Master Thesis Echo State Networks for Adaptive Filtering

University of Applied Sciences

Department of Computer Science

Master Thesis

Echo State Networks for Adaptive Filtering

Ali Uygar Küçükemre

A thesis submitted to theUniversity of Applied Sciences Bonn-Rhein-Sieg

for the degree ofMaster of Science in Autonomous Systems

Referee and Tutor: 1st Prof. Dr. Paul-Gerhard PlögerReferee: 2nd ................................................ external Referee: 3rd ................................................

Submitted: 30.04.2006

ECHO STATE NETWORKS FOR ADAPTIVE FILTERING

I, the undersigned, declare that this work has not previously been submitted to this or any other University, and that unless otherwise stated, it is entirely my own work.

...................... ..................................

Date Ali Uygar Küçükemre

A. U. KÜÇÜKEMRE II


ACKNOWLEDGEMENTS

Devoted to the my dear friend Özgün Doğan....

Special thanks to my professor Paul-Gerhard Plöger for his valuable support,teaching and understanding during my studies. For the first time in

my life I liked Mathematics after his short discourses during the lectures...

Nevertheless, I thank to my dear parents for all their support during this thesis work.

A. U. KÜÇÜKEMRE III


ABSTRACT

An Adaptive Filter is a self designing system that can work in non-stationary

environments. It relies on a recursive algorithm which continuously update its

parameters in order to track the changes its surrounding. If the non-stationarity of the

environment is governed by nonlinear dynamical processes a Recurrent Neural

Networks (RNNs) can be used successfully like an adaptive filter, provided they can

learn in an online manner. This is a necessary condition for adaptive filtering. In this

thesis, we will discuss usage of a special type of a RNN for those tasks, they are called

the Echo State Networks (ESNs). They constitute a novel approach that can be used for

RNN training. Counter to the conventional methods, which usually require a huge of

time to learn a given system, an ESN can be trained in linear time in a very quick way.

Most of the time, an offline learning algorithm is used to teach the ESN. But this does

not mean that training is only limited to this algorithm. The structure of the ESN, lets

the online adaptation algorithms from adaptive filtering area also to be used for training

of itself. Recursive Least Squares (RLS) is a very good example of these algorithms,

with which good results are achieved under the ESN context. The main problem of the

RLS is that it is prone to numerical instability*. For a reliable usage of RLS for the ESN

learning, this point should be treated in detail. Throughout this work, we look for

numerically stable RLS algorithms by digging through the literature. Extensive testing,

using two different adaptive filtering setups, is done to compare the individual

performances of different RLS based online adaptation algorithms** that can be used to

teach the ESNs. Results show that robust online adaptation is possible, provided

necessary precautions are taken. Additionally, it is shown that the ESNs are potentially

competitive candidates for adaptive filtering. This is done via comparing the ESN

performance to the standard linear and nonlinear adaptive filtering methods.

Keywords: Echo State Network, Adaptive Filtering, Recurrent Neural Network, Online Learning,

Recursive Least Squares, Least Mean Squares, Backpropagation Decorrelation, Adaptive System

Identification, Adaptive Noise Cancellation

* Other disadvantages of the RLS algorithm are also discussed in the thesis.

** Some non-RLS algorithms are also included in our discussions and testing.

A. U. KÜÇÜKEMRE IV


TABLE OF CONTENTS

1. ADAPTIVE FILTERING..........................................................................................................................1

2. PROBLEM STATEMENT........................................................................................................................9

3. ARTIFICIAL NEURAL NETWORKS AND THE ECHO STATE APPROACH.................................11

3.1 A Brief Introduction to the Artificial Neural Networks...................................................................11

3.2 The Echo State Networks Theory.................................................................................................... 13

4. ONLINE ADAPTATION OF THE ECHO STATE NETWORKS........................................................ 22

4.1 Introduction...................................................................................................................................... 22

4.2 The Conventional Recursive Least Squares Algorithm .................................................................. 23

4.3 Known Problems of the RLS Algorithm..........................................................................................31

4.4 The Problem of Computational Complexity.................................................................................... 31

4.5 The Problem of Numerical Instability..............................................................................................37

5. PERFORMANCE ANALYSIS OF THE ONLINE ADAPTATION ALGORITHMS.......................... 49

5.1 Introduction...................................................................................................................................... 49

5.2 Experimental Setups.........................................................................................................................51

5.3 Experimentation Phase and the Results........................................................................................... 55

5.3.1 Adaptive System Identification Setup..................................................................................... 55

5.3.1.1 Identifying the Second Order Nonlinear Dynamical System......................................... 56

5.3.1.2 Identifying the 10th Order Nonlinear System.................................................................61

5.3.2 Adaptive Noise Cancellation................................................................................................... 65

5.4 Comments on the Results.................................................................................................................68

6. ECHO STATE NETWORKS VS STANDARD METHODS.................................................................75

6.1 Introduction...................................................................................................................................... 75

6.2 Overview of the Standard Adaptive Filtering Methods................................................................... 75

6.3 Performance Comparison.................................................................................................................78

6.4 Conclusion........................................................................................................................................84

7. SUMMARY.............................................................................................................................................85

8. REFERENCES........................................................................................................................................ 89

APPENDIX A: ECHO STATE NETWORKS STATE OF THE ART.......................................................98

APPENDIX B: COMPUTATIONAL REQUIREMENTS ANALYSIS OF THE SELECTED

ALGORITHMS......................................................................................................................................... 108

A. U. KÜÇÜKEMRE V


LIST OF FIGURES

Figure 1: Block diagram of a basic filter....................................................................................................... 1

Figure 2: Block diagram of the adaptive modelling setup.............................................................................5

Figure 3: Block diagram of the adaptive inverse modeling setup................................................................. 6

Figure 4: Block diagram of the adaptive prediction setup.............................................................................6

Figure 5: Block diagram of the adaptive inference canceling setup..............................................................7

Figure 6: Structure of a neuron [Plöger 2004]............................................................................................. 12

Figure 7: Schematic description of a FNN and a RNN [Jaeger 2002a].......................................................12

Figure 8: An overview of the general Echo State Network structure. [Plöger 2004]..................................16

Figure 9: Block diagram of an ESN when used as a adaptive system identifier. (...) .................................51

Figure 10: First 250 samples of the second order nonlinear dynamical system.......................................... 52

Figure 11: Effects of the harsh parameter variations on the behavior of 10th order NARMA System (...) .................................................................................................................................. 53

Figure 12: Block diagram of an ESN-ANC. (...) ........................................................................................ 54

Figure 13: Results of the test case 1: fRate < 1 ( = 0.999) and 1000000 (One Million) Samples for the 2nd Order Nonlinear Dynamical System.....................................................................................................58

Figure 14: Results of the test case 2: fRate = 1 and 1000000 (One Million) Samples for the 2nd Order Nonlinear Dynamical System...................................................................................................................... 59

Figure 15: Results of the test case 3: fRate = 1 and 5000000 (Five Million) Samples for the 2nd Order Nonlinear Dynamical System...................................................................................................................... 60

Figure 16: Stepwise squared error graph of the ESN-QR-RLS with fRate = 1 during identification of the second order nonlinear dynamical system for five million samples. (...) .................................................. 61

Figure 17: Results of the test case 1: fRate < 1 ( = 0.999) and 1000000 (One Million) Samples for the 10th Order NARMA System....................................................................................................................... 63

Figure 18: Results of the test case 2: fRate = 1 and 1000000 (One Million) Samples for the 10th Order NARMA System..........................................................................................................................................64

Figure 19: Results of the test case 3: fRate = 1 and 5000000 (Five Million) Samples for the 10th Order NARMA System..........................................................................................................................................65

Figure 20: Results of the Test Case 1: fRate < 1 ( = 0.9999) and 4500000 Samples for the Adaptive Noise Cancellation................................................................................................................................................. 67

Figure 21: Results of the Test Case 2: fRate = 1 and 4500000 Samples for the Adaptive Noise Cancellation................................................................................................................................................. 68

Figure 22: Performance of different adaptive filtering methods for identification of the 2nd order nonlinear dynamical system.........................................................................................................................80

A. U. KÜÇÜKEMRE VI


Figure 23: Comparison of the ESN output versus the time varying 2nd Order Nonlinear Dynamical System in the last 100 iterations of the experiment..................................................................................... 81

Figure 24: Performance of different adaptive filtering methods for identification of the 10th order NARMA system...........................................................................................................................................82

Figure 25: Comparison of the ESN output versus the time varying 10th Order NARMA system in the last 100 iterations of the experiment.................................................................................................................. 83

Figure 26: Stepwise squared error graph of the ESN-IQR-RLS, observed during the identification of the 10th Order NARMA System. (...) ...............................................................................................................84

LIST OF TABLES

Table 1: Algorithm dependent parameters used for the identification of the 2nd order nonlinear dynamical system............................................................................................................... 57

Table 2: Results of the test case 1: fRate < 1 ( = 0.999) and 1000000 (One Million) Samples for the 2nd Order Nonlinear Dynamical System...........................................................................57

Table 3: Results of the test case 2: fRate = 1 and 1000000 (One Million) Samples for the 2nd Order Nonlinear Dynamical System...................................................................................................................... 58

Table 4: Results of the test case 3: fRate = 1 and 5000000 (Five Million) Samples for the 2nd Order Nonlinear Dynamical System...................................................................................................................... 59

Table 5: Algorithm dependent parameters used for the identification of the 10th Order NARMA System....................................................................................................................... 62

Table 6: Results of the test case 1: fRate < 1 ( = 0.999) and 1000000 (One Million) Samples for the 10th Order NARMA System............................................................................................................62

Table 7: Results of the test case 2: fRate = 1 and 1000000 (One Million) Samples for the 10th Order NARMA System..........................................................................................................................................63

Table 8: Results of the test case 3: fRate = 1 and 5000000 (Five Million) Samples for the 10th Order NARMA System..........................................................................................................................................64

Table 9: Algorithm dependent parameters used for the Adaptive Noise Cancellation................................66

Table 10: Results of the Test Case 1: fRate < 1 ( = 0.9999) and 4500000 Samples for Adaptive Noise Canceling......................................................................................................................66

Table 11: Results of the Test Case 2: fRate = 1 and 4500000 Samples for the AdaptiveNoise Cancellation....................................................................................................................................... 67

Table 12: Performance comparison of different adaptive filtering methods for identification of the 2nd order nonlinear dynamical system........................................................................80

Table 13: Performance comparison of different adaptive filtering methods for identification of the 10th order NARMA system................................................................................................................................. 82

A. U. KÜÇÜKEMRE VII


LIST OF ALGORITHMSAlgorithm 1: Generation of an RNN with ESP........................................................................................... 18

Algorithm 2: Supervised Training of ESN ................................................................................................. 21

Algorithm 3: ESN-CRLS.............................................................................................................................30

Algorithm 4: ESN-LMS...............................................................................................................................33

Algorithm 5: ESN-BPDC............................................................................................................................ 36

Algorithm 6: ESN-SCRLS...........................................................................................................................39

Algorithm 7: ESN-SCRLS2.........................................................................................................................41

Algorithm 8: ESN-Ardalan-RLS................................................................................................................. 42

Algorithm 9: ESN-RLSP............................................................................................................................. 44

Algorithm 11: ESN-QR-RLS.......................................................................................................................46

Algorithm 12 ESN-IQR-RLS...................................................................................................................... 48

LIST OF DEFINITIONSDefinition 1: Echo State Property................................................................................................................ 17

Definition 2: Linear Transversal Filter ....................................................................................................... 76

Definition 3: Second Order Truncated Volterra Filter.................................................................................77

Definition 4: Bilinear Filter......................................................................................................................... 79

LIST OF PREPOSITIONSPreposition 1: Sufficient Conditions for ESP.............................................................................................. 17

LIST OF LEMMASLemma 1: Matrix Inversion Lemma............................................................................................................ 26

LIST OF ABBREVIATIONSANC: Adaptive Noise Canceler

BIBO: Bounded-Input-Bounded-Output

BMI: Brain Machine Interface

BPDC: Backpropagation-Decorrelation

CRLS: Conventional Recursive Least Squares

CVA: Conventional Recursive Least Squares Variant Algorithms

DR: Dynamical Reservoir

DSP: Digital Signal Processor

EQR-RLS: Extended QR Decomposition Based Recursive Least Squares

ESN-ANC: Echo State Network Adaptive Noise Canceler

A. U. KÜÇÜKEMRE VIII


ESN-Ardalan-RLS: Echo state Network Ardalan Recursive Least Squares

ESN-BPDC: Echo state Network Backpropagation-Decorrelation

ESN-CRLS: Echo state Network Conventional Recursive Least Squares

ESN-IQR-RLS: Echo state Network Inverse QR Decomposition Based Recursive Least Squares

ESN-LMS: Echo state Network Least Mean Squares

ESN-QR-RLS:Echo state Network QR Decomposition Based Recursive Least Squares

ESN-RLSP: Echo state Network Recursive Least Squares Prewhitening

ESN-SCRLS2: Echo state Network Symmetric Conventional Recursive Least Squares 2

ESN-SCRLS: Echo state Network Symmetric Conventional Recursive Least Squares

ESN: Echo State Network

ESP: Echo State Property

FIR: Finite-Duration Impulse Response

fladd: floating point addition

fldiv: floating point division

flmult: floating point multiplication

flop: floating point operation

FNN: Feed-forward Neural Network

IIR: Infinite-Duration Impulse Response

IJCNN: International Joint Conference on Neural Networks

IQR-RLS: Inverse QR Decomposition Based Recursive Least Squares

LMS: Least Mean Squares

LSM: Liquid State Machines

LTA: Linear Time Algorithms

MSE: Mean Squared Error

NARMA: Nonlinear Auto Regressive Moving Average

NMSE: Normalized Mean of Squared Error

QR-RLS: QR Decomposition Based Recursive Least Squares

R-IML-N: Recurrent Infinite Middle Layer Network

RBA: Rotation Based Algorithms

RLS: Recursive Least Squares

RNN: Recurrent Neural Network

SNR: Signal to Noise Ratio

STM: Short Term Memory

UDU': Upper-Diagonal-Upper-Transpose

A. U. KÜÇÜKEMRE IX

1. ADAPTIVE FILTERING


What we basically do in signal processing? The most intuitive answer is that we do

something to the signal in order to make it more useful. While doing so, the most

important companion of a signal processing expert is probably a filter. Signal

Processing pioneer Simon Haykin gives the definition of a filter in [Haykin 1996] as

follows:

“The term filter is often used to describe a device in the form of a piece of physical

hardware or software that is applied to a set of noisy data in order to extract

information about a prescribed quantity of interest.”

In Figure 1, you see the most basic form of a filter. It takes an input signal denoted by

u , processes it and outputs the signal y . Processing should be done in such a

manner that the y is a good representative of the desired signal, also called the desired

response, d . A performance criterion is often defined to be able to decide on the

quality of the filter. It is usually a function of estimation error signal between the signals

y and d , and is denoted by e .

Filters can be classified in to two major groups. A filter is said to be linear if the filtered

signal at the output is a linear function of the observations at the input part of the filter,

otherwise it is called a nonlinear filter. Most of the signal processing theory is based on

the linear filters.

A process (or a system), on which a filter operates, is stationary if the statistical

A. U. KÜÇÜKEMRE Page 1 of 115

Figure 1: Block diagram of a basic filter.


characteristics of it are independent of the time at which the process is started. That is if

we look at the process at different time intervals, we essentially observe the same

statistical behavior at each of those intervals. If this property is not satisfied than it is a

non-stationary process. If our corrupted signal at hand is governed by a stationary

process, than we can design a optimum filter using the statistical signal processing

theory. Assuming that we know certain statistical parameters, like mean values and

correlation functions of the useful signal and the unwanted components that are mixed

on to it, we can design a linear filter which optimum in the statistical sense. The most

common performance criterion used in this case is the mean-squared value of the e ,

the difference between the desired response and the filter output. The corresponding

solution is known as the optimum Wiener solution named after his pioneering work in

[Wiener 1949]1.

The Wiener solution is insufficient during the times where the environment is non-

stationary. Therefore, the filter should now assume a time-varying form instead of fixed

statistics. In early 1960's, Kalman and Bucy extended the Wiener's work to time-varying

case with the Kalman filter, a highly powerful tool for many engineering problems

[Kalman 1960] [Kalman 1961]. It is an adaptive method which can respond positively

to statistical variations of the environment.

Although, the Kalman filter solved the problem of adaptivity to non-stationary

environments, like wise the Wiener solution it still assumes prior knowledge of certain

statistical parameters of the incoming signal. This knowledge is usually not known in

practical signal processing applications. In that case a good solution is using an adaptive

filter. An adaptive filter is a self-designing filter that relies on a recursive algorithm for

its operation. This recursive algorithm gives adaptive filter, the ability to perform

successfully in a non-stationary environment, where the relevant statistics of those

variations are not available, by continuously updating the filter parameters.

An adaptive filter always assumes limited or no knowledge about the inherent statistics

of its surrounding. If an adaptive filter is used in a stationary environment, after some

1 This book was originally issued as a classified National Defense Research Council Report on February 1942



number of iterations it converges to the optimum Wiener solution in some statistical

sense, from thereafter the recursive adaptation algorithm can be shut down. When the

environment is non-stationary, it offers tracking, that is after converging to its steady

state, it can track statistical variations of the system given the those variations are

sufficiently slow depending on the tracking capability of the recursive algorithm used.

Due to the recursive algorithm, the parameters of an adaptive filter are adapted from one

iteration to the next, hence become data dependent. By that property, all adaptive filters

are non-linear in the sense they do not obey the superposition principle, which is a

necessary condition for linearity. After all, in the literature they are often classified as

linear or non-linear. If the output of an adaptive filter is formed by a linear combination

of the filter coefficients and the input signal, than it is called linear, if otherwise, it is

called non-linear.

Linear adaptive filters can be implemented in two main forms, Infinite-Duration

Impulse Response (IIR), Finite-Duration Impulse Response (FIR). IIR filters are

governed by recursive equations of the form:

y t =∑i=0

M−1

a it x t−i ∑i=0

M−1

b it y t−i

Here a it and bi t are forward and feedback tap weights. Due to the presence of

feedback the impulse response of the IIR filters are infinitely long, hence their name.

The feedback connections also lead to a stability problem. It may get into oscillation if

no special precaution is taken for choice of feedback taps. Moreover, adaptation process

of IIR filters is hard. Performance functions (i.e: MSE) of these filters often contain

many local minimums. During the adaptation process, the filter may get trapped into

one of those local minimums, instead of the desired global minimum point of the

performance function. Because of these reasons FIR type of filters become more

popular for designing linear adaptive filters. This form of filters are also called the

transversal (FIR) filters. A FIR filter has a very simple form as follows:

y t =∑i=1

M

w it x t−i1



The output of the filter is generated by linear combination of filter weights and the

delayed input samples. Performance functions that belong to the FIR filters have usually

a one well defined global minimum that can easily be found by any recursive adaptation

algorithm.

In contrast to the linear adaptive filters, there exists no general structural framework for

implementing non-linear adaptive filters. Various schemes can be used for nonlinear

adaptive filtering. Examples of them include neural networks [Haykin 1999b], radial

basis function networks [Haykin 1996], polynomial filters [Mathews 1991], order

statistics filters [Palmieri 1988] etc.

Many different adaptation algorithms are developed for Adaptive Filters. Mostly they

either follow a statistical or deterministic approach. In statistical approach, also known

as stochastic gradient algorithms, instantaneous value of the mean squared error is

minimized at each iteration to get a rough estimate of the MSE. It turns out that this

rough estimate when used with a small step-size parameter leads to a very simple yet

robust algorithm, the widely celebrated Least Mean Squares (LMS) algorithm also

known as the Widrow-Hoff Rule [Widrow 1960]. Despite its simplicity and robustness,

it has a very slow convergence rate which is sensitive to the eigenvalue spread of the

input signal. When the deterministic approach is used, we want to minimize the sum of

weighted error squares, the least squares term. In contrast to the stochastic gradient

algorithms which use an instantaneous estimate of the performance criterion, the least

squares based algorithms consider also the history of the error function. The most

famous of least squares based adaptation algorithms is the Recursive Least Squares. The

most important advantage of using least squares based algorithms is their fast rate of

convergence, which is typically an order of magnitude faster than the stochastic

algorithms. Convergence is also invariant to the eigenvalue spread of the system. Price

paid for these desirable properties is the increased computational complexity.

Ability of adaptive filters to adjust themselves to different environments, let them to be

realized for many practical applications in diverse types of fields like control,

communications, military, radar and sonar signal processing, inference cancellation,

active noise control, biomedical engineering and other fields where minimal



information is available about the incoming signal. We can classify these application in

four main groups, namely Modeling, Inverse Modeling, Prediction and Inference

Canceling.

In the Adaptive Modeling (See Figure 2), we try to find a mathematical model of an

unknown plant. This is a very important task if you want to design controls for a time-

varying system. It is often difficult to model a physical phenomenon directly, however

by experimentation, response of the system in various conditions can be measured. In

this setup, we feed the filter and plant by the same input. Plant response is used as the

desired response and the recursive algorithm updates the adaptive filter accordingly

using the error signal of the filter response and the desired response. When the error

signal becomes sufficiently low, the filter response can be used as a representative of

the unknown plant response. The most prominent application example of this class is

the system identification.

In the second class of adaptive filtering applications, an adaptive filter is used to provide

an inverse model of an unknown noisy plant that represents the best fit in some sense

(See Figure 3). This setup is called the Inverse Modeling and also known as the

(adaptive) deconvolution. In case of linear systems, the inverse model is characterized

by a transfer function which is equal to the inverse (reciprocal) of the plant's transfer

function. That is the combination of two, ideally provides the perfect communication

medium. During the operation, delayed version of the plant input is used as the desired

signal, whereas the plant response is fed into the adaptive filter as input. In some


Figure 2: Block diagram of the adaptive modelling setup


applications plant input is used directly as the desired response without a delay.

Applications areas of inverse modeling can be given as predictive deconvolution,

adaptive equalization and blind equalization.

Adaptive filters that belong to the Prediction class are required to provide the best

prediction with respect to some performance criterion of the present value of a random

signal (process) (See Figure 4). During the adaptation, present values of the random

signal to be predicted constitute the desired response for the adaptive filter. Past values

of the signal are supplied as the input via delaying. Depending on the application type

either the adaptive filter output or the estimation error is used as the system output. First

case occurs when the filter operates as a predictor, in the latter case it operates as a

linear prediction error filter. Linear predictive coding, autoregressive spectrum analysis,

adaptive differential pulse-code modulation and signal detection are application types

that belong to the adaptive prediction class.

The last class of adaptive filtering applications is the Inference Canceling (See Figure

5). Aim is to cancel an interfering signal or noise component from the primary signal


Figure 3: Block diagram of the adaptive inverse modeling setup.

Figure 4: Block diagram of the adaptive prediction setup.


which is a combination of the information bearing component and the inference.

Principle is to obtain an estimate of the inferring component and then to subtract it from

the primary signal. Feasibility of this kind of adaptive filters relies on a reference signal

which is a correlated form of the interfering component. It can be derived from a sensor

or a sensor network located in relation to the sensor or set of sensors that provide the

primary signal, in such a way that the information containing signal is unobservable or

very weak. During the operation, the reference signal is fed in to the adaptive filter,

while primary signal is used as the desired response. Estimation error which is used for

adaptation of the filter is also used as the system output, in that it contains the best

estimate of the information bearing signal (in some sense). Sample applications can be

listed as Adaptive Noise Canceling, Echo Cancellation, Adaptive Beamforming and

Active Noise Control.

Adaptive linear filters are the very popular among engineers and scientists, hence have

been implemented in many applications. An obvious advantage of them is their natural

simplicity which allows their design, analysis and implementation stages to be

comparatively straightforward tasks for many cases. They have been studied very well

and there exist very good books that cover them extensively like [Haykin 1996],

[Farhang-Boroujeny 1998] and [Bellanger 2001]. Though, there are many situations

where performance of the linear adaptive filters are merely poor. Linear models does

not always produce the best estimates. This is quite natural especially if think that our

universe is governed by nonlinear dynamical processes. In that aspect, we need

nonlinear models to achieve better results.


Figure 5: Block diagram of the adaptive inference canceling setup.


Using an Artificial Neural Network is one of the solutions to operate in nonlinear

environments. A neural network is a biologically inspired, highly parallel and

distributed information (or signal) processor tool made up of inter-connected simple

non-linear processing units, also called neurons, which has ability to store experimental

knowledge. They imitate the biological neuronal networks in two manners. First,

knowledge is acquired by a learning process. Second, unit to unit connections, called the

synaptic weights, are responsible for storing the acquired knowledge. "As long as the

data used for a learning is good representative of the environment, one can build a

supervised neural network that can capture the underlying dynamics of the environment

whether the environment is stationary or non-stationary. This is truly a powerful

statement on non-linear adaptive filtering, with profound practical implications."

[Haykin 1999a]. In this thesis we will concentrate on usage of a special type of neural

networks, Echo State Networks [Jaeger 2001] for Adaptive Filtering.


2. PROBLEM STATEMENT


Echo State Networks (ESNs) serve as a powerful black-box tool for neural network

learning [Jaeger 2001]. They offer a novel approach for efficient training of Recurrent

Neural Networks (RNNs). The ESN training method has a complexity in the order of a

simple linear regression task. Up to now, they have been used successfully in a broad

range of applications like system identification, prediction, robot control etc. When we

examine these applications in detail, it follows that most of the time ESNs are trained in

an offline manner. That is they are used in a somewhat non-adaptive way, in the sense

that an ESN is being trained firstly in batch mode and after training it is utilized to the

needed application environment with no further change in the network parameters ( i.e:

synaptic weights). This in contrast to biological systems which learn continuously, to be

able to adapt to the time varying characteristics of the real world tasks.

In [Jaeger 2002c], it has was showed that ESNs can be used in an online manner using a

very well known algorithm from the Adaptive Filtering area which is called Recursive

Least Squares (RLS).2 Adaptive Filters, as previously explained, are special kinds of

filters that are running in unknown environments with time varying properties, therefore

continual adaptation of the filter taps has to take place at runtime. The learning

mechanism of ESNs also exhibit a similar structure to linear adaptive filters. Therefore

the same algorithms from the adaptive filtering area can be used for online adaptation of

ESNs.

Additionally, in [Jaeger 2004], ESNs are used for the solution of adaptive channel

equalization problem. This approach beaten conventional nonlinear methods methods

by two orders of magnitude. The same author also applied the ESNs to the nonlinear

adaptive system identification problem and again achieved very good performance

[Jaeger 2002c]. Based on these results, we can conclude that ESNs can also be

considered as an useful tool for Adaptive Filtering problems.

However we still need a more detailed treatment of the ESN-RLS combinations under

adaptive filtering scenarios since it is reported by many authors that the RLS family of

algorithms are subject to numerical stability problems, for example [Ljung 1985]

2 The same argument is also mentioned throughly in [Jaeger 2001].



[Ardalan 1986] [Cioffi 1987] [Yang 1992] [Levin 1994]. Should ESNs be used for

Adaptive Filtering or other application types in a robust manner, techniques to ensure or

increase the numerical stability of RLS algorithm should be found from the scientific

literature. In this thesis, we will try to give a detailed treatment of using RLS family of

algorithms for online adaptation of ESNs under Adaptive Filtering context. Much of the

concentration will be focused on numerical stability of different RLS algorithms, while

other performance criterias like algorithmic complexity, steady state error, tracking will

be discussed throughly. We will also show that the ESNs are competitive candidates to

do Adaptive Filtering. In that aspect, we will have a section which shows the superiority

of ESNs over standard linear and nonlinear adaptive filtering techniques. Additionally,

an algorithmic analysis of most promising RLS-ESN combinations will be given in

order to reveal the complexity trade-off between different algorithms and also to assist

future implementations.

The rest of the thesis is organized as follows. In the next section, we will make the

reader with the classical ESN theory which involves the basics of ESN generation and

supervised learning of them in batch mode. Then we go on with the online adaptation of

ESNs. Main focus will be on using RLS family of algorithms, while some non-RLS

algorithms will also be discussed. After specifying the most promising algorithms from

the adaptive filtering literature, we will have section on performance comparison of

these algorithms with respect to each other in different experimental setups. Later, we

will compare the ESNs, to standard linear and nonlinear methods of the adaptive

filtering area. We will finish the thesis, by giving a summary of what is done and

possible future works. In Appendix A, interested reader can find brief summary of ESN

research and application articles. Appendix B, is where we do algorithmic analyzes of

some most promising ESN online learning algorithms.


3. ARTIFICIAL NEURAL NETWORKS AND THE ECHO STATE APPROACH


3.1 A Brief Introduction to the Artificial Neural Networks

An artificial neural network (ANN) is a computational model that is inspired from the

biological neural networks (ie. Composition of a human brain). They can be used as a

powerful data modeling tool which has the ability to adapt, learn, generalize, cluster and

organize data. Their operation is based on distributed parallel processing. Formally,

ANNs can be represented as a set of simple processing units known as neurons which

communicate by sending signals to each other over a large number of weighted

connections, each usually having the following typical properties [Plöger 2004] (See

Figure 6):

• x it : activation state of neuron i at time t

• net i : net inputs to neuron i calculated by net propagation function

net jt =∑∀i

oit w ji

• f activationin. : Inner activation (transfer) function,

x it1 = f activationin. x it , net it ,i , where i is the limit step function.

• oi t : Output of neuron i , oi t = f activationout x it

• f out : Output activation function

One can classify ANNs under two main categories, feed forward and recurrent neural

networks. Also known as the Multi-layer Perceptrons (MLP), the feed forward neural

networks (FNNs) are the most popular type among ANNs. In an FNN, data enters at the

input layer and is piped through the network through intermediate layers, which are also

called the hidden layers, until it arrives at the output neurons. There is no feedback

between different layers during this process. This is the reason why this kind of

networks are called feed forward. Mathematically, they implement static input-output

mappings and theoretically FNNs can approximate any non-linear function with

arbitrary precision. Being studied extensively for many years now they are in a solid

state. Their application areas and training methods are quite well understood. (See



Figure 7)

Their counterparts, Recurrent Neural Networks (RNN) has cyclic connections between

the layers. Theoretically they can approximate any dynamical system with arbitrary

precision. Being able to model any dynamical system they offer a lot to the researchers

however their analysis and training is extremely difficult. Thus, the research on RNNs is

very limited. Yet, one cannot draw his/her interest away from RNNs since all biological

neural networks are recurrent and they are capable of doing a broader range of tasks.

Although, their analysis is still difficult, some difficulties of RNN training have been

solved on 2001 by a breakthrough approach called Echo State Networks by Herbert

Jaeger [Jaeger 2001]. Next, we follow with the details of Echo State Networks Theory.

Interested reader can refer to [Haykin 1999b] for a more detailed coverage of Artificial


Figure 6: Structure of a neuron [Plöger 2004]

Figure 7: Schematic description of a FNN and a RNN [Jaeger 2002a]


Neural Networks.

3.2 The Echo State Networks Theory

Echo State Networks resemble a novel approach for RNN training. While keeping the

expressive power of RNNs, they remedy the problems of RNNs training methods which

are usually hard and time consuming. A similar neural network type also developed

independently under the name Liquid State Machines (LSMs) [Maas 2002]. LSM theory

has a more biological perspective whereas the ESNs are approaching the problem from

an engineering point of view.

The central idea of ESN theory is based on using a relatively huge fixed dynamical

reservoir (DR). A DR can be viewed as a pool of artificial neurons with connections to

each other without any restriction on topology. Recurrent path are of course very

welcome in this setup. When excited by the the input signal, the DR maps it in to a

richer state space as encoded in its internal states. At the end the desired output can be

formed by computing a weighted combination of output connections. By imposing

certain algebraic conditions on the DR which will be mentioned in detail later in this

chapter, one can achieve impressive results by keeping the internal weights fixed and

only adjusting the taps from the DR to output units. In this way the RNN training which

was once a burden for researchers, boils down to a simple linear regression task. Being

based on such a simple approach, ESNs are now the title holder in prediction of well

known chaotic time series benchmark Mackey-Glass system.

Engineering applications of the method are numerous (See Appendix A). "ESNs can be

used for all basic tasks of signal processing and control including time series

prediction, inverse modeling, pattern generation, event detection and classification,

modeling distributions of stochastic processes, filtering and nonlinear control. Because

a single learning run takes only a few seconds (or minutes for very large datasets and

networks), engineers can test out variants at a high turnover rate, a crucial factor for

practical usability." [Jaeger 2004]

Before going further in to details of ESNs, firstly we want to fix the mathematical

notation that will be used throughout this work. To keep the integrity, we are going in



line with the same notation used in the original publications [Jaeger 2001] and [Jaeger

2002b].

Our ESN model consists of K input units with an activation (state) vector which is

denoting the activation of the input layer at time t :

u t = [ u1t , u2t , u3t ,... , uK t ]T : Input Activation Vector 3

N internal units ( reservoir) with the corresponding state vector:

x t = [ x1t , x2t , x3 t , ... , x N t ]T : Internal Activation Vector

and L output units with the output state vector:

y t = [ y1t , y2t , y3t , ... , yL t ]T : Output ActivationVector

The synaptic weights between the input, internal and output units are collected in three

matrices:

W in.= w ijin. : N×K , InputReservoir

W = wij : N×N , ReservoirReservoir

W out= wijout. : L× NKL , Reservoir , Input ,OutputOutput

Here a zero weight stands for no connection. It should also be noted that output units

have connections from input, internal and even from output neurons. In addition to

those, activations of the output units may optionally be projected back to the internal

units using the connections:

W back = wijback. : N×L , OutputReservoir

The matrices W in. , W out and also W back , if it exists at all, are usually full

matrices, whereas the W is a sparse matrix with recommended density values ranging

from 5% to 20%. Input to reservoir weights are assigned using an uniform distribution

3 Here T denotes the matrix transpose operation.



and is fixed through the ESN lifetime. Likewise, W back has also fixed weight which

are drawn randomly. Reservoir is scaled to have a nice global spectral radius, which is

also kept fixed. Only thing that is to be learned through time is W out , which makes

ESN learning computationally very fast among other RNN learning techniques.

Activations of the internal units are update by the rule:

x t1 = f activationW x t W in. ut1 W back y t

This step is called “Evaluation”. Here u t1 denotes new input vector at time

equals to t1 . f act= f 1, f 2, ... , f N denotes the activation function of the internal

neurons. It is also called the transfer function, output function or squashing function. To

achieve nonlinearity and because of being an invertible function, it is mostly selected as

tangent hyperbolic:

tanh x = sinh xcosh x

= exe−x

e xe−x

Specifically, neurons using tanh x are called sigmoids.

Following the evaluation step, new value of the output activation vector is given by the

formula:

y t1 = f activation W out concat x t1 , ut1 , y t

The name of this step in the ESN jargon is “Exploitation”. The function

concat x , u , y denotes concatenation of new internal and input states together with

the previous output state. One should observe that this notation does not require

recurrent pathways between internal neurons, although it is highly desirable. Therefore,

no restriction on the topology of network exists.

Successive computations of the evaluation and exploitation may lead to a chaotic and

unbounded behavior, therefore a proper global scaling of W should be applied in

order to prevent those unwanted effects. Details on how to this will be given in the

upcoming sections. An overview of ESNs is best summarized in the Figure 8.



Having defined our terminology, we now continue to details of the ESN theory. The

name "Echo States" is coming from special property of the network which is

characterized by the weight matrix W . The training data, [u t , d t ] is also

influential on the network whether it has echo states or not. That is for two different sets

of input-output signal pairs, the same network may have echo states on one set whilst

not having for the other. We require, input sequences are coming from a compact

interval U such that:

[u t ]t∈K ∈ U K

A similar requirement also hold for the teacher outputs (desired signal). We need the

desired signal values are from a compact interval, D , such that:

[d t ]t∈L ∈ DL

Then the echo state property is given in Definition 1.


Figure 8: An overview of the general Echo State Network structure. [Plöger 2004]


Verbally, the ESP states that the current state of the network is uniquely determined by

the input history and the teacher forced output. Therefore, when we run the network

freely, initial network states that are independent of the input and output history should

wash out after some time. By this property, we can have network states that are

characterized only by the input and the teacher forced output.

The echo state property is strongly connected to algebraic properties of the internal

weights matrix W . Although there exists no known necessary and sufficient algebraic

condition, to decide if any given [W in. ,W ,W back ] has the ESP. Still, a sufficient

condition to prove non-existence of echo states is present. It is given in Preposition 1.

In [Buehner 2006], a newer sufficient condition for ESP is also proposed. Authors'


Definition 1: Echo State Property (ESP)

Assume a recurrent neural network with connectivity weights W in. , W and W back

, which is driven by a teacher input u t = [u1t , u2t , u3 t , ... , u K t ]T and forced

by a desired teacher output, d t = [ d 1t , d 2t , d 3t , ... , d L t ]T , both coming

from the compact intervals [u t ]t∈K ∈ U K and [d t ]t∈L ∈ DL . This RNN has echo states if for every left infinite sequence [u t , d t−1] where t=−∞ , .. ,−1,0and all state sequences x t , x ' t which is computed by

x t1 = f activationW x t W in. ut1 W back y t

x ' t1 = f activation W x' t W in. u t1 W back y t

it holds true that x t = x ' t , ∀n≤0

Preposition 1: Sufficient Conditions for ESP

Given an untrained network, [W in. ,W ,W back ] , with state update according to x t1= f activation W x t W in. u t1 W back y t withf activation x = tanh x , let λmax be the largest absolute eigenvalue and σ max be

the largest singular value of W in. then:

a) If σ max 1 then ESP holds for the network with [W in. ,W ,W back ]

b) If ∣λmax∣ 1 then the network with [W in. ,W ,W back ] has no echo states for any input/output interval U K × Y L which is containing the zero input/output tuple 0 , 0 .


claim is that this new bound is tighter than the original one and guarantees asymptotic

stability (i.e: Guaranteed ESP for all inputs). Note that, no new design methodology for

generating a new untrained ESN is proposed in this paper, what is given can only be

used to test the global asymptotic stability of an ESN at hand. Based on the Preposition

1, Algorithm 1 seems to guarantee generation of an untrained network,

[ W in. , W , W back ] , with echo states.

The W matrix generated by the Algorithm 1 is what we usually call the Dynamical

Reservoir (DR). According to Jaeger's suggestions, it should be a sparse matrix. This is

a simple trick to ensure richness in internal dynamics of the DR. Best results are

achieved with low densities (connectivities) around 5% to 20%. Moreover, values

should be roughly in equilibrated that is the mean value of the internal weights should

be around zero. To achieve this, one can either draw random weights from a [-1,1]

uniform distribution or set the values either to -1 or 1.

Number of neurons, N , used in the reservoir should be selected considering both

hardness of the learning problem and the availability of the teacher signal. Harder the

problem, usually a high number of neurons is needed for good models. Another

important point is to avoid over-fitting. Over-fitting is a usual phenomenon that is

observed in ANN training. It occurs when the ANN learns a much too literal

reproduction of the teacher sequence, but is poor at generalizing for unseen examples.

As a rule of thumb, N should not be exceeding an order of magnitude T /2 or at least

should be bigger than T /10 where T is denoting the periodicity of the training data.


Algorithm 1: Generation of an RNN with ESP

1. Randomly generate a sparse matrix W 0 , with real valued w0ij 's between

[−1 , 1 ] with a low density. (i.e: only 5% of the weights are different from zero)

2. Normalize W 1=1∣λmax∣

W 0 , where λmax is largest absolute eigenvalue of W 0.

3. Scale W = W 1 , where 1 . It follows that α is the spectral radius ofW . Then the network [ W in. , W , W back ] has the ESP.


The more regular-periodic the training data, N can be chosen closer to the T /2 .

The spectral radius is another important parameter that should be carefully selected.

For fast network dynamics, a small should be used. The more it gets close to one,

slower is the network dynamics. There is no general answer for what is good choice of

spectral radius for a given application. Therefore, several trials usually need to be done

in order to fully exploit the capabilities of the ESN at hand.

Note that in Algorithm 1, the generation of the W in. and W back is left open. This

because of the fact that the ESP is independent of these two matrices. Mostly, weights

of the W in. and W back are drawn randomly. Absolute values of these weights have an

important effect on how the DR is excited by the input or output back-projections.

Larger the values imply the network is strongly driven by the input/output signals, vice

verse for the smaller values. Additionally, if tangent hyperbolic is used as the activation

function, using small weights indicate that network operates around the central near

linear region of the sigmoid. As the absolute weights grow we get close to the

saturation, hence work in a highly nonlinear region. In the absolute, we have binary

dynamics with a sigmoid output which is equal to one or minus one.

After we get an RNN with ESP, in other words an ESN, the training which was once a

very time consuming operation is easy. It is just adjusting weights of the W out , a very

simple linear regression task. Algorithm 2, can be used for supervised batch learning of

ESNs. In the Algorithm 2, be careful to collect the vectors x t and

f activation−1 d teacht

T in M and T , not x t and f activation−1 d teacht−1T .

When applying ESNs for certain applications there are also some additional points that

researchers should know in order to achieve meaningful results in a few trials. Examples

of these points can be enumerated as choosing the correct scaling parameters for the

input signal, using a reasonable network size, providing enough training examples etc.

A short tutorial of those training tricks is given in [Jaeger 2002a], whereas a more

comprehensive treatment of this topic, a detailed ESN training tutorial, yet waits aside

to be written.



In a relatively short period of time since their invention on 2001, the ESN theory gained

popularity amongst the neural network research community. Apart from the original

contributer Herbert Jaeger, also other researchers published their works regarding ESNs.

These articles come in various flavors. While some of them aimed at solving

engineering problems by means of ESNs, some concentrated on the theory of standard

ESNs in order to improve the performance or to put an end to inherent limitations of the

theory. It is also not so surprising to find some novel proposals like new RNN structures

or training methods whose development is strongly inspired by the ESN network theory.

Naturally, sometimes ESNs are as well criticized by some authors. In Appendix A, we

have given an overview of all ESN papers which are of our knowledge. Interested

readers can easily use the relevant references to get more detailed knowledge.

In the International Joint Conference on Neural Networks (IJCNN) which was held in

Montreal, Canada on 2005, a special session was reserved for Echo State Networks.

Outcome of this IJCNN 2005 was very prosperous in terms of number of ESN papers

published during the conference. Also, in year 2007, a full issue of the famous Neural

Networks magazine will be reserved only for the ESN topic and for another similar

idea, the LSMs [Maas 2002]. These two incidents provide evidence for increasing

popularity and acceptance of the ESN theory in the neural networks community.

Our discussion on the classical ESN theory ends here. In the next chapter, we will

discuss online adaptation of ESNs using algorithms from the adaptive filtering area.




Algorithm 2: Supervised Training of ESN

Let u teacht = [ u1t , u2t , u3t ,... , uK t ]T be the input teacher signal

and d teach t = [ d 1t , d 2t , d 3t , ... , d N t ]T be the output teacher signal

containing column vectors of number K and L in discrete time intervalt=1, 2, ... , t 0 , ... , T where t 0 is denoting the time point where all initial

states of the dynamic reservoir are washed out. With a definition of the initial zero teacher output d teach 0 = 0 .

1. Generate an ESN using the Algorithm 1

2. Initialize the network states of the dynamical reservoir arbitrarily. (i.e: x 0=0 )

3. Calculate x t1 ,∀ t = 0,1,. .. , t 0−1 using the evaluation equation,

x t1 = f activation W x t W in. ut1 W back y t

4. Do concat u teacht1 , x t1 , d teacht T ,∀ t=t 0, t 01,... ,T in rows

and store in the state matrix M t−t01×NKL .

5. In the same manner do concat f activation−1 d teacht

T ,∀ t = t 0, t01,... , T

in rows and save in to the teacher matrix C t−t01 xL

6. Solve for W ' = M−1C where M−1 denotes the pseudo (Moore-Penrose) inverse of M

7. Set W out = W 'T

4. ONLINE ADAPTATION OF THE ECHO STATE NETWORKS


4.1 Introduction

In previous works by Jaeger, [Jaeger 2002c] and [Jaeger 2004], online learning for

ESNs achieved by using Recursive Least Squares (RLS), a very well known algorithm

from Adaptive Filtering theory. The good results gained in these studies proved that

RLS offer a good solution for online adaptation of ESNs4.

Advantages of using RLS, can be listed as follows. Firstly, it has a very fast rate of

convergence5. Secondly, its rate of convergence is independent of the eigenvalue

spread6 of the correlation matrix of the input signal. RLS algorithm also has a good

steady state error performance.

However, RLS has two important disadvantages. Its computational complexity is in the

order of ON 2 .7 Secondly, it is numerically instable under finite precision

environments.8 (i.e: digital systems)

For a better and more reliable use of RLS for online adaptation of ESNs, the above

disadvantages should be examined in more detail and shall be remedied if possible. In

this section we will try do this detailed treatment of the RLS algorithm.

This chapter is organized as follows. Firstly, we will derive conventional RLS algorithm

from scratch to give a better insight of the algorithm to the reader. Next, we will begin

to investigate the problems one by one. At first step, we will look at the computational

complexity problem, leaving numerical instability problem to a later stage. In the

4 What is even better would be using the simple yet robust Least Mean Squares algorithm [Widrow 1960], however this is not possible at the moment with our current knowledge on ESNs [Jaeger 2005]. Further research should be done on the topic. Reasons for this will be explained later in this chapter.

5 By using Rate of Convergence, we refer to the following definition given in [Haykin 1996]. It is defined as the number of iterations for the algorithm, in response to stationary inputs, to converge "close enough" to the optimum Wiener solution in the mean square. Thus a fast rate of convergence allows filter to adapt more rapidly.

6 Eigenvalue spread is the ratio of the largest eigenvalue of a matrix to the smallest one.7 The Big-Oh is an mathematical notation used to describe asymptotic behavior of functions. In

computational literature it is often used to denote the complexity of the algorithms.8 An algorithm is numerically instable if it diverges from the desired response due to quantization errors

in digital environments.



proceeding chapter, we will test all different ESN adaptation methods using two

different scenarios, which are adaptive system identification and adaptive noise

canceling.

4.2 The Conventional Recursive Least Squares Algorithm

Now we will derive the conventional RLS algorithm for the ESN case as we abbreviate

the ESN-CRLS. The derivation may look complicated at first sight but actually is not so

hard to follow. Starting from the well known least squares minimization problem we

will come to a recursive set of equations that made up the CRLS algorithm.

For simplicity during derivation of the algorithm, we assume an ESN with one output

and one output. Neither an input to output connection nor an output to output

connection is used. Thus W out is a 1×N vector, storing connections from dynamical

reservoir to the one output neuron. More precisely we define W out as a function of

time as W out t =[w1 t , w2t ,... , wN t ] . We will be using the indexes T and

t to denote time variables. We adapt our ESN-CRLS from the derivation procedure

given in [Haykin 1996]. At the last step while giving the final version of the algorithm,

we will go to the most general form from this restricted case of the ESN.

In the method of weighted least squares, we want to minimize the following quantity to

achieve a good estimate of desired signal at time t :

=∑i=0

T

i ,t e i 2 (Equation 4.1)

where

e i = f activation−1 d teachi – W out t x i (Equation 4.2)

and i , n is the weighting factor which is defined as:

i , t =t−i where i=1,2,3, .... , n ∧ 0≤≤1 (Equation 4.3)

It is obvious from the formula that this kind of a weighting factor usage, tends to weigh



past samples with smaller coefficients. That is, the filter forgets the past, hence the

name forget rate is given to this term. The special case, when =1 ,is called the pre-

windowed or infinite memory RLS and is equal to the ordinary least squares

formulation. Using the notation with forget rate, , the cost function we want to

minimize becomes:

=∑i=0

T

t−i e i 2 (Equation 4.4)

This expression can be minimized by taking partial derivatives with respect to all

elements of W out t and equating result to the zero:

∂∂W out t

=∑i=0

T

t−i e i ∂e i ∂W out t

=∑i=0

T

t−i ei x i = 0 (Equation 4.5)

Now we replace e i with its original form:

∑i=0

T

t−i [ f activation−1 d teachi – W out t x i ] x i = 0 (Equation 4.6)

Rearranging the (Equation 4.6), we get:

W out t [∑i=0

T

t−i x i 2]=∑i=0

T

t−i f activation−1 d teach i x i (Equation 4.7)

We can express the same equation in matrix multiplication form as

t W out t = z t (Equation 4.8)

where t is the N×N correlation matrix of the internal state vector of our ESN:

t =∑i=0

T

t−i x i x i T (Equation 4.9)

z t is the N×1 cross-correlation vector between internal states vector of ESN,

x i and value of the inverse activation function which accepts the desired response



as input, f activation−1 d teachi :

z t =∑i=0

T

t−i x i f activation−1 d teachi (Equation 4.10)

Our aim is to get the least squares estimate of W out t . Using the (Equation 4.8), it

can be found by:

W out t T = t −1 z t (Equation 4.11)

Up to here, we followed the ordinary least squares framework to obtain a solution. Now

the important question is how to find the inverse of t recursively. From this point

on we start the main of part of our ESN-CRLS derivation which is involves recursions.

If we isolate the term where i=t from the rest of the summation in the correlation

matrix definition, (Equation 4.9), we get:

t = [∑i=0

T

t−1−i x i x i T ] x t x t T (Equation 4.12)

By definition, the term in the brackets is equal to the t−1 . Therefore, we get the

following recursion for the correlation matrix update:

t = t−1 x t x t T (Equation 4.13)

In the same manner, we get the recursion update equation for cross-correlation vector

z t given in (Equation 4.10), as:

z t = z t−1 x t f activation−1 d teacht (Equation 4.14)

Before jumping to the next step we briefly have to introduce an useful identity from the

linear algebra which will act a key role in the remaining steps. It is known as "Matrix

Inversion Lemma". In the literature, it is also referred to as Woodsbury's Identity. The

lemma states the following:



In the definition of the matrix inversion lemma, C H denotes the Hermitian Transpose9

of C . This lemma states that if we are given a matrix in the form of (Equation 4.15),

we can determine its inverse using the (Equation 4.16). The matrix inversion lemma can

easily be proved by multiplying the (Equation 4.15) and (Equation 4.16) side by side.

Then recognizing that the multiplication of a square matrix by its inverse results in an

identity matrix (i.e: A A−1= I ).

Assuming we have a positive definite10 t (hence a non-singular11 matrix), we may

apply the lemma to recursive update equation of the correlation matrix. Firstly, we make

the following identifications:

A =t

B−1= t−1

C = x t

D = 1

(Equation 4.17)

Substituting these terms in to the (Equation 4.16) of the matrix inversion lemma leads

us to the (Equation 4.18)

9 The Hermitian transpose, also called the conjugate transpose, of a matrix can be found by taking the transpose of the complex conjugate of the matrix, AH=AT . In case of a real matrix, it is equal to the ordinary matrix transpose operation [Eves 1980].

10 A N×N Hermitian matrix is positive definite iff for any nonzero vector v we have v H A v 0 . Whereas a square matrix is a Hermitian matrix iff A=AH . If the matrix is real than this condition boils down to A=AT . [Johnson 1970]

11 If a square matrix has a nonzero determinant then it is called a non-singular matrix [Lipschutz 1991]


Lemma 1: Matrix Inversion Lemma

Let A and B be two positive-definite M×M matrices which are related by the

equation; A = B−1 C D−1C H (Equation 4.15), where D is another positive-

definite N×M matrix. Then, the inverse of the matrix A can be expressed as:

A−1= B− B C D C H B C −1C H B (Equation 4.16)


−1t = −1−1t−1−−2−1t−1x t x t T−1t−1

1−1 x t T−1t−1x t (Equation 4.18)

In order to simplify the rest our calculations, we make the following definitions:

P t = −1t (Equation 4.19)

and

k t = −1 P t−1 x t 1−1 x t T P t−1x t

(Equation 4.20)

Using our new definitions, we can rewrite the (Equation 4.19) as:

P t = −1 [P t−1 − k t x t T P t−1 ] (Equation 4.21)

(Equation 4.20) can be reorganized to get a simpler form as:

k t = −1 P t−1 x t −−1 k t x t T P t−1x t

= −1 [−1 P t−1−k t x t T P t−1] x t (Equation 4.22)

Here notice that the terms in the brackets in (Equation 4.22) is equal to the P t by

(Equation 4.21). Therefore, we simplify the above equation using our new finding to get

a more compact form of the k t in terms of the correlation matrix and the internal

state vector of the ESN:

k t = P t x t = −1 x t (Equation 4.23)

The k t is referred by the name gain vector because of the close relation between the

Kalman Filtering and the RLS. Check out [Sayed 1994] for a very good description of

the relation between RLS and Kalman Filtering. Next we have find an recursive

equation for the W out update. Using (Equation 4.11), (Equation 4.14) and (Equation

4.19) we get:



W out t T = t −1 z t

= P t z t

= P t z t−1 P t x t f activation−1 d teacht

(Equation 4.24)

Substituting (Equation 4.21) only in to the first occurrence of P t in the (Equation

4.24) leads us to:

(Equation 4.25)

W out t T = P t−1 z t−1− k t x t T P t−1 z t−1 P t x t f activation

−1 d teach t

= −1 t−1 z t−1− k t x t T −1 t−1 z t−1 P t x t f activation−1 d teach t

=W out t−1 − k t x tT W out t−1 P t x t f activation−1 d teach t

Using the fact that k t = P t x t = −1 x t , we get:

(Equation 4.26)

W out t T = W out t−1T k t [ f activation

−1 d teacht − x t T W out t−1T ]

W out t = W out t−1 k t T t

Here the terms in brackets are used to define the a-posteriori error, t as

t = f activation−1 d teacht − x t T W out t−1T

= f activation−1 d teacht −W out t−1 x t

(Equation 4.27)

Now, using (Equation 4.27), (Equation 4.20), (Equation 4.21) and (Equation 4.26) in the

given order, we can write down the ESN-CRLS algorithm. In order to increase the

efficiency of our implementation, firstly we multiply both numerator and denominator

of the (Equation 4.20) by to get a simpler form of the gain vector as:

k t = P t−1 x t x t T P t−1 x t (Equation 4.28)



Then noticing that the term P t−1 x t appears in both numerator and denominator

of the k t , we introduce a new term t to simplify our notation:

t = P t−1x t (Equation 4.29)

Moreover, we expand our ESN structure to its most general definition. As a reminder,

we give this definition once again here. Our model consists of K input units with an

activation (state) vector which is denoting the activation of the input layer at time t :

u t = [ u1t , u2t , u3t , ... , u K t ]T : Input ActivationVector

N internal units that make up the DR with the corresponding state vector:

x t = [ x1t , x2t , x3t ,... , xN t ]T : Internal Activation Vector

and L output units with the output state vector:

y t = [ y1t , y2 t , y3t , ... , y L t ]T : Output Activation Vector

The synaptic weights are collected in four matrices:





Additionally, we have a correlation matrix, P t with dimensions

NKL×NKL and gain vector, k t , of dimension NKL×1 .

Also the t , now denotes the 1×L row vector of a-posteriori errors for coming

from each of the L outputs, defined as t = [1t ,2 t , ... ,L t] . The term

concat x t , u t , y t−1 which is used in the exploitation equation is used in

many places of the CRLS and its variants that will be mentioned in this section.

Therefore, we introduce a new term for the concatenated ESN states (input, internal,



output) vector, t = concat x t , u t , y t−1 to tidy up the notation of these

algorithms. Finally, the ESN-CRLS is given in Algorithm 3.

Algorithm 3: ESN−CRLS

Initialization :

0 ≤ 1

P 0 = −1 I , ≪ 1

W out 0=0 , y 0 = 0

Main Body :

for t = 1,2 , ... , T

x t = f activation W x t−1 W in. u t W back y t−1

t = concat x t , u t , y t−1

y t = f activation W out t−1 t

t = f activation−1 d teacht − W out t−1 t

t = P t−1t

k t =t

t Tt

P t = −1 [P t−1 − k t t T P t−1]

W out t = W out t−1 t k t T

During the initialization period of the ESN-CRLS, we set P 0 such that the non-

singularity of the correlation matrix, 0 is guaranteed. This is usually achieved by

setting the P 0 equal to an identity matrix which is multiplied by a very big scalar

number. Setting the W out 0 = 0 is another common practice in the literature which

we also followed here. It is known that using any value other than zero for initialization

of W out , does not have significant effect on the convergence and steady state behavior

of the algorithm, unless very large values are used for initialization [Farhang-Boroujeny



1998].

4.3 Known Problems of the RLS Algorithm

As we stated before the RLS family of algorithms mainly suffer from two main

problems which may hamper their usage. First of all their computational complexity is

directly proportional with the square of the size of W out we are using. In

computational notation O N 2 . The second big problem is the numerical instability

under finite precision environment which has been reported by many authors in the

literature like [Ljung 1985] [Ardalan 1986] [Cioffi 1987] [Yang 1992] [Levin 1994] etc.

The aim of this section is explore what are the effects of these problems on ESN and

RLS combinations and what kind of a strategy should be followed in order to cope with

them. We will look for the answers in the adaptive filtering literature, where these kind

of problems are studied in detail for other filter structures. Now, we continue with the

problem of computational complexity.

4.4 The Problem of Computational Complexity

High computational complexity of the CRLS limited the use of it for real world

applications despite its attractive advantages. This lead scientists to development of fast

RLS (FRLS) algorithms which has computational complexities which grows linearly

with the number of taps to be updated. They are on average O7M with M

denoting the number of filter taps of an adaptive filter. FRLS algorithms preserve the

nice properties of CRLS like fast convergence which is independent of the eigenvalue

spread of the correlation matrix. See [Cioffi 1984], [Slock 1991], [Carini 1999] for

individual examples of those algorithms. A more detailed overview on the FRLS

algorithms can be found in [Haykin 1996], [Glentis 1996], [Farhang-Boroujeny 1998],

[Bellanger 2001]

FRLS algorithms are utilized based on the shift time invariance property that is

associated with the input vector in transversal FIR filters. At each time step, only one

value is actually changing in the input vector of the transversal filter. All of the past data

samples are shifted by one, leaving the oldest time sample out of the vector. Then, the

new data value is injected to the first slot. Based on this property, it is possible to derive



FRLS algorithms. The derivation process is not in the scope of this thesis but can be

found in the given literature. However, the input vector to the RLS when used under

ESN context, is the combination of internal state vector x t and input output vectors

u t , y t . All values of x t are changing at each time step. This means that

the shift time invariant structure of transversal filters is unfortunately not present in

ESN-RLS combinations. From adaptive filtering theory point of view, weight update

procedure of ESN-RLS resembles to that of linear combiner structures (i.e:

Beamforming, Radar Array Processing) for which use of FRLS algorithms is

unfortunately not possible. Therefore, we have to omit the usage of FRLS family for

online adaptation of ESNs.

Instead we propose use of two other algorithms with ON complexity. First one is

the world famous Least Mean Squares (LMS) algorithm which also known by the name

Widrow-Hoff Rule [Widrow 1960]. It is a stochastic gradient algorithm That is the

gradient of the error performance surface with respect to free parameter vector changes

randomly from iteration to iteration. Many years by now from its invention, it

established itself as the workhorse of the adaptive filtering area. This is mainly because

of the reasons like; ease of implementation, low computational complexity and robust

performance. As with the every algorithm, it also has some drawbacks. LMS algorithm

converges slowly to its steady state. When compared to the RLS performance, LMS rate

of convergence is an order of magnitude slower than RLS. This phenomenon is also

reported for ESN-LMS combinations in [Jaeger 2001]. Another major drawback of the

RLS is its sensitivity to the eigenvalue spread of the correlation matrix of the input

vector. One way to overcome these limitations is to use projections of the input signal

on an orthogonal basis. This is usually attained by using variants of the algorithm which

operate in frequency domain at the cost of additional computational complexity. Instead

of using transform domain ESN-LMS, using ESN-RLS provides a more convenient

way. In an unpublished bachelors thesis [Liebald 2004], the eigenvalue spread problem

tried to be tackled by using different specifically tailored ESN topologies. However,

none of the proposed topologies performed better than the randomly created networks

using the Algorithm 1. How to shrink the eigenvalue spread of an ESN is a very hot

topic for ESN research as suggested by Jaeger in [Jaeger 2005]. Only after the

eigenvalue spread of ESNs can be made smaller, use of LMS algorithm for online



adaptation of ESNs with good performance would be possible. At the moment we only

know that adding a random noise component to the input signal is useful in lowering the

eigenvalue spread of an ESN, a trick that we know from using Extended Kalman

Filtering for RNN learning [Jaeger 2002a]. Still, our basic experimentation with the

algorithm revealed that it can be used for certain tasks in combination with the ESNs in

order to lower the computational complexity given the eigenvalue spread is in

acceptable intervals. But, this is at a cost of slower convergence and lower numerical

accuracy when compared to cases when CRLS algorithm is used for adaptation. Also,

note that we do not understood which tasks lead to a low eigenvalue spread and why.

This should be found empirically for each task. In spite of all, we decided to add the

LMS algorithm in to our performance tests. The ESN-LMS algorithm is given in

Algorithm 412.

Algorithm 4: ESN−LMS

Initialization :

∈ℝ is the user defined learning rate.

W out 0=0 , y 0 = 0

Main Body :

for t = 1,2 , ... , T





W out t = W out t−1 [t t T ]

The LMS algorithm may also suffer from numerical instability during weight update

because of the quantization errors. This usually becomes evident if the input signal does

12 Since our main interest is on the RLS algorithm throughout this thesis, we exclude the derivation of the LMS. Interested reader can find more information in standard adaptive filtering literatures like [Haykin 1996] or [Farhang-Boroujeny 1998]



not have enough energy at all frequencies [Cioffi 1987]. Often, this effect can be

remedied by introducing a leakage mechanism to the weight update equation [Zahm

1973]. Trade off is a small degradation in the steady state performance. A nearly

equivalent technique to ensure stability is to add a small amount of uncorrelated noise

component to the input vector before weight update [Werner 1983]. During our tests,

we followed the noise insertion approach to achieve stability. One final note is on the

convergence of the ESN-LMS. Usually, a common practice of choosing the learning

rate, is to use small real number smaller than 1. However, during our

experimentations with ESN-LMS, we observed that one may have also have to use very

big values for the , especially (but not always) when the input signal to the network

is scaled to a very compact interval around zero. Otherwise, the ESN-LMS does not

converge or converges at a very languishing pace. One should consider this while

playing out with the parameters of the ESN-LMS for any given application.

Another interesting method for online learning with computational complexity of

ON 13 is the recently introduced Backpropagation Decorrelation Algorithm

(BPDC) [Steil 2004]. It is suitable to be used for online ESN learning. The BPDC

algorithm is strongly influenced by the ESN and LSM theories in that it only learns the

output weights of recurrent neural networks. This is done in order to be able reduce the

complexity. Theoretic inspiration of the algorithm is coming from the RNN learning

rule introduced by Atiya and Parlos in the RNN training unification paper [Atiya 2000].

The Atiya-Parlos Recurrent Learning (APRL), "is based on the idea to differentiate the

error function with respect to the network states in order to obtain a virtual teacher

target, with respect to which the weight changes are computed." [Steil 2004]. Basically,

the BPDC relies on three important principles. First one is the one-step back

propagation of errors by means of the virtual teacher forcing like the APRL algorithm.

Secondly, usage of the short term memory in the network dynamics which is adapted

based on decorrelation of the internal activations of reservoir neurons. And finally use

of an non-adaptive dynamical reservoir as in the case of ESNs or LSMs in order to

attain reduced computational complexity.

13 This O N complexity is valid only when we have on one output neuron in our ESN. When more than one output units are used, complexity of this algorithm becomes O N 2 . See [Steil 2004]



In order to be able use the BPDC algorithm for ESNs with nonlinear activation

functions, we have to introduce some modifications on Exploitation and Evaluation

equations following the notation given in [Steil 2004]. See (Equation 4.30).

(Equation 4.30)

Exploitation for ESN−BPDC :

x t =W f activation x t−1 W in. f activationut W back f activation y t−1

Evaluation for ESN−BPDC :

y t =W out t f activation concat x t , u t , y t−1

In the original definition, we apply the transfer function to the network's internal state

elements after multiplying them with the corresponding synaptic weights. Now we do

the direct contrary by first applying the transfer function and then multiplying with the

weights. The ESN-BPDC algorithm in its general form with O N 2 complexity can

be given as in theAlgorithm 514. When only one output neuron is used, some

expressions in the algorithm cancel out, decreasing complexity of the ESN-BPDC to

ON .

It should be noted that in [Steil 2004], input signal is assumed to be coming from a

compact interval with mean value near to zero. Our experiences with ESN-BPDC also

indicates the need of a such conditioning of the input signal. Therefore, signals that are

not satisfying this regulation, should be biased and scaled appropriately before being fed

in to the ESN as an input. In a recent paper [Steil 2005], stability of the BPDC method

is also shown for many different cases. Therefore, we assume its stability depending on

this work and do not do any further investigation.

14 Like the LMS algorithm, we also skip the derivation of the BPDC algorithm. We also want to add that the notation we used here to express the algorithm is special for ESN case and is very different from the one used in the original paper [Steil 2004]




Algorithm 5 : ESN−BPDC

Initialization :

∈ℝ is the user defined learning rate.

∈ℝ is the user defined regularizationconstant.

W out 0=0 , y 0 = 0

Main Body :

for t = 1,2... , T

x t = W f activation x t−1 W in. f activation u t W back f activation y t−1

y t =W out t f activation concat x t , u t , y t−1

e t = y t − d teacht

for i = 1,2... , L

i =∑l=1

L

[wi NKl out f activation

−1 y l t−1 e lt−1]− ei t

for j = 1,2... , N

w ijout t =

f activation x j t−1

∑k=1

N

f activation xk t−12 ∑k=N1

NK

f activationuk t−12 ∑k=NK1

NKL

f activation yk t−12i

for j = N , N1..., NK

w ijout t =

f activation u j t−1

∑k=1

N


NK


NKL


for j = NK , NK1..., NKL

w ijout t =

f activation y jt−1

∑k=1

N


NK


NKL



4.5 The Problem of Numerical Instability

As we stated before, the RLS algorithm is advantageous over the LMS by two reasons.

First, its convergence rate is an order of magnitude faster. Second, convergence rate of

the RLS is independent of input signal statistics (i.e: eigenvalue spread). Although, we

have proposed two non-RLS algorithms as a substitute to solve the computational

complexity problem, today's microchip technology reached very high processing

speeds. Therefore, complexity is not a vital problem for most of the practical

applications provided not so big reservoirs are used. A more important problem than

complexity is the numerical instability, especially for the applications with long term

adaptation needs. Much of the work in the literature on stabilizing RLS family of

algorithms is concentrated on the fast versions, leaving out a more limited number of

studies for ON 2 algorithms. Here, we will go through the prominent ones of these

works in order to find out suitable ESN-RLS combinations with good numerical

stability. We will evaluate our findings in the next section under different experimental

scenarios.

When adaptive filters (or any other filter in general) are implemented in finite precision

environments (i.e: digital), all the values are quantized to certain numerical limited

precisions. Because of that, quantization (round-off) errors are generated, which deviate

the performance of the filter from the infinite precision performance. The amount of

these errors are implementation dependent, hence may vary from application to

application based on the size of the word lengths (i.e: number of bits) used [Ling 1984].

An error perturbation generated at an arbitrary point also has an effect on later

iterations, that is it said to be propagating. Continuous accumulation of such effects may

reach levels where deviations increase so much that the filter performance is no more

acceptable. If for a given algorithm, error accumulation grows without bound then it is

said to be unstable and its continuous use is unsuitable without further precautioning

[Liavas 1999]. For applications where adaptive filter is used only to determine an

unknown setting and the kept fixed at that setting, such instabilities are usually not

observed [Cioffi 1987]. We also experienced the same behavior during our simulations.

Algorithms usually tend to become unstable or deviate too much from the desired value,

after a few tens of thousands of iterations or more. The effects of quantization errors on



the RLS are very well studied by various authors. Good examples include [Ljung 1985],

[Ardalan 1986], [Cioffi 1987], [Bottomley 1989], [Leung 1989], [Bottomley 1991], [Liu

1991], [Yang 1992], [Haykin 1996], [Liavas 1999]. Here we do not aim to go into such

details, rather we will summarize what is said by these authors and then we will try

adapt their solutions to our problem. Interested reader can refer to above references for

more detailed information.

As with the other adaptive algorithms, RLS is as well affected by the quantization

errors. Two main effects are generally attributed for numerical instability of the

O N 2 type of RLS algorithms.

The first effect is about the recursive computation of the inverse of correlation matrix,

P t . This is the earliest problem noted for instability of the RLS algorithms [Hsu

1982] and has its origins from the Kalman Filtering theory. As a results of accumulation

of these errors, the matrix may become indefinite. Although this usually does not end up

in a overflow, the response of the filter is nevertheless unacceptable. This effect is

usually evident for input signals that do not satisfy a condition called persistent

excitation condition given in [Ljung 1985]. This statement essentially tells that the input

signal must have sufficient energies at all frequencies to prevent the inverse of

correlation matrix to become negative-definite [Cioffi 1987]. This condition can easily

be met by adding some uncorrelated white to the input signal. The same technique is

reported to be useful also for the ESN-RLS case in [Jaeger 2002c].

A better method to eliminate a negative-definite matrix is studied in [Hsu 1982]. This

method focuses on calculating the inverse correlation matrix using special recursions

that propagate the Upper-Diagonal-Upper-Transpose (UDU') factorization of the

P t . It aims at ensuring a positive definite matrix by keeping the symmetry of the

P t while having positive entries along the diagonal. A neat version of this

algorithm is given in [Yang 1994], which replaces the (Equation 4.21) in CRLS:

P t = −1 [P t−1 − k t x t T P t−1] (Equation 4.21)

By this new recursion which keeps the symmetry of the P t :



P t = Tri −1[ P t−1 − k t t T ] (Equation 4.31)

Here a new operator Tri ... is introduced which exploits the symmetry of the inverse

correlation matrix to increase efficiency of the algorithm. It does the computations on

only the upper or lower triangular part of the P t and then copies the results to the

opposite part. In this form the new algorithm has almost half the complexity of the

ESN-CRLS. Based on these, we give a new algorithm for online adaptation of ESNs,

which we call ESN Symmetric Conventional Recursive Least Squares (ESN-SCRLS),

as in Algorithm 6.

Algorithm 6 : ESN−SCRLS

Initialization :

0 ≤ 1

P 0 = −1 I , ≪ 1

W out 0=0 , y 0 = 0

Main Body :

for t = 1,2 , ... , T





t = P t−1t

k t =t

t Tt

P t = Tri −1[P t−1 − k t t T ]




Although the ESN-SCRLS increases the numerical stability is not always enough to

prevent divergence or overflow. The second main reason of the instability of RLS

occurs during the weight update equation. The unbounded grew of floating point round-

off errors causes filter weights to grew very large, resulting in a divergence or overflow.

While the divergence which shows up when P t becomes indefinite shows up after

few tens of thousands of iterations depending on the application type, this form of

divergence divergence usually manifests itself in the order of a factor 100 times later

than the first form [Cioffi 1987]. Therefore, in order ensure that it is eliminated,

extensive testing should be made with large datasets.

This kind of a divergence is of a similar nature to that of the LMS algorithms.

Therefore, it can be fixed by integrating tap weight a leakage mechanism to RLS

recursions which is a technique used also for LMS stabilization [Cioffi 1987]. Since we

have the inverse correlation matrix already present in our hand, a more simple scheme is

possible. Addition of a diagonal constant at periodically to the diagonal of the P t in

order to ensure good conditioning of it [Cioffi 1987] [Ardalan 1989]. While doing this it

is wise to use the UDU' form of P t since the SCRLS algorithm is more efficient

than the CRLS. We call this new algorithm ESN Symmetric Conventional Recursive

Least Squares Version 2 (ESN-SCRLS2) and is given Algorithm 7.

It is been shown that effect of floating point round-off errors on the weight update

equation increases as the forgetting rate is chosen close to one [Ardalan 1986].

Especially when = 1 is used, the possibility of the second form of divergence is

higher. In that case, error propagation mechanism becomes unbounded and is random

walk type [Ardalan 1987] [Slock 1991]. Therefore, for = 1 using ESN-SCRLS2

may not always be enough. Adali and Ardalan, developed a stabilization method

specifically for this case in [Adali 1991]. The technique resembles ESN-SCRLS2 in that

a term is added to diagonal of the P t , but this time instead of using a constant

value, a dynamic term is derived based on statistics of the amount of change in the tap

weights, w t . Our early experimentation showed the benefits of this technique

therefore we decided to add it to our algorithm collection. In the algorithm the operator

E ... denotes expectation (i.e: expected value). Therefore, the E [w t w t T ]



denotes the covariance matrix of the term w t . We refer to this algorithm by ESN

Ardalan Recursive Least Squares (ESN-Ardalan-RLS) which is formulated in the

Algorithm 8.


Algorithm 7 : ESN−SCRLS2

Initialization :

0 ≤ 1

∈ℕ∧0 isthe user defined period.

∈ℝ∧0 is the user defined diagonal constant.

P 0 = −1 I , ≪ 1

W out 0=0 , y 0 = 0

Main Body :

for t = 1,2 , ... , T





t = P t−1t

k t =t

t Tt

P t = Tri −1[P t−1 − k t t T ]

if t mod = 0 then Diag P l t , end



The methods we mentioned up to here are the most widely accepted ones by the

adaptive filtering community. But one can still find other studies which try to guarantee

stability of RLS algorithms. Some examples are [Bottomley 1991], [Chansarkar 1997],

[Horita 1999], [Douglas 2000] and others which we do not consider in this thesis. All of

these trials have some drawbacks. For example, the algorithm derived in [Chansarkar

1997] claims to have guaranteed stability but has an unacceptable computational

complexity of ON 3 . In [Horita 1999] the authors derived a leaky RLS algorithm,


Algorithm 8 : ESN−Ardalan−RLS

Initialization :

0 ≤ 1

P 0 = −1 I ∀ l∈[1,L ] , ≪ 1

W out 0=0 , y 0 = 0

Main Body :

for t = 1,2 , ... ,T



y t = f activationW out t−1 t


t = P t−1t

k t =l t

1tTt

w t = k t t

P t = Tri P t−1 − k t t T P t−1 E [w t wt T ]

W out t = W out t−1 t k tT


which is claimed to be robust. But with again an unacceptable complexity ON 3

which is because of a direct matrix inverse operation present in the algorithm.15 In

[Bottomley 1991], rather than deriving a new algorithm, the authors proposed some

modifications on the fixed point arithmetic to limit the error propagation, thus ensuring

stability. Since, we are only considering the single precision floating representation in

this thesis, this method is out of our scope although it looks promising at first sight.

Only with [Douglas 2000], we achieved some promising results during our early

experimentation. Using a novel least squares pre-whitening technique the author derived

a recursive algorithm to minimize the exponentially windowed least squares cost

function. The algorithm embodies the good properties of QR Decomposition based RLS

(QR-RLS)16 algorithms like high numerical accuracy. Algorithm performance is

acceptable and still in the orders of ON 2 . The algorithm is also claimed to be

stable through simulations. However, the author followed an overly simplified approach

during his stability analysis. Thus, we think further experimentation should be made to

conclude on the stability. We named this algorithm as ESN Recursive Least Squares

Pre-whitening (ESN-RLSP) and it is given in Algorithm 9.

We now continue with a new class of RLS algorithms which are implemented in a

structurally different form than the ones mentioned up to here. This class of algorithms

are very robust in terms of stability when compared to above methods. Trade-off is an

increase in the computational complexity but even this complexity is still bounded by

O N 2 . We will consider two main versions so called QR Decomposition Based

RLS (QR-RLS) and Inverse QR-RLS (IQR-RLS). From now on we may refer to this

class of algorithms using the name Rotation Based Algorithms (RBAs).

15 We also experimented with the [Chansarkar 1997] and [Horita 1999]. During those experiments, we observed overflows or unacceptable deviations from the desired signal. This is contradicting with the stability claims given in those articles. Because of the unacceptable complexity of these algorithms we did not give further attention on finding out the possible reasons of our observations.

16 We will consider QR-RLS algorithms in the upcoming pages.



The QR-RLS algorithm solves the least squares minimization problem by working

directly of the incoming data matrix via the QR Decomposition [McWhirter 1983]. This

is in contrast with CRLS and the variants of it which are working on the time averaged

correlation matrix. The QR Decomposition can be computed using variety of methods

among which the most popular ones are Givens Rotations, Householder


Algorithm 9 : ESN−RLSP

Initialization :

0 ≤ 1

P 0 = −1 I , ≪ 1

W out 0=0 , y 0 = 0

Main Body :

for t = 1,2 , ... , T





v t = P t−1 t

r t = P t−1T v t

k t = 1∥v t ∥2 ∥v t ∥2

P t = 1[P t−1 k t v t t T ]

w t =t t ∥v t ∥2



Transformations and Gramm-Schmidt Orthogonalization. Usually, Givens rotations are

preferred to other methods since it is computationally more efficient than the latter two

[Golub 1996]. Householder transformation can also be a good choice to implement

because theoretically it provides almost twice the better numerical accuracy than the

Givens method [Higham 1996]. It is known that QR-Decomposition when implemented

via Givens Rotations or Householder Transformations are numerically stable [Higham

1996]. The QR-RLS algorithm, when operating in finite environments, is shown to be

stable in a Bounded-Input-Bounded-Output (BIBO) manner [Leung 1989] [Liu 1991].

But it should be noted that BIBO stability does not always guarantee meaningful results.

In [Yang 1992], it is experimentally shown that when number of bits used is too small,

then the algorithm performance is unacceptable17. Experiments also show that better

accuracy is achieved when forgetting factor is chosen to be smaller than one. To present

the QR-RLS algorithm for ESN case, we used to notation used by the authors in [Sayed

1994]. Here we omit the derivation of the algorithm. Interested reader can refer to

[Sayed 1994] or [Haykin 1996]. The ESN-QR-RLS algorithm is given in Algorithm 10.

We would also like to give few more notes on the Algorithm 10. The initialization of

the algorithm is dependent on the size of the ESN used. If W out is a L×NKL

matrix then the initialization takes NKL iterations and during this period a-priori

estimation errors lt for each of the L outputs should assumed to be zero. To

calculate the W out at each time step has to calculate inverses of the l t −1 /2 's.

Matrix inverse is usually a computationally demanding process however here

l t −1 /2 is a lower triangular matrix. This makes it possible to compute inverse in

O N 2 time via back-substitution which exploits the lower triangular structure of the

matrix. Due to this special type of back-substitution, which may result in a division by

zero, real values of W out matrix is only accessible after the initialization period is

finished. During the initialization period of ESN-QR-RLS, both the W out and the ESN

output should assumed to be zero. Hence the a-posteriori estimation error lt is also

zero which is computed by taking the different of the latter two value.

17 Usage of five bits to express the values in digital form resulted in such an observation in [Yang 1992]




Algorithm 10 : ESN−QR−RLS

Initialization :

0 ≤ 1

l1 /2 0 = 0 ∀ l∈[1, L ]

p l 0 = 0 ∀ l∈[1, L ]

W out 0=0 , y 0 = 0

Main Body :

for t = 1,2,. .. , T




for l = 1,2,. .. , L

[ 1/2l1/2t−1 t

1/2 pl t−1T f activation−1 d l t

0T 1 ] t = [ l1/2t 0

pl t T l tϱl

1 /2t t Tl

1 /2t T ϱl1 /2t ]

end

W out t = [ p1t T1−1/2t

p2t T2−1 /2t

.

.pL t

TL−1 /2 t ]

NOTE :t is any unitary rotationthat produces a block zerocolumn vector on the last columnof the post array by annihilatingelements of the ESN internal state vector x t one byone. Alsod l tdenotes the l th element of the d teacht vector.


Because of the extra computational load introduced by computing inverse of l t −1 /2

, for the applications where the weights of W out need to be known explicitly (i.e:

Adaptive System Identification) ESN-QR-RLS is not a good choice. But, we have the a-

priori estimation error calculated at each iteration. We can omit the W out calculation

for applications where this a-priori estimation error value is enough. Prediction Error

Filters, Active Noise Control and Adaptive Noise Canceling can be given as examples

of such application scenarios. Additionally, another method called Extended QR-RLS

(EQR-RLS) can also be used for online adaptation of ESNs which avoids the

computationally demanding back-substitution operation [Yang 1992]. However, this

algorithm is not necessarily stable and the methods to make it stable are

computationally expensive [Moonen 1990]. Therefore, in the literature it is not

suggested to be used [Haykin 1996].

A better method is proposed in [Alexander 1993] under the name IQR-RLS. It is more

efficient than the QR-RLS algorithm in that it avoids the calculation of a lower

triangular matrix inverse via back-substitution. Unlike QR-RLS, IQR-RLS operates on

the inverse of the correlation matrix, hence named after this property. It also shares the

good numerical stability of the QR-RLS algorithm. Haykin states that the algorithm is

stable for 1 whereas for = 1 , the single error perturbation is not contractive

thus accumulation of such errors may lead to divergence [Haykin 1996]. According to

the simulation results given in the original paper [Alexander 1993] the algorithm stayed

stable over extremely long datasets (i.e: one million samples). Those simulations also

revealed that the algorithm has an outstanding numerical accuracy. As the QR-RLS we

again omit the derivation and the interested reader can refer to the [Alexander 1993] or

[Haykin 1996]. The IQR-RLS under ESN context can be written (again using the

notation from [Sayed 1994]) as in Algorithm 11.

Our discussion on the numerical instability problem of the RLS family of algorithms is

finished here. In the next chapter, we continue with our performance tests in which all

of the above algorithms are compared with respect to each other under similar

conditions.




Algorithm 11 : ESN−IQR−RLS

Initialization :

0 ≤ 1

P 0 = −1 I , ≪ 1

W out 0=0 , y 0 = 0

Main Body :

for t = 1,2 , ... , T





[ 1 −1 /2 t T P1/2t−10 −1 /2 P1/2t−1 ] t = [ ϱ

−1 /2 t 0T

k t ϱ−1/2t P1 /2t ]W out t = W out t−1 t k t T

NOTE :t is any unitary rotation that produces ablock zerorow vector on the first row of the post array by annihilatingelementsof theterm−1/2 x t T P1/2t−1one by one.

5. PERFORMANCE ANALYSIS OF THE ONLINE ADAPTATION ALGORITHMS


5.1 Introduction

In this section we will present our results that we got during our simulations. In these

simulations, we evaluated all methods with respect to numerical stability and steady

state error performances. Apart from those, computational complexities of some

selected algorithms are also discussed in Appendix B.

The following algorithms were used during the testing phase; ESN-CRLS, ESN-

SCLRS, ESN-SCRLS2, ESN-Ardalan-RLS, ESN-RLSP, ESN-IQR-RLS, ESN-QR-

RLS, ESN-LMS and ESN-BPDC. Two different experimental setups were utilized as

our testbeds which will be described later in this chapter. To ensure stability in a reliable

way, we ran all algorithm on the given setups for a few millions of iterations. For a

given experiment, all of algorithms use the same ESN. This is done to ensure objectivity

among different algorithm performances. During our early experimentations with those

algorithms, (while the implementation period,) we have observed that all of these

algorithms are also sensitive to the appropriate scaling and biasing of the input signals.

These two parameters not only effect the steady state error performance, but also have

an important impact on the numerical stability. Therefore, scaling and bias parameters

varied from algorithm to algorithm. By using different parameters, we aimed to get the

best results from each of the algorithm. Additionally, we chose parameters of the each

algorithm according to relevant precautions from the adaptive filtering literature, to

achieve a good numerical stability for each run. Actually, depending on our experience

with those methods while experimenting with short data sets (i.e: Number of samples

around 10000 to 50000), we expect many of them to remain stable for most of the test

cases.

We chose to use IEEE 754 Single Precision Floating Point Format to store all data in

our tests. This format uses thirty two (32) bits to express a real number in hardware.

One (1) bit is reserved for the sign, eight (8) bits for the exponent and lastly twenty

three (23) bits for the mantissa. Our motivation in taking such a decision comes from

the current state of the art of Digital Signal Processor (DSP) technology. See [Eyre



2000] for a good overview of evolution of DSPs. Adaptive filtering applications are

usually implemented in embedded systems. Computation of the adaptive filtering

algorithms are mainly composed of arithmetic operations. DSPs are specialized

especially for these kinds of applications where extensive arithmetic computation is

required. Thus, they offer impressive performance, scalability and ease of use when

compared to other embedded architectures (i.e FPGA, ASIC, VLSI etc.). High

performance is achieved by implementing sophisticated hardware techniques inside the

DSP chips like intensive pipelining for high frequency operations or parallel functional

units to execute multiple instructions at the same time. Mostly two types of data formats

are used inside the DSP chips, namely the fixed point and the floating point. Most of the

chips in the market are utilized with the fixed point arithmetic because of low power

consumption and pricing issues. On the other hand, the numerical accuracy and the

range is very limited with the fixed point architectures when compared to the floating

point representation. Because of this, numerical instability problems are much worse in

fixed point architectures. Floating point operations are usually emulated by

sophisticated assembly language tricks in fixed point systems which results in an

increased number of cycles per operation. On the other hand the same operations take

only 1 cycle due to specialized arithmetic units present on the floating point

architectures. A direct conclusion of these points is that the floating point DSPs, which

offer an easier design process to the developers, also lead to more efficient applications.

However, pricing issues hampered the use of floating point DSPs for a long period until

2000's. By the recent advances of the technology and also as a result of a highly

competitive market, the price per unit of DSP chips now decreased down to ten dollars

limit [TI 2002]. The pricing problems are now becoming less significant and floating

point DSPs are gaining much wider acceptance [Etalk 2004] [RTC 2004]. For most of

the cases, in floating point DSP chips 32 bit registers are used [Analog 2005] [TI 2005].

Based on these facts, we decided to use single precision floating points numbers during

our stability tests.

Our testing (evaluation) philosophy is as follows; firstly we ran all algorithms on

different setups for long time spans to check their numerical stability over time. If any

of the algorithms succeeded to remain stable, then we further compared the steady state

error performance to declare the winner for that particular setup. At the end of all



experiments, also taking the computational complexities in to the account, we made our

last comments on the algorithms. Our main intention in doing those tests is to show that

ESN-RLS combinations can be made numerically stable under applications scenarios

which require long term W out update due to constantly changing statistical properties

of the signals to be processed. Please do not forget that we do not guarantee hundred

percent stability under any circumstances. Additionally, as a natural outcome of this, we

do not claim to prove or disprove the stabilities of different algorithms mathematically.

Also note that such a detailed concentration on stability is usually not needed for

applications where only short term adaptation is required.

5.2 Experimental Setups

We chose adaptive nonlinear system identification as our first experimental setup. See

Figure 9.

Two different benchmark signals are used for learning which were introduced in a

recurrent neural network training unification paper by Atiya & Parlos [Atiya 2000]. We

also modified the signals in order to introduce time-varying behavior. First signal is a

second order dynamical system governed by the (Equation 5.1).

y t1 = y t y t y t−1 u3t (Equation 5.1)


Figure 9: Block diagram of an ESN when used as a adaptive system identifier. U(y) , y(t) and d(t) are the input, the ESN response and the desired response respectively. At each time step, the Adaptive Algorithm (i.e ESN-CRLS) updates W out using the error signal e(t).


Mild time varying statistics for this signal is achieved by using variable coefficients

, , , . At each time step we change them by a factor of 1% around the original

values 0.4, 0.4, 0.6, 0.1 respectively. Input u t is an uncorrelated uniform noise

coming from the interval [−0.5 , 0.5] . (See Figure 10)

Our second signal is a more difficult system to model. It is a 10th order Nonlinear

Autoregressive Moving Average (NARMA) system which is defined by the (Equation

5.2). This signal is also used in [Jaeger 2002c], and we follow the very same

prescriptions that are used there.

(Equation 5.2)

y t1 = tanh y t y t [∑i=0

9

y t−i ] u t−9u t

To ensure non-stationarity of the signal, we used the coefficients , , , which

vary at periodic intervals by a factor of ±50% around the original values which are 0.3,

0.05, 1.5, 0.1 respectively. This kind of a harsh coefficient value variation, effects the

signal behavior strongly, which is apparent in the Figure 11. Under some combinations

of , , , values this system may diverge in an explosive manner. In order to


Figure 10: First 250 samples of the second order nonlinear dynamical system


prevent such unwanted effects we used tanh x as a limiter. Input to the system is

again an uniform random noise that is drawn from the random uniform noise coming

from the interval [0 , 0.5] .

We used the following error function to evaluate the algorithm performances. It is the

Normalized Mean of Square Error (Equation 5.3), as defined in [Atiya 2000].

NMSE =∑t=1

T

y t −d t 2

∑t=1

T

d 2t (Equation 5.3)

In the (Equation 5.3), T is the number of filtered samples, d t is the desired

response and y t is the ESN output at time t. NMSE provides a more objective way

evaluating results, especially when the number of samples is high. We used this

definition from [Atiya 2000], in order to be in line with the standard literature. Our

second setup is another well known adaptive filtering application type, adaptive noise

cancellation [Widrow 1975]. In this experiment, we tried to enhance a music signal

which is corrupted by a non-stationary noise using an ESN Adaptive Noise Canceler


Figure 11: Effects of the harsh parameter variations on the behavior of 10th order NARMA System is apparent in this figure. Notice the periodic jumps at every 2000th sample.


(ESN-ANC). (See Figure 12)

The experimental setup is prepared as follows. The music signal, is ripped from a

commercial music CD in the wave file (.wav) format. As the noise source, we used a

real noise recordings which is made public on the Internet. It is a recording of a speech

babble. The source of this babble is 100 people speaking in a canteen. The room radius

is over two meters; therefore, individual voices are slightly audible. [TNO 1990a]. The

transfer functions H z and S z of the primary and the secondary paths

respectively, are obtained from the companion diskette coming with the book [Kuo

1996]. Those functions were measured from an experimental setup by the authors.

Using the transfer function H z , we get the x ' n which is an estimate of x n

. This is done by Infinite Impulse Response (IIR) filtering of the noise signal using the

transfer function as prescribed in the same book. Then the x ' n is summed up with

the u n in order to form the corrupted music signal which is d n . Using the

transfer function of the secondary path, S z , we synthesized the x ' ' n which is

the input signal that is fed in to the ESN-ANC.

The ESN-ANC tries to estimate a cleaner form of u n which is corrupted by the

signal x ' n , using x ' ' n as the reference noise. The ESN-ANC response

y n is subtracted from the corrupted signal, d n at each time step to get the

noise-cleaned, u ' n . The same u ' n is also fed back to the adaptive algorithm in


Figure 12: Block diagram of an ESN-ANC. The original signal, u(n) is corrupted by the noise element x'(t). x'(t) is the correlated version of the main noise source, x(t) which passes through the primary path H(z). The ESN ANC tries to estimate the noise on u(n), by using the x''(t) as its reference which is formed after x(t) passes the secondary path S(z). The ESN response, y(n), which is an estimate of x'(t), is subtracted from the corrupted signal, d(n) in order to get the cleaned version of the original signal, u'(n). Meanwhile, the Adaptive Algorithm (i.e RLS), updates W out to be able cope with the time varying properties of x(t).


order to update W out .

While evaluating, the results of the ESN-ANC, we also used an additional performance

criterion, Signal To Noise Ratio (SNR) which is defined as follows:

SNR= 10 × log10 ∑n=1

N

u ' 2n

∑n=1

N

u n−u ' n 2SNR is a more frequently used evaluation criterion than NMSE to measure the

performance of an audio application.

5.3 Experimentation Phase and the Results

Now we follow on with the details of our experimentation phase and the corresponding

results.

5.3.1 Adaptive System Identification Setup

During our tests under adaptive system identification setup, we evaluated three test

cases. In the first case, we used a forgetting factor of 0.999, = 0.999 , for each of

the algorithms and ran the networks for one million samples. In the second case we

again ran them for one million samples using a forgetting of 1, = 1 . In the last test

case, we again used = 1 but this time ran the network for 5 million samples. These

three cases are repeated for both the second order nonlinear system and the 10 th order

NARMA system. Now we go on to the results which we got with the first system. Here

note that although we were changing the value of the forgetting rate some algorithms

are not effected by this since they do not include the forgetting rate in their computation.

These are the ESN-LMS, ESN-BPDC and ESN-Ardalan-RLS. Despite the fact that they

do not use , we ran them in all tests in order to have a more strongly grounded

evidence regarding their stability. Moreover, we added some small amount of random

uncorrelated noise to the input of these algorithms to attain a better conditioning of

correlation matrix. By running them for each test, we can see the effects of the noise on

the steady state performance and the numerical stabilities of the given algorithms in a



better way. The same remarks also hold for the Adaptive Noise Cancellation

experiments.

5.3.1.1 Identifying the Second Order Nonlinear Dynamical System

In this test we used an 50 neuron reservoir with density of 0.1 and spectral radius of 0.3.

Our ESN is in its most general form where input to output and output to output

connections are also present. The W back matrix is also used in evaluation of the

internal states. Summing up we have 52 coefficients to be updated in our W out . We

skipped a very common practice of ESN theory which is running the network freely

with the input signal for some time without doing weight update in order to wash-out

the initial transient effects of the reservoir. Keep in mind that it takes some time before

all of the given algorithms converge depending on the filter length. Therefore, we think

it is not suitable to run networks freely for some additional time, which is a general

practice in offline training of ESNs. This does not make sense under adaptive filtering

context since convergence rate is of high importance in many applications. We ought

not prolong the transient period of the filter by doing so.

Algorithm dependent parameters can be best seen from the Table 1. Additionally, with

the ESN-SCRLS2 we have a diagonal constant of 1, = 1 when = 0.999 and

= 104 when = 1 . For the ESN-LMS we used a learning rate of 1, = 1 .

The learning rate and the regularization constant used for the ESN-BPDC are

= 0.25 and = 0.002 respectively. These parameters are estimated via testing

on small datasets with 1000 to 50000 samples. Even with those small data sets, we

experience that the parametrization has crucial role on both numerical stability and

accuracy. Slight changes may result in unacceptable deviations from the desired

performance. We optimized these parameters to some extend for each of the algorithms

until we got acceptable results.



Algorithm Scale Factor Bias Noise Added

ESN-CRLS 0.0005 0 Yes – 5 %

ESN-SCRLS 0.0005 0 Yes – 2.5 %

ESN-SCRLS2 0.0005 0 Yes – 2.5 %

ESN-Ardalan-RLS 0.0005 0 Yes – 2.5 %

ESN-RLSP 0.5 0 No

ESN-QR-RLS 0.75 0 No

ESN-IQR-RLS 0.75 0 No

ESN-LMS 0.45 0 Yes – 2.5 %

ESN-BPDC 0.075 0 No

Table 1: Algorithm dependent parameters used for the identification of the 2nd order nonlinear dynamical system.

Using the above setup, the results for all test cases are given in the Table 2, Table 3 and

Table 4 and also in the corresponding figures, Figure 13, Figure 14 and Figure 15. Note

that, we did not include the first 1000 iterations while calculating the NMSE in order to

discard transient responses of the ESN before it reached its steady state. Discarding first

samples is done not only for this experiment but also for the rest.

Algorithm NMSE

ESN-CRLS Overflow

ESN-SCRLS Overflow

ESN-SCRLS2 0.16037

ESN-Ardalan-RLS 0.16037

ESN-RLSP 0.0026872

ESN-QR-RLS 0.0029747

ESN-IQR-RLS 0.0036439

ESN-LMS 0.18792

ESN-BPDC 0.18059

Table 2: Results of the test case 1: fRate < 1 ( = 0.999) and 1000000 (One Million) Samples for the 2nd

Order Nonlinear Dynamical System



Algorithm NMSE

ESN-CRLS 0.15977

ESN-SCRLS 0.15977

ESN-SCRLS2 0.16609


ESN-RLSP 0.0066470



ESN-LMS 0.18751

ESN-BPDC 0.18003

Table 3: Results of the test case 2: fRate = 1 and 1000000 (One Million) Samples for the 2nd Order Nonlinear Dynamical System


Figure 13: Results of the test case 1: fRate < 1 ( = 0.999) and 1000000 (One Million) Samples for the 2nd Order Nonlinear Dynamical System


Algorithm NMSE

ESN-CRLS 0.16036

ESN-SCRLS 0.16036

ESN-SCRLS2 0.16669


ESN-RLSP 0.0048698

ESN-QR-RLS 0.029552


ESN-LMS 0.18712

ESN-BPDC 0.18033

Table 4: Results of the test case 3: fRate = 1 and 5000000 (Five Million) Samples for the 2nd Order Nonlinear Dynamical System


Figure 14: Results of the test case 2: fRate = 1 and 1000000 (One Million) Samples for the 2nd Order Nonlinear Dynamical System


Here it is interesting to see that ESN-QR-RLS performed worse than the ESN-RLSP

and the ESN-IQR-RLS when forgetting rate is equal to one. However, one should not

forget that the numerical accuracy achieved by the ESN-QR-RLS is still much better

than the other algorithms. In the Figure 16, observe that the stepwise squared error

performance of ESN-QR-RLS decreases constantly as the number of iterations increase.

In this example the forgetting rate is set to one. This in parallel with observations given

in [Yang 1992] in that better numerical accuracy is achieved when is chosen to be

1 . This can also be related to the poor tracking performance of the ESN-QR-

RLS. It is a well known effect that tracking performance of any RLS based algorithm

decreases, as the value of the forgetting rate approaches one. At the limit, when

= 1 , it is the worst.


Figure 15: Results of the test case 3: fRate = 1 and 5000000 (Five Million) Samples for the 2nd Order Nonlinear Dynamical System


5.3.1.2 Identifying the 10th Order Nonlinear System

In this experiment we used a 100 neuron reservoir. The density and the spectral radius

of the DR are 0.1 and 0.99 respectively. ESN is again in the most general form as the

previous experiment.

Algorithm dependent parameters can be best seen from the Table 2. Additionally, with

the ESN-SCRLS2 we have a diagonal constant of 1, = 1 when = 0.999 and

= 2500 when = 1 . For the ESN-LMS we used a learning rate of 1,

= 2500 . The learning rate and the regularization constant used for the ESN-BPDC

are = 0.75 and = 0.0002 respectively.

Before going on the results, we want to mention that better results are achieved for the

same identification task in [Jaeger 2002c] by using a DR with squared activations. By

this way it is possible to increase the nonlinearity at a cost of increased W out size.

Jaeger used around two hundred taps where we have only a hundred. As we already

mentioned before our main intention in doing those experiments is to test numerical

stability of different online adaptation algorithms, not to achieve the best numerical

accuracy.


Figure 16: Stepwise squared error graph of the ESN-QR-RLS with fRate = 1 during identification of the second order nonlinear dynamical system for five million samples. Observe that value of the squared error increases, hence the performance decreases, as the number iterations increase


Algorithm ScaleFactor Bias NoiseAdded

ESN-CRLS 0.001 0 Yes – 5 %

ESN-SCRLS 0.001 0 Yes – 2.5 %

ESN-SCRLS2 0.001 0 Yes – 2.5 %


ESN-RLSP 0.001 0 No

ESN-QR-RLS 0.1 0 No


ESN-LMS 0.001 0 Yes – 2.5 %

ESN-BPDC 0.001 0 No

Table 5: Algorithm dependent parameters used for the identification of the 10th Order NARMA System.

Results of this experiment are given the tables, Table 6, Table 7 and Table 8. Also in the

figures, Figure 17, Figure 18 and Figure 19.

Algorithm NMSE

ESN-CRLS 28417, Useless

ESN-SCRLS Overflow

ESN-SCRLS2 0.0064756


ESN-RLSP 0.0078000



ESN-LMS 0.014239

ESN-BPDC 0.013382

Table 6: Results of the test case 1: fRate < 1 ( = 0.999) and 1000000 (One Million) Samples for the 10th Order NARMA System



Algorithm NMSE

ESN-CRLS 0.0062685

ESN-SCRLS 0.0062662



ESN-RLSP 0.025600



ESN-LMS 0.013930

ESN-BPDC 0.013098

Table 7: Results of the test case 2: fRate = 1 and 1000000 (One Million) Samples for the 10th Order NARMA System


Figure 17: Results of the test case 1: fRate < 1 ( = 0.999) and 1000000 (One Million) Samples for the 10th Order NARMA System


Algorithm NMSE

ESN-CRLS 0.0073585

ESN-SCRLS 0.0073480



ESN-RLSP 0.0620620



ESN-LMS 0.013072

ESN-BPDC 0.011740

Table 8: Results of the test case 3: fRate = 1 and 5000000 (Five Million) Samples for the 10th Order NARMA System


Figure 18: Results of the test case 2: fRate = 1 and 1000000 (One Million) Samples for the 10th Order NARMA System


5.3.2 Adaptive Noise Cancellation

Now we continue on with the results of our Adaptive Noise Cancellation experiments.

In this experiment we used two test cases. We processed 4500000 (four and a half

million) samples at each of these cases. Again different forgetting rates were used for

these two test cases as in the system identification setup. We used = 0.9999 and

= 1 for test cases one and two respectively.

During this experiment we used a very small ESN. It has a fully connected reservoir

with 10 units. Spectral radius is set to 0.99. Output to output connections and the

W back is not used in this experiment. Therefore considering the one input to output

connection we have a W out of size 1×11 .

The algorithm dependent parameters are given in the Table 9. Additional parameters are

used for ESN-SCRLS2, ESN-LMS and ESN-BPDC. Diagonal constant used for ESN-

SCRLS2 is equal to one, = 1 . Learning rates for ESN-LMS and ESN-BPDC are

= 0.005 and = 0.1 respectively. The regularization constant is set to 0.002 for


Figure 19: Results of the test case 3: fRate = 1 and 5000000 (Five Million) Samples for the 10th Order NARMA System


the ESN-BPDC algorithm.

Algorithm ScaleFactor Bias NoiseAdded

ESN-CRLS 0.0005 0 Yes - 5 %

ESN-SCRLS 0.0005 0 Yes - 5 %

ESN-SCRLS2 0.0005 0 Yes - 5 %


ESN-RLSP 0.0005 0 Yes – 2.5 %

ESN-QR-RLS 0.0005 0 No


ESN-LMS 0.1 0 Yes – 2.5 %

ESN-BPDC 0.01 0 Yes – 2.5 %

Table 9: Algorithm dependent parameters used for the Adaptive Noise Cancellation

Based on this parametrization our results are given in the tables Table 10, Table 11 and

in the figures, Figure 20 and Figure 21. Note that just as we calculate the NMSE by

discarding first 1000 iterations, we do the same for the calculation of the SNR.

Algorithm NMSE SNR

ESN-CRLS 276.3157, Useless 0.0026, Useless

ESN-SCRLS Overflow Overflow

ESN-SCRLS2 Overflow Overflow

ESN-Ardalan-RLS 6.3874, Useless 0.6331, Useless

ESN-RLSP 0.0771 11.4546

ESN-QR-RLS 0.0495 13.2566

ESN-IQR-RLS 0.0492 13.2859

ESN-LMS 0.0828 11.1654

ESN-BPDC 0.0837 11.1214

Table 10: Results of the Test Case 1: fRate < 1 ( = 0.9999) and 4500000 Samples for Adaptive Noise Canceling



Algorithm NMSE SNR

ESN-CRLS 19.4080, Useless 0.1965, Useless

ESN-SCRLS 53.9304, Useless 0.0667, Useless

ESN-SCRLS2 0.0806 11.2714

ESN-Ardalan-RLS 0.0876 10.9411

ESN-RLSP 0.0791 11.3496

ESN-QR-RLS 0.0640 12.2079

ESN-IQR-RLS 2.881, Useless 1.2997, Useless

ESN-LMS 0.0828 11.1654

ESN-BPDC 0.0837 11.1214

Table 11: Results of the Test Case 2: fRate = 1 and 4500000 Samples for the Adaptive Noise Cancellation


Figure 20: Results of the Test Case 1: fRate < 1 ( = 0.9999) and 4500000 Samples for the Adaptive Noise Cancellation


5.4 Comments on the Results

We should start commenting on the results by first dividing our algorithms to three

different classes. In the first class, which we name the Linear Time Algorithms (LTA),

we have ESN-LMS and ESN-BPDC. In fact none of the algorithms given in this thesis

are of linear time complexity. This is due to the matrix-vector multiplication (matrix

W multiplied by the vector x t ) present in the evaluation equation. Therefore all

of the algorithms have at least a complexity that is growing with the square of DR size.

But if the main bodies of ESN-LMS and ESN-BPDC are considered alone, (when

evaluation and exploitation steps are omitted), then they are of linear complexity. See

Appendix B for details. Second class is called the CRLS Variant Algorithms (CVA).

This class includes ESN-CRLS, ESN-SCRLS, ESN-SCRLS2 and ESN-Ardalan-RLS.

All of these algorithms are based on the ESN-CRLS with some changes. Our third and

the last class is called the Rotation Based Algorithms (RBA) since they are based on

orthogonal rotations. This class includes the ESN-QR-RLS and ESN-IQR-RLS. Now

we can continue on to our general observations. We keep the ESN-RLSP out of any


Figure 21: Results of the Test Case 2: fRate = 1 and 4500000 Samples for the Adaptive Noise Cancellation


class because it can be included for both CVAs or RBAs. It resembles the CVAs

structurally and the RBAs numerically. We will comment on the ESN-RLSP separately.

We begin our discussion with some statistics on the algorithm performances regarding

their stability. Totally, we had eight test cases, six for the adaptive nonlinear system

identification and two for the adaptive noise canceling. Among them, we used 1

for three times and = 1 for five times. Exceptions are the ESN-Ardalan-RLS, ESN-

LMS and the ESN-BPDC where the forgetting rate parameter is not used. Still, we ran

them for all of our test cases in order to have stronger grounds regarding their

performance. In the following tables we give individual performances of algorithms in

terms of numerical stability.

ESN-CRLS 1 = 1 Total

Diverged 100% 20% 50%

Normal - 80% 50%

ESN-SCRLS 1 = 1 Total

Diverged 100% 20% 50%

Normal - 80% 50%

ESN-SCRLS2 1 = 1 Total

Diverged 33% - 14%

Normal 67% 100% 86%

ESN-RLSP 1 = 1 Total

Diverged - - -

Normal 100% 100% 100%

ESN-QR-RLS 1 = 1 Total

Diverged - - -

Normal 100% 100% 100%

ESN-IQR-RLS 1 = 1 Total

Diverged - 20% 16%

Normal 100% 80% 86%



ESN-Ardalan-RLS Total

Diverged 14%

Normal 86%

ESN-LMS Total

Diverged -

Normal 100%

ESN-BPDC Total

Diverged -

Normal 100%

We begin our discussion with the LTA class. In all of the cases LTA class of

algorithms showed a very robust performance in terms of numerical stability. On the

other hand their numerical accuracy was poor as opposed to their stability. Additionally,

numerical accuracy would be much worse, if we used them for fast startup applications

where W out should be kept the same after filter converges to its steady state. This is

because of the slow convergence rate of these algorithms. Usually it is an order of

magnitude slower than the CVA or RBA algorithms. Actually both the steady state

performance and the convergence rate of these algorithms are mostly characterized by

the eigenvalue spread of the correlation matrix of the input signal, thus the convergence

rate may change drastically from signal to signal. Although we did not explicitly

investigated the tracking performance of any of the algorithms, it is well known that

ESN-LMS has superior tracking properties over other both CVAs or RBAs [Haykin

1996] [Farhang-Boroujeny 1998]. We do not have enough experience to comment on

the tracking performance of the ESN-BPDC. This should be investigated in a future

work. Finally, the most obvious advantage of these algorithms are their computational

complexity which is linear with the size of the W out18. ESN-LMS is has a lower

computational complexity than ESN-BPDC whereas the ESN-BPDC offers a more

precise numerical accuracy. NMSE results reveal that ESN-BPDC performed better

than ESN-LMS in 75% of the test cases. Provided good conditioning of the input signal,

ESN-LMS and ESN-BPDC can also be used for applications which does not need fast

start-up. Experiments show that performance of these algorithms become more and

more acceptable as the number of iterations increase. However, we still do not 18 Note that ESN-BPDC complexity is linear time only when one output neuron is used. See [Steil 2004]



recommend the general use of these algorithms because the current state of the art of

DSP technology can now realize very high computational speeds at reasonable costs

[Analog 2005] [TI 2005]. Therefore, for many applications, ON 2 algorithms can

easily be put in to practice. But if a decision between these two must be made, we favor

ESN-BPDC because of its better numerical accuracy. The fact that it is specifically

designed for neural network learning whereas the ESN-LMS has its roots from the

Adaptive Filtering theory is also another reason why we favor the ESN-BPDC. Only

when ESN research will come to a point where some techniques to shrink the

eigenvalue spread of reservoirs will be discovered, then the ESN-LMS will be a very

competitive choice among other online adaptation algorithms because of its simplicity

and highly robust performance (i.e: stability, good tracking). Do not forget that

eigenvalue spread of an ESN when used with certain inputs may be acceptable for some

cases, therefore it is always worth trying ESN-LMS at the first step to see if it gives

acceptable performance. This approach may save a great deal of time and resources. If

the performance is not good, you will not lose anything since the equations present in

ESN-LMS are structurally common in all other algorithms, therefore the code used to

implement ESN-LMS is also re-usable for the implementation of other algorithms.

CVA class of algorithms showed varying results during our tests. All of them resulted in

an overflow or in a useless result at least once in some test cases. Additionally, they

showed the best NMSE performance only once with the ESN-SCRLS2 algorithm. In

terms of numerical accuracy they were better than LTAs but worse than RBAs. Our

most obvious inference based on the results is that these algorithms are acting more

stable under ESN context whenever = 1 . This is interestingly contradicting with

the observations given in [Ardalan 1987] and [Slock 1991], where it is stated that

probability of divergence is higher when 1 is used. Reasons behind this

phenomenon can be analyzed in a future work. Our suggestion is then in the same

direction that these algorithms should be always used with a forgetting factor equal to 1

for long term adaptations. But do not forget that using = 1 may cause a degradation

in the tracking performance of them. Most promising results are obtained using ESN-

SCRLS2 and ESN-Ardalan-RLS which are based on the same trick of adding a scalar

value to the diagonal of the symmetric correlation matrix. ESN-Ardalan-RLS is a more



intelligent method in the sense that it changes this scalar value dynamically depending

on the change in W out where as it is kept constant in ESN-SCRLS2. But of course

using this dynamical approach, adds some additional computational complexity to the

algorithm. ESN-Ardalan-RLS is also designed specifically for the pre-windowed

memory case meaning that the forgetting rate is implicitly equal to one. On the other

hand, SCRLS2 has a better numerical accuracy and a lower computational complexity.

Only disadvantage is that the designer should estimate the correct diagonal constant and

the period before running the algorithm which is in contrast to the ESN-Ardalan-RLS.

In conclusion, our favorite algorithms from the CVA class are both the ESN-Ardalan-

RLS and the ESN-SCRLS2. Nonetheless, we favor ESN-SCRLS2 more.

The most successful class of algorithms were the RBAs during out experiments. They

have offered a very good numerical stability and accuracy. Considering the NMSE

performance, they achieved the best results six times through eight test cases. Two of

these are obtained by ESN-QR-RLS and the remaining four by the ESN-IQR-RLS. The

only important point is that they should be used with 1 or otherwise their

performance in terms of numerical accuracy degrades. This point is further supported by

the numerical stability observations given in the [Yang 1992]. Both QR-RLS and IQR-

RLS diverged after some period of time during their experiments with = 1 . This

observation especially holds for the IQR-RLS algorithm through our tests. In the second

test case of the adaptive noise canceling experiment, it diverged.

In general, we recommend use of the RBAs. Especially we favor the ESN-IQR-RLS for

its good numerical accuracy, stability and simplicity when compared to the ESN-QR-

RLS. The ESN-QR-RLS should be used only for the applications where calculation of

the a-priori estimation error, t , suffices. An example of this type is the Adaptive

Noise Cancellation. Otherwise, this algorithm becomes computationally too much

demanding due to the computation of inverse of a lower triangular matrix via back-

substitution.

As we mentioned in the previous chapter, RLSP share the good numerical accuracy

properties of the RBAs provided the statistical variations of the input signal are mild.

Otherwise, its performance heavily degrades due to its bad tracking properties. So, it is



not advised to be used with the signals which are prone to harsh statistical changes such

as the 10th Order NARMA system used during our testing phase. During our early

experimentations on smaller datasets, we observed that ESN-RLSP has a slower

convergence rate when compared to CVAs or RBAs. Based on this we can also

conclude that it is also not suitable for fast start-up applications where the convergence

rate is of vital importance. Numerical stability-wise it showed a robust performance,

although its accuracy was not as good for all of the cases. Again as similar to the RBAs,

better results are achieved when forgetting rates are chosen to be smaller than one.

Otherwise, it may drift away from the desired signal for a few number of iterations and

re-converge to the steady state later. Together with the bad tracking properties of the

algorithm this usually ends up in a bad solution in terms of accuracy although it does

not diverge. See results of the test cases two and three of the 10 th Order NARMA

System identification, where this phenomenon is clearly visible. Computational

complexity of the algorithm is also a big drawback, it is approximately four times

slower than the ESN-SCRLS. Combining all of the negative points mentioned above we

conclude that the use of ESN-RLSP is not a good choice. If one has to trade-off between

accuracy and efficiency, RBAs or CVAs offer a better performance in general, thus they

should be considered instead of the ESN-RLSP.

As a final remark, we again want to repeat that these results should not be interpreted as

the given algorithms ensure 100% numerical stability or instability. We do not claim to

cover a more general class of applications based on these results under any

circumstances. What we really want to achieve was to show that taking the appropriate

precautions, a stable use of the algorithms may become possible where continuous, long

term adaptation is required. For short term adaptation purposes we recommend the

ESN-SCRLS for two reasons. Firstly, due to its simplicity both computational

complexity-wise and implementation-wise. Secondly because of its good convergence

properties which is fast and independent of the signal statistics. However, if the

numerical accuracy is also an important issue, ESN-IQR-RLS should be used when

ESN-SCRLS does not suffice. ESN-LMS or ESN-BPDC needs very good conditioning

of the DR in terms of eigenvalue spread, in order to be used successfully.

Our chapter on online adaptation of ESNs ends here. We discussed performances



different algorithms which can be used to update output layer of an ESN online in a

reliable way. In the next chapter, we will compare ESN performance under adaptive

filtering context with respect to standard methods.


6. ECHO STATE NETWORKS VS STANDARD METHODS


6.1 Introduction

This chapter aims to show the reader that ESNs when used for adaptive filtering,

becomes a competitive candidate over standard adaptive filtering techniques. Firstly, we

will give a brief summary of standard methods of the adaptive filtering theory. The

methods we will summarize are linear transversal filters and adaptive polynomial filters

(i.e: a class of non-linear adaptive filtering methods). We will then compare ESN

performance with respect to the linear transversal filters and the adaptive polynomial

filters.

6.2 Overview of the Standard Adaptive Filtering Methods

The first type of filters to be compared with the ESN Adaptive Filter are the linear

transversal filters. They have a very simple yet very useful structure. These type filters

often played an very important role in development of core adaptive filtering

applications. The transversal filter tries to model the desired signal using M input

samples where M denotes the filter length. Modeling is done via expressing the signal

as the linear combination of tap weights and the history of input vector. Due to the feed-

forward structure, transversal filters belong to the Finite-Duration Impulse Response

(FIR) type of filters. Formal definition of a transversal filter is given in Definition 2.

All of the algorithms given in this thesis, can be used for weight adaptation of a

transversal filter. For simplicity in our comparisons, we will only use the IQR-RLS

algorithm for online adaptation of the transversal filters, and also for other yet to be

given filter structures. The IQR-RLS is chosen for its good stability and steady state

error performance. To save up space in the thesis, we omit the derivation of the IQR-

RLS algorithm when used for transversal filters. Interested reader can refer to [Sayed

1994], [Haykin 1996], [Farhang-Boroujeny 1998] or [Bellanger 2001].

Our second type of filters are the adaptive polynomial filters. These are a special class

of nonlinear adaptive filters which are using polynomial systems to get a nonlinear

model of the desired signal. Specifically, we will investigate the performance of two



main filter types; Volterra Filters and Bilinear Filters. We will not go deep in to the

theory of these filters here. Interested reader can refer to [Matthews 1991] and [Jenkins

1996] from where more detailed literature can also be tracked through the given

references.

Definition 2 : LINEAR TRANSVERSAL FILTER

Input Vector :

X t = [ x t , x t−1 , x t−2 , ... , x t−M1 ]T

Tap Weight Vector :

w t = [ w1t , w2t , w3t , ... , wM t ]T

Output :

y t =∑i=1

M

w it x t−i1 = wt T X t

Error :

e t =d t − y t

Now we follow on with an introduction to the Adaptive Volterra Filters. We base our

discussion on polynomial filters to the [Matthews 1991] and [Jenkins 1996]. Infinite

Volterra series Expansion for any given discrete time signal is given by:

y t = h0 ∑m1=0

∞

h1m1 x t−m1 ∑m1=0

∞

∑m2=0

∞

h2m1 , m2 x t−m1x t−m2

... ∑m1=0

∞

∑m2=0

∞

...∑m p=0

∞

h pm1 , m2 , ... ,m p x t−m1x t−m2... x t−mp ...

Here, the h pm1, m2 , ... , m p is called the p th order Volterra kernel of the system.

Volterra kernels are assumed to be symmetric that is it is left unchanged for any p!

permutations of the indexes m1, m2 ,... , m p . One may view the infinite Volterra series

as a special form of Taylor series expansion with memory. Since an infinite expression

is impossible to realize in a real world application, one should use a truncated Volterra

series expansion as:



y t =∑m1=0

M−1

h1m1 x t−m1 ∑m2=0

M−1

∑m1=0

M−1

h2m1 ,m2x t−m1 x t−m2

... ∑m p=0

M−1

∑m p−1=0

M−1

...∑m1=0

M−1

h pm1 ,m2 , ... ,mpx t−m1 x t−m2... x t−m p

Here notice that h0 is not included in the equation since it can be estimated as equal to

zero. The most prominent disadvantage of using the truncated series is that number of

coefficients increase directly proportional to the M p . (i.e: For a p=3 order series

with M=5 time steps of history, we have 55253=155 coefficients. As a result,

most of the real world applications of truncated Volterra filters, use lower order

expansions. In our comparisons, we will use a 2nd order expansion as used in [Matthews

1991], which is given in Definition 3.

Definition 3 : SECOND ORDER TRUNCATED VOLTERRA FILTER

Input Vector :

X t = [ x t , x t−1 , xt−2 , ... , x t−M 1 , xt 2 , x tx t−1 , ... , x tx T−m1 , x t−12,... , x t−M 12]T

Volterra Kernels:

H t = [ h10 : t , h11 : t , ... , h1M−1, t , h20,0 : t , h20,1 : t , ... , h20, M −1: t , h21,1 :t , ... , h2M−1, M−1 : t ]T

Output :

yt = ∑m1=0

M−1

h1m1: t x t−m1 ∑m2=0

M−1

∑m1=0

M−1

h2 m1 , m2 :t x t−m1 x t−m2 = H t T X t

Error :

e t =d t − y t

The vector notation used in the above definition simplifies the use of adaptation

algorithms, like LMS or RLS, for the Volterra filters. Since the output can be expressed

as the linear combination of the elements in the vector X t , H t , we can use

the IQR-RLS for Volterra filter coefficient update in a similar form as in the case of the

transversal filters or the linear combiners whose learning have a similar structure to an

ESN's W out .



The last filter type we will consider are the Bilinear Filters. The main problem

associated to Volterra filters is that a large number of coefficients is usually required

model certain systems. Therefore, use of other polynomial representations should also

be considered. It is known that recursive nonlinear difference equations can be used to

model nonlinear systems with a better precision than Volterra series representations.

Bilinear expansion is a simple but very useful example of such recursive nonlinear

difference equations. It is given by the following formulation:

y t = ∑i=0

M−1

a i x t−i ∑i=0

M−1

bi y t−i ∑i=0

M−1

∑j=0

M−1

c i , j x t−i y t− j

It has been shown that the bilinear series can model large number of nonlinear system

with arbitrary precision under mild conditions. Because the of output feed-back used in

the formula of Bilinear Filters, structurally they resemble IIR filters. Therefore, the

main drawback of using bilinear filters is their problem of numerical instability (as in

the case of IIR filters). The research on the subject of stabilizing bilinear filters is still

on early stage. There exists no known scheme for guaranteed stability of bilinear filters

yet.

From the formula we can see that the output of the bilinear series is a linear combination

of its coefficients thus it is again a simple task to derive adaptive algorithms in order to

use them for adaptive filtering. As in the case of Volterra filters, we can extend the

theories developed for linear signal processing to bilinear systems. Similarly, we can

define an Adaptive Bilinear Filter as in Definition 4.

6.3 Performance Comparison

Cross-evaluation of the Transversal, 2nd Order Truncated Volterra, Bilinear and ESN are

done using the same adaptive system identification scenario used in testing phase of

different online learning algorithms for ESNs. As quick refresher we once again go over

this setup throughly. In these experiments we tried to identify two unknown time

varying systems that are introduced in the paper [Atiya 2000].



Definition 4 : BILINEAR FILTER

Input Vector :

X t= [ x t , x t−1 , ... , x t−M1 , y t , y t−1 , ... , y t−M1, x t y t , x t−1 y t , ... , x t−M1 y t−M 1 ]T

Coefficient Vector :

C t = [ a0t , a1t ,... , a M−1t ,b0 t ,b1t ,... , bM −1t , c0,0 t , c1,0 t , ... , cM−1, M−1t ]T

Output :

y t =∑i=0

M−1

a i x t−i ∑i=0

M−1

bi y t−i ∑i=0

M−1

∑j=0

M−1

c i , j x t−i y t− j = C t T X t

Error :

e t =d t − y t

The first system is a 2nd order nonlinear dynamical system which is given by the equation:

y n1 = y n y n y n−1 u3n

We generated 10000 samples of the signal. At each time step we change the parameters

, , , by a factor of 1% around the original values, which are 0.4, 0.4, 0.6, 0.1

respectively, to achieve variations in the signal behavior through time. The input signal

u n is an uncorrelated uniform noise from the interval [−0.5 , 0.5] .

We used the same fifty neuron ESN from the previous test which has output to output

and input to input connections. W back is included in the exploitation equation. Input

signal is scaled by a factor of 0.75 and no bias is used. In order to make an objective

comparison we used the other filters with at least fifty or more number of coefficients.

Results of this experiment are given in Table 12, Figure 22 and Figure 23.



Algorithm NMSETransversal (via IQ-RLS) 0.1602100

Second Order Truncated Volterra Filter (via IQ-RLS) 0.1615600Bilinear Filter (via IQ-RLS) 0.1576400

ESN-IQ-RLS 0.0054497

Table 12: Performance comparison of different adaptive filtering methods for identification of the 2nd order nonlinear dynamical system.

In this experiment, ESN-IQ-RLS performed much better when compared to the

performance of other methods. Performance gain by using ESNs is in the order of

approximately 30 times better. Although there not much difference in the performances

of the other filters, the worst performance belongs to the second order truncated

Volterra filter.


Figure 22: Performance of different adaptive filtering methods for identification of the 2nd order nonlinear dynamical system

Algorithm0

0,010,020,030,040,050,060,070,080,090,10,110,120,130,140,150,160,17

TransversalVolterraBilinearESNN

MSE


Figure 23: Comparison of the ESN output versus the time varying 2nd Order Nonlinear Dynamical System in the last 100 iterations of the experiment

Our second system is a harder example. It is the 10th order NARMA system given by the

equation:

y n1 = tanh y n y n [∑i=0

9

y n−i ] u n−9u n

Again we followed a similar approach as the previous experiment, and used same 100

neuron reservoir that was used in the performance comparison of ESN-RLS

combinations under system identification setup of the 10th order NARMA system. It has

a density of 0.1 and the spectral radius is 0.99. All connections including W back are

present. Input signal is scaled by a factor of 0.1 without any bias. Number of

coefficients used for other filter types are also around 100. For the transversal filter we

used 102 coefficients which is same as the ESN. For the Volterra filter we chose a

window-length of 13 which corresponds to 105 coefficients. Our bilinear filter has a

window-length of 10 resulting in 109 coefficients. We need to mention that, when we

run the filter directly with the input signal we observed instabilities in the bilinear filter.

As a simple solution, we scaled the input to a more compact interval and then fed it to

filter. The scaling factor we used in this case for our bilinear filter is 0.005. Results are



given in Table 13, Figure 24 and Figure 25.

Algorithm NMSETransversal (via IQ-RLS) 0.016813

Second Order Truncated Volterra Filter (via IQ-RLS) 0.011886Bilinear Filter (via IQ-RLS) 0.019035

ESN-IQ-RLS 0.008947

Table 13: Performance comparison of different adaptive filtering methods for identification of the 10th order NARMA system.

Although not as good as the previous example, ESNs again performed better than the

other filtering methods. This time the performance gain is about twice the worst

method, which is now the bilinear filter.


Figure 24: Performance of different adaptive filtering methods for identification of the 10th order NARMA system

Algorithm0,00E+000

2,00E-003

4,00E-003

6,00E-003

8,00E-003

1,00E-002

1,20E-002

1,40E-002

1,60E-002

1,80E-002

2,00E-002

TransversalVolterraBilinearESNN

MSE


Additionally, we also want to show you good example of how these online adaptation

algorithms track the signal variations in time. As you know our 10th Order NARMA

system has harsh changes in the signal coefficients at every 2000th time step. As shown

in the previous chapter, we observe a dramatic jump in the mean value of the signal at

those times. (See Figure 11) Naturally, due to these jumps, ESN response drift away

from the desired value for a number of time steps until it can re-converge to the steady

state of the new signal behavior. In the following Figure 26, we give the stepwise

squared error graph of the ESN-IQ-RLS observed during the identification of the 10th

Order NARMA System. Observe the sudden increases in the error value at every 2000th

iteration which decreases slowly after some iterations pass. Better the tracking property

of an online adaptation algorithm, time interval between the error jump and re-

convergence to the steady state should be shorter. It is known that the tracking

performance of the ESN-IQ-RLS is not as good as LTA (i.e: ESN-BPDC) or CVA (i.e:

ESN-SCRLS2) class of algorithms.


Figure 25: Comparison of the ESN output versus the time varying 10th Order NARMA system in the last 100 iterations of the experiment


6.4 Conclusion

In conclusion, we reached our aim in this chapter by showing the superiority of the

ESNs to other adaptive filtering methods both linear and non-linear. By doing so we do

not claim that ESNs will always perform better than the other methods. This may differ

from application to application, thus more investigation should be carried on wider

range of application types. But, depending on these results plus the ones in [Jaeger

2002c] and [Jaeger 2004], we can easily conclude that ESNs when used in an online

learning fashion constitute a competitive approach for adaptive filtering.


Figure 26: Stepwise squared error graph of the ESN-IQR-RLS, observed during the identification of the 10th Order NARMA System. Observe that the error value increases suddenly at every 2000th time step. After some time it re-converges to the steady state.

7. SUMMARY

7. SUMMARY

Throughout out this thesis, we described the use of ESNs for adaptive filtering tasks.

Generally an ESN is trained in an offline manner using the Algorithm 2 given in the

Chapter 3. However, due to the simplicity of the learned part of an ESN, which is only

the output connections from reservoir and input or output neurons, online learning is

possible by means of adaptation algorithms that are used inside adaptive filters. RLS is

an example of those algorithms and fits the ESN case successfully. It offers a fast rate of

converge which is independent of the eigenvalue spread of the input signal. On the

contrary it has two major drawbacks. Firstly, it is prone to numerical instability under

finite precision environments (i.e: Digital Systems). Secondly, complexity of the RLS

algorithm is in the squared order of the number of connections to be learned. These

problems should be investigated in detail for a more robust and reliable use of the RLS

for the online adaptation of ESNs. This is especially needed for applications where

online learning should be done in long term.

The problem of computational complexity could have been solved by fast versions of

the RLS algorithm however the ESN structure do not allow the use of such algorithms.

Another solution is to use of stochastic gradient algorithms, like the well known LMS,

which have linear time complexity. We proposed ESN-LMS and ESN-BPDC as an

example of such algorithms. The major drawback of them is their converge rate is an

order of magnitude slower than the RLS and it is also dependent on the eigenvalue

spread of the input signal. Therefore, they are not suited for all kinds of applications.

Another point regarding the complexity problem is the current state of the art of the

DSP chips. Those chips can realize very high computation speeds at the moment and

this is now achieved at reasonable prices. Therefore, the complexity of RLS can be

acceptable for certain applications where number of units used in the ESN is not very

large.

However, under any condition, the numerical stability problem should be treated in

detail for a reliable usage of the RLS algorithm. In that aspect, we went through the

most prominent examples of the adaptive filtering literature regarding stability of the

RLS algorithm. Outcome is a number of RLS variant algorithms to be used for online


7. SUMMARY

adaptation of ESNs, which have different pros and cons. We selected ESN-CRLS, ESN-

SCRLS, ESN-SCRLS2, ESN-Ardalan-RLS, ESN-RLSP, ESN-QR-RLS and ESN-IQR-

RLS as the most promising algorithms. Later, we evaluated the selected algorithms with

respect to each other under well known adaptive filtering scenarios. Our main

concentration during the experiments was numerical stability. Additionally, we also

considered steady error performances as an evaluation criteria. First one of our

experimental scenarios was the Adaptive System Identification where we tried to

identify two different nonlinear systems. A second order nonlinear dynamical system

and a 10th order NARMA system. For each them, we had three test cases. Our second

experimental scenario was the Adaptive Noise Canceling. There we tried to enhance a

music signal which is corrupted by the speech babble of hundred people talking in a

canteen. Under this scenario, we had two test cases. All in all, we evaluated eight

different test cases. Since the stability was our main concern, we used a very large

number of samples which is between one to five million depending on the test case. As

a result of our experiments, ESN-SCRLS2 and ESN-IQR-RLS are found to be

advantageous. ESN-SCRLS2 offers a acceptable computational complexity, and good

robustness (provided forgetting rate is set to one). It also retains the good numerical

accuracy of the conventional RLS algorithm. ESN-SCRLS2 is also simple to

implement. On the other hand ESN-IQR-RLS offers an excellent numerical accuracy

and robustness (provided forgetting rate is set to a value that is smaller than one.) But,

this is achieved at a cost of increased computational complexity.

During our tests, we also evaluated performances of the two linear time algorithms

ESN-LMS and ESN-BPDC using the same experimental setups. Result is that they can

be used for certain applications where fast converge and numerical accuracy is not very

important. Especially, ESN-LMS could be very useful, however not before some

methods to shrink the eigenvalue spread of an ESN are developed by the researchers.

An example of previous attempts can be found in [Liebald 2004]. This is a very

important future research topic that is also suggested by the founder of the ESN theory,

Herbert Jaeger in a recent paper [Jaeger 2005].

In the last chapter, we compared the performance of online adapted ESNs to the

standard adaptive filtering techniques. We considered both linear and nonlinear


7. SUMMARY

methods. Transversal Filters are examples of linear methods whereas Bilinear Filters

and Second Order Truncated Volterra Filters are belong to the nonlinear adaptive

filtering methods. For weight adaptation of all filter types including the ESN, we used

the IQR-RLS algorithm in order to be fair in our comparisons. Performances of the

filters are evaluated on the same Adaptive System Identification setup that was used

during our stability tests. During the identification of the second order nonlinear

dynamical system, ESNs performed very well by giving an error rate which is

approximately 30 times better than the other methods. For the 10th Order NARMA

System, ESNs again performed better than other methods but in this case the

performance gain was only two times better. In conclusion, we showed that the ESNs

are competitive candidates among other adaptive filtering methods. Similar conclusions

can also be reached by looking at the results of [Jaeger 2002c] and [Jaeger 2004].

We believe the following points are worth further investigation as a continuation of the

work we presented in this thesis:

• Adaptive filters are usually implemented in embedded platforms. Unfortunately, we

did not have enough time to try out the given online adaptation methods inside

embedded systems (i.e: DSP). As a result of this, we are planning to present an

embedded ESN application with online learning in a future paper.

• Throughout the thesis we did not concentrated on finding the best parameters for

ESNs. We believe that these results can be much better if ESNs are applied to the

given problems with a more optimized set of parameters. Additionally, using such

an optimized set may also have positive effects on stability and tracking of the

algorithms presented in this thesis. This is also worth investigating and is left as

future work.

• Since the concept of this thesis was to explore ESN usage for Adaptive Filtering

tasks, we limited our test cases to examples of well known adaptive filtering setups

where number of inputs and outputs are usually both one. On the other hand, the

online learning algorithms given in the thesis also cover multi-input multi-output

mappings. It would be interesting evaluate performance of the online adapted ESNs

for such tasks.


7. SUMMARY

Apart from the points listed above there exist more important questions to be answered

regarding the ESN research. Examples of such open points can be enumerated as

finding a rich reservoir in less number of trials, how to decide if a given reservoir is

suited for the task at hand, how can we adapt ESN in a unsupervised manner to our

task's type of data, etc. The papers [Jaeger 2005] and [Prokhorov 2005] nicely

summarize what answers are still missing regarding ESN research and point out

appropriate future research directions.


8. REFERENCES

8. REFERENCES

1. [Adali 1991] T. Adali, S. H. Ardalan (1991), Analysis of a Stabilization

Technique for the Fixed Point Prewindowed RLS Algorithm, IEEE Transactions

on Signal Processing Vol. 39 No. 9

2. [Alexander 1993] S. T. Alexander, A. L. Ghirnikar (1993), A Method for

Recursive Least Squares Filtering Based Upon an Inverse QR Decomposition,

IEEE Transactions on Signal Processing Vol. 41 No. 1

3. [Analog 2005] Analog Devices (2005), Sharc Processor Home Page,

http://www.analog.com/processors/processors/sharc/ (December 2005)

4. [Ardalan 1986] S. H. Ardalan (1986), Floating-Point Error Analysis of

Recursive Least-Squares and Least-Mean-Squares Adaptive Filters, IEEE

Transactions on Signal Processing Vol. Cas. 33 No. 12

5. [Ardalan 1987] S. Ardalan, S. T. Alexander (1987), Fixed Point round-off

analysis of the exponentially windowed RLS algorithm for time varying

systems, IEEE Transactions on Acoustics, Speech and Signal Processing Vol.

ASSP-35 No. 6

6. [Ardalan 1989] S. H. Ardalan, T. Adali (1989), Sensitivity Analysis of

Transversal RLS Algorithms With Correlated Inputs, IEEE International

Symposium on Circuits and Systems, Portland, USA

7. [Asada 1999] M. Asada, H. Kitano, et al. (1999), RoboCup: Today and

Tomorrow What we have learned, Artificial Intelligence Magazine No. 110

8. [Atiya 2000] A. G. Atiya, A. F. Parlos (2000), New Results on Recurrent

Network Training: Unifying the Algorithms and Accelerating Convergence,

IEEE Transactions on Neural Networks Vol. 11 No. 3

9. [Bellanger 2001] M. Bellanger (2001), Adaptive Digital Filter 2nd Ed., Marcel

Dekker Inc., New York, USA

10. [Bottomley 1989] G. E. Bottomley, S. T. Alexander (1989), A theoretical basis

for the divergence of conventional recursive least squares filters, IEEE

International Conference on Acoustics, Speech, Signal Processing, Glasgow, UK

11. [Bottomley 1991] G. E. Bottomley, S. T. Alexander (1991), A Novel Approach

for Stabilizing Recursive Least Squares Filters, IEEE Transactions on Signal


8. REFERENCES

Processing Vol. 39 No. 8

12. [Buehner 2006] M. Buehner, P. Young (2006), A tighter bound for the echo

state property, IEEE Transactions on Neural Networks

13. [Carini 1999] A. Carini, E. Mumolo (1999), A Numerically Stable Fast RLS

Algorithm for Adaptive Filtering and Prediction Based on the UD Factorization,

IEEE Transactions on Signal Processing Vol. 47 No. 8

14. [Cernansky 2004] M. Cernansky, M. Makula (2005), Feed-forward Echo State

Networks, Proceedings of International Joint Conference on Neural Networks,

Montreal, Canada

15. [Chansarkar 1997] M. M. Chansarkar, U. B. Desai (1997), A Robust

Recursive Least Squares Algorithm, IEEE Transactions on Signal Processing

Vol. 45 No. 7

16. [Cioffi 1984] J. Cioffi, T. Kailath (1984), Fast, recursive-least-squares

transversal filters for adaptive filtering, IEEE Transactions on Signal Processing

Vol. 32 No. 2

17. [Cioffi 1987] J. M. Cioffi (1987), Limited-Precision Effects in Adaptive

Filtering, IEEE Transactions on Circuits and Systems Vol. 34 No. 7.

18. [Dai 2004] Dai X. (2004), Genetic Regulatory Systems Modeled by Recurrent

Neural Network, International Symposium on Neural Networks, Dalian, China.

19. [Ding 2005] H. Ding, Pei W., et al. (2005), A Multiple Objective Optimazation

Based Echo State Network Tree and Application to Intrusion Detection, IEEE

Int. Workshop VLSI Design & Video Tech, Suzhou, China

20. [Douglas 2000] S. C. Douglas (2000), Numerically-Robust O(N^2) RLS

Algorithms using least squares prewhitening, IEEE International Conference on

Acoustics, Speech, Signal Processing, Istanbul, Turkey

21. [Elman 1990] J.L. Elman (1990), Finding Structure in Time, Cognitive Science

No. 4

22. [Erhan 2004] D. Erhan (2004), Exploration of combining ESN learning with

gradient-descent RNN learning techniques, Bachelor Thesis, UIB, Bremen,

Germany

23. [Etalk 2004] Electronicstalk (2004), Budget floating-point DSP drives down

system costs, http://www.electronicstalk.com/news/tex/tex660.html (December


8. REFERENCES

2005)

24. [Eves 1980] H. Eves (1980), Elementary Matrix Theory, Dover publications,

Mineola, New York, USA

25. [Eyre 2000] J. Eyre, J. Bier (2000), Evolution of the DSP processors, Berkeley

Design Technology Inc. White Paper

26. [Farhang-Boroujeny 1998] B. Farhang-Boroujeny (1998), Adaptive Filters,

John Wiley & Sons Inc, Chichester, UK

27. [Fette 2005] G. Fette and J. Eggert (2005), Short Term Memory and Pattern

Matching with Simple Echo State Networks, Int. Conference on Artificial

Neural Networks, Warsaw, Poland

28. [Fischer 2003] J. Fischer (2003), The Recurrent IML-Network, J. Mira and J.R.

Alvarez (Eds.): IWANN 2003, LNCS 2686 (Springer-Verlag Berlin Heidelberg

2003), pp. 302-309.

29. [Glentis 1996] G. Glentis, K. Berberidis, S. Theodoridis (1996), Efficient Least

Squares Adaptive Algorithms for FIR Transversal Filtering, IEEE Signal

Processing Magazine

30. [Golub 1996] G. H. Golub, C. F. van Loan (1996), Matrix Computations 3rd

Ed., Johns Hopkins University Press, Baltimore, USA

31. [Haykin 1996] S. Haykin (1996), Adaptive Filter Theory, 3rd ed. Englewood

Cliffs, Prentice-Hall, New-Jersey, USA

32. [Haykin 1999a] S. Haykin (1999), Adaptive Filters, appeared as a sub-article

in: L. Atlas, P. Duhamel (1999), Recent developments in the core of signal

processing, IEEE Signal Processing Magazine Vol 16. No. 1

33. [Haykin 1999b] S. Haykin (1999), Neural Networks: A Comprehensive

Foundation, 2nd Ed, Prentice Hall, USA

34. [Hertzberg 2002] J. Hertzberg., H. Jaeger et. al (2002), Learning to Ground

Fact Symbols in Behavior-Based Robots, European Conference on Artificial

Intelligence, Lyon, France

35. [Higham 1996] N. J. Higham (1996), Accuracy and stability of numerical

algorithms, Society for Industrial and Applied Mathematics, Philedelphia, USA

36. [Horita 1999] E. Horita, Y. Miyanaga (1999), Numerically Stable RLS

Algorithms for Time-Varying Signals, Electronics and Communications in


8. REFERENCES

Japan Part 3. Vol. 82, No. 4

37. [Hsu 1982] F. Hsu (1982), Square-root Kalman filtering for high-speed data

received over fading dispersive HF Channels, IEEE Transactions of Information

Theory Vol. IT 28 No. 5

38. [IEEE 1985] IEEE Standard 754 (1985), IEEE Standard for Binary Floating-

Point Arithmetic

39. [Ishii 2004] K. Ishii, V. Becanovic, et al. (2004). Identification of Motion with

Echo State Network. Oceans 04, Kobe, Japan.

40. [Jaeger 2001] H. Jaeger (2001), The "echo state" approach to analyzing and

training recurrent neural networks. GMD Report 148. Sankt Augustin, Germany,

GMD.

41. [Jaeger 2002a] H. Jaeger (2002), A tutorial on training recurrent neural

networks, covering BPPT, RTRL, EKF and the "echo state network" approach.

GMD Report 159. Sankt Augustin, Germany, GMD.

42. [Jaeger 2002b] H. Jaeger (2002), Short Term Memory in Echo State Networks.


43. [Jaeger 2002c] H. Jaeger (2002), Adaptive nonlinear system identification with

echo state networks , S. Becker, S. Thrun, K. Obermayer (Eds): Advances in

Neural Information Processing Systems 15 (MIT Press, Cambridge, MA, 2003),

pp. 593-600

44. [Jaeger 2004] H. Jaeger, H. Haas (2004). Harnessing nonlinearity: predicting

chaotic systems and saving energy in wireless telecommunication. Science.

Bremen, Germany, International University Bremen.

45. [Jaeger 2005] H. Jaeger (2005), Reservoir Riddles: Suggestions for Echo State

Network Research (Extended Abstract), Proceedings of International Joint

Conference on Neural Networks, Montreal, Canada

46. [Jenkins 1996] K. Jenkins, A.W. Hull, et. al(1996), Advanced Concepts in

Adaptive Digital Signal Processing, Kluwer Academic Publishers,

Massachusetts, USA

47. [Johnson 1970] C.R. Johnson (1970), Positive definite matrices, American

Mathematical Monthly, Vol. 77, Issue 3

48. [Kalman 1960] R.E. Kalman (1960), On the general theory of control,


8. REFERENCES

Proceedings of the First IFAC Congress, Butterworth, London

49. [Kalman 1961] R.E. Kalman, R.S. Bucy (1961) , New results in linear filtering

and prediction theory, Transactions of ASME, Ser. D. J. Basic Eng. 95–107.

50. [Kuo 1996] S. M. Kuo , D. R. Morgan (1996), Active Noise Control Systems:

Algorithms and DSP Implementations, John Wiley and Sons Inc., New York,

USA

51. [Küçükemre 2005] A. U. Küçükemre (2005), Echo State Networks as Robust

Ball Trackers, Unpublished Research and Development Project Report, Master

of Science in Autonomous Systems, University of Applied Sciences Bonn-

Rhein-Sieg, Sankt Augustion, Germany

52. [Leung 1989] H. Leung, S. Haykin (1989), Stability of recursive QRD-LS

algorithms using finite-precision systolic array implementation, IEEE

Transactions on Signal Processing Vol. 37 No. 5

53. [Levin 1994] M. D. Levin, C. F. N. Cowan (1994), Some observations on

implementing various recursive least squares adaptive filtering algorithms,

Institution of Electrical Engineers, London, UK

54. [Liavas 1999] A. P. Liavas, P. A. Regalia (1999), On the Numerical Stability

and Accuracy of the Conventional Recursive Least Squares Algorithm, IEEE


55. [Liebald 2004] B. Liebald (2004), Exploration of effects of different network

topologies on the ESN signal cross-correlation matrix spectrum, Bachelor

Thesis, UIB, Bremen, Germany

56. [Ling 1984] F. Ling, J. G. Proakis (1984), Numerical Accuracy and Stability:

Two Problems of Adaptive Estimation Algorithms Caused by Round-Off Error,

IEEE International Conference on Acoustics, Speech, Signal Processing

57. [Lipschutz 1991] S. Lipschutz (1991), Invertible Matrices, Schaum's Outline of

Theory and Problems of Linear Algebra 2nd ed., McGraw-Hill, New York, USA

58. [Liu 1991] K. R. Liu, S. H. Hsieh, et. al. (1991), Dynamic Range, Stability, and

Fault-Tolerant Capability of Finite-Precision RLS Systolic Array Based on

Given Rotations, IEEE Transactions on Circuits and Systems Vol. 38 No. 6

59. [Ljung 1985] S. Ljung, L. Ljung (1985), Error propagation properties of

recursive least squares adaptation algorithms, A bridge between control science


8. REFERENCES

and technology, Vol. 2, Oxford and New York, Pergamon Press, p. 677-681.

60. [Maas 2002] Maas W., Natschläger T., Markram H. (2002), Real-Time

Computing Without Stable States: A New Framework for Neural Computation

Based on Perturbations, Institute for Theoretical Computer Science, Technische

Universität Graz

61. [Maierhofer 2003] L. Maierhofer (2003), Odometry Correction for the Keksi

Omnidrive using an Echo State Network, Computational Intelligence Project,

University of Technology Graz, Graz, Austria.

62. [Makula 2004] M. Makula, M. Cernansky (2004), Approaches Based on

Markovian Architectural Bias in Recurrent Neural Networks, P. Van Emde Boas

et al. (Eds.): SOFSEM 2004, LNCS 2932 (Springer-Verlag Berlin Heidelberg

2004), pp. 257–264.

63. [Matthews 1991] V. J. Matthews (1991), Adaptive Polynomial Filters, IEEE

Signal Processing Magazine

64. [Mayer 2004] N. M. Mayer and M. Browne (2004), Echo State Networks and

Self-Prediction, A.J. Ijspeert et al. (Eds.): BioADIT 2004, LNCS 3141

(Springer-Verlag Berlin Heidelberg 2004), pp. 40–48

65. [McWhirter 1983] J. G. McWhirter (1983), RLS minimization using a systolic

array, Proceedings of SPIE: Real Time Signal Processing Vol. 431, California,

USA

66. [Moonen 1990] M. Moonen, J. Vandewalle (1990), Recursive Least Squares

with stabilized inverse factorization, Signal Processing Vol 21. No. 1

67. [Oubbati 2005a] M. Oubbati, M. Schanz, et. al. (2005), Velocity Control of

Omnidirectional RoboCup Player with Recurrent Neural Networks, RoboCup

Symposium 2005, Osaka, Japan

68. [Oubbati 2005b] M. Oubbati, M. Schanz, et. al. (2005) Meta-learning for

Adaptive Identification of Nonlinear Dynamical Systems, Proceedings of the

Joint 20th IEEE International Symposium on Intelligent Control

69. [Ozturk 2005] M. C. Ozturk, J. C. Principe (2005), Computing with

Transiently Stable States, Proceedings of International Joint Conference on

Neural Networks, Montreal, Canada

70. [Palmieri 1988] F. Palmieri, C.G. Boncelet Junior (1988), A Class of nonlinear


8. REFERENCES

adaptive filters, Proceedings of IEEE Internation Conference on Acoustics,

Speech and Signal Processing, New York, USA

71. [Plöger 2003] P. G. Plöger, A. Arghir, et al. (2003). Echo State Networks for

Mobile Robot Modelling and Control. Robocup 2003, Padua, Italy.

72. [Plöger 2004] P.G. Plöger (2004), Echo State Nets (Lecture 6), Mathematics for

Roboticists Lecture Notes, University of Applied Sciences Bonn-Rhein-

Sieg,Sankt Augustin, Germany

73. [Prokhorov 2005] D. Prokhorov (2005), Echo State Networks: Appeal and

Challenges, Proceedings of International Joint Conference on Neural Networks,

Montreal, Canada

74. [RTC 2004] RTC Magazine (2004), New Applications Benefit from Floating-

Point DSP Accuracy, (December 2005)

http://www.rtcmagazine.com/home/article.php?id=100255

75. [Rao 2005] Y. N. Rao, S. Kim, et. al. (2005), Learning Mappings in Brain

Machine Interfaces with Echo State Networks, IEEE ICASSP 2005,

Philadelphia, USA.

76. [Salmen 2005] M. Salmen, P. G. Plöger (2005), Echo State Networks used for

Motor Control, IEEE ICRA´05, Barcelona, Spain.

77. [Sayed 1994] A. H Sayed, T. Kailath (1994), A State-Space Approach to

Adaptive RLS Filtering, IEEE Signal Processing Magazine

78. [Slock 1991] D. T. M. Slock, T. Kailath (1991), Numerically stable fast

transversal filters for recursive least squares adaptive filtering, IEEE


79. [Steil 2004] J.J. Steil (2004), Backpropagation-Decorrelation: Online recurrent

learning with O(N) complexity, Proceedings of IJCNN 2004

80. [Steil 2005] J.J. Steil (2005), Stability of backpropagation-decorrelation

efficient O(N) recurrent learning, Proceedings of ESANN 2005

81. [TI 2002] Texas Instruments (2002), Product Bulletin: TMS320C67x™

Floating-Point DSP Generation, http://focus.ti.com/lit/ml/sprt235a/sprt235a.pdf

(December 2005)

82. [TI 2005] Texas Instruments (2005), C67x Floating Point DSPs from Texas

Instruments,


8. REFERENCES

http://focus.ti.com/paramsearch/docs/parametricsearch.tsp?family=dsp&sectionI

d=2&tabId=135&familyId=327&paramCriteria=no (December 2005)

83. [TNO 1990] Institute for Perception-TNO (1990), Speech Babble, Netherlands,

URL: http://spib.rice.edu/spib/data/signals/noise/babble.html (December 2005)

84. [Takens 1981] F. Takens (1981), Detecting Strange Attractors in Fluid

Turbulence, D. Rand and L.S. Young (Eds.): Dynamical Systems and

Turbulence, Springer, Berlin, West Germany, 1981

85. [Van der Zant 2004] T. Van der Zant, V. Becanovic, et al. (2004). Finding

good echo state networks to control an underwater robot using evolutionary

computations. The 5th Symposium on Intelligent Autonomous Vehicles 2004,

Lisbon, Portugal.

86. [Wang 2005] S. Wang, C. Wei (2005), Harnessing Non-linearity by Sigmoid-

wavelet Hybrid Echo State Networks (SWHESN)

87. [Werner 1983] J. J. Werner (1983), Control of drift for fractionally spaced

equalizers, US Patent No. 4384355

88. [Widrow 1960] B. Widrow, M.E Hoff (1960), Adaptive Switching Circuits,

IRE WESCON Convention Record, 1960

89. [Widrow 1975] B. Widrow, J. R. Glower, et. al. (1975), Adaptive Noise

Canceling: Principles and Applications, Proceedings of IEEE Vol. 63 No. 12

90. [Wiener 1949] N. Wiener (1949), Extrapolation, Interpolation, and Smoothing

of Stationary Time Series with Engineering Applications, Wiley, USA

91. [Wierstra 2005] D. Wierstra, F. J. Gomez, et. al. (2005), Modeling Systems

with Internal State using Evolino, GECCO’05, June 25–29, 2005, Washington

D.C., USA.

92. [Xi 2005] J. Xi, Z. Shi, et. al. (2005), Analyzing the state space property of echo

state networks for chaotic system prediction, Proceedings of International Joint

Conference on Neural Networks, Montreal, Canada

93. [Xu 2005] D. Xu, J. Lan, J. C. Principe (2005), Direct Adaptive Control: An

Echo State Network and Genetic Algorithm Approach, Proceedings of

International Joint Conference on Neural Networks, Montreal, Canada

94. [Yang 1992] B. Yang, J. F. Böhme (1992), Rotation-Based RLS Algorithms:

Unified Derivations Numerical Properties, and Parallel Implementations, IEEE


8. REFERENCES


95. [Yang 1994] B. Yang (1994), A note on error propagation analysis of recursive

least squares algorithms, IEEE Transactions on Signal Processing Vol. 42

96. [Zahm 1973] C. L. Zahm (1973), Applications of adaptive arrays to supress

strong jammers in the presence of weak signals, IEEE Transaction on Aerospace

Electronic Systems Vol. AES 9


APPENDIX A: ECHO STATE NETWORKS STATE OF THE ART


In this section, we will give an state of the art of ESN papers (that are of our

knowledge). Certainly, before going on to details of any paper listed here, one should

firstly digest the main publications related to ESN theory which are:

• [Jaeger 2001] H. Jaeger, (2001). The "echo state" approach to analyzing and

training recurrent neural networks. GMD Report 148. Sankt Augustin, Germany,

GMD.

• [Jaeger 2002a] H. Jaeger, (2002). A tutorial on training recurrent neural networks,

covering BPPT, RTRL, EKF and the "echo state network" approach. GMD Report

159. Sankt Augustin, Germany, GMD.

• [Jaeger 2002b] H. Jaeger, (2002). Short Term Memory in Echo State Networks.


[Jaeger 2001] is the publication where the ESN idea is firstly described. It includes

detailed theory and some useful learning cases. Its follow up [Jaeger 2002a], is a RNN

training tutorial. It covers traditional RNN training methods like Back-propagation

Through Time, Real Time Recurrent Learning and Extended Kalman Filtering. Most of

the concentration is of course given to the ESNs. The author gives out a detailed

explanation of the steps that should be taken while training an ESN including some

tricks to reach the solution in a faster and more successful way. In that aspect, this paper

is more useful than [Jaeger 2001] for those who really want to implement and use ESNs.

The last publication [Jaeger 2002b], investigates the short term memory (STM) capacity

of ESNs. In addition to the deep theoretical information given in the paper, different

practical examples are included to show how the STM capacity of ESNs can be

exploited for dynamic pattern recognition and stochastic modeling tasks.

In the rest of this section, we tried to give a summarized list of other ESN and ESN

related publications. Our aim in doing so is to provide a starting point for readers who

want to know more about the ESNs. Readers can continue with the real publications if



our summaries arise a specific interest in them. Our alphabetical list of summaries is as

follows:

[Buehner 2006]

In [Buehner 2006], a newer sufficient condition for ESP is proposed. Authors' claim is

that this new bound is tighter than the original one and guarantees asymptotic stability

(i.e: Guaranteed ESP for all inputs). No new design methodology for generating an ESN

is proposed, what is given can only be used to test the global asymptotic stability of an

ESN at hand.

[Cernansky 2005]

Based on the classical ESN model, authors proposed a simpler feed-forward version of

the networks. The only recurrent connection is output to reservoir connections, that is

the W back matrix. The DR is built completely in a feed-forward way. The success of

the new method is shown through simulations in which a well known benchmark task,

prediction of chaotic Mackey-Glass Series is also included.

[Dai 2004]

Dai applied ESNs for modeling genetic regulatory systems. Gene expression is complex

process which is regulated at several stages in synthesis of proteins. Probabilistic

(Bayesian) or Logical (Boolean) methods provide a simple and intuitive way in

modeling genetic regulatory systems however they leave the dynamical aspects of these

processes implicit or coarse. With this motivation, an ESN is used as an identification

tool to model genetic regulatory system while covering those dynamic characteristics.

The experimental studies shows the new genetic regulatory system modeled by an ESN

and trained using the fluorescent density of reporter protein data has a satisfactory

performance in modeling the synthetic oscillatory network of transcriptional regulatory

of Escherichia coli cells.

[Ding 2005]

In [Ding 2005], authors proposed ESN-Tree's. It is novel hybrid model which combines



symbolic (Decision Tree) and non-symbolic (ESN) learning models to exploit the best

characteristics of both methods under one structure. The ESN-Tree is applied to

network intrusion detection problem. Depending on the experimental results, ESN-

Tree's proved to be very useful for the network intrusion detection applications. In most

of the cases, this new approach over-performed conventional methods.

[Erhan 2004]

In his bachelors thesis Erhan tried to improve the network performance by a post-

processing the output of an ESN using Real Time Recurrent Learning Algorithm.

Results indicate that performance is increased when this type of post processing is

applied.

[Fette 2005]

Unrestricted random recurrent topology of the DR when combined with nonlinear units

makes it a almost impossible to analyze internal dynamics of ESNs. In [Fette 2005],

authors proposed a simplified version of the standard ESNs and studied it in order to

provide insights on internal working mechanisms of more complex models.

[Fischer 2003]

A new type of recurrent neural network called Recurrent Infinite Middle Layer Network

(R-IML-N) is proposed. Idea is strongly inspired by the ESN theory. The R-IML-N can

handle complex learning tasks in an efficient manner. Moreover, the information flow in

the middle layer is easier to investigate when compared to other RNN topologies.

[Hertzberg 2002]

Symbolic situation model (i.e: kind of a world model) of a behavior based robot should

be updated continuously in order to ensure activation of correct behaviors. In this work,

an ESN is taught to ground fact symbols. Early simulations indicated that ESNs provide

computationally efficient yet powerful enough solution for the problem.



[Ishii 2004] and [Van der Zant 2004]

Based on the fact that generation of good ESNs depend on finding good parameters,

some expertise and also some luck, in [Ishii 2004] and [Van der Zant 2004] authors

tried to automize the process by replacing this hand-tuning with a double evolutionary

computation. They succeeded in finding good networks for their experimental setup in

which ESNs are used for motion identifications of an underwater robot. However, total

computation used by the evolutionary computation is too long even for the small

networks used during the tests. Therefore, it is not yet clear whether such an approach

will also scale for high dimensional networks.

[Jaeger 2002c]

Identification of a time varying 10th order NARMA system is done by ESNs in an

online fashion using the conventional RLS algorithm. This is the first work which

demonstrates the online learning of ESNs. By that it also constitutes our motivation for

this thesis. Identification of the same system but a stationary version, using the offline

algorithm is also presented.

[Jaeger 2004]

This article is appeared in the well known science magazine “Science” and is an

important hallmark indicating the method is growing very popular. Prediction of a

benchmark time series Mackey-Glass System is presented. Using the classical offline

learning algorithm ESNs performed better than the classical methods by a factor of

seven hundred. Additionally, using a novel refined version of the method which is

introduced for the first time, the ESN can even perform 2400 times better than other

methods. As a potential engineering application communication channel equalization

via ESNs is also demonstrated. Results were again impressive. The signal error rate is

improved by two orders of magnitude with respect to other linear and nonlinear

methods.



[Jaeger 2005]

In our personal view, this recent contribution is very important for the future of ESN

research. Although, the method often works nicely, sometimes we observe poor ESN

behavior. Poor performance is surely not acceptable, however what is more important is

that we cannot really understand the underlying reasons for this kind of phenomenon. In

this correspondence, Jaeger lists a number of open points regarding ESNs and suggests

related researcher direction which may lead to important answers. “The common theme

is to understand the reservoir dynamics in terms of a dynamical representation of the

task's input signals.” We advice this paper to everybody who want to do valuable

research in ESN area

[Küçükemre 2005]

This author used ESNs in order to improve performance of a Robocup mid-size league

ball tracker. The tracker used for the robot had good performance in terms of accuracy.

However during the times where the ball was unseen or occluded by other objects or if

some false positives were present, prediction performance was unacceptable which has

a negative effect on the robot behaviors therefore an ESN is trained as a replacement.

Simulation results indicate that the ESN Tracker can compete with the standard tracker

equally well in terms of accuracy. Additionally, at times where our trackers completely

lose the ball, ESNs can predict the ball trajectory with reasonable precision.

[Liebald 2004]

Due to certain properties of ESNs, use of simple algorithms for online adaptation is not

feasible for many applications. Liebald investigated if some special topologies could

improve the online learning ability of ESNs. In the end, the tested topologies did not

lead to an improvement over the standard randomly created ESNs or even performed

worse.

[Maierhofer 2003]

Omni-directional drives are now very popular amongst Robocup Mid-size League

teams. The most prominent problem with those type of drives is that the position



information cannot be calculated reliably from the odometry values due to high slippage

rate of the wheels. In this work, the author tried to implement a correction scheme based

on ESNs. At the end, correction performance was below expectations. Authors related

this results to the poor training data used for learning.

[Makula 2004]

The main theme of this work is to show usefulness of RNNs which have the Markovian

architectural bias property. In that aspect, ESN performance is compared to first order

simple Elman networks [Elman 1990] in order to investigate pros and cons of both

methods. Both of these methods make use of the architectural bias. Two different test

cases are considered. In the first example, the two neural network types are trained for

predicting a periodic sequence. It is the House of the Rising Sun Example which is also

given in [Jaeger 2001]. Second, example is again a neural network prediction task in

which the system to be learned is a long symbolic sequence of quantized activity

changes of a laser in the chaotic regime.

[Mayer 2004]

Mayer and Brown introduced a method to adapt the DR of an ESN without changing the

interconnection weights of the neurons. Idea is to use a second second network which is

trained to predict the internal state of the main ESN. Then the predicted states is mixed

with the real activation. This results in an ESN whose activation is driven partially by

the input and partially by the predicted network model. Early simulations reveal that this

kind of a self prediction scheme may improve the network performance.

[Oubbati 2005a]

In [Oubbati 2005a] the authors presented yet another dynamic velocity controller for

Robocup mid-size league robots but this time for omni-directional drives. Experimental

results are acceptable in spite of the poor data set used for training. The ESN-controller

approach is advantageous in the sense it does not require prior knowledge on the robot

model. This property makes it useful for practical situations, where the exact knowledge

of the robot parameters is almost unavailable.



[Oubbati 2005b]

In this work ESNs are made to work on time-varying environments using a notion

called meta-learning. In contrast to the conventional adaptive systems where weights are

updated on-the fly, with meta learning weights of the ESN is kept fixed all the time.

Successful operation is achieved by extensive training the network in prior under all

possible conditions of the signal to be learned. Then the network can adapt to global

changes only by changing its internal state.

[Ozturk 2005]

As the abstract of this paper clearly describes; “Stability is an essential constraint in the

design of linear dynamical systems. Similar stability restrictions on nonlinear

dynamical systems, such as ESNs, have been enforced in order to use them for reliable

computation. In this paper we will introduce a novel computational mode for nonlinear

systems with sigmoidal nonlinearity, which does not require global stability." The

model used here is an ESN. "In this mode, although the autonomous system is unstable,

the input signal forces the system dynamics to become transiently stable. We

demonstrate with a function approximation experiment that the transiently stable

system can still do useful computation. We explain the principles of computation with

the stability of local dynamics obtained from linearization of the system at the operating

point."

[Plöger 2003]

After presenting a detailed overview of ESN theory, authors used ESNs for both system

modeling and control. Context of the application is a differential drive robocup robot.

Using the desired velocity and the odometry values an ESN is employed for motor

control. Another ESN gets the pulse width modulation signals, which are used for motor

control, as input and acts as an inverse model of the motor by outputting the odometry

values. Simulation results were successful in both modeling and control. Additionally,

an improved controller is generated. Simulation results are implemented on a real robot

in [Salmen 2004] as a proceeding work.



[Prokhorov 2005]

In this paper the ESN idea is discussed in broader context of RNN applications.

Advantages and disadvantages of the method is given from a practical point of view.

The author highlighted important challenges, which are on the way, for real world

applications of the ESNs. In that aspect, interesting research directions are pointed out.

From our personal point of view, this paper is another important contribution, as [Jaeger

2005] regarding future of the ESN research.

[Rao 2005]

In this article ESNs are utilized for realization of Brain Machine Interfaces (BMI).

BMIs utilize linear or nonlinear models to map neuronal activity to the associated

behavior. (i.e 2D or 3D hand position of a primate.) Linear models are insufficient due

to high disparity of the I/O dimensions therefore exhibit poor generalization. Multilayer

Perceptrons provide a solution for the problem, however their training is too much time

consuming which can prevent of their practical use. In that aspect, an ESN trained with

a stochastic gradient algorithm, the Sparse LMS, is presented as a quickly trainable yet

performant solution for BMI.

[Salmen 2005]

In this work ESNs are used for motor control of a differential drive Robocup mid-size

league robot. This work is original in that for the first time successful simulation results

are moved one step further and implemented on a physical platform, as an embedded

hardware. ESN performed better than a classical PID controller under various control

theory performance measures.

[Wang 2005]

ESNs are famous for their success especially for chaotic system prediction. In [Wang

2005] the authors further improved the standard ESN performance by replacing some

sigmoidal units in the DR by their tuned wavelet based versions. This approach

increased the memory capacity of the reservoir while retaining its richness in

nonlinearity.



[Wierstra 2005]

Some authors also criticize ESNs. In [Wierstra 2005] it is stated that despite the fact that

ESN technique is the current champion on Mackey-Glass time series benchmark, it may

fail to succeed in some seemingly simple tasks like generation of multiple superimposed

sine waves or simple context sensitive grammars. A second criticism is that due to

relatively large number of units used, ESNs are prone to over-fitting. We do not agree to

the author in both of the points. We did not tried with context sensitive grammars,

however in our experiment we saw that author's example of learning superimposed

multiple sine wave generation is successfully handled by the ESNs. Secondly, there is

restriction on the dynamic reservoir size that should be taken in to account while

training for a specific task. Over-fitting can be prevented by making a wise choice of the

network size. Teacher signal also have an impact on over-fitting. Additionally,

precautions for over-fitting were well explained in [Jaeger 2002a], three years before

the paper [Wierstra 2005] is published. We believe the reason for such negative

criticisms is due to the author's (possibly) limited experience with the ESNs.

[Xi 2005]

ESN is a realization of neural state reconstruction for chaotic system prediction, in

which the reconstructed state variable is coming from the internal neurons' activation,

rather than the delay vector obtained from delay coordinate reconstruction as in the case

of Takens Embedding Theorem [Takens 1981] based models. In this paper authors first

used ESNs for identification and prediction of Chua's circuit. ESNs could learn the data

well and attained the desired prediction performance. Later they made a detailed

analysis of this trained network under state space framework. In authors' words, "it is

shown that the ESN is a non-minimum state space realization of the target time series,

and the initial state can be freely chosen in the training process, and in the phase of

prediction, ESN needs to know where the prediction begins by being set a proper initial

state through a process of teacher forcing."



[Xu 2005]

In [Xu 2005], authors presented direct adaptive approach for design of nonlinear system

controllers. In contrast to the classical approach, the need for system identification is

eliminated under direct adaptive control approach. An ESN whose output connections

are optimized by genetic algorithms used to implement the idea. Simulation results

reveal that the new approach can do very nice control and is computationally efficient.


APPENDIX B: COMPUTATIONAL REQUIREMENTS ANALYSIS OF THE SELECTED ALGORITHMS


In this section we will do a simple computational requirements analysis of three

important algorithms mentioned as candidates for online adaptation of ESNs in the

previous chapters. They are the ESN-LMS, ESN-SCRLS2, ESN-IQ-RLS. ESN-LMS is

chosen because it is the simplicity champion among other algorithms and have a good

potential for future if eigenvalue spread problems of ESNs can be solved. ESN-SCRLS2

is chosen due to the fact that it is at the same time relatively simple, robust and yet

accurate. Lastly, ESN-IQ-RLS is chosen because of its very good numerical accuracy

and stability properties.

During this analysis, we again assume an ESN in its most general form as:





Floating point operation counts for any given vector, matrix operation used in this

analysis are based on the book [Golub 1996]. Abbreviations flop, fladd, flmult denote

floating point operation, addition and multiplication respectively.

We begin our computational complexity analysis firstly by investigating the evaluation

and exploitation equations which appear as a common term in all of the algorithms.

Firstly lets remember the evaluation of internal states at time t which is given as:

x t = f activation W x t−1 W in. ut W back y t−1

This equation is composed of the following steps:

1. W x t−1 : Multiplication of a N×N matrix with a N×1 vector



requires 2N 2 flop=N 2 fladdN 2 flmult

2. W in.u t : Multiplication of a N×N matrix with a N×1 vector requires

2NK flop=NK fladdNK flmult

3. W back y t−1 : Multiplication of a N×L matrix with a L×1 vector

requires 2NL flop=NL fladdNL flmult

4. step1 step2 step3 : Addition of three N×1 vectors requires2N flop=2N fladd

5. f activation step4 : Apply f activation to a N×1 vector requires N f activation

for which real number of flops required for computation is strongly function

type and implementation dependent.

All in all we have following number of floating point operations for evaluation:

Computational Complexity of Evaluation :

Total=2NNKL1 flop N f activation

Total=N NKL2 fladd N NKL flmult N f activation

Now go on with the exploitation equation:

y t = f activation W out t−1 concat x t ,u t , y t−1

If we divide it to its sub-steps, we get:

6. W out t−1 concat x t , u t , y t−1 : Multiplication of a

L×NKL matrix with a NKL×1 vector requires

2 lnLKL2 flop=lnLKL2 fladdlnLKL2 flmult .

7. f activation step7 : Applying f activation to a L×1 vector requires requires

L f activation

In total for the exploitation we need:



Computational Complexity of Exploitation :

Total=2LNKL flop L f activation

Total=LNKL fladd L NKL flmult L f activation

These 7 steps is common in all of ESN-LMS, ESN-SCRLS2 and ESN-IQ-RLS

therefore will be very useful in the rest of out analysis.

Now we follow on to the algorithms. Our first algorithm is ESN-LMS. Main body of the

ESN-LMS algorithm is as follows:

ESN−LMS Main Body

for t = 1,2 , ... , T

a. x t = f activation W x t−1 W in. u t W back y t−1

b. t = concat x t , u t , y t−1

c. y t = f activation W out t−1 t

d. t = f activation−1 d teacht − W out t−1 t

e. W out t =W out t−1 [t t T ]

Notice that the steps a to c are identical to evaluation and exploitation steps of which we

know the total number of operations needed. Step d is the calculation of the a-posteriori

estimation error. The term W out t−1 t is already calculated during evaluation

therefore here we only have a subtraction of two L×1 vectors and application of

f activation to this result. We need L fladdL f activation−1 to realize these calculations.

Note that we are to denote floating point subtraction we used fladd operation since it is

only a special case of fladd and takes same amount of time to be completed in most

hardware. Step e is composed of two sub-parts. Outer product of the a-posteriori error

with the concatenated ESN states vector, t . We need L NKL flmult and

L NKL fladd for calculation of the resulting matrix. Result is then multiplied

by the learning rate, for which we need an additional L NKL flmult .

These operations give out a L×NKL matrix. In the second sub-step, this



matrix is added to another matrix of the same dimensions, W out t−1 for which

L NKL fladd suffices.

Totally, we need 2L NKL flmultLNKL fladd to compute the step e.

Summing up everything, total computational complexity for ESN-LMS is found as:

Computational Complexity of ESN−LMS:

Total = [ 2NNK1 L8N6K6L1] flops ...... NL f activation L f activation

−1

Total = [N NK2 L4N2K3L1 ] fladd ......[N3LNKL ] flmult NL f activation L f activation

−1

Our next algorithm is ESN-SCRLS2. Its main body is given by the following set of

equations:

ESN−SCRLS2 Main Body

for t = 1,2 , ... , T





e. lt = P t−1 t

f. k t =t

t T t

g. P t = Tri −1 [P t−1 − k t tT ]

h. if t mod = 0 then Diag P t , end

i. W out t =W out t−1 t k t T

During the analysis of ESN-SCRLS we assume that = 1 , in other words we that



assume a constant term is added to the diagonal of inverse correlation matrix at every

time step.

Steps a to d are already considered in ESN-LMS algorithm which require total number

of operations needed for evaluation and exploitation plus a L fladdL f activation−1 . In

step e, we multiply the autocorrelation matrix P t by the vector t which has

dimensions NKL×NKL and NKL×1 respectively. This

calculation needs NKL2 flmult and NKL2 fladd . Next, in step f, we

have to calculate the gain vector, k t . This step requires NKL1 fladd ,

2 NKL flmult and 1 fldiv . In step g, we do the inverse correlation matrix

update. The term k t t T is an outer product thus requires NKL2 flmult

and NKL2 fladd . Next we have to subtract this term from the P t−1 . To

achieve this we need to do NKL2 fladd . Finally we do NKL2 flmult

to multiply the inverse of forgetting rate, −1 , with the resulting matrix of the

previous subtraction. All together we have 2 NKL2 flmult and

2 NKL2 fladd for step g. However, due to the symmetric structure of the

inverse correlation matrix, we can use a special operator Tri ... which was

introduced in [Yang 1994]. Tri ... operator exploits the symmetry and does the

computations only in lower or upper triangular part of the correlation matrix and copies

the results to other part. By this way we save almost half of the computation required

for step g if it is naively calculated. So we can say that number of operations needed for

step g is only NKL2 flmult and NKL2 fladd . Step h is the addition of

the constant to each element lying in the diagonal of the inverse correlation matrix.

This step needs NKL fladd . In the last step we have to do the W out update.

Firstly, we have to calculate the outer product of the a posteriori error and the gain

vector, therefore L NKL flmult and L NKL fladd are needed. The

resulting matrix should be added to W out t−1 which introduces another

L NKL fladd . All together, the computational complexity of the ESN-

SCRLS2 can be given as:



Computational Complexity of ESN−SCRLS2 :

Total = [4NKL2 2N5L2NKL 4N2K3L2] flops ...... NL f activation L f activation

−1

Total = [2NKL2 N3LNKL 4N2K3L1] fladd ...... [2NKL2 N2L2NKL] flmult 1 fldiv ...... NL f activation L f activation

−1

Our last algorithm to be analyzed here is the ESN-IQR-RLS. Main body of the

algorithm consists of following steps:

ESN−IQR−RLS Main Body

for t = 1,2 , ... ,T





e. [ 1 −1 /2 t T P1/2t−10 −1 /2 P1/2t−1 ] t = [ ϱ

−1 /2t 0T

k t ϱ−1/2t P1 /2t ]f. W out t =W out t−1 t k t T

Before going on to the details, we should briefly talk about the t operator. It is a

unitary rotation operator which zeros the elements of −1/2 t T P1 /2 t−1 term by a

series of orthogonal rotations. It can be implemented using various techniques like

Householder Reflections, Givens Rotation or Hyperbolic Rotations. Among those the

Givens Rotation is the most popular method among the adaptive filtering community

therefore we also follow the common trend for our ESN-IQR-RLS implementation.

Steps a to d are identical to ESN-SCRLS2 and require total number of operations

needed for evaluation and exploitation plus a L fladdL f activation−1 . In step e, we have



to form the pre-array and then apply Givens Rotations to get the post array. Calculation

of −1/2 P1 /2t−1 requires NKL2 flmult . In the next step, we need to find

−1/2 t T P1 /2t−1 . Using the previously calculated −1/2 P1 /2t−1 , we can get

the result by a vector-matrix multiplication at a cost of NKL2 flmult and

NKL2 fladd . Then, we have to apply the unitary rotation operator via Given

Rotations. Our Givens Rotation implementation for ESN-IQ-RLS is an unoptimized

version which is based on the directives given in [Sayed 1994]. Total operation count to

realize the full orthogonal rotation is 4 NKL26NKL flmult ,

2 NKL23NKL fladd and 2 NKL fldiv . A faster version of

Givens Rotation is also available whose details are not covered in this thesis but can be

found in the book [Golub 1996].

After the rotation is applied, we can extract the k t from the post-array by dividing

k t ϱ−1 /2t by ϱ−1 /2t in NKL fldiv . The proceeding step f which does

the W out update is identical to that of ESN-SCRLS2. It requires

L NKL flmult and 2L NKL fladd in total. Using the above, total

complexity of the ESN-IQ-RLS is:

Computational Complexity of ESN− IQ−RLS :

Total = [9 NKL2 2N5L13NKL N2L] flops ...... NL f activation L f activation

−1

Total = [3 NKL2 N3L3NKL N2L] fladd ...... [6NKL2 N2L7NKL] flmult ...... 3NKL fldiv NL f activation L f activation

−1

In this section we have given the computational complexities of our selected algorithms

which are advantageous over others in some aspects. If we consider the ESN-LMS,

ESN-SCRLS2 and ESN-IQR-RLS as our benchmarks, we can sort all algorithms with

respect to complexity in ascending order as follows:



ESN-LMS < ESN-BPDC < ESN-SCRLS < ESN-SCRLS2 < ESN-Ardalan-RLS <

ESN-CRLS < ESN-RLSP < ESN-IQR-RLS < ESN-QR-RLS19.

Although what we have given here is an overly simplified computational requirements

analysis which only covers algorithmic complexity and leaves out other points like

memory requirement or implementation details (suggestions), it may still be of some

help for design of possible future applications.

The most important point to be careful is that all of the algorithms have at least

O N 2 complexity20. Therefore, while designing time critical applications where fast

response is important, DR size should be carefully selected because of the squared

growth of algorithms with respect to this parameter. A more detailed investigation on

embedded implementation of ESNs for various adaptive filtering application with

different timing needs will be done in a future study.

19 When the back-substitution step for W out computation is included in the overall complexity of ESN-QR-RLS.

20 Notice that the complexity of O N 2 also holds for the ESN-LMS due to the matrix-vector multiplication of W and x t which occurs during the evaluation step, although complexity of LMS is only O N when used for online weight adaptation of an ordinary filter. (i.e: Transversal FIR filters)


Master Thesis Echo State Networks for Adaptive Filtering

Documents