University of Applied Sciences Department of Computer Science Master Thesis Echo State Networks for Adaptive Filtering Ali Uygar Küçükemre A thesis submitted to the University of Applied Sciences Bonn-Rhein-Sieg for the degree of Master of Science in Autonomous Systems Referee and Tutor: 1 st Prof. Dr. Paul-Gerhard Plöger Referee: 2 nd ................................................ external Referee: 3 rd ................................................ Submitted: 30.04.2006
124
Embed
Master Thesis Echo State Networks for Adaptive Filtering
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
University of Applied Sciences
Department of Computer Science
Master Thesis
Echo State Networks for Adaptive Filtering
Ali Uygar Küçükemre
A thesis submitted to theUniversity of Applied Sciences Bonn-Rhein-Sieg
for the degree ofMaster of Science in Autonomous Systems
Referee and Tutor: 1st Prof. Dr. Paul-Gerhard PlögerReferee: 2nd ................................................ external Referee: 3rd ................................................
Submitted: 30.04.2006
ECHO STATE NETWORKS FOR ADAPTIVE FILTERING
I, the undersigned, declare that this work has not previously been submitted to this or any other University, and that unless otherwise stated, it is entirely my own work.
Figure 1: Block diagram of a basic filter....................................................................................................... 1
Figure 2: Block diagram of the adaptive modelling setup.............................................................................5
Figure 3: Block diagram of the adaptive inverse modeling setup................................................................. 6
Figure 4: Block diagram of the adaptive prediction setup.............................................................................6
Figure 5: Block diagram of the adaptive inference canceling setup..............................................................7
Figure 6: Structure of a neuron [Plöger 2004]............................................................................................. 12
Figure 7: Schematic description of a FNN and a RNN [Jaeger 2002a].......................................................12
Figure 8: An overview of the general Echo State Network structure. [Plöger 2004]..................................16
Figure 9: Block diagram of an ESN when used as a adaptive system identifier. (...) .................................51
Figure 10: First 250 samples of the second order nonlinear dynamical system.......................................... 52
Figure 11: Effects of the harsh parameter variations on the behavior of 10th order NARMA System (...) .................................................................................................................................. 53
Figure 12: Block diagram of an ESN-ANC. (...) ........................................................................................ 54
Figure 13: Results of the test case 1: fRate < 1 ( = 0.999) and 1000000 (One Million) Samples for the 2nd Order Nonlinear Dynamical System.....................................................................................................58
Figure 14: Results of the test case 2: fRate = 1 and 1000000 (One Million) Samples for the 2nd Order Nonlinear Dynamical System...................................................................................................................... 59
Figure 15: Results of the test case 3: fRate = 1 and 5000000 (Five Million) Samples for the 2nd Order Nonlinear Dynamical System...................................................................................................................... 60
Figure 16: Stepwise squared error graph of the ESN-QR-RLS with fRate = 1 during identification of the second order nonlinear dynamical system for five million samples. (...) .................................................. 61
Figure 17: Results of the test case 1: fRate < 1 ( = 0.999) and 1000000 (One Million) Samples for the 10th Order NARMA System....................................................................................................................... 63
Figure 18: Results of the test case 2: fRate = 1 and 1000000 (One Million) Samples for the 10th Order NARMA System..........................................................................................................................................64
Figure 19: Results of the test case 3: fRate = 1 and 5000000 (Five Million) Samples for the 10th Order NARMA System..........................................................................................................................................65
Figure 20: Results of the Test Case 1: fRate < 1 ( = 0.9999) and 4500000 Samples for the Adaptive Noise Cancellation................................................................................................................................................. 67
Figure 21: Results of the Test Case 2: fRate = 1 and 4500000 Samples for the Adaptive Noise Cancellation................................................................................................................................................. 68
Figure 22: Performance of different adaptive filtering methods for identification of the 2nd order nonlinear dynamical system.........................................................................................................................80
A. U. KÜÇÜKEMRE VI
ECHO STATE NETWORKS FOR ADAPTIVE FILTERING
Figure 23: Comparison of the ESN output versus the time varying 2nd Order Nonlinear Dynamical System in the last 100 iterations of the experiment..................................................................................... 81
Figure 24: Performance of different adaptive filtering methods for identification of the 10th order NARMA system...........................................................................................................................................82
Figure 25: Comparison of the ESN output versus the time varying 10th Order NARMA system in the last 100 iterations of the experiment.................................................................................................................. 83
Figure 26: Stepwise squared error graph of the ESN-IQR-RLS, observed during the identification of the 10th Order NARMA System. (...) ...............................................................................................................84
LIST OF TABLES
Table 1: Algorithm dependent parameters used for the identification of the 2nd order nonlinear dynamical system............................................................................................................... 57
Table 2: Results of the test case 1: fRate < 1 ( = 0.999) and 1000000 (One Million) Samples for the 2nd Order Nonlinear Dynamical System...........................................................................57
Table 3: Results of the test case 2: fRate = 1 and 1000000 (One Million) Samples for the 2nd Order Nonlinear Dynamical System...................................................................................................................... 58
Table 4: Results of the test case 3: fRate = 1 and 5000000 (Five Million) Samples for the 2nd Order Nonlinear Dynamical System...................................................................................................................... 59
Table 5: Algorithm dependent parameters used for the identification of the 10th Order NARMA System....................................................................................................................... 62
Table 6: Results of the test case 1: fRate < 1 ( = 0.999) and 1000000 (One Million) Samples for the 10th Order NARMA System............................................................................................................62
Table 7: Results of the test case 2: fRate = 1 and 1000000 (One Million) Samples for the 10th Order NARMA System..........................................................................................................................................63
Table 8: Results of the test case 3: fRate = 1 and 5000000 (Five Million) Samples for the 10th Order NARMA System..........................................................................................................................................64
Table 9: Algorithm dependent parameters used for the Adaptive Noise Cancellation................................66
Table 10: Results of the Test Case 1: fRate < 1 ( = 0.9999) and 4500000 Samples for Adaptive Noise Canceling......................................................................................................................66
Table 11: Results of the Test Case 2: fRate = 1 and 4500000 Samples for the AdaptiveNoise Cancellation....................................................................................................................................... 67
Table 12: Performance comparison of different adaptive filtering methods for identification of the 2nd order nonlinear dynamical system........................................................................80
Table 13: Performance comparison of different adaptive filtering methods for identification of the 10th order NARMA system................................................................................................................................. 82
A. U. KÜÇÜKEMRE VII
ECHO STATE NETWORKS FOR ADAPTIVE FILTERING
LIST OF ALGORITHMSAlgorithm 1: Generation of an RNN with ESP........................................................................................... 18
Algorithm 2: Supervised Training of ESN ................................................................................................. 21
LIST OF DEFINITIONSDefinition 1: Echo State Property................................................................................................................ 17
Definition 2: Linear Transversal Filter ....................................................................................................... 76
Definition 3: Second Order Truncated Volterra Filter.................................................................................77
LIST OF PREPOSITIONSPreposition 1: Sufficient Conditions for ESP.............................................................................................. 17
LIST OF LEMMASLemma 1: Matrix Inversion Lemma............................................................................................................ 26
LIST OF ABBREVIATIONSANC: Adaptive Noise Canceler
BIBO: Bounded-Input-Bounded-Output
BMI: Brain Machine Interface
BPDC: Backpropagation-Decorrelation
CRLS: Conventional Recursive Least Squares
CVA: Conventional Recursive Least Squares Variant Algorithms
DR: Dynamical Reservoir
DSP: Digital Signal Processor
EQR-RLS: Extended QR Decomposition Based Recursive Least Squares
ESN-ANC: Echo State Network Adaptive Noise Canceler
A. U. KÜÇÜKEMRE VIII
ECHO STATE NETWORKS FOR ADAPTIVE FILTERING
ESN-Ardalan-RLS: Echo state Network Ardalan Recursive Least Squares
ESN-BPDC: Echo state Network Backpropagation-Decorrelation
ESN-CRLS: Echo state Network Conventional Recursive Least Squares
ESN-IQR-RLS: Echo state Network Inverse QR Decomposition Based Recursive Least Squares
ESN-LMS: Echo state Network Least Mean Squares
ESN-QR-RLS:Echo state Network QR Decomposition Based Recursive Least Squares
ESN-RLSP: Echo state Network Recursive Least Squares Prewhitening
ESN-SCRLS2: Echo state Network Symmetric Conventional Recursive Least Squares 2
ESN-SCRLS: Echo state Network Symmetric Conventional Recursive Least Squares
ESN: Echo State Network
ESP: Echo State Property
FIR: Finite-Duration Impulse Response
fladd: floating point addition
fldiv: floating point division
flmult: floating point multiplication
flop: floating point operation
FNN: Feed-forward Neural Network
IIR: Infinite-Duration Impulse Response
IJCNN: International Joint Conference on Neural Networks
IQR-RLS: Inverse QR Decomposition Based Recursive Least Squares
LMS: Least Mean Squares
LSM: Liquid State Machines
LTA: Linear Time Algorithms
MSE: Mean Squared Error
NARMA: Nonlinear Auto Regressive Moving Average
NMSE: Normalized Mean of Squared Error
QR-RLS: QR Decomposition Based Recursive Least Squares
R-IML-N: Recurrent Infinite Middle Layer Network
RBA: Rotation Based Algorithms
RLS: Recursive Least Squares
RNN: Recurrent Neural Network
SNR: Signal to Noise Ratio
STM: Short Term Memory
UDU': Upper-Diagonal-Upper-Transpose
A. U. KÜÇÜKEMRE IX
1. ADAPTIVE FILTERING
1. ADAPTIVE FILTERING
What we basically do in signal processing? The most intuitive answer is that we do
something to the signal in order to make it more useful. While doing so, the most
important companion of a signal processing expert is probably a filter. Signal
Processing pioneer Simon Haykin gives the definition of a filter in [Haykin 1996] as
follows:
“The term filter is often used to describe a device in the form of a piece of physical
hardware or software that is applied to a set of noisy data in order to extract
information about a prescribed quantity of interest.”
In Figure 1, you see the most basic form of a filter. It takes an input signal denoted by
u , processes it and outputs the signal y . Processing should be done in such a
manner that the y is a good representative of the desired signal, also called the desired
response, d . A performance criterion is often defined to be able to decide on the
quality of the filter. It is usually a function of estimation error signal between the signals
y and d , and is denoted by e .
Filters can be classified in to two major groups. A filter is said to be linear if the filtered
signal at the output is a linear function of the observations at the input part of the filter,
otherwise it is called a nonlinear filter. Most of the signal processing theory is based on
the linear filters.
A process (or a system), on which a filter operates, is stationary if the statistical
A. U. KÜÇÜKEMRE Page 1 of 115
Figure 1: Block diagram of a basic filter.
1. ADAPTIVE FILTERING
characteristics of it are independent of the time at which the process is started. That is if
we look at the process at different time intervals, we essentially observe the same
statistical behavior at each of those intervals. If this property is not satisfied than it is a
non-stationary process. If our corrupted signal at hand is governed by a stationary
process, than we can design a optimum filter using the statistical signal processing
theory. Assuming that we know certain statistical parameters, like mean values and
correlation functions of the useful signal and the unwanted components that are mixed
on to it, we can design a linear filter which optimum in the statistical sense. The most
common performance criterion used in this case is the mean-squared value of the e ,
the difference between the desired response and the filter output. The corresponding
solution is known as the optimum Wiener solution named after his pioneering work in
[Wiener 1949]1.
The Wiener solution is insufficient during the times where the environment is non-
stationary. Therefore, the filter should now assume a time-varying form instead of fixed
statistics. In early 1960's, Kalman and Bucy extended the Wiener's work to time-varying
case with the Kalman filter, a highly powerful tool for many engineering problems
[Kalman 1960] [Kalman 1961]. It is an adaptive method which can respond positively
to statistical variations of the environment.
Although, the Kalman filter solved the problem of adaptivity to non-stationary
environments, like wise the Wiener solution it still assumes prior knowledge of certain
statistical parameters of the incoming signal. This knowledge is usually not known in
practical signal processing applications. In that case a good solution is using an adaptive
filter. An adaptive filter is a self-designing filter that relies on a recursive algorithm for
its operation. This recursive algorithm gives adaptive filter, the ability to perform
successfully in a non-stationary environment, where the relevant statistics of those
variations are not available, by continuously updating the filter parameters.
An adaptive filter always assumes limited or no knowledge about the inherent statistics
of its surrounding. If an adaptive filter is used in a stationary environment, after some
1 This book was originally issued as a classified National Defense Research Council Report on February 1942
A. U. KÜÇÜKEMRE Page 2 of 115
1. ADAPTIVE FILTERING
number of iterations it converges to the optimum Wiener solution in some statistical
sense, from thereafter the recursive adaptation algorithm can be shut down. When the
environment is non-stationary, it offers tracking, that is after converging to its steady
state, it can track statistical variations of the system given the those variations are
sufficiently slow depending on the tracking capability of the recursive algorithm used.
Due to the recursive algorithm, the parameters of an adaptive filter are adapted from one
iteration to the next, hence become data dependent. By that property, all adaptive filters
are non-linear in the sense they do not obey the superposition principle, which is a
necessary condition for linearity. After all, in the literature they are often classified as
linear or non-linear. If the output of an adaptive filter is formed by a linear combination
of the filter coefficients and the input signal, than it is called linear, if otherwise, it is
called non-linear.
Linear adaptive filters can be implemented in two main forms, Infinite-Duration
Impulse Response (IIR), Finite-Duration Impulse Response (FIR). IIR filters are
governed by recursive equations of the form:
y t =∑i=0
M−1
a it x t−i ∑i=0
M−1
b it y t−i
Here a it and bi t are forward and feedback tap weights. Due to the presence of
feedback the impulse response of the IIR filters are infinitely long, hence their name.
The feedback connections also lead to a stability problem. It may get into oscillation if
no special precaution is taken for choice of feedback taps. Moreover, adaptation process
of IIR filters is hard. Performance functions (i.e: MSE) of these filters often contain
many local minimums. During the adaptation process, the filter may get trapped into
one of those local minimums, instead of the desired global minimum point of the
performance function. Because of these reasons FIR type of filters become more
popular for designing linear adaptive filters. This form of filters are also called the
transversal (FIR) filters. A FIR filter has a very simple form as follows:
y t =∑i=1
M
w it x t−i1
A. U. KÜÇÜKEMRE Page 3 of 115
1. ADAPTIVE FILTERING
The output of the filter is generated by linear combination of filter weights and the
delayed input samples. Performance functions that belong to the FIR filters have usually
a one well defined global minimum that can easily be found by any recursive adaptation
algorithm.
In contrast to the linear adaptive filters, there exists no general structural framework for
implementing non-linear adaptive filters. Various schemes can be used for nonlinear
adaptive filtering. Examples of them include neural networks [Haykin 1999b], radial
basis function networks [Haykin 1996], polynomial filters [Mathews 1991], order
statistics filters [Palmieri 1988] etc.
Many different adaptation algorithms are developed for Adaptive Filters. Mostly they
either follow a statistical or deterministic approach. In statistical approach, also known
as stochastic gradient algorithms, instantaneous value of the mean squared error is
minimized at each iteration to get a rough estimate of the MSE. It turns out that this
rough estimate when used with a small step-size parameter leads to a very simple yet
robust algorithm, the widely celebrated Least Mean Squares (LMS) algorithm also
known as the Widrow-Hoff Rule [Widrow 1960]. Despite its simplicity and robustness,
it has a very slow convergence rate which is sensitive to the eigenvalue spread of the
input signal. When the deterministic approach is used, we want to minimize the sum of
weighted error squares, the least squares term. In contrast to the stochastic gradient
algorithms which use an instantaneous estimate of the performance criterion, the least
squares based algorithms consider also the history of the error function. The most
famous of least squares based adaptation algorithms is the Recursive Least Squares. The
most important advantage of using least squares based algorithms is their fast rate of
convergence, which is typically an order of magnitude faster than the stochastic
algorithms. Convergence is also invariant to the eigenvalue spread of the system. Price
paid for these desirable properties is the increased computational complexity.
Ability of adaptive filters to adjust themselves to different environments, let them to be
realized for many practical applications in diverse types of fields like control,
communications, military, radar and sonar signal processing, inference cancellation,
active noise control, biomedical engineering and other fields where minimal
A. U. KÜÇÜKEMRE Page 4 of 115
1. ADAPTIVE FILTERING
information is available about the incoming signal. We can classify these application in
four main groups, namely Modeling, Inverse Modeling, Prediction and Inference
Canceling.
In the Adaptive Modeling (See Figure 2), we try to find a mathematical model of an
unknown plant. This is a very important task if you want to design controls for a time-
varying system. It is often difficult to model a physical phenomenon directly, however
by experimentation, response of the system in various conditions can be measured. In
this setup, we feed the filter and plant by the same input. Plant response is used as the
desired response and the recursive algorithm updates the adaptive filter accordingly
using the error signal of the filter response and the desired response. When the error
signal becomes sufficiently low, the filter response can be used as a representative of
the unknown plant response. The most prominent application example of this class is
the system identification.
In the second class of adaptive filtering applications, an adaptive filter is used to provide
an inverse model of an unknown noisy plant that represents the best fit in some sense
(See Figure 3). This setup is called the Inverse Modeling and also known as the
(adaptive) deconvolution. In case of linear systems, the inverse model is characterized
by a transfer function which is equal to the inverse (reciprocal) of the plant's transfer
function. That is the combination of two, ideally provides the perfect communication
medium. During the operation, delayed version of the plant input is used as the desired
signal, whereas the plant response is fed into the adaptive filter as input. In some
A. U. KÜÇÜKEMRE Page 5 of 115
Figure 2: Block diagram of the adaptive modelling setup
1. ADAPTIVE FILTERING
applications plant input is used directly as the desired response without a delay.
Applications areas of inverse modeling can be given as predictive deconvolution,
adaptive equalization and blind equalization.
Adaptive filters that belong to the Prediction class are required to provide the best
prediction with respect to some performance criterion of the present value of a random
signal (process) (See Figure 4). During the adaptation, present values of the random
signal to be predicted constitute the desired response for the adaptive filter. Past values
of the signal are supplied as the input via delaying. Depending on the application type
either the adaptive filter output or the estimation error is used as the system output. First
case occurs when the filter operates as a predictor, in the latter case it operates as a
linear prediction error filter. Linear predictive coding, autoregressive spectrum analysis,
adaptive differential pulse-code modulation and signal detection are application types
that belong to the adaptive prediction class.
The last class of adaptive filtering applications is the Inference Canceling (See Figure
5). Aim is to cancel an interfering signal or noise component from the primary signal
A. U. KÜÇÜKEMRE Page 6 of 115
Figure 3: Block diagram of the adaptive inverse modeling setup.
Figure 4: Block diagram of the adaptive prediction setup.
1. ADAPTIVE FILTERING
which is a combination of the information bearing component and the inference.
Principle is to obtain an estimate of the inferring component and then to subtract it from
the primary signal. Feasibility of this kind of adaptive filters relies on a reference signal
which is a correlated form of the interfering component. It can be derived from a sensor
or a sensor network located in relation to the sensor or set of sensors that provide the
primary signal, in such a way that the information containing signal is unobservable or
very weak. During the operation, the reference signal is fed in to the adaptive filter,
while primary signal is used as the desired response. Estimation error which is used for
adaptation of the filter is also used as the system output, in that it contains the best
estimate of the information bearing signal (in some sense). Sample applications can be
listed as Adaptive Noise Canceling, Echo Cancellation, Adaptive Beamforming and
Active Noise Control.
Adaptive linear filters are the very popular among engineers and scientists, hence have
been implemented in many applications. An obvious advantage of them is their natural
simplicity which allows their design, analysis and implementation stages to be
comparatively straightforward tasks for many cases. They have been studied very well
and there exist very good books that cover them extensively like [Haykin 1996],
[Farhang-Boroujeny 1998] and [Bellanger 2001]. Though, there are many situations
where performance of the linear adaptive filters are merely poor. Linear models does
not always produce the best estimates. This is quite natural especially if think that our
universe is governed by nonlinear dynamical processes. In that aspect, we need
nonlinear models to achieve better results.
A. U. KÜÇÜKEMRE Page 7 of 115
Figure 5: Block diagram of the adaptive inference canceling setup.
1. ADAPTIVE FILTERING
Using an Artificial Neural Network is one of the solutions to operate in nonlinear
environments. A neural network is a biologically inspired, highly parallel and
distributed information (or signal) processor tool made up of inter-connected simple
non-linear processing units, also called neurons, which has ability to store experimental
knowledge. They imitate the biological neuronal networks in two manners. First,
knowledge is acquired by a learning process. Second, unit to unit connections, called the
synaptic weights, are responsible for storing the acquired knowledge. "As long as the
data used for a learning is good representative of the environment, one can build a
supervised neural network that can capture the underlying dynamics of the environment
whether the environment is stationary or non-stationary. This is truly a powerful
statement on non-linear adaptive filtering, with profound practical implications."
[Haykin 1999a]. In this thesis we will concentrate on usage of a special type of neural
networks, Echo State Networks [Jaeger 2001] for Adaptive Filtering.
A. U. KÜÇÜKEMRE Page 8 of 115
2. PROBLEM STATEMENT
2. PROBLEM STATEMENT
Echo State Networks (ESNs) serve as a powerful black-box tool for neural network
learning [Jaeger 2001]. They offer a novel approach for efficient training of Recurrent
Neural Networks (RNNs). The ESN training method has a complexity in the order of a
simple linear regression task. Up to now, they have been used successfully in a broad
range of applications like system identification, prediction, robot control etc. When we
examine these applications in detail, it follows that most of the time ESNs are trained in
an offline manner. That is they are used in a somewhat non-adaptive way, in the sense
that an ESN is being trained firstly in batch mode and after training it is utilized to the
needed application environment with no further change in the network parameters ( i.e:
synaptic weights). This in contrast to biological systems which learn continuously, to be
able to adapt to the time varying characteristics of the real world tasks.
In [Jaeger 2002c], it has was showed that ESNs can be used in an online manner using a
very well known algorithm from the Adaptive Filtering area which is called Recursive
Least Squares (RLS).2 Adaptive Filters, as previously explained, are special kinds of
filters that are running in unknown environments with time varying properties, therefore
continual adaptation of the filter taps has to take place at runtime. The learning
mechanism of ESNs also exhibit a similar structure to linear adaptive filters. Therefore
the same algorithms from the adaptive filtering area can be used for online adaptation of
ESNs.
Additionally, in [Jaeger 2004], ESNs are used for the solution of adaptive channel
equalization problem. This approach beaten conventional nonlinear methods methods
by two orders of magnitude. The same author also applied the ESNs to the nonlinear
adaptive system identification problem and again achieved very good performance
[Jaeger 2002c]. Based on these results, we can conclude that ESNs can also be
considered as an useful tool for Adaptive Filtering problems.
However we still need a more detailed treatment of the ESN-RLS combinations under
adaptive filtering scenarios since it is reported by many authors that the RLS family of
algorithms are subject to numerical stability problems, for example [Ljung 1985]
2 The same argument is also mentioned throughly in [Jaeger 2001].
A. U. KÜÇÜKEMRE Page 9 of 115
2. PROBLEM STATEMENT
[Ardalan 1986] [Cioffi 1987] [Yang 1992] [Levin 1994]. Should ESNs be used for
Adaptive Filtering or other application types in a robust manner, techniques to ensure or
increase the numerical stability of RLS algorithm should be found from the scientific
literature. In this thesis, we will try to give a detailed treatment of using RLS family of
algorithms for online adaptation of ESNs under Adaptive Filtering context. Much of the
concentration will be focused on numerical stability of different RLS algorithms, while
other performance criterias like algorithmic complexity, steady state error, tracking will
be discussed throughly. We will also show that the ESNs are competitive candidates to
do Adaptive Filtering. In that aspect, we will have a section which shows the superiority
of ESNs over standard linear and nonlinear adaptive filtering techniques. Additionally,
an algorithmic analysis of most promising RLS-ESN combinations will be given in
order to reveal the complexity trade-off between different algorithms and also to assist
future implementations.
The rest of the thesis is organized as follows. In the next section, we will make the
reader with the classical ESN theory which involves the basics of ESN generation and
supervised learning of them in batch mode. Then we go on with the online adaptation of
ESNs. Main focus will be on using RLS family of algorithms, while some non-RLS
algorithms will also be discussed. After specifying the most promising algorithms from
the adaptive filtering literature, we will have section on performance comparison of
these algorithms with respect to each other in different experimental setups. Later, we
will compare the ESNs, to standard linear and nonlinear methods of the adaptive
filtering area. We will finish the thesis, by giving a summary of what is done and
possible future works. In Appendix A, interested reader can find brief summary of ESN
research and application articles. Appendix B, is where we do algorithmic analyzes of
some most promising ESN online learning algorithms.
A. U. KÜÇÜKEMRE Page 10 of 115
3. ARTIFICIAL NEURAL NETWORKS AND THE ECHO STATE APPROACH
3. ARTIFICIAL NEURAL NETWORKS AND THE ECHO STATE APPROACH
3.1 A Brief Introduction to the Artificial Neural Networks
An artificial neural network (ANN) is a computational model that is inspired from the
biological neural networks (ie. Composition of a human brain). They can be used as a
powerful data modeling tool which has the ability to adapt, learn, generalize, cluster and
organize data. Their operation is based on distributed parallel processing. Formally,
ANNs can be represented as a set of simple processing units known as neurons which
communicate by sending signals to each other over a large number of weighted
connections, each usually having the following typical properties [Plöger 2004] (See
Figure 6):
• x it : activation state of neuron i at time t
• net i : net inputs to neuron i calculated by net propagation function
net jt =∑∀i
oit w ji
• f activationin. : Inner activation (transfer) function,
x it1 = f activationin. x it , net it ,i , where i is the limit step function.
• oi t : Output of neuron i , oi t = f activationout x it
• f out : Output activation function
One can classify ANNs under two main categories, feed forward and recurrent neural
networks. Also known as the Multi-layer Perceptrons (MLP), the feed forward neural
networks (FNNs) are the most popular type among ANNs. In an FNN, data enters at the
input layer and is piped through the network through intermediate layers, which are also
called the hidden layers, until it arrives at the output neurons. There is no feedback
between different layers during this process. This is the reason why this kind of
networks are called feed forward. Mathematically, they implement static input-output
mappings and theoretically FNNs can approximate any non-linear function with
arbitrary precision. Being studied extensively for many years now they are in a solid
state. Their application areas and training methods are quite well understood. (See
A. U. KÜÇÜKEMRE Page 11 of 115
3. ARTIFICIAL NEURAL NETWORKS AND THE ECHO STATE APPROACH
Figure 7)
Their counterparts, Recurrent Neural Networks (RNN) has cyclic connections between
the layers. Theoretically they can approximate any dynamical system with arbitrary
precision. Being able to model any dynamical system they offer a lot to the researchers
however their analysis and training is extremely difficult. Thus, the research on RNNs is
very limited. Yet, one cannot draw his/her interest away from RNNs since all biological
neural networks are recurrent and they are capable of doing a broader range of tasks.
Although, their analysis is still difficult, some difficulties of RNN training have been
solved on 2001 by a breakthrough approach called Echo State Networks by Herbert
Jaeger [Jaeger 2001]. Next, we follow with the details of Echo State Networks Theory.
Interested reader can refer to [Haykin 1999b] for a more detailed coverage of Artificial
A. U. KÜÇÜKEMRE Page 12 of 115
Figure 6: Structure of a neuron [Plöger 2004]
Figure 7: Schematic description of a FNN and a RNN [Jaeger 2002a]
3. ARTIFICIAL NEURAL NETWORKS AND THE ECHO STATE APPROACH
Neural Networks.
3.2 The Echo State Networks Theory
Echo State Networks resemble a novel approach for RNN training. While keeping the
expressive power of RNNs, they remedy the problems of RNNs training methods which
are usually hard and time consuming. A similar neural network type also developed
independently under the name Liquid State Machines (LSMs) [Maas 2002]. LSM theory
has a more biological perspective whereas the ESNs are approaching the problem from
an engineering point of view.
The central idea of ESN theory is based on using a relatively huge fixed dynamical
reservoir (DR). A DR can be viewed as a pool of artificial neurons with connections to
each other without any restriction on topology. Recurrent path are of course very
welcome in this setup. When excited by the the input signal, the DR maps it in to a
richer state space as encoded in its internal states. At the end the desired output can be
formed by computing a weighted combination of output connections. By imposing
certain algebraic conditions on the DR which will be mentioned in detail later in this
chapter, one can achieve impressive results by keeping the internal weights fixed and
only adjusting the taps from the DR to output units. In this way the RNN training which
was once a burden for researchers, boils down to a simple linear regression task. Being
based on such a simple approach, ESNs are now the title holder in prediction of well
known chaotic time series benchmark Mackey-Glass system.
Engineering applications of the method are numerous (See Appendix A). "ESNs can be
used for all basic tasks of signal processing and control including time series
prediction, inverse modeling, pattern generation, event detection and classification,
modeling distributions of stochastic processes, filtering and nonlinear control. Because
a single learning run takes only a few seconds (or minutes for very large datasets and
networks), engineers can test out variants at a high turnover rate, a crucial factor for
practical usability." [Jaeger 2004]
Before going further in to details of ESNs, firstly we want to fix the mathematical
notation that will be used throughout this work. To keep the integrity, we are going in
A. U. KÜÇÜKEMRE Page 13 of 115
3. ARTIFICIAL NEURAL NETWORKS AND THE ECHO STATE APPROACH
line with the same notation used in the original publications [Jaeger 2001] and [Jaeger
2002b].
Our ESN model consists of K input units with an activation (state) vector which is
denoting the activation of the input layer at time t :
u t = [ u1t , u2t , u3t ,... , uK t ]T : Input Activation Vector 3
N internal units ( reservoir) with the corresponding state vector:
x t = [ x1t , x2t , x3 t , ... , x N t ]T : Internal Activation Vector
and L output units with the output state vector:
y t = [ y1t , y2t , y3t , ... , yL t ]T : Output ActivationVector
The synaptic weights between the input, internal and output units are collected in three
Here a zero weight stands for no connection. It should also be noted that output units
have connections from input, internal and even from output neurons. In addition to
those, activations of the output units may optionally be projected back to the internal
units using the connections:
W back = wijback. : N×L , OutputReservoir
The matrices W in. , W out and also W back , if it exists at all, are usually full
matrices, whereas the W is a sparse matrix with recommended density values ranging
from 5% to 20%. Input to reservoir weights are assigned using an uniform distribution
3 Here T denotes the matrix transpose operation.
A. U. KÜÇÜKEMRE Page 14 of 115
3. ARTIFICIAL NEURAL NETWORKS AND THE ECHO STATE APPROACH
and is fixed through the ESN lifetime. Likewise, W back has also fixed weight which
are drawn randomly. Reservoir is scaled to have a nice global spectral radius, which is
also kept fixed. Only thing that is to be learned through time is W out , which makes
ESN learning computationally very fast among other RNN learning techniques.
Activations of the internal units are update by the rule:
x t1 = f activationW x t W in. ut1 W back y t
This step is called “Evaluation”. Here u t1 denotes new input vector at time
equals to t1 . f act= f 1, f 2, ... , f N denotes the activation function of the internal
neurons. It is also called the transfer function, output function or squashing function. To
achieve nonlinearity and because of being an invertible function, it is mostly selected as
tangent hyperbolic:
tanh x = sinh xcosh x
= exe−x
e xe−x
Specifically, neurons using tanh x are called sigmoids.
Following the evaluation step, new value of the output activation vector is given by the
formula:
y t1 = f activation W out concat x t1 , ut1 , y t
The name of this step in the ESN jargon is “Exploitation”. The function
concat x , u , y denotes concatenation of new internal and input states together with
the previous output state. One should observe that this notation does not require
recurrent pathways between internal neurons, although it is highly desirable. Therefore,
no restriction on the topology of network exists.
Successive computations of the evaluation and exploitation may lead to a chaotic and
unbounded behavior, therefore a proper global scaling of W should be applied in
order to prevent those unwanted effects. Details on how to this will be given in the
upcoming sections. An overview of ESNs is best summarized in the Figure 8.
A. U. KÜÇÜKEMRE Page 15 of 115
3. ARTIFICIAL NEURAL NETWORKS AND THE ECHO STATE APPROACH
Having defined our terminology, we now continue to details of the ESN theory. The
name "Echo States" is coming from special property of the network which is
characterized by the weight matrix W . The training data, [u t , d t ] is also
influential on the network whether it has echo states or not. That is for two different sets
of input-output signal pairs, the same network may have echo states on one set whilst
not having for the other. We require, input sequences are coming from a compact
interval U such that:
[u t ]t∈K ∈ U K
A similar requirement also hold for the teacher outputs (desired signal). We need the
desired signal values are from a compact interval, D , such that:
[d t ]t∈L ∈ DL
Then the echo state property is given in Definition 1.
A. U. KÜÇÜKEMRE Page 16 of 115
Figure 8: An overview of the general Echo State Network structure. [Plöger 2004]
3. ARTIFICIAL NEURAL NETWORKS AND THE ECHO STATE APPROACH
Verbally, the ESP states that the current state of the network is uniquely determined by
the input history and the teacher forced output. Therefore, when we run the network
freely, initial network states that are independent of the input and output history should
wash out after some time. By this property, we can have network states that are
characterized only by the input and the teacher forced output.
The echo state property is strongly connected to algebraic properties of the internal
weights matrix W . Although there exists no known necessary and sufficient algebraic
condition, to decide if any given [W in. ,W ,W back ] has the ESP. Still, a sufficient
condition to prove non-existence of echo states is present. It is given in Preposition 1.
In [Buehner 2006], a newer sufficient condition for ESP is also proposed. Authors'
A. U. KÜÇÜKEMRE Page 17 of 115
Definition 1: Echo State Property (ESP)
Assume a recurrent neural network with connectivity weights W in. , W and W back
, which is driven by a teacher input u t = [u1t , u2t , u3 t , ... , u K t ]T and forced
by a desired teacher output, d t = [ d 1t , d 2t , d 3t , ... , d L t ]T , both coming
from the compact intervals [u t ]t∈K ∈ U K and [d t ]t∈L ∈ DL . This RNN has echo states if for every left infinite sequence [u t , d t−1] where t=−∞ , .. ,−1,0and all state sequences x t , x ' t which is computed by
x t1 = f activationW x t W in. ut1 W back y t
x ' t1 = f activation W x' t W in. u t1 W back y t
it holds true that x t = x ' t , ∀n≤0
Preposition 1: Sufficient Conditions for ESP
Given an untrained network, [W in. ,W ,W back ] , with state update according to x t1= f activation W x t W in. u t1 W back y t withf activation x = tanh x , let λmax be the largest absolute eigenvalue and σ max be
the largest singular value of W in. then:
a) If σ max 1 then ESP holds for the network with [W in. ,W ,W back ]
b) If ∣λmax∣ 1 then the network with [W in. ,W ,W back ] has no echo states for any input/output interval U K × Y L which is containing the zero input/output tuple 0 , 0 .
3. ARTIFICIAL NEURAL NETWORKS AND THE ECHO STATE APPROACH
claim is that this new bound is tighter than the original one and guarantees asymptotic
stability (i.e: Guaranteed ESP for all inputs). Note that, no new design methodology for
generating a new untrained ESN is proposed in this paper, what is given can only be
used to test the global asymptotic stability of an ESN at hand. Based on the Preposition
1, Algorithm 1 seems to guarantee generation of an untrained network,
[ W in. , W , W back ] , with echo states.
The W matrix generated by the Algorithm 1 is what we usually call the Dynamical
Reservoir (DR). According to Jaeger's suggestions, it should be a sparse matrix. This is
a simple trick to ensure richness in internal dynamics of the DR. Best results are
achieved with low densities (connectivities) around 5% to 20%. Moreover, values
should be roughly in equilibrated that is the mean value of the internal weights should
be around zero. To achieve this, one can either draw random weights from a [-1,1]
uniform distribution or set the values either to -1 or 1.
Number of neurons, N , used in the reservoir should be selected considering both
hardness of the learning problem and the availability of the teacher signal. Harder the
problem, usually a high number of neurons is needed for good models. Another
important point is to avoid over-fitting. Over-fitting is a usual phenomenon that is
observed in ANN training. It occurs when the ANN learns a much too literal
reproduction of the teacher sequence, but is poor at generalizing for unseen examples.
As a rule of thumb, N should not be exceeding an order of magnitude T /2 or at least
should be bigger than T /10 where T is denoting the periodicity of the training data.
A. U. KÜÇÜKEMRE Page 18 of 115
Algorithm 1: Generation of an RNN with ESP
1. Randomly generate a sparse matrix W 0 , with real valued w0ij 's between
[−1 , 1 ] with a low density. (i.e: only 5% of the weights are different from zero)
2. Normalize W 1=1∣λmax∣
W 0 , where λmax is largest absolute eigenvalue of W 0.
3. Scale W = W 1 , where 1 . It follows that α is the spectral radius ofW . Then the network [ W in. , W , W back ] has the ESP.
3. ARTIFICIAL NEURAL NETWORKS AND THE ECHO STATE APPROACH
The more regular-periodic the training data, N can be chosen closer to the T /2 .
The spectral radius is another important parameter that should be carefully selected.
For fast network dynamics, a small should be used. The more it gets close to one,
slower is the network dynamics. There is no general answer for what is good choice of
spectral radius for a given application. Therefore, several trials usually need to be done
in order to fully exploit the capabilities of the ESN at hand.
Note that in Algorithm 1, the generation of the W in. and W back is left open. This
because of the fact that the ESP is independent of these two matrices. Mostly, weights
of the W in. and W back are drawn randomly. Absolute values of these weights have an
important effect on how the DR is excited by the input or output back-projections.
Larger the values imply the network is strongly driven by the input/output signals, vice
verse for the smaller values. Additionally, if tangent hyperbolic is used as the activation
function, using small weights indicate that network operates around the central near
linear region of the sigmoid. As the absolute weights grow we get close to the
saturation, hence work in a highly nonlinear region. In the absolute, we have binary
dynamics with a sigmoid output which is equal to one or minus one.
After we get an RNN with ESP, in other words an ESN, the training which was once a
very time consuming operation is easy. It is just adjusting weights of the W out , a very
simple linear regression task. Algorithm 2, can be used for supervised batch learning of
ESNs. In the Algorithm 2, be careful to collect the vectors x t and
f activation−1 d teacht
T in M and T , not x t and f activation−1 d teacht−1T .
When applying ESNs for certain applications there are also some additional points that
researchers should know in order to achieve meaningful results in a few trials. Examples
of these points can be enumerated as choosing the correct scaling parameters for the
input signal, using a reasonable network size, providing enough training examples etc.
A short tutorial of those training tricks is given in [Jaeger 2002a], whereas a more
comprehensive treatment of this topic, a detailed ESN training tutorial, yet waits aside
to be written.
A. U. KÜÇÜKEMRE Page 19 of 115
3. ARTIFICIAL NEURAL NETWORKS AND THE ECHO STATE APPROACH
In a relatively short period of time since their invention on 2001, the ESN theory gained
popularity amongst the neural network research community. Apart from the original
contributer Herbert Jaeger, also other researchers published their works regarding ESNs.
These articles come in various flavors. While some of them aimed at solving
engineering problems by means of ESNs, some concentrated on the theory of standard
ESNs in order to improve the performance or to put an end to inherent limitations of the
theory. It is also not so surprising to find some novel proposals like new RNN structures
or training methods whose development is strongly inspired by the ESN network theory.
Naturally, sometimes ESNs are as well criticized by some authors. In Appendix A, we
have given an overview of all ESN papers which are of our knowledge. Interested
readers can easily use the relevant references to get more detailed knowledge.
In the International Joint Conference on Neural Networks (IJCNN) which was held in
Montreal, Canada on 2005, a special session was reserved for Echo State Networks.
Outcome of this IJCNN 2005 was very prosperous in terms of number of ESN papers
published during the conference. Also, in year 2007, a full issue of the famous Neural
Networks magazine will be reserved only for the ESN topic and for another similar
idea, the LSMs [Maas 2002]. These two incidents provide evidence for increasing
popularity and acceptance of the ESN theory in the neural networks community.
Our discussion on the classical ESN theory ends here. In the next chapter, we will
discuss online adaptation of ESNs using algorithms from the adaptive filtering area.
A. U. KÜÇÜKEMRE Page 20 of 115
3. ARTIFICIAL NEURAL NETWORKS AND THE ECHO STATE APPROACH
A. U. KÜÇÜKEMRE Page 21 of 115
Algorithm 2: Supervised Training of ESN
Let u teacht = [ u1t , u2t , u3t ,... , uK t ]T be the input teacher signal
and d teach t = [ d 1t , d 2t , d 3t , ... , d N t ]T be the output teacher signal
containing column vectors of number K and L in discrete time intervalt=1, 2, ... , t 0 , ... , T where t 0 is denoting the time point where all initial
states of the dynamic reservoir are washed out. With a definition of the initial zero teacher output d teach 0 = 0 .
1. Generate an ESN using the Algorithm 1
2. Initialize the network states of the dynamical reservoir arbitrarily. (i.e: x 0=0 )
3. Calculate x t1 ,∀ t = 0,1,. .. , t 0−1 using the evaluation equation,
x t1 = f activation W x t W in. ut1 W back y t
4. Do concat u teacht1 , x t1 , d teacht T ,∀ t=t 0, t 01,... ,T in rows
and store in the state matrix M t−t01×NKL .
5. In the same manner do concat f activation−1 d teacht
T ,∀ t = t 0, t01,... , T
in rows and save in to the teacher matrix C t−t01 xL
6. Solve for W ' = M−1C where M−1 denotes the pseudo (Moore-Penrose) inverse of M
7. Set W out = W 'T
4. ONLINE ADAPTATION OF THE ECHO STATE NETWORKS
4. ONLINE ADAPTATION OF THE ECHO STATE NETWORKS
4.1 Introduction
In previous works by Jaeger, [Jaeger 2002c] and [Jaeger 2004], online learning for
ESNs achieved by using Recursive Least Squares (RLS), a very well known algorithm
from Adaptive Filtering theory. The good results gained in these studies proved that
RLS offer a good solution for online adaptation of ESNs4.
Advantages of using RLS, can be listed as follows. Firstly, it has a very fast rate of
convergence5. Secondly, its rate of convergence is independent of the eigenvalue
spread6 of the correlation matrix of the input signal. RLS algorithm also has a good
steady state error performance.
However, RLS has two important disadvantages. Its computational complexity is in the
order of ON 2 .7 Secondly, it is numerically instable under finite precision
environments.8 (i.e: digital systems)
For a better and more reliable use of RLS for online adaptation of ESNs, the above
disadvantages should be examined in more detail and shall be remedied if possible. In
this section we will try do this detailed treatment of the RLS algorithm.
This chapter is organized as follows. Firstly, we will derive conventional RLS algorithm
from scratch to give a better insight of the algorithm to the reader. Next, we will begin
to investigate the problems one by one. At first step, we will look at the computational
complexity problem, leaving numerical instability problem to a later stage. In the
4 What is even better would be using the simple yet robust Least Mean Squares algorithm [Widrow 1960], however this is not possible at the moment with our current knowledge on ESNs [Jaeger 2005]. Further research should be done on the topic. Reasons for this will be explained later in this chapter.
5 By using Rate of Convergence, we refer to the following definition given in [Haykin 1996]. It is defined as the number of iterations for the algorithm, in response to stationary inputs, to converge "close enough" to the optimum Wiener solution in the mean square. Thus a fast rate of convergence allows filter to adapt more rapidly.
6 Eigenvalue spread is the ratio of the largest eigenvalue of a matrix to the smallest one.7 The Big-Oh is an mathematical notation used to describe asymptotic behavior of functions. In
computational literature it is often used to denote the complexity of the algorithms.8 An algorithm is numerically instable if it diverges from the desired response due to quantization errors
in digital environments.
A. U. KÜÇÜKEMRE Page 22 of 115
4. ONLINE ADAPTATION OF THE ECHO STATE NETWORKS
proceeding chapter, we will test all different ESN adaptation methods using two
different scenarios, which are adaptive system identification and adaptive noise
canceling.
4.2 The Conventional Recursive Least Squares Algorithm
Now we will derive the conventional RLS algorithm for the ESN case as we abbreviate
the ESN-CRLS. The derivation may look complicated at first sight but actually is not so
hard to follow. Starting from the well known least squares minimization problem we
will come to a recursive set of equations that made up the CRLS algorithm.
For simplicity during derivation of the algorithm, we assume an ESN with one output
and one output. Neither an input to output connection nor an output to output
connection is used. Thus W out is a 1×N vector, storing connections from dynamical
reservoir to the one output neuron. More precisely we define W out as a function of
time as W out t =[w1 t , w2t ,... , wN t ] . We will be using the indexes T and
t to denote time variables. We adapt our ESN-CRLS from the derivation procedure
given in [Haykin 1996]. At the last step while giving the final version of the algorithm,
we will go to the most general form from this restricted case of the ESN.
In the method of weighted least squares, we want to minimize the following quantity to
achieve a good estimate of desired signal at time t :
=∑i=0
T
i ,t e i 2 (Equation 4.1)
where
e i = f activation−1 d teachi – W out t x i (Equation 4.2)
and i , n is the weighting factor which is defined as:
i , t =t−i where i=1,2,3, .... , n ∧ 0≤≤1 (Equation 4.3)
It is obvious from the formula that this kind of a weighting factor usage, tends to weigh
A. U. KÜÇÜKEMRE Page 23 of 115
4. ONLINE ADAPTATION OF THE ECHO STATE NETWORKS
past samples with smaller coefficients. That is, the filter forgets the past, hence the
name forget rate is given to this term. The special case, when =1 ,is called the pre-
windowed or infinite memory RLS and is equal to the ordinary least squares
formulation. Using the notation with forget rate, , the cost function we want to
minimize becomes:
=∑i=0
T
t−i e i 2 (Equation 4.4)
This expression can be minimized by taking partial derivatives with respect to all
elements of W out t and equating result to the zero:
∂∂W out t
=∑i=0
T
t−i e i ∂e i ∂W out t
=∑i=0
T
t−i ei x i = 0 (Equation 4.5)
Now we replace e i with its original form:
∑i=0
T
t−i [ f activation−1 d teachi – W out t x i ] x i = 0 (Equation 4.6)
Rearranging the (Equation 4.6), we get:
W out t [∑i=0
T
t−i x i 2]=∑i=0
T
t−i f activation−1 d teach i x i (Equation 4.7)
We can express the same equation in matrix multiplication form as
t W out t = z t (Equation 4.8)
where t is the N×N correlation matrix of the internal state vector of our ESN:
t =∑i=0
T
t−i x i x i T (Equation 4.9)
z t is the N×1 cross-correlation vector between internal states vector of ESN,
x i and value of the inverse activation function which accepts the desired response
A. U. KÜÇÜKEMRE Page 24 of 115
4. ONLINE ADAPTATION OF THE ECHO STATE NETWORKS
as input, f activation−1 d teachi :
z t =∑i=0
T
t−i x i f activation−1 d teachi (Equation 4.10)
Our aim is to get the least squares estimate of W out t . Using the (Equation 4.8), it
can be found by:
W out t T = t −1 z t (Equation 4.11)
Up to here, we followed the ordinary least squares framework to obtain a solution. Now
the important question is how to find the inverse of t recursively. From this point
on we start the main of part of our ESN-CRLS derivation which is involves recursions.
If we isolate the term where i=t from the rest of the summation in the correlation
matrix definition, (Equation 4.9), we get:
t = [∑i=0
T
t−1−i x i x i T ] x t x t T (Equation 4.12)
By definition, the term in the brackets is equal to the t−1 . Therefore, we get the
following recursion for the correlation matrix update:
t = t−1 x t x t T (Equation 4.13)
In the same manner, we get the recursion update equation for cross-correlation vector
z t given in (Equation 4.10), as:
z t = z t−1 x t f activation−1 d teacht (Equation 4.14)
Before jumping to the next step we briefly have to introduce an useful identity from the
linear algebra which will act a key role in the remaining steps. It is known as "Matrix
Inversion Lemma". In the literature, it is also referred to as Woodsbury's Identity. The
lemma states the following:
A. U. KÜÇÜKEMRE Page 25 of 115
4. ONLINE ADAPTATION OF THE ECHO STATE NETWORKS
In the definition of the matrix inversion lemma, C H denotes the Hermitian Transpose9
of C . This lemma states that if we are given a matrix in the form of (Equation 4.15),
we can determine its inverse using the (Equation 4.16). The matrix inversion lemma can
easily be proved by multiplying the (Equation 4.15) and (Equation 4.16) side by side.
Then recognizing that the multiplication of a square matrix by its inverse results in an
identity matrix (i.e: A A−1= I ).
Assuming we have a positive definite10 t (hence a non-singular11 matrix), we may
apply the lemma to recursive update equation of the correlation matrix. Firstly, we make
the following identifications:
A =t
B−1= t−1
C = x t
D = 1
(Equation 4.17)
Substituting these terms in to the (Equation 4.16) of the matrix inversion lemma leads
us to the (Equation 4.18)
9 The Hermitian transpose, also called the conjugate transpose, of a matrix can be found by taking the transpose of the complex conjugate of the matrix, AH=AT . In case of a real matrix, it is equal to the ordinary matrix transpose operation [Eves 1980].
10 A N×N Hermitian matrix is positive definite iff for any nonzero vector v we have v H A v 0 . Whereas a square matrix is a Hermitian matrix iff A=AH . If the matrix is real than this condition boils down to A=AT . [Johnson 1970]
11 If a square matrix has a nonzero determinant then it is called a non-singular matrix [Lipschutz 1991]
A. U. KÜÇÜKEMRE Page 26 of 115
Lemma 1: Matrix Inversion Lemma
Let A and B be two positive-definite M×M matrices which are related by the
equation; A = B−1 C D−1C H (Equation 4.15), where D is another positive-
definite N×M matrix. Then, the inverse of the matrix A can be expressed as:
A−1= B− B C D C H B C −1C H B (Equation 4.16)
4. ONLINE ADAPTATION OF THE ECHO STATE NETWORKS
−1t = −1−1t−1−−2−1t−1x t x t T−1t−1
1−1 x t T−1t−1x t (Equation 4.18)
In order to simplify the rest our calculations, we make the following definitions:
P t = −1t (Equation 4.19)
and
k t = −1 P t−1 x t 1−1 x t T P t−1x t
(Equation 4.20)
Using our new definitions, we can rewrite the (Equation 4.19) as:
P t = −1 [P t−1 − k t x t T P t−1 ] (Equation 4.21)
(Equation 4.20) can be reorganized to get a simpler form as:
k t = −1 P t−1 x t −−1 k t x t T P t−1x t
= −1 [−1 P t−1−k t x t T P t−1] x t (Equation 4.22)
Here notice that the terms in the brackets in (Equation 4.22) is equal to the P t by
(Equation 4.21). Therefore, we simplify the above equation using our new finding to get
a more compact form of the k t in terms of the correlation matrix and the internal
state vector of the ESN:
k t = P t x t = −1 x t (Equation 4.23)
The k t is referred by the name gain vector because of the close relation between the
Kalman Filtering and the RLS. Check out [Sayed 1994] for a very good description of
the relation between RLS and Kalman Filtering. Next we have find an recursive
equation for the W out update. Using (Equation 4.11), (Equation 4.14) and (Equation
4.19) we get:
A. U. KÜÇÜKEMRE Page 27 of 115
4. ONLINE ADAPTATION OF THE ECHO STATE NETWORKS
W out t T = t −1 z t
= P t z t
= P t z t−1 P t x t f activation−1 d teacht
(Equation 4.24)
Substituting (Equation 4.21) only in to the first occurrence of P t in the (Equation
4.24) leads us to:
(Equation 4.25)
W out t T = P t−1 z t−1− k t x t T P t−1 z t−1 P t x t f activation
−1 d teach t
= −1 t−1 z t−1− k t x t T −1 t−1 z t−1 P t x t f activation−1 d teach t
=W out t−1 − k t x tT W out t−1 P t x t f activation−1 d teach t
Using the fact that k t = P t x t = −1 x t , we get:
(Equation 4.26)
W out t T = W out t−1T k t [ f activation
−1 d teacht − x t T W out t−1T ]
W out t = W out t−1 k t T t
Here the terms in brackets are used to define the a-posteriori error, t as
t = f activation−1 d teacht − x t T W out t−1T
= f activation−1 d teacht −W out t−1 x t
(Equation 4.27)
Now, using (Equation 4.27), (Equation 4.20), (Equation 4.21) and (Equation 4.26) in the
given order, we can write down the ESN-CRLS algorithm. In order to increase the
efficiency of our implementation, firstly we multiply both numerator and denominator
of the (Equation 4.20) by to get a simpler form of the gain vector as:
k t = P t−1 x t x t T P t−1 x t (Equation 4.28)
A. U. KÜÇÜKEMRE Page 28 of 115
4. ONLINE ADAPTATION OF THE ECHO STATE NETWORKS
Then noticing that the term P t−1 x t appears in both numerator and denominator
of the k t , we introduce a new term t to simplify our notation:
t = P t−1x t (Equation 4.29)
Moreover, we expand our ESN structure to its most general definition. As a reminder,
we give this definition once again here. Our model consists of K input units with an
activation (state) vector which is denoting the activation of the input layer at time t :
u t = [ u1t , u2t , u3t , ... , u K t ]T : Input ActivationVector
N internal units that make up the DR with the corresponding state vector:
x t = [ x1t , x2t , x3t ,... , xN t ]T : Internal Activation Vector
and L output units with the output state vector:
y t = [ y1t , y2 t , y3t , ... , y L t ]T : Output Activation Vector
The synaptic weights are collected in four matrices:
Additionally, we have a correlation matrix, P t with dimensions
NKL×NKL and gain vector, k t , of dimension NKL×1 .
Also the t , now denotes the 1×L row vector of a-posteriori errors for coming
from each of the L outputs, defined as t = [1t ,2 t , ... ,L t] . The term
concat x t , u t , y t−1 which is used in the exploitation equation is used in
many places of the CRLS and its variants that will be mentioned in this section.
Therefore, we introduce a new term for the concatenated ESN states (input, internal,
A. U. KÜÇÜKEMRE Page 29 of 115
4. ONLINE ADAPTATION OF THE ECHO STATE NETWORKS
output) vector, t = concat x t , u t , y t−1 to tidy up the notation of these
algorithms. Finally, the ESN-CRLS is given in Algorithm 3.
Algorithm 3: ESN−CRLS
Initialization :
0 ≤ 1
P 0 = −1 I , ≪ 1
W out 0=0 , y 0 = 0
Main Body :
for t = 1,2 , ... , T
x t = f activation W x t−1 W in. u t W back y t−1
t = concat x t , u t , y t−1
y t = f activation W out t−1 t
t = f activation−1 d teacht − W out t−1 t
t = P t−1t
k t =t
t Tt
P t = −1 [P t−1 − k t t T P t−1]
W out t = W out t−1 t k t T
During the initialization period of the ESN-CRLS, we set P 0 such that the non-
singularity of the correlation matrix, 0 is guaranteed. This is usually achieved by
setting the P 0 equal to an identity matrix which is multiplied by a very big scalar
number. Setting the W out 0 = 0 is another common practice in the literature which
we also followed here. It is known that using any value other than zero for initialization
of W out , does not have significant effect on the convergence and steady state behavior
of the algorithm, unless very large values are used for initialization [Farhang-Boroujeny
A. U. KÜÇÜKEMRE Page 30 of 115
4. ONLINE ADAPTATION OF THE ECHO STATE NETWORKS
1998].
4.3 Known Problems of the RLS Algorithm
As we stated before the RLS family of algorithms mainly suffer from two main
problems which may hamper their usage. First of all their computational complexity is
directly proportional with the square of the size of W out we are using. In
computational notation O N 2 . The second big problem is the numerical instability
under finite precision environment which has been reported by many authors in the
literature like [Ljung 1985] [Ardalan 1986] [Cioffi 1987] [Yang 1992] [Levin 1994] etc.
The aim of this section is explore what are the effects of these problems on ESN and
RLS combinations and what kind of a strategy should be followed in order to cope with
them. We will look for the answers in the adaptive filtering literature, where these kind
of problems are studied in detail for other filter structures. Now, we continue with the
problem of computational complexity.
4.4 The Problem of Computational Complexity
High computational complexity of the CRLS limited the use of it for real world
applications despite its attractive advantages. This lead scientists to development of fast
RLS (FRLS) algorithms which has computational complexities which grows linearly
with the number of taps to be updated. They are on average O7M with M
denoting the number of filter taps of an adaptive filter. FRLS algorithms preserve the
nice properties of CRLS like fast convergence which is independent of the eigenvalue
spread of the correlation matrix. See [Cioffi 1984], [Slock 1991], [Carini 1999] for
individual examples of those algorithms. A more detailed overview on the FRLS
algorithms can be found in [Haykin 1996], [Glentis 1996], [Farhang-Boroujeny 1998],
[Bellanger 2001]
FRLS algorithms are utilized based on the shift time invariance property that is
associated with the input vector in transversal FIR filters. At each time step, only one
value is actually changing in the input vector of the transversal filter. All of the past data
samples are shifted by one, leaving the oldest time sample out of the vector. Then, the
new data value is injected to the first slot. Based on this property, it is possible to derive
A. U. KÜÇÜKEMRE Page 31 of 115
4. ONLINE ADAPTATION OF THE ECHO STATE NETWORKS
FRLS algorithms. The derivation process is not in the scope of this thesis but can be
found in the given literature. However, the input vector to the RLS when used under
ESN context, is the combination of internal state vector x t and input output vectors
u t , y t . All values of x t are changing at each time step. This means that
the shift time invariant structure of transversal filters is unfortunately not present in
ESN-RLS combinations. From adaptive filtering theory point of view, weight update
procedure of ESN-RLS resembles to that of linear combiner structures (i.e:
Beamforming, Radar Array Processing) for which use of FRLS algorithms is
unfortunately not possible. Therefore, we have to omit the usage of FRLS family for
online adaptation of ESNs.
Instead we propose use of two other algorithms with ON complexity. First one is
the world famous Least Mean Squares (LMS) algorithm which also known by the name
Widrow-Hoff Rule [Widrow 1960]. It is a stochastic gradient algorithm That is the
gradient of the error performance surface with respect to free parameter vector changes
randomly from iteration to iteration. Many years by now from its invention, it
established itself as the workhorse of the adaptive filtering area. This is mainly because
of the reasons like; ease of implementation, low computational complexity and robust
performance. As with the every algorithm, it also has some drawbacks. LMS algorithm
converges slowly to its steady state. When compared to the RLS performance, LMS rate
of convergence is an order of magnitude slower than RLS. This phenomenon is also
reported for ESN-LMS combinations in [Jaeger 2001]. Another major drawback of the
RLS is its sensitivity to the eigenvalue spread of the correlation matrix of the input
vector. One way to overcome these limitations is to use projections of the input signal
on an orthogonal basis. This is usually attained by using variants of the algorithm which
operate in frequency domain at the cost of additional computational complexity. Instead
of using transform domain ESN-LMS, using ESN-RLS provides a more convenient
way. In an unpublished bachelors thesis [Liebald 2004], the eigenvalue spread problem
tried to be tackled by using different specifically tailored ESN topologies. However,
none of the proposed topologies performed better than the randomly created networks
using the Algorithm 1. How to shrink the eigenvalue spread of an ESN is a very hot
topic for ESN research as suggested by Jaeger in [Jaeger 2005]. Only after the
eigenvalue spread of ESNs can be made smaller, use of LMS algorithm for online
A. U. KÜÇÜKEMRE Page 32 of 115
4. ONLINE ADAPTATION OF THE ECHO STATE NETWORKS
adaptation of ESNs with good performance would be possible. At the moment we only
know that adding a random noise component to the input signal is useful in lowering the
eigenvalue spread of an ESN, a trick that we know from using Extended Kalman
Filtering for RNN learning [Jaeger 2002a]. Still, our basic experimentation with the
algorithm revealed that it can be used for certain tasks in combination with the ESNs in
order to lower the computational complexity given the eigenvalue spread is in
acceptable intervals. But, this is at a cost of slower convergence and lower numerical
accuracy when compared to cases when CRLS algorithm is used for adaptation. Also,
note that we do not understood which tasks lead to a low eigenvalue spread and why.
This should be found empirically for each task. In spite of all, we decided to add the
LMS algorithm in to our performance tests. The ESN-LMS algorithm is given in
Algorithm 412.
Algorithm 4: ESN−LMS
Initialization :
∈ℝ is the user defined learning rate.
W out 0=0 , y 0 = 0
Main Body :
for t = 1,2 , ... , T
x t = f activation W x t−1 W in. u t W back y t−1
t = concat x t , u t , y t−1
y t = f activation W out t−1 t
t = f activation−1 d teacht − W out t−1 t
W out t = W out t−1 [t t T ]
The LMS algorithm may also suffer from numerical instability during weight update
because of the quantization errors. This usually becomes evident if the input signal does
12 Since our main interest is on the RLS algorithm throughout this thesis, we exclude the derivation of the LMS. Interested reader can find more information in standard adaptive filtering literatures like [Haykin 1996] or [Farhang-Boroujeny 1998]
A. U. KÜÇÜKEMRE Page 33 of 115
4. ONLINE ADAPTATION OF THE ECHO STATE NETWORKS
not have enough energy at all frequencies [Cioffi 1987]. Often, this effect can be
remedied by introducing a leakage mechanism to the weight update equation [Zahm
1973]. Trade off is a small degradation in the steady state performance. A nearly
equivalent technique to ensure stability is to add a small amount of uncorrelated noise
component to the input vector before weight update [Werner 1983]. During our tests,
we followed the noise insertion approach to achieve stability. One final note is on the
convergence of the ESN-LMS. Usually, a common practice of choosing the learning
rate, is to use small real number smaller than 1. However, during our
experimentations with ESN-LMS, we observed that one may have also have to use very
big values for the , especially (but not always) when the input signal to the network
is scaled to a very compact interval around zero. Otherwise, the ESN-LMS does not
converge or converges at a very languishing pace. One should consider this while
playing out with the parameters of the ESN-LMS for any given application.
Another interesting method for online learning with computational complexity of
ON 13 is the recently introduced Backpropagation Decorrelation Algorithm
(BPDC) [Steil 2004]. It is suitable to be used for online ESN learning. The BPDC
algorithm is strongly influenced by the ESN and LSM theories in that it only learns the
output weights of recurrent neural networks. This is done in order to be able reduce the
complexity. Theoretic inspiration of the algorithm is coming from the RNN learning
rule introduced by Atiya and Parlos in the RNN training unification paper [Atiya 2000].
The Atiya-Parlos Recurrent Learning (APRL), "is based on the idea to differentiate the
error function with respect to the network states in order to obtain a virtual teacher
target, with respect to which the weight changes are computed." [Steil 2004]. Basically,
the BPDC relies on three important principles. First one is the one-step back
propagation of errors by means of the virtual teacher forcing like the APRL algorithm.
Secondly, usage of the short term memory in the network dynamics which is adapted
based on decorrelation of the internal activations of reservoir neurons. And finally use
of an non-adaptive dynamical reservoir as in the case of ESNs or LSMs in order to
attain reduced computational complexity.
13 This O N complexity is valid only when we have on one output neuron in our ESN. When more than one output units are used, complexity of this algorithm becomes O N 2 . See [Steil 2004]
A. U. KÜÇÜKEMRE Page 34 of 115
4. ONLINE ADAPTATION OF THE ECHO STATE NETWORKS
In order to be able use the BPDC algorithm for ESNs with nonlinear activation
functions, we have to introduce some modifications on Exploitation and Evaluation
equations following the notation given in [Steil 2004]. See (Equation 4.30).
(Equation 4.30)
Exploitation for ESN−BPDC :
x t =W f activation x t−1 W in. f activationut W back f activation y t−1
Evaluation for ESN−BPDC :
y t =W out t f activation concat x t , u t , y t−1
In the original definition, we apply the transfer function to the network's internal state
elements after multiplying them with the corresponding synaptic weights. Now we do
the direct contrary by first applying the transfer function and then multiplying with the
weights. The ESN-BPDC algorithm in its general form with O N 2 complexity can
be given as in theAlgorithm 514. When only one output neuron is used, some
expressions in the algorithm cancel out, decreasing complexity of the ESN-BPDC to
ON .
It should be noted that in [Steil 2004], input signal is assumed to be coming from a
compact interval with mean value near to zero. Our experiences with ESN-BPDC also
indicates the need of a such conditioning of the input signal. Therefore, signals that are
not satisfying this regulation, should be biased and scaled appropriately before being fed
in to the ESN as an input. In a recent paper [Steil 2005], stability of the BPDC method
is also shown for many different cases. Therefore, we assume its stability depending on
this work and do not do any further investigation.
14 Like the LMS algorithm, we also skip the derivation of the BPDC algorithm. We also want to add that the notation we used here to express the algorithm is special for ESN case and is very different from the one used in the original paper [Steil 2004]
A. U. KÜÇÜKEMRE Page 35 of 115
4. ONLINE ADAPTATION OF THE ECHO STATE NETWORKS
A. U. KÜÇÜKEMRE Page 36 of 115
Algorithm 5 : ESN−BPDC
Initialization :
∈ℝ is the user defined learning rate.
∈ℝ is the user defined regularizationconstant.
W out 0=0 , y 0 = 0
Main Body :
for t = 1,2... , T
x t = W f activation x t−1 W in. f activation u t W back f activation y t−1
y t =W out t f activation concat x t , u t , y t−1
e t = y t − d teacht
for i = 1,2... , L
i =∑l=1
L
[wi NKl out f activation
−1 y l t−1 e lt−1]− ei t
for j = 1,2... , N
w ijout t =
f activation x j t−1
∑k=1
N
f activation xk t−12 ∑k=N1
NK
f activationuk t−12 ∑k=NK1
NKL
f activation yk t−12i
for j = N , N1..., NK
w ijout t =
f activation u j t−1
∑k=1
N
f activation xk t−12 ∑k=N1
NK
f activationuk t−12 ∑k=NK1
NKL
f activation yk t−12i
for j = NK , NK1..., NKL
w ijout t =
f activation y jt−1
∑k=1
N
f activation xk t−12 ∑k=N1
NK
f activationuk t−12 ∑k=NK1
NKL
f activation yk t−12i
4. ONLINE ADAPTATION OF THE ECHO STATE NETWORKS
4.5 The Problem of Numerical Instability
As we stated before, the RLS algorithm is advantageous over the LMS by two reasons.
First, its convergence rate is an order of magnitude faster. Second, convergence rate of
the RLS is independent of input signal statistics (i.e: eigenvalue spread). Although, we
have proposed two non-RLS algorithms as a substitute to solve the computational
complexity problem, today's microchip technology reached very high processing
speeds. Therefore, complexity is not a vital problem for most of the practical
applications provided not so big reservoirs are used. A more important problem than
complexity is the numerical instability, especially for the applications with long term
adaptation needs. Much of the work in the literature on stabilizing RLS family of
algorithms is concentrated on the fast versions, leaving out a more limited number of
studies for ON 2 algorithms. Here, we will go through the prominent ones of these
works in order to find out suitable ESN-RLS combinations with good numerical
stability. We will evaluate our findings in the next section under different experimental
scenarios.
When adaptive filters (or any other filter in general) are implemented in finite precision
environments (i.e: digital), all the values are quantized to certain numerical limited
precisions. Because of that, quantization (round-off) errors are generated, which deviate
the performance of the filter from the infinite precision performance. The amount of
these errors are implementation dependent, hence may vary from application to
application based on the size of the word lengths (i.e: number of bits) used [Ling 1984].
An error perturbation generated at an arbitrary point also has an effect on later
iterations, that is it said to be propagating. Continuous accumulation of such effects may
reach levels where deviations increase so much that the filter performance is no more
acceptable. If for a given algorithm, error accumulation grows without bound then it is
said to be unstable and its continuous use is unsuitable without further precautioning
[Liavas 1999]. For applications where adaptive filter is used only to determine an
unknown setting and the kept fixed at that setting, such instabilities are usually not
observed [Cioffi 1987]. We also experienced the same behavior during our simulations.
Algorithms usually tend to become unstable or deviate too much from the desired value,
after a few tens of thousands of iterations or more. The effects of quantization errors on
A. U. KÜÇÜKEMRE Page 37 of 115
4. ONLINE ADAPTATION OF THE ECHO STATE NETWORKS
the RLS are very well studied by various authors. Good examples include [Ljung 1985],
1991], [Yang 1992], [Haykin 1996], [Liavas 1999]. Here we do not aim to go into such
details, rather we will summarize what is said by these authors and then we will try
adapt their solutions to our problem. Interested reader can refer to above references for
more detailed information.
As with the other adaptive algorithms, RLS is as well affected by the quantization
errors. Two main effects are generally attributed for numerical instability of the
O N 2 type of RLS algorithms.
The first effect is about the recursive computation of the inverse of correlation matrix,
P t . This is the earliest problem noted for instability of the RLS algorithms [Hsu
1982] and has its origins from the Kalman Filtering theory. As a results of accumulation
of these errors, the matrix may become indefinite. Although this usually does not end up
in a overflow, the response of the filter is nevertheless unacceptable. This effect is
usually evident for input signals that do not satisfy a condition called persistent
excitation condition given in [Ljung 1985]. This statement essentially tells that the input
signal must have sufficient energies at all frequencies to prevent the inverse of
correlation matrix to become negative-definite [Cioffi 1987]. This condition can easily
be met by adding some uncorrelated white to the input signal. The same technique is
reported to be useful also for the ESN-RLS case in [Jaeger 2002c].
A better method to eliminate a negative-definite matrix is studied in [Hsu 1982]. This
method focuses on calculating the inverse correlation matrix using special recursions
that propagate the Upper-Diagonal-Upper-Transpose (UDU') factorization of the
P t . It aims at ensuring a positive definite matrix by keeping the symmetry of the
P t while having positive entries along the diagonal. A neat version of this
algorithm is given in [Yang 1994], which replaces the (Equation 4.21) in CRLS:
P t = −1 [P t−1 − k t x t T P t−1] (Equation 4.21)
By this new recursion which keeps the symmetry of the P t :
A. U. KÜÇÜKEMRE Page 38 of 115
4. ONLINE ADAPTATION OF THE ECHO STATE NETWORKS
P t = Tri −1[ P t−1 − k t t T ] (Equation 4.31)
Here a new operator Tri ... is introduced which exploits the symmetry of the inverse
correlation matrix to increase efficiency of the algorithm. It does the computations on
only the upper or lower triangular part of the P t and then copies the results to the
opposite part. In this form the new algorithm has almost half the complexity of the
ESN-CRLS. Based on these, we give a new algorithm for online adaptation of ESNs,
which we call ESN Symmetric Conventional Recursive Least Squares (ESN-SCRLS),
as in Algorithm 6.
Algorithm 6 : ESN−SCRLS
Initialization :
0 ≤ 1
P 0 = −1 I , ≪ 1
W out 0=0 , y 0 = 0
Main Body :
for t = 1,2 , ... , T
x t = f activation W x t−1 W in. u t W back y t−1
t = concat x t , u t , y t−1
y t = f activation W out t−1 t
t = f activation−1 d teacht − W out t−1 t
t = P t−1t
k t =t
t Tt
P t = Tri −1[P t−1 − k t t T ]
W out t = W out t−1 t k t T
A. U. KÜÇÜKEMRE Page 39 of 115
4. ONLINE ADAPTATION OF THE ECHO STATE NETWORKS
Although the ESN-SCRLS increases the numerical stability is not always enough to
prevent divergence or overflow. The second main reason of the instability of RLS
occurs during the weight update equation. The unbounded grew of floating point round-
off errors causes filter weights to grew very large, resulting in a divergence or overflow.
While the divergence which shows up when P t becomes indefinite shows up after
few tens of thousands of iterations depending on the application type, this form of
divergence divergence usually manifests itself in the order of a factor 100 times later
than the first form [Cioffi 1987]. Therefore, in order ensure that it is eliminated,
extensive testing should be made with large datasets.
This kind of a divergence is of a similar nature to that of the LMS algorithms.
Therefore, it can be fixed by integrating tap weight a leakage mechanism to RLS
recursions which is a technique used also for LMS stabilization [Cioffi 1987]. Since we
have the inverse correlation matrix already present in our hand, a more simple scheme is
possible. Addition of a diagonal constant at periodically to the diagonal of the P t in
order to ensure good conditioning of it [Cioffi 1987] [Ardalan 1989]. While doing this it
is wise to use the UDU' form of P t since the SCRLS algorithm is more efficient
than the CRLS. We call this new algorithm ESN Symmetric Conventional Recursive
Least Squares Version 2 (ESN-SCRLS2) and is given Algorithm 7.
It is been shown that effect of floating point round-off errors on the weight update
equation increases as the forgetting rate is chosen close to one [Ardalan 1986].
Especially when = 1 is used, the possibility of the second form of divergence is
higher. In that case, error propagation mechanism becomes unbounded and is random
walk type [Ardalan 1987] [Slock 1991]. Therefore, for = 1 using ESN-SCRLS2
may not always be enough. Adali and Ardalan, developed a stabilization method
specifically for this case in [Adali 1991]. The technique resembles ESN-SCRLS2 in that
a term is added to diagonal of the P t , but this time instead of using a constant
value, a dynamic term is derived based on statistics of the amount of change in the tap
weights, w t . Our early experimentation showed the benefits of this technique
therefore we decided to add it to our algorithm collection. In the algorithm the operator
E ... denotes expectation (i.e: expected value). Therefore, the E [w t w t T ]
A. U. KÜÇÜKEMRE Page 40 of 115
4. ONLINE ADAPTATION OF THE ECHO STATE NETWORKS
denotes the covariance matrix of the term w t . We refer to this algorithm by ESN
Ardalan Recursive Least Squares (ESN-Ardalan-RLS) which is formulated in the
Algorithm 8.
A. U. KÜÇÜKEMRE Page 41 of 115
Algorithm 7 : ESN−SCRLS2
Initialization :
0 ≤ 1
∈ℕ∧0 isthe user defined period.
∈ℝ∧0 is the user defined diagonal constant.
P 0 = −1 I , ≪ 1
W out 0=0 , y 0 = 0
Main Body :
for t = 1,2 , ... , T
x t = f activation W x t−1 W in. u t W back y t−1
t = concat x t , u t , y t−1
y t = f activation W out t−1 t
t = f activation−1 d teacht − W out t−1 t
t = P t−1t
k t =t
t Tt
P t = Tri −1[P t−1 − k t t T ]
if t mod = 0 then Diag P l t , end
W out t = W out t−1 t k t T
4. ONLINE ADAPTATION OF THE ECHO STATE NETWORKS
The methods we mentioned up to here are the most widely accepted ones by the
adaptive filtering community. But one can still find other studies which try to guarantee
stability of RLS algorithms. Some examples are [Bottomley 1991], [Chansarkar 1997],
[Horita 1999], [Douglas 2000] and others which we do not consider in this thesis. All of
these trials have some drawbacks. For example, the algorithm derived in [Chansarkar
1997] claims to have guaranteed stability but has an unacceptable computational
complexity of ON 3 . In [Horita 1999] the authors derived a leaky RLS algorithm,
A. U. KÜÇÜKEMRE Page 42 of 115
Algorithm 8 : ESN−Ardalan−RLS
Initialization :
0 ≤ 1
P 0 = −1 I ∀ l∈[1,L ] , ≪ 1
W out 0=0 , y 0 = 0
Main Body :
for t = 1,2 , ... ,T
x t = f activation W x t−1 W in. u t W back y t−1
t = concat x t , u t , y t−1
y t = f activationW out t−1 t
t = f activation−1 d teacht − W out t−1 t
t = P t−1t
k t =l t
1tTt
w t = k t t
P t = Tri P t−1 − k t t T P t−1 E [w t wt T ]
W out t = W out t−1 t k tT
4. ONLINE ADAPTATION OF THE ECHO STATE NETWORKS
which is claimed to be robust. But with again an unacceptable complexity ON 3
which is because of a direct matrix inverse operation present in the algorithm.15 In
[Bottomley 1991], rather than deriving a new algorithm, the authors proposed some
modifications on the fixed point arithmetic to limit the error propagation, thus ensuring
stability. Since, we are only considering the single precision floating representation in
this thesis, this method is out of our scope although it looks promising at first sight.
Only with [Douglas 2000], we achieved some promising results during our early
experimentation. Using a novel least squares pre-whitening technique the author derived
a recursive algorithm to minimize the exponentially windowed least squares cost
function. The algorithm embodies the good properties of QR Decomposition based RLS
(QR-RLS)16 algorithms like high numerical accuracy. Algorithm performance is
acceptable and still in the orders of ON 2 . The algorithm is also claimed to be
stable through simulations. However, the author followed an overly simplified approach
during his stability analysis. Thus, we think further experimentation should be made to
conclude on the stability. We named this algorithm as ESN Recursive Least Squares
Pre-whitening (ESN-RLSP) and it is given in Algorithm 9.
We now continue with a new class of RLS algorithms which are implemented in a
structurally different form than the ones mentioned up to here. This class of algorithms
are very robust in terms of stability when compared to above methods. Trade-off is an
increase in the computational complexity but even this complexity is still bounded by
O N 2 . We will consider two main versions so called QR Decomposition Based
RLS (QR-RLS) and Inverse QR-RLS (IQR-RLS). From now on we may refer to this
class of algorithms using the name Rotation Based Algorithms (RBAs).
15 We also experimented with the [Chansarkar 1997] and [Horita 1999]. During those experiments, we observed overflows or unacceptable deviations from the desired signal. This is contradicting with the stability claims given in those articles. Because of the unacceptable complexity of these algorithms we did not give further attention on finding out the possible reasons of our observations.
16 We will consider QR-RLS algorithms in the upcoming pages.
A. U. KÜÇÜKEMRE Page 43 of 115
4. ONLINE ADAPTATION OF THE ECHO STATE NETWORKS
The QR-RLS algorithm solves the least squares minimization problem by working
directly of the incoming data matrix via the QR Decomposition [McWhirter 1983]. This
is in contrast with CRLS and the variants of it which are working on the time averaged
correlation matrix. The QR Decomposition can be computed using variety of methods
among which the most popular ones are Givens Rotations, Householder
A. U. KÜÇÜKEMRE Page 44 of 115
Algorithm 9 : ESN−RLSP
Initialization :
0 ≤ 1
P 0 = −1 I , ≪ 1
W out 0=0 , y 0 = 0
Main Body :
for t = 1,2 , ... , T
x t = f activation W x t−1 W in. u t W back y t−1
t = concat x t , u t , y t−1
y t = f activationW out t−1 t
t = f activation−1 d teacht − W out t−1 t
v t = P t−1 t
r t = P t−1T v t
k t = 1∥v t ∥2 ∥v t ∥2
P t = 1[P t−1 k t v t t T ]
w t =t t ∥v t ∥2
W out t = W out t−1 t k t T
4. ONLINE ADAPTATION OF THE ECHO STATE NETWORKS
Transformations and Gramm-Schmidt Orthogonalization. Usually, Givens rotations are
preferred to other methods since it is computationally more efficient than the latter two
[Golub 1996]. Householder transformation can also be a good choice to implement
because theoretically it provides almost twice the better numerical accuracy than the
Givens method [Higham 1996]. It is known that QR-Decomposition when implemented
via Givens Rotations or Householder Transformations are numerically stable [Higham
1996]. The QR-RLS algorithm, when operating in finite environments, is shown to be
stable in a Bounded-Input-Bounded-Output (BIBO) manner [Leung 1989] [Liu 1991].
But it should be noted that BIBO stability does not always guarantee meaningful results.
In [Yang 1992], it is experimentally shown that when number of bits used is too small,
then the algorithm performance is unacceptable17. Experiments also show that better
accuracy is achieved when forgetting factor is chosen to be smaller than one. To present
the QR-RLS algorithm for ESN case, we used to notation used by the authors in [Sayed
1994]. Here we omit the derivation of the algorithm. Interested reader can refer to
[Sayed 1994] or [Haykin 1996]. The ESN-QR-RLS algorithm is given in Algorithm 10.
We would also like to give few more notes on the Algorithm 10. The initialization of
the algorithm is dependent on the size of the ESN used. If W out is a L×NKL
matrix then the initialization takes NKL iterations and during this period a-priori
estimation errors lt for each of the L outputs should assumed to be zero. To
calculate the W out at each time step has to calculate inverses of the l t −1 /2 's.
Matrix inverse is usually a computationally demanding process however here
l t −1 /2 is a lower triangular matrix. This makes it possible to compute inverse in
O N 2 time via back-substitution which exploits the lower triangular structure of the
matrix. Due to this special type of back-substitution, which may result in a division by
zero, real values of W out matrix is only accessible after the initialization period is
finished. During the initialization period of ESN-QR-RLS, both the W out and the ESN
output should assumed to be zero. Hence the a-posteriori estimation error lt is also
zero which is computed by taking the different of the latter two value.
17 Usage of five bits to express the values in digital form resulted in such an observation in [Yang 1992]
A. U. KÜÇÜKEMRE Page 45 of 115
4. ONLINE ADAPTATION OF THE ECHO STATE NETWORKS
A. U. KÜÇÜKEMRE Page 46 of 115
Algorithm 10 : ESN−QR−RLS
Initialization :
0 ≤ 1
l1 /2 0 = 0 ∀ l∈[1, L ]
p l 0 = 0 ∀ l∈[1, L ]
W out 0=0 , y 0 = 0
Main Body :
for t = 1,2,. .. , T
x t = f activation W x t−1 W in. u t W back y t−1
t = concat x t , u t , y t−1
y t = f activationW out t−1 t
for l = 1,2,. .. , L
[ 1/2l1/2t−1 t
1/2 pl t−1T f activation−1 d l t
0T 1 ] t = [ l1/2t 0
pl t T l tϱl
1 /2t t Tl
1 /2t T ϱl1 /2t ]
end
W out t = [ p1t T1−1/2t
p2t T2−1 /2t
.
.pL t
TL−1 /2 t ]
NOTE :t is any unitary rotationthat produces a block zerocolumn vector on the last columnof the post array by annihilatingelements of the ESN internal state vector x t one byone. Alsod l tdenotes the l th element of the d teacht vector.
4. ONLINE ADAPTATION OF THE ECHO STATE NETWORKS
Because of the extra computational load introduced by computing inverse of l t −1 /2
, for the applications where the weights of W out need to be known explicitly (i.e:
Adaptive System Identification) ESN-QR-RLS is not a good choice. But, we have the a-
priori estimation error calculated at each iteration. We can omit the W out calculation
for applications where this a-priori estimation error value is enough. Prediction Error
Filters, Active Noise Control and Adaptive Noise Canceling can be given as examples
of such application scenarios. Additionally, another method called Extended QR-RLS
(EQR-RLS) can also be used for online adaptation of ESNs which avoids the
computationally demanding back-substitution operation [Yang 1992]. However, this
algorithm is not necessarily stable and the methods to make it stable are
computationally expensive [Moonen 1990]. Therefore, in the literature it is not
suggested to be used [Haykin 1996].
A better method is proposed in [Alexander 1993] under the name IQR-RLS. It is more
efficient than the QR-RLS algorithm in that it avoids the calculation of a lower
triangular matrix inverse via back-substitution. Unlike QR-RLS, IQR-RLS operates on
the inverse of the correlation matrix, hence named after this property. It also shares the
good numerical stability of the QR-RLS algorithm. Haykin states that the algorithm is
stable for 1 whereas for = 1 , the single error perturbation is not contractive
thus accumulation of such errors may lead to divergence [Haykin 1996]. According to
the simulation results given in the original paper [Alexander 1993] the algorithm stayed
stable over extremely long datasets (i.e: one million samples). Those simulations also
revealed that the algorithm has an outstanding numerical accuracy. As the QR-RLS we
again omit the derivation and the interested reader can refer to the [Alexander 1993] or
[Haykin 1996]. The IQR-RLS under ESN context can be written (again using the
notation from [Sayed 1994]) as in Algorithm 11.
Our discussion on the numerical instability problem of the RLS family of algorithms is
finished here. In the next chapter, we continue with our performance tests in which all
of the above algorithms are compared with respect to each other under similar
conditions.
A. U. KÜÇÜKEMRE Page 47 of 115
4. ONLINE ADAPTATION OF THE ECHO STATE NETWORKS
A. U. KÜÇÜKEMRE Page 48 of 115
Algorithm 11 : ESN−IQR−RLS
Initialization :
0 ≤ 1
P 0 = −1 I , ≪ 1
W out 0=0 , y 0 = 0
Main Body :
for t = 1,2 , ... , T
x t = f activation W x t−1 W in. u t W back y t−1
t = concat x t , u t , y t−1
y t = f activation W out t−1 t
t = f activation−1 d teacht − W out t−1 t
[ 1 −1 /2 t T P1/2t−10 −1 /2 P1/2t−1 ] t = [ ϱ
−1 /2 t 0T
k t ϱ−1/2t P1 /2t ]W out t = W out t−1 t k t T
NOTE :t is any unitary rotation that produces ablock zerorow vector on the first row of the post array by annihilatingelementsof theterm−1/2 x t T P1/2t−1one by one.
5. PERFORMANCE ANALYSIS OF THE ONLINE ADAPTATION ALGORITHMS
5. PERFORMANCE ANALYSIS OF THE ONLINE ADAPTATION ALGORITHMS
5.1 Introduction
In this section we will present our results that we got during our simulations. In these
simulations, we evaluated all methods with respect to numerical stability and steady
state error performances. Apart from those, computational complexities of some
selected algorithms are also discussed in Appendix B.
The following algorithms were used during the testing phase; ESN-CRLS, ESN-
RLS, ESN-LMS and ESN-BPDC. Two different experimental setups were utilized as
our testbeds which will be described later in this chapter. To ensure stability in a reliable
way, we ran all algorithm on the given setups for a few millions of iterations. For a
given experiment, all of algorithms use the same ESN. This is done to ensure objectivity
among different algorithm performances. During our early experimentations with those
algorithms, (while the implementation period,) we have observed that all of these
algorithms are also sensitive to the appropriate scaling and biasing of the input signals.
These two parameters not only effect the steady state error performance, but also have
an important impact on the numerical stability. Therefore, scaling and bias parameters
varied from algorithm to algorithm. By using different parameters, we aimed to get the
best results from each of the algorithm. Additionally, we chose parameters of the each
algorithm according to relevant precautions from the adaptive filtering literature, to
achieve a good numerical stability for each run. Actually, depending on our experience
with those methods while experimenting with short data sets (i.e: Number of samples
around 10000 to 50000), we expect many of them to remain stable for most of the test
cases.
We chose to use IEEE 754 Single Precision Floating Point Format to store all data in
our tests. This format uses thirty two (32) bits to express a real number in hardware.
One (1) bit is reserved for the sign, eight (8) bits for the exponent and lastly twenty
three (23) bits for the mantissa. Our motivation in taking such a decision comes from
the current state of the art of Digital Signal Processor (DSP) technology. See [Eyre
A. U. KÜÇÜKEMRE Page 49 of 115
5. PERFORMANCE ANALYSIS OF THE ONLINE ADAPTATION ALGORITHMS
2000] for a good overview of evolution of DSPs. Adaptive filtering applications are
usually implemented in embedded systems. Computation of the adaptive filtering
algorithms are mainly composed of arithmetic operations. DSPs are specialized
especially for these kinds of applications where extensive arithmetic computation is
required. Thus, they offer impressive performance, scalability and ease of use when
compared to other embedded architectures (i.e FPGA, ASIC, VLSI etc.). High
performance is achieved by implementing sophisticated hardware techniques inside the
DSP chips like intensive pipelining for high frequency operations or parallel functional
units to execute multiple instructions at the same time. Mostly two types of data formats
are used inside the DSP chips, namely the fixed point and the floating point. Most of the
chips in the market are utilized with the fixed point arithmetic because of low power
consumption and pricing issues. On the other hand, the numerical accuracy and the
range is very limited with the fixed point architectures when compared to the floating
point representation. Because of this, numerical instability problems are much worse in
fixed point architectures. Floating point operations are usually emulated by
sophisticated assembly language tricks in fixed point systems which results in an
increased number of cycles per operation. On the other hand the same operations take
only 1 cycle due to specialized arithmetic units present on the floating point
architectures. A direct conclusion of these points is that the floating point DSPs, which
offer an easier design process to the developers, also lead to more efficient applications.
However, pricing issues hampered the use of floating point DSPs for a long period until
2000's. By the recent advances of the technology and also as a result of a highly
competitive market, the price per unit of DSP chips now decreased down to ten dollars
limit [TI 2002]. The pricing problems are now becoming less significant and floating
point DSPs are gaining much wider acceptance [Etalk 2004] [RTC 2004]. For most of
the cases, in floating point DSP chips 32 bit registers are used [Analog 2005] [TI 2005].
Based on these facts, we decided to use single precision floating points numbers during
our stability tests.
Our testing (evaluation) philosophy is as follows; firstly we ran all algorithms on
different setups for long time spans to check their numerical stability over time. If any
of the algorithms succeeded to remain stable, then we further compared the steady state
error performance to declare the winner for that particular setup. At the end of all
A. U. KÜÇÜKEMRE Page 50 of 115
5. PERFORMANCE ANALYSIS OF THE ONLINE ADAPTATION ALGORITHMS
experiments, also taking the computational complexities in to the account, we made our
last comments on the algorithms. Our main intention in doing those tests is to show that
ESN-RLS combinations can be made numerically stable under applications scenarios
which require long term W out update due to constantly changing statistical properties
of the signals to be processed. Please do not forget that we do not guarantee hundred
percent stability under any circumstances. Additionally, as a natural outcome of this, we
do not claim to prove or disprove the stabilities of different algorithms mathematically.
Also note that such a detailed concentration on stability is usually not needed for
applications where only short term adaptation is required.
5.2 Experimental Setups
We chose adaptive nonlinear system identification as our first experimental setup. See
Figure 9.
Two different benchmark signals are used for learning which were introduced in a
recurrent neural network training unification paper by Atiya & Parlos [Atiya 2000]. We
also modified the signals in order to introduce time-varying behavior. First signal is a
second order dynamical system governed by the (Equation 5.1).
y t1 = y t y t y t−1 u3t (Equation 5.1)
A. U. KÜÇÜKEMRE Page 51 of 115
Figure 9: Block diagram of an ESN when used as a adaptive system identifier. U(y) , y(t) and d(t) are the input, the ESN response and the desired response respectively. At each time step, the Adaptive Algorithm (i.e ESN-CRLS) updates W out using the error signal e(t).
5. PERFORMANCE ANALYSIS OF THE ONLINE ADAPTATION ALGORITHMS
Mild time varying statistics for this signal is achieved by using variable coefficients
, , , . At each time step we change them by a factor of 1% around the original
values 0.4, 0.4, 0.6, 0.1 respectively. Input u t is an uncorrelated uniform noise
coming from the interval [−0.5 , 0.5] . (See Figure 10)
Our second signal is a more difficult system to model. It is a 10th order Nonlinear
Autoregressive Moving Average (NARMA) system which is defined by the (Equation
5.2). This signal is also used in [Jaeger 2002c], and we follow the very same
prescriptions that are used there.
(Equation 5.2)
y t1 = tanh y t y t [∑i=0
9
y t−i ] u t−9u t
To ensure non-stationarity of the signal, we used the coefficients , , , which
vary at periodic intervals by a factor of ±50% around the original values which are 0.3,
0.05, 1.5, 0.1 respectively. This kind of a harsh coefficient value variation, effects the
signal behavior strongly, which is apparent in the Figure 11. Under some combinations
of , , , values this system may diverge in an explosive manner. In order to
A. U. KÜÇÜKEMRE Page 52 of 115
Figure 10: First 250 samples of the second order nonlinear dynamical system
5. PERFORMANCE ANALYSIS OF THE ONLINE ADAPTATION ALGORITHMS
prevent such unwanted effects we used tanh x as a limiter. Input to the system is
again an uniform random noise that is drawn from the random uniform noise coming
from the interval [0 , 0.5] .
We used the following error function to evaluate the algorithm performances. It is the
Normalized Mean of Square Error (Equation 5.3), as defined in [Atiya 2000].
NMSE =∑t=1
T
y t −d t 2
∑t=1
T
d 2t (Equation 5.3)
In the (Equation 5.3), T is the number of filtered samples, d t is the desired
response and y t is the ESN output at time t. NMSE provides a more objective way
evaluating results, especially when the number of samples is high. We used this
definition from [Atiya 2000], in order to be in line with the standard literature. Our
second setup is another well known adaptive filtering application type, adaptive noise
cancellation [Widrow 1975]. In this experiment, we tried to enhance a music signal
which is corrupted by a non-stationary noise using an ESN Adaptive Noise Canceler
A. U. KÜÇÜKEMRE Page 53 of 115
Figure 11: Effects of the harsh parameter variations on the behavior of 10th order NARMA System is apparent in this figure. Notice the periodic jumps at every 2000th sample.
5. PERFORMANCE ANALYSIS OF THE ONLINE ADAPTATION ALGORITHMS
(ESN-ANC). (See Figure 12)
The experimental setup is prepared as follows. The music signal, is ripped from a
commercial music CD in the wave file (.wav) format. As the noise source, we used a
real noise recordings which is made public on the Internet. It is a recording of a speech
babble. The source of this babble is 100 people speaking in a canteen. The room radius
is over two meters; therefore, individual voices are slightly audible. [TNO 1990a]. The
transfer functions H z and S z of the primary and the secondary paths
respectively, are obtained from the companion diskette coming with the book [Kuo
1996]. Those functions were measured from an experimental setup by the authors.
Using the transfer function H z , we get the x ' n which is an estimate of x n
. This is done by Infinite Impulse Response (IIR) filtering of the noise signal using the
transfer function as prescribed in the same book. Then the x ' n is summed up with
the u n in order to form the corrupted music signal which is d n . Using the
transfer function of the secondary path, S z , we synthesized the x ' ' n which is
the input signal that is fed in to the ESN-ANC.
The ESN-ANC tries to estimate a cleaner form of u n which is corrupted by the
signal x ' n , using x ' ' n as the reference noise. The ESN-ANC response
y n is subtracted from the corrupted signal, d n at each time step to get the
noise-cleaned, u ' n . The same u ' n is also fed back to the adaptive algorithm in
A. U. KÜÇÜKEMRE Page 54 of 115
Figure 12: Block diagram of an ESN-ANC. The original signal, u(n) is corrupted by the noise element x'(t). x'(t) is the correlated version of the main noise source, x(t) which passes through the primary path H(z). The ESN ANC tries to estimate the noise on u(n), by using the x''(t) as its reference which is formed after x(t) passes the secondary path S(z). The ESN response, y(n), which is an estimate of x'(t), is subtracted from the corrupted signal, d(n) in order to get the cleaned version of the original signal, u'(n). Meanwhile, the Adaptive Algorithm (i.e RLS), updates W out to be able cope with the time varying properties of x(t).
5. PERFORMANCE ANALYSIS OF THE ONLINE ADAPTATION ALGORITHMS
order to update W out .
While evaluating, the results of the ESN-ANC, we also used an additional performance
criterion, Signal To Noise Ratio (SNR) which is defined as follows:
SNR= 10 × log10 ∑n=1
N
u ' 2n
∑n=1
N
u n−u ' n 2SNR is a more frequently used evaluation criterion than NMSE to measure the
performance of an audio application.
5.3 Experimentation Phase and the Results
Now we follow on with the details of our experimentation phase and the corresponding
results.
5.3.1 Adaptive System Identification Setup
During our tests under adaptive system identification setup, we evaluated three test
cases. In the first case, we used a forgetting factor of 0.999, = 0.999 , for each of
the algorithms and ran the networks for one million samples. In the second case we
again ran them for one million samples using a forgetting of 1, = 1 . In the last test
case, we again used = 1 but this time ran the network for 5 million samples. These
three cases are repeated for both the second order nonlinear system and the 10 th order
NARMA system. Now we go on to the results which we got with the first system. Here
note that although we were changing the value of the forgetting rate some algorithms
are not effected by this since they do not include the forgetting rate in their computation.
These are the ESN-LMS, ESN-BPDC and ESN-Ardalan-RLS. Despite the fact that they
do not use , we ran them in all tests in order to have a more strongly grounded
evidence regarding their stability. Moreover, we added some small amount of random
uncorrelated noise to the input of these algorithms to attain a better conditioning of
correlation matrix. By running them for each test, we can see the effects of the noise on
the steady state performance and the numerical stabilities of the given algorithms in a
A. U. KÜÇÜKEMRE Page 55 of 115
5. PERFORMANCE ANALYSIS OF THE ONLINE ADAPTATION ALGORITHMS
better way. The same remarks also hold for the Adaptive Noise Cancellation
experiments.
5.3.1.1 Identifying the Second Order Nonlinear Dynamical System
In this test we used an 50 neuron reservoir with density of 0.1 and spectral radius of 0.3.
Our ESN is in its most general form where input to output and output to output
connections are also present. The W back matrix is also used in evaluation of the
internal states. Summing up we have 52 coefficients to be updated in our W out . We
skipped a very common practice of ESN theory which is running the network freely
with the input signal for some time without doing weight update in order to wash-out
the initial transient effects of the reservoir. Keep in mind that it takes some time before
all of the given algorithms converge depending on the filter length. Therefore, we think
it is not suitable to run networks freely for some additional time, which is a general
practice in offline training of ESNs. This does not make sense under adaptive filtering
context since convergence rate is of high importance in many applications. We ought
not prolong the transient period of the filter by doing so.
Algorithm dependent parameters can be best seen from the Table 1. Additionally, with
the ESN-SCRLS2 we have a diagonal constant of 1, = 1 when = 0.999 and
= 104 when = 1 . For the ESN-LMS we used a learning rate of 1, = 1 .
The learning rate and the regularization constant used for the ESN-BPDC are
= 0.25 and = 0.002 respectively. These parameters are estimated via testing
on small datasets with 1000 to 50000 samples. Even with those small data sets, we
experience that the parametrization has crucial role on both numerical stability and
accuracy. Slight changes may result in unacceptable deviations from the desired
performance. We optimized these parameters to some extend for each of the algorithms
until we got acceptable results.
A. U. KÜÇÜKEMRE Page 56 of 115
5. PERFORMANCE ANALYSIS OF THE ONLINE ADAPTATION ALGORITHMS
Algorithm Scale Factor Bias Noise Added
ESN-CRLS 0.0005 0 Yes – 5 %
ESN-SCRLS 0.0005 0 Yes – 2.5 %
ESN-SCRLS2 0.0005 0 Yes – 2.5 %
ESN-Ardalan-RLS 0.0005 0 Yes – 2.5 %
ESN-RLSP 0.5 0 No
ESN-QR-RLS 0.75 0 No
ESN-IQR-RLS 0.75 0 No
ESN-LMS 0.45 0 Yes – 2.5 %
ESN-BPDC 0.075 0 No
Table 1: Algorithm dependent parameters used for the identification of the 2nd order nonlinear dynamical system.
Using the above setup, the results for all test cases are given in the Table 2, Table 3 and
Table 4 and also in the corresponding figures, Figure 13, Figure 14 and Figure 15. Note
that, we did not include the first 1000 iterations while calculating the NMSE in order to
discard transient responses of the ESN before it reached its steady state. Discarding first
samples is done not only for this experiment but also for the rest.
Algorithm NMSE
ESN-CRLS Overflow
ESN-SCRLS Overflow
ESN-SCRLS2 0.16037
ESN-Ardalan-RLS 0.16037
ESN-RLSP 0.0026872
ESN-QR-RLS 0.0029747
ESN-IQR-RLS 0.0036439
ESN-LMS 0.18792
ESN-BPDC 0.18059
Table 2: Results of the test case 1: fRate < 1 ( = 0.999) and 1000000 (One Million) Samples for the 2nd
Order Nonlinear Dynamical System
A. U. KÜÇÜKEMRE Page 57 of 115
5. PERFORMANCE ANALYSIS OF THE ONLINE ADAPTATION ALGORITHMS
Algorithm NMSE
ESN-CRLS 0.15977
ESN-SCRLS 0.15977
ESN-SCRLS2 0.16609
ESN-Ardalan-RLS 0.15977
ESN-RLSP 0.0066470
ESN-QR-RLS 0.0057434
ESN-IQR-RLS 0.0054215
ESN-LMS 0.18751
ESN-BPDC 0.18003
Table 3: Results of the test case 2: fRate = 1 and 1000000 (One Million) Samples for the 2nd Order Nonlinear Dynamical System
A. U. KÜÇÜKEMRE Page 58 of 115
Figure 13: Results of the test case 1: fRate < 1 ( = 0.999) and 1000000 (One Million) Samples for the 2nd Order Nonlinear Dynamical System
5. PERFORMANCE ANALYSIS OF THE ONLINE ADAPTATION ALGORITHMS
Algorithm NMSE
ESN-CRLS 0.16036
ESN-SCRLS 0.16036
ESN-SCRLS2 0.16669
ESN-Ardalan-RLS 0.16036
ESN-RLSP 0.0048698
ESN-QR-RLS 0.029552
ESN-IQR-RLS 0.0048431
ESN-LMS 0.18712
ESN-BPDC 0.18033
Table 4: Results of the test case 3: fRate = 1 and 5000000 (Five Million) Samples for the 2nd Order Nonlinear Dynamical System
A. U. KÜÇÜKEMRE Page 59 of 115
Figure 14: Results of the test case 2: fRate = 1 and 1000000 (One Million) Samples for the 2nd Order Nonlinear Dynamical System
5. PERFORMANCE ANALYSIS OF THE ONLINE ADAPTATION ALGORITHMS
Here it is interesting to see that ESN-QR-RLS performed worse than the ESN-RLSP
and the ESN-IQR-RLS when forgetting rate is equal to one. However, one should not
forget that the numerical accuracy achieved by the ESN-QR-RLS is still much better
than the other algorithms. In the Figure 16, observe that the stepwise squared error
performance of ESN-QR-RLS decreases constantly as the number of iterations increase.
In this example the forgetting rate is set to one. This in parallel with observations given
in [Yang 1992] in that better numerical accuracy is achieved when is chosen to be
1 . This can also be related to the poor tracking performance of the ESN-QR-
RLS. It is a well known effect that tracking performance of any RLS based algorithm
decreases, as the value of the forgetting rate approaches one. At the limit, when
= 1 , it is the worst.
A. U. KÜÇÜKEMRE Page 60 of 115
Figure 15: Results of the test case 3: fRate = 1 and 5000000 (Five Million) Samples for the 2nd Order Nonlinear Dynamical System
5. PERFORMANCE ANALYSIS OF THE ONLINE ADAPTATION ALGORITHMS
5.3.1.2 Identifying the 10th Order Nonlinear System
In this experiment we used a 100 neuron reservoir. The density and the spectral radius
of the DR are 0.1 and 0.99 respectively. ESN is again in the most general form as the
previous experiment.
Algorithm dependent parameters can be best seen from the Table 2. Additionally, with
the ESN-SCRLS2 we have a diagonal constant of 1, = 1 when = 0.999 and
= 2500 when = 1 . For the ESN-LMS we used a learning rate of 1,
= 2500 . The learning rate and the regularization constant used for the ESN-BPDC
are = 0.75 and = 0.0002 respectively.
Before going on the results, we want to mention that better results are achieved for the
same identification task in [Jaeger 2002c] by using a DR with squared activations. By
this way it is possible to increase the nonlinearity at a cost of increased W out size.
Jaeger used around two hundred taps where we have only a hundred. As we already
mentioned before our main intention in doing those experiments is to test numerical
stability of different online adaptation algorithms, not to achieve the best numerical
accuracy.
A. U. KÜÇÜKEMRE Page 61 of 115
Figure 16: Stepwise squared error graph of the ESN-QR-RLS with fRate = 1 during identification of the second order nonlinear dynamical system for five million samples. Observe that value of the squared error increases, hence the performance decreases, as the number iterations increase
5. PERFORMANCE ANALYSIS OF THE ONLINE ADAPTATION ALGORITHMS
Algorithm ScaleFactor Bias NoiseAdded
ESN-CRLS 0.001 0 Yes – 5 %
ESN-SCRLS 0.001 0 Yes – 2.5 %
ESN-SCRLS2 0.001 0 Yes – 2.5 %
ESN-Ardalan-RLS 0.001 0 Yes – 2.5 %
ESN-RLSP 0.001 0 No
ESN-QR-RLS 0.1 0 No
ESN-IQR-RLS 0.1 0 No
ESN-LMS 0.001 0 Yes – 2.5 %
ESN-BPDC 0.001 0 No
Table 5: Algorithm dependent parameters used for the identification of the 10th Order NARMA System.
Results of this experiment are given the tables, Table 6, Table 7 and Table 8. Also in the
figures, Figure 17, Figure 18 and Figure 19.
Algorithm NMSE
ESN-CRLS 28417, Useless
ESN-SCRLS Overflow
ESN-SCRLS2 0.0064756
ESN-Ardalan-RLS 0.0064809
ESN-RLSP 0.0078000
ESN-QR-RLS 0.0039322
ESN-IQR-RLS 0.0051296
ESN-LMS 0.014239
ESN-BPDC 0.013382
Table 6: Results of the test case 1: fRate < 1 ( = 0.999) and 1000000 (One Million) Samples for the 10th Order NARMA System
A. U. KÜÇÜKEMRE Page 62 of 115
5. PERFORMANCE ANALYSIS OF THE ONLINE ADAPTATION ALGORITHMS
Algorithm NMSE
ESN-CRLS 0.0062685
ESN-SCRLS 0.0062662
ESN-SCRLS2 0.0059918
ESN-Ardalan-RLS 0.0062696
ESN-RLSP 0.025600
ESN-QR-RLS 0.0091669
ESN-IQR-RLS 0.0065988
ESN-LMS 0.013930
ESN-BPDC 0.013098
Table 7: Results of the test case 2: fRate = 1 and 1000000 (One Million) Samples for the 10th Order NARMA System
A. U. KÜÇÜKEMRE Page 63 of 115
Figure 17: Results of the test case 1: fRate < 1 ( = 0.999) and 1000000 (One Million) Samples for the 10th Order NARMA System
5. PERFORMANCE ANALYSIS OF THE ONLINE ADAPTATION ALGORITHMS
Algorithm NMSE
ESN-CRLS 0.0073585
ESN-SCRLS 0.0073480
ESN-SCRLS2 0.0068149
ESN-Ardalan-RLS 0.0073510
ESN-RLSP 0.0620620
ESN-QR-RLS 0.0075911
ESN-IQR-RLS 0.0063173
ESN-LMS 0.013072
ESN-BPDC 0.011740
Table 8: Results of the test case 3: fRate = 1 and 5000000 (Five Million) Samples for the 10th Order NARMA System
A. U. KÜÇÜKEMRE Page 64 of 115
Figure 18: Results of the test case 2: fRate = 1 and 1000000 (One Million) Samples for the 10th Order NARMA System
5. PERFORMANCE ANALYSIS OF THE ONLINE ADAPTATION ALGORITHMS
5.3.2 Adaptive Noise Cancellation
Now we continue on with the results of our Adaptive Noise Cancellation experiments.
In this experiment we used two test cases. We processed 4500000 (four and a half
million) samples at each of these cases. Again different forgetting rates were used for
these two test cases as in the system identification setup. We used = 0.9999 and
= 1 for test cases one and two respectively.
During this experiment we used a very small ESN. It has a fully connected reservoir
with 10 units. Spectral radius is set to 0.99. Output to output connections and the
W back is not used in this experiment. Therefore considering the one input to output
connection we have a W out of size 1×11 .
The algorithm dependent parameters are given in the Table 9. Additional parameters are
used for ESN-SCRLS2, ESN-LMS and ESN-BPDC. Diagonal constant used for ESN-
SCRLS2 is equal to one, = 1 . Learning rates for ESN-LMS and ESN-BPDC are
= 0.005 and = 0.1 respectively. The regularization constant is set to 0.002 for
A. U. KÜÇÜKEMRE Page 65 of 115
Figure 19: Results of the test case 3: fRate = 1 and 5000000 (Five Million) Samples for the 10th Order NARMA System
5. PERFORMANCE ANALYSIS OF THE ONLINE ADAPTATION ALGORITHMS
the ESN-BPDC algorithm.
Algorithm ScaleFactor Bias NoiseAdded
ESN-CRLS 0.0005 0 Yes - 5 %
ESN-SCRLS 0.0005 0 Yes - 5 %
ESN-SCRLS2 0.0005 0 Yes - 5 %
ESN-Ardalan-RLS 0.0005 0 Yes – 2.5 %
ESN-RLSP 0.0005 0 Yes – 2.5 %
ESN-QR-RLS 0.0005 0 No
ESN-IQR-RLS 0.0005 0 No
ESN-LMS 0.1 0 Yes – 2.5 %
ESN-BPDC 0.01 0 Yes – 2.5 %
Table 9: Algorithm dependent parameters used for the Adaptive Noise Cancellation
Based on this parametrization our results are given in the tables Table 10, Table 11 and
in the figures, Figure 20 and Figure 21. Note that just as we calculate the NMSE by
discarding first 1000 iterations, we do the same for the calculation of the SNR.
Algorithm NMSE SNR
ESN-CRLS 276.3157, Useless 0.0026, Useless
ESN-SCRLS Overflow Overflow
ESN-SCRLS2 Overflow Overflow
ESN-Ardalan-RLS 6.3874, Useless 0.6331, Useless
ESN-RLSP 0.0771 11.4546
ESN-QR-RLS 0.0495 13.2566
ESN-IQR-RLS 0.0492 13.2859
ESN-LMS 0.0828 11.1654
ESN-BPDC 0.0837 11.1214
Table 10: Results of the Test Case 1: fRate < 1 ( = 0.9999) and 4500000 Samples for Adaptive Noise Canceling
A. U. KÜÇÜKEMRE Page 66 of 115
5. PERFORMANCE ANALYSIS OF THE ONLINE ADAPTATION ALGORITHMS
Algorithm NMSE SNR
ESN-CRLS 19.4080, Useless 0.1965, Useless
ESN-SCRLS 53.9304, Useless 0.0667, Useless
ESN-SCRLS2 0.0806 11.2714
ESN-Ardalan-RLS 0.0876 10.9411
ESN-RLSP 0.0791 11.3496
ESN-QR-RLS 0.0640 12.2079
ESN-IQR-RLS 2.881, Useless 1.2997, Useless
ESN-LMS 0.0828 11.1654
ESN-BPDC 0.0837 11.1214
Table 11: Results of the Test Case 2: fRate = 1 and 4500000 Samples for the Adaptive Noise Cancellation
A. U. KÜÇÜKEMRE Page 67 of 115
Figure 20: Results of the Test Case 1: fRate < 1 ( = 0.9999) and 4500000 Samples for the Adaptive Noise Cancellation
5. PERFORMANCE ANALYSIS OF THE ONLINE ADAPTATION ALGORITHMS
5.4 Comments on the Results
We should start commenting on the results by first dividing our algorithms to three
different classes. In the first class, which we name the Linear Time Algorithms (LTA),
we have ESN-LMS and ESN-BPDC. In fact none of the algorithms given in this thesis
are of linear time complexity. This is due to the matrix-vector multiplication (matrix
W multiplied by the vector x t ) present in the evaluation equation. Therefore all
of the algorithms have at least a complexity that is growing with the square of DR size.
But if the main bodies of ESN-LMS and ESN-BPDC are considered alone, (when
evaluation and exploitation steps are omitted), then they are of linear complexity. See
Appendix B for details. Second class is called the CRLS Variant Algorithms (CVA).
This class includes ESN-CRLS, ESN-SCRLS, ESN-SCRLS2 and ESN-Ardalan-RLS.
All of these algorithms are based on the ESN-CRLS with some changes. Our third and
the last class is called the Rotation Based Algorithms (RBA) since they are based on
orthogonal rotations. This class includes the ESN-QR-RLS and ESN-IQR-RLS. Now
we can continue on to our general observations. We keep the ESN-RLSP out of any
A. U. KÜÇÜKEMRE Page 68 of 115
Figure 21: Results of the Test Case 2: fRate = 1 and 4500000 Samples for the Adaptive Noise Cancellation
5. PERFORMANCE ANALYSIS OF THE ONLINE ADAPTATION ALGORITHMS
class because it can be included for both CVAs or RBAs. It resembles the CVAs
structurally and the RBAs numerically. We will comment on the ESN-RLSP separately.
We begin our discussion with some statistics on the algorithm performances regarding
their stability. Totally, we had eight test cases, six for the adaptive nonlinear system
identification and two for the adaptive noise canceling. Among them, we used 1
for three times and = 1 for five times. Exceptions are the ESN-Ardalan-RLS, ESN-
LMS and the ESN-BPDC where the forgetting rate parameter is not used. Still, we ran
them for all of our test cases in order to have stronger grounds regarding their
performance. In the following tables we give individual performances of algorithms in
terms of numerical stability.
ESN-CRLS 1 = 1 Total
Diverged 100% 20% 50%
Normal - 80% 50%
ESN-SCRLS 1 = 1 Total
Diverged 100% 20% 50%
Normal - 80% 50%
ESN-SCRLS2 1 = 1 Total
Diverged 33% - 14%
Normal 67% 100% 86%
ESN-RLSP 1 = 1 Total
Diverged - - -
Normal 100% 100% 100%
ESN-QR-RLS 1 = 1 Total
Diverged - - -
Normal 100% 100% 100%
ESN-IQR-RLS 1 = 1 Total
Diverged - 20% 16%
Normal 100% 80% 86%
A. U. KÜÇÜKEMRE Page 69 of 115
5. PERFORMANCE ANALYSIS OF THE ONLINE ADAPTATION ALGORITHMS
ESN-Ardalan-RLS Total
Diverged 14%
Normal 86%
ESN-LMS Total
Diverged -
Normal 100%
ESN-BPDC Total
Diverged -
Normal 100%
We begin our discussion with the LTA class. In all of the cases LTA class of
algorithms showed a very robust performance in terms of numerical stability. On the
other hand their numerical accuracy was poor as opposed to their stability. Additionally,
numerical accuracy would be much worse, if we used them for fast startup applications
where W out should be kept the same after filter converges to its steady state. This is
because of the slow convergence rate of these algorithms. Usually it is an order of
magnitude slower than the CVA or RBA algorithms. Actually both the steady state
performance and the convergence rate of these algorithms are mostly characterized by
the eigenvalue spread of the correlation matrix of the input signal, thus the convergence
rate may change drastically from signal to signal. Although we did not explicitly
investigated the tracking performance of any of the algorithms, it is well known that
ESN-LMS has superior tracking properties over other both CVAs or RBAs [Haykin
1996] [Farhang-Boroujeny 1998]. We do not have enough experience to comment on
the tracking performance of the ESN-BPDC. This should be investigated in a future
work. Finally, the most obvious advantage of these algorithms are their computational
complexity which is linear with the size of the W out18. ESN-LMS is has a lower
computational complexity than ESN-BPDC whereas the ESN-BPDC offers a more
precise numerical accuracy. NMSE results reveal that ESN-BPDC performed better
than ESN-LMS in 75% of the test cases. Provided good conditioning of the input signal,
ESN-LMS and ESN-BPDC can also be used for applications which does not need fast
start-up. Experiments show that performance of these algorithms become more and
more acceptable as the number of iterations increase. However, we still do not 18 Note that ESN-BPDC complexity is linear time only when one output neuron is used. See [Steil 2004]
A. U. KÜÇÜKEMRE Page 70 of 115
5. PERFORMANCE ANALYSIS OF THE ONLINE ADAPTATION ALGORITHMS
recommend the general use of these algorithms because the current state of the art of
DSP technology can now realize very high computational speeds at reasonable costs
[Analog 2005] [TI 2005]. Therefore, for many applications, ON 2 algorithms can
easily be put in to practice. But if a decision between these two must be made, we favor
ESN-BPDC because of its better numerical accuracy. The fact that it is specifically
designed for neural network learning whereas the ESN-LMS has its roots from the
Adaptive Filtering theory is also another reason why we favor the ESN-BPDC. Only
when ESN research will come to a point where some techniques to shrink the
eigenvalue spread of reservoirs will be discovered, then the ESN-LMS will be a very
competitive choice among other online adaptation algorithms because of its simplicity
and highly robust performance (i.e: stability, good tracking). Do not forget that
eigenvalue spread of an ESN when used with certain inputs may be acceptable for some
cases, therefore it is always worth trying ESN-LMS at the first step to see if it gives
acceptable performance. This approach may save a great deal of time and resources. If
the performance is not good, you will not lose anything since the equations present in
ESN-LMS are structurally common in all other algorithms, therefore the code used to
implement ESN-LMS is also re-usable for the implementation of other algorithms.
CVA class of algorithms showed varying results during our tests. All of them resulted in
an overflow or in a useless result at least once in some test cases. Additionally, they
showed the best NMSE performance only once with the ESN-SCRLS2 algorithm. In
terms of numerical accuracy they were better than LTAs but worse than RBAs. Our
most obvious inference based on the results is that these algorithms are acting more
stable under ESN context whenever = 1 . This is interestingly contradicting with
the observations given in [Ardalan 1987] and [Slock 1991], where it is stated that
probability of divergence is higher when 1 is used. Reasons behind this
phenomenon can be analyzed in a future work. Our suggestion is then in the same
direction that these algorithms should be always used with a forgetting factor equal to 1
for long term adaptations. But do not forget that using = 1 may cause a degradation
in the tracking performance of them. Most promising results are obtained using ESN-
SCRLS2 and ESN-Ardalan-RLS which are based on the same trick of adding a scalar
value to the diagonal of the symmetric correlation matrix. ESN-Ardalan-RLS is a more
A. U. KÜÇÜKEMRE Page 71 of 115
5. PERFORMANCE ANALYSIS OF THE ONLINE ADAPTATION ALGORITHMS
intelligent method in the sense that it changes this scalar value dynamically depending
on the change in W out where as it is kept constant in ESN-SCRLS2. But of course
using this dynamical approach, adds some additional computational complexity to the
algorithm. ESN-Ardalan-RLS is also designed specifically for the pre-windowed
memory case meaning that the forgetting rate is implicitly equal to one. On the other
hand, SCRLS2 has a better numerical accuracy and a lower computational complexity.
Only disadvantage is that the designer should estimate the correct diagonal constant and
the period before running the algorithm which is in contrast to the ESN-Ardalan-RLS.
In conclusion, our favorite algorithms from the CVA class are both the ESN-Ardalan-
RLS and the ESN-SCRLS2. Nonetheless, we favor ESN-SCRLS2 more.
The most successful class of algorithms were the RBAs during out experiments. They
have offered a very good numerical stability and accuracy. Considering the NMSE
performance, they achieved the best results six times through eight test cases. Two of
these are obtained by ESN-QR-RLS and the remaining four by the ESN-IQR-RLS. The
only important point is that they should be used with 1 or otherwise their
performance in terms of numerical accuracy degrades. This point is further supported by
the numerical stability observations given in the [Yang 1992]. Both QR-RLS and IQR-
RLS diverged after some period of time during their experiments with = 1 . This
observation especially holds for the IQR-RLS algorithm through our tests. In the second
test case of the adaptive noise canceling experiment, it diverged.
In general, we recommend use of the RBAs. Especially we favor the ESN-IQR-RLS for
its good numerical accuracy, stability and simplicity when compared to the ESN-QR-
RLS. The ESN-QR-RLS should be used only for the applications where calculation of
the a-priori estimation error, t , suffices. An example of this type is the Adaptive
Noise Cancellation. Otherwise, this algorithm becomes computationally too much
demanding due to the computation of inverse of a lower triangular matrix via back-
substitution.
As we mentioned in the previous chapter, RLSP share the good numerical accuracy
properties of the RBAs provided the statistical variations of the input signal are mild.
Otherwise, its performance heavily degrades due to its bad tracking properties. So, it is
A. U. KÜÇÜKEMRE Page 72 of 115
5. PERFORMANCE ANALYSIS OF THE ONLINE ADAPTATION ALGORITHMS
not advised to be used with the signals which are prone to harsh statistical changes such
as the 10th Order NARMA system used during our testing phase. During our early
experimentations on smaller datasets, we observed that ESN-RLSP has a slower
convergence rate when compared to CVAs or RBAs. Based on this we can also
conclude that it is also not suitable for fast start-up applications where the convergence
rate is of vital importance. Numerical stability-wise it showed a robust performance,
although its accuracy was not as good for all of the cases. Again as similar to the RBAs,
better results are achieved when forgetting rates are chosen to be smaller than one.
Otherwise, it may drift away from the desired signal for a few number of iterations and
re-converge to the steady state later. Together with the bad tracking properties of the
algorithm this usually ends up in a bad solution in terms of accuracy although it does
not diverge. See results of the test cases two and three of the 10 th Order NARMA
System identification, where this phenomenon is clearly visible. Computational
complexity of the algorithm is also a big drawback, it is approximately four times
slower than the ESN-SCRLS. Combining all of the negative points mentioned above we
conclude that the use of ESN-RLSP is not a good choice. If one has to trade-off between
accuracy and efficiency, RBAs or CVAs offer a better performance in general, thus they
should be considered instead of the ESN-RLSP.
As a final remark, we again want to repeat that these results should not be interpreted as
the given algorithms ensure 100% numerical stability or instability. We do not claim to
cover a more general class of applications based on these results under any
circumstances. What we really want to achieve was to show that taking the appropriate
precautions, a stable use of the algorithms may become possible where continuous, long
term adaptation is required. For short term adaptation purposes we recommend the
ESN-SCRLS for two reasons. Firstly, due to its simplicity both computational
complexity-wise and implementation-wise. Secondly because of its good convergence
properties which is fast and independent of the signal statistics. However, if the
numerical accuracy is also an important issue, ESN-IQR-RLS should be used when
ESN-SCRLS does not suffice. ESN-LMS or ESN-BPDC needs very good conditioning
of the DR in terms of eigenvalue spread, in order to be used successfully.
Our chapter on online adaptation of ESNs ends here. We discussed performances
A. U. KÜÇÜKEMRE Page 73 of 115
5. PERFORMANCE ANALYSIS OF THE ONLINE ADAPTATION ALGORITHMS
different algorithms which can be used to update output layer of an ESN online in a
reliable way. In the next chapter, we will compare ESN performance under adaptive
filtering context with respect to standard methods.
A. U. KÜÇÜKEMRE Page 74 of 115
6. ECHO STATE NETWORKS VS STANDARD METHODS
6. ECHO STATE NETWORKS VS STANDARD METHODS
6.1 Introduction
This chapter aims to show the reader that ESNs when used for adaptive filtering,
becomes a competitive candidate over standard adaptive filtering techniques. Firstly, we
will give a brief summary of standard methods of the adaptive filtering theory. The
methods we will summarize are linear transversal filters and adaptive polynomial filters
(i.e: a class of non-linear adaptive filtering methods). We will then compare ESN
performance with respect to the linear transversal filters and the adaptive polynomial
filters.
6.2 Overview of the Standard Adaptive Filtering Methods
The first type of filters to be compared with the ESN Adaptive Filter are the linear
transversal filters. They have a very simple yet very useful structure. These type filters
often played an very important role in development of core adaptive filtering
applications. The transversal filter tries to model the desired signal using M input
samples where M denotes the filter length. Modeling is done via expressing the signal
as the linear combination of tap weights and the history of input vector. Due to the feed-
forward structure, transversal filters belong to the Finite-Duration Impulse Response
(FIR) type of filters. Formal definition of a transversal filter is given in Definition 2.
All of the algorithms given in this thesis, can be used for weight adaptation of a
transversal filter. For simplicity in our comparisons, we will only use the IQR-RLS
algorithm for online adaptation of the transversal filters, and also for other yet to be
given filter structures. The IQR-RLS is chosen for its good stability and steady state
error performance. To save up space in the thesis, we omit the derivation of the IQR-
RLS algorithm when used for transversal filters. Interested reader can refer to [Sayed
1994], [Haykin 1996], [Farhang-Boroujeny 1998] or [Bellanger 2001].
Our second type of filters are the adaptive polynomial filters. These are a special class
of nonlinear adaptive filters which are using polynomial systems to get a nonlinear
model of the desired signal. Specifically, we will investigate the performance of two
A. U. KÜÇÜKEMRE Page 75 of 115
6. ECHO STATE NETWORKS VS STANDARD METHODS
main filter types; Volterra Filters and Bilinear Filters. We will not go deep in to the
theory of these filters here. Interested reader can refer to [Matthews 1991] and [Jenkins
1996] from where more detailed literature can also be tracked through the given
references.
Definition 2 : LINEAR TRANSVERSAL FILTER
Input Vector :
X t = [ x t , x t−1 , x t−2 , ... , x t−M1 ]T
Tap Weight Vector :
w t = [ w1t , w2t , w3t , ... , wM t ]T
Output :
y t =∑i=1
M
w it x t−i1 = wt T X t
Error :
e t =d t − y t
Now we follow on with an introduction to the Adaptive Volterra Filters. We base our
discussion on polynomial filters to the [Matthews 1991] and [Jenkins 1996]. Infinite
Volterra series Expansion for any given discrete time signal is given by:
y t = h0 ∑m1=0
∞
h1m1 x t−m1 ∑m1=0
∞
∑m2=0
∞
h2m1 , m2 x t−m1x t−m2
... ∑m1=0
∞
∑m2=0
∞
...∑m p=0
∞
h pm1 , m2 , ... ,m p x t−m1x t−m2... x t−mp ...
Here, the h pm1, m2 , ... , m p is called the p th order Volterra kernel of the system.
Volterra kernels are assumed to be symmetric that is it is left unchanged for any p!
permutations of the indexes m1, m2 ,... , m p . One may view the infinite Volterra series
as a special form of Taylor series expansion with memory. Since an infinite expression
is impossible to realize in a real world application, one should use a truncated Volterra
series expansion as:
A. U. KÜÇÜKEMRE Page 76 of 115
6. ECHO STATE NETWORKS VS STANDARD METHODS
y t =∑m1=0
M−1
h1m1 x t−m1 ∑m2=0
M−1
∑m1=0
M−1
h2m1 ,m2x t−m1 x t−m2
... ∑m p=0
M−1
∑m p−1=0
M−1
...∑m1=0
M−1
h pm1 ,m2 , ... ,mpx t−m1 x t−m2... x t−m p
Here notice that h0 is not included in the equation since it can be estimated as equal to
zero. The most prominent disadvantage of using the truncated series is that number of
coefficients increase directly proportional to the M p . (i.e: For a p=3 order series
with M=5 time steps of history, we have 55253=155 coefficients. As a result,
most of the real world applications of truncated Volterra filters, use lower order
expansions. In our comparisons, we will use a 2nd order expansion as used in [Matthews
1991], which is given in Definition 3.
Definition 3 : SECOND ORDER TRUNCATED VOLTERRA FILTER
Input Vector :
X t = [ x t , x t−1 , xt−2 , ... , x t−M 1 , xt 2 , x tx t−1 , ... , x tx T−m1 , x t−12,... , x t−M 12]T
Volterra Kernels:
H t = [ h10 : t , h11 : t , ... , h1M−1, t , h20,0 : t , h20,1 : t , ... , h20, M −1: t , h21,1 :t , ... , h2M−1, M−1 : t ]T
Output :
yt = ∑m1=0
M−1
h1m1: t x t−m1 ∑m2=0
M−1
∑m1=0
M−1
h2 m1 , m2 :t x t−m1 x t−m2 = H t T X t
Error :
e t =d t − y t
The vector notation used in the above definition simplifies the use of adaptation
algorithms, like LMS or RLS, for the Volterra filters. Since the output can be expressed
as the linear combination of the elements in the vector X t , H t , we can use
the IQR-RLS for Volterra filter coefficient update in a similar form as in the case of the
transversal filters or the linear combiners whose learning have a similar structure to an
ESN's W out .
A. U. KÜÇÜKEMRE Page 77 of 115
6. ECHO STATE NETWORKS VS STANDARD METHODS
The last filter type we will consider are the Bilinear Filters. The main problem
associated to Volterra filters is that a large number of coefficients is usually required
model certain systems. Therefore, use of other polynomial representations should also
be considered. It is known that recursive nonlinear difference equations can be used to
model nonlinear systems with a better precision than Volterra series representations.
Bilinear expansion is a simple but very useful example of such recursive nonlinear
difference equations. It is given by the following formulation:
y t = ∑i=0
M−1
a i x t−i ∑i=0
M−1
bi y t−i ∑i=0
M−1
∑j=0
M−1
c i , j x t−i y t− j
It has been shown that the bilinear series can model large number of nonlinear system
with arbitrary precision under mild conditions. Because the of output feed-back used in
the formula of Bilinear Filters, structurally they resemble IIR filters. Therefore, the
main drawback of using bilinear filters is their problem of numerical instability (as in
the case of IIR filters). The research on the subject of stabilizing bilinear filters is still
on early stage. There exists no known scheme for guaranteed stability of bilinear filters
yet.
From the formula we can see that the output of the bilinear series is a linear combination
of its coefficients thus it is again a simple task to derive adaptive algorithms in order to
use them for adaptive filtering. As in the case of Volterra filters, we can extend the
theories developed for linear signal processing to bilinear systems. Similarly, we can
define an Adaptive Bilinear Filter as in Definition 4.
6.3 Performance Comparison
Cross-evaluation of the Transversal, 2nd Order Truncated Volterra, Bilinear and ESN are
done using the same adaptive system identification scenario used in testing phase of
different online learning algorithms for ESNs. As quick refresher we once again go over
this setup throughly. In these experiments we tried to identify two unknown time
varying systems that are introduced in the paper [Atiya 2000].
A. U. KÜÇÜKEMRE Page 78 of 115
6. ECHO STATE NETWORKS VS STANDARD METHODS
Definition 4 : BILINEAR FILTER
Input Vector :
X t= [ x t , x t−1 , ... , x t−M1 , y t , y t−1 , ... , y t−M1, x t y t , x t−1 y t , ... , x t−M1 y t−M 1 ]T
Coefficient Vector :
C t = [ a0t , a1t ,... , a M−1t ,b0 t ,b1t ,... , bM −1t , c0,0 t , c1,0 t , ... , cM−1, M−1t ]T
Output :
y t =∑i=0
M−1
a i x t−i ∑i=0
M−1
bi y t−i ∑i=0
M−1
∑j=0
M−1
c i , j x t−i y t− j = C t T X t
Error :
e t =d t − y t
The first system is a 2nd order nonlinear dynamical system which is given by the equation:
y n1 = y n y n y n−1 u3n
We generated 10000 samples of the signal. At each time step we change the parameters
, , , by a factor of 1% around the original values, which are 0.4, 0.4, 0.6, 0.1
respectively, to achieve variations in the signal behavior through time. The input signal
u n is an uncorrelated uniform noise from the interval [−0.5 , 0.5] .
We used the same fifty neuron ESN from the previous test which has output to output
and input to input connections. W back is included in the exploitation equation. Input
signal is scaled by a factor of 0.75 and no bias is used. In order to make an objective
comparison we used the other filters with at least fifty or more number of coefficients.
Results of this experiment are given in Table 12, Figure 22 and Figure 23.
A. U. KÜÇÜKEMRE Page 79 of 115
6. ECHO STATE NETWORKS VS STANDARD METHODS
Algorithm NMSETransversal (via IQ-RLS) 0.1602100
Second Order Truncated Volterra Filter (via IQ-RLS) 0.1615600Bilinear Filter (via IQ-RLS) 0.1576400
ESN-IQ-RLS 0.0054497
Table 12: Performance comparison of different adaptive filtering methods for identification of the 2nd order nonlinear dynamical system.
In this experiment, ESN-IQ-RLS performed much better when compared to the
performance of other methods. Performance gain by using ESNs is in the order of
approximately 30 times better. Although there not much difference in the performances
of the other filters, the worst performance belongs to the second order truncated
Volterra filter.
A. U. KÜÇÜKEMRE Page 80 of 115
Figure 22: Performance of different adaptive filtering methods for identification of the 2nd order nonlinear dynamical system
Figure 23: Comparison of the ESN output versus the time varying 2nd Order Nonlinear Dynamical System in the last 100 iterations of the experiment
Our second system is a harder example. It is the 10th order NARMA system given by the
equation:
y n1 = tanh y n y n [∑i=0
9
y n−i ] u n−9u n
Again we followed a similar approach as the previous experiment, and used same 100
neuron reservoir that was used in the performance comparison of ESN-RLS
combinations under system identification setup of the 10th order NARMA system. It has
a density of 0.1 and the spectral radius is 0.99. All connections including W back are
present. Input signal is scaled by a factor of 0.1 without any bias. Number of
coefficients used for other filter types are also around 100. For the transversal filter we
used 102 coefficients which is same as the ESN. For the Volterra filter we chose a
window-length of 13 which corresponds to 105 coefficients. Our bilinear filter has a
window-length of 10 resulting in 109 coefficients. We need to mention that, when we
run the filter directly with the input signal we observed instabilities in the bilinear filter.
As a simple solution, we scaled the input to a more compact interval and then fed it to
filter. The scaling factor we used in this case for our bilinear filter is 0.005. Results are
A. U. KÜÇÜKEMRE Page 81 of 115
6. ECHO STATE NETWORKS VS STANDARD METHODS
given in Table 13, Figure 24 and Figure 25.
Algorithm NMSETransversal (via IQ-RLS) 0.016813
Second Order Truncated Volterra Filter (via IQ-RLS) 0.011886Bilinear Filter (via IQ-RLS) 0.019035
ESN-IQ-RLS 0.008947
Table 13: Performance comparison of different adaptive filtering methods for identification of the 10th order NARMA system.
Although not as good as the previous example, ESNs again performed better than the
other filtering methods. This time the performance gain is about twice the worst
method, which is now the bilinear filter.
A. U. KÜÇÜKEMRE Page 82 of 115
Figure 24: Performance of different adaptive filtering methods for identification of the 10th order NARMA system
Algorithm0,00E+000
2,00E-003
4,00E-003
6,00E-003
8,00E-003
1,00E-002
1,20E-002
1,40E-002
1,60E-002
1,80E-002
2,00E-002
TransversalVolterraBilinearESNN
MSE
6. ECHO STATE NETWORKS VS STANDARD METHODS
Additionally, we also want to show you good example of how these online adaptation
algorithms track the signal variations in time. As you know our 10th Order NARMA
system has harsh changes in the signal coefficients at every 2000th time step. As shown
in the previous chapter, we observe a dramatic jump in the mean value of the signal at
those times. (See Figure 11) Naturally, due to these jumps, ESN response drift away
from the desired value for a number of time steps until it can re-converge to the steady
state of the new signal behavior. In the following Figure 26, we give the stepwise
squared error graph of the ESN-IQ-RLS observed during the identification of the 10th
Order NARMA System. Observe the sudden increases in the error value at every 2000th
iteration which decreases slowly after some iterations pass. Better the tracking property
of an online adaptation algorithm, time interval between the error jump and re-
convergence to the steady state should be shorter. It is known that the tracking
performance of the ESN-IQ-RLS is not as good as LTA (i.e: ESN-BPDC) or CVA (i.e:
ESN-SCRLS2) class of algorithms.
A. U. KÜÇÜKEMRE Page 83 of 115
Figure 25: Comparison of the ESN output versus the time varying 10th Order NARMA system in the last 100 iterations of the experiment
6. ECHO STATE NETWORKS VS STANDARD METHODS
6.4 Conclusion
In conclusion, we reached our aim in this chapter by showing the superiority of the
ESNs to other adaptive filtering methods both linear and non-linear. By doing so we do
not claim that ESNs will always perform better than the other methods. This may differ
from application to application, thus more investigation should be carried on wider
range of application types. But, depending on these results plus the ones in [Jaeger
2002c] and [Jaeger 2004], we can easily conclude that ESNs when used in an online
learning fashion constitute a competitive approach for adaptive filtering.
A. U. KÜÇÜKEMRE Page 84 of 115
Figure 26: Stepwise squared error graph of the ESN-IQR-RLS, observed during the identification of the 10th Order NARMA System. Observe that the error value increases suddenly at every 2000th time step. After some time it re-converges to the steady state.
7. SUMMARY
7. SUMMARY
Throughout out this thesis, we described the use of ESNs for adaptive filtering tasks.
Generally an ESN is trained in an offline manner using the Algorithm 2 given in the
Chapter 3. However, due to the simplicity of the learned part of an ESN, which is only
the output connections from reservoir and input or output neurons, online learning is
possible by means of adaptation algorithms that are used inside adaptive filters. RLS is
an example of those algorithms and fits the ESN case successfully. It offers a fast rate of
converge which is independent of the eigenvalue spread of the input signal. On the
contrary it has two major drawbacks. Firstly, it is prone to numerical instability under
finite precision environments (i.e: Digital Systems). Secondly, complexity of the RLS
algorithm is in the squared order of the number of connections to be learned. These
problems should be investigated in detail for a more robust and reliable use of the RLS
for the online adaptation of ESNs. This is especially needed for applications where
online learning should be done in long term.
The problem of computational complexity could have been solved by fast versions of
the RLS algorithm however the ESN structure do not allow the use of such algorithms.
Another solution is to use of stochastic gradient algorithms, like the well known LMS,
which have linear time complexity. We proposed ESN-LMS and ESN-BPDC as an
example of such algorithms. The major drawback of them is their converge rate is an
order of magnitude slower than the RLS and it is also dependent on the eigenvalue
spread of the input signal. Therefore, they are not suited for all kinds of applications.
Another point regarding the complexity problem is the current state of the art of the
DSP chips. Those chips can realize very high computation speeds at the moment and
this is now achieved at reasonable prices. Therefore, the complexity of RLS can be
acceptable for certain applications where number of units used in the ESN is not very
large.
However, under any condition, the numerical stability problem should be treated in
detail for a reliable usage of the RLS algorithm. In that aspect, we went through the
most prominent examples of the adaptive filtering literature regarding stability of the
RLS algorithm. Outcome is a number of RLS variant algorithms to be used for online
A. U. KÜÇÜKEMRE Page 85 of 115
7. SUMMARY
adaptation of ESNs, which have different pros and cons. We selected ESN-CRLS, ESN-
SCRLS, ESN-SCRLS2, ESN-Ardalan-RLS, ESN-RLSP, ESN-QR-RLS and ESN-IQR-
RLS as the most promising algorithms. Later, we evaluated the selected algorithms with
respect to each other under well known adaptive filtering scenarios. Our main
concentration during the experiments was numerical stability. Additionally, we also
considered steady error performances as an evaluation criteria. First one of our
experimental scenarios was the Adaptive System Identification where we tried to
identify two different nonlinear systems. A second order nonlinear dynamical system
and a 10th order NARMA system. For each them, we had three test cases. Our second
experimental scenario was the Adaptive Noise Canceling. There we tried to enhance a
music signal which is corrupted by the speech babble of hundred people talking in a
canteen. Under this scenario, we had two test cases. All in all, we evaluated eight
different test cases. Since the stability was our main concern, we used a very large
number of samples which is between one to five million depending on the test case. As
a result of our experiments, ESN-SCRLS2 and ESN-IQR-RLS are found to be
advantageous. ESN-SCRLS2 offers a acceptable computational complexity, and good
robustness (provided forgetting rate is set to one). It also retains the good numerical
accuracy of the conventional RLS algorithm. ESN-SCRLS2 is also simple to
implement. On the other hand ESN-IQR-RLS offers an excellent numerical accuracy
and robustness (provided forgetting rate is set to a value that is smaller than one.) But,
this is achieved at a cost of increased computational complexity.
During our tests, we also evaluated performances of the two linear time algorithms
ESN-LMS and ESN-BPDC using the same experimental setups. Result is that they can
be used for certain applications where fast converge and numerical accuracy is not very
important. Especially, ESN-LMS could be very useful, however not before some
methods to shrink the eigenvalue spread of an ESN are developed by the researchers.
An example of previous attempts can be found in [Liebald 2004]. This is a very
important future research topic that is also suggested by the founder of the ESN theory,
Herbert Jaeger in a recent paper [Jaeger 2005].
In the last chapter, we compared the performance of online adapted ESNs to the
standard adaptive filtering techniques. We considered both linear and nonlinear
A. U. KÜÇÜKEMRE Page 86 of 115
7. SUMMARY
methods. Transversal Filters are examples of linear methods whereas Bilinear Filters
and Second Order Truncated Volterra Filters are belong to the nonlinear adaptive
filtering methods. For weight adaptation of all filter types including the ESN, we used
the IQR-RLS algorithm in order to be fair in our comparisons. Performances of the
filters are evaluated on the same Adaptive System Identification setup that was used
during our stability tests. During the identification of the second order nonlinear
dynamical system, ESNs performed very well by giving an error rate which is
approximately 30 times better than the other methods. For the 10th Order NARMA
System, ESNs again performed better than other methods but in this case the
performance gain was only two times better. In conclusion, we showed that the ESNs
are competitive candidates among other adaptive filtering methods. Similar conclusions
can also be reached by looking at the results of [Jaeger 2002c] and [Jaeger 2004].
We believe the following points are worth further investigation as a continuation of the
work we presented in this thesis:
• Adaptive filters are usually implemented in embedded platforms. Unfortunately, we
did not have enough time to try out the given online adaptation methods inside
embedded systems (i.e: DSP). As a result of this, we are planning to present an
embedded ESN application with online learning in a future paper.
• Throughout the thesis we did not concentrated on finding the best parameters for
ESNs. We believe that these results can be much better if ESNs are applied to the
given problems with a more optimized set of parameters. Additionally, using such
an optimized set may also have positive effects on stability and tracking of the
algorithms presented in this thesis. This is also worth investigating and is left as
future work.
• Since the concept of this thesis was to explore ESN usage for Adaptive Filtering
tasks, we limited our test cases to examples of well known adaptive filtering setups
where number of inputs and outputs are usually both one. On the other hand, the
online learning algorithms given in the thesis also cover multi-input multi-output
mappings. It would be interesting evaluate performance of the online adapted ESNs
for such tasks.
A. U. KÜÇÜKEMRE Page 87 of 115
7. SUMMARY
Apart from the points listed above there exist more important questions to be answered
regarding the ESN research. Examples of such open points can be enumerated as
finding a rich reservoir in less number of trials, how to decide if a given reservoir is
suited for the task at hand, how can we adapt ESN in a unsupervised manner to our
task's type of data, etc. The papers [Jaeger 2005] and [Prokhorov 2005] nicely
summarize what answers are still missing regarding ESN research and point out
appropriate future research directions.
A. U. KÜÇÜKEMRE Page 88 of 115
8. REFERENCES
8. REFERENCES
1. [Adali 1991] T. Adali, S. H. Ardalan (1991), Analysis of a Stabilization
Technique for the Fixed Point Prewindowed RLS Algorithm, IEEE Transactions
on Signal Processing Vol. 39 No. 9
2. [Alexander 1993] S. T. Alexander, A. L. Ghirnikar (1993), A Method for
Recursive Least Squares Filtering Based Upon an Inverse QR Decomposition,
IEEE Transactions on Signal Processing Vol. 41 No. 1
3. [Analog 2005] Analog Devices (2005), Sharc Processor Home Page,
Although what we have given here is an overly simplified computational requirements
analysis which only covers algorithmic complexity and leaves out other points like
memory requirement or implementation details (suggestions), it may still be of some
help for design of possible future applications.
The most important point to be careful is that all of the algorithms have at least
O N 2 complexity20. Therefore, while designing time critical applications where fast
response is important, DR size should be carefully selected because of the squared
growth of algorithms with respect to this parameter. A more detailed investigation on
embedded implementation of ESNs for various adaptive filtering application with
different timing needs will be done in a future study.
19 When the back-substitution step for W out computation is included in the overall complexity of ESN-QR-RLS.
20 Notice that the complexity of O N 2 also holds for the ESN-LMS due to the matrix-vector multiplication of W and x t which occurs during the evaluation step, although complexity of LMS is only O N when used for online weight adaptation of an ordinary filter. (i.e: Transversal FIR filters)