The University of Birmingham School of Computer Science ... · The University of Birmingham School of Computer Science MSc in Natural Computation Summer Project Non-Linear Memory

The University of BirminghamSchool of Computer Science

MSc in Natural ComputationSummer Project

Non-Linear Memory

Mills,A

September 8, 2005

Supervised by Dr Peter Tino

Abstract

The power of models which exploit the dynamicsof high dimensional recurrent reservoirs is studiedthrough experiments with a highly recurrent neuralnetwork model called the Echo State Machine. TheEcho State Machine is compared to other models ona simulated wireless communications reconstructiontask and an analysis is performed of its performance.The empirical results lead to the idea that the powerof the Echo State Machine may derive from emulatedstable states arising from contractive affine transfor-mations induced by off-diagonal elements in its reser-voir weight matrix.

keywords: ESN, RNN, FFN, recursive least squares,contraction mapping, fractals, dynamical systems,time series, prediction, tracking

This work is entirely my own except where otherwise indicated.

Contents

List of Figures 5

List of Tables 8

1 Introduction 11

1.1 Time Series Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

1.2 ESN Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

1.3 A Different Perspective . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

1.4 Questions Asked . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

1.5 Document Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2 Description of the task 15

3 Analysis of the Wireless Dataset 16

4 First Experiments with the Wireless Dataset 19

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

4.2 Replication of Jaeger’s ESN results (ESN Reservoir with an RLS Readout) 20

4.3 ESN Reservoir with a DLR Readout . . . . . . . . . . . . . . . . . . . . 26

4.4 ESN Reservoir with an FFN Readout . . . . . . . . . . . . . . . . . . . . 27

4.5 Nearest History Approach . . . . . . . . . . . . . . . . . . . . . . . . . . 28

4.6 Unprocessed Input Buffer with an RLS Readout . . . . . . . . . . . . . . 31

4.7 Unprocessed Input Buffer with a DLR Readout . . . . . . . . . . . . . . 33

4.8 Unprocessed Input Buffer with an FFN Readout . . . . . . . . . . . . . . 33

4.9 FPM Reservoir with VQ, RLS, and FFN Readouts . . . . . . . . . . . . 37

4.10 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

4.11 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

5 A Closer Look at the ESN Results 41

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

5.2 Sensitivity of the ESN results to Input Shift . . . . . . . . . . . . . . . . 41

5.2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

5.2.2 ESN Reservoir with an RLS Readout . . . . . . . . . . . . . . . . 41

5.2.3 ESN Reservoir with a DLR Readout . . . . . . . . . . . . . . . . 43

2

5.2.4 ESN Reservoir with an FFN Readout . . . . . . . . . . . . . . . . 43

5.2.5 Nearest History Approach . . . . . . . . . . . . . . . . . . . . . . 45

5.2.6 Unprocessed Input Buffer with an RLS Readout . . . . . . . . . . 46

5.2.7 Unprocessed Input Buffer with a DLR Readout . . . . . . . . . . 47

5.2.8 Unprocessed Input Buffer with an FFN Readout . . . . . . . . . . 48

5.2.9 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

5.3 Analysis of the ESN Reservoir Unit Activations for Different Input Shifts 50

5.4 Sensitivity of the ESN results to Weight Matrix Connectivity . . . . . . . 52

5.5 Sensitivity of the ESN results to Number of Non-Linear Reservoir Units . 56

5.6 Sensitivity of the ESN results to Number of Reservoir Units . . . . . . . 57

5.7 Sensitivity of the ESN results to Weight Matrix Spectral Radius . . . . . 61

5.8 Analysis of ESN Reservoir Recurrent Weight Distribution . . . . . . . . . 61

5.9 Analysis of the ESN Reservoir Dynamics for Constant Inputs . . . . . . . 62

5.10 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

5.11 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

6 Restricting the ESN 66

6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

6.2 Performance Without Spectral Radius Rescaling . . . . . . . . . . . . . . 66

6.3 ESN Reservoir with Constant Reservoir or Constant Input Weights . . . 66

6.4 ESN With Only Self Loops . . . . . . . . . . . . . . . . . . . . . . . . . . 68

6.5 ESN Circuits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

6.6 ESN Sub Circuits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

6.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

7 Overall Summary 73

8 Discussion 75

8.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

8.2 Future Direction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

8.3 Relation to Other Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

8.4 A Short Comment On the Idea of Dynamic Projection Networks . . . . . 78

9 Evaluation 80

3

10 Conclusion 81

A Document Appendix 82

A.0.1 FPM Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

A.1 The Addition of i.i.d Gaussian Noise to a Continuous Signal . . . . . . . 82

A.2 L2 Distance between scaled inputs . . . . . . . . . . . . . . . . . . . . . . 82

A.3 Rescaling a matrix to a specified spectral radius . . . . . . . . . . . . . . 83

B File structure of software included on CD 85

References 86

4

List of Figures

1 Echo State Network (ESN). An ESN is a sparsely connected dynamic reser-voir which is itself fully connected to one input, and fully connected to oneoutput. Only the reservoir-to-output weights are trained during learning. 12

2 A sample input signal d and corresponding output signal u for the wirelesslearning task. In this example the noise term has been set to 0, the inputscaling is 1, and the input shift is 30. . . . . . . . . . . . . . . . . . . . . 17

3 The function U(x) = x+0.036 ·x2− 0.011 ·x3 over its domain. Its domainis equal to the the range [−5.1, 5.1] of the encoding function Q. . . . . . 17

4 The number of subsets in the domain of Q which each define an x-to-1mapping under the function Q, for different values of x. . . . . . . . . . . 18

5 The function U(x) = x + 0.036 · x2 − 0.011 · x3 over the domain [−5.1,−3.9]. 19

6 ESN symbol error rate as the SNR is increased for an ESN constructedusing the parameters shown in Table-1. . . . . . . . . . . . . . . . . . . . 22

7 The graphed results of [1], for the varying of SNR with an ESN reservoirwith an RLS readout, established using the parameters shown in Table-1.Line a is for a linear DFE, line b is for a Volterra DFE, line c is for abilinear DFE, line d is for the first stage of ESN training, and line e is forthe second stage of ESN training. An approximation of the data for thisgraph is given in Table-4. . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

8 SER of an ESN with a DLR readout as the learning rate is increased. Dataof this plot is shown in Table-7. . . . . . . . . . . . . . . . . . . . . . . . 26

9 The SER performance and std of the nearest history approach as the mem-ory depth is increased for a fixed memory size of 1000, a data shift of 0, adata scale of 1, and a data SNR of 32. Std is not shown because it is toosmall to be of any interest. Data is shown in Table-10. . . . . . . . . . . 29

10 The SER performance and std of the nearest history approach as the mem-ory size is increased for a fixed memory depth of 3, a data shift of 0, a datascale of 1, and a data SNR of 32. Data is shown in Table-11. . . . . . . . 30

11 SER of an RLS readout coupled to an unprocessed input buffer, as the sizeof the input buffer is increased. Plot is shown in Table-12. . . . . . . . . 32

12 The effect on the wireless-data SER of extending further the size of theinput buffer of an RLS trained ESN. . . . . . . . . . . . . . . . . . . . . 32

13 SER of DLR with a fixed length input buffer of the wireless data as thelearning rate is increased. Data is shown in Table-13. . . . . . . . . . . . 33

14 SER of DLR on the wireless data as the length of the input buffer isincreased from 1 to 5 in unit steps. . . . . . . . . . . . . . . . . . . . . . 34


5

16 The SER of an FFN coupled to an unprocessed input buffer of WirelessData, as the size of the input buffer is varied. . . . . . . . . . . . . . . . 36

17 Comparison of the individual trial results for the best input buffer plusFFN readout result and the best ESN reservoir plus RLS readout result. 37

18 Comparison of the individual trial results for the best input buffer plusFFN readout result with 150 epochs of training, and and the best inputbuffer plus FFN readout result with 25 epochs of training. . . . . . . . . 38

19 SER of an ESN Reservoir with an RLS readout as the input sequence isshifted by increasing amounts. . . . . . . . . . . . . . . . . . . . . . . . . 42

20 SER of an ESN Reservoir with an RLS readout as the input sequence isshifted by increasing amounts. . . . . . . . . . . . . . . . . . . . . . . . . 42

21 SER of an ESN Reservoir with a DLR readout as the input sequence isshifted by increasing amounts. . . . . . . . . . . . . . . . . . . . . . . . . 44

22 SER of an ESN Reservoir with an FFN readout as the input sequence isshifted by increasing amounts. . . . . . . . . . . . . . . . . . . . . . . . . 45

23 RLS SER as the input shift is gradually increased. . . . . . . . . . . . . . 46

24 DLR SER as the input shift is gradually increased. . . . . . . . . . . . . 47

25 SER of an FFN readout coupled to an unprocessed input buffer as theinput shift is increased. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

26 The mean and std activation values of a relatively sparsely (connectivity0.2) connected ESN reservoir for different input shifts. . . . . . . . . . . . 51

27 Histograms with 20 bins each for different input shifts, the top left plot hasan input shift of 0 and the input shift is incremented in each subsequentplot, the incrementation goes across all columns of the the first row, thenacross all columns of the the second row etc. . . . . . . . . . . . . . . . . 52

28 Difference between mean and std reservoir unit activation values betweena fully connected ESN reservoir and an ESN reservoir with a connectivityof 0.2, for different input shifts. . . . . . . . . . . . . . . . . . . . . . . . 53

29 ESN SER as the connectivity of the weight matrix is increased. . . . . . 53


31 Average number of ESN reservoir units which have incoming and outgoingconnections as the connectivity is increased. (The graph continues in astraight line with 0 std for connectivity beyond 0.2.) . . . . . . . . . . . . 55

32 Average number of ESN reservoir units which have incoming and outgoingconnections as the connectivity is increased. Mean In (Mean Out) are themean number of units which have incoming (outgoing) connections, andStd In and Std Out are the associated standard deviations. Also shownare the SER and Std from Table-24, which correspond to the connectivitycolumn. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

6

33 ESN SER as the “non-linearity” of the dynamic reservoir is gradually in-creased. To begin with all neurons implement the identity function, andthen, one by one, they are changed to implement the tahn activation func-tion. The plot shows the results from 0 non-linear units out of 46 to 4 outof 46. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

34 ESN SER as the “non-linearity” of the dynamic reservoir is gradually in-creased. To begin with all neurons implement the identity function, andthen, one by one, they are changed to implement the tahn activation func-tion. The plot shows the results from 4 non-linear units out of 46 to 46out of 46. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

35 ESN SER as the “non-linearity” of the dynamic reservoir is gradually in-creased. To begin with all neurons implement the identity function, andthen, one by one, they are changed to implement the tahn activation function. 58

36 ESN SER as the number of reservoir units is increased from 1 to 15. . . . 59

37 ESN SER as the number of reservoir units is increased from 15 to 75. . . 59

38 Difference between the results of varying the number of units, for a con-nectivity of 0.2 vs a connectivity of 1. . . . . . . . . . . . . . . . . . . . . 60

39 ESN SER as the spectral radius of the weight matrix is increased for afixed connectivity of 0.2. The std for a spectral radius of 1 is omitted fromthe graph because it attenuates the other readings too much, its std was0.0014. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

40 Histogram of weights of 1000 randomized ESNs, using the initializationparameters specified in Table-1. . . . . . . . . . . . . . . . . . . . . . . . 62

41 Plot of weight values of 1000 randomized ESNs, using the initializationparameters specified in Table-1. . . . . . . . . . . . . . . . . . . . . . . . 63

42 Number of iterations before all reservoir activations are closer than thresh-old to 0. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

43 Number of weight-matrix generation tries it took to get a non-zero spectralradius. Std is shown every 0.0005 steps, no markers are shown for std ofzero. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

7

List of Tables

1 ESN parameters for replication of [1] . . . . . . . . . . . . . . . . . . . . 20

2 ESN symbol error rate as the SNR is increased for an ESN constructedusing the parameters shown in Table-1. The plot of this data is shown inFigure-6. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3 ESN symbol error rate achieved by training the best network, for eachSNR, that was obtained in the experiment whose results are shown inTable-2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

4 Approximation of the results presented in [1]. See Figure-7 caption forinformation regarding alphabetic mnemonics and the graph from whichthis data was approximated. The results in [1] were averaged over 20randomly generated training and testing sets. . . . . . . . . . . . . . . . 23

5 ESN settings which improve on those suggested by [1] and given in Table-5. 25

6 Replication of the wireless experiment of [1] but with substitution of theparameters listed in Table-5. . . . . . . . . . . . . . . . . . . . . . . . . . 25

7 SER of an ESN with a DLR readout as the learning rate is increased. Plotof this data is shown in Figure-8. . . . . . . . . . . . . . . . . . . . . . . 27

8 SER of an ESN with a DLR readout as the number of training epochs areincreased. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

9 SER of an ESN with an FFN readout as the number of training epochs isincreased. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

10 The SER performance and std of the nearest history approach as the mem-ory depth is increased for a fixed memory size of 1000, a data shift of 0, adata scale of 1, and a data SNR of 32. Plot is shown in Figure-9. . . . . 29

11 The SER performance and std of the nearest history approach as the mem-ory size is increased for a fixed memory depth of 3, a data shift of 0, a datascale of 1, and a data SNR of 32. Plot is shown in Figure-10. . . . . . . . 30

12 SER of an RLS readout coupled to an unprocessed input buffer, as the sizeof the input buffer is increased. Plot is shown in Figure-11. . . . . . . . . 31

13 SER of DLR with a fixed-length input buffer of the wireless data as thelearning rate is increased. Plot is shown in Figure-13. . . . . . . . . . . . 34


15 The SER of an FFN coupled to an unprocessed input buffer of WirelessData, as the size of the input buffer is varied. . . . . . . . . . . . . . . . 36

8

16 Summary of the results for replication of [1] and comparison models. TheSER reported is the lowest mean SER for the model in question obtainedover the set of experimental parameters tested. N/A means that the per-formance was no better than a random predictor. The number of unitsgiven in brackets after a model name is the optimal number of memoryunits found for the model, and is the number associated with the resultsshown. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

17 SER of an ESN Reservoir with an RLS readout as the input sequence isshifted by increasing amounts. See Section-5.2.2 for further information. . 43

18 SER of an ESN Reservoir with a DLR readout as the input sequence isshifted by increasing amounts. . . . . . . . . . . . . . . . . . . . . . . . . 44

19 SER of an ESN Reservoir with an FFN readout as the input sequence isshifted by increasing amounts. . . . . . . . . . . . . . . . . . . . . . . . . 45

20 RLS SER as the input shift is gradually increased. . . . . . . . . . . . . . 46

21 DLR SER as the input shift is gradually increased. . . . . . . . . . . . . 47

22 SER of an FFN readout coupled to an unprocessed input buffer as theinput shift is increased. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

23 Summary of experiments assessing sensitivity of several models to inputshift. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49


25 ESN SER as the number of reservoir units is increased . . . . . . . . . . 60

26 ESN SER as the spectral radius of the weight matrix is increased for afixed connectivity of 0.2. The std for a spectral radius of 1 is omitted fromthe graph because it attenuates the other readings too much, its std was0.0014. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

27 ESN settings that achieve excellent results without using spectral radiusrescaling. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

28 The SER and std of four cases which arise from initialising the input-to-reservoir weights from a random interval or to a constant, and frominitialising the reservoir-to-reservoir weights from a random interval or toa constant. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

29 Performance of a 46 unit ESN reservoir implementing identity functionactivation units, for different values for the reservoir-to-reservoir and input-to-reservoir weight ranges. . . . . . . . . . . . . . . . . . . . . . . . . . . 67

30 Performance of an ESN Reservoir with only self-loops, coupled to an RLSreadout, for different settings of the self-loop and input-to-reservoir weights. 68

31 Performance of ESN Reservoir whose 46 units are connected in a singlecircuit, coupled to an RLS readout, for different settings of the reservoir-to-reservoir (circuit) and input-to-reservoir weights. . . . . . . . . . . . . 69

9

32 Performance of ESN Reservoir whose 46 units are connected in a singlecircuit, with identity activation functions coupled to an FFN readout, fordifferent settings of the circuit and input-to-reservoir weights. . . . . . . 70

33 Performance of ESN Reservoir whose 46 units are each connected to onlytwo other units (one incoming connection and one outgoing connection).The reservoir was coupled to an RLS readout. The performance for differ-ent settings of the reservoir-to-reservoir and input-to-reservoir weights isshown. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

34 The best conventional ESN results on the wireless task for an SNR of32db, compared to several other similar models. The ANOVA columndesignates the p value for an ANOVA test involving the result from the firstrow and the row of interest (p < 0.05 indicates a statistically significantdifference). The meaning column gives an interpretation of the statistic,an “=” symbol means that the result is not significantly different fromthe best conventional ESN result, and a “<” symbol means the result issignificantly worse than the best conventional ESN result. . . . . . . . . . 71

10

1 Introduction

1.1 Time Series Tasks

Time series regression; the training of a readout mechanism to approximate a teachersignal, given a time series as input, is a task which has much real world relevance. Threecommon time series regression objectives, along with example usage, are given below:

1. Time series continuation. Prediction of time series, for example for financialforecasting.

2. Time series modeling. The modeling of the underlying generator of a time series,for example for chemical process monitoring.

3. Time series filtering. Changing one time series into another, for example thereconstruction of a corrupted wireless transmission.

The connectionist community has always had a strong interest in temporal processing,tracking, and prediction. Yet it is suggested in [2] that the connectionist models have nothad wide implementation and acceptance in industry. This is surprising given the obviousdemand for sophisticated analog processing in an increasingly wireless and autonomousworld.

It is suggested in [2] by Jaeger, that the problem stems from the absence of quick, reliable,and economically feasible training regimes for neural network models. It is claimed therethat traditional approaches such as back-propagation-through time (BPTT), real-timerecurrent learning (RTRL), and extended Kalman filtering (EKF) are tainted with prob-lems which make them essentially inaccessible to all but the expert. It is implied thateven in the hands of an expert, the techniques are inherently plagued by one or more of:slow convergence; computational inefficiency; undesirable trade-off decisions; instability;non-scalability; and obtuse complexity.

1.2 ESN Approach

The discussion in [2] is followed by the introduction of Jaeger’s own model, the EchoState Network (ESN), as a candidate to alleviate the problems of the models that hadpreviously been discussed.

An ESN (see Figure-1) is a highly recurrent reservoir of artificial neurons with non-linear activation functions. The input-to-reservoir and reservoir-to-reservoir connectionweights are initialised so that the resultant reservoir has what Jaeger calls “Echo StateProperty”[2]. Once these weights are initialised they remain fixed for the duration oftraining and testing.

”Echo State Property” essentially means that the input stream is projected as severalvariant “echos” of itself into the typically higher dimensional space of the reservoir. In-tuitively, since the input is fully connected to the reservoir, it is clear that the activation

11

INPUT OUTPUT

DYNAMICRESERVOIR

Figure 1: Echo State Network (ESN). An ESN is a sparsely connected dynamic reservoirwhich is itself fully connected to one input, and fully connected to one output. Only thereservoir-to-output weights are trained during learning.

of each unit is a variant of the input stream. The echoes decay at a rate governed by themagnitude of the connection weights as the reservoir is updated.

ESN have fading memory and obtain a hopefully injective-like mapping of the inputstream into echo states. The resultant reservoir is a high dimensional projection of theinput stream which is connected to one or more readout neurons. During learning, onlythe reservoir-to-output weights are changed.

Since the reservoir-to-output link is characterized by a single layer of weights, standardregression techniques (such as Recursive Least Squares (RLS)) can be used to train theweights such that the readout matches the teacher signal. The ability to use such linearregression techniques makes the ESN model efficient.

Focusing on the case of one input and one output unit, the ESN is described in terms ofa reservoir state x and an output state y. The reservoir and output states are updatedgiven a new input i(t) according to the following equations

x(t) = f(Wx(t− 1) + wini(t) + v(t)) (1)

y(t) = f out(wout(i(t),x(t))) (2)

f(.) stands for element-wise application of the reservoir activation function (for exampletanh); v(t) is an optional noise vector; (i(t),x(t)) is the concatenation of the input i(t)and reservoir state x(t); W is reservoir weight matrix (reservoir-to-reservoir weights); win

is the input weight vector (input-to-reservoir weights); wout is the output weight vector(reservoir-to-output weights); and f out is the output unit activation function.

The performance of the ESN, and its ease of use is strongly promoted in [2, 3, 4] where theclaims are empirically supported by excellent performance on several common benchmarks(for example, Mackey-Glass, Sante-Fe laser, and a 10th order NARMA system). It issuggested that, compared to traditional alternative techniques, the ESN is easy to train,

12

stable, and precise. The creator of the technique has even gone so far as to file a patentapplication for it.

1.3 A Different Perspective

Untrained recurrent neural networks (RNNs) initialised with small weights have a Marko-vian architectural bias [5, 6]. This means that the similarity of network states is relatedclosely to the similarity of the input stream which induces those states. The reason isthat small weights turn an RNN into a contraction mapping. A contraction mapping [7]is a function, under which the projections of any two points are closer than the originalpoints. In the limit, when driven indefinitely with a fixed input, a contraction mappingwill converge to a fixed point. Therefore each input in an RNN initialized with smallweights defines a fixed point to which the RNN will converge if driven indefinitely withthat input. It follows from this that similar input sequences have similar network stateswhich establishes the Markovian bias of the RNN.

A Markovian bias implies, in a prediction setting, a Markovian context from which topredict. Training a network using typical temporal backpropagation variants increasesthis bias [6], and this is what makes recurrent neural networks useful for time seriesregression tasks.

There are no architectural differences between a single layered classical RNN and an ESN.The only differences occur in training. In traditional RNN usually all the weights aretrained with BPTT, RTRL, or EKF, whereas in ESN only the reservoir-to-output weightsare adjusted and a linear regression algorithm is sufficient. Both models are initialisedby small weights and both have contractive dynamics.

Therefore the ESN is essentially identical to the concept of the neural prediction machine[5] (NPM). The NPM exploits the structure in an untrained RNN to provide a Markoviancontext, given an input stream, from which to predict the future.

There is a model called the fractal prediction machine (FPM) [8] inspired by the contrac-tive dynamics of RNN (and therefore NPM) which uses the idea of contraction directly.In the FPM, a “reservoir-like” state is maintained. The state at time t, s(t) is a convexcombination of the input at time t, i(t) and the state at time t− 1, s(t− 1):

s(t) = k1 · s(t− 1) + k2 · i(t)

Traditionally k2 is set to (1 − k1) but this is an arbitrary choice. This can be seen as arestriction of the ESN update equation (ignoring the noise term):

x(t) = f(Wx(t− 1) + wini(t))

With win set to 1 · k2 and W set to I · k1.

Alternatively the ESN can be thought of as a generalisation of the FPM. This is importantbecause it puts the ESN next to a model which is easy to analyse. It also demonstratesthat the ESN model is not an approach that is far removed from existing work.

13

1.4 Questions Asked

In [1], an Echo State Network was trained to reconstruct symbols received over a simulatedwireless communications channel, and the performance excelled all previous attempts.There it was stated, with regard to establishing an ESN reservoir for Mackey Glass seriescontinuation that:

It is important that the “echo” signals be richly varied.

And in [2] 5 open questions regarding ESN were listed, one of which was:

How is the “raw” echo state best prepared in order to obtain a “rich” varietyof internal connections?

Referring to [1] again, and continuing the quote above:

This was ensured by a sparse interconnectivity of 1% within the reservoir: thiscondition lets the reservoir decompose into many loosely coupled subsystems,establishing a richly structured reservoir of excitable dynamics.

The discovery that RNN initialised with small weights have a Markovian architecturalbias found order and simplicity in apparent complexity. The above quotes by Jaegersuggest that the ESN reservoir has complicated dynamics. It seems likely given that theNPM and the ESN are essentially identical, that the ESN has a Markovian architecturalbias and has relatively ordered dynamics.

It is believed that by looking at the ESN in this way, that questions about its operationscan be answered. The research addresses the following questions

1. The information in an ESN is derived from the input stream it is driven by. Thisraises the question: is a reservoir of echo states necessary at all? Or can a simpleapproach based on a sliding window work just as well?

2. It is suggested above that the way in which an ESN reservoir is created is importantin obtaining ”rich dynamics” and subsequent good performance. Is the suggestedsparse connectivity a necessary precursor for good performance? Why?

3. What kind of dynamics does the ESN actually have? And how do these dynamicsbenefit the ESN in terms of regression potential?

1.5 Document Structure

Section-2 describes the simulated wireless channel communications reconstruction taskthat serves as the benchmark used here to compare models and perform analyses. Section-3 performs an analysis of the wireless dataset construction algorithm to give a better

14

idea of the complexity of the task at hand. In Section-4 the work of [1] is replicated andcompared with the performance of several other models on the same task.

In Section-5 the results of Section-4, in particular the ESN results, are examined moreclosely. The sensitivity of the results to perturbation of various governing parameters isascertained, and the dynamics of the ESN model are analysed.

In Section-6 the ESN model is restricted in attempt to extract the underlying mechanism.

Section-7 gives a brief overall summary of the presented results. Section-8 provides adiscussion and puts the work in the context of others. Section-9 provides a criticalevaluation. Section-10 concludes the work. Appendices follow.

2 Description of the task

In [1], an Echo State Network (ESN) was trained to reconstruct symbols received overa simulated wireless communications channel. The technical details described in [1] arerecapitulated here in a different style and given a functional treatment.

The data to be transmitted is modeled as an i.i.d random symbol source

D = { d(1), d(2), ..., d(M) | d(i) ∈ {−3,−1, 1, 3} }

The data is converted to an intermediate signal q(n) by computing the dot productbetween a weight vector

qwT =

[0.08 −0.12 1.00 0.18 −0.10 0.09 0.05 0.04 0.03 0.01

]and a length 10 buffer of inputs

d(n) =

d(n + 2)d(n + 1)

d(n)d(n− 1)d(n− 2)d(n− 3)d(n− 4)d(n− 5)d(n− 6)d(n− 7)

so that

q(n) = qwTd(n)

or in its full form

15

q(n) = + 0.08 · d(n + 2) + 0.12 · d(n + 1) + 1.00 · d(n) + 0.18 · d(n− 1)− 0.10 · d(n− 2) + 0.09 · d(n− 3) − 0.05 · d(n− 4) + 0.04 · d(n− 5)+ 0.03 · d(n− 6) + 0.01 · d(n− 7)

expressed in terms of an encoding function Q : {−3,−1, 1, 3}10 → IR gives

Q(x) = qwTx

giving

q(n) = Q(d(n))

During transmission the signal gets gets corrupted according to the function U : IR → IR

U(x) = x + 0.036 · x2 − 0.011 · x3 + v

where v is an i.i.d Gaussian random variable. The variable v can be adjusted to yield aprescribed signal-to-noise ratio (SNR) (How to achieve this is explained in Appendix-A.1.The received signal is thus

u(n) = U(q(n))= U(Q(d(n)))

The learning task is to reconstruct the input d(n − 2) given the received signal u(n).Since u(n) is a function of q(n) which is a function of (d + 2), d(n + 1), ..., d(n − 7) thisimplies an input-to-output delay of 4 timesteps. The error measure, to be minimised, isthe number of incorrectly predicted symbols as a fraction of the test set length, whichcan be written as E/M where E is the number of incorrect symbol predictions and Mis the length of the test set. Figure-2 shows an example input signal and correspondingoutput signal.

3 Analysis of the Wireless Dataset

The data transmission signal q(n) is a fixed linear combination of 10 d(i) terms, and eachd(i) term can take only one of 4 discrete values from {−3,−1, 1, 3}, this yields 410 =220 = 1048576 possible values for q(n), which means the domain of Q has 220 elements.These are subsequently processed by the nonlinearity U(x) = x+0.036 ·x2−0.011 ·x3 +v.For now, ignore the noise term v(n). Enumeration of the range of Q and the range of Uare feasible owing to the small domain of Q.

All possible length 10 histories with respect to the symbol set {−3,−1, 1, 3} werecomputed and fed into the encoding function Q, this revealed that the range of Q andhence the domain of U was equal to [−5.1, 5.1], it contained exactly 511 points spaced at

16

0 20 40 60 80 100−20

−10

0

10

20Input

Val

ue

0 20 40 60 80 100−4

−2

0

2

4Output

Val

ue

Figure 2: A sample input signal d and corresponding output signal u for the wirelesslearning task. In this example the noise term has been set to 0, the input scaling is 1,and the input shift is 30.

−6 −4 −2 0 2 4 6−3

−2

−1

0

1

2

3

4

5Nonlinearity U over the range of Q

x

U(x

)

Figure 3: The function U(x) = x + 0.036 · x2 − 0.011 · x3 over its domain. Its domain isequal to the the range [−5.1, 5.1] of the encoding function Q.

equal intervals of 0.02. Plotting U over its domain revealed the nature of the non-linearityfor the case of no-noise, this is shown in Figure-3.

So in actuality the non-linearity is nearly linear within the operating range. It is interest-ing to ask whether the functions Q and U are injective, and if not, how many equivalenceclasses characterize their behavior.

17

To achieve this end, for each element in the range of Q, the number of elements in therange identical to it were counted.

This analysis revealed that the function Q was non-injective. 6 points in the domainshared their image with no other point 1, the rest shared their image with more, themaximum correspondence was 3216-to-1. Figure-4 shows the number of different x-to-1mappings for the function Q for different values of x.

What it means to say that there are y x-to-1 mappings is that there are y disjoint subsetsin the domain of Q having size x such that each subset is projected onto a differentelement in the range of Q.

0 500 1000 1500 2000 2500 3000 35000

2

4

6

8

10

12

Number of disjoint subsets of Q which definean x−to−1 mapping, for different values of x

x

Num

ber o

f dis

join

t sub

sets

of Q

whi

ch d

efin

e an

x−t

o−1

map

ping

Figure 4: The number of subsets in the domain of Q which each define an x-to-1 mappingunder the function Q, for different values of x.

The function Q has 511 equivalence classes, and the mean number of elements in a classis 6808.9 with a variance of 8878.4. This gives an idea of just how “non-injective” thefunction Q is.

When the range of Q was put through U it gave exactly 511 points2 which means thatthe function U is injective, that is no point in the domain of U shares its image withanother.

Figure-3 shows that U is almost linear apart from the small region in the lower left.A magnification of this non-linear region is shown in Figure-5, it spans from −5.1 to−3.9, the figure shows a horizontal line for each point as well as an asterisk, it shows thenon-equality of points on opposite sides of the bowl.

In conclusion, if the numerical precision is sufficient, the non-linearity U is injective (withrespect to the range of Q), and almost linear.

1Bearing in mind the numerical precision of the computer.2In C, it gave 510 points, but in Matlab it gave 511

18

−5.2 −5 −4.8 −4.6 −4.4 −4.2 −4 −3.8−2.77

−2.768

−2.766

−2.764

−2.762

−2.76

−2.758

−2.756

−2.754

−2.752

−2.75Nonlinearity U over the domain [−5.1,−3.9]

x

U(x

)

Figure 5: The function U(x) = x + 0.036 · x2 − 0.011 · x3 over the domain [−5.1,−3.9].

The non-linearity does not significantly increase the complexity of the task for the caseof no noise. With the addition of noise, however, the function becomes non-injectiveand the task much more complicated. And even without noise, the function Q is highlynon-injective which makes the task hard.

4 First Experiments with the Wireless Dataset

4.1 Introduction

The analysis of the preceding section revealed that the wireless data set has quite com-plicated memory structure. In this section the performance of several different predictionmodels are compared on randomly generated instances of the wireless data set.

Jaeger reported in [1] that the ESN can achieve a mean SER (Symbol Error Rate) of≈ 1 × 10−5 on the wireless dataset with an SNR (Signal to Noise Ratio) of 32db. Thepurpose of this section is twofold, first, it serves to replicate and verify the results of [1],and second, it serves to compare the results of [1] with the results achieved here usingother models.

The overall aim is to obtain enough empirical evidence to make a reasonable claim as towhy the ESN performs well and from where its underlying power derives.

In order that any comparisons be fair, the experiments to follow should be analyticallycompatible with the results of [1]. There, 47 adjustable weights were used, so it might besuggested that any meaningful comparison should also use a maximum of 47 adjustableweights. This however, does not account for other adjustable parameters such as connec-tivity, learning rates, scaling of inputs, shifting of inputs, and so on, which are clearly

19

adjustable parameters and which were important in achieving the results presented in [1].

It could be argued that a 47 parameter limit should refer to maximum memory depth, oreven the maximum number of neurons allowed, because in some sense the ESN used in [1]had 47 units of storage space. However it is hard to quantify storage space just in termsof units since factors such as how the memory is updated are important in determiningthe amount of information that can be stored. And in any case, there are models whichdo not use neurons, but which in any case are interesting. Where possible 46 units of“reservoir” were used since the 47th storage space of [1] comes from augmenting thereservoir with the current input, but not all models do this by default.

In general it is difficult to constrain the different models so that they are in some senseequivalent with respect to resources or adjustable parameters. It is difficult to reconciledifferent parameters in different models as being comparable. This means it is difficultto make general conclusions so much care must be taken.

4.2 Replication of Jaeger’s ESN results (ESN Reservoir withan RLS Readout)

The experiment parameters of [1] were duplicated exactly. They are shown in Table-1.

Parameter Description Value

Number of reservoir units 46Number of output units 1Number of input units 1Reservoir activation function tanhOutput unit activation function identityReservoir connectivity 0.2Reservoir weight set {−1, 0, 1}Reservoir spectral radius 0.5Input weight range [−0.025, 0.025]Input shift 30Input scale 1Feedback weight range NULLInitial reservoir activation value 0ESN washout length 100Training set length 5000Regression model RLSRLS forgetting rate 0.998

Table 1: ESN parameters for replication of [1]

An ESN reservoir was trained on a 5000 step random wireless data sequence3 using theRLS regression algorithm. The ESN reservoir was then tested on a 2.5×106 step randomwireless data sequence.

3100 of these steps are used as an initial “wash” of the ESN reservoir.

20

In [1], testing was terminated prematurely if 10 symbol errors accrued prior to exhaustionof the test set. This gives an unfair estimate of performance and it means that twoexperiments cannot be compared fairly if they are tested using this method. Given thesmall numbers of trials (20) used in [1] this method of testing is even more unfair. It isspeculated that the reason for using premature termination testing in [1] was efficiencysince they used a Matlab implementation.

To be fairer here, where time permitted testing occurred over the entire test set withno early termination. Because of the increase in running times this incurs for poorlyperforming ESN, the test length was reduced from 107 to 2.5 × 106, but the number oftrials were increased from 20 to 50. This should give a fair estimate of model performancewhich can be fairly compared to the performance of different models.

To save time some experiments were executed using a testing set of 107 and early termi-nation. If the results turned out to be comparable with the best performance of the ESN(i.e within an order of magnitude) then they were re-tested using the full test set to getmean and std values that could be fairly compared. Unless otherwise specified, assumefull testing.

The targets ({−3,−1, 1, 3}) of the wireless task are discrete, but the RLS output is realvalued, therefore the RLS output was quantized (as in [1]) using the bins [− inf,−2],(−2, 0],(0, 2],and (2, inf], so as to correspond respectively to each of the aforementioned targets, andthus enabling a discrete symbol error rate to be computed.

The symbol error rate was recorded relative to the number of symbols that were tested.The training and testing data were perturbed to achieve specified SNRs, namely 12db,16db, 20db, 24db, 28db, and 32db (See Appendix-A.1). The experiment was repeated foreach SNR and each experiment was averaged over 50 trials4. The results of the experimentare shown in Table-2 and Figure-6.

SNR (db) SER Std Min Max

12 2.880e-04 1.400e-04 7.960e-05 7.392e-0416 6.075e-05 5.011e-05 4.000e-07 2.300e-0420 5.295e-05 5.534e-05 0.000e+00 2.608e-0424 3.968e-05 5.001e-05 0.000e+00 2.148e-0428 4.507e-05 5.368e-05 0.000e+00 2.636e-0432 4.659e-05 5.821e-05 0.000e+00 2.100e-04

Table 2: ESN symbol error rate as the SNR is increased for an ESN constructed usingthe parameters shown in Table-1. The plot of this data is shown in Figure-6.

According to an ANOVA5 test [9], the difference between the means taking the resultsfor all SNRs into account is significant (p=0), but the difference between the means ofthe results for SNRs between 16db and 32db are not significantly different (p=0.3380).This implies that at the level of performance sampled by the ESN settings tested, noiseis insignificant at 16db and below (down to 32db).

420 trials were used in [1], presumably because of the relative inefficiency of Matlab.5p < 0.05 is considered significant

21

12 14 16 18 20 22 24 26 28 30 32

0

1

2

3

4

5

6

7

x 10−4

SNR in db

Sym

bol E

rror

Rat

e (S

ER

)

Replication of Jaeger’s wireless experiment. Forgetting factor = 0.998

MeanStdMaxMin

Figure 6: ESN symbol error rate as the SNR is increased for an ESN constructed usingthe parameters shown in Table-1.

As in [1], a second stage of training was executed where the best performing ESN obtainedabove, for each SNR, was trained and tested 50 times (in [1] it was 20 times) withrandomly generated training and testing sets (as opposed to above where random trainingand testing sets and a new random ESN were created for each trial). The results are shownin Table-3


12 1.861e-04 1.015e-04 6.480e-05 5.628e-0416 1.046e-05 1.819e-05 0.000e+00 9.240e-0520 3.752e-06 6.451e-06 0.000e+00 3.200e-0524 6.760e-06 1.056e-05 0.000e+00 5.280e-0528 1.471e-05 2.395e-05 0.000e+00 1.156e-0432 4.184e-06 6.916e-06 0.000e+00 3.880e-05

Table 3: ESN symbol error rate achieved by training the best network, for each SNR,that was obtained in the experiment whose results are shown in Table-2

According to ANOVA testing, the retraining (the second stage of ESN training) yieldssignificant improvements over the initial training for 12db (p=0.0001), 16db (p=0), 20db(p=0), 24db (p=0), 28db (p=0.0004), and 32db (p=0).

Tabulated data was not provided in [1], only a graph, upon contacting the author, herevealed that only an EPS file was available. The EPS was obtained and is shown inFigure-7. A numerical approximation of the underlying data was obtained by printingFigure-7 on A2 paper, and extracting the results with a ruler and calculator, these resultsare shown in Table-4. Measurements were made to the nearest mm and a linear interpo-

22

Figure 7: The graphed results of [1], for the varying of SNR with an ESN reservoir withan RLS readout, established using the parameters shown in Table-1. Line a is for a linearDFE, line b is for a Volterra DFE, line c is for a bilinear DFE, line d is for the first stageof ESN training, and line e is for the second stage of ESN training. An approximationof the data for this graph is given in Table-4.

lation between abscissa lines was then executed. In order to approximate the std, the stdline was measured according to the resolution between the enclosing abscissa lines of thecorresponding mean. Results were rounded to 2SF because measurements were made to2SF (with respect to the number of mm between two abscissa lines).

SNR a b c d d (std) e e (std)

12 - - - 8.900e-02 1.400e-02 8.900e-02 1.400e-0216 - - - 2.800e-02 1.600e-02 2.800e-02 1.900e-0220 - - - 2.500e-03 1.600e-03 2.500e-03 1.900e-0324 2.300e-02 7.400e-03 6.300e-03 4.000e-04 2.600e-04 2.100e-04 2.200e-0428 1.900e-02 6.700e-03 3.900e-03 8.900e-05 3.300e-05 1.000e-05 3.000e-0632 1.400e-02 5.600e-03 9.800e-04 9.700e-06 3.000e-06 9.100e-06 4.800e-06

Table 4: Approximation of the results presented in [1]. See Figure-7 caption for informa-tion regarding alphabetic mnemonics and the graph from which this data was approxi-mated. The results in [1] were averaged over 20 randomly generated training and testingsets.

Although the data shown in Table-4 was reverse engineered from Figure-7, it is believedthat it is accurate enough to make an informal comparison with the results presented inTable-2 and Table-3.

First of all it is apparent that the performance reported in [1] has been replicated here.Secondly, the results presented here appear to consistently improve on those presented in[1] at every SNR. Thirdly, the results of [1] show a consistent downward trend whilst theresults obtained here show insensitivity to SNR greater than 12db (within those SNRstested).

The most obvious source of the latter two discrepancies is that noise has been implemented

23

differently. Jaeger was consulted regarding this, and he confirmed that the equation shownin Appendix-A.1 for perturbing the signal was correct so this seems unlikely.

It is feasible to suggest that the difference in results stems from the difference in testingmethods. Early termination of testing appears to bias the results slightly toward anunder-estimation of performance.

The experiment was repeated using the early termination approach to testing. In thefirst stage of training, according to ANOVA [9], the performance was not significantlydifferent for any SNR. In the second stage of training, according to ANOVA testing,the performance was significantly better using full testing for SNRs of 12db (p=0.0016),24db (p=0.0058), and 32db (p=0.0072). There was no significant difference between theresults for 16db (p=0.4278), 20db (p=0.1456), and 28db (p=0.7548).

So exactly half the time, the second stage results were recorded as being significantlybetter using full training when compared to using an early termination approach, andthe rest of the time there was no significant difference between recorded results. Althoughit is not fair to make concrete conclusions, these results imply that the early terminationapproach to testing is more sensitive to the particular testing set used than full testing,additionally there appears to be a slight bias to under-estimate the true error with theearly termination approach to testing.

The difference in testing approach, and the difference in number of trials averaged overcould be partly responsible for the difference in results reported in [1] and here. But owingto the lack of a clear downward trend in the results reported here and apparent insensi-tivity to SNR above 12db, it is probable that there is another source of implementationdifference causing the difference in observed results.

Since the results which follow are mainly compared with each other, this discrepancy inresults is not a significant problem although it warrants further investigation given moretime.

It was discovered that the settings used to achieve the results of [1], or rather, the set-tings used to achieve the replication results presented here, are not optimal. Improvedparameters were discovered when performing the forthcoming experiments of Section-5,they are demonstrated here retrospectively as a comparison to the original settings6. Theparameter settings which differ from those shown in Table-1 are shown in Table-5. Notealso that the weights were initialised uniformly from the range [−1, 1] as opposed to dis-cretely from the set {−1, 0, 1}. Finally, note that in [1] the ESN reservoir plus the currentinput are connected to the output neuron, whereas in the improved settings here, onlythe ESN reservoir is connected to the output neuron.

The experiment of [1] which was replicated above, was repeated using the improvedparameters7. The results are shown in Table-6.

Compared to the first stage of the original replication results, shown in Table-2, accordingto ANOVA testing, the results for the improved settings are significantly better (p < 0.02in all cases).

6Changing the input shift should improve results too but this was avoided for consistency reasons7Using an RLS forgetting rate of 1.0 does not necessarily result in a better SER, but potentially can

improve stability.

24

Parameter Description Value

Reservoir connectivity 1Reservoir weight range [−1, 1]Reservoir spectral radius 0.1Input weight range [−0.025, 0.025]RLS forgetting rate 1.0

Table 5: ESN settings which improve on those suggested by [1] and given in Table-5.


12 5.289e-04 1.633e-04 3.432e-04 1.149e-0316 2.364e-05 2.595e-05 8.000e-07 1.240e-0420 5.968e-06 1.302e-05 0.000e+00 7.120e-0524 4.448e-06 1.276e-05 0.000e+00 7.120e-0528 3.560e-06 8.626e-06 0.000e+00 4.440e-0532 1.528e-06 6.199e-06 0.000e+00 4.240e-05

Table 6: Replication of the wireless experiment of [1] but with substitution of the param-eters listed in Table-5.

Compared to the second stage of the original replication results, shown in Table-3, ac-cording to ANOVA testing, the results for the improved settings are significantly worsefor SNRs of: 12db (p=0), 16db (p=0.0041), not significantly different for SNRs of: 20db(p=0.2835), 24db (p=0.3261), and significantly better for SNRs of: 28db (p=0.0025),and 32db (p=0.0459). This indicates that the sensitivity of known high performance net-works to the particular training set used reduces as the training noise is reduced, whichis intuitively obvious.

It is probable that retraining only the best networks from the improved results, wouldresult in marginal improvement in results, but unfortunately the networks were not saved.In any case, it is more interesting to consider the case of one-shot learning without aretraining of the best found networks since this is more realistic for an online case.

The results of Table-6 demonstrate that it is not necessary to augment the ESN reservoirwith the current input8. This is useful because; it is conceptually cleaner to considerthe dynamics of an isolated reservoir of neurons, it makes code less complicated, and itreduces complexity of parameter selection (it was found that when the ESN reservoir wasaugmented by an input and when an FFN readout or a DLR readout was used, then theinput would have to have to be scaled for it to work).

Thus, subsequent referral to an ESN reservoir alludes to the notion of a set of ESNreservoir units without augmentation with the current input used in [1].

8Experiments to determine whether any improvement is obtained from augmenting the ESN reservoirwith the current input, would be executed given more time.

25

4.3 ESN Reservoir with a DLR Readout

It is interesting to ask how much of the performance of the ESN shown above can beattributed to the chosen regression model. As a comparison, an ESN was created thatused a DLR readout 9 instead of an RLS readout. The ESN parameters were the sameas in Table-1.

As a parameter exploration, the DLR learning rate was increased from 0.1 to 1.0 in stepsof 0.1, in addition to trying the learning rate 0.01. The data SNR was set to 32db. Foreach setting, 25 ESN were created, and each was trained for 30 epochs on a random 5000step sequence and tested on a random 107 step sequence with early termination uponaccumulation of 10 errors. The results are shown in Figure-8 and Table-7.

0 0.2 0.4 0.6 0.8 1−0.02

0

0.02

0.04

0.06

0.08

0.1(ESN + DLR) SER as the Learning Rate is Increased

Learning Rate

SE

R

Figure 8: SER of an ESN with a DLR readout as the learning rate is increased. Data ofthis plot is shown in Table-7.

As can be seen from Table-8, a learning rate of about 0.3 appeared to work best. How-ever, irrespective of the learning rate, the performance is much worse than an ESN withan RLS readout. It could be argued that not enough training epochs were used. Tocounter this objection, using a learning rate of 0.3, the number of epochs in the set{5, 10, 25, 50, 100, 200, 300, 500, 1000} were tested, and for each setting the results wereaveraged over 25 trials. The results are shown in Table-8.

Table-8 shows that from 25 epochs onwards there is no observable difference in the per-formance. A validation set could be used, but it is clear that the best performance ofthis model, is grossly subordinate to ESN with an RLS readout and is likely to remainso irrespective of parameter tuning or the use of a validation set.

This result is important because it suggests that a very simple linear readout is not

9A DLR readout uses backpropagation to train a single layer of weights connected from the reservoirto the output.

26

LRate SER Std

0.01 3.118e-02 5.078e-020.1 6.673e-03 2.745e-030.2 7.683e-03 3.218e-030.3 6.635e-03 2.569e-030.4 7.237e-03 3.456e-030.5 8.100e-03 3.829e-030.6 7.880e-03 2.199e-030.7 8.968e-03 2.316e-030.8 8.555e-03 4.123e-030.9 8.753e-03 3.384e-031 9.534e-03 4.405e-03

Table 7: SER of an ESN with a DLR readout as the learning rate is increased. Plot ofthis data is shown in Figure-8.

Epochs SER Std

5 8.895e-03 5.918e-0310 1.021e-02 1.011e-0225 6.991e-03 2.665e-0350 6.889e-03 2.409e-03100 8.083e-03 3.154e-03200 7.120e-03 3.065e-03300 7.857e-03 2.585e-03500 7.297e-03 2.246e-031000 7.134e-03 2.253e-03

Table 8: SER of an ESN with a DLR readout as the number of training epochs areincreased.

sufficient to extract enough information from the reservoir to perform the prediction taskto any reasonable degree of success.

4.4 ESN Reservoir with an FFN Readout

DLR and RLS are linear regression models so it is interesting to consider a non-linearregression model.

An ESN reservoir with 46 units was constructed and coupled to a feed-forward network(FFN) with one hidden layer containing 5 tanh units, that is, the ESN reservoir activationsserved as input to the FFN. The FFN weights were established uniformly from [−1, 1] andthe ESN reservoir units used tanh for an activation function. Results were averaged over25 trials and each trial consisted of training with a 5000 step sequence and then testingusing early termination on a 107 step sequence. A learning rate of 0.1 was used. The ESNreservoir was fully connected and weights were established uniformly from [−0.1, 0.1] andthen rescaled to a spectral radius of 0.5. The input-to-reservoir weights were established

27

uniformly from [−0.025, 0.025]. The number of training epochs were varied using the set{1, 5, 10, 15, 20, 25, 30, 35, 40, 45, 50}, the results for this are shown in Table-9.

Epochs SER Std Min SER Max SER

1 1.931e-02 2.604e-02 1.485e-03 1.266e-015 1.787e-03 1.244e-03 2.665e-04 4.831e-0310 6.657e-04 8.197e-04 6.236e-05 3.775e-0315 2.285e-04 3.145e-04 0.000e+00 1.083e-0320 2.212e-04 3.624e-04 4.000e-07 1.674e-0325 9.767e-05 1.247e-04 2.000e-07 5.663e-0430 1.027e-04 1.814e-04 0.000e+00 8.787e-0435 3.612e-05 3.863e-05 0.000e+00 1.232e-0440 9.810e-05 1.416e-04 0.000e+00 5.432e-0445 9.203e-05 1.460e-04 0.000e+00 6.373e-0450 7.176e-05 7.915e-05 0.000e+00 3.330e-04100 5.207e-05 5.568e-05 0.000e+00 1.705e-04200 2.840e-05 6.804e-05 0.000e+00 2.414e-04

Table 9: SER of an ESN with an FFN readout as the number of training epochs isincreased.

From the table it can be seen that the SER drops as the number of epochs is increased,but the performance approximately an order of magnitude worse than an ESN with anRLS readout. It is possible that a validation set might improve performance, but use ofa validation set made the experiment running times prohibitively long.

4.5 Nearest History Approach

The nearest history approach is very simple. The training stage consists of updating aninput buffer and at each step storing as a pair, the buffer (an input history), and theoutput that is associated with the most recent symbol in the buffer. The exploitationstage consists of retrieving the “nearest” stored history from the memory, where “nearest”is for example measured as the smallest Euclidean distance between the current inputbuffer and those which have been stored. The output associated with this nearest storedhistory is then retrieved and used as the prediction. If more than one stored history hasthe same distance to the current input, the first found is selected.

The nearest history model has three parameters; memory size (number of stored histo-ries), memory depth (length of each stored history), and distance metric (used to comparehistories). Here the memory size and memory depth were varied but the distance metricwas maintained throughout as L2 Euclidean distance10. The dataset is characterized bythe parameters; input shift, input scale, and SNR.

10The Lm distance between two d dimensional vectors x = x1, x2, ..., xd and y = y1, y2, ..., yd is defined

as(∑d

i=1 |xi − yi|m)1/m

[10]

28

Preliminary experimentation revealed that the model performed well for large but shal-low memories. The experiments below were constructed to explore around this aprioriobtained interesting region in parameter space. Prediction results averaged over 100 ran-domized test sets of length 100000 (for fixed shift, scale, and SNR) were obtained forvarious settings of the model parameters.

Memory depth SER Std

1 7.377e-01 4.936e-032 6.045e-01 6.873e-033 8.633e-03 1.501e-034 1.430e-02 3.268e-035 7.640e-02 5.228e-036 1.481e-01 5.647e-037 2.213e-01 5.003e-038 2.812e-01 5.568e-039 3.389e-01 5.404e-0310 3.803e-01 4.403e-03

Table 10: The SER performance and std of the nearest history approach as the memorydepth is increased for a fixed memory size of 1000, a data shift of 0, a data scale of 1,and a data SNR of 32. Plot is shown in Figure-9.

0 2 4 6 8 100

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Nearest History SER as the Memory Depthis Increased For a Memory with 1000 Entries

Memory depth

SE

R

Figure 9: The SER performance and std of the nearest history approach as the memorydepth is increased for a fixed memory size of 1000, a data shift of 0, a data scale of 1, anda data SNR of 32. Std is not shown because it is too small to be of any interest. Data isshown in Table-10.

Increasing the memory depth whilst keeping the memory size fixed to 1000, gave theresults shown in Table-10 and Figure-9. As can be seen, the SER decreases from a high

29

SER to a low SER as the memory depth is increased from 2 to 3, it then steadily increasesas the memory depth is increased further.

Memory size SER Std


Table 11: The SER performance and std of the nearest history approach as the memorysize is increased for a fixed memory depth of 3, a data shift of 0, a data scale of 1, and adata SNR of 32. Plot is shown in Figure-10.

0 500 1000 1500 2000 2500 30000

0.02

0.04

0.06

0.08

0.1

0.12

0.14

Nearest History SER as the Memory Sizeis Increased for a Fixed Memory Depth of 3

Memory size

SE

R

Figure 10: The SER performance and std of the nearest history approach as the memorysize is increased for a fixed memory depth of 3, a data shift of 0, a data scale of 1, and adata SNR of 32. Data is shown in Table-11.

Using the low-error fixed memory depth of 3 obtained from the last experiment, thememory size was increased from 1 to 3000 in steps of 1. The results are shown in Table-11 and Figure-10. As can be seen, when the memory size is increased, the SER decreasesmonotonically.

30

4.6 Unprocessed Input Buffer with an RLS Readout

It goes without saying that the information required for the wireless task is present in theinput stream. It is therefore interesting to ask whether an unprocessed input stream canbe used directly to solve the prediction task. In this section, an RLS readout is coupledto a buffer of the unprocessed input stream.

At each timestep the input-buffer is updated in the obvious way (it is a window thatslides through the input stream), and a linear combination of the input buffer elementsserves as the output. The RLS training algorithm is used to adjust the weights of thelinear combination, using as a target, the output associated with the most recent inputin the input buffer. The forgetting factor λ used with the RLS learning algorithm wasset to 0.998 as in [1].

In the ESN experiments above, the data was shifted by adding the constant 30 prior torunning the experiment. It was found that a data shift of 30 had very poor performancewith this model, so a data shift of 0 was used instead because it gave relatively betterperformance. Further information about the sensitivity of this model to input-shift isgiven in Section-5.2.6 below.

Memory depth SER Std


Table 12: SER of an RLS readout coupled to an unprocessed input buffer, as the size ofthe input buffer is increased. Plot is shown in Figure-11.

Keeping all other parameters fixed, the length of the RLS input buffer was varied from1 to 30 in steps of 1. Table-12 and Figure-11 show the results. The SERs for bufferlengths 1 and 2 are not shown because their relative magnitude makes the rest of the plotunreadable, the missing data is shown in the table. The table shows only the entries forbuffer lengths between 1 and 10 because the rest of the data continues the same trend.The figure shows that any buffer length over 5 tends to have performance of the order0.005. Further increases in the buffer, at least upto a length of 100, do not appear toimprove performance (taking std into consideration), as can be seen in Figure-12.

31

5 10 15 20 25 300

0.005

0.01

0.015

0.02

0.025

0.03

0.035Effect on SER of varying the buffer length for RLS training

Buffer Length

SE

R

Figure 11: SER of an RLS readout coupled to an unprocessed input buffer, as the size ofthe input buffer is increased. Plot is shown in Table-12.

30 40 50 60 70 80 90 1000

1

2

3

4

5

6

7

8x 10

−3 Effect on SER of varying the buffer length for RLS training

Buffer Length

SE

R

Figure 12: The effect on the wireless-data SER of extending further the size of the inputbuffer of an RLS trained ESN.

32

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8(Input Buffer + DLR) SER as the Learning Rate is Increased

Learning Rate

SE

R

Figure 13: SER of DLR with a fixed length input buffer of the wireless data as thelearning rate is increased. Data is shown in Table-13.

4.7 Unprocessed Input Buffer with a DLR Readout

Delta rule learning (DLR) operates on a single layer network with linear units and appliesthe delta learning rule to update weights. Note that the delta rule is the same rule used byback-propagation. Since there is only one output for the wireless task, the readout-modelis extremely simple and is characterized by a single vector of weights with dimensionalityequal to the number of inputs.

Keeping the input buffer fixed to 46 inputs (to be compatible with the RLS trainedESN) the learning rate was increased from 0.1 to 1.0 in steps of 0.1, and in addition thelearning rates 0.0001, 0.001, and 0.01 were tried. For each setting, 100 random ESN werecreated. Each ESN was trained on a randomized test sequence containing 106 elements.The results are shown in Figure-13 and Table-13.

To check the performance of DLR as the buffer size is varied, as for the RLS case, thebuffer size was increased from 1 in unit steps. A learning rate of 0.001 was used. Theresults are shown in Figure-14, Figure-15, and Table-14.

4.8 Unprocessed Input Buffer with an FFN Readout

RLS and DLR (with linear units) are both linear regression models. The last two sectionsdemonstrated that it is unlikely that a linear readout is sufficient to obtain very low SERon a fixed length unprocessed input buffer. This implies that the ESN performs somekind of non-linear processing of the inputs which linearize the regression task from theperspective of the readout mechanism. It is thus interesting to see if a fixed input buffercoupled with a non-linear readout mechanism can achieve good performance.

33

LRate SER Std

0.0001 4.604e-01 3.957e-020.0010 5.530e-03 1.185e-030.0100 6.117e-01 2.777e-020.1000 6.693e-01 1.230e-020.2000 6.778e-01 1.054e-020.3000 6.826e-01 1.107e-020.4000 6.856e-01 9.805e-030.5000 6.884e-01 9.294e-030.6000 6.912e-01 2.625e-020.7000 6.933e-01 2.117e-020.8000 6.918e-01 7.936e-030.9000 6.964e-01 2.356e-021.0000 6.949e-01 6.535e-03

Table 13: SER of DLR with a fixed-length input buffer of the wireless data as the learningrate is increased. Plot is shown in Figure-13.

1 2 3 4 50

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8Effect on SER of varying the buffer length for LDR training

Buffer Length

SE

R

MeanStd

Figure 14: SER of DLR on the wireless data as the length of the input buffer is increasedfrom 1 to 5 in unit steps.

A feed-forward network (FFN) with one hidden layer containing 5 hidden units (usingthe tanh activation function) was trained for 150 epochs using back-propagation on afixed length input buffer (i.e a sliding window). Prior experimentation revealed that alearning rate of 0.01, and an input shift of 0 had good performance, so these were used,for a look at the sensitivity of this model to input shift see Section-5.2.8.

A 5000 step random sequence was used for training and a 2.5×106 step random sequencewas used for testing, full testing was performed. The SNR of both training and testing

34

10 15 20 25 30 35 40 45 501

2

3

4

5

6

7x 10

−3 Effect on SER of varying the buffer length for DLR training

Buffer Length

SE

R

MeanStd

Figure 15: SER of DLR on the wireless data as the length of the input buffer is increasedfrom 6 to 50 in unit steps.

Buffer Length SER Std


Table 14: SER of DLR on the wireless data as the length of the input buffer is increasedfrom 1 to 10 in unit steps.

data was set to 32db. The experiment was repeated for different input buffer lengths.The results, averaged over 50 trials, are shown in Figure-16 and Table-15.

According to ANOVA testing, an FFN with an input buffer of size 10 trained for 150epochs has performance which is not significantly different (p=0.7469) from the best ESNresults reported above for 32db.

The experiment was repeated using 100 epochs for training and the best result had abuffer size of 11 and achieved an SER of 4.496e-06 with a std of 2.275e-06. According toANOVA this is significantly worse (p=0.002) than the best ESN results for an SNR of32db.

35

5 10 15−3

−2

−1

0

1

2

3

4x 10

−3SER of an FFN Readout Coupled to an Unprocessed Input Buffer

Input Buffer Length

SE

R

SERStd

Figure 16: The SER of an FFN coupled to an unprocessed input buffer of Wireless Data,as the size of the input buffer is varied.

Buffer Length SER Std Min Max

5 7.527e-04 3.102e-03 2.320e-05 1.586e-026 4.532e-04 2.001e-03 0.000e+00 1.131e-027 2.903e-04 1.160e-03 0.000e+00 5.109e-038 6.768e-06 3.472e-05 0.000e+00 2.472e-049 4.218e-05 2.825e-04 0.000e+00 2.000e-0310 1.816e-06 1.079e-06 0.000e+00 6.400e-0611 2.864e-06 5.714e-06 4.000e-07 4.160e-0512 3.728e-06 6.690e-06 4.000e-07 4.160e-0513 1.674e-04 1.168e-03 8.000e-07 8.262e-0314 4.105e-04 2.015e-03 0.000e+00 1.099e-0215 3.220e-04 1.579e-03 0.000e+00 8.233e-03

Table 15: The SER of an FFN coupled to an unprocessed input buffer of Wireless Data,as the size of the input buffer is varied.

This indicates that an unprocessed input buffer can perform as well as an ESN if it istrained for long enough. This result is not particularly surprising, because the informationused by the ESN is obviously present in the unprocessed input stream, and FFN are goodfeature extractors.

The relatively shallow depth (10) of the sliding window implies that the task does notrequire a very deep memory.

The mean and std do not tell the full story. Figure-17 compares the distribution of resultvalues between the best input buffer plus FFN readout case and the best ESN reservoirplus RLS readout case. Each point corresponds to one trial.

36

0 10 20 30 40 500

2

4

6

8x 10

−6Distribution of Results for an Input Buffer with an FFN Readout

SE

R

0 10 20 30 40 500

1

2

3

4

5x 10

−5Distribution of Results for an ESN Reservoir with an RLS Readout

SE

R

Figure 17: Comparison of the individual trial results for the best input buffer plus FFNreadout result and the best ESN reservoir plus RLS readout result.

35 out of 50 of the ESN trials resulted in an SER of 0, whereas only 1 trial of the inputbuffer plus FFN approach resulted in an SER of 0. The ESN points do not appear tobe normally distributed about the mean, they appear to be distributed according to anarrower distribution with relatively broad tails.

Figure-18 compares the FFN trials for an input buffer of size 7 with 150 epochs trainingand with 25 epochs training. With 25 epochs of training and a buffer size of 7, 19 out of50 trials resulted in an SER of 0. With 150 epochs of training and a buffer size of 7, 17out of 50 trials resulted in an SER of 0. It is clear that the reason the mean SER is notmuch lower is because the majority of the “cumulative mass” of the results is made upby infrequent bad runs.

These observations suggest that the FFN results are just as good as the ESN resultsbut they suffer from instability, if a validation set were used it is likely the averageperformance would be much better.

However, the ESN uses a linear readout which can be trained online in one pass of theinput data which further supports the idea that it is more robust. Indeed the resultseffectively show that if the number of epochs are small then the ESN with an RLSreadout is more robust than the FFN plus buffer approach.

4.9 FPM Reservoir with VQ, RLS, and FFN Readouts

The FPM model is described in Appendix-A.0.1. As mentioned in the introduction, theclassical FPM model is equivalent to an ESN with

� A weight matrix equal to I · k, where k is the contraction coefficient.

37

0 10 20 30 40 500

0.01

0.02

0.03

0.04

0.057 Unit Input Buffer with an FFN Readout Trained for 200 Epochs

SE

R

0 10 20 30 40 500

2

4

6x 10

−37 Unit Input Buffer with an FFN Readout Trained for 25 Epochs

SE

R

Figure 18: Comparison of the individual trial results for the best input buffer plus FFNreadout result with 150 epochs of training, and and the best input buffer plus FFNreadout result with 25 epochs of training.

� Input weights equal to (1− k).

The input dimensionality is 1 and in the classical form of the FPM, the reservoir is thesame dimensionality as the input. In order to increase the size of the FPM reservoir,the input was “encoded” by multiplying it by a weight vector of the desired size. Thisintroduces variation into the contractions of different dimensions in the projection andcan potentially yield useful information.

Three different readout mechanisms were tried

1. VQ Prediction

2. RLS

3. FFN

VQ Prediction quantizes the current reservoir state to obtain a prediction context, theprediction context defines a vector of probabilities for each of the possible output symbols.An output symbol is selected according to the probability distribution. The probabilitiesare estimated by counting next-symbol observations, given each vector quantized context,as the fractal encoder is iterated over a training set.

The RLS and FFN readouts use supervised learning so they were attached to the FPMreservoir. They were both trained by updating the output weights/FFN in the normalmanner to approximate the targets. The targets are discrete so in computing the testerror, appropriate quantization of the output was performed to give a symbolic error rate.

For each readout mechanism, several different parameter settings were tried, but settingscould not be found which gave performance that was better than a random predictor.

38

4.10 Summary

Memory Model Readout Model SER Std

ESN Reservoir (46 Units) RLS 1.528e-06 6.199e-06ESN Reservoir (46 Units) DLR 7.339e-03 2.682e-03ESN Reservoir (46 Units) FFN 2.840e-05 6.804e-05Lookup Table Nearest History 3.899e-03 1.078e-03Input Buffer (10 Units) RLS 4.387e-03 1.902e-03Input Buffer (10 Units) DLR 4.828e-03 2.266e-03Input Buffer (10 Units) FFN 1.816e-06 1.079e-06FPM Reservoir VQ Prediction N/A N/AFPM Reservoir RLS N/A N/AFPM Reservoir FFN N/A N/A

Table 16: Summary of the results for replication of [1] and comparison models. TheSER reported is the lowest mean SER for the model in question obtained over the set ofexperimental parameters tested. N/A means that the performance was no better thana random predictor. The number of units given in brackets after a model name is theoptimal number of memory units found for the model, and is the number associated withthe results shown.

The section started with the replication of results by Jaeger in [1], the claimed excellentperformance of the ESN reported there was verified, and the performance was probablyimproved upon by discovering better settings 11. The better settings revealed that itis not necessary to augment the ESN reservoir with the current input to obtain goodperformance; the ESN reservoir alone is sufficient12.

The experiment was repeated substituting DLR as a readout. The performance wasgrossly inferior (several orders of magnitude) implying that a strong regression modelmust be used to extract enough task-relevant information from the ESN activations toachieve a good SER.

The experiment was repeated substituting an FFN as a readout. The performance wasan order of magnitude worse even with 200 training epochs.

An experiment was executed using a lookup table model (nearest history), and the theperformance was several orders of magnitude worse than that of the ESN with an RLSreadout. Although interestingly, using the early termination approach to testing, ANOVAsays that the performance is better than ESN reservoir coupled to a DLR readout (p=0).The results suggest that the training set is too small to allow a large fixed order Markovmodel to perform well.

Next, an unprocessed input buffer was coupled to three different regression models forreadouts; RLS, DLR, and FFN. All three models had small optimum input-buffer lengths.The RLS and DLR readouts performed poorly (orders of magnitude worse than thebest ESN result), but the FFN readout performed extremely well, showing no statistical

11Because of differences in testing strategy, the results cannot easily be statistically compared12This is not the same as saying that the performance is equivalent.

39

difference according to ANOVA between it and an ESN coupled to an RLS readout(p=0.7469).

All the experiments performed using an FPM reservoir produced SERs which were nobetter than a random predictor.

4.11 Discussion

When an RLS readout is coupled to an ESN reservoir the performance is excellent, butwhen it is coupled to an input buffer the performance is poor. Whereas when an FFNreadout is coupled to an ESN reservoir the performance is relatively poor, but whencoupled to an input buffer the performance is excellent. The poor performance of RLSon an input buffer can be attributed to non-linearity of the task since it is a linearregression model. This means that the ESN reservoir does a good job of “linearising” thetask.

The relatively poor performance of an FFN readout coupled to an ESN reservoir ismysterious, but implies that the processing of the input stream performed by the ESNreservoir is more suited to RLS. More information regarding this is provided in Section-6.2below where an FFN readout is coupled to a linear ESN reservoir.

Using an FFN readout, an input buffer size of 10 is sufficient to obtain performancestatistically equivalent (p=0.7469) to the best ESN result demonstrated (given sufficientlymany training epochs). In the experiments above, the ESN reservoir had 46 units, whichis more than 4 times as many units as the buffer. The sensitivity of the ESN reservoir tonumber of reservoir units is investigated in Section-5.6 below.

The FPM uses a very simple kind of constant contraction. All the results using thiskind of contraction were no better than a random predictor. This strongly implies thatconstant contraction does not preserve enough information from the input stream toenable a linear or a non-linear readout mechanism to readily solve the task.

40

5 A Closer Look at the ESN Results

5.1 Introduction

This section asks how sensitive the ESN results presented above are to perturbation ofthe indicated parameters.

5.2 Sensitivity of the ESN results to Input Shift

5.2.1 Introduction

In [1], the input is shifted by adding the constant 30 before it is passed through theinput-to-reservoir weights to the ESN reservoir. Presumably this was done because itwas found necessary to get good performance. It is natural therefore to question whyit is necessary, and a first step, is to take a look at how the performance changes withinput shift. It is then natural to ask whether or not the findings are specific to the ESNor whether they generalise to other models. These questions are the objective of thissection.

Subsequent reference to the term “input shift” refers to a scalar quantity which is addedto the input before it is passed to the model in question.

All of the experiments in this section terminate testing early if 10 errors accumulateprior to exhaustion of the test set. As explained earlier, this appears to tend towardan underestimate of performance, and is generally an unfair way to test models. Themaximum length of the test sequence differs between experiments in this section, andthere is variability in the number of trials. This was due to time constraints, but themaximum accuracy deemed feasible given the available time was used in each case. Thislack of consistency means that results cannot fairly be compared for statistical differences.

However, the results presented in this section are only intended to answer general ques-tions about the model dynamics, which should be relatively independent of the testaccuracy. Yet, it should be remembered that useful information could be missed, whichwould have otherwise been revealed had higher accuracy and more consistent testing beenused.

Sensitivity of the FPM reservoir model to input shift is not covered because its perfor-mance was no better than random for all settings tried (which included different inputshifts).

5.2.2 ESN Reservoir with an RLS Readout

Input shift was varied from 0 to 100 in steps of 1. For each input shift, 100 ESN wererandomly generated, trained using a 5000 step random sequence (with 100 steps for ESNwashout), and tested on a 106 step random sequence (with 100 steps for ESN washout).Termination of testing occurred early if 10 errors accumulated before the test sequencewas exhausted. Experiments for each setting were averaged over 50 trials. The results are

41

plotted in Figure-19 and Figure-20. The data for shifts of 1 to 10 is shown in Table-17.

0 2 4 6 8 10−0.02

0

0.02

0.04

0.06

0.08

0.1

0.12

Input Shift

SE

R

SER of an ESN with an RLS Readout as the Input Shift is Changed

MeanStd

Figure 19: SER of an ESN Reservoir with an RLS readout as the input sequence is shiftedby increasing amounts.

10 20 30 40 50 60 70 80 90 100−0.5

0

0.5

1

1.5

2

2.5

3x 10

−4

Input Shift

SE

R

SER of an ESN with an RLS Readout as the Input Shift is Changed

MeanStd

Figure 20: SER of an ESN Reservoir with an RLS readout as the input sequence is shiftedby increasing amounts.

As can be seen, shifting the input has a marked effect on the SER and associated std.There is an unexplained artifact which occurs within the range of shifts 0 to 10, shownin Figure-19 and Table-17; the SER is lower for an input shift of 0 than for an input shiftof 1. This is in contrast to the subsequent behavior, namely that as the input shift isincreased from a shift of 1 onwards, the SER and std decrease. They continue to decrease

42

Input Shift SER Std

0 3.385e-03 1.121e-031 5.999e-02 4.948e-022 4.086e-02 2.768e-023 2.588e-02 1.666e-024 1.265e-02 7.770e-035 7.945e-03 8.993e-036 3.046e-03 2.356e-037 1.186e-03 1.184e-038 6.851e-04 7.658e-049 2.376e-04 2.452e-0410 1.291e-04 1.368e-04

Input Shift SER Std

15 6.898e-05 8.225e-0520 5.936e-05 6.232e-0525 5.699e-05 7.660e-0530 4.674e-05 4.566e-0540 2.915e-05 4.461e-0550 2.154e-05 2.737e-0560 1.651e-05 2.100e-0570 1.682e-05 1.892e-0580 2.615e-05 2.839e-0590 4.283e-05 4.277e-05100 3.503e-05 4.467e-05

Table 17: SER of an ESN Reservoir with an RLS readout as the input sequence is shiftedby increasing amounts. See Section-5.2.2 for further information.

until a low point is reached at a shift of 54 and then the SER and std start climbing asthe input shift is increased further.

The results imply that a large shift is needed to get good performance, that too muchshift is detrimental, and that the best performance comes from a shift of approximately50. (Incidentally it is therefore apparent that the shift of 30 used in [1] is unlikely to beoptimal).

Two questions follow from the notion that input shift is important:

1. Do the other models tested in Section-4 derive the same benefit from large inputshifts as the ESN does?

2. What effect is manifest in the activations of the ESN reservoir as a consequence ofincreasing the input shift, and why does this result in a performance increase?

To address the first question, the models were considered in turn.

5.2.3 ESN Reservoir with a DLR Readout

The above input-shift experiment was repeated for an ESN reservoir with a DLR readout.The DLR and ESN settings were as in Section-4.3. The results are shown in Figure-21and Table-18.

As can be seen, the performance of an ESN reservoir with a DLR readout decreases asthe input shift is increased.

5.2.4 ESN Reservoir with an FFN Readout

The above input-shift experiment was repeated for an ESN with an FFN readout. TheESN and FFN were setup as in Section-4.4. The results are shown in Figure-22 andTable-19.

43

0 10 20 30 40 50−0.2

0

0.2

0.4

0.6

0.8

1

1.2(ESN + DLR) SER as the Input Shift is Increased

Input Shift

SE

R

Figure 21: SER of an ESN Reservoir with a DLR readout as the input sequence is shiftedby increasing amounts.

Input Shift SER Std


Table 18: SER of an ESN Reservoir with a DLR readout as the input sequence is shiftedby increasing amounts.

As can be seen from Table-19, the SER of an ESN reservoir with an FFN readout decreasesas the input shift is increased. For the first 20 shifts, the change is not so obvious butthe general increasing trend is apparent. Figure-22 clearly shows the subsequent increasein SER that occurs for greater input shifts. The increase in SER slows and appears tolevel off at around an input shift of 50, at this point the SER cannot really get any worsebecause an SER of 0.75 is the expected value for random prediction.

44

0 20 40 60 80 100

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1Sensitivity of the ESN+FFN to Input Shift

Input Shift

SE

R

Figure 22: SER of an ESN Reservoir with an FFN readout as the input sequence isshifted by increasing amounts.

Shift SER Std


Table 19: SER of an ESN Reservoir with an FFN readout as the input sequence is shiftedby increasing amounts.

5.2.5 Nearest History Approach

The nearest history model is invariant with respect to scaling of inputs because it alwaysselects the nearest stored history according to the L2 norm. Scaling the inputs by someconstant C causes the L2 distance between any two vectors to increase by a factor ofC, this is shown in Appendix-A.2. Because the distance between vectors increases by alinear function, the relative difference between vectors does not change, and hence thenearest history algorithm is scale invariant with respect to inputs.

45

5.2.6 Unprocessed Input Buffer with an RLS Readout

When the RLS algorithm was used in the experiments of Section-4.6, although it wasnot explicitly demonstrated, it was noted that a shift of 0 was required to get goodperformance. To explicitly demonstrate the sensitivity of the RLS algorithm to inputscaling, an experiment was executed.

An input buffer of length 10 was used, and the RLS algorithm was used for readout.The RLS forgetting factor was set to 1.0. All other factors were as specified in Section-4.6. The input shift was systematically increased from 0 to 30 in steps of 1, and foreach setting, the SER was computed by averaging over 100 randomised test sequences oflength 106. The results are shown in Figure-23 and Table-20.

0 5 10 15 20 25 300

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18RLS Algorithm SER as the Input Shift is Varied

Input Shift

SE

R

MeanStd

Figure 23: RLS SER as the input shift is gradually increased.

Input Shift SER Std


Table 20: RLS SER as the input shift is gradually increased.

As can be seen increasing the input shift when the RLS algorithm, coupled with an input

46

buffer, is used, actually increases the SER, which is opposite to the behavior observedwhen a properly configured ESN is coupled with an RLS readout.

5.2.7 Unprocessed Input Buffer with a DLR Readout

The experiment immediately above was repeated using a DLR readout. The results areshown in Figure-24.

0 5 10 15 20 25 300

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1DLR Algorithm SER as the Input Shift is Varied

Input Shift

SE

R

MeanStd

Figure 24: DLR SER as the input shift is gradually increased.

Input Shift SER Std


Table 21: DLR SER as the input shift is gradually increased.

For this model it can be seen that again input shift is detrimental to performance.

47

5.2.8 Unprocessed Input Buffer with an FFN Readout

The experiment above was repeated using an FFN readout setup as in Section-4.8. Theresults are shown in Figure-25 and Table-22.

0 5 10 15 20 25 30 35 40 45 500

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1SER of an FFN Readout Coupled to an Unprocessed Input Buffer

Input Shift

SE

R

SERStd

Figure 25: SER of an FFN readout coupled to an unprocessed input buffer as the inputshift is increased.

Shift SER Std


Table 22: SER of an FFN readout coupled to an unprocessed input buffer as the inputshift is increased.

As can be seen, increasing the input shift, increases the SER. Interestingly however,inspecting the actual data revealed that for an input shift of 2, many of the runs hadvery low SER, much lower than for an input shift of 0, so it is likely that increasing theshift very slightly does improve results but it makes the results less stable such that onaverage the results are not better.

48

5.2.9 Summary

Memory Model Readout Model Optimal Input Shift

ESN Reservoir RLS 54ESN Reservoir DLR 0ESN Reservoir FFN 6Lookup Table Nearest History N/AInput Buffer RLS 0Input Buffer DLR 0Input Buffer FFN 0

Table 23: Summary of experiments assessing sensitivity of several models to input shift.

The first question asked was

� Do the other models derive the same benefit from a large input shift as does theESN?

It would appear from Table-23 that shifting the input is beneficial only for the ESN modeland apparently only when the RLS regression algorithm is used as a readout mechanism.There also appeared to be slightly improved performance for an FFN with an unprocessedinput buffer for input shifts of 2, 3, and 4 but the average case performance decreasedwith input shift.

It is not surprising that the non-linear models do not benefit from an input shift, sinceshifting the inputs has no direct effect on the dynamics of such models. There was anopposite effect however; the SER increased with input shift, it is highly likely that thisis due to the following effect.

The output of the linear network/buffer is shifted as a consequence of the input beingshifted and so the output of the linear readout mechanism is shifted (because the sameinitialisation range for weights is used irrespective of input shift). The desired outputshowever, do not change which means that the difference between desired and actualoutputs increases, and it therefore requires more drastic changes in the weights to achievegood SER, which likely results in a decrease in training efficiency and the observed effect.This is very probably the reason for seeing an increase in SER with input shift whenDLR and RLS are used with an unprocessed input buffer and when RLS is used with anFPM reservoir.

When an FFN readout is used with an unprocessed input buffer and an FPM reservoir,and the input is shifted, and tanh saturates in the hidden layer of the FFN because ofthe shifted inputs (for example, tanh(5) = 9.9991e-01). This destroys information thatis needed for the task and so the SER increases. Naturally, changing the weights cancounteract this but the gradients are so steep because of saturated units that changes aredrastic and reduce the efficiency of learning. Once a certain threshold of input shift isexceeded, the tanh units become super-saturated and learning is hopeless.

For input shifts of 1, 2, and 3 there did appear to be a higher frequency of low SERresults in the unprocessed input buffer case, although the average SER did not change

49

because some runs had high SER. This seems reasonable because for these shifts, giventhat the FFN weight initialisation range was [−1, 1], the activations would be spread outmore, with the tanh units being very close to saturation.

The most interesting results are thus that the SER of an ESN reservoir performs wellwith input shift when an RLS readout is used, and badly when an FFN readout is usedor a DLR readout is used. It is speculated that perhaps the RLS readout is more suitedto the temporal nature of the problem so gains a slight advantage over both models. Andit is likely that the FFN and DLR models are not as sensitive (perhaps due to chosenlearning rates, length of training sets, or other parameters) so cannot detect the extrainformation which is obviously apparent to the RLS readout.

In any case, input shift is clearly an important factor in achieving good results with theESN reservoir plus RLS readout model.

5.3 Analysis of the ESN Reservoir Unit Activations for Differ-ent Input Shifts

Now to address the second question

� What effect is manifest in the activations as a consequence of increasing the inputshift, and why does this result in a performance increase?

Using the ESN initialisation parameters shown in Table-1 (except with different inputshifts), an ESN was created and driven for 5000 epochs (with an initial 100 epoch washoutperiod) with a random wireless data input sequence. At each step the reservoir unitactivation values were recorded. This was repeated 10 times with different input sequencesgiving a total of 10 ∗ 4900 ∗ 46 = 2.254 × 106 activation values. The entire process wasrepeated for input shifts between 0 and 100.

Figure-26 shows the mean and std activation value for each input shift. Clearly, shiftingthe input spreads the inputs out more. This is to be expected since tanh activationfunctions were used, but still it was necessary to confirm that this occurred in practicebecause the dynamics of the ESN reservoir are different to direct stimulation of a tanhactivation function.

A more detailed view of the spread of activations is shown in Figure-27, there, for eachdifferent input shift, a histogram with 20 bins has been drawn. This shows the distributionof activation values. This plot should be compared to Figure-20 which shows sensitivityto input of an RLS readout attached to an ESN reservoir. Figure-20 shows that the bestperformance occurs in the region between input shifts of 50 and 60, this is the sixth rowin Figure-27; a region in which activations are well spread and the histograms look moreflat. Performance is poorer for small input shifts, when the activations are bunched uptoo close in the middle, and performance is poorer for very large input shifts, when theactivations are pushed into small regions at the edges of the range; i.e where tanh beginsto saturate.

So back to the question “What effect is manifest in the activations as a consequenceof increasing the input shift, and why does this result in a performance increase?”. A

50

0 20 40 60 80 100−1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

Act

ivat

ion

Input Shift

Mean ESN Reservoir−Unit Activation for Different Input Shifts

MeanStd

Figure 26: The mean and std activation values of a relatively sparsely (connectivity 0.2)connected ESN reservoir for different input shifts.

speculative answer is that an increased spread of activations is manifest in the activationsas a consequence of increasing the input shift, this results in an increase in performancebecause it increases the separability of inputs thereby making it easier to distinguishbetween distinct input histories and consequently reduces errors.

If a unit contains incoming connections from many other units, on the one hand it seemsintuitively reasonable to suggest that its net input will be significantly larger than if itis connected to less units, but on the other hand since the ESN weights are initialisedrandomly and uniformly from [−1, 1], the average incoming signal should be close to 0irrespective of the amount of incoming connections unless the number is very low, hencefrom that perspective it seems reasonable to suggest that connectivity does not effect theaverage excitation of a unit to a large extent. Naturally it will depend on the weightsand how changing the spectral radius effects networks of different connectivity.

To explore this idea, the above experiment was repeated using an ESN reservoir with aconnectivity of 1. Figure-28 shows the differences (Mean activation for fully connectedreservoir - Mean activation for reservoir with a connectivity of 0.2) for different inputshifts. Also shown is the differences in standard deviation.

The figure shows that there is no obvious difference in the spread of activations fordifferent input shifts at the two different connectivities, this is indicated by the factthat the difference goes in both directions and does not appear to be biased in eitherdirection. This implies that it is unlikely that the degree of connectivity makes anysignificant difference to the effective input shift (at least not above a connectivity of 0.2for this network).

51

Figure 27: Histograms with 20 bins each for different input shifts, the top left plothas an input shift of 0 and the input shift is incremented in each subsequent plot, theincrementation goes across all columns of the the first row, then across all columns of thethe second row etc.

5.4 Sensitivity of the ESN results to Weight Matrix Connectiv-ity

It is interesting to ask more generally what effect the weight matrix connectivity has onthe performance of the ESN. It was argued in the last section that connectivity is notresponsible for altering the input shift, but it is probable that the degree of connectivitychanges the dynamics or some other aspect of the model which is important.

Using an RLS forgetting factor of 1, but otherwise using the settings shown in Table-1,the ESN weight matrix connectivity was increased from 0.1 to 1 in steps of 0.1. And inaddition, some smaller connectivity values were tried (in order to cover the lower end ofthe scale more comprehensively). For each connectivity, 100 randomly initialised ESNwere created and then tested on a random sequence of 107 inputs with early termination

52

0 20 40 60 80 100

−0.1

−0.05

0

0.05

0.1

Difference Between Mean Activation of a Fully Connected ESN Reservoir and an ESN Reservoir with a Connectivity of 0.2

Input Shift

Diff

eren

ce

MeanStd

Figure 28: Difference between mean and std reservoir unit activation values betweena fully connected ESN reservoir and an ESN reservoir with a connectivity of 0.2, fordifferent input shifts.

after 10 errors. The results are shown in Figure-29, Figure-30, and Table-24.

0 0.01 0.02 0.03 0.04 0.05−0.2

0

0.2

0.4

0.6

0.8

1

1.2ESN SER as the Connectivity is Increased

Weight Matrix Connectivity

SE

R

MeanStd

Figure 29: ESN SER as the connectivity of the weight matrix is increased.

As Figure-29 and Figure-30 show, increasing the connectivity initially decreases the SER,but very quickly (at a connectivity of about 0.04, which is around 80 connections) theimprovement ceases and further increase in the connectivity has no noticeable effect. TheSER for a connectivity of 0.2 and for a connectivity of 0.7 are very similar, the SER ofthe former is slightly lower than the other whereas the std of the latter is slightly lower

53

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1−2

0

2

4

6

8

10

12x 10

−5 ESN SER as the Connectivity is Increased

Weight Matrix Connectivity

SE

R

MeanStd

Figure 30: ESN SER as the connectivity of the weight matrix is increased.

Connectivity SER Std

0.0010 2.747e-01 1.790e-010.0020 2.366e-01 1.666e-010.0030 1.966e-01 1.785e-010.0040 7.516e-02 1.026e-010.0050 4.563e-03 5.743e-030.0100 2.966e-02 1.426e-010.0200 9.037e-05 1.315e-040.0300 7.889e-05 6.674e-050.0400 5.176e-05 5.618e-050.0500 4.978e-05 5.389e-05

Connectivity SER Std

0.1000 4.997e-05 5.146e-050.2000 3.189e-05 3.209e-050.3000 4.936e-05 4.927e-050.4000 5.276e-05 5.086e-050.5000 4.254e-05 3.782e-050.6000 2.413e-05 2.436e-050.7000 3.216e-05 3.110e-050.8000 3.375e-05 3.240e-050.9000 3.759e-05 2.977e-051.0000 3.724e-05 3.585e-05

Table 24: ESN SER as the connectivity of the weight matrix is increased.

than the other (a previously run trial with a testing set of size 106 showed the same fora connectivity of 0.1 and a connectivity of 1.0).

In [1] they say, with regard to establishing an ESN reservoir for the purpose of Mackeyglass series continuation: “It is important that the “echo” signals be richly varied. Thiswas ensured by a sparse interconnectivity of 1% within the ESN reservoir: this conditionlets the reservoir decompose into many loosely coupled subsystems, establishing a richlystructured reservoir of excitable dynamics”.

Whilst the task here is different, it is interesting that a fully connected reservoir has per-formance that appears statistically equivalent13 to that of a sparsely connected reservoir,suggesting that a certain threshold of sparse connectivity is sufficient but not necessary

13A test for significance with this data would not be fair

54

to obtain excellent performance.

An important property of a recurrent network when used for time series prediction isits ability to memorize the past. The only way to memorize the past in a network likethis is through recurrency. Isolated units can only provide information about the currentinput (they can do so because the input is fully connected to the reservoir). It is thusinteresting to ask, at what point of reservoir connectivity, does each unit have at least oneoutgoing connection, and at least one incoming connection? Or in other words, at whatpoint is the reservoir able to transmit potentially all the information in the reservoir tothe next epoch?

For each of the connectivity settings tested above, 100 random ESN were created, and foreach ESN, the number of reservoir units which had incoming connections were counted,and the number of reservoir units which had outgoing connections were counted. Theresults are shown in Table-32 and plotted in Figure-31.

0 0.05 0.1 0.15 0.2 0.25 0.30

5

10

15

20

25

30

35

40

45

50

Reservoir Connectivity

Uni

ts

Num Units With Connections as Reservoir Connectivity is Increased

Units with incoming connectionsStd for incoming connectionsUnits with outgoing connectionsStd for outgoing connections

Figure 31: Average number of ESN reservoir units which have incoming and outgoingconnections as the connectivity is increased. (The graph continues in a straight line with0 std for connectivity beyond 0.2.)

Note that beyond a connectivity of 0.2, all units have at least one incoming connectionand at least one outgoing connection, with a std of 0.

It would appear that the performance reaches a reasonable level between a connectivityof 0.04 and a connectivity of 0.2. At a connectivity of 0.04, about 40 of 46 units have anincoming connection, and about the same have an outgoing connection. It is interestingthat the first time all units have both an incoming and an outgoing connection, accordingto Table-32 is at a connectivity of 0.2, which is the connectivity parameter used in [1].

These results suggest that what might have appeared previously to be sparse connectivitymight not be so sparse, in that most of the reservoir units have at least one outgoingand at least one incoming connection. This means that most of the information in the

55

Connectivity Mean In Std In Mean Out Std Out SER Std

0.0005 1.000 0.000 1.000 0.000 7.283e-01 1.185e-010.001 1.980 0.141 1.960 0.197 3.101e-01 1.586e-010.002 3.890 0.314 3.850 0.359 2.503e-01 1.524e-010.003 5.650 0.520 5.600 0.569 1.643e-01 1.431e-010.004 7.400 0.791 7.480 0.643 1.037e-01 1.487e-010.005 9.790 0.880 9.920 0.861 3.332e-02 8.872e-020.01 17.360 1.573 16.980 1.421 9.119e-04 1.470e-030.02 27.720 1.913 27.590 2.202 7.769e-03 7.692e-020.03 34.690 2.264 34.160 2.168 6.245e-05 5.936e-050.04 39.620 1.994 38.870 1.773 4.759e-05 4.625e-050.05 41.580 1.854 41.410 1.810 4.444e-05 4.802e-050.1 45.710 0.498 45.490 0.718 3.245e-05 3.640e-050.2 46.000 0.000 46.000 0.000 4.593e-05 5.615e-050.3 46.000 0.000 46.000 0.000 3.790e-05 3.927e-05

Figure 32: Average number of ESN reservoir units which have incoming and outgoingconnections as the connectivity is increased. Mean In (Mean Out) are the mean numberof units which have incoming (outgoing) connections, and Std In and Std Out are theassociated standard deviations. Also shown are the SER and Std from Table-24, whichcorrespond to the connectivity column.

network will be passed on via recurrent connections to the network state at the nextepoch, which might explain why even with a low connectivity, a large network can havegood performance (if there were many units having no incoming or outgoing connections,they would only serve to duplicate the input data and would not act as memory units.)

5.5 Sensitivity of the ESN results to Number of Non-LinearReservoir Units

Referring back to Section-5.3, it was suggested that shifting of the input causes partialsaturation of the non-linear reservoir units, spreading out activations and increasing sep-arability of different input sequences. This section explores how the SER varies as thenumber of non-linear units in the ESN reservoir is varied.

Using an RLS forgetting factor of 1, but otherwise using the settings described in Table-1,all 46 neurons of an ESN reservoir were set to use the identity function as their activationfunction. Then, one at a time, the neurons were changed to use the tahn activationfunction. For each configuration, 100 randomly initialised ESN were created and thentested on a random sequence of 106 inputs with early termination after 10 errors.

As can be seen from the results shown in Figure-33, Figure-34, and Table-35, withoutnon-linearity, the SER is very high, but with the addition of only a small number ofnon-linear units, the SER rapidly decreases. Beyond ≈ 10 non-linear units it is hard todiscern any difference in the results.

56

0 0.5 1 1.5 2 2.5 3 3.5 40

0.1

0.2

0.3

0.4

0.5

0.6

0.7

ESN SER as the "Non−Linearity" is Increased

Number of nonlinear units out of 46

SE

R

MeanMinMaxStd

Figure 33: ESN SER as the “non-linearity” of the dynamic reservoir is gradually in-creased. To begin with all neurons implement the identity function, and then, one byone, they are changed to implement the tahn activation function. The plot shows theresults from 0 non-linear units out of 46 to 4 out of 46.

This is interesting because the results from the last section suggested that sparse con-nectivity is sufficient to obtain good performance, and these results suggest that sparse“non-linearity” is sufficient to obtain good performance.

5.6 Sensitivity of the ESN results to Number of Reservoir Units

The foregoing discussion leads to asking whether all the units are really necessary toobtain good performance, since good performance can be obtained in a sparsely connectednetwork and good performance can be obtained in a network with sparsely distributed“non-linearity”.

To answer this question, the number of ESN units was increased from 1 to 75 in stepsof 1, and for each size a random ESN was created using the same parameters as before,but with a connectivity of 1 and an RLS forgetting factor of 1. Each ESN was trainedon a 5000 step sequence and tested on a 106 step sequence. The results are shown inFigure-36 and Figure-37.

As can be seen from the figures, there is a rapid decrease in SER as the number of unitsare increased from 1 to 15, the decrease in SER and std continues as the reservoir unitsare increased from 15 to 75 although not as quickly.

Table-25 shows the SER for every 5 unit increase from 5 to 75. The results suggestthat no less than the number of units specified in Table-1 are sufficient to achieve theperformance obtained with that number of units, given the other parameters are keptfixed.

57

5 10 15 20 25 30 35 40 450

1

2

3

4

5

6

7

8

9

10x 10

−4 ESN SER as the "Non−Linearity" is Increased

Number of nonlinear units out of 46

SE

R

MeanMinMaxStd

Figure 34: ESN SER as the “non-linearity” of the dynamic reservoir is gradually in-creased. To begin with all neurons implement the identity function, and then, one byone, they are changed to implement the tahn activation function. The plot shows theresults from 4 non-linear units out of 46 to 46 out of 46.

Size SER Std Min Max

0 8.405e-02 5.100e-02 1.942e-02 3.125e-015 1.713e-04 1.264e-04 0.000e+00 5.836e-0410 7.734e-05 7.367e-05 0.000e+00 4.480e-0415 4.804e-05 5.379e-05 0.000e+00 3.552e-0420 5.339e-05 5.098e-05 0.000e+00 2.548e-0425 5.224e-05 5.940e-05 0.000e+00 3.783e-0430 4.277e-05 4.791e-05 0.000e+00 3.171e-0435 5.554e-05 7.908e-05 0.000e+00 6.746e-0440 4.819e-05 5.366e-05 0.000e+00 2.304e-0445 4.168e-05 5.103e-05 0.000e+00 2.663e-04

Figure 35: ESN SER as the “non-linearity” of the dynamic reservoir is gradually in-creased. To begin with all neurons implement the identity function, and then, one byone, they are changed to implement the tahn activation function.

The experiment was repeated using a fully connected ESN reservoir, Figure-38 shows thedifferences (mean with full connectivity - mean with sparse connectivity (0.2)) and (stdwith full connectivity - std with sparse connectivity) plotted against number of reservoirunits.

The top subplot shows the differences from 1 to 12, the differences are larger and moresignificant here. Although no significance test was performed, it would appear that forsmall numbers of units a small connectivity is slightly detrimental. Although it cannot be

58

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15−0.2

0

0.2

0.4

0.6

0.8

1ESN SER as the Number of Reservoir Units is Increased

Number of Reservoir Units

SE

R

MeanStd

Figure 36: ESN SER as the number of reservoir units is increased from 1 to 15.

20 30 40 50 60 70−5

0

5

10

15

20x 10

−4 ESN SER as the Number of Reservoir Units is Increased

Number of Reservoir Units

SE

R

MeanStd

Figure 37: ESN SER as the number of reservoir units is increased from 15 to 75.

confirmed that this is true, it seems intuitively reasonable that in a small network, thereis less information in general, which can more readily be preserved by more connections.

The bottom subplot shows the differences from 12 to 75 reservoir units, as can be seen,the magnitude of the differences is much smaller and goes in both directions.

In general then, for reasonable numbers of reservoir units, it can be argued that there isno general difference in performance between a reservoir with a connectivity of 0.2 and areservoir with a connectivity of 1.

59

Size SER Std

1 7.951e-01 1.040e-015 1.877e-01 1.799e-0110 8.247e-03 1.374e-0215 7.982e-04 9.131e-0420 2.970e-04 4.648e-0425 1.960e-04 1.442e-0430 1.019e-04 1.289e-0435 8.561e-05 6.041e-05

Size SER Std

40 7.280e-05 7.145e-0545 5.617e-05 6.035e-0550 2.142e-05 2.476e-0555 8.778e-06 1.115e-0560 2.188e-05 2.906e-0565 1.033e-05 1.485e-0570 6.880e-06 1.270e-0575 7.564e-06 1.495e-05

Table 25: ESN SER as the number of reservoir units is increased

2 4 6 8 10 12−0.15

−0.1

−0.05

0

0.05

Number of units

SE

R

Mean full − Mean sparseStd full − Std sparse

20 30 40 50 60 70−15

−10

−5

0

5x 10

−4

Number of units

SE

R

Mean full − Mean sparseStd full − Std sparse

Figure 38: Difference between the results of varying the number of units, for a connectivityof 0.2 vs a connectivity of 1.

This supports the idea introduced in the last section that once the majority of reservoirunits have at least one outgoing connection and at least one incoming connection, thedynamics of the reservoir responsible for the power of the ESN are established.

The image now is of a reservoir in which each unit is connected to others, but in a sparsemanner, and that non-linearity is important but sparse non-linearity is sufficient.

The connections are partly controlled by the ESN spectral radius and thus it is naturalto ask how the performance behaves as the spectral radius is changed, this is the subjectof the next section.

60

5.7 Sensitivity of the ESN results to Weight Matrix SpectralRadius

Using an RLS forgetting factor of 1, but otherwise using the settings shown in Table-1,the spectral radius, to which the ESN weight matrix is rescaled prior to training, wasincreased from 0.1 to 1 in steps of 0.1. For each spectral radius, 100 randomly initialisedESN were created, trained on a random 5000 step sequence (with 100 steps ESN washout)and then tested on a random 2.5×106 step sequence (with 100 steps ESN washout). Theresults are shown in Figure-39 and Table-26.

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0

1

2

3

4x 10

−4 ESN SER as the Spectral Radius is Increased

Weight Matrix Spectral Radius

SE

R

MeanStd

Figure 39: ESN SER as the spectral radius of the weight matrix is increased for a fixedconnectivity of 0.2. The std for a spectral radius of 1 is omitted from the graph becauseit attenuates the other readings too much, its std was 0.0014.

As can be seen the SER increases as the spectral radius is increased. Small spectralradius implies very small weights, which implies a high degree of contractivity.

5.8 Analysis of ESN Reservoir Recurrent Weight Distribution

It is interesting to ask what the distribution of weights are in an ESN whose weightshave been rescaled to some specified spectral radius. A histogram of 1000 randomized46-Neuron ESN reservoirs initiated with the parameters shown in Table-1 is shown inFigure-40, the weight values, for all 1000 networks, concatenated and sorted in ascendingorder are shown in Figure-41.

The results are as expected; the weights are all very small.

61

Spectral Radius SER Std

0.1 8.591e-07 2.252e-060.2 1.006e-05 2.152e-050.3 8.008e-06 1.016e-050.4 2.705e-05 3.183e-050.5 3.762e-05 4.758e-050.6 4.796e-05 3.727e-050.7 6.144e-05 5.548e-050.8 8.942e-05 9.277e-050.9 8.682e-05 6.240e-051 2.421e-04 4.505e-04

Table 26: ESN SER as the spectral radius of the weight matrix is increased for a fixedconnectivity of 0.2. The std for a spectral radius of 1 is omitted from the graph becauseit attenuates the other readings too much, its std was 0.0014.

−0.2 −0.15 −0.1 −0.05 0 0.05 0.1 0.15 0.20

2

4

6

8

10

12

14

16

18x 10

5 Weight Value Histogram of 1000 46−Neuron ESN Reservoirs

Value

Cou

nt

Figure 40: Histogram of weights of 1000 randomized ESNs, using the initialization pa-rameters specified in Table-1.

5.9 Analysis of the ESN Reservoir Dynamics for Constant In-puts

The small spectral radius and consequent small weights used by the ESN imply a highdegree of contractivity. It is therefore useful to confirm that the ESN dynamics are indeedcontractive and to take a closer look at their behavior. This can be achieved by drivingan untrained ESN reservoir with each of the possible inputs for a fixed duration andthen recording the consequent activations as the reservoir is allowed to self-sustain in theabsence of the input.

62

0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2

x 106

−0.2

−0.15

−0.1

−0.05

0

0.05

0.1

0.15

0.2

The Weight Values Of 1000 46−Neuron ESN Reservoirs

Weight

Val

ue

Figure 41: Plot of weight values of 1000 randomized ESNs, using the initialization pa-rameters specified in Table-1.

In Section-3 it was shown that there are only 511 different inputs and they are taken witha spacing of 0.02 evenly from [−5.1, 5.1]. This means it is feasible to test every input.

For each of the inputs, an ESN initialised according to the settings in Table-1 was drivenin isolation with the chosen input for 1000 iterations, the ESN was then updated withoutinput for a further 4000 iterations. The number of iterations taken from the point whenthe input ceased until the difference between each of the reservoir unit activations and 0became smaller than a threshold was recorded. The results were recorded for thresholds10−1 to 10−19 and are shown in Figure-42.

Figure-42 confirms the contractivity of the ESN reservoir and shows that the activationsdecay linearly with time. It is interesting that the decay seems quite slow which mightimply a deep memory. Without performing an analysis, it is not clear at which pointthe information would cease to become useful to the reservoir, i.e would cease to becomedetectable by the readout mechanism.

5.10 Summary

The section began by showing that the ESN is sensitive to input shift, the experimentsdemonstrated that large input shifts were necessary to obtain good performance. It wasshown empirically that large input shifts resulted in a larger range of the activationfunction being used; the activations were more spread out. It was speculated that thismight enhance the separability of input streams resulting in an increase in regressionpotential.

The connectivity experiments implied that good performance was conditional on themajority of reservoir units having at least one incoming connection and at least one

63

0 −5 −10 −15 −200

10

20

30

40

50

60

70

80Number of Iterations Before Reservoir Dynamics Decay below Threshold

Threshold (10x)

Num

ber o

f Ite

ratio

ns

MeanStd

Figure 42: Number of iterations before all reservoir activations are closer than thresholdto 0.

outgoing connection. It was discovered that this occurred at relatively low connectivities.It was found that further increase in the connectivity had no obvious effect on the SER.

It was discovered that the number of units in the ESN reservoir that are non-linear isimportant. Performance was found to increase as the number of non-linear units in anESN reservoir were increased. It appeared as though once a relatively small fraction ofthe ESN units were established as non-linear, further addition of non-linearity did notresult in significant improvement.

It was shown that relatively large numbers of reservoir units were required to obtain goodperformance (performance comparable to the results presented in [1]). Reconciling thiswith the connectivity results, it is clear that sparsity of connections does not imply failureto utilise all units, as the majority of units in a relatively sparsely connected network areused. For example, almost all units in a 46 unit network with a connectivity of 0.1 willhave at least one incoming and at least one outgoing connection.

The experiments with spectral radius demonstrated that small spectral radii had betterperformance than large spectral radii. The distribution of weights after spectral radiusrescaling confirmed their contractive properties.

It was confirmed through experimentation of reservoir dynamics that the behavior of theESN was indeed contractive.

5.11 Discussion

It seems feasible that the success of fully connected ESN reservoirs obscures the analysis.If the addition of extra connectivity, and non-linearity beyond some threshold is not

64

beneficial, then the implication is that beneficial dynamics of the ESN are establishedvia a limited number of connections, and that the non-linearity requirements can besatisfied by relatively few units, suggesting perhaps that the remaining units act simplyas contractive memory units. At this point it can be speculated that the ESN has thefollowing three properties:

1. Contractivity in the form of small weights to provide memory.

2. Non-linear units to make linearly inseparable inputs linearly separable and to in-crease the difference between different inputs.

3. Computational ability through connections between reservoir units.

The next step is to try and elucidate further the properties of the ESN reservoir, throughthe direct experimentation of variants.

65

6 Restricting the ESN

6.1 Introduction

The purpose of this section is to strip down the ESN and reduce it to simpler form sothat the essence of the model may be exposed.

6.2 Performance Without Spectral Radius Rescaling

When the reservoir weights are rescaled to obtain a specified spectral radius, they areliterally divided by the old spectral radius, and then multiplied by the desired spectralradius. Thus, the spectral radius rescaling is nothing more than a weight rescaling. Thismeans that it is not necessary to execute a spectral radius rescaling if an appropriateweight range is picked to begin with. For example, the settings shown in Table-27 produceexcellent results using uniform random initialisation of weights with no spectral radiusrescaling.

Parameter Value

Input to reservoir weight range [−0.025, 0.025]Reservoir to reservoir weight range [−0.1, 0.1]Connectivity 0.2

Table 27: ESN settings that achieve excellent results without using spectral radius rescal-ing.

The average SER and std, for a run with 50 trials, were respectively 2.448e-06 and9.144e-06 . According to ANOVA this result is not significantly different from the bestESN result (for 32db) presented in the first section of this report (p=0.5573).

6.3 ESN Reservoir with Constant Reservoir or Constant InputWeights

In order to simplify, it is interesting to ask what happens when the ESN reservoir weightsare initialised to a constant value, or what happens when the input-to-reservoir weightsare initialised to a constant value14.

Four cases arise from the fact that the input-to-reservoir weights can be initialised eitherfrom a random distribution or with a constant, and the fact that the reservoir-to-reservoirweights can be initialised either from a random distribution or with a constant. Anexperiment was performed to examine each of these cases. For each case, ESN wereinitialised using the settings shown in Table-27, trained on a 5000 step random sequenceand fully tested on a 2× 106 step sequence, the results were averaged over 50 trials andare shown in Table-28.

14The reservoir-to-output weights are irrelevant because the RLS algorithm was found to be indifferentto their initial values, they were initialised to 0 here.

66

Input Weights Reservoir Weights SER Std

Constant 0.025 Constant 0.1 6.590e-04 2.720e-04Constant 0.025 Random [−0.1, 0.1] 6.390e-05 4.931e-05Random [−0.025, 0.025] Constant 0.1 9.128e-06 2.092e-05Random [−0.025, 0.025] Random [−0.1, 0.1] 2.448e-06 9.144e-06

Table 28: The SER and std of four cases which arise from initialising the input-to-reservoirweights from a random interval or to a constant, and from initialising the reservoir-to-reservoir weights from a random interval or to a constant.

The best results occur when both the input is weighted randomly and the reservoir isweighted randomly. Compared to the case when the input is weighted randomly andthe reservoir is weighted with a constant, the former is significantly better (ANOVA,p=0.0412).

The experiment discussed above was repeated using an ESN with identity activationfunctions (linear units) and an FFN readout. The readout had one hidden layer with 5tanh units, 25 epochs and a learning rate of 0.03 were used for training. The results areshown in Table-29.

Input Weights Reservoir Weights SER Std

Static 0.1 Static 0.26 7.501e-01 2.897e-04Static 0.1 Random [−0.26, 0.26] 2.112e-06 5.001e-06Random [−0.1, 0.1] Static 0.26 7.500e-01 2.428e-04Random [−0.1, 0.1] Random [−0.26, 0.26] 5.760e-06 9.838e-06

Table 29: Performance of a 46 unit ESN reservoir implementing identity function activa-tion units, for different values for the reservoir-to-reservoir and input-to-reservoir weightranges.

The performance is best for static input-to-reservoir weights (that is, input scaling) cou-pled with randomly initialized reservoir-to-reservoir weights. Compared to the best ESNresults, according to ANOVA, the results are not significantly different (p=0.6053). Theresults are also not significantly different from an input buffer with 10 units coupled toan FFN readout (p=0.6834), but only 25 epochs were used here compared to 150 epochsfor the last.

The performance of an FFN readout coupled to an ESN reservoir with linear units isalmost an order of magnitude better than that of an FFN readout coupled to a ESNreservoir with non-linear readouts.

These results are interesting and suggest that the non-linearity tested here in the ESN,when coupled to an FFN readout has a detrimental effect on FFN training. It is specu-lated that perhaps non-linear pre-processing interferes with the FFNs ability to extractfeatures from the ESN reservoir so it performs better in the case with identity functions.

Compared to the FFN performance when coupled to an input buffer (See Section-4.4),the number of training epochs have been reduced by a factor of 5. Indeed, the bestperformance discovered for an FFN readout coupled to an input buffer when trained

67

for only 25 epochs, has performance (SER of 1.56e-02 with a std of 1.680e-02) which isseveral orders of magnitude worse than an FFN readout coupled to an ESN reservoirwith identity activation functions (SER of 2.112e-06 with a std of 9.838e-06). The poorperformance of the former was due primarily to unstable learning in a minority of cases,therefore the ESN reservoir appears to have alleviated this instability.

Removing the non-linearity from the ESN reservoir, whilst finding evidence that thereservoir can still aid the performance of a readout (when compared to a sliding windowapproach), is useful because a reservoir with identity function units is characterized bya single matrix of weights. Which means that the reservoir in that case is an iteratedfunction system (IFS); iteration of the reservoir given an input stream corresponds torecursive application of a sequence of affine transformations (excluding translation), thatis, of reflections and rotations (either rotation of the state, or equivalently a coordinatetransform in the opposite direction).

The matrix would be easier to analyse if it were symmetric. An ESN was establishedwith 46 units, input-to-reservoir weights from [−0.025, 0.025] and reservoir-to-reservoirweights from [−0.2, 0.2]. The reservoir was fully connected but the weights were initialisesymmetrically. 50 epochs of 5000 inputs were used for training, and 2.5×106 inputs wereused for testing. The mean SER over 50 trials was 3.520e-07 with a std of 1.080e-06 .According to ANOVA this is not significantly different from the best ESN + RLS result(p=0.1894).

6.4 ESN With Only Self Loops

The last section demonstrated that under certain circumstances the ESN benefits froma randomly weighted reservoir, but does it benefit from the inter-connections, or justthe decaying of past inputs with random coefficients, that this connectivity, in part,affords? To find out, an ESN was setup with only self-loops, that is, each reservoir unitwas connected only to itself, and the input was fully connected to the reservoir. Anexperiment was executed to compare several different weights for input-to-reservoir andreservoir-to-reservoir connections. A 5000 step training sequence was used for training(with 45 inputs used for washout), and 2.5 × 106 inputs were used for full testing (with45 inputs used for washout). The results of the experiment, averaged over 50 randomtrials, are shown in Table-30.

Input Weights Self Weights SER Std

Constant 0.02 Constant 0.1 7.500e-01 2.624e-04Constant 0.02 Random [−0.1, 0.1] 2.009e-02 1.597e-03Random [−0.02, 0.02] Constant 0.1 5.921e-01 1.525e-03Random [−0.02, 0.02] Random [−0.1, 0.1] 1.015e-02 9.520e-03

Table 30: Performance of an ESN Reservoir with only self-loops, coupled to an RLSreadout, for different settings of the self-loop and input-to-reservoir weights.

The best performance occurs when both the input-to-reservoir and reservoir-to-reservoirweights (here self-loops), are initialised randomly ANOVA (p=0). However, this per-

68

formance is several orders of magnitude worse than the best ESN performance, whichimplies that the ESN needs the inter-unit connections to get the best performance, andtherefore that the inter-unit connections provide something that a decaying memory (evenwith random coefficients) does not. It is speculated, that the ESN connections providecomputation as well as decaying memory.

6.5 ESN Circuits

As a step up from the idea of using only self-loops, consider a minimal kind of connectivitywhere there is just one circuit of connections. Each ESN unit is indexed, and thenconnected to its immediate successor with wraparound (and the highest indexed unit isconnected to the lowest indexed unit since it has no successor). The result is a singlecircuit of connected reservoir units, which is itself fully connected to the single input andfully connected to the single output.

An ESN reservoir with 46 units was connected in a circuit using reservoir-to-reservoirweights either drawn randomly from [−0.1, 0.1] or set as the constant 0.1. The input-to-reservoir weights were either drawn randomly from [−0.025, 0.025] or set as the constant0.025. An experiment was executed to check performance for each of the arising fourcases. Reservoirs were trained on a 5000 step random sequence (with 45 inputs used forwashout. Testing was executed fully, using a 2.5 × 106 input sequence. Table-31 showsthe results for each of the four cases that arise.

Input Weights Circuit Weights SER Std

Static 0.025 Static 0.1 7.500e-01 2.464e-04Static 0.025 Random [−0.1, 0.1] 3.698e-03 1.114e-03Random [−0.025, 0.025] Static 0.1 2.720e-07 1.121e-06Random [−0.025, 0.025] Random [−0.1, 0.1] 1.847e-04 1.031e-04

Table 31: Performance of ESN Reservoir whose 46 units are connected in a single circuit,coupled to an RLS readout, for different settings of the reservoir-to-reservoir (circuit) andinput-to-reservoir weights.

The most striking result is the performance of the random/static configuration in Table-31, that is, with random input weights from [−0.025, 0.025] and a static reservoir weight of0.1. This result is not significantly different from the best proper ESN result (p=0.1618).However, with 84% confidence it can be said that the performance is better. In terms ofthe actual distribution of results, 45 out of 50 had an SER of 0 compared to 35 out of 50for the proper ESN.

The experiment was repeated using an FFN readout with 5 tanh hidden units, coupled toa circuit of 46 units using the identity activation function. It was found that very smallweights were detrimental to performance so a lower bound on absolute weight magnitudewas set. The results for different weightings are shown in Table-32.

The performance is poor considering that 46 units are used. Comparing this to theperformance of a buffer connected to an FFN readout, or an ESN reservoir with identity

69


Static 0.1 Static 0.3 5.762e-02 4.333e-02Static 0.1 Random [−1,−0.3] ∪ [0.3, 1] 2.653e-04 5.746e-04Random [−0.1, 0.1] Static 0.3 6.663e-02 6.097e-02Random [−0.1, 0.1] Random [−1,−0.3] ∪ [0.3, 1] 4.727e-04 9.332e-04

Table 32: Performance of ESN Reservoir whose 46 units are connected in a single circuit,with identity activation functions coupled to an FFN readout, for different settings of thecircuit and input-to-reservoir weights.

units connected to an FFN readout, it is clear that the results for a circuit with linearactivation functions are inferior.

This is in contrast to the performance of the RLS algorithm on a non-linear circuit ofunits, which had excellent performance, indicating that if linear units are used, thensomething more than a simple loop of connections is required.

6.6 ESN Sub Circuits

An increase in complexity from a single circuit is to allow several sub-circuits to co-exists. This can be achieved by connecting each unit to exactly one other at random,with the provision that no unit may have more than one incoming connection. That is,each unit has exactly one incoming and exactly one outgoing connection. In analogy tothe experiment of the last section, the input-to-reservoir and reservoir-to-reservoir weightswere initialised with constant values or from random distributions. The results are shownin Table-33.


Static 0.025 Static 0.26 7.500e-01 2.741e-04Static 0.025 Random [−0.26,−0.05] ∪ [0.05, 0.26] 1.941e-04 1.662e-04Random [−0.025, 0.025] Static 0.26 1.048e-05 1.236e-05Random [−0.025, 0.025] Random [−0.26,−0.05] ∪ [0.05, 0.26] 5.104e-06 8.482e-06

Table 33: Performance of ESN Reservoir whose 46 units are each connected to only twoother units (one incoming connection and one outgoing connection). The reservoir wascoupled to an RLS readout. The performance for different settings of the reservoir-to-reservoir and input-to-reservoir weights is shown.

The best performance occurs with random input-to-reservoir weights and random reservoir-to-reservoir weights. That performance is significantly worse than the results of thelast section (p=0.0001) and significantly worse than the best proper ESN performance(p=0.018).

70

6.7 Summary

The best performing results (including some from previous sections) are summarized inTable-34.

Number Mnemonic SER Std ANOVA Meaning

1 ESN-Best 1.528e-06 6.199e-06 1.0000 =

2 ESN-No-Spec 2.448e-06 9.144e-06 0.5573 =3 ESN-Linear 2.112e-06 5.001e-06 0.6053 =4 ESN-Linear-Symmetric 3.520e-07 1.080e-06 0.1894 =5 ESN-Self-Loop 1.015e-02 9.520e-03 0.0000 <6 ESN-Circuit 2.720e-07 1.121e-06 0.1618 =7 ESN-Circuit-Linear 4.727e-04 9.332e-04 0.0006 <8 ESN-One-Conn-Per-Unit 5.104e-06 8.482e-06 0.0180 <

Table 34: The best conventional ESN results on the wireless task for an SNR of 32db,compared to several other similar models. The ANOVA column designates the p value foran ANOVA test involving the result from the first row and the row of interest (p < 0.05indicates a statistically significant difference). The meaning column gives an interpreta-tion of the statistic, an “=” symbol means that the result is not significantly differentfrom the best conventional ESN result, and a “<” symbol means the result is significantlyworse than the best conventional ESN result.

1. ESN-Best: The best standard ESN result, spectral radius of 0.1, RLS readout,input-to-reservoir weights from [−0.025, 0.025], and reservoir-to-reservoir weightsfrom [−1, 1].

2. ESN-No-Spec: ESN with no spectral radius rescaling, RLS readout, input-to-reservoir weights from [−0.025, 0.025], and reservoir-to-reservoir weights from [−0.1, 0.1].

3. ESN-Linear. ESN with identity activation functions (instead of tanh) and anFFN readout trained for 25 epochs, input-to-reservoir weights from [−0.1, 0.1], andreservoir-to-reservoir weights from [−0.26, 0.26].

4. ESN-Linear-Symmetric ESN with identity activation functions and a symmetricweight matrix, FFN readout trained for 50 epochs, input-to-reservoir weights from[−0.025, 0.025], and reservoir-to-reservoir weights from [−0.2, 0.2].

5. ESN-Self-Loop. ESN with only self-loops, an RLS readout, input-to-reservoirweights from [−0.02, 0.02], and reservoir-to-reservoir weights from [−0.1, 0.1].

6. ESN-Circuit. Each ESN unit is connected only to its indexed successor, withwraparound to form a single circuit of ESN units, coupled to an RLS readout,input-to-reservoir weights from [−0.025, 0.025], and reservoir-to-reservoir weightsconstant at 0.1.

7. ESN-Circuit-Linear. Connections established as immediately above but identityactivation functions used, FFN readout trained for 30 epochs, input-to-reservoirweights from [−0.1, 0.1], and reservoir-to-reservoir weights from [−1,−0.3]∪ [0.3, 1].

71

8. ESN-One-Conn-Per-Unit. Each ESN unit has only one incoming and one outgo-ing connection, coupled to an RLS readout, input-to-reservoir weights from [0.025, 0.025],and reservoir to reservoir weights from [0.26, 0.05] ∪ [0.05, 0.26].

First of all it was demonstrated that an ESN with uniformly initialised weights and nospectral radius rescaling could achieve performance statistically equivalent (p=0.5573) tothe best performing ESN with spectral radius scaling. This was an obvious result, butit makes explicit the fact that a spectral radius rescaling is nothing more than a weightrescaling.

It was shown that for the ESN, random input-to-reservoir and random reservoir-to-reservoir weights statistically outperformed the cases that arise from either or bothweights being kept constant.

It was demonstrated that an ESN with identity function activation units and an FFNreadout could achieve statistically equivalent performance to a non-linear ESN with anRLS readout (p=0.6053), although more than one pass through the training data wasrequired in the former case. The results were also statistically equivalent to an inputbuffer with an FFN readout (p=0.6834), yet less training epochs were required implyingthat the ESN reservoir enhanced the stability of the FFN training.

With the addition of extra epochs it was shown that an ESN with identity unit acti-vation functions and a symmetric weight matrix could achieve statistically equivalentperformance to a non-linear ESN with an RLS readout (p=0.1894).

An ESN was initialised with a diagonal weight matrix, that is, each unit was connectedonly to itself (ESN-Self-Loop). The best performance (p=0 compared to the second bestcase) came from randomly weighted self-loops and random input-to-reservoir weights, butthe best performance was several orders of magnitude worse than the best conventionalESN result.

Next, the ESN reservoir units were connected in a single circuit with an RLS readout(ESN-Circuit), the best performance occurred when the reservoir-to-reservoir weightswere kept constant and the input-to-reservoir weights were randomly initialised. Theperformance was not significantly different from the best ESN result (p=0.1618).

The experiment was repeated using an FFN readout and identity activation function ESNunits. The performance was approximately two orders of magnitude worse, indicatingthat more than a circuit of connections was necessary to get good performance withlinear units.

Finally, an ESN reservoir was established so that each unit had only one incoming and oneoutgoing connection. The performance was best for random input-to-reservoir weightsand random reservoir-to-reservoir weights (p=0.0128 compared to the next best). Com-pared to a circuit of ESN units the performance was statistically worse (p=0.0001) andcompared to the best ESN results the performance was statistically worse (p=0.018), andin addition, a delicate setting of the weight ranges was found to be necessary to achievethe reported performance.

72

7 Overall Summary

In Section-3 the wireless data generation equations were analysed. The analysis re-vealed that the non-linearity was almost linear in the absence of noise. However, its non-injectivity in the presence of noise coupled with the highly non-injective pre-processingstep gave confidence that the task was a good benchmark.

Section-4 replicated the results of [1] and compared the ESN model against several othermodels. The results can be summarized as follows:

� An FFN readout coupled to a sliding window is in principle as powerful as an ESNwith an RLS readout (p=0.7469), it however requires many more training epochsto achieve this.

� The RLS readout is the only readout tested with sufficient power to efficiently learnfrom a non-linear ESN reservoir and obtain excellent performance.

� An ESN reservoir with identity activation function units and constantly weightedself-loops, is practically useless at preserving information relevant to the task. Allattempts to predict given this context are no better than random.

Section-5 examined the results of Section-4 in more detail and performed some analysis,the results can be summarized as follows:

� The sensitivity of the ESN to input shift is due to the effect of input shift onspread of activations; greater input shifts give a more uniform covering of the ac-tivation function which increases separability and regression potential, resulting inan increase in performance.

� The degree of connectivity does not change the “effective input shift”.

� A minimum kind of connectivity, where each unit has at least one incoming connec-tion and at least one outgoing connection, is sufficient but not necessary to obtaincompetitive performance with an ESN reservoir coupled to an RLS readout.

� A minimum level of non-linearity is sufficient but not necessary to obtain competi-tive performance with an ESN reservoir coupled to an RLS readout.

� The ESN dynamics are contractive.

Section-6 tested several restricted versions of the ESN. The results can be summarizedas follows:

� The performance of ESN reservoirs with identity function units (and optionallysymmetric weight matrices) coupled to FFN readouts are in principle as powerfulas high performing ESN reservoirs with RLS readouts, except that more trainingepochs are required in the former case.

73

� The performance of an ESN reservoir with self loops is non-competitive.

� The performance of a non-linear ESN reservoir with a single circuit of units hadperformance statistically equivalent to the best full ESN approach.

74

8 Discussion

8.1 Conclusions

The excellent performance of the ESN Circuit model with an RLS readout, and theexcellent performance of the ESN model with identity units coupled to an FFN readout,given the poor performance of linear and non-linear ESN reservoir models employing onlyself-loops, irrespective of readout model, suggests that the non-diagonal terms in the ESNreservoir weight matrices are a very important aspect of the ESN model.

The results suggest that non-linearity in the ESN reservoir is not a necessary precursor tocompetitive performance since the identity function ESN was able to reduce the numberof epochs required to train an FFN readout to a competitive level. This is supported byempirical evidence presented in Section-5.5 where it was shown that only a small numberof non-linear units were required to obtain competitive performance.

In Section-1.4 some questions were posed at the outlook of the research, here I willattempt to address the extent to which they have been answered.

1. The information in an ESN is derived from the input stream it is driven by. Thisraises the question: is a reservoir of echo states necessary at all? Or can a simpleapproach based on a sliding window work just as well?

The experiments of Section-4 demonstrated that an FFN coupled to a sliding win-dow can achieve statistically equivalent performance (ANOVA, p=0.7469) to anESN with an RLS readout if the number of training epochs is sufficiently large.Whether or not this can be judged to imply that the former works “just as well”as the latter is debatable. The sliding window approach required a 150 epochs totrain whereas the RLS approach achieves its excellent performance in 1 epoch.

It appeared as though the sliding window approach had the potential to achievecompetitive epochs in less epochs, but it was unstable. It is highly probable thatwith a validation set or some other training monitoring, this instability would van-ish. But even in this case, a small number of epochs and a mothering approachto training is not the same as one pass with an online adaptation algorithm. Itcould be argued that until a sliding window approach is found that obtains com-petitive performance in one pass of the training data, that no direct sliding windowapproach works “just as well”.

2. It is suggested that the way in which an ESN reservoir is created is importantin obtaining ”rich dynamics” and subsequent good performance. Is the suggestedsparse connectivity a necessary precursor for good performance? Why?

The experiments of Section-5 demonstrated that it was necessary to have a certainminimal kind of connectivity (given a fixed size ESN reservoir), in order to obtaincompetitive performance. This connectivity appeared to correspond to the pointwhere each unit had at least one incoming connection and at least one outgoingconnection. And this idea was supported by the results of Section-6.5 where a singlecircuit of ESN units was shown to have competitive performance (each unit havingexactly one incoming connection and exactly one outgoing connection).

75

The experiments of Section-5 also demonstrated that an ESN reservoir is not pe-nalised by being over-connected. The implication is that, starting from an un-connected ESN, beneficial dynamics are established by the addition of a “few”connections, but as more connections are added, the beneficial dynamics are notwashed out by the added complexity.

It was demonstrated that an FFN readout coupled to a single circuit of ESN unitswith identity activation functions, did not achieve competitive performance. Al-though an ESN with a connectivity of 0.2 coupled to an FFN readout did. Thissupports the idea that the beneficial dynamics are established with relatively fewconnections, and implies that perhaps either the non-linearity enhances the effector the RLS algorithm is better at exploiting it, because it would appear that whena non-linear ESN reservoir is coupled to an RLS readout, it can make do with lessconnectivity.

Further investigation is needed to determine the minimal kind of connectivity re-quired to get competitive performance using an ESN with identity activation func-tions and an FFN readout.

3. What kind of dynamics does the ESN actually have? And how do these dynamicsbenefit the ESN in terms of regression potential?

The dynamics of the ESN are contractive in the long term, which means that theESN has decaying memory. The dynamics obtained in the short term are a ques-tion that remains unanswered as yet. What can be said is that it is apparent thatthe power of the ESN comes from the off-diagonal terms of the weight matrix, andthat this is an affine transformation. The input-to-reservoir weights perform a co-ordinate transformation of the input. And the dynamics are long term contractive.Intuitively then, a sequence of reservoir states can be seen as a fractal encoding ofthe input stream.

The exact nature of the transformation is speculated to be dominated by strongrotations15 but this has yet to be demonstrated. A rotation can induce a spiralingdynamics which can to some extent emulate stable states and this could be behindthe power of the ESN. Further investigation is needed.

8.2 Future Direction

The idea that the ESN dynamics arise through rotations can be attacked in two differentways.

1. A symmetric ESN reservoir weight matrix with identity activation functions wasshown to have competitive performance when coupled to an FFN readout. Thedynamics of this model are not affected by interactions with non-linear activationfunctions, and the symmetry of the weight matrix makes it easier to analyse. It wassuggested by my supervisor that singular value decomposition could be performed

15Speculated by my supervisor Peter Tino

76

to analyse exactly what kind of rotations and reflections the ESN reservoir matrixwas executing in such a case.

2. It was suggested by my supervisor that weight matrices be constructed which havespecific rotational components and that an experiment could be executed varyingthe amount of rotation that the weight matrix would induce. The results of theESN with matrices inducing different degrees of rotation could then be compared.

Owning to time constraints neither method was executed, but they both seem like goodsuggestions. Information regarding the construction of matrices for general n-dimensionalrotations can be found in [11].

These suggestions, along with the weaknesses reported in Section-9 will be addressed andthe results will be submitted for publication.

8.3 Relation to Other Work

Jaeger states in [1] that the ESN model is similar to the LSM (Liquid State Machine)model of Wolfgang Maass[12]. The LSM is biologically motivated model of a cortical mi-crocircuit, the microcircuit is initialised with randomly connected heterogenous neurons.The model is used to solve temporal problems by projecting the input stream into itsliquid-like body of neurons, where a simple readout mechanism, for example a perceptronis attached and trained.

The analogies between the LSM and ESN are clear; both models employ a dynamic un-trained reservoir to processes the input stream and a simple trainable readout to solve thetask. But the differences in terms of implementation, dynamics, and analytical tractabil-ity are enormous, leading one to question how far the analogy extends and whether itis reasonable to claim that “ESNs employ artificial recurrent neural networks in a waythat has recently been proposed independently as a learning mechanism in biologicalbrains”[1].

Another example of a “dynamic projection network” is Jochen Steil’s BPDC algorithmand associated dynamic network[13]. It employs a large untrained dynamic reservoir (withcontractive dynamics), and a readout mechanism which tries to maximise the disparitybetween different network states.

The idea implicated behind all these models is that the high dimensional dynamics of thereservoir somehow enhance information in the input stream and lead to better learningmechanisms.

In [14] Legenstein and Maass ask the question “What makes a dynamical system compu-tationally powerful?”, it is suggested that biological computational systems self-criticallyorganise to a regime between order and chaos and have obtain excellent computationalproperties through “computation at the edge of chaos”.

Computation at the edge of chaos could play a role in the short term dynamics of theESN. To test this the short term dynamics should be analysed to assess sensitivity toinitial conditions and to ascertain whether there is any kind of chaotic behavior.

77

Concepts such as self-criticality, chaos, and emergence, are over-used terms which havesomewhat vague meanings. They are increasingly being used as explanations to phenom-ena without addressing the root of the problems.

The object of this research was to try and get to the root of the ESNs power in amathematically clean way, without resorting to abstract notions like “computation atthe edge of chaos”.

If light can be shed about the mechanism of computation in ESNs perhaps it is possibleto take some of the conclusions and apply them to the analysis of LSMs.

The results so far seem to suggest an affine transformation fractal processing of the inputstream. The ubiquity of fractals in nature almost leads one to question not if, but whereand how they are related to the brain and it would not be surprising if a similar mechanismwas at work there. It is certainly a perspective worth considering. Future experimentswith the LSM could explore this possibility.

8.4 A Short Comment On the Idea of Dynamic Projection Net-works

The ESN and LSM models make use of a large dynamic reservoir of activations into whichthe input is projected. It is speculated that the reservoir into which inputs are projected,enhances separability of salient information whilst providing a decaying memory. Sepa-rability and decaying memory are two features which both models claim to have. Maasscalls these separation property [12] and fading memory.

What follows is an attempt to analyse the kind of features that an ideal reservoir shouldpossess.

I ask the open question “What kinds of dynamical systems obtain dynamics which arecomputationally useful with respect to non-linear regression and prediction tasks, andwhat kinds of computation are useful?”

In the case of regression it would seem as though the power afforded by, what is effectivelya pre-processing of the input by a non-linear projection, is the recombination of inputinformation in such a way to enhance its utility to the task at hand. One can easilythink of examples where a non-linear combination of inputs can produce informationwhich is intuitively more important to a task than the original input, as an exampleconsider the division operator which can change distance travelled and time taken intospeed (assuming unidirectional measure of distance travelled), thus a ratio may be moreuseful than its component numerator and divisor in isolation.

Consider a prediction task. Clearly information about the past is important so thedynamic reservoir, or rather the input to the regression model, needs to take the pastinto account, hence the notion of memory. The necessity for fading memory can beargued from the perspective of finiteness of computational resources and also from theperspective of pertinence of information; but at what point does historical informationbecome non-relevant with respect to prediction of the future? If one restricts oneselfto tasks which require only fading memory then this question can be quantified with a

78

task-specific answer, but in the general case this depth will be unknown. Further, what isthe effect of non-relevant inputs on the performance of the regression model?, and whatis the effect of non-relevant promotion of latent information?

These are questions which are clearly important, but as far as I can tell, nobody hasattempted to answer.

Separation is an important concept, the ability to distinguish between different inputstreams and hence to be able to derive a base context from which to accurately predictthe future is clearly important; separation of elements from different classes is useful inclassification.

It seems that any useful prediction system needs to strike a delicate balance between thefollowing (with respect to the task at hand)

1. The promotion of latent information whose presence is useful for the task.

2. The immediate destruction of salient information whose presence is detrimental tothe task.

3. The separation of semantically disjoint salient information whose presence is usefulfor the task.

4. The storage of salient information that will be useful for the task in the future.

5. The destruction of salient information that will be detrimental to the task in thefuture but that was previously useful (i.e that was not destroyed immediately).

By latent-information, I mean information that can only be expressed by non-linearlyrecombining existing information. One can also contrive other factors which might beuseful, for example

1. The suppression of information for a finite period where it is irrelevant to the taskcoupled with its subsequent re-expression whence it again becomes relevant to thetask.

The terms useful and detrimental are somewhat analogous to the ideas of relevance andirrelevance, except that irrelevant information need not be detrimental. Such informationcould be called benign. Detrimental and irrelevant information are conventionally callednoise. It could be argued that noise is never benign, since it takes up space in an organismsinput window which could potentially be replaced by more useful information should itbe available, thus benign noise could supplant useful information and be detrimental inthe sense that more information could be available if the noise were not present.

Is there any kind of model that has all these properties, I would argue that the humanbrain does. But how can a random reservoir know what information to “express” andwhich information to suppress? The answer is of course, that it cannot know. Thisimplies that there are certain kinds of computation performed by generic reservoirs whichare beneficial to a wide variety of tasks? Is it the “edge of chaos”? What are thesecomputations and why do they improve performance?

79

9 Evaluation

The weakest aspect of this study is the data set used for experiments, because

1. Only one data set was used so it is not clear if the performance of different modelstested will generalise to different data sets.

2. It was used mainly in low noise forms so it is not clear if the order of performancereported generalises across different noise levels.

Clearly this should be rectified by testing the most important models and results presentedhere in the context of several different data sets with different but known characteristics,and for various noise levels within each data set. The reason for not doing this to beginwith was time, and not knowing in advance what the interesting models or experimentswould be.

It could be argued that this report spends to much time discussing the results frompoor performing models. This was however deliberate, as a pre-emptive strike at theobjection that the alternative models were not explored to a deep enough level to concludetheir inferiority. Whilst this could still be claimed, the presentation of testing multipleparameters should reduce the propensity to doubt.

Some of the experiments used the testing method of [1], which was to terminate testingearly if 10 errors occurred, and then compute the SER by dividing by the number ofsymbols seen. This meant that the results of some experiments could not be statisticallycompared fairly, so they were only compared informally from observation. Whilst it isunlikely that this changes the conclusions of this work, it means that useful informationcould have been missed in the cases where the experiments were more important. Obvi-ously a follow up work should repeat those experiments but using a full testing strategyand average over more trials.

80

10 Conclusion

It is the combination of the powerful RLS readout algorithm and ESN reservoir withnon-linear activations that enables excellent performance to be obtained in one pass ofthe training data.

It is the connections between units that provide the power, not the decaying memory inisolation. The long term dynamics of the ESN model are contractive and the short termdynamics are governed by recursively applied affine transformations.

The specific transformations which obtain the ESN its power are unknown but the evi-dence suggests that there is a dominant underlying dynamic which emerges at a minimalthreshold of connectivity and which remains present as the complexity of the network isincreased.

It is believed that determination of this dynamic will lead to profound insights into theoperation of the ESN model.

Mechanisms to analyse the transformations induced by the ESN reservoir weights will bethe subject of future research.

81

A Document Appendix

A.0.1 FPM Model

The FPM (Fractal Prediction Machine) consists of: (i) an FPM Reservoir, and, (ii) areadout mechanism.

The FPM reservoir encodes an input stream recursively in a fractal manner. Its operationis very simple. The reservoir r has as many units as the dimensionality of the input i. Thereservoir is initialised in some pre-determined start state, for example 0. The reservoir isupdated according to the equation below

r(t) = k · r(t− 1) + (k − 1) · i(t)

k is called the contraction coefficient and determines how fast the past is forgotten.

It might be desirable to change the dimensionality of the FPM reservoir, this can beachieved by re-encoding the input. For example, a one-dimensional real-valued inputcould be encoded in five dimensions by multiplying a five dimensional weight vector.

A.1 The Addition of i.i.d Gaussian Noise to a Continuous Signal

Given a specified SNRdb s in decibels (db), Gaussian noise can be added to a signal i toachieve the desired SNR. Note that

SNR =signalrms

noiserms

SNRdb = 10 ∗ log10(SNR)

so

noiserms = SNRsignalrms

= 100.1∗SNRdb

signalrms

For the case of the wireless data, the signal is centered around 0 so the rms is just thestd, the std of the signal for every possible input was found to be 2.08. Plugging thisinto the equation above yields the rms of the noise signal. Since the noise is a Gaussiancentered around 0 this is just the std, hence setting the std of the i.i.d Gaussian noise tothe computed value 100.1∗SNRdb

2.08yields the prescribed SNRdb.

A.2 L2 Distance between scaled inputs

Claim: The distance between two vectors in the L2 norm increases by a factor of C ifthe individual elements of the vectors are each scaled by C.

82

Proof: The L2 distance between two n-dimensional vectors x = x1, x2, ..., xn and y =y1, y2, ..., yn is defined as

(n∑

i=1

(xi − yi)2

) 12

scaling the vector elements by C yields

(n∑

i=1

(C · xi − C · yi)2

) 12

=

(n∑

i=1

(C · (xi − yi))2

) 12

=

(n∑

i=1

C2 · (xi − yi)2

) 12

=

(C2 ·

n∑i=1

(xi − yi)2

) 12

= C ·

(n∑

i=1

(xi − yi)2

) 12

hence the original distance is scaled by C.

A.3 Rescaling a matrix to a specified spectral radius

The spectral radius of a matrix is its largest eigenvalue. To rescale a matrix to somespecified spectral radius the following steps are executed:

1. Compute the spectral radius of the input matrix.

2. Divide each element of the input matrix by the spectral radius of the input matrix.The resultant matrix has a spectral radius of 1.

3. Multiply each element of the resultant matrix by the desired spectral radius. Thisgives a matrix with the desired spectral radius.

It was found that for very small matrices, the LAPACK[15], dgeev function would returna maximum eigenvalue (spectral radius), or 0. Clearly one cannot divide each element ofthe input matrix by 0, because the result is absurd. Therefore in this case the weights of

83

the input matrix were re-generated until the spectral radius of the input matrix was not0, or until some specified maximum number of retries had elapsed.

To check how much of a problem this might be in practice, weight matrices of size 100×100were created with different connectivity values. The number of creation attempts it tookto generate matrix that had a non-zero spectral radius, was recorded. Connectivity wasvaried from 0.0001 to 0.02 in steps of 0.001, and 100 trials were performed for each setting.The results are plotted in Figure-43.

0 0.005 0.01 0.015 0.020

20

40

60

80

100

120Number of Times Spectral Radius was Zero for Different Connectivity

Connectivity

Wei

ght I

nit F

ailu

res

MeanStd

Figure 43: Number of weight-matrix generation tries it took to get a non-zero spectralradius. Std is shown every 0.0005 steps, no markers are shown for std of zero.

The results show that, although for very small connectivity such as 0.0001, which equatesto 1 connection, the number of retries is quite high, the number of retries falls quitesharply and by a connectivity of 0.0015, which is 15 connections, the initialisation succeedsfirst time with a std of 0. For higher connectivities there are no more runs which do notinitialise with a non-zero spectral radius the first time. Thus, in practice, it would appearthat the issue is non-problematic.

84

B File structure of software included on CD

The main directory “summer” contains 6 subdirectories

Directory Description

analysis Contains analysis experiment scripts and results.expr Contains main text experiment scripts and results.src Contains source code for many models.docs Contains source for documentation which you are reading now.matlab Contains some scripts and analyses done in matlab.test Contains test scripts and data for various models.

In addition there is a directory called “lib” which contains my programming libraries, itcontain four subdirectories

Directory Description

aml The main directory for library source code.amltest Test harness for library.apps Some applications written to accompany the library.docs Automatically generated documentation for the library.

To execute the programs, the library needs to be compiled first, and then the models inthe “src” directory need to be compiled. After this, entering the “test” directory aboveand running a bash script in one of the directories will show you the chosen model inaction. If you need help with this contact me.

85

References

[1] H Jaeger and H Haas. Harnessing nonlinearity: Predicting chaotic systems andsaving energy in wireless communication. Science, pages 78–80, April 2004.

[2] H. Jaeger. The “echo state” approach to analyzing and training recurrent neuralnetworks. Manuscript submitted for publication, 2001.

[3] H. Jaeger. A tutorial on training recurrent neural networks, covering BPPT, RTRL,EKF and the ”echo state network” approach. Technical Report 159, FraunhoferInstitute for Autonomous Intelligent Systems, 2001.

[4] Herbert Jaeger. Adaptive nonlinear system identification with echo state networks.In NIPS, 2002.

[5] P. Tino, M. Cernansky, and L. Benuskova. Markovian architectural bias of recurrentneural networks. Technical Report NCRG/2002/008, Neural Computation ResearchGroup, Aston University, UK., 2002.

[6] P. Tino, M. Cernansky, and L. Benuskova. Markovian architectural bias of recurrentneural networks. In P. Sincak, J. Vascak, V. Kvasnicka, and J. Pospichal, editors,Intelligent Technologies - Theory and Applications, pages 17–23. IOS Press, Amster-dam, 2002.

[7] Micheal F Barnsley. Fractals Everywhere. Academic Press, 1993.

[8] P Tino and G Dorffner. Predicting the future of discrete sequences from fractalrepresentations of the past. Machine Learning, 45(2):187–218, 2001.

[9] John Pezzullo. Analysis of variance from summary data.http://members.aol.com/johnp71/anova1sm.html.

[10] Paul E. Black. Lm distance. http://www.nist.gov/dads/HTML/lmdistance.html.

[11] Antonio Aguilera and Ricardo Perez-Aguila. General n-dimensional rotations. InWSCG SHORT Communication papers proceedings. UNION Agency - Science Press,2004.

[12] W. Maass, T. Natschlager, and H. Markram. Real-time computing without stablestates: A new framework for neural computation based on perturbations. NeuralComputation, 14(11):2531–2560, 2002.

[13] Jochen J Steil. Backpropagation-decorrelation: online recurrent learning with o(n)complexity. In Proc. IJCNN, pages 843–848, 2004.

[14] R. Legenstein and W. Maass. What makes a dynamical system computationallypowerful? New Directions in Statistical Signal Processing: From Systems to Brain.,2005. to appear.

[15] Linear algebra package. http://www.netlib.org/lapack/.

86

The University of Birmingham School of Computer Science ... · The University of Birmingham School of Computer Science MSc in Natural Computation Summer Project Non-Linear Memory

Documents