Jan Jantzen- Introduction To Perceptron Networks

Introduction To Perceptron Networks

Jan Jantzen [email protected]:1

$EVWUDFWWhen it is time-consuming or expensive to model a plant using the basic laws of physics,

a neural network approach can be an alternative. From a control engineer’s viewpoint a two-layer perceptron network is sufficient. It is indicated how to model a dynamic plant usinga perceptron network.

&RQWHQWV

� ,QWURGXFWLRQ �

� 7KH SHUFHSWURQ �2.1 Perceptron training 52.2 Single layer perceptron 92.3 Gradient descent learning 92.4 Multi-layer perceptron 132.5 Back-propagation 152.6 A general model 18

� 3UDFWLFDO LVVXHV ��3.1 Rate of learning 243.2 Pattern and batch modes of training 253.3 Initialisation 253.4 Scaling 253.5 Stopping criteria 263.6 The number of training examples 263.7 The number of layers and neurons 27

� +RZ WR PRGHO D G\QDPLF SODQW ��

� 6XPPDU\ ��

� Technical University of Denmark, Department of Automation, Bldg 326, DK-2800 Lyngby, DENMARK.Tech. report no 98-H 873 (nnet), 25 Oct 1998.

1

Plant

Network

-

y p

y m

u e

Figure 1: A neural network models a plant in a forward learning manner.

�� ,QWURGXFWLRQ

A neural network is basically a model structure and an algorithm for fitting the model tosome given data. The QHWZRUN DSSURDFK to modelling a plant uses a generic nonlinearityand allows all the parameters to be adjusted. In this way it can deal with a wide rangeof nonlinearities. /HDUQLQJ is the procedure of training a neural network to represent thedynamics of a plant, for instance in accordance with Fig. 1. The neural network is placedin parallel with the plant and the error h between the output of the system and the networkoutputs, the SUHGLFWLRQ HUURU, is used as the training signal. Neural networks have a poten-tial for intelligent control systems because they can learn and adapt, they can approximatenonlinear functions, they are suited for parallel and distributed processing, and they natu-rally model multivariable systems. If a physical model is unavailable or too expensive todevelop, a neural network model might be an alternative.

Networks are also used for FODVVLILFDWLRQ. An example of an industrial application con-cerns acoustic quality control (Meier, Weber & Zimmermann, 1994). A factory producesceramic tiles and an experienced operator is able to tell a bad tile from a good one by hittingit with a hammer; if there are cracks inside, it makes an unusual sound. In the quality con-trol system the tiles are hit automatically and the sound is recorded with a microphone. Aneural network (a so-called Kohonen network) tells the bad tiles from the good ones, withan acceptable rate of success, after being presented with a number of examples.

The objective here is to present the subject from a FRQWURO engineer’s viewpoint. Thus,there are two immediate application areas:

� Models of (industrial) processes, and� controllers for (industrial) processes.

In the early 1940s McCulloch and Pitts studied the connection of several basic elementsbased on a model of a neuron, and Hebb studied the adaptation in neural systems. Rosen-blatt devised in the late 50s the3HUFHSWURQ� now widely used. Then in 1969 Minsky andPapert pointed to several limitations of the perceptron, and as a consequence the research inthe field slowed down for lack of funding. The catalyst for today’s level of research was a

2

series of results and algorithms published in 1986 by Rumelhart and his co-workers. In the90s neural networks and fuzzy logic came together in neurofuzzy systems since both tech-niques are applied where there is uncertainty. There are now many real-world applicationsranging from finance to aerospace.

There are many neural network architectures such as the perceptron, multilayer per-ceptrons, networks with feedback loops, self-organising systems, and dynamical networks,together with several different learning methods such as error-correction learning, compet-itive learning, supervised and unsupervised learning (see the textbook by Haykin, 1994).Neural network types and learning methods have been organised into a brief classificationscheme, a WD[RQRP\ (Lippmann, 1987).

Neural networks have already been examined from a control engineer’s viewpoint (Miller,Sutton & Werbos in Hunt, Sbarbaro, Zbikowski & Gawthrop, 1992). Neural networks canbe used for system identification (forward modelling, inverse modelling) and for control,such as supervised control, direct inverse control, model reference control, internal modelcontrol, and predictive control (see the overview article by Hunt et al., 1992). Within therealm of modelling, identification, and control of nonlinear systems there are applicationsto pattern recognition, information processing, design, planning, and diagnosis (see theoverview article by Fukuda & Shibata, 1992). Hybrid systems using neural networks, fuzzysets, and artificial intelligence technologies exist, and these are surveyed also in that article.A systematic investigation of neural networks in control confirms that neural networks canbe trained to control dynamic, nonlinear, multivariable, and noisy processes (see the PhDthesis by Sørensen, 1994). Somewhat related is Nørgaards investigation (1996) of theirapplication to system identification, and he also proposes improvements to specific controldesigns. A comprehensive systematic classification of the control schemes proposed inthe literature has been attempted by Agarwal (1997). His taxonomy is a tree with ’controlschemes using neural networks’ at the top node, broken down into two classes: ’neuralnetwork only as aid’ and ’neural network as controller’. These are then further refined.

There are many commercial tools for building and using neural networks, either aloneor together with fuzzy logic tools; for an overview, see the database CITE (MIT, 1995).Neural network computations are naturally expressed in matrix notation, and there are sev-eral toolboxes in the matrix language Matlab, for example the commercial neural networktoolbox (Demuth & Beale, 1992), and a university developed toolbox for identification andcontrol, downloadable from the World Wide Web (Nørgaard, NNSYSID with NNCTRL ).

In the wider perspective ofVXSHUYLVRU\ control, there are other application areas, suchas robotic vision, planning, diagnosis, quality control, and data analysis (data mining). Thestrategy in this lecture note is to aim at all these application areas, but only present thenecessary and sufficient neural network material for understanding the basics.

�� 7KH SHUFHSWURQ

Given a classification problem, the set of dataX is to be classified into the classes C4 andC5= A neural network can learn from data and improve its performance through learning,and that is the ability we are interested in. In aOHDUQLQJ SKDVH we wish to present a subset

3

�� +] �� +] 4XDOLW\ 2N"0.958 0.003 Yes

1.043 0.001 Yes

1.907 0.003 Yes

0.780 0.002 Yes

0.579 0.001 Yes

0.003 0.105 No

0.001 1.748 No

0.014 1.839 No

0.007 1.021 No

0.004 0.214 No

Table 1: Frequency intensities for ten tiles

of examples to the network in order to train it. In a following JHQHUDOLVDWLRQ SKDVH we wishthe trained network makes the correct classification with data it has never seen before.

([DPSOH � �WLOHV� 7LOHV DUH PDGH IURP FOD\ PRXOGHG LQWR WKH ULJKW VKDSH� EUXVKHG� JOD]HG�DQG EDNHG� 8QIRUWXQDWHO\� WKH EDNLQJ PD\ SURGXFH LQYLVLEOH FUDFNV� 2SHUDWRUV FDQ GHWHFWWKH FUDFNV E\ KLWWLQJ WKH WLOHV ZLWK D KDPPHU� DQG LQ DQ DXWRPDWHG V\VWHP WKH UHVSRQVH LVUHFRUGHG ZLWK D PLFURSKRQH� ILOWHUHG� )RXULHU WUDQVIRUPHG� DQG QRUPDOLVHG� $ VPDOO VHW RIGDWD �DGDSWHG IURP 0,7� �� LV JLYHQ LQ 7$%/( 1�

�

The SHUFHSWURQ� the simplest form of a neural network, is able to classify data intotwo classes. Basically it consists of a single QHXURQ with a number of adjustable weights.The neuron is the fundamental processor of a neural network (Fig. 2). It has three basicelements:

1. A set of connecting OLQNV (or V\QDSVHV ); each link carries a ZHLJKW (or JDLQ ) z3> z4> z5.2. A VXPPDWLRQ (or DGGHU ) sums the input signals after they are multiplied by their re-

spective weights.3. An DFWLYDWLRQ IXQFWLRQ i+{, limits the output of the neuron. Typically the output is

limited to the interval ^3> 4` or alternatively ^�4> 4`.

The summation in the neuron also includes an RIIVHW z3 for lowering or raising the netinput to the activation function. Mathematically the input to the neuron is represented by avector X @ +4> x4> x5> = = = > xq,W , and the output is a scalar | @ i+{,. The weights of theconnections are represented by the vector Z @ +z3> z4> = = = > zq,

W where z3 is the offset.The output is calculated as

| @ i+ZWX, (1)Fig. 2a is a perceptron with two inputs and an offset. With a hard limiter as activation

function (b), the neuron produces an output equal to .4 or �4 that we can associate withC4 and C5> respectively.

4

f(x)

w0

w1

w2

�

1

Hard limiter

x-2 0 2

-1

0

1

f(x)

(a) (b)

Figure 2: Perceptron consisting of a neuron (a) with an offset z3 and an activation functioni+{,, which is a hard limiter (b).

�� 3HUFHSWURQ WUDLQLQJ

The weights Z are adjusted using an adaptive learning rule. One such learning rule is theSHUFHSWURQ FRQYHUJHQFH DOJRULWKP. If the two classes C4 and C5 are OLQHDUO\ VHSDUDEOH (i.e.,they lie on opposite sides of a straight line or, in general, a hyperplane), then there exists aweight vector Z� with the properties

ZWX � 3 for every input vector X belonging to class C1 (2)

ZWX ? 3 for every input vector X belonging to class C2

Assuming, to be general, that the perceptron has p inputs, then the equation ZWX @ 3>in an p dimensional space with coordinates x4> x5> = = = > xp, defines a hyperplane as theswitching surface between the two different classes of input. The training process adjuststhe weights Z to satisfy the two inequalities (2). A training set consists of, say, N samplesof the input vector X� together with each sample’s class membership (3 or4). A presentationof the complete training set to the multilayer perceptron is called anHSRFK. The learning iscontinued epoch by epoch until the weights stabilise.

$OJRULWKP The core of the perceptron convergence algorithm for adapting the weightsof the elementary perceptron has two steps.

(a) If thenth member of the training set,Xn> +n @ 4> 5 = = = >N, is correctly classified bythe weight vectorZn computed at thenth iteration of the algorithm, no correction ismade toZn as shown by

Zn.4 @ Zn if ZWn Xn � 3 andXn belongs to class C4 (3)

andZn.4 @ Zn if ZWn Xn ? 3 andXn belongs to class C5 (4)

(b) Otherwise the perceptron weights are updated according to the rule

Zn.4 @ Zn � �nXn if ZWn Xn � 3 andXn belongs to class C5 (5)

5

X

:��

:��

:��

ηX

ηX

θ

Figure 3: After two updates � is larger than 90 degrees, and ZWX @ mZm mXm frv +�, changessign.

and

Zn.4 @ Zn . �nXn if ZWn Xn ? 3 and Xn belongs to class C4 (6)

�

Notice the interchange of the class numbers from step 1 to step 2. The learning-rate�n controls the adjustment applied to the weight vector at iteration n. If � is a constant,independent of the iteration number n, we have a IL[HG LQFUHPHQW adaptation rule for theperceptron. The algorithm has been proved to converge (Haykin, 1994; Lippmann, 1987).

The perceptron learning rule is illustrated in Fig. 3. The input vector X points at somepoint in the p-dimensional space marked by a star. The update rule is based on the defin-ition of the dot-product of two vectors, which relates the angle � between the input vectorand the weight vector w,

ZWX @ mZm mXm frv +�,The figure shows a situation where the weight vector Z+4, needs to be changed. The angle� is less than 90 degrees, frv +�, is positive, and the update rule changes the weight vectorinto Z+5, by the amount �X in the direction opposite of X� The weight vector turned, butnot enough, so another pass is necessary to bring it to Z+6,= Now the angle is larger than90 degrees, and the sign of ZWX is correct. The vector Z is orthogonal to the hyperplanethat separates the classes (Fig. 4). When Z turns, the hyperplane turns with it, until thehyperplane separates the classes correctly.

([DPSOH � �WLOHV� :H ZLOO WUDLQ WKH SHUFHSWURQ LQ )LJ� 2 WR GHWHFW EDG WLOHV� 7KH LQSXWV WRWKH QHXURQ DUH WKH WZR ILUVW FROXPQV LQ 7$%/( 1� /HW XV DVVRFLDWH WKH JRRG WLOHV ZLWK FODVV&4 DQG WKH EDG RQHV ZLWK &5= 7KH ILUVW WHVW LQSXW WR WKH SHUFHSWURQ� LQFOXGLQJ D 4 IRU WKHRIIVHW ZHLJKW� LV WKHQ

XW @ +4> 3=<8;> 3=336, (7)7KLV FRUUHVSRQGV WR D WLOH IURP &4= 7KH ZHLJKWV� LQFOXGLQJ WKH RIIVHW ZHLJKW� KDYH WR EHLQLWLDOLVHG� DQG OHW XV DUELWUDULO\ VWDUW ZLWK

ZW @ +3=8> 3=8> 38, (8)

6

θ

X

Z1

Figure 4: The weight vector defines the line (hyperplane), that separates the white classfrom the shaded class.

7KH OHDUQLQJ UDWH LV IL[HG DW� @ 3=8 (9)

$FFRUGLQJ WR WKH SHUFHSWURQ FRQYHUJHQFH DOJRULWKP� ZH KDYH WR WHVW WKH SURGXFW

ZWX @ +4> 3=<8;> 3=336,+3=8> 3=8> 38,W (10)

@ 3=<;4 (11)

7KH VLJQ LV SRVLWLYH FRUUHVSRQGLQJ WR WKH FDVH LQ �� WKHUHIRUH ZH OHDYH WKH ZHLJKWV RI WKHSHUFHSWURQ DV WKH\ DUH� :H UHSHDW WKH SURFHGXUH ZLWK WKH QH[W VHW RI LQSXWV� DQG WKH QH[W�DQG VR RQ� 7KH UHVXOWV IURP D KDQG FDOFXODWLRQ RI WZR SDVVHV WKURXJK WKH ZKROH VHW RI GDWDDUH FROOHFWHG LQ 7$%/( 2� 7KHUH DUH QR XSGDWHV WR WKH ZHLJKWV EHIRUH ZH UHDFK WKH ILUVW EDGWLOH� WKHQ ZH DSSO\ �� DQG WKH ZHLJKWV FKDQJH�

$IWHU WKH ILUVW SDVV LW ORRNV OLNH WKH SHUFHSWURQ FDQ UHFRJQLVH WKH EDG WLOHV� EXW DV WKHZHLJKWV KDYH FKDQJHG VLQFH WKH LQLWLDO UXQ� ZH KDYH WR JR EDFN DQG FKHFN WKH JRRG WLOHVDJDLQ� ,Q WKH VHFRQG SDVV �7$%/( 2� VHFRQG SDVV� WKH ZHLJKWV DUH XSGDWHG LQ WKH ILUVWLWHUDWLRQ� EXW DIWHU WKDW WKHUH DUH QR IXUWKHU FKDQJHV� $ WKLUG SDVV �QRW VKRZQ� ZLOO VKRZWKDW WKH DGDSWDWLRQ KDV VWRSSHG�

7KH SHUFHSWURQ LV LQGHHG DEOH WR GLVWLQJXLVK EHWZHHQ JRRG DQG EDG WLOHV� )LJXUH 5 VKRZVD ORJDULWKPLF SORW RI WKH GDWD DQG D VHSDUDWLQJ OLQH EHWZHHQ WKH WZR FODVVHV� 7KH ILQDOZHLJKWV DUH

ZW @ +3> 3=<::>�3=758, (12)7KDW LV� QR RIIVHW� DQG WKH OLQH ZKHUH WKH VLJQ VZLWFKHV LV GHILQHG E\

3=<::x4 � 3=758x5 @ 3 (13)

7KLV LV HTXLYDOHQW WR WKH VWUDLJKW OLQH x5 @ +3=<::@3=758,x4 GUDZQ LQ WKH ILJXUH� :LWK DKDUG OLPLWHU � vljqxp IXQFWLRQ� DV WKH DFWLYDWLRQ IXQFWLRQ WKH SHUFHSWURQ SURGXFHV DQ RXWSXWDFFRUGLQJ WR

| @ vjq+ZWX, (14)$Q\ QHZ GDWD� GUDZQ IURP EH\RQG WKH WUDLQLQJ VHW� EXW DERYH WKH OLQH ZLOO UHVXOW LQ | @ �4�EDG� DQG DQ\ GDWD EHORZ WKH OLQH ZLOO UHVXOW LQ | @ .4 �JRRG��

�

7

XW ZW OK? ZWX Equation Updated ZW

(1,0.958,0.003) (0.5,0.5,0.5) 1 0.981 (3) (0.5,0.5,0.5)

(1,1.043,0.001) (0.5,0.5,0.5) 1 1.022 (3) (0.5,0.5,0.5)

(1,1.907,0.003) (0.5,0.5,0.5) 1 1.455 (3) (0.5,0.5,0.5)

(1,0.780,0.002) (0.5,0.5,0.5) 1 0.891 (3) (0.5,0.5,0.5)

(1,0.579,0.001) (0.5,0.5,0.5) 1 0.790 (3) (0.5,0.5,0.5)

(1,0.003,0.105) (0.5,0.5,0.5) 0 0.554 (5) (0,0.499,0.448)

(1,0.001,1.748) (0,0.499,0.448) 0 0.783 (5) (-0.5,0.498,-0.427)

(1,0.014,1.839) (-0.5,0.498,-0.427) 0 -1.277 (3) (-0.5,0.498,-0.427)

(1,0.007,1.021) (-0.5,0.498,-0.427) 0 -0.932 (3) (-0.5,0.498,-0.427)

(1,0.004,0.214) (-0.5,0.498,-0.427) 0 -0.589 (3) (-0.5,0.498,-0.427)

2nd pass(1,0.958,0.003) (-0.5,0.498,-0.427) 1 -0.024 (6) (0,0.977,-0.425)

(1,1.043,0.001) (0,0.977,-0.425) 1 1.019 (3) (0,0.977,-0.425)

(1,1.907,0.003) (0,0.977,-0.425) 1 1.862 (3) (0,0.977,-0.425)

(1,0.780,0.002) (0,0.977,-0.425) 1 0.761 (3) (0,0.977,-0.425)

(1,0.579,0.001) (0,0.977,-0.425) 1 0.565 (3) (0,0.977,-0.425)

(1,0.003,0.105) (0,0.977,-0.425) 0 -0.042 (3) (0,0.977,-0.425)

(1,0.001,1.748) (0,0.977,-0.425) 0 -0.742 (3) (0,0.977,-0.425)

(1,0.014,1.839) (0,0.977,-0.425) 0 -0.768 (3) (0,0.977,-0.425)

(1,0.007,1.021) (0,0.977,-0.425) 0 -0.427 (3) (0,0.977,-0.425)

(1,0.004,0.214) (0,0.977,-0.425) 0 -0.087 (3) (0,0.977,-0.425)

Table 2: After two passes the the perceptron algorithm converges.

10-3

10-2

10-1

100

10110

-3

10-2

10-1

100

101

475 Hz

557

Hz

Good

Bad

Figure 5: Classification of tiles in the input space (logarithmic axes).

8

�� 6LQJOH OD\HU SHUFHSWURQ

Given a problem which calls for more than two classes, several perceptrons can be com-bined into a network. The simplest form of a OD\HUHG network just has an input layer ofsource nodes that connect to an output layer of neurons, but not vice versa. Fig. 6 showsa single layer perceptron with five nodes capable of recognising three linearly separableclasses (three output nodes) by means of two features (two input nodes). Each output neu-ron defines a weight vector which in turn defines a hyperplane. The hyperplanes separatethe classes as in Fig. 7.

The activation function could have various shapes depending on the application; Fig.8 shows six different types. The mathematical expressions for the three in the bottom roware, respectively

i+{, @4

4 . h{s+�d{,(15)

i+{, @h{s+{,� h{s+�{,

h{s+{, . h{s+�{,(16)

i+{, @ h{s+�{5, (17)

Here (15) is a ORJLVWLF function, an example of the widely used VLJPRLG function; ingeneral a sigmoid is an v� shaped function, strictly increasing and asymptotic. The para-meter d is used to adjust the slope; in the limit as d approaches infinity, the sigmoid becomesa step. Equation (16) is a K\SHUEROLF WDQJHQW ; it is symmetric about the origin, which canbe convenient. The third function (17) is a *DXVVLDQ function.

�� *UDGLHQW GHVFHQW OHDUQLQJ

If we introduce the GHVLUHG response g of the network,

g @

�.4 if X belongs to class C4

�4 if X belongs to class C5

�(18)

the learning rules (3)-(6) can be expressed in a more convenient way. They can be summedup nicely in the form of an HUURU�FRUUHFWLRQ learning rule, the so-called GHOWD rule, as shownby

Zn.4 @ Zn .�Zn (19)

@ Zn . �n+gn � |n,Xn (20)

The difference gn � |n plays the role of an error signal and n corresponds to the currentlearning sample. The learning rate parameter � is a positive constant limited to the range3 ? � � 4=The learning rule increases the output | by increasingZwhen the error h @ g�|is positive; therefore zl increases if xl is positive, and it decreases if xl is negative (andsimilarly when the error is negative). Features of the delta rule are as follows:

� It is simple,� learning can be performed locally at each neuron, and� weights are updated on-line after presentation of each pattern.

9

1

2

u1

u2

y1

y2

y3

3

4

5

Figure 6: Single layer perceptron network with three output neurons.

C1

C2

C3

v1 = 0

v2 = 0

v3 = 0

Figure 7: Three linearly separable classes.

10

Linear Limiter Hard limiter

x x x

-1

0

1

y

-2 0 2

-1

0

1

y

Logistic

-2 0 2

Tanh

-2 0 2

Gaussian

Figure 8: Examples of six activation functions.

The ultimate purpose of error-correction learning is to minimise a cost function basedon the error signal h= Then the response of each output neuron approaches the target re-sponse for that neuron in some statistical sense. Indeed, once a cost function is selected,the learning is strictly an optimisation problem.

A common cost function is the sum of the squared errors,

H +Z, @4

5

[u

h5u (21)

The summation is over all the output neurons of the network (cf. index u). The network istrained by minimising H with respect to the weights, and this leads to the JUDGLHQW GHVFHQWmethod. The factor 4@5 is used to simplify differentiation when minimising H = A plot ofthe cost function H versus the weights is a multidimensional HUURU�VXUIDFH. Depending onthe type of activation functions used in the network we may encounter two situations:

1. 7KH QHWZRUN KDV HQWLUHO\ OLQHDU QHXURQV. The error surface is a quadratic function of theweights, and it is bowl-shaped with a unique minimum point.

2. 7KH QHWZRUN KDV QRQOLQHDU QHXURQV. The surface has a global minimum (perhaps mul-tiple) as well as local minima.

In both cases, the objective of the learning is to start from an arbitrary point on thesurface, determined by the initial weights and the initial inputs, and move toward a globalminimum in a step-by-step fashion. This is feasible in the first case, whereas in the secondcase the algorithm may get trapped in a local minimum, never reaching a global minimum.

11

4

0

1 -1-0.5

00.5

1

0

1

2

3

w0 w1

E-1

Figure 9: Example of error surface for a simple perceptron.

([DPSOH � �HUURU VXUIDFH� 7R VHH KRZ WKH HUURU VXUIDFH LV IRUPHG ZH ZLOO H[DPLQH DVLPSOH SHUFHSWURQ ZLWK RQH QHXURQ +u @ 4,� RQH LQSXW +p @ 4,� DQ RIIVHW� DQG D OLQHDUDFWLYDWLRQ IXQFWLRQ� 7KH QHWZRUN RXWSXW LV

| @ i+ZWX, @ ZWX @ +z3> z4,+4> x,W @ z3 .z4x (22)

*LYHQ D SDUWLFXODU LQSXW x DQG WKH FRUUHVSRQGLQJ GHVLUHG RXWSXW g� ZH FDQ SORW WKH HUURUVXUIDFH� $VVXPH

x @ 4> g @ 3=8 (23)WKHQ WKH HUURU IXQFWLRQ EHFRPHV

H +Z, @4

5h5 (24)

@4

5+g� |,5 (25)

@4

5+g� +z3 .z4x,,

5 (26)

@4

5+3=8� +z3 .z4,,

5 (27)

7KLV LV D SDUDEROLF ERZO� DV WKH SORW LQGLFDWHV LQ )LJ� 9�,Q FDVH WKHUH DUHN H[DPSOHV LQ WKH WUDLQLQJ VHW� WKHQ x� |� DQG g EHFRPH �URZ� YHFWRUV

RI OHQJWKN� DQG WKH VTXDUHG HUURUV DUH VXPPHG IRU HDFK LQVWDQFH RI z3 DQG z4

H +Z, @4

5

[N

�GW � �

z3 .z4 � XW��5

(28)

7KH VTXDULQJ RSHUDWLRQ LV HOHPHQW�E\�HOHPHQW DQG ZH KDYH D YHFWRU XQGHU WKH VXPPDWLRQV\PERO� 7KH VXPPDWLRQ LV RYHU WKH HOHPHQWV RI WKDW YHFWRU� 7KH UHVXOW LV D VFDODU IRU HDFKFRPELQDWLRQ RI WKH VFDODUV z3 DQG z4�

�

To sum up, in order to make H +Z, small, the idea is to change the weights Z in the

12

Input layer Hidden layer Output layer

Figure 10: Fully connected multi-layer perceptron.

direction of the negative gradient of H , that is

�Z @ ��CHgZ

@ ��hCh

gZ(29)

Here h has been substituted forS

u h5u= The partial derivatives Ch@gZ tells how much the

error is influenced by changes in the weights.

([DPSOH � 7KHUH DUH DOWHUQDWLYHV WR WKH FRVW IXQFWLRQ �� ,I LW LV FKRVHQ WR EH

H +Z, @ mhmWKH JUDGLHQW PHWKRG JLYHV

�Z @ ��Ch

gZvljq +h,

�

�� 0XOWL�OD\HU SHUFHSWURQ

The single-layer perceptron can only classify linearly separable problems. For non-separableproblems it is necessary to use more layers. A PXOWLOD\HU (feedforward) network has one ormore KLGGHQ layers whose neurons are called KLGGHQ QHXURQV. The graph in Fig. 10 illus-trates a multilayer network with one hidden layer. The network is IXOO\ FRQQHFWHG� becauseevery node in a layer is connected to all nodes in the next layer. If some of the links aremissing, the network is SDUWLDOO\ FRQQHFWHG. When we say that a network has q layers, weonly count the hidden layers and the output layer; the input layer of source nodes does notcount, because the nodes do not perform any computations. A single layer network thusrefers to a network with just an output layer.

([DPSOH � �;25� ,Q )LJ� 11� OHIW� WKH SRLQWV PDUNHG E\ DQ ¶[¶ EHORQJ WR FODVV F4 DQGWKH SRLQWV PDUNHG E\ DQ ¶R¶ EHORQJ WR F5� 7KH UHODWLRQ EHWZHHQ +x4> x5, YHUVXV WKH FODVV

13

1 u1

u2

u1

u2

1

2

3

Figure 11: Two classes ’x’ and ’o’ that cannot be separated by a line (left) require a networkwith more than one layer (right).

PHPEHUVKLS | FDQ EH GHVFULEHG LQ WKH WDEOH

x4 x5 |3 3 33 4 44 3 44 4 3

(30)

7KH WDEOH DOVR H[SUHVVHV WKH ORJLFDO h{foxvlyh0ru �;25� RSHUDWLRQ� LW ZDV XVHG PDQ\ \HDUVDJR WR GHPRQVWUDWH WKDW WKH �VLQJOH OD\HU� SHUFHSWURQ ZDV XQDEOH WR FODVVLI\ D YHU\ VLPSOHVHW RI GDWD�

,Q V\PEROV� D WZR�LQSXW SHUFHSWURQ ZLWK RQH QHXURQ� DQ RIIVHW� DQG D OLQHDU DFWLYDWLRQIXQFWLRQ KDV WKH IROORZLQJ IXQFWLRQ

| @ i+ZWX, @ ZWX @ +z3> z4> z5,+4> x4> x5,W @ z3 .z4x4 .z5x5 (31)

,W LV UHTXLUHG WR VDWLVI\ WKH IROORZLQJ IRXU LQHTXDOLWLHV�

3 � z4 . 3 �z5 .z3 � 3 , z3 � 3 (32)

3 � z4 . 4 �z5 .z3 A 3 , z3 A �z5 (33)

4 � z4 . 3 �z5 .z3 A 3 , z3 A �z4 (34)

4 � z4 . 4 �z5 .z3 � 3 , z3 � �z4 �z5 (35)

8VLQJ �� LQ �� LPSOLHV

z5 A 3 (36)

z4 A 3 (37)

z4 .z5 � 3 (38)

,W LV RI FRXUVH LPSRVVLEOH WR VDWLVI\ �� DOO DW WKH VDPH WLPH��

([DPSOH � ,W LV SRVVLEOH� QHYHUWKHOHVV� WR VROYH WKH SUREOHP ZLWK WKH WZR OD\HU SHUFHSWURQ

14

LQ )LJ� 11� ULJKW� LI QHXURQ � LQ WKH KLGGHQ OD\HU KDV WKH ZHLJKWV Z4 @ +�3=8> 4>�4,>QHXURQ � LQ WKH KLGGHQ OD\HU Z5 @ +3=8> 4>�4, > DQG WKH RXWSXW QHXURQZ6 @ +3=8> 4>�4, =$VVXPLQJ D KDUG OLPLWHU DFWLYDWLRQ IXQFWLRQ LQ HDFK QHXURQ� DQG UHSUHVHQWLQJ WKH ELWV 3 DQG4 E\ �4 UHVSHFWLYHO\ .4> WKHQ WKH RYHUDOO IXQFWLRQ RI WKH QHWZRUN LV

|6 @ vjq�ZW6 +4> |4> |5,

W�@ vjq +3=8 . |4 � |5, (39)

|4 @ vjq�ZW4 +4> x4> x5,

W�@ vjq +�3=8 . x4 � x5, (40)

|5 @ vjq�ZW5 +4> x4> x5,

W�@ vjq +3=8 . x4 � x5, (41)

%\ LQVHUWLRQ� WKH LQSXW�RXWSXW UHODWLRQ LV

|6 @ vjq +3=8 . +vjq +�3=8 . x4 � x5,,� vjq +3=8 . x4 � x5,, (42)

:LWKX4 @ +�4>�4> 4> 4,W DQGX5 @ +�4> 4� 4> 4,W WKH RXWSXW EHFRPHV \ @ +�4> 4> 4>�4,W

DV GHVLUHG� FI� ��

�� %DFN�SURSDJDWLRQ

The HUURU EDFN�SURSDJDWLRQ algorithm is a popular learning rule for multilayer perceptrons.It is based on the delta rule, and it uses the squared error measure (21) for output nodes. Aperceptron weight zml, the weight on the connection from neuron l to neuron m> is updatedaccording to the JHQHUDOLVHG GHOWD UXOH

zml @ zml .�zml @ zml � �CHCzml

(43)

To make the notation clearer the index of the training instance n has been omitted fromthe expression; it is understood that the equation is recursive such that the zml on the lefthand side is the new weight, while the zml on the right hand side is the old weight. Inthe generalised delta rule the correction to the weight is proportional to the gradient of theerror or, in other words, to the sensitivity of the error to changes in the weight. To apply thealgorithm two passes of computation are necessary, a IRUZDUG SDVV and a EDFNZDUG SDVV.

In the forward pass the weights remain unchanged. The forward pass begins at the firsthidden layer by presenting it with the input vector, and terminates at the output layer bycomputing the error signal for each output neuron. The backward pass starts at the outputlayer by passing the error signals backwards through the network, layer by layer.

For the backward pass, assume that neuron m is an output neuron. The derivative maybe expressed using the chain rule,

CHCzml

@CHChm

ChmC|m

C|mCym

CymCzml

@ hm � +�4, � f 3

m � |l (44)

Here hm is the error of neuron m> |m is the output of the neuron, and ym is the internal inputto neuron m after summation, but before the activation function, i.e., |m @ i(ym). The i

3

15

notation means differentiation of i with respect to its argument. Insertion into (43) yields

�zml @ �hmf3

m |l m is an output neuron (45)

The error hm of neuron m is associated with a desired output gm , and it is simply hm @ gm�|m =For the hidden node l>we have to EDFN�SURSDJDWH the error recursively, since it occurs in

the two first partial derivatives in (44). But there is a problem, in that the error is unknownfor a hidden node. We will instead compute directly

CHChl

ChlC|l

@CHC|l

(46)

The partial derivative of H with respect to the output of hidden neuron l> connecteddirectly to output neuron m> is

CHC|l

@[m

hmChmC|l

(47)

@[m

hmChmCym

CymC|l

(48)

@[m

hmC +gm � fm +ym,,

Cym

CymC|l

(49)

@[m

hm��f 3

m +�,�zml (50)

In this case the update rule becomes

�zlk @ �[m

hmf3

m +�,zmlf 3

l +�, |k (51)

Here l is a hidden neuron, k is an immediate predecessor of l, and m is an output neuron. Thefactor i 3l +�, depends solely on the activation function of hidden neuron l. The summationover m requires knowledge of the error signals hm for all the output neurons. The term zmlconsists of the weights associated with the output connections. Finally, |k is the input toneuron l, and it stems from the partial derivative Cym@C|l= For a neuron j in the next layer,going towards the input layer, the update rule is applied recursively.

In summary, the correction to the weightzts on the connection from neuron s to neuront is defined by

�zts @ ��t|s (52)Here � is the learning rate parameter, �t is called the ORFDO JUDGLHQW� cp. (29) or (43), and |sis the input signal to neuron t. The local gradient is computed recursively for each neuron,and it depends on whether the neuron is an output neuron or a hidden neuron:

1. If neuron t is an output neuron, then

�t @ htf 3

t +�, (53)

Both factors in the product are associated with neuron t.

16

2. If neuron t is a hidden node, then

�t @[u

�uzutf 3

t +�, (54)

The summation is a weighted sum of �u ’s of the immediate successor neurons.

The weights on the connections feeding the output layer are updated using the deltarule (52), where the local gradient�t is as in (53). Given the�’s for the neurons of theoutput layer, we next use (54) to compute the�’s for all the neurons in the next layer andthe changes to all the weights of the connections feeding it. To compute the� for eachneuron, the activation functioni +�, of that neuron must be differentiable.

$OJRULWKP (after Lippmann, 1987) The back-propagation algorithm has five steps.

(a) ,QLWLDOLVH ZHLJKWV. Set all weights to small random values.(b) 3UHVHQW LQSXWV DQG GHVLUHG RXWSXWV �WUDLQLQJ SDLUV�. Present an input vectorX and

specify the desired outputsG. If the net is used as a classifier then all desired outputsare typically set to zero, except one (set to 1). The input could be new on each trial,or samples from a training set could be presented cyclically until weights stabilise.

(c) &DOFXODWH DFWXDO RXWSXWV. Calculate the outputs by successive use of\ @ I+ZWX,whereI is a vector of activation functions.

(d) $GDSW ZHLJKWV. Start at the output neurons and work back to the first hidden layer.Adjust weights by

zml @ zml . ��m|l (55)In this equationzml is the weight from hidden neuronl (or an input node) to neuronm, |l is the output of neuronl (or an input),� is the learning rate parameter, and�m isthe gradient; if neuronm is an output neuron then it is defined by (53), and if neuronmis hidden then it is defined by (54). Convergence is sometimes faster if a momentumterm is added according to (78).

(e) *R WR VWHS �.�

([DPSOH � �ORFDO JUDGLHQW� 7KH ORJLVWLF IXQFWLRQ i+{, @ 4@+4 . h{s+�{,, LV GLIIHU�HQWLDEOH ZLWK WKH GHULYDWLYH i 3+{, @ h{s+�{,@+4 . h{s+�{,,5� :H FDQ HOLPLQDWH WKHH[SRQHQWLDO WHUP DQG H[SUHVV LW DV

I 3+{, @ I +{, ^4� I +{,` (56)

)RU D QHXURQ m LQ WKH RXWSXW OD\HU� ZH PD\ WKHQ H[SUHVV WKH ORFDO JUDGLHQW DV

�m @ hm I 3m +�, @ +gm � |m, |m +4� |m, (57)

)RU D KLGGHQ QHXURQ l WKH ORFDO JUDGLHQW LV

�l @ I 3l +�,[m

�mzml @ |l +4� |l,[m

�mzml (58)

�

([DPSOH � �EDFNSURSDJDWLRQ� �DIWHU'HPXWK%HDOH� �� 7R VWXG\ EDFN�SURSDJDWLRQ

17

0

5 -4-2

02

4

0

0.5

1

w1 w0

Sum

Squ

ared

Err

o r

-5

-5 0 5-4

-3

-2

-1

0

1

2

3

4Error contours

w1

w0

Error surface

Figure 12: Nonlinear error surface for back-propagation.

ZH ZLOO FRQVLGHU D VLPSOH H[DPSOH ZLWK WKH WUDLQLQJ VHW

x |�6 3=75 3=;

(59)

7KXV WKH QHWZRUN ZLOO KDYH RQH LQSXW� WKHUH ZLOO EH QR KLGGHQ OD\HU� DQG WKH RXWSXW OD\HUZLOO KDYH RQH RXWSXW QHXURQ� 7R PDNH WKH H[DPSOH PRUH LQWHUHVWLQJ ZH FKRRVH D QRQOLQHDUDFWLYDWLRQ IXQFWLRQ �ORJLVWLF IXQFWLRQ�� 7KH RYHUDOO IXQFWLRQ LV WKHQ

| @ i�+z3> z4, � +4> x,W

�@ i +z3 .z4x, (60)

7KH HUURU LV

H @4

5

[h5 @

4

5

[+g� |,5 @

4

5

[++3=7> 3=;,� i +z3 .z4 +�6> 5,,,5 (61)

:LWK VRPH KHOS IURP WKH QHXUDO QHWZRUN WRROER[ �'HPXWK %HDOH� �� ZH FDQ SORWWKH HUURU VXUIDFH� VHH )LJ� 12 �WKH WRROER[ RPLWV WKH IDFWRU 4@5 LQ ��

$V WKH OHDUQLQJ SURJUHVVHV� WKH ZHLJKWV DUH DGMXVWHG WR PRYH WKH QHWZRUN GRZQ WKHJUDGLHQW� 7KH HUURU FRQWRXU JUDSK RQ WKH ULJKW VLGH RI )LJ� 12 VKRZV KRZ WKH QHWZRUNPRYHG IURP LWV RULJLQDO YDOXHV WR D VROXWLRQ� 1RWLFH WKDW LW PRUH RU OHVV FKRRVHV WKH VWHHSHVWSDWK�

7KH QHWZRUN HUURU SHU HSRFK LV VDYHG WKURXJKRXW WKH WUDLQLQJ DQG WKH HUURU SHU HSRFK LVSORWWHG LQ )LJ� 13� 7KH HUURU GHFUHDVHV WKURXJKRXW WKH WUDLQLQJ DQG WKH WUDLQLQJ VWRSV DIWHUDERXW �� HSRFKV ZKHQ WKH HUURU GURSV EHORZ WKH SUHVHW OLPLW RI 3=334� $W WKLV SRLQW WKH ILQDOYDOXHV RI WKH ZHLJKWV DUH +z3> z4, @ +3=87<:> 3=6683, =

�

�� $ JHQHUDO PRGHO

A standard, matrix oriented model for neural networks has been developed (Hunt, Sbar-

18

0 20 40 60 8010

-4

10-3

10-2

10-1

100

Epoch

Sum

-Squ

ared

Err

or

Figure 13: Network error.

biM

f

Activationfunction

ai1

aiN

bi1

+

+

+

+

+

1

s+1

Lineardynamics

vi xi yi

y1

yN

u1

1

...

...

Figure 14: Generalised neuron model.

19

baro, Zbikowski and Gawthrop, 1992), and many well-known structures fall within thatmodel. The basic processing element is considered to have three elements (see Fig. 14): aweighted summer, a linear dynamic single-input, single-output function, and a non-dynamicnonlinear function.

� The ZHLJKWHG VXPPHU is described by

yl+w, @Q[m@4

dlmi+{m,+w, .P[n@4

elnxn+w,> (62)

giving a weighted sum yl in terms of the outputs of all elements i+{m,, external inputsxn> and corresponding weights dlm and eln with offsets included in eln. To simplifythe notation the time index will be omitted in the following, and the equation may bewritten in matrix notation as:

Y @ $I�[�. %X (63)

where $ is an Q � Q matrix of weights dlm , and % is an Q � P matrix of weightseln. The neurons are enumerated consecutively from top to bottom, layer by layer inthe forward direction. The offsets are incorporated with the inputs X, but it could beuseful to represent them explicitly.

� The OLQHDU G\QDPLF IXQFWLRQ has input yl and output {l. In transfer function form it isdescribed by

{l+v, @ K+v,yl+v, (64)Five common choices are

K+v, @ 4 (65)

K+v, @4

v(66)

K+v, @4

4 . �v(67)

K+v, @4

s4v. s5(68)

K+v, @ h{s+�v�, (69)

Clearly, the first three are special cases of the fourth.� The QRQ�G\QDPLF QRQOLQHDU IXQFWLRQ returns the output

i+{l, (70)

in terms of the input {l= The function i+�, corresponds to the previously mentionedactivation functions (see for example Fig. 8).

The three components of the neuron can be combined in various ways. For example, ifthe neurons are all non-dynamic +K +v, @ 4, then an array of neurons can be written as theset of DOJHEUDLF equations obtained by combining (62), (64), and (70):

[ @ $I�[�. %X (71)

where [ is an Q � 4 vector. If, on the other hand, each neuron has integrator dynamics

20

K+v, @ 4@v, then an array of neurons can be written as the set of GLIIHUHQWLDO equations�

[@ $I�[�. %X (72)

Clearly, the solutions of (71) form possible steady state solutions of (72).The behaviour of such a network depends on the interconnection matrix $ and on the

form of K+v, = We shall use the model now to give an alternative formulation of the back-propagation algorithm.

In a VWDWLF WZR�OD\HU IHHGIRUZDUG network, K+v, @ 4= In order to study more closelyhow the calculations proceed, Fig. 15 shows a signal flow graph of a simple network withone input, two nodes, and one output. It is a two layer network with offsets. The arrowsare weighted (sometimes nonlinearly) and the nodes are summation points. Generalisingto any number of nodes,

[ @ $I�[�. %X (73)

\ @ &I�[�

Keeping the two-layer structure, the [, \, and X vectors are partitioned according to thelayers, and the forward pass of the backpropagation algorithm can be written in details as�

[4[5

�@

��

$54 �

� �f+[4,f+[5,

�.

�%44 E45� E55

� �X4

x5 @ 4

�(74)

\ @�

� ,� � f+[4,

f+[5,

�(75)

The subscripts denote the layers in the network, such that the hidden layer has subscript 1and the output layer has subscript 2. The $ and % matrices have a block structure and thematrix $54 holds the weights from layer 1 to layer 2; , is the identity matrix. The inputsare partitioned according to real inputs X4 and one additional input x5 @ 4 which is usedfor the offsets of all nodes. The input matrix %44 holds the weights from the input nodesX4 to the hidden layer, E45 is a vector of offset weights for the first layer, and E55 is for thesecond layer.

The forward pass proceeds in three iterations, after the vector [ is initialised to �.Iteration 1

[4 @ %44X4 . E45[5 @ E55

Iteration 2

[4 @ %44X4 . E45[5 @ $54i +%44X4 . E45, . E55

Iteration 3\ @ i +$54i +%44X4 . E45, . E55,

Each iteration progresses on level farther into the flow graph, and after three iterations,the outputs are reached. The calculation is straight forward to implement.

The backward pass is somewhat tricky, at least regarding the understanding of the dif-ference between errors and gradients. Figure 16 is a flow graph of a backward pass in our

21

u1

f(x1)a21y

1 Layer 1 Layer 2

f(x2)b11

b21 b22

Figure 15: Signal flow graph of a forward pass in a two node network.

u1

δ1 = f’(x1)e1, e1 = a21δ2d - y = e2

1 Layer 1 Layer 2

δ2 = f’(x2)e2b11

b21 b22

Figure 16: Signal flow graph of a backward pass showing the difference between errorsand gradients.

22

simple network. The network is the same as in Fig. 15, except the arrows are reversedto emphasise the direction of the calculations. The input is now the error, which is thedifference between the desired value and the network output, or h5 @ g � |= The erroris multiplied by the derivative of the activation function in the current operating point toproduce the first local gradient �5= The error corresponding to node 1 is then h4 @ d54�5=The errors are used in the following matrix calculations similar to state variables in the statespace model. Reversing the arrows corresponds to transposing the matrices. The backwardpass can thus be expressed as,

H @ $W ^i 3 +[, H` .&W +G�\, (76)

Hx @ %W ^i 3 +[, H`

The expression in square brackets ^�` is a product of two vectors, but it is element-to-element, rather than a vector dot product, such that i 3 +[ +4,, is multiplied by H +4, > i 3 +[ +5,,by H +5, > and so on, producing a column vector. The bottom system of equations is not usedin practice, but included for completeness; it expresses the backpropagation of the error allthe way to the inputs. Note that I¶�[�H is the vector of gradients B= For implementationpurposes (76) can be written in more detail,�

H4H5

�@

�� $W54� �

� �i 3 +[4, H4i 3 +[5, H5

�.

��,

�^G� \` (77)

Hx @

�%W44 �EW45 EW55

��i 3 +[4, H4i 3 +[5, H5

�The backward pass also proceeds in three iterations. The vector H is initialised to �.

Iteration 1

H4 @ �

H5 @ G� \

Iteration 2

H4 @ $W54i3 +[5, +G� \,

H5 @ G� \

Iteration 3

Hx4 @ %W44i3 +[4,$

W54i

3 +[5, +G� \,

Hx5 @ EW45i3 +[4,$W54i

3 +[5, +G� \, . EW55i3 +[5, +G� \,

Once the backward pass is over, the weights can be updated. Both matrices $ and %have to be updated, according to the delta rule,

$ @ $.�$

% @ %.�%

Or, in more detail,

$ @

��

$54 �

�. �

��

�5i 3 +[4,W �

�

23

% @

�%44 E45� E55

�. �

��4XW4 �4� �5

�The vectors B4 and B5 are column vectors, and the products �5i 3 +[4,

W and �4XW4 are outerproducts, i.e., the results are matrices the same shape as $54 and %44 respectively. Themodel is for a two-layer, fully connected network. It should be fairly easy to extend it tomore layers than two.

�� 3UDFWLFDO LVVXHV

It is attractive that neural networks can model nonlinear relationships. From the treatmentabove, it may even seem easy to obtain a model of an unknown plant. In practice, however,several difficulties jeopardise the development, such as deciding the necessary and suffi-cient number of nodes in a network. There are no rigid mathematics regarding such designchoices, but this section provides a few rules of thumb.

�� 5DWH RI OHDUQLQJ

The smaller we make the learning parameter � the smaller will the changes to the weightsbe, but then the learning takes longer. If � is too large, the learning may become unstable.According to Sørensen (1994) it is normal practice to gradually decrease� from 3=4 to3=334.

The backpropagation algorithm may reach only a local minimum of the error function,and the search may be slow, especially near the minimum. A simple way to improve thesituation, and yet avoid instability, is to modify the delta rule by including aPRPHQWXPterm

�zml @ ��|l . ��zml (78)where� is usually a positive constant called thePRPHQWXP FRQVWDQW. The effect is a low-pass filtering of the increments of the weight updates, thus reducing the risk of getting stuckin a local minimum. The following rules of thumb apply.

� For the learning to be stable, the momentum constant must be in the range3 � m�m ? 4=When� is zero the back-propagation algorithm operates without momentum. Note that� can be negative, although it is unlikely to be used in practice. According to Sørensen(1994)� is typically chosen to be3=5> 3=6> ===> 3=<.

� When the partial derivativeCH@Czml has the same sign on consecutive iterations, thesum�zml grows in magnitude, and so the weight is adjusted by a large amount. Hencethe momentum term tends to accelerate the descent.

� When the partial derivativeCH@Czml has opposite signs on consecutive iterations, thesum�zml shrinks in magnitude, and the weight is adjusted by a small amount. Hencethe momentum term has a stabilising effect in directions that alternate in sign.

The momentum term may also prevent the learning from terminating in a shallow localminimum on the error surface.

The learning parameter� has been assumed constant, but in reality it should be connec-

24

tion dependent. We may in fact constrain any number of weights to remain fixed by simplymaking the learning rate �ml for weight zml equal to zero.

�� 3DWWHUQ DQG EDWFK PRGHV RI WUDLQLQJ

Back-propagation may proceed in one of two basic ways:

� 3DWWHUQ PRGH. Weights are updated after each training example; that is how it waspresented here. To be precise, consider an epoch consisting of N training examples(SDWWHUQV ) arranged in the order ^X4>G4` > = = = > ^XN >GN ` = The first example ^X4>G4` ispresented to the network, and the forward and backward computations are performed.Then the second example ^X5>G5` is presented, and the forward and backward compu-tations are repeated. This continues until the last example.

� %DWFK PRGH. Weights are updated DIWHU presenting DOO training examples; that consti-tutes an epoch.

From an on-line point of view, where the data are presented as time series, the patternmode uses less storage. Moreover, given a random presentation of the patterns, the search inweight space becomes stochastic in nature, and the back-propagation gets trapped less likelyin a local minimum. The batch mode, however, is appropriate off-line providing a moreaccurate estimate of the gradient vector. The choice between the two methods thereforedepends on the problem at hand.

�� ,QLWLDOLVDWLRQ

A good choice for the initial values of the weights can be important. The customary practiceis to initialise all weights randomly and uniformly distributed within a small range of values.The wrong choice of initial weights can lead to VDWXUDWLRQ when the value of H remainsconstant for some period of time during the learning process, corresponding to a saddlepoint in the error surface. Assuming sigmoidal activation functions, it happens if the inputto one (or several) activation functions is numerically so large that it operates on the tailsof the sigmoids. Here the slope is rather small, the neuron is in saturation. Consequentlythe adjustments to the weights of the neuron will be small, and the network may take a longtime to escape from it. The following observations have been made (Lee, Oh & Kim inHaykin, 1994):

� Incorrect saturation is avoided by choosing the initial values of the weights inside aVPDOO range of values.

� Incorrect saturation is less likely to occur when the number of hidden neurons is low.� Incorrect saturation rarely occurs when the neurons operate in their linear regions.

�� 6FDOLQJ

The training data most often have to be scaled. The data are measured in physical units,which often are of quite different orders of magnitude. This is inconvenient when the ac-

25

tivation functions are defined on standard universes. In the case of s-shaped activationfunctions, there is a risk that neurons will operate on the tails of the activation function,causing saturation phenomena and increased training time. Another problem is that theerror function H favours elements accidentally measured in a unit which implies a largemagnitude.

Scaling may avoid these problems. One way is to map the data such that the scaledsignals have a mean value p of zero, and a standard deviation v of one, i.e.,

[� @[�p

v>

where [ is a set of physical measurements, and [� the corresponding vector in the scaledworld of the network.

�� 6WRSSLQJ FULWHULD

In principle training continues until the weights stabilise and the average squared errorover the entire training set converges to some minimum value. There are actually no well-defined criteria for stopping the algorithm’s operation, but there are some reasonable crite-ria.

It is logical to look for a zero gradient, since the gradient of the error surface is zero ina minimum. A stopping criterion could therefore be to stop when the Euclidean norm ofthe gradient vector is sufficiently small. Learning may take a long time, however, and itrequires computing the gradient vectorJ(Z).

Another possibility is to exploit that the error function is stationary at a minimum. Thestopping criterion could therefore be to stop when the absolute rate of change in the averagesquared error per epoch is sufficiently small. A typical value is 0.1 to 1 percent per epoch;sometimes a value as small as 0.01 percent per epoch is used.

�� 7KH QXPEHU RI WUDLQLQJ H[DPSOHV

Back-propagation learning starts with presenting aWUDLQLQJ VHW to the network. Hopefullythe network is designed toJHQHUDOLVH, that is, it is able to perform a nonlinear interpolation.The network generalises well when its output is nearly correct when presented with aWHVWGDWD VHW never used in training. The network can interpolate mainly because continuousactivation functions lead to continuous output functions.

If a network learns too many input-output relations, i.e., there isRYHUILWWLQJ, in the pres-ence of noise, it is probably because it uses too many hidden neurons. When this happensthe error is low; it looks like a desirable situation, but it is not. In the presence of noiseit is undesirable to perfectly fit the data points. Instead a smoother approximation is morecorrect. A high error goal stops a network fromRYHUWUDLQLQJ, because it gives the networksome tolerance in the fit.

The opposite effect,XQGHUILWWLQJ, is also possible. This occurs when there are too fewweights, then the network cannot produce outputs reasonably close to the desired outputs.An example is illustrated in Fig. 17.

Overfitting and underfitting are affected by the size and the efficiency of the training

26

-1 -0.5 0 0.5 1-1

-0.5

0

0.5

1

Input

Out

put:

-, T

arge

t: +

Figure 17: Due to underfitting, a network (solid line) has not been able to model the vari-ations in the training data +.,.

set, and the architecture of the network. If the architecture is fixed, then a right size of thetraining set must be determined.

For the case of a network containing a single hidden layer used as a binary classifier,some guidelines are available (Baum & Haussler in Haykin, 1994). The network will almostcertainly provide generalisation, if

1. the fraction of errors made on the training set is less than %@5> and2. the number of examples, N, used in training is

N � 65Z

%oq

�65P

%

�(79)

Here P is the number of hidden neurons, Z is the total number of weights in thenetwork, % is the ratio of the permissible error to the desired output, and oq is the naturallogarithm. The equation provides a worst case formula for estimating the training set sizefor a single-layer neural network; in practice there can be a huge gap between the actualsize of the training set needed and that predicted by (79). In practice, all we need is tosatisfy the condition

N � Z

%(80)

Thus, with an error ratio of, say, 0.1 (10 percent) the number of training examples shouldbe approximately 10 times the number of weights in the network.

�� 7KH QXPEHU RI OD\HUV DQG QHXURQV

A single layer perceptron forms half-plane decision regions (Lippmann, 1987). A two-

27

layer perceptron can form any, possibly unbounded, convex region in the input space. Suchregions are formed from intersections of the half-plane regions formed by each node in thefirst layer of the multi-layer perceptron. These convex regions have at most as many sidesas there are nodes in the first layer. The number of nodes must be large enough to forma decision region that is as complex as is required. It must not, however, be so large thatthe weights required can not be readily estimated from the available training data. A threelayer perceptron (with two hidden layers) can form arbitrarily complex decision regions.Thus no more than three layers are required in perceptron networks.

The number of nodes in the second (hidden) layer must be greater than one when de-cision regions are disconnected or meshed and cannot be formed from one convex region.The number of second layer nodes required in the worst case is equal to the number ofdisconnected regions in the input space. The number of nodes in the first layer must typi-cally be sufficient to provide three or more edges for each convex area generated by everysecond layer node. There should thus typically be more than three times as many nodes inthe second as in the first layer.

The above concerns primarily multi-layer perceptrons with one output when hard lim-iters (VLJQXP functions) are used as activation functions. Similar rules apply to multi-outputnetworks where the resulting class corresponds to the output node with the largest output.

For control applications, not classification, Sørensen (1994) chooses to only considermulti-layer perceptrons with one hidden layer since it is sufficient.

�� +RZ WR PRGHO D G\QDPLF SODQW

To return to the learning problem posed in the introduction, forward learning is the proce-dure of training a neural network to represent the dynamics of a plant (Fig. 1). The neuralnetwork is placed in parallel with the plant and the error between the system and the net-work outputs, theSUHGLFWLRQ HUURU, is used as the training signal. In the case of a multi-layerperceptron the network can be trained by back-propagating the prediction error.

But how does the network allow for the dynamics of the plant? One possibility is tointroduce dynamics into the network itself. This can be done using internal feedback inthe network (aUHFXUUHQW network) or by introducing transfer functions into the neurons. Astraight forward approach is to augment the network input with signals corresponding topast inputs and outputs.

Assume that the plant is governed by a nonlinear discrete-time difference equation

|s+w. 4, @ i ^|s+w,> = = = > |s+w� q. 4,>x+w,> = = = > x+w�p. 4,` (81)

Thus the system output|s at timew.4 is a nonlinear functioni of the pastq output valuesand of the pastp values of the inputx. The superscripts refers to the plant. An immediateidea is to choose the input-output structure of the network equal to the believed structureof the system. Denoting the output of the network as|p where superscriptp refers to themodel, we then have

|p+w. 4, @ ei ^|s+w,> = = = > |s+w� q. 4,>x+w,> = = = > x+w�p. 4,` (82)

28

0 20 40 60 80 1000

0.5

1

Time

Y(k

), y

(k+

1)

Figure 18: Experimental step response data for training.

Here ei represents the nonlinear input-output map of the network (i.e. the approximationof i). Notice that the input to the network includes the past values of the plant output (thenetwork has no memory). This dependence is not included in Fig. 1 for simplicity. If weassume that after a suitable training period the network gives a good representation of theplant (i.e. |p � |s), then for subsequent post-training purposes the network output itselfcan be fed back and used as part of the network input. In this way the network can beused independently of the plant. This may be preferred when there is noise since it avoidsproblems of bias caused by noise in the plant.

A pure delay of qg samples in the plant can be directly incorporated. In the case of (81)the equation becomes

|s+w. 4, @ i ^|s+w,> = = = > |s+w� q. 4,>x+w� qg . 4,> = = = > x+w� qg �p. 5,`

([DPSOH � �ILUVW RUGHU SODQW� *LYHQ D G\QDPLF SODQW ZLWK RQH LQSXW x DQG RQH RXWSXW |�ZH ZLVK WR PRGHO LW ZLWK D QHXURIX]]\ QHWZRUN XVLQJ IRUZDUG OHDUQLQJ� ,WV VWHS UHVSRQVH WRD XQLW VWHS LQSXW x +w, @ 4 +3 � w � 433, ZLOO EH XVHG DV WUDLQLQJ GDWD� LW LV SORWWHG LQ )LJ�18� $VVXPH WKH SODQW FDQ EH PRGHOOHG XVLQJ D FRQWLQXRXV WLPH /DSODFH WUDQVIHU IXQFWLRQZLWK RQH WLPH FRQVWDQW � �

| @4

4 . �vx (83)

7KH GLVFUHWH WLPH �]HUR�RUGHU�KROG� HTXLYDOHQW LV

|+n . 4, @ d|+n, . +4� d,x+n, (84)

ZKHUH d LV D FRQVWDQW WKDW GHSHQGV RQ � DQG WKH VDPSOLQJ WLPH�7KH QHWZRUN LQSXWV DQG RXWSXW DUH� FI� ��

Lqsxw4 @ |+n, (85)

Lqsxw5 @ x+n, (86)

Rxwsxw @ |+n . 4, (87)

29

yC

yNu++

Sum1

+-

Sum2

C

Matrix

N

Network

E

y

y

Figure 19: Parallel coupling with a matrix.

8VLQJ WKH VWHS UHVSRQVH� ZH FDQ SURGXFH D WUDLQLQJ VHW� 7ZR H[DPSOHV IURP WKH VHW DUH

|+n, x+n, |+n . 4,� � ��

(88)

2QH PD\ DUJXH� DVVXPLQJ WKDW �� LV DQ DFFXUDWH PRGHO� WKDW WKH ILUVW URZ GHWHUPLQHVd� DQG WKH SUREOHP LV VROYHG ZLWKRXW QHXUDO QHWV� +RZHYHU� LQ FDVH WKH SODQW FRQWDLQV DQRQOLQHDULW\� QRW DFFRXQWHG IRU LQ �� D VLQJOH QHXURQ ZLWK WZR LQSXWV DQG RQH RXWSXWPLJKW PRGHO WKH SODQW EHWWHU�

�

If a nonlinear plant is known to have partly linear, partly nonlinear gains from the inputsto the outputs, a network can be trained to model just the nonlinear behaviour of the plant(Sørensen, 1994). The network is coupled in parallel with a known constant gain matrix&� and the network and& are both fed with the dataX(k), the input data to the plant (Fig.19). The outputs from the network are\Q +n, and from the matrix\F+n,> and

\Q +n, @ I +X+n,>w, (89)

\F+n, @ &X+n, (90)

The vectorw contains all the weights and offsets in the network ordered in some way. Thetotal output of the arrangement is

e\+n, @ \Q +n, . \F+n, (91)

@ I+X+n,>w, .&X+n, (92)

The total gain matrix is

1+n, @ge|l+n,gxm+n,

> (93)

where indexl runs over all outputs and indexm over all inputs. The gain matrix1� evaluatedat a particular inputX+n, at time instancen, is called the-DFRELDQ matrix of the functionfrom X to e|. Each element is the gain on a small change of an inputxm > and the gain in

30

position +l> m, refers to the signal path from input xm to output e|l. A further computationshows that

1+n, @gI+X+n,>w,

gXW +n,.& (94)

@ 1Q +n, .& (95)

where 1Q +n, is the gain matrix of the network. Equations (92) and (95) simply show thatthe combined parallel structure has a total output equal to the sum of the outputs from thenonlinear network and the linear matrix, and that the combined gain matrix is the sum of thetwo individual gain matrices. The configuration inside the dotted box in Fig. 19 behavesas an ordinary network, and can be trained as a network.

�� 6XPPDU\

Back-propagation is an example of VXSHUYLVHG learning. That means the network is shown(by a ’teacher’) a desired output given an example from a training set. A performancemeasure is the sum of the squared errors visualised as an error surface. For the system tolearn (from the ’teacher’), the operating point is forced to move toward a minimum point,local or global. The gradient points in the direction of the steepest descent. Given an ad-equate training set and enough time, the network is usually able to perform classificationand function approximation satisfactorily. A network can be described in terms of matricesresulting in a model similar in form to the state-space model known from control engineer-ing. A dynamic plant can be modelled using a static nonlinear regression in the discretetime domain.

5HIHUHQFHV

Agarwal, M. (1997). A systematic classification of neural-network-based control,,((( FRQWUROV\VWHPV ��(2): 75–93.

Demuth, H. and Beale, M. (1992).1HXUDO 1HWZRUN 7RROER[� )RU 8VH ZLWK 0DWODE, The Math-Works, Inc, Natick, MA, USA.

Fukuda, T. and Shibata, T. (1992). Theory and applications of neural networks for industrialcontrol systems,,((( 7UDQVDFWLRQV RQ LQGXVWULDO HOHFWURQLFV ��(6): 472–489.

Haykin, S. (1994).1HXUDO 1HWZRUNV� $ &RPSUHKHQVLYH )RXQGDWLRQ, Macmillan College Pub-lishing Company, Inc., 866 Third Ave, New York, NY 10022.

Hunt, K., Sbarbaro, D., Zbikowski, R. and Gawthrop, P. (1992). Neural networks for controlsystems - a survey,$XWRPDWLFD ��(6): 1083–1112.

Lippmann, R. (1987). An introduction to computing with neural nets,,((( $663 0DJD]LQHpp. 4–22.

Meier, W., Weber, R. and Zimmermann, H.-J. (1994). Fuzzy data analysis-methods and industrialapplications,)X]]\ 6HWV DQG 6\VWHPV ��: 19–28.

MIT (1995). &,7( /LWHUDWXUH DQG 3URGXFWV 'DWDEDVH, MIT GmbH / ELITE, Promenade 9, D-

31

52076 Aachen, Germany.MIT (1997). 'DWD(QJLQH� 3DUW ,,� 7XWRULDOV, MIT GmbH, Promenade 9, D-52076 Aachen,

Germany.Nørgaard, P. M. (1996).6\VWHP ,GHQWLILFDWLRQ DQG &RQWURO ZLWK 1HXUDO 1HWZRUNV, PhD thesis,

Technical University of Denmark, Dept. of Automation, Denmark.Nørgaard, P. M. (n.d.a).11&75/ 7RRONLW, Technical University of Denmark: Dept. of Automa-

tion, http://www.iau.dtu.dk/research/control/nnctrl.html.Nørgaard, P. M. (n.d.b).116<6,' 7RROER[, Technical University of Denmark: Dept. of Automa-

tion, http://www.iau.dtu.dk/Projects/proj/nnhtml.html.Sørensen, O. (1994).1HXUDO 1HWZRUNV LQ &RQWURO $SSOLFDWLRQV, PhD thesis, Aalborg University,

Institute of Electronic Systems, Dept. of Control Engineering, Denmark. ISSN 0908-1208.

32

Jan Jantzen- Introduction To Perceptron Networks

Documents

control engineers

average squared

signal flow

learning rate

flow graph

activation

yl

training signal