1 Class Exercise The back-propagation training algorithm.

1

Class ExerciseClass Exercise

The back-propagation training algorithmThe back-propagation training algorithm

2

Step 1Step 1: Initialisation: InitialisationSet all the weights and threshold levels of the Set all the weights and threshold levels of the network to random numbers uniformly network to random numbers uniformly distributed inside a small range:distributed inside a small range:

where where FFii is the total number of inputs of neuron is the total number of inputs of neuron ii

in the network. The weight initialisation is done in the network. The weight initialisation is done on a neuron-by-neuron basis.on a neuron-by-neuron basis.

The back-propagation training algorithmThe back-propagation training algorithm

ii FF

4.2 ,

4.2

3

Step 2Step 2: Activation: ActivationActivate the back-propagation neural network by Activate the back-propagation neural network by applying inputs applying inputs xx11((pp), ), xx22((pp),…, ),…, xxnn((pp) and desired ) and desired

outputs outputs yydd,1,1((pp), ), yydd,2,2((pp),…, ),…, yydd,,nn((pp).).

((aa) Calculate the actual outputs of the neurons in ) Calculate the actual outputs of the neurons in the hidden layer:the hidden layer:

where where nn is the number of inputs of neuron is the number of inputs of neuron jj in the in the hidden layer, and hidden layer, and sigmoidsigmoid is the is the sigmoidsigmoid activation activation function.function.

j

n

iijij pwpxsigmoidpy

1

)()()(

4

((bb) Calculate the actual outputs of the neurons in ) Calculate the actual outputs of the neurons in the output layer:the output layer:

where where mm is the number of inputs of neuron is the number of inputs of neuron kk in the in the output layer.output layer.

k

m

jjkjkk pwpxsigmoidpy

1

)()()(

Step 2Step 2: Activation (continued): Activation (continued)

5

Step 3Step 3: Weight training: Weight trainingUpdate the weights in the back-propagation network Update the weights in the back-propagation network propagating backward the errors associated with propagating backward the errors associated with output neurons.output neurons.((aa) Calculate the error gradient for the neurons in the ) Calculate the error gradient for the neurons in the output layer:output layer:

wherewhere

Calculate the weight corrections:Calculate the weight corrections:

Update the weights at the output neurons:Update the weights at the output neurons:

)()(1)()( pepypyp kkkk

)()()( , pypype kkdk

)()()( ppypw kjjk

)()()1( pwpwpw jkjkjk

6

((bb) Calculate the error gradient for the neurons in ) Calculate the error gradient for the neurons in the hidden layer:the hidden layer:

Calculate the weight corrections:Calculate the weight corrections:

Update the weights at the hidden neurons:Update the weights at the hidden neurons:

)()()(1)()(1

][ p wppypyp jk

l

kkjjj

)()()( ppxpw jiij

)()()1( pwpwpw ijijij

Step 3Step 3: Weight training (continued): Weight training (continued)

7

Step 4Step 4: Iteration: IterationIncrease iteration Increase iteration pp by one, go back to by one, go back to Step 2Step 2 and and repeat the process until the selected error criterion repeat the process until the selected error criterion is satisfied.is satisfied.

8

• As an example, we may consider the three-layer As an example, we may consider the three-layer back-propagation network. back-propagation network.

• Suppose that the network is required to perform Suppose that the network is required to perform logical operation logical operation Exclusive-ORExclusive-OR. .

• Recall that a single-layer perceptron could not do Recall that a single-layer perceptron could not do this operation. Now we will apply the three-layer this operation. Now we will apply the three-layer net.net.

9

Three-layer network for solving the Three-layer network for solving the Exclusive-OR operationExclusive-OR operation

y55

x1 31

x2

Inputlayer

Outputlayer

Hidden layer

42

3

w13

w24

w23

w24

w35

w45

4

5

1

1

1

10

The effect of the threshold applied to a neuron in the The effect of the threshold applied to a neuron in the hidden or output layer is represented by its weight, hidden or output layer is represented by its weight, , , connected to a fixed input equal to connected to a fixed input equal to 1.1.

The initial weights and threshold levels are set The initial weights and threshold levels are set randomly as follows:randomly as follows:ww1313 = 0.5, = 0.5, ww1414 = 0.9, = 0.9, ww2323 = 0.4, = 0.4, ww2424 = 1.0, = 1.0, ww3535 = = 1.2, 1.2,

ww4545 = 1.1, = 1.1, 33 = 0.8, = 0.8, 44 = = 0.1 and 0.1 and 55 = 0.3. = 0.3.

11

We consider a training set where inputs We consider a training set where inputs xx11 and and xx22 are are

equal to 1 and desired output equal to 1 and desired output yydd,5,5 is 0. The actual is 0. The actual

outputs of neurons 3 and 4 in the hidden layer are outputs of neurons 3 and 4 in the hidden layer are calculated ascalculated as

5250.01 /1)( )8.014.015.01(32321313 ewxwx sigmoidy

8808.01 /1)( )1.010.119.01(42421414 ewxwx sigmoidy

Now the actual output of neuron 5 in the output layer Now the actual output of neuron 5 in the output layer is determined as:is determined as:

Thus, the following error is obtained:Thus, the following error is obtained:

5097.01 /1)( )3.011.18808.02.15250.0(54543535 ewywy sigmoidy

5097.05097.0055, yye d

12

The next step is weight training. To update the The next step is weight training. To update the weights and threshold levels in our network, we weights and threshold levels in our network, we propagate the error, propagate the error, ee, from the output layer , from the output layer backward to the input layer.backward to the input layer.

First, we calculate the error gradient for neuron 5 in First, we calculate the error gradient for neuron 5 in the output layer:the output layer:

1274.05097).0( 0.5097)(1 0.5097)1( 555 e y y

Then we determine the weight corrections assuming Then we determine the weight corrections assuming that the learning rate parameter, that the learning rate parameter, , is equal to 0.1:, is equal to 0.1:

0112.0)1274.0(8808.01.05445 yw0067.0)1274.0(5250.01.05335 yw

0127.0)1274.0()1(1.0)1( 55

13

Next we calculate the error gradients for neurons 3 Next we calculate the error gradients for neurons 3 and 4 in the hidden layer:and 4 in the hidden layer:

We then determine the weight corrections:We then determine the weight corrections:

0381.0)2.1 (0.1274) (0.5250)(1 0.5250)1( 355333 wyy

0.0147.11 4) 0.127 ( 0.8808)(10.8808)1( 455444 wyy

0038.00381.011.03113 xw0038.00381.011.03223 xw

0038.00381.0)1(1.0)1( 33 0015.0)0147.0(11.04114 xw0015.0)0147.0(11.04224 xw

0015.0)0147.0()1(1.0)1( 44

14

At last, we update all weights and threshold:At last, we update all weights and threshold:

5038.00038.05.0131313 www

8985.00015.09.0141414 www

4038.00038.04.0232323 www

9985.00015.00.1242424 www

2067.10067.02.1353535 www

0888.10112.01.1454545 www

7962.00038.08.0333

0985.00015.01.0444

3127.00127.03.0555

The training process is repeated until the sum of The training process is repeated until the sum of squared errors is less than 0.001. squared errors is less than 0.001.

15

Learning curve for operation Learning curve for operation Exclusive-ORExclusive-OR

0 50 100 150 200

101

Epoch

Su

m-S

qu

are

d E

rro

r

Sum-Squared Network Error for 224 Epochs

100

10-1

10-2

10-3

10-4

16

Final results of three-layer network learningFinal results of three-layer network learning

Inputs

x1 x2

1010

1100

011

Desiredoutput

yd

0

0.0155

Actualoutput

y5Y

Error

e

Sum ofsquarederrors

e 0.9849 0.9849 0.0175

0.0155 0.0151 0.0151 0.0175

0.0010

17

Network represented by McCulloch-Pitts model Network represented by McCulloch-Pitts model for solving the for solving the Exclusive-ORExclusive-OR operation operation

y55

x1 31

x2 42

+1.0

1

1

1+1.0

+1.0

+1.0

+1.5

+1.0

+1.0

+0.5

+0.5

18

((aa) Decision boundary constructed by hidden neuron 3;) Decision boundary constructed by hidden neuron 3;((bb) Decision boundary constructed by hidden neuron 4; ) Decision boundary constructed by hidden neuron 4; ((cc) Decision boundaries constructed by the complete) Decision boundaries constructed by the complete three-layer networkthree-layer network

x1

x2

1

(a)

1

x2

1

1

(b)

00

x1 + x2 – 1.5 = 0 x1 + x2 – 0.5 = 0

x1 x1

x2

1

1

(c)

0

Decision boundariesDecision boundaries

19

VARIATIONS in multilayer neural VARIATIONS in multilayer neural networksnetworks

Modifications can be made to improve Modifications can be made to improve performance of BPperformance of BP

May involve:May involve: Changes in weight update procedureChanges in weight update procedure Alternative to sigmoid activation functionAlternative to sigmoid activation function Improve the computational powerImprove the computational power

20

Alternative Weight Update ProceduresAlternative Weight Update Procedures

MomentumMomentum– BP with momentum, the weight change using BP with momentum, the weight change using

combination of current and previous gradientcombination of current and previous gradient

– Advantage when some training data are very Advantage when some training data are very different/unusual from the majority of the data (or different/unusual from the majority of the data (or incorrect)incorrect)

» Use a small learning rate to avoid a major disruption of Use a small learning rate to avoid a major disruption of learning learning

» Or maintain training paceOr maintain training pace

21


MomentumMomentum– Faster convergence with momentumFaster convergence with momentum– To use momentum, weights from 1 or more To use momentum, weights from 1 or more

previous training need to be savedprevious training need to be saved

22


BP with momentumBP with momentum

wjk(t+1) = wjk(t) + kzj + [wjk(t) – wjk(t - 1)]

wjk(t+1) = kzj + wjk(t)

and

vij(t + 1) = vij(t) + jxi + [vij(t) – vij(t - 1)]

vij(t + 1) = jxi + vij(t)

23


Advantage of adding MomentumAdvantage of adding Momentum– Allows the net to make reasonably large weight Allows the net to make reasonably large weight

adjustments as long as the corrections are in the adjustments as long as the corrections are in the same general direction for several patterns while same general direction for several patterns while using small learning rate to prevent a large using small learning rate to prevent a large response to the error from any one training response to the error from any one training patternspatterns

– Reduce local minimumReduce local minimum– In momentum, the net is proceeding not in the In momentum, the net is proceeding not in the

direction of the gradient, but the direction of a direction of the gradient, but the direction of a combination of the current gradient and the combination of the current gradient and the previous direction of weight correctionprevious direction of weight correction

24


Example of adding MomentumExample of adding Momentum– Example 6.1-6.3, using the same initial weights Example 6.1-6.3, using the same initial weights

and architecture, adding momentum 0.9 with a and architecture, adding momentum 0.9 with a learning rate as before (0.2) reduces training learning rate as before (0.2) reduces training requirements from requirements from 387 epochs to 38 epochs387 epochs to 38 epochs..

25

Adaptive Learning RatesAdaptive Learning Rates

Changing learning rate during trainingChanging learning rate during training Delta-Bar-DeltaDelta-Bar-Delta

– Allow each weight to have its own learning rateAllow each weight to have its own learning rate– Let the learning rates vary with time as training Let the learning rates vary with time as training

progressesprogresses

26


To accelerate the convergence and yet avoid the To accelerate the convergence and yet avoid the

danger of instability, we can apply two heuristics:danger of instability, we can apply two heuristics:

Heuristic 1Heuristic 1If the change of the sum of squared errors has the same If the change of the sum of squared errors has the same algebraic sign (direction) for several consequent algebraic sign (direction) for several consequent epochs, then the learning rate parameter, epochs, then the learning rate parameter, , should be , should be increased.increased.

Heuristic 2Heuristic 2If the algebraic sign of the change of the sum of If the algebraic sign of the change of the sum of squared errors alternates for several consequent squared errors alternates for several consequent epochs, then the learning rate parameter, epochs, then the learning rate parameter, , should be , should be decreased.decreased.

27

Adapting the learning rate requires some changes Adapting the learning rate requires some changes in the back-propagation algorithm. in the back-propagation algorithm.

If the sum of squared errors at the current epoch If the sum of squared errors at the current epoch exceeds the previous value by more than a exceeds the previous value by more than a predefined ratio (typically 1.04), the learning rate predefined ratio (typically 1.04), the learning rate parameter is decreased (typically by multiplying parameter is decreased (typically by multiplying by 0.7) and new weights and thresholds are by 0.7) and new weights and thresholds are calculated. calculated.

If the error is less than the previous one, the If the error is less than the previous one, the learning rate is increased (typically by multiplying learning rate is increased (typically by multiplying by 1.05).by 1.05).

28


Formula Delta-bar delta ruleFormula Delta-bar delta rule– Weight update ruleWeight update rule– Learning rate update ruleLearning rate update rule

wwijij((tt) : arbitrary weight at time ) : arbitrary weight at time tt

ijij((tt) : learning rate for that weight at time ) : learning rate for that weight at time tt

E : squared error for the pattern presented at E : squared error for the pattern presented at time time tt

29



wwjkjk((t + 1t + 1) = ) = wwjkjk((tt) - ) - jkjk((t + 1t + 1))

= wwjkjk((tt) - ) - jkjk((t + 1t + 1))kkzzjj

Ewwjkjk

Where standard weight changesPp 308

Ewwjkjk

jk = = -kkzzjj

30

Adaptive Learning RatesAdaptive Learning RatesPerformancePerformance

METHODMETHOD SIMULATIONSSIMULATIONS SUCCESSESSUCCESSES MEAN MEAN EPOCHEPOCH

BPBP 2525 2424 16,859.816,859.8BP+momentumBP+momentum 2525 2525 2,056.3 2,056.3 Delta-bar-deltaDelta-bar-delta 2525 2222 447.3447.3

31



wwjkjk((t + 1t + 1) = ) = wwjkjk((tt) - ) - jkjk((t + 1t + 1))

= wwjkjk((tt) - ) - jkjk((t + 1t + 1))kkzzjj

Ewwjkjk

Where standard weight changesPp 308

Ewwjkjk

jk = = -kkzzjj

32

Learning with adaptive learning rateLearning with adaptive learning rate

0 10 20 30 40 50 60 70 80 90 100Epoch

Training for 103 Epochs

0 20 40 60 80 100 1200

0.2

0.4

0.6

0.8

1

Epoch

Le

arn

ing

Ra

te

10-4

10-2

100

102S

um

-Sq

ua

red

Err

or

10-3

101

10-1

33

Learning with momentum and adaptive learning rateLearning with momentum and adaptive learning rate

0 10 20 30 40 50 60 70 80Epoch


0 10 20 30 40 50 60 70 80 900

0.5

1

2.5

Epoch

Lea

rnin

g R

ate

10-4

10-2

100

102S

um

-Sq

ua

red

Err

or

10-3

101

10-1

1.5

2

34

Accelerated learning in multilayer Accelerated learning in multilayer neural networksneural networks

A multilayer network learns much faster when the A multilayer network learns much faster when the sigmoidal activation function is represented by a sigmoidal activation function is represented by a hyperbolic tangenthyperbolic tangent::

where where aa and and bb are constants. are constants.

Suitable values for Suitable values for aa and and bb are: are: aa = 1.716 and = 1.716 and bb = 0.667 = 0.667

ae

aY

bXhtan

1

2

35

We also can accelerate training by including a We also can accelerate training by including a momentum termmomentum term in the delta rule: in the delta rule:

where where is a positive number (0 is a positive number (0 1) called the 1) called the momentum constantmomentum constant. Typically, the momentum . Typically, the momentum constant is set to 0.95.constant is set to 0.95.

This equation is called the This equation is called the generalised delta rulegeneralised delta rule..

)()()1()( ppypwpw kjjkjk

36

Learning with momentum for operation Learning with momentum for operation Exclusive-ORExclusive-OR

0 20 40 60 80 100 12010-4

10-2

100

102

Epoch

Su

m-S

qu

are

d E

rro

r


0 100 140-1

-0.5

0

0.5

1

1.5

Epoch

Lea

rnin

g R

ate

10-3

101

10-1

20 40 60 80 120

37


To accelerate the convergence and yet avoid the To accelerate the convergence and yet avoid the

danger of instability, we can apply two heuristics:danger of instability, we can apply two heuristics:

Heuristic 1Heuristic 1If the change of the sum of squared errors has the same If the change of the sum of squared errors has the same algebraic sign for several consequent epochs, then the algebraic sign for several consequent epochs, then the learning rate parameter, learning rate parameter, , should be increased., should be increased.

Heuristic 2Heuristic 2If the algebraic sign of the change of the sum of If the algebraic sign of the change of the sum of squared errors alternates for several consequent squared errors alternates for several consequent epochs, then the learning rate parameter, epochs, then the learning rate parameter, , should be , should be decreased.decreased.

38

Adapting the learning rate requires some changes Adapting the learning rate requires some changes in the back-propagation algorithm. in the back-propagation algorithm.

If the sum of squared errors at the current epoch If the sum of squared errors at the current epoch exceeds the previous value by more than a exceeds the previous value by more than a predefined ratio (typically 1.04), the learning rate predefined ratio (typically 1.04), the learning rate parameter is decreased (typically by multiplying parameter is decreased (typically by multiplying by 0.7) and new weights and thresholds are by 0.7) and new weights and thresholds are calculated. calculated.

If the error is less than the previous one, the If the error is less than the previous one, the learning rate is increased (typically by multiplying learning rate is increased (typically by multiplying by 1.05).by 1.05).

39


0 10 20 30 40 50 60 70 80 90 100Epoch


0 20 40 60 80 100 1200

0.2

0.4

0.6

0.8

1

Epoch

Le

arn

ing

Ra

te

10-4

10-2

100

102S

um

-Sq

ua

red

Err

or

10-3

101

10-1

40

Learning with momentum and adaptive learning rateLearning with momentum and adaptive learning rate

0 10 20 30 40 50 60 70 80Epoch


0 10 20 30 40 50 60 70 80 900

0.5

1

2.5

Epoch

Lea

rnin

g R

ate

10-4

10-2

100

102S

um

-Sq

ua

red

Err

or

10-3

101

10-1

1.5

2

1 Class Exercise The back-propagation training algorithm.

Documents