Intelligent Control Module I- Neural Networks Lecture 7 Adaptive Learning Rate Laxmidhar Behera Department of Electrical Engineering Indian Institute of Technology, Kanpur Recurrent Networks – p.1/40
Intelligent Control
Module I- Neural NetworksLecture 7
Adaptive Learning Rate
Laxmidhar Behera
Department of Electrical EngineeringIndian Institute of Technology, Kanpur
Recurrent Networks – p.1/40
Subjects to be covered
Motivation for adaptive learning rate
Lyapunov Stability Theory
Training Algorithm based on Lyapunov StabilityTheory
Simulations and discussion
Conclusion
Recurrent Networks – p.2/40
Training of a Feed Forward Network
x2
1x
WW
y
Figure 1: A feed-forward network
Here, W ∈ RM is the weight vector. The training dataconsists of, say, N patterns, xp, yp, p = 1, 2, ..., N .
Weight update law: W (t+1) = W (t)−η∂E
∂W, η : learning rate
Recurrent Networks – p.3/40
Motivation for adaptive learning rate
-10 -5 0 5 10x
0
10
20
30
40
50
60
70
f(x)
ActualAdaptive learning rate
x0=-6.7
Figure 2: Convergence to global minimum
With adaptive learning rate, one can employ a higherlearning rate when the error is far from global minimumand a smaller learning rate when it is near to it.
Recurrent Networks – p.4/40
Adaptive Learning Rate
The objective is to achieve global convergence for anon-quadratic, non-convex nonlinear function withoutincreasing the computational complexity.
In GD, learning rate is fixed. If one can have a largerlearning rate for a point far away from global minimumand a smaller learning rate for a point closer to globalminimum, then it would be possible to avoid localminima and ensure global convergence. Thisnecessitates need of adaptive learning rate.
Recurrent Networks – p.5/40
Lyapunov Stability Theory
Used extensively in control system problems.
If we choose a Lyapunov function candidate V (x(t), t)such that
V (x(t), t) is positive definite
V (x(t), t) is negative definitethen the system is asymptotically stable.
Local Invariant Set Theorem (La Salle)Consider an autonomous system of the form x = f(x)with f continuous, and let V (x) be a scalar functionwith continuous partial derivatives. Assume that
* for some l > 0, the region Ωl defined by V (x) < lis bounded.
Recurrent Networks – p.6/40
Lyapunov stability theory: contd...
* V (x) < 0 for all x in Ωl.
Let R be the set of all points within Ωl where V (x) = 0, andM be the largest invariant set in R. Then, every solutionx(t) originating in Ωl tends to M as t → ∞.
Problem lies in choosing a proper Lyapunov functioncandidate.
Recurrent Networks – p.7/40
Weight update law using Lyapunov based approach
The network output is given by
yp = f(W , xp) p = 1, 2, . . . N (1)
The usual quadratic cost function is given as:
E =1
2
N∑
p=1
(yp − yp)2 (2)
Let’s choose a Lyapunov function candidate for the systemas below:
V =1
2(yT y) (3)
where y = [y1 − y1, ....., yp − yp, ....., yN − yN ]T .Recurrent Networks – p.8/40
LF I Algorithm
The time derivative of the Lyapunov function V is given by
V = −y∂y
∂WW = −yT JW (4)
where
J =∂y
∂WJ ∈ RN×M
Theorem 1. If an arbitrary initial weight W (0) is updated by
W (t′) = W (0) +
∫ t′
0
W dt (5)where
W =‖ y ‖2
‖ JT y ‖2 +εJT y (6)
where ε is a small positive constant, then y converges to zerounder the condition that W exists along the convergencetrajectory. Recurrent Networks – p.9/40
Proof of LF - I Algorithm
Proof. Substitution of Eq. (6) into Eq. (4) yields
V1 = − ‖ y ‖2 ‖ JT y ‖2
‖ JT y ‖2 +ε≤ 0 (7)
where V1 < 0 for all y 6= 0. If V1 is uniformly continuous andbounded, then according to Barbalat’s lemma as t → ∞, V1 → 0and y → 0.
Recurrent Networks – p.10/40
LF - I Algorithm: contd...
The weight update law is a batch update law. Theinstantaneous LF I learning algorithm can be derived as:
W =‖ y ‖2
‖ JiT y ‖2
JiT y (8)
where y = yp − yp ∈ R and Ji = ∂yp
∂W∈ R(1×M). The
difference equation representation of the weight updateequation is given by
W (t + 1) = W (t) + µW (t) (9)
Here µ is a constant.
Recurrent Networks – p.11/40
Comparison with BP Algorithm
In gradient descent method we have,
4W = −η∂E
∂W= ηJi
T y
W (t + 1) = W (t) + ηJiT y (10)
The update equation for LF-I algorithm:
W (t + 1) = W (t) +(µ
‖ y ‖2
‖ JiT y ‖2
)Ji
T y
Comparing above two equations, we find that the fixedlearning rate η in BP algorithm is replaced by its adaptiveversion ηa:
ηa =(µ
‖ y ‖2
‖ JiT y ‖2
)(11)
Recurrent Networks – p.12/40
Adaptive Learning rate of LF-I
0 100 200 300 400No of iterations (4xno. of epochs)
0
10
20
30
40
50
Lear
ning
rate
LF - I : XOR
Learning rate is not fixed unlike BP algorithm.
Learning rate goes to zero as error goes to zero.Recurrent Networks – p.13/40
Convergence of LF-I
The theorem states that the global convergence of LF-I isguaranteed provided W exists along the convergencetrajectory. This, in turn, necessitates ‖ ∂V1
∂W‖=‖ JT y ‖6= 0.
‖ ∂V1
∂W‖= 0, indicates a local minimum of the error function.
Thus, the theorem only says that the global minimum isreached only when local minima are avoided duringtraining.
Since instantaneous update rule introduces noise, it maybe possible to reach global minimum in some cases,however, the global convergence is not guaranteed.
Recurrent Networks – p.14/40
LF II Algorithm
We consider following Lyapunov function
V2 =1
2(yT y + λWTW)
= V1 +λ
2WTW (12)
where λ is a positive constant. The time derivative ofabove equation is given by
V2 = −yT ∂y
∂WW + λWTW
= −yT (J − D)W (13)
where J = ∂y
∂W: N × m is the Jacobian matrix, and
D = λ 1‖y‖2 yWT ∈ RN×m
Recurrent Networks – p.15/40
LF II Algorithm: contd...
Theorem 2. If the update law for weight vector W follows adynamics given by following nonlinear differential equation
W = α(W)JT y − α(W)W (14)
where α(W) = ‖y‖2
‖JT y‖2+εis a scalar function of weight vector W
and ε is a small positive constant, then y converges to zero underthe condition that (J − D)T y is non-zero along the convergencetrajectory.
Recurrent Networks – p.16/40
Proof of LF II algorithm
Proof. W = α(W)JT y − α(W)W may be rewritten as
W =‖y‖2
‖JT y‖2 + ε(J − D)T y (15)
Substituting for W from above equation intoV2 = −yT (J − D)W, we get
V2 = −‖ y ‖2
‖ JT y ‖2 +ε‖ (J − D)T y ‖2≤ 0 (16)
Since (J − D)T y is non-zero, V2 < 0 for all y 6= 0 and V2 = 0
iff y = 0. If V2 is uniformly continuous and bounded, thenaccording to Barbalat’s lemma as t → ∞, V2 → 0 andy → 0. Recurrent Networks – p.17/40
Proof of LF II algorithm: contd...
The instantaneous weight update equation using LF IIalgorithm can be finally expressed in difference equationmodel as follows:
W(t + 1) = W(t) +(µ
‖y‖2
‖JpT y‖2 + ε
)(Jp − D)T y
= W(t) + µ‖y‖2
‖JpT y‖2 + ε
JpT y
−µ1W(t)
‖JpT y‖2 + ε
(17)
where µ1 = µλ and the acceleration W(t) is computed as:
W(t) =1
(4t)2[W(t) − 2W(t − 1) + W(t − 2)]
and 4t is taken to be one time unit for simulation. Recurrent Networks – p.18/40
Comparison with BP Algorithm
Applying gradient-descent to V2 = V1 + λ2WTW,
4W = −η
(∂V2
∂W
)T
= −η
(∂V1
∂W
)T
− η
[d
dW(λ
2WTW)
]T
= η
(∂y
∂W
)T
y − ηλW
Thus, the weight update equation for gradient descentmethod may be written as
W(t + 1) = W(t) + η′JpT y − µ′W
︸︷︷︸
acceleration term
(18)Recurrent Networks – p.19/40
Adaptive learning rate and adaptive acceleration
Comparing the two updates law, the adaptive learning ratein this case is given by
η′a = µ
‖y‖2
‖JpT y‖2 + ε
(19)
and the adaptive acceleration rate is given by
µ′a =
λ
‖JpT y‖2 + ε
(20)
Recurrent Networks – p.20/40
Convergence of LF II
The global minimum of V2 in is given by
y = 0, W = 0 (y ∈ Rn, W ∈ Rm)
Global minimum can be reached provided W does notvanish along the convergence trajectory.
Analyzing local minima conditions:
W vanishes under following conditions
1. First condition: J = D (J, D ∈ Rn×m)In case of neural networks, it is very unlikely that eachelement of J would be equal to that of D, thus thispossibility can easily be ruled out for a multi-layerperceptron network.
Recurrent Networks – p.21/40
Convergence of LF II: contd...
2. Second Condition: W vanishes whenever
(J − D)T y = 0
Assuming J 6= D, Rank ρ(J − D) = n ensures globalconvergence.
3. Third Condition:
JT y = DT y = λW
Solutions of above equation represent localminimaThe solution to above equation exists for everyvector W ∈ Rm whenever rank ρ(J) = m
Recurrent Networks – p.22/40
Convergence of LF II: contd...
For NN, n ≤ m ⇒ ρ(J) ≤ n. Hence there are at leastm− n vectors W ∈ Rm for which solutions do not existand hence local minima do not occur.
Thus, increasing no. of hidden layers or hiddenneurons (i.e, increasing m), chances of encounteringlocal minima can be reduced.
Increasing the number of output neurons increasesboth m and n as well as n/m.Thus, for MIMO systems, there are more local minima(for fixed number of weights) as compared to singleoutput systems.
Recurrent Networks – p.23/40
Avoiding local minima
V1
W
local minimum
global minimum
ABC D
tt − 1
t − 2
t + 14W(t − 1)
4W(t)
4W(t + 1)
Recurrent Networks – p.24/40
Avoiding local minima: contd...
Rewrite the update law for LF-II as
W(t + 1) = W(t) + 4W(t + 1) = W(t) − η′ ∂V1
∂W(t) − µ′W(t)
Consider point B (at time t − 1):The weight update for the interval (t − 1, t]computed at this instant4W(t) = 4W1(t − 1) + 4W2(t − 1).4W1(t − 1) = −η ∂V1
∂W(t − 1) > 0
4W2(t − 1) = −µW(t − 1) =−µ(4W (t − 1) −4W (t − 2)) > 0
It is to be noted that 4W (t − 1) < 4W (t − 2) asthe velocity is decreasing towards the point oflocal minimum.4W (t) > 0, hence speed increases. Recurrent Networks – p.25/40
Avoiding local minima: contd...
Consider point ’A’ (at t):Weight increments
4W1(t) = −η∂V1
∂W(t) = 0
4W2(t) = −µW(t) = −µ(4W (t) −4W (t − 1)) > 0
4W (t) < 4W (t − 1) ⇒ 4W2(t) > 0
4W(t + 1) = 4W1(t) + 4W2(t) > 0
This helps in avoiding local minimum
Recurrent Networks – p.26/40
Avoiding local minima: contd...Consider point ’D’ (at instant t + 1):
Weight contributions
4W1(t + 1) = −η∂V1
∂W(t + 1) < 0
4W2(t + 1) = −µW(t + 1)
= −µ(4W (t + 1) −4W (t)) > 0
contribution due to BP term becomes negative as theslope ∂V1
∂W> 0 on the right hand side of local minimum.
4W (t + 1) < 4W (t)
4W(t + 2) = 4W1(t + 1) + 4W2(t + 1) > 0 if4W2(t + 1) > 4W1(t + 1)
Thus it is possible to avoid local minima by properlychoosing µ. Recurrent Networks – p.27/40
Simulation results - LF-I vs LF-II: XOR
0 10 20 30 40 50Runs
50
100
150
200
250
300
Trai
ning
Epo
chs
LF I (λ=0.0, µ=0.55)LF II (λ=0.015, µ=0.65)
XOR
Figure 3: performance comparison for XOR
Observation: LF II provides tangible improvement over LF Iboth in terms of convergence time and training epochs.
Recurrent Networks – p.28/40
LF I vs LF II: 3-bit parity
0 10 20 30 40 50Runs
0
500
1000
1500
2000
2500
3000
Trai
ning
epo
chs
LF I (λ=0.0, µ=0.47)LF II (λ=0.03, µ=0.47)
3-bit Parity
Figure 4: performance comparison for 3-bit parity
Observation: LF II performs better than LF I both in termsof computation time and training epochs Recurrent Networks – p.29/40
LF I vs LF II: 8-3 Encoder
0 10 20 30 40 50Runs
0
50
100
150
Trai
ning
epo
chs
LF I (λ=0.0, µ=0.46)LF II (λ=0.01, µ=0.465)
8-3 Encoder
Figure 5: comparison for 8-3 encoder
Observation: LF II takes minimum epochs in most of theruns Recurrent Networks – p.30/40
LF I vs LF II: 2D Gabor function
0 10000 20000 30000Iterations (training data points)
0
0.1
0.2
0.3
0.4
0.5
rms t
rain
ing
erro
r
LF I (µ=0.8, λ=0.0)LF II (µ=0.8, λ=0.6)
2D Gabor Function
Figure 6: performance comparison for 2D Gabor
functionObservation: With increasing iterations, the performance ofLF II improves as compared to LF I
Recurrent Networks – p.31/40
Simulation Results - Comparison: contd...
XOR
Algorithm epochs time (sec) parametersBP 5620 0.0578 η = 0.5
BP 3769 0.0354 η = 0.95
EKF 3512 0.1662 λ = 0.9
LF-I 165 0.0062 µ = 0.55
LF-II 120 0.0038 µ = 0.65, λ = 0.01
Recurrent Networks – p.32/40
Comparison among BP, EKF and LF-II
0 10 20 30 40 50Run
0
0.1
0.2
0.3
0.4
Conv
erge
nce
time
(sec
onds
)
BPEKFLF - II
Observation: LF takes almost same time for any arbitrary
initial condition.
Recurrent Networks – p.33/40
Comparison among BP, EKF and LF: contd...
3-bit Parity
Algorithm epochs time (sec) parametersBP 12032 0.483 η = 0.5
BP 5941 0.2408 η = 0.95
EKF 2186 0.4718 λ = 0.9
LF-I 1338 0.1176 µ = 0.47
LF-II 738 0.0676 µ = 0.47, λ = 0.03
Recurrent Networks – p.34/40
Comparison among BP, EKF and LF: contd...
8-3 Encoder
Algorithm epochs time (sec) parametersBP 326 0.044 η = 0.7
BP 255 0.0568 η = 0.9
LF-I 72 0.0582 µ = 0.46
LF-II 42 0.051 µ = 0.465, λ = 0.01
Recurrent Networks – p.35/40
Comparison among BP, EKF and LF: contd...
2D Gabor function
Algorithm No. of Centers rms error/run parametersBP 40 0.0847241 η1,2 = 0.2
BP 80 0.0314169 η1,2 = 0.2
LF-I 40 0.0192033 µ = 0.8
LF-II 40 0.0186757 µ = 0.8, λ = 0.3
Recurrent Networks – p.36/40
Discussion
Global convergence of Lyapunov based learning Algorithms
Consider the following Lyapunov function candidate:
V2 = µV1 +1
2σ‖
∂V1
∂W‖2; where V1 =
1
2yT y (21)
The objective is to select an weight update law W suchthat the global minimum (V1 = 0 and ∂V1
∂W= 0), is reached.
The rate derivative of the Lyapunov function V2 is given as:
V2 =∂V1
∂W[µI + σ
∂2V1
∂W ∂W T]W (22)
Recurrent Networks – p.37/40
If the weight update law W is selected as
W = −[µI+σ∂2V1
∂W ∂W T]−1 ( ∂V1
∂W)T
‖ ∂V1
∂W‖2
(ζ‖∂V1
∂W‖2+η‖V1‖
2) (23)
with ζ > 0 and η > 0, then
V2 = −ζ‖∂V1
∂W‖2 − η‖V1‖
2 (24)
which is negative definite with respect to V1 and ∂V1
∂W. Thus,
V2 will finally converge to its equilibrium point given by V1 =
0 and ∂T V1
∂W= 0.
Recurrent Networks – p.38/40
But the implementation of the weight update algorithmbecomes very difficult due to the presence of aHessian term ∂2V1
∂W ∂WT .
Thus, the above algorithm is of theoretical interest.
The above weight update algorithm is similar to BPlearning algorithm with a fixed learning rate.
Recurrent Networks – p.39/40
Conclusion
LF Algorithms perform better than both EKF and BPalgorithms in terms of speed and accuracy.
LF II avoids local minima to a greater extent ascompared to LF I.
It is seen that by choosing a proper networkarchitecture, it is possible to reach global minimum.
LF-I Algorithm has an interesting parallel withconventional BP algorithm where the the fixedlearning rate of BP is replaced by an adaptive learningrate.
Recurrent Networks – p.40/40