Adaptive Learning Rate

Intelligent Control
Module I- Neural Networks
Lecture 7
Adaptive Learning Rate
Laxmidhar Behera
Department of Electrical Engineering
Indian Institute of Technology, Kanpur
Recurrent Networks – p.1/40
Subjects to be covered
Motivation for adaptive learning rate
Lyapunov Stability Theory
Training Algorithm based on Lyapunov Stability
Theory
Simulations and discussion
Conclusion
Recurrent Networks – p.2/40
Training of a Feed Forward Network
W
x1
W
y
x2
Figure 1: A feed-forward network
Here, W ∈ RM is the weight vector. The training data
consists of, say, N patterns, {xp , y p }, p = 1, 2, ..., N .
∂E
Weight update law: W (t+1) = W (t)−η
, η : learning rate
∂W
Recurrent Networks – p.3/40
Motivation for adaptive learning rate
70
Actual
Adaptive learning rate
60
50
x0=-6.7
f(x)
40
30
20
10
0
-10
-5
0
x
5
10
Figure 2: Convergence to global minimum
With adaptive learning rate, one can employ a higher
learning rate when the error is far from global minimum
and a smaller learning rate when it is near to it.
Recurrent Networks – p.4/40
Adaptive Learning Rate
The objective is to achieve global convergence for a
non-quadratic, non-convex nonlinear function without
increasing the computational complexity.
In GD, learning rate is fixed. If one can have a larger
learning rate for a point far away from global minimum
and a smaller learning rate for a point closer to global
minimum, then it would be possible to avoid local
minima and ensure global convergence. This
necessitates need of adaptive learning rate.
Recurrent Networks – p.5/40
Lyapunov Stability Theory
Used extensively in control system problems.
If we choose a Lyapunov function candidate V (x(t), t)
such that
V (x(t), t) is positive definite
V̇ (x(t), t) is negative definite
then the system is asymptotically stable.
Local Invariant Set Theorem (La Salle)
Consider an autonomous system of the form ẋ = f (x)
with f continuous, and let V (x) be a scalar function
with continuous partial derivatives. Assume that
* for some l > 0, the region Ωl defined by V (x) < l
is bounded.
Recurrent Networks – p.6/40
Lyapunov stability theory: contd...
* V̇ (x) < 0 for all x in Ωl .
Let R be the set of all points within Ωl where V̇ (x) = 0, and
M be the largest invariant set in R. Then, every solution
x(t) originating in Ωl tends to M as t → ∞.
Problem lies in choosing a proper Lyapunov function
candidate.
Recurrent Networks – p.7/40
Weight update law using Lyapunov based approach
The network output is given by
ŷ p = f (W , xp )
p = 1, 2, . . . N
(1)
The usual quadratic cost function is given as:
E=
N
X
1
2
(y p − ŷ p )2
(2)
p=1
Let’s choose a Lyapunov function candidate for the system
as below:
1 T
V = (ỹ ỹ)
(3)
2
where ỹ = [y 1 − ŷ 1 , ....., y p − ŷ p , ....., y N − ŷ N ]T .
Recurrent Networks – p.8/40
LF I Algorithm
The time derivative of the Lyapunov function V is given by
∂ ŷ
Ẇ = −ỹ T J Ẇ
(4)
V̇ = −ỹ
∂W
where
∂ ŷ
J ∈ RN ×M
J=
∂W
Theorem 1. If an arbitrary initial weight W (0) is updated by
W (t0 ) = W (0) +
where
Z
t0
Ẇ dt
(5)
0
k ỹ k2
T
Ẇ =
J
ỹ
(6)
T
2
k J ỹ k +
where is a small positive constant, then ỹ converges to zero
under the condition that Ẇ exists along the convergence
Recurrent Networks – p.9/40
trajectory.
Proof of LF - I Algorithm
Proof. Substitution of Eq. (6) into Eq. (4) yields
T
2
k
J
ỹ
k
≤0
V˙1 = − k ỹ k2
T
2
k J ỹ k +
(7)
where V˙1 < 0 for all ỹ 6= 0. If V˙1 is uniformly continuous and
bounded, then according to Barbalat’s lemma as t → ∞, V˙1 → 0
and ỹ → 0.
Recurrent Networks – p.10/40
LF - I Algorithm: contd...
The weight update law is a batch update law. The
instantaneous LF I learning algorithm can be derived as:
k ỹ k2
T
J
ỹ
Ẇ =
i
T
2
k Ji ỹ k
(8)
∂ ŷ p
∂W
∈ R(1×M ) . The
where ỹ = y − ŷ ∈ R and Ji =
difference equation representation of the weight update
equation is given by
p
p
Ŵ (t + 1) = Ŵ (t) + µẆ (t)
(9)
Here µ is a constant.
Recurrent Networks – p.11/40
Comparison with BP Algorithm
In gradient descent method we have,
∂E
= ηJi T ỹ
4W = −η
∂W
Ŵ (t + 1) = Ŵ (t) + ηJi T ỹ
(10)
The update equation for LF-I algorithm:
k ỹ k2 T
Ji ỹ
Ŵ (t + 1) = Ŵ (t) + µ
T
2
k Ji ỹ k
Comparing above two equations, we find that the fixed
learning rate η in BP algorithm is replaced by its adaptive
version ηa :
k ỹ k2 ηa = µ
(11)
T
2
k Ji ỹ k
Recurrent Networks – p.12/40
Adaptive Learning rate of LF-I
50
LF - I : XOR
Learning rate
40
30
20
10
0
0
100
200
300
No of iterations (4xno. of epochs)
400
Learning rate is not fixed unlike BP algorithm.
Learning rate goes to zero as error goes to zero.
Recurrent Networks – p.13/40
Convergence of LF-I
The theorem states that the global convergence of LF-I is
guaranteed provided Ẇ exists along the convergence
∂V1
trajectory. This, in turn, necessitates k ∂W
k=k J T ỹ k6= 0.
∂V1
k ∂W
k= 0, indicates a local minimum of the error function.
Thus, the theorem only says that the global minimum is
reached only when local minima are avoided during
training.
Since instantaneous update rule introduces noise, it may
be possible to reach global minimum in some cases,
however, the global convergence is not guaranteed.
Recurrent Networks – p.14/40
LF II Algorithm
We consider following Lyapunov function
1 T
V2 =
(ỹ ỹ + λẆT Ẇ)
2
λ T
= V1 + Ẇ Ẇ
2
(12)
where λ is a positive constant. The time derivative of
above equation is given by
T ∂y
˙
V2 = −ỹ
Ẇ + λẄT Ẇ
∂W
= −ỹT (J − D)Ẇ
(13)
where J =
∂y
∂W
: N × m is the Jacobian matrix, and
1
T
D = λ kỹk
2 ỹ Ẅ
∈ RN ×m
Recurrent Networks – p.15/40
LF II Algorithm: contd...
Theorem 2. If the update law for weight vector W follows a
dynamics given by following nonlinear differential equation
Ẇ = α(W)JT ỹ − α(W)Ẅ
(14)
kỹk2
kJT ỹk2 +
is a scalar function of weight vector W
where α(W) =
and is a small positive constant, then ỹ converges to zero under
the condition that (J − D)T ỹ is non-zero along the convergence
trajectory.
Recurrent Networks – p.16/40
Proof of LF II algorithm
Proof. Ẇ = α(W)JT ỹ − α(W)Ẅ may be rewritten as
kỹk2
(J − D)T ỹ
Ẇ = T 2
kJ ỹk + (15)
Substituting for Ẇ from above equation into
V̇2 = −ỹT (J − D)Ẇ, we get
2
k
ỹ
k
T
2
V˙2 = − T
k
(J
−
D)
ỹ
k
≤0
2
k J ỹ k +
(16)
Since (J − D)T ỹ is non-zero, V˙2 < 0 for all ỹ 6= 0 and V˙2 = 0
iff ỹ = 0. If V˙2 is uniformly continuous and bounded, then
according to Barbalat’s lemma as t → ∞, V˙2 → 0 and
Recurrent Networks – p.17/40
ỹ → 0.
Proof of LF II algorithm: contd...
The instantaneous weight update equation using LF II
algorithm can be finally expressed in difference equation
model as follows:
kỹk2
T
(J
−
D)
ỹ
W(t + 1) = W(t) + µ
p
T
2
kJp ỹk + kỹk2
T
J
= W(t) + µ
p ỹ
T
2
kJp ỹk + Ẅ(t)
−µ1
kJp T ỹk2 + (17)
where µ1 = µλ and the acceleration Ẅ(t) is computed as:
1
Ẅ(t) =
[W(t) − 2W(t − 1) + W(t − 2)]
2
(4t)
and 4t is taken to be one time unit for simulation.
Recurrent Networks – p.18/40
Comparison with BP Algorithm
Applying gradient-descent to V2 = V1 + λ2 ẆT Ẇ,
4W = −η
∂V2
∂W
T
T
∂V1
d λ T
= −η
−η
( Ẇ Ẇ)
∂W
dW 2
T
∂y
= η
ỹ − ηλẄ
∂W
T
Thus, the weight update equation for gradient descent
method may be written as
W(t + 1) = W(t) + η 0 Jp T ỹ −
µ0 Ẅ
| {z }
acceleration term
(18)
Recurrent Networks – p.19/40
Adaptive learning rate and adaptive acceleration
Comparing the two updates law, the adaptive learning rate
in this case is given by
2
kỹk
ηa0 = µ
kJp T ỹk2 + (19)
and the adaptive acceleration rate is given by
µ0a
λ
=
kJp T ỹk2 + (20)
Recurrent Networks – p.20/40
Convergence of LF II
The global minimum of V2 in is given by
ỹ = 0, Ẇ = 0
(ỹ ∈ Rn , W ∈ Rm )
Global minimum can be reached provided Ẇ does not
vanish along the convergence trajectory.
Analyzing local minima conditions:
Ẇ vanishes under following conditions
1. First condition: J = D
(J, D ∈ Rn×m )
In case of neural networks, it is very unlikely that each
element of J would be equal to that of D, thus this
possibility can easily be ruled out for a multi-layer
perceptron network.
Recurrent Networks – p.21/40
Convergence of LF II: contd...
2. Second Condition: Ẇ vanishes whenever
(J − D)T ỹ = 0
Assuming J 6= D, Rank ρ(J − D) = n ensures global
convergence.
3. Third Condition:
J T ỹ = DT ỹ = λẄ
Solutions of above equation represent local
minima
The solution to above equation exists for every
vector Ẅ ∈ Rm whenever rank ρ(J) = m
Recurrent Networks – p.22/40
Convergence of LF II: contd...
For NN, n ≤ m ⇒ ρ(J) ≤ n. Hence there are at least
m − n vectors Ẅ ∈ Rm for which solutions do not exist
and hence local minima do not occur.
Thus, increasing no. of hidden layers or hidden
neurons (i.e, increasing m), chances of encountering
local minima can be reduced.
Increasing the number of output neurons increases
both m and n as well as n/m.
Thus, for MIMO systems, there are more local minima
(for fixed number of weights) as compared to single
output systems.
Recurrent Networks – p.23/40
Avoiding local minima
local minimum
global minimum
V1
t−2
t−1
t t+1
4W(t − 1)
4W(t)
4W(t + 1)
CBAD
W
Recurrent Networks – p.24/40
Avoiding local minima: contd...
Rewrite the update law for LF-II as
∂V1
W(t + 1) = W(t) + 4W(t + 1) = W(t) − η
(t) − µ0 Ẅ(t)
∂W
0
Consider point B (at time t − 1):
The weight update for the interval (t − 1, t]
computed at this instant
4W(t) = 4W1 (t − 1) + 4W2 (t − 1).
∂V1
(t − 1) > 0
4W1 (t − 1) = −η ∂W
4W2 (t − 1) = −µẄ(t − 1) =
−µ(4W (t − 1) − 4W (t − 2)) > 0
It is to be noted that 4W (t − 1) < 4W (t − 2) as
the velocity is decreasing towards the point of
local minimum.
4W (t) > 0, hence speed increases.
Recurrent Networks – p.25/40
Avoiding local minima: contd...
Consider point ’A’ (at t):
Weight increments
∂V1
4W1 (t) = −η
(t) = 0
∂W
4W2 (t) = −µẄ(t) = −µ(4W (t) − 4W (t − 1)) > 0
4W (t) < 4W (t − 1) ⇒ 4W2 (t) > 0
4W(t + 1) = 4W1 (t) + 4W2 (t) > 0
This helps in avoiding local minimum
Recurrent Networks – p.26/40
Avoiding local minima: contd...
Consider point ’D’ (at instant t + 1):
Weight contributions
∂V1
4W1 (t + 1) = −η
(t + 1) < 0
∂W
4W2 (t + 1) = −µẄ(t + 1)
= −µ(4W (t + 1) − 4W (t)) > 0
contribution due to BP term becomes negative as the
∂V1
slope ∂W
> 0 on the right hand side of local minimum.
4W (t + 1) < 4W (t)
4W(t + 2) = 4W1 (t + 1) + 4W2 (t + 1) > 0 if
4W2 (t + 1) > 4W1 (t + 1)
Thus it is possible to avoid local minima by properly
Recurrent Networks – p.27/40
choosing µ.
Simulation results - LF-I vs LF-II: XOR
300
XOR
LF I (λ=0.0, µ=0.55)
LF II (λ=0.015, µ=0.65)
Training Epochs
250
200
150
100
50
0
10
20
Runs
30
40
50
Figure 3: performance comparison for XOR
Observation: LF II provides tangible improvement over LF I
both in terms of convergence time and training epochs.
Recurrent Networks – p.28/40
LF I vs LF II: 3-bit parity
3000
3-bit Parity
LF I (λ=0.0, µ=0.47)
LF II (λ=0.03, µ=0.47)
Training epochs
2500
2000
1500
1000
500
0
0
10
20
Runs
30
40
50
Figure 4: performance comparison for 3-bit parity
Observation: LF II performs better than LF I both in terms
of computation time and training epochs
Recurrent Networks – p.29/40
LF I vs LF II: 8-3 Encoder
Training epochs
150
8-3 Encoder
LF I (λ=0.0, µ=0.46)
LF II (λ=0.01, µ=0.465)
100
50
0
0
10
20
Runs
30
40
50
Figure 5: comparison for 8-3 encoder
Observation: LF II takes minimum epochs in most of the
runs
Recurrent Networks – p.30/40
LF I vs LF II: 2D Gabor function
0.5
2D Gabor Function
rms training error
0.4
LF I (µ=0.8, λ=0.0)
LF II (µ=0.8, λ=0.6)
0.3
0.2
0.1
0
0
10000
20000
Iterations (training data points)
30000
Figure 6: performance comparison for 2D Gabor
function
Observation: With increasing iterations, the performance of
LF II improves as compared to LF I
Recurrent Networks – p.31/40
Simulation Results - Comparison: contd...
XOR
Algorithm epochs time (sec)
BP
5620
0.0578
BP
3769
0.0354
EKF
3512
0.1662
LF-I
165
0.0062
LF-II
120
0.0038
parameters
η = 0.5
η = 0.95
λ = 0.9
µ = 0.55
µ = 0.65, λ = 0.01
Recurrent Networks – p.32/40
Comparison among BP, EKF and LF-II
Convergence time (seconds)
0.4
BP
EKF
LF - II
0.3
0.2
0.1
0
0
10
20
Run
30
40
50
Observation: LF takes almost same time for any arbitrary
initial condition.
Recurrent Networks – p.33/40
Comparison among BP, EKF and LF: contd...
3-bit Parity
Algorithm epochs time (sec)
BP
12032
0.483
BP
5941
0.2408
EKF
2186
0.4718
LF-I
1338
0.1176
LF-II
738
0.0676
parameters
η = 0.5
η = 0.95
λ = 0.9
µ = 0.47
µ = 0.47, λ = 0.03
Recurrent Networks – p.34/40
Comparison among BP, EKF and LF: contd...
8-3 Encoder
Algorithm epochs time (sec)
BP
326
0.044
BP
255
0.0568
LF-I
72
0.0582
LF-II
42
0.051
parameters
η = 0.7
η = 0.9
µ = 0.46
µ = 0.465, λ = 0.01
Recurrent Networks – p.35/40
Comparison among BP, EKF and LF: contd...
2D Gabor function
Algorithm No. of Centers rms error/run
BP
40
0.0847241
BP
80
0.0314169
LF-I
40
0.0192033
LF-II
40
0.0186757
parameters
η1,2 = 0.2
η1,2 = 0.2
µ = 0.8
µ = 0.8, λ = 0.3
Recurrent Networks – p.36/40
Discussion
Global convergence of Lyapunov based learning Algorithms
Consider the following Lyapunov function candidate:
1 ∂V1 2
V2 = µV1 + σk
k;
2 ∂W
1 T
where V1 = ỹ ỹ
2
(21)
The objective is to select an weight update law Ẇ such
∂V1
that the global minimum (V1 = 0 and ∂W
= 0), is reached.
The rate derivative of the Lyapunov function V˙2 is given as:
∂ 2 V1
∂V1
[µI + σ
]Ẇ
V̇2 =
T
∂W
∂W ∂W
(22)
Recurrent Networks – p.37/40
If the weight update law Ẇ is selected as
∂V1 T
(
∂ 2 V1
∂V1 2
−1 ∂W )
2
Ẇ = −[µI +σ
]
(ζk
+ηkV
k
) (23)
k
1
T
∂V1 2
∂W
k ∂W k
∂W ∂W
with ζ > 0 and η > 0, then
∂V1 2
V̇2 = −ζk
k − ηkV1 k2
∂W
which is negative definite with respect to V1 and
(24)
∂V1
∂W
. Thus,
V2 will finally converge to its equilibrium point given by V1 =
0 and
∂ T V1
∂W
= 0.
Recurrent Networks – p.38/40
But the implementation of the weight update algorithm
becomes very difficult due to the presence of a
∂ 2 V1
Hessian term ∂W ∂W T .
Thus, the above algorithm is of theoretical interest.
The above weight update algorithm is similar to BP
learning algorithm with a fixed learning rate.
Recurrent Networks – p.39/40
Conclusion
LF Algorithms perform better than both EKF and BP
algorithms in terms of speed and accuracy.
LF II avoids local minima to a greater extent as
compared to LF I.
It is seen that by choosing a proper network
architecture, it is possible to reach global minimum.
LF-I Algorithm has an interesting parallel with
conventional BP algorithm where the the fixed
learning rate of BP is replaced by an adaptive learning
rate.
Recurrent Networks – p.40/40

Download Report

Adaptive Learning Rate

Paperzz.com

Your Paperzz