On the Optimal Learning Rate of a Neuron

On the Optimal Learning Rate of a Neuron
Bo Peng∗
April 11, 2017
* This is NOT a research paper, but a draft on what I found in a toy model. The author welcome all critiques and discussions, and you can reach me at "bo at withablink dot com".
We found in some cases, the optimal (initial) learning rate (to minimize the expected loss after a
1-step update) of a neuron has a closed-form formula, is independent of the regression target, and only
depends on the distribution of the input data.
Namely, when we are using the L2-norm loss function, and let the batch size → ∞, then the optimal
initial learning rate of a neuron with n inputs uniformed distributed in [0, d] is:
24
,
(9n + 7) · dn+2
1
=
.
2 · dn
ηbest−f or−weights =
(1)
ηbest−f or−bias
(2)
These results might be too nice to be true. So let me know if you find any mistakes. Thanks.
1
A neuron with 1 input, and without bias
Let us begin by considering the simplest case: a neuron with 1 input, without bias, and without
activation. We will also be using a number of other simplifications.
(Remark: Actually in this section I am computing the inverse of the Hessian, as in Newton’s method.
However, when the Hessian is not a diagonal matrix (which is often the case for neurons with n > 1
inputs), then the result is different from Newton’s method.)
1
1.1
Problem definition
We will be doing regression. Let our single-sample loss function be the L2-norm (we will later discuss
other loss functions):
L(x) = (w · x − f (x))2
(3)
where w is the neuron’s weight for the input x, and f (x) is the (unknown) regression target function.
Assume the input x follows the uniform distribution in [c, d] (we will later discuss other distributions),
then the expected loss is:
Z
L = E(L(x)) =
d
Z
d
L(x) dx =
c
(w · x − f (x))2 dx.
(4)
c
In a gradient descent training process, let G be the gradient, η be the learning rate, and after 1
training step we have a new weight:
w̃ = w − η · G
(5)
and we hope to find the optimal η = ηbest to minimise the new expected loss L̃:
Z
ηbest = arg min L̃ = arg min
η
η
d
((w − η · G) · x − f (x))2 dx.
(6)
c
which means ηbest is to satisfy:
∂ L̃
∂η
!
= 0.
(7)
η=ηbest
Note here we are only considering the case of minimising the expected loss after 1 training step,
which is different and simpler than minimising the expected loss after n training steps. We will later
discuss the second case.
1.2
The continuous limit
For a batch size of N , the weight w updates according to the SGD rule:
1 X ∂L w̃ = w − η ·
N
∂w x=xi
i
2
(8)
And in the continuous limit (that is, as the batch size N goes to infinity), the weight updates
according to:
d
Z
w̃ = w − η ·
c
∂L
dx
∂w
(9)
d
Z
=w−η·
2x · (wx − f (x)) dx
(10)
2x · M (x) dx
(11)
c
d
Z
=w−η·
c
here we define M (x) = wx − f (x) to simplify some later computations.
Hence the single-sample loss of the new model will be:
L̃(x) = (w̃x − f (x))2
2
Z d
2x · M (x) dx x − f (x)
=
w−η·
(12)
(13)
c
d
Z
= −η ·
2
2x · M (x) dx x + (wx − f (x))
(14)
2
2x · M (x) dx x + M (x)
(15)
c
=
d
Z
−η ·
c
= (−η · P (x) + M (x))2
here we define P (x) = x
Rd
c
(16)
2x · M (x) dx to simplify some later computations.
Recall the optimal learning rate ηbest is to satisfy:
∂ L̃
∂η
!
=0
(17)
η=ηbest
where the expected loss of the new model is L̃ = E(L̃(x)) =
Rd
c
L̃(x) dx.
Perhaps surprisingly, ηbest has a simple closed-form solution, and is independent of f (x) and w:
ηbest =
1
2
Rd
c
x2 dx
=
3
2(d3 − c3 )
(18)
Let us see how this is computed. Note that:
∂ L̃
=
∂η
d
Z
c
∂ L̃
dx
∂η
Z
d
(19)
P (x) · (η · P (x) − M (x)) dx
=2
c
3
(20)
Put in η = ηbest =
2
R d1
2
c x dx
and we need to prove:
Z
d
P (x) ·
c
Note P (x) = x
Rd
!
P (x)
− M (x) dx = 0
Rd
2 c x2 dx
(21)
2x · M (x) dx, so we need to prove:
c
Z
d
Z
d
x · M (x) dx ·
x
c
x
c
Rd
!
x
·
M
(x)
dx
c
− M (x) dx = 0
Rd
2 dx
x
c
(22)
for any M (x).
And this can be proved by computing its derivative with regard to x and c and d (a long yet
straightforward computation), which all turns out to be 0. So it must be a constant. Then we can let
M (x) = 1 to see it’s actually 0.
2
A neuron with n > 1 input, and without bias
Here the computations are quite long. If c = −d, that is, if all input data {xi } ∈ ~x follows the uniform
distribution in [−d, d], then the result is also independent of the regression target f (~x) and the weights
{wi }. The optimal ηbest will be:
ηbest =
3
2n+1 dn+2
(23)
which means if d = 0.5 then we have a nice result: our ηbest will even be independent of n. In that case,
the optimal learning rate ηbest exactly equals 6.
A learning rate of 6 seems large, however note we usually are using an input data with distribution
further away from 0 than [−0.5, 0.5], such as [0, 1], and then ηbest will be much smaller.
In fact, for other distributions of {xi }, the optimal learning rate ηbest will depend on f (~x) and {wi }.
For example, if n = 2, and if all input data {xi } ∈ ~x follows the uniform distribution in [0, d], and if we
Rd
Rd
pick a simple f (~x) such as f (~x) = x1 , and plot ηbest and the expected loss L = 0 · · · 0 L(~x) d~x with
regard to {wi }, the result will be:
4
The right graph is L, the expected loss. Its shape is typical in the sense that we have directions that
are easier to converge (thanks to a larger gradient) and directions that are harder to converge (due to
a smaller gradient).
The left graph is ηbest , and the optimal learning rate can be very large when {wi } is going to the
minimum of L in a "difficult direction" (note we are effectively using a batch size of ∞, so here the
gradient will be small and stable, which justifies the very large optimal learning rate), hence it is indeed
easy for SGD to get stuck near a minimum, because we won’t increase η. This suggests we can try
increasing η when we get stuck (and increase the batch size to stabilise the gradient).
Interestingly, as {wi } gets very close to the minimum, ηbest is more likely to be smaller (in the sense
there’ll be a smaller percentage of directions which requires a high η), which may be why we often find
SGD eventually gives a good answer. All these observations are typical for other more complicated f .
Finally, as we can see in the ηbest graph, when our weights are away from the minimum of L, ηbest is
surprisingly stable. For example, in the above case, and in the limit as wi → ±∞, we have ηbest →
which is independent of f (~x). This means
75
86d4
75
,
86d4
can be a good initial learning rate in the n = 2 case.
And there is a closed-form formula. The ηbest can be computed by requiring the following directlycomputable integral to be 0 and solve for η:
Z
d
Z
···
0
d
(4x1 + 3 · (x2 + · · · + xn )) −6x1 + η · dn+2 (4x1 + 3 · (x2 + · · · + xn )) dx1 · · · dxn
(24)
0
arriving at this result requires some long computations so we omit the computation steps here. The
computation itself is standard.
5
And the ηbest for a bias-free neuron with n inputs uniformed distributed in [0, d] is:
ηbest =
(27n2
2
54n + 42
≈
n+2
+ 27n + 10) · d
n · dn+2
(25)
As a sanity check, if d = 1 and n = 10 then it’s around 0.2, and if n = 20 then it’s around 0.1.
3
A neuron with n > 1 input, and with bias [TO-WRITE]
When the neuron has bias, It is found that we also have constant ηbest as wi → ±∞ or b → ±∞. Hence
for a giving network structure, a fixed initial learning rate works well for all datasets (as long as they
are carefully normalised to follow the same distribution).
However, the ηbest is a bit different for the weights and the bias:
1. n = 1: ηbest for weights (as wi → ±∞) is
3
,
2d3
2. n = 2: ηbest for weights (as wi → ±∞) is
24
,
25d4
and ηbest for bias (as b → ±∞) is
1
.
2d2
3. n = 3: ηbest for weights (as wi → ±∞) is
12
,
17d5
and ηbest for bias (as b → ±∞) is
1
.
2d3
and ηbest for bias (as b → ±∞) is
1
2d .
4. · · ·
The ηbest for bias (as b → ±∞) is
1
2·dn .
And the ηbest for weights (as wi → ±∞) is
24
.
(9n+7)·dn+2
[TO-WRITE]
4
TO-WRITE
To summarize:
1. For a giving network structure, a fixed initial learning rate can work well for all datasets (as long
as they are carefully normalised to follow the same distribution).
2. One of the reasons why batch normalisation is effective is because it normalises the distribution
of neuron inputs, such that one single learning rate works for all neuron.
3. It may be helpful to use different learning rate for different neuron (and different learning rate
for different weights and different biases). And it seems there exists a closed-form expression.
6
4. It may be even better to fix the learning rate scheme, and apply some more careful normalisation
to input data and intermedia outputs (as we have seen for the [−0.5, 0.5] case where the optimal
learning rate is the constant 6).
To-write:
1. Find the expression in the stochastic limit (when batch size goes to 1 instead of ∞). Find the
optimal batch size + learning rate combination. Find the optimal hyper-parameters when we are
trying to maximise the performance after n steps (instead of 1 step).
2. Find the optimal hyper-parameters when we have non-linear activation, in a real neural network
(it seems we shall have different η for different neuron), and in a convolutional neural network.
3. Find the expression when the input distribution is Gaussian, etc. Find the optimal normalisation
scheme of input.
4. Find the expression in the case of other optimisation algorithms such as SGD+momentum, ADAM,
etc. Will there be an optimal optimisation algorithm?
7