Multi Layer Perceptron trained by Quasi Newton learning rule

DAME – Data Mining & Exploration - Quasi Newton Learning Method for Multi Layer Perceptron
Multi Layer Perceptron trained by Quasi Newton learning rule
Feed-forward neural networks provide a general framework for representing nonlinear functional mappings
between a set of input variables and a set of output variables (Bishop 2006). One can achieve this goal by
representing the nonlinear function of many variables by a composition of non-linear activation functions of
one variable:
=∑
( )
∑
( )
(1)
A Multi-Layer Perceptron may be represented by a graph: the input layer (xi) is made of a number of
perceptrons equal to the number of input variables (d); the output layer, on the other hand, will have as many
neurons as the output variables (K). The network may have an arbitrary number of hidden layers (in most
cases one) which in turn may have an arbitrary number of perceptrons (M). In a fully connected feed-forward
network each node of a layer is connected to all the nodes in the adjacent layers. Each connection is
represented by an adaptive weight which represents the strength of the synaptic connection between neurons
(wkj(l)). The response of each perceptron to the inputs is represented by a non-linear function g, referred to as
the activation function. Notice that the above equation assumes a linear activation function for neurons in the
output layer. We shall refer to the topology of an MLP and to the weights matrix of its connections as to the
model. In order to find the model that best fits the data, one has to provide the network with a set of
examples: the training phase thus requires the KB, i.e. the training set.
The learning rule of our MLP is the Quasi Newton Algorithm (QNA).
In general Quasi Newton Algorithms (QNA) are variable metric methods used to find local maxima and
minima of functions (Davidon 1968) and, in the case of MLP’s they can be used to find the stationary (i.e.
the zero gradient) point of the learning function. The Newton method is the general basis for a whole family
of so called Quasi Newton methods. One of those methods, implemented here is the L-BFGS algorithm
(Byrd et al. 1994, Broyden C. G. 1970, Fletcher R. 1970, Goldfarb D. 1970, Shanno D. F. 1970). More
rigorously, the QNA is an optimization of learning rule, also because, as described below, the
implementation is based on a statistical approximation of the Hessian by cyclic gradient calculation, that, as
said in the previous section, is at the base of Back Propagation (BP; Bishop 2006) method.
As known, the classical Newton method uses the Hessian of a function. The step of the method is defined as
a product of an inverse Hessian matrix and a function gradient. If the function is a positive definite quadratic
form, we can reach the function minimum in one step. In case of an indefinite quadratic form (which has no
minimum), we will reach the maximum or saddle point. In short, the method finds the stationary point of a
quadratic form. In practice, we usually have functions which are not quadratic forms. If such a function is
smooth, it is sufficiently good described by a quadratic form in the minimum neighborhood. However, the
Newton method can converge both to a minimum and a maximum (taking a step into the direction of a
function increasing). Quasi Newton methods solve this problem as follows: they use a positive definite
approximation instead of a Hessian. If Hessian is positive definite, we make the step using the Newton
method. If Hessian is indefinite, we modify it to make it positive definite, and then perform a step using the
Newton method. The step is always performed in the direction of the function decrement. In case of a
positive definite Hessian, we use it to generate a quadratic surface approximation. This should make the
convergence better. If Hessian is indefinite, we just move to where function decreases.
Some modifications of Quasi Newton methods perform a precise linear minimum search along the indicated
line, but it is proved that it is enough to sufficiently decrease the function value, and not necessary to find a
precise minimum value. The L-BFGS algorithm tries to perform a step using the Newton method. If it does
not lead to a function value decreasing, it lessens the step length to find a lesser function value. Up to here it
seems quite simple…but it is not!
The Hessian of a function isn't always available and in many cases is too much complicated. More often we
can only calculate the function gradient. Therefore, the following operation is used: the Hessian of a function
is generated on the basis of the N consequent gradient calculations, and the Quasi Newton step is performed.
M. Brescia, March 30, 2012
All Rights Reserved
Please, explicitly cite DAME Program in case of re-use of full or partial text here reported
DAME – Data Mining & Exploration - Quasi Newton Learning Method for Multi Layer Perceptron
There is a special formulas which allows to iteratively get a Hessian approximation. On each approximation
step, the matrix remains positive definite. The algorithm L-BFGS does not generate the Hessian, but directly
its inverse matrix, so we don't have to waste time to invert the Hessian.
In order to better understand the process at the base of Quasi Newton method, let’s start from the classical
Gradient Descent Algorithm (Back Propagation, Bishop 2006).
By using the standard GDA, the direction of each updating step is calculated through the error descent
gradient, while the length is determined by the learning rate. A more sophisticated approach could be to
move towards the negative direction of the gradient (line search direction) not by a fixed length, but up to
reach the minimum of the function along that direction. This is possible by calculating the descent gradient
and analyzing it with the variation of the learning rate (Brescia 2012).
Let suppose that at step t the current weight vector is w(t ) and consider a search direction d(t) = − ∇E(t) . If we
( )
select the parameter λ in order to minimize ( ) =
+ ( ) . The new weight vector can be then
expressed as:
(
)
=
( )
+
( )
(2)
The problem of line search is in practice a single dimension minimization problem. A simple solution could
be to move E(λ) by varying λ in small intervals, to evaluate the error function at each new position and to
stop when the error starts to decrease. There exist many other methods to solve this problem. For example
the parabolic search of a minimum calculates the parabolic curve crossing pre-defined learning rate points.
The minimum d of the parabolic curve is a good approximation of the minimum of E(λ) and it can be
reached by considering the parabolic curve crossing the fixed points with the lowest error values.
There are also the trust region based strategies to find a minimum of an error function, which main concept
is to iteratively growing or contracting the region of the function by adjusting a quadratic model function
which better approximates the error function. In this sense this technique is considered dual to line search,
because it tries to find the best size of the region by preliminarily fixing the moving step (the opposite of the
line search strategy that always chooses the step direction before to select the step size), (Celis et al. 1985).
Up to now we have supposed that the optimal search direction for the method based on the line search is
given at each step by the negative gradient. That’s not always true!
If the minimization is done along the negative gradient, next search direction (the new gradient) will be
orthogonal to the previous one. In fact, note that when the line search founds the minimum, then we have:
(
( )
+
( )
)=0
(3)
and hence,
(
)
( )
=0
(4)
where g(t + 1) ≡ ∇E(t + 1).
By selecting further directions equal to the negative gradient, there should be obtained some oscillations on
the error function that slow down the convergence process. The solution could be to select further more
directions such that the gradient component, parallel to the previous search direction (that is zero), remains
unchanged at each step.
Let supposed to have already minimized in respect of the direction d(t) starting from the point w(t ) and
reaching the point w(t + 1) .
In the point w(t + 1) the (4) is ( ( ) ) ( ) = 0 and by choosing ( ) to preserve the gradient component
parallel to ( ) equal to zero, it is possible to build a sequence of directions d in such a way that each
direction is conjugated to the previous on the dimension |w| of the search space (conjugate gradients
method), (Golub et al. 1999).
M. Brescia, March 30, 2012
All Rights Reserved
Please, explicitly cite DAME Program in case of re-use of full or partial text here reported
DAME – Data Mining & Exploration - Quasi Newton Learning Method for Multi Layer Perceptron
In presence of a square error function, an algorithm with such technique has a weight update of the form:
(
)
( )
=
+ !(
) ( )
(5)
with
!( ) = −
(#)$ (#)
%
(6)
(#)$ & (#)
Furthermore, d can be obtained for the first time by the negative gradient and then as linear combination of
the current gradient and of the previous search directions:
(
)
with
'( ) =
=−
(
)
+ '(
) ( )
(7)
%(#())$ & (#)
(#)$ & (#)
(8)
This algorithm founds the minimum of a square error function in almost |w| steps. On the contrary, the
computational cost of each step is high, because in order to determine the values of α and β, we have to refer
to the hessian matrix Η, highly expensive in terms of calculations. But fortunately, the coefficients α and
β can be obtained from analytical expressions that do not use the Hessian matrix explicitly. For example the
term β can be calculated in one of the following ways:
1) expression of Polak-Ribiere: ' ( ) =
%(#())$ (%(#()) *%(#) )
%(#)$ %(#)
2) expression of Hestenes-Sitefel: ' (
=
)
3) expression of Fletcher-Reeves: ' ( ) =
%(#())$ (%(#()) *%(#) )
(#)$ (%(#()) *%(#) )
%(#())$ %(#())
%(#)$ %(#)
These expressions are equivalent if the error function is square-typed, otherwise they assume different
values. Typically the Polak-Ribiere equation obtains better results, because, if the algorithm is slow and the
consequent gradients are quite similar between them, this equation produces values of β such that the search
direction tends to assume the negative gradient direction (Vetterling et al. 1992). And this corresponds to a
restart of the procedure.
Concerning the parameter α, its value can be obtained by using the line search method directly.
The method of conjugate gradients reduces the number of steps to minimize the error up to a maximum of |w|
because there could be almost |w| conjugate directions in a |w|-dimensional space. In practice however, the
algorithm is slower because, during the learning process, the property “conjugate” of the search directions
tend to deteriorate.
It is useful, to avoid the deterioration, to restart the algorithm after | | steps, by resetting the search direction
with the negative gradient direction.
By using a local square approximation of the error function, we can obtain an expression for the minimum
position. The gradient in every point w is in fact given by:
∇E = H × (w – w*)
(9)
where w∗ corresponds to the minimum of the error function, which satisfies the condition:
w∗ = w − H−1 × ∇E
(10)
M. Brescia, March 30, 2012
All Rights Reserved
Please, explicitly cite DAME Program in case of re-use of full or partial text here reported
DAME – Data Mining & Exploration - Quasi Newton Learning Method for Multi Layer Perceptron
The vector −H−1×∇E is known as Newton direction and it is the base for a variety of optimization strategies,
such as for instance the QNA which instead of calculating the H matrix and then its inverse, uses a series of
intermediate steps of lower computational cost to generate a sequence of matrices which are more and more
accurate approximations of H−1.
From the Newton formula (10) we note that the weight vectors on steps t and t+1 are correlated to the
correspondent gradients by the formula:
(
)
( )
−
= −. *
(
)
( )
−
(11)
known as Quasi Newton Condition. The approximation G is therefore built to satisfy this condition. The
formula for G is:
/(
)
= /( ) +
00$
0$ 1
2 (#) 1 1$ 2 (#)
−
1$ 2 (#) 1
+ 3 / ( ) 3 44
(12)
where the vectors are
5=
(
)
( )
−
,3=
(
)
−
( )
( )
5
and 4 = 6
− / 36
5 3
3 /( )3
(13)
By using the identity matrix to initialize the procedure is equivalent to consider, step by step, the direction of
the negative gradient while, at each next step, the direction –Gg is for sure a descent direction. The above
expression could carry the search out of the interval of validity for the squared approximation. The solution
is hence to use the line search to found the minimum of function along the search direction.
By using such system, the weight updating expression (5) would be formulated as follows:
(
)
=
( )
+ !( )/ (
) ( )
(14)
where α is obtained by the line search. The following algorithm shows the MLP trained by QNA method.
Let us consider a generic MLP with
( )
the weight vector at time (t).
1) Initialize all weights ( ) with small random values (typically normalized in [-1, 1]), set constant 7 ,
set t = 0 and / ( ) = 8;
2) Present to the network all training set and calculate ( ( ) ) as the error function for the current
weight configuration;
3) If t=0
a)
b)
then
else
( )
( )
= −∇
= −/ (
( )
* )
∇
( * )
4) Calculate ( ) = ( ) − ! ( ) where ! is obtained by line search expression (6);
5) Calculate G ( ) with equation (12);
(
)
6) If
> 7 then t=t+1 and goto 2, else STOP
One of main advantage of QNA, compared with conjugate gradients, is that the line search does not require
the calculation of α with an high precision, because it is not a critical parameter. On the contrary, the
downside is that it requires a big amount of memory to calculate the matrix G |w|×|w| for large |w| .
One way to reduce the required memory is to replace at each step the matrix G with a unitary matrix. With
such replacement and multiplying by g (the current gradient), we obtain:
M. Brescia, March 30, 2012
All Rights Reserved
Please, explicitly cite DAME Program in case of re-use of full or partial text here reported
DAME – Data Mining & Exploration - Quasi Newton Learning Method for Multi Layer Perceptron
(
)
=−
( )
+ ;5 + <3
(15)
Note that if the line search returns exact values, then the above equation produces mutually conjugate
directions. A and B are scalar values defined as
;=− 1+
1 $ 1 0$ %(#())
0$ 1
0$ 1
+
1 $ %(#())
0$ 1
and < =
0$ %(#())
0$ 1
(16)
As discussed we use a slightly modified version of the QNA, known as L-QNA or Limited memory QNA
(L-BFGS, Nocedal 1980). The algorithm of MLP with L-QNA is the following:
Let us consider a generic MLP with
( )
the weight vector at time (t).
1) Initialize all weights ( ) with small random values (typically normalized in [-1, 1]), set constant 7 ,
set t = 0;
2) Present to the network all training set and calculate ( ( ) ) as the error function for the current
weight configuration;
3) If t=0
a)
b)
then
else
( )
( )
= −∇
= −∇
( )
( * )
+ ;5 + <3, where 5 =
(
)
−
( )
,3=
(
)
−
( )
4) Calculate ( ) = ( ) − ! ( ) where ! is obtained by line search equation (6);
5) Calculate A and B for the next iteration, as reported in (16);
(
)
> 7 then t=t+1 and goto 2, else STOP
6) If
Note that for approximate values of alpha the algorithm works well anyway.
During the exploration of the parameter space, in order to find the minimum error direction, QNA starts in
the wrong direction. This direction is chosen because at the first step the method has to follow the error
gradient and so it takes the direction of steepest descent. However, in subsequent steps, it incorporates
information from the gradient at the steps taken to build up an approximate model of the Hessian.
As known, all line search methods, being based on techniques searching the minimum error by exploring
the error function surface, are likely to get stuck in a local minimum. Along the research in the field many
solutions have been proposed (Floudas and Jongen 2005). By incorporating a random component into the
weight updating is one general way to escape the local minimum. Also Genetic Algorithms have been
employed to deal with this problem, by proceedings through multiple initial weight settings and recombining
trained weights during the process (Fu 1994). But the cost of both approaches is the prolonged duration time
of training. In order to accelerate the convergence of GDA, Newton’s method uses the information on the
second-order derivatives. QNA is able to better optimize the convergence time by approximating secondorder information with first-order terms (Shanno 1990).
By having the information of the second derivatives, QNA is able to avoid local minima of the error function
and to be more precise in the error function trend follow-up, revealing a “natural” capability to find the
absolute minimum error of the optimization problem.
However this last feature could be a downside of the model, especially when the signal-to-noise ratio of data
is very poor. But with “clean” data, such as in presence of high quality spectroscopic redshifts, used for
model training, the QNA performances result extremely precise.
In the L-BFGS version of the algorithm, in case of big dimensions, the amount of memory required to store a
Hessian is too big, along with the machine time required to process it. Therefore, instead of using a complete
number of gradient values to generate a Hessian, we can use a smaller number of values. On the one hand,
the convergence slows down. On the other hand, the performance could even grow up. At first sight, this
M. Brescia, March 30, 2012
All Rights Reserved
Please, explicitly cite DAME Program in case of re-use of full or partial text here reported
DAME – Data Mining & Exploration - Quasi Newton Learning Method for Multi Layer Perceptron
statement seems to be paradoxical. But it contains no contradictions: the convergence is measured by a
number of iterations, whereas the performance depends on the number of processor's time units spent to
calculate the result.
Related to the computational cost there is also the strategy adopted in terms of stopping criteria of the
method. As known, the process of adjusting the weights based on the gradients is repeated until a minimum
is reached. In practice, one has to decide the stopping condition of the algorithm. More in general, there are
several criteria. Among them the most used are: (i) the algorithm could be terminated after the gradient is
sufficiently small (by definition the gradient will be zero at a minimum); (ii) based on the error to be
minimized, in terms of a fixed threshold; (iii) based on the cross validation.
The cross validation can be used to monitor generalization performance during training and to terminate the
algorithm when there is no more improvement. The basic mechanism consists into dividing data in a train
and test set. The network is trained on the training set and its performances are evaluated on the test set.
Statistically significant results come out by trying multiple independent data partitions and averaging the
performance.
The first two criteria mentioned above are mainly sensitive to the choice of specific parameters and may lead
to poor results if the parameters are improperly set. The cross validation do not suffer of such drawback. It
can avoid overfitting the data and is able to improve the generalization performance of the model. However
it is much more computationally expensive.
REFERENCES
Bishop, C.M., Pattern Recognition and Machine Learning, 2006, Springer ISBN 0-387-31073-8.
Brescia, M., “New Trends in E-Science: Machine Learning and Knowledge Discovery in Databases”. 2012,
Contribution to the Volume “Horizons in Computer Science Research”, Thomas S. Clary (eds.), Series
“Horizons in Computer Science” Vol. 7, Nova Science Publishers, ISBN: 978-1-61942-774-7.
Broyden, C. G., The convergence of a class of double-rank minimization algorithms. 1970, Journal of the
Institute of Mathematics and Its Applications, Vol. 6, pp. 76–90.
Byrd, R.H et al., 1994, Mathematical Programming, 63, 4, pp. 129-156
Celis, M.; Dennis, J. E.; Tapia, R. A., A trust region strategy for nonlinear equality constrained optimization.
1985, in Numerical Optimization, P. Boggs, R. Byrd and R. Schnabel eds, SIAM, Philadelphia USA,
pp. 71–82.
Davidon, W.C., Comput. J. 10, 406 (1968)
Fletcher, R., A New Approach to Variable Metric Algorithms. 1970, Computer Journal, Vol. 13, pp. 317–
322.
Floudas, C. A.; Jongen, H. Th., Global Optimization: Local Minima and Transition Points. 2005, Journal of
Global Optimization, Vol. 32, Number 3, 409-415
Fu, Limin, Neural Networks in Computer Intelligence. 1994, E.M. Munson and L. Goldberg Editors,
McGraw-Hill NY
Goldfarb, D., A Family of Variable Metric Updates Derived by Variational Means. 1970, Mathematics of
Computation, Vol. 24, pp. 23–26.
Golub, G.H.; Ye, Q., Inexact Preconditioned Conjugate Gradient Method with Inner-Outer Iteration. 1999,
SIAM Journal of Scientific Computation, Vol. 21, pp. 1305-1320.
M. Brescia, March 30, 2012
All Rights Reserved
Please, explicitly cite DAME Program in case of re-use of full or partial text here reported
DAME – Data Mining & Exploration - Quasi Newton Learning Method for Multi Layer Perceptron
Nocedal, J., Updating Quasi-Newton Matrices with Limited Storage. 1980, Mathematics of Computation,
Vol. 35, pp. 773–782.
Shanno, D. F., Conditioning of quasi-Newton methods for function minimization. 1970, Mathematics of
Computation, Vol. 24, pp. 647–656.
Shanno, D.F., Recent Advances in Numerical Techniques for large-scale optimization. 1990, in Neural
Networks for Control, MIT Press, Cambridge MA
Vetterling, T.; Flannery, B.P., Conjugate Gradients Methods in Multidimensions. 1992, Numerical Recipes
in C - The Art of Scientific Computing, W. H. Press and S. A. Teukolsky Eds, Cambridge University
Press; 2nd edition.
M. Brescia, March 30, 2012
All Rights Reserved
Please, explicitly cite DAME Program in case of re-use of full or partial text here reported