Plenary Lecture I

Neural Networks: Nonlinear Optimization for
Constrained Learning and its Applications
STAVROS J. PERANTONIS
Computational Intelligence Laboratory,
Institute of Informatics and Telecommunications,
National Center for Scientific Research “Demokritos”,
153 10 Aghia Paraskevi, Athens,
GREECE
Abstract: - Feedforward neural networks are highly non-linear systems whose training is usually carried out by
performing optimization of a suitable cost function which is nonlinear with respect to the optimization variables
(synaptic weights). This paper summarizes the considerable advantages that arise from using constrained
optimization methods for training feedforward neural networks. A constrained optimization framework is
presented that allows incorporation of suitable constraints in the learning process. This enables us to include
additional knowledge in efficient learning algorithms that can be either general purpose or problem specific.
We present general purpose first order and second order algorithms that arise from the proposed constrained
learning framework and evaluate their performance on several benchmark classification problems. Regarding
problem specific algorithms, we present recent developments concerning the numerical factorization and root
identification of polynomials using sigma-pi feedforward networks trained by constrained optimization
techniques.
Key-Words: - neural networks, constrained optimization, learning algorithms
1 Introduction
Artificial neural networks are computational systems
loosely modeled after the human brain. Although it is
not yet clearly understood exactly how the brain
optimizes its connectivity to perform perception and
decision making tasks, there have been considerable
successes in developing artificial neural network
training algorithms by drawing from the wealth of
methods available from the well established field of
non-linear optimization. Although most of the
methods used for supervised learning, including the
original and highly cited back propagation method for
multilayered feed forward networks [1], originate
from unconstrained optimization techniques, research
has shown that it is often beneficial to incorporate
additional knowledge in the neural network
architecture or learning rule [2]-[6]. Often, the
additional knowledge can be encoded in the form of
mathematical relations that have to be satisfied
simultaneously with the demand for minimization of
the cost function. Naturally, methods from the field of
nonlinear constrained optimization are essential for
solving these modified learning tasks.
In this paper, we present an overview of recent
research results on neural network training using
constrained optimization techniques. A general
constrained optimization framework is introduced for
incorporating additional knowledge in the neural
network learning rule. It is subsequently shown
how this general framework can be utilized to
obtain efficient neural network learning algorithms.
These can be either general purpose algorithms or
problem specific algorithms. The general purpose
algorithms incorporate additional information about
the specific type of the neural network, the nature
and characteristics of its cost function landscape
and are used to facilitate learning in broad classes
of problems. These are further divided into first
order algorithms (mainly using gradient
information) and second order algorithms (also
using information about second derivatives encoded
in the Hessian matrix). Problem specific algorithms
are also discussed. Special attention is paid to
recent advancements in numerical solution of the
polynomial factorization and root finding problem
which can be efficiently solved by using suitable
sigma-pi networks and incorporating relations
among the polynomial coefficients or roots as
additional constraints.
2 Constrained learning framework
Conventional unconstrained supervised learning in
neural networks involves minimization, with
respect to the synaptic weights and biases, of a cost
function of the form
E  E[Tip  Oip (w )]
(1)
Here w is a column vector containing all synaptic
weights and biases of the network, p is an index
running over the P patterns of the training set, Oip is
the network output corresponding to output node i
and Tip is the corresponding target.
However, it is desirable to introduce additional
relations that represent the extra knowledge and
involve the network’s synaptic weights. Before
introducing the form of the extra relations, we note
that we will be adopting an epoch-by-epoch
optimization framework with the following
objectives:
 At each epoch of the learning process, the vector
w is to be incremented by dw , so that the search
for an optimum new point in the space of w is
restricted to a region around the current point. It is
possible to restrict the search on the surface of a
hypersphere, as, for example, is done with the
steepest descent algorithm. However, in some
cases it is convenient to adopt a more general
approach whereby the search is done on a
hyperelliptic surface centered at the point defined
by the current w :
(2)
dwT Mdw  ( P)2
where M is a positive definite matrix and  P is a
known constant.
 At each epoch, the cost function must be
decremented by a positive quantity  Q , so that at
the end of learning E is rendered as small as
possible. To first order, the change in E can be
substituted by its first differential so that:
dE   Q
(3)
Next, additional objectives are introduced in order to
incorporate the extra knowledge into the learning
formalism. The objective is to make the formalism as
general as possible and capable of incorporating
multiple optimization objectives. The following two
cases of importance are therefore considered, that
involve additional mathematical relations representing
knowledge about learning in neural networks:
Case 1: There are additional constraints Φ  0 which
must be satisfied as best as possible upon termination
of the learning process. Here Φ is a column vector
whose components are known functions of the
synaptic weights. This problem is addressed by
introducing a function  and demanding at each
epoch of the learning process the maximization of 
subject to the condition that dΦ   Φ . In this way,
it is ensured that Φ tends to 0 at a temporarily
exponential rate.
Based on these considerations, the following
optimization problem can be formulated for each
epoch of the algorithm, whose solution will
determine the adaptation rule for w :
Maximize   0 with respect to dw , subject to the
following constraints:
dwT Mdw  ( P)2
dE   Q
  d  m  m  m  1… S
where S is the number of components of Φ .
(4)
(5)
(6)
Case 2: This case involves additional conditions
whereby there is no specific final target for the
vector Φ , but rather it is desired that all
components of Φ are rendered as large as possible
at each individual epoch of the learning process.
This is a multiobjective maximization problem,
which is addressed by defining   d  m and
demanding that  assume the maximum possible
value at each epoch. Thus the constrained
optimization problem is as before with equation (6)
substituted by
(7)
  d  m  m  1… S
The solution to the above constrained optimization
problems can be obtained by a method similar to
the constrained gradient ascent technique
introduced by Bryson and Denham [7] and leads to
a generic update rule for w . Therefore, suitable
Lagrange multipliers L1 and L2 are introduced to
take account of equations (5) and (4) respectively
and a vector of multipliers Λ to take account of
equation (6) or equation (7). The Lagrangian thus
reads
    L1 (GT dw   Q)
L2 [dwT Mdw  ( P)2 ]  ΛT (Fdw   1)
(8)
where the following quantities have been
introduced:
1. A vector G with elements Gi  E wi
2. A matrix F whose elements are defined by
Fmi  (1m )(m wi ) (for case 1) or
Fmi   m wi (for case 2).
To maximize  under the required constraints, it is
demanded that:
d  d (1  ΛT 1)
( L1GT  ΛT F  2 L2 dwT M)d 2 w  0
d 2   2 L2 d 2 wT Md 2 w  0
(9)
(10)
Hence, the factors multiplying d w and d in
2
equation (8) should vanish, and therefore:
L
1
dw   1 M 1G 
M 1FT Λ
2 L2
2 L2
(11)
ΛT 1  1
(12)
Equation (11) constitutes the weight update rule for
the neural network, provided that the Lagrange
multipliers appearing in it have been evaluated in
terms of known quantities. The result is summarized
forthwith, whereas the full evaluation is carried out in
the Appendix. To complete the evaluation, it is
necessary to introduce the following quantities:
I GG  GT M 1G
(13)
I GF  FM 1G
(14)
I FF  FM 1FT
1
R
( I GG I FF  I GF  I GF )
I GG
(15)
(16)
(R 1 )a (ITGF R 1I GF )  (1T R 1I GF ) 2
(17)
(R 1 )a I GG
where () a denotes the sum of all elements of a matrix
and 1 is a column vector whose elements are all equal
Z 1 
to 1. In terms of these known quantities, the Lagrange
multipliers are evaluated using the relations:

I GG
1
L2    1
2
2 
2  (R ) a [ I GG ( P)  Z ( Q) ] 
1 2
(18)
R 1 2L  Q 1 R IGF 1
Λ 1  2
R 1 R I
(R ) a
IGG
(R 1 )a
 T




1
L1  


1
GF 

1
2 L2 Q  Λ I GF
I GG
(19)
(20)
In the Appendix, it is shown that  Q must be chosen
adaptively according to  Q   P( I GG )
. Here 
is a real parameter with 0    1 . Consequently, the
proposed generic weight update algorithm has two
free parameters, namely  P and  .
3 First order constrained learning
algorithms
3.1 Algorithms with adaptive momentum
Learning in feedforward networks is usually hindered
by specific characteristics of the landscape defined by
the mean square error cost function
E
1
 (Tip  Oip )2
2 ip
dw  
L1
1
G
u
2 L2
2 L2
(22)
where
T
1/ 2
The two most common problems arise because
 of the occurrence of long, deep valleys or
troughs that force gradient descent to follow zigzag paths [8]
 of the possible existence of temporary minima in
the cost function landscape.
In order to improve learning speed in long deep
valleys, it is desirable to align current and previous
epoch weight update vectors as much as possible,
without compromising the need for a decrease in
the cost function [6]. Thus, satisfaction of an
additional condition is required i.e. maximization of
the quantity   (w  wt )(wt  wt 1 ) with respect
to the synaptic weight vector w at each epoch of
the algorithm. Here w t and wt 1 are the values of
the weight vectors at the present and immediately
preceding epoch respectively and are treated as
known constant vectors. The generic update rule
derived in the previous section can be applied.
There is only one additional condition to satisfy
(maximization of d at each epoch), so that
Case 2 of the previous section is applicable. It is
readily seen that in this case  has only one
component equal to -1 (by equation (12)) and the
weight update rule is quite simple:
(21)
u  w t  w t 1  L1  ( I uG  2 L2 Q) I GG 
2

1  I GG I uu  I Gu
L2   

2  I GG ( P) 2  ( Q) 2 
1 2
(23)
with
I Gu  uT G I uu  uT u
(24)
Hence, weight updates are formed as linear
combinations of the cost function derivatives G
with respect to the weights and of the weight
updates u at the immediately preceding epoch.
This is similar to back propagation with a
momentum term, with the essential difference that
the coefficients of G and u are suitably adapted at
each epoch of the learning process.
The resulting learning algorithm (Algorithm for
Learning
Efficiently
using
Constrained
Optimization - ALECO) is historically the first
constrained learning algorithm to be proposed that
complies with the optimization framework of the
previous section. An example of its behaviour in a
benchmark problem, the 11-11-1 multiplexer
problem [9] is shown in Fig. 1 where the
performance of ALECO is compared to other state
of the art algorithms including Back propagation (BP),
resilient propagation (RP) [10], conjugate gradient
(CG) [11], the quickprop (QP) algorithm [12] and
Delta-bar-Delta (DBD) [9]. From these experiments it
is evident that ALECO generally outperforms the
other algorithms in terms of success rate and
computing time (these first order algorithms require
roughly the same CPU time per epoch).
120
ALECO
100
80
BP
60
RPROP
40
20
ConjGrad
QP and DBD
failed to converge
0
Success rate
Epochs/5
Figure 1: Multiplexer benchmark task (11-11-1):
Comparison of a prototype first order constrained learning
algorithm (ALECO) with other learning algorithms in terms
of success rate and epochs needed to complete the task.
Figure 2: Classification of lithologies and identification of
mineral alteration zones using ALECO.
Moreover, the algorithm has exhibited very good
generalization performance in a number of benchmark
and industrial tasks. Fig. 2 shows the result of
supervised learning performed on the problem of
classifying lithological regions and mineral alteration
zones for mineral exploration purposes from satellite
images. Among several neural and statistical
classification algorithms tried on the problem,
ALECO was most successful in correctly classifying
mineral alteration zones which are very important for
mineral exploration [13].
3.2 Dynamical system based approach
As mentioned in the previous section, the problem
of slow learning in feedforward networks is also
associated to the problem of temporary minima in
the cost function landscape. In recent work [14], the
problem of temporary minima has been studied
using a method that originates from the theory of
dynamical systems. One of the major results
obtained are the analytical predictions for the
characteristic dynamical transitions from flat
plateaus (or temporary minima) of finite error to the
desired levels of lower or even zero error.
It is well known that temporary minima result from
the development of internal symmetries and from
the subsequent building of redundancy in the
hidden layer. In this case, one or more of the hidden
nodes perform approximately the same function and
therefore they form clusters of redundant nodes
which are approximately reducible to a single unit.
Due to the formation of these clusters, the network
is trapped in a temporary minimum and it usually
takes a very long time before the redundancy is
broken and it finds its way down the cost function
landscape. Introducing suitable state variables
formed by appropriate linear combinations of the
synaptic weights, a dynamical system model can be
derived that describes the dynamics of the
feedforward network in the vicinity of these
temporary minima [14].
The corresponding non-linear system can be
linearized in the vicinity of temporary minima and
the learning behaviour of the feedforward network
can then be characterized by the largest eigenvalue
of the Jacobian matrix J c of each cluster of
redundant hidden nodes corresponding to the
linearized system. It turns out that in the vicinity of
the temporary minima, learning is slow because the
largest eigenvalues of the Jacobian matrices of all
formed clusters are very small, and therefore the
system evolves very slowly when unconstrained
back propagation is utilized. The magnitude of the
largest eigenvalues gradually grows with learning
and eventually a bifurcation of the eigenvalues
occurs and the system follows a trajectory which
allows it to move far away from the minimum.
Instead of waiting for the growth of the
eigenvalues, it is useful to raise these eigenvalues
more rapidly in order to facilitate learning. It is
therefore beneficial to incorporate in the learning
rule additional knowledge related to the desire for
rapid growth of these eigenvalues. Since it is
difficult in the general case to express the
maximum eigenvalues in closed form in terms of
the weights, it is chosen to raise the values of
appropriate lower bounds  c  1T J c 1  c for these
eigenvalues. Details are given in [14] but here it
suffices to say that this is readily achieved by the
generic weight update rule (Case 2 of section 2) using
the above  c with M  1 (the unit matrix).
180
160
140
120
100
80
60
40
20
0
DCBP
BP
RPROP
ConjGrad
DBD
QP
Success rate
Epochs/5
Figure 3: Parity 4 problem: Comparison of constrained
learning algorithm DCBP with other first order algorithms
in terms of success rate and complexity.
The algorithm thus obtained, DCBP (Dynamical
Constrained Back Propagation), results in significant
acceleration of learning in the vicinity of the
temporary minima also improving the success rate in
comparison with other first order algorithms. The
improvement achieved for the parity-4 problem is
shown in Fig. 3. Moreover, in a standard breast cancer
diagnosis benchmark problem (cancer3 problem of the
PROBEN1 set [15]), DCBP was the only first order
algorithm tried that succeeded in learning the task
with 100% success starting from different initial
weights, with resilient propagation achieving a 20%
success rate and other first order algorithms failing to
learn the task.
4 Second order constrained learning
algorithms
Apart from information contained in the gradient, it is
often useful to include second order information about
the cost function landscape which is of course
contained in the Hessian matrix of second derivatives
with respect to the weights. Constrained learning
algorithms in this section combine advantages of first
order unconstrained learning algorithms like
conjugate gradient and second order unconstrained
algorithms like the very successful Levenberg
Marquardt (LM) algorithm.
The main idea is that a one-dimensional minimization
in the previous step direction u  dw t 1 followed by a
second minimization in the current direction dw does
not guarantee that the function has been minimized on
the subspace spanned by both of these directions. A
solution to this problem is to choose minimization
directions which are non-interfering and linearly
independent. This can be achieved by the selection
of conjugate directions which form the basis of the
Conjugate Gradient (CG) method [16]. The two
vectors dw and u are non-interfering or mutually
conjugate with respect to  2 E when
2
T
(25)
dw ( E )u  0
Our objective is to decrease the cost function of
equation (1) with respect to w as well as to
maximize
  dwT (2 E )u . Second order
information is introduced by incrementing the
weight vector w by dw , so that
(26)
dwT (2 E )dw  ( P)2
Thus, at each iteration, the search for an optimum
new point in the weight space is restricted to a
small hyperellipse centered at the point defined by
the current weight vector. The shape of such a
hyperellipse reflects the scaling of the underlying
problem, and allows for a more correct weighting
among all possible directions [17]. Thus, in terms
of the general constrained learning framework of
section 2, this is a one-goal optimization problem,
seeking to maximize  , so that (26) is respected.
This falls under Case 2 of the general framework
with the matrix M being replaced by the Hessian
matrix  2 E . The derived weight update rule reads:
dw  
L1
1
[ 2 E ]1 G 
u
2 L2
2 L2
(27)
where
2

2 L2 Q  I Gu
1  I uu I GG  I Gu
L1 
 L2   
2
2
I GG
2  I GG ( P)  ( Q) 
(28)
1
I GG  G ( E ) G
T
2
I Gu  G T u
(29)
I uu  uT ( 2 E )u
As a final touch, to save computational resources,
the Hessian is evaluated using the LM trust region
prescription for the Hessian [18]
(30)
 2 E  (J T J   I )
where J is the Jacobian matrix and  is a scalar
that (indirectly) controls the size of the trust region.
As is evident from equation (27), the resulting
algorithm is an LM type algorithm with an
additional adaptive momentum term, hence the
algorithm is termed LMAM (Levenberg-Marquardt
with Adaptive Momentum). An additional
1 2
100
80
LMAM
OLMAM
LM
ConjGrad
60
40
20
AM
LM
O
SV
M
C
H
H
ar
d
C
C
-m
-m
ea
ns
ea
ns
K
G
ise
d
Fu
zz
y
M
Classification rate
100
99
98
97
96
95
94
93
92
91
Su
pe
rv
improvement over this type of algorithm is the so
called OLMAM algorithm (Optimized LevenbergMarqardt with Adaptive Momentum) which
implements exactly the same weight update rule but
also achieves independence from the externally
provided parameter values  P and  . This
independence is achieved by automatically regulating
analytical mathematical conditions that should hold in
order to ensure the constant maintenance of the
conjugacy between weight changes in successive
epochs. Details are given in [19].
LMAM and OLMAM have been particularly
successful in a number of problems concerning ability
of reaching a solution and good generalization ability.
In the famous 2-spiral benchmark, LMAM and
OLMAM have achieved a remarkable success rate of
89  90% and the smallest mean number of epochs
for a feedforward network with just one hidden layer
(30 hidden units) and no shortcut connections that (to
the best of our knowledge) has ever been reported in
the literature of neural networks. As shown in Fig. 4,
other algorithms like the original LM and conjugate
gradient have failed to solve this type of problem in
the great majority of cases, while the CPU time
required by LMAM and OLMAM is also very
competitive.
Figure 5: Comparison of classification rate achieved by
different classification methods for the pap-smear
classification problem. The constrained learning
algorithm (OLMAM) outperforms all other methods
achieving a rate of 98.8%.
5 Problem specific algorithms
Problem specific applications of constrained
learning range from financial modeling and market
analysis problems, where sales predictions can be
made to conform to certain requirements of the
retailers, to scientific problems like the numerical
solution of simultaneous linear equations [21]. A
scientific application of constrained learning that
has reached a level of maturity is numerical
factorization and root finding of polynomials.
Polynomial factorization is an important problem
with applications in various areas of mathematics,
mathematical physics and signal processing ([22]
and references cited therein).
Consider, for example, a polynomial of two
variables z1 and z2 :
NA NA
0
Success rate
CPU time/5 (sec)
A( z1  z2 )    aij z1i z2j
(31)
i 0 j 0
Figure 4: 2-spiral problem: Success rates and CPU times
required for second-order constrained learning algorithms
(LMAM and OLMAM) and two other algorithms
(Levenberg-Marquardt and Conjugate Gradient).
Regarding generalization ability, the second order
constrained learning algorithms have been
successfully applied to a medical problem concerning
classification of Pap-Smear test images. In particular,
OLMAM has achieved the best classification ability
on the test set reported in the literature among a
multitude of statistical and other classification
methods. Some of the results of this comparison are
shown graphically in Fig. 5, while a full account of
this comparison is given in [20].
with N A even, and a00  1 . For the above
polynomial, It is sought to achieve an exact or
approximate factorization of the form
A( z1  z2 ) 
A(i ) ( z1  z2 )
(32)

i 1 2
where
MA MA
A(i ) ( z1  z2 )    v (jki ) z1j z2k
(33)
j 0 k 0
with
M A  N A  2 . We can try to find the
coefficients v (jki ) by considering P training patterns
selected from the region  z1  1 z2  1 . The
primary purpose of the learning rule is thus to
minimize with respect to the v (jki ) a cost function of
the form
E   ( A ( z1 p  z2 p )  A( z1 p  z2 p ))
(i )
p
2
(34)
i 1 2
Note that this cost function corresponds to a sigma-pi
neural network with the elements of v (1) and v (2) as
its synaptic weights.
Unconstrained minimization of the cost function has
been tried, but often leads to unsatisfactory results,
because it can be easily trapped in flat minima.
However, there is extra knowledge available for this
problem. The easiest way to incorporate more
knowledge is to take advantage of the constraints
among the coefficients of the desired factor
polynomials and the coefficients of the original
polynomial. More explicitly, if it is assumed that
A( z1  z2 ) is factorable, then these constraints can be
expressed as follows:
i
j
(1) (2)
 vj  ( N A 1)i  aij    vlm
vi l  j m  0
l 1 m 1
with 0  i  N A  0  j  N A . Thus, the objective is to
reach a minimum of the cost function of equation (34)
with respect to the variables v (jki ) , which satisfies as
best as possible the constraints  v  0, where
v  (vj ( N A 1)i  0  i  N A  0  j  N A ) .
The constraints can be incorporated into the
constrained optimization formalism as in Case 1 of
section 2. It turns out that the constrained learning
algorithm can determine the factor polynomials in
factorable cases and gives good approximate solutions
in cases where the original polynomial is nonfactorable [23].
Recently, there have been interesting developments
that allow us to achieve improved results in the
problem of numerical factorization and root finding
for polynomials using constrained learning
techniques. In [24] additional constraints have been
incorporated for ensuring stability of the resulting
factor polynomials in filter factoring problems related
to signal processing. More importantly, the basic
constrained learning method has been applied to root
finding of arbitrary polynomials of one variable. The
parallel structure of the neural network has been
exploited in order to obtain all roots of the original
polynomial simultaneously or in distinct groups,
resulting to efficient algorithms capable of handling
problems with polynomials of high degree [25]. The
method has also been extended for finding the roots of
polynomials with arbitrary complex coefficients and
with roots that can be close to each other and need
extra effort to be resolved [26]. A recent extension
incorporates constraints among the root moments of
the original polynomial instead of the relations
among the coefficients themselves [27]. These
advances have rendered the constrained learning
algorithms very competitive compared with well
established numerical root finding methods like the
Muller and Laguerre techniques leading to better
accuracies at a fraction of the CPU time.
6 Conclusion
An overview of recent research results on neural
network training using constrained optimization
techniques has been presented. A generic learning
framework was derived in which many types of
additional knowledge, codified as mathematical
relations satisfied by the synaptic weights, can be
incorporated. Specific examples were given of the
application of this framework to neural network
learning including first and second order general
purpose algorithms as well as problem specific
methods. It is hoped that the constrained learning
approach will continue to offer insight into learning
in neural networks. It has potential to combine the
merits of both connectionist and knowledge based
approaches for developing successful applications.
Appendix: Derivation of constrained
learning algorithm
In this Appendix, evaluation of the Lagrange
multipliers L1 , L2 and Λ involved in the general
constrained learning framework of section 2 is
carried out.
By multiplying both sides of equation (11) by G T
and by taking into account equation (5) we obtain:
Q  
L1
1 T
I GG 
Λ I GF
2 L2
2 L2
(35)
Solving for L1 readily yields equation (20), which
evaluates L1 in terms of L2 and Λ .
By left multiplication of both sides of equation (11)
by F and taking into account equations (6) and
(20), we obtain
 QI GF
RΛ
(36)
I GG
2 L2
where the matrix R is defined by equation (16).
Solving equation (36) for Λ yields
2 L  Q 1
Λ  2 L2 R 11  2
R I GF
(37)
I GG
1 

By substituting this equation into equation (12) we
arrive at:
 
2 Q T
1  2 LIGG
1 R 1IGF
(38)
2L2 (R 1 )a
We can now substitute this equation into equation (37)
to obtain equation (19) evaluating Λ in terms of L2 .
To evaluate L2 , we must substitute our expression for
dw into equation (4). To make the algebra easier, we
note that on account of equation (20), equation (11)
can be written as:
dw 
Q
I GG
M 1G 
1
M 1A
2 L2
(39)
where
A
ΛT I GF
G  FT Λ
I GG
(40)
From the definition of A we can readily derive the
following properties:
(41)
AT M1A  ΛT RΛ AT M1G  0
Substituting equation (39) into equation (4) and taking
into account equation (41), we can obtain a relation
involving only L2 and Λ :
1  I GG ( ΛT RΛ ) 
L2   

2  I GG ( P) 2  ( Q) 2 
1 2
(42)
where the negative square root sign has been selected
on account of inequality (10).
By substituting equation (19) into equation (42) and
solving for L2 , equation (18) is obtained, with Z
given by equation (17). Evaluation of all Lagrange
multipliers in terms of known quantities is now
complete.
As a final note, let us discuss our choice for  Q . This
choice is dictated by the demand that the quantity
under the square root in equation (42) be positive. It
can readily be seen by the first of equation (41) that
ΛT RΛ  0 provided that M is positive definite.
Since I GG  G T M 1G  0 , it follows from equation
(42) that care must be taken to ensure that
I GG ( P) 2  ( Q) 2 . The simplest way to achieve this
is to set  Q   P I GG with 0    1 .
References
[1] D. E. Rumelhart, J. E. Hinton and R J. Williams,
Learning internal representations by error
propagation, in: Parallel Distributed Processing:
Explorations in the Microstructures of Cognition,
Vol. 1, Foundations, eds. D. E. Rumelhart and
J. L. McLelland, MIT Press, 1986, pp. 318-362.
[2] D. Barber and D. Saad, Does extra knowledge
necessarily improve generalization?, Neural
Computation, Vol. 8, 1996, pp. 202-214.
[3] Y. le Cun, L. D. Jackel, B. E. Boser, J. S.
Denker, H-P. Graf, I. Guyon, D. Henderson,
R. E. Howard and W. Hubbard, Handwritten
digit recognition: Applications of neural
network chips and automatic learning, IEEE
Communications Magazine, Nov.1989, pp. 4146.
[4] P. Simard, Y. le Cun and J. Denker, Efficient
pattern recognition using a new transformation
distance, in: Advances in Neural Processing
Systems, eds. S. J. Hanson, J. D. Cowan and
C. L. Giles, Morgan Kaufmann, 1993, pp. V50–V-58.
[5] S. Gold, A. Rangarajan and E. Mjolsness,
Learning with preknowledge: clustering with
point and graph matching distance, Neural
Computatation, Vol. 8, 1996, pp. 787-804.
[6] S. J. Perantonis and D. A. Karras, An efficient
learning
algorithm
with
momentum
acceleration, Neural Networks, vol. 8, 1995,
pp. 237-249.
[7] A. E. Bryson and W. F. Denham, A steepest
ascent method for solving optimum
programming problems, Journal App. Mech.,
Vol. 29, 1962, pp. 247-257.
[8] S. S. Rao, Optimization Theory and
Applications, New Delhi, Wiley Eastern, 1984.
[9] R. A. Jacobs, Increased rates of convergence
through learning rate adaptation, Neural
Networks, Vol. 1, 1988, pp. 295-307.
[10] M. Riedmiller and H. Braun, A direct adaptive
method for faster backpropagation learning:
The RPROP algorithm, Proceedings of the
International Conference on Neural Networks,
San Francisco, Vol. 1, 1993, 586-591.
[11] E. M. Johansson,
F. U. Dowla
and
D. M. Goodman, Backpropagation learning for
multilayer feedforward networks using the
conjugate gradient method, International
Journal of Neural Systems, Vol. 2, 1992, pp.
291-301.
[12] S. E. Fahlman, Faster learning variations on
back-propagation: An empirical study, in:
Proceedings of the Connectionist Models
Summer School, eds. D. Touretzky, G. Hinton
and T. Sejnowski, Morgan Kaufmann, 1988,
pp. 29-37.
[13] J. Aarnisalo et al., Integrated technologies for
minerals exploration: pilot project for nickel
ore deposits, Transactions of the Institution of
Mining and Metallurgy, Section B, Applied
Earth Science, Vol. 108, September-December
1999, pp. 151-163.
[14] N. Ampazis, S. J. Perantonis and J. G. Taylor, A
dynamical model for the analysis and
acceleration of learning in feedforward networks,
Neural Networks, Vol. 14, 2002, pp. 1075-1088.
[15] L. Prechelt, PROBEN1-A set of neural network
benchmark problems and benchmarking rules,
Technical Report 21/94, Universität Karlsruhe,
Germany, 1994.
[16] J. Gilbert and J. Nocedal, Global convergence
properties of conjugate gradient methods for
optimization, SIAM Journal on Optimization,
Vol. 2, 1992, pp. 21-42.
[17] N. I. M. Gould and J. Nocedal, On the modified
absolute-value factorization norm for trust-region
minimization, In High Performance Algorithms
for Software in Nonlinear Optimization, Boston,
MA: Kluwer, 1998, pp. 225-241.
[18] K. Levenberg, A method for the solution of
certain problems in least squares, Quarterly of
Applied Mathematics, Vol. 5, 1944, pp. 164-168.
[19] N. Ampazis and S. J. Perantonis, Two highly
efficient second order algorithms for feedforward
networks, IEEE Transactions on Neural
Networks, Vol. 13(5), 2002, pp. 1064-1074.
[20] N. Ampazis, G. Dounias and J. Jantzen, Efficient
second order neural network training algorithms
for the construction of a Pap-smear classifier,
SETN04, Samos, Greece, 2004, accepted for
presentation.
[21 D. S. Huang, On the comparisons between RLSA
and CLA for solving arbitrary linear
simultaneous equations, Lecture Notes in
Computer Science, Vol. 2690, 2003, pp. 169-176,
Springer-Verlag.
[22] N. E. Mastorakis, A method of approximate
multidimensional factorization via the singular
value decomposition, Found. of Comput.
Decis. Sci., Vol. 21, No. 3, 1996, pp.137-144.
[23] S. J. Perantonis, N. Ampazis, S. Varoufakis
and G. Antoniou, Constrained learning in
neural networks: Application to stable
factorization of 2-D polynomials, Neural Proc.
Lett., Vol. 7, 1998, pp. 5-14.
[24] G. Antoniou, S. J. Perantonis, N. Ampazis and
S. J. Varoufakis, Stable factorization of 2-D
polynomials
using
neural
networks,
Proceedings of 13th International Conference
on Signal Processing, Santorini, Greece, July
1997, pp. 983-986.
[25] D. S. Huang and Z. Chi, Neural networks with
problem decomposition for finding real roots
of polynomials, Proceedings of IJCNN2001,
Washington DC, Addendum, 2001, pp. 25-30.
[26] D. S. Huang, H. H. S. Ip, Z. Chi and H.S.
Wong, Dilation method for finding close roots
of polynomials based on constrained learning
neural networks, Physics Letters, Vol. A309,
No. 5-6, 2003, 443-451.
[27] D. S. Huang, Finding roots of polynomials
based on root moments. Proc. 8th Int. Conf. on
Neural Information Processing (ICONIP),
Shanghai, China 2001, Vol. 3, 2001, pp. 15651571.