Global Convergence of Damped Newton`s Method for Nonsmooth

Global Convergence of Damped Newton's Method for Nonsmooth Equations via the Path
Search
Author(s): Daniel Ralph
Source: Mathematics of Operations Research, Vol. 19, No. 2 (May, 1994), pp. 352-389
Published by: INFORMS
Stable URL: http://www.jstor.org/stable/3690225 .
Accessed: 09/06/2011 06:56
Your use of the JSTOR archive indicates your acceptance of JSTOR's Terms and Conditions of Use, available at .
http://www.jstor.org/page/info/about/policies/terms.jsp. JSTOR's Terms and Conditions of Use provides, in part, that unless
you have obtained prior permission, you may not download an entire issue of a journal or multiple copies of articles, and you
may use content in the JSTOR archive only for your personal, non-commercial use.
Please contact the publisher regarding any further use of this work. Publisher contact information may be obtained at .
http://www.jstor.org/action/showPublisher?publisherCode=informs. .
Each copy of any part of a JSTOR transmission must contain the same copyright notice that appears on the screen or printed
page of such transmission.
JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range of
content in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new forms
of scholarship. For more information about JSTOR, please contact [email protected].
INFORMS is collaborating with JSTOR to digitize, preserve and extend access to Mathematics of Operations
Research.
http://www.jstor.org
MATHEMATICS OF OPERATIONS
Vol. 19, No. 2, May 1994
Printed in U.S.A.
RESEARCH
GLOBAL CONVERGENCE OF DAMPED NEWTON'S
METHOD FOR NONSMOOTH EQUATIONS
VIA THE PATH SEARCH
DANIEL
RALPH
A natural damping of Newton's method for nonsmooth equations is presented. This
damping, via the path search instead of the traditional line search, enlarges the domain of
convergence of Newton's method and therefore is said to be globally convergent. Convergence behavior is like that of line search damped Newton's method for smooth equations,
including Q-quadratic convergence rates under appropriate conditions.
Applications of the path search include damping Robinson-Newton's method for nonsmooth normal equations corresponding to nonlinear complementarity problems and variational inequalities, hence damping both Wilson's method (sequential quadratic programming)
for nonlinear programming and Josephy-Newton's method for generalized equations.
Computational examples from nonlinear programming are given.
1. Introduction. This paper presents a novel, natural damping of (local) Newton's
method-essentially Robinson-Newton's method (Robinson 1988)-for solving nonsmooth equations. This damping, via the so-called path search instead of the
traditional line search, enlarges the domain of convergence of Newton's method and
therefore is said to be globally convergent. It is natural in the sense that the
convergence behavior of the method is fundamentally the same as that of the
traditional line search damped Newton's method applied to smooth equations, which,
under appropriate conditions, is roughly described as linear convergence far from a
solution and superlinear, possibly quadratic, convergence near a solution. Such
convergence results are amongst the strongest results of their kind, but also require
strong conditions-see further discussion below. We also investigate the nonmonotone path search (??3, 4) which is an easy extension of the nonmonotone line search
(Grippo, Lampariello and Lucidi 1986).
An immediate application is damping (Robinson-) Newton's method for solving the
nonsmooth normal equations (Robinson 1992), yielding a damping procedure for
both Wilson's method (or sequential quadratic programming) (Wilson 1963, Fletcher
1987) for nonlinear programs, and Josephy-Newton's method (Josephy 1979) for
generalized equations. Path search damped Newton's method likewise applies to the
normal equation formulation of variational inequalities and nonlinear complementarity problems, as described in ?5.
For the moment, our prototype of a nonsmooth function will be
def
F+(x)
where F: RN -->
N
-F(x+)
+ x - x,
Vx E RN
is a smooth function, and x+ is the vector in
R[N
whose ith
Received January 18, 1991; revised October 27, 1992.
AMS 1980 subject classification. Primary: 49D15, 90C30, 90C33.
IAOR 1973 subject classification. Main: Optimization. Cross references: Programming: Nonlinear.
OR/MS Index 1978 subject classification. Primary: 653 Programming/Nonlinear.
Key words. Damped Newton's method, global convergence, first-order approximation, line search, path
search, nonsmooth equations, normal mappings, variational inequalities, generalized equations, complementarity problems, nonlinear programming.
352
0364-765X/94/ 1902/0352/$01.25
Copyright ( 1994, The Institute of Management Sciences/Operations
Research Society of America
GLOBAL CONVERGENCE OF DAMPED NEWTON S METHOD
353
component is max{x,, 0}. F+ is only nonsmooth on the boundaries of orthants in RN.
The system F+(x)= 0 is actually the normal formulation of the nonlinear complementarity problem: find z E RN such that
z > O,
F(z)
> 0,
and
=
(z,F(z))
0,
where vector inequalities are taken in a componentwise fashion. Note that if x solves
the normal equation F+(x) = 0, then z = defx solves the nonlinear complementarity problem; and, conversely, a solution z of the latter yields a solution x = defz - F(z)
of the former.
We need some notation to aid further explanation. Suppose X and Y are both
N-dimensional normed spaces (though Banach spaces can be dealt with) and
f: X -> Y. We wish to solve the equation
f(x)
x EX.
=0,
For the traditional damped Newton's method (Ortega and Rheinboldt 1970,
Burdakov 1980) we assume f is a continuously differentiable function. Suppose xk is
the kth iterate of the algorithm, and xk+1 is the zero of the "linearization"
def
Ak(x)
+ Vf(xk)(x
f(xk)
E X.
-xk),
Since Ak approximates f, Xk+ at least formally approximates a zero of f. Newton's
SO Xk+ is called the Newton iterate. Newton's
method defines xk+1 =d xk+l
method is damped by a line search; that is by choosing xk+I on the line segment
from xk to xk+l such that the decrease in |lf(')ll after moving from xk to xk+i is
close to the decrease predicted by IIAk(Q)ll.The normed residual Ilf(-)l| is the merit
function for determining xk+ . We refer to such methods as line search damped.
Some convergence results under different line search strategies are set out by
O. P. Burdakov (1980).
The computational success of line search damped Newton's method relies
on uniformly bounded invertibility of the Jacobians Vf(xk), which yields two key
properties of the linearizations Ak that are independent of k: first, Ak is a firstorder approximation of f at xk, i.e., f(x) = Ak(x) + o(x - xk) where o(x - xk)/
Ix - xkll -- 0 as x -, xk; and, secondly, Ak goes to zero "rapidly" on the path
pk(t) =def Xk + t(Xk+l - Xk) as t goes from 0 to 1, by which we mean that, given our
choice of Ak and ^kk+,
Ak(pk(t))
+ tVf(xk)(xk+l
=f(xk)
xk) = (1
We call pk the Newton path. Observe that f(pk(t))
t)f(xk).
= (1 - t)f(xk)
+ o(t) where
o(t)/t -> 0 as t I 0, uniformly for all k. Thus for a fixed ore (0, 1) and all sufficiently
small positive t independent of k,
lf(pk(t))l||l
So there exists a step length
tk
(1
-
ot)llf(Xk)|
.
(0, 1] such that not only does the previous inequality
hold at t = tk, but there is a positive lower bound on tk uniformly for all k. A
(monotone) line search determines such a step length; then the damped Newton
iterate is Xk+ 1 =def pk(tk). We see that the normed residual IIf(-)11is decreased by at
least a constant factor after each iteration. As the residual converges to zero, the
354
DANIEL RALPH
iterates converge to a solution and, within a neighborhood of this solution, superlinear convergence of the iterates is observed.
We propose a damping of Newton's method suitable for a nonsmooth equation
f(x) = 0, using first-order approximations Ak and pathspk with the properties
summarized above. For example, suppose f = F+ from above, and the approximation
to f at xk is
Ak(X)
def
-
+ VF(xk)(x
F(xk)
-x+ ) +
-
+,
Vx E
N.
The approximation error f(x)-Ak(x)
is the difference between F(x+) and the
linearization F(xt) + VF(xk )(x+- x ), i.e., the relationship of f to Ak is like that
of a smooth function to its linearization about a given point. Newton's method-essentially Robinson-Newton's method (Robinson 1988) in this context-defines xk+1
as the Newton iterate xk+ = A 1(0), as in the smooth case. A naive damping of this
method is a line search along the interval [xk+, xk]. Instead, we follow a path pk
from xk (t = 0) to k+l (t = 1) on which Ak goes to zero rapidly: pk(t), the Newton
path, is defined by the equation
Ak(pk(t))
By assuming pk(t) f(pk(t))
k =
=
(1 - t)f(xk).
O(t), we have that
=Ak(pk(t))
+ o(pk(t)+-
Xk)
=
(1
-
t)f(k)
+ 0(t);
hence there is a path length tk (0, 1] such that lf(pk(tk))il < (1 - rt)ilf(xk)Ii. A
path search determines the path length tk; then the damped Newton iterate is
Xk+ =defp (tk). It is not even necessary that xk+l exist so long as pk can be defined
on an interval [0, T7] for suitable Tk e (0, 1]. Assuming each Ak is invertible near xk,
uniformly in some sense for all k, the path search ensures that the residuals f(xk)
converge to zero at least linearly, and eventually we obtain superlinear convergence
of the damped iterates to a solution. We have outlined path search damped Newton's
method.
Since Ak need not be affine (it is piecewise smooth above), pk need not be affine.
In this case, there is no basis for the line search on [xk, xk 1]. It is even possible that
both 1A1(t)l0 and Ilf(xk + t(k+ - xk))ll initially increase as t increases from 0,
causing the line search to fail altogether. The path search, however, may still be
numerically and theoretically sound.
The remainder of the introduction will be devoted to related literature. For
purposes of comparison, we observe that the approach proposed here is distinguished
from most, if not all others by the use of nonsmooth Newton paths pk. The sections
to come are as follows:
?2. Notation and preliminary results.
?3. Motivation: Line search damped Newton's method for smooth equations.
?4. Path search damped Newton's method for nonsmooth equations.
?5. Applications.
One of the simplest approaches to (damping) Newton's method for nonsmooth
equations is suggested by J.-S. Pang (1990): set
=
B(x)
Rk(X) df(Xk)
+f'(xk;
x -k-xk)
GLOBAL CONVERGENCE
OF DAMPED
355
NEWTON S METHOD
where we assume the existence of the directional derivative f'(x; d) at each x in each
direction d. As before, suppose xk+1 solves Bk(x) = 0. Assuming f is also locally
Lipschitz, f'(x; ) is a so-called B(ouligand)-derivative (Robinson 1987), and we say
the Newton method given by xk+l =defxk+l is B-Newton's method. Now f(x)=
Bk(x) + o(x - Xk), and Bk is affine on the interval [xk, xk+],
hence
damping
B-Newton's method by a line search makes sense. A major difficulty, however, is that
the closer xk is to a point of nondifferentiability f, the smaller the neighborhood of
xk in which Bk accurately approximates f. This difficulty is reflected in the dearth of
strong global convergence results of such a scheme. For instance, the global convergence result (Pang 1990, Theorem 6) requires the existence of an accumulation point
x* of the sequence of iterates (xk) at which IIf(')l12
is differentiable; this requirement
defeats
the
aim
nonsmooth
of
solving
equations. Harker and Xiao (1990) apply
partly
the theory of Pang (1990) to the nonsmooth normal mapping f = F+ associated with
the nonlinear complementarity problem and provide some computational experience
of line search damped B-Newton's method. For further discussion, see ?5.2.
A global Newton method with convergence results comparable to ours is given by
J.-S. Pang (1991), for the "min" equation formulation of variational inequalities. The
min equation for the nonlinear complementarity problem, above, is
0 = f(x)
def
d
min{x, F(x)},
where the min operation is taken componentwise. In Pang (1991) a modified B-Newton method, with line search damping, is given for this min equation. This method is
globally convergent under a regularity condition at each iterate. The regularity
condition at the kth iterate ensures invertibility of the equation defining the modified
B-Newton method, and is quite similar to the local invertibility property of the
approximation Ak of F+ needed in path search damped Newton's method. Nevertheless, the approach and analysis of Pang (1991) are quite different from ours, and
have some of the flavor of Han, Pang and Rangaraj (1992), discussed next.
Line search damping of general Newton methods for nonsmooth equations is
considered by S.-P. Han, J.-S. Pang and N. Rangaraj (1992). Given a general iteration
function G(xk, ) from RN to RN, the associated Newton method defines xk+l =def
k+l where xk+l is such that f(xk) + G(xk, k+l - k) 0. To incorporate a line
search on [xk,xk+l], conditions on the iteration function G(xk, ) are given to
ensure a certain rate of decrease of Ilf(xk + t(xk+1 - xk))ll as t increases, at least
for all small t > 0 uniformly in k. Such conditions, in general, preclude f(xk) +
G(xk, ) from being a good approximation of f near xk, uniformly in k. In effect,
Han, Pang and Rangaraj (1989) give conditions on G(xk, .) such that line search
damping is viable on the interval [xk, xk+l], while the approximation properties of
G(xk, ) with respect to f are not of importance. In contrast, we must abandon the
line search to retain what we believe are the fundamental properties of the approximations Ak that make the classical Newton's method successful.
Local and global Newton methods for locally Lipschitz, semismooth functions f are
presented by L. Qi (1993). Such functions are B-differentiable. The proposed Newton
iterate
xk+1 satisfies
f(xk)
+ Mk(k+l
-xk)
= 0, where
the matrix
Mk is any
member of a refinement of the Clarke generalized Jacobian (Clarke 1983) of f at xk.
When f is piecewise smooth, for example f(x) is F+(x) or min{x, F(x)}, Mk can be
chosen very simply: roughly, let fk be one of the smooth functions that defines f on a
The undamped
"piece of smoothness" containing xk, and take Mk -defVfk(xk).
and
S.
Shindo (1986).
of
M.
method
to
the
in
this
method reduces,
Kojima
case,
to those
are
similar
method
line
search
of
the
Global convergence properties
damped
356
DANIEL RALPH
of line search damped B-Newton's method (Pang 1990), and can be enhanced using
the framework of Han, Pang, and Rangaraj (1992).
P. Marcotte and J.-P. Dussault (1987) have shown global convergence of JosephyNewton's method (Josephy 1979) with line search damping, for monotone variational
inequalities over compact convex sets, using a nonsmooth merit function. Related
work using M. Fukushima's smooth merit function is given in Fukushima (1992), and
by K. Taji, M. Fukushima and T. Ibaraki (1993), and J. H. Wu, M. Florian and
P. Marcotte (1990), for line search damped Josephy-Newton's method applied to
strongly monotone variational inequalities over closed convex, but not necessarily
compact sets. Our results are sharper when they apply, i.e., in the strongly monotone
case; see Proposition 11 and the comments after its proof.
It can be argued that line search damping of a Newton method for nonsmooth
equations lacks the elegance of path search damping; putting it differently, path
search damping seems better suited than other damping methods to the standard
merit function, the normed residual. This view, however, may have little bearing on
the ultimate usefulness of the different damping procedures. For example, in ?5 we
use a modification of Lemke's algorithm to determine each path pk; hence this
implementation might be prey to the same exponential behavior as the original
Lemke's algorithm for certain pathological problems, a difficulty not observed when
applying line search damped B-Newton's method to such problems (Harker and Xiao
1990).
It has been pointed out by Liqun Qi that conditions sufficient for global and
superlinear convergence of nonsmooth Newton methods are weaker than the assumptions made in this paper, which include a local invertibility property of Ak (see
Theorem 9 and Proposition 10). For superlinear convergence see F. Bonnans (1990),
M. Kojima and S. Shindo (1986), J.-S. Pang (1993), and L. Qi (1993). It seems,
however, that substantially weaker conditions require the existence of the solution
point a priori, whereas our procedure actually proves the existence of such a point. In
fact the same criticism, and rebuttal of that criticism, can be applied to extensions of
the classical Kantorovich-Newton theorem (Ortega and Rheinboldt 1970, Theorem
12.6.2) to generalized equations (Josephy 1979, Theorem 2.2) and nonsmooth equations (Robinson 1988, Theorem 3.2). Implications of our assumptions are explored
further after Proposition 10.
The weakest assumptions for global convergence, known to the author, are for
Gauss-Newton-like methods, rather than Newton-like methods, such as the method
of J.-S. Pang and S. A. Gabriel (1993). In this basic paper we prefer to make strong
assumptions in order to use principles underlying the classical Newton method, and
obtain strong results. A Gauss-Newton approach to nonsmooth systems prior to Pang
and Gabriel (1993) is given by J.-S. Pang, S.-P. Han and N. Rangaraj (1991). Also,
M. C. Ferris and S. Lucidi (1994) present conditions on merit functions used in line
search damping, that are sufficient for global convergence.
Finally we draw the reader's attention to the starting point of this research, K. Park
(1989), in which homotopy methods via point-based approximations are considered
for solving nonsmooth equations. Stephen M. Robinson has suggested that, while
Park takes the viewpoint of path following, this paper proposes "multiple path
following".
2. Notation and preliminary results. Most of the interest for us is in finite
dimensions, when both X and Y are N-dimensional real spaces, RN, with arbitrary
norms. Our most fundamental result, however, Theorem 9, is valid for complete
normed spaces X, Y over R; hence this generality of X and Y will be used unless
otherwise stated. Throughout, f is function mapping X to Y, and Bx, By denote the
GLOBAL CONVERGENCE
OF DAMPED NEWTON S METHOD
357
closed unit balls in X, Y respectively. The unit ball may be written B when the
context is clear.
We write t 0 to mean t -> 0, t > 0. By o(t) (as t I 0) we mean a (vector) function
of the scalar t such that o(t)/t - 0 in norm as t 0. Likewise, by O(t) (as t I 0) we
mean 0(t)/t is bounded as t I0. When d E X, o(d) denotes a function of d such
that o(d)/lldll -- 0 as d -* 0, d / 0.
The function f is Lipschitz (of modulus / > 0) on a subset XOof X if If(x) - f(x')\l
is bounded above by a constant multiple (1) of lix - x'll, for any points x, x' in X0.
We say f is locally Lipschitz if it is Lipschitz near each x E X. A function g from a
subset XO of X to Y is said to be continuously invertible, or Lipschitzianly invertible
(of modulus I > 0), if it is bijective and its inverse mapping is continuous, or Lipschitz
(of modulus 1), respectively. Such a function g is continuously invertible, or Lipschitzianly invertible (of modulus 1) near a point x E XO if, for some neighborhoods
U of x in X and V of g(x) in Y, the restricted mapping glunx0: Un X0 -> V:
x -> g(x) is continuously invertible, or Lipschitzianly invertible (of modulus 1). In
defining this restricted mapping it is tacitly assumed that g(U n Xo) c V.
We are interested in approximating f when it is not necessarily differentiable.
DEFINITION 1. Let X0 c X.
(1) Let x e X. A first-order approximation of f at x is a mapping f: X > Y such
that
f(x')
- f(x')
= o(x - x'),
x'
X.
A first-order approximation of f on Xo is a mapping v on XO such that for each
x E Xo, a/(x) is a first-order approximation of f at x.
(2) Let v be a first-order approximation of f on XO. v is a uniform first-order
approximation (with respect to X0) if there exists A: (0, oo) -> [0, oo]with A(t) = o(t),
such that for any x, x' e XO,
|II|&(x)(x') - f(')
(1)
< A(llx
x'tl1).
v is a uniform first-order approximation near x"? XO if, for some A(t) as above, (1)
holds for x, x' near x.
The idea of a path will be needed to define the path search damping of Newton's
method.
DEFINITION 2. A path (in X) is a continuous function p: [0, T] -> X where
T E [0, 1]. The domain of p is [0, T], denoted dom(p).
We note a path lifting result.
LEMMA
3.
Let ': X
-*
Y, x E X and P(x) 0 O. Suppose the restricted mapping
) = def (I u: U -> V is continuously invertible, where U and V are neighborhoods of x
and ((x), respectively. If U is open, and E > 0 is such that ((x) + EBy c V, then,
for 0 < T < min{e/llj(x)ll, 1), the unique path p of domain [0, T] such that
p(O) =-x,
(p(t))
= (1 -t)(x)
Vt E [0, T]
is given by
p(t) =
)-'((
-t)PD(x))
Vt E [0,T].
PROOF. Let U, E and T be as above, and observe that p: [0, T] -- X: t ->
- '((1 -t)dp(x)) is a well-defined mapping. Suppose q: [0, T] - X is a path such
that q(0) = x and, for t E [0, T], ((q(t)) = (1 - t)P(x). Clearly q(t) = p(t) if and
358
DANIEL RALPH
only if q(t) E U, so consider
,def
t= sup{t E [O, T]Iq(s)
Vs
E U,
E
[0, t]}.
If t E [0, T) and q(s) E U for each s e [0, t), then q(t) = p(t) by continuity of p
and q; so q(t) belongs to the open set U. Therefore, by continuity of q, q(t') belongs
to U for each t' E [0, T] near t. It follows that t 4 T, hence t = T, and q(t) = p(t)
for each t e [0, T]. [
For a nonempty, closed convex set C in RN and each x E-RN, T.C(x) denotes the
nearest point in C to x with respect to the Euclidean norm. The existence and
uniqueness of the projected point rrc(x) is classical in the generality of Hilbert
spaces. We refer the reader to Brezis (1973, Example 2.8.2 and Prop. 2.6.1) where we
also see that the projection operator WTcis Lipschitz of modulus 1. The normal cone
to C at z is
def {y
(
RN
c EC}
< 0,
y,c-z)
if
E C,
otherwise.
0(z)
The tangent cone to C at z, denoted Tc(z), is {x E RN I(X, y) < 0, Vy E Nc(z)} if
z E C, and the empty set otherwise.
Next we present the normal mappings of Robinson (1992). These will be our source
of applications (?5).
DEFINITION4.
Let C be a closed, convex, nonempty set in 1R and F: C
> R.
The normal mapping induced by F and C is
def
Fc = F o'c
+ I- 'c
where I is the identity operator on RN.
Our applications will concern finding a zero of a normal mapping such as Fc above,
where C is usually polyhedral convex. If F is smooth, its derivative mapping is
denoted VF.
DEFINITION 5. Let F and C be as in Definition 4. F is continuously differentiable
(on C) if the restriction of F to the relative interior of C, ri C, is continuously
differentiable and, for each relative boundary point c of C, the limit
def
VF( )
lim VF(c)
c ->c
ceri C
exists.
If F, as above, is continuously differentiable (on C) then it is locally Lipschitz, just
as in the classical case when C is a linear subspace; hence Fc is locally Lipschitz.
We relate the normal mapping Fc to the set mapping F + Nc.
6. Suppose F, C are as in the Definition 4, and F is locally Lipschitz.
The mapping Fc is Lipschitzianly invertible near x if and only if for some neighborhoods U, Vof 7rc(x), Fc(x) respectively,(F + Nc)-1 n U is a (single valued) Lipschitz
function when restricted to V.
LEMMA
c
PROOF. It is well known (Brezis 1973, Example 2.8.2) that c = rrc(x) if and only if
E C and
(x - c,c' - c) <
,
Vc'
C;
OF DAMPED
GLOBAL CONVERGENCE
c =
-= F(x),
So for U c
RN,
359
METHOD
4 E Nc(c). It follows that
thus c = rrc( + c) if and only if
(2)
NEWTON'S
x=
(F + N)(c),
E
c(x)
+ (I-
F)(c).
(F + Nc)(U) equals Fc(rc 1(U)) and, for s E [RN, we get
(3)
(F + Nc)
Let x? e X and
=defFc(xO).
1(e)
n U = [7rr(F
1(6))] n U.
Suppose Fc is Lipschitzianly invertible near x?, so, for some 8 > 0 and neighborhood V? of s?, the restriction of FC- n (x? + 286B)to V( is a Lipschitz function that
maps V? onto x? + 281B.Let
def
-
o
Ud f(I -F)-(xOdef
n
Vd Fc(,l(U))
+ 5B),
(0
+ 6),
where e E (0, 5) is chosen such that 6? + eB c V?. Observe U is a neighborhood of
- ?).Also V
7rc(x?), since I - F is a continuous function which maps 7rc(x?) to x?
is a neighborhood of s?, since U is a neighborhood of 7rr(x?) and Fc is continuously
invertible near x(. Moreover for E- V we have
xE Fc-I( ),
=x=
7Tc(x)E
+( (
= x Ex? + 6 -
U
(IF)(77cc(x))
F)(rc(x)),
E
x( -
+o85
?0+ 5B cx0 + 26B.
Thus
0
O
[7cc(Fc '(e))]
n U c 7r[Fcl'()
n (x? + 25B)].
As the set on the right is a singleton, this statement combined with (3) yields
(F + Nc)
'(a) n U = 7,[Fc
1(e) n (x? + 26B)].
In particular (F - Nc)- n U, as a mapping on V, is a Lipschitz function.
Conversely suppose the restriction of (F + Nc) 1 CnU to V is a Lipschitz function,
where U and V are respective neighborhoods of 7rc(x?) and s?. Let UO be a
neighborhood of 7rc(X0) in U such that F is Lipschitz on U?, and V? =def
(F + Nc)(U?) n V. So the restriction of (F + Nc)-1 n U? to V0 is a Lipschitz
function, and V? is a neighborhood
of s?. Let U1 = def7 cl(U0), a neighborhood
of
x?. Using (2), we have
F1; n Ul =I+
[I-
F][(F+
Nc)-'
n U].
So the restriction of Fc ' n U' to V0 is a Lipschitz function, and it follows that Fc is
Lipschitzianly invertible near x?. n
Finally, we restate Robinson (1988, Lemma 2.3), a generalization of the Banach
perturbation lemma.
360
DANIEL RALPH
LEMMA 7. Let f be a set in X. Let g and g' be functions from X into Y such that
g l: f -> g(fl) is Lipschitzianly invertibleof modulus L > 0, and g - g' is Lipschitz on
fl of modulus r7 > 0. Let x? E l and 8 > O. If
(a) f D x? + SBx,
(b) g(f) D g(x?) + (8/L)Y,
(c) r7L < 1, then g'la: Qf
- g'(f)
is Lipschitzianly invertible of modulus
L/(1 - 7L) > 0, and
+ (1 - r7L)(8/L)By.
g'(Qf) D g'(x?)
Define the perturbation function, h(*) = g'(.) - g(.) - [g'(x)
Let g = defgli and observe that the Lipschitz property of g~' gives
PROOF.
1/L < 1/sup{l ^-1(y)
= inf({[g(x)
- -(y')ll/Iy
- g(x')
- Y'Illy, Y' E g(f),
/llIx - x'll Ix, x' E f, x
- g(x?)].
y - y'}
x'}.
Thus according to Robinson (1988, Lemma 2.3), g + h satisfies
(1/L)
- 71 < inf{fl(g + h)(x)
- (g + h)(x') ||/llx - x'll Ix, x' E fQ, x # x'}
and (g + h)(fl) contains g(x?) + (1 - rlL)(5/L)By. Therefore (g + h)|n: fn
(g + h)(Qf) is invertible and, as above, its inverse is Lipschitz of modulus
- r] = L/(1 - 7L). The claimed properties of g' hold because g' = g + h
1/[(1/L)
El
+ g'(x?) - g(x?).
3. Motivation: Line search damped Newton's Method for smooth equations. Let
f: X -> Y be smooth, that is continuously differentiable. We wish to solve the
nonlinear equation
f(x)
Suppose xk e X (k E {0, 1,...
x
=0,
X.
) and there exists the Newton iterate xk+ 1, i.e., xk+l
solves the equation
f(xk)
+ Vf(xk)(x
-
k) =0,
x
E
X.
Newton's method is inductively given by setting xk+l =defxk+ 1
The algorithm is also called local Newton's method because the KantorovichNewton theorem (Ortega and Rheinboldt 1970, Theorem 12.6.2)-probably the best
known convergence result for the Newton's method-shows convergence of the
Newton iterates to a solution in a ball of radius 8 > 0 about the starting point x?.
Assumptions include that Vf(x?) is boundedly invertible; then 8 is constructed small
enough to ensure, by continuity of Vf and the Banach perturbation lemma, that
Vf(x) is boundedly invertible at each x E x? + 5Bx. Further conditions guarantee 8
can also be chosen large enough such that the sequence of iterates remains in the
5-ball of x0, and convergence follows. It is well known (Burdakov 1980), however,
that the domain of convergence of the algorithm can be substantially enlarged using
line search damping, reviewed below, which preserves the asymptotic convergence
properties of the local method.
Choose line search parameters i,r?
,
(0, 1). As we saw in the introduction,
the Newton path pk(t) =defxk + t(^k+l - k) is such that f(pk(t)) = (1 - t)
GLOBAL CONVERGENCE OF DAMPED NEWTON'S METHOD
361
+ o(t) for t > 0; hence for sufficiently small t > 0 and f(xk)
Monotone Descent of the norm of the residual:
=0, we have
f(k)
(MD)
l(f(pk(t))l
< ( l-t)llf(x
)1l.
In the Armijo line search familiar in optimization (McCormick 1983, Chapter 6, ?1),
the step length tk is defined as rl, where I is the smallest integer 1 E {0, 1,... } such
that (MD) holds at t = rl. The (line search) damped Newton iterate is xk+1 =def
pk(tk).
More recently, in the context of unconstrained optimization, Grippo, Lampariello
and Lucidi (1986) have developed a line search using a Nonmonotone Descent
condition that often yields better computational results than monotone damping. Let
M E J, the memory length of the procedure, and relax the progress criterion (MD)
to
(NmD)
lf(pk(t))Il < (1
-
-t)max{|
f(xk+l-)ll
Ij
1,...,min{M,k
+ 1}}.
This is identical to (MD) when M = 1. The nonmonotone Armijo line search would
determine
tk =
T'
similar to the monotone
case.
We abstract the general properties of an unspecified Nonmonotone Line search
procedure:
(NmLs). If (NmD) holds at t = 1, let tk = defl.
Otherwise, choose any
tk
E [0, 1] such that (NmD) holds at t =
tk > r sup(T e [0, 1] I(NmD) holds Vt
tk
and
[0, T]}.
It is easy to see that the above nonmonotone Armijo line search produces a step
length that fulfills (NmLs). The parameter T need not be explicitly used in damped
Newton's algorithm, however, so other line search procedures in which r is not
specified may be valid; only the existence of r, independent of k, is needed to prove
convergence.
The formal algorithm is now given.
Line search damped Newton's method. Given x? E X, the sequence (xk) is inductively defined for k = 0, 1,... as follows.
If f(Xk) = 0, stop.
Find xk+ -de fk - Vf(xk)-lf(xk).
Line search: Let pk(t) =defxk + t(2k+ i -k)
for t
E
[0,1]. Find tk e [0,1] satisfy-
ing (NmLs).
Define xk+
=defpk(tk).
We present a basic convergence result for the line search damped, or so-called
global Newton's method. It is a corollary of Proposition 10.
PROPOSITION8.
Let f: RIN N->R
def
Xo-
be continuously differentiable,ao > 0 and
(x E R I 1f(x)II < ao}.
Let o-, i E (0, 1) and M E N be the line search parameters governing (NmLs).
Suppose Xo is bounded, and Vf(x) is invertible for each x E Xo. Then for each
x? E X, line search damped Newton's method is well defined and the sequence (xk)
converges to a zero x* of f.
362
DANIEL RALPH
The residuals converge to zero at least at an R-linear rate: for some constant
p e (0,1) and all k> 0,
Ilf(xk)
)|
pmax{|f(Xk<p
llf(XO)
= 1,..., min{M, k}}
1.
The rate of convergence of (xk) to x* is Q-superlinear. In particular, if Vf is Lipschitz
near x* then (xk) converges Q-quadratically to x*:
lIxk+l _ X*II< dllXk - x*112
for some constant d > 0 and all sufficientlylarge k.
The convergence properties of the method depend both on the uniform accuracy of
each of the approximations Ak of f at xk for all k, i.e.,
Ilf(x) -Ak(x)\\/Ilx
-xk
- 0
as x
-xk
(x
:
xk),
uniformly Vk
and on the uniformly bounded invertibility of the approximations Ak:
sup Vf(xk)
k
I < oo
These uniformness properties are disguised in the boundedness (hence compactness)
hypothesis on X0.
4. Path search damped Newton's method for nonsmooth equations.
solve the nonlinear and, in general, nonsmooth equation
f(x)
= 0,
We want to
X
where f: X -> Y. We proceed as in the smooth case, the main difference being the
use of first-order approximations of the function f instead of linearizations.
Suppose
xk E X (k
{0, 1,... }) and Ak =defq(xk),
where
v
is a first-order
approximation of f on X. Recall (Definition 2) a path is a continuous mapping of the
form p: [0, T] -> X where T E [0,1]. Assume there exists a path pk: [0,1] -> X such
that, for t E [0, 1],
(4)
Ak(pk(t))
= (1 -t)f(xk);
pk is the Newton path. Assume also that pk(t) - k = O(t). Note that ^k+ = defpk(l)
is the Newton iterate, a solution of the equation Ak(x) = 0.
In nonsmooth Newton's method, the next iterate is given by xk+ = defXk+1 just as
in smooth Newton's method. For nonsmooth functions having a uniform first-order
approximation we call this Robinson-Newton's method, since the ideas behind local
convergence results are essentially the same as those employed in the seminal paper
(Robinson 1988), although a special uniform first-order approximation called the
point-based approximation is required there (see discussion after Proposition 10).
(Robinson 1988, Theorem 3.2) is the nonsmooth version of the classical KantorovichNewton convergence theorem (Ortega and Rheinboldt 1970, Theorem 12.6.2). Applications of Robinson-Newton's method include sequential quadratic programming
(Fletcher 1987), or Wilson's method (Wilson 1963) for nonlinear programming; and
OF DAMPED NEWTON'S
GLOBAL CONVERGENCE
363
METHOD
Josephy-Newton's method (Josephy 1979) for generalized equations (see also ?5). As
in the smooth case, however, convergence of the method is shown within a ball of
of f at x? is
radius 8 > 0 about x?. The first-order approximation A = defx0)
assumed to be Lipschitzianly invertible near x? and then, for small enough 8, the
continuity properties of '( ) are used in Lemma 7 to show that /(x) is Lipschitzianly
invertible near each x E x? + 58x. Further assumptions guarantee that RobinsonNewton's method generates a sequence lying in this 5-ball of x?, and convergence
follows. We propose to enlarge the domain of convergence using a path search.
Now Ak is a first-order approximation of f at xk, and o(pk(t)
assuming pk(t) - xk = (t). With (4) we find that
(5)
= (1 - t)f(k)
f(pk(t))
-
xk) = o(t)
+ o(t),
i.e., f moves toward zero rapidly on the path pk as t increases from 0, at least
initially.
In the spirit of ?3, we fix (r, r E (0, 1). As before, assuming f(xk) / 0, we have for
all sufficiently small positive t,
f(pk(t))ll|| < (1 -_ t)||f(
k)||.
So the nonmonotone descent condition below, formally identical to that given in ?3, is
valid given any memory size M E N and all small positive t:
(NmD)
I|f(pk(t))l|
< (1 -
t)max{|
f(Xk+l-
j)l |j
= 1,...,min{M,
k + 1}}.
The path search is any procedure satisfying
If (NmD) holds at t = 1, let tk =defl.
Otherwise, choose any
tk >r
tk
E [0, 1] such that (NmD) holds at t =
sup{T
E
tk
and
[0, 1] (NmD) holds Vt E [0, T]}.
The (path search) damped Newton iterate is xk+1 =defpk(tk). The path search takes
+
=
tk = 1, hence the Newton iterate xk
k(1), if possible, otherwise a path length tk
large enough to prevent premature convergence under further conditions. Also, as in
?3, r need not be specified explicitly.
However, the path search given above is too restrictive in practice; in particular it
assumes existence of the Newton iterate xk+ l Ak l(0). Motivated by computation
(?5) we only assume the path pk: [0, Tk] -- X can be constructed for some Tk E (0, 1].
The path is constructed by iteratively extending the domain [0, Tk] until either it
cannot be extended further (e.g., Tk = 1) or the progress criterion (NmD) is violated
at t = Tk. The path length tk is then chosen with reference to this upper bound Tk.
Of course we are still assuming that the path pk satisfies the Path conditions:
(P)
pk(0)
X= k,
Ak(pk(t))
= (1 - t)f(xk),
Vt E dom(pk)
where dom (pk) = [0, Tk]. The idea of extending the path pk is justified by Lemma 3,
taking I = Ak and x = pk(Tk), which says that if Ak is continuously invertible near
pk(Tk) and Tk < 1, then pk can be defined over a larger domain (i.e., Tk is strictly
increased) while still satisfying (P).
364
DANIEL RALPH
We use the following Nonmonotone Pathsearch in which pk:
supposed to satisfy (P).
[0,Tk]
-> X is
If (NmD) holds at t = Tk, let tk defTk.
Otherwise, choose any tk E [0, Tk] such that (NmD) holds at t = tk and
(NmPs).
tk > r sup{T E [0, Tk](NmD)
holds Vt
[0, T]}.
Now we give the algorithm formally. As the choice of tk may be determined during
the construction of pk (see ?5), we have not separated the construction of pk from
the path search in the algorithm.
Path search damped Newton's method. Given x? E X, the sequence (xk) is inductively defined for k = 0,1,... as follows.
If f(Xk) = 0, stop.
Path search: Let Ak =def(xk).
satisfying (P) such that if
Tk
Construct
a path pk: [, Tk] - X, T,k
[0,1],
< 1 then either Ak is not continuously invertible near
or (NmD) fails at t = Tk. Find tk
Define xk+l -defpk(tk),
[0, Tk] satisfying (NmPs).
pk(Tk),
Our main result, showing the convergence properties of global Newton's method, is
now given. The first two assumptions on the first-order approximation v correspond,
in the smooth case, to uniform continuity of Vf on X( and uniformly bounded
invertibility of Vf(x) for x E X(, respectively. The purpose of the third, technical
assumption is to guarantee the existence of paths used by the algorithm (cf. the
continuation property of Rheinboldt (1969)). With the exception of dealing with the
paths pk, the proof uses techniques well developed in standard convergence theory of
damped algorithms (e.g. (Ortega and Rheinboldt 1970, McCormick 1983, Fletcher
1987)).
THEOREM 9.
Let f: X - Y be continuous, a0 > 0 and
Xo
def {X E XI
Ilf(x)II
aol.
Let or,r E (0, 1) and M E N be the parameters governing the path search (NmPs).
Suppose
(1) / is a uniform first-order approximation off on X(.
(2) WX(x)is uniformly Lipschitzianly invertible near each x E X,, meaning for some
constants 8, c, L > 0 and for each x E XO, there are sets Ux and V^ containing
x + 8BI and f(x) + EBy respectively, such that &/(x)lux: Ux - Vx is Lipschitzianly
invertible of modulus L.
(3) For each x E Xo and T E (0, 1], if p: [0, T) -- X is such that p(O) = x and, for
= (1 - t)f(x) and A/(x) is continuously invertible near
each t E [0, T), s(x)(p(t))
=
p(t), then there exists p(T) def limt T p(t) with d(x)(p(T)) = (1 - T)f(x).
Thenfor any x? E X(, path search damped Newton's method is well defined such that
the sequence (x k) converges to a zero x* off.
The residuals converge to zero at least at an R-linear rate: for some constant
E
p (0,1) and all k > 0,
fIf(xk)l
< pmax{l f(xk-j)ll
j = 1,...,min{M,
k}}
< pkll f(XO).
The rate of convergence of (xk) to x* is Q-superlinear; indeed, if for c > 0 and all
pointsx nearx* we have lll(x)(x*) - f(x*)ll < cIx - x*112,then the rate of convergence is Q-quadratic:
IIxk+ -
for sufficiently large k.
x*II < cLIIxk - X*112
GLOBAL CONVERGENCE
OF DAMPED NEWTON'S
365
METHOD
PROOF. Assume, without loss of generality, that f(xk) i 0 for each k. Let 8, E, L
be the constants given by Hypothesis (2) of the theorem and, for each k, let Ak be
the Lipschitzianly invertible mapping (xk)lUxk: Uxk -0 VXk given there. We may also
assume without loss of generality that UXkis open: replace UXkby its interior, UXk,8
by any 6SE (0, 8), E by any e E (0, min{E,8/L}), and Vxk by Ak(UXk). The only point
which may be unclear is that Ak(UXk) contains Ak(xk) + EB: let y E Ak(xk) + El;
then x =deA- '(y) n Uxkis an element of Uxksuch that
LIIy - Ak(Xk) |<,
t|x -x xl
hence y E Ak(UXk). Let A(t) = o(t) be the uniform bound on the accuracy of '(x),
x e XO, as given by Definition 1; and A(0) = 0 for convenience.
We show the algorithm is well defined. Suppose xk E X0. Lemma 3 can be used to
show existence of a (unique) continuous function p: I -> X of largest domain I, with
respect to the conditions
*p(0) = xk;
*either I = [0, 1], or I = [0, T) for some T E (0, 1]; and
?for each t e I,
Ak(p(t)) = (1 - t)f(x),
Ak is continuously invertible near p(t).
If I = [0,1], pk
p is a path acceptable to the algorithm. If I = [0, T)
then, by Hypothesis (3), we can extend p continuously to domain [0, T] by
p(T) =def limt Tp(t), for which Ak(p(T)) = (1 - T)f(xk). In this case, by maximality of I, Ak is not continuously invertible at p(T), so the extension p: [0, T] - X
is acceptable as pk. This shows that existence of the path pk, required by the
algorithm, is guaranteed if xk e XO. The fact that each xk belongs to XO follows
from the inequality
max{I|f(xk-j') I Ij = 1,...,min{M,
k}} < i|(x?)
|,
which is easily shown by induction.
Recall that pk: [0, Tk] - X, 0 < Tk < 1, is the path determined by the algorithm.
We will show that Tk and the path length tk are bounded away from zero.
We aim to find a positive constant y such that for each k,
def
Tk > Sk = min{y/ll f(xk)ll,
1)
and
(6)
pk(t)
=Ak -((1
-t)f(xk))
Vt e [0,Sk],
(NmD) holds Vt E [0, Sk].
(7)
To show these we need several other facts, the first of which is given by Lemma 3
when q = defAk and U =defUk:
(8)
pk(t)
= A-((1
-
t)f(xk)),
V0 < t < min{e/||f(xk)
|, Tk
366
DANIEL RALPH
Now if Tk<I
and (NmD) holds at tT= Tk then, by choice of
continuouslyinvertiblenearpk(Tk); thus
pk,
Ak
is not
Tk > E/Ilf(Xk)I .
Another fact is that
~
(9)
(9)
~
~
min(r/ii ~Y1If(Xk)ll,
(NmD) holds for0 <t <min{k}
TT
where y et (0, El will be specified below. Recall the path search parameter o- G (0, 1).
We choose /3> 0 such that A(/8) <3(1 )- )iL for 0 </3 </3. Then for
0 < t < min{ II
k)
I
k)
we have
plk(t)
Xk
j-((
?
-
)(k))
((-
LII(1-
- f(xk)
t)f(xk)
So for such t, 1tpk(t) xkII</,
(10)
-'MX
t)f(x(f(x))
If =
by(8)
jI by Hypothesis(2).
tLIjf(xk)
hence
_
A(Iipk(t) XkJI) < fpk(t) _ Xk (1(l u/L
-
t(1
-
c)IIf(xk)I,
and furthermore,
If(pk(t))jj
<IIAk(pk(t))II + A(IIpk(t)
_ XkjI)
by Hypothesis (1)
< (1
-
t)IIf(xk)II ? t(l - a)IIf(xk)JII (1 _- ut) 1tf(Xk)1
by choice of pk and (10)
< (1 - ut)max{llf(xk+?1j)
Let y
=def
)IIj
= 1,~... min{M, k + 1}}.
min{/3/L, E} (e (0, E]).We have verified(9).
If Tk < I and (NmD) holds at t
=-
Tk we have already seen that Tk> E/lif(xk)JI;
hence Tk> Skk If (NmD) is violated at t - Tk then, with the statement(9), we see
that Tk > y/llf(xk)I!. So we always have Tk> Sk. Statement (6) now immediately
follows from (8) because E > y and Tk> Sk. Likewise, statement (7) immediately
follows from (9).
Next we show that tk > TSk for every k (recall r E (0, 1) is another path search
parameter).Fromthe rules of (NmPs),if (NmD) holds at t = Tk then tk -def Tk> Sk
> rS,k; otherwise tk is chosen to satisfy
tk > r SuP{TE
>TSk
[0, Tk I(NmD) holds Vt e [0, T]}
by(7).
OF DAMPED
GLOBAL CONVERGENCE
NEWTON S METHOD
367
Since each iterate xk belongs to XO, we get
tk
*,
n/llf(n
- m
> rmn
,,def
1
t
min{/|()|
y/aO, 11 > 0.
Therefore, by a short induction argument, for each k > 0
(11)
I||(xk)||
IIIj = 1,...,min{M,k}})
< pmax{l f(xk-)
<
pkllf|(x?)
where p =de (1- (it)1/M. This validates the claim of linear convergence of the
residuals. As a result, for some K, > 0 and each k > K1,
whence Sk = 1. So pk(1) = Ak (0) (by (6)) and (NmD) holds at t = 1 (by (7)). Also
Tk = 1, since 1 = Sk < Tk < 1; hence (NmPs) determines tk =defl and the damped
Newton iterate is the Newton iterate: xk +l = Ak 1(0). For k > K1,
Ixk+1l
-
Ak
1k(0)
xkI=
-Ak l(f(Xk))[[
< L 0 -
(
by Hypothesis (2)
xk)I
-<Lp ||f(xO) 1 by (11).
So if K, K' E , K < K < K', then
IXK - XK || <
E lixk+
k=K
- xk
< E LIkllf(x?)11
k=K
Ilf(x
LIIf(x?)lK
p -=
0
as K -oo,
K' >K.
This shows that (xk) is a Cauchy sequence, hence convergent in the complete normed
space X with limit, say, x*. Since IIf(xk)l -> 0, continuity of f yields f(x*) = 0.
Finally note that for all sufficiently large k, x* xk + 8BExC Uxk. For such
k > K,, Hypothesis (2) yields
IIx-xk
l X*
II
Ak(f(
< L||
f(x
))
-Ak
(Ak(X*))l
) -(xAX*)
< LA xIIxk-*
11).
The second inequality demonstrates Q-superlinear convergence. With the first inequality we see that if IL((x)(x*) - f(x*)ll < cllx -x*l12 for some c > 0 and all x
near x*, then
IIXk+1
for sufficiently large k.
[
_ x* 1< cLlIXk - x* 12
368
DANIEL RALPH
We can say more about the rate of convergence of the residuals: the inequality
I||f(xk)||l< p max{||lf(xk-j) I j
,...,min{M,
k}}
shows that convergence is M-step Q-linear, and, if f is Lipschitz near x* as it must
be in the smooth case, the residuals converge at a Q-superlinear or Q-quadratic rate.
We will find the following version of Theorem 9 useful in applications (?5.1).
PROPOSITION 10.
Let f:
N be continuous, a( > 0 and
>
1RN
def
o
({x
< a}.
Nll(x)ll
Let ar,r E (0, 1) and M E N be the parameters governing the path search (NmPs).
Suppose X( is bounded, v/ is a first-order approximation off on X0, and for each
x E X. the following hold:
(1) W/is a uniform first-order approximation of f near x.
(2) S(x) is invertiblenear x, and continuous.
(3) There exists Ix > 0 such that if d'(x) is invertible near any x' E [RN, then it is
Lipschitzianly invertible of modulus lx near x'.
(4) There are rqx:(0, oo) -> [0, oo]and a neighborhood Ux of x such that lim5s
, rx(s)
= 0 and A(x1) - s(x2) is Lipschitz of modulus r7(lQx1- x211)on Ux, for x', x2 E
Ux .
Then for X = Y = RN, the hypotheses (and conclusions) of Theorem 9 hold.
PROOF. We first strengthen Hypothesis (2): each x E X, has an open neighbor-
hood Ux such that
(2') For some scalars Ax,ex, Lx > 0 and each x' E x + XE,x, the mapping
d(x')IuU: UX-
(x)(Ux)
is Lipschitzianly invertible of modulus Lx, and d(x')(Ux) contains f(x') + ExBy.
To see this appeal to Hypotheses (2)-(4). There are neighborhoods Ux and Vx of x
and f(x), respectively, and lx > 0 for which '(x)lux: Ux - Vx is Lipschitzianly
invertible of modulus lx > 0; and there is 7/x:[0, oo) -> [0, oo] such that lim , r7x(s) =
0, 7ix(0) = 0, and sc'(x) - s(x2) is Lipschitz of modulus ri7(llxl - x211)for xl, x2
Ux. Also S'(x) is continuous. So there is 8x > 0 such that
x + 2b8x~ c Ux,
d(x)(x
+ 8x,X)
+ (6x/lx)By
c
x,
(s) < 1/(21X),
Vs E [0,
For x' E x + XIBx, we apply Lemma 7 with g =def,(x),
g =def(x'),
=
-->
defx',: V(x')Iux: Ux
L = defl, 7 =def x(llX - x'II), X0 =defxt, and
Lipschitzianly invertible of modulus 21l,
(8x/(2lx))By.
Let Lx =def21
and
and ex =def6x/(2lx).
'(x')(Ux)
contains
f
=defUx,
(x')(Ux) is
f(x') +
(2') is confirmed.
By Hypothesis (1) and the above, each x E X0 has a neighborhood Ux such that for
some AZ(t) = o(t),
(12)
II V(x')(x")
-f(x"
) l < Ax(llx' - x"ll),
x', x"
Ux,
and (2') holds. Assume without loss of generality that Ux contains x + ,xBx, and let
369
GLOBAL CONVERGENCE OF DAMPED NEWTON'S METHOD
0x denote the interior of x + 8xBx. Since X0 is compact we may cover it by finitely
many neighborhoods (0xi) corresponding to a finite sequence (xi) c Xo. For each i
and x = x' we have A i(t) = o(t) such that (12) holds, and scalars 5,, e,i L,
satisfying (2'). Now each x e XO is an interior point of 0xi for some i, indeed there
C Ox for some i; if not,
exists 8 > 0, independent of x, such that for x + _5BIX
of
X0
leads
to
an
Define A(t) as
contradiction.
sequential compactness
easy
max/ Ai(t) for 0 < t < 5 and oofor t > 8_ then (t) = o(t). Since any two points of
X0 separated by distance less than _ lie in some Oxi, (12) yields
< A(lIx -
-f(x")|1
Ilx|(x')(x")
x"[ll)
Vx', xf" e X,
which confirms Hypothesis (1) of Theorem 9.
Hypothesis (2) of Theorem 9 also holds, with 6 = def, e def min,i c, and L = d
maxi Lxi. For each x E XO we take Ux =defUi for i such that x + SIBxc Oxi, and
Note that x + SBxEcxi + 5,iBX, hence x + 3Bx c Ux and, by
VX-def(x)(Uxi).
(2'), &V(x)Iux:U -- Vx is Lipschitzianly invertible of modulus L and Vx D f(x) + Ely.
To show Hypothesis (3) of Theorem 9, let x e X0, T (0, 1] and p: [0, T) - X be
such that, for each t E [0, T), x(x)(p(t)) = (1 - t)f(x) and sf(x) is continuously
invertible near p(t). We only need show p is Lipschitz on (0, T), say of modulus 1, in
which case, first, (p(t)lt e (0, T)} is bounded, hence p(t) has an accumulation point y
as t TT; secondly, for t e (0, T),
||p(t) - y |< limsup |p(t) - p(s) |< lit - T,
sTT
so lim,t p(t) = y; and finally, S/(x)(y) = (1 - T)f(x) by continuity of M(x).
Let t, t' e(0, T); without loss of generality assume t' > t. Let Ix > 0 be the
Lipschitz constant given by Hypothesis (3), and choose a neighborhood U of p(t)
n U is Lipschitz of modulus lx near (1 - t)f(x). So there is y > 0
such that X(x)such that (t - y, t + y) is contained in (0, T), and p(s) = S/(x)- [(1 - s)f(x)] n U
for s E (t- y, t + y). Then p is Lipschitz of modulus/ I deflxlf(x)ll on (t - , t +
y). As [t, t'] is compact, it can be covered by finitely many open intervals in (0, T), on
each of which p is Lipschitz of modulus 1. Hence there is a finite sequence
'
t' such that p is Lipschitz of modulus I on [tk-1, tk] for
<tK
to = t < t <
=
each k
1,..., K. We conclude:
K
11p(t)-p(t')Il
< E I|P(tk-I) -P(tk)ll=
k=l
K
E (tk
tk-)l
(t
t)l.
O
k=l
Hypotheses (1) and (4) of the proposition specify a relaxed local version of the
point-based approximation of f at x, used to define Robinson-Newton's method
(Robinson 1988). A point-based approximation of f on f c X is a function v on Q,
each value d(x) of which is itself a mapping from f to Y, such that for some K > 0
and every x1, x2 EG ,
(a) Il.(xl)(x2)f(x2)lI < (1/2)Kllxl' x2I12 and
(b) av(xl) - X2(x2)is Lipschitz of modulus Kljxl- x211.
Suppose v is a point-based approximation of f on a neighborhood l of x and, for
each x1 e f, we extend the domain of a(x') to X by arbitrarilydefining the values
gV(xl)(x2) e Y, x2 e X\ f. Then property (a) implies Hypothesis (1), and property
(b) implies Hypothesis (4).
Liqun Qi has noted that the conditions of the previous two convergence results
preclude from consideration such functions as f: [R-> R: x > Ix[, because f does not
370
DANIEL RALPH
have a first-order approximation at 0 that is invertible near 0. This particular example
A
is easily accommodated: Theorem 9 is valid if X0 is replaced by Xo\f-(0).
similar recasting of Proposition 10 is possible, though care is needed because
compactness of X0 does not imply compactness of X0 \f -(0). This leads to a more
serious question however: Can we deal with a function that is badly behaved at the
current iterate xk E XO?In light of Lemma 7, the hypotheses of both Theorem 9 and
Proposition 10 imply that f must be Lipschitzianly invertible near each point of X0
(though we can restrict our attention to points of X( at which f is nonzero), hence at
xk. This is an implicit restriction on the kind of mapping to which the analysis applies.
5. Applications.
5.1. Variational inequalities. The Variational Inequality is the problem of finding
a vector z E N such that
z e C,
(VI)
(F(z),c
-z)
> 0,
Vc
C
where F is a function from C to RN and C is a nonempty closed convex set in RN.
We will also assume that F is continuously differentiable (see Definition 5) and
usually that C is polyhedral. See Harker and Pang (1990) for a survey of variational
inequalities which includes analysis, algorithms and applications.
Equivalent to (VI) are the Generalized Equation (Robinson 1979)
0 E F(z)
(GE)
+ Nc(z)
where Nc(z) is the normal cone to C at z (see ?2), and the Normal Equation
(Robinson 1992)
0 = F(rc(x))
(NE)
+ x - 7c(x)
where 7c(x) is the Euclidean projection of x to its nearest point in C. So (NE) is just
0 = Fc(x). We will work with (NE) because it is a (nonsmooth) equation.
Note that (VI) and (GE) are, but for notation, identical, whereas the equivalence
between each of these two problems and (NE) is indirect: if z solves (VI) or (GE)
then x = z - F(z) solves (NE), while if x solves (NE) then z = rc(x) solves (VI)
and (GE). In fact we have a stronger result from Lemma 6, assuming F is locally
Lipschitz: Fc is Lipschitzianly invertible near x if and only if for some neighborhoods
U, V of rrc(x), Fc(x) respectively, (F + Nc)- 1 n U is a Lipschitz function when
restricted to V. This result has links to strongly regular generalized equations (see
Robinson (1980), and proof of Proposition 13).
A first-order approximation of f =defF at x is obtained by linearizing F about
C = defT(X):
def
- c) + x' - 7c(x'),
W'(x)(x')
So
(x)(-) is VF(c)c(') shifted by the vector F(c)-
continuous,
F(c)
+ VF(c)(77c(x')
(13)
and for any x', x2
11,(X")(x2)
<
VF(c)(c). Therefore X/(x) is
RIN,
- f(X2)11
sup IlVF(7c(X1))xo s)<1
Vx' E RN.
) -
X lv(
7C X')
7T
r( x2)1l!
VF(sc(x')
+ (1-
S)T(X2))||
GLOBAL CONVERGENCE
OF DAMPED NEWTON'S
371
METHOD
by the vector mean value theorem (Ortega and Rheinboldt 1970, Theorem 3.2.3).
Now rrC(x+ B) is compact, so VF is uniformly continuous on the convex hull of this
set, and it follows from the above inequality and Lipschitz continuity of 7rc that vO is
a uniform first-order approximation of f near x. Similar arguments ensure that, for
s > 0 and
(s)d=
VF(7r)c(-2))I
sup{ VF(rc(C'))
|1,
2 E X + B, 11 -
21l < s},
if x1, x2 E x + B then as(x1) - Q(x2) is Lipschitz of modulus rx(llx1 - x211),and
r7x(s) 0 as s I 0. We have verified the first and fourth assumptions of Proposition 10
for each x E RN.
To apply Proposition 10 we still must find a constant a0 > 0 such that the level set
def
xo-
{x
E
R
Ilf(x)
I < ao
is bounded and only contains points x near which X(x) is Lipschitzianly invertible,
etc. One case we will deal with is when F is strongly monotone of modulus A > 0,
that is
(F(c)
-F(c'),c
-c')
>
Ai -c'112,
Vc, c' E C.
Another case to consider is when C is polyhedral convex; here -rc is piecewise linear,
hence, for each x, S(x) is piecewise linear too. In this situation, Robinson's
homeomorphism theorem (Robinson 1992, Theorem 4.3) says /(x) is homeomorphic if and only if it is coherently oriented, i.e., its determinants in each of its full
dimensional pieces of linearity have the same (nonzero) sign. (Robinson 1992,
Theorem 5.2) also provides homeomorphism results near a point x, via the critical
cone (Robinson 1987) to C at x, which is the intersection of the tangent cone to C at
c = 7c(x), Tc(c), with the linear subspace orthogonal to x- c, (x- c)l. See also
Ralph (1993). These results provide testable conditions for the second assumption of
Proposition 10.
PROPOSITION 11.
Let C be a nonempty closed convex set in RN, F: C -
RN be
continuously differentiable, and o-, r E (0, 1), M E I be the parameters governing the
path search (NmPs), and ao > 0. Let X = def{x E [NI IIFc(x)II< a0}.
Suppose, in addition, either
(1) F is strongly monotone; or
(2) Xo is bounded; C is polyhedral; and for each x E XO, c =def7C(x), VF(c)c is
invertible near x, or, equivalently, VF(c)K is coherently oriented where K =def T(c) f1
(x -c)'.
Let path search damped Newton's method for solving Fc(x) = 0 be defined using the
first-order approximation (13), startingfrom x? E Xo. Then the damped Newton iterates
xk converge to a zero x* of Fc. The residuals Fc(xk) and iterates Xk converge at
Q-superlinear rates to zero and x*, respectively; indeed these rates are Q-quadratic if
VF is Lipschitz near 7rr(x*).
PROOF. We have seen that Hypotheses (1) and (4) of Proposition 10 hold. It is left
to verify the remaining hypotheses. The claim of Q-superlinear or Q-quadratic
convergence of Fc(xk) is then explained by the comments after Theorem 9, given
that Fc is Lipschitz near x*.
372
DANIEL RALPH
(1) Let F be strongly monotone of modulus A > 0.
It is well known for such F (Harker and Pang 1990, Corollary 3.2) that the
generalized equation
y
F(c) +Nc(c)
has a unique solution c E C for each y E RN. Suppose c' E C and y' E (F + NC)ci),
for i = 1, 2. Using the Cauchy-Schwartz inequality, and the definitions of the normal
cone and strong monotonicity, we get
Ilyl
_
y21l1 C - c211 > (yl
(y
y2, c1
-_ F(c'),
+( F(c)
2)
c
-
2) +
- F(c2),cl
(y2-
F(C2),
2
cl)
- c2)
> AIlc1- C2112
Thus Ily1 - y21 > AIIc1- c211.Since (F + Nc)-1 maps points to nonempty sets, this
bound implies (F + Nc)-1 is actually a function on RN that is Lipschitz (of modulus
A-1). So Co =def(F + Nc)- (aoB) is compact and it follows, by continuity of F, that
F(Co) is compact too. Recall that from Lemma 6, equation (2), y = Fc(x) if and only
if x = y - F(c) + c for some c E (F + N)- 1(y). This yields
X = y - F(c) + c y E aolB, c E (F + Nc)-(y)
caB
-
F(CO) + C(,
which shows that XO is bounded.
It also follows from strong monotonicity of F that for any x E [R' and c defrrc(x),
VF(c) is strongly monotone, hence, as above, (VF(c) + Nc)-1 is a Lipschitz mapping.
Since VF(c) is also Lipschitz on RN we see that, as in Lemma 6, the normal mapping
VF(c)c is Lipschitzianly invertible; hence X/(x) is Lipschitzianly invertible. As as(x)
is also continuous, Hypotheses (2) and (3) of Proposition 10 are satisfied.
(2) Under the assumptions of this proposition, we have seen that, by virtue of
Robinson (1992), Hypothesis (2) of Proposition 10 holds. We are also given that X( is
bounded, so it only left to show that Hypothesis (3) of Proposition 10 is valid.
Since C is polyhedral convex, a/(x)(.) is piecewise linear for each x E [RN, that is
,X(x) is represented on each of finitely many polyhedral convex sets covering RN by
an affine mapping z - Miz + b', where {Mi} c [NXN and (bi} c RN. Since V/(x) is
invertible near x, at least one of the matrices Mi must be invertible; let
def
lx
= max{llM,-' I IMi is
invertible}.
If S/(x) is invertible near some x' E [RN, then it has a local inverse P: V -- U where
V is a neighborhood of W(x)(x') and U is a neighborhood of x. We assume without
loss of generality that V is convex, and take any y, y' E V. Now the interval joining
P(y) and P(y') is covered by finitely many subintervals {Ij} on each of which a/(x) is
represented by some invertible mapping Miz + b'. Hence the interval joining y and
y' is covered by the subintervals ({W(x)(Ij)}, on each of which P is Lipschitz of
modulus lx. It follows that IIP(y) - P(y')11 < lxlly - y'11.
Fukushima (1992), Taji, Fukushima and Ibaraki (1993), and Wu, Florian and
Marcotte (1990) show convergence of line search damped schemes based on
Josephy-Newton's method (Josephy (1979) and below) for strongly monotone varia-
373
GLOBAL CONVERGENCE OF DAMPED NEWTON S METHOD
tional inequalities. Superlinear convergence is shown under the additional assumptions of polyhedrality of C, and a nondegeneracy condition implying smoothness of
Fc at the solution point to which the iterates converge. When F is monotone, but not
necessarily strongly monotone, and C is compact as well as being convex and
nonempty, Marcotte and Dussault (1987) give a globally convergent damped
Josephy-Newton's method; in this situation, however, it seems that neither our
analysis nor the analysis of Fukushima (1992), Taji, Fukushima and Ibaraki(1993) and
Wu, Florian and Marcotte (1990) applies.
Josephy-Newton's method. In Josephy-Newton's method (Josephy 1979) for solving
(GE), given the kth iterate ck, the next iterate is defined to be a solution ck+ of the
linearized generalized equation
0 E F(ck)
-
+ VF(ck)(c
ck)
+ Nc(c),
c
E RN.
The equivalence between Josephy-Newton's method on (GE) and Robinson-Newton's
method on the associated (NE) is well known: if xk is such that _-C(xk) = ck and
Ak+l
d
Ak+
then Xk+ is a zero of (xk)(X
_ F(Ck)
-
- ck),
VF(ck)(Ck+l
) and Ck+ = rCc(k+ 1). Hence the Josephy-Newton
iterate is the Euclidean projection onto C of the Robinson-Newton
search damping of Robinson-Newton's method produces xk+l on the
on a path from ck
pk from xk to k+l. The projection rr(xk+l),
iterate.
damped Josephy-Newton
For practical purposes, suppose C is polyhedral convex. Let us
construction
iterate. Path
Newton path
to ck+, is a
describe the
of the path pk given the current iterate xk. From (13), with ck
-def
Trr(Xk),
Ak(x)
df(k)(X)
dfF(ck)
+ VF(ck)(c(
-
k) +
-tc(X).
As Ak is piecewise linear, pk is also piecewise linear. We construct pk, piece by
affine piece, using a pivotal method. Starting from t = 0 (pk() = xk), ignoring
degeneracy, each pivot increases t to the next breakpoint in the derivative of pk
while maintaining the equation
= (1 - t)f(xk),
Ak(pk(t))
thereby extending the domain of pk. We continue to pivot so long as pivoting is
possible and our latest breakpoint t satisfies the nonmonotone descent condition
(NmD). If, after a pivot, (NmD) fails, then we line search on the interval
[pk(told), pk(t)] to find xk+ , where
told
is the value of t at the previous breakpoint.
The line search makes sense here because pk is affine between successive breakpoints, hence affine on [told, t]. It is easy to see that the Armijo line search applied
with parameters o, r E (0, 1) produces tk E [told, t] that fulfills (NmPs). On the other
hand, if (NmD) holds at every breakpoint t then we must eventually stop because
t = 1 or further pivots are not possible (i.e., Ak is not continuously invertible at
pkt)). In this case we take Tk =deft and xk+l = defp(t).
This procedure can be thought of as a forward path search: we construct the path
as t moves forward from 0 to Tk, testing (NmD) after each pivot that extends the
path. This is an efficient strategy if (NmD) fails after the first few pivots. A backward
path search extends the path as far as possible, that is until Tk = 1 or p is not
374
DANIEL RALPH
invertible at Tk, while recording each pivot point p(t) and the corresponding value of
t. If (NmD) holds at the final value of t, i.e., at Tk, take xk+l =defpk(Tk); otherwise
search through the list of (pk(t), t) pairs until one, with t = t1 say, is found such that
t = t1 satisfies (NmD), but (NmD) fails at the next value of t seen in the list, say
t = t2. A line search is now performed on the interval [pk(tl), pk(t2)]
For the sake of generality, we mention a class of nonsmooth functions with
"natural" first-order approximations that contains each normal mapping f = Fc such
that F is smooth and C is closed, convex and nonempty. Consider the class of
functions of the form
f=Hoh
where D c [RK, h: RN -> D is locally Lipschitz, and H: D -- RN is smooth (e.g., D is
convex and H is continuously differentiable). A subclass of this class was highlighted
in Robinson (1988) in the context of point-based approximations. Similar to above, it
is easy to see that
def
J(x)(x')
-H(h(x))
defines a first-order approximation of f on
Proposition
- h(x)],
+ VH(h(x))[h(x')
RN
x,'
N
satisfying Assumptions (1) and (4) of
10. Fc is given by setting K =def2N,
D =defc
X RN, h(x) =def(7c(X),
X
- r7c(x)) and H(a, b) =dfF(a) + b.
5.2.
Nonlinear complementarity problems. We now confine our attention to C
the nonnegative orthant in RN. The problems (VI), (GE) and (NE) are
equivalent to the NonLinear Complementarity Problem:
=def RN,
z,F(z)
>0,
(NLCP)
(z,F(z))
= 0,
where vector inequalities are taken componentwise. The normal mapping induced by
F and lRN is denoted by F+, so F+= F+N. The normal equation form of (NLCP) is
F+(x) = 0 or
F(x+)
where x+ denotes rrN(x).
def
(14)
/(x)(x')
F(x+)
+x -x+=
0
The associated first-order approximation (13) becomes
+ VF(x+)(x+-x+)
+ x' -x',
Vx' E
N.
(NLCP) has many applications, for example in nonlinear programming (?5.3) and
economic equilibria problems (Harker and Pang 1990, Harker and Xiao 1990, Pang
and Gabriel 1993).
More notation is needed. Let I be the identity matrix in RNXN. Given a matrix
M E RNXN and index sets J, Jc
{1,...,
N}, let M
be the (possibly vacuous)
Also let \J
be the
submatrix of M of elements Mi,j where (i, j) E Jxf.
of
be
the
Schur
and
N}
J,
J, {1,...,
\
complement
complement of M
M/Mj ,
with respect to M,r _,
M/M_, =
(M \ \-M\
M
vacuous
\ ,,M,Y-M,\
if 0 +
{1, . . ., N},
if J=
,
if J= {1,..., N},
GLOBAL CONVERGENCE
OF DAMPED NEWTON S METHOD
375
assuming M , j is invertible or vacuous. Also, M is a P-matrix if it has positive
principal minors.
PROPOSITION
12. Let F: RN - IRN be continuously differentiable,and r, r E (0, 1),
M E N be the parameters governing the path search (NmPs). Suppose ao > 0 and
def
X- = {x E RI IF+(x) 1< ao
is bounded. Suppose for each x E Xo the normal mapping VF(x +)+ is Lipschitzianly
invertible near x or, equivalently, the following (possibly vacuous) conditions hold:
VF(x+), , is inuertible, where J= def{ilx, > 0};
> 0}.
, is a P-matrix, where f def{ixi
/VF(x+),
VF(x+),
Let path search damped Newton's method for solving F+(x) = 0 be defined using the
first-order approximation (14). Then for any x? E X0, the damped Newton iterates xk
converge to a zero x* of F+. The residuals F+(xk) and iterates xk converge at
Q-superlinear rates to zero and x*, respectively; indeed these rates are Q-quadratic if
VF is Lipschitz near x*.
If the above conditions on VF(x+) are equivalent to the assertion that
is
Lipschitzianly invertible near x, then the result is a corollary of ProposiVF(x+)+
tion 11.2. The required equivalence follows from Robinson (1992, Proposition 4.1 and
Theorem 5.2). c
This result is similar in statement to Harker and Xiao (1990, Theorem 3) on line
search damped B-Newton's method, but the conclusions of Harker and Xiao (1990)
are weaker. The convergence results of Harker and Xiao (1990) are derived directly
from Pang (1990), hence, as noted in the Introduction, require existence of a limit
point of the damped Newton iterates at which lIF+(')112is differentiable. Note that
no differentiability properties of the residual, or of its norm squared, are needed in
Proposition 12.
On a practical level, let us compare the damped method used in Proposition 12
with line search damped B-Newton's method. In the latter case we use the approximation
PROOF.
Bk(x)
-f
(xk)
+F+(k;
-xk),
where the B-derivative F (xk; d) equals VF(xk) + d - , and (i is di if xk > 0,
max{d/, 0} if x/k = 0, and 0 otherwise. Solving Bk(x)= 0 involves factoring out the
linear part of F+(xk; ), which corresponds to the nonzero components of xk, and
solving the remaining piecewise linear problem of reduced size. On one hand, this
generally requires less work per iteration than using the full piecewise linear approximation (14) to carry out a path search. On the other hand, since the approximation
Bk of F+ is generally not as accurate as Ak, damped B-Newton's method may
require more iterations to reduce the size of the residual to a given level.
5.3. Nonlinear programming. Our computational examples are all optimization
problems in NonLinear Programming. A general form of the nonlinear programming
problem is
min 0(z)
subject to z
E
D, g(z)
= 0
where D is a nonempty polyhedral convex set in [Rn,and 0: D -> R and g: D -> [Rm
376
DANIEL RALPH
are smooth functions. Under a constraint qualification the standard first-order conditions necessary for optimality of this problem are of the form (GE), where N = defn +
m,
def
C = D X Rm,
+ Vg(z)
F(z, y)def (V(z)
V(z, y) E D X [.
y, g(z)),
See, for example, Robinson (1983 ?1 and Theorem 3.2). Variable-multiplier pairs
(z, y) satisfying the first-order conditions can be rewritten as solutions of (NE) for
these F and C (Park 1989, Chapter 3, ?4). We will confine ourselves to a more
restrictive class of nonlinear programs which contains our computational examples:
(NLP)
subject toz>
min0(z)
0, g(z)
<0.
Again under a constraint qualification such as the Mangasarian-Fromovitz condition
(Mangasarian 1969, 11.3.5, McCormick 1983, 10.2.16) the first-order conditions necessary for optimality of (NLP) can be written as either (NLCP), or the corresponding
normal equation, where N = defn + m and
(15)
F(z
+ Vg(Z)-g
y) d (VO(Z)
(z)),
V(z,y)
E
+
Kojima (1980) introduced the normal equation formulation for programs with (nonlinear) inequality and equality constraints.
Wilson's method. In Wilson's method (Wilson 1963, Fletcher 1987), also known as
sequential quadratic programming (SQP), given the k th variable-multiplier pair
(ak, bk) E R[n+m the next iterate is defined as the optimal variable-multiplier pair
(ak+l bk+l) for the quadratic program:
min
a E
n
0(ak)
+ (V0(ak),a
ak,V2[
+(1/2)(a
subject to a > 0, g(ak)
Suppose xk = (zk
yk) E [Rn+
- ak)
+ Vg(ak)(a
satisfies (z,
+ (bk) g](a(a
a - a))
- ak) < 0.
y+) = (ak, bk), F is given by (15), and
s(xk) is given by (14). The SQP iterate (ak+,fk+'1) satisfies the first-order condition for the quadratic program, which is equivalent, by previous discussion, to saying
the point
k+l
(zk+l
9k+)
def(k+
bk+l)
F(ak
VF(ak, bk)(k+l
bk)
_ ak
k+l _ bk)
is a zero of S(xk). Also (zk+l, k+) = (k+,
bk+l). A path search applied to
Robinson-Newton's method for (NE) determines the next iterate xk+1 = (zk+ , yk+ )
as a point on the Newton path pk from xk to xk+ . The nonnegative part
k?+I
k+I
, yk+ ), a point on the path from (ak, bk) to (ak+l, bk+ ), is a damped iterate
(z
for SQP.
There are standard conditions that jointly ensure local uniqueness of an optimal
variable-multiplier pair (z, y) > 0 of (NLP), hence of the solution (also (z, y)) of the
GLOBAL CONVERGENCE OF DAMPED NEWTON'S METHOD
377
associated system (NLCP), and of the solution (z*, y*) = (z, y) - F(z, y) of the
associated equation (NE). At nonsolution points (z, y) of (NE), analogous conditions
will guarantee local invertibility of F+= FR+m and its first-order approximation (14)
at (z, y).
To specify these conditions at (z, y) E 8nfm, let
{ilzi > 0}, Ade {ilzi > 0},
de
(16)
def
.y-- = -{/lY
def
>
ye= {lylY 0},
0),
and I,J be the cardinality of z, etc. We present the conditions of Linear Independence of binding constraint gradients at (z, y):
(LI)
Vg(
)jg zz,
has linearly independent rows
,
and Strong Second-Order Sufficiency at (z, y):
if Vg(z+)y,j
ddt=0,
O d E Rlal,
(SSOS)
+ YV2g(z+
then d [V20(z+)
)]
d > 0.
We note that if (z, y) is a zero of F+ then (LI) and (SSOS) correspond to more
familiar conditions (e.g., Robinson 1980, ?4) defined with respect to (z+, y+) rather
than (z, y). This connection is explored further in the proof of our next result.
PROPOSITION
13. Let 0: R+-> R, g:R'~- Rm be twice continuously differentiable
functions, and F be given by (15). Let o, r E (0, 1) and M e Ni be the parameters
governing the path search (NmPs). Suppose ao > O,
def
Xdf
{( z, y) E +Rn+m |F+ (z
y) | < a0}
is bounded, and the above (LI) and (SSOS) conditions hold at each (z, y) E X0.
If path search damped Newton's method for solving F+(z, y) = 0 is defined using the
first-order approximation (14), where x - df(z, y), then for any (z0, y?) E XQ, the
damped Newton iterates (zk, yk) converge to a zero (z*, y*) of F such that z is a
local minimizer of (NLP). The residuals F+(zk, yk) and the iterates (zk, yk) converge
at Q-superlinear rates to zero and (z*, y*), respectively; indeed these rates are Qquadratic if V20 and V2g are Lipschitz near z.
PROOF. This is essentially a corollary of Proposition 12. Most of the proof is
devoted to showing that the (LI) and (SSOS) conditions at a given point (z0, yO) E
R" X R" are sufficient for VF(z?, y?
to be Lipschitzianly invertible near that
.)
I
n
denotes
the
times
n
identity matrix; and the fact that
point. Below,
+ N(z)
0E u +
u,
0,
(u, z)
0
E z + NR=(u),
will be used without reference. Also Robinson (1980) is needed, so generalized
equations rather than normal equations are stressed.
378
Let x?
DANIEL RALPH
y)
=def(zO,
and u? =def
,0
z0 (so that
(5x0,
u?)= zo-
)
Define
Def
y)
F + (X0).
Consider the nonlinear program, a perturbed version of (NLP).
min 0(z)
(NLP)
subject to g(z)
< 0,
where 0(z) =def0(z)- (<), Z) and g(z)
def(g(z) + )(, -z). The first-order opticondition
this
(Robinson 1980) for
mality
problem is the perturbed generalized
equation
0E
(GE)
F+
y,
u)
NA1X,,,,)(z,
where
F(z,y,u)de(V(z)
y) -
=(F(z,
+Vg(z)(y),-g(Z))
-(u ,0), z).
Now
F(z+
y+
(, y-+ y ,z+
+
-E
"
+(+,
y+, +
u0
) solves (GE). This generalized equation is said to be strongly regular
i.e., (z,
,
, u ) if the linearized set mapping
(Robinson 1980) at (z?,
def
0
T(z, y, u) = F(z,
y , u)
?+ F(,
z", y --y+,u-u
u+,)(
+ N,RxR,+,,(z,
)
,
yu)
is such that for some neighborhoods U of (z, y,
) and V of 0 e [R2n +n, T- n U
is a Lipschitz mapping when restricted to V.
Given a subset X of the row indices of a matrix M, let M denote the submatrix
consisting of the rows of M of indices in X. According to Robinson (1980, Theorem
4.1), it is sufficient for strong regularity of (GE) at (z, y, Ou ) that two conditions
hold, where, for " =def(y( ut),
-:
def
{(iIg(z
),
0,
def
( > O),
(i g(z+
=
)
0,
The first is linear independence of the binding constraints:
Vg(z?) )+
u
has full row rank;
and the second, strong second-order sufficiency:
if z E
\ {0}
and
Vg(z )iy
z = 0,
then zT'"z > 0
01.
(
OF DAMPED NEWTON'S
GLOBAL CONVERGENCE
379
METHOD
where
d'V2(j+
(y,
)+ +g)(
=
)
+ (y)Tg)(z+).
V2(
In terms of g and the index sets (16), and \
independence condition is
Vg(z?f
=-def{l
. ..,n} \ ,
the linear
has linearly independent rows;
I~-I\Z
L
hence is equivalent to (LI) at (z?, y0). Likewise, the strong second-order sufficient
condition may be rewritten
if z ER
\ {0}
v
and
yz = 0,
then zr" z > 0,
which clearly is equivalent to (SSOS) at (z, y0). Therefore the conditions (LI) and
(SSOS) at (z?, yo) guarantee strong regularity of (GE) at (z?, yo uo ).
Furthermore, assuming (z*, y*) is a zero of F+, if (z?, yO) (z*, y*) then
0 = (0,0) and (NLP) is just the original problem (NLP). So, as shown above, (LI)
and (SSOS) at (z*, y*) are respectively equivalent to the usual conditions of linear
independence and strong second order sufficiency for (NLP) with respect to the
variable z*, the multipliers y* corresponding to the nonlinear constraints g(z*) < 0,
and the multipliers z -z* corresponding to the constraints z > 0. Applying
McCormick (1983, 10.4.1) we find that z* is a strict local minimizer of (NLP).
We now show that strong regularity of (GE) implies VF(z0 yo)+ is Lipschitzianly
, y0,
invertible near (z, y?) Let M defVF(z y) M = defF
). Observe that
_"
"
where
M= -G
-G o
M
Gt
0
(
"), G
0
0
I
=defV2(0 + (y0 )Tg)(Z)=
T
0
defVg(zO). Define
def
(&1,)00-(y
0O0))
(00,o
oo
(\
zO
A)(z,,
-z
yo);
--M)(zo,
Note T(-) + (O
0)) (o + NX,,x+m(
if0,
For any (z, y), (s, y) e Rn X Rm
($,Sy)
** 3uu e Rn
n,
).
E
= -'z
fe
0 y+,0 0X
+Ni,uO
+rn+ z
(
(,)
yo+
(M +AN+m)(Z,
Y)
+ GTy - u,
-Gz + N,(y),
0 e z + NIa(u),
=
3
E R",
(
(,(j y, 0) E
+
nXR+n")(
Z,Y,U).
380
DANIEL RALPH
((, 0) such that
Hence if U, V are respective neighborhoodsof (z?, yo, u ), (0,
(M+
is a Lipschitz mapping when restricted to V, then (M +
n U is a Lipschitz map when restricted to V, where
N,Rnx,+t,)-'n
NRn+m)-
)
def{(
U
e
n+m
1(,
,0)
def
y)
{(,
Fn+m(z,
y
},
) E U for some u e R}.
an (o
In this case U and V are neighborhoods of (z?, y) and
) respectively, so
Lemma 6 (and equation (2)) assures us that M+ is also Lipschitzianly invertible near
(,f,,
:0() + (I
M)(z,+,
To summarize: (z,
y ) = 40 + (I
F)(z(,
)) =(z,
y
y, uo ) solves (GE) and
(LI) and (SSOS) hold at (z?, y?)
= T- 1n
u is Lipschitz when restricted to V,
for some neighborhoods U, 7 of (z?, yI,+ u ), 0 E
M + N n xRm+n)
(A
R2n+m
respectively
n u is Lipschitz when restricted to V,
for some neighborhoods U, V of (z,
,0) respectively
y (0I, ), 0oo
n U is Lipschitz when restricted to V,
(M + NRn,+)-
for some neighborhoods U, V of (z? y)
(
) respectively
< M+ is Lipschitzianly invertible near (z?, yO).
The proof is complete. E
We show that Proposition 13 applies to convex programs.
Let 0, g, F, o-, ,and M be as in Proposition 13. Let 0 and each
component function gl (I = 1,..., m) of g be convex. Suppose there is z > 0 with
g(z) < 0 (Slater constraint qualification) and (NLP) has a (global) minimum at
z E(W. Then there exists y e R+ such that (z*, y*) = (z, 9) - F(z, y) is a zero of
LEMMA 14.
F+.
If (LI) and (SSOS) hold at (z*, y*) then there is ao > 0 such that
def
Xo-
|
{(z, y) E [R+m IF+(z,
<ao}
Y)II1
is bounded, and VF(z+, y+)+ is Lipschitzianly invertiblenear each (z, y)
E
Xo.
GLOBAL CONVERGENCE
381
OF DAMPED NEWTON S METHOD
PROOF. It is a classical result in optimization (Mangasarian 1969, 7.3.7, McCormick
1983, 10.2.16) that there exists a Lagrange multiplier (y, iu) e l'm X Rn"such that
0 = V0(z)
+ Vg()Ty
o< ,
-
,
i,2>)= 0,
0<?y,
<y, g(
))
= 0.
Since z is also feasible (z > 0 and g(z) < 0), (z, y, u) = (, 9, i) is a solution of
0
(GE*)
where F*(z, y, u) =def(F(,
y, u)
(F* + NR,tXR+?f)(Z,
E
0), z). (GE*) is just the generalized
y)-(U,
equation
(GE) from the proof of Proposition 13 when s? is 0 e R[n+m.It is easy to see that
(2, y) solves (GE),
0 E (F + NR+n)(z,
9),
- F(z, y) is a zero of F+; and (z*, y *,
hence (z*, y*) def(z,)
z*(2z ,i,), a solution of (GE*).
Since (LI) and (SSOS)
z)
equals
hold at x* =def(z*, Y*), the proof of Proposition
13
demonstrates that VF(xT)+ is Lipschitzianly invertible near x*. Lemma 7 with
g =defVF(4X)+ and g =defF+ can be used to show that F+ is Lipschitzianly
invertible near x*. Let U* be a neighborhood of x* and a. > 0 be such that
-F+u *: U*
aEB
is Lipschitzianly invertible. So U* is bounded. For any element x? of U*, another
shows that
application of Lemma 7, this time with g =deF+ and g' =defVF(xo)+,
VF(x? )+ is Lipschitzianly invertible near x?. It only remains to be seen that for some
a0 > 0,
def
XO=F+ (aoBI) c U,
because then X( is bounded and, for each x? X0, VF(x?)+ is Lipschitzianly
invertible near x?.
It is basic exercise in convex analysis to show that F is monotone on tR++-:
(F(x)
-F(x'),
x -x')
> 0,
Vx x' E Rn+m
By Brezis (1973, Proposition 2.10), F + NRn+m is a maximal monotone operator,
meaning there is no monotone operator whose graph strictly contains the graph of
F + NRn+E, Hence (F + Nn+m)-'
is also maximal monotone
and, as a result (see
Brezis 1973, Example 2.1.2), (F + NRn+m)-'l() is a convex set for each $ E [n+m.
We have established that F+ is Lipschitzianly invertible near x*, so Lemma 6
n+m such that
yields neighborhoods U' of rc(x*) and V' of 0 E WR
(F + NRn+m)-1 n uI is
VI c a*lE and U' is
so
(F + NR,+m)-'(),
(F + NRnm)-(~)n IU1
a Lipschitz function when restricted to Vl. We may assume
open. Let 5 e V', x = (F + NRn+m)- '()f) U and x'
x' need not lie in Ul.
Then (1 - t)x + tx' lies in
for all sufficiently small t E [0,1], because (F + NRnm)-1l()
is convex and U1 is open. This contradicts uniqueness of x in U1, unless x' = x.
382
DANIEL RALPH
Therefore (F + NRn+m)-'Ivris a function. It follows that, by using (2) from the proof
of Lemma 6, F+ lIv is also a function. Recall F+ fl U* is a function when restricted
to a 1B, and a [B contains V1; therefore F+1v
1- U* and F+l 1 are identical, in
c
U*.
To
choose
F+1(V1)
particular
conclude,
a( > 0 such that aoB c V1, and
observe
def
X=F+F(a0B)
c F-'(V
) c UW.
i
5.4. Implementation of the path search. We will outline how the Newton path pk
is obtained computationally at iteration k + 1, for solving (NE) when C =def1N. We
are actually thinking of solving (NLP), that is letting N = defn + m and F be given by
(15). First Lemke's algorithm (Cottle and Dantzig 1974), a standard pivotal method
for solving linear complementarity problems, is reviewed.
The Linear Complementarity Problem is a special case of the nonlinear complementarity problem when the function F defining (NLCP) is affine: fine v, w E RN
such that
(LCP)
w = Mv + q,
0<
0 = (v,w),
,w,
where M e RNxN and q E RN. In Lemke's algorithm an artificial variable uv0 R is
introduced. Let e = def(1,..,
1)T E RN. At each iteration of the algorithm we have a
basic feasible solution or BFS (Chvatal 1983), (v, w, v0), of the system
w =Mv + q + evo,
(17)
0 < (v,w, vu),
which is almost complementary, i.e., for each basic variable ui (respectively wi),
1 < i < N, wi (respectively vi) is nonbasic. The variable wi is called the complement
of vi, and vice versa. In particular (v,w) = 0, hence if v( = 0 then (v,w) solves
(LCP). Given uv, (v, w) solves the parametric linear complementarity problem
w = Mv + q +
evoU,
0
,
0 =
(v,w).
The initial almost complementary BFS is given by taking uv large and positive, and
(u, w) = (0, q + evo). So Lemke's algorithm may be viewed as a method of constructing a path of solutions (v(v0), W(Vo))to the parametric problem as uv moves down
toward zero.1 We use this idea, but with a different parametric problem, to construct
the path pk we need for iteration k + 1 of damped Newton's method.
Let N =defn + m, F be given by (15) and xk =def(zk, yk). Let the first-order
approximation / of F+ on R[N be given by (14); and denote x(xk) by Ak. Now
is the solution x to
pk(0) =defxk and, for each t E dom(pk),pk(t)
(1-t)F+(xk)
=Ak(x)
=F(xk)
= VF(k)+
+ VF(xk )(x+-xk)
(x)
+
-x+
+ [F(Xk) - VF(xk)xt ].
tActually, v0 need not strictly decrease as a result of a pivot in Lemke's algorithm, in which case the
corresponding solution (v, w) of the parametric problem is not a function of uv. In any case (u, w) traces
out a path.
OF DAMPED NEWTON S METHOD
GLOBAL CONVERGENCE
Equivalently, (u, w) = (pk(t)+, pk(t)mentarity problem:
(LCP)(t)
383
pk(t)) solves the parametric linear comple-
w = Mkv + qk - (1 - t)rk,
0 = (v,w),
0 < v,w,
where
0
-Vg(zk)
^def/
^
__/
q =fF(x
()-VF(x+)x+-
rk -F+(Xk)
= Mkk
^
.
VO(Zk)
-
^'Zk
V
'k
z +
-g(z+)
Vg(zk)z
+ qk _ wk
and
defV2 (
(uk,Wk)
(X,
+ (yk )Tg)(z),
X-
Xk).
Clearly (vk, wk) solves (LCP)(O),and a solution (v, w) of (LCP)(1) yields a zero u - w
of Ak, i.e., the Newton iterate.
To construct pk we use a modification of Lemke's algorithm in which the role of
the auxiliary variable v0 is played by t. Assume (v, w, t) is an almost complementary
BFS of
(18)
w = Mk
+ qk + (t - 1)rk,
0 <
,,
0 < t < 1,
such that t is basic. This defines the current point on pk: pk(t) =defU - w. As there
are exactly N basic variables, there is some i for which both vi and wi are nonbasic.
One of these nonbasics is identified as the entering variable. The modified algorithm
iterates likes Lemke's algorithm, in the following way.
Increase the entering variable-altering the basic variables as needed to maintain
the equation w = Mkv + qk + (t - 1)rk-until (at least) one of the basic variables is
forced to a bound. We choose one of these basic variables to be the leaving variable,
that is the variable to be replaced in the basis by the entering variable. This
transformation of the entering and basic variables is called a pivot operation, and
corresponds to moving along an affine piece of the path pk from the value of the
parameter t before the pivot, told, to the current parameter value. After a pivot,
(v, w, t) is still an almost complementary BFS for (18); the leaving variable is now
nonbasic, and, assuming it is not t, its complement is nonbasic too. The next entering
variable is defined as the complement of the leaving variable. This completes an
iteration of the modified algorithm, and we are ready to begin the next iteration.
The modified algorithm is initialized at (v, w, t) = (vk, wk, 0) with t as the nonbasic
entering variable (so pk(0) = Uk - wk as needed).
Now the algorithm cannot continue if either t becomes the leaving variable
(because no entering variable is defined) or no leaving variable can be found (because
increasing the entering variable causes no decrease in any basic variable). In the
latter case we have detected a ray. Other stopping criteria are needed to reflect the
384
DANIEL RALPH
aim of the exercise, namely to path search. We stop iterating if t = 1 (the entire path
has been traversed), or t, with pk(t) =
def
-
w, does not satisfy the descent condi-
tion (NmD), or t strictly decreases as a result of the last pivot-satisfaction of (NmD)
after each pivot is the basic requirement of a forward path search. In practice, an
upper bound on the number of pivots is also enforced.
It is a subtle point that, when solving (NLP), the first-order approximation at
xk = (zk, yk) E [Rn+m used to define the path pk is exact for points x = (z, y) such
that z+= z:
Ak(x) = F(x+) +x - x+. Using this idea, we may save a little work by
only checking (NmD) if the positive part of the variables z differ from zk. This
corresponds in the modified Lemke's algorithm to only checking (NmD) if the
subvectorof the first n componentsof v differsfrom the correspondingsubvectorof
k
Suppose the method halts because a ray is detected or t decreases. In both cases
we know Ak is not invertible at pk(told), otherwise it would have been possible to
perform a pivot yielding t >
told,
so we take xk+
=-defpk(told).
Another possibility is
that the method halts because (NmD) fails at t, in which case we use the Armijo line
search on [p(told), p(t)] to determine a path length tk E [told, t] satisfying (NmPs),
and take xk+l
defpk(tk)
We have described our implementation of a forward path search. A backward path
search was also implemented as follows. We start by recording the first tuple
(uk,Wk,0), and then perform pivots as described previously, recording the tuple
(v, w,t) after each successful pivot. As before, pivoting stops if a ray is detected, t
leaves the basis, or t = 1; we do not, however, use the condition (NmD) or the sign of
the change in t to decide whether to continue pivoting. Suppose J is the number of
successful pivots. If the (J + l)th tuple (v,w, t) is accurate, meaning (NmD) is
satisfied at t with pk(t) =def - W, then take xk?+ =defy - w. If not, take J = 1,
J = J + 1 and, while J + 1 = J, do the following bisection search on the list: if the
tuple at position J' =def[(J + J)/2] in the list is accurate, then let J = J' and J be
unchanged, otherwise let J be unchanged and J = J'. After the bisection search, the
tuple at position J is accurate and the tuple at position J + 1 is not accurate. An
Armijo line search is now carried out on the interval between the points v -w
defined by these two tuples, and xk+l defined accordingly.
One difficulty we have not discussed is the possibility that (uk, wk) is not a basic
solution of
w = Mkv + qk - rk
O < v,w,
i.e., (uk,wk, 0) with t nonbasic cannot be used as a starting point of the modified
Lemke's algorithm. To overcome this, we performed a modified Cholesky factorization (Gill, Murray and Wright 1981) of the Hessian of the Lagrangian, 7'k' above,
changing it if necessary into a positive definite matrix. This operation-somewhat
crude because it attacks the entire Hessian of the Lagrangian, rather than just the
"basis" columns-may not always be sufficient to correct the problem since columns
of Mk not corresponding to columns of k'' are generally present in the initial basis.
However it was sufficient in the problems we tried, below. It has the added advantage
of being a heuristic for preventing convergence to a Karush-Kuhn-Tucker point of the
nonlinear program which corresponds to a local maximum or saddle point instead of
a local minimum of the program.
5.5. Computational examples. Our computational examples are of global, or path
search damped Newton's method applied to the nonlinear programs of the form
385
GLOBAL CONVERGENCE OF DAMPED NEWTON'S METHOD
(NLP):
min 0(z)
subject to z > 0, g(z)
< 0.
The computer programs were written in C and the computation carried out on a
Sun 4 (Sparcstation). Most problems are taken from Ferris (1990), with starting
variables z? close to solutions2 and initial multipliers y0 set to zero. We note that
these problems all have small dimension: no more than 15 variables and 10 nonlinear
constraints. Results include performance of global Newton's method using both the
forward and backward path search procedures. Other problems, showing convergence
of the damped method in spite of the cycling (nonconvergence) that occurs without
damping, are also tested.
For comparison we have also tested local (Robinson- or Josephy-)Newton's method
on the same problems, using Lemke's algorithm to find the Newton iterate at each
iteration. This method was found to be somewhat fragile with respect to starting
points: quite often the method failed because a ray was detected by Lemke's
algorithm, i.e., the Newton iterate xk+? could not be determined. To make the
method more robust we altered it, allowing it continue after Lemke's algorithm
detects a ray by setting xk+ = v - w, where (r, w, L0()is the last almost complementary BFS produced by Lemke's algorithm. The altered method was usually successful
in solving test problems.
Our limited computational results, below, bear out what intuition suggests. First,
the global method is more robust than the local method especially in avoiding cycling,
discussed below. Second, the global method requires more stringent conditions on the
current approximation Ak than does the local method, and enforcing these heuristically can degrade convergence. Third, the computational cost of carrying out a path
search is seen in the increase the number of evaluations of F. For these problems,
the backward path search was more efficient than the forward procedure because, at
each iteration, the Newton iterate pk(l) E Ak (0) usually existed and satisfied the
descent criterion (NmD), so that xk+l =defpk(l).
In higher dimensions we expect the value of path searching to be even greater, a
view partly supported by the following observations. One peculiarity of Newton's
method is the possibility of cycling, depending on the starting point, as in the smooth
case. This is seen for the altered Newton's algorithm when testing the Colville 2
problem with a feasible starting point (Table 1). It is easy to find a real function of
one variable, which, when minimized over [0, oo) by Newton's method, demonstrates
cycling of (unaltered) Newton's method (Example 15 below). It turns out that for any
problem (NLP), the new problem formed by adding such a function in an (n + l)th
variable to the objective function 0 cannot be solved by Newton's method for some
starting points (Example 16): at best it will converge in the first n variables and cycle
in the (n + l)th variable. So it can be argued that the likelihood of cycling in
Newton's method increases as the dimension of the problem increases. Global
Newton's method converges for otherwise well behaved problems.
Table 1 summarizes our results on the problems from Ferris (1990). The following
parameters were set: in the path search, a =def0.1,
='def0.5,
M
=def4.
The problem
was considered solved when the Euclidean norm of the residual satisfied
I|F(x)
+ XXk X
< 10-5
2Fortran subroutines and initial values of variables supplied by Michael C. Ferris, Computer Sciences
Department, University of Wisconsin-Madison, Madison, WI 53706.
386
DANIEL RALPH
TABLE 1
I Local Newton. II Global Newton with a Forward Path Search. III Global Newton with a Backward Path Search.
Problem
Size
m Xn
Rosenbrock
Himmelblau
Wright
Colville 1
Colville 2 (feasible)
Colville 2 (infeasible)
4 x
3x
3x
10 x
5x
5 x
2
4
5
5
15
15
Pivots
Evaluations
Iterations
I
II
III
I
II
III
I
II
III
20
25
99
31
*
113
19
7
31
41
23
40
21
7
31
41
24
40
7
6
8
4
17
6
29
5
21
23
15
6
29
4
13
8
6
5
7
3
*
7
9
5
27
3
8
7
9
5
27
3
8
7
*
8
Starting points x?t def(z0 y0) E [R""+Mwere taken with the variables z? used in
Ferris (1990) and the multipliers y =def0. The first column of Table 1 gives the
name of the problem, and the second column gives the number of constraints m in
the vector inequality g(z) < 0, and the number of variables n, i.e., the respective
dimensions of multipliers y and variables z. Each of the remaining columns contains
three subcolumns corresponding to three implementations of Newton's method: I
denotes the (altered) local Newton's method, II, the global method with a forward
path search, and III, the global method with a backward path search. The 'Pivots'
column lists the total number of pivots required by the three implementations to solve
a given problem. The 'Evaluations' column lists the total number of evaluations of F
used by the three implementations. The final column, 'Iterations', shows the number
of iterations needed by the implementations to solve each problem. An asterisk *
indicates failure to solve the problem.
REMARKSON TABLE 1.
Iterations. The most important feature of the results is the number of iterations
needed, i.e., the number of evaluations of the Hessian of the Lagrangian, f/". Three
problems where the number of iterations differ widely are discussed further.
The objective function of the Rosenbrock problem is highly nonlinear causing any
damping of Newton's method (even in the unconstrained case) to take very short
steps, hence many iterations, unless the current iterate is very close to a minimizer.
The nonmonotone line search alleviates this problem: with memory length M= 4
only 9 iterations were needed, while more than 500 iterations were needed in another
test using the monotone path search (M = 1). In other tests using the monotone path
search, the remaining problems required the same or slightly fewer iterations than
listed in Table 1.
In the Wright problem, the local method does much better (7 iterations) than both
implementations of the global method (27 iterations) because it uses the actual
Hessian of the Lagrangian whereas the global implementations introduce error by
using a modified Cholesky version of this. In another test, 27 iterations were also
required by the local method using the modified Cholesky version of the Hessian of
the Lagrangian. This highlights a difficulty of path search damping: care is needed to
ensure that the path is well defined near current iterate. On the other hand, this has
advantages (other than damping) as we see in the remarks on pivots below.
For the Colville 2 problem with a feasible starting point, the local Newton's method
cycled while the global Newton implementations solved the problem easily. In the
local implementation, after the first iteration the pair (uk, wk) alternated between
two different vector pairs for which the associated residual [IF(vk) - wkfltook values
98 and 5 respectively, the former corresponding to an unbounded ray.
Evaluations. The number of evaluations of F, i.e., of V0, g and Vg, is higher for
the global implementations than for the local implementation, because the former
()F DAMPED NEWION'S
GLOBAL CONVERGENCE
METHOD
387
compare IlF() - wll to (1 - t)llF(vt) - wHll after perhaps several pivots during
iteration k + 1, whereas the latter only checks the norm of the residual at the start of
the iteration. For global Newton's method, the backward path search (III) generally
uses fewer function and derivative evaluations than the forward version (II) because,
near the solution, the Newton iterate usually exists and is acceptable as the next
iterate.
The difference between the number in the 'Pivots' column and the number in the
'Evaluations' column for method II (global Newton with a forward path search) is
the savings in function/derivative evaluations obtained by only checking (NmD) when
the first n components of v differ from the first n components of vk.
Pivots. The global implementations usually require fewer pivots per iteration than
does the local implementation. This is not surprising: the damped method requires
the current iterate xk to correspond to a basis of (LCP)(0). We initiate the modified
Lemke's algorithm at this basis by explicitly factorizing the corresponding matrix. By
starting with this 'warm' basis, both the forward and backward implementation of the
damped method generally take many fewer pivots than the local implementation to
find the Newton iterate, the solution of (LCP)(1).
EXAMPLE 15. Let (': R --> be a differentiable function whose gradient is
dle
V4(z)
arctan(z - 10),
the shifted inverse trigonometric tangent function. Integrating and setting the constant of integration to be zero we obtain
O(z)
(z
-
10)arctan(z
-
10)
-
(1/2)og(l
+ (z - 10)2).
Now consider a simple problem of the form (NLP):
subject to z > 0.
min 4(z)
10.
The
unique
solution is z
The unique solution is z = 10.
It can be easily shown that for any starting point z? - 101> 2 Newton's method
cycles infinitely. This has been observed in computation for several such starting
points. Global Newton's method using the forward path search, with the same
parameters as used to obtain the results of Table 1, converged in 4-33 iterations over
a variety of starting points 2 < Iz? - 101< 100.
EXAMPLE 16. Derive a problem from (NLP) by including an extra variable z,,+.
The new objective function is
def
O(zi,...,
l) -= O(z,...,z,)
+
(z,+),
where ( is defined in the last example. The constraints are g(z,..,
z,,) < 0 as
for
see
that
not
hard
to
It
is
).
starting
any
point
before, and 0 < z =def(z,...,
z,,+
(z0, y0) E [ "+ X W"'in which 1z2?+l- 101> 2, Newton's method cannot converge
in z +1.
We have tested this for the Himmelblau (NLP), where n =de4, (z(),..., z4) are the
initial variables used in Ferris (1990), z = 0, and all multipliers are initially zero.
Newton's method cycles infinitely; damped Newton's method using the forward path
search, with the same parameters as used for Table 1, converges in 33 iterations.
In other tests using the monotone path search, damped Newton's method using the
forward path search only required 4-7 iterations to solve the problem in Example 15
388
DANIEL RALPH
from various starting points; and 9 iterations for the augmented problem of Example
16 using the starting point specified there. In both cases this is an improvement on
nonmonotone damping. It is interesting that nonmonotone damping, which aids
convergence for highly nonlinear problems by accepting iterates that need not
decrease the size of the residual, slows convergence here because it perpetuates
almost cyclic behavior of the iterates.
Acknowledgement. This research was initiated under funding from the Air Force
Office of Scientific Research, Air Force Systems Command, USAF, through Grants
No. AFOSR-88-0090 and No. AFOSR-89-0058; and the National Science Foundation
through Grant No. CCR-8801489. Further support was given by the National Science
Foundation for the author's participation in the 1990 Young Scientists' Summer
Program at the International Institute of Applied Systems Analysis, Laxenburg,
Austria. Finally, this research was partially supported by the U.S. Army Research
Office through the Mathematical Sciences Institute, Cornell University, and by the
Computational Mathematics Program of the National Science Foundation under
grant DMS-8706133. The U.S. Government is authorized to reproduce and distribute
reprints for Government purposes notwithstanding any copyright notation thereon.
I would like to thank Stephen M. Robinson for his support of this project, through
both initial funding and discussions, and his continued interest. I also thank Laura M.
Morley for her assistance in implementing path search damped Newton's method;
and Oleg P. Burdakov, Michael C. Ferris, Liqun Qi, and two anonymous referees for
their helpful comments. This paper was completed while the author was at the
Advanced Computing Research Institute, Cornell University, Ithaca, New York
14853.
References
Bonnans, J. F., (1990). Rates of Convergence of Newton Type Methods for Variational Inequalities and
Nonlinear Programming, manuscript, Institut National de Recherche en Informatique et en Automatique, Rocquencourt, France.
Br6zis, H. (1973). Operateurs Maximalix Monotones et Semigrollpes de Contractionssdans les Espaces de
Hilbert, North-Holland Mathematics Studies No. 5, North-Holland, Amsterdam.
Burdakov,O. P. (1980). Some Globally Convergent Modifications of Newton's Method for Solving Systems
of Nonlinear Equations, Sotiet Math. Dokl. 22 376-379.
Chvatal, V. (1983). Linear Programming, W. H. Freeman and Co., NY.
Clarke, F. H. (1983). Optimization and Nonsmooth Analysis, John Wiley and Sons, NY.
Cottle, R. W. and Dantzig, G. B. (1974). Complementary Pivot Theory of Mathematical Programming. In
Studies in Optimization 10, (G. B. Dantzig and B. C. Eaves, Eds), MAA Studies in Mathematics, pp.
27-51.
Ferris, M. C. (1990). Iterative Linear Programming Solution of Convex Programs. J. Optim. Theoiy Appl. 65
53-65.
and Lucidi, S., (1994). Nonmonotone Stabilization Methods for Nonlinear Equations. J. Optim.
TheoryAppl. 81 (to appear).
Fletcher, R. (1987). Practical Methods of Optimization. 2nd ed., John Wiley and Sons, NY.
Fukushima, M. (1992). Equivalent Differentiable Optimization Problems and Descent Methods for Asymmetric Variational Inequality Problems. Math. Programming 53 99-110.
Gill, P. E., Murray, W. and Wright, M. H. (1981). Practical Optimization. Academic Press, NY.
Grippo, L., Lampariello, F. and Lucidi, S. (1986). A Nonmonotone Line Search Technique for Newton's
Method. SIAM J. Numer. Anal. 23 707-716.
Han, S.-P., Pang, J.-S. and Rangaraj, N. (1992). Globally Convergent Newton Methods for Nonsmooth
Equations. Math. Oper. Res. 17 586-607.
Harker, P. T. and Pang, J.-S. (1990). Finite-dimensional Variational Inequality and Nonlinear Complementarity Problems: A Survey of Theory, Algorithms and Applications. Math. Programming 48 161-220.
and Xiao, B. (1990). Newton's Method for the Nonlinear Complementarity Problem: A B-Differentiable Equation Approach. Math. Programming 48 339-357.
Josephy, N. H. (1979). Newton's Method for Generalized Equations. Technical Summary Report #1965,
Mathematics Research Center, University of Wisconsin-Madison.
GLOBAL CONVERGENCE
)F DAMPED NEWTON'S
METHOD
389
Kojima, M. (1980). Strongly Stable Stationary Solutions in Nonlinear Programs. In Analysis and Computation of Fixed Points (S. M. Robinson, Ed.), Academic Press, NY. pp. 93-138.
and Shindo, S. (1986). Extensions of Newton and Quasi-Newton Methods to Systems of PC'
Equations. J. Oper. Res. Soc. Japan 29 352-374.
Mangasarian, 0. L. (1969). Nonlinear Programming. McGraw-Hill, NY.
Marcotte, P. and Dussault, J.-P. (1987). A Note on a Globally Convergent Newton Method for Solving
Monotone Variational Inequalities. Oper. Res. Lett. 6 35-42.
McCormick, G. P. (1983) Nonlinear Programming, Theory, Algorithms and Applications. John Wiley and
Sons, NY.
Ortega, J. M. and Rheinboldt, W. C. (1970). Iterative Solution of Nonlinear Equations of Se'eral Variables.
Academic Press, San Diego.
Pang, J.-S. (1990). Newton's Method for B-differentiable Equations. Math. Oper. Res. 15 311-341.
(1991). A B-differentiable Equation Based, Globally, and Locally Quadratically Convergent
Algorithm for Nonlinear Programs, Complementarity and Variational Inequality Problems. Math.
Programming 51 101-131.
(1993). Convergence of Splitting and Newton Methods for Complementarity Problems: An
Application of Some Sensitivity Results. Math. Programming 58 149-160.
and Gabriel, S. A. (1993). NE/SQP: A Robust Algorithm for the Nonlinear Complementarity
Problem. Math. Programming 60 295-337.
Han, S.-P. and Rangaraj, N. (1991). Minimization of Locally Lipschitz Functions, SIAM J. Optim. 1
57-82.
Park, K. (1989). Continuation Methods for Nonlinear Programming. Dissertation, Department of Industrial
Engineering, University of Wisconsin-Madison.
Qi, L. (1993). Convergence Analysis of Some Algorithms for Solving Nonsmooth Equations. Math. Oper.
Res. 18 227-224.
Ralph, D. (1993). A New Proof of Robinson's Homeomorphism Theorem for pl-Normal Maps. Linear
Algebra Appl.. 178 249-260.
Rheinboldt, W. C. (1969). Local Mapping Relations and Global Implicit Function Theorems. Trans. Amer.
Math. Soc. 138 183-198.
Robinson, S. M. (1979). Generalized Equations and Their Solutions. Part I: Basic Theory. Math.
Programming Stud. 10 128-141.
(1980). Strongly Regular Generalized Equations. Math. Oper. Res. 5 43-62.
(1983). Local Structure of Feasible Sets in Nonlinear Programming. Part I: Regularity. In
Numerical Methods: Proceedings, Caracas 1982 (V. Pereyra and A. Reinoza, Eds.), Springer-Verlag,
Berlin, pp. 240-251.
(1987). Local Structure of Feasible Sets in Nonlinear Programming. Part III: Stability and
Sensitivity. Math. Programming Stud. 30 45-66.
(1988). Newton's Method for a Class of Nonsmooth Functions. Manuscript, Department of
Industrial Engineering, University of Wisconsin-Madison.
(1992). Normal Maps Induced by Linear Transformations. Math. Oper. Res. 17 691-714.
Taji, K., Fukushima, M. and Ibaraki, T. (1993). A Globally Convergent Newton Method for Solving
Strongly Monotone Variational Inequalities. Math. Programming 58 369-383.
Wilson, R. B. (1963). A Simplicial Method for Concave Programming. Dissertation, Graduate School of
Business, Harvard University.
Wu, J. H., Florian, M., and Marcotte, P. (1990). A General Descent Framework for the Monotone
Variational Inequality Problem. Publication CRT-723, Centre de Recherche sur les Transports,
Universite de Montreal, Canada.
D. Ralph: Mathematics Department, University of Melbourne, Parkville, Victoria 3052, Australia;
e-mail:danny@(mundoe.maths.mu.oz.au