UNIVERSITY OF CALIFORNIA, SAN DIEGO

UNIVERSITY OF CALIFORNIA, SAN DIEGO
Reduced Hessian Quasi-Newton Methods for
Optimization
A dissertation submitted in partial satisfaction of the
requirements for the degree Doctor of Philosophy
in Mathematics
by
Michael Wallace Leonard
Committee in charge:
Professor
Professor
Professor
Professor
Professor
Philip E. Gill, Chair
Randolph E. Bank
James R. Bunch
Scott B. Baden
Pao C. Chau
1995
c 1995
Copyright Michael Wallace Leonard
All rights reserved.
The dissertation of Michael Wallace Leonard is approved,
and it is acceptable in quality and form for publication
on microfilm:
Professor Philip E. Gill, Chair
University of California, San Diego
1995
iii
This dissertation is dedicated to my mother and father.
iv
Contents
Signature Page . .
Dedication . . . . .
Table of Contents .
List of Tables . . .
Preface . . . . . . .
Acknowledgements
Curriculum Vita .
Abstract . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. iii
. iv
. vi
. vii
. viii
. xiii
. xiv
. xv
1 Introduction to Unconstrained Optimization
1.1 Newton’s method . . . . . . . . . . . . . . . . . . . .
1.2 Quasi-Newton methods . . . . . . . . . . . . . . . . .
1.2.1 Minimizing strictly convex quadratic functions
1.2.2 Minimizing convex objective functions . . . .
1.3 Computation of the search direction . . . . . . . . . .
1.3.1 Notation . . . . . . . . . . . . . . . . . . . . .
1.3.2 Using Cholesky factors . . . . . . . . . . . . .
1.3.3 Using conjugate-direction matrices . . . . . .
1.4 Transformed and reduced Hessians . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
1
2
6
9
10
13
13
13
15
16
2 Reduced-Hessian Methods for Unconstrained Optimization
2.1 Fenelon’s reduced-Hessian BFGS method . . . . . . . . . . . .
2.1.1 The Gram-Schmidt process . . . . . . . . . . . . . . .
2.1.2 The BFGS update to RZ . . . . . . . . . . . . . . . . .
2.2 Reduced inverse Hessian methods . . . . . . . . . . . . . . . .
2.3 An extension of Fenelon’s method . . . . . . . . . . . . . . . .
2.4 The effective approximate Hessian . . . . . . . . . . . . . . . .
2.5 Lingering on a subspace . . . . . . . . . . . . . . . . . . . . .
2.5.1 Updating Z when p = pr . . . . . . . . . . . . . . . . .
2.5.2 Calculating sZ̄ and yZ̄ . . . . . . . . . . . . . . . . . .
2.5.3 The form of RZ when using the BFGS update . . . . .
2.5.4 Updating RZ after the computation of p . . . . . . . .
2.5.5 The Broyden update to RZe . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
18
19
20
22
23
25
29
31
34
36
37
39
41
v
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
2.5.6
A reduced-Hessian algorithm with lingering . . . . . . . .
3 Rescaling Reduced Hessians
3.1 Self-scaling variable metric methods . . . . . . . . .
3.2 Rescaling conjugate-direction matrices . . . . . . .
3.2.1 Definition of p . . . . . . . . . . . . . . . . .
3.2.2 Rescaling V̄ . . . . . . . . . . . . . . . . . .
3.2.3 The conjugate-direction rescaling algorithm
3.2.4 Convergence properties . . . . . . . . . . . .
3.3 Extending Algorithm RH . . . . . . . . . . . . . . .
3.3.1 Reinitializing the approximate curvature . .
3.3.2 Numerical results . . . . . . . . . . . . . . .
3.4 Rescaling combined with lingering . . . . . . . . . .
3.4.1 Numerical results . . . . . . . . . . . . . . .
3.4.2 Algorithm RHRL applied to a quadratic . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
41
43
44
46
46
47
48
49
50
50
53
54
57
58
.
.
.
.
.
.
.
.
.
.
.
.
4 Equivalence of Reduced-Hessian and Conjugate-Direction Rescaling
62
4.1 A search-direction basis for range(V1 ) . . . . . . . . . . . . . . . . 62
4.2 A transformed Hessian associated with B . . . . . . . . . . . . . . 66
4.3 How rescaling V̄ affects Ū T B̄ Ū . . . . . . . . . . . . . . . . . . . 70
4.4 The proof of equivalence . . . . . . . . . . . . . . . . . . . . . . . 75
5 Reduced-Hessian Methods for Large-Scale Unconstrained
mization
5.1 Large-scale quasi-Newton methods . . . . . . . . . . . . . .
5.2 Extending Algorithm RH to large problems . . . . . . . . . .
5.2.1 Imposing a storage limit . . . . . . . . . . . . . . . .
5.2.2 The deletion procedure . . . . . . . . . . . . . . . . .
5.2.3 The computation of T̄ . . . . . . . . . . . . . . . . .
5.2.4 The updates to ḡ Z̄ and R̄Z̄ . . . . . . . . . . . . . . .
5.2.5 Gradient-based reduced-Hessian algorithms . . . . . .
5.2.6 Quadratic termination . . . . . . . . . . . . . . . . .
5.2.7 Replacing g with p . . . . . . . . . . . . . . . . . . .
5.3 Numerical results . . . . . . . . . . . . . . . . . . . . . . . .
5.4 Algorithm RHR-L-P applied to quadratics . . . . . . . . . .
Opti.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
79
. 79
. 82
. 83
. 84
. 86
. 87
. 88
. 89
. 90
. 97
. 107
6 Reduced-Hessian Methods for Linearly-Constrained Problems
6.1 Linearly constrained optimization . . . . . . . . . . . . . . . . . .
6.2 A dynamic null-space method for LEP . . . . . . . . . . . . . . .
6.3 Numerical results . . . . . . . . . . . . . . . . . . . . . . . . . . .
114
114
118
123
Bibliography
125
vi
List of Tables
2.1
Alternate methods for computing Z . . . . . . . . . . . . . . . . .
3.1 Alternate values for σ̄ . . . . . . . . . . . . . . . .
3.2 Test Problems from Moré et al. . . . . . . . . . .
3.3 Results for Algorithm RHR using R1, R4 and R5
3.4 Results for Algorithm RHRL on problems 1–18 .
3.5 Results for Algorithm RHRL on problems 19–22 .
5.1
5.2
5.3
5.4
5.5
5.6
5.7
5.8
5.9
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
21
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
53
54
55
58
59
Comparing p̄ from CG and Algorithm RH-L-G on quadratics
Iterations/Functions for RHR-L-G (m = 5) . . . . . . . . . .
Iterations/Functions for RHR-L-P (m = 5) . . . . . . . . . .
Results for RHR-L-P using R3–R5 (m = 5) on Set # 1 . . .
Results for RHR-L-P using R3–R5 (m = 5) on Set # 2 . . .
RHR-L-P using different m with R4 . . . . . . . . . . . . .
RHR-L-P (R4) for m ranging from 2 to n . . . . . . . . . .
Results for RHR-L-P and L-BFGS-B (m = 5) on Set #1 . .
Results for RHR-L-P and L-BFGS-B (m = 5) on Set #2 . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
90
98
99
100
101
102
103
105
106
6.1 Results for LEPs (mL = 5, δ = 10−10 , kN Tgk ≤ 10−6 ) . . . . . . . 124
6.2 Results for LEPs (mL = 8, δ = 10−10 , kN Tgk ≤ 10−6 ) . . . . . . . 124
vii
Preface
This thesis consists of seven chapters and a bibliography. Each chapter
starts with a review of the literature and proceeds to new material developed
by the author under the direction of the Chair of the dissertation committee.
All lemmas, theorems, corollaries and algorithms are those of the author unless
otherwise stated.
Problems from all areas of science and engineering can be posed as
optimization problems. An optimization problem involves a set of independent
variables, and often includes constraints or restrictions that define acceptable values of the variables. The solution of an optimization problem is a set of allowed
values of the variables for which some objective function achieves its maximum
or minimum value. The class of model-based methods form quadratic approximations of optimization problems using first and sometimes second derivatives of
the objective and constraint functions.
If no constraints are present, an optimization problem is said to be
unconstrained. The formulation of effective methods for the unconstrained case
is the first step towards defining methods for constrained optimization. The
unconstrained optimization problem is considered in Chapters 1–5. Methods for
problems with linear equality constraints are considered in Chapter 6.
Chapter 1 opens with a discussion of Newton’s method for unconstrained
optimization. Newton’s method is a model-based method that requires both
first and second derivatives. In Section 1.2 we move on to quasi-Newton methods, which are intended for the situation when the provision of analytic second
derivatives is inconvenient or impossible. Quasi-Newton methods use only first
derivatives to build up an approximate Hessian over a number of iterations. At
viii
each iteration of a quasi-Newton method, the approximate Hessian is altered
to incorporate new curvature information. This process, which is known as an
update, involves the addition of a low-rank matrix (usually of rank one or rank
two). This thesis will be concerned with a class of rank-two updates known as
the Broyden class. The most important member of this class is the so-called
Broyden-Fletcher-Goldfarb-Shanno (BFGS) formula.
In Chapter 2 we consider quasi-Newton methods from a completely different point of view. Quasi-Newton methods that employ updates from the
Broyden class are known to accumulate approximate curvature in a sequence
of expanding subspaces. It follows that the search direction can be defined using
matrices of smaller dimension than the approximate Hessian. In exact arithmetic
these so-called reduced Hessians generate the same iterates as the standard quasiNewton methods. This result is the basis for all of the new algorithms defined
in this thesis. Reduced-Hessian and reduced inverse Hessian methods are considered in Sections 2.1 and 2.2 respectively. In Section 2.3 we propose Algorithm
RH, which is the template algorithm for this thesis. In Section 2.5 this algorithm
is generalized to include a “lingering scheme” (Algorithm RHL) that allows the
iterates to be restricted to certain low dimensional manifolds.
In practice, the choice of initial approximate Hessian can greatly influence the performance of quasi-Newton methods. In the absence of exact secondderivative information, the approximate Hessian is often initialized to the identity
matrix. Several authors have observed that a poor choice of initial approximate
Hessian can lead to inefficiences—especially if the Hessian itself is ill-conditioned.
These inefficiences can lead to a large number of function evaluations in some
cases.
ix
Rescaling techniques are intended to address this difficulty and are the
subject of Chapter 3. The rescaling methods of Oren and Luenberger [39], Siegel
[45] and Lalee and Nocedal [27] are discussed. In particular, the conjugatedirection rescaling method of Siegel (Algorithm CDR), which is also a variant of
the BFGS method, is described in some detail. Algorithm CDR (page 48) has
been shown to be effective in solving ill-conditioned problems. Algorithm CDR
has notable similarities to reduced-Hessian methods, and two new rescaling algorithms follow naturally from the interpretation of Algorithm CDR as a reduced
Hessian method. These algorithms are derived in Sections 3.3 and 3.4. The
first (Algorithm RHR) is a modification of Algorithm RH; the second (Algorithm RHRL) is derived from Algorithm RHL. Numerical results are given for
both algorithms. Moreover, under certain conditions Algorithm RHRL is shown
to converge in a finite number of iterations when applied to a class of quadratic
problems. This property, often termed quadratic termination, can be numerically
beneficial for quasi-Newton methods.
In Chapter 4, it is shown that if Algorithm RHRL is used in conjunction
with a particular rescaling technique of Siegel [45], then it is equivalent to Algorithm CDR in exact arithmetic. Chapter 4 is mostly technical in nature and may
be skipped without loss of continuity. However, the convergence results given in
Section 4.4 should be reviewed before passing to Chapter 5.
If the problem has many independent variables, it may not be practical
to store the Hessian matrix or an approximate Hessian. In Chapter 5, methods for solving large unconstrained problems are reviewed. Conjugate-gradient
(CG) methods require storage for only a few vectors and can be used in the
large-scale case. However, CG methods can require a large number of itera-
x
tions relative to the problem size and can be prohibitively expensive in terms
of function evaluations. In an effort to accelerate CG methods, several authors
have proposed limited-memory and reduced-Hessian quasi-Newton methods. The
limited-memory algorithm of Nocedal [35], the successive affine reduction method
of Nazareth [34], the reduced-Hessian method of Fenelon [14] and reduced inverseHessian methods due to Siegel [46] are reviewed.
In Chapter 5, new reduced-Hessian rescaling algorithms are derived as
extensions of Algorithms RH and RHR. These algorithms (Algorithms RHR-L-G
and RHR-L-P) employ the rescaling method of Algorithm RHR. Algorithm RHRL-P shares features of the methods of Fenelon, Nazareth and Siegel. However, the
inclusion of rescaling is demonstrated numerically to be essential for efficiency.
Moreover, Algorithm RHR-L-P is shown to enjoy the property of quadratic termination, which is shown to be beneficial when the algorithm is applied to general
functions.
Chapter 6 considers the minimization of a function subject to linear
equality constraints. Two algorithms (Algorithms RH-LEP and RHR-LEP) extend reduced-Hessian methods to problems with linear constraints. Numerical
results are given comparing Algorithm RHR-LEP with a standard method for
solving linearly constrained problems.
In summary, a total of seven new reduced-Hessian algorithms are proposed.
• Algorithm RH (p. 28)—The algorithm template.
• Algorithm RHL (p. 41)—Uses a lingering scheme that constrains the iterates to remain on a manifold.
• Algorithm RHR (p. 52)—Rescales when approximate curvature is obtained
xi
in a new subspace.
• Algorithm RHRL (p. 56)—Exploits the special form of the reduced Hessian
resulting from the lingering strategy. This special form allows rescaling on
larger subspaces.
• Algorithm RHR-L-G (p. 95)—A gradient-based method with rescaling for
large-scale optimization.
• Algorithm RHR-L-P (p. 95)—A direction-based method with rescaling for
large-scale optimization. This algorithm converges in a finite number of
iterations when applied to a quadratic function.
• Algorithm RHR-LEP (p. 123)—A reduced-Hessian rescaling method for
linear equality-constrained problems.
xii
Acknowledgements
I am pleased to acknowledge my advisor, Professor Philip E. Gill. I
became interested in doing research while I was a student in the Master of Arts
program, but writing a dissertation seemed an unlikely task. However, Professor
Gill thought that I had the right stuff. He has helped me hurdle many obstacles,
not the least of which was transferring into the Ph.D. program. He introduced
me to a very interesting and rewarding problem in numerical optimization. He
also supported me as a Research Assistant for several summers and during my
last quarter as a graduate student.
I would like to express my gratitude to Professors James R. Bunch,
Randolph E. Bank, Scott B. Baden and Pao C. Chao, all of whom served on my
thesis committee. My thanks also to Professors Maria E. Ong and Donald R.
Smith from whom I learned much in my capacity as a teaching assistant.
My special thanks to Professor Carl H. Fitzgerald. His training inspired
in me a much deeper appreciation of mathematics and is the basis of my technical
knowledge.
My family has always prompted me towards further education. I want
to thank my mother and father, my stepmother Maggie and my brother Clif for
their encouragement and support while I have been a graduate student.
I also want to express my appreciation to all of my friends who have
been supportive while I worked on this thesis. My climbing friends Scott Marshall,
Michael Smith, Fred Weening and Jeff Gee listened to my ranting and raving and
always encouraged me. My friends in the department, Jerome Braunstein, Scott
Crass, Sam Eldersveld, Ricardo Fierro, Richard LeBorne, Ned Lucia, Joe Shinnerl, Mark Stankus, Tuan Nguyen and others were all inspirational, informative
and helpful.
xiii
Vita
1982
Appointed U.C. Regents Scholar.
University of California, Santa Barbara
1985
B.S., Mathematical Sciences, Highest Honors.
University of California, Santa Barbara
1985
B.S., Mechanical Engineering, Highest Honors.
University of California, Santa Barbara
1985-1987
Associate Engineering Scientist.
McDonnell-Douglas Astronautics Corporation
1987-1990
High School Mathematics Teacher.
Vista Unified School District
1988
Mathematics Single Subject Teaching Credential.
University of California, San Diego
1991
M.A., Applied Mathematics.
University of California, San Diego
1991-1993
Adjunct Mathematics Instructor. Mesa Community College
1991-1995
Teaching Assistant. Department of Mathematics,
University of California, San Diego
1993
C.Phil., Mathematics. University of California, San Diego
1995
Research Assistant. Department of Mathematics,
University of California, San Diego
1995
Ph.D., Mathematics. University of California, San Diego
Major Fields of Study
Major Field: Mathematics
Studies in Numerical Optimization.
Professor Philip E. Gill
Studies in Numerical Analysis.
Professors Randolph E. Bank, James R. Bunch, Philip E. Gill
and Donald R. Smith
Studies in Complex Analysis.
Professor Carl H. Fitzgerald
Studies in Applied Algebra.
Professors Jeffrey B. Remmel and Adriano M. Garsia
xiv
Abstract of the Dissertation
Reduced Hessian Quasi-Newton Methods for Optimization
by
Michael Wallace Leonard
Doctor of Philosophy in Mathematics
University of California, San Diego, 1995
Professor Philip E. Gill, Chair
Many methods for optimization are variants of Newton’s method, which
requires the specification of the Hessian matrix of second derivatives. QuasiNewton methods are intended for the situation where the Hessian is expensive
or difficult to calculate. Quasi-Newton methods use only first derivatives to
build an approximate Hessian over a number of iterations. This approximation
is updated each iteration by a matrix of low rank. This thesis is concerned with
the Broyden class of updates, with emphasis on the Broyden-Fletcher-GoldfarbShanno (BFGS) update.
Updates from the Broyden class accumulate approximate curvature in
a sequence of expanding subspaces. This allows the approximate Hessians to be
represented in compact form using smaller reduced approximate Hessians. These
reduced matrices offer computational advantages when the objective function is
highly nonlinear or the number of variables is large.
Although the initial approximate Hessian is arbitrary, some choices may
cause quasi-Newton methods to fail on highly nonlinear functions. In this case,
rescaling can be used to decrease inefficiencies resulting from a poor initial approximate Hessian. Reduced-Hessian methods facilitate a trivial rescaling that
implicitly changes the initial curvature as iterations proceed. Methods of this
type are shown to have global and superlinear convergence. Moreover, numerical
xv
results indicate that this rescaling is effective in practice.
In the large-scale case, so-called limited-storage reduced-Hessian methods offer advantages over conjugate-gradient methods, with only slightly increased memory requirements. We propose two limited-storage methods that utilize rescaling, one of which can be shown to terminate on quadratics. Numerical
results suggest that the method is effective compared with other state-of-the-art
limited-storage methods.
Finally, we extend reduced-Hessian methods to problems with linear
equality constraints. These methods are the first step towards reduced-Hessian
methods for the important class of nonlinearly constrained problems.
xvi
Chapter 1
Introduction to Unconstrained
Optimization
Problems from all areas of science and engineering can be posed as
optimization problems. An optimization problem involves a set of independent
variables, and often includes constraints or restrictions that define acceptable values of the variables. The solution of an optimization problem is a set of allowed
values of the variables for which some objective function achieves its maximum
or minimum value. The class of model-based methods form quadratic approximations of optimization problems using first and sometimes second derivatives of
the objective and constraint functions.
Consider the unconstrained optimization problem
minimize
f (x),
n
x∈IR
(1.1)
where f : IRn → IR is twice-continuously differentiable. Since maximizing f can
be achieved by minimizing −f , it suffices to consider only minimization. When
no constraints are present, the problem of minimizing f is often called “unconstrained optimization.” When linear constraints are present, the minimization
problem is called “linearly-constrained optimization.” The unconstrained opti-
1
Introduction to Unconstrained Optimization
2
mization problem is introduced in the next section. Linearly constrained optimization is introduced in Chapter 6. Nonlinearly constrained optimization is not
considered. However, much of the work given here applies to solving “subproblems” that might arise in the course of solving nonlinearly constrained problems.
1.1
Newton’s method
A local minimizer x∗ of (1.1) satisfies f (x∗ ) ≤ f (x) for all x in some open neighborhood of x∗ . The necessary optimality conditions at x∗ are
∇f (x∗ ) = 0 and ∇2f (x∗ ) ≥ 0,
where ∇2f (x∗ ) ≥ 0 means that the Hessian of f at x∗ is positive semi-definite.
Sufficient conditions for a point x∗ to be a local minizer are
∇f (x∗ ) = 0 and ∇2f (x∗ ) > 0,
where ∇2f (x∗ ) > 0 means that the Hessian of f at x∗ is positive definite. Since
∇f (x∗ ) = 0, many methods for solving the (1.1) attempt to “drive” the gradient
to zero. The methods considered here are iterative and generate search directions
by minimizing quadratic approximations to f . In what follows, let xk denote the
kth iterate and pk the kth search direction.
Newton’s method for solving (1.1) minimizes a quadratic model of f
each iteration. The function qkN (x) given by
qkN (x) = f (xk ) + ∇f (xk )T(x − xk ) + 12 (x − xk )T∇2f (xk )(x − xk ),
(1.2)
is a second-order Taylor-series approximation to f at the point xk . If ∇2f (xk ) > 0,
then qkN (x) has a unique minimizer, corresponding to the point at which ∇qkN (x)
Introduction to Unconstrained Optimization
3
vanishes. This point is taken as the new estimate xk+1 of x∗ . If the substitution
p = x − xk is made in (1.2) then the resulting quadratic model
0
qkN (p) = f (xk ) + ∇f (xk )Tp + 12 pT ∇2f (xk )p
(1.3)
can be minimized with respect to p for a search direction pk . If ∇2f (xk ) > 0, then
0
0
the vector pk such that ∇qkN (pk ) = ∇2f (xk )pk + ∇f (xk ) = 0 minimizes qkN (p).
The new iterate is defined as xk+1 = xk + pk . This leads to the definition of
Newton’s method given below.
Algorithm 1.1. Newton’s method
Initialize k = 0 and choose x0 .
while not converged do
Solve ∇2f (xk )p = −∇f (xk ) for pk .
xk+1 = xk + pk .
k ←k+1
end do
We now summarize the convergence properties of Newton’s method. It
is important to note that the method seeks points at which the gradient vanishes
and has no particular affinity for minimizers. In the following theorem we will
let x̄ denote a point such that ∇f (x̄) = 0.
Theorem 1.1 Let f : IRn → IR be a twice-continuously differentiable mapping
defined in an open set D, and assume that ∇f (x̄) = 0 for some x̄ ∈ D and that
∇2f (x̄) is nonsingular. Then there is an open set S such that for any x0 ∈ S the
Newton iterates are well defined, remain in S, and converge to x̄.
Proof. See Moré and Sorenson [30, pp. 37–38].
Introduction to Unconstrained Optimization
4
The rate or order of convergence of a sequence of iterates is as important
as its convergence. If a sequence {xk } converges to x̄ and
kxk+1 − x̄k ≤ Ckxk − x̄kp
(1.4)
for some positive constant C, then {xk } is said to converge with order p. The
special cases of p = 1 and p = 2 correspond to linear and quadratic convergence
respectively. In the case of linear convergence, the constant C must satisfy C ∈
(0, 1). Note that if C is close to 1, linear convergence can be unsatisfactory. For
example, if C = .9 and kxk − x̄k = .1, then roughly 21 iterations may be required
to attain kxk − x̄k = .01.
A sequence {xk } that converges to x̄ and satisfies
kxk+1 − x̄k ≤ βk kxk − x̄k,
for some sequence {βk } that converges to zero, is said to converge superlinearly.
Note that a sequence that converges superlinearly also converges linearly. Moreover, a sequence that converges quadratically converges superlinearly. In this
sense, superlinear convergence can be considered a “middle ground” between linear and quadratic convergence.
We now state order of convergence results for Newton’s method (for
proofs of these results, see Moré and Sorenson [30]). If f satisfies the conditions
of Theorem 1.1, the iterates converge to x̄ superlinearly. Moreover, if the Hessian
is Lipschitz continuous at x̄, i.e.,
k∇2f (x) − ∇2f (x̄)k ≤ κkx − x̄k (κ > 0),
(1.5)
then {xk } converges quadratically. These asymptotic rates of convergence of
Newton’s method are the benchmark for all other methods that use only first
Introduction to Unconstrained Optimization
5
and second derivatives of f . Note that since x∗ satisfies ∇f (x∗ ) = 0, these
results hold also for minimizers.
If x0 is far from x∗ , Newton’s method can have several deficiencies.
Consider first when ∇2f (xk ) is positive definite. In this case, pk is a descent
0
direction satisfying ∇f (xk )Tpk < 0. However, since the quadratic model qkN is
only a local approximation of f , it is possible that f (xk + pk ) > f (xk ). This
problem is alleviated by redefining xk+1 = xk + αk pk , where αk is a positive step
length. If pTk ∇f (xk ) < 0, then the existence of ᾱ > 0 such that αk ∈ (0, ᾱ)
implies f (xk+1 ) < f (xk ) is guaranteed (see Fletcher [15]). The specific value
of αk is computed using a line search algorithm that approximately minimizes
the univariate function f (xk + αpk ). As a result of the line search, the iterates
satisfy f (xk+1 ) < f (xk ) for all k, which is the defining property associated with
all descent methods. This thesis is concerned mainly with descent methods that
use a line search.
Another problem with Algorithm 1.1 arises when ∇2f (xk ) is indefinite
or singular. In this case, pk may be undefined, non-uniquely defined, or a nondescent direction. This drawback has been successfully overcome by both modified Newton methods and trust-region methods. Modified Newton methods replace
∇2f (xk ) with a positive-definite approximation whenever the former is indefinite
or singular (see Gill et al. [22] for details). Trust-region methods minimize the
quadratic model (1.3) in some small region surrounding xk (see Moré and Sorenson [13, pp. 61–67] for further details).
Any Newton method requires the definition of O(n2 ) second-derivatives
associated with the Hessian. In some cases, for example when f is the solution to
a differential or integral equation, it may be inconvenient or expensive to define
Introduction to Unconstrained Optimization
6
the Hessian. In the next section, quasi-Newton methods are introduced that solve
the unconstrained problem (1.1) using only gradient information.
1.2
Quasi-Newton methods
The idea of approximating the Hessian with a symmetric positive-definite matrix
was first introduced in Davidon’s 1959 paper, Variable metric methods for minimization [9]. If Bk denotes an approximate Hessian, then the quadratic model
qkN is replaced by
qk (x) = f (xk ) + ∇f (xk )T(x − xk ) + 21 (x − xk )TBk (x − xk ).
(1.6)
In this case, pk is the solution of the subproblem
minimize
f (xk ) + ∇f (xk )Tp + 21 pTBk p.
n
p∈IR
(1.7)
Since Bk is positive definite, pk satisfies
Bk pk = −∇f (xk )
(1.8)
and pk is guaranteed to be a descent direction. Approximate second-derivative
information obtained in moving from xk to xk+1 is incorporated into Bk+1 using
an “update” to Bk . Hence, a general quasi-Newton method takes the form given
in Algorithm 1.2 below.
Algorithm 1.2. Quasi-Newton method
Initialize k = 0; Choose x0 and B0 ;
while not converged do
Solve Bk pk = −∇f (xk );
Compute αk , and set xk+1 = xk + αk pk ;
Introduction to Unconstrained Optimization
7
Compute Bk+1 by applying an update to Bk ;
k ← k + 1;
end do
It remains to discuss the form of the update to Bk and the choice of αk .
Define sk = xk+1 − xk , gk = ∇f (xk ) and yk = gk+1 − gk . The definition of xk+1
implies that sk satisfies
sk = αk pk .
(1.9)
This relationship will used throughout this thesis. The curvature of f along sk at
a point xk is defined as sTk ∇2f (xk )sk . The gradient of f can be expanded about
xk to give
Z
gk+1 = ∇f (xk + sk ) = gk +
0
1
∇2f (xk + ξsk )dξ sk .
It follows from the definition of yk that
sTk ∇2f (xk )sk ≈ sTk yk .
(1.10)
The quantity sTk yk is called the approximate curvature of f at xk along sk .
Next, we present a class of low-rank changes to Bk that ensure
sTk Bk+1 sk = sTk yk ,
(1.11)
so that Bk+1 incorporates the correct approximate curvature.
• The well-known Broyden-Fletcher-Goldfarb-Shanno (BFGS) formula defined by
Bk+1 = Bk −
yk ykT
Bk sk sTk Bk
+
sTk Bk sk
sTk yk
(1.12)
is easily shown to satisfy (1.11). An implementation of Algorithm 1.2 using
the BFGS update will be called a “BFGS method”.
Introduction to Unconstrained Optimization
8
• The Davidon-Fletcher-Powell (DFP) formula is defined by
Bk+1
sTBk sk
= Bk − 1 + k T
sk y k
!
yk ykT yk sTkBk + Bk sk ykT
−
.
sTkyk
sTkyk
(1.13)
An implementation of Algorithm 1.2 using the DFP update will be called
a “DFP method”.
• The approximate Hessians of the so-called Broyden class are defined by the
formulae
Bk+1 = Bk −
Bk sk sTkBk yk ykT
+ T + φk (sTkBk sk )wk wkT,
sTkBk sk
sk y k
(1.14)
where
wk =
Bk sk
yk
− T
,
T
skyk skBk sk
and φk is a scalar parameter. Note that the BFGS and DFP formulae
correspond to the choices φk = 0 and φk = 1.
• The convex class of updates is a subclass of the Broyden updates for which
φk ∈ [0, 1] for all k. The updates from convex class satisfy (1.11) since they
are all elements of the Broyden class.
Several results follow immediately from the definition of the updates
in the Broyden class. First, formulae in the Broyden class apply at most ranktwo updates to Bk . Second, updates in the Broyden class are such that Bk+1 is
symmetric as long as Bk is symmetric. Third, if Bk is positive definite and φk is
properly chosen (e.g., any φk ≥ 0 is acceptable (see Fletcher [16])), then Bk+1 is
positive definite if and only if sTkyk > 0.
In unconstrained optimization, the value of αk can ensure that sTkyk > 0.
In particular, sTkyk is positive if αk satisfies the Wolfe [48] conditions
f (xk + αk pk ) ≤ f (xk ) + ναk gkTpk
T
and gk+1
pk ≥ ηgkTpk ,
(1.15)
Introduction to Unconstrained Optimization
9
where 0 < ν < 1/2 and ν ≥ η < 1. The existence of such αk is guaranteed if, for
example, f is bounded below. In a practical line search, it is often convenient to
require αk to satisfy the modified Wolfe conditions
f (xk + αk pk ) ≤ f (xk ) + ναk gkTpk
(1.16)
T
and |gk+1
pk | ≤ η|gkTpk |.
The existence of an αk satisfying these conditions can also be guaranteed theoretically. (See Fletcher [15, pp. 26–30] for the existence results and further details.)
For theoretical discussion, αk is sometimes considered to be an exact
minimizer of the univariate function Ψ(α) defined by Ψ(α) = f (xk + αpk ). This
T
choice ensures a positive-definite update since, for such an αk , gk+1
pk = 0, which
implies sTk yk > 0. Properties of Algorithm 1.2 when it is applied to a convex
quadratic objective function using such an exact line search are given in the next
section.
1.2.1
Minimizing strictly convex quadratic functions
Consider the quadratic function
q(x) = d + cTx + 21 xTHx,
where c ∈ IR, d ∈ IRn , H ∈ IRn×n ,
(1.17)
and H is symmetric positive definite and independent of x. This quadratic has
a unique minimizer x∗ that satisfies Hx∗ = −c. If Algorithm 1.2 is used with
an exact line search and an update from the Broyden class, then the following
properties hold at the kth (0 < k ≤ n) iteration:
Bk si = Hsi ,
sTiHsk = 0,
sTigk = 0,
(1.18)
and
(1.19)
(1.20)
Introduction to Unconstrained Optimization
10
for all i < k. Multiplying (1.18) by sTi gives sTi Bsi = sTi Hsi , which implies that
the curvature of the quadratic model (1.6) along si (i < k) is exact. Define
Sk = ( s 0
s1
···
sk−1 ) and assume that si 6= 0 (0 ≤ i ≤ n − 1). Under
this assumption, note that (1.19) implies that the set {si | i ≤ n − 1} is linearly
independent. At the start of the nth iteration, (1.18) implies that Bn Sn = HSn ,
and Bn = H since Sn is nonsingular.
It can be shown that xk minimizes q(x) on the manifold defined by x0
and range(Sk ) (see Fletcher [15, pp. 25–26]). It follows that xn minimizes q(x).
This implies that Algorithm 1.2 with an exact line search finds the minimizer of
the quadratic (1.17) in at most n steps, a property often referred to as quadratic
termination.
Further properties of Algorithm 1.2 follow from its well-known equivalence to the conjugate-gradient method when used to minimize convex quadratic
functions using an exact line search. If B0 = I and the updates are from the
Broyden class, then for all k ≥ 1 and 0 ≤ i < k,
giTgk = 0
and
pk = −gk + βk−1 pk−1 ,
(1.21)
(1.22)
where βk−1 = kgk k2 /kgk−1 k2 (see Fletcher [15, p. 65] for further details).
1.2.2
Minimizing convex objective functions
Much of the convergence theory for quasi-Newton methods involves convex functions. The theory focuses on two properties of the sequence of iterates. First,
given an arbitrary starting point x0 , will the sequence of iterates converge to x∗ ?
If so, then the method is said to be globally convergent. Second, what is the order
of convergence of the sequence of iterates? In the next two sections, we present
Introduction to Unconstrained Optimization
11
some of the results from the literature regarding the convergence properties of
quasi-Newton methods.
Global convergence of quasi-Newton methods
Consider the application of Algorithm 1.2 to a convex function. Powell has shown
that in this case, the BFGS method with a Wolfe line search is globally convergent
with lim inf kgk k = 0 (see Powell [40]). Byrd, Nocedal and Yuan have extended
Powell’s result to a quasi-Newton method using any update from the convex class
except the DFP update (see Byrd et al. [6]).
Uniformly convex functions are an important subclass of the set of convex functions. The Hessian of these functions satisfy
mkzk2 ≤ z T ∇2f (x)z ≤ M kzk2 ,
(1.23)
for all x and z in IRn . It follows that a function in this class has a unique minimizer
x∗ . Although the DFP method is on the boundary of the convex class, it has not
been shown to be globally convergent, even on uniformly convex functions (see
Nocedal [36]).
Order of convergence of quasi-Newton methods
The order of convergence of a sequence has been defined in Section 1.1. The
method of steepest descent, which sets pk = −gk for all k, is known to converge linearly from any starting point (see, for example, Gill et al. [22, p. 103]).
This poor rate of convergence occurs because steepest descent uses no secondderivative information (the method implicitly chooses Bk = I for all k). On the
other hand, Newton’s method can be shown to converge quadratically for x0 sufficiently close to x∗ if ∇2f (x) is nonsingular and satisfies the Lipschitz condition
(1.5) at x∗ . Since quasi-Newton methods use an approximation to the Hessian,
Introduction to Unconstrained Optimization
12
they might be expected to converge at a rate between linear and quadratic. This
is indeed the case.
The following order of convergence results apply to the general quasiNewton method given in Algorithm 1.2. It has been shown that {xk } converges
superlinearly to x∗ if and only if
k(Bk − ∇2f (x∗ ))sk k
=0
k→∞
ksk k
lim
(1.24)
(see Dennis and Moré [11]). Hence, the approximate curvature must converge to
the curvature in f along the unit directions sk /ksk k. In a quasi-Newton method
using a Wolfe line search, it has been shown that if the search direction approaches
the Newton direction asymptotically, the step length αk = 1 is acceptable for large
enough k (see Dennis and Moré [12]).
Suppose now that a quasi-Newton method using updates from the convex class converges to a point x∗ such that ∇2f (x∗ ) is nonsingular. In this case,
if f is convex, Powell has shown that the BFGS method with a Wolfe line search
converges superlinearly as long as the unit step length is taken whenever possible
(see [40]). This result has been extended to every member of the convex class of
Broyden updates except the DFP update (see Byrd et al. [6]).
The DFP method has not been shown to be superlinearly convergent
when using a Wolfe line search. However, there are convergence results concerning
the application of the DFP method using an exact line search (see Nocedal [36]
for further discussion).
In Section 1.2.1, it was noted that if Algorithm 1.2 with exact line search
is applied to a strictly convex quadratic function, and the steps sk (0 ≤ k ≤ n−1)
are nonzero, then Bn = H. When applied to general functions, it should be noted
that Bk need not converge to ∇2f (x∗ ) even when {xk } converges to x∗ (see Dennis
Introduction to Unconstrained Optimization
13
and Moré [11]).
The global and superlinear convergence of Algorithm 1.2 when applied
to general f using a Wolfe line search remains an open question.
1.3
Computation of the search direction
Various methods for solving the system Bk pk = −gk in a practical implementation
of Algorithm 1.2 are discussed in this section.
1.3.1
Notation
For simplicity, the subscript k is suppressed in much of what follows. Bars, tildes
and cups are used to define updated quantities obtained during the kth iteration.
Underlines are sometimes used to denote quantities associated with xk−1 . The use
of the subscript will be retained in the definition of sets that contain a sequence of
quantities belonging to different iterations, e.g., {g0 , g1 , . . . , gk }. Also, for clarity,
the use of subscripts will be retained in the statement of results.
Throughout the thesis, Ij denotes the j × j identity matrix, where j
satisfies 1 ≤ j < n. The matrix I is reserved for the n × n identity matrix. The
vector ei denotes the ith column of an identity matrix whose order depends on
the context.
If u ∈ IRn and v ∈ IRm , then (u, v)T denotes the column vector of order
n + m whose components are the components of u and v.
1.3.2
Using Cholesky factors
The equations Bp = −g can be solved if an upper-triangular matrix R is known
such that B = RTR. If B̄ is obtained from B using a Broyden update, then an
upper-triangular matrix R̄ satisfying B̄ = R̄TR̄ can be obtained from a rank-one
Introduction to Unconstrained Optimization
14
update to R (see Goldfarb [24], Dennis and Schnabel [10]). In particular, the
BFGS update can be written as
R̄ = S(R + u(w − RTu)T ) where u =
Rs
,
kRsk
w=
y
(y Ts)1/2
,
(1.25)
and S is an orthogonal matrix that transforms R + u(w − RTu)T to uppertriangular form.
Since many choices of S yield an upper-triangular R̄, we now describe
the particular choice used throughout the paper. The matrix S is of the form
S = S2 S1 , where S1 and S2 are products of Givens matrices. The matrix S1 is
defined by S1 = Pn,1 · · · Pn,n−2 Pn,n−1 , where Pnj (1 ≤ j ≤ n−1) is a Givens matrix
in the (j, n) plane designed to annihilate the jth element of Pn,j+1 · · · Pn,n−1 u.
The product S1 R is upper triangular except for the presence of a “row spike” in
the nth row. Since S1 u = ±en , the matrix S1 (R + u(w − RTu)T ) is also upper
triangular except for a row-spike in the nth row. This matrix is restored to
upper-triangular form using a second product of Givens matrices. In particular,
S2 = Pn−1,n Pn−2,n · · · P1n , where Pin (1 ≤ i ≤ n−1) is a Givens matrix in the (i, n)
plane defined to annihilate the (n, i) element of Pi−1,n · · · P1n S1 (R+u(w−RTu)T ).
For simplicity, the BFGS update (1.25) and the Broyden update to R
will be written
R̄ = BFGS(R, s, y) and R̄ = Broyden(R, s, y).
(1.26)
The form of S will be as described in the last paragraph.
Another choice of S that implies S1 (R + u(w − RTu)T ) is upper Hessenberg is described by Gill, Golub, Murray and Saunders [17]. Goldfarb prefers
to write the update as a product of R and a rank-one modification of the identity. This form of the update is also easily restored to upper-triangular form (see
Goldfarb [24]).
Introduction to Unconstrained Optimization
15
Some authors reserve the term “Cholesky factor” of a positive definite
matrix B to mean the triangular factor with positive diagonals satisfying B =
RTR. However, throughout this thesis, the diagonal components of R are not
restricted in sign, but R will be called “the” Cholesky factor of B.
1.3.3
Using conjugate-direction matrices
Since B is symmetric positive definite, there exists a nonsingular matrix V such
that V TBV = I. The columns of V are said to be “conjugate” with respect to
B. In terms of V , the approximate Hessian satisfies
B −1 = V V T,
(1.27)
which implies that the solution of (1.7) may be written as
p = −V V Tg.
(1.28)
If B̄ is defined by the BFGS formula (1.12), then a formula for V̄ satisfying
V̄ TB̄ V̄ = I can be obtained from the product form of the BFGS update (see
Brodlie, Gourlay, and Greenstadt [3]). The formula is given by
V̄ = (I − suT)V Ω,
where u =
Bs
(sTy)1/2 (sTBs)1/2
+
y
sTy
(1.29)
and Ω is an orthogonal matrix.
Powell has proposed that Ω be defined as follows. Let Ve denote the
product V Ω. The matrix Ω is chosen as a lower-Hessenberg matrix such that
the first column of Ve is parallel to s (see Powell [42]). Let gV be defined as
gV = V Tg,
(1.30)
and define Ω such that Ω T = P12 P23 · · · Pn−1,n , where Pi,i+1 is a rotation in the
(i, i+1) plane chosen to annihilate the (i+1)th component of Pi+1,i+2 · · · Pn−1,n gV .
Introduction to Unconstrained Optimization
16
Then, Ω is an orthogonal lower-Hessenberg matrix such that Ω TgV = kgV ke1 .
Furthermore, (1.28) and the relation s = αp give
Ve e1 = −
1
s.
α kgV k
(1.31)
Hence, the first column of Ve is parallel to s.
With this choice of Ω, Powell shows that the columns of V̄ satisfy
v̄ i =





s
(sTy)1/2
,

ṽiTy


 ṽi − T s,
sy
if i = 1;
(1.32)
otherwise.
Note that the matrix B in the update (1.29) has been eliminated in the formulae
(1.32).
Formulae have also been derived for matrices V̄ that satisfy V̄ TB̄ V̄ = I,
where B̄ is any Broyden update to B (see Siegel [47]).
1.4
Transformed and reduced Hessians
Let Q denote an n × n orthogonal matrix and let B denote a positive-definite approximation to ∇2f (x). The matrix QTBQ is called the transformed approximate
Hessian. If Q is partitioned as Q = ( Z
W ), the transformed Hessian has a
corresponding partition


Z TBZ Z TBW 
QTBQ =  T
.
W BZ W TBW
The positive-definite submatrices Z TBZ and W TBW are called reduced approximate Hessians.
Transformed Hessians are often used in the solution of constrained optimization problems (see, for example, Gill et al. [21]). In the next chapter, a
Introduction to Unconstrained Optimization
17
particular choice of Q will be seen to give block-diagonal structure to the approximate Hessians associated with quasi-Newton methods for unconstrained optimization. This simplification leads to another technique for solving Bp = −g
that involves a reduced Hessian. Reduced Hessian quasi-Newton methods using
this technique are the subject of Chapter 2.
Chapter 2
Reduced-Hessian Methods for
Unconstrained Optimization
In her dissertation, Fenelon [14] has shown that the BFGS method accumulates approximate curvature information in a sequence of expanding subspaces.
This feature is used to show that the BFGS search direction can often be generated with matrices of smaller dimension than the approximate Hessian. Use
of these reduced approximate Hessians leads to a variant of the BFGS method
that can be used to solve problems whose Hessians may be too large to store.
In this chapter, reduced Hessian methods are reviewed from Fenelon’s point of
view. A reduced inverse Hessian method, due to Siegel [46], is reviewed in Section 2.2. Fenelon’s and Siegel’s work is extended in Sections 2.3–2.5, giving new
reduced-Hessian methods that utilize the Broyden class of updates.
18
Reduced-Hessian Methods for Unconstrained Optimization
2.1
19
Fenelon’s reduced-Hessian BFGS method
Using the equations Bi pi = −gi and si = αi pi for 0 ≤ i ≤ k, the BFGS updates
from B0 to Bk can be “telescoped” to give
Bk = B0 +
k−1
X
i=0
gi giT yi yiT
+ T
giTpi
si y i
!
.
(2.1)
If B0 = σI (σ > 0), then (2.1) can be used to show that the solution of Bk pk =
−gk is given by
X giTpk
1
yiTpk
1 k−1
pk = − gk −
g
+
yi .
i
σ
σ i=0 giTpi
sTiyi
!
(2.2)
Hence, if Gk denotes the set of vectors
Gk = {g0 , g1 , . . . , gk },
(2.3)
then (2.2) implies that pk ∈ span(Gk ). The following lemma summarizes this
result.
Lemma 2.1 (Fenelon) If the BFGS method is used to solve the unconstrained
minimization problem (1.1) with B0 = σI (σ > 0), then pk ∈ span(Gk ) for all k.
Using this result, Fenelon has shown that if Zk is a full-rank matrix such
that range(Zk ) = span(Gk ), then
p k = Zk p Z ,
where pZ = −(ZkTBk Zk )−1 ZkTgk .
(2.4)
This form of the search direction implies a reduced-Hessian implementation of
the BFGS method employing Zk and an upper-triangular matrix RZ such that
RZTRZ = ZkTBk Zk .
Reduced-Hessian Methods for Unconstrained Optimization
2.1.1
20
The Gram-Schmidt process
The matrix Zk is obtained from Gk using the Gram-Schmidt process. This process
gives an orthonormal basis for Gk . The choice of orthonormal basis is motivated
by the result
cond(ZkT Bk Zk ) ≤ cond(Bk ) if ZkT Zk = Irk
(see Gill et al. [22, p. 162]).
To simplify the description of this process we drop the subscript k, as
discussed in Section 1.3.1. At the start of the first iteration, Z is initialized to
g0 /kg0 k. During the kth iteration, assume that the columns of Z approximate
an orthonormal basis for span(G). The matrix Z̄ is defined so that range(Z̄) =
span(G ∪ ḡ) as follows. The vector ḡ can be uniquely written as ḡ = ḡ R + ḡ N ,
where ḡ R ∈ range(Z), ḡ N ∈ null(Z T ). The vector ḡ R satisfies ḡ R = ZZ T ḡ, which
implies that the component of ḡ orthogonal to range(Z) satisfies ḡ N = ḡ−ZZ T ḡ =
(I −ZZ T )ḡ. Let zḡ denote the normalized component of ḡ orthogonal to range(Z).
If we define ρḡ = kḡ N k, then zḡ = ḡ N /ρḡ . Note that if ρḡ = 0, then ḡ ∈ range(Z).
In this case, we will define Z̄ = Z.
To summarize, if r denotes the column dimension of Z, we define


r̄ = 
r,
if ρḡ = 0;
r + 1,
otherwise.
(2.5)
Using r̄, zḡ and Z̄ satisfy




0,
zḡ =  1

(I − ZZ T )ḡ,

ρḡ
if r̄ = r;
(2.6)
otherwise,
and
Z̄ =



Z,
if r̄ = r;
(2.7)
(Z
zḡ ),
otherwise.
Reduced-Hessian Methods for Unconstrained Optimization
21
It is well-known that the Gram-Schmidt process is unstable in the presence of computer round-off error (see Golub and Van Loan [25, p. 218]). Several
methods have been proposed to stabilize the process. These methods are given
in Table 2.1. The advantages and disadvantages of each method are also given in
the table. Note that a “flop” is defined as a multiplication and an addition. The
flop counts given in the table are only approximations of the actual counts. The
value of 3.2nr flops for the reorthogonalization process is an average that results
if 3 reorthogonalizations are performed every 5 iterations.
Table 2.1: Alternate methods for computing Z
Method
Advantage
Disadvantage
Gram-Schmidt
Simple
2nr flops
Unstable
Modified
Gram-Schmidt
More stable
than GS
Z must be recomputed
each iteration.
Gram-Schmidt with
reorthogonalization
(Daniel et al. [76],
Fenelon [81])
Stable
Expensive, e.g.,
3.2nr flops
Implicitly
(Siegel [92])
nr + O(r2 ) flops
Expensive if
r is large
Another technique for stabilizing the process suggested by Daniel et
al. [8] (and used by Siegel [46]) is to ignore the component of ḡ orthogonal to
range(Z) if it is small (but possibly nonzero) relative to kgk. In this case, the
definition of r̄ satisfies


r̄ = 
r,
if ρḡ ≤ kḡk;
(2.8)
r + 1,
where ≥ 0 is a preassigned constant.
otherwise,
Reduced-Hessian Methods for Unconstrained Optimization
22
The matrix Z that results when this definition of r̄ is used has properties
that depend on the choice of . If = 0, then in exact arithmetic the columns of
Z form an orthonormal basis for span(G). Moreover, for any ( ≥ 0), the matrix
Z forms an orthonormal basis for a subset of G. If K = {k1 , k2 , . . . , kr } denotes
the set of indices for which ρg > kgk and G = ( gk1
gk2
···
gkr ) is the
matrix of corresponding gradients, then the columns of Z form an orthonormal
basis for range(G ). Gradients satisfying ρg > kgk are said to be “accepted”;
otherwise, they are said to be “rejected”. Hence, G is the matrix of accepted
gradients associated with a particular choice of . Note that the dimension of Z
is nondecreasing with k.
During iteration k + 1, the vector ḡ Z̄ (ḡ Z̄ = Z̄ Tḡ) is needed to compute
the next search direction p̄. Since
ḡ Z̄


Z T ḡ,



 

=  Z T ḡ


,



ρ
if r̄ = r;
(2.9)
otherwise,
ḡ
this quantity is a by-product of the computation of Z̄.
If r̄, ḡ Z̄ and Z̄ satisfy (2.8), (2.9) and (2.7), then we will write
(Z̄, ḡ Z̄ , r̄) = GS(Z, ḡ, r, ).
2.1.2
(2.10)
The BFGS update to RZ
If Z, gZ and RZ are known during the kth iteration of a reduced-Hessian method,
then p is computed using (2.4). Following the calculation of x̄ in the line search,
ḡ is either rejected or added to the basis defined by Z. It remains to define
a matrix R̄Z̄ satisfying Z̄ TB̄ Z̄ = R̄TZ̄R̄Z̄ , where B̄ is obtained from B using the
BFGS update.
Reduced-Hessian Methods for Unconstrained Optimization
23
Let yZ denote the quantity Z Ty. If ḡ is rejected, Fenelon employs the
method of Gill et al. [17] to obtain R̄Z̄ from RZ via two rank-one updates involving
gZ and yZ . If ḡ is accepted, R̄Z̄ can be partitioned as


R̄Z R̄ḡ 
R̄Z̄ = 
,
0
φ̄
where φ̄ is a scalar.
The matrix R̄Z is obtained from RZ using gZ and yZ . The following lemma is
used to define R̄ḡ and φ̄.
Lemma 2.2 (Fenelon) If zḡ denotes the normalized component of gk+1 orthogonal to span(Gk ), then
Z T Bk+1 zḡ =
y Tg
yZ
sTy
and
zḡ T Bk+1 zḡ = σ +
(zḡ T y)2
.
sTkyk
(2.11)
(Although the relation zḡ Tg = 0 is used in the proof of Lemma 2.2, it was not
used to simplify (2.11).) The solution of an upper-triangular system involving
R̄Z and (y Tg/sTy)yZ is used to define R̄ḡ . The value φ̄ is then obtained from R̄ḡ
and zḡ T B̄zḡ .
2.2
Reduced inverse Hessian methods
Many quasi-Newton algorithms are defined in terms of the inverse approximate
Hessian Hk = Bk−1 . The Broyden update to Hk is
Hk+1 = MkT Hk Mk +
Mk
sk ykT
= I− T
sk y k
sk sTk
− ψk (ykT Hk yk )rk rkT,
sTk yk
where
Hk y k
sk
and rk = T
− T .
y k Hk y k sk y k
The parameter φk is related to ψk by the equation
φk (ψk − 1)(ykT Hk yk )(sTk Bk sk ) = ψk (φk − 1)(sTk yk )2 .
(2.12)
Reduced-Hessian Methods for Unconstrained Optimization
24
Note that the values ψk = 0 and ψk = 1 correspond to the BFGS and the DFP
updates respectively.
Siegel [46] gives a more general result than Lemma 2.1 that applies to
the entire Broyden class. The result is stated below without proof.
Lemma 2.3 (Siegel) If Algorithm 1.2 is used to solve the unconstrained minimization problem (1.1) with B0 = σI (σ > 0) and a Broyden update, then
pk ∈ span(Gk ) for all k. Moreover, if z ∈ span(Gk ) and w ∈ span(Gk )⊥ , then
Bk z ∈ span(Gk ), Hk z ∈ span(Gk ), Bk w = σw and Hk w = σ −1 w.
Let Gk denote the matrix of the first k + 1 gradients. For simplicity,
assume that these gradients are linearly independent and that k is less than n.
Since Gk has full column rank, it has a QR factorization of the form


Tk 
,
Gk = Qk 
0
where QTk Qk = I
and
(2.13)
Tk is nonsingular and upper triangular. Define rk = dim(span(Gk )), and partition
Qk = ( Zk
Wk ), where Zk ∈ IRn×rk . Note that the product Gk = Zk Tk defines
a “skinny” QR factorization of Gk (see Golub and Van Loan [25, p. 217]). The
columns of Zk form an orthonormal basis for range(Gk ) and the columns of Wk
form an orthonormal basis for null(GTk ). If the first k+1 gradients are not linearly
independent, Qk is defined as in (2.13), except that G0k is used in place of Gk .
Hence, the first r columns of Qk are still an orthonormal basis for Gk .
Consider the transformed inverse Hessian QTk Hk Qk . Lemma 2.3 implies
that if H0 = σ −1 I, then QTk Hk Qk is block diagonal and satisfies


Z T H k Zk
0
.
QTk Hk Qk =  k
0
σ −1 In−rk
(2.14)
As the equation for the search direction in terms of Hk satisfies pk = −Hk gk ,
we have QTk pk = −(QTk Hk Qk )QTk gk . It follows that pk = −Zk (ZkT Hk Zk )ZkT gk
Reduced-Hessian Methods for Unconstrained Optimization
25
since WkT gk = 0. This form of the search direction leads to a reduced inverse
Hessian method employing Zk and ZkT Hk Zk . Instead of using reorthogonalization
for stability, Siegel defines Zk implicitly in terms of Gk and a nonsingular uppertriangular matrix similar to Tk given by (2.13) (see Siegel [46] for further details).
This form of Zk has some advantages in the case of large-scale unconstrained
optimization (see Table 2.1).
2.3
An extension of Fenelon’s method
Lemma 2.3 is now used to show that pk is of the form (2.4) when Bk is updated
using any member from the Broyden class. Let Qk be defined as in Section 2.2,
i.e., Qk = ( Zk
Wk ), where range(Zk ) = span(G0k ) and QTk Qk = I. If B0 = σI
and Bk is updated using (1.14), then Lemma 2.3 implies that


0 
Z TBk Zk
.
QTkBk Qk =  k
0
σIn−rk
(2.15)
The equation for the search direction can be written as (QTkBk Qk )QTk pk = −QTk gk .
Since WkT gk = 0, it follows from the form of the transformed Hessian (2.15) that
pk satisfies (2.4).
The curvature of the quadratic model (1.7) along any unit vector in
range(Wk ) depends only on the choice of B0 and has no effect on pk . All relevant
curvature in Bk is contained in the reduced Hessian ZkTBk Zk . Since rk+1 ≥ rk
for all k, the curvature in the quadratic model used to define pk accumulates in
subspaces of nondecreasing dimension.
Let Qk+1 denote an update to Qk satisfying

G0k+1

Tk+1 
= Qk+1 
,
0
Reduced-Hessian Methods for Unconstrained Optimization
26
where Tk+1 is nonsingular and upper triangular. Partition Qk+1 as Qk+1 =
( Zk+1
Wk+1 ), where Zk+1 ∈ IRn×rk+1 . Furthermore, let Zk+1 be defined so
that its first rk columns are identical to Zk . In the remainder of the section, the
subscript k will be omitted.
Let RQ and RQ̄ denote upper-triangular matrices such that RQT RQ =
QTBQ and RQ̄T RQ̄ = Q̄TB Q̄. Since the first r columns of Q and Q̄ are identical,
Lemma 2.3 implies that the matrices QTBQ and Q̄TB Q̄ are identical. Hence, the
form of the transformed Hessian given by (2.15) implies that RQ̄ is of the form


RZ̄
0
,
RQ̄ = 
1/2
0 σ In−r̄
where RZ̄ satisfies


RZ ,



 

=  RZ
0


,



0 σ 1/2
RZ̄
(2.16)
if r̄ = r;
(2.17)
if r̄ = r + 1.
Define the transformed vectors sQ̄ = Q̄Ts and yQ̄ = Q̄Ty. Let R̄Q̄ denote
the Cholesky factor of Q̄TB̄ Q̄, where B̄ is obtained from B using a Broyden
update. The following lemma follows from the definition of B̄, sQ̄ and yQ̄ .
Lemma 2.4 If RQ , RQ̄ and R̄Q̄ satisfy RQT RQ = QTBQ, RQ̄T RQ̄ = Q̄TB Q̄ and
R̄TQ̄ R̄Q̄ = Q̄TB̄ Q̄, then
R̄Q̄ = Broyden(RQ̄ , sQ̄ , yQ̄ ).
(2.18)
Hence, the updated Cholesky factor of the transformed Hessian is obtained in
the same way as R except that sQ̄ and yQ̄ are used in place of s and y.
Lemma 2.3 and the definition of y imply that


sZ̄
sQ̄ =  
0


yZ̄
and yQ̄ =   ,
0
(2.19)
where sZ̄ = Z̄ Ts and yZ̄ = Z̄ Ty. A simplification of the Broyden update results
from the special form of sQ̄ and yQ̄ .
Reduced-Hessian Methods for Unconstrained Optimization
27
Lemma 2.5 If sQ̄ and yQ̄ are of the form (2.19) and RQ̄ satisfies (2.16), then
R̄Q̄ = Broyden(RQ̄ , sQ̄ , yQ̄ ) satisfies


R̄Z̄
0
,
R̄Q̄ = 
0 σ 1/2 In−r̄
where R̄Z̄ = Broyden(RZ̄ , sZ̄ , yZ̄ ).
Since R̄TQ̄ R̄Q̄ = Q̄TB̄ Q̄, and Q̄TB̄ Q̄ satisfies (2.15) post-dated one iteration, R̄Z̄ is the Cholesky factor of Z̄ TB̄ Z̄. It follows that the Cholesky factor
corresponding to the updated reduced Hessian can be obtained directly from RZ
using the reduced quantities sZ̄ and yZ̄ .
This discussion leads to the definition of reduced-Hessian methods using
updates from the Broyden class. We first present a version of these methods that
is identical in exact arithmetic to the corresponding quasi-Newton method. This
method will serve as a template for the more practical reduced-Hessian methods
that follow.
Algorithm 2.1. Template reduced-Hessian quasi-Newton method
Initialize k = 0, r0 = 1; Choose x0 and σ;
Initialize Z = g0 /kg0 k, gZ = kg0 k and RZ = σ 1/2 ;
while not converged do
Solve RZT tZ = −gZ , RZ pZ = tZ , and set p = ZpZ ;
Compute α so that sTy > 0 and set x̄ = x + αp;
Compute (Z̄, ḡ Z̄ , r̄) = GS(Z, ḡ, r, 0);
Form RZ̄ according to (2.17);
if r̄ = r then
Set gZ̄ = gZ and pZ̄ = pZ ;
else
Set gZ̄ = (gZ , 0)T and pZ̄ = (pZ , 0)T;
Reduced-Hessian Methods for Unconstrained Optimization
28
end if
Compute sZ̄ = αpZ̄ and yZ̄ = ḡ Z̄ − gZ̄ ;
Compute R̄Z̄ = Broyden(RZ̄ , sZ̄ , yZ̄ );
k ← k + 1;
end do
The columns of Z form an orthonormal basis for span(G) since = 0.
The definition of gZ̄ follows because when ḡ is accepted, gZ̄ = (Z Tg, zḡ T g)T =
(gZ , 0)T since g ∈ range(Z). A similar argument implies that the form of pZ̄ is
correct.
Round-off error can cause the computed value of ρḡ to be inaccurate
when ḡ is nearly in range(Z). For this reason, we consider a modification of
Algorithm 2.1 that employs a positive value for . In this case, the following
comment is made with regard to the definition of gZ̄ .
Consider the case when g has been rejected and ḡ is accepted. At the
end of iteration k, the vector gZ̄ satisfies gZ̄ = Z̄ Tg = (gZ , zḡ T g)T . Note that zḡ
may be nonzero since g might have been rejected with 0 < ρg ≤ kgk. In this case,
gZ̄ is not of the form given in Algorithm 2.1. We take the suggestion of Siegel [46]
and define the update in terms of an approximation gZ̄ defined by gZ̄ = (gZ , 0)T .
The quantity yZ̄ is replaced by approximation yZ̄ defined by yZ̄ = ḡ Z̄ − gZ̄ .
This discussion leads to the definition of the following reduced-Hessian
algorithm.
Algorithm 2.2. Reduced-Hessian quasi-Newton method (RH)
Initialize k = 0, r0 = 1; Choose x0 , σ and ;
Initialize Z = g0 /kg0 k, gZ = kg0 k and RZ = σ 1/2 ;
while not converged do
Reduced-Hessian Methods for Unconstrained Optimization
29
Solve RZT tZ = −gZ , RZ pZ = tZ , and set p = ZpZ ;
Compute α so that sTy > 0 and set x̄ = x + αp;
Compute (Z̄, ḡ Z̄ , r̄) = GS(Z, ḡ, r, );
Form RZ̄ according to (2.17);
if r̄ = r then
Define gZ̄ = gZ and pZ̄ = pZ ;
else
Define gZ̄ = (gZ , 0)T and pZ̄ = (pZ , 0)T ;
Compute sZ̄ = αpZ̄ and yZ̄ = ḡ Z̄ − gZ̄ ;
Compute R̄Z̄ = Broyden(RZ̄ , sZ̄ , yZ̄ );
k ← k + 1;
end do
Note that the Broyden update is well defined as long as sTy > 0 since
sTZ̄ yZ̄ = sTZ yZ = sTQ yQ = sTy.
2.4
(2.20)
The effective approximate Hessian
As suggested by Nazareth [34], we define an effective approximate Hessian B in
terms of Z, RZ , and the implicit matrix W . In particular, with Q = ( Z
W ),
B is given by

B = Q(RQ )T RQ QT ,

RZ
0
.
where RQ = 
0 σ 1/2 In−r
The quadratic model associated with B is denoted by q (p) and satisfies
q (p) = f (x) + g Tp + 21 pT B p.
It can be verified that the search direction p = ZpZ defined in Algorithm RH
minimizes q (p) in range(Z).
Reduced-Hessian Methods for Unconstrained Optimization
30
It is important to note that if > 0, B may not be equal to the
approximate Hessian B generated by Algorithm 1.2. To see this, suppose that
the first k + 1 gradients are accepted and that RZ is updated as described above.
During iteration k, suppose that ḡ is not accepted, but that 0 < ρḡ ≤ kḡk. This
implies that the component of ḡ orthogonal to range(Z) is nonzero. Since ḡ is
not accepted and g has been accepted, it follows that yQ̄ satisfies






Z̄ T ḡ
Z̄ T g
Z̄ T y
yQ̄ =  T  −  T  =  T  .
W̄ ḡ
W̄ g
W̄ ḡ
(2.21)
Since 0 < ρḡ ≤ kḡk, it follows that W̄ T ḡ 6= 0 (note that kW̄ T ḡk = ρḡ ) and that yQ̄
does not satisfy the hypothesis (2.19) of Lemma 2.5. If R̄Q̄ = Broyden(RQ̄ , sQ̄ , yQ̄ ),
where RQ̄ satisfies (2.16) and yQ̄ satisfies (2.21), then R̄Q̄ is generally a dense
upper-triangular matrix (although the elements corresponding to W̄ are “small”),
which is not equal to R̄Q̄ .
The structure of R̄Q̄ corresponds to an approximate gradient ḡ defined
by ḡ = Z̄ ḡ Z̄ . Note that ḡ 0 = ḡ and that ḡ = ḡ, whenever ḡ is accepted.
The vector g is similarly defined as g = ZgZ . In terms of these approximate
gradients, y is defined by y = ḡ − g . Since




Z̄ Ty   yZ̄ 
T 
yQ̄ = Q̄ y =
=
,
W̄ Ty 0
the following lemma holds.
Lemma 2.6 Let Z ∈ IRn×r and let RQ denote a nonsingular upper-triangular
matrix of the form


RZ
0 
RQ = 
,
0 σIn−r
where RZ ∈ IRr×r .
Let r̄ and Z̄ be defined by (Z̄, r̄) = GS(Z, ḡ, r, ). Define

RQ̄ = 

RZ̄
0 
,
0 σIn−r̄
where RZ̄ is defined by (2.17).
Reduced-Hessian Methods for Unconstrained Optimization
31
If R̄Q̄ = Broyden(RQ̄ , sQ̄ , yQ̄ ), then


R̄Z̄
0 
R̄Q̄ = 
,
0 σIn−r̄
where R̄Z̄ = Broyden(RZ̄ , sZ̄ , yZ̄ ).
Proof. Since s ∈ range(Z) and the first r columns of Q̄ are the columns of Z, it
follows that sQ̄ = (sZ , 0)T. A short calculation verifies that the first r components
of yQ̄ are given by yZ . Hence,
sTQ̄ yQ̄ = sTZ yZ = sTQ̄ yQ̄ = sTy > 0,
(2.22)
which implies that R̄Q̄ is well defined. An identical argument shows that R̄Z̄ is
well defined. The rest of the proof follows from the form of sQ̄ and yQ̄ and the
definition of the Broyden update to the Cholesky factor.
2.5
Lingering on a subspace
We have seen that quasi-Newton methods gain approximate curvature in a sequence of expanding subspaces whose dimensions are given by dim(span(Gk )).
The subspace span(Gk ) ⊕ span(x0 ) is the manifold determined by x0 and Gk
and will be denoted by Mk (Gk , x0 ). Because of the form of pk , it is clear that
{x0 , x1 , . . . , xk } lies in Mk (Gk , x0 ). Moreover, as will be shown in Chapter 5,
each iteration that dim(span(Gk )) increases, the iterates “step into” the corresponding larger subspace. Hence, {x0 , x1 , . . . , xk } spans Mk (Gk , x0 ). This
property also holds if Algorithm RH is used with a positive value of , i.e.,
span{x0 , x1 , . . . , xk } = Mk (Gk , x0 ) for any ≥ 0.
We now consider a modification of Algorithm RH that employs a scheme
in which successive iterates “linger” on a manifold smaller than Mk (Gk , x0 ).
Both Fenelon [14] and Siegel [45] have considered lingering as an optimization
Reduced-Hessian Methods for Unconstrained Optimization
32
strategy. Our experience has shown that lingering can be beneficial, especially
when combined with the rescaling strategies defined in Chapter 3. We develop
the lingering strategy as a modification of Algorithm RH during iteration k. The
iteration subscript is again dropped as described in Section 1.3.1.
Suppose that g is accepted during iteration k − 1 of Algorithm RH. It
follows that Z satisfies Z = ( Z
zg ) at the start of iteration k (recall that Z is
defined by Z = Zk−1 ). Define U = Z and Y = zg so that Z = ( U
Y ) and let
l denote the number of columns in U , which is r − 1 in this case. Partition RZ
according to


RU RU Y 
RZ = 
,
RY
where RU ∈ IRl×l .
(2.23)
The search direction defined in Algorithm RH satisfies
pr = ZpZ ,
where RZT tZ = −gZ
and RZ pZ = tZ .
(2.24)
The superscript r has been added to emphasize that the search direction is obtained from the r-dimensional subspace range(Z). A unit step along pr minimizes the quadratic model q (p) in range(Z). Partition gZ = (gU , gY )T and
tZ = (tU , tY )T , where gU = U Tg, gY = Y Tg, tU ∈ IRl and tY ∈ IRr−l . Note that the
partition of RZ given by (2.23) and the equation RZ tZ = −gZ imply that
RUT tU = −gU
and RYT tY = −(RUT Y tU + gY ).
(2.25)
The reduction in the quadratic model q (p) along pr satisfies
q (0) − q (pr ) = 12 ktZ k2 .
Let pl denote the vector obtained by minimizing the quadratic model in range(U ).
The vector pl satisfies
pl = −U (RUT RU )−1 gU
Reduced-Hessian Methods for Unconstrained Optimization
33
and the reduction in the quadratic model along pl satisfies
q (0) − q (pl ) = 12 ktU k2 .
When minimizing a convex quadratic function with exact line search, successive
gradients are mutually orthogonal. In this case, gU = 0 and it follows from (2.25)
that tU = 0. Hence, a decrease in the quadratic model can be made only by
minimizing on the subspace determined by Z. However, Siegel has observed that
this behavior can be nearly reversed when minimizing general functions with an
inexact line search (see Siegel [45]). In this case, it is possible that ktU k ≈ ktZ k,
which implies that nearly all of the reduction in the quadratic model is obtained
in range(U ).
Since gU = 0 when minimizing quadratics with exact line search, quasiNewton methods minimize completely on range(U ) before to moving into the
larger subspace range(Z). This phenomenon leads to the well-known property of
quadratic termination. Although this property is not retained when minimizing
general f , the quasi-Newton method can be modified so that gU is “smaller”
before moving into the larger subspace range(Z). This modification is achieved
easily by choosing pl instead of pr as the search direction. If the search direction
is given by pl , then the iterate x̄ = x + αpl remains on the manifold M(U, x0 )
defined by x0 , x1 , . . ., xk .
While the iterates linger on range(U ), it is likely that the column dimension of Z continues to grow as gradients are accepted into the basis. The new
components of each accepted gradient are appended to Z as in Algorithm RH and
contribute to an increase in the dimension of range(Y ). The matrix U remains
fixed as long as the iterates linger on M(U, x0 ). While the iterates linger, unused
approximate curvature accumulates in the effective Hessian along directions in
Reduced-Hessian Methods for Unconstrained Optimization
34
range(Y ).
As noted by Fenelon [14, p. 72], it is not generally efficient to remain
on M(U, x0 ) until gU = 0. As suggested by Siegel [45], we will allow the iterates
to linger on M(U, x0 ) until the reduction in the quadratic model obtained by
moving into range(Y ) is significantly better than that obtained by lingering. In
particular, the iterates will remain in M(U, x0 ) as long as ktU k2 > τ ktZ k2 , where
τ ∈ ( 12 , 1] is a preassigned constant. Since ktZ k2 = ktU k2 + ktY k2 , the inequality
ktU k2 > τ ktZ k2 is equivalent to (1 − τ )ktU k2 > ktZ k2 . Hence, if τ = 1, then the
iterates do not linger.
In the case that p = pr (ktU k2 ≤ τ ktZ k2 ), let pZ be partioned as pZ =
(pU , pY )T, where pU ∈ IRl and pY ∈ IRr−l . The partition of RZ and the equation
RZ pZ = tZ imply that
RY pY = tY
and RU pU = tU − RU Y pY .
(2.26)
In terms of pU and pY , the search direction satisfies p = U pU + Y pY . Note that
if τ < 1, then the inequality (1 − τ )ktU k2 ≤ ktY k2 implies that tY 6= 0. It follows
from (2.26) that pY 6= 0 since RY is nonsingular. Hence, Y pY is a nonzero step
into range(Y ) and we will say that the iterate x̄ = x + αpr “steps into” range(Y ).
2.5.1
Updating Z when p = pr
When x̄ “steps into” range(Y ), the dimension of the manifold defined by the
sequence of iterates increases by one. The new manifold is determined by x0 , U
and pr . If subsequent iterates are to linger on this manifold, then it is convenient
to change U to another matrix, say Ue , such that range(U ) ⊂ range(Ue ) and
pr ∈ range(Ue ). The new manifold is then given by M(x0 , Ue ). If the search
directions are taken from range(Ue ), then the iterates will remain on M(x0 , Ue ).
Reduced-Hessian Methods for Unconstrained Optimization
35
The matrix Ue can be defined using an update to Z following the computation of pr .
Ze = ( Ue
Let Ze denote the desired update to Z and partition Ze as
Ye ), where Ue ∈ IRn×(l+1) and Ye ∈ IRn×(r−l−1) . The component of
pr in range(Y ) is given by Y pY . The matrix Ze is defined so that
range(Ue ) = range(U ) ⊕ range(Y pY ),
e = range(Z). Because range(Z)
e = range(Z), the update essentially
and range(Z)
defines a “reorganization” of Z.
The update described here corresponds to an update of the GramSchmidt QR factorization associated with G and is due to Daniel, et al. [8].
Let S denote an orthogonal (r − l) × (r − l) matrix satisfying SpY = kpY ke1
and define Ze = ( U
Y S T ). Note that Y S Te1 = Y pY /kpY k. Accordingly, the
Y S Te1 ). The
e i.e., U
e =(U
update Ue is given by the first l + 1 columns of Z,
remainder of Ze is denoted by Ye , i.e., Ye = ( Y S Te2
Y S Te3
···
Y S Ter ). A
e = range(Z) and Z
e TZ
e = I . Hence, Z
e is also
short argument shows that range(Z)
r
an orthonormal basis for G . The matrix S satisfies S = Pl+1,l+2 Pl+2,l+3 · · · Pr−1,r ,
where Pi,i+1 is a symmetric (r−l)×(r−l) Givens matrix in the (i, i+1) plane chosen to annihilate the (i + 1)th component of Pi+1,i+2 · · · Pr−1,r pY . We say that the
component of p in range(Y ) is “rotated” into Ue . This component is considered
to be removed from Y to define Ye since these two matrices satisfy
range(Y ) = range(Ye ) ⊕ range(Y pY ).
As an aside, we note that if Ue is defined as above, then the columns of
Ue form a basis for the search directions. Moreover, if Pk = ( pk0
p k1
···
p kl )
denotes the matrix of “full” search directions satisfying p = pr , then Pk has full
rank and range(Pk ) = range(Ue ).
Reduced-Hessian Methods for Unconstrained Optimization
36
If p = pl , then let Ue = U , Ye = Y and Ze = Z. The new partition
parameter satisfies


l̄ = 
if ktU k2 > τ ktZ k2 ;
l,
(2.27)
l + 1,
otherwise,
The matrix Z̄ is defined by the Gram-Schmidt process used in Algorithm RH, except that Ze is used in place of Z. Hence, Z̄, ḡ Z̄ and r̄ satisfy
e ḡ, r, ).
(Z̄, ḡ Z̄ , r̄) = GS(Z,
2.5.2
Calculating sZ̄ and yZ̄
At the end of iteration k, the quantities sZ̄ and yZ̄ are required to compute R̄Z̄
using a Broyden update. (Recall that yZ̄ is the approximation of yZ̄ that results
when a positive value of is used in the Gram-Schmidt process.) Computational
savings can be made if these quantities are obtained using pZ and gZ . We discuss
the definition of sZ̄ first. The vector sZ̄ satisfies


sZ̄ = 
sZe,
if r̄ = r;
(2.28)
(sZe, 0)T ,
if r̄ = r + 1,
where sZe = Ze Ts. The vector sZe satisfies sZe = αpZe, where pZe = Ze Tp. If the
partition parameter increases, then pZe 6= pZ . However, we shall show below that
pZe can be obtained directly from pZ without a matrix-vector multiplication. This
is important, especially when n is large, since the computation of Ze Tp “from
e p
scratch” requires nr floating point operations. From the definition of Z,
Z
e
satisfies

pZe = Ze Tp = ( U



U Tp   pU 
Y S T )Tp = 
=
.
SY Tp
kpY ke1
The value kpY k is computed during the update of Z. Hence, the definition of pZe
requires no further computation. Note that if l̄ = l, then pZe = pZ .
Reduced-Hessian Methods for Unconstrained Optimization
37
Second, we discuss the calculation of yZ̄ . The vector gZe defined by
gZe = Ze Tg satisfies


gZ 
gZe = 
.
SgY
This vector can be calculated by applying the Givens matrices defining S to the
vector gY . The definition of gZ̄ is similar to the definition used in Algorithm RH,
i.e.,
gZ̄ =


gZe,

(gZe, 0)T ,
if r̄ = r;
(2.29)
if r̄ = r + 1.
The vector yZ̄ is defined as in Algorithm RH, i.e., yZ̄ = ḡ Z̄ − gZ̄ .
2.5.3
The form of RZ when using the BFGS update
In this section, the effect of the lingering strategy on the block structure of RZ
is examined. Although a complete algorithm utilizing lingering has not yet been
defined, we present some preliminary results based on the discussion given to this
point. The first result gives information about the effect of the BFGS update on
RZ̄ when s ∈ range(Ū ).
Lemma 2.7 Let RZ̄ denote a nonsingular upper-triangular r̄ × r̄ matrix partitioned as


RŪ RŪ Ȳ 
RZ̄ = 
,
0 RȲ
where RŪ ∈ IRl̄×l̄
and
RȲ ∈ IRr̄−l̄ .
Suppose sZ̄ is of the form sZ̄ = (sŪ , 0)T, where sŪ ∈ IRl̄ , and that yZ̄ ∈ IRr̄ . If
R̄Z̄ = BFGS(RZ̄ , sZ̄ , yZ̄ ), then the (2, 2) block of RZ̄ is unaltered by the update,
i.e.,


R̄Ū R̄Ū Ȳ 
R̄Z̄ = 
0 RȲ
Reduced-Hessian Methods for Unconstrained Optimization
38
Proof. The result follows from the definition of the rank-one BFGS update given
in Section 1.3.2.
Note that the result is purely algebraic in nature. The notation used
in the lemma is consistent with the current discussion to facilitate application in
Lemma 2.8 below.
Lemma 2.8 Assume that Algorithm RH has been applied with the BFGS update
to minimize f for k iterations. Moreover, assume that g was accepted at iteration
k − 1. Let lk = rk − 1 and partition RZ as in (2.23). During iteration k, suppose
that the iterates begin to linger on M(U, x0 ), and that they remain on the manifold
for m (m ≥ 0) iterations. Then, at the start of iteration k + m, the (2, 2) block
of RZ satisfies RY = σ 1/2 Irk+m −lk .
Proof. The result is proved by induction on i, where (0 ≤ i ≤ m). Since g is
accepted and l = r − 1, Z is of the form Z = ( U
Y ), where U = Z and Y = zg .
Prior to application of the BFGS update, the Cholesky factor satisfies


R
0 
RZ =  U
,
0 RY
where RY = σ 1/2
.
Since s ∈ range(U ), it follows that sZ = (sU , 0)T . Hence, the result holds for i = 0
by application of Lemma 2.7 predated by one iteration. Assume that the result
holds for i = m−1. Since the iterates linger during iterations k through k +m−1,
the partition parameter satisfies lk+m = lk+m−1 = · · · = lk . Hence, we may use l
to denote this common value of the partition parameter. For the remainder of the
proof, let unbarred quantities be associated with the start of iteration k + m − 1
and let barred quantities denote their corresponding updates. By the inductive
hypothesis, RY = σ 1/2 Ir−l . Since x̄ lingers on M(U, x0 ), s ∈ range(U ) and it
Reduced-Hessian Methods for Unconstrained Optimization
39
follows that sZ̄ = (sU , 0)T . Prior to the BFGS update, RȲ = σ 1/2 Ir̄−l . After the
BFGS update, Lemma 2.7 implies that R̄Ȳ = σ 1/2 Ir̄−l , as required.
2.5.4
Updating RZ after the computation of p
The change of basis from Z to Ze necessitates a corresponding change in RZ
whenever l̄ = l + 1. Recall that the effective approximate Hessian is defined by

T
T
B = QRQ RQ Q ,
and Q = ( Z

RZ
0

where RQ = 
1/2
0 σ In−r
W ) is orthogonal. The reduced Hessian Z TB Z satisfies Z TB Z =
RZT RZ . Following the change of basis, the Cholesky factor of the reduced Hessian
Ze T B Ze is required. Let RZe denote the desired matrix. A short calculation
shows that Ze TB Ze = diag(Il , S)RZT RZ diag(Il , S T ). The partition of RZ defined
by (2.23) gives


RU RU Y S T 
,
RZ diag(Il , S T ) = 
0 RY S T
which is not generally upper triangular. Hence, RZe is defined by
T
e
RZe = diag(Il , S)R
Z diag(Il , S ),
T
e
where Se is defined so that SR
is upper triangular. In the next section, we
YS
consider the definition of Se when BFGS updates are used.
The form of Se when using the BFGS update
Lemma 2.8 implies that RY = σ 1/2 Ir−l . Hence, the matrix RZ diag(Il , S T ) satisfies


RU RU Y S T 
RZ diag(Il , S T ) = 
.
0 σ 1/2 S T
Reduced-Hessian Methods for Unconstrained Optimization
40
Thus, Se may be set equal to S giving


RU RU Y S T 
.
RZe = 
0 σ 1/2 Ir−l
Note that the Givens matrices defined by S need only be applied to RU Y in this
case. In the next section, we consider the definition of Se when general Broyden
updates are used.
The form of Se when using Broyden updates
When using Broyden updates other that the BFGS update, RY is not generally
diagonal. Restoring RY S T to upper-triangular form is more complicated in this
case. The matrix Se is defined by a product of Givens matrices. In particular,
Se = P˜l+1,l+2 · · · P˜r−1,r , where P˜i,i+1 is an (r − l) × (r − l) Givens matrix in the
(i, i + 1) plane defined to annihilate the (i + 1, i) component of
P˜i+1,i+2 · · · P˜r−1,r RY Pr−1,r · · · Pi,i+1 .
Note that P˜i,i+1 is defined immediately after the definition of Pi,i+1 . For this
reason, the Givens matrices defining Se are said to be interlaced with those defining
S. This technique of interlacing Givens matrices to maintain upper-triangular
form has been described by Crawford in the context of the generalized eigenvalue
problem and has been suggested for use in optimization by Gill et al. (see [20]).
In summary, the update to RZ satisfies



RZe = 

RZ ,
if l̄ = l;
(2.30)
T
e
diag(Il , S)R
Z diag(Il , S ),
otherwise.
Reduced-Hessian Methods for Unconstrained Optimization
2.5.5
41
The Broyden update to RZe
The matrix RZ̄ is defined by
RZ̄


RZe,



 

=
R
0
  Ze

,



0 σ 1/2
if r̄ = r;
(2.31)
if r̄ = r + 1.
The updated Cholesky factor R̄Z̄ satisfies R̄Z̄ = Broyden(RZ̄ , sZ̄ , yZ̄ ). Note that
when the BFGS update is used, fewer Givens matrices need be defined in order
to reduce uZ̄ to er̄ , resulting in computational savings. (See Section 1.3.2 and
note that uZ̄ = RZ̄ sZ̄ /kRZ̄ sZ̄ k is of the form uZ̄ = (uŪ , 0)T , where uŪ ∈ IRl̄ .)
2.5.6
A reduced-Hessian algorithm with lingering
A reduced Hessian algorithm with lingering is given below. This algorithm will
be referred to as Algorithm RHL.
Algorithm 2.3. Reduced Hessian method with lingering (RHL)
Initialize k = 0, r0 = 1 and l0 = 0; Choose x0 , σ0 (σ0 > 0) and ;
1/2
Initialize Z = Y = g0 /kg0 k and RZ = RY = σ0
(U and RU are void);
while not converged do
Compute tU and tY according to (2.25);
Compute l̄ according to (2.27);
if l̄ = l then
Solve RU pU = −tU and compute p = U pU ;
else
Compute pU and pY according to (2.26) and set p = U pU + Y pY ;
end if
Compute α so that sTy > 0 and set x̄ = x + αp;
Reduced-Hessian Methods for Unconstrained Optimization
42
if l̄ = l then
Define Ze = Z, Ue = U and Ye = Y ;
else
Define Ze = Z diag(Il , S T ), where S satisfies SpY = kpY ke1 ;
Define Ue = ( U
Y S T e1 ) and Ye = Y S T ( e2
e3
···
er );
T
e
e
Compute RZe = diag(Il , S)R
Z diag(Il , S ), where S is defined so that
T
e
SR
is upper triangular;
YS
Define pZe = (pU , kpY ke1 )T and gZe = (gU , SgY )T ;
end if
Form RZ̄ according to (2.31);
Compute sZ̄ and yZ̄ ;
Compute R̄Z̄ = Broyden(RZ̄ , sZ̄ , yZ̄ );
k ← k + 1;
end do
Chapter 3
Rescaling Reduced Hessians
In practice, the choice of B0 can greatly influence the performance of
quasi-Newton methods. If no second-derivative information is available at x0 ,
then B0 is often initialized to I. Several authors have observed that a poor
choice of B0 can lead to inefficiences—especially if ∇2f (x∗ ) is ill-conditioned (e.g.,
see Powell [41] and Siegel [45]). These inefficiences can lead to a large number
of function evaluations in practical implementations. Function evaluations are
often expensive in comparison to the linear algebra required to implement quasiNewton methods.
One remedy involves rescaling the approximate Hessians.
To date,
rescaling has involved multiplying the approximate Hessians (or part of a factorization of the approximate Hessians) by positive scalars. The following are
examples of rescaling methods.
• The self-scaling variable metric (SSVM) method, reviewed in Section 3.1,
multiplies Bk by a scalar prior to application of the Broyden update.
• Siegel [45] has demonstrated global and superlinear convergence of a scheme
that rescales columns of a conjugate-direction factorization of Bk−1 . This
43
Rescaling Reduced Hessians
44
method is reviewed in Section 3.2.
• Lalee and Nocedal [27] have defined an algorithm that rescales columns of
a lower-Hessenberg factor of Bk .
In this thesis, rescaling is achieved by reassigning the values of certain
elements of the reduced-Hessian Cholesky factor. In Sections 3.3 and 3.4, two
new rescaling algorithms of this type are introduced as extensions of Algorithm
RH (p. 28) and Algorithm RHL (p. 41).
3.1
Self-scaling variable metric methods
The first rescaling method, suggested by Oren and Luenberger [39], involves a
scalar factor ηk applied to the approximate Hessian before the quasi-Newton
update. Although the original SSVM methods were formulated in terms of the
inverse approximate Hessian, we shall describe them in terms of Bk (see Brodlie
[2]).
Let Mk = H 1/2 Bk−1 H 1/2 , where H is the Hessian of the quadratic q(x)
given in (1.17). Assume that Bk is positive definite. Brodlie states the result
that when q(x) is minimized using an exact line search,
q(xk+1 ) − q(x∗ ) ≤ γk2 (q(xk ) − q(x∗ )),
where γk = (κ(Mk ) − 1)/(κ(Mk ) + 1).
The value γk2 is called the “one-step convergence rate”. Note that the
smaller the value of κ(Mk ), the smaller the value of γk . Hence, Oren and Luenberger suggest that a good method should decrease κ(Mk ) every iteration. However, when Bk is updated by a formula from the Broyden convex class, κ(Mk )
Rescaling Reduced Hessians
45
can fluctuate. Consider the scalar ηk (β) defined by
ηk (β) = β
ykT Bk−1 yk
sTk yk
+
(1
−
β)
,
sTk Bk sk
sTk yk
(3.1)
If Bk is multiplied by ηk before application of an update from the Broyden convex
class, then κ(Mk ) decreases monotonically assuming an exact line search (see
Oren and Luenberger [39]).
The choice β = 1 avoids the need to form Bk−1 yk , a quantity not normally
computed by methods updating Bk . The corresponding value of the rescaling
parameter.
ηk (1) =
sTk yk
,
sTk Bk sk
(3.2)
has been studied by several authors. Contreras and Tapia [7] consider this choice
in connection with trust-region methods for unconstrained optimization using
both the BFGS and the DFP updates. They report positive results for the DFP
update but negative results for the BFGS update (see [7] for further details).
Results given by Nocedal and Yuan [37] suggest that rescaling by ηk (1) every
iteration may inhibit superlinear convergence in line search algorithms that use
an initial step length of one.
Several researchers have proposed rescaling at the first iteration only.
Shanno and Phua suggest multiplying H0 by the scalar 1/η0 (0) prior to the first
BFGS update. This is analogous to multiplying B0 by
η0 (0) =
y0 B0−1 y0
.
sT0 y0
Numerical results imply that the method can be superior to the BFGS method,
especially for larger values of n. They also compare the method to a SSVM
method that suggested by Oren and Spedicato [38] and conclude that initial scal-
Rescaling Reduced Hessians
46
ing is superior (see Shanno and Phua [44]). Siegel has suggested multiplying B0
by η0 (1) = sT0 y0 /sT0 B0 s0 in methods for large-scale unconstrained optimization.
Liu and Nocedal [28] have studied rescaling parameters in connection
with limited-memory methods (see Section 5.1). In these methods, the “initial”
inverse approximate Hessian Hk0 can be redefined every iteration. Several choices
for Hk0 are compared and they conclude that
Hk0 =
sT yk−1
1
I = k−1
I
T
η0 (0)
yk−1
yk−1
is the most effective in practice.
A common feature of SSVM methods is that they alter the approximate curvature in all directions. Recent methods, such as the conjugate-direction
rescaling algorithm reviewed in the next section, rescale more selectively.
3.2
Rescaling conjugate-direction matrices
Siegel has proposed a rescaling algorithm that uses conjugate direction matrices.
The matrices are updated using the form of the BFGS update suggested by Powell
(see Section 1.3.3). The algorithm is similar to Powell’s, but the definition of p
is different and the updated matrix V̄ is rescaled. We present an outline of the
method in the following sections. See Siegel [45] for further details regarding both
the motivation and implementation of the method.
3.2.1
Definition of p
Consider the matrix V defined in Section 1.3.3. The rescaling algorithm uses an
integer parameter l (0 ≤ l ≤ n) that may be increased at any iteration. The
matrix V is partitioned as V = ( V1
V2 ), where V1 = (v1 v2 · · · vl ) and
Rescaling Reduced Hessians
47
V2 = (vl+1 vl+2 · · · vn ). Define the vectors
g1 = V1Tg,
and g2 = V2Tg.
(3.3)
Note that the definitions of gV (1.30) and p (1.28) satisfy gV = (g1 , g2 )T and
p = −V gV . The definition of gV is modified for the rescaling scheme as follows.
Let τ ∈ ( 21 , 1] denote a preassigned constant and define
gV

 (g1 , 0)T,
= 
if kg1 k2 > τ (kg1 k2 + kg2 k2 );
(3.4)
(g1 , g2 )T,
otherwise.
As before, the search direction is given by
p = −V gV .
(3.5)
Note that if gV is defined by the first part of (3.4), then p depends only on the
first l columns of V . The parameter l is initialized at zero and updated during the
calculation of p. This parameter is always incremented during the first iteration.
For k ≥ 1,


l̄ = 
3.2.2
if kg1 k2 > τ (kg1 k2 + kg2 k2 );
l,
(3.6)
l + 1,
otherwise.
Rescaling V̄
After the calculation of p, V is updated to give V̄ using the BFGS update (1.32).
Let V̄ be partitioned as V̄ = ( V̄ 1
V̄ 2 ), where V̄ 1 = (v1 v2 · · · vl̄ ) and
V̄ 2 = (vl̄+1 vl̄+2 · · · vn ). Let γk and µk denote the scalar parameters
γk =
µk =
ykTsk
ksk k2
and
min {γi }.
0≤i≤k
(3.7)
(3.8)
Rescaling Reduced Hessians
48
The matrix Vb , which is used to denote V̄ after rescaling, satisfies
Vb = ( V̄ 1
where
βk =



βk V̄ 2 ),
(1/γ0 )1/2 ,
n
o

1/2
 max 1, (µ
,
k−1 /γk )
(3.9)
if k = 0;
(3.10)
otherwise.
This choice of βk is motivated by considering the application of the
BFGS method to the convex quadratic function
q(x) = 21 xTHx,
where λ1 (H) · · · λn (H) > 0.
(3.11)
When the BFGS method with exact line search is applied to q(x), the search
directions tend to be almost parallel with the eigenvectors. Moreover, successive
search directions are aligned with eigenvectors associated with smaller eigenvalues. Under these conditions, the curvature along pk+1 should be no larger than
γ0 , . . ., γk , i.e., it should be no larger than µk . Recall that W̄ defines the subspace
of IRn in which the BFGS method has gained no approximate curvature through
the (k + 1)th iteration. We shall show in Chapter 4 that the choice of βk is such
that the approximate curvature in (Vb Vb T )−1 along unit directions w ∈ range(W̄ )
is equal to µk .
3.2.3
The conjugate-direction rescaling algorithm
For reference, the conjugate-direction rescaling algorithm is given below.
Algorithm 3.1. Conjugate-direction rescaling (CDR) (Siegel)
Initialize k = 0, l = 0;
Choose x0 and V0 (V0T V0 = I);
Define τ ( 12 < τ < 1);
Rescaling Reduced Hessians
49
while not converged do
if k = 0 then
Compute gV = V Tg and set l̄ = l + 1;
else if l < n
if kg1 k2 > τ (kg1 k2 + kg2 k2 ) then
Set gV = (g1 , 0)T and l̄ = l;
else
Set gV = (g1 , g2 )T and l̄ = l + 1;
end if
end if
Compute p = −V gV ;
Compute α so that y Ts > 0, and set x̄ = x + αp;
Compute y = ḡ − g;
Compute V̄ from V using (1.32) and set Vb = ( V̄ 1
β V̄ 2 );
k ← k + 1;
end do
Note that β ≥ 1 except possibly on the first iteration. It follows that
the columns of V̄ 2 are either unchanged or “scaled up” on every iteration after
the first. Once l reaches n, the BFGS update to V is no longer rescaled.
3.2.4
Convergence properties
It has been shown that if Algorithm CDR is applied to a strictly convex, twicecontinuously differentiable f with Lipschitz continuous Hessian satisfying
k∇2f (x)−1 k < C,
where C > 0,
Rescaling Reduced Hessians
50
for all x in the level set of f (x0 ), then the iterates converge globally and superlinearly to f (x∗ ). The proof uses the convergence properties of the BFGS algorithm
proven by Powell to imply global and superlinear convergence of {xk }. In the
case that l never reaches n, it is also necessary to show that the limit of {xk }
minimizes f (x) (see Siegel [45] for further details).
3.3
Extending Algorithm RH
In this section, a new rescaling algorithm is introduced that is an extension of
Algorithm RH (p. 28). This new algorithm alters the approximate curvature of
B̄ on a subspace of dimension n − r at each iteration that r̄ = r + 1. Attention is
now restricted to the BFGS update (1.12) since this has been the most successful
update in practice.
3.3.1
Reinitializing the approximate curvature
The effective transformed Hessian associated with Algorithm RH is




0 
Z̄ T B̄ Z̄ Z̄ T B̄ W̄   R̄TZ̄ R̄Z̄
=
Q̄TB̄ Q̄ =  T T 0
σIn−r̄
W̄ B̄ Z̄ W̄ B̄ W̄
(3.12)
at the end of iteration k. The approximate curvature along unit vectors in
range(W̄ ) is equal to σ. The approximate curvature along zḡ is given by the
following lemma.
Lemma 3.1 Suppose that ḡ is accepted during the kth iteration of Algorithm RH.
If the BFGS update is used at the end of the iteration, then
zḡ T B̄ zḡ = σ +
ρḡ 2
.
sTy
Proof. The value zḡ T B̄ zḡ is the (r̄, r̄) element of R̄TZ̄ R̄Z̄ . This matrix satisfies
RZ̄T RZ̄ sZ̄ sTZ̄ RZ̄T RZ̄ yZ̄ (yZ̄ )T
R̄Z̄ R̄Z̄ = RZ̄ RZ̄ −
+ T .
sTZ̄ RZ̄T RZ̄ sZ̄
sZ̄ yZ̄
T
T
Rescaling Reduced Hessians
51
The (r̄, r̄) element of RZ̄T RZ̄ is σ. The result follows since sZ̄ = (sZ , 0)T, yZ̄ =
(Z Ty, ρḡ )T and sTZ̄ yZ̄ = sTy.
Lemma 3.1 is analogous to Lemma 2.2, which applies to the BFGS method in
exact arithmetic.
Lemma 3.1 implies that
zḡ T B̄ zḡ − (σ − σ̄) = σ̄ +
ρḡ 2
.
sTy
This is the value of the approximate curvature along zḡ that would result from
choosing B0 = σ̄I. In this sense, subtracting σ − σ̄ reinitializes the approximate
curvature along zḡ . The approximate curvature along directions in range(W̄ ) can
be reinitialized in the same way. The rescaled transformed effective Hessian is
defined accordingly by


0
Z TB̄ zḡ
Z TB̄ Z


Tb
T
T

Q̄ Be Q̄ =  zḡ B̄ Z zḡ B̄ zḡ − (σ − σ̄)
0 

0
0
σ̄In−r̄
(3.13)
(the “hat” denotes rescaling as in the definition of Algorithm CDR).
The rescaling suggested by (3.13) can be simply applied to R̄Z̄ . Since
p ∈ range(Z), it follows that sZ̄ = (sZ , 0)T . Moreover, Lemma 2.7 implies that
the (r̄, r̄) element of RZ̄ is unaltered by the BFGS update. It follows that R̄Z̄ can
be partitioned as


R̄Z R̄ḡ 
R̄Z̄ = 
,
0 σ 1/2


R̄T R̄Z
R̄TZ R̄ḡ 
.
which implies R̄TZ̄ R̄Z̄ =  ZT
R̄ḡ R̄Z R̄ḡ T R̄ḡ + σ
b be defined by replacing the (r̄, r̄) element of R̄ with σ̄, i.e.,
Let R
Z̄
Z̄


b =  R̄Z R̄ḡ  .
R
Z̄
0 σ̄ 1/2
Rescaling Reduced Hessians
52
It follows that




T
R̄TZ r̄ 
0
0 
bT R
b =  R̄Z R̄Z
= R̄TZ̄ R̄Z̄ − 
.
R
Z̄
Z̄
T
T
r̄ R̄Z r̄ r̄ + σ̄
0 σ − σ̄
b is the Cholesky factor of Z̄ T B
b Z̄. Note that R
b is nonsingular after
Hence, R
Z̄
Z̄
the reassignment since σ̄ > 0. Thus, no loss of positive definiteness occurs as a
result of subtracting σ − σ̄ from the reduced Hessian.
An algorithm using this rescaling scheme is given below.
Algorithm 3.2. Reduced Hessian rescaling
Initialize k = 0; Choose x0 , σ0 and ;
1/2
Initialize r = 1, Z = g0 /kg0 k, and RZ = σ0 ;
while not converged do
Solve RZT tZ = −gZ , RZ pZ = tZ , and set p = ZpZ ;
Compute α so that sTy > 0 and set x̄ = x + αp;
Compute (Z̄, ḡ Z̄ , r̄) = GS(Z, ḡ, r, ).
Define RZ̄ as in (2.17);
Compute R̄Z̄ = BFGS(RZ̄ , sZ̄ , yZ̄ );
Compute or define σ̄;
if r̄ > r and σ̄ 6= σ then
Set the (r̄, r̄) element of R̄Z̄ equal to σ̄;
end if
k ← k + 1;
end do
It remains to define an appropriate value of σ̄. We draw upon the
discussion in Section 3.1 to define four possible values. The fifth value has been
Rescaling Reduced Hessians
53
suggested by Siegel for Algorithm CDR. The five alternatives are summarized in
Table 3.1.
Label
R0
R1
3.3.2
Table 3.1: Alternate values for σ̄
σ̄
Reference
σ
No rescaling
γ0
Siegel [46]
R2
R3
y0T y0 /sT0 y0
γk
Shanno and Phua [44]
Analogous to Liu and Nocedal
R4
R5
ykT yk /sTk yk
µk
Liu and Nocedal [28]
Siegel [45]
Numerical results
The first set of test problems consists of the 18 unconstrained optimization problems given by Moré et al. [29]. These problems are listed in Table 3.2 below.
The method is implemented in double precision FORTRAN 77 on a
DEC 5000/240. The line search is a slightly modified version of that included in
NPSOL. The line search is designed to ensure that α satisfies the modified Wolfe
conditions (1.16) (see Gill et al. [21]). The step length α = 1 is always attempted
first. The step length parameters are ν = 10−4 and η = 0.9. The value = 10−4
is used in the Gram-Schmidt process and the stopping criterion is kgk k < 10−8 .
The results of Table 3.3 compare Algorithm RH (p. 28) with Algorithm RHR using several of the rescaling values. The numbers of iterations and
function evaluations needed to achieve the stopping criterion are given for each
run. For example, the notation “31/39” indicates that 31 iterations and 39 function evaluations are required for convergence. The notation “L” indicates that
the method terminated in the line search. In this case, the number in parentheses
gives gives the final norm of the gradient. The final column in Table 3.3 gives an
Rescaling Reduced Hessians
Table
Number
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
3.2:
n
3
6
3
2
3
16
12
16
16
2
4
3
20
14
16
2
4
16
54
Test Problems from Moré et al.
Problem name
Helical valley
Biggs EXP6
Gaussian function
Powell badly scaled function
Box three-dimensional function
Variably dimensioned function
Watson
Penalty I
Penalty II
Brown badly scaled function
Brown and Dennis
Gulf research and development
Trigonometric
Extended Rosenbrock
Extended Powell singular function
Beale
Wood
Chebyquad
“at a glance” comparison of Algorithm RHR and Algorithm RH. For example,
the notation “+ − +” means that Algorithm RHR required fewer function evaluations than Algorithm RH for rescaling methods R1 and R5, but required more
function evaluations for R4 (note that “+” means fewer function evaluations).
3.4
Rescaling combined with lingering
Sometimes it is desirable to reinitialize the approximate curvature in a larger
subspace than that determined by zḡ . Our objective is to alleviate inefficiencies
resulting from poor initial approximate curvature. In doing so, we must be careful
to alter only the affects of the initial approximate curvature. In Lemma 2.8 it
is shown that RȲ = σIr̄−l̄ when the BFGS update is used in Algorithm RHL
(p. 41). In this sense, the initial approximate curvature along unit directions in
Rescaling Reduced Hessians
Table 3.3:
Problem
No. n
1
3
2
6
3
3
4
2
5
3
6
16
7
12
8
16
9
16
10
2
11
4
12
3
13 20
14 14
15 16
16
2
17
4
18 16
55
Results for Algorithm RHR using R1, R4
Alg. RH
Algorithm RHR
σ=1
R1
R4
R5
31/39
28/35
27/36
26/35
34/42
44/50
41/45
39/44
4/6
5/8
5/8
5/8
145/191 147/199 140/193 147/199
32/36
34/37
34/37
33/36
24/32
24/32
24/32
24/32
80/93
146/153 124/129 109/114
58/74
61/85
57/69
56/68
301/442 416/470 505/584 582/710
L(.1E-1) L(.1E-1) L(.5E-2) L(.1E-1)
65/88
74/83
72/80
72/80
31/40
47/65
48/69
47/65
48/53
54/61
47/50
39/49
37/52
38/54
36/50
38/54
48/62
74/79
81/87
60/65
16/23
16/20
16/20
16/20
80/121
39/48
63/78
66/87
68/102
77/87
70/78
58/82
and R5
Comp.
+++
−−−
−−−
−−−
−−0
000
−−−
−++
−−−
0+0
+++
−−−
−++
−+−
−−−
+++
+++
+++
range(Ȳ ) is unaltered. Hence, the approximate curvature in B corresponding to
Ȳ is easily reinitialized. The approximate curvature along directions in range(Ū )
will be considered to be established and the associated reduced Hessian will not
b will be defined by
be rescaled. Following the BFGS update, R
Z̄


b =  R̄Ū R̄Ū Ȳ  ,
R
Z̄
0 σ̄Ir̄−l̄


R̄Ū R̄Ū Ȳ 
which replaces R̄Z̄ = 
0 σIr̄−l̄
at the start of iteration k + 1.
The approximate curvature along directions w ∈ range(W̄ ) is also reinitialized. The Cholesky factors of the effective transformed Hessians Q̄TB̄ Q̄ and
Rescaling Reduced Hessians
56
Q̄TBb Q̄ satisfy


R̄Z̄
R̄Z̄ Ȳ
0


1/2


R̄Q̄ =  0 σ Ir̄−l̄
0

0
0
σ 1/2 Ir̄−l̄


R̄Z̄
R̄Z̄ Ȳ
0


b
1/2

.
and RQ̄ =  0 σ̄ Ir̄−l̄
0

0
0
σ̄ 1/2 Ir̄−l̄
Note that the rescaled transformed Hessian satisfies


0
bT R
b = R̄T R̄ −  0
.
Q̄T Bb Q̄ = R
Q̄
Q̄
Q̄
Q̄
0 (σ − σ̄)In−l̄
(3.14)
This rescaling it therefore analogous to that defined for Algorithm RH (3.13). In
this case the rescaling is defined on the (possibly larger) subspace range( Ȳ
instead of range( zḡ
W̄ ).
An algorithm employing this strategy is given below.
Algorithm 3.3. Reduced-Hessian rescaling with lingering (RHRL)
Initialize k = 0, r0 = 1 and l0 = 0; Choose x0 , σ0 and ;
1/2
Initialize Z = Y = g0 /kg0 k and RZ = RY = σ0
(U is void);
while not converged do
Compute p as in Algorithm RHL;
Compute Ze and associated quantities as in Algorithm RHL;
Compute α so that sTy > 0 and set x̄ = x + αp;
if r < n then
e ḡ, r, );
Compute (Z̄, ḡ Z̄ , r̄) = GS(Z,
else
e
Define Z̄ = Z;
Compute ḡ Z̄ ;
end if
Form RZ̄ according to (2.31);
Compute sZ̄ and yZ̄ ;
W̄ )
Rescaling Reduced Hessians
57
Compute R̄Z̄ = BFGS(RZ̄ , sZ̄ , yZ̄ );
Compute σ̄;
if l̄ < r̄ and σ̄ 
6= σ then 
b =  R̄Ū R̄Ū Ȳ ;
Set R
Z̄
0 σ̄Ir̄−l̄
end if
k ← k + 1;
end do
3.4.1
Numerical results
Results are given in Table 3.4 that compare Algorithm RHRL using R1, R4
and R5 with Algorithm RH (p. 28) on the 18 problems listed in Table 3.2. The
constants used in the line search and the Gram Schmidt process are the same as
those given in Section 3.3.2.
Results are given for four additional problems in Table 3.5. Problem 19
was used by Siegel to test Algorithm 3.1 (p. 48). Results are given for the case
D11 = 1 and D55 = 10−12 , which define a function whose Hessian is very illconditioned (see [45] for further details). In this case, the convergence criteria
are kgk k < 10−8 and |f (xk )−f ∗ | < 10−8 , where f ∗ = 3.085557482E−3. Problems
20, 21 and 22 are the calculus of variation problems discussed by Gill and Murray
(see [18]). We give results for these problems for n = 50, n = 100 and n = 200.
Generally the column dimension of Y stays large as the iterations proceed, which
means that the approximate curvature is rescaled on high-dimensional subspaces
of IRn . For example, in the solution of problem 20 with n = 50, the column
dimension of Y reaches 10 at iteration 33 and remains greater than or equal to
10 until iteration 49.
Rescaling Reduced Hessians
58
Table 3.4: Results for Algorithm RHRL on problems 1–18
Problem Alg. RH
Algorithm RHRL
No. n
σ=1
R1
R4
R5
Comp.
1
3
31/39
28/37
26/33
31/38
+++
2
6
40/43
46/52
48/53
40/43
−−0
3
3
4/6
5/8
5/8
5/8
−−−
4
2 146/194 147/199 140/193 147/199 − + −
5
3
32/36
34/37
31/34
29/33
−++
6
16
24/32
24/32
24/32
24/32
000
7
12
80/93
146/157 119/126
79/84
−−+
8
16
58/74
60/77
57/75
57/75
−−−
9
16 285/411 526/614 449/558 493/614 − − −
10
2 L(.1E-1) L(.1E-1) L(.5E-2) L(.1E-1) 0 + 0
11
4
65/88
77/84
66/73
68/75
+++
12
3
31/40
51/60
49/67
38/53
−−−
13 20
48/53
55/59
48/52
38/50
−++
14 14
37/52
37/52
39/55
36/50
0−+
15 16
48/52
74/79
61/68
67/73
−−−
16
2
16/23
16/20
16/20
16/20
+++
17
4
80/121
38/45
71/88
69/91
+++
18 16 68/102
80/89
67/77
57/83
+++
3.4.2
Algorithm RHRL applied to a quadratic
The following theorem summarizes some properties of Algorithm RHRL when it
is used with an exact line search to minimize the quadratic function (1.17). In
the statement and proof of the theorem, rij denotes the (i, j) component of RZ .
Theorem 3.1 Consider the use of Algorithm RHRL with exact line search to
minimize the strictly convex quadratic function (1.17). In this case, the uppertriangular matrix RZ is upper bidiagonal. At the start of iteration k, rk = k + 1,
1/2
lk = k, RU ∈ IRk×k , RU Y = −kgk k/(sTk−1 yk−1 )ek and RY = σk . The nonzero
elements of RU satisfy
rii =
kgi−1 k
T
(si−1 yi−1 )1/2
and
ri,i+1 = −
kgi k
T
(si−1 yi−1 )1/2
Rescaling Reduced Hessians
Table 3.5:
Problem
No. n
19
5
50
20 100
200
50
21 100
200
50
22 100
200
59
Results for Algorithm RHRL on problems 19–22
Alg. RH
Algorithm RHRL
σ=1
R1
R4
R5
Comp.
L(.2E-8) L(.4E-9) L(.4E-9) 115/139
00+
222/255 266/291 209/215 79/128 − + +
398/480 470/524 475/478 137/249 − + +
731/912 849/966 969/985 260/524 − − +
50/172 197/202 110/115
49/74
−++
64/310 247/253 169/174 74/124 + + +
127/623 335/341 295/302 127/227 + + +
164/217 280/284 187/191 70/107 − + +
250/350 421/425 312/316 99/148 − + +
217/317 152/252 217/220 161/292 + + +
for 1 ≤ i ≤ k. The matrix Z satisfies Z = ( U
U=
g0
kg0 k
Y ), where
gk−1 g1
···
kg1 k
kgk−1 k
and
Y =
gk
.
kgk k
Furthermore, the search directions satisfy





−gk ,
if k = 0;
!
2
pk =  1
kg
k
k


σk−1
pk−1 − gk ,

σk
kgk−1 k2
(3.15)
otherwise.
Proof. The result is clearly true for k = 0. Assume that the result holds at the
start of iteration k, i.e., RZ , Z and the first k search directions are of the stated
form.
The first k + 1 gradients are orthogonal and nonzero by assumption (or
by the assumed form of the first k search directions). Hence, gU = U Tgk = 0,
−1/2
which implies tU = 0 and tY = −σk
kgk k. Since ktU k2 < τ (ktU k2 + ktY k2 ), lk+1
satisfies lk is incremented and lk+1 = k + 1, as required. The definitions of pY
and pU give

pY = −
kgk k
σk

kg0 k−1

kgk k2 
..


and pU = −

,
.

σk 
kgk−1 k−1
(3.16)
Rescaling Reduced Hessians
60
respectively. Hence,
1
kgk k2
g0
gk−1
pk = Uk pU + Yk pY =
−
kgk−1 k2
+ ··· +
2
2
σk
kgk−1 k
kg0 k
kgk−1 k2
!
!
− gk .
A short inductive argument verifies that
2
−kgk−1 k
g0
gk−1
+ ··· +
2
kg0 k
kgk−1 k2
!
= σk−1 pk−1 ,
which, together with the previous equation, implies that pk is of the required
form.
Following the computation of pk , the matrix Z must be reorganized since
the partition parameter has increased. Since pY is a scalar, S = 1, Ue = ( U
Y )
and Ye is void. The Cholesky factor RZe satisfies RZe = RZ . Since the first k + 1
search directions are parallel to the conjugate-gradient directions, xk+1 is such
that gk+1 is orthogonal to g0 , . . ., gk . Thus, gk+1 is accepted if it is nonzero.
It follows that the matrices Ū , Ȳ and Z̄ satisfy Ū = Ue , Ȳ = gk+1 /kgk+1 k and
Z̄ = ( Ū
Ȳ ), as required.
We complete the proof by considering the computation of R̄Z̄ (see Sec1/2
1/2
tion 1.3.2). Since gk+1 is always accepted, RZ̄ = diag(RZe, σk ) = diag(RZ , σk ).
The vector uZ̄ used in the BFGS update satisfies




tZ
0



RZ̄ sZ̄
RZ̄ pZ̄
1 



uZ̄ =
=
=
tY  =  −1 

.
kRZ̄ sZ̄ k
kRZ̄ pZ̄ k
ktU k
0
0
Thus, the matrices S1 and S1 RZ̄ satisfy


Ik 0 0



S1 =  0 0 1 

0 1 0


RU RU Y
0

1/2 

and S1 RZ̄ =  0
0 σk 
,
1/2
0 σk
0
respectively. Since



0

1/2 
T

RZ̄ uZ̄ = 
 −σk 
0
and wZ̄ =
1
(sTk yk )1/2

0


 −kg k  ,
k 

kgk+1 k
Rescaling Reduced Hessians
61
it follows that

RU

 0
S1 (RZ̄ + uZ̄ (wZ̄ − RZ̄T uZ̄ )T ) = 


0
RU Y
0
1/2
0
σk
kgk k
kgk+1 k
− T 1/2
(sTk yk )1/2
(sk yk )



.


If S2 is defined by S2 = S1 , then R̄Z̄ is upper triangular and satisfies



R̄Z̄ R̄Ū Ȳ 
R̄Z̄ = 
,
0 R̄Ȳ

R̄Ū Ȳ
RU
where R̄Z̄ = 

0

RU Y

kgk k  ,
(sTk yk )1/2

0


= 
kgk+1 k 
− T 1/2
(sk yk )
1/2
and R̄Ȳ = σk .
b satisfies
The rescaled matrix R
Z̄


b =  R̄Z̄ R̄Ū Ȳ  ,
R
Z̄
1/2
0 σk+1
which completes the inductive argument.
Now we show that Theorem 3.1 implies that Algorithm RHRL terminates on quadratics.
Corollary 3.1 If Algorithm RHRL is used to minimize the convex quadratic
function (1.17) with exact line search and σ0 = 1, then the method converges to
the minimizer in at most n iterations.
Proof. The search directions are parallel to the conjugate-gradient directions
by Theorem 3.1. Thus, Algorithm RHRL enjoys quadratic termination since the
conjugate-gradient method has this property.
Chapter 4
Equivalence of Reduced-Hessian
and Conjugate-Direction
Rescaling
In this chapter, it is shown that if Algorithm RHRL is used in conjunction with a particular rescaling technique of Siegel [45], then it is equivalent to
Algorithm CDR in exact arithmetic. This chapter is mostly technical in nature
and may be skipped without loss of continuity. However, the convergence results
given in Section 4.4 should be reviewed before passing to Chapter 5.
First, we show that a basis for V1 can be formulated in terms of the
search directions generated by Algorithm CDR. Second, a transformed approximate Hessian associated with B is derived that has the same form as the transformed Hessian generated by Algorithm RHRL. Third, we define the affect that
rescaling the conjugate-direction matrices has on this transformed Hessian.
4.1
A search-direction basis for range(V1)
The following two lemmas lead to a result that gives a basis for range(V1 ) in
terms of a subset of the search directions generated by Algorithm 3.1.
62
Equivalence of Reduced-Hessian and Conjugate-Direction Rescaling
63
Lemma 4.1 If l is unchanged during any iteration of Algorithm 3.1, then
range(Vb1 ) = range(V1 ).
Proof. Since l remains fixed, gV = (g1 , 0)T. Thus Ω (see Section 1.3.3 for the
definition of Ω) is of the form


Ω1 0 
Ω=
,
0 In−l
where Ω1 ∈ IRl×l
is orthogonal and lower Hessenberg. Using the update (1.29) and the form of Ω,
we find
V̄ = (I − suT ) V1 Ω1 V2 .
If rescaling is applied to the second part of V̄ , we obtain
Vb = (I − suT ) V1 Ω1 βV2 ,
which implies Vb1 = (I − suT )V1 Ω1 . Since s = αp = −αV1 g1 , it follows that
Vb1 = (I + αV1 g1 uT )V1 Ω1 = V1 (Il + αg1 uT V1 )Ω1 .
From this we see that range(Vb1 ) ⊆ range(V1 ). However, (Il + αg1 uT V1 )Ω1 is
invertible since otherwise, Vb1 is rank deficient. Thus, range(V1 ) ⊆ range(Vb1 ), and
we may conclude that range(Vb1 ) = range(V1 ).
The second lemma relates to a property of Ω. In both the statement
and proof of the lemma, Ω is partitioned according to
Ω = ( Ω̃1
Ω̃2 ),
where Ω̃1 ∈ IRn×(l+1)
and Ω̃2 ∈ IRn×(n−l−1) .
The tildes are used to distinguish the partition from that used in Lemma 4.1 and
Theorem 4.1.
Equivalence of Reduced-Hessian and Conjugate-Direction Rescaling
64
Lemma 4.2 Let Ω ∈ IRn×n be an orthogonal, lower-Hessenberg matrix. Given
an integer l (1 ≤ l ≤ n − 1), partition Ω as in (4.1). There exist w1 , w2 , . . . , wl ∈
IRl+1 , such that Ω1 wi = ei (1 ≤ i ≤ l), where ei is the ith column of I.
Proof. The first l rows of Ω̃2 are zero since Ω is lower Hessenberg. Hence, Ω̃2
may be partitioned as


0 
Ω̃2 = 
,
Ω̃22
where Ω22 ∈ IR(n−l)×(n−l−1) .
The product Ω̃2 Ω̃2T satisfies


0
0
.
Ω̃2 Ω̃2T = 
T
0 Ω̃22 Ω̃22
Since I = ΩΩ T = Ω̃1 Ω̃1T + Ω̃2 Ω̃2T, it follows that

Ω̃1 Ω̃1T

Il
0
.
=
T
0 In−l − Ω̃22 Ω̃22
Thus, with wi (1 ≤ i ≤ l) defined as the transpose of the ith row of Ω̃1 , we have
the desired result.
Let P denote the set of search directions generated by Algorithm CDR.
Let l denote the value of the partition parameter at the kth (k ≥ 1) iteration
of Algorithm CDR before calculation of the search direction. Let k1 , k2 , . . . , kl
(0 = k1 < k2 < · · · < kl < k) denote the indices of the iterations at which l is
incremented. Define
P1 = {pk1 , pk2 , . . . , pkl },
and P2 = P − P1 .
(4.1)
Note that the subscripts 1 and 2 on P in this definition are not iteration indices.
The main result of this section follows.
Equivalence of Reduced-Hessian and Conjugate-Direction Rescaling
65
Theorem 4.1 Let P1 be defined as in (4.1). Then P1 is a basis for range(V1 )
for all k ≥ 1.
Proof. If k = 1, then Algorithm 3.1 gives l = 1 and P1 = {p0 } automatically.
Since l = 1 and by (1.32), V1 = s0 /(sT0 y0 )1/2 . Hence, the result holds for k = 1.
Given l (1 ≤ l ≤ n), assume that the result holds for k = kl + 1. The set
P1 as given in (4.1) is a basis for range(V1 ). If l = n, then the inductive argument
is complete since this would imply that V1 = V , P1 is a basis for IRn , and hence
P1 is a basis for range(V̄ 1 ) = range(V̄ ). If l < n and l does not increase during
or after iteration k, then the inductive argument is complete since Lemma 4.1
implies range(V̄ 1 ) = range(V1 ). Therefore, assume that l < n and that l increases
during or after iteration k.
The result is true for all k (kl + 1 < k ≤ kl+1 ) by Lemma 4.1, and we
fix k = kl+1 for the rest of the argument. Since l̄ = l + 1, p 6∈ range(V1 ). Hence,
p is independent of P1 , which implies that P̄1 is a linearly independent set. It
remains to show that P̄1 is a spanning set for range(V̄ 1 ).
The vector p ∈ range(V̄1 ) since the first column of V̄ is parallel to it by
(1.32). We now show that each member of P1 is also an element of range(V̄1 ).
Partition Ṽ as
Ṽ = ( Ṽ1
Ṽ2 ),
where Ṽ1 ∈ IRn×(l+1)
and Ṽ2 ∈ IRn×(n−l−1) .
Since Ω is constructed to make ṽ1 parallel to s and since v̄1 is parallel to s by
(1.32), it follows that ṽ1 ∈ range(V̄1 ). Rearranging the definition of v̄ i in (1.32)
gives ṽi = v̄i − (ṽiT y/sTy)s (2 ≤ i ≤ n). Thus, ṽi ∈ range(V̄1 ) (2 ≤ i ≤ l + 1).
Therefore range(Ṽ1 ) ⊆ range(V̄1 ). Since Ṽ1 = V Ω1 , we have V Ω1 wi = V ei =
vi ∈ range(V̄1 ) (1 ≤ i ≤ l) where Ω1 and wi are defined as in Lemma 4.2. Thus,
range(V1 ) ⊂ range(V̄1 ), and since P1 is a basis for range(V1 ), P1 ⊂ range(V̄1 ).
Equivalence of Reduced-Hessian and Conjugate-Direction Rescaling
66
It has been shown that p ∈ range(V̄ ) and P1 ⊂ range(V̄ ). Thus,
P̄1 ⊆ range(V̄ 1 ). Since P̄1 consists of l + 1 linearly independent vectors and
dim(range(V̄ 1 )) = l + 1, P̄1 is a basis for V̄1 . Finally, since rescaling has no effect
on V̄ 1 , P̄1 is a basis for Vb1 , as required.
4.2
A transformed Hessian associated with B
The set Gk is defined as in (2.3), i.e.,
Gk = {g0 , g1 , . . . , gk }.
In this section, Q will denote an orthogonal matrix partitioned as
Q=(Z
W ),
where
range(Z) = span(G).
The following lemma, analogous to Lemma 2.3, shows that the transformed Hessian QT BQ has the same structure as the transformed Hessian associated with
the BFGS method. Hence, conjugate-direction rescaling preserves the block diagonal structure of the transformed Hessian. The proof of Lemma 4.3 is similar
to the proof of Lemma 2.3 given by Siegel [46].
Lemma 4.3 Let V0 be any orthogonal matrix. If Algorithm CDR is applied to
a twice-continuously differentiable function f : IRn → IR, then s ∈ span(G) for
all k. Moreover, if z and w belong to span(G) and the orthogonal complement of
span(G) respectively, then
Bz
∈ span(G),
B −1 z ∈ span(G) for all k, while
Bw

 w
= 
µw
where µ = µk−1 is defined by (3.8).
if k = 0;
otherwise,
(4.2)
Equivalence of Reduced-Hessian and Conjugate-Direction Rescaling
67
Proof. The result for k = 0 is proved directly, while induction is used for
iterations such that k ≥ 1. Since B = I, (1.8) implies p = −g. Thus, s =
αp = −αg, which implies that s ∈ span(G). Also Bz = z implies Bz ∈ span(G)
and B −1 z ∈ span(G), for all z ∈ span(G). Since Bw = B −1 w = w, for all
w ∈ span(G)⊥ the result is true for k = 0.
With k = 0, the update B̄ satisfies
B̄ = I −
ssT yy T
+
.
sTs y Ts
The set Ḡ satisfies Ḡ = {g0 , g1 }. The vector s = −αg ∈ span(Ḡ) and y = ḡ − g ∈
span(Ḡ). span(Ḡ). Hence, for all w ∈ span(Ḡ)⊥ , it is true that B̄w = w, which
implies B̄ −1 w = w. The matrix B̄ −1 = V̄ V̄ T by definition and it follows that
V̄ V̄ Tw = w.
(4.3)
Since the first column of V̄ is parallel to s and since l = 1 at the end of the first
iteration, V̄ T1 w = 0. It follows from (4.3) that V̄ 2 V̄ T2 w = w. Hence, using (3.9),
Bb −1 w = (V̄ 1 V̄ T1 + β 2 V̄ 2 V̄ T2 )w = β 2 w.
(4.4)
Using (3.10) and (3.8), equation (4.4) implies that Bb −1 w = (1/µ)w, which also
b
b Tw = µz Tw = 0. Hence,
gives Bw
= µw. For all z ∈ span(Ḡ), we have (Bz)
b ∈ span(Ḡ). Similarly, B
b −1 z ∈ span(Ḡ). Finally, since s̄ = −ᾱB
b −1 ḡ, and since
Bz
Bb −1 ḡ ∈ span(Ḡ), if follows that s̄ ∈ span(Ḡ). Therefore, the result holds for
k = 1.
Assume that the result holds at the start of iteration k. By the inductive
hypothesis, s ∈ span(G) ⊆ span(Ḡ), and Bs ∈ span(G) ⊆ span(Ḡ). Also, y ∈
span(Ḡ) by definition. With w ∈ span(Ḡ)⊥ , and using (1.12) along with the
inductive hypothesis, we find
B̄w = Bw = µw.
(4.5)
Equivalence of Reduced-Hessian and Conjugate-Direction Rescaling
68
Equation (4.5) implies V̄ V̄ Tw = (1/µ)w, whence V̄ 2 V̄ T2 w = (1/µ)w − V̄ 1 V̄ T1 w.
Using Theorem 4.1 and the inductive hypothesis, range(V̄ 1 ) = span(P̄1 ) ⊆
span(G) ⊆ span(Ḡ). Hence, V̄ T1 w = 0 and V̄ 2 V̄ T2 w = (1/µ)w. Thus, Bb −1 w =
(V̄ 1 V̄ T1 + β 2 V̄ 2 V̄ T2 )w = (β 2 /µ)w. Using (3.10) and (3.8),
β2
1
= ,
µ
µ
(4.6)
b
for all k ≥ 1, which implies Bb −1 w = (1/µ)w as desired. Hence, Bw
= µw.
b ∈ span(Ḡ), and B
b −1 z ∈ span(Ḡ) for all z ∈ span(Ḡ).
Exactly as above, we find Bz
Finally, if s̄ = −ᾱBb −1 ḡ, then s̄ ∈ span(Ḡ) since Bb −1 ḡ ∈ span(Ḡ). Otherwise, if
s̄ = −ᾱV̂1 V̂1Tḡ, then s̄ ∈ span(Ḡ) since range(V̂1 ) = span(P̄1 ) ⊆ span(G) ⊆
span(Ḡ).
Lemma 4.3 implies that for k = 0, QT BQ = I, and for k ≥ 1,

QT BQ = 

T
Z BZ
0 
.
0
µIn−r
(4.7)
Hence, the transformed Hessian associated with Algorithm CDR has the same
block structure as that given in equation (2.15) in connection with the BFGS
method. Furthermore, the transformed gradient satisfies




ZT g
ZT g 
QT g =  T  = 
.
W g
0
(4.8)
When l̄ = l + 1, the form of p given in (3.5) satisfies Bp = −g, which is equivalent
to
(QT BQ)QT p = −QT g.
(4.9)
Equations (4.7), (3.5), and (4.9) imply that
p = −Z(Z T BZ)−1 Z T g.
(4.10)
Equivalence of Reduced-Hessian and Conjugate-Direction Rescaling
69
Using this form of p (4.10) and Theorem 4.1, it is now shown that the
search directions in P1 can be rotated into the basis defined by Z. For k = 0, p
can replace g in the definition of Z since p = −g as long as V is orthogonal. At
the start of iteration k, assume that
Z=(U
Y ),
where
range(U ) = span(P1 ).
(4.11)
Let r denote the column dimension of Z. If l̄ = l + 1, equation (4.10) implies that
p = U pU + Y pY , for some pU ∈ IRl and pY ∈ IRr−l . By Theorem 4.1, p is independent of P1 , and hence pY 6= 0. Therefore, Y pY /kpY k can be rotated into the first
column of Y as described in Section 2.5.1. Let S denote an orthogonal matrix
satisfying SpY = kpY ke1 and define Ze = ( Ū
and Ye = Y S T( e2
···
Ye ), where Ū = ( U
Y S Te1 ),
er ). Let ρḡ denote the norm of the component of ḡ
e Let ye denote the normalized component of ḡ orthogonal to Z
e
orthogonal to Z.
ḡ
and define
Ȳ
=




 ( Ye
If Z̄ = ( Ū
Ye ,
if ρḡ = 0;
(4.12)
yeḡ ),
otherwise.
Ȳ ), then range(Z̄) = span(Ḡ) and range(Ū ) = span(P̄1 ) and this
completes the argument.
For the remainder of the chapter, Q is defined as an orthogonal matrix
satisfying
Q=(Z
W ),
where Z = ( U
Y ).
(4.13)
Consider the (2, 2) block of the transformed Hessian determined by W̄ . From
equation (4.5), it follows that W̄ T B̄ W̄ = µIn−r̄ while the form of the transformed
Hessian given by (4.7) implies that W̄ T Bb W̄ = µIn−r̄ . Since the off-diagonal blocks
of the transformed Hessian are 0, the affect of rescaling V̄ on the transformed
Equivalence of Reduced-Hessian and Conjugate-Direction Rescaling
70
Hessian corresponding to W̄ is now determined. The affect of conjugate direction
rescaling on the reduced Hessian Ū T B̄ Ū is examined in the next section.
4.3
How rescaling V̄ affects Ū T B̄ Ū
A preliminary lemma is required that relates V̄ −1 to Vb −1 .
Lemma 4.4 If V̄ −1 is partitioned as

V̄ −1

V̄ −1
1 
=  −1
,
V̄ 2
l̄×n
where V̄ −1
, then
1 ∈ IR

V̄ −1
1

Vb −1 =  1 −1  .
V̄
β 2

Proof. Using the partition V̄ = ( V̄ 1
−1
b
V̄ 1 V̄ −1
1 + V̄ 2 V̄ 2 . Since V = ( V̄ 1

V̄ −1
1

V̄ 2 ), it follows that I = V̄ V̄ −1 =
β V̄ 2 ),

1 −1

−1
Vb 
 1 −1  = V̄ 1 V̄ 1 + (β V̄ 2 )( V̄ 2 ) = I,
β
V̄
β 2
as required.
The overall form of Q̄TB̄ Q̄ given by (4.7) (postdated one iteration) is


Z̄ TB̄ Z̄
0 
.
Q̄TB̄ Q̄ = 
0
µIn−r̄
Since Z̄ satisfies equation (4.11) postdated one iteration, range(Ū ) = range(V̄ 1 ).
Thus, there exists a nonsingular M ∈ IRl̄×l̄ such that Ū = V̄ 1 M , and we may
write Z̄ = ( V̄ 1 M
Ȳ ). It follows that

Z̄ TB̄ Z̄ = 
T
T
1 B̄ V̄ 1 M
M V̄
Ȳ TB̄ V̄ 1 M
T
T
1 B̄ Ȳ
M V̄
Ȳ TB̄ Ȳ


Equivalence of Reduced-Hessian and Conjugate-Direction Rescaling
71
By definition of B̄, V̄ T1 B̄ V̄ 1 = Il̄ . Since V̄ −1 V̄ = I, we have V̄ −1 V̄ 1 = E1 , where
E1 denotes the first l̄ columns of the n × n identity matrix. It follows that
T
B̄ V̄ 1 = V̄ T −1 V̄ −1 V̄ 1 = V̄ T −1 E1 = (V̄ −1
1 ) ,
which implies


M TM
M T V̄ −1
1 Ȳ 
Z̄ TB̄ Z̄ =  T −1 T
.
T
Ȳ (V̄ 1 ) M
Ȳ B̄ Ȳ
Using the relation V̄ 1 = Vb1 , Z̄ TBb Z̄ is found to satisfy


M TM
M T Vb1−1 Ȳ 
Tb

Z̄ B Z̄ =
.
Ȳ T(Vb1−1 )T M
Ȳ TBb Ȳ
Lemma 4.4 implies that V̄ −1
= Vb1−1 and it follows that Z̄ TB̄ Z̄ and Z̄ TBb Z̄ are
1
identical except in the (2, 2) block.
The quantity Ȳ TB̄ Ȳ can be written in terms of quantities involving V̄
as follows
T −1
T
−1 T −1
Ȳ TB̄ Ȳ = Ȳ T(V̄ −1
1 ) V̄ 1 Ȳ + Ȳ (V̄ 2 ) V̄ 2 Ȳ .
(4.14)
b −1
Similarly, using the equality V̄ −1
1 = V1 ,
T −1
T b −1 T b −1
Ȳ TBb Ȳ = Ȳ T(V̄ −1
1 ) V̄ 1 Ȳ + Ȳ (V2 ) V2 Ȳ .
(4.15)
Subtracting (4.14) from (4.15), and using Lemma 4.4 gives
Ȳ TBb Ȳ − Ȳ TB̄ Ȳ = (1 − β 2 )Ȳ T(Vb2−1 )T Vb2−1 Ȳ .
(4.16)
From (4.16) it is seen that the form of Ȳ T(Vb2−1 )T Vb2−1 Ȳ is required to determine
how rescaling V̄ affects the reduced Hessian Ȳ TB̄ Ȳ .
Equivalence of Reduced-Hessian and Conjugate-Direction Rescaling
72
The form of Ȳ T (Vb2−1 )T Vb2−1 Ȳ
The following theorem gives information about the block structure of (V T V )−1 ,
from which the right-hand side of (4.16) is ascertained.
Theorem 4.2 If (V T V )−1 is partitioned as

(V T V )−1

X11 X12 
,
= T
X12 X22
where X11 ∈ IRl×l , then X22 = µIn−l for all k ≥ 1.
Proof. The proof is by induction on k. For k = 0, V is orthogonal by definition
of Algorithm CDR. Using (1.29), and the Sherman-Morrison formula (see Golub
and Van Loan [26, p. 51]),
(V̄ T V̄ )−1 = Ω T V −1 (I − γsuT )(I − γusT )V T −1 Ω,
where γ = 1/(sT u − 1). From (1.31), the quantity Ω T V −1 s satisfies
Ω T V −1 s = −αkgV ke1 .
Hence,
(V̄ T V̄ )−1 = (Ω T V −1 + γαkgV ke1 uT )(V T −1 Ω + γαkgV kueT1 ),
which can be written as
(V̄ T V̄ )−1 = I + δ(e1 f T + f eT1 ) + δ 2 kuk2 e1 eT1 ,
(4.17)
where δ = γαkgV k, and f = Ω T V −1 u. Let X̄22 denote the (n − 1) × (n − 1), (2, 2)
block of (V̄ T V̄ )−1 . Since equation (4.17) only involves rank-one changes to I, all
of which include e1 as a factor,
X̄22 = In−1 .
(4.18)
Equivalence of Reduced-Hessian and Conjugate-Direction Rescaling
73
c = (1/β 2 )X̄ = µI.
Using Lemma 4.4, (4.18), (3.8), and (3.10), we have X
22
22
Since l̄ = 1, the result is true at the start of the first iteration.
Assume that the result is true at the start of the kth iteration. Exactly
as in the derivation of (4.17),
(V̄ T V̄ )−1 = Ω T (V T V )−1 Ω + δ(e1 f T + f eT1 ) + δ 2 kdk2 e1 eT1 .
(4.19)
Let Ω be partitioned as


Ω11 Ω12 
Ω=
,
Ω21 Ω22
where Ω11 ∈ IRl×l , while (V T V )−1 is partitioned as in the statement of the lemma,
and consider the cases l̄ = l and l̄ = l + 1.
If l̄ = l, then Ω12 = 0 and Ω22 = In−l . In this case,






T
Ω T Ω21
X
X12   Ω11 0 
  11
Ω T V −1 V T −1 Ω =  11
T
0 In−l
X12 X22
Ω22 In−l
X̃11 X̃12 
,
=  T
X̃12 X22
where quantities with tildes have been affected by Ω. Using this in (4.19) gives

(V̄ T V̄ )−1

X̄11 X̄12 
= T
,
X̄12 X22
where the quantities with bars differ from those with tildes as a result of the rank
one matrices in (4.19). By the inductive hypothesis, X22 = µIn−l . Hence, using
c = µI
Lemma 4.4 and (4.6), we have X
22
n−l as required.
Suppose that l̄ = l + 1. Due to the lower Hessenberg form of Ω, we have
Ω12 = ζen−l eT1 , where ζ is some constant. The matrix Ω22 is lower Hessenberg,
and, because of the form of Ω12 , orthogonal. Let X̃22 denote the (2, 2) block of
Ω T (V T V )−1 Ω. Block multiplication, the orthogonality of Ω22 , and the inductive
Equivalence of Reduced-Hessian and Conjugate-Direction Rescaling
74
hypothesis give
T
T
X̃22 = ζe1 (eTn−l (X11 Ω12 + X12 Ω22 )) + ζ(Ω22
X12
en−l )eT1 + µIn−l ,
(4.20)
which differs from µIn−l only in the first row and column. Since l̄ = l + 1, the
partitioning in (V̄ T V̄ )−1 is changed so that X̄11 ∈ IR(l+1)×(l+1) . Substitution of
(4.20) into (4.19) yields


(V̄ T V̄ )−1
X̄11 X̄12 
.
=
X̄21 µIn−l̄
c = µI , as required.
Finally, using Lemma 4.4 and (4.6) again, we have X
22
n−l̄
We now return to the derivation of the right hand side of (4.16). Since
Vb2−1 Vb1 = 0 by definition of a matrix inverse, the rows of Vb2−1 are a basis for
null(Vb1 ). Theorem 4.2 implies that Vb2−1 (Vb2−1 )T = µIn−l̄ , which means that the
rows of µ−1/2 Vb2−1 are orthonormal. Hence, the rows of µ−1/2 Vb2−1 form an orthonormal basis for null(Vb1 ). The form of Q̄ given in (4.13) and the definition
of Ū imply that ( Ȳ
W̄ ) also forms an orthonormal basis for null(Vb1 ). Thus,
µ−1/2 (Vb2−1 )T = ( Ȳ
W̄ )N , where N ∈ IR(n−l̄)×(n−l̄) is nonsingular. Moreover,
N is orthogonal. Therefore,

Ȳ T (Vb2−1 )T Vb2−1 Ȳ = µȲ T
Ȳ
W̄

Ȳ T
N N T  T  Ȳ = µIr̄−l̄ .
W̄
Substituting this result into (4.16), and using (4.6), it follows that
Ȳ T Bb Ȳ − Ȳ T B̄ Ȳ = (µ − µ)Ir̄−l̄ .
The effect that rescaling V̄ has on Q̄T B̄ Q̄ is now fully determined and is summarized in the following theorem.
Equivalence of Reduced-Hessian and Conjugate-Direction Rescaling
75
Theorem 4.3 Let V0 denote any orthogonal matrix. During the kth iteration of
Algorithm CDR, let B̄ = (V̄ V̄ T )−1 , where V̄ is the BFGS update to V . Let Q̄ be
defined as in (4.13) postdated one iteration. Then, for k = 0 and k ≥ 1,




Z̄ T B̄ Z̄
0 
,
Q̄T B̄ Q̄ = 
0
In−r̄
Z̄ T B̄ Z̄
0 
Q̄T B̄ Q̄ = 
.
0
µIn−r̄
and
Now, let Vb be given by (3.9). If Bb = (Vb Vb T )−1 , then Bb satisfies


0 
Z̄ T Bb Z̄
Q̄T Bb Q̄ = 
,
0
µIn−r̄
where




0
Ū T B̄ Ū Ū T B̄ Ȳ   0
.
Z̄ T Bb Z̄ =  T
−
T
0 (µ − µ)Ir̄−l̄
Ȳ B̄ Ū Ȳ B̄ Ȳ
4.4
The proof of equivalence
The results of this chapter are now applied in the proof of the following theorem
on the equivalence of Algorithm RHRL and CDR. Following the theorem are two
corollaries, one of which addresses the convergence properties of Algorithm RHRL
on strictly convex quadratic functions. In the proof of the theorem, the subscript
“c” is used to denote quantities generated by Algorithm CDR.
Theorem 4.4 Consider application of Algorithm RHRL and Algorithm CDR in
exact arithmetic to find a local minimizer of a twice-continuously differentiable
f : IRn → IR, where the former algorithm uses σ0 = 1, R5 and = 0. Then, if
τ = τc and both algorithms start from the same initial point x0 , then they generate
the same sequence {xk } of iterates.
Equivalence of Reduced-Hessian and Conjugate-Direction Rescaling
76
Proof. It suffices to show that both algorithms generate the same sequence of
search directions. Clearly, p = pc = −g for k = 0. Assume that the first k search
directions satisfy p = pc and assume that the index l increases on the same
iterations that lc increases. Assume that the matrices Q and QC are identical
satisfying U = UC , Y = YC and W = WC . This is true at the start of the
first iteration since U and UC are vacuous, Y and YC both equal g0 /kg0 k and the
implicit matrices W and WC can be considered to be equal. Furthermore, assume
that V and RQ are such that RQT RQ = QTC (V V T )−1 QC . This is true at the start
of the first iteration since RQ = I and since V is orthogonal.
At the start of iteration k, the reduction in the quadratic model in
range(U ) is equal to 21 ktU k2 , where RUT tU = −gU . Since U = UC = V1 M , for some
nonsingular matrix M , and since RUT RU = UCT BC UC , it follows that
1
kt k2
2 U
= 21 gUT (UCT BC UC )−1 gU = 12 g1T M (M T V1T BC V1 M )−1 M T g1 = 21 kg1 k2 ,
where the last equality follows since V1T BC V1 = Ilc . A similar argument shows
explicitly that
τ (ktZ k2 + ktY k2 ) = τc (kg1 k2 + kg2 k2 ),
since τ = τc by assumption. Hence, the parameters l and lc increase or remain
fixed in tandem. If l̄ = l̄c = l, then
p = −U (RUT RU )−1 U T g = −V1 M (M T V1T BC V1 M )−1 M T V1T g = −V1 V1T g = pc .
Otherwise, l̄ = l̄c = l + 1 and
p = −Z(RZT RZ )−1 Z T g = −ZC (ZCT BC ZC )−1 ZCT g = pc ,
where the last equality is given in (4.10). Thus, the search directions satisfy p = pc
and x̄ = x̄c assuming that both algorithms use the same line search strategy.
Equivalence of Reduced-Hessian and Conjugate-Direction Rescaling
77
The matrix Z̄ is defined by Algorithm RHRL so that range(Z̄) =
span(Ḡ), and if l̄ = l + 1, then p is rotated into the basis so that range(Ū ) =
span(P̄). The implicit matrix W̄ is defined so that Q̄ = ( Z̄
W̄ ) is orthog-
onal. Note that the update to Q does not affect the underlying matrix B, i.e.,
Q̄RQ̄T RQ̄ Q̄T = B = BC . Since s = sc and y = yc , the BFGS updates to RQ̄ and
V yield R̄Q̄ and V̄ respectively satisfying Q̄R̄TQ̄ R̄Q̄ Q̄T = (V̄ V̄ T )−1 = B̄C . This
equation and Theorem 4.3 imply that


0
0
bT R
b ,
=R
Q̄T Bb C Q̄ = R̄TQ̄ R̄Q̄ − 
Q̄
Q̄
0 (µ − µ)In−l̄
where the last equality follows from (3.14) and the choice of σ̄. Since Q̄ = Q̄C ,
bT R
b = Q̄ (Vb Vb T )−1 Q̄T , as required.
the last equation implies that R
Q̄
C
Q̄
C
The following corollary addresses the quadratic termination of Algorithm CDR. This result was not given by Siegel in [45], although the algorithm
was designed specifically not to interfere with the quadratic termination of the
BFGS method.
Corollary 4.1 If Algorithm 3.1 is used with exact line search each iteration to
minimize the strictly convex quadratic (1.17), then the iteration terminates at the
minimizer x∗ in at most n steps.
Proof. The result follows since Algorithm CDR generates the same iterates as
Algorithm RHRL and since the latter terminates on quadratics by Corollary 3.1.
The last result of this chapter gives convergence properties of Algorithm
RHR when applied to strictly convex functions.
Equivalence of Reduced-Hessian and Conjugate-Direction Rescaling
78
Corollary 4.2 Let f : IRn → IR denote a strictly convex, twice-continuously
differentiable function. Furthermore, assume that ∇2f (x) is Lipschitz continuous
with k∇2f (x)−1 k bounded above for all x in the level set of x0 . If Algorithm RHRL
with a Wolfe line search is used to minimize f , then convergence is global and
superlinear.
Proof. Since Algorithm CDR has these convergence properties, the proof is
immediate considering Theorem 4.4.
Chapter 5
Reduced-Hessian Methods for
Large-Scale Unconstrained
Optimization
5.1
Large-scale quasi-Newton methods
When n is large, it may be impossible to store the Cholesky factor of Bk or the
conjugate-direction matrix Vk . Conjugate-gradient methods can be used in this
case and require storage for only a few n-vectors (see Gill et al. [22, pp. 144–
150]). However, these methods can require as many as 5n iterations and may be
prohibitively costly in terms of function evaluations. In an effort to accelerate
these methods, several authors have proposed “limited-memory” quasi-Newton
methods. These methods define a quasi-Newton update used either alone (e.g.,
see Shanno [43], Gill and Murray [19] or Nocedal [35]) or in a preconditioned
conjugate-gradient scheme (e.g., see Nazareth [33] or Buckley [4], [5]). Instead of
forming Hk explicitly, these methods store vectors that implicitly define Hk as a
sequence of updates to an “initial” inverse approximate Hessian. This allows the
direction pk = −Hk gk to be computed using a sequence of inner products.
79
Reduced-Hessian Methods for Large-Scale Unconstrained Optimization
80
For example, Nocedal’s method [35] makes use of the product form of
the inverse BFGS update
Hk+1 = MkT Hk Mk +
sk sTk
,
sTk yk
where Mk = I −
yk sTk
sTk yk
(5.1)
(see (2.12) for the corresponding form of the general Broyden update). Storage
is provided for a maximum of m pairs of vectors (si , yi ). Once the storage limit
is reached, the oldest pair of vectors is discarded at each iteration. Hence, after
the mth iteration, Hk is given by
T
T
Hk = Mk−1
· · · Mk−m−1
Hk0 Mk−m−1 · · · Mk−1
+
T
Mk−1
sk−m−1 sTk−m−1
T
Mk−m
· · · Mk−m T
sk−m−1 yk−m−1
· · · Mk−1
..
.
T
+ Mk−1
(5.2)
sk−2 sTk−2
sk−1 sTk−1
M
+
,
k−1
sTk−2 yk−2
sTk−1 yk−1
where Hk0 is chosen during iteration k. Liu and Nocedal study several different
choices for Hk0 , all of which are diagonal. In particular, the choice
Hk0 = θk I,
where θk =
sTk yk
,
ykT yk
is shown to be the most effective in practice (see Liu and Nocedal [28]).
The formula (5.2) for Hk is used to compute pk using the stored vectors
{sk−m , . . . , sk−1 } and {yk−m , . . . , yk−1 }.
An efficient method for computing pk due to Strang is given in Nocedal [35] that
requires 4mn floating-point operations. The iterations proceed using formula
(5.2) until a non-descent search direction is computed. The matrix Hk is then
reset to a diagonal matrix and the storage of pairs (si , yi ) begins from scratch.
Reduced-Hessian Methods for Large-Scale Unconstrained Optimization
81
Reduced-Hessian methods provide an alternative to standard limitedmemory methods. Fenelon proposed the first reduced-Hessian method for large
scale unconstrained optimization in her thesis dissertation (see Fenelon [14]). Her
method is an extension of the Cholesky-factor method given in Section 2.1 and
is based on the fact that the reduced Hessian is tridiagonal when minimizing
quadratic functions using an exact line search. This tridiagonal form implies
that the reduced Hessian can be written as Z TBZ = LDLT, where L is unit
lower bidiagonal. The matrix Z is partitioned as Z = ( Z1
Z2 ), where Z2
corresponds to the last m accepted gradients. Fenelon suggests forcing L to have
a block structure


L11
0 
L=
,
T
λe1 er−m L22
where λ ∈ IR,
L11 is unit lower bidiagonal and L22 is unit lower triangular. A recurrence relation
is given for computing p satisfying LDLTpZ = −gZ and p = ZpZ using L22 , Z2
and one extra n-vector. The form of L is motivated by the desire for quadratic
termination. However, the update to L22 may not be defined when minimizing
general f because of a loss of positive definiteness in the matrix L̄D̄L̄T. An
indefinite update does not occur because of roundoff error, but stems rather from
the assumed structure of L. Fenelon suggests a restart strategy to alleviate the
problem, but reports disappointing results (see Fenelon [14] for further details of
the algorithm and complete test results).
Nazareth has defined reduced inverse-Hessian successive affine reduction
(SAR) methods. These methods store a matrix Z T HZ, where
range(Z) = span{pk−1 , gk−m+1 , gk−m+2 , . . . , gk } (assuming k ≥ m − 1)
and the columns of Z are orthonormal. In terms of Z T HZ, the search direction
Reduced-Hessian Methods for Large-Scale Unconstrained Optimization
82
satisfies p = −Z(Z T HZ)Z T g. The inclusion of pk−1 in range(Z) ensures that the
method terminates on quadratics.
Siegel has proposed a method based on the reduced inverse approximate
Hessian method of Section 2.2. The method differs from Fenelon’s in that no
attempt is made to define the approximate inverse Hessian corresponding to
Z1 . Positive definiteness of Z2THZ2 is guaranteed and quadratic termination is
achieved by redefining Z2 after the computation of p so that p ∈ range(Z2 ).
The method differs from SAR methods because the size of reduced Hessian is
explicitly controlled. Information is only discarded when the acceptance of a
new gradient causes the reduced Hessian to exceed order m (see Siegel [46] for
complete details).
5.2
Extending Algorithm RH to large problems
Four new reduced Hessian methods for large-scale optimization are introduced in
this chapter. The first, which is called Algorithm RH-L-G, is similar to Fenelon’s
method in the sense that it uses both a Cholesky factor of the reduced Hessian and an orthonormal basis for the gradients. The second, called Algorithm
RH-L-P, uses an orthonormal basis for previous search directions and possibly
the last accepted gradient. The third and fourth new algorithms, called RHRL-G and RHR-L-P, use the method of resaling proposed in Chapter 3. Since
Algorithms RHR-L-G and RHR-L-P can be implemented without rescaling, they
include Algorithms RH-L-G and RH-L-P as special cases.
The methods are similar to Siegel’s and utilize two important features
of his algorithms.
• When information is discarded, the exact reduced Hessian corresponding
Reduced-Hessian Methods for Large-Scale Unconstrained Optimization
83
to the saved gradients (or search directions) is maintained. Since the saved
gradients (search directions) will be linearly independent, this implies that
there is no loss of positive definiteness in exact arithmetic.
• In Algorithm RHR-L-P, the last accepted gradient is replaced by the search
direction in order to establish quadratic termination.
In Section 5.3, numerical results are given for the algorithms. Results
show that Algorithm RHR-L-P outperforms RHR-L-G, which may indicate that
the quadratic termination property appears to be beneficial in practice. Rescaling
is shown to be crucial in practice through numerical experimentation. Results
are given comparing the methods to the limited-memory BFGS algorithm of Zhu
et al. [49], which may be considered as the current state of the art.
5.2.1
Imposing a storage limit
Let m denote a prespecified “storage limit”. This limit restricts the size of the
reduced Hessian passed from one iteration to the next. If the reduced Hessian
grows to size (m + 1) × (m + 1) during any iteration, then approximate curvature
information will be discarded and an m × m reduced Hessian is passed to the
next iteration. Several authors have suggested discarding curvature information
corresponding to the “oldest” gradient (e.g., see Fenelon [14], Nazareth [34] and
Siegel [46]). Alternative discard procedures are the subject of future research and
will not be considered in this thesis.
To introduce some of the notation that will be used, we present an example illustrating how the oldest gradient can be discarded. At the end of the kth
iteration, suppose that Z̄ and R̄Z̄ are associated with Ḡ = ( g0
g1
···
gm )
(g0 , g1 , . . . , gm are assumed to be linearly independent). Because m + 1 linearly
Reduced-Hessian Methods for Large-Scale Unconstrained Optimization
84
independent vectors are in the basis, g0 will be discarded before the start of iteration k + 1. We will use tildes to denote the corresponding quantities following the
e =(g
deletion of g0 . In this case, G
1
g2
···
e = range(G)
e
gm ) and range(Z)
with Ze TZe = Im . The matrix R̄Ze will denote the Cholesky factor of the reduced
e The determination of Z
e is considered in the next section.
Hessian Ze T B̄ Z.
5.2.2
The deletion procedure
e is obtained from
In the next two sections, we consider the definition of Ze when G
Ḡ by dropping the oldest gradient. This procedure is due to Daniel et al. (see
[8] for further details). In the first section, we will consider an example. In the
second, we will give the general procedure.
An example of the discard procedure
Consider the case where n = 4, m = 2 and Ḡ = ( g0
g1
g2 ). (The gradients
g0 , g1 and g2 are assumed to be linearly independent.) The matrix Z̄ satisfies
range(Z̄) = range(Ḡ), Z̄ TZ̄ = I3 and is obtained using the Gram-Schmidt process
e = range(G),
e where G
e =
on g0 , g1 and g2 . We require Ze such that range(Z)
( g1
g2 ) and Ze TZe = I2 . Recall that there exists a nonsingular upper-triangular
matrix T̄ such that Ḡ = Z̄ T̄ . Let Z̄ and T̄ be partitioned as

Z̄ = ( z̄ 1
z̄ 2

t̄11 t̄12 t̄13



z̄ 3 ) and T̄ = 
t̄22 t̄23 
.
t̄33
It follows that
g1 = t̄12 z̄ 1 + t̄22 z̄ 2
and g2 = t̄13 z̄ 1 + t̄23 z̄ 2 + t̄33 z̄ 3 .
e Let P denote a 3 × 3
Hence, no two columns of Z̄ define a basis for range(G).
12
symmetric Givens matrix in the (1, 2) plane defined to annihilate t̄22 in T̄ . Let
Reduced-Hessian Methods for Large-Scale Unconstrained Optimization
85
P23 denote a symmetric Givens matrix in the (2, 3) plane defined to annihilate
t̄33 in P12 T̄ . It follows that Ḡ = (Z̄P12 P23 )(P23 P12 T̄ ) and that P23 P12 T̄ is of the
form


× × ×



P23 P12 T̄ =  ×
×
.
×
Suppose we partition Z̄P12 P23 and P23 P12 T̄ such that


te Te 
ze ) and P23 P12 T̄ = 
,
τ 0
Z̄P12 P23 = ( Ze
where Ze ∈ IR4×2 and Te ∈ IR2×2 . Note that Te is nonsingular since Ḡ has full
e =Z
e Te and it follows that Z
e and Te define a
rank. These partitions imply that G
e
skinny Gram-Schmidt QR factorization of G.
It is important to note that the discard procedure cannot be accomplished without knowledge of T̄ . The Givens matrices P12 and P23 depend on
every nonzero component of T̄ except t̄11 .
The general drop-off procedure
During iteration k, suppose that ḡ = gkm is accepted into the basis and that, with
the addition of ḡ, the reduced approximate Hessian attains order m + 1. The
matrix of accepted gradients, Ḡ = ( gk0
Ḡ = ( gk0
gk1
···
gkm ) may be partitioned as
e ) in accordance with the strategy of deleting the oldest gradient.
G
Define T̄S as S T̄ , where S denotes an orthogonal matrix. Define Te as the (1, 2),
m × m block of T̄S . The matrix S is defined so that Te is nonsingular and upper
triangular. In particular, S = Pm,m+1 Pm−1,m · · · P12 , where Pi,i+1 is a symmetric
(m + 1) × (m + 1) Givens matrix in the (i, i + 1) plane defined to annihilate the
(i + 1, i + 1) element of Pi−1,i · · · P12 T̄ . The resulting product satisfies

S T̄ = 
te Te
τ
0

,
where te ∈ IRm .
Reduced-Hessian Methods for Large-Scale Unconstrained Optimization
Let Z̄ S = Z̄S T and define Ze ∈ IRn×m by the partition Z̄ S = ( Ze
86
ze ), i.e.,
Ze = Z̄ S Em , where Em consists of the first m columns of I. From the definition
of Ḡ, we have
Ḡ = ( gk0
e ) == Z̄ T̄ = Z̄ T̄ = ( Z
e te + τ ze Z
e Te ),
G
S S
e =Z
e Te is a Gram-Schmidt QR factorization corresponding
and it follows that G
to the last m accepted gradients.
5.2.3
The computation of T̄
We now describe the computation of the nonsingular upper-triangular matrix T̄ .
This matrix is a by-product of the Gram-Schmidt process described in Section
2.1.
Given that Z and T are known at the start of iteration k (they are
easily defined at the start of the first iteration), consider the definition of Z̄ and
T̄ . During iteration k, suppose that ḡ is accepted giving Ḡ = ( G ḡ ). Define
ρḡ = k(I − ZZ T )ḡk and zḡ = (I − ZZ T )ḡ/ρḡ as in Section 2.1. If Z̄ and T̄ are
defined by

Z̄ = ( Z

T Z Tḡ 
,
zḡ ) and T̄ = 
0 ρḡ
then Z̄ T̄ = Ḡ. If ḡ is rejected, we will define Z̄ = Z and T̄ = T .
In summary, after the computation of ḡ, r̄ is defined as in Chapter 2,
i.e.,
r̄ =
if ρḡ ≤ kḡk;


r,

r + 1,
(5.3)
otherwise,
The updates to Z and T satisfy


Z̄ = 
Z,
if r̄ = r;
(5.4)
(Z
zḡ ),
otherwise
Reduced-Hessian Methods for Large-Scale Unconstrained Optimization
and


T,



 

=
T
ḡ
Z



,



0 ρ
T̄
87
if r̄ = r;
(5.5)
otherwise.
ḡ
For convenience, we define the function GST (short for Gram-Schmidt orthogonalization including T )
(Z̄, T̄ , ḡ Z̄ , r̄) = GST(Z, T, ḡ, r, )
that defines r̄, T̄ and Z̄ according to (5.3)–(5.5).
5.2.4
The updates to ḡ Z̄ and R̄Z̄
The change of basis necessitates changing ḡ Z̄ and R̄Z̄ so that all quantities passed
e The quantity ḡ
to the next iteration correspond to the new basis defined by Z.
Z
e
needed to compute the search direction during iteration k + 1 can be obtained
from ḡ Z̄ without the mn floating-point operations required to compute Ze Tḡ from
scratch. Let ḡ S denote the vector S ḡ Z̄ = Pm,m+1 Pm−1,m · · · P12 ḡ Z̄ . Since Ze =
Z̄ S Em (recall that Em denotes the matrix of first m columns of I), we have
T
T
T
ḡ Ze = Ze Tḡ = (Z̄ S Em )T ḡ = Em
S Z̄ Tḡ = Em
S ḡ Z̄ = Em
ḡ S .
Thus, ḡ Ze is given by the first m components of ḡ S .
It remains to define an update to R̄Z̄ that yields R̄Ze, where R̄TZeR̄Ze =
e The latter quantity satisfies
Ze TB̄ Z.
T
Ze TB̄ Ze = (Z̄S TEm )TB̄ (Z̄S T Em ) = Em
S R̄TZ̄ R̄Z̄ S T Em .
In general, the matrix R̄Z̄ S T is not upper triangular. Let Se denote an orthogonal
matrix of order m + 1 defined so that SeR̄Z̄ S T is upper triangular. If R̄S = SeR̄Z̄ S T
denotes the resulting matrix, then it follows that
T T
R̄S R̄S Em ,
Ze T B̄ Ze = Em
Reduced-Hessian Methods for Large-Scale Unconstrained Optimization
88
which implies that the leading m × m block of R̄S is the required factor R̄Ze.
The matrix Se is the product P˜m,m+1 · · · P˜23 P˜12 , where P˜i,i+1 is an (m +
1) × (m + 1) Givens matrix in the (i, i + 1) plane that annihilates the (i, i +
1) element of P˜i−1,i · · · P˜12 R̄ZeP12 · · · Pi,i+1 . The two sweeps of Givens matrices
defined by S and Se must be interlaced as in the update for the Cholesky factor
following the change of basis for Z (see Section 2.5).
For notational convenience, we define the function discard corresponding
to the drop-off procedure. We write
e Te , ḡ , R̄ ) = discard(Z̄, T̄ , ḡ , R̄ ).
(Z,
Z̄
Z̄
Z
e Ze
The quantities ḡ Z̄ and R̄Z̄ are supplied to discard because if ḡ Ze and R̄Ze are come the rotations defining S need not be stored.
puted during the computation of Z,
5.2.5
Gradient-based reduced-Hessian algorithms
The first of the four reduced-Hessian methods for large-scale unconstrained optimization is given as Algorithm RH-L-G below.
Algorithm 5.1. Gradient-based Large-scale reduced-Hessian method (RH-L-G)
Initialize k = 0; Choose x0 , σ, and m;
Initialize r = 1, Z = g0 /kg0 k, T = kg0 k and RZ = σ 1/2 ;
while not converged do
Solve RZT tZ = −gZ , RZ pZ = tZ and set p = ZpZ ;
Compute α so that sTy > 0 and set x̄ = x + αp;
Compute (Z̄, T̄ , ḡ Z̄ , r̄) = GS(Z, T, ḡ, r, );
if r̄ = r + 1 then
Define pZ̄ = (pZ , 0)T , gZ̄ = (gZ , 0)T and RZ̄ = diag(RZ , σ 1/2 );
Reduced-Hessian Methods for Large-Scale Unconstrained Optimization
89
else
Define pZ̄ = pZ , gZ̄ = gZ and RZ̄ = RZ ;
end if
Compute sZ̄ = αpZ̄ and yZ̄ = ḡ Z̄ − gZ̄ ;
Compute R̄Z̄ = Broyden(RZ̄ , sZ̄ , yZ̄ );
if r̄ = m + 1 then
e Te , ḡ , R̄ ) = discard(Z̄, T̄ , ḡ , R̄ );
Compute (Z,
Z̄
Z̄
Z
e Ze
r̄ ← m;
end if
k ← k + 1;
end do
5.2.6
Quadratic termination
Fenelon [14] and Siegel [46] have observed that gradient-based reduced-Hessian algorithms may not enjoy quadratic termination. Consider a quasi-Newton method
employing an update from the Broyden class and an exact line search. Recall that
when this method is applied to a quadratic, the search directions are parallel to
the conjugate-gradient directions (see Section 1.2.1). However, we demonstrate
below that the directions generated by Algorithm RH-L-G are not necessarily
parallel to the conjugate-gradient directions. Moreover, Algorithm RH-L-G does
not exhibit quadratic termination in practice.
Table 5.1 gives the definition of the search direction generated during
iteration k + 1 of both the conjugate-gradient method and Algorithm RH-L-G.
Note that the conjugate-gradient direction is a linear combination of ḡ and p.
Suppose that the first k + 1 directions of Algorithm RH-L-G are parallel to the
first k + 1 conjugate-gradient directions. Under this assumption, ḡ is accepted by
Reduced-Hessian Methods for Large-Scale Unconstrained Optimization
90
Table 5.1: Comparing p̄ from CG and Algorithm RH-L-G on quadratics
Iteration Conjugate Gradient Reduced Hessian
kḡk2
e
k+1
p̄ = −ḡ +
p̄ = Zp
p
Z
e
2
kgk
Algorithm RH-L-G during iteration k since ḡ is orthogonal to the previous gradie However,
ents (see Section 1.2.1). It follows by construction that ḡ ∈ range(Z).
if the oldest gradient is dropped from Ḡ, and if p has a nonzero component along
e Hence, the search directhe direction of the oldest gradient, then p 6∈ range(Z).
tion p̄ generated by Algorithm RH-L-G cannot be parallel to the corresponding
conjugate-gradient direction.
Authors have devised various ways of ensuring quadratic termination of
reduced-Hessian type methods. As described in Section 5.1, Fenelon [14] obtains
quadratic termination by recurring the super-diagonal elements of RZ corresponding to the deleted gradients. Nazareth [34] defines the basis used during iteration
k + 1 to include p and ḡ. Siegel [46] maintains quadratic termination by replacing
g with p in the definition of Z whenever the former is accepted. This exchange is
discussed further in the next section and will lead to a modification of Algorithm
RH-L-G.
5.2.7
Replacing g with p
Consider the set of search directions,
Pk = {p0 , p1 , . . . , pk },
(5.6)
generated by a quasi-Newton method (see Algorithm 1.2) using updates from the
Broyden class. Siegel has observed that the subspace associated with Gk is also
determined by Pk , i.e., span(Gk ) = span(Pk ). Lemma 5.1 given below is essential
Reduced-Hessian Methods for Large-Scale Unconstrained Optimization
91
for the proof of this result. In Lemma 5.1, zḡ and zp̄ are the normalized components of ḡ and p̄ respectively, that are orthogonal to span(G). The normalized
component of p̄ orthogonal to range(Z) is given
zp̄ =




0,
if ρp̄ = 0;
1


(I − ZZ T)p̄,

ρp̄
(5.7)
otherwise,
where ρp̄ = k(I − ZZ T )p̄k. The lemma establishes that zp̄ is nonzero as long zḡ
is nonzero, i.e., p̄ always includes a component of zḡ . Note that in the proof of
Lemma 5.1, Z and Z̄ are assumed to be exact orthonormal bases for span(G) and
span(Ḡ) respectively.
Lemma 5.1 If B0 = σI (σ > 0), and Bk is updated using an update from the
Broyden class, then zp̄ = ±zḡ .
Proof. The proof is trivial if zḡ = 0. Suppose zḡ 6= 0. Using (2.4), p̄ = Z̄ Z̄ Tp̄ =
ZZ Tp̄ + (zḡTp̄)zḡ , which implies that (I − ZZ T)p̄ = (zḡTp̄)zḡ and ρp̄ = |zḡTp̄|. Hence,
as long as zḡTp̄ 6= 0, zp̄ = sign(zḡTp̄)zḡ , as required.
It remains to show that zḡTp̄ cannot be zero. Assume that zḡTp̄ = 0 with
zḡ 6= 0, which means that p̄ = Zp1 , where p1 ∈ IRr , i.e., p̄ ∈ span(G). Using the
Broyden update formulae (1.14), the equation Bp = −g, and the equations
s = αp and B̄ p̄ = −ḡ,
(5.8)
it follows that
p̄Tg αφsTg wTp̄ − p̄Ty
B p̄ + αφw p̄ + T +
g=
pg
sTy
!
T
αφsTg wTp̄ − p̄Ty
− 1 ḡ. (5.9)
sTy
!
Since p̄ ∈ span(G), Lemma 2.3 implies that B p̄ ∈ span(G). Thus, if (αφsTg wTp̄ −
p̄Ty)/sTy 6= 1, then equation (5.9) implies that ḡ ∈ span(G), which contradicts
Reduced-Hessian Methods for Large-Scale Unconstrained Optimization
92
zḡ 6= 0. Otherwise, equation (5.9) implies that
B p̄ = −βg,
where β = αφwTp̄ +
p̄Tg αφsTg wTp̄ − p̄Ty
+
.
pTg
sTy
Multiplying through by B −1 gives p̄ = −βB −1 g = βp. Combining this with the
quasi-Newton condition B̄s = y and (5.8) gives
!
β
β
+ 1 ḡ = g,
α
α
which must imply that ḡ is parallel to g, contradicting zḡ 6= 0. These contradictions establish that zḡTp̄ 6= 0 as required.
Once Lemma 5.1 is established, the following result follows directly.
Theorem 5.1 (Siegel) If B0 = σI (σ > 0), and Bk is updated using any formula
from the Broyden class, then
span(Gk ) = span(Pk ).
Proof. The result is clearly true for k = 0. Suppose that the result holds
through iteration k − 1, i.e., span(Pk−1 ) = span(Gk−1 ). Since pk ∈ span(Gk ) (see
Lemma 2.3), it follows that span(Pk ) ⊆ span(Gk ). A straightforward application
of Lemma 5.1 (predated one iteration) implies that span(Gk ) ⊆ span(Pk ) and the
desired result follows.
Recall that the iterates x0 , x1 , . . ., xk+1 of quasi-Newton methods (using
updates from the Broyden class) lie on the manifold M(x0 , Gk ). Since span(Pk ) =
span(Gk ), the iterates are a spanning set for the manifold. Hence, these methods
exploit all gradient information.
We now consider exchanging the search direction for the gradient in
Algorithm RH (p. 28). Recall that this algorithm uses an approximate basis Gk
Reduced-Hessian Methods for Large-Scale Unconstrained Optimization
93
for span(G) (see Section 2.1). The columns of Gk are the accepted gradients gk1 ,
gk2 , . . ., gkr . We define Pk as the corresponding matrix of search directions, i.e.,
Pk =
p k1 p k2 · · · p kr
.
(5.10)
The following corollary, analogous to Lemma 5.1, implies that p̄ has a nonzero
component along zḡ whenever ḡ is accepted.
Corollary 5.1 If zḡ is defined as in Algorithm RH and zp̄ is defined by (5.7),
then zp̄ = ±zḡ .
Proof. In the reduced Hessian method, a full approximate Hessian is not stored,
but can be implicitly defined by


R T RZ
0  T
Q̄
B = Q̄  Z
0
σIn−r
(see Section 2.4). Similarly, define


R̄T R̄Z̄
0  T
B̄ = Q̄  Z̄
Q̄
0
σIn−r̄
and note that in terms of these effective Hessians, the search directions satisfy
B p = −ZgZ and B̄ p̄ = −Z̄ ḡ Z̄ respectively. In light of these two equations,
define g = ZgZ , ḡ = Z̄ ḡ Z̄ , and y = ḡ − g . Hence,
B p = −g and B̄ p̄ = −ḡ .
(5.11)
If R̄Z̄ is obtained from RZ̄ using a Broyden update as in Algorithm 2.2, then
B̄ is the matrix obtained by applying the same Broyden update to B using
the quantities B , and y in place of B and y respectively. A short calculation
verifies the quasi-Newton condition B̄ s = y . The rest of the proof proceeds as
in Lemma 5.1.
Reduced-Hessian Methods for Large-Scale Unconstrained Optimization
94
Theorem 5.2, which is analogous to Theorem 5.1, follows immediately
from Corollary 5.1.
Theorem 5.2 If B0 = σI (σ > 0) in Algorithm RH,
range(Gk ) = range(Pk ).
Proof. The proof is analogous to the proof of 5.1 and is omitted.
When the discard procedure is used, we expect that zp̄ is nonzero whenever ḡ is accepted. However, at the time of the completion of this dissertation,
this result has not been proved. We therefore give the following proposition.
Proposition 5.1 If Algorithm RH-L-G is used to minimize f : IRn → IR, then
zp̄ 6= 0 whenever ḡ is accepted.
Replacing g with p can be accomplished by simply replacing the last
column of T with pZ . We will use the function chbs (for “change of basis”) to
denote this replacement and will write
T̆ = chbs(T ).
In the absence of a proof for the proposition, we will only perform the replacement
if ρp > M kpZ k. We note that in exact arithmetic, ρp must be nonzero if the search
directions are conjugate-gradient directions. This follows because the conjugategradient directions are linearly independent (see Fletcher [15, p. 25]).
The following algorithm employs the change of basis.
Algorithm 5.2. Direction-based large-scale reduced-Hessian method (RH-L-P)
Reduced-Hessian Methods for Large-Scale Unconstrained Optimization
95
The algorithm is identical to Algorithm RH-L-G except after defining p.
The lines following the computation of p are as follows.
if g was accepted then
T̆ = chbs(T )
end if
Rescaling reduced Hessians for large problems
When solving smaller problems, the numerical effects of rescaling vary as shown in
Chapter 3. For larger problems, the discard procedure makes rescaling essential.
We now present two rescaling algorithms defined as extensions of Algorithms RHL-G and RH-L-P. The definition is based on the rescaling suggested in Algorithm
RHR (p. 52). Algorithm RHR-L-G is identical to RH-L-G except following the
BFGS update.
Algorithm 5.3. Gradient-based Large-scale reduced-Hessian rescaling method
(RHR-L-G)
Compute σ̄;
if r̄ = r + 1 then
b ;
Replace the (r̄, r̄) element of R̄Z̄ with σ̄ to give R
Z̄
end if
Since much of the notation is altered for the direction-based algorithm,
it is given in its entirety.
Algorithm 5.4. Direction-based large-scale reduced-Hessian rescaling method
(RHR-L-P)
Initialize k = 0; Choose x0 , σ, and m;
Initialize r = 1, Z = g0 /kg0 k, T = kg0 k and RZ = σ 1/2 ;
Reduced-Hessian Methods for Large-Scale Unconstrained Optimization
96
while not converged do
Solve RZT tZ = −gZ , RZ pZ = tZ and set p = ZpZ ;
if g was accepted then
T̆ = chbs(T )
end if
Compute α so that sTy > 0 and set x̄ = x + αp;
Compute (Z̄, T̄ , ḡ Z̄ , r̄) = GST(Z, T̆ , ḡ, r, );
if r̄ = r + 1 then
Define pZ̄ = (pZ , 0)T , gZ̄ = (gZ , 0)T and RZ̄ = diag(RZ , σ 1/2 );
else
Define pZ̄ = pZ , gZ̄ = gZ and RZ̄ = RZ ;
end if
Compute sZ̄ = αpZ̄ and yZ̄ = ḡ Z̄ − gZ̄ ;
Compute R̄Z̄ = BFGS(RZ̄ , sZ̄ , yZ̄ );
Compute σ̄;
if r̄ = r + 1 then
b ;
Replace the (r̄, r̄) element of R̄Z̄ with σ̄ to give R
Z̄
end if
if r̄ = m + 1 then
e Te , ḡ , R
b
b
Compute (Z,
Z
e Ze) = discard(Z̄, T̄ , ḡ Z̄ , RZ̄ );
r̄ ← m;
end if
k ← k + 1;
end do
In Section 5.3, we compare several choices of σ̄ used in Algorithm RHR-
Reduced-Hessian Methods for Large-Scale Unconstrained Optimization
97
L-P. In Section 5.4, Algorithm RHR-L-P (and consequently Algorithm RH-L-P)
is shown to enjoy quadratic termination in exact arithmetic.
5.3
Numerical results
Results are presented in this section for various aspects of Algorithms RHR-LG and RHR-L-P. We also present a comparison with the L-BFGS-B algorithm
proposed by Zhu et al. [49]. (The L-BFGS-B algorithm is an extension of the LBFGS method reviewed in Section 5.1, but performs similarly on unconstrained
problems.)
Many of the problems are taken from the CUTE collection (see Bongartz
et al. [1]). In the tables of results, we will use the CUTE designation for the test
problems although there is some overlap with the problems from Moré et al. [29]
listed in Table 3.2.
In the following sections we answer four questions concerning the algorithms.
• Does the enforcement of quadratic termination in Algorithm RHR-L-P result in practical benefits in comparison to Algorithm RHR-L-G?
• How do the various rescaling schemes presented in Table 3.1 affect the
performance of Algorithms RHR-L-G and RHR-L-P?
• What effect does the value of m have on the number of iterations and
function evaluations required by the algorithms?
• How many iterations and function evaluations do the algorithms require in
comparison with L-BFGS-B?
Reduced-Hessian Methods for Large-Scale Unconstrained Optimization
98
Algorithm RHR-L-G compared with RHR-L-P
In this section we examine the performance of Algorithm RHR-L-G compared
with Algorithm RHR-L-P. The implementation is in FORTRAN 77 using a DEC
5000/240. The line search, step length parameters and acceptance parameter are
the same as those used to test Algorithm RHR (see Section 3.3.2, p. 53). We
present results in Tables 5.2–5.3 for the rescaling methods R0 (no rescaling), R2
and R4 (see Table 3.1, p. 53, for definitions of the rescaling schemes).
The stopping criterion is kgk k∞ < 10−5 as suggested by Zhu et al. [49].
The notation “L” indicates that the algorithm terminated during the line search.
Termination during the line search usually occurs when the search direction is
nearly orthogonal to the gradient. A limit of 1500 iterations was imposed. The
notation “I” indicates that the algorithm was terminated after 1500 iterations.
Both the “L” and “I” are accompanied by a number in parentheses that indicates
the final infinity norm of the gradient.
Table 5.2: Iterations/Functions for RHR-L-G (m = 5)
Problem
Algorithm RHR-L-G
Name
n
R0
R2
R4
ARWHEAD 1000
8/14
17/22
17/22
BDQRTIC
100 211/325 I(.1E-2)
259/268
CRAGGLVY 1000 L(.2E-4) I(.6E-2)
249/256
DIXMAANA 1500
12/16
17/21
15/19
DIXMAANB 1500
28/45
22/26
17/21
DIXMAANE 1500 965/976 I(.6E-3) 1232/1236
EIGENALS
110 I(.8E-1) I(.7E-1)
I(.6E-1)
GENROSE
500 I(.2E+1) I(.2E+1) I(.2E+1)
MOREBV
1000 120/196 130/132
158/160
PENALTY1 1000
49/60
75/89
59/71
QUARTC
1000 120/178 645/1176
97/103
Consider the performance of the two algorithms for a given rescaling
Reduced-Hessian Methods for Large-Scale Unconstrained Optimization
99
Table 5.3: Iterations/Functions for RHR-L-P (m = 5)
Problem
Algorithm RHR-L-P
Name
n
R0
R2
R4
ARWHEAD 1000
8/14
17/22
17/22
BDQRTIC
100
123/211
625/634
148/163
CRAGGLVY 1000 L(.1E-3)
L(.3E-4)
108/116
DIXMAANA 1500
12/16
17/21
15/19
DIXMAANB 1500
25/39
22/26
18/22
DIXMAANE 1500 209/216
772/776
178/183
EIGENALS
110 975/1880 659/1189
682/704
GENROSE
500 1432/3449 1058/1903 1136/1211
MOREBV
1000 100/201
96/98
81/83
PENALTY1 1000
49/60
75/89
59/71
QUARTC
1000 122/160
722/728
87/93
technique. For some of the problems (e.g., ARWHEAD, DIXMAANA–B and
PENALTY1) they perform similarly. However, for most of the problems, it is
clear that Algorithm RHR-L-P performs better than RHR-L-G. This is particularly true for DIXMAANE, EIGENALS and GENROSE. There are very few cases
where the reverse is true and in these cases the difference is quite small. For example, RHR-L-G without rescaling takes 196 function evaluations for MOREBV
while RHR-L-P requires 201.
A comparison of the rescaling schemes R3–R5
If we consider the three rescaling schemes shown in Table 5.3, then clearly R4 is
the best choice. In this section, we give results comparing more of the rescaling
techniques in conjunction with Algorithm RHR-L-P. In Tables 5.4–5.5, we consider the choices R3–R5 for two sets of test problems from the CUTE collection.
The first set includes 26 problems whose names range in alphabetical order from
ARWHEAD to FLETCHCR. The second set includes problems from FMINSURF
to WOODS. The maximum number of iterations is increased to 3500 for this set
Reduced-Hessian Methods for Large-Scale Unconstrained Optimization
100
of results.
Table 5.4: Results for RHR-L-P using R3–R5 (m = 5) on Set # 1
Problem
n
R3
R4
R5
ARWHEAD 1000
17/22
17/22
17/22
BDQRTIC
100
129/177
148/163
127/189
BROYDN7D 1000 352/614
367/372
329/657
BRYBND
1000
41/52
30/36
39/53
CRAGGLVY 1000 100/138
108/116
91/162
DIXMAANA 1500
15/19
15/19
15/19
DIXMAANB 1500
18/22
18/22
18/22
DIXMAANC 1500
12/17
12/17
12/17
DIXMAAND 1500
24/28
21/25
23/27
DIXMAANE 1500 156/293
178/183
159/303
DIXMAANF 1500 175/320
147/152
165/306
DIXMAANG 1500 108/199
124/130
134/251
DIXMAANH 1500 268/442
243/254
256/542
DIXMAANI 1500 1534/3059 1352/1367 1153/2289
DIXMAANK 1500 136/257
129/136
153/299
DIXMAANL 1500 174/318
147/152
154/286
DQDRTIC
1000
7/10
12/15
7/10
DQRTIC
500
85/92
85/91
90/97
EIGENALS
110 667/1239
682/704
756/1439
EIGENBLS
110 1184/2321 329/343 1562/3434
EIGENCLS
462 3313/6810 2808/2843 3097/6804
ENGVAL1
1000
23/29
22/26
25/33
FLETCBV2 1000 1003/2007 952/967 1002/2005
FLETCBV3 1000 L(.1E+2)
L(.3E-1)
L(.1E+2)
FLETCHBV 100
L(.2E-1)
L(.2E+5)
L(.8E-1)
FLETCHCR 100
84/142
76/84
92/196
From the tables, it is clear that R4 is the best choice of the rescaling
parameter in practice. The number of function evaluations for R3 and R5 is
similar with slightly fewer required for option R3. Note that for many of the
problems option R5 requires roughly twice as many function evaluations as iterations. This behavior results because the rescaling parameter σ̄ = µ results in a
search direction of larger norm than is acceptable by the line search. (In the case
Reduced-Hessian Methods for Large-Scale Unconstrained Optimization
101
Table 5.5: Results for RHR-L-P using R3–R5 (m = 5) on Set # 2
Problem
n
R3
R4
R5
FMINSURF 1024 180/337
229/233
214/470
FREUROTH 1000 L(.3E-3)
L(.8E-4)
L(.1E-3)
GENROSE
500 1150/2097 1136/1211 1417/3361
LIARWHD
1000
20/26
20/26
20/26
MANCINO
100
L(.2E-4)
L(.2E-4)
L(.2E-4)
MOREBV
1000
96/179
81/82
95/187
NONDIA
1000
9/21
9/21
9/21
NONDQUAR 100 943/1445
794/848 1095/1848
PENALTY1 1000
59/71
59/71
59/71
PENALTY2
100
107/135
125/133
113/161
PENALTY3
100
L(.2E-1)
L(.8E-2)
L(.8E-2)
POWELLSG 1000
41/46
44/50
41/46
POWER
1000 149/223
177/184
152/253
QUARTC
1000
90/98
87/93
90/97
SINQUAD
1000 143/189
142/187
130/173
SROSENBR 1000
18/24
18/24
18/24
TOINTGOR
50
134/220
157/161
137/256
TOINTGSS 1000
5/8
5/8
5/8
TOINTPSP
50
123/193
133/153
118/227
TOINTQOR
50
41/57
43/45
51/74
TQUARTIC 1000
20/26
20/26
20/26
TRIDIA
1000 398/796
903/924
391/783
VARDIM
100
36/44
36/44
36/44
VAREIGVL 1000
91/152
90/95
92/165
WOODS
1000
43/49
45/51
43/49
of a quadratic with exact line search, the length of the search direction varies as
the inverse of σ̄.) For this choice, the line search is forced to interpolate nearly
every iteration.
Results for Algorithm RHR-L-P using different m
Results are given in Table 5.6 for Algorithm RHR-L-P (with R4) using different
values of m.
When m = 2, the search direction is a linear combination of two vectors
Reduced-Hessian Methods for Large-Scale Unconstrained Optimization
Table 5.6:
Problem
Name
n
BDQRTIC
100
CRAGGLVY 1000
DIXMAANA 1500
DIXMAANB 1500
DIXMAANE 1500
EIGENALS
110
GENROSE
500
MOREBV
1000
PENALTY1 1000
QUARTC
1000
SINQUAD
1000
102
RHR-L-P using different m with R4
Algorithm RHR-L-G
m=2
m=5
m = 10
m = 15
216/250
148/163
109/118
104/115
127/134
108/116
118/126
124/132
13/17
15/19
15/19
15/19
14/18
18/22
18/22
18/22
222/226
178/183
190/194
193/197
752/765
682/704
682/711
389/409
1776/1804 1136/1211 1106/1254 1133/1295
132/134
81/83
70/72
73/75
59/71
59/71
59/71
59/71
55/61
87/93
139/145
212/218
376/435
142/187
142/187
142/187
as it is for conjugate-gradient methods. We note that the choice m = 5 gives
better results than m = 2 on most of the problems. The choice of m that
minimizes the number of function evaluations varies. For example, the number
of function evaluations for CRAGGLVY, DIXMAANE and GENROSE is least
for m = 5. For several of the problems, for example BDQRTIC and EIGENALS,
the fewest number of function evaluations is required for m = 15.
For some of the examples, the smallest number of functions evaluations
is required for an intermediate value of m. For example, the smallest number of
function evaluations for CRAGGLVY, DIXMAANE and GENROSE occurs for
m = 5. The function evaluations might be expected decrease for values of m
greater than 15, especially as m is chosen closer to n. The variation in function
evaluations for values of m ranging from 2 to n is given in Table 5.7 for four
problems. The first three are the calculus of variation problems (see Section
3.4.1). Problem 23 is a minimum-energy problem (see Siegel [46]). For this
table, the termination criterion is kgk k < 10−4 |f (xk )|. The maximum number of
iterations is 15, 000 and the notation “I” indicates that this limit was reached.
Reduced-Hessian Methods for Large-Scale Unconstrained Optimization
103
In this case, the number in parentheses gives the final ratio kgk k/|f (xk )|.
Table 5.7: RHR-L-P (R4) for
m
Problem 20 Problem 21
2
I(.4E-1)
3898/3903
3 13496/13906 2545/2619
5 11765/11896 1920/1947
7
I(.4E-3)
1792/1805
10 14496/14532 1992/1997
15 13152/13167 2315/2321
20 13672/13683 1712/1718
30 11042/11064 1426/1431
40 13775/13794 997/1002
50 14170/14189 1002/1007
70 13361/13373
510/516
100 11419/11432
315/326
150 5829/5302
808/821
200
854/865
464/475
m ranging from 2 to n
Problem 22 Problem 23
I(.6E-2)
269/281
I(.1E-2)
210/223
9027/9129
163/176
I(.5E-2)
181/194
7927/7953
192/213
14766/14789
228/266
953/957
225/247
1216/1218
254/281
744/746
311/340
451/453
316/341
313/316
296/320
207/210
374/398
190/193
437/461
190/193
656/680
For all four of the problems there is a slight “dip” in the plot of function
evaluations versus m. This dip occurs for m = 5, m = 7, m = 5 and m = 5 for
the respective problems. For Problems 20–22, the number of function evaluations
decreases dramatically for values of m closer to n. This is not the case for Problem
23 where the choice m = 5 results in the smallest number of function evaluations.
Algorithm RHR-L-P compared with L-BFGS-B
In this section, we compare Algorithm RHR-L-P (using rescaling option R4)
with L-BFGS-B. The L-BFGS-B method is run using the primal option (see Zhu
et al. [49] for an description of the three L-BFGS-B options). The line search
provided with L-BFGS-B is an implementation of the line search proposed by
Moré and Thuente [31]. The line search parameters ν = 10−4 and η = .9 are
used in L-BFGS-B as suggested by Zhu et al. (these are the same as those used
for RHR-L-P). The termination criterion is kgk k∞ < 10−5 .
Reduced-Hessian Methods for Large-Scale Unconstrained Optimization
104
The number of iterations and function evaluations required to solve the
problems in Set #1 and Set #2 is given in Tables 5.8–5.9. The notation “R” in
the tables indicates that L-BFGS-B did not meet the termination criterion, but
that when the iterations were halted, a secondary termination criterion had been
met. This criterion is given by (f (xk ) − f (xk+1 ))/ max(|f (xk )|, |f (xk )|, 1) ≤ CM ,
where C = 10−7 and M is the machine precision.
From the tables, we see that in terms of function evaluations Algorithm
RHR-L-P is comparable with L-BFGS-B. The number of problems on which
RHR-L-P requires fewer function evaluation is relatively low. However, we should
stress that the results are somewhat preliminary.
Reduced-Hessian Methods for Large-Scale Unconstrained Optimization
Table 5.8: Results for RHR-L-P and L-BFGS-B (m = 5) on Set #1
Problem
n
Algorithm RHR-L-P Algorithm L-BFGS-B
ARWHEAD 1000
17/22
11/13
BDQRTIC
100
148/163
86/101
BROYDN7D 1000
367/372
362/373
BRYBND
1000
30/36
29/31
CRAGGLVY 1000
108/116
87/95
DIXMAANA 1500
15/19
10/12
DIXMAANB 1500
18/22
10/12
DIXMAANC 1500
12/17
12/14
DIXMAAND 1500
21/25
14/16
DIXMAANE 1500
178/183
165/171
DIXMAANF 1500
147/152
153/160
DIXMAANG 1500
124/130
158/166
DIXMAANH 1500
243/254
152/157
DIXMAANI 1500
1352/1367
1170/1215
DIXMAANK 1500
129/136
135/139
DIXMAANL 1500
147/152
171/177
DQDRTIC
1000
12/15
13/19
DQRTIC
500
85/91
38/43
EIGENALS
110
682/704
541/574
EIGENBLS
110
329/343
1072/1116
EIGENCLS
462
2808/2843
2795/2900
ENGVAL1
1000
22/26
20/23
FLETCBV2 1000
952/967
490/505
FLETCBV3 1000
L(.3E-1)
R(.4E+1)
FLETCHBV 100
L(.2E+5)
R(.5E+0)
FLETCHCR 100
76/84
525/602
105
Reduced-Hessian Methods for Large-Scale Unconstrained Optimization
Table 5.9: Results for RHR-L-P and L-BFGS-B (m = 5) on Set #2
Problem
n
Algorithm RHR-L-P Algorithm L-BFGS-B
FMINSURF 1024
229/233
198/208
FREUROTH 1000
L(.8E-4)
R(.2E-4)
GENROSE
500
1136/1211
1086/1244
INDEF
1000
L(.1E+2)
R(.6E+1)
LIARWHD
1000
20/26
22/27
MANCINO
100
L(.2E-4)
11/15
MOREBV
1000
81/82
74/79
NONDIA
1000
9/21
19/23
NONDQUAR 100
794/848
907/1001
PENALTY1 1000
59/71
50/60
PENALTY2
100
125/133
69/74
PENALTY3
100
L(.8E-2)
L(.3E-2)
POWELLSG 1000
44/50
51/57
POWER
1000
177/184
131/136
QUARTC
1000
87/93
41/47
SINQUAD
1000
142/187
150/207
SROSENBR 1000
18/24
17/20
TOINTGOR
50
157/161
122/134
TOINTGSS 1000
5/8
14/20
TOINTPSP
50
133/153
105/129
TOINTQOR
50
43/45
38/42
TQUARTIC 1000
20/26
21/27
TRIDIA
1000
903/924
675/705
VARDIM
100
36/44
36/37
VAREIGVL 1000
90/95
122/130
WOODS
1000
45/51
28/31
106
Reduced-Hessian Methods for Large-Scale Unconstrained Optimization
5.4
107
Algorithm RHR-L-P applied to quadratics
We now show that Algorithm RHR-L-P has the quadratic termination property.
For use in the proof of quadratic termination, we define
γk = (sTk yk )1/2
Note that this definition of γk differs from that given in (3.7).
Theorem 5.3 Consider Algorithm RHR-L-P implemented with exact line search
and σ0 = 1. If this algorithm is applied to the strictly convex quadratic function
(1.17), then RZ is upper bidiagonal. At the start of iteration k (0 ≤ k ≤ m − 1),
RZ satisfies
kg0 k
kg1 k
−
0

γ0
 γ0


kg2 k
kg1 k

−


γ1
γ1

RZ = 
.

..









···
0
...
..
.
kgk−1 k
γk−1




..

.





0 

kgk k 

−

γk−1 

1/2
σk
At the start of iteration k (k ≥ m), RZ satisfies
kgl+1 k
−kgl k2
−
0

γl
 σl kpl kγl


kgl+1 k
kgl+2 k


−

γl+1
γl+1

RZ = 
...










···
...
...
kgk−1 k
γk−1
0




..


.


,

0 

kgk k 

−

γk−1 

1/2
σk
where l = k − m + 1. The matrices Zk and Tk satisfy
Zk =
g0
kg0 k
gk g1
···
kg1 k
kgk k
and
Reduced-Hessian Methods for Large-Scale Unconstrained Optimization
kg1 k2
kg2 k2
kgk−1 k2
−
··· −
 −kg0 k −

σ1 kg0 k
σ2 kg0 k
σk−1 kg0 k


2
kg1 k
kg2 k
kgk−1 k2


−
−
·
·
·
−

σ1
σ2 kg1 k
σk−1 kg1 k



kg2 k
..

−
.

σ2
= 


kgk−1 k2

..

.
−

σk−1 kgk−2 k



kgk−1 k

−


σk−1



Tk
108
0 

0
..
.
0
0
kgk k























at the start of iteration k (0 ≤ k ≤ m − 1). At the start of iteration k (k ≥ m),
these two matrices satisfy
Zk =
pl
kpl k
gl+1
kgl+1 k
gk gl+2
···
kgl+2 k
kgk k
and
kgl+1 k2
kgl+2 k2
kgk−1 k2
kp
k
kp
k
·
·
·
kpl k
 kpl k
l
l

kgl k2
kgl k2
kgl k2


kgl+2 k2
kgk−1 k2


−kg
k
−
·
·
·
−
l+1

kgl+1 k
kgl+1 k



..

−kgl+2 k
.


Tk = Ck 











..
.

0 
0
..
.
kgk−1 k2
kgk−2 k
0
−kgk−1 k
0
−
kgk k











 Dk ,











−1
−1
where Ck = diag(σl , Im−1 ) and Dk = diag(σl−1 , σl+1
, . . . , σk−1
, 1). Furthermore,
the search direction is given by equation (3.15) of Theorem 3.1, i.e.,
pk =





−gk ,
1




σk
if k = 0;
kgk k2
pk−1 − gk ,
kgk−1 k2
(5.12)
!
σk−1
otherwise.
Proof. The form of Z, RZ and the search directions is already established for
iterations k (0 ≤ k ≤ m − 1) by Theorem 3.1 since Algorithm RHR-L-P and
Reduced-Hessian Methods for Large-Scale Unconstrained Optimization
109
Algorithm RHRL are equivalent for the first m − 1 iterations. Since the first
m − 1 search directions are parallel to the conjugate-gradient directions, the
first m gradients are mutually orthogonal and accepted. Moreover, the rank is
rk = k + 1. The search directions p0 , p1 , . . ., pm−2 replace the corresponding
gradients during iterations k (0 ≤ k ≤ m − 2). The form of Tk at the start of
iterations k (0 ≤ k ≤ m − 1) follows from the form of pZ (3.16, p. 59).
The value of m is assumed to be 3 for the remainder of the argument.
The key ideas of the proof are illustrated using this value. At the start of iteration
m − 1, the forms of Zk , RZ and Tk are given by
kg0 k
kg1 k
−
0
 γ
γ0

0


kg2 k
kg1 k
RZ = 
−
 0

γ1
γ1


Zk =
g1
kg1 k
g0
kg0 k
g2 ,
kg2 k
0
0
1/2









σ2
and
kg1 k2
 −kg0 k −

σ1 kg0 k


kg1 k
= 

0
−

σ1


Tk

0
0

0 

0
kg2 k


.




The rank satisfies r2 = 3. The form of p2 is given by (5.12) since the algorithm
is identical Algorithm RHRL until the end of iteration m − 1. Since g2 has been
accepted, g2 is replaced by p2 . The form of pZ given by (3.16) implies that T̆k
satisfies

kg2 k2
kg1 k2
−
σ1 kg0 k
σ2 kg0 k
kg1 k
kg2 k2
−
−
σ1
σ2 kg1 k
kg2 k
0
−
σ2
 −kg0 k −




T̆k = 




0
0






.




Reduced-Hessian Methods for Large-Scale Unconstrained Optimization
110
The gradient g3 is orthogonal to g0 , g1 and g2 and is accepted. The rank satisfies
r̄2 = 4 and the updates to Zk , RZ and T̆k satisfy
Z̄ k =

g3 ,
kg3 k
Zk

T̆k
0 
T̄k = 
0 kg3 k
kg0 k
kg1 k
−
0
0

 γ0
γ
0


kg1 k
kg2 k

 0
−
0
b =
γ
γ
1
1

R
Z̄


kg2 k
kg3 k
 0
0
−

γ2
γ2











,






0
0
0
1/2
σ3
and
respectively.
As r̄2 > m, the algorithm performs the first discard procedure. Since
kp1 k is given by the norm of the second column of T̄k , the rotation P12 satisfies

P12
kg1 k2
−
 σ1 kg0 kkp1 k



 − kg1 k
=
σ1 kp1 k






kg1 k
−
0
σ1 kp1 k
kg1 k2
0
σ1 kg0 kkp1 k
0
0
0
0
0




0
.



0

1
0
1
In the remainder of the proof, we will use the symbol “×” to denote a value that
will be discarded. A short computation gives
σ1 kg2 k2
kp1 k
 × kp1 k

σ2 kg1 k2



×

P12 T̄k = 


 0



0
0
0
0
−

0 

0
0
kg2 k
σ2
0
0
kg3 k




.






Reduced-Hessian Methods for Large-Scale Unconstrained Optimization
111
Since the second row of P12 T̄k is a multiple of eT1 , P23 and P34 are permutations.
Hence, Tek satisfies
σ1 kg2 k2
 kp1 k
kp1 k

σ2 kg1 k2


kg2 k
Tek = 
 0
−

σ2




0
0 

0
kg3 k
0


.




The matrix Sk is defined by Sk = P34 P23 P12 and it is easily verified that
Zek = Z̄ k SkT E3 =
p1
kp1 k
g2
kg2 k
g3 .
kg3 k
0
0
Another short computation gives

b P
R
Z̄ 12
0



kg1 k2

−
 σ γ kp k
1 1
1
=



0



0
×




0 

.

kg3 k 

−

γ2 

kg2 k
γ1
kg2 k
γ2
× −
0
0

1/2
0
σ3
Hence, the rotation P˜12 is a permutation, as are P˜23 and P˜34 . The matrix Sek is
b given by the leading 3 × 3 block of
defined by Sek = P˜34 P˜23 P˜12 . The update R
Z
e
b S satisfies
Sek R
Z̄ k
kg1 k2
kg2 k
−
−
0
 σ1 γ1 kp1 k
γ1


b =
kg2 k
kg3 k
R

Z
e
0
−

γ2
γ2








.




0
0
1/2
σ3
Following the drop-off procedure, the rank satisfies r3 = 3.
We have shown that Zk , Tk and RZ have the required structure at the
start of iteration m. Now assume that they have this structure at the start of
Reduced-Hessian Methods for Large-Scale Unconstrained Optimization
112
iteration k (k ≥ m). Moreover, assume that the first k search directions satisfy
(5.12). Hence,
Zk =
pk−2
kpk−2 k
gk−1
kgk−1 k
gk ,
kgk k
σk−2 kgk−1 k2
 kpk−2 k
kpk−2 k

σk−1 kgk−2 k2


kgk−1 k
= 

0
−

σk−1


Tk

0

0 

0
kgk k
0







and
kgk−2 k2
kgk−1 k
−
−
0
 σk−2 γk−2 kpk−2 k
γk−2


kgk−1 k
kgk k
RZ = 

0
−

γk−1
γk−1








.




0
0
1/2
σk
The rank satisfies rk = 3.
Since gZ = (0, 0, kgk k)T , the equations RZT tZ = −gZ and RZ pZ = tZ give
kgk k2
kpk−2 k 
 σk−2


kgk−2 k2




1 
2
.
kg
k
k
and pZ =


−
σk 

kgk−1 k






0


0
tZ = 

 kgk k
−
1/2
σk







−kgk k

Using the form of Zk and the form of pk−1 given by (5.12), pk satisfies
1
kgk k2
pk =
σk−1
pk−1 − gk ,
σk
kgk−1 k2
!
as required. Since gk has been accepted, pk is exchanged for gk giving





T̆k = 




σk−2 kgk−1 k2
σk−2 kgk k2
kp
k
kpk−2 k 
k−2

σk−1 kgk−2 k2
σk kgk−2 k2


kgk−1 k
1 kgk k2

.
−
−

σk−1
σk kgk−1 k



kgk k
0
−
σk

 kpk−2 k
0
0
Reduced-Hessian Methods for Large-Scale Unconstrained Optimization
113
Since pk is parallel to the (k+1)th conjugate-gradient direction, gk+1 is orthogonal
to the previous gradients (and pk−2 ) and is accepted. The updated rank satisfies
r̄k = 4. The corresponding updates satisfy
Z̄ k =
Zk
gk+1 ,
kgk+1 k


T̆k
0 
T̄k = 
0 kgg+1 k


RZ
0 
and RZ̄ = 
.
1/2
0 σk
The matrix R̄Z̄ is obtained from RZ̄ in the manner described in Theorem 3.1.
b satisfies
The rescaled matrix R
Z̄
kgk−2 k2
kgk−1 k
−
0
0
−
 σk−2 γk−2 kpk−2 k
γk−2


kgk−1 k
kgk k


0
−
0

b
γk−1
γk−1
RZ̄ = 


kgk k
kgk+1 k

0
0
−


γk
γk

















0
0
0
1/2
σk+1
Since r̄k > m, the algorithm executes the drop-off procedure. The
remainder of the proof is similar to that given above for the drop-off at the end
of iteration m − 1 and is omitted.
Chapter 6
Reduced-Hessian Methods for
Linearly-Constrained Problems
6.1
Linearly constrained optimization
In this section we consider the linear equality constrained problem (LEP)
minimize
n
f (x)
subject to
Ax = b,
x∈IR
(6.1)
where rank(A) = mL and A ∈ IRmL ×n . The assumption of full rank is only included to simplify the discussion. The methods proposed here do not requirement
this assumption.
A point x ∈ IRn satisfying Ax = b is said to be feasible. Since A
has full row rank, the existence of feasible points is guaranteed. (If A were
rank deficient, then the existence of feasible points requires that b ∈ range(A).)
Standard methods can be used to determine a particular feasible point (e.g., see
Gill et al. [23, p. 316]).
Two first-order necessary conditions for optimality hold at a minimizer
x∗ . The first is that x∗ be feasible. The second requires the existence of λ∗ ∈ IRm
114
Reduced-Hessian Methods for Linearly-Constrained Problems
115
such that
ATλ∗ = ∇f (x∗ ).
(6.2)
The components of λ∗ are often called Lagrange multipliers. If nL denotes n−mL ,
let N ∈ IRn×nL denote a full-rank matrix whose columns form a basis for null(A).
Condition (6.2) is equivalent to the condition
N T ∇f (x∗ ) = 0
(6.3)
(see Gill et al. [22, pp. 69–70]). The quantity N T ∇f (x) is often called a reduced
gradient. In methods for solving (6.1) that utilize a representation for null(A),
equation (6.3) gives a simple method for verifying first-order optimality.
The second-order necessary conditions for optimality at x∗ are that
Ax∗ = b, N T ∇f (x∗ ) = 0 and that the reduced Hessian N T ∇2f (x∗ )N is positive semi-definite. Sufficient conditions for optimality at a point x∗ are that the
first-order conditions hold and that N T ∇2f (x∗ )N is positive definite.
In what follows, we will assume that an initial feasible iterate x0 is
known. Since the constraints for LEP are linear, it is simple to enforce feasibility
of all the iterates. Let x denote a feasible iterate satisfying Ax = b and let
x̄ = x + αp. If
p = N pN ,
where pN ∈ IRnL ,
(6.4)
then x̄ is feasible for all α. Furthermore, it is easily shown that x̄ is feasible only
if p = N pN , for some pN ∈ IRnN . To see this, note that any given direction p
can be written as p = N pN + ATpR . Since x̄ is feasible, it follows that Ax̄ =
A(x + αp) = Ax + αAp = b + αAp = b, which implies that Ap = 0. It follows
that p ∈ null(A), i.e., p = N pN for some pN . A vector p satisfying (6.4) is called
Reduced-Hessian Methods for Linearly-Constrained Problems
116
a feasible direction. Because of the feasibility requirement, the subproblem (1.7)
used to define p in the unconstrained case is replaced by
minimize
f (xk ) + g T p + 21 pTBp
n
p∈IR
subject to
Ap = 0.
(6.5)
This subproblem is an equality constrained quadratic program (EQP). If B is
positive definite, then the solution of the EQP is given by
p = N pN ,
where pN = −(N TBN )−1 N Tg,
(6.6)
which is of the form (6.4).
For the remainder of the chapter, the columns of N are assumed to be
orthonormal. Let Q denote an orthogonal matrix of the form Q = ( N
Y ),
where Y ∈ IRn×m . Consider the transformed Hessian


N TBN Y TBN 
.
QTBQ =  T
Y BN Y TBY
If all of the search directions satisfy the EQP (6.5), only the reduced Hessian
N TBN is needed, i.e., no information about the transformed Hessian corresponding to Y is required. Hence, we consider quasi-Newton methods for solving LEP
that store only N TBN .
The question naturally arises as to whether the reduced Hessian can be
updated (e.g., using the BFGS update) without knowledge of the entire transformed Hessian. In the unconstrained case, we have seen that B can be blockdiagonalized using a certain choice of Q. The Broyden update to the transformed
Hessian is completely defined by the corresponding update to the (possible much
smaller) reduced Hessian. In the linearly constrained case, QTBQ is generally
dense as long as N is chosen so that range(N ) = null(A). It is not possible in
this case to define the updated matrix QTB̄Q (corresponding to a fixed Q) if
Reduced-Hessian Methods for Linearly-Constrained Problems
117
only N TBN is known. However, it can be shown that N TB̄N can be obtained
from N TBN without knowledge of Z TBY or Y TBY . The update to N TBN is obtained by way of an update from the Broyden class using the reduced quantities
sN = N Ts and sY = N Ty in place of s and y respectively.
Based on the above discussion, a method for solving LEP is presented
below. The matrix RN is the Cholesky factor of N TBN and is used to solve for
pN according to (6.6).
Algorithm 6.1. Quasi-Newton method for LEP
Initialize k = 0; Obtain x0 such that Ax0 = b;
Initialize Z so that range(N ) = (A) and N TN = InL ;
Initialize RN = σ 1/2 InL ;
while not converged do
Solve RNT tN = −gN , RN pN = tN , and set p = N pN ;
Compute α so that sTy > 0 and set x̄ = x + αp;
Compute sN = αpN and yN = ḡ N − gN ;
Compute R̄N = Broyden(RN , sN , yN );
end do
Note that N is fixed in Algorithm 6.1. The matrix N can be obtained
in several ways. For example, N can be obtained from a QR factorization of A.
If B denotes a nonsingular matrix whose columns are from A, then N can also be
obtained using a Gram-Schmidt QR factorization on the columns of a variablereduction form of null(A) (see Murtagh and Saunders [32]). This factorization
can be stably achieved using either reorthogonalization or modified Gram-Schmidt
(see Golub and Van Loan [26, pp. 218–220]).
Reduced-Hessian Methods for Linearly-Constrained Problems
6.2
118
A dynamic null-space method for LEP
Many quasi-Newton methods for solving LEP utilize a fixed representation, N ,
for null(A). In this section, a new method is given for solving LEP that employs
a dynamic choice of N . Since N is a matrix with orthonormal columns, the
method is only practical for small problems or when the number of constraints
is close to n. In the case when the number of constraints is small, an alternative
range-space method can be used in conjunction with the techniques for largescale optimization discussed in Chapter 5. This method is the subject of current
research and will not be discussed further.
The freedom to vary N stems from the invariance of the search direction
(6.6) with respect to N as long as the columns of N form a basis for null(A). To
f is another matrix with orthonormal columns such that
see this, suppose that N
f) = range(N ). In this case, there exists an orthogonal matrix M such
range(N
f = N M . The search direction p = −N (N TBN )−1 N Tg satisfies
that N
f)−1 N
fTg
fM (M T N
fTB N
fM )−1 M N
fTg = −N
f(N
fTB N
p = −N
f. We now consider a choice for N
f that will induce special structure
in terms of N
in the reduced Hessian.
In the unconstrained case, the quasi-Newton search direction satisfies
p = ZpZ , where range(Z) = span{g0 , g1 , . . . , gk }. If Z has orthonormal columns,
then the orthogonal matrix Q = ( Z
W ) induces block-diagonal structure
in the transformed Hessian QTBQ. In Algorithm RHR (p. 52), this structure
facilitates a rescaling scheme that reinitializes the approximate curvature each
iteration that ḡ is accepted.
In the linearly-constrained case, if Q = ( N
Y ), the transformed
Hessian QTBQ generally has no special structure. In actuality, we are concerned
Reduced-Hessian Methods for Linearly-Constrained Problems
119
with the structure of N TBN since the transformed Hessian corresponding to Y
is not used to define p. Let gNi denote the reduced gradient N T gi (0 ≤ i ≤ k).
Note that N gNi is the component of gi in null(A). Let Z denote a matrix of
orthonormal columns satisfying
range(Z) = span{N gN0 , N gN1 , . . . , N gNk }.
(6.7)
Let r denote the column dimension of Z and note that r ≤ nL since range(Z) ⊆
range(N ). There exists an nL × r matrix M1 with orthonormal columns such that
Z = N M1 . Let M = ( M1
M2 ) denote an orthogonal marix with M1 as its first
f is defined as N
f = N M , then range(N
f) = range(N ). Let W = N M
block. If N
2
and note that the partition of M implies that
f = NM = ( NM
N
1
N M2 ) = ( Z
W ).
By construction of Z and W , it follows that if w ∈ range(W ), then wTgi = 0 for
0 ≤ i ≤ k. Hence if B0 = σI, Lemma 2.3 implies that


T
0 
fTB N
f =  Z BZ
N
0
σInL −r


fTg =  gZ  ,
and N
0
(6.8)
where gZ = Z Tg.
f so that it induces block-diagonal structure in N
fTB N
f.
We have defined N
If the quantities (6.8) are substituted into the equation for p (6.6), it is easy to
see that p = −Z(Z TBZ)−1 gZ . This search direction can be computed using a
Cholesky factor RZ of Z TBZ.
Comparison with the unconstrained case
The choice of Z and W discussed here has some similarities and differences with
the definitions of Z and W in the unconstrained case. The subspace range(Z) is
Reduced-Hessian Methods for Linearly-Constrained Problems
120
associated with gradient information in both cases, although in the linearly constrained case Z defines gradient information in null(A). Elements from range(W )
are orthogonal to the first k + 1 gradients in both cases. Also in both cases, the
column dimension of Z is nondecreasing while that of W is nonincreasing. In
the unconstrained case, the column dimension of Z can become as large as n,
but in the linearly-constrained case, the column dimension can only reach nL .
Finally, the approximate curvature in the quadratic model along unit directions
in range(W ) is equal to σ in both cases.
Maintaining Z and W
At the start of iteration k, suppose that N is such that N = ( Z
W ), where Z
satisfies (6.7). The component of ḡ in null(A) is given by

N N Tḡ = ( Z

ḡ Z 
,
W )
ḡ W
where ḡ Z = Z Tḡ
and ḡ W = W Tḡ.
If ḡ W = 0, then N N Tḡ ∈ range(Z) and the matrices Z and W remain fixed. If
ḡ W 6= 0, then ḡ has a nonzero component in range(W ). In this case, updates
to both Z and W are required. The purpose of the updates is to insure that Z̄
satisfies (6.7) postdated one iteration. More specifically, the updates are defined
so that if N̄ = ( Z̄
W̄ ), then
range(N̄ ) = range(N ),
range(Z̄) = range(Z) ∪ range(N N T ḡ)
and N̄ TN̄ = InL . In the implementation, we preassign a positive constant δ > 0
and require that kḡ W k > δ(kḡ W k2 + kḡ Z k2 )1/2 for an update to be performed.
The updates to Z and W are similar to those defined in Section 2.5.1.
Recall that symmetric Givens matrices can be used to define an orthogonal matrix
S such that SgW = kgW ke1 . If N̄ is defined by N̄ = ( Z
W S T ), then range(N̄ ) =
Reduced-Hessian Methods for Linearly-Constrained Problems
121
range(N ) and N̄ TN̄ = InL . Moreover, the first column of W S T satisfies W S Te1 =
W gW /kgW k, which is the normalized component of ḡ in range(W ). Hence, we
may define Z̄ = ( Z
W S Te1 ) and Ȳ = ( Y S Te2
Y S Te3
···
Y S TenL ).
When the null-space basis is changed from N to N̄ , some related quantities must also be altered. The first is the Cholesky factor RZ . In Section 2.4
(p. 29), we define the effective approximate Hessian associated with a reducedHessian method. In the linearly constrained case, we define an effective reduced
Hessian since the approximate Hessian corresponding to Y is not needed. The
effective reduced Hessian is defined by


0 
R T RZ
.
N TB δ N =  Z
0
σInL −r
If N̄ = N diag(Ir , S T ), the reduced Hessian corresponding to N̄ satisfies




Ir 0   RZT RZ
0   Ir 0 
= N TB δ N.
N̄ TB δ N̄ = 
T
0 S
0
σInL −r
0 S
It follows that the Cholesky factor of Z̄ T B δ Z̄ is given by RZ̄ = diag(RZ , σ 1/2 ).
This definition of RZ̄ is identical to that used in the unconstrained case.
The other quantities corresponding to Z̄ are gZ̄ and ḡ Z̄ and sZ̄ . We use
an approximation to gZ̄ (as in the unconstrained case) given by gZ̄δ = (gZ , 0)T .
The vector ḡ Z̄ satisfies

ḡ Z̄ = Z̄ Tḡ =
Z W S T e1
T

ḡ Z 
.
ḡ = 
kgW k
Since p ∈ range(Z), sZ̄ = α(pZ , 0)T . As in the unconstrained case, the vector
ybδ = ḡ Z̄ − gZ̄δ is used in the Broyden update.
The following algorithm solves LEP using the choice of Z described
above.
Algorithm 6.2. Reduced Hessian method for LEP (RH-LEP)
Reduced-Hessian Methods for Linearly-Constrained Problems
122
Initialize k = 0, r = 1; Choose σ and δ;
Compute x0 satisfying Ax0 = b;
Compute N satisfying range(N ) = (A) and N TN = InL ;
Initialize RZ = σ 1/2 ;
Rotate N N Tg0 into the first column of N and partition N = ( Z
W ), where
Z ∈ IRn×1 ;
while not converged do
Solve RZT tZ = −gZ , RZ pZ = tZ , and set p = ZpZ ;
Compute α so that sTy > 0 and set x̄ = x + αp;
if kḡ W k > δ(kḡ Z k2 + kḡ W k2 )1/2 then
Update Z and W as described in this section;
Set r̄ = r + 1;
else Set r̄ = r; end if
Compute sZ̄ and yZ̄δ .
Define RZ̄ by
RZ̄


RZ ,



 

=  RZ
0


,



0 σ 1/2
if r̄ = r;
(6.9)
otherwise;
Compute R̄Z̄ = Broyden(RZ̄ , sZ̄ , ybZ̄ );
end do
Rescaling R̄Z̄
If r̄ = r + 1, then the value σ 1/2 in the (r̄, r̄) position of RZ̄ is unaltered by the
BFGS update. Hence, the last diagonal component of R̄Z̄ can be reinitialized exactly as in Algorithm RHR. This leads to the definition of the rescaling algorithm
Reduced-Hessian Methods for Linearly-Constrained Problems
123
for LEP given below. All steps are the same as in Algorithm RH-LEP except the
last three.
Algorithm 6.3. Reduced Hessian rescaling method for LEP (RHR-LEP)
Compute R̄Z̄ = BFGS(RZ̄ , sZ̄ , yZ̄δ );
Compute σ̄;
if r̄ = r + 1 and σ̄ < σ then
Replace the (r̄, r̄) component of R̄Z̄ with σ̄;
end if
6.3
Numerical results
The test problems correspond to seven of the eighteen problems listed in Table
3.2 (p. 54). The constraints are randomly generated. The starting point x0 is the
closest point to the starting point given by Moré et al. [29] that satisfies Ax0 = b.
More specifically, x0 is the solution to
minimize
n
x∈IR
subject to
1
kx
2
− xM GH k2
Ax = b,
(6.10)
where xM GH is the starting point suggested by Moré et al.
Numerical results given in Tables 6.1 and 6.2 compare Algorithm 6.1
with Algorithm RHR-LEP. Algorithm RHR-LEP is tested using the the rescaling
techniques R0 (no rescaling), R1, R4 and R5. (see Table 3.1 (p. 53)). Table 6.1
gives results for a random set of five linear constraints. The second table gives
results for mL = 8.
The algorithm is implemented in Matlab code on a DEC 5000/240
work station using the line search given by Fletcher [15, pp. 33–39]. The line
search ensures that α meets the modified Wolfe conditions (1.16) and uses the step
Reduced-Hessian Methods for Linearly-Constrained Problems
124
Table 6.1: Results for LEPs (mL = 5, δ = 10−10 , kN Tgk ≤ 10−6 )
Problem Alg. 6.1
Algorithm 6.3
No. n
σ=1
R0
R1
R4
R5
6
16
26/30
24/28 24/28
24/28
24/28
7
12
39/50
39/50 75/79
67/70
54/57
8
16
54/72
49/66 113/147 105/138 99/132
9
16
33/49
33/49 51/55
28/32
27/31
13 20
20/37
20/37 11/14
10/13
9/12
14 14
45/79
45/79 53/61
48/54
48/56
15 16
41/74
41/74 66/71
47/54
38/44
Table 6.2: Results for LEPs (mL = 8, δ = 10−10 , kN Tgk ≤ 10−6 )
Problem Alg. 6.1
Algorithm 6.3
No. n
σ=1
R0
R1
R4
R5
6
16
26/30
24/28
24/28 24/28
24/28
7
12
15/21
15/21
22/25 22/25
21/24
8
16
17/23
16/21
18/23 18/23
18/23
9
16
16/46
16/46
20/25 16/21
16/21
13 20
17/40
17/40
12/15 11/14
12/16
14 14
18/40
L(1.2E-6) 25/30 19/24 L(1.1E-6)
15 16
19/41
19/41
30/35 23/28
19/25
length of one whenever it satisfies these conditions. The step length parameters
are ν = 10−4 and η = 0.9. The implementation uses δ = 10−10 with the stopping
criterion kN T gk < 10−6 . The numbers of iterations and function evaluations
required to achieve the stopping criterion are given for each run. For example,
the notation “26/30” indicates that 26 iterations and 30 function evaluations are
required for convergence. The notation “L” indicates termination during the line
search. In this case, the value in parentheses gives the final norm of the reduced
gradient.
Bibliography
[1] I. Bongartz, A. R. Conn, N. I. M. Gould, and P. L. Toint,
CUTE: Constrained and unconstrained testing environment, Report 93/10,
Département de Mathématique, Facultés Universitaires de Namur, 1993.
[2] K. W. Brodlie, An assessment of two approaches to variable metric methods, Math. Prog., 12 (1977), pp. 344–355.
[3] K. W. Brodlie, A. R. Gourlay, and J. Greenstadt, Rank-one and
rank-two corrections to positive definite matrices expressed in product form,
Journal of the Institute of Mathematics and its Applications, 11 (1973),
pp. 73–82.
[4] A. G. Buckley, A combined conjugate-gradient quasi-Newton minimization algorithm, Math. Prog., 15 (1978), pp. 200–210.
[5]
, Extending the relationship between the conjugate-gradient and BFGS
algorithms, Math. Prog., 15 (1978), pp. 343–348.
[6] R. H. Byrd, J. Nocedal, and Y.-X. Yuan, Global convergence of a
class of quasi-Newton methods on convex problems, SIAM J. Numer. Anal.,
24 (1987), pp. 1171–1190.
[7] M. Contreras and R. A. Tapia, Sizing the BFGS and DFP updates:
Numerical study, J. Optim. Theory and Applics., 78 (1993), pp. 93–108.
[8] J. W. Daniel, W. B. Gragg, L. Kaufman, and G. W. Stewart,
Reorthogonalization and stable algorithms for updating the Gram-Schmidt
QR factorization, Math. Comput., 30 (1976), pp. 772–795.
[9] W. C. Davidon, Variable metric methods for minimization, a. e. c. research
and development, Report ANL-5990, Argonne National Laboratory, 1959.
[10] J. E. Dennis, Jr. and R. B. Schnabel, A new derivation of symmetric
positive definite secant updates, in Nonlinear programming, 4 (Proc. Sympos., Special Interest Group on Math. Programming, Univ. Wisconsin, Madison, Wis., 1980), Academic Press, New York, 1981, pp. 167–199. ISBN
0-12-468662-1.
125
Bibliography
126
[11] J. E. Dennis Jr. and J. J. Moré, A characterization of superlinear
convergence and its application to quasi-Newton methods, Math. Comput.,
28 (1974), pp. 549–560.
[12]
, Quasi-Newton methods, motivation and theory, SIAM Review, 19
(1977), pp. 46–89.
[13] J. E. Dennis, Jr. and R. B. Schnabel, Numerical Methods for Unconstrained Optimization and Nonlinear Equations, Prentice-Hall, Inc., Englewood Cliffs, New Jersey, 1983.
[14] M. C. Fenelon, Preconditioned Conjugate-Gradient-Type Methods for
Large-Scale Unconstrained Optimization, PhD thesis, Department of Operations Research, Stanford University, Stanford, CA, 1981.
[15] R. Fletcher, Practical Methods of Optimization, John Wiley and Sons,
Chichester, New York, Brisbane, Toronto and Singapore, second ed., 1987.
ISBN 0471915475.
[16]
, An overview of unconstrained optimization, Report NA/149, Department of mathematics and computer science, University of Dundee, June
1993.
[17] P. E. Gill, G. H. Golub, W. Murray, and M. A. Saunders, Methods
for modifying matrix factorizations, Math. Comput., 28 (1974), pp. 505–535.
[18] P. E. Gill and W. Murray, The numerical solution of a problem in the
calculus of variations, in Recent mathematical developments in control, D. J.
Bell, ed., vol. 24, Academic Press, New York and London, 1973, pp. 97–122.
[19]
, Conjugate-gradient methods for large-scale nonlinear optimization, Report SOL 79-15, Department of Operations Research, Stanford University,
Stanford, CA, 1979.
[20] P. E. Gill, W. Murray, M. A. Saunders, and M. H. Wright,
Procedures for optimization problems with a mixture of bounds and general
linear constraints, ACM Trans. Math. Software, 10 (1984), pp. 282–298.
[21]
, User’s guide for NPSOL (Version 4.0): a Fortran package for nonlinear programming, Report SOL 86-2, Department of Operations Research,
Stanford University, Stanford, CA, 1986.
[22] P. E. Gill, W. Murray, and M. H. Wright, Practical Optimization,
Academic Press, London and New York, 1981. ISBN 0-12-283952-8.
[23]
, Numerical Linear Algebra and Optimization, volume 1, AddisonWesley Publishing Company, Redwood City, 1991. ISBN 0-201-12649-4.
Bibliography
127
[24] D. Goldfarb, Factorized variable metric methods for unconstrained optimization, Math. Comput., 30 (1976), pp. 796–811.
[25] G. H. Golub and C. F. Van Loan, Matrix Computations, The Johns
Hopkins University Press, Baltimore, Maryland, 1983. ISBN 0-8018-5414-8.
[26]
, Matrix Computations, The Johns Hopkins University Press, Baltimore,
Maryland, second ed., 1989. ISBN 0-8018-5414-8.
[27] M. Lalee and J. Nocedal, Automatic column scaling strategies for quasiNewton methods, SIAM J. Optim., 3 (1993), pp. 637–653.
[28] D. C. Liu and J. Nocedal, On the limited memory BFGS method for
large scale optimization, Math. Prog., 45 (1989), pp. 503–528.
[29] J. J. Moré, B. S. Garbow, and K. E. Hillstrom, Testing unconstrained optimization software, ACM Trans. Math. Software, 7 (1981),
pp. 17–41.
[30] J. J. Moré and D. C. Sorensen, Newton’s method, in Studies in Mathematics, Volume 24. Studies in Numerical Analysis, Math. Assoc. America,
Washington, DC, 1984, pp. 29–82.
[31] J. J. Moré and D. J. Thuente, Line search algorithms with guaranteed
sufficient decrease, ACM Trans. Math. Software, 20 (1994), pp. 286–307.
[32] B. A. Murtagh and M. A. Saunders, Large-scale linearly constrained
optimization, Math. Prog., 14 (1978), pp. 41–72.
[33] J. L. Nazareth, A relationship between the BFGS and conjugate gradient
algorithms and its implications for new algorithms, SIAM J. Numer. Anal.,
16 (1979), pp. 794–800.
[34]
, The method of successive affine reduction for nonlinear minimization,
Math. Programming, 35 (1986), pp. 97–109.
[35] J. Nocedal, Updating quasi-Newton matrices with limited storage, Math.
Comput., 35 (1980), pp. 773–782.
[36]
, Theory of algorithms for unconstrained optimization, in Acta Numerica
1992, A. Iserles, ed., Cambridge University Press, New York, USA, 1992,
pp. 199–242. ISBN 0-521-41026-6.
[37] J. Nocedal and Y. Yuan, Analysis of self-scaling quasi-Newton method,
Math. Prog., 61 (1993), pp. 19–37.
[38] S. Oren and E. Spedicato, Optimal conditioning of self-scaling variable
metric algorithms, Math. Prog., 10 (1976), pp. 70–90.
Bibliography
128
[39] S. S. Oren and D. G. Luenberger, Self-scaling variable metric (SSVM)
algorithms, Part I: Criteria and sufficient conditions for scaling a class of
algorithms, Management Science, 20 (1974), pp. 845–862.
[40] M. J. D. Powell, Some global convergence properties of a variable metric
algorithm for minimization without exact line searches, in SIAM-AMS Proceedings, R. W. Cottle and C. E. Lemke, eds., vol. IX, Philadelphia, 1976,
SIAM Publications.
[41]
, How bad are the BFGS and DFP methods when the objective function
is quadratic?, Math. Prog., 34 (1986), pp. 34–37.
[42]
, Methods for nonlinear constraints in optimization calculations, in The
State of the Art in Numerical Analysis, A. Iserles and M. J. D. Powell, eds.,
Oxford, 1987, Oxford University Press, pp. 325–357.
[43] D. F. Shanno, Conjugate-gradient methods with inexact searches, Math.
Oper. Res., 3 (1978), pp. 244–256.
[44] D. F. Shanno and K. Phua, Matrix conditioning and nonlinear optimization, Math. Prog., 14 (1978), pp. 149–160.
[45] D. Siegel, Modifying the BFGS update by a new column scaling technique,
Report DAMTP/1991/NA5, Department of Applied Mathematics and Theoretical Physics, University of Cambridge, May 1991.
[46]
, Implementing and modifying Broyden class updates for large scale optimization, Report DAMTP/1992/NA12, Department of Applied Mathematics and Theoretical Physics, University of Cambridge, December 1992.
[47]
, Updating of conjugate direction matrices using members of Broyden’s
family, Math. Prog., 60 (1993), pp. 167–185.
[48] P. Wolfe, Convergence conditions for ascent methods, SIAM Review, 11
(1968), pp. 226–235.
[49] C. Zhu, R. H. Byrd, P. Lu, and J. Nocedal, Algorithm 778: L-BFGSB: Fortran subroutines for large-scale bound-constrained optimization, ACM
Trans. Math. Software, 23 (1997), pp. 550–560.