Optimal approximation of sparse hessians and its equivalence to a

Mathematical Programming 26 (1983) 153-171
North-Holland Publishing Company
O P T I M A L A P P R O X I M A T I O N OF S P A R S E H E S S I A N S AND
ITS E Q U I V A L E N C E TO A G R A P H C O L O R I N G P R O B L E M
S. Thomas M c C O R M I C K
Department of Operations Research. Stanford University, Stanford, CA 94305, U.S.A.
Received 4 December 1981
Revised manuscript received 11 June 1982
We consider the problem of approximating the Hessian matrix of a smooth non-linear
function using a minimum number of gradient evaluations, particularly in the case that the
Hessian has a known, fixed sparsity pattern. We study the class of Direct Methods for this
problem, and propose two new ways of classifying Direct Methods. Examples are given that
show the relationships among optimal methods from each class. The problem of finding a
non-overlapping direct cover is shown to be equivalent to a generalized graph coloring
problem--the distance-2 graph coloring problem. A theorem is proved showing that the
general distance-k graph coloring problem is NP-Complete for all fixed k >- 2, and hence that
the optimal non-overlapping direct cover problem is also NP-Complete. Some worst-case
bounds on the performance of a simple coloring heuristic are given. An appendix proves a
well-known folklore result, which gives lower bounds on the number of gradient evaluations
needed in any possible approximation method.
Key words: Sparse Hessian Approximation, Graph Coloring Heuristics, Elimination
Methods, NP-Complete, Direct Methods.
1. Introduction
The problem of interest is the approximation of the Hessian matrix of a
smooth nonlinear function F : ~ " ~ .
In many circumstances, it is difficult or
even impossible to evaluate the Hessian of F from its exact representation,
Under these conditions, an approximation to the Hessian can be computed using
finite differences of the gradient. When the Hessian is a dense matrix, this
approximation is usually obtained by differencing the gradient along the coordinate vectors, and hence requires n evaluations of the gradient (which is the
minimum possible number; see Appendix 1). However, if the Hessian has a fixed
sparsity pattern at every point (i.e., certain elements are known to be zero), the
Hessian may be approximated with a smaller number of gradient evaluations by
differencing along specially selected sets of vectors.
For example, consider the following sparsity pattern, in which 0 stands for a
This research was partially supported by the Department of Energy Contract AM03-76SF00326,
PA#DE-AT03-76ER72018: Army Research Office Contract DAA29-79-C-0110, Office of Naval
Research Contract N00014-74-C-0267: National Science Foundation Grants MCS76-81259, MCS79260099 and ECS-8012974.
153
S.T. McCormick/Approxinmtion of sparse Hessians
154
k n o w n zero and 1 stands for a possible non-zero of the Hessian:
1
If the gradient is differenced along the directions (1, 1,0) 1 and (0,0, 1)x, the
Hessian may be approximated with only two additional gradient evaluations.
When n is large and the proportion of zero elements is high, the n u m b e r of
gradient evaluations needed to a p p r o x i m a t e the Hessian m a y be only a small
fraction of n. This result is particularly useful in numerical optimization algorithms.
L e t g(x) denote the gradient of F and H ( x ) denote the Hessian. Assume that
g(x ~ is k n o w n and that we wish to a p p r o x i m a t e H ( x ~ by evaluating g(x~ hdt),
l = 1. . . . . k, for some step size h and a set of k difference vectors {dr}. For each I
and sufficiently small h, we obtain n a p p r o x i m a t e linear equations
hH (x~ I ~ g(x~ + hd I) - g(x~),
(1)
SO that there are a total of nk equations. Note that m a n y of the c o m p o n e n t s of d t
and H ( x c) are usually zero.
Schemes for evaluating a Hessian approximation have been divided into three
categories, depending on the complexity of the s u b s y s t e m of (1) that must be
solved for the u n k n o w n elements of the Hessian (see e.g. [2]). Direct Methods
c o r r e s p o n d to a diagonal s u b s y s t e m ; Substitution Methods correspond to a
triangular s u b s y s t e m ; and Elimination Methods c o r r e s p o n d simply to a nonsingular subsystem. There is a tradeoff here; as we m o v e from Direct Methods
to Elimination Methods, we are less restricted and thus expect fewer evaluations
to be required, but we lose ease of approximation and possibly numerical
stability.
Once a class of methods has been selected, the problem is to choose a specific
method that minimizes k for a given sparsity pattern, without requiring too m u c h
effort to determine the vectors {d~}. This p a p e r is concerned primarily with a
partial solution to this problem for direct methods. (The solution for elimination
methods is well known, but seems never to have been published. Appendix 1
gives a proof of the t h e o r e m behind the solution in this case.) Section 2 gives a
new classification for direct methods, Section 3 reduces one of these classes to a
graph coloring problem and shows that problem to be N P - C o m p l e t e , Section 4
gives some heuristic results for the same class, and Section 5 points out possible
future avenues of research.
2. Classifying direct methods
Let H denote the Hessian of F at the point x ~ Any element of H that is not
known to be zero is called an unknown. An illuminating interpretation of a direct
S.T. McCormick/Approximation of sparse Hessians
155
method is to regard the non-zero components of a given d t as specifying a subset
St of the columns of H ; Sz is called the lth group of columns; by a slight abuse
of notation, a column index j is said to belong to S~ when its column does. When
two columns belonging to S~ both have an unknown in row i, there is said to be
an overlap in S~ in row i.
By definition of a direct method, the family of subsets {S~} must satisfy the
Direct Cover Property (DC):
(DC)
For each unknown h~j, there must be at least one S~ containing column j such that column j is the only column in S~
that has an unknown in row i.
Any family of subsets of columns satisfying (DC) is called a direct cover for
H, and naturally gives a scheme for approximating H. That is, if e~ is the ith unit
vector, differencing along
d t = ~ , ej, I = l ..... k
is the scheme associated with the family {S~}. The problem of interest is thus that
of finding a minimum cardinality direct cover for a given H (an optimal direct
cover).
Since it is difficult to find a general optimal direct cover, the problem is often
approached heuristically by restricting the acceptable direct covers and attempting to choose an optimal or near-optimal direct cover from the restricted set. We
suggest a new classification scheme for types of permitted direct covers. From
most to least restrictive, the categories are:
(1) Non-Overlap Direct Covers (NDC): No overlap may occur within any
group of columns, i.e. every group has at most one unknown in each row. The
best-known heuristic, the CPR method, belongs to this class (see Curtis, Powell
and Reid [3]).
(2) Sequentially Overlapping Direct Covers (SeqDC): A less restricted class
may be defined by observing that overlap within a group of an ordered direct
cover does not violate the direct cover property if the values of overlapping
unknowns can be resolved using preceding groups. In a SeqDC, columns in
group I are allowed to overlap in row m only if column m belongs to some group
k such that k < I. This definition implies that (DC) is satisfied: Consider an
unknown h~j and let k = min{/[either i or j belongs to group l} (k is called the
minimum index group of h~j); note that h~j cannot overlap with any other
unknown in its row in group l. Powell and Toint [8] propose a heuristic of this
class.
(3) Simultaneously Overlapping Direct Covers (SimDC): This is the most
general class of direct covers. Any kind of overlap is allowed, as long as (DC) is
not violated. In particular, an unknown may overlap in its row even in its
minimum index group as long as the overlap is resolved in some succeeding
156
S.T. McCormick/Approximation of sparse Hessians
group. Thapa's " N e w Direct M e t h o d " [9] falls in this class, though he adds
several other restrictions.
Define c xDc to be the number of groups needed by an optimal direct cover of
type XDC. Clearly NDC C SeqDC C SimDC, which implies that c NDc -> c seqDcc simDc. These inclusions and inequalities can be strict, as the following examples
show. Consider
\1
1
1
0
1
1
1
0
1
1
0
0
1
.
1/
Since every column overlaps with every other column, an NDC must use at least
five groups (and of course, five suffice). But {{3}, {2,4}, {1}, {5}} is a SeqDC of
cardinality 4 < 5. N o w consider (from [8]):
i' ~
1
1
0
0
0
0
1 0
1
0
0
1
0
1
0
0
.
It is easy to see that any SeqDC requires at least four groups, but {{1,5}, {2, 6},
{3, 4}} is a SimDC of size only three.
The above discussion glossed over whether a column can belong to more than
one group. This consideration leads to an independent classification scheme for
direct covers into:
(1) Partitioned Direct Covers (prefix P): These require the direct cover to be a
partition of {1,2 . . . . . n}, i.e., every column must belong to exactly one group.
(2) General Direct Covers (prefix G): These allow either columns that belong
to no group, or columns that belong to more than one group.
All heuristics proposed so far known to this author restrict themselves to
partitioned direct covers. When h~ is an unknown, column i must belong to at
least one group, for otherwise h~ would not be determined. Since hii is an
unknown for all i in most unconstrained problems, it seems natural to restrict
our attention to direct covers in which each column belongs to at least one
group. However, sometimes the optimal direct cover is larger under this restriction. For example, consider an N D C for
(i'i
1
,
0
'
S.T. McCormick] Approximation of sparse Hessians
157
Since all three pairs of columns overlap, a P N D C must use three groups;
however, {{2}, {3}} is a valid G N D C of smaller size. But such problems rarely
occur in unconstrained problems, and we shall henceforth consider only direct
covers in which every column belongs to at least one group.
From the r e m a r k in the definition of SeqDC that the definition of a SeqDC
implies (DC), it is easy to see that in any GSeqDC, we can delete a column from
all groups in which it appears except for its minimum index group, without
violating (DC) (i.e., since any h~ is always determined directly by some column
in that c o l u m n ' s minimum index group, any later occurrences of that column are
superfluous). Thus in the SeqDC case, and so also in the N D C case, it suffices to
consider only partitioned direct covers.
But, unfortunately, P S i m D C s are not optimal in the class of GSimDC.
Consider
1
0
0
1
1
1
1,
0
1
1 0
1
1 0
0
1
1
1
0
0
1
1
0
1
1
1
0
0
1
1
0
1
1
0
0
1
1
0
0
0
1
0
1
0
1
0
0
0
1
(2)
Laborious calculations verify that any PSimDC must have more than four
groups. H o w e v e r , {{1,2}, {1, 3}, {4,6}, {5, 7}} is a GSimDC of size four where
column 1 appears in two different groups. (The matrix (2) is the smallest possible
such example in terms of number of columns.)
There is no efficient way known at present to find an optimal direct c o v e r from
any of these classes, though heuristics that appear to work very well in practice
are known. The first heuristics studied were designed to find N D C s , largely
because this is the natural model which was known for approximating sparse
Jacobians (see [2]). Powell and Toint [8] noted that any N D C heuristic can be
converted into a SeqDC heuristic just by relaxing the non-overlapping condition
to apply only to unknowns that have not yet been determined, so as to take
advantage of the s y m m e t r y of the Hessian. Finally, Thapa [9] realized that in
some cases a SimDC would do better than a SeqDC, and proposed a heuristic
for SimDCs. T h a p a ' s N e w Direct Method appears to offer little gain relative to
the extra effort necessary to implement it. H o w e v e r , Powell and Toint's
modification of the Curtis, Powell and Reid method adds almost no overhead
and can achieve substantial i m p r o v e m e n t s in some cases, and so appears to be
the best current choice for actual practice. The next section shows that finding
an optimal NDC is very difficult. Though it would be interesting to analyze the
more useful class, SeqDC, NDC seems to be more amenable to analysis. The
158
S.T. McCormick/Approximation of sparse Hessians
tWO classes are similar enough that the results in Section 3 are considerable
evidence that finding an optimal SeqDC is also a very hard problem.
3. An equivalent graph coloring problem; NP-completeness
We show that the problem of finding an optimal N D C (which can be assumed
to be partitioned) is equivalent to a graph coloring problem, which is then shown
to be N P - C o m p l e t e . Our notation and terminology for graphs follow that of
B o n d y and Murty [1]. Our way of writing a sparsity pattern as a (0, D-matrix,
call it S, could just as well be interpreted as the v e r t e x - v e r t e x incidence matrix
of an undirected graph. That is, the s y m m e t r y of the sparsity pattern is reflected
in the undirectedness of the graph. A partition of the columns of the sparsity
pattern naturally induces a partition of the vertices of the associated graph; the
vertex partition can be considered to be some sort of coloring on the graph.
T w o columns in the same group, i.e. two vertices of the same color, cannot
'overlap'. Column i overlaps column j if sk~ = ski = 1 for some row k, i.e., if
vertex i and vertex j are both adjacent to vertex k. Thus the restriction on our
graph coloring is that no two vertices of the same color can have a c o m m o n
neighbor. If distance from vertex i to vertex j in the graph is measured by
' m i n i m u m number of edges in any path between i and j', then we require that
any two vertices of the same color must be more than two units apart. In the
usual Graph Coloring Problem, we require that any two vertices of the same
color be more than one unit apart. This leads to defining a proper distance-k
coloring of a graph G to be a partition of the vertices of G into classes (colors)
so that any two vertices of the same color are more than k units apart. Then we
want to solve the Distance-k Graph Coloring Problem ( D k G C P ) on a graph G:
Find a proper distance-k coloring of G in the minimum possible number of
colors.
Then the usual Graph Coloring Problem (GCP) is DIGCP, and the optimal
NDC problem is equivalent to D2GCP. We shall use this equivalence to show
that the optimal N D C problem is N P - C o m p l e t e by showing that D 2 G C P is
N P - C o m p l e t e ; in fact, we shall show the stronger result that D k G C P is NPComplete for any fixed k -> 2.
First we review the definition of N P - C o m p l e t e n e s s (see G a r e y and Johnson
[4]). The fundamental N P - C o m p l e t e problem is the Satisfiability Problem (SAT),
which we use in a slightly simpler, but equivalent form:
3-Satisfiability (3SAT): Given a set of atoms u~. u2. . . . . u,,. we get the set of
literals L = {u. ff~, u2. t~2. . . . . u., t].}. Let C = { C . C2 . . . . . C.,} be a set of 3-clauses
drawn from L, that is, each C~ C L. and ICsl = 3. Is there a truth assignment T:
{u~ . . . . . u . } ~ { t r u e , false} such that each Cs contains at least one u~ with T(u~)=
true or at least one t~i with T ( u , ) = false?
The set of clauses is really an abstraction of a logical formula; imagine the
159
S.T. McCormick/Approximation of sparse Hessians
,
2
I,--/',
.
.
.-:.,-,
.
-cO-
9 ..__ckl
.
T~--"
9.
r".-
rh.
§
.
.
.
.
.
.
-
Figure I.
clauses as parenthesized subformulae whose literals are connected by 'or', with
all the clauses connected with 'and'. Then a satisfying truth assignment makes
the whole formula true. 3SAT has been shown to be 'at least as hard as' a large
class of hard problems. Thus, if 3SAT can be encoded into any other problem X,
then X inherits the 'at least as hard as' property and is called N P - C o m p l e t e .
Since G C P is well k n o w n to be N P - C o m p l e t e [7], it is also possible to show that
a problem X is N P - C o m p l e t e by showing that G C P can be encoded into X. This
approach might seem to be more natural for D k G C P , but it appears to be very
difficult. As we shall see in Section 4, D k G C P encodes more naturally into G C P
than vice-versa.
In order to encode 3SAT into D k G C P , D k G C P must be recast as a decision
problem. As is standard with optimization problems, we re-phrase D k G C P to "Is
there a distance-k coloring that uses p or f e w e r c o l o r s ? " In a slight abuse of
notation, we shall let D k G C P refer to both the optimization problem and the
related decision problem. Our encoding is a generalization of the one found in
K a r p ' s original proof [7] of the N P - C o m p l e t e n e s s of GCP. The theorem about
the encoding requires the exclusion of the case in which a clause contains both
an atom and its negation. But such clauses are always trivially satisfied, so we
henceforth understand ' 3 S A T ' to mean '3-Satisfiability without trivial clauses'.
Also, we assume (without loss of generality) that n-> 4. This will allow us to
choose, for any clause Cj, an atom u~ with u ~ " Cj, and tii~" Cj.
Given a 3SAT problem P, we construct from it a decision problem on a graph
Gk(P). If P has atoms ul, uz . . . . . u,, and clauses Ct, C2 . . . . . Cm, let h = r89 and
p = 2nk + r e ( k - 1 ) . Let V and E denote the vertices and edges of Gk(P), and
define them by
ui, t~i
V=
r=l ..... k
C~, r = O . . . . . k - l }
I~, r = l . . . . . k - 1
F[,T~,
} i = 1,
....
n
s=l ..... m
literal vertices, false
vertices, true vertices
clause vertices, intermediate vertices
S.T. McCormick/Approximation of sparse Hessians
160
{ui, ti,}
u, t~ different colors
all i
{v,,
{T;, r;+'}/
all F ' s , T ' s different colors
all r, all i # j
{v;, F j} [
{T;, T;} J
{I:,I~+'}
{I~
', c~}
il}
all r
C o different color
than its literals
all s
{l~,u~} ifu~eC,
{I.',, a~} if/~ E C
E =
{u,, F~}
{u, T~}
ui, (~i can only
be FI or TI
all i # j
{a,, vT}
{/i,, T~}
{c;, c;}
{c~, F7 +'}
r>-h
{C~. T7 +'}
F~
{c~, c7'}
r_>
ul,o,.~&~
C't ~
{c','-', TP} all
i
{c';, Fp} Ub
Cs}
{C~, TI'}
all i
C~, r > O, different
from each other and
F ' s and T ' s
a l l s # t , alli
J
all s, k odd l
all s, k even |
J
C o can only be F I
colors of its literals
N o t e that we consider Gk(P) only for k---2, implying that h + 1 - < k, so h + 1
makes sense as a superscript for the F ' s and the T's. The global structure of
G~(P) is pictured in Fig. 1.
We need three propositions about the structure of a p r o p e r distance-k coloring
of Gk(P).
Proposition 1. The vertices FT, T~, and C~ ~, i = 1. . . . . n, s = 1. . . . . m, r = ! . . . . . k
must all have different colors, thus using up all p colors.
Proof. Consider the length k paths
rll
iT,-
T~ . . . . .
T~
which d e m o n s t r a t e that all F ' s and T ' s must be different colors. N o w consider
the length k - 1 paths
C l-C; .....
f v l " ' - -!'+: . . . . .
C ~ - I T , h+l - TI
--' '+2 . . . . .
e~
T~
(3)
which show that no C~, 0 < q -< h can be any F I or T [ color, r > h. The length at
S.T. M c C o r m i c k / A p p r o x i m a t i o n
most k-
161
of sparse Hessians
1 paths
livl,~, _ F~+-, . . . . .
C k-I _ C ~ -e
*,
~
Th+2
T k
s h o w that no C~, h <- q < k c a n be a n y F~ o r T r c o l o r , r > h. L e t I be an i n d e x
such t h a t u~K C, a n d &~" C,, a n d c o n s i d e r the l e n g t h k - 1 p a t h s
,
c,-
....
h-, f T l ' c,-lC~-TI'-
TI'-' .....
T~ k o d d
.... TI
keven
(4)
w h i c h s h o w t h a t no C q, 0 < q < h c a n be a n y T7 c o l o r , r -< h. T h e length k p a t h s
C ~__ C~ . . . . .
c'"-'-
.
-s
f
F ~ ' - F ) - ~ ' "- --i '
/C h
FI'-
. . . . .
~- i' ' . . . . .
g'
--J
F1
-i
kodd
k even
(5)
s h o w t h a t no C~~, 0 < q < h c a n be a n y F~ c o l o r , r - < h. T h e l e n g t h k p a t h s
-El,§
'-
.....
{ri§
F~
rl' .....
FI
rl
s h o w t h a t no C~, h --- q < k, c a n be a n y F~ or T~ c o l o r , r - < h. T h e l e n g t h k -
1
paths
,
C ~ - C7 . . . . .
h-I f T ~ - ~,,-I
C,h-:
~1
'
- ~'
C , - ] . C h _ ~,--,,
r.~,-1 . . . . . p I~ '
k odd
k even
(6)
s h o w that no C7 c a n be the s a m e c o l o r as a n y C~, 0 < q, r < h. F i n a l l y , the l e n g t h
k - 1 path
c~-
c:, .....
cl' - cl' - c, ~+' .....
cb'
(7)
s h o w s t h a t no C .q~ c a n b e the s a m e c o l o r as a n y C , , h - < q, r < k.
S i n c e FT, T7 a n d C7, q > O, u s e up all the c o l o r s , w e s u b s e q u e n t l y r e f e r to the
colors by these vertex names.
Proposition
2. Vertices ui and & must be colored F~ and TI in some order,
i = 1 .... , n .
Proof. L e t j r i and c o n s i d e r the l e n g t h k p a t h s
F ~ - ~ ,,Akj_ ~
u,}_
I
,~k 2 . . . . .
-- __j
I=~
--j
. . . . r~
rr!, -_ T~r;_2' . .....
r?
1c ~ - ' -
c~-'- . . . . .
(8)
c l
w h i c h s h o w t h a t u~ a n d & c a n n o t be a n y c o l o r o t h e r t h a n F I a n d TI. A l s o , u~ a n d
ai c e r t a i n l y c a n n o t be the s a m e c o l o r .
S.T. McCormick/Approximation
162
of sparse Hessians
Thus a proper distance-k coloring of Gk(P) induces a truth assignment on the
literals.
Proposition 3. If the literals in clause Cs have indices a, b, and c, then C Omust be
colored FI,, Fib, or F~., s = 1. . . . . m.
Proof. We can add C~ to the beginning of the paths in (3), (4), (5), (6) and (7),
thus excluding all colors except FI from C~ If l is an index such that ul.~ C ,
fit.~ C,, then we can drop the edge FI' - F~ from (5) and add C~ to the beginning
to show that C o cannot be any F~ color either.
N o w we are ready to state and prove the N P - C o m p l e t e n e s s theorem, which
then immediately implies that finding an optimal N D C is N P - C o m p l e t e . This
theorem is a symmetric version of a result in Section 3 of Coleman and Mor6 [2].
Theorem 1. For fixed k ~- 2, D k G C P is NP-Complete.
Proof. Since the size of Q ( P ) is a polynomial in m and n, it is clear that the
above reduction of 3SAT to D k G C P can be carried out in polynomial time. We
must show that there is a satisfying truth assignment for the 3SAT problem P if
and only if the graph G~(P) has a proper distance-k coloring in p or f e w e r
colors.
First suppose that G~(P) is properly distance-k colored. If l,,, lb, and l~ are the
literals contained in Cs, then the length k path
0
C~-Is
k I
-Ix
k-2
I
- .... I~-li
shows that C~ cannot be the same color as any of l~, lb, or l,.. But C O must be
colored F~o, F~, or F~ by Proposition 3. By Proposition 2, each l~ is colored
either FI or TI, so each clause must contain at least one true literal under the
truth assignment induced by the proper coloring, i.e., the clauses are satisfiable.
N o w we need only show that Gg(P) can always be colored in p or f e w e r
colors if P is satisfiable. Let -r be some satisfying truth assignment for C1,
C2 . . . . . C,,. First color the F [ ' s , T [ ' s and C~'s, r > 0, as decreed by Proposition 1.
Color u~ with TI if "r(ui)= true, color u~ with FI otherwise; color ai with the
c o m p l e m e n t a r y color. Each C~ has at least one true literal, say l~. Color COs with
color F~. Finally, color I~ with C~+1, r = 1. . . . . k - 1, where the subscript on C~
is interpreted modulo m.
We now show that this coloring is proper. The colors F[, T[, 1 < r - k, each
appear on only one vertex and so are proper. Color C~+t appears on exactly two
vertices, itself and I~. A shortest possible path between these vertices in G~(P)
is
.
.
.
.
.
.
.
.
L)fii
'_
C~+
1 --
.+
' ' __ C s + l
S.T. McCormick/Approximation of sparse Hessians
163
and is of length k + 1. This is a shortest path because at least k edges must be
used to get from layer I or to layer C or , and one extra edge must be used to get
from an F or a T to a C. Also, any alternative path between these vertices that
goes through a C ~ has at least k + 2 h edges because of the difference in
subscripts, and because the C~'s do not interconnect for r < h ; thus color C; is
proper. Color TI also appears on exactly two vertices, itself and one of u~ or g~.
A shortest possible path between these vertices is the third one in (8) with T~
added at the end. For ] # i, at least k edges must be used to get from the uo layer
to the T~ layer, and an extra edge is necessary to go from an i vertex to a j
vertex. Any other path between these vertices through the I ' s uses at least
k + 2 h - 1 edges (see Fig. 1), so T] is proper. Finally, F I can appear in three
places: on itself, on u~ or fi~, and on any number of C~ whose clauses contain
either u~ or t~i. As with ul or fir and TI above, ui or ai and F I do not cause a
conflict. Some shortest possible paths between F I and any C o are those in (5)
with C o added to the beginning. Again, at least k edges are necessary to go from
0
1
the C o layer to the F o layer, and an extra edge is necessary to go from an l
vertex to an i vertex. In Fig. 1 we see that any other path between these vertices
through the I ' s uses at least 2k edges, so no C~ FI pair causes a conflict.
Between a u~ or t~ and a C ~ some shortest possible paths are
f l t 1-'
Ui -- | T k pk-l
lk-1 ~ o
C,t f~0
of lengths k and k + 1 respectively. The first cannot exist because of the truth
assignment and because there are no trivial clauses. Once again, the second must
use k edges going from layer u. to layer C.,0 and an extra edge going from an F
or a T to a C, so no u~ or t~, C o pair conflicts. Finally, a shortest possible path
between C O and C o is (6) with C O added to the beginning and C O added to the
end, of length k + 1. In Fig. 1 we see that any other path between these vertices
through the I ' s uses at least 2k edges, so no F I color conflicts. Thus the coloring
is proper, and the theorem is proved.
4. Heuristics for finding non-overlapping direct covers
Theorem 1 is, unfortunately, a negative result, since it implies that finding an
optimal NDC is very hard. On a more positive note, much work has been done
on finding near-optimal, polynomial-time, heuristic algorithms for NP-Complete
problems (see [4, Chapter 6]).
In the present case, the most obvious heuristic approach is to reduce D2GCP
to GCP and then apply known heuristic results on GCP to the reduced graph.
Given a graph G = (V, E), define D2(G) (the distance-2 completion of G) to be
the graph on the same vertex set V, and with edges E = { { i , j } [ i and j are
distance 2 or less apart in G}. Then it is easy to verify that a coloring of V is a
164
S.T, McCormick~Approximationof sparse Hessians
proper distance-2 coloring of G if and only if it is a proper (distance-l) coloring
of D2(G) (note that this reduction also implies that D1GCP is NP-Complete).
If there were a 'good' heuristic for GCP, then we could compose it with D2(o)
to obtain a 'good' heuristic for D2GCP. Coleman and Mor6 [2, Section 4], gives a
good overview of the present state of the art in GCP heuristics, which is not
'good'. In fact, if oH(G) denotes the number of colors used by the best known
heuristic on graph G, and ~((G) denotes the optimal number of colors necessary
for G (its chromatic number), then in the worst case
max
cH(G) - O( n )
(9)
G . . . . . . . rices x ( G )
(this best heuristic and the bound (9) are due to Johnson [6]). Two facts mitigate
the unpleasantness of (9). First, the range of D2(, ) does not include all graphs,
and hence a better bound than (9) can be obtained for D2GCP. Second,
average-case results have been obtained for GCP heuristics that are considerably better than (9).
To improve on (9) for D2GCP, consider the specific heuristic called the
distance-2 sequential algorithm (D2SA). Define W(i)={jr
j is distance --<2
from i}, the distance-2 neighborhood of a vertex i in a graph. Thus, if i has color
c in a proper distance-2 coloring, no j ~ N(i) can be color c. Then D2SA assigns
color
min{c -> 1 I no j E ,~(i), j < i, is colored c}
to vertex i, i = 1..... I V[. That is, D2SA assigns vertex i the smallest color not
conflicting with those already assigned. (This is just the distance-2 version of the
best known GCP heuristic, the sequential algorithm, which is called the CPR
method in its applications to approximating sparse Jacobians (see Curtis et al.
[3]).) Let cS(G) denote the number of colors used by D2SA when applied to G.
In order to obtain bounds on cS(G), we require two definitions. The maximum
degree of G, A(G), is defined as
A(G) =
maxl{j 1{i, j} ~ E(G)}I.
The distance-2 chromatic number of G, x2(G) is defined as the optimal number
of colors in a proper distance-2 coloring of G, i.e.
x2(G) = min{k [ G has a proper distance-2 coloring with k colors}.
The following theorem bounds
corollary improves (9) for D2SA:
x2(G) and cS(G)
in terms of zI(G), and a
Theorem 2. L e t d = z I ( G ) . Then
d + 1 -< x2(G) -< cS(G) -< d2 + 1
for all graphs G.
(lO)
165
S.T. McCormick/Approximation of sparse Hessians
Proof. L e t i be a v e r t e x incident to e x a c t l y d edges, and note that i and its d
n e a r e s t neighbors must all be different colors in a p r o p e r distance-2 coloring; this
p r o v e s the lower b o u n d in (10). T h e s e c o n d inequality in (10) is trivial.
T o p r o v e the u p p e r b o u n d in (10), note that for a n y v e r t e x i, IN(i)]-<
d + d(d - 1) = d 2. S u p p o s e that D 2 S A assigns color l to v e r t e x i; by definition of
D 2 S A , this can h a p p e n only if at least one vertex of each color 1 . . . . . l - 1 is in
X ( i ) . Thus, if i were assigned color I > d 2 + 1, then IN(i) I - d 2 + 1 (a contradiction). (This p r o o f is essentially a c o n s t r u c t i v e p r o o f of C o r o l l a r y 8.2.1 in B o n d y
and M u r t y [113
Corollary 1. F o r all n >- 1,
cS(G)
max
_-- X / n -
(11)
1 + 1 = O(X/n).
Proof. Clearly, c S ( G ) <- n. L e t k = x2(G). A p p l y i n g the first and third inequalities
of (10), we obtain
cS(G)<-(k-
1)2+ 1.
(12)
C o n s i d e r s e p a r a t e l y two cases:
C a s e 1: If n - < ( k -
cS(G)
I ) 2 + 1, this implies that X / n -
n
n
-~-<
k ~/n-l+l
k
-X/n-
1+ 1-24
1+ 1 -<k and so
2
~/n-l+l
-~X/n-
1 + 1.
C a s e 2: If n > (k - 1) 2 + 1, then k < X/n - 1 + 1, and so
cS(G) _(k-
k
~
1)~-+ 1
k
-k-2+
2
-<k < ~ / n -
1+ 1.
(Corollary 1 is essentially (4.6) of C o l e m a n and Mor6 [2] in the special case that
the matrix is square and s y m m e t r i c . )
G r a p h s that attain b o u n d (12) for a certain ordering of their vertices exist for
k = 1, 2, 3, 4. T h e cases k = !, 2 are trivial. F o r k = 3, c o n s i d e r G3 = (V3, E3)
defined b y
V3=xij
i=1,2,3,
E3 = {Xii, Xi+13+l}
j=1,2,3,4,5,
all i, j
(subscripts m o d u l o 3 and 5).
T h e n D 2 S A assigns x~j color i w h e n the vertices are o r d e r e d by i (which is
optimal b y (10)), and assigns x/j color j w h e n the v e r t i c e s are o r d e r e d b y j (which
is the w o r s t possible, by (12)). For k = 4, c o n s i d e r G4 = (V4, E4) defined b y
V4=Xij
i=1,2,3,4,
j=l
. . . . . lO,
S. 7". McCorvnick/ Approximation of sparse Hessians
166
t {xii, xi+2,i§
E4 =
all j
{xi# x~+~.i+2} all odd j
all i
(subscripts modulo 4 and 10).
{xij, x~.t,j+4} all even j
Then D2SA applied to G4 also colors x~j with i when ordered by i, and with j
when ordered by j (which are again respectively optimal and worst possible).
H o w e v e r , this construction seems difficult to extend. E v e n if it can be extended,
the n u m b e r of vertices is given by n = k((k - 1)2+ 1) so that
cS(a~) = O(n ,3)
k
which is a better result than (11). Thus, while (10), (11) and (12) are better results
than (9), I believe that the associated bounds are not the best possible.
For the average case, G r i m m e t and McDiarmid [5] proved the following
theorem:
Theorem 3. Fix n vertices, and let vertices i and j be independently connected by
an edge with fixed probability p, 0 < p < 1. Let cC(G) be the number of colors
used by C P R on G, and ;g(G) be the optimal number of colors (so that cC(G)
and x ( G ) are random variables). Then
cO(G) ~
x(G) - _ + E
for all 9 > 0 with probability 1 - o(1).
Thus, on average, C P R almost never p e r f o r m s more than twice as badly as the
optimal strategy. This a nice result, but for our purposes it has at least two flaws.
First, sparsity patterns in practical problems are not uniformly random as
T h e o r e m 3 supposes. Second, even if they were, the density of sparsity patterns
tends to be O ( l / n ) rather than constant with increasing n, so the theorem does
not apply anyway. It would be useful to determine a better random model for
sparsity patterns, or at least to p r o v e T h e o r e m 3 under the assumption that
p = O(l/n).
5. Conclusions and further questions
As we m o v e from Direct Methods to Elimination Methods, and from N D C s to
G S i m D C s within Direct Methods, we are less restricted, and so can find
potentially more powerful methods. We also m o v e from an N P - C o m p l e t e
problem (finding an optimal N D C ) to a polynomially-bounded one (a general
Elimination Method) (using T h e o r e m 1, and T h e o r e m 4 of the Appendix). It
would be interesting to know what intermediate point divides N P - C o m p l e t e
methods from polynomial methods (if indeed there is a continuum at all).
S.T. McCormick/Approximation of sparse Hessians
167
In particular, as was noted in Section 2, it is of some practical importance to
ascertain the difficulty of finding an optimal SeqDC. It is fairly easy to see what
property a graph coloring must have to induce a proper SeqDC; can this be used
to prove a version of T h e o r e m 1 for SeqDCs? SimDCs are harder to deal with
because of their simultaneous nature, but they can still be shown to be
equivalent to a form of graph 'multi-coloring'. Is the corresponding T h e o r e m 1
still true? The expectation is that these graph coloring problems are also
N P - C o m p l e t e , which, if true, means that we must rely on heuristic algorithms to
construct SeqDCs and SimDCs. Can the bound (11) be significantly improved,
or, more importantly, can a provably better heuristic be found?
Other interesting questions involve the observed p e r f o r m a n c e of heuristics.
Even CPR, one of the simplest possible heuristics, seems to give good results on
practical problems (see Coleman and Mot6 [2, Tables 3, 4 and 5]). Can this
behavior be proved under some convincing randomness assumption, as suggested at the end of Section 4? Much work remains to be done before we know
whether we are approximating sparse Hessians efficiently.
Acknowledgments
This p a p e r would not exist were it not for Walter Murray suggesting the
problem, and for Margaret Wright finding errors and otherwise improving the
presentation; I thank them both very much.
Appendix 1. Elimination methods
Assume that we a p p r o x i m a t e the Hessian of F : W ' ~ R using an elimination
method by evaluating the gradient of F at x ~ along directions d ~, d 2. . . . . d k. If no
entry of the Hessian is known to be zero, there are ~n(n + l) unknowns, namely
h~ for all 1 -< i -< ] -< n. The following theorem is well known in the folklore, but
the present author knows of no published proof:
Theorem 4. The m a x i m u m n u m b e r o f u n k n o w n s that can be determined f r o m a
set o f gradient e v a l u a t i o n s along any k directions, 0 <- k <- n, is
r~.k = n + (n - 1) + - - . + (n - k + l) = ~k(2n - k + l).
In particular, in the c o m p l e t e l y dense case, n e v a l u a t i o n s are n e c e s s a r y to obtain
all ~n(n + 1) u n k n o w n s . This b o u n d is sharp.
Proof. Let g ( x ) denote the gradient of F at x, and H ( x ) denote the Hessian of F
at x. Evaluating g ( x ) along the k directions d ' , d 2. . . . . d k produces the nk
168
S.T. McCormick/Approximation of sparse Hessians
a p p r o x i m a t e linear equations
H(x~
d')- g(xt),
' ~ g(x~
i = 1 . . . . . k.
(13)
W e a s s u m e that H is s y m m e t r i c , and so we identify u n k n o w n s h~, and h,~ in (13).
T h e n u m b e r of u n k n o w n s that can be d e t e r m i n e d f r o m equations (13) is b o u n d e d
a b o v e b y the rank of the n k by ~n(n + 1) coefficient matrix of the h~i's w h e n no
h~j is a s s u m e d to be zero. Q u e s t i o n s of rank could be affected by d e p e n d e n c e
a m o n g the d~; we thus a s s u m e the H a a r Condition, n a m e l y that e v e r y square
s u b m a t r i x of the matrix w h o s e ith c o l u m n is d ~ is non-singular. This a s s u m p t i o n
m a x i m i z e s the potential rank in (13).
T o d e s c r i b e the coefficient matrix, order the equations in (13) so that equations
with left-hand side H , . d ~, i = 1 . . . . . k a p p e a r first, those with left-hand side
H 2 . d ~, i = 1 . . . . . k a p p e a r next, etc., and then order the u n k n o w n s as h~,, h2, h22,
h3i, h32. . . . . h,,,. W e shall refer to the c o l u m n s of the coefficient matrix by the
n a m e s of their a s s o c i a t e d u n k n o w n s . G i v e n this ordering, call the coefficient
matrix in (13) Ak; partition A k r o w - w i s e into n blocks of k rows each, and
c o l u m n - w i s e into n blocks, w h e r e the ith c o l u m n block has i c o l u m n s . T h u s
c o l u m n b l o c k i consists of all c o l u m n s h~,i with j = 1. . . . . i. L e t the i, jth
s u b m a t r i x of the partition be d e n o t e d by A~ i, i, j = 1. . . . . n. For e x a m p l e , w h e n
n = 4 and k = 2, the ordering of the u n k n o w n s and the f o r m of A k are s h o w n in
the following (where zero e l e m e n t s are not shown):
hll
h21 tl22 h31 hn
h33 h41 h42 h4-4 h44
d~
d~
d~
d~
d~
d]
dl
d,
d~
d~
d~
d~
dl
d~
d~'
d~
d~
d~
d~
d~
d~
d~
(14)
dl d~ d~ d~
d~ d~ d~ d~
T o simplify notation, we define c i as the k - v e c t o r of the jth c o m p o n e n t s of the
{d'}, i.e., c i = (d~, d~ . . . . . d~) ~, j = 1. . . . . n. T h e n we see f r o m (14) that each A~i is
c o m p l e t e l y d e s c r i b e d by
A ik,i =
0
ifi>j;
(c 1, c 2. . . . . c i)
if i = j;
(0,0 . . . . . c i. . . . . 0)
if i < j .
12
i
j
(15)
169
S.T. McCormick~Approximation of sparse Hessians
To c o m p l e t e the p r o o f , it must be s h o w n that r a n k ( A k) = rn.k. L e t E be the set
of c o l u m n s hi.j with k < j - < i <-n, and let F be the c o m p l e m e n t a r y set of
columns. W e shall s h o w that the c o l u m n s of F can be used to eliminate the
c o l u m n s of E. N o t e that IFI = r,.k and that each c o l u m n in F involves only c ~
with i -< k.
Define h ~ to be the solution of the s y s t e m
(c t
c"
...
c k ) h ~ = - - c ~, l = k + l , k + 2
..... n
(h ~ exists and is unique u n d e r the a s s u m p t i o n of the H a a r condition). The
following c o m p u t a t i o n s s h o w that linear c o m b i n a t i o n s of the c o l u m n s in F, using
the {h ~} as multipliers, can be used to eliminate the c o l u m n s in E ; since the f o r m
of the linear c o m b i n a t i o n s is c o m p l i c a t e d , the result is best u n d e r s t o o d by
referring to the following e x a m p l e (16) w h e r e we re-write (14) in our new
notation and include the multipliers defined b e l o w for the case that k = 2, q = 4
and p = 3:
hit
h21
C I
C2
C t
h:2
h~t
h~2
h3~
A~
C3
C t
Ih2
h.,,.3
h~
r
C~
C2
h4~
C2
C4
C4
C~
C I
C2
C3
C4
(16)
Let p and q satisfy k < p < q <- n, so hq,,~ and h,~.~ are typical c o l u m n s in E. To
eliminate c o l u m n hcq f r o m A k, we add to it 2~ times c o l u m n hq, i, ] = 1 . . . . . k, and
h,"h7 times c o l u m n h~,j, i = 1 . . . . . k, j = 1. . . . . i (these multipliers are the first line
of h ' s in (16)). F r o m (15) and (16) we see that the resulting c o l u m n will be zero in
row blocks i, k < i - < n, i # q, since no c o l u m n with a n o n - z e r o coefficient is
n o n - z e r o in these row blocks. In row block i, 1 < - i -< k , the n o n - z e r o contributions to the resulting c o l u m n are E~_~ )~'~)~'~c t f r o m c o l u m n block i, ~,Th~c t
f r o m c o l u m n block l, i < l -< k, and hTc q f r o m c o l u m n b l o c k q, for a total of
)~(~<~3.~c~+cq)=0,
i=l,',,,..,k.
In r o w block q, the o n l y n o n - z e r o contribution is f r o m c o l u m n block q w h i c h is
~_~ , ~ c t + c q = O.
Thus the resultant c o l u m n is zero and c o l u m n h,., is indeed d e p e n d e n t on the
c o l u m n s of F.
170
S.T. McCormick / Approximation of sparse Hessians
N o w we eliminate c o l u m n h,,~, using the c o l u m n s in F. A d d to it h~ times
c o l u m n h,~j,, j = 1,. . k, . , ~ times
.
.c o l u m
. n hpj, j = 1,..., k, and ()~qg)~j+
p hPh~) times
c o l u m n h~.~, i = 1 . . . . . k, j = 1 . . . . . i (these multipliers are s h o w n in the s e c o n d row
of ?,'s in (16)). T h e r e is no n o n - z e r o contribution to the resultant c o l u m n in a n y
c o l u m n block i, k < i -< n, i ~ p, q. In row block i, 1 -< i -< k, there is a contribution
of E ~ i ( h ~ , ~ + h~h~)c ~ f r o m c o l u m n block i, a c o n t r i b u t i o n of ( h ~ h f + h~hi~)c ~
f r o m c o l u m n b l o c k l, i < 1 -< k, a contribution of h ~ c " f r o m c o l u m n b l o c k p, and
a contribution of h~c q f r o m c o l u m n block q, for a total of
In row block p, there is a contribution of E~k h~c ~ f r o m c o l u m n b l o c k p and a
contribution of c q f r o m c o l u m n block q for a total of
t ~ . h ~ c ~ + c ~ = O.
In r o w block q, the only n o n - z e r o contribution is f r o m c o l u m n b l o c k q and is
~] h~'c ~+ c o = 0.
I~k
O n c e again the row block totals are all zero, so c o l u m n hq.p is also d e p e n d e n t on
the c o l u m n s in F.
B y eliminating the 1 + 2 + . . - + ( n k) c o l u m n s of E we have s h o w n that
r a n k ( A k) -< r,.k. To s h o w that r a n k ( A k) = r,.k, delete the c o l u m n s of E f r o m A k,
and delete the last k - i r o w s f r o m each row block i, i < k. T h e n the remaining
matrix is r,,.k by r,,~ and is block u p p e r triangular with square, non-singular
diagonal blocks. T h u s this submatrix of A k is non-singular, so r a n k ( A k) - r,,.k.
T o s h o w that this b o u n d is sharp, s u p p o s e that our original sparsity pattern
has h~.j as an u n k n o w n w h e n e v e r i --- k or j -< k. T h e n by letting d ~ be the ith unit
v e c t o r for i = 1, 2 . . . . . k, we can clearly solve for all r,.,k of the u n k n o w n s , so the
b o u n d of the t h e o r e m is attained.
N o t e that this t h e o r e m actually gives useful lower b o u n d s on the n u m b e r of
gradient evaluations n e e d e d for a n y a p p r o x i m a t i o n m e t h o d , since the class of
Elimination M e t h o d s is unrestricted. A corollary to this p r o o f is that the
m i n i m u m n u m b e r of gradient evaluations n e c e s s a r y to a p p r o x i m a t e a H e s s i a n
(sparse or dense) can in principle be f o u n d in time b o u n d e d b y O(n 7) by the
following p r o c e d u r e :
S t e p 1: Set k = 1.
S t e p 2: F o r m A k, deleting the c o l u m n s c o r r e s p o n d i n g to variables k n o w n to
be zero.
S t e p 3: E v a l u a t e r a n k ( A k) = tk, say. If tk > - - n u m b e r of u n k n o w n s , then k is
optimal; otherwise, set k = k + l and go to (2).
S.T, McCormick/Approximation of sparse Hessians
171
Step 3 can be performed at most n times, on a matrix whose number of
columns is O(n-~). Evaluation of rank requires O(Icolumnsl 3) operations, giving a
total bound of O(nT). However, because of the difficulties inherent in calculating
the rank of a matrix in finite precision arithmetic, this procedure could never be
practically applied. It would be quite interesting to find a procedure that could
calculate the minimum possible k directly from the sparsity pattern without
needing to evaluate any ranks. But even if such a procedure could be found,
numerical difficulties would probably make a general elimination method untrustworthy in any case.
References
[1] J.A. Bondy and U.S.R. Murty, Graph theory with applications (MacMillan, London, 1976).
[2] T.F. Coleman and J.J. Mor~, "Estimation of sparse Jacdbian matrices and graph coloring
problems", Technical Report ANL-81-39, Argonne National Laboratory (Argonne, IL, 1981).
[3] A.R Curtis, M.J.D. Powell and J.K. Reid, "On the estimation of sparse Jacobian matrices",
Journal of the Institute of Mathematics and its Applications 13 (1974) 117-119.
[4] M.R. Garey and D.S. Johnson. Computers and intractability (Freeman, San Francisco, CA,
1979).
[5] G.R Grimmet and C.J.H. McDiarmid, "On colouring random graphs", Proceedings of the
Cambridge Philosophical Society 77 (1975) 313-324.
[6] D.S. Johnson, "'Worst case behavior of graph coloring algorithms", in: Proceedings of the 5th
Southeastern Conference on Combinatorics, Graph Theory, and Computing (Utilitas Mathematica, Winnipeg. Man., 1974) pp. 513-527.
[7] R.M. Karp, "Reducibility among combinatorial problems", in: R.E. Miller and J.W. Thatcher,
eds., Complexity of computer computations (Plenum, New York, 1972) pp. 85-103.
[8] M.J.D. Powell and P.L. Toint. "On the estimation of sparse Hessian matrices", SIAM Journal on
Numerical Analysis 16 (1979) 1060-1074.
[9] M.N. Thapa, "Optimization of unconstrained functions with sparse Hessian matrices", Ph.D.
Thesis. Stanford University (Stanford, CA. 1980).

Download Report

Optimal approximation of sparse hessians and its equivalence to a

Paperzz.com

Your Paperzz