theoretical efficiency of a new inexact method of tangent hyperbolas

Optimization Methods and Software
Vol. 19, Nos. 3–4, June–August 2004, pp. 247–265
THEORETICAL EFFICIENCY OF A NEW INEXACT
METHOD OF TANGENT HYPERBOLAS
NAIYANG DENGa and HAIBIN ZHANGb,∗
a
College of Sciences, China Agricultural University, Beijing 100083, China; b Department of Applied
Mathematics, Beijing University of Technology, Beijing 100022, China
(Received 6 February 2003; Revised 20 September 2003; In final form 10 January 2004)
A new improved method of tangent hyperbolas is established in this article. It is shown that for middle and large
scale unconstrained optimization problems, its average computation cost in each iteration is much lower than that of
Improved Method of Tangent Hyperbolas (IMTH) while the rapid local convergence rate is maintained.
Keywords: Unconstrained optimization problems; Preconditioned conjugate gradient method (PCG); Improved method
of tangent hyperbolas; Automatic differentiation
AMS Subject Classification: 65K05; 90C30
1
INTRODUCTION
Consider the middle and large scale unconstrained optimization problem
min f (x).
x∈R n
(1)
Suppose that problem (1) satisfies the following standard assumptions.
ASSUMPTION (A1) f is four-times continuously differentiable in a neighbourhood of the local
minimum x ∗ , which minimizes f (x).
ASSUMPTION (A2) The Hessian ∇ 2 f (x ∗ ) is symmetric positive definite.
In this article, we are concerned with the third-order methods [see e.g. Refs. 8,10]. Just
as Newton method approximates the gradient of the objective function with a linear function
with the same slope at the current iterate, the third-order methods approximate the gradient
with a ‘parabola’ with the same slope and curvature at the current iterate. The most famous
third-order method is the improved method of tangent hyperbolas (Algorithm IMTH):
xk+1 = xk + sk1 + sk2 ,
∗
Corresponding author. Fax: (8610)67391738; E-mail:[email protected]
c 2004 Taylor & Francis Ltd
ISSN 1055-6788 print; ISSN 1029-4937 online DOI: 10.1080/10556780410001683087
(2)
248
N. DENG AND H. ZHANG
where sk1 and sk2 are the solutions to the following Newton equation and Newton-like equation,
respectively:
∇ 2 f (xk )sk1 = −∇f (xk ),
(3)
1
∇ 2 f (xk )sk2 = − ∇ 3 f (xk )sk1 sk1 .
2
(4)
The cost in one step of the Algorithm IMTH consists of two parts:
(1) Evaluating the gradient ∇f (xk ), the Hessian ∇ 2 f (xk ), and the term ∇ 3 f (xk )sk1 sk1 .
(2) Solving the Newton equation (3) and the Newton-like equation (4).
Our aim is to improve Algorithm IMTH further by reducing its computation cost while
keeping the same convergence rate. Here the following two strategies are used:
(1) Automatic differentiation (AD). It is an efficient and precise numerical differentiation
technique. It has significant advantages over hand-coding, finite difference approximation,
and symbolic differentiation [see Ref. 14]. Forward mode and reverse mode of AD were
apparently first proposed in Ref. [15] and Ref. [11], respectively. Since 1990s, AD has been
developed rapidly, see e.g. Refs. [2,7]. One of its most important applications is to improve
the optimization algorithms by computing the relevant derivative information efficiently.
(2) Preconditioned conjugate gradient (PCG) method. Computing the exact solution to a linear
system by a direct method such as Cholesky factorization (CF) may be expensive for the
middle and large scale problems. It is reasonable to use an iterative method to solve it only
approximately, see e.g. Refs. [4,5,12,13]. Here PCG method is selected. Noticing the local
algorithm is concerned, we restrict our discussion in a sufficiently small neighbourhood
of the solution to Eq. (1). Thus, it can be assumed that the norms of the righthand side of
both Eqs. (3) and (4) are small enough, e.g. less than 1.
It should be noted that the idea of using PCG method in Ref. [5] to the third-order method was
mentioned briefly in Ref. [3]. Using the above two strategies, a new improved method of tangent
hyperbolas (NIMTH) – Algorithm NIMTH – is established in this article. It is well known that
the optimization algorithms with AD technique are more efficient than the algorithms with
other differentiation techniques. Therefore, in this article we only show the superiority of our
new Algorithm NIMTH to the original Algorithm IMTH when AD technique is used in both
of them.
In this article, the following notations and conventions are used: ẋ ∈ R n denotes a directional
vector. ∇f (x), ∇ 2 f (x), and ∇ 3 f (x) are respectively, gradient, Hessian matrix, and third-order
tensor of the function f (x). κ(A) denotes the condition number of the matrix A. · denotes
the largest integer not larger than ·, and · denotes the smallest integer not smaller than ·.
Finally, we make the convention that
b
· · · = 0,
when a > b.
a
This article is organized as follows: An Algorithm Model NIMTH(p) with a parameter p is
established in Section 2. The local convergence and the efficiency of the algorithm model are
given, respectively, in Sections 3 and 4. In Section 5, we derive our new Algorithm NIMTH
from Algorithm Model NIMTH(p) and show that it is much more efficient than the original
Algorithm IMTH for middle and large scale problems.
INEXACT METHOD OF TANGENT HYPERBOLAS
249
2 ALGORITHM MODEL NIMTH( p)
Now starting from Algorithm IMTH, we establish Algorithm Model NIMTH(p), where p is
a parameter. Its characteristics are: the derivative information is calculated by AD; the linear
system of equations is solved by either CF or PCG method. So, next we first give the AD
algorithms and the PCG method, then propose the algorithm model.
2.1 Automatic Differentiation Algorithms
We can evaluate the gradient, the directional gradient, and the second-order directional gradient
of the function by AD, see e.g. Ref. [7]. Two AD algorithms, Algorithm AD1 and Algorithm
AD2(m), in Ref. [6] will be used later. The former will be used to evaluate a gradient ∇f (x);
the latter will be used to evaluate m Hessian-vector products ∇ 2 f (x)ẋi with i = 1, . . . , m,
where ẋi ∈ R n , and the parameter m is any positive integer.
In order to evaluate ∇ 3 f (x)ẋ ẋ by AD, the following algorithm is also proposed:
ALGORITHM AD3
Step 0. Set x, ẋ ∈ R n .
Step 1. Calculate ∇ 2 f (x)ẋ by Algorithm AD2(1).
Step 2. Evaluate ∇ 3 f (x)ẋ ẋ, using Forward Propagation of Tangents.
2.2 Algorithm PCG
For solving the linear equation system
∇ 2 f (x)s = b,
(5)
the following Algorithm PCG(M, ∇f (x), b, l) is constructed, where M is the preconditioner
and l is the maximum number of subiterations.
ALGORITHM PCG(M, ∇f (x), b, l)
Step 0. Initial Data. Set s0 = 0, r0 = b, i = 1.
Step 1. Termination Test. If
ri−1 ≤ b3 or i − 1 = l,
(6)
then terminate the iteration by taking s̄ = si−1 .
Step 2. Subiteration:
(a) Solve the equation Mz = ri−1 for z, set ti−1 = zT ri−1 ; if i = 1, then β = 0 and
q = z; else, β = ti−1 /ti−2 , q = z + βq and evaluate
w = ∇ 2 f (x)q
(7)
by Algorithm AD2(1) with ẋ1 = q.
k
(b) set si = si−1 + αq, where α = ti−1 /q T and w = ∇ 2 f (x k )q; set rik = ri−1
+ αw.
Step 3. Set i = i + 1, and go to Step 1.
It should be noted according to the test criteria (6), the number of PCG iterations is less
than or equal to l. In fact, when the number of PCG iterations is selected to be n, the solution
must be exact, as resulting from a CF.
250
N. DENG AND H. ZHANG
2.3 Algorithm Model NIMTH( p)
Now we are in a position to propose the new algorithm model. Its basic structure is as follows:
every CF step is followed by p PCG steps, where p is a parameter.
ALGORITHM MODEL NIMTH(p)
Step 0. Initial Data. Set the initial point x0 ∈ R n and the parameter p. If p ≥ 1, set the
maximum numbers of subiterations: l1N = 2 · 3, l1N = 2 · 32 , . . . , lpN = 2 · 3p and l1H =
3, l2H = 32 , . . . , lpH = 3p and set k = 0, where N and H , respectively, denotes the
Newton equations (3) and (4).
Step 1. Evaluate ∇f (xk ) by Algorithm AD1. If ∇f (xk ) = 0, then terminate the iteration by
taking x ∗ = xk .
Step 2. Switch Test. If k can be divided by p + 1 with no remainder, go to Step 3; otherwise,
go to Step 4.
Step 3. CF Step. Evaluate ∇ 2 f (xk ) by using Algorithm AD2(n) with setting ẋi = ei , i =
1, . . . , n, where ei is the i-th Cartesian basic vector in R n . If p > 0, then set
B k = ∇ 2 f (xk ).
Evaluate ∇ 3 f (xk )sk1 sk1 by using Algorithm AD3 with x = xk and ẋ = sk1 . Solve sk1 and sk2
in the Newton equation (3) and Newton-like equation (4) by CF ∇ 2 f (xk ) = Lk Dk LTk .
Set m = 0 and go to Step 5.
Step 4. PCG Step. Set m = m + 1
M = Bk.
(8)
Find the approximate solution sk1 and sk2 to the Newton equation (3) and Newton-like
equation (4) by the Algorithm PCG (M, ∇f (xk ), −∇f (xk ), lmN ) and Algorithm PCG
(M, ∇f (xk ), −(1/2)∇ 3 f (x)sk1 sk1 , lmH ).
Step 5. Update Solution Estimate. Set xk+1 = xk + sk1 + sk2 . Set k = k + 1, and go to Step 1.
3
CONVERGENCE ANALYSIS OF ALGORITHM MODEL NIMTH( p)
In this section, the local convergence of the new algorithm model is given.
DEFINITION 3.1 Let both xCF and xc be near to the solution x ∗ to Eq. (1). The progress index
ν = ν[xCF , xc ] from xCF to xc with respect to x ∗ is defined as
ν = ν[xCF , xc ] =
ln xc − x ∗ ,
ln xCF − x ∗ (9)
where xCF denotes the point obtained by the CF step and xc denotes the current iterative point
obtained from PCG step between one CF step and next CF step. The progress index ν roughly
reflects the convergence order.
1
2
1
2
+ sCF
, where sCF
and sCF
are the solutions to NewLEMMA 3.1 Assume that x+ = xCF + sCF
ton equation
∇ 2 f (xCF )s 1 = −∇f (xCF )
INEXACT METHOD OF TANGENT HYPERBOLAS
251
and the Newton-like equation
1
∇ 2 f (xCF )s 2 = − ∇ 3 f (xCF )s 1 s 1 ,
2
respectively. Then there exists δ ∈ (0, 1) such that for the solution x ∗ to Eq. (1), when xCF −
x ∗ ≤ δ, we have
(10)
ν[xCF , x+ ] ≥ 3 + θ1 .
where θ1 = ln C1 / ln xCF − x ∗ , C1 > 1 is a constant.
Proof In fact, by Assumption (A1), we have
1
∇f (x ∗ ) − ∇f (xCF ) = ∇ 2 f (xCF )(x ∗ − xCF ) + ∇ 3 f (xCF )(x ∗ − xCF )(x ∗ − xCF )
2
+ O(x ∗ − xCF 3 ).
(11)
Since ∇f (x ∗ ) = 0, we conclude
∇f (xCF ) = O(xCF − x ∗ ).
(12)
By
0 = ∇f (x ∗ ) = ∇f (xCF ) + ∇ 2 f (xCF )(x ∗ − xCF ) + O(x ∗ − xCF 2 ),
we get
∇f (xCF ) + ∇ 2 f (xCF )(x ∗ − xCF ) = O(x ∗ − xCF 2 ).
Since xCF is sufficiently near to x ∗ , Assumptions (A1) and (A2) along with ∇ 2 f (xCF )−1 =
O(1) yield
∇ 2 f (xCF )−1 [∇f (xCF ) + ∇ 2 f (xCF )(x ∗ − xCF )] = O(x ∗ − xCF 2 ).
Therefore,
x ∗ − xCF + ∇ 2 f (xCF )−1 ∇f (xCF ) = O(x ∗ − xCF 2 ).
From Eqs. (11)–(13), we have
1
2
x+ − x ∗ = xCF + sCF
+ sCF
− x∗
= xCF − ∇ 2 f (xCF )−1 ∇f (xCF )
1
− ∇ 2 f (xCF )−1 ∇ 3 (xCF )[−∇ 2 f (xCF )−1 ∇f (xCF )]2 − x ∗
2
= ∇ 2 f (xCF )−1 {∇ 2 f (xCF )(xCF − x ∗ ) − ∇f (xCF )
1
− ∇ 3 f (xCF )[−∇ 2 f (xCF )−1 ∇f (xCF )]2 }
2
1 3
= ∇ 2 f (xCF )−1
∇ f (xCF )(x ∗ − xCF )2 + O(x ∗ − xCF 3 )
2
1
− ∇ 2 f (xCF )−1 ∇ 3 f (xCF )[−∇ 2 f (xCF )−1 ∇f (xCF )]2
2
(13)
252
N. DENG AND H. ZHANG
=
1 2
∇ f (xCF )−1 ∇ 3 f (xCF )[x ∗ − xCF − ∇ 2 f (xCF )−1 ∇f (xCF )]
2
× [x ∗ − xCF + ∇ 2 f (xCF )−1 ∇f (xCF )] + O(x ∗ − xCF 3 )
=
1 2
∇ f (xCF )−1 ∇ 3 f (xCF )O(x ∗ − xCF )O(x ∗ − xCF )2 )
2
+ O(x ∗ − xCF 3 ) = O(x ∗ − xCF 3 ).
In other word, there exists a constant C1 > 1 such that
x+ − x ∗ ≤ C1 xCF − x ∗ 3 .
Thus, Eq. (10) is obtained from Eq. (9).
LEMMA 3.2
(14)
Let the progress index ν from xCF to xc with respect to x ∗ satisfy
ν = ν[xCF , xc ] ≥ 1.
(15)
Assume that s̄ 1 and s̄ 2 are obtained by Algorithm PCG (∇ 2 f (xCF ), ∇f (xCF ), −∇f (xCF ), lmN )
and Algorithm PCG (∇ 2 f (xCF ), ∇f (xCF ), −(1/2)∇ 3 f (xCF )s̄ 1 s̄ 1 , lmH ), respectively. Then there
exists δ ∈ (0, 1), such that when xCF − x ∗ ≤ δ, the residuals
r(s̄ 1 ) = ∇ 2 f (xc )s̄ 1 + ∇f (xc )
and
1
r (s̄ 2 ) = ∇ 2 f (xc )s̄ 2 + ∇ 3 f (xc )s̄ 1 s̄ 1 ,
2
respectively, satisfy
N
r(s 1 ) ≤ ω ∇f (xc )1+min{2,lm /ν} ,
(16)
and
H
r (s 2 ) ≤ ω ∇f (xc )2+min{1,lm /ν} ,
(17)
where ω > 0 and w > 0 are constants.
Proof We first prove the inequality (16). Without loss of generality, we assume ∇f (xc ) < 1.
According to termination condition (6), there are two possibilities: either
r(s 1 ) ≤ ∇f (xc )3 ,
(18)
or the number of the subiterations is lmN , i.e.
s̄ 1 = sl1N .
m
(19)
The former case implies that the inequality (16) is true with ω = 1 and 1 + ( lmN / max{ν,
3m } ≤ 3. So we only need to show Eq. (16) for the latter case.
First, we estimate the difference between (∇ 2 f (xc ))ij and (∇ 2 f (xCF ))ij , where (·)ij is
the element in the ith row and the j th column of the matrix. Noticing Assumption (A2),
INEXACT METHOD OF TANGENT HYPERBOLAS
253
Definition 3.1 and Eq. (15), we have that when δ is small enough, there exists a constant
ω1 > 0, such that
1
1
ω1 xc − xCF ≤ ω1 (xc − x ∗ + xCF − x ∗ )
2
2
1
= ω1 (xc − x ∗ + xc − x ∗ 1/ν ) ≤ ω1 xc − x ∗ 1/ν (20)
2
|(∇ 2 f (xc ))ij − (∇ 2 f (xCF ))ij | ≤
On the other hand, by Assumptions (A1) and (A2), we conclude that when δ is small enough,
∇f (xc ) = ∇f (xc ) − ∇f (x ∗ ) ≥
1
ω2 xc − x ∗ 2
(21)
where ω2 = min{λmin , 1}, and λmin is the smallest eigenvalue of the matrix ∇ 2 f (x ∗ ). It follows
from Eqs. (20) and (21) that
|(∇ 2 f (xc ))ij − (∇ 2 f (xCF ))ij | ≤ ω1
2
ω2
1/ν
∇f (xc )1/ν ≤ ω3 ∇f (xc )1/ν .
(22)
where ω3 = 2ω1 /ω2 , which is independent of ν.
Let
A = (∇ 2 f (xCF ))−1/2 ∇ 2 f (xc )(∇ 2 f (xCF ))−1/2
= I + (∇ 2 f (xCF ))−1/2 [∇ 2 f (xc ) − ∇ 2 f (xCF )](∇ 2 f (xCF ))−1/2 .
(23)
It is easy to see that when δ is small enough for all i, j ,
|(∇ 2 f (xCF )−1/2 )ij | ≤ ω4 ,
(24)
def
where ω4 = max{|(∇ 2 f (x ∗ )−1/2 )ij ||i, j = 1, 2, . . . , n} and > 1 is a positive constant.
Therefore, from Eqs. (22) and (24),
|((∇ 2 f (xCF ))−1/2 [∇ 2 f (xc ) − ∇ 2 f (xCF )](∇ 2 f (xCF ))−1/2 )ij | ≤
ω5
∇f (xc )1/ν
n
(25)
where ω5 = n3 ω3 ω42 . By Gerschgoring Theorem, Eqs. (22) and (24), we can get that for any
eigenvalue λ of A, there exists a aii (a diagonal entry of A), such that
|λ − aii | ≤ (n − 1)
By Eq. (24), we have
|aii − 1| ≤
ω5
∇f (xc )1/ν .
n
ω5
∇f (xc )1/ν ,
n
therefore,
|λ − 1| ≤ |λ − aii | + |aii − 1| ≤ ω5 ∇f (xc )1/ν ,
that is,
1 − ω5 ∇f (xc )1/ν ≤ λ ≤ 1 + ω5 ∇f (xc )1/ν .
(26)
254
N. DENG AND H. ZHANG
Therefore, the condition number of A, κ(A) satisfies
κ(A) =
λmax (A)
1 + ω5 ∇f (xc )1/ν
.
≤
λmin (A)
1 − ω5 ∇f (xc )1/ν
(27)
Thus by Eq. (27), noticing limδ→0 ∇f (xc ) = 0, we conclude that when δ is small enough,
κ(A) ≤ 2(1 − ω5 ),
κ(A) − 1 ≤
(28)
2ω5 ∇f (xc )
≤ 4ω5 ∇f (xc )1/ν .
1 − ω5 ∇f (xc )1/ν
1/ν
(29)
Now, we consider to solve the linear system
Aŝ = −b
(30)
using conjugate gradient method, where A is defined by Eq. (23) and
b = ∇ 2 f (xCF )−1/2 ∇f (xc ).
(31)
Let the initial ŝ0 = 0 and ŝ be the approximate solution obtained after l N subiterations.
Let us estimate the residual
r̂(ŝl N ) = Aŝl N + b.
(32)
In fact, from the Lemma 2.3.2 in Ref. [9], we have
and
r̂(ŝl N )
ŝ N − ŝ ∗ A
≤ (κ(A))1/2 l
r̂(0)
0 − ŝ ∗ A
(33)
l N
ŝl − ŝ ∗ A
(κ(A))1/2 − 1
≤
2
.
0 − ŝ ∗ A
(κ(A))1/2 + 1
(34)
where ŝ ∗ is the exact solution to Eq. (30).
Combining Eqs. (33), (34), (28), and (29) we have
r̂(ŝl N ) ≤ 2r̂(0)(κ(A))
1/2
(κ(A))1/2 − 1
(κ(A))1/2 + 1
≤ 2r̂(0)(κ(A))1/2 (κ(A) − 1)l
l N
N
N
≤ 2r̂(0)(2(1 + ω5 ))1/2 (4ω5 )l ∇f (xc )l
lN
N /ν
≤ 2∇ 2 f (xCF )−1/2 (2(1 + ω5 ))1/2 (4ω5 ) ∇f (xc )
≤ ω6 ∇f (xc )1+l
N /ν
,
N
where ω6 = 4(2(1 + ω5 ))1/2 (4ω5 )l ∇ 2 f (x ∗ )−1/2 .
Considering the relationships
sl N = −∇ 2 f (xc )−1 ∇f (xc ),
ŝl N = (∇ 2 f (xCF ))1/2 sl N ,
(35)
1+l N /ν
INEXACT METHOD OF TANGENT HYPERBOLAS
255
and
r(sl N ) = (∇ 2 f (xCF ))1/2 r̂(sl N ),
we obtain
r(sl N ) ≤ ω∇f (xc )1+l
N /ν
,
(36)
where ω = 2ω6 (∇ 2 f (x ∗ ))1/2 . For the solution s̄ 2 to Newton-like equation, similarly we can
prove that, after executing l H subiterations of the PCG, similar to Eq. (35), the residual satisfies
H
r (sl H ) ≤ 2r (0)(2(1 + ω5 ))1/2 (4ω5 )l ∇f (xc )l
H /ν
.
(37)
Since
r (0) =
1 3
1
∇ f (xc )sl N sl N ≤ ∇ 3 f (xc )sl N 2 ,
2
2
and Assumptions (A1) and (A2), there exists a constant C such that
r (0) ≤ C ∇f (xc )2 .
Therefore,
r (sl H ) ≤ ω ∇f (xc )2+l
H /ν
,
where ω is a constant. Therefore from Eqs. (36) and (38), we get the conclusion.
(38)
LEMMA 3.3 Suppose that ν ≥ 1 and x+ = xc + s̄ 1 + s̄ 2 , where ν, s̄ 1 , and s̄ 2 are defined
in Lemma 3.2. Then there exists δ ∈ (0, 1), such that for the solution x ∗ to Eq. (1), when
xCF − x ∗ ≤ δ and xc − x ∗ ≤ δ, we have
ν[xCF , x+ ] ≥ ν + θ2 ,
(39)
= min{3, 1 + min{2, lmN /ν}, 2 + min{1, lmH /ν}},
(40)
where
def
θ2 =
ln C2
< 0,
ln xCF − x ∗ (41)
and C2 > 1 is a constant.
Proof
From the definition of the residual r(s̄ 1 ) and s̄ 2 , we have
s̄ 1 = −∇ 2 f (xc )−1 ∇f (xc ) + ∇ 2 f (xc )−1 r(s̄ 1 ),
(42)
1
s̄ 2 = − ∇ 2 f (xc )−1 ∇ 3 f (xc )s̄ 1 s̄ 1 + ∇ 2 f (xc )−1 r(s̄ 2 ).
2
(43)
and
256
N. DENG AND H. ZHANG
Suppose
1
B1 = xc − ∇ 2 f (xc )−1 ∇f (xc ) − ∇ 2 f (xc )−1 ∇ 3 f (xc )
2
× [−∇ 2 f (xc )−1 ∇f (xc )]2 − x ∗
−1
−1
(44)
−1
B2 = ∇ f (xc ) r(s̄ ) + ∇ f (xc ) ∇ f (xc )[∇ f (xc ) ∇f (xc )]
2
1
2
3
2
1
× [∇ 2 f (xc )−1 r(s̄ 1 )] − ∇ 2 f (xc )−1 ∇ 3 f (xc )[∇ 2 f (xc )−1 r(s̄ 1 )]2
2
B3 = ∇ 2 f (xc )−1 (s̄ 1 ).
Then
(45)
(46)
x+ − x ∗ = xc + s̄ 1 + s̄ 2 − x ∗ = B1 + B2 + B3.
(47)
By Assumptions (A1) and (A2), Lemma 3.2, we can see that there exist constants M1 , M2 ,
and M3 that are only dependent on x ∗ , such that the following three inequalities hold:
B1 ≤ M1 xc − x ∗ 3 ,
B2 ≤ M2 ∇f (xc )
(48)
1+l N / max{ν,3m }
,
(49)
H / max{ν,3m }
,
(50)
B3 ≤ M3 ∇f (xc )2+l
where the proof of Eq. (48) can be obtained by the proof of Lemma 3.1 with setting xc = xCF ,
while Eqs. (49) and (50) are not difficult to prove. Therefore, by Eqs. (47)–(50)
x+ − x ∗ ≤ C2 xc − x ∗ ,
(51)
where and C2 are defined in Eqs. (40) and (41). Notice that ν[xCF , x+ ] = ν[xc , x+ ]ν, we
can get Eq. (39) from Eq. (51).
For convenience, in expression we rewrite the sequence {xk } generated by Algorithm Model
NIMTH(p) as:
 CF
{x
, x CF , . . . , xjCF
when p = 0

(p+1) , . . . , },

 0(p+1) 1(p+1)
CF
PCG
(52)
{xk } = {x0(p+1) , x0(p+1)+1 , . . . ,


 CF
PCG
CF
xj (p+1) , xjPCG
(p+1)+1 , . . . , xj (p+1)+p , x(j +1)(p+1) , . . . , }, when p > 0,
where the notations CF or PCG are to show which step is executed at the corresponding iterate.
CF
CF
We call the iterations to obtain x(j
+1)(p+1) from xj (p+1) as j th cycle of the algorithm model.
LEMMA 3.4 Consider the sequence (52) generated by Algorithm Model NIMTH(p). Then
there exists δ ∈ (0, 1) such that, when x0 − x ∗ < δ, we have, for the solution x ∗ to Eq. (1)
and for any j ,
ν[xjCF
(p+1) , xj (p+1)+q+1 ] ≥ 3 +
1
t=1
ltN +
3q+1 − 1
θ > 1 for q = 0, 1, . . . , p,
2
(53)
where θ = min{θ1 , θ2 }, while θ1 and θ2 are defined in Lemma 3.1 and Lemma 3.3, respectively.
Proof We consider the two cases p = 0 and p > 0 separately. For the former case, by
∗
Lemma 3.1, we have that when δ is small enough and xjCF
(p+1) − x ≤ δ,
CF
CF
CF
ν[xjCF
(p+1) , x(j +1)(p+1) ] = ν[xj , xj +1 ] ≥ 3 + θ1 ≥ 3 + θ > 1.
(54)
INEXACT METHOD OF TANGENT HYPERBOLAS
257
For the latter case, it is sufficient to prove by induction that when δ is small enough and
∗
xjCF
(p+1) − x ≤ δ, Eq. (54) and
∗
xjCF
(p+1)+q+1 − x ≤ δ,
q = 0, 1, . . . , p.
(55)
are valid.
First, note that when q = 0, Eqs. (53) and (55) can be obtained by Lemma 3.1 directly.
Second, we assume the validity of Eqs. (53) and (55) with q = i − 1 and prove their validity
∗
with q = i. In fact, by Lemma 3.3 we can conclude that when xjCF
(p+1) − x ≤ δ,
PCG
ν[xjCF
(p+1) , xj (p+1)+i+1 ] ≥ min{α1 , α2 , α3 },
(56)
where
α1 =
PCG
ν[xjCF
(p+1) , xj (p+1)+i ]
+ min 1,
PCG
ν[xjCF
(p+1) , xj (p+1)+i ]
PCG
α2 = 2ν[xjCF
(p+1) , xj (p+1)+i ] + min 1,
3i
PCG
ν[xjCF
(p+1) , xj (p+1)+i ]
liN + θ2 ,
3i
liH + θ2 ,
PCG
α3 = 3ν[xjCF
(p+1) , xj (p+1)+i ] + θ2 .
It is easy to see from the induction assumption that
PCG
i
ν[xjCF
(p+1) , xj (p+1)+i ] ≥ 3
1 − 1/3i
3i − 1
θ = 3i 1 +
θ .
2
2
Thus,
α1 ≥ 3 +
i−1
ltN
t=1
=3+
≥3+
3i − 1
1 − 1/3i
+
θ + 1+
θ liN + θ
2
2
1 − 1/3i N
ltN + 1 +
li θ
2
t=1
i
i
ltN +
t=1
3i+1 − 1
θ,
2
(57)
1 − 1/3i H
3i − 1
α2 ≥ 2 3 +
θ + 1+
li + θ
+
2
2
t=1
i−1
i−1
= 3+
ltN + 3 +
ltN + liH
i−1
ltN
t=1
+ (3i − 1) +
=3+
i
t=1
ltN +
t=1
1 − 1/3i H
li + 1
2
3i+1 − 1
θ,
2
θ
(58)
258
N. DENG AND H. ZHANG
3i − 1
α3 ≥ 3 3 +
+
θ +θ
2
t=1
i−1
i−1
3i − 1
N
N
=3+
lt + 2 3 +
lt + 3
θ +θ
2
t=1
t=1
=3+
i−1
i
t=1
ltN
ltN +
3i+1 − 1
θ.
2
(59)
The validity of Eq. (53) follows from Eqs. (56) and (59). Combining Eqs. (53) with the condition
∗
xjCF
(p+1) − x ≤ δ yields Eq. (53).
THEOREM 3.1 Algorithm Model NIMTH(p) is locally convergent. Furthermore, there exists
a constant C3 > 0, such that
CF
∗
CF
∗ 3
x(j
+1)(p+1) − x ≤ C3 xj (p+1) − x Proof
4
p+1
.
(60)
Setting q = p in Lemma 3.4, the conclusion is obtained.
EFFICIENCY ANALYSIS OF ALGORITHM MODEL NIMTH( p)
In this section, we will analyze the efficiency of Algorithm Model NIMTH(p) after examining
the computation cost.
4.1 The Computation Cost of Algorithm Model NIMTH( p)
4.1.1
The Computation Cost of Automatic Differentiation Algorithms
Consider the computation cost AAD1 to evaluate a gradient ∇f by Algorithm AD1. By Eq. (32)
in Ref. [7, Chap. 3], Section 3.4, we have
QAD1 ≤ 4Qf ,
(61)
Qf = the computation cost to evaluate a function value f (x).
(62)
where
The computation cost QAD2 (m) of Algorithm AD2 consists of two parts: the cost QAD1
to evaluate a gradient ∇f and the cost QHV (m) to evaluate m Hessian-vector products after
∇f (x) is evaluated. Using Eq. (14) in Ref. [7, Chap. 3], Section 3.2, we have
QHV (m) ≤ 1.5mQAD1 ≤ 6mQf .
(63)
QAD2 (m) = QAD1 + QHV (m) ≤ (1 + 1.5m)QAD1 ≤ (4 + 6m)Qf ,
(64)
Therefore,
where the last inequality comes from Eq. (61).
INEXACT METHOD OF TANGENT HYPERBOLAS
259
Let QAD3 be the computation cost involved in Algorithm AD3. Using Eq. (14) in Ref. [7,
Chap. 3], Section 3.2, we arrive at
QAD3 ≤ 20Qf .
4.1.2
(65)
The Computation Cost of a CF Step
The computation cost of a CF step with an extra gradient evaluation consists of four parts:
(1) Evaluating a gradient ∇f and a Hessian ∇ 2 f . This is completed by using Algorithm
AD2(n) with setting ẋi = ei , i = 1, . . . , n, where ei is the i-th Cartesian basic vector in
R n . The corresponding computation cost is QAD2 (n).
(2) Solving the Newton equation by CF. Denoting the corresponding computation cost as Q−
D,
we have
1 3 3 2 2
Q−
(66)
D = n + n − n,
6
2
3
where only the multiplicative operations are considered.
(3) Evaluating a tensor–vector product ∇ 3 f s 1 s 1 . This is completed by using Algorithm AD3
with x = xk , ẋ = s 1 , where s 1 is the solution of the Newton equation (3). The corresponding computation cost is QAD3 − QAD1 .
(4) Solving the Newton-like equation. The corresponding multiplicative computation
cost is n2 .
Thus, by Eqs. (64) and (63), the total computation cost of one CF step with an extra gradient
evaluation is
AD1
2
2
QAD2 (n) + QAD3 − QAD1 + Q−
+ QHV (n) + QAD3 − QAD1 + Q−
D +n =Q
D +n
2
= QHV (n) + QAD3 + Q−
D +n
≤ (6n + 20)Qf + QD ,
(67)
1 3 5 2 2
n + n − n.
6
2
3
(68)
where
2
QD = Q−
D +n =
4.1.3
The Computation Cost of the PCG Steps
For the PCG steps, examine the tth (1 ≤ t ≤ p) PCG step after a CF step. According to
Algorithm Model NIMTH(p), the subiteration number in the tth PCG step is not greater than
3t+1 . Therefore, the computation cost of the tth PCG step consists of two parts:
(1) Evaluating a gradient ∇f , a tensor–vector product (1/2)∇ 3 f (x)qq and at most 3t+1
Hessian-vector products ∇ 2 f q. Using Algorithm AD2(m) with m ≤ 3t+1 , the computation
cost is not greater than QAD2 (3t+1 ).
(2) Executing at most 3t+1 subiterations. Denote
QI = QI (n) = n2 + 6n + 2
(69)
as the multiplicative computation cost in one PCG subiteration. The computation cost is
not greater than 3t+1 QI .
260
N. DENG AND H. ZHANG
Thus, by Eq. (64), the total computation cost of p PCG steps with p extra gradient evaluations
has the upper bound
p
[Q
AD3
+Q
AD2
(3
t+1
) + (3
t+1
t=1
1 p+2
2
Qf
)QI ] ≤ 20p + 6
−3
3
2
1
+ (3p+2 − 32 )QI .
2
4.1.4
(70)
The Computation Cost of Algorithm Model NIMTH( p)
CF
CF
Let W [xjCF
(p+1) , x(j +1)(p+1) ] be the computation cost of the j th cycle from xj (p+1) to
CF
x(j
+1)(p+1) . Combining Eqs. (67) and (70) and defining
def 1
σ = σ (p) = (3p+2 − 32 ),
2
(71)
we conclude that the average computation cost in the p + 1 steps of j th cycle satisfies
CF
W [xjCF
(p+1) , x(j +1)(p+1) ]
p+1
≤
(6n + 20p + 6σ + 20)Qf + QD + σ QI def
= w(n, Qf , p).
p+1
(72)
4.2 The Efficiency Coefficient of Algorithm Model NIMTH( p)
Now we estimate the efficiency of Algorithm Model NIMTH(p). Here we cite a definition of
the efficiency coefficient given by Ref. [1].
DEFINITION 4.1 Consider the sequence {xk } generated by Algorithm Model NIMTH(p) and
given in Eq. (52). The efficiency coefficient of the algorithm is defined by
(p) = lim inf
j →∞ {xk }
CF
ln ν[xjCF
(p+1) , x(j +1)(p+1) ]
CF
W [xjCF
(p+1) , x(j +1)(p+1) ]
.
(73)
THEOREM 4.1 Suppose that the efficiency coefficient of Algorithm Model NIMTH(p) is defined
by Eq. (73). Then it satisfies
(p) ≥
(p + 1) ln 3
CF
W [xjCF
(p+1) , x(j +1)(p+1) ]
≥
ln 3
def
= (p),
w(n, Qf , p)
(74)
where w(n, Qf , p) is defined by Eqs. (72), (71), (68), and (69).
Proof By Lemma 3.4, we conclude that when x0 is close to the solution x ∗ to Eq. (1) enough,
the sequence (52) generated by Algorithm Model NIMTH(p) satisfies Eq. (53). Using the
estimate (53) with q = p, we have
CF
ν[xjCF
(p+1) , x(j +1)(p+1) ]
≥3+
p
t=1
2 · 3t +
3p+1 − 2
3p+1 − 2
θ = 3p+1 +
θ.
2
2
Then the conclusion (74) comes from Eqs. (73), (75), and (72).
(75)
INEXACT METHOD OF TANGENT HYPERBOLAS
5
261
NEW ALGORITHM AND ITS THEORETICAL ADVANTAGES
Our Algorithm NIMTH is derived by specifying the parameter p in Algorithm Model
NIMTH(p). Theorem 4.1 shows that (p) is a lower bound of the efficiency coefficient of
Algorithm Model NIMTH(p). It is natural to select p by maximizing (p) or by minimizing
w(n, Qf , p) defined by Eq. (72). This leads to our Algorithm NIMTH:
5.1 Algorithm NIMTH
Algorithm NIMTH is obtained from Algorithm Model NIMTH(p) by specifying p = p ∗ ,
where p∗ is the solution to the one-dimensional optimization problem:
min w(n, Qf , p) =
(6n + 20p + 6σ + 20)Qf + QD + σ QI
,
p+1
s. t. p is a nonnegative integer,
(76)
(77)
where Qf , QD , QI , and σ = σ (p) are defined by Eqs. (62), (68), (69), and (71).
In order to compare Algorithm NIMTH with the original Algorithm IMTH, note that the
latter can also be obtained from Algorithm Model NIMTH (p) by specifying p = 0. Therefore,
according to Theorem 4.1, the following efficiency ratio reflects, in some sense the improvement
of Algorithm NIMTH over Algorithm IMTH – the larger this efficiency ratio, the more superior
Algorithm NIMTH.
DEFINITION 5.1 The efficiency ratio of Algorithm NIMTH versus Algorithm IMTH is
defined as
w(n, Qf , 0)
(p ∗ )
γ (n, Qf ) = R(n, Qf , p∗ ) =
=
.
(78)
(0)
w(n, Qf , p∗ )
In order to estimate the efficiency ratio, the following two lemmas are needed.
LEMMA 5.1 When n ≥ 100, the solution p ∗ to the one-dimensional optimization problem (76)
and (77) satisfies:
3((2/3)n − 3)
(1/9)n − 3
∗
< 3p +1 <
.
3 ln(((1/9)n − 3)/e)
ln(((2/3)n − 3)/e) − ln ln(((2/3)n − 3)/e)
(79)
Proof Suppose p1 is the solution to the optimization problem
min w(n, Qf , p)
p≥0
(80)
with the continuous variable p, where w(n, Qf , p) is defined by Eq. (72). Obviously,
the solution p∗ = p(n, Qf ) to Eqs. (76) and (77) and the solution p1 = p1 (n, Qf ) have the
relationship
p1 ≤ p∗ ≤ p1 .
By
∂w(n, Qf , p1 )
= 0,
∂p1
(81)
262
N. DENG AND H. ZHANG
we have
(3p1 +1 /e) ln(3p1 +1 /e) = (β − 3)/e,
(82)
2 (QD + 6nQf )
.
3 (QI + 6Qf )
(83)
where
β = β(n, Qf ) =
Obviously, Eq. (82) has a unique solution p1 when n ≥ 100 and Qf ≥ 0. Now let us estimate p1 .
By the monotonicity of p1 with respect to β in Eq. (82) and the monotonicity of β with
respect to n and Qf in Eq. (83), it is not difficult to see that when n ≥ 100 and Qf ≥ 0,
p1 (n, Qf ) is increasing with respect to both n and Qf . When n ≥ 100,
e2 + 3 <
2n
2
< β = β(n, Qf ) < n,
36
3
(84)
In addition, notice that the function
3p1 +1 3p1 +1
ln
e
e
is increasing when p1 ≥ 0. Hence, by Eqs. (84) and (82) we conclude that when n ≥ 100, p1
satisfies
β −3
β −3
< 3p1 +1 <
.
(85)
ln((β − 3)/e)
ln((β − 3)/e) − ln ln((β − 3)/e)
This leads to, by Eq. (81), that when n ≥ 100,
σ1 (β) < 3p
where
σ1 (β) =
and
σ2 (β) =
∗ +1
< σ2 (β),
(86)
β −3
,
3 ln((β − 3)/e)
3(β − 3)
.
3 ln((β − 3)/e) − ln ln((β − 3)/e)
Thus, by Eq. (84) and the monotonicity of σ1 (β) and σ2 (β) with respect to β, Eq. (79) is
obtained.
LEMMA 5.2 The solution p ∗ to the optimization problem (76) and (77) satisfies:
(1) When n ≥ 100,
p∗ ≥ 1
(87)
(2) When n → +∞,
∗
lim sup
3p +1
≤ 2,
n/ ln n
lim inf
3p +1
1
≥
.
n/ ln n
27
n→+∞
(88)
∗
n→+∞
(89)
INEXACT METHOD OF TANGENT HYPERBOLAS
(3) When n → +∞,
p∗ ∼
Proof
ln n
.
ln 3
263
(90)
In fact, QD > 9Q1 when n ≥ 100 and Qf ≥ 0. So
w(n, Qf , 0) − w(n, Qf , 1) =
1
[(QD − 9QI ) + (6n − 54)Qf ] > 0.
2
Therefore,
w(n, Qf , 0) > w(n, Qf , 1).
Thus Eq. (87) is proved.
By Eq. (79), it is easy to see the validity of Eqs. (88)–(90).
(91)
The next remark is concerned with the maximum number in the PCG step in our Algorithm.
Remark 5.1
Suppose n ≥ 100, it is shown by Lemma 5.2 that p∗ ≥ 1. Therefore,
w(n, Qf , p∗ − 1) > w(n, Qf , p∗ ).
This yields that the computation cost of the p ∗ th PCG step is less than that of a CF step.
∗
∗
lpN∗ QI + lpH∗ QI = 2 · 3p QI + 3p QI < QD ,
where lpN∗ and lpH∗ are defined in Step 0 in Algorithm NIMTH. So the subiteration number in
the p∗ th PCG step satisfies that
lpN∗ + lpH∗ <
QD
(1/6)n3 + (5/2)n2 − (2/3)n
=
< n.
QI
n2 + 6n + 2
Therefore, we have
lpN∗ + lpH∗ < n.
Notice that
max{l1N + l1H , l2N + l2H , . . . , lpN∗ + lpH∗ } = lpN∗ + lpH∗ ,
we conclude that the maximum of the subiteration numbers in the PCG steps in a circle is
less than n. In practice, the maximum is much less than n. This is the reason that leads to the
efficiency of our algorithm.
THEOREM 5.1 When n ≥ 100, the efficiency coefficient γ (n, Qf ) satisfies:
(1)
(2)
(3)
(4)
γ (n, Qf ) > 1.
When fixed n, γ (n, Qf ) is strictly increasing with respect to Qf ≥ 0.
γ (n, 0) are strictly increasing with respect to n.
When n → +∞, for all Qf > 0
γ (n, Qf ) > γ (n, 0) ∼
ln n
.
ln 3
264
N. DENG AND H. ZHANG
Proof In the following, n ≥ 100 and Qf ≥ 0 are always assumed.
(1) By Lemma 5.2 (1), p∗ ≥ 1 when n ≥ 100. Therefore, by Eq. (91)
γ (n, Qf ) = R(n, Qf , p∗ ) =
w(n, Qf , 0)
w(n, Qf , 0)
≥
> 1.
∗
w(n, Qf , p )
w(n, Qf , 1)
(92)
(2) Define
c(p) = p/σ (p),
(93)
where σ (p) is defined in Eq. (71). σ (p) and c(p) are functions with respect to continuous
variable p ≥ 1. Because
c (p) =
p
(1/2)(3p+2 − 32 )
=
2(3p − 1 − p3p ln 3)
< 0,
9(3p − 1)2
by Eq. (87), we can get
c(p∗ ) =
Therefore,
6 + 20c(p ∗ )
6 + 20/9
QI
≤
<
,
6n + 20
6n + 20
QD
that is,
Thus,
p∗
p∗
1
1
=
≤
= .
∗
σ
σ (p ∗ )
σ (1)
9
6σ ∗ + 20p ∗
σ ∗ QI
.
<
6n + 20
QD
(94)
20 + 6n + 20p ∗ + 6σ ∗
Q D + σ ∗ QI
.
<
(1 + p ∗ )(20 + 6n)
(1 + p ∗ )QD
(95)
Therefore, it is not difficult to get that when fixed n and Qf 1 > Qf ≥ 0,
(p∗ + 1)((20 + 6n)Qf + QD )
(6n + 20 + 20p ∗ + 6σ ∗ )Qf + QD + σ ∗ QI
(p ∗ + 1)((20 + 6n)Qf 1 + QD )
,
(6n + 20 + 20p ∗ + 6σ ∗ )Qf 1 + QD + σ ∗ QI
(96)
and, by Eq. (78),
γ (n, Qf ) = R(n, Qf , p∗ ) < R(n, Qf 1 , p∗ ).
(97)
def
If p1∗ = p(n, Qf 1 ), then
R(n, Qf 1 , p∗ ) ≤ R(n, Qf 1 , p1∗ ) = γ (n, Qf 1 ).
(98)
Combining Eqs. (97) and (98), we have, when fixed n and Qf 1 > Qf ≥ 0,
γ (n, Qf ) < γ (n, Qf 1 ).
The conclusion (1) is obtained.
(99)
INEXACT METHOD OF TANGENT HYPERBOLAS
265
(3) Denoting T (n) = QD /QI , then it is easy to prove
T (n) < T (n + 1).
(100)
By Eqs. (100) and (78) and
γ (n, 0) =
(p0∗ (n) + 1)QD
p0∗ (n) + 1
=
,
∗
QD + σ0 (n)QI
1 + σ0∗ (n)/T (n)
(101)
where p0∗ (n) = p(n, 0), σ0∗ (n) = σ (p0∗ (n)), we have
γ (n, 0) <
p0∗ (n) + 1
p0∗ (n + 1) + 1
<
= γ (n + 1, 0).
∗
1 + σ0 (n)/T (n + 1)
1 + σ0∗ (n + 1)/T (n + 1)
The conclusion (2) is obtained.
(4) By Eqs. (101), (90), (88), and (89), the validity of the conclusion (3) is obtained.
Theorem 5.3 is also supported by our preliminary numerical experiments. The detail is
omitted.
Acknowledgments
The work was supported by the National Science Foundation of China (Grant No. 10071094),
Research Grants Council of the Hong Kong Special Administrative Region, China (Grant
CityU 1066/00P), and the Talent Foundation of Beijing (Grant No. Kw0603200352).
References
[1] R. Brent (1973). Some efficient algorithms for solving systems of nonlinear equation. SIAM J. Numerical Anal.,
10, 327–344.
[2] M. Bartholomew-Biggs, S. Brown, B. Christianson and L.C.W. Dixon (2000). Automatic differentiation of
algorithms. J. Comput. Appl. Math., 12, 171–190.
[3] L.C.W. Dixon (2000). On the Deng-Wang theorem. OR Trans., 4, 42–48.
[4] R. Dembo, S. Eisenstat and T. Steihaug (1982). Inexact Newton method. SIAM J. Numerical Anal., 19, 400–408.
[5] N.Y. Deng and Z.Z. Wang (2000). Theoretical efficiency of an inexact Newton method. J. Optim. Theor. Appl.,
105, 97–112.
[6] N.Y. Deng, H.B. Zhang and C.H. Zhang (2001). Further improvement of the Newton- PCG algorithm with
automatic differentiation. Optim. Methods Software, 16, 151–178.
[7] A. Griewank (2000). Evaluating derivatives principles and techniques of algorithmic differentiation. Frontiers
in Appl. Math., Vol. 19, SIAM, Philadephia.
[8] R. Jackson and G. McCormick (1986). The Poliyadic Structure of Factorable Functions Ten-sors with
Applications to High-Order Minimization Techniques. J. Optim. Techniques, J. Optim. Theor. Appl., 51(1),
63–94.
[9] C. Kelly (1995). Iterative Methods for Linear and Nonlinear Equations, SIAM, Philadelphia.
[10] R. Kalaba and A. Tischler (1983). A generalized Newton algorithm using high order derivatives. J. Optim. Theor.
Appl., 39, 1–17.
[11] G. Ostrovskii, Yu. Volin and W. Borisov (1971). Über die Berechnung von Ableitungen. Wissenschaftliche
Zeitschrift der Technischen Hochschule für Chemie, Leuna-Merseburg, 13(4), 382–384.
[12] A.H. Sherman (1978). On Newton-iterative methods for the solution of systems of nonlinear equations. SIAM J.
Numer. Anal., 15, 755–771.
[13] P.L. Toint (1981). Towards an Efficient Sparsity Exploiting Newton Method for Minimization, Sparse Matrices
and Their Uses, In: I.S. Duff (Ed.), Academic Press, London, England, pp. 57–88.
[14] J.E. Tolsma and P.I. Barton (1998). On computational differentiation. Comput. Chem. Eng., 22(4/5), 475–490.
[15] R. Wengert (1964). A simple automatic derivative evaluation program. Commun. ACM, 7(8), 463–464.