5 Overview of algorithms for unconstrained optimization

c Marina A. Epelman
IOE 519: NLP, Winter 2012
5
19
Overview of algorithms for unconstrained optimization
5.1
General optimization algorithm
Recall: we are attempting to solve the problem
(P)
min
s.t.
f (x)
x2X
where f (x) is di↵erentiable and X ⇢ Rn is an open set.
Solutions to optimization problems are almost always impossible to obtain directly (or “in closed
form”) — with a few exceptions. Hence, for the most part, we will solve these problems with
iterative algorithms. These algorithms typically require the user to supply a starting point x0 2 X.
Beginning at x0 , an iterative algorithm will generate a sequence of points {xk }1
k=0 called iterates. In
k+1
deciding how to generate the next iterate, x , the algorithms use information about the function
f at the current iterate, xk , and sometimes past iterates x0 , . . . , xk 1 . In practice, rather than
constructing an infinite sequence of iterates, algorithms stop when an appropriate termination
criterion is satisfied, indicating either that the problem has been solved within a desired accuracy,
or that no further progress can be made.
Most algorithms for unconstrained optimization we will discuss fall into the category of directional
search algorithms:
General directional search optimization algorithm
Initialization Specify an initial guess of the solution x0
Iteration For k = 0, 1, . . .,
If xk is optimal, stop
Otherwise,
• Determine dk — a search directions
• Determine ↵k > 0 — a step size
• Determine xk+1 = xk + ↵k dk — a new estimate of the solution.
5.1.1
Choosing the direction
Typically, we require that dk is a descent direction of f at xk , that is,
f (xk + ↵dk ) < f (xk ) 8↵ 2 (0, ✏]
for some ✏ > 0. For the case when f is di↵erentiable, we have shown in Theorem 4.1 that any dk
such that rf (xk )T dk < 0 is a descent direction whenever rf (xk ) 6= 0.
Often, direction is chosen to be of the form
dk =
Dk rf (xk ),
where Dk is a positive definite symmetric matrix. (Why is it important that Dk is positive definite?)
c Marina A. Epelman
IOE 519: NLP, Winter 2012
20
The following are the two basic methods for choosing the matrix Dk at each iteration; they give
rise to two classic algorithms for unconstrained optimization we are going to discuss in class:
Steepest descent Dk = I, k = 0, 1, 2, . . .
Newton’s method Dk = H(xk )
5.1.2
1
(provided H(xk ) is positive definite.)
Choosing the stepsize
After dk is fixed, ↵k ideally would solve the one-dimensional optimization problem
min f (xk + ↵dk ).
↵ 0
This optimization problem is usually also impossible to solve exactly. Instead, ↵k is computed
(via an iterative procedure referred to as line search) either to approximately solve the above
optimization problem, or to ensure a “sufficient” decrease in the value of f .
5.1.3
Testing for optimality
Based on the optimality conditions, xk is a locally optimal if rf (xk ) = 0 and H(xk ) is positive
definite. However, such a point is unlikely to be found. In fact, the most of the analysis of the
algorithms in the above form deals with their limiting behavior, i.e., analyzes the limit points of
the infinite sequence of iterates generated by the algorithm. Thus, to implement the algorithm in
practice, more realistic termination criteria need to be implemented. They often hinge, at least
in part, on approximately satisfying, to a certain tolerance, the first order necessary condition for
optimality discussed in the previous section.
5.2
Steepest descent algorithm for minimization
The steepest descent algorithm is a version of the general optimization algorithm that chooses
dk = rf (xk ) at the kth iteration. As a source of motivation, note that f (x) can be approximated
by its linear expansion f (x̄ + d) ⇡ f (x̄) + rf (x̄)T d. It is not hard to see that so long as rf (x̄) 6= 0,
the direction
rf (x̄)
rf (x̄)
d¯ =
=p
krf (x̄)k
rf (x̄)T rf (x̄)
minimizes the above approximation over all direction of unit length. Indeed, for any direction d
with kdk = 1, the Schwartz inequality yields
rf (x̄)T d
krf (x̄)k · kdk =
¯
krf (x̄)k = rf (x̄)T d.
Of course, if rf (x̄) = 0, then x̄ is a candidate for local minimizer, i.e., x̄ satisfies the first order
necessary optimality condition. The direction d¯ = rf (x̄) is called the direction of steepest descent
at the point x̄.
Note that d¯ = rf (x̄) is a descent direction as long as rf (x̄) 6= 0. To see this, simply observe
that d¯T rf (x̄) = rf (x̄)T rf (x̄) < 0 so long as rf (x̄) 6= 0.
A natural consequence of this is the following algorithm, called the steepest descent algorithm.
c Marina A. Epelman
IOE 519: NLP, Winter 2012
21
Steepest Descent Algorithm:
Step 0 Given x0 , set k
Step 1 dk =
0
rf (xk ). If dk = 0, then stop.
Step 2 Choose stepsize ↵k by performing an exact (or inexact) line search.
Step 3 Set xk+1
xk + ↵ k dk , k
k + 1. Go to Step 1.
Note from Step 2 and the fact that dk =
f (xk ).
rf (xk ) is a descent direction, it follows that f (xk+1 ) <
The following theorem establishes that under certain assumptions on f , the steepest descent algorithm converges regardless of the initial starting point x0 (i.e., it exhibits global convergence).
Theorem 5.1 (Convergence Theorem; Steepest Descent with exact line search) Suppose
that f : Rn ! R is continuously di↵erentiable on the set S = {x 2 Rn : f (x)  f (x0 )}, and that
S is a closed and bounded set. Suppose further that the sequence {xk } is generated by the steepest
descent algorithm with stepsizes ↵k chosen by an exact line search. Then every point x̄ that is a
limit point of the sequence {xk } satisfies rf (x̄) = 0.
Proof: The proof of this theorem is by contradiction. By the Weierstrass’ Theorem, at least
one limit point of the sequence {xk } must exist. Let x̄ be any such limit point. Without loss of
generality, assume that limk!1 xk = x̄, but that rf (x̄) 6= 0. This being the case, there is a value
4
¯ > 0, where d¯ = rf (x̄). Then also (x̄ + ↵
¯ 2 intS,
of ↵
¯ > 0 such that = f (x̄) f (x̄ + ↵
¯ d)
¯ d)
0
¯
because f (x̄ + ↵
¯ d) < f (x̄)  f (x ).
Let {dk } be the sequence of directions generated by the algorithm, i.e., dk = rf (xk ). Since f is
¯ Then since (x̄ + ↵
¯ 2 intS, and (xk + ↵
¯
continuously di↵erentiable, limk!1 dk = d.
¯ d)
¯ dk ) ! (x̄ + ↵
¯ d),
k
k
for k sufficiently large we have x + ↵
¯ d 2 S and
¯ +
f (xk + ↵
¯ dk )  f (x̄ + ↵
¯ d)
2
= f (x̄)
+
2
= f (x̄)
2
.
However,
f (x̄)  f (xk + ↵k dk )  f (xk + ↵
¯ dk )  f (x̄)
which is, of course, a contradiction. Thus d¯ =
2
,
rf (x̄) = 0.
An example Suppose, f (x) is a simple quadratic function of the form:
1
f (x) = xT Qx + q T x,
2
where Q is a positive definite symmetric matrix. The optimal solution of (P) is easily computed
as:
x? = Q 1 q
(since Q is positive definite, it is non-singular) and direct substitution shows that the optimal
objective function value is:
1 T
f (x? ) =
q Q 1 q.
2
c Marina A. Epelman
IOE 519: NLP, Winter 2012
22
For convenience, let x denote the current point in the steepest descent algorithm. We have:
1
f (x) = xT Qx + q T x
2
and let d denote the current direction, which is the negative of the gradient, i.e.,
rf (x) =
d=
Qx
q.
Now let us compute the next iterate of the steepest descent algorithm. If ↵ is the generic stepsize,
then
1
f (x + ↵d) = (x + ↵d)T Q(x + ↵d) + q T (x + ↵d)
2
1
1
= xT Qx + ↵dT Qx + ↵2 dT Qd + q T x + ↵q T d
2
2
1
= f (x) ↵dT d + ↵2 dT Qd.
2
Optimizing the value of ↵ in this last expression yields
↵=
dT d
,
dT Qd
and the next iterate of the algorithm then is
x0 = x + ↵d = x +
and
Q=
✓
+4
2
2 +2
Then
rf (x) =
✓
Qx
1
↵dT d + ↵2 dT Qd = f (x)
2
f (x0 ) = f (x + ↵d) = f (x)
Suppose that
dT d
d, where d =
dT Qd
◆
and
+4
2
2 +2
and so
x? =
q=
✓
q.
1 (dT d)2
.
2 dT Qd
◆
+2
.
2
◆✓ ◆ ✓ ◆
x1
+2
+
x2
2
✓ ◆
0
1
and
f (x? ) =
1.
Suppose that x0 = (0, 0). Then we have:
x1 = ( 0.4, 0.4),
x2 = (0, 0.8), etc.,
and the even numbered iterates satisfy
x2n = (0, 1
0.2n )
and
f (x2n ) = (1
0.2n )2
2 + 2(0.2)n
IOE 519: NLP, Winter 2012
c Marina A. Epelman
23
and so
kx2n
x? k = 0.2n , f (x2n )
f (x? ) = (0.2)2n .
Therefore, starting from the point x0 = (0, 0), distance from the current iterate to the optimal
solution goes down by a factor of 0.2 after every two iterations of the algorithm (a similar observation
can be made about the progress of the objective function values). The graph below plots the progress
of the sequence kxk x? k as a function of iteration number; notice that the y-axis is drawn on
a logarithmic scale — this allows us to visualize the progress of the algorithm better as values of
kxk x? k approach zero.
Although it is easy to find the optimal solution of the quadratic optimization problem in closed form,
the above example is relevant in that it demonstrates a typical performance of the steepest descent
algorithm. Additionally, most functions behave as near-quadratic functions in a neighborhood of
the optimal solution, making the example even more relevant.
Termination criteria Ideally, the algorithm will terminate at a point xk such that rf (xk ) = 0.
However, the algorithm is not guaranteed to be able to find such point in finite amount of time.
Moreover, due to rounding errors in computer calculations, the calculated value of the gradient will
have some imprecision in it.
Therefore, in practical algorithms the termination criterion is designed to test if the above condition
is satisfied approximately, so that the resulting output of the algorithm is an approximately optimal
solution. A natural termination criterion for the steepest descent could be krf (xk )k  ✏, where ✏ >
0 is a pre-specified tolerance. However, depending on the scaling of the function, this requirement
can be either unnecessarily stringent, or too loose to ensure near-optimality (consider a problem
concerned with minimizing distance, where the objective function can be expressed in inches, feet,
or miles). Another alternative, that might alleviate the above consideration, is to terminate when
krf (xk )k  ✏|f (xk )| — this, however, may lead to problems when the objective function at the
optimum is zero. A combined approach is then to terminate when
krf (xk )k  ✏(1 + |f (xk )|).
The value of ✏ is typically taken to be at most the square root of the machine tolerance (e.g.,
✏ = 10 8 if 16-digit computing is used), due to the error incurred in estimating derivatives.
c Marina A. Epelman
IOE 519: NLP, Winter 2012
5.3
24
Stepsize selection
In the analysis in the above subsection we assumed that one-dimensional optimization problem
invoked in the line search in each iteration of the Steepest Descent algorithm was performed exactly
and with perfect precision, which is usually not possible. In this subsection we discuss one of
the many practical ways of solving this problem approximately, to determine the stepsize at each
iteration of the general directional search optimization algorithm (including steepest descent).
5.3.1
Stepsize selection basics
Suppose that f (x) is a continuously di↵erentiable function, and that we seek to (approximately)
solve:
¯
↵
¯ = arg min f (x̄ + ↵d),
↵>0
where x̄ is our current iterate, and d¯ is the current direction generated by an algorithm that seeks
to minimize f (x). We assume that d¯ is a descent direction, i.e., rf (x̄)T d¯ < 0. Let
¯
F (↵) = f (x̄ + ↵d),
whereby F (↵) is a function in the scalar variable ↵, and our problem is to solve for
↵
¯ = arg min F (↵).
↵>0
Using the chain rule for di↵erentiation, we can show that
¯ T d.
¯
F 0 (↵) = rf (x̄ + ↵d)
Therefore, applying the necessary optimality conditions to the one-dimensional optimization problem above, we want to find a value ↵
¯ for which F 0 (¯
↵) = 0. Furthermore, since d is a descent
0
direction, F (0) < 0.
5.3.2
Armijo rule, or backtracking
Although there are iterative algorithms developed to solve the problem min↵ F (↵) (or F 0 (↵) = 0)
“exactly,” i.e., with a high degree of precision (such as, for instance, bisection search algorithm),
they are typically too expensive computationally. (Recall that we need to perform a line search at
every iteration of our steepest optimization algorithm!) On the other hand, if we sacrifice accuracy
of the line search, this can cause inferior performance of the overall algorithm.
The Armijo rule, or the backtracking method, is one of several inexact line search methods which
guarantees a sufficient degree of improvement in the objective function to ensure the algorithm’s
convergence.
Armijo rule requires two parameters: 0 < µ < 0.5 and 0 < < 1. Suppose we are minimizing a
function F (↵) such that F 0 (0) < 0 (which is indeed the case for the line search problems arising in
descent algorithms). Then the first order approximation of F (↵) at ↵ = 0 is given by F (0)+↵F 0 (0).
Define F̂ (↵) = F (0)+µ↵F 0 (0) (see figure). A stepsize ↵
¯ is considered acceptable by Armijo rule only
if F (↵
¯ )  F̂ (¯
↵), that is, if taking a step of size ↵
¯ guarantees sufficient decrease of the function:
¯
f (x̄ + ↵
¯ d)
¯
f (x̄)  µ¯
↵rf (x̄)T d.
c Marina A. Epelman
IOE 519: NLP, Winter 2012
25
0.5
0.4
F(0)+µ F(0)
0.3
F( )
0.2
0.1
0
−0.1
F(0)+ F(0)
−0.2
0
0.2
0.4
0.6
0.8
1
1.2
Note that the sufficient decrease condition will hold for any small value of ↵. On the other hand,
we would like to prevent the step size from being too small, for otherwise our overall optimization
algorithm would not be making much progress. To combine these two considerations, will implement
the following iterative backtracking procedure (here we use = 12 ):
Backtracking line search
Step 0 Set k=0. ↵0 = 1.
Step k If F (↵k )  F̂ (↵k ), choose ↵k as the step size; stop. If F (↵k ) > F̂ (↵k ), let ↵k+1
k
k + 1.
Note that as a result of the above iterative scheme, the chosen stepsize is ↵ = 21t , where t
the smallest integer such that F (1/2t )  F̂ (1/2t ) (or, for general , F ( t )  F̂ ( t )).
Typically, µ is chosen in the range between 0.01 and 0.3, and
1
2 ↵k ,
0 is
— between 0.1 to 0.8.
Note that if xk and xk+1 are the consecutive iterates of the general optimization algorithm with dk
– a descent direction, and the stepsizes chosen by backtracking, then f (xk+1 )  f (xk ) — that is,
the algorithm is guaranteed to produce an improvement in the function value at every iteration.
Under additional assumptions on f , it can be also shown that the steepest descent algorithm
will demonstrate global convergence properties under the Armijo line search rule, as stated in the
following theorem.
Theorem 5.2 (Convergence Theorem; Steepest Descent with backtracking line search)
Suppose that the set S = {x 2 Rn : f (x)  f (x0 )} is closed and bounded, and suppose that the
gradient of f is Lipschitz continuous on the set S, i.e., there exist a constant G > 0 such that
krf (x)
rf (y)k  Gkx
yk 8x, y 2 S.
Suppose further that the sequence {xk } is generated by the steepest descent algorithm with stepsizes
↵k chosen by a backtracking line search. Then every point x̄ that is a limit point of the sequence
{xk } satisfies rf (x̄) = 0.
The additional assumption, basically, ensures that the gradient of f does not change too rapidly.
In the proof of the theorem, this allows to provide a lower bound on the stepsize in each iteration.
(See any of the reference textbooks for details.)
Remark: Our discussion so far implicitly assumed that the domain of the optimization problem
was the entire Rn . If our optimization problem is
(P) min f (x) s.t. x 2 X,
c Marina A. Epelman
IOE 519: NLP, Winter 2012
26
where X is an open set, then the line-search problem is
¯ s.t. x̄ + ↵d¯ 2 X.
min f (x̄ + ↵d)
In this case, we must ensure that all iterate values of ↵ in the backtracking algorithm satisfy
x̄ + ↵d¯ 2 X. As an example, consider the following problem:
(P)
min
m
P
f (x) :=
aTi x)
ln(bi
i=1
s.t.
b
Here the domain of f (x) is X = {x 2 Rn : b
the line-search problem is:
(LS)
min
s.t.
Ax > 0.
¯
Ax > 0}. Given a point x̄ 2 X and a direction d,
m
P
¯ =
h(↵) := f (x̄ + ↵d)
¯ > 0.
A(x̄ + ↵d)
b
ln(bi
i=1
¯
aTi (x̄ + ↵d))
Standard arithmetic manipulation can be used to establish that
b
¯ > 0 if and only if ↵
A(x̄ + ↵d)
ˇ<↵<↵
ˆ,
where
↵
ˇ :=
min
¯
aT
i d<0
⇢
bi
aTi x̄
aT d¯
and ↵
ˆ := min
¯
aT
i d>0
i
⇢
bi
aTi x̄
aT d¯
,
i
and the line-search problem then is:
LS : minimize h(↵) :=
m
P
ln(bi
i=1
s.t.
¯
aTi (x̄ + ↵d))
0<↵<↵
ˆ.
The implementation of the backtracking rule for this problem would have to be modified — starting
with ↵ = 1, we will backtrack, if necessary, until alpha < ↵
ˆ , and only then start checking the
sufficient decrease conditions.
5.4
Newton’s method for minimization
Again, we want to solve
(P) min f (x)
x 2 Rn .
The Newton’s method can also be interpreted in the framework of the general optimization algorithm, but it truly stems from the Newton’s method for solving systems of nonlinear equations.
Recall that if : Rn ! Rn , to solve the system of equations
(x) = 0,
one can apply an iterative method. Starting at a point x̄, approximate the function by (x̄ + d) ⇡
(x̄) + r (x̄)T d, where r (x̄)T 2 Rn⇥n is the Jacobian of at x̄, and provided that r (x̄) is nonsingular, solve the system of linear equations
r (x̄)T d =
(x̄)
c Marina A. Epelman
IOE 519: NLP, Winter 2012
27
to obtain d. Set the next iterate x = x̄ + d, and continue. This method is well-studied, and is wellknown for its good performance when the starting point x̄ is chosen appropriately. The Newton’s
method for minimization is precisely an application of this equation-solving method to the (system
of) first-order optimality conditions rf (x) = 0.
Here is another view of the motivation behind the Newton’s method for optimization. At x = x̄,
f (x) can be approximated by
4
f (x) ⇡ q(x) = f (x̄) + rf (x̄)T (x
1
x̄) + (x
2
x̄)T H(x̄)(x
x̄),
which is the quadratic Taylor expansion of f (x) at x = x̄.
q(x) is a quadratic function which is minimized by solving rq(x) = 0, i.e., rf (x̄)+H(x̄)(x x̄) = 0,
which yields
x x̄ = H(x̄) 1 rf (x̄).
The direction
H(x̄)
1 rf (x̄)
is called the Newton direction, or the Newton step.
This leads to the following algorithm for solving (P):
Newton’s Method:
Step 0 Given x0 , set k
Step 1 dk =
H(xk )
0
1 rf (xk ).
If dk = 0, then stop.
Step 2 Choose stepsize ↵k = 1.
Step 3 Set xk+1
xk + ↵ k dk , k
k + 1. Go to Step 1.
Proposition 5.3 If H(x) is p.d., then d =
H(x)
1 rf (x)
is a descent direction.
Proof: It is sufficient to show that rf (x)T d = rf (x)T H(x) 1 rf (x) < 0. Since H(x) is positive
definite, if v 6= 0,
0 < (H(x) 1 v)T H(x)(H(x) 1 v) = v T H(x) 1 v,
completing the proof.
Note that:
• Work per iteration: O(n3 )
• The iterates of Newton’s method are, in general, equally attracted to local minima and local
maxima. Indeed, the method is just trying to solve the system of equations rf (x) = 0.
• The method assumes H(xk ) is nonsingular at each iteration. Moreover, unless H(xk ) is
positive definite, dk is not guaranteed to be a descent direction.
• There is no guarantee that f (xk+1 )  f (xk ).
• Step 2 could be augmented by a linesearch of f (xk + ↵dk ) over the value of ↵; then previous
consideration would not be an issue.
• What if H(xk ) becomes increasingly singular (or not positive definite)? Use H(xk ) + ✏I.
• In general, points generated by the Newton’s method as it is described above, may not
converge. For example, H(xk ) 1 may not exist. Even if H(x) is always non-singular, the
method may not converge, unless started “close enough” to the right point.
c Marina A. Epelman
IOE 519: NLP, Winter 2012
28
Example 1: Let f (x) = 7x ln(x). Then rf (x) = f 0 (x) = 7 x1 and H(x) = f 00 (x) = x12 . It is not
hard to check that x? = 17 = 0.142857143 is the unique global minimizer. The Newton direction at
x is
✓
◆
f 0 (x)
1
1
2
d = H(x) rf (x) =
= x 7
= x 7x2 ,
f 00 (x)
x
and is defined so long as x > 0. So, Newton’s method will generate the sequence of iterates {xk }
with xk+1 = xk +(xk 7(xk )2 ) = 2xk 7(xk )2 . Below are some examples of the sequences generated
by this method for di↵erent starting points:
k
0
1
2
3
4
5
6
7
8
9
10
xk
1
5
xk
xk
0.1
0.01
0.13
0.0193
0.1417
0.03599257
0.14284777 0.062916884
0.142857142 0.098124028
0.142857143 0.128849782
0.1414837
0.142843938
0.142857142
0.142857143
0.142857143
(note that the iterate in the first column is not in the domain of the objective function, so the
algorithm has to terminate with an error). Below is a plot of the progress of the algorithm as a
function of iteration number (for the two sequences that did converge):
Example 2: f (x) =
ln(1
x1
x2 )
ln x1

rf (x) =
ln x2 .
1
1 x1 x2
1
1 x1 x2
1
x1
1
x2
,
c Marina A. Epelman
IOE 519: NLP, Winter 2012
2 ⇣
x? =
1 1
3, 3
H(x) = 4
1
1 x1 x2
, f (x? ) = 3.295836866.
k
0
1
2
3
4
5
6
7
⇣
⌘2
⇣ ⌘2
+ x11
⌘2
⇣
1
1 x1 x2
⇣
1
1 x1 x2
⌘2
1
1 x1 x2
⌘2
+
⇣
1
x2
29
3
⌘2 5 .
xk1
xk2
kxk x̄k
0.85
0.05
0.58925565098879
0.717006802721088 0.0965986394557823
0.450831061926011
0.512975199133209 0.176479706723556
0.238483249157462
0.352478577567272 0.273248784105084
0.0630610294297446
0.338449016006352 0.32623807005996
0.00874716926379655
0.333337722134802 0.333259330511655 7.41328482837195e 5
0.333333343617612 0.33333332724128
1.19532211855443e 8
0.333333333333333 0.333333333333333 1.57009245868378e 16
Termination criteria Since Newton’s method is working with the Hessian as well as the gradient,
it would be natural to augment the termination criterion we used in the Steepest Descent algorithm
with the requirement that H(xk ) is positive semi-definite, or, taking into account the potential for
the computational errors, that H(xk ) + ✏I is positive semi-definite for some ✏ > 0 (this parameter
may be di↵erent than the one used in the condition on the gradient).
c Marina A. Epelman
IOE 519: NLP, Winter 2012
5.5
5.5.1
30
Comparing performance of the steepest descent and Newton algorithms
Rate of convergence
Suppose we have a converging sequence limk!1 sk = s̄, and we would like to characterize the speed,
or rate, at which the iterates sk approach the limit s̄.
A converging sequence of numbers {sk } exhibits linear convergence if for some 0  C < 1,
|sk+1 s̄|
= C.
k!1 |sk
s̄|
lim
C in the above expression is referred to as the rate constant; if C = 0, the sequence exhibits
superlinear convergence.
A converging sequence of numbers {sk } exhibits quadratic convergence if
|sk+1 s̄|
=
k!1 |sk
s̄|2
lim
< 1.
Examples:
Linear convergence sk =
1 k
10 :
0.1, 0.01, 0.001, etc. s̄ = 0.
|sk+1 s̄|
= 0.1.
|sk s̄|
Superlinear convergence sk = 0.1 ·
1
1
1
1
1
1
k! : 10 , 20 , 60 , 240 , 1250 ,
etc. s̄ = 0.
|sk+1 s̄|
k!
1
=
=
! 0 as k ! 1.
|sk s̄|
(k + 1)!
k+1
Quadratic convergence sk =
k
1 (2
10
1)
: 0.1, 0.01, 0.0001, 0.00000001, etc. s̄ = 0.
k 1
|sk+1 s̄|
(102 )2
=
= 1.
2
|sk s̄|
102k
This illustration compares the rates of convergence of the above sequences (note that the y-axis is
displayed on the logarithmic scale):
c Marina A. Epelman
IOE 519: NLP, Winter 2012
31
We will use the notion of rate of convergence to analyze one aspect of performance of optimization
algorithms. Indeed, since an algorithm for nonlinear optimization problems, in its abstract form,
generates an infinite sequence of points {xk } converging to a solution x̄ only in the limit, it makes
sense to discuss the rate of convergence of the sequence kek k = kxk x̄k, or E k = |f (xk ) f (x̄)|,
which both have limit 0.
5.5.2
Rate of convergence of the steepest descent algorithm for the case of a quadratic
function
In this section we explore answers to the question of how fast the steepest descent algorithm
converges. Recall that in the earlier example we observed linear convergence of both the sequence
{E k } and {ek }.
We will show now that the steepest descent algorithm with stepsizes selected by exact line search
in general exhibits linear convergence, but that the rate constant depends very much on the ratio
of the largest to the smallest eigenvalue of the Hessian matrix H(x) at the optimal solution x = x? .
In order to see how this dependence arises, we will examine the case where the objective function
f (x) is itself a simple quadratic function of the form:
1
f (x) = xT Qx + q T x,
2
where Q is a positive definite symmetric matrix. We will suppose that the eigenvalues of Q are
A = a1
a2
...
an = a > 0,
i.e, A and a are the largest and smallest eigenvalues of Q.
We already derived that the optimal solution of (P) is
x? =
Q
1
q
with the optimal objective function value is:
f (x? ) =
1 T
q Q
2
1
q.
c Marina A. Epelman
IOE 519: NLP, Winter 2012
32
Moreover, if x is the current point in the steepest descent algorithm, then
1
f (x) = xT Qx + q T x,
2
and the next iterate of the steepest descent algorithm with exact line search is
x0 = x + ↵d = x +
where d =
dT d
d,
dT Qd
rf (x) and
1
↵dT d + ↵2 dT Qd = f (x)
2
f (x0 ) = f (x)
Therefore,
f (x0 )
f (x)
T
1 (dT d)2
.
2 dT Qd
2
d)
f (x) 12 (d
f (x? )
f (x? )
dT Qd
=
f (x? )
f (x) f (x? )
=1
=1
1 T
2 x Qx
1
2 (Qx
=1
1 (dT d)2
2 dT Qd
+ q T x + 12 q T Q 1 q
+
1 (dT d)2
2 dT Qd
q)T Q 1 (Qx
(dT d)2
(dT Qd)(dT Q
=1
where
=
+ q)
1 d)
1
(dT Qd)(dT Q
(dT d)2
1 d)
.
In order for the convergence constant to be good, which will translate to fast linear convergence,
we would like the quantity to be small. The following result provides an upper bound on the
value of .
Kantorovich Inequality: Let A and a be the largest and the smallest eigenvalues of Q, respectively. Then
(A + a)2

.
4Aa
We will skip the proof of this inequality. Let us apply this inequality to the above analysis.
Continuing, we have
✓
◆
f (x0 ) f (x? )
1
4Aa
(A a)2
A/a 1 2
=
1

1
=
=
.
f (x) f (x? )
(A + a)2
(A + a)2
A/a + 1
Note by definition that A/a is always at least 1. If A/a is small (not much bigger than 1), then the
convergence constant will be much smaller than 1. However, if A/a is large, then the convergence
constant will be only slightly smaller than 1. The following table shows some sample values:
c Marina A. Epelman
IOE 519: NLP, Winter 2012
A
1.1
3.0
10.0
100.0
200.0
400.0
a
1.0
1.0
1.0
1.0
1.0
1.0
Upper Bound on
1 1
0.0023
0.25
0.67
0.96
0.98
0.99
33
Number of Iterations to Reduce
the Optimality Gap by 0.10
1
2
6
58
116
231
Note that the number of iterations needed to reduce the optimality gap by 0.10 grows linearly in
the ratio A/a.
Two pictures of possible iterations of the steepest descent algorithm are as follows:
2
1.5
1
0.5
0
−0.5
−1
−1.5
−2
−2
−1.5
−1
−0.5
0
0.5
1
1.5
2
−1.5
−1
−0.5
0
0.5
1
1.5
2
2
1.5
1
0.5
0
−0.5
−1
−1.5
−2
−2
c Marina A. Epelman
IOE 519: NLP, Winter 2012
34
Some remarks:
• We analyzed the convergence of the function values; the convergence of the algorithm iterates
can be easily shown to be linear with the same rate constant.
• The bound on the rate of convergence is attained in practice quite often, which is unfortunate.
The ratio of the largest to the smallest eigenvalue of a matrix is called the condition number
of the matrix.
• What about non-quadratic functions? If the Hessian at the locally optimal solution is positive
definite, the function behaves as near-quadratic function in a neighborhood of that solution.
The convergence exhibited by the iterates of the steepest descent algorithm will also be linear.
The analysis of the non-quadratic case gets very involved; fortunately, the key intuition is
obtained by analyzing the quadratic case.
• What about backtracking line search? Also linear convergence! (The rate constant depends
in part on the backtracking parameters.)
5.5.3
Rate of convergence of the pure Newton’s method
We have seen from our examples that, even for convex functions, the Newton’s method in its pure
form (i.e., with stepsize of 1 at every iteration) does not guarantee descent at each iteration, and
may produce a diverging sequence of iterates. Moreover, each iteration of the Newton’s method
is much more computationally intensive then that of the steepest descent. However, under certain
conditions, the method exhibits quadratic rate of convergence, making it the “ideal” method for
solving convex optimization problems. Recall that a method exhibits quadratic convergence when
kek k = kxk x̄k ! 0 and
kek+1 k
lim
= C.
k!1 kek k2
Roughly speaking, if the iterates converge quadratically, the accuracy (i.e., the number of correct
digits) of the solution doubles in a fixed number of iterations.
There are many ways to state and prove results regarding the convergence on the Newton’s method.
We provide one that provides a particular insight into the circumstances under which pure Newton’s
method demonstrates quadratic convergence.
p
Let kvk denote the usual Euclidian norm of a vector, namely kvk := v T v. Recall that the operator
norm of a matrix M is defined as follows:
kM k := max{kM xk : kxk = 1}.
x
As a consequence of this definition, for any x, kM xk  kM k · kxk.
Theorem 5.4 (Quadratic convergence) Suppose f (x) is twice continuously di↵erentiable and
x? is a point for which rf (x? ) = 0. Suppose H(x) satisfies the following conditions:
• there exists a scalar h > 0 for which k[H(x? )]
1k

1
h
• there exists scalars > 0 and L > 0 for which kH(x)
satisfying kx x? k  and ky x? k  .
H(y)k  Lkx
yk for all x and y
c Marina A. Epelman
IOE 519: NLP, Winter 2012
Let x satisfy kx x? k 
Then:
(i) kxN
x? k  kx
(ii) kxN
x? k < kx
(iii) kxN
x? k  kx
, where 0 <
x? k2
⇣
< 1 and
L
2(h Lkx x? k)
:= min
35
2h
, 3L
, and let xN := x H(x)
1 rf (x).
⌘
x? k, and hence the iterates converge to x?
3L
2h
x? k2
.
The proof relies on the following two “elementary” facts.
Proposition 5.5 Suppose that M is a symmetric matrix. Then the following are equivalent:
1k
1. h > 0 satisfies kM

2. h > 0 satisfies kM vk
1
h
h · kvk for any vector v
Proposition 5.6 Suppose that f (x) is twice di↵erentiable. Then
Z 1
rf (z) rf (x) =
[H(x + t(z x))] (z
x)dt .
0
Proof: Let (t) := rf (x + t(z x)). Then (0) = rf (x) and (1) = rf (z), and
[H(x + t(z x))] (z x). From the fundamental theorem of calculus, we have:
rf (z)
rf (x) = (1)
(0)
Z 1
0
=
(t)dt
0
Z 1
=
[H(x + t(z
x))] (z
0
(t) =
x)dt .
0
Proof of Theorem 5.4
We have:
xN
x? = x
H(x)
1
x?
rf (x)
(rf (x? ) rf (x))
Z 1
?
1
= x x + H(x)
[H(x + t(x? x))] (x? x)dt
0
Z 1
= H(x) 1
[H(x + t(x? x)) H(x)] (x? x)dt.
=x
x? + H(x)
1
(from Proposition 5.6)
0
Therefore
kxN
?
 kx?
= kx?
=
kx?
Z
1
k [H(x + t(x? x)) H(x)] k · k(x?
Z 1
xk · kH(x) 1 k
L · t · k(x? x)kdt
0
Z 1
xk2 kH(x) 1 kL
tdt
x k  kH(x)
1
k
0
0
xk2 kH(x)
2
1 kL
.
x)kdt
c Marina A. Epelman
IOE 519: NLP, Winter 2012
We now bound kH(x)
1 k.
Let v be any vector. Then
kH(x)vk = kH(x? )v + (H(x)
kH(x? )vk
h · kvk
Lkx
Lkx
H(x? ))vk
H(x? ))vk
k(H(x)
kH(x)
h · kvk
= (h
36
?
?
H(x? )kkvk
(from Proposition 5.5)
xk · kvk
xk) · kvk .
Invoking Proposition 5.5 again, we see that this implies that
kH(x)
1
k
1
Lkx?
h
.
xk
Combining this with the above yields
kxN
x? k  kx?
which is (i) of the theorem. Because Lkx?
kxN
x? k  kx?
xk
xk2
2 (h
xk  ·
2h
3
L
Lkx?
<
2h
3
·
Lkx? xk

?
2 (h Lkx
xk)
2 h
xk)
,
we have:
2h
3
2h
3
kx?
xk = kx?
xk,
which establishes (ii) of the theorem. Finally, we have
kxN
x? k  kx?
xk2
2 (h
L
Lkx?
xk)
 kx?
xk2
L
2 h
2h
3
= kx?
xk2
3L
,
2h
which establishes (iii) of the theorem.
Notice that the results regarding the convergence and rate of convergence in the above theorem
are local, i.e., they apply only if the algorithm is initialized at certain starting points (the ones
“sufficiently close” to the desired limit). In practice, it is not known how to pick such starting
points, or to check if the proposed starting point is adequate. (With the very important exception
of self-concordant functions.)
5.6
5.6.1
Further discussion and modifications of the Newton’s method
Global convergence for strongly convex functions with a two-phase Newton’s
method
We have noted that, to ensure descent at each iteration, the Newton’s method can be augmented
by a line search. This idea can be formalized, and the efficiency of the resulting algorithm can be
analyzed (see, for example, “Convex Optimization” by Stephen Boyd and Lieven Vandenberghe,
available at http://www.stanford.edu/~boyd/cvxbook.html for a fairly simple presentation of
the analysis).
Suppose that f (x) is strongly convex on it domain, i.e., assume there exists µ > 0 such that the
smallest eigenvalue of H(x) is greater than equal to µ for all x and that the Hessian is Lipschitz
continuous everywhere on the domain of f . Suppose we apply the Newton’s method with the
c Marina A. Epelman
IOE 519: NLP, Winter 2012
37
stepsize at each iteration determined by the backtracking procedure of section 5.3.2. That is, at
each iteration of the algorithm we first attempt to take a full Newton step, but reduce the stepsize
if the decrease in the function value is not sufficient. Then there exist positive numbers ⌘ and
such that
• if krf (xk )k
⌘, then f (xk+1 )
f (xk ) 
, and
• if krf (xk )k < ⌘, then stepsize ↵k = 1 will be selected, and the next iterate will satisfy
krf (xk+1 )k < ⌘, and so will all the further iterates. Moreover, quadratic convergence will be
observed in this phase.
As hinted above, the algorithm will proceed in two phases: while the iterates are far from the
minimizer, a “dampening” of the Newton step will be required, but there will be a guaranteed
decrease in the objective function values. This phase (referred to as “dampened Newton phase”)
0
?
cannot take more than f (x ) f (x ) iterations. Once the norm of the gradient becomes sufficiently
small, no dampening of the Newton step will required in the rest of the algorithm, and quadratic
convergence will be observed, thus making it the “quadratically convergence phase.”
Note that it is not necessary to know the values of ⌘ and to apply this version of the algorithm!
The two-phase Newton’s method is globally convergent; however, to ensure global convergence, the
function being minimized needs to posses particularly nice global properties.
5.6.2
Other modifications of the Newton’s method
We have seen that if Newton’s method is initialized sufficiently close to the point x̄ such that
rf (x̄) = 0 and H(x̄) is positive definite (i.e., x̄ is a local minimizer), then it will converge quadratically, using stepsizes of ↵ = 1. There are three issues in the above statement that we should be
concerned with:
• What if H(x̄) is singular, or nearly-singular?
• How do we know if we are “close enough,” and what to do if we are not?
• Can we modify Newton’s method to guarantee global convergence?
In the previous subsection we “assumed away” the first issue, and, under an additional assumption,
showed how to address the other two. What if the function f is not strongly convex, and H(x)
may approach singularity?
There are two popular approaches (which are actually closely related) to address these issues. The
first approach ensures that the method always uses a descent direction. For example, instead of
the direction H(xk ) 1 rf (xk ), use the direction (H(xk ) + ✏k I) 1 rf (xk ), where ✏k 0 is chosen
so that the smallest eigenvalue of H(xk ) + ✏k I is bounded below by a fixed number > 0. It
is important to choose the value of appropriately — if it is chosen to be too small, the matrix
employed in computing the direction can become ill-conditioned if H(x̄) is nearly singular; if it
is chosen to be too large, the direction becomes nearly that of the steepest descent algorithm,
and hence only linear convergence can be guaranteed. Hence, the value of ✏k is often chosen
dynamically.
The second approach is the so-called trust region method. Note that the main idea behind the
Newton’s method is to represent the function f (x) by its quadratic approximation qk (x) = f (xk ) +
c Marina A. Epelman
IOE 519: NLP, Winter 2012
38
rf (xk )T (x xk ) + 21 (x xk )T H(xk )(x xk ) around the current iterate, and then minimize that
approximation. While locally the approximation works quite well, this may no longer be the case
when a large step is taken. The trust region methods hence find the next iterate by solving the
following constrained optimization problem:
min qk (x) s.t. kx
xk k 
k,
i.e., not allowing the next iterate to be outside the neighborhood of xk where the quadratic approximation is close to the original function f (x) (as it turns out, this problem is not much harder
to solve than the unconstrained minimization of qk (s)).
The value of k is set to represent the size of the region in which we can “trust” qk (x) to provide
a good approximation of f (x). Smaller values of k ensure that we are working with an accurate
representation of f (x), but result in conservative steps. Larger values of k allow for larger steps,
but may lead to inaccurate estimation of the objective function. To account for this, the value if k
is updated dynamically throughout the algorithm, namely, it is increased if it is observed that qk (x)
provided an exceptionally good approximation of f (x) at the previous iteration, and decreased is
the approximation was exceptionally bad.

Download Report

5 Overview of algorithms for unconstrained optimization

Paperzz.com

Your Paperzz