Bound-Constrained Optimization. Consider the problem of

Constrained Optimization
A Simple Case: Bound-Constrained Optimization. Consider the problem of minimizing a function subject to nonnegativity constraints on the variables.
min f (x) subject to x ≥ 0.
(2.10)
Here “x ≥ 0” indicates that all components of the vector x are required to be nonnegative. Following Chapter 2, we say that x∗ is a local solution of (2.10) if there is
a neighborhood N of x∗ such that
f (x) ≥ f (x∗ ) for all x ∈ N with x ≥ 0.
We can derive first-order necessary conditions for x∗ to be a local solution of
(2.10) by using a slight generalization of the proof technique of Theorem 2.2 in the
book. We claim that these conditions as as follows:
0 ≤ x∗i ⊥ [∇f (x∗ )]i ≥ 0,
i = 1, 2, . . . , n,
(2.11)
where the symbol ⊥ indicates that the product of these two quantities is zero. We
can write this condition even more succinctly as follows:
0 ≤ x∗ ⊥ ∇f (x∗ ) ≥ 0,
(2.12)
where u ⊥ v indicates that uT v = 0.
We have the following result.
Theorem 2.3. If x∗ is a local solution of (2.10) and f is continuously differentiable in a neighborhood of x∗ , then conditions (2.11) hold.
Proof. Suppose that conditions (2.11) are violated for a certain index i. If x∗i < 0
then x∗ is not even feasible so is clearly not a solution, so we need only consider x∗i = 0
and x∗i > 0.
In the case x∗i = 0, since condition (2.11) is violated, we must have [∇f (x∗ )]i < 0.
Consider the direction p inIRn with pj = 0 for j 6= i and pi = −[∇f (x∗ )]i > 0. It is
easy to see that x∗ + t̄p ≥ 0 for all t̄ sufficiently small and positive, and that
pT ∇f (x∗ ) = −[∇f (x∗ )]2i < 0,
and hence that
pT ∇f (x∗ + tp) < 0
for all t sufficiently small. It follows from Taylor’s theorem (see the argument in the
proof of Theorem 2.2) that
f (x∗ + t̄p) = f (x∗ ) + t̄pT ∇f (x∗ + tp) < f (x∗ ), for some t ∈ (0, t̄).
Since any neighborhood N of x∗ will contain points of the form x∗ + t̄p for small t̄, and
since these points are feasible, x∗ does not satisfy the definition of a local minimizer.
Consider now the case of x∗i > 0. Since (2.11) is violated, we must have [∇f (x∗ )]i 6=
0. Define p ∈ IRn as above: pj = 0 for j 6= i and pi = −[∇f (x∗ )]i . Using a similar
argument, we can verify that x∗ + t̄p ≥ 0 and that f (x∗ + t̄p) < f (x∗ ) for all t̄ sufficiently small and positive, again demonstrating that any neighborhood N of x∗ must
contain feasible points that have a lower function value that f (x∗ ).
Figure 2.1 illustrates three possible cases of optimality for n = 2.
20
f(x*)
∆
x*
f(x*)
f(x*)
∆
∆
x*
x*
Fig. 2.1. Three optimal points for nonnegative-constrained optimization
y
y
P(y)
P(y)
z
z
Fig. 2.2. Lemma 2.4 illustrated: part (i) (left) and part (ii) (right).
Optimization over a Closed Convex Set. We now discuss the more general problem
of optimization over a closed convex set Ω ⊂ IRn :
min f (x) subject to x ∈ Ω.
(2.13)
A point that satisfies the constraint x ∈ Ω is said to be feasible.
It is relatively easy to develop optimality conditions and the gradient projection
algorithm in these general terms, then specialize it to the bound constrained case in
which
Ω = {z ∈ IRn | z ≥ 0}.
(2.14)
Given a closed convex set Ω ⊂ IRn , the projection operator P : IRn → Ω is defined
as follows:
P (y) = arg min kz − yk2 .
z∈Ω
That is, P (y) is the point in Ω that is closest to y in the sense of the Euclidean
norm. This operator is useful both in defining optimality conditions and in defining
algorithms.
We start with a useful result about P . See Figure 2.2.
Lemma 2.4.
(i) (P (y) − z)T (y − z) ≥ 0 for all z ∈ Ω, with equality if and only if z = P (y).
(ii) (y − P (y))T (z − P (y)) ≤ 0 for all z ∈ Ω.
21
Proof. We prove (i) and leave (ii) as an exercise.
Let z be an arbitrary vector in Ω. We have
kP (y) − yk22 = kP (y) − z + z − yk22
= kP (y) − zk22 + 2(P (y) − z)T (z − y) + kz − yk22
which implies by rearrangement that
2(P (y) − z)T (y − z) = kP (y) − zk22 + kz − yk22 − kP (y) − yk22 .
(2.15)
The term in brackets is nonnegative, from the definition of P . The first term on the
right-hand side is trivially nonnegative, so the nonnegativity claim is proved.
If z = P (y), we obviously have (P (y) − z)T (y − z) = 0. If the latter condition
holds, then the first term on the right-hand side of (2.15) in particular is zero, so we
have z = P (y).
We now state the first-order necessary conditions for the problem (2.13).
Theorem 2.5. If f is continuously differentiable and x∗ is a local solution of
(2.13), then
∇f (x∗ )T (x − x∗ ) ≥ 0, for all x ∈ Ω.
(2.16)
Proof. (Sketch.) Suppose that there is some x such that ∇f (x∗ )T (x − x∗ ) < 0.
Consider a step to a point x∗ + (x − x∗ ). We have first by convexity that, since x ∈ Ω
and x∗ ∈ Ω, that x∗ + (x − x∗ ) ∈ Ω for all ∈ [0, 1]. In addition, we have by Taylor’s
theorem that
f (x∗ + (x − x∗ )) = f (x∗ ) + ∇f (x∗ )T (x − x∗ ) + o() < f (x∗ ),
for all positive sufficiently small, since ∇f (x∗ )T (x−x∗ ) so the final term is dominated
by the first order term for small .
We now introduce the concept of limiting feasibility directions to derive an alternative to the condition (2.16) that will be useful in subsequent analysis. (This concept
is also useful in the case of nonconvex feasible sets, which we do not consider here.)
Definition 2.6. We say that t ∈ IRn is a limiting feasible direction for the set
Ω at a point x ∈ Ω if there is a sequence of vectors ti → t and a seqeunce of positive
scalars αi → 0 such that x + αi ti ∈ Ω. Some limiting feasible directions for a set Ω
with a curved boundary are shown in Figure 2.4.
The following result establishes an equivalence between “entering directions” and
“limiting feasible directions” for the set Ω at x∗ .
Theorem 2.7. Given any vector y ∈ IRn , the closed convex set Ω, and a point
∗
x ∈ Ω we have that y T (x−x∗ ) ≥ 0 for all x ∈ Ω if and only if y T t ≥ 0 for all limiting
directions t to Ω at x∗ .
Proof. Suppose first that y T (x − x∗ ) ≥ 0 for all x ∈ Ω. Given a direction t, and
associated sequences ti and αi > 0, we have that
y T ((x∗ + αi ti ) − x∗ ) = αi y T ti ≥ 0,
so dividing by αi , we obtain y T ti ≥ 0 for all i. By taking limits as i → ∞, we obtain
y T t ≥ 0.
22
Ω
Fig. 2.3. Limiting feasible directions
Suppose now that y T t ≥ 0 for all limiting directions t, and let be an arbitrary
element of Ω. By definining ti ≡ (x − x∗ ) and αi = 1/i for all i ≥ 1, we have by
convexity that x∗ + αi ti = (1 − 1/i)x∗ + (1/i)x ∈ Ω, so these sequences define the
limiting direction t = (x − x∗ ). We therefore have y T (x − x∗ ) ≥ 0, completing the
proof.
Using this equivalence, we can restate Theorem 2.5 to make reference to limiting
feasible directions rather than to all elements of the set Ω.
Theorem 2.8. If f is continuously differentiable and x∗ is a local solution of
(2.13), then
∇f (x∗ )T t ≥ 0,
(2.17)
for all limiting directions t of Ω at x∗ .
We can combine Lemma 2.4 and Theorem 2.5 to obtain an equivalent form of the
first-order necessary conditions.
Theorem 2.9. If
P (x∗ − ᾱ∇f (x∗ )) = x∗ , for some ᾱ > 0,
then (2.16) holds. Conversely, if (2.16) holds, then
P (x∗ − α∇f (x∗ )) = x∗ , for all α > 0.
Proof. Suppose first that (2.18) holds. In Lemma 2.4(ii) we set
y = x∗ − ᾱ∇f (x∗ ), P (y) = x∗ ,
and let z be any element of Ω. We then have
0 ≥ (y − P (y))T (z − P (y)) = (−ᾱ∇f (x∗ ))T (z − x∗ ),
which implies that ∇f (x∗ )T (z − x∗ ) ≥ 0 for all z ∈ Ω, proving (2.16).
23
(2.18)
Now supposed that (2.16) holds, and denote
xα = P (x∗ − α∇f (x∗ )).
Setting y = x∗ − α∇f (x∗ ), P (y) = xα , z = x∗ in Lemma 2.4, we have
(x∗ − α∇f (x∗ ) − xα )T (x∗ − xα ) ≤ 0,
which implies that
kx∗ − xα k22 − α∇f (x∗ )T (x∗ − xα ) ≤ 0.
(2.19)
By (2.16), we have that
−α∇f (x∗ )T (x∗ − xα ) ≥ 0,
so both terms on the left-hand side of (2.19) are nonnegative. Hence they are both
zero and we have in particular that xα = x∗ as claimed.
An immediate consequence of this theorem is that P (x∗ − ᾱ∇f (x∗ )) = x∗ for
some ᾱ > 0, then P (x∗ − α∇f (x∗ )) = x∗ for all α > 0. (Why?)
Reduction to the Bound-Constrained Case. It is not difficult to show that the
general first-order conditions (2.16) reduce to (2.11) in the case of Ω defined by (2.14).
We prove this result informally. Suppose first that (2.11) hold. Then
X
X
∇f (x∗ )T (x − x∗ ) =
[∇f (x∗ )]i (xi − x∗i ) +
[∇f (x∗ )]i (xi − x∗i ).
i:[∇f (x∗ )]i >0
i:[∇f (x∗ )]i =0
The second sum is trivially zero. In the first sum, we have [∇f (x∗ )]i > 0 and thus
x∗i = 0 by (2.11). Thus xi −x∗i = xi ≥ 0, so that [∇f (x∗ )]i (xi −x∗i ) = [∇f (x∗ )]i xi ≥ 0.
Hence ∇f (x∗ )T (x − x∗ ) ≥ 0, as required. Now suppose that ∇f (x∗ )T (x − x∗ ) ≥ 0 for
all feasible x. Choosing i such that x∗i = 0, we can define x such that xj = x∗j for all
j 6= i, and xi = 1. Thus
0 ≤ ∇f (x∗ )T (x − x∗ ) = [∇f (x∗ )]i (xi − x∗i ) = [∇f (x∗ )]i ,
implying that [∇f (x∗ )]i ≥ 0. Hence (2.11) is satisfied for this i. For the other case
x∗i > 0, we again choose x with xj = x∗j for all j 6= i and obtain
0 ≤ ∇f (x∗ )T (x − x∗ ) = [∇f (x∗ )]i (xi − x∗i ).
By choosing xi = 0 and xi = 2x∗i , we obtain
0 ≤ [∇f (x∗ )]i (−x∗i ), 0 ≤ [∇f (x∗ )]i x∗i ,
from which is follows, using x∗i > 0, than [∇f (x∗ )]i = 0. Thus (2.11) is satisfied in
this case as well.
Optimization with Linear Constraints. We consider now the case in which Ω is
defined by a set of linear inequalities, that is,
Ω := {x | Ax ≥ b},
(2.20)
where A ∈ IRm×n and b ∈ IRm . By breaking A and b into their consituent rows aTi ,
i = 1, 2, . . . , m and bi , i = 1, 2, . . . , m, we can rewrite (2.20) as follows:
Ω := {x | aTi x ≥ bi , i = 1, 2, . . . , m}.
24
(2.21)
It is easy to characterize the set of limiting feasible directions for such sets Ω. Given a
point x∗ ∈ Ω, we define the active set A∗ ⊂ {1, 2, . . . , m} to be the linear constraints
that are satisfied at equality at the point x∗ , that is,
A∗ := {i = 1, 2, . . . , m | aTi x∗ = bi }.
(2.22)
Note that for the inactive constraints i ∈
/ A∗ , we have strict inequality: aTi x∗ > bi .
∗
Theorem 2.10. Given x ∈ Ω defined by (2.21), and A∗ defined as above, the
set of limiting feasible directions for Ω at x∗ is exactly the set
T (x∗ ) := {t | aTj t ≥ 0, for all j ∈ A∗ }.
Proof. Supposing first that t is a limiting feasible direction, and let αi and ti be
the sequences that define it. From x∗ + αi ti ∈ Ω and a∗j x∗ = bj for i ∈ A∗ , we have
aTj ti =
1 T ∗
a (x + αi ti − x∗ ) = aTj (x∗ + αi ti ) − bj ≥ 0,
αi j
for all j ∈ A∗ .
By taking limits as i → ∞, we have that aTj t ≥ 0 for all j ∈ A∗ , so that t ∈ T (x∗ ).
Suppose now that t ∈ T (x∗ ). Define to be positive but sufficiently small that
T
aj (x∗ + αt) ≥ bj for all j ∈
/ A∗ . (Since aTj x∗ > bj for j ∈
/ A∗ , we can certainly find
a positive value of satisfying this condition.) Now defining ti ≡ t and αi = /i for
i ≥ 1, it is easy to see that aTj (x∗ + αi t) ≥ bj for all j ∈
/ A∗ . For j ∈ A∗ , we have
from aTj t ≥ 0 and aTj x∗ = bj that aTj (x∗ + αi t) ≥ bj for all j ∈ A∗ . We conclude that
x∗ + αi ti ∈ Ω for all i, so t is a limiting feasible direction for Ω at x∗ .
By combining this result with Theorem 2.8, we have that when x∗ is optimal, we
have ∇f (x∗ )T t for all t ∈ T (x∗ ). This condition is still not “checkable” because in
general there is an infinite number of directions in T (x∗ ). However, a key result known
as Farkas’ Lemma allows us to derive a checkable optimiality condition for this case,
extending the checkable conditions (2.12) derived for the bound-constrained case.
Lemma 2.11 (Farkas). Given a set of vectors aj ∈ IRn , j ∈ A and a direction
y ∈ IRn , exactly one of the following two statements is true:
P
I. There are nonnegative coefficients λi ≥ 0 such that y = j∈A λi ai . (That
is, y can be expressed as a nonnegative linear combination of the vectors aj ,
j ∈ A.)
II. There is a vector z such that aTj z ≥ 0 for all j ∈ A and y T z < 0. (This
vector defines a hyperplane that separates y from aj , j ∈ A.)
The proof of this lemma is difficult and beyond the scope of these notes. It is
closely related to the strong duality theorem for linear programming, which is often
proved constructively by applying the simplex method and using the properties of
this method. An illustration of this result is shown in Figure 2.4.
We can combine Farkas’ Lemma with the preceding results to derive the KarushKuhn-Tucker conditions, usually known as the KKT conditions, a checkable first-order
necessary condition for constrained optimization. This theorem introduces quantities
λj that are known as Lagrange multipliers.
Theorem 2.12 (KKT conditions). Suppose that f is continuously differentiable
and x∗ is a local solution of (2.13), where Ω is defined by (2.21) and A∗ is defined by
(2.22). Then we have
X
∇f (x∗ ) =
λj aj ,
(2.23)
j∈A∗
25
a1
a1
y
a2
a2
a3
z
a3
y
Fig. 2.4. Farkas’ Lemma illustrated: Case I (left) and Case II (right). The separating hyperplane in case II is shown as a dotted line.
for some nonnegative coefficients λj , j ∈ A∗ .
Proof. We noted above that under the assumptions of the theorem, we must
have ∇f (x∗ )T t ≥ 0 for any t such that aTj t ≥ 0 for all j ∈ A∗ . Hence alternative
II in Farkas’ Lemma cannot hold (when we identify A and cA∗ and z with t). Thus
alternative I holds, giving the result.
Properly speaking, the KKT conditions are defined as the conditions defining
feasibility and the active set A∗ , along with the condition (2.23). A succinct way to
state them, similarly to (2.11) is as follows:
0 ≤ λ ⊥ Ax∗ − b ≥ 0,
m
X
∇f (x∗ ) =
λj aj .
(2.24a)
(2.24b)
j=1
Note that the first condition (2.24a) guarantees feasibility of x∗ . It also ensures that
the components of λ that correspond to inactive constraints are set to zero, because
aTj x∗ − bj > 0 ⇒ λj = 0 by the perpendicularity operator. Thus, the inactive terms
contribute nothing to the sum in (2.24b), so the right-hand sides of (2.24b) and (2.23)
are the same.
Example. We illustrate the KKT conditions with a simple example in two dimensions:
min (x1 + 1)2 + 2(x2 + 1)2 subject to x2 ≥ 0, x1 + x2 ≥ 0.
(2.25)
We can express the constraints in the form (2.21) by defining:
0
1
m = 2, a1 =
, a2 =
, b1 = 0, b2 = 0.
1
1
The problem is shown in Figure 2.5, and we can see that the solution is at x∗ = (0, 0)T .
Let us check the KKT conditions at this point. Both constraints are active, that is,
A∗ = {1, 2}. We have
2(x1 + 1)
2
∗
∇f (x) =
, and so ∇f (x ) = ∇f (0) =
.
4(x2 + 1)
4
26
a1
f(x*)
a2
∆
x*
Fig. 2.5. KKT conditions for the example, with quadratic objective centered at (−1, −1)T .
Both constraints are active at the solution.
We now seek λ1 ≥ 0 and λ2 ≥ 0 such that ∇f (x∗ ) = λ1 a1 + λ2 a2 , that is,
2
0
1
λ2
= λ1
+ λ2
=
.
4
1
1
λ1 + λ2
It is easy to see that these conditions are satisfied (uniquely) by λ1 = 2, λ2 = 2, to
the KKT conditions hold.
Consider now a variation on this example with the same constraints, but with
unconstrained minimizer of the objective function at the point (−1, β) instead of
(−1, −1). The problem is therefore
min (x1 + 1)2 + 2(x2 − β)2 subject to x2 ≥ 0, x1 + x2 ≥ 0.
(2.26)
We ask the questions: For what values of β is x∗ = (0, 0) still the solution of the
problem? How does the optimal active set change as we change β? We answer these
questions by considering different possible active sets A∗ for each value of β, and
checking to see if the KKT conditions are satifsied for this choice of A∗ .
We first investigate the range of values of β for which A∗ = {1, 2}, so that
∗
x = (0, 0)T continues to be the solution. Since for f defined in (2.26), we have
2(x1 + 1)
2
∇f (x) =
, that is ∇f (0) =
,
4(x2 − β)
−4β
the KKT conditions will be satisfied at the point (0, 0) if we can find λ1 ≥ 0 and
λ2 ≥ 0 such that
2
0
1
λ2
= λ1
+ λ2
=
.
−4β
1
1
λ1 + λ2
This linear system is solved uniquely by (λ1 , λ2 ) = (−4β − 2, 2), and we have λ ≥ 0 if
and only if −4β − 2 ≥ 0, that is, β ≤ −.5. We can conclude that the KKT conditions
are satisfied at x∗ = 0 when β ≤ −.5.
Figure 2.6 suggests that for values of β greater than −.5 (not not too large), the
solution x∗ may be at a point where only the second constraint x1 + x2 ≥ 0 is active.
Accordingly, we examine the situation where A∗ = {2}, to see if we can identify x∗
27
a2
f(x*)
∆
x*
Fig. 2.6. KKT conditions for the example, with quadratic objective centered at approximately
(−1, .3)T . Only the second constraint is active at the solution. Note that ∇f (x∗ ) and a2 point in
the same direction.
and a range of values of β for which this is the optimal active set. For this choice of
A∗ , a point x∗ will be optimal if the constraints in A∗ are active, it is feasible with
respect to the other constraints, and if the condition (2.23) holds for some nonnegative
λj , j ∈ A∗ . We this have the following conditions:
∇f (x∗ ) =
x∗1 + x∗2 = 0,
(2.27a)
x∗2
≥ 0,
(2.27b)
+ 1)
1
= λ2
,
− β)
1
(2.27c)
2(x∗1
4(x∗2
λ2 ≥ 0.
(2.27d)
For a given β, the conditions (2.27a) and (2.27c) form a system of linear equations —
three equations in the three unknowns x∗1 , x∗2 , and λ2 — which we can solve to find
the following values:
2
1
x∗1 = − β − ,
3
3
x∗2 =
2
1
β+ ,
3
3
4
4
λ2 = − β + .
3
3
By applying the inequalities (2.27b) and (2.27d), we find the values of β for which
this is indeed a solution:
2
1
1
β+ ≥0⇔β≥− ,
3
3
2
4
4
λ2 ≥ 0 ⇔ − β + ge0 ⇔ β ≤ 1.
3
3
x∗2 ≥ 0 ⇔
We conclude that for β ∈ [−1/2, 1], the point
−2/3
−1/3
∗
x =
β+
2/3
1/3
28
is a solution of (2.26).
So far we have accounted for the range β ∈ (−∞, −1/2] (within which the active
set A∗ = {1, 2} leads to the a solution x∗ = (0, 0) that satisfies the KKT conditions,
and the range β ∈ [−1/2, 1], for which A∗ = {2} leads to a solution satisfying KKT.
What happens when β ≥ 1? Clearly, in this case, the unconstrained minimizer of the
function f in (2.26) lies inside the feasible set, so we have A∗ = ∅ and x∗ = (−1, β)T .
KKT conditions for bound-constrained problems. Suppose the constraints are all
nonnegativity bounds xi ≥ 0, i = 1, 2, . . . , n. We can put these constraints into the
form (2.21) by defining ai = ei , i = 1, 2, . . . , n (where ei are the standard unit vectors),
and bi = 0, i = 1, 2, . . . , n. The active set A∗ is simply the indices for which x∗i = 0,
that is,
A∗ = {i | x∗i = 0}.
The conditions (2.23) this require existence of nonnegative λi such that
X
∇f (x∗ ) =
λi ei .
i∈A∗
Note that the right-hand side of this expression is a vector with zero entries in positions
i corresponding to x∗i > 0 (the inactive indices) and nonnegative values λi in positions
i corresponding to x∗i = 0. This corresponds exactly to the conditions (2.11), so in fact
(2.11) are the specialization of the KKT conditions to the case of bound constraints.
Consider now the specialization of the KKT conditions to the two-sided boundconstrained problem
min f (x) subject to l ≤ x ≤ u,
(2.28)
where l, u ∈ IRn are vectors of lower and upper bounds, respectively. We can express
this problem in the form (2.21) by setting m = 2n and defining
ai = ei ,
bi = li ,
i = 1, 2, . . . , n,
an+i = −ei , bn+i = −ui , i = 1, 2, . . . , n.
The active set A∗ at x∗ is then defined by
A∗ = {i | i ∈ {1, 2, . . . , n} x∗i = li } ∪ {n + i | i ∈ {1, 2, . . . , n}, x∗i = u∗i }.
The key condition (2.23) thus has the form
X
X
∇f (x∗ ) =
λi ei −
λn+i ei .
x∗
i =ui
x∗
i =li
The vector on the right-hand side thus consists of the following entries:
• 0, for those components i for which x∗i is strictly between li and ui , that is,
li < x∗i < ui ;
• Nonnegative numbers λi , for those i for which x∗i = li ;
• Nonpositive numbers −λi , for those i for which x∗i = ui .
We can state these conditions succinctly, in a fashion similar to (2.12):
0 ≤ x∗ − l ⊥ ∇f (x∗ ) ≥ 0,
0 ≤ u − x∗ ⊥ ∇f (x∗ ) ≤ 0.
29
(2.29)
Linear Equality Constraints. We have considered problems in which the constraint sets are defined in terms of linear inequality constraints (2.21). Our discussion
was based on constraints of the form aTj x ≥ bj , but of course any constraint expressed
in “less than” form aTj x ≤ bj can be reformulated as −aTj x ≥ −bj , and the theory
above applied.
Suppose now that in addition to inequality constraints, we have equality constraints, which we write as cTj x = dj , j = 1, 2, . . . , p. Extending (2.21), the constraint
set has the form
Ω := {x | aTj x ≥ bj , j = 1, 2, . . . , m; cTl x = dl , l = 1, 2, . . . , p}.
(2.30)
Can the KKT conditions (2.23) be extended for this case? the key is to redefine
Ω in terms of inequality constraints only, apply the theory above, then simplify the
formulae. We continue to use the definition (2.22) of the active set, to refer only to the
active inequality constraints. (The equality constraints are of course always “active.”)
We have the following extension of Theorem 2.12.
Theorem 2.13 (KKT conditions for inequality and equality constraints). Suppose that f is continuously differentiable and x∗ is a local solution of (2.13), where Ω
is defined by (2.30) and A∗ is defined by (2.22). Then we have
∇f (x∗ ) =
X
λj aj +
j∈A∗
p
X
µl cl ,
(2.31)
l=1
for some nonnegative coefficients λj , j ∈ A∗ and scalars µl , l = 1, 2, . . . , p.
Proof. The feasible set Ω can be defined using the following inequalities:
aTj x ≥ bj ,
cTl x
−cTl x
≥ dl ,
≥ −dl ,
j = 1, 2, . . . , m,
(2.32a)
l = 1, 2, . . . , p,
(2.32b)
l = 1, 2, . . . , p.
(2.32c)
Introducing Lagrange multipliers λj for (2.32a), µ̂l , l = 1, 2, . . . , p for (2.32b), and
µ̄l , l = 1, 2, . . . , p for (2.32c), we can apply Theorem 2.12 to obtain the following
condition:
∇f (x∗ ) =
X
λj aj +
j∈A∗
p
X
l=1
µ̂l cl −
p
X
µ̄l cl ,
l=1
where all coefficients λj , j ∈ A∗ , µ̂l , l = 1, 2, . . . , p, and µ̄l , l = 1, 2, . . . , p are nonnegative. We obtain the result by setting
µl := µ̂l − µ̄l ,
l = 1, 2, . . . , p.
(Note that since µl is the difference of two nonnegative variables, it is not signrestricted.)
Example. Consider the problem
min (x1 + 1)2 + 3(x2 − 2)2 s.t. x1 ≥ 0, x1 + x2 = 1.
x
Let us check that the KKT conditions hold at the point x∗ = (0, 1)T . We have
2(x1 + 1)
2
∗
∇f (x) =
, ∇f (x ) =
.
6(x2 − 2)
−6
30
Note that the inequality constraint is satisfied at x∗ , so that A∗ = {1}. For condition
(2.31), we need to find λ1 ≥ 0 and µ1 such that
2
1
1
∇f (x∗ ) =
=
λ1 +
µ .
−6
0
1 1
It is easy to see that the values µ1 = −6, λ1 = 8 satisfy these conditions, so we have
verified the KKT conditions at this point.
Gradient Projection. A basic algorithm for constrained optimization is gradient
projection. In a sense it is the natural extension of steepest descent to the problem
(2.13). The search path from a point x is the projection of the steepest descent path
onto the feasible set Ω, that is,
P (x − α∇f (x)), α > 0.
We seek an α for which the function decreases, that is,
f (x) > f (P (x − α∇f (x))).
For global convergence we need a stronger condition than this. Ideally, we would like
to find an α that minimizes the line search function φ(α) ≡ f (P (x − α∇f (x))). But
this is difficult, even for simple functions f . In general it is not a smooth function of α.
For example, when Ω is polyhedral and f is quadratic, φ is piecewise quadratic. When
Ω is polyhedral and f is smooth, φ is piecewise smooth. But because of the “kinks”
in the search path, we cannot apply line search procedures like the one developed in
Chapter 3, which assume smoothness. Backtracking may be more suitable.
To describe this strategy we use the following notation, for convenience:
x(α) := P (x − α∇f (x)).
We consider an Armijo backtracking strategy along the projection arc, in which
we choose an ᾱ > 0 and take the step αk to be the first element in the sequence
ᾱ, β ᾱ, β 2 ᾱ, . . . for which the following condition is satisfied:
f (xk (β m ᾱ)) ≤ f (xk ) + c1 ∇f (xk )T (xk (β m ᾱ) − xk ).
(2.33)
We can show that the resulting algorithm has stationary limit points. The analysis is
quite complicated (see pp. 236–240 of Bertsekas), but we state it in part because the
proof techniques are interesting and relevant to convergence results in other settings.
We need a preliminary “geometric” lemma whose proof is omitted—see Lemma
2.3.1 of Bertsekas.
Lemma 2.14. For all x ∈ Ω and z ∈ IRn , the function g : [0, ∞) → IR defined by
g(s) :=
kP (x + sz) − xk
s
is monotonically nonincreasing.
Our convergence result is then:
Theorem 2.15.
(a) For every x ∈ Ω there exists a scalar sx > 0 such that
f (x) − f (x(s)) ≥ c1 ∇f (x)T (x − x(s)), ∀s ∈ [0, sx ].
31
(2.34)
(b) Let xk be the sequence generated by the gradient projection method with Armijo
backtracking along the projection path, ᾱ = 1, and step acceptance criterion
(2.33). Then every limit point of {xk } is stationary
Proof. We follow the proof of Bertsekas Proposition 2.3.3. For Part (a), we have
by Lemma 2.4(ii) with y = x − s∇f (x) (thus P (y) = x(s)) that
(x − x(s))T (x − s∇f (x) − x(s)) ≤ 0, ∀x ∈ Ω, s > 0.
By rearrangement this becomes
∇f (x)T (x − x(s)) ≥
kx − x(s)k2
, ∀x ∈ Ω, s > 0.
s
(2.35)
If x is stationary (that is, satisfies conditions and (2.18)) we have x(s) = x for all
s > 0, so the conclusion (2.34) holds trivially. Otherwise, x is nonstationary, so
x 6= x(s) for all s > 0. By Taylor’s theorem we have
f (x) − f (x(s)) = ∇f (x)T (x − x(s)) + (∇f (ζs ) − ∇f (x))T (x − x(s)),
for some ζs on the line segment between x and x(s). Hence, (2.34) can be written as
(1 − c1 )∇f (x)T (x − x(s)) ≥ (∇f (x) − ∇f (ζs ))T (x − x(s)).
(2.36)
From (2.35) and Lemma 2.14, we have for all s ∈ (0, 1] that
∇f (x)T (x − x(s)) ≥
kx − x(s)k2
≥ kx − x(1)kkx − x(s)k.
s
Therefore (2.36) is satisified for all s ∈ (0, 1] such that
(1 − c1 )kx − x(1)k ≥ (∇f (x) − ∇f (ζs ))T
x − x(s)
.
kx − x(s)k
Obviously the left-hand side of this expression is strictly positive, while the right-hand
side goes to zero continuously, as s ↓ 0. Hence there is an sx with the desired property,
so the proof of part (a) is complete.
Note that part (a) ensures that we can find an αk of the form αk = β mk that
satisfies (2.33) from each point xk . Suppose there is a subsequence K such that
xk →k∈K x̄. Since the sequence of function values {f (xk )} is nonincreasing we have
f (xk ) → f (x̄).
We consider two cases. First, suppose that
lim inf αk ≥ α̂
k∈K
for some α̂ > 0. From (2.35) and Lemma 2.14, we have for all k ∈ K sufficiently large
that
f (xk ) − f (xk+1 ) ≥ c1 ∇f (xk )T (xk − xk+1 )
kxk − xk+1 k2
αk
c1 αk kxk − xk+1 k2
=
αk2
≥ c1
≥ c1 α̂kxk − xk (1)k2 ,
32
since 1 is the initial choice of steplength and αk ≤ 1. Taking the limit as k → ∞,
k ∈ K, we have x̄ − x̄(1) = 0, which implies that x̄ is stationary (see (2.18)).
In the second case, suppose that lim inf k∈K,k→∞ αk = 0. Then by taking a further
subsequence K̄ ⊂ K, we have limk∈K̄,k→∞ αk = 0. For all k ∈ K̄, the Arnijo test
(2.33) will fail at least once, so we have
f (xk ) − f (xk (β −1 αk )) < c1 ∇f (xk )T (xk − xk (β −1 αk )).
(2.37)
Further, no xk with k ∈ K̄ can be stationary, since for stationary points we have
αk = 1. Hence,
kxk − xk (β −1 αk )k > 0.
(2.38)
By Taylor’s theorem we have
f (xk ) − f (xk (β −1 αk )) = ∇f (xk )T (xk − xk (β −1 αk ))
+ (∇f (ζk ) − ∇f (xk ))T (xk − xk (β −1 αk )),
for some ζk on the line segment between xk and xk (β −1 αk )). By combining this
expression with (2.37), we have
(1 − c1 )∇f (xk )T (xk − xk (β −1 αk )) < (∇f (xk ) − ∇f (ζk ))T (xk − xk (β −1 αk )),
From (2.35) with Lemma 2.14, we have
kxk − xk (β −1 αk )k2
β −1 αk
≥ kxk − xk (1)kkxk − xk (β −1 αk )k.
∇f (xk )T (xk − xk (β −1 αk )) ≥
By combining the last two results, and using the Schwartz inequality, we have for
large k ∈ K̄ that
(1 − c1 )kxk − xk (1)kkxk − xk (β −1 αk )k < (∇f (xk ) − ∇f (ζk ))T (xk − xk (β −1 αk ))
≤ k∇f (xk ) − ∇f (ζk )kkxk − xk (β −1 αk )k.
Using this expression together with (2.38) we obtain
(1 − c1 )kxk − xk (1)k < k∇f (xk ) − ∇f (ζk )k.
Since αk → 0 and xk → x̄ as k → ∞, k ∈ K̄, it follows that ζk → x̄. Hence by
taking limits in the expression above we obtain that x̄ = x̄(1), which implies that x̄
is stationary, as claimed.
33