21-690

Methods of Optimization Fall 2009
Prof. S. Ta’Asan∗
Chris Almost†
Contents
Contents
1
1 Review and Definitions
1.1 Convex sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.2 Convex functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.3 Convex optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2
2
2
4
2 Unconstrained Optimization
2.1 Inequalities . . . . . . . . . . . . .
2.2 Descent methods . . . . . . . . .
2.3 Newton’s method . . . . . . . . .
2.4 Non-differentiable optimization
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
5
5
6
10
13
3 Constrained Optimization
3.1 Separating hyperplanes .
3.2 Conjugate functions . . .
3.3 Duality . . . . . . . . . . .
3.4 Complementary slackness
3.5 Alternatives . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
15
16
17
17
20
23
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
4 Algorithms
24
Index
31
∗
†
email?
[email protected]
1
2
1
1.1
Optimization
Review and Definitions
Convex sets
The line through x 1 , x 2 ∈ Rn is the set of points {(1 − θ )x 1 + θ x 2 | θ ∈ R}.
An affine set is a set that contains the entire line through any pair of points in
it, i.e. C is affine if (1 − θ )x 1 + θ x 2 ∈ C for all x 1 , x 2 ∈ C and θ ∈ R. Notice that
if C is an affine set and x 0 ∈ C is chosen, then C − x 0 = {x − x 0 | x ∈ C} is a
vector space. Indeed, (x 1 − x 0 ) + (x 2 − x 0 ) = 2( 21 x 1 + 12 x 2 ) − x 0 ∈ C − x 0 and
θ (x − x 0 ) = θ x + (1 − θ )x 0 − x 0 ∈ C − x 0 . If A is a matrix then the set of solutions
x to Ax = b is an affine set.
A convex set is a set that contains the line segment joining any pair of points in
it, i.e. C is convex if (1 − θ )x 1 + θ x 2 ∈ C for all x 1 , x 2 ∈ C and θ ∈ [0, 1]. A cone is
a set that contains the ray emanating from any point in it, i.e. K is a cone if θ x ∈ K
for all x ∈ K. A hyperplane is {x | a T x − b = 0} and a half space is {a T x − b ≤ 0},
where a and b are vectors. The norm cone is {(x, t) | kxk ≤ t} ⊆ Rn+1 , where k · k
is any norm on Rn .
The set of symmetric n × n matrices is denoted S n . Define
S+n = {A ∈ S n | x T Ax ≥ 0 for all x ∈ Rn }
and
n
S++
= {A ∈ S n | x T Ax > 0 for all x 6= 0}.
n
Note that S+n and S++
are convex cones in the convex set S n ⊆ Rn×n .
1.2
Convex functions
A convex function is a function whose graph lies below the line segment joining
any two points of it. More precisely, f is convex if its domain is convex and
f ((1−θ )x 1 +θ x 2 ) ≤ (1−θ ) f (x 1 )+θ f (x 2 ) for all x 1 , x 2 ∈ dom( f ) and θ ∈ [0, 1].
Any norm is a convex function with domain Rn .
Recall from multi-variable calculus that if f is a differentiable convex function
then f ( y) ≥ f (x) + ∇ f (x) T ( y − x) for all x, y ∈ dom( f ). In fact,
1.2.1 Proposition. If f is differentiable and dom( f ) is convex then f is convex if
and only if f ( y) ≥ f (x) + ∇ f (x) T ( y − x) for all x, y ∈ dom( f ).
PROOF: Let f be differentiable. On R, if f is convex then f (θ y + (1 − θ )x) ≤
θ f ( y) + (1 − θ ) f (x), so we get
f ( y) − f (x) ≥
1
θ
( f (x + θ ( y − x)) − f (x)) → f 0 (x)( y − x)
as θ → 0, which is the required inequality.
Conversely, still on R, let x, y ∈ dom( f ), θ ∈ [0, 1], and z = θ x + (1 − θ ) y.
We have f (x) ≥ f (z)+ f 0 (z)(x −z) and f ( y) ≥ f (z)+ f 0 (z)( y −z), so multiplying
Convex functions
3
and adding we get
θ f (x) + (1 − θ ) f ( y) ≥ f (z) + f 0 (z)(θ x + (1 − θ ) y − z) = f (z).
In Rn , let g(t) := f (t y + (1 − t)x). Then g is a convex function on its domain,
a subset of R. Note that g 0 (t) = ∇ f (t y +(1− t)x) T ( y − x), so the one dimensional
case implies the n dimensional case.
ƒ
1.2.2 Proposition. If f is twice differentiable and dom( f ) is convex then f is
convex if and only if ∇2 f (x) 0.
PROOF: Let f be twice differentiable. We have seen that f is convex if and only
if f ( y) ≥ f (x) + ∇ f (x) T ( y − x) for all x, y ∈ dom( f ). On R, if f is convex, for
x, y ∈ dom( f ) we have
f ( y) ≥ f (x) + f 0 (x)( y − x) and f (x) ≥ f ( y) + f 0 ( y)(x − y).
Adding and dividing by (x − y)2 ,
0≤
f 0 (x) − f 0 ( y)
x−y
→ f 00 (x)
as y → x.
Conversely, still on R, let x, y ∈ dom( f ) and note that the following is nonnegative.
y
Z
x
y Z
f 00 (z)(z − y)dz = f 0 (z)( y − z) +
x
y
f 0 (z)dz
x
= − f 0 (x)( y − x) + f ( y) − f (x)
In Rn , for x, y ∈ dom( f ) fixed, define g(t) = g x, y := f (x + t( y − x)). Then
g (t) = ∇ f (x + t( y − x)) T ( y − x), and
0
g 00 (t) = t 2 ( y − x) T ∇2 f (x + t( y − x))( y − x).
Now g = g x, y is convex for every x, y ∈ dom( f ) if and only if f is convex, and
g 00 (t) ≥ 0 for all t > 0 for all x, y ∈ dom( f ) if and only if
( y − x) T ∇2 f (x + t( y − x))( y − x) ≥ 0
for all t > 0 and all x, y, ∈ dom( f ). It follows that g is convex if and only if
∇2 f (x) 0 for all x ∈ dom( f ).
ƒ
Let f be convex in its first variable for every value of its second variable in
some set A (not necessarily convex). Let g(x) := sup y∈A f (x, y). Then g is
4
Optimization
convex. Indeed,
g(θ x 1 + (1 − θ )x 2 ) = sup f (θ x 1 + (1 − θ )x 2 , y)
y∈A
≤ sup θ f (x 1 , y) + (1 − θ ) f (x 2 , y)
y∈A
≤ sup θ f (x 1 , y) + sup (1 − θ ) f (x 2 , y)
y∈A
y∈A
= θ sup f (x 1 , y) + (1 − θ ) sup f (x 2 , y)
y∈A
y∈A
= θ g(x 1 ) + (1 − θ )g(x 2 )
n
Let C ⊆ R be a convex set and let f : Rm × C → R be convex. Then g(x) :=
inf y∈C f (x, y) is convex. Indeed, for x 1 , x 2 ∈ Rm and θ ∈ [0, 1],
g(θ x 1 + (1 − θ )x 2 ) = inf f (θ x 1 + (1 − θ )x 2 , y)
y∈C
= inf
y1 , y2 ∈C
f (θ x 1 + (1 − θ )x 2 , θ y1 + (1 − θ ) y2 )
≤ inf (θ f (x 1 , y1 ) + (1 − θ ) f (x 2 , y2 ))
y1 , y2 ∈C
≤ θ inf f (x 1 , y1 ) + (1 − θ ) inf f (x 2 , y2 ))
y1 ∈C
y2 ∈C
= θ g(x 1 ) + (1 − θ )g(x 2 )
since the sets C and θ C + (1 − θ )C are equal.
1.3
Convex optimization
The constrained optimization problem we consider is that of minimizing a function
f0 : Rn → R subject to inequality constraints f i (x) ≤ 0 for i = 1, . . . , m, and equality
constraints h j (x) = 0 for j = 1, . . . , p, where the f i are convex and the h j are affine.
Call the set of points satisfying the constraints X .
If f0 is convex and differentiable then f0 ( y) ≥ f0 (x) + ∇ f0 (x) T ( y − x) for all
x, y ∈ X . If x ∗ ∈ X is a minimizer then ∇ f0 (x ∗ ) T ( y − x ∗ ) ≥ 0 for all y ∈ X . These
are the first order conditions. Note that for the unconstrained problem X = Rn , the
first order conditions are ∇ f0 (x ∗ ) = 0, and they are necessary and sufficient.
Here is a useful trick. If you can write f (x + t v) = f (x) + t w T v + O(t 2 ) then
w = ∇ f (x). This follows from Taylor’s theorem. For example, if P ∈ S+n and
f (x) := 12 x T P x + q T x + r then
f (x + t v) = 12 (x + t v) T P(x + t v) + q T (x + t v) + r
= f (x) + t(q + P x) T v + O(t 2 )
so ∇ f (x) = P x + q
(Q)
1.3.1 Example. Solve the unconstrained problem min x kAx − bk22 . This is a general form of the problem of finding a least-squares approximation. Note that
kAx − bk22 = x T (AT A)x − 2(AT b) T x + b T b,
Unconstrained Optimization
5
so the gradient is 2(AT Ax − AT b) by (Q). Setting the gradient equal to zero gives
the normal equations, AT Ax = AT b.
2
2.1
Unconstrained Optimization
Inequalities
Consider the unconstrained problem min x f (x), where f is a convex function. Let
x ∗ = argmin x f (x). We assume that x ∗ is unknown, p∗ = f (x ∗ ) is unknown, but
∇ f (x) exists and is known. We would like to find x (our guess at the minimum)
such that kp∗ − f (x)k is small or kx − x ∗ k is small.
Assume that f is twice differentiable. By Taylor’s theorem,
f ( y) = f (x) + ∇ f (x) T ( y − x) + 21 ( y − x) T ∇2 f (z)( y − x)
for some z on the line segment between x and y. Assume that, for some m > 0,
∇2 f (x) mI for all x ∈ Rn . Then
f ( y) ≥ f (x) + ∇ f (x) T ( y − x) +
m
ky
2
− xk2 .
This expression is a quadratic function of y, so it has a minimum. Differentiating
with respect to y, for the minimizer y ∗ we get ∇ f (x) + m y ∗ − mx = 0, i.e.
y ∗ − x = − m1 ∇ f (x). Then
f ( y) ≥ f (x) −
1
∇f
m
(x) T ∇ f (x) +
Whence p∗ = min y f ( y) ≥ f (x) −
m 1
k ∇f
2 m
1
k∇ f
2m
(x)k2 = f (x) −
1
k∇ f
2m
(x)k2 .
(x)k2 , so
0 ≤ f (x) − p∗ ≤
1
k∇ f
2m
(x)k2 .
(1)
Under the same assumptions, for any x, y ∈ Rn , as before,
f ( y) = f (x) + ∇ f (x) T ( y − x) + 12 ( y − x) T ∇2 f (z)( y − x)
m
k y − xk2
2
(x) + ∇ f (x) T (x ∗ − x) + m2 kx ∗ − xk2
(x) − k∇ f (x)kkx ∗ − xk + m2 kx ∗ − xk2
≥ f (x) + ∇ f (x) T ( y − x) +
so p∗ = f (x ∗ ) ≥ f
≥f
Therefore 0 ≥ p∗ − f (x) ≥ −k∇ f (x)kkx ∗ − xk +
kx ∗ − xk ≤
2
k∇ f
m
m
kx ∗
2
− xk2 , so
(x)k
(2)
Now assume that, for some M > 0, M I ∇2 f (x) for all x ∈ Rn . Then
f ( y) = f (x) + ∇ f (x) T ( y − x) + 21 ( y − x) T ∇2 f (z)( y − x)
≤ f (x) + ∇ f (x) T ( y − x) +
M
ky
2
so p∗ = min f ( y) ≤ min{ f (x) + ∇ f (x) T ( y − x) +
y
y
− xk2
M
ky
2
− xk2 }
6
Optimization
We can solve the minimization problem on the right hand side, as above, by setting
the derivative with respect to y to zero. ∇ f (x) + M y ∗ − M x = 0, so y ∗ − x =
− M1 ∇ f (x). Plugging this into the equations,
p∗ − f (x) ≤ − M1 ∇ f (x) T ∇ f (x) +
1
k∇ f
2M
1
(x)k2 = − 2M
k∇ f (x)k2 ,
so
1
k∇ f
2M
2.2
(x)k2 ≤ f (x) − p∗
(3)
Descent methods
(n+1)
Let {x (n) }∞
= ρ x (n) for some 0 < ρ < 1 then
n=1 be a sequence of vectors. If x
(n)
(0)
x → 0, for any starting value x , and we say that x (n) has linear convergence
with constant ρ.
The general descent method is as follows.
def descent(f, dx, x0):
x = x0
while abs(df(x)) > delta:
dx = compute_dir(f, df, x)
t = line_search(f, df, x, dx)
x = x + t * dx
return x
Gradient descent
The gradient descent method chooses ∆x = −∇ f (x). The line search for this
method is either the exact line search, (i.e. solve min t f (x + t∆x) exactly for t)
or the backtracking line search, as follows. The function g(t) := f (x + t∆x) is
convex, and the line
{(t, f (x) + t∇ f (x) T ∆x) | t ∈ R}
supports the graph of g at the point t = 0. Choose an α < 21 and form the reference
line
{(t, f (x) + αt∇ f (x) T ∆x) | t ∈ R},
which intersects the graph of g at t = 0 and t = t 0 > 0. See Figure 1. Choose a
reduction factor β < 1 and run the following.
t = 1
while f(x + t * dx) > f(x) + alpha * t * df * dx
t = beta * t
return t
Observe that this algorithm terminates with t ∈ [β t 0 , t 0 ∧ 1], where t 0 is the other
point at which the reference line intersects the graph of f .
Descent methods
7
Figure 1: Backtracking line search
2.2.1 Theorem. Let f be twice continuously differentiable and such that mI ∇2 f (x) M I for all x ∈ Rn . Assume that m > 0. Then the gradient descent
m
method applied to f converges linearly with constant 1− M
using exact line search
m
and constant 1 − 2αβ M using backtracking line search.
PROOF: Let {x (n) }∞
n=1 be the sequence of points encountered by the gradient descent method.
For exact line search, let t exact := argmin t f (x − tk∇ f (x)k2 ).
f (x − t∇ f ) ≤ f (x) − tk∇ f (x)k2 + t 2 M2 k∇ f (x)k2
1
k∇ f (x)k2
2M
1
f (x (n) ) − p∗ − 2M
k∇ f
m
(n)
(1 − M )( f (x ) − p∗ )
f (x − t exact ∇ f (x)) ≤ f (x) −
so f (x (n+1) )) − p∗ ≤
≤
(x (n) )k2
The third line follows from minimizing the right hand side of the second line
with respect to t, and the last line follows from the inequality (1), written as
−k∇ f (x)k2 ≤ −2m( f (x) − p∗ ). This proves convergence since m < M .
For the backtracking method, as above,
f (x − t∇ f ) ≤ f (x) − tk∇ f (x)k2 + t 2 M2 k∇ f (x)k2
= f (x) + (−t +
M 2
t )k∇ f
2
(x)k2
8
Optimization
≤ f (x) − 12 tk∇ f (x)k2
≤ f (x) − αtk∇ f (x)k2
where the third line holds for all 0 ≤ t ≤
1
.
M
It follows from this that the back-
tracking step always terminates with a non-zero t, and further,
either t = 1 and f (x − ∇ f (x)) ≤ f (x) − αk∇ f (x)k2 , or
f (x − t∇ f (x)) ≤ f (x) − αtk∇ f (x)k2 ≤ f (x) −
β
M
αβ
k∇ f
M
≤ t ≤ 1. Now,
(x)k2
Therefore, for the iteration,
f (x (n+1) ) − p∗ ≤ f (x (n) ) − p∗ − min{1,
αβ
}k∇ f
M
(x)k2
m
≤ ( f (x (n) ) − p∗ )(1 − min{2m, 2αβ M
})
m
)
= ( f (x (n) ) − p∗ )(1 − 2αβ M
ƒ
Data processing
Here is our first application. Given data {u∗j }Nj=1 , interpreted as coming from some
“rough” or “noisy” source, find {u j }Nj=1 such that u is “nice” and u is “close to” u∗ .
If by “close to” we mean that ku − u∗ k22 is small, and if by “nice” we mean that
u has a reasonably sized quadratic variation, then the problem becomes
X
X
min
(u j − u∗j )2 + β
(u j+1 − u j )2 .
u
j
j
The term on the right is the roughness penalty and it involves an arbitrary parameter controlling the relative importance of the two goals. Call this objective
function f (u). It is convex (as a sum of convex functions) and twice continuously
differentiable.
∂f
= 2(uk − u∗k ) + 2β(2uk − uk−1 − uk+1 )
∂ uk
for 1 ≤ k ≤ N (where uk := 0 if k = 0 or N + 1). Since ∇ f (u) = 0, we need to
solve the following linear system of equations.


2 + 4β
−2β
0
...
0

.. 


2 + 4β
−2β
. 
 −2β


∗
..


.
−2β
2 + 4β
0  u = 2u .
 0
 .

..
 .

.
2 + 4β
−2β 
 .
0
...
0
−2β
2 + 4β
We can of course do the same thing when u∗ is two-dimensional (e.g. an image).
Then the problem becomes
X
X
X
min
(u∗i j − ui j )2 + β
(ui+1, j − ui, j )2 +
(ui, j+1 − ui, j )2 .
u
ij
ij
ij
Descent methods
9
Steepest Descent
Given z ∈ Rn , z ∗ : Rn → R : x 7→ z T x defines a linear functional. We write
kzk∗ := maxkxk=1 |z T x|, and call k · k∗ the dual norm of k · k. The dual norm of
the Euclidean norm is itself. Recall that the operator norm of a matrix A acting on
(Rn , k · k1 ) is the largest of the L 1 -norms of the columns of A. Also recall that the
operator norm of a matrix A acting on (Rn , k · k∞ ) is the largest of the L 1 -norms of
the rows of A. It follows that kzk1,∗ = kzk∞ and kzk∞,∗ = kzk1 .
The steepest descent method chooses ∆x so that f (x + t∆x) decreases in t
most rapidly, for some definition of “rapidly”. Recall from Taylor’s theorem that
f (x + t∆x) ≈ f (x) + t∇ f (x) T ∆x. Given a norm k · k on Rn , we would like to
find a direction vector ∆x nsd (i.e. k∆x nsd k = 1) that minimizes ∇ f (x) T ∆x (this
minimum will be negative). That is to say, for steepest descent,
∆x nsd := argmin{∇ f (x) T ∆x : k∆xk = 1}.
n
Let P ∈ S++
and consider that
1
1
1
1
1
kxk2P = x T P x = x T P 2 P 2 x = (P 2 x) T (P 2 x) = kP 2 xk22 .
Therefore we need to solve
1
min{∇ f (x) T ∆x : kP 2 ∆xk2 = 1}
1
1
1
= min{∇ f (x) T P − 2 (P 2 ∆x) : kP 2 ∆xk2 = 1}
1
1
1
= min{(P − 2 ∇ f (x)) T (P 2 ∆x) : kP 2 ∆xk2 = 1}
1
1
The solution to the final problem is known to be P 2 ∆x = −P 2 ∇ f (x) T , so it
follows that the steepest descent direction for k · k P is ∆x = −P −1 ∇ f (x) T .
2.2.2 Example. Let” f (x,— y) = " x 2 + y 2 . Then
is twice continuously
2"f0 2"differen
x
2
0
then
tiable, ∇ f (x, y) = 2"
2 y , and ∇ f (x, y) =
0 2 . If we take P =
0 2
the steepest descent direction is
2"
∆x = −
0
0
2
−1 2" x
2y
x
=−
.
y
2.2.3 Example. For the norm k · k1 , the steepest descent direction is
∆x = −sign(∇ f (x) T )ei ,
where i = argmaxi |∇ f (x)iT |.
2.2.4 Theorem. Let f be twice continuously differentiable and such that mI ∇2 f (x) M I for all x ∈ Rn . Assume that m > 0. Fix a norm k · k on Rn and
choose γ ∈ (0, 1) such that kxk ≥ γkxk2 for all x ∈ Rn .
The steepest descent method with backtracking line search applied to f converges linearly with constant 1 − 2mαγ2 min{1, βγ2 /M }.
10
Optimization
PROOF: Define
∆x nsd := min{∇ f (x) T ∆x : k∆xk = 1}
and
∆x sd := k∇ f (x)k∗ ∆x nsd .
Then
f (x + t∆x sd ) ≤ f (x) + t∇ f (x) T ∆x sd + t 2 M2 k∆x sd k22
≤ f (x) +
≤ f (x) −
M
2
t∇ f (x) ∆x sd + t 2 2γ
2 k∆x sd k∗
M
2
tk∇ f (x)k2∗ + t 2 2γ
2 k∇ f (x)k∗
T
Taylor’s theorem
by choice of γ
definition of ∆x sd
M
2
= f (x) − t(1 − t 2γ
2 )k∇ f (x)k∗
≤ f (x) − 12 tk∇ f (x)k2∗
see below
≤ f (x) − αtk∇ f (x)k2∗
α<
1
2
γ2
M
The second last line follows because the minimum value of −t + t 2 2γ
2 is − 2M and
is attained at t =
γ2
.
M
The line joining the vertex of this parabola and the origin
γ2
− 12
and lies above the parabola for t ∈ (0, M ].
has slope
It follows that the result of the backtracking line search satisfies the inequality
βγ2
t ≥ min{1, M }, so by (1),
βγ2
}k∇ f (x)k2∗
M
βγ2
p∗ ) − αγ2 min{1, M }k∇ f (x)k22
f (x (n+1) ) − p∗ ≤ ( f (x (n) ) − p∗ ) − α min{1,
≤ ( f (x (n) ) −
βγ2
≤ ( f (x (n) ) − p∗ ) 1 − 2mαγ2 min{1, M } .
2.3
ƒ
Newton’s method
2
Let {x n }∞
n=1 be a sequence of real numbers. If x n+1 = K x n then, if x n → 0, we say
that x n has quadratic convergence. Unfortunately, the conditions under which such
a sequence converges depend on both K and x 0 .
Newton’s method is steepest descent where, at each step, we move in the steepest descent direction for the norm k · k∇2 f (x (n) ) . We will see that Newton’s method
has quadratic convergence.
Another approach to Newton’s method is as follows. Recall
f (x + t v) ≈ f (x) + t∇ f (x) T v + 21 t 2 v T ∇2 f (x)v.
The minimum value of min v ∇ f T v + 12 v T ∇2 f v occurs when ∇2 f (x)v = −∇ f (x),
by (Q). Newton’s method takes ∆x = −∇2 f (x)−1 ∇ f (x).
A third approach is to note f (x ∗ ) = min x f (x) if and only if ∇ f (x ∗ ) = 0.
We can approximate the latter by solving ∇ f (x + v) = 0, which is approximately
solved by v satisfying ∇ f (x) + ∇2 f (x)v = 0, by Taylor’s theorem.
Newton’s method
11
2.3.1 Theorem. Let f be twice continuously differentiable and such that mI ∇2 f (x) M I for all x ∈ Rn . Assume that m > 0 and there is L (the Lipschitz
constant) such that
k∇2 f (x) − ∇2 f ( y)k2 ≤ Lkx − yk2
for all x, y ∈ Rn . Newton’s method with backtracking line search applied to f
satisfies the following. There is 0 < η ≤ m2 /L and γ = γ(η) > 0 such that
(i) if k∇ f (x)k ≥ η then f (x (n+1) ) − f (x (n) ) ≤ −γ;
(ii) if k∇ f (x)k ≤ η then k 2mL 2 ∇ f (x (n+1) )k2 ≤ k 2mL 2 ∇ f (x (n) )k22 and the line
search terminates with t = 1.
In fact, η = max{3(1 − 2α), 1}m2 /L.
Remark. It follows from (i) that the number of iterations needed, from any starting point x (0) , to get to case (ii) is about ( f (x (0) )−p∗ )/γ. If we are in case (ii) then
k∇ f (x (n) )k ≤ η implies k∇ f (x (n+1) )k ≤ η, so we remain in case (ii). Further, for
any ` > k,
2`−k 2`−k
L
L
1
(`) (k) ≤
2m2 ∇ f (x ) ≤ 2m2 ∇ f (x )
2
By (1),
f (x
(n)
∗
)−p ≤
1
k∇ f
2m
(x
(n)
)k22
≤
1
2m
‚
2m2
Œ2 2n+1
1
L
2
=
2m3
L2
2 n
1
2
Note that in this case the number of steps required so that f (x (n) ) − p∗ < " is on
the order of log2 log2 1" . For all practical purposes, at most 6 steps will be required!
PROOF: Define ∆x nt := −(∇2 f (x))−1 ∇ f (x). Let
λ2 (x) := ∇ f (x) T (∇2 f (x))−1 ∇ f (x) > 0.
Note that λ2 (x) = −∇ f (x) T ∆x nt . From the linear approximation, this is the
amount we decrease our approximation with a step of unit size. Call it Newton’s
decrement. Further,
k∆x nt k22 = ∇ f (x) T (∇2 f (x))−1 (∇2 f (x))−1 ∇ f (x) ≤
1 2
λ (x).
m
As usual,
f (x + t∆x nt ) ≤ f (x) + t∇ f (x) T ∆x nt + t 2 M2 k∆x nt k22
M 2
λ (x)
≤ f (x) − tλ2 (x) + t 2 2m
M
≤ f (x) + (−t + t 2 2m
)λ2 (x)
12
Optimization
m
The parabola is minimized at t̂ = M
and the slope of the line joining (0, 0) with
m
1
m
m
,
the vertex ( M , − 2M ) is − 2 . So when 0 < t ≤ M
f (x + t∆x nt ) ≤ f (x) − 21 tλ2 (x)
≤ f (x) − αtλ2 (x)
1
2
α<
= f (x) + αt∇ f (x) T ∆x nt
m
] satisfies the backtracking condition. It follows that the t
Therefore any t ∈ (0, M
m
returned by the backtracking algorithm is at least t̂ = β M
. What is the reduction
in f ?
m 2
f (x + t̂∆x nt ) ≤ f (x) − αβ M
λ (x)
m 2
so f (x (n+1) ) − f (x (n) ) ≤ −αβ M
λ (x)
But λ2 (x) = ∇ f (x) T (∇2 f (x))−1 ∇ f (x), so
1
k∇ f
M
(x)k22 ≤ λ2 (x) ≤
1
k∇ f
m
(x)k22 .
Whence
f (x (n+1) ) − f (x (n) ) ≤ −αβ Mm2 k∇ f (x)k22 ≤ −αβ Mm2 η2 =: −γ.
Now for the second, more difficult, part. Let f˜(t) = f (x + t∆x nt ). The idea is
to derive an estimate for f˜00 and then f˜0 , and then f˜.
f˜0 (t) = ∇ f (x + t∆x nt ) T ∆x nt
f˜00 (t) = ∆x T ∇2 f (x + t∆x nt )∆x nt
nt
f˜0 (0) = ∇ f (x) T ∆x nt = −λ2 (x)
f˜00 (0) = ∆x T ∇2 f (x)∆x nt = λ2 (x)
nt
It follows from the Lipschitz condition that
T
| f˜00 (t) − f˜00 (0)| = ∆x nt
(∇2 f (x + t∆x nt ) − ∇2 f (x))∆x nt ≤ t Lk∆x nt k32 .
Whence,
f˜00 (t) ≤ f˜00 (0) + t Lk∆x nt k32 ≤ λ2 (x) + t Lk∆x nt k32 ≤ λ2 (x) + t
L
m3/2
λ3 (x),
so, integrating,
f˜0 (x) ≤ f˜0 (0) + λ2 (x)t + t 2
L
2m3/2
λ3 (x) = −λ2 (x) + λ2 (x)t + t 2
L
2m3/2
λ3 (x).
Integrating again,
f˜(t) ≤ f˜(0) − tλ2 (x) +
f (x + t∆x nt ) − f (x) ≤ −tλ2 (x) +
1
2
1
2
t 2 λ2 (x) + t 3
t 2 λ2 (x) + t 3
L
6m3/2
L
6m3/2
λ3 (x)
λ3 (x).
Non-differentiable optimization
13
We need to check the exit condition for the backtracking line search.
f (x + t∆x nt ) ≤ f (x) + αt∇ f (x) T ∆x nt
⇐⇒ f˜(t) ≤ f˜(0) − αtλ2 (x)
We want to exit the line search algorithm on the first step, with t = 1. The
following steps are reversible.
L
λ3 (x) ≤ αλ2 (x)
6m3/2
1
L
− +
λ(x) ≤ −α
2 6m3/2
Lλ(x)
2α − 1 ≤
3m3/2
3(1 − 2α)m3/2
λ(x) ≤
L
−λ2 (x) + 21 λ2 (x) +
p
3(1−2α)m2
We have the inequality mλ(x) ≤ k∇ f (x)k2 , so if we choose η =
then,
L
since k∇ f (x)k ≤ η in case (ii), the inequality is satisfied and line search terminates with t = 1. It remains to show that
2
L
L
(n+1) (n) ∇ f (x )
) ≤ 2m2 ∇ f (x
2
2m
2
2
Consider the following.
k∇ f (x n+1 )k2 = k∇ f (x (n) + ∆x nt ) − ∇ f (x (n) ) − ∇2 f (x (n) )∆x nt )k2
Z 1
2
(n)
(n)
=
∇ f (x + t∆x nt ) − ∇ f (x ) ∆x nt d t 2
0
Z1
k∇2 f (x (n) + t∆x nt ) − ∇ f (x (n) )kk∆x nt k2 d t
≤
0
Z
1
≤L
0
tk∆x nt k22 d t =
L
2
k∆x nt k22 ≤
L
2m2
k∇ f (x)k22
This proves the inequality because norms are homogeneous. The last inequality
follows from
k∆x nt k2 = k(∇2 f )−1 ∇ f (x)k2
≤ k(∇2 f )−1 k2 k∇ f (x)k2 ≤
2.4
1
k∇ f
m
(x)k2 .
ƒ
Non-differentiable optimization
In applications the objective functions, while convex, may not always be differentiable. Whether or not f is not differentiable at a point x, define the sub-gradient
14
Optimization
of f at x to be
∂ f (x) = {g ∈ Rn | f ( y) ≥ f (x) + g T ( y − x) for all y ∈ Rn }.
For example, ∂ |0| = [−1, 1], and

(sign(x), sign( y)) (x, y) 6= (0, 0)



x = 0, y 6= 0
∂ (|x| + | y|) =

y = 0, x 6= 0


(x, y) = (0, 0)
Note that the sub-gradient is always convex.
2.4.1 Lemma. x ∗ = argmin x f (x) if and only if 0 ∈ ∂ f (x ∗ ).
PROOF: If 0 ∈ ∂ f (x) then by definition, f ( y) ≥ f (x) + 0 T ( y − x) = f (x) for all
y ∈ Rn , so x is a minimizer. Conversely, if 0 ∈
/ ∂ f (x) then there is y ∈ Rn such
T
that f ( y) < f (x) + 0 ( y − x) = f (x), so x is not the minimizer.
ƒ
The sub-gradient method is described below. Choose a step size αk and a
direction g (k) ∈ ∂ f (x (k) ) and let x (k+1) = x (k) − αk g (k) .
x = xbest = x0
while :
choose g in sub-gradient of f(x)
calculate alpha
x = x - alpha g
if f(x) < f(xbest): xbest = x
Since g (k) ∈ ∂ f (x (k) ), it follows that f (x ∗ ) ≥ f (x (k) ) + g (k)T (x ∗ − x (k) ), so
kx (k+1) − x ∗ k22 = kx (k) − αk g (k) − x ∗ k22
= kx (k) − x ∗ k22 − 2αk g (k)T (x (k) − x ∗ ) + α2k kg (k) k22
≤ kx (k) − x ∗ k22 − 2αk ( f (x (k) ) − f (x ∗ )) + α2k kg (k) k22
≤ kx (1) − x ∗ k22 − 2
k
X
α j ( f (x ( j) ) − f (x ∗ )) +
j=1
k
X
α2j kg ( j) k22
j=1
In particular we conclude the following.
2
k
X
α j ( f (x ( j) ) − f (x ∗ )) ≤ kx (1) − x ∗ k22 +
j=1
k
X
α2j kg ( j) k22
j=1
∗
2( f (x best ) − f (x ))
k
X
j=1
α j ≤ kx
(1)
−
x ∗ k22
+
k
X
α2j kg ( j) k22
j=1
Pk
kx (1) − x ∗ k22 + j=1 α2j kg ( j) k22
f (x best ) − f (x ∗ ) ≤
Pk
2 j=1 α j
Constrained Optimization
15
There are several methods for choosing the sequence {αk }∞
k=1 .
(i) Constant step size: αk = h for all k ≥ 1.
(k)
(ii) Variable step size,
by gP
: αk = h/kg (k) k2 for all k ≥ 1.
P scaled
2
2
αk = ∞.
(iii) ` sequence: k αk < ∞ but kP
(iv) c0 sequence: limk→∞ αk = 0 but k αk = ∞.
We assume that ∂ f (x) is bounded for all x ∈ Rn by the same radius G. We analyze
each case below.
(i)
f (x best ) − f (x ∗ ) ≤
kx (1) − x ∗ k22 + 2kh2 G 2
2kh
→
hG 2
2
as k → ∞
(ii)
f (x best ) − f (x ∗ ) ≤
kx (1) − x ∗ k22 G + kh2 G
kx (1) − x ∗ k22 + kh2
hG
≤
→
Pk
2hk
2
2h j=1 kg1( j) k
(iii)
Pk
kx (1) − x ∗ k22 + j=1 α2j kg ( j) k22
f (x best ) − f (x ) ≤
→0
Pk
2 j=1 α j
∗
(iv) Exercise.
2.4.2 Example. In a data fitting problem we can choose our error function. Let’s
compare a sum-of-squares smoothness error
X
X
(u j+1 − u j )2
(u j − u∗j )2 + β
j
j
with the non-differentiable
X
X
(u j − u∗j )2 + β
|u j+1 − u j |.
j
j
Let h = 1/N and consider the two
j = jh} and {v j = δ j,N
P possible fits {uP
P} on the
grid { jh | j = 0, . . . , N }. Then j |u j+1 − u j | = j h = N h = 1 and j |v j+1 −
P
P
P
v j | = 1. But j (u j+1 − u j )2 = j h2 = N h2 = 1/N while j (v j+1 − v j )2 is still
1. If the “true” curve is expected to have jumps then we would prefer the nondifferentiable error function, because the sum-of-squares error function gives too
much preference to smoother fits, i.e. we lose “sharpness.”
3
Constrained Optimization
Let f : Rn → R be a differentiable convex function. Recall that min x f (x) has a
solution at x ∗ if and only if ∇ f (x ∗ ) = 0. This is the unconstrained case. Consider
16
Optimization
the problem min x f (x) subject to constraints such as Ax = b or x ≥ 0. The
minimizer may occur at some x ∗ such that ∇ f (x ∗ ) 6= 0.
Recall that f ( y) ≥ f (x) + ∇ f (x) T ( y − x) for all x, y ∈ Rn and that x ∗ is
a minimizer if and only if ∇ f (x ∗ ) T ( y − x ∗ ) for all y satisfying the constraints.
For an equality constraint Ax = b we require that Ay = b. The minimizer must
of course satisfy the constraints, so we obtain A( y − x ∗ ) = 0. Furthermore, every
element of ker(A) is obtained in this way for some y. Therefore, for any v ∈ ker(A),
∇ f (x ∗ ) T ≥ 0. Again, this is a linear functional bounded below on a linear space,
so it must be the case that ∇ f (x ∗ ) T ≡ 0 on ker(A). It is a fact that Rn = ker(A) ⊕
range(AT ), so ∇ f (x ∗ ) T ∈ range(AT ). Write ∇ f (x ∗ ) T = −AT v. It follows that x ∗ is
a minimizer if and only if Ax ∗ = b and ∇ f (x ∗ ) T + AT v = 0 for some v ∈ Rn .
3.0.3 Example. We solve min x 12 kx − x 0 k22 subject to Ax = b. Then ∇ f (x) =
x − x 0 so the optimality condition becomes Ax ∗ = b and AT v + x ∗ − x 0 = 0. We
can write this as the system
A 0
I AT
b
x
=
.
v
x0
For the non-negativity constraint, recall that ∇ f (x ∗ ) T · ( y − x ∗ ) ≥ 0 for all
y ∈ X . Then ∇ f T y ≥ ∇ f T x ∗ , so at y = 0 we get ∇ f T (x ∗ ) · x ∗ ≤ 0. Further,
∇ f T (x ∗ ) is a linear form bounded below on the positive cone, so ∇ f (x ∗ ) ≥ 0.
Combining these, we find that x i∗ (∇ f (x ∗ ))i = 0 for all i = 1, . . . , n. This is also
called complementary slackness.
3.1
Separating hyperplanes
3.1.1 Theorem. Let C, D ⊆ Rn be disjoint convex sets. Then there is a ∈ Rn and
b ∈ R such that a T c ≤ b for all c ∈ C and a T d ≥ b for all d ∈ D. If C is closed
and D compact then the inequalities may be taken to be strict. If C and D are both
open then the inequalities may be taken to be strict.
Consider the inequalities Ax ≤ b. This is equivalent to b − Ax ≥ 0, and these
inequalities have a solution if and only if C = {b − Ax | x ∈ Rn } and D = { y | y ≥
0} have a non-empty intersection. If C ∩ D = ∅ the by the separating hyperplane
theorem there are λ ∈ Rn and µ ∈ R such that λ T z + µ ≤ 0 for all z ∈ C and
λ T z + µ ≥ 0 for all z ∈ D.
λ T (b − Ax) + µ ≤ 0 implies that −λ T Ax ≤ −λ T b for all x ∈ Rn . A linear
functional is bounded above if and only if it is constant, so −λ T A = 0, or AT λ = 0.
Further, λ T b + µ ≤ 0.
λ T y + µ ≥ 0 for all y ≥ 0 implies in particular that µ ≥ 0 (taking y = 0).
It follows that λ T b ≤ −µ ≤ 0. Further, if λ had a negative coordinate then the
inequality wouldn’t be satisfied for all y ≥ 0, so λ > 0.
Therefore Ax ≤ b does not have a solution if and only if there is λ > 0 such
that AT λ = 0 and λ T b ≤ 0.
Conjugate functions
3.2
17
Conjugate functions
Let f : Rn → R be any function and let f ∗ ( y) = sup x∈dom( f ) ( y T x − f (x)) for all y
for which this supremum is not ∞. We say that f ∗ is the conjugate function of f .
3.2.1 Examples.
(i) If f (x) = a x + b then
f ∗ ( y) = sup( y x − a x − b) = sup(x( y − a) − b) = ∞ if y 6= a.
x
x∈R
Therefore dom( f ∗ ) = {a} and f ∗ (a) = −b.
(ii) If f (x) = − log(x) then dom( f ) = R++ .
¨
∗
f ( y) = sup ( y x − log x) =
x∈R++
∞
−1 − log(− y)
y ≥0
y < 0.
n
(iii) Let Q ∈ S++
and let f (x) = 21 x T Qx.
f ∗ ( y) = sup ( y T x − 21 x T Qx) = 21 y T Q−1 y.
x∈Rn
The maximum occurs at the x for which y + Qx = 0 by (Q).
(iv) Let f (x) = kxk, where k · k is some norm. If k yk∗ > 1 then there is z ∈ Rn
with kzk = 1 such that | y T z| > 1. Notice that in this case
y T (tz) − ktzk = ( y T z − 1)t
is unbounded. Suppose that k yk∗ ≤ 1. Then
y T x − kxk ≤ k yk∗ kxk − kxk ≤ 0,
with a maximum of 0 attained at x = 0. Therefore
¨
0 k yk∗ ≤ 1
∗
T
f ( y) = sup ( y x − kxk) =
n
∞ k yk∗ > 1.
x∈R
(v) If f (x) = 21 kxk2 then
f ∗ ( y) = sup ( y T x − 21 kxk2 ) ≤ sup (k yk∗ kxk − 21 kxk2 ) = 21 k yk2∗ ,
x∈Rn
x∈Rn
where the supremum is attained at x = argmax{ y T x | kxk = k y ∗ k}.
3.3
Duality
The general problem is min x f0 (x) subject to f i (x) ≤ 0, i = 1, . . . , m and h j (x) = 0,
j = 1, . . . , p, where the f i are convex and the h j are affine. Let X be the set
18
Optimization
T
of points satisfying the constraints, the feasible points, and D = i dom( f i ). We
assume ∇2 f ≥ mI and let p∗ be the optimal value. Define the Lagrangian to be
L (x, λ, ν) = f0 (x) +
m
X
λi f i (x) +
i=1
p
X
ν j h j (x)
j=1
and g(λ, ν) := inf x∈D L (x, λ, ν). We show that g(λ, ν) ≤ f0 (x) for λ ≥ 0 and any
feasible point x. Indeed,
g(λ, ν) = inf L (λ, ν) ≤ L (x̃, λ, µ) = f0 (x̃) +
x∈D
m
X
λi f i (x̃) +
p
X
i=1
ν j h j (x̃) ≤ f0 (x̃).
j=1
It follows that g(λ, ν) ≤ p∗ for all λ ≥ 0 and ν arbitrary. The problem maxλ≥0,ν g(λ, ν)
is the dual problem.
3.3.1 Examples.
(i) For the problem min x x T x subject to Ax = b,
L (x, ν) = x T x + ν T (Ax − b)
g(ν) = inf(x T x + ν T (Ax − b))
x
= −ν T b − 41 ν T AAT ν
(ii) For the problem min x c T x subject to Ax = b and x ≥ 0,
L (x, λ, ν) = c T x − λ T x + ν T (Ax − b)
g(λ, ν) = inf(c T x − λ T x + ν T (Ax − b))
x
= −ν T b + inf((c − λ + AT ν) T x)
x
¨
T
−ν b c − λ + AT ν = 0
=
−∞
otherwise
(iii) For the problem min x f0 (x) subject to Ax = b and C x ≤ d,
L (x, λ, ν) = f0 (x) + λ T (C x − d) + ν T (Ax − b)
g(λ, ν) = inf( f0 (x) + λ T (C x − d) + ν T (Ax − b))
x
= −ν T b − λ T d + inf( f0 (x) + (C T λ + AT ν) T x)
x
= −ν b − λ d − sup((−AT ν − C T λ) T x − f0 (x))
T
T
x
= −ν T b − λ T d − f0∗ (−AT ν − C T λ)
(iv) To solve the problem min x kxk subject to Ax = b, note that
¨
−ν T b kAT bk ≤ 1
T
∗ T
g(ν) = −ν b − f0 (A ν) =
−∞
otherwise.
Duality
19
We have the following relationships.
g(λ, ν) ≤ f0 (x) for any x ∈ X
g(λ, ν) ≤ inf f0 (x) = p∗
x∈X
sup g(λ, ν) ≤ inf f0 (x)
x∈X
λ≥0,ν
This is weak duality. The problem supλ≥0,ν g(λ, ν) is called the dual problem to
the primal problem inf x∈X f0 (x).
3.3.2 Examples.
(i) For the problem min x c T x subject to Ax = b and x ≥ 0,
¨
g(λ, ν) =
c − λ + AT ν = 0
otherwise
−ν T b
−∞
so the dual problem is maxν −b T ν subject to AT ν + c ≥ 0.
(ii) For the problem min x c T x subject to C x ≤ d,
¨
g(λ) =
−λ T d
c + CTλ = 0
−∞
otherwise
so the dual problem is maxλ −d T λ subject to C T λ + c = 0.
Under some conditions
d ∗ = sup g(λ, ν) = inf f0 (x) = p∗ .
λ≥0,ν
x∈X
This is referred to as strong duality.
Suppose for now that there are no equality constraints. Then g(λ) = inf x L (x, λ)
and
p
X
sup L (x, λ) = sup( f0 (x) +
λ j f j (x)) ≤ f0 (x)
λ≥0
λ≥0
j=1
Therefore
inf sup L (x, λ) ≤ inf f0 (x) = p∗ .
x λ≥0
x
On the other hand,
d ∗ = sup g(λ) = sup inf L (x, λ).
λ≥0
λ≥0 x
Finally,
d ∗ = sup inf L (x, λ) ≤ inf sup L (x, λ) ≤ p∗ .
λ≥0 x
x λ≥0
3.3.3 Proposition (Slater condition). If the following two conditions hold then
d ∗ = p∗ , i.e. strong duality holds.
20
Optimization
(i) There is x̃ ∈ int(D) such that f i (x̃) < 0 for all i = 1, . . . , m and h j (x̃) = 0 for
all j = 1, . . . , p; and
(ii) rank(A) = p, where Ax = b is the other representation of the equality constraints.
PROOF: Define the sets
G := {( f1 (x), . . . , f m (x), h1 (x), . . . , h p (x), f0 (x)) | x ∈ D}
A := {(u, v, t) ∈ Rm+p+1 | ∃ x ∈ D∀ i, j( f (x) ≤ ui ∧ h j (x) = v j ∧ f0 (x) ≤ t)}
B := {(0, 0, t) ∈ Rm+p+1 | t < p∗ }
Then A is convex and B is convex, and p∗ = inf{t | (0, 0, t) ∈ A }. It follows that
A and B are disjoint. If p∗ = −∞ then d ∗ = p∗ , so we may assume that B is
non-empty, and by assumption A is non-empty.
By the separating hyperplane theorem there is (λ, ν, µ) 6= 0 and α ∈ R such
that (λ, ν, µ) T (u, v, t) ≥ α for all (u, v, t) ∈ A and (λ, ν, µ) T (u, v, t) ≤ α for all
(u, v, t) ∈ B. From the latter, µt ≤ α for all (0, 0, t) ∈ B, so µp∗ ≤ α. From the
former, λ, µ ≥ 0 since u and t can be taken to +∞ in A .
If µ > 0 then we may take µ = 1 without loss of generality, and for any feasible
x,
X
X
d∗ ≥
λi f i (x) +
ν j h j (x) + µ f0 (x) ≥ α ≥ p∗ .
i
j
Assume that µ = 0. For any x ∈ D,
X
λi f i (x) + ν T (Ax − b) ≥ α ≥ 0.
i
P
For the particular point x̃, i λi f i (x̃) ≥ 0, so λ = 0. Therefore ν T (Ax − b) ≥ 0
for all x ∈ D, and ν 6= 0. Since x̃ is in the interior of D, if ν T A 6= 0 then there is
a point x − ∈ int(D) such that ν T (Ax − − b) < 0. This contradiction implies that
AT ν = 0. But this too is a contradiction since the rank of A is p. Therefore µ > 0.ƒ
We know that g(λ, ν) ≤ p∗ for any λ ≥ 0 and ν. Therefore
f0 (x) − p∗ ≤ f0 (x) − g(λ, ν).
(4)
This estimate is very important in applications.
3.4
Complementary slackness
Assume for now that strong duality holds. Let x ∗ be primal optimal and (λ∗ , ν ∗ )
be dual optimal. Recall that
f (x ∗ ) = g(λ∗ , ν ∗ ) = inf { f0 (x) +
x∈D
m
X
i=1
λ∗i f i (x) +
p
X
j=1
ν j∗ h j (x)}
Complementary slackness
21
≤ f0 (x ∗ ) +
m
X
λ∗i f i (x ∗ ) +
p
X
i=1
ν j∗ h j (x ∗ )
j=1
≤ f0 (x ∗ )
It follows that we have equalities on every line, and in particular λ∗i f i (x ∗ ) = 0 for
all i = 1, . . . , m. If f i (x ∗ ) < 0 then λi = 0 is zero, and we say that this constraint is
an inactive constraint.
3.4.1 Proposition (Karish-Kuhn-Tucker conditions).
Assume that f0 and f i , i = 1, . . . , m, are convex and h j , j = 1, . . . , p, are affine, and
all functions are differentiable. Then x ∗ is primal optimal, (λ∗ , ν ∗ ) is dual optimal,
and strong duality holds and only if
f i (x ∗ ) ≤ 0
i = 1, . . . , m
h j (x ∗ ) = 0
j = 1, . . . , p
λ∗i
∗
≥0
i = 1, . . . , m
)=0
i = 1, . . . , m
λ∗i f i (x
∇ f0 (x ∗ ) +
m
X
λ∗i ∇ f i (x ∗ ) +
i=1
p
X
ν j∗ ∇h j (x ∗ ) = 0
j=1
PROOF: Note that he function
x 7→ L (x, λ∗ , ν ∗ ) = f0 (x) +
m
X
i=1
λ∗i f i (x) +
p
X
ν j∗ h j (x)
j=1
is convex. If x ∗ = argmin x L (x, λ∗ , ν ∗ ). Therefore ∇ x L (x, λ∗ , ν ∗ ) = 0.
Finish this proof.
ƒ
3.4.2 Example. Solve min x 12 x T P x + q T x + r subject to Ax = b. Here
L (x, ν) = 21 x T P x + q T x + r + ν T (Ax − b).
Therefore 0 = ∇L implies P x + q + AT ν = 0 and Ax = b, so the original problem
is equivalent to
P AT
x
−q
=
.
A 0
ν
b
22
Optimization
3.4.3 Example. Solve min x f0 (x) =
a hyperplane. The dual function is
Pn
i=1 f i (x i )
subject to a T x = b, that x lies on
g(ν) = inf L (x, ν)
x
n
X
= inf
x
f i (x i ) + ν(a T x − b)
i=1
= −ν b + inf
x
= −ν b −
= −ν b
n
X
( f i (x i ) + ν ai x i )
i=1
n
X
sup(−ν ai x i
i=1 x i
n
X
−
f i∗ (−ν ai ).
i=1
− f i (x i ))
The dual problem is to maximize g(ν) over ν ∈ R. This should be easy, so say
the maximum occurs at ν ∗ . Now to find the optimizer
for the primal problem, we
Pn
must solve the unconstrained problem min x i=1 ( f i (x i ) + ν ∗ ai x i ).
3.4.4 Example (Optimal transport). Given a list of locations with given amounts
of material (x i , pi ), we would like to find a scheme to move it to locations with
given capacities ( y j , q j ), minimizingPsome distance quantity.
P
Let the flows be µi, j ≥ 0. So j µi, j = pi , and i µi, j = q j . The distance
P
quantity is i, j µi, j |x i − y j |2 (Wasserstein distance). This problem has the µi, j as
the quantities to be found, and there are on the order of N 2 unknowns. The dual
problem is found as follows. Notice that
X
X
X
X
µi, j |x i − y j |2 =
|x i |2 pi +
| y j |2 q j − 2
µi, j x i · y j ,
i, j
i
j
i, j
µi, j x i · y j .
X X
X
X
X X
L (µ, a, b, λ) = −
µi, j x i · y j +
ai
µi, j − pi +
bj
µi, j − q j −
λi, j µi, j
so it suffices to minimize −
i, j
=
X
P
i, j
i
j
µi, j (−x i · y j + ai + b j − λi, j ) −
i, j
g(a, b, λ) = inf L (µ, a, b, λ)
µ
( P
P
− i ai pi − j b j q j
=
−∞
j
X
i
ai pi −
i
X
i, j
bjqj
j
−x i y j + ai + b j − λi, j = 0 for all i, j
otherwise
The supremum of g(a, b, λ) over λ ≥ 0 will occur when −x i y j + ai + b j − λi, j =
0 for all P
i, j. ThenP0 ≤ λi, j = ai + b j − x i · y j . The dual problem becomes
− mina,b i ai pi + j b j q j subject to ai + b j ≥ x i · y j . But consider the constraints: b j ≥ x i · y j − ai for all i, so we may take b j = maxi (x i · y j − ai ). In
Alternatives
23
∗
someP
sense b =
Pa ,∗ and the dual problem reduces to the unconstrained problem
mina i ai pi + j a j q j . If strong duality holds (which it likely does), the minimum
P
P
distance is 2d ∗ + i |x i |2 pi + j | yi |2 q j .
To actually implement the problem we need an algorithm for calculating a∗
given a, and we need some descent method. From the KKT conditions, µi, j λi, j = 0,
so it λi, j > 0 then µi, j = 0, and it can be seen that many of the primal variables
are actually zero.
3.5
Alternatives
Either there is a solution to Ax = b with x 0 or there is λ such that AT λ ≥ 0,
AT λ 6= 0, and b T λ ≤ 0. Let C = {x 0 | x ∈ Rn } and D = {x ∈ Rn | Ax = b}.
C ∩ D = ∅ if and only if there is no solution x 0. C and D are convex and
non-empty, so there is c ∈ Rn and d ∈ R such that c T d ≥ d on C and c T x ≤ d on
D. The fact that c T x ≥ d implies c ≥ 0, since the coördinates of x ∈ C may be
taken to positive infinity. Taking x → 0, it is seen that d ≤ 0. D is an affine space,
so c T x is constant on D since it is bounded above. Rename d, if necessary, to be
equal to this constant. Then c T x = d on D. Write D = {x = x 0 + F y | y ∈ R r },
where r = dim ker A and F is a matrix whose columns are a basis of ker A. Then
c T (x 0 + F y) = d for any y ∈ R r , which is only possible if c T F = 0, so c ⊥ ker A.
But ker A ⊕ rangeAT = Rn , so c ∈ rangeAT , i.e. c = AT λ for some λ. Finally,
λ T A = c ≥ 0, λ T A = c 6= 0, and λ T Ax 0 = λ T b = d ≤ 0. This was problem 2.20 in
the textbook.
3.5.1 Example. f i (x) ≤ 0 for i = 1, . . . , m and h j (x) = 0 for j = 1, . . . , p. We
would like to find an alternative for this collection of inequalities. The idea is to
consider the problem min x 0 subject to f i (x) ≤ 0 and h j (x) = 0. If there is any
feasible x then the minimum is p∗ = 0, otherwise p∗ = ∞. The dual function is
g(ν, λ) = inf 0 +
x
p
X
ν j h j (x)
j=1
m
X
λi f i (x).
i=1
If α > 0 then g(αν, αλ) = αg(ν, λ). Whence, if there is (ν, λ) with λ ≥ 0 such
that g(ν, λ) > 0 then p∗ ≥ d ∗ = ∞ by weak duality. If there is no feasible solution
to λ ≥ 0, g(ν, λ) > 0 then d ∗ = 0. Summarizing, if the dual is feasible (i.e. there
is (ν, λ) such that λ ≥ 0 and g(ν, λ) > 0) then the primal is not feasible. This is a
weak alternative because it could be the case that neither are feasible.
3.5.2 Example. f i (x) < 0 for i = 1, . . . , m and h j (x) = 0 for j = 1, . . . , p. We
would like to show that λ ≥ 0, λ 6= 0, g(ν, λ) ≥ 0 is an alternative. Consider the
problem min x 0 subject to f i (x) ≤ 0 and h j (x) = 0. The dual function is
g(ν, λ) = inf
x
p
X
j=1
ν j h j (x)
m
X
λi f i (x).
i=1
If the primal is feasible then g(ν, λ) < 0 for all λ ≥ 0, λ 6= 0, so the proposed
alternative cannot hold. (Not quite right.)
24
Optimization
If strong duality holds then g(λ∗ , ν ∗ ) = d ∗ = p∗ = f0 (x ∗ ). We assume for now
that strong duality holds.
3.5.3 Example. For f i (x) < 0, i = 1, . . . , m, and Ax = b, the alternative is λ ≥ 0,
λ 6= 0, and g(λ, ν) ≥ 0.
Consider the problem min x,s s subject to f i (x) − s ≤ 0 and Ax = b. Then p∗ < 0
if and only if there is x such that f i (x) < 0 for all i and Ax = b.
The dual function is
g(λ, ν) = inf s +
x,s
m
X
λi ( f i (x) − s) + ν T (Ax − b)
i=1
X
m
m
X
= −ν T b + inf s 1 −
λi +
λi f i (x) + ν T Ax
x,s
i=1
i=1
This is −∞ unless 1 T λ = 1. The dual problem hence is
max g(λ, ν).
1 T λ=1
λ≥0
By strong duality there at λ∗ and ν ∗ optimizing the dual. Now, p∗ < 0 if and only
if there is x such that Ax = b and f i (x) < 0 for all i. In this case g(λ∗ , ν ∗ ) = d ∗ =
p∗ < 0, so the alternative does not hold, in particular g(λ, ν) < 0 for all λ ≥ 0 and
ν. Similarly, if there is no such x then the alternative holds.
4
Algorithms
Now we move to chapters 10 and 11, concerning problems of the from min f0 (x)
subject to Ax = b, and min f0 (x) subject to Ax = b and f i (x) ≤ 0, respectively.
One way to solve the problem with only equality constraints is to use the dual
problem.
g(ν) := inf( f0 (x) + ν T (Ax − b)) = −ν T b + inf( f0 (x) + (AT ν) T x)
x
x
= −ν T b − sup(−(AT ν) T x − f0 (x))
x
= −ν T b − f0∗ (−AT ν)
Solving the primal is equivalent (under strong duality) to solving the unconstrained problem supν (−ν T b − f0∗ (−AT ν)). Suppose ν ∗ is the solution. Then plugging this into the Lagrangian, we can solve the unconstrained problem inf x ( f0 (x)+
ν ∗T (Ax − b) for x ∗ , the solution to the original problem.
Another is way to use Newton’s method. Suppose x approximates the solution
and we wish to improve it. Approximate f0 (x + v) up to the quadratic term,
f0 (x + v) ≈ f0 (x) + ∇ f0 (x) T v + 12 v T ∇2 f (x)v
Algorithms
25
and use this approximation to find the direction to move to improve x. This is also
known as sequential quadratic programming. We solve the following problem for
the best direction v.
min f0 (x) + ∇ f0 (x) T v + 12 v T ∇2 f (x)v subject to Av = Ax − b
v
The optimality conditions are as follows. The Lagrangian is
L = 21 v T ∇2 f (x)v + ∇ f0 (x) T v + ν T (Ax − b − Av)
so the KKT condition is that ∇2 f0 (x)v + ∇ f0 (x) + AT ν = 0 and of course the
constraint Av = Ax − b must be satisfied. We get the following system of equations
for the best direction v and the dual variable ν.
2
∇ f0 (x)
vnt
∇ f0 (x) AT
=−
.
Ax − b
νnt
A
0
A second derivation of this system comes from the optimality conditions for the
original problem. They are ∇ f0 (x) + AT ν = 0 and Ax = b. We wish to find x + ∆x
so that the optimality conditions still hold (but at which some other quantity is
improved). Linearizing,
∇ f0 (x) + ∇2 f x (x)∆x + AT ∆ν + AT ν = 0
∇ f0 (x)
A
2
Ax + A∆x = b
∇ f0 (x) + AT ν
∆x nt
A
=−
Ax − b
∆νnt
0
T
In the unconstrained Newton’s method we want ∇ f0 (x + v) = 0 so we pick v
so that ∇2 f0 (x)v = −∇ f0 (x).
f0 (x + t v) = f0 (x) + t∇ f0 (x) T v + O(t 2 )
= f0 (x) − t v T ∇2 f0 (x) T v + O(t 2 )
d
= −λ2 (x) < 0
f (x + t v)
t=0
dt
since ∇2 f0 (x) is positive semidefinite. With the backtracking method f0 (x) decreases with every iteration.
With equality constraints, however, f0 (x) will not necessarily decrease if Ax 6=
b. We need to find some other quantity that will go down with every iteration to
use as the backtracking quantity. From Newton’s step,
d
dt
f0 (x + t∆x)
t=0
= ∇ f0 (x) T ∆x nt
T
T
= −∆x nt
(∇2 f0 (x)∆x nt
+ AT (ν + ∆νnt ))
T
T
= −∆x nt
∇2 f0 (x)∆x nt
− (ν + ∆νnt ) T (A∆x nt )
T
T
= −∆x nt
∇2 f0 (x)∆x nt
− (ν + ∆νnt ) T (b − Ax)
26
Optimization
We show now that
r = r(x, ν) =
rdual
r pr imal
=
∇ f0 (x) + AT ν
Ax − b
strictly decreases, and that r = 0 at optimum. Consider the new variable y =
(x, ν) and do Newton’s method on r( y). The linear approximation is r( y + z) =
r( y) + Dr( y)z, so we solve this equation for z. r is called the residual. This
material is in §10.3.
d
kr( y + t∆ ynt )k22 t=0
dt
d
=
(r( y) + t Dr( y)∆ ynt ), r( y) + t Dr( y)∆ ynt ) + O(t 2 )
t=0
dt
= 2(Dr( y)∆ ynt , r( y)) = −2r( y) T r( y) = −2kr( y)k22
Similarly,
d
dt
kr( y + t∆ ynt )k2 t=0
=
d Æ
dt
kr( y +
t∆ ynt )k22 t=0
−2kr( y)k22
= p
= −krk2
2 kr( y)k22
Hence the tangent line to the graph of kr( y + t v)k2 at t = 0 is (1 − t)kr( y)k2 . We
build our line search as follows. Find t such that
kr( y + t∆ ynt )k2 ≤ (1 − αt)krk2 .
The infeasible start Newton’s method is as follows. (See §10.3.)
input: (x0, v0), e > 0, 0 < a < 0.5, 0 < b < 1
x, v = x0, v0
while |r(x, v)| > e:
dxnt, dvnt = compute_dir(x, v, etc.)
t = 1.0
while |r(x + t dxnt, v + t dvnt)| > (1 - a t) |r(x, v)|
t = b * t
x, v = x0 + t dxnt, v0 + t dvnt
Some observations. If we accept the Newton step, i.e. if t = 1, then the solution
becomes feasible, i.e. A(x + ∆x nt ) = b. Otherwise, if we step by t < 1 then we
have b −A(x + t∆x nt ) = (1− t)(b −Ax), so we “improve the feasibility” by a factor
of 1 − t.
Now we move on to the problem with inequality constraints. Consider the
problem min f0 (x) subject
Pm to Ax = b and f i (x) ≤ 0, i = 1, . . . , m. Introduce the
functional ϕ(x) = − i=1 log(− f i (x)) and solve
min t f0 (x) + ϕ(x) subject to Ax = b.
x
Algorithms
27
We hope that the solutions get close to the solution of the original problem for
large t. Compute
∇ϕ(x) = −
∇2 ϕ(x) = −
m
X
1
f (x)
i=1 i
m
X
∇ f i (x)
1
i=1
1
T
2
∇
f
(x)∇
f
(x)
−
∇
f
(x)
i
i
i
f i (x)
f i (x)2
Combining with the dual, the optimality conditions are
t∇ f0 (x ∗ ) −
m
X
1
i=1
f i (x ∗ )
∇ f i (x ∗ ) + AT ν ∗ = 0
Ax ∗ − b = 0
where the first line is t∇ f0 (x) + ∇ϕ + AT ν = 0. These are the equations for the
barrier method. We known that − f i (x ∗ ) > 0, so write λ∗i := −t f1(x ∗ ) > 0. Rescaling
i
ν ∗ , we can write the above equations as
X
∇ f0 (x ∗ ) +
λ∗i ∇ f i (x ∗ ) + AT ν ∗ = 0
Ax ∗ − b = 0
−λ∗i f i =
1
t
These are the equations for the primal-dual method. The Lagrangian for the problem is
L (x, ν, λ) = f0 (x) +
m
X
λi f i (x) + ν T (Ax − b)
i=1
m
X
∇L (x, ν, λ) = ∇ f0 (x) +
λi ∇ f i (x) + AT ν
i=1
Note that ∇L (x ∗ , ν ∗ , λ∗ ) = 0. With the dual function g we have
g(λ∗ , ν ∗ ) = f0 (x ∗ ) +
m
X
λ∗i f i (x ∗ ) + ν ∗T (Ax ∗ − b)
i=1
so f0 (x ∗ ) − p∗ ≤ f0 (x ∗ ) − g(λ∗ , ν ∗ )
m
X
1
m
f i (x ∗ ) =
→ 0 as t → ∞
=
∗
t f (x )
t
i=1
For the barrier method we have the variables (x, ν). Use Newton’s method with
line search on the residuals to find ∆x and ∆ν.
2
∆x
t∇ f0 + ∇ϕ + AT ν
t∇ f0 + ∇2 ϕ AT
=
∆ν
Ax − b
A
0
28
Optimization
How do we start the barrier method? We need to start with a feasible point because otherwise ϕ(x) is not defined. The first phase is to solve the auxiliary
problem
min s subject to f i (x) ≤ s and Ax = b.
x,s
This problem is always feasible: choose x 0 such that Ax (0) = b and set s(0) =
maxi f i (x (0) ). Use any method to get a sequence of approximate solutions to this
problem. If we find s(n) < 0 for some n then we know p∗ < 0 for the auxiliary
problem, so the original problem is feasible and we have obtained a feasible point
x (n) . However, if find that p∗ is probably positive then the original problem is
probably not feasible. In this case we hope to find (λ, ν) for the dual to the auxiliary problem for which g(λ, ν) > 0, proving infeasibility of the original problem.
Another way to get started is to instead consider the problem
min 1 T s subject to f i (x) ≤ si , i = 1, . . . , m, and Ax = b.
x,s
For the primal-dual method we apply Newton’s method with line search on the
residuals to the modified KKT equations
rdual = ∇ f0 (x) +
X
rcent = −λi f i (x) −
1
t
λi ∇ f i (x) + AT ν = 0
=0
rprim = Ax − b = 0
to find ∆x ∆λ, and ∆ν.

P
∇2 f0 + λi ∇2 f i

T
−[λ1 ∇ f1 | · · · | λm ∇ f m ]
A
[∇ f1 | · · · | ∇ f m ]
−diag( f1 , . . . , f m )
0




r
∆x
AT
 dual 
  nt 
0  ∆λnt  = −  rcent 
rprim
∆νnt
0
Dr( y) = ∆ ynt
Recall the inequality f0 (x) − p∗ ≤
1
t
m
.
t
Define η̂ = −
P
i
λi f i (x) ≈
m
.
t
Since λi > 0
and −λi f i (x) = > 0, we must have f i (x) < 0.
Write y + = y + s∆ ynt . The line search consists of the following steps:
(i) Choose s1 ≤ 1 so that λ+ ≥ 0.
(ii) Starting with s = s1 , s ← βs until f i (x + ) < 0. Call the result s2 .
(iii) Starting with s = s2 , s ← βs until kr( y + )k2 ≤ (1 − αt)kr( y)k2 .
What follows are fragments of the proof that this all works.
Let t max = inf{t > 0 : kr( y + t∆ ynt )k2 > kr( y)k2 }. This is the first t for which
moving along the Newton direction by this much will not improve the value of
kr( y)k2 . Let
S := {t : kr( y + t∆ ynt )k2 ≤ kr( y)k2 } ⊇ [0, t max ],
Algorithms
29
and assume that S is closed. Note that we have Dr( y)∆ y = −r( y) because it is
the definition of the Newton increment.
Z1
r( y + t∆ y) = r( y) +
t Dr( y + tτ∆ y)∆ y dτ
0
= r( y) + t Dr( y)∆ y +
Z
1
[Dr( y + tτ∆ y) − Dr( y)]t∆ y dτ
0
= (1 − t)r( y) + e
kr( y + t∆ y)k2 ≤ (1 − t)kr( y)k2 + kek2
Now, if Dr is Lipschitz, say kDr( y) − Dr( y)k2 ≤ Lk y − y 0 k2 , then
Z 1
kek2 = [Dr( y + tτ∆ y) − Dr( y)]t∆ y dτ
2
0
Z1
kDr( y + tτ∆ y) − Dr( y)k2 k∆ yk2 dτ
≤t
0
≤ Lt
2
Z
1
0
k∆ yk22 τdτ = 21 L t 2 k∆ yk22
= 12 L t 2 k[Dr( y)−1 r( y)k22
Suppose as well that k[Dr( y)]−1 k2 ≤ K for all y. Then for all t ∈ [0, 1 ∧ t max ],
kr( y + t∆ y)k2 ≤ (1 − t)kr( y)k2 + 21 t 2 LK 2 kr( y)k22 .
(5)
This parabola in t is minimized at t̄ = 1/(LK 2 kr( y)k2 ). It is clear that 1/LK 2 is a
critical value of kr( y)k2 . There are hence two cases: (i) krk > 1/(K 2 L) and (ii)
krk ≤ 1/(K 2 L).
In the first case t̄ < 1 ∧ t max , so t̄ is a valid step size and we get
kr( y + t∆ y)k2 ≤ (1 − t̄)kr( y)k2 + 12 t̄ 2 LK 2 kr( y)k22
= kr( y)k2 −
1
2K 2 L
1
≤ kr( y)k2 − α t̄ 2
K L
≤ (1 − α t̄)kr( y)k2
since t̄ < 1, α < 21 , and 1/(K 2 L) < kr( y)k2 . Therefore the line search terminates
for some t ≥ β t̄.
kr( y + t∆ y)k2 ≤ (1 − α t̄)kr( y)k2 ≤ kr( y)k2 −
so the reduction after the line search as at least
steps, we will end up in case (ii).
αβ
.
K2 L
αβ
K2 L
Therefore, after finitely many
30
Optimization
In the second case kr( y)k2 ≤ 1/(K 2 L) so
kr( y + t∆ y)k2 ≤ (1 − t)kr( y)k2 + 21 t 2 LK 2 kr( y)k22 ≤ (1 − t + 12 t 2 )kr( y)k2 .
From this it follows that t max > 1, and for t = 1 we have
kr( y + t∆ y)k2 ≤ 12 kr( y)k2 ≤ (1 − αt)kr( y)k2
so the line search exits with t = 1. Finally, plugging t = 1 into the original
inequality,
LK 2
kr( y)k22
2
‚ 2
Œ2
LK
K2 L
kr( y + ∆ y)k2 ≤
kr( y)k2
2
2
kr( y + ∆ y)k2 ≤
so we have quadratic convergence in case (ii).
Index
affine set, 2
alternative, 23
quadratic convergence, 10
residual, 26
roughness penalty, 8
backtracking line search, 6
barrier method, 27
sequential quadratic programming, 25
steepest descent method, 9
strong duality, 19
sub-gradient, 13
complementary slackness, 16
cone, 2
conjugate function, 17
constrained optimization problem, 4
convex function, 2
convex set, 2
unconstrained problem, 4
Wasserstein distance, 22
weak alternative, 23
weak duality, 19
descent method, 6
dual norm, 9
dual problem, 18, 19
equality constraint, 16
equality constraints, 4
exact line search, 6
feasible, 18
first order conditions, 4
gradient descent method, 6
half space, 2
hyperplane, 2
inactive constraint, 21
inequality constraints, 4
infeasible start Newton’s method, 26
Lagrangian, 18
line, 2
linear convergence, 6
Lipschitz constant, 11
Newton’s decrement, 11
Newton’s method, 10
non-negativity constraint, 16
norm cone, 2
normal equations, 5
primal problem, 19
primal-dual method, 27
31