Lecture 13 Gradient Methods for Constrained Optimization

Lecture 13
Gradient Methods for Constrained Optimization
October 16, 2008
Lecture 13
Outline
• Gradient Projection Algorithm
• Convergence Rate
Convex Optimization
1
Lecture 13
Constrained Minimization
minimize f (x)
subject x ∈ X
• Assumption 1:
• The function f is convex and continuously differentiable over Rn
• The set X is closed and convex
• The optimal value f ∗ = inf x∈Rn f (x) is finite
• Gradient projection algorithm
xk+1 = PX [xk − αk ∇f (xk )]
starting with x0 ∈ X.
Convex Optimization
2
Lecture 13
Bounded Gradients
Theorem 1 Let Assumption 1 hold, and suppose that the gradients are
uniformly bounded over the set X. Then, the projection gradient method
generates the sequence {xk } ⊂ X such that
• When the constant stepsize αk ≡ α is used, we have
αL2
lim inf f (xk ) ≤ f +
k→∞
2
∗
• When diminishing stepsize is used with
P
k
αk = +∞, we have
lim inf f (xk ) = f ∗.
k→∞
Proof: We use projection properties and the line of analysis similar to that
of unconstrained method. HWK 6
Convex Optimization
3
Lecture 13
Lipschitz Gradients
• Lipschitz Gradient Lemma For a differentiable convex function f with
Lipschitz gradients, we have for all x, y ∈ Rn,
1
k∇f (x) − ∇f (y)k2 ≤ (∇f (x) − ∇f (y))T (x − y),
L
where L is a Lipschitz constant.
• Theorem 2 Let Assumption 1 hold, and assume that the gradients of
f are Lipschitz continuous over X . Suppose that the optimal solution
set X ∗ is not empty. Then, for a constant stepsize αk ≡ α with
0 < α < L2 converges to an optimal point, i.e.,
lim kxk − x∗k = 0
k→∞
Convex Optimization
for some x∗ ∈ X ∗.
4
Lecture 13
Proof:
Fact 1: If z = PX [z − v] for some v ∈ <n, then z = PX [z − τ v] for any
τ > 0.
Fact 2: z ∈ X ∗ if and only if z = PX [z − ∇f (z)].
These facts imply that z ∈ X ∗ if and only if z = PX [z − τ ∇f (z)] for any
τ > 0.
By using the definition of the method and the preceding relation with
τ = α, we obtain for any z ∈ X ∗,
kxk+1 − zk2 = kPX [xk − α∇f (xk )] − PX [z − α∇f (z)k2.
By non-expansiveness of the projection, it follows
kxk+1 − zk2 = kxk − z − α(∇f (xk ) − ∇f (z))k2
= kxk − zk2 − 2α(xk − z)T (∇f (xk ) − ∇f (z))
+α2k∇f (xk ) − ∇f (z)k2
Convex Optimization
5
Lecture 13
Using Lipschitz Gradient Lemma, we obtain for any z ∈ X ∗,
α
kxk+1 − zk ≤ kxk − zk − (2 − αL)k∇f (xk ) − ∇f (z)k2.
L
2
2
(1)
Hence, for all k,
α
(2 − αL)k∇f (xk ) − ∇f (z)k2 ≤ kxk − zk2 − kxk+1 − zk2.
L
By summing the preceding relations from arbitrary K to N , with K < N,
we obtain
N
X
α
(2−αL)
k∇f (xk )−∇f (z)k2 ≤ kxK −zk2−kxN +1−zk2 ≤ kxK −zk2.
L
k=K
Convex Optimization
6
Lecture 13
In particular, setting K = 0 and letting N → ∞, we see that
∞
X
α
(2 − αL)
k∇f (xk ) − ∇f (z)k2 ≤ kx0 − zk2 < ∞.
L
k=0
(2)
As a consequence, we also have
lim ∇f (xk ) = ∇f (z).
k→∞
(3)
By discarding the non-positive term in the right hand side of Eq. (1), we
have for any z ∈ X ∗ and all k,
kxk+1 − zk2 ≤ kxk − zk2 + (2 − αL)k∇f (xk ) − ∇f (z)k2.
By summing these relations over k = K, . . . , N for arbitrary K and N with
K < N, we obtain
Convex Optimization
7
Lecture 13
2
2
kxN +1 − zk ≤ kxK − zk + (2 − αL)
N
X
k∇f (xk ) − ∇f (z)k2.
k=K
Taking limsup as N → ∞, we obtain
2
2
lim sup kxN +1 − zk ≤ kxK − zk + (2 − αL)
N →∞
∞
X
k∇f (xk ) − ∇f (z)k2.
k=K
Now, taking liminf as K → ∞ yields
lim sup kxN +1 − zk2 ≤ lim inf kxK − zk2
N →∞
K→∞
+ (2 − αL) lim k
K→∞
∞
X
∇f (xk ) − ∇f (z)k2
k=K
= lim inf kxK − zk2,
K→∞
Convex Optimization
8
Lecture 13
where the equality follows in view of the relation in (2). Thus, we have that
the sequence {kxk − zk} is convergent for every z ∈ X ∗.
By the inequality in Eq. (1), we have that
kxk − zk ≤ kx0 − zk
for all k.
Hence, the sequence {xk } is bounded, and it has an accumulation point.
Since the scalar sequence {kxk − zk} is convergent for every z ∈ X ∗, the
sequence {xk } must be convergent.
Suppose now that xk → x̄. By considering the definition of the iterate xk+1,
we have
xk+1 = PX [xk − α∇f (xk )].
Letting k → ∞ and using xk → x̄, and continuity of the gradient ∇f (x),
we obtain
x̄ = PX [x̄ − α∇f (x̄)].
In view of facts 1 and 2, the preceding relation is equivalent to x̄ ∈ X ∗. Convex Optimization
9
Lecture 13
Modes of Convexity: Strict and Strong
• Def. f is strictly convex if for all x 6= y and α ∈ (0, 1) we have
f (αx + (1 − α)y) < αf (x) + (1 − α)f (y)
• Def. f is strongly convex if there exists a scalar ν > 0 such that
f (αx + (1 − α)y) ≤ αf (x) + (1 − α)f (y) −
ν
α(1 − α)kx − yk2
2
for all x, y ∈ <n and any α ∈ [0, 1].
The scalar ν is referred to as strongly convex constant.
The function is said to be strongly convex with constant ν .
Convex Optimization
10
Lecture 13
Modes of Convexity: Differentiable Function
• Let f : <n → R be continuously differentiable.
• Modes of convexity can be equivalently characterized in terms of the
linearization properties of the function ∇f : <n → <n.
• We have
• f is convex if and only if
f (x) + ∇f (x)T (y − x) ≤ f (y)
for all x, y ∈ <n
• f is strictly convex if and only if
f (x) + ∇f (x)T (y − x) < f (y)
for all x 6= y
• f is strongly convex with constant ν if and only if
ν
f (x) + ∇f (x) (y − x) + ky − xk2 ≤ f (y)
2
T
Convex Optimization
for all x, y ∈ <n
11
Lecture 13
Modes of Convexity: Gradient Mapping
• Let f : <n → R be continuously differentiable.
• Modes of convexity can be equivalently characterized in terms of the
monotonicity properties of the gradient mapping ∇f : <n → <n.
• We have
• f is convex if and only if
(∇f (x) − ∇f (y))T (x − y) ≥ 0
for all x, y ∈ <n
• f is strictly convex if and only if
(∇f (x) − ∇f (y))T (x − y) > 0
for all x 6= y
• f is strongly convex with constant ν if and only if
(∇f (x) − ∇f (y))T (x − y) ≥ ν kx − yk2
Convex Optimization
for all x, y ∈ <n
12
Lecture 13
Modes of Convexity: Twice Differentiable Function
• Let f : <n → R be twice continuously differentiable.
• Modes of convexity can be equivalently characterized in terms of the
definiteness of the Hessians ∇2f (x) for x ∈ <n.
• We have
• f is convex if and only if
∇2f (x) ≥ 0
for all x ∈ <n
• f is strictly convex if
∇2f (x) > 0
for all x ∈ <n
• f is strongly convex with constant ν if and only if
∇2f (x) ≥ ν I
Convex Optimization
for all x ∈ <n
13
Lecture 13
Strong Convexity: Implications
Let f be continuously differentiable and strongly convex∗ over Rn with
constant m
• Implications:
• Lower Bound on f over Rn: for all x, y ∈ Rn
m
f (y) ≥ f (x) + ∇f (x) (y − x) + kx − yk22
2
minimize w/r to y in the right-hand side:
1
k∇f (x)k2
f (y) ≥ f (x) −
2m
n
minimum over y ∈ R :
1
k∇f (x)k2
f (x) − f ∗ ≤
2m
• Useful as a stopping criterion (if you know m)
T
(4)
∗ Strong convexity over Rn can be replaced by a strong convexity over a set X. Then, all the relations stay valid over the set
Convex Optimization
14
Lecture 13
• Relation (4) with x = x0 and f (y) ≤ f (x0) implies that the level
set Lf (f (x0)) is bounded
• Relation (4) also yields for an optimal x∗ and any x ∈ Rn,
m
kx − x∗k2 ≤ f (x) − f (x∗)
2
• Last two bullets HWK6 assignment.
Convex Optimization
15
Lecture 13
Convergence Rate: Once Differentiable
Theorem 3 Let Assumption 1 hold, and assume that the gradients of f
are Lipschitz continuous over X with constant L > 0. Suppose that the
function is strongly convex with constant m > 0. Then:
• A solution x∗ exists and it is unique.
• The iterates generated by the gradient projection method with αk ≡ α
and α < L2 converge to x∗ with geometric rate, i.e.,
kxk+1 − x∗k2 ≤ q k kxk − x∗k2
for all k
with q ∈ (0, 1) depending on m and L.
Proof: HWK 6.
Convex Optimization
16
Lecture 13
Convergence Rate: Twice Differentiable
Theorem 4 Let Assumption 1 hold. Assume that the function is twice
continuously differentiable and strongly convex with constant m > 0.
Assume also that ∇f 2(x) ≤ L for all x ∈ X . Then:
• A solution x∗ exists and it is unique.
• The iterates generated by the gradient projection method with αk ≡ α
and α < L2 converge to x∗ with geometric rate, i.e.,
kxk+1 − x∗k ≤ q k kxk − x∗k
for all k
with q = max{|1 − αm|, |1 − αL}.
Convex Optimization
17
Lecture 13
Proof: The q here is different from the one in the preceding theorem. Since
∇f 2(x) ≤ L for all x ∈ X , it follows that the gradients are Lipschitz
continuous over X with constant L. By the definition of the method and
the non-expansive property of the projection, we have for z = x∗ and any
k,
kxk+1 − x∗k2 = kPX [xk − α∇f (xk )] − PX [x∗ − ∇f (x∗)]k2
≤ kxk − x∗ − α(∇f (xk ) − ∇f (x∗))k2.
(5)
Mean Value Theorem for vector functions When g : Rn → R is
differentiable on [x, y], we have
g(y) = g(x) +
Z
1
∇g(x + τ (y − x)) dτ
0
Convex Optimization
18
Lecture 13
Applying this Theorem with g = ∇f , y = xk and x = x∗, we obtain
∗
∇f (xk ) = ∇f (x ) +
1
Z
∇2f (x∗ + τ (xk − x∗)) dτ
0
Hence,
∗
∇f (xk ) − ∇f (x ) =
Z
1
∇2f (x∗ + τ (xk − x∗))dτ.
(6)
0
By introducing Ak (x − x∗) = ∇f (xk ) − ∇f (x∗) and using this in relation
(5), we obtain
kxk+1 − x∗k ≤ k(I − αAk )(xk − x∗)k ≤ kI − αAk k kxk − x∗k
Convex Optimization
19
Lecture 13
The matrix Ak is symmetric, and hence kI − Ak k is equal to the max
absolute eigenvalue of I − Ak , i.e.,
kI − αAk k = max{|1 − αλmax(Ak )|, |1 − αλmin(Ak )|}.
R1
In view of Eq. (6), we have Ak = 0 ∇2f (x∗ + τ (xk − x∗))dτ . By the
strong convexity of f , we have ∇2f (x) ≥ mI for all x, while by the given
condition, we have ∇2f (x) ≤ L I. Therefore,
λmax(Ak ) ≤ L,
λmin(Ak ) ≥ m,
implying that
kI − αAk k ≤ max{|1 − αm|, |1 − αM |}.
Convex Optimization
20
Lecture 13
• The parameter q is minimized when α∗ =
L−m
q =
L+m
∗
with cond(f ) =
Convex Optimization
⇐⇒
2
m+L
in which case
cond(f ) − 1
q =
,
cond(f ) + 1
∗
L
.
m
21
Lecture 13
Upper Bound on Hessian and f over the Level Set
For a twice differentiable strongly convex f :
• The level set L0 = {x | f (x) ≤ f (x0)} is bounded
• The maximum eigenvalue of the Hessian ∇2f (x) is a continuous function
of x over L0
• Hence, the maximum eigenvalue of the Hessian is bounded over L0:
there is a constant M such that
∇2f (x) M I
for all x ∈ L0
• Upper Bound on f over L0:
M
f (y) ≤ f (x) + ∇f (x) (y − x) + ky − xk2
2
T
• minimize over y ∈ L0 in both sides:
1
∗
f ≤ f (x) −
k∇f (x)k2
2M
Convex Optimization
for all x, y ∈ L0
for all x ∈ L0
22
Lecture 13
Condition Number of a Matrix
For a a twice differentiable strongly convex f : mI ∇2f (x) M I
all x ∈ L0
for
• The condition number cond(A) of a positive definite matrix A:
largest eigenvalue of A
cond(A) =
smallest eigenvalue of A
2
is
an
upper
bound
on
the
condition
number
∇
f (x) for
• The ratio M
m
every x ∈ L0
Convex Optimization
23
Lecture 13
Strong Convexity and Condition Number of Level Sets
Assume a minimizer x∗ of f over Rn exists and f is strongly convex.
Consider the level set L0 = {x | f (x) ≤ f (x0)}
• We have seen that mI ∇2f (x) M I for all x ∈ L0
• Also, we have
m
M
∗ 2
∗
f + kx − x k ≤ f (x) ≤ f +
kx − x∗k2
2
2
Binner ⊆ L0 ⊆ Bouter , where
∗
• Hence:
∗
Binner = x | kx − x k ≤
Bouter = x | kx − x∗k ≤
q
(2 (f (x0) − f ∗) /M )
q
(2 (f (x0) − f ∗) /m)
• Therefore, we have a bound on cond(L0)
M
cond(L0) ≤
m
• The condition number of level sets affects the efficiency of the
algorithms
Convex Optimization
24

Download Report

Lecture 13 Gradient Methods for Constrained Optimization

Paperzz.com

Your Paperzz