slides - UCLA Department of Mathematics

Math 273a: Optimization
Convex Functions
Instructor: Wotao Yin
Department of Mathematics, UCLA
Fall 2015
online discussions on piazza.com
Definition
• f : Rn → R is convex if
f (λx + (1 − λ)y) ≤ λf (x) + (1 − λ)f (y),
∀ x, y, λ ∈ [0, 1]
• modern definition: f : Rn → R ∪ {∞} but is proper (not identically ∞)
then, we can ignore dom(f ). also, f can include the indicator function of
convex set S:
(
0,
x∈S
ιS (x) =
∞, otherwise
Concave function
• A function f is concave if (−f ) is convex.
Epigraph
• definition:
epi(f ) := {(x, t) : x ∈ domf, t ≥ f (x)}
• f is convex if and only if epi(f ) is a convex set
• “lifts” the convex function and enables set-operations such as projection,
etc., enriching understanding means and numerical tools
F 1 : the set of C 1 convex functions
• A continuously differentiable function f (x) is convex if and only if
f (y) ≥ f (x) + (y − x)T ∇f (x)
for any x, y ∈ Rn .
interpretation: f (y) is on or above its linear support function at any point
• a twice continuously differentiable function f (x) is convex if and only if
∇2 f (x) 0
for any x ∈ Rn
• F k (Rn ): the set of convex functions on Rn that are k times continuously
differentiable. [notation used by Nesterov’03 textbook]
• (we will show these later)
Epigraph and support hyperplane
• if (x, t) ∈ epi(f ), then
∇f (x0 )
−1
T x − x0
≤0
t − f (x0 )
Examples in R
• xα is convex over x ≥ 0 for α ≥ 1
• |x| is convex (but not differentiable)
• log x is concave, and x log x is convex, over x > 0
• ex is convex
• max(0, x) and max(0, −x) are both convex
Basic properties
• f is convex if and only if g(t) := f (x + td) is convex for all x, d
• if f is convex, then αf is convex for all α ≥ 0
• if f1 , f2 are convex, then so is f1 + f2
• extends to infinite sums: if g(x, y) is convex in x for each y, then
R
g(x, y)dy is convex
• if f1 , f2 are convex, then so is h(x) = max{f1 (x), f2 (x)}
(also extends to point-wise supremum of infinitely many convex functions)
• affine transform to x: if f is convex, so is f (Ax + b)
Examples in Rn
• linear functions
f (x) = aT x + b
• quadratic functions: Q 0
f (x) =
1 T
x Qx + aT x + b
2
• all norms
•
•
•
kxk1
kxk2
kxk∞
• max distance to any set (convex or not): f (x) = sups∈S ks − xk
Partial min of jointly-convex functions
• if h(x, y) is convex (i.e., jointly convex in (x, y)), then
f (x) := inf h(x, y)
y
is convex. (corresponds to epigraph projection)
• min distance to a convex set, f (x) = inf x∈S ks − xk, is convex.
proof: consider h(x, s) = ks − xk + ιS (s), which is (jointly) convex. Then,
f (x) = inf h(x, s)
s
• infimal post-composition, f (y) = inf{g(x) : Ax = y}, is convex
proof:: consider h(x, y) = g(x) + ι{0} (Ax − y), which is (jointly) convex.
Then,
f (y) = inf h(x, y)
x
Jensen’s inequality
Let f : Rn → R be convex.
• two points: for λ1 , λ2 ≥ 0, λ1 + λ2 = 1,
f (λ1 x1 + λ2 x2 ) ≤ λ1 f (x1 ) + λ2 f (x2 )
• multiple points: for λi ≥ 0,
f
P
X
i
i
λi = 1,
λi xi ≤
X
λi f (xi )
i
• continuous version: for distribution p(x) ≥ 0,
f (Ex) = f
Z
xp(x)dx
≤
Z
R
p(x)dx = 1,
f (x)p(x)dx = E(f (x)).
application: f (x) = − log(x) is convex over x > 0. Then, for a, b > 0,
√
1
− (log a + log b) ≥ − log (a + b)/2 =⇒ ab ≤ (a + b)/2
2
(of course, extends to a, b ≥ 0.)
Properties of f, g ∈ F 1 (Rn )
• Recall: a C 1 function f (x) is convex on Rn if and only if
f (y) ≥ f (x) + (y − x)T ∇f (x)
for any x, y ∈ Rn .
(function f (y) is on or above its linear support function at any point)
• Corollary: ∇f (x∗ ) = 0 if and only if f (x) ≥ f (x∗ ) for all x.
• F 1 is closed under linear operation: if f, g ∈ F 1 , then αf + βg ∈ F 1 ,
α, β ≥ 0
• For matrix A ∈ Rm×n and b ∈ Rn , f ∈ F 1 , we have
h(x) = f (Ax + b) ∈ F 1 .
Equivalent Definitions for F 1 (Rn )
Theorem
The followings are equivalent:
1. f ∈ F 1 (Rn ), i.e., f is convex and in C 1 (Rn ).
2. f ∈ C 1 (Rn ) and for any x, y ∈ Rn and 0 ≤ α ≤ 1,
f (αx + (1 − α)y) ≤ αf (x) + (1 − α)f (y).
3. f ∈ C 1 (Rn ) and monotone, i.e., for any x, y ∈ Rn ,
(x − y)T (∇f (x) − ∇f (y)) ≥ 0.
Theorem
Function f ∈ F 2 (Rn ) if and only if it is in C 2 (Rn ) and
∇2 f (x) 0
for any x ∈ Rn .
(1)
Proof of Theorem 2.
(⇒) : Fix x and any direction s. Let xt = x + ts (for t > 0). From (1),
1
1
0 ≤ 2 (xt − x)T (∇f (xt ) − ∇f (x)) = 2 (tsT )
t
t
=
1
t
Z
t
0
Z
t
0
2
∇ f (x + τ s)sdτ
=
sT ∇2 f (x + τ s)sdτ
Letting t → 0 and using continuity, we conclude that sT ∇2 f (x)s ≥ 0.
(⇐) : Suppose y, x ∈ Rn . Observe that
f (y) = f (x) +
= f (x) +
Z
Z
1
0
0
1
(y − x)T ∇f (x + t(y − x))dt
(y − x)T
Z
t
0
= f (x) + (y − x)T ∇f (x) +
≥ f (x) + (y − x)T ∇f (x).
∇2 f (x + τ (y − x))(y − x)dτ + ∇f (x) dt
Z
1
0
Z
t
0
(y − x)T ∇2 f (x + τ (y − x))(y − x)dτ dt
FLk,p (Rn ): Lipschitz continuous convex functions
• FLk,p (Rn ) ⊂ F k (Rn ): the pth derivative of f is Lipschitz with constant
L ≥ 0; that is,
kf (p) (x) − f (p) (y)k ≤ Lkx − yk
for any x, y ∈ Rn .
(the rate of change of f (p) is bounded)
Equivalent Definitions for FL1,1 (Rn )
Theorem
The followings are equivalent:
1. f ∈ FL1,1 (Rn ).
2. f is in C 1 (Rn ) and for any x, y ∈ Rn
0 ≤ f (y) − f (x) − (y − x)T ∇f (x) ≤
3. f is in C 1 (Rn ) and for any x, y ∈ Rn
f (y) ≥ f (x) + (y − x)T ∇f (x) +
4. f is in C 1 (Rn ) and for any x, y ∈ Rn
L
ky − xk2 .
2
1
k∇f (x) − ∇f (y)k2 .
2L
1
k∇f (x) − ∇f (y)k2 ≤ (x − y)T (∇f (x) − ∇f (y)).
L
Proof of Theorem 3.
(1) ⇒ (2): Observe that
f (y) = f (x) +
Z
1
0
(y − x)T ∇f (x + t(y − x))dt
= f (x) + (y − x)T ∇f (x) +
≤ f (x) + (y − x)T ∇f (x) +
≤ f (x) + (y − x)T ∇f (x) +
= f (x) + (y − x)T ∇f (x) +
Z
Z
Z
1
0
0
0
1
1
(y − x)T (∇f (x + t(y − x)) − ∇f (x))dt
ky − xkk∇f (x + t(y − x)) − ∇f (x)kdt
ky − xktLky − xkdt
L
ky − xk2
2
(2) ⇒ (3): Fix x0 ∈ Rn . Consider convex function φ(y) := f (y) − yT ∇f (x0 )
which achieves it minimum at y∗ = x0 since
∇φ(y∗ ) = ∇f (y∗ ) − ∇f (x0 ) = 0. In particular,
φ(y∗ ) ≤ φ(y −
1
∇φ(y))
L
Applying assumption (2) to φ (with y := y −
1
∇φ(y),
L
and x := y)
Continue of Proof of Theorem 3.
Hence,
φ(y −
1
1
∇φ(y)) ≤ φ(y) −
k∇φ(y)k2 .
L
2L
1
k∇φ(y)k2 .
2L
Substituting and using the fact that ∇φ(y) = ∇f (y) − ∇f (x0 ) implies (3).
φ(x0 ) ≤ φ(y) −
(3) ⇒ (4): Adding two inequalities (3) with x and y interchanged.
(4) ⇒ (1): Notice from Theorem 1.3 that f is convex. Next we show the
Lipschitz bound. From assumption (4) and Cauchy-Schwarz we have
1
k∇f (x)−∇f (y)k2 ≤ (x−y)T (∇f (x)−∇f (y)) ≤ kx−ykk∇f (x)−∇f (y)k.
L
Thus,
k∇f (x) − ∇f (y)k ≤ Lkx − yk.
Strongly Convex Function
A continuously differentiable function f (x) is μ-strongly convex on Rn if for
any x, y ∈ Rn we have
f (y) ≥ f (x) + (y − x)T ∇f (x) +
μ
ky − xk2 .
2
• Sμk (Rn ): the set of μ-strongly convex functions on Rn that are k times
continuously differentiable.
Properties:
• If f ∈ Sμ1 (Rn ) and ∇f (x∗ ) = 0 then for all x,
f (x) ≥ f (x∗ ) +
μ
kx − x∗ k2
2
• If f1 ∈ Sμ11 (Rn ), f2 ∈ Sμ12 (Rn ) and α, β ≥ 0 then
1
(Rn ).
αf1 + βf2 ∈ Sαμ
1 +βμ2
Equivalent Definitions for Sμ1 (Rn )
Theorem (4)
The followings are equivalent:
1. f ∈ Sμ1 (Rn ) (i.e. f is strongly convex).
2. f is in C 1 (Rn ) and for any x, y ∈ Rn and 0 ≤ α ≤ 1,
μ
f (αx + (1 − α)y) ≤ αf (x) + (1 − α)f (y) − α(1 − α) kx − yk2 .
2
3. f is in C 1 (Rn ) and for any x, y ∈ Rn
(x − y)T (∇f (x) − ∇f (y)) ≥ μkx − yk2 .
Theorem (5)
Function f ∈ Sμ2 (Rn ) if and only if it is in C 2 (Rn ) and for any x ∈ Rn we have:
∇2 f (x) μIn .
Application: performance of gradient methods
“black box” optimization (Nesterov’03 textbook):
• Problem to be solved (e.g., find minimizer).
k,p
• Class of functions (e.g., F 1 (Rn ), FL
(Rn ), Sμ1 (Rn )).
• Oracle: black box information of the problem. For example,
0th order:
1st order:
2nd order:
xi → f (xi )
xi → (f (xi ), ∇f (xi ))
xi → (f (xi ), ∇f (xi ), ∇2 f (xi )).
• Solution: exact or approximate condition (i.e. kx∗ − xsol k < ).
• Algorithm using only oracle to compute or approximate xsol .
Example:
• Problem: find root of function
• Function class: all polynomial
• Oracle: 0th order
• Method: always return x∗ = 0.
Performance: cannot be beaten for some functions in the class, arbitrarily bad
performance for other functions.
How to gauge performance of a method?
Use the computational cost for the worst function in a chosen function class
(e.g. F 1 (Rn ), FLk,p (Rn ), Sμ1 (Rn )).
• Analytical complexity: number of calls to the oracle which is required to
solve the problem up to the desired accuracy.
• Arithmetical complexity: number of arithmetic operations (including work
of oracle and method) to solve the problem up to desired accuracy.
For our purposes
• Problem: minimizex∈Rn f (x).
∞,1
• Function class: f ∈ FL
(Rn ) or Sμ1 (Rn ).
• Oracle: first order
• Approximate solution: kf (ˉ
x) − f ∗ k < .
• Methods: Sequence {xk } such that
xk ∈ x0 + span{∇f (x0 ), ∇f (x1 ), . . . , ∇f (xk−1 )}
(methods include gradient descent methods.)
Lower Complexity Bounds
Theorem
For any k, 1 ≤ k ≤ (n − 1)/2, and any x0 ∈ Rn , there exists a function
f ∈ FL∞,1 (Rn ) such that for all first order methods generating a sequence
xk ∈ x0 + span{∇f (x0 ), ∇f (x1 ), . . . , ∇f (xk−1 )},
we have
f (xk ) − f ∗ ≥
3Lkx − x∗ k2
,
32(k + 1)2
kxk − x∗ k2 ≥
1
kx0 − x∗ k2 .
32
Gradient Method
Problem:
min f (x), where f ∈ FL1,1 (Rn ).
x∈Rn
Scheme:
0. Choose x0 ∈ Rn .
1. k th iteration (k ≥ 0).
a. Compute f (xk ) and ∇f (xk )
b. Set xk+1 ← xk − hk ∇f (xk )
To simplify the following work, we assume hk = h > 0 for every k.
Upper Bound for Gradient Method
Theorem
If f ∈ FL1,1 (Rn ) and 0 < h <
f (xk ) − f ∗ ≤
2
L
then
(f (x0 ) − f ∗ )kx0 − x∗ k2
kx0 − x∗ k2 + (f (x0 ) − f ∗ )h(1 −
L
h)k
2
Proof:
Step 1. Boundedness of {xk }. Let εk := kxk − x∗ k. Then
ε2k+1 = kxk − x∗ − hf 0 (xk )k2
= ε2k − 2hhf 0 (xk ), xk − x∗ i + h2 kf 0 (xk )k2
≤ ε2k − h
2
− f kf 0 (xk )k2
L
Where we used the 4th equivalent definition of FL1,1 (Rn ) and ∇f (x∗ ) = 0 in
the last line. Therefore εk+1 ≤ εk ≤ . . . ≤ ε0 .
Note: ∇f cannot vary too quickly or xk go unbounded.
Proof continued
Step 2: Objective descent.
The second equivalent definition of FL1,1 (Rn ) tells us
L
kxk+1 − xk k2
2
L
= f (xk ) + h∇f (xk ), −h∇f (xk )i + h2 k∇f (xk )k2
2
L
2
= f (xk ) − h(1 − h)k∇f (xk )k
2
f (xk+1 ) ≤ f (xk ) + h∇f (xk ), xk+1 − xk i +
To simplify notation, define ω := h(1 − L2 h). Also define Δk := f (xk ) − f ∗ .
Subtracting f ∗ from both sides of the above inequality yields
Δk+1 ≤ Δk − ωk∇f (xk )k2
(2)
Proof continued
Step 3: The descent is “sufficient.”
We shall bound Δ(x) by k∇f (x)k, so there is sufficient descent to estimate
the descent rate of Δk .
From the definition of convexity,
f (x∗ ) ≥ f (xk ) + h∇f (xk ), x∗ − xk i
Thus
Δk ≤ −h∇f (xk ), x∗ − xk i ≤ k∇f (xk )kεk
ω
⇔ −ωk∇f (xk )k2 ≤ − 2 Δ2k .
ε0
Inequalities (2) and (3) together give:
Δk+1 ≤ Δk −
ω 2
Δk .
ε20
(3)
Proof continued
Step 4: Rate establishment.
Recall that f ∗ is the minimum of f , hence Δj ≥ 0. Divide the inequality by
Δk+1 Δk , rearrange and recall {Δk } decreases in k to get:
Δk
1
1
ω
1
ω
≥
+ 2 ∙
≥
+ 2
Δk+1
Δk
Δk
ε0 Δk+1
ε0
1
ω
≥ ... ≥
+ 2 (k + 1)
Δ0
ε0
Reindex at k and invert the inequality to finally obtain
Δk ≤
ε20
Δ0 ε20
+ Δ0 ωk
which, miraculously, is what we were to show.
Performance bounds
Further analysis can prove the following:
1. The gradient method is not optimal for FL1,1 (Rn ).
1,1
(Rn ).
2. The gradient method is not optimal for Sμ,L
Notes:
• All standard NLP methods (conjugate gradients, variable metric) have
similar lower efficiency estimates.
• The bounds do not necessarily reflect real-world performance, and often not.
• The lecture is not finished yet. More on the way.

Download Report

slides - UCLA Department of Mathematics

Paperzz.com

Your Paperzz