Introduction to Optimization - Lecture: Lagrange Duality

Math 273a: Optimization
Lagrange Duality
Instructor: Wotao Yin
Department of Mathematics, UCLA
Winter 2015
online discussions on piazza.com
Gradient descent / forward Euler
• assume function f is proper close convex, differentiable
• consider
min f (x)
• gradient descent iteration (with step size c):
xk+1 = xk − c ∇f (xk )
• xk+1 minimizes the following local quadratic approximation of f :
f (xk ) + h∇f (xk ), x − xk i +
1
kx − xk k22
2c
• compare with forward Euler iteration, a.k.a. the explicit update:
x(t + 1) = x(t) − Δt ∙ ∇f (x(t))
Backward Euler / implicit gradient descent
• backward Euler iteration, also known as the implicit update:
solve
x(t + 1) ←− x(t + 1) = x(t) − Δt ∙ ∇f (x(t + 1)) .
• equivalent to:
solve
1. u(t + 1) ←− u = ∇f (x(t) − Δt ∙ u) ,
2. x(t + 1) = x(t) − Δt ∙ u(t + 1).
• we can view it as the “implicit gradient” descent:
solve
(xk+1 , uk+1 ) ←− x = xk − cu, u = ∇f (x).
• c is the “step size”, very different from a standard step size.
• explicit (implicit) update uses the gradient at the start (end) point
Implicit gradient step = proximal operation
• proximal update:
proxcf (z) := arg min f (x) +
x
1
kx − zk2 .
2c
• optimality condition:
0 = c ∇f (x∗ ) + (x∗ − z).
• given input z,
- proxcf (z) returns solution x∗
- ∇f (proxcf (z)) returns u∗
solve
(x∗ , u∗ ) ←− x = z − cu, u = ∇f (x).
Proposition
Proximal operator is equivalent to an implicit gradient (or backward Euler) step.
Proximal operator handles sub-differentiable f
• assume that f is closed, proper, sub-differentiable convex function
• ∂f (x) is denoted as the subdifferential of f at x.
Recall u ∈ ∂f (x) if
f (x0 ) ≥ f (x) + hu, x0 − xi, ∀x0 ∈ Rn .
• ∂f (x) is point-to-set, neither direction is unique
• prox is well-defined for sub-differentiable f ; it is point-to-point,
prox maps any input to a unique point
Proximal operator
proxcf (z) := arg min f (x) +
x
1
kx − zk2 .
2c
• since objective is strongly convex, solution proxcf (z) is unique
• since f is proper, dom proxcf = Rn
• the followings are equivalent
proxcf z = x∗ = arg min f (x) +
x
1
kx − zk2 ,
2c
solve
x∗ ←− 0 ∈ c ∂f (x) + (x − z),
solve
(x∗ , u∗ ) ←− x = z − cu, u ∈ ∂f (x).
• point x∗ minimizes f if and only if x∗ = proxf (x∗ ).
Lagrange duality
Convex problem
subject to Ax = b.
minimize f (x)
Relax the constraints and price their violation (pay a price if violated one way;
get paid if violated the other way; payment is linear to the violation)
L(x; y) := f (x) + yT (Ax − b)
For later use, define the augmented Lagrangian
LA (x; y, c) := f (x) + yT (Ax − b) +
Minimize L for fixed price y:
c
kAx − bk2
2
d(y) := − minx L(x; y). Always, d(y) is convex
The Lagrange dual problem
minimize d(y)
y
Given dual solution y∗ , recover x∗ = minx L(x; y∗ ) (under which conditions?)
Question: how to compute the explicit/implicit gradients of d(y)?
Dual explicit gradient (ascent) algorithm
Assume d(y) is differentiable (true if f (x) is strictly convex.)
Gradient descent iteration (if the maximizing dual is used, it is called gradient
ascent):
yk+1 = yk − c ∇f (yk ).
It turns out to be relatively easy to compute ∇d, via an unstrained subproblem:
∇d(y) = b − Aˉ
x,
Dual gradient iteration
solve
1. xk ←− minx L(x; yk );
2. yk+1 = yk − c(b − Axk ).
ˉ = arg min L(x; y).
where x
x
Sub-gradient of d(y)
Assume d(y) is sub-differentiable (which condition on primal can guarantee
this?)
Lemma
ˉ = arg minx L(x; y), we have b − Aˉ
Given dual point y and x
x ∈ ∂d(y).
Proof.
Recall
• u ∈ ∂d(y) if d(y0 ) ≥ d(y) + hu, y0 − yi for all y0 ;
• d(y) := − minx L(x; y).
ˉ,
From (ii) and definition of x
d(y) + hb − Aˉ
x, y0 − yi = −L(ˉ
x; y) + (b − Aˉ
x)T (y − y0 )
= −[f (ˉ
x + yT (Aˉ
x − b)] + (b − Aˉ
x)T (y − y0 )
= −[f (ˉ
x) + (y0 )T (Aˉ
x − b)]
= −L(ˉ
x; y0 ) ≤ d(y0 ).
From (i), b − Aˉ
x ∈ ∂d(y).
Dual explicit (sub)gradient iteration
The iteration:
solve
1. xk ←− minx L(x; yk );
2. yk+1 = yk − ck (b − Axk );
Notes:
• (b − Axk ) ∈ ∂d(yk ) as shown in the last slide
• it does not require d(y) to be differentiable
• convergence might require a careful choice of ck (e.g., a diminishing
sequence) if d(y) is only sub-differentiable (or lacking Lipschitz continuous
gradient)
Dual implicit gradient
Goal: to descend using the (sub)gradient of d at the next point y k+1 :
Following from the Lemma, we have
b − Axk+1 ∈ ∂d(yk+1 ), where xk+1 = arg min L(x; yk+1 )
x
Since the implicit step is y
k+1
= y − c(b − Ax
k
k+1
), we can derive
xk+1 = arg min L(x; yk+1 ) ⇐⇒
x
0 ∈ ∂x L(xk+1 ; yk+1 ) = ∂f (xk+1 ) + AT yk+1
= ∂f (xk+1 ) + AT (yk − c(b − Axk+1 )).
Therefore, while xk+1 is a solution to minx L(x; yk+1 ); it is also a solution to
min LA (x; yk , c) = f (x) + (yk )T (Ax − b) +
x
which is independent of yk+1 .
c
kAx − bk2 ,
2
Dual implicit gradient
Proposition
Assuming y0 = y − c(b − Ax0 ), the followings are equivalent
solve
1. x0 ←− minx L(x; y0 ),
solve
2. x0 ←− minx LA (x; y, c).
Dual implicit gradient iteration
The iteration
yk+1 = proxcd (yk )
is commonly known as the augmented Lagrangian method or the method of
multipliers.
Implementation:
solve
1. xk+1 ←− minx LA (x; yk , c);
2. yk+1 = yk − c(b − Axk+1 ).
Proposition
The followings are equivalent
1. the augmented Lagrangian iteration;
2. the implicit gradient iteration of d(y);
3. the proximal iteration yk+1 = proxcd (yk ).
Dual explicit/implicit (sub)gradient computation
Definitions:
• L(x; y) = f (x) + yT (Ax − b)
• LA (x; y, c) = L(x; y) +
c
kAx
2
− bk2
Objective:
d(y) = − min L(x; y).
x
Explicit (sub)gradient iteration: y
k+1
= yk − c∇d(yk ) or use a subgradient
∂d(yk )
1. xk+1 = arg minx L(x; yk );
2. yk+1 = yk − c(b − Axk+1 ).
Implicit (sub)gradient step: yk+1 = proxcd yk
1. xk+1 = arg minx LA (x; yk , c);
2. yk+1 = yk − c(b − Axk+1 ).
The implicit iteration is more stable; “step size” c does not need to diminish.