Jacob Steinhardt

A Greedy Framework for First-Order Optimization
Jonathan Huggins
1 Massachusetts
1
Jacob Steinhardt
2
Institute of Technology
2 Stanford
University
Dec 10, 2013
JHH and JS (MIT, Stanford)
Greedy First-Order Optimization
Dec 10, 2013
1/6
Motivation
We want to solve the following saddle point problem:
min max L(u, θ),
u
θ
where L(u, θ) = h(u) + u T θ − R(θ). (Assume h, R convex.)
JHH and JS (MIT, Stanford)
Greedy First-Order Optimization
Dec 10, 2013
2/6
Motivation
We want to solve the following saddle point problem:
min max L(u, θ),
u
θ
where L(u, θ) = h(u) + u T θ − R(θ). (Assume h, R convex.)
Tie-in with optimization: can think of this as minimizing
def
L(u) = max L(u, θ) = h(u) + R ∗ (u).
θ
JHH and JS (MIT, Stanford)
Greedy First-Order Optimization
Dec 10, 2013
2/6
Motivation
def
L(u, θ) = h(u) + u T θ − R(θ)
We want to solve the following saddle point problem:
min max L(u, θ),
u
θ
where L(u, θ) = h(u) + u T θ − R(θ). (Assume h, R convex.)
Tie-in with optimization: can think of this as minimizing
def
L(u) = max L(u, θ) = h(u) + R ∗ (u).
θ
JHH and JS (MIT, Stanford)
Greedy First-Order Optimization
Dec 10, 2013
2/6
Being (Too) Greedy
def
L(u, θ) = h(u) + u T θ − R(θ)
Let’s try the following updates:
ut
= arg min L(u, θt )
u
θt+1 = arg max L(ut , θ)
θ
“iterative best response”
JHH and JS (MIT, Stanford)
Greedy First-Order Optimization
Dec 10, 2013
3/6
Being (Too) Greedy
def
L(u, θ) = h(u) + u T θ − R(θ)
Let’s try the following updates:
ut
= arg min L(u, θt )
u
θt+1 = arg max L(ut , θ)
θ
“iterative best response”
Issue: let u, θ ∈ R, h(u) = 12 u 2 , R(θ) = 21 θ2 . Then:
1 2
ut = arg min u + uθ = −θt
2
u
1 2
θt+1 = arg max uθ − θ = ut
2
θ
OSCILLATION
JHH and JS (MIT, Stanford)
Greedy First-Order Optimization
Dec 10, 2013
3/6
Being Just Greedy Enough
def
L(u, θ) = h(u) + u T θ − R(θ)
P
def
Can get what we want if we replace ut with ût = 1t ts=1 us :
ut
= arg min L(u, θt )
u
θt+1 = arg max L(ût , θ)
θ
JHH and JS (MIT, Stanford)
Greedy First-Order Optimization
Dec 10, 2013
4/6
Being Just Greedy Enough
def
L(u, θ) = h(u) + u T θ − R(θ)
P
def
Can get what we want if we replace ut with ût = 1t ts=1 us :
ut
= arg min L(u, θt )
u
θt+1 = arg max L(ût , θ)
θ
Theorem
If R is strongly convex then |L(ût , θt ) − L(u ∗ , θ∗ )| ≤ O
log(T )
T
.
Note: can get O(1/T ) convergence if we use a weighted average for ût .
JHH and JS (MIT, Stanford)
Greedy First-Order Optimization
Dec 10, 2013
4/6
Frank-Wolfe
def
L(u, θ) = h(u) + u T θ − R(θ)
We get Frank-Wolfe for h ≡ 0:
ut
arg min u T θt
= arg min L(u, θt ) =
u
u
θt+1 = arg max L(ût , θ) = arg max ûtT θ − R(θ) =
θ
JHH and JS (MIT, Stanford)
∂R ∗ (ût )
θ
Greedy First-Order Optimization
Dec 10, 2013
5/6
Frank-Wolfe
def
L(u, θ) = h(u) + u T θ − R(θ)
We get Frank-Wolfe for h ≡ 0:
ut
arg min u T θt
= arg min L(u, θt ) =
u
u
θt+1 = arg max L(ût , θ) = arg max ûtT θ − R(θ) =
θ
∂R ∗ (ût )
θ
General updates:
ut
= ∂h∗ (−θt )
θt+1 = ∂R ∗ (ût ).
JHH and JS (MIT, Stanford)
Greedy First-Order Optimization
Dec 10, 2013
5/6
Applications
L(u, θ)
Algorithm
T
∗
h(u) + u θ − f (θ)
mirror descent
kuk1 + (Au − y )T θ − 21 kθk22
thresholded Frank-Wolfe
P
Tr(AT X ) + ni=1 yi (Xii − 1 − η log yi )
AHK low-rank SDP
Ex∼µ [θT (φ(x) − φ̄)] − q1 kθkqq
q-herding
(Note: some of the entries above use a dual version of the algorithm.)
JHH and JS (MIT, Stanford)
Greedy First-Order Optimization
Dec 10, 2013
6/6