Some New Complexity Results for Composite Optimization

Some New Complexity Results for
Composite Optimization
Guanghui (George) Lan
Georgia Institue of Technology
Joint work with Yuyuan Ouyang (Clemson) and Yi Zhou (Georgia Tech)
Workshop on Optimization without Borders, dedicated to
Yuri Nesterov’s 60th Birthday, Les Houches, France,
December 7-12, 2016
Background
Complex composite problems
Finite-sum problems
Summary
General CP methods
Problem: Ψ∗ = minx∈X Ψ(x).
X closed and convex.
Ψ is convex
Goal: to find an -solution, i.e., x̄ ∈ X s.t. Ψ(x̄) − Ψ∗ ≤ .
Complexity: the number of (sub)gradient evaluations of Ψ –
√
Ψ is smooth: O(1/ ).
Ψ is nonsmooth: O(1/2 ).
Ψ is strongly convex: O(log(1/)).
beamer-tu-log
2 / 38
Background
Complex composite problems
Finite-sum problems
Summary
Composite optimization problems
We consider composite problems which can be modeled as
Ψ∗ = min {Ψ(x) := f (x) + h(x)} .
x∈X
Here, f : X → R is a smooth and expensive term (data fitting),
h : X → R is a nonsmooth regularization term (solution
structures), and X is a closed convex set.
Three Challenging Cases
h or X are not necessarily simple.
f given by the summation of many terms.
f (or h) is possibly nonconvex.
beamer-tu-log
3 / 38
Background
Complex composite problems
Finite-sum problems
Summary
Existing complexity results
Problem: Ψ∗ := minx∈X {Ψ(x) := f (x) + h(x)}.
First-order methods: iterative methods which operate with the
gradients (subgradients) of f and h.
Complexity: number of iterations needed to find an -solution,
i.e., a point x̄ ∈ X s.t. Ψ(x̄) − Ψ∗ ≤ .
Easy case: h simple, X simple
PrX ,h (y ) := argminx∈X ky − xk2 + h(x) is easy to compute (e.g.,
√
compressed sensing). Complexity: O(1/ ) (Nesterov 07,
Tseng 08, Beck and Teboulle 09).
beamer-tu-log
4 / 38
Background
Complex composite problems
Finite-sum problems
Summary
Existing complexity results
Problem: Ψ∗ := minx∈X {Ψ(x) := f (x) + h(x)}.
First-order methods: iterative methods which operate with the
gradients (subgradients) of f and h.
Complexity: number of iterations needed to find an -solution,
i.e., a point x̄ ∈ X s.t. Ψ(x̄) − Ψ∗ ≤ .
Easy case: h simple, X simple
PrX ,h (y ) := argminx∈X ky − xk2 + h(x) is easy to compute (e.g.,
√
compressed sensing). Complexity: O(1/ ) (Nesterov 07,
Tseng 08, Beck and Teboulle 09).
beamer-tu-log
4 / 38
Background
Complex composite problems
Finite-sum problems
Summary
More difficult cases
h general, X simple
h is a general nonsmooth function; PX := argminx∈X ky − xk2 is
easy to compute. Complexity: O(1/2 ).
h structured, X simple
h is structured, e.g., h(x) = maxy∈Y hAx, yi; PX is easy to
compute. Complexity: O(1/).
h simple, X complicated
LX ,h (y) := argminx∈X hy, xi + h(x) is easy to compute (e.g.,
matrix completion).Complexity: O(1/).
beamer-tu-log
5 / 38
Background
Complex composite problems
Finite-sum problems
Summary
More difficult cases
h general, X simple
h is a general nonsmooth function; PX := argminx∈X ky − xk2 is
easy to compute. Complexity: O(1/2 ).
h structured, X simple
h is structured, e.g., h(x) = maxy∈Y hAx, yi; PX is easy to
compute. Complexity: O(1/).
h simple, X complicated
LX ,h (y) := argminx∈X hy, xi + h(x) is easy to compute (e.g.,
matrix completion).Complexity: O(1/).
beamer-tu-log
5 / 38
Background
Complex composite problems
Finite-sum problems
Summary
More difficult cases
h general, X simple
h is a general nonsmooth function; PX := argminx∈X ky − xk2 is
easy to compute. Complexity: O(1/2 ).
h structured, X simple
h is structured, e.g., h(x) = maxy∈Y hAx, yi; PX is easy to
compute. Complexity: O(1/).
h simple, X complicated
LX ,h (y) := argminx∈X hy, xi + h(x) is easy to compute (e.g.,
matrix completion).Complexity: O(1/).
beamer-tu-log
5 / 38
Background
Complex composite problems
Finite-sum problems
Summary
Motivation
h simple, X simple
h general, X simple
h structured, X simple
h simple, X complicated
√
O(1/ )
O(1/2 )
O(1/)
O(1/)
100
108
104
104
More general h or more complicated X
⇓
Slow convergence of first-order algorithms
⇓
A large number of gradient evaluations of ∇f
beamer-tu-log
6 / 38
Background
Complex composite problems
Finite-sum problems
Summary
Motivation
h simple, X simple
h general, X simple
h structured, X simple
h simple, X complicated
√
O(1/ )
O(1/2 )
O(1/)
O(1/)
100
108
104
104
More general h or more complicated X
⇓
Slow convergence of first-order algorithms
×
⇓?
A large number of gradient evaluations of ∇f
Question: Can we skip the computation of ∇f ?
beamer-tu-log
6 / 38
Background
Complex composite problems
Finite-sum problems
Summary
Composite problems
Ψ∗ = minx∈X {Ψ(x) := f (x) + h(x)} .
f is smooth, i.e., ∃L > 0 s.t. ∀x, y ∈ X ,
k∇f (y ) − ∇f (x)k ≤ Lky − xk.
h is nonsmooth, i.e., ∃M > 0 s.t. ∀x, y ∈ X ,
|h(x) − h(y)| ≤ Mky − xk.
PX is simple to compute.
Question:
How many number of gradient evaluations of ∇f and
subgradient evaluations of h0 are needed to find an -solution?
beamer-tu-log
7 / 38
Background
Complex composite problems
Finite-sum problems
Summary
Existing results
Existing algorithms evaluate ∇f and h0 together at each
iteration:
Mirror-prox method
(Juditsky,
Nemirovski and Travel, 11):
o
n
L
M2
O + 2
Accelerated stochastic
approximation
(Lan, 12):
q
2
L
M
O
+ 2
Issue:
Whenever the second term dominates, the number of gradient
evaluations ∇f is given by O(1/2 ).
beamer-tu-log
8 / 38
Background
Complex composite problems
Finite-sum problems
Summary
Bottleneck for composite problems
The computation of ∇f , however, is often the bottleneck in
comparison with that of h0 .
The computation of ∇f invovles a large data set, while that
of h0 only involves a very sparse matrix (e.g., total variation
minimization).
Can we reduce the number of gradient evaluations for ∇f
√
from O(1/2 ) to O(1/ ), while still maintaining the
optimal O(1/2 ) bound on subgradient evaluations for h0 ?
beamer-tu-log
9 / 38
Background
Complex composite problems
Finite-sum problems
Summary
The gradient sliding algorithm
Algorithm 1 The gradient sliding (GS) algorithm
Input: Initial point x0 ∈ X and iteration limit N.
Let βk ≥ 0, γk ≥ 0, and Tk ≥ 0 be given and set x̄0 = x0 .
for k = 1, 2, . . . , N do
1. Set x k = (1 − γk )x̄k−1 + γk xk−1 and gk = ∇f (x k ).
2. Set (xk , x̃k ) = PS(gk , xk−1 , βk , Tk ).
3. Set x̄k = (1 − γk )x̄k−1 + γk x̃k .
end for
Output: x̄N .
PS: the prox-sliding procedure.
beamer-tu-log
10 / 38
Background
Complex composite problems
Finite-sum problems
Summary
The PS procedure
Procedure (x + , x̃ + ) = PS(g, x, β, T )
Let the parameters pt > 0 and θt ∈ [0, 1], t = 1, . . ., be given.
Set u0 = ũ0 = x.
for t = 1, 2, . . . , T do
2
t
ut = argminu∈X hg + h0 (ut−1 ), ui + β2 ku − xk2 + βp
2 ku − ut−1 k ,
ũt = (1 − θt )ũt−1 + θt ut .
end for
Set x + = uT and x̃ + = ũT .
Note: k · − · k2 /2 can be replaced by the more general
Bregman distance V (x, u) = ω(u) − ω(x) − h∇ω(x), u − xi.
beamer-tu-log
11 / 38
Background
Complex composite problems
Finite-sum problems
Summary
Remarks
When supplied with g(·), x ∈ X , β, and T , the PS procedure
computes a pair of approximate solutions (x + , x̃ + ) ∈ X × X for
the problem of:
β
argminu∈X Φ(u) := hg, ui + h(u) + ku − xk2 .
2
In each iteration, the subproblem is given by
βk
2
argminu∈X Φk (u) := h∇f (x k ), ui + h(u) + ku − xk k .
2
beamer-tu-log
12 / 38
Background
Complex composite problems
Finite-sum problems
Summary
Convergence of the PS proedure
Proposition
If {pt } and {θt } in the PS procedure satisfy
pt =
t
2(t + 1)
and θt =
,
2
t(t + 3)
then for any t ≥ 1 and u ∈ X ,
Φ(ũt )−Φ(u)+
β(t + 1)(t + 2)
M2
βku0 − uk2
kut −uk2 ≤
+
.
2t(t + 3)
β(t + 3)
t(t + 3)
beamer-tu-log
13 / 38
Background
Complex composite problems
Finite-sum problems
Summary
Convergence of the GS algorithm
Theorem
Suppose that the previous conditions on {pt } and {θt } hold, and
that N is given a priori. If
2 2
2
M Nk
2L
, γk =
, and Tk =
βk =
k
k +1
D̃L2
for some D̃ > 0, then
L
Ψ(x̄N ) − Ψ(x ) ≤
N(N + 1)
∗
3kx0 − x ∗ k2
+ 2D̃ .
2
Remark: Do NOT need N given a priori if X is bounded.
beamer-tu-log
14 / 38
Background
Complex composite problems
Finite-sum problems
Summary
Complexity of the GS algorithm
Denote DX := maxx1 ,x2 ∈X kx1 − x2 k and set D̃ = 3DX2 /4.
The number
q of gradient evaluations of ∇f is bounded by
3LDX2
0
and
PN the number of subgradient evaluations of h is given by
bounded by
k=1 Tk , which is q
2
2
3LDX2
4M DX
+
2
.
Consequence
Significantly reduce the number of gradient evaluations of ∇f
√
from O(1/2 ) to O(1/ ), even though the whole objective
function Ψ is nonsmooth in general.
beamer-tu-log
15 / 38
Background
Complex composite problems
Finite-sum problems
Summary
Extensions
Gradient sliding for minx∈X f (x) + h(x):
h general nonsmooth
h structured nonsmooth
f strongly convex
total iter.
O(1/2 )
O(1/)
O(1/)
∇f
√
O(1/ )
√
O(1/ )
O(log(1/))
Conditional gradient sliding methods for problems with
more complicated feasible set.
f convex
f strongly convex
total iter. (LO oracle)
O(1/)
O(1/)
∇f
√
O(1/ )
O(log(1/))
beamer-tu-log
16 / 38
Background
Complex composite problems
Finite-sum problems
Summary
The problem of interest
P
Problem: Ψ∗ := minx∈X Ψ(x) := m
i=1 fi (x) + h(x) + µ ω(x) .
X closed and convex.
fi smooth convex: k∇fi (x1 ) − ∇fi (x2 )k∗ ≤ Li kx1 − x2 k.
h simple, e.g., l1 norm.
ω strongly convex with modulus 1 w.r.t. an arbitrary norm.
µ ≥ 0.
Subproblem argminx∈X hg, xi + h(x) + µ ω(x) is easy.
P
Pm
Denote f (x) ≡ m
i=1 fi (x) and L ≡
i=1 Li . f is smooth with
Lipschitz constant ≤ L.
beamer-tu-log
17 / 38
Background
Complex composite problems
Finite-sum problems
Summary
Stochastic subgradient descent for nonsmooth
problems
General stochastic programming (SP): minx∈X Eξ [F (x, ξ)].
Reformulation of the finite sum problem as SP:
ξ ∈ {1, . . . , m}, Prob{ξ = i} = νi , and
F (x, i) = νi−1 fi (x) + h(x) + µω(x), i = 1, . . . , m.
Iteration complexity: O(1/2 ) or O(1/) (µ > 0).
Iteration cost: m times cheaper than deterministic
first-order methods.
Save up to a factor of O(m) subgradient computations.
For details, see Nemirovski et. al. (09).
beamer-tu-log
18 / 38
Background
Complex composite problems
Finite-sum problems
Summary
Required ∇fi ’s in the smooth case
For simplicity, focus on the strongly convex case (µ > 0).
Goal: find a solution x̄ ∈ X s.t. kx̄ − x ∗ k ≤ kx 0 − x ∗ k.
Nesterov’s optimal
method (Nesterov
83):
n q
o
Lf
µ
O m
log 1 ,
Accelerated stochastic approximation (Lan 12, Ghadimi
and Lan 13): nq
o
O
Lf
µ
log 1 +
σ2
µ
Note: the optimality of the latter bound for general SP does not
preclude more efficient algorithms for the finite-sum problem.
beamer-tu-log
19 / 38
Background
Complex composite problems
Finite-sum problems
Summary
Randomized incremental gradient methods
Each iteration requires a randomly selected ∇fi (x).
Stochastic average gradient (SAG) by Schmidt, Roux and
Bach 13:
O (m + L/µ) log 1 .
Similar results were obtained in Johnson and Zhang 13,
Defazio et al. 14...
Worse dependence on the L/µ than Nesterov’s method.
beamer-tu-log
20 / 38
Background
Complex composite problems
Finite-sum problems
Summary
Coordinate ascent in the dual
Pm
T
i=1 φi (ai x)
+ h(x) , h strongly convex w.r.t. l2 norm.
q
n
o
1
All these coordinate algorithms achieve O m + mL
log
µ
.
min
Shalev-Shwartz and Zhang 13, 15 (restarting stochastic
dual ascent),
Lin, Lu and Xiao, 14 ( Nesterov, Fercoq and P. Richtárik’s),
see also Zhang and Xiao 14 (Chambolle and Pock),
Dang and Lan 14 (non-strongly convex), O(1/) or
√
O(1/ ).
Some issues:
Deal with a more special class of problems.
Require argmin{hg, y i + φ∗i (y ) + kyk2∗ }, not incremental
gradient methods.
beamer-tu-log
21 / 38
Background
Complex composite problems
Finite-sum problems
Summary
Open problems and our research
Problems:
Can we accelerate the convergence of randomized
incremental gradient method?
What is the best possible performance we can expect?
Our contributions:
A primal-dual gradient (PDG) method = a primal-dual look
to Nesterov’s method.
A randomized PDG (RPDG).
A new lower complexity bound.
A game-theoretic interpretation for acceleration.
Catalyst: Lin, Mairal, and Harchaoui 15.
beamer-tu-log
22 / 38
Background
Complex composite problems
Finite-sum problems
Summary
Reformulation and game/economic interpretation
Let Jf be the conjugate
function of f . Consider
Ψ∗ := minx∈X h(x) + µ ω(x) + maxg∈G hx, gi − Jf (g)
The buyer purchases products from the supplier.
The unit price is given by g ∈ Rn .
X , h and ω are constraints and other local cost for the
buyer.
The profit of supplier: revenue (hx, gi) - local cost Jf (g).
beamer-tu-log
23 / 38
Background
Complex composite problems
Finite-sum problems
Summary
How to achieve equilibrium?
Current order quantity x 0 , and product price g 0 .
Proximity control functions:
P(x 0 , x) := ω(x) − [ω(x 0 ) + hω 0 (x 0 ), x − x 0 i].
Df (gi0 , yi ) := Jf (g) − [Jf (g 0 ) + hJf0 (g 0 ), g − g 0 i].
Dual prox-mapping:
MG (−x̃, g 0 , τ ) := arg min h−x̃, gi + Jf (g) + τ Df (g 0 , g) .
g∈G
x̃ is the given or predicted demand. Maximize the profit, but not
too far away from g 0 .
Primal prox-mapping: MX (g, x 0 , η) := arg min hg, xi + h(x) + µω(x) + ηP(x 0 , x) .
x∈X
g is the given or predicted price. Minimize the cost, but not too
beamer-tu-log
far way from x 0 .
24 / 38
Background
Complex composite problems
Finite-sum problems
Summary
The deterministic PDG
Algorithm 2 The primal-dual gradient method
Let x 0 = x −1 ∈ X , and the nonnegative parameters {τt }, {ηt },
and {αt } be given.
Set g 0 = ∇f (x 0 ).
for t = 1, . . . , k do
Update z t = (x t , y t ) according to
x̃ t = αt (x t−1 − x t−2 ) + x t−1 .
g t = MG (−x̃ t , g t−1 , τt ).
x t = MX (g t , x t−1 , ηt ).
end for
beamer-tu-log
25 / 38
Background
Complex composite problems
Finite-sum problems
Summary
A game/economic interpretation
The supplier predicts the buyer’s demand based on
historical information: x̃ t = αt (x t−1 − x t−2 ) + x t−1 .
The supplier seeks to maximize predicted profit, but not too
far away from g t−1 : g t = MG (−x̃ t , g t−1 , τt ).
The buyer tries to minimize the cost, but not too far away
from x t−1 : x t = MX (g t , x t−1 , ηt ).
beamer-tu-log
26 / 38
Background
Complex composite problems
Finite-sum problems
Summary
PDG in gradient form
Algorithm 3 PDG method in gradient form
Input: Let x 0 = x −1 ∈ X , and the nonnegative parameters
{τt }, {ηt }, and {αt } be given.
Set x 0 = x 0 .
for t = 1, 2, . . . , k do
t−1
x̃ t = αt (x t−1 − x t−2
)+x .
t
t
t−1
x = x̃ + τt x
/(1 + τt ).
t
t
g = ∇f (x ).
x t = MX (g t , x t−1 , ηt ).
end for
Idea: set Jf0 (g t−1 ) = x t−1 .
beamer-tu-log
27 / 38
Background
Complex composite problems
Finite-sum problems
Summary
Relation to Nesterov’s method
A variant of Nesterov’s method:
xt
xt
x̄ t
= (1 − θt )x̄ t−1 + θt x t−1 .
P
t
t−1 , η ).
= MX ( m
t
i=1 ∇fi (x ), x
= (1 − θt )x̄ t−1 + θt x t .
Note that
x t = (1 − θt )x t−1 + (1 − θt )θt−1 (x t−1 − x t−2 ) + θt x t−1 .
Equivalent to PDG with τt = (1 − θt )/θt and αt = θt−1 (1 − θt )/θt .
Nesterov’s acceleration: looking-ahead dual players.
Gradient descent: myopic dual players (αt = τt = 0 in PDG).
beamer-tu-log
28 / 38
Background
Complex composite problems
Finite-sum problems
Summary
Convergence of PDG (or Nesterov’s variant)
Theorem
P
P
Define x̄ k := ( kt=1 θt )−1 kt=1 (θt x t ). Suppose
that
√
q
p
2Lf /µ
2Lf
τt =
2Lf µ, αt = α ≡ √
, and θt =
µ , ηt =
1+
Then,
P(x k , x ∗ )
2Lf /µ
µ+Lf
µ
αk P(x 0 , x ∗ ).
h
Ψ(x̄ k ) − Ψ(x ∗ ) ≤ µ(1 − α)−1 1 + Lµf (2 +
≤
1
.
αt
Lf
µ)
i
αk P(x 0 , x ∗ ).
Theorem
4Lf
t−1
If τt = t−1
2 , ηt = t , αt = t , and θt = t, then
8Lf
Ψ(x̄ k ) − Ψ(x ∗ ) ≤ k(k+1)
P(x 0 , x ∗ ).
beamer-tu-log
29 / 38
Background
Complex composite problems
Finite-sum problems
Summary
A multi-dual-player reformulation
Let Ji : Yi → R be the conjugate functions of fi and Yi ,
i = 1, . .. , m, denote the dual spaces.P
P
minx∈X h(x) + µ ω(x) + maxyi ∈Yi hx, i yi i − i Ji (y) ,
Define their new dual prox-functions and dual
prox-mappings as
Di (yi0 , yi )
:=
Ji (yi )− [Ji (yi0 ) + hJi0 (yi0 ), yi − yi0 i], 0
MYi (−x̃, yi , τ ) := arg min h−x̃, y i + Ji (yi ) + τ Di (yi0 , yi ) .
yi ∈Yi
beamer-tu-log
30 / 38
Background
Complex composite problems
Finite-sum problems
Summary
The RPDG method
Algorithm 4 The RPDG method
Let x 0 = x −1 ∈ X , and {τt }, {ηt }, and {αt } be given.
Set yi0 = ∇fi (x 0 ), i = 1, . . . , m.
for t = 1, . . . , k do
Choose it according to Prob{it = i} = pi , i = 1, . . . , m.
x̃ t = (
αt (x t−1 − x t−2 ) + x t−1 .
MYi (−x̃ t , yit−1 , τt ), i = it ,
yit =
t−1
i 6= it .
(yi ,
pi−1 (yit − yit−1 ) + yit−1 , i = it ,
.
yit−1 ,
i=
6 it .
P
t
t−1 , η ).
x t = MX ( m
t
i=1 ỹi , x
end for
ỹit
=
beamer-tu-log
31 / 38
Background
Complex composite problems
Finite-sum problems
Summary
RPDG in gradient form
Algorithm 5 RPDG
for t = 1, . . . , k do
Choose it according to Prob{it = i} = pi , i = 1, . . . , m.
t−1
x̃ t = αt (x t−1 − x t−2
)+x . (
(1 + τt )−1 x̃ t + τt x t−1
, i = it ,
i
t
xi =
t−1
i 6= it .
(x i ,
t
∇fi (x i ), i = it ,
yit =
yit−1 ,
i 6= it .
t
t−1
x = MX (g
+ (pi−1
− 1)(yitt − yit−1
), x t−1 , ηt ).
t
t
g t = g t−1 + yitt − yit−1
.
t
end for
beamer-tu-log
32 / 38
Background
Complex composite problems
Finite-sum problems
Summary
Game-theoretic interpretation for RPDG
The suppliers predict the buyer’s demand as before.
Only one randomly selcted supplier will change his/her
price, arriving at y t .
The buyer would have used y t as the price, but the
algorithm converges slowly (a worse depedence on m)
(Dang and Lan 14).
Add a dual prediction (estimation) step, i.e., ỹ t s.t.
Et [ỹit ] = ŷit , where ŷit := MYi (−x̃ t , yit−1 , τit ).
The buyer uses ỹ t to determine the order quantity.
beamer-tu-log
33 / 38
Background
Complex composite problems
Finite-sum problems
Summary
Rate of Convergence
Theorem
Let C =
8L
µ.
and
τt
Li
1
= Prob{i
t = i} = 2m + 2L , i = 1, . . . , m,
√
(m−1)2 +4mC−(m−1)
= √
,
2m
ηt
αt
=
= α := 1 −
pi
µ
(m−1)2 +4mC+µ(m−1)
,
2
√1 2
.
(m+1)+ (m−1) +4mC
Then
E[P(x k , x ∗ )]
E[Ψ(x̄ k )] − Ψ∗
3Lf
µ
)αk P(x 0 , x ∗ ),
h
≤ αk/2 (1 − α)−1 µ + 2Lf +
≤ (1 +
L2f
µ
i
P(x 0 , x ∗ ).
beamer-tu-log
34 / 38
Background
Complex composite problems
Finite-sum problems
Summary
The iteration complexity of RPGD
∗
Tonfind a point
q x̄ ∈ X
h s.t.0 E[P(
iox̄, x )] ≤ :
∗
P(x ,x )
O (m + mL
.
µ ) log
To find a point x̄ ∈ X s.t. Prob{P(x̄, x ∗ ) ≤ } ≥ 1 − λ for
some
(0, 1): h
n λ ∈q
io
P(x 0 ,x ∗ )
O (m + mL
.
µ ) log
λ
Similar results hold for the ergodic sequence in terms of
function values. n
q √ o
A factor of up to O min{ µL , m} savings on gradient
computation (or price changes), at the price of more order
transactions.
beamer-tu-log
35 / 38
Background
Complex composite problems
Finite-sum problems
Summary
Lower complexity bound
P µ
2
minxi ∈Rñ ,i=1,...,m Ψ(x) := m
.
i=1 fi (xi ) + 2 kxi k2
µ(Q−1) 1
fi (xi ) =
4
2 hAxi , xi i − he1 , xi i . ñ ≡ n/m, 

2 −1 0
0 ··· 0
0
0
 −1 2 −1 0 · · · 0
0
0 


,κ =

A =  ··· ··· ··· ··· ··· ··· ···

 0
0
0
0 · · · −1 2 −1 
0
0
0
0 · · · 0 −1 κ
√
√Q+3 .
Q+1
Theorem
√
√
Denote q := ( Q − 1)/( Q + 1). Then the iterates {x k }
generated by a randomized
method must
incremental gradient
E[kx k −x ∗ k22 ]
kx 0 −x ∗ k22
√
4k Q √
exp − m(√Q+1)
for any
2 −4 Q
h
i
k
n ≥ n(m, k ) ≡ [m log 1 − (1 − q 2 )/m /2 ]/(2 log q).
satisfy
≥
1
2
beamer-tu-log
36 / 38
Background
Complex composite problems
Finite-sum problems
Summary
Complexity
Corollary
The number of gradient evaluations performed by a randomized
incremental gradient method for finding a solution x̄ ∈ X s.t.
E[kx̄ − x ∗ k22 ] ≤ cannot be smaller than
n√
o
kx 0 −x ∗ k22
if n is sufficiently large.
Ω
mC + m log
Other results in the paper
Generalization to problems without strong convexity.
Lower complexity bound for randomized coordinate
descent methods.
beamer-tu-log
37 / 38
Background
Complex composite problems
Finite-sum problems
Summary
What’s new?
Gradient sliding algorithms for complex composite
optimization.
Saving gradient computation significantly without increasing
# of iterations.
An optimal randomized incremental gradient for finite-sum
optimization.
Saving gradient computation at the expense of more
iterations.
New lower complexity bound and game-theoretic
interpretation for first-order methods.
beamer-tu-log
38 / 38

Download Report

Some New Complexity Results for Composite Optimization

Paperzz.com

Your Paperzz