A Consensus-based Decentralized Algorithm for Non

A Consensus-based Decentralized Algorithm for Non-convex Optimization with
Application to Dictionary Learning
Hoi-To Wai†, Tsung-Hui Chang and Anna Scaglione Emails: [email protected], [email protected], [email protected]
Motivations
Prior works
Shift of paradigm needed by big-data signal
processing —
Data is stored across the network.
Parallel computation for large-scale problems.
Only local communication is allowed.
Natural approach is to combine alternating
optimization (AO) & Distributed gradient (DG).
E.g., [1] applies adapt-then-combine (ATC)-AO
with fixed/dim. step sizes – (not even empirical
evidence) no convergence to local minimal
solutions.
Empirical Convergence on the Dist. Learning problem
Apply DL w/ brodatz.png – img. size: 512 × 512, M = 300, m = 256, n = 64, γ = 0.1, λ = 0.03.
Network settng: N = 10 agents, comm. graph generated as ER graph with p = 0.6.
S:,1
EXTRA-AO algorithm
S:,4
S:,1
S:,3
S:,3
S:,4
S:,2
EXTRA algorithm — originally developed in [2],
convergence guarantee for convex problems.
We propose an EXTRA-AO algorithm for (2):
S:,2
Data are collected/stored
in a distributed setting
Agents cooperate to learn
common features
Goal: to develop decentralized optimization tools
that are suitable in a networked setting.
1:
2:
Example: Dictionary Learning (DL) —
min
X,Y
N X
1
2
kS:,j − XY:,j k2F + λkY:,j k1
j=1
(1)
+γkXk22.
S ∈ Rm×M – training data, divided into N parts
X ∈ Rm×n – dictionary that constitutes S
Y ∈ Rn×M – sparse coefficient vector that
encodes S, divided into N parts
The DL problem is non-convex, the training
data S is stored across the network, etc.
Problem Set-up
Our aim is to apply decentralized optimization to
tackle:
N
X
min
fi (x, yi ) + hi (yi ) ,
(2)
x,{yi }N
i=1
i=1
x ∈ Rm, yi ∈ Rn – optimization variables
fi (x, yi ) – cts. differentiable, non-convex
hi (yi ) – convex but non-smooth, e.g., `1 norm
N – no. of agents in the network
Example: distributed DL, NMF, low rank matrix
factorization, etc.
3:
4:
5:
for k = 1, 2, ... do
Agents compute the following:

k−1
k−1
k −1

Wx
−
α∇
f(x
,
y
),
if
k
=
1,
x


k
x = x k −1 + Wx k −1 − α∇x f(x k −1, y k −1)


−W̃x k −2 + α∇ f(x k −2, y k −2), if k > 1,
x
where α > 0 and W̃ = (I + W)/2.
For all i, agent i computes the following:
yki = proxβhi (·) yki −1 − β∇yfi (xki , yk−1
)
i
end for
Return: x k , y k .
where
k
m k
n
xi ∈ R , yi ∈ R – variables held by agent i at
iteration k .
x k , (xk1 , . . . xkN )T , y k , (yk1 , . . . (ykN )T .
W ∈ RN×N – doubly stochastic matrix,
constrained by the network topology G.
The EXTRA-AO is shown to converge to a local minima (with a vanishing gradient).
The convergence rate is comparable to other AO-based algorithms.
Theoretical Guarantees
Proposition 1: Assume that null{I − W} = span{1}. Suppose that the sequence {(x k , y k )}k
generated by EXTRA-AO converges to a point (x ∞, y ∞), then (x̂∞, y ∞) is a stationary point to
problem (2).
For the DL problem, we observe that the sequence {(x k , y k )}k converges.
Lemma 1: Suppose that the step sizes α, β in EXTRA-AO satisfy
0 < α < (2λmin(W̃)/Lx ), 0 < β < (1/Ly ),
then the following inequalities hold for the objective values of (2) at each iteration:
k +1
D
E
X
1
k +1
k
k
k
k +1
k 2
(W̃ − W)
x t , x k +1 − x k ,
f (x , y ) − f (x , y ) ≤ −δkx
− x k2 −
α
(3)
(4)
t=0
f (x k +1, y k +1) − f (x k +1, y k ) ≤ −(1/2)ky k − y k +1k22,
where δ = (λmin(W̃)/α − Lx /2) > 0 is a constant.
Step 3 is the EXTRA step that can be carried
out distributively.
Step 4 can be computed individually by agents.
The algorithm can be sped up using parallel
computers across network.
Convergence analysis of [2] applies to convex
functions only.
If the latter inner product in (4) vanishes as k → ∞, then f (x k , y k ) decreases monotonically =⇒
ky k − y k +1k22 + kx k − x k +1k22 → 0 =⇒ EXTRA-AO converges to a stationary point of (2).
In fact, under some mild assumptions, we can prove that
Pk
k(W̃ − W) t=0 x t k ≤ C < ∞.
This paper: we propose the EXTRA-AO and
study its convergence.
Ref.: [1] P. Chainais and C. Richard, “Distributed dictionary learning over a sensor network,” pp. 1–6, Apr. 2013.
[Online]. Available: http://arxiv.org/abs/1304.3568
[2] W. Shi, Q. Ling, G. Wu, and W. Yin, “EXTRA: An Exact First-Order Algorithm for Decentralized Consensus
Optimization,” pp. 1–23, Apr. 2014. [Online]. Available: http://arxiv.org/abs/1404.6264
(5)
The convergence analysis may involve the analysis of a coupled dynamical system.
This work is supported by NSF grant CCF 1018111 and Taiwan Ministry of Science and Technology NSC 102-2221-E-011-005-MY3.
April 19 – 26, 2015, ICASSP 2015, Brisbane, Australia.