A Consensus-based Decentralized Algorithm for Non-convex Optimization with Application to Dictionary Learning Hoi-To Wai†, Tsung-Hui Chang and Anna Scaglione Emails: [email protected], [email protected], [email protected] Motivations Prior works Shift of paradigm needed by big-data signal processing — Data is stored across the network. Parallel computation for large-scale problems. Only local communication is allowed. Natural approach is to combine alternating optimization (AO) & Distributed gradient (DG). E.g., [1] applies adapt-then-combine (ATC)-AO with fixed/dim. step sizes – (not even empirical evidence) no convergence to local minimal solutions. Empirical Convergence on the Dist. Learning problem Apply DL w/ brodatz.png – img. size: 512 × 512, M = 300, m = 256, n = 64, γ = 0.1, λ = 0.03. Network settng: N = 10 agents, comm. graph generated as ER graph with p = 0.6. S:,1 EXTRA-AO algorithm S:,4 S:,1 S:,3 S:,3 S:,4 S:,2 EXTRA algorithm — originally developed in [2], convergence guarantee for convex problems. We propose an EXTRA-AO algorithm for (2): S:,2 Data are collected/stored in a distributed setting Agents cooperate to learn common features Goal: to develop decentralized optimization tools that are suitable in a networked setting. 1: 2: Example: Dictionary Learning (DL) — min X,Y N X 1 2 kS:,j − XY:,j k2F + λkY:,j k1 j=1 (1) +γkXk22. S ∈ Rm×M – training data, divided into N parts X ∈ Rm×n – dictionary that constitutes S Y ∈ Rn×M – sparse coefficient vector that encodes S, divided into N parts The DL problem is non-convex, the training data S is stored across the network, etc. Problem Set-up Our aim is to apply decentralized optimization to tackle: N X min fi (x, yi ) + hi (yi ) , (2) x,{yi }N i=1 i=1 x ∈ Rm, yi ∈ Rn – optimization variables fi (x, yi ) – cts. differentiable, non-convex hi (yi ) – convex but non-smooth, e.g., `1 norm N – no. of agents in the network Example: distributed DL, NMF, low rank matrix factorization, etc. 3: 4: 5: for k = 1, 2, ... do Agents compute the following: k−1 k−1 k −1 Wx − α∇ f(x , y ), if k = 1, x k x = x k −1 + Wx k −1 − α∇x f(x k −1, y k −1) −W̃x k −2 + α∇ f(x k −2, y k −2), if k > 1, x where α > 0 and W̃ = (I + W)/2. For all i, agent i computes the following: yki = proxβhi (·) yki −1 − β∇yfi (xki , yk−1 ) i end for Return: x k , y k . where k m k n xi ∈ R , yi ∈ R – variables held by agent i at iteration k . x k , (xk1 , . . . xkN )T , y k , (yk1 , . . . (ykN )T . W ∈ RN×N – doubly stochastic matrix, constrained by the network topology G. The EXTRA-AO is shown to converge to a local minima (with a vanishing gradient). The convergence rate is comparable to other AO-based algorithms. Theoretical Guarantees Proposition 1: Assume that null{I − W} = span{1}. Suppose that the sequence {(x k , y k )}k generated by EXTRA-AO converges to a point (x ∞, y ∞), then (x̂∞, y ∞) is a stationary point to problem (2). For the DL problem, we observe that the sequence {(x k , y k )}k converges. Lemma 1: Suppose that the step sizes α, β in EXTRA-AO satisfy 0 < α < (2λmin(W̃)/Lx ), 0 < β < (1/Ly ), then the following inequalities hold for the objective values of (2) at each iteration: k +1 D E X 1 k +1 k k k k +1 k 2 (W̃ − W) x t , x k +1 − x k , f (x , y ) − f (x , y ) ≤ −δkx − x k2 − α (3) (4) t=0 f (x k +1, y k +1) − f (x k +1, y k ) ≤ −(1/2)ky k − y k +1k22, where δ = (λmin(W̃)/α − Lx /2) > 0 is a constant. Step 3 is the EXTRA step that can be carried out distributively. Step 4 can be computed individually by agents. The algorithm can be sped up using parallel computers across network. Convergence analysis of [2] applies to convex functions only. If the latter inner product in (4) vanishes as k → ∞, then f (x k , y k ) decreases monotonically =⇒ ky k − y k +1k22 + kx k − x k +1k22 → 0 =⇒ EXTRA-AO converges to a stationary point of (2). In fact, under some mild assumptions, we can prove that Pk k(W̃ − W) t=0 x t k ≤ C < ∞. This paper: we propose the EXTRA-AO and study its convergence. Ref.: [1] P. Chainais and C. Richard, “Distributed dictionary learning over a sensor network,” pp. 1–6, Apr. 2013. [Online]. Available: http://arxiv.org/abs/1304.3568 [2] W. Shi, Q. Ling, G. Wu, and W. Yin, “EXTRA: An Exact First-Order Algorithm for Decentralized Consensus Optimization,” pp. 1–23, Apr. 2014. [Online]. Available: http://arxiv.org/abs/1404.6264 (5) The convergence analysis may involve the analysis of a coupled dynamical system. This work is supported by NSF grant CCF 1018111 and Taiwan Ministry of Science and Technology NSC 102-2221-E-011-005-MY3. April 19 – 26, 2015, ICASSP 2015, Brisbane, Australia.
© Copyright 2026 Paperzz