Active set strategy for high-dimensional non-convex sparse optimization problems A. Boisbunon, R. Flamary, A. Rakotomamonjy CNES/INRIA – AYIN Team Université de Nice Sophia-Antipolis – Lab. Lagrange – OCA – CNRS Université de Rouen – Lab. LITIS May 7 2014 Motivation: high dimensional linear estimation min x∈Rp { f (x) = l (x) + r (x) } / Objective | (1) \ | Reg. term (nonconvex, Data term nondifferentiable) (differentiable) Objective I Estimate a high dimensional sparse model x ∈ Rp . I Go beyond the Lasso (biased, not always consistent [8, 2]). I Regularization term DC function: r (x) = p X h(|xi |) i I Use sparsity for efficient optimization. I Build on top of existing efficient algorithms [6, 4]. 2 / 11 Nonconvex sparse optimization in the literature Difference of Convex Algorithm (DCA) [1, 2, 3] I Solves iteratively weighted `1 -penalty. I Slow but converges in few re-weighting operations. Sequential Convex Programming (SCP) [6] I Uses a majorization of the nonconvex penalty. I Also handles constrained optimization. General Iterative Shrinkage and Threshold (GIST) [4] I Extension of proximal methods to nonconvex regularization. I Estimation of descent step via BB-rule (Barzilai & Borwein). Limits of those approaches I Solve the full optimization problem. I Full gradient computation is expensive. → Use an active set to focus on a small number of variables. 3 / 11 Active set strategy Principle I Work on a subset of variables ϕ and solve the problem on this subset. I Optimality conditions used to update the active set. I Widely used in convex optimization. I Sparse optimization: initialization ϕ = ∅. Nonconvex optimality conditions I The regularization term is expressed as a DC function: r (x) = r1 (x) − r2 (x) with r1 and r2 two convex functions of the form X X r1 (x) = g1 (|xi |), r2 (x) = g2 (|xi |) i I (2) i If x∗ is a stationary point of the optimization problem then ∂r2 (x∗ ) ⊂ ∇l (x∗ ) + ∂r1 (x∗ ) (3) 4 / 11 Optimality conditions in practice Optimality conditions h(|xi |) = Pp {g1 (|xi |) − g2 (|xi |)} r (x) = I Component-wise optimality condition. I When g20 (0) = 0 the optimality condition becomes i |∇l (x)i | ≤ I i g10 (0) if |∇l (x)i | ≤ h (0) if Regularization functions h( ·) `1 Capped `1 Log penalty `p with p =1/2 1 xi = 0. When g2 = g1 − h the optimality condition becomes 0 2 h(|u|) Pp I 02 1 0 1 |u| xi = 0. Examples: `1 : h(u) = λu ⇒ |∇l (x)i | ≤ λ if xi = 0 Capped-`1 : h(u) = λ min(u, θ) ⇒ |∇l (x)i | ≤ λ if xi = 0 Log sum : h(u) = λ log(1 + u/θ) ⇒ |∇l (x)i | ≤ λ/θ if xi = 0 5 / 11 Active set algorithm Algorithm for Log sum regularization Inputs - Initial active set ϕ = ∅ 1: repeat 2: x ← Solve Problem (1) with current active set ϕ (using GIST) 3: Compute r ← |∇l (x)| 4: for k = 1, . . . , ks do 5: j ← arg maxi ∈ϕ̄ ri 6: If rj > h0 (0) + ε then ϕ ← j ∪ ϕ 7: end for 8: until stopping criterion is met Discussion I Only small problems are solved (dimension |ϕ|). I Use warm-starting trick. I At each iteration, ks variables are added to the active set. I Step 3 can be computed in parallel. I > 0 typically small, acts as a threshold similar to OMP. 6 / 11 Numerical experiments Datasets I Simulated Dataset: p = [102 , 107 ], SNR=30dB, n = 100, t = 10. I Dorothea Dataset: p = 105 , n = 1150. I URL Reputation Dataset: p = 3.2 × 106 , n = 20 000, sparse. Compared Methods I DC Algorithm, reweighted-`1 (DC-Lasso) [2, 3]. I General Iterative Shrinkage and Threshold (GIST) [4]. I Proposed Active Set approach with GIST (AS-GIST). Performance measures I CPU time used in the algorithm. I Parameters I Regularized least-squares. Final objective value. I Log sum with θ = 0.1. Both measures averaged over 10 splits/generations of the data. I ks = 10 and = 0.1. I Computed on Octave. 7 / 11 Simulated dataset Generated dataset 0 10 Final objective value 4 10 CPU time (s) Generated dataset DC−Lasso GIST AS−GIST 2 10 0 10 −2 10 −1 2 4 10 10 6 p 10 10 2 10 4 10 6 p 10 Results I Standard deviation in dashed lines. I DC-Lasso outperformed by GIST and AS-GIST. I GIST and AS-GIST statistically equivalent and > DC-Lasso. I AS-GIST up to 20× faster than GIST and > 100× faster than DC-Lasso. 8 / 11 Dorothea dataset Dorothea dataset 3 Dorothea dataset 4 10 10 GIST AS−GIST 2 Final objective value CPU time (s) 10 1 10 0 10 −1 10 −2 2 10 0 10 −2 10 −2 10 0 10 lambda 2 10 10 −2 10 0 10 lambda 2 10 Results I Performance measures along the regularization path. I DC-Lasso not computed due to computational time. I AS-GIST more efficient on sparse solutions (large λ). I Better objective value of AS-GIST for small λ. 9 / 11 URL Reputation dataset URL dataset 4 10 3 Final objective value 10 CPU time (s) URL dataset 4 10 2 10 1 10 0 10 −1 10 GIST AS−GIST 2 10 0 10 −2 10 −2 10 0 10 lambda 2 10 −2 10 0 10 lambda 2 10 Results I Very high dimension p = 3.2 × 106 I Important computational gain with AS-GIST. I Important gain in objective value for small λ ( parameter). 10 / 11 Conclusion Active set strategy I When solution is sparse: use active set even for nonconvex problems. I Spends more time optimizing values that count. I Applicable to a wide class of regularization term. I Any convex differentiable loss (least-squares, logistic regression). I Simple algorithm, code will be available. Working on I More general optimality condition (Clarke differential). I Convergence proof to stationary point. I Study the regularization effect of initializing by 0 and choice of . I Applications in large scale datasets/problems. 11 / 11 References I [1] L.T.H. An and P.D. Tao. The DC (difference of convex functions) programming and DCA revisited with DC models of real world nonconvex optimization problems. Annals of Operations Research, 133(1):23–46, 2005. [2] E.J. Candes, M.B. Wakin, and S.P. Boyd. Enhancing sparsity by reweighted `1 minimization. Journal of Fourier Analysis and Applications, 14(5-6):877–905, 2008. [3] G. Gasso, A. Rakotomamonjy, and S. Canu. Recovering sparse signals with a certain family of nonconvex penalties and DC programming. IEEE Transactions on Signal Processing, 57(12):4686–4698, 2009. [4] P. Gong, C. Zhang, Z. Lu, and J. Huang, J. Ye. A general iterative shrinkage and thresholding algorithm for non-convex regularized optimization problems. In Proceedings of the 30th International Conference on Machine Learning (ICML), volume 28, pages 37–45, 2013. 1/3 References II [5] I. Guyon, S. Gunn, A. Ben-Hur, and G. Dror. Result analysis of the NIPS 2003 feature selection challenge. In Advances in Neural Information Processing Systems, pages 545–552, 2004. [6] Z. Lu. Sequential convex programming methods for a class of structured nonlinear programming. arXiv preprint arXiv:1210.3039, 2012. [7] J. Ma, L.K. Saul, S. Savage, and G.M. Voelker. Identifying suspicious URLs: an application of large-scale online learning. In Proceedings of the 26th Annual International Conference on Machine Learning (ICML), pages 681–688, 2009. [8] H. Zou. The adaptive lasso and its oracle properties. Journal of the American Statistical Association, 101(476):1418–1429, 2006. 2/3 Examples of optimization problems min x∈Rp { f (x) = l (x) + r (x) } Data-fitting term 1 2 P − ak > x)2 = 21 ky − Axk2 I Least-squares : l (x) = I P Logistic regression : l (x) = k log(1 + exp(−yk ak > x)) P SVM Rank : l (x) = k max(0, 1 − ak > x)2 I k (yi Gradient of the form ∇l (x) = A> e(x) Regularization term Lasso (`1 ) : r (x) = λkxk1 P I Capped-`1 : r (x) = λ i min(|xi |, θ) P I Log sum : r (x) = λ log(1 + |xi |/θ) i P p I `p -pseudonorm : r (x) = λ i |xi | P Regularizer of the form r (x) = pi h(|xi |) 2 Regularization functions h( ·) `1 Capped `1 Log penalty `p with p =1/2 h(|u|) I 1 02 1 0 |u| 1 3/3
© Copyright 2024 Paperzz