Active set strategy for high-dimensional non

Active set strategy for high-dimensional
non-convex sparse optimization problems
A. Boisbunon, R. Flamary, A. Rakotomamonjy
CNES/INRIA – AYIN Team
Université de Nice Sophia-Antipolis – Lab. Lagrange – OCA – CNRS
Université de Rouen – Lab. LITIS
May 7 2014
Motivation: high dimensional linear estimation
min
x∈Rp
{ f (x) = l (x) + r (x) }
/
Objective
|
(1)
\
|
Reg. term (nonconvex,
Data term
nondifferentiable)
(differentiable)
Objective
I
Estimate a high dimensional sparse model x ∈ Rp .
I
Go beyond the Lasso (biased, not always consistent [8, 2]).
I
Regularization term DC function:
r (x) =
p
X
h(|xi |)
i
I
Use sparsity for efficient optimization.
I
Build on top of existing efficient algorithms [6, 4].
2 / 11
Nonconvex sparse optimization in the literature
Difference of Convex Algorithm (DCA) [1, 2, 3]
I
Solves iteratively weighted `1 -penalty.
I
Slow but converges in few re-weighting operations.
Sequential Convex Programming (SCP) [6]
I
Uses a majorization of the nonconvex penalty.
I
Also handles constrained optimization.
General Iterative Shrinkage and Threshold (GIST) [4]
I
Extension of proximal methods to nonconvex regularization.
I
Estimation of descent step via BB-rule (Barzilai & Borwein).
Limits of those approaches
I
Solve the full optimization problem.
I
Full gradient computation is expensive.
→ Use an active set to focus on a small number of variables.
3 / 11
Active set strategy
Principle
I
Work on a subset of variables ϕ and solve the problem on this subset.
I
Optimality conditions used to update the active set.
I
Widely used in convex optimization.
I
Sparse optimization: initialization ϕ = ∅.
Nonconvex optimality conditions
I
The regularization term is expressed as a DC function:
r (x) = r1 (x) − r2 (x) with r1 and r2 two convex functions of the form
X
X
r1 (x) =
g1 (|xi |), r2 (x) =
g2 (|xi |)
i
I
(2)
i
If x∗ is a stationary point of the optimization problem then
∂r2 (x∗ ) ⊂ ∇l (x∗ ) + ∂r1 (x∗ )
(3)
4 / 11
Optimality conditions in practice
Optimality conditions
h(|xi |) =
Pp
{g1 (|xi |) − g2 (|xi |)}
r (x) =
I
Component-wise optimality condition.
I
When g20 (0) = 0 the optimality condition
becomes
i
|∇l (x)i | ≤
I
i
g10 (0)
if
|∇l (x)i | ≤ h (0)
if
Regularization functions h( ·)
`1
Capped `1
Log penalty
`p with p =1/2
1
xi = 0.
When g2 = g1 − h the optimality condition
becomes
0
2
h(|u|)
Pp
I
02
1
0
1
|u|
xi = 0.
Examples:
`1 :
h(u) = λu
⇒
|∇l (x)i | ≤ λ
if
xi = 0
Capped-`1 :
h(u) = λ min(u, θ)
⇒
|∇l (x)i | ≤ λ
if
xi = 0
Log sum :
h(u) = λ log(1 + u/θ)
⇒
|∇l (x)i | ≤ λ/θ
if
xi = 0
5 / 11
Active set algorithm
Algorithm for Log sum regularization
Inputs
- Initial active set ϕ = ∅
1: repeat
2:
x ← Solve Problem (1) with current active set ϕ (using GIST)
3:
Compute r ← |∇l (x)|
4:
for k = 1, . . . , ks do
5:
j ← arg maxi ∈ϕ̄ ri
6:
If rj > h0 (0) + ε then ϕ ← j ∪ ϕ
7:
end for
8: until stopping criterion is met
Discussion
I
Only small problems are solved (dimension |ϕ|).
I
Use warm-starting trick.
I
At each iteration, ks variables are added to the active set.
I
Step 3 can be computed in parallel.
I
> 0 typically small, acts as a threshold similar to OMP.
6 / 11
Numerical experiments
Datasets
I
Simulated Dataset: p = [102 , 107 ], SNR=30dB, n = 100, t = 10.
I
Dorothea Dataset: p = 105 , n = 1150.
I
URL Reputation Dataset: p = 3.2 × 106 , n = 20 000, sparse.
Compared Methods
I
DC Algorithm, reweighted-`1 (DC-Lasso) [2, 3].
I
General Iterative Shrinkage and Threshold (GIST) [4].
I
Proposed Active Set approach with GIST (AS-GIST).
Performance measures
I
CPU time used in the algorithm.
I
Parameters
I
Regularized least-squares.
Final objective value.
I
Log sum with θ = 0.1.
Both measures averaged over 10
splits/generations of the data.
I
ks = 10 and = 0.1.
I
Computed on Octave.
7 / 11
Simulated dataset
Generated dataset
0
10
Final objective value
4
10
CPU time (s)
Generated dataset
DC−Lasso
GIST
AS−GIST
2
10
0
10
−2
10
−1
2
4
10
10
6
p
10
10
2
10
4
10
6
p
10
Results
I
Standard deviation in dashed lines.
I
DC-Lasso outperformed by GIST and AS-GIST.
I
GIST and AS-GIST statistically equivalent and > DC-Lasso.
I
AS-GIST up to 20× faster than GIST and > 100× faster than DC-Lasso.
8 / 11
Dorothea dataset
Dorothea dataset
3
Dorothea dataset
4
10
10
GIST
AS−GIST
2
Final objective value
CPU time (s)
10
1
10
0
10
−1
10
−2
2
10
0
10
−2
10
−2
10
0
10
lambda
2
10
10
−2
10
0
10
lambda
2
10
Results
I
Performance measures along the regularization path.
I
DC-Lasso not computed due to computational time.
I
AS-GIST more efficient on sparse solutions (large λ).
I
Better objective value of AS-GIST for small λ.
9 / 11
URL Reputation dataset
URL dataset
4
10
3
Final objective value
10
CPU time (s)
URL dataset
4
10
2
10
1
10
0
10
−1
10
GIST
AS−GIST
2
10
0
10
−2
10
−2
10
0
10
lambda
2
10
−2
10
0
10
lambda
2
10
Results
I
Very high dimension p = 3.2 × 106
I
Important computational gain with AS-GIST.
I
Important gain in objective value for small λ ( parameter).
10 / 11
Conclusion
Active set strategy
I
When solution is sparse: use active set even for nonconvex problems.
I
Spends more time optimizing values that count.
I
Applicable to a wide class of regularization term.
I
Any convex differentiable loss (least-squares, logistic regression).
I
Simple algorithm, code will be available.
Working on
I
More general optimality condition (Clarke differential).
I
Convergence proof to stationary point.
I
Study the regularization effect of initializing by 0 and choice of .
I
Applications in large scale datasets/problems.
11 / 11
References I
[1] L.T.H. An and P.D. Tao.
The DC (difference of convex functions) programming and DCA revisited with DC
models of real world nonconvex optimization problems.
Annals of Operations Research, 133(1):23–46, 2005.
[2] E.J. Candes, M.B. Wakin, and S.P. Boyd.
Enhancing sparsity by reweighted `1 minimization.
Journal of Fourier Analysis and Applications, 14(5-6):877–905, 2008.
[3] G. Gasso, A. Rakotomamonjy, and S. Canu.
Recovering sparse signals with a certain family of nonconvex penalties and DC
programming.
IEEE Transactions on Signal Processing, 57(12):4686–4698, 2009.
[4] P. Gong, C. Zhang, Z. Lu, and J. Huang, J. Ye.
A general iterative shrinkage and thresholding algorithm for non-convex
regularized optimization problems.
In Proceedings of the 30th International Conference on Machine Learning (ICML),
volume 28, pages 37–45, 2013.
1/3
References II
[5] I. Guyon, S. Gunn, A. Ben-Hur, and G. Dror.
Result analysis of the NIPS 2003 feature selection challenge.
In Advances in Neural Information Processing Systems, pages 545–552, 2004.
[6] Z. Lu.
Sequential convex programming methods for a class of structured nonlinear
programming.
arXiv preprint arXiv:1210.3039, 2012.
[7] J. Ma, L.K. Saul, S. Savage, and G.M. Voelker.
Identifying suspicious URLs: an application of large-scale online learning.
In Proceedings of the 26th Annual International Conference on Machine Learning
(ICML), pages 681–688, 2009.
[8] H. Zou.
The adaptive lasso and its oracle properties.
Journal of the American Statistical Association, 101(476):1418–1429, 2006.
2/3
Examples of optimization problems
min
x∈Rp
{ f (x) = l (x) + r (x) }
Data-fitting term
1
2
P
− ak > x)2 = 21 ky − Axk2
I
Least-squares : l (x) =
I
P
Logistic regression : l (x) = k log(1 + exp(−yk ak > x))
P
SVM Rank : l (x) = k max(0, 1 − ak > x)2
I
k (yi
Gradient of the form ∇l (x) = A> e(x)
Regularization term
Lasso (`1 ) : r (x) = λkxk1
P
I Capped-`1 : r (x) = λ
i min(|xi |, θ)
P
I Log sum : r (x) = λ
log(1
+ |xi |/θ)
i
P
p
I `p -pseudonorm : r (x) = λ
i |xi |
P
Regularizer of the form r (x) = pi h(|xi |)
2
Regularization functions h( ·)
`1
Capped `1
Log penalty
`p with p =1/2
h(|u|)
I
1
02
1
0
|u|
1
3/3