Fast Optimization Methods for L1
Regularization: A Comparative Study and Two
New Approaches
Suplemental Material
Mark Schmidt 1 , Glenn Fung2 , Romer Rosales2
1
1
Department of Computer Science University of British Columbia,
2
IKM, Siemens Medical Solutions, USA
Smooth L1-norm aproximation
In order to deal with the non-differentiable penalty, we propose a smooth approximation to the L1 penalty based on the following:
(i) |x| = (x)+ + (−x)+ , where the plus function is (x)+ = max {x, 0}
(ii) The plus function can be approximated (smoothly)[2], by the integral to a
smooth approximation of the sigmoid function:
(x)+ ≈ p(x, α) = x +
1
log(1 + exp(−αx))
α
(1)
Combinining these, we arrive to the following smooth approximation for the
absolute value function consisting of the sum of the integral of two sigmoid
functions (Fig. 1 plots this approximation near 0 for different values of α):
|x| = (x)+ + (−x)+ ≈ p(x, α) + p(−x, α)
= α1 [log(1 + exp(−αx)) + log(1 + exp(αx))]
= |x|α
(2)
In practice, α = 106 yields results that are within some small tolerance of the
results produced by (optimal) constrained optimization methods. As opposed
to the L1-penalty, this approximation is amenable to standard unconstrained
optimization methods since it is twice-differentiable:
∇(|x|) ≈ (1 + exp(−αx))−1 − (1 + exp(αx))−1
2
∇ (|x|) ≈ 2α exp(αx)/(1 + exp(αx))
2
(3)
(4)
This approximation can be used in conjuction with any general likelihood or loss
functions. Next we will show that for optimization problems derived from learning methods with L1 regularization, the solutions of the smooth approximated
problems approach the solution to the original problems when α approaches infinity. We will start by proposing and proving a simple lemma similar to one
proposed in [1] for the plus function. This gives us a bound relating |x| and |x|α .
2
0.3
10000
1000
100
10
0.25
0.2
0.15
0.1
0.05
0
−0.3
−0.2
−0.1
0
0.1
0.2
0.3
1.8
1
10
100
1000
10000
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
−1
−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
Fig. 1. Approximation near 0 for different settings of the parameter α
1
3
Lemma 11 Approximation bound For any x ∈ ℜ and for any α > 0:
||x| − |x|α | ≤ 2
log 2
α
(5)
Proof:
Lets consider two cases. For x > 0,
1
log(1 + exp(−αx)) − x
α
log 2
1
= log(1 + exp(−αx)) ≤
α
α
p(x, α) − (x)+ = x +
(6)
For x ≤ 0,
log 2
,
(7)
α
where the last inequality derives from the fact that p is a monotonically incresing function. Hence, from equations (6) and (7) and because p(x, α) always
dominates (x)+ , we conclude that:
0 ≤ p(x, α) − (x)+ = p(x, α) ≤ p(0, α) =
|p(x, α) − (x)+ | ≤
log 2
α
(8)
Since |x| = (x)+ + (−x)+ , using equation (1) we have:
||x| − |x|α | = |p(x, α) + p(−x, α) − ((x)+ + (−x)+ )|
≤ |p(x, α) − (x)+ | + |p(−x, α) − (−x)+ |
≤ logα 2 + logα 2 = 2 logα 2 .2
(9)
Let us now define kxk(1,α) as a smooth approximation to the 1-norm function
Pn
kxk1 for a vector x ∈ ℜn in the following way: kxk(1,α) = i |xi |α .
Then, using Lemma 11 we have that
(10)
kxk(1,α) − kxk1 ≤ 2n logα 2
Hence, we can conclude that:
limα→∞ kxk(1,α) = kxk1 ∀x ∈ ℜn
(11)
Letting L : ℜn → ℜ be any continuous loss function and defining f (x) = L(x) +
kxk1 and fα (x) = L(x) + kxk(1,α) . Let us also define x̄ = arg minx f (x) and
x¯α = arg minx fα (x). By the definition of f and fα and by equation (11), it
follows that:
limα→∞ fα (x) = f (x) ∀x ∈ ℜn
(12)
In adition, we know that f (x̄) ≤ f (x), ∀x. In particular f (x̄) ≤ f (x¯α ), then:
f (x̄) ≤ f (x¯α ) = L(x¯α ) + kx¯α k1
= L(x¯α ) + kx¯α k1 + kx¯α k(1,α) − kx¯α k(1,α)
= L(x¯α ) + kx¯α k(1,α) + kx¯α k1 − kx¯α k(1,α)
= fα (x¯α ) + kx¯α k1 − kx¯α k(1,α)
(13)
4
This implies f (x̄) − fα (x¯α ) ≥ −2n logα 2 (using equation (10)). Similarly, we can
prove that f (x̄) − fα (x¯α ) ≤ 2n logα 2 , hence: limα→∞ fα (x¯α ) = f (x̄).
Furthermore:
|f (x¯α ) − f (x̄)| = |f (x¯α ) − f (x̄) − fα (x¯α ) + fα (x¯α )|
≤ |f (x¯α ) − fα (x¯α )| + |fα (x¯α ) − f (x̄)|
(14)
This implies that: limα→∞ f (x¯α ) = f (x̄). Moreover, if L is stricly convex, it is
easy to proove that: limα→∞ xα = x̄
References
1. Chunhui Chen and O. L. Mangasarian. A class of smoothing functions for nonlinear
and mixed complementarity problems. Computational Optimization and Applications, 5(2):97–138, 1996.
2. Y.-J. Lee and O. L. Mangasarian. SSVM: A smooth support vector machine. Comp.
Opt. and App., 20:5–22, 2001.
© Copyright 2025 Paperzz