Overview of Sparse Penalized Regression

Dz
ô
G:
D
x>
Ÿ
†
Æ
<
r 2011¸
ð
r>
†
ÆÕ
<
t7
ü
H
ë
Hµ
ϳ
1
ð
rá
Ԗ
Ðr
ç̀
Overview of Sparse Penalized Regression
Yongdai Kima
a Seoul
National University
1. Introduction
Variable selection is an important issue in high-dimensional statistical modeling. Traditionally, stepwise subset selection procedures have been widely adopted to select an appropriate number of predictive variables. However, such
procedures have some drawbacks such as intensive computation, difficulties in obtaining sampling properties and
unstableness, see Breiman (1996) for further details.
Methods for overcoming these problems have been developed via sparse penalized approaches such as the bridge
regression (Frank and Friedman, 1993), the least absolute shrinkage and selection operator (LASSO) (Tibshirani,
1996), and the smoothly clipped absolute deviation (SCAD) penalty (Fan and Li, 2001). All these methods have
common advantages over subset selection procedures, i.e., they are computationally simpler, the derived sparse estimators are stable, and they facilitate higher prediction accuracies.
In this paper, we review various penalized approaches. In Section 2, various penalties are introduced, and the
corresponding optimization algorithms are explained in Section 3. Asymptotic properties of penalized estimators are
discussed in Section 4.
2. Various penalties
Let (x1 , y1 ), . . . , (xn , yn ) be n covariate-response variable pairs where xi ∈ X and yi ∈ Y where X is a subset of R p . We
consider estimating the regression coefficient β by minimizing the penalized empirical risk (PER)
Cλ (β) =
n
∑
i=1
l(yi , xi β) + Jλ (β),
′
where l is a given loss function, Jλ is a penalty and λ is a regularization parameter of controlling the size of the penalty.
Examples of the loss function are the squared error loss l(y, a) = (y − a)2 , that corresponds to the standard linear
regression, and the logistic loss l(y, a) = ya − log(1 + exp(a)) that corresponds to the logistic regression.
Examples of the penalty functions are
Ridge (Hoerl and Kennard, 1970): Jλ (β) = λ
∑p
k=1 βk ;
∑
• Bridge (Frank and Friedman, 1993): Jλ (β) = λ kp=1 |βk |γ , γ >
∑
• Lasso (Tibshirani, 1996): Jλ (β) = λ kp=1 |βk |;
∑
• SCAD (Fan and Li, 2001): Jλ (β) = kp=1 Jλ (|βk |), where
•
2
0;
Jλ (|β|) = λ|β|I (0 ≤ |β| < λ)
)
(
aλ(|β| − λ) − (|β|2 − λ2 )/2 2
+ λ I (λ ≤ |β| ≤ aλ)
+
(a − 1)
(
)
2
(a − 1)λ
2
+
+ λ I (|β| ≥ aλ).
2
This work was supported by the Korea Science and Engineering Foundation (KOSEF) grant funded by the Korea government (No. R012008-000-20538-0.
32
Proceedings for the Spring Conference, 2011, The Korean Statistical Society
•
Elastic net (Zou and Hastie, 2005):
Jλ (β) = λ1
•
MCP (Zhang, 2010): Jλ (β) =
p
∑
k=1
|βk | + λ2
p
∑
k=1
β2k ;
∑p
k=1 Jλ (|βk |), where
Jλ (β) = λβ − β2 /(2a)I (0 6 β < aλ) + λ2 a/2I (β > aλ), a > 0.
Remarks on the penalties are put in order.
•
The ridge penalty is differentiable while the other penalties are not. It is well known that non-differentiable penalties at 0 give sparse solutions (i.e. some coefficients of the minimizer of the penalized empirical risk are exactly
zero), and hence can do regularization and variable selection simultaneously. The bridge penalty, which includes
the ridge and Lasso (Least Absolute and Shrinkage Operator) as special cases, gives a sparse solution only when
γ ≤ 1.
•
The bridge with γ < 1, SCAD (Smoothly Clipped Absolute Deviation) penalty and MCP (Minimum Concave
Penalty) are nonconvex, which are devised to give sparse unbiased solutions. Note that convex penalties cannot
give unbiased solutions.
•
The elastic net penalty is devised to take care of highly correlated covariates more wisely than other sparse penalties
such as the Lasso. The Lasso penalty tends to select one among highly correlated covariates while the elastic net
uses all of them.
•
Fan and Li (2001) discussed three desirable properties of penalized estimators including unbiasedness, sparsity
and continuity. Unbiasedness means that the solution is unbiased. The sparsity means that some coefficients are
exactly 0, and the continuity means that the solution changes continuously in λ. The continuity implies the stability
of the solution, which is indispensable for accurate prediction.
3. Optimization algorithms
Except the ridge, the penalties are non-differentiable at 0 and so standard optimization algorithms such as the gradient
descent and Newton-Raphson algorithms are not applicable. In this section, we briefly review optimization algorithms
for sparse penalties.
3.1. Algorithms for Lasso
For the squared error loss, Osborne et al. (2000) and Rosset and Zhu (2007) developed an algorithm to find a solution
path. Here, the solution path is the set of solutions {β̂(λ), λ ≥ 0}, where β̂(λ) is the minimizer of Cλ (β). The idea of
the algorithm start with the fact that β̂(λ) is a piecewise linear. That is, there are finite increasing sequence 0 = λ0 <
λ1 < · · · < λm < ∞ (called them the knots), such that
β̂(λ) =
λ − λk−1
λk − λ
β̂(λk−1 ) +
β̂(λk )
λk − λk−1
λk − λk−1
for λ ∈ [λk−1 , λl ). Then, they developed an algorithm to find the knots, which are the points where the Karush-KuhnTucker (KKT) conditions are violated. For the KKT condition of the Lasso, see Rosset and Zhu (2007).
Efron et al. (2004) developed an algorithm called the LARS (Least Angle RegreSsion), which is motivated by the
forward stagewise selection method proposed initially by Friedman (2001) in the context of the boosting algorithm.
33
Dz
ô
G:
D
x>
Ÿ
†
Æ
<
r 2011¸
ð
r>
†
ÆÕ
<
t7
ü
H
ë
Hµ
ϳ
1
ð
rá
Ԗ
Ðr
ç̀
The main idea of the LARS algorithm is to add a variable gradually into the model. Let β̂ be the current solution and
let ri = yi − xi β̂. Find a covariate zk = ( x1k , . . . , xnk ) that maximizes the correlation between bzk and r = (r1 , . . . , rn ) .
Finally, update β̂ by β̂k = β̂k + ϵ and β̂ j = β̂ j , where ϵ > is a very small positive number. They proved that the solution
of the LARS algorithm with slight modification (delete step) converges to the Lasso solution as ϵ → 0. In fact, the
main contribution of the LARS is not a new optimization algorithm, but to explain how the Lasso solution behaves.
The equivalence of the LARS and Lasso solutions implies that the Lasso solutions are similar to that of the prudent
stepwise model selection. Here, the “prudent” means that it adds a very small amount of a covariate at each selection.
For general convex loss functions, there is no algorithm to give a solution path since the solution path is not
piecewise linear. Kim et al. (2008a) developed optimization algorithms to find a solution for fixed λ. Park and Hastie
(2007) mimicked a solution path algorithm to give an approximated solution path for the generalized linear model.
Recently, the shooting algorithm has been proposed by Friedman et al. (2007). The main idea of the shooting algorithm is to solve one-dimensional optimization problem iteratively. That is, we minimize Cλ (β) in βk with β j , j , k
being fixed. We iterate it for k = 1, . . . , p and keep iterating it until convergence. Note that the univariate Lasso
optimization problem has a closed form solution, and hence the shooting algorithm is very fast because it does not
need matrix inversion even though it does not provide a solution path.
′
′
′
3.2. An algorithm for the bridge estimator
For the bridge estimator with γ > 1, the PER is differentiable, and so standard optimization algorithms can be applied.
For γ = 1, it is the Lasso and we have already discussed the algorithms. Hence, we focus on the bridge penalties with
γ < 1.
For γ < 1, consider the modified PER
Cλ (β : θ) =
n
∑
i=1
l(yi , xi β) +
′
p
∑
k=1
θk1−1/γ |βk | + τ
p
∑
k =1
θk ,
where θk ≥ 0 and λ = τ1−γ γ−γ (1 − γ)γ−1 .
Let (β̂, θ̂) be the minimizer of Cλ (β : θ). Huang et al. (2008) proved that β̂ is the minimizer of Cλ (β) for the bridge
penalty with γ < 1. we can minimize Cλ (β : θ) iteratively. For fixing θ, we minimize Cλ (β : θ) in β, which is a
Lasso optimization problems so can be solved easily. When β̂ is fixed, the minimization of Cλ (β : θ) can be done
easily since there are closed form solutions. We repeat this two steps until convergence. Since Cλ (β : θ) decreases
after each iteration and so the solution eventually converges to a local minimum. This algorithm, however, does not
guarantee either a good local minimum or global optimum. More study should be done for this algorithm.
3.3. An algorithm for the SCAD
Fan and Li (2001) proposed to approximate the SCAD penalty by a quadratic function that depends on the current
solution, and update the solution by the minimizer of the PER with the quadratically approximated SCAD penalty,
which can be obtained easily by the Newton-Raphson algorithm. Zou and Li (2008) proposed a linear approximation
of the SCAD penalty. These algorithms, however, are approximations and it is not obvious if the algorithms converge
and give a local minimum if converge.
Kim et al. (2008b) proposed an optimization algorithm which always converges to a local minimum, The main
idea of the proposed algorithm is to decompose the PER Cλ (β) as the sum of the convex and concave functions.
We then apply the CCCP algorithm (Yuille and Rangarajan, 2003; An and Tao, 1997), which is one of the powerful
optimization algorithms for non-convex problems. The CCCP algorithm has been popularly used in many learning
problems including Shen et al. (2003) and Collobert et al. (2006). The key idea of the CCCP algorithm is to update the
34
Proceedings for the Spring Conference, 2011, The Korean Statistical Society
solution with the minimizer of the tight convex upper bound of the objective function obtained at the current solution.
To be specific, suppose there are convex and concave functions Cvex (β) and Ccav (β) such that C (β) = Cvex (β) + Ccav (β).
Given a current solution βc , the tight convex upper bound is defined by Q(β) = Cvex (β) + ∇Ccav (βc ) β where ∇Ccav (β) =
∂Ccav (β)/∂β. We then update the solution by the minimizer of Q(β). Note that Q(β) is a convex function so that it can
be easily minimized. An important property of the CCCP algorithm is that after each iteration, the objective function
always decreases. Hence, eventually, the solution converges to a local minimum. We also note that the SCAD penalty
can be decomposed by the sum of the concave and convex functions:
′
Jλ (|β j |) = J˜λ (|β j |) + λ|β j |,
where J˜λ (|β j |) is a differentiable concave function and |β j | is a convex function. Hence, the PER with the SCAD
penalty can be rewritten as
p
p
n
∑
∑
∑
C (β) = l(yi , xi β) + J˜λ (|β j |) + λ |β j |,
(3.1)
′
j=1
i=1
j=1
which is the sum of convex and concave functions. We apply the CCCP algorithm as follows. Given a current
solution βc , the tight convex upper bound is given as
Q(β) =
p
p
n
∑
∑
1 ∑
(yi − xi β)2 + ∇ J˜λ (|βcj |)β j + λ |β j |.
2n i=1
j=1
j=1
′
(3.2)
We then update the current solution of βc with the minimizer of (3.2). To minimize (3.2), we can use any algorithm
for the Lasso.
The CCCP can be applied the MCP. Also, recently, Zhang (2010) developed an algorithm to give a solution path
for the SCAD penalty and MCP.
4. Asymptotic properties
Basically, there are two modes of asymptotic properties: selection consistency and minimax optimality. For given
β ∈ R p , let σ(β) = { j : β j ne0}. The selection consistency means that
Pr{σ(β̂) = σ(β∗ )} → 1
as n → 1, where β∗ is the true regression coefficient. The minimax optimality is to investigate the convergence rate of
sup Eβ (∥β̂ − β∗ ∥2 ).
β ∗ ∈R p
∗
Recently, asymptotic properties of penalized estimators have received much attention for high dimensional models,
where p diverges as n → ∞. This is partly because the well known ridge estimator is not even consistent for such cases
(Kim, 2009).
4.1. Asymptotic properties of Lasso
Recently, many researchers proved that the Lasso estimator is minimax optimal even when p diverges exponentially
fast in n. See, for example, Bickel (2009). Also, the selection consistency of the Lasso estimator for high dimensional
models has been proved by Zhao and Yu (2006).
An interesting observation is that the minimax optimality and selection consistency cannot be achieved simultaneously. That is, the sequences of the regularization parameters λn to achieve these two aims have different orders.
Moreover, this contradiction still exists even when p is fixed. See, for example, Zou (2006).
35
Dz
ô
G:
D
x>
Ÿ
†
Æ
<
r 2011¸
ð
r>
†
ÆÕ
<
t7
ü
H
ë
Hµ
ϳ
1
ð
rá
Ԗ
Ðr
ç̀
4.2. Asymptotic properties of SCAD
A stronger asymptotic result than the selection consistency is the oracle property, which means
Pr{β̂ = β̂o } → 1,
where β̂o is an estimator obtained after deleting all noisy variables. Fan and Li (2001) proved the SCAD estimator
has the oracle property when p is fixed. Fan and Peng (2004) extended this result when p diverges slower than n. Kim
et al. (2008b) and Kwon and Kim proved that the oracle property of the SCAD estimator still holds when p diverges
faster than n. Zhang (2010) proved that this desirable property also holds for the MCP estimator.
4.3. Contradiction between minimax optimality and selection consistency
We have mentioned that the Lasso estimator cannot achieve the minimax optimality and selection consistency. This
contradiction is not noly for the Lasso estimator but it is a general rule. In fact, Leeb and Potscher (2008) proved that
any selection consistent estimator cannot be minimax optimal when p is fixed. They argued that the SCAD estimator
is not good in prediction accuracy.
√
This contradiction happens when some of the true coefficients are not zero but close to zero (i.e. inside the 1/ n
ball around 0). The selection consistent estimator cannot find such small non-zero coefficients and ignore them to
0. However, ignoring such small coefficients paly an important role in the prediction error. That is, to achieve the
optimal prediction accuracy, we need to select more variables than necessary. This is closely related to the story
between the AIC and BIC - the AIC is optimal in prediction while the BIC is selection consistent.
References
L.T.H. An and P.D. Tao. Solving a class of linearly constrained indefinite quadratic problems by dc algorithms.
Journal of Global Optimization, 11:253–285, 1997.
Yaacov R. Tsybakov B. Bickel, P. Simultaneous analysis of lasso and dantzig selector. Annals of Statistics, 37: 1705–
1732, 2009.
L. Breiman. Heuristics of instability and stabilization in model selection. The Annals of Statistics, 24: 2350–2383,
1996.
R. Collobert, F. Sinz, J. Weston, and L. Bottou. Large scale transductive SVMs. Journal of Machine Learning Research, 7: 1687–1712, 2006.
Bradley Efron, Trevor Hastie, Iain Johnstone, and Robert Tibshirani. Least angle regression. Annals of Statistics,
32: 407–499, 2004.
J. Fan and R. Li. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the
American Statistical Association, 96: 1348–1360, 2001.
J. Fan and H. Peng. Nonconcave penalized likelihood with a diverging number of parameters. The Annals of Statistics, 32: 928–961, 2004.
I.E. Frank and J.H. Friedman. A statistical view of some chemometrics regression tools. Technometrics, 35: 109–
148, 1993.
J. H. Friedman. Greedy function approximation : a gradient boosting machine. Annals of Statistics, 29: 1189–
1232, 2001.
J.H. Friedman, Hofling H. Hastie, T., and R. Tibshirani. Pathwise coordinate optimization. Annals of Applied Statistics, 1: 302–332, 2007.
A. E. Hoerl and R. W. Kennard. Ridge regression : biased estimation for nonorthogonal problems. Technometrics,
12: 55–67, 1970.
J. Huang, S. Ma, and C-H. Zhang. Asymptotic properties of bridge estimators in sparse high-dimensional regression
models. Annals of Statistics, 36: 587–613, 2008.
J. Kim, Y. Kim, and Y. Kim. A gradient-based optimization algorithm for lasso. Journal of Computational and
Graphical Statistics, 17: 994–1009, 2008a.
Y Kim. An inconsistency of the ridge estimator. Technical report, Seoul National University, 2009.
36
Proceedings for the Spring Conference, 2011, The Korean Statistical Society
Y. Kim, H. Choi, and H. Oh. Smoothly clipped absolute deviation on high dimensions. Journal of the American
Statistical Association, 103: 1665–1673, 2008b.
S. Kwon and Y. Kim. Large sample properties of the scad penalized maximum likelihood estimator on high dimensions. Statistica Sininca.
H. Leeb and B.M. Potscher. Sparse estimators and the oracle property, or the return of hodges’ estimator. Journal of
Econometrics, 142: 201–211, 2008.
M. R. Osborne, Brett Presnell, and B. A. Turlach. A new approach to variable selection in least squares problems.
IMA Journal of Numerical Analysis, 20 (3): 389–404, 2000.
M. Park and T. Hastie. L1-regularization path algorithm for generalized linear models. Journal of the Royal Statistical Society: Series B, 69 (4): 659–677, 2007.
S. Rosset and J. Zhu. Piecewise linear regularized solution paths. The Annals of Statistics, 35, 2007.
X. Shen, G.C. Tseng, X. Zhang, and W.H. Wong. On ψ-learning. Journal of the American Statistical Association,
98: 724–734, 2003.
R.J. Tibshirani. Regression shrinkage and selection via the LASSO. Journal of the Royal Statistical Society, Ser. B,
58: 267–288, 1996.
A. Yuille and A. Rangarajan. The concave-convex procedure. Neural Computation, 15: 915–936, 2003.
C.-H. Zhang. Nearly unbiased variable selection under minimax concave penalty. The Annals of Statistics, 38: 894–
942, 2010.
P. Zhao and B. Yu. On model selection consistency of lasso. Journal of Machine Learning Reserach, 7: 2541–2563,
2006.
H. Zou. The adaptive lasso and its oracle properties. Journal of the American Statistical Association, 101: 1418–
1429, 2006.
H. Zou and T. Hastie. Regularization and variable selection via the elastic net. Journal of the Royal Statistical
Society, Ser. B, 67: 301–320, 2005.
H. Zou and R. Li. One-step sparse estimates in nonconcave penalized likelihood models. Annals of Statistics, 36:
1509–1533, 2008.
37