Linear Programming for Feature Selection via

Linear Programming for Feature Selection via
Regularization
Yoonkyung Lee
Department of Statistics
The Ohio State University
(Joint work with Yonggang Yao)
July 2008
Outline
◮
Methods of regularization
◮
Solution paths
◮
Main optimization problems for feature selection
◮
Overview of linear programming
◮
Simplex algorithm for generating solution paths
◮
Implications
◮
Numerical examples
◮
Concluding remarks
Regularization
◮
Tikhonov regularization (1943):
solving ill-posed integral equation numerically
◮
Process of modifying ill-posed problems by introducing
additional information about the solution
◮
Modification of the maximum likelihood principle or
empirical risk minimization principle
(Bickel & Li 2006)
◮
Smoothness, sparsity, small norm, large margin, ...
◮
Bayesian connection
Methods of Regularization (Penalization)
Find f ∈ F minimizing
n
1X
L(yi , f (x i )) + λJ(f ).
n
i=1
◮
Empirical risk + penalty
◮
F: a class of candidate functions
◮
J(f ): the complexity of a model f
◮
λ > 0: a regularization parameter
◮
Without the penalty J(f ), ill-posed problem
Examples of Regularization Methods
◮
Ridge regression (Hoerl and Kennard 1970)
◮
LASSO (Tibshirani 1996)
◮
Smoothing splines (Wahba 1990)
◮
Support vector machines (Vapnik 1998)
◮
Regularized neural network, boosting, logistic regression,
...
◮
Smoothing splines:
Find f ∈ F = W2 [0, 1]
= {f : f , f ′ absolutely continuous, and f ′′ ∈ L2 } minimizing
n
X
1
n
where J(f ) =
◮
(yi − f (xi ))2 + λ
Z
1
(f ′′ (x))2 dx,
0
i=1
R1
′′ (x))2 dx.
(f
0
Support vector machines:
Find f ∈ F = {f (x) = w ⊤ x + b | w ∈ Rp and b ∈ R}
minimizing
n
1X
(1 − yi f (x i ))+ + λkw k2 ,
n
i=1
where J(f ) = J(w ⊤ x + b) = kw k2 .
LASSO
min
β
n
X
i=1
p
p
n
X
X
X
(yi −
βj xij )2 +λkβk1 ⇔ min
(yi −
βj xij )2 s.t. kβk1 ≤ s
j=1
β
^
β
i=1
j=1
LASSO coefficient paths
LASSO
*
*
*
*
*
** **
*
**
*
**
*
*
*
**
**
**
**
**
*
**
** **
* **
** *
**
**
**
12
*
9
10
**
**
**
**
**
**
*
*
**
*
**
*
*
6
7 8
1 8
**
**
**
4 5
2
0
**
3
5
500
2
−500
Standardized Coefficients
0
**
0.0
0.2
0.4
0.6
|beta|/max|beta|
0.8
1.0
Solution Paths
◮
Each regularization method defines a continuum of
optimization problems indexed by a tuning parameter.
◮
λ determines the trade-off between the prediction error
and the model complexity
◮
The entire set of solutions f or β as a function of λ
◮
Complete exploration of the model space and
computational savings
Examples
◮
◮
◮
◮
◮
◮
LARS (Efron et al. 2004)
SVM path (Hastie et al. 2004)
Multicategory SVM path (Lee and Cui 2006)
Piecewise linear paths (Rosset and Zhu 2007)
Generalized path seeking algorithm (Friedman 2008)
Main Problem
◮
Regularization for simultaneous fitting and feature selection
◮
Convex piecewise linear loss functions
Penalties of ℓ1 nature for feature selection
◮
◮
◮
Parametric: LASSO-type
Nonparametric: COSSO-type
COmponent Selection and Smoothing Operator
(Lin and Zhang 2003, Gunn and Kandola 2002)
◮
Non-differentiability of the loss and penalty
◮
Linear programming (LP) problems indexed by a single
regularization parameter
◮
Examples
◮
◮
◮
◮
◮
◮
◮
ℓ1 -norm SVM
(Bradley and Mangasarian 1998, Zhu et al. 2004)
ℓ1 -norm Quantile Regression (Li and Zhu 2005)
θ-step (kernel selection) for structured kernel methods
(Lee et al. 2006)
Dantzig selector (Candes and Tao 2005)
ǫ-insensitive loss in SVM regression
Sup norm, maxj=1,...,p |βj |
Computational properties of the solutions to the problems
can be treated generally by tapping into the LP theory.
Linear Programming
◮
One of the cornerstones of the optimization theory
◮
Applications in operation research, economics, business
management, and engineering
◮
The simplex algorithm by Dantzig (1947)
◮
‘Parametric cost LP’ or ‘parametric right-hand-side LP’ in
the optimization theory
◮
Exploit the connection to lay out general algorithms for the
solution paths of the feature selection problems.
Geometry of LP
◮
◮
Search the minimum of a linear function over a polyhedron
whose edges are defined by hyperplanes.
At least one of the intersection points of the hyperplanes
should attain the minimum if the minimum exists.
Linear Programming
◮
Standard form of LP


 zmin
∈ RN
s.t.


c′ z
Az = b
z ≥ 0,
where z is an N-vector of variables, c is a fixed N-vector, b
is a fixed M-vector, and A is an M × N fixed matrix.
LP terminology
◮
◮
◮
◮
∗ } ⊂ N = {1, · · · , N} is called a
A set B ∗ := {B1∗ , · · · , BM
basic index set, if AB∗ is invertible.
z ∗ ∈ RN is called the basic solution associated with B ∗ , if
z ∗ satisfies
(
z ∗B∗ := (zB∗ ∗ , · · · , zB∗ ∗ )′ = A−1
∗b
B
1
M
∗
zj = 0 for j ∈ N \ B ∗ .
A basic index set B ∗ is called a feasible basic index set if
A−1
B∗ b ≥ 0.
A feasible basic index set B ∗ is also called an optimal basic
index set if
′
c − A′ A−1
cB∗ ≥ 0.
B∗
Optimality Condition for LP
Theorem
Let z ∗ be the basic solution associated with B ∗ , an optimal
basic index set. Then z ∗ is an optimal basic solution.
◮
The standard LP problem can be solved by finding the
optimal basic index set.
Parametric Linear Programs
◮
◮
Standard form of a parametric-cost LP:

′z
min
(c
+
λa)

 z ∈ RN
s.t. Az = b


z ≥0
Standard form of a parametric right-hand-side LP:

′z
min
c

 z ∈ RN
∗
s.t.
A
z
=
b
+
ωb


z≥0
Example: ℓ1 -norm SVM
min
β0 ∈ R, β ∈
◮
Rp
n
X
{1 − yi (β0 + xi β)}+ + λkβk1 ,
i=1
In other words,
(
minp
β0 ∈ R, β ∈ R , ζ ∈
z :=
c :=
a :=
A :=
b :=
Rn
i=1 (ζi )+
+ λkβk1
yi (β0 + xi β) + ζi = 1 for i = 1, · · · , n.
s.t.
( β0+
( 0
( 0
( Y
1.
Pn
β0−
0
0
−Y
(β + )′
0′
1′
diag(Y )X
(β − )′
0′
1′
−diag(Y )X
(ζ + )′
1′
0′
I
(ζ − )′
0′
0′
−I
)′
)′
)′
)
Optimality Interval
Corollary
For a fixed λ∗ ≥ 0, let B ∗ be an optimal basic index set of the
parametric-cost LP problem at λ = λ∗ . Define
!
∗
čj
λ :=
− ∗
max
∗
∗
ǎj
{j : ǎj > 0; j ∈ N \ B }
!
∗
čj
− ∗ ,
min
and λ :=
∗
∗
ǎj
{j : ǎj < 0; j ∈ N \ B }
∗
′ A−1 A for j ∈ N .
where ǎ∗j := aj − a′B∗ A−1
A
and
č
:=
c
−
c
∗
j
j
j
B∗ B∗ j
B
Then, B ∗ is an optimal basic index set for λ ∈ [λ, λ], which
includes λ∗ .
Simplex Algorithm
1. Initialize the optimal basic index set at λ−1 = ∞ with B 0 .
2. Given B l at λ = λl−1 , determine the solution z l by
l = 0 for j ∈ N \ B l .
b
and
z
z lBl = A−1
l
j
B
3. Find the entry index
!
l
čj
l
− l .
j =
arg max
ǎj
j : ǎlj > 0; j ∈ N \ Bl
4. Find the exit index
il =
arg min
i∈{j: djl <0, j∈Bl }
zil
− l
di
!
.
5. Update the optimal basic index set to B l+1 = B l ∪ {j l } \ {i l }.
6. Terminate the algorithm if člj l ≥ 0 or equivalently λl ≤ 0.
Otherwise, repeat 2 – 5.
Theorem
The solution path of the parametric-cost LP is

z0
for λ > λ0

zl
for λl < λ < λl−1 , l = 1, · · · , J

τ z l + (1 − τ )z l+1 for λ = λl and τ ∈ [0, 1], l = 0, · · · , J − 1.
◮
The simplex algorithm gives a piecewise constant path.
An Illustrative Example
◮
x = (x1 , . . . , x10 ) ∼ N(0, I)
◮
A probit model:
Y = sign(β0 + xβ + ǫ) where ǫ ∼ N(0, 50)
◮
β0 = 0, βj = 2 for j = 1, 3, 5, 10, and 0 elsewhere
◮
The Bayes error rate: 0.336
◮
n = 400
◮
ℓ1 -norm SVM
0.5
3
1
10
5
0.0
2
9
4
6
7
8
−1.0
−0.5
β
0
2
4
6
8
− log(λ)
Figure: ℓ1 -norm SVM coefficient path indexed by λ
(five-fold CV with 0-1 and hinge)
10
Alternative Formulation
◮
Example: ℓ1 -norm SVM
min
β0 ∈ R, β ∈
◮
Rp
n
X
{1 − yi (β0 + xi β)}+
i=1
s.t. kβk1 ≤ s
As a parametric right-hand-side LP,

′z
min
c



 z ∈ RN , δ ∈ R
s.t.
Az = b


a′ z + δ = s


z ≥ 0, δ ≥ 0.
Theorem
For s ≥ 0, the solution path can be expressed as
(
sl+1 − s l
s − sl z l+1 if s ≤ s < s
z
+
l
l+1 and l = 0, · · · , J − 1
sl+1 − sl
sl+1 − sl
zJ
if s ≥ sJ ,
where sl = a′ z l .
◮
The simplex algorithm gives a piecewise linear path.
0.5
3
1
10
5
0.0
2
9
4
6
7
8
−1.0
−0.5
β
0
0.0
0.5
1.0
1.5
2.0
s
Figure: ℓ1 -norm SVM coefficient path indexed by s
(five-fold CV with 0-1 and hinge)
2.5
0.39
0.37
0.35
0.33
Error Rate
1.0
1.5
2.0
2.5
s
Figure: The true error rate path for the l1 -norm SVM under the probit
model
Annual Household Income Data
◮
http://www-stat.stanford.edu/∼tibs/ElemStatLearn/
◮
Predict the annual household income with 13 demographic
attributes (education, age, gender, marital status,
occupation, householder status etc.).
◮
The response takes one of nine income brackets specified.
◮
Split 6,876 records into a training set of 2,000 and a test
set of 4,876.
Age
100
80
80
80
40
40
Marital Status
Occupation
House Status
100
80
80
80
40
60
40
Unemploy
Retire
Military
Student
Home
Clerical
Factory
Prof.
20
Sales
Single
Widowed
Divorced
Together
20
Married
20
60
Own
40
Income
100
Income
100
60
Female
Male
>65
55−64
45−54
35−44
25−34
18−24
20
14−17
Graduate
College
College 1−3
High School
Grade 9−11
=<Grade 8
20
60
Parents
40
60
Rent
60
Income
100
20
Income
Gender
100
Income
Income
Education
Figure: Boxplots of the annual household income with education,
age, gender, marital status, occupation, and householder status out
of 13 demographic attributes in the data
Median Regression with l1 Penalty
p
n X
X
yi −
min
βj xij subject to kβk1 ≤ s
β
i=1
j=1
◮
Main effect model with 35 variables plus a quadratic term
for age
◮
Partial two-way interaction model with additional 69
two-way interactions (out of 531 potential interaction terms)
2
−4 −2 0
β
4
6
8
Main effect model
0
10
20
30
40
50
60
s
Figure:
Positive: home ownership (in dark blue relative to renting), education (in brown), dual income due to
marriage (in purple relative to ‘not married’), age (in skyblue), and male (in light green). Negative: single or divorced
(in red relative to ‘married’) and student, clerical worker, retired or unemployed (in green relative to
professionals/managers)
−5
0
β
5
Two-way interaction model
0
20
40
60
80
100
120
s
Figure: Positive: ‘dual income ∗ home ownership’, ‘home ownership ∗
education’, and ‘married but no dual income ∗ education’. Negative:
‘single ∗ education’ and ‘home ownership ∗ age’
8.2
8.0
7.8
7.6
Estimated Risk
8.4
Risk Path
20
40
60
80
100
120
s
Figure: The risks of the two-way fitted models are estimated by using
a test data set with 4,876 observations.
Refinement of the Simplex Algorithm
◮
The simplex algorithm assumes the non-degeneracy of
solutions that z l 6= z l+1 for each l.
◮
Tableau-simplex algorithm with the anti-cycling property for
more general settings
◮
Structural commonalities in the elements of the standard
LP form can be utilized for efficient computation.
Concluding Remarks
◮
Establish the connection between a family of regularization
problems for feature selection and the LP theory.
◮
Shed a new light on solution path-finding algorithms for the
optimization problems in comparison with the existing
algorithms.
◮
Provide fast and efficient computational tools for screening
and selection of features for regression and classification
problems.
◮
Unified algorithm with modular treatment of different
procedures (lpRegPath)
◮
Model selection (or averaging) and validation
◮
Optimization theory and tools are very useful to
statisticians.
Reference
◮
Another Look at Linear Programming for Feature Selection
via Methods of Regularization, Yao, Y., and Lee, Y.,
Technical Report No. 800, Department of Statistics, The
Ohio State University, 2007.
◮
For a copy of the paper and slides of this talk, visit
http://www.stat.ohio-state.edu/∼yklee
◮
E-mail: [email protected]