Feature Selection using Sparse Priors: A Regularization Approach

Feature Selection using Sparse Priors:
A Regularization Approach
Martin Brown and Nick Costen
[email protected] [email protected]
Feature selection is a fundamental process within many classification algorithms. A large
dictionary of potential, flexible features often exists, from which it is necessary to select a relevant
subset. The aims of feature selection include:
 improved, more robust parameter estimation
 improved insight into the decision making process.
Feature selection is generally an empirical process that is performed prior to, or jointly with, the
parameter estimation process. However, it should also be recognized that feature selection often
introduces problems because:
 optimal feature selection is an NP-complete problem,
 sensitivity analysis of the trained classifier is often neglected
 parametric prediction uncertainty with respect to the unselected features is often not
represented in the final classifier.
This talk is concerned with a classification-regularization approach of the form:
min θ,b ,ζ f θ, b, ζ,   
st
1
ζ 2  θ 1
2
2
diag (t )Φθ  1b   ζ  1
where  and b are the classification parameters and bias term respectively,  is the regularization
parameter and the empirical data set is given by {, t}, where the target class labels lie in {–
1,+1}. This is similar to the regularization functions proposed by Vapnik and used to develop
Support Vector Machines [1]. The aim is to jointly minimize the loss function that measures how
close the predictions are to the class labels and the model complexity term that orders potential
classifiers according to their complexity. For classifiers that are linear in their parameters, this is
a piecewise Quadratic Programming problem.
An interesting observation is that the aim of classification is to minimize the number of
classification errors, which is, in general an NP-complete problem. To circumvent this, “soft”
measures, such as a 1-norm or 2-norm on the prediction errors are used (the loss function), and
this is minimized instead. This is the basis of the sparse regularization approach to classification.
Instead of performing a discrete search through the space of potential classifiers, which is an NPcomplete problem, the 1-norm of the parameter vector can be used to provide a “soft” measure.
Any convex optimization criterion with convex constraints will have a global minimum and it can
be shown that the 1-norm complexity measure is convex. The talk will concentrate on the
properties of the 1-norm prior, showing why it produces sparse classification models, and
compare it with other complexity measures.
The 1-norm’s derivative discontinuity at 0 is fundamental to producing sparse models, and it will
be shown that other p-norms that do not possess this property will not produce truly sparse
models. In addition, because the 1-norm is constant in each region where the parameters’ signs
do not change, an efficient algorithm can be generated that calculates every globally optimal
sparse classifier as a function of the regularization parameter. This is because the local Hessian
matrix is a constant in that region. The complete algorithm for generating every sparse classifier,
and thus every sparse feature set will be briefly described [2], and a discussion will be performed
about visualizing the information provided by the parameter trajectories. This is an important
aspect of providing a qualitative understanding of the classification problem’s sensitivity and gives
the designer some indication of about the conditional independence of the different features as
the classifier’s complexity changes.
The algorithm has been tested with a variety of datasets, including Australian Credit, a real-world
set consisting of 690 observations about whether or not credit should be granted, using 14
potential attributes. The error curve and parameter trajectory for a linear model are shown in
Figures 1and 2 respectively, using a transformed regularization parameter log(+1) to emphasize
small-behaviour.
Figure 1. Classification errors verses the
regularization parameter, log(+1).
Figure 1. Parameter trajectories verses the
regularization parameter, log(+1
As can be seen, for a large range of sparse models, the error rate is approximately constant and
there is a single dominant parameter. When  is small, a number of parameters started to reverse
their sign as the regularization parameter is further reduced, demonstrating the presence of
Simpon’s paradox. The advantage of this approach is that the properties for all the sparse models
can be directly visualized and the behaviour of the parameters discussed. In addition, the
approach is efficient, compared to re-training a single classifier, even though every sparse
classifier is generated
Other aspects, such as the generalization of the problem to non-linear classifier structures, the
development of parametric uncertainty/error bars for the inactive features, the inclusion of prior
weighting parameters on the features [3] and the development of globally optimal, sparse,
adaptive classifiers will also be briefly discussed.
References
1.
2.
3.
Vapnik, V. N., Statistical Learning Theory. Wiley, New York, 1998.
Brown, M., Exploring the set of sparse, optimal classifiers. Proceedings of the International Workshop
on Artificial Neural Networks in Pattern Recognition, pages **-**, 2003.
Costen, N. P. and Brown, M. Exploratory sparse models for face classification Proceedings of the
British Machine Vision Conference, vol 1, pages 13-22, 2003.