Feature Selection using Sparse Priors: A Regularization Approach Martin Brown and Nick Costen [email protected] [email protected] Feature selection is a fundamental process within many classification algorithms. A large dictionary of potential, flexible features often exists, from which it is necessary to select a relevant subset. The aims of feature selection include: improved, more robust parameter estimation improved insight into the decision making process. Feature selection is generally an empirical process that is performed prior to, or jointly with, the parameter estimation process. However, it should also be recognized that feature selection often introduces problems because: optimal feature selection is an NP-complete problem, sensitivity analysis of the trained classifier is often neglected parametric prediction uncertainty with respect to the unselected features is often not represented in the final classifier. This talk is concerned with a classification-regularization approach of the form: min θ,b ,ζ f θ, b, ζ, st 1 ζ 2 θ 1 2 2 diag (t )Φθ 1b ζ 1 where and b are the classification parameters and bias term respectively, is the regularization parameter and the empirical data set is given by {, t}, where the target class labels lie in {– 1,+1}. This is similar to the regularization functions proposed by Vapnik and used to develop Support Vector Machines [1]. The aim is to jointly minimize the loss function that measures how close the predictions are to the class labels and the model complexity term that orders potential classifiers according to their complexity. For classifiers that are linear in their parameters, this is a piecewise Quadratic Programming problem. An interesting observation is that the aim of classification is to minimize the number of classification errors, which is, in general an NP-complete problem. To circumvent this, “soft” measures, such as a 1-norm or 2-norm on the prediction errors are used (the loss function), and this is minimized instead. This is the basis of the sparse regularization approach to classification. Instead of performing a discrete search through the space of potential classifiers, which is an NPcomplete problem, the 1-norm of the parameter vector can be used to provide a “soft” measure. Any convex optimization criterion with convex constraints will have a global minimum and it can be shown that the 1-norm complexity measure is convex. The talk will concentrate on the properties of the 1-norm prior, showing why it produces sparse classification models, and compare it with other complexity measures. The 1-norm’s derivative discontinuity at 0 is fundamental to producing sparse models, and it will be shown that other p-norms that do not possess this property will not produce truly sparse models. In addition, because the 1-norm is constant in each region where the parameters’ signs do not change, an efficient algorithm can be generated that calculates every globally optimal sparse classifier as a function of the regularization parameter. This is because the local Hessian matrix is a constant in that region. The complete algorithm for generating every sparse classifier, and thus every sparse feature set will be briefly described [2], and a discussion will be performed about visualizing the information provided by the parameter trajectories. This is an important aspect of providing a qualitative understanding of the classification problem’s sensitivity and gives the designer some indication of about the conditional independence of the different features as the classifier’s complexity changes. The algorithm has been tested with a variety of datasets, including Australian Credit, a real-world set consisting of 690 observations about whether or not credit should be granted, using 14 potential attributes. The error curve and parameter trajectory for a linear model are shown in Figures 1and 2 respectively, using a transformed regularization parameter log(+1) to emphasize small-behaviour. Figure 1. Classification errors verses the regularization parameter, log(+1). Figure 1. Parameter trajectories verses the regularization parameter, log(+1 As can be seen, for a large range of sparse models, the error rate is approximately constant and there is a single dominant parameter. When is small, a number of parameters started to reverse their sign as the regularization parameter is further reduced, demonstrating the presence of Simpon’s paradox. The advantage of this approach is that the properties for all the sparse models can be directly visualized and the behaviour of the parameters discussed. In addition, the approach is efficient, compared to re-training a single classifier, even though every sparse classifier is generated Other aspects, such as the generalization of the problem to non-linear classifier structures, the development of parametric uncertainty/error bars for the inactive features, the inclusion of prior weighting parameters on the features [3] and the development of globally optimal, sparse, adaptive classifiers will also be briefly discussed. References 1. 2. 3. Vapnik, V. N., Statistical Learning Theory. Wiley, New York, 1998. Brown, M., Exploring the set of sparse, optimal classifiers. Proceedings of the International Workshop on Artificial Neural Networks in Pattern Recognition, pages **-**, 2003. Costen, N. P. and Brown, M. Exploratory sparse models for face classification Proceedings of the British Machine Vision Conference, vol 1, pages 13-22, 2003.
© Copyright 2026 Paperzz