Feature Selection as a one

Feature Selection as a one-player game
Michèle Sebag
LRI, CNRS UMR 8623 & INRIA-Futurs
Bâtiment 490, Université Paris-Sud
91405 - Orsay Cedex (France)
[email protected]
Romaric Gaudel
LRI, CNRS UMR 8623 & INRIA-Futurs
Bâtiment 490, Université Paris-Sud
91405 - Orsay Cedex (France)
[email protected]
Abstract
This paper formalizes Feature Selection as a Reinforcement Learning problem,
leading to a provably optimal though intractable selection policy. As a second
contribution, this paper presents an approximation thereof, based on a one-player
game approach and relying on the Monte-Carlo tree search UCT (Upper Confidence Tree) proposed by Kocsis and Szepesvari (2006).
More precisely, the presented FUSE (Feature Uct SElection) algorithm extends
UCT to deal with i) a finite unknown horizon (the target number of relevant features); ii) a huge branching factor of the search tree (the size of the initial feature
set). Additionally, a frugal reward function is proposed as a rough but unbiased
estimate of the relevance of a feature subset. A proof of concept of FUSE is shown
on the NIPS 2003 Feature Selection Challenge.
1
Introduction
Feature Selection (FS), one key issue in statistical learning, is a combinatorial optimization problem
aimed at minimizing the generalization error. FS is tackled using three main approaches : scoring, wrapping and embedded FS. Scoring approaches [1] rely on statistical criteria to rank features
w.r.t. the classification problem. The main limitation of scoring approaches is to poorly account for
the feature inter-dependencies. Wrapper methods [2, 3, 4] basically tackle the whole combinatorial optimization problem, exploring the whole powerset of the feature set and computing for each
candidate subset an estimate of the generalization error. Embedded approaches [5, 6, 7] incorporate
sparsity criteria to achieve FS during learning, or exploit the hypothesis learned [8, 9] to compute an
educated score about feature relevance.
This paper formalizes wrapper FS approaches as a Reinforcement Learning (RL) problem. This
formalization leads to a provably optimal, though intractable, policy. An approximation of the
optimal policy is obtained by casting the RL problem as a one-player game and using the Upper
Confidence Tree (UCT) framework [10] as a robust approach for optimization under uncertainty.
The formalization of FS as a RL problem is described in Section 2. The derived UCT based algorithm, called FUSE (for Feature UCT SElection), is presented in Section 3. Preliminary results on
datasets from NIPS03 FS challenge are discussed in Section 4.
2
Feature Selection seen as a Reinforcement Learning problem
Let us formalize FS as a Markov Decision Process (MDP). Let F denote the finite set of features plus
an additional stopping feature noted fs . The state space S of the MDP is the powerset of F. Final
states are all subsets F ⊆ F containing fs . The action space likewise is the set of features. Within
any non final state F , a possible action is to select a feature in F \ F : letting p : S × F × S → IR+
be the transition function, p(Fd , f, Fd+1 ) is non zero if Fd+1 = Fd ∪ {f }.
1
The reward function, only defined for final states F , is the generalization error of the learned hypothesis A(F \ {fs }) noted Err(A(F )). Letting the starting state be the empty set, we shall denote
S⊥ the final state built by iteratively applying policy S. The optimal policy S ? is the one minimizing
Err(A(S⊥ )):
S ? = argmin Err (A (S⊥ ))
(1)
S
Following Bellman’s optimality principle [11] and recursively defining the value function V ? as
(
?
V (F ) =
Err(A(F ))
min V ? (F ∪ {f })
f ∈F \F
if F is final (fs ∈ F )
otherwise
(2)
it follows that the optimal policy S ? is defined as
S ? (F ) = argmin V ? (F ∪ {f })
(3)
f ∈F \F
While the above S ? policy is provably optimal, it does not lead to a tractable algorithm, as the state
space is exponential in the number of features. Interestingly, a similar setting is investigated by [12],
formalizing Active Learning (AL) as an MDP.
The optimal AL strategy being likewise intractable, a one-player game setting was proposed to build
an approximation thereof. In the AL one-player game setting (respectively in the FS setting), each
move corresponds to selecting an example (resp. a feature). At the end of each game, i.e. after a
set of examples (resp. features) has been selected, the instant reward is computed as an unbiased
estimation of the generalization error of the hypothesis learned from these examples (resp. these
features).
As in [12], the design of a robust game strategy proposed here relies on the Upper Confidence Tree
(UCT) framework [10] extending the optimal exploration vs. exploitation tradeoff of the Multi Arm
Bandit algorithm (Upper Confidence Bound (UCB) [13]) to the case of sequential decision making.
3
Overview of the UCT-based Feature Selection: FUSE
This section briefly reminds the UCT framework
for game strategy, asymmetrically growing the
game search tree (Fig. 1) to explore the best
moves. We first define the instant reward attached to each FS game (Section 3.1) and then
discuss the specific heuristics introduced to deal
with a finite unknown horizon (the target number of relevant features). Only the case of binary
classification will be considered in the following.
features chosen
following
UCB criterion
explored tree
new node
randomly chosen
features
Figure 1: FUSE, UCT approach for FS, asymmetrically grows the search tree.
3.1
Instant reward function
An essential ingredient in Monte-Carlo tree search is to embed a computationally cheap estimate
of the reward to be computed at each simulation (game). FUSE involves an estimate of the generalization error attached to a feature subset F based on the k-Nearest Neighbor classifier (kNN) as
2
follows. Let dF denote the Euclidean distance based on features in F . Let L and V respectively
denote the training set and a small uniform sample subset of L. For each labeled example z = (x, y)
in V, consider the set NF,k (x) of the k nearest neighbors of x in L after dF ; define score(z) the
sum of positive labels in these neighbors
X
score(z) =
y0
(4)
z 0 ∈NF,k (x)
The instant reward of F is finally computed as the Man Whitney Wilcoxon criterion attached to
score
reward(F ) = |{(z, z 0 ) ∈ V 2 , score(x) < score(x0 ), y < y 0 }|
(5)
Note that the calculation of the proposed instant reward is linear in the size n of L up to a logarithmic
term. Denoting m the size of V and d the size of the feature subset, computing the nearest neighbors
of all examples in V is Õ(mnd). Whereas kNN raises some statistical and algorithmic difficulties,
it provides a robust cheap estimate of the local relevance of the feature set.
3.2
Upper Confidence Tree for Feature Selection
UCT [10] is a Monte-Carlo tree search algorithm where each simulation (aka game) explores a
sequence of selections of a feature (aka moves), divided into two phases: a bandit-based phase and a
random phase. Within the bandit phase, FUSE iteratively selects a feature f ? among those possible
when in node (state) Fd , using the Upper Confidence Bound (UCB) criterion [13]
s
(
?
f = argmax µ̂ (Fd ∪ {f }) +
f ∈ F \Fd
log (t (Fd ))
t (Fd ∪ {f })
)
(6)
where t (F ) stands for the number of times node F has been visited, and µ̂ (F ) is the average of the
instant rewards (Section 3.1) collected when visiting node F .
While the UCB criterion enforces an optimal tradeoff between exploration and exploitation (conditionnaly to the independance of the nodes), it is hindered by the huge number of features; this issue
will be addressed in next section.
At some point, one arrives at a node F which does not belong to the search tree. This node is
added to the tree and FUSE switches to the random phase: new features are uniformly selected. The
termination of the simulation is ensured by selecting the stopping feature fs with probability 1 − q d ,
where d is the current number of selected features and q ∈ [0, 1] is a parameter of the algorithm.
This heuristics enforces the timely termination of each simulation and expectedly enables FUSE to
estimate the optimal number of features.
Upon arriving at a final node F , the instant reward associated to F is computed as in section 3.1 and
value µ̂ attached to each node of the visited branch is updated accordingly1 .
3.3
Progressive Widening fed by RAVE
A general limitation of Multi-Arm Bandit algorithms, when dealing with a large number of moves
compared to the number of allowed simulations, is to be biased toward exploration. This limitation
is even more severe for UCT, which must resist over-exploration at each level of the tree. A heuristics proposed by [16] to prevent over-exploration is to gradually increase the number of considered
features, depending on the number of times t(F ) a node F has been visited. Formally the Progressive Widening heuristics sets the allowed number of features to the integer part of t(F )1/a , where
a > 1 is a parameter of the algorithm.
1
Formally the search tree should be viewed as a graph rather than a tree (as in [14]). As in other applications
of UCT [12, 15] however, only the current branch is updated in FUSE due to the high branching factor of nodes.
3
Note that the selection of the additional considered feature offers room to take into account any
prior knowledge [12] or any knowledge gained within the search. In FUSE, knowledge gradually
acquired along the search is summarized through the so-called Rapid Action Value Estimation vector
(RAVE), associating to each feature f the average instant reward of all final nodes F containing f :
RAVE(f ) =
F
average reward(F )
s.t. {f,fs }⊆F
(7)
Regarding the stopping feature, RAVE is defined conditionally to the level of a node (which is also
(d)
the size of its corresponding feature subset). Specifically, let fs be the stopping feature at level d.
(d)
RAVE(fs ) is defined as the average instant reward of all final nodes F of size d + 1 :
RAVE(fs(d) ) =
average
reward(F )
(8)
|F |=d+1, fs ∈F
The knowledge encapsulated within the RAVE score is exploited through the Progressive Widening
heuristics: the top ranked feature after its RAVE value is added to the set of considered features
whenever this set increases. The RAVE-enhanced Progressive Widening thus enables to introduce
promising features, including stopping features, and to avoid a hopeless uniform exploration of the
feature space.
Note that the FUSE tree and the RAVE score are inter-dependent. While the RAVE score guides the
FUSE exploration, FUSE can inversely be viewed as a sophisticated way of building an educated
guess of feature relevance through the RAVE score.
3.4
Output
FUSE yields two types of solution. Firstly the best path in the FUSE tree (obtained by iteratively
selecting the most often visited child node, after standard UCT practice) is used to derive the top-d
features denoted F̂d (legend FUSE in Table 1). Another possibility is to only exploit the RAVE
score: F̂dR denotes the top-d features ranked after their RAVE value (legend RAVE in Table 1).
4
Experimental Validation
A proof of principle of the approach is obtained by running FUSE on 2 datasets from the NIPS 2003
FS challenge [17].
4.1
Experimental setting
The Madelon2 dataset was considered as it involves a small ratio of relevant features: 480 out of
500 are probes while the remaining 20 are built from 5 initial features. The Madelon challenge thus
corresponds to finding the proverbial needle in the haystack. The Gisette problem was considered
to investigate the scalability of the FUSE algorithm as it involves 5 000 features, half of which are
probes.
FUSE is launched with 200,000 Monte-Carlo simulations on each problem. The learning algorithm
A is a Support Vector Machine with Gaussian kernels3 [18]. The a parameter controlling the Progressive Widening (Section 3.3) is set to 2 for both problems. The q parameter controlling the size
of feature subsets (Section 3.2) is set to 0.9 for Madelon and to 0.9999 for Gisette. The k parameter
of the kNN (Section 3.1) is set to 5.
The candidate solutions (F̂d or F̂dR ) are assessed along 3 indicators: the predictive test error averaged
on s splits (s = 40 for Madelon and s = 10 for Gisette) of the available dataset into a training
set (containing 90% of the examples) and a test set; the challenge error returned by the challenge
website; and the number d0 of irrelevant features out of the d features submitted.
2
3
For the Madelon dataset, features were centered and normalized.
The parameter C and the bandwidth σ are optimized by 10 fold cross-validation on the training set.
4
database
Madelon
Gisette
algorithm
FSPP2 [8]
FUSE
FUSE
RAVE
BB
BB’
FUSE
RAVE
RAVE
predictive
error
9.1%
7.9%
7.8%
5.8%
1.4%
1.5%
challenge
error
6.22%
7.89%
6.83%
6.50%
0.86%
1.11%
5.94%
1.63%
1.51%
d
d0
12
15
8
18
NA
100
15
200
500
0
1
0
0
NA
0
1
1
15
Table 1: Experimental validation of the FUSE algorithm. The predictive error is computed after
s splits of the dataset into a learning and a testing subset (s = 40 for Madelon and s = 10 for
Gisette). The challenge error is returned by the website (http://www.nipsfsc.ecs.soton.
ac.uk/). Column d (resp. d0 ) corresponds to the number of features (resp. irrelevant features) in
the submitted feature subset. BB (best baseline) corresponds to the current best submission on the
website. BB’ corresponds to the current best submission including information on features.
4.2
Results and interpretation
R
The empirical results are displayed in Table 1. On Madelon, the best solutions F̂18
and F̂8 (in terms
of predictive error on validation sets) selected all features right. While this result confirms the ability
of FUSE to select relevant features (out of 500) the challenge error remains higher than that of the
best known algorithm on this dataset (FSPP2 [8]) and FUSE only gets rank 15 in terms of challenge
error.
On Gisette, the disappointing performance of FUSE is explained by the small size of the candidate
feature subset (15 features as compared to 100 features for the best known algorithm). Quite the
contrary, the performance of FS based on the RAVE score can be considered as very promising: 1
out of 200 (resp 15 out of 500) of the features selected are irrelevant. While these feature subsets do
not lead to predictive accuracies matching that of the best known algorithm on this dataset, they are
nevertheless good (1.63% and 1.51%), and rank FUSE among the best 80 out of 300.
5
Discussion and Perspectives
The proposed formalization of Feature Selection, as a Reinforcement Learning problem aimed at
minimizing the generalization error, is new to our best knowledge. The main other contribution of
this paper is the FUSE algorithm, implementing an efficient approximation of the FS-RL problem,
viewed as a one-player game and tackled using UCT:
Firstly, the use of virtual stopping features is introduced to handle the finite unknown horizon, corresponding to the target number of relevant features. Secondly, a frugal estimate of a feature subset
relevance is used as reward function. Thirdly, the reward estimates (RAVE) extracted from the UCT
tree are used to overcome the exploration bottleneck, due to the large number of features comparatively to the computational budget.
A proof of concept of the approach on the NIPS 2003 Feature Selection Challenge shows the merits
and weaknesses of the approach, and opens two main perspectives for further research.
While RAVE was meant to guide the FUSE search, it turns out that RAVE can be directly used to
score the features. Further, feature selection based on the RAVE scores is found to be more efficient
than FUSE itself on the Gisette problem. On the one hand, the weakness of FUSE is blamed on the
insufficient depth of the tree search, leading to an insufficient exploration of the 5,000 feature search
space. On the other hand, it suggests that FUSE might also be viewed as an educated way of scoring
the feature relevance4 , taking into account the feature correlations with regard to the classification
problem.
4
Complementary experiments show that the RAVE score built from FUSE significantly outperforms the
scores built from a random uniform exploration of the feature subsets.
5
A second perspective is to reconsider the reward function used in FUSE. While this reward must
be computed from an aggressive subsample of the training set for computational tractability, a more
sophisticated learner than a k-nearest neighbor will be considered.
Acknowledgments
The authors thank Olivier Teytaud for fruitful discussions, and gratefully acknowledge the support
of the PASCAL2 Network of Excellence, IST-2007-216886.
References
[1] Y. Seldin and N. Tishby. Multi-classification by categorical features via clustering.
ICML’08, pages 920–927, 2008.
In
[2] T. Zhang. Multi-stage Convex Relaxation for Learning with Sparse Regularization.
NIPS’08, pages 1929–1936, 2008.
In
[3] M. Boullé. Compression-based averaging of selective Naive Bayes classifiers. J. Mach. Learn.
Res., 8:1659–1685, 2007.
[4] T. Zhang. Adaptive Forward-Backward Greedy Algorithm for Sparse Learning with Linear
Models. In NIPS’08, pages 1921–1928, 2008.
[5] F. Bach. Exploring large feature spaces with Hierarchical Multiple Kernel Learning. In
NIPS’08, pages 105–112, 2008.
[6] A. Rakotomamonjy, F. Bach, S. Canu, and Y. Grandvalet. SimpleMKL. J. Mach. Learn. Res.,
9:2491–2521, 2008.
[7] J. Langford, L. Li, and T. Zhang. Sparse Online Learning via Truncated Gradient. J. Mach.
Learn. Res., 10:777–801, 2009.
[8] K. Q. Shen, C. J. Ong, X. P. Li, and E. P. V. Wilder-Smith. Feature selection via sensitivity
analysis of SVM probabilistic outputs. Mach. Learn., 70(1):1–20, 2008.
[9] I. Guyon, J. Weston, S. Barnhill, and V. Vapnik. Gene selection for cancer classification using
Support Vector Machines. Mach. Learn., 46(1-3):389–422, 2002.
[10] C. Kocsis, L.and Szepesvári. Bandit based Monte-Carlo planning. In ECML’06, pages 282–
293. Springer Verlag, 2006.
[11] R. Bellman. Dynamic Programming. Princeton Univ. Press, 1957.
[12] P. Rolet, M. Sebag, and O. Teytaud. Boosting Active Learning to optimality: a tractable
Monte-Carlo, Billiard-based algorithm. In ECML/PKDD’09, pages 302–317. Springer Verlag,
2009.
[13] P. Auer. Using confidence bounds for exploitation-exploration trade-offs. J. Mach. Learn. Res.,
3:397–422, 2002.
[14] F. de Mesmay, A. Rimmel, Y. Voronenko, and M. Püschel. Bandit-based optimization on
graphs with application to library performance tuning. In ICML’09, pages 729–736, 2009.
[15] S. Gelly and D. Silver. Combining online and offline knowledge in UCT. In ICML’07, pages
273–280, 2007.
[16] R. Coulom. Efficient selectivity and backup operators in Monte-Carlo tree search. In Computers and Games, pages 72–83, 2006.
[17] I. Guyon, S. R. Gunn, A. Ben-Hur, and G. Dror. Result analysis of the NIPS 2003 Feature
Selection challenge. In NIPS’04, pages 545–552, 2004.
[18] R. Collobert, S. Bengio, and J. Marithoz. Torch: A modular machine learning software library.
Technical report, IDIAP, 2002.
6