Embedded Variable Selection in Classification Trees
Servane Gey1 , Tristan Mary-Huard2
1
MAP5, UMR 8145, Université Paris Descartes, Paris, France
2
UMR AgroParisTech/INRA 518, Paris, France
Tristan Mary-Huard
Embedded Variable Selection in Classification Trees
Overview
Introduction
? Binary classification setting
? Model and variable selection in classification
? Classification tree
Variable selection for CART
? Classes of classification trees
? Theoretical results
? Comparison with practice.
Tristan Mary-Huard
Embedded Variable Selection in Classification Trees
Binary classification
Binary classification
Prediction of the unknown label Y (0 or 1) of an observation X .
i .i .d
⇒ Use training sample D = (X1 , Y1 ), ..., (Xn , Yn ) ∼ P to build a classifier bf
bf : X
X
→ {0, 1}
b.
7
→
Y
Quality assessment
? Classification risk and loss : Quality of the resulting classifier bf
L(b
f)
= P(bf (X ) 6= Y |D )
∗
`(bf , f ) = L(bf ) − L(f ∗ )
? Average loss : Quality of the classification algorithm
ED [`(bf , f ∗ )]
Remark : All these quantities depend on P that is unknown.
Tristan Mary-Huard
Embedded Variable Selection in Classification Trees
Basics of Vapnik Theory : structural risk minimization (SRM)
Consider and collection of classes of classifiers C 1 , ..., C M . Define
fm
= arg min L(f ),
·
¸
V
bf = arg min Ln (bfm ) + α C m
bfm = arg min Ln (f ),
f ∈C m
n
m
f ∈C m
? Class complexity
If C 1 , ..., C M have finite VC dimensions VC 1 , ...VC M , then
r
¶¾
½ µ
VC m
λ
∗
∗
+
ED [`(bf , f )] ≤ C inf `(f m , f ) + K
m
n
(Vapnik, 1998).
n
? Classification task complexity (Margin Assumption)
If there exists h ∈]0; 0.5[ such that
¯
¯
P (¯η(x ) − 1/2¯ ≤ h) = 0, with η(x ) = P(Y = 1|X = x )
then
½ µ
µ
¶¶¾
VC m
λ0
∗
0
ED [`(bf , f )] ≤ C inf `(f m , f ) + K
+
∗
m
n
Tristan Mary-Huard
n
(Massart & Nédélec, 2006).
Embedded Variable Selection in Classification Trees
Application to variable selection in classification
Assume that X
∈ Rp . Define
f m(k ) = arg min L(f ),
bfm(k ) = arg min Ln (f )
f ∈C m(k )
f ∈C m(k )
? Variable selection
Choose b
f such that
·
¸
VC m(k )
p
0
bf = arg min Ln (bfm(k ) ) + α
+α log[(k )]
n
m (k )
Then (under strong margin assumption)
ED [`(bf , f ∗ )] ≤ Clog(p)
½
inf
m(k )
µ
µV
¶¶¾
C m (k )
λ
`(f m(k ) , f ∗ ) + K 0
+
n
n
(Massart, 2000, Mary-Huard et al., 2007)
Tristan Mary-Huard
Embedded Variable Selection in Classification Trees
Classification trees
General strategy
Heuristic approach (CART, Breiman, 1984)
Choose
? Find a tree Tmax such that Ln (fTmax ) = 0,
? Prune Tmax using criterion :
|T |
bf = arg min Ln (fT ) + α
bf = arg min Ln (fT ) + α
T
|T |
n
Tristan Mary-Huard
T ⊆Tmax
n
Embedded Variable Selection in Classification Trees
Definitions
Consider a tree Tc ` with
- a given configuration c,
- a given list ` of associated variables.
Remark : A same variable may be associated
with several nodes.
Class of tree classifiers
Define
C c`
= {f /f based on Tc ` } ,
Hc `
=
VC log-entropy of class C c ` ,
f c`
= arg min L(f ) ,
bfc `
= arg min Ln (f ) .
f ∈C c `
f ∈C c `
Remark : Two classifiers f , f 0
∈ C c ` only differ in their
thresholds and labels.
Tristan Mary-Huard
Embedded Variable Selection in Classification Trees
Risk bound for one class
Proposition
Assume that strong margin assumption is satisfied. For all C > 1, there exist positive
constants K 1 and K 2 depending on C such that
½
ED [`(bfc ` , f ∗ )] ≤ C `(f c ` , f ∗ ) + K 1
µ
|Tc ` | log(2n)
n
¶¾
+
K2
n
.
Idea of proof
? Show that E [Hc ` ] ≤ |Tc ` | log(2n),
? Apply general theory (Koltchinskii, 2006).
Tristan Mary-Huard
Embedded Variable Selection in Classification Trees
Combinatorics for variable selection
To take into account variable selection in the penalized criterion, one needs to count
the number of classes sharing the same a priori complexity.
? Parametric case (Logistic regression, LDA,...)
- One parameter per variable,
- 2 classes with classifiers based on k variables have the same a priori complexity,
p
⇒ (k ) classes of a priori complexity k .
? Classification trees
- One parameter per internal node (threshold),
- 2 classes C c ` and C c 0 `0 such that |Tc ` | = |Tc 0 `0 | have the same a priori complexity
⇒ Count the number of classes based on trees of size k !
Tristan Mary-Huard
Embedded Variable Selection in Classification Trees
Combinatorics for variable selection
A tree Tc ` is defined by
- a configuration,
- a list of variables associated with each node.
? Number of configurations of size k :
Ã
!
1 2k − 2
k
Nc =
k k −1
? Number of variable lists of size k :
- the list is ordered : {1, 2, 3} 6= {2, 1, 3},
- variables are selected with replacement : {1, 2, 1}.
⇒ N`k = pk −1 instead of (k ) !
p
? Number of classes based on trees of size |Tc ` | = k :
Ã
N
k
=
k
Nc
k
× N` =
!
1 2k − 2
k
k −1
× pk −1
⇒ log(N k ) ≤ λ|Tc ` | log(p)
Tristan Mary-Huard
Embedded Variable Selection in Classification Trees
Combinatorics for variable selection
A tree Tc ` is defined by
- a configuration,
- a list of variables associated with each node.
? Number of configurations of size k :
Ã
!
1 2k − 2
k
Nc =
k k −1
? Number of variable lists of size k :
- the list is ordered : {1, 2, 3} 6= {2, 1, 3},
- variables are selected with replacement : {1, 2, 1}.
⇒ N`k = pk −1 instead of (k ) !
p
? Number of classes based on trees of size |Tc ` | = k :
Ã
N
k
=
k
Nc
k
× N` =
!
1 2k − 2
k
k −1
× pk −1
⇒ log(N k ) ≤ λ|Tc ` | log(p)
Tristan Mary-Huard
Embedded Variable Selection in Classification Trees
Combinatorics for variable selection
A tree Tc ` is defined by
- a configuration,
- a list of variables associated with each node.
? Number of configurations of size k :
Ã
!
1 2k − 2
k
Nc =
k k −1
? Number of variable lists of size k :
- the list is ordered : {1, 2, 3} 6= {2, 1, 3},
- variables are selected with replacement : {1, 2, 1}.
⇒ N`k = pk −1 instead of (k ) !
p
? Number of classes based on trees of size |Tc ` | = k :
Ã
N
k
=
k
Nc
k
× N` =
!
1 2k − 2
k
k −1
× pk −1
⇒ log(N k ) ≤ λ|Tc ` | log(p)
Tristan Mary-Huard
Embedded Variable Selection in Classification Trees
Risk bound for tree classifiers
Proposition
Assume that strong margin assumption is satisfied. If
bf = arg min (Ln (bfc ` ) + pen(c , `)) ,
c ,`
where pen(c , `)
= Ch1
|Tc ` | log(2n)
n
+ Ch2
|Tc ` | log(p)
n
with constants Ch1 , Ch2 depending on h appearing in the margin condition, then there
exist positive constants C , C 0 , C 00 such that
½ ½
µ
¶¾¾
C 00
∗
0 |Tc ` | log(2n)
∗
b
ED [l (f , f )] ≤ C log(p) inf `(f c ` , f ) + C
+
.
n
c ,`
Remark :
n
Theory :
pen(c , `)
= (an + bn log(p)) |Tc ` | = αp,n |Tc ` |
Practice (CART) :
pen(c , `)
= αCV |Tc ` |
Does αCV match αp,n ?
Tristan Mary-Huard
Embedded Variable Selection in Classification Trees
Illustration on simulated data (1)
- Variables X 1 , ..., X p are independent,
- If X 1 > 0 and X 2 > 0 P (Y
= 1) = q, otherwise P (Y = 1) = 1 − q
Remark : Easy case
- The Bayes classifier belongs to the collection of classes,
- Strong margin assumption is satisfied.
Tristan Mary-Huard
Embedded Variable Selection in Classification Trees
Illustration on simulated data (2)
- P (Y
- For j
= 1) = 0.5
= 1, 2, X j |Y = 0 ,→ N (0, σ2 ) and X j |Y = 1 ,→ N (1, σ2 ),
- Additional variables are independent and non-informative.
Remark : Difficult case
- The Bayes classifier does NOT belong to the collection of classes,
- Strong margin assumption is NOT satisfied.
Tristan Mary-Huard
Embedded Variable Selection in Classification Trees
Conclusion
Model selection for tree classifiers :
- Already investigated (Nobel 02, Gey & Nedelec 06, Gey 10),
- Variable selection not investigated so far.
- Pruning step now validated from this point of view.
Theory vs practice
- Theory : exhaustive search,
- Practice : forward strategy,
- Nonetheless theoretical results are informative !
Extension
- In this talk : strong margin assumption
- Can be extended to less restrictive margin assumption
- Manuscript on arXiv.org :
http ://arxiv.org/abs/1108.0757
Tristan Mary-Huard
Embedded Variable Selection in Classification Trees
Bibliography
Breiman L., Friedman J., Olshen R. & Stone, C. (1984) Classification And
Regression Trees, Chapman & Hall.
Gey S. & Nédélec E. (2005) Model selection for CART regression trees, IEEE Trans.
Inform. Theory, 51, 658–670.
Koltchinskii, V. (2006) Local Rademacher Complexities and Oracle Inequalities in
Risk Minimization, Annals of Statistics, 34, 2593–2656.
Mary-Huard T., Robin S. & Daudin J.-J. (2007) A penalized criterion for variable
selection in classification, J. of Mult. Anal., 98, 695–705.
Massart P. (2000) Some applications of concentration inequalities to statistics,
Annales de la Faculté des Sciences de Toulouse.
Massart P. & Nédélec E. (2006) Risk Bounds for Statistical Learning, Annals of
Statistics, 34, 2326–2366.
Nobel A.B. (2002) Analysis of a complexity-based pruning scheme for classification
trees, IEEE Trans. Inform. Theory, 48, 2362–2368.
Tristan Mary-Huard
Embedded Variable Selection in Classification Trees
© Copyright 2026 Paperzz