Skewing: An Efficient Alternative to Lookahead for Decision Tree

Skewing: An Efficient Alternative
to Lookahead for Decision Tree
Induction
David Page
Soumya Ray
Department of Biostatistics and Medical
Informatics
Department of Computer Sciences
University of Wisconsin, Madison
USA
Main Contribution

Greedy tree learning algorithms suffer from
myopia

This is remedied by Lookahead, which is
computationally very expensive

We present an approach to efficiently address
the myopia of tree learners
Task Setting
Given: m examples of n Boolean attributes each,
labeled according to a function f over some
subset of the n attributes
Do: Learn the Boolean function f
TDIDT Algorithm

Top Down Induction of Decision Trees
Greedy algorithm
 Chooses feature that locally optimizes some
measure of “purity” of class labels

 Information
 Gini
Index
Gain
TDIDT Example
x1
x2
x3
Value
0
0
0
+
1
0
1
−
0
1
0
−
0
1
1
+
TDIDT Example
x1
x1 =0 (2+, 1−)
x2
x2 =0 (1+)
x1 =1 (1−)
─
x2 =1 (1+,1−)
Outline
Introduction to TDIDT algorithm
 Myopia and “Hard” Functions
 Skewing
 Experiments with Skewing Algorithm
 Sequential Skewing
 Experiments with Sequential Skewing
 Conclusions and Future Work

Myopia and Correlation Immunity

For certain Boolean functions, no variable has “gain”
according to standard purity measures (e.g., entropy,
Gini)

No variable correlated with class

In cryptography, correlation immune

Given such a target function, every variable looks
equally good (bad)

In an application, the learner will be unable to
differentiate between relevant and irrelevant variables
A Correlation Immune Function
x1
x2
f=x1  x2
0
0
0
0
1
1
1
0
1
1
1
0
Examples

In Drosophila, Survival is an exclusive-or
function of Gender and the expression of the
SxL gene

In drug binding (Ligand-Domain), binding may
have an exclusive-or subfunction of Ligand
Charge and Domain Charge
Learning Hard Functions

Standard method of learning hard functions
with TDIDT: depth-k Lookahead


k+1-1
O(mn2
) for m examples in n variables
Can we devise a technique that allows TDIDT
algorithms to efficiently learn hard functions?
Key Idea
Correlation immune functions aren’t hard
– if the data distribution is significantly
different from uniform
Example

Uniform distribution can be sampled by setting
each variable (feature) independently of all
others, with probability 0.5 of being set to 1

Consider a distribution where each variable
has probability 0.75 of being set to 1.
Example
x1
x2
0
0
0
1
1
0
1
1
x3
0
1
0
1
0
1
0
1
f
0
GINI ( f )  0.25
GINI ( f ; xi  0)  0 .25
1
GINI ( f ; xi  1)  0 .25

1
0
GAIN ( xi )  0
Example
x1
0
0
1
1
x2
0
1
0
1
x3
0
1
0
1
0
1
0
1
f
0
1
1
0
Weight
1
64
3
64
3
64
9
64
3
64
9
64
9
64
27
64
Sum
1
16
3
16
3
16
9
16
Example
x1
0
0
1
1
x2
0
1
0
1
x3
0
1
0
1
0
1
0
1
f
Sum
0
1
16
1
3
16
1
3
16
0
9
16
GINI ( f )
(1  9) (3  3)


16
16
60

256
Example
x1
0
0
x2
0
1
x3
0
1
0
1
f
Sum
0
1
16
1
3
16
GINI ( f ; x1  0)
 1 / 16 3 / 16 



 4 / 16 4 / 16 
1 3
  
 4 4
48

256
Example
x1
1
1
x2
0
1
x3
0
1
0
1
f
Sum
1
3
16
0
9
16
GINI ( f ; x1  1)
 3 / 16 9 / 16 



 12 / 16 12 / 16 
1 3
  
 4 4
48

256
Example
x1
0
0
1
1
x2
0
1
0
1
x3
0
f
0
Weight
1
64
1
3
64
0
1
3
64
0
0
9
64
0
GINI ( f ; x3  0)
 6 / 64 10 / 64 



 16 / 64 16 / 64 
 6 10 
  
 16 16 
60

256
Key Idea

Given
a large enough sample and
 a second distribution sufficiently different from the
first,

we can learn functions that are hard for TDIDT
algorithms under the original distribution.
Issues to Address

How can we get a “sufficiently different”
distribution?


Our approach: “skew” the given sample by
choosing “favored settings” for the variables
Not-large-enough sample effects?

Our approach: Average “goodness” of any variable
over multiple skews
Skewing Algorithm

For T trials do
Choose a favored setting for each variable
 Reweight the sample
 Calculate entropy of each variable split under this
weighting
 For each variable that has sufficient gain,
increment a counter


Split on the variable with the highest count
Experiments

ID3 vs. ID3 with Skewing (ID3 to avoid issues to do with
parameters, pruning, etc.)

Synthetic Propositional Data




Examples of 30 Boolean variables
Target Boolean functions of 2-6 of these variables
Randomly chosen targets and randomly chosen hard targets
UCI Datasets (Perlich et al, JMLR 2003)


10 fold cross validation
Evaluation metric: Weighted Accuracy = average of accuracy over
positives and negatives
Results (3-variable Boolean functions)
95
90
Accuracy
100
Accuracy
100
90
85
70
60
80
75
80
200
ID3, No Skewing
ID3 with Skewing
400
600
800
Sample Size
Random functions
1000
50
200
ID3, No Skewing
ID3 with Skewing
400
600
800
Sample Size
Hard functions
1000
Results (4-variable Boolean functions)
95
90
Accuracy
100
Accuracy
100
90
85
70
60
80
75
80
200
ID3, No Skewing
ID3 with Skewing
400
600
800
Sample Size
Random functions
1000
50
200
ID3, No Skewing
ID3 with Skewing
400
600
800
Sample Size
Hard functions
1000
Results (5-variable Boolean functions)
95
90
Accuracy
100
Accuracy
100
90
85
70
60
80
75
80
200
ID3, No Skewing
ID3 with Skewing
400
600
800
Sample Size
Random functions
1000
50
200
ID3, No Skewing
ID3 with Skewing
400
600
800
Sample Size
Hard functions
1000
Results (6-variable Boolean functions)
95
90
Accuracy
100
Accuracy
100
90
85
70
60
80
75
80
200
ID3, No Skewing
ID3 with Skewing
400
600
800
Sample Size
Random functions
1000
50
200
ID3, No Skewing
ID3 with Skewing
400
600
800
Sample Size
Hard functions
1000
Current Shortcomings
Sensitive to noise, high-dimensional data
 Very small signal on the hardest CI functions
(parity) given more than 3 relevant variables
 Only very small gains on real-world datasets
attempted so far

Few correlation immune functions in practice?
 Noise, dimensionality, not enough examples?
