Skewing: An Efficient Alternative to Lookahead for Decision Tree Induction David Page Soumya Ray Department of Biostatistics and Medical Informatics Department of Computer Sciences University of Wisconsin, Madison USA Main Contribution Greedy tree learning algorithms suffer from myopia This is remedied by Lookahead, which is computationally very expensive We present an approach to efficiently address the myopia of tree learners Task Setting Given: m examples of n Boolean attributes each, labeled according to a function f over some subset of the n attributes Do: Learn the Boolean function f TDIDT Algorithm Top Down Induction of Decision Trees Greedy algorithm Chooses feature that locally optimizes some measure of “purity” of class labels Information Gini Index Gain TDIDT Example x1 x2 x3 Value 0 0 0 + 1 0 1 − 0 1 0 − 0 1 1 + TDIDT Example x1 x1 =0 (2+, 1−) x2 x2 =0 (1+) x1 =1 (1−) ─ x2 =1 (1+,1−) Outline Introduction to TDIDT algorithm Myopia and “Hard” Functions Skewing Experiments with Skewing Algorithm Sequential Skewing Experiments with Sequential Skewing Conclusions and Future Work Myopia and Correlation Immunity For certain Boolean functions, no variable has “gain” according to standard purity measures (e.g., entropy, Gini) No variable correlated with class In cryptography, correlation immune Given such a target function, every variable looks equally good (bad) In an application, the learner will be unable to differentiate between relevant and irrelevant variables A Correlation Immune Function x1 x2 f=x1 x2 0 0 0 0 1 1 1 0 1 1 1 0 Examples In Drosophila, Survival is an exclusive-or function of Gender and the expression of the SxL gene In drug binding (Ligand-Domain), binding may have an exclusive-or subfunction of Ligand Charge and Domain Charge Learning Hard Functions Standard method of learning hard functions with TDIDT: depth-k Lookahead k+1-1 O(mn2 ) for m examples in n variables Can we devise a technique that allows TDIDT algorithms to efficiently learn hard functions? Key Idea Correlation immune functions aren’t hard – if the data distribution is significantly different from uniform Example Uniform distribution can be sampled by setting each variable (feature) independently of all others, with probability 0.5 of being set to 1 Consider a distribution where each variable has probability 0.75 of being set to 1. Example x1 x2 0 0 0 1 1 0 1 1 x3 0 1 0 1 0 1 0 1 f 0 GINI ( f ) 0.25 GINI ( f ; xi 0) 0 .25 1 GINI ( f ; xi 1) 0 .25 1 0 GAIN ( xi ) 0 Example x1 0 0 1 1 x2 0 1 0 1 x3 0 1 0 1 0 1 0 1 f 0 1 1 0 Weight 1 64 3 64 3 64 9 64 3 64 9 64 9 64 27 64 Sum 1 16 3 16 3 16 9 16 Example x1 0 0 1 1 x2 0 1 0 1 x3 0 1 0 1 0 1 0 1 f Sum 0 1 16 1 3 16 1 3 16 0 9 16 GINI ( f ) (1 9) (3 3) 16 16 60 256 Example x1 0 0 x2 0 1 x3 0 1 0 1 f Sum 0 1 16 1 3 16 GINI ( f ; x1 0) 1 / 16 3 / 16 4 / 16 4 / 16 1 3 4 4 48 256 Example x1 1 1 x2 0 1 x3 0 1 0 1 f Sum 1 3 16 0 9 16 GINI ( f ; x1 1) 3 / 16 9 / 16 12 / 16 12 / 16 1 3 4 4 48 256 Example x1 0 0 1 1 x2 0 1 0 1 x3 0 f 0 Weight 1 64 1 3 64 0 1 3 64 0 0 9 64 0 GINI ( f ; x3 0) 6 / 64 10 / 64 16 / 64 16 / 64 6 10 16 16 60 256 Key Idea Given a large enough sample and a second distribution sufficiently different from the first, we can learn functions that are hard for TDIDT algorithms under the original distribution. Issues to Address How can we get a “sufficiently different” distribution? Our approach: “skew” the given sample by choosing “favored settings” for the variables Not-large-enough sample effects? Our approach: Average “goodness” of any variable over multiple skews Skewing Algorithm For T trials do Choose a favored setting for each variable Reweight the sample Calculate entropy of each variable split under this weighting For each variable that has sufficient gain, increment a counter Split on the variable with the highest count Experiments ID3 vs. ID3 with Skewing (ID3 to avoid issues to do with parameters, pruning, etc.) Synthetic Propositional Data Examples of 30 Boolean variables Target Boolean functions of 2-6 of these variables Randomly chosen targets and randomly chosen hard targets UCI Datasets (Perlich et al, JMLR 2003) 10 fold cross validation Evaluation metric: Weighted Accuracy = average of accuracy over positives and negatives Results (3-variable Boolean functions) 95 90 Accuracy 100 Accuracy 100 90 85 70 60 80 75 80 200 ID3, No Skewing ID3 with Skewing 400 600 800 Sample Size Random functions 1000 50 200 ID3, No Skewing ID3 with Skewing 400 600 800 Sample Size Hard functions 1000 Results (4-variable Boolean functions) 95 90 Accuracy 100 Accuracy 100 90 85 70 60 80 75 80 200 ID3, No Skewing ID3 with Skewing 400 600 800 Sample Size Random functions 1000 50 200 ID3, No Skewing ID3 with Skewing 400 600 800 Sample Size Hard functions 1000 Results (5-variable Boolean functions) 95 90 Accuracy 100 Accuracy 100 90 85 70 60 80 75 80 200 ID3, No Skewing ID3 with Skewing 400 600 800 Sample Size Random functions 1000 50 200 ID3, No Skewing ID3 with Skewing 400 600 800 Sample Size Hard functions 1000 Results (6-variable Boolean functions) 95 90 Accuracy 100 Accuracy 100 90 85 70 60 80 75 80 200 ID3, No Skewing ID3 with Skewing 400 600 800 Sample Size Random functions 1000 50 200 ID3, No Skewing ID3 with Skewing 400 600 800 Sample Size Hard functions 1000 Current Shortcomings Sensitive to noise, high-dimensional data Very small signal on the hardest CI functions (parity) given more than 3 relevant variables Only very small gains on real-world datasets attempted so far Few correlation immune functions in practice? Noise, dimensionality, not enough examples?
© Copyright 2024 Paperzz