Decision Trees SDSC Summer Institute 2012 Natasha Balac, Ph.D. © Copyright 2012, Natasha Balac DECISION TREE INDUCTION Method for approximating discrete-valued functions robust to noisy/missing data can learn non-linear relationships inductive bias towards shorter trees © Copyright 2012, Natasha Balac Decision trees “Divide-and-conquer” approach Nodes involve testing a particular attribute Attribute value is compared to Constant Comparing values of two attributes Using a function of one or more attributes Leaves assign classification, set of classifications, or probability distribution to instances Unknown instance is routed down the tree © Copyright 2012, Natasha Balac Decision Tree Learning Applications: medical diagnosis – ex. heart disease analysis of complex chemical compounds classifying equipment malfunction risk of loan applicants Boston housing project – price prediction © Copyright 2012, Natasha Balac DECISION TREE FOR THE CONCEPT “Sunburn” Name Hair Height Weight Lotion Result Sarah blonde average light no sunburned (positive) Dana blonde tall average yes none (negative) Alex brown short average yes none Annie blonde short average no sunburned Emily red average heavy no sunburned Pete brown tall heavy no none John brown average heavy no none Katie blonde short light yes none © Copyright 2012, Natasha Balac DECISION TREE FOR THE CONCEPT “Sunburn” © Copyright 2011, Natasha Balac Medical Diagnosis and Prognosis <= 91 Class 2: Minimum systolic blood pressure over a 24-hour period following admission to the hospital > 91 Age of Patient <=62.5 >62.5 Early death Class 1: Was there sinus tachycardia? Survivors NO YES Class 1: Class 2: Survivors Early death Beriman et. al, 1984 © Copyright 2012, Natasha Balac Occam’s Razor “The world is inherently simple. Therefore the smallest decision tree that is consistent with the samples is the one that is most likely to identify unknown objects correctly” © Copyright 2012, Natasha Balac Decisions Trees Representation Each internal node tests an attribute Each branch corresponds to attribute value Each leaf node assigns a classification © Copyright 2012, Natasha Balac When to Consider Decision Trees Instances describable by attribute-value pairs each attribute takes a small number of disjoint possible values Target function has discrete output value Possibly noisy training data may contain errors may contain missing attribute values © Copyright 2012, Natasha Balac Weather Data Set-Make the Tree! Day Outlook Temp Humidity Wind PlayTennis D1 D2 D3 D4 D5 D6 D7 D8 D9 D10 D11 D12 D13 D14 Sunny Sunny Overcast Rain Rain Rain Overcast Sunny Sunny Rain Sunny Overcast Overcast Rain Hot Hot Hot Mild Cool Cool Cool Mild Cool Mild Mild Mild Hot Mild High High High High Normal Normal Normal High Normal Normal Normal High Normal High Weak Strong Weak Weak Weak Strong Strong Weak Weak Weak Strong Strong Weak Strong No No Yes Yes Yes No Yes No Yes Yes Yes Yes Yes No © Copyright 2012, Natasha Balac DECISION TREE FOR THE CONCEPT “Play Tennis” Day OutlookTempHumidityWind PlayTennis D1 D2 D3 D4 D5 D6 D7 D8 D9 D10 D11 D12 D13 D14 No No Yes Yes Yes No Yes No Yes Yes Yes Yes Yes No Sunny Hot Sunny Hot OvercastHot Rain Mild Rain Cool Rain Cool OvercastCool Sunny Mild Sunny Cool Rain Mild Sunny Mild OvercastMild OvercastHot Rain Mild Mitchell, 1997 High High High High Normal Normal Normal High Normal Normal Normal High Normal High Weak Strong Weak Weak Weak Strong Strong Weak Weak Weak Strong Strong Weak Strong [Mitchell,1997] © Copyright 2011, Natasha Balac Constructing decision trees Normal procedure: top down in recursive divide-andconquer fashion First: attribute is selected for root node and branch is created for each possible attribute value Then: the instances are split into subsets (one for each branch extending from the node) Finally: procedure is repeated recursively for each branch, using only instances that reach the branch Process stops if all instances have the same class © Copyright 2012, Natasha Balac Induction of Decision Trees Top-down Method Main loop: A pick the ``best'' decision attribute for next node Assign A as decision- split value attribute for node For each value of A, create new descendant of node Sort training examples to leaf nodes If training examples perfectly classified, Then STOP Else iterate over new leaf nodes © Copyright 2012, Natasha Balac Which is the best attribute? © Copyright 2011, Natasha Balac Attribute selection How to choose the best attribute? Smallest tree Heuristic: Attribute that produces the “purest” nodes Impurity criterion: Information gain Increases with the average purity of the subsets produced by the attribute split Choose attribute that results in greatest information gain © Copyright 2012, Natasha Balac Computing information Information is measured in bits Given a probability distribution, the info required to predict an event is the distribution’s entropy Entropy gives the information required in bits (this can involve fractions of bits!) Formula for computing the entropy: © Copyright 2012, Natasha Balac Expected information for attribute “Outlook” “Outlook” = “Sunny” “Outlook” =“Overcast” “Outlook” =“Rainy” Total expected information: © Copyright 2012, Natasha Balac Computing the information gain Information gain: information before splitting – information after splitting Gain(“Outlook”)=info([9,5])-info([2,3],[4,0],[3,2]) = 0.940-0.693 = 0.247 bits Information gain for attributes from weather data: Gain (“Outlook”) = 0.247 bits Gain (“Temp”) = 0.029 bits Gain (“Humidity”) = 0.152 bits Gain (“Windy”) = 0.048 bits © Copyright 2012, Natasha Balac Which is the best attribute? © Copyright 2011, Natasha Balac Further splits Gain (“Temp”)=0.571 bits Gain (“Humidity”)=0.971 Gain(“Windy”)=0.020 bits © Copyright 2011, Natasha Balac Final product © Copyright 2012, Natasha Balac Purity Measure Desirable properties Pure Node -> measure = zero Impurity maximal -> measure = maximal Multistage property decisions can be made in several stages measure([2 ,3,4])= measure([2,7])+(7/9) measure([3,4]) Entropy is the only function that satisfies all the properties © Copyright 2012, Natasha Balac Highly-branching attributes Attributes with a large number of values example: ID code Subsets more likely to be pure if there is a large number of values Information gain biased towards attributes with a large number of values Overfitting © Copyright 2012, Natasha Balac New version of Weather Data © Copyright 2012, Natasha Balac ID code attribute split Info([9,5]) = 0.940 bits © Copyright 2012, Natasha Balac Gain ratio Modification that reduces its bias Takes number and size of branches into account when choosing an attribute Taking the intrinsic information of a split into account Intrinsic information: Entropy of distribution of instances into branches How much info do we need to tell which branch an instance belongs to © Copyright 2012, Natasha Balac Computing the gain ratio Example: intrinsic information for ID code info([1,1,…1])=14(-1/14 log1/14)=3.807 bits Value of attribute decreases as intrinsic information gets larger Definition of gain ratio: Example: © Copyright 2012, Natasha Balac Gain ration example © Copyright 2012, Natasha Balac Avoid Overfitting How can we avoid Overfitting: Stop growing when data split not statistically significant Grow full tree then post-prune How to select best tree? Measure performance over training data Measure performance over separate validation data set © Copyright 2012, Natasha Balac Pruning Pruning simplifies a decision tree to prevent overfitting to noise in the data Post-pruning: Pre-pruning: takes a fully-grown decision tree and discards unreliable parts stops growing a branch when information becomes unreliable Post-pruning preferred in practice because of early stopping in pre-pruning © Copyright 2012, Natasha Balac Summary Algorithm for top-down induction of decision trees Probably the most extensively studied method of machine learning used in data mining Different criteria for attribute/test selection rarely make a large difference Different pruning methods mainly change the size of the resulting pruned tree © Copyright 2012, Natasha Balac WEKA Tutorial © Copyright 2012 Natasha Balac REGRESSION TREE INDUCTION Why Regression tree? Ability to: Predict continuous variable Model conditional effects Model uncertainty © Copyright 2012, Natasha Balac Regression Trees Quinlan, 1992 Continuous goal variables Induction by means of an efficient recursive partitioning algorithm Uses linear regression to select internal node © Copyright 2012, Natasha Balac Regression trees Differences to decision trees: Splitting: minimizing intra-subset variation Pruning: numeric error measure Leaf node predicts average class values of training instances reaching that node Can approximate piecewise constant functions Easy to interpret and understand the structure Special kind: Model Trees © Copyright 2012, Natasha Balac Model trees RT with linear regression functions at each leaf Linear regression (LR) applied to instances that reach a node after full regression tree has been built Only a subset of the attributes is used for LR Fast Overhead for LR is minimal as only a small subset of attributes is used in tree © Copyright 2012, Natasha Balac Building the tree Splitting criterion: standard deviation reduction (T portion of the data reaching the node) Where T1,T2, are the sets that result from splitting the node according to the chosen attribute Termination criteria Standard deviation becomes smaller than certain fraction of sd for full training set (5%) Too few instances remain (< 4) © Copyright 2012, Natasha Balac Nominal attributes Nominal attributes are converted into binary attributes (that can be treated as numeric ones) Nominal values are sorted using average class value If there are k values, k-1 binary attributes are generated It can be proven that the best split on one of the new attributes is the best binary split on original M5‘ only does the conversion once © Copyright 2012, Natasha Balac Pruning Model Trees Based on estimated absolute error of LR models Heuristic estimate for smoothing calculation P’ is prediction past up to the next node; p is passed to nod below; q is predicted value by the model; n # of training instances in the node; K smoothing const Pruned by greedily removing terms to minimize the estimated error Heavy pruning allowed – single LR model can replace a whole subtree Pruning proceeds bottom up - error for LR model © Copyright 2012 Natasha Balac at internal node is compared to error fosubtree Building the tree Splitting criterion standard deviation reduction T – portion T of the training data T1, T2, …sets that result from splitting the node on the chosen attribute Treating SD of the class values in T as a measure of the error at the node; calc expected reduction in error by testing each attribute at node Termination criteria Standard deviation becomes smaller than certain fraction of SD for full training set (5%) © Copyright 2012, Natasha Balac Too few instances remain (< 4) Pseudo-code for M5’ Four methods: Main method: MakeModelTree() Method for splitting: split() Method for pruning: prune() Method that computes error: subtreeError() We’ll briefly look at each method in turn Linear regression method is assumed to perform attribute subset selection based on error © Copyright 2012, Natasha Balac MakeModelTree SD =sd(instances) for each k-valued nominal attribute convert into k-1 synthetic binary attributes root =newNode root.instances = instances split(root) prune(root) printTree(root) © Copyright 2012 Natasha Balac split(node) if sizeof(node.instances) < 4 or sd(node.instances) < 0.05*SD node.type = LEAF else node.type = INTERIOR for each attribute for all possible split positions of the attribute calculate the attribute’s SDR node.attribute = attribute with maximum SDR split(node.left) split(node.right) © Copyright 2012, Natasha Balac prune() if node = INTERIOR then prune(node.leftChild) prune(node.rightChild) node.model = linearRegression(node) if subtreeError(node) > error(node) then node.type = LEAF © Copyright 2012, Natasha Balac subtreeError() l = node.left; r = node.right if node = INTERIOR then return (sizeof(l.instances)*subtreeError(l) + sizeof(r.instances)*subtreeError(r)) /sizeof(node.instances) else return error(node) © Copyright 2012, Natasha Balac MULTI-VARIATE REGRESSION TREES* All the characteristics of a regression tree Capable of predicting two or more outcomes Example: Activity and toxicity, monetary gain and time Balac, Gaines ICML 2001 © Copyright 2012, Natasha Balac MULTI-VARIATE REGRESSION TREE INDUCTION Var 1 Var 2 >-4.71 OR >4.83 <=-4.71 AND <=4.83 Var 1 Var 3 >0.5 AND >-3.61 Var 2 Var 4 >0.5 OR >-3.61 <=-2 AND <=4.8 Activity = 7.39 Activity = 7.05 Toxicity = 2.89 Toxicity = 0.173 <=0.5 AND <=-3.56 … Var 1 Var 4 >0.5 OR >-3.56 © Copyright 2012, Natasha Balac CLUSTERING Basic idea: Group similar things together Unsupervised Learning – Useful when no other info is available K-means Partitioning instances into k disjoint clusters Measure of similarity © Copyright 2012 Natasha Balac
© Copyright 2026 Paperzz