Learning with Trees Unit 6 Machine Learning University of Vienna Learning with Trees Quinland’s ID3 (C4.5) Classification and Regression Trees (CART) Benefits of the decision tree The computational cost of making the tree is fairly low. The cost of using it is even lower: O(logN), where N is the number of data points. Important for machine learning: querying the trained algorithm is fast the result is immediately easy to understand makes people trust it more than getting an answer from a black box Eg.: phone a helpline for computer faults. The phone operators are guided through the decision tree by your answers to their questions. Idea of a decison tree We start at the root (base) of the tree and progress down to the leaves, where we receive the classification decision. At each stage we choose a question that gives us the most information given what we know already. Encoding this mathematically is the task of information theory. The trees can even be turned into a set of if-then rules, suitable for use in a rule induction system. Quick Aside: Entropy in Information Theory Claude Shannon in 1948 published a paper: A Mathematical Theory of Communication. Information entropy describes the amount of impurity in a set of features. The entropy H of a set of probabilities pi is X H(p) = pi log2 (pi ), i where the logarithm is base 2 because we are imagining that we encode everything using binary digits (bits), and we define 0 log 0 = 0 log2 p = ln p ln 2 Binary decision Theory Binary decision : positive“ with probaility p, negative“ with ” ” probaility 1 p Maximal entropy of 1 lies by p = 0.5 Desicion tree If the feature separates the example into 50% positive and 50% negative, then the amount of entropy is at a maximum, and knowing about that feature is very useful to us. For our desicion tree, the best feature to pick as the one to classify on now is the one that gives you most information, i.e., the one with the highest entropy. After using that feature, we re-evaluate the entropy of each feature and again pick the one with the highest entropy. ID3 Information gain: the entropy of the whole set S minus the entropy when a particular feature F is chosen. |Sf | is a count of the number of S that have value f for feature F . Gain(S, F ) = Entropy(S) P |Sf | f 2values(F ) |S| Entropy(Sj ) Example Outcomes S = (s1 = true, s2 = false, s3 = false, s4 = false) Entropy(S) = p(true) log2 p(true) = 1 1 log2 4 4 p(false) log2 p(false) = 3 3 log2 = 0.811 4 4 Suppose we have one feature F that can have values f1 , f2 , f3 and if if if if s1 s2 s3 s4 then then then then feature feature feature feature has has has has value value value value f2 , f2 , f3 , f1 . Example Then |Sf1 | |S| Entropy(Sf1 ) =0 |Sf2 | |S| Entropy(Sf2 ) = 24 ( |Sf3 | |S| Entropy(Sf3 ) =0 1 2 log2 1 2 1 2 Gain(S, F ) = 0.811 log2 11 ) = 0.5 0.5 = 0.311 The ID3 Algorithm if all examples have the same label: - return a leaf with that label else if there are no features left to test: - return a leaf with the most common label else: - choose the feature Fb that maximises the information gain of S to be the next node - add a branch from the node for each possible value f in Fb - for each branch: b from the set of features o calculate Sf by removing F o recursively call the algorithm with Sf , to compute the gain relative to the current set of examples Implementation in Python For a graph, the key to each dictionary entry is the name of the node, and its value is a list of the nodes that it is connected to, as in this example: graph = (A : [B, C ], B : [C , D], C : [D], D : [C ], E : [F ], F : [C ]) ID3 uses the inductive bias : the next feature added into the tree is the one with the highest information gain, which biases the algorithm towards smaller trees, since it tries to minimise the amout of information that is left. The shortest description of something, i.e., the most compressed one, is the best description. It is known as the Minimum Description Length (MDL) (Rissanen, 1989) ID3 ID3 can deal with noise and with missing data. Continuous variables: in general, only one split is made, rather than allowing for threeways or higher splits. Univariate trees - split according one feature Mulivariate trees - split by picking combination of features To avoid overfitting: to limit the size of the tree pruning (Quinlan in C4.5 ) Naive pruning run the decision tree algorithm until all of the features are used produce smaller trees by running over the tree, picking each node in turn replace the subtree beneath every node with a leaf labelled with the most common classification of the sub-tree. the error of the pruned tree is evaluated on the validation set pruned tree is kept if the error is the same as or less than the original tree, and rejected otherwise. Post-pruning Implemented in C4.5. take the tree generated by ID3 convert it to a set of if-then rules prune each rule by removing preconditions if the accuracy of the rule increases without it. the rules are then sorted according to their accuracy on the training set and applied in order The advantages of dealing with rules are: - they are easier to read, - their order in the tree does not matter - just their accuracy in the classification Classification and Regression Trees (CART) CART can be used for both classification and regression. CART constrained to construct binary trees. But: a question that has three answers can be split into two questions. Another information measure is the Gini impurity. a leaf is pure if all of the training data within it have just one class. N(i) - the fraction of the number of datapoints that belong to a class i For any particular feature k we can compute: Gk = c X X i=1 j6=i c is the number of classes. N(i)N(j) Gini impurity Pc i=1 N(i) P j6=i = 1, while there has to be some output class and so N(i) = 1 N(i) Another presentation: Gk = 1 c X N(i)2 i=1 The Gini impurity is equivalent to computing the expected error rate if the classificaton was picked according to the class distribution. Gini impurity We can change the information measure, adding weight to the misclassifications. Risk ij represents the cost of misclassifying i as j. Modified Gini impurity: Gk = X j6=i ij N(i)N(j) Regression Trees (CART) The splitting criteria for regression is SSTotal (SSLeft + SSRight ), where SSTotal = (yi y )2 is the sum of squares for the node, and SSRight , SSLeft are the sums of squares for the right and left branch, respectively. This is equivalent to chosing the split to maximize the between-groups sum-of-squares in a simple analysis of variance (anova). We pick then the feature that has the split point that provides the best sum-of-squares error. The error of a node: the variance of y for anova. Example using ID3 Options: go to the pub, watch TV, go to a party or study. Depends from an assignement, an availability of a party and feeling. Deadline Party Lazy Activity Urgent Yes Yes Party Urgent No Yes Study Near Yes Yes Party None Yes No Party None No Yes Pub None Yes No Party Near No No Study Near No Yes TV Near Yes Yes Party Urgent No No Study Example using ID3 To produce a decision tree for this problem: first: which feature to use as the root node? Compute the entropy of S: P(Party) = 0.5, P(Study) = 0.3, P(Pub) = 0.1, P(TV) = 0.1 Entropy = 0.5 log2 0.5 + 0.3 log2 0.3 + 2 ⇥ 0.1 log2 0.1 = 1.685 Find which feature has the maximal information gain: 3 1 1 2 2 Gain(S, Deadline) = 1.685 10 ( 3 log2 3 + 3 log2 3 )+ 4 + 10 (2 14 log2 14 + 12 log2 12 )+ 3 1 + 10 ( 3 log2 13 + 23 log2 23 ) = 1.685 1.150 = 0.534 Example using ID3 Gain(S, Party ) = 1.685 5 10 ⇥ 0+ 5 + 10 (2 15 log2 15 + 35 log2 35 ) = 1.685 0.685 = 1 6 3 3 1 1 Gain(S, Lazy ) = 1.685 10 ( 6 log2 6 + 3 ⇥ 6 log2 6 )+ 4 + 10 (2 ⇥ 12 log2 12 ) = 1.685 1.475 = 0.21 Example using ID3 The root node will be the Party feature and it will have two branches coming out of it ( yes and no ). All five cases of yes branch are party and we put a leaf node there Go to party For the no branch there are tree di↵erent outcomes, so we need to choose another feature. The same procedure as before! Example - Solution in Python Party Yes -> Party -> Study -> Pub No Deadlline Urgent None Near Lazy Yes -> TV No -> Study Example -Regression Trees (CART) covariation <- matrix(c(5 , 0.6 , 0.6 , 5) , nrow = 2 , ncol = 2) means <- c(2 , 7) a3= (a1- means[1])*(a2- means[2])) Result -Regression Trees (CART) Iris -Regression Trees (CART) Iris -Cross-Validation Iris - Results Iris - Partition
© Copyright 2026 Paperzz