Decision Trees (2)

Decision Trees (2)
Numerical attributes
• Decision trees work most naturally with discrete (nominal)
attributes.
• They can be extended to work with numeric attributes.
• At each node, check the value of one attribute against a constant.
Numerical attributes
• Tests in nodes can be of the form fi > constant or fi < constant
• In this way, attribute space is partitioned into disjoint regions
(rectangles in 2D space).
Decision Boundaries
Predicting Bankruptcy
(Leslie Kaebling’s example, MIT courseware)
Considering splits
• Consider splitting between each data point in each dimension.
• So, here we'd consider 9
different splits in the R
dimension
Considering splits II
• And there are other 6 possible splits in the L dimension
– because L is an integer, there are lots of duplicate L values.
Bankruptcy Example
• We consider all the possible splits in all dimensions, and
compute the average entropies of the children.
Bankruptcy Example
• The best split is at L<1.5
• All the points with L ≤1.5 are of class 0, so we can make a leaf.
Bankruptcy Example
• Now, we consider all the splits of the remaining part of space.
• Note that we have to recalculate all the average entropies again,
because the points that fall into the leaf node are taken out of
consideration.
Bankruptcy Example
• Now the best split is at R > 0.9. And we see that all the points
for which that's true are positive, so we can make another leaf.
Bankruptcy Example
• Continuing in this way, we finally obtain:
Alternative Splitting Criterion based on GINI
• We have used so far as splitting
criteria the entropy at the given node, Entropy (t ) = −∑ p( j | t ) log p( j | t )
j
which is computed by:
(NOTE: p( j | t ) is the relative frequency (i.e
probability) of class j at node t ).
• Another alternative impurity measure
is GINI computed by:
• Both, have:
– Maximum when records are equally
distributed among all classes, implying
least interesting information
– Minimum when all records belong to
one class, implying most interesting
information
For a 2-class
problem:
GINI (t ) = 1 − ∑ [ p ( j | t )]2
j
Regression* Trees
• Classification: predicts discrete class attributes
• Regression: predicts continuous class attributes
• Regression trees are similar to decision trees,
but with real-valued constant outputs at the leaves.
(*In statistics, regression is the process of computing an expression that predicts a
numeric quantity.)
Splitting and Leaf values
• Assume that multiple training points are in the leaf and we have
decided, for whatever reason, to stop splitting.
– if the class is nominal, we use the majority output value as the value
for the leaf.
– if the class is numeric, we'll use the average output value.
• So, if we're going to use the average value at a leaf as its output,
we'd like to split up the data (in the parent node) so that the leaf
averages are not too far away from the actual class value of the
items in the leaf.
• Statistics has a good measure of how spread out a set of numbers is
– (and, therefore, how different the individuals are from the average);
– it's called the variance of a set.
Variance
Variance is a measure of how much spread out a set of numbers is.
• Mean of m values, z1 through zm :
• Variance is essentially the average of the squared distance
between the individual values and the mean.
– If it's the average, then you might wonder why we're dividing by
m-1 instead of m.
– Dividing by m-1 makes it an unbiased estimator of variance based
on a random sample of population.
Splitting
• We're going to use the average variance of the children to
evaluate the quality of splitting on a particular feature.
• Here we have a data set, for which just the y (class) values have
been indicated.
DB:
Splitting
• Just as we did in the binary case, we can compute a weighted
average variance for, depending on the relative sizes of the two
sides of the split.
DB:
Splitting
• We can see that the average variance of splitting on feature f3 is
much lower than of splitting on f7, and so we'd choose to split
on f3.
Stopping
• Stop when the variance at the leaf is small enough.
• Then, set the value at the leaf to be the mean of the y (class)
values of the elements.