Decision Trees
What is a decision tree?
• Input = assignment of values for given attributes
– Discrete (often Boolean) or continuous
• Output = predicated value
– Discrete - classification
– Continuous - regression
• Structure:
– Internal node - tests one attribute
– Leaf node - output value (or linear function of attributes)
Example Attributes
•
•
•
•
•
•
•
•
•
•
Alternate - Boolean
Bar - Boolean
Fri/Sat - Boolean
Hungry - Boolean
Patrons - {None, Some, Full}
Price - {$, $$, $$$}
Raining - Boolean
Reservation - Boolean
Type - {French, Italian, Thai, burger}
WaitEstimate - {0-10min, 10-30, 30-60, >60}
Decision Tree for WillWait
Decision Tree Properties
• Propositional: attribute-value pairs
• Not useful for relationships between objects
• Universal: Space of possible decision trees
includes every Boolean function
• Not efficient for all functions (e.g., parity,
majority)
• Good for disjunctive functions
Decision Tree Learning
• From a training set, construct a tree
Methods for Constructing
Decision Trees
One path for each example in training set
–
Not robust, little predictive value
Rather look for “small” tree
–
Applying Ockham’s Razor
ID3: simple non-backtracking search through
space of decision trees
Algorithm
Choose-Attribute
Greedy algorithm - use attribute that gives the
greatest immediate information gain
Information/Entropy
• Information provided by knowing an answer:
–
–
–
–
Possible answers vi with probability P(vi)
I({P(vi)}) = i -P(vi) log2P(vi)
I({0.5, 0.5}) = -0.5*(-1)-0.5*(-1) = 1
I({0.01,0.99}) = -0.01*(-6.6) -0.99*(-0.014) = 0.08
• Estimate probability from set of examples
– Example:
• 6 yes, 6 no, estimate P(yes)=P(no)=0.5
• 1 bit required
– After testing an attribute, take weighted sum of information
required for subsets of the examples
After Testing “Type”
2/12*I(1/2,1/2) + 2/12*I(1/2,1/2) + 4/12*I(1/2,1/2) +
4/12*I(1/2,1/2) = 1
After Testing “Patrons”
2/12*I(0,1) + 4/12*I(1,0) + 6/12*I(2/6,4/6) = 0.46
Repeat adding tests
• Note: induced tree not the same as “true” function
• Best we can do given the examples
Overfitting
• Algorithm can find meaningless regularity
• E.g., use date & time to predict roll of die
• Approaches to fixing:
– Stop the tree from growing too far
– Allow the tree to overfit, then prune
2 - Pruning
• Is a split irrelevant?
– Information gain close to zero - how close?
– Assume no underlying pattern (null hypothesis)
– If statistical analysis shows <5% probability
that null hypothesis is correct, then assume
attribute is relevant
• Pruning provides noise tolerance
Rule Post-Pruning
• Convert tree into a set of rules (one per
path)
• Prune preconditions (generalize) each rule if
it improves accuracy
• Rules may now overlap
• Consider rules in order of accuracy when
doing classification
Continuous-valued attributes
• Split based on some threshhold
– X<97 vs X>=97
• Many possible split points
Multi-valued attributes
• Information gain of attribute with many
values may be huge (e.g., Date)
• Rather than absolute info gain use ratio of
gain to SplitInformation -
Continuous-valued outputs
• Leaf is a linear function of attributes, not
value
• Regression Tree
Missing Attribute Values
• Attribute value in example may not be
known
– Assign most common value amongst
comparable examples
– Split into fractional examples based of observed
distribution of values
Credits
• Diagrams from “Artificial Intelligence - A
Modern Approach’’ by Russell and Norvig
© Copyright 2026 Paperzz