Decision Trees
Nicholas Ruozzi
University of Texas at Dallas
Based on the slides of Vibhav Gogate and David Sontag
Supervised Learning
β’ Input: labelled training data
β i.e., data plus desired output
β’ Assumption: there exists a function π that maps data
items π₯ to their correct labels
β’ Goal: construct an approximation to π
2
Today
β’ Weβve been focusing on linear separators
β Relatively easy to learn (using standard techniques)
β Easy to picture, but not clear if data will be separable
β’ Next two lectures weβll focus on other hypothesis spaces
β Decision trees
β Nearest neighbor classification
3
Application: Medical Diagnosis
β’ Suppose that you go to your doctor with flu-like symptoms
β How does your doctor determine if you have a flu that
requires medical attention?
4
Application: Medical Diagnosis
β’ Suppose that you go to your doctor with flu-like symptoms
β How does your doctor determine if you have a flu that
requires medical attention?
β Check a list of symptoms:
β’ Do you have a fever over 100.4 degrees Fahrenheit?
β’ Do you have a sore throat or a stuffy nose?
β’ Do you have a dry cough?
5
Application: Medical Diagnosis
β’ Just having some symptoms is not enough, you should also
not have symptoms that are not consistent with the flu
β’ For example,
β If you have a fever over 100.4 degrees Fahrenheit?
β And you have a sore throat or a stuffy nose?
β You probably do not have the flu (most likely just a cold)
6
Application: Medical Diagnosis
β’ In other words, your doctor will perform a series of tests
and ask a series of questions in order to determine the
likelihood of you having a severe case of the flu
β’ This is a method of coming to a diagnosis (i.e., a
classification of your condition)
β’ We can view this decision making process as a tree
7
Decision Trees
β’ A tree in which each internal (non-leaf) node tests the value
of a particular feature
β’ Each leaf node specifies a class label (in this case whether
or not you should play tennis)
8
Decision Trees
β’ Features: (Outlook, Humidity, Wind)
β’ Classification is performed root to leaf
β The feature vector (Sunny, Normal, Strong) would be classified as
a yes instance
9
Decision Trees
β’ Can have continuous features too
β Internal nodes for continuous features correspond to
thresholds
10
Decision Trees
β’ Decision trees divide the feature space into axis parallel
rectangles
π₯2 > 4
π₯2
8
7
6
5
4
3
2
1
0
+
+
+
+
+
β
β
+
π₯1 β€ 6
π₯2 > 5
β
β
+
+
β
π₯1
0 1 2 3 4 5 6 7 8
11
β
π₯1 > 4
β
π₯2 > 2
+
β
+
Decision Trees
β’ Decision trees divide the feature space into axis parallel
rectangles
π₯2 > 4
π₯2
8
7
6
5
4
3
2
1
0
+
+
+
+
+
β
β
+
π₯1 β€ 6
π₯2 > 5
β
β
+
+
β
π₯1
0 1 2 3 4 5 6 7 8
12
β
π₯1 > 4
β
π₯2 > 2
+
β
+
Decision Trees
β’ Decision trees divide the feature space into axis parallel
rectangles
π₯2 > 4
π₯2
8
7
6
5
4
3
2
1
0
+
+
+
+
+
β
β
+
π₯1 β€ 6
π₯2 > 5
β
β
+
+
β
π₯1
0 1 2 3 4 5 6 7 8
13
β
π₯1 > 4
β
π₯2 > 2
+
β
+
Decision Trees
β’ Decision trees divide the feature space into axis parallel
rectangles
π₯2 > 4
π₯2
8
7
6
5
4
3
2
1
0
+
+
+
+
+
β
β
+
π₯1 β€ 6
π₯2 > 5
β
β
+
+
β
π₯1
0 1 2 3 4 5 6 7 8
14
β
π₯1 > 4
β
π₯2 > 2
+
β
+
Decision Trees
β’ Decision trees divide the feature space into axis parallel
rectangles
π₯2 > 4
π₯2
8
7
6
5
4
3
2
1
0
+
+
+
+
+
β
β
+
π₯1 β€ 6
π₯2 > 5
β
β
+
+
β
π₯1
0 1 2 3 4 5 6 7 8
15
β
π₯1 > 4
β
π₯2 > 2
+
β
+
Decision Trees
β’ Decision trees divide the feature space into axis parallel
rectangles
π₯2 > 4
π₯2
8
7
6
5
4
3
2
1
0
+
+
+
+
+
β
β
+
π₯1 β€ 6
π₯2 > 5
β
β
+
+
β
π₯1
0 1 2 3 4 5 6 7 8
16
β
π₯1 > 4
β
π₯2 > 2
+
β
+
Decision Trees
β’ Worst case decision tree may require exponentially many
nodes
π₯2
1
+
β
0
β
+
π₯1
0
1
17
Decision Tree Learning
β’ Basic decision tree building algorithm:
β Pick some feature/attribute
β Partition the data based on the value of this attribute
β Recurse over each new partition
18
Decision Tree Learning
β’ Basic decision tree building algorithm:
β Pick some feature/attribute (how to pick the βbestβ?)
β Partition the data based on the value of this attribute
β Recurse over each new partition (when to stop?)
Weβll focus on the discrete case first (i.e., each feature takes
a value in some finite set)
19
Decision Trees
β’ What functions can be represented by decision trees?
Every function can be represented by a sufficiently
complicated decision tree
β’ Are decision trees unique?
20
Decision Trees
β’ What functions can be represented by decision trees?
β Every function of +/- can be represented by a
sufficiently complicated decision tree
β’ Are decision trees unique?
β No, many different decision trees are possible for the
same set of labels
21
Choosing the Best Attribute
β’ Because the complexity of storage and classification
increases with the size of the tree, should prefer smaller
trees
β Simplest models that explain the data are usually
preferred over more complicated ones
β This is an NP-hard problem
β Instead, use a greedy heuristic based approach to pick
the best attribute at each stage
22
Choosing the Best Attribute
π₯1 , π₯2 β {0,1}
Which attribute should you split on?
π₯1
1
π¦ = β: 0
π¦ = +: 4
π₯2
0
π¦ = β: 3
π¦ = +: 1
1
π¦ = β: 1
π¦ = +: 3
0
π¦ = β: 2
π¦ = +: 2
23
ππ
ππ
π
1
1
+
1
0
+
1
1
+
1
0
+
0
1
+
0
0
β
0
1
β
0
0
β
Choosing the Best Attribute
π₯1 , π₯2 β {0,1}
Which attribute should you split on?
π₯1
1
π¦ = β: 0
π¦ = +: 4
π₯2
0
π¦ = β: 3
π¦ = +: 1
1
π¦ = β: 1
π¦ = +: 3
0
π¦ = β: 2
π¦ = +: 2
Can think of these counts as
probability distributions over the
labels: if π₯ = 1, the probability that
π¦ = + is equal to 1
24
ππ
ππ
π
1
1
+
1
0
+
1
1
+
1
0
+
0
1
+
0
0
β
0
1
β
0
0
β
Choosing the Best Attribute
β’ The selected attribute is a good split if we are more
βcertainβ about the classification after the split
β If each partition with respect to the chosen attribute has a distinct
class label, we are completely certain about the classification
after partitioning
β If the class labels are evenly divided between the partitions, the
split isnβt very good (we are very uncertain about the label for
each partition)
β What about other situations? How do you measure the
uncertainty of a random process?
25
Discrete Probability
β’
Sample space specifies the set of possible outcomes
β For example, Ξ© = {H, T} would be the set of possible
outcomes of a coin flip
β’
Each element π β Ξ© is associated with a number p π β [0,1]
called a probability
ΰ·π π =1
πβΞ©
β For example, a biased coin might have π π» = .6 and π π =
.4
26
Discrete Probability
β’ An event is a subset of the sample space
β Let Ξ© = {1, 2, 3, 4, 5, 6} be the 6 possible outcomes of a dice
role
β π΄ = 1, 5, 6 β Ξ© would be the event that the dice roll comes
up as a one, five, or six
β’ The probability of an event is just the sum of all of the outcomes that
it contains
β π π΄ = π 1 + π 5 + π(6)
27
Independence
β’ Two events A and B are independent if
π π΄ β© π΅ = π π΄ π(π΅)
Let's suppose that we have a fair die: π 1 = β¦ = π 6 = 1/6
If π΄ = {1, 2, 5} and π΅ = {3, 4, 6} are π΄ and π΅ indpendent?
π΅
π΄
2
1
3
5
4
6
28
Independence
β’ Two events A and B are independent if
π π΄ β© π΅ = π π΄ π(π΅)
Let's suppose that we have a fair die: π 1 = β¦ = π 6 = 1/6
If π΄ = {1, 2, 5} and π΅ = {3, 4, 6} are π΄ and π΅ indpendent?
π΅
π΄
2
1
3
5
4
6
29
No!
1
π π΄β©π΅ =0β
4
Independence
β’ Now, suppose that Ξ© = { 1,1 , 1,2 , β¦ , 6,6 } is the set of all
possible rolls of two unbiased dice
β’ Let π΄ = { 1,1 , 1,2 , 1,3 , β¦ , 1,6 } be the event that the first
die is a one and let π΅ = { 1,6 , 2,6 , β¦ , 6,6 } be the event that
the second die is a six
β’ Are π΄ and π΅ independent?
π΄
π΅
(1,1)
1,2
(1,5)
(1,4)
(1,3)
(3,6)
(1,6)
2,6 (5,6)
(4,6)
(6,6)
30
Independence
β’ Now, suppose that Ξ© = { 1,1 , 1,2 , β¦ , 6,6 } is the set of all
possible rolls of two unbiased dice
β’ Let π΄ = { 1,1 , 1,2 , 1,3 , β¦ , 1,6 } be the event that the first
die is a one and let π΅ = { 1,6 , 2,6 , β¦ , 6,6 } be the event that
the second die is a six
β’ Are π΄ and π΅ independent?
π΄
π΅
(1,1)
1,2
(1,5)
(1,4)
(1,3)
(3,6)
(1,6)
2,6 (5,6)
(4,6)
(6,6)
31
Yes!
1
1 1
π π΄β©π΅ =
= β
36 6 6
Conditional Probability
β’
The conditional probability of an event π΄ given an event π΅
with π π΅ > 0 is defined to be
π π΄β©π΅
π π΄π΅ =
π π΅
β’ This is the probability of the event π΄ β© π΅ over the sample space
Ξ©β² = π΅
β’ Some properties:
β ΟπβΞ© π(π|π΅) = 1
β If π΄ and π΅ are independent, then π π΄ π΅ = π(π΄)
32
Discrete Random Variables
β’ A discrete random variable, π, is a function from the state space Ξ©
into a discrete space π·
β For each π₯ β π·,
π π=π₯ β‘π πβΞ©βΆπ π =π₯
is the probability that π takes the value π₯
β π(π) defines a probability distribution
β’ Οπ₯βπ· π(π = π₯) = 1
β’ Random variables partition the state space into disjoint events
33
Example: Pair of Dice
β’ Let Ξ© be the set of all possible outcomes of rolling a pair of dice
β’ Let π be the uniform probability distribution over all possible
outcomes in Ξ©
β’ Let π π be equal to the sum of the value showing on the pair of dice
in the outcome π
β π π = 2 =?
β π π = 8 =?
34
Example: Pair of Dice
β’ Let Ξ© be the set of all possible outcomes of rolling a pair of dice
β’ Let π be the uniform probability distribution over all possible
outcomes in Ξ©
β’ Let π π be equal to the sum of the value showing on the pair of dice
in the outcome π
β π π=2 =
π
ππ
β π π = 8 =?
35
Example: Pair of Dice
β’ Let Ξ© be the set of all possible outcomes of rolling a pair of dice
β’ Let π be the uniform probability distribution over all possible
outcomes in Ξ©
β’ Let π π be equal to the sum of the value showing on the pair of dice
in the outcome π
β π π=2 =
π
ππ
β π π=8 =
π
ππ
36
Discrete Random Variables
β’ We can have vectors of random variables as well
π π = [π1 π , β¦ , ππ π ]
β’ The joint distribution is π π1 = π₯1 , β¦ , ππ = π₯π is
π(π1 = π₯1 β© β― β© ππ = π₯π )
typically written as
π(π₯1 , β¦ , π₯π )
β’ Because ππ = π₯π is an event, all of the same rules from basic
probability apply
37
Entropy
β’ A standard way to measure uncertainty of a random
variable is to use the entropy
π» π = β ΰ· π π = π¦ log π(π = π¦)
π=π¦
β’ Entropy is maximized for uniform distributions
β’ Entropy is minimized for distributions that place all their
probability on a single outcome
38
Entropy of a Coin Flip
π»(π)
π = ππ’π‘ππππ ππ ππππ ππππ π€ππ‘β ππππππππππ‘π¦ ππ βππππ π
π
39
Conditional Entropy
β’ We can also compute the entropy of a random variable
conditioned on a different random variable
π» π π = β ΰ· π(π = π₯) ΰ· π π = π¦ π = π₯ log π(π = π¦|π = π₯)
π₯
π¦
β This is called the conditional entropy
β This is the amount of information needed to quantify the
random variable π given the random variable π
40
Information Gain
β’ Using entropy to measure uncertainty, we can greedily
select an attribute that guarantees the largest expected
decrease in entropy (with respect to the empirical
partitions)
πΌπΊ π = π» π β π»(π|π)
β Called information gain
β Larger information gain corresponds to less uncertainty
about π given π
β’ Note that π» π π β€ π»(π)
41
Decision Tree Learning
β’ Basic decision tree building algorithm:
β Pick the feature/attribute with the highest information
gain
β Partition the data based on the value of this attribute
β Recurse over each new partition
42
Choosing the Best Attribute
π₯1 , π₯2 β {0,1}
Which attribute should you split on?
π₯1
1
π¦ = β: 0
π¦ = +: 4
π₯2
0
π¦ = β: 3
π¦ = +: 1
1
π¦ = β: 1
π¦ = +: 3
0
π¦ = β: 2
π¦ = +: 2
What is the information
gain in each case?
43
ππ
ππ
π
1
1
+
1
0
+
1
1
+
1
0
+
0
1
+
0
0
β
0
1
β
0
0
β
Choosing the Best Attribute
π₯1 , π₯2 β {0,1}
Which attribute should you split on?
π₯1
π¦ = β: 0
π¦ = +: 4
ππ
π
1
1
+
1
0
+
1
1
+
1
0
+
0
1
+
0
0
β
0
1
β
0
β
π₯2
0
1
ππ
π¦ = β: 3
π¦ = +: 1
1
π¦ = β: 1
π¦ = +: 3
0
π¦ = β: 2
π¦ = +: 2
5
5 3
3
0
π» π = β log β log
8
8 8
8
π» π π1 = .5 β0 log 0 β 1 log 1 + .5 β.75 log .75 β .25 log .25
π» π π2 = .5 β.5 log .5 β .5 log .5 + .5 β.75 log .75 β .25 log .25
π» π β π» π π1 β π» π + π» π π2 = βlog .5 > 0
44
Should split on π₯1
When to Stop
β’ If the current set is βpureβ (i.e., has a single label in the
output), stop
β’ If you run out of attributes to recurse on, even if the current
data set isnβt pure, stop and use a majority vote
β’ If a partition contains no data items, nothing to recurse on
β’ For fixed depth decision trees, the final label is determined
by majority vote
45
Decision Trees
β’ Because of speed/ease of implementation, decision trees
are quite popular
β Can be used for regression too!
β’ Decision trees will always overfit!
β It is always possible to obtain zero training error on the
input data with a deep enough tree (if there is no noise
in the labels)
β Solution?
46
© Copyright 2026 Paperzz