Decision trees

Decision Trees (suggested time: 30 min)
•
Definition
•
Mechanism
•
•
•
Splitting Functions
Issues in Decision-Tree Learning (if time permits)
•
Avoiding overfitting through pruning
•
Numeric and Missing attributes
Applications to Security
Illustration
Example: Learning to identify Spam
Is the user unknown?
No
Yes
Number of Recipients
≥N
Spam
<N
Not Spam
Spam
Definition
 A decision-tree learning algorithm approximates a target
concept using a tree representation, where each internal
node corresponds to an attribute, and every terminal node
corresponds to a class.
 There are two types of nodes:
 Internal node.- Splits into different branches according
to the different values the corresponding attribute can
take. Example: Number of recipients <= N or Number
of recipients > N.
 Terminal Node.- Decides the class assigned to the
example.
Classifying Examples
X = (Unknown Sender, Number of recipients > N)
Is the sender unknown?
No
Yes
Number of Recipients
≥N
Assigned
Class
Spam
<N
Not Spam
Spam
Appropriate Problems for Decision Trees
 Attributes are both numeric and nominal.
 Target function takes on a discrete number of values.
 Data may have errors.
 Some examples may have missing attribute values.
Decision Trees
•
Definition
•
Mechanism
•
•
Splitting Functions
Issues in Decision-Tree Learning
•
Avoiding overfitting through pruning
•
Numeric and Missing attributes
Historical Information
Ross Quinlan – Induction of Decision Trees. Machine Learning
Journal 1: 81-106, 1986 (over 8 thousand citations)
Historical Information
Leo Breiman – CART (Classification and Regression Trees), 1984.
Mechanism
There are different ways to construct trees from data.
We will concentrate on the top-down, greedy search approach:
Basic idea:
1. Choose the best attribute a* to place at the root of the tree.
2. Separate training set D into subsets {D1, D2, .., Dk} where
each subset Di contains examples having the same value for a*
3. Recursively apply the algorithm on each new subset until
examples have the same class or there are few of them.
Illustration
P1
Duration
D2
Class A: Attack
Class B: Benign
D3
Destination Port
Attributes: Destination Port and Duration
Destination Port has two values: > P1 or <= P1
Duration has three values:
> D2,
<=D2 and > D3,
<= D3
Illustration
Duration
Suppose we choose Destination Port
as the best attribute:
D2
Destination
Port
D3
<= P1
P1
> P1
?
A
Class A: Attack
Class B: Benign
Illustration
Duration
Suppose we choose Duration as the next best attribute:
Destination
Port
D2
<= P1
> P1
D3
A
> D2
P1
Class A: Attack
Class B: Benign
B
A
> D3 and <= D2
≤ D3
B
Formal Mechanism
• Create a root for the tree
• If all examples are of the same class or the number of examples
is below a threshold return that class
• If no attributes available return majority class
• Let a* be the best attribute
• For each possible value v of a*
• Add a branch below a* labeled “a = v”
• Let Sv be the subsets of example where attribute a*=v
• Recursively apply the algorithm to Sv
What attribute is the best to split the data?
Let us remember some definitions from information theory.
A measure of uncertainty or entropy that is associated
to a random variable X is defined as
H(X) = - Σ pi log pi
where the logarithm is in base 2.
This is the “average amount of information or entropy of a finite
complete probability scheme” (Introduction to I. Theory by Reza F.).
There are two possible complete events A and B
(Example: flipping a biased coin).
 P(A) = 1/256, P(B) = 255/256
H(X) = 0.0369 bit
 P(A) = 1/2, P(B) = 1/2
H(X) = 1 bit
 P(A) = 7/16, P(B) = 9/16
H(X) = 0.989 bit
Entropy is a function concave downward.
1 bit
0
0.5
1
Illustration
Duration
D2
Class A: Attack
Class B: Benign
D3
P1
Destination
Port
Attributes: Destination Port and Duration
Destination Port has two values: > P1 or <= P1
Duration has three values:
> D2,
<=D2 and > D3,
<= D3
Duration
Splitting based on Entropy
D2
D3
P1
S2
Destination
Port
Destination Port divides the
sample in two:
S1 = { 6A, 0B}
S2 = { 3A, 5B}
H(S1) = 0
H(S2) = -(3/8)log2(3/8)
-(5/8)log2(5/8)
S1
Splitting based on Entropy
S1
Duration
D2
S2
D3
S3
P1
Destination
Port
Duration divides the sample
in three:
S1 = { 2A, 2B}
S2 = { 5A, 0B}
S3 = { 2A, 3B}
H(S1) = 1
H(S2) = 0
H(S3) = -(2/5)log2(2/5)
-(3/5)log2(3/5)
Information Gain
IG(A) = H(S) - Σv (Sv/S) H (Sv)
H(S) is the entropy of all examples
H(Sv) is the entropy of one subsample after partitioning S
based on all possible values of attribute A.
Components of IG(A)
H(S1) = 0
H(S2) = -(3/8)log2(3/8)
-(5/8)log2(5/8)
Duration
D2
D3
H(S) = -(9/14)log2(9/14)
-(5/14)log2(5/14)
P1
S2
Destination
Port
S1
|S1|/|S| = 6/14
|S2|/|S| = 8/14
Components of IG(A)
H(S1) = 0
H(S2) = -(3/8)log2(3/8)
-(5/8)log2(5/8)
Duration
D2
D3
H(S) = -(9/14)log2(9/14)
-(5/14)log2(5/14)
P1
S2
Destination
Port
S1
|S1|/|S| = 6/14
|S2|/|S| = 8/14
Gain Ratio
Let’s define the entropy of the attribute:
H(A) = - Σ pj log pj
Where pj is the probability that attribute A takes value Vj.
Then
GainRatio(A) = IG(A) / H(A)
Gain Ratio
Duration
D2
D3
P1
S2
S2
Destination
Port
S1
H(size) = -(6/14)log2(6/14) - (8/14)log2(8/14)
where |S1|/|S| = 6/14 |S2|/|S| = 8/14
Security Applications
Decision trees have been used in:
• Intrusion detection [> 11 papers]
• Online dynamic security assessment [He et al. ISGT 12]
• Password checking [Bergadano et al. CCS 97]
• Database inference [Chang, Moskowitz NSPW 98]
• Analyzing malware [Ravula et al. KDIR 11]