Lab 9: Entropy and Decision Trees

Lab 9: Entropy and Decision Trees
The objective of this lab is to:



Practice ideas from machine learning,
Learn about entropy and apply the concept,
Learn to build a decision tree using real data.
The Classification Problem
As described in the lectures, “machine learning” is an area of artificial intelligence concerning
methods by which machines can learn from data. One of the problems addressed by machine
learning is the so-called “classification problem”. The objective of the classification problem is to
use examples to build a classifier that can accurately classify samples as having or not having a
particular property. An instance of a classification problem would be the classification of E-mail
messages as spam or non-spam. In this lab, you will build a classifier known as a “decision tree”.
Training Data
The examples from which a classifier is built are called “training data”. Training data is typically
given in the form of a table, with some rows being instances that are in the class to be identified by
the classifier and other rows being instances that are not in the class. The columns of the table
give numerical values for various features exhibited by the instances.
This can be illustrated with an example from the text “Machine Learning” by Tom Mitchell
(McGraw Hill 1997). Suppose we wished to tell whether a day is good to play tennis and we have
some information on previous days that were good or bad for playing. Our data is collected in a
table, as follows.
Notice that there are 4 different possibilities for outlook, 3 for temperature, 2 for humidity and 2
for wind. That is, there are 4 x 3 x 2 x 2 = 48 possible combinations, and the training data covers
only 14 of them.
Decision Trees
Using the data in the table, we wish to build a decision tree to help us make decisions whether to
play tennis. The tree will have a root node making a decision based on one of the features, for
example whether the outlook is sunny, overcast or for rain. Then, in each case, further decisions
are made based on other features. A partially constructed decision tree for this example is given
below.
Entropy and Information Gain
To build a decision tree, we must select one of the attributes to use at the root. We want to pick
the one that is most effective at separating the good from the bad conditions to play tennis. To
make this choice, we introduce the notions of “entropy” and “information gain”.
Entropy is a measure of the uniformity of a collection of training examples. The entropy of a set of
examples is 0 if they all belong to the same class and it is 1 if each class contains the same number
of samples. If the classes contain different numbers of examples, then the entropy is between 0
and 1.
Let the number of possible classifications be 𝑐. In our example 𝑐 = 2: “Play tennis?” is either “Yes”
or “No”. Let 𝑆 be a collection of training examples, and let 𝑝𝑖 be the proportion of those samples
in the 𝑖 𝑡ℎ classification. Then the entropy of the collection 𝑆 is
𝑐
𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆) = ∑ −𝑝𝑖 log 2 𝑝𝑖
𝑖=1
For example, our entire collection of training data is a set with 14 examples, 9 of which are “Yes”
and 5 of which are “No”. So the entropy is
𝑐
𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆) = ∑ −𝑝𝑖 log 2 𝑝𝑖 = −
𝑖=1
9
9
5
5
log 2
−
log 2
≈ 0.940
14
14 14
14
Information gain is a measure of how well an attribute (feature) separates the training examples
according to the desired classification. It is defined as
𝐺𝑎𝑖𝑛(𝑆, 𝐴) = 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 (𝑆) −
∑
𝑣∈𝑉𝑎𝑙𝑢𝑒𝑠(𝐴)
|𝑆𝑣 |
𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆𝑣 ),
|𝑆 |
where 𝑉𝑎𝑙𝑢𝑒𝑠(𝐴) is the set of possible values for the attribute 𝐴, 𝑆𝑣 is the subset of 𝑆 for which
the attribute 𝐴 has the value 𝑣, and |𝑋| is the number of elements in the set 𝑋.
For example, let 𝑆𝑊𝑒𝑎𝑘 be the set of examples with weak wind (with 6 “Yes” and 2 “No”) and
𝑆𝑆𝑡𝑟𝑜𝑛𝑔 be the set of examples with strong wind (with 3 “Yes” and 3 “No”). Then the information
gain for the Wind attribute is
𝐺𝑎𝑖𝑛(𝑆, 𝑊𝑖𝑛𝑑 ) = 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 (𝑆) −
8
6
𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆𝑊𝑒𝑎𝑘 ) −
𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆𝑆𝑡𝑟𝑜𝑛𝑔 ) ≈ 0.048
14
14
Building the Decision Tree
To build the decision tree, use the attribute with the greatest information gain in the root node.
In our example, we have
𝐺𝑎𝑖𝑛(𝑆, 𝑂𝑢𝑡𝑙𝑜𝑜𝑘 ) ≈ 0.246,
𝐺𝑎𝑖𝑛(𝑆, 𝐻𝑢𝑚𝑖𝑑𝑖𝑡𝑦) ≈ 0.151,
𝐺𝑎𝑖𝑛(𝑆, 𝑊𝑖𝑛𝑑 ) ≈ 0.048,
𝐺𝑎𝑖𝑛(𝑆, 𝑇𝑒𝑚𝑝𝑒𝑟𝑎𝑡𝑢𝑟𝑒) ≈ 0.029.
So Outlook should be at the root node. We then build the branches of the tree by forming three
new decision trees, one each for Outlook = Overcast (with 4 examples), Outlook = Sunny (with 5
examples) and Outlook = Rain (5 examples).
A Decision Tree for Breast Cancer Malignancy
For this lab you will build a decision tree using a set of data collected at the University of Wisconsin
Hospitals, Madison, by Dr William H. Wolberg.
Have the TA check your work for each of the following:
1. Obtain the Data Set
Download the data from the course resources under OWL.
The data file is “Lab 8 -- breast-cancer-wisconsin.xls” and the fields are described in the “.txt” file
of the same name. The important information is this: Columns A to I contain values for different
measured attributes, with the values encoded as numbers in the range 1 to 10. Some values are
missing and are indicated by question marks. Column J indicates whether malignancy is detected,
with 2 for “benign” and 4 for “malignant”.
Insert 43 rows after row 1 of the spreadsheet, and one column before column A to make room to
work.
2. Compute Proportions
In cell A2 add the title “Proportions” and in cells A3 to A12 enter the numbers 1 to 10. Make sure
that A3 has the value 1 in it and that the original data is in rows 45 to 743.
You will now compute the proportion of the instances with “clump_thickness” equal to 1, the
proportion with “clump thickness” equal to 2, etc, all the way to 10. To do this, enter the formula
=COUNTIF(B$45:B$743, "="&$A3) / COUNTA(B$45:B$743)
into cell B3. Be prepared to explain this formula to the TA when you show your work.
Copy and paste this formula into all of cells B3 to B12. Now each entry in this area should give the
proportion of the samples.
Put a title “Total” into cell A14 and in cell B14 put a sum of all the 10 proportions. It should add
up to 1. Copy and paste B3 to B14 to the same rows in columns C through K.
3. Compute the Entropies for Each Attribute Value
You will now compute the Entropies for each of the sets defined by fixing attribute values.
In cell A16 add the title “Entropies” and in cells A17 to A26 enter the numbers 1 to 10.
Add a column L to the data table (rows 45 to 743) with TRUE or FALSE, depending on whether the
value for “class” in row K is indicating malignancy or not. That is the entry in column L should be
TRUE if the value in column K is 4, and should be FALSE otherwise.
Now you will compute the entropy for the attribute value clump_thickness = 1. To do this you will
need to pay attention only to the rows with clump_thickness = 1, i.e. rows 51, 55, 58, 69, 74, 90,
92, 109, 114, 117, 120, 121, 135, 138, 152, … Some of these rows will have the value TRUE in
column L and some will have FALSE in column L. You should compute the entropy of the
TRUE/FALSE split, taking into account only those rows with clump_thickness = 1. To do this, you
could write your own VBA function, but since you have a limited time for the lab, one is provided
here. (See next page.) Use the Developer tab to add this function. Now, in cell B17 (i.e. in the
column for “clump_thickness” and the row for the value 1), enter the formula
=Entropy($A17, B$45:B$743, $L$45:$L$743)
Copy and paste this to all cells in the range B17 to K26. Compute the overall entropy in the cell
K28. If 𝑝 is the proportion of all rows with TRUE in column L, then this overall entropy is
−𝑝 log 2 𝑝 − (1 − 𝑝) log 2 (1 − 𝑝)
Function Entropy(valOfInterest As Integer, values As Range, inClass As Range) As Double
Dim countIn As Double, countOut As Double
Dim i As Integer
countIn = 0
countOut = 0
For i = 1 To values.Rows.Count
If values.Rows(i).Value = valOfInterest Then
If inClass.Rows(i).Value Then
countIn = countIn + 1
Else
countOut = countOut + 1
End If
End If
Next i
Dim propIn As Double, propOut As Double
If countIn + countOut = 0 Then
Entropy = 0
Exit Function
End If
propIn = countIn / (countIn + countOut)
propOut = countOut / (countIn + countOut)
Dim ans As Double
ans = 0
If propIn <> 0 Then
ans = ans - propIn * Log(propIn) / Log(2)
End If
If propOut <> 0 Then
ans = ans - propOut * Log(propOut) / Log(2)
End If
Entropy = ans
End Function
4. Compute the Information Gain for Each Attribute
In cell A30 enter the title “Information” and in cells A31 to A40 enter the numbers 1 to 10.
In each of the cells in the region from B31 to J40 enter the corresponding proportion from the first
table times the entropy from the second table. For example, cell B31 should contain the formula
=B3*B17. Thus each entry computes one of the terms |𝑆𝑣 |/|𝑆|𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆𝑣 ).
In row 42 compute the information gain for each attribute.
SUM(B31:B40).
E.g. in cell B42 use =$K$28-
Which attribute gives the greatest information gain to identify malignancies? This attribute would
be tested as the root of a decision tree.
Your worksheet should look something like the following:
5. [Optional] Now Compute the Next Node
Now that you have determined which attribute should tested as the root node of the decision
tree, it is possible to go on and figure out which attributes should be tested second.
Suppose the attribute that had the highest information gain was “uniformity_of_cell_size”. That
would be checked at the root of the tree. Then one branch would be the decision tree for the
case uniformity_of_cell_size = 1, another branch would be the decision tree for the case
uniformity_of_cell_size = 2, and so on, with one decision tree for each of the 10 values.
It is possible to do this by making space below the work for questions 2-4 and above the data, and
copying and pasting the three tables you have worked on.
Then sort the data table by the
column selected as the root attribute. You can now adjust the formulas in the pasted tables to
refer only to the rows with uniformity_of_cell_size = 1. (Adjust one formula for “Proportions” and
copy it to all the cells, one formula for “Entropies” and copy it, and one formula for “Information”
and copy it.)
You can then copy and paste the block again and modify it to compute the main attribute for the
branch uniformity_of_cell_size = 2, and so on.