Etymology of “Entropy” Shannon Entropy

Information Entropy:
Illustrating Example
Andrew Kusiak
Intelligent Systems Laboratory
2139 Seamans Center
The University of Iowa
Iowa City, Iowa 52242 - 1527
[email protected]
http://www.icaen.uiowa.edu/~ankusiak
Tel: 319 - 335 5934
Fax: 319-335 5669
The University of Iowa
Etymology of “Entropy”
• Entropy = randomness
• Amount of uncertainty
Intelligent Systems Laboratory
Definitions
Shannon Entropy
• S = final probability space composed of two disjoint events E1 and E2 with probability p1 = p and p2 = 1 – p, respectively.
• The Shannon entropy is defined as
Information content
m
I(s1,s2,...,sm)  
i 1
Entropy
v
E(A)  
j 1
H(S) = H(p1, p2) = – plogp – (1 – p)log(1 – p)
si
si
log 2
s
s
s1 j  ...  smj
I ( s1 j ,..., smj )
s
Information gain
Gain(A)  I(s 1, s 2,..., sm)  E(A)
1
Example
Example
2 classes; 2 nodes
No.
1
2
3
4
5
6
7
8
I(D1, D2) = -4/8*log2 (4/8) - 4/8*log2 (4/8) = 1
For Blue D11 = 4, D21 = 0
I(D11, D21) = -4/4* log2 (4/4) = 0
For Red D12 = 0, D22 = 4
I(D12, D22) = -4/4* log2 (4/4) = 0
F1
Blue
Blue
Blue
Blue
Red
Red
Red
Red
D
1
1
1
1
2
2
2
2
I(D1, D2, D3) = -2/8*log2(2/8) - 3/8*log2(3/8)
- 3/8*log2(3/8) = 1.56
For Blue D11 = 2, D21 = 2, D31 = 0
I(D11, D21))= -2/4*
2/4 log2 (2/4) -2/4*
2/4 log2 (2/4) = 1
Collection
For Red D12 = 0, D22 = 1, D32 = 3
I(D22, D32) = -1/4* log2 (1/4) -3/4* log2 (3/4)
= 0.81
E(F1) = 4/8 I(D11, D21) + 4/8 I(D12, D22) = 0
Gain (F1) = I (D1, D2) - E (F1) = 1
Blue
D1=4
D2=0
D1=
ofsmexamples
Gain(A)  I(s
1, sNo
2,...,
)  E(A) in class 1
D2= No of examples in class 2
m
I(s1,s2,...,sm)  
i 1
si
si
log 2
s
s
v
E(A)  
j 1
Red
D1=0
D2=4
s1 j  ...  smj
I ( s1 j ,..., smj )
s
Gain(A)  I(s 1, s 2,..., sm)  E(A)
For Blue D11 = 1, D21 = 3, D31 = 0
I (D11, D21) = -1/4*
1/4* log
l 2 (1/4) -3/4*
3/4* log
l 2 (3/4) = 0.81
0 81
For Red D12 = 0, D22 = 0, D32 = 4
I (D32) = -4/4* log2 (4/4) = 0
m
i 1
si
si
log 2
s
s
v
E(A)  
j 1
No.
1
2
3
4
5
6
7
8
F1
Blue
Blue
Blue
Blue
Red
Red
Red
Red
D
1
2
2
2
3
3
3
3
Collection
s1 j  ...  smj
I ( s1 j,..., smj )
s
Gain(A)  I(s 1, s 2,..., sm)  E(A)
j 1
No.
1
2
3
4
5
6
7
8
3 classes; 3 nodes
I(D1, D2, D3) = -1/8*log2(1/8) - 3/8*log2(3/8)
- 4/8*log2(4/8) = 1.41
I(s1,s2,...,sm)  
Red
D1=0
D2=1
D3=3
Example
3 classes; 2 nodes
Gain (F1)= I(D1, D2) - E (F1) = 1
s
D
1
1
2
2
2
3
3
3
Blue
D1=2 D2=2
D3=0
Gain (F1) = I(D
v
m 1, D2) - E (F1) = 0.655
s1 j  ...  smj
si
si
E(A)  
I ( s1 j ,..., smj )
I(s1,s2,...,sm )   log 2
s
s
F1
Blue
Blue
Blue
Blue
Red
Red
Red
Red
Collection
E(F1) = 4/8 I(D11, D21) + 4/8 I (D22, D32)
= 0.905
i 1
Example
E(F1) = 4/8 I(D11, D21) + 4/8 I(D32) = 0.41
No.
1
2
3
4
5
6
7
8
3 classes; 2 nodes
Blue
D1=1 D2=3
D3=0
I(D1, D2, D3) = -2/8*log2(2/8) -3/8*log2(3/8)
-3/8*log2(3/8) = 1.56
For Blue D11 = 2, D21 = 0, D31 = 0
I(D11) = -2/2* log2 (2/2) = 0
For Red D12 = 00, D22 = 33, D32 = 0
I(D32) = -3/3* log2 (3/3) = 0
Gain(A)  I(s 1, s 2,..., sm)  E(A)
D
1
1
3
3
3
2
2
2
Collection
For Green D13 = 0, D23 = 0, D33 = 3
I(D33) = -3/3* log2 (3/3) = 0
Red
D1=0
D2=0
D3=4
F1
Blue
Blue
Green
Green
Green
Red
Red
Red
Blue
D1=2 D2=0
D3=0
Green
D1=0 D2=0
D3=3
Red
D1=0
D2=3
D3=0
E(F1) = 2/8 I(D11) + 3/8 I(D32) + 3/8 I(D32) = 0
Gain (F1) = I(D1, D2) - E (F1) = 1.56
m
I(s1,s2,...,sm)  
i 1
si
si
log 2
s
s
v
E(A)  
j 1
s1 j  ...  smj
I ( s1 j ,..., smj )
s
Gain(A)  I(s 1, s 2,..., sm)  E(A)
2
Example
Summary
I(D1, D2, D3, D4) = -2/8*log2(2/8) -2/8*log2(2/8)
-2/8*log2(2/8) -2/8*log2(2/8) = 2
No.
1
2
3
4
5
6
7
8
For Blue D11 = 1, D21 = 0, D31 = 0, D41 =0
I(D11) = -1/1* log2 (1/1) = 0
For Red D12 = 0, D22 = 2, D32 = 0, D42 = 0
I(D22) = -2/2* log2 (2/2) = 0
For Green D13 = 1, D23 = 0, D33 = 2, D43 = 2
I(D13, D33, D43) = -1/5* log2 (1/5) - 2/5* log2 (2/5)
- 2/5* log2 (2/5) = 1.52
F1
Blue
Green
Green
Green
Green
Green
Red
Red
Case 1
D
1
1
3
3
4
4
2
2
m
I(s1,s2,...,sm)  
i 1
Blue
D1=1 D2=0
D3=0
D4=0
v
E(A)  
si
si
log 2
s
s
j 1
Case 3
Case 4
Case 5
D
1
No. F1
1 Blue
D
1
No. F1
1 Blue
D
1
No. F1
1 Blue
D
1
No. F1
1 Blue
2
Blue
1
2
Blue
1
2
Blue
2
2
Blue
1
2
Green
1
3
Blue
1
3
Blue
2
3
Blue
2
3
Green
3
3
Green
3
4
Blue
5
Red
2
5
Red
2
5
Red
3
5
Green
3
5
Green
4
6
Red
2
6
Red
3
6
Red
3
6
Red
2
6
G
Green
4
7
Red
2
7
Red
3
7
Red
3
7
Red
2
7
Red
2
8
Red
2
8
Red
3
8
Red
3
8
Red
2
8
Red
2
1
4
Blue
2
4
Blue
2
4
Green
Green
D1=1 D2=0
D3=2
D4=2
s1 j  ...  smj
I ( s1 j,..., smj )
s
Red
D1=0
D2=2
D3=0
D4=0
Gain(A)  I(s 1, s 2,..., sm)  E(A)
E(F1) = 0
E(F1) = 0.905
E(F1) = 0.41
Play Tennis:Training Data Set
sunny
Feature
(Attribute)
Temperature
hot
Humidity
4
Gain (F1) = 1
Gain (F1) = 0.655 Gain (F1) = 1
E(F1) = 0
Green
3
Wind
Play tennis
high
weak
no
high
strong
no
high
weak
yes
sunny
hot
overcast
hot
rain
mild
high
weak
yes
rain
cool
normal
weak
yes
rain
i
cool
normall
strong
no
overcast
cool
normal
strong
yes
sunny
mild
high
weak
no
sunny
cool
normal
weak
yes
rain
mild
normal
weak
yes
sunny
mild
normal
strong
yes
overcast
mild
high
strong
yes
overcast
hot
normal
weak
yes
rain
mild
high
strong
no
Decision
E(F1) = 0.95
Gain (F1) = 1.56 Gain (F1) = 1.05
Continuous values  a split
http://www.icaen.uiowa.edu/%7Ecomp/Public/Kantardzic.pdf
Outlook
Outlook
3
D
1
Collection
E(F1)= 1/8 I(D11) + 2/8 I(D22) + 5/8 I(D13, D33, D43) = 0.95
Gain (F1)= I(D1, D2) - E (F1) = 1.05
Case 2
No. F1
1 Blue
Feature Selection
Temperature
Humidity
Wind
Play tennis
sunny
hot
high
weak
no
sunny
hot
high
strong
no
overcast
hot
high
weak
rain
mild
high
weak
yes
rain
cool
normal
weak
yes
rain
cool
normal
strong
no
overcast
cool
normal
strong
yes
sunny
mild
high
weak
no
sunny
cool
normal
weak
yes
rain
mild
normal
weak
yes
sunny
mild
normal
strong
yes
feature wind Gain(S, wind) = 0.048
feature outlook ( ,
)
Gain(S, outlook) = 0.246
feature humidity
Gain(S, humidity) = 0.151
feature temperature
Gain(S, temperature) = 0.029
yes
overcast
mild
high
strong
yes
overcast
hot
normal
weak
yes
rain
mild
high
strong
no
Feature values
3
Constructing Decision Tree
Outlook Temp
hot
sunny
Outlook
hot
sunny
overcast hot
mild
rain
cool
rain
Rain
Sunny
Overcast
Yes
normal
Complete Decision Tree
Play tennis
no
normal
overcast cool
normal
weak yes
strong no
strong yes
sunny
mild
high
weak
no
sunny
cool
normal
weak
yes
rain
mild
normal
weak
yes
sunny
mild
normal
strong yes
overcast mild
high
strong yes
overcast hot
normal
weak
high
strong no
rain
mild
From Decision Trees to Rules
Outlook
no
yes
yes
cool
rain
Yes and No
Humidity Wind
high
weak
high
strong
high
weak
high
weak
Sunny
Rain
Overcast
Humidity
Wind
Yes
High
Normal
No
Yes
Strong
No
Weak
Yes
yes
Decision Trees:
Key Characteristics
Outlook
Rain
Sunny
Overcast
Humidity
Wind
yes
High
Normal
No
Yes
g
Strong
No
Weak
Yes
If Outlook = Overcast
OR
Outlook = Sunny AND Humidity = Normal
OR
Outlook = Rain AND Wind = Weak
THEN Play tennis
Complete space of finite discrete‐valued functions
M i i i
Maintaining a single hypothesis
i l h
h i
No backtracking in search
All training examples used at each step
4
Avoiding Overfitting the Data
Reference
J. R. Quinlan, Induction of decision trees, Machine Learning, 1, 1986, 81‐106.
Accuracy
Training data set
Testing data set
Size of tree
5