Etymology of “Entropy” Shannon Entropy

Information Entropy:
Illustrating Example
Andrew Kusiak
Intelligent Systems Laboratory
2139 Seamans Center
The University of Iowa
Iowa City, Iowa 52242 - 1527
[email protected]
http://www.icaen.uiowa.edu/~ankusiak
Tel: 319 - 335 5934
Fax: 319-335 5669
The University of Iowa
Etymology of “Entropy”
• Entropy = randomness
• Amount of uncertainty
Intelligent Systems Laboratory
Definitions
Shannon Entropy
• S = final probability space composed of two disjoint events E1 and E2 with probability p1 = p and p2 = 1 – p, respectively.
• The Shannon entropy is defined as
Information content
m
I(s1,s2,...,sm)  
i 1
Entropy
v
E(A)  
j 1
H(S) = H(p1, p2) = – plogp – (1 – p)log(1 – p)
si
si
log 2
s
s
s1 j  ...  smj
I ( s1 j ,..., smj )
s
Information gain
Gain(A)  I(s 1, s 2,..., sm)  E(A)
1
D1= No of examples in class 1
D2= No of examples in class 2
Example
2 classes; 2 nodes
No.
1
2
3
4
5
6
7
8
I(D1, D2) = -4/8*log2 (4/8) - 4/8*log2 (4/8) = 1
For Blue D11 = 4, D21 = 0
I(D11, D21) = -4/4* log2 (4/4) = 0
For Red D12 = 0, D22 = 4
I(D12, D22) = -4/4* log2 (4/4) = 0
F1
Blue
Blue
Blue
Blue
Red
Red
Red
Red
D
1
1
1
1
2
2
2
2
I(D1, D2, D3) = -2/8*log2(2/8) - 3/8*log2(3/8)
- 3/8*log2(3/8) = 1.56
For Blue D11 = 2, D21 = 2, D31 = 0
I(D11, D21)= -2/4* log2 (2/4) -2/4* log2 (2/4) = 1
For Red D12 = 0, D22 = 1, D32 = 3
I(D22, D32) = -1/4* log2 (1/4) -3/4* log2 (3/4)
= 0.81
E(F1) = 4/8 I(D11, D21) + 4/8 I(D12, D22) = 0
Blue
D1=4
D2=0
Gain(A)  I(s 1, s 2,..., sm)  E(A)
m
I(s1,s2,...,sm)  
i 1
si
si
log 2
s
s
v
E(A)  
j 1
Red
D1=0
D2=4
s1 j  ...  smj
I ( s1 j ,..., smj )
s
Gain(A)  I(s 1, s 2,..., sm)  E(A)
For Blue D11 = 1, D21 = 3, D31 = 0
I (D11, D21) = -1/4* log2 (1/4) -3/4* log2 (3/4) = 0.81
For Red D12 = 0, D22 = 0, D32 = 4
I (D32) = -4/4* log2 (4/4) = 0
m
i 1
si
si
log 2
s
s
v
E(A)  
j 1
No.
1
2
3
4
5
6
7
8
F1
Blue
Blue
Blue
Blue
Red
Red
Red
Red
D
1
2
2
2
3
3
3
3
Collection
s1 j  ...  smj
I ( s1 j,..., smj )
s
v
E(A)  
j 1
Blue
D1=2 D2=2
D3=0
s1 j  ...  smj
I ( s1 j,..., smj )
s
No.
1
2
3
4
5
6
7
8
3 classes; 3 nodes
I(D1, D2, D3) = -1/8*log2(1/8) - 3/8*log2(3/8)
- 4/8*log2(4/8) = 1.41
I(s1,s2,...,sm)   
si
si
log 2
s
s
D
1
1
2
2
2
3
3
3
Red
D1=0
D2=1
D3=3
Gain(A)  I(s 1, s 2,..., sm)  E(A)
Example
3 classes; 2 nodes
Gain (F1)= I(D1, D2) - E (F1) = 1
m
I(s1,s2,...,sm )  
F1
Blue
Blue
Blue
Blue
Red
Red
Red
Red
Collection
E(F1) = 4/8 I(D11, D21) + 4/8 I (D22, D32)
= 0.905
Gain (F1) = I(D1, D2) - E (F1) = 0.655
i 1
Example
E(F1) = 4/8 I(D11, D21) + 4/8 I(D32) = 0.41
No.
1
2
3
4
5
6
7
8
3 classes; 2 nodes
Collection
Gain (F1) = I (D1, D2) - E (F1) = 1
Example
Blue
D1=1 D2=3
D3=0
I(D1, D2, D3) = -2/8*log2(2/8) -3/8*log2(3/8)
-3/8*log2(3/8) = 1.56
For Blue D11 = 2, D21 = 0, D31 = 0
I(D11) = -2/2* log2 (2/2) = 0
For Red D12 = 0, D22 = 3, D32 = 0
I(D32) = -3/3* log2 (3/3) = 0
Gain(A)  I(s 1, s 2,..., sm)  E(A)
D
1
1
3
3
3
2
2
2
Collection
For Green D13 = 0, D23 = 0, D33 = 3
I(D33) = -3/3* log2 (3/3) = 0
Red
D1=0
D2=0
D3=4
F1
Blue
Blue
Green
Green
Green
Red
Red
Red
Blue
D1=2 D2=0
D3=0
Green
D1=0 D2=0
D3=3
Red
D1=0
D2=3
D3=0
E(F1) = 2/8 I(D11) + 3/8 I(D32) + 3/8 I(D32) = 0
Gain (F1) = I(D1, D2) - E (F1) = 1.56
m
I(s1,s2,...,sm )  
i 1
si
si
log 2
s
s
v
E(A)  
j 1
s1 j  ...  smj
I ( s1 j ,..., smj )
s
Gain(A)  I(s 1, s 2,..., sm)  E(A)
2
Example
Summary
I(D1, D2, D3, D4) = -2/8*log2(2/8) -2/8*log2(2/8)
-2/8*log2(2/8) -2/8*log2(2/8) = 2
No.
1
2
3
4
5
6
7
8
For Blue D11 = 1, D21 = 0, D31 = 0, D41 =0
I(D11) = -1/1* log2 (1/1) = 0
For Red D12 = 0, D22 = 2, D32 = 0, D42 = 0
I(D22) = -2/2* log2 (2/2) = 0
For Green D13 = 1, D23 = 0, D33 = 2, D43 = 2
I(D13, D33, D43) = -1/5* log2 (1/5) - 2/5* log2 (2/5)
- 2/5* log2 (2/5) = 1.52
F1
Blue
Green
Green
Green
Green
Green
Red
Red
Case 1
D
1
1
3
3
4
4
2
2
m
I(s1,s2,...,sm)   
i 1
si
si
log 2
s
s
Case 3
Case 4
Case 5
No. F1
1
Blue
D
1
No. F1
1 Blue
D
1
No. F1
1 Blue
D
1
No. F1
1 Blue
D
1
No. F1
1 Blue
D
1
2
Blue
1
2
Blue
1
2
Blue
2
2
Blue
1
2
Green
1
3
Blue
1
3
Blue
2
3
Blue
2
3
Green
3
3
Green
4
Blue
1
4
Blue
2
4
Blue
2
4
Green
3
4
Green
3
5
Red
2
5
Red
2
5
Red
3
5
Green
3
5
Green
4
6
Red
2
6
Red
3
6
Red
3
6
Red
2
6
Green
4
7
Red
2
7
Red
3
7
Red
3
7
Red
2
7
Red
2
8
Red
2
8
Red
3
8
Red
3
8
Red
2
8
Red
2
3
Collection
E(F1)= 1/8 I(D11) + 2/8 I(D22) + 5/8 I(D13, D33, D43) = 0.95
Blue
D1=1 D2=0
D3=0
D4=0
Gain (F1)= I(D1, D2) - E (F1) = 1.05
Case 2
s1 j  ...  smj
I ( s1 j,..., smj )
s
j 1
Green
D1=1 D2=0
D3=2
D4=2
Red
D1=0
D2=2
D3=0
D4=0
E(F1) = 0
E(F1) = 0.905
Gain (F1) = 1
Gain (F1) = 0.655 Gain (F1) = 1
E(F1) = 0.41
E(F1) = 0
E(F1) = 0.95
Gain (F1) = 1.56 Gain (F1) = 1.05
Continuous values  a split
v
E(A)  
Gain(A)  I(s 1, s 2,..., sm)  E(A)
Note
http://www.icaen.uiowa.edu/%7Ecomp/Public/Kantardzic.pdf
Play Tennis:Training Data Set
Outlook
sunny
Transformation between log2(x) and log10(x):
Temperature
Humidity
Wind
Play tennis
hot
high
weak
no
sunny
hot
high
strong
no
Log2(x) = log10(x)/log10(2) = 3.322 log10(x)
overcast
hot
rain
mild
high
weak
yes
Log10(x) = log2(x)/log2(10) = 0.301 log2(x)
rain
cool
normal
weak
yes
high
weak
yes
rain
cool
normal
strong
no
overcast
cool
normal
strong
yes
Formula in Excel:
sunny
mild
high
weak
no
sunny
cool
normal
weak
yes
For log2(x), use function: “=log(x,2)”
rain
mild
normal
weak
yes
sunny
mild
normal
strong
yes
For log10(x), use function: “=log(x,10)” or “=log10(x)”
Feature
(Attribute)
Decision
overcast
mild
high
strong
yes
overcast
hot
normal
weak
yes
rain
mild
high
strong
no
Feature values
3
Outlook
Feature Selection
Temperature
Humidity
Wind
Play tennis
sunny
hot
high
weak
no
sunny
hot
high
strong
no
overcast
hot
high
weak
rain
mild
high
weak
yes
rain
cool
normal
weak
yes
feature wind Gain(S, wind) = 0.048
feature outlook Gain(S, outlook) = 0.246
feature humidity
Gain(S, humidity) = 0.151
feature temperature
Gain(S, temperature) = 0.029
rain
cool
normal
strong
no
overcast
cool
normal
strong
yes
sunny
mild
high
weak
no
sunny
cool
normal
weak
yes
rain
mild
normal
weak
yes
sunny
mild
normal
strong
Constructing Decision Tree
yes
mild
high
strong
yes
hot
normal
weak
yes
rain
mild
high
strong
no
hot
sunny
overcast hot
mild
rain
cool
rain
Rain
Sunny
Overcast
yes
overcast
overcast
Outlook Temp
hot
sunny
Outlook
Yes
normal
weak
cool
normal
overcast cool
rain
Yes and No
Humidity Wind
high
weak
high
strong
high
weak
high
weak
Play tennis
no
no
yes
yes
yes
normal
strong no
strong yes
sunny
mild
high
weak
no
sunny
cool
normal
weak
yes
rain
mild
normal
weak
yes
sunny
mild
normal
strong yes
overcast mild
high
strong yes
overcast hot
normal
weak
high
strong no
rain
mild
yes
From Decision Trees to Rules
Complete Decision Tree
Outlook
Outlook
Rain
Sunny
Overcast
Humidity
Sunny
Overcast
Humidity
High
Normal
Wind
Yes
High
Normal
No
Yes
No
Strong
No
Wind
yes
Rain
Weak
Yes
Yes
Strong
No
Weak
Yes
If Outlook = Overcast
OR
Outlook = Sunny AND Humidity = Normal
OR
Outlook = Rain AND Wind = Weak
THEN Play tennis
4
Decision Trees:
Key Characteristics
Complete space of finite discrete‐valued functions
Maintaining a single hypothesis
No backtracking in search
All training examples used at each step
Avoiding Overfitting the Data
Accuracy
Training data set
Testing data set
Size of tree
Reference
J. R. Quinlan, Induction of decision trees, Machine Learning, 1, 1986, 81‐106.
5