Introduction to Information Theory Information Theory Information

Introduction to
Information Theory
Information Theory
Introduction to
Information Theory
Uncertainty
Uncertainty
Entropy
Entropy
Surprisal
Surprisal
Joint entropy
Joint entropy
Conditional entropy
Mutual information
Introduction to Information Theory
Relative entropy
Noisy channel
model
L645
Information theory answers two fundamental questions in
communication theory:
I
What is the ultimate data compression?
I
What is the transmission rate of communication?
Conditional entropy
Mutual information
Relative entropy
Noisy channel
model
Applies to: computer science thermodynamics, economics,
computational linguistics, . . .
Dept. of Linguistics, Indiana University
Fall 2015
Background reading: T. Cover, J. Thomas (2006) Elements
of Information Theory. Wiley.
1 / 28
Information & Uncertainty
Information as a decrease in uncertainty:
I
I
We have some uncertainty about a process, e.g., which
symbol (A, B, C) will be generated from a device
We learn some information, e.g., previous symbol is B
I
I
Introduction to
Information Theory
2 / 28
Entropy
Introduction to
Information Theory
Uncertainty
Uncertainty
Entropy
Entropy
Surprisal
Surprisal
Joint entropy
Joint entropy
Conditional entropy
Conditional entropy
Mutual information
Mutual information
Relative entropy
Noisy channel
model
Entropy is the way we measure how informative a random
variable is:
(1) H (p ) = H (X ) = −
How uncertain are we now about which symbol will
appear?
The more informative our information is, the less
uncertain we will be.
P
x ∈X
Relative entropy
Noisy channel
model
p (x ) log2 p (x )
How do we get this formula?
3 / 28
Motivating Entropy
Assume a device that emits one symbol (A)
I
Uncertainty of what we will see is zero
Assume a device that emits two symbols (A, B).
I
With one choice (A or B), our uncertainty is one
I
We could use one bit (0 or 1) to encode the outcome
Introduction to
Information Theory
If we made a decision tree, there would be two levels
(starting with “Is it A/B or C/D?”): uncertainty is two.
I
Need two bits (00, 01, 10, 11) to encode the outcome
Adding probabilities
Uncertainty
Uncertainty
Entropy
Entropy
Surprisal
Surprisal
Joint entropy
Joint entropy
log2 M assumes every choice (A, B, C, D) is equally likely
Conditional entropy
Mutual information
Relative entropy
I
This is not the case in general
Noisy channel
model
Conditional entropy
Mutual information
Relative entropy
Noisy channel
model
Instead, we look at − log2 p (x ), where x is the given choice,
to tell us how surprising it is
Assume a device that emits four symbols (A, B, C, D)
I
4 / 28
Introduction to
Information Theory
I
If every choice x is equally likely:
I p (x ) = 1 (and M = 1 )
M
p (x )
1
I log M = log
2
2 p (x ) = − log2 p (x )
We are describing a (base 2) logarithm:
I
log2 M, where M is the number of symbols
5 / 28
6 / 28
Average surprisal
Introduction to
Information Theory
Entropy example
Introduction to
Information Theory
Uncertainty
− log2 p (x ) tells us how surprising one particular symbol is.
I
But on average, how surprising is a random variable?
Summation gives a weighted average:
(2) H (X ) = −
P
x ∈X
p (x ) log2 p (x ) =
Uncertainty
Entropy
Entropy
Surprisal
Roll an 8-sided die (or pick a character from an alphabet of 8
characters)
Joint entropy
Conditional entropy
Mutual information
Relative entropy
E (log2 p (1X ) )
Noisy channel
model
Because each outcome is equally likely, the entropy is:
(3) H (X ) = −
i.e., sum over all possible outcomes, multiplying surprisal
(− log2 p (x )) by probability of seeing that outcome (p (x ))
I
I
8
P
i =1
p (i ) log2 p (i ) = −
8
P
i =1
1
8
log2
1
8
Surprisal
Joint entropy
Conditional entropy
Mutual information
Relative entropy
Noisy channel
model
= log2 8 = 3
i.e., 3 bits needed to encode this 8-character language:
The amount of surprisal is the amount of information we
need in order to not be surprised.
I H (X ) = 0 if the outcome is certain
I H (X ) = 1 if out of 2 outcomes, both are equally likely
A
000
E
001
I
010
O
011
U
100
F
101
G
110
H
111
7 / 28
Simplified Polynesian
Introduction to
Information Theory
8 / 28
Simplified Polynesian
Introduction to
Information Theory
Entropy calculation
Uncertainty
Uncertainty
Entropy
Entropy
Surprisal
Surprisal
Joint entropy
I
Conditional entropy
entropy:
Joint entropy
Conditional entropy
Mutual information
H (X ) = −
Relative entropy
I
Simplified Polynesian: 6 characters
char:
P
T
K
A
I
prob: 1/8 1/4 1/8 1/4 1/8
Noisy channel
model
X
Mutual information
p (i ) log p (i )
Relative entropy
i ∈L
1
1
1
1
= −[4 × log + 2 × log ]
8
8
4
4
= 2.5
U
1/8
I
I
again, we need 3 bits
char:
P
T
K
code: 000 001 010
A
011
I
100
Noisy channel
model
U
101
BUT: since the distribution is NOT uniform, we can
design a better code . . .
9 / 28
Simplified Polynesian
Introduction to
Information Theory
10 / 28
Joint entropy
Introduction to
Information Theory
Designing a better code
char:
code:
I
I
P
100
T
00
K
101
A
01
I
110
U
111
0 as first digit: 2-digit char.
1 as first digit: 3-digit char.
Uncertainty
Uncertainty
Entropy
Entropy
Surprisal
Surprisal
Joint entropy
Joint entropy
Conditional entropy
Mutual information
Relative entropy
Noisy channel
model
More likely characters get shorter codes
(4) H (X , Y ) = −
P P
x ∈X y ∈Y
p (x , y ) log2 p (x , y )
Mutual information
Relative entropy
Noisy channel
model
How much do the two values influence each other?
Task:
I Code the following word: KATUPATI
I
Conditional entropy
For two random variables X & Y , joint entropy is the
average amount of information needed to specify both values
I
e.g., the average surprisal at seeing two POS tags next
to each other
How many average bits do we need?
11 / 28
12 / 28
Conditional entropy
Introduction to
Information Theory
Conditional entropy: how much extra information is needed
to find Y ’s value, given that we know X ?
P
H ( Y |X ) =
x ∈X
P
=
x ∈X
(5)
= −
= −
p (x )[−
P P
x ∈X y ∈Y
P P
x ∈X y ∈Y
P
y ∈Y
Introduction to
Information Theory
Uncertainty
Uncertainty
Entropy
Entropy
Surprisal
Surprisal
Joint entropy
Joint entropy
Conditional entropy
Mutual information
Relative entropy
p (x )H (Y |X = x )
Chain Rule
Conditional entropy
Chain rule for entropy:
Mutual information
Relative entropy
Noisy channel
model
Noisy channel
model
H (X , Y ) = H (X ) + H (Y |X )
p (y |x ) log2 p (y |x )]
H (X1 , . . . , Xn ) = H (X1 ) + H (X2 |X1 )
+H (X3 |X1 , X2 ) + . . .
p (x )p (y |x ) log2 p (y |x )
+H (Xn |X1 , . . . , Xn−1 )
p (x , y ) log2 p (y |x )
13 / 28
Syllables in Simplified Polynesian
Introduction to
Information Theory
Uncertainty
Our model of simplified Polynesian earlier was too simple;
joint entropy will help us with a better model
I
Probabilities for letters on a per-syllable basis, using C
and V as separate random variables
I
Probabilities for consonants followed by a vowel
(P (C , .)) & vowels preceded by a consonant (P (., V ))
p
1/8
I
t
3/4
k
1/8
a
1/2
i
1/4
t
3/8
k
1/16
a
1/4
i
1/8
Syllables in Simplified Polynesian (2)
More specifically, for CV structures the joint probability
P (C , V ) is represented:
Entropy
Surprisal
Joint entropy
Conditional entropy
Uncertainty
Entropy
Surprisal
Joint entropy
Conditional entropy
Mutual information
Mutual information
p
t
k
a
1
16
3
8
1
16
1
2
i
1
16
3
16
0
1
4
u
0
3
16
1
16
1
4
1
8
3
4
1
8
Relative entropy
Noisy channel
model
u
1/4
On a per-letter basis, this would be the following (which
we are not concerned about here):
p
1/16
14 / 28
Introduction to
Information Theory
Relative entropy
Noisy channel
model
P (C , .) & P (., V ) from before are marginal probabilities:
u
1/8
I
I
P (C = t , V = a ) =
P (C = t ) =
3
4
3
8
15 / 28
Polynesian Syllables
Introduction to
Information Theory
16 / 28
Polynesian Syllables (2)
Uncertainty
Find H (C , V ), how surprised we are on average to see a
particular syllable structure
(6) H (C , V ) = H (C ) + H (V |C ) ≈ 1.061 + 1.375 ≈ 2.44
(7) a. H (C ) = −
P
c ∈C
b. H (V |C ) = −
p (c ) log2 p (c ) ≈ 1.061
P P
c ∈C v ∈V
Uncertainty
Entropy
Entropy
Surprisal
Joint entropy
Conditional entropy
Mutual information
Relative entropy
Noisy channel
model
I
H (C ) = −
=
Can calculate the probability p (v |c ) from the chart on
the previous page
1
is half of 18
e.g., p (V = a |C = p ) = 12 because 16
=
=
17 / 28
X
i ∈L
Surprisal
Joint entropy
p (i ) log p (i )
1
1
3
3
log + log ]
8
8
4
4
1
3
2 × log 8 + (log 4 − log 3)
8
4
1
3
2 × × 3 + (2 − log 3)
8
4
3
6 3
+ − log 3
4
4 4
9 3
− log 3 ≈ 1.061
4 4
= −[2 ×
=
p (c , v ) log p (v |c ) = 1.375
For the calculation of H (V |C ) ...
I
Introduction to
Information Theory
Conditional entropy
Mutual information
Relative entropy
Noisy channel
model
18 / 28
Polynesian Syllables (3)
H ( V |C ) = −
XX
x ∈C y ∈V
p (x , y ) log p (y |x )
= −[1/16 log 1/2 + 3/8 log 1/2 + 1/16 log 1/2
+1/16 log 1/2 + 3/16 log 1/4 + 0 log 0
Introduction to
Information Theory
Polynesian Syllables (4)
Introduction to
Information Theory
Uncertainty
Uncertainty
Entropy
Entropy
Surprisal
Surprisal
Joint entropy
Joint entropy
Conditional entropy
Conditional entropy
Mutual information
Mutual information
Relative entropy
Noisy channel
model
+0 log 0 + 3/16 log 1/4 + 1/16 log 1/2]
Relative entropy
Exercise: Verify this result by using
H (C , V ) = H (V ) + H (C |V )
Noisy channel
model
= 1/16 log2 + 3/8 log 2 + 1/16 log 2
+1/16 log 2 + 3/16 log 4
+3/16 log 4 + 1/16 log 2]
= 11/8
19 / 28
Pointwise mutual information
Mutual information (I(X ; Y )): how related are two different
random variables?
= Amount of information one random variable contains
about another
= Reduction in uncertainty of one random variable based
on knowledge of other
Introduction to
Information Theory
20 / 28
Mutual information
Introduction to
Information Theory
Uncertainty
Uncertainty
Entropy
Entropy
Surprisal
Surprisal
Joint entropy
Joint entropy
Conditional entropy
Conditional entropy
Mutual information
Relative entropy
Noisy channel
model
Mutual information
Mutual information: on average what is the common
information between X and Y ?
(9) I(X ; Y ) =
Pointwise mutual information: mutual information for two
points (not two distributions), e.g., a two-word collocation
P
x ,y
p (x , y ) log2
Relative entropy
Noisy channel
model
p (x ,y )
p (x )p (y )
p (x ,y )
(8) I(x ; y ) = log p (x )p (y )
Probability of x and y occurring together vs. independently
Exercise: Calculate the pointwise mutual information of
C = p and V = i from simplified Polynesian example
21 / 28
Mutual information (2)
H (X |Y ): information needed to specify X when Y is known
(10) I(X ; Y ) = H (X ) – H (X |Y ) = H (Y ) – H (Y |X )
Take the information needed to specify X and subtract out
the information when Y is known
I
I
I
Introduction to
Information Theory
22 / 28
Relative entropy
Uncertainty
Uncertainty
Entropy
Entropy
Surprisal
Surprisal
Joint entropy
Conditional entropy
Mutual information
Relative entropy
Noisy channel
model
. . . i.e., the information shared by X and Y
How far off are two distributions from each other?
Relative entropy, or Kullback-Leibler (KL) divergence,
provides such a measure (for distributions p and q):
(11) D (p ||q) =
When X and Y are independent, I(X ; Y ) = 0
When X and Y are very dependent, I(X ; Y ) is high
P
x ∈X
Joint entropy
Conditional entropy
Mutual information
Relative entropy
Noisy channel
model
p (x )
p (x ) log q(x )
Informally, this is the distance between p and q
I
Exercise: Calculate I(C ; V ) for Simplified Polynesian
I
Introduction to
Information Theory
If p is correct disribution: average number of bits
wasted by using distribution q instead of p
Note that log2 1 = 0 and that this happens when
p (x , y ) = p (x )p (y )
23 / 28
24 / 28
Notes on relative entropy
Introduction to
Information Theory
Divergence as mutual information
Uncertainty
Uncertainty
Entropy
Entropy
Surprisal
Surprisal
Joint entropy
Joint entropy
Conditional entropy
D (p ||q) =
P
x ∈X p (x ) log
p (x )
q(x )
I
Often used as a distance measure in machine learning
I
D (p ||q) is always nonnegative
D (p ||q) = 0 if p = q
D (p ||q) = ∞ if x ∈ X so that p (x ) > 0 and q(x ) = 0
I
Introduction to
Information Theory
Mutual information
Relative entropy
Noisy channel
model
Conditional entropy
Our formula for mutual information was:
(12) I(X ; Y ) =
P
x ,y
p (x ,y )
p (x , y ) log p (x )p (y )
Mutual information
Relative entropy
Noisy channel
model
meaning that it is the same as measuring how far a joint
distribution (p (x , y )) is from independence (p (x )p (y )):
Not symmetric: D (p ||q) , D (q||p )
(13) I(X ; Y ) = D (p (x , y )||p (x )p (y ))
25 / 28
The noisy channel model
The idea of information comes from Claude Shannon’s
description of the noisy channel model
Many natural language tasks can be viewed as:
I
I
There is a output, which we can observe
We want to guess at what the input was ... but it has
been “corrupted” along the way
Introduction to
Information Theory
I
Uncertainty
Entropy
Entropy
Surprisal
Surprisal
Joint entropy
Conditional entropy
Mutual information
Relative entropy
Some questions behind information theory:
I
Noisy channel
model
How much loss of information can we prevent when we
are attempting to compress the data?
I
I
I
Assume that the true input is in French
But all we can see is the garbled (English) output
I
The noisy channel model theoretically
Uncertainty
Example: machine translation from English to French
I
26 / 28
Introduction to
Information Theory
Joint entropy
Conditional entropy
Mutual information
Relative entropy
Noisy channel
model
i.e., how redundant does the data need to be?
And what is the theoretical maximum amount of
compression? (This is entropy.)
How fast can data be transmitted perfectly?
I
A channel has a specific capacity (defined by mutual
information)
To what extent can we recover the “original” French?
27 / 28
28 / 28