2x3pdf - Computational Linguistics

Information Theory
Answers two fundamental questions in communication theory:
Introduction to Information Theory
• what is the ultimate data compression?
L645
Advanced NLP
Autumn 2009
• what is the transmission rate of communication?
Helpful in many areas: computer science (Kolmogoroff complexity), thermodynamics,
economics, computational linguistics
Background reading: T. Cover, J. Thomas (2006) Elements of Information Theory.
Wiley.
2
Information & Uncertainty
The noisy channel model
The idea of information comes from Claude Shannon’s description of the noisy channel
model, which is used to model many processes.
We are going to talk about information as a decrease in uncertainty.
• We have some uncertainty about a process, e.g., which symbol (A, B, C) will be
generated from a device
• We learn a piece of information, e.g., that the previous symbol is B
Many natural language tasks can be viewed as:
• There is a certain output, which we can see
• We want to guess the input ... but it has been corrupted along the way
• How uncertain are we now about which symbol will appear?
Example: machine translation from English to French
– The more informative our information is, the less uncertain we will be.
• Assume that the true input is in French
• But all we can see is the garbled (English) output
– To what extent can we recover the “original” French?
3
4
The noisy channel model theoretically
Entropy
Entropy is the way we measure how informative something (a random variable) is:
Some questions behind information theory:
• How much loss of information can we prevent when we are attempting to compress
the data?
– i.e., how redundant does the data need to be?
– And what is the theoretical maximum amount of compression? (We’ll see that
this is entropy.)
(1) H(p) = H(X) = −
P
x∈X
p(x) log2 p(x)
What does this formula mean? How do we get this?
• How fast can data be transmitted perfectly? → a channel has a specific capacity
(defined by mutual information)
5
6
Motivating Entropy
Adding probabilities
Let’s assume that we have a device that emits one symbol (A). We have no uncertainty
as to what we will see: uncertainty is zero.
A device that emits two symbols (A, B). We have one choice, either A or B: our
uncertainty is one.
• We could use one bit (0 or 1) to encode the outcome
A device that emits four symbols (A, B, C, D). If we made a decision tree, there would
be two levels (starting with “Is it A/B or C/D?”): uncertainty is two.
But log2 M assumes that every choice (A, B, C, D) is equally likely → this is not
generally true
So, instead we look at − log2 p(x), where x is the given symbol, to tell us how
surprising something is
• If every choice x is equally likely, then p(x) =
1
M
(and thus also: M =
1
p(x) )
1
• And then log2 M = log2 p(x)
= − log2 p(x)
• We would need two bits (00, 01, 10, 11) to encode the outcome
What we are describing is a logarithm (base 2): log2 M , where M is the number of
symbols
7
8
Average surprisal
Entropy example
But − log2 p(x) just tells us how surprising one particular symbol is. On average,
though, how surprising is our random variable?
That is where the summation (a weighted average) comes in:
(2) H(X) = −
P
x∈X
p(x) log2 p(x) =
Roll an 8-sided die (or pick a character from an alphabet of 8 characters)
• because each outcome is equally likely, the entropy is:
1
E(log2 p(X)
)
(3) H(X) = −
Sum over all possible outcomes, multiplying the surprisal (− log2 p(x)) by the
probability of seeing that surprising outcome (p(x))
8
P
i=1
8
P
i=1
p(i) log2 p(i) = −
8
P
i=1
1
8
log2 18 = log2 8 = 3
In other words, you need 3 bits to encode this (000, 001, 010, 011, ...), i.e., the
number of bits needed to describe 8-character language
A
000
• The amount of surprisal is the amount of information we need in order to not be
surprised.
– H(X) = 0 if the outcome is certain
– H(X) = 1 if out of 2 outcomes, both are equally likely
=−
E
001
I
010
O
011
U
100
F
101
G
110
• how many bits do we need if entropy is 2.3?
9
10
Simplified Polynesian
• simplified Polynesian L: 6 characters
• entropy:
H(X) = −
char:
prob:
X
= 2.5
char:
code:
P
000
T
1/4
Simplified Polynesian (2)
K
1/8
A
1/4
I
1/8
U
1/8
char:
code:
p(i) log p(i)
K
010
A
011
P
100
T
00
K
101
A
01
I
110
K
101
A
01
I
110
U
111
I
100
• more likely characters get shorter codes
• code the following word: ”KATUPATI”
U
101
• how many bits on average do we need?
• BUT: since the distribution is NOT uniform, we can design a better code:
char:
code:
T
00
1 as first digit: 3-digit char.
1
1
1
1
log + 2 × log ]
8
8
4
4
T
001
P
100
• 0 as first digit: 2-digit char.
i∈L
= −[4 ×
• again, we need 3 bits
P
1/8
H
111
U
111
11
12
Joint entropy
Conditional entropy
For two random variables X and Y , the joint entropy is the amount of information
needed to specify both values, on average
Conditional entropy is a similar idea: how much extra information is needed to find
Y ’s value, given that we know the value of X.
(4) H(X, Y ) = −
P P
x∈X y∈Y
H(Y |X) =
p(x, y) log2 p(x, y)
In other words, we can talk about how two values are related and influence each other
• e.g., the average surprisal at seeing two POS tags next to each other
(5)
P
x∈X
p(x)H(Y |X = x) =
P P
= −
x∈X y∈Y
= −
x∈X y∈Y
P P
P
p(x)[−
x∈X
P
y∈Y
p(x)p(y|x) log2 p(y|x)
p(y|x) log2 p(y|x)]
p(x, y) log2 p(y|x)
13
Chain Rule
14
Syllables in Simplified Polynesian
Our model of simplified Polynesian earlier was too simple; joint entropy will help us
with a better model
And then we have the Chain rule for entropy (resembling the chain rule of probability):
• Probabilities for letters on a per-syllable basis, using C and V as separate random
variables
– i.e., probabilities for consonants followed by a vowel (P (C, .)) and for vowels
preceded by a consonant (P (., V ))
H(X, Y ) = H(X) + H(Y |X)
H(X1, . . . , Xn) = H(X1) + H(X2|X1)
p
1/8
+H(X3|X1, X2) + . . .
+H(Xn|X1, . . . , Xn−1)
t
3/4
k
1/8
a
1/2
i
1/4
u
1/4
• On a per-letter basis, this would be the following (which we are not concerned
about here):
p
1/16
t
3/8
k
1/16
a
1/4
i
1/8
u
1/8
15
16
Polynesian Syllables
Syllables in Simplified Polynesian (2)
More specifically, for CV structures the joint probability P (C, V ) is represented:
p
t
k
a
1
16
3
8
1
16
1
2
i
1
16
3
16
0
1
4
0
3
16
1
16
1
4
1
8
3
4
1
8
u
We want to find H(C, V ), in other words, how surprised we are on average to see a
particular syllable structure
(6) H(C, V ) = H(C) + H(V |C) ≈ 1.061 + 1.375 ≈ 2.44
P
(7) a. H(C) = −
p(c) log2 p(c) ≈ 1.061
c∈CP P
b. H(V |C) = −
p(c, v) log p(v|c) = 1.375
c∈C v∈V
For the calculation of H(V |C) ...
The P (C, .) and P (., V ) from before are the marginal probabilities
• P (C = t, V = a) =
• P (C = t) =
• Can calculate the probability p(v|c) from the chart on the previous page
3
8
• e.g., p(V = a|C = p) =
3
4
1
2
because
1
16
is half of
1
8
Exercise: Verify this result by using H(C, V ) = H(V ) + H(C|V )
17
18
Polynesian Syllables (2)
H(C) = −
X
Polynesian Syllables (3)
H(V |C) = −
p(i) log p(i)
i∈L
=
=
=
p(x, y) log p(y|x)
x∈C y∈V
= −[1/16 log 1/2 + 3/8 log 1/2 + 1/16 log 1/2
1
1
3
3
log + log ]
8
8
4
4
1
3
2 × log 8 + (log 4 − log 3)
8
4
3
1
2 × × 3 + (2 − log 3)
8
4
3
6 3
+ − log 3
4
4 4
9 3
− log 3 ≈ 1.061
4 4
= −[2 ×
=
XX
+1/16 log 1/2 + 3/16 log 1/4 + 0 log 0
+0 log 0 + 3/16 log 1/4 + 1/16 log 1/2]
= 1/16 log2 + 3/8 log 2 + 1/16 log 2
+1/16 log 2 + 3/16 log 4
+3/16 log 4 + 1/16 log 2]
= 11/8
19
20
Pointwise mutual information
Mutual information
Mutual information (I(X; Y )) tells us how related two different random variables
are, i.e., will knowing one tell you a lot about the other?
• the amount of information that one random variable contains about another, or
the reduction in uncertainty of one random variable based on the knowledge of the
other
Mutual information makes this comparison for the random variables, i.e., on average
what is the common information between X and Y
(9) I(X; Y ) =
P
x,y
p(x,y)
p(x, y) log2 p(x)p(y)
Let’s start by looking at pointwise mutual information, the mutual information for
two points (not two distributions), e.g., a collocation of two words
Because of how we defined H(X, Y ), namely as: H(X)+H(Y |X) = H(Y )+H(X|Y ),
we have the following equivalent formulas:
p(x,y)
(8) I(x; y) = log p(x)p(y)
(10) I(X; Y ) = H(X) – H(X|Y ) = H(Y ) – H(Y |X)
We compare the probability of x and y occurring together vs. the probability of x and
the probability of y occurring independently
Exercise: Calculate the pointwise mutual information of C = p and V = i
21
Mutual information (2)
22
Noisy Channel Model (revisited)
Recall that H(X|Y ) is the information needed to specify X when Y is known
Recall that the noisy channel model was developed for data compression
(11) I(X; Y ) = H(X) – H(X|Y ) = H(Y ) – H(Y |X)
This formula means: take the information needed to specify X and subtract out the
information when Y is known ... in other words, we are left with the information
shared by X and Y
• When X and Y are independent, I(X; Y ) = 0
• basis: communication via telephone line
• Entropy = optimal amount of compression
Capacity C: rate at which information can be transmitted
• When X and Y are highly dependent, I(X; Y ) is very high
• C = maxp(X)I(X; Y )
Exercise: Calculate I(C; V ) for Simplified Polynesian
• Note that log2 1 = 0 and that this happens when p(x, y) = p(x)p(y)
23
24
Relative entropy
Notes on relative entropy
P
p(x) log p(x)
q(x)
If we know that one distribution is (more) true, but another distribution is perhaps
easier to calculate, we might want to see how far off the two distributions are
D(p||q) =
Relative entropy, or Kullback-Leibler (KL) divergence, provides such a measure
(for distributions p and q):
• often used as a distance measure in machine learning
(12) D(p||q) =
P
p(x) log
x∈X
p(x)
q(x)
x∈X
• D(p||q) is always nonnegative
D(p||q) = 0 if p = q
D(p||q) = ∞ if x ∈ X so that p(x) > 0 and q(x) = 0
Informally, this is the distance between p and q
• not symmetric: D(p||q) 6= D(q||p)
• Average number of bits wasted by using distribution q instead of correct p
25
26
Divergence as mutual information
Cross entropy
Our formula for mutual information was:
(13) I(X; Y ) =
P
x,y
Let’s say we have a probability model pM (for language model M ) covering some data
we have seen.
p(x,y)
p(x, y) log p(x)p(y)
meaning that it is the same as measuring how far a joint distribution (p(x, y)) is from
independence (p(x)p(y)):
(14) I(X; Y ) = D(p(x, y)||p(x)p(y))
• How close is this model to the true model p?
The cross entropy between a random variable X with true distribution p and a model
of this, pM , is:
(15) H(X, pM ) = −
P
p(x) log pM (x)
x
27
Cross entropy as a measure of language models
The Entropy of English
The following inequality is true:
(16) −
P
x
p(x) log p(x) ≤ −
P
28
Depending on our language model, we have different (cross) entropy measures:
p(x) log q(x) = H(X, q)
• Uniform model (27 characters equiprobable): log2(27) = 4.76 bits per character
x
This means that the lower the cross-entropy for some model q, the closer it is to reality
• Taking account of letter frequency: 4.03
• Taking account of bigrams: 2.8
• If we can construct a series of models with decreasing cross entropy, we approach—
and maybe even reach—the correct model
• Good language models tend to minimize entropy (and perplexity = 2Crossentropy )
29
• Human experiment: 1.34
The true entropy of English is lower than any of these
30

Download Report

2x3pdf - Computational Linguistics

Paperzz.com

Your Paperzz