• Information theory, IT • Entropy • Mutual Information • Use in NLP

Some basic concepts of Information Theory and
Entropy
•
•
•
•
Information theory, IT
Entropy
Mutual Information
Use in NLP
NLP Language Models
1
Entropy
•
Related to the coding theorymore efficient code: fewer bits
for more frequent messages at
the cost of more bits for the less
frequent
NLP Language Models
2
EXAMPLE: You have to send messages about the two
occupants in a house every five minutes
•
Equal probability:
0 no occupants
1 first occupant
2 second occupant
3 Both occupants
•
Different probability
Situation
no occupants
Probability
Code
.5
0
first occupant
.125
110
second occupant
.125
111
Both occupants
.25
10
NLP Language Models
3
•
•
•
Let X a random variable taking values x1,
x2, ..., xn from a domain de according to a
probability distribution
We can define the expected value of X, E(x)
as the summatory of the possible values
weighted with their probability
E(X) = p(x1)X(x1) + p(x2)X(x2) + ... p(xn)X(xn)
NLP Language Models
4
Entropy
• A message can be thought of as a random
variable W that can take one of several
values V(W) and a probability distribution P.
•
Is there a lower bound on the number of bits
neede tod encode a message? Yes, the
entropy
•
It is possible to get close to the minimum
(lower bound)
•
It is also a measure of our uncertainty about
wht the message says (lot of bits- uncertain,
few - certain)
NLP Language Models
5
•
•
•
Given an event we want to obtain its information
content (I)
From Shannon in the 1940s
Two constraints:
• Significance:
• The less probable is an event the more information it
contains
•
P(x1) > P(x2) => I(x2) > I(x1)
• Additivity:
• If two events are independent
•
NLP Language Models
I(x1x2) = I(x1) + I(x2)
6
•
•
•
NLP Language Models
I(m) = 1/p(m) does not satisfy
the second requirement
I(x) = - log p(x) satisfies both
So we define I(X) = - log p(X)
7
•
•
•
•
Let X a random variable, described by p(X), owning an
information content I
Entropy: is the expected value of I: E(I)
Entropy measures information content of a random variable.
We can consider it as the average length of the message
needed to transmite a value of this variable using an optimal
coding.
Entropy measures the degree of desorder (uncertainty) of the
random variable.
NLP Language Models
8
•
Uniform distribution of a variable X.
• Each possible value xi ∈ X with |X| = M has the same
probability pi = 1/M
• If the value xi is codified in binary we need log2 M bits of
information
•
Non uniform distribution.
• by analogy
• Each value xi has a different probability pi
• Let assume pi to be independent
• If Mpi = 1/ pi we will need log2 Mpi = log2 (1/ pi ) = - log2 pi bits
of information
NLP Language Models
9
Let X ={a, b, c, d} with pa = 1/2; pb = 1/4; pc = 1/8; pd = 1/8
entropy(X) = E(I)=
-1/2 log2 (1/2) -1/4 log2 (1/4) -1/8 log2 (1/8) -1/8 log2 (1/8) =
7/4 = 1.75 bits
X = a?
si
a
si
b
no
X = b?
no
X = c?
si
c
NLP Language Models
Average number of questions:
1.75
no
a
10
Let X with a binomial distribution
X = 0 with probability p
X = 1 with probability (1-p)
H(Xp)
H(X) = -p log2 (p) -(1-p) log2 (1-p)
p = 0 => 1 - p = 1 H(X) = 0
p = 1 => 1 - p = 0 H(X) = 0
p = 1/2 => 1 - p = 1/2 H(X) = 1
1
0
0
NLP Language Models
1/2
1
p
11
NLP Language Models
12
•
joint entropy of two random variables, X,
Y is average information content for
specifying both variables
NLP Language Models
13
•
The conditional entropy of a random
variable Y given another random variable
X, describes what amount of information is
needed in average to communicate when
the reader already knows X
NLP Language Models
14
Chaining rule for probabilities
P(A,B) = P(A|B)P(B) = P(B|A)P(A)
P(A,B,C,D…) = P(A)P(B|A)P(C|A,B)P(D|A,B,C..)
NLP Language Models
15
Chaining rule for entropies
NLP Language Models
16
Mutual Information
•
•
I(X,Y) is the mutual information
between X and Y.
I(X,Y) measures the reduction of
incertaincy of X when Y is known
It measures too the amouny of
information X owns about Y (or Y
about X)
NLP Language Models
17
•
I = 0 only when X and Y are
independent:
• H(X|Y)=H(X)
•
•
H(X)=H(X)-H(X|X)=I(X,X)
Entropy is the autoinformation (mutual
information between X and X)
NLP Language Models
18
NLP Language Models
19
Pointwise Mutual Information
•
The PMI of a pair of outcomes x and y belonging to
discrete random variables quantifies the
discrepancy between the probability of their
coincidence given their joint distribution versus the
probability of their coincidence given only their
individual distributions and assuming independence
•
The mutual information of X and Y is the expected
value of the Specific Mutual Information of all
possible outcomes.
NLP Language Models
20
•
•
•
•
H: entropy of a language L
We ignore p(X)
Let q(X) a LM
How good is q(X) as an estimation of
p(X) ?
NLP Language Models
21
Cross Entropy
Measures the “surprise” of a model q
when it describes events following a
distribution p
NLP Language Models
22
Relative Entropy Relativa or Kullback-Leibler (KL) divergence
Measures the difference between two probabilistic distributions
NLP Language Models
23