EE5319R: Problem Set 3
Assigned: 24/08/16, Due: 31/08/16
1. Cover and Thomas: Problem 2.30 (Maximum Entropy):
Solution: We are required to maximize H(PX ) over all distributions PX on the non-negative integers
satisfying
∞
X
nPX (n) = A
n=0
and also the normalization constraint
Now, construct the Lagrangian:
L(PX , λ) = −
P∞
∞
X
n=0
PX (n) = 1 (which we ignore without loss of generality).
PX (n) log PX (n) + λ
n=0
∞
X
!
nPX (n) − A
n=0
Differentiating with respect to PX (n) (assuming natural logs and interchanging of differentiation and
infinite sum), we obtain
−PX (n)
1
− log PX (n) + λn = 0
PX (n)
so we have
∗
PX
(n) = exp(−1 + λn),
n ≥ 0.
∗
We immediately recognize that this is a geometric distribution with mean A, i.e., PX
can be written
alternatively as
∗
PX
(n) = (1 − p)n p,
n ≥ 0.
where
A=
1−p
p
From direct calculations, the entropy is
H(PX ) =
Hb (p)
.
p
2. (Optional): Cover and Thomas: Problem 2.38 (The Value of a Question):
Solution: We have
H(X) − H(X|Y ) = I(X; Y )
= H(Y ) − H(Y |X)
= Hb (α) − H(Y |X)
= Hb (α)
since H(Y |X) = 0.
1
3. Fano’s inequality for list decoding: Recall the proof of Fano’s inequality. Now develop a generalization of Fano’s inequality for list decoding. Let (X, Y ) ∼ PXY and let L(Y ) ⊂ X̂ be a set of
size L ≥ 1 (compare this to an estimator X̂(Y ) ∈ X which is a set of size L = 1). Lower bound the
probability of error Pr(X ∈
/ L(Y )) in terms of L, H(X|L(Y )) and |X |.
You should be able to recover the standard Fano inequality if you set L = 1.
Solution: Define the error random variable
E=
1
0
X∈
/ L(Y )
X ∈ L(Y )
Now consider
H(X, E|L(Y )) = H(X|E, L(Y )) + H(E|L(Y ))
= H(E|X, L(Y )) + H(X|L(Y ))
Let Pe := Pr(X ∈
/ L(Y )). Now clearly, H(E|X, L(Y )) = 0, and H(E|L(Y )) ≤ H(E) = Hb (Pe ). Now,
we examine the term H(X|E, L(Y )). We have
H(X|E, L(Y )) = Pr(E = 0)H(X|E = 0, L(Y )) + Pr(E = 1)H(X|E = 1, L(Y ))
≤ (1 − Pe ) log L + Pe log(|X | − L)
since if we know that E = 0, the number of values that X can take on is no more than L and if E = 1,
the number of values that X can take on is no more than |X | − L. Putting everything together and
upper bounding Hb (Pe ) by 1, we have
Pe ≥
H(X|L(Y )) − 1 − log L
log
|X |−L
L
.
4. (Optional): Data Processing Inequality for KL Divergence: Let PX , QX be pmfs on the
same alphabet X . Assume for the sake of simplicity that PX (x), QX (x) > 0 for all x ∈ X . Let
W (y|x) = Pr(Y = y|X = x) be a channel from X to Y. Define
X
X
PY (y) =
W (y|x)PX (x), and QY (y) =
W (y|x)QX (x)
x
x
Show that
D(PX kQX ) ≥ D(PY kQY )
You may use the log-sum inequality. This problem shows that processing does not increase divergence.
Solution: Starting from the definition of D(PY kQY ), we have
X
PY (y)
D(PY kQY ) =
PY (y) log
Q
Y (y)
y
!
P
X X
(
W (y|x)PX (x))
=
W (y|x)PX (x) log P x
(
W
(y|x)QX (x))
x
y
x
≤
XX
=
XX
=
X
y
y
W (y|x)PX (x) log
W (y|x)PX (x)
W (y|x)QX (x)
W (y|x)PX (x) log
PX (x)
QX (x)
x
x
PX (x) log
x
PX (x)
QX (x)
= D(PX kQX )
where the inequality follows from the log-sum inequality
2
5. Typical-Set Calculations 1:
(a) Suppose a DMS emits h and t with probability 1/2 each. For = 0.01 and n = 5, what is An ?
Solution: In this case, H(X) = 1. All source sequences are equally likely, each with probability
2−5 = 2−nH(X) . Hence, all sequences satisfy the condition for being typical,
2−n(H(X)+) ≤ pX n (xn ) ≤ 2−n(H(X)−)
for any > 0. Hence, all 32 sequences are typical.
(b) Repeat if Pr(h) = 0.2, Pr(t) = 0.8, n = 5, and = 0.0001.
Solution: Consider a sequence with m heads and n−m tails. Then, the probability of occurrence
of this sequence is pm (1 − p)n−m , where p = Pr(h). For such a sequence to be typical
2−n(H(X)+) ≤ pm (1 − p)n−m ≤ 2−n(H(X)−)
which translates to
m
1 − p n − p log p ≤ Plugging in the value of p = 0.2, we get
m 1
− ≤ .
5
5 2
Since m = 0, . . . , 5, this condition will be satisfied for the given only for m = 1 i.e. when there
is one H in the sequence. Thus,
An = {(HT T T T ), (T HT T T ), (T T HT T ), (T T T HT ), (T T T T H)}.
6. Typical-Set Calculations 2: Consider a DMS with a two symbol alphabet {a, b} where pX (a) = 2/3
and pX (b) = 1/3. Let X n = (X1 , . . . , Xn ) be a string of chance variables from the source with
n = 100, 000.
(a) Let W (Xj ) be the log pmf random variable for the j-th source output, i.e., W (Xj ) = − log 2/3
for Xj = a and − log 1/3 for Xj = b. Find the variance of W (Xj ).
Solution: For notational convenience, we will denote the log pmf random variable by W . Now,
note that W takes on values − log 2/3 with probability 2/3 and − log 1/3 with probability 1/3.
Hence,
2
Var(W ) = E[W 2 ] − E[W ]2 =
9
(n)
(b) For = 0.01, evaluate the bound on the probability of the typical set using Pr(X n ∈
/ A ) ≤
2
σW
/(n2 ).
Solution: The bound on the typical set, as derived using Chebyshev’s inequality is
Pr(X n ∈ A(n)
)≥1−
2
σW
.
n2
Substituting the values of n = 105 and = 0.01, we obtain
Pr(X n ∈ A(n)
)≥1−
1
44
=
45
45
Loosely speaking this means that if we were to look at sequences of length 100, 000 generated
from our DMS, more than 97% of the time the sequence will be typical.
3
(c) Let Na be the number of a’s in the string X n = (X1 , . . . , Xn ) . The random variable (rv) Na is
the sum of n iid rv’s. Show what these rv’s are.
Pn
Solution: The rv Na is the sum of n iid rv’s Yi , Na = i=1 Yi where Yi ’s are Bernoulli with
Pr(Yi = 1) = 2/3.
(d) Express the rv W (X n ) as a function of the rv Na . Note how this depends on n.
Solution: The probability of a particular sequence X n with Na number of a’s (2/3)Na (1/3)n−Na .
Hence,
W (X n ) = − log pX n (xn ) = − log[(2/3)Na (1/3)n−Na ] = n log 3 − Na
(n)
(e) Express the typical set in terms of bounds on Na (i.e., A = {xn : α < Na < β} and calculate
α and β).
Solution: For a sequence X n to be typical, it must satisfy
1
− log pX n (xn ) − H(X) < n
From (a) the source entropy is H(X) = E[W (X)] = log 3 − 2/3 and substituting in and W (X n )
from part (d), we get
Na
2 n − 3 ≤ 0.01
Note the intuitive appeal of this condition! It says that for a sequence to be typical, the proportion
of a’s in that sequence will be very close to the probability that the DMS generates an a. Plugging
in the value of n in the above equation, we get the bounds on
65, 667 ≤ Na ≤ 67, 666.
(n)
(f) Find the mean and variance of Na . Approximate Pr(X n ∈ A ) by the central limit theorem
(n)
approximation. The central limit theorem approximation is to evaluate Pr(X n ∈ A ) assuming
that Na is Gaussian with the mean and variance of the actual Na . Recall that for a sequence of
iid rvs C1 , . . . Cn , the central limit theorem assert that
!
n
1 X
t
Ci − µC ≤ t ≈ Φ
Pr √
σC
n i=1
Rz
2
where µC and σC are the mean and standard deviation of the Ci ’s and Φ(z) = −∞ √12π exp(− u2 ) du
is the cdf of the standard Gaussian.
Solution: Na is a binomial r.v. (which is a sum independent Bernoulli r.v. as we have shown in
part (c)). The mean and variance are
E[Na ] =
2
× 105 ,
3
Var(Na ) =
2
× 105
9
(n)
Note that we can calculate the exact probability of the typical set A :
Pr(A(n)
) = Pr(65, 667 ≤ Na ≤ 67, 666) =
67,666
X
Na =65,667
105
Na
Na 105 −Na
2
1
3
3
(n)
But this is computationally intensive, so we approximate the Pr(A ) with the central limit
theorem. We can use the CLT because Na is the sum of n iid r.v. so in the limit of large n, the
cumulative distribution approaches that of a Gaussian r.v. with the mean and variance of Na .
Z β
1
(x − E[Na ])2
p
Pr(65, 667 ≤ Na ≤ 67, 666) ≈
exp −
dx = Φ(6.706)−Φ(6.710) ≈ 1
2 Var(Na )
2π Var(Na )
α
4
where Φ(x) is the integral of the unit Gaussian r.v. from (−∞, x). Thus the CLT approximation
tells us approximately all of the sequences we observe from the output of the DMS will be typical,
whereas Chebyshev gave us a bound that more than 97% of the sequences that we observe will
be typical.
7. (Optional): Typical-Set Calculations 3: For the random variables in the previous problem, find
Pr(Na = i) for i = 0, 1, 2. Find the probability of each individual string xn for those values of i. Find
the particular string xn that has maximum probability over all sample values of X n . What are the next
most probable n-strings. Give a brief discussion of why the most probable n-strings are not regarded
as typical strings.
Solution: We know from the previous problem that
Pr(Na = i) =
105
i
i 105 −i
1
2
3
3
For i = 0, 1, 2, Pr(Na = i) is approximately zero. The string with the maximal probability is the string
with all “a”s. The next most probable strings are the sequences with n − 1 “a”s and one “b”, and
so forth. From the definition of the typical set, we see that the typical set is a fairly small set which
contains most of the probability, and the probability of each sequence in the typical set is almost the
same. The most probable sequences and the least probable sequences are the tails of the distribution
of the sample mean of the log pmf (they are the furthest from the mean), so are not regarded as typical
strings. In fact, the aggregate probability of the all the most likely sequences and all the least likely
sequences is very small. The only case where the most likely sequence is regarded as typical is when
every sequence is typical and every sequence is most likely (as in problem Typical Set Calculation
1). However, this is not the case in general. From what we have seen in problem Typical Set
Calculation 2 for very long sequences, the typical sequence will contain roughly the same proportion
of of symbols as the probability of that symbol.
8. (Optional): AEP and Mutual Information: Let (Xi , Yi ) be i.i.d. pX,Y (x, y). We form the loglikelihood ratio of the hypothesis that X and Y are independent vs the hypothesis that X and Y are
dependent. What is the limit of
pX n (X n )pY n (Y n )
1
log
n
pX n ,Y n (X n , Y n )
What is the limit of
Solution: Let
pX n (X n )pY n (Y n )
pX n ,Y n (X n ,Y n )
if Xi and Yi are independent for all i?
L=
1
pX n (X n )pY n (Y n )
log
n
pX n ,Y n (X n , Y n )
Since (Xi , Yi ) be i.i.d. pX,Y (x, y), we have
n
L=
1X
pX (Xi )pY (Yi )
log
n i=1
pX,Y (Xi , Yi )
|
{z
}
W (Xi ,Yi )
Each of the terms is a function of (Xi , Yi ) which are independent across i = 1, . . . , n. Thus, the
following convergence in probability is observed:
pX (X)pY (Y )
= −I(X; Y )
L → E [W (X, Y )] = E(X,Y )∼pX,Y log
pX,Y (X, Y )
n
n
)pY (Y )
Hence, the limit of 2nL = ppXX n(X
is 2−nI(X;Y ) which converges to one if X and Y are inden
n
,Y n (X ,Y )
pendent because I(X; Y ) = 0.
n
n
5
9. Piece of Cake: A cake is sliced roughly in half, the largest piece being chosen each time, the other
pieces discarded. We will assume that a random cut creates pieces of proportions:
(2/3, 1/3) w.p. 3/4
P =
(2/5, 3/5) w.p. 1/4
Thus, for example, the first cut (and choice of largest piece) may result in a piece of size 3/5 . Cutting
and choosing from this piece might reduce it to size (3/5)(2/3) at time 2, and so on. Let Tn be the
fraction of cake left after n cuts. Find the limit (in probability) of
1
log Tn
n→∞ n
lim
Solution: Let Ci be the fraction of the piece of cake that is cut at the i-th cut, and let Tn be the
fraction of cake left after n cuts. Then we have Tn = C1 C2 . . . Cn . Hence,
n
1X
3
2 1
3
1
log Tn = lim
log Ci → E[log C1 ] = log + log .
n→∞ n
n→∞ n
4
3 4
5
i=1
lim
10. Two Typical Sets: Let Xi be a sequence of real-valued random variables independent and identically
distributed
according to PX (x), x ∈ X . Let µ = E[X] and denote the entropy of X as H(X) =
P
− x PX (x) log PX (x). Define the two sets
(
)
n
1 X
1
1
Bn = xn ∈ X n : An = xn ∈ X n : log
− H(X) ≤ ,
Xi − µ ≤ n
n
PX n (xn )
i=1
n
(a) (1 point) Pr(X ∈ An ) → 1 as n → ∞. True or false. Justify your answer.
Solution: This follows by Chebyshev’s inequality: Indeed.
Pr(X n ∈ Acn ) ≤
σ02
→0
n2
where σ02 = Var(− log PX (X)). Consequently
Pr(X n ∈ An ) → 1
as desired.
(b) (1 point) Pr(X n ∈ An ∩ Bn ) → 1 as n → ∞. True or false. Justify your answer.
Solution: By Chebyshev’s inequality and the same logic as the above,
Pr(X n ∈ Bn ) → 1
So by De Morgan’s theorem and the union bound,
Pr(X n ∈ An ∩ Bn ) = 1 − Pr(X n ∈ Acn ∪ Bnc ) ≥ 1 − Pr(X n ∈ Acn ) − Pr(X n ∈ Bnc )
Since the latter two terms tend to zero, we know that
Pr(X n ∈ An ∩ Bn ) → 1
as desired.
(c) (1 point) Show that |An ∩ Bn | ≤ 2n(H(X)+) for all n.
Solution: We have
|An ∩ Bn | ≤ |An | ≤ 2n(H+)
where the final inequality comes from the AEP, shown in class.
6
(d) (1 point) Show that |An ∩ Bn | ≥
Solution: We have
1
2
× 2n(H(X)−) for n sufficiently large.
Pr(X n ∈ An ∩ Bn ) ≥
1
2
for n sufficiently large. Thus
1
≤
2
X
xn ∈A
X
PX n (xn ) ≤
xn ∈A
n ∪Bn
2−n(H−) = |An ∪ Bn |2−n(H−)
n ∪Bn
and we are done.
11. Entropy Inequalities: Let X and Y be real-valued random variables that take on discrete values in
X = {1, . . . , r} and Y = {1, . . . , s}. Let Z = X + Y .
(a) (1 point) Show that H(Z|X) = H(Y |X). Justify your answer carefully.
Solution: Consider
X
PX (x)H(Z|X = x)
H(Z|X) =
x
=−
X
PX (x)
X
x
=−
X
PX (x)
X
x
=−
X
PZ|X (z|x) log PZ|X (z|x)
z
PY |X (z − x|x) log PY |X (z − x|x)
z
PX (x)
x
X
PY |X (y|x) log PY |X (y|x)
y
= H(Y |X).
(b) (1 point) It is now known that X and Y are independent, which of the following is true in general?
(i) H(X) ≤ H(Z); (ii) H(X) ≥ H(Z). Justify your answer.
Solution: From the above, note that X and Y are symmetrical. So given what we have proved
in (a), we also know that
H(Z|Y ) = H(X|Y )
Now, we have
H(Z) ≥ H(Z|Y ) = H(X|Y ) = H(X)
where the inequality is due to “conditioning reduces entropy” and the final equality by the independence of X and Z. So the first assertion is true.
(c) (1 point) Now, in addition to Z = X + Y and that X and Y are independent, it is also known
that X = f1 (Z) and Y = f2 (Z) for some functions f1 and f2 . Find H(Z) in terms of H(X) and
H(Y ).
Solution: We have
H(Z) = H(X + Y ) ≤ H(X, Y ) = H(X) + H(Y )
where the final equality is by independence of X and Y . On the other hand,
H(X) + H(Y ) = H(X, Y ) = H(f1 (Z), f2 (Z)) ≤ H(Z)
Hence all inequalities above are equalities and we have
H(Z) = H(X) + H(Y ).
7
© Copyright 2026 Paperzz