Lecture 11∗
Agenda for the lecture
• Properties of information measures (contd.): Data processing inequalities, Fano’s
inequality, Continuity of entropy
11.1
Data processing inequalities
In this section, we explore how the measures of information change as the random variables
become increasingly noisy. We shall consider two typical situations: First, given two
distributions P and Q on X and a channel W : X → Y, how does the KL divergence
between PW and QW relate to that between P and Q? Second, given a rv X and rvs
Y and Z produced by adding succesive independent noise to X, how does the mutual
information I(X ∧ Y ) compare with I(X ∧ Z).
In the first case, we draw our heuristics from Stein’s lemma (Lemma 6.4). Note that
while we proved Stein’s lemma for deterministic tests for binary hypothesis testing, the
result holds even when more general randomized tests are allowed. A randomized test
T : X → {0, 1} upon observing x ∈ X declares the null hypothesis with probability T (0|x)
and the alternative with probability 1 − T (1|x), i.e., the test is not required to declare a
fixed hypothesis but is allowed to toss a coin in addition to the observation X and decide
∗
c
Himanshu
Tyagi. Feel free to use with acknowledgement.
1
on its answer. Stein’s lemma holds for such randomized tests. Therefore, the given a
fixed acceptable probability of false alarm , the minimum probability of missed detection
for Pn vs. Qn is roughly 2−nD(PkQ) . Note that a feasible (randomized) test involves first
making the observation “more noisy” by passing it through a channel W and using a binary
hypothesis test for PW vs. QW . Using Stein’s lemma for (PW )n vs. (QW )n , the best such
test with probability of false alarm less than will have probability of missed detection
roughly 2−nD(PW kQW ) . But this can only be worse than the least possible probability of
missed detection over all tests, and so
2−nD(PW kQW ) ≤ 2−nD(PkQ) ,
which gives
D(PkQ) ≥ D(PW kQW ).
Thus, the distance between distributions decreases as they become increasingly noisy1 A
similar heuristic based on the Bayes error for uniform prior and its connection to d(P, Q)
given in Section 5.1.1 gives
d(P, Q) ≥ d(PW, QW ).
For the second form, we first introduce the notion of a Markov chain.
Definition 11.1. Random variables X, Y, Z form a Markov chain, denoted X −◦− Y −◦− Z if
X and Z are conditionally independent given Y , or in the language of information theory,
I(X ∧ Z|Y ) = 0. A sequence X1 , ..., Xn forms a Markov chain if (X1 , ..., Xi−1 ) −◦− Xi −◦−
(Xi+1 , ..., Xn ) holds for every 1 ≤ i ≤ n.
A Markov chain is an interesting probability model which captures a “first level dependence” between rvs where the next rv depends on the previous ones only through the
last rv. Such models and their extensions have several uses; in particular, they are used to
1
In fact, a similar argument gives β (P, Q) ≤ β (PW, QW ).
2
model written or typed text and speech for compression. In the context of this section, if
X −◦− Y −◦− Z corresponds to producing Z by adding noise to Y independent of X. In this
case, we expect X and Y to be “more correlated” than X and Z. Since mutual information
has an interpretation as a measure of correlation, we expect I(X ∧ Y ) to be greater than
I(X ∧ Z), a property which indeed holds.
Formally, we have the following:
Data processing inequality for total variation distance.
d(PW, QW ) =
=
≤
=
=
1X
|PW (y) − QW (y)|
2 y
X
1 X X
P(x)W (y|x) −
Q(x)W (y|x)
2 y x
x
1 XX
W (y|x) |P(x) − Q(x)|
2 y x
1 XX
W (y|x) |P(x) − Q(x)|
2 x y
1X
|P(x) − Q(x)|
2 x
= d(P, Q),
where the equality holds by the triangle inequality. Equality holds iff for every y
X
W (y|x) |P(x) − Q(x)| = |PW (y) − QW (y)|.
x
Data processing inequality for KL divergence.
D(PW kQW ) =
X
PW (y) log
y
=
XX
y
x
PW (y)
QW (y)
P
P(x)W (y|x)
P(x)W (y|x) log P x
x Q(x)W (y|x)
3
≤
XX
=
X
y
P(x)W (y|x) log
x
P(x)
X
W (y|x) log
y
x
P(x)W (y|x)
Q(x)W (y|x)
P(x)
Q(x)
= D(PkQ),
where the inequality is by log-sum inequality. Equality holds iff for every x, y
P(x)W (y|x)
Q(x)W (y|x)
=
.
PW (y)
QW (y)
Data processing inequality for mutual information. Given a Markov chain X −
−
◦ Y−
−
◦ Z,
by the chain rule for mutual information we have
I(X ∧ Y, Z) = I(X ∧ Y ) + I(X ∧ Z|Y )
= I(X ∧ Y ),
and also
I(X ∧ Y, Z) = I(X ∧ Z) + I(X ∧ Y |Z).
By comparing the two expressions above and using the nonnegativity of conditional mutual
information, we have
I(X ∧ Y ) ≥ I(X ∧ Z),
with equality iff2 I(X ∧ Y |Z) = 0.
An alternative proof uses the data processing inequality for KL divergence, thereby
showing that the latter is more general. Specifically, denoting by W the channel from X
to Y and by V the channel from Y to Z, we note that the channel T from X to Z is given
2
Equality holds iff both X −◦− Y −◦− Z and X −◦− Z −◦− Y hold, which can only hold in very specific cases;
see HW3.
4
by
T (z|x) =
X
V (z|y)W (y|x).
y
Denoting by P the probability distribution of X , the distribution of Z is given by PT =
(PW )V and Tx = Wx V Then,
I(X ∧ Z) = D(T kPT |P)
X
=
P(x)D(Tx kPT )
x
=
X
P(x)D(Wx V k(PW )V )
x
≤
X
P(x)D(Wx k(PW ))
x
= I(X ∧ Y ),
where the inequality corresponds to the data processing inequality for KL divergence.
11.2
Lower bound for error: Fano’s inequality
The next property, which is perhaps the most widely used theoretical result of Information
theory, is a lower bound for probability of error for M -ary hypothesis tesing in a Bayesian
framework.
Theorem 11.2 (Fano’s inequality). Given a rv X taking values in a finite set X with
|X | > 1 and a channel W : X → Y, for every random variable X̂ taking values in X such
that the Markov chain X −◦− Y −◦− X̂ holds
P X 6= X̂ + h(P X 6= X̂ ) ≥
H(X|Y )
,
log(|X | − 1)
where h(p) = −p log p − (1 − p) log(1 − p) is the binary entropy function.
5
Remark 11.3. In the result above, X = x corresponds to the input hypothesis, Y to the
observation generated by PY |X=x , and X̂ corresponds to the estimate of X generated using
a randomized function of Y (i.e., using a randomized hypothesis test).
Proof. We show that
P X 6= X̂ + h(P X 6= X̂ ) ≥
H(X|X̂)
.
log(|X | − 1)
(1)
The claimed bound then follows since by the data processing inequality for mutual information,
H(X|X̂) − H(X|Y ) = I(X ∧ Y ) − I(X ∧ X̂) ≥ 0.
It remains to prove (1). To that end, denote by Z the {0, 1}-valued rv 1(X 6= X̂), i.e., Z
is a function of (X, X̂) and equals 1 if X 6= X̂ and 0 otherwise. Note that
P (Z = 1) = E [Z] = P X 6= X̂ =: p.
Furthermore, since Z is a function of (X, X̂),
H(X, Z|X̂) = H(X|X̂) + H(Z|X, X̂) = H(X|X̂).
Also,
H(X, Z|X̂) = H(Z|X̂) + H(X|Z, X̂) ≤ H(Z) + H(X|Z, X̂) = h(p) + H(X|Z, X̂).
Thus, the proof will be completed upon showing
H(X|Z, X̂) ≤ p log(|X | − 1).
6
(2)
Indeed,
H(X|Z, X̂) ≤ (1 − p)H(X|Z = 0, X̂) + pH(X|Z = 1, X̂).
For the first term, note that if Z = 0, X = X̂ and so H(X|Z = 0, X̂) = 0. For the second
term, given Z = 1, X 6= X̂ and therefore X can take at most |X | − 1 values. Thus, by
property (d) in Lecture 10, H(X|Z = 1, X̂) ≤ log(|X | − 1). Therefore, (2) holds,which
completes the proof.
11.3
Continuity of entropy
For a discrete set X , we saw that entropy is a concave function of P, which implies continuity
of entropy in P. In this section, we show a direct proof of this fact with precise estimates
for the rate. Specifically, we provide precise bounds for |H(P) − H(Q)| in terms of d(P, Q).
In deriving our bound, we use a result known as the maximal coupling lemma. We
review this fundamental result first. Consider a joint distribution PXY for X -valued rvs X
and Y such that PX = P and PY = Q. Then, denoting A := {x : P(x) > Q(x)},
P (X = Y ) =
X
PXY (x, x)
x
=
X
PXY (x, x) +
X
PXY (x, x)
x∈Ac
x∈A
≤
X
PY (x, x) +
X
PX (x, x)
x∈Ac
x∈A
= Q(A) + 1 − P(A)
= 1 − d(P, Q).
Thus, for all joint distributions PXY with PX = P and PY = Q
P (X 6= Y ) ≥ 1 − d(P, Q).
7
The maximal coupling lemma asserts that for every P and Q there exists a joint distribution
which attains the bound above with equality.
Theorem 11.4 (Maximal Coupling). Given two distributions P and Q on X , there exists
a joint distribution PXY with PX = P and PY = Q such that
P (X 6= Y ) = d(P, Q).
As an example, consider X = {0, 1}. Let P = Bernoulli(p) and Q = Bernoulli(q).
Then, d(P, Q) = |p − q|. We now give a joint distribution satisfying the maximal coupling
lemma. For a rv U distributed uniformly in [0, 1], note that X = 1(U < p) and Y =
1(U < q) have the required marginals. Furthermore, assuming p < q, P (X 6= Y ) =
P (p < U < q) = |p − q|. Thus, the rvs X and Y above constitute a maximal coupling of P
and Q.
Using the maximal coupling lemma, we establish the following bound for |H(P)−H(Q)|.
Lemma 11.5. For a finite set X and pmfs P and Q on X ,
|H(P) − H(Q)| ≤ d(P, Q) log |X | − 1 + h(d(P, Q)),
where h denotes the binary entropy function.
Proof. Given P, Q, let PXY be the maximal coupling of P and Q guaranteed by Theorem 11.4. Noting that H(P) − H(Q) = H(X) − H(Y ) = H(X|Y ) − H(Y |X), we get
|H(P) − H(Q)| = |H(X|Y ) − H(Y |X)|
≤ max{H(X|Y ), H(Y |X)}.
Note that by our construction d(P, Q) = P (X 6= Y ). Thus, by Fano’s inequality
H(X|Y ) ≤ d(P, Q) log(|X | − 1) + h(d(P, Q)),
8
and
H(Y |X) ≤ d(P, Q) log(|X | − 1) + h(d(P, Q)).
Therefore, the same bound also holds for the max of the two conditional entropies and the
claim follows.
9
© Copyright 2026 Paperzz