Stein`s Lemma

EE5139R Lecture 13 : Stein’s Lemma
Vincent Y. F. Tan
November 9, 2016
Let X1 , X2 , . . . , Xn be independently drawn from Q ∈ P(X ). Consider the hypothesis test
H0 : Q = P 0
H1 : Q = P 1
(1)
Assume that D(P0 kP1 ) < ∞. Let An ⊂ X n be an acceptance region for hypothesis H0 . Let the probabilities
of error be
αn (An ) := P0n (Acn )
(2)
βn (An ) := P1n (An )
(3)
βn∗ (ε) := min{βn (An ) : αn (An ) ≤ ε}
(4)
For ε ∈ (0, 1), define
n
where An runs through all subsets of X .
Theorem 1 (Chernoff-Stein Lemma). For every ε ∈ (0, 1),
lim −
n→∞
1
log βn∗ (ε) = D(P0 kP1 )
n
(5)
This means that regardless of the constant upper bound on the type-I error, the type-II error behaves as
.
βn∗ (ε) = 2−nD(P0 kP1 ) .
(6)
This provides an operational meaning of the relative entropy D(P0 kP1 ) as the best exponent for discriminating the two distributions P0 and P1 .
Proof. Our objective is to prove that
.
βn∗ (ε) ≤ 2−nD(P0 kP1 ) .
(7)
There is a proof of this in Cover and Thomas [1, page 384]. Here we provide an alternative argument that
doesn’t require the knowledge of P1 (but needs to know P0 ). Consider the decision regions
Bn = {xn ∈ X n : D(P̂xn kP0 ) ≤ δn }
where δn =
√1 .
n
(8)
In this case we can calculate
αn (Bn ) = P0n (Bnc )
(9)
P0n ({xn
(10)
=
=
: D(P̂xn kP0 ) > δn })
X
P0n (xn )
(11)
xn :D(P̂xn kP0 )>δn
=
X
X
Q∈Pn (X ):D(QkP0 )>δn
1
xn ∈T
(Q)
P0n (xn )
(12)
X
=
P0n (T (Q))
(13)
2−nD(QkP0 )
(14)
2−nδn
(15)
Q∈Pn (X ):D(QkP0 )>δn
X
≤
Q∈Pn (X ):D(QkP0 )>δn
X
≤
Q∈Pn (X ):D(QkP0 )>δn
X
=
2−
√
n
Q∈Pn (X ):D(QkP0 )>δn
√
|X | − n
≤ (n + 1)
2
≤ ε,
(16)
(17)
(18)
where (12) follows by splitting the sum into different type classes, (14) follows by the inequality P0n (T (Q)) ≤
2−nD(QkP0 ) , (15) follows by the fact that each Q in the sum satisfies D(QkP0 ) > δn , (16) follows from the
definition of δn and finally (17) follows from the fact that the number of types is polynomial. Hence, the
type-I error probability bound is satisfied for all n large enough.
Now let’s estimate the type-II error. We have
βn (Bn ) = P1n (Bn )
≤ (n + 1)
(19)
|X | −nD(Q∗ kP1 )
2
(20)
by Sanov’s theorem where
Q∗ =
arg min
D(QkP1 ).
(21)
Q:D(QkP0 )≤δn
Because δn → 0, and D(QkP0 ) = 0 iff Q = P0 , the optimizer Q∗ converges to P0 , i.e.,
Q∗ (a) → P0 (a),
∀a ∈ X.
(22)
|D(Q∗ kP1 ) − D(P0 kP1 )| ≤ 1
(23)
βn (Bn ) ≤ (n + 1)|X | 2−n(D(P0 kP1 )+1 )
(24)
Hence, for any δ > 0,
for all n large enough. Consequently,
We have established that Bn .satisfies the property that its type-I error is arbitrarily small while the type-II
error vanishes essentially as ≤ 2−nD(P0 kP1 ) . This proves (7).
Now, we prove the lower bound
.
βn∗ (ε) ≥ 2−nD(P0 kP1 ) .
(25)
To do so, let’s first establish the following lemma:
Lemma 2. Let Dn ⊂ X n be a subset satisfying
P0n (Dn ) > 1 − ε
(26)
where ε ∈ (0, 1). Then for any 0 < δ < 1 − ε and n large enough, we have
P1n (Dn ) > (1 − ε − δ)2−n(D(P0 kP1 )+δ) .
(27)
We defer the proof of Lemma 2 to the end of the document. Now, for any acceptance region for An
(think of this as Dnc ), we have the property in (26), i.e., that P0n (Dnc ) ≤ ε). Hence, it is true that (27) holds
and
1
1
lim
log βn∗ (ε) ≥ −D(P0 kP1 ) − δ + lim
log(1 − ε − δ).
(28)
n→∞ n
n→∞ n
2
The final term obviously vanishes so we are left with
lim
n→∞
1
log βn∗ (ε) ≥ −D(P0 kP1 ) − δ.
n
(29)
Since δ can be made arbitrarily small we have
lim
n→∞
1
log βn∗ (ε) ≥ −D(P0 kP1 )
n
(30)
as desired.
Proof of Lemma 2. Fix δ ∈ (0, 1 − ε). Define the relative entropy typical set
1
P n (xn )
En := xn : −δ ≤ log 0n n − D(P0 kP1 ) ≤ δ .
n
P1 (x )
By the weak law of large numbers (with X n i.i.d. P0 ),
1
P n (X n )
P0n (En ) = Pr log 0n n − D(P0 kP1 ) ≤ δ > 1 − δ
n
P1 (X )
(31)
(32)
for all n large enough. Furthermore, by the union bound
P0n (Enc ∪ Dnc ) ≤ P0n (Enc ) + P0n (Dnc ) ≤ δ + ε,
(33)
P0n (En ∩ Dn ) ≥ 1 − (δ + ε).
(34)
so
Now consider
P1n (Dn ) ≥ P1n (Dn ∩ En )
X
≥
P1n (xn )
(35)
(36)
xn ∈Dn ∩En
≥
X
P0n (xn )2−n(D(P0 kP1 )+δ)
(37)
xn ∈Dn ∩En
≥ 2−n(D(P0 kP1 )+δ) P0n (Dn ∩ En )
≥ (1 − δ − ε)2
−n(D(P0 kP1 )+δ)
(38)
(39)
where (37) uses the definition of the relative entropy typical set as in (31) and (39) uses the union bound
calculation in (34). This completes the proof.
References
[1] T. M. Cover and J. A. Thomas. Elements of Information Theory. Wiley-Interscience, 2nd edition, 2006.
3