National Institute for Theoretical Physics,
Stellenbosch University, South-Africa, January 24, 2017.
Frank den Hollander
Leiden University, The Netherlands
Large Deviations
– a mini-course –
Topic 3
Application:
Statistical hypothesis testing
Topic 2
Overview of large deviation theory
Part II
Topic 1
Overview of large deviation theory
Part I
Part I
Overview of large deviation theory
TOPIC 1
Large deviation theory originated in statistical physics, and
via insurance and finance mathematics reached probability
theory and statistics. It can now be found in almost any
scientific discipline.
they can be captured with the help of variational principles.
This gives the theory a particularly elegant structure.
in the least unlikely of all the unlikely ways
Large deviation theory is the study of unlikely events. Since
such events are always realised
§ BACKGROUND
X̄n = n−1(X1 + · · · + Xn)
Var(X1) = σ 2 ∈ (0, ∞).
X̄
√n → µ
n(X̄n − µ) → σN (0, 1)
P-a.s.,
in P-probability,
where N (0, 1) is the standard normal random variable.
LLN:
CLT:
be the empirical average of the n-sample. Then, as n → ∞,
Let
E(X1) = µ ∈ R,
A classical setting for large deviations is the following. Let
(Xi)i∈N be i.i.d. R-valued random variables with
n → ∞,
The rate function captures the cost of the large deviations.
Large deviations away from the mean are exponentially
costly at a rate that depends on the size of the deviation.
is a rate function that achieves a unique zero at ν = µ.
ν 7→ I(ν)
where ≈ means close in a proper sense, and
P(X̄n ≈ ν) = e−[1+o(1)] I(ν) n,
Under the assumption that E(eλX1 ) < ∞ for all λ ∈ R, large
deviation theory says that
log 2 + z log z + (1 − z) log(1 − z)
I(z) =
∞
where
1
log P(X̄n ≥ a) = −I(a),
lim
n→∞ n
Then, for all a > 1
2,
P(X1 = 0) = P(X1 = 1) = 1
2.
if z ∈ [0, 1],
otherwise.
THEOREM Let (Xi)i∈N be i.i.d. random variables with
§ COIN TOSSING: an example
0
1
2
1
z
log 2
I (z )
Æ
Æ
+1
k≥an
Qn(a) = max
k
n
.
which proves the claim.
lim
1
log Qn(a) = −a log a − (1 − a) log(1 − a)
n→∞ n
= −I(a) + log 2,
√
1
n! = nne−n 2πn 1 + O
n
therefore allows us to infer that [exercise]
The maximum is attained at k = dane, the smallest integer
≥ an. Stirling’s formula
where
2−nQn(a) ≤ P(X̄n ≥ a) ≤ (n + 1)2−nQn(a),
1 , 1] we
Proof. The claim is trivial for a > 1. For a ∈ ( 2
P
n
−n
observe that P(X̄n ≥ a) = 2
k≥an k , which yields
P |X̄n − 1
2| > δ
<∞
∀ δ > 0,
The curvature of the rate function at z = 1
2 corresponds
to the CLT, as will become clear later.
and so the LLN follows via the Borel-Cantelli lemma.
[exercise]
n∈N
X
Note that the rate function z 7→ I(z) is infinite outside
[0, 1], finite and strictly convex inside [0, 1], and has a
unique zero at z = 1
2 . This zero corresponds to the LLN.
Indeed,
1 , the above theorem deals with
Since E(X1) = 1
and
a
>
2
2
large deviations in the upward direction. It is clear from
symmetry that the same holds for P(X̄n ≤ a) with a < 1
2.
This is manifested by the symmetry relation I(1−z) = I(z).
where I(S) = inf x∈S I(x), S ⊆ X .
(i) I has compact level sets and is not identically infinite,
(ii) lim inf ↓0 log µ(O) ≥ −I(O) for all O ⊆ X open,
(iii) lim sup↓0 log µ(C) ≤ −I(C) for all C ⊆ X closed,
A family of probability measures (µ)>0 on a Polish space
X is said to satisfy the large deviation principle (LDP) with
rate function I : X → [0, ∞] if
DEFINITION (Large Deviation Principle)
§ LARGE DEVIATION PRINCIPLE general theory
X
Paradigmatic picture of a rate function with a unique zero.
0
I
when ↓ 0 followed by δ ↓ 0.
µ(Bδ (x)) = e−[1+o(1)] I(x)/
Informally, the LDP says that if Bδ (x) is the open ball of
radius δ > 0 centred at x ∈ X , then
By interchanging I and J we get the opposite bound and
hence I(x) = J(x). Since x ∈ X is arbitrary, this proves
that I = J.
Since J has compact level sets it is lower semi-continuous,
so that limN →∞ J(BN ) = J(x). Hence I(x) ≥ J(x).
↓0
≤ lim sup log µ(cl(BN +1)) ≤ −J(cl(BN +1) ≤ −J(BN ).
↓0
− I(x) ≤ −I(BN +1) ≤ lim inf log µ(BN +1)
Proof. Suppose that I, J are both rate functions for the
same (µ)>0. Pick x ∈ X and consider the open balls
BN = B1/N (x), N ∈ N. Then
THEOREM The rate function is unique.
↓0
lim sup log µ([KN ]c) ≤ −N.
Strengthening a weak LDP to an LDP requires establishing
exponential tightness, i.e., proving that for every N < ∞
there exists a compact set KN ⊆ X such that
If the level sets of I are assumed to be closed rather than
compact, and the inequality for the limpsup is assumed to
hold for compact sets rather than closed sets, then it is
said that the weak LDP holds.
§ WEAK LDP
X
eF (x)/ µ(dx) = ΛF ,
∀ F ∈ Cb(X ),
x∈X
ΛF = sup [F (x) − I(x)].
where Cb(X ) is the space of bounded continuous functions
on X , and
↓0
lim log
Z
If (µ)>0 satisfies the LDP on X with rate function I, then
THEOREM (Varadhan’s Lemma)
The LDP is the workhorse for the computation of averages
of exponential functionals.
§ DUALITY
I(x) =
[F (x) − ΛF ],
F ∈Cb (X )
sup
x ∈ X.
Suppose that (µ)>0 is exponentially tight and the limit in
Varadhan’s Lemma exists for all F ∈ Cb(X ). Then (µ)≥0
satisfies the LDP with rate function I given by
THEOREM (Bryc’s Lemma)
Varadhan’s Lemma has the following inverse.
Varadhan’s Lemma can be easily extended to functions F
that are unbounded and discontinuous, provided certain
tail estimates on µ are available.
J(y) =
inf I(x),
x∈X
T (x)=y
y ∈ Y.
satisfies the LDP on Y with rate function J given by
ν = µ ◦ T −1
Let (µ)>0 satisfy the LDP on X with rate function I. Let
Y be a second Polish space, and T : X → Y a continuous
map from X to Y. Then the family of probability measures
(ν)>0 on Y defined by
THEOREM (Contraction Principle)
There are several ways to generate one LDP from another.
We give three examples.
§ FORWARD PRINCIPLES
Y
Illustration of the contraction principle.
X
x∈T −1 (C)
inf
↓0
y∈C
I(x) = − inf J(y) = −J(C).
Moreover, J inherits compact levels set from I because T
is continuous.
A similar argument applies for O ⊂ Y open. Since the rate
function is unique, this identifies J as the rate function.
≤ −I(T −1(C)) = −
↓0
lim sup log ν(C) = lim sup log µ(T −1(C))
Proof. Since T is continuous, T −1 maps open sets into
open sets and closed sets into closed sets. Pick C ⊂ Y
closed and write
N =
X
Z
eF (x)/µ(dx),
J(x) = ΛF − [F (x) − I(x)],
x ∈ X.
satisfies the LDP on X with rate function J given by
1 F (x)/
e
µ(dx),
ν(dx) =
N
Let (µ)>0 satisfy the LDP on X with rate function I, and
let F ∈ Cb(X ). Then the family of probability measures
(ν)>0 on X defined by
THEOREM (Exponential Tilting)
N −1
µN
= µ ◦ (π ) ,
N ∈ N.
N ∈N
I(x) = sup I N (π N x),
x ∈ X.
If, for each N ∈ N, the family (µN
)>0 satisfies the LDP on
X N with rate function I N , then (µ)>0 satisfies the LDP
on X with rate function I given by
X N = πN X ,
Let (µ)e>0 be a family of probability measures on X . Let
(π N )N ∈N be a nested family of projections acting on X
S
such that N ∈N π N is the identity. Let
THEOREM (Dawson-Gärtner projective limit LDP)
I N (y) =
{x∈X : π N (x)=y}
inf
I(x),
y ∈ XN,
{−N, . . . , +N }
→
Z
→
R
→
...
The projective limit LDP can be used to extend a suitably
nested sequence of LDP’s on finite-dimensional spaces to
an LDP on an infinite-dimensional space:
the supremum defining I is non-decreasing in N because
the projections are nested.
Since
Part II
Overview of large deviation theory
TOPIC 2
For instance, if X is a vector space, then the rate function
can be identified as the Legendre transform of a generalised
cumulant generating function. When X = Rd, we have the
following.
LDPs can be formulated on general topological spaces X ,
although this comes at the cost of more technicalities.
Conversely, more can be said when X has more structure.
§ SPECIAL STRUCTURES
Rd
Z
eht,xi/ µ(dx)
exists in R,
h
i
∗
ψ (x) = sup ht, xi − ψ(t) ,
t∈Rd
x ∈ Rd .
Then (µ)>0 satisfies the LDP on Rd with a convex rate
function ψ ∗ given by the Legendre transform
where h·, ·i denotes the standard inner product on Rd.
(ii) t 7→ ψ(t) is differentiable on Rd.
↓0
ψ(t) = lim log
(i) For all t ∈ Rd,
Let (µ)>0 be a family of probability measures on Rd, d ≥
1, with the following properties:
THEOREM (Gärtner-Ellis Theorem)
Let (Xi)i∈N be i.i.d. R-valued random variables with law ρ.
Let M1(R) denote the space of probability measures on R
(which is a subset of the vector space of signed measures
on R).
• Cramér’s Theorem
• Sanov’s Theorem
Two special cases deserve to be mentioned:
There is a version of the Gärtner-Ellis Theorem where the
domain of ψ is not all of Rd, in which case some additional
assumptions must be made.
ϕ(t) = E(etX1 ) =
R
Z
Xi ∈ R.
etxρ(dx) < ∞
i=1
n
X
∀ t ∈ R,
t∈R
I(x) = sup[tx − log ϕ(t)],
x ∈ R.
then (µn)n∈N satisfies the LDP on R with rate = n−1 and
rate function
If
X̄n = n−1
Let µn denote the law of the empirical average
§ CRAMÉR’S THEOREM
ϕ(t) = exp[m(et − 1)],
t ∈ R,
I(x) = x log(x/m) − (x − m), x ∈ [0, ∞).
3. ρ = POISSON(m):
ϕ(t) = cosh(t),
t ∈ R,
1 (1 − x) log(1 − x),
I(x) = 1
(1
+
x)
log(1
+
x)
+
2
2
x ∈ [−1, +1].
2
ϕ(t) = et /2, t ∈ R,
2
I(x) = 1
x ∈ R.
2x ,
2. ρ = 1
2 (δ−1 + δ+1 ):
1. ρ = N (0, 1):
Examples:
ϕ(t) = E(etX1 ) < ∞
where
∀ t ∈ R,
t∈R
I(z) = sup[zt − log ϕ(t)].
lim
1
log P(X̄n ≥ a) = −I(a),
n→∞ n
then, for all a > E(X1),
If
Cramér’s Theorem one-sided version
We will prove the following:
t∈R
1
lim
log P(X̄n ≥ 0) = log ρ.
n→∞ n
Note that I(0) = − log ρ (with I(0) = ∞ if ρ = 0), so we
must prove that
ρ = inf ϕ(t).
Henceforth we abbreviate
Proof. We may suppose without loss of generality that
a = 0 and E(X1) < 0. Namely, the substitution X1 → X1+a
gives ϕ(t) → eatϕ(t) and, consequently, I(a) → I(0). We
may also suppose that X1 is non-degenerate, because the
claim is trivial otherwise.
The other cases are trivial.
P(X1 < 0) > 0,
P(X1 > 0) > 0.
Hence, ϕ is strictly convex and ϕ0(0) = E(X1) < 0. We
focus on the case
R
= R x etxdF (x),
R
00
ϕ (t) = R x2 etxdF (x).
ϕ0(t)
Let F (x) = P(X1 ≤ x) be the distribution function of X1.
We have ϕ ∈ C ∞(R), the space of smooth functions, with
t
ϕ0(τ ) = 0.
Hence
1
log P(X̄n ≥ 0) ≤ log ρ.
n→∞ n
lim sup
P(X̄n ≥ 0) = P(eτ Sn ≥ 1) ≤ E(eτ Sn ) = [ϕ(τ )]n = ρn.
P
Let Sn = nX̄n = n
i=1 Xi . By the exponential Chebyshev
inequality, we have the following upper bound:
'(t)
ϕ(τ ) = ρ,
Since limt→±∞ ϕ(t) = ∞ and since ϕ is strictly convex,
there exists a unique τ ∈ R satisfying τ > 0 such that
R
eτ y dF (y).
The proof of the lower bound proceeds via three lemmas.
ρ = ϕ(τ ) =
Z
called the Cramér-transform of F . Note that
c)
Let (X
i i∈N be an i.i.d. sequence of random variables with
distribution function
Z
1
Fb (x) =
eτ y dF (y),
ρ (−∞,x]
To get the matching lower bound we resort to an important
technique in large deviation theory: exponential tilting.
R
Z
etx dFb (x)
00 (0) = 1 ϕ00 (τ ) ∈ (0, ∞).
c )=ϕ
b
Var(X
1
ρ
c )=ϕ
b0(0) = 1 ϕ0(τ ) = 0,
E(X
1
ρ
This implies that ϕb ∈ C ∞(R) and
1
1
tx
τ
x
=
e e dF (x) = ϕ(t + τ ) < ∞
ρ R
ρ
b
ϕ(t)
=
Z
b
b
Proof. Let ϕ(t)
= E(etX1 ). Then
c )=µ
c )=σ
b = 0, Var(X
b 2 ∈ (0, ∞).
Lemma: E(X
1
1
∀ t ∈ R.
R
= {x1+···+xn≥0} [ ρe−τ x1 dFb (x1)] × · · · × [ ρe−τ xn dFb (xn)].
= {x1+···+xn≥0} dF (x1) × · · · × dF (xn)
R
P(Sn ≥ 0)
Proof. Write
Pn
c . Then
b
Lemma Let Sn = i=1 X
i
b
P(Sn ≥ 0) = ρnE e−τ Sn 1{Sb ≥0} .
n
≥ 0.
b
S
n
≥ e−τ C σb n P √ ∈ [0, C) .
b n
σ
√
Z standard normal.
The latter is strictly positive for any C > 0.
P Z ∈ [0, C) ,
Since Sbn satisfies the central limit theorem, the probability
in the right-hand tends to
b
E e−τ Sn 1{Sb ≥0}
n
Proof. For any C > 0, we may estimate
1 log E e−τ Sbn 1
Lemma lim inf n
{Sbn ≥0}
n→∞
(1) I is continuous, strictly convex and smooth on the
interior of its domain.
(2) I(z) ≥ 0 with equality if and only if z = µ.
(3) I 00(µ) = σ12 .
The rate function I has the following properties: [exercise]
which is the matching lower bound.
1
lim inf log P(X̄n ≥ 0) ≥ log ρ,
n→∞ n
The proof of the one-sided version of Cramér’s Theorem
is now concluded by observing that the last two lemmas
imply that [exercise]
i=1
δXi ∈ M1(R).
#
ν ∈ M1(R),
H(ν | ρ) is called the relative entropy of ν with respect to
ρ.
with the right-hand side infinite when ν is not absolutely
continuous with respect to ρ.
Z
dν
I(ν) = H(ν | ρ) =
ν(dx) log
(x),
dρ
R
"
Then (µn)n∈N satisfies the LDP on M1(R) with rate =
n−1 and rate function
Ln = n−1
n
X
Let µn denote the law of the empirical distribution
§ SANOV’S THEOREM
s=1
r
X
νs = 1
to denote the probability simplex in Rr , which may be identified with the set of probability measures on Γ.
M1(Γ) = ν = (ν1, . . . , νr ) ∈ [0, 1]r :
We write
Xi ∈ Γ = {1, . . . , r} ⊂ N,
X1, X2, . . . are i.i.d. with marginal law ρ = (ρs)s∈Γ,
ρs > 0 ∀ s ∈ Γ.
We restrict the proof to the situation where the Xi take
values in a finite set:
r
X
νs
I(ν) =
νs log
ρs
s=1
(with the convention inf ∅ Iρ = ∞).
where
1
lim
log P(Ln ∈ O) = − inf I(ν),
n→∞ n
ν∈O
Pn
1
Let Ln = n i=1 δXi . Then, for all open sets O ⊂ M1(Γ),
Sanov’s Theorem finite support
which turns M1(Γ) into a Polish space (i.e., a complete
separable metric space).
r
1 X
d(µ, ν) =
|µs − νs|,
2 s=1
On M1(Γ) we define the total variation distance
s=1
r
X
ks = n ,
Then, clearly,
k∈Kn : νn (k)∈O
max
n!
s=1 ks !
.
k ∈ Kn.
r
Y
ρks s
Qn ≤ P(Ln ∈ O) ≤ |Kn| Qn.
Qn =
1 k ∈ M (Γ). Put
For k ∈ Kn, let νn(k) = n
1
r
Y
ks
ρks s
P Ln(s) =
∀ s = n!
,
n
s=1 ks !
1 K ⊂ M (Γ) for all n ∈ N. Then L has
and note that n
n
n
1
the multinomial distribution
Kn = k = (k1, . . . , kr ) ∈ Nr0 :
Proof. Let
=
s=1 n
r
X
ks
ks
log n
log ρs − log
+O
n
n
Pr
uniformly on Kn, where we use that s=1 ks = n. Since the
sum in the right-hand side equals −I(νn(k)), and |Kn| =
n+r
r−1 ), we find that
=
O(n
r−1
1 log P(L ∈ O) = O log n + 1 log Q
n
n
n
n
n
log n
−
min
I νn(k) .
= O n
k∈Kn : νn (k)∈O
r
Y
ρks s
1
log n!
n
s=1 ks !
Stirling’s formula gives
n∈N
S
{νn(k) : k ∈ Kn} is dense in M1(Γ),
n→∞
lim sup
k∈Kn : νn (k)∈O
min
I νn(k) ≤ I(ν)
∀ ν ∈ O.
lim I νn(kn) = I(ν).
n→∞
Since O is an open set, this implies that
n→∞
lim d νn(kn), ν = 0,
Indeed, (i) and (ii) guarantee that for every ν ∈ M1(Γ)
there exists a sequence (kn)n∈N, with kn ∈ Kn for all n,
such that
(ii) ν 7→ I(ν) is continuous on M1(Γ),
(i)
In order to complete the proof, it now remains to observe
that [exercise]
k∈Kn : νn (k)∈O
min
ν∈O
I νn(k) ≤ inf I(ν).
(1) I is finite, continuous and strictly convex on M1(Γ).
(2) I(ν) ≥ 0 with equality if and only if ν = ρ.
I(ν) = H(ν|ρ) is the relative entropy of ν with respect to
ρ and has the following properties: [exercise]
lim
1
log P(Ln ∈ O) = − inf I(ν),
n→∞ n
ν∈O
which proves claim.
However, the reverse inequality is trivial, and so
n→∞
lim sup
Optimizing over ν, we get
With the help of the Dawson-Gärtner projective limit LDP,
the proof of Sanov’s Theorem can be easily extended to
R.
Relative entropy is a key notion in ergodic theory,
information theory and statistical physics.
(X1, . . . , Xn)per = X
, . . . , Xn}, X
, . . . , Xn}, . . .
| 1 {z
| 1 {z
i=1
δθi(X ,...,Xn)per ∈ M∗1(RN),
1
is the periodic extension of the first n elements of (Xi)i∈N,
θ is the left-shift acting on RN, and M∗1(RN) is the space
of θ-invariant probability measures on RN.
where
Rn = n−1
n
X
It is also possible to obtain an infinite-dimensional version
of Sanov’s Theorem. The key object of interest is the
empirical process
§ EMPIRICAL PROCESS
H(Q | ρ⊗N) is called the specific relative entropy of Q with
respect to ρ⊗N.
where π N Q is the projection of Q onto the first N coordinates.
N →∞
I(Q) = H(Q | ρ⊗N) = lim N −1H(π N Q | ρN ),
Let µn denote the law of the empirical process Rn. Then
(µn)n∈N satisfies the LDP on M∗1(RN) with rate = n−1
and rate function
Sanov’s Theorem empirical process
ν 7→ I(ν),
Q 7→ I(Q)
It is possible to generalize the three LDPs to sequences
(Xi)i∈N that are not i.i.d., for instance, Markov sequences.
However, some mixing properties are needed to ensure
proper rate functions. In general these rate functions are
no longer convex.
are all convex. The first two are strictly convex, the third is
affine. There is a deep connection with information theory.
x 7→ I(x),
The rate functions
§ EXTENSIONS
+ ...
H. Touchette, A Basic Introduction to Large Deviations:
Theory, Applications, Simulations, 2012, 3rd International Oldenburg
Summer School, 2011.
H. Touchette, The Large Deviation Approach to Statistical Mechanics,
Phys. Rep. 478, 1-69, 2009.
R. Cerf, On Cramér Theory in Infinite Dimensions,
Société Mathématique de France, 2007.
F. den Hollander, Large Deviations,
American Mathematical Society, 2000.
A. Dembo and O. Zeitouni, Large Deviations Techniques
and Applications, Springer, 1998.
J.-D. Deuschel and D.W. Stroock, Large Deviations,
Academic Press, 1989.
R.S. Ellis, Entropy, Large Deviations, and Statistical Physics,
Springer, 1985.
S.R.S. Varadhan, Large Deviations and Applications,
Society for Industrial and Applied Mathematics, 1984.
§ LITERATURE
statistical hypothesis testing
Application:
TOPIC 3
What is the best statistical test to decide, based on the
observation of the sample X1, . . . , Xn, which of the two
laws occurs, and how good is this test for large n?
Let X1, . . . , Xn be i.i.d. R–valued random variables with an
unknown marginal law µ. Suppose we know that either
µ = µ0 or µ = µ1, where µ0 and µ1 are given, but we do
not know which.
§ STATISTICAL HYPOTHESIS TESTING
µ = µ0 ,
µ = µ1 .
H0 is accepted when Tn(X1, . . . , Xn) = 0,
H1 is accepted when Tn(X1, . . . , Xn) = 1.
A decision test is a measurable function Tn : Rn → {0, 1}
such that
DEFINITION
hypothesis H0 :
hypothesis H1 :
There are two hypotheses:
∆n = αnP(µ = µ0) + βnP(µ = µ1).
where P is the joint law of X1, . . . , Xn and µ. The latter is
assumed to be distributed according to an a priori law on
{µ0, µ1}.The Bayesian error probability is
αn = P(Tn rejects H0 | µ = µ0) = P(Tn = 1 | µ = µ0),
βn = P(Tn rejects H1 | µ = µ1) = P(Tn = 0 | µ = µ1),
The performance of a decision test is determined by the
error probabilities
ASSUMPTION: µ0 6= µ1, but they are equivalent, i.e.,
mutually absolutely continuous w.r.t. each other.
The optimal decision test was identified by
Neyman and Pearson and is based on the
so-called log-likelihood ratios.
A good decision test will have both αn and βn small. A
sensible criterion for optimality is to seek a decision test
that minimizes βn subject to a pre-set constraint on αn, or
vice versa.
§ OPTIMAL DECISION TESTS
for some γn ∈ R.
Tn(X1, . . . , Xn) = 1{ 1 (Y +···+Y )>γ }
n
n
n 1
A Neyman-Pearson test (NP-test) is a decision test of the
form
DEFINITION
Yi = log L10(Xi) = − log L01(Xi),
i = 1, . . . , n.
dµ0
L01 =
,
dµ1
and define new random variables
Define the likelihood ratios
dµ1
L10 =
,
dµ0
H(µ1| µ0) =
Z
dµ1 log
dµ1
dµ0
denotes the relative entropy of µ0 with respect to µ1. Note
that −∞ ≤ γ0 < 0 < γ1 ≤ ∞.
where
γ0 = E(Y1| µ = µ0) = −H(µ0| µ1),
γ1 = E(Y1| µ = µ1) = H(µ1| µ0),
This definition shows that NP-tests are naturally linked
with Cramér’s Theorem. In order to make the proper connection, we need the following quantities:
The γn of the NP-test depends on the choice of αn.
In other words, if we consider the class of all decision tests
with a fixed value of αn, then in this class the NP-test has
the smallest value of βn.
There are no tests with the same value of αn
and a smaller value of βn, and vice versa.
It is a classical result in statistics that NP-tests are optimal
in the following sense:
§ OPTIMALITY
with
i
tY
ϕ0(t) = E e 1 | µ = µ0 .
t∈R
I0(z) = sup zt − log ϕ0(t) ,
h
with I0 the Legendre transform of the cumulant generating
function of Y1 given µ = µ0, i.e.,
1 log β
lim n
n = −[I0 (γ) − γ] < 0,
n→∞
1 log α
lim n
n = −I0 (γ) < 0,
n→∞
Let γ ∈ (γ0, γ1). Then, for the NP-test with γn ≡ γ,
THEOREM (Chernoff)
+
Z
0.
dµ1
dµ1
{ dµ ≥1}
0
dµ
t log dµ1
dµ0 e
dµ0
dµ1
{ dµ <1}
0
Z
≤ 2,
1
αn = P (Y1 + · · · + Yn) > γ µ = µ0 ,
n
so the first claim follows from the one-sided version of
Cramer’s Theorem. Note that γ > γ0 = E(Y1|µ = µ0), so
I0(γ) > 0.
which means that ϕ0 exists in a right-neighbourhood of
the origin. We have
ϕ0(t) ≤
Z
Hence, for all 0 ≤ t ≤ 1,
ϕ0(t) =
Proof. First note that
An easy computation shows that ϕ1(t) = ϕ0(t+1), so that
I1(z) = I0(z) − z for the rate function I1 associated with
ϕ1, and so the second claim follows as well. Note that
γ < γ1 = E(Y1 | µ = µ1), so I1(γ) > 0.
which exists in a left-neighbourhood of the origin.
tY
1
ϕ1(t) = E e
| µ = µ1 ,
1
βn = P (Y1 + · · · + Yn) ≤ γ µ = µ1
n
and hence the same reasoning applies, this time with the
moment generating function
Similarly, we have
I0 (0)
slope
'0 (t)
0
slope
1
The graph of t 7→ log ϕ0 (t) on [0, 1].
log
t
Proof. It suffices to consider NP-tests (since these are
optimal). Let α∗n and βn∗ be the error probabilities for the
NP-test with γn ≡ γ = 0. Then, for any other NP-test,
either αn ≥ α∗n (when γn ≤ 0) or βn ≥ βn∗ (when γn ≥ 0).
1
lim inf log inf ∆n(Tn) = −I0(0),
n→∞ n
Tn
where the infimum runs over all decision tests.
Suppose that 0 < P(µ = µ0) < 1. Then
COROLLARY (Chernoff)
o
o
∗
∗
≥ min αnP(µ = µ0), βnP(µ = µ1) .
Tn n
Equality is obtained for the NP-test with γn ≡ γ = 0.
1
lim inf log inf ∆n(Tn)
n→∞ n
Tn
1
1
≥ min lim
log α∗n, lim
log βn∗ = −I0(0).
n→∞ n
n→∞ n
Since 0 < P(µ = µ0) = 1 − P(µ = µ1) < 1, it follows that
Tn
inf ∆n(Tn) ≥ inf min αnP(µ = µ0), βnP(µ = µ1)
n
Hence, by the definition of ∆n,
t∈R
Compute I0(0) when µ0, µ1 are exponential with respective
parameters δ0, δ1 > 0.
[exercise]
which is the best possible. This rate is called the Chernoff
information of the laws µ0 and µ1.
t∈R
I0(0) = − log inf ϕ0(t) = I1(0) = − log inf ϕ1(t),
This test has an exponential rate
The Bayesian error probability is smallest, in the limit
as n → ∞, for the zero threshold NP-test.
§ CONCLUSION
© Copyright 2026 Paperzz