Lecture 13 - Dept of Maths, NUS

Lecture 13
1
Large Deviations–Cramér’s Theorem
If we think of Extreme Value Theory as being concerned with the chance of making a rare
observation among a collection of observations (Xi )1≤i≤n , then Large Deviations Theory deals
with a different type of rare events, where the rare event occurs not because of a single rare
observation, but because the observations (Xi )1≤i≤n contribute jointly to the rare event. For
P
example, the rare event could be that the empirical average n1 ni=1 Xi takes on an unusually
large value (e.g. X1 , · · · , Xn may be the value of n assets in a portfolio so that we are only
interested in their sum), or the rare event could be that (Xi )1≤i≤n take on exceptionally small
values uniformly in 1 ≤ i ≤ n. Large deviations theory is concerned with determining how
fast do the probabilities of these rare events decay to 0 as the sample size n tends to ∞, and
how to compute the precise rate of decay as a function of the rare event.
We consider here the simplest case, where the observations (Xi )i∈N are i.i.d. real-valued
random variables with α := E[X1 ] < ∞, and the events we consider are
An (E) :=
n
n1 X
n
o
Xi ∈ E ,
with E ⊂ R.
(1.1)
i=1
P
The weak law of large numbers implies that n1 ni=1 Xi converges in distribution to the delta
measure δα (dx). If α ∈
/ E, then (α − , α + ) ⊂ E c for some > 0, and hence P(An (E)) → 0
as n → ∞ which makes An (E) a rare event. The rate of decay of P(An (E)) is identified in
the following basic large deviation result.
Theorem 1.1 [Cramér’s Theorem] Let (Xi )i∈N be i.i.d. non-degenerate real-valued random
variables with a finite logarithmic moment generating function
φ(λ) := log E[eλX1 ]
Let Sn :=
Pn
i=1 Xi .
for all λ ∈ R.
(1.2)
Then for any a > E[X1 ],
1
log P(Sn ≥ na) = −I(a)
n
1
lim log P(Sn ≤ na) = −I(a)
n→∞ n
lim
n→∞
for all a > E[X1 ],
(1.3)
for all a < E[X1 ],
(1.4)
where
I(x) = sup(λx − φ(λ))
(1.5)
λ∈R
is called the rate function. I is convex, non-negative, and equals 0 only at E[X1 ]. More
generally, for E ⊂ R, if E and E o denote respectively the closure and the interior of E, then
1
log P(Sn /n ∈ E) ≤ − inf I(x),
n→∞ n
x∈E
1
lim inf log P(Sn /n ∈ E) ≥ − inf o I(x).
n→∞ n
x∈E
lim sup
1
(1.6)
(1.7)
Remark. Cramér’s Theorem is a study of the distribution (µn )n∈N of the sequence of random
P
variables n1 ni=1 Xi for i.i.d. (Xi )i∈N . The law of large numbers implies that µn asymptotically
concentrates all of its mass at the point E[X1 ]. Cramér’s Theorem shows that the measure
assigned by µn to a small neighborhood of x decays asymptotically as e−nI(x) , which explains
the name “rate function” for I.
Proof. By replacing (Xi )i∈N with (−Xi )i∈N , we note that (1.4) follows from (1.3). Let
a > E[X1 ]. We can bound P(Sn ≥ na) from above by the exponential Markov inequality as
follows: For any λ > 0,
Pn
P(Sn ≥ na) = P eλSn ≥ eλna ≤ e−nλa E[eλSn ] = e−nλa E[eλ i=1 Xi ] = e−n(aλ−φ(λ)) .
Since this bound holds for all λ > 0, we can minimize over λ > 0 to obtain
P(Sn ≥ na) ≤ e−n supλ>0 (aλ−φ(λ)) .
Note that
φ0 (λ) =
E[X1 eλX1 ]
E[eλX1 ]
and φ00 (λ) =
E[X12 eλX1 ] E[X1 eλX1 ] 2
−
E[eλX1 ]
E[eλX1 ]
(1.8)
(1.9)
are respectively the mean and variance of a random variable with probability distribution
µλ (dx) :=
eλx
P(X1 ∈ dx) = eλx−φ(λ) P(X1 ∈ dx).
E[eλX1 ]
(1.10)
Therefore φ00 ≥ 0 and φ is convex. The assumption that the distribution of X1 is nondegenerate further implies that φ00 (λ) > 0 for all λ ∈ R, and hence φ0 (λ) is strictly increasing.
Note that φ(0) = 0 and φ0 (0) = E[X1 ]. Since a > E[X1 ], it is easy to see from the convexity
of φ that aλ < φ(λ) for all λ < 0, and hence
sup(aλ − φ(λ)) = sup(aλ − φ(λ)) = I(a),
λ>0
λ∈R
which together with (1.8) give the desired upper bound on P(Sn ≥ na) for a > E[X1 ].
We now make a digression and establish the basic properties of I. By the definition in (1.5),
I(x) is the supremum of the collection of linear functions λx − φ(λ), indexed by λ ∈ R (I is in
fact the Legendre transform of φ), and hence is convex. The non-negativity of I is seen by
taking λ = 0 in the supremum in (1.5). Since φ is convex and differentiable, fx (λ) := λx−φ(λ)
is concave and differentiable in λ. Note that fx (0) = 0, therefore I(x) = supλ fx (λ) = 0 only
if fx0 (0) = 0, which gives x = φ0 (0) = E[X1 ]. Thus I has a unique 0 at E[X1 ], is increasing on
[E[X1 ], ∞), and decreasing on (−∞, E[X1 ]].
We now use the upper bound
we established for (1.3) to deduce
(1.6). Given E ⊂ R,
let α+ := min E ∩ [E[X1 ], ∞) and α− := max E ∩ (−∞, E[X1 ]] , with α± := ±∞ if the
respective set is empty. Then
lim sup
n→∞
1
1
log P(Sn /n ∈ E) ≤ lim sup log P(Sn ≤ nα− ) + P(Sn ≥ nα+ )
n
n→∞ n
1
1
= max lim log P(Sn ≤ nα− ), lim log P(Sn ≥ nα+ )
n→∞ n
n→∞ n
= max − I(α− ), −I(α+ ) = − inf I(x),
x∈E
which is just (1.6).
2
The lower bound for (1.3) follows from (1.7), since I is continuous and increasing on
[E[X1 ], ∞), and hence inf x>a I(x) = I(a) if a > E[X1 ]. To prove (1.7), it suffices to show that
for any y ∈ R and > 0,
lim inf
n→∞
1
log P(Sn /n ∈ (y − , y + )) ≥ −I(y),
n
(1.11)
since for each y ∈ E o , we can find > 0 such that (y − , y + ) ⊂ E.
We first prove (1.11) for the degenerate cases of either P(X1 ≤ y) = 1 or P(X1 ≥ y) = 1,
with P(X1 = y) = ρ. In these cases,
lim inf
n→∞
while
1
1
P(Sn /n ∈ (y − , y + )) ≥ lim log ρn = log ρ,
n→∞ n
n
I(y) = sup yλ − log E[eλX1 ] = − inf log eλ(X1 −y) = − log ρ.
λ∈R
λ∈R
Therefore (1.11) holds.
Now we assume that P(X1 > y) > 0 and P(X1 < y) > 0. We will prove (1.11) by a change
of measure argument. For the rare event {Sn /n ∈ (y − , y + )} to occur, where is small, it
turns out that the optimal strategy is for (XRi )1≤i≤n to change their common distribution from
µ(dx) := P(X1 ∈ dx) to some ν(dx) with xν(dx) = y, so that if (Xi )1≤i≤n have common
distribution ν, then the event {Sn /n ∈ (y − , y + )} becomes typical in the sense that its
probability is close to 1. It then only remains to compute the “cost” of requiring (Xi )1≤i≤n
to behave as if they have common distribution ν. The exact details are as follows, where ν
will be the µλ in (1.10) for some suitably chosen λ, and the assumption that P(X1 > y) > 0
and P(X1 < y) > 0 guarantees that such a µλ exists.
Recall that given f : R → [0, ∞) with E[f (X1 )] = 1, we can interpret f (x) as a probability
density with respect to P(X1 ∈ dx), such that f (x)P(X1 ∈ dx) defines another probability
distribution. If X̂1 denotes a random variable with such a distribution, then for all bounded
measurable function g,
Z
E[f (X1 )g(X1 )] = g(x)f (x)P(X1 ∈ dx) = E[g(X̂1 )].
Let (X̂i )i∈N be i.i.d. random variables with distribution µλ∗ (dx) as in (1.10), where λ∗ satisfies
Z
φ0 (λ∗ ) = xµλ∗ (dx) = y.
(1.12)
Such an λ∗ exists and is unique, because limλ→∞ φ0 (λ) > y by the assumption P(X1 > y) > 0,
limλ→−∞ φ0 (λ) < y by the assumption P(X1 < y) > 0, and φ0 is continuous and strictly
P
increasing because φ00 (λ) > 0 for all λ ∈ R. Let Ŝn := ni=1 X̂i . We can then write
n
hY
i
∗
∗
∗
∗
P(|Sn /n − y| < )) = E
eλ Xi −φ(λ ) e−λ Sn +nφ(λ ) 1{|Sn /n−y|<}
i=1
h
i
∗
∗
∗
∗
∗
= E e−λ Ŝn +nφ(λ ) 1{|Ŝn /n−y|<} ≥ e−n(λ y+|λ |−φ(λ )) P(|Ŝn /n − y| < ).
By the large of numbers applied to Ŝn /n, which has mean y, we obtain
lim inf
n→∞
1
log P(Sn /n ∈ (y − , y + )) ≥ −(λ∗ y + |λ∗ |) − φ(λ∗ )).
n
3
Since restricting to Sn /n ∈ (y − 0 , y + 0 ) for 0 ∈ (0, ) gives a lower bound, we can send 0 ↓ 0
to obtain
1
lim inf log P(Sn /n ∈ (y − , y + )) ≥ −(λ∗ y − φ(λ∗ )).
(1.13)
n→∞ n
On the other hand, because yλ − φ(λ) is differentiable and concave in λ, the supremum in
I(y) = supλ∈R (yλ − φ(λ)) is achieved at an interior critical point λ with y = φ0 (λ). By (1.12),
only λ∗ satisfies this equation. Therefore (1.13) is exactly (1.11). The proof of (1.3) and (1.7)
is then complete.
2
Sanov’s Theorem
P
Cramér’s Theorem studies the distribution µn of the empirical average n1 ni=1 Xi , and is
sometimes called a level-1 large deviation. We can go one level higher to study an object
which contains more information than the empirical average: the empirical measure
n
Ln (dx) :=
1X
δXi (dx).
n
(2.14)
i=1
Note that Ln is a probability measure with n atoms of size 1/n each, located respectively
at X1 , . . . , Xn . In particular, Ln is a random probability
measure. The empirical mean can
R
P
be recovered from Ln by noting that n1 ni=1 Xi = xLn (dx), and hence µn is the measure
induced by νn on R via the map
Z
ρ ∈ M1 :→ xρ(dx) ∈ R.
Instead of studying the distribution µn of the empirical average, we can study the distribution
νn of Ln . More precisely, Ln is a random variable taking its value in the space of probability
measures on R, which we denote by M1 , and hence the distribution νn of Ln is a probability
measure on M1 . Equipped with the topology of weak convergence (weak topology), M1 can
be shown to be a Polish space.
First we claim that as a sequence of probability measures on M1 , νn asymptotically
concentrates all its mass at the point µ1 ∈ M1 , where µ1 is the distribution of X1 . In other
words, Ln converges to µ1 in probability. We will in fact show that Ln ⇒ µ1 almost surely.
Indeed, for each x ∈ R, by the law of large numbers,
n
1X
Ln (−∞, x] =
1{Xi ≤x} −→ P(X1 ≤ x]
n→∞
n
almost surely.
i=1
The almost sure convergence can be applied simultaneously for a countable collection of x,
say x ∈ Q. Therefore almost surely, Ln (−∞, x] → µ1 (−∞, x] for all x ∈ Q, which implies
that Ln ⇒ µ1 , as claimed.
The fact that νn asymptotically concentrates all its mass at the point µ1 ∈ M1 is analogous
to µn , a sequence of probability measures on R, asymptotically concentrating all its mass at
E[X1 ] ∈ R. The main difference is that (νn )n∈N are probability measures on the Polish space
M1 , while (µn )n∈N are probability measures on the Polish space R. By this analogy, we can
ask: Is there a rate function I : M1 → [0, ∞], such that the probability Ln is in a small
neighborhood of ρ ∈ M1 decays asymptotically like e−nI(ρ) ? The answer is affirmative and
given by Sanov’s Theorem.
4
Theorem 2.1 [Sanov’s Theorem] Let (Xi )i∈N be i.i.d. real-valued random variables with
P
common distribution µ1 . Let Ln := n1 ni=1 δXi . Then or any measurable E ⊂ M1 ,
1
log P(Ln ∈ E) ≤ − inf I(ρ),
n→∞ n
ρ∈E
1
lim inf log P(Ln ∈ E) ≥ − inf o I(ρ),
n→∞ n
ρ∈E
lim sup
where the rate function
Z
 f (x) log f (x)µ1 (dx)
I(ρ) =

∞
if ρ(dx) = f (x)µ1 (dx) for some density f,
(2.15)
(2.16)
(2.17)
otherwise
is convex, non-negative, and zero only at µ1 .
Remark. Often I(ρ) in (2.17) is also denoted by H(ρ|µ1 ), and is called the relative entropy
of ρ w.r.t. µ1 . Note that the non-negativity of I follows from the convexity of x :→ x log x
and Jensen’s inequality:
Z
Z
Z
f (x) log f (x)µ1 (dx) ≥
f (x)µ1 (dx) log
f (x)µ1 (dx) = 1 log 1 = 0.
The convexity of I also follows from similar reasoning. Relative entropy is a key notion in
ergodic theory, information theory, and statistical physics.
Sketch proof of Theorem 2.1. We sketch the ideas of the proof and refer the details to [1, 2].
The fundamental ideas are the same as in the proof of Cramér’s Theorem. Let us restrict our
attention to E = Bδ (ρ), a ball of small radius δ centered around any ρ ∈ M1 (recall that the
weak topology on M1 can be metrized). For the large deviation upper bound, we apply the
exponentialR Markov inequality as follows. For any bounded continuous f : R → R, denoting
hf, Ln i := f (x)Ln (dx), we have
P(Ln ∈ Bδ (ρ)) ≤ P enhf,Ln i ≥ en inf ξ∈Bδ (ρ) hf,ξi
≤ e−n inf ξ∈Bδ (ρ) hf,ξi E[enhf,Ln i ] ≤ e−nhf,ρi+nCf (δ)+nφ(f ) ,
where
Cf (δ) := sup |hf, ξi − hf, ρi|
ξ∈Bδ (ρ)
is comparable to the modulus of continuity of the function ξ ∈ M1 → hf, ξi ∈ R, and
φ(f ) = log E[ef (X1 ) ].
(2.18)
Optimizing over f then leads to the upper bound
lim sup
n→∞
1
log P(Ln ∈ Bδ (ρ)) ≤ − sup hf, ρi − Cf (δ) − φ(f ) .
n
f
(2.19)
Since for each f , Cf (δ) → 0 as δ → 0, we obtain
lim lim sup
δ↓0
n→∞
1
log P(Ln ∈ Bδ (ρ)) ≤ −I(ρ),
n
(2.20)
where the rate function
I(ρ) = sup hf, ρi − φ(f )
f
5
(2.21)
has the same form as the rate function in (1.5) in Cramér’s Theorem. The inequality (2.19)
shows that the probability Ln is in a small neighborhood of ρ decays at least as fast as e−nI(ρ) .
To derive the large deviation upper bound (2.15) from (2.19) requires some work, where it is
necessary to first replace E by its intersection with a compact set so that we can cover the
set by a finite number of balls of radius δ.
To prove the large deviation lower bound (2.16) with E replaced by Bδ (ρ), we apply the
same change of measure argument as in the proof of Cramér’s Theorem. We only need to
consider ρ(dx) = g(x)µ1 (dx) for some density g; otherwise the desired lower bound is −∞,
∗
which is trivial. As a starting point, let us consider g = ef for some bounded continuous f ∗ .
P
Let (X̂i )i∈N be i.i.d. random variables with distribution ρ, and let L̂n := n1 ni=1 δX̂i . Then
we can write
n
hY
i
∗
∗
∗
P(Ln ∈ Bδ (ρ)) = E
ef (Xi ) e−nhf ,Ln i 1{Ln ∈Bδ (ρ)} = E e−nhf ,L̂n i 1{L̂n ∈Bδ (ρ)} .
i=1
Since L̂n converges almost surely to ρ, we obtain
lim inf
n→∞
1
1
log P(Ln ∈ Bδ (ρ)) ≥ lim lim inf log P(Ln ∈ Bδ (ρ)) = −hf ∗ , ρi.
δ↓0 n→∞ n
n
(2.22)
Note that hf ∗ , ρi is exactly the rate function I(ρ) defined in (2.17). One can then extend this
bound to f ∗ which is not assumed to be bounded and continuous.
Lastly, it can be shown that I(ρ) as defined in (2.21) coincides with hf ∗ , ρi in (2.22), and
hence equals the rate function defined in (2.17).
3
Further Extensions
Sanov’s Theorem goes one level higher than Cramér’s Theorem by considering the empirical
measure Ln , which contains more information on the sequence (Xi )1≤i≤n than the empirical
average. However, Ln still does not capture all information about (Xi )1≤i≤n , because it does
not keep track of the order in (Xi )1≤i≤n . We can go one level even higher, by considering the
pair empirical measure
n
1X
2
Ln :=
δ(Xi ,Xi+1 ) ,
n
i=1
R2
which is a probability measure on
consisting of n atoms of equal size, with one atom at
(Xi , Xi+1 ) ∈ R2 for each 1 ≤ i ≤ n, and we define Xn+1 := X1 to create a periodic boundary
condition. We can then analyze large deviations for the distribution νn2 of the sequence of
random measures L2n . Note that (νn2 )n∈N are probability measures on M1 (R2 ), the space of
probability measures on R2 . The full level-3 large deviation considers the empirical process
n
L∞
n
1X
:=
δ(Xi+1 ,Xi+2 ,··· ) ,
n
i=1
where (Xi )1≤i≤n is extended periodically to all i ∈ N by setting Xi+n := Xi . Note that L∞
n is
N
∞
a random measure on R . Large deviation analysis on the distribution of (Ln )n∈N requires
proving upper and lower bounds analogous to (2.15) and (2.16).
A major advancement in large deviations is the theory developed by Donsker and Varadhan
for (Xi )i∈N which is a Markov chain. The importance of Markov chains in applications also
made the associated large deviation theory an important tool in the study of many problems,
from finance to protein folding etc. For more on large deviations, see [1] for an introduction,
and [2] for a more comprehensive treatment.
6
References
[1] F. den Hollander. Large deviations. Fields Institute Monographs, 14. American Mathematical Society, Providence, RI, 2000.
[2] A. Dembo and O. Zeitouni. Large deviations techniques and applications. Stochastic Modelling and Applied Probability, 38. Springer-Verlag, Berlin, 2010.
7