PROBABILITY THEORY 2 LECTURE NOTES
These lecture notes were written for MATH 6720 at Cornell University in the Spring semester
of 2014. They were last revised in the Spring of 2016 and the schedule on the following page
reects that semester.
These notes are for personal educational use only and are not to
be published or redistributed.
Much of the material on martingales, Markov chains, and
ergodic theory comes directly from the course text,
Probability: Theory and Examples by
Markov
Rick Durrett. Some of the discussion of Markov chains is also drawn from the book
Chains and Mixing Times by David Levin, Yuval Peres, and Elizabeth Wilmer. The sections
on Brownian motion are based mainly on Brownian Motion by Peter Mörters and Yuval
Peres. There are likely typos and mistakes in these notes. All such errors are the author's
fault and corrections are greatly appreciated.
1
Day 1: Through Example 1.2
Day 2: Through Proposition 1.5
Day 3: Through denition of r.c.d.s
Day 4: Through Example 2.4
Day 5: Through Theorem 2.5
Day 6: Through Lemma 2.1
Day 7: Finished Section 2
Day 8: Through Theorem 3.3 (Skipped Polya's Urn)
Day 9: Through Theorem 4.3
Day 10: Through rst observations about uniform integrability
Day 11: Through Theorem 4.9
Day 12: Through Theorem 5.2
Day 13: Finished Section 5
Day 14: Up to Example 6.1
Day 15: Through Example 6.6
Day 16: Finished Section 7
Day 17: Through Theorem 8.3 (Lionel)
Day 18: Through Example 8.2 (Dan)
Day 19: Through Example 9.2
Day 20: Through Theorem 9.1
Day 21: Through Theorem 9.5
Day 22: Through Theorem 10.2
Day 23: Up to Theorem 10.3
Day 24: Student Presentations
Day 25: Through cut-o phenomenon
Day 26: Through Example 11.1
Day 27: Finished Section 11
(Skipped Sections 12 and 13)
Day 28: Through Theorem 14.1
Day 29: Through
B(d) =
P∞
i=0
Fi (d)
for
d ∈ D.
Day 30: Through Theorem 15.1
Day 31: Finished Section 15
Day 32: Through Theorem 16.7
Day 33: Through Theorem 16.9
Day 34: Started Theorem 16.13
Day 35: Through Theorem 17.1
Day 36: Through Proposition 17.1 (and some discussion of remaining material)
Days 37-42: Student Presentations
2
1. Conditional Expectation
Let
(Ω, F, P ) be a probability space and suppose that A, B ∈ F
we learn that the probability of
A
The idea is that if we learn that
for this new information.
those events contained in
conditional on
B
B
with
is dened as
P (B) > 0.
P (A |B ) =
In undergraduate probability,
P (A∩B)
P (B) .
has occurred, then the probability space must be updated to account
In particular, the sample space becomes
B , FB = {E ∩ B : E ∈ F },
B,
the
σ -algebra
now includes only
and the probability measure restricts to
FB
and is
normalized to account for this change. The formula for conditional probability is thus a description of how
the probability function changes when additional information dictates that the sample space should shrink.
(P
(· |B )
also denes a probability on
(Ω, F),
and it is often more convenient to adopt this perspective.)
When thinking about conditional probability, it can be more instructive to take a step back and think of
a second observer with access to partial information. Here we interpret
ω∈Ω
system whose chance of being in state
is governed by
P. F
(Ω, F, P )
as describing a random
represents the possible conclusions that
can be drawn about the state of the system: All that can be said is whether it lies in
Now suppose that the observer has performed a measurement that tells her if
P (B) ∈ (0, 1).
If she found out that
If she found that
B
B
B
view, her description of the probability of
A
A
as
P A |B
for each
holds for some
A∈F
C
is true, her assessment of the probability of
is false, she would evaluate the probability of
A
A ∈ F.
B∈F
would be
with
P (A |B ).
. Thus, from our point of
is given by the random variable
(
XA (ω) =
P (A |B ) , ω ∈ B
.
C
P A |B , ω ∈
/B
This is ultimately the kind of idea we are trying to capture with conditional expectation.
The typical development in elementary treatments of probability is to apply the denition of
the events
{X = x}
mass function of
continuous
X
X
and
Y
and
{Y = y}
given that
P (A |B )
to
X , Y in order to dene the conditional
pX,Y (x,y)
pY (y) . One then extrapolates to absolutely
for discrete random variables
Y = y
as
pX (x |Y = y ) =
by replacing mass functions with densities (which is problematic in that it treats pdfs
as probabilities and raises issues concerning conditioning on null events). Finally, conditional expectation is
dened in terms of integrating against the conditional pmfs/pdfs.
In what follows, we will need a more sophisticated theory of conditioning that avoids some of the pitfalls,
paradoxes, and limitations of the framework sketched out above. Rather than try to arrive at the proper
denition by way of more familiar concepts, we will begin with a formal denition and then work through a
variety of examples and related results in order to provide motivation, build intuition, and make connections
with ideas from elementary probability.
Denition.
and
G⊆F
variable
(i)
(ii)
If
Y
Y
Let
(Ω, F, P )
be a probability space,
a sub-σ -algebra. We dene
E[X |G ],
X : (Ω, F) → (R, B)
the
satises
(i)
and
(ii),
we say that
Y
is a
E |X| < ∞,
conditional expectation of X given G , to be any random
satisfying
Y ∈ G (i.e. Y is measurable with
´
´
Y dP = A XdP for all A ∈ G
A
a random variable with
respect to
G)
version
E[X |G ].
of
3
Our most immediate order of business is to show that this denition makes good mathematical sense by
proving existence and uniqueness theorems.
To streamline this task, we rst take a moment to establish integrability for random variables which t the
denition so that we may manipulate various quantities of interest with impunity.
Lemma 1.1. If Y satises conditions (i) and (ii) in the denition of E[X |G ], then it is integrable.
Proof.
Letting
A = {Y ≥ 0} ∈ G ,
condition
ˆ
(ii)
implies
ˆ
ˆ
XdP ≤
|X| dP,
A
ˆ
ˆ
ˆ
Y dP = −
X=
(−X)dP ≤
Y dP =
ˆA
ˆ
A
(−Y )dP = −
AC
AC
It follows that
ˆ
E |Y | =
AC
ˆ
ˆ
AC
|X| dP.
AC
ˆ
(−Y )dP ≤
Y dP +
A
AC
|X| dP +
|X| dP = E |X| < ∞.
AC
A
Our next result makes use of a famous theorem from analysis whose proof can be found in any text on
measure theory.
Theorem 1.1 (Radon-Nikodym). If µ and ν are σ-nite measures on (S, S) with
´
measurable function f : S → R such that ν(A) = A fd µ for all A ∈ S .
dν
.
f is called the Radon-Nikodym derivative of ν with respect to µ, written f = dµ
ν µ,
then there is a
The following existence proof gives an interpretation of conditional expectation in terms of Radon-Nikodym
derivatives.
Theorem 1.2. Let (Ω, F, P ) be a probability space, X : (Ω, F) → (R, B) a random variable with E |X| < ∞,
and G ⊆ F a sub-σ-algebra. There exists a random variable Y satisfying
(i)
(ii)
Y ∈G
´
´
Y dP = A XdP
A
Proof.
(Ω, G).
First suppose that
(That
ν
for all A ∈ G
X ≥ 0.
ν(A) =
Dene
´
A
XdP
for
P.
ˆ
ˆ
XdP = ν(A) =
A
and
P |G
and
ν
Y =
dν
dP is a version of
X , write X = X + − X −
G -measurable,
are nite measures on
ν
is clearly absolutely
The Radon-Nikodym theorem therefore implies that there is a function
such that
For general
Then
is countably additive is an easy application of the DCT.) Moreover,
continuous with respect to
It follows that
A ∈ G.
A
dν
dP
∈G
dν
dP.
dP
E[X |G ].
and let
Y1 = E [X + |G ], Y2 = E [X − |G ].
Then
Y = Y1 − Y2
A ∈ G,
ˆ
ˆ
ˆ
ˆ
ˆ
+
−
Y dP =
Y1 dP −
Y2 dP =
X dP −
X dP =
XdP.
is integrable
so for all
ˆ
A
A
A
A
A
A
* The proof of Theorem 1.2 is really a one-liner since the Radon-Nikodym theorem holds for signed measures
and
ρ(A) =
´
A
gdµ
denes a signed measure for
µ-integrable g .
4
To conclude our discussion of the denition of conditional expectation, we record
Theorem 1.3.
Proof.
Y
is unique up to null sets.
Y0
Suppose that
Condition
(ii)
is also a version of
implies that
E[X |G ].
ˆ
ˆ
ˆ
0
Y dP =
A
for all
XdP =
A
Y dP
A
A ∈ G.
By condition
(i),
the event
ˆ
0=
Aε = {Y − Y 0 ≥ ε} is in G for all ε > 0, hence
ˆ
ˆ
0
Y dP −
Y dP =
(Y − Y 0 )dP ≥ εP (Y − Y 0 ≥ ε) .
Aε
It follows that
Y ≤Y
0
Aε
Aε
a.s. Interchanging the roles of
Y
and
Y0
in the preceding argument shows that
Y0 ≤Y
a.s. as well, and the proof is complete.
The following observation helps elucidate the sense in which uniqueness holds.
Proposition 1.1. If Y is a version of E[X |G ] and Y 0 ∈ G with Y
E[X|G].
Proof.
any
Since
B ∈ G,
Y
and
Y0
are
= Y0
G -measurable, E = {ω : Y (ω) 6= Y 0 (ω)} ∈ G .
ˆ
ˆ
XdP =
B
ˆ
ˆ
Y dP =
B
ˆ
B\E
Y dP =
B\E
ˆ
Y 0 dP =
=
ˆ
Y 0 dP +
B\E
Since
P (E) = 0,
we see that for
ˆ
Y dP +
B∩E
a.s., then Y 0 is also a version of
Y dP
B\E
ˆ
Y 0 dP =
B∩E
Y 0 dP.
B
Lemma 1.1, Theorem 1.3, and Proposition 1.1 combine to tell us that conditional expectation is unique as an
element of
L1 (Ω, G, P ).
Just as elements of
Lp
spaces are really equivalence classes of functions (rather than
specic functions) in classical analysis, conditional expectations are equivalence classes of random variables.
Here versions play the role of specic functions.
Often we will omit the almost sure qualication when speaking of relations between conditional expectations, but it is important to keep this issue in mind.
In light of Proposition 1.1, we can often work with convenient versions of
use of pointwise results.
5
E[X |G ]
when we need to make
Examples.
Intuitively, sub-σ -algebras represent (potentially available) information - for each
or not
A
value of
has occurred. From this perspective, we can think of
X
given the information in
Example 1.1.
guess is
X
If
X ∈ G,
G.
E[X |G ]
A∈G
we can ask whether
as giving the best guess for the
The following examples are intended to clarify this view.
then our heuristic suggests that
X
itself. This clearly satises the denition as
E[X |G ] = X
since if we know
always satises condition
(ii)
X,
then our best
and condition
(i)
is
met by assumption.
Since constants are measurable with respect to any
σ -algebra,
Example 1.2.
At the other extreme, suppose that
{X ∈ B}
are independent events. In this case,
and
A
E[X]
As a constant,
A ∈ G,
X
G
automatically satises condition
X=c
taking
is independent of
G
shows that
- that is, for all
tells us nothing about
(i).
To see that
(ii)
ˆ
E[c |G ] = c.
X,
A ∈ G , B ∈ B,
so our best guess is
E[X].
holds as well, note that for any
ˆ
E[X]dP = E[X]P (A) = E[X]E[1A ] = E[X1A ] =
A
XdP
A
by independence.
In particular, ordinary expectation corresponds to conditional expectation w.r.t.
Example 1.3.
partition of
Ω
σ(Ω1 , Ω2 , ...).
Ωi
We now expand upon our introductory example:
G = {Ω, ∅}.
Suppose that
Ω1 , Ω2 , ...
into disjoint measurable sets, each having positive probability (e.g.
We claim that
−1
E[X |G ] = P (Ωi )
E[X; Ωi ]
on
Ωi .
B
and
The interpretation is that
contains the outcome, and given that information, our best guess for
X
is a countable
B C ).
G
is its average over
Let
G =
tells us which
Ωi .
To verify our claim, note that
E[X |G ](ω) =
X E[X; Ωi ]
P (Ωi )
i
is
G -measurable
since each
Ωi
it suces to check condition
belongs to
(ii)
ˆ
G.
Also, since each
1Ωi (ω)
A∈G
is a countable disjoint union of the
on the elements of the partition. But this is trivial as
ˆ
P (Ωi )
−1
E[X; Ωi ]dP = E[X; Ωi ] =
Ωi
XdP.
Ωi
P (A |H ) = E[1A |H ], then the above says that
ˆ
P (A ∩ Ωi )
P (A |G ) = P (Ωi )−1
1A dP =
on Ωi .
P (Ωi )
Ωi
If we make the obvious denition
6
Ω0i s,
Example 1.4.
Conditioning on a random variable can be seen as a special case of our denition by taking
E[X |Y ] = E[X |σ(Y ) ].
suppose that
X
and
Y
To see how this compares with the denition given in undergraduate probability,
are discrete with joint pmf
{Y = y}y∈Range(Y ) ,
the countable partition
pX,Y
and marginals
on
Then
σ(Y )
is generated by
E |X| < ∞,
so the previous example shows that if
E[X|Y ] = P (Y = y)−1 E[X; {Y = y}] =
=
pX , pY .
then
X
1
xP (X = x, Y = y)
P (Y = y) x
X pX,Y (x, y)
x
pY (y)
x
{Y = y}.
Example 1.5.
and marginals
then
X
Similarly, suppose that
fX , fY .
and
Y
are jointly absolutely continuous with joint density
Suppose for simplicity that
E[g(X) |Y ] = h(Y )
where
ˆ
h(y) =
The Doob-Dynkin lemma shows that
well, recall that every
A ∈ σ(Y )
formula shows that
ˆ
fY (y) > 0
g(x)
ˆ
y ∈ R.
In this case, if
E |g(X)| < ∞,
fX,Y (x, y)
dx.
fY (y)
E[g(X) |Y ] ∈ σ(Y ).
is of the form
for all
fX,Y
To see that the second criterion is satised as
A = {Y ∈ B}
ˆ
for some
B ∈ B.
The change of variables
ˆ
fX,Y (x, y)
h(Y )dP =
h(y)fY (y)dy = 1B (y)
g(x)
dx fY (y)dy
fY (y)
{Y ∈B}
B
ˆ ˆ
ˆ
=
g(x)1B (y)fX,Y (x, y)dxdy = E[g(X)1B (Y )] =
g(X)dP.
{Y ∈B}
Note that the condition
fY > 0
ˆ
h(y)fY (y) =
so
h
can take on any value at those
y
with
Suppose that
X
and
Y
fY (y) = 0.
0
letting
ˆ
µ
and
ν
(i)
ˆ
g(X)dP =
at such
ˆ
g(x)dµ(x) =
where
X
Y,
ˆ
and
´
fX,Y (x, y)dx
1B (x)
and
fX,Y ≥ 0,
ϕ
satises
E |ϕ(X, Y )| < ∞.
Then
g(x) = E[ϕ(x, Y )].
(ii)
can be veried by
ϕ(x, y)dν(y) dµ(x)
ˆ ˆ
ˆ
1B (x)ϕ(x, y)d(µ × ν)(x, y) =
ˆ
1B (X)ϕ(X, Y )dP =
ϕ(X, Y )dP.
{X∈B}
7
the
y .)
respectively, and computing
B
=
fY (y) =
is satised by Doob-Dynkin, and condition
denote the distributions of
{X∈B}
(Since
are independent and
E[ϕ(X, Y ) |X ] = g(X)
As in the previous example, condition
to satisfy
g(x)fX,Y (x, y)dx,
right-hand side of the above equation will also be
Example 1.6.
h
is actually unnecessary since the above proof only needs
Properties.
Many of the properties of ordinary expectation carry over to conditional expectation as they are ultimately
facts about integrals:
Proposition 1.2 (Linearity).
Proof.
E [aX + Y |G ] = aE [X |G ] + E [Y |G ]
G -measurable functions are G -measurable, and for
ˆ
ˆ
(aE [X |G ] + E [Y |G ]) dP = a
E [X |G ] dP +
E [Y |G ] dP
A
A
ˆA
ˆ
ˆ
=a
XdP +
Y dP = (aX + Y )dP.
Sums and constant multiples of
ˆ
A
A
By assumption, we have
ˆ
ˆ
E[X |G ]dP =
A
for all
A ∈ G.
For all
ˆ
E[Y |G ]dP
Y dP =
A
A
so
ˆ
ˆ
ˆ
E[X |G ] ≤ E[Y |G ]
E[Y |G ]dP ≤ 0.
E[X |G ]dP −
(E[X |G ] − E[Y |G ]) dP =
Aε
Aε
Aε
It follows that
|G ].
ε > 0, Aε = {ω : E[X |G ] − E[Y |G ] ≥ ε} ∈ G ,
εP (Aε ) ≤
ˆ
XdP ≤
A
A∈G
A
Proposition 1.3 (Monotonicity). If X ≤ Y , then E[X |G ] ≤ E[Y
Proof.
any
a.s.
Proposition 1.4 (Monotone Convergence Theorem). If Xn ≥ 0 and Xn % X , then E[Xn |G ] % E[X |G ].
Proof.
0 ≤ E[Xn |G ] ≤ E[Xn+1 |G ] ≤ E[X |G ]
By monotonicity,
for all
n.
(The inequalities are almost
sure, but we can work with versions of the conditional expectations where they hold pointwise.)
bounded nondecreasing sequences of reals converge to their limit superior, there is a random variable
Since
Y
with
E[Xn |G ] % Y .
Moreover,
Y ∈G
as it is the limit of
G -measurable
Finally, applying the ordinary MCT to
tion, and then applying the MCT to
E[Xn |G ]1B % Y 1B ,
Xn 1B % X1B
ˆ
for all
B ∈ G,
hence
Y
invoking the denition of conditional expecta-
shows that
ˆ
n→∞
is a version of
ˆ
E[Xn |G ]dP = lim
Y dP = lim
B
functions.
n→∞
B
E[X |G ].
ˆ
Xn dP =
B
XdP
B
Note that since we have established a conditional MCT, conditional versions of Fatou and dominated convergence follow from the usual arguments.
8
The nal analogue we will consider is a conditional form of Jensen's inequality. It is fairly straightforward
to derive conditional variants of other familiar theorems using these examples as templates.
Proposition 1.5 (Jensen). If ϕ is convex and E |X| , E |ϕ(X)| < ∞, then
ϕ (E[X |G ]) ≤ E[ϕ(X) |G ].
Proof.
When we proved the original Jensen inequality, we established that if
c ∈ R,
there is a linear function lc (x)
Let
S = {(ar , br )}r∈Q .
is dense in
R
Then
S
= ac x + bc
ϕ
is convex, then for every
≤ ϕ(x)
such that lc (c)
= ϕ(c)
and lc (x)
ax + b ≤ ϕ(x)
for all
x ∈ R, (a, b) ∈ S .
is countable with
and convex functions are continuous, we have
ϕ(x) = sup ax + b
for all
for all
x ∈ R.
Moreover, since
Q
x ∈ R.
(a,b)∈S
Monotonicity and linearity imply that
E[ϕ(X) |G ] ≥ E[aX + b |G ] = aE[X |G ] + b
whenever
As
S
a.s.
(a, b) ∈ S .
is countable, the event
A = {E[ϕ(X) |G ] ≥ aE[X |G ] + b
for all
(a, b) ∈ S}
has full probability.
Thus with probability one, we have
E[ϕ(X) |G ] ≥ sup aE[X |G ] + b = ϕ (E[X |G ]) .
(a,b)∈S
One use for conditional expectation is as an intermediary for computing ordinary expectations.
This is
justied by the law of total expectation:
Proposition 1.6.
Proof.
Taking
E [E[X |G ]] = E[X].
A=Ω
E[X |G ] yields
ˆ
XdP =
E[X |G ]dP = E [E[X |G ]] .
in the denition of
ˆ
E[X] =
Ω
Ω
As an example of the utility of the preceding observation, we prove
Proposition 1.7. Conditional expectation is a contraction in Lp , p ≥ 1.
Proof.
Since
p
ϕ(x) = |x|
is convex, Proposition 1.5 implies that
p
p
|E[X |G ]| ≤ E [|X| |G ].
Taking expecta-
tions and appealing to Proposition 1.6 gives
p
p
p
E [|E[X |G ]| ] ≤ E [E [|X| |G ]] = E [|X| ] .
9
Proposition 1.6 is actually a special case of the tower property of conditional expectation.
This result is one of the more useful theorems about conditional expectation and is often summarized as
The smaller
σ -algebra
always wins.
Theorem 1.4. If G1 ⊆ G2 , then
E [E[X |G1 ] |G2 ] = E [E[X |G2 ] |G1 ] = E[X |G1 ].
Proof.
Since
E[X |G1 ] ∈ G1 ⊆ G2 ,
To see that
Example 1.1 shows that
E [E[X |G1 ] |G2 ] = E[X |G1 ].
E [E[X |G2 ] |G1 ] = E[X |G1 ], we observe that E[X |G1 ] ∈ G1 and
ˆ
ˆ
ˆ
E[X |G1 ]dP =
XdP =
E[X |G2 ]dP.
A
* Proposition 1.6 is the case
A
for any
A ∈ G1 ⊆ G2 ,
A
G1 = {Ω, ∅}, G2 = G .
The second criterion in our denition of conditional expectation can be expressed in more probabilistic
language as
E[Y 1A ] = E[X1A ]
for all bounded
Z ∈ G.
for all
A ∈ G.
One sometimes sees the alternative criterion
E[Y Z] = E[XZ]
The equivalence of the two conditions follows from the usual four-step procedure for
building general integrals from integrals of indicators. We will stick with our original denition as it is easier
to check.
The following theorem (which Durrett describes as saying that for conditional expectation with respect to
G,
random variables
Y ∈G
are like constants [in that] they can be brought outside the integral) generalizes
this alternative denition.
Theorem 1.5. If W
Proof. W E[X |G ] ∈ G
We rst suppose that
∈G
and E |X| , E |W X| < ∞, then E[W X |G ] = W E[X |G ].
by assumption, so we need only check the second criterion.
W = 1B for some B ∈ G . Then for all A ∈ G ,
ˆ
ˆ
ˆ
W E[X |G ]dP =
1B E[X |G ]dP =
E[X |G ]dP
A
A∩B
ˆA
ˆ
ˆ
=
XdP =
1B XdP =
W XdP.
By linearity, we see that the condition
Now if
W, X ≥ 0,
A∩B
´
A
W E[X |G ]dP =
´
A
A
W XdP
we can take a sequence of simple functions
ˆ
ˆ
A
also holds when
Wn % W
W
is a simple function.
and use the MCT to conclude that
W E[X |G ]dP = lim
Wn E[X |G ]dP
ˆ
ˆ
= lim
Wn XdP =
W XdP.
A
The general result follows by splitting
W
and
n→∞
A
n→∞
A
X
A
into positive and negative parts.
10
Our nal major result about conditional expectation gives a geometric interpretation in the case of square
integrable
X.
subspace of
Namely, noting that
L2 (F),
L2 (F) = {Y ∈ F : E[Y 2 ] < ∞}
X ∈ L2 (F),
we will show that if
then
E[X |G ]
is a Hilbert space and
L2 (G)
is a closed
is the orthogonal projection of
X
onto
2
L (G).
Theorem 1.6. If
Y ∈ G.
Proof.
E[X 2 ] < ∞,
To begin, we note that if
then E[X |G ] minimizes the mean square error E[(X − Y )2 ] amongst all
Z ∈ L2 (G), then E |ZX| < ∞ by the Cauchy-Schwarz inequality, so Theorem
1.5 implies
ZE[X |G ] = E[ZX |G ].
Taking expected values gives
E [ZE[X |G ]] = E [E[ZX |G ]] = E[ZX],
showing that
E [Z (X − E[X |G ])] = E[ZX] − E [ZE[X |G ]] = 0
for
2
Z ∈ L (G).
Thus for any
Y ∈ L2 (G), if we set Z = E[X |G ] − Y , then we have
h
i
2
E (X − Y )2 = E ((X − E[X |G ]) + Z)
h
i
2
= E (X − E[X |G ]) + 2E [Z (X − E[X |G ])] + E[Z 2 ]
h
i
2
= E (X − E[X |G ]) + E[Z 2 ].
E[X |G ] ∈ L2 (G), so Z = E[X |G ] − Y ∈ L2 (G) as well.)
E (X − Y )2 is minimized over L2 (G) when E[X |G ] − Y = Z = 0.
(Proposition 1.7 shows that
It follows that
To see that
E[X |G ]
minimizes the
M SE
over
L0 (G),
we make use of the inequality
(a + b)2 ≤ (a + b)2 + (a − b)2 = 2a2 + 2b2 .
If
Y ∈G
is such that
E[(X − Y )2 ] = ∞,
then it certainly doesn't minimize the
M SE
since
E[X |G ] ∈ L2 (G)
with
h
i
2
E (X − E[X |G ]) ≤ 2E[X 2 ] + 2E E[X |G ]2 < ∞,
and if
E[(X − Y )2 ] < ∞,
then
2
E[Y 2 ] = E[((Y − X) + X) ] ≤ 2E[(X − Y )2 ] + 2E[X 2 ] < ∞.
In some treatments of conditional expectation, the Radon-Nikodym approach is bypassed entirely by rst
dening
E[X |G ]
for
X ∈ L2 (F)
in terms of projection onto
L2 (G),
and then extending the denition to
1
X ∈ L (G) using approximating sequences of square integrable random variables.
is that one can then prove the Radon-Nikodym theorem using martingales!
11
An upshot of this strategy
Regular Conditional Distributions.
Before moving on to martingales, we pause briey to discuss regular conditional distributions/probabilities.
Recall that if
variable on
(Ω, F, P ) is a probability space, (S, S) is a measurable space, and X
(Ω, F, P ),
X
then
induces the pushforward measure
P ◦X
−1
(S, S).
on
(S, S)-valued random
is an
The theory of regular
conditional distributions provides a conditional analogue of this construction.
To see how this works, suppose that
of
{X ∈ A}
given
G
is a sub-σ -algebra. For any set
is given by the conditional expectation
µ
suggests, we are thinking of
as a function
We would like to know whether
Note that for every
G⊆F
µ(ω, ·)
A ∈ S , µ(·, A)
A ∈ S , the conditional probability
µ(ω, A) = E[1A (X) |G ](ω).
As the notation
µ : Ω × S → [0, 1].
denes a probability measure on
(S, S)
for
P -a.e. ω ∈ Ω.
is a.s. uniquely dened and we are free to modify
µ(·, A)
on null sets, so
linearity and monotone convergence imply that for any particular countable disjoint collection
we have
{An } ⊆ S ,
!
µ ω,
[
= lim E [1A1 (X) + ... + 1An (X) |G ] =
An
n
n
for a set of
ω
having probability
µ (ω, An )
n
1.
The issue is that this event may depend on
condition for any collection
X
{An }
for
{An }
P -a.e. ω .
and we want
µ(ω, ·)
to satisfy the countable additivity
If there are too many dierent countable collections, then
the exceptional sets might pile up to give a set having positive probability, or even a non-measurable set.
To address this issue properly, we set forth a formal denition.
Denition.
Let
(Ω, F, P )
X : (Ω, F) → (S, S)
a random variable.
distribution for X given G
(i)
For each
(ii)
For
When
be a probability space,
µ : Ω × S → [0, 1]
A map
is a version of
P -a.e. ω ∈ Ω, A 7→ µ(ω, A)
and
a measurable space,
G ⊆F
a sub-σ -algebra, and
is said to be a
regular conditional
if
A ∈ S , ω 7→ µ(ω, A)
(S, S) = (Ω, F)
(S, S)
X(ω) = ω , µ
P (X ∈ A |G ),
is a probability measure on
is called a
One reason such objects are of interest is that if
µ
(S, S).
regular conditional probability.
is a regular conditional distribution for
we can express conditional expectations of functions of
X
given
G
X
given
G,
then
in terms of integrals against the r.c.d.
(just like the ordinary change of variables formula with distribution functions).
Theorem 1.7. Let (Ω, F, P ), (S, S), G , and X be as in the above denition. Let µ be a r.c.d. for X given
G . Then for any measurable f : (S, S) → (R, B) with E |f (X)| < ∞, we have
ˆ
E[f (X) |G ](ω) =
Proof.
If
f = 1B
for some
B ∈ S,
f (x)µ(ω, dx)
for P -a.e. ω
then
E[f (X) |G ](ω) = E[1B (X) |G ](ω) = P (X ∈ B |G )(ω)
ˆ
ˆ
= µ(ω, B) =
µ(ω, dx) = f (x)µ(ω, dx)
B
where the third equality (which is a.s.) is condition
(i)
12
in the denition of
µ.
By linearity, the result holds for simple functions; by monotone convergence it holds for nonnegative functions;
and by consideration of positive and negative parts, it holds for integrable functions.
(In each of these three steps, there are only countably many null sets that need to be discarded.)
Of course, for all of this to be worthwhile, it is necessary that r.c.d.s actually exist. Naively, one would just
YA
take a version
of
P (X ∈ A |G )
for each
problem as one can just modify each
YA
YA 's
then there are uncountably many
A ∈ S
and set
µ(ω, A) = YA (ω).
If
F
is nite this is not a
on a null set so that everything works out. But if
ω : there
and the set
may have positive probability or not even lie in
F
is some
no matter how the
{An }
YA 's
S
with Y
(ω)6=
n An
F is innite,
P
n YAn (ω)
are chosen. One can construct
examples where r.c.d.s fail to exist for essentially this reason.
However, the following theorem shows that if
X
takes values in a nice space (e.g. a complete and separable
metric space), then it has a r.c.d. given any sub-σ -algebra.
Theorem 1.8. If (Ω, F, P ) is a probability space, G ⊆ F is a sub-σ-algebra, (S, S) is a nice space, and X
is a (S, S)-valued random variable on (Ω, F, P ), then X has a r.c.d. given G .
Proof.
ϕ : (S, S) → (R, B)
By denition, there is a bijection
with
ϕ
and
ϕ−1
measurable.
Ω0 ∈ F
Using monotonicity and throwing out a countable collection of null sets, we see that there is a set
with
P (Ω0 ) = 1
and a family of random variables
P (ϕ(X) ≤ q |G )
with
q 7→ G(ω, q)
ω 7→ G(ω, q), q ∈ Q,
is nondecreasing for all
As in the proof of Helly's selection theorem, if we dene
distribution function for all
*
x 7→ F (ω, x)
for
such that
ω 7→ G(ω, q)
is a version of
ω ∈ Ω0 .
F (ω, x) = inf{G(ω, q) : q > x},
then
F (ω, ·)
is a
ω ∈ Ω0 .
is nondecreasing since for any
x < y,
there is a rational
x<r<y
so, since
G(ω, r) ≤ G(ω, s)
r ≤ s,
F (ω, x) = inf{G(ω, q) : q > x} ≤ G(ω, r) ≤ inf{G(ω, q) : q > y} = F (ω, y);
x 7→ F (ω, x) is right continuous since for any x ∈ R, ε > 0, there is a rational q > x with G(ω, q) ≤ F (ω, x)+ε
and thus for any
x < y < q,
F (ω, y) ≤ G(ω, q) ≤ F (ω, x) + ε;
and
x 7→ F (ω, x)
and inmum
0
satises the appropriate boundary conditions since it is nondecreasing with supremum
by virtue of
q 7→ G(ω, q)
satisfying these boundary conditions.
(G(ω, ·) has the appropriate limits since
goes to
0
or
Thus each
1
as
q
goes to
∓∞
and
´
G(ω, q)dP (ω) =
´
1 {ϕ(X) ≤ q} dP
which
by monotone convergence.)
ω ∈ Ω0 gives rise to a unique measure ν(ω, ·) having distribution function F (ω, x) = ν (ω, (−∞, x]).
Also, monotone convergence shows that
readily checks that
(with
0 ≤ G(ω, q) ≤ 1
1
ω 7→ F (ω, x) is a version of P (ϕ(X) ≤ x |G ) for every x ∈ R.
L = {B ∈ B : ν(ω, B) = P (ϕ(X) ∈ B |G )}
P = {(−∞, x] : x ∈ R})
that
ν(ω, B)
is a version of
To extract the r.c.d. in question, note that for any
a
λ-system,
it follows from the
theorem
P (ϕ(X) ∈ B |G ).
A ∈ S , {X ∈ A} = {ϕ(X) ∈ ϕ(A)},
µ(ω, A) := ν (ω, ϕ(A)).
π -λ
As one
so we can take
13
2. Martingales
With conditional expectation formally dened and its most important properties established, we are ready
to tackle the topic of discrete time martingales.
Recall that a ltration is an increasing sequence of sub-σ -algebras
F1 ⊆ F2 ⊆ · · · .
(Really, the term net
is more appropriate than sequence since often one would like to consider uncountable index sets such as
[0, T ],
but we are working in discrete time at the moment so the distinction is not important.)
We say that a collection of random variables
all
is
adapted
to the ltration
{Fn }∞
n=1
if
Xn ∈ Fn
for
n ∈ N.
Note that
{Xn }n∈N
Denition 2.1.
for all
is always adapted to the ltration dened by
{Gn }n∈N ,
adapted to
then necessarily
{Xn }n∈N
is a
Fn = σ(X1 , ..., Xn ).
Indeed, if
{Xn }n∈N
is
with respect to the ltration
{Fn }n∈N
if
.
martingale
n∈N
E |Xn | < ∞
(ii)
Xn ∈ Fn
(iii)
E [Xn+1 |Fn ] = Xn .
If condition
(iii)
Gn ⊇ σ(X1 , ..., Xn )
We say that a sequence
(i)
If
{Xn }∞
n=1
(iii)
reads
is replaced with
E [Xn+1 |Fn ] ≤ Xn ,
Note that, by denition,
Also, by linearity, if
E [Xn+1 |Fn ] ≥ Xn ,
then we say that
then we say that
{Xn }
is a
{Xn }
is a
submartingale.
supermartingale.
{Xn } is a martingale if and only if it is both a submartingale and a supermartingale.
{Xn }
is a submartingale, then
{−Xn }
is a supermartingale (and symmetrically).
Thus when one proves a theorem involving submartingales, say, there are immediate corollaries for martingales and supermartingales.
The standard picture of a
smartingale
(the collective term for ordinary, sub-, and super- martingales) is in
terms of one's fortune after repeated bets (which need not be identical) on a game of unchanging odds.
A fair game corresponds to a martingale.
submartingale.
If the odds are in the player's favor, then it corresponds to a
If the odds are against the player, one has a supermartingale.
(If the odds are always
for/against, then one still gets a sub/super-martingale even if they vary game to game.)
Example 2.1.
We have already studied one martingale in some detail - simple random walk in one dimension.
Specically, let
ξ1 , ξ2 , ...
be i.i.d.
with
P (ξ1 = 1) = P (ξ1 = −1) =
1
2 , and let
our convention henceforth, when no ltration is specied, we will take
Fn = σ(X1 , ..., Xn )
this case since
so that condition
|Xn | ≤ n.
(ii)
{Fn }
Xn =
Pn
i=1 ξi . As will be
to be the
natural ltration
is automatically satised. The rst condition is also satised in
To see that this does indeed dene a martingale, we compute
E[Xn+1 |Fn ] = E[Xn + ξn+1 |Fn ] = E[Xn |Fn ] + E[ξn+1 |Fn ] = Xn + E[ξn+1 ] = Xn
where we used linearity of conditional expectation,
Xn ∈ Fn ,
14
and
ξn+1
independent of
Fn .
Example 2.2.
shows that
More generally, if
{Xn }
supermartingale if
Example 2.3.
Xn =
is a martingale if
E[ξn ] ≤ 0
for all
By linearity, if
Pn
i=1 ξi where the
E[ξn ] = 0
for all
ξn 0 s are independent,
n;
a submartingale if
(1)
(m)
{Xn }, ..., {Xn }
Dene
X
n;
and a
a common ltration {Fn }) and
(i)
i=1 αi Xn is a martingale. If α1 , ..., αm ≥ 0
Pm
{Xn }.
be an integrable random variable on a ltered probability space
Xn = E[X |Fn ].
for all
are martingales (w.r.t.
then the sequence dened by Xn = α0 +
(1)
(m)
and {Xn }, ..., {Xn } are sub/super-martingales, then so is
Let
E[ξn ] ≥ 0
n.
α0 , α1 , ..., αm ∈ R,
Example 2.4.
then an identical argument
(Ω, F, {Fn }, P ).
We have
E |Xn | = E |E[X |Fn ]| ≤ E [E[|X| |Fn ]] = E |X| < ∞
by Jensen's inequality and the law of total expectation;
Xn = E[X |Fn ] ∈ Fn
by denition of conditional expectation; and
E[Xn+1 |Fn ] = E [E[X |Fn+1 ] |Fn ] = E[X |Fn ] = Xn
by the tower property.
The interpretation is that
X1 , X2 , ...
represent increasingly better estimates of
X
as more information is
revealed.
Example 2.5.
{Mn }
Let
X1 , X2 , ...
be i.i.d. with
is a martingale with respect to
Here we can think of
Mn
Qn
1
i=1 Xi . Then
2 . Dene Mn =
1
1
since E[Mn+1 |Fn ] = (0 · Mn ) + (2 · Mn ) = Mn .
2
2
P (X1 = 0) =
Fn = σ(X1 , ..., Xn )
1
2,
P (X1 = 2) =
as representing the fortune of a gambler going double or nothing in a fair game
with an initial bet (at time
0)
The same argument shows if
of
1
unit.
X1 , X2 , ...
are independent, bounded, and nonnegative, then
denes a martingale provided that all multiplicands have mean
When they all have mean at least
1
(respectively, at most
Mn =
Qn
i=1
Xi
1.
1), Mn
is a submartingale (respectively, super-
martingale).
Example 2.6.
The name martingale derives from a class of betting systems, the prototype of which is the
following: In a fair game with an initial bet of
$1,
double your bet after each loss and quit after your rst
win.
Mathematically, suppose that
by
ξ1 , ξ2 , ... are i.i.d.
with
P (ξ1 = 1) = P (ξ1 = −1) =
1
2 and dene
Mn
recursively
M0 = 0,
(
Mn+1 =
Then
{Mn }
|Mn | < 2
n
is adapted to
1,
Mn + 2n ξn+1 ,
F0 = {∅, Ω}, Fn = σ(ξ1 , ..., ξn )
.
15
Mn = 1
else
.
by an induction argument, and is integrable since
It is thus a martingale since
E[Mn+1 |Fn ] = E [1 {Mn = 1} + 1 {Mn 6= 1} (Mn + 2n ξn+1 ) |Fn ]
= 1 {Mn = 1} + 1 {Mn 6= 1} (Mn + 2n E[ξn+1 |Fn ])
= 1 {Mn = 1} + 1 {Mn 6= 1} (Mn + 2n E[ξn+1 ])
= 1 {Mn = 1} Mn + 1 {Mn 6= 1} Mn = Mn .
Because you will eventually win with probability one and
n
2 −
n−1
X
2i = 2n − (2n − 1) = 1,
i=0
it seems that this strategy guarantees that you will come out ahead even though the game is fair.
0 < P (ξ1 = 1) < P (ξ1 = −1),
Moreover, if
then the same analysis shows that
{Mn }
is a supermartingale,
but you will still eventually come out ahead.
The catch is that this trick only works if you have an innite line of credit and are allowed unlimited plays
and arbitrarily large wagers. We will see later that if such stipulations are missing, then there is no system
that will turn the odds of an unfair game in your favor.
* (Sub/Super)Harmonic Functions.
From the gambler's perspective, the denitions of sub- and super- martingales seem to be backwards (though
the names are apt from the house's point of view). In fact, the nomenclature has nothing to do with choosing
sides in this metaphor but rather arises from connections with potential theory.
Recall that if
harmonic
on
U
U
is a
domain
(open connected set) in
if it satises Laplace's equation
Rn
∆h = 0
then a continuous function
on
h:U →R
is said to be
U.
An important fact about harmonic functions is
Theorem 2.1 (Mean Value Formulas).
h ∈ C 2 (U )
is harmonic if and only if for every ball B(x, r) ⊆ U ,
ˆ
1
h(x) =
|B(x, r)|
ˆ
1
h(y)dy =
σ (∂B(x, r))
B(x,r)
h(y)dS(y)
∂B(x,r)
where |B(x, r)| = rn α(n) is the volume of the ball B(x,
r) and σ (∂B(x, r)) = nα(n)rn−1 is the surface mean
sure of the sphere ∂B(x, r); α(n) = |B(0, 1)| = Γ(πn 2+1) .
2
Proof.
Suppose that
h
is harmonic on
φ(r) :=
1
nα(n)rn−1
U ⊇ B(x, r)
and set
ˆ
h(y)dS(y) =
∂B(x,r)
1
nα(n)
ˆ
h(x + rz)dS(z)
∂B(0,1)
Dierentiating and appealing to the divergence theorem gives
ˆ
1
y−x
∇h(x + rz) · z dS(z) =
∇h(y) ·
dS(y)
n−1
nα(n)r
r
∂B(0,1)
∂B(x,r)
ˆ
ˆ
1
1
=
∇h(y)
·
ν(y)dS(y)
=
∆h(y)dy = 0.
nα(n)rn−1 ∂B(x,r)
nα(n)rn−1 B(x,r)
1
φ (r) =
nα(n)
0
ˆ
16
Since
φ(r)
is constant, we have
1
r→0 nα(n)r n−1
ˆ
φ(r) = lim
Radial integration gives
ˆ
ˆ
r
ˆ
h dS
0
For the converse, suppose that
B(x, r).
Taking
φ
∆h 6≡ 0
h
φ(x)nα(n)sn−1 ds = φ(x)α(n)rn .
0
U.
on
r
dr =
∂B(x,s)
The there is some ball
as above yields the contradiction
0 = φ0 (r) =
In other words,
ˆ
!
h(y)dy =
B(x,r)
h(y)dS(y) = h(x).
∂B(x,r)
1
nα(n)rn−1
B(x, r) ⊆ U
with, say,
∆h > 0
on
ˆ
∆h(y)dy > 0.
B(x,r)
is harmonic if and only if at each point in
U, h
is equal to its average over any ball in
U
centered at that point and also to its average over any sphere centered at that point.
A function
x0 ∈ U )
B(x, r)
f :U →R
is called
subharmonic
and for each closed ball
f ≤h
for which
on
if it is
B(x, r) ⊆ U
∂B(x, r),
one has
upper semicontinuous (lim supx→x0 f (x) ≤ f (x0 ) for all
h
and each harmonic function
f ≤h
on
dened on a neighborhood of
B(x, r).
* Note that the only harmonic functions in one dimension are linear functions, so subharmonic is the same
as convex when
n = 1.
The analogue of Theorem 2.1 for subharmonic functions is
Theorem 2.2. An upper semicontinuous function f is subharmonic on a domain U
f (x) ≤
1
|B(x, r)|
ˆ
1
σ (∂B(x, r))
f (y)dy,
B(x,r)
⊆ Rn
if and only if
ˆ
f (y)dS(y)
∂B(x,r)
for every ball B(x, r) ⊆ U .
Proof sketch.
For any
If
f
is subharmonic, then it is u.s.c., so there is a sequence of continuous functions
B(x, r) ⊆ U ,
let
hn
be the harmonic function on
B(x, r)
with
hn (x) = fn (x)
on
fn & f .
∂B(x, r).
(It is
known that the Dirichlet problem for the Laplace equation on a ball with continuous boundary values has a
classical solution). The mean value formula for harmonic functions and
f (x) ≤ hn (x) =
for all
n,
1
σ (∂B(x, r))
the integral over
∂B(x,r)
1
σ (∂B(x, r))
∂B(x, r)
gives
ˆ
fn (y)dS(y)
∂B(x,r)
fn
can be taken to be bounded from above. Radial integration takes care of
B(x, r).
Conversely, suppose that
on
hn (y)dS(y) =
on
and the result follows from the MCT since upper semicontinuous functions attain their maximum
on compact sets and thus the
function
ˆ
f ≤ fn = hn
f
satises the given inequality and there is
h with f ≤ h on ∂B(x, r) and f (x0 ) > h(x0 ).
x0 ∈ B(x, r) ⊆ U
Without loss, we can assume that
B(x, r).
17
and some harmonic
x0
maximizes
f −h
By assumption, we have
1
f (x0 ) − h(x0 ) ≤
|B(x0 , s)|
for suciently small
x0 .
f ≤h
of
∂B(x, r)
A function
g
is
f (y) − h(y)dy
B(x0 ,s)
s, so maximality implies that f (y)−h(y) = f (x0 )−h(x0 ) for a.e. y in some neighborhood
A connectivity argument shows that
on
ˆ
(as
f −h
f (y) − h(y) = f (x0 ) − h(x0 ) > 0
a.e. on
B(x, r),
contradicting
is u.s.c.).
superharmonic if and only if −g is subharmonic, so one has analogous mean value inequalities
for superharmonic functions.
Roughly, one thinks of martingales as corresponding to harmonic functions while submartingales (respectively, supermartingales) correspond to subharmonic (respectively, superharmonic) functions. For example,
if
f
is subharmonic, then the value of
sort of similar to
f
at
x
is at most the average of
f
over a neighborhood of
A concrete example of the correspondence between potential theory and martingales is that if
Brownian motion in
1
Bt ∈ L (Ft )
and
x,
which is
Xn ≤ E[Xn+1 |Fn ].
R
n
(which is a continuous-time martingale w.r.t.
E[Bt |Fs ] = Bs
for
0 ≤ s ≤ t),
Ft = σ(Bs : s ≤ t)
h
is a
in the sense that
then under certain integrability assumptions,
a martingale (respectively, sub/super-martingale) if
{Bt }t≥0
Yt = h(Bt )
is
is harmonic (respectively, sub/super-harmonic).
A more accessible connection is given by the following proposition
Proposition 2.1. Suppose f is continuous and subharmonic on Rd . Let ξ1 , ξ2 , ... be i.i.d. uniform on
Pn
B(0, 1), and dene Sn = x + i=1 ξi for some x ∈ Rd . Then {f (Sn )} is a submartingale.
Proof.
The implicit ltration is
is adapted to
Fn .
on compact sets,
Also, since
Fn = σ(ξ1 , ..., ξn ) 3 Sn , so, since continuous functions are measurable, f (Sn )
Sn ∈ B(x, n) ⊆ B(x, n)
E |f (Sn )| < ∞.
by construction and continuous functions are bounded
Finally, by the mean value inequality for subharmonic functions,
1
E [f (Sn+1 ) |Fn ] = E [f (Sn + ξn+1 ) |Fn ] =
|B(0, 1)|
Here we are using the fact that if
X ∈ G
and
Y
ˆ
is independent of
f (y)dy ≥ f (Sn ).
B(Sn ,1)
G,
then
E[ϕ(X, Y ) |G ] = h(X)
h(x) = E[ϕ(x, Y )].
where
First Results.
An easy but extremely important fact is that the martingale property is not limited to a single time step.
Theorem 2.3. If {Xn } is a submartingale, then for any n > m, E [Xn |Fm ] ≥ Xm .
Proof.
When
n = m + 1,
the result follows from the denition of a submartingale, so we may assume for the
purposes of an induction argument that
E[Xm+k |Fm ] ≥ Xm .
Then the tower property and monotonicity
give
E [Xm+k+1 |Fm ] = E [E[Xm+k+1 |Fm+k ] |Fm ] ≥ E [Xm+k |Fm ] ≥ Xm .
18
Corollary 2.1. If {Xm } is a supermartingale, then
martingale, then E [Xn |Fm ] = Xm for any n > m.
Proof.
E [Xn |Fm ] ≤ Xm
for any n > m, and if {Xm } is a
{−Xm }
is a submartingale, and the second
The rst claim follows from Theorem 2.3 by noting that
then follows since a martingale is both a submartingale and a supermartingale.
The line of reasoning used in Corollary 2.1 applies to many of our subsequent results, so to avoid redundancy
we will often just state a theorem for submartingales or supermartingales with the corresponding statements
for the other cases taken as implicit.
Corollary 2.2. If {Xm } is a submartingale, then
supermartingales and martingales.
Proof.
E [Xn ] ≥ E [Xm ]
for any n > m, and analogously for
Proposition 1.6, Theorem 2.3, and monotonicity imply
E [Xn ] = E [E [Xn |Fm ]] ≥ E [Xm ] .
Keeping in mind that convex functions of one variable are subharmonic, the following results provide another
connection between smartingales and potential theory.
Theorem 2.4. If {Xn } is a martingale w.r.t. {Fn } and ϕ : R → R is a convex function with E |ϕ(Xn )| < ∞
for all n, then {ϕ(Xn )} is a submartingale w.r.t. {Fn }.
Proof. ϕ(Xn ) ∈ L1 (Fn ) by assumption and Jensen's inequality implies
E [ϕ(Xn+1 ) |Fn ] ≥ ϕ (E [Xn+1 |Fn ]) = ϕ(Xn )
since
{Xn }
is a martingale.
Corollary 2.3. Let p ≥ 1 and suppose that {Xn } is a martingale such that E [|Xn |p ] < ∞ for all n. Then
p
{|Xn | } is a submartingale.
Proof. f (x) = |x|p
is convex.
Theorem 2.5. If {Xn } is a submartingale w.r.t. {Fn } and ϕ is a nondecreasing convex function with
E |ϕ(Xn )| < ∞ for all n, then {ϕ(Xn )} is a submartingale w.r.t. {Fn }.
Proof.
By Jensen's inequality and the assumptions,
E [ϕ(Xn+1 ) |Fn ] ≥ ϕ (E[Xn+1 |Fn ]) ≥ ϕ(Xn ).
Corollary 2.4. If {Xn } is a submartingale, then
martingale, then {Xn ∧ a} is a supermartingale.
{(Xn − a)+ }
is a submartingale. If {Xn } is a super-
Proof. f (x) = x ∨ b is convex and nondecreasing, and − [(−x) ∨ (−a)] = x ∧ a.
19
In order to further develop the basic notions of martingale theory, we introduce the following denitions:
Denition.
Hn ∈ Fn−1
Given a ltration
for all
Denition.
If
n ∈ N.
{Xn }∞
n=0
{Fn }∞
n=0 ,
we say that a sequence
Hn
That is, the value of
is a smartingale and
{Hn }∞
n=1
is
predictable
(or
previsible )
is determined by the information available at time
{Hn }∞
n=1
is predictable w.r.t.
Fn = σ(X0 , ..., Xn ),
if
n − 1.
the process
(H · X)0 = 0,
(H · X)n =
n
X
Hm (Xm − Xm−1 ) , n ≥ 1
m=1
is called the
martingale transform
of
Hn
Xn .
with respect to
H
Note that the martingale transform is dened like a Riemann sum for the integral of
with respect to
X.
With a bit of care, this analogy can be extended to develop a general theory of stochastic integration.
We typically think of
stakes were
$1
{Hn }
the outcomes of the rst
fortune at time
as a gambling strategy: If
per game and
n
Hn
n−1
P (ξ1 = −1) = 1 − p,
Xn =
and set
represents the gambler's fortune at time
games, but not on the outcome of game
when they bet according
As an example, take
Xn
represents the amount wagered on the
nth
n),
n
if the
game (which can be based on
then
(H · X)n
is the gambler's
{Hn }.
Pn
ξ0
i=0 ξi where (
H1 = 1, Hn =
= 0
1,
and
2Hn−1 ,
ξ1 , ξ2 , ... are
ξn−1 = 1
ξn−1 = −1
i.i.d.
for
with
P (ξ1 = 1) = p ∈ (0, 1),
n ≥ 2.
This corresponds to our example of the martingale betting system, except that instead of quitting after a
win and walking away with one dollar, the process just starts over. Every win recovers the amount from the
last losing streak and one additional dollar to boot, so the gambler has a prot of
$k
after the
kth
win.
As the second Borel-Cantelli lemma ensures innitely many wins with probability one, the gambler can
eventually amass an arbitrarily large fortune no matter how much the odds are stacked against them.
However, the above reasoning is predicated on unrealistic assumptions such as innite credit and no table
limits. The ensuing results show that in practice there is no system for beating an unfavorable game.
Theorem 2.6. Suppose that {Xn }∞
n=0 is a supermartingale. If Hn ≥ 0 is predictable and each Hn is bounded,
then (H · X)n denes a supermartingale.
0
Proof. (H·X)n is integrable since H is bounded and the Xm
s are integrable, and it is adapted by construction.
To see that it denes a supermartingale, we compute
E [(H · X)n+1 |Fn ] = E [(H · X)n + Hn+1 (Xn+1 − Xn ) |Fn ]
= E [(H · X)n |Fn ] + E [Hn+1 (Xn+1 − Xn ) |Fn ]
= (H · X)n + Hn+1 E [Xn+1 − Xn |Fn ] ≤ (H · X)n
where we used
Hn+1 , (H · X)n ∈ Fn , Hn ≥ 0,
and
E [Xn+1 − Xn |Fn ] ≤ 0.
20
Of course, the argument in Theorem 2.6 also applies to submartingales and to martingales (without the
Hn ≥ 0
restriction in the latter case).
An important example of a predictable sequence is given by
Hn = 1 {N ≥ n}
The terminology is quite apt as the strategy is stop gambling at time
It is worth observing that
(H · X) is linear in H
(and in
X
where
N
is a stopping time.
N .
for that matter), so one can construct very general
betting systems using stopping times.
Note that if
{Xn }
is a supermartingale and
n
X
(H · X)n =
Hn = 1 {N ≥ n},
then Theorem 2.6 shows that
1 {N ≥ m} (Xm − Xm−1 ) = XN ∧n − X0
m=1
is a supermartingale.
Since the constant sequence
Yn = X0
is also a supermartingale and the sum of two supermartingales is a
supermartingale, we have
Corollary 2.5. If N is a stopping time and {Xn } is a supermartingale, then {XN ∧n } is a supermartingale,
and similarly for submartingales and martingales.
Corollary 2.5 gives us our rst optional stopping theorem, which explains why the martingale system
doesn't work if we can't play forever.
Theorem 2.7. If {Xn }∞
n=0 is a supermartingale and N is a stopping time with N
then E[XN ] ≤ E[X0 ].
Proof.
≤ m a.s.
for some m ∈ N,
Under the above assumptions,
E[XN ] = E[XN ∧m ] ≤ E[X0 ].
Now suppose that
N0 , N1 , N2 , ...
by
Xn , n ≥ 0
N0 = −1
is a submartingale.
and for
Fix
a, b ∈ R
with
a < b
and dene the sequence
k ≥ 1,
N2k−1 = inf{m > N2k−2 : Xm ≤ a}
N2k = inf{m > N2k−1 : Xm ≥ b}.
The
so
Nj0 s
are stopping times and
Hm = 1 {N2k−1 < m ≤ N2k
By construction,
to above
XN2k−1 ≤ a
{N2k−1 < m ≤ N2k } = {N2k−1 ≤ m − 1} ∩ {N2k ≤ m − 1}C ∈ Fm−1 ,
for some
and
k ≥ 1}
XN2k ≥ b,
denes a predictable sequence.
so in the time interval
b.
21
[N2k−1 , N2k ], Xm
crosses from below
a
{Hm }∞
m=1
is a gambling strategy which tries to take advantage of these upcrossings:
It turns on after
Xm
dips below
a
and turns o once
Xm
rises above
b.
As the sum dening
telescopes in between these times, the contribution from an upcrossing starting at time
at time
N2k
is
(H · X)m
and ending
XN2k − XN2k−1 ≥ b − a.
In stock market terms, this is like buying a share once its price is less than
than
N2k−1
a
and selling once it's greater
b.
Letting
Un = sup{k : N2k ≤ n}
Vn = sup{j : N2j−1 ≤ n}
denote the number of upcrossings completed by time
the number of upcrossings begun before time
n,
n
and
we see that
(H · X)n ≥ Un (b − a) + 1 {Un < Vn } Xn − XN2Vn −1 .
The following lemma extends this observation to bound the expected number of upcrossings by time
Lemma 2.1
(Upcrossing Inequality)
n.
. If {Xm }∞
m=0 is a submartingale, then
(b − a)E[Un ] ≤ E (Xn − a)+ − E (X0 − a)+ .
Proof.
Also,
Xm
Since
ϕ(x) = x ∨ a
Ym = a
when
is convex and nondecreasing,
Xm ≤ a
and
Ym = Xm
when
Ym = Xm ∨ a
Xm > a,
so
Ym
is a submartingale.
upcrosses
[a, b]
the same number of times
does.
Moreover,
Ym ≥ a = YN2Vn −1
implies
(H · Y )n ≥ Un (b − a) + 1 {Un < Vn } Yn − YN2Vn −1 ≥ Un (b − a).
Setting
Km = 1 − Hm ,
we have that
{Km }
is nonnegative, bounded, and predictable, so Corollary 2.2 and
the submartingale form of Theorem 2.6 yield
E[(K · Y )n ] ≥ E [(K · Y )0 ] = 0.
Putting these facts together gives
(b − a)E[Un ] ≤ E [(H · Y )n ] = E [(1 · Y )n − (K · Y )n ]
= E[Yn − Y0 ] − E[(K · Y )n ] ≤ E[Yn ] − E[Y0 ]
= E (Xn − a)+ + a − E (X0 − a)+ + a .
The primary utility of the upcrossing inequality is to facilitate the proof of the pointwise martingale convergence theorem, which shows that smartingales behave like monotone sequences of real numbers in that they
converge if appropriately bounded.
Essentially, the idea is to show that there is a set of full measure on which
with
a, b ∈ Q
only nitely many times. This allows us to conclude that
Xn
Xn
upcrosses any interval
[a, b]
has an almost sure limit.
Theorem 2.8 (Martingale Convergence). If {Xn } is a submartingale with supn E[Xn+ ] < ∞, then there is
an integrable random variable X such that Xn → X a.s. as n → ∞.
22
Proof.
For any
a, b ∈ Q
with
a < b,
let
Un
be the number of upcrossings of
[a, b]
by time
n,
and let
U = limn→∞ Un
be the number of upcrossings of the whole sequence. (U is a well-dened random variable because
increasing.) Set
M=
supn E[Xn+ ]
E [(X0 − a)+ ] ≥ 0
for all
n,
and
that
so the monotone convergence theorem gives
E[U ] = lim E[Un ] ≤
n→∞
and thus
is
< ∞.
E [(Xn − a)+ ] ≤ E [Xn+ ] + |a| ≤ M + |a|, Lemma 2.1 implies
M + |a|
1
E (Xn − a)+ − E (X0 − a)+ ≤
E[Un ] ≤
b−a
b−a
Since
Un
U <∞
M + |a|
< ∞,
b−a
a.s.
As the above holds for any
[a, b]
with
a, b ∈ Q,
and the set of intervals with rational endpoints is countable,
the event
[
{lim inf n Xn < a < b < lim supn Xn }
a,b∈Q
has probability
Letting
X
0,
hence
lim supn Xn = lim inf n Xn
a.s.
E[X + ] ≤ lim inf n E[Xn+ ] < ∞,
denote this common value, Fatou's lemma shows that
a.s. To see that
X > −∞
a.s., we observe that since
{Xn }
so
X<∞
is a submartingale,
E[Xn− ] = E[Xn+ ] − E[Xn ] ≤ E[Xn+ ] − E[X0 ],
so another application of Fatou's lemma gives
E[X − ] ≤ lim inf E[Xn− ] ≤ lim inf E[Xn+ ] − E[X0 ] < ∞.
n
Thus
Xn
has a limit
X∈R
with
n
E |X| = E[X + ] + E[X − ] < ∞.
An immediate corollary is
Corollary 2.6. If Xn ≥ 0 is a supermartingale, then as n → ∞, Xn → X a.s. and E[X] ≤ E[X0 ].
Proof. Yn = −Xn
is a submartingale with
E[Yn+ ] = 0,
so it follows from Theorem 2.8 that
Yn → Y = −X
a.s. The inequality follows from Fatou's lemma and the supermartingale property:
E[X] ≤ lim inf E[Xn ] ≤ E[X0 ].
n
Of course, by considering
Xn + K ,
one may replace non-negative with bounded from below in the preceding.
It is worth observing that the assumptions in Theorem 2.8 do not guarantee convergence in
Example 2.7.
1,
and for
P (ξ1 = −1) =
Sn
is a simple random walk started at
time of
0,
Let S0 =
1
. That is,
2
and let
Xn = Sn∧N
n ≥ 1, Sn = Sn−1 + ξn
be the walk stopped at
2.6 implies that it has an almost sure limit
|Xn − Xn+1 | = 1.
However,
X.
0.
Then
where
1.
Xn
for all
23
n,
are i.i.d.
with
N = inf{n : Sn = 0}
P (ξ1 = 1) =
be the hitting
is a nonnegative martingale, so Corollary
Clearly, we must have
E[Xn ] = E[X0 ] = 1
ξ1 , ξ2 , ...
Let
L1 .
X=0
so we can't have
Xn = k > 0,
a.s. since if
Xn → X
in
1
L
.
then
We will provide criteria for convergence in
Lp
before long, but before doing so we have one more general
result to discuss, and then we will take some time to consider some examples in detail in order to better
understand the utility of martingale convergence.
The following decomposition result is due to Doob (like essentially everything else in the classical theory
of martingales) and can be useful for reducing questions about sub/super-martingales to questions about
martingales.
Theorem 2.9
Xn = Mn + An
Proof.
If
. Any submartingale {Xn }∞
n=0 can be written in a unique way as
is a martingale and An is a predictable increasing sequence with A0 = 0.
(Doob Decomposition)
where Mn
Xn = Mn + An
with
E[Mn |Fn−1 ] = Mn−1
and
An ∈ Fn−1 ,
we must have
E[Xn |Fn−1 ] = E[Mn |Fn−1 ] + E[An |F n−1 ]
= Mn−1 + An = Xn−1 − An−1 + An .
The recursion
A0 = 0,
An − An−1 = E[Xn |Fn−1 ] − Xn−1
uniquely denes
An
and thus
Mn = Xn − An .
It remains to establish existence by checking that
To check that
since
Xn
An
An
and
Mn
as dened above satisfy the assumptions.
is indeed increasing and predictable, we note that
is a submartingale, and
An − An−1 = E[Xn |Fn−1 ] − Xn−1 ≥ 0
An = E[Xn |Fn−1 ] − Xn−1 + An−1 ∈ Fn−1
Finally, rewriting the recursion for
An
as
E[Xn |Fn−1 ] − An = Xn−1 − An−1 ,
by induction.
we see that
E[Mn |Fn−1 ] = E[Xn − An |Fn−1 ] = E[Xn |Fn−1 ] − An = Xn−1 − An−1 = Mn ,
so
Mn
is a martingale.
The inspiration for the theorem is quite clear:
An =
n
X
(E[Xk |Fk−1 ] − Xk−1 )
k=1
is the running sum of the amount by which
E[Xn |Fn−1 ] overshoots Xn−1 ,
been subtracted o, the remainder
Mn = X0 +
n
X
(Xk − E[Xk |Fk−1 ])
k=1
is a martingale.
24
and once these drift terms have
3. Applications
Bounded Increments.
Our rst application of the martingale convergence theorem involves a generalization of the second BorelCantelli lemma.
This will follow from a more general result concerning martingales with bounded increments:
They either converge or oscillate between
±∞.
Theorem 3.1. Suppose {Xn } is a martingale such that supn |Xn+1 − Xn | ≤ M for some M
∈ R.
Writing
n
o
C = lim Xn exists in R ,
n→∞
n
o n
o
D = lim sup Xn = ∞ ∩ lim inf Xn = −∞ ,
n→∞
n→∞
we have P (C ∪ D) = 1.
Proof.
Since
For any
Xn − X0
0 < K < ∞,
is a martingale, we may assume without loss of generality that
set
N = inf {n : Xn ≤ −K}.
so applying Corollary 2.6 to
Letting
K→∞
shows that
Applying the foregoing to
Thus, up to a null set,
Xn∧N + K + M
limn→∞ Xn
−Xn
shows that
exists a.s. on
shows that
Xn∧N
Then
Xn
is a martingale with
Xn∧N ≥ −K − M
has a nite limit on (almost all of )
a.s.,
{N = ∞}.
{lim inf n→∞ Xn > −∞}.
limn→∞ Xn
exists a.s. on
{lim supn→∞ Xn < ∞}.
C = Ω \ D.
Corollary 3.1. Let {Fn }∞
n=0 be a ltration with
Then (up to a null set)
{An
X0 = 0.
i.o.} =
F0 = {∅, Ω},
∞
X
and let A1 , A2 , ... be events with An ∈ Fn .
P (An |Fn−1 ) = ∞ .
n=1
Proof.
so
Xn
On
Set
X0 = 0
is a martingale
C = {limn→∞ Xn
Pn
(1Am − P (Am |Fm−1 )) for n ≥ 1. Then
E[Xn+1 |Fn ] = E Xn + 1An+1 − P (An+1 |Fn ) |Fn
= Xn + E 1An+1 |Fn ) − P (An+1 |Fn ) = Xn ,
with |Xn+1 − Xn | = 1An+1 − P (An+1 |Fn ) ≤ 1.
and
Xn =
m=1
exists and is nite}, we must have
∞
X
1Am = ∞
if and only if
n=1
and the same is true on
P (C ∪ D) = 1
P (An |Fn−1 ) = ∞,
n=1
D = {lim supn Xn = ∞
This proves the result since
∞
X
and
lim inf n Xn = −∞}.
by Theorem 3.1.
25
* Polya's Urn.
Suppose that an urn initially contains
r
red balls and
g
green balls. At each time step we remove a ball at
random, observe its color, and then replace it along with
c
Negative values of
c
other balls of the same color for some integer
c.
correspond to removing balls, and in this case it is generally necessary to stop the
process once further removals become impossible.
Note that
c = −1
corresponds to sampling without replacement and
c=0
corresponds to sampling with
replacement.
c,
For positive values of
each time a color is sampled increases the probability that it will be sampled in the
future. Such self-reinforcing behavior is sometimes described by the phrase the rich get richer. (Of course,
the opposite holds when
c
is negative.)
We will assume henceforth that
by
c
c>0
so that the process can be continued indenitely and so we can divide
when convenient.
To make the foregoing rigorous, let
ξ1 , ξ2 , ...,
ξσ(1) , ..., ξσ(n) .
sequence
ξn = 1 {green
ball drawn at time
though certainly not independent, is
To gain intuition, we compute
exchangeable
side is
r
r+g
g
r+g+c
The rst observation is that the
n ∈ N, σ ∈ Sn , (ξ1 , ..., ξn ) =d
- for any
P (ξ1 = 1, ξ2 = 0, ξ3 = 0) = P (ξ1 = 0, ξ2 = 1, ξ3 = 0):
g
r+g
Using elementary conditional probability, the left-hand side is
n}.
r+c
r+g+2c
.
r
r+g+c
r+c
r+g+2c
and the right-hand
The successive denominators are constant and the numerator are just
permuted.
More generally,
P (ξ1 = 1, ..., ξm = 1, ξm+1 = 0, ..., ξn = 0)
g + (m − 1)c
r
r + (n − m − 1)c
g
···
···
,
=
g+r
g + r + (m − 1)c
g + r + mc
g + r + (n − 1)c
which is the same as
Now let
for any
Xn =
P (ξ1 = 1S (1), ..., ξn = 1S (n))
for any other
S ⊆ [n]
Pn
0≤m≤
i=1 ξn be the number of green balls drawn at time
n
n, P (Xn = m) = m
pn,m where
m−1
Y
pn,m = P (ξ1 = 1, ..., ξm = 1, ξm+1 = 0, ..., ξn = 0) =
n.
(g + ic)
i=0
with
|S| = m.
The preceding discussion shows that
n−m−1
Y
(r + jc)
j=0
n−1
Y
(g + r + kc)
k=0
m−1
Y
=
g
c
Y
n−m−1
+i
i=0
r
c
j=0
n−1
Y
g+r
c
+k
+j
Γ( gc +m)Γ( rc +n−m)
=
Γ( gc )Γ( rc )
Γ( g+r
c +n)
Γ(
g+r
c
)
k=0
26
=
Γ
g
c
+m Γ
Γ
g+r
c
+n−m
Γ g+r
c
.
·
Γ gc Γ rc
+n
r
c
Consequently, we have
n
Γ(n + 1)
pn,m
P (Xn = m) =
pn,m =
Γ(m + 1)Γ(n − m + 1)
m
Γ g+r
Γ m + gc
Γ n − m + rc
Γ(n + 1)
c
.
=
·
·
·
Γ(m + 1) Γ(n − m + 1) Γ n + g+r
Γ gc Γ rc
c
√
x
2πx xe , implies
p
x+a−1
x+a−1 a−b
2π(x + a − 1) x+a−1
1 + a−1
Γ(x + a)
x a−b
a−b x
e
x
≈ p
≈
≈
e
= xa−b ,
x+b−1
b−1 x+b−1
Γ(x + b)
e
e
2π(x + b − 1) x+b−1
1
+
e
x
Stirling's approximation,
Γ(x + 1) ≈
so
nP (Xn = m) ≈ n
Γ
Γ
g
c
g+r
c
Γ
r m
g
c −1
(n − m)
r
c −1
n
1− gc − rc
c
and thus
Γ
g+r
c
nP (Xn = m) →
Γ gc Γ
as
m, n → ∞
with
m
n
=
r
c
g
Γ
Γ
g
c
g+r
c
Γ
r
c
m gc −1 n
1−
m rc −1
,
n
r
x c −1 (1 − x) c −1
→ x.
0 < x < 1,
Xn
≤ x = P (Xn ≤ nx) = P (Xn = 0) + P (Xn = 1) + ... + P (Xn = bnxc)
n
ˆ bnxc+1
ˆ bnxc+1
n
=
P (Xn = byc)dy =
nP (Xn = bnuc)du
0
0
ˆ x
Γ g+r
g
r
c
c −1 (1 − u) c −1 du
→
g
r u
0 Γ c Γ c
It follows that for any
P
as
n → ∞.
This shows that
satises
g r
Xn
g + cXn
⇒ Beta
, , which implies that the fraction of green balls at time n, Yn =
,
n
c c
g + r + cn
g r
Xn
cn
g
Yn =
⇒ Beta
,
+
.
n
r + g + cn
g + r + cn
c c
The martingale convergence theorem allows us to strengthen this conclusion to almost sure convergence:
If we can show that
limit
Y,
Yn ≥ 0
is a martingale, then it will follow from Corollary 2.6 that it has an almost sure
which necessarily has the Beta
The obvious ltration in this case is
integrable since
0 ≤ Yn ≤ 1,
g r
c, c
distribution by the previous argument.
Fn = σ(ξ1 , ..., ξn ). Yn
so it remains only to show that
To see that this is so, we observe that on
Yn =
gn
gn + rn
is adapted by the Doob-Dynkin Lemma and is
E[Yn+1 |Fn ] = Yn .
,
gn + c
gn
gn
rn
·
+
·
gn + rn + c gn + rn
gn + rn + c gn + rn
gn (gn + c + rn )
gn
=
=
= Yn .
(gn + rn + c)(gn + rn )
gn + rn
E [Yn+1 |Fn ] =
27
Branching Processes.
Let
{ξin }i,n∈N
be an array of i.i.d.
Zn
is called a
Galton-Watson process, and the interpretation is that Zn
N0 -valued random variables. Dene the sequence Z0 , Z1 , ... by Z0 = 1 and
(
ξ1n+1 + ... + ξZn+1
, Zn > 0
n
Zn+1 =
.
0,
Zn = 0
gives the population size of the
nth
generation where each individual gives birth to an identically distributed number of ospring and then dies.
pk = P (ξin = k)
is called the
ospring distribution.
One can come up with many natural variants on this problem by changing assumptions such as i.i.d. ospring
variables or one generation life-spans, but we will content ourselves with the simplest case here.
Our ultimate goal is to show that the mean of the ospring distribution determines whether or not the
population is doomed to extinction. Specically, if the population is to have a chance of surviving indenitely,
then the number of births must exceed the number of deaths on average.
In order to prove this unsurprising yet nontrivial claim, we rst establish
Lemma 3.1. Let Fn = σ(ξim : i ≥ 1, 1 ≤ m ≤ n) and µ = E[ξin ] ∈ (0, ∞). Then Mn :=
with respect to Fn .
Zn
µn
is a martingale
Proof. Mn is adapted by construction and integrable by an induction argument using the ensuing computaFn
tion. Linearity, monotone convergence, and the denition of
"
E[Zn+1 |Fn ] = E
∞
X
#
Zn+1 1 {Zn = k} |Fn
k=0
=
=
=
∞
X
k=1
∞
X
k=1
∞
X
E
=
imply
∞
X
E[Zn+1 1 {Zn = k} |Fn ]
k=0
ξ1n+1 + ... + ξkn+1 1 {Zn = k} |Fn
1 {Zn = k} E ξ1n+1 + ... + ξkn+1
1 {Zn = k} kµ = µZn ,
k=1
and the result follows upon division by
Since
Mn
where
µn+1 .
is a nonnegative martingale, it has a nite almost sure limit
M∞ .
We begin by identifying cases
M∞ = 0.
Theorem 3.2. If µ < 1, then Mn , Zn → 0 a.s.
Proof.
Since
Because
on
ω)
Zn
Mn → M∞ ∈ R
a.s. and
µn → 0,
is integer valued and converges to
for almost every
ω ∈ Ω.
It follows that
we have that
Zn = µn Mn → 0 · M∞ = 0
a.s.
0 a.s., we must have that Zn (ω) = 0 for all large n (depending
Mn =
Zn
µn is eventually
0,
hence
M∞ = 0.
It makes sense that if the population has more deaths than births on average, then it will eventually die out.
28
The next result shows that if we exclude the trivial case where every individual has one ospring with full
probability, then the same is also true if, on average, each individual replaces themselves before dying.
Theorem 3.3. If µ = 1 and P (ξin = 1) < 1, then Zn → 0 a.s.
Proof.
When
Because
If
P
(ξin
µ = 1, Zn = Mn
Z n ∈ N0 ,
= 1) < 1,
Zn = Z∞
we must have that
then
µ=1
implies that
P
for
(ξin
for all
so it must be the case that
Zn
µ < 1 , Zn → 0
does not converge in
In the
µ≤1
When
µ > 1,
L
k ∈ N, P (ξ1n = ... = ξkn = 0
i.o.)
=1
k, N ∈ N,
n ≥ N ) ≤ 1 − P (ξ1n = ... = ξkn = 0
for some
n > N ) = 0,
Zn
Mn = n
µ
(exponentially
1
so for each
Z∞ = 0.
It is worth observing that since
Thus when
n ≥ N (ω).
= 0) > 0,
by the second Borel-Cantelli lemma. Thus for any
P (Zn = k
Z∞ .
converges almost surely to some nite
Zn
Z0
is a martingale, E
=E
= 1, hence E[Zn ] = µn .
µn
µ0
1
fast) in L . However, when µ = 1, E[Zn ] = E[Z0 ] = 1 for all n,
so
.
cases, we were able to conclude not only that
Zn → 0
a.s., but also that
Mn → 0
a.s.
we will show that there is positive probability that the population never goes extinct, but we
point out that this does not enable us to conclude that
M∞
is nonzero with positive probability. Necessary
and sucient conditions for this to occur are stated in Durrett.
Theorem 3.4. If µ > 1, then P (Zn > 0 for all n) > 0.
Proof.
For
s ∈ [0, 1],
let
∞
h ni X
G(s) = E sξi =
pk sk
k=0
be the
As
probability generating function
P∞
k=0
pk sk
for the ospring distribution.
is a power series whose interval of convergence contains
[−1, 1],
we may dierentiate termwise
to obtain
G0 (s) =
G00 (s) =
∞
X
k=1
∞
X
kpk sk−1 ≥ 0
k(k − 1)pk sk−2 ≥ 0
k=2
for
s ∈ [0, 1).
Since the DCT implies that
G0 (s), G00 (s) > 0
for
lims→1− G0 (s) =
P∞
k=1
kpk = µ > 1,
there is a
k≥2
such that
pk > 0,
hence
s ∈ (0, 1).
The reason we care about the p.g.f. is that the proof relies on showing that the probability of extinction is
given by the unique xed point of
G
in
[0, 1).
29
Claim (a).
Proof.
next
θm = P (Zm = 0),
If
Z1 = k ,
If
m−1
P∞
θm =
then
k=0
which happens with probability
k
pk θm−1
= G(θm−1 ).
pk ,
then
Zm = 0
if and only if all
time steps. By independence, this happens with probability
possibilities for
k
k
θm−1
.
k
lineages die out in the
Summing over the dierent
establishes the claim.
Our next step is to establish the existence and uniqueness of the purported xed point:
Claim (b).
Proof.
There is a unique
Since
(1 − ε, 1).
ρ<1
such that
lims→1− G0 (s) = µ > 1,
As
G
is continuous on
and
G0
[1 − ε, 1]
G(ρ) = ρ.
is continuous on
and
G(1) = 1,
[0, 1),
ε>0
there is an
such that
G0 (s) > 1
on
the mean value theorem implies that
1 − G(s) = G(1) − G(s) = G0 (c)(1 − s) > 1 − s,
G(s) < s,
hence
Also,
on
(1 − ε, 1).
G(0) = p0 ≥ 0.
F (1 −
ε
2)
< 0,
If
G(0) = 0,
so the existence of
ρ
ρ = 0.
take
Otherwise, letting
we have
F (0) > 0
and
follows from the intermediate value theorem.
To see that this is the only xed point less than
thus for any
F (x) = G(x) − x,
1,
observe that
G00 > 0
on
(0, 1),
so
G
is strictly convex and
x ∈ (ρ, 1),
G(x) = G (λρ + (1 − λ) · 1) < λG(ρ) + (1 − λ)G(1) = λρ + (1 − λ) · 1 = x.
There can be no xed point in
then imply that
(0, ρ)
either as the above computation (with
x
and
ρ
interchanged) would
G(ρ) < ρ.
The nal step is to show that
ρ
is indeed the extinction probability:
Claim (c). θm % ρ as m % ∞.
Proof. θn = G(θn−1 ) is an increasing sequence because θ0 = 0 ≤ G(0) = θ1
Similarly,
As
θn
supn θn ≤ ρ
since
θ0 = 0 ≤ ρ
and
θm ≤ ρ
implies
and
G0 (x) ≥ 0
for
x ≥ 0.
θm+1 = G(θm ) ≤ G(ρ) = ρ.
is a bounded increasing sequence, it converges to some
θ∞ ≤ ρ,
so, since
G
is continuous, we have
θ∞ = lim θn = lim G(θn−1 ) = G(θ∞ ).
n→∞
Because
Finally,
θ∞ ≤ ρ
is a xed point of
{Zm = 0} ⊆ {Zm+1 = 0}
P (Zn = 0
G,
n→∞
the previous claim implies
for all
for some
m,
θ ∞ = ρ.
so continuity from below implies
n) = P
∞
[
!
{Zm = 0}
m=1
= lim P (Zm = 0)
m→∞
= lim θm = ρ < 1
m→∞
and the proof is complete.
30
4.
In order to obtain an
Lp
Lp
Convergence
version of Theorem 2.8 we begin with a generalization of Theorem 2.7.
Theorem 4.1. If Xn is a submartingale, M and N are stopping times with M
with P (N ≤ k) = 1, then E[XM ] ≤ E[XN ].
Proof.
Since
≤ N,
and there is a k ∈ N
{M < n ≤ N } = {M ≤ n − 1} ∩ {N ≤ n − 1}C ∈ Fn−1 , Kn = 1 {M < n ≤ N } ≥ 0
is
predictable, hence
(K · X)n =
n
X
N
∧n
X
1 {M < m ≤ N } (Xm − Xm−1 ) =
(Xm − Xm−1 ) = XN ∧n − XM ∧n
m=M ∧n+1
m=1
is a submartingale.
Consequently,
E[XN ] − E[XM ] = E[XN ∧k ] − E[XM ∧k ] = E [(K · X)k ] ≥ E [(K · X)0 ] = 0.
In particular, we have
Corollary 4.1. If Xn is a submartingale and
then E[X0 ] ≤ E[XN ] ≤ E[Xk ].
Proof.
For the rst inequality, take
For the second, take
M =0
N
is a stopping time with P (N ≤ k) = 1 for some k ∈ N,
in Theorem 4.1.
M = N , N = k.
Our next step in proving the
Lp
martingale convergence theorem is
Theorem 4.2 (Doob's Inequality). Let Xn be a submartingale, Xen = max Xm+ , λ > 0, and A =
0≤m≤n
Then
en ≥ λ .
X
λP (A) ≤ E[Xn 1A ] ≤ E[Xn+ 1A ] ≤ E[Xn+ ].
Proof.
Let
N = inf {m : Xm ≥ λ} ∧ n.
If
ω ∈ A,
then there is a smallest
m≤n
such that
λ ≤ Xm (ω) = XN (ω) (ω).
Also, since
XN = Xn
on
AC ,
Corollary 4.1 implies
E[XN 1A ] + E[Xn 1AC ] = E[XN 1A ] + E[XN 1AC ] = E[XN ] ≤ E[Xn ] = E[Xn 1A ] + E[Xn 1AC ].
Thus, as in the proof of Chebychev's inequality, we have
λP (A) = E[λ1A ] ≤ E[XN 1A ] ≤ E[Xn 1A ].
The other inequalities are trivial since
Xn 1A ≤ Xn+ 1A ≤ Xn+ .
31
A tangential application of Doob's inequality is
Theorem 4.3 (Kolmogorov's Maximal Inequality). If ξ1 , ξ2 , ... are independent with E[ξi ] = 0 and
Pn
Var(ξi ) ∈ (0, ∞), then Sn =
i=1 ξi satises
P
Proof. Sn2
max |Sn | ≥ x
1≤m≤n
≤ x−2 Var(Sn )
is a submartingale (by convexity), so taking
2
x P
max |Sn | ≥ x
2
=x P
1≤m≤n
λ = x2
max
1≤m≤n
Sn2
for all x > 0.
in Theorem 4.2 gives
≥x
2
≤ E Sn2 = Var(Sn ).
More important for the task at hand is
Theorem 4.4 (Lp
. If Xn is a submartingale, then for all 1 < p < ∞,
Maximum Inequality)
h i p p p enp ≤
E Xn+
.
E X
p−1
Proof.
For any xed
M > 0,
we have
( en ∧ M ≥ λ =
X
en ≥ λ ,
X
M ≥λ
∅,
M <λ
,
so the layer cake representation, Doob's inequality, Tonelli's theorem, and Hölder's inequality give
E
h
en ∧ M
X
p i
ˆ
∞
=
en ∧ M ≥ λ dλ
pλp−1 P X
0
ˆ
M
=
en ≥ λ dλ
pλp−1 P X
0
ˆ
M
≤
pλ
0
ˆ
=
p−1
ˆ
Xn+
ˆ
1
+
e
Xn 1 Xn ≥ λ dP dλ
λ
en ∧M
X
pλp−2 dλ dP
0
ˆ
p−1
p
en ∧ M
=
Xn+ X
dP
p−1
p−1 p
en ∧ M
=
E Xn+ X
p−1
p i p−1
h
p p1
p
e
p
≤
E Xn+
E X
∧
M
.
n
p−1
Dividing through by
p i p−1
h
e
p
E X
∧
M
>0
n
E
h
shows that
en ∧ M
X
p i p1
≤
32
p p1
p
E Xn+
,
p−1
and thus
Sending
Since
h
p i p p p e
E Xn+
.
E Xn ∧ M
≤
p−1
M →∞
|Yn |
and invoking monotone convergence completes the proof.
is a submartingale whenever
Yn
is a martingale, we have
Corollary 4.2. If Yn is a martingale and Yn∗ =
E
p
(Yn∗ )
max |Yn |,
1≤m≤n
≤
p
p−1
p
then
p
E |Yn | .
With the maximum inequality behind us, it is a small step to show
Theorem 4.5 (Lp Convergence). If
Xn → X a.s. and in Lp .
Proof. E Xn+
p
1<p<∞
p
p
≤ E |Xn | ≤ E |Xn | < ∞,
and Xn is a martingale with supn E |Xn |p < ∞, then
so the martingale convergence theorem implies
Xn → X
a.s.
Also,
p max |Xn |
E
by Corollary 4.2, so letting
Because
1≤m≤n
n→∞
p
≤
p
p−1
p
p
E |Xn |
supn |Xn | ∈ Lp .
p
E |Xn − X| → 0.
and using monotone convergence shows that
|Xn − X| ≤ (2 supn |Xn |)
p
, dominated convergence gives
Uniform Integrability.
It remains to consider the
p=1
case. Essentially, this is just the Vitali convergence theorem, but we will go
ahead and do everything carefully.
Denition.
{Xi }i∈I is said to be uniformly
lim sup E |Xi | 1 {|Xi | > M }
= 0.
A collection of random variables
M →∞
Uniform integrability implies uniform
the denition is less than
1,
L1
integrable
if
i∈I
bounds since we can take
M
large enough that the supremum in
say, and thus conclude that
sup E [|Xi |] = sup E |Xi | 1 {|Xi | ≤ M } + E |Xi | 1 {|Xi | > M }
i∈I
i∈I
≤ sup E |Xi | 1 {|Xi | ≤ M } + sup E |Xi | 1 {|Xi | > M } ≤ M + 1 < ∞.
i∈I
i∈I
33
However, one should observe that uniform integrability is a less restrictive assumption than domination by
an integrable function since if
X≥0
is integrable and
|Xi | ≤ X
for all
i ∈ I,
then
|Xi | 1 {|Xi | > M } ≤ |X| 1 {|X| > M } for all i ∈ I , so monotonicity and dominated convergence
lim sup E |Xi | 1 {|Xi | > M }
≤ lim E |X| 1 {|X| > M } = 0.
M →∞
give
M →∞
i∈I
As an example of a u.i. family which is not dominated by an integrable random variable, consider Lebesgue
measure on
(0, 1)
and
Xn (ω) = ω −1 1(
1
1
n+1 , n
) (ω).
Any nite collection of integrable random variables is clearly uniformly integrable, but countable collections
need not be.
For example, if
n > M.
Xn = n1(0, n1 )
on
[0, 1]
with Lebesgue measure, then
E |Xn | 1 {|Xn | > M } = 1
(This also shows that uniform integrability is a stronger assumption than uniform
1
L
for all
bounds.)
Our next theorem shows that one can have very large u.i. families.
Theorem 4.6. Given a probability space (Ω, F, P ) and a random variable X ∈ L1 , the collection
{E[X |G ] : G is a sub-σ -algebra of F} is uniformly integrable.
Proof.
We rst note that since
X ∈ L1 ,
An
if
is a sequence of events with
P (An ) → 0,
then
E |X| 1An → 0
by the DCT for convergence in probability.
It follows that for every
exists an
Now let
for any
ε>0
ε>0
ε > 0,
and a sequence
there is a
A1 , A2 , ...
be given and choose
δ
δ>0
with
such that
P (An ) <
as above. Taking
1
n
P (A) < δ implies E |X| 1A < ε - if
and E |X| 1An ≥ ε, a contradiction.
M >
not, there
E|X|
δ , it follows from Jensen's inequality that
G ⊆ F,
E |E [X |G ]| 1 {|E [X |G ]| > M } ≤ E E |X| G 1 E X G > M
≤ E E |X| G 1 E |X| G > M
= E |X| 1 E |X| G > M
where the nal equality follows from the denition of conditional expectation by observing that
E |X| G > M ∈ G .
Using Chebychev's inequality, we have
P E |X| G > M ≤
ii
h h
E E |X|G
M
=
E|X|
M
< δ.
Thus,
E E X G 1 E X G > M
≤ E |X| 1 E |X| G > M
<ε
for every
G ⊆ F,
and the result follows since
ε
was arbitrary.
Uniform integrability is the condition needed to upgrade convergence in probability to convergence in
Theorem 4.7. If Xn →p X , then the following are equivalent:
(i): {Xn }∞
n=0 is uniformly integrable.
(ii): Xn → X in L1 .
(iii): E |Xn | → E |X| < ∞.
34
L1 .
Proof.
Assume
{Xn }
is u.i. and let
M,
ϕM (x) =
x,
−M,
x>M
|x| ≤ M
.
x < −M
Then
|Xn − X| ≤ |Xn − ϕM (Xn )| + |ϕM (Xn ) − ϕM (X)| + |ϕM (X) − X|
+
= (|Xn | − M ) + |ϕM (Xn ) − ϕM (X)| + (|X| − M )
+
≤ |Xn | 1 {|Xn | > M } + |ϕM (Xn ) − ϕM (X)| + |X| 1 {|X| > M } ,
and thus
E |Xn − X| ≤ E [|Xn | 1 {|Xn | > M }] + E |ϕM (Xn ) − ϕM (X)| + E [|X| 1 {|X| > M }] .
Xn →p X
Since
and
ϕM
is bounded and continuous,
E |ϕM (Xn ) − ϕM (X)| → 0.
(Convergence in prob-
ability is preserved by continuous functions and the convergence in probability generalization of bounded
convergence then applies.)
Uniform integrability ensures that the rst term can be made less than any given
ε > 0
by choosing
M
suciently large.
It also ensures that
that
E |X| < ∞,
Therefore, given
supn E |Xn | < ∞,
and thus
ε > 0,
M
so the convergence in probability version of Fatou's lemma implies
can be taken large enough that the third term is less than
we can choose
M
(i)
implies
(ii).
That
(ii)
implies
(iii)
as well.
so that
E |Xn − X| < ε + E |ϕM (Xn ) − ϕM (X)| + ε → 2ε
hence
ε
as
n → ∞,
is standard:
E |Xn | − E |X| ≤ E |Xn | − |X| ≤ E |Xn − X| → 0.
Finally, suppose that
E |Xn | → E |X| < ∞
and dene
ψM (x) =
Given
ε > 0,
one can choose
M
large enough that
Also, as in the rst part of the proof,
Thus there is an
N ∈N
Xn →p X
such that for all
x,
0≤x≤M −1
(M − 1)(M − x),
M −1<x<M
0,
otherwise
E |X| − E [ψM (|X|)] < ε
.
by the DCT.
ψM (|·|) ∈ Cb implies E[ψM (|Xn |)] → E[ψM (|X|)].
n ≥ N , E |Xn | − E |X| , |E[ψM (|Xn |)] − E[ψM (|X|)]| < ε, hence
and
E [|Xn | 1 {|Xn | > M }] ≤ E |Xn | − E [ψ (|Xn |)]
≤ E |Xn | − E |X| + (E |X| − E [ψ (|X|)]) + |E [ψ (|X|)] − E [ψ (|Xn |)]| < 3ε.
By increasing
M
if need be, we can ensure that
E [|Xn | 1 {|Xn | > M }] < 3ε
for the (nitely many)
n < N,
and uniform integrability is established.
35
We are nally able to provide necessary and sucient conditions for
L1
convergence.
Theorem 4.8. For a submartingale, the following are equivalent.
(i): It is uniformly integrable.
(ii): It converges a.s. and in L1 .
(iii): It converges in L1 .
Proof.
Uniform integrability implies that
Xn → X
As
a.s., and Theorem 4.7 implies
supn E |Xn | < ∞
Xn → X
in
1
L
so the martingale convergence theorem implies
, hence
(i)
1
(ii) tautologically implies (iii), it remains only to show that L
But this is also a simple consequence of Theorem 4.7 since
When
Xn
implies
(ii).
convergence implies uniform integrability.
Xn → X
in
L1
implies
Xn →p X .
is a martingale, we can actually say a little bit more.
Theorem 4.9. For a martingale, the following are equivalent.
(i): It is uniformly integrable.
(ii): It converges a.s. and in L1 .
(iii): It converges in L1 .
(iv): There is an X ∈ L1 with Xn = E[X |Fn ].
Proof.
Since martingales are submartingales,
Now assume that
Xn → X
in
1
L
(i) ⇒ (ii) ⇒ (iii)
. Then for all
by Theorem 4.8.
A ∈ F,
|E[Xn 1A ] − E[X1A ]| = |E [(Xn − X) 1A ]| ≤ E |(Xn − X) 1A | ≤ E |Xn − X| → 0,
hence
E[Xn 1A ] → E[X1A ].
Also, since
Xn
E[Xn |Fm ] = Xm for all m < n. Thus if A ∈ Fm ,
E[Xm 1A ] = E E[Xn |Fm ]1A = E E[Xn 1A |Fm ] = E[Xn 1A ].
is a martingale, we have
Putting these facts together shows that
tional expectation, implies that
Since Theorem 4.6 shows that
E[Xm 1A ] = E[X1A ]
for all
A ∈ Fm ,
then
which, by denition of condi-
Xm = E[X |Fm ].
(iv) ⇒ (i),
the chain is complete.
The nontrivial part of Theorem 4.9 (conditional on preceding results) was the conclusion that if a martingale
Xn
converges to a random variable
X
in
L1 , then Xn = E[X |Fn ].
Our next result can be seen as a variation
on this theme.
Theorem 4.10 (Lévy's Forward Theorem). Suppose that
S
σ ( n Fn ). Then for any integrable X ,
E[X |Fn ] → E[X |F∞ ]
Proof.
We showed in Example 2.4 that
Fn % F∞
- i.e. {Fn } is a ltration and F∞ =
a.s. and in L1 .
Xn = E[X |Fn ]
is a martingale and Theorem 4.6 implies it is
uniformly integrable.
It follows, therefore, from Theorem 4.9 that
E[X |Fn ]
E[X |Fn ] = Xn = E[X∞ |Fn ].
36
converges to a limit
X∞
a.s.
and in
L1 ,
and that
A ∈ Fn ,
By denition of conditional expectation, this means that for all
ˆ
ˆ
ˆ
XdP =
A
Our claim that
X∞ = E[X |F∞ ]
Xn dP =
A
will follow if we can show that the above holds for all
S
π − λ theorem since P =
´
´
L = A : A XdP = A X∞ dP .
But this is a consequence of the
contained in the
λ-system
(P is a
since
π -system
{Fn }
X∞ dP.
A
is a ltration and
L
is a
n
Fn
λ-system
is a
since
π -system
X, X∞
A ∈ F∞ .
which generates
F∞
and is
are integrable.)
An immediate consequence of Theorem 4.10 is
Theorem 4.11 (Lévy's 0 − 1 Law). If Fn % F∞ and A ∈ F∞ , then P (A |Fn ) → 1A a.s.
Though Theorem 4.11 may seem trivial, it should be noted that it implies Kolmogorov's
If
X1 , X2 , ... are independent and A belongs to the tail eld T =
of each
Fn = σ(X1 , ..., Xn ),
this converges to
1A
so
P (A |Fn ) = P (A).
a.s., so we must have
As
P (A) = 1A
T
n
σ(Xn , Xn+1 , ...), then 1A
T ⊆ σ(X1 , X2 , . . .) = F∞ ,
a.s., hence
37
0−1
P (A) ∈ {0, 1}.
law:
is independent
Theorem 4.11 implies that
5. Optional Stopping
We round out our discussion of smartingales with a look at optional sampling theorems, of which we have
already seen one example:
If
Xn
is a submartingale and
M, N
are stopping times with
M ≤N ≤k
a.s., then
E[X0 ] ≤ E[XM ] ≤ E[XN ] ≤ E[Xk ].
We now look into what kind of conditions on
Xn
allow us to reach similar conclusions for unbounded stopping
times.
Lemma 5.1. If Xn is a uniformly integrable submartingale and
uniformly integrable.
+
Proof. Xm
Since
+
Xm
is a submartingale and
N ∧n≤n
N
is a stopping time, so
is any stopping time, then XN ∧n is
+
+
E[XN
∧n ] ≤ E[Xn ]
by Corollary 4.1.
is u.i., we have
+
+
sup E[XN
∧n ] ≤ sup E[Xn ] < ∞,
n
n
so the martingale convergence theorem implies
Because
E |XN | < ∞
and
XN ∧n → XN
a.s. (where
X∞ = lim Xn )
n→∞
and
E |XN | < ∞.
{Xn } is u.i., for any ε > 0, we can choose K so that
ε
E |XN | ; |XN | > K , supn E |Xn | ; |Xn | > K < ,
2
hence
E |XN ∧n | ; |XN ∧n | > K = E |Xn | ; |Xn | > K, N > n + E |XN | ; |XN | > K, N ≤ n
≤ supn E |Xn | ; |Xn | > K + E |XN | ; |XN | > K < ε
for all
n,
showing that
{XN ∧n }
is u.i.
It is worth observing that the nal sentence in the preceding proof establishes
Corollary 5.1. If E |XN | < ∞ and Xn 1 {N
> n}
is u.i., then XN ∧n is u.i.
The utility of Lemma 5.1 is that it enables us to prove
Theorem 5.1. If Xn is a uniformly integrable submartingale, then for any stopping time N
E[X0 ] ≤ E[XN ] ≤ E[X∞ ] where X∞ = lim Xn .
n→∞
Proof.
Corollary 4.1 gives
E[X0 ] ≤ E[XN ∧n ] ≤ E[Xn ]
martingale convergence theorem implies
taking
n→∞
Observe that if
for all
E[XN ∧n ] → E[XN ]
and
n.
Since
Xn
E[Xn ] → E[X∞ ],
Xn
XN ∧n
are u.i., the
L1
so the result follows by
in the above inequality.
is our double after losing martingale and
Thus the conclusion of Theorem 5.1 need not hold if
and
≤ ∞,
Xn
N = inf {n : Xn = 1}, then XN = 1 6= 0 = X0 .
is not u.i.
More generally, Theorem 5.1 shows that any such system will fail if we only have nite credit, because then
the corresponding martingale would be bounded and thus uniformly integrable.
38
A useful consequence of Theorem 5.1 is
Theorem 5.2. If L ≤ M are stopping times and YM ∧n is a u.i. submartingale, then
E[Y0 ] ≤ E[YL ] ≤ E[YM ]
Proof.
and YL ≤ E[YM |FL ].
The rst claim follows from Theorem 5.1 by setting
Xn = YM ∧n , N = L,
so that
E[Y0 ] = E[X0 ], E[YL ] = E[YM ∧L ] = E[XN ], E[YM ] = lim E[YM ∧n ] = E[X∞ ].
n→∞
(
L
on A
. This is a stopping time since L ≤ M implies A ∈ FL ⊆ FM ,
Now for A ∈ FL , let N =
M on AC
C
hence {N = n} = (A ∩ {L = n}) ∪ A ∩ {M = n} ∈ Fn .
As
N ≤M
by construction, the preceding shows that
E[YL ; A] + E[YM ; AC ] = E[YN ] ≤ E[YM ] = E[YM ; A] + E[YM ; AC ],
hence
E[YL ; A] ≤ E[YM ; A] = E E[YM 1A |FL ] = E E[YM |FL ]; A .
A = Aε := {YL − E[YM |FL ] ≥ ε}
Taking
shows that
P (Aε ) = 0
for all
ε > 0
and the desired result
follows.
We have shown that optional stopping holds for bounded stopping times and for bounded smartingales
(as bounded implies uniformly integrable), and noted that these results show that you need innite time
and credit, respectively, to ensure victory in an unfavorable game. Our last optional stopping theorem lies
somewhere in between these two and shows that another way for casinos to guard against length of play
strategies is to place caps on bets.
Theorem 5.3. Suppose Xn is a submartingale with E |Xn+1 − Xn | |Fn
with E[N ] < ∞, then XN ∧n is u.i. and thus E[XN ] ≥ E[X0 ].
Proof.
≤B
a.s. If N is a stopping time
Since
n−1
∞
X
X
|XN ∧n | = X0 +
(Xm+1 − Xm ) 1 {N > m} ≤ |X0 | +
|Xm+1 − Xm | 1 {N > m} ,
m=0
it will follow that
XN ∧n
m=0
is u.i. once we show that the right-hand side is integrable.
C
{N > m} = {N ≤ m} ∈ Fm , so it follows from our assumptions
E |Xm+1 − Xm | 1 {N > m} = E E[|Xm+1 − Xm | |Fm ]1 {N > m}
To see this, observe that
that
≤ E [B1 {N > m}] = BP (N > m) ,
and thus
∞
∞
h
i
X
X
E |X0 | +
|Xm+1 − Xm | 1 {N > m} ≤ E |X0 | +
BP (N > m) = E |X0 | + BE [N ] < ∞.
m=0
m=0
39
Our nal optional stopping theorem applies to arbitrary stopping times and requires that the smartingale
be a.s. bounded in the appropriate direction.
Theorem 5.4. If Xn is a nonnegative supermartingale and N is a stopping time, then E [X0 ] ≥ E [XN ].
Proof.
Also,
Corollary 2.6 shows that
E [X0 ] ≥ E [XN ∧n ]
for all
X∞ = lim Xn
n→∞
n∈N
exists, so
XN
is well-dened.
by Theorem 2.7.
Since monotone convergence implies
E [XN ; N < ∞] = lim E [XN ; N ≤ n]
n→∞
and Fatou's lemma implies
E [XN ; N = ∞] ≤ lim inf E [Xn ; N > n] ,
n→∞
we have
E [X0 ] ≥ lim inf E [XN ∧n ] = lim inf (E [Xn ; N > n] + E [XN ; N ≤ n]) ≥ E [XN ] .
n→∞
Example 5.1
.
(Gambler's Ruin)
Suppose that in successive ips of an unfair coin, we win one dollar
if the coin comes up heads and lose one dollar if it comes up tails.
fortune at time
n
n→∞
behaves like asymmetric simple random walk,
Sn =
If we start with nothing, then our
Pn
i=1 ξi where
ξ1 , ξ2 , ...
are i.i.d. with
P (ξi = 1) = 1 − P (ξi = −1) = p.
1
1
2 and a supermartingale if p < 2 .
Sn
1−p
Since we don't want to have to consider the cases separately, we look at Xn =
, which we claim is
p
Clearly
Sn
is a submartingale w.r.t.
a martingale:
Xn
Fn = σ(ξ1 , ..., ξn )
if
p>
Fn -measurable, and
"
S ξ #
1 − p n 1 − p n+1 E [Xn+1 |Fn ] = E
Fn
p
p
Sn "
ξn+1 #
1−p
1−p
=
E
p
p
p
1−p
= Xn p
+ (1 − p)
= Xn .
p
1−p
is bounded and
Now let's suppose we decide beforehand that we will quit once we lose
happens rst. That is, we walk away at time
N = T−a ∧ Tb
A natural question is how likely is it the game ends in ruin,
We rst note that
SN ∧n
(−a, b)
That
SN ∧n
dollars or win
is bounded also implies
XN ∧n =
dollars, whichever
L1 .
It follows from the
L1
As convergence to a point
N < ∞ a.s.
SN ∧n
1−p
p
b
r = P (T−a < Tb ).
has a limit almost surely and in
is impossible, it must be the case that
SN ∧n
a
Tx = inf{m : Sm = x}.
is uniformly bounded and thus is uniformly integrable.
martingale convergence theorem that
in
where
is u.i., so we can apply the martingale form of
Theorem 5.1 to conclude that
−a
b
1−p
1−p
r+
(1 − r)
p
p
"
−a b # b
1−p
1−p
1−p
=r
−
+
.
p
p
p
1 = E[X0 ] = E[XN ] =
40
The probability of ruin is thus
b
1 − 1−p
p
r= a b .
p
1−p
−
1−p
p
Example 5.2
occurrence of
.
(Patterns in Coin Tossing)
HT H
are i.i.d. with
We are interested in the expected waiting time for the rst
in a sequence of independent tosses of a fair coin. More formally, suppose that
P (Xi = H) = P (Xi = T ) =
We want to compute
1
2 , and set
X1 , X2 , ...
τHT H = inf {t ≥ 3 : Xt−2 = H, Xt−1 = T, Xt = H}.
E[τHT H ].
Somewhat surprisingly, our analysis is simplied by incorporating a casino and an army of gamblers into the
model. The setup is as follows. The casino oers even odds on successive tosses of a fair coin (X1 , X2 , ...).
Gamblers arrive one at a time with the
a
$1
bet on heads. If
Xk = T ,
which he then wagers on
Xk+1 = T .
stake. Otherwise, he bets his
kth
gambler joining the game just before
he loses his dollar and quits playing. If
$4 fortune on Xk+2 = H .
follows that the casino's prot from the rst
MτHT H = τHT H − 10
except the
(τHT H − 2)nd
n
rounds,
because each of the
(who has
$8)
and the
τHT H
To see that this is so, we rst note that
Also, since
hence
τHT H
kth
Mn =
round,
ξk ∈ σ(X1 , ..., Xk ),
k=1 ξk , is a martingale w.r.t.
gamblers payed a
τHT H th
(who has
$2)
$1
$1
E[ξk ] = 0.
It
σ(X1 , ..., Xn ).
entrance fee, and all gamblers
walked away with nothing.
0 = E[M0 ] = E[MτHT H ] = E[τHT H ]−10.
is stochastically dominated by
3Y
where
Y ∼ Geom( 81 ),
thus
E |MτHT H | ≤ E[τHT H ] + 10 < ∞.
b n3 c
n
7
≤ 7nP Y >
= 7n
→0
3
8
implies
has
Pn
|Mn | ≤ 7n (as none of the n gamblers have a net loss or gain exceeding 7), we
ˆ
ˆ
|Mn 1 {τHT H > n}| dP ≤ 7n 1 {τHT H > n} dP = 7nP (τHT H > n)
It follows that
$2,
Regardless of the outcome, he quits after this round.
If we can show that optional stopping applies, then it will follow that
E[τHT H ] < ∞,
is observed and placing
the casino pays him
If he loses, then he walks away with nothing and is out his initial
Since the game is fair, the casino's net prot from the
Also,
Xk
Xk = H ,
Mn 1 {τHT H > n}
E[MτHT H ] = E[M0 ]
is u.i., so Corollary 5.1 shows that
and we conclude that
as
MτHT H ∧n
see that
n → ∞.
is u.i., hence Theorem 5.2
E[τHT H ] = 10.
The exact same logic applies for words of dierent lengths and alphabets of various sizes.
It is interesting to note that if you carry out this analysis for the sequence
E[τHHH ] = 8 + 4 + 2 = 14,
HHH ,
then you nd that
so coin toss patterns of the same length can have dierent expected times to rst
occurrence.
41
6. Markov Chains
In words, a Markov chain is a random process in which the future depends on the past only through the
present. More formally,
Denition.
Let
(S, S)
be a measurable space. A sequence of
a ltered probability space
for all
(Ω, F, {Fn }, P )
is said to be a
S -valued
X0 , X1 , X2 , ...
random variables
Markov chain
with respect to
Fn
if
Xn ∈ Fn
on
and
B ∈ S,
P (Xn+1 ∈ B |Fn ) = P (Xn+1 ∈ B |Xn ).
S
is called the
state space
X0
of the chain, and the law of
is called the
initial distribution.
* We take it as implicit in the name Markov chain (as opposed to Markov process) that we are working
in discrete time.
As the above denition is fairly dicult to work with directly, we introduce the following useful construct.
Denition.
A function
(1) For each
(2) For each
We say that
Xn
p:S×S →R
x ∈ S , A 7→ p(x, A)
A ∈ S , x 7→ p(x, A)
transition probability
is called a
is a probability measure on
if
(S, S),
is a measurable function.
is a Markov chain (w.r.t.
Fn )
with transition probabilities
pn
if
P (Xn+1 ∈ B |Fn ) = pn (Xn , B).
If
(S, S)
is nice or
S
is countable, there is no loss of generality in supposing the existence of transition
probabilities as we are then assured of the existence of a r.c.d. for
Xn+1
given
σ(Xn ):
µn (ω, B) = P (Xn+1 ∈ B |Xn ) = P (Xn+1 ∈ B |Fn ).
The last problem on the rst homework shows that we can take
probability
µ
Conversely, if we are given an initial distribution
p0 , p1 , ...,
on
(S, S)
ˆ
ˆ
space
(S
ˆ
p0 (x0 , dx1 ) · · ·
µ(dx0 )
B0
(S, S)
and a sequence of transition probabilities
we can dene a consistent sequence of nite dimensional distributions by
νn (X0 ∈ B0 , X1 ∈ B1 , ..., Xn ∈ Bn ) =
If
µn (ω, B) = pn (Xn (ω), B) for some transition
pn .
B1
pn−1 (xn−1 , dxn ).
Bn
is nice, Kolmogorov's extension theorem guarantees the existence of a measure
N0
,S
N0
)
such that the coordinate maps
Xn (ω) = ωn
* Note that the extension theorem also applies when
subset of
on the sequence
have the desired distributions.
is countable since we can then identify
Z ⊆ R.
Through a slight abuse of notation, when
µ = δx
is the point mass at
It is worth observing that the family of measures
Pµ (A) =
S
Pµ
´
Px (A)µ(dx)
for any initial distribution
{Px }x∈S
µ
on
42
x,
we will write
Px
for
is fundamental in the sense that
(S, S).
Pδ x .
S
with a
To see that the construction from Kolmogorov's theorem denes a Markov chain with respect to
Fn = σ(X0 , X1 , ..., Xn )
{pn }, we need to prove
ˆ
1B (Xn+1 ) dPµ =
pn (Xn , B) dPµ
having transition probabilities
ˆ
A
for all
that
A
n ∈ N0 , A ∈ Fn , B ∈ S .
Since the collection of
suces to consider
(n + 1)-cylinders
is a
π -system
which generates
A = {X0 ∈ B0 , X1 ∈ B1 , ..., Xn ∈ Bn }
Fn ,
a
π−λ
argument shows that it
Bi ∈ S .
with
We compute
ˆ
A
1B (Xn+1 ) dPµ = Pµ (A, Xn+1 ∈ B) = Pµ (X0 ∈ B0 , X1 ∈ B1 , ..., Xn ∈ Bn , Xn+1 ∈ B)
ˆ
ˆ
ˆ
ˆ
=
µ(dx0 )
p0 (x0 , dx1 ) · · ·
pn−1 (xn−1 , dxn )
pn (xn , dxn+1 )
B0
B1
Bn
B
ˆ
ˆ
ˆ
pn−1 (xn−1 , dxn )pn (xn , B).
p0 (x0 , dx1 ) · · ·
µ(dx0 )
=
Bn
B1
B0
To nish up, we reconstruct the integral:
If
f (xn ) = 1C (xn ), then
ˆ
ˆ
ˆ
µ(dx0 )
p0 (x0 , dx1 ) · · ·
B0
B1
Bn
pn−1 (xn−1 , dxn )1C (xn ) = Pµ (X0 ∈ B0 , ..., Xn ∈ Bn ∩ C)
ˆ
= Pµ (A, Xn ∈ C) =
1C (Xn ) dPµ .
A
Linearity shows the result holds for
bounded measurable
f,
such as
f (xn )
simple, and the bounded convergence theorem extends it to
f (xn ) = pn (xn , B).
In summary, we have shown that
Theorem 6.1. If (S, S) is nice (or S is countable), then for any distribution µ and any sequence of transition
probabilities p0 , p1 , ..., there exists an S -valued Markov chain {Xn } such that X0 ∼ µ and
P (Xn+1 ∈ B |X0 , X1 , ..., Xn ) = pn (Xn , B).
In order to verify that the preceding Kolmogorov construction is indeed the right one, we prove
Theorem 6.2. Any Markov chain Xn on (S, S) having initial distribution µ and transition probabilities pn
has nite dimensional distributions satisfying
ˆ
P (X0 ∈ B0 , X1 ∈ B1 , ..., Xn ∈ Bn ) =
ˆ
ˆ
p0 (x0 , dx1 ) · · ·
µ(dx0 )
B0
B1
pn−1 (xn−1 , dxn )
Bn
for all n ∈ N0 , B0 , ..., Bn ∈ S .
To this end, we rst record the following useful consequence of the
Theorem 6.3 (Functional Monotone Class Theorem). Let
collection of functions f : Ω → R which satises
(i)
(ii)
(iii)
A
π−λ
theorem.
be a π-system containing Ω and let H be a
If A ∈ A, then 1A ∈ H.
If f, g ∈ H and c ∈ R, then f + g and cf are in H.
If f1 , f2 , ... ∈ H are nonnegative with fn % f , then f ∈ H.
Then H contains all functions which are measurable with respect to σ(A).
43
Proof. L = {A : 1A ∈ H}
B, C ∈ L
(i)
As
H
with
C ⊆ B;
and
λ-system
is a
(iii)
implies
π -system A
shows that the
(ii)
that
H
implies
1A = limn 1An ∈ H
is contained in
contains all indicators of events in
It then follows from
(i)
since
L,
1Ω ∈ H; (ii)
if
An ∈ L
convenient when
H
contains all simple functions and from
σ(A)-measurable
if
An % A.
π−λ
theorem that
σ(A) ⊆ L,
thus
with
(iii)
(ii)
f
that it contains all nonnegative
gives the result.
bounded, and the conclusion is that
H
functions. The argument is the same, and that version can be more
is dened in terms of expectations.)
Proof of Theorem 6.2.
n
H= f :S→R
The proof of Theorem 6.1 shows that
such that
f
ˆ
E[f (Xn+1 ) |Fn ] =
is bounded and
satises the conditions of Theorem 6.3 with
bounded
1B\C = 1B − 1C ∈ H
σ(A).
(iii) only supposes that f ∈ H when fn % f
contains all bounded
with
it follows from the
measurable functions. Taking positive and negative parts and using
(Often, condition
implies
A = S,
hence
pn (Xn , dy)f (y)
E[f (Xn+1 ) |Fn ] =
´
for all
n ∈ N0
pn (Xn , dy)f (y)
o
for all
f ∈ S.
f0 , ..., fn ,
" " n
##
n
Y
Y
E
fm (Xm ) = E E
fm (Xm ) Fn−1
m=0
m=0
" n−1
#
Y
=E
fm (Xm ) · E [fn (Xn ) |Fn−1 ]
Accordingly, for any bounded measurable
"
#
m=0
=E
Since
´
" n−1
Y
ˆ
fm (Xm ) ·
#
pn−1 (Xn−1 , dy)fn (y) .
m=0
pn−1 (Xn−1 , dy)fn (y)
µ = L (X0 ),
"
E
is a bounded measurable function of
Xn−1 ,
it follows by induction that if
then
n
Y
ˆ
#
fm (Xm ) =
ˆ
µ(dx0 )f0 (x0 )
ˆ
p0 (x0 , dx1 )f1 (x1 ) · · ·
pn−1 (xn−1 , dxn )fn (xn ),
m=0
establishing Theorem 6.2.
The preceding results show that we can describe a Markov chain
pn .
Xn
by specifying the transition probabilities
In practice, the transition probabilities are the fundamental objects for analyzing Markov chains.
Given transition probabilities, we can assume that the
S N0 , S N0
Xn 's
are the coordinate maps on the sequence space
.
This construction gives us a measure
with transition probabilities
Pµ
for each initial distribution
µ,
which makes
Xn
a Markov chain
pn .
It also enables us to work with the shift operators
(θn ω)i = ωi+n .
To keep things simple, we will restrict our attention henceforth to
in which there is a single transition probability
p = p0 = p1 = ...
44
temporally homogeneous
Markov chains,
Example 6.1
.
(Random Walk)
Xn = x0 +ξ1 +...+ξn .
Then
Let
ξ1 , ξ2 , ... ∈ Rd
be i.i.d. random vectors with distribution
Xn denes a Markov chain with initial distribution δx0
µ,
and dene
and transition probability
p(x, B) = µ(B − x).
If the
ξk0 s
are independent but not identically distributed, we still get a Markov chain, but it's no longer
time homogeneous.
All other examples of Markov chains that we will consider will have countable state space
equipped with the
S
σ -algebra S = 2
S = {s1 , s2 , ...}
.
In this case, the transition probabilities are specied by functions of the form
p : S × S → [0, 1]
P
p(s, t) = 1 for all s ∈ S . (The corresponding transition probability pe : S × S → R
P
pe(x, B) = y∈B p(x, y).) The interpretation is p(s, t) = P (Xn+1 = t |Xn = s ).
t∈S
Example 6.2 (Branching Processes).
as a Markov chain on
is
i,
and individual
Example 6.3
with
for
k
N0
has
ξ1 , ξ2 , ... ∈ N0
with transition function
ξk
.
where
is given by
be i.i.d. The Galton-Watson process can be viewed
p(i, j) = P
P
i
k=1 ξk
=j
. (If the current population size
ospring, the next generation will have population size
(Birth and Death Chains)
|Xn − Xn+1 | ≤ 1.
k ≥ 1
Let
Pi
k=1 ξk .)
Birth and death chains are dened by the condition
In terms of transition probabilities, this means that
pn = p(n, n + 1), rn = p(n, n),
and
with
qn = p(n, n − 1).
p0 + r0 = 1
Xn ∈ N0
pk + rk + qk = 1
and
One can think of the associated
chains as giving population sizes in successive generations in which at most one birth or death can occur per
generation.
Example 6.4 (M/G/1 Queue).
The
M/G/1 queue is a model of line lengths at a service station. M
for Markovian (or memoryless) and means that customers arrive according to a rate
indicates that the service times follow a general distribution
F;
and
1
λ
stands
Poisson process;
G
is because there is a single server. We
assume that the line can be arbitrarily long and that the priority is rst come, rst serve.
The time steps correspond to new customers being served, so that
Xn
is the length of the queue when
n begins their service. X0 = x is the number of people in line when service opens with customer 0.
´∞
k
ak = 0 e−λt (λt)
k! dF (t) be the probability that k customers arrive during a service time, and let ξn
customer
Let
denote the net number of customers to enter the queue during the service of customer
that customer
i.i.d with
n
P (ξn = k − 1) = ak ,
no one waiting when the
service time
n,
keeping in mind
completed their service in this time period. Our assumptions imply that that
nth
and we have
Xn+1 = (Xn + ξn )
+
ξ0 , ξ1 , ...
are
. We take positive parts because if there is
customer begins their service (Xn
= 0)
and no customers arrive during this
(ξn = −1), then the next queue length is 0 since we don't start counting until the next customer
arrives and begins service.
It is not dicult to see that
Xn
denes a Markov chain with transition probabilities
p(0, 0) = a0 + a1 ,
p(j, j + k − 1) = ak
45
if
j≥1
or
k > 1.
Example 6.5
V
.
(Random Walks on Graphs)
and edge set
deg(v)
=
P
u∈V
E.
For
u, v ∈ V ,
1 {u ∼ v}
write
Let
u ∼ v
is the degree of
v.
G = (V, E)
be a simple undirected graph with vertex set
{u, v} ∈ E .
if
Assume that
A simple random walk on
present vertex to a neighbor chosen uniformly at random - that is,
More generally, suppose that
→
−
w : E → [0, ∞).
→
−
G = (V, E )
V
by
In fact, every Markov chain on a countable state space
Example 6.6
edge set
p(g, h) = µ(hg
are chosen independently from
For example, let
d
G = (Z/2Z)
If we dene
where
S
w({u, v})
.
→
− w({u, x})
x:{u,x}∈ E
can be interpreted as a random walk on the directed
{{u, v} : p(u, v) > 0},
−1
).
.
and edge weights
µ
Any probability
The Markov chain is dened by
w ({u, v}) = p(u, v).
on a countable group
Xn+1 = gn+1 Xn
G
where
µ.
µ(x) = d1 if x has exactly one coordinate equal
chain, Xn , is equivalent to simple random walk on the
kxk = |{i ∈ [d] : xi = 1}|,
< ∞
1
1 {v ∼ u}.
deg(u)
, and let
otherwise. The associated Markov
deg(v)
p(u, v) = P
(Random Walks on Groups/Card Shuing)
induces a random walk via
g1 , g2 , . . .
S,
p(u, v) =
supv∈V
proceeds by moving from the
is a directed graph (possibly containing self-loops) and let
One can dene a random walk on
graph having vertices indexed by
G
then one can verify that
Yn = kXn k
to one,
µ(x) = 0
hypercube.
is a Markov chain. In fact,
Yn
is
equivalent to the Ehrenfest chain mentioned in Durrett (Example 6.2.5).
As another example, let
G = Sn ,
and let
µ(τ ) =
n −1
1 {τ
2
= (ij)}
be the uniform distribution on transpo-
sitions. We can think of permutations as representing arrangements of a deck of cards:
to the ordering in which
in position
σ
−1
σ(k)
is the label of the
kth
σ ∈ Sn
corresponds
card from the top. (Equivalently, the card labeled
l
is
(l).)
Left-multiplying
σ
by
τ = (ij)
corresponds to interchanging card
σ(k),
τ ◦ σ(k) =
i,
j,
i
and card
j
in the deck:
σ(k) ∈
/ {i, j}
.
σ(k) = j
σ(k) = i
Thus we can think of the random walk in terms of repeatedly shuing the deck by randomly transposing
pairs of cards.
In card shuing applications, it is often more convenient to multiply on the right - so
p(σ, π) = µ(σ
−1
π)
- as we typically want shues to act on positions rather than labels.
For the random transposition case, right-multiplying
position
i
Xn+1 = Xn σ ,
with the card in position
σ
by
τ = (ij)
corresponds to interchanging the card in
j:
σ(k),
σ ◦ τ (k) =
σ(j),
σ(i),
k∈
/ {i, j}
k=i
.
k=j
In this example, the two conventions are essentially equivalent, but typically there is a distinction.
Consider shuing by removing the top card and inserting it in a random position. Here we need to multiply
on the right by permutations distributed according to
Left multiplication by
the card labeled
j
(1 · · · k)
µ(σ) =
1
n 1{σ
= (1 · · · k)
would correspond to replacing the card labeled
with that labeled
j+1
for
j < k.
for some
k
with the card labeled
This requires looking at the cards.
46
k ∈ [n]}.
1
and
The inverse of this top-to-random shue - namely, placing a randomly chosen card on the top of the deck
(k · · · 1) = (1 · · · k)−1
- corresponds to right-multiplication by a cycle of the form
from
k
with
chosen uniformly
[n].
Note that the left-invariant walk (having kernel
driven by
µ̌(g) = µ(g
−1
)
p(x, y) = µ(x−1 y))
under the anti-automorphism
x 7→ x
−1
transforms into the right-invariant walk
, so it suces to stick with one convention
for developing theory and then translate the results when a particular model is better suited to the other
choice.
When
in a
S = {s1 , ..., sn } is nite (as in some of the latter examples), the transition probabilities can be encoded
transition matrix K ∈ Mn (R) dened by Ki,j = p(si , sj ).
One nice thing about nite state space Markov
chains is that one can often prove powerful results by using ideas from linear algebra.
Some texts have dierent conventions regarding the indexing of transition matrices. For us, the
represents the probability of moving from
x
to
y
x, y -entry
in a single step. It follows that probability distributions
are represented by row vectors and functions by column vectors.
That is, if
K
Xn , µ is a probability measure on (S, S),
X
(µK) (y) =
µ(x)K(x, y) = P (Xn+1 = y |Xn ∼ µ )
is the transition matrix for
and
f : S → R,
then
x∈S
and
(Kf ) (x) =
X
K(x, y)f (y) = E [f (Xn+1 ) |Xn = x ] .
y∈S
For countably innite state spaces, the transition matrix is innite, so not all linear algebraic results carry
over directly. However, the operator perspective is still convenient.
For example, Theorem 6.2 implies that
Pµ (X0 = x0 , X1 = x1 , ..., Xn = xn ) = µ(x0 )
n
Y
p(xm−1 , xm ).
m=1
When
n = 1,
we have
Pµ (X1 = y) =
X
Pµ (X0 = x, X1 = y) =
X
x
When
µ(x)p(x, y) = (µp)(y).
x
n = 2, µ = δx ,
Px (X2 = z) =
X
Px (X1 = y, X2 = z) =
X
y
p(x, y)p(y, z) = p2 (x, z).
y
It follows by induction that
Px (Xn = z) =
X
Px (Xn−1 = y, Xn = z) =
X
y
pn−1 (x, y)p(y, z) = pn (x, z),
y
and thus
Pµ (Xn = z) =
X
µ(x)Px (Xn = z) =
x
X
x
47
µ(x)pn (x, z) = (µpn )(z).
7. Extensions of the Markov Property
E [1B (Xn+1 ) |Fn ] = E [1B (Xn+1 ) |Xn ]
The Markov property reads
for all
of approximating by simple functions, we see that this is equivalent to
for all bounded measurable
B ∈ S.
By the usual argument
E [h (Xn+1 ) |Fn ] = E [h (Xn+1 ) |Xn ]
h.
In this section, we show that for discrete time Markov chains, the Markov property extends to the whole
future and to random times.
Specically, we have that for all bounded measurable
h
n ∈ N0 ,
and all
E [h (Xn , Xn+1 , . . .) |Fn ] = E [h (Xn , Xn+1 , . . .) |Xn ] .
Moreover, if
N
is a stopping time, then the above holds with
N
in place of
n
when we restrict to the event
{N < ∞}.
We will assume throughout that the
and that
(Ω, F)
Fn = σ(X0 , X1 , ..., Xn ).
that makes
Xn
Xn
are coordinate maps on the sequence space
(S, S),
(Ω, F) = (S N0 , S N0 )
For each probability measure
µ
on
a Markov chain with initial distribution
µ
and transition probability
Also, we recall that the maps
θn : Ω → Ω
we write
Pµ
for the measure on
p.
(θn ω)i = ωi+n .
act by shifting coordinates:
We begin by showing that the Markov property is not limited to a single time step.
Theorem 7.1. If Y
:Ω→R
is bounded and measurable, then
Eµ [Y ◦ θn |Fn ] = EXn [Y ]
where the subscript on the left indicates that the conditional expectation is taken with respect to Pµ , and the
expression on the right is ϕ(x) = Ex [Y ] evaluated at x = Xn .
Proof.
Let
A = {ω : ω0 ∈ A0 , ..., ωn ∈ An }
for some
A0 , ..., An ∈ S
and let
g0 , ..., gm : (S, S) → (R, B)
be
bounded and measurable.
Applying the formula
Eµ
N
hY
ˆ
ˆ
i ˆ
fi (Xi ) = µ(dx0 )f0 (x0 ) p(x0 , dx1 )f1 (x1 ) · · · p(xN −1 , dxN )fN (xN )
i=0
with
fk = 1Ak
Eµ
m
hY
for
k < n, fn = 1An g0 ,
i
and
ˆ
for
1≤j≤m
ˆ
gk (Xn+k ); A =
ˆ
· g0 (xn )
, we have
ˆ
p(x0 , dx1 ) · · ·
µ(dx0 )
A0
k=0
fn+j = gj
A1
p(xn−1 , dxn )
An
p(xn , dxn+1 )g1 (xn+1 ) · · ·
ˆ
p(xm+n−1 , dxm+n )gm (xm+n )
m
h
hY
i i
= Eµ EXn
gk (Xk ) ; A .
k=0
The collection of sets
proved is a
π -system
A
for which this holds is a
that generates
Fn ,
so the
λ-system,
π−λ
and the collection of sets for which it has been
theorem shows that it is true for all
48
A ∈ Fn .
Thus if
Y (ω) =
Qm
k=0 gk (ωk ) for
ˆ
n
Y ◦ θ dPµ =
ˆ Y
m
A
g0 , ..., gm
bounded and measurable, then
gk (Xn+k ) dPµ = Eµ
m
hY
A k=0
gk (Xn+k ); A
k=0
h
= Eµ EXn
ˆ
m
i i ˆ
hY
i
gk (Xk ) ; A =
EXn
gk (Xk ) dPµ =
EXn [Y ] dPµ
m
hY
A
k=0
for all
A ∈ Fn ,
Eµ [Y ◦ θn |Fn ] = EXn [Y ]
hence
for all such
To complete the proof, we observe that the collection
system that generates
of events in
A
(take
S
N0
. The set of functions
gk = 1Ak
A
k=0
Y.
A of events of the form {ω0 ∈ A0 , . . . , ωk ∈ Ak } is a π -
H = {Y : Eµ [Y ◦ θn |Fn ] = EXn [Y ]} contains the indicators
in the preceding), and is certainly closed under sums, scalar multiples, and
increasing limits, so Theorem 6.3 implies that
Thus, conditional on
i
Fn , Xn , Xn+1 , ...
H
contains all bounded measurable
Y.
has the same distribution as a copy of the chain started at
Xn .
Markov chains are forgetful; they start fresh at every step.
It should be noted that Theorem 7.1 depends on the assumption of time homogeneity.
In general, writing
Y (ω) = h(ω0 , ω1 , . . .),
so that
Y ◦ θn = h (Xn , Xn+1 , . . .),
the proof of Theorem 7.1 gives
E [h (Xn , Xn+1 , ...) |Fn ] = E [h (Xn , Xn+1 , ...) |Xn ] .
Another interpretation of this statement of the Markov property is that the past and the future are conditionally independent given the present:
Corollary 7.1. If A ∈ σ(X0 , ..., Xn ) and B ∈ σ(Xn , Xn+1 , ...), then for any initial distribution µ,
Pµ (A ∩ B |Xn ) = Pµ (A |Xn ) Pµ (B |Xn ) .
Proof.
By Theorem 7.1 and basic properties of conditional expectation,
Pµ (A ∩ B |Xn ) = Eµ [1A 1B |Xn ] = Eµ [Eµ [1A 1B |Fn ] |Xn ]
= Eµ [1A Eµ [1B |Fn ] |Xn ] = Eµ [1A Eµ [1B |Xn ] |Xn ]
= Eµ [1A |Xn ] Eµ [1B |Xn ] = Pµ (A |Xn ) Pµ (B |Xn ) .
A useful application of the Markov property is the following intuitive decomposition result for expressing
multi-step transition probabilities in terms of convolution (matrix multiplication):
Proposition 7.1 (Chapman-Kolmogorov). If Xn is time homogeneous with discrete state space, then
Px (Xm+n = z) =
X
Px (Xm = y) Py (Xn = z) .
y
Proof.
1{z} (Xm+n (ω)) = 1{z} ◦ Xn ◦ θm (ω), Theorem 7.1 shows that
Px (Xm+n = z) = Ex 1{z} (Xm+n ) = Ex Ex 1{z} ◦ Xn ◦ θm |Fm
X
= Ex EXm 1{z} ◦ Xn = Ex [PXm (Xn = z)] =
Px (Xm = y) Py (Xn = z) .
Since
y
49
Our second extension is known as the Strong Markov Property which generalizes the original denition by
replacing deterministic times with stopping times.
Recall that if
of all events
N
is a stopping time with respect to a ltration
A∈F
such that
A ∩ {N = n} ∈ Fn
for all
Fn ,
then the stopped
σ -algebra FN
consists
n.
Also, remember that we dened random shifts by
(
N
θ ω=
where
∆
is an extra point we add to
{N < ∞},
Ω
θn ω,
ω ∈ {N = n}
∆,
ω ∈ {N = ∞}
for convenience. In what follows, we will restrict our attention to
so this extra point need not concern us.
Theorem 7.2 (Strong Markov Property). Let N be a stopping time and suppose that Y
and measurable. Then
Eµ Y ◦ θN |FN = EX [Y ] on {N < ∞}.
:Ω→R
is bounded
N
Proof.
For any
A ∈ FN ,
∞
X
Eµ [Y ◦ θn ; A ∩ {N = n}]
Eµ Y ◦ θN ; A ∩ {N < ∞} =
=
n=0
∞
X
Eµ [EXn [Y ]; A ∩ {N = n}]
n=0
= Eµ [EXN [Y ]; A ∩ {N < ∞}] ,
and the result follows by denition of conditional expectation.
The above proof is representative of many results for discrete stopping times - one sums over possible values
of
N
and then applies existing results to the summands. This trick doesn't work in continuous time and the
corresponding theorems can be much less trivial.
While every discrete time Markov process has the strong Markov property, the two notions do not necessarily
coincide in continuous time. (For example,
Bt 1 {B0 6= 0}
50
is Markov but not strong Markov.)
8. Classifying States
We will restrict our attention henceforth to chains having a countable state space.
Dene
Ty0 = 0
and
Tyk = inf n > Tyk−1 : Xn = y
(where
inf ∅ = ∞),
Ty =
Set
so that
Ty1 and let
Tyk
is the time of the
ρxy = Px (Ty < ∞)
kth
visit to
y
at positive times.
be the probability that the chain started at
x
visits
y
in nitely
many steps.
Theorem 8.1.
Proof.
Px Tyk < ∞ = ρxy ρk−1
yy .
ρxy
The result follows from the denition of
Now let
Setting
Y (ω) = 1 {ωn = y
N=
for some
Tyk−1 , we have
Y ◦θ
N
when
k = 1,
k ≥ 2.
so can assume that
n ∈ N} = 1 {Ty < ∞}.
= 1 ωn = y for some n > Tyk−1 = 1 Tyk < ∞
on
{N < ∞}.
Also,
Ex [Y ◦ θN |FN ] = EXN [Y ]
on
{N < ∞}
by the strong Markov property.
As
XN = y
Thus, since
{N < ∞},
on
the right-hand side is
1 {N < ∞} ∈ FN ,
EXN [Y ] = Ey [Y ] = Py (Ty < ∞) = ρyy .
we have
Px Tyk < ∞ = Ex 1 Tyk < ∞ = Ex Y ◦ θN ; N < ∞
= Ex Ex [Y ◦ θN |FN ]; N < ∞
= Ex [ρyy ; N < ∞] = ρyy Px Tyk−1 < ∞ ,
and the result follows by induction.
Denition.
If every
If
y
x∈S
y
y∈S
is a
recurrent state
if
ρyy = 1
is a recurrent state, we say that the chain is
is recurrent, then Theorem 8.1 shows that
Py (Xn = y
If
We say that
i.o.)
and a
transient state
if
ρyy < 1.
recurrent.
Py Tyk < ∞ = ρkyy = 1
for all
k ∈ N,
hence
= 1.
is transient and we let
N (y) =
P∞
k=1
Ex [N (y)] =
=
1 {Xk = y}
∞
X
k=1
∞
X
be the number of visits to
Px (N (y) ≥ k) =
∞
X
Px Tyk < ∞
y
at positive times, then
k=1
ρxy ρk−1
yy = ρxy
k=1
∞
X
ρkyy =
k=0
ρxy
< ∞.
1 − ρyy
Combining these observations gives
Theorem 8.2.
Denition.
y
is recurrent if and only if Ey [N (y)] = ∞.
We say that
accessible from
x,
y
is
we say that
accessible from x if ρxy > 0.
and y communicate.
x
51
If
x = y
or
x
is accessible from
y
and
y
is
Our next theorem says that if
x
is recurrent, then so are all states accessible from
x.
(Recurrence is
contagious.)
Theorem 8.3. If x is recurrent and ρxy > 0, then y is recurrent and ρyx = 1.
Proof.
Let
We rst prove that ρyx = 1 by showing that ρxy > 0 and ρyx < 1 implies ρxx < 1.
K = inf k : pk (x, y) > 0 . (K < ∞ since ρxy > 0.) Then there is a sequence y1 , ..., yK−1
p(x, y1 )p(y1 , y2 ) · · · p(yK−1 , y) > 0.
If
ρyx < 1,
Moreover, since
K
is minimal,
yk 6= x
such that
k = 1, ..., K − 1.
for
then we have
1 − ρxx = Px (Tx = ∞) ≥ p(x, y1 )p(y1 , y2 ) · · · p(yK−1 , y)(1 − ρyx ) > 0,
a contradiction.
To prove that
y
is recurrent, we note that
ρyx > 0
implies that there is an
L
with
pL (y, x) > 0.
Because
pL+n+K (y, y) ≥ pL (y, x)pn (x, x)pK (x, y),
we see that
Ey [N (y)] =
∞
X
Ey [1 {Xk = y}] =
k=1
=
∞
X
∞
X
Py (Xk = y) =
k=1
pL+n+K (y, y) ≥ pL (y, x)pK (x, y)
y
∞
X
k=L+K+1
n=1
is recurrent by Theorem 8.2.
x are recurrent provided that we already know
is recurrent. It is useful at this point to introduce the following denitions.
Denition.
x=y
or
is called
If
pk (y, y)
pn (x, x) = pL (y, x)pK (x, y)Ex [N (x)] = ∞,
Theorem 8.3 allows us to conclude that states accessible from
x
∞
X
pk (y, y) ≥
k=1
n=1
so
∞
X
S
A set
ρxy > 0.
D⊆S
is called a
communicating class
if all states in
D
communicate:
A communicating class in which every state is accessible from itself (ρxx
irreducible.
(In general, a set
B⊆S
is irreducible if
itself is irreducible, we say that the chain is
ρxy > 0
for all
x, y ∈ D
implies
> 0 for all x ∈ D)
x, y ∈ B .)
irreducible.
Proposition 8.1. The communicating classes partition the state space.
Proof.
Communicates with is reexive and symmetric by denition, thus we need only establish transitivity.
This is trivial if
implies
x, y, z
are not all distinct, so (because of symmetry) it suces to show that
ρxz > 0.
But this is a simple consequence of the strong Markov property since
on
ρxy , ρyz > 0
{Ty < ∞},
Px Tz ◦ θTy < ∞ FTy = Py (Tz < ∞)
thus
ρxz = Px (Tz < ∞) ≥ Px Tz ◦ θTy < ∞; Ty < ∞
= Ex Px Tz ◦ θTy < ∞ FTy ; Ty < ∞
= Ex [Py (Tz < ∞) ; Ty < ∞]
= Py (Tz < ∞) Ex [Ty < ∞] = ρyz ρxy > 0.
(Or you could just use
pK+L (x, z) ≥ pK (x, y)pL (y, z)...)
52
x,
Note that communicating classes consisting of a single element,
are not necessarily irreducible as it may
ρxx = 0.
be the case that
However, the proof of transitivity shows that every class containing more than one state is irreducible.
Also, any communicating class containing a recurrent state is irreducible as it either contains multiple
elements or consists of a single recurrent state.
More importantly, Theorem 8.3 shows that recurrence is a class property - either every element in a communicating class is recurrent or every element is transitive.
Thus if we have identied a communicating class, we can check recurrence for all members simultaneously
by testing a single element.
In many cases, this is simplied by the next result.
Denition.
x ∈ C
and
C ⊆S
A set
ρxy > 0
Px (Xn ∈ C) = 1
If a singleton
{z}
is said to be
implies
y ∈ C.
- there is no escaping
closed
if it contains all points accessible from any of its elements:
The reason for the name is that if
C
is closed and
x ∈ C,
then
C.
is closed, we say that the state
z
is
absorbing.
Theorem 8.4. If C is a nite closed set, then it contains a recurrent state. If, in addition, C is a communicating class, then all states in C are recurrent.
Proof.
It suces to prove the rst statement as the second then follows from Theorem 8.3.
To this end, suppose that
for all
x, y ∈ C ,
so, since
∞>
C
C
ρyy < 1
is a nite closed set with
for all
y ∈ C.
is nite, we have the contradiction that for any
X
Ex [N (y)] =
X
∞
X
pn (x, y) =
y∈C n=1
y∈C
∞
X
X
where the penultimate equality follows from the fact that
C
Ex [N (y)] =
ρxy
1−ρyy
<∞
x ∈ C,
pn (x, y) =
n=1 y∈C
Then
∞
X
1=∞
n=1
is closed.
Corollary 8.1. Every irreducible Markov chain on a nite state space is recurrent.
Proof. S
is closed.
Theorem 8.3 provides a simple test of transience:
If there exists a
y∈S
such that
ρxy > 0
and
ρyx < 1,
then
[x]
is transient.
Theorem 8.4 gives a similar recurrence test for states in a nite communicating class:
If
|[x]| < ∞
and
ρxy > 0
implies
(The assumptions imply that
and
ρyz > 0
implies
ρyx > 0,
[x]
then
[x]
is closed since
ρxz ≥ ρxy ρyz > 0,
so
is recurrent.
ρxw > 0
ρzx > 0
implies
as well, hence
53
ρwx > 0,
z ∈ [x].)
hence
w ∈ [x];
and
y ∈ [x] \ {x}
To recap, we can partition the state space into communicating classes, each of which is either recurrent or
transient.
All classes containing at least two states (as well as certain single-state classes, such as those consisting of
an absorbing state) are irreducible.
Communicating classes are not necessarily closed, but Theorem 8.3 shows that any communicating class
containing a recurrent element is both closed and irreducible.
It follows from Proposition 8.1 that the set of recurrent states
R = {x ∈ S : ρxx = 1}
can be expressed as a
disjoint union of closed and irreducible communicating classes.
We conclude this discussion by considering recurrence/transience behavior in some concrete examples.
Example 8.1
.
(Random Walks on Finite Groups)
with transition function
the support of
such that
p(g, h) = µ(hg
−1
)
G
If
for some probability
µ, Σ = {g ∈ G : µ(g) > 0}, generates G.
gik · · · gi1 = sr−1 ,
is a nite group and
µ
on
Σ,
so
ρeg = 0
where
Whether or not the chain is irreducible, all states are recurrent. If
H < G.
then
ρrs ≥ pk (r, s) ≥ µ(gi1 ) · · · µ(gik ) > 0.
hence
cannot be expressed as a nite product of terms in
a proper subgroup
G,
If so, then for any
Xn
Xn
Xn
is a Markov chain on
G
is irreducible if and only if
r, s ∈ G, there exist gi1 , ..., gik ∈ Σ
If not, there is some
e
is the identity in
g∈G
G.
is not irreducible, then
The communicating classes are precisely the right cosets of
H,
which
Σ
generates
and they are all
closed.
Example 8.2 (Branching Processes).
If the ospring distribution has mass
no children), then for every
is transient.
0
p0 > 0
k ≥ 1, ρk0 ≥
is recurrent because
pk0
at zero (i.e. there is positive probability that an individual has
> 0.
Since
ρ0k = 0
for all such
k,
we see that every state
k≥1
p(0, 0) = 1.
Example 8.3 (Birth and Death Chains on N0 ).
Denote
p(i, i + 1) = pi ,
where
Let
q0 = 0
and
pk−1 , qk > 0
N = inf {n ≥ 0 : Xn = 0}.
for
k ≥ 1.
p(i, i) = ri
(The latter condition ensures that the chain is irreducible.)
We will dene a function
We begin by imposing the conditions
Xn = k ≥ 1,
p(i, i − 1) = qi ,
ϕ(0) = 0
and
ϕ : N0 → R
ϕ(1) = 1.
so that
ϕ (XN ∧n )
For the martingale property to hold when
we must have
ϕ(k) = pk ϕ(k + 1) + rk ϕ(k) + qk ϕ(k − 1),
or
(pk + qk )ϕ(k) = (1 − rk )ϕ(k) = pk ϕ(k + 1) + qk ϕ(k − 1).
Dividing by
pk
and rearranging gives
ϕ(k + 1) − ϕ(k) =
qk
ϕ(k) − ϕ(k − 1) .
pk
54
is a martingale.
Since
ϕ(1) − ϕ(0) = 1,
we have
m
Y
qj
ϕ(m + 1) − ϕ(m) =
p
j=1 j
for
m ≥ 1,
hence
n−1
X
ϕ(n) = ϕ(1) +
ϕ(m + 1) − ϕ(m)
m=1
=1+
n−1
m
XY
n−1
m
XY
qj
qj
=
p
p
m=1 j=1 j
m=0 j=1 j
where we adopt the convention that the empty product equals
Now for any
Since
a < x < b,
ϕ (XT ∧n )
let
T = Ta ∧ Tb
where
is a bounded martingale and
for
n≥1
1.
Tz = inf{n ≥ 1 : Xn = z}.
XT ∈ {a, b} Px -a.s.,
optional stopping gives
ϕ(x) = Ex [ϕ(XT )] = ϕ(b)Px (Ta > Tb ) + ϕ(a) [1 − Px (Ta > Tb )] ,
hence
Px (Ta > Tb ) =
Taking
a = 0, b = M
ϕ(x) − ϕ(a)
.
ϕ(b) − ϕ(a)
gives
ϕ(x)
,
ϕ(M )
see that 0
Px (T0 > TM ) =
so letting
M →∞
and observing that
TM ≥ M − x,
we
M Y
m
X
qj
= ϕ(M ) → ∞
p
m=0 j=1 j
(ϕ is increasing and thus has a limit
as the chain started at
so as long as
p0 > 0,
0
is either
0
ϕ(∞) ∈ [1, ∞].
or
1
If
as
is recurrent if and only if
M → ∞.
ϕ(∞) = ∞, then P1 (T0 = ∞) = 0, hence 0 is recurrent
after one time step. If
ϕ(∞) < ∞,
there is positive probability that the chain started at
55
0
then
P1 (T0 = ∞) =
never returns.)
1
ϕ(∞)
> 0,
9. Stationary Measures
Denition.
µ 6≡ 0
on
Xn
If
(S, S)
is a Markov chain with state space
is a
stationary measure
for
Xn
(S, S) and transition function p, we say that a measure
if it satises the
X
µ(y) =
equilibrium equation
µ(x)p(x, y)
x∈S
y ∈ S.
for every
If
µ
is a probability measure, it is called a
Of course, if a stationary measure
If
π
µ
is a stationary distribution for
is nite, then
Xn ,
stationary distribution.
µ(y)
π(y) = P
x µ(x)
is a stationary distribution.
then the equilibrium equation reads
Using the Markov property and induction, it follows that
Pπ (X1 = y) = π(y).
Pπ (Xn = y) = π(y)
for all
n,
hence the name
stationary.
From the transition operator perspective, the equilibrium equation reads
a left eigenfunction with eigenvalue
Example 9.1
µ≡1
(Random Walk on
µp = µ,
so a stationary measure is
1.
Zd ).
Here
p(x, y) = f (y − x)
where
f ≥0
and
P
z
f (z) = 1.
In this case,
is a stationary measure since
µ(y) = 1 =
X
f (z) =
z
Of course
ν≡k
X
1 · f (y − x) =
X
x
µ(x)p(x, y).
x
is also a stationary measure, and in general, any positive multiple of a stationary measure
is stationary.
Note that
ν≡k
is not a stationary distribution since
Example 9.2 (Asymmetric Simple Random Walk).
The preceding example shows that
given by
ν(x) =
x
p
q
µ≡1
P
x
Here
ν(x) = ∞.
S = Z and p(x, x + 1) = p, p(x, x − 1) = q = 1 − p.
is a stationary measure. If
p 6= q ,
another stationary measure is
since
X
ν(x)p(x, y) = ν(y − 1)p(y − 1, y) + ν(y + 1)p(y + 1, y)
x
y−1
y+1
p
p
py q py+1
=
p+
q= y + y
q
q
q
q
py
py (q + p)
= y = ν(y).
=
qy
q
Example 9.3
chain on
G
.
(Random Walks on a Finite Group)
with transition probabilities
Suppose that
−1
p(x, y) = f (yx
)
G
is a nite group and
for some probability
f
on
G.
stationary distribution since
π(y) =
X
1
1 X
1 X
=
f (g) =
f (yx−1 ) =
π(x)p(x, y).
|G|
|G|
|G|
g∈G
x∈G
56
x∈G
Xn
Then
is a Markov
π≡
1
|G| is a
Example 9.4
with
q0 = 0
.
(Birth and Death Chains)
S = N0 , p(i, i + 1) = pi , p(i, i) = ri ,
and
p(i, i − 1) = qi ,
p(i, j) = 0 for |i − j| > 1.
Qk p
µ(k) = j=1 j−1
qj has
and
The measure
µ(i)p(i, i + 1) = pi
Since
Here
p(x, y) = 0
|x − y| > 1,
if
i
i+1
Y
Y pj−1
pj−1
= qi+1
= µ(i + 1)p(i + 1, 1).
qj
qj
j=1
j=1
this implies that
µ
satises the
µ(x)p(x, y) = µ(y)p(y, x)
Summing over
x,
detailed balance equations :
for all
x, y ∈ S.
we have
X
µ(x)p(x, y) = µ(y)
X
x
p(y, x) = µ(y),
x
so any such measure is stationary.
If
Xn has a stationary distribution which satises the detailed balance equations, we say that Xn is reversible.
To see the reason for the nomenclature, observe that if
Xn
is reversible with respect to
π,
then
Pπ (X0 = x0 , X1 = x1 , ..., Xn = xn ) = π(x0 )p(x0 , x1 )p(x1 , x2 ) · · · p(xn−1 , xn−1 )
= p(x1 , x0 )π(x1 )p(x1 , x2 ) · · · p(xn−1 , xn )
= p(x2 , x1 )p(x1 , x0 )π(x2 ) · · · p(xn−1 , xn )
= ... = π(xn )p(xn , xn−1 ) · · · p(x1 , x0 )
= Pπ (X0 = xn , X1 = xn−1 , ..., Xn = x0 ),
thus in stationarity
(X0 , X1 , ..., Xn ) =d (Xn , ..., X1 , X0 ).
Random walks on countable groups (like the rst three examples) are reversible with respect to counting
measure precisely when
Example 9.5
with
|V | < ∞
π(x) =
f (g) = f (g −1 )
g ∈ G.
for all
(Simple Random Walk on a Finite Graph)
and
Xn
is a Markov chain on
V
.
If
G = (V, E)
with transition probabilities
deg(x)
2|E| is a reversible stationary distribution for
Xn
is a simple, undirected graph
p(x, y) =
1
deg(x) 1 {x
∼ y},
then
since
1
1 {x ∼ y} = π(y)p(y, x).
2 |E|
P
π is a probability measure on V by the handshaking lemma,
x∈V deg(x) = 2 |E|, which follows by
P
P
+
−
+
observing that if we orient the edges in any way, then |E| =
and
x deg (x) =
x deg (x) (where deg
P
P
−
+
−
deg denote the outdegree and indegree, respectively), hence 2 |E| = x deg (x) + deg (x) = x deg(x).
π(x)p(x, y) =
Reversibility is a very convenient feature for Markov chains to have, but as the term detailed balance
suggests, it is much less generic than one might infer from the preceding examples.
For instance, the M/G/1 queue has no reversible measures since if
x > y+1, then p(x, y) = 0, but p(y, x) > 0.
We will see shortly that there is a more complicated potential obstruction to reversibility in the case of
irreducible Markov chains, but rst we present a lemma which is useful in its own right.
57
Lemma 9.1. If p is irreducible and µ is a stationary measure for p, then µ(x) > 0 for all states x.
Proof.
let
µ(x) 6≡ 0,
Since
x0 , x1 , ..., xk
x0
there must be some
with
µ(x0 ) > 0.
be a sequence of minimal length such that
Such a sequence exists by irreducibility. Since
0 = µ(xk ) =
X
xk ∈ N ,
Assume that
p(xi−1 , xi ) > 0
N = {y : µ(y) = 0} =
6 ∅,
for each
i = 1, ..., k
and
and
xk ∈ N .
we have that
µ(w)p(w, xk ) ≥ µ(xk−1 )p(xk−1 , xk ).
w
It follows that
µ(xk−1 ) = 0,
(Alternatively, for any
contradicting minimality.
y ∈ S , there is an n with pn (x0 , y) > 0, so µ(y) = (µpn )(y) ≥ µ(x0 )pn (x0 , y) > 0.) Theorem 9.1. Suppose that
reversible measure is
(i)
(ii)
p
is irreducible. A necessary and sucient condition for the existence of a
implies p(y, x) > 0,
Q
For any loop x0 , x1 , . . . , xn = x0 with ni=1 p(xi−1 , xi ) > 0,
p(x, y) > 0
n
Y
p(xi−1 , xi )
i=1
Proof.
= 1.
p(xi , xi−1 )
To prove necessity, we note that any stationary measure has
so the detailed balance equations imply that
p(xi−1 , xi ) =
µ(xi )
µ(xi−1 ) p(xi , xi−1 ), so, since
p(xi , xi−1 )
=
µ(x) > 0
for all
holds. To check the cycle condition,
x0 = xn ,
n
Y
p(xi−1 , xi )
i=1
(i)
x
(by Lemma 9.1),
(ii),
we observe that
we have
n
n−1
Y
µ(xn ) Y µ(xi )
µ(xi )
=
= 1.
µ(xi−1 )
µ(x0 ) i=1 µ(xi )
i=1
s ∈ S and set µ(s) = 1. By
Qn
s = x0 , x1 , . . . , xn = x with i=1 p(xi−1 , xi ) > 0, and
To show that these conditions are sucient as well, x
irreducibility, for any
x ∈ S \ {s},
we set
there is a sequence
µ(x) =
n
Y
p(xi−1 , xi )
i=1
If
s = y0 , y1 , . . . , ym = x
p(xi , xi−1 )
is another such sequence, then
s = x0 , x1 , . . . , xn = ym , ym−1 , . . . , y0 = s
is a loop, so
(i)
(ii)
.
implies that
Qm
j=1
p(yj , yj−1 ) > 0,
and
implies that
m
Y
n
Y
p(xi−1 , xi )
p(yj , yj−1 )
·
= 1.
p(x
,
x
)
p(yj−1 , yj )
i
i−1
i=1
j=1
It follows that
µ
does not depend on the particular path chosen.
Finally, detailed balance is satised since if
p(x, y) > 0,
shows that
µ(y) = µ(x)
then consideration of the path
p(x, y)
.
p(y, x)
s = x0 , x1 , ..., x, y
Though the existence of a reversible measure implies the existence of a stationary measure, it is not a
necessary condition. The following theorem shows that any chain having a recurrent state has a stationary
measure.
58
Theorem 9.2. If x is a recurrent state and Tx = inf {n ≥ 1 : Xn = x}, then
µx (y) = Ex
"T −1
x
X
#
1 {Xn = y} =
n=0
∞
X
Px (Xn = y, Tx > n)
n=0
denes a stationary measure.
The intuition is that
x
and
(µx p) (y)
Proof.
µx (y)
is the expected number of visits to
is the expected number of visits to
y
in
y
in
{1, ..., Tx },
{0, 1, ..., Tx − 1}
which is equal since
Tonelli's theorem shows that
X
µx (y)p(y, z) =
y
X
p(y, z)
y
=
∞
X
Px (Xn = y, Tx > n)
n=0
∞ X
X
Px (Xn = y, Tx > n)p(y, z)
n=0 y
=
∞ X
X
Px (Xn = y, Xn+1 = z, Tx > n)
n=0 y
for all
z ∈ S.
When
z 6= x,
we have
X
µx (y)p(y, z) =
y
∞ X
X
Px (Xn = y, Xn+1 = z, Tx > n)
n=0 y
=
=
=
∞
X
Px (Xn+1 = z, Tx > n + 1)
n=0
∞
X
Px (Xm = z, Tx > m)
m=1
∞
X
Px (Xn = z, Tx > n) = µx (z)
n=0
since
Px (X0 = z) = 0.
For the
z=x
case, we note that
µx (x) =
∞
X
Px (Xn = x, Tx > n) = Px (X0 = x, Tx > 0) = 1,
n=0
so
X
µx (y)p(y, x) =
y
∞ X
X
Px (Xn = y, Xn+1 = x, Tx > n)
n=0 y
=
∞ X
X
Px (Xn = y, Tx = n + 1)
n=0 y
=
=
∞
X
Px (Tx = n + 1)
n=0
∞
X
Px (Tx = m) = 1 = µx (x)
m=1
since
x
recurrent implies
Tx ∈ [1, ∞)
a.s.
59
for the chain started at
X0 = XTx = x.
Finally, we observe that
µx (y) < ∞
ρxy > 0,
from Theorem 8.3 that if
As
µx = µx p
for all
then
µx = µx pn ,
implies that
y.
To see that this is so, note that since
ρyx = 1,
n
p (y, x) > 0
hence
for some
x
is recurrent, it follows
n ∈ N.
we have
1 = µx (x) = (µx pn )(x) =
X
µx (w)pn (w, x) ≥ µx (y)pn (y, x),
w
thus
µx (y) ≤
1
pn (y,x)
< ∞.
On the other hand, if
ρxy = 0,
then the denition of
µx
implies that
µx (y) = 0 < ∞.
To complement the previous existence result, we have
Theorem 9.3. If p is irreducible and recurrent, then the stationary measure is unique up to multiplication
by a positive constant.
Proof.
Fix
s∈S
and dene
Ts , µs
as in Theorem 9.2. For any stationary measure
ν(z) =
X
ν(y)p(y, z) = ν(s)p(s, z) +
y
X
ν
and any
z 6= s,
we have
ν(y)p(y, z).
y6=s
Repeating this decomposition gives
ν(z) = ν(s)p(s, z) +
X
ν(s)p(s, y) +
y6=s
= ν(s)p(s, z) +
X
X
ν(x)p(x, y) p(y, z)
x6=s
ν(s)p(s, y)p(y, z) +
y6=s
XX
ν(x)p(x, y)p(y, z)
y6=s x6=s
= ν(s)Ps (X1 = z) + ν(s)Ps (X1 6= s, X2 = z) + Pν (X0 6= s, X1 6= s, X2 = z).
Continuing in this fashion yields
ν(z) = ν(s)
≥ ν(s)
n
X
m=1
n
X
Ps (X1 , ..., Xm−1 6= s, Xm = z) + Pν (X0 , ..., Xn−1 6= s, Xn = z)
Ps (X1 , ..., Xm−1 6= s, Xm = z) = ν(s)
Since
Now
n→∞
µs (s) = 1,
ν
and
µs
shows that
we have
ν(z) ≥ ν(s)µs (z)
ν(s) ≥ ν(s)µs (s)
are stationary and
X
Ps (Xm = z, Ts > m).
m=0
m=1
Letting
n
X
for all
z 6= s.
as well, hence
µs (s) = 1,
ν(x) ≥ ν(s)µs (x)
for all
x ∈ S.
so
ν(x)pn (x, s) = ν(s) = ν(s)µs (s) = ν(s)
X
x
µs (x)pn (x, s),
x
and thus
X
(ν(x) − ν(s)µs (x)) pn (x, s) = 0
x
for all
As
n ∈ N.
ν(x) ≥ ν(s)µs (x),
and
n
this implies that
is arbitrary, we conclude that
ν(x) = ν(s)µs (x)
ν(x) = ν(s)µs (x)
for all
for all
60
x
with
x ∈ S.
pn (x, s) > 0.
Because
p
is irreducible
The foregoing guarantees existence and uniqueness (up to scaling) of stationary measures under certain
relatively mild assumptions.
However, it may be the case that some (and thus every) stationary measure is innite, so that no stationary
distribution exists.
Theorem 9.4. If there is a stationary distribution π, then all states y ∈ S with π(y) > 0 are recurrent.
Proof.
Suppose that
π(y) > 0
and recall that
N (y) =
P∞
k=1
1 {Xk = y}
satises
Ex [N (y)] = ρxy
P∞
k=0
ρkyy .
Thus if we start the chain in stationarity, we have
Eπ [N (y)] =
X
π(x)Ex [N (y)] =
X
x
On the other hand, since
π(x)ρxy
x
πpn = π ,
Eπ [N (y)] =
∞
X
ρkyy ≤
X
π(x)
x
k=0
∞
X
ρkyy =
k=0
∞
X
ρkyy .
k=0
we see that
∞
X
Pπ (Xn = y) =
∞ X
X
π(x)pn (x, y) =
n=1 x
n=1
∞
X
π(y) = ∞.
n=1
Combining these equations shows that
∞
X
ρkyy ≥ Ex [N (y)] = ∞,
k=0
hence
ρyy = 1.
Now Lemma 9.1 says that if
p
is irreducible, then any stationary measure
µ
has
µ(x) > 0
for all
x,
thus
Theorem 9.4 shows that if an irreducible Markov chain has a transient state, then it cannot have a stationary
distribution.
We will see shortly that recurrence is not quite sucient for an irreducible Markov chain to have a stationary
distribution, but rst we show that if one exists (which is necessarily unique by Theorem 9.3), then it must
be given by the following theorem.
Theorem 9.5. If p is irreducible and has a stationary distribution π, then π(x) =
Proof.
Irreducibility implies that
π(x) > 0
for all
x∈S
1
.
Ex [Tx ]
by Lemma 9.1, so Theorem 9.4 shows that all states
are recurrent.
It follows from Theorem 9.2 that
µx (y) =
∞
X
Px (Xn = y, Tx > n)
n=0
denes a stationary measure with
µx (x) = 1.
By Tonelli's theorem and the layer cake representation, we have
X
µx (y) =
y
∞
XX
Px (Xn = y, Tx > n)
y n=0
=
∞ X
X
Px (Xn = y, Tx > n) =
n=0 y
∞
X
n=0
61
Px (Tx > n) = Ex [Tx ].
By Theorem 9.3, the stationary measure is unique up to positive scaling, so the unique stationary distribution
is given by
π(x) =
µx (x)
1
=
.
Ex [Tx ]
Ex [Tx ]
Observe that Theorem 9.5 gives the interesting identity
X p(x, y) X
1
=
π(x)p(x, y) = π(y) =
.
E
[T
]
E
x x
y [Ty ]
x
x
A notable example of an irreducible Markov chain which does not have a stationary distribution is simple
random walk on
Z:
We have seen that the expected rst return time is innite, so Theorem 9.5 shows that
there cannot be a stationary distribution.
Denition.
If a state
Ex [Tx ] = ∞,
then
x
x
has
is called
Ex [Tx ] < ∞,
we say that
null recurrent.
x
is
positive recurrent.
If
x
is a recurrent state with
Theorem 9.6. If p is irreducible, then the following are equivalent.
(i): Some x is positive recurrent
(ii): There is a (unique) stationary distribution
(iii): All states are positive recurrent.
Proof.
If
x
is positive recurrent, then the preceding proof shows that
∞
X
µx (y)
1
π(y) =
=
Px (Xn = y, Tx > n)
Ex [Tx ]
Ex [Tx ] n=0
(i) implies (ii).
1
If π is a stationary distribution, then Theorem 9.5 shows that π(y) =
for all y .
Ey [Ty ]
implies that π(y) > 0 for all y , we must have Ey [Ty ] < ∞, hence (ii) implies (iii).
denes a stationary distribution (which is unique by Theorem 9.3), thus
As
(iii)
trivially implies
(ii),
the proof is complete.
We observe that Theorem 9.2 and the proof of Theorem 9.5 show that if any state
then
µx (y)
πx (y) =
Ex [Tx ]
Since Lemma 9.1
x
is positive recurrent,
denes a stationary distribution, regardless of irreducibility. Also, since
the communicating class
[x]
x
is recurrent,
is closed and irreducible. Applying Theorem 9.6 to the chain restricted to
[x]
shows that positive recurrence is a class property.
Thus we have an extra layer in our classication of states: Each communicating class is either recurrent or
transient, and each recurrent class is either positive recurrent or null recurrent. Moreover, for each positive
recurrent class, there is a unique stationary distribution which is positive on all states in the class and
all other states.
62
0
for
10. Convergence Theorems
If a state
y
is transient, then for all states
∞
X
pn (x, y) =
n=1
hence
n
p (x, y) → 0
as
Nn (y) =
Pn
m=1
we have
ρxy
< ∞,
1 − ρyy
Px (Xn = y) = Ex [N (y)] =
n=1
n → ∞.
As termwise convergence to
Let
∞
X
x,
0
is not sucient for summability, the converse is not necessarily true.
1{Xm = y}
y
be the number of visits to
by time
n.
Theorem 10.1. Suppose that y is recurrent. Then for any x ∈ S ,
lim
n→∞
Proof.
1
1
Nn (y) =
1{Ty < ∞} Px -a.s.
n
Ey [Ty ]
We begin by considering the chain started at
the
kth
Set
t1 = R(1) = Ty
return to
y.
Let
R(k) = inf{n ≥ 1 : Nn (y) = k}
be the time of
y.
and
tk = R(k) − R(k − 1)
Markov property implies that
t1 , t2 , ...
for
k ≥ 2.
Since we have assumed that
X0 = y ,
the strong
are i.i.d. Thus it follows from the strong law of large numbers that
n
1X
R(n)
tk → Ey [Ty ] Py -a.s.
=
n
n
k=1
Observing that
rst visit to
y
R (Nn (y))
after time
is the time of the last visit to
n,
we see that
y
by time
n
and
R (Nn (y)) ≤ n < R (Nn (y) + 1),
R (Nn (y) + 1)
is the time of the
thus
R (Nn (y))
n
R (Nn (y) + 1) Nn (y) + 1
≤
<
·
.
Nn (y)
Nn (y)
Nn (y) + 1
Nn (y)
Letting
n→∞
and noting that
When the initial state is
x 6= y ,
Nn (y) → ∞ a.s. (because y is recurrent)
n
→ Ey [Ty ] Py -a.s.
Nn (y)
we rst note that if
Ty = ∞,
then
yields
Nn (y) = 0
for all
n,
hence
Nn (y)
→ 0 on {Ty = ∞}.
n
The strong Markov property shows that, conditional on
Py (Ty = n).
{Ty < ∞}, t2 , t3 , ...
are i.i.d. with
Px (tk = n) =
It follows that
R(k)
t1
t2 + ... + tk k − 1
=
+
·
→ 0 + Ey [Ty ] Px -a.s.
k
k
k−1
k
Thus, arguing as before, we see that for all
x ∈ S,
1
Nn (y)
→
Px -a.s.
n
Ey [Ty ]
on
{Ty < ∞}.
Adding our observation about the
{Ty = ∞}
case completes the proof
Theorem 10.1 helps to explain the terminology for recurrent states:
the chain at
y,
the asymptotic fraction of time spent at
63
y
y
is positive.
is positive recurrent if, when we start
y
is null-recurrent if this fraction is
0.
To connect this result with our opening remarks about the
Nn (y)
n
≤ 1,
n → ∞
behavior of
pn (x, y),
we note that
so the bounded convergence theorem gives
n
1 X m
Ex [Nn (y)]
1
Px (Ty < ∞)
ρxy
p (x, y) =
→ Ex
1{Ty < ∞} =
=
.
n m=1
n
Ey [Ty ]
Ey [Ty ]
Ey [Ty ]
(This holds for
y
In particular, if
n
p (x, y) → 0.
transient as well since
y
Ey [Ty ] = ∞
in that case.)
is positive recurrent and accessible from
In other words, if
ρxy > 0
and
n
p (x, y) → 0,
x,
then
then
y
ρxy
Ey [Ty ]
> 0,
so it cannot be the case that
is transient or null recurrent.
More precisely, we have
Corollary 10.1. A recurrent class [w] is null recurrent if and only if
x, y ∈ [w].
n
1 X m
p (x, y) → 0
n m=1
for any/all
We are now in a position to upgrade Corollary 8.1 to
Theorem 10.2. Every irreducible Markov chain on a nite state space is positive recurrent and thus has a
unique stationary distribution.
Proof.
We know that irreducible nite state space chains are recurrent. Suppose that such a chain was null
recurrent, then for any
x, y ∈ S
1
n
1=
Pn
m=1
pm (x, y) → 0.
But since
S
is nite, this implies that
n
n
X1 X
1 XX n
pn (x, y) → 0,
p (x, y) =
n m=1 y
n
y
m=1
a contradiction.
1
n
Pn
pm (x, y) always converges.
ρxy
1
If p is irreducible and positive recurrent, the limit is
=
= π(y).
Ey [Ty ]
Ey [Ty ]
n
However, the following simple examples show that the sequence p (x, y) may not converge
The preceding analysis shows that the Cesàro mean
m=1
in the ordinary
sense:
Example 10.1.
is taken modulo
but
k
p (x, y) = 1
Example 10.2.
Consider the chain on
m.
if
Z/mZ
with transition probabilities
p(x, x + 1) = 1,
where addition
This chain is irreducible and positive recurrent (with uniform stationary distribution),
k ≡ y − x (mod m)
and
pk (x, y) = 0
otherwise, hence
pk (x, y)
is divergent for all
x, y .
Consider the random transposition shue which proceeds by choosing two distinct cards at
random and interchanging them. This is the random walk on
sitions. By construction,
k
p (σ, η) = 0
However, one can show that for any
if
k
Sn
driven by the uniform measure on transpo-
is even and sgn(σ)sgn(η)
k ≥ n, pk (x, y) ≥
1
(n2 )
n
= −1
k
ση
or
if the parity of
is odd and sgn(σ)sgn(η)
and
k
= 1.
agree.
In both cases, the periodicity problem can be sidestepped by adding laziness - that is, expanding the support
of the measure driving the walk to include the identity - so that the chain has some positive probability of
staying put at each step.
64
Inspired by the obstructions to convergence in the preceding examples, we are led to make the following
denition,
Denition.
For any
x ∈ S,
set
Ix = {n ≥ 1 : pn (x, x) > 0}. dx = gcd(Ix )
period
is called the
x
of
where
gcd(∅) = ∞.
Note that
Ix ⊆ N
is nonempty precisely when
ρxx > 0.
In this case,
dx ≤ min(Ix ).
In particular,
dx = 1
if
p(x, x) > 0.
Lemma 10.1. If x ∼ y, then dy = dx .
Proof.
We may assume that
x 6= y
so that
ρxy , ρyx > 0.
Let
K
and
L
be such that
pK (x, y), pL (y, x) > 0.
Then
pK+L (y, y) ≥ pL (y, x)pK (x, y) > 0,
so
If
dy |(K + L) .
n
is such that
pn (x, x) > 0,
then
pK+n+L (y, y) ≥ pL (y, x)pn (x, x)pK (x, y) > 0,
so we also have that
dy |(K + L + n) ,
Interchanging the roles of
Denition.
x
and
y
We say that a state
and thus
shows that
x
is
dy |n .
dx |dy
aperiodic
if
As
n ∈ Ix
is arbitrary, it follows that
as well, and we conclude that
dx = 1 .
dy |dx .
dy = dx .
If all states in a recurrent Markov chain are
aperiodic, we say that the chain is aperiodic.
Lemma 10.2. If dx = 1, then there is an mx ∈ N such that pm (x, x) > 0 for all m > mx .
Proof.
We rst observe that there is a nite
so, note that
d(n) = gcd (Ix ∩ [1, n])
number of times. Let
Fx ⊂ Ix
such that gcd(Fx )
is a nonincreasing
N-valued
N = max {n ∈ N : d(n) < d(n − 1)}
Set
a = maxi |ai |, b =
P
q ≥ ab,
then
of elements in
than
q + rai ≥ 0
Fx ⊂ I x .
a1 , ..., an
with
Pn
i=1
for all
Since
Ix
i.
that
i
In other words, every integer greater than
is closed under addition, this means that
mx = ab2 .
To see that this is
ai bi = 1.
m ∈ N, there exist q, r ∈ N0 with r < b such
X
X
X
m = qb + r = q
bi + r
ai bi =
(q + rai )bi .
i
= 1.
Fx = Ix ∩ [1, N ] = {b1 , ..., bn }.
i bi . For any
i
If
gcd(Ix )
function and thus can only decrease a nite
and set
Now the Euclidean Algorithm shows that there are integers
=
Ix
ab2
can be written as a sum
contains every integer greater
65
If
K
N ∈N
follows from Lemma 10.2 that there is an
For each
x, y ,
there is an
M = max(x,y) n(x, y)
Letting
n(x, y) ∈ N
exists in
L = maxx mx
with
such that
Now since
K
1
mx
π(x) = 1.
K
Moreover, letting
w ≡1
Frobenius implies
limm→∞ K N
we see that
K
n
K
x, y ∈ S
by irreducibility. Since
whenever
S
then it
n ≥ N:
is nite,
since
n(x, y) ≤ M ,
1,
and since
hence
KN
n ≥ L+M
and any
x, y ∈ S ,
n − n(x, y) ≥ L ≥ my .
is a positive matrix, the Perron-Frobenius
is a simple eigenvalue.
It follows that
Finally, writing
for all
as in Lemma 10.2, we see that for any
Perron-Frobenius also shows that
x
K (x, y) > 0
pn(x,y) (x, y) > 0
is stochastic, it has spectral radius
theorem ensures that
such that
n
N.
pn (x, y) ≥ pn(x,y) (x, y)pn−n(x,y) (y, y) > 0
P
S,
is the transition matrix for an irreducible and aperiodic Markov chain with nite state space
KN
has a left eigenvector
π
with
πK N = π , π(x) > 0
has unique and strictly positive stationary distribution
for all
= wπ ,
the matrix with all rows equal to
1,
Perron-
π.
in Jordan normal form and recalling that the eigenvalues satisfy
converges as well (with exponential rate given by
and
π.
denote the appropriately normalized right eigenvector with eigenvalue
m
x,
λ0 = 1 > |λ1 | ≥ |λ2 | ≥ ...,
n
|λ1 |), hence K (x, y) → π(y) for all x, y ∈ S .
When the state space is countably innite, we no longer have these linear algebra tools and the above
argument does not apply. However, if we tack on the assumption of positive recurrence (so that a stationary
distribution exists), we can still arrive at the same conclusion.
Theorem 10.3. Suppose that p is irreducible, aperiodic, and positive recurrent with countable state space
S . Then there is a probability measure π on S which satises
lim pn (x, y) = π(y)
n→∞
for all states x, y.
Proof.
Dene a transition probability
pe on S × S
by
pe ((x1 , y1 ), (x2 , y2 )) = p(x1 , x2 )p(y1 , y2 ).
For any
exists
x, y ∈ S ,
n(x, y) ∈ N
aperiodicity gives
with
It follows that for any
n(x,y)
p
pm (x, x) > 0
whenever
m > mx ,
and irreducibility shows that there
(x, y) > 0.
(x1 , y1 ), (x2 , y2 ) ∈ S × S ,
pen(x1 ,x2 )+n(y1 ,y2 )+m ((x1 , y1 ), (x2 , y2 )) = pn(x1 ,x2 )+n(y1 ,y2 )+m (x1 , x2 )pn(x1 ,x2 )+n(y1 ,y2 )+m (y1 , y2 )
≥ pn(x1 ,x2 ) (x1 , x2 )pn(y1 ,y2 )+m (x2 , x2 )pn(y1 ,y2 ) (y1 , y2 )pn(x1 ,x2 )+m (y2 , y2 ) > 0
whenever
m ≥ mx2 ∨ my2 ,
hence
pe is
irreducible.
66
Also, since
Dene
p
is irreducible and positive recurrent, it has a unique stationary distribution,
π
e ((x, y)) = π(x)π(y). Then π
e is a stationary distribution for pe since
X
X
X
π
e ((w, y)) pe ((w, y), (x, z)) =
π(w)p(w, x)
π(y)p(y, z) = π(x)π(z) = π
e ((x, z)) .
pe is
y∈s
w∈S
(w,y)∈S×S
Since
π.
irreducible and has a stationary distribution, it follows from Theorem 9.6 that all states in
S×S
are positive recurrent.
Now let
(Xn , Yn )
be a Markov chain on
the hitting time of the diagonal
of any point in
D
S ×S
with transition probability
D = {(y, y) : y ∈ S}.
is a.s. nite (by Theorem 8.3), thus
Since
pe is
T <∞
pe and
let
T = inf{n : Xn = Yn }
be
irreducible and recurrent, the hitting time
a.s.
By considering the time and location of the rst intersection and appealing to the Markov property, we see
that
P (Xn = y, T ≤ n) =
n X
X
P (T = m, Xm = x, Xn = y)
m=1 x∈S
=
n X
X
P (T = m, Xm = x)P (Xn = y |Xm = x )
m=1 x∈S
=
n X
X
P (T = m, Ym = x)P (Yn = y |Ym = x )
m=1 x∈S
= P (Yn = y, T ≤ n).
It follows that
P (Xn = y) = P (Xn = y, T ≤ n) + P (Xn = y, T > n)
= P (Yn = y, T ≤ n) + P (Xn = y, T > n)
Since
P (Yn = y) = P (Yn = y, T ≤ n) + P (Yn = y, T > n),
we have
|P (Xn = y) − P (Yn = y)| = |P (Xn = y, T > n) − P (Yn = y, T > n)|
≤ P (Xn = y, T > n) + P (Yn = y, T > n),
and thus
X
|P (Xn = y) − P (Yn = y)| ≤ 2P (T > n).
y
If we take
X0 = x, Y0 ∼ π ,
this says that
X
|pn (x, y) − π(y)| ≤ 2P (T > n) → 0
y
and the proof is complete.
67
11. Coupling
Let
p
denote the transition function for a time homogeneous Markov chain
S,
space
The
so that
k -step
p(x, y) = P (Xt+1 = y |Xt = x )
for all
X0 , X1 , ...
on a countable state
t ∈ N0 , x, y ∈ S .
transitions are given recursively by
pk (x, z) = P (Xt+k = z |Xt = x ) =
X
pk−1 (x, y)p(y, z).
y∈S
In the
|S| = N < ∞
case, we think of
p
as an
N ×N
matrix and the above is just the formula for matrix
exponentiation.
In general,
p
is an operator which acts on functions (column vectors) by
(pf )(x) =
X
p(x, y)f (y) = E [f (Xt+1 ) |Xt = x ]
y
and acts on probabilities (row vectors) by
(µp)(y) =
X
µ(x)p(x, y) = P (Xt+1 = y |Xt ∼ µ ) .
x
Thus if the chain has initial distribution
We know that if
π
and
(If
S
µ0 = L (X0 ),
then the distribution of
Xt
is
µ0 pt .
p is irreducible, aperiodic, and positive recurrent, then it has a unique stationary distribution
limn→∞ pn (x, y) = π(y)
for all
x, y ∈ S .
is nite, the assumption of positive recurrence is redundant.)
In applications such as MCMC, we want to know how fast the chain converges - that is, how big does
to be for
t
µ0 p
to be close to
t
need
π.
To make this question more precise, we need a metric on probabilities.
Denition.
The
total variation distance
between probabilities
kµ − νkT V =
1X
2
µ
and
ν
on a countable set
S
is dened as
|µ(s) − ν(s)| .
s∈S
It is routine to verify that total variation denes a metric, and that we have the equivalent characterizations
kµ − νkT V = max (µ(A) − ν(A))
A⊆S
=
(The maxima are attained by
1
max (Eµ [f ] − Eν [f ]) .
2 kf k∞ ≤1
A = {s : µ(s) > ν(s)}
Also, it is clear from the original denition that
and
f = 1A − 1AC .)
kµn − νkT V → 0
implies
µn → ν
pointwise. The converse
implication does not necessarily hold when the state space is innite.
Note that the proof of Theorem 10.3 actually established that
68
pnx → π
in total variation.
When speaking of Markov chain mixing, we often want a measure of distance which is independent of the
initial distribution, so we dene
d(t) = sup µ0 pt − π T V .
µ0
In the case where the initial distribution is a point mass, we write
ptx = δx pt .
It is left as an exercise to show that an equivalent denition is
d(t) = sup ptx − π T V .
x∈S
Also, it is often useful to consider the distance between two copies of the chain started at dierent states,
and we dene
d(t) = sup ptx − pty T V .
x,y∈S
Another good exercise is to establish the inequality
d(t) ≤ d(t) ≤ 2d(t).
One of the main goals in the modern theory of Markov chains is to estimate the
ε-mixing
time
tmix (ε) = min {t ∈ N0 : d(t) ≤ ε} .
Remarkably, in many cases of interest the mixing time is asymptotically independent of
Here we are thinking of a sequence of chains
p(1) , p(2) , ...
p(n)
where
has state space
ε.
S(n) ,
stationary distri-
(n)
bution π(n) , and ε-mixing time tmix (ε).
For example,
p(n)
n
may represent a particular method of shuing
cards, simple random walk on the
dimensional hypercube, Glauber dynamics for the Ising model on the torus
We say that the sequence
p(n)
exhibits the
cuto phenomenon
d
(Z/nZ)
n-
, and so forth.
if
(n)
lim
tmix (ε)
n→∞ t(n) (1
mix
for all
− ε)
=1
ε ∈ (0, 1).
This means that
tmix (ε1 )
and
tmix (ε2 )
dier only in lower order terms.
Writing
d(n) (t) = sup δx pt(n) − π(n) ,
TV
x∈S(n)
this is equivalent to the requirement that
lim d(n)
n→∞
where
(n)
(n)
tmix = tmix
1
(n)
ctmix
(
=
1,
c<1
0,
c>1
1
1
4 . (The choice of 4 is arbitrary but standard.)
That is, when time is scaled by
rium stays near
(n)
tmix , d(n) (t)
approaches a step function. Essentially, the distance to equilib-
for a while and then abruptly drops and tends rapidly to
69
0.
A classical method of bounding mixing times is based on the notion of coupling.
(In fact, this was the
technique used to prove Theorem 10.3.)
Denition.
are
S -valued
µ
If
and
ν
(S, S),
are probabilities on
we say that
random variables on a common probability space
P (Y ∈ A) = ν(A)
for all
(X, Y )
(Ω, F, P )
such that
of
µ
ν
and
if
X
and
P (X ∈ A) = µ(A)
Y
and
A ∈ S.
Lemma 11.1. If (X, Y ) is a coupling of µ and ν , then kµ − νkT V
Proof.
coupling
is a
≤ P (X 6= Y ).
A ∈ S,
For any
µ(A) − ν(A) = P (X ∈ A) − P (Y ∈ A)
= P (X ∈ A, X 6= Y ) + P (X ∈ A, X = Y )
− P (Y ∈ A, X 6= Y ) − P (Y ∈ A, X = Y )
= P (X ∈ A, X 6= Y ) − P (Y ∈ A, X 6= Y )
≤ P (X ∈ A, X 6= Y ) ≤ P (X 6= Y ).
When
S
is countable, one can show that there always exist couplings for which the inequality is an equality,
hence we have the additional denition of total variation:
kµ − νkT V = min {P (X 6= Y ) : (X, Y )
* Briey, one denes
w
on
S×S
is a coupling of
µ
and
ν} .
by
w(z, z) = min{µ(z), ν(z)},
w(x, y) =
and checks that
(X, Y ) ∼ w
(µ(x) − w(x, x)) (ν(y) − w(y, y))
P
1 − z w(z, z)
is a coupling with
P (X 6= Y ) = kµ − νkT V .
We leave the verication of this
claim as an exercise.
Denition.
A
{(Xt , Yt )}t∈N0
coupling of a transition probability p on a countable state space S is an S × S -valued process
dened on some probability space
Markov chains with transition probability
Note that we do not require
{Xt }
and
(Ω, F, P )
such that, marginally,
{Xt }
and
{Yt }
are each
p.
{Yt }
to proceed independently or that
{(Xt , Yt )}
is a Markov chain.
Theorem 11.1. Suppose that {(Xt , Yt )}t∈N is a coupling of p with X0 ∼ µ, Y0 ∼ ν . If T is a random time
such that Xt = Yt on {T ≤ t}, then
0
t
µp − νpt ≤ P (T > t) .
TV
Proof.
By construction,
(Xt , Yt )
is a coupling of
µpt
and
νpt .
Since
{Xt 6= Yt } ⊆ {T > t},
the coupling
lemma implies
t
µp − νpt ≤ P (Xt 6= Yt ) ≤ P (T > t) .
TV
70
ν = π,
When
kµpt − πkT V ≤ P (T > t).
µ = δx and ν = δy to get ptx − pty T V ≤ P (Tx,y > t).
Theorem 11.1 gives the bound
It can also be convenient to take
One can then bound the distance to stationarity using
d(t) ≤ d(t) ≤ sup P (Tx,y > t).
x,y
Typically, one considers couplings
In this case, we can take
Denition.
S×S
{(Xt , Yt )}t∈N0
(1)
(2)
0
Py ∈S
x0 ∈S
of a Markov chain
q,
whose transition probability,
P
Xt = Yt
Xt+1 = Yt+1 .
implies
T = inf{t ≥ 0 : Xt = Yt }.
Markovian coupling
A
with the property that
p
is a Markov chain
∞
{(Xt , Yt )}t=0
with state space
satises
q ((x, y), (x0 , y 0 )) = p(x, x0 )
for all
x, y, x0 ∈ S .
q ((x, y), (x0 , y 0 )) = p(y, y 0 )
for all
x, y, y 0 ∈ S .
Markovian couplings can be modied so that the two chains stay together after colliding.
To wit, suppose that
(
Yt ,
Zt =
Dene
Xt ,
property implies
∞
{(Xt , Yt )}t=0
t≤T
is a Markovian coupling of
. Then
t>T
that (Xt , Zt )
p
Z0 ∼ L (Y0 ), Xt+1 = Zt+1
is a coupling of
p.
and let
T = inf{t ≥ 0 : Xt = Yt }.
whenever
Xt = Zt ,
and the strong Markov
(This trick may not work for non-Markovian couplings.)
The general idea is that we start one copy of the chain in a specied distribution, let another copy begin
in stationarity, and then let them evolve according to the same transition mechanism until they meet and
proceed simultaneously forever after. As the second chain was stationary to begin with, it remains so for all
time, thus the rst chain must have equilibrated by the time they couple.
Though the preceding argument captures the intuition, it is not strictly correct as it overlooks a subtle point:
Yt ∼ π
for all
t
does not guarantee that
YT ∼ π
for a stopping time
For example, consider the chain with state space
p(y, y) =
1
2 . It is easy to see that
chain started at
time of
chain
{Xt }
{Wt }
π(x) =
1
3,
{x, y}
π(y) =
T.
and transition probabilities
2
3 is stationary for
p.
If we let
p(x, y) = 1, p(y, x) =
{Xt }
be a copy of the
y , let {Yt } be another copy of the chain with initial distribution π , and let T
and
{Yt },
then we necessarily have that
having transition probability
YT = y
since
Wt = x
implies that
be the coupling
Wt−1 = y
for any
p.
Nonetheless, the coupling bound on variation distance still holds and can be quite useful.
Example 11.1
and
y
(Lazy Random Walk on the Hypercube)
dier in exactly one coordinate, and
p(x, z) = 0
.
Here
S = (Z/2Z)d , p(x, x) =
1
2,
p(x, y) =
1
2d if
x
otherwise.
That is, at each step we ip a fair coin. If it comes up heads, we stay put. If it comes up tails, we choose
one of our
d
neighbors uniformly at random and move there.
As an irreducible random walk on a nite group (or a simple random walk on a nite regular graph),
the stationary distribution is uniform. The
1
2 holding probabilities ensure aperiodicity, so the convergence
theorem implies that distribution of the position at time
We want to know how fast it converges.
71
t
will approach the uniform distribution as
t → ∞.
To this end, we let
{1, ..., d}
At time
and let
t,
the
X0 = x
V1 , V2 , ...
Ut th
and let
Y0
have the uniform distribution on
{0, 1}.
be i.i.d. uniform on
coordinate of each chain is set to
S.
Let
U1 , U2 , ...
be i.i.d. uniform on
All random variables are taken to be independent.
Vt
and the others remain as they were.
In other words, at every time step we pick a coordinate at random and set its value (in both chains) to
1
0
or
according to a toss of a fair coin.
It is easy to see that
coordinate
Since
T
Ut
{Xt }
from time
t
{Yt }
and
are each evolving according to
onward, so they have coupled by time
p.
Moreover, the two chains agree in
T = inf {t : {U1 , ..., Ut } = {1, ..., d}}.
does not depend on the initial state, Theorem 11.1 shows that
d(t) = sup ptx − π T V ≤ P (T > t).
x
Thus the variation bound reduces to a coupon collector problem:
If we let
Atk = {k ∈
/ {U1 , ..., Ut }}
k = 1, ..., d, then
!
t
d
[
t
1
t
t
P (T > t) = P
Ak ≤ dP A1 = d 1 −
≤ de− d .
d
for
k=1
Therefore, if
t = d log(d) + cd
where
c>0
is chosen so that
t ∈ N,
then
d(t) ≤ e−c ,
hence the mixing time is
O (d log(d)).
* Using a more sophisticated coupling or other techniques such as Fourier analysis, along with lower bound
arguments like Wilson's method, one can show that the correct rate is
We conclude our discussion with a look at
1
2 d log(d).
grand couplings.
The idea here is to build copies of the chain started at each possible initial state using a common source of
randomness.
In other words, we wish to construct a collection of random variables
probability space
p
(Ω, F, P )
x
and initial state X0
such that for each
= x.
x∈
{Xtx : x ∈ S, t ∈ N0 }
on a common
S , {Xtx }∞
t=0 is a Markov chain with transition probability
To apply the coupling bound, we also need the various trajectories to stay
together after their rst meeting.
One way to achieve such a setup is through a
such that
Z
is a random variable on
P (f (x, Z) = y) = p(x, y)
for all
(Ω, F, P )
random mapping representation
and
f : S×Ω → S
of
p,
which is a pair
(f, Z)
is a (deterministic) function satisfying
x, y ∈ S .
It is left as an exercise to show that every transition probability on a countable state space
S
admits a
random mapping representation.
(Hint: Let
Z ∼ U (0, 1)
{s1 , s2 , ...}
is an enumeration of the state space.)
If
(f, Z)
and consider the array
Fi,j =
is a random mapping representation for
x
a grand coupling is given by taking X0
=x
p
Pj
k=1
and
x
and Xt+1
Z, Z0 , Z1 , ...
=
72
p(si , sk )
where
p
is the transition function and
is an i.i.d. sequence on
f (Xtx , Zt ) for each
x ∈ S , t ∈ N0 .
(Ω, F, P ),
then
Random mapping representations can be represented more compactly as iterates of random functions by
writing
ft (·) = f (·, Zt ).
If we dene
Fij = fj−1 ◦ fj−2 ◦ · · · ◦ fi+1 ◦ fi
then
F0t (X0 ) is a
Xt =
distribution
i < j,
of the Markov chain with transition probability
p
and initial
L (X0 ).
coalescence time
The
forward simulation
for
is dened by
Tc = inf{t : F0t
is a constant function}.
It follows from the coupling lemma and the fact that
d(t) ≤ d(t)
that we have the variation bound
d(t) ≤ P (Tc > t).
* Note that the unique element in the range of
F0Tc
is not necessarily distributed according to
π
as evidenced
by the two-state example given previously. However, if
0
Rc = inf{t : F−t
then
is a constant function},
0
(x) ∼ π .
F−R
c
This observation lies at the heart of the perfect sampling scheme known as
coupling from the past.
(Roughly, by time homogeneity and the convergence theorem,
0
lim P F−t
(x) = y = lim P F0t (x) = y = π(y),
t→∞
0
∞
so F−∞ , F0
If
r > Rc ,
t→∞
∼ π.
then
0
0
0
F−r
= f−1 ◦ · · · ◦ f−Rc ◦ f−Rc −1 ◦ · · · ◦ f−r = F−R
◦ f−Rc −1 ◦ · · · ◦ f−r = F−R
,
c
c
hence
0
0
∼ π.
= F−∞
F−R
c
This trick does not work in the forward direction, because if
t > Tc ,
then
F0t = ft−1 ◦ · · · ◦ fTc ◦ fTc −1 ◦ · · · ◦ f0 = ft−1 ◦ · · · ◦ fTc ◦ F0Tc
is not equal to
F0Tc
Example 11.2
of
G
in general.)
(Metropolis chain for proper
is an element
x ∈ {1, 2, ..., q}V
such that
q -colorings).
x(u) 6= x(v)
Let
G = (V, E)
whenever
be a graph. A proper
{u, v} ∈ E
We can generate an approximation to the uniform distribution on the set
Λ
(henceforth
of proper
q -coloring
u ∼ v ).
q -colorings
of
G
using
the Metropolis algorithm:
From any coloring
any
u ∼ v,
assign
v
x,
select a vertex
the color
j.
v
uniformly from
Otherwise do nothing.
73
V
and a color
j
uniformly from
[q].
If
j 6= x(u)
for
By applying these dynamics on the larger space
e = [q]V ,
Λ
we can use a grand coupling to prove
Theorem 11.2. Let G be a graph with n vertices and maximal degree ∆. For the Metropolis chain on proper
q -colorings of G with q > 3∆, we have
tmix (ε) ≤
Proof.
Let
(v0 , k0 ), (v1 , k1 ), ...
be i.i.d. uniform on
and
(
x
Xt+1
(v)
Dene the metric
ρ
on
e
Λ
1−
=
3∆
q
−1
V × [q],
n log
n
.
ε
and for each
kt ,
Xtx (v),
v = vt , Xtx (u) 6= kt
ρ(x, y) =
X
e,
x∈Λ
for all
u∼v
else
dene
{Xtx }t≥0
by
X0x = x
.
by
1 {x(v) 6= y(v)} .
v∈V
Also, for each
e, v ∈ V ,
x∈Λ
denote the set of colors of neighbors of
N (x, v) = {j ∈ [q] : x(u) = j
Now suppose that
e
x, y ∈ Λ
are such that
ρ(x, y) = 1.
Then
v
for some
x
and
y
in
x
by
u ∼ v} .
dier at only one vertex, say
w.
Let's see what happens to the distance after an update.
Since
N (x, w) = N (y, w)
by assumption, we have
P (ρ (X1x , X1y ) = 0) = P (v0 = w, k0 ∈
/ N (x, w)) =
1 q − |N (x, w)|
q−∆
·
≥
.
n
q
nq
Similarly,
P (ρ (X1x , X1y ) = 2) ≤ P (v0 ∼ w, k0 = y(w)) + P (v0 ∼ w, k0 = x(w))
=
The only other possible value for
|{u ∈ V : u ∼ w}| 1 |{u ∈ V : u ∼ w}| 1
2∆
· +
· ≤
.
n
q
n
q
nq
ρ(X1x , X1y )
is
1,
so we have
E [ρ (X1x , X1y ) − 1] = −1 · P (ρ(X1x , X1y ) = 0) + 0 · P (ρ(X1x , X1y ) = 1) + 1 · P (ρ(X1x , X1y ) = 2)
≤
2∆ q − ∆
3∆ − q
−
=
,
nq
nq
nq
or
E [ρ (X1x , X1y )] ≤ 1 −
If
e
x, z ∈ Λ
for
are such that
ρ(x, z) = r,
then there exist
q − 3∆
.
nq
x0 = x, x1 , ..., xr−1 , xr = z
such that
ρ(xi−1 , xi ) = 1
i = 1, ..., r.
It follows from the triangle inequality and the preceding estimate that
E
[ρ (X1x , X1z )]
≤
r
X
k=1
q − 3∆
q − 3∆
xk−1
xk
E ρ X1
, X1
≤r 1−
= ρ(x, z) 1 −
.
nq
nq
74
Since the chain is time homogeneous, we have
so
q − 3∆
xt−1
zt−1 z
x
z x
,
E ρ (Xt , Xt ) Xt−1 = xt−1 , Xt−1 = zt−1 = E ρ X1 , X1
≤ ρ(xt−1 , zt−1 ) 1 −
nq
x
z
taking expectation with respect to Xt−1 , Xt−1 gives
q − 3∆
x
z
x
z
E [ρ (Xt , Xt )] ≤ E ρ(Xt−1 , Xt−1 ) 1 −
,
nq
and thus
E
[ρ (Xtx , Xtz )]
q − 3∆
≤ ρ(x, z) 1 −
nq
t
by induction.
Now Chebychev's inequality shows that
P (Xtx 6= Xtz ) = P (ρ (Xtx , Xtz ) ≥ 1) ≤ E [ρ (Xtx , Xtz )]
t
t
q − 3∆
q − 3∆
≤ ρ(x, z) 1 −
≤n 1−
nq
nq
for all
e,
x, z ∈ Λ
hence
d(t) ≤ d(t) ≤ max P (Xtx 6= Xtz ) ≤ max P (Xtx 6= Xtz )
x,z∈Λ
Since we assumed that
e
x,z∈Λ
t
q − 3∆
q − 3∆
≤ n exp −
t .
≤n 1−
nq
nq
−1
q > 3∆, taking t ≥ 1 − 3∆
n log nε gives
q
n
q − 3∆
t ≤ ne− log( ε ) = ε.
d(t) ≤ n exp −
nq
75
12. Probability Preserving Dynamical Systems
Much of our investigation of probability theory has revolved around the long term behavior of sequences of
random variables. We continue this theme with a cursory look at ergodic theory. Roughly speaking, ergodic
theorems assert that under certain stability and irreducibility conditions time averages converge to space
averages. As usual, we begin with some denitions.
Denition.
A sequence
Equivalently,
X0 , X1 , ...
is said to be
stationary
if
is stationary if for every
n, k ∈ N0 ,
we have
X0 , X1 , ...
(X0 , X1 , ...) =d (Xk , Xk+1 , ...)
for all
k ∈ N.
(X0 , ..., Xn ) =d (Xk , ..., Xn+k ).
We have already seen several examples of stationary sequences. For instance, i.i.d. sequences are stationary,
and more generally, so are exchangeable sequences.
Another example of a stationary sequence is a Markov chain
X0 , X1 , ...
started in equilibrium.
To treat the general case, we introduce the following construct.
Denition. Given a probability space (Ω, F, P ), a measurable map T : Ω → Ω is said to be probability
preserving if P T −1 A = P (A) for all A ∈ F , where T −1 A = {ω ∈ Ω : T ω ∈ A} denotes the preimage of A
under
T.
We say that the tuple
Iterates of
T
and
T
−1
(Ω, F, P, T )
is a
probability preserving dynamical system.
are dened inductively by
T 0ω = ω
and, for
n ≥ 1,
T n = T ◦ T n−1 ,
T −n = T −1 ◦ T −(n−1) = (T n )−1 .
* We use the inverse image in our denitions because
A∈F
does not necessarily imply that
Also, beware that some authors say that P is an invariant measure for
T∗ P = P ◦ T −1
Finally, observe that since the push-forward measure
ˆ
formula shows that
ˆ
for all
If
X
f
rather than T preserves
is equal to
P,
P .
the change of variables
ˆ
f ◦ T dP =
Ω
T
T A ∈ F.
f dT∗ P =
Ω
f dP
Ω
for which the latter integral is dened.
is a random variable on
(Ω, F, P ) and T : Ω → Ω is probability preserving, then Xn (ω) = X(T n ω) denes
a stationary sequence since for any
n, k ∈ N and any Borel set B ∈ B n+1 , if A = {ω : (X0 (ω), ..., Xn (ω)) ∈ B},
then
P ((Xk , ..., Xn+k ) ∈ B) = P T −k A = P (A) = P ((X0 , ..., Xn ) ∈ B) .
In fact, every stationary sequence taking values in a nice space can be expressed in this form:
If
Y0 , Y1 , ...
is a stationary sequence of random variables taking values in a nice space
mogorov extension theorem gives a measure
satisfy
(X0 , X1 , ...) =d (Y0 , Y1 , ...).
serving and
P
If we let
on
S
N0
X = X0
,S
N0
and
Xn (ω) = ωn = (θn ω)0 = X(T n ω).
76
(S, S),
then the Kol-
such that the coordinate projections
T =θ
(the shift map), then
T
Xn (ω) = ωn
is probability pre-
In light of the preceding observations, we will assume henceforth that we are working with stationary sequences of the form
space
(Ω, F, P )
Xn (ω) = X(T n ω)
for some
(S, S)-valued
and some probability preserving map
random variable
X
dened on a probability
T : Ω → Ω.
For subsequent results, we will need a few more denitions.
Denition. Let (Ω, F, P, T ) be a probability preserving dynamical system. We say that an event A ∈ F
−1
is invariant if T
A = A up to null sets - that is, P T −1 A 4A = 0 where 4 denotes the symmetric
dierence,
E4F = (E \ F ) ∪ (F \ E).
A random variable
X
invariant
is called
if
X ◦T =X
a.s.
It is left as an exercise to show
Proposition 12.1.
invariant.
Denition.
is invariant} is a sub-σ-eld of F , and X ∈ I if and only if X is
I = {A ∈ F : A
We say that
T
is
ergodic
if for every invariant event
Ergodicity is a kind of irreducibility requirement:
with
A, B ∈ I
and
T
is ergodic if
A ∈ I,
Ω
we have
P (A) ∈ {0, 1}.
cannot be decomposed as
Ω = AtB
P (A), P (B) > 0.
Ergodic maps are good at mixing things up in the sense that they don't x nontrivial subsets.
A useful test for ergodicity is given by
Proposition 12.2.
T
is ergodic if and only if every invariant X : Ω → R is a.s. constant.
Proof.
T
is ergodic and
Suppose that
clearly invariant since
a ∈ R,
hence
X
T
−1
X ◦T = X
a.s. For any
Ea = {ω : X(T ω) < a} = Ea
a ∈ R,
the set
Ea = {ω ∈ Ω : X(ω) < a}
a.s. Ergodicity implies that
P (Ea ) ∈ {0, 1}
is
for all
is a.s. constant.
Conversely, suppose that the only invariant random variables are a.s. constant, and let
an invariant random variable, and thus is a.s. constant, and we conclude that
A ∈ I.
P (A) ∈ {0, 1}.
Then
1A
is
Note that the above proof also holds if we restrict our attention to classes of random variables containing
the indicator functions, such as
X ∈ Lp (Ω, F, P ).
* For the sake of clarity, we will sometimes use the notation of functions rather than random variables, but
ultimately the two are the same.
Example 12.1 (Rotation of the Circle).
For
α ∈ (0, 1),
dene
Tα : Ω → Ω
by
Let
(Ω, F, P )
be
[0, 1)
with the Borel sets and Lebesgue measure.
Tα x = x + α (mod 1). Tα
is clearly measurable and probability
preserving.
* We consider
If
a rotation since
[0, 1)
may be identied with
T = {z ∈ C : |z| = 1}
via the map
x 7→ e2πix .
1
α= m
n for some m, n ∈N with (m, n) = 1, so for any measurable B ∈ 0, 2n with P (B) > 0,
Sn−1
O(B) = k=0 B + nk (mod 1) is invariant with P (O(B)) ∈ (0, 1), thus Tα is not ergodic.
α ∈ Q,
the set
Tα
then
(Alternatively, the function
f (x) = e2πinx
is nonconstant and invariant.)
77
However,
Tα
is ergodic whenever
f : [0, 1) → R
α∈
/ Q.
P
To see that this is so, we note that any square-integrable
2πikx
where
k∈Z ck e
has Fourier expansion
n
X
ck e2πikx → f (x)
ck =
in
´1
f (x)e−2πikx dx
0
and
L2 ([0, 1)).
k=−n
If
f
is invariant, then we have
f (x) = f (Tα x) =
ck e2πik(x+α(mod 1)) =
X
k∈Z
(where equality is in the
k ∈ Z.
Since
α
L2
is irrational,
sense).
e
It follows that any invariant
2πikα
f ∈L
k∈Z
k ∈ Z \ {0},
for
Again take
d ≥ 2, dene Td x = dx (mod 1).
with
Ij =
h
I = [a, b)
a+j b+j
d , d
with
0≤a<b<
x
d
for all
k 6= 0.
for
is ergodic.
to be
Td−1 x =
ck = 0
[0, 1)
+
j
d
with Lebesgue measure.
: j = 0, 1, ..., d − 1
. (Throughout
and we suppress the mod notation for convenience.)
1, Td−1 I can be expressed as the disjoint union
Td−1 I
=
d−1 X
b+j
argument,
Td
a+j
−
d
d
j=0
π−λ
Tα
ck = ck e2πikα
Td−1 I =
Fd−1
j=0
Ij
, so
P
Thus, by a
(Ω, F, P )
Note that
1
this example, all operations are taken modulo
For any interval
so it must be the case that
is constant, which shows that
Example 12.2 (Ane Expanding Maps).
For any integer
ck e2πikα e2πikx
Uniqueness of Fourier coecients implies that
6= 1
2
X
=d
b−a
d
= P (I).
is probability preserving.
To establish ergodicity, we suppose that
f ∈ L2
is invariant with Fourier series
P
k ck e
2πikx
.
Then
f (Td x) =
X
ck e2πik(dx(mod1)) =
k∈Z
so comparing coecients with
P
f (x) =
k ck e
ck = 0
shows that
ˆ
for all
1
2
|ck | =
k 6= 0,
ck = cdk .
Iterating yields
ck = cdk = cd2 k = ...
2
|f (x)| dx < ∞,
0
k∈Z
it must be the case that
ck e2πikdx ,
k∈Z
2πikx
Since Parseval's identity gives
X
X
hence
f
is constant.
Before moving on to the ergodic theorems, we present the following simple yet incredible result due to Henri
Poincaré.
Theorem 12.1 (Poincaré Recurrence Theorem). Let (Ω, F, P, T ) be a probability preserving dynamical
system, and let U ∈ F . Then for almost every ω ∈ U , T n ω ∈ U innitely often.
If T is ergodic and P (U ) > 0, then P (T n ω ∈ U i.o.) = 1.
Proof.
Then
Let
Un =
Un+1 = T
S∞
−1
j=n
Un ,
T −j U
be the set of points
ω∈Ω
that enter
U
at least once at or after time
so
P (Un+1 ) = P T −1 Un = P (Un ),
hence
P (Un ) = P (U0 )
for all
n
by induction.
78
n.
Setting
W =
T∞
n=0
Un = {ω ∈ Ω : T n ω ∈ U
i.o.} and noting that
U0 ⊇ U1 ⊇ U2 . . . , it follows from continuity
from above that
P (W ) = lim P (Un ) = P (U0 ).
n→∞
The set of points in question is
V = U ∩ W.
U, W ⊆ U0
Since
and
P (U0 ) = P (W ),
we conclude that
P (U ) = P (V ).
For the second claim, note that
Since
W
P (W ) ≥ P (V ) = P (U ) > 0,
is invariant, so ergodicity implies that
P (W ) ∈ {0, 1}.
P (W ) = 1.
we must have
F
To put Theorem 12.1 in perspective, suppose that
Ω
is a separable metric space and
sets for the metric topology.
Ω
with countably many open balls of arbitrarily small
Then we can cover
contains the Borel
radius, and almost every point in each ball returns to its neighborhood of origin innitely often. It follows
that
P (lim inf n→∞ d (T n x, x) = 0) = 1.
This interpretation of the recurrence theorem lends itself to all sorts of amusing inferences about thermodynamics, cosmology, and so forth, but we will resist the urge to indulge in such speculations here.
A natural question to ask after seeing Theorem 12.1 is How long does it take to return?
result (due to Mark Kac) says that if
T
The following
is ergodic, the expected rst return time for a point in
A
is
1
.
P (A)
Theorem 12.2. Let (Ω, F, P, T ) be a probability preserving dynamical system with T ergodic, and suppose
that A ∈ F has P (A) > 0. Dene τA (ω) = inf{n ≥ 1 : T n ω ∈ A}. Then
ˆ
τA dP = 1.
A
Proof.
* In what follows, equality and inclusion are taken to mean up to a null set. Because the relevant denitions
are stated in these terms and all constructions are countable, there is no harm in being loose on this point
and the argument is much clearer without constant qualication.
For each
n ∈ N,
set
that their union is
An = {ω ∈ A : τA (ω) = n}.
A0n s
The
are disjoint and the recurrence theorem implies
A.
Consequently, we have
∞
X
P (An ) = P
n=1
Similarly, for
n ≥ 1,
dene
∞
[
!
= P (A).
An
n=1
Bn = {ω ∈ Ω : τA (ω) = n},
and let
B = ∪∞
n=1 Bn .
Since
T −1 Bn = {ω ∈ Ω : T ω ∈ Bn } = {ω ∈ Ω : T n+1 ω ∈ A, T k ω ∈
/A
for
2 ≤ k ≤ n} = Bn+1 ∪ T −1 An ,
we see that
T
−1
B=
∞
[
Bn+1 ∪ T
−1
An =
n=1
=
∞
[
B
!
Bm
m=2
!
Bm
∪ T −1 A =
m=2
hence
∞
[
∞
[
n=1
is an invariant set.
79
Bn = B,
∪ T −1
∞
[
n=1
An
Moreover,
An ⊆ B n
ergodicity that
Since the
Bn0 s
by denition, so
A ⊆ B,
and thus
P (B) ≥ P (A) > 0.
Therefore, it follows from
P (B) = 1.
are disjoint, we have
∞
X
P (Bn ) = P (B) = 1.
n=1
Now we can write
ˆ
τA dP =
A
so if we can show that
P∞
k=n
∞
X
nP (An ) =
n=1
∞ X
∞
X
P (Ak ),
n=1 k=n
P (Ak ) = P (Bn ), then we will obtain
ˆ
∞ X
∞
∞
X
X
τA dP =
P (Ak ) =
P (Bn ) = 1
A
n=1 k=n
B1 = T −1 A
by denition, so
n=1
as desired.
When
m = 1,
we have
Now suppose that
(because
ω∈T
−1
P∞
k=m
P (Ak ) = P (Bm ).
Since
P (B1 ) = P (A) =
T
−1
Bn = Bn+1 ∪
P∞
k=1 P (Ak ).
−1
T An with Bn+1 and
T ω ∈ An ⊆ A and thus ω ∈
/ Bn+1 ), the fact that T preserves P
P (Bn ) = P T −1 Bn = P (Bn+1 ) + P (T −1 An ) = P (Bn+1 ) + P (An ),
An
implies
T −1 An
disjoint
implies
so the inductive hypothesis gives
P (Bm+1 ) = P (Bm ) − P (Am ) =
∞
X
k=m
P (Ak ) − P (Am ) =
∞
X
P (Ak ).
k=m+1
This completes the inductive step and the proof.
80
13. Ergodic Theorems
Suppose that
function on
Ω
(Ω, F, P, T )
is a probability preserving dynamical system.
nth
(i.e. a random variable), we dene the
f n (x) =
n−1
X
Birkho sum
If
f
is a real-valued measurable
as
f T kx ,
k=0
1 n
so that
n f gives the
time average
Ergodic theorems are statements of the form
If
g = f · 1A
for
A ∈ I,
n.
over orbit segments of length
1 n
nf
→f
f
where
is invariant with
´
Ω
f dP =
´
Ω
f dP .
then
n−1
n−1
n−1
1 n
1X
1X
1X
1
g =
(f · 1A ) ◦ T k =
(f ◦ Tk )1T −k A =
(f ◦ Tk )1A = f n · 1A ,
n
n
n
n
n
k=0
k=0
ˆ
f dP =
A
(In the cases we consider,
g
T
is ergodic,
f
gdP =
Ω
ˆ
gdP =
f dP.
Ω
A
will satisfy the assumptions of an ergodic theorem if
Thus, in probabilistic terms,
When
k=0
ˆ
ˆ
so
f
does.)
f = E[f |I ].
f = E[f ]
must be a.s. constant, hence
- the
space average.
The rst proof of an ergodic theorem was by John von Neumann in 1931, establishing convergence in
(and thus in
L1
by an approximation argument). This result is known as the
L2
mean ergodic theorem.
von Neumann's paper was not published until 1932, by which time George Birkho (after hearing about the
former's work) had already published a proof of the
pointwise ergodic theorem
We begin with a cute proof of the mean ergodic theorem in
Theorem 13.1. Let S = {g ∈ L2 : g ◦ T
one has n1 f n → PS f in L2 .
Proof.
Dene the
= g}
L2
establishing a.s. convergence.
due to Frigyes Riesz.
and let PS be the projection from L2 onto S . For all f ∈ L2 ,
Koopman operator U : L2 → L2
ˆ
hU f, U gi =
by
Uf = f ◦ T.
Then
(U f )(x)(U g)(x)dP (x)
ˆ
(f g) ◦ T (x)dP (x)
=
ˆ
(f g)(x)dP (x) = hf, gi
=
where the penultimate equality used the fact that
* This shows that
hf, gi = hU f, U gi = hf, U ∗ U gi
operator with this property is said to to be an
identity and we say that
Now set
U
is
unitary.
W = {U h − h : h ∈ L2 }.
T
preserves
for all
f, g ∈ L2 ,
isometry.
We will show that
If
U
W ⊥ = S.
81
P.
so
U ∗U
is the identity. A bounded linear
is surjective as well, then
UU∗
is also the
S ⊆ W ⊥,
To see that
let
f ∈ S, g ∈ W
so that
f = Uf
and
g = Uh − h
for some
h ∈ L2 .
Then
hf, gi = hf, U h − hi = hf, U hi − hf, hi = hU f, U hi − hf, hi = 0,
so, since
g∈W
was arbitrary, we have that
For the reverse inclusion, let
f ∈ W⊥
f ∈ W ⊥.
so that for every
h ∈ L2 ,
we have
hf, U h − hi = 0,
hence
hf, hi = hf, U hi = hU ∗ f, hi
Since
h
was arbitrary, this implies that
U ∗f = f .
Accordingly, we have
2
kU f − f k2 = hU f − f, U f − f i
= hU f, U f i − hU f, f i − hf, U f i + hf, f i
2
2
= kf k2 − hf, U ∗ f i − hU ∗ f, f i + kf k2
2
= 2 kf k2 − 2 hf, f i = 0,
and we conclude that
Since
(If
V
Since
W
f ∈ S.
is clearly a subspace of
L2
(though not necessarily closed), we have
L2 = W ⊕ W ⊥ = W ⊕ S .
H , then H = V ⊕ V ⊥ .
H = W ⊕ W ⊆ W + W ⊥ ⊆ H.
is a closed subspace of a Hilbert space
W ⊆ W, W
⊥
⊆ W ⊥,
so
⊥
To see that the latter sum is direct, note that
Now if
f ∈ S,
then
U kf = f
k ≥ 0,
for all
x ∈ W ∩W⊥
implies
hx, xi = hx, limn xn i = limn hx, xn i = 0.)
so
n−1
1 n
1X k
1
f =
U f = · nf = f,
n
n
n
k=0
1 n
hence
nf
If
→ f = PS f .
g = Uh − h ∈ W,
then
gn =
n−1
X
U kg =
k=0
thus
If
1 n
ng
g ∈ W,
=
1
n
(U n h − h) → 0 = PS g .
there exists a sequence
that we can take
n
n−1
X
U k+1 h − U k h = U n h − h,
k=0
(Recall that convergence is in the
gi ∈ W
with
gi → g ,
so for any
ε > 0,
L2
sense.)
taking
i
so that
kg − gi k2 < ε
shows
large enough that
n−1
n−1
n−1
1 X
1 X
1 X
k k
k
U (g − gi ) + U g ≤
U gi < 2ε,
2
n
n
n
k=0
since
U
1
is an isometry and
n
Finally, if
F ∈ L2 ,
then
k=0
F =f +g
* Note that the limit function
k=0
2
Pn−1
k
U gi →
with
f = PS f
Theorem 1.6 shows that this is
k=0
0, so n1 g n
→0
f ∈ S, g ∈ W ,
so
g ∈ W.
for
1 n
nF
is the projection of
f
2
=
1 n
nf
+ n1 g n → f = PS F
onto the space of
H,
then
T -invariant
L2 .
functions in
L2 .
E[f |I ].
Also, it is worth observing that the above proof shows more generally that if
Hilbert space
in
P
1 n−1 k
n k=0 U x − Pker(U −I) x → 0
82
for all
x ∈ H.
U
is a linear isometry on a
Analogous to the development of conditional expectation in terms of projection, we have
Corollary 13.1. If f
∈ L1 ,
Proof.
L1
Since bounded
then n1 f n → f in L1 where f is invariant with
functions are dense in
g ∈ L1b = {h ∈ L1 : khk∞ < ∞}
1 n
ng
As
→g
L2
in
k·k1 ≤ k·k2
n
Now choose
so
big enough that
1 n
g − g < ε.
n
1
g
Ω
g ∈ L1b ⊆ L2 ,
Since
f dP .
ε > 0,
there is a
Theorem 13.1 shows that
1 n
ng
→g
in
L1 .
Since
m, n, we have
1 n
f − 1 f m ≤ 1 f n − g + g − 1 f m < 4ε,
n
m
n
m 1
1
1
shows that for all large
f
in
L1
(by completeness).
1 n
1
f ◦T =
n
n
L1
n
X
limits since
n
f ◦ Tk =
k=1
n+1
1 X
1
·
f ◦ T k − f →L 1 f ,
n
n+1
n
k=0
ˆ
ˆ 1 n
dP = lim
lim
f
◦
T
−
f
◦
T
n
n→∞
n→∞
L1
´
g = PS g .
Invariance follows from uniqueness of
Finally,
f dP =
and are square integrable, for any
kf − gk1 < ε.
(by Hölder's inequality), we have that
1 n
n f has a limit
and
Ω
n−1
X
1 n 1 n
(f − g) ◦ T k ≤ kf − gk < ε,
f − g ≤ 1
1
n
1
n
n
1
k=0
1 n
f − g ≤ 1 f n − 1 g n + 1 g n − g < 2ε for n suciently large.
n
n
n
n
1
1
1
we have that
Eliminating
where
such that
L1
´
ˆ
1 n
dP = lim
f
−
f
◦
T
n
n→∞
convergence, Fubini's theorem, and the fact that
ˆ
ˆ
f dP = lim
n→∞
Ω
Ω
T
preserves
P
1 n
f − f dP = 0.
n
imply
ˆ
n−1 ˆ
n−1 ˆ
1 n
1X
1X
f dP = lim
f ◦ T k dP = lim
f dP =
f dP.
n→∞ n
n→∞ n
n
Ω
Ω
Ω
k=0
k=0
The key to proving the pointwise ergodic theorem is
Theorem 13.2 (Maximal Ergodic Theorem). Suppose that f
´
M (f ) = {x ∈ Ω : supn≥1 f n (x) > 0}. Then M (f ) f dP ≥ 0.
Proof.
Then
with
Write
FN (x) = max0≤k≤N f k (x)
F1 ≤ F2 ≤ . . .
S∞
M (f ) =
Now for every
∈ L1
and dene
where we adopt the convention that
pointwise, so the sets
MN (f ) = {x ∈ Ω : FN (x) > 0}
f 0 ≡ 0.
form a nested increasing sequence
N =1 MN (f ).
k = 0, 1, ..., N ,
FN (T x) + f (x) = f (x) + max f j (T x) ≥ f (x) + f k (T x)
0≤j≤N
= f (x) +
k−1
X
f (T j+1 x) =
j=0
hence
f (x) ≥ f k (x) − FN (T x)
for
k
X
j=0
k = 1, ..., N + 1.
83
f (T j x) = f k+1 (x),
Because
As
FN
FN (x) = max1≤k≤N f k (x)
is nonnegative and equal to
ˆ
for
0
x ∈ MN (f ),
this shows that
Ω \ MN (f ), we have
ˆ
ˆ
f dP ≥
FN dP −
MN (f )
ˆ
x ∈ MN (f ).
FN ◦ T dP
MN (f )
ˆ
FN dP −
=
Ω
FN ◦ T dP
MN (f )
ˆ
≥
ˆ
FN dP −
FN ◦ T dP = 0.
Ω
MN (f ) % M (f ),
for
on
MN (f )
Finally, since
f (x) ≥ FN (x) − FN (T x)
Ω
the dominated convergence theorem shows that
ˆ
ˆ
f dP ≥ 0.
f dP = lim
N →∞
M (f )
MN (f )
Corollary 13.2. If Mα (f ) = {x : supn≥1 n1 f n (x) > α}, then
Proof.
Let
g = f − α.
n−1
X
n−1
X
(f − α) ◦ T k =
k=0
=
1 n
nf
− α,
Mα (f )
f dP ≥ αP (Mα (f )).
Then
gn =
1 n
so
ng
´
f ◦ T k − α = fn − nα,
k=0
and thus
1
1 n
f (x) > α = x : sup f n (x) − α > 0
n≥1 n
n≥1 n
1 n
n
= x : sup g (x) > 0 = x : sup g (x) > 0 = M (g).
n≥1 n
n≥1
Mα (f ) =
x : sup
Therefore, the maximal ergodic theorem implies
ˆ
ˆ
0≤
M (g)
ˆ
(f − α) dP =
gdP =
Mα (f )
f dP − αP (Mα (f )) .
Mα (f )
Corollary 13.3. If A ⊆ M (f ) is T -invariant, then
Proof.
Since
T −1 A = A
up to a null set,
gn =
´
A
1A ◦ T = 1A
f dP ≥ 0.
a.s., so if
n−1
X
n−1
X
k=0
k=0
(f · 1A ) ◦ T k =
g = f · 1A ,
then
f ◦ T k · 1A = f n · 1A .
It follows that
M (g) =
x ∈ Ω : sup g n (x) > 0 = x ∈ A : sup f n (x) > 0 = A ∩ M (f ) = A,
n≥1
and thus
n≥1
ˆ
ˆ
gdP ≥ 0.
f dP =
A
M (g)
84
We are now in a position to prove
Theorem 13.3 (Pointwise Ergodic Theorem). For any f
1 n
f =f
n→∞ n
lim
where f ∈ L1 is invariant with
Proof.
´
Ω
´
f dP =
Ω
∈ L1 ,
a.s.
f dP .
Set
1 n
f (x),
n→∞ n
1
f − (x) = lim inf f n (x).
n→∞ n
f + (x) = lim sup
Clearly
f+
and
f−
need to show that
For
α, β ∈ Q,
P (Mα,β ) = 0
let
are invariant with
f − (x) ≤ f + (x)
M = {x : f − (x) < f + (x)}
α, β ∈ Q
with
+
= {x : f (x) > β}.
(f − β) (x) = f n (x) − nβ > 0,
Since
´
Mα,β
Mα,β ⊆ Mβ+ ⊆ M (f − β)
f+ = f−
a.s., so we
Then
M =
S
α,β∈Q
Mα,β ,
so it suces to show that
α < β.
We note at the outset that the invariance of
+
Now let Mβ
n
We wish to show that
P (M ) = 0.
has
Mα,β = {x : f − (x) < α, f + (x) > β}.
for all
x.
for all
hence
If
x ∈
f+
f−
imply that Mα,β is an invariant set.
+
Mβ , then there is some n ∈ N such that n1 f n (x)
and
x ∈ M (f − β).
is invariant, Corollary 13.3 shows that
´
Mα,β
(f − β)dP ≥ 0,
> β,
so
hence
f dP ≥ βP (Mα,β ).
Similarly, if
x ∈ Mα− := {x : f − (x) < α},
It follows that
Mα,β ⊆ Mα− ⊆ M (α − f ),
then there is an
´
so that
Mα,β
m
with
1 m
mf
< α,
so
(α − f )m > 0.
f dP ≤ αP (Mα,β ).
ˆ
Thus we have shown that
βP (Mα,β ) ≤
f dP ≤ αP (Mα,β ) ,
Mα,β
so since
α < β,
we conclude that
P (Mα,β ) = 0 as desired.
This shows that
1 n
n f has an almost sure limit
f ∗.
1 n
nf
→ f in L1 , and Fatou's lemma gives
ˆ
ˆ
ˆ ∗
f − f dP = lim inf 1 f n − f dP ≤ lim inf 1 f n − f dP = 0,
n
n
n
n
By Corollary 13.1, we also have that
so
f∗ = f
a.s.
In light of previous observations, this shows that
1 n
n f converges a.s. to
f = E[f |I ].
It is left as a homework exercise to use Theorem 13.3 to extend the mean ergodic theorem to
Theorem 13.4. If f
∈ Lp , 1 ≤ p < ∞,
then n1 f n → E[f |I ] in Lp .
We conclude our discussion of ergodic theorems with some examples.
85
Example 13.1 (Strong Law of Large Numbers).
Xi0 s
We can think of the
theorem with
If
θ
P
A∈
X1 , X2 , ... is an i.i.d.
as coordinate projections on the probability space
sequence with
(RN , B N , P )
X1 ∈ L1 .
from Kolmogorov's
arising from nite dimensional distributions given by product measure.
is the shift map, then
An invariant set
Suppose that
A
has
(RN , B N , P, θ)
{ω ∈ A} = {θω ∈ A} ∈ σ(X2 , X3 , ...).
T∞
n=1 σ(Xn , Xn+1 , ...) = T ,
0−1
Since Kolmogorov's
is a probability preserving dynamical system.
Iterating shows that
the tail eld.
law shows that
T
is trivial, this shows that
I
is trivial, hence
θ
is ergodic.
The pointwise ergodic theorem gives
n
n−1
1X
1X
Xk =
X1 ◦ θk → E[X1 |I ] = E[X1 ],
n
n
k=1
k=0
the strong law of large numbers!
(Corollary 13.1 shows that we also have convergence in
Example 13.2
S
space
(Markov Ergodic Theorem)
and stationary distribution
π
with
.
L1 .)
Suppose that
π(x) > 0
{Xn }
x ∈ S.
for all
is a Markov chain with countable state
If
X0 ∼ π ,
then
X0 , X1 , ...
is stationary
and the shift map is probability preserving.
If
R
X0 ∈ R
is a recurrent communicating class, then
Thus
θ
implies
Xn ∈ R
for all
n,
so
{ω : X0 (ω) ∈ R} ∈ I .
is not ergodic if the chain is not irreducible.
If the chain is irreducible and
A
is invariant, then, taking
Markov property and the invariance of
A
Fn = σ(X0 , ..., Xn )
as usual, it follows from the
that
Eπ [1A |Fn ] = Eπ [1A ◦ θn |Fn ] = h(Xn )
where
h(x) = Ex [1A ].
Levy's
0−1
law implies that
Eπ [1A |Fn ] → 1A
a.s., and since the chain is irreducible and recurrent (by
irreducibility and the existence of a stationary distribution), for every
means that
h(Xn ) = h(x)
other words,
For any
innitely often for any
Pπ (A) ∈ {0, 1},
1
f ∈ L (π),
x ∈ S,
x ∈ S , Pπ (Xn = x
so it must be the case that
h≡0
i.o.)
or
= 1.
h≡1
This
a.s. In
so the shift map is ergodic.
applying the ergodic theorem to
f ◦ X0
yields
n−1
X
1X
f (Xk ) →
f (x)π(x)
n
x
a.s. and in
L1 .
k=0
Example 13.3 (Equidistribution Theorem).
Thus Theorem 13.3 shows that for any
As a nal example, recall that irrational rotation is ergodic.
α ∈ (0, 1) \ Q,
the map
T ω = ω + α (mod 1)
n−1
1X
1{T k ω ∈ A} → λ(A)
n
k=0
for all
A ∈ B[0,1) ,
where
λ
is Lebesgue measure.
In fact, we can show
Claim.
If
A = [a, b),
then the exceptional set is
∅.
86
a.s.
satises
Proof.
Write
Ak = a + k1 , b − k1 .
b−a>
If
2
k , then the ergodic theorem shows that
n−1
2
1X
1{T j ω ∈ Ak } → b − a −
n j=0
k
ω ∈ Ωk
T
G = k>
for all
Let
where
2
b−a
we can nd
Since
Ωk .
ωk ∈ G
T j ωk ∈ Ak
λ (Ωk ) = 1.
Then
so that
n→∞
k
so
G
is dense in
[0, 1),
thus for any
x ∈ [0, 1), k ∈ N ∩
2
b−a , ∞ ,
1
k.
|ωk − x| <
implies that
lim inf
As
λ(G) = 1,
T j x ∈ A,
we have
n−1
n−1
1X j
1X j
2
1 T x ∈ A ≥ lim inf
1 T ωk ∈ Ak = b − a − .
n→∞
n j=0
n j=0
k
was arbitrary, we conclude that
lim inf
n→∞
A similar argument with
AC
n−1
1X j
1 T x ∈ A ≥ b − a.
n j=0
shows that
n−1
1X j
1 T x ∈ A ≤ b − a,
lim sup
n
n→∞
j=0
and the claim follows.
A neat application of this result concerns the distribution of leading digits in the decimal expansion of powers
of
2.
The rst few terms of the sequence of initial digits are
1, 2, 4, 8, 1, 3, 6, 1, 2, 5, 1, 2, 4, 8, 1, 3, 5, 1, ...
If
2j
has leading digit
k,
then
k · 10m ≤ 2j < (k + 1) · 10m
for some
m ∈ N0 ,
thus
m + log10 (k) ≤ j log10 (2) < m + log10 (k + 1).
If
T
is rotation by
α = log10 (2) ∈ (0, 1) \ Q,
then the above inequality shows that
log10 (k) ≤ T j 0 < log10 (k + 1).
Thus if
Nk (n) = j ∈ [0, n − 1] ∩ Z : k
is the rst digit in the decimal expansion of
2j ,
then we have
n−1
|Nk (n)|
1X k+1
j
=
1 log10 (k) ≤ T 0 < log10 (k + 1) → log10 (k + 1) − log10 (k) = log10
.
n
n j=0
k
(Of course, one can play the same game with powers and bases other than
leading
r
2
and
10
and can consider the
digits as well by the same basic reasoning.)
The distribution
p(k) = log10
k+1
k
on
{1, ..., 9}
is known as
Benford's law
and it has been shown to model
many real world data sets surprisingly well. It is sometimes used on nancial and scientic data as a method
of fraud detection.
87
The general idea is that if you have a collection of numbers that range over several orders of magnitude and
are approximately uniformly distributed on a logarithmic scale, then the distribution of leading digits should
be close to Benford's law:
x
has leading digit
d
if there is some
and the width of the interval
m
with
d · 10m ≤ x < (d + 1) · 10m ,
or
log10 (d) + m ≤ log10 (x) < log10 (d + 1) + m,
.
[log10 (d) + m, log10 (d + 1) + m) is log10 d+1
d
The reason that one wants the data to span several orders of magnitude is that a sample point has leading digit
d
if it falls in any of the intervals
[log10 (d) + m, log10 (d + 1) + m), [log10 (d) + m + 1, log10 (d + 1) + m + 1),
[log10 (d) + m + 2, log10 (d + 1) + m + 2),
etc..., and averaging over more intervals blurs out local deviations
from log uniformity.
Note that by a standard change of variables argument, a random variable whose logarithm is uniform has
density
f (x) ∝ x−1 .
laws with exponent
Thus Benford's law describes the leading digits of random variables which obey power
1.
One way that this leading digit law might come up is if you were to examine the size of an exponential
growth process at a random time. Basically, this is because exponential growth translates to linear growth
on a logarithmic scale.
As with the power law description, what we are ultimately picking up on is some kind of self-similarity or
scale invariance. This is one reason that Benford's law is so often observed empirically: In many nancial
applications, you expect the same basic picture whether you're working with dollars or yen. Similarly, if you
are looking at scientic data, then it generally shouldn't matter too much whether length is measured in
centimeters or furlongs.
A nal example in which the law might arise is a process which is subject to successive multiplicative
perturbations, so that its logarithm undergoes additive perturbations (i.e. random walk). The CLT shows
that after many time steps, the law of the log should be approximately normal with huge variance, and this
looks uniform in the central regime.
88
14. Brownian Motion
Historically, Brownian motion refers to the random movement of particles suspended in a uid famously
described by Robert Brown in 1827.
In 1905, Albert Einstein explained this phenomenon in terms of the motion of water molecules. Einstein's
analysis and the experimental verication of his predictions were a major force behind the widespread
acceptance of the atomic hypothesis.
Five years prior to the publication of Einstein's paper, Louis Bachelier gave a mathematical treatment of
Brownian motion in the context of evaluating stock options. This work (Bachelier's PhD thesis) helped usher
in the modern era of mathematical nance.
The rst fully rigorous mathematical derivation of Brownian motion was due to Norbert Wiener in a series
of papers in the early 1920s.
It is this
Wiener process
- the mathematical construct rather than its various physical manifestations - with
which we will concern ourselves here, though we retain the colloquial term Brownian motion.
Denition.
A real-valued stochastic process
(1) For all times
{B(t) : t ≥ 0}
is called a (linear)
Brownian motion
0 ≤ t1 < · · · < tn , B(t1 ), B(t2 ) − B(t1 ), . . . , B(tn ) − B(tn−1 )
if
are independent random
variables,
t ≥ 0, h > 0,
(2) For all
(3) The function
If
B(0) = x
The case
the increment
t 7→ B(t)
a.s., we say that
x=0
is known as
B(t + h) − B(t)
0
and variance
{B(t) : t ≥ 0}
is a Brownian motion started at
x.
standard Brownian motion.
hold when we restrict our attention to times in
[0, T ]
for some
T ≥ 0.
This just means that properties
[0, T ].
We will only concern ourselves with the one-dimensional case here, but we observe that if
independent linear Brownian motions started at
d-dimensional
h,
is almost surely continuous.
We will sometimes talk about Brownian motion on
1-3
is normal with mean
Brownian motion started at
On the one hand, we can regard
T
(x1 , ..., xd )
{B(t) : t ≥ 0}
some underlying probability space
x1 , ..., xd ,
(Ω, F, P )
B1 (t), ..., Bd (t) are
T
B(t) = (B1 (t), ..., Bd (t))
respectively, then
is a
.
as a collection of random variables
and indexed by
ω 7→ B(t, ω)
dened on
t ∈ [0, ∞).
Alternatively, Brownian motion can be thought of as a random function.
Specically, Property
3
allows
us to interpret Brownian motion as a random variable taking values in the space of continuous functions
C ([0, ∞), R).
The target
σ -algebra
is the Borel sets for the topology of uniform convergence on compact
sets (which coincides with the natural product
σ -algebra
generated by the coordinate maps
The reason we don't work in the space of measurable functions from
[0, ∞)
to
R
πt : f 7→ f (t)).
with the product
σ -algebra
is that every measurable set is then determined by the values of the functions at a countable number of
points, hence the set of continuous functions is not even measurable!
For a xed
ω ∈ Ω,
the map
t 7→ B(t, ω)
is called a
sample path
* We will switch freely between the equivalent notations
and whether we want to emphasize the role of
ω.
89
or
trajectory.
B(t), Bt , B(t, ω), Bt (ω)
depending on readability
Before establishing that the process we have dened actually exists, we make a few simple observations in
order to familiarize ourselves with our object of study.
Proposition 14.1 (Translation Invariance). If {Bt : t ≥ 0} is a Brownian motion started at x, then for any
y ∈ R, {Bt + y : t ≥ 0} is a Brownian motion started at x + y .
Proof.
Properties
1
and
2
y
hold since the
preserves continuity. The starting state is
The exact same argument shows that if
{Bt }t≥0 ,
then the process
{Xt }t≥0
terms cancel out and Property
3
holds since adding a constant
B0 + y = x + y .
{Bt }t≥0
dened by
Y
is a standard Brownian motion and
Xt = Bt + Y
is a Brownian motion with
is independent of
X0 = Y
a.s. Thus
there is no real loss of generality in restricting our attention to standard Brownian motion.
In a similar vein, we have the following Markov property for time shifts.
Proposition 14.2 (Time-shift Invariance). If {B(t) : t ≥ 0} is a Brownian motion and s ≥ 0, then
{B(s + t) − B(s) : t ≥ 0} is a standard Brownian motion and is independent of the process {B(t) : 0 ≤ t ≤ s}.
Proof.
Clearly the process starts at
terms and Property
3
Now, stochastic processes
S , t1 , ..., tn ∈ T ,
B(s) − B(s) = 0.
Properties
1 and 2 follow from cancellation of the B(s)
holds since the composition of a.s. continuous functions is a.s. continuous.
{X(s) : s ∈ S}, {Y (t) : t ∈ T }
the random vectors
are independent if for every
(X(s1 ), . . . , X(sm ))
and
Since Brownian motion has independent increments, for any
(Y (t1 ), . . . , Y (tn ))
are independent.
t1 , . . . , tn ≥ 0, 0 ≤ s1 , . . . , sm ≤ s,
(B(s + t1 ) − B(s), . . . , B(s + tn ) − B(s)) and (B(s1 ), . . . , B(sm )) are independent.
disjoint increments to the right of
s
m, n ∈ N, s1 , . . . , sm ∈
the vectors
(The former is built from
and the latter from disjoint increments to the left.)
Another simple but extremely useful invariance property is
Proposition 14.3 (Diusive Scaling). If {B(t) : t ≥ 0} is a standard Brownian motion and a 6= 0, then the
process {X(t) : t ≥ 0} dened by X(t) = a1 B(a2 t) is also a standard Brownian motion.
Proof.
If
1
B(0) = 0.
a
2
2
2
B(a t1 ), B(a t2 ) − B(a t1 ), . . . , B(a2 tn ) − B(a2 tn−1 )
Again, we just have to verify the dening properties. Clearly
0 ≤ t1 < · · · < tn ,
then
0 ≤ a2 t1 < · · · < a2 tn ,
so
X(0) =
are
independent, hence
X(t1 ), X(t2 ) − X(t1 ), ..., X(tn ) − X(tn−1 ) =
B(at21 ) B(a2 t2 ) − B(a2 t1 )
B(a2 tn ) − B(a2 tn−1 )
,
,...,
a
a
a
are independent.
Similarly, for
t ≥ 0, h > 0, B(a2 t + a2 h) − B(a2 t)
is normally distributed with mean
2
X(t + h) − X(t) =
0 and variance h.
1
X(t) = B(a2 t) is a composition
a
2
0
and variance
a2 h,
so
2
B(a t + a h) − B(a t)
a
is normal with mean
Finally,
Observe that taking
a = −1
of a.s. continuous functions and thus is a.s. continuous.
0.
√
=d { aBt }t≥0
shows that standard Brownian motion is symmetric about
It is sometimes more convenient to express the scaling relation as
90
{Bat }t≥0
for
a > 0.
At this point, it is useful to give the following alternative characterization of Brownian motion.
Theorem 14.1. A real-valued process {Bt }t≥0 with B0 = 0 is a standard Brownian motion if and only if
a)
b)
c)
is a Gaussian process - i.e. all of its nite dimensional distributions are multivariate normal,
and E[Bs Bt ] = s ∧ t,
With probability one, t 7→ Bt is continuous.
Bt
E[Bt ] = 0
Proof.
To see that Brownian motion is a Gaussian process, x times
1
−1
M = .
.
.
0
Properties
1
and
2
0
···
0
..
.
..
.
.
.
.
.
..
.
..
···
−1
√1
t1
0
0
√ 1
t2 −t1
..
.
D=
,
0
1
0 < t1 < ... < tn
.
.
.
···
0
···
0
..
.
.
.
.
..
.
0
and dene
0
√ 1
tn −tn−1
.
show that the random vector
T
X = DM (B(t1 ), ..., B(tn )) =
B(t1 ) B(t2 ) − B(t1 )
B(tn ) − B(tn−1 )
√ , √
,..., √
tn − tn−1
t1
t2 − t1
T
has independent standard normal entries.
M and D are nonsingular, the matrix A = M −1 D−1 is well-dened, so we have
T
(B(t1 ), . . . , B(tn )) = AX , which is multivariate normal (with mean 0 and covariance AAT )
Since
Now suppose that
s ≤ t.
Properties
1
and
2
show that
Bs ∼ N (0, s)
and
by denition.
Bt − Bs ∼ N (0, t − s)
are
independent, so we have
E[Bs ] = 0,
E[Bs Bt ] = E[Bs (Bt − Bs )] + E[Bs2 ] = s.
For the other direction, note that multivariate normals have independent entries if and only if all covariances
are
0.
If
{B(t)}t≥0
satises
a
and
b,
then for any
r ≤ s ≤ t ≤ u,
E [(B(s) − B(r)) (B(u) − B(t))] = E [B(s)B(u)] − E [B(s)B(t)] − E [B(r)B(u)] + E [B(r)B(t)]
= s − s − r + r = 0,
so Property
1
holds.
Similarly, for any
t ≥ 0, h > 0, B(t + h) − B(h)
is a dierence of mean zero normals and thus is normal with
mean zero and variance
Var (B(t
+ h) − B(t)) = Var (B(t + h)) + Var (B(t)) − 2Cov (B(t + h), B(t))
= (t + h) + t − 2E [B(t + h)B(t)] = 2t + h − 2t = h.
With this alternative denition, we can prove two more simple results.
Proposition 14.4 (Time Inversion). Suppose that {B(t) : t ≥ 0} is a standard Brownian motion. Then the
process {X(t) : t ≥ 0} dened by
(
X(t) =
0,
tB
is also a standard Brownian motion.
91
1
t
t=0
,
t>0
Proof.
Also,
−1 T
t1 , ..., tn > 0, B(t−1
is a Gaussian vector, so is
1 ), ..., B(tn )
t1 0 · · · 0
. T
..
0 t
.
.
.
1
1
2
T
, ..., B
.
(X(t1 ), ..., X(tn )) = .
B
..
. ..
t
t
1
n
.
.
0
.
0 ···
0 tn
E[X(0)] = 0 and E[X(t)] = tE B 1t = 0 for t > 0, and for 0 < s ≤ t, E[X(0)X(t)] = 0
1
1
1
E[X(s)X(t)] = stE B
B
= st · = s.
s
t
t
Since for any
Finally, the paths
t 7→ X(t)
are a.s.
continuous for
continuous, so it remains only to demonstrate that
Because
as
+
Q = Q ∩ (0, ∞)
{B(t) : t ∈ Q+ },
t > 0
and
since compositions of continuous functions are
limt→0+ X(t) = 0.
{X(t) : t ∈ Q+ }
is countable, the foregoing shows that
has the same distribution
so we must have
P lim X(t) = 0 = P lim B(t) = 0 = 1.
t→0
t∈Q+
Since
Q+
is dense in
(0, ∞),
t→0
t∈Q+
we conclude that
limt→0+ X(t) = 0
a.s. and the proof is complete.
Time inversion gives an easy proof of the following fact about the long term behavior of Brownian motion.
Proposition 14.5 (Brownian SLLN). If {B(t) : t ≥ 0} is a standard Brownian motion, then
B(t)
=0
t→∞
t
lim
Proof.
Let
{X(t) : t ≥ 0}
a.s.
be as in Proposition 14.4. Then
B(t)
= lim X
t→∞
t→∞
t
lim
1
= lim+ X (s) = 0
t
s→0
a.s.
Wiener's Theorem.
Now that we're a little more comfortable working with Brownian motion, it's time to prove that it actually
exists. The main complication lies in the continuity requirement.
To illustrate this issue, observe that if
{B(t) : t ≥ 0}
variable which is uniformly distributed on
[0, 1],
(
e =
B(t)
U is an independent
n
o
e : t ≥ 0 dened by
B(t)
is a Brownian motion and
then the process
B(t),
t 6= U
0,
t=U
has the same nite dimensional distributions as Brownian motion since
However,
e
B(t)
is discontinuous whenever
B(U ) 6= 0
random
P (U ∈ S) = 0 for any nite S ⊆ [0, 1].
- that is, with probability one.
There are several ways to go about proving existence.
We will pursue a fairly straightforward approach
due to Paul Lévy. The basic idea is to construct standard Brownian motion on
[0, 1]
as a uniform limit of
continuous functions having the right nite dimensional distributions on sets of dyadic rationals and then
patch together independent copies to obtain Brownian motion on
92
[0, ∞).
k
2n
Let
Dn =
Let
(Ω, F, P )
: k = 0, 1, ..., 2n
for
n = 0, 1, ...,
D=
and set
be a probability space on which a collection
S∞
n=0
Dn .
{Zd : d ∈ D}
of independent standard normals
can be dened. (Kolmogorov's extension theorem gives one such space.)
For each
n ∈ N0 ,
(1)
we dene random variables
For all
q < r ≤ s < t
in
B(d), d ∈ Dn
so that
Dn , B(t) − B(s) ∼ N (0, t − s)
B(r) − B(q) ∼ N (0, r − q)
and
are
independent.
(2)
The collections
We begin by taking
satises
(1)
and
We then dene
{B(d) : d ∈ Dn }
B(0) = 0 and B(1) = Z1 .
B(d)
for
d ∈ Dn \ Dn−1
Since the rst term is the average of
hypothesis shows that
(2)
{Zt : t ∈ D \ Dn }
are independent.
Now suppose for the sake of induction that
{B(d) : d ∈ Dn−1 }
(2).
B(d) =
so Property
and
B(d)
by
Zd
B (d − 2−n ) + B (d + 2−n )
+ n+1 .
2
2 2
B
Dn−1
at points in
and the
Zd0 s
are independent, the inductive
is the sum of two random variables which are independent of
{Zd : t ∈ D \ Dn },
is satised.
The inductive hypothesis also shows that
mals having mean
0
and variance
2−(n+1) ,
1
2
[B (d + 2−n ) − B (d − 2−n )]
and
2−
n+1
2
Zd
are independent nor-
so their sum
n+1
1
B d + 2−n − B d − 2−n + 2− 2 Zd
2
Zd
B (d + 2−n ) + B (d − 2−n )
− B d − 2−n + n+1 = B(d) − B d − 2−n
=
2
2 2
and their dierence
n+1
1
B d + 2−n − B d − 2−n − 2− 2 Zd
2
B (d + 2−n ) + B (d − 2−n )
n+1
− 2− 2 Zd = B d + 2−n − B(d)
= B d + 2−n −
2
are independent and normally distributed with mean
Since the vector of increments
0
and variance
(B(d) − B (d − 2−n ) : d ∈ Dn \ {0})
2−n
(as proved in the homework).
is multinormal (as its coordinates are
linear combinations of independent normals), it has independent entries if they are pairwise independent.
We have already seen that
B (d + 2−n ) − B(d)
and
B(d) − B (d − 2−n )
are independent for
It remains to consider the case where the two intervals lie on opposite sides of some
pairs, choose
d ∈ Dj
with
j
minimal so that the intervals lie in
The inductive hypothesis ensures that
B(d) − B d − 2−j
Since the increments in question are constructed from
tively, using disjoint subsets of the
As the increments
{Zt : t ∈ Dn },
and
−j
[d − 2
, d]
and
B(d) − B d − 2−j
and
d ∈ Dn−1 .
[d, d + 2
B d + 2−j − B(d)
d ∈ Dn \ Dn−1 .
−j
],
For such
respectively.
are independent.
B d + 2−j − B(d),
respec-
the desired independence follows.
{B(d) − B (d − 2−n ) : d ∈ Dn \ {0}}
inductive step is complete.
93
are i.i.d.
N (0, 2−n ),
Property
(1)
is veried and the
At this point, we have dened
B(t) on the set D of dyadic rationals so that the nite dimensional distributions
are as desired.
The next step is to extend the denition to all of
Let
F0 (t) =
and, for
Fi0 s
by interpolation.
Z1 ,
t=1
0,
t=0
linear,
in between
,
n ≥ 1,
− n+1
2 2 Zt ,
Fn (t) =
0,
linear,
The
[0, 1]
t ∈ Dn \ Dn−1
t = Dn−1
.
Dn
between consecutive points in
n ∈ N0 , d ∈ Dn ,
are continuous, and we claim that for all
B(d) =
n
X
Fi (d) =
i=0
(We only need to justify the rst equality since
∞
X
Fi (d).
i=0
d ∈ Dn
implies
Fi (d) = 0
for
i > n.)
The proof is a simple induction argument:
It holds for
n=0
by construction. Assume that it holds for
We need to verify that
Since
Fi
is linear on
n−1
X
B(d) =
Pn
Fi (d)
i=0
[d − 2−n , d + 2−n ]
Fi (d) =
i=0
n−1
X
i=0
for
We now need to show that the sum
2
P (|Zd | ≥ u) = √
2π
for all
ˆ
0 ≤ i ≤ n,
B(d)
P∞
i=0
Zd0 s
∞
e−
u
d ∈ Dn \ Dn−1 .
we have
B (d − 2−n ) + B (d + 2−n )
Fi (d − 2−n ) + Fi (d + 2−n )
=
.
2
2
The claim follows from the denition of
To this end, we observe that the
for
n − 1.
Fn (d) = 2−
since
Fi (t)
n+1
2
Zt .
is uniformly convergent.
are standard normal, so
x2
2
2
dx ≤ √
2π
ˆ
∞
u
x − x2
2
e 2 dx = √
u
u 2π
ˆ
∞
u2
2
e−y dy = √ e− 2
1 2
u 2π
2u
u > 0.
In particular,
p
P |Zd | ≥ 4n log(2) ≤ p
for all
2
8πn log(2)
e−2n log(2) ≤ 2−2n
n ∈ N.
Consequently,
∞
X
n=0
p
P |Zd | ≥ 4n log(2)
for some
∞ X
∞
X
X
p
2n + 1
d ∈ Dn ≤
P |Zd | ≥ 4n log(2) ≤
< ∞.
22n
n=0
n=0
d∈Dn
94
Therefore, the rst Borel-Cantelli lemma shows that there is a (random, but a.s. nite)
n≥N
and
d ∈ Dn
implies|Zd |
<
p
4n log(2),
such that
hence
kFn k∞ ≤ 2−
whenever
N
n+1
2
p
4n log(2)
n ≥ N.
It follows that, almost surely,
∞
X
B(t) =
uous.
To see that
Fi (t)
is a uniform limit of continuous functions and thus is contin-
i=0
{B(t) : t ∈ [0, 1]}
is a standard Brownian motion on
[0, 1],
we just have to verify that it has the
right nite dimensional distributions.
But this is an easy consequence of continuity since we have already established the claim on the dense set
D ⊆ [0, 1].
Indeed, let
0 ≤ t1 < ... < tn ≤ 1.
t1,k ≤ ... ≤ tn,k
Then there exist
in
D
with
ti,k → ti
as
k → ∞,
so
continuity gives
lim B(ti+1k ) − B(ti,k ) = B(ti+1 ) − B(ti )
k→∞
for
a.s.
i = 1, ..., n − 1.
As
lim E [B(ti+1,k ) − B(ti,k )] = 0
k→∞
and
lim
k→∞
Cov (B(ti+1,k )
− B(ti,k ), B(tj+1,k ) − B(tj,k ))
= lim 1{i = j} (ti+1,k − ti,k ) = 1{i = j} (ti+1 − ti ) ,
k→∞
we see that
(B(t2 ) − B(t1 ), ..., B(tn ) − B(tn−1 ))
is multivariate normal with mean
(0, ..., 0)
T
and covariance
Σ = Diag (t2 − t1 , . . . , tn − tn−1 ).
(An easy characteristic functions argument shows that the limit is indeed Gaussian.)
Thus we have constructed a continuous process
as standard Brownian motion on
B : [0, 1] → R
with the same nite dimensional distributions
[0, 1].
To extend this to all positive times, take a sequence
B0 , B1 , ...
of independent
C ([0, 1], R)-valued
random
variables having the distribution of this process and glue them together end to end.
Specically, dene
{B(t) : t ≥ 0}
by
btc−1
B(t) = Bbtc (t − btc) +
X
Bi (1).
i=0
One readily checks that
B
is an a.s. continuous function from
sional distributions.
95
[0, ∞)
to
R
which has the right nite dimen-
15. Sample Path Properties
Having shown that Brownian motion exists and established some simple invariance principles, we now turn
our attention to a few basic properties of Brownian paths.
We begin with an easy result which demonstrates some of the peculiarities of Brownian motion.
Theorem 15.1. Almost surely, Brownian motion is not monotone on any interval [a, b], 0 ≤ a < b < ∞.
Proof.
Fix a nondegenerate interval
[a, b].
We can partition
[a, b]
n
into
subintervals
[ai−1 , ai ]
by picking
a = a0 < a1 < ... < an = b.
If
Bt
is monotone on
[a, b],
then each of the increments
B(ai ) − B(ai−1 )
has the same sign.
Since the increments are independent mean zero normals, the probability of this occurring is
Thus for any
Sending
0 ≤ a < b < ∞, n ∈ N, P (Bt
n→∞
shows that
{Bt }t≥0
is monotone on
2 · 2−n .
[a, b]) ≤ 2−(n−1) .
is not monotone on any particular interval almost surely.
Since there are countably many intervals with rational endpoints and every nondegenerate interval contains
some such interval, the result follows.
Our next goal is to show that Brownian motion is a.s. locally
γ -Hölder
continuous for
γ<
1
2.
The key to doing so is the following result which deduces local Hölder continuity from moment bounds.
Theorem 15.2 (Kolmogorov's Continuity Theorem). Let X(t) be a stochastic process with a.s. continuous
sample paths such that
h
i
β
1+α
E |X(t) − X(s)|
≤ C |t − s|
for some constants α, β, C > 0 and all times s, t ≥ 0.
Then for each 0 ≤ γ < αβ , T > 0, and almost every ω, there exists a constant K = K(ω, γ, T ) such that
|X(t, ω) − X(s, ω)| ≤ K |t − s|
for all 0 ≤ s, t ≤ T.
γ
That is, the sample paths t 7→ X(t, ω) are a.s. γ -Hölder continuous on [0, T ].
Proof.
Let
D
By time scaling, it suces to consider the case
denote the dyadic rationals in
[0, 1]
T = 1.
as in the proof of Wiener's theorem, and let
γ ∈ 0, α
β .
Chebychev's inequality and our assumption give
P
!
2n [
k
−
1
k
≥ 2−γn
X
−X
2n
2n
k − 1 k
−γn
max X
−X
=P
≥2
1≤k≤2n 2n
2n
k=1
n
2
X
k − 1 k
−γn
−X
≤
P X
≥2
2n
2n
k=1
h
β i
2n E X k − X k−1 X
n
n
2
2
≤
2−βγn
k=1
n
≤2
βγn
α+1
2 X
1
C
= 2(βγ−α)n C.
2n
k=1
96
Since
βγ < α,
we see that
∞
X
∞
X
k − 1 k
−γn
≥
2
≤
C
2(βγ−α)n < ∞,
−
X
max n X
1≤k≤2
2n
2n
n=1
P
n=1
so the rst Borel-Cantelli lemma implies the existence of a set
there is an
for all
N (ω) ∈ N
Ω∗
with
P (Ω∗ ) = 1 such that for every ω ∈ Ω∗ ,
with
k
k−1
< 2−γn
max n X
,
ω
−
X
,
ω
n
n
1≤k≤2
2
2
n ≥ N (ω).
Choosing
K0 = K0 (ω, γ)
large enough gives
k
k − 1 −γn
K0
max n X
−
X
<2
1≤k≤2
2n
2n
for all
n∈N
on
Ω∗ .
Now suppose that
s, t ∈ D
We can write
with
s < t,
and let
n∈N
be such that
i
1
1
− p1 − ... − p ,
2n
2
2 k
n < q1 < ... < q` and i ≤ j .
s=
with
n < p1 < ... < pk ,
Moreover, since
j
2n
−
i
2n
t=
2−n ≤ t − s < 2−(n−1) .
1
1
j
+ q1 + ... + q
2n
2
2`
≤ t − s < 2−(n−1) , we must have |i − j| ≤ 1, hence
X j − X i < 2−γn K0 .
2n
2n Also,
for
i
1
1
X i − 1 − ... − 1
−X
− p1 − ... − pr−1 ≤ 2−γpr K0
n
p
p
n
1
r
2
2
2
2
2
2
r = 1, ..., k , so, since n < p1 < ... < pk ,
k
k
X
X
−γpr
X (s) − X i ≤ K0
2
≤
K
2−γ(n+r)
0
2n r=1
r=1
≤2
−γn
K0
∞
X
2−γr ≤ 2−γn K1
r=1
where
K1 = (1 ∨
P∞
r=1
2−γr ) K0 .
The exact same reasoning shows that
X
j
2n
− X (t) ≤ 2−γn K1 .
It follows that
j i j
i
|X (t) − X (s)| ≤ X (t) − X
−
X
−
X
(s)
+
X
+
X
n
n
n
n
2
2
2
2
γ
≤ 2−γn K ≤ K |t − s| ,
hence the paths are a.s.
Because
D
is dense in
γ -Hölder
[0, 1]
and
continuous on
t 7→ X(t)
K = 3K1 ,
D.
is a.s.
continuous, the Hölder condition holds a.s.
s, t ∈ [0, 1].
for all
Applying the above result to Brownian motion yields
97
Theorem 15.3. Let {Bt } t≥0 be a standard Brownian motion, and let
t 7→ Bt is γ -Hölder continuous on [0, T ] for all γ < 12 .
Proof.
T > 0.
Then with full probability,
a = t − s shows that
h
i
β
β
= E |Bt−s | = Cβ |t − s| 2
The scaling relation (Proposition 14.3) with
h
E |Bt − Bs |
where
Thus when
0≤γ<
α
β
β > 2,
=
1
2
−
Theorem 15.2
β
i
ˆ
h
i
x2
1
β
β
|x| e− 2 dx.
Cβ = E |B1 | = √
2π
β
applies with α =
2 − 1, so t 7→ Bt is γ -Hölder
1
β . The result follows by sending
β
continuous for
to innity.
In fact, the preceding result is optimal in the sense that
Theorem 15.4.
Proof.
{Bt }t∈[0,1]
is not Hölder continuous for any exponent γ ≥ 12 .
(Homework) We wish to show that
|Bt − Bs |
γ =∞
s,t∈[0,1] |t − s|
sup
whenever
Since
γ≥
a.s.
1
2.
xγ1 ≥ xγ2
for
0 < γ1 ≤ γ2 , x ∈ [0, 1],
it suces to establish the result for
γ=
1
2.
Dene
K(ω) :=
Ki,n (ω) :=
|Bt (ω) − Bs (ω)|
√
,
t−s
0≤s<t≤1
sup
i−1
i
n ≤s<t≤ n
sup
|Bt (ω) − Bs (ω)|
√
t−s
for
n ∈ N, i ∈ [n].
K1,n , ..., Kn,n are independent.
√ scaling relation shows that {Xt }t∈[0,1] =d {Bt }t∈[0,1] where Xt :=
n B i+t − B ni , hence
n
− B s + i−1 B nt + i−1
|Bt − Bs |
n
n
n
√
= sup q
Ki,n =
sup
i−1
t
i−1
i
t−s
0≤s<t≤1
− ns + i−1
n ≤s<t≤ n
n + n
n
√ √ n B i+t − 1 − B i+s − 1 n B i+t − B i+s n
n
n
n
n
n
√
√
= sup
=d sup
t−s
t−s
0≤s<t≤1
0≤s<t≤1
The independent increments property of Brownian motion shows that
Moreover, the
=
Because
K ≥ max Ki,n
1≤i≤n
|Xt − Xs |
√
=d
t−s
0≤s<t≤1
sup
for all
n∈N
|Bt − Bs |
√
= K.
t−s
0≤s<t≤1
sup
by construction, it follows that for any
M > 0,
we have
P (K ≤ M ) ≤ P (K1,n ≤ M, ..., Kn,n ≤ M ) ≤ P (K ≤ M )n .
Since
B√t −Bs
t−s
∼ N (0, 1)
for all
0 ≤ s < t ≤ 1,
we have that
Therefore, the preceding inequality implies that
P (K > M ) > 0
P (K ≤ M ) = 0
98
for all
for all
M > 0,
M > 0.
hence
K=∞
a.s.
One can actually show that Brownian motion is almost surely not
is a stronger statement than not being
γ -Hölder
γ -Hölder continuous at any point t (which
continuous over an interval) whenever
γ>
1
2.
This can be veried by appropriately modifying the proof of the following theorem.
Theorem 15.5. With probability one, Brownian paths are not Lipschitz continuous at any point.
Proof.
Fix a constant
C ∈ (0, ∞)
and let
An =
For
ω : there
1 ≤ k ≤ n − 2,
Yn,k
is an
s ∈ [0, 1]
|Bt (ω) − Bs (ω)| ≤ C |t − s|
whenever
An ⊆ B n .
|Bt (ω) − Bs (ω)| ≤ C |t − s|
k = max{j ∈ [n − 2] :
6C
Bn =
min Yn,k ≤
.
1≤k≤n−2
n
this end, suppose that ω ∈ An .
To
|t − s| ≤
≤ s}. Then nj − s ≤
whenever
j−1
n
Then there is an
s ∈ [0, 1]
3
n.
3
n for
j = k − 1, k, k + 1, k + 2,
so
B ` , ω − B ` − 1 , ω ≤ B ` , ω − B(s, ω) + B(s, ω) − B ` − 1 , ω n
n
n
n
`
` − 1 3
3
6C
≤ C − s + s −
≤C
+
=
n
n
n n
n
` = k, k + 1, k + 2,
hence
Yn,k (ω) ≤
6C
n .
k = 1, ..., n − 2,
k+2
3
Y j 1 6C
6C
j − 1 6C
P B
P Yn,k ≤
=
−B
≤ n = P B n ≤ n
n
n
n
Now for any
j=k
since Brownian motion has stationary independent increments.
a = n1 and then using B(1) ∼ N (0, 1) yields
3
6C
1 6C
= P B
≤
≤
n
n n
3
1
6C 3
6C
= P √ B (1) ≤
= P |B (1)| ≤ √
n
n
n
!3 3
ˆ √6C
n
x2
12C
2
≤ √
= √
e− 2 dx
2π 0
2πn
Applying the scaling relation with
P Yn,k
where the nal inequality is because
e−
x2
2
≤1
for
x ≥ 0.
Combining our observations shows that
P (An ) ≤ P (Bn ) = P
n−2
[
k=1
as
.
let
We will show that
for
k+1
k+2
k
k − 1 k k + 1 , B
−B
−B
−B
= max B
, B
,
n
n
n
n n
n
and set
Let
such that
3
|t − s| ≤
n
Yn,k
6C
≤
n
n → ∞.
99
!
≤ (n − 2)
12C
√
2πn
3
→0
with
Since
As
C
A1 ⊆ A2 ⊆ . . ., P (An )
n,
is increasing in
was arbitrary, we have shown that
so we conclude that
t 7→ Bt
P (An ) = 0
for all
is not Lipschitz at any point in
[0, 1]
n.
with full probability.
The exact same argument shows that
Am
n =
has probability zero for all
Note that if
f
s ∈ [m, m + 1]
there exists an
m, n ∈ N,
such that
|Bt − Bs | ≤ C |t − s|
whenever
|t − s| ≤
and the result follows from countable subadditivity.
is dierentiable at t, then there is a
3
n
δ > 0 such that |f (t) − f (s)| ≤ (|f 0 (t)| + 1) |t − s| whenever
|t − s| < δ .
In other words, a function is Lipschitz at every point at which it is dierentiable. Taking the contrapositive
and appealing to Theorem 15.5 yields
Theorem 15.6 (Paley, Wiener, Zygmund). Brownian motion is a.s. nowhere dierentiable.
upper and lower right derivatives
In fact, if we dene the
D∗ f (t) = lim sup
h&0
then one can show that
f (t + h) − f (t)
,
h
of a function by
D∗ f (t) = lim inf
h&0
P
f (t + h) − f (t)
,
h
\
({D∗ B(t) = −∞} ∪ {D∗ B(t) = ∞}) = 1.
t≥0
Indeed, if there is some
t0 ∈ [0, 1]
such that
−∞ < D∗ B(t0 ) ≤ D∗ B(t0 ) < ∞,
lim sup
h&0
Since
B(t)
then
|B(t0 + h) − B(t0 )|
< ∞.
h
is a.s. continuous (and thus uniformly bounded on compact sets), this means that
sup
h∈[0,1]
for some a.s. nite
M,
so that
B(t)
|B(t0 + h) − B(t0 )|
≤M
h
is Lipschitz from the right at
t0 .
An argument similar to the proof of Theorem 15.5 shows that for any
{ω : there
is some
s ∈ [0, 1]
such that
C ∈ (0, ∞),
|B(s + h, ω) − B(s, ω)| ≤ Ch
has probability zero.
100
the event
for all
h ∈ [0, 1]}
The Hölder continuity results discussed here are not the most precise answer to the question How continuous
is Brownian motion?
In 1937, Paul Lévy proved the following theorem regarding Brownian motion's modulus of continuity:
Theorem 15.7. Almost surely,
lim sup
h→0
|B(t + h) − B(t)|
q
= 1.
0≤t≤1−h
2h log h1
sup
Though the proof of Theorem 15.7 is not dicult, we leave it to independent pursuit so that we may explore
more topics.
Before moving on, however, we note that the arguments in Theorems 15.2 and 15.3 provide an alternative
path to proving the existence of Brownian motion.
Namely, one uses the extension theorem to get a process on the dyadic rationals with appropriate nite
dimensional distributions.
One then argues as before to show that this process satises a Hölder condition.
As
D
is dense in
[0, 1],
there is a unique continuous extension to
[0, 1].
By an easy limiting argument, this continuous version has the correct nite dimensional distributions.
Though I prefer the approach we have taken since it is more concrete, the above argument has the advantage
that it also works for other continuous processes specied by their nite dimensional distributions.
Another benet is that the starting point is a part of the nite dimensional distributions, so as with Markov
chains, we have one process
Bt
Bt (ω) = ω(t)
is a Brownian motion with
and a family of probability measures
{Px }x∈R
Px (B0 = x) = 1.
Measures corresponding to general initial distributions then arise as mixtures:
101
Pµ (A) =
such that under
´
Px (A)dµ(x).
Px ,
16. Markov Properties
Our next object is to establish that Brownian motion is a (strong) Markov process and explore some simple
consequences of this fact.
In order to discuss Markov properties, it is necessary to have a referent ltration.
The denitions in continuous time are as one would expect from the discrete case:
Given a probability space
F(s) ⊆ F(t) ⊆ F
A process
{X(t) : t ≥ 0}
F(t)-measurable
If
{B(t) : t ≥ 0}
B(t)
for all
(Ω, F, P ),
a ltration is a family of sub-σ -algebras
{F(t) : t ≥ 0}
(Ω, F, {F(t)}t≥0 , P )
on a ltered probability space
is adapted if
X(t)
is
t ≥ 0.
is a standard Brownian motion, the obvious candidate is
F 0 (t) = σ (B(s) : 0 ≤ s ≤ t).
is adapted to this ltration by construction, and Proposition 14.2 shows that
is independent of
with
0 ≤ s ≤ t.
for all
{B(t + s) − B(s) : t ≥ 0}
0
F (s).
It turns out that a more convenient ltration is given by
F + (s) =
\
F 0 (t).
t>s
{F + (t) : t ≥ 0}
Intuitively,
is clearly a ltration with
F 0 (t)
F 0 (t) ⊆ F + (t).
while
F + (t)
contains, in addition, an inni-
A ∈ F (t + ε)
for all
ε > 0.
contains all of the information up to time
tesimal peak into the future:
+
A ∈ F (t)
if and only if
One advantage of working with the latter is that it is
t,
0
right-continuous :
!
\
\
\
t>s
u>t
+
F (t) =
t>s
0
F (u)
=
\
F 0 (u) = F + (s).
u>s
It is with respect to this right-continuous ltration that we state
Theorem 16.1 (Markov Property). For every s ≥ 0, {Bt+s − Bs : t ≥ 0} is independent of F + (s).
Proof.
Let
Let
sn
+
be a strictly decreasing sequence with
A ∈ F (s),
Since
+
let
A ∈ F (s)
t1 , ..., tm ≥ 0,
implies
0
and let
A ∈ F (sn )
F :R
for all
n,
m
limn→∞ sn = s.
→R
be bounded and continuous.
Proposition 14.2 shows that
E [F (Bt1 +sn − Bsn , ..., Btm +sn − Bsn ) 1A ] = E [F (Bt1 +sn − Bsn , ..., Btm +sn − Bsn )] P (A)
for all
n.
Sample path continuity implies
lim Bt+sn − Bsn = Bt+s − Bs
n→∞
for all
t ≥ 0,
a.s.
so two applications of the dominated convergence theorem give
E [F (Bt1 +s − Bs , ..., Btm +s − Bs ) 1A ] = lim E [F (Bt1 +sn − Bsn , ..., Btm +sn − Bsn ) 1A ]
n→∞
= lim E [F (Bt1 +sn − Bsn , ..., Btm +sn − Bsn )] P (A)
n→∞
= E [F (Bt1 +s − Bs , ..., Btm +s − Bs )] P (A).
As
A ∈ F + (s), t1 , ..., tm ≥ 0,
and
F ∈ Cb (Rm , R)
were arbitrary, the claim follows.
102
F + (s),
In other words, conditional on
When specialized to the
{B(t + s) : t ≥ 0}
the process
is a Brownian motion started at
B(s).
germ σ-algebra F + (0), Theorem 16.1 yields
Theorem 16.2 (Blumenthal's 0 − 1 Law). The σ-algebra F + (0) is trivial.
Proof.
Let
A ∈ F + (0).
In particular,
A
Then
A ∈ σ (Bt : t ≥ 0),
so Theorem 16.1 implies that
P (A) = P (A ∩ A) = P (A)P (A),
is independent of itself, so
A
hence
is independent of
P (A) ∈ {0, 1}.
F + (0).
An interesting consequence of Theorem 16.2 is that standard Brownian motion has positive values, negative
values, and zeros in any time interval
(0, ε)
almost surely.
Theorem 16.3. Suppose that {B(t) : t ≥ 0} is a standard Brownian motion.
Dene τ = inf{t > 0 : B(t) > 0} and T = inf{t > 0 : B(t) = 0}. Then P (τ = 0) = P (T
Proof.
Let
Ak = τ < k1 = Bt > 0
for some
∞
\
{τ = 0} =
k=n
n ∈ N,
for all
To see that
since
so
+
{τ = 0} ∈ F (0),
P (τ = 0) > 0,
1
1 ∼ N
B 2n
0, 2n
hence
1
k
0<t<
1
Ak ∈ F
n
n ∈ N,
1
1 > 0
P (An ) ≥ P B 2n
=
2
observe that for all
, so
The exact same argument (or the fact that
∞
\
!
= lim P (An ) ≥
An
n→∞
is a.s. continuous, the intermediate value theorem implies
stand the behavior as
Since
t→0
P (T = 0) = 1.
and let
is mapped onto the germ
T =
T
σ -algebra
t≥0
G(t)
be the
σ -algebra
of tail events.
under time inversion, we have
Theorem 16.4. The tail σ-algebra for standard Brownian motion is trivial.
A typical application of Theorem 16.4 is given by
Theorem 16.5. If Bt is a standard Brownian motion, then with probability one,
Bt
lim sup √ = ∞,
t
t→∞
Bt
lim inf √ = −∞.
t→∞
t
103
behavior of Brownian motion can be used to under-
t → ∞.
G(t) = σ (B(s) : s ≥ t)
T
1
.
2
{Bt }t≥0 =d {−Bt }t≥0 ) shows that σ = inf {t > 0 : B(t) < 0} = 0
Because of time inversion, results concerning the
Dene
. Then
0
n=1
B
P (τ = 0) ∈ {0, 1}.
P (τ = 0) = P
a.s. Since
= 0) = 1.
Proof.
Let
It suces to prove the rst statement as the second then follows from symmetry.
K ∈ (0, ∞),
and recall the inequality
P (An
i.o.)
∞ [
∞
\
=P
An
= lim P
n→∞
n=1 m=n
≥ lim
n→∞
∞
[
!
!
An
m=n
sup P (An ) = lim sup P (An ).
n→∞
m≥n
B1 ∼ N (0, 1) that
Bn
Bn
P √ ≥ K i.o ≥ lim sup P √ ≥ K = P (B1 ≥ K) > 0.
n
n
n→∞
o
Bn
≥ K i.o ∈ T , Theorem 16.4 implies P √
≥
K
i.o = 1 and the result
n
It follows from Brownian scaling and the fact that
Since
n
Bn
√
n
follows since
K
was
arbitrary.
Writing
Px
for the probability measure which makes
{B(t) : t ≥ 0}
a Brownian motion started at
x,
the
previous theorem, along with a.s. continuity and translation invariance shows that
Theorem 16.6. Let Bt be a Brownian motion and let A =
for all x.
T
n∈N
{Bt = 0
for some t ≥ n}. Then Px (A) = 1
Our nal application of the Markov property concerns the local and global maxima of Brownian motion.
Theorem 16.7. For a Brownian motion {B(t) : 0 ≤ t ≤ 1}, almost surely
(1)
(2)
(3)
Proof.
Every local maximum is a strict local maximum.
The set of times where local maxima are attained is countable and dense.
The global maximum is attained at a unique time.
We begin by showing that for any closed time intervals with disjoint interiors, the maxima of Brownian
motion over each are almost surely distinct:
Let
0 ≤ a1 < b1 ≤ a2 < b2 ≤ 1,
and let
m1 (ω) = maxt∈[a1 ,b1 ] B(t, ω), m2 (ω) = maxt∈[a2 ,b2 ] B(t, ω).
B(t) is almost surely continuous and the intervals in question are compact, m1
and
m2
(Since
are well-dened with
full probability.)
Theorem 16.3 and the Markov property show that for any
a2 + ε
with
B(t) − B(a2 ) > 0,
Thus for almost every
hence
B(a2 ) < m2
ε > 0,
there almost surely exists some
a2 < t <
a.s.
ω , there is a smallest n = n(ω) ∈ N such that m2 (ω) is the maximum over a2 + n1 , b2 .
By considering each of these intervals, it suces to assume in the proof that
Applying the Markov property at time
Applying the Markov property at time
Now we can write the event
m1 = m2
b1 ,
a2
we have that
shows that
B(a2 ) − B(b1 )
b1 < a2 .
is independent of
m1 − B(b1 ).
m2 − B(a2 ) is independent of each of these increments.
as
B(a2 ) − B(b1 ) = (m1 − B(b1 )) − (m2 − B(a2 )) .
Conditioning on the values of
m1 − B(b1 )
and
m2 − B(a2 ),
we see that the right hand side is constant and
the left is a continuous random variable, so this event has probability
104
0.
(1) We just proved that all non-overlapping pairs of nondegenerate closed intervals with rational endpoints have distinct maxima. If Brownian motion had a non-strict local maximum, there would be
two such intervals having the same maximum.
(2) The maximum over any nondegenerate closed interval with rational endpoints is a.s. not attained at
the endpoints, so there is a local maximum between any two rational numbers, hence the set of local
maxima is dense. Since every local maximum is a strict local maximum, there are no more maxima
than there are intervals with rational endpoints, which is countable.
(3) For any
q ∈ Q ∩ [0, 1],
the maxima over
obtained at two points t1
and
[q, 1]
< t2 ,
[0, q]
and
[q, 1]
there would be a rational
are distinct.
If the global maximum was
q ∈ (t1 , t2 ) such that the maxima over [0, q]
agree.
Strong Markov Property.
To state the strong Markov property, we need to understand stopping times in the continuous setting.
Though there are many similarities, beware that not all results for discrete time carry over.
Denition.
variable
T
Given a ltered probability space
is a
stopping time
if
{T ≤ t} ∈ F(t)
Ω, F, {F(t)}t≥0 , P
for all
, we say that a
[0, ∞]-valued
random
t ≥ 0.
One of the reasons that it is so convenient to work with right-continuous ltrations is
Theorem 16.8. If {F(t)}t≥0 is right-continuous, then a [0, ∞]-valued random variable T is a stopping time
if {T < t} ∈ F(t) for all t ≥ 0.
Proof.
For such a
T,
we have
{T ≤ t} =
∞ n
∞
\
1o \ 1
T <t+
∈
F t+
= F(t).
k
n
n=1
k=1
Thus for right-continuous ltrations, we recover our denition from the discrete setting:
Corollary 16.1. If {F(t)}t≥0 is right-continuous, then a [0, ∞]-valued random variable T is a stopping time
if {T = t} ∈ F(t) for all t ≥ 0.
Proof.
{T = t} = {T ≤ t} \ {T < t} ∈ F(t).
105
To illustrate the utility of Theorem 16.8 (and get some practice working with stopping times), observe that
•
If
Tn
{F(t)}t≥0
is an increasing sequence of stopping times with respect to
also a stopping time with respect to
and
Tn % T ,
then
T
is
{F(t)}t≥0 :
{T ≤ t} =
∞
\
{Tn ≤ t} ∈ F(t).
n=1
•
If
Tn
Tn & T , then so is T
is a decreasing sequence of stopping times with
is right-continuous:
{T < t} =
∞
[
provided that the ltration
{Tn < t} ∈ F(t).
n=1
•
If
K
is a closed set, then
(and thus
Let
D
TK = inf {t ≥ 0 : Bt ∈ K}
is a stopping time with respect to
F + (t)):
K.
[
be a countable dense subset of
{TK ≤ t} =
∞
\
Then
[ n
|B(s) − x| <
n=1 s∈Q∩(0,t) x∈D
•
If
U
0 F (t)
is an open set, then
TU = inf {t ≥ 0 : Bt ∈ U }
1o
∈ F 0 (t).
n
is a stopping time with respect to
{F + (t)}t≥0 :
By continuity,
{TU ≤ t} =
\
{TU < s} =
s>t
However,
U
TU
γ(t) ∈ ∂U
{B(s) = γ(s)
[
{B(r) ∈ U } ∈ F + (t).
s>t r∈Q∩(0,s)
is not necessarily a stopping time with respect to
is bounded and
with
\
U
and
for all
0 F (t) t≥0 .
Indeed, suppose that
does not contain the starting point. Then we can nd a path
{γ(s) : 0 ≤ s < t} ∩ U = ∅.
0 ≤ s ≤ t}.
But if
Clearly
0
F (t)
γ : [0, t] → R
contains no nontrivial subset of
0
{TU ≤ t} ∈ F (t), then {B(s) = γ(s)
for all
0 ≤ s ≤ t, T = t}
would be a nontrivial subset, a contradiction.
(The rst statement is because
the second is because
B
F 0 (t)
only contains information about the paths up to time
could either enter
U
t,
and
immediately after hitting its boundary or could remain
in the complement for a while.)
Denition.
If
T
is a stopping time with respect to
{F + (t) : t ≥ 0},
F + (T ) = A ∈ F : A ∩ {T ≤ t} ∈ F + (t)
the stopped
for all
σ -algebra
t≥0 .
Theorem 16.9 (Strong Markov Property). If {Bt : t ≥ 0} is a Brownian motion and
stopping time with respect to {F + (t) : t ≥ 0}, then the process
{BT +t − BT : t ≥ 0}
is a standard Brownian motion independent of F + (T ).
106
is given by
T
is an a.s. nite
Proof.
For
n ∈ N,
dene
Tn
by
Tn =
m+1
2n
m
m+1
≤T <
.
2n
2n
if
Dene the processes
k
k
Bn,k (t) = B t + n − B n ,
2
2
We will show that
To this end, let
Bn∗ (t)
Bn∗ (t) = B(Tn + t) − B(Tn ).
is a standard Brownian motion independent of
E ∈ F + (Tn ).
{Bn∗ ∈ A},
For every event
P ({Bn∗ ∈ A} ∩ E) =
=
∞
X
k=1
∞
X
P
F + (Tn ).
we have
Bn,k ∈ A ∩ E ∩ Tn = k/2n
P Bn,k ∈ A P E ∩ Tn = k/2n
k=1
since the Markov property implies
Because
Bn,k
Bn,k ∈ A
P Bn,k
is a standard Brownian motion,
P ({Bn∗ ∈ A} ∩ E) =
=
E ∩ Tn = k/2n ∈ F + k/2n .
e ∈ A does not depend on n or k ,
∈A =P B
is independent of
so
∞
X
P Bn,k ∈ A P E ∩ Tn = k/2n
k=1
∞
X
e ∈ A P (E).
e ∈ A P E ∩ Tn = k/2n
P B
=P B
k=1
Taking
E=Ω
∗
shows that Bn is a standard Brownian motion, hence
P ({Bn∗ ∈ A} ∩ E) = P ({Bn∗ ∈ A} ∩ E) = P Bn∗ ∈ A P (E)
for all
Now
A, E ,
establishing the independence of
Tn & T ,
Bn∗
and
F + (Tn ).
so it follows from continuity that
(B (Tn + t2 ) − B (Tn + t1 ) , . . . , B (Tn + tm ) − B (Tn + tm−1 ))
→ (B (T + t2 ) − B (T + t1 ) , . . . , B (T + tm ) − B (T + tm−1 )) .
Since the left hand side is multivariate normal with mean zero and covariance diag (t2
for all
As
n,
the right hand side is as well.
{B (T + t) − B (T ) : t ≥ 0}
is clearly a.s. continuous, we see that it is a Brownian motion.
To nish up, we need to show that
{B (T + t) − B (T ) : t ≥ 0}
This is a simple consequence of the fact that
(because
− t1 , ..., tm − tm−1 )
T ≤ Tn
implies
A ∩ {Tn ≤ t} = (A ∩ {T ≤ t}) ∩ {Tn ≤ t} ∈ F + (t)
Specically, since
{B(Tn + t) − B(Tn ) : t ≥ 0}
t1 , ..., tm ≥ 0,
F ∈ Cb (Rm , R),
and
is independent of
F + (T ).
F + (T ) ⊆ F + (Tn )
if
A ∈ F + (T )).
is independent of
F + (Tn ) ⊇ F + (T ),
if
A ∈ F + (T ),
then continuity and the dominated convergence theorem give
E [F (B(T + t1 ) − B(T ), ..., B(T + tm ) − B(T )) 1A ]
= lim E [F (B(Tn + t1 ) − B(Tn ), ..., B(Tn + tm ) − B(Tn )) 1A ]
n→∞
= lim E [F (B(Tn + t1 ) − B(Tn ), ..., B(Tn + tm ) − B(Tn ))] P (A)
n→∞
= E [F (B(T + t1 ) − B(T ), ..., B(T + tm ) − B(T ))] P (A).
107
It is easy to get bogged down in the details of these kind of arguments, but observe that the general strategy
is quite simple. Namely, we approximate
T
discretely from above, sum over the possible values of
Tn ,
and
apply the ordinary Markov property, just as we did for Markov chains. The rest is just tying up loose ends.
An alternative form of Theorem 16.9 that is sometimes useful is
For any x ∈ R and any bounded measurable f : C ([0, ∞), R) → R, we have
where expectation
h n
oi
e :t≥0
Ex f ({B(T + t) : t ≥ 0}) F + (T ) = EB(T ) f
B(t)
n
o
e : t ≥ 0 started
on the right is with respect to a Brownian motion B(t)
at B(T ).
An intriguing consequence of the strong Markov property is
Theorem 16.10 (Reection Principle). If T is an a.s. nite stopping time and {B(t) : t ≥ 0} is a standard
Brownian motion, then the process {B ∗ (t) : t ≥ 0} dened by
B ∗ (t) = B(t)1 {T ≤ t} + (2B(T ) − B(t)) 1 {T > t}
is also a standard Brownian motion. (B ∗ is called
Proof.
The strong Markov property implies that
which is independent of
B = {Bt : 0 ≤ t ≤ T },
Brownian motion reected at
B (T ) = {BT +t − BT : t ≥ 0} is a standard Brownian motion
hence
The concatenation map which glues a continuous path
to form the continuous path
T .)
−B (T ) = {BT − BT +t : t ≥ 0}
is as well.
{g(t) : t ≥ 0} to a nite continuous path {f (t) : 0 ≤ t ≤ T }
ΨT (f, g)(t) = f (t)1 {0 ≤ t ≤ T }+(f (T ) − g(0) + g(t − T )) 1 {t > T } is evidently
measurable.
Applying
ΨT
to
B
result follows since
B (T ) gives the original
B, B (T ) =d B, −B (T ) .
and
process
B,
and applying
ΨT
to
B
and
−B (T )
gives
B∗.
The
It is an interesting fact that Brownian motion reected at a random time is still a Brownian motion, but
the real beauty of this result lies in its consequences. For example, with the aid of Theorem 16.10, we can
deduce the marginal distributions of the running maximum of Brownian motion,
M (t) = max B(s).
0≤s≤t
Theorem 16.11. If a > 0, then
P0 (M (t) > a) = 2P0 (B(t) > a) = P0 (|B(t)| > a) .
Proof.
The second equality follows from the symmetry of standard Brownian motion.
For the rst, let
T = inf {t ≥ 0 : B(t) = a},
and let
B ∗ (t)
be Brownian motion reected at
T.
Then
{M (t) > a} = {B(t) > a}
G
{M (t) > a, B(t) ≤ a} .
The right-hand side is a disjoint union, the second term of which coincides with
{B ∗ (t) ≥ a},
so Theorem
16.10 gives
P0 (M (t) > a) = P0 (B(t) > a) + P0 (B ∗ (t) > a) = 2P0 (B(t) > a) .
108
Our next application of the strong Markov property concerns the zero set of Brownian motion.
First, we take a brief detour to recall a few facts from undergraduate real analysis.
Denition.
U ∈T
A point
x∈U
with
x
and
in a topological space
(X, T )
is called an
isolated point
S⊆X
of
if there exists some
U ∩ S = {x}.
and is
x ∈ R is isolated from the right (with respect to S ) if there is some ε > 0 such that (x, x + ε) ∩ S = ∅,
isolated from the left if there is some ε > 0 such that (x − ε, x) ∩ S = ∅.
x∈R
is isolated if and only if it is isolated from the left and from the right.
A point
Denition.
A
perfect subset
of a topological space
(X, T )
is a closed set with no isolated points.
Proposition. A perfect subset of R (with the usual metric topology) is uncountable.
Proof.
We rst note that if
open ball centered at
Now suppose that
Then
U1
S
and
1
with radius
2
U
x,
is an open set containing
miny∈U ∩S |x − y|
would intersect
S = {x1 , x2 , ...}.
can be enumerated as
then
Let
U1
S
U ∩S
is innite - otherwise, the
x.
only in
be a bounded open set containing
xk3
T∞
set containing
V =
S . Let k1 = 1 and k2 = min {k ≥ k1 : xk ∈ U1 }. Let U2 be an
U2 ⊆ U1 \ xk1 . Then set k3 = min {k ≥ k2 : xk ∈ U2 }, let U3 be an open
with U3 ⊆ U2 \ xk2 , and so on...
Un ∩ S . Since S and Un are closed subsets of R and Un is bounded, V is the intersection
n=1
xk2
such that
of a nested decreasing sequence of nonempty compact sets and thus is nonempty.
V ⊆S
x1 .
contains innitely many points in
open set containing
Dene
x
x∈S
does not contain any
xk ,
S
contradicting the assumption that
But, by construction,
can be enumerated.
Returning to Brownian motion, we have
Theorem 16.12. Let {B(t) : t ≥ 0} be a Brownian motion and dene
Zeros
= {t ≥ 0 : B(t) = 0} .
Then, almost surely, Zeros is a perfect set and thus is uncountable.
Proof.
Since Brownian motion is a.s.
continuous, Zeros is closed with probability one.
In light of the
preceding proposition, it remains only to show that Zeros almost surely has no isolated point.
To this end, we dene for each nonnegative
which is a.s.
nite by Theorem 16.6.
the strong Markov property applied to
q ∈ Q, τq = inf {t ≥ q : B(t) = 0}.
Since Zeros is a.s.
τq
closed,
P τq ∈
show that, almost surely,
τq
Then
Zeros
τq
= 1.
is a stopping time
Theorem 16.3 and
is not isolated from the right for all
q ∈ Q ∩ [0, ∞).
We will be done if we can show that all other points in Zeros are not isolated from the left. For such
qn % t
with
qn ∈ Q
and dene
tn = τqn .
Then
qn ≤ tn < t,
so
tn % t,
hence
t
t,
take
is not isolated from the
left.
109
Markov Processes.
At this point, we need to dene precisely what we mean by a continuous time Markov process.
Denition.
A function
p : [0, ∞) × R × B → [0, 1]
Markov transition kernel
is a
(1) For all
A ∈ B , p(·, ·, A) : [0, ∞) × R → [0, 1]
(2) For all
t ∈ [0, ∞), x ∈ R, p(t, x, ·)
(3) For all
A ∈ B , x ∈ R, s, t > 0, p(s + t, x, A) =
is a measurable function of
on
if
R
(t, x).
(R, B).
is a probability measure on
ˆ
{F(t) : t ≥ 0}, an adapted process {X(t) : t ≥ 0} is a Markov
Given a ltration
p
t>s≥0
if for all
p(s, x, dy)p(t, y, A).
A ∈ B,
and all
P (X(t) ∈ A |F(s) ) = p (t − s, X(s), A)
In words,
p(t, x, A)
process with transition kernel
is the probability that the process started at
almost surely.
x
will be in
A
at time
Theorem 16.1 shows that Brownian motion is a Markov process with respect to
The transition kernel is
That is,
p(t, x, ·)
ˆ
1
p(t, x, A) = √
2πt
is the normal distribution with mean
e−
(y−x)2
2t
t.
{F + (t) : t ≥ 0}.
dy.
A
x
and variance
t.
(The Chapman-Kolmogorov condition here is the statement that the sum of independent normals is normal.)
Likewise, if
by
{B(t) : t ≥ 0} is a standard Brownian motion, the reected Brownian motion {X(t) : t ≥ 0} given
X(t) = |B(t)|
is a Markov process (w.r.t.
normal distribution
- i.e. the law of
|W |
with
{F + (t) : t ≥ 0})
with transition kernel
p(t, x, ·)
the
modulus
W ∼ N (x, t).
The following theorem shows that the dierence of the maximum process
M (t) = sup B(s)
and the under-
0≤s≤t
lying Brownian motion
B(t)
is a reected Brownian motion.
Theorem 16.13. Let {B(t) : t ≥ 0} be a Brownian motion and let {M (t) : t ≥ 0} be the running maximum
process. Then the process {Y (t) : t ≥ 0} given by Y (t) = M (t) − B(t) is a reected Brownian motion.
Proof.
Since
M (0) = B(0),
We proceed by xing
s≥0
we can assume without loss that
and dening the processes
b = B(s + t) − B(s),
B(t)
Since
Y (t)
is clearly
F + (t)-measurable,
B(t)
is a standard Brownian motion.
b :t≥0
B(t)
and
c(t) = max B(u).
b
M
0≤u≤t
we need only show that
P Y (s + t) ∈ A F + (s) = p(t, Y (s), A)
for all
t ≥ 0, A ∈ B ,
where
p(t, y, ·) = L (|W |), W ∼ N (y, t).
110
c(t) : t ≥ 0
M
by
b ∼ N (0, t) is independent of F + (s), this is equivalent to showing that, conditional on F + (s), Y (s + t)
B(t)
b
+ Y (s).
has the same distribution as B(t)
As
Writing
c(t)
M (s + t) = M (s) ∨ B(s) + M
and the max over
[s, s + t]
- the max over
[0, s + t]
is the maximum of the max over
[0, s]
- shows that
Y (s + t) = M (s + t) − B(s + t)
h
i c(t) − B(t)
b + B(s)
= M (s) ∨ B(s) + M
h
i h
i
b + B(s) ∨ B(s) + M
c(t) − B(t)
b + B(s)
= M (s) − B(t)
b
c(t) − B(t)
b
= Y (s) − B(t)
∨ M
c(t) − B(t).
b
= Y (s) ∨ M
We will be done if we prove that for every
c(t) − B(t)
b =d
y ≥ 0, y ∨ M
a ≥ 0,
c(t) − B(t)
b > a = P y − B(t)
b >a +P
y∨M
b >a +P
= P y + B(t)
b
B(t) + y .
To see this, observe that for
P
since
b ≤ a, M
c(t) − B(t)
b >a
y − B(t)
b ≤ a, M
c(t) − B(t)
b >a
y − B(t)
b =d −B(t).
b
B(t)
Thus equality in distribution will follow upon showing that
b ≤ a, M
c(t) − B(t)
b > a = P y + B(t)
b < −a .
P y − B(t)
To this end, let
{W (u) : 0 ≤ u ≤ t}
be the time reversed Brownian motion
(which was shown in the homework to be a standard Brownian motion on
b − u) − B(t)
b
W (u) = B(t
[0, t]), and set MW (t) = max W (u).
0≤u≤t
We have
MW (t) = max
0≤u≤t
and
b ,
W (t) = −B(t)
b
b =M
c(t) − B(t)
b
b − u) − B(t)
b
− B(t)
B(t
= max B(u)
0≤u≤t
so
b ≤ a, M
c(t) − B(t)
b > a = P (y + W (t) ≤ a, MW (t) > a) .
P y − B(t)
Finally, let
{W ∗ (u) : 0 ≤ u ≤ t}
be the process formed by reecting
The reection principle shows that
{W ∗ (u) : 0 ≤ u ≤ t}
W
at
τa = inf {u : W (u) = a}.
is a standard Brownian motion on
[0, t],
hence
∗
b .
W (t) =d −B(t)
{y + W (t) ≤ a, MW (t) > a} = {W ∗ (t) ≥ a + y}, so we end up with
b ≤ a, M
c(t) − B(t)
b > a = P (y + W (t) ≤ a, MW (t) > a)
P y − B(t)
b ≥a+y
= P (W ∗ (t) ≥ a + y) = P −B(t)
b ≤ −a − y = P B(t)
b + y < −a ,
= P B(t)
Moreover, it is clear that
(as
b
B(t)
is a continuous random variable) and the proof is complete.
111
While the foregoing establishes that
{M (t) − B(t) : t ≥ 0} is a Markov process, {M (t) : t ≥ 0} clearly is not.
However, the next result shows that the process which records the times when new maxima are achieved is
Markovian.
Theorem 16.14. For a ≥ 0, dene the stopping time
Ta = inf {t ≥ 0 : B(t) = a} .
Then
Ta : a ≥ 0
is a Markov process with transition kernel given by the densities
p(a, t, s) = p
Proof.
2π(s − t)3
exp −
t ≥ 0,
{Ta − Tb = t} = B Tb + s − B Tb < a − b
Fix
a≥b≥0
a
1 {s > t}
for a > 0.
and observe that for all
The strong Markov property shows that this event is
Therefore,
a2
2(s − t)
Ta : a ≥ 0
s < t ∩ B Tb + t − B Tb = b − a .
+
independent of F
Tb , and thus of Tc : c ≤ b .
for all
is Markovian with respect to its natural ltration.
To compute the transition density, observe that the strong Markov property implies that
Ta−b =d Ta − Tb
so Theorem 16.11 gives
P Ta − Tb ≤ t = P Ta−b ≤ t = P max B(s) ≥ a − b
0≤s≤t
ˆ ∞
ˆ t
2
2
a − b − (a−b)2
− x2t
√
e 2s ds
e
dx =
= P |B(t)| ≥ a − b = √
2πt a−b
2πs3
0
q
t
used the substitution x = (a − b)
s.
where the nal equality
stable subordinator of index 21 .
A subordinator is a real-valued nondecreasing Lévy process - that is, a process with stationary independent
The process in Theorem 16.14 is called a
increments which is continuous in probability: for all
It is stable with index
α
if
X(0) = 0
t ≥ 0, ε > 0, limh→0 P (|Xt+h − Xt | > ε) = 0.
and the scaling relation
1
t− α X(t) =d X(1)
Thus, for example, standard Brownian motion is stable with index
2,
holds for all
t > 0.
but is not a subordinator.
Conversely, the Poisson process is a subordinator, but is not stable.
The process
{Ta : a ≥ 0}
is nondecreasing and continuous in probability by continuity of Brownian paths.
It has stationary independent increments since the transition densities are shift invariant in the sense that
p(a, t, s) = p(a, 0, s − t)
for all
a, s, t ≥ 0.
The self-similarity is a consequence of Brownian scaling:
Ta = inf {t ≥ 0 : Bt = a} = a2 inf {t ≥ 0 : Ba2 t = a}
= a2 inf {t ≥ 0 : aBt = a} = a2 T1 .
112
17. Martingale Properties
In the previous section, we were able to infer several interesting facts about Brownian motion and related
processes by appealing to the Markov property.
We will now explore Brownian motion as a (continuous time) martingale and use the opportunity to introduce
some standard concepts in the study of stochastic processes.
Denition.
A real-valued process
{F(t) : t ≥ 0}
{X(t) : t ≥ 0}
is said to be a
if it is adapted, integrable, and satises
martingale
E [X(t) |F(s) ] = X(s)
with respect to a ltration
for all
0 ≤ s ≤ t.
It is a sub/super-martingale if the equality is replaced with an appropriate inequality.
Example 17.1.
0 ≤ s ≤ t,
+ E B(t) F (s) = E (B(t) − B(s)) + B(s) F + (s)
Brownian motion is clearly a martingale since for all
= E [B(t) − B(s)] + B(s) = B(s).
Before elaborating on the martingale property of Brownian motion, we will take a moment to extend some
familiar results to continuous time.
The basic strategy is to apply discrete time analogs to approximating sequences.
In order to present these results in greater generality, we need a few more denitions.
We say that a ltered probability space
(1)
F
(2)
F0
(3)
{Ft }t≥0
is complete with respect to
contains all
P -null
P
Ω, F, {Ft }t≥0 , P
- that is, if
A∈F
satises the
has
usual conditions
P (A) = 0
and
B ⊆ A,
if
then
B ∈ F.
sets.
is right-continuous.
Given a probability space
N = {E ⊆ Ω : E ⊆ F
(Ω, F o , P o ),
its completion is the space
o
(Ω, F, P )
where
F = σ (F o ∪ N )
with
P (F ) = 0},
and
with
o
P (E) = inf {P (F ) : E ⊆ F ∈ F}.
o
o
o
The usual augmentation Ω, F, {Ft }t≥0 , P
of the ltered space Ω, F , {Ft }t≥0 , P
is formed
T
(Ω, F, P ) to be the completion of (Ω, F o , P o ) and setting Ft = s>t σ (Fso ∪ N ).
for some
F ∈F
o
by taking
All of our results in the previous section carry over without change to the augmented ltration for Brownian
motion since adding null sets doesn't make any dierence in the arguments.
The reason that we are now concerned with completeness is that many processes of interest do not enjoy the
continuity properties of Brownian motion, but we still want to be able to argue by discrete approximation.
This approach generally works for
càdlàg
(right-continuous with left limits) processes, and we will see that
under the usual conditions, martingales have càdlàg versions.
version of {Xt : t ≥ 0} if for all t ≥ 0, P (Xt 6= Yt ) = 0. This is a strictly
weaker notion than being indistinguishable : P (Xt 6= Yt for some t ≥ 0) = 0, though the two coincide when
(We say that
{Yt : t ≥ 0}
is a
the processes are right-continuous.)
113
Theorem 17.1 (Doob's inequalities). If Xt is a martingale or nonnegative submartingale with càdlàg sample
paths, then
(1)
P
(2)
Proof.
E |Xt |
sup |Xs | ≥ λ ≤
λ
s≤t
for all λ > 0.
p
p
p
p
E [|Xt | ]
E sup |Xt | ≤
p−1
s≤t
for all 1 < p < ∞.
Since the absolue value of a martingale is a nonnegative submartingale, it suces to consider the case
Xt ≥ 0 is a submartingale.
kt
(n)
(n)
n
Let Dn (t) =
: 0 ≤ k ≤ 2 , and set Yk = X ktn , Gk = F ktn .
2
2
2n
(n)
(n)
Then Yk
is clearly a discrete time submartingale with respect to Gk .
n
o
Setting An (t) = sups∈D (t) |Xs | > λ , it follows from Theorem 4.2 that
n
E Y (n)
n 2
E |Xt |
(n)
=
.
P (An (t)) = P
max Yk > λ ≤
0≤k≤2n
λ
λ
S
Since A1 (t) ⊆ A2 (t) ⊆ · · · and Xt is right-continuous, we have
n An (t) = sups≤t |Xs | > λ
E |Xt |
P sup |Xs | > λ = lim P (An (t)) ≤
.
n→∞
λ
s≤t
where
Plugging in
λ−ε
for
λ
and taking
ε&0
gives
and thus
(1).
Similarly, Theorem 4.4 gives
p p h
i p p
(n) p
(n) p
p
E sup Yk ≤
E Y2n =
E [|Xt | ] .
n
p
−
1
p
−
1
0≤k≤2
(n) p
p
implies that sup0≤k≤2n Yk % sups≤t |Xs | , and (2) follows from Fatou's
Right-continuity
lemma.
In a similar vein, we can prove continuous time optional stopping theorems, such as
Theorem 17.2. Suppose that {X(t) : t ≥ 0} is a martingale with càdlàg sample paths and S ≤ T are stopping
times. If there exists some integrable random variable Y such that |X(t ∧ T )| ≤ Y a.s. for all t ≥ 0, then
E [X(T ) |F(S) ] = X(S)
Proof.
a.s.
N ∈ N and dene a discrete time martingale Yn = X T ∧ 2nN and stopping times S 0 = 2N S +1,
T 0 = 2N T + 1. The referent ltration is taken to be Gn = F 2nN .
{Yn : n ∈ N} is clearly uniformly integrable, so, writing SN = 2−N 2N S + 1 , Theorem 5.2 gives
Fix
E [X(T ) |F (SN ) ] = E [YT 0 |GS 0 ] = YS 0 = X (T ∧ SN ) .
Since
SN & S as N % ∞, dominated convergence and right-continuity show that for any A ∈ F(S),
ˆ
ˆ
X(T )dP = lim
E [X(T ) |F(SN ) ] dP
N →∞ A
A
ˆ
ˆ
ˆ
= lim
X (T ∧ SN ) dP =
lim X (T ∧ SN ) dP =
X(S)dP.
N →∞
A N →∞
A
114
A
To wrap up our general discussion of martingales, we prove a continuous time martingale convergence theorem
and show that there is no loss in assuming càdlàg paths as long as the usual conditions are satised.
We will use the notation
Dn =
k
2n
: k ∈ N0
,
D=
S
n∈N0
Dn .
Theorem 17.3. Suppose that {Xt : t ≥ 0} is a submartingale with respect to some ltration
and that supt≥0 E |Xt | < ∞. Then
lim Xt exists in R almost surely.
t→∞
With probability one, Xt has nite left and right limits along D.
(1)
(2)
Proof.
{Ft : t ≥ 0}
Theorem 17.1 ensures that for all
λ > 0,
!
P
Xt ≥ λ
sup
≤
t∈Dn ∩[0,n]
E |Xn |
,
λ
so the monotone convergence theorem gives
P
sup Xt ≥ λ
≤ sup
t∈D
Since
λ>0
is arbitrary, we see that
Accordingly, the only way for
of upcrossings of
[a, b]
by
{|Xt | : t ∈ D}
is a.s. a bounded set.
(1) or (2) to fail is if there is some pair of rationals a < b such that the number
{Xt : t ∈ D}
But Lemma 2.1 shows that if
t≥0
Un
is innite.
is the number of upcrossings by
E [Un ] ≤
Taking
n → ∞,
E |Xt |
.
λ
{Xt : t ∈ Dn ∩ [0, n]},
then
E |Xn |
.
b−a
Fatou's lemma shows that the number of upcrossings by
{Xt : t ∈ D}
has nite expectation
and thus is nite with full probability. The result follows since there are countably many pairs of rationals.
The reason for taking the seemingly circuitous route through the upcrossing inequality was to establish the
second claim in Theorem 17.3. The utility of this fact lies in
Theorem 17.4. Let {Ft } be a ltration satisfying the usual conditions. If Xt is a martingale with respect
to {Ft }, then there is a version of {Xt : t ≥ 0} which is a martingale and has càdlàg paths.
Proof.
Jensen's inequality implies that
|Xt |
is a submartingale, so
Accordingly, Theorem 17.3 shows that for any
Since
N
is arbitrary, we must have that
Xt
N ∈ N, Xt∧N
E |Xt | ≤ E |XN | < ∞
for all
t ≤ N.
almost surely has left and right limits along
has left and right limits along
D
D.
a.s.
Now dene
Yt =
Then
Yt
Xu .
has càdlàg paths by construction.
Also, since
{Ft }
is right-continuous and
We now observe that for xed
Let
lim
u∈D,u→t+
ε > 0.
Since
XN
{Xt }
is adapted,
N ∈ N, {Xt : 0 ≤ t ≤ N }
is integrable, there is a
Yt
is
Ft -measurable
115
t ≥ 0.
is uniformly integrable:
δ > 0 such that P (A) < δ
of Theorem 4.6 for details.)
for all
implies
E [XN ; A] < ε.
(See the proof
M>
When
E|XN |
, we have
δ
P (|Xt | ≥ M ) ≤
0≤t≤N
for all
Since
because
{|Xt | ≥ M } ∈ Ft ,
|Xt |
E |Xt |
E |XN |
≤
<δ
M
M
is a submartingale.
the submartingale property implies
E [|Xt | ; |Xt | ≥ M ] ≤ E [|XN | ; |Xt | ≥ M ] < ε
t ∈ [0, N ],
for all
Now let
t < N.
hence
If
{Xt : 0 ≤ t ≤ N }
E ∈ Ft ,
is u.i.
then the Vitali convergence theorem gives
E [Yt ; E] = E
Since
Yt ∈ Ft ,
Because
N
lim
u∈D,u→t+
Xt = Yt
this proves that
Yt
is a martingale: For
lim
E [Xu ; E] = E [Xt ; E] .
u∈D,u→t+
a.s.
is arbitrary, we conclude that
implies that
Xu ; E =
{Yt : t ≥ 0}
is a version of
{Xt : t ≥ 0},
s ≤ t, E [Yt |Fs ] = E [Xt |Fs ] = Xs = Ys
and we are done since this
almost surely.
Now that we are comfortable with martingales in continuous time and have the foregoing results at our
disposal, we can return to Brownian motion. We begin with
Theorem 17.5 (Wald's lemma). Let {B(t) : t ≥ 0} be a standard Brownian motion and let T be a stopping
time with E[T ] < ∞. Then E[B(T )] = 0.
Proof.
Dene
Mk = max |B(t + k) − B(k)| ,
0≤t≤1
E[M ] = E
bT c
X
Mk =
k=0
∞
X
Mk
is independent of
E [Mk 1 {T ≥ k}] =
k=0
= E[M0 ] 1 +
∞
X
∞
X
!
P (T ≥ k)
C
{T ≥ k} = {T < k} ∈ F + (k)
yield
E[Mk ]P (T ≥ k)
ˆ
≤ E [M0 ] 1 +
∞
P (T ≥ t) dt = E [M0 ] (1 + E [T ]) .
0
The reection principle shows that
∞
Mk .
k=0
k=1
ˆ
bT c
X
k=0
Then independent increments and the fact that
M=
ˆ
max |B(t)| > x dx ≤ 1 +
∞
E [M0 (t)] =
P
2P max B(t) > x dx
0≤t≤1
0≤t≤1
1
0
ˆ ∞
ˆ ∞ ˆ ∞
1 − s2
2
√ e
ds dx
=1+2
P (|B(1)| > x) dx = 1 + 2
2
2π
1
1
x
ˆ ∞ˆ ∞
ˆ
∞
x2
s − s2
4
4
x−1 e− 2 dx < ∞,
≤1+ √
e 2 dsdx = 1 + √
x
2π 1
2π 1
x
so
E[M ] < ∞.
Since
supt≥0 |B(t ∧ T )| ≤ M ∈ L1 ,
Theorem 17.2 gives
E [B(T )] = E E B(T ) F + (0) = E [B(0)] = 0.
116
In order to compute the second moment of
B(T ),
we need the following results, the second of which gives
another concrete example of a continuous martingale.
Lemma 17.1. Let S ≤ T be stopping times with E[T ] < ∞. Then
h
i
2
E B(T )2 = E B(S)2 + E (B(T ) − B(S)) .
Proof.
ii
h h
2
E B(T )2 = E E (B(T ) − B(S)) + 2B(T )B(S) − B(S)2 F + (S)
ii
h h
2
= E E (B(T ) − B(S)) + 2B(S) (B(T ) − B(S)) + B(S)2 F + (S)
h
i
2
= E (B(T ) − B(S)) + 2E B(S)E B(T ) − B(S) F + (S) + E B(S)2 .
Since
E[T ] < ∞
implies
E [T − S |F(S) ] < ∞
with Wald's lemma shows that
Proposition 17.1. Let
martingale.
a.s., the strong Markov property at time
+
E [B(T ) − B(S) |F (S) ] = 0
{B(t) : t ≥ 0}
S
in conjunction
a.s.
be a Brownian motion. Then the process B(t)2 − t : t ≥ 0 is a
Proof. B(t)2 − t is clearly integrable and F + (t)-measurable for all t ≥ 0, and for any 0 ≤ s ≤ t,
h
i
2
E B(t)2 − t F + (s) = E (B(t) − B(s)) + 2B(t)B(s) − B(s)2 − t F + (s)
h
i
2
= E (B(t) − B(s)) + 2B(s) B(t) F + (s) − B(s)2 − t
= (t − s) + 2B(s)2 − B(s)2 − t = B(s)2 − s.
We can now prove Wald's second lemma.
Theorem 17.6. If T is a stopping time for standard Brownian motion and E[T ] < ∞, then
E B(T )2 = E[T ].
Proof.
Consider the martingale
Then the process
variable
2
n + T,
B(t)2 − t : t ≥ 0
and dene stopping times
n
o
2
B (t ∧ T ∧ Tn ) − (t ∧ T ∧ Tn ) : t ≥ 0
Tn = inf {t ≥ 0 : |B(t)| = n}.
is uniformly dominated by the integrable random
so
Theorem 17.2 with
h
i
h
i
2
2
E B (T ∧ Tn ) − (T ∧ Tn ) = 0, or E B (T ∧ Tn ) = E [T ∧ Tn ].
i
h
2
2
to T ∧ Tn ≤ T shows that E B(T )
≥ E B (T ∧ Tn ) , hence
h
i
2
E B(T )2 ≥ lim E B (T ∧ Tn ) = lim E [T ∧ Tn ] = E[T ]
S=0
Applying Lemma 17.1
implies
n→∞
n→∞
by monotone convergence.
On the other hand, Fatou's lemma gives
h
i
h
i
2
2
E B(T )2 = E lim inf B (T ∧ Tn ) ≤ lim inf E B (T ∧ Tn ) = lim inf E [T ∧ Tn ] = E[T ],
n→∞
n→∞
n→∞
and the proof is complete.
117
A typical application of Wald's lemmas is computing exit times and exit probabilities for Brownian motion.
For example, we have the following gambler's ruin type result.
Theorem 17.7. Let {B(t) : t ≥ 0} be a standard Brownian motion, and let TA
denote the hitting time of a Borel set A. If a < 0 < b, then T = T{a,b} satises
= inf {t ≥ 0 : B(t) ∈ A}
(1)
P (B(T ) = a) =
b
|a| + b
and P (B(T ) = b) =
|a|
.
|a| + b
(2)
E [T ] = |ab| .
Proof. |B(T ∧ t)| ≤ |a| ∨ b, so it follows from Theorem 17.2 that
0 = E [B(0)] = E [B(T )] = aP (B(T ) = a) + bP (B(T ) = b)
= aP (B(T ) = a) + b [1 − P (B(T ) = a)]
= b + (a − b)P (B(T ) = a)
= b − (b + |a|) P (B(T ) = a) ,
which gives
(1).
To see that
T
has nite mean, observe that
ˆ
ˆ
∞
P (T > t)dt ≤
E[T ] =
0
≤1+
P (T > btc) dt =
0
∞
X
∞
X
k
\
P
∞
X
P (T > k)
k=0
{B(t) ∈ (a, b)
for all
j − 1 ≤ t < j}
j=1
k=1
≤1+
∞
P
k
max |B(t)| ≤ b − a
< ∞.
0≤t≤1
k=1
Therefore, Wald's second lemma implies
E [T ] = E B(T )2 = a2 P (T = a) + b2 P (T = b)
2
=
|a| b
|a| b2
|a| b (|a| + b)
+
=
= |ab| .
|a| + b |a| + b
|a| + b
Proposition 17.1 showed that the process
B(t)2 − t : t ≥ 0
is a martingale with respect to
important example is given by geometric Brownian motion.
Proposition 17.2. Let {B(t) : t ≥ 0} be a standard Brownian motion.
Then the process exp σB(t) − σ2 t : t ≥ 0 is a martingale for all σ > 0.
n
2
o
118
F + (t).
Another
Proof.
0 ≤ s ≤ t,
σ2 t
σ 2 t +
+
F
(s)
=
exp
(σB(s))
E
exp
(σ
(B(t)
−
B(s)))|
F
(s)
exp
−
.
E exp σB(t) −
2 2
σ (B(t) − B(s)) ∼ N 0, σ 2 (t − s) is independent of F + (s), the middle term is
For
Since
ˆ
∞
ˆ
∞
x2 − 2σ 2 (t − s)x
e p
e
exp −
dx = p
dx
2σ 2 (t − s)
2πσ 2 (t − s)
2πσ 2 (t − s) −∞
−∞
!
2
2
ˆ ∞
x − σ 2 (t − s) − σ 4 (t − s)2
1
σ (t − s)
=p
exp −
dx
=
exp
,
2σ 2 (t − s)
2
2πσ 2 (t − s) −∞
2
1
x
1
− 2σ2x(t−s)
thus
σ 2 (t − s) σ 2 t
σ2 s
σ 2 t +
F (s) = exp σB(s) +
−
= exp σB(s) −
.
E exp σB(t) −
2 2
2
2
The derived martingales in Propositions 17.1 and 17.2 are of the form
f2 (x, t) = exp σx −
2
σ t
2
. Note that both
f1
and
f2
solve the
fk (B(t), t)
backward equation
The following formal derivation show that under relatively mild conditions, if
then
{f (B(t), t) : t ≥ 0}
f
f1 (x, t) = x2 − t
∂u
1 ∂2u
=−
.
∂t
2 ∂x2
with
pt (x, y) =
√1
2πt
2
exp − (y−x)
2t
satises the
forward equation
2
1∂ u
∂u
=
.
∂t
2 ∂y 2
(This is the heat equation with diusivity
1
2 .)
Indeed,
1
(y − x)2
1
(y − x)2
1
(y − x)2
∂
√
exp −
=√
exp −
− +
∂t
2t
2t
2t
2t2
2πt
2πt
and
(y − x)2
∂2
1
(y − x)2
1 ∂
y−x
√
√
exp
−
exp
−
=
−
∂y 2
2t
t
2t
2πt
2πt ∂y
2
(y − x)
1
1 (y − x)2
exp −
=√
− +
.
2t
t
t2
2πt
If one can justify dierentiating under the integral, then
ˆ
∂
[f (y, t)pt (x, y)] dy
∂t
ˆ ˆ
∂
∂
f (y, t) pt (x, y)dy + f (y, t)
pt (x, y) dy.
=
∂t
∂t
∂
Ex [f (B(t), t)] =
∂t
If the following integration by parts steps are valid, then the latter integral is
ˆ
f (y, t)
2
ˆ
∂
1
∂
pt (x, y) dy =
f (y, t)
p
(x,
y)
dy
t
∂t
2
∂y 2
ˆ 1
∂
∂
=−
f (y, t)
pt (x, y) dy
2
∂y
∂y
ˆ 2
1
∂
=
f (y, t) pt (x, y)dy.
2
∂y 2
119
and
solves the backward equation,
is a martingale.
First observe that the transition density
Putting these equations together gives
ˆ
∂
∂
f (y, t) pt (x, y)dy + f (y, t)
pt (x, y) dy
∂t
∂t
ˆ ˆ 2
∂
1
∂
f
(y,
t)
pt (x, y)dy
=
f (y, t) pt (x, y)dy +
∂t
2
∂y 2
ˆ 1 ∂2
∂
f
(y,
t)
pt (x, y)dy = 0
=
f (y, t) +
∂t
2 ∂y 2
∂
Ex [f (B(t), t)] =
∂t
since
f
ˆ satises the backward equation.
Finally, the Markov property shows that
E f (B(t), t) F + (s) = EB(s) [f (B(t), t) − f (B(s), s)] + f (B(s), s) = f (B(s), s)
because
Ex [f (B(t), t)]
is constant in
t.
An important example of the above heuristic is
f (x, t) = g(x)
with
∆g = 0.
That is, under mild integrability assumptions, harmonic functions of Brownian motion are martingales.
120
18. Donsker's Theorem
Let
X1 , X2 , . . .
be i.i.d. with
E[X1 ] = 0
and Var (X1 )
= 1,
and set
Sn =
Pn
i=1
Xi .
Sn from N0 to [0, ∞) by linear interpolation: S(t) = Sbtc +(t − btc) Sbt+1c − Sbtc
√
n
sequence of continuous functions S (t) = S(nt)/ n.
We can extend the random walk
Scaling diusively gives a
The CLT implies that the law of
S n (1)
converges weakly to
N (0, 1) = L (B(1))
where
{B(t)}t∈[0,1]
is a
standard Brownian motion.
We will prove the stronger statement that the process
{S n (t)}t∈[0,1]
converges in distribution to
{B(t)}t∈[0,1] .
Thus we can think of Brownian motion as the scaling limit of random walk.
Here we are viewing the processes as random variables taking values in
functions from
[0, 1]
to
R
bounded and continuous functional
(the space of continuous
Λ : C[0, 1] → R, limn→∞ E [Λ(S n )] = E [Λ(B)].
One approach to proving this functional CLT begins by dening
C[0, 1]
endowed with the uniform metric), so weak convergence means that for every
Sen (t) : t ≥ 0
by
Sen (t) =
√1
n
Pbntc
k=1
Xk .
→ s ∧ t for all s, t ≥ 0, the multivariate CLT shows that for any
Sen (s), Sen (t) = bnsc∧bntc
n
t1 , . . . , tm ≥ 0, Sen (t1 ), . . . , Sen (tm ) ⇒ (B(t1 ), . . . , B(tm )) as n → ∞.
n
en (t) + nt−bntc
√
Linear interpolation yields the continuous process S (t) = S
Xbntc+1 , and we have that
n
!
2
nt − bntc
n
√
P S (t) − Sen (t) > ε = P Xbntc+1 > ε2
n
Since Cov
2
≤
h
i
(nt − bntc)
1
2
E
X
→ 0,
bntc+1 ≤
2
nε
nε2
so Slutsky's theorem shows that the nite dimensional distributions of
{S n (t)}t≥0
converge weakly to those
of Brownian motion.
One can then deduce weak convergence of the processes by demonstrating tightness.
Making this argument rigorous involves more analysis than probability, so we'll pursue an alternative approach based on Skorokhod embedding.
Theorem 18.1 (Skorokhod's embedding theorem). If X is a random variable with mean zero and nite
variance, then there is a stopping time T for {F + (t)}t≥0 such that B(T ) =d X and E [T ] = E X 2 .
Note that if
X
is supported on
{a, b}
with
a < 0 < b,
P (X = a) =
Letting
T = Ta,b = inf {t : B(t) ∈
/ (a, b)},
then the mean zero condition implies that
b
−a
, P (X = b) =
.
b−a
b−a
Theorem 17.7 shows that
B(T ) =d X
and
E [T ] = −ab = E X 2 .
We can use this observation to prove Theorem 18.1 by considering an appropriate martingale.
Denition.
A martingale
∞
{Xn }n=0
is
binary splitting
A (x0 , . . . , xn ) = {X0 = x0 , . . . , Xn = xn }
if whenever
x0 , . . . , xn ∈ R
is such that the event
has positive probability, then conditional on
is supported on at most two values.
121
A (x0 , . . . , xn ), Xn+1
.
Lemma 18.1. If X is a random variable on (Ω, F, P ) with
martingale {Xn } with Xn → X a.s. and in L2 .
Proof.
Set
G0 = {∅, Ω}, X0 = E[X],
σ (ξ0 , . . . , ξn−1 ), Xn = E [X |Gn ], ξn =
Then
Gn
is generated by a partition
As each element in
Pn
Pn
ξ0 =
and
1,
X ≥ X0
−1,
X < X0
1,
X ≥ Xn
−1,
X < Xn
of
E X 2 < ∞,
Ω into 2n
Dene
Gn , Xn , ξn
events, each of which can be expressed as
we see that
{Xn }
Xn
converges to
We'll be done upon showing that
X∞ = E [X |G∞ ]
X∞ = X
in
2
L
suciently large
ξn (ω) = −1
Since
As
n,
so
then this is clearly true.
ξn (ω) = 1
n,
it follows
a.s. To see that this is so, we observe that
lim ξn (X − Xn ) = |X − X∞ |
X(ω) = X∞ (ω),
for all
as well.
n→∞
Indeed, if
A(x0 , . . . , xn ).
is binary splitting.
S
Also, Lévy's Forward Theorem gives Xn → E [X |G∞ ] a.s. where G∞ = σ ( n Gn ).
h
i
2
2
Since X is square-integrable and E Xn = E E [X |Gn ]
≤ E E X 2 |Gn = E X 2
from Theorem 4.5 that
Gn =
recursively by
.
Pn+1 ,
is a union of two events in
.
then there is a binary splitting
a.s.
X(ω) > X∞ (ω),
If
eventually and the equality holds.
then
Similarly, if
X(ω) ≥ Xn (ω)
for all
X(ω) < X∞ (ω),
then
eventually.
ξn ∈ Gn ,
we see that
E [ξn (X − Xn )] = E [ξn E [X − Xn |Gn ]] = 0
|ξn (X − Xn )| ≤ |Xn | + |X| ≤ |X∞ | + 1 + |X|
for large
n,
for all
the DCT implies
n.
E |X − X∞ | = 0.
Skorokhod's embedding theorem is a simple consequence of the above observations.
Proof of Theorem 18.1.
We know that if
X
Let
{Xn }
be a binary splitting martingale which converges to
is supported on
{a, b}
for
a < 0 < b,
then
X
a.s. and in
Ta,b = inf {t ≥ 0 : B(t) ∈
/ (a, b)}
L2 .
gives the
requisite stopping time.
A(x0 , . . . , xn−1 ), Xn
B(Tn ) =d Xn and E [Tn ] = E Xn2 .
Since, conditional on
that
As
Tn
is supported on two such values, we can nd
T0 ≤ T1 ≤ · · ·
such
Tn → T a.s. and
E [T ] = lim E [Tn ] = lim E Xn2 = E X 2
is increasing, there is some stopping time
n→∞
T
with
n→∞
by monotone convergence.
Finally,
B(Tn ) converges in distribution to X
of Brownian paths, so we have
B(T ) =d X
by construction, and
B(Tn ) converges a.s.
to
B(T ) by continuity
as required.
Now that we have Skorokhod embedding at our disposal, we are ready to start proving
Theorem 18.2 (Donsker's invariance principle). On the space
distribution to a standard Brownian motion.
C[0, 1],
The basic idea of the proof is to construct a sequence of random variables
space as the Brownian motion such that
n
{S }
the sequence {S n } converges in
X1 , X2 , . . .
on the same probability
is close to a scaling of the Brownian motion with high
probability.
122
Lemma 18.2. Suppose that {B(t) : t ≥ 0} is a standard Brownian motion. Then for any random variable
X with mean zero and variance one, there exists a sequence of stopping times 0 = T0 ≤ T1 ≤ T2 ≤ · · · with
respect to {F + (t)} such that
(1)
(2)
The sequence of random variables {B(Tn )}∞
n=0 has the distribution of the random walk whose increments are distributed as X .
For all ε > 0, the sequence of functions {S n }∞
n=0 constructed from this random walk satises
B(nt)
n
sup √ − S (t) > ε = 0.
n
0≤t≤1
lim P
n→∞
Proof.
By Skorokhod embedding, we can nd a stopping time
T1
for
{F + (t)}
with
E [T1 ] = E X 2 = 1
and
B(T1 ) =d X .
{B(T1 + t) − B(T1 ) : t ≥ 0}
The strong Markov property shows that
is independent of
F + (T1 )
Accordingly, we can nd
E [T1 ] + E [T20 ] = 2
and
and thus of
T20
with
is a standard Brownian motion which
(T1 , B(T1 )).
E[T20 ] = 1
and
B(T20 ) =d X .
B(T2 ) = (B(T1 + T20 ) − B(T1 )) + B(T1 ),
random walk with increment distribution
L (X).
Continuing inductively gives a sequence
T0 ≤ T1 ≤ T2 ≤ · · ·
Setting
T2 = T1 + T20 ,
we see that
E [T2 ] =
which is distributed as the second step in a
where
E [Tn ] = n
and
Sn = B (Tn )
is the
embedded random walk.
To prove the second claim, write
We must show that
Let
k = k(t)
B(nt)
√
and dene
n
Wn (t) =
An = {|Wn (t) − S n (t)| > ε
for some
t ∈ [0, 1]}.
P (An ) → 0.
be the unique integer with
k−1
n
≤t<
k
n . Since
Sn
is linear on
k−1
n
, nk
, we have
[ Sk−1 Sk An ⊆ Wn (t) − √ > ε for some t ∈ [0, 1]
Wn (t) − √n > ε for some t ∈ [0, 1] .
n
√
Using Sk = B(Tk ) =
nWn (Tk /n), we see that An is contained in
[
A∗n = {|Wn (t) − Wn (Tk /n)| > ε for some t ∈ [0, 1]} {|Wn (t) − Wn (Tk /n)| > ε for some t ∈ [0, 1]} .
Now for any xed
0 < δ < 1, A∗n
{|Wn (t) − Wn (s)| > ε
Since
Wn (t) =
B(nt)
√
n
for some
=d B(t),
is contained in
s, t ∈ [0, 2]
with
[ Tk
Tk−1
|s − t| < δ}
n − t ∨ n − t ≥ δ
for some
the probability of the rst of these events is independent of
path continuity ensures that we can choose
δ
0 < δ < 1,
Tk
Tk−1
P − t ∨ − t ≥ δ for some t ∈ [0, 1] → 0.
n
n
To prove this, we note that the strong law implies
n
1X
Tn
= lim
(Tk − Tk−1 ) = 1
n→∞ n
n→∞ n
lim
k=1
Tk0 = Tk − Tk−1
n,
t ∈ [0, 1] .
and sample
small enough to make this probability as small as we wish.
Thus we will be done upon showing that for any
because the random variables
are i.i.d. with mean
123
1.
a.s.
an
n
Now observe that any sequence of reals with
→1
sup
must satisfy
0≤k≤n
|ak −k|
n
→ 0.
ε > 0, we can choose N so that ann − 1 < ε for n ≥ N , and then choose M > N
max0≤k≤N akn−k < ε for n ≥ M . Thus when n ≥ M , we have
ak − k ak − k ak − k < ε ∨ max k ak − 1 < ε.
max =
max
∨
max
0≤k≤n
N <k≤n n k
n 0≤k≤N n N <k≤n n * Given
so that
Accordingly, we see that
lim P
n→∞
k(t)−1
Because
n
Tk(t)
P n
Tk − k ≥ δ = 0.
sup n 0≤k≤n
k(t)
n , whenever
≤t<
n > 2/δ , we have
Tk(t)−1
− t ∨ − t ≥ δ for some t ∈ [0, 1]
n
Tk − (k − 1) Tk−1 − k ≤P
sup ∨
≥δ
n
n
1≤k≤n
Tk − k δ
Tk−1 − (k − 1) δ
≥
≥
≤P
sup +
P
sup
2 → 0. n 2
n
1≤k≤n
1≤k≤n
We are now able to give the
Proof of Theorem 18.2.
the random functions
For any closed set
Choose
T0 ≤ T1 ≤ · · ·
Wn ∈ C[0, 1]
K ⊆ C[0, 1],
given by
as in Lemma 18.2 and note that Proposition 14.3 shows that
Wn (t) =
if we let denote its
B(nt)
√
are standard Brownian motions.
n
ε-neighborhood
K[ε] = {f ∈ C[0, 1] : kf − gk ≤ ε
by
for some
g ∈ K} ,
then it is clear that
P (Sn ∈ K) ≤ P (Wn ∈ K[ε]) + P (kSn − Wn k > ε)
= P (B ∈ K[ε]) + P (kSn − Wn k > ε) → P (B ∈ K[ε])
where
Since
B
K
is a standard Brownian motion.
is closed,
lim P (B ∈ K[ε]) = P
ε→0
Therefore,
lim supn→∞ P (Sn ∈ K) ≤ P (B ≤ K),
\
B∈
n∈N
!
1
K
= P (B ∈ K).
n
showing that
Sn ⇒ B
by the Portmanteau theorem.
One of the main uses of Donsker's theorem and the Skorokhod embedding theorem is to translate results
about random walks into results about Brownian motions and conversely. Standard examples of this sort of
reasoning are given by the arcsine laws and the law of the iterated logarithm.
For instance, the law of the iterated logarithm is easier to prove in the continuous setting of Brownian
motion, and one derives the corresponding statement for random walk by embedding
124
Sn
in
B.
Similarly, it is straightforward to prove the arcsine law for the last zero of Brownian motion, and then one
can invoke Donsker's theorem to get the analogous statement for random walk. Conversely, combinatorial
considerations make it relatively easy to prove an arcsine law for the time simple random walk spends above
the
x-axis,
and one can use Donsker to get the statement about positive times of Brownian motion.
Moreover, Donsker's theorem is an invariance principle, meaning that the statement does not depend on the
particulars of the increment distribution. Thus one can prove a result for simple random walk, use Donsker
to establish its analogue for Brownian motion, and then convert it back to a result about random walk with
arbitrary mean
0
variance
1
increments.
The following two examples are among the simplest illustrations of this line of reasoning.
Example 18.1.
Dene
Let
Pn
X1 , X2 , . . . be i.i.d. with E [X1 ] = 0 and E X12 = 1, and set Sn = k=1 Xk .
√
S n (t) = S(nt)/ n with S(t) = Sbtc + (t − btc) Sbt+1c − Sbtc as before.
h : R → R
Let
be bounded and continuous, and dene
supf ∈C[0,1] |Λh (f )| ≤ khk∞ and
if
fn → f
in
C[0, 1],
then
Λh : C[0, 1] → R
by
Λh (f ) = h (f (1)).
Λh (fn ) = h (fn (1)) → h (f (1)) = Λh (f ),
Then
hence
Λh
is a bounded, continuous functional.
It follows from Theorem 18.2 that
Sn
= E [h (S n (1))] = E [Λh (S n )] → E [Λh (B)] = E [h (B(1))] .
E h √
n
Since
h ∈ Cb (R)
Example 18.2.
Suppose that
was arbitrary and
Also,
S(t)
we have just proved the CLT!
In the setting of the preceding example, let
h:R→R
Mn = max Sk .
is bounded and continuous and dene
As before, it is clear that
Since
B(1) ∼ N (0, 1),
Γh
0≤k≤n
Γh : C[0, 1] → R
by
Γh (f ) = h max f (x) .
0≤x≤1
is continuous and bounded.
Sn , we have
Sk
S(nx)
= E h max √
.
E [Γh (S n )] = E h max √
0≤x≤1
0≤k≤n
n
n
E [Γh (B)] = E h max B(t)
and max B(t) =d |B(1)| by Theorem 16.11,
is the linear interpolation of
0≤t≤1
0≤t≤1
so Donsker's theorem
implies
Mn
lim E h √
= lim E [Γh (S n )] = E [Γh (B)] = E [h (|B(1)|)] .
n→∞
n→∞
n
Thus the denition of weak convergence in terms of distribution functions shows that
√ lim P Mn > x n = P (|B(1)| > x) =
n→∞
125
r
2
π
ˆ
∞
t2
e− 2 dt.
x
© Copyright 2026 Paperzz