Weak convergence in Probability Theory A summer excursion! Day 1

BCAM June 2013
1
Weak convergence in Probability Theory
A summer excursion!
Day 1
Armand M. Makowski
ECE & ISR/HyNet
University of Maryland at College Park
[email protected]
BCAM June 2013
Day 1: Basic definitions of convergence for random
variables will be reviewed, together with criteria and
counter-examples.
Day 2: Skorokhod’s Theorem and coupling – Examples in queueing
theory, in the theory of Markov chains and time series analysis.
Day 3: Poisson convergence: The Stein-Chen method with
applications to problems in the theory of random graphs.
Day 4: Weak convergence in function spaces – Prohorov’s Theorem
and sequential compactness
Day 5: An illustration: From random walks to Brownian motion
2
BCAM June 2013
3
A very short bibliography
A. D. Barbour and L. Holst, “Some applications of the Stein-Chen
method for proving Poisson convergence,” Advances in Applied
Probability 21 (1989), pp. 74-90.
A. D. Barbour, L. Holst and S. Janson, Poisson Approximation,
Oxford Studies in Probability 2, Oxford University Press, Oxford
(UK), 1992.
P. Billingsley, Convergence of Probability Measures, John Wiley &
Sons, New York (NY), 1968.
P. Billingsley, Probability and Measure, Third Edition, Wiley
Series in Probability and Statistics, John Wiley & Sons, New York
(NY), 1995.
BCAM June 2013
4
CONVERGENCE IN Rd :
BASIC FACTS
BCAM June 2013
5
A sequence a : N0 → R, often described as {an , n = 1, 2, . . .},
converges to some a in R if for every ε > 0, there exists n⋆ (ε) such
that
|an − a| ≤ ε, n ≥ n⋆ (ε)
We write
lim an = a or
n→∞
an → a
This definition contains two basic questions:
• Existence – Does it converge?
• Value – Find the limiting value!
What happens if a = ±∞?
BCAM June 2013
6
Existence
Two basic ideas
Every monotone sequence converges!
Bolzano-Weierstrass Theorem: Every bounded sequence contains
at least one convergent subsequence!
BCAM June 2013
7
Liminf/Limsup
Given a sequence a : N0 → R, define
lim sup an = inf an
n→∞
n≥1
with an =
and
lim inf = sup an
n→∞
n≥1
with an =
sup am
m≥n
inf am
m≥n
lim sup an = Largest accumulation point of the sequence
n→∞
and
lim inf an = Smallest accumulation point of the sequence
n→∞
BCAM June 2013
8
an ↑ lim inf an
n→∞
and an ↓ lim sup an
n→∞
lim inf an ≤ lim sup an
n→∞
n→∞
Fact: The sequence a : N0 → R converges if and only if
lim inf an = lim sup an ≡ lim an
n→∞
n→∞
n→∞
BCAM June 2013
9
A sequence a : N0 → R is said to be Cauchy if for every ε > 0,
there exists n⋆ (ε) such that
|an − am | ≤ ε,
m, n ≥ n⋆ (ε)
Fact: A sequence a : N0 → R converges if and only if it is Cauchy –
R is complete under its usual topology
BCAM June 2013
10
MODES OF CONVERGENCE
FOR RANDOM VARIABLES
BCAM June 2013
11
Random variables
Given a probability triple (Ω, F, P), a d-dimensional random
variable (rv) is a measurable mapping X : Ω → Rd , i.e.,
X −1 (B) = {ω ∈ Ω : X(ω) ∈ B} ∈ F,
B ∈ B(Rd )
Two viewpoints
• A mapping
• A probability distribution function (i.e., measure induced
on B(Rd ))
F : Rd → [0, 1] : x → F (x) ≡ P [X ≤ x]
Several modes of convergence with many subtleties!
BCAM June 2013
12
An obvious definition . . .
Consider a collection {X; Xn , n = 1, 2, . . .} of Rd -valued rvs all
defined on the same probability triple (Ω, F, P). Then, we could
say convergence takes place to X if
lim Xn (ω) = X(ω),
n→∞
ω∈Ω
Why not?
• Too strong
• Modeling information: Often only the corresponding
probability distributions {Fn , n = 1, 2, . . .} are available
BCAM June 2013
13
Four basic modes of convergence
• Convergence in distribution (in law) – Weak convergence
• Convergence in the r th -mean (r ≥ 1)
• Convergence in probability
• Convergence with probability one (w.p. 1)
Requirements
• Consistency with usual convergence for deterministic
sequences
• Subsequence principle
BCAM June 2013
14
A programme
• Easy-to use-criteria
• Relationships between modes of convergence
• Impact of (continuous) transformations
• Key limit theorems of Probability Theory
With r ≥ 1,
v
u d
uX
r
kxkr = t
|xk |r ,
k=1
x = (x1 , . . . , xd ) ∈ Rd
BCAM June 2013
15
Convergence with probability one
Consider a collection {X; Xn , n = 1, 2, . . .} of Rd -valued rvs all
defined on the same probability triple (Ω, F, P). We say that the
sequence {Xn , n = 1, 2, . . .} converges almost surely (a.s.) (or
with probability one (w.p. 1)) to the rv X if
h
i
P ω ∈ Ω : lim Xn (ω) = X(ω) = 1
n→∞
We write
lim Xn = X
n→∞
a.s.
BCAM June 2013
16
Convergence in probability
Consider a collection {X; Xn , n = 1, 2, . . .} of Rd -valued rvs all
defined on the same probability triple (Ω, F, P). We say that the
sequence {Xn , n = 1, 2, . . .} converges in probability to the rv X
if for every ε > 0,
lim P [kXn − Xk2 > ε] = 0.
n→∞
This is often written
P
Xn →n X
For d = 1,
lim P [|Xn − X| > ε] = 0.
n→∞
BCAM June 2013
17
Convergence in the rth mean (r ≥ 1)
Consider a collection {X; Xn , n = 1, 2, . . .} of Rd -valued rvs all
defined on the same probability triple (Ω, F, P). We say that the
sequence {Xn , n = 1, 2, . . .} converges in the r th -mean to the rv X
if
E [(kXn kr )r ] < ∞,
n = 1, 2, . . .
and E [(kXkr )r ] < ∞
and
lim E [(kXn − Xkr )r ] = 0.
n→∞
This is often written
r
Xn →n X
BCAM June 2013
18
Convergence in distribution
Also known as distributional convergence, convergence in law and
weak convergence. Multiple equivalent definitions available
A sequence of probability distribution functions {Fn , n = 1, 2, . . .}
on Rd converges weakly to the probability distribution function F
on Rd , written Fn =⇒n F , if
lim Fn (x) = F (x),
n→∞
x ∈ CF
where CF denotes the continuity set of F , i.e.,
CF := {x ∈ R : point of continuity of F }
BCAM June 2013
19
The definition is sometimes given in the following form and setting:
Consider Rd -valued rvs {X, Xn , n = 1, 2, . . .} where for each
n = 1, 2, . . ., the rv Xn is defined on some probability triple
(Ωn , Fn , Pn ) and the rv X is defined on some probability triple
(Ω, F, P)a .
The sequence of rvs {Xn , n = 1, 2, . . .} converges in distribution
to the rv X if Fn =⇒n F where
Fn (x) = Pn [Xn ≤ x] and F (x) = P [X ≤ x] ,
In that case we write Xn =⇒n X.
a The
probability triples may or may not be distinct!
x ∈ Rd
n = 1, 2, . . .
BCAM June 2013
20
Why this definition?
Consider the two sequences
Xn =
1
n
1
and Yn = − ,
n
n = 1, 2, . . .
By the consistency requirement, one should have Xn =⇒n 0 and
Yn =⇒n 0. BUT,


0 if x < 0


limn→∞ FXn (x)
=

limn→∞ FYn (x)


1 if x > 0
with limn→∞ FXn (0) = 0 and limn→∞ FYn (0) = 1.
“The limit of a distribution is not always a distribution”
BCAM June 2013
21
RELATIONSHIPS BETWEEN
MODES OF CONVERGENCE
BCAM June 2013
22
Fact: Almost sure convergence implies convergence in probability
With ε > 0,
[Xn converges to X] ⊆ ∪∞
n=1 Bn (ε)
with monotone increasing events
Bn (ε) ≡ ∩∞
m=n [|Xm − X| ≤ ε],
n = 1, 2, . . .
Therefore,
P [Xn converges to X] ≤ lim P [Bn (ε)]
n→∞
by monotonicity!
BCAM June 2013
23
If P [Xn converges to X] = 1, then limn→∞ P [Bn (ε)] = 1 becomes
0 = lim P [Bn (ε)c ] ≥ lim P [∪∞
m=n [|Xm − X| > ε]]
n→∞
n→∞
by complementarity, whence
lim P [|Xn − X| > ε] = 0
n→∞
Converse is not true as seen through standard counterexamples.
BCAM June 2013
24
Partial converse – If the sequence {Xn , n = 1, 2, . . .} converges
in probability to the rv X, then there exists a sequence
ν : N0 → N0 with
νk < νk+1 ,
k = 1, 2, . . .
(whence limk→∞ νk = ∞) such that
lim Xνk = X
k→∞
a.s.
Thus, any sequence convergent in probability contains a
deterministic subsequence which converges a.s. (to the same
limit).
BCAM June 2013
25
Fact: Convergence in the r th -mean implies convergence in
probability
By Markov’s inequality,
P [|Xn − X| > ε] = P [|Xn − X|r > εr ]
−r
≤ ε
r
E [|Xn − X| ] ,
r > 0, ε > 0
n = 1, 2, . . .
Converse is not true without additional conditions, e.g., with
α > 0,

−α

0
with
probability
1
−
n


Xn =



n with probability n−α
BCAM June 2013
26
Fact: Convergence in probability implies convergence in
distribution
For each n = 1, 2, . . . and ε > 0, we have
P [Xn ≤ x] ≤ P [X ≤ x + ε] + P [|Xn − X| ≥ ε]
and
P [X ≤ x − ε] ≤ P [Xn ≤ x] + P [|Xn − X| ≥ ε]
Thus,
lim sup P [Xn ≤ x] ≤ P [X ≤ x + ε]
n→∞
and
P [X ≤ x − ε] ≤ lim inf P [Xn ≤ x]
n→∞
Finally let ε ↓ 0 and use the fact that x will be a point of continuity
of the probability distribution of X.
BCAM June 2013
Converse is not true! With Z ∼ N(0, 1), take Xn = (−1)n Z for
each n = 1, 2, . . .. Obviously, Xn =⇒n Z but


if n even

 0
|Xn − Z| = |1 − (−1)n ||Z| =



2|Z| if n odd
27
BCAM June 2013
28
Partial converse – If the sequence {Xn , n = 1, 2, . . .} converges
in distribution to the a.s. constant rv c, then
P
Xn →n c
Every sequence converging in distribution to a constant converges
to it in probability!
Indeed, for each n = 1, 2, . . . and ε > 0, we have
P [|Xn − c| ≤ ε] = P [Xn ≤ c + ε] − P [Xn < c − ε]
BCAM June 2013
29
BEWARE!
WEAK CONVERGENCE
IS INDEED WEAK
BCAM June 2013
30
Consider the two sequences of rvs {X, Xn , n = 1, 2, . . .} and
{Y, Yn , n = 1, 2, . . .} where for each n = 1, 2, . . ., the pair of rvs Xn
and Yn are defined on the same probability triple (Ωn , Fn , Pn ).
Assume that
Xn =⇒n X
and
Yn =⇒n Y.
Important ways in which weak convergence differs from the other
modes of convergence.
BCAM June 2013
31
Convergence under transformation: Is it true that
h(Xn ) =⇒n h(X)
with h : Rd → Rp ? If not always, then under what conditions?
Fact: We have
h(Xn ) =⇒n h(X)
if h : Rd → Rp is continuous
Not easy to show from the basic definition because no more
“pointwise convergence” – Skorohod to the rescue!
BCAM June 2013
32
Joint convergence: Is it true that
(Xn , Yn ) =⇒n (X, Y )?
In general no: Take Z ∼ N(0, 1), Xn = Z and Yn = (−1)n Z, so
that



 P [Z ≤ min(x, y)] if n even
P [Xn ≤ x, Yn ≤ y] =



P [−y ≤ Z ≤ x]
if n odd
Fact: We have
(Xn , Yn ) =⇒n (X, Y )
if for each n = 1, 2, . . ., the rvs Xn and Yn are independent, in
which case X and Y are independent.
BCAM June 2013
33
Convergence of sums: Is it true that
Xn + Yn =⇒n X + Y ?
In general no: Take Z ∼ N(0, 1), Xn = Z and Yn = (−1)n Z, so
that Xn + Yn = (1 + (−1)n )Z
Fact: We have
Xn + Yn =⇒n X + Y
if for each n = 1, 2, . . ., the rvs Xn and Yn are independent!
BCAM June 2013
34
Question: You know that
P
Xn →n X
P
and Yn →n Y
Convergence of sums: Is it true that
P
Xn + Yn →n X + Y ?
Yes because for each n = 1, 2, . . ., the event
[|(Xn + Yn ) − (X + Y )| > ε]
is contained in
[|Xn − X| >
ε
ε
] ∪ [|Yn − Y | > ]
2
2
BCAM June 2013
35
What if only
P
Xn →n X
and Yn =⇒n Y
Counterexample: With Z ∼ N(0, 1), set
and Yn = (−1)n Z,
Xn = Z
n = 1, 2, . . .
so that
Xn + Yn = (1 + (−1)n ) Z,
P
n = 1, 2, . . .
It is plain that Xn →n Z and Yn =⇒n Z, but the convergence
P
Xn + Yn =⇒n X + Y does not hold, hence Xn + Yn →n X + Y fails
as well!
BCAM June 2013
36
TIGHTNESS
BCAM June 2013
37
Tightness
The Rd -valued rvs {Xn , n = 1, 2, . . .} (or equivalently, their
probability distribution functions {Fn , n = 1, 2, . . .}) are tight if
there for every ε > 0, there exists a compact subset Kε ⊆ Rd such
that
inf P [Xn ∈ Kε ] ≥ 1 − ε
n=1,2,...
or equivalently, by complementarity,
sup P [Xn 6∈ Kε ] ≤ ε
n=1,2,...
BCAM June 2013
38
An easy criterion
Fact: Tightness holds if for some p ≥ 1, we have
sup E [|Xn |p ] < ∞
B=
n=1,2,...
(Proof for d = 1) By Markov’s inequality,
B
E [|Xn |p ]
≤
,
P [|Xn | > c] ≤
p
p
c
c
and note that
Kc = [−c, c]
is a compact subset of R
c>0
n = 1, 2, . . .
BCAM June 2013
39
Fact: Every probability distribution function F on Rd is tight.
By monotone continuity of probability measures and the fact that
Rd is σ-compact:
d
1=P X ∈R
= P [∪∞
n=1 [X ∈ B(0, n)]] = lim P [X ∈ B(0, n)]
n→∞
Fact: If the sequence of probability distribution functions
{Fn , n = 1, 2, . . .} on Rd converges weakly to the probability
distribution function F on Rd , then the collection
{Fn , n = 1, 2, . . .} is tight
BCAM June 2013
40
(For d = 1) Fix x and y in CF such that y < 0 < x. For each δ > 0,
there a finite integer n⋆ = n⋆ (x, y; δ) such that
F (x) − δ ≤ Fn (x) ≤ F (x) + δ,
n ≥ n⋆
F (y) − δ ≤ Fn (y) ≤ F (y) + δ,
n ≥ n⋆
and
Consequently,
P [Xn > x] ≤ P [X > x] + δ,
n ≥ n⋆
P [Xn ≤ y] ≤ P [X ≤ y] + δ,
n ≥ n⋆
and
Thus,
P [Xn 6∈ [y, x]] ≤ P [X > x] + P [X ≤ y] + 2δ,
n ≥ n⋆
BCAM June 2013
41
Now take x in CF sufficiently large, say x = x(δ), such that
P [X > x] ≤ δ
Similarly take y in CF with |y| sufficiently large, say y = y(δ) such
that
P [Y ≤ y] ≤ δ
With this choice,
P [Xn 6∈ [y, x]] ≤ 4δ,
n ≥ n(δ)
with
n(δ) = n⋆ (x(δ), y(δ); δ)
BCAM June 2013
42
By Prohorov’s Theorem,
Tightness
=
Sequential precompactness
(with respect to weak convergence)
Remember Bolzano-Weierstrass!
BCAM June 2013
43
ANALYTIC VIEW
OF WEAK CONVERGENCE
BCAM June 2013
44
Basic idea
Transform of a probability distribution: With any probability
distribution F : Rd → [0, 1], we associate its transform/related
quantity
T (F ) : Rd → C
Many such transforms available: Characteristic functions (general
applicability), moment generating functions, Laplace-Stieltjies
transforms (non-negative rvs), z-transforms (N-valued rvs), etc
BCAM June 2013
45
Uniqueness requirement: With F and G probability
distributions on Rd ,
T (F ) = T (G)
if and only if F = G
Desired result: The sequence of probability distribution functions
{Fn , n = 1, 2, . . .} on Rd converges weakly to the probability
distribution function F on Rd if and only if
lim T (Fn )(t) = T (F )(t),
n→∞
t ∈ Rd
BCAM June 2013
46
Characteristic functions
With Rd -valued rv X = (X1 , . . . , Xd )′ , its characteristic
function ΦX : R → C is given by
h ′ i
ΦX (t) = E eit X , t ∈ Rd
Also
ΦF = ΦX
where X ∼ F
Uniqueness: With F and G probability distributions on Rd ,
ΦF = ΦG
if and only if F = G
BCAM June 2013
47
Fact: With Rd -valued rv X = (X1 , . . . , Xd )′ , its characteristic
function ΦX : R → C satisfies the following properties:
• Bounded:
|ΦX (t)| ≤ ΦX (0) = 1,
t ∈ Rd
• Uniformly continuous on Rd :
lim sup (|ΦX (t + h) − ΦX (t)|) = 0
h→0 t∈Rd
• Positive definiteness: For every n = 1, 2, . . ., every t1 , . . . , tn
in Rd and every z1 , . . . , zn in C,
n
n X
X
k=1 ℓ=1
ΦX (tk − tℓ )zk zℓ⋆ ≥ 0
BCAM June 2013
48
The Bochner-Herglotz Theorem
Theorem 1 Consider a function Φ : Rd → C. It is the
characteristic function of the probability distribution
F : Rd → [0, 1] for some Rd -valued rv X if and only if it is positive
definite, continuous at the origin, and if Φ(0) = 1.
A beautiful characterization
BCAM June 2013
49
Fact: The sequence of probability distribution functions
{Fn , n = 1, 2, . . .} on Rd converges weakly to the probability
distribution function F on Rd if and only if
lim ΦFn (t) = ΦF (t),
n→∞
t ∈ Rd
Useful analytic characterizations of weak convergence via
characteristic functions
A natural idea: Look for the limiting behavior of characteristic
functions
h ′ i
lim ΦFn (t) = lim E eit Xn , t ∈ Rd
n→∞
n→∞
Beware: While this limit may exist, it is not always a
characteristic function.
BCAM June 2013
50
Fact: Consider a sequence of probability distribution functions
{Fn , n = 1, 2, . . .} on Rd such that the limits
lim ΦFn (t) = Φ(t),
n→∞
t ∈ Rd
exist. If Φ : Rd → C is continuous at t = 0, then it is the
characteristic function of a probability distribution function F on
Rd and Fn =⇒n F .
Consequence of the Bochner-Herglotz Theorem because the
positive definiteness of Φ and the requirement Φ(0) = 1 are
automatically inherited through the limiting process.
BCAM June 2013
51
Applications
• Sums of independent rvs – WLLNs and CLT
• Joint distribution of independent components