Some notes on asymptotic theory in probability

Some notes on asymptotic theory in probability
Alen Alexanderian∗
Abstract
We provide a precise account of some commonly used results from asymptotic theory
in probability.
Contents
1 Introduction and basic definitions
1
2 Basic definitions from probability theory
1
3 Convergence in probability and op (n), Op (n) notations
3
4 Probabilistic Taylor expansion
6
5 Convergence in distribution
6
6 Central Limit Theorem and related results
9
References
1
10
Introduction and basic definitions
This brief note summarizes some important results in asymptotic theory in probability. The main motivation of this theory is to approximate distribution of large sample
statistics with a limiting distribution which is often much simpler to work with.
The structure of this note is as follows. First, we recall some basic definitions from
probability theory in Section 2. Then, we proceed by discussing the notion of convergence in Probability and op (n) and Op (n) notations in Section 3. Section 4 presents the
probabilistic Taylor formula. In Section 5 we discuss convergence in distributions along
with some related results, and in Section 6, we follow up by presenting the Central Limit
Theorem along with some accompanying results.
2
Basic definitions from probability theory
Throughout this note, we let (Ω, F, P ), be a probability space, where Ω denotes a
sample space, F is a suitable σ -algebra on Ω, and P is a probability measure. We
call a measurable function X : Ω 7→ R a random variable; sometimes, we work with
∗ The University of Texas at Austin, USA. E-mail: [email protected]
Last revised: April 16, 2015
Asymptotic theory in probability
random k -vectors (vectors valued random variables), which are measurable functions
X : Ω 7→ Rk . The expectation (mean) of a random variable X is
Z
E [X] =
X(ω) dP (ω).
Ω
Variance of a random variable measures the deviation of X from its mean,
Var [X] = E (X − E [X])2 .
It is trivial to show,
2
Var [X] = E X 2 − (E [X]) .
Next, we recall the notion of the covariance of two random variables. Let X and Y be
two random variables,
Cov [X, Y ] = E [(X − E [X])(Y − E [Y ])] .
Note that Cov [X, X] = Var [X].
The variance and covariance for random k -vectors are defined by,
Var [X] = E [(X − E [X]) · (X − E [X])] ,
Cov [X, Y ] = E [(X − E [X]) · (Y − E [Y ])] .
Here x · y =
k
X
xi yi .
i=1
Definition 2.1 (Distribution Function). The distribution function of a random variable
X is the function FX : R 7→ [0, 1] defined by
FX (z) := P (X ≤ z)
For a random k -vector X we similarly define FX : Rk 7→ [0, 1] by
FX (z) := P (X ≤ z)
Note that for two vectors x and y in Rk , we write x ≤ y to mean xi ≤ yi , for
i = 1, . . . , k .
Another equivalent definition of expectation is given by
Z
Z
E [X] =
X(ω) dP (ω) =
z dFX (z).
Rk
Ω
We also have, for a measurable function g
Z
E [g(X)] =
Z
g(X(ω)) dP (ω) =
g(z) dFX (z).
(2.1)
Rk
Ω
The next definition concerns the very important notion of the characteristic function of
a random variable.
Definition 2.2. Let X be an random k -vector. Then
φX (ξ) := E eiξ·X ,
ξ ∈ Rk
is the characteristic function of X .
Using (2.1) we have,
φX (ξ) = E eiξ·X =
Z
Rk
2
eiξ·z dFX (z).
Asymptotic theory in probability
3
Convergence in probability and op (n), Op (n) notations
Let us begin by defining the notion of convergence in probability to zero [2].
Definition 3.1. Let {Xn } be a sequence of random variables defined on (Ω, F, P ). We
say Xn converges in probability to zero, written Xn = op (1)
P
or Xn → 0, if for every > 0,
P (|Xn | > ) → 0,
as n → ∞.
A closely related notion is that of a sequence being bounded in probability [2].
Definition 3.2. Let {Xn } be a sequence of random variables defined on (Ω, F, P ). We
say Xn is bounded in probability (or tight), if for every > 0, there exists M > 0 such
that,
P (|Xn | > M ) < , for all n.
We can look at the idea of boundedness in probability in the following alternative
way. We say Xn is bounded in probability if for every > 0, there is a set F ∈ F and a
number M such that for all ω ∈ F ,
|Xn (ω)| ≤ M ,
for all n,
and
P (F c ) < .
P
To get used to this ideas, let us prove the following basic fact: if Xn → 0, then Xn
is bounded in probability. That is if Xn = op (1), then Xn = Op (1). This corresponds
to the basic idea from real analysis that says a convergent sequence is bounded. The
P
argument is as follows; suppose Xn → 0. Let > 0 be given; we know P (|Xn | > 1) → 0,
therefore, there is some n0 such that
P (|Xn | > 1) < ,
for n ≥ n0 .
Now, pick M0 sufficiently large so that P (|Xi | > M0 ) < for i = 1, . . . , n0 − 1. Then, we
have for M = max(1, M0 ) that
P (|Xn | > M ) < ,
for all n,
which completes the proof that Xn = Op (1).
Now, we use the above developments to define convergence in probability and order
in probability [2].
Definition 3.3. Let {Xn } be a sequence of random variables on (Ω, F, P ) and let {an }
be a sequence of strictly positive reals.
1. Xn converges is probability to X if and only if Xn − X = op (1).
2. Xn = op (an ) if and only if a−1
n Xn = op (1).
3. Xn = Op (an ) if and only if a−1
n Xn = Op (1).
The following proposition[2] shows the similarities between the notions of op and Op
with their counterparts from real analysis (the little-oh and big-oh notations).
Proposition 3.4. Let {Xn } and {Yn } be sequences of random variables on (Ω, F, P ),
and let {an } and {bn } be sequences of positive real numbers.
1. Suppose Xn = op (an ) and Yn = op (bn ); then,
3
Asymptotic theory in probability
(a) Xn Yn = op (an bn ).
(b) Xn + Yn = op (max(an , bn )).
(c) |Xn |r = op (arn ) for r > 0.
2. If Xn = op (an ) and Yn = Op (bn ), then Xn Yn = op (an bn ).
We can extend the above developments to random k -vectors. For a random k -vector,
X , denote its j th component by X j .
Definition 3.5. Let {X n } be a sequence of random k -vectors on (Ω, F, P ), and let {an }
be a sequence of strictly positive reals.
1. X n = op (an ) if and only if Xnj = op (an ) for j = 1, . . . , k .
2. X n = Op (an ) if and only if Xnj = Op (an ) for j = 1, . . . , k .
P
3. X n converges in probability to X , written X n → X , if and only if X n − X = op (1).
Recall that in analysis, convergence in Rk is characterized by componentwise convergence. Same idea applies when we talk about convergence in probability of random
k -vectors.
Pk
j 2
Let kXk denote the Euclidean norm, kXk2 =
j=1 |X | . The following result characterizes convergence in probability in terms of the Euclidean distance [2].
Proposition 3.6. Let {X n } be a sequence of random k -vectors on (Ω, F, P ) and X a
random k -vector on the same probability space. Then, X n − X = op (1) if and only if
kX n − Xk = op (1).
Proof. Suppose X n − X = op (1). Then, for every > 0,
lim P (|Xnj − X j |2 > /k) = 0,
n→∞
Pk
Claim: P (
j=1
|Xnj − X j |2 > ) ≤
Pk
j=1
for j = 1, . . . , k.
P (|Xnj − X j |2 > /k).
Pk
j
j 2
This follows from the fact that
j=1 |Xn − X | > implies at least one of the summands is larger than /k . To make this clearer, let
En = {ω :
k
X
|Xnj (ω) − X j (ω)|2 > },
and
Enj = {ω : |Xnj − X j |2 > /k},
j=1
and note that En ⊆ ∪kj=1 Enj . Therefore, P (En ) ≤
claim. Then, we have,
P (kX n − Xk2 > ) ≤
k
X
Pk
j=1
P (Enj ), which establishes the
P (|Xnj − X j |2 > /k) → 0,
as n → ∞.
j=1
Therefore, kX n − Xk2 = op (1), and by proposition 3.4, kX n − Xk = op (1). Conversely,
if kX n − Xk = op (1), we use |Xnj − X j | ≤ kX n − Xk to get
P (|Xnj − X j | > ) ≤ P (kX n − Xk > ) → 0
as n → ∞.
Proposition 3.7. Let {X n } and {Y n } be sequences of random k -vectors on (Ω, F, P ).
P
P
P
If X n − Y n → 0 and Y n → Y , where Y is a random k -vector on (Ω, F, P ) then, X n → Y
also.
4
Asymptotic theory in probability
Proof. Note that by Proposition 3.6, kX n − Y n k = op (1) and kY n − Y k = op (1). Then,
we have
kX n − Y k ≤ kX n − Y n k + kY n − Y k = op (1),
where the last equality follows from Proposition 3.4. Therefore, again applying Proposition 3.6, we have X n − Y = op (1).
We also have the following continuity result [2].
Proposition 3.8. Let {Xn } be a sequence of random variables on (Ω, F, P ), such that
P
P
Xn → X . If g : R 7→ R is a continuous mapping, then g(Xn ) → g(X).
Proof. Let > 0 be given. For any K ∈ R+ ,
P (|g(Xn ) − g(X)| > ) ≤ P (|g(Xn ) − g(X)| > , |X| ≤ K, |Xn | ≤ K)
+P ({|X| > K} ∪ {|Xn | > K}).
Now we note that g is uniformly continuous on {x : |x| ≤ K}, and thus there exists
δ = δ() such that |g(Xn ) − g(X)| ≤ for |Xn − X| < δ (where we consider the set where
|X| ≤ K and |Xn | ≤ K ). Therefore,
{|g(Xn ) − g(X)| > , |X| ≤ K, |Xn | ≤ K} ⊆ {|Xn − X| > δ}.
Therefore, we have
P (|g(Xn ) − g(X)| > ) ≤ P (|Xn − X| > δ) + P (|X| > K) + P (|Xn | > K)
K
K
≤ P (|Xn − X| > δ) + P (|X| > K) + P (|X| > ) + P (|Xn − X| > )
2
2
Now, let γ > 0 be arbitrary; we can choose K sufficiently large so that the second and
third terms are both less than γ/2. Thus, we have for sufficiently large K ),
P (|g(Xn ) − g(X)| > ) ≤ P (|Xn − X| > δ) + P (|Xn − X| >
K
) + γ.
2
P
Taking the limit of both sides of above equation and using the fact that |Xn − X| → 0
gives,
lim P (|g(Xn ) − g(X)| > ) ≤ γ,
n→∞
and since γ > 0 was arbitrary the proof is complete.
Suppose Xn = a + op (1), then using the above result we have for any continuous
function g ,
g(Xn ) = g(a) + op (1).
Now, what if g(x) is also differentiable at x = a? In the next section, we discuss a
probabilistic analogue of Taylor’s Theorem.
Remark 3.9. The following technical note is relevant in what follows. Let Let {Xn } be
a sequence of of random variables such that Xn = Op (rn ), where 0 < rn → 0. Then,
P
Xn = op (1); i.e., Xn → 0. To show this, we let δ > 0 be fixed but arbitrary, and let > 0
−1
be given. By Xn = Op (rn ), there exists M ∈ (0, ∞) such that P (rn
|Xn | > M ) < ,
for all n. Now, take n0 (), sufficiently large, such that for n ≥ n0 (), we have rn M < δ .
Then, we have for n ≥ n(),
P (|Xn | > δ) ≤ P (|Xn | > M rn ) < ,
and hence the claim.
5
Asymptotic theory in probability
4
Probabilistic Taylor expansion
Proposition 4.1 (Proposition 6.1.5 in [2]). Let {Xn } be a sequence of of random variables such that Xn = a+Op (rn ) where a ∈ R and 0 < rn → 0 as n → ∞. If g is a function
with s continuous derivatives at a then,
g(Xn ) =
s
X
g (j) (a)
j!
j=0
(Xn − a)j + op (rns ).
Proof. Define the function h as below,
h(x) =


P
g(x)− sj=0

0,
g (j) (a)
(x−a)j
j!
(x−a)s
s!
,
x 6= a
x = a.
The function h defined above is continuous at x = a (by repeated application of L’Hôpital’s
rule we can check limx→a h(a) = 0). Therefore, h(Xn ) = h(a) + op (1), and since h(a) = 0,
we have h(Xn ) = op (1). But then, using Proposition 3.4(2), we have
(Xn − a)s h(Xn ) = op (rns ),
that is
g(Xn ) −
s
X
g (j) (a)
j!
j=0
(Xn − a)j = op (rns ),
which completes the proof.
We can also state an analogue of the above result for random k -vectors, the proof of
which is similar [2].
Proposition 4.2. Let {X n } be a sequence of of random k -vectors such that X n =
a + Op (rn ) where a ∈ Rk and 0 < rn → 0 as n → ∞. If g : Rk 7→ R is a function such
∂g
are continuous in a neighborhood of a then,
that the partial derivatives ∂x
i
g(X n ) = g(a) + ∇g(a) · (X n − a) + op (rn ).
5
Convergence in distribution
When we talk about convergence in probability, we always talk about a sequence of
random variables, X1 , X2 , . . ., defined on the same probability space. In this section, we
discuss convergence in distribution, which makes sense even if Xi are not defined on
the same probability space.
Definition 5.1. Let {X n } be a sequence of random k -vectors, with distribution functions {FX n (·)}. We say X n converges in distribution to X if,
lim FX n (z) = FX (z)
n→∞
for all z ∈ C where C is the set of continuity points of FX n . We use the notation,
D
D
X n → X or FX n → FX to denote convergence in distribution.
D
Speaking in practical terms, X n → X tells us that the distribution of X n is well
approximated by the distribution of X for large n. The following Theorem [2] is an
important result linking convergence in distribution to convergence of characteristic
functions.
6
Asymptotic theory in probability
Theorem 5.2. Let {Fn }, n = 0, 1, 2, .Z. ., be distribution functions on Rk corresponding
eiξ·x dFn (x), n = 0, 1, 2, . . ., then the following
to characteristic functions, φn (ξ) =
Rk
are equivalent.
D
1. Z
Fn → F0 ,
Z
g(x) dFn (x) →
2.
Rk
k
g(x) dF0 (x), for every bounded continuous function g on
Rk
R .
3. limn→∞ φn (ξ) = φ0 (ξ) for every ξ ∈ Rk .
Let us get some insight into the notion of convergence in distribution by proving the
following simple result, which says that a sequence that converges in distribution is
bounded in probability.
D
Lemma 5.3. Suppose Xn → X . Then, Xn = Op (1).
Proof. Let > 0 be given. Choose M0 sufficiently large so that P (|X| > M0 ) < . Now,
we know by convergence in distribution that,
P (|Xn | > M0 ) → P (|X| > M0 ).
So there exists some n0 such that for n ≥ n0 ,
P (|Xn | > M0 ) < Next, choose M1 sufficiently large so that P (|Xi | > M1 ) < , for i = 1, . . . , n0 − 1. Then,
for M = max(M0 , M1 ), we have
P (|Xn | > M ) < ,
for all n.
P
Proposition 5.4. Let {Xn } be a sequence of random variables. Suppose Xn → X .
Then,
1. E |eitXn − eitX | → 0 as n → ∞, for all t ∈ R, and
D
2. Xn → X .
Proof. Let t ∈ R and > 0 be fixed but arbitrary. Pick δ = δ() > 0 such that,
|x − y| < δ =⇒ |eitx − eity | = |1 − eit(y−x) | < .
Then, we have
h
i
itX
itX
it(X−Xn )
|
E |e n − e | = E |1 − e
h
i
h
i
= E |1 − eit(X−Xn ) |χ{|Xn −X|<δ} + E |1 − eit(X−Xn ) |χ{|Xn −X|≥δ}
h
i
≤ + E |1 − eit(X−Xn ) |χ{|Xn −X|≥δ}
≤ + 2P (|Xn − X| ≥ δ).
Therefore, taking limit as n → ∞ gives,
lim E |eitXn − eitX | ≤ ,
n→∞
and since > 0 was arbitrary, the proof of the first statement of the Proposition is
complete.
7
Asymptotic theory in probability
To prove convergence in distribution, we show the convergence of characteristic
functions:
|φXn (t) − φX (t)| = |E eitXn − E eitX |
= |E eitXn − eitX |
≤ E |eitXn − eitX | → 0,
as n → ∞,
where the last conclusion follows from the first statement of the Proposition; this completes the proof.
Remark 5.5. By Proposition 5.4 we have that convergence in probability implies convergence in distribution. While the converse is not true in general, we mention the
following important special case. Let {Xn } be a sequence of random variables such
D
P
that Xn → c where c is a constant. Then, we have that Xn → c also. To see this, let
δ > 0 be fixed but arbitrary, and note that
P (|Xn − c| > δ) ≤ P (Xn − c > δ) + P (Xn − c ≤ −δ) = 1 − FXn (c + δ) + FXn (c − δ) → 0,
as n → ∞.
Proposition 5.6. Let {Xn } and {Yn } be sequences of random variables such that Xn −
D
D
Yn = op (1), and Xn → X ; then, Yn → X also.
Proof. Again, we will show the convergence of the characteristic functions. First, we
claim,
|φYn (t) − φXn (t)| → 0, as n → ∞, ∀t ∈ R.
This follows from,
|φYn (t) − φXn (t)| = |E eitYn − E eitXn |
= |E eitYn − eitXn |
≤ E |eitYn − eitXn |
h
i
= E |1 − eit(Xn −Yn ) | → 0,
as n → ∞,
where the last conclusion follows from Proposition 5.4. The rest of the proof is simple;
for any t ∈ R,
|φYn (t) − φY (t)| ≤ |φYn (t) − φXn (t)| + |φXn (t) − φX (t)| → 0,
as n → ∞.
The following continuity result is very useful.
D
Proposition 5.7. Let {Xn } be a sequence of random variables such that Xn → X . If
D
h : R 7→ R is a continuous function, then h(Xn ) → h(X).
Proof. We will show φh(Xn ) (t) → φh(X) (t) for all t. Note that
h
i Z
ith(Xn )
φXn (t) = E e
=
eith(x) dFXn .
R
ith(x)
Now, for every t ∈ R, the function g(x) = e
of x. Therefore, by Theorem 5.2(2), we have
Z
e
ith(x)
Z
dFXn →
R
is a bounded and continuous function
eith(x) dFX ,
R
but the right hand side of the above equation is none but, φh(X) (t). Hence, we have
D
shown that φXn (t) → φX (t) for every t and consequently, by Theorem 5.2, h(Xn ) →
h(X).
8
Asymptotic theory in probability
6
Central Limit Theorem and related results
Definition 6.1. A sequence {Xn } of random variables is said to be asymptotically normal with mean µn and standard deviation σn if σn > 0 for n sufficiently large and
Xn − µn D
→ Z,
σn
Z ∼ N (0, 1).
In this case we use the notation,
Xn is AN (µn , σn2 ).
2
Remark 6.2. [2] If Xn is AN (µn , σn
) it is not necessarily the case that µ(Xn ) = µn and
2
Var [Xn ] = σn .
The following Central Limit Theorem, which is stated without proof is well known (I
will at some point put a carefully written proof here). Elementary proofs of this Theorem
can be found in [3] or [4], for more rigorous, measure theoretic, proofs see [1, 2, 5].
Theorem 6.3 (Central Limit Theorem). Let {Xn } be a sequence of independent and
identically distributed random variables with mean µ and variance σ 2 (notation: {Xn } ∼
IID(µ, σ 2 )). Define X̄n =
X1 + X2 + · · · + Xn
. Then,
n
X̄n is AN (µ,
σ2
)
n
The following result [2], known as the delta-method in statistics literature, is useful
in applications.
2
), where σn → 0 as n → ∞, and if g is a function
Proposition 6.4. If Xn is AN (µ, σn
continuously differentiable at µ, and g 0 (µ) 6= 0, then
2
g(Xn ) is AN (g(µ), [g 0 (µ)] σn2 ).
D
n −µ
Proof. Let Zn = Xσ
, we know that Zn → Z ∼ N (0, 1). Thus, by Lemma 5.3, Zn =
n
Op (1). Therefore, Xn = µ + Op (σn ). Then, by Taylor formula given in Proposition 4.1,
we have g(Xn ) = g(µ) + g 0 (µ)(Xn − µ) + op (σn ). Therefore,
g(Xn ) − g(µ)
= g 0 (µ)
σn
Xn − µ
σn
+ op (1).
we rewrite the above equation as
g(Xn ) − g(µ)
=
g 0 (µ)σn
and recall that
Xn − µ
σn
+ op (1),
Xn − µ D
→ Z ∼ N (0, 1); therefore, by Proposition 5.6, we have
σn
g(Xn ) − g(µ) D
→ Z ∼ N (0, 1),
g 0 (µ)σn
That is,
2
g(Xn ) is AN g(µ), [g 0 (µ)] σ 2 .
9
Asymptotic theory in probability
References
[1] L. Breiman, Probability, Siam, 1992.
[2] P. Brockwell and R. Davis, Time Series: Theory and Methods, Springer, 1991.
[3] P. G. Hoel, S. C. Port, and C. J. Stone, Introduction to Probability Theory, Houghton Mifflin
Company, 1971.
[4] S. Ross, A First Course in Probability, Macmillan Publishing Company, 2nd ed., 1984.
[5] D. Williams, Probability with Martingales, Cambridge University Press, 1991.
10