Probability 2 - Notes 10 Some Useful Inequalities. Lemma. If X is a

Probability 2 - Notes 10
Some Useful Inequalities.
Lemma. If X is a random variable and g(x) ≥ 0 for all x in the support of fX , then P(g(X) ≥
1) ≤ E[g(X)].
Proof. (continuous case)
Z
P(g(X) ≥ 1) =
Z
x:g(x)≥1
fX (x)dx ≤
x:g(x)≥1
g(x) fX (x)dx ≤
Z ∞
−∞
g(x) fX (x)dx = E[g(X)]
Corollaries
1. Markov’s Inequality. For any h > 0, P(|X| ≥ h) ≤
values then for any h > 0 P(X ≥ h) ≤ E[X]
h .
Proof. Take g(X) =
lemma.
|X|
h
E[|X|]
h .
When X only takes non-negative
in the lemma. If X only takes non-negative values take g(X) =
X
h
in the
2. Chebyshev’s Inequality. If E[X] = µ and Var(X) = σ2 , which are finite, then for any h > 0
2
P(|X − µ| ≥ h) ≤ σh2 .
³
Proof. Take g(X) =
X−µ
h
´2
in the lemma.
Note: Chebyshev’s inequality can be used to derive the weak law of large numbers. This is
specified in the theorem below.
Theorem. Let X1 , X2 , ... be a sequence of i.i.d. random variables each with finite mean µ and
finite variance σ2 . Then for any ε > 0 and δ > 0 there exists an N such that P(|X n − µ| ≥ ε) ≤ δ
for all n ≥ N, where X n = 1n ∑nj=1 X j .
2
Proof. Note that E[X n ] = µ and Var(X n ) = σn . Consider any ε > 0 and δ > 0. Apply Chebyσ2
σ2
shev’s inequality to X n and let h = ε. Then P(|X n − µ| ≥ ε) ≤ nε
2 ≤ δ provided n ≥ ε2 δ . Therefore we need only choose N =
σ2
ε2 δ
to obtain the result.
Note. Observe that limn→∞ P(|X n − µ| ≥ ε) = 0 for any ε > 0. We say that X n converges in
probability to µ as n tends to infinity.
Some examples using the inequalities.
1. From Markov’s inequality with h = Nµ, if X is a non-negative random variable, P(X > Nµ) ≤
µ
1
Nµ = N for any N > 0.
2. If σ2 = 0 then from Chebyshev’s inequality for any h > 0, P(|X − µ| < h) = 1 − P(|X − µ| ≥
2
h) ≥ 1 − σh2 = 1. Hence P(X = µ) = limh↓0 P(|X − µ| < h) = 1. So variance zero implies the
random variable takes a single value with probability 1.
3. When σ2 > 0 Chebyshev’s inequality gives a lower bound on the probability that X lies
within k standard deviations from the mean. Take h = kσ. Then
P(|X − µ| < kσ) = 1 − P(|X − µ| ≥ kσ) ≥ 1 −
σ2
1
=
1
−
(kσ)2
k2
4. When σ = 1, how large a sample is needed if we want to be at least 95% certain that the
sample mean lies within 0.5 of the true mean? We use Chebyshev’s inequality for X n with
h = 0.5. Then
P(|X n − µ| < 0.5) = 1 − (|X n − µ| ≥ 0.5) ≥ 1 −
provided n ≥
4
0.05
σ2
4
= 1 − ≥ 0.95
2
n(0.5)
n
= 80. So we need a minimum sample size of 80.
The Central Limit Theorem.
Let X1 , X2 , ... be a sequence of i.i.d. random variables each with finite mean µ and finite variance
σ2 and let X n be the sample mean based on X1 , ..., Xn . Then we can find an approximation for
P(X n ≤ A) when n is large by writing the ³event for
Xn in´ terms of the standardized variable
√
√
n(A−µ)
Zn = n(Xn − µ)/σ (i.e. P(X n ≤ A) = P Zn ≤
) and proving that limn→∞ P(Zn ≤
σ
R
2
z √1 −x /2
z) = Φ(z) = −∞
e
dx which is the c.d.f. of N(0, 1). The proof of this result uses the
2π
m.g.f. and the following lemma.
Lemma. Let Z1 , Z2 , ... be a sequence of random variables. If limn→∞ MZn (t) = M(t), which
is the m.g.f. of a distribution with c.d.f. F, then limn→∞ FZn (z) = F(z) at all points z for which
F(z) is continuous.
Theorem (The Central Limit Theorem). Let X1 , X2 , ... be a sequence of i.i.d. random variables
each with m.g.f. which exists for entries in an open region about zero so is
√differentiable with
finite mean, denoted by µ, and finite variance, denoted by σ2 . Let Zn = n(X n − µ)/σ, then
limn→∞ P(Zn ≤ z) = Φ(z).
Proof. Let U j = (X j − µ)/σ and let MU (t) be the common m.g.f. Then MU (t) = e−µt/σ MX (t/σ)
exists in an open interval about t = 0, M(0) = 1, M 0 (0) = E[U] = 0 and M 00 (0) = E[U 2 ] =
Var(U) = 1. So U1 ,U2 , ... are i.i.d. with mean zero and variance one. Now
h n
n
√ i
√
¡
√ ¢n
t ∑ j=1 U j / n
MZn (t) = E e
= ∏ E[etU j / n ] = MU (t/ n)
j=1
√
√
Taking logs to base e gives ln(MZn (t)) = n(ln(MU (t/ n))). Now let x = 1/ n and use
L’Hopital’s rule. Then
√
ln(MU (xt))
lim n ln(MU (t/ n)) = lim
n→∞
x2
x↓0
0
= lim
x↓0
00
0
tMU (xt)/MU (xt)
t 2 (MU (xt)MU (xt) − (MU (xt))2 )/(MU (xt))2
= lim
2x
2
x↓0
00
0
t 2 (MU (0)MU (0) − (MU (0))2 ) t 2
=
=
2(MU (0))2
2
2
Hence limt→∞ ln(MZn (t)) = t 2 /2 and so limt→∞ MZn (t) = et /2 . Since this is the m.g.f. of an
Rz
2
√1 e−x /2 dx
N(0, 1) distribution, using the lemma proves that limn→∞ P(Zn ≤ z) = Φ(z) = −∞
2π
The bivariate and multivariate normal distribution.
An indirect method was used on problem sheet 9 to get you to derive standard results for a
bivariate normal distribution. The results are summarised below. The results may be proved
directly, however it is messy unless you use matrix and vector notation. Once you do this
results can just as easily be obtained for the multivariate normal, so we may just as well derive
results immediately for the more general case.
Summary of results for the bivariate normal distribution.
1. If X1 and X2 have bivariate normal distribution then the joint p.d.f. is
fX1 ,X2 (x1 , x2 ) =
2πσ1 σ2
1
p
(1 − ρ2 )
e
−1
2(1−ρ2 )
µ³
x1 −µ1
σ1
´2
³
−2ρ
x1 −µ1
σ1
´³
x2 −µ2
σ2
´ ³
´2 ¶
x −µ
+ 2σ 2
2
for all x, y. The distribution has parameters µ1 , µ2 , σ21 , σ22 , ρ. The parameter ρ is restricted so that
−1 < ρ < 1.
2. The joint m.g.f. is
1
2 2
2 2
MX1 ,X2 (t1 ,t2 ) = e(µ1t1 +µ2t2 )+ 2 (σ1t1 +2ρσ1 σ2t1t2 +σ2t2 )
This can be used to identify the parameters and find the marginal distributions. MX1 (t1 ) =
1 2 2
MX1 ,X2 (t1 , 0) = eµ1t1 + 2 σ1t1 . Hence X1 ∼ N(µ1 , σ21 ). Similarly X2 ∼ N(µ2 , σ22 ). Differentiating
the joint m.g.f. in standard manner shows that ρ(X1 , X2 ) = ρ.
3. X1 and X2 are independent iff ρ = 0. This is easily seen from either the joint p.d.f. or the
joint m.g.f.
4. The conditional distribution of X2 |X1 = x1 is normal with mean linear in x1 and variance
which does not depend on x1 . A similar result holds for the conditional distribution of X1 |X2 =
x2 .
Using vector and matrix notation.
µ
X=
X1
X2
¶
µ
; x=
x1
x2
¶
µ
; t=
t1
t2
¶
µ
; m=
µ1
µ2
¶
µ
; V=
ρσ1 σ2
σ21
ρσ1 σ2 σ22
¶
Then m is the vector of means and V is the variance-covariance matrix. Note that |V| =
σ21 σ22 (1 − ρ2 ) and
V−1 =
Hence fX (x) =
1
(1 − ρ2 )
1
T −1
1
e− 2 (x−m) V (x−m)
(2π)2/2 |V|1/2
Ã
1
σ21
−ρ
σ1 σ2
−ρ
σ1 σ2
1
σ22
!
for all x. Also MX (t) = et
T m+ 1 tT Vt
2
.
The Multivariate Normal Distribution.
We again use matrix and vector notation, but now there are n random variables so that X, x, t
and m are now n-vectors with ith entries Xi , xi , ti and µi and V is the n × n matrix with iith entry
σ2i and i jth entry (for i 6= j) σi j . Note that V is symmetric so that VT = V.
The joint p.d.f. is fX (x) =
1
T −1
1
e− 2 (x−m) V (x−m)
(2π)n/2 |V|1/2
for all x. We say that X ∼ N(m, V).
We can find the joint m.g.f. quite easily.
Z
i
h n
T
MX (t) = E e∑ j=1 t j X j = E[et X ] =
∞
−∞
...
Z ∞
1
1
e− 2 ((x−m)
n/2
1/2
|V|
−∞ (2π)
T V−1 (x−m)−2tT x
) dx ...dx
n
1
We do the equivalent of completing the square, i.e. we write
(x − m)T V−1 (x − m) − 2tT x = (x − m − a)T V−1 (x − m − a) + b
for a suitable choice of the n-vector a of constants and a constant b. Then
MX (t) = e−b/2
Z ∞
−∞
...
Z ∞
1
1
T V−1 (x−m−a)
e− 2 (x−m−a)
n/2
1/2
|V|
−∞ (2π)
dx1 ...dxn = e−b/2 .
We just need to find a and b. Expanding we have
((x − m) − a)T V−1 ((x − m) − a) + b
= (x − m)T V−1 (x − m) − 2aT V−1 (x − m) + aT V−1 a + b
£
¤
= (x − m)T V−1 (x − m) − 2aT V−1 x + 2aT V−1 m + aT V−1 a + b
−1
T
T −1
T
This has
equal (x − m)T V
£ to
¤ (x − m) − 2t x for all £x. T HenceT we¤ need a V = t and
T
−1
T
−1
b = − 2a V m + a V a . Hence a = Vt and b = − 2t m + t Vt . Therefore
MX (t) = e−b/2 = et
T m+ 1 tT Vt
2
Results obtained using the m.g.f.
1. Any (non-empty) subset of multivariate normals is multivariate normal. Simply put t j = 0 for
2 2
all j for which X j is not in the subset. For example MX1 (t1 ) = MX1 ,...,Xn (t1 , 0, ..., 0) = et1 µ1 +t1 σ1 /2 .
Hence X1 ∼ N(µ1 , σ21 ). A similar result holds for Xi . This identifies the parameters µi and σ2i as
the mean and variance of Xi . Also
1
2 2
2 2
MX1 ,X2 (t1 ,t2 ) = MX1 ,...,Xn (t1 ,t2 , 0, ..., 0) = et1 µ1 +t2 µ2 + 2 (t1 σ1 +2σ12t1t2 +σ2t2 )
Hence X1 and X2 have bivariate normal distribution with σ12 = Cov(X1 , X2 ). A similar result
holds for the joint distribution of Xi and X j for i 6= j. This identifies V as the variance-covariance
matrix for X1 , ..., Xn .
2. X is a vector of independent random variables iff V is diagonal (i.e. all off-diagonal entries
are zero so that σi j = 0 for i 6= j).
Proof. From (1), if the X 0 s are independent then σi j = Cov(Xi , X j ) = 0 for all i 6= j, so that V is
diagonal.
If V is diagonal then tT Vt = ∑nj=1 σ2j t 2j and hence
T m+ 1 tT Vt
2
MX (t) = et
´
n
n ³
1 2 2
= ∏ eµ j t j + 2 σ j t j /2 = ∏ MX j (t j )
j=1
j=1
By the uniqueness of the joint m.g.f., X1 , ..., Xn are independent.
3. Linearly independent linear functions of multivariate normal random variables are multivariate normal random variables. If Y = AX + b, where A is an n × n non-singular matrix and b is
a (column) n-vector of constants, then Y ∼ N(Am + b, AVAT ).
Proof. Use the joint m.g.f.
TY
MY (t) = E[et
T
] = E[et
= et b e(A
T AX+b
T
] = et b E[e(A
T t)T m+ 1 (AT t)T V(AT t)
2
T t)T X
T
] = et b MX (AT t)
T (Am+b)+ 1 tT (AVAT )t
2
= et
This is just the m.g.f. for the multivariate normal distribution with vector of means Am + b
and variance-covariance matrix AVAT . Hence, from the uniqueness of the joint m.g.f, Y ∼
N(Am + b, AVAT ).
Note that from (2) a subset of the Y 0 s is multivariate normal.
NOTE. The results concerning the vector of means and variance-covariance matrix for linear
functions of random variables hold regardless of the joint distribution of X1 , ..., Xn .
We define the expectation of a vector of random variables X, E[X] to be the vector of the
expectations and the expectation of a matrix of random variables Y, E[Y], to be the matrix of
the expectations. Then the variance-covariance matrix of X is just E[(X − E[X])(X − E[X])T ].
The following results are easily obtained:
(i) Let A be an m × n matrix of constants, B be an m × k matrix of constants and Y be an n × k
matrix of random variables. Then E[AY + B] = AE[Y] + B.
Proof. The i jth entry of E[AY + B] is E[∑nr=1 AirYr j + Bi j ] = ∑nr=1 Air E[Yr j ] + Bi j , which is the
i jth entry of AE[Y] + B. The result is then immediate.
(ii) Let C be a k × m matrix of constants and Y be an n × k matrix of random variables. Then
E[YC] = E[Y]C.
Proof. Just transpose the equation. The result then follows from (i).
Hence if Z = AX + b, where A is an m × n matrix of constants, b is an m-vector of constants and
X is an n-vector of random variables with E[X] = m and variance-covariance matrix V, then
E[Z] = E[AX + b] = AE[X] + b = Am + b
Also the variance-covariance matrix for Y is just
E[(Y − E[Y])(Y − E[Y])T ] = E[A(X − m)(X − m)T AT ] = AE[(X − m)(X − m)T ]AT = AVAT
Example. Suppose that E[X1 ] = 1, E[X2 ] = 0, Var(X1 ) = 2, Var(X2 ) = 4 and Cov(X1 , X2 ) = 1.
Let Y1 = X1 + X2 and Y2 = X1 + aX2 . Find the means, variances and covariance and hence find
a so that Y1 and Y2 are uncorrelated.
Writing in vector and matrix notation we have E[Y] = Am and the variance-covariance matrix
for Y is just AVAT where
µ
m=
1
0
¶
µ
V=
2 1
1 4
¶
µ
A=
1 1
1 a
Therefore
µ
Am =
1 1
1 a
¶µ
1
0
¶
µ
=
1
1
¶
¶
µ
T
AVA =
1 1
1 a
¶µ
2 1
1 4
¶µ
1 1
1 a
¶
µ
=
8
3 + 5a
3 + 5a 2 + 2a + 4a2
¶
Hence Y1 and Y2 have means 1 and 1, variances 8 and 2 + 2a + 4a2 and covariance 3 + 5a. They
are therefore uncorrelated if 3 + 5a = 0, i.e. if a = − 53 .