Chapter 5. Introduction to Probability Theory

Chapter 5. Introduction to Probability Theory
This chapter introduces some elementary probability theory which is the base of statistical
inference in the next chapter. The foundations of modern probability theory were laid by Andrey
Nikolaevich Kolmogorov (1903-1987) in 1933. Related materials are Chapter 1-5 of Casella and
Berger (2002) and Appendix B of Hansen (2016).
1
Foundations
The set of all possible outcomes of an experiment is called the sample space for the experiment.
Take the simple example of tossing a coin. There are two outcomes, heads and tails, so we can
write
= fH; T g. If two coins are tossed in sequence, we can write the four outcomes as
=
fHH; HT; T H; T T g.
An event A is any collection of possible outcomes of an experiment. An event is a subset of ,
including itself and the null set ;. Continuing the two coin example, one event is A = fHH; HT g,
the event that the …rst coin is heads. We say that A and B are disjoint or mutually exclusive
if A \ B = ;. For example, the sets fHH; HT g and fT Hg are disjoint. Furthermore, if the
sets A1 ; A2 ;
are pairwise disjoint and [1
, then the collection A1 ; A2 ;
is called a
i=1 Ai =
partition of .
A probability function P assigns probabilities (numbers between 0 and 1) to events A in .
This is straightforward when is countable; when is uncountable we must be somewhat more
careful. A collection of sets, F, is called a sigma algebra (or -…eld) if ; 2 F, A 2 F implies
Ac 2 F, and A1 ; A2 ;
2 F implies [1
i=1 Ai 2 F. We only de…ne probabilities for events contained
in F. A simple example of F is f;; g which is known as the trivial -algebra. For any sample
space , let F be the smallest sigma algebra which contains all of the open sets in . When is
countable, F is simply the collection of all subsets of F, including ; and . When is the real line,
then F is the collection of all open and closed intervals. We call such an F the Borel -algebra
associated with , and a set A 2 F a Borel (measurable) set.
We now can give the axiomatic de…nition of probability. Given and F, a probability function
P satis…es P ( ) = 1, P (A)
0 for all A 2 F, and if A1 ; A2 ;
2 F are pairwise disjoint, then
Pn
1
P ([i=1 Ai ) = i=1 P (Ai ).
Some important properties of the probability function include the following
1. P (;) = 0
Email: [email protected]
1
2. P (A)
1
3. P (Ac ) = 1
P (A)
4. P (B \ Ac ) = P (B)
P (A \ B)
5. P (A [ B) = P (A) + P (B)
6. If A
B then P (A)
P (A \ B)
P (B)
7. Bonferroni’s Inequality: P (A \ B)
8. Boole’s Inequality: P (A [ B)
P (A) + P (B)
1
P (A) + P (B)
For some elementary probability models, it is useful to have simple rules to count the number
of objects in a set. These counting rules are facilitated by using the binomial coe¢ cients which are
de…ned for nonnegative integers n and r, n r, as
n
r
!
=
n!
:
r!(n r)!
When counting the number of objects in a set, there are two important distinctions. Counting
may be with replacement or without replacement. Counting may be ordered or unordered.
For example, consider a lottery where you pick six numbers from the set 1; 2; :::; 49. This selection is
without replacement if you are not allowed to select the same number twice, and is with replacement
if this is allowed. Counting is ordered or not depending on whether the sequential order of the
numbers is relevant to winning the lottery. Depending on these two distinctions, we have four
expressions for the number of objects (possible arrangements) of size r from n objects.
Without Replacement With Replacement
n!
Ordered
nr
(n r)!
!
!
n
n+r 1
Unordered
r
r
In the lottery example,!if counting is unordered and without replacement, the number of potential
49
combinations is
= 13; 983; 816.
6
If P (B) > 0 the conditional probability of the event A given the event B is
P (AjB) =
P (A \ B)
:
P (B)
For any B, the conditional probability function is a valid probability function where
replaced by B. Rearranging the de…nition, we can write
P (A \ B) = P (AjB)P (B);
2
has been
which is often quite useful. We can say that the occurrence of B has no information about the
likelihood of event A when P (AjB) = P (A), in which case we …nd
P (A \ B) = P (A)P (B):
(1)
We say that the events A and B are statistically independent when (1) holds. Furthermore,
we say that the collection of events A1 ; :::; Ak are mutually independent when for any subset
fAi : i 2 Ig ;
!
Y
\
P
P (Ai ):
Ai =
i2I
i2I
Theorem 1 (Bayes’Rule or Theorem) For any set B and any partition A1 ; A2 ;
ple space, then for each i = 1; 2;
of the sam-
P (BjAi )P (Ai )
:
P (Ai jB) = P1
j=1 P (BjAj )P (Aj )
Example 1 (Monty Hall Problem) The Monty Hall problem is a brain teaser, …rst appearing
on the American television game show Let’s Make a Deal and named after its original host, Monty
Hall. Suppose you’re given the choice of three doors: Behind one door is a car; behind the others,
goats. You pick a door, say No. 1, and the host, who knows what’s behind the doors, opens another
door, say No. 3, which has a goat. He then says to you, "Do you want to pick door No. 2?" Is it
to your advantage to switch your choice?
Solution: Consider the events C1, C2 and C3 indicating the car is behind respectively door 1 2
or 3. All these 3 events have probability 1=3. The player initially choosing door 1 is described by
the event X1. As the …rst choice of the player is independent of the position of the car, also the
conditional probabilities are P (CijX1) = 1=3. The host opening door 3 is described by H3. For
this event it holds:
P (H3jC1; X1) = 1=2, P (H3jC2; X1) = 1 and P (H3jC3; X1) = 0:
Then, if the player initially selects door 1, and the host opens door 3, the conditional probability
of winning by switching is
P (C2jH3; X1) =
=
=
=
P (H3jC2; X1) P (C2 \ X1)
P (H3 \ X1)
P (H3jC2; X1) P (C2 \ X1)
P3
i=1 P (H3jCi; X1) P (Ci \ X1)
P (H3jC2; X1)
P3
i=1 P (H3jCi; X1)
1
2
= :
1=2 + 1 + 0
3
3
Figure 1: Tree Showing the Probability of Every Possible Outcome If the Player Initially Picks
Door 1
Figure 1 shows the probability of every possible outcome if the player initially picks Door 1.
2
Random Variables
A random variable (r.v.) X is a measurable function from a sample space into the real line,
i.e., for any Borel set A in R, the inverse image of A is in F.1 This induces a new sample space - the
real line - and a new probability function on the real line. Typically, we denote r.v.’s by uppercase
letters such as X, and use lowercase letters such as x for potential values and realized values. For
a r.v. X we de…ne its cumulative distribution function (cdf) as
F (x) = P (X
x):
Sometimes we write this as FX (x) to denote that it is the cdf of X. A function F (x) is a cdf i¤ the
following three properties hold:
1.
lim F (x) = 0 and lim F (x) = 1
x! 1
x!1
2. F (x) is nondecreasing in x
3. F (x) is right-continuous
1
Measurability of X guarantees that for any Borel set A, P (X 2 A) P (! 2 jX (!) 2 A) is well de…ned. Since
the Borel -algebra can be generated by f( 1; x]jx 2 Rg, we need only de…ne P (X x) for x 2 R to calculate all
other probabilities.
4
The support of a r.v. X is the closure of the set of possible values that X can take, denoted as
supp(X). We say that the r.v. X is discrete if F (x) is a step function. In this case, the support
of X consists of a countable set of real numbers x1 ;
; xJ , where J can be 1. The probability
function for X takes the form of the probability mass function (pmf)
P (X = xj ) = pj , j = 1;
; J;
(2)
P
where 0
pj
1 and Jj=1 pj = 1. A famous discrete r.v. is the Bernoulli r.v., where J = 2,
x1 = 0, x2 = 1, p2 = p and p1 = 1 p. We say that the r.v. X is continuous if F (x) is continuous
in x. In this case P (X = x) = 0 for all x 2 R so the representation (2) is unavailable. Instead, we
represent the relative probabilities by the probability density function (pdf)
f (x) =
so that
F (x) =
Z
d
F (x)
dx
x
f (u)du
1
and
P (a
X
b) =
Z
b
f (u)du:
a
These expressions make sense only if F (x) is di¤erentiable. While there are examples of continuous
r.v.’s which do not possess a pdf, these cases are unusual and are typically ignored. A function
R1
f (x) is a pdf i¤ f (x) 0 for all x 2 R and 1 f (x)dx = 1. The support of a continuous r.v. is
fx 2 Rjf (x) > 0g. One famous continuous r.v. is the normal r.v. which will be discussed in Section
7. Another famous continuous r.v. is the uniform r.v. whose pdf is f (x) = 1 (0 x 1), i.e., X
can occur only on [0; 1] and occurs uniformly. We denote X
U [0; 1]. A generalization is the
uniform r.v. on [a; b], a < b, whose pdf is f (x) = 1 (a x b) =(b a). We denote X U [a; b].
3
Expectation
For any measurable real function g, we de…ne the mean or expectation E [g(X)] as follows. If X
is discrete,
J
X
E [g(X)] =
g(xj )pj ;
j=1
and if X is continuous
E [g(X)] =
Z
1
g(x)f (x)dx:
1
The latter is well de…ned and …nite if
Z
1
1
jg(x)j f (x)dx < 1:
5
(3)
If (3) does not hold, evaluate
I1 =
Z
g(x)f (x)dx;
g(x)>0
I2 =
Z
g(x)f (x)dx:
g(x)<0
If I1 = 1 and I2 < 1 then we de…ne E [g(X)] = 1. If I1 < 1 and I2 = 1 then we de…ne
E [g(X)] = 1. If both I1 = 1 and I2 = 1 then E [g(X)] is unde…ned.
To cover both the discrete and continuous case (and even more general cases), we often write
R
E [g(X)] as a Riemann-Stieltjes integral g(x)dF (x) (why intuitively correct? see Chapter 1).
Since E [a + bX] = a + b E [X], we say that expectation is a linear operator.
For m > 0, we de…ne the m0 th moment of X as E [X m ], the m0 th central moment as
E [(X E[X])m ], the m0 th absolute moment of X as E [jXjm ], and the m0 th absolute central
moment as E [jX E[X]jm ].
h
i
2.
Two special moments are the mean = E[X] and variance 2 = E (X
)2 = E X 2
p
2 the standard deviation of X. We can also write 2 = V ar(X). For example,
We call =
this allows the convenient expression V ar(a + bX) = b2 V ar(X).
The moment generating function (MGF) of X is
M ( ) = E [exp ( X)] :
The MGF does not necessarily exist. However, when it does and E [X m ] < 1 then
E [X m ] =
dm
M( )
d m
;
=0
which is why it is called the moment generating function.
More generally, the characteristic function (CF) of X is
C( ) = E [exp (i X)] ;
where i =
p
1 is the imaginary unit. The CF always exists, and when E [jXjm ] < 1,
E [X m ] =
1 dm
C( )
im d m
which is called Moment Theorem.
The Lp norm, p 1, of the r.v. X is
kXkp = (E [jXjp ])1=p :
6
;
=0
4
Multivariate Random Variables
A pair of bivariate r.v.’s (X; Y ) is a function from the sample space into R2 . The joint cdf of (X; Y )
is
F (x; y) = P (X x; Y
y) :
If F is continuous, the joint probability density function is
@2
F (x; y):
@x@y
f (x; y) =
R2 ,
For a Borel measurable set A
P ((X; Y ) 2 A) =
Z Z
f (x; y)dxdy:
A
For any measurable function g(x; y),
E [g(X; Y )] =
Z
1
1
Z
1
g(x; y)f (x; y)dxdy:
1
The marginal distribution of X is
FX (x) = P (X
x) = lim F (x; y) =
y!1
Z
x
1
Z
1
f (x; y)dydx;
1
so the marginal density of X is
fX (x) =
d
FX (x) =
dx
Z
1
f (x; y)dy:
1
Similarly, the marginal density of Y is
fY (y) =
Z
1
f (x; y)dx:
1
The r.v.’s X and Y are de…ned to be independent if f (x; y) = fX (x)fY (y). Furthermore,
X and Y are independent i¤ there exist functions g(x) and h(y) such that f (x; y) = g(x)h(y).
(Exercise)
If X and Y are independent, then
E [g(X)h(Y )] =
=
=
Z
Z
Z
1
Z
1
g(x)h(y)f (x; y)dxdy
1
1
1 Z 1
1
1
1
g(x)h(y)fX (x)fY (y)dxdy
Z 1
g(x)fX (x)dx
h(y)fY (y)dy
1
= E [g(X)] E [h(Y )] ;
7
1
(4)
if the expectations exist. For example, if X and Y are independent then
E [XY ] = E [X] E [Y ] :
Another implication of (4) is that if X and Y are independent and Z = X + Y , then
MZ ( ) = E [exp ( (X + Y ))]
= E [exp ( X) exp ( Y )]
= E [exp ( X)] E [exp ( Y )]
= MX ( )MY ( ):
The covariance between X and Y is
Cov(X; Y ) =
XY
= E [(X
E [X]) (Y
E [Y ])] = E [XY ]
E [X] E [Y ] :
The correlation between X and Y is
Corr(X; Y ) =
XY
=
XY
:
X Y
The Cauchy-Schwarz Inequality implies that
j
XY j
1:
The correlation is a measure of linear dependence, free of units of measurement.
If X and Y are independent, then XY = 0 and XY = 0. The reverse, however, is not true.
For example, if E [X] = 0 and E X 3 = 0, then Cov(X; X 2 ) = 0.
A useful fact is that
V ar(X + Y ) = V ar(X) + V ar(Y ) + 2Cov(X; Y ):
An implication is that if X and Y are independent, then
V ar(X + Y ) = V ar(X) + V ar(Y );
the variance of the sum is the sum of the variances.
A k 1 random vector X = (X1 ;
; Xk )0 is a function from to Rk . Let x = (x1 ;
denote a vector in Rk . The vector X has the distribution and density functions
F (x) = P (X
x) ;
k
@
f (x) =
F (x):
@x1
@xk
8
; xk )0
For a measurable function g : Rk ! Rs , we de…ne the expectation
Z
g(x)f (x)dx;
E [g(X)] =
Rk
where the symbol dx denotes dx1
dxk . In particular, we have the k
1 multivariate mean
= E [X]
and k
k covariance matrix
= E (X
)0 = E XX 0
) (X
If the elements of X are mutually independent, then
V ar
k
X
Xi
i=1
5
!
k
X
=
0
:
is a diagonal matrix and
V ar (Xi ) :
i=1
Conditional Distributions and Expectation
The conditional density of Y given X = x is de…ned as
fY jX (yjx) =
f (x; y)
fX (x)
if fX (x) > 0. One way to derive this expression from the de…nition of conditional probability is
fY jX (yjx) =
=
=
=
=
@
lim P (Y
yjx X x + ")
@y "!0
@
P (fY
yg \ fx X x + "g)
lim
@y "!0
P (x X x + ")
@
F (x + "; y) F (x; y)
lim
@y "!0 FX (x + ") FX (x)
@
F (x + "; y)
@
lim @x
@y "!0 fX (x + ")
@
@y@x F
(x; y)
fX (x)
=
f (x; y)
:
fX (x)
The conditional mean or conditional expectation is the function
m(x) = E [Y jX = x] =
Z
1
1
yfY jX (yjx)dy:
The conditional mean m(x) is a function, meaning that when X equals x, then the expected value
9
of Y is m(x). Similarly, we de…ne the conditional variance of Y given X = x as
2
(x) = V ar (Y jX = x)
i
h
= E (Y m(x))2 X = x
= E Y2 X =x
m(x)2 :
Evaluated at x = X, the conditional mean m(X) and conditional variance 2 (X) are r.v.’s, functions of X. We write this as E [Y jX = x] = m(X) and V ar (Y jX) = 2 (X). For example, if
E [Y jX = x] = + 0 x, then E [Y jX] = + 0 X, a transformation of X.
(
8xy; if 0 < x < y < 1;
Example 2 Suppose fXY (x; y) =
Are X and Y independent? What
0;
otherwise.
are the marginal densities of X and Y ? What is the conditional density of Y given X? What are
E [Y ] and V ar (Y )? What are E [Y jX = x] and V ar (Y jX = x)?
Solution: It is tempting to claim that X and Y are independent since it seems that f (x; y) =
g(x)h(y) with g(x) = 8x and h(y) = y. However, such an argument neglects the support of (X; Y )
which is shown in the left panel of Figure 2. Note that the support of (X; Y ), supp(X; Y ) =
f(x; y) j0 x y 1g, where the boundary is included.
The marginal density of X is
fX (x) =
Z
1
8xydy = 4x(1
x2 ); 0 < x < 1
x
and the marginal density of Y is
fY (y) =
Z
y
8xydx = 4y 3 ; 0 < y < 1;
0
so
E [Y ] =
Z
0
1
4
4y ydy = ; E Y 2 =
5
3
Z
0
1
2
4y 3 y 2 dy = ;
3
and
V ar (Y ) = E Y 2
E [Y ]2 =
2
3
4
5
2
=
2
:
75
Obviously,
fXY (x; y) = 8xy 1 (0 < x < y < 1)
6= 4x(1
x2 ) 1 (0 < x < 1) 4y 3 1 (0 < y < 1)
= fX (x)fY (y);
so X and Y are not independent.
10
The conditional density of Y given X = x is
f (x; y)
fY jX (yjx) =
=
fX (x)
(
8xy
4x(1 x2 )
=
2y
;
1 x2
0;
if 0 < x < y < 1;
otherwise.
So when 0 < x < 1,
E Y 2 jX = x
Z
1
2 1 + x + x2
2y
dy
=
;
1 x2
3 1+x
x
Z 1
2y
1 1 + x + x2 + x3
y2
=
dy
=
;
1 x2
2
1+x
x
E [Y jX = x] =
y
and
V ar (Y jX = x) = E Y 2 jX = x
=
=
E [Y jX = x]2
1 1 + x + x2 + x3
2
1+x
(1
2 1 + x + x2
3 1+x
2
x)2 (1 + 4x + x2 )
:
18(1 + x)2
The right panel of Figure 2 shows E [Y jX = x] and V ar (Y jX = x) as a function of x. Obviously,
E [Y jX = x] is increasing with E [Y jX = 0] = 2=3 and E [Y jX = 1] = 1, and V ar (Y jX = x)
is decreasing with V ar (Y jX = 0) = 1=18 and V ar (Y jX = 1) = 0. Since E [Y jX = x] and
V ar (Y jX = x) are not constant, X and Y are not independent (why?). Another observation
is that
Z 1
2 1 + x + x2
4
=
4x(1 x2 )dx
E [Y ] =
5
3
1
+
x
0
Z
=
E [Y jX = x] fX (x)dx:
This is an example of the famous law of iterated expectations (LIE) as explained below.
Theorem 2 (Simple Law of Iterated Expectations) E [E [Y jX]] = E [Y ] :
Proof.
E [E [Y jX]] = E [m(X)]
Z 1
=
m (x) fX (x)dx
1
Z 1Z 1
=
yfY jX (yjx)fX (x)dydx
1
1
Z 1
Z 1
=
yf (y; x)dydx
1
= E [Y ] :
1
11
Figure 2: supp(X; Y ), E [Y jX = x] and V ar (Y jX = x)
Corollary 1 (Law of Iterated Expectations) E [E [Y jX; Z] jX] = E [Y jX] :
Theorem 3 (Conditioning Theorem) For any function g(x),
E [g(X)Y jX] = g(X)E [Y jX] :
Proof. Let
h(x) = E [g(X)Y jX = x]
Z 1
=
g(x)yfY jX (yjx)dy
1
Z 1
yfY jX (yjx)dy
= g(x)
1
= g(x)m(x);
where m(x) = E [Y jX = x]. Thus h(X) = g(X)m(X), which is the same as E [g(X)Y jX] =
g(X)E [Y jX].
6
Transformations
Suppose that X 2 Rk with continuous distribution function FX (x) and density fX (x). Let Y =
g(X) where g(x) : Rk ! Rk is one-to-one, di¤erentiable, and invertible. Let h(y) denote the
inverse of g(x). Recall that the Jacobian of h is de…ned as
J(y) = det (Dh(y)) ;
12
where for a matrix A, det(A) is the determinant of A.
Consider the univariate case k = 1. If g(x) is an increasing function, then g(X)
so the distribution function of Y is
FY (y) = P (g(X)
y) = P (X
y i¤ X
h(y),
h(y)) = FX (h(y)):
Taking the derivative, the density of Y is
fY (y) =
d
d
FY (y) = fX (h(y)) h(y):
dy
dy
If g(x) is a decreasing function, then g(X)
FY (y) = P (g(X)
y i¤ X
y) = P (X
h(y), so
h(y)) = 1
FX (h(y));
and the density of Y is
fY (y) =
d
FY (y) =
dy
fX (h(y))
d
h(y):
dy
We can write these two cases jointly as
fY (y) = fX (h(y)) jJ(y)j :
(5)
This is known as the change-of-variables formula. This same formula (5) holds for k > 1, but
its justi…cation requires deeper results from analysis.
Example 3 Suppose X U [0; 1] and Y = log(X). Here, g(x) = log(x) and h(y) = exp( y)
so the Jacobian is J(y) = exp( y). As the range of X is [0; 1], that for Y is [0; 1]. Since
fX (x) = 1 for 0 x 1, (5) shows that fY (y) = exp( y), 0 y 1, an exponential density.
7
Normal and Related Distributions
The standard normal density is
1
(x) = p exp
2
x2
2
; 1 < x < 1:
It is conventional to write X
N (0; 1), and to denote the standard normal density function by
(x) and its distribution function by (x). The latter has no closed-form solution. The normal
density has all moments …nite. Since it is symmetric about zero all odd moments are zero. By
iterated integration by parts, we can also show that E X 2 = 1 and E X 4 = 3. In fact, for any
positive integer m, E X 2m = (2m 1)!! = (2m 1)(2m 3) 1. Thus E X 4 = 3, E X 6 = 15,
E X 8 = 105, and E X 10 = 945.
If Z is standard normal and X = + Z, then using the change-of-variables formula, X has
13
density
1
f (x) = p
2
)2
(x
exp
2
2
!
; 1 < x < 1:
which is the univariate normal density. The mean and variance of the distribution are
2 , and it is conventional to write X
N ( ; 2 ).
For x 2 Rk , the multivariate normal density is
f (x) = p
1
1=2
2 det ( )
(x
exp
1 (x
)
)
2
and
; x 2 Rk :
The mean and covariance matrix of the distribution are and . When X has the multivariate
normal density, we call X follows the multivariate normal distribution and write X N ( ; ).
0
The MGF and CF of the multivariate normal are exp 0 + 0
=2 and exp i 0
=2 ,
respectively.
Ifn X o2 Rk is multivariate normal and the elements of X are mutually uncorrelated, then
=
2
diag j is a diagonal matrix. In this case the density function can be written as
f (x) =
=
p
1
2
k
Y
j=1
p
exp
1
k
1
2
exp
2
2
1) = 1
(x1
2
2
k) = k
(xk
2
xj
j
2
j
2
j
!
2
!
;
which is the product of marginal univariate normal densities. This shows that if X is multivariate
normal with uncorrelated elements, then they are mutually independent.
Theorem 4 If X
N ( ; ) is a k-dimensional multivariate normal vector, and Y = a + BX,
where B is an m k matrix with full row rank, then Y
N (a + BX; B B0 ) is a m-dimensional
multivariate normal vector.
Theorem 5 Let X
written as 2r .
Theorem 6 If Z
N (0; Ir ). Then Q = X 0 X is distributed chi-square with r degrees of freedom,
N (0; A) with A > 0, q
Theorem 7 Let Z
N (0; 1) and Q
student’s t with r degrees of freedom.
Remark 1 T1
2
r
q, then Z 0 A
1Z
p
be independent. Then Tr = Z= Q=r is distributed as
N (0; 1). E[Tr ] = 0 for r > 1 and V ar (Tr ) = r=(r
2 and Q
2 be independent. Then F
Theorem 8 Let P
q;r =
q
r
F distribution with q and r degrees of freedom.
Remark 2 qFq;1
2.
q
2.
q
14
2) for r > 2.
P=q
Q=r
is distributed as Fisher’s
0.4
0.35
0.3
Density
0.25
0.2
0.15
0.1
0.05
0
-5
-4
-3
-2
-1
0
1
2
3
4
5
Figure 3: Density of the t Distribution for r = 1; 2; 5; 1
3.5
3
Density
2.5
2
1.5
1
0.5
0
0
0.5
1
1.5
2
2.5
3
Figure 4: Density of the F Distribution for (q; r) = (1; 1) ; (2; 1) ; (5; 2) ; (10; 1) ; (100; 100)
15

Download Report

Chapter 5. Introduction to Probability Theory

Paperzz.com

Your Paperzz