Probability Theory

Chapter 1
Probability Theory
1.1
Set Theory
One of the main objectives of a statistician is to draw conclusions
about a population of objects by conducting an experiment. The
first step in this endeavor is to identify the possible outcomes or, in
statistical terminology, the sample space.
Definition 1.1.1 The set, S, of all possible outcomes of a particular experiment is called the sample space for the experiment.
If the experiment consists of tossing a coin, the sample space contains two outcomes, heads and tails; thus, S = {H, T }.
Consider an experiment where the observation is reaction time to a
certain stimulus. Here, the sample space would consist of all positive
numbers, that is, S = (0, ∞).
1
2
CHAPTER 1. PROBABILITY THEORY
The sample space can be classified into two type: countable and
uncountable. If the elements of a sample space can be put into 1–1 correspondence with a subset of integers, the sample space is countable.
Otherwise, it is uncountable.
Definition 1.1.2 An event is any collection of possible outcomes
of an experiment, that is, any subset of S (including S itself ).
Let A be an event, a subset of S. We say the event A occurs if the
outcome of the experiment is in the set A.
1.1. SET THEORY
3
We first define two relationships of sets, which allow us to order
and equate sets:
A⊂B⇔x∈A⇒x∈B
(containment)
A = B ⇔ A ⊂ B and B ⊂ A.
(equality)
Given any two events (or sets) A and B, we have the following
elementary set operations:
Union: The union of A and B, written A∪B, is the set of elements
that belong to either A or B or both:
A ∪ B = {x : x ∈ A or x ∈ B}.
Intersection: The intersection of A and B, written A ∩ B, is the
set of elements that belong to both A and B:
A ∩ B = {x : x ∈ A and x ∈ B}.
Complementation: The complement of A, written Ac, is the
set of all elements that are not in A:
Ac = {x : x ∈
/ A}.
4
CHAPTER 1. PROBABILITY THEORY
Example 1.1.1 (Event operations) Consider the experiment of
selecting a card at random from a standard deck and noting its
suit: clubs (C), diamond (D), hearts (H), or spades (S). The
sample space is
S = {C, D, H, S},
and some possible events are
A = {C, D},
and
B = {D, H, S}.
From these events we can form
A ∪ B = {C, D, H, S},
A ∩ B = {D},
and
Ac = {H, S}.
Furthermore, notice that A ∪ B = S and (A ∪ B)c = ∅, where ∅
denotes the empty set (the set consisting of no elements).
1.1. SET THEORY
5
Theorem 1.1.1 For any three events, A, B, and C, defined on
a sample space S,
a. Commutativity. A ∪ B = B ∪ A, A ∩ B = B ∩ A.
b. Associativity. A ∪ (B ∪ C) = (A ∪ B) ∪ C, A ∩ (B ∩ C) =
(A ∩ B) ∩ C.
c. Distributive Laws. A ∩ (B ∪ C) = (A ∩ B) ∪ (A ∩ C), A ∪ (B ∩
C) = (A ∪ B) ∩ (A ∪ C).
d. DeMorgan’s Laws. (A ∪ B)c = Ac ∩ B c, (A ∩ B)c = Ac ∪ B c.
The operations of union and interaction can be extended to infinite
collections of sets as well. If A1, A2, A3, . . . is a collection of sets, all
defined on a sample space S, then
∪∞
i=1 Ai = {x ∈ S : x ∈ Ai for some i}.
∩∞
i=1 Ai = {x ∈ S : x ∈ Ai for all i}.
For example, let S = (0, 1] and define Ai = [(1/i), 1]. Then
∞
∪∞
i=1 Ai = ∪i=1 [(1/i), 1] = (0, 1]
∞
∩∞
i=1 Ai = ∩i=1 [(1/i), 1] = {1}.
6
CHAPTER 1. PROBABILITY THEORY
It is also possible to define unions and intersections over uncountable collections of sets. If Γ is an index set (a set of elements to be
used as indices), then
∪a∈ΓAa = {x ∈ S : x ∈ Aa for some a},
∩a∈ΓAa = {x ∈ S : x ∈ Aa for all a}.
Definition 1.1.3 Two events A and B are disjoint (or mutually
exclusive) if A ∩ B = ∅. The events A1, A2, . . . are pairwise disjoint (or mutually exclusive) if Ai ∩ Aj = ∅ for all i 6= j.
Definition 1.1.4 If A1, A2, . . . are pairwise disjoint and ∪∞
i=1 Ai =
S, then the collection A1, A2, . . . forms a partition of S.
1.2. BASICS OF PROBABILITY THEORY
1.2
7
Basics of Probability Theory
When an experiment is performed, the realization of the experiment
is an outcome in the sample space. If the experiment is performed
a number of times, different outcomes may occur each time or some
outcomes may repeat. This “frequency of occurrence” of an outcome
can be thought of as a probability. More probable outcomes occur
more frequently. If the outcomes of an experiment can be described
probabilistically, we are on our way to analyzing the experiment statistically.
1.2.1
Axiomatic Foundations
Definition 1.2.1 A collection of subsets of S is called a sigma
algebra (or Borel field), denoted by B, if it satisfied the following
three properties:
a. ∅ ∈ B (the empty set is an element of B).
b. If A ∈ B, then Ac ∈ B (B is closed under complementation).
c. If A1, A2, . . . ∈ B, then ∪∞
i=1 Ai ∈ B (B is closed under countable unions).
8
CHAPTER 1. PROBABILITY THEORY
Example 1.2.1 (Sigma algebra-I) If S is finite or countable, then
these technicalities really do not arise, we define for a given sample space S,
B = {all subsets of S, including S itself}.
If S has n elements, there are 2n sets in B. For example, if
S = {1, 2, 3}, then B is the following collection of 23 = 8 sets:
{1}, {1, 2}, {1, 2, 3}, {2}, {1, 3}, ∅, {3}, {2, 3}.
Example 1.2.2 (Sigma algebra-II) Let S = (−∞, ∞), the real
line. Then B is chosen to contain all sets of the form
[a, b],
(a, b],
(a, b),
[a, b)
for all real numbers a and b. Also, from the properties of B,
it follows that B contains all sets that can be formed by taking
(possibly countably infinite) unions and interactions of sets of the
above varieties.
1.2. BASICS OF PROBABILITY THEORY
9
Definition 1.2.2 Given a sample space S and an associated sigma
algebra B, a probability function is a function P with domain B
that satisfies
1. P (A) ≥ 0 for all A ∈ B.
2. P (S) = 1.
3. If A1, A2, . . . ∈ B are pairwise disjoint, then P (∪∞
i=1 Ai ) =
P∞
i=1 P (Ai ).
The three properties given in the above definition are usually referred to as the Axioms of Probability or the Kolmogorov Axioms.
Any function P that satisfies the Axioms of Probability is called a
probability function.
The following gives a common method of defining a legitimate probability function.
Theorem 1.2.1 Let S = {s1, . . . , sn} be a finite set. Let B be
any sigma algebra of subsets of S. Let p1, . . . , pn be nonnegative
numbers that sum to 1. For any A ∈ B, define P (A) by
P (A) =
X
{i:si ∈A}
pi .
10
CHAPTER 1. PROBABILITY THEORY
(The sum over an empty set is defined to be 0.) Then P is a
probability function on B. This remains true if S = {s1, s2, . . .}
is a countable set.
Proof: We will give the proof for finite S. For any A ∈ B, P (A) =
P
i:si ∈A pi ≥ 0, because every pi ≥ 0. Thus, Axiom 1 is true. Now,
P (S) =
X
pi =
i:si ∈S
n
X
pi = 1.
i=1
Thus, Axiom 2 is true. Let A1, . . . , Ak denote pairwise disjoint events.
(B contains only a finite number of sets, so we need to consider only
finite disjoint unions.) Then,
P (∪ki=1Ai)
=
X
{j:sj ∈∪ki=1 Ai }
pj =
k
X
X
i=1 {j:sj ∈Ai }
pj =
k
X
P (Ai).
i=1
The first and third equalities are true by the definition of P (A). The
disjointedness of the Ai’s ensures that the second equality is true,
because the same pj ’s appear exactly once on each side of the equality.
Thus, Axiom 3 is true and Kolmogorov’s Axioms are satisfied. ¤
1.2. BASICS OF PROBABILITY THEORY
11
Example 1.2.3 (Defining probabilities-II) The game of darts is
played by throwing a dart at a board and receiving a score corresponding to the number assigned to the region in which the dart
lands. For a novice player, it seems reasonable to assume that the
probability of the dart hitting a particular region is proportional
to the area of the region. Thus, a bigger region has a higher probability of being hit.
The dart board has radius r and the distance between rings is
r/5. If we make the assumption that the board is always hit, then
we have
P (scoring i points) =
Area of region i
.
Area of dart board
For example,
πr2 − π(4r/5)2
4 2
P (scoring1point) =
=
1
−
(
).
πr2
5
It is easy to derive the general formula, and we find that
(6 − i)2 − (5 − i)2
,
P (scoring i points) =
52
i = 1, . . . , 5,
independent of π and r. The sum of the areas of the disjoint
regions equals the area of the dart board. Thus, the probabilities
that have been assigned to the five outcomes sum to 1, and, by
Theorem 1.2.6, this is a probability function.
12
CHAPTER 1. PROBABILITY THEORY
1.2.2
The calculus of Probabilities
Theorem 1.2.2 If P is a probability function and A is any set
in B, then
a. P (∅) = 0, where ∅ is the empty set.
b. P (A) ≤ 1.
c. P (Ac) = 1 − P (A).
Theorem 1.2.3 If P is a probability function and A and B are
any sets in B, then
a. P (B ∩ Ac) = P (B) − P (A ∩ B).
b. P (A ∪ B) = P (A) + P (B) − P (A ∩ B).
c. If A ⊂ B, then P (A) ≤ P (B).
Formula (b) of Theorem 1.2.3 gives a useful inequality for the probability of an intersection. Since P (A ∪ B) ≤ 1, we have
P (A ∩ B) = P (A) + P (B) − 1.
This inequality is a special case of what is known as Bonferroni’s
inequality.
1.2. BASICS OF PROBABILITY THEORY
13
Theorem 1.2.4 If P is a probability function, then
a. P (A) =
P∞
i=1 P (A
b. P (∪∞
i=1 Ai ) ≤
P∞
∩ Ci) for any partition C1, C2, . . .;
i=1 P (Ai )
for any sets A1, A2, . . ..
(Boole’s
Inequality)
Proof: Since C1, C2, . . . form a partition, we have that Ci ∩ Cj = ∅
for all i 6= j, and S = ∪∞
i=1 Ci . Hence,
∞
A = A ∩ S = A ∩ (∪∞
i=1 Ci ) = ∪i=1 (A ∩ Ci ),
where the last equality follows from the Distributive Law. We therefore have
P (A) = P (∪∞
i=1 (A ∩ Ci )).
Now, since the Ci are disjoint, the sets A ∩ Ci are also disjoint, and
from the properties of a probability function we have
∞
∞
X
X
P ( (A ∩ Ci)) =
P (A ∩ Ci).
i=1
i=1
To establish (b) we first construct a disjoint collection A∗1 , A∗2 , . . .,
∗
∞
∗
with the property that ∪∞
i=1 Ai = ∪i=1 Ai . We define Ai by
A∗1 = A1,
A∗i = Ai \ (
i−1
X
j=1
Aj ),
i = 2, 3, . . . ,
14
CHAPTER 1. PROBABILITY THEORY
where the notation A\B denotes the part of A that does not intersect
with B. In more familiar symbols, A \ B = A ∩ B c. It should be easy
∗
∞
to see that ∪∞
i=1 Ai = ∪i=1 Ai , and we therefore have
P (∪∞
i=1 Ai )
=
∗
P (∪∞
i=1 Ai )
=
∞
X
P (A∗i ),
i=1
where the equality follows since the A∗i ’s are disjoint. To see this, we
write
A∗i
∩
A∗k
©
ª
(∪i−1
j=1 Aj )
©
(∪k−1
j=1 Aj )
ª
(definition of A∗i )
∩ Ak \
= Ai \
ª
ª ©
©
c
c
k−1
A
)
(definition of \)
A
)
∩
A
∩
(∪
= Ai ∩ (∪i−1
k
j=1 j
j=1 j
©
= Ai ∩
i−1
\
k−1
\ ª
ª
©
c
Aj ∩ Ak ∩
Acj
(DeM organ0sLaws)
j=1
j=1
Now if i > k, the first intersection above will be contained in the set
Ack , which will have an empty intersection with Ak . If k > i, the
argument is similar. Further, by construction A∗i ⊂ Ai, so P (A∗i ) ≤
P (Ai) and we have
∞
X
i=1
establishing (b). ¤
P (A∗i ) ≤
∞
X
i=1
P (Ai),
1.2. BASICS OF PROBABILITY THEORY
15
There is a similarity between Boole’s Inequality and Bonferroni’s
Inequality. If we apply Boole’s Inequality to Ac, we have
P (∪ni=1Aci) ≤
n
X
P (Aci),
i=1
and using the facts that ∪Aci = (∩Ai)c and P (Aci) = 1 − P (Ai), we
obtain
1 − P (∩ni=1Ai) ≤ n −
n
X
P (Ai).
i=1
This becomes, on rearranging terms,
P (∩ni=1Ai)
≥
n
X
P (Ai) − (n − 1),
i=1
which is a more general version of the Bonferroni Inequality.
16
CHAPTER 1. PROBABILITY THEORY
1.2.3
Counting
Methods of counting are often used in order to construct probability
assignments on finite sample spaces, although they can be used to answer other questions also. The following theorem is sometimes known
as the Fundamental Theorem of Counting.
Theorem 1.2.5 If a job consists of k separate tasks, the ith of
which can be done in ni ways, i = 1, . . . , k, then the entire job
can be done in n1 × n2 × · · · × nk ways.
Example 1.2.4 For a number of years the New York state lottery
operated according to the following scheme. From the numbers
1,2, . . ., 44, a person may pick any six for her ticket. The winning
number is then decided by randomly selecting six numbers from
the forty-four. So the first number can be chosen in 44 ways, and
the second number in 43 ways, making a total of 44 × 43 = 1892
ways of choosing the first two numbers. However, if a person
is allowed to choose the same number twice, then the first two
numbers can be chosen in 44 × 44 = 1936 ways.
The above example makes a distinction between counting with replacement and counting without replacement. The second crucial
1.2. BASICS OF PROBABILITY THEORY
17
element in counting is whether or not the ordering of the tasks is
important. Taking all of these considerations into account, we can
construct a 2 × 2 table of possibilities.
Number of possible arrangements of size r from n objects
Without replacement With replacement
Ordered
Unordered
n!
(n−r)!
nr
¡n¢
¡n+r−1¢
r
r
Let us consider counting all of the possible lottery tickets under
each of these four cases.
Ordered, without replacement
From the Fundamental Theorem of Count-
ing, there are
44 × 43 × 42 × 41 × 40 × 39 =
44!
= 5, 082, 517, 440
38!
possible tickets.
Ordered, with replacement
Since each number can now be selected in
44 ways, there are
44 × 44 × 44 × 44 × 44 × 44 = 446 = 7, 256, 313, 856
possible tickets.
18
CHAPTER 1. PROBABILITY THEORY
Unordered, without replacement
From the Fundamental Theorem, six
numbers can be arranged in 6! ways, so the total number of unordered
tickets is
44 × 43 × 42 × 41 × 40 × 39
44!
=
= 7, 059, 052.
6×5×4×3×2×1
6!38!
Unordered, with replacement
In this case, the total number of un-
ordered tickets is
44 × 45 × 46 × 47 × 48 × 49
49!
=
= 13, 983, 816.
6×5×4×3×2×1
6!43!
1.2. BASICS OF PROBABILITY THEORY
1.2.4
19
Enumerating outcomes
The counting techniques of the previous section are useful when the
sample space S is a finite set and all the outcomes in S are equally
likely. Then probabilities of events can be calculated by simply counting the number of outcomes in the event. Suppose that S = {s1, . . . , sN }
is a finite sample space. Saying that all the outcomes are equally likely
means that P ({si}) = 1?N for every outcome si. Then, we have, for
any event A,
P (A) =
X
si ∈A
X 1
# of elements in A
=
.
P ({si}) =
N
# of elements in S
si ∈A
Example 1.2.5 Consider choosing a five-card poker hand from
a standard deck of 52 playing cards. What is the probability of
having four aces? If we specify that four of the cards are aces,
then there are 48 different ways of specifying the fifth card. Thus,
48
P (four aces) = ¡52¢ =
5
48
.
2, 598, 960
The probability of having four of a kind is
P (four of a kind) =
13 × 48
¡52¢ =
5
624
.
2, 598, 960
The probability of having exactly one pair is
¡ ¢¡ ¢ 3
13 42 12
4
1, 098, 240
¡52¢3
=
.
P (exactly one pair) =
2,
598,
960
5
20
CHAPTER 1. PROBABILITY THEORY
1.3
Conditional Probability and Independence
All of the probabilities that we have dealt with thus far have been
unconditional probabilities. A sample space was defined and all probabilities were calculated with respect to that sample space. In many
instances, however, we are in a position to update the sample space
based on new information. In such cases, we want to be able to update
probability calculations or to calculate conditional probabilities.
Example 1.3.1 (Four aces) Four cards are dealt from the top of
a well-shuffled deck. What is the probability that they are four
aces?
¡ ¢
The probability is 1/ 52
4 =
1
270,725 .
We can also calculate this probability by an “updating” argument, as follows. The probability that the first card is an ace is
4/52. Given that the first card is an ace, the probability that the
second card is an ace is 3/51. Continuing this argument, we get
the desired probability as
4 3 2 1
1
=
.
52 51 50 49 270, 725
1.3. CONDITIONAL PROBABILITY AND INDEPENDENCE
21
Definition 1.3.1 If A and B are events in S, and P (B) > 0,
then the conditional probability of A given B, written p(A|B), is
P (A|B) =
P (A ∩ B)
.
P (B)
(1.1)
Note that what happens in the conditional probability calculation
is that B becomes the sample space: P (B|B) = 1. The intuition is
that our original sample space, S, has been updated to B. All further
occurrence are then calibrated with respect to their relation to B.
Example 1.3.2 (Continuation of Example 1.3.1) Calculate the
conditional probabilities given that some aces have already been
drawn.
P (4 aces in 4 cards|i aces in i cards)
P ({4 aces in 4 cards} ∩ {i aces in i cards})
P (i aces in i cards)
P (4 aces in 4 cards)
=
P (i aces in i cards)
¡52¢
(4 − i)!48!
1
i
¢
¡
¢
= ¡52¢¡
=
=
4
52−i
(52
−
i)!
4
i
4−i
=
For i = 1, 2 and 3, the conditional probabilities are .00005, .00082,
and ,02041, respectively.
22
CHAPTER 1. PROBABILITY THEORY
Example 1.3.3 (Three prisoners) Three prisoners, A, B, and C,
are on death row. The governor decides to pardon one of the three
and chooses at random the prisoner to pardon. He informs the
warden of his choice but requests that the name be kept secret for
a few days.
The next day, A tries to get the warden to tell him who has
been pardoned. The warden refuses. A then asks which of B or C
will be executed. The warden thinks for a while, then tells A that
B is to be executed.
Warden’s reasoning:
Each prisoner has a
1
3
chance of being par-
doned. Clearly, either B or C must be executed, so I have given
A no information about whether A will be pardoned.
A’s reasoning:
Given that B will be executed, then either A or C
will be pardoned. My chance of being pardoned has risen to 12 .
Let A, B, and C denote the events that A, B, or C is pardoned,
respectively. We know that P (A) = P (B) = P (C) = 1/3. Let W
denote the event that the warden says B will die. Using (1.1), A
1.3. CONDITIONAL PROBABILITY AND INDEPENDENCE
23
can update his probability of being pardoned to
P (A|W ) =
P (A ∩ W )
.
P (W )
What is happening can be summarized in this table:
Prisoner Pardoned Warden tells A Probability
A
B dies
1
6
A
C dies
1
6
B
C dies
1
3
C
B dies
1
3
Using this table, we can calculate
P (W ) = P (warden says B dies)
= P (warden says B dies and A pardoned)
+ P (warden says B dies and C pardoned)
+ P (warden says B dies and B pardoned)
1 1
1
= + +0= .
6 3
2
Thus, using the warden’s reasoning, we have
P (A ∩ W )
P (W )
P (warden says B dies and A pardoned) 1/6 1
=
=
= .
P (warden says B dies)
1/2 3
P (A|W ) =
24
CHAPTER 1. PROBABILITY THEORY
However, A falsely interprets the event W as equal to the event
B c and calculates
P (A ∩ B c) 1/3 1
P (A|B ) =
=
= .
P (B c)
2/3 2
c
We see that conditional probabilities can be quite slippery and
require careful interpretation.
The following are several variations of (1.1):
P (A ∩ B) = P (A|B)P (B)
P (A ∩ B) = P (B|A)P (A)
P (A|B) = P (B|A)
P (A)
P (B)
Theorem 1.3.1 (Bayes’ Rule) Let A1, A2, . . . be a partition of
the sample space, and let B be any set. Then, for each i = 1, 2, . . .,
P (B|Ai)P (Ai)
P (Ai|B) = P∞
.
P
(B|A
)P
(A
)
j
j
j=1
1.3. CONDITIONAL PROBABILITY AND INDEPENDENCE
25
Example 1.3.4 (Codings) When coded messages are sent, there
are sometimes errors in transmission. In particular, Morse code
uses “dots” and “dashes”, which are known to occur in the proportion of 3:4. This means that for any given symbol,
P (dot sent) =
3
7
4
and P (dash sent) = .
7
Suppose there is interference on the transmission line, and with
probability
1
8
a dot is mistakenly received as a dash, and vice versa.
If we receive a dot, can we be sure that a dot was sent?
Using Bayesian Rule, we can write
P (dot sent|dot received) = P (dot received|dot sent)
P (dot sent)
.
P (dot received)
P (dot received) = P (dot received ∩ dot sent)
+ P (dot received ∩ dash sent)
= P (dot received|dot sent)P (dot sent)
+ P (dash received|dash sent)P (dash sent)
7 3 1 4 25
= × + × = .
8 7 8 7 56
So we have
P (dot sent|dot received) =
(7/8) × (3/7) 21
= .
25/56
25
26
CHAPTER 1. PROBABILITY THEORY
In some cases it may happen that the occurrence of a particular
event, B, has no effect on the probability of another event, A. For
these cases, we have the following definition.
Definition 1.3.2 Tow events, A and B, are statistically independent if
P (A ∩ B) = P (A)P (B).
Example 1.3.5 The gambler introduced at the start of the chapter, Chevalier de Mere, was particularly interested in the event
that he could throw at least 1 six in 4 rolls of a die. We have
P (at least 1 six in 4 rolls) = 1 − P (no six in 4 rolls)
=1−
4
Y
P (no six on roll i)
i=1
5
= 1 − ( )4 = 0.518.
6
The equality of the second line follows by independence of the
rolls.
1.3. CONDITIONAL PROBABILITY AND INDEPENDENCE
27
Theorem 1.3.2 If A and B are independent events, then the
following pairs are also independent:
a. A and B c.
b. Ac and B.
c. Ac and B c.
Proof: (a) can be proved as follows.
P (A ∩ B c) = P (A) − P (A ∩ B) = P (A) − P (A)P (B)
(A and B are
= P (A)(1 − P (B)) = P (A)P (B c)
¤
Example 1.3.6 (Tossing two dice) Let an experiment consists
of tossing two dice. For this experiment the sample space is
S = {(1, 1), (1, 2), . . . , (1, 6), (2, 1), . . . , (2, 6), . . . , (6, 1), . . . , (6, 6)}.
Define the following events:
A = {double appear} = {(1, 1), (2, 2), (3, 3), (4, 4), (5, 5), (6, 6)},
B = {the sum is between 7 and 10}
C = {the sum is 2, 7 or 8}.
28
CHAPTER 1. PROBABILITY THEORY
Thus, we have P (A) = 1/6, P (B) = 1/2 and P (C) = 1/3. Furthermore,
P (A ∩ B ∩ C) = P (the sum is 8, composed of double 4s)
1
1 1 1
=
= × ×
36 6 2 3
= P (A)P (B)P (C)
However,
P (B ∩ C) = P (sum equals 7 or 8) =
11
6= P (B)P (C).
36
Similarly, it can be shown that P (A∩B) 6= P (A)P (B). Therefore,
the requirement P (A ∩ B ∩ C) = P (A)P (B)P (C) is not a strong
enough condition to guarantee pairwise independence.
Definition 1.3.3 A collection of events A1, . . . , An are mutually
independent if for any subcollection Ai1, . . . , Aik , we have
P (∩kj=1Aij )
=
k
Y
j=1
P (Aij ).
1.3. CONDITIONAL PROBABILITY AND INDEPENDENCE
29
Example 1.3.7 (Three coin tosses) Consider the experiment of
tossing a coin three times. The sample space is {HHH, HHT, HT H,
T HH, T T H, T HT, HT T, T T T }. Let Hi, i = 1, 2, 3, denote the
event that the ith toss is a head. For example, H1 = {HHH, HHT,
HT H, HT T }. Assuming that the coin is fair and has an equal
probability of landing heads or tails on each toss, the events H1,
H2 and H3 are mutually independent.
To verify this we note that
P (H1 ∩ H2 ∩ H3) = P ({HHH}) =
1
= P (H1)P (H2)P (H3).
8
To verify the condition in Definition 1.3.3, we must check each
pair. For example,
P (H1 ∩ H2) = P ({HHH, HHT }) =
2
= P (H1)P (H2).
8
The equality is also true for the other two pairs. Thus, H1, H2
and H3 are mutually independent.
30
CHAPTER 1. PROBABILITY THEORY
1.4
Random variables
Example 1.4.1 (Motivation example) In an opinion poll, we might
decide to ask 50 people whether they agree or disagree with a certain issue. If we record a “1” for agree and “0” for disagree, the
sample space for this experiment has 250 elements. If we define
a variable X=number of 1s recorded out of 50, we have captured
the essence of the problem. Note that the sample space of X is
the set of integers {1, 2, . . . , 50} and is much easier to deal with
than the original sample space.
In defining the quantity X, we have defined a mapping (a function)
from the original sample space to a new sample space, usually a set
of real numbers. In general, we have the following definition.
Definition 1.4.1 A random variable is a function from a sample
space S into the real numbers.
Example 1.4.2 (Random variables) In some experiments random variables are implicitly used; some examples are these.
1.4. RANDOM VARIABLES
31
Experiment
Random variable
Toss two dice
X =sum of the numbers
Toss a coin 25 times
X =number of heads in 25 tosses
Apply different amounts of
fertilizer to corn plants
X =yield/acre
Suppose we have a sample space
S = {s1, . . . , sn}
with a probability function P and we define a random variable X with
range X = {x1, . . . , xm}. We can define a probability function PX
on X in the following way. We will observe X = xi if and only if the
outcome of the random experiment is an sj ∈ S such that X(sj ) = xi.
Thus,
PX (X = xi) = P ({sj ∈ S : X(sj ) = xi}).
(1.2)
Note PX is an induced probability function on X , defined in terms of
the original function P . Later, we will simply write PX (X = xi) =
P (X = xi).
32
CHAPTER 1. PROBABILITY THEORY
Theorem 1.4.1 The induced probability function defined in (1.2)
defines a legitimate probability function in that it satisfies the Kolmogorov Axioms.
Proof: CX is finite. Therefore B is the set of all subsets of X . We
must verify each of the three properties of the axioms.
(1) If A ∈ B then PX (A) = P (∪xi∈A{sj ∈ S : X(sj ) = xi}) ≥ 0
since P is a probability function.
(2) PX (X ) = P (∪m
i=1 {sj ∈ S : X(sj ) = xi }) = P (S) = 1.
(3) If A1, A2, . . . ∈ B and pairwise disjoint then
∞
PX (∪∞
k=1 Ak ) = P (∪k=1 {∪xi ∈Ak {sj ∈ S : X(sj ) = xi }})
=
∞
X
P (∪xi∈Ak {sj ∈ S : X(sj ) = xi} =
k=1
∞
X
PX (Ak ),
k=1
where the second inequality follows from the fact P is a probability
function. ¤
A note on notation: Random variables will always be denoted
with uppercase letters and the realized values of the variable will be
denoted by the corresponding lowercase letters. Thus, the random
variable X can take the value x.
1.4. RANDOM VARIABLES
33
Example 1.4.3 (Three coin tosses-II) Consider again the experiment of tossing a fair coin three times independently. Define the
random variable X to be the number of heads obtained in the three
tosses. A complete enumeration of the value of X for each point
in the sample space is
s
X(s)
HHH HHT HTH THH TTH THT HTT TTT
3
2
2
2
1
1
1
0
The range for the random variable X is X = {0, 1, 2, 3}. Assuming that all eight points in S have probability 18 , by simply
counting in the above display we see that the induced probability
function on X is given by
x
PX (X = x)
0 1 2 3
1
8
3
8
3
8
1
8
The previous illustrations had both a finite S and finite X , and
the definition of PX was straightforward. Such is also the case if X
is countable. If X is uncountable, we define the induced probability
34
CHAPTER 1. PROBABILITY THEORY
function, PX , in a manner similar to (1.2). For any set A ⊂ X ,
PX (X ∈ A) = P ({s ∈ S : X(s) ∈ A}).
(1.3)
This does define a legitimate probability function for which the Kolmogorov Axioms can be verified.
1.5. DISTRIBUTION FUNCTIONS
1.5
35
Distribution Functions
Definition 1.5.1 The cumulative distribution function (cdf ) of
a random variable X, denoted by FX (x), is defined by
FX (x) = PX (X ≤ x),
for all x.
Example 1.5.1 (Tossing three coins) Consider the experiment
of tossing three fair coins, and let X =number of heads observed.
The cdf of X is




0






1



8


FX (x) =
1
2





7



8





1
if −∞ < x < 0
if 0 ≤ x < 1
if 1 ≤ x < 2
if 2 ≤ x < 3
if 3 ≤ x < ∞.
Remark:
1. FX is defined for all values of x, not just those in X = {0, 1, 2, 3}.
Thus, for example,
7
FX (2.5) = P (X ≤ 2.5) = P (X = 0, 1, 2) = .
8
36
CHAPTER 1. PROBABILITY THEORY
2. FX has jumps at the values of xi ∈ X and the size of the jump
at xi is equal to P (X = xi).
3. FX = 0 for x < 0 since X cannot be negative, and FX (x) = 1 for
x ≥ 3 since x is certain to be less than or equal to such a value.
FX is right-continuous, namely, the function is continuous when a
point is approached from the right. The property of right-continuity
is a consequence of the definition of the cdf. In contrast, if we had
defined FX (x) = PX (X < x), FX would then be left-continuous.
Theorem 1.5.1 The function FX (x) is a cdf if and only of the
following three conditions hold:
a. limx→−∞ F (x) = 0 and limx→∞ F (x) = 1.
b. F (x) is a nondecreasing function of x.
c. F (x) is right-continuous; that is, for every number x0, limx↓x0 F (x) =
F (x0).
Example 1.5.2 (Tossing for a head) Suppose we do an experiment that consists of tossing a coin until a head appears. Let
p =probability of a head on any given toss, and define X =number
1.5. DISTRIBUTION FUNCTIONS
37
of tosses required to get a head. Then, for any x = 1, 2, . . .,
P (X = x) = (1 − p)x−1p.
The cdf is
FX (x) = P (X ≤ x) =
x
X
P (X = i) =
i=1
x
X
(1−p)i−1p = 1−(1−p)x.
i=1
It is easy to show that if 0 < p < 1, then FX (x) satisfies the
conditions of Theorem 1.5.1. First,
lim FX (x) = 0
x→−∞
since FX (x) = 0 for all x < 0, and
lim FX (x) = lim (1 − (1 − p)x) = 1,
x→∞
x→∞
where x goes through only integer values when this limit is taken.
To verify property (b), we simply note that the sum contains more
positive terms as x increases. Finally, to verify (c), note that, for
any x, FX (x + ²) = FX (x) if ² > 0 is sufficiently small. Hence,
lim FX (x + ²) = FX (x),
²↓0
so FX (x) is right-continuous.
Example 1.5.3 (Continuous cdf ) An example of a continuous
cdf (logistic distribution) is the function
FX (x) =
1
.
1 + e−x
38
CHAPTER 1. PROBABILITY THEORY
It is easy to verify that
lim FX (x) = 0
x→−∞
and
lim FX (x) = 1.
x→∞
Differentiating FX (x) gives
d
e−x
FX (x) =
> 0,
dx
(1 + e−x)2
showing that FX (x) is increasing. FX is not only right-continuous,
but also continuous.
Definition 1.5.2 A random variable X is continuous if FX (x) is
a continuous function of x. A random variable X is discrete if
FX (x) is a step function of x.
We close this section with a theorem formally stating that FX completely determines the probability distribution of a random variable
X. This is true if P (X ∈ A) is defined only for events A in B1, the
smallest sigma algebra containing all the intervals of real numbers of
the form (a, b), [a, b), (a, b], and [a, b]. If probabilities are defined for
a larger class of events, it is possible for two random variables to have
the same distribution function but not the same probability for every
event (see Chung 1974, page 27).
1.5. DISTRIBUTION FUNCTIONS
39
Definition 1.5.3 The random variables X and Y are identically
distributed if, for every set A ∈ B1, P (X ∈ A) = P (Y ∈ A).
Note that two random variables that are identically distributed are
not necessarily equal. That is, the above definition does not say that
X =Y.
Example 1.5.4 (Identically distributed random variables) Consider the experiment of tossing a fair coin three times. Define the
random variables X and Y by
X =number of heads observed and
Y =number of tails observed.
For each k = 0, 1, 2, 3, we have P (X = k) = P (Y = k). So X
and Y are identically distributed. However, for no sample point
do we have X(s) = Y (s).
Theorem 1.5.2 The following two statements are equivalent:
a. The random variables X and Y are identically distributed.
b. FX (x) = FY (x) for every x.
Proof: To show equivalence we must show that each statement implies the other. We first show that (a) ⇒ (b).
40
CHAPTER 1. PROBABILITY THEORY
Because X and Y are identically distributed, for any set A ∈ B1,
P (X ∈ A) = P (Y ∈ A). In particular, for every x, the set (−∞, x]
is in B1, and
FX (x) = P (X ∈ (−∞, x]) = P (Y ∈ (−∞, x]) = FY (x).
The above argument showed that if the X and Y probabilities
agreed in all sets, then agreed on intervals. To show (b) ⇒ (a), we
must prove if the X and Y probabilities agree on all intervals, then
they agree on all sets. For more details see Chung (1974, section
2.2). ¤
1.6. DENSITY AND MASS FUNCTIONS
1.6
41
Density and Mass Functions
Definition 1.6.1 The probability mass function (pmf ) of a discrete random variable X is given by
fX (x) = P (X = x) for all x.
Example 1.6.1 (Geometric probabilities) For the geometric distribution of Example 1.5.2, we have the pmf



(1 − p)x−1p for x = 1, 2, . . .
fX (x) = P (X = x) =


0
otherwise.
From this example, we see that a pmf gives us “point probability”.
In the discrete case, we can sum over values of the pmf to get the
cdf. The analogous procedure in the continuous case is to substitute
integrals for sums, and we get
Z
x
P (X ≤ x) = FX (x) =
fX (t)dt.
−∞
Using the Fundamental Theorem of Calculus, if fX (x) is continuous,
we have the further relationship
d
FX (x) = fX (x).
dx
42
CHAPTER 1. PROBABILITY THEORY
Definition 1.6.2 The probability density function or pdf, fX (x),
of a continuous random variable X is the function that satisfies
Z x
FX (x) =
fX (t)dt for all x.
(1.4)
−∞
A note on notation
: The expression “X has a distribution given
by FX (x)” is abbreviated symbolically by “X ∼ FX (x)”, where we
read the symbol “∼” as “is distributed as”. We can similarly write
X ∼ fX (x) or, if X and Y have the same distribution, X ∼ Y .
In the continuous case we can be somewhat cavalier about the specification of interval probabilities. Since P (X = x) = 0 if X is a
continuous random variable,
P (a < X < b) = P (a < X ≤ b) = P (a ≤ X < b) = P (a ≤ X ≤ b).
Example 1.6.2 (Logistic probabilities) For the logistic distribution, we have
FX (x) =
1
,
1 + e−x
hence, we have
d
e−x
fX (x) = FX (x) =
,
dx
(1 + e−x)2
1.6. DENSITY AND MASS FUNCTIONS
43
and
P (a < X < b) = FX (b) − FX (a)
Z b
Z a
=
fX (x)dx −
fX (x)dx
−∞
−∞
Z b
=
fX (x)dx.
a
Theorem 1.6.1 A function FX (x) is a pdf (or pmf ) of a random
variable X if and only if
a. FX (x) ≥ 0 for all x.
b.
P
x fX (x)
= 1 (pmf ) or
R∞
−∞ fX (x)dx
= 1 (pdf ).
Proof: If fX (x) is a pdf (or pmf), then the two properties are immediate from the definitions. In particular, for a pdf, using (1.4) and
Theorem 1.5.1, we have that
Z
∞
1 = lim FX (x) =
x→∞
fX (t)dt.
−∞
The converse implication is equally easy to prove. Once we have
fX (x), we can define FX (x) and appeal to Theorem 1.5.1. ¤
44
CHAPTER 1. PROBABILITY THEORY