1 Probability Space

From the book ”Probability Theory”, Hannelore Lisei (Editura Casa Cărţii de
Ştiinţă, 2004)
1 Probability Space
1.1 Experiments and Events
The theory of probability pertains to the various possible outcomes that might
be obtained and the possible events that might occur when an experiment is performed. The terms trial or experiment are used in probability theory to describe
virtually any process or action whose outcome is not known in advance with certainty, i.e. it has a random behaviour.
A sample space, denoted S, is the set of all possible outcomes of an experiment.
The elements of S are called elementary events. For an elementary event e we
write e ∈ S to denote that e is an element of S. An event is a subset of the sample
space, i.e. it is a collection of elementary events. If A is an event, we write A ⊆ S
(or S ⊇ A) to denote that A is a subset of S (or that S is a superset of A). An
elementary event e is realized or occurs if it is the outcome of an experiment.
We consider two special events: the impossible event, denoted by ∅, which is the
event that ”never happened” (the event that does not happen in any trial) and the
sure (also certain) event, which is the event that happens in each trial.
For any event A ⊆ S we may associate its contrary event or complement, denoted
A, to mean that A is realized if and only if A is not realized.
It is said that the event A implies the event B if every element of A is also an
element of B, written A ⊆ B. If A implies B and B implies A, then the events A
and B are equal A = B.
The union of the events A and B, denoted by A ∪ B, is the set of all elementary
events that are either in A or in B, i.e.
A ∪ B = {e ∈ S : e ∈ A or e ∈ B}.
The intersection of the events A and B, denoted by A ∩ B, is the set of all elementary events that are both in A and in B, i.e.
A ∩ B = {e ∈ S : e ∈ A and e ∈ B}.
Two events A and B are said to be disjoint (or mutually exclusive) if A and B
have no common outcomes, i.e. A ∩ B = ∅.
1
The difference of the events A and B, denoted by A\B, is the set of all elementary
events that are in A but not in B, i.e.
A \ B = A ∩ B.
The symmetric difference of the events A and B, denoted by A∆B, is the set of
all elementary events that are in A or B but are not common to both, i.e.
A∆B = (A ∪ B) \ (A ∩ B).
The operations of union, intersection and symmetric difference are commutative:
A ∪ B = B ∪ A,
A ∩ B = B ∩ A,
A∆B = B∆A;
associative:
(A ∪ B) ∪ C = A ∪ (B ∪ C),
(A ∩ B) ∩ C = A ∩ (B ∩ C),
(A∆B)∆C = A∆(B∆C);
and distributive:
(A ∪ B) ∩ C = (A ∩ C) ∪ (B ∩ C),
(A ∩ B) ∪ C = (A ∪ C) ∩ (B ∪ C),
A ∩ (B∆C) = (A ∩ B)∆(A ∩ C).
Example 1.1.
Suppose that we make an experiment by tossing a coin two times. Then the
sample space is
S = {e1 , e2 , e3 , e4 },
where the elementary events (outcomes) are:
e1 head is obtained on the first toss and tail in the second toss;
e2 tail is obtained on the first toss and tail in the second toss;
e3 head is obtained on the first toss and head in the second toss;
e4 tail is obtained on the first toss and head in the second toss.
Let A be the event that at least one head is obtained during the experiment, let B
be the event that a head is obtained on the second toss, C the event that a tail is
obtained on the first toss and D the event that no heads are obtained. We can write
A = {e1 , e3 , e4 }, B = {e3 , e4 }, C = {e2 , e4 }, D = {e2 }.
2
Some relations among these events are:
D ⊆ C, A ∩ D = ∅, B ∩ C = {e4 }, A ∪ C = S, D = A.
♦
Throughout this chapter let S denote a sample space.
A collection of events (Ai )i∈I (where I is a set of indices) from S is said to be
collectively exhaustive if
[
Ai = S.
i∈I
A collection (Ai )i∈I of events from S is said to be a partition of S if the events
are collectively exhaustive and for all i, j ∈ I, i 6= j, the events Ai and Aj are
disjoint.
The lower limit of a sequence (An )n∈N of events from S is the event
lim inf An =
n→∞
∞
∞ \
[
Ai .
n=1 i=n
The upper limit of a sequence (An )n∈N of events from S is the event
lim sup An =
n→∞
∞ [
∞
\
Ai .
n=1 i=n
Note that lim inf An is that event that occurs if all events of the sequence (An )n∈N
n→∞
occur except a finite number of events, while the event lim sup An occurs if an
n→∞
infinite number of events of the sequence (An )n∈N occur:
lim inf An = {e ∈ S : e ∈ An for all n except a finite number},
n→∞
lim sup An = {e ∈ S : e ∈ An for inifinitely many n}.
n→∞
A sequence (An )n∈N of events from S is said to be monotone increasing if
An ⊆ An+1
for all n ∈ N,
and it is monotone decreasing if
An+1 ⊆ An
for all n ∈ N.
3
Proposition 1.2.
(1) A collection (Ai )i∈I of events from S satisfies De Morgan’s laws:
[
Ai =
i∈I
\
Ai ,
i∈I
\
[
Ai =
i∈I
Ai .
i∈I
(2) For any sequence (An )n∈N of events from S it holds
lim inf An ⊆ lim sup An .
n→∞
n→∞
(3) For any monotonically increasing sequence (An )n∈N of events from S we
have
∞
[
An .
lim inf An = lim sup An =
n→∞
n→∞
n=1
(4) For any monotonically decreasing sequence (An )n∈N of events from S it
holds
∞
\
An .
lim inf An = lim sup An =
n→∞
n→∞
n=1
Proof. (1) We have the following equivalences:
e∈
[
i∈I
Ai ⇐⇒ e ∈
/
[
i∈I
Ai ⇐⇒ e ∈
/ Ai
⇐⇒ e ∈ An
for all i ∈ I
for all i ∈ I ⇐⇒ e ∈
\
Ai .
i∈I
Using the property we have just proved, we can write
[
i∈I
Ai =
[
Ai =
i∈I
\
i∈I
Ai =
\
Ai .
i∈I
(2) Let e ∈ lim inf An . Then there exists ne ∈ N such that e ∈
n→∞
for all i ∈ N, i ≥ ne . This implies
e∈
∞
[
i=n
Ai
for all n ∈ N.
4
∞
\
i=ne
Ai , i.e. e ∈ Ai
Hence
e∈
Therefore,
∞
∞ [
\
Ai = lim sup An .
n→∞
n=1 i=n
lim inf An ⊆ lim sup An .
n→∞
n→∞
(3) For a monotonically increasing sequence (An )n∈N of events we have
lim inf An =
n→∞
∞
[
An .
∞
[
An ,
n=1
But
lim sup An ⊆
n→∞
which implies by (2)
n=1
Ai =
i=n
An for all n ∈ N and therefore
∞
[
∞
\
n=1
An = lim inf An ⊆ lim sup An ⊆
n→∞
n→∞
∞
[
An .
n=1
(4) If (An )n∈N is a monotonically decreasing sequence of events, then the sequence
(An )n∈N is monotonically increasing and we can use the property proved in (3) to
get
∞
\
lim inf An = lim sup An =
An .
n→∞
n→∞
n=1
By using De Morgan laws and that An = An for all n ∈ N the desired property is
obtained.
1.2 Sigma Fields and Probabilities
Definition 1.3. A collection K of events from the sample space S is said to be a
σ-field (or σ-algebra) if it satisfies the following conditions:
(i) K 6= ∅;
(ii) if A ∈ K, then A ∈ K;
5
(iii) if An ∈ K for all n ∈ N, then
∞
[
n=1
An ∈ K.
If K is a σ-field in the sample space S, then the pair (S, K) is called measurable
space.
Example 1.4.
(1) The collection of all subsets of a set S is a σ-field, denoted by P(S).
(2) Let (S, K) and (S ∗ , K∗ ) be measurable spaces, and let T : S → S ∗ be a
function. Then
T −1 (K∗ ) = {T −1 (A∗ ) : A∗ ∈ K∗ }
is a σ-field in S.
Proposition 1.5. If K is a σ-field in S, then the following properties hold:
(1) ∅, S ∈ K.
(2) Let A, B ∈ K. Then A ∩ B, A \ B, A∆B ∈ K.
(3) Let An ∈ K for all n ∈ N. Then
∞
\
n=1
An ∈ K,
lim inf An ∈ K,
n→∞
lim sup An ∈ K.
n→∞
Proof. (i) Let A ∈ K (since K =
6 ∅), then by Definition 1.3 we have A ∈ K and
A ∪ A = S ∈ K. This implies S = ∅ ∈ K.
(ii) Let A, B ∈ K. By Definition 1.3 and by De Morgan’s laws (see Proposition
1.2) A, B ∈ K and A ∪ B = A ∩ B ∈ K. Then using again Definition 1.3 it
follows that A ∩ B = A ∩ B ∈ K. We have A \ B = A ∩ B ∈ K and A∆B =
(A ∪ B) \ (A ∩ B) ∈ K, because A ∪ B, A ∩ B ∈ K.
(iii) For all n ∈ N we have An ∈ K, since An ∈ K. By De Morgan’s laws (see
Proposition 1.2) and by Definition 1.3 it follows that
∞
\
n=1
An =
∞
[
n=1
An ∈ K.
Using this property and the definitions of the lower and upper limit of a sequence
of events we obtain lim inf An ∈ K and lim sup An ∈ K.
n→∞
n→∞
6
Proposition 1.6. Let A be a collection of subsets of S. Then there exists in P(S)
a smallest σ-field containing A and denoted by σ(A), i.e. σ(A) is a σ-field such
that it contains A and for any σ-field F containing A we have σ(A) ⊆ F.
Proof. The set
σ(A) =
\
{F : A ⊆ F, F is σ-field}
is not empty, since P(S) is a σ-field which contains A. It is easily verified that
σ(A) is a σ-field and that A ⊆ σ(A). Obviously σ(A) is minimal.
Definition 1.7. The σ-field σ(A), whose existence is proved in Proposition 1.6, is
called σ-field generated by A.
Definition 1.8. A nonempty collection A of subsets of S is called π-class if A∩B ∈
A whenever A, B ∈ A.
Definition 1.9. A nonempty collection A of subsets of S is called λ-class if it
satisfies the following conditions:
(i) S ∈ A;
(ii) if A, B ∈ A and B ⊆ A, then A \ B ∈ A;
(iii) if An ∈ A and An ⊆ An+1 for all n ∈ N, then
∞
[
n=1
An ∈ A.
Proposition 1.10.
(1) Let A and D be collections of subsets of S. There exists in S a smallest
λ-class containing A and denoted by λ(A), i.e. λ(A) is a λ-class such that
it contains A and for any λ-class F containing A we have λ(A) ⊆ F.
(2) A λ-class which is simultaneously a π-class is a σ-field.
(3) If a λ-class A contains a π-class D, then σ(D) ⊆ A.
Proof. (1) The set
λ(A) =
\
{F : A ⊆ F, F is λ-class}
is not empty, since P(S) is a λ-class which contains A. It is easily verified that
λ(A) is a λ-class and that A ⊆ λ(A). Obviously λ(A) is minimal.
(2) Let A be a λ-class which is also a π-class. Then we have A = S \ A ∈ A for
all A ∈ A. For A, B ∈ A it follows that A ∩ B ∈ A, hence
A ∪ B = A ∩ B ∈ A.
7
We can write
∞
[
An =
n=1
∞ [
n
[
n=1
Ak
k=1
∈ A, since
increasing sequence of sets from A.
(3) We show that λ(D) is also a π-class. Set
n
n[
k=1
o
Ak : n ∈ N is an
G1 = {A : A ⊆ S, A ∩ D ∈ λ(D) for all D ∈ D}.
It is easy to verify that G1 is a λ-class containing D. Therefore λ(D) ⊆ G1 . Then
by the definition of G1 we get A ∩ D ∈ λ(D) for all A ∈ λ(D) and all D ∈ D.
Now we consider
G2 = {B : B ⊆ S, B ∩ G ∈ λ(D) for all G ∈ λ(D)}.
It is easy to verify that G2 is a λ-class containing D. Therefore λ(D) ⊆ G2 , which
means that if B, G ∈ λ(D), then by the definition of G2 we get B ∩ G ∈ λ(D).
So, λ(D) is a π-class and λ-class in the same time. By the statement (2) of this
proposition it follows that λ(D) = σ(D). But λ(D) ⊆ λ(A) = A, therefore
σ(D) ⊆ A.
Definition 1.11. Let K be a σ-field in S. A mapping P : K → R is called probability if it satisfies the following axioms:
(i) P(S) = 1;
(ii) P(A) ≥ 0 for every A ∈ K;
(iii) for any sequence (An )n∈N of mutually exclusive events from K it holds
P
∞
[
n=1
∞
X
P(An ),
An =
n=1
i.e. P is σ-additive.
The triplet (S, K, P) consisting of a measurable space (S, K) and a probability
P : K → R is called probability space.
Theorem 1.12. Let (S, K, P) be a probability space, and let A, B ∈ K. Then the
following properties are true:
(1) P(A) = 1 − P(A) and 0 ≤ P(A) ≤ 1.
(2) P(∅) = 0.
8
(3) P(A \ B) = P(A) − P(A ∩ B).
(4) If A ⊆ B, then P(A) ≤ P(B), i.e. P is monotone.
(5) P(A ∪ B) = P(A) + P(B) − P(A ∩ B).
Proof. (1) Since A and A are disjoint and A ∪ A = S, we have P(S) = P(A) +
P(A). Taking into consideration that P(S) = 1, it follows that P(A) = 1 − P(A).
But P(A) ≥ 0 implies P(A) ≤ 1.
(2) We apply (1) for A = S and use P(S) = 1.
(3) In virtue of the equality A = (A ∩ B) ∪ (A \ B) we have
P(A) = P(A ∩ B) + P(A \ B).
(4) We have A ∩ B = A, since A ⊆ B. Then by (1) and (3) it follows that
0 ≤ P(B \ A) = P(B) − P(A).
(5) The following property holds A ∪ (A ∩ B) = A ∪ B, where the union in the
left side of the equality is composed of disjoint sets. Thus,
P(A ∪ B) = P(A) + P(A ∩ B) = P(A) + P(B \ A)
= P(A) + P(B) − P(A ∩ B),
where we also used the property (3).
Theorem 1.13. Let (S, K, P) be a probability space. If (An )n∈N is a sequence of
events from K, then the following properties are true:
(1) The inclusion-exclusion principle:
P
n
[
Ai
i=1
=
n
X
i=1
P(Ai ) −
n
X
i,j=1
P(Ai ∩ Aj ) + . . .
i<j
+(−1)
n−1
P(A1 ∩ . . . ∩ An )
for all n ∈ N.
(2) P is subadditive, i.e.
P
∞
[
n=1
An ≤
9
∞
X
n=1
P(An ).
Proof. (1) To prove this we use the induction method. For n = 2 the property was
proved in Theorem 1.12. Assuming it is true for n ∈ N, we prove that it is also true
for n + 1. By using statement (5) of Theorem 1.12 we write
P
(1)
n+1
[
Ai
i=1
=P
n
[
i=1
n [
Ai ∩ An+1
Ai + P(An+1 ) − P
i=1
and applying the property (inclusion-exclusion principle) for n sets we have
n
n
n
[
X
X
P(Ai ∩ Aj ) + . . .
P(Ai ) −
=
Ai
P
i=1
i=1
i,j=1
i<j
+(−1)
n−1
P(A1 ∩ . . . ∩ An )
and
P
n [
i=1
Ai ∩ An+1
=
n
X
i=1
−
P(Ai ∩ An+1 )
n
X
i,j=1
P(Ai ∩ Aj ∩ An+1 ) + . . .
i<j
+(−1)n−1 P(A1 ∩ . . . ∩ An ∩ An+1 ).
Using these relations in (1) it follows that the inclusion-exclusion principle holds
also for n + 1 sets.
(2) We define in K the sequence (Bn )n∈N of events by
B1 = A1 ,
Bn = An \
Then we have
∞
[
n=1
Bn =
∞
[
An ,
n=1
n−1
[
i=1
Ai
for n ≥ 2.
Bi ∩ Bj = ∅ for i 6= j.
But Bn ⊆ An for all n ∈ N, which implies P(Bn ) ≤ P(An ) (by Theorem 1.12).
Then it follows that
∞
∞
∞
∞
X
[
[
X
P(An ).
P(Bn ) ≤
Bn =
An = P
P
n=1
n=1
n=1
10
n=1
Theorem 1.14. Let (S, K, P) be a probability space. Then the following properties
are true:
(1) If (An )n∈N is an increasing sequence of events from K, then
lim P(An ) = P
n→∞
∞
[
n=1
An .
(2) If (An )n∈N is a decreasing sequence of events from K, then
lim P(An ) = P
n→∞
∞
\
n=1
An .
Proof. (1) We define in K the sequence (Bn )n∈N of events by
Bn = An \ An−1 for n ≥ 2.
B1 = A1 ,
Since (An )n∈N is an increasing sequence of events, we have
∞
[
Bn =
∞
[
An ,
n=1
n=1
Bi ∩ Bj = ∅ for i 6= j.
We write
P
(2)
∞
[
n=1
∞
∞
X
[
P(Bn )
Bn =
An = P
= P(A1 ) +
n=1
=
lim
n→∞
n=1
n=1
∞
X
P(An+1 \ An )
P(A1 ) + P(A2 \ A1 ) + . . . P(An \ An−1 ) .
By Theorem 1.12 we have
P(An+1 \ An ) = P(An+1 ) − P(An )
for all n ∈ N,
because An ⊆ An+1 . Thus,
P(A1 ) + P(A2 \ A1 ) + . . . P(An \ An−1 ) = P(An ).
Then by (2)
P
∞
[
n=1
An = lim P(An ).
n→∞
11
(2) We apply (1) for Bn = An . So (Bn )n∈N becomes an increasing sequence of
events from K and it holds
∞
[
Bn .
lim P(Bn ) = P
n→∞
n=1
Due to De Morgan’s laws (see Proposition 1.2) this equality implies
1 − lim P(An ) = P
n→∞
Therefore, lim P(An ) = P
n→∞
∞
\
∞
\
n=1
n=1
∞
\
An .
An = 1 − P
n=1
An .
1.3 Classic Interpretation of Probability
The classic interpretation of probability is based on the concept of equally likely
outcomes. For example, when a numbered dice is rolled, there are 6 possible outcomes: 1, 2, 3, 4, 5, or 6. If it may be assumed that the occurrence of these
outcomes are likely equal, then they must have the same probability. Since the sum
of the probabilities must be 1, the probability that the number 1, or 2, or 3, or 4, or
1
5, or 6 appears must be .
6
More generally, if the outcome of some process is one of n different outcomes
and if these outcomes are equally likely to occur, then the probability for each
1
outcome is . But this formal definition can not be applied for events whose
n
outcomes are not equally likely.
The classical definition of probability is due to the French mathematician and
physicist Pierre Simon Laplace. We consider an experiment whose outcomes are
finite and equally likely. Then the probability that an event A will occur is the
number
number of outcomes favorable for the occurrance of A
P(A) =
.
number of all possible outcomes during the experiment
This is closely related to the so called relative frequency fA of the occurrence of
the event A defined as
k
fA = ,
n
when A appears k times after repeating the experiment for n times. In practice, by
repeating the experiment for a large number of times, one observes that the relative
frequency fA oscillates around the number P(A).
12
Example 1.15. (Dice Rolling)
The correspondence between B. Pascal and P. Fermat, in which they investigated
the dice rolling problem of the French nobleman and gambler Chevalier de Méré
is famous. It is said that de Méré had been betting that in four rolls of a dice at
least one six would turn up. He was consistently winning and to get more people
to play, he changed the game bet: in 24 rolls of two dice a pair of sixes would turn
up. But with this second bet de Méré lost and felt that 25 rolls were necessary to
make the game favorable.
We will calculate and compare the probabilities of the following events:
A: we obtain at least one six in 4 rolls of a dice;
B: we obtain at least one pair of sixes in 24 rolls of two dice;
C: we obtain at least one pair of sixes in 25 rolls of two dice.
For this problem it is easier to determine the probabilities of the contrary events
A, B and C.
The event A means no six is obtained in 4 rolls of a dice. The experiment has 64
possible outcomes. This is the number of how many functions F we can define
F : {r1 , r2 , r3 , r4 } → {1, 2, 3, 4, 5, 6},
where F (ri ) shows which number was obtained in the ith roll of the dice, where
i = 1, . . . , 4. In order to calculate the favorable outcomes for the occurrence of
A, we observe that in each rolling of the dice we have 5 possibilities to obtain no
six, then in 4 rolls of the dice we have 54 possibilities to obtain no six. This is the
number of how many functions G we can define
G : {r1 , r2 , r3 , r4 } → {1, 2, 3, 4, 5},
where G(ri ) shows which number 1, 2, 3, 4 or 5 (no six) was obtained in the ith
roll of the dice, where i = 1, . . . , 4. Therefore,
P(A) =
54
,
64
which implies
P(A) = 1 −
54
≈ 0.5177.
64
The event B means no pair of sixes is obtained in 24 rolls of two dice. The experiment has 3624 possible outcomes. This is the number of how many functions F
we can define
F : {r1 , r2 , . . . , r24 } → {(1, 1), (1, 2), . . . , (6, 6)},
13
where F (ri ) shows which pair of numbers was obtained in the ith roll of two dice,
where i = 1, . . . , 24. In order to calculate the favorable outcomes for the occurrence of B, we observe that in each rolling of two dice we have 35 possibilities to
obtain no pair of sixes, then in 24 rolls of two dice we have 3524 possibilities to
obtain no pair of sixes. This is the number of how many functions G we can define
G : {r1 , r2 , . . . , r24 } → {(1, 1), (1, 2), . . . , (6, 5)},
where G(ri ) shows which pair (no pair of sixes) was obtained in the ith roll of two
dice, where i = 1, . . . , 24. Therefore,
P(B) =
3524
,
3624
which implies
P(B) = 1 −
3524
≈ 0.4914.
3624
The event C has the probability
P(C) = 1 −
3525
≈ 0.5055.
3625
Comparing now the calculated probabilities we notice
P(B) <
1
< P(C) < P(A).
2
♦
1.4 Conditional Probability
Definition 1.16. Let (S, K, P) be a probability space, and let A, B ∈ K. The
conditional probability of A given B is P(·|B) : K → R, defined by
P(A|B) =
P(A ∩ B)
,
P(B)
provided P(B) > 0.
Proposition 1.17. Let (S, K, P) be a probability space, and let B ∈ K be such that
P(B) > 0. Then (S, K, P(·|B)) is a probability space.
14
Proof. We have
P(A|B) =
P(A ∩ B)
≥ 0,
P(B)
P(S|B) =
P(S ∩ B)
=1
P(B)
and for any mutually exclusive events (An )n∈N from K we have
P
∞
[
n=1
P
An |B
=
=
∞
[
(An ∩ B)
n=1
P(B)
∞
X
n=1
=
∞
X
P(An ∩ B)
n=1
P(B)
P(An |B).
Proposition 1.18 (The multiplication rule). Let (S, K, P) be a probability space,
and let A1 , . . . , An ∈ K be such that
P(A1 ∩ . . . ∩ An−1 ) > 0.
Then
P(A1 ∩ . . . ∩ An ) = P(A1 )P(A2 |A1 ) . . . P(An |A1 ∩ . . . ∩ An−1 ).
Proof. Using the definition of the conditional probability we write
P(A1 )P(A2 |A1 ) . . . P(An |A1 ∩ . . . ∩ An−1 )
P(A1 ∩ A2 )
P(A1 ∩ . . . ∩ An−1 ∩ An ))
= P(A1 )
...
P(A1 )
P(A1 ∩ . . . ∩ An−1 )
= P(A1 ∩ . . . ∩ An ).
Proposition 1.19 (Bayes’ formula). In a probability space (S, K, P) we consider
(Ai )i∈I to be a partition of S with P(Ai ) > 0 and Ai ∈ K for all i ∈ I, and let
A ∈ K be such that P(A) > 0. Then
P(Aj )P(A|Aj )
P(Aj |A) = X
P(Ai )P(A|Ai )
i∈I
15
f or all j ∈ I.
Proof. Because (Ai )i∈I is a partition of S, we can write
X
[
(A ∩ Ai ) =
P(A ∩ Ai ).
P(A) = P
i∈I
i∈I
By the definition of the conditional probability we have
P(Aj |A) =
P(Aj ∩ A)
P(Aj )P(A|Aj )
P(Aj )P(A|Aj )
=X
.
= X
P(A)
P(A ∩ Ai )
P(Ai )P(A|Ai )
i∈I
i∈I
Remark 1.20.
We have a set of prior probabilities P(Ai ) for the hypothesis Ai , where i ∈ I, and
an event A called evidence. By Bayes’ formula we can calculate the probabilities
for the hypothesis given the evidence P(Ai |A), called posterior probability. △
Example 1.21.
A shop buys a certain CD brand from three suppliers S1 , S2 and S3 . 40% of the
CDs are acquired from S1 , 35% from S2 and 25% from S3 . Further suppose that
2% of the CDs from S1 are defective, 1% of those of S2 are defective and 3%
of those of S3 are defective. Finally, suppose that a costumer buys randomly a
CD from this brand and then he requires refund for being defective. What is the
probability that this was a CD from S2 ? What is the probability that a randomly
chosen CD of this brand to be defective?
We consider the following events:
D: CD is defective
Ai : CD was produced by Si , where i = 1, 2, 3.
We then have
P(A1 ) =
40
,
100
P(A2 ) =
35
,
100
P(A3 ) =
25
100
and
1
3
2
, P(D|A2 ) =
, P(D|A3 ) =
.
100
100
100
By Bayes’ formula we calculate the probability that the defective CD was produced
by S2
P(D|A1 ) =
P(A2 |D) =
P(A2 )P(D|A2 )
3
X
P(Ai )P(D|Ai )
i=1
16
=
7
≈ 0.1842 .
38
The probability that a CD is defective can be determined by
P(D) =
3
X
P(Ai )P(D|Ai ).
i=1
♦
1.5 Independence of Events
Definition 1.22. Let (S, K, P) be a probability space. The events A, B ∈ K are
said to be independent events if
P(A ∩ B) = P(A)P(B).
A sequence (An )n∈N of events from K is called sequence of independent events if
P(Ai1 ∩ . . . ∩ Aim ) = P(Ai1 ) . . . P(Aim )
for each finite subset {i1 , . . . , im } ⊂ N.
A sequence (An )n∈N of events from K is called sequence of pairwise independent
events if
P(Ai ∩ Aj ) = P(Ai )P(Aj )
for each i, j ∈ N, i 6= j.
Proposition 1.23. In a probability space (S, K, P) let A, B ∈ K. Then the following assertions are equivalent:
(1) A and B are independent.
(2) A and B are independent.
(3) A and B are independent.
(4) A and B are independent.
Proof. In order to prove the equivalences we use the properties from Theorem 1.12
and the definition of independent events.
(1)⇔(2):
P(A ∩ B) = P(A)P(B) ⇔ P(B) − P(A ∩ B) = P(B)P(A)
⇔ P(B \ A) = P(B)P(A) ⇔ P(A ∩ B) = P(A)P(B).
17
(1)⇔(3):
P(A ∩ B) = P(A)P(B) ⇔ P(A) − P(A ∩ B) = P(A)P(B)
⇔ P(A \ B) = P(A)P(B) ⇔ P(A ∩ B) = P(A)P(B).
(1)⇔(4):
P(A ∩ B) = P(A)P(B) ⇔ P(A ∪ B) = P(A) + P(B) − P(A)P(B)
⇔ 1 − P(A ∪ B) = P(A)P(B) ⇔ P(A ∩ B) = P(A)P(B).
2 Classic Probabilistic Models
2.1 Samplings with Replacement
Bernoulli Trials and the Bernoulli Model
Repeated independent trials of an experiment, such that there are only two possible
outcomes for each trial and their probabilities remain the same throughout the trials
are called Bernoulli trials. We denote the two possible outcomes by A,”success”,
having the probability p, and respective A, ”failure”, with probability q. Of course,
we have
p + q = 1, p ≥ 0, q ≥ 0.
The sample space of each trial is S = A ∪ A.
Frequently we are interested in the total number of successes produced in n
Bernoulli trials, but not in their order. We want to determine the probability of the
event that in n Bernoulli trials we obtain k successes and n − k failures.
Proposition 2.1. Given n Bernoulli trials with the probability p of success and
probability q = 1 − p of failure, then the probability to occur exactly k successes
is
(3)
b(k, n; p) = Cnk pk q n−k .
Proof. For each i ∈ {1, . . . , n} we denote by Ai the event that in the ith trial we
obtain success. Then Ai is the event that in the ith trial we obtain failure. We also
consider the events
Bi1 ...ik = Ai1 ∩ . . . Aik ∩ Aik+1 ∩ · · · Ain ,
18
where (i1 , . . . , ik ) ∈ I and I is given by
n
I = (i1 , . . . , ik ) : 1 ≤ i1 < . . . < ik ≤ n,
o
1 ≤ ik+1 < . . . < in ≤ n, ik+1 , . . . , in ∈
/ {i1 , . . . , ik } .
We have a union of disjoint events Bi1 ...ik and we know that the events Ai are
independent, as well as P(Ai ) = p and P(Ai ) = q. Therefore,
[
X
Bi1 ...ik =
b(k, n; p) = P
P(Bi1 ...ik )
(i1 ,...,ik )∈I
=
X
(i1 ,...,ik )∈I
=
X
(i1 ,...,ik )∈I
(i1 ,...,ik )∈I
P(Ai1 ∩ . . . Aik ∩ Aik+1 ∩ . . . Ain )
pk q n−k = Cnk pk q n−k .
Remark 2.2.
The number b(k, n; p) represents the coefficient of xk in the binomial expansion
(px + q)n =
n
X
b(k, n; p)xk .
k=1
△
Example 2.3.
A well balanced coin is tossed 10 times. The possible outcomes are head, with
probability p = 0.5, and tail with probability q = 0.5. The probability never to get
a head is
1
b(0, 10; 0.5) = 10 ,
2
while the probability to obtain at least one head is
1 − b(0, 10; 0.5) = 1 −
1
.
210
♦
Multiple Bernoulli Trials
The binomial distribution can be generalized to the case of n repeated independent
trials, where each trial can have several possible (disjoint) outcomes E1 , . . . , Ek
and the probability of the realization of Ei (i ∈ {1, . . . , k}) in each trial is P(Ei ) =
pi . Obviously, p1 + . . . + pk = 1. Let n1 , . . . , nk be such that n1 + . . . + nk = n.
19
Proposition 2.4. The probability that in n = n1 + n2 + . . . + nk trials Ei (i ∈
{1, . . . , k}) occurs ni times is
b(n1 , . . . , nk ; n) =
(4)
n!
pn1 pn2 . . . pnk k .
n1 !n2 ! . . . nk ! 1 2
Proof. We denote by Aij the event that in the jth trial the event Ei occurs. We also
consider the events
B(ij ,...,ijn
1
)
j j=1,k
= A1i1 ∩ . . . ∩ A1i1n ∩ A2i2 ∩ . . . ∩ A2i2n ∩ · · ·
1
2
{z
} | 1
{z
}
| 1
E1 n1 times
∩ Akik ∩ . . . ∩
1
{z
|
where we denote
E2 n2 times
Akik
n
Ek nk times
,
}
k
1
(ij1 , . . . , ijnj )j=1,k = (i11 , . . . , i1n1 , i21 , . . . , i2n2 , . . . , ik1 , . . . , iknk ).
The index set I is given by
n
I = (ij1 , . . . , ijnj )j=1,k :1 ≤ ij1 < . . . < ijnj ≤ n, j = 1, k,
o
ilk 6= ist if l 6= s or k 6= t .
We have a union of disjoint events B(ij ,...,ijn
1
)
j j=1,k
E1 , . . . , Ek are independent. Therefore,
[
b(n1 , . . . , nk ; n) = P
B(ij ,...,ijn
(ij1 ,...,ijnj )j=1,k ∈I
=
)
j j=1,k
P(B(ij ,...,ijn
X
P(A1i1 ∩ . . . ∩ A1i1n ∩ · · · ∩ Akik ∩ . . . ∩ Akik )
X
pn1 1 pn2 2 . . . pnk k .
(ij1 ,...,ijnj )j=1,k ∈I
=
1
X
(ij1 ,...,ijnj )j=1,k ∈I
=
and we know that the events
1
)
j j=1,k
1
)
1
1
nk
(ij1 ,...,ijnj )j=1,k ∈I
Now we determine how many elements the index set I has. This can be reformulated equivalently: suppose that n distinct elements are to be divided into k different groups in such a way that the jth group contains exactly nj (j ∈ {1, . . . , k})
20
elements and n1 + n2 + . . . + nk = n. It is desired to determine the total number
N of different ways in which the n elements can be divided into the k groups.
The n1 elements of the first group can be selected from the n available elements
in Cnn1 different ways. After the n1 elements of the first group have been selected,
the n2 elements of the second group can be selected from the remaining n − n1
n2
elements in Cn−n
different ways. The total number of different ways of selecting
1
n2
. After the
the elements for both the first group and the second group is Cnn1 Cn−n
1
n1 +n2 elements in the first two groups have been selected, the number of different
n3
ways in which the n3 elements in the third group can be selected is Cn−n
. Then
1 −n2
the total number of different ways of selecting the elements for the first three groups
n2
C n3
. From the preceding explanation it follows that after the first
is Cnn1 Cn−n
1 n−n1 −n2
k − 2 groups have been formed, the number of different ways in which the nk−1
elements in the (k−1)th group can be selected from the remaining n−n1 . . .−nk−2
nk−1
elements is Cn−n
. The remaining nk elements form the last group. Hence,
1 ...−nk−2
the total number N of different ways of dividing the n elements into the k groups
is
n!
nk−1
n2
N = Cnn1 Cn−n
C n3
. . . Cn−n
=
.
1 n−n1 −n2
1 ...−nk−2
n1 !n2 ! . . . nk !
The index set I has N elements and it holds
b(n1 , . . . , nk ; n) =
n!
pn1 pn2 . . . pnk k .
n1 !n2 ! . . . nk ! 1 2
Remark 2.5.
The number b(n1 , . . . , nk ; n) represents the coefficient of xn1 1 . . . xnk k in the multinomial expansion (p1 x1 + . . . + pk xk )n .
△
Example 2.6.
Suppose that nine dice are rolled. What is the probability that the numbers 1, 3, 6
appear exactly three times?
Solution. We consider multiple Bernoulli trials with n = 9 repetitions of rolling
one dice, where the possible outcomes are the events:
E1 : the number 1 occurred;
E2 : the number 3 occurred;
E3 : the number 6 occurred.
We have
1
p1 = P(E1 ) = p2 = P(E2 ) = p3 = P(E3 ) = ,
6
21
while n1 = n2 = n3 = 3 are representing the number of occurrences of E1 , E2 ,
respectively E3 . So, the probability that the numbers 1, 3, 6 appear exactly three
times is
9!
1
7
· 9 =
.
3!3!3! 6
209952
♦
3 Samplings without Replacement
Model with Two States
Proposition 3.1. In a box there are n1 white balls and n2 black balls, all balls
having the same size. We draw a ball from the box and note its colour and do not
replace the ball into the box. The probability that in n trials we get k white balls
(k ≤ n1 , n ≤ n1 + n2 ) is
C k C n−k
p(k; n) = nn1 n2 .
Cn1 +n2
Proof. For i ∈ {1, . . . , n} we denote by Ai the event that in the ith drawing we
took a white ball. Then Ai is the event that in the ith drawing we took a black ball.
We also consider the events
Bi1 ...ik = Ai1 ∩ . . . Aik ∩ Aik+1 ∩ . . . ∩ Ain ,
where (i1 , . . . , ik ) ∈ I and I is given by
n
I = (i1 , . . . , ik ) : 1 ≤ i1 < . . . < ik ≤ n,
o
1 ≤ ik+1 < . . . < in ≤ n, ik+1 , . . . , in ∈
/ {i1 , . . . , ik } .
The set I of indices has Cnk elements. We have a union of disjoint events Bi1 ...ik
and therefore
[
X
Bi1 ...ik =
p(k; n) = P
P(Bi1 ...ik )
(i1 ,...,ik )∈I
=
X
(i1 ,...,ik )∈I
(i1 ,...,ik )∈I
P(Ai1 ∩ . . . Aik ∩ Aik+1 ∩ . . . ∩ Ain )
= Cnk P(A1 ∩ . . . Ak ∩ Ak+1 ∩ . . . ∩ An ).
22
Since the events Aik are not independent (the sampling is without replacement), by
the multiplication rule (see Proposition 1.18) we get
P(Ai1 ∩ . . . Aik ∩ Aik+1 ∩ . . . Ain )
= P(A1 ∩ . . . Ak ∩ Ak+1 ∩ . . . An )
= P(A1 )P(A2 |A1 ) . . . P(Ak+1 |A1 ∩ . . . ∩ Ak )
. . . P(An |A1 ∩ . . . Ak ∩ Ak+1 ∩ . . . An−1 )
n1
n1 − 1
n1 − (k − 1)
n2
=
·
...
·
n1 + n2 n1 + n2 − 1
n1 + n2 − (k − 1) n1 + n2 − k
n2 − (n − k − 1)
...
n1 + n2 − k − (n − k − 1)
n1 !
n2 !
(n1 + n2 − n)!
=
·
·
.
(n1 − k)! (n2 − (n − k))!
(n1 + n2 )!
Then
n1 !
n2 !
(n1 + n2 − n)!
·
·
(n1 − k)! (n2 − (n − k))!
(n1 + n2 )!
k
n−k
Cn 1 Cn 2
.
Cnn1 +n2
p(k; n) = Cnk
=
Example 3.2.
Suppose that a deck of 36 cards contains 4 aces. We draw 4 cards. What is the
probability to obtain exactly one ace? And what is the probability to obtain at least
one ace?
Solution. We use the sampling model without replacement with 2 states. The
probability to obtain exactly k = 1 ace among n = 5 drawings is
p(1; 4) =
3
C41 C32
≈ 0.3368 .
4
C36
The probability to obtain at least one ass is
p = p(1; 4) + p(2; 4) + p(3; 4) + p(4; 4) ≈ 0.3895.
♦
23
Model with k states
Proposition 3.3. In a box there are balls of the same size and of k colours c1 , . . . , ck .
We assume that we have ni balls of colour ci (i ∈ {1, . . . , k}. We draw a ball from
the box, note its colour and do not replace the ball into the box. The probability
that in m1 + . . . + mk trials we get mi balls of colour ci (i ∈ {1, . . . , k} , where
mi ≤ ni is
Cnm1 . . . Cnmk
p(m1 , . . . , mk ; m1 + . . . + mk ) = m1 1 +...+mkk .
Cn1 +...+nk
Example 3.4.
Suppose that a deck of 52 cards contains 13 hearts. Suppose that the cards are
shuffled and distributed in a random manner to four players A, B, C and D so that
each player receives 13 cards. Determine the probability that player A will receive
2 hearts, player B will receive 3 hearts, player C will receive 4 hearts and player
D will also receive 4 hearts.
Solution. We use the sampling model without replacement with 4 states (corresponding to the four players) with n1 = n2 = n3 = n4 = 13 (the number of
cards for each player) and m1 = 2 (player A receives 2 hearts), m2 = 3 (player
B receives 3 hearts), m3 = 4 (player C receives 4 hearts) and m4 = 4 (player D
receives 4 hearts) and get
p(2, 3, 4, 4; 13) =
2 C3 C4 C4
C13
13 13 13
≈ 0.018 .
13
C52
♦
4 Poisson Model
Proposition 4.1. In an experiment only two possible outcomes occur in each trial
and their probabilities don’t remain the same throughout the trials. We denote the
two possible outcomes by ”success” and respectively ”failure”. The probability
to occur success in the i-th trial is pi and to occur failure is qi = 1 − pi . The
probability of the event that in n trials we obtain k successes and n − k failures is
X
P (k; n) =
pi1 . . . pik qik+1 . . . qin .
1≤i1 <...<ik ≤n
24
Proof. For i ∈ {1, . . . , n} we denote by Ai the event that in the ith trial we obtain
success. Then Ai is the event that in the ith trial we obtain failure. We also consider
the events
Bi1 ...ik = Ai1 ∩ . . . Aik ∩ Aik+1 ∩ · · · Ain
where (i1 , . . . , ik ) ∈ I and I is given by
n
I = (i1 , . . . , ik ) : 1 ≤ i1 < . . . < ik ≤ n,
o
1 ≤ ik+1 < . . . < in ≤ n, ik+1 , . . . , in ∈
/ {i1 , . . . , ik } .
We have a union of disjoint events Bi1 ...ik and we know that the events A1 , . . . , An
are independent. Therefore,
X
[
P(Bi1 ...ik )
Bi1 ...ik =
P (k; n) = P
(i1 ,...,ik )∈I
(i1 ,...,ik )∈I
=
X
(i1 ,...,ik )∈I
=
X
P(Ai1 ∩ . . . Aik ∩ Aik+1 ∩ . . . Ain )
pi1 . . . pik qik+1 . . . qin .
(i1 ,...,ik )∈I
Remark 4.2.
1) The number P (k; n) represents the coefficient of xk of the polynomial
(p1 x + q1 ) . . . (pn x + qn ).
2) If pi = p and qi = 1 − p for all i ∈ {1, . . . , n}, then we obtain the Bernoulli
model and P (k; n) = b(n, k; p).
△
Example 4.3.
Suppose that four new computer models M1 , M2 , M3 , M4 are being tested for their
reliability. The probability that a model fits the latest market standards is p1 = 0.8
for model M1 , p2 = 0.7 for model M2 , p3 = 0.9 for model M3 , respectively
p4 = 0.6 for model M4 . Determine the probability p that at least three models
match the profile.
Solution. We use a Poisson model with p1 = 0.8 (the probability that M1 fits),
p2 = 0.7 (the probability that M2 fits), p3 = 0.9 (the probability that M3 fits)
and p4 = 0.6 (the probability that M4 fits). If at least three models match the
profile, it means that exactly three models or exactly four models match, so p =
25
P (3; 4) + P (4; 4), representing the sum of the coefficients of x3 and x4 of the
polynomial
(0.8x + 0.2)(0.7x + 0.3)(0.9x + 0.1)(0.6x + 0.4).
♦
Then, p = 0.7428 .
5 Pascal Model (Negative Binomial Model)
Proposition 5.1. We consider an infinite sequence of Bernoulli trials in which the
outcome of any trial is either a success with the probability p or a failure with the
probability 1 − p. Then the probability of the event that k failures occurred before
the n-th success is obtained is
k
pn (1 − p)k .
P (k, n; p) = Cn+k−1
Proof. We consider the following events:
A: in n + k − 1 trials we obtain exactly n − 1 successes;
B: in the (n + k)th trial we obtain success.
We observe that P(B) = p and
n−1
pn−1 (1 − p)k ,
P(A) = b(n − 1, n + k − 1; p) = Cn+k−1
since the event A describes a Bernoulli model.
Then the event that k failures occurred before the nth success is obtained is
equal to the event that in n + k − 1 trials we obtain exactly n − 1 successes (the
event A) and in the (n + k)th trial we obtain success (the event B). But A and B
are independent events, so
P (k, n; p) = P(A ∩ B) = P(A)P(B)
n−1
k
pn (1 − p)k = Cn+k−1
pn (1 − p)k .
= Cn+k−1
Remark 5.2.
We consider the following generalization: for k ∈ {0, 1, . . .} and a ∈ R∗ we define

 1
if k = 0
Cak =
a(a − 1) . . . (a − k + 1)

if k ∈ N .
k!
26
Then one can write the following expansion:
(1 + t)a =
∞
X
k=0
Cak tk for all |t| < 1.
If we take a = −n and t = −(1 − p)x, then
−n
pn
n = pn 1 − (1 − p)x
1 − (1 − p)x
= p
n
∞
X
k=0
k
C−n
(−1)k (1
k k
− p) x =
∞
X
k=0
k
pn (1 − p)k xk .
Cn+k−1
Hence the number P (k, n; p) represents the coefficient of xk in the expansion of
pn
1 − (1 − p)x
n .
For this reason this model is sometimes called negative binomial model.
△
6 Geometric Model
For the special case n = 1 in the Pascal model we have the following result.
Proposition 6.1. We consider an infinite sequence of Bernoulli trials in which the
outcome of any trial is either a success with the probability p or a failure with the
probability 1 − p. Then the probability of the event that k failures occurred before
the first success is obtained is given by
P (k; p) = p(1 − p)k .
Remark 6.2.
In this model the number P (k; p) represents the coefficient of xk in the geometric
expansion
∞
X
p
p(1 − p)k xk .
=
1 − (1 − p)x
k=0
For this reason this model is sometimes called geometric model.
27
△
7 Random Variables and Random Vectors
7.1 Definition and Properties
Let (R, B) be the measurable space R endowed with the σ-field B = σ(Sopen )
generated by
Sopen = {S ⊂ R : S is an open set}.
n
Let (Rn , B n ) be the measurable space Rn endowed with the σ-field B n = σ(Sopen
)
generated by
n
Sopen
= {S ⊂ Rn : S is an open set}.
Definition 7.1. Let (Ω, K) and (E, E) be measurable spaces. A function F : Ω →
E is said to be K/E measurable if
F −1 (B) = {ω ∈ Ω : F (ω) ∈ B} ∈ K for all B ∈ E.
Now we consider the special cases E = R, respectively E = Rn . Further, let
(Ω, K, P) denote a probability space.
Definition 7.2. A function X : Ω → R is called random variable if it is K/B
measurable, i.e.
X −1 (B) = {ω ∈ Ω : X(ω) ∈ B} ∈ K for all B ∈ B.
A function X : Ω → C is called random variable if Re X and Im X are random
variables.
A function X : Ω → Rn is called random vector if it is K/B n measurable, i.e.
X −1 (B) = {ω ∈ Ω : X(ω) ∈ B} ∈ K for all B ∈ B n .
If not otherwise mentioned we will consider in this book random variables taking
values in R.
Example 7.3.
A simple example of a random variable is the indicator IA : Ω → R of a set
A ∈ K, defined by
1 if ω ∈ A
IA (ω) =
0 if ω ∈
/ A.
♦
28
Definition 7.4. A random variable X is said to be discrete random variable if it
has the representation
X
X(ω) =
xi IAi (ω) f or all ω ∈ Ω,
i∈I
where I ⊆ N, (Ai )i∈I is a partition of Ω and Ai ∈ K, xi ∈ R for all i ∈ I . If I is
a finite set, then X is called simple random variable
For a random variable X we consider
KX = {X −1 (B) : B ∈ B}.
It is easily verified that KX is a σ-field. It is called the σ-field generated by X.
Proposition 7.5. The following statements are equivalent:
(1) X is a random variable.
(2) X −1 ((−∞, x)) ∈ K for all x ∈ R.
(3) X −1 ((−∞, x]) ∈ K for all x ∈ R.
(4) X −1 ((x, ∞)) ∈ K for all x ∈ R.
(5) X −1 ([x, ∞)) ∈ K for all x ∈ R.
Proof. (1) ⇒ (2) Since (−∞, x) ∈ B it follows by Definition 7.2 that
X −1 ((−∞, x)) ∈ K.
(2) ⇒ (3) We write
(−∞, x] =
∞ \
n=1
− ∞, x +
1
n
and use that X −1 ((−∞, x + n1 )) ∈ K for all n ∈ N.
(3) ⇒ (4) We have
X −1 (R) =
∞
[
n=1
X −1 ((−∞, n]) ∈ K.
Then we can write
X −1 ((x, ∞)) = X −1 (R \ (−∞, x])
= X −1 (R) \ X −1 ((−∞, x]) ∈ K.
29
(4) ⇒ (5) We write
[x, ∞) =
∞ \
n=1
x−
1
,∞
n
and take into consideration that
− n1 , ∞)) ∈ K for all n ∈ N.
(5) ⇒ (1) We consider the set M = {A ⊆ R : X −1 (A) ∈ K}, which is a σ-field
over R. The set G = {[x, ∞) : x ∈ R} is a subset of M and moreover σ(G) = B.
Since G ⊆ M and M, σ(G) are σ-fields, it follows that σ(G) ⊆ M. Therefore,
for all B ∈ σ(G) = B we have B ∈ M, and so X −1 (B) ∈ K.
X −1 ((x
Proposition 7.6. Let F : Rn → R be a B n /B measurable function, and let X be
a random vector. Then F ◦ X is also a random variable.
Proof. We have F −1 (B) ∈ B n for all B ∈ B, because F is a measurable function.
Since X is a random vector we obtain {ω ∈ Ω : X(ω) ∈ F −1 (B)} ∈ K. Hence
−1
(B) = {ω ∈ Ω : F (X(ω)) ∈ B} ∈ K.
F ◦X
Using the above-proved proposition we easily can prove the following properties.
Proposition 7.7. Let X and Y be random variables. Then aX (where a ∈ R),
X
(if Y (ω) 6= 0 for all
|X|, min{X, Y }, max{X, Y }, X + Y , X − Y , XY ,
Y
ω ∈ Ω) are random variables.
Proposition 7.8. Let X and Y be random variables. Then
{ω ∈ Ω : X(ω) > Y (ω)} ∈ K,
{ω ∈ Ω : X(ω) ≥ Y (ω)} ∈ K,
and
{ω ∈ Ω : X(ω) = Y (ω)} ∈ K.
Proof. We have
{ω ∈ Ω : X(ω) > Y (ω)}
[
=
{ω ∈ Ω : X(ω) > r} ∩ {ω ∈ Ω : r > Y (ω)} .
r∈Q
From this relation and Proposition 7.5 (applied for X respectively Y ) it follows
that {ω ∈ Ω : X(ω) > Y (ω)} ∈ K. Using the above-mentioned result we also
obtain {ω ∈ Ω : Y (ω) > X(ω)} ∈ K. Finally we write
{ω ∈ Ω : X(ω) ≥ Y (ω)} = {ω ∈ Ω : Y (ω) > X(ω)} ∈ K
30
and
{ω ∈ Ω : X(ω) = Y (ω)}
= {ω ∈ Ω : X(ω) ≥ Y (ω)} \ {ω ∈ Ω : X(ω) > Y (ω)} ∈ K.
Proposition 7.9. Let (Xn )n∈N be a sequence of random variables. Then
sup Xn ,
n∈N
inf Xn ,
n∈N
lim sup Xn ,
n→∞
lim inf Xn
n→∞
are random variables.
Proof. We have
{ω ∈ Ω : sup Xn (ω) > x} =
n∈N
∞
[
{ω ∈ Ω : Xn (ω) > x} ∈ K,
n=1
hence sup Xn is a random variable. Using the identity
n∈N
inf Xn = − sup(−Xn ),
n∈N
n∈N
it follows that inf Xn is also a random variable. The rest follows from the first two
n∈N
results and the identities
lim sup Xn = inf (sup Xk ),
n→∞
lim inf Xn = sup( inf Xk ) .
n→∞
n≥1 k≥n
n≥1 k≥n
Proposition 7.10. Let X be a random variable. Then the mapping PX : B → R
defined by
PX (B) = P(ω ∈ Ω : X(ω) ∈ B)
is a probability over B.
Proof. We verify Definition 1.11:
(i) PX (R) = P(ω ∈ Ω : X(ω) ∈ R) = P(Ω) = 1.
(ii) PX (B) ≥ 0 for all B ∈ B.
31
(iii) If (Bn )n∈N are mutually exclusive events, then
PX
∞
[
n=1
=P
∞
[
∞
[
Bn
Bn = P ω ∈ Ω : X(ω) ∈
n=1
n=1
∞
∞
X
X
−1
−1
PX (Bn ).
P X (Bn ) =
X (Bn ) =
n=1
n=1
Definition 7.11. The probability PX : B → R, introduced in Proposition 7.10 is
called probability distribution of X.
If X is a simple random variable, represented by
X(ω) =
n
X
i=1
xi IAi (ω) f or all ω ∈ Ω,
then PX is completely determined by
PX ({xi }) = P(ω ∈ Ω : X(ω) = xi ), i ∈ {1, . . . , n}.
In this case PX is said to be the probability mass function of the simple random
variable X.
Example 7.12.
(1) We roll one dice. Let X be the random variable that takes the value 1, if we
obtained 6, and 0 otherwise. The sample space is Ω = {e1 , e2 , e3 , e4 , e5 , e6 },
where the elementary events are
ei : we obtained the number i for i ∈ {1, . . . , 6}.
The σ-field is K = P(Ω) (the set of all subsets of Ω). Then X has the probability
mass function
PX ({x}) = P(ω ∈ Ω : X(ω) = x) = px q 1−x ,
x ∈ {0, 1},
1
5
where p = and q = .
6
6
(2) We consider n = 3 Bernoulli trials, where in each trial there are only two
possible outcomes: ”success” (s) with the probability p, and ”failure” (f ) with the
probability q. We consider the discrete random variable X that represents how
many successes occurred in n = 3 Bernoulli trials. The sample space, i.e. the set
of all possible outcomes, is
n
Ω =
(s, s, s), (f, s, s), (s, f, s), (s, s, f ), (f, f, s), (f, s, f ),
o
(s, f, f ), (f, f, f ) .
32
Then the random variable X is given by

0 if ω = (f, f, f )



1 if ω ∈ {(f, f, s), (f, s, f ), (s, f, f )}
X(ω) =
2 if ω ∈ {(f, s, s), (s, f, s), (s, s, f )}



3 if ω = (s, s, s).
The probability mass function of X is
PX ({x}) = P(ω ∈ Ω : X(ω) = x) = Cnx px q n−x ,
x ∈ {0, 1, 2, 3}.
♦
Notation:
In what follows we denote the event {ω ∈ Ω : X(ω) ∈ B} simply by {X ∈ B}.
8 Distribution Function
Definition 8.1. The function F : R → R, defined by
F (x) = P(X ≤ x),
is called distribution function or cumulative distribution function of X.
Example 8.2.
The values of the binomial distribution function F : R → R of the random variable
X from Example 7.12 (2) are

0
if x < 0



3

q
if
0≤x<1

3
2
q + 3pq
if 1 ≤ x < 2
F (x) =

3 + 3pq 2 + 3p2 q if 2 ≤ x < 3

q



1
if 3 ≤ x .
The graphic representation of F for n = 3 and p = 0.45 is given in Figure 1.
If we want to determine probabilities by means of the distribution function we may
use the following properties.
Proposition 8.3. Let a, b ∈ R, a < b. The distribution function F : R → R of a
random variable X has the following properties:
(1) P(a < X ≤ b) = F (b) − F (a) .
33
2
1.5
1
0.5
0
−0.5
−0.5
0
0.5
1
1.5
2
2.5
3
3.5
Fig. 1: Binomial distribution function for n = 3 and p = 0.45
(2) P (X = b) = F (b) − F (b−) , where the limit from the left is defined by
F (b−) = lim F (y) .
yրb
(3) P (X < b) = F (b−).
Proof. (1) We have
P(a < X ≤ b) = P({X ≤ b} \ {X ≤ a})
= P(X ≤ b) − P(X ≤ a) = F (b) − F (a).
(2) We consider a monotone increasing sequence (xn )n∈N of real numbers with
lim xn = b. For all n ∈ N we define the event An = {xn < X ≤ b} . Since
n→∞
(xn )n∈N
is an increasing sequence, it follows that An+1 ⊆ An for all n ∈ N. By
Theorem 1.14 we obtain
lim P(An ) = P
n→∞
∞
\
n=1
An = P(X = b).
But P(An ) = F (b) − F (xn ) for all n ∈ N. Thus,
lim F (b) − F (xn ) = P(X = b),
n→∞
hence F (b) − F (b−) = P(X = b).
(3) We have P(X < b) = P(X ≤ b) − P(X = b) = F (b−).
Theorem 8.4. The distribution function F : R → R of a random variable X has
the following properties:
34
(1) F is monotone increasing.
(2) F is continuous from the right, that is F (x+) = F (x) for each x ∈ R,
where the limit from the right is defined by F (x+) = lim F (y).
yցx
(3)
lim F (x) = 0 and lim F (x) = 1.
x→∞
x→−∞
Proof. (1) If a ≤ b, then we have {X ≤ a} ⊆ {X ≤ b}. By the properties of
the probability P (see Theorem 1.12) it follows that F (a) = P(X ≤ a) ≤ P(X ≤
b) = F (b).
(2) We fix a point x ∈ R and consider a monotone decreasing sequence (xn )n∈N
of real numbers with lim xn = x. For all n ∈ N we define the event An = {x <
n→∞
X ≤ xn }. Since An+1 ⊆ An for all n ∈ N, it follows by Theorem 1.14 that
lim P(An ) = P
n→∞
∞
\
n=1
An = P(∅) = 0.
But P(An ) = F (xn )−F (x) for all n ∈ N by Proposition 8.3. Thus, lim F (xn ) =
n→∞
F (x) and F (x+) = F (x).
(3) We consider a monotone decreasing sequence (xn )n∈N of real numbers with
lim xn = −∞. For all n ∈ N we define the event An = {X ≤ xn } for all
n→∞
n ∈ N. Since An+1 ⊆ An for all n ∈ N, we get by Theorem 1.14 that
lim P(An ) = P
n→∞
∞
\
n=1
An = P(∅) = 0.
But P(An ) = F (xn ) for all n ∈ N. Thus, lim F (xn ) = 0, and hence lim F (x) =
n→∞
x→−∞
0.
We consider a monotone increasing sequence (yn )n∈N of real numbers with
lim yn = ∞. For all n ∈ N we define the event Bn = {yn < X} for all n ∈ N.
n→∞
Since (yn )n∈N is an increasing sequence, it follows that Bn+1 ⊆ Bn for all n ∈ N.
By Theorem 1.14 we obtain
lim P(Bn ) = P
n→∞
∞
\
n=1
Bn = P(∅) = 0.
But P(Bn ) = 1 − P(X ≤ yn ) = 1 − F (yn ) for all n ∈ N. Thus, lim F (yn ) = 1,
n→∞
and hence lim F (x) = 1.
x→∞
35
Remark 8.5.
The properties (1), (2) and (3) from Theorem 8.4 characterize a distribution function, i.e. if a function F : R → R has these properties, then there exists a random
variable X, which has F as its distribution function.
To prove this property, consider the function G : [0, 1] → R, defined by

 inf F −1 ({y}) if y ∈ F (R), y 6= 0
sup F −1 ({y}) if y ∈ F (R), y = 0
G(y) =

xy
if y ∈
/ F (R), y ∈]F (xy ), F (xy + 0)[ .
Note that for y ∈ [0, 1] \ F (R) there exists a unique point xy ∈ R such that
y ∈]F (xy ), F (xy + 0)[.
Let Ω = [0, 1], K = B[0, 1] (the σ-field generated by [0, 1]) and let P = λ be
the Lebesgue measure on [0, 1]. We define the random variable X : Ω → R by
X(ω) = G(ω). Then the distribution function of X is
FX (x) = P(X < x) = P(ω ∈ Ω : G(ω) < x)
= λ(y ∈ [0, 1] : y ≤ F (x)) = F (x).
△
9 Density Function
Definition 9.1. Let X be a random variable, and let F : R → R be its distribution
function. If there exists a function f : R → R such that
(5)
F (x) =
Zx
−∞
f (t)dt for all x ∈ R,
then f is called density function of X. If the random variable X admits a density
function, then X is said to be continuous random variable. Sometimes it is also
said that such X has a continuous distribution.
Theorem 9.2. Let a, b ∈ R, a < b. If X is a continuous random variable having
the distribution function F and the density function f , then the following properties
hold:
(1) F is absolutely continuous 1 and F ′ (x) = f (x) for almost every x ∈ R.
1
A function F : J → R, where J ⊆ R, is absolutely continuous, if for any ε > 0 there exists
δ > 0 such that for any finite family of open and disjoint intervals (ai , bi ), i = 1, . . . , n from J with
n
n
X
X
(bi − ai ) < δ we have
|F (bi ) − F (ai )| < ε.
i=1
i=1
36
(2) f (x) ≥ 0 for almost every x ∈ R.
Z
(3)
f (t)dt = 1.
R
(4) For a, b ∈ R with a < b we have P(X = b) = 0 and
P(a < X < b) = P(a ≤ X < b) = P(a ≤ X ≤ b)
Zb
= P(a < X ≤ b) = f (t)dt .
a
Proof. (1) From the expression (5) of the distribution function F it follows that F
is a absolutely continuous and that it admits almost surely a derivative. Calculating
on each side of the equality (5) the derivative, it follows that F ′ (x) = f (x) for
almost every x ∈ R.
(2) Since F is monotone increasing (see Theorem 8.4) it follows that F ′ (x) ≥ 0
for almost every x ∈ R. Thus by (1) we obtain f (x) ≥ 0 for almost every x ∈ R.
(3) We write
Z
R
f (t)dt = lim
Zx
x→∞
−∞
f (t)dt = lim F (x) = 1,
x→∞
where we used the properties of F from Theorem 8.4.
(4) By using Proposition 8.3 we have P(X = b) = F (b) − F (b−). But F is
continuous, hence F (b) − F (b−) = 0 and then P(X = b) = 0. We write
P(a < X < b) = P(a < X < b) + P(X = b) = P(a < X ≤ b)
= P(X = a) + P(a < X ≤ b) = P(a ≤ X ≤ b)
= P(a ≤ X < b) + P(X = b) = F (b) − F (a) =
10
Continuous Distributions
Uniform Distribution on an Interval
37
Zb
a
f (t)dt.
Let a, b ∈ R, a < b. A random variable X has uniform distribution on the interval
[a, b] if its density function is given by

 1
if x ∈ [a, b]
fX (x) =
b−a
 0
if x ∈
/ [a, b].
Then the distribution function has the form


Zx
 0x − a
FX (x) =
f (t)dt =

 b−a
−∞
1
if x < a
if a ≤ x ≤ b
if x > b .
The distribution function and density function of a random variable which is uniformly distributed on the interval [2, 3] are given in Figure 2.
Distribution function
Density function
1.5
1.5
1
1
0.5
0.5
0
0
−0.5
1
2
3
−0.5
4
1
2
3
4
Fig. 2: Uniform distribution for a = 2 and b = 3
Normal Distribution
A random variable X has normal distribution N (m, σ) if it has the density function
!
1
1 x−m 2
fX (x) = √
,
exp −
2
σ
2πσ
38
where m ∈ R and σ > 0 are parameters. Then the distribution function is given by
x−m
σ
Z
1
F (x) = √
2π
exp
−∞
−t2
2
dt.
The distribution function and density function of a random variable which has
N (0, 1) normal distribution are given in Figure 3.
1
0.4
0.9
0.35
0.8
0.3
0.7
0.25
0.6
0.2
0.5
0.4
0.15
0.3
0.1
0.2
0.05
0.1
0
−3
−2
−1
0
1
2
0
−3
3
(a) density function
−2
−1
0
1
2
3
(b) distribution function
Fig. 3: Normal distribution for m = 0 and σ = 1
Gamma Distribution
We say that a random variable X has Gamma distribution with parameters a, b ∈
R+ if it has the density function given by

 0
if x ≤ 0
x
1
fX (x) =
xa−1 exp −
if x > 0,

Γ(a)ba
b
where
Γ(a) =
Z∞
xa−1 exp(−x)dx
0
is the Gamma function of Euler (for more properties of Γ see Appendix B.1). The
distribution function and density function of a random variable which has Gamma
distribution with parameter a = 5 are given in Figure 4.
Exponential Distribution
39
0.2
1
0.18
0.9
0.16
0.8
0.14
0.7
0.12
0.6
0.1
0.5
0.08
0.4
0.06
0.3
0.04
0.2
0.02
0.1
0
0
2
4
6
8
10
12
14
0
0
16
(a) density function
2
4
6
8
10
12
(b) distribution function
Fig. 4: Gamma distribution for a = 5
1
Let λ > 0 be given. If we take a = 1 and b = in the Gamma distribution, then
λ
we obtain the exponential distribution with parameter λ. The density function is
then given by
0
if x ≤ 0
fX (x) =
λe−λx if x > 0.
χ2 Distribution
n
and b = 2σ 2 in the Gamma
2
distribution, then we obtain the so called χ2 distribution with n degrees of freedom,
denoted by χ2 (n, σ). The density function is then

if x ≤ 0
 0
x n
1
fX (x) =
−1
if x > 0,
x 2 exp − 2
n

2σ
Γ( n2 )2 2 σ n
Let n ∈ N and σ > 0 be given. If we take a =
The distribution functions and density functions of random variables which have
χ2 (n, 1) distribution for n ∈ {1, 5, 10, 30} are given in Figure 5.
40
0.16
χ2(1,1)
1
0.9
χ2(5,1)
χ2(10,1)
2
χ (30,1)
0.14
0.8
0.12
0.7
0.1
0.6
0.5
0.08
0.4
0.06
0.3
0.04
0.2
0.02
0.1
0
0
1
2
3
4
5
6
7
0
0
8
(a) for n = 1
10
20
30
40
50
60
(b) for n ∈ {5, 10, 30}
Fig. 5: χ2 density function
Beta Distribution
We say that a random variable X has Beta distribution with parameters a, b ∈ R+
if it has the density function given by

1

xa−1 (1 − x)b−1 if x ∈ [0, 1]
fX (x) =
B(a, b)
 0
if x ∈
/ [0, 1],
where
B(a, b) =
Z1
0
xa−1 (1 − x)b−1 dx
is the Beta function of Euler (for more properties of B see Appendix B.2).
Student Distribution
We say that a random variable X has Student distribution with n ∈ N degrees of
freedom if it has the density function given by
− n+1
2
Γ n+1
x2
2
fX (x) = √
1
+
.
n
nπΓ n2
Snedecor-Fischer Distribution (F Distribution)
We say that a random variable X has Snedecor-Fischer distribution (or F distribution) with m, n ∈ N degrees of freedom, if it has the density function given
41
by
fX (x) =





0
mm
m
x 2 −1
2
n
n
B( m
2 , 2 )(1 +
m+n
m
2
n x)
if x ≤ 0
if x > 0 .
This distribution is denoted by F (m, n).
11
Joint Distribution Function
Definition 11.1. The function F : Rn → R, defined by
F (x1 , . . . , xn ) = P(X1 ≤ x1 , . . . , Xn ≤ xn ),
is called joint distribution function of the random vector (X1 , . . . , Xn ).
If we want to determine probabilities from the joint distribution function we may
use the following property.
Proposition 11.2. The joint distribution function F of a random vector (X1 , . . . , Xn )
has the following property
P(a1 < X1 ≤ b1 , . . . , an < Xn ≤ bn ) = F (b1 , . . . , bn )
n
X
F (b1 , . . . , ai , . . . , bn )
−
(6)
+
i=1
n
X
i,j=1
F (b1 , . . . , ai , . . . , aj , . . . , bn ) − . . .
i<j
+(−1)n F (a1 , . . . , an )
whenever ai , bi ∈ R and ai < bi for all i ∈ {1, . . . , n}.
Theorem 11.3. If (X1 , . . . , Xn ) is a random vector having the joint distribution
function F , then the following properties hold:
(1) F is monotone increasing in each argument.
(2) F is right continuous in each argument.
(3)
lim
x1 ,...,xn →∞
F (x1 , . . . , xn ) = 1.
(4) For each k ∈ {1, . . . , n} we have
lim F (x1 , . . . , xk−1 , xk , xk+1 , . . . , xn ) = 0
xk →−∞
for all x1 , . . . , xk−1 , xk+1 , . . . , xn ∈ R.
42
12
Joint Density Function
Definition 12.1. Let (X1 , . . . , Xn ) be a random vector, and let F be its joint distribution function. If there exists a function f : Rn → R such that
F (x1 , . . . , xn ) =
Zx1
...
Zxn
f (t1 , . . . , tn )dt1 . . . dtn ,
−∞
−∞
then f is called joint density function of the random vector (X1 , . . . , Xn ). If
the random vector (X1 , . . . , Xn ) admits a joint density function, then it is said to
be a continuous random vector. Sometimes it is said that such a vector has a
continuous distribution.
The properties of the joint density function of a random vector are given in the
following theorem.
Theorem 12.2. If (X1 , . . . , Xn ) is a continuous random vector having the joint
distribution function F and the joint density function f , then the following properties hold:
(1) F is absolutely continuous function and
∂ n F (x1 , . . . , xn )
= f (x1 , . . . , xn )
∂x1 ∂x2 . . . ∂xn
for a.e. (x1 , . . . , xn ) ∈ Rn .
(2) f (x1 , . . . , xn ) ≥ 0 for a.e. (x1 , . . . , xn ) ∈ Rn .
Z
Z
(3)
. . . f (t1 , . . . , tn )dt1 . . . dtn = 1 .
| {z }
Rn
(4) For any B ∈ B n it holds
P (X1 , . . . , Xn ) ∈ B =
Example 12.3.
Z
Z
. . . f (t1 , . . . , tn )dt1 . . . dtn .
| {z }
B
43
(1) A random vector (X1 , . . . , Xn ) has a uniform distribution on
I = [a1 , b1 ] × . . . × [an , bn ],
with ai , bi ∈ R, ai < bi , i ∈ {1, . . . , n}, if it has the joint density function

1

if (x1 , . . . , xn ) ∈ I
f (x1 , . . . , xn ) =
(b1 − a1 ) . . . (bn − an )
 0
if (x , . . . , x ) ∈
/ I.
1
n
Then the joint distribution function has the form
F (x1 , . . . , xn ) =
Zx1
...
=
f (t1 , . . . , tn )dt1 . . . dtn
−∞
−∞
Zxn
x1 − a1
b1 − a 1
∗
···
xn − an
bn − a n
,
∗
where we use the notation
u∗

 0
u
=

1
if u < 0
if 0 ≤ u ≤ 1
if 1 < u.
(2) Let m1 , m2 ∈ R, σ1 , σ2 > 0 and r ∈ (−1, 1). A random vector (X, Y ) has
normal distribution N (m1 , m2 , σ1 , σ2 , r) if it has the density function
2
2 !
f (x, y) =
exp −
x−m1
σ1
−
2r(x−m1 )(y−m2 )
+
σ1 σ2
2(1−r 2 )
y−m2
σ2
√
2πσ1 σ2 1 − r2
.
The graphic representation of f is given in Figure 6.
12.1
♦
Independent Random Variables
Definition 12.4. Two random variables X and Y are called independent if
FX (x)FY (y) = F(X,Y ) (x, y) f or all x, y ∈ R.
The random variables X1 , . . . , Xn are independent if
FX1 (x1 ) . . . FXn (xn ) = F(X1 ,...,Xn ) (x1 , . . . , xn )
for all x1 , . . . , xn ∈ R.
A sequence (Xn )n∈N of random variables is said to be independent if for each finite subset {i1 , . . . , im } ⊂ N the random variables Xi1 , . . . , Xim are independent.
44
z
0.1
0.08
0.06
0.04
0.02
0
6
4
3
2
y
2
0
1
0
−2
−1
−4
−2
−6
x
−3
Fig. 6: Normal distribution N (0, 0, 1, 2, −0.5)
Theorem 12.5. The following properties hold:
X
X
(1) If X =
xi IAi (ω) and Y =
yj IBj (ω) are discrete random variables,
i∈I
j∈J
then they are independent if and only if
P(X = xi )P(Y = yj ) = P(X = xi , Y = yj )
for all i ∈ I, j ∈ J.
(2) If X and Y are continuous random variables, then they are independent if
and only if
fX (x)fY (y) = f(X,Y ) (x, y) f or a.e. (x, y) ∈ R2 .
Proof. (1) Let x ∈ {xi : i ∈ I}, y ∈ {yj : j ∈ J}. First we assume that X and Y
are independent random variables. By using properties of the distribution function
(see Theorem 8.3) we write
P(X = x, Y = y) = P
∞ \
n=1
x−
45
1
1
< X ≤ x, y − < Y ≤ y
n
n
1
1
lim P x − < X ≤ x, y − < Y ≤ y
n→∞
n
n
h
1
1
1
− F(X,Y ) x, y −
= lim F(X,Y ) x − , y −
n→∞
n
n
n
i
1 −F(X,Y ) x − , y + F(X,Y ) (x, y)
n
h 1
1
1 FY y −
− FX (x)FY y −
= lim FX x −
n→∞
n
n
n
i
1
FY (y) + FX (x)FY (y)
−FX x −
n
= FX (x−)FY (y−) − FX (x)FY (y−)
=
−F (x−)FY (y) + FX (x)FY (y)
X
=
FX (x) − FX (x−) FY (y) − FY (y−)
= P(X = x)P(Y = y).
Thus P(X = x)P(Y = y) = P(X = x, Y = y) .
Now we assume that P(X = x)P(Y = y) = P(X = x, Y = y) holds. By using
this equality, we get
XX
P(X = xi , Y = yj )
F(X,Y ) (x, y) = P(X < x, Y < y) =
xi <x yj <y
i∈I j∈J
=
XX
P(X = xi )P(Y = yj )
xi <x yj <y
i∈I j∈J
=
X
xi <x
i∈I
P(X = xi )
X
P(Y = yj )
yj <y
j∈J
= P(X < x)P(Y < y) = FX (x)FY (y).
Hence X and Y are independent.
(2) First we assume that X and Y are independent. By Theorem 12.2 and Definition 12.4 we have
∂ 2 F(X,Y ) (x, y)
∂FX (x) ∂FY (y)
f(X,Y ) (x, y) =
=
= fX (x)fY (y)
∂x∂y
∂x
∂y
for a.e. (x, y) ∈ R2 .
Now we assume that fX (x)fY (y) = f(X,Y ) (x, y) holds for a.e. (x, y) ∈ R2 .
Let (x, y) ∈ R2 , then
Zx Zy
f(X,Y ) (u, v)dudv
F(X,Y ) (x, y) =
−∞ −∞
46
=
Zx
fX (u)du
−∞
Zy
fY (v)dv = FX (x)FY (y).
−∞
Hence X and Y are independent.
Theorem 12.6.
(1) The random variables X1 , . . . , Xn are independent if and only if
P(X1 ∈ A1 ) . . . P(Xn ∈ An ) = P(X1 ∈ A1 , . . . , Xn ∈ An )
for every A1 , . . . , An ∈ B.
(2) Let X1 , . . . , Xn be discrete random variables. They are independent if and
only if
P(X1 = x1 ) . . . P(Xn = xn ) = P(X = x1 , . . . , Xn = xn )
for all x1 , . . . , xn ∈ R.
(3) Let X1 , . . . , Xn be continuous random variables. They are independent if
and only if
fX1 (x) . . . fXn (xn ) = f(X1 ,...,Xn ) (x1 , . . . , xn ) .
for a.e. (x1 , . . . , xn ) ∈ Rn .
Using Theorem 12.6 and the Definition 7.1 of measurable functions we obtain the
following result.
Proposition 12.7. Let X1 , . . . , Xn be independent random variables and let g :
R → R be a B/B measurable function. Then the random variables g ◦ X1 , . . . , g ◦
Xn are independent.
13
Functions of Random Variables
Let X = (X1 , . . . , Xn ) be a continuous random vector with joint density function
f and let g = (g1 , . . . , gk ) : Rn → Rk be a B n /B k measurable function. We want
to determine the joint density function of the vector
g ◦ X = (g1 (X1 , . . . , Xn ), . . . , gk (X1 , . . . , Xn )).
47
We write the joint distribution function Fg◦X : Rk → R of the vector g ◦ X as
follows:
Fg◦X (z1 , . . . , zk )
= P g1 (X1 , . . . , Xn ) ≤ z1 , . . . , gk (X1 , . . . , Xn ) ≤ zk
Z
Z
= P (X1 , . . . , Xn ) ∈ B = . . . f (t1 , . . . , tn )dt1 . . . dtn ,
| {z }
B
where
n
o
B = (x1 , . . . , xn ) ∈ Rn:gi (x1 , . . . , xn ) ≤ zi , i = 1, . . . , k ∈ B n .
Then the joint density function fg◦X can be calculated by
fg◦X (z1 , . . . , zk ) =
∂ n Fg◦X (z1 , . . . , zk )
∂z1 ∂z2 . . . ∂zk
for almost every (z1 , . . . , zk ) ∈ Rk .
We will take now some special cases for the function g. For the addition of two
random variables, we take g : R2 → R given by g(x, y) = x + y.
Proposition 13.1. The density function of the sum X + Y of the random variables
X and Y is given by
Z
fX+Y (z) = f(X,Y ) (u, z − u)du f or a.e. z ∈ R.
R
Proof. We start with
Z Z
FX+Y (z) = P(X + Y ≤ z) =
f(X,Y ) (x, y)dxdy.
| {z }
{(x,y)∈R2 :x+y≤z}
Then we make the change of variable Γ = (Γ1 , Γ2 ) with
Γ1 (u, v) = u,
Γ2 (u, v) = v − u.
The domain of integration D = {(x, y) ∈ R2 : x + y ≤ z} becomes DΓ =
{(u, v) ∈ R2 : v ≤ z}. The determinant of the Jacobian matrix is
∂Γ1 ∂Γ2 ∂u
∂u 1 −1 = 1.
det JΓ (u, v) = = 0
1 ∂Γ1 ∂Γ2 ∂v
∂v
48
Thus,
FX+Y (z) =
ZZ
1 · f(X,Y ) (u, v − u)dudv
DΓ
=
Z Zz
R −∞
f(X,Y ) (u, v − u)dudv.
We derivate2 with respect to z and obtain
Z
fX+Y (z) = f(X,Y ) (u, z − u)du
R
for a.e. z ∈ R.
If X and Y are independent, then we have
Z
fX+Y (z) = fX (u)fY (z − u)dx
for a.e. z ∈ R.
R
For the multiplication of two random variables, we take g : R2 → R given by
g(x, y) = x · y.
Proposition 13.2. The density function of the multiplication X · Y of the random
variables X and Y is given by
Z
z
1
fX·Y (z) =
f(X,Y ) u,
du f or a.e. z ∈ R.
|u|
u
R
Proof. We write
Z Z
FX·Y (z) = P(X · Y ≤ z) =
f(X,Y ) (x, y)dxdy.
| {z }
{(x,y)∈R2 :x·y≤z}
Then we make the change of variable Γ = (Γ1 , Γ2 ) with
Γ1 (u, v) = u, Γ2 (u, v) =
2
d
Derivation formula:
dz
v
.
u
!
b(z)
Z
g(t)dt = g(b(z))b′ (z) − g(a(z))a′ (z).
a(z)
49
The domain of integration D = {(x, y) ∈ R2 : x · y ≤ z} becomes DΓ =
{(u, v) ∈ R∗ × R : v ≤ z}. The determinant of the Jacobian matrix is
∂Γ1 ∂Γ2 1 −v 2 u
∂u
∂u = 1.
det JΓ (u, v) = =
1 u
∂Γ1 ∂Γ2 0
u
∂v
∂v
Thus,
FX·Y (z) =
ZZ
DΓ
=
Z0 Zz
−∞ −∞
v
1
f(X,Y ) u,
dudv
|u|
u
v
1
dudv +
− f(X,Y ) u,
u
u
Z∞ Zz
0 −∞
v
1
f(X,Y ) u,
dudv.
u
u
We take the derivative with respect to z and obtain
fX·Y (z) =
Z0
−∞
=
Z
R
z
1
− f(X,Y ) u,
du +
u
u
Z∞
0
z
1
du
f(X,Y ) u,
u
u
z
1
du
f
u,
|u| (X,Y )
u
for a.e. z ∈ R.
If X and Y are independent, then we have
Z
z 1
du
fX (u)fY
fX·Y (z) =
|u|
u
f or a.e. z ∈ R.
R
For the division of two random variables, we take g : R × R∗ → R given by
x
g(x, y) = .
y
X
of the random variables
Y
X and Y (satisfying Y (ω) 6= 0 for all ω ∈ Ω) is given by
Z
f X (z) = |v|f(X,Y ) (vz, v)dv f or a.e. z ∈ R.
Proposition 13.3. The density function of the division
Y
R
50
Proof. We write
F X (z) = P
Y
X
Y
Z Z
≤z =
f(X,Y ) (x, y)dxdy.
| {z }
≤z}
{(x,y)∈R×R∗ : x
y
Then we make the change of variable Γ = (Γ1 , Γ2 ) with
Γ1 (u, v) = u · v, Γ2 (u, v) = v.
The domain of integration D = {(x, y) ∈ R × R∗ : xy ≤ z} becomes DΓ =
{(u, v) ∈ R × R∗ : u ≤ z}. The determinant of the Jacobian matrix is
∂Γ1 ∂Γ2 ∂u
∂u v 0 det JΓ (u, v) = = u 1 = v.
∂Γ1 ∂Γ2 ∂v
∂v
Thus,
F X (z) =
Y
ZZ
|v|f(X,Y ) (u · v, v)dudv
DΓ
=
+
Z0 Zz
−∞
Z∞
0
−∞
Zz
−∞
(−v) · f(X,Y ) (u · v, v)du dv
v · f(X,Y ) (u · v, v)du dv.
We take the derivative with respect to z and obtain
f X (z) =
Y
Z0
−∞
=
Z
(−v) · f(X,Y ) (z · v, v)dv +
Z∞
0
v · f(X,Y ) (z · v, v)dv
|v|f(X,Y ) (z · v, v)dv
R
for a.e. z ∈ R.
If X and Y are independent, then we have
Z
f X (z) = |v|fX (z · v)fY (v)dv
Y
R
51
for a.e. z ∈ R.
We consider now the case of simple random variables
X
X
X=
xi I{X=xi } and Y =
yj I{Y =yj } .
i∈I
j∈J
The probability distributions for X + Y, X · Y and
X
P(X + Y = xi + yj ) =
X
are given by:
Y
P(X = xk , Y = yl ),
(k,l)∈I1 (i,j)
where I1 (i, j) = {(k, l) ∈ I × J : xk + yl = xi + yj };
X
P(X · Y = xi yj ) =
P(X = xk , Y = yl ),
(k,l)∈I2 (i,j)
where I2 (i, j) = {(k, l) ∈ I × J : xk yl = xi yj };
X
P(X/Y = xi /yj ) =
P(X = xk , Y = yl ),
(k,l)∈I3 (i,j)
where I3 (i, j) = {(k, l) ∈ I × J : xk /yl = xi /yj }.
Proposition 13.4. Let g : R → R be a strictly monotone and derivable function.
Let X be a continuous random variable, and set Y = g ◦ X. Then the density
function of Y is for a.e. y ∈ R given by

 fX (g −1 (y))
if y ∈ g(R)
fY (y) =
|g ′ (g −1 (y))|

0
if y ∈
/ g(R) .
14
14.1
Numerical Characteristics of Random Variables
Expectation
Definition 14.1. If X =
X
xi I{X=xi } is a discrete random variable, then the
i∈I
expectation of X (or mean value, or expected value) is the number
X
E(X) =
xi P(X = xi ),
i∈I
52
if the series is absolutely convergent.
If a continuous random variable X has fX as its density function, then the expectation (or mean value, or expected value) of X is the number
Z
E(X) = xfX (x)dx,
R
if the integral is absolutely convergent.
Definition 14.2. If X : Ω → C is a random variable taking complex values, then
the expectation of X is the complex number
E(X) = E(Re X) + iE(Im X).
Definition 14.3. Let X = (X1 , . . . , Xn ) be a random vector. The n dimensional
vector
E(X) = E(X1 ), . . . , E(Xn )
is called the expectation of the random vector X if each of the random variables
Xi (i ∈ {1, . . . , n}) has expectation E(Xi ).
Remark 14.4.
(1) The expectation of a random variable X with the distribution function FX can
be written as
Z
E(X) = xdFX (x),
R
if the Lebesgue-Stieltjes integral is absolutely convergent. An equivalent definition
is
Z
E(X) = X(ω)dP(ω),
Ω
if the integral with respect to the measure P is absolutely convergent (see Appendix
A).
(2) If h : R → R is a B/B measurable function and X a random variable, then the
expectation of the random variable h(X) is given by
Z
Z
E h(X) = xdFh(X) (x) = h(x)dFX (x).
R
R
53
If X is a discrete random variable X =
X
xi I{X=xi } , then
i∈I
E h(X) =
X
h(xi )P(X = xi ),
i∈I
while for a continuous random variable X we have
Z
Z
E h(X) = h(x)fX (x)dx = xfh(X) (x)dx.
R
R
(3) Let X = (X1 , . . . , Xn ) be a random vector and let h : Rn → R be a B n /B
measurable function, then the expectation of the random variable h(X) is given by
Z
E h(X)
=
xdFh(X) (x)
=
R
Z
Z
. . . h(x1 , . . . , xn )dFX (x1 , . . . , xn ).
| {z }
Rn
If X is a discrete random vector X =
X
xi I{X=xi } , where xi = (xi1 , . . . , xin ),
i∈I
then
E h(X) =
X
i∈I
h(xi1 , . . . , xin )P X = (xi1 , . . . , xin ) ,
while for a continuous random vector X we have
Z
Z
E h(X)
=
. . . h(x1 , . . . , xn )fX (x1 , . . . , xn )dx1 . . . dxn
| {z }
n
Z R
=
xfh(X1 ,...,Xn ) (x)dx.
R
△
Example 14.5.
(1) The expectation of the random variable X having binomial distribution with
parameters n and p is
E(X) =
n
X
k=0
kP(X = k) =
n
X
k=0
54
kCnk pk (1 − p)n−k = np.
(2) A random variable X having hypergeometric distribution with parameters a, b
and n has the expectation
E(X) =
n
X
Cak C n−k
b
n
Ca+b
k=0
=
an
.
a+b
(3) A random variable X having Poisson distribution with parameter λ > 0 has the
expectation
∞
X
λk
k e−λ = λ.
E(X) =
k!
k=0
(4) The expectation of the continuous random variable X with uniform distribution
on the interval [a, b] is
E(X) =
Z
R
1
xfX (x)dx =
b−a
Zb
xdx =
a+b
.
2
a
(5) The expectation of the continuous random variable X with normal distribution
N (m, σ) is
Z
Z
(x − m)2 x
√ exp −
E(X) = xfX (x)dx =
dx = m.
2σ 2
R σ 2π
R
Now we prove some properties of the expectation of a random variable.
♦
Theorem 14.6. If X and Y are both discrete or both continuous random variables
which have the expectations E(X) and E(Y ) respectively, then the following properties hold:
(1) E(aX + b) = aE(X) + b for all a, b ∈ R.
(2) E(X + Y ) = E(X) + E(Y ).
(3) If X and Y are independent, then
E(X · Y ) = E(X)E(Y ).
(4) If X(ω) ≤ Y (ω) for all ω ∈ Ω, then E(X) ≤ E(Y ).
55
Proof. (1) If X =
X
xi I{X=xi } is a discrete random variable, then the probability
i∈I
distribution of Z = aX + b is
n
o n
o
P(Z = axi + b) : i ∈ I = P(X = xi ) : i ∈ I .
Hence the expectation of Z = aX + b is given by
X
E(Z) = E(aX + b) =
(axi + b)P(X = xi )
i∈I
= a
X
xi P(X = xi ) + b
i∈I
X
P(X = xi )
i∈I
= aE(X) + bP
[
i∈I
(X = xi ) = aE(X) + b.
Now we consider that X is a continuous random variable. Put Z = aX + b. Then
the expectation of Z is
Z
E(Z) =
zfZ (z)dz.
R
We use Proposition 13.4 for g : R → R defined by g(x) = ax + b which is a
strictly monotone function on R if a 6= 0 and which is also derivable. The inverse
z−b
function of g is g −1 (z) =
. Then the density function of Z is given by
a
fZ (z) =
1
z−b
fX (g −1 (z))
=
fX (
).
′
−1
|g (g (z))|
|a|
a
Hence
E(Z) =
=
Z
R
Z
z
fX
|a|
z−b
a
dz = sign(a)
Z∞
−∞
ax + b
fX (x)adx
|a|
(ax + b)fX (x)dx = aE(X) + b.
R
If a = 0, then Z is a discrete random variable taking the constant value b with
probability 1 and therefore E(Z) = b · 1 = 0 · E(X) + b.
(2) If
X
X
X=
xi I{X=xi } and Y =
yj I{Y =yj }
i∈I
j∈J
56
are discrete random variables, then the expectation of the sum X + Y is
XX
E(X + Y ) =
(xi + yj )P(X = xi , Y = yj )
i∈I j∈J
=
X
xi
X
P(X = xi , Y = yj )
j∈J
i∈I
+
X
yj
X
P(X = xi , Y = yj )
i∈I
j∈J
=
X
xi P(X = xi ) +
i∈I
X
yj P(Y = yj ) = E(X) + E(Y ).
j∈J
Now we consider the case of continuous random variables X and Y . We use the
definition of the expectation and Proposition 13.1 to obtain
Z
E(X + Y ) = zfX+Y (z)dz
R
=
=
Z
z
Z
R
R
Z
Z
R
R
f(X,Y ) (x, z − x)dx dz
zf(X,Y ) (x, z − x)dz dx.
We make the change of variable Γ = (Γ1 , Γ2 ) with
Γ1 (u, v) = u, Γ2 (u, v) = u + v.
The determinant of the Jacobian matrix is
∂Γ1 ∂Γ2
∂u
∂u
det JΓ (u, v) = ∂Γ1 ∂Γ2
∂v
∂v
Thus,
E(X + Y ) =
Z Z
1 1
=
0 1
= 1.
(u + v)f(X,Y ) (u, v)dudv
R R
=
=
Z
R
Z
R
u
Z
f(X,Y ) (u, v)dv du +
R
ufX (u)du +
Z
Z
R
v
Z
R
f(X,Y ) (u, v)du dv
vfY (v)dv = E(X) + E(Y ).
R
57
We used the relation between the (marginal) density function of the random variable X and the joint density function of (X, Y ).
(3) If
X
X
X=
xi I{X=xi } and Y =
yj I{Y =yj }
i∈I
j∈J
are discrete independent random variables, then the expectation of the product X·Y
is
XX
E(X · Y ) =
xi · yj P(X = xi , Y = yj )
i∈I j∈J
=
XX
i∈I j∈J
=
X
xi · yj P(X = xi )P(Y = yj )
xi P(X = xi )
X
j∈J
i∈I
yj P(Y = yj ) = E(X)E(Y ).
Now we consider the case of continuous independent random variables X and Y .
We use the definition of the expectation and Proposition 13.2 to obtain
Z
Z Z
z 1
E(X · Y ) = zfX·Y (z)dz = z
fX (x)fY
dx dz.
|x|
x
R
R
R
We make the change of variable Γ = (Γ1 , Γ2 ) with
Γ1 (u, v) = u, Γ2 (u, v) = u · v.
The determinant of the Jacobian matrix is
∂Γ1 ∂Γ2
∂u
∂u
det JΓ (u, v) = ∂Γ
∂Γ
1
2
∂v
∂v
Thus,
1 v
=
0 u
= u.
Z Z
u·v
fX (u)fY (v)|u|dudv
|u|
R R
Z
Z
=
ufX (u)du vfY (v)dv = E(X)E(Y ).
E(X · Y ) =
R
R
(4) Let Z(ω) = X(ω)−Y (ω) ≥ 0 for all ω ∈ Ω. By the linearity of the expectation
(see (1) and (2) proved above) it is enough to prove that Z ≥ 0 implies E(Z) ≥ 0.
58
If Z =
X
i∈I
zi I{Z=zi } is a discrete random variable, then Z ≥ 0 implies zi ≥ 0
for all i ∈ I. The expectation of Z is
X
E(Z) =
zi P(Z = zi ) ≥ 0.
i∈I
If Z is a continuous random variable, then
Z
E(Z) = zfZ (z)dz.
R
But for all z ≤ 0 we have F (z) = P(Z < z) = 0. Hence fZ (z) = FZ′ (z) = 0 for
a.e. z ≤ 0.
Z∞
Therefore, E(Z) = zfZ (z)dz ≥ 0.
0
15
Variance
We define now other numerical characteristics of random variables.
Definition 15.1. Let X be a random variable, and let E(X) be its expectation.
The variance (or dispersion) of X is the number
2
V (X) = E X − E(X) ,
2
p
if the expectation of X − E(X) exists. The value σ =
V (X) is called
standard deviation of X.
Theorem 15.2. If X and Y are random variables, then the following properties
hold:
(1) V (X) = E(X 2 ) − E(X)2 .
(2) V (aX + b) = a2 V (X) for all a, b ∈ R.
(3) If X and Y are independent, then
V (X + Y ) = V (X) + V (Y )
and
V (X · Y ) = V (X)V (Y ) + E(X)2 V (Y ) + E(Y )2 V (X).
59
Proof. (1) By Definition 15.1 and by the linearity property of the expectation we
have
V (X) = E X 2 − 2E(X)X + E(X)2 = E(X 2 ) − E(X)2 .
(2) Using Definition 15.1 and Theorem 14.6 we obtain
2
2
V (aX + b) = E aX + b − E(aX + b) = E aX − aE(X)
= a2 V (X).
(3) We get
2
2
V (X + Y ) = E X − E(X) + Y − E(Y ) = E X − E(X)
2
+2E X − E(X) Y − E(Y ) + E Y − E(Y )
2
2
= E X − E(X) + E Y − E(Y ) = V (X) + V (Y ).
Since X and Y are independent, we have E(X · Y ) = E(X)E(Y ) and E(X 2 ·
Y 2 ) = E(X 2 )E(Y 2 ). Therefore,
V (X · Y ) = E(X 2 · Y 2 ) − E(X)2 E(Y )2
= E(X 2 )E(Y 2 ) − E(X)2 E(Y )2
= V (X) + E(X)2 V (Y ) + E(Y )2 − E(X)2 E(Y )2
= V (X)V (Y ) + E(X)2 V (Y ) + E(Y )2 V (X).
Definition 15.3. Let X and Y be two random variables, and let E(X) and E(Y ),
respectively be their expectations. The covariance of the random variables X and
Y is the number (if it exists)
cov(X, Y ) = E X − E(X) Y − E(Y ) .
The correlation coefficient of X and Y is defined by
cov(X, Y )
ρ(X, Y ) = p
,
V (X)V (Y )
if cov(X, Y ), V (X), V (Y ) exist and V (X) 6= 0, V (Y ) 6= 0.
60
Theorem 15.4. If X, Y and Z are random variables, then the following properties
hold:
(1) cov(X, X) = V (X).
(2) cov(X, Y ) = E(X · Y ) − E(X)E(Y ).
(3) If X and Y are independent, then cov(X, Y ) = ρ(X, Y ) = 0 (it is said that
X and Y are uncorrelated).
(4) V (aX + bY ) = a2 V (X) + b2 V (Y ) + 2ab cov(X, Y ) for all a, b ∈ R.
(5) cov(X + Y, Z) = cov(X, Z) + cov(Y, Z).
(6) ρ2 (X, Y ) ≤ 1. Equality holds in this inequality if there exists a, b ∈ R such
that
Y = aX + b f or a.e. ω ∈ Ω.
Proof. (1) Follows directly from the definition of the covariance and variance.
(2) By the linearity property of the expectation (see Theorem 14.6) we can write
cov(X, Y ) = E X · Y − E(X)Y − E(Y )X + E(X)E(Y )
= E(X · Y ) − E(X)E(Y ).
(3) Since X and Y are independent, it follows that E(X · Y ) = E(X)E(Y ). Then
by (2) it immediately follows that cov(X, Y ) = 0, hence ρ(X, Y ) = 0.
(4) We have
2
V (aX + bY ) = E aX + bY − E(aX + bY )
2
2
= a2 E X − E(X) + b2 E Y − E(Y )
+2abE X − E(X) Y − E(Y )
= a2 V (X) + b2 V (Y ) + 2ab cov(X, Y ).
(5) We calculate
cov(X + Y, Z) = E X − E(X) + Y − E(Y ) Z − E(Z)
= E X − E(X) Z − E(Z)
+E Y − E(Y ) Z − E(Z)
= cov(X, Z) + cov(Y, Z).
61
(6) We notice that
2
E t(X − E(X)) + Y − E(Y ) ≥ 0 for all t ∈ R.
Therefore we have
V (X)t2 + 2cov(X, Y )t + V (Y ) ≥ 0
for all t ∈ R.
Since V (X) > 0 this inequality holds if and only if
4cov2 (X, Y ) − 4V (X)V (Y ) ≤ 0.
So, ρ2 (X, Y ) ≤ 1.
Now we consider the random variables
X − E(X)
U= p
,
V (X)
Y − E(Y )
W = p
.
V (Y )
They satisfy the properties E(U ) = E(W ) = 0 and
V (U ) = E(U 2 ) = 1,
V (W ) = E(W 2 ) = 1.
First we consider the case ρ(X, Y ) = 1. Then
E(U W ) = ρ(X, Y ) = 1
and
E(U − W )2 = E(U 2 ) + E(W 2 ) − 2E(U W ) = 0.
This implies U (ω) = W (ω) for a.e. ω ∈ Ω. So,
Taking
Y (ω) − E(Y )
X(ω) − E(X)
p
p
=
V (X)
V (Y )
a=
s
V (Y )
,
V (X)
b = E(Y ) −
s
for a.e ω ∈ Ω.
V (Y )
E(X),
V (X)
we obtain Y = aX + b for a.e. ω ∈ Ω. The case ρ(X, Y ) = −1 is analogously
treated .
Example 15.5.
62
(1) Let X be a random variable having binomial distribution with parameters n and
p. We calculate its variance by using the formula V (X) = E(X 2 ) − E(X)2 . We
know that E(X) = np (see Example 14.5) and have
E(X 2 ) =
n
X
k=0
k 2 Cnk pk (1 − p)n−k = n2 p2 + np − np2 .
Therefore V (X) = np(1 − p).
(2) Let X be a random variable with Gamma distribution with parameters a, b ∈
R+ . We calculate
Z ∞
x
1
Γ(a + 1)
a
E(X) =
x
exp
−
= ab
dx = b
Γ(a)ba 0
b
Γ(a)
and
Z ∞
x
Γ(a + 2)
1
a+1
x
exp
−
dx = b2
E(X ) =
a
Γ(a)b 0
b
Γ(a)
2
= a(a + 1)b .
2
Then V (X) = E(X 2 ) − E(X)2 = ab2 .
63
♦