Introductory Course M2 TSE
Part II: Probability Theory
Basic Probability Theory
Elementary Probability Theory with Stochastic Processes and an
Introduction to Mathematical Finance 4
This material is extracted from the books of Robert B. Ash
(John Whiley & Sons, 1970) and of Kai Lai Chung and Farid
AitSahlia
,
th edition (Springer, 2003).
This is a preliminary version of the document.
Some sections remain
to be completed, so refer to the books above (or other literature) for the
corresponding topics.
1 Basic concepts
The classical denition of probability is the following: the probability of an
event is the number of outcomes favorable to the event, divided by the total
number of outcomes, where all outcomes are equally likely. This denition
is restrictive (nite number of outcomes) and circular (equally likely =
equally probable).
The frequency approach is based on the physical observations of the following type: if an unbiased coin is tossed independently
n
times, where
n
1/2. We
Sn /n, where
is very large, the relative frequency of heads is likely to be close to
would like to dene the probability of an event as the limit of
Sn is
Sn /n
the number of occurrences of the event. However, it is possible that
converges to any number between
0
and
1,
or has no limit at all.
We need a rigorous denition of probability to construct a mathematical
theory. We now introduce the basic concepts of the mathematical probability
theory.
1.1 Probability space
1.1.1 Sample space
A
sample space Ω
is a set of points representing possible outcomes of a ran-
dom experiment. Its choice is dictated by the problem under consideration.
Examples:
1
•
In the experiment of tossing a single die, we can chose
Another choice:
Ω = { N
is even , N is odd } but if we are interested,
for example, in whether or not
•
Ω = {1, 2, 3, 4, 5, 6}.
N ≥ 3,
this second space is not useful.
To model the number of accidents that faces an insurance company in
one year, we can dene the sample space as
Ω = {0, 1, 2, . . . }.
In this
example, the sample space is innite (countable).
•
Let the experiment consist in selecting a person at random and measuring his height. We can chose
Ω = R+ .
Here,
Ω
has a continuum of
points.
1.1.2 Algebra of events
An
event
associated with a random experiment is a condition that is satised
or not satised in a given performance of the experiment. For example, if a
coin is tossed twice, the number of heads is
≤ 1
is an event.
We can also say that an event corresponds to a yes/no question that
may be answered after the experiment is performed. A yes answer is associated with a subset of the sample space.
Ω = {HH, HT, T H, T T }.
be answered yes or no. The subset of
is
Let in the previous example
The question Is the number of heads
Ω
≤ 1?
can
corresponding to a yes answer
A = {HT, T H, T T }.
Thus, in mathematical terms, an event is dened as a subset of the sample
space. Example:
•
Experiment: tossing a coin twice. Sample space:
Ω = {HH, HT, T H, T T }.
The event the result of the rst toss is equal to the result of the second
toss corresponds to the subset
B = {HH, T T }.
We can form new events from old ones by use of connectives or, and,
and not. In terms of subsets, this corresponds to the following operations:
A∪B
•
or = union:
•
and = intersection:
A ∩ B.
•
not = complement:
Ac ≡ Ω \ A.
(means
A
or
B
or both).
Ω = {1, 2, 3, 4, 5, 6}
events: A = {N ≥ 3} =
For example, in the experiment of tossing a single die with
N the result, we can consider the following
{3, 4, 5, 6} and B = {N is even} = {2, 4, 6}. Then
and
2
• A ∪ B = {N ≥ 3
or
• A ∩ B = {N ≥ 3
and
N
• Ac = {N
is not
• B c = {N
is not even}
is even}
N
= {2, 3, 4, 5, 6}
is even}
= {4, 6}
≥ 3} = {N < 3} = {1, 2}
= {N
is odd}
= {1, 3, 5}
We can also apply these operations to more than two events: A1 ∪ A2 ∪
∪
· · · ∪ An = ni=1 Ai (set of the points
belonging to
of the events
∩n
Ai ) and A1 ∩ A2 ∩ · · · ∩ An = i=1 Ai (set of the points belonging to
of
at least one
the events
Ai ).
all
We dene in the same way the union and the intersection of
an innite sequence of events.
Two events in a sample space are said to be
if it is impossible that both
A
and
B
mutually exclusive disjoint
or
occur during the same performance of
the experiment. Mathematically, this means that their intersection is empty:
A ∩ B = ∅.
In general, the events
{Ai }
(a nite or innite collection) are
mutually exclusive if no more than one of them can occur during the same
performance of the experiment:
Ai ∩ Aj = ∅
for
i ̸= j.
In some ways the algebra of events is similar to the algebra of real numbers, with union corresponding to addition and intersection to multiplication.
For example, the commutative and associative properties hold:
A ∪ B = B ∪ A,
A ∩ B = B ∩ A,
A ∪ (B ∪ C) = (A ∪ B) ∪ C
A ∩ (B ∩ C) = (A ∩ B) ∩ C
In many ways the algebra of events diers from the algebra of real numbers, as some of the identities below indicate.
A ∪ A = A,
A ∩ A = A,
A ∩ Ω = A,
A ∪ Ω = Ω,
A ∪ Ac = Ω
A ∩ Ac = ∅
A∪∅=A
A∩∅=∅
3
1.1.3 Class of events
In some situations, we may not have a complete information about the out-
In
comes of an experiment. For example, if the experiment involves tossing a
this case, we cannot consider all subsets of Ω as events.
coin three times, we may record the results of only the rst two tosses.
Indeed, let
Ω = {(a1 , a2 , a3 ) | ai ∈ {H, T }, i = 1, 2, 3}
and let us consider the subset
A
of
Ω
corresponding to the condition there
are at least two heads:
A = {(H, H, T ), (H, T, H), (T, H, H), (H, H, H)}.
Imagine that after the experiment is performed, we have the following information about the outcome:
ω = (H, T, ∗).
In this case, we are not able to give a yes or no answer to the question
Is
ω ∈ A?
(this depends on the result of the last toss, and we miss this
information). So,
A is not measurable with respect to the given information.
B = {(T, T, T ), (T, T, H)} which corresponds to
In contrast, the subset
the condition the rst two tosses are tails is an event.
Indeed, with the
ω
information about the rst two tosses, we are always able to say whether
is in
B
or not, for all possible outcomes
of events
ω.
This leads to consider a particular class of subsets of
mathematical consistency, we require that
collection of subsets of
1.
Ω∈F
2.
A∈F
3.
implies
A1 , A2 , · · · ∈ F
Ω
F
form a
class
Ω called the
F . For reasons
sigma eld
. The standard notation for the class of events is
of
, which is a
satisfying the following three requirements.
Ac ∈ F .
implies
That is,
∪∞
i=1
F
is closed under complementation.
Ai ∈ F .
That is,
F
is closed under nite
or countable union.
The above conditions imply also that
intersection, and that the empty set
∅
Examples of sigma elds:
4
F
is closed under nite or countable
belongs to
F
(exercise).
• F = {Ω, ∅}
•
The collection of
•
Let
all
subsets of
Ω = {1, 2, 3, 4, 5, 6}.
Ω
is a sigma eld.
The following collection of subsets is a sigma
eld:
F = {Ω, ∅, {1, 3, 5}, {2, 4, 6}}
•
If
Ω
R
is a part of
B of
R
Borel sets
(for instance,
consider the sigma eld
or
R+ ,
or
[0, 1]),
we will typically
. That is, the smallest sigma eld
containing the intervals (and, in consequence, unions and intersections
of intervals).
1.1.4 Probability measure
relative frequency
We now consider the assignment of probabilities to events. The probability
of an event should somehow reect the
of the event in a
A ∈ F,
P (∅) = 0
large number of independent repetitions of the experiment. Thus, if
0 and 1, with
B are disjoint (cannot
occur in the same time), then the number of occurrences of A ∪ B is the
sum of the number of occurrences of A and the number of occurrences of B ,
so we should have P (A ∪ B) = P (A) + P (B). This motivates the following
the probability
and
P (Ω) = 1.
P (A)
should be a number between
Furthermore, if the events
A
and
denition.
probability measure
A function that assigns a number
is called a
on
F,
P (A) to each set A in the sigma eld F
provided that the following conditions
are satised:
1.
P (A) ≥ 0
2.
P (Ω) = 1
3. If
for every
A1 , A 2 , . . .
is a
A∈F
nite or countable
collection of disjoint sets in
F , then
P (A1 ∪ A2 ∪ · · · ) = P (A1 ) + P (A2 ) + · · ·
Remark.
subsets of
Another reason for considering a class of events
Ω
F
instead of all
is that in some cases it is impossible to dene a probability
measure (in the sense of the denition above) on all subsets.
Just as it is
impossible to dene an area of all subsets of the plane or a length of all
5
subsets of a line (this is the reason why we consider only the Borel sets in
Rn ).
From this denition, we can deduce the following properties of a probability measure (exercise):
• P (∅) = 0
• P (A ∪ B) = P (A) + P (B) − P (A ∩ B)
•
If
B ⊂ A,
then
P (B) ≤ P (A).
• P (A1 ∪ A2 ∪ · · · ) ≤ P (A1 ) + P (A2 ) + · · ·
•
If the sets
An
are nondecreasing, i.e.
∪
P(
If the sets
An
n→∞
are nonincreasing, i.e.
∩
P(
then
An ) = lim P (An ).
n≥1
•
An ⊂ An+1 , ∀n ≥ 1,
An+1 ⊂ An , ∀n ≥ 1,
then
An ) = lim P (An ).
n→∞
n≥1
We may now give the underlying mathematical framework for probability
theory.
Denition 1.
A
probability space
sigma eld of subsets of
Ω,
and
is a triple
P
(Ω, F, P ),
Ω
F.
where
a probability measure on
is a set,
F
a
Exercises
Ω = {ω1 , ω2 , . . . , ωn , . . . } be a countable sample space and F
Ω. To each sample point ωn let us attach
arbitrary weight pn subject to the conditions
∑
∀n : pn ≥ 0,
pn = 1.
1. Let
the class of all subsets of
n
A
Ω,
the weights of all points in it
Now for any subset
of
we dene its probability to be the
. In symbols,
∀A ⊂ Ω,
P (A) =
∑
ωn ∈A
Show that
P
is indeed a probability measure.
6
pn .
be
an
sum of
2. If
Ω
is countably innite, may all the sample points
likely (that is, all
pn
ωn ∈ Ω
be equally
in the previous example be equal)?
Ω be a plane set (Ω ⊂ R2 ) with a nite Lebesgue measure 0 <
|Ω| < +∞ (think about a square or a circle for simplicity). Consider
all measurable subsets A of Ω and dene
3. Let
P (A) =
Show that
P
|A|
.
|Ω|
is a probability measure.
4. Suppose that the land of a square kingdom is divided into three strips
A, B , C of equal area and suppose the value per unit is in the ratio
of 1 : 3 : 2. For any piece of (measurable) land S in this kingdom, the
relative value with respect to that of the kingdom is then given by the
formula
P (S ∩ A) + 3P (S ∩ B) + 2P (S ∩ C)
2
the previous exercise. Show that V is a
V (S) =
where
P
is as in
probability
measure.
Q are two probability measures dened on the same
class of events F of Ω, then aP + bQ is also a probability measure on
F for any two nonnegative numbers a and b satisfying a + b = 1.
5. Show that if
6. If
P
P
and
is a probability measure, show that the function
(1)
P/2
satises
but not (2) in the denition of a probability
2
measure. The function P satises (1) and (2) but not necessarily (3);
conditions
and
(3)
give a counterexample to (3).
1.2 Independence
Consider the following experiment.
A person is selected at random and is
height is recorded. After this the last digit of the licence number of the next
car to pass is noted. If
A
the event that the digit is
is the event that the height is over
≥ 7,
then, intuitively,
A
and
B
1m70,
and
B
is
are independent.
The knowledge about the occurrence or nonoccurrence of one of the events
should not inuence the odds about the other.
7
all
those in which A occurs
should be the same if we consider
:
repetitions of the experiment or
B
only
In other words, we expect that the relative frequency of occurrence of
NB
NAB
=
.
N
NA
This implies that
NAB
=
N
NA N B
N
N
=
NA NB
N N
Thus, if we interpret probabilities as relative frequencies, we should have
P (A ∩ B) = P (A)P (B).
This reasoning motivates the following denition.
Denition 2. Two events A and B are independent if P (A∩B) = P (A)P (B).
This can be extended to an arbitrary (possibly innite) collection of
events.
Denition 3.
Let
Ai , i ∈ I ,
where
I
is an arbitrary index set, possibly
innite, be an arbitrary collection of events on a given probability space
(Ω, F, P ). The Ai are said to
i1 , . . . , ik ∈ I we have
be independent if for each nite set of distinct
indices
P (Ai1 ∩ Ai2 ∩ · · · ∩ Aik ) = P (Ai1 )P (Ai2 ) · · · P (Aik )
We list below some properties of independent events (the proofs are left
in exercise).
•
•
If in a collection of independent events Ai , i ∈ I , we replace some Ai
c
by their complements Ai , the new collection is also independent. For
c
example, if A and B are independent, then A and B are independent,
c
c
as well as A and B .
Any subcollection of independent events also forms, of course, a family
of independent events.
However, the condition
P (A1 ∩ · · · ∩ An ) =
P (A1 ) · · · P (An ) does not imply the analogous condition for any smaller
family of events! For example, it is possible to have P (A ∩ B ∩ C) =
P (A)P (B)P (C), but P (A ∩ B) ̸= P (A)P (B), P (A ∩ C) ̸= P (A)P (C),
P (B ∩ C) ̸= P (B)P (C).
8
P (A ∩ B) = P (A)P (B),
P (A ∩ C) = P (A)P (C), P (B ∩ C) = P (B)P (C), but P (A ∩ B ∩ C) ̸=
P (A)P (B)P (C). Thus A and B are independent, as are A and C , and
also B and C , but A, B , and C are not independent.
Conversely, it is possible to have, for example,
Exercises
1. What can you say about the event
the events
A
and
B
A
if it is independent of itself ? If
are disjoint and independent, what can you say of
them?
1.3 Conditional probability
If the events
A
rence of
A
and
B
are not independent, the knowledge about the occur-
will change the odds about
B.
How to measure this exactly? In
other words, we want to quantify the relative frequency of the occurrence of
B
in the trials on which
A
occurs. We look only at the trials on which
occurs and count those trials on which
NAB /NA
B
A
occurs also. This relative frequency
may be represented as
NAB
NAB /N
=
NA
NA /N
This discussion suggests the following denition.
Denition 4.
Let
P (A) > 0.
The
dened as
conditional probability of B given A
P (B | A) =
Example.
and
P (A ∩ B)
.
P (A)
Throw two unbiased dice independently. Let
B = {faces
P (B | A) =
are equal}. Then
A = {sum
If
A
and
of the faces
P (A ∩ B)
P ((4, 4))
1/36
1
=
=
=
P (A)
P ((4, 4), (3, 5), (5, 3), (2, 6), (6, 2))
5/36
5
Note some consequences of the above denition.
•
B
are independent, then
P (B | A) =
P (A ∩ B)
P (A)P (B)
=
= P (B),
P (A)
P (A)
which is in accordance with the intuition.
9
is
= 8}
•
We have
P (A ∩ B) = P (A)P (B | A),
and we can extend this formula
to more than two events:
P (A ∩ B ∩ C) = P (A ∩ B)P (C | A ∩ B) = P (A)P (B | A)P (C | A ∩ B)
Similarly,
P (A ∩ B ∩ C ∩ D) = P (A)P (B | A)P (C | A ∩ B)P (D | A ∩ B ∩ C),
and so on.
Example.
Three cards are drawn without replacement from an ordinary
deck. Find the probability of not obtaining a heart.
Let
Ai = {card i
is not a heart}. Then we are looking for
P (A1 ∩ A2 ∩ A3 ) = P (A1 )P (A2 | A1 )P (A3 | A1 ∩ A2 ) =
39 38 37
.
52 51 50
We now formulate the most useful results on conditional probabilities.
1.3.1 Theorem of total probability
Theorem 5. Let B1 , B2 , . . . be a nite or
countable family of mutually exclusive and exhaustive events (i.e., the B are disjoint and their union is Ω).
If A is any event, then
i
P (A) =
∑
P (Bi )P (A | Bi )
i
(the sum is taken over those i for which P (B ) > 0).
i
Example.
Consider the following experiment. We have a biased coin with
probability of heads equal to
3
white and
2
1/3.
We also have two urns: the rst contains
black balls; the second contains
We toss the coin.
1
white and
3
black balls.
If the result is heads, we draw a ball from the rst
urn, if the result is tails, we draw a ball from the second urn. What is the
probability that a white ball is drawn?
Let
{the
A = {a
white ball is drawn},
coin falls tails}. Then
B1 = {the
coin falls heads},
B2 =
( )( ) ( )( )
1
3
2
1
11
P (A) = P (B1 )P (A | B1 ) + P (B2 )P (A | B2 ) =
+
=
3
5
3
4
30
10
1.3.2 Bayes' theorem
Notice that under the above assumptions we have
P (Bk | A) =
P (Bk )P (A | Bk )
P (A ∩ Bk )
=∑
P (A)
i P (Bi )P (A | Bi )
Bayes's theorem
a posteriori probability
This formula is referred to as
called an
. The quantity
.
P (Bk | A)
is
The reason for this terminology may be
seen in the example below.
Example.
Consider the previous experiment with one biased coin and two
a posteriori
urns. Suppose that we did not observe the whole experiment but only the
nal result: a black ball is drawn. We would like to estimate
C = {a
P (B1 | C):
probability that the coin falled heads. Let
use the Bayes theorem to compute
the
black ball is drawn}. We
(1) (2)
P (B1 )P (C | B1 )
4
P (B1 | C) =
= ( 1 ) ( 2 )3 (5 2 ) ( 3 ) =
P (B1 )P (C | B1 ) + P (B2 )P (C | B2 )
19
+ 3 4
3
5
2 Random variables
2.1 Denition of a random variable
Intuitively, a random variable is a quantity that is measured in connection
Ω is a sample space, and the outcome of the
ω , a measuring process is carried out to obtain a number R(ω).
with a random experiment. If
a random variable is a real-valued function on a sample space
experiment is
Thus
. Let us
give some examples:
•
10 times, and let R be the number of heads. For ω =
HHT HT T HHT H , R(ω) = 6. Another random variable, R1 , is the
Throw a coin
number of times a head is followed immediately by a tail.
outcome
•
ω
above,
Throw two dice.
pairs of integers
For the
R1 (ω) = 3.
We may take the sample space to be the set of all
(x, y), x, y = 1, 2, . . . , 6 (36
Let
R1 =
the result of the rst toss. Then
Let
R2 =
the sum of the two faces. Then
points in all).
R1 (x, y) = x.
R2 (x, y) = x + y .
R3 = 1 if at least one face is an even number; R3 = 0
Then R3 (6, 5) = 1, R3 (3, 6) = 1, R3 (1, 3) = 0, and so on.
Let
11
otherwise.
R,
If we are interested in a random variable
the probability of events involving
R lies in a set
interval
B ⊆ R.
R.
we generally want to know
In general these events are of the form
For instance, R is less than
5
or R lies in the
[a, b).
{ω : a ≤ R(ω) < b} will be often abbreviated to
{a ≤ R < b}. We note its probability P (a ≤ R < b).
Example. A biased coin is tossed independently n times, with probability
p of coming up heads on a given toss. Let R be the number of heads. Then,
for integers k ≤ l ,
Notation.
The event
P (k ≤ R ≤ l) =
l
∑
Cni pi (1 − p)n−i .
i=k
The formal denition of a random variable is the following.
Denition 6.
A
Remember that Borel sets in
(Ω, F, P ) is a real
Borel subset B of the
on the probability space
R dened on Ω, such
{ω : R(ω) ∈ B} belongs to F .
valued function
reals,
random variable
R
that for every
are the sets obtained from the intervals
by applying the operations of union and intersection. The last condition in
the denition means that the assertions of the form R belongs to a Borel
set
B
are events on our probability space.
2.2 Classication of random variables
The way in which the probabilities
particular nature of
R.
P (R ∈ B)
are calculated depends on the
In this section we examine some standard classes of
random variables.
2.2.1 Discrete random variables
Denition 7. A random variable R is said to be discrete if the set of possible
values of
is nite or countably innite.
{x1 , x2 , . . . } be the set of possible values of R.
pR dened by
probability function
Let
by its
R
pR (xi ) = P (R = xi ),
12
Then
i = 1, 2, . . .
R is characterized
We say that
R
has masses of probability at the points
function denes the probabilities of all events involving
∑
P (R ∈ B) =
P (R = xi ) =
xi ∈B
Another way of characterizing
∑
is by means of the
dened by
FR (x) = P (R ≤ x),
Example.
Let
R
The probability
pR (xi )
xi ∈B
R
xi .
R:
distribution function
x ∈ R.
be the number of heads in two independent tosses of a
coin, with the probability of heads being
0.6
on a given toss.
Take
Ω =
{HH, HT, T H, T T } with respective probabilities 0.36, 0.24, 0.24, 0.16.
Then R can take three values: 0, 1, or 2. Its probability function is given
by
pR (0) = 0.16,
pR (1) = 0.48,
The distribution function of
R
pR (2) = 0.36.
is given by
0,
0.16,
FR (x) =
0.64,
1,
stant
x < 0,
0 ≤ x < 1,
1 ≤ x < 2,
x ≥ 2.
The distribution function of a discrete random variable is
, with jumps at the points where
R
piecewise con-
has masses of probability. The size
of the jump at a point is equal to the corresponding probability mass:
FR (x) − FR (x−) = pR (x).
pR , we can construct FR , and, conpR . Knowledge of either function is
probability of all events involving R.
Thus, in the discrete case, if we know
versely, given
FR ,
we can construct
sucient to determine the
2.2.2 Absolutely continuous random variables
Denition 8. The random variable R is said to be absolutely continuous if
fR dened on R
∫ x
FR (x) ≡ P (R ≤ x) =
fR (y)dy
there is a nonnegative function
fR
is called the
density function
−∞
of
R.
13
such that
for all real
x.
the distribution function of an absolutely continuous random variable is continuous
F
From the denition, it follows that
. Note that if
tiable at
x,
then its derivative is given by
FR′ (x)
d
=
dx
∫
R is dieren-
fR :
x
−∞
fR (y)dy = fR (x).
It can be proved that the probabilities of all events involving
R
may be
computed in the following way:
∫
P (R ∈ B) =
fR (y)dy.
B
In particular, we can see that, in contrast with discrete random variables,
the probability for
R
x
to fall exactly at a given point
∫
is zero:
x
P (R = x) =
fR (y)dy = 0.
x
We give below some important examples of absolutely continuous random
variables.
Uniform random variable.
[a, b]
A uniform random variable on an interval
is dened by the density function
{
fR (x) =
R
1
,
b−a
x ∈ [a, b],
0,
otherwise.
represents a number chosen at random between
that is, the probability that
R
a
and
b
in a uniform way:
will fall into an interval of length
c
depends
c and not on the position of this interval within [a, b]. Indeed,
Ic ⊆ [a, b] be any interval of length c. Then,
∫
∫
∫
1
1
c
P (R ∈ Ic ) =
fR (y)dy =
dy =
dy =
.
b − a Ic
b−a
Ic
Ic b − a
only on
R is
0,
The distribution function of
FR (x) =
given by
x−a
,
b−a
1,
14
x < a,
a ≤ x < b,
x ≥ b.
let
Notation. The uniform distribution on
[a, b] is noted U ([a, b]), and we write
R ∼ U ([a, b]).
Normal random variable.
µ∈R
and
R has a normal distribution with parameters
σ > 0, R ∼ N (µ, σ), if its density function is given by
fR (x) = √
1
2πσ 2
e−
(x−µ)2
2σ 2
The distribution function of a normal random variable
∫
FR (x) =
x
√
−∞
1
2πσ 2
e−
(y−µ)2
2σ 2
dy
cannot be expressed in terms of elementary functions but its properties are
standard normal
well known and its values are listed in tables or may be easily obtained on
distribution
a computer.
.
If
µ = 0
and
σ = 1,
we say that
has a
There is a standard notation for the distribution function in
∫
this case:
x
Φ(x) =
−∞
Exponential random variable.
parameter
R
λ > 0, R ∼ E(λ),
has an exponential distribution with
if its density function is given by
{
fR (x) =
The distribution function of
R
y2
1
√ e− 2 dy
2π
R
λe−λx ,
0,
x ≥ 0,
x < 0.
is given by
{
FR (x) =
1 − e−λx ,
0,
x ≥ 0,
x < 0.
2.2.3 Mixed random variables
A random variable need not be discrete or absolutely continuous.
There
are also mixed distributions that have masses of probability at some points
and are continuous elsewhere. In terms of the distribution function,
piecewise continuous. Typically,
FR
FR
is
is also piecewise dierentiable, so that
15
we can identify the intervals where
R
R
has a density
fR
and the points where
has masses of probability. For example,
0,
FR (x) =
FR
x+30
,
200
1,
x < 0,
0 ≤ x < 120,
x ≥ 120.
x=0
is continuous everywhere except two points:
means that
R
and
x = 120.
This
has masses of probability at these points:
P (R = 0) = FR (0) − FR (0−) = 0.15 − 0 = 0.15,
P (R = 120) = FR (120) − FR (120−) = 1 − 0.75 = 0.25.
(−∞, 0), (0, 120), and (120, ∞), R has a density
{
1/200,
x ∈ (0, 120),
′
gR (x) = FR (x) =
0,
x < 0 or x > 120.
On the intervals
The probability that
R ∈ B,
for a Borel set
B,
function
is then computed in a
mixed way, combining the formulae for discrete and continuous random
∫
variables:
P (R ∈ B) =
gR (y)dy +
B
where
xi
are the points where
R
∑
P (R = xi ),
xi ∈B
has masses of probability. For instance, in
the example above,
∫
∫
∞
P (R > 100) =
120
gR (y)dy + P (R = 120) =
100
100
1
dy + 0.25 = 0.35.
200
2.3 Properties of distribution functions
In this section, we list some general properties of the distribution function of
an arbitrary random variable.
Let F be the distribution function of an arbitrary random variable R. Then
1. F (x) is nondecreasing; that is, a < b implies F (a) ≤ F (b).
2. lim F (x) = 1
Theorem 9.
x→∞
16
3. lim F (x) = 0
4. F is continuous from the right; that is, lim F (x) = F (x ).
5. lim F (x) = P (R < x)
6. P (R = x ) = F (x ) − F (x −). Thus F is continuous at x if and only
if P (R = x ) = 0.
The random variable R is said to be
if its distribution function F (x) is a continuous function for all x. In any
reasonable case a continuous random variable will have a densitythat
is, it will be absolutely continuousbut it is possible to establish the
existence of random variables that are continuous but not absolutely
continuous.
7. Let F be a function from reals to reals, satisfying properties 1,2,3, and
4 above. Then F is the distribution function of some random variable.
x→−∞
x→x0 +
0
x→x0 −
0
0
0
0
0
continuous
Remark.
R
Note that property 2 impies that
∫
∞
f (x)dx = 1.
−∞
It can be shown that any nonnegative function
(the integral on
R
is equal to
1)
f
satisfying this condition
is the density function of some random
variable.
2.4 Joint density functions
We are going to investigate situations in which we deal simultaneously with
several random variables dened on the same sample space.
For example,
suppose that a person is selected at random, and his age and weight recorded.
Ω = {(x, y) | x, y ∈ R}. Let R1 be the age of the person selected
and R2 the weight; that is, R1 (x, y) = x, R2 (x, y) = y . We wish to assign
probabilities to events that involve R1 and R2 simultaneously. For example,
the person is between 22 and 23 years, and 60 and 80 kilograms:
We may take
{22 ≤ R1 ≤ 23, 60 ≤ R2 ≤ 80}
17
Denition 10. The joint distribution function of two arbitrary random variables
R1
and
R2
on the same probability space is dened by
F12 (x, y) = P (R1 ≤ x, R2 ≤ y)
The pair
function
f12
absolutely continuous
(R1 , R2 ) is said to be
f12 dened on R2 such that
∫ x ∫ y
F12 (x, y) =
f12 (u, v)dudv
is called the
density
−∞
of
−∞
(R1 , R2 ),
or the
if there is a nonnegative
x, y
for all real
joint density
of
R1
and
R2 .
As for a single random variable, we have the following properties of an
absolutely continuous pair
•
For any Borel set
(R1 , R2 ).
B ⊆ R2 ,
∫∫
P ((R1 , R2 ) ∈ B) =
f12 (x, y)dxdy
B
•
The density function has total mass
∫
∞
−∞
∫
1:
∞
−∞
joint distribution function
f12 (x, y)dxdy = 1.
In a similar way, we can dene a
random vector (R , R , . . . , R )
1
2
n
with
F12...n (x1 , x2 , . . . , xn ) = P (R1 ≤ x1 , R2 ≤ x2 , . . . , Rn ≤ xn )
Example.
Let the joint density of
{
f12 (x, y) =
(This is the
ability that
1,
0,
uniform density
if
(R1 , R2 )
be
0≤x≤1
and
0≤y≤1
elsewhere
on the unit square.) Let us calculate the prob-
1/2 ≤ R1 + R2 ≤ 3/2:
∫ ∫
P (1/2 ≤ R1 + R2 ≤ 3/2) =
( )
1
3
1 dxdy = 1 − 2
=
8
4
1/2≤x+y≤3/2
(making a gure can help to compute such integrals).
18
2.5 Relationship between joint and individual distributions
In this section, we investigate the relationship between joint and individual
distributions of random variables dened on the same probability space.
Question 1.
If
(R1 , R2 )
is absolutely continuous, are
R1 and R2 absolutely
R1 and R2 be found
continuous, and, if so, how can the individual densities of
in terms of the joint density?
marginal densities
The answer to this question is positive, and the individual densities (also
called
) are given by
∫
f1 (x) =
∫
∞
−∞
f12 (x, y)dy,
f2 (y) =
∞
−∞
f12 (x, y)dx
Indeed, we have
F1 (x) = P (R1 ≤ x) = P (R1 ≤ x, R2 ∈ (−∞, ∞))
)
∫ x ∫ ∞
∫ x (∫ ∞
∫
=
f12 (u, v)dudv =
f12 (u, v)dv du =
−∞
so
R1
−∞
−∞
−∞
x
−∞
is absolutely continuous with density
f1 .
We deal with
R2
f1 (u)du,
similarly.
In exactly the same way we may establish similar formulae in higher
dimensions; for example,
∫
f12 (x, y) =
∫
f2 (y) =
∞
−∞
∫
∞
f123 (x, y, z)dz
−∞
∞
−∞
f123 (x, y, z)dxdz
and so on.
Question 2.
Given
R1 , R2
(individually) absolutely continuous, is
(R1 , R2 )
absolutely continuous, and, if so, can the joint density be derived from the
individual densities?
not
The answer to this question is negative; that is, if
absolutely continuous then
(R1 , R2 )
is
19
R1
and
R2
are each
necessarily absolutely continuous.
not
(R1 , R2 ) is absolutely continuous, f1 (x) and f2 (x)
f12 (x, y). The examples below illustrate these statements.
Furthermore, even if
determine
Example.
do
R1 be absolutely continuous with density f . Take R2 ≡ R1 ;
that is, R2 (ω) = R1 (ω), ω ∈ Ω. Then R2 is absolutely continuous but
(R1 , R2 ) is not. Indeed, if L denotes the line x = y , we have P ((R1 , R2 ) ∈
L) = 1. On the other hand, if (R1 , R2 ) has a density g(x, y), we should have
∫∫
P ((R1 , R2 ) ∈ L) =
g(x, y)dxdy = 0
Let
L
since
L
has area
0.
This contradiction proves that
(R1 , R2 )
cannot have a
density.
Example.
f12 (x, y) =
It is easy to check that the following two densities
1
4 (1 + xy),
0,
−1 ≤ x ≤ 1
−1 ≤ y ≤ 1 ,
1
4,
g12 (x, y) =
elsewhere
have the same marginal densities:
{
f1 (x) =
1
,
2
−1 ≤ x ≤ 1
0,
elsewhere
{
,
f2 (y) =
−1 ≤ x ≤ 1
−1 ≤ y ≤ 1
0,
elsewhere
1
,
2
−1 ≤ y ≤ 1
0,
elsewhere
do
independent
Thus the individual densities are not sucient to determine the joint density.
However, there is one situation where the individual densities
mine the joint density: when the random variables are
deter-
.
2.6 Independence of random variables
We have considered the notion of independence of events, and this can be
used to dene independence of random variables. Namely, we say that random variables are independent if events involving these random variables are
independent. The formal denition is the following.
Denition 11. Let R1 , . . . , Rn be random variables on (Ω, F, P ).
are said to be
independent
if for all Borel subsets
B1 , . . . , Bn
of
R1 , . . . , R n
R we have
P (R1 ∈ B1 , . . . , Rn ∈ Bn ) = P (R1 ∈ B1 ) · · · P (Rn ∈ Bn ).
20
Note that the last equality in itself does not imply that events
B1 }, . . . , {Rn ∈ Bn }
are independent.
equality holds for all Borel subsets
of
Bi .
However, since we require that this
B1 , . . . , Bn ,
it holds for any subfamily
Indeed, it is sucient to replace the other ones by
example, in the case
{R1 ∈
(−∞, ∞).
For
n = 3,
P (R1 ∈ B1 , R2 ∈ B2 ) = P (R1 ∈ B1 , R2 ∈ B2 , R3 ∈ (−∞, ∞))
= P (R1 ∈ B1 )P (R2 ∈ B2 )P (R3 ∈ (−∞, ∞)) = P (R1 ∈ B1 )P (R2 ∈ B2 ),
R1 , . . . , Rn are indeR1 , . . . , Rk , for k < n.
If (Ri , i ∈ I) is an arbitrary family of random variables on the space
(Ω, F, P ), the Ri are said to be independent if for each nite set of distinct
indices i1 , . . . , ik ∈ I , Ri1 , . . . , Rik are independent.
and so on. By the same reasoning, we deduce that if
pendent, so are
Theorem 12. Let R1 , R2 , . . . , Rn be independent random variables on a given
probability space. If each R is absolutely continuous with density f , then
is absolutely continuous; also, for all x , x , . . . , x ,
i
i
(R1 , R2 , . . . , Rn )
1
2
n
f12...n (x1 , x2 , . . . , xn ) = f1 (x1 )f2 (x2 ) · · · fn (xn ).
Thus in this sense the joint density is the product of the individual densities.
2.7 Problems
1. An absolutely continuous random variable
f (x) = (1/2)e−|x| .
(a) Sketch the distribution function of
R
has a density function
R.
(b) Find the probability of each of the following events.
i.
ii.
iii.
iv.
v.
vi.
{|R| ≤ 2}
{|R| ≤ 2 or R ≥ 0}
{|R| ≤ 2 and R ≥ −1}
{|R| + |R − 3| ≤ 3}
{R3 − R2 − R − 2 ≤ 0}
{esin πR ≥ 1}
21
vii.
{R
is irrational}
= {ω : R(ω) is
an irrational number}
2. Consider a sequence of ve Bernoulli trials. Let
R
be the number of
times that a head is followed immediately by a tail.
For example, if
ω = HHT HT then R(ω) = 2, since a head is followed directly by a
tail at trials 2 and 3, and also at trials 4 and 5. Find the probability
function of R.
3 Expectation
The physical meaning of the
expectation
of a random variable is the average
value of this variable in a very large number of independent repetitions of the
random experiment. Before we make this denition mathematically precise,
let us consider the following example.
Suppose that we observe the length of a telephone call made from a
specic phone booth at a given time of the day (say, the rst call after 12
o'clock). Suppose that the cost
R2
of a call depends on its length
R1
in the
following way:
If
If
If
0 ≤ R1 ≤ 3
3 < R1 ≤ 6
6 < R1 ≤ 9
(minutes)
R2 = 10
R2 = 20
R2 = 30
(cents)
(Assume for simplicity that the telephone is automatically disconnected after
9
minutes.)
Suppose that we repeat the experiment independently
N
N
times, where
is very large, and record the cost of each call. If we take the arithmetic
average of the costs (the total cost divided by
N)
we expect physically that
it will converge to a number that we should interpret as the long-run average
cost of a call.
P (R2 = 10) = 0.6, P (R2 = 20) = 0.25, and P (R2 = 30) =
N calls, then, roughly, {R2 = 10} will occur 0.6N times;
the total cost of the calls of this type is 10(0.6N ) = 6N . The calls with
{R2 = 20} will occur approximately 0.25N times, giving rise to a total cost
of 20(0.25N ) = 5N . Finally, {R2 = 30} will occur approximately 0.15N
times, producing a total cost of 30(0.15N ) = 4.5N . The total cost of all calls
is then equal to 6N + 5N + 4.5N = 15.5N . We deduce that the average cost
of a call is 15.5 cents.
Suppose that
0.15.
If we observe
22
Observe how we have computed the average:
10(0.6N ) + 20(0.25N ) + 30(0.15N )
= 10(0.6) + 20(0.25) + 30(0.15)
N
∑
=
yP (R2 = y).
Thus we are taking a
weighted average
y
of the possible values of
R2 ,
where
the weights are the corresponding probabilities. This suggests the following
denition.
3.1 Expectation of discrete random variables
Denition 13. Let R be a simple random variable, that is, a discrete random
nite
expectation
expected value average value mean value mean R
variable with a
called the
number of possible values. Dene the
,
,
E[R] =
∑
, or
) of
(also
as
xP (R = x)
x
Example.
Let
R
have a binomial distribution with parameters
n
and
Then
E[R] =
n
∑
kCnk pk q n−k
k=0
n
∑
= np
k=1
=
n
∑
k=1
n!
pk q n−k
(k − 1)!(n − k)!
∑ (n − 1)!
(n − 1)!
pk−1 q n−k = np
pl q n−1−l
(k − 1)!(n − k)!
l!(n
−
1
−
l)!
l=0
n−1
= np
n−1
∑
l
Cn−1
pl q n−1−l = np(p + q)n−1 = np
l=0
Example.
Let
R
be a simple random variable with probability function
x
pR (x)
−3
−2
0
10
15
0.1
0.35
0.2
0.05
0.3
Then its expectation is given by
E[R] = (−3) × 0.1 + (−2) × 0.35 + 0 × 0.2 + 10 × 0.05 + 15 × 0.3 = 4
23
p.
If
R
is discrete with innitely (countably) many possible values, the ex-
pectation is globally dened in the same way but there is a little complication:
an innite sum is not always convergent. This leads to the following con+
−
struction. Let R = max(R, 0) and R = max(−R, 0) be the positive and
+
−
negative parts of R. We have R = R − R . For example, for the simple
+
−
random variable R above, the probability functions of R and R are given
by
x
p+ (x)
x
p− (x)
We dene
E[R+ ] =
∑
0
10
15
0.1+0.35+0.2=0.65
0.05
0.3
3
2
0
0.1
0.35
0.2+0.05+0.3=0.55
E[R− ] =
xP (R+ = x),
x
Since
R+
and
R−
∑
xP (R− = x)
x
take on only nonnegative values, these sums are always
well dened (they may be nite or equal to
+∞).
Now we dene
E[R] = E[R+ ] − E[R− ]
if this is not of the form
R+ and R− are
to +∞ or −∞.
+∞ − ∞. That is, the
+∞ simultaneously.
not equal to
Let
R
exists if
It may be nite, or equal
The possible cases are summarized in the following table.
E[R− ] = b
E[R− ] = +∞
Example.
expectation of
R
E[R+ ] = a
E[R] = b − a
E[R] = −∞
E[R+ ] = +∞
E[R] = +∞
E[R] does not exist
have a Poisson distribution with parameter
λ > 0.
probability function is given by
pR (n) = e−λ
λn
,
n!
Let us calculate the expectation of
E[R] =
∞
∑
n=0
ne
−λ λ
n
n!
= λe
−λ
n = 0, 1, 2, . . .
R:
∞
∞
∑
∑
λn−1
λk
−λ
= λe
= λe−λ eλ = λ
(n
−
1)!
k!
n=1
k=0
24
Its
3.2 Expectation of absolutely continuous random variables
If
R
is absolutely continuous, the denition of the expectation is similar but
the sum is replaced by an integral, and
Denition 14.
density
fR (x).
Let
R
P (R = x)
by the density
fR (x).
be an absolutely continuous random variable with
The expectation of
E[R] =
R is
∫ ∞
−∞
dened as
xfR (x)dx
∫∞
+
if this integral is well dened; that is, E[R ] =
xfR (x)dx and
0
∫0
(−x)fR (x)dx are not equal to +∞ simultaneously.
−∞
fR (x)
E[R− ] =
P (R = x). For an absolutely
continuous random variable, the probability P (R = x) is zero for all x but
the density is a non-zero function. The density fR (x) is
a probability: in
particular, it need not be ≤ 1. The quantity which represents a probability
in this expression is fR (x)dx: informally speaking, this is the probability that
R belongs to the innitesimal interval (x, x + dx).
Note.
Don't confuse, however,
and
not
Example.
Let
R be a uniform random variable on [a, b] with density function
{ 1
,
x ∈ [a, b],
b−a
fR (x) =
0,
otherwise.
Then
∫
E[R] =
b
x
a
Example.
If
∫
R
is an exponential random variable with parameter
∞
E[R] =
xλe
0
1 x2 b
a+b
1
1 b 2 − a2
dx =
=
=
b−a
b−a 2 a b−a 2
2
−λx
∫
dx = −
∞
−λx ′
) dx = −xe
x(e
0
∞ ∫
+
−λx 0
0
=−
Example.
Let
∞
λ
then
e−λx dx
e−λx ∞ 1
=
λ 0
λ
R be a standard normal random variable, R ∼ N (0, 1).
∫ ∞
1
2
E[R] =
x √ e−x /2 dx = 0
2π
−∞
25
Then
since the integrand is an odd function of
∫
x.
If
R ∼ N (µ, σ 2 )
then
∫ ∞
(x−µ)2
y2
1
1
−
2
2σ
e
e− 2σ2 dy
E[R] =
x√
dx =
(y + µ) √
2πσ
2πσ
−∞
−∞
∫ ∞
∫ ∞
2
y
y2
1
1
√
=
y√
e− 2σ2 dy + µ
e− 2σ2 dy = 0 + µ · 1 = µ
2πσ
2πσ
−∞
−∞
∞
On the last line, the rst integral is equal to zero because the integrand is
odd, and the second integral is equal to 1 because this is the total mass of
2
the density function of a N (0, σ ) random variable.
Thus the meaning of the parameter
µ
of a normal random variable is its
expectation.
Remark.
It is possible for the expectation to be innite, or not exist at all.
{
For example, let
fR (x) =
∫
Then
E[R] =
x≥1
x<1
1
,
x2
0,
∫
∞
−∞
xfR (x)dx =
∞
x
1
1
dx = ∞
x2
fR (x) = 1/(2x ), |x| ≥ 1; fR (x) = 0, |x| < 1.
∫
∫ ∞
1 ∞ 1
+
E[R ] =
x 2 dx = ∞
xfR (x)dx =
2 1
x
0
∫ −1
∫ 0
1
1
E[R− ] =
(−x) 2 dx = ∞
(−x)fR (x)dx =
2 −∞
x
−∞
2
As another example, let
Thus
E[R]
Then
does not exist.
3.3 Expectation of mixed random variables
R is a mixed random variable such that its distribution function is piecewise
dierentiable with derivative g(x) and has jumps at the points x1 , x2 , . . . ,
then the expectation of R is dened as follows:
∫ ∞
∑
E[R] =
xg(x)dx +
xi P (R = xi )
If
−∞
xi
if both terms are well dened.
26
Example.
R
If
is the mixed random variable from the example in Sec-
tion 2.2.3 then
∫
120
E[R] =
x
0
1
dx + 0 · P (R = 0) + 120 · P (R = 120)
200
1 1202
=
+ 120 · 0.25 = 66
200 2
3.4 General moments of random variables
If R1 : Ω → R is a random variable, and g : R → R a real-valued function on
R, then R2 = g(R1 ) : Ω → R is also a random variable (under some conditions
on g ; for instance, it is true if g is continuous or piecewise continuous).
Let us consider the example of phone calls given at the beginning of
Section 3. The cost
R2 = g(R1 )
with
R2
of a call depends on the length
10,
20,
g(x) =
30,
0,
if
if
if
R1
of the call, so that
0≤x≤3
3<x≤6
6<x≤9
otherwise
This example shows that we may be interested in the expectations of the
form
E[g(R)].
R1 is discrete with possible values x1 , x2 , . . . . Then with
probability P (R1 = xi ) we have R1 = xi , hence R2 = g(xi ). So, the expectation of g(R1 ) should be given by
∑
E[g(R1 )] = E[R2 ] =
g(xi )P (R1 = xi )
Suppose that
xi
Similarly, if
R
is absolutely continuous with density
∫
E[g(R)] =
If we have an
n-dimensional
then
∞
−∞
g(x)fR (x)dx
R0 = g(R1 , . . . , Rn ), the
R1 , . . . , Rn are discrete,
situation, for instance
preceding formulae generalize in a natural way. If
E[g(R1 , . . . , Rn )] =
fR ,
∑
g(x1 , . . . , xn )P (R1 = x1 , . . . , Rn = xn )
x1 ...xn
27
If
(R1 , . . . , Rn )
is absolutely continuous with density
∫
E[g(R1 , . . . , Rn )] =
∞
−∞
∫
···
∞
−∞
f12...n ,
then
g(x1 , . . . , xn )f12...n (x1 , . . . , xn )dx1 · · · dxn
Of course, in all denitions above, we require that the corresponding series
or integrals are well dened.
3.4.1 Terminology
If
R
k
is a random variable, the
integer) is dened by
th moment
of
R (k > 0,
not necessarily an
mk = E[Rk ]
if the expectation exists. Thus
{ ∑ k
x p (x)
mk = ∫ ∞x k R
x fR (x)
−∞
The
k
m1
th central moment
The rst moment
if
if
R
R
E[R]
is absolutely continuous
is simply the expectation
of
R (k > 0)
ck = E[(R − E[R])k ]
{ ∑
k
∫ ∞x (x − m1 ) kpR (x)
=
(x − m1 ) fR (x)
−∞
if
is discrete
E[R].
is dened by
if
if
R
R
is discrete
is absolutely continuous
is nite and the expectation in question exists. Note that the rst
c1 = E[R − m1 ] = m1 − m1 = 0.
2
The second central moment E[(R − E[R]) ] is called the
2
written σ or Var(R). The positive square root of the variance
central moment is zero:
is called the
standard deviation
of
variance
√
σ=
R.
The variance may be interpreted as a measure of dispersion.
variance corresponds to a high probability that
while a small variance indicates that
The quantities
E[g(R)], with g
R.
R,
Var(R)
of
R
A large
will fall far from its mean,
R
is likely to be close to its mean.
k
k
other than x or (x − m1 ) , are sometimes
called general moments of
3.4.2 Moment generating function
k = 1, 2, . . .
moment generating function
The moments of order
using the so called
of a random variable
MR (t) = E[eRt ],
28
of
R
t∈R
R may be calculated
dened as
wherever this expectation exists. Provided
around
t = 0,
the
k th
MR (t)
moment is given by
(k)
E[Rk ] = MR (0) =
dk MR (t) dtk
t=0
′
Rt
′
For example, MR (t) = E[Re ], hence MR (0)
E[R2 eRt ] yields MR′′ (0) = E[R2 ], and so on.
Example.
= E[R].
R ∼ E(λ). Then
∫ ∞
∫
Rt
xt
−λx
MR (t) = E[e ] =
e λe dx = λ
Similarly,
MR′′ (t) =
Let
0
provided
exists in an open interval
t < λ.
MR′ (t) =
∞
e−(λ−t)x dx =
0
λ
λ−t
We have
λ
,
(λ − t)2
MR′′ (t) =
2λ
,
(λ − t)3
6λ
(λ − t)4
MR′′′ (t) =
Thus we obtain
E[R] = MR′ (0) = 1/λ
E[R2 ] = MR′′ (0) = 2/λ2
E[R3 ] = MR′′′ (0) = 6/λ3
and so on. In particular,
Example.
Let
Var(R) = E[R2 ] − (E[R])2 = 2/λ2 − 1/λ2 = 1/λ2 .
R ∼ P(λ).
MR (t) = E[e ] =
Rt
∞
∑
Then
n
nt −λ λ
e e
n=0
n!
=e
−λ
n=0
Let us compute the rst two moments of
MR′ (t) = λet eλ(e −1) ,
t
∞
∑
(λet )n
R.
n!
t
t
We have
MR′′ (t) = λet eλ(e −1) + λ2 e2t eλ(e −1)
t
Therefore,
E[R] = MR′ (0) = λ
E[R2 ] = MR′′ (0) = λ + λ2
Then
= e−λ eλe = eλ(e −1)
Var(R) = E[R2 ] − (E[R])2 = λ + λ2 − λ2 = λ.
29
t
3.5 Properties of expectation
In this section we list several basic properties of the expectation of a random
variable. (We will always assume that all expectations which appear in the
properties listed below exist.)
1. Let
R1 , . . . , R n
be random variables on a given probability space. Then
E[R1 + · · · + Rn ] = E[R1 ] + · · · + E[Rn ]
(if
2. If
+∞
E[R]
and
−∞
do not
exists and
a
both
appear in the sum
is any real number, then
E[R1 ] + · · · + E[Rn ]).
E[aR]
exists and
E[aR] = aE[R]
Basically, properties 1 and 2 say that the expectation is linear.
3. If
R1 ≤ R2 ,
R≥0
0) = 1.
4. If
5. If
then
and
Var(R) = 0,
E[R1 ] ≤ E[R2 ].
E[R] = 0,
then
R
then
R
is zero
almost surely
; that is,
is essentially constant; more precisely,
P (R =
R = E[R]
almost surely.
2
This is a corollary of the previous property. Indeed, from E[(R−m1 ) ] =
2
0 we conclude that (R − m1 ) = 0 almost surely, since (R − m1 )2 is
nonnegative. Thus
6. Let
R1 , . . . , Rn be
R = m1
almost surely.
independent
random variables. If one of the following
conditions is satised:
a) all
Ri
b) all
E[Ri ]
are nonnegative, or
are nite
then
E[R1 R2 · · · Rn ] = E[R1 ]E[R2 ] · · · E[Rn ]
7. Let
R
be a random variable with nite mean
sibly innite). If
a
and
b
m
are real numbers, then
Var(aR + b) = a2 σ 2
30
and variance
σ2
(pos-
8. Let
R1 , . . . , R n
be independent random variables, each with nite ex-
pectation. Then
Var(R1 + · · · + Rn ) = Var(R1 ) + · · · + Var(Rn )
R1 , . . . , Rn are independent,
a1 , . . . an , b are real numbers, then
Corollary. If
tion, and
each with nite expecta-
Var(a1 R1 + · · · + an Rn + b) = a21 Var(R1 ) + · · · + a2n Var(Rn )
3.6 Correlation
If
R1
their
R2
covariance
and
are random variables on a given probability space, we dene
as
Cov(R1 , R2 ) = E[(R1 − E[R1 ])(R2 − E[R2 ])] = E[R1 R2 ] − E[R1 ]E[R2 ]
Theorem 15.
conversely.
Proof. R
If
If R and R are independent, then Cov(R , R ) = 0, but not
1 and
1
R2
2
1
E[R1 R2 ] = E[R1 ]E[R2 ]
are independent, then
Cov(R1 , R2 ) = 0.
Let θ be uniformly
R2 = sin θ. We have
distributed between
∫
2
0
and
2π .
Dene
hence
R1 = cos θ,
2π
1
dx = 0
2π
0
∫ 2π
1
E[R2 ] = E[sin θ] =
sin(x) dx = 0
2π
0
∫ 2π
1
E[R1 R2 ] = E[cos θ sin θ] =
cos(x) sin(x) dx = 0
2π
0
E[R1 ] = E[cos θ] =
Thus
But
cos(x)
Cov(R1 , R2 ) = 0. However, R1 and R2 are not independent. Indeed,
√
2
π
7π
1
P (R1 >
) = P (0 < θ < ou
< θ < 2π) =
4
4
4
√2
2
π
3π
1
P (R2 >
) = P( < θ <
)=
2
4
4
4
√
√
√
√
2
2
2
2
P (R1 >
, R2 >
) = 0 ̸= P (R1 >
)P (R2 >
)
2
2
2
2
31
Denition 16. Let R1 and R2 be random variables dened on a given probability space. If
of
R1
and
R2
Var(R1 ) > 0, Var(R2 ) > 0 we dene the
as
correlation coecient
Cov(R1 , R2 )
√
ρ(R1 , R2 ) = √
Var(R1 ) Var(R2 )
R1
By Theorem 15, if
that is,
ρ(R1 , R2 ) = 0,
R2
and
are independent, they are uncorrelated;
but not conversely.
It can be shown that
−1 ≤ ρ(R1 , R2 ) ≤ 1
3.7 Indicator function
In this section, we introduce a useful notion of indicator functions.
Denition 17.
The
as follows:
indicator
of an event
{
IA (ω) =
1,
0,
if
if
A is a random variable IA
dened
ω ∈ A,
ω∈
/ A.
Note that this is a (simple) discrete random variable since it has only two
possible values. Its expectation is given by
E[IA ] = 1 · P (IA = 1) + 0 · P (IA = 0),
where
P (IA = 1) ≡ P ({ω ∈ Ω | IA (ω) = 1}) = P ({ω ∈ Ω | ω ∈ A}) ≡ P (A).
So, the expectation of
IA
is equal to the probability of
A:
E[IA ] = P (A).
Example.
A single unbiased die is tossed independently n times. Let R1
1′ s obtained, and R2 the number of 2′ s. Find E[R1 R2 ].
be the number of
Intuitively, random variables
1,
we cannot obtain
expectation
E[R1 R2 ]
2
R1
and
R2
are not independent (if we obtain
at the same time), so the direct evaluation of the
is not easy. The method of indicators allows to greatly
simplify this problem.
32
If
ith
Ai
is the event that
toss results in a
2,
ith
toss results in a
1,
and
Bi
the event that the
then
R1 = IA1 + · · · + IAn ,
R2 = IB1 + · · · + IBn .
Hence
E[R1 R2 ] =
n
∑
E[IAi IBj ].
i,j=1
Now if
i ̸= j , IAi
and
IBj
are independent (exercise); hence
E[IAi IBj ] = E[IAi ]E[IBj ] = P (Ai )P (Bj ) =
1
.
36
i = j , Ai and Bi are disjoint, since the ith toss cannot
result in a 1 and in a 2. Thus IAi IBi = IAi ∩Bi = 0. Thus
If
simultaneously
n(n − 1)
36
E[R1 R2 ] =
n(n−1) ordered pairs (i, j) of integers belonging to {1, 2, . . . , n}
i ̸= j .
since there are
such that
4 Conditional probability and expectation
4.1 Conditional expectation given A with P (A) > 0
If
R
is discrete,
E[R | A] =
∑
∑
xP (R = x | A) =
x
x
For any
R,
not necessarily discrete, dene
E[R | A] =
Example.
xP ({R = x} ∩ A)
E[RIA ]
=
P (A)
P (A)
Let
R
E[RIA ]
P (A)
be the result of the toss of a single die and
the result is even. Then
E[R | A] =
6
∑
nP (R = n | A)
n=1
33
A
be the event
We have
P (R = 1 | A) = P (R = 3 | A) = P (R = 5 | A) = 0
and
P (R = 2 and R is even)
P (R = 2)
1/6
1
=
=
=
P (R is even)
P (R is even)
1/2
3
P (R = 2 | A) =
and also
P (R = 4 | A) = P (R = 6 | A) = 1/3. Thus
1
1
1
E[R | A] = 2 + 4 + 6 = 4
3
3
3
We have the following consequence of the formula of total probability.
Theorem of total expectation.
Let
B1 , B2 , . . .
be a nite or countable
family of mutually exclusive and exhaustive events.
variable, then
E[R] =
∑
If
R
is any random
P (Bi )E[R | Bi ]
i
We say that
R
A;
independent from
sets
is independent from
that is
{R ∈ B}
A
if all the events involving
A
and
R
are
are independent for all Borel
B.
Property 18.
If
R
is independent from
A
then
E[R | A] = E[R]
Example.
0, 1, 2, . . .
Let
N
be a discrete random variable with possible values
(for instance,
ables, independent from
N ∼ P(λ))
N . Dene
S=
and
N
∑
X1 , X2 , . . .
n =
are i.i.d. random vari-
Xi
i=1
Let us compute the expectation of
[
E[S] = E
N
∑
]
Xi =
i=1
=
∞
∑
∞
∑
[
P (N = n)E
n
∑
[
N
∑
]
Xi | N = n
i=1
]
n=0
P (N = n)E
n=0
S.
Xi | N = n =
n=0
i=1
=
∞
∑
∞
∑
P (N = n)nE[X1 ] = E[X1 ]
n=0
∞
∑
n=0
34
[
P (N = n)E
n
∑
]
Xi
i=1
nP (N = n) = E[X1 ]E[N ]
For example, if
2 × 105 .
N ∼ P(200)
and
X1 ∼ E(1/1000)
then
E[S] = 1000 × 200 =
4.2 Conditional expectation with respect to a random
variable
If
R
has an absolutely continuous distribution, then
P (R = x) = 0
for all
x, so the conditional probability P (A | R = x) and conditional expectation
E[R1 | R = x] cannot be dened as before. However, intuitively, these
quantities make sense. For example, let R1 and R2 be independent random
variables with the same distribution U ([0, 1]). Then we would like to write,
for example,
P (R1 + R2 ≤ 1 | R1 = x) = P (x + R2 ≤ 1) = P (R2 ≤ 1 − x) = 1 − x
or
E[R1 R2 | R1 = x] = E[xR2 | R1 = x] = xE[R2 ] =
x
2
The rigorous denition of conditional probability and expectation with respect to the events of probability zero of the type
{R = x} is rather involved
and is out of the scope of these lectures.
Instead, we give the continuous equivalents of theorems of total probability and total expectation and dene the notion of conditional density
consistent with these theorems.
Let R be an absolutely continuous random variable with density function f (x). Let A be any event and R any random variable. Then
Theorem 19.
R
1
∫
∞
P (A) =
∫
E[R1 ] =
Denition 20.
Example.
−∞
∞
−∞
P (A | R = x)fR (x)dx
E[R1 | R = x]fR (x)dx
(conditional density)
R1 is chosen with the density f1 (x) =
xe , x ≥ 0; f1 (x) = 0, x < 0. If R1 = x, a number R2 is chosen with
uniform density between 0 and x. Find P (R1 + R2 ≤ 2).
−x
A nonnegative number
35
By theorem of total probability,
∫
P (R1 + R2 ≤ 2) =
∞
=
∫
1
=
−x
P (R1 + R2 ≤ 2 | R1 = x)f1 (x)dx
∫−∞∞
∫0
2
−x
∫
P (R2 ≤ 2 − x | R1 = x)xe dx +
(1)xe dx +
0
P (x + R2 ≤ 2 | R1 = x)xe−x dx
1
∞
(0)xe−x dx
2
We have
2−x
x
P (R2 ≤ 2 − x | R1 = x) =
Therefore
∫
P (R1 + R2 ≤ 2) =
1
−x
∫
0
Notation. If
2
xe dx +
1
E[R2 | R1 = x] = g(x),
2 − x −x
xe dx = 1 − 2e−1 + e−2
x
we write
E[R2 | R1 ] = g(R1 ).
So, the
conditional expectation with respect to a random variable is also a random
variable (and not a constant as a usual expectation).
Example.
x ≥ 0,
and
R1 have density fR (x) = xe−x ,
uniform on [0, x] given R = x. Then we have
∫ x
x
1
E[R2 | R1 = x] =
y dy =
x
2
0
∫ x
1
ex − 1
R2
E[e | R1 = x] =
ey dy =
x
x
0
Let, as in the previous example,
R2
be
Thus we write
R1
2
eR1 − 1
| R1 ] =
R1
E[R2 | R1 ] =
E[eR2
36
4.3 Conditional expectation with respect to a σ-eld
Denition 21. Let X be a random variable on a probability space (Ω, F, P )
(
E[|X|] < ∞,
G ⊂ F a σ -eld. The random variable E[X | G]
X
G ) is a random variable Z satisfying the
conditional expectation of given
such that
and
following properties:
• Z ∈ G (Z
G -measurable);
is
• ∀ Y ∈ G, Y
bounded,
E[XY ] = E[ZY ].
Remark The random variable dened above is unique in the sense that if Z1
Z2 satisfy the properties of the conditional expectation E[X | G],
Z1 = Z2 almost surely (P {ω | Z1 (ω) ̸= Z2 (ω)} = 0).
and
If
X
is square integrable (that is,
E[|X|2 ] < ∞),
then
the conditional expecta-
tion has a geometrical interpretation. Indeed, the set
L2
of square-integrable
random variables on a given probability space may be considered as a vector
√
space with the norm ∥X∥2 =
E[|X|2 ] and the associated scalar product
X · Y = E[XY ].
For any
σ -eld G ,
the random variables which are
G-
X, Y ∈ G , α ∈ R implies
VG . The conditional expectaof X on VG .
Indeed, what an orthogonal projection of a vector ⃗
v ∈ V on a subspace
G ⊂ V is? It is a vector ⃗vG which lies in G and such that ⃗v −⃗vG is orthogonal
to G. The second condition means that for all w
⃗ ∈ G, (⃗v − ⃗vG ) ⊥ w
⃗ in the
sense (⃗
v − ⃗vG ) · w
⃗ = 0. Equivalently, for all w
⃗ ∈ G, ⃗v · w
⃗ = ⃗vG · w
⃗.
In our case, the vector space V is L2 , vectors are random variables, and
G is the set VG of G -measurable random variables. Looking at the denition
of E[X | G], we see that this is exactly the description of the orthogonal
projection of X on VG .
In other words, the conditional expectation E[X | G] is the closest to
X G -measurable random variable in the sense of the L2 -norm (that is, the
2
random variable Z ∈ G which minimizes E[(Z − X) ]).
measurable form a vector subspace of
αX + Y ∈ G ). Let us denote
tion E[X | G] is then an
L2
(since
orthogonal projection
this subspace by
Keeping in mind this geometrical interpretation helps to understand some
of the following properties of the conditional expectation. These properties
are essential in the stochastic processes theory and thus are extensively used
in mathematical nance.
37
Properties
(Ω, F, P ).
1.
X, Y, Z
G, H ⊆ F
Let
Let
be random variables on a given probability space
be
σ -algebras
of events.
E[aX + bY | G] = aE[X | G] + bE[Y | G], ∀a, b ∈ R.
(The projection is linear.)
Y ⊥⊥ G
E[Y | G] = E[Y ].
(Don't confuse independent and orthogonal in the sense of L ! Here,
Y is not necessarily orthogonal to G. To understand this property, use
the interpretation of G as an information: if Y is independent of G, the
information contained in G does not inuence the distribution of Y , so
the condition in the expectation may be dropped out.)
2. If
then
2
Z∈G
E[XZ | G] = ZE[X | G]. In particular, E[Z | G] = Z .
G
⃗v
G
⃗v
Z G
G
Z
Z
G
(The projection on of a vector which already belongs to is
itself. In terms of the information, if is -measurable, it means that
given the information the value of is known; thus is treated as
a known constant in the conditional expectation given .)
3. If
then
G⊆H
E[E[X | H] | G] = E[X | G].
(To project on a smaller subspace is equivalent to project rst on a
greater subspace and then project the result on the smaller subspace.)
4. If
then
The comments above do not constitute the proofs of these properties but
only provide some intuition about them. To prove these properties, we have
to use the denition of the conditional expectation. For example, let us prove
property 2:
We need to prove that the (constant) random variable
the two properties of the conditional expectation of
constant, it is
G ∈ G.
G -measurable.
Y
Z = E[Y ] satises
G . Since it is
given
G be an arbitrary bounded random variable,
E[Y G] = E[E[Y ]G]. By independence of Y
G , the left hand side is equal to E[Y ]E[G]. The right hand
the same expression simply by taken the constant E[Y ] out
Let
We have to check that
with respect to
side is equal to
of the expectation.
We omit here the proofs of the other properties of the conditional expectation.
5 Gaussian vectors
(to be completed)
38
© Copyright 2026 Paperzz