download

6
RANDOM VECTORS
Basic Concepts
Often, a single random variable cannot adequately provide all of the information
needed about the outcome of an experiment. For example, tomorrow’s weather
is really best described by an array of random variables that includes wind speed,
wind direction, atmospheric pressure, relative humidity and temperature. It would
not be either easy or desirable to attempt to combine all of this information into a
single measurement.
We would like to extend the notion of a random variable to deal with an experiment
that results in several observations each time the experiment is run. For example,
let T be a random variable representing tomorrow’s maximum temperature and
let R be a random variable representing tomorrow’s total rainfall. It would be
reasonable to ask for the probability that tomorrow’s temperature is greater than
70◦ and tomorrow’s total rainfall is less than 0.1 inch. In other words, we wish to
determine the probability of the event
A = {T > 70, R < 0.1}.
Another question that we might like to have answered is, “What is the probability
that the temperature will be greater than 70◦ regardless of the rainfall?” To answer
this question, we would need to compute the probability of the event
B = {T > 70}.
In this chapter, we will build on our probability model and extend our definition of
a random variable to permit such calculations.
Definition
The first thing we must do is to precisely define what we mean by “an array of
random variables.”
153
154
Definition
y
ω
3
2
(4,1)
1
Ω
1
2
3
4
5
x
Figure 6.1: A mapping using (X1 , X2 )
Definition 6.1. ¡Let Ω be a sample space.¢An n-dimensional random variable or
random vector, X1 (·), X
(·), . . . , Xn (·) , is a vector
of functions that assigns to
¡2
¢
each point ω ∈ Ω a point X1 (ω), X2 (ω), . . . , Xn (ω) in n-dimensional Euclidean
space.
Example: Consider an experiment where a die is rolled twice. Let X1 denote the
number of the first roll, and X2 the number of the second roll. Then (X1 , X2 ) is a
two-dimensional random vector. A possible sample point in Ω is
that is mapped into the point (4, 1) as shown in Figure 6.1.
Joint Distributions
155
(x,y)
Figure 6.2: The region representing the event P (X ≤ x, Y ≤ y).
Joint Distributions
Now that we know the definition of a random vector, we can begin to use it to assign
probabilities to events. For any random vector, we can define a joint cumulative
distribution function for all of the components as follows:
Definition 6.2. Let (X1 , X2 , . . . , Xn ) be a random vector. The joint cumulative
distribution function for this random vector is given by
FX1 ,X2 ,...,Xn (x1 , x2 , . . . , xn ) ≡ P (X1 ≤ x1 , X2 ≤ x2 , . . . , Xn ≤ xn ).
In the two-dimensional case, the joint cumulative distribution function for the
random vector (X, Y ) evaluated at the point (x, y), namely
FX,Y (x, y),
is the probability that the experiment results in a two-dimensional value within the
shaded region shown in Figure 6.2.
Every joint cumulative distribution function must posses the following properties:
1.
lim
all xi →−∞
FX1 ,X2 ,...,Xn (x1 , x2 , . . . , xn ) = 0
156
Discrete Distributions
x2
6
5
4
3
2
1
1
2
3
4
5
6
x1
Figure 6.3: The possible outcomes from rolling a die twice.
2.
lim
all xi →+∞
FX1 ,X2 ,...,Xn (x1 , x2 , . . . , xn ) = 1
3. As xi varies, with all other xj ’s (j 6= i) fixed,
FX1 ,X2 ,...,Xn (x1 , x2 , . . . , xn )
is a nondecreasing function of xi .
As in the case of one-dimensional random variables, we shall identify two major classifications of vector-valued random variables: discrete and continuous.
Although there are many common properties between these two types, we shall
discuss each separately.
Discrete Distributions
A random vector that can only assume at most a countable collection of discrete
values is said to be discrete. As an example, consider once again the example on
page 154 where a die is rolled twice. The possible values for either X1 or X2 are in
the set {1, 2, 3, 4, 5, 6}. Hence, the random vector (X1 , X2 ) can only take on one
of the 36 values shown in Figure 6.3.
If the die is fair, then each of the points can be considered to have a probability
Continuous Distributions
157
1
mass of 36
. This prompts us to define a joint probability mass function for this type
of random vector, as follows:
Definition 6.3. Let (X1 , X2 , . . . Xn ) be a discrete random vector. Then
pX1 ,X2 ,...,Xn (x1 , x2 , . . . , xn ) ≡ P (X1 = x1 , X2 = x2 , . . . , Xn = xn ).
is the joint probability mass function for the random vector (X1 , X2 , . . . , Xn ).
Referring again to the example on page 154, we find that the joint probability mass
function for (X1 , X2 ) is given by
1
36
pX1 ,X2 (x1 , x2 ) =
for x1 = 1, 2, . . . , 6 and x2 = 1, 2, . . . , 6
Note that for any probability mass function,
FX1 ,X2 ,...,Xn (b1 , b2 , . . . , bn ) =
X X
x1 ≤b1 x2 ≤b2
···
X
pX1 ,X2 ,...,Xn (x1 , x2 , . . . , xn ).
xn ≤bn
Therefore, if we wished to evaluate FX1 ,X2 (3.0, 4.5) we would sum all of the
probability mass in the shaded region shown in Figure 6.3, and obtain
1
FX1 ,X2 (3.0, 4.5) = 12 36
= 13 .
This is the probability that the first roll is less than or equal to 3 and the second roll
is less than or equal to 4.5.
Every joint probability mass function must have the following properties:
1. pX1 ,X2 ,...,Xn (x1 , x2 , . . . xn ) ≥ 0 for any (x1 , x2 , . . . , xn ).
X
2.
X
...
all x1
pX1 ,X2 ,...,Xn (x1 , x2 , . . . xn ) = 1
all xn
3. P (E) =
X
pX1 ,X2 ,...,Xn (x1 , x2 , . . . xn ) for any event E .
(x1 ,...,xn )∈E
You should compare these properties with those of probability mass functions for
single-valued discrete random variables given in Chapter 5.
Continuous Distributions
Extending our notion of probability density functions to continuous random vectors
is a bit tricky and the mathematical details of the problem is beyond the scope of
158
Continuous Distributions
an introductory course. In essence, it is not possible to define the joint density as a
derivative of the cumulative distribution function as we did in the one-dimensional
case.
Let Rn denote n-dimensional Euclidean space. We sidestep the problem by defining
the density of an n-dimensional random vector to be a function that when integrated
over the set
{(x1 , . . . , xn ) ∈ Rn : x1 ≤ b1 , x2 ≤ b2 , . . . , xn ≤ bn }
will yield the value for the cumulative distribution function evaluated at
(b1 , b2 , . . . , bn ).
More formally, we have the following:
Definition 6.4. Let (X1 , . . . , Xn ) be a continuous random vector with joint
cumulative distribution function
FX1 ,...,Xn (x1 , . . . , xn ).
The function fX1 ,...,Xn (x1 , . . . , xn ) that satisfies the equation
FX1 ,...,Xn (b1 , . . . , bn ) =
Z b1
−∞
···
Z bn
−∞
fX1 ,...,Xn (t1 , . . . , tn ) dt1 · · · dtn
for all (b1 , . . . , bn ) is called the joint probability density function for the random
vector (X1 , . . . , Xn ).
Now, instead of obtaining a derived relationship between the density and the
cumulative distribution function by using integrals as anti-derivatives, we have
enforced such a relationship by the above definition.
Every joint probability density function must have the following properties:
1. fX1 ,X2 ,...,Xn (x1 , x2 , . . . xn ) ≥ 0 for any (x1 , x2 , . . . , xn ).
Z +∞
2.
−∞
···
3. P (E) =
Z +∞
−∞
Z
···
fX1 ,...,Xn (x1 , . . . , xn ) dx1 · · · dxn = 1
Z
E
fX1 ,...,Xn (x1 , . . . , xn ) dx1 · · · dxn for any event E .
You should compare these properties with those of probability density functions
for single-valued continuous random variables given in Chapter 5.
Continuous Distributions
159
x2
b2
a2
a1
b1
x1
Figure 6.4: Computing P (a1 < X1 ≤ b1 , a2 < X2 ≤ b2 ).
In the one-dimensional case, we had the handy formula
P (a < X ≤ b) = FX (b) − FX (a).
This worked for any type of probability distribution. The situation in the multidimensional case is a little more complicated, with a comparable formula given
by
P (a1 < X1 ≤ b1 , a2 < X2 ≤ b2 ) =
FX1 ,X2 (b1 , b2 ) − FX1 ,X2 (a1 , b2 ) − FX1 ,X2 (b1 , a2 ) + FX1 ,X2 (a1 , a2 ).
You should be able to verify this formula for yourself by accounting for all of the
probability masses in the regions shown in Figure 6.4.
Example: Let (X, Y ) be a two-dimensional random variable with the following
joint probability density function (see Figure 6.5):
(
fX,Y (x, y) =
Note that
2−y
0
Z 2Z 2
1
0
if 0 ≤ x ≤ 2 and 1 ≤ y ≤ 2
otherwise
(2 − y) dx dy = 1.
160
Marginal Distributions
fXY(x,y)
1
1
2
y
1
2
x
Figure 6.5: A two-dimensional probability density function
Suppose we would like to compute P (X ≤ 1.0, Y ≤ 1.5). To do this, we calculate
the volume under the surface fX,Y (x, y) over the region {(x, y) : x ≤ 1, y ≤ 1.5}.
This region of integration is shown shaded (in green) in Figure 6.5. Performing the
integration, we get,
P (X ≤ 1.0, Y ≤ 1.5) =
=
Z 1.5 Z 1.0
−∞ −∞
Z 1.5 Z 1.0
1.0
0
fX,Y (x, y) dx dy
(2 − y) dx dy = 83 .
Marginal Distributions
Given the probability distribution for a vector-valued random variable (X1 , . . . Xn ),
we might ask the question, “Can we determine the distribution of X1 , disregarding
the other components?” The answer is yes, and the solution requires the careful
use of English rather than mathematics.
For example, in the two-dimensional case, we may be given a random vector (X, Y )
with joint cumulative distribution function FX,Y (x, y). Suppose we would like to
find the cumulative distribution function for X alone, i.e., FX (x)? We know that
FX,Y (x, y) = P (X ≤ x, Y ≤ y)
Marginal Distributions
161
and we are asking for
(1)
FX (x) = P (X ≤ x).
But in terms of both X and Y , expression 1 can be read: “the probability that X
takes on a value less than or equal to x and Y takes on any value.” Therefore, it
would make sense to say
FX (x) = P (X ≤ x)
= P (X ≤ x, Y ≤ ∞)
=
lim FX,Y (x, y).
y→∞
Using this idea, we shall define what we will call the marginal cumulative distribution function:
Definition 6.5. Let (X1 , . . . , Xn ) be a random vector with joint cumulative distribution function FX1 ,...,Xn (x1 , . . . , xn ). The marginal cumulative distribution
function for X1 is given by
FX1 (x1 ) = lim
lim · · · lim FX1 ,X2 ,...,Xn (x1 , x2 , . . . , xn ).
x2 →∞ x3 →∞
xn →∞
Notice that we can renumber the components of the random vector and call any one
of them X1 . So we can use the above definition to find the marginal cumulative
distribution function for any of the Xi ’s.
Although Definition 6.5 is a nice definition, it is more useful to examine marginal
probability mass functions and marginal probability density functions. For example, suppose we have a discrete random vector (X, Y ) with joint probability mass
function pX,Y (x, y). To find pX (x), we ask “What is the probability that X = x
regardless of the value that Y takes on? This can be written as
pX (x) = P (X = x) = P (X = x, Y = any value)
=
X
pX,Y (x, y).
all y
Example: In the die example (on page 154)
pX1 ,X2 (x1 , x2 ) =
1
36
for x1 = 1, 2, . . . , 6 and x2 = 1, 2, . . . , 6
To find pX1 (2), for example, we compute
pX1 (2) = P (X1 = 2) =
6
X
k=1
pX1 ,X2 (2, k) = 61 .
162
Marginal Distributions
Table 6.1: Joint pmf for daily production
pX,Y (x, y)
1
2
X
3
4
5
1
.05
.15
.05
.05
.10
Y
2
3
0
0
.10
0
.05
.10
.025 .025
.10
.10
4
0
0
0
0
.10
5
0
0
0
0
0
This is hardly a surprising result, but it brings some comfort to know we can get it
from all of the mathematical machinery we’ve developed thus far.
Example: Let X be the total number of items produced in a day’s work at a factory,
and let Y be the number of defective items produced. Suppose that the probability
mass function for (X, Y ) is given by Table 6.1. Using this joint distribution, we
can see that the probability of producing 2 items with exactly 1 of those items being
defective is
pX,Y (2, 1) = 0.15.
To find the marginal probability mass function for the total daily production, X ,
we sum the probabilities over all possible values of Y for each fixed x:
pX (1) = pX,Y (1, 1) = 0.05
pX (2) = pX,Y (2, 1) + pX,Y (2, 2) = 0.15 + 0.10 = 0.25
pX (3) = pX,Y (3, 1) + pX,Y (3, 2) + pX,Y (3, 3) = 0.05 + 0.05 + 0.10 = 0.20
etc.
But notice that in these computations, we are simply adding the entries all columns
for each row of Table 6.1. Doing this for Y as well as X we can obtain Table 6.2
So, for example, P (Y = 2) = pY (2) = 0.275. We simply look for the result in
the margin1 for the entry Y = 2.
The procedure is similar for obtaining marginal probability density functions. Recall that a density, fX (x), itself is not a probability measure, but fX (x)dx, is. So
1
Would you believe that this is why they are called marginal distributions?
Marginal Distributions
163
Table 6.2: Marginal pmf’s for daily production
pX,Y (x, y)
1
2
X
3
4
5
pY (y)
Y
2
3
0
0
.10
0
.05
.10
.025 .025
.10
.10
.275 .225
1
.05
.15
.05
.05
.10
0.4
4
0
0
0
0
.10
.10
5
0
0
0
0
0
0
pX (x)
.05
.25
.20
.10
.40
with a little loose-speaking integration notation we should be able to compute
fX (x) dx = P (x ≤ X < x + dx)
= P (x ≤ X < x + dx, Y = any value)
=
Z +∞
−∞
fX,Y (x, y) dy dx
where y is the variable of integration in the above integral. Looking at this
relationship as
fX (x) dx =
Z +∞
−∞
fX,Y (x, y) dy dx
it would seem reasonable to define
fX (x) ≡
Z +∞
−∞
fX,Y (x, y) dy
and we therefore offer the following:
Definition 6.6. Let (X1 , . . . , Xn ) be a continuous random variable with joint
probability density function fX1 ,...,Xn . The marginal probability density function
for the random variable X1 is given by
fX1 (x1 ) =
Z +∞
−∞
···
Z +∞
−∞
fX1 ,X2 ,...,Xn (x1 , x2 , . . . , xn ) dx2 · · · dxn .
Notice that in both the discrete and continuous cases, we sum (or integrate) over
all possible values of the unwanted random variable components.
164
Marginal Distributions
y
1
a
0
1
x
Figure 6.6: The support for the random vector
The trick in such problems is to insure that your limits of integration are correct.
Drawing a picture of the region where there is positive probability mass (the support
of the distribution) often helps.
For the above example, the picture of the support would be as shown in Figure 6.6.
If the dotted line in Figure 6.6 indicates a particular value for x (call it a), by
integrating over all values of y , we are actually determining how much probability
mass has been placed along the line x = a. The integration process assigns all of
that mass to x = a in one-dimension. Repeating this process for each x yields the
desired probability density function.
Example: Let (X, Y ) be a two-dimensional continuous random variable with joint
probability density function
(
fX,Y (x, y) =
x+y
0
0 ≤ x ≤ 1 and 0 ≤ y ≤ 1
otherwise
Find the marginal probability density function for X .
Solution: Let’s consider the case where we fix x so that 0 ≤ x ≤ 1. We compute
fX (x) =
Z +∞
−∞
fX,Y (x, y) dy
Marginal Distributions
165
=
Z 1
(x + y) dy
0
Ã
y2
xy +
2
=
= x+
!¯y=1
¯
¯
¯
¯
y=0
1
2
If x is outside the interval [0, 1], we have fX (x) = 0.
So, summarizing these computations we find that
(
fX (x) =
1
2
x+
0
0≤x≤1
otherwise
We will leave it to you to check that fX (x) is, in fact,
a probability density
R +∞
functionby making sure that fX (x) ≥ 0 for all x and that −∞ fX (x) dx = 1.
Example: Let (X, Y ) be the two-dimensional random variable with the following
joint probability density function (see Figure 6.5 on page 160)
(
fX,Y (x, y) =
2−y
0
if 0 ≤ x ≤ 2 and 1 ≤ y ≤ 2
otherwise
Find the marginal probability density function for X and the marginal probability
density function for Y .
Solution: Let’s first find the marginal probability density function for X . Consider
the case where we fix x so that 0 ≤ x ≤ 2. We compute
fX (x) =
=
Z +∞
−∞
Z 2
fX,Y (x, y) dy
(2 − y) dy
1
Ã
y2
2y −
2
=
=
!¯y=2
¯
¯
¯
¯
y=1
1
2
If x is outside the interval [0, 2], we have fX (x) = 0. Therefore,
(
fX (x) =
1
2
0
0≤x≤2
otherwise
166
SEE ERRATA SHEET
Functions of Random Vectors
To find the marginal probability density function for Y , consider the case where
we fix y so that 1 ≤ y ≤ 2. We compute
fY (y) =
=
Z +∞
−∞
Z 1
fX,Y (x, y) dx
(2 − y) dx
0
=
x=1
(2 − y)x|x=0
= 2−y
If y is outside the interval [1, 2], we have fY (y) = 0. Therefore,
(
fY (x) =
2−y
0
0≤x≤2
otherwise
You should also double check that fX and fY are both probability density functions.
Functions of Random Vectors
The technique for computing functions of one-dimensional random variables carries
over to the multi-dimensional case. Most of these problems just require a little
careful reasoning.
Example: Let (X1 , X2 ) be the die-tossing random vector of the example on
page 154. Find the probability mass function for the random variable Z = X1 +X2 ,
the sum of the two rolls.
1
Solution: We already know that pX1 ,X2 (x1 , x2 ) = 36
for x1 = 1, . . . , 6 and x2 =
1, . . . , 6. We ask the question, “What are the possible values that Z can take on?”
The answer: “The integers 2, 3, 4, . . . , 12.” For example, Z equals 4 precisely
when any one of the mutually exclusive events
{X1 = 1, X2 = 3},
{X1 = 2, X2 = 2},
or
{X1 = 3, X2 = 1}
occurs. So,
pZ (4) = pX1 ,X2 (1, 3) + pX1 ,X2 (2, 2) + pX1 ,X2 (3, 1) =
3
36 .
Functions of Random Vectors
167
Continuing in this manner, you should be able to verify that
1
pZ (2) = 36
;
4
pZ (5) = 36 ;
5
pZ (8) = 36
;
2
pZ (11) = 36 ;
2
pZ (3) = 36
;
5
pZ (6) = 36 ;
4
pZ (9) = 36
;
1
pZ (12) = 36 .
3
pZ (4) = 36
;
6
pZ (7) = 36 ;
3
pZ (10) = 36
;
Example: Let (X, Y ) be the two-dimensional random variable given in the example on page 164. Find the cumulative distribution function for the random variable
Z =X +Y.
Solution: The support for the random variable Z is the interval [0, 2], so


 0
FZ (z) =
if z < 0
if 0 ≤ z ≤ 2
if z > 2
?

 1
For the case 0 ≤ z ≤ 2, we wish to evaluate
FZ (z) = P (Z ≤ z) = P (X + Y ≤ z).
In other words, we are computing the probability mass assigned to the shaded set
in Figure 6.7 as z varies from 0 to 2.
In establishing limits of integration, we notice that there are two cases to worry
about as shown in Figure 6.8:
Case I (z ≤ 1):
FZ (z) =
Z z Z z−y
0
=
(x + y) dx dy
0
1 3
3z .
Case II (z > 1):
FZ (z) =
Z z−1 Z 1
0
(x + y) dx dy +
0
= z 2 − 31 z 3 − 31 .
Z 1 Z z−y
z−1 0
(x + y) dx dy
168
Functions of Random Vectors
y
1
z
x+
y=
z
z
0
x
1
Figure 6.7: The event {Z ≤ z}.
y
y
z
1
z
0
1
z 1
Case I
x
0
1 z
x
Case II
Figure 6.8: Two cases to worry about for the example
Independence of Random Variables
169
These two cases can be summarized as
FZ (z) =


0


 1 z3
3

z 2 − 13 z 3 − 31



1
if z < 0
if 0 ≤ z ≤ 1
if 1 < z ≤ 2
if z > 2
Notice what we thought at first to be one case (0 ≤ z ≤ 2) had to be divided into
two cases (0 ≤ z ≤ 1 and 1 < z ≤ 2).
Independence of Random Variables
Definition 6.7. A sequence of n random variables X1 , X2 , . . . , Xn is independent
if and only if, and
FX1 ,X2 ,...,Xn (b1 , b2 , . . . , bn ) = FX1 (b1 )FX2 (b2 ) · · · FXn (bn )
for all values b1 , b2 , . . . , bn .
Definition 6.8. A sequence of n random variables X1 , X2 , . . . , Xn is a random
sample if and only if
1. X1 , X2 , . . . , Xn are independent, and
2. FXi (x) = F (x) for all x and for all i (i.e., each Xi has the same marginal
distribution, F (x)).
We say that a random sample is a vector of independent and identically distributed (i.i.d.) random variables.
Recall: An event A is independent of an event B if and only if
P (A ∩ B) = P (A)P (B).
Theorem 6.1. If X and Y are independent random variables then any event A
involving X alone is independent of any event B involving Y alone.
Testing for independence
Case I: Discrete
A discrete random variable X is independent of a discrete random variable Y if
and only if
pX,Y (x, y) = [pX (x)][pY (y)]
for all x and y .
170
Independence of Random Variables
Case II: Continuous
A continuous random variable X is independent of a continuous random variable
Y if and only if
fX,Y (x, y) = [fX (x)][fY (y)]
for all x and y .
Example: A company produces two types of compressors, grade A and grade B.
Let X denote the number of grade A compressors produced on a given day. Let
Y denote the number of grade B compressors produced on the same day. Suppose
that the joint probability mass function pX,Y (x, y) = P (X = x, Y = y) is given
by the following table:
y
pX,Y (x, y)
0
x
1
2
0
0.1
0.2
0.2
1
0.3
0.1
0.1
The random variables X and Y are not independent. Note that
pX,Y (0, 0) = 0.1 6= pX (0)pY (0) = (0.4)(0.5) = 0.2
Example: Suppose an electronic circuit contains two transistors. Let X be the
time to failure of transistor 1 and let Y be the time to failure of transistor 2.
(
fX,Y (x, y) =
4e−2(x+y) x ≥ 0, y ≥ 0
0
otherwise
The marginal densities are
(
fX (x) =
(
fY (y) =
2e−2x x ≥ 0
0
otherwise
2e−2y y ≥ 0
0
otherwise
Expectation and random vectors
171
We must check the probability density functions for (X, Y ), X and Y for all values
of (x, y).
For x ≥ 0 and y ≥ 0:
fX,Y (x, y) = 4e−2(x+y) = fX (x)fY (y) = 2e−2x 2e−2y
For x ≥ 0 and y < 0:
fX,Y (x, y) = 0 = fX (x)fY (y) = 2e−2x (0)
For x < 0 and y ≥ 0:
fX,Y (x, y) = 0 = fX (x)fY (y) = (0)2e−2y
For x < 0 and y < 0:
fX,Y (x, y) = 0 = fX (x)fY (y) = (0)(0)
So the random variables X and Y are independent.
Expectation and random vectors
Suppose we are given a random vector (X, Y ) and a function g(x, y). Can we find
E(g(X, Y ))?
Theorem 6.2.
E(g(X, Y )) =
XX
g(x, y)pX,Y (x, y)
if (X, Y ) is discrete
all x all y
E(g(X, Y )) =
Z ∞ Z ∞
−∞ −∞
g(x, y)fX,Y (x, y) dy dx
if (X, Y ) is continuous
Example: Suppose X and Y have joint probability density function
(
fX,Y (x, y) =
x + y 0 ≤ x ≤ 1, 0 ≤ y ≤ 1
0
otherwise
172
Expectation and random vectors
Let Z = XY . To find E(Z) = E(XY ) use Theorem 6.2 to get
Z ∞ Z ∞
E(XY ) =
−∞ −∞
Z 1Z 1
=
xyfX,Y (x, y) dx dy
xy(x + y) dx dy
0
0
0
0
Z 1Z 1
=
Z 1
=
0
Z 1
=
0
1 2
6y
=
x2 y + xy 2 dx dy
1 3
3x y
1
3y
¯1
¯
+ 21 x2 y 2 ¯ dy
0
+ 12 y 2 dy
¯1
¯
+ 61 y 3 ¯ =
0
1
3
We will prove the following results for the case when (X, Y ) is a continuous random
vector. The proofs for the discrete case are similar using summations rather than
integrals, probability mass functions rather than probability density functions.
Theorem 6.3. E(X + Y ) = E(X) + E(Y )
Proof. Using Theorem 6.2:
E(X + Y ) =
=
=
=
=
Z ∞ Z ∞
−∞ −∞
Z ∞ Z ∞
−∞ −∞
Z ∞ Z ∞
−∞ −∞
Z ∞
−∞
Z ∞
−∞
x
(x + y)fX,Y (x, y) dx dy
xfX,Y (x, y) dx dy +
xfX,Y (x, y) dy dx +
·Z ∞
−∞
£
Z ∞ Z ∞
−∞ −∞
¸
yfX,Y (x, y) dx dy
−∞ −∞
Z ∞ ·Z ∞
fX,Y (x, y) dy dx +
¤
x fX (x) dx +
Z ∞
−∞
yfX,Y (x, y) dx dy
Z ∞ Z ∞
£
−∞
y
¤
y fY (y) dy
= E(X) + E(Y )
Theorem 6.4. If X and Y are independent, then
E[h(X)g(Y )] = E[h(X)]E[g(Y )]
−∞
¸
fX,Y (x, y) dx dy
Expectation and random vectors
173
Proof. Using Theorem 6.2:
E[h(X)g(Y )] =
Z ∞ Z ∞
−∞ −∞
h(x)g(y)fX,Y (x, y) dx dy
since X and Y are independent. . .
=
=
=
Z ∞ Z ∞
−∞ −∞
Z ∞
−∞
Z ∞
−∞
h(x)g(y)fX (x)fY (y) dx dy
g(y)fY (y)
·Z ∞
−∞
¸
h(x)fX (x) dx dy
£
¤
g(y)fY (y) E(h(X)) dy
since E(h(X)) is a constant. . .
= E(h(X))
Z ∞
−∞
g(y)fY (y) dy
= E(h(X))E(g(Y ))
Corollary 6.5. If X and Y are independent, then E(XY ) = E(X)E(Y )
Proof. Using Theorem 6.4, set h(x) = x and g(y) = y to get
E[h(X)g(Y )] = E(h(X))E(g(Y ))
E[XY ] = E(X)E(Y )
Definition 6.9. The covariance of the random variables X and Y is
Cov(X, Y ) ≡ E[(X − E(X))(Y − E(Y ))].
Note that Cov(X, X) = Var(X).
Theorem 6.6. For any random variables X and Y
Var(X + Y ) = Var(X) + Var(Y ) + 2Cov(X, Y )
174
Expectation and random vectors
Proof. Remember that the variance of a random variable W is defined as
Var(W ) = E[(W − E(W ))2 ]
Now let W = X + Y . . .
Var(X + Y ) = E[(X + Y − E(X + Y ))2 ]
= E[(X + Y − E(X) − E(Y ))2 ]
= E[({X − E(X)} + {Y − E(Y )})2 ]
Now let a = {X − E(X)}, let b = {Y − E(Y )}, and
expand (a + b)2 to get. . .
= E[{X − E(X)}2 + {Y − E(Y )}2 + 2{X − E(X)}{Y − E(Y )}]
= E[{X − E(X)}2 ] + E[{Y − E(Y )}2 ]
+2E[{X − E(X)}{Y − E(Y )}]
= Var(X) + Var(Y ) + 2Cov(X, Y )
Theorem 6.7. Cov(X, Y ) = E(XY ) − E(X)E(Y )
Proof. Using Definition 6.9, we get
Cov(X, Y ) = E[(X − E(X))(Y − E(Y ))]
= E[XY − XE(Y ) − E(X)Y + E(X)E(Y )]
= E[XY ] + E[−XE(Y )] + E[−E(X)Y ] + E[E(X)E(Y )]]
Since E[X], E[Y ], and E[XY ] are all constants. . .
= E[XY ] − E(Y )E[X] − E(X)E[Y ] + E(X)E(Y )
= E[XY ] − E(X)E(Y )
Corollary 6.8. If X and Y are independent then Cov(X, Y ) = 0.
Proof.
If X and Y are independent then, from Corollary 6.5, E(XY ) =
E(X)E(Y ). We can then use Theorem 6.7:
Cov(X, Y ) = E(XY ) − E(X)E(Y ) = 0
Expectation and random vectors
175
Corollary 6.9. If X and Y are independent then Var(X+Y ) = Var(X)+Var(Y ).
Proof. If X and Y are independent then, from Theorem 6.6 and Corollary 6.8:
Var(X + Y ) = Var(X) + Var(Y ) + 2Cov(X, Y )
Var(X + Y ) = Var(X) + Var(Y ) + 2(0)
Var(X + Y ) = Var(X) + Var(Y )
Definition 6.10. The covariance of two random variables X and Y is given by
£
¤
Cov(X, Y ) ≡ E (X − E(X))(Y − E(Y )) .
Definition 6.11. The correlation coefficient for two random variables X and Y
is given by
Cov(X, Y )
ρ(X, Y ) ≡ p
.
Var(X)Var(Y )
Theorems 6.10, 6.11 and 6.12 are stated without proof. Proofs of these results may
be found in the book by Meyer.2
Theorem 6.10. For any random variables X and Y ,
|ρ(X, Y )| ≤ 1.
Theorem 6.11. Suppose that |ρ(X, Y )| = 1. Then (with probability one),
Y = aX + b for some constants a and b. In other words: If the correlation
coefficient ρ is ±1, the Y is a linear function of X (with probability one).
The converse of this theorem is also true:
Theorem 6.12. Suppose that X and Y are two random variables, such that
Y = aX + b where a and b are constants. Then |ρ(X, Y )| = 1. If a > 0, then
ρ(X, Y ) = +1. If a < 0, then ρ(X, Y ) = −1.
2
Meyer, P., Introductory probability theory and statistical applications, Addison-Wesley, Reading
MA, 1965.
176
Expectation and random vectors
Random vectors and conditional probability
Example: Consider the compressor problem again.
y
pX,Y (x, y)
0
x
1
2
0
0.1
0.2
0.2
1
0.3
0.1
0.1
Given that no grade B compressors were produced on a given day, what is the
probability that 2 grade A compressors were produced?
Solution:
P (X = 2 | Y = 0) =
=
P (X = 2, Y = 0)
P (Y = 0)
2
0.2
=
0.5
5
Given that 2 compressors were produced on a given day, what is the probability
that one of them is a grade B compressor?
Solution:
P (Y = 1 | X + Y = 2) =
=
=
P (Y = 1, X + Y = 2)
P (X + Y = 2)
P (X = 1, Y = 1)
P (X + Y = 2)
1
0.1
=
0.3
3
Example: And, again, consider the two transistors. . .
(
fX,Y (x, y) =
4e−2(x+y) x ≥ 0, y ≥ 0
0
otherwise
Given that the total life time for the two transistors is less than two hours, what is
the probability that the first transistor lasted more than one hour?
Expectation and random vectors
177
Solution:
P (X > 1 | X + Y ≤ 2) =
P (X > 1, X + Y ≤ 2)
P (X + Y ≤ 2)
We then compute
P (X > 1, X + Y ≤ 2) =
Z 2 Z 2−x
= e
4e−2(x+y) dy dx
1
0
−2
− 3e−4
and
P (X + Y ≤ 2) =
Z 2 Z 2−x
0
4e−2(x+y) dy dx
0
= 1 − 5e−4
to get
P (X > 1 | X + Y ≤ 2) =
e−2 − 3e−4
= 0.0885
1 − 5e−4
Conditional distributions
Case I: Discrete
Let X and Y be random variables with joint probability mass function pX,Y (x, y)
and let pY (y) be the marginal probability mass function for Y .
We define the conditional probability mass function of X given Y as
pX|Y (x | y) =
pX,Y (x, y)
pY (y)
whenever pY (y) > 0.
Case II: Continuous
Let X and Y be random variables with joint probability density function fX,Y (x, y)
and let fY (y) be the marginal probability density function for Y .
We define the conditional probability density function of X given Y as
fX|Y (x | y) =
whenever fY (y) > 0.
fX,Y (x, y)
fY (y)
178
Self-Test Exercises for Chapter 6
Law of total probability
Case I: Discrete
pX (x) =
X
pX|Y (x | y)pY (y)
y
Case II: Continuous
fX (x) =
Z +∞
−∞
fX|Y (x | y)fY (y) dy
Self-Test Exercises for Chapter 6
For each of the following multiple-choice questions, choose the best response
among those provided. Answers can be found in Appendix B.
S6.1 Let X1 , X2 , X3 , X4 be independent and identically distributed random variables, each with P (Xi = 1) = 12 and P (Xi = 0) = 12 . Let P (X1 + X2 +
X3 + X4 = 3) = r, then the value of r is
(A)
1
16
(B)
1
4
(C)
1
2
(D) 1
(E) none of the above.
S6.2 Let (X, Y ) be a discrete random vector with joint probability mass function
given by
pX,Y (0, 0) = 1/4
pX,Y (0, 1) = 1/4
pX,Y (1, 1) = 1/2
Then P (Y = 1) equals
(A) 1/3
(B) 1/4
(C) 1/2