5
FUNCTIONS OF RANDOM VARIABLES
Basic Concepts
Two of the primary reasons for considering random variables are:
1. They are convenient, and often obvious, representations of experimental outcomes.
2. They allow us to restrict our attention to sample spaces that are subsets of the
real line.
There is a third, and probably more important reason for using random variables:
3. They permit algebraic manipulation of event definitions.
This chapter deals with this third “feature” of random variables. As an example,
let X be the random variable representing the Buffalo, New York temperature
at 4:00 pm next Monday in degrees Fahrenheit. This random variable might be
suitable for the use of Buffalonians, but our neighbors in Fort Erie, Ontario prefer
the random variable Y representing the same Buffalo temperature, except in degrees
Celsius. Furthermore, they are interested in the event
A = {ω ∈ Ω : Y (ω) ≤ 20}
where Ω is the sample space of the underlying experiment. Note that Ω is being
mapped by both of the random variables X and Y into the real line.
In order to compute P (A), we could evaluate the cumulative distribution function
for Y , FY (y), at the point y = 20. But suppose we only know the cumulative
distribution function for X , namely FX (x). Can we still find P (A)?
The answer is “yes.” First, notice that for any ω ∈ Ω,
(1)
Y (ω) =
5
9
¡
¢
X(ω) − 32 .
105
106
Basic Concepts
We can then redefine the event A in terms of X as follows:
A = {ω ∈ Ω : Y (ω) ≤ 20}
n
=
ω∈Ω :
5
9 (X(ω)
o
− 32) ≤ 20
= {ω ∈ Ω : X(ω) − 32 ≤ 36}
= {ω ∈ Ω : X(ω) ≤ 68} ,
So, knowing a relationship like Equation 1 between X(ω) and Y (ω) for every
ω ∈ Ω, allows a representation of the event A in terms of either the random
variable X or the random variable Y .
When finding the probability of the event A, we can use the shortcut notation
presented in Chapter 5:
P (A) = P (Y ≤ 20)
= P ( 59 (X − 32) ≤ 20)
= P (X − 32 ≤ 36)
= P (X ≤ 68)
= FX (68).
So, if we know the probability distribution for X , we can compute probabilities of
events expressed in terms of the random variable Y .
The general concept surrounding the above example is the notion of a function of
a random variable. To show you what we mean by this, let the function h(x) be
defined by
h(x) = 59 (x − 32),
which is nothing more than the equation of a straight line. If X is a random variable,
we also have
Y = h(X)
as a random variable.1 The function h(x) maps merely points in the real line (the
sample space for X ) into points in the real line (the sample space for Y ) as in
Figure 5.1.
The random variable X serves as a “middleman” in the mapping shown in Figure 5.1. It may seem, at first, more obvious to define Y as mapping directly from
1
Note that we should have really written this as Y (ω) = h(X(ω)). In order to simplify the
notation, mathematicians usually will drop the (ω).
Basic Concepts
107
Ω
ω
X( )
Real line
Y( )
212
Real line
100
Figure 5.1: The mapping Y (ω) = h(X(ω))
Ω to the real numbers. But consider the case where we have already defined X
and then are given a functional relationship between X and Y . Under those circumstances, it would be very handy to have a tool that would enable us to compute
probabilities involving Y using the facts we already know about X .
Referring to Figure 5.2, you can see that what is happening is really quite simple.
Suppose we are given a cumulative distribution function for the random variable
X , the Buffalo temperature in degrees Fahrenheit. Using this, we can compute
P (X ≤ t) for any value of t. This gives us the cumulative distribution function
FX (t) and allows us to compute the probability of many other events.
Returning to our example, the folks in Fort Erie are asking us to compute P (Y ≤
20). To oblige, we ask ourselves, “What are all of the possible outcomes of
the experiment (i.e., measuring Monday’s 4 pm temperature) that would result in
announcing that the temperature in degrees Celsius in 20 or less?” Call this event
A and compute the probability of A from the distribution of the random variable
X.
An even more general situation appears in Figure 5.3. In this case, suppose we
are given a random variable X . Also assume that we are given the cumulative
distribution function for X so that we are capable of computing the probability
measure of any subset of the sample space for X . If we then define the random
108
Basic Concepts
y
y=
20
)
h(x
x
68
A
Figure 5.2: Representing the event {Y ≤ 20}
y
y= h(x)
B
x
= h-1(B)
Figure 5.3: A more complicated example
Some examples
109
variable Y = h(X) (where h is the function shown in Figure 5.3) we can compute
the probability measure of any subset B in the sample space for Y by determining
which outcomes of X will be mapped by h into B . The probability of any of these
particular outcomes of X is, therefore, also the probability of observing Y taking
on a value in B .
Some examples
Discrete case
Example: Let X be a discrete random variable with probability mass function
pX (−2) =
pX (−1) =
pX (0) =
pX (1) =
pX (2) =
1
5
1
5
1
5
1
5
1
5
Let Y = X 2 . We would like to find the probability distribution for Y .
Solution: There is a pleasant fact that helps when working with discrete random
variables:
Theorem 5.1. A function of any discrete random variable is also a discrete
random variable.
Unfortunately, we cannot make such a statement for continuous random variables,
as we will see in a later example.
To solve the problem, we should first ask ourselves, “Since X can take on the
values −2, −1, 0, 1 or 2, what values can Y assume?” We see that the function
h(x) = x2 maps the point x = −2 into the point 4, x = −1 into 1, x = 0 into
0, x = 1 into 1 and x = 2 into 4. So all of the probability mass for the random
variable Y is assigned to the set {0, 1, 4}. Such a set is called the support for the
random variable Y .
Definition 5.1. Let A be the smallest subset of R such that P (Y ∈ A) = 1. Then
A is called the support of the distribution for the random variable Y .
Since X will take on the value −2 with probability 15 and the value 2 with probability
1
5 , and since each of these mutually exclusive events will result in Y taking on the
110
Some examples
value 4, we can say that
pY (4) = pX (−2) + pX (2) = 52 .
Similarly, pY (1) = 25 and pY (0) = 15 . Hence the probability distribution for Y is
given by the probability mass function:
pY (0) =
pY (1) =
pY (4) =
1
5
2
5
2
5
Continuous case
Example: Let X be a uniformly distributed random variable on the interval (−1, 1),
i.e., the density for X is given by
(
fX (x) =
1
2
if −1 ≤ x ≤ 1
0 otherwise
and the cumulative distribution function is
0
x+1
FX (x) =
2
1
if x < −1
if −1 ≤ x ≤ 1
if x > 1.
Find the distribution for the random variable Y = X 2 .
Solution: For continuous random variables, it is easier to deal with the cumulative
distribution functions. Let’s try to find FY (y). First, we should ask ourselves,
“Since X can take on only values from −1 to +1, what values can Y take on?”
Since Y = X 2 , the answer is “Y can assume values from 0 to 1.” With no elaborate
computation, we now know that
0
FY (y) =
if y < 0
? if 0 ≤ y ≤ 1
1 if y > 1.
Now suppose that y is equal to a value in the only remaining unknown case, that is
0 ≤ y ≤ 1. We then have
FY (y) = P (Y ≤ y )
Some examples
111
³
´
= P X2 ≤ y
¡ √
√ ¢
= P − y≤X≤ y
¡√ ¢
¡ √ ¢
= FX
y − FX − y
√
√
y+1 − y+1
=
−
2
√ 2
=
y.
So,
FY (y) =
0
√
if y < 0
y if 0 ≤ y ≤ 1
1
if y > 1
Finding the probability density function for Y is now easy. That task is left to you.
Example: Let X be a uniformly distributed random variable on the interval (0, 2),
i.e., the density for X is given by
(
fX (x) =
1
2
if 0 ≤ x ≤ 2
0 otherwise
and the cumulative distribution function is
0
x
FX (x) =
2
1
if x < 0
if 0 ≤ x ≤ 2
if x > 2
Let h(x) be the greatest integer less than or equal to x, i.e., let h be given by the
step function shown in Figure 5.4. Find the distribution for Y = h(X).
Solution: Since X takes on the values ranging continuously from 0 to 2, Y can take
on only the integers 0, 1 and 2 (why 2?). We immediately know that the random
variable Y is discrete, and we have our first example of a continuous random
variable being mapped into a discrete random variable. Look at Figure 5.4 again
and try to visualize why this happened.
To find the probability mass function for Y , first let’s try to find pY (0). We know
that Y takes on the value 0 when X takes on any value from 0 to 1 (but not including
1). Hence,
¯
Z 1
x ¯1 1
1
dx = ¯¯ = .
pY (0) =
2 0 2
0 2
112
Invertible Functions of Continuous Random Variables
y
3
2
1
x
-2
-1
1
2
3
4
A
Figure 5.4: The greatest integer function
Similarly,
pY (1) =
Z 2
1
1
and
2
dx =
1
2
Z 2
1
dx = 0,
2
which is good thing because we ran out of probability mass after the first two
computations.
pY (2) =
2
So we have the answer,
pY (0) = pY (1) = 21 .
Invertible Functions of Continuous Random Variables
First we need a few concepts about inverse functions from calculus:
Definition 5.2. A function h : R → R is called one-to-one, if h(a) = h(b)
implies a = b. A function h : R → R is called onto, if for every y ∈ R there exists
some x ∈ R such that y = h(x).
If a function h(x) is both one-to-one and onto, then we can define the inverse
function h−1 : R → R such that h−1 (y) = x if and only if h(x) = y .
Suppose h(x) has an inverse and is increasing (or decreasing). Then, if X is a
continuous random variable, and Y = h(X), we can develop a general relationship
Invertible Functions of Continuous Random Variables
113
between the probability density function for X and the probability density function
for Y .
If h(x) a strictly increasing function and h(x) has an inverse, then h−1 (x) is
nondecreasing. Therefore, for any a ≤ b, we have h−1 (a) ≤ h−1 (b). This
produces the following relationship between the cumulative distribution function
for X and the cumulative distribution function for Y :
FY (y) = P [Y ≤ y] = P [h(X) ≤ y]
FY (y) = P [h−1 (h(X)) ≤ h−1 (y)]
FY (y) = P [X ≤ h−1 (y)]
FY (y) = FX (h−1 (y))
To get the relationship between the densities, take the first derivative of both sides
with respect to y and use the chain rule:
d
FX (h−1 (y))
dy
d
fY (y) = fX (h−1 (y)) h−1 (y)
dy
d
FY (y) =
dy
We have just proven the following theorem:
Theorem 5.2. Suppose X is a continuous random variable with probability
density function fX (x). If h(x) is a strictly increasing function with a differentiable inverse h−1 (·), then the random variable Y = h(X) has probability density
function
d
fY (y) = fX (h−1 (y)) h−1 (y)
dy
We will leave it to you to find a similar result for the case when h(x) is decreasing
rather than increasing.
Note: This transformation technique is actually a special case of a more general
technique from calculus. If one considers the function Y = h(X) as a transformation (x) → (y), the monotonicity property of h allows us to uniquely solve
y = h(x). That solution is x = h−1 (y). We can then express the probability
density function of Y as
fY (y) = fX (h−1 (y))J(y)
114
Invertible Functions of Continuous Random Variables
where J(y) is the following 1 × 1 determinant:
¯
¯ ¯
¯
¯ ∂x ¯ ¯ ∂ −1 ¯
¯ = ¯ h (y)¯ = ∂ h−1 (y)
¯
∂y ¯ ¯ ∂y
∂y
J(y) = ¯¯
since h−1 (y) is also an increasing function of y and, thus, has a nonnegative first
derivative with respect to y .
In calculus, when solving transformation problems with n variables, the resulting
n × n determinant is called the Jacobian of the transformation.
Example: Let X be a continuous random variable with density for X given by
(
|x|/4 if −2 ≤ x ≤ 2
0
otherwise
fX (x) =
Let Y = h(X) = X 3 . Find the probability density function for Y without explicitly
finding the cumulative distribution function for Y .
Solution: The support for X is [−2, 2]. So the support for Y is [−8, 8].
1
Since h−1 (y) = y 3 , we can use Theorem 5.2 to get
fY (y) = fX (h−1 (y))
1
= fX (y 3 )
d −1
h (y)
dy
d 1
y3
dy
2
1
= fX (y 3 )( 13 )y − 3
1
1
1
When y < −8 or y > 8 we have y 3 < −2 or y 3 > 2. In those cases, fX (y 3 ) = 0
hence fY (y) is zero.
When −8 ≤ y ≤ 8
fY (y) =
=
=
Summarizing
(
fY (y) =
1
1 13 1 − 23
4 |y |( 3 )y
1
1
1
3 √
12 |y | 3 y 2
1 − 13
|
12 |y
|y − 3 |/12 if −8 ≤ y ≤ 8
0
otherwise
Expected values for functions of random variables
115
Expected values for functions of random variables
Theorem 5.3. Let Y = h(X).
If X is a discrete random variable with probability mass function pX (x), then
E(Y ) = E(h(X)) =
X
h(x)pX (x)
all x
If X is a continuous random variable with probability density function fX (x), then
E(Y ) = E(h(X)) =
Z +∞
−∞
h(x)fX (x) dx
Question: Toss a coin. Let X = 1 if heads, X = 0 if tails. Let h(X) denote your
winnings.
(
−5 x = 0
h(x) =
10 x = 1
Find E(h(X)), your expected winnings.
Question: Let X , the life length of a transistor, be a continuous random variable
with probability density function
(
fX (x) =
2e−2x x ≥ 0
0
otherwise
Suppose you earn ln(x) + x2 − 3 dollars if the transistor lasts exactly x hours. What
are your expected earnings?
Theorem 5.4. E(a) = a if a is a constant.
Proof. We will provide proofs for both the discrete and continuous cases.
We are really finding E(h(X)) where h(x) = a for all x.
Suppose X is a discrete random variable with probability mass function
pX (xk ) = P (X = xk ) = pk
for k = 1, 2, . . ..
Then
E(h(X)) =
X
k
h(xk )pk
116
Expected values for functions of random variables
X
E(a) =
apk
k
X
= a
pk = a(1) = a
k
Suppose X is a continuous random variable with probability density function
fX (x). Then
E(h(X)) =
E(a) =
Z ∞
−∞
Z ∞
h(x)fX (x) dx
afX (x) dx
−∞
Z ∞
= a
fX (x) dx = a(1) = a
−∞
Theorem 5.5. E(aX) = aE(X) if a is a constant.
Proof. We are finding E(h(X)) where h(x) = ax.
Suppose X is a discrete random variable with probability mass function
pX (xk ) = P (X = xk ) = pk
for k = 1, 2, . . ..
Then
E(h(X)) =
X
h(xk )pk
k
E(aX) =
X
axk pk
k
= a
X
xk pk = a(E(X))
k
Suppose X is a continuous random variable with probability density function
fX (x). Then
E(h(X)) =
E(aX) =
Z ∞
−∞
Z ∞
h(x)fX (x) dx
axfX (x) dx
−∞
Z ∞
= a
−∞
xfX (x) dx = a(E(X))
Expected values for functions of random variables
117
Theorem 5.6. E(X + b) = E(X) + b if b is a constant.
Proof. We are finding E(h(X)) where h(x) = x + b.
Suppose X is a discrete random variable with probability mass function
pX (xk ) = P (X = xk ) = pk
for k = 1, 2, . . ..
Then
E(h(X)) =
X
h(xk )pk
k
E(X + b) =
X
(xk + b)pk =
k
=
X
xk pk +
X
k
=
X
X
(xk pk + bpk )
k
bpk
k
xk pk + b
k
X
pk = E(X) + b(1)
k
Suppose X is a continuous random variable with probability density function
fX (x). Then
E(h(X)) =
E(X + b) =
=
Z ∞
−∞
Z ∞
−∞
Z ∞
−∞
h(x)fX (x) dx
(x + b)fX (x) dx =
xfX (x) dx +
Z ∞
−∞
Z ∞
−∞
(xfX (x) + bfX (x)) dx
bfX (x) dx = E(X) + b(1)
Theorem 5.7. E[h1 (X) + h2 (X)] = E[h1 (X)] + E[h2 (X)] for any functions
h1 (X) and h2 (X)
Proof. We are finding E(h(X)) where h(x) = h1 (x) + h2 (x).
Suppose X is a discrete random variable with probability mass function
pX (xk ) = P (X = xk ) = pk
for k = 1, 2, . . ..
Then
E(h(X)) =
X
k
h(xk )pk
The k th Moment of X
118
E[h1 (X) + h2 (X)] =
X
[h1 (xk ) + h2 (xk )]pk
k
=
X
[h1 (xk )pk + h2 (xk )pk ]
k
=
X
h1 (xk )pk +
X
k
h2 (xk )pk
k
= E[h1 (X)] + E[h2 (X)]
Suppose X is a continuous random variable with probability density function
fX (x). Then
E(h(X)) =
E[h1 (X) + h2 (X)] =
=
=
Z ∞
−∞
Z ∞
−∞
Z ∞
−∞
Z ∞
−∞
h(x)fX (x) dx
(h1 (x) + h2 (x))fX (x) dx
(h1 (x)fX (x) + h2 (fX (x)) dx
h1 (x)fX (x) dx +
Z ∞
−∞
h2 (x)fX (x) dx
= E[h1 (X)] + E[h2 (X)]
Notice in the all of the above proofs the close relationship of
Discrete X and Continuous
X
Z ∞
X
and
k
−∞
P (X = k) = pX (xk ) = pk and P (x ≤ X ≤ x + dx) = fX (x) dx
Also notice that Theorem 5.6 is really only a corollary to Theorem 5.7. Simply use
Theorem 5.7 with h1 (x) = x and h2 (x) = b and you get Theorem 5.6.
The kth Moment of X
If we let h(x) = xk and consider E[h(X)], we get the following definition:
Definition 5.3. The quantity E(X k ) is defined as the kth moment about the
origin of the random variable X .
Variance
119
Variance
Recall that E(X) is simply a number. If we let h(x) = (x − E(X))2 and compute
E(h(X)) we get a very useful result:
Definition 5.4. The variance of a random variable X is
Var(X) ≡ E[(X − E(X))2 ]
.
Note that
Var(X) =
X
(x − E(X))2 pX (x)
all x
for a discrete random variable X , and
Var(X) =
Z +∞
−∞
(x − E(X))2 fX (x) dx
for a continuous random variable X .
Note: The value of Var(X) is the second moment of the probability distribution
about the expected value of X . This can be interpreted as the moment of inertia of
the probability mass distribution for X .
Definition 5.5. The standard deviation of a random variable X is given by
q
σX ≡
Var(X).
Notation: We often write
E(X) = µX
2
Var(X) = σX
Theorem 5.8. Var(X) = E(X 2 ) − [E(X)]2
Proof. Using the properties of expected value, we get
Var(X) = E[(X − E(X))2 ]
= E[X 2 − 2XE(X) + [E(X)]2 ]
expand the square
2
= E[X ] + E[−2XE(X)] + E[[E(X)]2 ]
2
2
= E[X ] − 2E(X)E[X] + E[[E(X)] ]
from Theorem 5.7
from Theorem 5.5
120
Variance
= E[X 2 ] − 2E(X)E[X] + [E(X)]2
2
2
from Theorem 5.4
2
= E[X ] − 2[E[X]] + [E(X)]
= E[X 2 ] − [E[X]]2
Theorem 5.9. Var(aX) = a2 Var(X)
Proof. Using Theorem 5.8, we get
Var(aX) = E[(aX)2 ] − [E(aX)]2
= E[a2 X 2 ] − [E(aX)]2
= a2 E[X 2 ] − [aE(X)]2
2
2
2
Using Theorem 5.5
2
= a E[X ] − a [E(X)]
= a2 (E[X 2 ] − [E(X)]2 )
= a2 Var(X)
Using Theorem 5.8
Theorem 5.10. Var(X + b) = Var(X)
Proof. Using Definition 5.4, we get
Var(X + b) = E[((X + b) − E[X + b])2 ]
= E[(X + b − E[X] − b)2 ]
= E[(X − E[X])2 ]
= Var(X)
The proof of the following result is left to the reader:
Theorem 5.11. Let Y = h(X) then
Var(Y ) = E[(h(X) − E(h(X))2 ]
Example: Suppose X is a continuous random variable with probability density
function
(
3x2 if 0 ≤ x ≤ 1
fX (x) =
0
otherwise
Approximating E(h(X)) and Var(h(X))
121
Let Y = h(X) = 1/X . Then
E(Y ) =
Z 1
h(x)(3x2 ) dx
0
=
Z 1
1
0
=
Then we can use E(Y ) = E(h(X)) =
Var(Y ) =
Z 1
0
x
(3x2 ) dx
¯
3 2 ¯¯1 3
x =
2 ¯0 2
3
2
to compute
(h(x) − E(h(X)))2 (3x2 ) dx
Z 1µ
1
¶
3 2
−
=
(3x2 ) dx
x
2
0
¶
Z 1µ
1
3 9
(3x2 ) dx
=
− +
x2 x 4
0
Z 1
Z 1
Z 1
27 2
3
=
3 dx −
9x dx +
x dx =
4
0
0
0 4
Approximating E(h(X)) and Var(h(X))
This section is based on material in a text by P. L. Meyer.2
We have already shown that, given a random variable X and function Y = h(X),
we can compute E(Y ) = E(h(X)) directly using only the probability distribution
for X . This avoids the need for explicitly finding the probability distribution for
the random variable Y = h(X) if we are only computing expected values.
However, if the function h(·) is very complicated, then even using these short-cut
methods for computing E(X) and Var(X) may result in difficult integrals. Often,
we can get good results using the following approximations:
Theorem 5.12. Let X be a random variable with E(X) = µ and Var(X) = σ 2 .
Suppose Y = h(X). Then
(2)
(3)
2
h00 (µ) 2
σ
2
Var(Y ) ≈ [h0 (µ)]2 σ 2
E(Y ) ≈ h(µ) +
Meyer, P., Introductory probability theory and statistical applications, Addison-Wesley, Reading
MA, 1965.
122
Approximating E(h(X)) and Var(h(X))
Proof. (This is only a sketch of the proof.)
Expand the function h in a Taylor series about x = µ to two terms. We obtain,
Y = h(µ) + (X − µ)h0 (µ) +
(X − µ)2 h00 (µ)
+ R1
2
where R1 is the remainder of the expansion. Discard the remainder term R1 and
take the expected value of both sides to get Equation (2).
For the second approximation, expand h in a Taylor series about x = µ to only one
term to get
Y = h(µ) + (X − µ)h0 (µ) + R2
where R2 is the remainder. Discard the term R2 and take the variance of both sides
to get Equation (3).
A common mistake when using expected values is to assume that E(h(X)) =
h(E(X)). This is not generally true.
For example, suppose we are selling a product, and the random variable X represents the monthly demand (in terms of the number of items sold). Then E(X) = µ
represents the expected number of items sold in a month.
Now suppose, Y = h(X) is the actual profit from selling X items in a month.
Often, the number h(µ) is incorrectly interpreted as the expected profit. That is,
the expected sales figure is “plugged into” the profit function, believing the result
is the expected profit.
In fact, the expected profit is actually E(Y ) = E(h(X)). Unless h is linear, this
will not be the same as h(µ) = h(E(X)).
Equation (2) in Theorem 5.12 provides an approximation for the error if you
incorrectly use h(µ) rather than E(h(X)).
h00 (µ) 2
σ
2
h00 (µ) 2
h(µ) − E(Y ) ≈ −
σ
2
E(Y ) ≈ h(µ) +
Approximating E(h(X)) and Var(h(X))
123
Therefore, if you incorrectly state that h(µ) is your expected monthly profit, your
answer will be wrong by approximately
∆=−
h00 (µ) 2
σ
2
Note that if h is linear, h00 (x) = 0, and ∆ = 0. This result for the linear case is also
provided by Theorems 5.5, 5.6 and 5.7.
Example: (Meyer) Under certain conditions, the surface tension of a liquid
(dyn/cm) is given by the formula S = 2(1 − 0.005T )1.2 , where T is the temperature of the liquid (degrees centigrade).
Suppose T is a continuous random variable with probability density function
(
fX (x) =
3000t−4 if t ≥ 10
0
otherwise
Therefore,
E(T ) =
Z ∞
3000t−3 dt = 15
(degrees centigrade)
10
Var(T ) = E(T 2 ) − (15)2
=
Z ∞
3000t−2 dt − 225 = 75 (degrees centigrade)2
10
In order to use Equations (2) and (3), we will use
h(t) = 2(1 − 0.005t)1.2
h0 (t) = −0.012(1 − 0.005t)0.2
h00 (t) = 0.000012(1 − 0.005t)−0.8
Since µ = 15, we have
h(15) = 1.82
h0 (15) = 0.01
h00 (15) = 0.000012(1 − 0.005(15))−0.8 ≈ 0+
and, using Theorem 5.12, results in
E(S) ≈ h(15) + 75h00 (15) = 1.82
0
2
Var(S) ≈ 75[h (15)] = 0.87
(dyne/cm)
(dyne/cm)2
124
Chebychev’s Inequality
Chebychev’s Inequality
Chebychev’s inequality, given in the following Theorem 5.13, is applicable for
any random variable X with finite variance. For a given Var(X), the inequality
establishes a limit on the amount of probability mass that can be found at distances
far away from E(X) (the boondocks3 of the probability distribution).
Theorem 5.13. Let X be a random variable with E(X) = µ and Var(X) = σ 2 .
Then for any a > 0
σ2
P (|X − µ| ≥ a) ≤ 2
a
Proof. We will prove this result when X is a continuous random variable. (The
proof for X discrete is similar using summations rather than integrals, and fX (x) dx
replaced by the probability mass function for X .)
We have for any a > 0
Z
σ
2
+∞
(x − µ)2 fX (x) dx
=
−∞
Z µ−a
Z
−∞
Z µ−a
µ−a
−∞
µ−a
Z
Z
Z
−∞
Z µ+a
µ−a
(x − µ)2 fX (x) dx
µ+a
+∞
(x − µ)2 fX (x) dx
µ+a
+∞
(x − µ)2 fX (x) dx +
≥
+∞
(x − µ)2 fX (x) dx +
(x − µ)2 fX (x) dx + something ≥ 0 +
=
since
Z
µ+a
(x − µ)2 fX (x) dx +
=
(x − µ)2 fX (x) dx
µ+a
(x − µ)2 fX (x) dx ≥ 0.
Note that (x − µ)2 ≥ a2 for every x ∈ (−∞, µ − a] ∪ [µ + a, +∞). Hence, for all
x over this region of integration, we have (x − µ)2 fX (x) ≥ a2 fX (x). Therefore,
Z µ−a
−∞
Z +∞
µ+a
3
(x − µ)2 fX (x) dx ≥
(x − µ)2 fX (x) dx ≥
Z µ−a
−∞
Z +∞
µ+a
a2 fX (x) dx
and
a2 fX (x) dx.
boon·docks (bōōn’dŏks’) plural noun. Rural country; the backwoods. Example usage: The
R Dictionary
1965 Billy Joe Royal Song: “Down in the Boondocks;” Source: The American Heritage°
of the English Language, Fourth Edition, Houghton Mifflin Company.
The Law of Large Numbers
125
So, we get
σ
2
Z µ−a
≥
−∞
Z µ−a
≥
−∞
2
(x − µ) fX (x) dx +
a2 fX (x) dx +
Z +∞
µ+a
Z +∞
µ+a
(x − µ)2 fX (x) dx
a2 fX (x) dx
= a2 P (X ≤ µ − a) + a2 (P (X ≥ µ + a)
= a2 P (|X − µ| ≥ a)
The Law of Large Numbers
This is an important application of Chebychev’s inequality.
Theorem 5.14. Let an experiment be conducted n times, with each outcome of
the experiment independent of all others. Let
Xn ≡ number of times the event A occurs
p ≡ P (A)
Then for any ² > 0
µ¯
¯
¶
¯
¯ Xn
lim P ¯¯
− p¯¯ < ² = 1.
n→∞
n
The above result is the weak law of large numbers. The theorem guarantees that,
as n gets large, the probability that the ratio Xn /n is observed to be “close” to p
approaches one.
There is a related result called the strong law of large numbers that makes a similar
(but stronger statement) about the entire sequence of random variables
{X1 , 12 X2 , 31 X3 , 41 X4 , . . .}
The strong law states that, with probability one, this sequence converges to p. In
this case, we often say that the sequence { n1 Xn } converges to p almost surely.
126
The Normal Distribution
The Normal Distribution
Definition 5.6. The random variable X is said to have a normal distribution if
it has probability density function
¡
¢
1 x−µ 2
1
fX (x) = √ e− 2 σ
σ 2π
for −∞ < x < ∞
Note that X has two parameters, µ and σ .
Theorem 5.15. If X ∼ N (µ, σ 2 ) then E(X) = µ. and Var(X) = σ 2 .
We often say that “X is distributed normally with mean µ and variance σ 2 ”
We write
X ∼ N (µ, σ 2 )
One of the amazing facts about the normal distribution is that the cumulative
distribution function
(4)
FX (a) =
Z a
−∞
¡
¢
1 x−µ 2
1
√ e− 2 σ
dx
σ 2π
does not exist in closed form. In other words, there exists no function in closed
form whose derivative is
¡
¢
1 x−µ 2
1
fX (x) = √ e− 2 σ
σ 2π
The only way to compute probabilities for normally distributed random variables
is to evaluate the cumulative distribution function in Equation 4 using numerical
integration.
If X ∼ N (µ, σ 2 ), there is an important property of the normal distribution that
allows us to convert any probability statement, such as P (X ≤ a), to an equivalent
probability statement involving only a normal random variable with mean zero and
variance one. We can then refer to a table of N (0, 1) probabilities, and thus avoid
numerically integrating Equation 4 for every conceivable pair of values µ and σ .
Such a table is provided in Table I of Appendix A.
Definition 5.7. The standard normal random variable, Z , has probability
density function
1 2
1
φ(z) = √ e− 2 z
2π
The Normal Distribution
127
i.e., Z ∼ N (0, 1).
We denote the cumulative distribution function for Z by Φ(a) = P (Z ≤ a).
Theorem 5.16. If X ∼ N (µ, σ 2 ), then the random variable
Z=
X −µ
∼ N (0, 1).
σ
Proof. Suppose X ∼ N (µ, σ 2 ). Therefore, the probability density function for X
is
¡
¢
1 x−µ 2
1
fX (x) = √ e− 2 σ
for −∞ < x < ∞
σ 2π
Define the function
x−µ
h(x) =
σ
Note that z = h(x) is an invertible, increasing function of x with
h−1 (z) = σz + µ
d −1
h (z) = σ
dz
Using Theorem 5.2, the probability density function for the random variable Z =
h(X) is
d −1
h (z)
dz
¡ (σz+µ)−µ ¢2 ¸
fZ (z) = fX (h−1 (z))
·
=
=
1
1
σ
√ e− 2
σ for −∞ < σz + µ < ∞
σ 2π
1 2
1
√ e− 2 z for −∞ < z < ∞
2π
Note that we couldn’t use the cumulative distribution function of X to find the
cumulative distribution function of Z since neither cumulative distribution function
exists in closed form.
X −µ
If X ∼ N (µ, σ 2 ), Theorem 5.16 tells us three things about Z =
σ
1. E(Z) = 0
2. Var(Z) = 1, and
128
The Normal Distribution
3. Z has a normal distribution
Actually, the first two results would be true even if X did not have a normal
distribution:
Lemma 5.17. Suppose X is any random variable with finite mean E(X) = µ
and finite, nonzero variance Var(X) = σ 2 . Define the random variable
Z=
X −µ
.
σ
Then E(Z) = 0 and Var(Z) = 1.
Proof. We have
·
E(Z) =
=
=
=
=
¸
X −µ
E
σ
·
¸
1
E
(X − µ)
σ
1
E[X − µ]
σ
1
(E[X] − µ)
σ
1
(0) = 0
σ
and
·
Var(Z) =
=
=
=
¸
X −µ
Var
σ
·
¸
1
Var (X − µ)
σ
1
Var[X − µ]
σ2
1
Var[X] = 1
σ2
So, it is no surprise that, in Theorem 5.16, E(Z) = 0 and Var(Z) = 1. What
is important in Theorem 5.16 is that if X is normally distributed, then the linear
transformation of X , namely,
X −µ
Z=
σ
The Normal Distribution
129
is also normally distributed.
A distribution that has zero mean and unit variance is said to be in standard form.
When we perform the above transformation (i.e., Z = (X − µ)/σ ), we say we are
standardizing the distribution of X .)
Linear transformations rarely preserve the “shape” of the probability distribution.
For example, if X is exponentially distributed, Z would not be exponentially
distributed. If X was a Poisson random variable, Z would not be a Poisson random
variable. But, as provided by Theorem 5.16, if X is normally distributed, then Z
is also normally distributed. (See Question 5.6 to see that the uniform distribution
is preserved under linear transformations.)
Tools for using the standard normal tables
P (Z ≤ a) = Φ(a) is given in the tables for various values of a. It can be shown
that since Z is a continuous random variable and the probability density function
for Z is symmetric about zero, we have
Φ(a) = 1 − Φ(−a).
Example: Suppose X ∼ N (0, 4). Let’s find P (X ≤ 1).
P (X ≤ 1) = P (X − 0 ≤ 1 − 0)
¶
µ
1−0
X −0
≤
= P
2
2
Since X ∼ N (0, 4) we know from Theorem 5.16 that
µ
Z=
X −0
2
¶
∼ N (0, 1)
so, using Table I in Appendix A, we have
µ
P (X ≤ 1) = P Z ≤
³
= P Z≤
1−0
2
1
2
¶
´
= 0.6915
130
The Normal Distribution
This is the area shown in the following figure:
0.5
0.4
φ(z )
φ(
0.3
0.2
0.1
2.0
1.8
1.6
1.4
1.2
1.0
0.9
0.7
0.5
0.3
0.1
-0.1
-0.3
-0.5
-0.7
-0.9
-1.1
-1.2
-1.4
-1.6
-1.8
-2.0
0.0
z
Example: Suppose X ∼ N (2, 9). Let’s find P (0 ≤ X ≤ 3).
P (1 ≤ X ≤ 3) = P (0 − 2 ≤ X − 2 ≤ 3 − 2)
µ
¶
0−2
X −2
3−2
= P
≤
≤
3
3
3
Since X ∼ N (2, 9) we know that
µ
Z=
X −2
3
¶
∼ N (0, 1)
Hence,
³
P (1 ≤ X ≤ 3) = P − 32 ≤ Z ≤
³
1
3
= P Z≤
´
1
3
´
³
− P Z < − 23
´
Since the N (0, 1) distribution is symmetric about zero,
³
´
³
P Z < − 23 = P Z >
2
3
´
³
=1−P Z ≤
2
3
´
Finally, from Table I, we get
³
P (1 ≤ X ≤ 3) = P Z ≤
1
3
´
³
³
− 1−P Z ≤
= 0.63 − (1 − 0.75) = 0.38
2
3
´´
Self-Test Exercises for Chapter 5
131
which is the area shown in the following figure:
0.5
0.4
φ(z )
φ(
0.3
0.2
0.1
2.0
1.8
1.6
1.4
1.2
1.0
0.9
0.7
0.5
0.3
0.1
-0.1
-0.3
-0.5
-0.7
-0.9
-1.1
-1.2
-1.4
-1.6
-1.8
-2.0
0.0
z
Self-Test Exercises for Chapter 5
For each of the following multiple-choice questions, choose the best response
among those provided. Answers can be found in Appendix B.
S5.1 Let X be a continuous random variable with probability density function
(
fX (x) =
1
2
0
if −1 ≤ x ≤ 1
otherwise.
The value of P (X 2 ≥ 14 ) is
(A) 0
(B)
1
4
(C)
1
2
(D) 1
(E) none of the above.
S5.2 Let X be a random variable with E(X) = 0 and E(X 2 ) = 2. Then the value
of E[X(1 − X)] is
(A) −2
(B) 0
(C) 1
(D) 2
(E) Var(X)
© Copyright 2026 Paperzz