Expected Value and Variance
This handout, reorganizes the material in §3.3 of the text. It also adds to it slightly (Theorem 1 on p.3) and
presents two results from chapter 5 of the text, because they fit in here and will simplify the presentation of some
of the material in chapters 3 and 4. In addition, the handout contains a motivating discussion that leads up to
the definition of expected value.
1
Expected Value.
1.1
Motivating Discussion and Definition.
Suppose that I play the following game with anyone who is interested in playing:
• I will charge a player an up-front fee F (to be determined).
• The player then flips a fair coin three times, and I pay him/her $1.00 for each head that (s)he gets and
$2.00 for each tail that (s)he gets.
Below, I show the random variable Y that logs my payout as a function defined on a sample space as well as the
the distribution of Y .
Sample Space:
Distribution:
Sample point s:
p(s) :
Y(s) :
HHH
1
8
1
8
1
8
1
8
1
8
1
8
1
8
1
8
3
T HH
HT H
HHT
HT T
T HT
TTH
TTT
Value y:
P(Y = y):
3
1
8
4
3
8
5
3
8
6
1
8
4
4
4
5
5
5
6
The question I want to consider is this: what would be an equitable value for F, the up-front fee? If I overcharge,
I will be swindling people, but if I undercharge, I will eventually lose all my savings. In order to be fair, I should
take in from each player the amount that, on average, I must pay out to each player:
F should be the average payout per game.
(1)
The number referred to in box (1)—is the theoretical average of the random variable Y ; it is called the expected
value of Y and is denoted “E[Y ]” or “µY .”1 In order to discover the correct formula for E[Y ], let us see what an
1
µ is the Greek lowercase letter mu; I am certain that it was chosen because m is the first letter in the word mean.
1
estimate of it looks like. Suppose we estimate the average payout per game by repeating the game a huge number
of times—say 109 times, to be definite—and then calculating the arithmetic average of the payouts:
arithmetic average =
y1 + y2 + · · · + y109
,
109
(2)
where the numbers (y1 , y2 , . . .) are the amounts of the individual payouts.
We now need to simplify (2). The first step is to group like terms together in the numerator, so that (2) is
replaced by (3):
arithmetic average
=
=
arithmetic average
=
(3+3+· · ·+3) + (4+4+· · ·+4) + (5+5+· · ·+5) + (6+6+· · ·+6)
109
(3+3+· · ·+3) (4+4+· · ·+4) (5+5+· · ·+5) (6+6+· · ·+6)
+
+
+
9
9
9
9
10
10
10
10
# of 3’s
# of 4’s
# of 5’s
# of 6’s
3·
+
4
·
+
5
·
+
6
·
.
109
109
109
109
(3)
(4)
Now, if you look carefully at (4), you can discern the theoretical quantity that the arithmetic average is estimating.
Since 109 is such a big number, we can be sure that
# of 3’s
≈ P (Y = 3),
109
# of 4’s
≈ P (Y = 4),
109
# of 5’s
≈ P (Y = 5),
109
and
# of 6’s
≈ P (Y = 6).
109
Making these approximate substitutions into the right side of (4) gives us the approximate equation
arithmetic average
≈ 3P (Y = 3) + 4P (Y = 4) + 5P (Y = 5) + 6P (Y = 6).
(5)
We have identified the theoretical average value of this particular random variable: it is the right side of (5). The
corresponding formula works for any discrete random variable.2
Definition 1 Let Y be a discrete random variable with distribution given by
P (Y = yi ) = pi , (i ≥ 1).
The expected value of Y (E[Y] or µY ) is given by
E[Y ] = µY := y1 P (Y = y1 ) + y2 P (Y = y2 ) + y3 P (Y = y3 ) + · · · =
X
yi P (Y = yi ).
(6)
i≥1
1.2
Basic Properties of Expected Value.
There are quite a few of these; together, they make expected value a very user-friendly hquantity.
The first of
i
these, which is not in the text, simplifies the proofs of some of the others; it shows that E Y , which is defined
from the distribution of Y , can also be viewed as a sum over the sample space on which Y is defined.3
2
When the random variable has infinitely many values, the sum in (6) can diverge.
It is strongly suggested that at this point you compute the expected value of the random variable on p.1 both from
the definition and from formula (7).
3
2
Theorem 1
h i X
E Y =
Y (s)p(s).
(7)
s∈S
Proof. For each i ≥ 1, let Ei be the event “Y = yi ” so that P (Y = yi ) = P (Ei ). Observe that the events
{E1 , E2 , . . .} partition S.
Starting from the definition, we get:
h i
X
E Y
=
yi P (Y = yi )
i≥1
=
X
yi P (Ei )
X
X
i≥1
(by the definition of P (Ei ) −→)
=
yi
(because yi = Y (s) for s ∈ Ei −→)
=
Y (s)
=
X
!
p(s)
s∈Ei
i≥1
(distribute −→)
p(s)
s∈Ei
i≥1
X
!
XX
Y (s)p(s)
i≥1 s∈Ei
(because the events {E1 , E2 , . . .} partition S −→)
=
X
Y (s)p(s).
s∈S
Theorem 2 Let X, Y1 and Y2 be random variables defined on the same sample space, and let c and d be constants.
h i
[a]: If X is constant—that is, if P (X = x0 ) = 1 for some x0 —then E X = x0 .
h
i
h i
[b]: E cY1 + d = cE Y1 + d (or, equivalently: µcY1 +d = cµY1 + d.)
h
i
h i
h i
[c]: E Y1 + Y2 = E Y1 + E Y2 (or, equivalently: µY1 +Y2 = µY1 + µY2 .)
h i
Proof of [a]. E X = x0 P (X = x0 ) = x0 · 1 = x0 .
Proof of [b].
h
i
X
E cY1 + d
=
(cyi + d)P (Y1 = yi )
i≥1
=
X
=
X
cyi P (Y1 = yi ) + dP (Y1 = yi )
i≥1
cyi P (Y1 = yi ) +
i≥1
= c
X
X
dP (Y1 = yi ))
i≥1
yi P (Y1 = yi ) + d
i≥1
X
P (Y1 = yi ))
i≥1
h i
h i
= cE Y1 + d × (1) = cE Y1 + d.
Proof of [c]. This can be proved more easily using (7) than it can by using the definition of expected value.
h
i
X
E Y1 + Y2
=
Y1 + Y2 (s)p(s)
s∈S
(definition of Y1 + Y2 −→) =
X
Y1 (s) + Y2 (s) p(s)
s∈S
(distribute −→) =
X
s∈S
3
Y1 (s)p(s) + Y2 (s)p(s)
(group into two sums −→)
=
X
Y1 (s)p(s) +
s∈S
X
Y2 (s)p(s)
s∈S
h i
h i
(equation (7), used twice −→) = E Y1 + E Y2 .
h
i
Exercise 1 Show that E Y − µY = 0.
1.3
Independence and Expected Values.
The concept of independent random variables is built upon the idea of independent events.
Definition 2 Let X and Y be two random variables defined on the same sample space. X and Y are independent
iff for every value xi in the distribution of X and every yj in the distribution of Y , the events “X = xi ” and
“Y = yj ” are independent events; that is,
\
P X = xi Y = yj = P (X = xi )P (Y = yj ).
Many desirable things happen when random variables are independent; in this section, I will focus on one particular
one.
Theorem 3 If X and Y are independent random variables,4 then
h
i
h i
h i
E XY = E X · E Y (if X and Y are independent)
(8)
Proof. The simplest proof I can find uses both the definition and Theorem 1. For each value xj of X, let
Fj = “X = xj ”, and for each value yi of Y , let Ei = “Y = yi ”. Note that the events {(Fj ∩ Ei )} partition S.
h i h i
X
X
E X E Y
=
xj P (X = xj )
yi P (Y = yi )
j≥1
i≥1
X
X
=
xj P (Fj )
yi P (Ei )
j≥1
(multiply out −→)
=
i≥1
XX
xj yi P (Fj )P (Ei )
XX
xj yi P (Fj ∩ Ei )
j≥1 i≥1
(because X and Y are independent −→)
=
j≥1 i≥1
(definition of P (Fj ∩ Ei ) −→)
=
(multiply out −→) =
X
XX
j≥1 i≥1
xj yi
XX
X
xj yi p(s)
XX
X
X(s)Y (s)p(s)
s∈Fj ∩Ei
p(s)
j≥1 i≥1 s∈Fj ∩Ei
(correct for s ∈ (Fj ∩ Ei ) −→)
=
j≥1 i≥1 s∈Fj ∩Ei
(because the events{(Fj ∩ Ei )} partition S −→)
=
X
X(s)Y (s)p(s)
s∈S
(formula (7) −→)
4
h
i
= E XY .
Because they are independent, they are necessarily defined on the same sample space.
4
2
Variance and Standard Deviation.
2.1
Motivating Discussion and Definition.
Just as the expected value µY of a random variable Y is analogous to the mean y of a sample
(y1 , y2 , . . . , yn ), there is a number called the variance of Y that is analogous to the variance s2 of
the sample y1 , y2 , . . . , yn . The considerations that lead to the definition of s2 apply in this case as
well:5
1. The goal is to find to measure the spread of the distribution of Y ; that is, to find a number that
is small for a tight, compact distribution and large for a widely spread-out distribution.
2. The
h expected
i (signed) distance of a random value of Y from the mean µY does not work, because
E Y − µY = 0, always (Exercise 1).
h
i
3. The expected (unsigned) distance of Y from µY , namely E |Y − µY | , does work, but it is not
particularly easy to work with.
4. The expected squared distance of Y from µY —the variance—also works, just as #2 does; but
unlike #2, the variance has a number of nice mathematical properties, and (probably because of
them) it is by far the most commonly used measure of dispersion.
5. The variance has one undesirable property, namely that its units are the square of the units of Y ;
this can be fixed by taking the square root of the variance, to get the so-called standard deviation
of Y .
Definition 3 Let Y be a discrete random variable.
2
1. The variance of Y, denoted V(Y) or σY
, is given by
h
i
σY2 := E (Y − µY )2 .
2. The standard deviation of Y, denoted σY , is the square root of the variance of Y :
σY :=
2.2
q
σ2
Y
=
r h
i
E (Y − µY )2 .
Basic Properties of the Variance and Standard Deviation.
While there are parallels between the properties of variance and those of expected value, there are also
differences. Notice in particular that part [d] of Theorem 4 requires independence, whereas part [c] of
Theorem 2 (p.3) does not.
Theorem 4 Let X, Y1 and Y2 be random variables defined on the same sample space, and let c and d
be constants.
h
i
[a]: σY21 = E Y12 − µ2Y1 .
2 = 0.6
[b]: If X is constant—that is, if P (X = x0 ) = 1 for some x0 —then σX
2
[c]: V (cY1 + d) = c2 V (Y1 ) (or, equivalently: σcY
= c2 σY21 .)
1 +d
[d]: If Y1 and Y2 are independent, V (Y1 +Y2 ) = V (Y1 )+V (Y2 ) (or, equivalently: σY21 +Y2 = σY21 + σY22 .)
5
with the exception of the (n − 1) in the denominator in the definition of s2 . The reason for that will become clear
when you reach Chapter 8.
6
The converse of [b] is also true (Exercise 3).
5
Proof of [a].
σY21
h
= E (Y1 − µY1 )2
i
h
= E Y12 − 2Y1 µY1 + µ2Y1
h
i
h
i
h
i
h
i
i
h
i
h
(by Theorem 2 part [c] −→) = E Y12 + E − 2Y1 µY1 + E µ2Y1
h
i
i
(by Theorem 2 parts [a] and [b] −→) = E Y12 − 2µY1 E Y1 + µ2Y1
h
i
(because E Y1 = µY1 −→) = E Y12 − 2µ2Y1 + µ2Y1
= E Y12 − µ2Y1 .
2
2
Proof of [b].h Since
i P (X = x0 ) = 1, it is also true that P (X = x0 ) = 1, so that by part [a] of
2
2
Theorem 2, E X = x0 . Then, by part [a] of Theorem 4,
h
i
2
σX
= E X 2 − (µX )2 = x20 − (µX )2 = x20 − (x0 )2 = 0.
Proof of [c].
V (cY1 + d) = E
h
(by Theorem 2[b] −→) = E
h
(arithmetic −→) = E
h
cY1 + d − µcY1 +d
2 i
2 i
cY1 + d − (cµY1 + d)
2 i
c(Y1 − µY1 )
h
(arithmetic −→) = E c2 (Y1 − µY1 )2
h
(by Theorem 2[b] −→) = c2 E (Y1 − µY1 )2
i
i
= c2 V (Y1 ).
Proof of [d].
2 i
V (Y1 + Y2 ) = E
h
(by Theorem 2[c] −→) = E
h
(arithmetic −→) = E
h
(arithmetic −→) = E
h
(by Theorem 2 [b],[c] −→) = E
h
i
h
(by Theorem 3 (we have independence) −→) = E
h
i
h
(by Exercise 1 on p.4 −→) = E
h
i
Y1 + Y2 − µY1 +Y2
2 i
Y1 + Y2 − (µY1 + µY2 )
2 i
(Y1 − µY1 ) + (Y2 − µY2 )
(Y1 − µY1 )2 + 2(Y1 − µY1 )(Y2 − µY2 ) + (Y2 − µY2 )2
i
h
(Y1 − µY1 )2 +2E (Y1 − µY1 )(Y2 − µY2 ) +E (Y2 − µY2 )2
i
i h
h
i
(Y1 − µY1 )2 +2E Y1 − µY1 E Y2 − µY2 +E (Y2 − µY2 )2
h
(Y1 − µY1 )2 +2(0)(0)+E (Y2 − µY2 )2
i
= V (Y1 ) + V (Y2 ).
Exercise 2 For each part of Theorem 4, there is a corresponding fact about standard deviations. Write
these facts down.
Exercise 3 Show that if σY2 = 0, then Y is a constant.
6
i
i
© Copyright 2026 Paperzz