Conditioning Random Variables
OPRE 7310 Lecture Notes by Metin Çakanyıldırım
Compiled at 19:47 on Wednesday 7th December, 2016
1 Conditional Distribution and Expected Values of RVs
To begin with, let us consider discrete random vector [ X, Y ] with joint pmf p X,Y and the marginals pmfs p X
and pY . Conditional pmf of X given Y is
p X |Y = y ( x ) = P ( X = x |Y = y ) =
P( X = x, Y = y)
p ( x, y)
= X,Y
P (Y = y )
pY ( y )
provided that pY (y) > 0. pY |X = x (y) is similarly defined. Once the conditional pmf is available for pY (y) > 0,
the conditional expectation is
E( X |Y = y) = ∑ xp X |Y =y ( x ).
x
Example: Let [ X, Y ] be a binary valued random vector with joint pmf given by
p X,Y (0, 0) = 0.1, p X,Y (0, 1) = 0.3, p X,Y (1, 0) = 0.2, p X,Y (1, 1) = 0.4.
Find p X |Y =0 ( x ) and p X |Y =1 ( x ) as well as E( X |Y = 0) and E( X |Y = 1). The marginal pmf of Y is pY (y = 0) =
p X,Y (0, 0) + p X,Y (1, 0)=0.1+0.2=0.3 and pY (y = 1) = p X,Y (0, 1) + p X,Y (1, 1)=0.3+0.4=0.7.
pX,Y (0,0) = 0.1 if x = 0
pX,Y (0,1) = 0.3 if x = 0
0.3
0.7
pY ( 0 )
pY ( 1 )
p X |Y = 0 ( x ) =
and p X |Y =1 ( x ) =
.
pX,Y (1,0) = 0.2 if x = 1
pX,Y (1,1) = 0.4 if x = 1
pY ( 0 )
0.3
pY ( 1 )
0.7
E( X |Y = 0) = 0p X |Y =0 ( x = 0) + 1p X |Y =0 ( x = 1) = 2/3.
E( X |Y = 1) = 0p X |Y =1 ( x = 0) + 1p X |Y =1 ( x = 1) = 4/7.
⋄
Example: Let X1 ∼ Bin(n1 , p), X2 ∼ Bin(n2 , p) and X1 ⊥ X2 . Find P( X1 = k| X1 + X2 = m) for 0 ≤ k ≤ m.
P ( X1 = k | X1 + X2 = m ) =
=
P( X1 = k, X1 + X2 = m)
P( X1 = k, X2 = m − k)
P ( X 1 = k ) P ( X2 = m − k )
=
=
P ( X 1 + X2 = m )
P ( X1 + X2 = m )
P ( X 1 + X2 = m )
n2
m − k (1 − p ) n2 − m + k
Ckn1 pk (1 − p)n1 −k Cm
−k p
n1 + n2 m
Cm
p (1 − p ) n1 + n2 − m
=
n2
Ckn1 Cm
−k
n1 + n2
Cm
This conditional probability turns out to be that of Hypergeometric random variable. This is the probability
of k successes among the first n1 trials when it is known that n1 + n2 trials ended up with m successes. ⋄
For continuous random vector [ X, Y ] with joint pdf f X,Y and the marginals pdfs f X and f Y . Conditional
pdf of X given Y is
f X,Y ( x, y)
f X |Y ( x | y ) =
f Y (y)
1
provided that f Y (y) > 0 at y values where the conditional probability is considered. f Y |X (y| x ) is similarly
defined. Moreover,
∫
E ( X |Y = y ) =
∞
−∞
x f X |Y ( x |y)dx.
Example: Consider the continuous random vector [ X, Y ] with joint pdf f X,Y ( x, y) = c( x2 − xy) if 0 ≤
x, y ≤ 1; otherwise, 0. What value of c makes this pdf legitimate? The answer is none because f X,Y ( x, y) =
cx ( x − y) < 0 if c > 0 and x < y and f X,Y ( x, y) = cx ( x − y) < 0 if c < 0 and x > y. There is no c that ensures
f X,Y ( x, y) ≥ 0 for all possible values of x, y. ⋄
Example: Find the conditional pdf of X given Y for continuous random vector [ X, Y ] with joint pdf
f X,Y ( x, y) = c( x2 − xy) if 0 ≤ x ≤ 1, 0 ≤ y ≤ x; otherwise, 0. Compute P( X ≤ 1/2|Y = y). Unlike
the previous example, we now insist on Y ≤ X. We can now find c.
∫ 1∫ x
1=
0
0
(cx − cxy)dydx =
2
∫ 1
0
∫ 1
1
x
3
3
4
( cx y − cxy /2 y=0 dx =
(cx − cx /2)dx = cx /8
2
2
x =0
0
= c/8
yields c = 8. The marginal pmf of Y is
∫ 1
f Y (y) =
y
(
(8x − 8xy)dx =
2
1
( 3
)
8x3
8y
8
4y3
4
8
2 3
− 4x y
− 4y = − 4y +
= ( y − 1)2 ( y + 2).
= − 4y −
3
3
3
3
3
3
x =y
Hence f Y (y) > 0 for 0 ≤ y < 1. The conditional pdf of X given Y is
f X |Y ( x, y) =
3 8x ( x − y)
6x ( x − y)
=
for 0 ≤ y ≤ x < 1.
4 ( y − 1)2 ( y + 2)
( y − 1)2 ( y + 2)
The conditional cdf of X given Y is
a
2x3 − 3x2 y x=y
6x ( x − y)
2a3 − 3a2 y + y3
(y − a)2 (y + 2a)
FX |Y ( a, y) =
dx
=
=
=
2
( y − 1)2 ( y + 2)
( y − 1)2 ( y + 2)
( y − 1)2 ( y + 2)
y ( y − 1) ( y + 2)
)2 (
)
(
2 − 2a
1−a
1−
for 0 ≤ y ≤ a < 1
=
1−
1−y
2+y
∫ a
and FX |Y ( a = y, y) = 0, FX |Y ( a = 1, y) = 1 for every y. When a = 0, we must have y = 0, and then
FX |Y ( a = 0, 0) = 0. We can find P( X ≤ 1/2|Y = y) simply by setting a = 1/2 in the conditional cdf:
{ (
P( X ≤ 1/2|Y = y) =
1−
1
2(1− y )
0
)2 (
1−
1
2+ y
)
if y ≤ 1/2
}
if y ≥ 1/2
.
P( X ≤ 1/2|Y = y) is plotted in Figure 1 for 0 ≤ y ≤ 1/2. By inspection of the plot, a good approximation
for the probability is P( X ≤ 1/2|Y = y) = (1 − 2y)/8 for 0 ≤ y ≤ 1/2. ⋄
Example: Let X1 ∼ Expo (λ1 ), X2 ∼ Expo (λ2 ) and X1 ⊥ X2 . Find the pdf of X1 and the expected value of X1
given that X1 + X2 = v. First we write down the joint density of [ X1 , X2 ]: f X1 ,X2 ( x1 , x2 ) = λ1 e−λ1 x1 λ2 e−λ2 x2 .
We then can seek the the joint pdf of f U,V for U = X1 and V = X1 + X2 . Then x1 = u and x2 = v − u, the
determinant of Jacobian is |1 0; − 1 1| = 1. Hence,
f U,V (u, v) = f X1 ,X2 (u, v − u) = λ1 e−λ1 u λ2 e−λ2 (v−u) for 0 ≤ u ≤ v.
2
0.12
0.10
0.08
0.06
0.00
0.02
0.04
prob
0.0
0.1
0.2
0.3
0.4
0.5
y
Figure 1: P( X ≤ 1/2|Y = y) for 0 ≤ y ≤ 1/2 produced by R commands y <– seq(0, 0.5, by=.001); prob <–
((y-0.5)ˆ2*(y+1))/((y-1)ˆ2*(y+2)); plot(y, prob).
Then we can recover the marginal distribution of V:
∫ v
f V (v) =
0
∫ v
f U,V (u, v)du =
∫ v
f V (v) =
=
0
0
∫ v
f U,V (u, v)du =
0
λ2 e−λu e−λ(v−u) du = λ2 ve−λv for λ = λ1 = λ2 .
λ1 λ2 e−(λ1 −λ2 )u e−λ2 v du =
λ 1 λ 2 ( e − λ2 v − e − λ1 v )
for λ1 ̸= λ2 .
λ1 − λ2
λ1 λ2 e−λ2 v (−e−(λ1 −λ2 )u |vu=0
λ1 − λ2
This density is nonnegative when λ1 > λ2 and when λ1 < λ2 .
The desired conditional distribution is
f U |V ( u | v ) =
f U |V ( u | v ) =
=
f U,V (u, v)
λ2 e−λu e−λ(v−u)
1
=
=
for 0 ≤ u ≤ v and λ = λ1 = λ2 .
2
−
λv
f V (v)
v
λ ve
f U,V (u, v)
λ e − λ1 u λ 2 e − λ2 ( v − u ) ( λ 1 − λ 2 )
e − λ1 u e − λ2 ( v − u ) ( λ 1 − λ 2 )
= 1
=
f V (v)
λ 1 λ 2 ( e − λ2 v − e − λ1 v )
( e − λ2 v − e − λ1 v )
e−(λ1 −λ2 )u (λ1 − λ2 )
for 0 ≤ u ≤ v and λ1 ̸= λ2 .
(1 − e−(λ1 −λ2 )v )
Since f U |V (u|v) is a uniform distribution for λ1 = λ2 , its expected value is
E(U |V = v) = v/2 for λ1 = λ2 .
Note that the symmetry of λ1 = λ2 yields a simple answer.
3
For λ1 ̸= λ2 ,
E (U | V = v ) =
=
=
∫ v
1
1 − e−(λ1 −λ2 )v
1
0
e
−(λ1 −λ2 )u
(λ1 − λ2 )udu =
(
−ve−(λ1 −λ2 )v −
1 − e−(λ1 −λ2 )v
e−(λ1 −λ2 )v
λ1 − λ2
1
(
−(λ1 −λ2 )u
v
e−(λ1 −λ2 )u −
λ1 − λ2 −ue
1 − e−(λ1 −λ2 )v
)
1
1
ve−(λ1 −λ2 )v
+
=
−
λ1 − λ2
λ1 − λ2 1 − e−(λ1 −λ2 )v
0
1
v
− (λ −λ )v
. ⋄
λ1 − λ2
e 1 2 −1
2 Expectations via Conditioning
Above we have seen the computation of the expected value of conditional random variables. For a random
vector [ X, Y ], the conditional expectation E( X |Y = y) is a function of y in the sense that there are various
E( X |Y = y1 ), E( X |Y = y2 ), . . . values. Hence, we can consider the sum of E( X |Y = y) weighted by P(Y = y)
when y is a discrete random variable. This sum turns out to be the expected value of X:
E( X ) =
∑
E( X |Y = y)PY (y).
y:pY (y)>0
Justification for this formula can be found in Ross (2014). The analogous equation for continuous random
variable Y is
∫ ∞
E( X ) =
E( X |Y = y) f Y (y)dy.
−∞
In short, for any random vector [ X, Y ], we have E( X ) = E(E( X |Y )).
Example: Suppose that the number of blank paper sheets you use as a scratch paper is Xi on day i and you
are to buy paper from Office Depot to last for the next N days. N is the number of days between two of your
visits in a row to Office Depot. You are not sure about N, which then is treated as a random variable. We
assume Xi ⊥ N for each day i. What is the expected value of the total number of paper sheets you need in N
days? Note that neither the distributions of { Xi } or N is given, but the question is about the expected value
not the distribution. The total number of sheets is X1 + X2 + · · · + X N .
If we know the number of days N = n between two visits, the total number of sheets is E( X1 + X2 +
· · · + X N | N = n) = E( X1 ) + E( X2 ) + · · · + E( Xn ), where the last equality is due to Xi ⊥ N. Then
(
)
N
E
∑ Xi
∑
=
i =1
(E( X1 ) + E( X2 ) + · · · + E( Xn )) p N (n).
n:p N (n)>0
Note that Xi and X j can depend on each other. Furthermore, if { Xi } is an identical sequence, then
(
E
N
∑ Xi
i =1
)
= E ( Xi )
∑
np N (n) = E( Xi )E( N ).
n:p N (n)>0
Sum of random number of random variables often appear in supply chains, finance, marketing, etc. The
equality above is known as Wald’s equation and it can be obtained under more relaxed assumptions than
Xi ⊥ N. ⋄
Example: A SOM student is looking for his TA’s office to hand a homework. The student knows the name of
the TA but not the office number. The TA’s office is in the new SOM building and can be located in 2 minutes
4
once the student goes to the new building. The student starts by the elevators on the western wing on the
third floor.
• First, if he takes the southern aisle and then the northern aisle on the third floor, he returns to where he
started after 9 minutes without finding the TA.
• Second, if he goes to the fourth floor and takes the southern aisle and then the northern aisle on the fourth
floor, he returns to where he started after 12 minutes without finding the TA.
• Third, if he goes to the new building, he takes 4 minutes to reach there.
Assuming that the student can choose either one of these 3 options with equal probabilities independent of
his previously chosen options, how long does it take for him to find his TA? Let Y be the option chosen, so
Y ∈ {1, 2, 3}. Let X be the time it takes for the student to find his TA’s office. We have
E( X |Y = 1) = 9 + E( X ); E( X |Y = 2) = 12 + E( X ); E( X |Y = 3) = 4 + 2.
Putting them together
E( X ) = E( X |Y = 1)P(Y = 1) + E( X |Y = 2)P(Y = 2) + E( X |Y = 3)P(Y = 3) = 3 + E( X )/3 + 4 + E( X )/3 + 2
= 9 + 2E( X )/3,
which leads to E( X ) = 27 minutes. If the student has a memory, he will not try the same option twice. Rather
he will consider all possible strings of options that end with the third option: (1,2,3), (2,1,3), (1,3), (2,3), (3).
The duration to find the TA’s office with options (1,2,3) and (2,1,3) is the same and is 27 minutes. The duration
with options (1,3) is 15 minutes whereas the duration with (2,3) is 18 minutes. Finally the duration with (3)
is 6 minutes. The probabilities of choosing these options are
P(1,2,3 is chosen) = P(2,1,3 is chosen) = (1/3)(1/2) = 1/6
P(1,3 is chosen) = P(2,3 is chosen) = (1/3)(1/2) = 1/6
P(3 is chosen) = 1/3.
In other words, all 6 permutations of 1,2,3 is equally likely. Then the expected value of the duration X m with
memory is
E( X m ) = 27(1/6) + 27(1/6) + 15(1/6) + 18(1/6) + 6(2/6) = 99/6 = 16.5.
E( X ) = 27 > 16.5 = E( X m ), so having a memory increases efficiency by 10.5 minutes. ⋄
Example: Consider a sequence of Bernoulli trials with success probability p and let Nk be the number of
trials required to obtain k consecutive successes. Find nk = E( Nk ). As a warm up, we know that N1 is a
geometric random variable and n1 = 1/p. What is the additional number of trials needed to make 1 success
into 2 consecutive successes? Let this random variable be A1,2 . We have E( A1,2 ) = 1.p + (1 + n2 )(1 − p) =
1 + (1 − p)n2 ; 1 more trial if that trial is success and otherwise, 1 + n2 more trials. Then we have
n2 = E( N2 ) = E( N1 + A1,2 ) = n1 + 1 + (1 − p)n2 ,
which yields n2 = p−1 + n1 p−1 = p−1 + p−2 . Now we are ready to generalize the ideas, we have Nk =
Nk−1 + Ak−1,k where Ak−1,k is the additional number of trials needed to make k − 1 consecutive successes
into k consecutive successes. E( Ak−1,k ) = 1p + (1 − p)(nk + 1) and leads to
nk = E( Nk ) = E( Nk−1 + Ak−1,k ) = nk−1 + 1 + (1 − p)nk ,
which yields nk = p−1 + nk−1 p−1 . This, starting with n1 = p−1 or n2 = p−1 + p−2 , gives us
n k = p −1 + p −2 + · · · + p − k .
5
Note that the analysis for k = 2 above is for warm up and is not necessary. However, examining simpler
cases helps to build our understanding when we face a new problem. ⋄
Example: Let X ∼ Geo ( p) and find E( X ), and E( X 2 ) by conditioning on the outcome Y of the first trial.
E( X ) = E( X |Y = 0)P(Y = 0) + E( X |Y = 1)P(Y = 1) = (1 + E( X ))(1 − p) + 1p = 1 + (1 − p)E( X ),
which yields E( X ) = 1/p.
E( X 2 ) = E( X 2 |Y = 0)P(Y = 0) + E( X 2 |Y = 1)P(Y = 1) = E((1 + X )2 )(1 − p) + 1p
= (1 − p) + 2(1 − p)E( X ) + (1 − p)E( X 2 ) + p = 1 + 2(1 − p)/p + (1 − p)E( X 2 ),
which yields E( X 2 ) = (2 − p)/p2 . Then we easily obtain V( X ) = (1 − p)/p2 . ⋄
As we have obtained V( X ) by conditioning on Y above, we can also obtain the conditional variance of X
given Y. The conditional variance of X given [Y = y] is
(
)
V( X |Y = y) = E ( X − E ( X | Y = y))2 Y = y
= E( X 2 |Y = y) − 2E ( X E( X |Y = y) |Y = y) + E( (E( X |Y = y))2 |Y = y)
= E( X 2 |Y = y) − 2E( X |Y = y)E( X |Y = y) + (E( X |Y = y))2
= E( X 2 |Y = y) − (E( X |Y = y))2
In the third equality, we use E( ZE( X |Y = y)|Y = y) = E( X |Y = y)E( Z |Y = y) for any random variable Z.
Here E( X |Y = y) is a constant and hence E(E( X |Y = y)|Y = y) = E( X |Y = y).
On the other hand, V( X |Y = y) is a function of y and it is the random variable V( X |Y ) when y is not
known. Hence, we are interested in its expectation:
E( V( X |Y ) ) = E( E( X 2 |Y ) ) − E( (E( X |Y ))2 ) = E( X 2 ) − E( (E( X |Y ))2 ),
where E(E( X 2 |Y )) = E( X 2 ). Also, E( X |Y ) is a random variable and its variance is
V( E( X |Y ) ) = E( (E( X |Y ))2 ) − ( E(E( X |Y )) )2 = E( (E( X |Y ))2 ) − (E( X ))2 ,
where E(E( X |Y )) = E( X ). Summing up the last two equalities above side by side,
E( V( X |Y ) ) + V( E( X |Y ) ) = E( X 2 ) − E((E( X |Y ))2 ) + E((E( X |Y ))2 ) − (E( X ))2 = V( X ).
This leads to the next theorem.
Theorem 1. Conditional variance formula for a random vector [ X, Y ]: V( X ) = E(V( X |Y )) + V(E( X |Y )).
This theorem is used to compute V( X ) by first conditioning on Y, as illustrated next.
Example: Let { Xi } be an iid sequence with mean µ and variance σ2 . Let N be independent of { Xi }. Let
S = ∑iN=1 Xi . Consider the random vector [S, N ]. We want to find V(S) by using the conditonal variance
formula. First, E(S| N = n) = nµ and V(S| N = n) = nσ2 , so E(S| N ) = Nµ and V(S| N ) = Nσ2 . From
the conditional variance formula, V(S) = E(V(S| N )) + V(E(S| N )) = E( Nσ2 ) + V( Nµ) = E( N )σ2 + V( N )µ2 .
This formula is used in supply chains to obtain the variance of demand S during lead time N when each day
is subject to iid demands of Xi . ⋄
Example: Hat matching. N people throw their hats into the center of a room. Hats are mixed up. Each
person selects a hat randomly. What is the expected number people selecting their own hat?
6
First, one needs to think of whether the hats are selected sequentially or simultaneously. Let us suppose
that each hat is given a number from 1 to N and each of the N people pick a number from 1 to N while
the others are picking numbers simultaneously. We first ensure that every number from 1 to N is picked by
exactly one person. Then each person gets the hat whose number he/she has selected. This is simultaneous
selection of hats.
Let X n be the number of people selecting their own hat in a population of size n. Let Xi = 1 if the ith
person selects his/her hat; otherwise Xi = 0, so
X n = X 1 + X2 + · · · + X n .
Note that X1 , X2 , . . . , Xn are identical but not independent as X1 = X2 = · · · = Xn−1 = 1 implies Xn = 1. For
the ith person among N people in simultaneous selection, the probability of choosing own hat is 1/N and
E( Xi ) = 1/N. Hence,
1
E( X N ) = E( X1 ) + E( X2 ) + · · · + E( X N ) = N = 1.
N
Interestingly, the number of people selecting own hat is 1 — independent of the number N of people. ⋄
Example: In the hat selection example, what are V( Xi ), Cov( Xi , X j ), Cor( Xi , X j ), V( X N ) and E(( X N )2 )?
Xi is a binary variable and P( Xi = 1) = 1/N so V( Xi ) = E( Xi2 ) − E( Xi )2 = 1/N − 1/N 2 . To compute
P( Xi = 1, X j = 1) consider all possible hat assignments (permutations) in n! ways and those permutations
that match hats for people i and j in (n − 2)! ways: P( Xi = 1, X j = 1) = (n − 2)!/n!. Xi X j is another binary
varaible so E( Xi X j ) = P( Xi = 1, X j = 1) = (1/N )(1/( N − 1)).
Cov( Xi , X j ) = E( Xi X j ) − E( Xi )( X j ) =
1 1
1
1
− 2 =
> 0.
N N−1 N
( N − 1) N 2
Xi and X j are positively correlated: If a person picks his own hat correctly, it is more likely for another person
to pick correctly.
Cov( Xi , X j )
1
1
N2
=
=
.
Cor( Xi , X j ) = √
2
( N − 1) N ( N − 1)
( N − 1)2
V ( Xi ) V ( X j )
For N = 2, the correlation between X1 and X2 is 1; either both people pick correctly or both pick incorectly,
i.e., X1 = X2 . Correlation drops quadratically as N increases; The effect of one individual on another is low
in large populations.
We have
)
(
)
(
N
N
N
1
1 1
1
1
N
N
− 2 + 2C2
−
V( X ) = ∑ V( Xi ) + 2 ∑ ∑ Cov( Xi , X j ) = N
N
N
N N − 1 N2
i =1
i =1 j=1,j̸=i
= 1−
1
N2 − N
+1−
= 1.
N
N2
We also have E(( X N )2 ) = V( X N ) + (E( X N ))2 = 1 + 1 = 2. Once more, moments are independent of the
number of people. ⋄
Example: Expected value of rounds. Suppose that people who select their own hat depart with their hats.
Then the remaining people simultaneously and randomly select a hat from the pile of remainig and mixed
up hats. Each time peeople pick up their own hat and depart, we say that a round is complete. One round
is assumed to be independent of the next, except that they are tied by the number of remaining hats. That
7
is, a person does not carry any knowledge from one round to another. Note that this requires renumbering
of remaining hats in every round. If the hats are not renumbered in each round and a person picks jth hat
which turns out to be not his, he will not pick the jth hat in the next round. Without renumbering, rounds
are dependent, which is not considered here.
Let Rn be the number of rounds it takes for n people to select their own hats. We know that R1 = 1 and
E( R1 ) = 1. On the other hand, it can take infinitely many rounds for n people to select their own hats. Let
X n be the number of matches in a round that starts with n people. From the example above E( X n ) = 1 for
every n.
By conditioning on [ X n = i ] for 0 ≤ i ≤ n, we obtain
n
E( R n ) =
∑ E( R n | X n = i )P( X n = i )
i =0
n
=
∑ (1 + E( Rn−i ))P(X n = i)
as E( Rn | X n = i ) = 1 + E( Rn−i )
i =0
n
= 1 + E( R n )P( X n = 0) + ∑ E( R n −i )P( X n = i ).
i =1
Before solving this recursion, let us specialize it for n = 2:
E( R2 ) = 1 + E( R2 ) P( X 2 = 0) +E( R1 ) P( X 2 = 1) + E( R0 ) P( X 2 = 2),
| {z }
| {z } | {z } | {z }
=0
=1/2!
=0/2!
=1/2!
which yields E( R2 )/2 = 1 or E( R2 ) = 2. For n = 3:
E( R3 ) = 1 + E( R3 ) P( X 3 = 0) +E( R2 ) P( X 3 = 1) +E( R1 ) P( X 3 = 2) + E( R0 ) P( X 3 = 3),
| {z }
| {z }
| {z } | {z } | {z }
=2/3!
=3/3!
=0/3!
=0
=1/3!
where probabilities of P( X n = i ) are obtained by enumeration; a structured approach for this computation
is presented in the next section. We obtain E( R3 ) = 1 + E( R3 )/3 + E( R2 )/2 = 2 + E( R3 )/3, which yields
E( R3 ) = 3.
To solve the recursion for general n, we use the initial condition E( R1 ) = 1 and assume that E( Rn ) =
an + b for unknown constants a and b. The initial condition gives b = 1 − a, so E( Rn ) = an + 1 − a. Inserting
this into the recursive equation:
n
an + 1 − a = 1 + ( an + 1 − a)P( X n = 0) + ∑ ( a(n − i ) + 1 − a)P( X n = i )
i =1
Hence, by using ∑in=1 (n − i )P( X n = i ) = n ∑in=1 P( X n = i ) − ∑in=0 iP( X n = i ) = n(1 − P( X n = 0)) − 1,
n
n
(1 − P( X n = 0))( an + 1 − a) = 1 + a ∑ (n − i )P( X n = i ) + (1 − a) ∑ P( X n = i )
i =1
i =1
= 1 + an(1 − P( X = 0)) − a + (1 − a)(1 − P( X n = 0))
n
This equality holds for every n if (1 − P( X n = 0)) an = (1 − P( X n = 0)) an and (1 − P( X n = 0))(1 − a) =
(2 − P( X n = 0))(1 − a). While the first equality is an identity, the second holds only if a = 1. So E( Rn ) = n
is the only linear function that satisfies the recursive equation and the initial condition. ⋄
Example: Variance of rounds. Show that V( Rn ) = n for n ≥ 2. The proof is based on induction on n. For
n = 2, R2 is the number of rounds two people attempt to find their own hat so R2 ∼ Geo (1/2). We have
8
V( R2 ) = (1 − 1/2)/(1/2)2 = 2 and the induction holds for n = 2. As an induction hypothesis, we assume
V( R j ) = j for 2 ≤ j < n.
From the previous example, E( Rn ) = n so for random variable Xn
E ( R n | X n ) = 1 + E ( R n − Xn ) = 1 + n − X n .
We also observe P( Rn = k | X n = x ) = P( Rn− x = k − 1) for x ≥ 1 and k ≥ 1 because [ Rn = k| X n = x ]
and [ Rn− x = k − 1] are the same events. The number Rn of rounds differs from Rn− x by a constant 1 when
X n = x: V( Rn | X n = x ) = V( Rn− x ). Hence
V( Rn | X n ) = V( Rn−X n ) = wn−Xn for Xn ≥ 1,
where w j = V( R j ). When X n ≥ 1, we can use the induction hypothesis to write V( Rn | X n ) = V( Rn−X n ) =
n − Xn . However, this does not apply when X n = 0, hence we introduce w j . Using the conditional variance
formula,
wn = V( Rn ) = E(V( Rn | X n )) + V(E( Rn | X n )) = E(wn−X n ) + V( X n ) =
n
∑ w n −i P( X n = i ) + 1
i =0
n
= wn P( X n = 0) + ∑ (n − i )P( X n = i ) + 1 = wn P( X n = 0) + n(1 − P( X n = 0)) − E( X n ) + 1
i =1
= wn P( X n = 0) + n(1 − P( X n = 0)),
which implies wn = n and completes the induction. Thus, V( Rn ) = n. ⋄
Example: Expected value of total selections in all rounds. Find E(Sn ), where n ≥ 2 and Sn is the total number
of selections made by n people in Rn rounds until everybody selects his/her own hat. We start by observing
S0 = 0, S1 = 1 and E(S0 ) = 0, E(S1 ) = 1. S2 = 2R2 , where R2 ∼ Geo (1/2), so E( R2 ) = 2 and E(S2 ) = 4. By
conditioning on [ X n = i ] for 0 ≤ i ≤ n, we obtain
n
E( Sn ) =
n
∑ E(Sn |X n = i)P(X n = i) = ∑ (n + E(Sn−i ))P(X n = i)
i =0
i =0
n
= n + ∑ E( Sn −i )P( X n = i ) = n + E( Sn − X n ).
i =0
If there is exactly one match in each round, there will be n rounds and n selections in the nth round. That is
Rn = n and Sn = 1 + 2 + · · · + n = n2 /2 + n/2. So let us suppose E(Sn ) = an + bn2 . The initial condition
yields 4 = E(S2 ) = 2a + 4b, so b = 1 − a/2. Then E(Sn ) = an + (1 − a/2)n2 can be inserted into the recursive
equation to obtain
an + (1 − a/2)n2 = n + E(Sn−X n ) = n + E( a(n − X n ) + (1 − a/2)(n − X n )2 )
= n + a(n − E( X n )) + (1 − a/2)n2 − 2(1 − a/2)nE( X n ) + (1 − a/2)E(( X n )2 )
= n + a(n − 1) + (1 − a/2)n2 − 2(1 − a/2)n + (1 − a/2)2
= (1 − a/2)n2 + (2a − 1)n + 2 − 2a.
This equality holds for every n if a = 2a − 1 and 2 − 2a = 0, which imply a = 1. So E(Sn ) = n + n2 /2 satisfies
the recursive equation and the initial condition. ⋄
9
3 Probabilities via Conditioning
In the last section, we have computed expectation E( X ) for a random variable. In particular, 1I A ∈ {0, 1}
is a random variable, which takes the value of 1 when event A happens. Using the conditional expectation
formulas of the last section,
∑
P( A) = E(1I A ) =
∑
E(1I A |Y = y)PY (y) =
y:pY (y)>0
P( A|Y = y)PY (y),
y:pY (y)>0
or analogously
P( A) = E(1I A ) =
∫ ∞
−∞
E(1I A |Y = y) f Y (y)dy ==
∫ ∞
−∞
P( A|Y = y) f Y (y)dy.
Example: For two independent uniform random variables X1 , X2 ∼ U (0, 1), find P( X1 ≥ X2 ).
P ( X 1 ≥ X2 ) =
∫ 1
0
P( X1 ≥ X2 | X2 = u) f X2 (u)du =
∫ 1
0
1
(1 − u)du = u − u2 /20 = 1/2. ⋄
Example: Convolution. Suppose that X, Y are continuous random variables with pdfs f X , f Y and cdfs FX , FY ,
and X ⊥ Y. We want to find the pdf of X + Y. We can proceed by conditioning as follows.
P( X + Y ≤ a ) =
∫ ∞
−∞
P( X + Y ≤ a|Y = y) f Y (y)dy =
∫ ∞
−∞
P( X ≤ a − y) f Y (y)dy =
∫ ∞
−∞
FX ( a − y) f Y (y)dy.
Above equality can also be obtained through mapping the vector [ X, Y ] to [ X + Y, Y ]. Taking the derivative
with respect to a in the above equality, we can find the density of f X +Y :
f X +Y ( a ) =
∫ ∞
−∞
f X ( a − y) f Y (y)dy.
As a matter of fact, we can define the convolution operator ∗ for any two given functions f 1 and f 2 as
( f 1 ∗ f 2 )( a) =
∫ ∞
−∞
f 1 ( a − y) f 2 (y)dy =
∫ ∞
−∞
f 1 (u) f 2 ( a − u)du,
where the last equality folows from setting u = a − y. From the equality above, f 1 ∗ f 2 = f 2 ∗ f 1 , so the
convolution operator is commutative. ⋄
Example: The number of accidents that occur to each policyholder at an insurance company has Poisson
distribution with mean Λ, where Λ is a random variable with exponential distribution, i.e., f Λ (λ) = λe−λ .
Let X be the number of accidents occuring to a random poliyholder and find its pmf p X .
P( X = n ) =
=
∫ ∞
0
P( X = n|Λ = λ) f Λ (λ)dλ =
1
n!2n+2
∫ ∞ −λ n
e λ
λe
n!
∫ ∞
( n + 1) !
n+1
e−u (u)n+1 du =
= n +2 .
n
+
2
n!2
2
0
0
−λ
1
dλ =
n!2n+1
∫ ∞
0
e−2λ (2λ)n+1 dλ
∞
n+2 . We know
n+2 = (1/4)(1/ (1 −
You may wonder if 1 = ∑∞
∑∞
n=0 P( X = n ) = ∑n=0 ( n + 1) /2
n=0 1/2
∞
n
+
1
1/2)) = 1/2, so one must check ∑n=0 n/2
= 1, which is done in the appendix. ⋄
Example: The Best Price Problem. Suppose that you are selling your used car and you determine to entertain
exactly N ≥ 2 bids. The prospective buyers come to you and present their bids one by one. When talking to
10
a buyer, either you accept the bid or reject it. If the current bid is accepted, the car is sold at the bid’s price
and no more bids are received. If the current bid is rejected, it becomes unavailable and you consider the
next bid. You know the bid amounts of all the previous bids as well as the current bid, so you know whether
the current bid has rank 1 (the best) among all that has been received or rank j among them for 1 < j ≤ n + 1,
where n is the number of all previously received bids. But you do not know anything about the forthcoming
bids. Without knowing how the bids are sequenced, we assume that all N! permutations are possible. What
should we do to have a reasonably high chance of accepting the best bid among all bids N?
To have a grasp of the situation, we can ask whether it is advisable to wait until the last bid. Since the last
bid is not necessarily the best and previous bids are all lost, waiting until the last may not be good. We cannot
accept the first bid prematurely. So let us consider a family of intermediate strategies indexed by k where we
do not accept the first k bids but afterwards we accept the first bid that beats the first k bids. Suppose that
bids are coming in the following sequence: 7,200, 7,000, 8,300, 6,800, 8,500, 7,400. With k = 1 or k = 2, the
third bid 8,300 is accepted. With k = 3 and k = 4, the fifth bid 8,500 is accepted and this is the best bid. With
k = 5, the last bid 7,400 is accepted. Let Pk (best) be the probability of accepting the best bid with strategy k.
Also let X be the position of the best price. For example, the best bid 8,500 is in position 5 above so X = 5.
Since each permutation is equally likely, X is a discrete uniform random variable over {1, 2 . . . , N }.
We have
N
Pk (best) =
∑ Pk (best|X = i)P(X = i) =
i =1
1
N
N
∑ Pk (best|X = i).
i =1
If i ≤ k, the best bid is among the first k bids and they are rejected by strategy k, so Pk (best| X = i ) = 0. If
i = k + 1, first k bids are rejected and the k + 1st bid is best and accepted, so Pk (best| X = i ) = 1. In general,
when i > k, to pick the best bid with strategy k, we must reject bids from k + 1 to i − 1 observing that they
are all beaten by a bid among the first k bids. In other words, we pick the best when the best among first i − 1
bids is one of the first k bids. Since the position of the best bid among the first i − 1 is discrete uniform with
pmf 1/(i − 1),
k
for i > k.
Pk (best| X = i ) =
i−1
Hence,
Pk (best) =
1
N
≈
k
N
N
k
k
=
i−1
N
i = k +1
∑
∫ N −1
1
x =k
N −1
∑
i =k
1
i
k
dx =
ln(( N − 1)/k).
x
N
The function (k/N ) ln(( N − 1)/k) has the derivative (1/N ) ln(( N − 1)/k) − 1/N, which is decreasing in
k. Then (k/N ) ln(( N − 1)/k) is concave and is maximized at ln(( N − 1)/k) = 1 or at k = ( N − 1)/e. The
strategy that maximizes Pk (best) has k∗ = ( N − 1)/e and then the probability is Pk∗ (best) = ( N − 1)/( Ne).
This probability starts at 1/(2e) = 0.184 for N = 2 and reaches 1/e = 0.368 as N → ∞. ⋄
Example: Let N ( x ) be the number of independent uniform Ui ∼ U (0, 1) random variables summed up to
exceed a given value of 0 < x ≤ 1, i.e., N ( x ) = min{n ≥ 1 : U1 + U2 + · · · + Un > x }. Show that
P( N ( x ) > n) = x n /n! for n ≥ 1.
The smallest value for N ( x ) is 1 and the event [ N ( x ) > 1] happens whenever [U1 ≤ x ]. Hence, P( N ( x ) >
1) = P(U1 ≤ x ) = x = x1 /1! so P( N ( x ) ≥ n) formula holds for n = 1. We assume the formula holds for n
11
and obtain it for n + 1:
1
P( N ( x ) > n + 1) =
3
∫ 1
0
∫ x
=
2
P( N ( x ) > n + 1|U1 = y)dy =
0
n +1
= x
4
P( N ( x − y) > n)dy =
∫ x
0
∫ x
0
P( N ( x ) > n + 1|U1 = y)dy
(( x − y)n /n!)dy =
∫ x
0
(un /n!)du
/(n + 1)!.
The first equality is by conditioning, the second is by N ( x ) > n + 1 =⇒ U1 ≤ x, the third is by the fact
that U1 = y and n variables exceed x whenever these n variables exceed x − y, and the fourth equality by the
induction hypothesis. This completes the induction. ⋄
Example: Consider the hat matching problem with n hats, we want the probability that exactly k out of n
people select their own hats. But we start with the probability P0n of no matches among n people.
Depending on whether the first person selects his own hat or not.
P0n = P(no matches among n|the first DOES select his)P(the first DOES select his)
+P(no matches among n|the first does NOT select his)P(the first does NOT select his)
n−1
.
= P(no matches among n|the first does NOT select his)
n
When the first person p′ does not select his hat, he selects hat h” of person p”. The first person p′ leaves
n − 2 people { p1 , . . . , pn−2 }, their n − 2 hats {h1 , . . . , hn−2 }, his own hat h′ and the hatless person p” whose
hat that he has just selected. So there are n − 2 matching people and hats {( p1 , h1 ), . . . , ( pn−2 , hn−2 )} and 1
hatless person p” and 1 personless hat h′ .
Now we consider what the hatless person p” does. Either the hatless person p” selects the personless hat
h′ and leaves n − 2 matching pairs of people and hats {( p1 , h1 ), . . . , ( pn−2 , hn−2 )}, or the hatless person does
not select the personless hat.
P(no matches among n|the first does NOT select his)
= P(no matches among n AND hatless DOES select the personless|the first does NOT select his)
+ P(no matches among n AND hatless does NOT select the personless|the first does NOT select his).
1):
In the former case of hatless person p” selecting the personless hat h′ , no matching probability is P0n−2 /(n −
P(no matches among n AND the hatless DOES select the personless|the first does NOT select his)
= P(no matches among n − 2 AND the hatless DOES select the personless|the first does NOT select his)
= P(no matches among n − 2|the first does NOT select his)
× P(the hatless DOES select the personless|the first does NOT select his)
1
= P0n−2
,
n−1
where the second equality is due to independence of no matches among n − 2 pairs from the actions of the
hatless person p”.
In the latter case of hatless person p” not selecting the personless hat h′ , let us temporarily consider the
personless hat h′ as the hat of the hatless person p”. Once we pair the hatless person with the personless hat
and add this pair to the remaining n − 2 pairs, we obtain n − 1 pairs {( p1 , h1 ), . . . , ( pn−2 , hn−2 )} ∪ {( p”, h′ )}.
The matching probabilities in these n − 1 pairs {( p1 , h1 ), . . . , ( pn−2 , hn−2 )} ∪ {( p”, h′ )} are the same as those
12
in n − 1 pairs {( p1 , h1 ), . . . , ( pn−1 , hn−1 )}. There are no matches among the n − 1 pairs {( p1 , h1 ), . . . , ( pn−2 , hn−2 )} ∪
{( p”, h′ )} if and only if the hatless does not select the personless hat and the remaining n − 2 pairs have no
matches, which implies
P(no matches among n − 2 AND the hatless does NOT select the personless|the first does NOT select his)
= P(no matches among n − 1) = P0n−1 .
Equipped with this equality we proceed as
P(no matches among n AND the hatless does NOT select the personless|the first does NOT select his)
= P(no matches among n − 2 AND the hatless does NOT select the personless|the first does NOT select his)
= P0n−1 .
Combining the two cases above of hatless person selecting and not selecting the personless hat,
P(no matches among n|the first does NOT select his) =
1
Pn−2 + P0n−1 .
n−1 0
The paragraphs above yield P0n = ((n − 1)/n)P(no matches among n|the first does not select his)=
(1/n) P0n−2 + (1 − 1/n) P0n−1 , which can be written in an alternative form
1
P0n − P0n−1 = − ( P0n−1 − P0n−2 ).
n
Separately we can obtain initial conditions P01 = 0 and P02 = 1/2. We observe that P03 − P02 = (−1/3)(1/2 −
0) = −1/3!, P04 − P03 = (−1/4)(−1/3!) = 1/4! and P0n − P0n−1 = (−1)n /n!. Hence,
P0n =
n
1
1
1
(−1)n
(−1)i
1
− + − ···+
=∑
.
2! 3! 4! 5!
n!
i!
i =0
To obtain exactly k matches out of n, we first pick those k in Ckn ways, the probability of perfect matching
among k is (1/n) . . . (1/(n − (k − 1)) and no matches among the remaining n − k is given by probability
P0n−k . The probability of exactly k matches then is
Pkn
=
P0n−k
1
1
∑in=−0k (−1)i /i!
n−k
...
P
=
=
.
nn−1
n − ( k − 1) 0
k!
k!
1
Ckn
For every fixed k, P0n−k → e−1 as n → ∞. That is, when there are many people and hats, the probability of
eactly k matches is e−1 /k!. An alternative argument to obtain P0n is presented in Appendix. ⋄
Example: Ballot Problem. After an election involving two candidates A and B, suppose that you wake up
in the morning to learn that a total of k = n + m people have voted and candidate A has won the election by
winning n votes for n > m. While you were sleeping during the night, the vote count continued and reported
on a news web site. Without knowing exactly how the votes may have arrived, you can safely assume that
each of Cnn+m vote permutations is equally likely. We want to show that the probability Pn,m that candidate A
has led the vote count throughout the night is (n − m)/(n + m).
Let us consider some special cases. For n ≤ m, candidate A never leads the election so Pn,m = 0. For
n ≥ 1 and m = 0, candidate A always leads the election so Pn,m = 1. Figure 2 shows these special cases
and a generic (n, m) for n ≥ m + 1. To arrive at the vote count (n, m), we must have the vote count of either
13
m 6
d
k=3 d
d
d
d
n=m
d
d
d
d
q
Pn,m = 0 for n ≤ m
n = m+1
q
t
Pn−1,m
k=2 d
d
d
t
q
q
k=1 d
d
t
t
q
q
q
q
P1,0 = 1 P2,0 = 1 P3,0 = 1
Start (0,0) d
t
t
t
n
m+n
q
End (n, m)
P
n,m
-t
6
k = n+m
m
t m+n
Pn,m−1
q
k = n+m−1
Pn,0 = 1
t
-
n
Figure 2: Progression of the vote count from the start to the end.
(n, m − 1) or (n − 1, n) before the last vote is cast. We first need the probability that the last vote is for the
+ m −1
candidate A. There are Cnn+m vote permutations and Cnn−
of them end with a vote for candidate A, so
1
P(last vote for A) =
+ m −1
Cnn−
1
Cnn+m
=
n
m
= 1−
= 1 − P(last vote is for B),
n+m
n+m
which are also shown in Figure 2 next to arrows coming to (n, m) from (n − 1, m) and (n, m − 1).
For n ≥ m + 1, we can find Pn,m by conditioning on the last vote:
Pn,m = P( A is always ahead|last vote for A)P(last vote for A)
+P( A is always ahead|last vote for B)P(last vote for B)
n
n+m
m
+P( A is always ahead if the last vote count is (n, m − 1))
n+m
n
m
=
Pn−1,m +
Pn,m−1 .
n+m
n+m
= P( A is always ahead if the last vote count is (n − 1, m))
Using this equality and induction on the number k = n + m of votes cast, we can prove Pn,m = (n − m)/(n +
m). For k = 1 = n + m and n ≥ m + 1, we obtain (1, 0) and P1,0 = 1 = (1 − 0)/(1 + 0) so the induction
hypothesis of Pn,m = (n − m)/(n + m) holds for k = 1. Although unnecessary, you can also check to see that
Pn,n = 0 = (n − n)/(n + n) and Pn,0 = 1 = (n − 0)/(n + 0) satisfy the formula Pn,m = (n − m)/(n + m).
What is necessary is to assume Pn,m = (n − m)/(n + m) for n + m = k − 1 and n ≥ m and prove it for
n + m = k and n ≥ m + 1; see how k can progress from the left-hand side to the right-hand side in Figure 2.
We use the recursive equation above and check that the indices of the probability on the right hand-side are
such that the induction hypothesis applies:
Pn,m
|{z}
=
n
m
n n−1−m
m n+1−m
Pn−1,m +
Pn,m−1 =
+
n + m | {z } n + m | {z }
n+mn−1+m n+mn−1+m
=
n−m
.
n+m
n ≥ m +1
n −1≥ m
n ≥ m −1
14
This proves the induction hypothesis for k = n + m and n ≥ m + 1. To complete, we need to check the
formula for n = m. Note that Pn,n = 0 = (n − n)/(n + n) so the formula holds. Hence, we have established
the induction hypothesis for k and [n ≥ m] = [n ≥ m + 1] ∪ [n = m]. ⋄
Example: Bankruptcy problem. After an election involving two candidates A and B, suppose that you wake
up in the morning to learn that a total of k = r + p people have voted and candidate A has not lost the
election by winning r votes for r ≥ p. Without knowing exactly how the votes may have arrived, you can
r+ p
safely assume that each of Cr vote permutations is equally likely. We want to show that the probability
Br,p that candidate A has has never been behind throughout the night is 1 − p/(r + 1) for r ≥ p − 1.
This example can be used to find the probability of no bankruptcy for a company that is to receive r unit
payments and to pay p unit payments. Given these values of receipts and payments, a bankruptcy happens
when the amount of colected receipts is less than the amount of payments due at any time. If the receipts
and payments happen in a random sequence, the probability of no bankruptcy is
P(No bankruptcy with r to receive and p to pay) = 1 −
p
for r ≥ p − 1.
r+1
We can integrate the region r < p − 1 by using min and then express the bankruptcy probability as
}
{
p
,1 .
P(Bankruptcy with r to receive and p to pay) = min
r+1
Now we trun to proving Br,p = 1 − p/(r + 1). Let us consider some special cases. For r ≤ p − 1, candidate
A end the election behind so Br,p = 0. For r ≥ 0 and p = 0, candidate A is never behind so Pr,p = 1. Figure 3
shows these special cases and a generic (r, p) for r ≥ p. To arrive at the vote count (r, p), we must have the
vote count of either (r, p − 1) or (r − 1, p) before the last vote is cast. We have
P(last vote for A) =
r
p
= 1−
= 1 − P(last vote is for B),
r+p
r+p
which are also shown in Figure 3 next to arrows coming to (r, p) from (r − 1, p) and (r, p − 1).
p payments 6
d
d
d
r = p−1
r=p
d
q
q
Br,p = 0 for r ≤ p − 1
Bankruptcy region
k=3 d
d
d
t
q
t
Br−1,p
q
r
r+ p
End (r, p)
B
-t r,p
6
k =r+p
p
k=2 d
d
t
t
q
q
k=1 d
t
t
t
q
q
q
q
Start (0,0) t
B1,0 = 1 B2,0 = 1 B3,0 = 1
t
t
t
t r+ p
Br,p−1
q
k = r+ p−1
Br,0 = 1
t
-
r receipts
Figure 3: Progression of the payments and receipts from the start to the end.
15
For r ≥ p, we can find Br,p by conditioning on the last vote:
Br,p = P( A is never behind|last vote for A)P(last vote for A)
+P( A is never behind|last vote for B)P(last vote for B)
r
p
=
Br−1,p +
Br,p−1 .
r+p
r+p
Using this equality and induction on the number k = r + p of votes cast, we can prove Br,p = 1 − p/(r + 1).
For k = 0 = r + p and r ≥ p, we obtain (0, 0) and B0,0 = 1 = 1 − 0/(0 + 1) so the induction hypothesis of
Br,p = 1 − p/(r + 1) holds for k = 1. Although unnecessary, you can also check to see that Bn,n+1 = 0 =
1 − (n + 1)/(n + 1) and Bn,0 = 1 = 1 − 0/(n + 1) satisfy the formula Br,p = 1 − p/(r + 1). What is necessary
is to assume Br,p = 1 − p/(r + 1) for r + p = k − 1 and r ≥ p − 1 and prove it for r + p = k and r ≥ p. We
use the recursive equation above and check that the indices of the probability on the right hand-side are such
that the induction hypothesis applies:
(
)
p
r (
r
p)
p
p−1
Br,p =
Br−1,p +
Br,p−1 =
1−
+
1−
r + p | {z } r + p | {z }
r+p
r
r+p
r+1
|{z}
r≥ p
r −1≥ p −1
r ≥ p −2
p
.
= 1−
r+1
This proves the induction hypothesis for k = r + p and r ≥ p. To complete, we need to check the formula for
r = p − 1. Note that Bn,n+1 = 0 = 1 − (n + 1)/(n + 1) so the formula holds. Hence, we have established the
induction hypothesis for k and [r ≥ p − 1] = [r ≥ p] ∪ [r = p − 1]. ⋄
Example: In volleyball, two teams, say A and B, play at most 5 sets until a team wins 3 out of 5. Each of
the first 4 sets are played until one of the teams earns 25 points and separates itself from the other team by
at 2 points. So these sets can end with scores 25 to m for m ≤ 23 or scores 25 + i to 23 + i for i ≥ 1. The
fifth set, if necessary, is played until a team earns 15 points and separates itself from the other by 2 points.
To earn a point, a team must win a rally which can start with either team A or B serving the ball (This rule
was different before 1998 when only the serving team could win a point). When team A serves and starts a
rally, it wins with probability p a and team B wins with probability q a . When team B serves and starts a rally,
it wins with probability qb and team A wins with probability pb :
p a = 1 − q a = P(Team A wins the rally|Team A serves to start the rally) and
qb = 1 − pb = P(Team B wins the rally|Team B serves to start the rally).
The winner of a rally not only gets a point but also serves to start the next rally. One of the first four sets ends
with scores (25, m) or (25 + i, 23 + i ) for m ≤ 23 and i ≥ 1 when team A wins; it can also end with scores
(m, 25) or (23 + i, 25 + i ) when team B wins. For now, we simplify the ending rule and drop the condition for
separation, so a set ends at (25, m) or (m, 25) for m ≤ 24. We can come back to condition for separation later.
For generality, we replace 25 with generic numbers, so a set ends at ( N, m) for 0 ≤ m ≤ M − 1 or (n, M ) for
0 ≤ n ≤ N − 1. Let [ FA , FB ] be the final score of a set and find the pmf of this random vector.
In a set, let B(n) be the number of points scored by team B by the time team A scores its nth point and
find the pmf of this random variable. When the team A wins all the rallies or team B wins all the rallies in a
set, we have
P( FA = N, FB = 0) = p aN and P( FA = 0, FB = M) = q a qbM−1 .
Note that P( B( N ) = 0) = p aN and P( B( N ) ≥ M ) = 0, so we consider [ B(n) = m] below only for 1 ≤ m ≤
M − 1.
16
We say that a round starts when team A serves. A round has 1 rally, when team A serves and wins. A
round has 5 rallies, when team A serves and loses, and then team B serves and wins 4 rallies until it loses.
The set starts with team A’s service, which starts the first round. Let Bi be the number of points won by team
B in the ith round. We have P( Bi = 0) = p a and P( Bi = k) = q a qkb−1 pb for k ≥ 1, but we rather write
P( Bi = 0) = p a and P( Bi = k| Bi > 0) = qkb−1 pb for k ≥ 1.
Then we observe that Bi is a mixture of a Geometric random variable with success probability pb and mass
at zero:
{
}
0
with probability p a
.
Bi =
Geo ( pb ) with probability q a
Note that { Bi } is an iid sequence.
We have B(n) = ∑in=1 Bi . But we can take out zero-valued Bi s from this sum. Let Y be the number of
B1 , B2 , . . . , Bn that are positive. By conditioning on r,
n
P( B ( n ) = m ) =
∑ P ( B ( n ) = m | Y = r ) P (Y = r ) =
r =0
n
∑ P ( B ( n ) = m | Y = r ) P (Y = r ) ,
r =1
where the last equality follows from P( B(n) = m|Y = 0) = 0 for m ≥ 1. Since { Bi } is an iid sequence and
P( Bi > 0) = q a , we have Y ∼ Bin(n, q a ) and then
n
P( B ( n ) = m ) =
∑ P( B(n) = m|Y = r)Crn qra pna −r .
r =1
On the other hand, [ B(n) = m|Y = r ] whenever [∑ri=1 Bi = m]. The event [∑ri=1 Bi = m] happens
when the mth trial yields the rth success in Bernoulli experiments with probability of success pb . Hence,
−r
P( B(n) = m|Y = r ) = P(∑ri=1 Bi = m) = Crm−−11 prb qm
and
b
(
)
n
n
pb q a r
−r n r n −r
m n
m −1 n
P( B(n) = m) = ∑ Crm−−11 prb qm
C
q
p
=
q
p
C
C
for m ≥ 1.
r a a
b a ∑ r −1 r
b
qb p a
r =1
r =1
Inserting n = N above, we can obtain the pmf of B( N ). Then
(
)
N
pb q a r
m N
m −1 N
P( FA = N, FB = m) = qb p a ∑ Cr−1 Cr
for 1 ≤ m ≤ M − 1.
qb p a
r =1
What remains is to find P( FA = n, FB = M) for 1 ≤ n ≤ N − 1. Conditioning on the event [ B(n) = m]
and noticing that P( FA = n, FB = M| B(n) = m) = 0 for m ≥ M,
M −1
P( FA = n, FB = M) =
∑
P( FA = n, FB = M | B(n) = m)P( B(n) = m)
m =0
In the event [ FA = n, FB = M | B(n) = m], team A wins its nth point at which time team B has m points
and team B wins all the remaining points. Team B wins m + 1st point when team A is serving and wins the
remaining M − (m + 1) points while it is serving. Hence, P( FA = n, FB = M | B(n) = m) = q a qbM−m−1 , which
yields
M −1
P( FA = n, FB = M) =
∑
m =0
q a qbM−m−1 P( B(n) = m) = q a qbM−1 P( B(n) = 0) +
[
= q a qbM−1 pna 1 +
M −1 n
∑ ∑ Crm−−11 Crn
m =1 r =1
17
(
pb q a
qb p a
)r ]
M −1
∑
m =1
q a qbM−m−1 P( B(n) = m)
for 1 ≤ n ≤ N − 1,
which completes the pmf for the random vector [ FA , FB ].
Since Team A starts serving, there is some asymmetry in the problem and we can compare the probability
of team A winning the set to 50%: Consider symmetric (equally powerful) teams that win with the same
probability p a = qb when they are serving and lose with the same probability q a = pb . Suppose taht each
team is s times more likely to win rather to lose when serving, i.e., p a /q a = s and qb /pb = s. Then p a = qb =
s/(1 + s) and q a = pb = 1/(1 + s).
(
)N [
)m N
( )2r ]
N −1 (
s
s
1
m −1 N
P(First serving wins; N, s) =
1+ ∑
Cr−1 Cr
∑
1+s
1+s
s
m =1
r =1
0.52
0.50
0.48
0.46
0.42
0.44
Probability first serving team wins
0.6
0.5
0.4
0.3
Probability first serving team wins
0.7
P(First serving wins; N, s) is shown in Figure 4 as N and s varies. If s = 2, the serving team has advantage
and is more likely to win the set; but this advantage tapers off as N increases. Interestingly, for s = 0.5,
serving is not an advantage and the serving team is more likely to lose. If you watch volleyball games, you
will see that the serving team basically hands over the ball to the other team for offense; serving does not
seem to be an advantage and so s < 1. ⋄
0
5
10
15
20
25
30
0.5
N
1.0
1.5
2.0
s
Figure 4: Left Panel: Probability that the first serving team wins when ( N, s = 0.5), ( N, s = 1.0) and ( N, s =
2.0). Right Panel: Probability that the first serving team wins when ( N = 10, s), ( N = 20, s) and ( N = 30, s).
4 Application: Password Security for Information Systems
Let us suppose that you have a password of length n made out of digits {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}. Let us
further suppose that your password is non-overlapping, i.e., the beginning of k digits of your password are
not the same as the ending k digits of it. Some overlapping passwords are 11, 1212, 12312, 1234512, where
the first pass sword has 1 at the beginning and ending of it and others have 12 at the beginning and ending
of them. In general, if there is 1 ≤ k < n such that the first k digits x1 x2 . . . xk are the same as the last k digits
xn−k+1 xn−k+2 . . . xn , then the password is non-overlapping. Any password that uses digit i only as the first
digit and nowhere else is a non-overlapping password. For example, x1 x2 x3 . . . is non-overlapping if x1 = k
and x j ∈ {0, 1, 2, 3, 4, 5, 6, 7, 8, 9} \ {k} for j ≥ 2.
18
Now let us consider a hacker who generates a random sequence of X1 X2 X3 . . . , where each element Xi
in the sequence is iid with probability pk = P( Xi = k) for k ∈ {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}. The hacker knows
that your password has length of n so he takes substrings of the kind X j+1 X j+2 . . . X j+n for j ≥ 0 and enters
these strings one by one to login to your account. Your computer security system temporarily shuts down
account access if a certain number of incorrect passwords are attempted. Both for your security and hacker’s
ability to break into your account, the critical quantity is the average number of password attempts required
to guess your password. For K attempts, the hacker needs to create a random string of length N = K + n − 1.
Let N be the length of hacker’s string to guess your password of x1 x2 . . . xn . If N = j + n, then N > j
and X j+1 X j+2 . . . X j+n = x1 x2 . . . xn . Now let us consider the reverse implication, what can we conclude
with N > j and X j+1 X j+2 . . . X j+n = x1 x2 . . . xn ? For example, is N = j + 1 possible? For N = j + 1, we
need X j+2−n X j+3−n . . . X j+1 = x1 x2 . . . xn . Coupling this with X j+1 X j+2 . . . X j+n = x1 x2 . . . xn , we obtain x1 =
X j+1 = xn so the password is overlapping with k = 1. Is N = j + k possible for 1 ≤ k < n? For N = j + k, we
need X j+k−n+1 X j+k−n+2 . . . X j+k = x1 x2 . . . xn . Coupling this with X j+1 X j+2 . . . X j+n = x1 x2 . . . xn , we obtain
x1 x2 . . . xk = X j+1 X j+1 . . . X j+k = xn−k+1 xn−k+2 . . . xn so the password is overlapping with k ≥ 1. We know
that N ≤ j + n, so we must have N = j + n. In summary, if N > j and X j+1 X j+2 . . . X j+n = x1 x2 . . . xn , then
N = j + n.
The above paragraph yields the first equality below. The second equality below is from observing that
[ N > j] depends only on X1 . . . X j which are independent of X j+1 X j+2 . . . X j+n .
P( N = j + n) = P( N > j, X j+1 X j+2 . . . X j+n = x1 x2 . . . xn ) = P( N > j)P( X j+1 X j+2 . . . X j+n = x1 x2 . . . xn )
n
= P( N > j ) ∏ p xi
i =1
The last equality can be summed over j ≥ 0:
∞
1=
∑ P( N = j + n ) =
j =0
∞
n
∏ p xi
∑ P( N > j ) =
i =1
j =0
Hence, we obtain
E( N ) =
1
∏in=1
p xi
n
∏ p x E( N ).
i
i =1
.
Example: A hacker knows that UTD students are thrice more likely to use odd digits in their passwords
than even digits and accordingly creates a random sequence of digits to guess a 4 digit long password.
What will be the expected length E( N1357 ) of his random sequence to guess a password of 1357. We have
p1 =p3 =p5 =p7 =p9 =0.75/5 = 0.15 and p2 =p4 =p6 =p8 =p0 =0.25/5 = 0.05. Then E( N1357 ) = 1/(0.15)4 ≈ 1975.
Similarly, E( N2468 ) = 1/(0.05)4 = 160, 000. ⋄
5 Application: Multinomial Logit Model for Marketing
A population of consumers is presented with a set S of products to buy from and n = |S|. Each consumer
buys the product that maximizes his own utility. A consumer i can have deterministic and random parts for
his utility:
Ui = ui + ξ i ,
where ui is known and ξ i is a random variable with cdf Fξ and pdf f ξ . Consumer j has the same additive
form for the utility with deterministic part u j and random part ξ j assumed to be iid with ξ i . In general, we
assume that {ξ i } is an iid sequence and constitute the random part of the consumer utilities. We want to
compute the probability pi that a consumer chooses product i.
19
There can be various ways of building a consumer choice models. These models can generally be grouped
into two categories: utility maximization models and utility satisficing models. Multinomial logit model is a
utility maximization model where
[Choose product i ] = [Uj ≤ Ui for j ̸= i ] = [ξ j ≤ ui + ξ i − u j for j ̸= i ].
We can con compute pi = P(Choose product i ) by conditioning on the random utility of ξ i :
pi = P(ξ j ≤ ui + ξ i − u j for j ̸= i ) =
∫ ∞
n
∏
−∞ j=1,j̸=i
Fξ (ui − u j + x ) f ξ ( x )dx.
Multiple integrals can be taken numerically when there are a few products to consider.
To obtain a closed form expression, we let the random part of the utility be distributed by Gumbel (λ), so
Fξ ( x ) = exp(−e−λx ) and f ξ ( x ) = λe−λx exp(−e−λx ).
Inserting these into the pi expression above and then let y = e−λx so that dy = −λe−λx dx:
pi =
=
∫ ∞
∏
−∞ j=1,j̸=i
∫ ∞
0
=
n
exp(−e−λ(ui −u j + x) )λe−λx exp(−e−λx )dx =
(
(
exp −y
n
∑
e − λ (ui − u j ) + 1
j=1,j̸=i
1
∑nj=1,j̸=i e−λ(ui −u j )
(
))
+1
=
eλui
∑nj=1 eλu j
dy = −
∫ ∞
0
(
n
∏
exp(−ye−λ(ui −u j ) )e−y dy
j=1,j̸=i
exp −y ∑nj=1,j̸=i e−λ(ui −u j ) + 1
∑nj=1,j̸=i e−λ(ui −u j ) + 1
)) y=∞
y =0
.
To highlight the dependence of pi on the parameter λ of the Gumbel distribution, we can write pi (λ).
Then pi (λ) is the probability of a consumer to choose product i and p1 (λ) + p2 (λ) + · · · + pn (λ) = 1.
When there are N consumers present, we can consider the random of vector of choices [ N1 , N2 , . . . , Nn ]
for N1 + · · · + Nn = N, where Ni is the number of consumers choosing product i. The vector of choices has
multinomial distribution with parameters N and p1 (λ), p2 (λ), . . . , pn (λ).
Example: Let ξ i′ have a shifted Gumbel distribution with cdf Fξ ′ ( x ) = exp(−e−λ( x−µ) ). Then P(ξ i′ ≤ a) =
exp(−e−λ(a−µ) ) = P(ξ i ≤ a − µ) = P(ξ i + µ ≤ a) and ξ i′ = ξ i + µ. In particular, P(ξ ′j ≤ ui + ξ i′ − u j for j ̸= i )
= P(ξ j + µ ≤ ui + ξ i + µ − u j for j ̸= i ) = P(ξ j ≤ ui + ξ i − u j for j ̸= i ). Hence, shifting the Gumbel random
variable does not alter the choice probability pi (λ). ⋄
Example: Let ξ have a Gumbel distribution with cdf Fξ ( x ) = exp(−e−λx ) or cdf f ξ ( x ) = exp(−e−λx )e−λx λ.
Find E(ξ ), E(ξ 2 ) and V(ξ ).
E( ξ n ) =
∫ ∞
−∞
= (1/λ)n
=
z=λx
x n exp(e−λx )e−λx λdx =
∫ ∞
−∞
∫ ∞
−∞
y = e−z
(z/λ)n exp(e−z )e−z dz
zn exp(e−z )e−z dz = (1/λ)n
∫ ∞
0
(− ln y)n exp(y)dy
γn
π2
+
1
I
,
n
=
2
λn
6λ2
where the last equality is from integration tables and γ = limk→∞ (∑ik=1 (1/i ) − ln k) = 0.57721566 is the Euler’s constant. We have E(ξ ) = γ/λ and E(ξ 2 ) = γ2 /λ2 + π 2 /(6λ2 ), so V(ξ ) = γ2 /λ2 + π 2 /(6λ2 ) − γ2 /λ2 =
20
π 2 /(6λ2 ). ⋄
From the last two examples, we conclude
P(ξ ≤ x ) = exp(−e−λ( x+γ) )
=⇒
E(ξ ) = 0, V(ξ ) =
eλui
π2
and
p
(
λ
)
=
.
i
6λ2
∑nj=1 eλu j
That is, the choice probabilities above do not change by shifting the utility to bring its expected value to zero.
6 Solved Exercises
7 Exercises
1. Let the joint pdf of [ X, Y ] be f X,Y ( x, y) = ax −b for 0 ≤ y ≤ x ≤ 1 and b < 2.
a) Find parameter a in terms of parameter b to make f X,Y a legitimate density.
b) Find f Y |X (y| x ) and E(Y | X = x ).
c) Find E(Y ), and compute it as b → 2 from left and b → −∞.
2. Suppose that X1 , X2 , . . . , Xn are iid with cdf F and let Yjn = minnj { X1 , X2 , . . . , Xn }. Find P( X1 ≤ Yjn )
and P( X2 ≤ Yjn ), should they be different or the same?
3. Suppose that X1 , X2 , . . . , Xn are iid with cdf F and let Yjn = minnj { X1 , X2 , . . . , Xn }. Let us fix y ≥ 0 to
obtain the double indexed sequence anj = P(Yjn ≤ y).
a) By using conditioning arguments, we can then write for n > j
anj = panj−−11 + (1 − p) anj −1
for appropriate values of p that can depend only on n, j, y and F or a subset thereof. Find and justify
the value of p.
ANSWER anj = P(minnj { X1 , X2 , . . . , Xn } ≤ y) = P( Xn ≤ y)P(minnj { X1 , X2 , . . . , Xn } ≤ y| Xn ≤ y) +
P( Xn > y)P(minnj { X1 , X2 , . . . , Xn } ≤ y| Xn > y) = F (y)P(minnj−−11 { X1 , X2 , . . . , Xn−1 } ≤ y) + (1 −
F (y))P(minnj −1 { X1 , X2 , . . . , Xn−1 } ≤ y) = F (y) anj−−11 + (1 − F (y)) anj −1 . We note that P(Yjn ≤ y| Xn ≤
y) = P(Yjn−−11 ≤ y): if the jth order statistic out of n observations is less than or equal to y and Xn ≤ y is
removed from the observations, the j − 1st order statistic out of n − 1 observations becomes less than or
equal to y. Similarly, P(Yjn ≤ y| Xn ≥ y) = P(Yjn−1 ≤ y): if the jth order statistic out of n observations is
less than or equal to y and Xn ≥ y is removed from the observations, the jth order statistic out of n − 1
observations becomes less than or equal to y. So p = F (y).
b) Let us fix y so that we can drop it from the notation. Note that a11 = F, a21 = F (2 − F ) and a22 = F2 .
Find a31 , a41 by using the inclusion-exclusion formula and find a33 , a44 by using the properties of minimum
of 3 or 4 random variables. Use anj = panj−−11 + (1 − p) anj −1 to compute a32 , a42 , a43 . Therefore, when you
are done, you have 10 probabilities { anj : 1 ≤ j ≤ n ≤ 4}.
ANSWER a31 = P( X1 ≤ y or X2 ≤ y or X3 ≤ y) = 3P( X1 ) − 3P( X1 , X2 ≤ y) + P( X1 , X2 , X3 ≤ y) = 3F −
3F2 + F3 . a41 = P( X1 ≤ y or X2 ≤ y or X3 ≤ y or X4 ≤ y) = 4P( X1 ) − 6P( X1 , X2 ≤ y) + 4P( X1 , X2 , X3 ≤
y) − P( X1 , X2 , X3 , X4 ≤ y) = 4F − 6F2 + 4F3 − F4 . a33 = P( X1 ≤ y, X2 ≤ y, X3 ≤ y) = F3 , a44 = P( X1 ≤
y, X2 ≤ y, X3 ≤ y, X4 ≤ y) = F4 . Using the formula from a), a32 = Fa21 + (1 − F ) a22 = F2 (2 − F ) + (1 −
F ) F2 = 3F2 − 2F3 . a42 = Fa31 + (1 − F ) a32 = F (3F − 3F2 + F3 ) + (1 − F ) F2 (3 − 2F ) = F2 (3F2 − 8F + 6).
a43 = Fa32 + (1 − F ) a33 = F3 (3 − 2F ) + (1 − F ) F3 = F3 (4 − 3F ).
21
4. Suppose that X1 , X2 , . . . , Xn are iid with cdf F and let Yjn = minnj { X1 , X2 , . . . , Xn }. Let us fix y ≥ 0 to
obtain the double indexed sequence anj = P(Yjn ≤ y).
a) By using conditioning arguments, we can write for n ≥ j
anj = qanj++11 + (1 − q) anj +1
for appropriate values of q that can depend only on n, j, y and F or a subset thereof. Find and justify
the value of q.
ANSWER We have
{
Yjn =
if Xn+1 ≤ Yjn+1
if Xn+1 ≥ Yjn++11
Yjn++11
Yjn+1
}
.
To see this, consider n + 1 many Xi s and the relative position of Xn+1 . If Xn+1 is one of the first j Xi s,
removing it makes j + 1st order statistics the jth order statistics: Yjn++11 = Yjn . If Xn+1 is not one of the
first j Xi s, i.e., Xn+1 ≥ Yjn++11 , removing it maintains the jth order statistics: Yjn+1 = Yjn . Then
anj = P(Yjn ≤ y) = P( Xn+1 ≤ Yjn+1 )P(Yjn ≤ y| Xn+1 ≤ Yjn+1 ) + P( Xn+1 ≥ Yjn++11 )P(Yjn ≤ y| Xn+1 ≥ Yjn++11 )
)
)
(
(
j
j
j
j
n +1
n +1
n +1
P(Yj+1 ≤ y) + 1 −
P(Yj
a
anj +1 .
≤ y) =
+ 1−
=
n+1
n+1
n + 1 j +1
n+1
So q = j/(n + 1).
b) Verify your formula for a32 by writing it in terms of a43 and a42 .
ANSWER (2/4) a43 + (1 − 2/4)( a42 ) = (4F3 − 3F4 )/2 + F2 (3F2 − 8F + 6)/2 = F2 (3 − 2F ) = a32 , which
validates the formula. Although not asked, you can also check (1/4) a42 + (1 − 1/4)( a41 ) = F2 (3F62 −
8F + 6)/4 + 3(4F − 6F2 + 4F3 − F4 )/4 = 3F − 3F62 + F3 = a31 or (3/4) a44 + (1 − 3/4) a43 = 3F4 /4 +
(4F3 − 3F4 )/4 = F3 .
5. Two perfectly shuffled decks of 52 cards are put side by side and facing down on a table. We draw
one card from the top of each deck and check to see if the cards are identical. The drawn cards are not
returned to the decks. We repeat this 52 times to finish both decks. What is the probability of coming
across at least one draw where the two cards coming from the two decks are identical? What is the
probability of coming across exactly one draw with identical cards? What is the probability of coming
across exactly 51 draws with identical cards?
6. Two perfectly shuffled decks of 52 cards are put side by side and facing down on a table. We draw
one card from the top of each deck and check to see if the cards are identical. The drawn cards are
returned to the decks and decks are reshuffled again. We repeat this 52 times. What is the probability
of coming across at least one draw where the two cards coming from the decks are identical? What
is the probability of coming across exactly one draw with identical cards? What is the probability of
coming across exactly 51 draws with identical cards?
7. A bipolar magnet has north (N) and south (S) ends. When two bipolar bar magnets are placed in
a row next to each other such that opposite ends (N-S or S-N) face each other, magnets attract each
other. Otherwise, they repel. That is (NS)-(NS) and (SN)-(SN) attract while (NS)-(SN) and (SN)-(NS)
repel. Suppose that each magnet is placed with random orientation with probability q = 1/2 for NS
and 1 − q = 1/2 for SN. When there are n = 2 magnets, the probability of having a single chain
due to attraction is 1/2 and the probability of having two chains due to repellency is 1/2. With n
magnets, let Xn be the number of chains due to attraction and repellency, so 1 ≤ Xn ≤ n. In particular,
22
P( X1 = 1) = 1, P( X2 = 1) = 1/2, P( X2 = 2) = 1/2. In general, we want to compute P( Xn = k) for
k = 1, . . . , n. To obtain a recursive equation for P( Xn = k), condition this probability on the orientation
of the last magnet placed when there are n − 1 magnets, which yields
P ( X n = k ) = ( ? ) P ( X n −1 = k − 1 ) + ( ? ) P ( X n −1 = k ) .
Find what the constants denoted by (?) above should be. Then solve the recursive equation to find an
expression for P( Xn = k) only in terms of n, k. You may guess the form of P( Xn = k ) and establish it
with an induction on n. Finally, find E( Xn ) and V( Xn ).
ANSWER The last magnet has NS with probability 1/2. If Xn−1 = k − 1 and the last magnet repels,
we get Xn = k. If Xn−1 = k and the last magnet attracts, we get Xn = k. The probability of last
magnet repelling is 1/2 = 1/2(1/2 + 1/2) obtained by conditioning on the orientation of the next to
last magnet. Hence,
P( Xn = k) = (1/2)P( Xn−1 = k − 1) + (1/2)P( Xn−1 = k).
You can obtain P( X3 = 1) = 1/4, P( X3 = 2) = 2/4, P( X3 = 3) = 1/4, P( X4 = 1) = 1/8, P( X4 = 2) =
3/8, P( X4 = 3) = 3/8, P( X4 = 4) = 1/8 to guess that P( Xn = k) = Ckn−−11 /2n−1 . This formula holds for
n = 1, 2, 3, 4, so we can provide an induction proof for it. Suppose that P( Xn−1 = k) = Ckn−−12 /2n−2 and
use this in the recursive equation:
P( Xn = k) = (1/2)P( Xn−1 = k − 1) + (1/2)P( Xn−1 = k) = (1/2)Ckn−−22 /2n−2 + (1/2)Ckn−−12 /2n−2
=
Ckn−−22 + Ckn−−12
Ckn−−11
=
,
2n −1
2n −1
where the last equality follows from an identity of combinations. This probability is also P( Xn = k) =
P( Bin(n − 1, 1/2) = k − 1) so
Xn = Bin(n − 1, 1/2) + 1.
From the expected value and variance of a binomial random variable, we immediately obtain E( Xn ) =
(n − 1)/2 + 1 = (n + 1)/2 and V( Xn ) = (n − 1)(1/2)(1 − 1/2) = (n − 1)/4.
8. A party has n guests, each of whom came with their distinct hats and umbrellas. Upon arrival, each
guest leaves his hat and umbrellas at the concierge, takes two numbers one for his hat another for his
umbrella and joins the party. After all guests arrive, the numbering system breaks down. At their
departure time, each guest is randomly presented with an umbrella and a hat. Both hat and umbrella
may be correct, either one may be correct or neither may be correct. Let Xih ∈ {0, 1} and Xiu ∈ {0, 1}
be indicator variables taking value of 1 when the hat and the umbrella is respectively selected correctly
for guest n.
a) Find E( Xih ) and E( Xiu ).
b) Let Xhn = X1h + X2h + · · · + Xnh and Xun = X1u + X2u + · · · + Xnu . Express Xhn and Xun in English.
c) Let X n be the number of guests, out of n total guests, who are presented with both correct hat and
correct umbrella so that they can depart. Express X n using the X variables defined above. Find E( X n ).
9. After we study the Ballot and Bankruptcy problems and have found Pr,p = (r − p)/(r + p) and Br,p =
1 − p/(r + 1) for r > p, we realize that
Pr,p =
r
Br−1,p .
r+p
This equality can have a probabilistic explanation based on a conditioning argument, because 0 ≤
r/(r + p) ≤ 1 and both Pr,p and Br−1,p are probabilities. Furnish such an explanation and provide the
event used in the conditioning argument.
23
10. We want to find the choice probability pi (λ) when the information about the utility is maximum (zero
variance) and minimum (infinite variance). To maximize (minimize) the information, we can manipulate λ appropriately and monitor the change in the variance of the utility. What are the choice probabilities under no variance and infinite variance cases? Are these probabilities what you expect, comment.
11. a) Consider the probability P( X n = 0) of no matching among n pairs. First show that P( X n = 0) ≤ 1/2.
ANSWER From the main body, we have for n even
) (
)
(
)
(
1
1
1
1
1
1
1
1
n
P( X = 0) =
−
−···−
≤ .
−
−
−
−
2!
3! 4!
5! 6!
(n − 1)! n!
2
The terms in the parentheses are all positive. Similarly, for n odd,
(
) (
)
(
)
1
1
1
1
1
1
1
1
1
n
−
−···−
−
P( X = 0) =
−
−
−
−
≤ .
2!
3! 4!
5! 6!
( n − 2) ! ( n − 1) !
n!
2
So P( X n = 0) ≤ 1/2.
b) Use P( X n = 0) ≤ 1/2 to obtain the following bound on E( Rn ):
E( Rn ) ≤ 2 + max E( Ri ).
1≤ i ≤ n −1
ANSWER 1 − P( X n = 0) ≥ 1/2 or (1 − P( X n = 0))−1 ≤ 2. We have (1 − P( X n = 0))E( Rn ) =
1 + ∑in=1 E( Rn−i )P( X n = i ), so
E( R n ) =
n
n
1
1
P( X n = i )
+
=
+
E
(
R
)
E( R n −i )P( X n = i | X n ≥ 1)
n
−
i
∑
∑
1 − P ( X n = 0 ) i =1
P( X n ≥ 1)
1 − P ( X n = 0 ) i =1
≤ 2 + max E( Ri )
1≤ i ≤ n −1
c) Use induction to prove that a sequnce of cn ∈ ℜ such that c1 = 1 and cn ≤ 2 + max1≤i≤n−1 ci must
satisfy cn ≤ 2n. Hence conclude that E( Rn ) ≤ 2n; it canot grow faster than linear.
ANSWER To start the induction c1 ≤ 2. We assume ci ≤ 2i for i ≤ n − 1. Then cn ≤ 2 + max1≤i≤n−1 ci ≤
2 + 2(n − 1) = 2n. This completes the induction.
24
Appendix: Sum of a Particular Sequence
−n−1 . We first consider a finite sum:
We want to find ∑∞
n=0 n2
K
∑ n2−n−1
=
n =0
( 0 ) 2 K + ( 1 ) 2 K − 1 + · · · + ( K ) 20
2 K +1
(20 + · · · + 2K−1 ) + (20 + · · · + 2K−2 ) + · · · + (20 )
2 K +1
K
K
−
1
(2 − 1) + (2
− 1 ) + · · · + ( 21 − 1 )
K
1
=
= − K+1 + (20 + 21 + · · · + 2−K )
K
+
1
2
2
2
K
1 1 − 1/2K +1
K
1
= − K +1 +
= 1 − K +1 − K +1
2 1 − 1/2
2
2
2
Now taking the limit as K → ∞ on both sides yields
=
∞
K
1
− lim K+1
∑ n2−n−1 = 1 − Klim
→ ∞ 2 K +1
K →∞ 2
= 1.
n =0
Appendix: Probability and Number of Matches in Random Coupling of Pairs
We want to find the probability P0n of no matches when n people pick up their mixed up hats. We can look for
1 − P0n , the probability that at least one person selects his own hat. Let Xi be the indicator that person i selects
his own hat for 1 ≤ i ≤ n. We have P( Xi = 1) = (n − 1)!/n!, where (n − 1)! is the number of permutations
of hats of n − 1 people left after person i selects his own hat. Similarly, P( Xi X j = 1) = (n − 2)!/n!, where
(n − 2)! is the number of permutations of hats of n − 2 people left after people i and j select their own hats.
We also have P(∏ik=1 Xi = 1) = (n − k)!/n!.
Using the inclusion-exclusion formula, we have
1 − P0n = P(∪in=1 [ Xi = 1])
=
n
n i −1
n i −1 j −1
n
i =1
i =1 j =1
i =1 j =1 j = k
i =1
∑ P ( Xi = 1 ) − ∑
∑ P ( Xi X j = 1 ) + ∑
∑
∑ P(Xi Xj Xk = 1) + · · · + (−1)n+1 P(∏ Xi = 1)
( n − 2) !
( n − 3) !
(n − n)!
( n − 1) !
− C2n
+ C3n
+ · · · + (−1)n+1 Cnn
n!
n!
n!
n!
1
1
n +1 1
= 1 − + + · · · + (−1)
.
2! 3!
n!
Then it immediately follows that
= C1n
P0n =
n
1
1
1
1
− + · · · + (−1)n = ∑ (−1)i .
2! 3!
n!
i!
i =0
From this probability, we can obtain the number of derangements, which are permutations of n hats such
that no hat matches its owner. The number of derangements is n! ∑in=0 (−1)i i!1 . We can let Mkn denote the
number of permutations (of n hats) such that exactly k elements match:
Mkn = Ckn (n − k)!
n−k
1
n−k
n!
1
∑ (−1)i i! = (n − k)!k! (n − k)! ∑ (−1)i i! =
i =0
i =0
n! n−k
1
(−1)i .
∑
k! i=0
i!
For example, when n = 3 and people show up in permutation [1, 2, 3], the hats can be in 1 permutation of
[1, 2, 3] where 3 hats match, in 3 permutations [1, 3, 2], [3, 2, 1], [2, 1, 3] where 1 hat matches and in 2 permutations of [3, 1, 2], [2, 3, 1] where no hat matches. Note that M33 = 1, M23 = 0, M13 = 3 and M03 = 2, and
P33 = 1/3!, P13 = 3/3! and P03 = 2/3!.
25
R Code to Generate Figure 4
rm(list=ls(all=TRUE));
s_step <- 0.1; s_lower <- 0.1; s_upper <- 2; s_vector <- seq(s_lower,s_upper,s_step);
N_step <- 1;
N_lower <- 1;
N_upper <- 30; N_vector <- seq(N_lower,N_upper,N_step);
Prob <- matrix(1,0,length(s_vector));
for (N in N_vector){
Probrow <- matrix(1,1,0);
for (s in s_vector){
sdivs <- s/(1+s);
Probtemp <- 1;
if (N>1){
for (m in 1:(N-1)){
ProbTemp <- 0;
for (r in 1:N){ ProbTemp <- ProbTemp + choose(m-1,r-1) * choose(N,r) * s^(-2*r);}
Probtemp <- Probtemp + sdivs^m * ProbTemp;
}
}
Probrow <- cbind(Probrow, sdivs^N * Probtemp);
}
Prob <- rbind(Prob, Probrow);
}
plot(N_vector,Prob[,10],type="l",xlab="N", ylab="Probability first serving team wins", col=1);
lines(N_vector,Prob[,5],col=2); lines(N_vector,Prob[,20],col=4);
plot(s_vector,Prob[20,],type="l",xlab="s", ylab="Probability first serving team wins", col=1);
lines(s_vector,Prob[10,],col=2); lines(s_vector,Prob[30,],col=4);
References
S. M. Ross. 2014. Probability Models. 11th edition published by Elsevier.
26
© Copyright 2026 Paperzz