CS 70 Discrete Mathematics and Probability Theory Spring 2016

CS 70
Discrete Mathematics and Probability Theory
Spring 2016
Review
Rao and Walrand
J EAN WALRAND
- Probability
The objective of these notes is to enable you to check your understanding and knowledge of the probability
concepts studied in this course. These notes are not meant to be self-contained. They are mostly a check-list.
The lecture slides and lecture notes contain the details. You should try to invent problems and see if you can
solve them.
1
Probability Space
Review
The sample space Ω is the set of possible outcomes of some random experiment. The experiment selects
one outcome (only one!) ω ∈ Ω. The probability that it selects ω is Pr[ω]. For now (until Section 8), Ω is
either finite or countable (i.e., its elements can be numbered ω1 , ω2 , . . .). For each ω ∈ Ω, one has Pr[ω] ≥ 0
and ∑ω Pr[ω] = 1. A finite probability space is said to be uniform when all the outcomes have the same
probability 1/|Ω|.
An event is a subset A of Ω, i.e., a set of outcomes. One defines the probability of the event A as Pr[A] =
/ then Pr[A ∪ B] = Pr[A] + Pr[B]. By induction, if the events Am are
∑ω∈A Pr[ω]. Note that if A ∩ B = 0,
pairwise disjoint, then Pr[A1 ∪ · · · ∪ An ] = Pr[A1 ] + · · · + Pr[An ] for all finite n. One can also show that this
identity holds even for infinite n, i.e., Pr[∪m≥1 Am ] = ∑m≥1 Pr[Am ] if the events Am are pairwise disjoint. One
says that probability is countably additive.
By induction, one also sees that for arbitrary events Am one has Pr[A1 ∪ · · · ∪ An ] ≤ Pr[A1 ] + · · · + Pr[An ].
This inequality is called the union bound. Also, for any two events A and B one has Pr[A ∪ B] = Pr[A] +
Pr[B] − Pr[A ∩ B]. This is called the inclusion-exclusion identity.
Symmetry is a powerful argument to determine that two events have the same probability. For instance, say
that there is a bag with 100 red balls and 200 blue balls. You pick five balls without replacement. The
probability that the fifth ball is red is 1/3. Indeed, this is the same as the probability that the first ball is red.
To see this, think of arranging the 300 balls in a random order (permutation), so that all the permutations are
equally likely. They remain so if you interchange the first and fifth ball. As another example, you deal five
cards from a well-shuffled 52-card deck. The probability that the third card is red is 1/2. The probability
that the fourth card is an ace is 4/52 = 1/13. Of course, the likelihood that the fourth card is an ace depends
on the first three cards. Say that there is only one specific chocolate that you like in a box that you share
with some friends by everyone in turn picking one chocolate randomly uniformly and without replacement.
If you are the last one to pick, you might think that you are less likely to get your favorite chocolate than if
you are first. Not so, by symmetry.
Examples
1. Ω = {1, 2, 3, 4} be a uniform probability space. For A = {1, 2, 3} and B = {3, 4}, one finds that
Pr[A] = 3/4, Pr[B] = 1/2, Pr[A∩B] = 1/4, Pr[A\B] = 1/2, Pr[Ā] = 1/4, Pr[A∪B] = 1, Pr[A∆B] = 3/4.
CS 70, Spring 2016, J EAN WALRAND - Probability Review
1
2. Ω = {HH, HT, T H, T T } be a uniform probability space. This probability space describes the random
experiment “flipping a fair coin twice.” Note that the outcome of the experiment is one of the elements
of Ω. It would be fundamentally wrong to say that this experiment is described by Ω = {H, T } where
one picks two elements of Ω when one performs the experiment. It is wrong because we want to
describe how the coin flips are related. For instance, if we glue the two coins together so that they
both land on the same face, then the probability space becomes the uniform space {HH, T T }. We
lose this relationship if we think of Ω = {H, T } and we select two outcomes. Hence, it is essential
that ω describes the full outcome of the complete experiment, i.e., each ω describes the two coin flips.
Make sure you understand this point. What would the probability space be if you flipped the fair coin
100 times?
3. Ω = {H, T } with Pr[H] = 0.6 and Pr[T ] = 0.4. This probability space describes the random experiment “flipping a biased coin once.”
4. Ω = {1, 2, . . .} with Pr[n] = (1 − p)n−1 p for n ≥ 1. This probability space describes the random
experiment “flipping a biased coin until it yields the first H.”
2
Conditional Probability and Independence
Review
For two events A and B in Ω, one defines the conditional probability of A given B as Pr[A|B] := Pr[A ∩
B]/Pr[B]. We say that A and B are independent if Pr[A ∩ B] = Pr[A]Pr[B]. Three events A, B,C are mutually
independent if they are independent two by two and if, in addition, Pr[A ∩ B ∩C] = Pr[A]Pr[B]Pr[C]. More
generally, the events {Ai , i ∈ I} are mutually independent if
Pr[Ai1 ∩ Ai2 ∩ · · · ∩ Ain ] = Pr[Ai1 ] × · · · × Pr[Ain ]
for all n ≥ 2 and any {i1 , . . . , in } ⊂ I. We say that the events A and B are positively (resp., negatively)
correlated if Pr[A ∩ B] > Pr[A]Pr[B] (resp., Pr[A ∩ B] < Pr[A]Pr[B]). Thus, if A and B are positively correlated, then Pr[B|A] > Pr[B]. Note that this does not imply that A causes B. For instance, one also has
Pr[A|B] > Pr[A], so is it A that causes B or B that causes A? In fact, it may be neither: the events could have
a common cause, like people who drive a Tesla being more likely to own a vacation home, and vice-versa.
Examples
1. Let Ω = {1, 2, 3, 4} be a uniform probability space. Let also A = {1, 2, 3} and B = {3, 4}. Then
Pr[A|B] = 1/2, Pr[B|A] = 1/3. Here, A and B are negatively correlated.
2. Roll a balanced six-sided die. The probability that the outcome is 6 given that it is larger than 4 is
1/2.
3. A couple has two kids, at least one of which is a girl. The probability that they have two girls is
Pr[GG]/Pr[GB, BG, GG] = 1/3. Perhaps surprisingly, it is not 1/2.
4. Roll a balanced six-sided die twice. Let A be the event that the first roll yields m for some m ∈
{1, . . . , 6} and B the event that the second roll yields n for some n in {1, . . . , 6}. Then A and B are
independent. We say that the two rolls are independent. (See Section 4.)
CS 70, Spring 2016, J EAN WALRAND - Probability Review
2
5. Let Ω = {1, . . . , 8} be a uniform probability space. Let A = {1, 2, 3, 4}, B = {4, 5},C = {1, 2, 5, 6}.
Then A, B,C are pairwise independent, not mutually. Note that A, B,C would not be pairwise independent if the uniform probability space were {1, . . . , 6}.
6. Let Ω = {1, . . . , 8} be a uniform probability space. Then A = {1, 2, 3, 4}, B = {2, 4, 6, 8},C = {2, 3, 6, 7}
are mutually independent.
7. Let {Ai , i ∈ I} be mutually independent. Let also J and K be disjoint finite subsets of I. Define E to
be an event obtained by performing set operations on the events {Ai , i ∈ J} and F an event obtained
by performing set operations on the events {Ai , i ∈ K}. For instance, E = (A1 ∪ A2 ) \ (A3 ∩ A4 ) and
F = (A5 ∆A6 ) \ A7 . Then E and F are independent. Here is a general proof, for arbitrary E and F.
First look at the case of three events A1 , A2 , A3 . When you draw the Venn diagram, you find that there
8 disjoint sets that make up Ω: B1 = A1 ∩ A2 ∩ A3 , B2 = A1 ∩ Ac2 ∩ Ac3 , . . . , B8 = Ac1 ∩ Ac2 ∩ Ac3 . These
events are of the form D1 ∩ D2 ∩ D3 where each Di is either Ai or Aci . Any event that you can create by
performing set operations on A1 , A2 , A3 is a union of some of the eight sets B j . More generally, define
the atoms Bn of {Ai , i ∈ J} to be all the events of the form Bn = ∩i∈J Di where each Di is either Ai or
Aci . Similarly, define the atoms Cm of {Ai , i ∈ K}. These atoms Bn are pairwise disjoint. Similarly, the
atoms Cm are pairwise disjoint. Also, each Bn is independent of every Cm . Moreover, E is a union of
atoms Bn and F is a union of atoms Cm . Thus, Pr[E ∩ F] = Pr[∪m,n (Bn ∩Cm )] = ∑m,n Pr[Bn ∩Cm ] =
∑m,n Pr[Bn ]Pr[Cm ] = (∑n Pr[Bn ])(∑m Pr[Cm ]) = Pr[E]Pr[F].
3
Bayes’ Rule
Review
The setup is that Ω is a probability space, {A1 , . . . , An } are a partition of Ω and B is some event. Then
Pr[Am |B] =
Pr[Am ]Pr[B|Am ]
,m
n
∑k=1 Pr[Ak ]Pr[B|Ak ]
= 1, . . . , n.
The point of this formula is as follows. We think of the Am as possible pairwise exclusive circumstances
under which the symptom B can occur. One knows the prior probability Pr[Ak ] of every circumstance. One
also knows the conditional probability Pr[B|Ak ] that B occurs under circumstance Ak . Bayes’ Rule tells us
how to compute the likelihood Pr[Ak |B] that circumstance Ak is in effect given that B occurs. Think of Ak as
a disease and B as a symptom such as a fever, or a suspicious X-ray reading, or some abnormal value for a
test. One calls Pr[Am |B] the posterior probability of Am given B. Thus, Bayes’ Rule computes the posterior
probabilities Pr[Am |B] given the prior probabilities Pr[Am ] and the conditional probabilities Pr[B|Am ].
The event Am with the maximum value of Pr[Am |B] is said to be the Maximum A Posteriori (MAP) estimate
of the circumstance given the symptom, i.e., the most likely circumstance Am given that B occurs. The
event Am with the maximum value of Pr[B|Am ] is said to be the Maximum Likelihood Estimate (MLE) of
the circumstance given the symptom. The MLE and the MAP coincide if the priors are uniform, i.e., if
Pr[Am ] = 1/n for m = 1, . . . , n. Otherwise, they generally differ. Thus, the MAP is the most likely Am
given B while the MLE is the Am that makes B most likely. For instance, Ebola is the MLE circumstance of
diarrhea whereas food poisoning may be the MAP.
Examples
1. A coin is equally likely to be fair or biased with Pr[H] = 0.7. You flip it once and get H. The posterior
CS 70, Spring 2016, J EAN WALRAND - Probability Review
3
probability that the coin is fair is
0.5 × 0.5
≈ 0.42.
0.5 × 0.5 + 0.5 × 0.7
Thus, the coin is a bit more likely to be biased since it produced one H. The probability that the next
coin flip yields H is then
0.42 × 0.5 + 0.58 × 0.7 ≈ 0.62.
If the first flip yields T , then you can verify that the likelihood that the second flip yields H is 0.575.
2. A coin is fair with probability 0.7, otherwise it is such that Pr[H] = 0.6. You flip the coin ten times
and get 8 heads. The posterior probability that the coin is fair is
8
2
0.7 10
8 (0.5) (0.5)
≈ 0.1.
10
8
2
8
2
0.7 10
8 (0.5) (0.5) + 0.3 8 (0.6) (0.4)
Thus, it is much more likely that the coin is biased since it produced so many heads. Consequently,
the conditional probability that flip 11 yields heads is approximately 0.1 × 0.5 + 0.9 × 0.6 ≈ 0.59.
3. A coin is fair with probability 0.5, otherwise it is such that Pr[H] = 0.6. You flip the coin repeatedly
and it takes 10 flips until the first H. The posterior probability that it is fair is
0.5 × (0.5)9 (0.5)
≈ 0.86.
0.5 × (0.5)9 (0.5) + 0.5 × (0.4)9 (0.6)
Thus, the conditional probability that the coin is fair is much larger than the prior probability 0.5
because it took so long to get the first H. Consequently, the probability that the next flip, i.e., flip 11,
yields heads is approximately 0.86 × 0.5 + 0.14 × 0.6 ≈ 0.51.
4. There are two envelopes. The first one contains five checks in the amounts of {1, 2, 5, 5, 6} and the
second contains four checks in the amounts {2, 5, 6, 8}. You pick an envelope at random and then pick
a check at random. Given that the check you got is in the amount of 5, the posterior probability that
you picked it from the first envelope is
0.5 × (2/5)
≈ 0.62.
0.5 × (2/5) + 0.5 × (1/4)
Thus, if you are offered to either keep the envelope you selected or to switch to the other envelope,
you should switch because the second envelope contains more money.
4
Random Variables and Expectation
Review
A random variable is a real-valued function of the outcome of a random experiment. Thus, there is a probability space Ω with Pr[ω] defined for each ω ∈ Ω. This probability space defines the random experiment.
A random variable X is a function X : Ω → ℜ that assigns a real number X(ω) to each outcome ω ∈ Ω. For
A ⊂ ℜ, one defines Pr[X ∈ A] := Pr[X −1 (A)] where X −1 (A) := {ω ∈ Ω | X(ω) ∈ A} is the inverse image
of A under the function X. Similarly, Pr[X = a] := Pr[X −1 ({a})]. The distribution of a random variable
X is the set of its possible values and their probability. This may seem abstract, so here is a concrete example. Say that a bag contains 100 balls. Among those, 23 are marked with the value 5 and the others are
CS 70, Spring 2016, J EAN WALRAND - Probability Review
4
marked with different values. Let ω be the ball you pick and X(ω) the value marked on the ball. Then
Pr[X = 5] = Pr[X −1 (5)] = 23/100. Here, X −1 (5) is the set of 23 balls marked with the value 5.
If X,Y are two random variables on Ω and g : ℜ2 → ℜ is a function, then g(X,Y ) is a new random variable
that assigns the real number g(X(ω),Y (ω)) to ω ∈ Ω.
The expected value of X is E[X] := ∑x xPr[X = x] = ∑ω X(ω)Pr[ω]. Thus, E[g(X,Y )] = ∑ω g(X(ω),Y (ω))Pr[ω].
Expectation is linear: E[aX + bY ] = aE[X] + bE[Y ].
Given an event A, one defines the conditional expectation of X given A as E[X|A] := ∑x xPr[X = x|A] =
∑ω X(ω)Pr[ω|A]. Assume that {A1 , . . . , An } is a partition of Ω. Then E[X] = ∑m E[X|Am ]Pr[Am ]. In particular, E[X] = ∑y E[X|Y = y]Pr[Y = y]. If we define the conditional expectation of X given Y as E[X|Y ] := g(Y )
where g(y) := E[X|Y = y], then we see that E[X] = E[E[X|Y ]].
One also finds that E[Xg(Y )|Y ] = g(Y )E[X|Y ], so that E[(X − E[X|Y ])g(Y )] = 0, an identity that shows that
the estimation error X − E[X|Y ] is orthogonal to any function g(Y ). This projection property implies that
E[(X − h(Y ))2 ] is minimized by choosing h(Y ) = E[X|Y ]. We say that the conditional expectation E[X|Y ]
is the Minimum Mean Squares Estimate (MMSE) of X given Y . Another way to appreciate this property is
to note that E[(X − h(Y ))2 ] = ∑y Pr[Y = y]E[(X − h(y))2 |Y = y] and that, for each y, E[(X − h(y))2 |Y = y]
is minimized by choosing h(y) = E[X|Y = y], as we see by setting to zero the derivative with respect to a of
E[(X − a)2 |Y = y] = E[X 2 |Y = y] − 2aE[X|Y = y] + 2a2 .
We say that the random variables X and Y are independent if (and only if) Pr[X = x,Y = y] = Pr[X =
x]Pr[Y = y] for all x, y. Equivalently, they are independent if and only if Pr[X ∈ A,Y ∈ B] = Pr[X ∈ A]Pr[Y ∈
B] for all A, B ⊂ ℜ. If X and Y are independent, then g(X) and h(Y ) are independent for any functions g and
h. More generally, the random variables {Xi , i ∈ I} are mutually independent if Pr[Xi1 ∈ A1 , . . . , Xin ∈ An ] =
Pr[Xi1 ∈ A1 ] × · · · × Pr[Xin ∈ An ] for all finite n, all i1 , . . . , in ∈ I and all sets A1 , . . . , An in ℜ. If the random
variables Xi are mutually independent, then E[Xi1 Xi2 · · · Xin ] = E[Xi1 ] · · · E[Xin ]. The converse is not true.
Examples
1. Flip a biased coin with Pr[H] = p. Let X = 1 if the outcome is H and 0 otherwise. Then the distribution
of X is called Bernoulli with parameter p and one writes X = B(p). Thus, E[X] = p and E[X 2 ] =
E[X] = p.
2. Let Xm for m = 1, . . . , n be i.i.d. B(p). Then X = X1 + . . . + Xn = B(n, p). Thus, E[X] = np.
3. Roll a balanced six-sided die once and let X be the number of pips. Then E[X] = ∑6m=1 m(1/6) = 3.5.
4. Roll a balanced six-sided die twice. Then Ω = {1, . . . , 6}2 with Pr[ω] = 1/36 for all ω ∈ Ω. Let
X((a, b)) = a + b for (a, b) ∈ Ω. Then E[X] = 2 × 3.5 = 7, by linearity of expectation.
5. Same experiment as above. Let Y ((a, b)) = min{a, b} and Z((a, b)) = max{a, b} for (a, b) ∈ Ω. Then
(it may be useful to draw a picture) Pr[Y = 1] = 11/36, Pr[Y = 2] = 9/36, Pr[Y = 3] = 7/36, Pr[Y =
4] = 5/36, Pr[Y = 5] = 3/36, Pr[Y = 6] = 1/36. Also, Pr[Z = m] = Pr[Y = 6 − m], by symmetry.
Hence, E[Y ] = 1(11/36) + 2(9/36) + · · · + 6(1/36) = 91/36 ≈ 2.52. Also, Y + Z is the total number
of pips on the two rolls, so that E[Y + Z] = 7 and it follows that E[Z] = 7 − E[Y ] ≈ 4.48.
6. Pick a number uniformly in {1, . . . , n}. Then Ω = {1, . . . , n} with Pr[m] = 1/n for m ∈ Ω. Define
X(m) = 2 + 3m + 4m2 .
7. Let A ⊂ Ω and define X(ω) = 1{ω ∈ A} for ω ∈ Ω. The random variable X is called the indicator of
the event A. Sometimes we write it as 1A (ω).
CS 70, Spring 2016, J EAN WALRAND - Probability Review
5
8. Flip n times a biased coin with Pr[H] = p. Then Ω = {H, T }n and Pr[ω] = pm (1 − p)n−m if ω has m
heads. Define X(ω) to be the number of heads in ω. Then Pr[X = m] = mn pm (1 − p)n−m . This is the
binomial distribution and we write it as B(n, p).
9. Flip a biased coin with Pr[H] = p until you get a first H. Then Ω = {1, 2, . . .} and Pr[n] = (1 − p)n−1 p
for n ≥ 1. Let X(n) = n. The distribution of X is called the Geometric distribution with parameter p.
We designate it by G(p).
10. A monkey is typing away on a keyboard that has only the 26 letters. After he types 1012 letters, let X be
the number of times that the word ‘walrand’ appears. Then E[X] = (1012 − 7)(26)−7 ≈ 124. Indeed,
X = X1 + · · · + Xn with n = 1012 − 7 where Xm is the indicator of the event that the word ‘walrand’
appears starting with the m-th letter. One has E[Xm ] = (26)−7 and E[X] = nE[X1 ], by linearity of
expectation.
11. An elevator loads n people on the ground floor of a building with k + 1 floors (including the ground
floor 0). Each person chooses one of the k other floors independently, with equal probabilities. The
probability that no one chooses a particular floor is then (1 − 1/k)n . Consequently, if Xm is the indicator that someone chose floor m, one sees that E[Xm ] = 1 − (1 − 1/k)n . Let X = X1 + · · · + Xk be
the number of different floors that people picked. Then, by linearity of expectation, E[X] = kE[X1 ] =
k[1 − (1 − 1/k)n ].
12. Let X = G(p), so that Pr[X = n] = (1 − p)n−1 p for n ≥ 1. Then E[X] = 1/p.
13. Let X = G(p) and Y = G(q) be independent. Then min{X,Y } = G(r) with r = 1 − (1 − p)(1 − q).
14. We say that X is Poisson with parameter λ > 0 if Pr[X = n] = (λ n )/(n!) exp{−λ } for n ≥ 0. We
write X = P(λ ). Then E[X] = λ .
15. Let X = P(λ ) and Y = P(µ) be independent. Then X +Y = P(λ + µ).
16. Flip a coin. With probability p, the outcome is H and one defines Z = X; otherwise, one defines
Z = Y . Here, X and Y are two random variables that are independent of the coin flip. Find E[Z] and
var(Z) in terms of p and the mean and variance of X and Y .
17. Assume that E[Xn+1 |Xn ] = a+bXn for n ≥ 1. Taking expectation, we get E[Xn+1 ] = a+bE[Xn ]. Hence,
E[X2 ] = a + bE[X1 ], E[X3 ] = a + bE[X2 ] = a + b(a + bE[X1 ]) = a(1 + b) + b2 E[X1 ]. Continuing in this
way, we find that E[Xn ] = a(1 + b + · · · + bn−2 ) + bn−1 E[X1 ] = a(1 − bn−1 )/(1 − b) + bn−1 E[X1 ].
18. Diluting. There is a bin with 100 red balls. At step 1, we pick a ball from the bin and replace it with
a blue ball. We now have X1 = 99 red balls and 1 blue ball. We continue in this way and let Xn be
the number of red balls in the bin after n steps. At step n + 1, the likelihood that we pick a red ball is
Xn /100. Hence, one has E[Xn+1 |Xn ] = Xn − Xn /100 = 0.99Xn . Hence, E[Xn ] = 100(0.99)n for n ≥ 1.
19. Mixing. There is a bin with 100 red balls and one with 100 blue balls. At each step, one picks a ball
from each bin and puts it in the other bin. Let Xn be the number of red balls in the first bin at the end
of n steps. Then E[Xn+1 |Xn ] = Xn − (Xn /100)(Xn /100) + (1 − Xn /100)(1 − Xn /100) = 1 + bXn with
b = 49/50. Hence, E[Xn ] = (1 − bn−1 )/(1 − b) + bn−1 99.
20. Going Viral. Assume everyone on Tweeter has a random number of friends that has mean µ. You
tweet a rumor to those friends. Each of them retweets independently with probability p to each of
their friends, and so on. Let Xn be the number of people who retweet the rumor at step n. Given
Xn = k and the number of friends Y1 , . . . ,Yk of these k people, we see that Xn+1 = B(Y1 + · · · +Yk , p).
CS 70, Spring 2016, J EAN WALRAND - Probability Review
6
Thus, E[Xn + 1|Xn = k,Y1 , . . . ,Yk ] = p(Y1 + · · · +Yk ). Hence, E[Xn+1 |Xn = k] = E[p(Y1 + · · · +Yk )] =
pkµ = pµXn . Consequently, E[Xn+1 ] = pµE[Xn ] and we conclude that E[Xn ] = (pµ)n−1 for n ≥ 1
(because X1 = 1). If X = X1 + X2 + · · · , we see that E[X] < ∞ if and only if pµ < 1.
5
Covariance, Variance, Linear Regression and Estimation
Review
The covariance of two random variables X and Y is defined as cov(X,Y ) = E[XY ] − E[X]E[Y ]. The random
variables X and Y are said to be uncorrelated if cov(X,Y ) = 0, positively correlated if cov(X,Y ) > 0 and negatively correlated if cov(X,Y ) < 0. Note that if X and Y are independent, then cov(X,Y ) = 0. The converse
is not true. The variance of a random variable is defined as var(X) = E[X 2 ] − E[X]2 . One has var(X +Y ) =
var(X)+var(Y )+2cov(X,Y ). In particular, if X and Y are independent, then var(X +Y ) = var(X)+var(Y ).
Also, var(a + bX) = b2 var(X) and cov(aX + bY, cV + dW ) = ac.cov(X,V ) + ad.cov(X,W ) + bc.cov(Y,V ) +
bd.cov(Y,W ).
If f : ℜ → [0, ∞) is nondecreasing and f (a) > 0, then Pr[X ≥ a] ≤ E[ f (X)]/ f (a). This is Markov’s inequality. Chebyshev’s inequality states that Pr[|X − E[X]| ≥ a] ≤ var[X]/a2 .
The linear function of X that minimizes E[(Y −a−bX)2 ] is L[Y |X] := E[Y ]+[cov(X,Y )/var(X)](X −E[X]).
We call this function the LLSE (linear least squares estimate) of Y given X. In particular, if (X,Y ) is
distributed uniformly in the set {(Xm ,Ym ), m = 1, . . . , n}, then L[Y |X] is called the linear regression of Y
over X. Note that this LR is non-Bayesian: it does not assume a prior distribution of (X,Y ) but uses only
the observed sample values.
Let {Xm , m ≥ 1} be i.i.d. with mean µ and variance σ 2 and An := (X1 + · · · + Xn )/n. Then, Chebyshev
implies that Pr[|An − µ| ≥ ε] ≤ σ 2 /(nε 2 ) = σ 2 /(nε 2 ). Thus, for all ε > 0 one has Pr[|An − µ| ≥ ε] → 0 as
n → ∞. This result is called the weak law of large numbers (WLLN). Using the same inequality and choosing
√
√
ε so that nε 2 = 20σ 2 , i.e., ε = 4.5σ / n, we find that Pr[|An − µ| ≥ 4.5σ / n] ≤ 5%, which shows that
√
√
[An − 4.5σ / n, An + 4.5σ / n] is a 95%-confidence interval for µ. Using the Central Limit Theorem, we
will be able to replace 4.5 by 2 in the confidence interval (see Section 9).
Examples
1. Let X = B(p). Then var(X) = E[X 2 ] − E[X]2 = E[X] − E[X]2 = p(1 − p) ≤ 1/4.
2. Let X = B(n, p). Then we can write X = X1 + · · · + Xn where the Xm are i.i.d. B(p), so that var(X) =
np(1 − p).
3. Let Ω be the uniform probability space {1, . . . , 6} and let X takes the values {0, 0, 1, 1, 2, 2} and Y
the values {0, 3, 6, 12, 3, 0}, respectively for the different values of ω. Thus, X(1) = X(2) = 0 and
Y (1) = 0,Y (1) = 3, etc. Then E[X] = (0 + 0 + 1 + 1 + 2 + 2)/6 = 1, E[Y ] = (0 + 3 + 6 + 12 + 3 +
0)/6 = 4, E[XY ] = (0 + 0 + 6 + 12 + 6 + 0)/6 = 4. Hence, cov(X,Y ) = 0 and L[Y |X] = E[Y ] = 4.
Also, Pr[Y = 0|X = 0] = 1/2 and Pr[Y = 3|X = 0] = 1/2, so that E[Y |X = 0] = 3/2. Similarly,
E[Y |X = 1] = 9 and E[Y |X = 2] = 3/2. We can write E[Y |X] = (3/4)(X − 1)(X − 2) − 9(X − 0)(X −
2) + (3/4)(X − 0)(X − 1) = −(15/2)X 2 + 15X + 3/2 since a polynomial that goes through the point
(xm , ym ) for m = 1, . . . , n can be written as ∑nm=1 ym Πk6=m [(x − xk )/(xm − xk )].
4. Let X,Y, Z be i.i.d. with mean 0 and variance 1. Then L[aX + bY + cZ|dX + eY + f Z] = [(ad + be +
c f )/(d 2 + e2 + f 2 )](dX + eY + f Z).
CS 70, Spring 2016, J EAN WALRAND - Probability Review
7
5. Let Ω = {1, 2, 3, 4} be a uniform probability space. Let also A = {1, 2}, B = {1, 3},C = {1, 4} and
X = 1A ,Y = 1B , Z = 1C . Then, X,Y, Z are pairwise independent, but not mutually independent. They
are also identically distributed like B(0.5). Note that E[XY Z] = 1/4 6= E[X]E[Y ]E[Z]. Thus, when
we write ‘i.i.d.’, we mean mutually independent and identically distributed, like three coin flips, for
instance.
6. Let {X1 , . . . , Xn } be pairwise independent. Then var(X1 + · · · + Xn ) = var(X1 ) + · · · + var(Xn ).
7. Let X = B(n, p). Then X = X1 + · · · + Xn where the Xm are i.i.d. B(p). Hence, var(X) = n.var(X1 ) =
np(1 − p).
8. Let X = P(λ ). Then E[X(X − 1)] = ∑n≥0 n(n − 1)λ n exp{−λ }/(n!) = λ 2 ∑n≥2 λ n−2 exp{−λ }/[(n −
2)!] = λ 2 ∑n≥0 λ n exp{−λ }/(n!) = λ 2 . Thus, E[X 2 ]−E[X] = λ 2 , so that E[X 2 ] = λ 2 +E[X] = λ 2 +λ
and var(X) = E[X 2 ] − E[X]2 = λ .
9. Let X = G(p). Then X = 1 + ZY where Y = G(p) and Z = B(1 − p) are independent. Thus, E[X 2 ] =
E[1 + 2ZY + Z 2Y 2 ] = 1 + 2(1 − p)E[Y ] + (1 − p)E[Y 2 ] = 1 + 2(1 − p)/p + (1 − p)E[X 2 ]. Hence,
E[X 2 ] = [1 + 2(1 − p)/p]/p = (2 − p)/p2 . Consequently, var(X) = E[X 2 ] − E[X]2 = (2 − p)/p2 −
1/p2 = (1 − p)/p2 .
6
Collisions and Collecting
Review
Say that there are M different coupons. When you buy a box of cereal, you get coupon m with probability
1/M for m = 1, . . . , M. It takes X1 = 1 box to get the first coupon. Then X2 = G((M − 1)/M) to get a new
one, then X3 = G((M − 2)/M) to get a new one, and so on. Thus, the average number of boxes required
to get all the coupons is M/M + M/(M − 1) + M/(M − 1) + · · · + M = MH(M) where H(M) := 1 + 1/2 +
· · · + 1/M ≈ ln(M) + 0.58. The probability that coupon m has not been seen in n boxes is [(M − 1)/M]n =
(1 − 1/M)n ≈ exp{−n/M}. Thus, the average number of coupons seen after n steps is M[1 − (1 − 1/M)n ] ≈
M(1 − exp{−n/M}).
One throws m balls into n > m bins, independently and uniformly at random. The probability that there is no
collision after 1 ball is 1, after 2 balls is (n − 1)/n, after three balls it is [(n − 1)/n] × [(n − 2)/n], after four
balls it is [(n − 1)/n] × [(n − 2)/n] × [(n − 2)/n], and so on. The probability of no collision after m balls is
m−1
m−1
2
then Πm−1
k=1 [(n−k)/n] ≈ Πk=1 exp{−k/n} = exp{− ∑k=1 k/n} ≈ exp{−m /(2n)}. Bin i is still empty after k
k
steps with probability (1 − 1/n) . Thus, the expected number of empty bins is m(1 − 1/n)k ≈ m exp{−k/n}.
By Markov’s inequality, the probability that there is at least one empty bin is at most m exp{−k/n}.
Examples
1. (Note: This example was corrected on 5/4/16.) Let’s use c bits as a CRC for m files. This means that
the m files are thrown into n = 2c bins. The probability of no collision is approximately exp{−m2 /(2n)},
so that the probability of collision is approximately 1 − exp{−m2 /(2n)} ≈ m2 /(2n). If we want the
probability to be less than ε, we need m2 /(2n) ≤ ε, i.e., m2 /(2c+1 ) ≤ ε, or c ≥ 2 log2 (m)−log2 (ε)−1.
For instance, if m = 106 ≈ 220 and ε = 10−9 , we need c ≥ 69. The implication is that a 7-byte CRC
suffices to sign messages or to detect errors in most applications.
CS 70, Spring 2016, J EAN WALRAND - Probability Review
8
7
Markov Chains
Review
A Markov chain on the finite state space X = {1, 2, . . . , K} is the random sequence {Xn , n ≥ 0} defined by
Pr[X0 = i0 , X1 = i1 , . . . , Xn = in ] = π0 (i0 )P(i0 , i1 ) · · · P(in−1 , in )
where π0 is a probability distribution called the initial distribution of the Markov chain and P is a K × K
nonnegative matrix whose rows sum to one. In particular, πn (i) := Pr[Xn = i] is such that πn = π0 Pn and
Pr[Xn = j|X0 = i] = Pn (i, j).
A MC is irreducible if it can go from any state i to any other state j, possibly in more than one step. It is
then aperiodic if the return times to some state are coprime. (The return times to any state are then also
coprime.)
The distribution π is invariant if π solves the balance equations πP = π. There is a unique invariant
distribution if the Markov chain is irreducible. Moreover, the fraction of time in state i then converges to
π(i) for all i. The distribution πn converges to π if the Markov chain is also aperiodic.
The average time β (i) to enter a set A of states when starting from state i satisfies the first step equations
β (i) = 1 + ∑ P(i, j)β ( j), ∀i ∈
/A
j
β (i) = 0, ∀i ∈ A.
The probability α(i) of entering a set A of states before a disjoint set B of states when staring from state i
satisfies the first step equations
α(i) = ∑ P(i, j)α( j), ∀i ∈
/ A∪B
j
α(i) = 1, ∀i ∈ A
α(i) = 0, ∀i ∈ B.
Note that we call the equations both for α and β first step equations even though they are not the same. The
justification for the terminology is that in both cases one conditions on what happens in the first step of the
MC.
Examples
1. The MC on {0, 1} with P(0, 1) = P(1, 0) = a and P(0, 0) = P(1, 1) = 1−a with a ∈ [0, 1] is irreducible
if a > 0 and aperiodic if 0 < a < 1. The invariant distribution is π = [0.5, 0.5] when a > 0.
2. The MC on {0, 1} with P(0, 1) = a and P(1, 0) = b with a, b ∈ [0, 1] is irreducible if a, b > 0 and is
then aperiodic if a 6= 1 or b 6= 1. The invariant distribution is π = [b/(a + b), a/(a + b)] when a, b > 0.
3. You flip a fair coin until you get two heads in a row. The average number of flips is 6.
4. You roll a balanced six-sided die until the sum of the last two rolls is equal to 8. The average number
of rolls is about 8.4.
CS 70, Spring 2016, J EAN WALRAND - Probability Review
9
5. In each step, you get up one rung of a ladder with probability p; otherwise, you fall back to the ground.
The average number of step to reach rung 20 is (p−20 − 1)/(1 − p).
6. At each step, you win one dollar with probability p; otherwise you lose it. Starting with n dollars, the
probability your fortune reaches M > n before it reaches 0 is (1−ρ n )/(1−ρ M ) where ρ = (1− p)p−1 .
8
Continuous Probability
Review
Imagine choosing a real number X uniformly in [0, 1]. Clearly Pr[X = x] = 0 for all x ∈ [0, 1]. This would
be the same if we chose the number uniformly in [0, 2]. Thus, the probability of individual outcomes does
not describe properly the random experiment. Instead, one starts by defining the probability of events. For
choosing uniformly in [0, 1], one defines Pr[[a, b]] = b − a, for 0 ≤ a ≤ b ≤ 1. One extents that definition by
additivity to any (countable) union of intervals. For instance, Pr[[0, 0.3] ∪ [0.7, 0.9]] = 0.5. Thus, for a finite
or countable sample space Ω, one starts by defining the probability of each outcome ω and one then defines
the probability of an event as the sum of the probabilities of the outcomes it contains. For an uncountable
sample space Ω, one starts by defining the probability of its events.
∞
Let f : ℜ → [0, ∞) be a function such that −∞
f (x)dx = 1.
Define a random variable X such that Pr[x <
Rx
X < x + ε] ≈ f (x)ε for ε 1. Then F(x) := Pr[X ≤ x] = −∞
f (y)dy. Thus, f (x) is the derivative of F(x).
One calls f (x) the probability density function (pdf) of X and F(x) the cumulative distribution function
(cdf) of X. Sometimes one writes fX (x) := f (x) and FX (x) := F(x) to specify that the pdf and cdf are those
of the random
variable X. This is Rconvenient when one deals with more than one random variable. Then
R
E[X] = x fX (x)dx and E[h(X)] = h(x) fX (x)dx.
R
Examples
1. X = U[a, b] iff fX (x) = (b − a)−1 1{a < x < b}. Consequently, E[X] =
Rb
2. X = Expo(λ ) iff fX (x) = λ exp{−λ x}1{x ≥ 0}. Consequently, E[X] =
λ −1 .
R∞
a
0
x(b − a)−1 dx = (a + b)/2.
xλ exp{−λ x}dx = −
R∞
0
d exp{−λ x}dx
3. One shoots a dart uniformly in a circle with radius r. Let X be the distance of the dart to the center.
Then FX (x) = (πx2 )/(πr2 ) = x2 /r2 for 0 ≤ x ≤ r. Hence, fX (x) = 2x/r2 1{0 ≤ x ≤ r} and E[X] = 2r/3.
4. Define Y = a + bX, for some a and some b > 0. Then fY (y)ε = Pr[y < a + bX < y + ε] = Pr[ y−a
b <
y−a ε
y−a
ε
1
X < y−a
+
]
=
f
(
)
.
Thus,
f
(y)
=
f
(
).
X
Y
X
b
b
b b
b
b
5. Let X and Y be two independent random variables and W = max{X,Y }. Then FW (w) = Pr[W ≤ w] =
Pr[X ≤ w]Pr[Y ≤ w] = FX (w)FY (w).
6. Let X and Y be two independent random variables and V = min{X,Y }. Then 1 − FV (v) = Pr[V >
v] = Pr[X > v]Pr[Y > v] = (1 − FX (v)][1 − FY (v)].
7. Let X and
Y be two independent random variables and ZR= X + Y . Then, fZ (z)ε = Pr[z < Z <
R∞
∞
zR + ε] = −∞ Pr[x < X < x + dx, z − x < Y < z − x + ε] = −∞
fX (x) fY (z − x)εdx. Hence, fZ (z) =
∞
f
(x)
f
(z
−
x)dx.
X
Y
−∞
8. Assume that with probability p the random variable X has pdf f0 (x) and it has pdf f1 (x) otherwise.
Given X = x, the probability that it has pdf f0 is [p f0 (x)]/[p f0 (x) + (1 − p) f1 (x)].
CS 70, Spring 2016, J EAN WALRAND - Probability Review
10
9
Gaussian and CLT
Review
√
A random variable X is N (0, 1) iff its pdf is fX (x) = (1/ 2π) exp{−x2 /2}. One can verify that E[X] = 0
and var(X) = 1. By definition, the random variable Y = µ + σ X is then N (µ, σ 2 ).√Then E[Y ] = µ +
σ E[X] = µ and var(Y ) = σ 2 var(X) = σ 2 . Also, one can check that fY (y) = (1/ 2πσ 2 ) exp{−(y −
µ)2 /(2σ 2 )} by Example 4 of the previous section. This is the bell-shaped pdf.
If X = N (µ, σ 2 ), then Pr[X > µ +2σ ] ≈ 2.5% and Pr[|X − µ| > 2σ ] ≈ 5%. Also, Pr[X > µ +1.65σ ] ≈ 5%
and Pr[|X − µ| > 1.65σ ] ≈ 10%. Note that Chebyshev shows that Pr[|X − µ| > 2σ ] ≤ 1/4 = 25%, which is
a very loose bound. This is why the CLT (see below) provides smaller confidence intervals than Chebyshev.
To get 5% using Chebyshev, we have to write Pr[|X − µ| ≥ 4.5σ ] ≤ 1/(4.5)2 ≈ 5%.
The Central Limit Theorem (CLT) states that if the Xn are i.i.d. with mean µ and variance σ 2 and if An =
√
√
(X1 + · · · + Xn )/n, then [An − µ] n ≈ N (0, σ 2 ) when n 1. Thus, we see that Pr[|An − µ| n > 2σ ] ≈ 5%,
√
√
√
√
so that Pr[An −2σ / n ≤ µ ≤ An +2σ / n] ≈ 95%. Hence, [An −2σ / n, An +2σ / n] is a 95%-confidence
interval for µ.
One can show that if X = N (µ1 , σ12 ) and Y = N (µ2 , σ22 ) are independent, then X +Y = N (µ1 + µ2 , σ12 +
σ22 ). Intuitively, X and Y are both sums of many small independent random variables, so that X +Y is also,
and should therefore be Gaussian. The mean and variance are the sum of those of X and Y .
The Law of Large Numbers implies that if the Xm are i,i.d. with mean µ and variance σ 2 , then An =
(1/n) ∑nm=1 Xm ≈ µ and s2n := (1/n) ∑nm=1 (Xm − An )2 ≈ σ 2 . Thus, replacing the standard deviation σ by sn
√
√
in the confidence intervals, we see that [An − 2sn / n, An + 2sn / n] is a 95%-confidence interval for µ when
n is large enough. In practice, one has to be a bit careful because sn may be a poor estimate of σ when n is
not large enough. Use with care and at your own risk!
Examples
1. Let X = X1 + · · · + Xn where the Xm are i.i.d. B(p). Let also An = (X1 + · · · + Xn )/n. We saw that
√
√
[An − 2σ / n, An + 2σ / n] is a 95%-confidence interval for E[X1 ] = p. Since σ 2 = var(X1 ) = p(1 −
√
√
p) ≤ 1/4, one has σ ≤ 1/2, and it follows that [An − 1/ n, An + 1/ n] is a 95%-confidence interval
for p.
10
Some Important Distributions
1. Bernoulli with parameter p : B(p)
Pr[X = 1] = p; Pr[X = 0] = 1 − p. Mean = p; variance = p(1 − p).
2. Uniform in {1, 2, . . . , n} : U[1, . . . , n]
Pr[X = m] = 1/n, m = 1, . . . , n. Mean = (n + 1)/2; variance = (n2 − 1)/12.
3. Binomial with parameters n, p : B(n, p)
Pr[X = m] = mn pm (1 − p)n−m , m = 0, . . . , n. Mean = np; variance = np(1 − p).
4. Geometric with parameter p : G(p)
CS 70, Spring 2016, J EAN WALRAND - Probability Review
11
Pr[X = n] = (1 − p)n−1 p, n = 1, 2, . . .. Mean = 1/p, variance = (1 − p)/p2 .
5. Poisson with parameter λ : P(λ )
Pr[X = n] = (λ n /n!)e−λ , n = 0, 1, 2, . . .. Mean = λ , variance = λ .
6. Uniform in [0, 1] : U[0, 1].
fX (x) = 1{0 ≤ x ≤ 1}. Mean = 1/2, variance = 1/12.
7. Exponential with parameter λ : Expo(λ ).
fX (x) = λ e−λ x 1{x > 0}, FX (x) = [1 − e−λ x ]1{x ≥ 0}. Mean = λ −1 , variance = λ −2 .
8. Gaussian with parameters µ, σ 2 : N (µ, σ 2 )
fX (x) =
11
√ 1
2πσ 2
exp{−(x − µ)2 /(2σ 2 )}. Mean = µ, variance = σ 2 .
Appendix: Some Mathematical Facts
The following definitions and facts are used repeatedly in this course.
1. The following set notation is assumed to be familiar: 0,
/ A ∩ B, A ∪ B, A∆B, A \ B, Ā = Ac .
2. (A ∩ B)c = Ac ∪ Bc and (A ∪ B)c = Ac ∩ Bc .
3. A × B := {(a, b) | a ∈ A, b ∈ B}.
4. A2 := A × A; An := An−1 × A = {(a1 , . . . , an ) | ak ∈ A, k = 1, . . . , n}.
n
0
5. A∗ is the set of finite strings with elements in A. Thus A∗ := ∪∞
n=0 A where A is the set that contains
one string of length 0.
6. For a 6= 1 and n ≥ 0, one has 1 + a + · · · + an = (1 − an+1 )/(1 − a).
7. For |a| < 1, one has 1 + a + a2 + · · · = 1/(1 − a).
8. 1 + 2a + 3a2 + 4a3 + · · · = (d/da)[1 + a + a2 + · · · ] = (d/da)(1 − a)−1 = (1 − a)−2 .
n!
9. For 0 ≤ m ≤ n, one defines mn := m!(n−m)!
.
10. For n ≥ 0, one has (a + b)n = ∑nm=0 mn am bn−m .
11. ln(1 + ε) ≈ ε for |ε| 1.
n
a
12. exp{a} = ∑∞
n=0 n! .
13. exp{ε} ≈ 1 + ε for |ε| 1.
14. exp{a + b} = exp{a} exp{b}.
15. logb (x) = loga (x) logb (a).
16. 1 + 2 + 3 + · · · + n = n(n + 1)/2.
CS 70, Spring 2016, J EAN WALRAND - Probability Review
12
17.
Rb
a
f (x)dg(x) = f (b)g(b) − f (a)g(a) −
Rb
a
g(x)d f (x).
R∞
exp{−ax}dx = 1/a for a > 0.
√
R∞
2
19. −∞
exp{− x2 }dx = 2π.
√
R∞ 2
2
20. −∞
x exp{− x2 }dx = 2π.
18.
0
CS 70, Spring 2016, J EAN WALRAND - Probability Review
13

Download Report

CS 70 Discrete Mathematics and Probability Theory Spring 2016

Paperzz.com

Your Paperzz