CS 498/698 Cheat Sheet: Useful Facts from Probability Theory

CS 498/698 Cheat Sheet:
Useful Facts from Probability Theory
Shai Ben-David
Dávid Pál
February 7, 2006
Useful analytical inequality: For any ∈ [0, 1] and any m ≥ 0 it holds
(1 − )m ≤ e−m .
This inequality is typically used when bounding probability of m independent outcomes each
with probability (1 − ).
Proof. First we show, that for any real number x it holds
1 − x ≤ e−x .
(1)
If we draw the function y = 1 − x on the left side of the inequality and the function y = e−x
on the right side, we get something as on Figure 1. These two functions have common value
at the point x = 0 and also derivative of both functions at x = 0 is the same. Since e−x is
convex and 1 − x is linear, we have proved inequality (1).
For x ∈ [0, 1] both sides of (1) are non-negative. Hence, we may take the m-th power
(m ≥ 0) of both sides and obtain
(1 − x)m ≤ e−xm .
Replacing x by , we get the desired inequality.
y = e−x
1
0
1
x
y =1−x
Figure 1: The functions y = 1 − x and y = e−x .
1
Union bound: If A and B are two probability events, then
Pr(A ∪ B) ≤ Pr(A) + Pr(B).
More generally, if A1 , A2 , . . . , An are probability events, then
Pr(A1 ∪ A2 ∪ · · · ∪ An ) ≤ Pr(A1 ) + Pr(A2 ) + · · · + Pr(An ).
Example: We demonstrate simple, yet very typical, use of both the union bound and the
“useful analytical inequality”. Suppose we throw simultanously 100 fair six-sided dice. We
show that with probability very very close to 1, all the numbers 1, 2, 3, 4, 5, 6 will appear on
at least one of the dice. Let A1 , A2 , . . . , A6 be the event that respectively 1, 2, . . . , 6 appears
on at least one of the dice. Let A1 , A2 , . . . , A6 be the event that respectively 1, 2, . . . , 6 does
not appear on any of the dice. We have,
100
1
Pr(A1 ) = Pr(A2 ) = Pr(A3 ) = Pr(A4 ) = Pr(A5 ) = Pr(A6 ) = 1 −
≤ e−100/6
6
What is the event A1 ∪ A2 ∪ A3 ∪ A4 ∪ A5 ∪ A6 ? It is the event that at least one of the
numbers 1, 2, 3, 4, 5, 6 does not appear on any of the dice. Using the union bound we show
that this event has small probability.
Pr(A1 ∪ A2 ∪ A3 ∪ A4 ∪ A5 ∪ A6 ) ≤ Pr(A1 ) + Pr(A2 ) + Pr(A3 ) + Pr(A4 ) + Pr(A5 ) + Pr(A6 )
≤ 6 × e−100/6
≈ .00000034666491116514
And hence, the probability of the complementary event (i.e. that all numbers will show up)
is at least 0.999999.
Expected value: Expected value (average, mean) of a real random variable X is defined
as, in discrete case
∞
X
E(X) =
αi · Pr(X = αi ),
i=−∞
when the values that X can have are in a countable set {αi : i ∈ N }. In the continous case
the sum is replaced by an integral
Z ∞
E(X) =
x p(x) dx,
−∞
where p(x) is the probability density at point x.
Example: What is the average number thrown on a die? It is
1
1
1
1
1
1
+2× +3× +4× +5× +6×
6
6
6
6
6
6
= (1 + 2 + 3 + 4 + 5 + 6)/6
= 21/6 = 7/2 = 3.5
E(X) = 1 ×
2
Few useful tricks for expectations:
• Linearity of expectation: E(X +Y ) = E(X)+E(Y ) and for any c ∈ R, E(cX) = c E(X).
• More to come . . .
Markov’s inequality: Let X be a non-negative real random variable with the expected
value (i.e. mean) µ. Then,
Pr(X > t) <
µ
t
for any t > 0,
or equivalently,
1
for any a > 0.
a
In more condesed form, we usually write as: If X ≥ 0, then
Pr(X > aµ) <
Pr(X > t) <
E(X)
t
for any t > 0,
or equivalently,
Pr(X > a E(X)) <
1
a
for any a > 0.
Example: The average height of a human is 2 meters. What is the probability of meeting a
human with height 10 or more? Well, by Markov inequality (assuming that height of humans
is non-negative), this probability is at most 20%.
Chebyshev’s inequality: Let X be a non-negative random variable with the expected
value (i.e. mean) µ = E(X) and with variance σ 2
σ2
Pr(|X − µ| ≥ t) ≤ 2 ,
t
for any t > 0,
or equivalently
1
,
for any a > 0.
a2
This inequality states that, it is unprobable that a random variable will deviate much from
its expected value.
Pr(|X − µ| ≥ aσ) ≤
Chernoff ’s bound: Let X1 , X2 , . . . , Xm be binary independent random variables, with
E(Xi ) = p for all i = 1, 2, . . . , m. (By binary we mean, that Xi reaches only two possible
values 0 and 1. Then E(Xi ) = p simply means that Pr[X = 1] = p and Pr[X = 0] = 1 − p.)
Let
m
1 X
X=
Xi
m i=1
3
be the empirical average then for any > 0
2m
Pr(X < E(X) − ) ≤ e−2
2m
Pr(X > E(X) + ) ≤ e−2
2m
Pr(|X − E(X)| > ) ≤ 2e−2
,
or equivalently
2m
Pr(X < p − ) ≤ e−2
2m
Pr(X > p + ) ≤ e−2
2m
Pr(|X − p| > ) ≤ 2e−2
,
Note that second bound follows from the first bound by taking Yi = 1 − Xi . The third
bound follows from preceding two by the union bound.
Example: Let us demonstrate an example of use of the Chernoff bound. Suppose we throw
simultanously 1000 fair coins. What is the probability that the number of heads will lie in
the interval [400, 600]?
Let Xi be the binary random variable, that is 1 iff i-th coin falls head and zero otherwise.
Then X1 + X2 + · · · + X1000 is the random variable that counts the number of heads. Using
Chernoff’s bound we have
Pr [X1 + X2 + · · · + X1000 ∈ [400, 600]] = 1 − Pr [|X1 + X2 + · · · + X1000 − 500| > 100]
1
100
= 1 − Pr
|X1 + X2 + · · · + X1000 − 500| >
1000
1000
" P
#
1000 X
1 i
= 1 − Pr i=1
− > 0.1
1000
2
1
= 1 − Pr X − > 0.01
2
≥ 1 − 2e−(0.1)
= 1 − 2e−10
≥ 0.9999
2 ×1000
P1000
1
where X = 1000
i=1 Xi is the empirical average. Hence the number of heads will be in the
interval [400, 600] with probability at least 99.99%.
Chernoff’s bound is implied by much more general Hoeffding’s inequality.
Hoeffding’s inequality: Let X1 , X2 , . . . , Xm be independent random variables, such that
ai ≤ Xi ≤ bi . (Note that Xi does not have to be identicaly distributed.) Let
S=
m
X
i=0
4
Xi
be the empirical sum then for any > 0
−
Pr(S < E(S) − m) ≤ e
2
Pm2(b2 mi −a
2
i)
i=1
−
Pr(S > E(S) + m) ≤ e
2
Pm2(b2 mi −a
2
i)
i=1
−
Pr(|S − E(S)| > m) ≤ 2e
2
Pm2(b2 mi −a
2
i)
i=1
As before the second inequality follows from first one, by applying it to Yi = bi − Xi .
Similarily as before, the third bound follows from preceding two by the union bound.
5