Lecture 02: Bounding tail distributions of a random variable

CSCI-B609: A Theorist’s Toolkit, Fall 2016
Aug 25
Lecture 02: Bounding tail distributions of a random variable
Lecturer: Yuan Zhou
Scribe: Yuan Xie & Yuan Zhou
Let us consider the unbiased coin flips again. I.e. let the outcome of the i-th coin toss to be
a random variables
(
+1, w.p. 12
Xi =
.
−1, w.p. 12
We assume all coin tosses are independent and let would like to study the sum S of the first
n coin tosses,
n
X
S=
Xi .
i=1
In this lecture, we would like the study the probability that S greatly deviates from its
mean E S = 0. Specifically, for some parameter t, we would like to estimate the probability
Pr[S > t]. Intuitively, we know that such probability should be “small” for large enough t.
The goal of this lecture (and part of the next one) is to derive qualitative upper bounds on
the tail probability mass parameterized by t.
As we did in the previous lecture, using the Berry-Esseen theorem, we know that
√
Pr[S ≥ n · t] − Pr[G ≥ t] ≤ O(1)
√ ,
n
where G ∼ N (0, 1) is a standard Gaussian. For convenience, we may also use the following
informal notation
√
O(1)
(1)
Pr[S ≥ n · t] = Pr[G ≥ t] ± √ .
n
Using basic calculus, we can estimate that
Z +∞
−t2
t2
1
√ e− 2 dt ≤ O(1) · e 2 .
Pr[G ≥ t] =
(2)
2π
t
√
Now let us fix the parameter t = 10 ln n. Combining (1) and (2), we have
!!
√
√
1
1
(10 ln n)2
Pr[S ≥ n · t] ≤ Pr[G ≥ t] + O √
= O exp −
+O √
2
n
n
1
1
1
=O √
.
=O
+O √
n50
n
n
1
Lecture 02: Bounding tail distributions of a random variable
2
We seethat the tail mass of the standard Gaussian is only O n150 . However, the error term
O √1n introduced by the Berry-Esseen theorem is much greater. This error term is the
main reason that we can’t get better results. In the following part of this lecture, we will
try various other methods to improve the upper bound.
1
Markov inequality
When we only know the mean of a nonnegative random variable, Markov inequality gives a
simple upper bound on the probability that it deviates from its mean.
Theorem 1 (Markov inequality). Given a random variable X, assume X ≥ 0. For each
1
parameter t ≥ 1, we have Pr[X ≥ t · E[X]] ≤ .
t
Proof. For each α > 0, we have
E[X] = Pr[X ≥ α] · E[X|X ≥ α] + Pr[X < α] · E[X|X < α]
≥ Pr[X ≥ α] · α + Pr[X < α] · 0 = Pr[X ≥ α] · α.
Dividing both sides of the inequality by α > 0
E[X]
≤ Pr[X ≥ α]
α
Taking α = E[X] · t we get the desired bound.
Now let us try to apply the Markov inequality to bounding the tail mass of S. Since S is
not a nonnegative random variable, we cannot directly apply the inequality. However, note
that S ≥ −n
√ always holds. We apply the inequality to T = S + n where E T = E S + n = n.
Let t = 10 n ln n. We have
n
n
t+n
√
.
Pr[S ≥ t] = Pr[T ≥ t + n] = Pr T ≥ (E T ) ·
≤
=
n
t+n
n + 10 n ln n
This is a very bad bound – it does not even converge to 0 as n grows!
2
Chebyshev inequality
The Chebyshev inequality not only uses the mean of the random variable, but also needs
the variance (or the second moment). Since we have more information about the random
variable, we may potentially get better bounds.
Lecture 02: Bounding tail distributions of a random variable
3
Theorem 2 (Chebyshev inequality). Asumme that E[X] = µ and Var[X] = σ 2 > 0. For
1
every parameter t > 0, we have Pr[|X − µ| ≥ t · σ] ≤ 2 .
t
Proof. Let Y = (X − µ)2 . We can check that E[Y ] = σ 2 and Y ≥ 0. Applying Markov
inequality, we have
Pr[|X − µ| ≥ t · σ] = Pr[(X − µ)2 ≥ t2 · σ 2 ] = Pr[Y ≥ t2 · E[Y ]] ≤
1
.
t2
Now let us go back to the scenario discussed at the beginning of this lecture. We compute
that
p
√
√
µ = ES = 0
and
σ = Var[S] = E S 2 = n.
Therefore
√
√
Pr[S ≥ 10 n ln n] ≤ Pr[|S| ≥ 10 n ln n]
"
#
√
10 n ln n
σ2
1
√
= Pr |S| ≥ σ ·
.
≤
=
2
σ
100 ln n
(10 n ln n)
This bound is still not as good as expected. However, at least it converges to 0 as n → ∞.
Remark 1. Note that Chebyshev inequality only needs pairwise independence among Xi ’s.
Specifically, when computing the variance of S, we have
Var[S] = Var[X1 + X2 + ... + Xn ]
= E[(X1 + X2 + ... + Xn )2 ] − E[(X1 + X2 + ... + Xn )]2 = E[(X1 + X2 + ... + Xn )2 ]
n
n
X
X
X
2
2
=
[X
]
+
[X
X
]
=
E i
E i j
E[Xi ] = n.
i=1
i6=j
i=1
In the penultimate equality, we used the fact that Xi is independent from Xj for i 6= j.
3
The fourth moment method
Using the first two moments, we had better bounds then only using the mean of the random
variable. Now let us try to extend this method to the fourth moment.
Let us consider S 4 ≥ 0. By Markov inequality, we have
h
i
√
√
Pr[S ≥ 10 n ln n] ≤ Pr S 4 ≥ (10 n ln n)4 ≤
E S4
.
10000n2 ln2 n
(3)
Lecture 02: Bounding tail distributions of a random variable
4
Now let us estimate

!4 
X
4
Xi 
E[S ] = E 
i
=
X
i
4
E Xi +
XX
i
XX
XX X
4
4
4
3
2
+
+
E Xi Xj ·
E Xi Xj Xk ·
2 2
1
2
i j6=i
i j6=i k:k6=i,k6=j
XX X
X
+
E Xi Xj Xk Xq . (4)
2 2 1
E Xi Xj ·
j6=i
i
j6=i k:k6=i,k6=j q:q6=i,q6=j,q6=k
Fortunately because of independence and E Xi = 0, we have that E Xi Xj3 = E Xi Xj Xk2 =
E Xi Xj Xk Xq = 0. Therefore we can simplify (4) by
X
XX
2 2
2
4
4
(5)
E Xi Xj · 3 = n + 3n(n − 1) ≤ 3n .
E[S ] =
E Xi +
i
i
j6=i
Combining (3) and (5), we get
√
Pr[S ≥ 10 n ln n] ≤
3n2
3
=
.
2
2
10000n ln n
10000 ln2 n
This is a better bound than what we get from Chebyshev.
Remark 2. When using the fourth moment method, we only used the independence among
every quadruple of random variables. Therefore the bound works for 4-wise independent
random variables too.
Remark 3. We can extend this method to considering S 2k for positive integer k’s, and
picking the k that optimizes the upper bound. However, this plan would lead to the painful
estimation of E S 2k . We will use a slightly different method to get better upper bounds.
4
The “Chernoff method”
Instead of S 2k , let us consider the function eλS for some positive parameter λ.
Since ex is a monotonically increasing function, we have
√
√
√
Pr[S ≥ 10 n ln n] = Pr[λS ≥ 10λ n ln n] ≤ Pr[eλS ≥ e10λ n ln n ].
By Markov inequality (also checking that eλS > 0, we have
√
Pr[eλS ≥ e10λ
n ln n
]≤
E eλS
e10λ
√
n ln n
.
(6)
Lecture 02: Bounding tail distributions of a random variable
5
Now it remains to upper bound E eλS . We have
Y
Y
P
λS
λ
X
λX
eλXi =
E e = E e i i] = E
E e i.
i
(7)
i
Note that in the last equality, we used the full independence among all Xi ’s.
On the other hand, by the distribution of Xi , we have
λXi
Ee
1
1
= eλ + e−λ
2 2
1
λ2 λ3 λ4
1
λ2 λ3 λ4
=
1+λ+
+
+
+ ... +
1−λ+
−
+
− ...
2
2!
3!
4!
2
2!
3!
4!
(Taylor expansion)
λ2 λ4 λ6
2
+
+
+ ... ≤ eλ /2 .
=1+
2!
4!
6!
Getting back to (7), we have
λS
Ee ≤
Y
eλ
2 /2
2 /2
= enλ
.
i
Combining this with (6), we have
Pr[eλS ≥ e10λ
Picking λ = 10
bound
q
ln n
,
n
√
n ln n
2 /2−10λ
] ≤ enλ
√
n ln n
.
(8)
we minimize the right-hand side of (8) and get our desired upper
Pr[eλS ≥ e10λ
√
n ln n
] ≤ e50 ln n−100 ln n =
1
.
n50
In the beginning of the next lecture, we are going to extend this method to more general
random variables and general thresholds, and go through the proof the famous Chernoff
bound.