CSCI-B609: A Theorist’s Toolkit, Fall 2016 Aug 25 Lecture 02: Bounding tail distributions of a random variable Lecturer: Yuan Zhou Scribe: Yuan Xie & Yuan Zhou Let us consider the unbiased coin flips again. I.e. let the outcome of the i-th coin toss to be a random variables ( +1, w.p. 12 Xi = . −1, w.p. 12 We assume all coin tosses are independent and let would like to study the sum S of the first n coin tosses, n X S= Xi . i=1 In this lecture, we would like the study the probability that S greatly deviates from its mean E S = 0. Specifically, for some parameter t, we would like to estimate the probability Pr[S > t]. Intuitively, we know that such probability should be “small” for large enough t. The goal of this lecture (and part of the next one) is to derive qualitative upper bounds on the tail probability mass parameterized by t. As we did in the previous lecture, using the Berry-Esseen theorem, we know that √ Pr[S ≥ n · t] − Pr[G ≥ t] ≤ O(1) √ , n where G ∼ N (0, 1) is a standard Gaussian. For convenience, we may also use the following informal notation √ O(1) (1) Pr[S ≥ n · t] = Pr[G ≥ t] ± √ . n Using basic calculus, we can estimate that Z +∞ −t2 t2 1 √ e− 2 dt ≤ O(1) · e 2 . Pr[G ≥ t] = (2) 2π t √ Now let us fix the parameter t = 10 ln n. Combining (1) and (2), we have !! √ √ 1 1 (10 ln n)2 Pr[S ≥ n · t] ≤ Pr[G ≥ t] + O √ = O exp − +O √ 2 n n 1 1 1 =O √ . =O +O √ n50 n n 1 Lecture 02: Bounding tail distributions of a random variable 2 We seethat the tail mass of the standard Gaussian is only O n150 . However, the error term O √1n introduced by the Berry-Esseen theorem is much greater. This error term is the main reason that we can’t get better results. In the following part of this lecture, we will try various other methods to improve the upper bound. 1 Markov inequality When we only know the mean of a nonnegative random variable, Markov inequality gives a simple upper bound on the probability that it deviates from its mean. Theorem 1 (Markov inequality). Given a random variable X, assume X ≥ 0. For each 1 parameter t ≥ 1, we have Pr[X ≥ t · E[X]] ≤ . t Proof. For each α > 0, we have E[X] = Pr[X ≥ α] · E[X|X ≥ α] + Pr[X < α] · E[X|X < α] ≥ Pr[X ≥ α] · α + Pr[X < α] · 0 = Pr[X ≥ α] · α. Dividing both sides of the inequality by α > 0 E[X] ≤ Pr[X ≥ α] α Taking α = E[X] · t we get the desired bound. Now let us try to apply the Markov inequality to bounding the tail mass of S. Since S is not a nonnegative random variable, we cannot directly apply the inequality. However, note that S ≥ −n √ always holds. We apply the inequality to T = S + n where E T = E S + n = n. Let t = 10 n ln n. We have n n t+n √ . Pr[S ≥ t] = Pr[T ≥ t + n] = Pr T ≥ (E T ) · ≤ = n t+n n + 10 n ln n This is a very bad bound – it does not even converge to 0 as n grows! 2 Chebyshev inequality The Chebyshev inequality not only uses the mean of the random variable, but also needs the variance (or the second moment). Since we have more information about the random variable, we may potentially get better bounds. Lecture 02: Bounding tail distributions of a random variable 3 Theorem 2 (Chebyshev inequality). Asumme that E[X] = µ and Var[X] = σ 2 > 0. For 1 every parameter t > 0, we have Pr[|X − µ| ≥ t · σ] ≤ 2 . t Proof. Let Y = (X − µ)2 . We can check that E[Y ] = σ 2 and Y ≥ 0. Applying Markov inequality, we have Pr[|X − µ| ≥ t · σ] = Pr[(X − µ)2 ≥ t2 · σ 2 ] = Pr[Y ≥ t2 · E[Y ]] ≤ 1 . t2 Now let us go back to the scenario discussed at the beginning of this lecture. We compute that p √ √ µ = ES = 0 and σ = Var[S] = E S 2 = n. Therefore √ √ Pr[S ≥ 10 n ln n] ≤ Pr[|S| ≥ 10 n ln n] " # √ 10 n ln n σ2 1 √ = Pr |S| ≥ σ · . ≤ = 2 σ 100 ln n (10 n ln n) This bound is still not as good as expected. However, at least it converges to 0 as n → ∞. Remark 1. Note that Chebyshev inequality only needs pairwise independence among Xi ’s. Specifically, when computing the variance of S, we have Var[S] = Var[X1 + X2 + ... + Xn ] = E[(X1 + X2 + ... + Xn )2 ] − E[(X1 + X2 + ... + Xn )]2 = E[(X1 + X2 + ... + Xn )2 ] n n X X X 2 2 = [X ] + [X X ] = E i E i j E[Xi ] = n. i=1 i6=j i=1 In the penultimate equality, we used the fact that Xi is independent from Xj for i 6= j. 3 The fourth moment method Using the first two moments, we had better bounds then only using the mean of the random variable. Now let us try to extend this method to the fourth moment. Let us consider S 4 ≥ 0. By Markov inequality, we have h i √ √ Pr[S ≥ 10 n ln n] ≤ Pr S 4 ≥ (10 n ln n)4 ≤ E S4 . 10000n2 ln2 n (3) Lecture 02: Bounding tail distributions of a random variable 4 Now let us estimate !4 X 4 Xi E[S ] = E i = X i 4 E Xi + XX i XX XX X 4 4 4 3 2 + + E Xi Xj · E Xi Xj Xk · 2 2 1 2 i j6=i i j6=i k:k6=i,k6=j XX X X + E Xi Xj Xk Xq . (4) 2 2 1 E Xi Xj · j6=i i j6=i k:k6=i,k6=j q:q6=i,q6=j,q6=k Fortunately because of independence and E Xi = 0, we have that E Xi Xj3 = E Xi Xj Xk2 = E Xi Xj Xk Xq = 0. Therefore we can simplify (4) by X XX 2 2 2 4 4 (5) E Xi Xj · 3 = n + 3n(n − 1) ≤ 3n . E[S ] = E Xi + i i j6=i Combining (3) and (5), we get √ Pr[S ≥ 10 n ln n] ≤ 3n2 3 = . 2 2 10000n ln n 10000 ln2 n This is a better bound than what we get from Chebyshev. Remark 2. When using the fourth moment method, we only used the independence among every quadruple of random variables. Therefore the bound works for 4-wise independent random variables too. Remark 3. We can extend this method to considering S 2k for positive integer k’s, and picking the k that optimizes the upper bound. However, this plan would lead to the painful estimation of E S 2k . We will use a slightly different method to get better upper bounds. 4 The “Chernoff method” Instead of S 2k , let us consider the function eλS for some positive parameter λ. Since ex is a monotonically increasing function, we have √ √ √ Pr[S ≥ 10 n ln n] = Pr[λS ≥ 10λ n ln n] ≤ Pr[eλS ≥ e10λ n ln n ]. By Markov inequality (also checking that eλS > 0, we have √ Pr[eλS ≥ e10λ n ln n ]≤ E eλS e10λ √ n ln n . (6) Lecture 02: Bounding tail distributions of a random variable 5 Now it remains to upper bound E eλS . We have Y Y P λS λ X λX eλXi = E e = E e i i] = E E e i. i (7) i Note that in the last equality, we used the full independence among all Xi ’s. On the other hand, by the distribution of Xi , we have λXi Ee 1 1 = eλ + e−λ 2 2 1 λ2 λ3 λ4 1 λ2 λ3 λ4 = 1+λ+ + + + ... + 1−λ+ − + − ... 2 2! 3! 4! 2 2! 3! 4! (Taylor expansion) λ2 λ4 λ6 2 + + + ... ≤ eλ /2 . =1+ 2! 4! 6! Getting back to (7), we have λS Ee ≤ Y eλ 2 /2 2 /2 = enλ . i Combining this with (6), we have Pr[eλS ≥ e10λ Picking λ = 10 bound q ln n , n √ n ln n 2 /2−10λ ] ≤ enλ √ n ln n . (8) we minimize the right-hand side of (8) and get our desired upper Pr[eλS ≥ e10λ √ n ln n ] ≤ e50 ln n−100 ln n = 1 . n50 In the beginning of the next lecture, we are going to extend this method to more general random variables and general thresholds, and go through the proof the famous Chernoff bound.
© Copyright 2026 Paperzz