Math 216A Notes, Week 5

Math 216A Notes, Week 5
Scribe: Anyastassia Sebolt
Disclaimer: These notes are not nearly as polished (and quite possibly not nearly as correct) as a published
paper. Please use them at your own risk.
1. Thresholds for Random Graphs
As with our last example from last week, we will be looking at applications of Chebyshev’s Inequality.
Theorem 1. If X is any random variable with mean µ, and variance σ 2 , then
1
P (|X − µ| ≥ λσ) ≤ 2
λ
This can be useful when we are trying to show that a variable is likely to be large. We first show that it is
large on average (its expectation is large), then show that it is usually close to its expectation (the variance
is small).
For example, consider the following model of Random Graphs (the so-called Erdős-Rényi model): We have n
vertices (where n should be thought of as large). Every edge between two vertices appears with probability
p, which might depend on n, and all edges are independent. Intuitively, p should be thought of here as a
weighting scheme – as p increases from 0 to 1 this model gives more and more weight to graphs with more
and more edges.
Definition 1. A Threshold for an event is some probability p∗ so that if pp∗ → 0 , then the event happens
with probability o(1), and if pp∗ → ∞, the probability of the event is 1 − o(1).
Here we use the notation f g and g f to mean that
f
g
tends to 0 as n tends to infinity.
1.1. A Trivial
Example. The threshold for having an edge is n12 . If p n12 , the expected number of
n
edges, p 2 , tends to 0, so by Markov’s inequality there is almost surely not an edge. If p n12 , then
n
n
P( no edges ) = (1 − p)( 2 ) ≤ e−p( 2 ) = o(1).
1.2. A More Complicated Example (Counting Triangles). Consider the threshold for the event of
having at least one triangle in the graph (so we want three vertices that are all connected to each other).
The expected number of triangles is
n 3
p .
3
So
• If p • If p 1
n,
1
n,
the expected number of triangles tends to 0
the expected number of triangles tends to infinity
We would like to use this to say that n1 is a threshold for the appearance of triangles in G. If we only consider
expectations, we can only get halfway there. It is true that if X is any variable such that E(X) tends to 0,
then with high probability X = 0 (This is just Markov’s inequality). However, it is not always true that if
E(X) tends to infinity than with high probability X > 0, since it could be the high expectation is because X
is extremely large a small fraction of the time. (The homework gives an example of how this could happen
if we replace triangles by a different graph).
1
Here we would like to apply Chebyshev’s inequality to X, where X is the numberP
of triangles. In order to
do this, we need to compute the variance of X. Note that we can think of X as i xi , where the index i
ranges over all subsets of three vertices, and the variable xi is equal to 1 if the subset spans a triangle and
0 otherwise. In this form we can write
!
Var(X)
= E
X
(xi − E (xi ))
2
i
=
XX
i
(Covar (xi , xj )) ,
j
where
Covar (xi , xj )
:= E ((xi − E (xi )) (xj − E (xj )))
=
E (xi xj ) − E (xi ) E (xj )
is the covariance of xi and xj . Since the xi are non-negative indicator variables, the covariance is always at
most equal to E(xi xj ).
In our case we need to compute the sum of the covariance over all pairs of potential triangles. If we denote
the vertex sets of triangles by S1 and S2 , we can break this up into four parts depending on the intersection
of S1 and S2 .
(1) If |S1 ∩ S2 | = ∅, covariance is 0, because the appearance of the triangles are independent.
(2) If |S1 ∩ S2 | = 1, the covariance is still 0 because the triangles in question still do not share an edge.
(3) If |S1 ∩ S2 | = 2 then 2 vertices are shared, and hence one edge is shared. So for both triangles to
be present five edges must be present. This means P (S1 , S2 both there) = p5 , which also gives an upper
bound on the covariance. The number of such pairs (S1 , S2 ) is at most n5 (there’s 4 vertices involved in the
configuration), so the total contribution of such pairs to the variance is at most n4 p5 .
(4) If |S1 ∩ S2 |, then the covariance is at most p3 and the number of such pairs is at most n3 , so the total
contribution here is at most n3 p3 .
Adding up over all the above cases, we have that the variance is at most p5 n4 + p3 n3 . By Chebyshev’s
inequality, we can now say
If p 1
n,
P( no triangles ) ≤
P (|X − E(X)| > E(X)) ≤
=
p5 n4 + p3 n3
1
1
=
+ 3 3
(p3 n3 )2
pn2
p n
σ2
µ2
this tends to 0.
Remark 1. In a sense you can think of what we’re doing as breaking up the sum defining variance into a sum
over intersecting pairs and a sum over non-intersecting pairs. The first sum is small just because there aren’t
too many intersecting pairs, while the second sum is zero because non-intersecting triangles are pairwise
independent.
2. Back to the Ménage’s Problem
Here’s a problem we considered before: How many permutations of 1, ..., n have σ(i) 6= i(modn) and
σ(i) 6= i + 1(modn) for 1 ≤ i ≤ n?
2
For any n, if we choose σ at random, the probability that ”i is bad” is n2 . Since these events are in a
sense ”nearly” independent
for large n, we may also guess that the probability bad events do not occur is
n
approximately 1 − n2 ≈ e12 . Here’s one step towards making that intuition rigorous.
Let, xi be the indicator function of the ”bad” event in which σ(i) = i or i + 1. As above, we know that
E(xi ) = 2/n. Imagine for now that the xi were independent. We would have
E((x1 + ... + xn )2 )
XX
=
i
= n
E(xi xj )
j
2
4
= 6 + o(1)
+ (n2 − n)
n
n2
In the last line we broke our sum up according to 2 different cases:
Case (1): If i = j, we have E(x2i ) =
2
n.
This happens for n terms of our sum.
2
n
×
2
n
Case(2): If i 6= j, then E(xi xj ) =
terms with this property.
=
4
n2
since xi and xj are assumed independent. There are n2 − n
We now turn to the computation of the actual second moment. As before, we write
E((x1 + · · · + xn )2 ) =
XX
i
E(xi xj ),
j
but now xi and xj are not independent. We compute E(xi xj ) in three cases:
Case (1): If |i − j| > 1, E(xi xj ) =
2 2
n n−1 .
3
Case (2): If |i − j| = 1, E(xi xj ) = n(n−1)
. (3 because one of the four ways in which both xi and xj could
be bad involves both i and j mapped to the same place by σ).
Case (3): If i = j, E(xi xj ) =
2
n.
We therefore have (adding up over the three cases in reverse order)
E(x1 + · · · + xn )2
=
XX
i
E(xi xj )
j
2
3
4
+ 2n
+ (n2 − 3n)
n
n(n − 1)
n(n − 1)
6 + o(1)
= n
=
So the second moment is asymptotically the same as it were if the xi were independent. It is possible to
show by a similar argument that the same is true for E((x1 + · · · + xn )k ) for any k, which turns out to be
enough to make the ”nearly independent” intuition rigorous.
Remark 2. The splitting up into cases here was similar to how it was for counting triangles. Almost all of
the pairs have |i − j| > 1, in which case we have E(Xi Xj ) ≈ E(Xi )E(Xj ). There are a few pairs i 6= j
where the covariance is larger, but the number of pairs is so small that the net contribution to the variance
is negligible.
3
3. Exponential Moments and the Chernoff Bound
In a sense, Chebyshev’s inequality is as tight as we can hope for. If, for example,
P(X = σ) = P(X = −σ) =
1
2
and λ = 1 then we certainly can’t hope to say anything stronger than
P(|X| ≥ σ) ≤ 1.
But there are also many situations where it isn’t close to being as good as reality. For example, suppose
n
X
that xi = ±1 with probability 1/2, and X =
xi . We have Var(X) = n, so Chebyshev’s Inequality gives
i=1
√
1
P (X > λ n) ≤ 2 .
λ
2
However, in reality, X is asymptotically normal, and the probability decays like e−λ /2 . In this case,
Chebyshev is very far off from the correct bound for large λ. The key difference between this and the
”tight” example above is that here the sum is made up of a lot of independent variables which are not too
large. There will usually be a lot of cancellation, and we want to exploit this to get a better tail bound.
Here’s a more general framework.
Let x1 , . . . , xn be independent random variables with |xi − E(xi )| ≤ 1, and let X = x1 + ...xn . Let σ 2 be
the variance of X. We want to find a tail bound on P(|X − E(X)| ≥ λσ. For simplicity, let us assume that
E(xi ) = 0 (we can always subtract a constant from each xi to make this true without affecting our bound).
Before, our bound from Chebyshev’s inequality was proven by using an argument along the lines of
P (|X − E(X)| > λ)
=
P ((X − E(X))2 > λ2 )
=
P (X 2 > λ2 )
E(X 2 )
,
λ2
≤
Where the last inequality holds by Markov’s inequality. To get a better bound, we’ll apply a similar argument
to a ”steeper” function. Let t be a parameter to be chosen later (we will eventually optimize over t). By
Markov’s inequality, we have
P (X > λ) = P (etX > etλ ) ≤
E(etX )
.
etλ
We now exploit that our xi are independent to write
E(etX ) = E(etx1 +tx2 +...+txn ) =
n
Y
E(etxi ).1
i=1
At this point, we still need to find some bound on etxi . Assume for now that −1 ≤ t ≤ 1, Then, |txi | ≤ 1
because of our assumption that the xi are bounded. It is a quick calculus exercise to show that for −1 ≤ u ≤ 1
we have eu ≤ 1 + u + u2 . This means
E(etxi ) ≤
E(1 + txi + t2 x2i )
1 + Var(xi )t2
2
≤ et Var(xi )
=
1This trick here is why the Fourier Transform (or ”Characteristic Function”, as some call it) is useful in probability.
4
Multiplying over all i, we have
n
Y
E(etxi ) ≤
i=1
n
Y
2
et Var(xi )
i=1
≤
exp t2
n
X
!
Var(xi )
i=1
=
2
et
σ2
t2 σ 2
tX
2
2
)
≤ eetλ . Now, if we relabel our variables slightly, P (X > λσ) ≤ et σ e−tλσ . To
Hence, P (X > λ) ≤ E(e
etλ
λ
get the best bound possible, we would like to take t = 2σ
. However, we earlier made the assumption |t| ≤ 1.
Splitting up into two cases our values of t give us the following:
(1) If λ ≤ 2σ, take t =
λ
2σ .
2
This gives, P (X > λσ) ≤ e−λ 4.
2
(2) If λ > 2σ, take t = 1. This gives P (x > λσ) ≤ e−λσ+σ ≤ e
−λσ
2
.
An identical argument gives us a bound on the probability that x is very small. Combining the two, we
obtain
Theorem 2. (Chernoff ’s Inequality) Let x1 , ...xn be independent random variables, satisfying |xi − E(xi )| ≤
1. Let X = x1 + ... + xn have variance σ 2 . Then,
n −λ2 −λσ o
.
P (|X − E(X)| ≤ λσ) ≤ 2 max e 4 , e 2
One particular special case here is useful. Suppose that each xi is the indicator function of some event with
probability pi , meaning that xi = 1 with probability pi and xi = 0 otherwise. Then
E(xi ) = pi ≥ Var(xi ) = pi (1 − pi ).
Adding this up over all i, we get E(X) ≥ Var(X).
Now take λ =
εE(x)
σ .
Then Chernoff’s bound becomes
n 2 E(x)
o
E(x)
P (|x − E(x)| > E(x)) ≤ 2 max e− 4 , e− 2
.
If E(x) is very large, this probability is automatically small. In particular, if the pi are bounded away from
0, this probability is exponentially small in n. This is great for applications when we want to take a union
bound over a very large number of events.
4. Balance in Random Graphs
: Consider a graph on n vertices where every edge is independently present/absent with probability 0.5 and
n large (that is to say a graph chosen uniformly from all graphs on n vertices). We would like to say that
the edges of this graph are spread out evenly. Here’s three senses in which we could make such a claim:
(1) Every vertex has degree between
same degree).
n
2
p
− 2 n log(n) and
n
2
+2
p
n log(n). (So each vertex has about the
(2) If we split the vertices into 2 equal sized subsets X and Y , the number of edges between X and Y
2
is between n4 21 −
connecting them).
3
n2
2
and
n2 1
4 2
3
+
n2
2
. (Most subsets of vertices have about the same number of vertices
5
(3) For any two disjoint subsets X and Y , the number of edges between X and Y are between
3
|
and |X||Y
+ n2 .
2
|X||Y |
2
3
− n2
We can show each of these without too much effort by the Chernoff bound and the union bound.PFor
example, consider (1). Each vertex v has n − 1 possible edges. The degree of v can be thought of as
xi ,
n−1
2
where xi = 1 if edge i from v is present. So xi has mean n−1
2 and variance σ = 4 .
By Chernoff,
√
√
−λ n−1
λ2
n−1
λ n−1
P (| deg(v) −
|>
≤ 2max(e 4 , e 4 ).
2
2
√
Now if we take λ = 4 log n, the right hand side becomes 2n−4 . So any given vertex has abnormal degree
with probability at most 2n−4 , so the probability that some vertex has abnormal degree as at most 2n−3 .
For (2), we do exactly the same thing. For any particular split, the number of edges has a mean
2
variance n16 . So, σ = n4 and by Chernoff,
P (|number of edges −
n2
8
and
−λn
−λ2
n2
λn
|>
≤ 2max(e 4 , e 8 ).
8
4
The number of splits is 21 nn < 2n . So we choose λ to make the Chernoff bound smaller than 2−n . It turns
2
√
out if we take λ = 2 n, the bound on the probability any one split fails becomes e−n . So the probability
that some split fails is at most 2n e−n 1.
The proof for (3) is similar.
Remark 3. Note that for all of these our methods were somewhat similar. Count the number of bad events
that we want to avoid, then pick λ in the Chernoff bound so as to make the probability of each bad event
much smaller than the number of bad events.
4.1. Imbalance in General Graphs. One question we might ask if it is possible to, by clever construction,
come up with a graph that’s more evenly spread out than the random graph (in the sense of (3)). It’s certainly
reasonable to thing so – in the balls in bins example it took us n log n balls to fill all the bins by random
dropping, but there’s an obvious way of filling all the bins with only n balls. As it turns out, the random
graph is actually the best possible here, up to possibly the constant in front of the n3/2 . We’ll only sketch
the proof of a weaker version of this where we assume that the graph is regular (which intuitively should
help us spread edges out more evenly).
Theorem 3. There a δ > 0 such that for large enough even n, any
3
|
subsets X and Y such that E(X, Y ) ≥ |X||Y
+ δn 2 .
2
n
2
- regular graph on n vertices has two
Proof. (sketch): Let X be a subset of the vertices of the graph chosen randomly, which P (x ∈ X) = .01 and
each x ∈ X independently.
Definition 2. We will call a vertex v heavy if:
(1) v ∈
/X
(2) The number of neighbors of v in X is at least
|x|
2
+
1 √
10000 n
We will take Y to be the set of heavy vertices. It can be verified directly that a given v is heavy with
probability at least 1/10000, so E(|Y |) ≥ n/10000. On the other hand, we know that |Y | is at most n. So
6
we have
E(|Y |) ≤ nP (|Y | > n/20000) + n/20000P (|Y | ≤ n/20000)
≤ nP (|Y | > n/20000) + n/20000
Comparing, P (|Y | > n/20000) ≥ 1/20000. This means there must be a choice of X for which |Y | is at least
1/20000. But then by construction
X
|X||Y |
|X|
E(X, Y ) −
=
E(X, {y}) −
2
2
y∈Y
√
n
≥ |Y |
10000
≥ δn3/2
5. Asymptotic Bases
Definition 3. Set A is an asymptotic basis of order k if all sufficiently large n are the sum of at most k
elements of A.
Example 1: Lagrange’s 4 square Theorem (every positive integer is the sum of at most 4 squares) means
that {1, 4, 9, 16, ...} is a basis of order 4.
Example 2: Goldbach’s Conjecture would imply that the primes {2, 3, 5, 7, 11, 13, ...} are a basis of order 3.
Example 3: {1, 2, 4, 8, 16, 32, ...} is not a basis for any k, since a number with k + 1 digits of 1 in its binary
representation cannot be written as the sum of k powers of 2.
Question 1. (Sidon - 1930’s): How ”thin” can a basis of positive integers be?
We’ll focus on the case k = 2, and start with a couple of cheap bounds.
Let rA (n) be the number of representations of n as a1 + a2 , for a1 , a2 ∈ A.
We first note that if a1 + a2 ≤ n, certainly a1 ≤ n and a2 ≤ n. This means
N
X
n=1
rA (n) ≤
1
|A ∩ {1, ..., N }|2
2
Any basis satisfies rA (n) ≥ 1 for large n, which by the above implies
√
|A ∩ {1, ...N }| ≥ 2N .
Conversely, if a1 and a2 ≤ n, then a1 + a2 ≤ 2N . This implies
2N
X
1
|A ∩ {1, ...N }|2 ≤
rA (n).
2
i=1
So if rA (n) is small (not too much larger than 1), then |A ∩ {1, . . . , N } ≈
7
√
N.
Theorem 4. (Erdős, 1956) There is an A such that rA (n) = θ log(n), meaning that there are constants c1
and c2 such that for large enough n we have
c1 log(n) ≤ rA (n) ≤ c2 log(n)
Proof. By the above, we expect that |A ∩ {1, ..., N }| should be about
√
N log
√ N . So we’d like to consider a
log(x)
√
.
x
set formed by taking x ∈ A randomly and independently with probability
that, because probability are at most 1. In actuality, we’ll take
!
r
log(n)
P (n ∈ A) = min C
,1
n
Well, we can’t quite do
, where C is a large constant. Then rA (n) is the sum of n2 indicator events xi of the form {i ∈ Aandn−i ∈ A}.
We aim to show that roughly log n of these events hold. To do this, we can safely ignore (or even adjust
the probability of) any constant number of events. The events we’ll ”fix” correspond to the representations
n = n and n = n/2 + n/2, along with any event where P(i ∈ A) = 1. For the remaining events, we have
r
P (xi ) = C
2
log(i)
i
r
log(n − i)
.
n−i
Therefore
n
p
√
2
X
C 2 log i log(n) − i
p
+ O(1).
E(rA (n)) =
i(n − i)
i=1
Here, an upper bound for E(rA (n)) is:
n
n
2
2
X
c2 log(n)
1
C 2 log(n) X
√ pn =
pn
√ ≤ 2C 2 log(n) + o(1)
i 2
i
2
i=1
i=1
And a lower bound for E(rA (n)) is:
n
p
√
2
X
C 2 109i 109(n − i)
p
i(n − 1)
i= n
n
≥
2
C2
X
p
109 n4
q
=
n 2
4C
3n
4
log(n)
n
2
=
log
n2
4
i= n
4
4
q
1
+ o(1) c2 log(n)
2
Taking C = 9, we see that for large n 27 log(n) ≤ E(rA (n)) ≤ 243 log(n).
Now for a fixed n, the events for each difference i are independent. Now we may use the Chernoff bound to
see that P (|rA (n) − E(rA (n))|) ≤ 26 log(n) ≤ 2e−13 log(n) = n−13 .
So for sufficiently large n the probability n doesn’t satisfy log(n) ≤ rA (n) ≤ 269 log(n)) is at most n−13 .
This has a finite sum over all n, so by Borel Cantelli we have with probability 1 that the number of n which
fail to satisfy this is finite.
8
It’s possible, but more complicated, to extend this argument to cover the case of larger k (this was first done
by Erdős and Tetali). The key wrinkle that needs to be overcome is that we no longer have independence.
For example, if n = 10 the representations 5 + 4 + 1 and 5 + 3 + 2 are coupled by the presence of 5 in both
of them.
5.1. Two Open Problems. Here are two infamous conjectures due to Erdős and Turán:
1. You cannot have a basis for which 1 ≤ rA (n) ≤ o(log(n)) for all large n.
2. There is no constant C for which there exists a basis satisfying 1 ≤ rA (n) ≤ C.
The second conjecture is of course weaker than the first.
9