A Definitions of models and algorithms we used B Proof of Theorem

Why Steiner-tree type algorithms work for community detection
A
A.1
Definitions of models and
algorithms we used
Steiner tree problem
Definition A.1 (Steiner tree). Let G = {V, E} be an
edge-weighted graph. Let W be a subset of nodes in V .
The Steiner tree problem is to find a tree that spans all
the nodes in W and its edge weight sum is minimized.
A.2
Small world
We now review Kleinberg’s small world model (Kleinberg, 2000). We remark that Kleinberg’s model allows the network to be directed. Here, we shall use a
straightforward way to convert a directed graph into
an undirected one: there is an undirected edge between
u and v if and only if (u, v) or (v, u) is in the directed
grpah. Thus, Kleinberg’s small world model can be
described as follows.
The set of nodes reside in √
a two-dimensional lattice points {(i, j) : i, j ∈ {1, ..., n}}. The lattice distance is defined as d((i, j), (k, `)) = |k − i| + |` − j|.
Let r = 2 and q = Θ(log n) be a normalization term.
There are two types of edges in the small world graph:
• local edges: if d(u, v) ≤ 1, then there is an edge
between d(u, v).
• long range edges: if d(u, v) > 1, then with probability q −1 · d−r (u, v) there is an edge between
u and v, which is independent to the rest of the
edges.
We shall define d , maxv E[degree(v)]. Since q =
Θ(log n), d is also a constant.
A.3
Chung and Lu’s random graph
Next, we describe Chung and Lu’s random graph
model (Chung and Lu, 2002). In this model, we are
given an expected degree sequence w = (w1 , ..., wn ),
in which wi represents the expected degree for vi . The
probability that there is an edge between vi and vj is
wi wj ρ for ρ = P 1wi , which is independent to other
i
edges. Furthermore, we shall assume the distribution
of {w1 , ..., wn } follows a power law distribution with
exponent parameter between 2 and 3.
B
Proof of Theorem 3.1
We shall partition the support of the signals into three
regions: the high interval H = [1, 2 − γ], the median
interval M = (1 − γ, 1), and the low interval L =
[0, 1 − γ]. When a node’s signal is in H, the node is
in S; when a node’s signal is in L, the node is not in
S. Thus, a detector’s only goal is to identify the nodes
from S among those nodes with their signals in M .
On the other hand, observe that conditioned on
a node’s signal is in M , the distribution of the signal
is uniform in (1 − γ, 1). This always holds regardless
whether the node is in S or not. Therefore, no algorithm can do better than random guess in the region
M . It remains to analyze the performance of an algorithm that only makes random guesses. One can see
that with high probability: 1. the number of signals in
M is Θ(γn). 2. the number of nodes in S with signals
in M is Θ(γk). Thus, on average,
a node in S will be
picked up with probability Θ
γk
γn
= Θ(k/n). Since
we can select up to O((α − 1)k) nodes in the interval
M (with high probability), the total number of nodes
from S that will finally be in Ŝ is O(k 2 /n). Thus,
2
k
0
E [|S − S |] = (1 − o(1))γk − O
= (1 − o(1))γk
n
for sufficiently large n and k.
C
Warmup: the algorithm for a line
case.
We now give a solution for the case where the network
forms a line, i.e., E = {{vi , vi+1 } : i < n}. While a line
graph is not a realistic model for social or biological
networks, analyzing this simple example helps us to
understand how we may improve the performance of a
detector algorithm by utilizing the information of the
network structure.
Lemma C.1. Let γ be an arbitrary constant. Consider the community detection problem, in which the
underlying graph is a line and k = o(n) is polynomial
in n. There exists an efficient algorithm that returns
a set Ŝ of size k and |S − Ŝ| = o(k) whp. In other
words, for any γ, there exists a (1, γ)-detector.
Proof. Observe that the subgraph induced by S has to
be connected. Hence, there exists a j such that S =
{vj , vj+1 , ..., vj+k−1 }. Our algorithm works as follows:
enumerate through all possible connected subgraphs of
size k. Output an arbitrary
one such that it contains at
√
least (1−γ)k−(log n) k signals in the high interval H.
Also in what follows, we say a connected subgraph of
size k passes the threshold
√ test if and only if it contains
at least (1−γ)k−(log n) k signals in the high interval.
We need to show the following two events hold
with high probability
• Event 1: The subgraph induced by S passes the
threshold test.
• Event 2: There exists no connected subgraph S 0
such that |S 0 | = k, |S − S 0 | = Ω(k), and S 0 passes
the threshold tests.
Mung Chiang, Henry Lam, Zhenming Liu, Vincent Poor
Event 1 implies that with high probability our algorithm will return at least one subgraph. Event 2 implies that with high probability the node set returned
by our algorithm will be almost the same as S.
We may apply a standard Chernoff bound to
prove that event 1 happens with high probability.
Specifically,
1
log2 n
≤ 2
Pr[S fails the threshold test] ≤ exp −
3
n
(4)
for sufficiently large n.
We next move to analyze the second event. Let be an arbitrarily small constant. We shall first show
the probability that there exists a specific connected
subset S 0 such that |S − S 0 | ≥ k is small. Then we
shall use a union bound to argue whp no bad S 0 will
pass the threshold test.
Let S 0 be an arbitrary connected subset of size k
such that |S −S 0 | ≥ k. In expectation, the total number of nodes in S that have signals in H is (1 − )(1 −
γ)k. By using a Chernoff bound again, the probabil√
ity that S 0 will have more than (1 − γ)k − (log n) k
nodes in H is ≤ n13 (no effort was made to optimize
this bound). Finally, since there are at most (n−k +1)
connected subgraphs of size k, we may apply a union
bound and get that the probability event 2 happens is
≤ n−k+1
≤ 1/n2 .
n3
To sum up, with probability 1 − n22 , we will find
a set Ŝ of size k such that |S − Ŝ| = o(k).
D
Proof of Lemma 4.4
We need to prove two directions: 1. a MinConnect
problem can be reduced to a Steiner tree problem
with uniform edge weights. 2. a Steiner tree problem with uniform edge weights can be reduced to a
MinConnect problem.
We use the following observation for the analysis
for both directions: in an arbitrary graph with uniform edge weights, the cost of any spanning tree for a
connected subset of size k is k − 1. This observation
is true because of the simple fact that a tree of size `
contains ` − 1 edges for arbitrary `.
Now both directions are straightforward. Let W
be the set of nodes that need to be included in the
MinConnect problem. To reduce the MinConnect
problem to a Steiner tree problem, we require the
Steiner tree to cover all the nodes in W in the same
graph. Because of the above observation, a Steiner
tree that covers W with minimum number of edges is
also a minimum connected subgraph that covers W .
We can prove the other direction in a similar fashion.
E
Proof of Proposition 4.6
`−1
a random
We shall show that with probability τn`
subset R of size ` is connected, where τ is a suitable
constant (the probability is over both the generative
model and the choice of the random subset). Then
we can show that the expected number of subgraphs
`−1
of size ` is n` τn`
= nτ0` for another constant τ0 .
We may next apply a Markov inequality to get the
Proposition.
First, we shall imagine the set R is sampled in
a sequential manner, i.e., the first node v1 from R
is sampled uniformly from V . Then the second node
v2 is sampled uniformly from V /{v1 }. And in general the i-th node vi in R is sampled uniformly from
V /{v1 , ..., vi−1 }. After ` samplings, we would get a set
R of size ` that is sampled uniformly among all the
nodes. The reason we introduce this scheme is to couple the sampling of R with the generative model itself,
by using the principle of deferred decision over long
range edges: the existence of a long range edge is not
revealed until we need it to make conclusions on connectivity. In other words, we want to test the connectivity of R sequentially when each member node of R
and its neighbors are revealed to us step-by-step. The
bottom line is that we need not know all the member
nodes of R to conclude on connectivity.
To carry out our scheme, we introduce a coupled
branching process that is represented by a stochastic
tree {T (t)}t≥1 . This tree traces the current revealed
nodes in V (R). It is rooted at v1 and grows over time
according to the following procedure. First, let us introduce a labeling notion to help define our growing
process: at each time t, we call any node in V dead if
its long-range neighbors are revealed to us, and active
otherwise. The notion of dead and active can apply to
any nodes in V .
We start from a uniformly sampled node v1 in
V , decide v1 ∈ R, and reveal all its long-range neighbors (in addition to the four local neighbors) via the
generative model. Then T (1) contains only v1 and v1
is labelled dead. To grow from T (t) to T (t + 1), we
implement the following steps:
• Step 1. Pick an arbitrary active node v in V (T (t))
and reveal all its neighbors. The active node is
then labeled as dead.
• Step 2. Decide the set of nodes from R/V (T (t))
that are adjacent to v by using this observation: the distribution of R/(V (T (t))) is uniform
over V / (V (T (t)) ∪ Dead(t)), where Dead(t) is the
Why Steiner-tree type algorithms work for community detection
union of all neighbors of all the dead nodes at time
t.
• Step 3. Expand T (t) to T (t + 1) by adding all the
new neighbors of v that are in R.
makes the recursive analysis easier at the cost of weakening bound on k. Also, we shall refer to the branching
process in which the adversary controls the set F as
T 0 (t).
Note that V (T (t)) ∈ R by construction. Moreover, whenever the tree T (t) stops growing before
|T (t)| reaches `, we conclude that the current sample
of R is not connected.
The base case where ` = 1, 2 is straightforward.
Let us now move to the induction step for computing
P (` + 1, n) when P (1, n), ..., P (`, n) satisfy (5). Let
v1 be the first node in R. Let us define the random
variables X and Y as follows: X is the total number
of neighbors of v1 and Y is the total number of nodes
from R that are adjacent to v1 . Notice that E[X1 ] ≤
4 + d (4 is the number of direct neighbors and d is
the maximum expected number of long range edges
among all the nodes). By using a Chernoff bound, we
also have Pr[X1 ≥ g] ≤ 2−g , when g is a sufficiently
large constant (Chernoff bound is applicable because
the number of long range edges can be expressed as
the sum of independent variables).
We next show by induction over ` that Pr[|T (t)| =
`−1
. One potential obstacle in our analysis is
k] ≤ τn`
that in the second step of the above procedure, the set
Dead(t) ∪ V (T (t)) evolves in a complex manner, and it
is not straightforward to find the neighbors of v that
are in R over time. On the other hand, it is not difficult
2/3
n2/3
to show that Pr[|Dead(t)| ≥ n 2 ] ≤ 2− 2 at any moment (by using a Chernoff bound). In the rest of our
2/3
analysis, we shall silently assume that Dead(t) ≥ n 2
does not happen. This assumption will result in an
n2/3
additive error of order 2− 2 for the probability quantities we are interested in, which is an asymptotically
small term. Since our goal is to give an upper bound
on the probability that |T (t)| eventually reaches `,
we can imagine the branching process works as follows: there is an (imaginary) adversary that decides
how T (t) grows and tries to maximize the probability
that |T (t)| eventually reaches `. The adversary basically has to follow the above three-step procedure to
grow T (t), but in the second step the adversary needs
to decide which set of nodes that is not allowed to
choose (instead of using Dead(t) ∪ V (T (t))) as long as
2/3
|Dead(t)| is sufficiently small, i.e., |Dead(t)| ≤ n 2 .
When the adversary maximizes the probability that
|T (t)| eventually reaches `, such quantity is also an
upper bound on the same probability for the original
process.
Thus, our goal is to understand the probability
that |T (t)| hits ` under the adversarial setting. Specifically, we want to prove the following statement.
Let ` ≤ n1/3 and let G be a small world graph of size
n. Let F be an arbitrary set of size n2 − 2` · n2/3 . We
shall refer F as the forbidden set. Let R be a random
subgraph from V /F of size `. Let P (`, n) be the probability that the coupled branching process T (t) reaches
` nodes eventually. Then
`−1
τ`
P (`, n) ≤
.
(5)
n
Roughly speaking, the forbidden set F allows the
adversary to choose the set Dead(t) ∪ V (T (t)) over
time. Also, notice that the set F has to shrink as
` grows. Imposing such a technical assumption will
Furthermore, we may also compute Pr[Y = j]
asymptotically. Specifically, we have the following
lemma.
Lemma E.1. Let Y be the variable defined above. For
any i ≤ `, there exists a constant c2 such that
Pr[Y = i] ≤
c2 `
n
i
.
(6)
Proof of Lemma E.1. Recall that we let X be the
number of nodes that are adjacent to v1 . Also, recall that Pr[X > g] < 2−g , when g is a sufficiently
large constant. We can also write Pr[X > g] ≤ c4 · 2−g
for a sufficiently large c4 .
We have
Pr[Y = i]
=
P
Pr[Y = i|X = j] Pr[X = j]
(j)(n0 −j) −j
≤ c4 i≤j≤n i n`−i
·2 ,
( `0 )
(7)
where n0 is n − |F ∪ {v1 }| is the total number of nodes
that R may choose from. Notice that n0 = Θ(n) .
1≤j≤n
P
0 −j
(ji)(n`−i
)
· 2−j . One
n0
(`)
(j)(n0 −j) −j
can see that when j ≥ 3i, the terms i n`−i
·2
( `0 )
decrease more sharply than a geometric progression
with ratio 3/4. Thus, the sum of the first 3i terms is
the dominating term. We now give an upper bound on
0 −j
(ji)(n`−i
) −j
· 2 for the case j ≤ 3i. Let j = αi, where
n0
(`)
Let us focus on the terms
Mung Chiang, Henry Lam, Zhenming Liu, Vincent Poor
α ≤ 3, we have
αi
i
≤
Thus, we have
n0 −αi
`−i
n0
`
c α
√5
`
i
(n0 −αi)n0 −αi
(`−i)`−i (n0 −`−(α−1)i)n0 −`−(α−1)i
`` (n0
n
n0 0
−`)n0 −` )
(c5 is a sufficiently large constant.)
n0 −`−(1−α)i
≤
c5 · αi
{(n0 − αi)(n0 − `)}
n0 −`−(α−1)i
{(n0 − ` − (α − 1)i)n0 }
`−i
·
{`(n0 − αi)}
`−i
{(` − i)n0 }
·
(n0 − `)(α−1)i `i
.
(α−1)i
ni0
n
0
We can see that (n0 − αi)(n0 − `) ≤ (n0 − ` − (α −
1)i)n, `(n0 − αi) ≤ (` − i)n0 , and n0 − ` ≤ n0 . Thus,
αi
i
n0 −αi
`−i
n0
`
≤
c6 `
n
2
.
Together with (7), we finish the proof of Lemma E.1
Next, by the law of total P
probability, we also have
Pr[|T 0 (t)| = ` + 1] equals to 1≤i≤n Pr[|T 0 (t)| = ` +
1|Y = i] Pr[Y = i]. We first walk through the analysis
for Pr[|T 0 (t)| = `+1|Y= 1] and Pr[|T 0 (t)| = `+1|Y = 2].
Then we give an asymptotic bound on Pr[|T 0 (t)| =
` + 1|Y = i] for general i.
Let us start with the case Y = 1. Let v2 be the
node in S that is connected with v1 . Here, Pr[|T 0 (t)| =
` + 1 | Y = 1] reduces to a case that is covered by the
induction hypothesis, i.e., the tree T 0 (t) contains ` + 1
nodes if and only if the subtree rooted at v2 contains
` nodes. This v2 -rooted subtree is another branching
process coupled with ` nodes, which are uniformly chosen from V /({v1 } ∪ Γ(v1 ) ∪ F ), where Γ(v1 ) is the set
of v1 ’s neigbhors. Thus, by induction hyptothesis, we
`−1
have Pr[|T 0 (t)| = ` + 1 | Y = 1] ≤ P (`, n) ≤ τn`
.
Next, let us move to the case for Y = 2. Let v2
and v3 be the nodes in S that are connected with v1 .
We may first grow the tree T 0 (t) from v2 , then grow
the tree from v3 . At the end, let i be the number of
children of v2 (and v2 itself). The number of children
of v3 (and v3 itself) is thus ` − i. One may check that
both subprocesses can be understood by using induction hypothesis (we also need to check that the size of
the forbidden set does not violate the requirements in
the induction hypothesis; but this is straightforward
because we assumed that the total number of neighbors of R is less than n2/3 /2).
Pr[|T (t)| = ` + 1|Y = 2]
`
X
=
Pr[v2 -rooted tree connected,
i=1−1
v3 -rooted tree connected | L = i, Y = 2]
Pr[L = i | Y = 2]
`−1
X
`−2
≤
P (i, n)P (` − i, n)
i−1
i=1
i−1 `−i−1 `−1 X
τ (` − i)
`−2
τi
(Induction)
≤
n
n
i−1
i=1
`−1
τ `−2 X i−1
`−i−1 ` − 2
=
i
(`
−
i)
i−1
n`−2 i=1
Let us define aj = j j−1 (` − j)`−j−1 . Notice that
aj = a`−j . So we have
X
X
aj ≤ 2
aj .
1≤j≤`
1≤j≤`/2
We may continue to compute the sum of aj ’s:
X
aj
1≤j≤`/2
X
=
j j−1 (` − j)`−j−1
1≤j≤`/2
X
≤
j j−1 ``−j−1
1≤j≤`/2
≤ e
X
`j−1 ``−j−1
`−2
j−1
`−j−1
`
1≤j≤`/2
`−j−1 `−j−1
`
`−2
j−1
`−j−1
≤ c3 ``−2
for some constant c3 . The second inequality uses the
fact that
j−1
`−2
`−2
j j−1
≤ j j−1
j−1
j−1
j−1 j
`−2
j
j−1
= j
j
j−1
≤
e`j−1
Thus, we have
Pr[|T (t)| = ` + 1 | Y = 2] = c3
τ`
n
`−2
.
(8)
In general, we may define the `-hitting probability
Pj (`, n) for a “branching forest” that grows from j
Why Steiner-tree type algorithms work for community detection
different roots. When j = 1, it corresponds with the
case for Y = 1, i.e., P1 (`, n) = P (`, n). When j = 2,
`−2
(8) gives us P2 (`, n) ≤ c3 τn`
.
We now compute the general form of Pj (`, n).
Specifically, we shall show by induction that
`−j
τ`
.
(9)
Pj (`, n) ≤ cj−1
3
n
The base cases for j = 1, 2 are already analyzed above.
We now move to the induction case. We have the
following recursive relation:
X
`−j
Pj (`, n) ≤
P1 (i, n)Pj−1 (` − i, n)
.
i−1
1≤i≤`−j+1
Thus, we have
≤
Pj (`, n)
P
τ i i−1 `−i `−i−j+1 j−2 `−i−j+1
c3 τ
n
n
`−j
P · i−1
τ `−j j−2
i−1
c3
(` − i)`−i−j+1 `−j
1≤i≤`−j+1 i
n
i−1
1≤i≤`−j+1
=
(10)
P It
remains
to
analyze
the
term
i−1
`−i−j+1 `−j
i
(`
−
i)
.
Via
some
1≤i≤`−j+1
i−1
straightforward manipulation, we have
X
`−j
ii−1 (` − i)`−i−j+1
≤ c3 ``−j (11)
i−1
1≤i≤`−j+1
for some sufficiently large c3 . (11) and (10) together
give us (9). Now we are ready to compute P (` + 1, n)
(and thus Pr[|T 0 (t)| = ` + 1])):
X
P (` + 1, n) =
P (` + 1, n | Y = j) Pr[Y = j]
1≤j≤n
≤
X τ ` `−j j−1 c2 ` j
c3
n
n
1≤j≤n
When τ ≥ 4c2 c3 , we have
X τ ` `−j j−1 c2 ` j
c3
n
n
1≤j≤n
`
`
X
τ`
τ`
−i
≤
2 ≤
.
n
n
1≤j≤n
This completes the proof of Proposition 4.6.
F
Proof of Proposition 4.8
We shall first describe a way to construct S. Then we
will argue that γ portion of the nodes in S is statistically indistinguishable with a large number of nodes
from V /S.
Recall that wi is the average degree for the node
vi .√Wlog, we shall let w1 ≤ w2 ≤ ... ≤ wn , where wn =
c0 n for some constant c0 . Since the degree distribution is a power law distribution, there exists a constant
c1 such that for all i ≤ c1 n, wi = Θ(1). Let us randomly partition the set of nodes {v1 , v2 , ..., vc1 n } into
two subsets of equal size, namely S1 = {vi1 , ...., vic1 n/2 }
and S2 = {vi01 , ...., vi0c n/2 }. Notice that for any vi ∈
1
S1 ∪ S2 , we have Pr[{vi , vn } ∈ E] = wi wn ρ ≥ √c2n for
some constant c2 .
Next, let us construct the set S: we shall first let
vn ∈ S and the rest of the nodes in S will be picked up
from S1 . Since with probability at least √c2n there is an
edge between a node in S1 and vn , in expectation√the
number of nodes that are connected with vn is Ω( n).
Thus, with high probability we are able to find a subset
of size k − 1 that are all connected to vn .
Now we analyze an arbitrary algorithm’s performance. By using a Chernoff bound, we can see that
with high probability in S there are (1 − γ ± o(1))k
nodes that have signals in H and (γ ± o(1)) nodes that
have signals in M . Let us refer to the subset of nodes
in S whose signals are in M as SM . Recall that when
the algorithm does not know the network structure, it
will not be able to discover most of the nodes in SM .
Here, we shall show that the algorithm will behave in a
similar way even that it knows the network structure.
Let us focus on the nodes in S2 . It is strightforward to see that with high probability the number of
nodes in S2 that are both connected with vn√and associated with signals in M is at lesat (1 − )γ n for an
arbitrary constant . Let us call the set of these nodes
S20 . The nodes in S20 are statistically indistinguishable
from the nodes in SM . Furthermore, the connectivity constraint is met for nodes in both sets. Thus, no
algorithm can do better than randomly guessing. In
other words, if an algorithm outputs a set of size O(k),
then the number of nodes in SM that will be included
is o(γk). This completes the proof of Proposition 4.8.
G
Proof of Proposition 5.1
Let ` = |Sopt | be the size of the output. Wlog, we
shall focus on the case ` ≥ (1 + )k. The case for
` ≤ (1 + )k can be analyzed in a similar manner.
Recall that Φ(x) is the cdf for the Gaussian variable
N (0, 1). Let ν0 = E[− log(1 − Φ(X))], where X ∼
N (0, 1), i.e., the expected score of a node from V /S.
Similarly, let ν1 = E[− log(1 − Φ(X))], where X ∼
N (µ, 1). Furthermore, let η , ν1 /ν0 . Since µ is a
constant, we have ν0 , ν1 , and ν1 − ν0 are all constants.
Also, η is a function that grows with µ. Our way
of setting the function c(·) is straightforward: we set
0
c(e) = ω , ν1 −ν
for all e.
2
Mung Chiang, Henry Lam, Zhenming Liu, Vincent Poor
We first show that when S is substituted into (3),
the objective value is at least (1 − 10
)kν1 . This can be
seen by using a Chernoff type bound for the independent variables − log(1 − Φ(X)) with X ∼ N (µ, 0) (See
Theorem I.1 in Appendix I for the statements and the
proofs):
#
"
X
Pr
b(vi ) ≤ (1 − )kν1 ≤ exp(−Θ(2 ν1 k)).
10
Therefore,
X
X
Pr[
b(vi ) +
b(vi ) − ω(` − 1) ≥ Es ]
2 !
η
ckν0 η − 1
(h − 1) −
− δη
≤
exp −
h
2
10
∆≤0
X
exp(−cδ 2 ηkν0 )
+
X
v∈S
(12)
)kν1 −
Thus, the objective value is at least Es , (1− 10
ω ·(k −1) with high probability. We next show that for
any specific S 0 of size ` such that |S 0 −S|+|S−S 0 | ≥ k,
we have
"
#
X
b(vi ) − ω(|S 0 | − 1) ≥ Es ≤ exp(−g(µ)2 `),
Pr
vi
∈S 0
(13)
where g(µ) is a monotonic function in µ. Then, using
the fact that with probability ≥ 1 − the number of
connected subgraphs of size ` is ≤ n (τ0 )` for some
constant τ0 , we can conclude that whp any S 0 of size `
such that |S 0 − S| + |S − S 0 | ≥ k cannot be an optimal
solution.
We now move to prove
P(13). Our goal thus is to
give a bound on the event vi ∈S 0 b(vi ) − ω(|S 0 | − 1) ≥
Es . The probability is maximized when S ⊂ S 0 . Let
0
0
us write
P S2 = S −
PS. Thus, we shall find a bound for
Pr[ vi ∈S b(vi ) + vi ∈S 0 b(vi ) − ω(` − 1) ≥ Es ].
2
P
P
When vi ∈S b(vi ) + vi ∈S 0 b(vi ) ≥ Es + ω(` − 1),
2
there exists a ∆ ∈ {−n, ..., n} such that
X
b(vi ) ≥ kν1 + ∆
vi ∈S
and
vi ∈S20
vi ∈S
1≤∆≤∆m
2 !
ckν0 η − 1
η
· exp −
(h − 1) −
− δη
h
2
10
X
exp(−cδ 2 ηkν0 ).
+
∆>∆m
It is not difficult to see that both the first and third
summations ≤ exp(−Θ(2 (η − 1)2 ν0 `/η)). Therefore,
it remains to analyze the second summation in the
above inequality. Specifically, we want to understand
when
2 !
ckν0 η − 1
η
2
exp −cδ ηkν0 −
(h − 1) −
− δη
h
2
10
(14)
is maximized. One can see that the exponent is a
quadratic function in δ, which is maximized when
δ=
1
2(h − 1)( 2η
5 − 2)
.
η
h +1
When we plug in the optimal value of δ to (14), we
have
2 η−1
η
0
exp −cδ 2 ηkν0 − ckν
(h
−
1)
−
−
δη
h
2
10
≤ exp(−Θ(2 (η − 1)2 ν0 `/η)).
Thus, (13) indeed holds.
ν1 k
− ∆ + (` − k)ω − 1.
b(vi ) ≥ −
10
0
X
H
Proof of Theorem 5.3
vi ∈S2
Let us write ∆ = δkν1 and ` = hk. When ∆ ≥ 0, we
have
"
#
X
Pr
b(vi ) = kτ1 + ∆ ≤ exp(−cδ 2 ηkν0 )
vi ∈S
and when ∆ ≤ ∆m , η−1
2 (` − k)ν0 :


X
−ν1 k
Pr 
b(vi ) ≥
− ∆ + (` − k)ω − 1
10
vi ∈S20
2 !
c η−1
(`
−
k)ν
−
ν
k
−
∆
0
1
2
10
≤ exp −
`τ0
2 !
ckν0 η − 1
η
≤ exp −
(h − 1) −
− δη
.
h
2
10
Proof. Let φ(x) be the pdf of N (0, 1). First, observe
that for any non-negative functions f and g such that
φ = f + g, we may interpret a sample from N (0, 1) as
a sample from the mixture of two distributions from f
and g by using the following procedure:
R∞
R∞
• First let F = −∞ f (x)dx and G = −∞ g(x)dx.
F
• Then with probability F +G
, we draw a sample
(x)
from the ditribution with pdf f F
and with probG
ability F +G we draw a sample from the distribu-
tion with pdf
g(x)
G .
Let φµ (x) be the pdf for N (µ, 0) and let τ = φφ(µ)
.
µ (µ)
We shall decompose φ(x) into a mixture of τ · φµ (x)
and R(x) , φ(x)−τ ·φµ (x). We consider the following
strictly simpler problem and give a lower bound on α
Why Steiner-tree type algorithms work for community detection
for this problem: we still have the same setting that
nodes from V /S and from S receive samples from different distributions and we are asked to find S. The
only difference here is that when vi ∈ V /S receives a
sample from N (0, 1), we assume the sample is generated from the mixture of τ · φµ (x) and R(x). Furthermore, when vi is sampled from R(x), we also explicitly
label vi as from R(x). In other words, the algorithm
knows the set of nodes that are sampled from R(x).
Notice that the new problem gives a strict superset of
information and thus is information theoretically easier than the original problem.
Next, let us move to find a lower bound on α
for the new problem. When a node is labeled as from
R(x), it is clear that the node should not be part of the
output. It remains for us to find S from the rest of the
non-labeled node. But notice that all the rest of the
signals are sampled from N (µ, 0). Thus, we cannot
do better than randomly guessing. Since µ = Θ(1),
we have τ = Θ(1). Thus, the size of the remaining
unlabeled nodes is still Θ(n) (with high probability).
One can see that in order to cover ρ portion of nodes
from S, the size of the final output has to be Θ(ρn).
distribution of N (0, 1). We use the bound (Williams,
1991)
Z ∞
2
2
1
1
√ e−y /2 dy ≥ √
e−x /2
Φ̄(x) =
2π
2π(x + 1/x)
x
(18)
for x > 0. We shall prove that the mgf of − log(1 −
Φ(Xi )) exists and is finite in a neighborhood of zero.
Namely, consider
φ(θ)
−θ log Φ̄(Xi )
:= E[e
h
i ]
1
= E (Φ̄(Xi ))θ
R∞
2
1
√1 e−(x−µ) /2 dx
=
−∞ (Φ̄(x))θ 2π
(19)
Using (18), and by considering the region {x ≤ η} and
{x > η} for some η > 0, the quantity (19) is bounded
from above by
Z η
2
max{Φ̄(η)−θ , 1}
e−(x−µ) /2 dx
−∞
θ
2
1
1
x2 /2
√ e−(x−µ) /2 dx
e
+
2π x +
x
2π
η
Z η
2
= max{Φ̄(η)−θ , 1}
e−(x−µ) /2 dx + (2π)(θ−1)/2
Z
∞
∞
θ
2
2
1
x+
eθx /2−(x−µ) /2 dx
x
√
−∞
Z
I
η
Concentration bounds
In this section, we prove the following large deviations
bound.
Theorem I.1. Let Yi ∼ N (0, 1) and Xi ∼ N (µ, 1).
Let Φ(·) be the cdf for N (0, 1). Let µx = E[− log(1 −
Φ(Xi ))] and µy = E[− log(1 − Φ(Yi ))]. We have
hP
i
Pr 1≤i≤n − log(1 − Φ(Xi )) − nµx ≥ nµx
≤ exp(−c2 nµx )
(15)
and
hP
i
Pr 1≤i≤n − log(1 − Φ(Yi )) − nµy ≥ nµy
≤ exp(−c2 nµy )
(16)
for a suitable constant c > 0.
(20)
Consider the second term in (20). For 0 < θ < 1, it is
bounded by
Z
2
2
(2π)(θ−1)/2
C1 xθ eθx /2−(x−µ) /2 dx < ∞ (21)
x>η
for some C1 > 0, and for −1 < θ < 0, it is bounded by
Z
2
2
(2π)(θ−1)/2
C2 x−θ eθx /2−(x−µ) /2 dx < ∞ (22)
x>η
for some C2 > 0. Therefore (−1, 1) is contained in the
domain of convergence of φ(θ). This implies that φ(θ)
is infinitely differentiable in (−1, 1). Define ψ(θ) =
log φ(θ) as the logarithmic mgf of − log(1 − Φ(Xi )).
The same convergence and differentiability behavior
then holds for ψ(·) in the same region (−1, 1).
To proceed, we use the Chernoff inequality

X
Pr 
− log Φ̄(Xi ) ≤ (1 − )nµx  ≤ eθ(1−)nµx +nψ(−θ)

Proof. We shall prove the lower tail of (17), i.e.,
hP
Pr
1≤i≤n − log(1 − Φ(Xi )) ≤ (1 − )nµx
≤ exp(−c2 nµx )
1≤i≤n
i
(17)
for some constant c > 0. The other cases can be
analyzed in a similar manner. Consider the moment
generating function (mgf) of − log(1 − Φ(Xi )), where
Xi ∼ N (µ, 1) and Φ(·) is the cdf of N (0, 1). For convenience, we also denote Φ̄(·) = 1 − Φ(·) as the tail
(23)
for 0 ≤ θ < 1. Using the Taylor expansion ψ(−θ) =
−ψ 0 (0)θ2 + ψ 00 (ζ)θ2 /2 for some ζ ∈ (−θ, 0), and the
fact that ψ 0 (0) = µx , (23) becomes
o
n
2
exp θ(1 − )nµx − nψ 0 (0)θ2 + nψ 00 (ζ) θ2
n
o
2
= exp −θnµx + nψ 00 (ζ) θ2
n
o
2
≤ exp −θnµx + n supu∈[−θ,0] ψ 00 (u) θ2
Mung Chiang, Henry Lam, Zhenming Liu, Vincent Poor
For < N1 , we choose θ = c1 , and the value of 0 <
c1 < N2 will be chosen small enough to get our result.
Note that for < N1 and c1 < N2 , any choice of
θ = c1 implies that supu∈[−c1 ,0] ψ 00 (u) ≤ M for some
constant M > 0. Hence (24) becomes
!)
(
c
1
exp −c1 2 n µx − sup ψ 00 (u)
2
u∈[−c,0]
o
n
c1
≤ exp −c1 2 n µx − M
2
Choosing c1 small enough will then give µx −M c21 > 0,
which concludes the theorem.
J
Missing calculations
J.1
Proof of Equation 2
γ k0 +∆k
=
=
=
≤
≤
1.55k(τ0 )`
p
1.55k k0 +∆k k+∆k
γ
τ0
p
1.55k k0 +∆k
γ
(τ0 )k+∆k
p
1.55k
(τ0 )k γ k0 (γτ0 )∆k
p
1.55k
(τ0 )k (γ λγ )k (γτ0 )∆k
p
(2τ0 γ λγ )k (γτ0 )∆k
(using
1.55k
p
(Using k0 ≥ λγk)
≤ 2k for sufficiently large k.)
≤ c−k
0
The last inequality holds when γ λγ ≤
constant c0 .
1
2c0 τ0
for any