Efficient Enumeration of all Connected Induced

Efficient Enumeration of all Connected
Induced Subgraphs of a Large Undirected
Graph
by
Sean Maxwell
Submitted in partial fulfillment of the requirements
For the degree of Master of Science
Graduate Program in Systems Biology and Bioinformatics
Case Western Reserve University
January, 2014
CASE WESTERN RESERVE UNIVERSITY
SCHOOL OF GRADUATE STUDIES
We hereby approve the thesis/dissertation of
Sean Maxwell
candidate for the
Master of Science
degree*.
Mark Chance
Harold Connamacher
Mehmet Koyutürk
(date)
June 25, 2013
*We also certify that written approval has been obtained for any
proprietary material contained therein.
1
For my wife Lea and our daughter Stella.
2
Contents
Abstract
8
1 Introduction
9
2 Problem Definition and Observations
14
3 Base Case
17
3.1
Focusing on Direct Neighbors . . . . . . . . . . . . . . . . . . . . . .
17
3.2
Optimized Local Search Tree . . . . . . . . . . . . . . . . . . . . . . .
19
4 General Case
21
4.1
Joining Local Search Trees . . . . . . . . . . . . . . . . . . . . . . . .
21
4.2
Optimized Joining of Local Search Trees . . . . . . . . . . . . . . . .
22
4.3
Caching Depth First Search (CDFS) . . . . . . . . . . . . . . . . . .
25
5 Correctness
27
6 Experimental Results
28
6.1
Exhaustive Synthetic Testing . . . . . . . . . . . . . . . . . . . . . .
28
6.2
Integration into CRANE . . . . . . . . . . . . . . . . . . . . . . . . .
30
7 Discussion
35
8 Conclusion
36
A Extended Definitions of Complex Notation
37
B Supporting Lemmas
39
3
List of Figures
1
Enumeration of all S using anchor vertices . . . . . . . . . . . . . . .
17
2
Binomial tree generated from an anchor vertex . . . . . . . . . . . . .
18
3
Optimized local search tree construction . . . . . . . . . . . . . . . .
20
4
Extending a local search tree . . . . . . . . . . . . . . . . . . . . . . .
21
5
Enumeration tree T . . . . . . . . . . . . . . . . . . . . . . . . . . . .
23
6
Examples of potential overhead during construction of T . . . . . . .
24
7
Rejection rate analysis for arXiv[27] . . . . . . . . . . . . . . . . . . .
31
8
Runtime comparison for arXiv[27] . . . . . . . . . . . . . . . . . . . .
32
9
Adjacency overhead comparison for arXiv[27]
. . . . . . . . . . . . .
33
10
Rejection rate analysis for HPRD[15] . . . . . . . . . . . . . . . . . .
34
11
Runtime comparison for HPRD[15] . . . . . . . . . . . . . . . . . . .
35
4
List of Algorithms
1
CDFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
26
2
CHOOSELOWEST . . . . . . . . . . . . . . . . . . . . . . . . . . . .
49
5
Acknowledgments
I would like to thank my committee for all of their guidance: Dr. Mehmet Koyutürk,
my thesis advisor; Dr. Mark Chance, my committee chair; and Dr. Harold Connamacher. I am truly grateful to have had the opportunity to learn and work with each
committee member.
This thesis would not have been possible without my family who supported me
on my journey. I am forever grateful to my wife Lea for her infinite patience and
love, and our daughter Stella who unselfishly sacrificed time with her father on many
occasions.
6
Symbols
The following table provides a very brief definition of all symbols. Detailed definitions
of the symbols defining complex concepts are provided in Appendix A.
Graph
G
V
E
v
S
An undirected graph / network
Set of all vertices in G
Set of all edges in G
A vertex v ∈ V
A set of vertices S ⊆ V that induce a connected subgraph in G
Tree
T
N
B
n
r
Tree enumerating all S that contain a given anchor vertex v ∈ V
The set of all nodes of T
The set of all branches of T
A node of T labeled with one v ∈ V . Label vertex denoted as vn
The root node of T , labeled by the anchor vertex vr
Set/List
P(nk )
Dnk
Cnk
χnk
β(nk , nk+1 )
The sequence of nodes along the path from n1 = r to nk in T
Neighbor set of nk is all vertices adjacent to vnk in G
Cull set of nk is {vr } ∪ {u : uvni ∈ E, 1 ≤ i < k}
Extension set of nk is Dnk \Cnk
A list of branch nodes defined by nk and nk+1
f (x)
Υ(v)
ψ(P(nk ))
Γ(n)
fb (S)
K(n)
p(n)
Adds a new node to N labeled with v
Maps P(nk ) to labeling vertices, ψ(P(nk )) = {vn1 , vn2 , . . . , vnk }
Returns the source node m ∈ N that n was cloned from
Evaluates a bounding function on the subset S
The set of all children of n ∈ N
The immediate parent of n ∈ N
7
Efficient Enumeration of all Connected Induced Subgraphs
of a Large Undirected Graph
Abstract
by
Sean Maxwell
In this work we investigate the problem of efficiently enumerating all connected induced subgraphs of an undirected graph. We show that the redundant computations
in a depth first branch and bound search for subgraphs that satisfy a hereditary
property can be reduced using an enumeration tree as a “memory” during the depth
first search to avoid redundant rejections. This approach reduces the runtime of
exhaustive search as well. Our method includes a proof of correctness and computational results on synthetic and real data sets demonstrating improved runtime over
traditional depth first approaches.
8
1
Introduction
For many applications in systems biology, the connected induced subgraphs of molecular interaction networks are of particular interest since they represent a set of functionally associated biomolecules. In many applications, scientists are interested in finding
groups of functionally associated molecules that together induce coherent patterns in
other types of biological data. For example, in the context of the systems biology of
complex diseases, one is interested in identifying “dysregulated protein subnetworks”
that are sets of proteins connected to each other via protein-protein interactions that
exhibit collective differential expression between different phenotypes [5, 6, 7, 10, 29].
Similarly, gene set enrichment analysis aims to evaluate the statistical significance
of the aggregate disease association of sets of genes that are defined a priori, and
the connected subgraphs of molecular networks provide excellent candidate gene sets
since they are functionally related through physical and functional interactions [24].
At an evolutionary scale, sets of orthogonal proteins that induce connected subgraphs
are shown to be useful in gaining insights into the conservation and modularity of
biological processes across diverse taxa [17, 22, 25]. While methods designed to tackle
these problems implement heuristic algorithms to search the space of connected induced subgraphs of a network, it was shown that exhaustive search may lead to the
identification of more biologically relevant patterns as compared to those identified
by simple heuristics [6, 25, 30].
There are many sources of molecular interaction data available such as HPRD[15]
and BioGrid[13]. These networks are usually modeled as undirected graphs, in which
nodes represent proteins and edges represent pairwise interactions between them.
These networks are quite large, but highly sparse. For this reason, when searching
for groups of proteins that together optimize a certain objective function, limiting the
search to interacting proteins greatly reduces the search space while not compromising
the biological relevance of the results. For instance, at the time this document was
prepared, HPRD contained 37,080 protein-protein interactions among 9,455 proteins.
9
The number of interactions specified by HPRD is significantly less than assuming
every protein is functionally related to every other protein and dramatically reduces
the search space for dysregulated groups of functionally related proteins. For example,
to search for all groups of size k among n proteins where all proteins interact with
each other results in nk subgraphs whereas the number of protein groups that induce
a connected subgraph of the interaction network is much smaller. We seek to further
reduce the search space by addressing one type of overhead that can arise during
depth first branch and bound enumeration of connected induced subgraphs.
Subgraph enumeration is a problem that arises in many applications. Depending on the application, the problem can be formulated in several ways. While two
frequently used approaches share overlap, it should be noted that algorithms which
enumerate subgraphs by edges (topologically) are fundamentally different than algorithms which enumerate connected induced subgraphs (i.e. the former focuses on sets
of edges while the latter focuses on sets of vertices). Topological enumeration such as
that used for motif mining or querying focuses on generating all edge configurations,
whereas connected induced subgraph enumeration aims to generate vertex sets that
satisfy the criterion of connectedness. However, regardless of the task, most enumeration methods rely on an underlying order of vertices (intrinsic or imposed) to perform
efficient enumeration, and thus certain themes are common.
The ordering of vertices defines a relation on any two vertices v1 , v2 such that
v1 ≺ v2 (read as v1 precedes v2 ) is either true or false. Throughout this work we refer
to the order of vertices as a lexicographic order to establish that a total order exists on
all sequences of alphanumeric symbols used to name vertices, i.e., the lexicographic
order of vertices named A1 and A2 is A1 ≺ A2 while the lexicographic order of vertices named B1 and A2 is A2 ≺ B1. We refer to the lexicographic rank of vertex v1
in relation to v2 as either higher if v2 ≺ v1 or lower if v1 ≺ v2 . Some works from
our literature survey use the ordering of vertices(and edges) to define a lexicographic
order/rank on canonical forms of subgraphs, i.e., if two connected induced subgraphs
10
had canonical forms A, B, C and C, B, A then A, B, C ≺ C, B, A is true. The lexicographic rank of vertices and of canonical forms of subgraphs is used in several of the
following enumeration methods we surveyed.
A good deal of research has focused on identifying motifs or frequent subgraphs in
graphs. Methods such as gSpan[34] and FFSN[20] mine all frequent subgraphs from an
input graph. gSpan utilizes a branch and bound depth first approach to enumerate
candidates. Bounding is performed using the lexicographic rank of the canonical
form of each subgraph such that a subgraph is only extended if its lexicographic rank
is theoretically lowest, thus avoiding much redundancy caused by isomorphisms of
the same graph being searched separately. FFSM[20] utilizes a similar lexicographic
ranking technique, but it uses a hybrid method of joining candidate subgraphs to
form new candidates as well as extending current candidates by a single vertex to
perform a breadth first search. A pruning step removes undesirable candidates from
the search at each iteration.
Another problem that has been studied extensively is motif counting. In motif
counting, rather than unsupervised mining of all frequent subgraphs, one is interested in counting how many instances of a target motif exist in a graph. This can
be viewed as finding all isomorphic instances of a given graph as subgraphs of another graph. An early algorithm that was developed for this task is the backtracking
algorithm for testing subgraph isomorphism proposed by Ullman[32]. The VF2 algorithm [9] exhibits a performance improvement over Ullman using a depth first branch
and bound approach and an imposed lexicographic ordering of vertices to enumerate
isomorphisms of the target motif. A recent work, the ISAM algorithm [11], demonstrates performance improvements over VF2 and Ullman by implementing a depth
first search as an iterative procedure (similar to Ullman) but using highly optimized
data structures and candidate ranking criteria. A different approach is taken by
Afrati et al.[1] where they investigate methods for parallelizing motif counting using
the well known MapReduce framework to distribute the work of the search across
11
many different processors.
A special case of motif search that is commonly investigated is finding cliques,
i.e., subgraphs in which each node is connected to every other node of the subgraph.
More specifically, investigators commonly wish to identify all maximal cliques that are
cliques that are not contained by any larger clique. Algorithm457 [4] uses a branch
and bound approach to generate maximal cliques, but rather than a vertex ordering
strategy to avoid redundancy they employ a not set to track vertices which have
already been explored. In contrast, the depth first backtracking algorithm by Krehner
and Stinson[26] uses an imposed ordering on the vertices to avoid redundancy and find
all maximal cliques. This problem has also been studied in the MapReduce framework
by Wu et al.[33] who develop a novel depth first clique enumeration algorithm that
can be distributed across multiple processors.
Alternatively, investigators may wish to find independent sets of a graph, which
in some ways is the dual of the clique problem, i.e., no two nodes of an independent
set are joined by an edge. Similar to cliques, maximal independent sets are of interest
where a maximal independent set is not contained in any other independent set.
Johnson and Yannakakis [21] perform a theoretical analysis and present an iterative
depth first search algorithm for general graphs that outputs results in lexicographic
order. Eppstein [12] explores an iterative algorithm to perform this task based on the
ReverseSearch framework [2], which is also a form of depth first search.
The general task that relates most closely to ours is enumerating all sets of vertices
of a graph that induce a connected subgraph. A powerful algorithm well suited to
this task is ReverseSearch [2], which is highly efficient in terms of space and can be
distributed to run in parallel. ReverseSearch is itself a form of depth first search that
utilizes a rank ordering of vertices to eliminate redundancy and imposes a child/parent
relationship on all subgraphs. However, the complexity of the general ReverseSearch
algorithm is less efficient than derivatives optimized for specific tasks [12, 28]. There
are also methods like Algorithm447 [19] that use iterative depth first search and label
12
vertices as visited to avoid redundancy. This is less efficient in terms of space than
ReverseSearch but is similar to many of the algorithms we surveyed for topological
enumeration.
In this study, we place an additional constraint on the problem of connected induced subgraph enumeration that enables development of efficient branch and bound
algorithms for problems that include finding high-scoring connected subgraphs according to a well defined scoring criteria. Namely, we focus on the case where all
connected induced subgraphs satisfy a hereditary property. A hereditary property of
a graph G is a property such that all induced subgraphs g ∈ G also satisfy the property [3]. For example, being a clique is a hereditary property because any induced
subgraph of the clique is also a clique. A problem closely related to our search for all
connected induced subgraphs satisfying a hereditary property was recently studied
by Cohen et al.[8] in which they investigate for which classes of hereditary property
P the maximal P-subgraphs problem can be solved in polynomial time. Cohen et
al. also point out that the general problem (for any hereditary property P) cannot
be solved in polynomial time because the output may be exponential in size. This
is closely related to our problem because we do not restrict the class of hereditary
property that each subgraph must satisfy, and thus our result set may be exponential
in size. This is in fact the case when we relax the property such that any S satisfies
the property and the search becomes exhaustive.
The two unifying observations we made from our survey of previous work are (1)
methods for subgraph enumeration are generally based on depth first search with a
significant portion using a branch and bound optimization strategy. (2 ) an intrinsic
or imposed order of vertices enables the use of diverse strategies to reduce the search
space and avoid redundant solutions. The first observation motivates our interest in
the problem because depth first search exhibits an inherent drawback when applied
to branch and bound search for subgraphs that satisfy a hereditary property fb which
we will outline in the following section.
13
In this work we will first clearly define the problem and the type of overhead we
seek to reduce, and we then briefly outline the process through which we have developed our proposed solution with examples to clarify key points. We will then provide
a formal algorithm for our proposed solution with supporting theorems and a set of
computational experiments demonstrating the difference in performance between our
solution and a conventional depth first branch and bound search.
2
Problem Definition and Observations
Let G = (V, E) be an undirected graph. A set V 0 ⊆ V is said to be a connected
node set if the subgraph induced by V 0 is connected, i.e., if for every pair of nodes
{u, v} ∈ V 0 , there is a path in G from u to v that goes only through nodes in V 0 .
Throughout this work we refer to connected node sets as S where it is implied that
S ⊆ V and S induces a connected subgraph of G.
We are interested in enumerating all connected node sets in G. While enumerating
such sets can be useful in the context of many applications, here we are particularly
interested in facilitating branch-and-bound algorithms that are designed to solve optimization problems or enumerate all subgraphs that satisfy a hereditary property.
In particular, we assume that we are given a scoring function f : 2V → R such that,
for V 0 ⊆ V , f (V 0 ) = −∞ if V 0 does not induce a connected subgraph in G. In this
setting, branch-and-bound algorithms can be useful in solving two types of problems:
P1 : Given a score threshold f ∗ , find all connected node sets S such that f (S) ≥ f ∗ .
P2 : Find a connected node set S such that f (S) ≥ f (S 0 ) for any connected node
set S 0 in G.
Since our focus is on facilitating branch-and-bound algorithms, we assume that a
“bounding” function fb : 2V → R is available such that for a given connected node
set S, f (S) ≤ fb (S 0 ) for any connected node set S 0 ⊆ S. In other words, the function
14
fb (S) provides a mechanism for bounding the score of any connected subset that can
be obtained by adding more nodes to S. In the context of problems of type P1, if
fb (S) < f ∗ we say that fb (S) is not satisfied. Alternatively, fb can be defined as a
boolean function fb : 2V → {0, 1} that determines if a subgraph S satisfies a desired
property where if fb (S 0 ) = 0 the property is not satisfied. In the case fb (S 0 ) = 0, all
S ⊇ S 0 have bound fb (S) = 0, i.e, if S 0 does not satisfy the property then no S ⊇ S 0
satisfies the property either. Both definitions of fb are hereditary in nature and thus
meet our requirement that enumerated S ⊆ V satisfy a hereditary property.
Observe that, if we can solve problems of the type P1 using a branch-and-bound
algorithm, we can also solve problems of type P2 by adaptively setting the threshold
f ∗ to the score of the best subnetwork found so far. For this reason, we focus on the
first type of problems in the rest of our discussion. For both types of problems, if
we have an efficient way of enumerating all connected node sets, we can develop a
branch-and-bound algorithm that will prune out chunks of the search space efficiently
by bounding the score (f or satisfiability of a desired property) of larger connected
node sets using the bounding function (fb ) for their subsets, which are smaller.
To facilitate efficient and effective branch-and-bound algorithms, we need an algorithm to enumerate the solution space (here, the space of all connected node sets
of the input graph) correctly and efficiently. We observe that such an enumeration algorithm should satisfy the following criteria to result in efficient and effective
branch-and-bound algorithms:
• Completeness: All connected node sets in G satisfying bound fb should be
generated and all generated node sets should be connected.
• No redundant subgraph generation: Each connected node set in G should be
generated exactly once.
• Optimal order of enumeration: If S 0 and S are connected node sets and S 0 ⊂ S,
then S 0 should be generated before S so that if fb (S 0 ) is not satisfied we try to
15
avoid generating S.
The “completeness” criterion relates to the correctness of the algorithm while the
“no redundant subgraph generation” and “optimal order of enumeration” criteria
relate to efficiency. The “no redundant subgraph generation” criterion asserts that
each candidate solution in the solution space should be considered exactly once since
additional considerations will lead to redundant computation.
The “optimal order of enumeration” criterion, on the other hand, facilitates optimal pruning of the search space by ensuring that all subsets of a connected node
set are considered before the node set itself is considered. To see why this is useful,
consider the definition of the bounding function which guarantees f (S) ≤ fb (S 0 ) for
any S 0 ⊆ S. From this it is apparent that evaluating S 0 before all S that contain
S 0 is desirable because if fb (S 0 ) < f ∗ then any S containing S 0 will have f (S) < f ∗
and in the context of G there may be an exponential number of S that contain S 0 .
Whenever an S ⊃ S 0 is evaluated where fb (S 0 ) < f ∗ we call it a redundant rejection.
Redundant rejections are a source of overhead and reducing redundant rejections is
the focus of this thesis.
Eliminating all redundant rejections likely requires a breadth first search of G, but
in this work we took a more conservative approach to keep the size of the problem
manageable. To begin, we observed that satisfying the “completeness”and “no redundant subgraph generation”criteria can be accomplished by selecting a single v ∈ V as
an anchor vertex and enumerating all subgraphs containing v before removing v from
G. In this way each v ∈ V is chosen as a starting point and all subgraphs containing
it are enumerated before v is removed from G. When V ≡ ∅ all subgraphs have
been enumerated. An example of enumerating connected induced subgraphs from an
anchor vertex is shown in Figure 1.
It is obvious that, for some S 0 ⊂ V , this process generates many S ⊃ S 0 before
S 0 itself, and thus it does not satisfy the criterion of “optimal order of enumeration”.
For example, subgraphs containing DH are generated 4 times before DH itself is
16
Figure 1: Example illustrating the enumeration all connected induced subgraphs of a graph using
anchor vertices. On the left is the graph as each vertex becomes the anchor used for enumeration of
all connected subgraphs that contain the anchor before it is subsequently removed from G. On the
right are the connected subgraphs generated from each anchor vertex.
generated, and this can cause redundant rejections if fb (DH) does not satisfy our
criteria for score or hereditary property. Rather than going breadth first over the
entire network we took a different approach. We instead focused on eliminating
redundant rejections within each search anchored at v. I.e., during extension from
A in Figure 1, if fb (ADH) is not satisfied, we do not enumerate ACDH, ABDH or
ABCDH.
3
3.1
Base Case
Focusing on Direct Neighbors
The first structure we explore is a variation of the binomial tree. Any child n of the
root of a binomial tree has children that are copies of all branches rooted at siblings
that precede n in the tree, and it can be used to enumerate the super set of a set[23].
17
To apply this to a graph, we observe that starting at an anchor vertex v ∈ V , the
neighbors of v can be treated as a set because v and any combination of its neighbors
induce a connected subgraph of G.
This collection of subgraphs can be represented as a binomial tree T with the
root node r labeled by v. We construct the tree such that each node n only contains
descendants labeled by vertices of greater lexicographic rank than the vertex labeling
n. Since all vertices labeling nodes in T are connected to the vertex labeling the
root, the set of vertices that label each path P(nk ) from the root r of T to a node
nk represents a connected subgraph of G. The binomial tree is a special case of our
more general solution. In this special case the input is restricted to a vertex and its
direct neighbors and the bound fb is satisfied by any S. Here, we do not prove that
the binomial tree satisfies our criteria of “completeness”, “no redundant subgraph
generation”and “optimal order of enumeration” since Theorems 1 and 2 related to
the general solution show that depth first search of the binomial tree satisfies these
criteria. An example is illustrated in Figure 2.
Figure 2: Graph G and binomial tree T generated from anchor vertex A and its neighbors B, C and
D. Performing a depth first search of T generates the sets A,AD,AC,ACD,AB,ABD,ABC,ABCD
However, this does not yet provide a performance gain as we have still generated
all possible branches first (each branch representing an S) and if an S 0 does not
satisfy our bound fb , it is still possible to encounter S ⊃ S 0 as searching T continues.
Taking advantage of T for branch and bound algorithms requires that the binomial
tree be constructed in a specific manner. To motivate this statement consider the
18
worst case scenario for generating all subsets using a full binomial tree T . If the first
subset S 0 = AD in Figure 2 does not satisfy fb (S 0 ), then a depth first search of T
rejects 3 additional S ⊃ S 0 , i.e., we would perform three redundant rejections. On
the other hand, if S 0 = AB does not satisfy fb no redundant rejections occur. The
number of redundant rejections of a subgraph depends on the order in which vertices
are considered. In the worst case being 2m−1 − 1 (m being the number of neighbors
of v) redundant rejections and in the best case there are no redundant rejections.
Similar to depth first traversal of the binomial tree, performing depth first branch
and bound enumeration directly on G exhibits the same type of redundant rejections.
This is obvious because a depth first search must explore subgraphs of G equivalent
to those represented by the binomial tree and no depth first search strategy can select
nodes to follow a priori that will avoid all redundant rejections. I.e., regardless of
the order in which the depth first search of G selects vertices to follow the number
of redundant rejections encountered varies depending on which S 0 ⊂ S is the cause
of the rejection. That is to say, depth first branch and bound search is inherently
unstable in regards to the number of redundant S 0 ⊂ S evaluated as this quantity
varies unpredictably.
3.2
Optimized Local Search Tree
A simple method that reduces redundant rejections using a binomial tree based approach is as follows. The binomial tree can be constructed by adding each neighbor
vertex to the root as a new node n, and then appending copies of the branches rooted
at each sibling of n as children of n. Neighbor vertices are added to the root in reverse
lexicographic order similar to the construction of the binomial tree in the previous
section. However, branches are copied using a method that evaluates each path of
the branch while copying it, and copying a path P(nk ) terminates whenever the set
S represented by P(nk ) does not satisfy the bound fb . The resulting local search is
similar in spirit to the set enumeration tree (SE-tree) search of Rymon [31]. However,
19
our approach is more closely related to the binomial tree because we construct an
explicit tree where the set is defined by the path from the root to a node in the tree.
It is important to note that depending on how rejections occur, T may no longer meet
the definition of a binomial tree so moving forward we will refer to T as a local search
tree. An example of constructing a local search tree is shown in Figure 3.
Figure 3: Creating the local search tree T using a branch and bound optimization generates the
sets A,AD,AC,AB,ABD,ABC. (A) The input graph G with the anchor vertex A. (B) Exploring
D with no previous branches yields the D branch. (C) Exploring C with the previous D branch
evaluates ACD which is rejected resulting in branches C and D. (D) Exploring B with previous
branches evaluates ABD and ABC (avoiding ABCD which contains the previously rejected ACD)
In this way, copying of branches stops at any point that fb (S) is not satisfied. It is
obvious this reduces redundant rejections because when we append the branch rooted
at sibling n1 to the new sibling n2 we avoid enumerating S that are rejected while
creating the n1 branch. We stress that this is only a potential reduction in redundant
rejections because we are avoiding one source of redundant rejections, but others exist
that we describe in the Discussion section. The local search tree is a special case of
our general solution. In this special case the input is limited to only a vertex and its
direct neighbors. Here we do not prove that the local search tree satisfies our criterion
of “completeness” since Theorem 3 related to the general solution shows that after
rejecting an S 0 all S 6⊃ S 0 are still enumerated by depth first search of the local search
tree.
20
4
General Case
4.1
Joining Local Search Trees
To demonstrate our reasoning behind the next development, we observe that the local
search tree from the previous section contains all subgraphs that satisfy fb (S) using
a single anchor vertex v and its direct neighbors. To generate subgraphs beyond this
initial seed, we can follow a path P(nk ) in T where the vertices that label the nodes of
P(nk ) represent an S ⊆ V . At this point we can treat S as an anchor set by removing
all neighbors of v not in S from G. We can then look at the direct neighbors of the
anchor set and create a T 0 rooted at S. An example illustrating this idea is shown in
Figure 4. The process is repeated for each P(nk ) in T until all paths have been used
as anchor sets.
Figure 4: Example illustrating the key idea for joining local search trees of direct neighbors to
generate all connected subgraphs. Initial tree T1 generated by Algorithm 1 anchored at A is then
extended by following path ACD and extending it also using Algorithm 1 to generate T2 .
Generating all S ⊆ V in this manner maintains our original optimization during
generation of each tree rooted at an anchor set, but the optimization does not extend
21
beyond the individual tree generations. For example, if in Figure 4 we extend AD and
during generation find that fb (ADI) is not satisfied, we would reject its supergraphs (
e.g., ACDI, ACDHI, ACDEHI and ACDEHI) again when we extend ACD. To utilize
the information from local search branches globally, we must modify the construction
procedure for the local search tree as described in the next section.
4.2
Optimized Joining of Local Search Trees
We initially investigated creating a local search tree and then following each path to
enumerate connected induced subgraphs containing vertices beyond the anchor vertex
and its direct neighbors. However if instead we extend each branch as it is added
to generate a depth first search through a neighbor of the anchor v, we can use the
depth first branches as the search continues through other neighbors to leverage the
information from previous rejections. In the context of the enumeration procedure,
this becomes a simultaneous depth-breadth search. An example of this method is
shown in Figure 5.
However, an additional matter that greatly complicates this process is the occurrence of cycles in G. A cycle is a path in G that originates and terminates at a vertex
v without back tracking, i.e., we arrive back at v by only traversing unvisited vertices.
While generating a branch from an anchor set S, if a cycle exists in G that originates
and terminates in S, T can inadvertently become corrupt such that a path P(nk )
contains multiple nodes labeled with the same vertex. As such, the vertices labeling
the nodes along P(nk ) no longer represent a proper set as elements are duplicated.
In addition, if the branch being generated from an anchor set S is joined with
a previously generated branch that contains nodes labeled by vertices adjacent to S
then the joining can disrupt the desired order of generating all S 0 ⊂ S before S. If
this occurs it can introduce additional overhead into the subgraph enumeration if an
S 0 ⊂ S that does not satisfy fb (S 0 ) is generated after S.
Furthermore, cycles can also lead to redundant P(nk ) in T such that the same
22
Figure 5: (A) The input graph G with the anchor vertex A highlighted. (B) The enumeration
tree generated by exploring depth first through vertex D, where fb (ADIH) is not satisfied and ADIH
is rejected. (C) The first step of adding the tree generated by AD to the tree being generated
through AC. At this step ACD, ACDI and ACDH are evaluated and fb (ACDH) is not satisfied so
ACDH is rejected. The DI branch is passed to the children of AC discovered by continued depth
first search. (D) The second step of extending AC to neighbor E. After ACE is evaluated, the DI
branch is joined and ACED and ACEDI are evaluated.
S will be enumerated multiple times. An example of all three types of corruption
are shown in Figure 6 (A). Repeated vertex labels in a path are quite obvious. An
example of the enumeration order being violated is ABCEG is evaluated before ABG.
Redundancy occurs multiple times where ABCEG is evaluated on every branch.
In order to resolve these issues, we place an additional restriction on adding each
previously generated branch to the branch currently being generated. If we are extending a subgraph S represented by path P(nk ), we prune previous branches at
nodes labeled with v ∈ χnk before joining, i.e., we remove nodes labeled by vertices
adjacent to vnk but not adjacent to any other vertex labeling a node along P(nk−1 ).
We then pass the pruned branch forward as enumeration continues. This method
23
represents our solution to the general case and is formalized in Algorithm 1. Lemma
3 guarantees that all paths in T represent sets, i.e., every vertex labeling a node of
a path in T is unique. Theorem 1 guarantees that all S ⊆ V containing the anchor
vertex are uniquely enumerated during exhaustive search. Theorem 2 guarantees that
no S ⊃ S 0 is enumerated before S 0 during exhaustive search. Theorem 3 guarantees
that when an S 0 is rejected all S 6⊃ S 0 are still enumerated. An example of the tree
constructed by Algorithm 1 is shown in Figure 6 (C).
Figure 6: (A) Graph G with anchor vertex A highlighted. (B) The enumeration tree generated
by appending an un-pruned branch generated from S =AC to the branch generated through S =AB
which exhibits several redundant instances of G and E. (C) The tree generated by pruning the
branch generated through AC as it is added to the branch generated through AB. At AB, G is a
neighbor of B so the branch through AC is pruned at G to CE before being added. The pruned
branch is then passed to ABG during extension. At ABG, E is a neighbor of G so the branch from
AC is pruned again at E to C. The C branch is then passed to ABGE where it is added the final
time. In addition it can be observed that if G was not removed from the CEG branch when first
joining it to B, CEBG would be generated before BG, thus violating the optimal generation order.
24
4.3
Caching Depth First Search (CDFS)
Algorithm 1 is a formal presentation of the full algorithm for generating all connected induced subgraphs of G optimized for branch and bound algorithms. It makes
use of the extension set χnk defined in detail in Appendix A. The entry point is
DEPTH(∅,v,[ ]) which performs a CDFS search from anchor vertex v ∈ G. We have
used the boolean function definition of fb that tests S for a hereditary property. We
know by Theorems 1 and 3 that when DEPTH returns all S ⊆ V containing v that
satisfy fb (S) have been enumerated. Enumerating all connected induced subgraphs
in G only requires calling DEPTH on each v ∈ V and removing the selected v from
G after each call.
25
Algorithm 1 Enumerate all S that contain anchor vertex v and satisfy fb (S). Returns the root node of the enumeration tree T . Entrance point is DEPTH(∅,v,[ ])
1: procedure BREADTH(S, n, U )
2:
if vn ∈ U then
. Prune branch by topology
3:
return null
4:
end if
5:
6:
7:
8:
9:
10:
11:
12:
13:
14:
15:
16:
17:
18:
19:
n0 ← Υ(vn )
. Recursively evaluate/prune/clone branch
∗
∗
for all {n : nn ∈ B} do
n00 ← BREADTH(S 0 , n∗ , U )
if n00 6= null then
B ← B ∪ n0 n00
end if
end for
return n0
end procedure
20:
21:
22:
23:
24:
25:
26:
27:
28:
29:
30:
31:
32:
33:
34:
35:
36:
37:
38:
39:
40:
41:
42:
procedure DEPTH(S, v, β)
S0 ← S ∪ v
if fb (S 0 ) = f alse then
return null
end if
n ← Υ(v)
β0 ← [ ]
for i = 1 to |β| do
n0 ← BREADTH(S 0 , β[i], χn )
if n0 6= null then
B ← B ∪ nn0
push(β 0 , n0 )
end if
end for
for all v ∈ χn do
n0 ← DEPTH(S 0 , v, β 0 )
if n0 6= null then
B ← B ∪ nn0
push(β 0 , n0 )
end if
end for
return n
end procedure
S 0 ← S ∪ vn
if fb (S 0 ) = f alse then
return null
end if
. Prune branch by bounding function
26
. Note: Derive χn from S and v
5
Correctness
The following theorems are based on supporting lemmas in Appendix B. Theorem
1 guarantees that our method satisfies the “completeness” and “no redundant subgraph generation” criteria during exhaustive enumeration, i.e, when fb is satisfied by
any S. Theorem 2 guarantees that our method satisfies the “optimal order of enumeration” criterion during exhaustive enumeration. Theorem 3 guarantees that our
method satisfies the “completeness” criterion when fb is selective.
Theorem 1 Given an input graph G, an anchor vertex v ∈ V and a bound fb that
is satisfied by any S, Algorithm 1 uniquely enumerates all S ⊆ V containing v that
induce a connected subgraph of G.
Proof: By Lemma 4 we know that the set represented by any path P(nk ) in T
induces a connected subgraph of G and by Lemma 5 we know that every path in
T represents a unique set. By Lemma 7 we know that all S ⊆ V containing v are
represented by a path P(nk ) in T . Therefore, we can conclude that because Algorithm
1 enumerates all paths of T , Algorithm 1 uniquely enumerates all S ⊆ V containing
v.
Theorem 2 Given an input graph G, an anchor vertex v ∈ V and a bound fb that
is satisfied by any S, Algorithm 1 enumerates all connected induced subgraphs of G
containing v in an order such that all S 0 ⊂ S containing v are enumerated before S.
By Theorem 1 we know that all S ⊆ V containing v that induce a connected
subgraph of G are enumerated, and by Lemma 8 we know that any S 0 ⊂ S containing
v must be generated before S.
Theorem 3 Given an input graph G, an anchor vertex v and a bound fb , if an S 0
does not satisfy fb all S 6⊃ S 0 are still enumerated by Algorithm 1.
27
Proof: We prove the theorem by contradiction
For a given set of connected vertices, let P(nk ) be the path that represents S in
an exhaustive enumeration tree T . Assume P (nk ) was eliminated because P (mk )
representing S 0 was rejected and assume that S 6⊃ S 0 . Rejecting P (mk ) can only
eliminate P (nk ) if n is a copy of m created by the BREADTH procedure of Algorithm
1. As n is a copy of m created by the BREADTH procedure, all vertices labeling nodes
along P (mk ) also label nodes along P (nk ). Thus, ψ(P (nk )) ⊃ ψ(P (mk )) =⇒ S ⊃ S 0
and we have a contradiction. Since we know by Theorem 1 that all S ⊆ V containing
v are enumerated by Algorithm 1 when no rejections occur, and that when a rejection
of S 0 occurs it can only eliminate S ⊃ S 0 we conclude that if an S 0 is rejected all S 6⊃ S 0
are still enumerated.
6
Experimental Results
6.1
Exhaustive Synthetic Testing
We compare the performance of CDFS to DFS based approaches by enumerating
connected induced subgraphs of real world networks. We use the total weight of a
subgraph as our hereditary property and comparison to a threshold t as our bounding
function fb where if weight(S) > t then fb (S) is f alse, and otherwise fb (S) is true.
Weights were assigned to vertices from a Gaussian distribution with mean m and
standard deviation ρ. In our implementation of Algorithm 1 used for computational
tests we imposed a maximum size k on enumerated subgraphs by adding a test for
|S| = k in both the DEPTH and BREADTH procedures.
We utilized the human protein reference database [15] that consists of 9,455 vertices and 37,080 edges and a citation network generated by Leskovec et al.[27] from
the on-line arXiv journal that consists of 5,241 vertices and 28,958 edges. Enumera28
tion was performed across a range of thresholds and standard deviations a total of ten
times and the performance measures such as the number of rejections and execution
time were averaged. It should be noted that for the arXiv network enumeration was
performed to a size of 5 while for HPRD enumeration was performed to a size of 4. A
smaller maximum size of S was imposed on enumeration of HPRD because runtimes
were significantly longer than for arXiv, and thus a smaller size was required to run all
tests in a tractable amount of time. The results for the arXiv network are displayed
in Figures 7,8,9 and the results for the HPRD network are displayed in Figures 10,11.
Figures 7 and 10 plot the rejection rate (the number of S enumerated where fb (S)
was f alse divided by the total number of S enumerated) for each algorithm while
enumerating all S ⊆ V that satisfy fb . The rejection rate for CDFS is consistently
lower than that of DFS though the relationship is more obvious at higher values of ρ.
The correlation of rejection rate and ρ occurs because at low values of ρ the rejections
are strongly correlated with depth, i.e. at t=10 and p=0 all single vertex subgraphs
would be accepted but all two vertex subgraphs would be rejected, and this correlation
aligns well with how branch and bound depth first search prunes areas of the search
space. However as ρ increases the rejections become more likely to happen at any
depth, and our strategy for reducing redundancy has more opportunity to prune the
search space in ways that DFS cannot.
Figures 8 and 11 show that the two algorithms have similar runtimes for small
thresholds t and small standard deviations ρ, but at larger t and ρ the CDFS method
completes before the DFS method consistently. The large discrepancy in rejection
rate between the two algorithms that occurs for values of ρ > 6 appears to contribute
to CDFS outperforming the DFS method for small values of t, but we were unable
to establish a correlation between rejection rate and runtime for the values of t that
showed the greatest difference in runtime between the two algorithms. For this reason
we performed additional experiments to measure what factors were contributing to
the performance gain for large values of t.
29
We determined the dominating factor of the performance gain was that the DFS
based approach expends a significant amount of effort rediscovering relationships that
it has already established. Both algorithms require a method ADJ(S) that returns
unvisited vertices adjacent to at least one v ∈ S (in Algorithm 1 this returns χn
at line 34). However, the CDFS method does not call ADJ(S) in the BREADTH
procedure where it extends the current S using all previous subgraphs that satisfied
fb . As the search continues, the amount of search space being explored by BREADTH
increases exponentially, and the cache of previously established relationships allows
CDFS to outperform the DFS based method on larger search spaces. The reduction in
calls to ADJ(S) shows larger disparity between methods as the search space expands
demonstrating that the CDFS method reduces runtime complexity for both branch
and bound and exhaustive searches, i.e., searches where any S satisfies fb (S). Figure
9 shows a plot of the number of calls to ADJ(S) versus threshold value t. The cache
provides CDFS a computational edge over methods that use a strictly branch and
bound depth first approach, but it can require exponentially more memory. For this
reason CDFS is best suited to problems where G is very large, but the maximum size
of S to be enumerated is small (|S| < 10) so that memory consumption is reasonable
and the runtime improvement is spread across all vertices of G.
6.2
Integration into CRANE
The CRANE algorithm by Chowdhury et al.[6], performs a heuristic branch and
bound depth first search of protein-protein interaction networks to find combinatorially dysregulated subnetworks with binary expression state patterns that are discriminative between two sample classes. The heuristic property of the algorithm is
adjustable in that it extends the best B subgraphs at any point during the depth first
search where if B is made large enough the search becomes exhaustive and B = 1 is
a greedy search.
The objective function for CRANE is non-trivial as it must compute probabilities
30
Figure 7: Rejection rate analysis for enumerating all S up to size 5 satisfying fb in the arXiv[27]
citation network. Each pane plots the average rejection rate (ratio of the number of S evaluated that
did not satisfy fb to the total number of S evaluated) of each algorithm versus threshold t where
the node scores were sampled from a Gaussian distribution with m=10 and standard deviation ρ.
based on the contents of the binary expression data for each subgraph explored.
Profiling executions of the code confirmed that most processing time is spent in the
objective function which is different from the previous computational tests where the
objective function complexity was less than that of the enumeration. We implemented
a version of CRANE that can enumerate connected induced subgraphs using either
the CDFS method or the DFS method and then compared the performance of both
methods.
Expression data sets used for comparison were a synthetic test case comprised of
500 genes and 32 samples and a glioblastoma data set comprised of 7,419 genes and
86 samples. We received the glioblastoma data set from Patel et al.[30] that they had
constructed from a dataset of RG Verhaak et al.[16] using additional info from the
31
Figure 8: Comparison of the average runtime required for each algorithm to enumerate all S
up to size 5 satisfying fb in the arXiv[27] network. Each pane plots the average execution time
versus threshold t where the node scores were sampled from a Gaussian distribution with m=10 and
standard deviation ρ. At low values of t and ρ the average DFS execution time was sometimes less
than the time for CDFS. However when ρ > 6 or t > 25 the average CDFS time was consistently
less than that of the DFS method.
TCGA.
For the synthetic set, a corresponding synthetic network was generated and the
optimal solution was known. We enumerated connected induced subgraphs up to size
eight in the synthetic network with both DFS and CDFS methods and compared the
results. Due to the heuristic nature of the algorithm and differences between how DFS
and CDFS explore the search space, there were discrepancies in the results. However,
the known result was identified by both algorithms as part of their overall result
sets. The DFS method required 2 minutes to complete whereas the CDFS method
required 1 minute. This is not of particular interest from a runtime perspective, but
the analysis helped to underline how both methods behave when used in a heuristic
32
Figure 9: Comparison of the effort expended by each algorithm to retrieve unvisited vertices while
enumerating all S up to size 5 satisfying fb in the arXiv[27] network. Each pane plots the average
number of calls to a method returning unvisited adjacent vertices versus threshold t where the node
scores were sampled from a Gaussian distribution with m=10 and standard deviation ρ.
algorithm.
For the TCGA expression data, we used the HPRD[15] protein-protein interaction network to enumerate connected induced subgraphs. We enumerated connected
induced subgraphs up to size 8 in HPRD, and both methods identified the same top
results. The DFS method completed in 234 minutes compared to 30 minutes for
CDFS. The top results consisted of several large subgraphs with state patterns of all
genes down regulated and a two gene subgraph with a state pattern of one gene up and
the other gene down regulated. The subgraphs with all genes down regulated were
not of particular interest biologically. However, the two gene subgraph discriminated
roughly half of the short and long term survivors by up regulated MDK and down
regulated SDC1. This is interesting because MDK expression is known to promote
33
Figure 10: Rejection rate analysis for enumerating all S up to size 4 satisfying fb in the HPRD[15]
network. Each pane plots the average rejection rate (ratio of the number of S evaluated that did
not satisfy fb to the total number of S evaluated) of each algorithm versus threshold t where the
node scores were sampled from a Gaussian distribution with m=10 and standard deviation ρ.
cell migration and angiogenesis during tumerigenesis and SDC1 is a trans-membrane
protein involved in cell migration and cell-matrix interactions [18]. Our result is
intuitive in that increased MDK promotes cell migration while loss of SDC1 could
potentially weaken the intracellular matrix enabling cancer cells to migrate more easily. A literature survey uncovered a recent paper that concluded that up regulated
MDK plays a pivotal role in promoting human glioma cell resistance to cannabinoid
antitumoral activity [14]. However, we were unable to find literature investigating
the role of SDC1 in glioblastoma, and this may be an interesting avenue for further
research.
34
Figure 11: Comparison of the average runtime required for each algorithm to enumerate all S
up to size 4 satisfying fb in the HPRD[15] network. Each pane plots the average execution time
versus threshold t where the node scores were sampled from a Gaussian distribution with m=10 and
standard deviation ρ. At low values of t and ρ the average DFS execution time was sometimes less
than the time for CDFS. However when ρ > 6 or t > 25 the average CDFS time was consistently
less than that of the DFS method.
7
Discussion
The computational results support the hypothesis that CDFS can perform enumeration of all connected induced subgraphs that satisfy a bound fb or exhaustive enumeration of all connected induced subgraphs with less overhead than DFS based
approaches. The CDFS method consistently performed equally well or significantly
better during the synthetic tests, and when implemented into the CRANE[6] algorithm it identified the same interesting solutions in dramatically less time.
The CDFS algorithm removes many possible redundant rejections of subgraphs
that do not satisfy fb (S). However, it is only a reduction, and it is possible that an
35
S 0 ⊂ S that does not satisfy fb (S 0 ) will be evaluated multiple times during traversal
of T . For example, it can be observed in Figure 5 (B) that if CE is rejected, then
when branch C is later joined to BGE, the subgraph CE is again contained in BGEC
so a redundant rejection will occur. Another less subtle example is in Figure 3 where
if ACD is not rejected but ABD is, because the CD branch already exists in the β of
AB then ABCD will reject ABD again. In practice this does not appear to happen
often enough to cause CDFS to encounter more redundant rejections than the DFS
method, but it is overhead that we plan to address in future work.
The memory consumption of CDFS also requires consideration. If CDFS will
be used for an enumeration task, it is better suited to tasks of evaluating S that
are small compared to G because the size of T grows as 2|S| . The potential to use
memory exponential to |S| makes use of CDFS for enumeration of all S in G up to
size |V | infeasible, in which case a low memory approach such as ReverseSearch may
take longer to complete, but is better suited in terms of space requirements.
8
Conclusion
We have investigated the problem of reducing the number of subgraphs evaluated
while enumerating all connected induced subgraphs S ⊆ V that satisfy a hereditary
bounding criterion fb . Our proposed method displays a significant decrease in runtime
compared to a classical depth first branch and bound approach. In addition, Theorems 1, 2 and 3 provide proof of correctness that all connected induced subgraphs S
that satisfy fb (S) are enumerated. Finally, our optimization strategy also improves
performance when the search is equivalent to exhaustive enumeration. However, due
to the potential for our method to use space exponential to the maximum size of S
being enumerated, our method is best suited to enumerating all |S| ≤ k from G where
k is chosen appropriate to the problem and the available memory.
36
A
Extended Definitions of Complex Notation
P(nk ) : The sequence of nodes along a connected path from the root of T to a tree
node nk . It represents a sequence of n ∈ N such that the first element of P(nk )
is always n1 = r, i.e.,
P(nk ) = {n1 , n2 , n3 , . . . , nk }
(1)
We use the subsequence relation to indicate that the path P(nk ) is a sub-path
of P(nk+1 ), i.e., P(nk ) @ P(nk+1 ) means the sequence of nodes defined by P(nk )
matches the sequence of the first k nodes defined by P(nk+1 ).
Dnk :
The neighbor set of nk , defined as the set of vertices adjacent to the vertex
labeling nk , i.e.,
Dnk = {u ∈ V : uvnk ∈ E}
Cnk :
(2)
The cull set of nk . If nk = r, the cull set is {vr }. Otherwise the cull set is
the union of {vr } and all vertices adjacent to vertices labeling nodes along the
path P(nk−1 ), i.e.,
Cnk = {vr } ∪ {u : uvni ∈ E, 1 ≤ i < k}
(3)
χnk : The extension set of nk defined as the set of vertices adjacent to vnk but not
adjacent to any vni<k labeling a node of Pnk .
χnk = Dnk \Cnk
(4)
Γ(n) : The source of a tree node n. Source nodes are added to N by the DEPTH
procedure of Algorithm 1 while the BREADTH procedure only copies nodes
37
already in N . The source of n is n if n was created by DEPTH. Otherwise, the
source of n is the source of the node that BREADTH copied to create n. We
refer to a node that is not its own source as a clone.

 n
Γ(n) =
 Γ(m)
if n was created by DEPTH
if n was cloned by BREADTH from node m
38
(5)
β(nk , nk+1 ) : The branch list passed from node nk to node nk+1
Let ni ∈ N be a node in the enumeration tree and let nj ∈ K(ni ) be a child of
ni . The branch list β(ni , nj ) denotes an ordered list passed by ni to nj by the
enumeration algorithm, defined recursively as follows:
• β(n0 , ni ) = [nj : vnj ∈ χn0 , vnj vni ] for all ni ∈ K(n0 ). In other words,
the branch list passed by the root to each of its children contains, in lexicographic order, the nodes labeled with vertices in χn0 that are of greater
lexicographic rank than the vertex labeling the respective child.
• β(nk , nk+1 ) = [β(nk−1 , nk ); [nj : vnj ∈ χnk , vnj vnk+1 ]] for all nk ∈ N
and nk+1 ∈ K(nk ). In other words, the branch list passed to node nk+1
by its parent nk contains the concatenation of the branch list of nk and
the ordered list of nodes labeled with vertices in χnk that are of greater
lexicographic rank than vnk+1 .
Equation 6 defines the relationship in a more compact form.
  nj : vn ∈ χn , vn vn 0
j
j
i
β(nk , nk+1 ) =
 β(n , n ); n : v ∈ χ , v v
nk+1
nk
nj
k−1
k
j
nj
B
if k = 0
(6)
if k > 0
Supporting Lemmas
The following lemmas support Theorems 1 and 2 in section 5. The general conclusions
of the lemmas used directly in the theorems is as follows:
• Theorem 1
– Lemma 4 shows that each path in T represents a set of vertices that induce
a connected subgraph of G.
– Lemma 5 shows that each path in T represents a unique set of vertices.
39
– Lemma 7 shows that any connected induced subgraph in G is represented
by a path in T .
• Theorem 2
– Lemma 8 shows that the enumeration order matches the desired order
where all S 0 ⊂ S are enumerated before S.
Lemma 1 Let nk+1 ∈ N be a node in the enumeration tree and let nk = p(nk+1 )
be its parent. For all n ∈ β(nk , nk+1 ), there must be a node nj ∈ P(nk+1 ) such that
vn ∈ χnj . In other words, any node in β(nk , nk+1 ) must be labeled by a vertex adjacent
to a vertex labeling a node along P(nk+1 ).
Proof. We prove the lemma by induction on k.
Base case: In the base case, the node of interest is a child of the root node, i.e.,
nk+1 ∈ K(n0 ). In this case, by Equation 6, all vertices in β(n0 , nk+1 ) are in χn0 and
clearly n0 ∈ P(nk+1 ).
Inductive step: Assume that ∀ n ∈ β(nk−1 , nk ), ∃ nj ∈ P(nk ) such that vn ∈ χnj .
Now consider a node n ∈ β(nk , nk+1 ). By Equation 6 at least one of the following
has to be true: (i) vn ∈ χnk or (ii) n ∈ β(nk−1 , nk ). If (i) is true, then the lemma is
proved since nk ∈ P(nk+1 ). If (ii) is true, then by the inductive hypothesis, we know
that there exists nj ∈ P(nk ) such that vn ∈ χnj . Since P(nk ) @ P(nk+1 ), we have
nj ∈ P(nk+1 ), and thus the lemma is proven.
Lemma 2 For any P(nk+1 ) in T , there is at least one node nj ∈ P(nk ) labeled by a
vertex adjacent to vnk+1 in G, i.e., vnj vnk+1 ∈ E.
Proof: We investigate the two possible cases where nk+1 is either created by the
DEPTH procedure or the BREADTH procedure of Algorithm 1.
Case 2.A: nk+1 created by DEPTH
40
In this case, from the DEPTH procedure, we can see that vnk+1 is either in the
extension set χnk line 35, or it labels a node in β(nk−1 , nk ) line 28. In the first
case (vnk+1 ∈ χnk ), we have vnk vnk+1 ∈ E by definition of extension set. In the
second case, since nk+1 ∈ β(nk−1 , nk ), we know by Lemma 1 that ∃nj ∈ P(nk+1 )
such that vnk+1 ∈ χnj . Therefore, by definition of extension set, it immediately
follows that ∃nj ∈ P(nk ) such that vnj vnk+1 ∈ E.
Case 2.B: nk+1 created by BREADTH
In this case, nk+1 is a clone of another node mk+1 ∈ N . Note that cloned nodes
can also be cloned by BREADTH, thus there may be multiple clonings between
mk+1 and nk+1 , but mk+1 is the very first node that is created by DEPTH which
is defined as Γ(nk+1 ) in Equation 5. Let ` be the lowest common ancestor of
nk+1 and mk+1 .
Since mk+1 is created by DEPTH, we know from Case 2.A that there exists
mj ∈ P (mk ) such that vmj vmk+1 ∈ E. Now, since mj ∈ P (mk+1 ), we have two
cases:
• mj ∈ P (`): In this case, since P (`) @ P(nk+1 ) (because ` is along the
path to P(nk+1 )) we immediately have mj ∈ P(nk ), and thus the lemma is
proven.
• mj ∈ P (mk ) − P (`): Since ` is the lowest common ancestor of mk+1 and
nk+1 , and nk+1 is a clone of mk+1 , the sub-path from ` to mk+1 is cloned
such that the vertices labeling P (mk+1 ) − P (`) will label a sub-path of
the path from ` to nk+1 . Since the set of vertices labeling the nodes on
this cloned sub-path is identical to the set of vertices labeling the nodes
on P (mk+1 ) − P (`) and mj ∈ P (mk ) − P (`), there exists a clone nj of mj
along P (nk ) − P (`) labeled by vmj . Therefore because nk+1 is a clone of
mk+1 we know ∃nj along P(nk ) s.t. vnj vnk+1 ∈ E.
41
Lemma 3 For any path node nk ∈ T , no other node on P(nk ) can be labeled by vnk .
Stated formally, 6 ∃ni ∈ P(nk−1 ) s.t. vni = vnk
Proof: Let ni be a node on P(nk−1 ). We consider all possible relationships
between ni and nk to show that vni 6= vnk .
Case 3.A: ni and nk created by DEPTH
In this case, from the DEPTH procedure of Algorithm 1, we know that vnk ∈
χnk−1 at line 35. But since vni ∈ Cvnk−1 , we know that vni 6∈ χnk−1 (4). It
immediately follows that vni 6= vnk .
Case 3.B: ni created by DEPTH, nk created by BREADTH
In this case, nk is a clone of another node nm = Γ(nk ), and thus we have
vnm = vnk . Let n` be the lowest common ancestor of nk and nm . There are five
possible relationships between n` , nm , and ni .
Case 3.B.1 ni is an ancestor of n` .
If ni is an ancestor of n` then it is also an ancestor of nm . Thus we can
apply the argument in Case 3.A to conclude that vni 6= vnm and hence
vni 6= vnk .
Case 3.B.2 Both ni and nm are children of n` .
If ni and nm are children of n` , then the nodes are labeled by different
vertices by (4), i.e, vni 6= vnm . Hence vni 6= vnk .
Case 3.B.3 ni is a child of n` , nm is child of a descendant of n` .
If ni is a child of n` , vni will be in the cull set of all descendants of n` by
(3). By this fact we know that no descendant nm of n` can be labeled with
vni . It follows that vni 6= vnm and hence vni 6= vnk .
Case 3.B.4 nm is a child of n` , ni is child of a descendant of n` .
42
If nm is a child of n` , vnm will be in the cull set of all descendants of n` by
(3). By this fact we know that no descendant ni of n` can be labeled with
vnm . It follows that vni 6= vnm and hence vni 6= vnk .
Case 3.B.5 ni and nm are children of descendants of n` .
As each branch in β is cloned, nodes labeled with vertices in the extension
set of the node appending the branch are removed by Algorithm 1 at
line 2. When p(ni ) is passed the branch containing nm it will copy it by
removing any nodes labeled by vni . Because the pruned branch is passed
to all descendants, we know that if vnm = vni it has been removed and
thus for any nk , vni 6= vnk .
Case 3: ni and nk created by BREADTH.
In this case ni and nk are clones of nodes in a branch cloned by BREADTH.
Let nz be the original node created by DEPTH that is cloned to ni , and let nm
be the descendant of nz that is cloned to nk . Because nz is created by DEPTH,
we can prove that vnz 6= vnm using Case 3.B, and thus vni 6= vnk .
Lemma 4 For a given P(nk ), the vertices ψ(P(nk )) induce a connected subgraph of
G.
Proof: This follows directly from Lemma 2 in that any vertex labeling nk of
P(nk ) is adjacent to a vertex labeling at least one other node nj of P(nk ) such that
j < k.
Lemma 5 For any pair of distinct nodes nk , nm ∈ N , ψ(P(nk )) 6= ψ(P(nm )). In
other words, each path in T represents a unique connected induced subgraph of G.
43
Proof: Let n` be the minimum common ancestor of nk and nm , and let x and y
be the children of n` such that x ∈ P(nk ) and y ∈ P(nm ). Assume without generality
that x is to the left of y.
We will consider all possible relationships between x and y to show that ψ(P(nk )) 6=
ψ(P(nm )).
Case 5.A: Both x and y are created by DEPTH
In this case, vy is in the cull set of x and x is to the left of y so β(n` , nx ) contains
no nodes labeled by vy . Hence no descendant of x can be labeled by vy , and thus
vy ∈
/ ψ(P(nk )). But since vy ∈ ψ(P(nm )), it follows that ψ(P(nk )) 6= ψ(P(nm )).
Case 5.B: x created by BREADTH and y created by DEPTH
In this case, x was passed to n` from its parent and by line 2 of Algorithm 1
we know that all nodes in the branch rooted at x labeled by vy were removed.
Hence no descendant of x can be labeled by vy , and thus vy ∈
/ ψ(P(nk )). But
since vy ∈ ψ(P(nm )), it follows that ψ(P(nk )) 6= ψ(P(nm )).
Case 5.C: x and y created by BREADTH
We will consider two cases, based on whether n` was created by DEPTH or by
BREADTH.
Case 5.C.1: n` created by DEPTH
Let w and t denote the “original” (created by DEPTH) nodes that are
cloned to create respectively x and y. So we have vw = vx and vt = vy . In
this case w and t are in β(n`−1 , n` ). If p(w) = p(t), the branch rooted at x
cannot contain any nodes labeled by vy by Case 5.A. Otherwise, we know
w is discovered before t because it is to the left of t in T . It follows from
line 2 of Algorithm 1 that any node labeled by vt = vy is removed from
the clone of branch w when t is discovered. Thus no descendant of x can
be labeled by vy .
44
Case 5.C.2: n` created by BREADTH
Let nz be the “original” node (created by DEPTH) that is cloned to create n` . Let ni and nj denote the descendants of nz that are cloned to
respectively create nk and nm .
Using Case 5.A - Case 5.C.1 we can prove that ψ(P(ni )) 6= ψ(P(nj )). As
the paths P(ni ) and P(nk ) only differ by their respective prefixes P(nz ) and
P(n` ), we can establish the relation P(nk ) − P(n` ) = P(ni ) − P(nz ), and
hence ψ(P(nk )) = (ψ(P(ni )) \ ψ(P(nz ))) ∪ ψ(P(n` )). Similarly, we have
ψ(P(nm )) = (ψ(P(nj )) \ ψ(P(nz ))) ∪ ψ(P(n` )). Therefore, ψ(P(ni )) 6=
ψ(P(nj )) implies ψ(P(nk )) 6= ψ(P(nm )).
Lemma 6 Given a path P(nk ) representing S\v where v was selected from S by
Algorithm 2, then for the source node m = Γ(nk ) either v ∈ χm or v labels a node in
β(p(m), m). In other words, Algorithm 2 selects a vertex that labels a child of nk .
Proof. In Algorithm 2, v will be the last vertex removed from θ so the vertex
labeling nk must be removed before v. Because we prove the lemma on Γ(nk ) we are
proving that when vnk is discovered it has the ability to add a child labeled with v.
We investigate three possible relationships between vnk and v. Cases B and C utilize
the fact that the vertices in θ in Algorithm 2 are sorted first by the order they are
discovered in Algorithm 2 at lines 4 and 14, and second by reverse lexicographic
order at lines 5 and 16.
Case 6.A v adjacent to vnk
In this case v was added to θ by the iteration that removed vnk , and we know v
is adjacent to vnk and not adjacent to any vertex preceding vnk in P by line 18
of Algorithm 2. This is the definition of extension set (4), therefore v ∈ χΓ(nk ) .
45
Case 6.B v adjacent to the same vertex that discovered vnk
If v is discovered at the same time as vnk then we know both vertices are in
the extension set of the vertex that discovered them. Furthermore, we know
that vnk ≺ v because v is the last removed from θ where the vertices are sorted
in reverse lexicographic order. By the definition of branch list (6) we know
because vnk ≺ v that Γ(nk ) receives a branch list from its parent containing a
node labeled by v.
Case 6.C v adjacent to an ancestor of the vertex that discovered vnk
If v is discovered before vnk then v is the last removed from θ by the order of
discovery. By the definition of branch list (6) we know because v was discovered
before vnk that Γ(nk ) receives a branch list from its parent containing a node
labeled by v.
Because v must be adjacent to at least one vertex in S\v and vnk labels the last
node along P(nk ) no other relationships exist between vnk and v, and the lemma is
proven.
Lemma 7 For any S ⊆ V such that v0 ∈ S and S induces a connected subgraph of
G, there exists a path in T that represents S.
Proof. We prove the lemma by induction on |S|.
Base case: In the base case |S| = 1, we know v0 ∈ S so S is represented by the
root of T .
Inductive step: Assume that for any S such that |S| ≤ k and S induces a connected
subgraph in G, there exists a path in T that represents S.
Consider any set S such that |S| = k + 1 and S induces a connected subgraph in
G. We will show that S is represented by a path in T .
46
Let g denote the subgraph of G that is induced by S. Let v be the vertex in S
that is selected by Algorithm 2 (where Algorithm 2 selects the v ∈ S that labels the
lowest node in T among all vertices in S). Now define S 0 = S \ v, and let g 0 be the
subgraph of g induced by S 0 . By its definition, v is a leaf in the search tree that
results from running Algorithm 2 on g. Thus its removal leaves the tree connected.
Since this remaining tree is a subgraph of g that contains all vertices in S 0 , we can
conclude that g 0 is connected, i.e., S 0 induces a connected subgraph in G. Therefore,
by the inductive hypothesis, we know that |S 0 | = k is represented by a path P(nk ) in
T . Furthermore, we know that v is adjacent to a vertex labeling a node along P(nk )
since g 0 is connected. By the inductive hypothesis that P(nk ) exists in T representing
S\v and the fact that v is selected by Algorithm 2 we know P(nk+1 ) representing S
exists in T by one of the following two cases:
Case 7.A : nk = Γ(nk )
If nk is created by DEPTH we know by Lemma 6 that either v ∈ χnk or v labels
a node in β(nk−1 , nk ). If v ∈ χnk then a path P(nk+1 ) representing S exists by
extending P(nk ) with a node labeled by v at line 34 of Algorithm 1. If v labels
a node in β(nk−1 , nk ) then a path P(nk+1 ) representing S exists by extending
path P(nk ) with a node labeled by v at line 27 of Algorithm 1.
Case 7.B : nk 6= Γ(nk )
If nk is created by BREADTH, we can use Case 7.A to prove that Γ(nk ) created
by DEPTH has a child labeled by v, and that because BREADTH clones Γ(nk )
and all descendants that the node nk has a child labeled by v. Therefore, a path
P(nk+1 ) exists representing S.
Lemma 8 No S 0 s.t. S 0 ⊂ S is generated after S.
47
Proof. We prove the lemma by contradiction.
Assume that an S 0 ⊂ S is generated after S, that S 0 is represented by P(nk ), and
that S is represented by P (mk ). We can conclude that P(nk ) 6@ P (mk ) because in
that case S 0 would be generated before S. Therefore, the paths must diverge at a
common node p where the search follows mj on P (mk ) before nj on P(nk ). P (mk )
must contain a node labeled with vnj in order for S to contain S 0 as a subset. If nj
and mj are both created by DEPTH at line 23 of Algorithm 1 then the ordering of
siblings prohibits mj from containing a descendant labeled by vnj , and thus we have
a contradiction. If mj is created by BREADTH at line 11 of Algorithm 1 and nj
is created by DEPTH then any descendants of mj labeled by vnj would be pruned
during the BREADTH procedure because vnj ∈ χp (line 2 of Algorithm 1), and thus
we have a contradiction. If both nj and mj are created by BREADTH, the same
contradiction can be raised by searching backward from p to r along P (p) for the
node where Γ(nj ) is added to β.
48
Algorithm 2 Given a set S of size k + 1, selects the vertex v to remove from S such
that v will label the lowest node in T among all v ∈ S
1: procedure CHOOSE(S, r)
2:
θ ← [], P ← []
3:
Add all elements of S to P
4:
Remove r from P and insert at P [0]
5:
X ← −SORT (x s.t. rx ∈ E)
. Sort B ≺ A
6:
for all x ∈ X do
7:
if x ∈ P then
8:
push(θ, x)
9:
end if
10:
end for
11:
12:
13:
14:
15:
16:
17:
18:
19:
20:
21:
22:
23:
24:
25:
t←1
while |θ| =
6 0 do
w ← pop(θ)
Remove w from P , and insert at P [t]
X ← −SORT (x s.t. wx ∈ E)
for all x ∈ X do
if x ∈
/ θ AND x ∈ P [t..end] then
push(θ, x)
end if
end for
t←t+1
end while
return P [end]
end procedure
49
References
[1] Foto Afrati, Dimitris Fotakis, and Jeffrey Ullman. Enumerating subgraph instances using map-reduce. arXiv, Nov 2012. 1208.0615v2.
[2] David Avis and Komei Fukuda. Reverse search for enumeration. Discrete Applied
Mathematics, 1993.
[3] Bella Bollobas. Hereditary properties of graphs asymptotic enumeration global
structure and colouring. Documenta Mathematica, pages 333–342, 1998.
[4] Coen Bron and Joep Kerbosch. Finding all cliques of an undirected graph.
Communications of the ACM, 16(9):575–577, September 1973.
[5] Salim Chowdhury and Mehmet Koyuturk. Identification of coordinately dysregulated subnetworks in complex phenotypes. In Pacific Symposium on Biocomputing, pages 133–144, 2010.
[6] Salim Chowdhury, Rod Nibbe, Mark Chance, and Mehmet Koyuturk. Subnetwork state functions define dysregulated subnetworks in cancer. 18(3):263–281,
2011.
[7] Han-Yu Chuang, Eunjung Lee, Yu-Tsueng, Doheon Lee, and Trey Ideker.
Network-based classification of breast cancer metastasis. October 2007.
[8] Sarah Cohen, Benny Kimelfeld, and Yehoshua Sagiv. Generating all maximal induces subgraphs for hereditary and connected-hereditary graph properties. Journal of Computer and System Sciences, 74:1147–1159, June 2008.
[9] Luigi Cordella, Pasquale Foggia, Carlo Sansone, and Mario Vento. A (sub)graph
isomorphism algorithm for matching large graphs. 10:1367–1372, October 2004.
50
[10] Phuong Dao, Kendric Wang, Colin Collins, Martin Ester, Anna Lapuk, and
S. Cenk Sahinalp1. Optimally discriminative subnetwork markers predict response to chemotherapy. July 2011.
[11] Sofie Demeyer, Tom Michoel, Jan Fostier, Pieter Audenaert, Mario Pickavet, and
Piet Demeester. The index-based subgraph matching algorithm (isma): Fast
subgraph enumeration in large networks using optimized search trees. PLoS
ONE, 8(4):e61183, April 2013. doi:10.1371/journal.pone.0061183.
[12] David Eppstein. All maximal independent sets and dynamic dominance for sparse
graphs. arXiv, July 2004. http://arxiv.org/abs/cs/0407036v1.
[13] Chatr-Aryamontri et al. The biogrid interaction database: 2013 update. 41:816–
823, Jan 2013.
[14] Lorente M et al. Stimulation of the midkine/alk axis renders glioma cells resistant
to cannabinoid antitumoral action. January 2011.
[15] Prasad T. S. K. et al. Human protein reference database: 2009 update. 37:767–
772, 2009.
[16] RG Verhaak et al. Integrated genomic analysis identifies clinically relevant subtypes of glioblastoma characterized by abnormalities in pdgfra, idh1, egfr, and
nf1. January 2010.
[17] Jason Flannick, Antal Novak, Balaji Srinivasan, Harley McAdams, and Serafim
Batzoglou1. Grmlin: General and robust alignment of multiple large interaction
networks. September 2006.
[18] National Center for Biotechnology Information.
Ncbi gene, October 2013.
http://www.ncbi.nlm.nih.gov/gene.
[19] John Hopcroft and Robert Tarjan. Efficient algorithms for graph manipulation.
Communications of the ACM, 16(6), 1973.
51
[20] Jun Huan, Wei Wang, and Jan Prins. Efficient mining of frequent subgraphs
in the presence of isomorphism. In Proceedings of the Third IEEE International
Conference on Data Mining. IEEE, 2003.
[21] David Johnson and Mihalis Yannakakis. On generating all maximal independent
sets. Information processing Letters, 27:119–123, March 1988.
[22] Maxim Kalaev, Mike Smoot, Trey Ideker, and Roded Sharan. Networkblast:
comparative analysis of protein networks. January 2008.
[23] Donald Knuth. The Art of Computer Programming, volume 4A of Combinatorial
Algorithms Part 1. Addison-Wesley, 2012.
[24] Bin Konga, Tao Yanga, Lin Chenb, Yong qin Kuanga, Jian wen Gua, Xun Xiaa,
Lin Chenga, and Jun hai Zhang. Proteinprotein interaction network analysis and
gene set enrichment analysis in epilepsy patients with brain cancer. November
2013.
[25] Mehmet Koyutrk, Yohan Kim, Shankar Subramaniam, Wojciech Szpankowski,
and Ananth Grama. Detecting conserved interaction patterns in biological networks. October 2006.
[26] Donald Kreher and Douglas Stinson. Combinatorial Algorithms. CRC Press,
1999.
[27] Jure Leskovec, Jon Kleinberg, and Christos Faloutsos. Graph evolution: Densification and shrinking diameters. 2007.
[28] Ambros Marzetta. ZRAM: A Library of Parallel Search Algorithms and Its Use
in Enumeration and Combinatorial Optimization. PhD thesis, Swiss Federal
Institute of Technology Zurich, 1998.
52
[29] Rod Nibbe, Sanford Markowitz, Lois Myeroff, Rob Ewing, and Mark Chance.
Discovery and scoring of protein interaction subnetworks discriminative of late
stage human colon cancer. 8(4):827–845, April 2009.
[30] Vishal Patel, Giridharan Gokulrangan, Salim Chowdhury, Yanwen Chen, Andrew Sloan, Mehmet Koyutrk, Jill Barnholtz-Sloan, and Mark Chance. Network
signatures of survival in glioblastoma multiforme. 9, September 2013.
[31] Ron Rymon. Search through systematic set enumeration. Technical report,
University of Pennsylvania, August 1992.
[32] J.R. Ullmand. An algorithm for subgraph isomorphism. 23:31–42, January 1976.
[33] Bin Wu, Shengqi Yang, Haizhou Zhao, and Bai Wang. A distributed algorithm
to enumerate all maximal cliques in mapreduce. In International Conference on
Frontier of Computer Science and Technology, 2009.
[34] Xifeng Yan and Jiawei han. gspan: Graph-based substructure pattern mining.
In Proc. 2002 of Int. Conf. on Data Mining (ICDM’02), 2002.
53