On Tools for Network Motif Finding

On Tools for Network Motif Finding - Bio-Grid

On Network Tools for Network Motif Finding: A
Survey Study
Elisabeth A. Wong1,2, Brittany Baur1,3
1
2010 NSF Bio-Grid REU Research Fellows at Univ of Connecticut
2
Bowdoin College
3
Manhattanville College
Abstract. Network motifs have been called the building blocks of networks [1]. Graph theory
is used to computationally represent and search networks. Many efforts have been put into
developing motif discovery tools to search for and find network motifs, patterns or subgraphs
within the input network that occur more frequently in the input network than in randomized
networks where patterns occur by chance [2]. Complications involved with network motif
discovery include the graph isomorphism problem which is NP-complete. A myriad of tools
and algorithms have been developed for both full enumeration of subgraphs and methods for
avoiding full enumeration in order to lessen runtimes and required computational power.
Experimental data from various tools is provided in this paper including (1) runtimes for
different subgraph sizes, network sizes, and number of random networks generated, (2)
differences in frequencies based on different search restrictions, and (3) protein-protein
interaction (PPI) network results. The limitations that still exist especially concerning size of
motifs and networks that can be searched are also included. This paper presents a survey study
of current network motif discovery tools; algorithms, experimental data, limitations, and pros
and cons of tools are examined and discussed.
Keywords: network, motif, algorithm, isomorphism
1 Introduction
Networks are integral parts of many real systems and thus it has become a priority
in many research fields to analyze them. Emphasis has been placed on the importance
of studying small aspects of networks in order to gain a better understanding of the
entire network. Recently graph theory has been used to allow for computational
analysis of networks. Through graph theory, it has been found that numerous
networks contain network motifs, small sub-graphs that appear more frequently than
expected in randomized networks [1]. Because these motifs are statistically
significant, it has been hypothesized that they are also significant to their respective
networks and systems; higher frequencies of subgraphs than result by chance suggest
that the motifs are present due to factors such as being conserved evolutionarily and
having an important function or purpose [1]. Each network has different motifs that
are more frequent and thus more important to the system or organism that they are in.
For example, gene regulation transcriptional networks and neuronal connectivity
networks have been found to have motifs known as ‘feed forward loops’ [2] and
‘bifans’ [2]. This suggests that these two networks are similar in some design aspects.
The feed-forward loop is thought to be used in information processes where its design
helps with controlling connections and signals [1]. In contrast, food web networks
which do not deal with information processing have motifs unique to networks such
as neuronal connectivity and gene regulation. This demonstrates how motifs are
biologically significant in their ability to help analyze, explain, and classify networks.
Due to the significance of network motifs many efforts have been put forth into
developing tools that can detect network motifs.
In order for graph theory to be applied to the study of networks and motifs,
networks need to be represented by graphs. Each entity in a network (i.e. a protein, a
gene, a person) is represented by a node (or vertex) while the connections between the
entities (i.e. an interaction, a regulatory signal, a correspondence) are represented by
edges. In some networks the nodes and edges have different characteristics (i.e.
different types of genes or different signals passed from gene to gene). In these cases
the nodes or edges are „colored‟ with each color representing a different kind of entity
or connection. Furthermore another aspect of a network that must be considered and
included in the graph representation is whether or not a connection or edge is
„directed‟ or „undirected‟. Gene regulatory networks have directed connections as it is
important that the signal travels from one specific gene to another. On the other hand
networks such as a social network that counts handshaking as a connection is not
directed because the shaking of hands is not direction specific. „Undirected‟ edges
versus „directed‟ edges can be differentiated on a graph with different edge colors and
node colors or by arrows. A graph can be split up into small graphs known as
subgraphs. As stated before, statistically overrepresented subgraphs are defined as
network motifs. The number of edges that enter and exit a node are summed to
determine the node‟s „degree‟ and the number of nodes that make up a motif
determine the subgraph‟s size.
Tools for network motif discovery have proved very difficult to develop to be both
efficient and able to find motifs of all sizes (not just small sized motifs). One of the
larger obstacles in finding an efficient and thorough algorithm is the graph
isomorphism problem. This problem entails determining if a bijection occurs between
the nodes of two graphs and that each corresponding node is adjacent to the same
corresponding nodes [3]. Two isomorphic graphs have the same number of nodes and
edges and the same number of degrees for corresponding nodes. The graph
isomorphism problem is computationally complex and is classified as an NP complete
problem. The NAUTY algorithm is a well known and powerful algorithm that has
been developed to test for graph isomorphism. It is used by multiple motif discovery
tools. The NAUTY algorithm utilizes canonical labeling in order to tell which graphs
are isomorphic to each other [3]. If they are isomorphic, then their canonical label
should be the same. A canonical label for each graph is formed by taking the
adjacency matrix of a graph and concatenating it row by row in order to form a binary
number. By leaf partitioning (partitioning the graph into singleton sets) each of the
vertices automorphism can be found by checking the adjacency matrix of different
orderings of vertices and seeing if the matrix is the same. NAUTY then examines the
automorphisms and computes a canonical label, which is the largest or smallest
possible concatenated adjacency matrix [4].
Additional difficulties in developing a network motif discovery algorithm include
the fact that the number of network motifs exponentially increases with increases in
network size and that there is an absence of the downward closure property in many
networks [5]. These difficulties make it so that full enumeration of subgraphs can be
extremely time consuming and may require large amounts of computational power.
In order to study a network in context, randomized networks are used for
comparison. These randomized networks are developed in such a way that their
structure is random and thus not a result of any constraints or significant design
elements [6]. This allows for aspects of the network in question (such as motifs) to be
compared to the randomized networks to see whether they are a result of the intrinsic
properties of the network or if they are indicative of real world functional constraints
and/or design principles due to selection [2].
Varying parameters are used to describe the network motif occurrences and to
determine whether they are statistically significant. The frequency of a motif is the
number of times the motif appears in a network [7]. Different tools use different
restrictions for counting frequencies based off of whether or not overlapping of nodes
and edges is allowed. Some motif discovery tools ask for a user input that sets a
threshold of how many motif occurrences are required for a motif to be considered
‘frequent’. To determine whether a motif is significant in a specific network and not
just present due to intrinsic properties of the network, a uniqueness factor is
sometimes applied [7]. If the network in question has a motif with a higher frequency
than in a certain amount of random networks (threshold set by the user) then the motif
is considered to be ‘unique’. In addition, statistical numbers such as z-score and pvalue are often used to determine whether the frequency of a motif is statistically
significant. The z-score is calculated by finding the difference in the frequency of the
motif in the specific target network and the mean frequency of the motif in the
randomized networks divided by the standard deviation of the frequency in the
randomized networks [3]. The higher the z-score corresponds with the motif being
more overrepresented. The z-score threshold over which the motifs are considered
overrepresented is often 2. The p-value looks at whether the probability that the
number of times a motif appears in a randomized network is equal to or greater than
the number of times the motif is present in the network in question [2]. The lower the
p-value means the more significant the motif. The threshold under which the p-value
must be to be considered significant is commonly 0.01. All of these parameters are
important for setting standards to help distinguish between which subgraphs are
overrepresented and which are not.
Multiple algorithms and tools have been developed, each with different advantages
and disadvantages, to identify network motifs. Network motif discovery is a crucial
problem to solve in order to gain further insights into the important characteristics,
functions, and inner workings of systems with networks. Therefore it has been the
goal of many researchers to develop ways to efficiently identify network motifs. It is
our goal in this paper to summarize, collect experimental data about, and analyze the
various network motif discovery tools and algorithms that have been created.
2
Methodology
Major aspects of motif discovery tools that must be considered when
examining these tools are the methods of determining frequencies of motifs, the
ways of developing randomized networks, the algorithms used for full
enumeration, the strategies of identifying motifs without full enumeration, , and
the data sets the tools can be applied to. All of these factors must be considered
when developing a motif discovery tool and each tool uses different variations
and combinations of all of these factors.
I)
Restrictions for Determining Frequencies
An important aspect of each tool and algorithm is the method of determining motif
frequencies. Frequency refers to the number of matches of a motif in a network [8].
Different methods for determining motif frequency depend on restrictions of how
network elements are shared [2]. Different methods lead to different frequency results.
There are three types of frequency concepts: (1) F1, (2) F2, and (3) F3. F1 allows
overlapping of nodes and edges arbitrarily. Only node overlapping is allowed in
concept F2. F3 does not allow for any overlapping of nodes or edges [3]. The method
of determining motif frequencies is very important because motif frequency is used in
the calculation of statistical elements such as z-score and p-value. [7]. Numerous tools
use these statistical parameters to indicate whether or not motifs are statistically
significant. It is important to note which frequency concept is used by which tool.
The different restrictions upheld by each concept cause the frequencies calculated by
the different concepts to be significantly different [2]. Sometimes it is also important
to use tools with certain frequency concepts for specific networks. In some networks
the overlapping of edges and nodes may be an important aspect of motifs whereas
sometimes it might be only relevant to find motifs that do not overlap at all. Thus,
paying attention to the frequency concepts is very important when using and
designing motif discovery tools.
II)
Random network generation
As mentioned previously, random networks are essential in network motif
discovery because they are needed for comparison with the input network. Subgraph
occurrences in the input network are compared to those in the random networks to see
if differences are present which would indicate a significant motif. Multiple methods
are used to generate randomized networks. Common randomization techniques
include the switching method, the stubs method, and the “go with the winners”
algorithm [6].
(1)
The switching method implements the Markov chain method. It
involves using the nodes of the input network, preserving their
degree in and degree out, and switching the edges between the
nodes numerous times to obtain randomization. The draw back to
the switching method is that the time required for proper mixing is
not known for the Markov chain method. [9]
(2)
The stubs method keeps the same in and out degrees of the nodes of
the input network. Each node has „stubs‟ that are „in-stubs‟ (for all
in degrees of the node) and „out-stubs‟ (for all out degrees of the
(3)
node). A matching algorithm is used to put all of the in-stubs in a
pair with an out-stub. Theoretically this creates random edges
between nodes while still preserving the in and out degrees of all
nodes. The method discards any self edges or multiple edges. This
becomes a problem because numerous real world networks have
nodes with degrees such that there will most likely be more than
one edge between two nodes. [9]
The “go with the winners” algorithm starts with multiple graphs. It
then carries out the stubs method. To compensate for the graphs that
are eliminated (due to self or multiple edges) the algorithm
periodically copies all of its graphs which results in the number of
graphs being constant on average. Once all stubs have been linked
the process stops and a random network is chosen from all the
remaining graphs. This algorithm can be very slow, especially with
large scale networks. [9]
The switching algorithm has been found to be the ideal method for random graph
generation and is often used in network motif discovery tools.
III)
Classification of tools based on algorithms:
Network centric tools require that the entire network and all subgraphs have to be
enumerated. On the other hand, non-network centric tools (motif centric tools) allow
for a single specific motif to be examined [10]. Major network discovery tools have
been classified into these two groups and segregated further within each group based
on aspects of their algorithms.
NETWORK CENTRIC ALGORITHMS:
Algorithms that use trees:
NeMoFINDER
NeMoFINDER is a motif discovery algorithm used specifically to find motifs in
PPI networks [5]. This tool uses trees to partition the network in question. It uses
concept F1. By allowing for arbitrary node and edge overlap it ensures uniqueness
and is not downward closed. The required inputs for the algorithm include maximum
motif size, number of randomized networks, the target PPI network, frequency
threshold, and uniqueness [7]. NeMoFINDER generates randomized networks via the
switching method [5] and the Apriori algorithm is used for subgraph frequency
determination [8].
The algorithm can be divided into three main steps [5]. The first step entails
finding all occurrences of a 2 sized tree and subsequently larger sized trees up until all
size trees from 2 to k have been found. This ensures all of the repeated subgraphs
have been found. If the number of k sized trees is larger than a user given frequency
threshold then the subgraph in question is considered statistically significant and is
designated as a motif. Step 2 involves the size k trees being used to partition the
graph. Thus each section of the graph contains trees of size 2 through k. In step 3, for
each size k tree a subgraph is generated with k-1 edges and k nodes. A new set of
subgraphs is then generated by combing each k-1 edge subgraph with a size k tree
resulting in subgraphs with k edges. This new set contains subgraphs that are all
candidates for being a motif. The number of occurrences of each candidate subgraph
is found in the partition of the network by the k sized trees. If the occurrence is more
than a given threshold then the subgraph is added to a set of repeated subgraphs.
These subgraphs are then combined with novel generated subgraphs to find k+1 sized
subgraphs. This process continues until all repeated subgraphs of size 2 through k are
detected. Because the network is partitioned by trees the algorithm is consequently
scalable. [5]
NeMoFINDER also uses the concept of graph cousins to generate possible motif
candidates [5]. However, graph cousin generation can be ambiguous and symmetry
breaking is not used in the NeMoFINDER algorithm resulting in the discovery of
redundant subgraphs [8].
Performance studies have been carried out on NeMoFINDER. This was done by
ranking PPI network motifs of different sizes by frequency, uniqueness, and
individual motif size. Motif strengths were generated and scored from these
parameters. The scores were compared by function homogeneity, localization
coherence, and gene expression correlation. Reliability of each motif was determined
using this scoring method. [5]
Kavosh
Kavosh is a network motif discovery tool that uses trees to enable the detection of
motifs. It can handle both directed and undirected networks [3]. There are four main
parts of the Kavosh algorithm: (1) enumeration, (2) classification, (3) random graph
generation, and (4) motif identification [3]. Enumeration looks at the network in
question and finds all subgraphs of given sizes (also preformed on random graphs).
This is achieved by selecting one node and all the combinations of connections with
the neighboring nodes via tree representation. The first level of the tree is the selected
node, the second level consists of the neighbors of this node, the third level of the tree
is made up of the neighbors of the previous neighbors, and so on. If a k sized graph is
being searched for, all compositions of size k-1 are found. The „revolving door
algorithm‟ is used to go through all of the nodes at each level ascending from the
bottom level and labeling each node as „visited‟. This ensures that no tree or subgraph
is constructed more than once. The algorithm finds all of the combinations of the
nodes including subgraphs with nodes in the same level (i.e. a subgraph size 3 can be
made up of an initial node and two neighbors or an initial node, a neighbor, and a
neighbor of a neighbor). After these motifs are found, the node is removed and a new
node is used. This process is also carried out on the randomized networks to find the
frequency and identify the motifs in the randomized cases for comparison. Constraints
are placed on the construction of these trees (some explained above) so that each
specific tree is only generated once. This avoids redundancy and extra computational
time.
Classification involves placing the subgraphs found in the enumeration step into
isomorphic classes. This is done using the NAUTY algorithm [3]. Random graphs are
generated in Kavosh using the switching method. The frequencies of subgraphs in the
input network are compared to frequencies in the random networks. Subgraphs are
dubbed as motifs if frequencies are higher in the input network than they are in the
random networks. Parameters often used include p-value, frequency level, and z
scores. [3]
MA Visto
MA Visto is able to consider all 3 frequency concepts when enumerating
subgraphs [2]. This allows for an effective visual representation of the frequency
concepts. MA Visto finds all of the subgraphs of a certain size and finds the
frequencies for each subgraph using all three frequency concepts. The flexible pattern
finder (FPF) algorithm is used by MA Visto to search for the motifs [13]. The FPF
algorithm looks at patterns that are of the same size as the given target size (i.e. looks
for all patterns of size 4 when looking for size 4 motifs). As the size of the pattern
increases the number of possible patterns of that size also increases meaning that
finding all of the patterns of one size would be computationally costly. A tree is
constructed with each level of the tree is comprised of patterns of a certain size up
until a level where the desired size is reached. In order to avoid generating all the
possible patterns of a given size, FPF eliminates patterns that are not supported by
(cannot be mapped to) the input network as soon as it appears in the tree. This stops a
pattern from being generated as soon as it is seen which allows for elimination of
unnecessary branches [14]. Also, since frequencies of patterns decreases with
increasing pattern size, if an intermediate (and smaller) sized pattern is found to have
a smaller frequency than patterns of the desired (and larger) size the branch of the tree
is discontinued because it will never have a high enough frequency [13]. MA Visto
uses the frequencies of the subgraphs in the input network as well as the frequencies
in the randomized networks in order to find z scores and p values for the different
motifs [2].
Probabilistic algorithms:
Full enumeration can be computationally costly and require a lot of time. As the
size of the subgraphs being searched for increases the possible isomorphic types
increases. This makes exhaustive enumeration algorithms extremely time consuming
and costly because they need to find the frequencies of each different isomorphic
graph of all sizes in both the input network and the randomized networks.
Kashtan et al developed a „sampling method for subgraph counting‟ which is a
probabilistic algorithm [11]. This algorithm deals with estimating subgraph
frequencies by sampling subgraphs. This is less time consuming than full
enumeration. The algorithm makes it so that runtime does not increase asymptotically
as network size increases. With Kashtan‟s sampling algorithm larger networks than
full enumeration algorithms can handle are able to be analyzed and larger motifs can
be identified.
A random n-sized subgraph is found in this sampling algorithm. An edge is picked
randomly and its neighbors are all made into candidates to be the next edge. One of
the candidates is picked at random and its neighbors are the new candidates. This
process continues with one edge from all the neighbors being chosen randomly to be
the next edge until a subgraph of size n is created. All of the nodes from these edges
and all the edges that connect these nodes make up the sampled subgraph. [11]
An ordered set of n-1 edges needs to be picked for an n sized subgraph to be found.
The probabilities of getting these ordered pairs is used to find the probability that an n
sized subgraph will be sampled. From this and a few additional calculations the
estimated subgraph concentrations are found. [11]
A major problem with Kashtan et al‟s method is that it has bias sampling [8]. This
means that each subgraph does not have a uniform probability of being sampled [10].
Therefore, occurrences of a subgraph cannot be impartially estimated [8]. The
algorithm tries to take this into account by weighting each subgraph with a value of
1/(probability of the subgraph being chosen) [10].
Other tools that use probabilistic sampling algorithms as alternatives to full
enumeration are MFinder and FANMOD. MFinder uses a bias algorithm like that of
Kashtan‟s while FANMOD uses an improved method that achieves unbiased
sampling [10].
MFinder
MFinder is capable of analyzing directed and undirected networks [2]. Concept F1
is used when finding the frequency for the subgraphs. Also, concept F3 is applied in
order to determine a lower bound for uniqueness value [2]. MFinder fully enumerates
subgraphs by starting with an edge. All motifs of different sizes are found that contain
this edge [6]. Once a subset of nodes is found that is connected to the initial edge the
subset is added to a hash table so it cannot be revisited [10]. When no more subgraphs
can be identified the hash tables are cleared and the process begins again with a
different edge. This is repeated until all edges have been used. Because a specific
subgraph will be counted each time one of its edges is examined there is redundancy
and number of times the subgraph will be counted is a multiple of its edge number
[6]. Therefore, the count for a subgraph must be divided by the number of edges in the
motif.
Since MFinder looks at so many motifs and has redundancy it requires large
amounts of memory. This causes the runtimes to be large and makes it hard for large
motifs to be searched for [10]. Therefore, MFinder uses the biased sampling method
that Kashtan et al developed.
FANMOD
FANMOD is a tool that can be used to analyze both directed and undirected
networks [2]. It is able to identify motifs of sizes 3 – 8. Only induced subgraphs are
found from FANMOD. It determines frequencies of subgraphs with concept F1 and
uses z-score and p-value to deem whether or not a motif is statistically significant [2].
The full enumeration part of the FANMOD algorithm begins with one node and a
list of possible vertices to which this node can be connected (i.e. the node‟s
neighbors). Once a possible vertex is extended to it is removed from the list of
possible extensions and its neighbors are added to the candidates that this vertex can
be connected to next. Different combinations of possible extensions are chosen in
order to form subgraphs of different sizes. Since the list of possible extensions is
constantly changed, each subgraph is only enumerated once. Like Kavosh, FANMOD
uses the NAUTY algorithm to test for graph isomorphism. [10]
FANMOD‟s alternative method uses probabilistic sampling to reduce runtimes for
identifying motifs. It uses randomized enumeration algorithm known and RANDESU. This sampling works by changing the full enumeration algorithm so that it
randomly skips subgraphs. The FANMOD sampling algorithm chooses each size k
subgraph with a certain probability [12]. This means that all subgraphs have the same
probability of being sampled and all samples give different subgraphs. Because of the
adjustments to the Kashtan et al algorithm, FANMOD is unbiased and results in all
subgraphs having the same probability of being chosen [10].
MOTIF CENTRIC ALGORITHMS:
Mapping algorithms:
Grochow
Grochow is a motif centric tool that can be applied to directed and undirected
networks [15]. The algorithm progressively maps a specific target subgraph onto a
global network. By doing this Grochow checks for isomorphism as it maps the query
graph onto the network [10]. This eliminates the extra time and memory it would take
to check for isomorphism and avoids full enumeration. The mapping algorithm goes
through the query subgraph node by node in order to map the subgraph onto the
network. A node will be specified and the tool will find all the “candidate nodes”,
nodes in the network that have the same characteristics (i.e. same degree and
neighbors with correct degrees). As the algorithm goes through each node in the query
subgraph possible matches in the network are found while others that are not exactly
the same are eliminated once any inconsistency is found. This mapping ensures that
only exact isomorphic subgraphs in the network are detected. [15]
Grochow uses a method known as symmetry breaking to make sure that each
subgraph is only mapped to once in order to reduce run time and redundancy [15].
Graphs that are self-isomorphic are said to have the same symmetries. Nodes that can
be mapped to one another are defined as equivalent. Therefore, the nodes in a specific
subgraph can be separated into equivalence classes. The Grochow algorithm ensures
that mapping begins only from one representative of each equivalence class so that
multiple mappings are not carried out beginning with equivalent nodes. Also,
restrictions are added to the labeling of each vertex so that symmetry is avoided. [10]
MODA
MODA utilizes a pattern growth algorithm that takes in a query graph [8]. It uses
information based on previously found query graphs. By maintaining information
about formerly found mappings, it reduces computational time. It uses the concept of
expansion trees, which are similar to pattern trees used in MA Visto, but applicable to
the frequency concept F1. The expansion tree starts with a root node at level 0. Then
it finds all minimally connected size-k trees of the root node, which is level 1. It then
adds an edge at each level until a complete graph is obtained. The first level of the
tree therefore represents the number of non-isomorphic trees. Each node of the
expansion tree can be represented by an adjacency matrix consisting of 0‟s and 1‟s.
For undirected graphs, which are symmetric, only the numbers below the main
diagonal are stored. Expansion trees are stored for every size k-graph. They are a
static data structure which can be stored and retrieved and do not have to be found
each time. [8]
The mapping algorithm takes the query graph from the first level of the expansion
tree, which is composed of trees themselves, and maps them onto the network. It
holds onto their calculated frequencies. The frequencies at the second level of the
expansion tree can be found with respect to the first level of the expansion tree, which
are their parent nodes. MODA utilizes the symmetry-breaking conditions of the
Grochow algorithm. It only uses the Grochow algorithm for the first level of the
expansion tree. All the information the algorithm finds about the first level can be
exploited to find the frequencies of the all the next levels which are supergraphs of the
first level. By exploiting information of formerly found mappings, MODA can be
used to reduce computational costs. Additionally, MODA has a sampling method that
can be used to reduce runtimes with the sacrifice of accuracy. [8]
3 Experiments and Analysis
Data from experiments on runtimes of various algorithms are presented here along
with MA Visto frequency concept data and experimental motif results from PPI
networks.
I)
Runtimes
Experiments:
Many experimental runs have been carried out to determine runtimes for network
motif discovery tools. As shown in Table 1, Omidi et al compared runtimes of
MODA, MFinder, Grochow, FPF (algorithm used in MA Visto), and FANMOD [8].
Searches were carried out for subgraphs size 3 – 9. In Table 2 is shown Chen et al’s
comparison of the runtimes of NeMoFINDER and FPF (algorithm used in Ma Visto)
[5]. Kavosh et al compared the runtimes for Kavosh, FANMOD, MA Visto, and
MFinder for subgraphs between size 3 and size 10 as shown in Table 3 [3].
Table 1. Data from Figure 7 from Omidi et al [8] showing runtimes (in
seconds) for size 3-9 subgraphs in E. coli transcription network. Tools
compared include MODA, MFinder, Grochow, FPF algorithm, FANMOD.
[8]
3
4
5
6
Mfinder
2.0
7.9
7.9x101
3.2x103
FPF(MA Visto)
1.1
1.6
6.3
5.0x101
1.1
Fanmod
1.3
2.0
7
7.9
1
MODA
1.1
1.3
3.2
2.0x10
Grochow
1.3
2.5
1.6x101
2.2x102
8
1.0x103
5.6x104
5.6x10
1
7.9x102
1.8x10
2
3.2x103
9
6.3x104
1.8x104
Table 2. Data from Figure 11 from Chen et al [5] showing runtimes (in
seconds) for size 3-13 subgraphs in Utez PPI network. Tools compared
include NeMoFINDER, FPF algorithm, sampling algorithm, and full
enumeration algorithm.
3
FPF
NeMo
FINDER
2.2x10
4
1
2.2x101
7.9x10
5
1
7.9x101
3.2x10
6
2
2.8x102
3.5x10
7
3
1.6x103
6.3x10
8
4
6.3x103
4.0x10
9
5
1.6x104
3.2x10
10
11
12
13
6
2.0x104
3.5x104
4.0x104
5.6x104
7.1x104
Table 3. Data compiled from Table 4 from Kashani et al [3]. Runtimes (in
seconds) for identifying subgraphs in yeast S. cereviciae transcription
network of sizes between 3 and 10 are shown. Tools compared include
Kavosh, FANMOD, MA Visto, and MFinder.
3
Kavosh
FANMOD
MA VISTO
(FPF)
Mfinder
4
3.0x10
-1
8.1x10
-1
1.4x10
4
3.1x101
5
6
1.5x10
1
2.5
1.6x10
1
3.0x102
2.4x104
1.8
7
1.4x10
2
1.3x10
2
8
1.4x10
3
1.2x10
3
9
1.3x10
4
9.3x10
3
1.2x10
10
5
1.1x106
Kashtan et al compared the times it took their probabilistic sampling method to the
time it took for full enumeration to complete while identifying motifs in different
sized networks (Figure 1) [11]. The network sizes for which these comparisons were
made were between 1000 and 8000 nodes.
Figure 1. Figure 4 from Kashtan et al [11] showing runtimes for different
network sizes (on a log-log scale). Kashtan’s probabilistic algorithm and a
full enumeration algorithm were compared.
Runtimes were found for MA Visto when finding subgraphs of size 3-4 and 4-5
(Table 4). Networks analyzed included E. coli transcription network and yeast
transcription network [16].
Table 4. Examples of runtimes for MA Visto analyzing E. coli and yeast
transcription networks [16]. Subgraphs of size 3-4 and 4-5 were searched for.
For each run 100 randomized networks were generated.
E. Coli transcription network
(418 nodes, 519 edges)
Yeast transcription Network
(688 nodes, 1079 edges)
3-4 Nodes
4-5 Nodes
909.904
8366.359
507.574
>25200
Runtimes were found for FANMOD when finding subgraphs of size 3 – 7 (Table 5).
A protein structure network [16], PPI network [17], yeast transcription network [16],
and E. coli transcription network [16] were used.
Table 5. Runtimes (in seconds) for FANMOD tool finding subgraphs of size
3 – 7 for networks including protein structure [16] , PPI [17], yeast
transcription[16], and E. coli transcription [16]. 1000 random networks were
generated in all of the runs.
3
2.157
4
17.89
5
315.79
6
1306.8
7
1452.08
Protein-Protein Interaction
(Undirected, 4470 nodes,
3886 edges)
150.766
9705.86
-
-
-
Yeast transcription Network
(Directed, 689 nodes, 1078
edges)
14.562
312.025
-
-
-
E. Coli transcription
Network (Directed, (418
nodes, 519 edges)
6.485
145.504
4084.25
-
-
Protein Structure
(Undirected, 96 nodes, 213
edges)
Omidi et al [8] and Kashani et al [3] both did experimental runs on
FANMOD, MA Visto (or the FPF algorithm), and MFinder. Kashani et al measured
the runtimes of the tools to fully enumerate the input network and to generate and
enumerate 100 random networks. Omidi et al measured the runtimes only for full
enumeration of the input network.
Table 6. Data from Figure 7. Of Omidi et al [8] and Table 4 of Kashani et al
[3]. Runtimes (in seconds) for motif searches done by FANMOD, MA Visto
(FPF algorithm), and MFinder. Times are given for runs that fully
enumerated the input network and 100 random networks with a 3.2 GHz
AMD Opteron processor and 8 GB RAM (shown with no shading) [3] and
for runs that only fully enumerated the input network with IBM R50e laptop
with Intel Pentium 1.8 GHz and 1 GB Ram (shown with grey shading) [8].
FanMod
MA
Visto/FPF
MFinder
3
4
5
6
7
8
With 100
random
networks
With 0
random
networks
8.1x10-1
2.5
1.6x101
1.3x102
1.2x103
9.3x103
1.1
1.3
2.0
7.9
5.6x101
7.9x102
With 100
random
networks
With 0
random
networks
1.4x104
1.1
1.6
6.3
5.0x101
1.0x103
5.6x104
With 100
random
networks
With 0
random
networks
3.1x101
3.0x102
2.4x104
7.9
7.9x101
2.0
3.2x103
Analysis and Limitations:
Experimental runs carried out by Omidi et al [8], Chen et al [5], and Kavosh
et al [3] allow for comparisons of a variety of network motif discovery tools. A
consistent trend seen in these experiments is the inability of MA Visto and MFinder to
handle subgraphs as large as the other tools. Often they were only able to find
subgraphs of size 5 or less and usually had runtimes larger than most other tools.
FANMOD was able to identify motifs up to size 8 but NeMoFINDER, Kavosh,
MODA, and Grochow were seen to be able to deal with subrgaphs larger than 8.
Despite the ability of some tools to handle subgraphs larger than 8 it can be seen that
the runtimes for these experiments are very large. Overall, it can be concluded that the
current motif discovery tools are very limited in the size of subgraphs that they can
handle in reasonable amounts of time. NeMoFINDER shows promise in being able to
search for larger sized motifs.
Another limitation for motif discovery tools is the size of the network able to
be analyzed. As seen in Kashtan et al’s [11] experiment the runtimes for motif
searches increases exponentially as network sizes increase. All of the tools discussed
have difficulty searching larger networks (in the thousands) in reasonable amounts of
time. Therefore, networks such as PPI and most social networks that have thousands
of nodes are difficult to fully enumerate in a reasonable time. Kashtan’s probabilistic
sampling method has shown to produce a fairly consistent runtime with increases in
network size. The sampling method takes significantly less time than exhaustive
enumeration as network size increases. However, as discussed above, Kashtan’s
sampling algorithm has bias and results in loss of accuracy.
The number of randomly generated networks is also a limitation to be
considered. Random networks are used in the tools for comparison with the input
network and multiple of them are needed for an accurate comparison. However,
runtimes increase with the number of random networks that need to be generated and
searched for motifs. Omidi et al’s [8] experiments only involved full enumeration of
the input network where as Kashani et al’s [3] fully enumerated the input network and
generated and enumerated 100 random networks. Despite the fact that Kashani et al
used a computer with greater computational powers, the runtimes for Kashani et al’s
experiments were significantly larger than those for Omidi et al’s experiment.
Therefore, even with a more powerful computer, the generation and enumeration of
many random networks adds on significant time to searches.
II)
Frequency Concepts
Experiments:
MA Visto was used to compare the frequency results for concepts F1, F2, and F3 for
the same subgraph within the same network (Table 6, Table 7).
Table 6. Values for concept F1, F2, and F3 for each size 3 motif found by
MA Visto in the E. coli gene transcription network [16].
F1
4819
F2
207
F3
47
269
131
23
42
18
7
202
39
18
Table 7. Values for concept F1, F2, and F3 for each size 3 motif found by
MA Visto in the yeast gene transcription network [16]
F1
107
F2
31
F3
10
111
33
13
11
6
5
116
32
13
4
1
1
2
1
1
2
1
1
11
2
1
1
1
1
1
1
1
1
1
1
1
1
1
Analysis and Limitations:
MA Visto’s ability to calculate the frequency of motifs for all three frequency
concepts, F1, F2, and F3 allows for comparisons of discrepancies in each case.
Experimental runs on two different data sets (E. coli transcription and yeast
transcription) demonstrate that there can be differences in the frequencies calculated
by different concepts. When a small amount of a certain subgraph is in the network
then the discrepancy is not large but for more frequent subgraphs the frequency
concept results vary substantially.
III)
Protein-Protein Interaction Network Motifs
Experiments:
An E. coli PPI network [17] was analyzed by FANMOD. Motifs of size 3 and 4 were
found (Figure 2).
Figure 2. Motifs of size 3 and 4 from the E. coli PPI network [17] as
identified by FANMOD. Z-scores for each motif are shown.
Analysis and Limitations:
Although NeMoFINDER had success identifying larger sized motifs other
tools struggled to find motifs larger than size 5. All tools had issues with identifying
motifs in a reasonable amount of time. Due to the large size of PPI networks they are
harder to analyze than smaller networks such as E. coli gene transcription networks.
Preliminary findings from runs done by FANMOD show the size 3 and 4 motifs
found in an E. coli PPI network. The motifs and their z scores are listed. For both the
size 3 and size 4 motifs the most frequent motif (motif with the largest z score) was
that of the complete graph, a graph with an edge between all pairs of nodes. Previous
studies have also found complete graphs with high frequencies in PPI networks [18].
Further studies involving PPI networks may support these findings further. Although
some methods have been used to predict PPI network motifs [19] there is still much
about the biological significance of the PPI motifs to be explored.
4 Conclusion
Increasing interest in network motifs and emphasis on motif significance has led to
an ongoing process of network motif discovery tool development and continual
revision of previous work. As wet lab techniques have become more advanced,
increasing amounts of information about different biological systems and organisms
have been collected. This has allowed for databases to be developed that provide full
information sets concerning networks. The study of networks provides insights into
how organisms and systems work as a whole. Network motifs are the building blocks
of networks [1] and are often biologically significant which makes the identification
of the motifs extremely important in the search for the understanding of networks.
Researchers have struggled to overcome the difficulties in developing network
motif discovery tools. The graph isomorphism problem makes it so that finding all the
motifs in different networks is highly unreasonable [2]. Also, dealing with large
networks, discovering large motifs, and generating and searching numerous random
networks are all issues that cause network motif discovery to be extremely
computationally costly. Although these factors are costly, they are also integral parts
of the network motif discovery process and in understanding networks.
Results of various experimental runs carried out using different network motif
discovery tools have helped to determine which tools are more efficient and useful.
Furthermore, these comparisons help to highlight which algorithmic methods improve
tool performance. Overall, MA Visto and MFinder were computationally costly and
had large runtimes in comparison to other tools searching the same network for the
same subgraph sizes. MFinder’s algorithm requires full enumeration and exhaustively
searches using a technique that counts the same subgraph multiple times [6]. This
redundancy contributes to increased computational cost and runtimes. MA Visto
calculates the frequencies for all three frequency concepts which requires more time
than only doing searches with one frequency concept [13].
Other tools were found to perform better than both MFinder and MA Visto; all had
better runtimes and were able to search for larger subgraphs than either MA Visto or
MFinder. FANMOD is a well-established and well-known tool which performs
relatively well partially due to its use of the NAUTY algorithm to test for graph
isomorphism. This, along with the fact that FANMOD’s algorithm ensures that each
subgraph is only counted once, makes full enumeration with FANMOD relatively
reasonable [10]. FANMOD also uses an unbias sampling algorithm that helps to
reduce runtimes in comparison to full enumeration [12]. FANMOD has been shown
to have smaller runtimes than Grochow and MODA but it can only search for
subgraphs of size 8. Kavosh, like FANMOD, uses the NAUTY algorithm and has
also shown in experimental runs that it has relatively good runtimes. The restrictions
put on tree structures formed while searching for motifs design the algorithm to only
enumerate each subgraph once [3]. This along with the use of the NAUTY algorithm
results in the Kavosh tool having relatively good search efficiency.
Grochow achieves some efficiency due to its symmetry breaking techniques. With
symmetry breaking, Grochow is able to reduce redundant counts of subgraphs. The
algorithm’s ability to eliminate the subgraphs that are being mapped as soon as it is
discovered that they do not match any patterns in the input network also helps boost
efficiency. This prevents irrelevant subgraphs from being generated which saves time
and computational power. This also ensures that the subgraphs identified are
isomorphic to the subgraph in question which means that an isomorphic test is not
required [15]. MODA uses some of the techniques from Grochow such as symmetry
breaking and uses the actual Grochow algorithm to find frequencies of some of the
patterns in question [8]. MODA’s algorithm also uses expansion trees to build
patterns that make subgraphs. These expansion trees and the mapping information for
the patterns are stored so that redundancy does not occur and computational time is
saved.
NeMoFinder has been found to be able to identify meso-scale motifs
(specifically, up to size 12) although it is limited to analyzing PPI networks and thus
only undirected networks [5]. By partitioning networks into sets of graphs with
repeated trees the algorithm is more efficient than some other tools. NeMoFinder is
different than many other network discovery tools because of its use of graph cousins
to generate possible subgraphs and to determine subgraph frequencies. Although
graph cousins allow for generation of candidate graphs their use also causes
redundancy which adds more time to the runs [7].
The good and bad aspects of each tool are important to take note of so that
algorithmic shortcomings can be avoided in future tools while successful aspects can
be capitalized on. Sampling techniques have shown promise in reduction of runtimes
and should be considered when developing algorithms (along with the sacrifice in
accuracy). Also, the concept frequency that each tool uses is important to take note of
because, as seen from experimental runs, the frequencies vary greatly between the
different concepts. Network motif discovery has proved to be a very complex task.
Although many tools, algorithms and methods have been created for finding network
motifs, further improvements and new developments are a necessity in order to
increase motif discovery capabilities.
Future directions:
These further directions include: (1) the ability to intelligently search respective
networks for possible biologically relevant motifs that have been identified as
significant sub-graphs from experimental runs and literature review, and (2) the idea
of employing modern computing infrastructure to search concurrently for network
motifs that are larger than those that presently available tools can search.
Acknowledgements:
We would like to thank the National Science Foundation for providing the funding for
the Bio-Grid REU program and making this research possible. We would also like to
thank the University of Connecticut for hosting this program and especially Dr. ChunHis Huang for advising and mentoring.
5 References
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
Milo, R., Shen-Orr S., Itzkovitz, S., Kashtan, N., Chklovskii, D., Alon, U.:
Network Motifs: Simple Building Blocks of Complex Networks. Science.
298, 824—827 (2002)
Schwobbermeyer, H.: Network Motifs. In Junker, B., Schreiber, F. (eds.)
Analysis of Biological Networks. Pp. 85 – 111. NJ: John Wiley & Sons,
Inc (2008)
Kashani, Z., Ahrabian, H., Elahi, E., Nowzari-Dalini, A., Ansari, E.,
Asadi, S., Mohammadi, S., Schreiber, F., Masoudi-Nejad, A.: Kavosh: a
new algorithm for finding network motifs. BMC Bioinf. 10:318 (2009).
Fortin, S.: The Graph Isomorphism Problem. University of Alberta: Dept
of Computing Science, Alberta (1996)
Chen, J., Hsu, M., Lee, L., Ng, SK.: NeMofinder: . genome-wide proteinprotein interactions with meso-scale network motifs. KDD. 106—115
(2006).
Kashtan, N., Itzkovitz, S., Milo, R., Alon, U.: Network Motif Detection
Tool: mfinder Tool Guide. Weizmann Institute of Science: Depts of Mol
Cell Bio and Comp Sci & Applied Math, Rehovot, Israel (2002-2005)
Ciriello, G., Guerr,a C.: A review on models and algorithms for motif
discovery in protein-protein interaction networks. Briefings in Functional
Genomics and Proteomics Advance Access. (2008)
Omidi, S., Schreiber, F., Masoudi-Nejad, A.: MODA: An efficient
algorithm for network motif discovery in biological networks. Genes
Genet. Syst. 84, 385 – 395 (2009)
Milo, R., Kashtan, N., Itzkovitz, S., Newman, M., Alon, U.: Uniform
generation of random graphs with arbitrary degree sequences. (2004)
Ribeiro, P., Silva, F., Kaiser, M.: Strategies for network motifs discovery.
IEEE International Conference. 81-86 (2009)
Kashtan, N., Itzkovitz, S., Milo, R., Alon, U.: Efficient sampling algorithm
for estimating subgraph concentrations and detecting network motifs.
Bioinformatics. 20, 1746-1758 (2004)
Wernicke, S.: A Faster Algorithm for Detecting Network Motifs. In
Casadio, R., Myers, G. (eds.) Algorithms in Bioinformatics: 5 th
international workshop. pp. 165 – 176. Springer (2005)
Schreiber, F., Schwobbermeyer, F. MAVisto: a tool for the exploration of
network motifs. Bioinformatics Applications Note. 21, 3572-3574 (2005)
Schreiber, F., Schwobbermeyer, H.: Frequency Concepts and Pattern
Detection for the Analysis of Motifs in Networks. Trans. On Comput.
Syst. Biol. III. 89-104 (2005)
Grochow. J., Kellis, M.: Network motif discovery using subgraph
enumeration and symmetry-breaking. Recomb. 92-106 (2007)
Collections of complex networks,
http://www.weizmann.ac.il/mcb/UriAlon/groupNetworksData.html
Bacteriome,
http://www.compsysbio.org/bacteriome/dataset/core_interactions.txt
18. Przulj, N., Wigle, D., Jurisica, I.: Functional topology in a network of
protein interactions. Bioinformatics. 20, 340 – 348 (2004)
19. Albert, I., Albert, R.: Conserved network motifs allow protein-protein
interaction predication. Bioinformatics. 20, 3346-3352 (2004)

Download Report

On Tools for Network Motif Finding - Bio-Grid

Paperzz.com

Your Paperzz