B RIEFINGS IN BIOINF ORMATICS . VOL 16. NO 3. 497^525 Advance Access published on 24 June 2014 doi:10.1093/bib/bbu021 Current innovations and future challenges of network motif detection Ngoc Tam L. Tran, Sominder Mohan, Zhuoqing Xu and Chun-Hsi Huang Submitted: 20th January 2014; Received (in revised form) : 25th May 2014 Abstract Network motif detection is the search for statistically overrepresented subgraphs present in a larger target network. They are thought to represent key structure and control mechanisms. Although the problem is exponential in nature, several algorithms and tools have been developed for efficiently detecting network motifs. This work analyzes 11 network motif detection tools and algorithms. Detailed comparisons and insightful directions for using these tools and algorithms are discussed. Key aspects of network motif detection are investigated. Network motif types and common network motifs as well as their biological functions are discussed. Applications of network motifs are also presented. Finally, the challenges, future improvements and future research directions for network motif detection are also discussed. Keywords: network motif; random network; graph isomorphism; statistical significance; network motif detection INTRODUCTION Network motifs were first theorized by Shen-Orr et al. [1] as patterns of inter-connections occurring in many different parts of a network at numbers that are significantly higher than those in random networks. Networks that contain motifs extremely vary. Some examples of these networks include protein–protein interaction (PPI), gene regulation, food webs, neuron connectivity, electronic circuits, World Wide Web (WWW), network traffic and social networks [2–5]. Certain network motifs such as feed-forward loop (FFL) and bi-fan have been shown to recur in completely different biological networks [6, 7]. These motifs can be found in Table 1. Discovered network motifs are theorized to highlight key control mechanisms that regulate target networks. By identifying key control mechanisms in biological networks, researchers could increase the accuracy and efficiency of medications while speeding up their production. Network motifs could also bridge gaps between distinct disciplines and allow for fruitful collaboration. Two networks, which share similar significance profiles, are thought to have related structures and functioning methods. Significance profile is a metric, which is used to measure the significance of subgraphs’ frequencies [6]. If certain networks, for instance, PPI and transistor, were found to have similar significance profiles, research could be borrowed and shared by both biologists and very large scale integration (VLSI) designers. Each might have unique insights to offer such as methods for design, control or analysis. Although the study of network motif detection is not tra nascent field, there is still a significant amount of research needed for improving network motif detection. In this work, we analyze 11 network motif detection tools and algorithms. We provide detailed comparisons and insightful directions for using these tools and algorithms. We discuss the key aspects of network motif detection, the types of network motifs and common network motifs with their biological functions. We also present several applications of network motifs. Finally, we present the challenges, future improvements and future research directions for network motif detection. Corresponding author. Ngoc Tam L. Tran, Department of Computer Science and Engineering, University of Connecticut, Storrs, CT 06269-2155, USA. Tel: þ1-860-296-7533, E-mail: [email protected] Ngoc Tam L. Tran is a graduate student at Department of Computer Science and Engineering, University of Connecticut. Sominder Mohan is an undergraduate student at Swarthmore College. Zhuoqing Xu is an undergraduate student at University of Connecticut. Chun-Hsi Huang is an Associate Professor at Department of Computer Science and Engineering, University of Connecticut. ß The Author 2014. Published by Oxford University Press. For Permissions, please email: [email protected] 498 Tran et al. Table 1: Network motifs (Courtesy of [8 ^16]) Type Motif Type Motif Single node with selfedge Pair-wise Autoregulation Illustration Illustration Positive Negative Positive feedback loops Double-positive Negative feedback loops Double-negative Slow Fast Cascade Cascades Hub Single-input module (SIM) Bipartite Dense overlapping regulons (DOR) Positive Negative Bi-fan Clique Protein clique Interacting transcription factors that co-regulate a third gene Feed-forward loop (FFL) Coherent type 1 Coherent type 2 Coherent type 3 Coherent type 4 Incoherent type 1 Incoherent type 2 Incoherent type 3 Incoherent type 4 Co-regulated interacting proteins Mixed-feedback loop between transcription factors that co-regulate a gene Biparallel For motifs SIM, DOR, bi-fan, protein clique, interacting transcription factors that co-regulate a third gene, FFL, co-regulated interacting proteins, and mixed-feedback loop between transcription factors that co-regulate a gene, directed edge represents interaction between a transcription factor and its target gene and bidirectional edge connects interacting proteins. Current innovations and future challenges of network motif detection DESCRIPTION OF NETWORK MOTIF DETECTION Network motif detection is the problem of finding smaller graphs (motifs) within a larger graph (target network) that correspond to certain statistical thresholds. Before stating a definition of the problem, we introduce a few definitions below. To start, our target network and all motifs found are represented as graphs: Graph A graph is a set of verticesVconnected to each other by a set of edges E. Network motifs are defined as being found directly from within the target network, meaning that the exact shape of the motif must be present somewhere in the target’s structure. Mathematically, we can assert that a motif must be an induced subgraph of its target network. Induced Subgraph Let G be a graph with vertices Vand edges E. Let V 0 V and E 0 E. An induced subgraph H of G is a graph such that H is defined by V 0 and E 0. In other words, an induced subgraph H is a graph that can be completely defined by vertices and edges of G. We know that network motif graphs are subsets of the target graphs. However, some network motif graphs can have different shapes but their properties are mathematically identical. Consider two network motif graphs: one has a star shape and the other has a pentagon shape as in Figure 1. These network motif graphs look nothing alike. However, they have the same properties. For instance, both have five edges and five vertices. The degree of corresponding vertices between them is the same. Each vertex has the same degree of two in both graphs. Each network Figure 1: Isomorphic graphs. 499 motif graph is fully connected. Both have the same number of connected component, which is one. Each pair of connected vertices between them corresponds. They have neither loop edges nor parallel edges. Graphs that bear such similarities are known as isomorphic. Graph Isomorphism Assume graphs G ¼ {V, E} and H ¼ {V 0 ,E 0 }. G and H are isomorphic if there exists a bijective function f between Vand 0 V such that for 0each edge fu,vg 2 E there is an edge f ðuÞ,f ðvÞ 2 E . We can count the number of isomorphic induced subgraphs in a target network to establish their frequency. However, we need to know the layouts of these subgraphs in the target network. There are three different classifications that can be used for counting network motifs [17], and different tools use different ones. Frequency 1 The frequency of a subgraph is the number of times it occurs in a target network.This is also known as subgraph frequency. Frequency 2 Frequency also measures how motifs can be overlaid in the target network. There are three separate measurements that can be used to determine how freely motifs can share like elements [17]: (i) F1 frequency is completely unrestricted and allows distinct motifs to share both edges and vertices. (ii) F2 frequency is more restricted, and distinct motifs can only share vertices. (iii) F3 frequency is the most restricted; distinct motifs cannot share any vertices or edges. The next step after obtaining the frequencies of isomorphic induced subgraphs is to find out whether or not they are significant using statistical significance testing. By running the same frequency test on a large number of similar random networks, we can accumulate a large set of frequency values that provide some insight into whether or not the value obtained from the target network is significant. The random networks are used to establish default values for frequency and other metrics, otherwise known as a statistical null hypothesis. Generated random networks are also known as null-models [18]. Testing all null models yields certain average values for all our scores, which can be used to set 500 Tran et al. thresholds. If a particular score from the target network breaks its determined threshold, this indicates significant data are found. Scoring Thresholds Scoring thresholds are used to test whether or not subgraphs are statistically overrepresented and can therefore be called motifs. (i) Z-score: The Z-score of a motif is a way to measure how many more motifs are in the target network than the average random network. It is calculated as follows [19]: fin frand zðmÞ ¼ pffiffiffiffiffiffiffiffiffi s2rand where m is a motif, fin is the number of motifs in the target network, frand and srand are the mean and standard deviation of its appearances in the set of random networks. (ii) P-value: The P-value of a motif is the number of times that motif appears in a random network is equal or larger than the number of times that motif appears in the target network divided by the total number of random networks. It is a probability value ranging from 0 to1. A motif is considered statistically significant if it has P-value < 0.01 [6]. (iii) Significance Profile: The significance profile is a vector of Z scores of a set of motifs, which is normalized to length 1 [19].The significance profile of a motif i is calculated as follows [19]: zi ffi SPi ¼ rffiffiffiffiffiffiffiffiffiffi n P 2 zi i¼1 where zi is the Z-score of motif i, and n is the number of motifs in the set. The network motif detection problem can be stated generally as follows: Network Motif Detection The search for induced isomorphic subgraphs within a target network that occur significantly more often in the target network than in the random network using scoring thresholds. COMPLEXITY OF NETWORK MOTIF DETECTION Network motif detection is computationally very expensive. As target networks grow, there is more room for induced subgraphs to appear. Furthermore, as subgraphs grow larger, there are more potential ways to overlay them within a network as well as a larger total number of possible motifs. All of this cost is exacerbated by the large runtime multiplier of having to repeat the computation for a large number of random networks. Moreover, finding isomorphic subgraphs in the target network is an NP problem, which is neither known to be NPComplete nor can be solved in polynomial time [6]. TYPES OF NETWORK MOTIFS Generally, there are five types of network motifs, which can be found in Figure 2. They are single node with self-edge, pair-wise, cascade, hub, bipartite, and clique [8–12]. Some common network motifs identified by each type and their biological functions are discussed in the section below. The illustrations of these network motifs can be found in Table 1. Single node with self-edge The network motif identified for this type is autoregulation motif [10]. There are two types of autoregulation namely negative autoregulation (NAR) and positive autoregulation (PAR) [9, 10]. The NAR is one of the most abundant network motifs [11]. It is known about 40% of known transcription factors in Escherichia coli are NAR motifs [11]. This network motif is also found abundant in yeast and higher organisms [11]. In NAR motif, a transcription factor represses the transcription of its own gene [10]. The NAR motif has two essential functions: (i) accelerate response time of gene circuits and (ii) decrease cell to cell variation in protein levels [9]. In the PAR motif, a transcription factor increases its production’s rate. Thus, its functions are opposite to the NAR motif in which the response time of gene circuits is decreased and the cell to cell variation in protein levels is increased [9]. Pair-wise The network motif of this type can be interaction between two connected proteins [8, 9]. The identified network motifs are positive feedback loops and negative feedback loops [9, 10]. In the positive feedback loops, two transcription factors regulate each other [9]. There are two types of positive feedback loops: double-positive loop and double-negative loop. In the double-negative loop, Current innovations and future challenges of network motif detection 501 Figure 2: Types of network motifs (Courtesy of [8 ^12]). two transcription factors repress each other so that there are two steady states in which one is off and the other is on and vice versa. In the double-positive loop, two transcription factors activate each other. Thus, there are two steady states: both are on or off [9, 10]. The negative feedback loops contain interactions between two genes or two proteins in which interactions happen on different timescales. For instance, gene X slowly activates gene Y, which in turn rapidly inhibits gene X [10]. Cascade The cascade network motif can be a sequence of activations of genes. When the upstream gene reaches an appropriate threshold, it activates the downstream gene. There are two types of cascades: positive and negative. In positive cascade, the genes are sequentially activated. In negative cascade, the genes are sequentially repressed [10]. Hub The network motif identified for this type can be a pattern of a regulator that regulates a group of target genes [9, 12]. A regulator can also regulate itself [10]. An example of this type is the single-input module (SIM) [9]. The main function of SIM is to control synchronized expression of a group of genes with shared function [9, 10]. Bipartite The network motif identified for this type is a set of regulators that jointly control a set of genes [9, 10]. Examples of this type include dense overlapping regulons (DOR) or multi-input motifs (MIMs) and bi-fan [9]. The DOR contains several input regulators that jointly regulate several output genes. This network motif is found in E. coli and yeast. It has several functions such as carbon utilization, anaerobic growth and stress response [9, 10]. The bi-fan network motif is also found in transcription regulation networks of E. coli and Saccharomyces cerevisiae yeast. It contains two input regulators that jointly regulate two output genes. This network motif can be categorized into coherent and incoherent bi-fan network motifs. The coherent bi-fan network motif has both inputs as promoters while the incoherent bi-fan network motif has one input as a promoter and the other input as a repressor. In general, the bi-fan network motif controls the order of signal propagation and its role can be signal sorters, filters and synchronizers [13, 14]. Clique This type of network motif can be a protein complex consists of three or more proteins interacting to form a clique [9]. Several network motifs of this type have been detected as follows [8, 16]. Protein clique This network motif contains three proteins interacting with each other. It is the most abundant network motif in the PPI networks. Ninety-two percent of the occurrences of this network motif correspond to known protein complexes [8]. Interacting transcription factors that coregulate a third gene This network motif has two transcription factors interacting with each other, and they jointly regulate a third gene. Most of the interacting transcription factors pairs of this network motif have the same 502 Tran et al. function, which is either co-activating or co-repressing genes [8]. Feed-forward loop (FFL) In this network motif, a transcription factor regulates another transcription factor and both together regulate a target gene [8, 10, 15]. This network motif is found in E. coli, yeast and other organisms [10]. There are eight types of FFLs, as each interaction in the FFL can be either activation or repression as in Table 1 [10]. The coherent type 1 FFL and the incoherent type 1 FFL are the most common FFLs [10]. The incoherent FFL has a role of sign-sensitive accelerator. It accelerates the response time of the target gene expression by following stimulus steps in one direction such as from off to on but not in the opposite direction [15]. The coherent FFL has a role of sign-sensitive delay [15]. Co-regulated interacting proteins In this network motif, two genes interacting with each other and they are regulated by a common transcription factor. This network motif is found in many different cellular pathways [8]. Mixed-feedback loop between transcription factors that co-regulate a gene This network motif can be a combination of two network motifs: two transcription factors that coregulate a third gene and the feed-forward loop. Thus, this topology allows combined regulation methods [8]. Biparallel In this network motif, a regulator controls two other regulators, which co-regulate a target gene. This network motif is found in transcription and phosphorylation networks [16]. APPLICATIONS OF NETWORK MOTIFS Network motifs have a wide range of applications as follows. Network motifs have been used for identifying application protocols in network traffic. This application supports network administrators to secure and manage network resources. The implementation shows that motif profiles outperform traditional profiles for correctly identify application protocols in network traffic [4]. Similar network motifs found in different networks reveal the structural similarity between these networks. Thus, network motifs can be used to classify networks into super superfamilies [19]. Network motifs have been employed to validate the construction of evolutionary trees using parsimony methods. In this application, the correctness of evolutionary trees, which are built based on the character overlap graph, is validated by finding under-represented network motifs called holes in the character overlap graph. The network motifs in this application typically are squares without crossing edges [20]. Network motifs found in human signaling network have been used to identify breast cancer patients. In this application, three-node network motifs in human signaling network have been screened for identifying cancer-associated motifs in breast cancer samples from normal samples. This method has higher accuracy for identifying breast cancer patients, and it may help for breast cancer diagnosis and therapy as well as other types of cancer [21]. Network motifs also provide explanations for better understanding functional roles of some genes in gene regulation. For instance, identifying recurring miRNA that contains motifs in gene regulation networks improves the understanding of functional roles of miRNAs in gene regulation [22]. Network motifs also allow predicting protein– protein interaction in the PPI network. In this application, three-node and four-node network motifs have been used to predict the correct interaction partner of a protein. This method achieves high accuracy for prediction of the interactions in the protein interaction network [23]. The labeled network motifs found can be used to predict the functions of unknown proteins in the PPI network. In this application, network motifs are discovered based on structure and biological meanings. The discovered network motifs are labeled so that they can be used to predict the functions of unknown proteins in the PPI network [24]. Network motifs have been utilized for identifying network activity. In this application, network motifs are mapped to applications. This implementation achieves 85% average accuracy. As a result, it improves network resource management as well as security enforcement [25]. Another application of network motifs is that the directed feedback loop and feed forward loop have been identified as dominant contributors to local Current innovations and future challenges of network motif detection information storage capability in biological and artificial networks. Thus, the finding can explain why some recurrent neural networks are known for good memory performance [26]. Lastly, network motifs have been used to explore the mechanisms of cervical carcinoma response to epidermal growth factor (EGF) in regulation network. Because regulation network is large and complex for identifying which component of the network is significant, network motifs provide better understanding of the modularity as well as large scale structure of the network. Thus, identifying network motifs may reveal the mechanisms underlying the response to growth factor activation in regulation network [27]. CLASSIFICATION OF NETWORK MOTIF DETECTION ALGORITHMS Network motif detection algorithms can be classified into two categories: network-centric and motifcentric algorithms [6]. Network-centric algorithm determines the frequency of a given subgraph size k in the target network by using isomorphic subgraphs checking. It compares this frequency obtained in the target network with the frequency in the random networks for this subgraph to determine if it is a motif [28]. Motif-centric algorithm enumerates all possible subgraphs size k. Then, it checks each subgraph size k with the target network to find a match and determines its frequency. This frequency is compared with the frequency in the random networks for this subgraph to determine if it is a motif [28]. Motif-centric algorithm has a drawback that it may spend unnecessary time for checking generated subgraphs that may not be found in the target network [6]. The method for calculating subgraphs in the network by motif-centric and network-centric algorithms can be classified into exact counting and approximation [7]. The former method is limited by the enormous computational task in large networks. Thus, it can find small motifs up to four nodes and motif generalizations up to six nodes [29]. The latter method was developed to overcome the complexity of the exact counting method so that it can find larger motifs [29, 30]. The exact counting methods include exhaustive recursive search (ERS) [3], enumerating subgraphs (ESU) [31] and compact topological motifs [7, 32]. The methods for 503 approximation include edge sampling [3], randomized version of ESU from a search tree [33] and tree-filtering search [7, 34]. These methods are discussed in the next section. GENERAL AND BASIC TECHNIQUES USED IN NETWORK MOTIF DETECTION Random network generations Generating random networks is an essential step in network motif detection because it is used to detect motifs in the target network. The generated random network must have the same properties such as the number of nodes, the number of edges, the degree of nodes and so on as the target network. There are two common algorithms, which are switching algorithm and matching algorithm, used for generating random networks [35]. The switching algorithm utilizes a Markov chain for generating a random graph of a given degree. The algorithm uses Monte Carlo switching steps for switching a pair of edges (A ! B, C ! D) chosen randomly to (A ! D, C ! B) by applying the rule that does not allow multiple edges or selfedges. This process is repeated for Q E times where E is the number of edges in the graph and Q has a value approximately to 100 for achieving sufficient randomization. This algorithm can sample networks uniformly [35]. The matching algorithm for generating random networks contains the following steps. First, the algorithm prepares a set of nodes where each node is assigned a set of ‘stubs’, which are half edges of incoming and outgoing edges. Next, pairs of in-stubs and out-stubs are randomly selected and joined to form network edges. This step allows self-edges and repeat edges. Next, the algorithm searches for self-edges and repeat edges and rewires them without altering the degree of any node. This step is carried out until no self-edge or repeat edge exists in the network. This algorithm has a drawback that generates a biased sample of random networks [35]. Exhaustive recursive search algorithm The ERS is an exact algorithm, which takes the input network in the form of adjacency matrix and exhaustively scans the entire matrix for all type of subgraphs of sizes 3 and 4 only. The algorithm counts the number of appearances of each type of 504 Tran et al. subgraph in the target network and also in the random networks. Subsequently, it determines isomorphic subgraphs for each subgraph type. Then, each subgraph type is assessed for its statistical significance [3]. Edge sampling algorithm This algorithm belongs to the family of approximate algorithms. It samples an n-node subgraph by selecting random connected edges to expand the subgraph until a set of n nodes is reached. The algorithm contains the following steps. First, a random edge is selected from the network and it is expanded by selecting random neighboring edges repeatedly until n nodes are reached for this subgraph. To select a random edge for expanding the subgraph’s size by one, the algorithm keeps a list of all neighboring edges and it randomly selects an edge from that list. This process is repeated until a subgraph of n nodes is reached. This edge sampling algorithm is not uniform because the probabilities of sampling different specific subgraphs are not equal even if they have the same topology. To compensate this drawback, the algorithm implements a correction method, which calculates the probability P of sampling a specific subgraph to guarantee unbiased estimation of subgraph concentrations. The algorithm calculates the concentrations of n-node subgraphs as follows. It assigns a score Si , which is set to zero initially to each sample subgraph type i. Next, a weighted score W is added to the accumulated score Si of the appropriate subgraph type i. The estimated subgraph concentrations after ST samples are calculated as follows [30]. Ci ¼ Si L P Sk k¼1 where Si is the score of subgraph type i, L is the total number of different subgraph types, and Sk enumerates through all the different subgraphs. The concentration of each subgraph is used to determine whether or not it is statistically significant [30]. Frequent pattern finder (FPF) The FPF algorithm searches for given size patterns that occur with maximum frequency under a given frequency concept. The algorithm builds a tree for only patterns that are supported by the target graph. It traverses the tree and examines only its promising branches. A tree is built starting from the root that contains the simplest possible pattern with one edge and two vertices. The children are constructed by having the parent’s pattern extending by one edge. Duplicate patterns are not allowed. The canonical label is assigned to each pattern and it is used to identify the pattern. Isomorphic graphs are identified if they have the same canonical label. When the frequency of a pattern of intermediate size falls below the frequency of a pattern of target size discovered so far, the algorithm discards this branch of the tree. If there is a nearly maximum frequent pattern of target size found early in the search process then it is most likely that the frequency threshold of intermediate size patterns is discarded early. Thus, the number of patterns to be searched decreases drastically [36]. Enumerating subgraphs algorithm The ESU is an exact algorithm, which enumerates all size k subgraphs. The algorithm starts with a vertex v from the input graph and it adds vertices, one at a time, to the VExtension set that have two properties. First, the label of these vertices must be larger than the label of v. Second, these vertices can only be neighbors to a newly added vertex w and they cannot be neighbors to a vertex already in VSubgraph. The subgraph is extended until size k subgraph is reached. The algorithm outputs each size k subgraph exactly once [31]. Randomized version of ESU (RANDESU) from a search tree The RAND-ESU is an approximate algorithm, which was designed to overcome the drawbacks of the nonuniform edge sampling and the expensive biased correction method by Kashtan et al. [37]. It can efficiently enumerate all size k subgraphs and randomly omits some subgraphs during its execution so that an unbiased subgraph sampling can be obtained. A general concept of enumerating all size k subgraphs is to follow. First, it starts with a vertex v from the input graph. Then, it extends v by adding vertices to VExtension set that have two properties: (i) the label of these vertices must be greater than the label of v, and (ii) they cannot be neighbors of a vertex in VSubgraph . This procedure results in an ESU-tree with an important property that can be used to efficiently sample random subgraphs uniformly. In addition, it is much faster because there is no biased correction needed [37]. Current innovations and future challenges of network motif detection 505 Tree-filtering search 2. MAVisto This algorithm was designed to find network motifs in PPI networks only. First, the algorithm finds the repeated subgraphs in the network. This step is performed by finding repeated size k trees and then it uses repeated size k trees to partition the network graph. Subsequently, it performs graph join operation for finding repeated size k subgraphs. Second, the algorithm verifies the frequency of repeated subgraphs in the random networks. Finally, it determines the uniqueness values of the repeated subgraphs using their frequencies in the PPI network and in the random networks. The details of this algorithm are discussed in the subsection NeMoFinder [34]. MAVisto (Motif Analysis and Visualization tool) [40] is network motif detection tool for biological networks. The tool was developed in 2005 for analyzing and visualizing network motifs. MAVisto relies on an editor called Gravisto [41] for graph visualization and a toolkit for implementing graph algorithms. MAVisto also employs an advanced force-directed layout algorithm [42] for drawing networks [40]. The advanced force-directed layout algorithm is designed for drawing aesthetically pleasing, two dimensional undirected graphs with straight edges. The algorithm has the following characteristics. It distributes the vertices evenly. It constructs edge lengths uniformly, and it reflects inherent symmetry. The algorithm has the advantages of speed and simplicity [42]. MAVisto allows discovering motifs of a given size specified by the number of nodes or the number of edges. MAVisto uses all three different frequencies F1 , F2 and F3 for identifying motifs. It also uses Z-scores to measure the statistical significance of discovered motifs. MAVisto relies on the FPF algorithm [36] discussed above for the motif finding [40]. MAVisto is written in Java. It allows Pajek-.Net[43] and GML [44] as inputs. Its output is detailed. MAVisto contains several views for motif visualizations such as motif table, motif view, motif fingerprint and motif matches. Motif table provides information on unique network motif label, motif’s size, structural properties and so on. Motif view provides visualization of motif’s structure. Motif fingerprint is a diagram of motif frequency spectrum of the target network. Motif matches view allows visual examination of the occurrences of a motif within the analyzed network and their matches. MAVisto is fast for detecting motif sizes 3–5 in directed networks by using a lookup table for isomorphic checking [40]. MAVisto’s user friendliness and its variety of frequency thresholds make it unique, even if it does not incorporate the fastest algorithm. Compact topological motifs Discovering topological motifs using compact notation is an exact counting method. This method was designed to overcome the combinatorial explosion of isomorphic subgraphs by using compact location lists, which are location lists of the vertices of the motifs. Instead of enumerating k elements out of n, it uses the n form, where k is the number of immediate k neighbors of a vertex out of n possible immediate neighbors. Thus, it reduces the size of the output significantly without losing information [38]. NETWORK MOTIF DETECTION TOOLS AND ALGORITHMS 1. mfinder mfinder [30] is a command line tool, and it is the first network motif detection tool developed in 2004. It uses the edge sampling algorithm discussed above for subgraphs sampling. Because the runtime of this algorithm does not depend on the network size, mfinder can explore large networks and detect larger motifs that are unreachable by the exhaustive enumeration algorithms. mfinder uses an F1 frequency threshold, meaning that the vertices and edges of motifs are freely shared [30]. mfinder also implements several methods such as switching, stubs and go-with-the-winners for generating random networks [39]. mfinder is not suitable for finding large motifs due to the directly exponential sampling procedure [30]. However, its runtime is independent of network size and it is able to detect subgraphs that have very low concentration [30]. 3. NeMoFinder NeMoFinder [34] is an algorithm developed in 2006 for detecting repeated and unique meso-scale network motifs in large PPI networks. It takes four input parameters specified by the user: PPI network G, frequency threshold F, uniqueness threshold S and maximum network motif size K. Its output is a 506 Tran et al. set of repeated and unique motifs from size 2 to a specified maximum size K [34]. NeMoFinder uses F1 frequency for finding motifs. The algorithm contains the following steps [34]. (i) Discovering repeated subgraphs in the PPI network (a) Finding repeated size k trees In this step, the algorithm first finds size 2 tree. Then, it extends to size 3 tree, size 4 tree and so on until it reaches size k tree. Next, it counts the occurrences of each size k tree in the network and determines if it is a repeated tree and adds it to the set Tk by using user-defined frequency threshold F. (b) Using repeated size k trees to partition graph G In this step, the algorithm uses size k trees in Tk to divide the graph G into a set of graphs such that each graph contains a size k tree in Tk (2 k K). (c) Performing graph join operation to find repeated size k graphs In this step, the algorithm generates size k subgraphs for each tree t in Tk. Then, it joins t with each of these subgraphs to generate size k subgraphs with k edges and add them to the candidate set Ck . Next, it checks the occurrences of each subgraph in Ck and determines if it is repeated subgraph and adds it to the set of frequent subgraphs by using user-defined threshold F. Next, the repeated subgraphs are used to generate all possible k vertex and k edge subgraphs. Subsequently, the repeated subgraphs are joined with a newly generated subgraphs to obtain (k þ 1) edge subgraphs, and it is added to the set of frequent subgraphs. This process is repeated until no repeated subgraph can be found or a complete graph of k (k 1)/2 edges is reached. Finally, the algorithm outputs a set of the repeated trees and subgraphs from size 2 to size K. (ii) Verifying the frequency of repeated subgraphs in the random networks In this step, the algorithm employs Markov chain algorithm for generating random networks, which have the same single vertex characteristics as the PPI network. Then, it verifies the frequency of the frequent subgraphs in each random network. (iii) Verifying the uniqueness values of repeated subgraphs Lastly, the algorithm calculates the uniqueness value for each frequent subgraph using its frequencies in the PPI network and the random networks. NeMoFinder is scalable because of partitioning the network into a set of graphs, which results in counting the frequency of a size k subgraph in the network. This problem is reduced to finding the number of graphs that contain the subgraph, which is downward closed. Thus, this algorithm can analyze scale-free networks. NeMoFinder also utilizes the idea in SPIN [45] for searching repeated trees and extending them to subgraphs for reducing the computational complexity [34]. SPIN (SPanning tree-based maximal graph mINing) is a spanning tree-based frequent subgraph mining algorithm that mines only maximal frequent subgraphs from a graph database [45]. Maximal frequent subgraphs are subgraphs that are not contained within any other frequent subgraphs [45]. Frequent subgraph mining extract frequent subgraphs that have frequency above a specified threshold in a given dataset [46]. Because SPIN only mines maximal frequent subgraphs, it can reduce the size of the output as well as the computational time significantly [45]. NeMoFinder differs from SPIN in which it examines occurrences of a subgraph in a network while SPIN only verifies if a subgraph occurs in a graph. Moreover, NeMoFinder discovers repeated unlabeled subgraphs from a single graph while SPIN employs equivalence classes for finding maximal labeled frequent subgraphs in a set of graphs [34]. The algorithm also implements the Graph Cousins technique. This technique reduces the computational time for generating candidate subgraphs and frequency counting for finding repeated subgraphs. The traditional way for generating a subgraph candidate from a tree is by adding a new edge to that tree. The resulting graph is verified if it exists in the candidate set. However, the candidate set can become very large and checking a graph for its existence in the candidate set involves graph isomorphism checking. Thus, the Graph Cousins technique is designed to overcome the complexity and reduces the computational time. There are three types of Current innovations and future challenges of network motif detection graph cousins between graphs g and h. Type I or Direct Cousin has h isomorphic to a subgraph g’, which has the same number of vertices and edges as g but g 6¼ g’. Type II or Twin Cousin has h isomorphic to subgraph g. Type III or Distant Cousin has h as a disconnected subgraph [34]. NeMoFinder is written in Cþþ, and it can find network motifs up to size 12 in the PPI network [34]. 4. FANMOD FANMOD [33] is a tool developed in 2006 for fast network motif detection. It implements the RANDESU [37] novel algorithm discussed above for enumerating and sampling subgraphs [33]. This algorithm uses F1 frequency for finding motifs. FANMOD is written in Cþþ and it can detect network motifs up to size 8. The tool can also detect motifs in colored networks. FANMOD implements the canonical graph labeling algorithm called NAUTY [47] for grouping subgraphs into isomorphic subgraph classes [33]. NAUTY (No AUTomorphisms, Yes?) is a software package containing several programs written in C language to implement McKay’s algorithm for determining the automorphism group of a vertexcolored graph and for computing the canonical labeling, which is used for isomorphic graphs testing. Two graphs are isomorphic if they have the same canonical labeling [48]. FANMOD calculates the frequency of subgraph classes in a number of random graphs specified by the user. These random graphs are created from the original network by switching edges between vertices. However, they preserve the degree sequence of the original network. FANMOD allows selecting different switching schemes for generating random graphs. The tool is much faster than mfinder and MAVisto [33]. FANMOD only accepts an edge-list text file. However, its output options are far better. It is able to generate HTML files containing basic visualizations of the network motifs. It also allows exporting the results to different formats for further analysis [33]. FANMOD’s relative speed, ease of use and rich customization, make it one of the most competitive tools available today. However, its memory usage increases remarkably when the subgraph size and network size increase [49]. 507 5. Grochow-Kellis Grochow-Kellis [29] is a network motif detection algorithm developed in 2007 for detecting large network motifs based on a novel symmetry-breaking technique. Because the algorithms that use network-centric approach can only detect network motifs up to size 8, this algorithm takes a new approach called motif-centric for discovering larger network motifs. The algorithm has an exponential speedup because the symmetry-breaking technique eliminates repeated isomorphism checking. As a result, the algorithm can detect network motifs up to size 15. Furthermore, it can be applied to any type of network. The algorithm uses F1 frequency for finding motifs [29]. The algorithm has five distinguishable features. First, instead of enumerating subgraphs, which increases the complexity, the algorithm exhaustively looks for the instances of a single query graph in the network by using McKay’s geng and directg [47] programs. These programs are parts of the NAUTY [47] package. The geng program generates small graphs while the directg program generates small digraphs with given underlying graph [50]. Second, the algorithm maps the query graph to the network in all possible ways for checking isomorphic subgraphs. Third, it uses a novel technique, subgraph symmetries, which allows finding an instance of a single query graph only once. This technique speeds up the algorithm by exponential factor. It also allows writing discovered instances to the disk, which improves memory usage. Fourth, the algorithm has a better isomorphic subgraph checking than other motif finding algorithms because it considers the degree of each node as well as the degrees of each node’s neighbors. Finally, the algorithm utilizes the subgraph hashing technique for hashing the graphs using their degree sequences. This technique improves the isomorphic subgraph checking process significantly [29]. The algorithm has five noticeable advantages. First, it can find larger motifs up to size 15. Second, it can query a particular subgraph for significant checking. Third, it is able to cluster all discovered instances of a given subgraph into clusters so that larger structures can be examined from the formation of these clusters. Fourth, it can save time and space. Finally, the algorithm can be easily parallelized which is advantageous for future improvement. The algorithm was also implemented in Java [29]. 508 Tran et al. 6. Kavosh Kavosh [49] is a network motif finding algorithm developed in 2009. It is based on counting all size k subgraphs of the target network. The goal of this algorithm is to find network motif of any given size with less memory usage and lower CPU time. Kavosh can find network motifs greater than eight nodes. It uses the F1 frequency threshold. Kavosh is written in Cþþ. The algorithm contains four major steps as follows [49]. (i) Enumeration This step finds all subgraphs of a given size in the target network. The algorithm implements an efficient method for enumerating subgraphs size k as follows. To count all subgraphs size k of a given graph with vertices that are numerically labeled, the algorithm finds all subgraphs that include a particular vertex. Then, it removes that vertex from the network and repeats the process consecutively for successive vertices. To count subgraphs size k that include a particular vertex, the algorithm builds trees that have special properties and restrictions as follows. They have maximum depth of k and they are rooted at this vertex. The children of each vertex are incoming and outgoing adjacent vertices. Each vertex appears only once so that no duplicate vertices are allowed. The children of a tree must have numerical labels larger than the label of the root of that tree. These properties results in counting a subgraph only once [49]. Kavosh also implements the revolving door ordering algorithm [49, 51], which is the minimal change order in which two consecutive objects differ by exactly two positions [52]. This algorithm allows saving time for calculation performs on each object that differs slightly from its predecessor [53]. The revolving door algorithm is known to be the fastest algorithm for generating combinations of vertices, for enumerating subgraphs [49]. (ii) Classification This step classifies discovered subgraphs into isomorphic classes. The algorithm employs NAUTY [47] for finding isomorphic subgraphs. It inputs the adjacency matrix of each discovered subgraph in the previous step to NAUTY for generating canonical labeling as a class identifier of that subgraph [49]. (iii) Random graph generation This step generates random graphs such that it preserves the degree sequence of the target network. The algorithm implements the switching method that is similar to Milo’s random model [35, 54] for generating random graphs. This switching method is described in the section ‘Random network generations’ [49]. (iv) Motif identification This step identifies motifs from discovered subgraphs based on statistical parameters such as frequency, Z-score and P-value [49]. 7. MODA MODA (network MOtif Discovery Algorithm) [55] is a network motif detection algorithm developed in 2009. It was designed to target large network motifs (greater than size 8) efficiently. MODA is written in C#. The algorithm uses F1 frequency for finding network motifs. It implements the pattern growth approach [17], which reduces the cost of isomorphic subgraphs checking. This approach starts with size k trees, and they are extended until a complete graph with k nodes is reached. The algorithm exploits the use of the previous query graph for the current query graph if it is a supergraph of the previous one so that the information can be re-used for calculating the frequency of a particular query graph. This technique reduces the computational time [55]. The algorithm also utilizes expansion trees that extend minimal query graphs by adding edges to them until a complete graph is obtained. The expansion tree Tk has the following distinguishable characteristics. Each node except for the root is a query graph of size k. These query graphs become more complete by traversing down the tree. The root is k, which is the size of a query graph. Each node in level ith has a graph of size k and contains ðk 2 þ iÞ edges. The first level contains the number of nodes that are equal to the number of nonisomorphic trees of size k. Each node except for the root is a graph that is nonisomorphic to all other graphs in Tk . Each node except for the root is a subgraph of its child. 2 , which is There is only one leaf node at level k 3kþ4 2 k2 3kþ4 a complete graph with k nodes and edges. 2 Each node also contains an adjacency matrix corresponding to the graph. The expansion tree is generated by following a particular procedure, and it is created only once. The tree is a static data structure so that it can be stored and retrieved whenever the algorithm needs. The expansion tree Tk is also used for calculating the frequency of the subgraphs. The algorithm also implements the mapping module for calculate subgraph frequency. The mapping module allows storing calculated mapping in the memory for Current innovations and future challenges of network motif detection later use. This mapping module implements the symmetry-breaking technique [29] for counting subgraph only once. In addition, the algorithm implements sampling throughout the network inside the mapping module to speed up the mapping process. It also implements the enumeration module, which speeds up the subgraph frequency-calculating process [55]. 8. G-Tries G-Trie (Graph reTRIEval) [56] is a specialized data structure developed in 2010. It is built based on prefix tree, which provides sharing common topology. This data structure allows building a multiway tree with the property that the descendants of a node share a common substructure. It also allows storing subgraphs, computing the frequency of subgraphs efficiently, as well as efficient searching for finding network motifs. Because of its sharing common structure, G-Trie uses F1 frequency for finding motifs [56]. G-Trie has the following characteristics. Each node of the tree represents a single graph vertex and its corresponding edges to predecessors. Nodes that have common predecessors share common substructures. A node is a subgraph of their children. A graph becomes more complete by traversing down the tree. Each vertex is assigned an index, and the index is increased when traversing down the tree. Each graph is represented by an adjacency matrix. The tree is built by following a particular procedure. It starts with the root and one subgraph is inserted to the tree at a time. The canonical labeling is used to ensure that isomorphic graphs produce the same adjacency matrix for the same G-Trie. It also guarantees that the order of vertices that have the largest number of edges needs to appear first in the matrix. Because G-Trie allows sharing common substructures, the more common substructures the less memory needed as well as the size of the tree decreases. When there is no common substructure or less common substructures, the tree would require a substantial amount of storage space, as motif’s size and network’s size increase. However, once the tree is constructed searching and retrieval can be obtained more efficiently. This data structure eases the subgraph census and isomorphic checking. However, the algorithm still uses NAUTY tool [47] for isomorphic checking. To avoid over counting subgraphs, a symmetry-breaking technique, which is similar to the technique in Grochow-Kellis [29], 509 is implemented. G-Trie was implemented in Cþþ [56]. gtrieScanner [57] developed in 2012 is the only tool that implemented G-Trie data structure. It is a command line tool, and it only allows finding one network motif size at a time. gtrieScanner can take the input network graph in text format and outputs the result in text or html format. The tool is written in Cþþ. Its current release version 0.1 is only for Linux system. gtrieScanner is a limited preliminary tool, which is still under active development [57]. 9. NetMODE NetMODE (Network MOtif DEtection) [58] is a network motif detection software package developed in 2012. This is the first software package that does not depend on NAUTY [47] tool for isomorphic subgraphs checking. Although NAUTY is one of the fastest tool for isomorphic subgraphs checking, it is still too costly for calling NAUTY for million or billion times. However, the algorithm has to pay a cost for this independence by storing k-node subgraph data in the memory for k 5 in its pretreatment phase. NetMODE uses a novel approach when k ¼ 6. However, it can only detect network motifs up to size 6. NetMODE was developed based on Kavosh [49] but it is not a variant of Kavosh, as it has its own features. NetMODE uses F1 frequency for finding motifs. The distinguishable features of NetMODE are to follow. NetMODE stores all canonical labels in the memory using brute-force search so that it does not have to call NAUTY [47]. However, this practice only works for 3 k 5. When k ¼ 6, the algorithm uses a different approach, which involves the Reconstruction Conjecture in graph theory. The algorithm also contains two stages for finding size 6 motif: (i) process the input network, and (ii) process the comparison graphs. When k 7, it is impractical [58]. NetMODE contains a variety of methods for sampling similar graphs. An appropriate method can be chosen based on the input network. It has several variants of the switching method that is similar to the method in [35]. This switching method is described in the section ‘Random network generations’. The switching method in NetMODE is a mixture of advantage features from both Kavosh [49] and FANMOD [33]. NetMODE samples random graphs using nonuniform distribution. It has an alternative method, which implements the local constant mode, for sampling similar graphs, 510 Tran et al. but this method is slower than the switching methods. For subgraph enumeration, NetMODE employs the subgraph iteration procedure from Kavosh [49] but without using the revolving door algorithm [58]. NetMODE also has other features as follows. It contains a verbose mode, which allows the users to analyze the isomorphic subgraphs retuned in the input network and the comparison graphs. Its stdin/stdout can be interfaced with other packages such as R for other analyses such as drawing motifs. It also contains a burnin feature in which some comparison graphs generated by the switching method are discarded. This feature leads to a better collection of comparison graphs. NetMODE also contains high performance computing feature for comparison graphs. Although this feature is a basic coarse-grained parallelism, it allows NetMODE to achieve near linear speedup [58]. 10. Acc-MOTIF Acc-MOTIF (accelerated Motif) [59] is a network motif detection software developed in 2012. It implements combinatorial techniques for accelerating the motif-finding process. Acc-MOTIF contains a number of algorithms for exact counting isomorphic subgraph motifs of size 3, 4 and 5 independently. Acc-MOTIF contains two main techniques. First, instead of listing induced subgraphs, it calculates the number of isomorphic patterns. Second, it assigns an integer variable to each isomorphic pattern and increments it directly instead of checking for isomorphic subgraphs. pffiffiffiffiThe algorithms have the complexities of O m m for motif size 3 and O m2 for motif size 4, where m is the number of edges in the network graph [59]. Acc-MOTIF uses F2 frequency for finding motifs. Its speed depends on the size of the network. Thus, it may not viable for large target networks. 11. QuateXelero QuateXelero is a network motif detection algorithm developed in 2013. It is written in Cþþ. QuateXelero usesF1 frequency for finding motifs. The algorithm is based on FANMOD [33]. However, it intends to reduce the number of calls to NAUTY [47]. Thus, QuateXelero minimizes this cost by implementing a quaternary tree data structure in the ESU algorithm in FANMOD for faster motif detection. An example of a quaternary tree can be found in Figure 3. A quaternary tree has the Figure 3: An example of a quaternary tree of depth 3 that has a root and three internal nodes. One internal node has four children. The search for string ‘321’ starts at the root and visits children 3 and 2 in the path. The search is completed by adding a new leaf, which is number 1 (Courtesy of Khakabimamaghani et al. [5]). following properties. Each internal node can have at most four children and at most five neighbors with one is its parents and the other four are its children. An edge connecting a parent to a child can be labeled by using number, character or symbol. Once the tree is constructed and labeled, it can be used to search for a given string containing the same set of symbols that are used for labeling the tree. The search for a particular string, for example string ‘321’ in Figure 3, starts at the root and propagates down the tree by visiting its children. First, the first symbol is read from the input string, which is 3 in this case. The current pointer is set to root and it moves to the child that has the connecting edge label matches with a search symbol read from the input string. Then, the second symbol is read from the input string, and the process is repeated at the current node. If a symbol is not found in a child of the current node, a new child is added to the current node for that symbol. The current pointer is moved to this new child. The search goes on until the search string is exhausted. Figure 3 depicts this process with the path containing dotted edges for the search string ‘321’. This quaternary tree allows partial classification for enumerated subgraphs and reduces the need of calling NAUTY [5, 47]. QuateXelero contains three main steps: enumeration, classification and motif detection. In the enumeration step, QuateXelero uses quaternary tree for enumerating subgraphs by building and extending the tree. In the classification step, the algorithm checks for isomorphic subgraphs by exploiting the quaternary tree data structure, which reduces the Current innovations and future challenges of network motif detection number of calls to NAUTY. Thus, the computational time is reduced drastically. In the motif detection step, the algorithm generates random networks using the same method as used in G-Trie [56]. Then, it calculates Z-scores for determining the significance of the motifs. Like G-Tries, QuateXelero consumes a considerable amount of memory in trading for its speedup [5]. A summary of all 11 tools and algorithms discussed above can be found in Table 2. RESULTS AND DISCUSSIONS The network datasets used for evaluating the tools and algorithms are varied. They are biological network, dictionary, electronic network, food web, power grid network, social network, WWW network and others. Some tools and algorithms use a single dataset. Others use a wide variety of dataset types. All tools and algorithms were evaluated on at least one biological network. Some tools and algorithms were evaluated on the same set of datasets for comparison purpose. A summary of these datasets can be found in Table 3. mfinder mfinder was evaluated on five different networks: the transcription network of E. coli (423 nodes and 519 edges), the transcription network of S. cerevisiae yeast (685 nodes, 1052 vertices), the Caenorhabditis elegans (C. elegans) neural network (280 nodes, 2170 edges), the WWW network (325 000 nodes, 1 460 000 edges), and the food web of birds, fishes and invertebrates (83 nodes, 391 edges) [30]. The E. coli and the S. cerevisiae yeast networks are available on the Uri Alon’s Complex Networks [30, 60]. The WWW network is the network of hyperlinks between web pages in ndu domain [30]. The performance of mfinder was compared with the exhaustive enumeration method [3] on a WWW network [61] for motif size 3 and on a transcriptional regulation network of E. coli [1] for motif sizes 3–5. The results show mfinder detected all network motifs found by the exhaustive enumeration method. Besides, the evaluation of mfinder on the neural network of C. elegans [62] shows it can detect larger motif sizes 5 and 6 that are unreachable by the exhaustive enumeration algorithm. In general, it is able to find larger network motifs in larger networks, with the runtimes not depending on the network size [30]. Besides, the sampling method of mfinder 511 is significantly faster than the exhaustive enumeration method. It is able to estimate the subgraph concentration at very high accuracy even for subgraphs that have low concentration. The evaluation results also show mfinder can detect motifs up to size 7 [30]. The experimental results show mfinder is able to detect most common network motifs including cascade-type motif (positive cascade), hub-type motif (single-input module), bipartite-type motifs (dense overlapping regulons, bi-fan) and cliquetype motifs (feed-forward loop, biparallel). In addition to common network motifs, mfinder is able to detect several other network motifs. Although the tool does not specify the forms of network motifs it is able to detect, it has the capability to discover network motifs from two to eight nodes. mfinder is publicly available. The tool can be run on Windows 2000, Windows XP and Linux. However, it is no longer supported. MAVisto The FPF algorithm [36] in MAVisto was tested only on a transcription network of S. cerevisiae yeast. This network comes from the transcriptional regulatory networks in S. cerevisiae, and it contains 62 nodes and 93 edges [36]. The performance of MAVisto was not compared with other tools and algorithms. The evaluation results show it can detect network motifs up to size 7. Besides, different frequency concepts produce different results. For instance, if the analysis for a particular network searches for all possible occurrences of a pattern then the frequency concept F1 would produce better results [36]. Thus, depending on a particular analysis, an appropriate frequency concept should be chosen. Although the experimental results do not explicitly identify the forms of discovered network motifs and the tool does not specify the types of network motifs it is able to detect, MAVisto has the capability to detect network motifs from two to nine nodes. MAVisto is publicly available but it is no longer supported. NeMoFinder NeMoFinder was evaluated on two real life datasets. The Uetz [63] dataset consists of 957 PPIs and 1004 proteins of S. cerevisiae. The MIPS CYGD dataset [64] has 10 199 PPIs and 4341 proteins after eliminating redundancy and orphan links from the whole genome PPI network of S. cerevisiae [34]. 7 8 12 8 15 >8 >8 Unspecified 9 6 5 Unspecified 12 mfinder MAVisto NeMoFinder FANMOD Grochow-Kellis Kavosh MODA G-Tries NetMODE Acc-Motif QuateXelero 5 6 9 12 15 8 13 9 8 Motif’s max. size Algorithm / tool name Output format Unspecified Unspecified Adjacency list Unspecified Unspecified Unspecified Unspecified Text Unspecified Unspecified Unspecified Unspecified Adjacency list Text (Text format) Unspecified Adjacency list Text, HTML (Text format) Unspecified Pajek-.Net, GML Graphics Adjacency list Text (Text format) Motif’s max. Input format size in acceptable running time Edge Sampling Method Algorithm Software package Software package Data structure Algorithm Networkcentric Only PPI network 2013 Yes Yes N/A 2012 2012 N/A N/A Yes N/A N/A N/A No No Maintenance? 2010 2009 2009 Tested on biolo- Networkcentric gical, social, and electronic networks Pattern growth, Tested on biolo- Motif-centric sampling, gical network symmetryonly breaking, expansion tree Pattern growth Various network Networktree, types centric Symmetrybreaking Exhaustive, pat- Tested on social Networkcentric and biological tern growth networks tree, various switching methods, Parallelism Exhaustive, Various network Motif-centric combinatorial types techniques Exhaustive, pat- Various network Networktern growth types centric quaternary tree 2006 2006 2005 2007 Any network Various network Networktypes centric Networkcentric Biological network only 2004 Network / Published motif centric year Various network Networktypes centric Networkspecific Motif-centric Frequent pattern finder (FPF) with pattern tree Algorithm Graph Cousins, pattern growth Best overall, tool, Randomized visualization enumeration, sampling, pattern growth tree Algorithm Symmetrybreaking, exhaustive, sampling Algorithm Exhaustive, pattern growth tree Tool, visualization and analysis Command-line tool Userfriendliness Table 2: A summary of 11 network motif finding tools and algorithms Yes Yes Yes No Yes Yes No Yes No Yes Yes Available online? Reference 34 29 56 55 59 Windows, Linux 5 All platforms Windows 32-bit 58 N/A N/A Windows 32-bit, 49 Linux N/A Windows 32-bit, 33 Mac, Linux N/A Windows 2000, 30 Windows XP, Linux OS All platforms 40 Platform 512 Tran et al. QuateXelero [5] Acc-MOTIF [59] NetMODE [58] MODA [55] G-Tries [56] Kavosh [49] Grochow-Kellis [29] FANMOD [33] NeMoFinder [34] PPI network of S. cerevisiae yeast PPI network of S. cerevisiae yeast Transcription network of E. coli Transcription network of S. cerevisiae yeast Neural network of C. elegans Food web of the Ythan estuary PPI network of S. cerevisiae Transcription network of S. cerevisiae Metabolic pathway of E. coli network Transcription network of S. cereviciae yeast Real social network Electronic network Transcription network of E. coli Network of common associations between a group of dolphin Electronic circuit network Benchmark social network with heterogeneous communities PPI network of yeast U.S.A. western states power grid network Real social network Metabolic pathway of E. coli network Transcription network of S. cerevisiae yeast Complete directed graph Transcription network of E. coli Transcription network of S. cerevisiae yeast Roget (Roget.net is a directed network contain cross-references in Roget’s Thesaurus) CS phd Epa California ODLIS (odlis.net is directed network based on the ODLIS: Online Dictionary of Library and Information Science) Words E. PairsFSG foldoc.net is a directed network of Free On-line Dictionary of Computing Transcription network of S. cerevisiae yeast Metabolic pathway of E. coli PPI network of the budding yeast Real social network Dolphins social network Electronic circuit Transcription network of E.coli Transcription network of S. cerevisiae yeast Neural network of C. elegans W W W network of hyperlinks between web pages in ndu domain Food web of birds, fishes and invertebrates Transcription network of S. cerevisiae yeast mfinder [30] MAVisto [40] Dataset Tool/algorithm N/A N/A 423 688 306 135 1379 685 672 688 67 97 423 62 252 1000 2361 4941 67 672 688 50 418 688 1022 1882 4271 6175 2900 7381 5018 12905 688 672 2361 67 62 252 Genealogy N/A N/A Dictionary N/A N/A Dictionary Biological network Biological network Biological network Social network Social network Electronic network 423 685 280 325 000 83 62 Number of nodes Biological network Biological network Biological network Biological network Biological network Food web Biological network Biological network Biological network Biological network Social network Electronic network Biological network Dolphins network Electronic network Social network Biological network Power grid network Social network Biological network Biological network Directed graph Biological network Biological network Dictionary Biological network Biological network Biological network W W W network Food web Biological network Type Table 3: A summary of the datasets used by motif finding tools and algorithms 46 281 63 608 109 092 1079 1275 6646 182 159 399 1740 8965 16150 18 241 N/A N/A 519 1079 2345 597 2493 1052 1276 1079 182 189 519 159 399 7770 6646 6594 182 1276 1079 2540 519 1079 5074 519 1052 2170 1460 000 391 93 Number of edges N/A N/A Pajek datasets Uri Alon’s Complex Networks N/A Pajek datasets N/A University of Michigan Network Data Uri Alon’s Complex Networks Pajek datasets N/A N/A Pajek datasets Uri Alon’s Complex Networks Uri Alon’s Complex Networks N/A ndu domain N/A Young Lab (Transcriptional Regulatory Networks in S. cerevisiae) N/A N/A Uri Alon’s Complex Networks Uri Alon’s Complex Networks N/A N/A N/A N/A Uri Alon’s Complex Networks Uri Alon’s Complex Networks N/A N/A Uri Alon’s Complex Networks University of Michigan Network Data Uri Alon’s Complex Networks N/A Pajek datasets University of Michigan Network Data N/A Uri Alon’s Complex Networks Uri Alon’s Complex Networks N/A Uri Alon’s Complex Networks Uri Alon’s Complex Networks Pajek datasets Data source Current innovations and future challenges of network motif detection 513 514 Tran et al. Figure 4: Runtimes for different network motif sizes for NeMoFinder, FPF (MAVisto), Sampling (mfinder) and Enumeration in Uetz PPI Network of S. cerevisiae [63] (Courtesy of Chen et al. [34]) The performance of NeMoFinder was compared with other algorithms such as the enumeration method (Exhaustive Recursive Search) [3], sampling method (edge sampling algorithm) [30] and FPF [17] as shown in Figure 4. The result shows NeMoFinder achieves larger motifs as well as better runtimes with 20- to 100-fold speed up in the Uetz PPI network. Besides, it can detect all motifs up to size 13 within an acceptable running time for this network. NeMoFinder also outperforms the FPF for up to100-fold speedup under various frequency thresholds. It can find motif up to size 12 for the MIPS dataset [34]. Although the experimental results do not explicitly identify the forms of discovered network motifs and the algorithm does not specify the types of network motifs it is able to detect, NeMoFinder has the capability to detect network motifs from 2 to 13 nodes. NeMoFinder is only an algorithm, and its source code is not publicly available. Figure 5: Comparison of runtimes for different network motif sizes for Grochow-Kellis algorithm and two versions of Milo et al. algorithm [3] in PPI Network of S. cerevisiae. The speed-up of GrochowKellis algorithm is also indicated (Courtesy of Grochow et al. [29]). The RAND-ESU [37] algorithm in FANMOD was compared with the edge sampling algorithm (ESA) in mfinder on four different datasets above. The results show RAND-ESU is much faster than ESA by several orders of magnitude for subgraph sizes 5. Besides, RAND-ESU is more consistent than ESA for sampling quality for different networks because it is unbiased as well as its capability for estimating the total number of subgraphs [37]. Although the experimental results do not explicitly identify the forms of discovered network motifs and the tool does not specify the types of network motifs it is able to detect, FANMOD has the capability to detect network motifs from three to eight nodes. FANMOD is publicly available. Its latest update was in 2006. The tool can be run on Windows 32-bit, Mac and Linux. FANMOD Grochow-Kellis The RAND-ESU [37] algorithm in FANMOD was evaluated on four different networks: the transcription network of E. coli [1] (423 nodes, 519 edges), the transcription network of S. cerevisiae yeast [3] (688 nodes, 1079 vertices), the neural network of C. elegans [30] (306 nodes, 2345 edges) and the food web of the Ythan estuary (135 nodes, 597 edges). The E. coli and the S. cerevisiae yeast networks are available on the Uri Alon’s Complex Networks [33, 60]. Grochow-Kellis algorithm was evaluated on two biological networks: the PPI network of S. cerevisiae yeast (1379 nodes, 2493 edges) and the transcription network of S. cerevisiae yeast (685 nodes, 1052 edges) [29]. The algorithm was compared with two versions of Milo et al. algorithm, which is the exhaustive recursive search [3]. The result in Figure 5 shows Grochow-Kellis achieves an exponential improvement in time over other two algorithms [29]. Current innovations and future challenges of network motif detection Although the experimental results do not explicitly identify the forms of discovered network motifs and the algorithm does not specify the types of network motifs it is able to detect, Grochow-Kellis has the capability to detect all types of network motifs from 1 to 15 nodes. The software implemented this algorithm is not publicly available, and it is only accessible by request [29]. 515 capability to detect network motifs with three or more nodes. Kavosh’s source code is publicly available. Its latest update was in 2013. Kavosh can be run on Windows 32-bit and Linux. MODA MODA was tested only on the E. coli [1] transcription network, which contains 423 nodes and 519 edges. This network is available on Uri Alon’s Complex Networks [55, 60]. MODA was assessed for its computational time for enumerating subgraph appearances but not for determining the occurrences of the motifs. Its runtime was compared with Grochow-Kellis [29], mfinder [30], FANMOD [33] and MAVisto [40]. It was not able to compare with NeMoFinder because there is no implementation of NeMoFinder available. The comparison result in Figure 6 shows MODA outperforms mfinder, Grochow-Kellis and MAVisto for enumerating subgraph in the target network only. This comparison does not include the computational time for the randomized networks. The result shows MODA is able to find size 9 motifs in an acceptable running time [55]. Although the experimental results do not explicitly identify the forms of discovered network motifs and the algorithm does not specify the types of network motifs it is able to detect, MODA has the capability to detect network motifs with two or more nodes. MODA’s source code is publicly available. Its latest update was in 2009. Kavosh Kavosh was evaluated on four different networks: the metabolic pathway of E. coli (672 nodes, 1276 edges), the transcription network of S. cereviciae yeast (688 nodes, 1079 edges), the real social network (67 nodes, 182 edges) and the electronic network (97 nodes, 189 edges). The E. coli and the S. cereviciae yeast networks are available on the Uri Alon’s Complex Networks [49, 60]. Kavosh’s performance was compared with mfinder [30], MAVisto [40] and FANMOD [33] in Table 4. For the E. coli network, Kavosh is comparable to FANMOD but it outperforms other tools. It can find larger motifs in acceptable running times. For the yeast, social and electronic networks, Kavosh outperforms all other tools, and it can also find larger motifs in acceptable running times [49]. Although the experimental results do not explicitly identify the forms of discovered network motifs and the algorithm does not specify the types of network motifs it is able to detect, Kavosh has the Table 4: Performance comparisons between Kavosh, FANMOD [33], MAVisto [40] and mfinder [30] using E. coli network [64], social network [30] and electronic network [30] E. coli S. cereviciae Social Electronic Kavosh FANMOD MAVisto mfinder Kavosh FANMOD MAVisto mfinder Kavosh FANMOD MAVisto mfinder Kavosh FANMOD MAVisto mfinder 3 4 5 6 7 8 9 10 11 12 0.30 0.81 13 532 31 1.35 2.20 15 784 32 0.04 0.46 393 12 0.08 0.53 210.00 7.00 1.84 2.53 ^ 297 34.59 41.41 ^ 306 0.23 0.84 1492 49 0.36 1.06 1727.00 14.00 14.91 15.71 ^ 23 671.8 1003.92 1111.95 ^ 33 548.2 1.63 3.07 ^ 798 0.02 4.34 6 696 000.00 109.80 141.98 132.24 ^ ^ 20 212.99 24 292.05 ^ ^ 10.48 17.63 ^ 181076.8 11.39 24.24 ^ 2020.20 1374.01 1205.97 ^ ^ 746 385.86 926 745.34 ^ ^ 69.43 117.43 ^ ^ 77.22 160.00 ^ ^ 13173.74 9256.61 ^ ^ 17111178.28 18851135.4 ^ ^ 415.66 845.93 ^ ^ 422.61 967.99 ^ ^ 121110.31 ^ ^ ^ 337 076 691.32 ^ ^ ^ 2594.19 ^ ^ ^ 2823.70 ^ ^ ^ 112 0560.16 ^ ^ ^ 7 211199 226.13 ^ ^ ^ 14 611.23 ^ ^ ^ 18 037.56 ^ ^ ^ N/A N/A N/A N/A N/A N/A N/A N/A ^ ^ ^ ^ N/A N/A N/A N/A 135 752.35 ^ ^ ^ ^ ^ ^ ^ N/A N/A N/A N/A 997 893.27 ^ ^ ^ Network motif size is listed across from 3 to 12. The column underneath each motif size shows different runtimes in seconds for each algorithm (Courtesy of Kashani et al. [49]). 516 Tran et al. G-Trie G-Trie was evaluated on a variety of networks: the dolphins social network [66, 67] (62 nodes, 159 edges), the electronic circuit [3] network (252 Figure 6: Runtimes of MODA, Grochow-Kellis [29], mfinder [30], FANMOD [33] and MAVisto [40] algorithms for motif size 3 to 9 (Courtesy of Omidi et al. [55]). nodes, 399 edges), the benchmark social network with heterogeneous communities [68] (1000 nodes, 7770 edges), the PPI network of yeast [69, 70] (2361 nodes, 6646 edges) and the U.S.A. Western states power grid network [67, 71] (4941 nodes, 6594 edges) [56]. The dolphins social network and the power grid network are available on the University of Michigan Network Data [56]. The PPI network of yeast is available on the Pajek datasets [72] site. The electronic circuit is accessible on the Uri Alon’s Complex Networks [60]. The performance of G-Trie was compared with FANMOD [33] for network-centric and GrochowKellis [29] for motif-centric on various networks above using the same common Cþþ platform for the original network and the random networks. The comparison results in Table 5 show G-Trie outperforms FANMOD [33] and Grochow-Kellis [29] for all networks with different motif size ranges. It also shows G-Trie can detect motifs up to size 9 in efficient running times [56]. Although the experimental results do not explicitly identify the forms of discovered network motifs and the algorithm does not specify the types of network motifs it is able to detect, G-Trie has the capability to detect network motifs with three or more Table 5: Comparison of G-Trie with FANMOD [33] and Grochow-Kellis [29] on five different networks (dolphins [66, 67], circuit [3], social [68], yeast [69, 70] and power [67, 71] networks) Network Dolphins Circuit Social Yeast Power Motif size 5 6 7 8 9 6 7 8 3 4 5 3 4 5 3 4 5 6 7 Census original network Average census on similar random networks FanMod Grochow G-Trie FanMod Grochow G-Trie vs FanMod vs Grochow 0.07 0.48 3.02 19.44 100.86 0.49 3.28 17.78 0.31 7.78 208.3 0.47 10.07 268.51 0.51 1.38 4.68 20.36 101.04 0.03 0.28 3.44 73.16 2984.22 0.41 3.73 48 0.11 1.37 31.85 0.33 2.04 34.1 1.46 4.34 16.95 95.58 765.91 0.01 0.04 0.23 1.69 6.98 0.03 0.22 1.52 0.02 0.56 14.88 0.02 0.36 12.73 0 0.02 0.1 0.55 3.36 0.13 1.14 8.34 67.94 493.98 0.55 3.53 21.42 0.35 13.27 531.65 0.57 12.9 400.13 0.91 3.01 12.38 67.65 408.15 0.04 0.35 3.55 37.31 366.79 0.24 1.34 7.91 0.11 1.86 62.66 0.35 2.25 47.16 1.37 4.4 17.54 92.74 630.65 0.01 0.07 0.46 4.03 24.84 0.03 0.17 1.06 0.02 0.57 22.11 0.02 0.41 14.98 0.01 0.03 0.14 0.88 5.17 16.00 17.27 18.21 16.87 19.88 19.57 20.55 20.17 14.67 23.28 24.05 31.67 31.15 26.70 113.25 107.43 91.06 76.52 78.92 4.75 5.24 7.74 9.27 14.76 8.43 7.81 7.45 4.75 3.26 2.83 19.33 5.44 3.15 171.00 157.07 128.96 104.90 121.94 The execution time is in seconds.The speedup ratios are also indicated (Courtesy of Ribeiro et al. [56]). Current innovations and future challenges of network motif detection nodes. G-Trie is only a data structure, and its source code is not publicly available. NetMODE NetMODE was tested on four different networks: the real social network [49] (67 nodes, 182 vertices), the metabolic pathway of E. coli [49] (672 nodes, 1276 vertices), the transcription network of S. cerevisiae yeast [49] (688 nodes, 1079 edges) and the complete directed graph (50 vertices, 2540 vertices). The social, E. coli and S. cerevisiae yeast networks come from the Kavosh source code, and they are accessible on the Uri Alon’s Complex Networks [58, 60]. NetMODE was compared with Kavosh [49] and FANMOD [33] with and without multi-cores using several switching methods. The comparison results in Table 6 show NetMODE achieves better runtimes 517 for 4-node and 6-node subgraphs for the yeast and social networks [58]. Although the experimental results do not explicitly identify the forms of discovered network motifs and the software does not specify the types of network motifs it is able to detect, NetMODE has the capability to detect network motifs from three to six nodes. NetMODE’s source code is publicly available. Its latest update was in 2012. NetMODE can be run on Windows 32-bit. Acc-MOTIF Acc-MOTIF was evaluated on various networks selected from Uri Alon’s Complex Networks [60] and Pajek datasets [72] site. The description for individual dataset can be found in Tables 3 and 7. Acc-MOTIF was compared with FANMOD [33] on various networks described above in Table 7. The Table 6: Comparisons of runtimes in seconds between NetMODE, Kavosh [49] and FANMOD [33] under various switching methods for social network (4 -node) [49] and transcription network of S. cerevisiae yeast (6 -node) [49] (Courtesy of Li et al. [58]) Network Tool/algorithm Fixed bidirectional edges No regard Global constant Local constant Uniform local constant Yeast (4 -node subgraph census) Kavosh FANMOD NetMODE NetMODE 4 -core Kavosh FANMOD NetMODE NetMODE 4 -core 214.2 ^ 12.1 2.9 50.5 ^ 33.5 10.3 ^ 318 12.6 2.9 ^ 464 87.2 24.1 ^ 319 12 3 ^ 139 31.4 10.1 ^ 318 12.3 3 ^ 147 34.9 10.8 ^ ^ ^ ^ ^ ^ 34.5 11.6 Social (6 -node subgraph census) Table 7: Comparisons of runtimes between Acc-MOTIF and FANMOD [33] for network motif sizes 3 and 4 on various networks (Courtesy of Meira et al. [59]) Motifs k ¼ 3 (milliseconds) Motifs k ¼ 4 (seconds) Network Nodes (n), Edges (m) acc-MOTIF FANMOD acc-MOTIF FANMOD Reference Transcription network of E. coli Transcription network of S. cerevisiae yeast Roget (Roget’s Thesaurus) CS phd (genealogy network) Epa California ODLIS (Online Dictionary of Library and Information Science) Words E. PairsFSG Foldoc (Free On-line Dictionary of Computing) (418, 519) (688, 1079) (1022, 5074) (1882, 1740) (4271, 8965) (6175, 16150) (2900, 18 241) 0.9 0.1 0.9 0.04 2 0.04 1.2 0.03 2.7 0.3 4.3 0.1 8.1 0.3 2.7 0.2 7.4 0.6 34 0.5 3.2 0.2 131 1 216 2 1, 025 5 0.021 0.001 0.043 0.0004 0.27 0.01 0.055 0.001 0.58 0.01 1.2 0.01 4.5 0.03 0.08 0.003 0.19 0.002 0.76 0.01 0.04 0.0005 9.2 0.07 12.6 0.02 210 2 60 60 71 71 71 71 71 7, 028 174 1, 687 19 2, 938 7 105 0.7 13 0.4 13.3 0.5 > 7200 153 3 439 8 60 71 71 (7381, 46 281) 46 0.4 (5018, 63 608) 42 0.3 (12 905, 109 092) 92 1 518 Tran et al. result shows it achieves significant speedup over FANMOD [33] for motif sizes 3 and 4 [59]. Although the experimental results do not explicitly identify the forms of discovered network motifs and the software does not specify the types of network motifs it is able to detect, Acc-MOTIF has the capability to detect network motifs from three to five nodes. Acc-MOTIF software is publicly available. Its current version is 2.0, and it is still under active development [59]. QuateXelero QuateXelero was evaluated on six networks of different types: the transcription network of S. cerevisiae yeast [60] (688 nodes, 1079 edges), the metabolic pathway of E. coli [65] (672 nodes, 1275 edges), the PPI network of the budding yeast [69, 70] (2361 nodes, 6646 edges), the real social network [49] (67 nodes, 182 edges), the dolphins social network [66, 68] (62 nodes, 159 edges) and the electronic circuit network [3] (252 nodes, 399 edges). The E. coli, S. cerevisiae yeast and social network are directed networks. The PPI network in budding yeast and dolphins network are undirected networks. The electronic circuit is both direct and undirected network [5]. The S. cerevisiae yeast and the electronic circuit networks are available on the Uri Alon’s Complex Networks [60]. The PPI network of the budding yeast is accessible on the Pajek datasets [72] site. The dolphins social network is available on the University of Michigan Network Data [5]. QuateXelero was compared with Kavosh [49] and G-Tries [56] on various networks above with different motif size ranges for the target network and the random networks. The comparison results can be found in Table 8. The results show QuateXelero can detect motifs up to size 12 in acceptable running times. The results also reveal the following strengths and weaknesses of QuateXelero [5]. QuateXelero is always faster than the ESU of GTries for enumeration on original networks. It is generally faster for enumeration on random networks for smaller motifs. QuateXelero is better than the ESU of G-Tries for larger motif sizes in directed networks. However, for undirected networks, QuateXelero is better than the ESU of G-Tries for smaller and larger motifs but not for medium-sized motifs. The memory usage between QuateXelero and G-Tries is comparable for some networks. However, QuateXelero does not show better memory usage than G-Tries in general [5]. Although the experimental results do not explicitly identify the forms of discovered network motifs and the algorithm does not specify the types of network motifs it is able to detect, QuateXelero has the capability to detect network motifs with two or more nodes. QuateXelero’s source code is publicly available. Its latest update was in 2013. QuateXelero can be run on Windows and Linux. CHALLENGES FOR COMPARING DIFFERENT TOOLS AND ALGORITHMS mfinder and MAVisto are no longer supported. mfinder can be run only on older versions of Windows or Linux. MAVisto does not run on the current version of Java. There is no implementation for NeMoFinder available for testing this algorithm [55]. The source code for Grochow-Kellis is not publicly available and can only be obtained via request. G-Trie is only a data structure and its source code is not publicly available. The gtrieScanner tool that implements G-Trie can only find one network motif size at a time [57]. Testing G-Trie and QuateXelero for large networks and larger motifs requires a considerable amount of memory. Different tools and algorithms accept different input formats. The conversion of the input into an acceptable format by each tool and algorithm involves developing procedures or scripts in a programming language. Comparison with Kavosh QuateXelero is much faster than Kavosh for all situations. However, QuateXelero consumes a large amount of memory comparing to Kavosh for constructing the quaternary tree [5]. Comparison with G-Tries OBSERVATIONS ON NETWORK MOTIF DETECTION TOOLS AND ALGORITHMS mfinder is a tool developed in 2004 for overcoming the drawbacks of the exhaustive enumeration method by implementing subgraphs sampling Yeast Yeast Yeast Yeast Yeast Electronic Electronic Electronic Electronic Electronic Electronic Electronic Electronic E.coli E.coli E.coli E.coli E.coli E.coli E.coli Social Social Social Social Social Social Social 5 6 7 8 9 5 6 7 8 9 10 11 12 5 6 7 8 9 10 11 5 6 7 8 9 10 11 Kavosh QuateXelero Processing times Directed 23.4 0.5 Directed 438.5 8.9 Directed 14 056.2 166.4 Directed 22 4497 2609.5 Directed ^ 53 852.1 Directed / Undirected 0.13 0 Directed / Undirected 0.8 0.08 Directed / Undirected 5.9 0.3 Directed / Undirected 38.7 1.9 Directed / Undirected 278.2 11.9 Directed / Undirected 2614.2 71.2 Directed / Undirected ^ 493.3 Directed / Undirected ^ ^ Directed 0.48 0.05 Directed 4.3 0.3 Directed 45.3 2.8 Directed 410.7 23.6 Directed 4000 190.7 Directed ^ ^ Directed ^ ^ Directed 0.11 0.06 Directed 0.82 0.36 Directed 5.4 2.6 Directed 33.3 16.3 Directed 220.3 96.22 Directed ^ ^ Directed ^ ^ Network Motif size Directionality 46.80 49.27 84.47 86.03 ^ nan 10.00 19.67 20.37 23.38 36.72 ^ ^ 9.60 14.33 16.18 17.40 20.98 ^ ^ 1.83 2.28 2.08 2.04 2.29 ^ ^ QX G-Tries QX Average census on randoms ESU þ G-Tries Total time QX G-Tries Memory 30.846 0.733 0.693 0.955 37.85 10.51 1.5 MB 532.806 11.201 11.909 17.856 651.07 190.20 2.3 MB 12 314.314 164.596 220.656 344.539 14494.60 3611.77 7.1MB ^ ^ ^ ^ ^ ^ ^ 848186.49 6205.98 13 544.20 23 950.802 915 907.47 125 962.31 711 M 0.184 0.015 0.014 0.009 2.13 1.28 1.2 MB 1.097 0.063 0.068 0.051 8.49 5.59 2.4 MB 7.780 0.390 0.376 0.302 45.81 31.29 8.6 MB ^ ^ ^ ^ ^ ^ ^ 65.89 2.34 2.360 2.604 79.18 16.11 42 M 483.41 13.89 11.626 14.962 550.76 90.76 206 M 3998.61 82.75 76.793 113.920 4438.76 663.11 1.0 G ^ 504.40 ^ 796.268 ^ 4557.15 ^ 0.612 0.063 0.126 0.037 15.63 5.51 4.5 MB 5.604 0.546 0.910 0.303 104.07 33.65 22.1MB 51.092 4.430 7.195 2.600 822.42 274.45 135.4 MB ^ ^ ^ ^ ^ ^ ^ 728.69 47.52 86.493 41.078 1223.27 264.96 1.2 G 6357.95 352.46 929.200 402.146 11461.69 2443.21 7.6 G 53 819.37 ^ 8834.432 ^ 101184.86 ^ 44.0 G 0.094 0.031 0.019 0.009 3.55 1.31 5.4 MB 0.581 0.218 0.118 0.070 22.74 9.00 30.7 MB 3.532 1.451 0.725 0.612 154.91 72.78 184.9 MB ^ ^ ^ ^ ^ ^ ^ 21.25 9.62 5.830 12.228 119.34 83.37 1.5 G 121.01 54.30 34.570 85.252 731.66 558.70 7.9 G 669.35 273.54 237.761 653.014 4527.95 4368.38 40.0 G ESU of G-Tries Comparison Census on original versus Kavosh 1.8 MB 2.5 MB 8.8 MB ^ 889 M 3.4 MB 3.9 MB 8.5 MB ^ 130 M 678 M 4.6 G 25 G 7.7 MB 13.6 MB 74.6 MB ^ 2.4 G 19 G ^ 2.7 MB 13.9 MB 143.7 MB ^ 2.8 G 18 G 59 G QX 60 60 60 60 60 3 3 3 3 3 3 3 3 72 72 72 72 72 72 72 49 49 49 49 49 49 49 Reference Table 8: Comparisons of runtimes between QuateXelero and Kavosh [49] for network motif sizes 5-11 on various networks (Courtesy of Khakabimamaghani et al. [5]) Current innovations and future challenges of network motif detection 519 520 Tran et al. technique. However, it suffers the biased subgraphs sampling so that it has to pay extra expensive cost for correcting the biased estimation. Nonetheless, the gains it obtained are noticeable. It is significantly faster than the exhaustive enumeration method and its runtime does not depend on the network size. It can detect motif sizes 5 and 6 that are inaccessible by the exhaustive enumeration method [30]. MAVisto is a tool developed in 2005 for providing the motif analysis and visualization that are not supported by mfinder. The tool provides more flexibility for the input format as well as a rich set of visualizations for motif analysis. MAVisto is only fast for detecting motif sizes 3–5 in directed networks but it can find motif up to size 8. MAVisto was designed for finding network motif in biological network only [40]. NeMoFinder is an algorithm developed in 2006 for detecting repeated and unique network motifs in PPI network only. It is the first algorithm in the development timeline that is able to detect network motif up to size 12. Its runtime is better than enumeration method, sampling method and FPF. This is a big improvement over existing tools around that time. This algorithm can also analyze scale-free networks [34]. FANMOD is a tool developed in 2006 with the aims for fast network motif detection and better motif analysis through the graphical user interface and flexible output formats. It is an attractive tool because of its relative speed, ease of use, rich customization, as well as flexible export format. However, FANMOD is not able to find motifs greater than size 8 due to computational explosion. The reason is that the number of calls to NAUTY for isomorphic subgraphs checking is enormous when motif size and network size increase. In addition, its memory usage increases remarkably when the subgraph size and network size increase [33]. Grochow-Kellis is an algorithm developed in 2007 for detecting large network motifs based on a novel symmetry-breaking technique. All tools and algorithms developed to this point use network centric approach, which limits them from detecting larger network motif because of the subgraph census in the entire target network. This is the first algorithm using motif centric approach, which uses a single query subgraph, for detecting large network motifs. However, it suffers the fact that not all query subgraphs it generated can be found in the target network. Thus, there are unnecessary computational times spend using this approach. However, the gains obtained for this approach are very noticeable. The algorithm has an exponential speedup. It eliminates the limitations of memory usage. It can detect network motifs up to size 15. This makes it surpasses all other existing tools and algorithms for finding large network motifs. The algorithm can also be applied to any type of network and it can be easily parallelized [29]. Kavosh is an algorithm developed in 2009 with the goal to find network motif of any given size with less memory usage and lower CPU time. The algorithm was tested on biological, social and electronic networks with the results showing it is able to detect network motif size greater than 8. The algorithm surpasses mfinder, MAVisto and FANMOD but it was not compared with other existing algorithms [49]. MODA is an algorithm developed in 2009 with the goal also for detecting motifs greater than size 8 efficiently. It was tested only on the E. coli transcription network [1]. It outperforms Grochow-Kellis [29], mfinder [30], FANMOD [33] and MAVisto [40] for this network only [55]. G-Trie is a multi-way tree data structure developed in 2010 for storing subgraphs, which allows saving computational time for faster motif finding. It was tested on various network types and it is able to detect motifs up to size 9. It outperforms FANMOD [33] and Grochow-Kellis [29], but it consumes a huge amount of memory for trading with efficient search and retrieval [56]. NetMODE is a software package developed in 2012. This is the first software package that does not depend on NAUTY for isomorphic subgraphs checking. However, the tradeoff for this independence is the cost it has to store k-node subgraph data in the memory, which is disadvantageous because of large memory consumption. NetMODE was tested on social and biological networks. It was compared with Kavosh [49] and FANMOD [33] but not for other algorithms. It is able to detect motifs up to size 6 [58]. Acc-MOTIF is a software developed in 2012. It contains several algorithms for finding motifs of size 3, 4 and 5 independently. It uses combinatorial techniques for subgraph isomorphic checking and finding significant motifs. Acc-MOTIF was tested on various network types for detecting motif sizes 3 and 4. It was compared with FANMOD [33] only, and the results show it outperforms FANMOD significantly [59]. Current innovations and future challenges of network motif detection QuateXelero is an algorithm developed in 2013 for the purpose to reduce the number of calls to NAUTY by implementing the quaternary tree data structure. It was tested on various network types. It outperforms Kavosh [49] for all cases and outperforms G-Tries [56] for some cases. It was not compared with other tools and algorithms. Like G-Tries, QuateXelero consumes a substantial amount of memory in trading for its speedup [5]. Some tools and algorithms were designed for a specific network type. Others were designed for various networks. Newer tools and algorithms were designed to overcome some of the shortcomings such as limited motif size, large memory usage or massive computational time. Because there are many challenges for developing an efficient network motif detection tool, the developed products rolled out are just only algorithms or limited tools. However, newer tools and algorithms always show some improvement aspects. We have seen these improvements of motif finding tools and algorithms throughout the years. FANMOD is a user-friendly tool for motif visualization and analysis, but it can detect motifs up to size 8. All tools and algorithms, which are able to detect motif size > 8, do not provide user-friendly interface as well as motif visualization and analysis. Although they are able to handle large networks and large motif sizes, they consume large amount of memory or they were tested on one or few networks. Kim et al. [7] classified network motifs into structural network motifs and biological network motifs. Structural network motifs are detected based on structural uniqueness using scoring thresholds. Biological network motifs are detected based on biological significance such as topological property and Gene Ontology (GO) term relevance regardless of their structure. The authors presented that structural uniqueness can suggest biological network motifs but it is not sufficient for determining biological network motifs. Thus, they developed five algorithms: EDGEGO-BNM, EDGEBETWEENNESS-BNM, NMF-BNM, NMFGO-BNM and VOLTAGEBNM for efficiently detecting biological network motifs using biological significance criteria. These algorithms also consider nonnetwork motifs for their biological significance. The authors also validated the discovered biological network motifs for their existence based on three criteria: (i) motifs included in complex, (ii) motifs included in functional module and (iii) GO term clustering score. The protein 521 complexes are defined as groups of proteins interacting mutually within a cell at the same time and place. The functional modules are groups of binding proteins participating in different cellular processes at different times. The GO term clustering score is the clustering score calculated based on GO term relevance. Their validations revealed that by using biological significance nonnetwork motifs are also found to be biological meaningful network motifs. The authors compared their algorithms with mfinder, ESU and RAND-ESU in FANMOD. The comparisons showed these algorithms produce more reliable biological network motifs than mfinder, ESU and FANMOD. In addition, they are capable for finding structural network motifs as well. However, these algorithms are not able to detecting large biological network motifs [7]. The users may consider these algorithms for finding biological network motifs. All tools and algorithms discussed above use their own strategies for detecting network motifs based on structural uniqueness. Thus, the detected network motifs can be considered as structural network motifs [7]. Although MAVisto was designed for detecting network motifs in biological networks and NeMoFinder was designed for finding meso-scale network motifs in large PPI networks, all tools and algorithms above neither consider biological significance for finding network motifs nor using three validation criteria above for validating the existence of discovered network motifs in biological networks. Additionally, these tools and algorithms do not consider nonnetwork motifs for their biological significance but instead filter them out. At this point, the users may wonder which tool or algorithm is the right choice for their research. mfinder and MAVisto are no longer supported plus they are limited tools. mfinder does not support recent versions of Windows. MAVisto does not support recent version of Java. There is no implementation of NeMoFinder available. FANMOD is a fast and user-friendly tool for various network types, but it can detect motifs up to size 8. Grochow-Kellis is only an algorithm but it can detect motifs up to size 15 and it can be used for any network. Its source code is not openly available and it can be obtained via request. Kavosh is also an algorithm and it was tested on biological, social and electronic networks only. The algorithm claims it can detect motifs greater than size 8. The comparison results show it can detect motif size 12 in an acceptable 522 Tran et al. running time. MODA is also an algorithm. It was tested only on the E. coli transcription network and it outperforms Grochow-Kellis, mfinder, FANMOD and MAVisto for this network only. The algorithm claims it can detect motifs greater than size 8. The comparison results show it can detect motif size 9 in an acceptable running time. G-Tries is only a data structure but it can be used for various network types. Its source code is not publicly available. G-Tries outperforms FANMOD, GrochowKellis and it can detect motifs up to size 9 in efficient running times. However, it consumes huge amount of memory. NetMODE is software package that can detect motifs up to size 6. It was tested on social and biological networks only. Acc-Motif is a tool that can detect motif sizes 3, 4 and 5 independently. It can be used for various network types, and it is much faster than FANMOD. It can support all types of platforms, and it is under active development. QuateXelero is only an algorithm but it can be used for various network types. It outperforms Kavosh for all cases and outperforms G-Tries for some cases. The tool is fast, but it consumes large amount of memory. It can detect motifs size 12 in acceptable running times. We have seen the pros and cons of individual tool and algorithm. Hence, an appropriate tool or algorithm should be carefully chosen depending on the type of research being conducted and the available computing resources the users have locally. Any tools and algorithms that are developed should fulfill the needs of the users. The features probably most concerned by the users include accuracy, speed, motif size and ease-of-use. Therefore, we have some general remarks as follows. into the function of organized structure at different scales is limited [74]. There is no specific number for the size of larger network motifs that need to be explored. However, the capability of discovering larger network motifs by a tool or an algorithm provides the opportunity for making further discoveries. (ii) User-Friendliness—The users usually never encounter the inner workings of any network motif detection tools. They simply use the provided interface to select parameters and observe progress. For this reason, it is vital that competitive tools have sensible user interfaces. Though some tools are very fast, their limited user interfaces and difficult setup procedures have led to a tepid reception by the users. (iii) I/O Formats—Graphs are stored in different formats, and it is important for a tool or an algorithm to accept a wide variety of input formats. It is also vital for providing the motif visualization that helps the users better observing the results as well as allowing them to export the results into various formats for further analysis. (iv) Sampling or Exhaustive—This is an important aspect of a tool or an algorithm, as it influences the accuracy of potential results. Thus, it is important that a tool or an algorithm be very clear about the kind of method it implements. (v) Network-Specific—Some tools and algorithms are designed to target one or some specific type of networks. Some can be used for any network types. Some tools were tested only on one or few networks. Therefore, it is vital that a tool or an algorithm be specific on the type of target network it was developed for. (i) Practical Motif Size Limit—This limitation of a tool or an algorithm is critical because discovering larger network motifs may answer several important questions. Does a given motif appear independently in the network? Or the instances of that motif combine to form larger structures? If it is the latter, then what is the function of these larger structures? Do different networks that share a certain network motif also share the same structural combinations of that motif? These questions can be answered by finding and analyzing large subgraphs [73]. Moreover, small size of network motifs limits the scale where features of organization in networks can be discovered. Hence, the possibility of insight CONCLUSIONS AND FUTURE WORK We have analyzed 11 different network motif detection tools and algorithms as well as discussed their individual strengths and weaknesses. We have seen improvements of network motif detection tools and algorithms throughout the years. However, several improvements are still needed in the field of network motif detection. Discover larger motifs It requires exploring new techniques for developing better tools and algorithms that enable researchers Current innovations and future challenges of network motif detection to discover and analyze larger motifs. Currently, the literature is saturated with tools and algorithms that can find motifs in the single-digit range. Discovering larger motifs imposes a significant challenge on future development, as the exponential runtime makes a significant impact on the performance. Improve runtime The tools and algorithms that have succeeded up until now have implemented novel ways to speed up the computation. The most obvious solution to decrease runtime is by implementing parallel algorithms. The network motif detection problem is extremely parallel [58], and only recently have tools begun to exploit this fact. However, these tools and algorithms have only exploited coarsegrain parallelism. There are independent aspects of the network motif detection that can be executed simultaneously to reduce the runtime. Some parallelized examples are to follow. Random networks can be processed simultaneously. This step of the algorithm is the simple one to parallelize. Single query subgraph can be processed concurrently for different subgraph sizes. Isomorphic subgraphs checking for different subgraph sizes can also be processed in parallel. Efficient fine-grained parallelism is perhaps the most crucial improvement needed currently for network motif detection. 523 convenient for the users to use a tool via the web. This would allow saving time and resources for the users so that they do not have to spend time installing the tool locally because some tools might consume many resources that may not be available on the local machine. There are many different paths for future development of network motif detection. We address some of them here. One direction would be to predict how motifs appear in a network, which could provide additional possible optimization for future tools and algorithms. Furthermore, smaller motif could be a part of bigger motif, which would mean researchers could do a quick search for small motifs and then use them as seeds to find larger ones. This avenue of research has huge implications for developing efficient, large motif sampling algorithms. Another avenue is to reuse the computation with less memory usage or without using extra memory usage. Key Points Most of network motif detection tools and algorithms that are able to detect larger motifs greater than 8 nodes do not provide user-friendly interface as well as motif visualization and analysis. Detecting large network motifs has cost associated with large memory tradeoff or spending on unnecessary computational time. Efficient fine-grained parallelism is one of the most crucial improvements needed for network motif detection. There is no web tool developed yet for network motif detection. Provide user-friendly interface Most of recent tools and algorithms, that are capable for detecting larger motifs > 8 nodes, do not provide a user-friendly interface for motif visualization and analysis. Thus, this feature should be included in the future version of these tools as well as for new tools, as it allows researchers to gain more insights into the behavior of network motifs. FUNDING Improve I/O References This work was supported in part by the National Science Foundation (NSF) [OCI-1156837 to S.M., C.-H.H.]; and U.S. Department of Education Graduate Fellowships in Areas of National Need (GAANNs) [P200A130153 to N.T.L.T.]. New tool/algorithm and the future version of current tools and algorithms should accept a wide variety of input formats as well as allow exporting the results into various formats for further analysis. 1. Provide web tool 3. As of this writing, there is no web tool develops for network motif detection yet. It would be more 4. 2. Shen-Orr SS, Milo R, Mangan S, et al. Network motifs in the transcriptional regulation network of Escherichia coli. Nat Genet 2002;31:64–8. Albert I, Albert R. Conserved network motifs allow protein-protein interaction prediction. Bioinformatics 2004; 20(18):3346–52. Milo R, Shen-Orr S, Itzkovitz S, et al. Network motifs: simple building blocks of complex networks. Science 2002; 298:824–7. Allan EG, Turkett WH, Fulp EW. Using Network Motifs to Identify Application Protocols. Global Telecommunications 524 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. Tran et al. Conference 2009. GLOBECOM 2009. Honolulu, Hawaii: IEEE, pp. 1–7. Khakabimamaghani S, Sharafuddin I, Dichter N, et al. QuateXelero: an accelerated exact network motif detection algorithm. PLoS One 2013;8(7):e68073. Wong E, Baur B, Quader S, et al. Biological network motif detection: principles and practice. Brief Bioinform 2011;13(2): 202–15. Kim W, Li M, Wang J, et al. Biological network motif detection and evaluation. BMC Syst Biol 2001;5:1–13. Yeger-Lotem E, Sattath S, Kashtan N, et al. Network motifs in integrated cellular networks of transcription–regulation and protein–protein interaction. PNAS 2004;101(16):5934–9. Alon U. Network motifs: theory and experimental approaches. Nat Rev Genet 2007;8:450–61. Alon U. SnapShot: network motifs. Cell 2010;143:326.e1. Madar D, Dekel E, Bren A, et al. Negative auto-regulation increases the input dynamic-range of the arabinose system of Escherichia coli. BMC Syst Biol 2011;5:1–9. Jin G, Zhang S, Zhang X, et al. Hubs with network motifs organize modularity dynamically in the protein-protein interaction network of yeast. PLoS One 2007;11:e1207. Ingram PJ, Stumpf MPH, Stark J. Network motifs: structure does not determine function. BMC Genomics 2006;7:108. Lipshtat A, Purushothaman SP, Iyengar R, et al. Functions of bifans in context of multiple regulatory motifs in signaling networks. BiophysJ 2008;94:2566–79. Mangan S, Alon U. Structure and function of the feedforward loop network motif. PNAS 2003;100(21):11980–5. Zhu X, Gerstein M, Snyder M. Getting connected: analysis and principles of biological networks. Genes Dev 2007;21: 1010–24. Schreiber F, Schwbbermeyer H. Frequency concepts and pattern detection for the analysis of motifs in networks. Trans Comput Syst Biol III Lect Notes Comput Sci 2005;3737: 89–104. Schmidt C, Weiss T, Komusiewicz C, et al. An analytical approach to network motif detection in samples of networks with pairwise different vertex labels. Comput Math Methods Med 2012;2012:1–12. Milo R, Itzkovitz S, Kashtan N, etal. Superfamilies of evolved and designed networks. Science 2004;303(5663):1538–42. Przytycka TM. An important connection between network motifs and parsimony models. Res Comput Mol Biol 2006; 3909:321–35. Chen L, Qu X, Cao M, et al. Identification of breast cancer patients based on human signaling network motifs. Sci Rep 2013;3368:1–7. Tsang J, Zhu J, van Oudenaarden A. MicroRNA-mediated feedback and feedforward loops are recurrent network motifs in mammals. Mol Cell 2007;26(5):753–67. Albert I, Albert R. Conserved network motifs allow protein-protein interaction prediction. Bioinformatics 2004; 20(18):3346–52. Chen J, Hsu W, Lee ML, et al. Labeling network motifs in protein interactomes for protein function prediction. ICDE 2007;546–55. Turkett W, Fulp E, Lever C. Graph mining of motif profiles for computer network activity inference. MLG; 2011;1–8. 26. Lizier JT, Atay FM, Jost J. Information storage, loop motifs, and clustered structure in complex networks. Phys Rev E 2012;86:1–5. 27. Wu SF, Qian WY, Zhang JW, et al. Network motifs in the transcriptional regulation network of cervical carcinoma cells respond to EGF. Arch Gynecol Obstet 2013;287:771–7. 28. Kim W, Diko M, Rawson K. Network motif detection: algorithms, parallel and cloud computing, and related tools. Tsinghua SciTechnol 2013;18(5):469–89. 29. Grochow JA, Kellis M. Network motif discovery using subgraph enumeration and symmetry-breaking. Proceedings of the 11th Annual International Conference on Research in Computational Molecular Biology, 2007, Oakland, CA, USA, Vol. 4453. Springer Berlin Heidelberg, 2007, 92–106. 30. Kashtan N, Itzkovitz S, Milo R, et al. Efficient sampling algorithm for estimating subgraph concentrations and detecting network motifs. Bioinformatics 2004;20:1746–58. 31. Wernicke S. Efficient detection of network motifs. IEEE/ ACM Trans Comput Biol Bioinform 2006;3(4):347–59. 32. Parida L. Discovering topological motifs using a compact notation. J Comput Biol 2007;14(3):300–23. 33. Wernicke S, Rasche F. FANMOD: a tool for fast network motif detection. Bioinformatics 2006;22:1152–3. 34. Chen J, Hsu W, Le ML, et al. NeMoFinder: Dissecting genome-wide protein-protein interactions with meso-scale network motifs. Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 2006, Philadelphia, PA, USA;106–15. 35. Milo R, Kashtan N, Itzkovitz S, et al. On the uniform generation of random graphs with prescribed degree sequences 2004. http://arxiv.org/abs/cond-mat/0312028 (27 December 2013, date last accessed). 36. Schreiber F, Schwobbermeyer H. Towards motif detection in networks: frequency concepts and flexible search. Proceedings of the International Workshop on Network Tools and Applications in Biology 2004, Camerino, Italy;91–102. 37. Wernicke S. A faster algorithm for detecting network motifs. Algorithms Bioinform 2005;3692:165–77. 38. Parida L. Discovering topological motifs using a compact notation. J Comput Biol 2007;14(3):300–23. 39. Kashtan N, Itzkovitz S, Milo R, et al. Network motif detection tool MFinder tool guide. Technical report 2005. Rehovot, Israel: Departments of Molecular Cell Biology and Computer Science and Applied Mathematics, Weizmann Institute of Science, 2005. 40. Schreiber F, Schwbbermeyer H. MAVisto: a tool for the exploration of network motifs. Bioinformatics 2005;21: 3572–4. 41. Bachmaier C, Brandenburg FJ, Forster M, et al. Gravisto: graph visualization toolkit. Graph Drawing 2005;3383:502–3. 42. Fruchterman TMJ, Reingold EM. Graph drawing by force-directed placement. Softw Pract Exp 1991;21(11): 1129–64. 43. Batagelj V, Mrvar A. Pajek—analysis and visualization of large networks. Graph Drawing 2002;477–8. 44. Himsolt M. Graphlet: design and implementation of a graph editor. Softw Pract Exp 2000;30(11):1303–24. 45. Huan J, Wang W, Prins J, et al. Spin: Mining maximal frequent subgraphs from graph databases. Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 2004, Seattle,WA, USA;581–6. Current innovations and future challenges of network motif detection 46. Jiang C, Coenen F, Zito M. A survey of frequent subgraph mining algorithms. Knowl Eng Rev 2013;28(01):75–105. 47. McKay BD. Practical graph isomorphism. Congr Numer 1981;30:45–87. 48. Hartke SG, Radcliffe AG. McKay’s canonical graph labeling algorithm. Contemp Math 2009;479:99–111. 49. Kashani ZRM, Ahrabian H, Elahi E, et al. Kavosh: a new algorithm for finding network motifs. BMC Bioinformatics 2009;10:318. 50. McKay B, Piperno A. Nauty and Traces. http://pallini.di. uniroma1.it/ (22 February 2014, date last accessed). 51. Kreher D, Stinson D. Combinatorial Algorithms: Generation, Enumeration snd Search. Florida: CRC Press LTC, 1998. 52. Alamgir Z, Abbasi S. Combinatorial algorithms for listing paths in minimal change order. Comb Algorithmic Aspects Netw 2007;4852:112–30. 53. Nijenhuis A, Wilf HS. Combinatorial Algorithms for Computers and Calculators. London: Academic Press, 1978. 54. Maslov S, Sneppen K. Specificity and stability in topology of protein networks. Science 2002;296(5569):910–13. 55. Omidi S, Schreiber F, Masoudi-Nejad A. MODA: an efficient algorithm for network motif discovery in biological networks. Genes Genet Syst 2009;84:385–95. 56. Ribeiro P, Silva F. G-Tries: an efficient data structure for discovering network motifs. Proceedings of the 2010 ACM Symposium on Applied Computing 2010, Sierre, Switzerland; 1559–66. 57. gtrieScanner - Quick Discovery of Network Motifs. http:// www.dcc.fc.up.pt/gtries/ (28 December 2013, date last accessed). 58. Li X, Stones DS, Wang H, et al. NetMODE: Network motif detection without Nauty. PLoS One 2012;7(12): e50093. 59. Meira LAA, Maximo VR, Fazenda L, et al. Accelerated Motif Detection Using Combinatorial Techniques. Signal ImageTechnology and Internet Based Systems (SITIS), 2012 Eighth International Conference on 25-29 November, 2012;744–53. 60. Uri AlonLab. http://www.weizmann.ac.il/mcb/UriAlon/ (30 December 2013, date last accessed). 525 61. Barabasi AL, Albert R. Emergence of scaling in random networks. Science 1999;286:509–12. 62. Achacoso TB, Yamamoto WS. AY’s Neuroanatomy of C. elegans for Computation. Boca Roton, FL: CRC Press, 1992. 63. Uetz P, Giot L, Cagney G, et al. A comprehensive analysis of protein-protein interactions in saccharomyces cerevisiae. Nature 2000;403(6770):623–7. 64. Mewes HW, Frishman D, Guldener U, et al. Mips: a database for genomes and protein sequences. Nucleic Acids Res 2002;30(1):31–34. 65. The E. coli Database. http://www.kegg.com/ (2009, date last accessed). 66. Lusseau D, Schneider K, Boisseau OJ, et al. The bottlenose dolphin community of doubtful sound features a large proportion of long-lasting associations. Can geographic isolation explain this unique trait? Behav Ecol Sociobiol 2003;54(4): 396–405. 67. Newman M. Network data. http://www-personal.umich. edu/mejn/netdata/ (27 December 2013, date last accessed). 68. Lancichinetti A, Fortunato S, Radicchi F. Benchmark graphs for testing community detection algorithms. Phys Rev E 2008;78:1–6. 69. Bu D, Zhao Y, Cai L, et al. Topological structure analysis of the protein-protein interaction network in budding yeast. Nucleic Acids Res 2003;31(9):2443–50. 70. Batagelj V, Mrvar A. Pajek datasets. http://vlado.fmf.uni-lj. si/pub/networks/data/ (27 December 2013, date last accessed). 71. Watts DJ, Strogatz SH. Collective dynamics of ‘smallworld’ networks. Nature 1998;393(6684):440–2. 72. Batagelj V, Mrvar A. Pajek datasets 2006. http://vlado.fmf. uni-lj.si/pub/networks/data/ (30 December 2013, date last accessed). 73. Kashtan N, Itzkovitz S, Milo E, et al. Topological generalizations of network motifs. Phys Rev E Stat Nonlin Soft Matter Phys 2004;70(3 Pt 1):031909. 74. Baskerville K, Paczuski M. Subgraph ensembles and motif discovery using a new heuristic for graph isomorphism. Phys. Rev. 2006;74(5 Pt 1):051903.
© Copyright 2026 Paperzz