August 1, 2008 17:26 WSPC - Proceedings Trim Size: 9.75in x 6.5in article 1 IMPROVED ALGORITHMS FOR ENUMERATING TREE-LIKE CHEMICAL GRAPHS WITH GIVEN PATH FREQUENCY YUSUKE ISHIDA1 [email protected] HIROSHI NAGAMOCHI1 [email protected] LIANG ZHAO1 [email protected] TATSUYA AKUTSU2 [email protected] 1 Department of Applied Mathematics and Physics, Graduate School of Informatics, Kyoto University, Yoshida, Kyoto 606-8501, Japan 2 Bioinformatics Center, Institute for Chemical Research, Kyoto University, Gokasho, Uji, Kyoto 611-0011, Japan This paper considers the problem of enumerating all non-isomorphic tree-like chemical graphs with given path frequency, where “tree-like” means that the graph can be viewed as a tree if multiple edges (i.e., edges with the same end points) and a benzene ring are treated as one edge and one vertex, respectively, and “path frequency” is a vector of the numbers of specified vertex-labeled paths that must be realized in every output. This and related problems have several potential applications such as classification of chemical compounds, structure determination using mass-spectrum and/or NMR and design of novel chemical compounds. For this problem, several studies have been done. Recently, Fujiwara et al. (2008) showed two formulations and for each of them, they gave a branch-and-bound algorithm, which combined efficient enumeration of non-isomorphic trees with bounding operations based on the path frequency and the atom-atom bonds to avoid the generation of invalid trees. In this paper, based on their work and a result of Nagamochi (2006), we introduce two new bounding operations, the detachment-cut and the Hcut, to further reduce the size of the search space. We performed computational experiments to compare our proposed algorithms with those of Fujiwara et al. (2008) using some chemical compound data obtained from the KEGG LIGAND database (http://www.genome.jp/kegg/ligand.html). The results show that our proposed algorithms are much faster than their algorithms. Keywords: chemical graph enumeration; chemical tree enumeration; path frequency; feature vector; detachment. 1. Introduction Enumerating chemical graphs is one of the fundamental issues in chemoinformatics and bioinformatics which can go back to the 19th century (Cayley [6]). Its applications include structure determination using mass-spectrum and/or NMRspectrum [5, 11], virtual exploration of chemical universe [9, 15], reconstruction of molecular structures from their signatures [8, 12], and classification of chemical compounds [7]. In these applications, enumeration of chemical graphs satisfying given constraints is important [2]. August 1, 2008 2 17:26 WSPC - Proceedings Trim Size: 9.75in x 6.5in article A. Authorx, B. Authory, & C. D. Authorz This paper considers to enumerate chemical compounds with given path frequency, i.e., the numbers of specified vertex-labeled paths that must be realized in every output. The problem was motivated from the pre-image problem in machine learning [4]. In the pre-image problem, given a feature vector, a structure consistent with the feature vector is computed. The pre-image problem for chemical graphs has a potential application to design of novel chemical compounds [2, 4], which is an important target of bioinformatics. Suppose that we have some potential function in a feature space, which reflects the pharmacological activity of chemical compounds and may be learned from training data. Then, a desired object is computed as a point in the feature space using the potential function and some optimization technique. Finally, a pre-image of the point is computed as a candidate of a novel chemical compound. Though this approach has not yet been shown to be better than existing approaches, there is a possibility that chemical structures, which have better pharmacological activity than training data, are obtained. Since feature vectors based on frequency of labeled paths were successfully applied to classification of chemical compounds [13, 14], we consider the graph pre-image problem with given path frequency. Akutsu and Fukagawa [1] first studied the computational complexity of the preimage problem. They proved that the problem is NP-hard even if chemical graphs are restricted to trees. Since the problem is NP-hard and it is quite difficult to handle all chemical graphs, they developed a branch-and-bound algorithm for treelike chemical graphs [2]. Recently, Fujiwara et al. [10] proposed a much more efficient branch-and-bound algorithm, which combined the tree enumeration algorithm of Nakano and Uno [17, 18] to generate non-isomorphic trees with bounding operations based on the path frequency and the atom-atom bonds to avoid the generation of invalid trees. To improve the efficiency, they also gave an alternative formulation of the problem by removing all hydrogens and replacing each group of multiple edges with a new virtual atom and two new single edges. Experimental results show that some instances up to 61 atoms could be enumerated within 30 minutes using a normal PC. Their algorithms can also be applied to a classical problem of enumeration of alkanes (Cn H2n+2 ), which was considered by Cayley [6], and the latter was shown to be at least as fast as the fastest existing algorithm [3]. In order to further improve the efficiency, we introduce two new bounding operations in this paper. The first, the detachment-cut is based on a result of Nagamochi [16]. Another, the H-cut can only be applied to the second formulation, which uses the information of the removed hydrogens. We show that they can effectively reduce the size of the search space and thereby reduce the running time. Applying to the same instances, we show that our algorithms are much faster than those of [10]. The proposed algorithms are faster than those of [10] in all the examined cases and are dozens of times faster in many cases. As in [10], our algorithms can be extended for treating benzene rings too. The rest of the paper is organized as follows. Section 2 gives preliminaries and the first formulation. Section 3 shows the branch-and-bound framework following August 1, 2008 17:26 WSPC - Proceedings Trim Size: 9.75in x 6.5in article For Genome Informatics Contributors 3 [10] and the new detachment-cut bounding operation. Section 4 gives the second formulation and the new algorithm that employs both the detachment-cut and the H-cut. We report in Section 5 some experimental results and conclude in Section 6. 2. Preliminaries and problem formulation A graph is called a multigraph if multiple edges are allowed; otherwise it is simple. A multitree is a multigraph with no cycle nor self-loop. A path P is a sequence v0 , e1 , v1 , e2 , v2 , . . . , ek , vk of distinct vertices vi and edges ej which join vj−1 and vj , j = 1, . . . , k. Without confusion we may write P = (v0 , v1 , . . . , vk ). The length |P | of path P is defined by k, i.e., the number of edges. We are given a set Σ = {`1 , `2 , . . . , `s } of s labels, which correspond chemical elements. Let each label ` be associated with a valence val(`) ∈ Z+ , where Z+ denotes the set of non-negative integers. A multigraph G is said Σ-labeled if each vertex v has a label `(v) ∈ Σ, and is called (Σ, val )-labeled if, in addition, the degree of each vertex v is val(`(v)), i.e., the valence of the element `(v). Chemical compounds that can be viewed as (Σ, val )-labeled, self-loopless and connected multigraphs, where vertices and labels represent atoms and elements, respectively. For a path P = (v0 , v1 , . . . , vk ), we call `(P ) = `(v0 ), `(v1 ), . . . , `(vk ) the label sequence of P . Given a label sequence t, let #t denote the number of paths P with `(P ) = t in the graph, where multiple edges are treated as a single edge and paths are considered “directed.” The feature vector fK (G) of level K ∈ Z+ of G is defined as the p(K, s)-dimensional vector whose entry fK (G)[t] (|t| ≤ K) represents #t, where p(K, s) = (sK+2 − s)/(s − 1) for s > 1 and p(K, 1) = K + 1. Fig. 1 illustrates an example. H Σ O = {C, O, H} val (C) = 4, val (O) = 2, val (H) = 1 H C C O H feature vector of level 1 H O C HH HO HC OH OO OC CH CO CC H G 4 2 2 0 1 3 1 0 2 3 2 2 Fig. 1. An illustration of a (Σ, val)-labeled multitree G and f1 (G), where multiple edges are treated as one edge and paths are considered “directed.” Let deg(v; G) denote the degree of a vertex v in a graph G. The problem can be formulated as follows (an alternative formulation will be given in Section 4). Problem 1. Given a set Σ of s labels, a valence function val : Σ → Z+ and a feature vector g of level K, find all (Σ, val )-labeled multitrees T such that fK (T ) = g and deg(v; T ) = val (`(v)) for all vertices v ∈ T . For a given feature vector g, the entry g(t) specifies #t in an output graph. In parP ticular, the number n of vertices is decided by `∈Σ g(`). To solve the problem, we August 1, 2008 4 17:26 WSPC - Proceedings Trim Size: 9.75in x 6.5in article A. Authorx, B. Authory, & C. D. Authorz start with an empty graph, and repeatedly extend the current tree T by appending a new vertex with each label ` ∈ Σ to obtain a valid tree (a tree that has not violated any constraints on output trees) by one vertex until we get n vertices. In order to avoid duplicate outputs, we follow the branch-and-bound framework of [10] which first defines a canonical representation for isomorphic trees, then lists them using the algorithm of [17, 18] (the branching operation) and discards invalid trees using some bounding operations. 3. The enumeration algorithm Given a simple tree with n vertices, the valence constraint uniquely determines the multiplicities of all edges. Thus we consider listing all non-isomorphic Σ-labeled simple trees and get/check the corresponding multitrees by the valence constraint. The framework of the enumeration algorithm is shown in Fig. 2. Main: FOR all labels ` ∈ Σ DO Let T be the tree consisting of one (root) vertex labeled by ` Gen(T ) DONE Gen(T ): IF T has n vertices THEN Check if T is valid. If it is, output it. ELSE Extend T to T1 , T2 , . . . , Tp for some finite p by appending a new leaf vertex FOR all such trees Ti DO Check if Ti is valid. If it is, call Gen(Ti ) (do nothing otherwise). DONE ENDIF Fig. 2. The framework of the enumeration algorithm. Section 3.1 reviews the way of extending trees (the branching operation), which is exactly the same as [10]. Section 3.2 describes how the validity is checked by several bounding operations, three from [10] and a new detachment-cut. 3.1. Canonical representation of trees and the branching operation First of all, we need a representation for the output that must be unique for isomorphic trees. For this purpose, we use the idea of centroid-rooted left-heavy tree [10], where centroid is defined from the next theorem (see also [3]). Theorem 3.1 (Jordan 1869). For any tree with n vertices, either there exists a unique vertex v ∗ such that each subtree obtained by removing v ∗ contains at most August 1, 2008 17:26 WSPC - Proceedings Trim Size: 9.75in x 6.5in article For Genome Informatics Contributors 5 ∗ b n−1 2 c vertices, or there exists a unique edge e such that both of the subtrees obtained n ∗ by removing e contain exactly 2 vertices. Such a vertex v ∗ (resp., edge e∗ ) is called the unicentroid (resp., bicentroid) of the tree. For example, the tree in Fig. 1 has a bicentroid (the C-C edge). To introduce“left-heaviness,” we need an ordering among rooted trees. Let T be a tree of n vertices rooted at a vertex v0 (which is not necessarily its centroid). Suppose that it is embedded in the plane, where v0 is the top. Let v0 , v1 , . . . , vn−1 be indexed by the depth-first search (DFS) that starts from v0 and visits vertices from the left to the right. The depth d(v) of a vertex v is defined as the length of the path from v0 to v in T . The depth-label sequence of T is defined as DL(T ) = (d(v0 ), `(v0 ), d(v1 ), `(v1 ), . . . , d(vn−1 ), `(vn−1 )). We say that T is rooted at an edge (v0 , v1 ) if v0 and v1 are the two tops, where we define d(v) by the minimum of the lengths of the v0 , v-path and the v1 , v-path. Then DL(T ) can be defined as before. Now we have a one-to-one mapping between plane-embedded trees and label sequences. See Fig. 3 for an illustration. Fig. 3. Rooted trees and their depth-label sequences. Notice that T1 and T2 are isomorphic. Given an (arbitrary) order of labels, define the order of depth-label sequences as follows. For any T1 and T2 , we say DL(T1 ) > DL(T2 ) if DL(T1 ) is lexicographically larger than DL(T2 ). Similarly we can define DL(T1 ) ≥ DL(T2 ) straightforward. In Fig. 3, we have DL(T1 ) > DL(T2 ) > DL(T3 ), supposing C > O > H. The canonical representation of a rooted tree is defined by the largest depth-label sequence among all its plane embeddings. This is equivalent to the left-heavy plane embedding (see [17, 18]); i.e., any two siblings (vertices having the same parent or the two vertices of the edge root) vi and vj with i < j satisfy DL(T(vi )) ≥ DL(T(vj )), where T(v) denotes the subtree consisting of v and all its descendants. For example, T1 and T3 in Fig. 3 are left-heavy whereas T2 is not. Thus our branching task is to list all centroid-rooted left-heavy trees with n vertices and m or less labels. Following the scheme of [17, 18], we define a parentchild relation between two left-heavy trees. The parent P (T ) of a left-heavy tree T is obtained from T by removing its rightmost leaf. If T is rooted at a vertex or an edge (v0 , v1 ) but v1 is not the rightmost leaf, then the root of P (T ) remains unchanged. Otherwise we change the root to vertex v0 since v1 is removed. Clearly P (T ) is still August 1, 2008 6 17:26 WSPC - Proceedings Trim Size: 9.75in x 6.5in article A. Authorx, B. Authory, & C. D. Authorz left-heavy. In this way we can define a family tree F(n, m) of left-heavy trees whose leaves are exactly what we want, i.e., the centroid-rooted left-heavy trees with n vertices and m or less labels. Notice that, in general, a non-leaf node in the family tree may not be rooted at its own centroid. Therefore we only need to enumerate the (leaf) nodes of F(n, m). This can be done by starting from the empty tree (the root node of F(n, m)) and repeatedly appending a new leaf to some appropriate place on the rightmost path. For that purpose, our branching operation employs the algorithm due to [17, 18], which extends the current tree T (i.e., finds a child of T ) in constant time. See [10] for detail. 3.2. Bounding operations Next we explain how to check the validity of a tree T generated during the branching operation. If we can conclude that T and all its descendants are not valid, then we can discard T , i.e., skip the task of appending leaves to T . Our branching operation discards T if at least one of the following criteria is violated. (C1) (C2) (C3) (C4) The root of T remains the centroid of an output (the centroid constraint); fK (T ) ≤ g (the feature vector constraint); deg(v; T ) ≤ val(`(v)) for all v ∈ T (the valence constraint); T can be extended to a connected and loopless tree with n vertices (the detachment constraint). The first three are the same as [10], and not difficult to check (see [10]). In the following, we explain how to check the last one. We need some definitions. Let G = (V, E) be a multigraph which may have self-loops. Given a function r : V → Z+ , an r-detachment of G is a multigraph H obtained from G by splitting each vertex v ∈ V into a set of r(v) copies of v, denoted by Wv = {v 1 , v 2 , . . . , v r(v) }, so that each edge (u, v) in G is mapped to a distinct edge (ui , v j ) in H for some ui ∈ Wu and v j ∈ Wv , where a self-loop (u, u) in G may be mapped to a self-loop (ui , ui ) or a non-loop edge (ui , uj ) in H. Notice that, for all vertex pairs {u, v} ⊆ V , the number of edges in H between Wu and Wv is equal to that in G between u and v. (We note that an r-detachment may not be unique in general.) An r-degree specification is a set ρ of vectors ρ(v) = (ρv1 , ρv2 , . . . , ρvr(v) ) such that P v 1≤i≤r(v) ρi = deg(v; G) for all v ∈ V . An r-detachment H is called a ρ-detachment if deg(v i ; H) = ρvi , for all v ∈ V , and i = 1, 2, . . . , r(v). See Fig. 4 for an illustration. Theorem 3.2 (Nagamochi [16]). Given G = (V, E), r : V → Z+ and an rdegree specification ρ, G has a connected and loopless ρ-detachment if and only if r(X) + c(G − X) − d(X, V ; G) ≤ 1, 1≤ ρvi ≤ d(v; G) + d({v}, {v}; G), ∅ 6= X ⊆ V, v ∈ V, i = 1, 2, . . . , r(v), where r(X) = Σv∈X r(v), G − X denotes the graph obtained from a graph G by removing the vertices in X together with all edges incident to them, c(G−X) denotes August 1, 2008 17:26 WSPC - Proceedings Trim Size: 9.75in x 6.5in article For Genome Informatics Contributors b r (b) = 3 r (a) = 1 b 7 b b ρ(b) = (3, 3, 1) ρ(a) = (3) a d d a d r (d) = 2 ρ(d) = (2, 3) c c c r (c) = 4 G Fig. 4. ρ(c) = (1, 3, 2, 3) c c H An illustration of a multigraph G and a ρ-detachment H of G the number of connected components of graph G − X, and d(A, B; G) denotes the number of edges (u, v) ∈ E with u ∈ A and v ∈ B. Using this theorem, we can check if a partial multitree T violates (C4). Let RP (T ) = (r0 , r1 , . . . , rk ) be the rightmost path of T , and let r0 , . . . , rh (h ≤ k) be the vertices to which a new leaf can be attached without violating the left-heavy property (see [10] for how to do this). Recall `1 , `2 , . . . , `s and g are the given labels and the feature vector, respectively. Let nR i (1 ≤ i ≤ s) be the number of vertices rj (0 ≤ j ≤ h) with `(rj ) = `i . Introducing a new label `s+1 of valence h + 1, we define a new feature vector g 0 of level 1 by ( g(`i ) − #`i + nR 1≤i≤s i 0 g (`i ) = 1 i = s + 1, ( g(`i `j ) − #`i `j 1 ≤ i, j ≤ s g 0 (`i `j ) = nR 1 ≤ i ≤ s, j = s + 1. i (Recall #t denotes the number of paths in T of label sequence t.) We construct an auxiliary graph G = (V, E) by V = {`1 , . . . , `s , `s+1 } and E = {eij |eij = (`i , `j ), d({`i }, {`j }; G) = g 0 (`i `j ), 1 ≤ i, j ≤ s + 1} where d({`i }, {`j }; G) means the multiplicity of edge eij . The function r and the degree specification ρ are defined as follows (see Fig. 5 for an illustration of G). r(v) = g 0 (`i ), `(v) = `i , 1 ≤ i ≤ s + 1, ( val(`(vi )) vi ∈ / {r0 , . . . , rh }, 1 ≤ i ≤ r(v) ρvi = val(`(vi )) − deg(vi ; T ) + 1 vi ∈ {r0 , . . . , rh }, 1 ≤ i ≤ r(v). If G has no ρ-detachment, then T cannot be extended to a connected and loopless tree with n vertices. The new label (label A in Fig. 5) is introduced in order to ensure the existence of the edges (ri , ri+1 ), i = 0, 1, . . . , h − 1. By Theorem 3.2, we only need to check if one or more of the next two conditions is violated. (a) Σ1≤i≤r(v) ρvi ≥ deg(v; G), ∀v ∈ V . (b) r(X) + c(G − X) − d(X, V ; G) ≤ 1, ∅ 6= X ⊆ V . Notice that condition (a) is not equality because the feature vector counts multiple edges as one edge. Our detachment-cut discards T if any of (a) and (b) is violated. August 1, 2008 8 17:26 WSPC - Proceedings Trim Size: 9.75in x 6.5in article A. Authorx, B. Authory, & C. D. Authorz g’ H O C A HO HC OC OA CC CA g H O C HO HC OC CC 12 3 6 2 10 3 6 2 3 1 5 val (H) = 1, val (O) = 2, val (C) = 4 T ρ:4→3 C 1 1 2 ρ(C) = (3, 2, 4), ρ(A) = (3) ρ:2→2 H 1 C C H 4 ρ(H) = (1, 1, 1, 1, 1, 1), ρ(O) = (2, 2) ρ:4→2 C 2 r (H) = 6, r (O) = 2, r (C) = 3, r (A) = 1 O H C H H O O A H C val (A) = 3 G H Fig. 5. An illustration of how to construct a graph G for checking the validity of T using the detachment-cut, where we omit symmetric and zero entries in the feature vectors. We remark that condition (b) has 2s+1 − 1 inequalities, but usually it is small because s is very small. E.g., s is 2 for alkanes and 5 in our experiments. 4. Alternative problem formulation We also follows the second problem formulation in [10], which use two kinds of graph transformation. First the H-removal transformation reduces the size of compounds by removing hydrogens. Then the single-bond transformation replaces multiple edges with a new virtual atom and two new simple edges joining the same end points. Fig. 6 illustrates these two transformations. H C C O C H Fig. 6. C C H O O H H-removal O C O {{ O, C }, 2 } single bond O An illustration of the H-removal and single-bond transformations. When the single-bond transformation replaces multiple edges (u, v) by a new vertex w and two new simple edges (u, w) and (w, v), we define the bond label `(w) of w by `(w) = ({`(u), `(v)}), and define the bond valence of `(w) by the multiplicity of (u, v). Let CΣ be the set of all such bond labels and Σ∗ = Σ ∪ CΣ . For each vertex g T ) is defined as the number of vertices adjacent to v ∈ Σ∗ , its bond degree, y deg(v; v. We consider the next formulation. Problem 2. Given a set of labels Σ∗ , a feature vector g of level K, and a valence function val : Σ → Z+ , find all Σ∗ -labeled simple trees T ∗ = (V ∗ , E ∗ ) that satisfy g T ∗ ) ≤ val (`(v)) for all v ∈ V ∗ . fK (T ∗ ) = g and deg(v; To solve this, we follow the aforementioned framework with the same branching operation. The bounding operations are somewhat different, however. In fact, we can still employ bounding operations based on the four criteria (C1)-(C4) as stated in Section 3.2 (notice that Problem 2 considers only simple trees). Moreover, we August 1, 2008 17:26 WSPC - Proceedings Trim Size: 9.75in x 6.5in article For Genome Informatics Contributors 9 introduce a new H-cut bounding operation, which discards the partial tree T being checked if the number of hydrogens that must be appended to T and any of its descendants in order to restore the compound exceeds a pre-calculated limit. Formally, we first calculate the numbers h∗ (`), ` ∈ Σ, of hydrogens that must be appended to vertices labeled `. It is easy to see that this can be done from the input feature vector of level 1 and the valence function. The H-cut checks if (a lower bound on) the number of hydrogens that must be appended to the `-labeled vertices in T exceeds h∗ (`) for each ` ∈ Σ. We use the next lower bound h(`; T ) = Σ{(val(`(v)) − deg(v; T )) | v ∈ T \ RP (T ), `(v) = `}. (Recall T and all descendants of T in the family tree share the common structure of T \ RP (T ).) See an illustration in Fig. 7. Fig. 7. An illustration of the H-cut procedure, where only label C is being considered, in which numbers val(`(v)) − deg(v; T ) are shown near each carbons not on the rightmost path. 5. Experimental Results We conducted computational experiments to compare the running time of our algorithms with [10] using the same instances, which were obtained by randomly picking up some tree-like compounds from the KEGG LIGAND database (http://www.genome.jp/kegg/ligand.html) and replacing each benzene ring by a new virtual element of valence 6. Feature vectors were calculated for levels 1, 2, . . . , 7. For Problem 2, we preprocessed the instances with the H-removal and single-bond transformations. The experimental results were performed on a PC with a Pentium4 3.00GHz CPU. Tables 1 and 2 show the experimental results for Problems 1 and 2, respectively. We observe that the new algorithms run considerably faster than [10]. 6. Conclusion In this paper, we showed two branch-and-bound algorithms for enumerating treelike chemical graphs from given path frequency, which are based on the framework of [10] and improved their results. In particular, we have proposed two bounding operations, the detachment-cut and the H-cut. As a future work, we are considering August 1, 2008 17:26 10 WSPC - Proceedings Trim Size: 9.75in x 6.5in article A. Authorx, B. Authory, & C. D. Authorz Table 1. Entry Formula n1 C03343 37 C16 H22 O4 C07530 43 C17 H28 N2 O C07178 46 C21 H28 N2 O5 C03690 61 C24 H38 O4 K 1 2 3 4 5 6 7 1 2 3 4 5 6 7 1 2 3 4 5 6 7 1 2 3 4 5 6 7 Experimental Results of Problem 1. Fujiwara time T.O. 3.11 3.25 3.06 3.42 2.33 1.85 T.O. 50.55 16.78 7.14 3.28 3.37 3.88 T.O. 51.72 4.26 0.94 1.02 1.13 1.00 T.O. T.O. T.O. T.O. T.O. T.O. 1287.30 et al.’s algorithm [10] nnt fs 1,334,417,908 N.F. 830,298 9 614,413 2 428,440 1 391,046 1 210,246 1 146,605 1 1,407,334,896 N.F. 16,339,119 55 3,265,086 1 994,926 1 366,628 1 299,518 1 299,518 1 1,237,087,310 N.F. 15,827,372 16 915,962 2 146,789 1 123,251 1 118,295 1 93,947 1 1,428,804,364 N.F. 499,544,612 N.F. 338,357,072 N.F. 254,834,091 N.F. 198,785,929 N.F. 129,353,817 1 77,002,582 1 Our algorithm (this paper) time nnt fs 158.23 25,149,700 570,773 0.48 46,311 9 0.30 28,106 2 0.27 21,688 1 0.26 18,616 1 0.21 12,129 1 0.19 10,551 1 109.27 7,966,323 73,711 1.40 95,639 55 0.61 35,025 1 0.34 15,734 1 0.18 7,929 1 0.16 6,862 1 0.18 6,862 1 500.78 31,003,703 70,170 3.51 158,597 16 0.32 15,427 2 0.16 6,677 1 0.15 5,485 1 0.16 5,450 1 0.15 5,036 1 T.O. 456,703,633 N.F. 318.68 32,927,230 1,198 188.13 16,574,164 8 44.07 3,469,929 4 36.54 2,385,611 2 16.02 854,956 1 10.27 477,305 1 Note: (1) C03343, C07530, C07178, and C03690 are the entries of 2-Ethylhexyl phthalate, Etidocaine, Trimethobenzamide, and Bis (2-ethylhexyl) phthalate in the KEGG LIGAND database, respectively; (2) n1 is the number of atoms in an instance preprocessed by replacing each benzene ring with a new atom with valence 6; (3) K is the level of a given feature vector; (4) “time” is the CPU time in seconds; (5) “T.O.” means “time over” (the time limit was set to 1800 seconds); (6) “nnt” is the number of nodes in the family trees that are checked; (7) “fs” is the number of feasible solutions found in the time limit; and (8) “N.F.” means “not found”. to enumerate more general graph classes, e.g., outerplanar graphs which are known to cover most of the chemical graphs. A preliminary work can be found in [19]. We note that the depth label sequences defined in this paper only represent the graphical structures of compounds in the viewpoint of planarity but may lose information of stereochemistry, especially for stereoisomers. Thus, designing better representations is another interesting topic for future research. Acknowledgments This work was supported in part by Grant-in-Aid #19200022 from the Ministry of Education, Culture, Sports, Science and Technology (MEXT) of Japan. We thank Hiroki Fujiwara and Jiexun Wang for their helpful discussions. August 1, 2008 17:26 WSPC - Proceedings Trim Size: 9.75in x 6.5in article For Genome Informatics Contributors Table 2. Entry Formula n2 C03343 17 C16 H22 O4 C07530 16 C17 H28 N2 O C07178 19 C21 H28 N2 O5 C03690 25 C24 H38 O4 C04036 29 C19 H39 O7 P C03630 33 C21 H39 O7 P K 1 2 3 4 5 6 7 1 2 3 4 5 6 7 1 2 3 4 5 6 7 1 2 3 4 5 6 7 1 2 3 4 5 6 7 1 2 3 4 5 6 7 11 Experimental Results of Problem 2. Fujiwara et al.’s algorithm [10] time nnt fs 66.06 28,683,656 570,773 0.03 5,157 9 0.03 4,607 2 0.04 4,086 1 0.04 3,470 1 0.04 2,909 1 0.04 2,675 1 10.26 4,029,246 73,711 0.16 43,513 55 0.09 16,090 1 0.06 8,006 1 0.04 5,624 1 0.04 4,642 1 0.04 4,642 1 222.29 96,006,467 70,170 0.11 21,460 16 0.09 11,950 2 0.03 3,152 1 0.02 2,143 1 0.02 2,088 1 0.02 2,088 1 T.O. 664,265,016 5,305,243 23.36 2,984,162 1,198 15.87 1,464,436 8 7.12 509,870 4 4.97 283,418 2 2.66 132,434 1 2.10 101,097 1 T.O. 734,327,164 2,653,617 T.O. 228,786,134 161 184.54 14,517,014 1 11.86 638,457 1 5.95 225,966 1 4.34 127,250 1 3.38 81,532 1 T.O. 667,687,809 3,959 T.O. 168,054,487 77 T.O. 115,797,466 11 118.48 5,104,899 11 50.63 1,554,928 9 27.83 673,426 7 11.97 244,166 5 Our algorithm (this paper) time nnt fs 13.31 5,865,685 570,773 0.01 3,091 9 0.02 2,780 2 0.02 2,453 1 0.02 2,098 1 0.02 1,739 1 0.02 1,596 1 1.00 424,121 73,711 0.06 14,900 55 0.04 6,385 1 0.02 3,736 1 0.02 2,522 1 0.02 2,245 1 0.02 2,245 1 9.03 3,909,283 70,170 0.02 4,321 16 0.02 2,984 2 0.01 1,062 1 0.01 819 1 0.01 794 1 0.01 794 1 T.O. 708,264,977 60,257,365 8.10 1,113,024 1,198 5.66 570,616 8 2.46 197,027 4 1.90 120,718 2 1.12 60,310 1 0.88 46,319 1 T.O. 759,794,526 11,587,705 1543.37 300,524,875 2,520 45.36 4,745,395 1 3.60 262,162 1 2.27 107,378 1 1.57 60,557 1 1.26 40,493 1 T.O. 639,689,202 96,245 T.O. 239,538,772 1,736 438.19 37,803,253 13 25.65 1,519,286 11 12.24 515,752 9 6.44 225,620 7 3.14 92,431 5 Note: (1) C03343, C07530, C07178, C03690, C04036, and C03630 are the entries of 2-Ethylhexyl phthalate, Etidocaine, Trimethobenzamide, Bis (2-ethylhexyl) phthalate, 1-Palmitoylglycerol 3-phosphate, and Oleoylglycerone phosphate in the KEGG LIGAND database, respectively; (2) n2 is the number of vertices preprocessed by replacing benzene rings with new atoms of valence 6 and by the H-removal and single-bond transformations; (3) K is the level of a given feature vector; (4) “time” is the CPU time in seconds; (5) “T.O.” means “time over” (the time limit was set to 1800 seconds); (6) “nnt” is the number of nodes in the family trees that are checked; (7) “fs” is the number of feasible solutions found within the time limit. August 1, 2008 12 17:26 WSPC - Proceedings Trim Size: 9.75in x 6.5in article A. Authorx, B. Authory, & C. D. Authorz References [1] Akutsu, T., Fukagawa, D., Inferring A Graph from Path Frequency, LNCS, 3537, 371–382, 2005. [2] Akutsu, T., Fukagawa, D., Inferring a Chemical Structure from a Feature Vector Based on Frequency of Labeled Pathsand Small Fragments, Series on Advances in Bioinformatics and Computational Biology, Proc. 5th Asia-Pacific Bioinformatics Conf. Sankoff, D., Wang, L., Chin, F., Eds.; Imperial College Press, 165–174, 2007. [3] Aringhieri, R., Hansen, P., Malucelli, F., Chemical Trees Enumeration Algorithms, 4OR, 1, 67–83, 2003. [4] Bakır, G. H., Zien, A., Tsuda, K., Learning to Find Graph Pre-Images, LNCS, 3175, 253–261, 2004. [5] Buchanan, B. G., Feigenbaum, E. A., DENDRAL and Meta-DENDRAL - Their Applications Dimension, Artif. Intell., 1, 5–24, 1978. [6] Cayley, A., On the Analytic Forms Called Trees, with Applications to the Theory of Chemical Combinations, Reports British Assoc. Adv. Sci., 45, 257-305, 1875. [7] Deshpande, M., Kuramochi, M., Wale, N., Karypis, G., Frequent Substructure-Based Approaches for Classifying Chemical Compounds, IEEE Transactions on Knowledge and Data Engineering, 17, 1036–1050, 2005. [8] Faulon, J. L., Churchwell, C. J., Visco, Jr., D.P., The Signature Molecular Descriptor. 2. Enumerating Molecules from Their Extended Valence Sequences, J. Chem. Inf. Comp. Sci., 43, 721–734, 2003. [9] Fink, T., Reymond, J. L., Virtual Exploration of the Chemical Universe up to 11 Atoms of C, N, O, F: Assembly of 26.4 Million Structures (110.9 Million Stereoisomers) and Analysis for New Ring Systems, Stereochemistry, Physicochemical Properties, Compound Classes, and Drug Discovery, J. Chem. Inf. Comp. Sci., 47, 342–353, 2007. [10] Fujiwara, H., Wang, J., Zhao, L., Nagamochi, H., Akutsu, T., Enumerating Tree-like Chemical Graphs with Given Path Frequency, J. Chem. Inf. Model., 2008 (to appear). [11] Funatsu, K., Sasaki, S., Recent Advances in the Automated Structure Elucidation System, CHEMICS. Utilization of Two-Dimensional NMR Spectral Information and Development of Peripheral Functions for Examination of Candidates, J. Chem. Inf. Comp. Sci., 36, 190–204, 1996. [12] Hall, L. H., Dailey, E. S., Design of Molecules from Quantitative Structure-Activity Relationship Models. 3. Role of Higher Order Path Counts: Path 3, J. Chem. Inf. Comp. Sci., 33, 598–603, 1993. [13] Kashima, H., Tsuda, K., Inokuchi, A., Marginalized Kernels between Labeled Graphs, Proc. 20th International Conference on Machine Learning, Fawcett, T., Mishra, N. Eds., The AAAI Press, Menlo Park, California, 321–328, 2003. [14] Mahé, P., Ueda N., Akutsu, T., Perret, J. L., Vert, J. P., Graph Kernels for Molecular Structure-Activity Relationship Analysis with Support Vector Machines, J. Chem. Inf. Model., 45, 939–951, 2005. [15] Mauser, H., Stahl, M., Chemical Fragment Spaces for De Novo Design, J. Chem. Inf. Comp. Sci., 47, 318–324, 2007. [16] Nagamochi, H., A Detachment Algorithm for Inferring A Graph from Path Frequency, LNCS, 4112, 274–283, 2006. [17] Nakano, S., Uno, T., Efficient Generation of Rooted Trees, Technical Report, NII2003-005E, ISSN:1346-5597; National Inst. of Informatics: Tokyo, Japan, July 3, 2003. [18] Nakano, S., Uno, T., Generating Colored Trees, LNCS, 3787, 249–260, 2005. [19] Wang, J., Zhao, L., Nagamochi, H., Akutsu, T., An Efficient Algorithm for Generating Colored Outerplanar Graphs, LNCS, 4484, 573–583, 2007.
© Copyright 2024 Paperzz