IMPROVED ALGORITHMS FOR ENUMERATING TREE

August 1, 2008
17:26
WSPC - Proceedings Trim Size: 9.75in x 6.5in
article
1
IMPROVED ALGORITHMS FOR ENUMERATING TREE-LIKE
CHEMICAL GRAPHS WITH GIVEN PATH FREQUENCY
YUSUKE ISHIDA1
[email protected]
HIROSHI NAGAMOCHI1
[email protected]
LIANG ZHAO1
[email protected]
TATSUYA AKUTSU2
[email protected]
1 Department
of Applied Mathematics and Physics, Graduate School of Informatics,
Kyoto University, Yoshida, Kyoto 606-8501, Japan
2 Bioinformatics Center, Institute for Chemical Research, Kyoto University, Gokasho,
Uji, Kyoto 611-0011, Japan
This paper considers the problem of enumerating all non-isomorphic tree-like chemical
graphs with given path frequency, where “tree-like” means that the graph can be viewed
as a tree if multiple edges (i.e., edges with the same end points) and a benzene ring
are treated as one edge and one vertex, respectively, and “path frequency” is a vector
of the numbers of specified vertex-labeled paths that must be realized in every output.
This and related problems have several potential applications such as classification of
chemical compounds, structure determination using mass-spectrum and/or NMR and
design of novel chemical compounds.
For this problem, several studies have been done. Recently, Fujiwara et al. (2008)
showed two formulations and for each of them, they gave a branch-and-bound algorithm, which combined efficient enumeration of non-isomorphic trees with bounding
operations based on the path frequency and the atom-atom bonds to avoid the generation of invalid trees. In this paper, based on their work and a result of Nagamochi
(2006), we introduce two new bounding operations, the detachment-cut and the Hcut, to further reduce the size of the search space. We performed computational experiments to compare our proposed algorithms with those of Fujiwara et al. (2008)
using some chemical compound data obtained from the KEGG LIGAND database
(http://www.genome.jp/kegg/ligand.html). The results show that our proposed algorithms are much faster than their algorithms.
Keywords: chemical graph enumeration; chemical tree enumeration; path frequency;
feature vector; detachment.
1. Introduction
Enumerating chemical graphs is one of the fundamental issues in chemoinformatics and bioinformatics which can go back to the 19th century (Cayley [6]). Its
applications include structure determination using mass-spectrum and/or NMRspectrum [5, 11], virtual exploration of chemical universe [9, 15], reconstruction of
molecular structures from their signatures [8, 12], and classification of chemical compounds [7]. In these applications, enumeration of chemical graphs satisfying given
constraints is important [2].
August 1, 2008
2
17:26
WSPC - Proceedings Trim Size: 9.75in x 6.5in
article
A. Authorx, B. Authory, & C. D. Authorz
This paper considers to enumerate chemical compounds with given path frequency, i.e., the numbers of specified vertex-labeled paths that must be realized in
every output. The problem was motivated from the pre-image problem in machine
learning [4]. In the pre-image problem, given a feature vector, a structure consistent
with the feature vector is computed. The pre-image problem for chemical graphs
has a potential application to design of novel chemical compounds [2, 4], which is an
important target of bioinformatics. Suppose that we have some potential function
in a feature space, which reflects the pharmacological activity of chemical compounds and may be learned from training data. Then, a desired object is computed
as a point in the feature space using the potential function and some optimization
technique. Finally, a pre-image of the point is computed as a candidate of a novel
chemical compound. Though this approach has not yet been shown to be better
than existing approaches, there is a possibility that chemical structures, which have
better pharmacological activity than training data, are obtained. Since feature vectors based on frequency of labeled paths were successfully applied to classification of
chemical compounds [13, 14], we consider the graph pre-image problem with given
path frequency.
Akutsu and Fukagawa [1] first studied the computational complexity of the preimage problem. They proved that the problem is NP-hard even if chemical graphs
are restricted to trees. Since the problem is NP-hard and it is quite difficult to
handle all chemical graphs, they developed a branch-and-bound algorithm for treelike chemical graphs [2]. Recently, Fujiwara et al. [10] proposed a much more efficient
branch-and-bound algorithm, which combined the tree enumeration algorithm of
Nakano and Uno [17, 18] to generate non-isomorphic trees with bounding operations
based on the path frequency and the atom-atom bonds to avoid the generation of
invalid trees. To improve the efficiency, they also gave an alternative formulation
of the problem by removing all hydrogens and replacing each group of multiple
edges with a new virtual atom and two new single edges. Experimental results
show that some instances up to 61 atoms could be enumerated within 30 minutes
using a normal PC. Their algorithms can also be applied to a classical problem of
enumeration of alkanes (Cn H2n+2 ), which was considered by Cayley [6], and the
latter was shown to be at least as fast as the fastest existing algorithm [3].
In order to further improve the efficiency, we introduce two new bounding operations in this paper. The first, the detachment-cut is based on a result of Nagamochi [16]. Another, the H-cut can only be applied to the second formulation,
which uses the information of the removed hydrogens. We show that they can effectively reduce the size of the search space and thereby reduce the running time.
Applying to the same instances, we show that our algorithms are much faster than
those of [10]. The proposed algorithms are faster than those of [10] in all the examined cases and are dozens of times faster in many cases. As in [10], our algorithms
can be extended for treating benzene rings too.
The rest of the paper is organized as follows. Section 2 gives preliminaries and
the first formulation. Section 3 shows the branch-and-bound framework following
August 1, 2008
17:26
WSPC - Proceedings Trim Size: 9.75in x 6.5in
article
For Genome Informatics Contributors
3
[10] and the new detachment-cut bounding operation. Section 4 gives the second
formulation and the new algorithm that employs both the detachment-cut and the
H-cut. We report in Section 5 some experimental results and conclude in Section 6.
2. Preliminaries and problem formulation
A graph is called a multigraph if multiple edges are allowed; otherwise it is simple.
A multitree is a multigraph with no cycle nor self-loop. A path P is a sequence
v0 , e1 , v1 , e2 , v2 , . . . , ek , vk of distinct vertices vi and edges ej which join vj−1 and
vj , j = 1, . . . , k. Without confusion we may write P = (v0 , v1 , . . . , vk ). The length
|P | of path P is defined by k, i.e., the number of edges.
We are given a set Σ = {`1 , `2 , . . . , `s } of s labels, which correspond chemical
elements. Let each label ` be associated with a valence val(`) ∈ Z+ , where Z+
denotes the set of non-negative integers. A multigraph G is said Σ-labeled if each
vertex v has a label `(v) ∈ Σ, and is called (Σ, val )-labeled if, in addition, the
degree of each vertex v is val(`(v)), i.e., the valence of the element `(v). Chemical compounds that can be viewed as (Σ, val )-labeled, self-loopless and connected
multigraphs, where vertices and labels represent atoms and elements, respectively.
For a path P = (v0 , v1 , . . . , vk ), we call `(P ) = `(v0 ), `(v1 ), . . . , `(vk ) the label sequence of P . Given a label sequence t, let #t denote the number of paths P with
`(P ) = t in the graph, where multiple edges are treated as a single edge and paths
are considered “directed.” The feature vector fK (G) of level K ∈ Z+ of G is defined
as the p(K, s)-dimensional vector whose entry fK (G)[t] (|t| ≤ K) represents #t,
where p(K, s) = (sK+2 − s)/(s − 1) for s > 1 and p(K, 1) = K + 1. Fig. 1 illustrates
an example.
H
Σ
O
= {C, O, H}
val (C) = 4, val (O) = 2, val (H) = 1
H
C
C
O
H
feature vector of level 1
H O C HH HO HC OH OO OC CH CO CC
H
G
4 2 2
0
1
3
1
0
2
3
2
2
Fig. 1. An illustration of a (Σ, val)-labeled multitree G and f1 (G), where multiple edges are
treated as one edge and paths are considered “directed.”
Let deg(v; G) denote the degree of a vertex v in a graph G. The problem can be
formulated as follows (an alternative formulation will be given in Section 4).
Problem 1. Given a set Σ of s labels, a valence function val : Σ → Z+ and a
feature vector g of level K, find all (Σ, val )-labeled multitrees T such that fK (T ) = g
and deg(v; T ) = val (`(v)) for all vertices v ∈ T .
For a given feature vector g, the entry g(t) specifies #t in an output graph. In parP
ticular, the number n of vertices is decided by `∈Σ g(`). To solve the problem, we
August 1, 2008
4
17:26
WSPC - Proceedings Trim Size: 9.75in x 6.5in
article
A. Authorx, B. Authory, & C. D. Authorz
start with an empty graph, and repeatedly extend the current tree T by appending
a new vertex with each label ` ∈ Σ to obtain a valid tree (a tree that has not violated
any constraints on output trees) by one vertex until we get n vertices. In order to
avoid duplicate outputs, we follow the branch-and-bound framework of [10] which
first defines a canonical representation for isomorphic trees, then lists them using
the algorithm of [17, 18] (the branching operation) and discards invalid trees using
some bounding operations.
3. The enumeration algorithm
Given a simple tree with n vertices, the valence constraint uniquely determines the
multiplicities of all edges. Thus we consider listing all non-isomorphic Σ-labeled
simple trees and get/check the corresponding multitrees by the valence constraint.
The framework of the enumeration algorithm is shown in Fig. 2.
Main:
FOR all labels ` ∈ Σ DO
Let T be the tree consisting of one (root) vertex labeled by `
Gen(T )
DONE
Gen(T ):
IF T has n vertices THEN
Check if T is valid. If it is, output it.
ELSE
Extend T to T1 , T2 , . . . , Tp for some finite p by appending a new leaf vertex
FOR all such trees Ti DO
Check if Ti is valid. If it is, call Gen(Ti ) (do nothing otherwise).
DONE
ENDIF
Fig. 2.
The framework of the enumeration algorithm.
Section 3.1 reviews the way of extending trees (the branching operation), which
is exactly the same as [10]. Section 3.2 describes how the validity is checked by
several bounding operations, three from [10] and a new detachment-cut.
3.1. Canonical representation of trees and the branching operation
First of all, we need a representation for the output that must be unique for isomorphic trees. For this purpose, we use the idea of centroid-rooted left-heavy tree [10],
where centroid is defined from the next theorem (see also [3]).
Theorem 3.1 (Jordan 1869). For any tree with n vertices, either there exists a
unique vertex v ∗ such that each subtree obtained by removing v ∗ contains at most
August 1, 2008
17:26
WSPC - Proceedings Trim Size: 9.75in x 6.5in
article
For Genome Informatics Contributors
5
∗
b n−1
2 c vertices, or there exists a unique edge e such that both of the subtrees obtained
n
∗
by removing e contain exactly 2 vertices.
Such a vertex v ∗ (resp., edge e∗ ) is called the unicentroid (resp., bicentroid) of
the tree. For example, the tree in Fig. 1 has a bicentroid (the C-C edge). To
introduce“left-heaviness,” we need an ordering among rooted trees.
Let T be a tree of n vertices rooted at a vertex v0 (which is not necessarily
its centroid). Suppose that it is embedded in the plane, where v0 is the top. Let
v0 , v1 , . . . , vn−1 be indexed by the depth-first search (DFS) that starts from v0 and
visits vertices from the left to the right. The depth d(v) of a vertex v is defined as
the length of the path from v0 to v in T . The depth-label sequence of T is defined as
DL(T ) = (d(v0 ), `(v0 ), d(v1 ), `(v1 ), . . . , d(vn−1 ), `(vn−1 )).
We say that T is rooted at an edge (v0 , v1 ) if v0 and v1 are the two tops, where
we define d(v) by the minimum of the lengths of the v0 , v-path and the v1 , v-path.
Then DL(T ) can be defined as before. Now we have a one-to-one mapping between
plane-embedded trees and label sequences. See Fig. 3 for an illustration.
Fig. 3.
Rooted trees and their depth-label sequences. Notice that T1 and T2 are isomorphic.
Given an (arbitrary) order of labels, define the order of depth-label sequences as
follows. For any T1 and T2 , we say DL(T1 ) > DL(T2 ) if DL(T1 ) is lexicographically
larger than DL(T2 ). Similarly we can define DL(T1 ) ≥ DL(T2 ) straightforward. In
Fig. 3, we have DL(T1 ) > DL(T2 ) > DL(T3 ), supposing C > O > H. The canonical
representation of a rooted tree is defined by the largest depth-label sequence among
all its plane embeddings. This is equivalent to the left-heavy plane embedding (see
[17, 18]); i.e., any two siblings (vertices having the same parent or the two vertices
of the edge root) vi and vj with i < j satisfy DL(T(vi )) ≥ DL(T(vj )), where T(v)
denotes the subtree consisting of v and all its descendants. For example, T1 and T3
in Fig. 3 are left-heavy whereas T2 is not.
Thus our branching task is to list all centroid-rooted left-heavy trees with n
vertices and m or less labels. Following the scheme of [17, 18], we define a parentchild relation between two left-heavy trees. The parent P (T ) of a left-heavy tree T is
obtained from T by removing its rightmost leaf. If T is rooted at a vertex or an edge
(v0 , v1 ) but v1 is not the rightmost leaf, then the root of P (T ) remains unchanged.
Otherwise we change the root to vertex v0 since v1 is removed. Clearly P (T ) is still
August 1, 2008
6
17:26
WSPC - Proceedings Trim Size: 9.75in x 6.5in
article
A. Authorx, B. Authory, & C. D. Authorz
left-heavy. In this way we can define a family tree F(n, m) of left-heavy trees whose
leaves are exactly what we want, i.e., the centroid-rooted left-heavy trees with n
vertices and m or less labels. Notice that, in general, a non-leaf node in the family
tree may not be rooted at its own centroid.
Therefore we only need to enumerate the (leaf) nodes of F(n, m). This can be
done by starting from the empty tree (the root node of F(n, m)) and repeatedly
appending a new leaf to some appropriate place on the rightmost path. For that
purpose, our branching operation employs the algorithm due to [17, 18], which
extends the current tree T (i.e., finds a child of T ) in constant time. See [10] for
detail.
3.2. Bounding operations
Next we explain how to check the validity of a tree T generated during the branching
operation. If we can conclude that T and all its descendants are not valid, then we
can discard T , i.e., skip the task of appending leaves to T . Our branching operation
discards T if at least one of the following criteria is violated.
(C1)
(C2)
(C3)
(C4)
The root of T remains the centroid of an output (the centroid constraint);
fK (T ) ≤ g (the feature vector constraint);
deg(v; T ) ≤ val(`(v)) for all v ∈ T (the valence constraint);
T can be extended to a connected and loopless tree with n vertices (the
detachment constraint).
The first three are the same as [10], and not difficult to check (see [10]). In the
following, we explain how to check the last one. We need some definitions.
Let G = (V, E) be a multigraph which may have self-loops. Given a function
r : V → Z+ , an r-detachment of G is a multigraph H obtained from G by splitting
each vertex v ∈ V into a set of r(v) copies of v, denoted by Wv = {v 1 , v 2 , . . . , v r(v) },
so that each edge (u, v) in G is mapped to a distinct edge (ui , v j ) in H for some
ui ∈ Wu and v j ∈ Wv , where a self-loop (u, u) in G may be mapped to a self-loop
(ui , ui ) or a non-loop edge (ui , uj ) in H. Notice that, for all vertex pairs {u, v} ⊆ V ,
the number of edges in H between Wu and Wv is equal to that in G between u and
v. (We note that an r-detachment may not be unique in general.)
An r-degree specification is a set ρ of vectors ρ(v) = (ρv1 , ρv2 , . . . , ρvr(v) ) such that
P
v
1≤i≤r(v) ρi = deg(v; G) for all v ∈ V . An r-detachment H is called a ρ-detachment
if deg(v i ; H) = ρvi , for all v ∈ V , and i = 1, 2, . . . , r(v). See Fig. 4 for an illustration.
Theorem 3.2 (Nagamochi [16]). Given G = (V, E), r : V → Z+ and an rdegree specification ρ, G has a connected and loopless ρ-detachment if and only if
r(X) + c(G − X) − d(X, V ; G) ≤ 1,
1≤
ρvi
≤ d(v; G) + d({v}, {v}; G),
∅ 6= X ⊆ V,
v ∈ V, i = 1, 2, . . . , r(v),
where r(X) = Σv∈X r(v), G − X denotes the graph obtained from a graph G by
removing the vertices in X together with all edges incident to them, c(G−X) denotes
August 1, 2008
17:26
WSPC - Proceedings Trim Size: 9.75in x 6.5in
article
For Genome Informatics Contributors
b
r (b) = 3
r (a) = 1
b
7
b
b
ρ(b) = (3, 3, 1)
ρ(a) = (3)
a
d
d
a
d
r (d) = 2
ρ(d) = (2, 3)
c
c
c
r (c) = 4
G
Fig. 4.
ρ(c) = (1, 3, 2, 3)
c
c
H
An illustration of a multigraph G and a ρ-detachment H of G
the number of connected components of graph G − X, and d(A, B; G) denotes the
number of edges (u, v) ∈ E with u ∈ A and v ∈ B.
Using this theorem, we can check if a partial multitree T violates (C4). Let
RP (T ) = (r0 , r1 , . . . , rk ) be the rightmost path of T , and let r0 , . . . , rh (h ≤ k) be
the vertices to which a new leaf can be attached without violating the left-heavy
property (see [10] for how to do this). Recall `1 , `2 , . . . , `s and g are the given labels
and the feature vector, respectively. Let nR
i (1 ≤ i ≤ s) be the number of vertices
rj (0 ≤ j ≤ h) with `(rj ) = `i . Introducing a new label `s+1 of valence h + 1, we
define a new feature vector g 0 of level 1 by
(
g(`i ) − #`i + nR
1≤i≤s
i
0
g (`i ) =
1
i = s + 1,
(
g(`i `j ) − #`i `j 1 ≤ i, j ≤ s
g 0 (`i `j ) =
nR
1 ≤ i ≤ s, j = s + 1.
i
(Recall #t denotes the number of paths in T of label sequence t.) We construct an
auxiliary graph G = (V, E) by V = {`1 , . . . , `s , `s+1 } and E = {eij |eij = (`i , `j ),
d({`i }, {`j }; G) = g 0 (`i `j ), 1 ≤ i, j ≤ s + 1} where d({`i }, {`j }; G) means the multiplicity of edge eij . The function r and the degree specification ρ are defined as
follows (see Fig. 5 for an illustration of G).
r(v) = g 0 (`i ),
`(v) = `i , 1 ≤ i ≤ s + 1,
(
val(`(vi ))
vi ∈
/ {r0 , . . . , rh }, 1 ≤ i ≤ r(v)
ρvi =
val(`(vi )) − deg(vi ; T ) + 1 vi ∈ {r0 , . . . , rh }, 1 ≤ i ≤ r(v).
If G has no ρ-detachment, then T cannot be extended to a connected and loopless
tree with n vertices. The new label (label A in Fig. 5) is introduced in order to ensure
the existence of the edges (ri , ri+1 ), i = 0, 1, . . . , h − 1. By Theorem 3.2, we only
need to check if one or more of the next two conditions is violated.
(a) Σ1≤i≤r(v) ρvi ≥ deg(v; G), ∀v ∈ V .
(b) r(X) + c(G − X) − d(X, V ; G) ≤ 1, ∅ 6= X ⊆ V .
Notice that condition (a) is not equality because the feature vector counts multiple
edges as one edge. Our detachment-cut discards T if any of (a) and (b) is violated.
August 1, 2008
8
17:26
WSPC - Proceedings Trim Size: 9.75in x 6.5in
article
A. Authorx, B. Authory, & C. D. Authorz
g’ H O C A HO HC OC OA CC CA
g H O C HO HC OC CC
12 3 6
2
10
3
6 2 3 1
5
val (H) = 1, val (O) = 2, val (C) = 4
T
ρ:4→3
C
1
1
2
ρ(C) = (3, 2, 4), ρ(A) = (3)
ρ:2→2
H
1
C
C
H
4
ρ(H) = (1, 1, 1, 1, 1, 1), ρ(O) = (2, 2)
ρ:4→2
C
2
r (H) = 6, r (O) = 2, r (C) = 3, r (A) = 1
O
H
C
H
H
O
O
A
H
C
val (A) = 3
G
H
Fig. 5. An illustration of how to construct a graph G for checking the validity of T using the
detachment-cut, where we omit symmetric and zero entries in the feature vectors.
We remark that condition (b) has 2s+1 − 1 inequalities, but usually it is small
because s is very small. E.g., s is 2 for alkanes and 5 in our experiments.
4. Alternative problem formulation
We also follows the second problem formulation in [10], which use two kinds of graph
transformation. First the H-removal transformation reduces the size of compounds
by removing hydrogens. Then the single-bond transformation replaces multiple edges
with a new virtual atom and two new simple edges joining the same end points.
Fig. 6 illustrates these two transformations.
H
C
C
O
C
H
Fig. 6.
C
C
H
O
O
H
H-removal
O
C
O
{{ O, C }, 2 }
single bond
O
An illustration of the H-removal and single-bond transformations.
When the single-bond transformation replaces multiple edges (u, v) by a new
vertex w and two new simple edges (u, w) and (w, v), we define the bond label `(w)
of w by `(w) = ({`(u), `(v)}), and define the bond valence of `(w) by the multiplicity
of (u, v). Let CΣ be the set of all such bond labels and Σ∗ = Σ ∪ CΣ . For each vertex
g T ) is defined as the number of vertices adjacent to
v ∈ Σ∗ , its bond degree, y deg(v;
v. We consider the next formulation.
Problem 2. Given a set of labels Σ∗ , a feature vector g of level K, and a valence
function val : Σ → Z+ , find all Σ∗ -labeled simple trees T ∗ = (V ∗ , E ∗ ) that satisfy
g T ∗ ) ≤ val (`(v)) for all v ∈ V ∗ .
fK (T ∗ ) = g and deg(v;
To solve this, we follow the aforementioned framework with the same branching
operation. The bounding operations are somewhat different, however. In fact, we can
still employ bounding operations based on the four criteria (C1)-(C4) as stated
in Section 3.2 (notice that Problem 2 considers only simple trees). Moreover, we
August 1, 2008
17:26
WSPC - Proceedings Trim Size: 9.75in x 6.5in
article
For Genome Informatics Contributors
9
introduce a new H-cut bounding operation, which discards the partial tree T being
checked if the number of hydrogens that must be appended to T and any of its
descendants in order to restore the compound exceeds a pre-calculated limit.
Formally, we first calculate the numbers h∗ (`), ` ∈ Σ, of hydrogens that must
be appended to vertices labeled `. It is easy to see that this can be done from the
input feature vector of level 1 and the valence function. The H-cut checks if (a lower
bound on) the number of hydrogens that must be appended to the `-labeled vertices
in T exceeds h∗ (`) for each ` ∈ Σ. We use the next lower bound
h(`; T ) = Σ{(val(`(v)) − deg(v; T )) | v ∈ T \ RP (T ), `(v) = `}.
(Recall T and all descendants of T in the family tree share the common structure
of T \ RP (T ).) See an illustration in Fig. 7.
Fig. 7. An illustration of the H-cut procedure, where only label C is being considered, in which
numbers val(`(v)) − deg(v; T ) are shown near each carbons not on the rightmost path.
5. Experimental Results
We conducted computational experiments to compare the running time of our
algorithms with [10] using the same instances, which were obtained by randomly picking up some tree-like compounds from the KEGG LIGAND database
(http://www.genome.jp/kegg/ligand.html) and replacing each benzene ring by a
new virtual element of valence 6. Feature vectors were calculated for levels 1, 2, . . . , 7.
For Problem 2, we preprocessed the instances with the H-removal and single-bond
transformations. The experimental results were performed on a PC with a Pentium4
3.00GHz CPU. Tables 1 and 2 show the experimental results for Problems 1 and 2,
respectively. We observe that the new algorithms run considerably faster than [10].
6. Conclusion
In this paper, we showed two branch-and-bound algorithms for enumerating treelike chemical graphs from given path frequency, which are based on the framework
of [10] and improved their results. In particular, we have proposed two bounding
operations, the detachment-cut and the H-cut. As a future work, we are considering
August 1, 2008
17:26
10
WSPC - Proceedings Trim Size: 9.75in x 6.5in
article
A. Authorx, B. Authory, & C. D. Authorz
Table 1.
Entry
Formula
n1
C03343
37
C16 H22 O4
C07530
43
C17 H28 N2 O
C07178
46
C21 H28 N2 O5
C03690
61
C24 H38 O4
K
1
2
3
4
5
6
7
1
2
3
4
5
6
7
1
2
3
4
5
6
7
1
2
3
4
5
6
7
Experimental Results of Problem 1.
Fujiwara
time
T.O.
3.11
3.25
3.06
3.42
2.33
1.85
T.O.
50.55
16.78
7.14
3.28
3.37
3.88
T.O.
51.72
4.26
0.94
1.02
1.13
1.00
T.O.
T.O.
T.O.
T.O.
T.O.
T.O.
1287.30
et al.’s algorithm [10]
nnt
fs
1,334,417,908 N.F.
830,298
9
614,413
2
428,440
1
391,046
1
210,246
1
146,605
1
1,407,334,896 N.F.
16,339,119
55
3,265,086
1
994,926
1
366,628
1
299,518
1
299,518
1
1,237,087,310 N.F.
15,827,372
16
915,962
2
146,789
1
123,251
1
118,295
1
93,947
1
1,428,804,364 N.F.
499,544,612 N.F.
338,357,072 N.F.
254,834,091 N.F.
198,785,929 N.F.
129,353,817
1
77,002,582
1
Our algorithm (this paper)
time
nnt
fs
158.23
25,149,700 570,773
0.48
46,311
9
0.30
28,106
2
0.27
21,688
1
0.26
18,616
1
0.21
12,129
1
0.19
10,551
1
109.27
7,966,323
73,711
1.40
95,639
55
0.61
35,025
1
0.34
15,734
1
0.18
7,929
1
0.16
6,862
1
0.18
6,862
1
500.78
31,003,703
70,170
3.51
158,597
16
0.32
15,427
2
0.16
6,677
1
0.15
5,485
1
0.16
5,450
1
0.15
5,036
1
T.O. 456,703,633
N.F.
318.68
32,927,230
1,198
188.13
16,574,164
8
44.07
3,469,929
4
36.54
2,385,611
2
16.02
854,956
1
10.27
477,305
1
Note: (1) C03343, C07530, C07178, and C03690 are the entries of 2-Ethylhexyl phthalate, Etidocaine, Trimethobenzamide, and Bis (2-ethylhexyl) phthalate in the KEGG LIGAND database,
respectively; (2) n1 is the number of atoms in an instance preprocessed by replacing each benzene
ring with a new atom with valence 6; (3) K is the level of a given feature vector; (4) “time” is the
CPU time in seconds; (5) “T.O.” means “time over” (the time limit was set to 1800 seconds); (6)
“nnt” is the number of nodes in the family trees that are checked; (7) “fs” is the number of feasible
solutions found in the time limit; and (8) “N.F.” means “not found”.
to enumerate more general graph classes, e.g., outerplanar graphs which are known
to cover most of the chemical graphs. A preliminary work can be found in [19].
We note that the depth label sequences defined in this paper only represent
the graphical structures of compounds in the viewpoint of planarity but may lose
information of stereochemistry, especially for stereoisomers. Thus, designing better
representations is another interesting topic for future research.
Acknowledgments
This work was supported in part by Grant-in-Aid #19200022 from the Ministry of
Education, Culture, Sports, Science and Technology (MEXT) of Japan. We thank
Hiroki Fujiwara and Jiexun Wang for their helpful discussions.
August 1, 2008
17:26
WSPC - Proceedings Trim Size: 9.75in x 6.5in
article
For Genome Informatics Contributors
Table 2.
Entry
Formula
n2
C03343
17
C16 H22 O4
C07530
16
C17 H28 N2 O
C07178
19
C21 H28 N2 O5
C03690
25
C24 H38 O4
C04036
29
C19 H39 O7 P
C03630
33
C21 H39 O7 P
K
1
2
3
4
5
6
7
1
2
3
4
5
6
7
1
2
3
4
5
6
7
1
2
3
4
5
6
7
1
2
3
4
5
6
7
1
2
3
4
5
6
7
11
Experimental Results of Problem 2.
Fujiwara et al.’s algorithm [10]
time
nnt
fs
66.06
28,683,656
570,773
0.03
5,157
9
0.03
4,607
2
0.04
4,086
1
0.04
3,470
1
0.04
2,909
1
0.04
2,675
1
10.26
4,029,246
73,711
0.16
43,513
55
0.09
16,090
1
0.06
8,006
1
0.04
5,624
1
0.04
4,642
1
0.04
4,642
1
222.29
96,006,467
70,170
0.11
21,460
16
0.09
11,950
2
0.03
3,152
1
0.02
2,143
1
0.02
2,088
1
0.02
2,088
1
T.O.
664,265,016 5,305,243
23.36
2,984,162
1,198
15.87
1,464,436
8
7.12
509,870
4
4.97
283,418
2
2.66
132,434
1
2.10
101,097
1
T.O.
734,327,164 2,653,617
T.O.
228,786,134
161
184.54
14,517,014
1
11.86
638,457
1
5.95
225,966
1
4.34
127,250
1
3.38
81,532
1
T.O.
667,687,809
3,959
T.O.
168,054,487
77
T.O.
115,797,466
11
118.48
5,104,899
11
50.63
1,554,928
9
27.83
673,426
7
11.97
244,166
5
Our algorithm (this paper)
time
nnt
fs
13.31
5,865,685
570,773
0.01
3,091
9
0.02
2,780
2
0.02
2,453
1
0.02
2,098
1
0.02
1,739
1
0.02
1,596
1
1.00
424,121
73,711
0.06
14,900
55
0.04
6,385
1
0.02
3,736
1
0.02
2,522
1
0.02
2,245
1
0.02
2,245
1
9.03
3,909,283
70,170
0.02
4,321
16
0.02
2,984
2
0.01
1,062
1
0.01
819
1
0.01
794
1
0.01
794
1
T.O. 708,264,977 60,257,365
8.10
1,113,024
1,198
5.66
570,616
8
2.46
197,027
4
1.90
120,718
2
1.12
60,310
1
0.88
46,319
1
T.O. 759,794,526 11,587,705
1543.37 300,524,875
2,520
45.36
4,745,395
1
3.60
262,162
1
2.27
107,378
1
1.57
60,557
1
1.26
40,493
1
T.O. 639,689,202
96,245
T.O. 239,538,772
1,736
438.19
37,803,253
13
25.65
1,519,286
11
12.24
515,752
9
6.44
225,620
7
3.14
92,431
5
Note: (1) C03343, C07530, C07178, C03690, C04036, and C03630 are the entries of 2-Ethylhexyl phthalate, Etidocaine, Trimethobenzamide, Bis (2-ethylhexyl) phthalate, 1-Palmitoylglycerol 3-phosphate,
and Oleoylglycerone phosphate in the KEGG LIGAND database, respectively; (2) n2 is the number of
vertices preprocessed by replacing benzene rings with new atoms of valence 6 and by the H-removal and
single-bond transformations; (3) K is the level of a given feature vector; (4) “time” is the CPU time in
seconds; (5) “T.O.” means “time over” (the time limit was set to 1800 seconds); (6) “nnt” is the number
of nodes in the family trees that are checked; (7) “fs” is the number of feasible solutions found within
the time limit.
August 1, 2008
12
17:26
WSPC - Proceedings Trim Size: 9.75in x 6.5in
article
A. Authorx, B. Authory, & C. D. Authorz
References
[1] Akutsu, T., Fukagawa, D., Inferring A Graph from Path Frequency, LNCS, 3537,
371–382, 2005.
[2] Akutsu, T., Fukagawa, D., Inferring a Chemical Structure from a Feature Vector
Based on Frequency of Labeled Pathsand Small Fragments, Series on Advances in
Bioinformatics and Computational Biology, Proc. 5th Asia-Pacific Bioinformatics
Conf. Sankoff, D., Wang, L., Chin, F., Eds.; Imperial College Press, 165–174, 2007.
[3] Aringhieri, R., Hansen, P., Malucelli, F., Chemical Trees Enumeration Algorithms,
4OR, 1, 67–83, 2003.
[4] Bakır, G. H., Zien, A., Tsuda, K., Learning to Find Graph Pre-Images, LNCS, 3175,
253–261, 2004.
[5] Buchanan, B. G., Feigenbaum, E. A., DENDRAL and Meta-DENDRAL - Their Applications Dimension, Artif. Intell., 1, 5–24, 1978.
[6] Cayley, A., On the Analytic Forms Called Trees, with Applications to the Theory of
Chemical Combinations, Reports British Assoc. Adv. Sci., 45, 257-305, 1875.
[7] Deshpande, M., Kuramochi, M., Wale, N., Karypis, G., Frequent Substructure-Based
Approaches for Classifying Chemical Compounds, IEEE Transactions on Knowledge
and Data Engineering, 17, 1036–1050, 2005.
[8] Faulon, J. L., Churchwell, C. J., Visco, Jr., D.P., The Signature Molecular Descriptor.
2. Enumerating Molecules from Their Extended Valence Sequences, J. Chem. Inf.
Comp. Sci., 43, 721–734, 2003.
[9] Fink, T., Reymond, J. L., Virtual Exploration of the Chemical Universe up to 11
Atoms of C, N, O, F: Assembly of 26.4 Million Structures (110.9 Million Stereoisomers) and Analysis for New Ring Systems, Stereochemistry, Physicochemical Properties, Compound Classes, and Drug Discovery, J. Chem. Inf. Comp. Sci., 47, 342–353,
2007.
[10] Fujiwara, H., Wang, J., Zhao, L., Nagamochi, H., Akutsu, T., Enumerating Tree-like
Chemical Graphs with Given Path Frequency, J. Chem. Inf. Model., 2008 (to appear).
[11] Funatsu, K., Sasaki, S., Recent Advances in the Automated Structure Elucidation
System, CHEMICS. Utilization of Two-Dimensional NMR Spectral Information and
Development of Peripheral Functions for Examination of Candidates, J. Chem. Inf.
Comp. Sci., 36, 190–204, 1996.
[12] Hall, L. H., Dailey, E. S., Design of Molecules from Quantitative Structure-Activity
Relationship Models. 3. Role of Higher Order Path Counts: Path 3, J. Chem. Inf.
Comp. Sci., 33, 598–603, 1993.
[13] Kashima, H., Tsuda, K., Inokuchi, A., Marginalized Kernels between Labeled Graphs,
Proc. 20th International Conference on Machine Learning, Fawcett, T., Mishra, N.
Eds., The AAAI Press, Menlo Park, California, 321–328, 2003.
[14] Mahé, P., Ueda N., Akutsu, T., Perret, J. L., Vert, J. P., Graph Kernels for Molecular
Structure-Activity Relationship Analysis with Support Vector Machines, J. Chem.
Inf. Model., 45, 939–951, 2005.
[15] Mauser, H., Stahl, M., Chemical Fragment Spaces for De Novo Design, J. Chem. Inf.
Comp. Sci., 47, 318–324, 2007.
[16] Nagamochi, H., A Detachment Algorithm for Inferring A Graph from Path Frequency,
LNCS, 4112, 274–283, 2006.
[17] Nakano, S., Uno, T., Efficient Generation of Rooted Trees, Technical Report, NII2003-005E, ISSN:1346-5597; National Inst. of Informatics: Tokyo, Japan, July 3, 2003.
[18] Nakano, S., Uno, T., Generating Colored Trees, LNCS, 3787, 249–260, 2005.
[19] Wang, J., Zhao, L., Nagamochi, H., Akutsu, T., An Efficient Algorithm for Generating
Colored Outerplanar Graphs, LNCS, 4484, 573–583, 2007.