Classification of small molecules by two-and three

BIOINFORMATICS
ORIGINAL PAPER
Vol. 23 no. 16 2007, pages 2038–2045
doi:10.1093/bioinformatics/btm298
Structural bioinformatics
Classification of small molecules by two- and three-dimensional
decomposition kernels
Alessio Ceroni, Fabrizio Costa* and Paolo Frasconi
Machine Learning and Neural Networks Group, Dipartimento di Sistemi e Informatica,
Universitá degli Studi di Firenze, Italy
Received on February 19, 2007; revised on May 14, 2007; accepted on May 28, 2007
Advance Access publication June 5, 2007
Associate Editor: Anna Tramontano
ABSTRACT
Motivation: Several kernel-based methods have been recently
introduced for the classification of small molecules. Most available
kernels on molecules are based on 2D representations obtained
from chemical structures, but far less work has focused so far on the
definition of effective kernels that can also exploit 3D information.
Results: We introduce new ideas for building kernels on small
molecules that can effectively use and combine 2D and 3D
information. We tested these kernels in conjunction with support
vector machines for binary classification on the 60 NCI cancer
screening datasets as well as on the NCI HIV data set. Our results
show that 3D information leveraged by these kernels can consistently improve prediction accuracy in all datasets.
Availability: An implementation of the small molecule classifier is
available from http://www.dsi.unifi.it/neural/src/3DDK
Contact: [email protected]
1
INTRODUCTION
Prediction of the biological activity of chemical compounds is
one of the major goals of chemoinformatics, mainly prompted
by the need to improve current drug development methods
(van de Waterbeemd and Gifford, 2003) and to gain deeper
insights in toxicology (Helma, 2005; Helma and kramer, 2003)
Recent development in combinatorial technologies and ultrahigh-throughput screening have paved the way towards largescale exploration of potential drugs. In this context, prediction
algorithms are needed for reducing attrition in drug development and to select subsets of interesting compounds from
the large set of molecules that can be potentially synthesized
(van de Waterbeemd and Gifford, 2003). Machine-learning
algorithms for prediction are justified by the reasonable
assumption that compounds having similar geometrical
and physicochemical characteristics should also exhibit similar
effects on the same biological system. A major challenge is to
define an effective quantitative measure of similarity. In this
article, we extend recent research in the area of kernel machines
for defining similarity between small molecules (Horváth et al.,
2004; Kashima et al., 2003; Mahe et al., 2005; Menchetti et al.,
*To whom correspondence should be addressed.
2038
2005; Swamidass et al., 2005) and propose new methods that
combine 2D and 3D information.
Starting from the seminal work of Hansch et al. (1962) on
quantitative structure activity relationship (QSAR), several
researchers have investigated supervised learning techniques
for predicting the biological activity of small molecules. All
these techniques must effectively solve two separate
subproblems: one related to representation, and another
focused on statistical learning issues. The former essentially
amounts to devising a proper set of descriptors that effectively
capture the representative characteristics of the molecules
(ideally, molecules exhibiting similar behavior should
be represented by similar descriptors). The latter amounts to
finding a smooth and stable function that maps descriptors
either into a real number expressing the activity of the molecule
(regression) or into a boolean decision such as active/inactive
(classification).
In classical approaches, after descriptor analysis has been
performed on the chemical structure, a molecule is always
represented by a vector of real numbers and the similarity
between two molecules is implicitly or explicitly computed from
their descriptor vectors. The simplicity of vector-based
representations (as opposite to relational representations) has
made it possible to consider several alternative algorithms for
solving the supervised learning problem including linear
regression, ridge regression (Hawkins et al., 2001), artificial
neural networks (Devillers, 1996; Jalali-Heravi and Parastar,
2000) and support vector machines (SVMs) (Burbidge et al.,
2001). Although there are some crucial statistical aspects that
may affect the quality of prediction [most importantly
the problem of identifying outlier (Lipnick, 1991) and problems
of feature selection and dimensionality reduction (L’Heureux
et al., 2004)] most of the research has focused on descriptors.
Descriptors have been traditionally obtained from both the
chemical and the 3D structure of the compounds using steric
and hydrophobic properties (Hansch et al., 1995), topological
properties (Devillers and Balaban, 1999), connectivity indices
(Kier and Hall, 1986) or quantum-chemical effects (Karelson
et al., 1996). Alternatively, 3D-QSAR (Kubinyi et al., 2002)
methods have been proposed to generate the descriptors
exclusively from the 3D structure of the molecule. Among the
most preeminent 3D-QSAR methods figure CoMFA (comparative molecular field analysis, Cramer et al., 1988)
ß The Author 2007. Published by Oxford University Press. All rights reserved. For Permissions, please email: [email protected]
Classification of small Molecules
and CoMSIA (comparative molecular similarity indices
analysis, Klebe et al., 1994). In both approaches, the molecular
structures are first aligned and then placed into a 3D grid to
calculate their interaction energy with an atomic probe.
The molecular fields obtained in this way are then correlated
with the biological activity by partial least squares analysis.
However, these grid-based approaches are strongly influenced
by the alignment of compounds structure, that is the critical
step of these methods. Pastor et al. (2000) proposed an
alternative approach based on the extraction of pairwise
pharmacophore distances that does not require alignment.
Other (and somewhat simpler) approaches for incorporating
geometrical information have been proposed by Deshpande
et al. (2003) and Swamidass et al. (2005). These are also limited
to pairwise atom distances, that are less effective than
interaction energies.
Alternatives to descriptor-vector approaches aim to preserve
as much information as possible about the molecular structure
by recognizing the relational nature of the data. In particular,
molecules can be represented as labeled graphs whose vertices
and edges contain attributes describing physico-chemical
properties of atoms and bonds, respectively. Relational
learning algorithms have been mainly developed within the
inductive logic programming (ILP) framework (King et al.,
1996). ILP has also the advantage that domain expertize can be
encoded in the form of a logic program and that the result of
the learning algorithm is human readable in the form of firstorder predicates. The complexity of ILP algorithms, however,
may not scale well with the training set size (Giordana
and Saitta, 2000), making it difficult to apply them to large
scale problems. For this reason, a number of researchers have
explored the much simpler graph-based representation of
the compound chemical structure and transformed the problem
of finding chemical substructures to that of finding subgraphs
in this graph-based representation (Berthold and Borgelt, 2002;
Deshpande et al., 2003). An even simpler representation
was proposed by Kramer et al. (2001): in this scheme each
chemical compound is represented as a SMILES string
(Weininger, 1988, 1989), a 1D representation of chemical
structures encoding atomic content and connectivity. A single
compound can be represented by different SMILES strings,
unless a canonical ordering is defined. This representation
simplifies the problem of discovering frequently occurring
substructures at the cost of losing some information on
the represented molecules.
Different statistical learning algorithms for structured data
have been successfully applied to QSAR problems, including
recursive neural networks (Micheli et al., 2001) and kernel
machines (Horváth et al., 2004; Kashima et al., 2003; Mahe
et al., 2005; Menchetti et al., 2005; Swamidass et al., 2005).
Micheli et al. (2001) first transformed chemical structures into
ordered trees (this is possible for the particular class of
compounds, benzodiazepines, investigated in that study)
and then used a standard RNN for learning the structure/
activity relation. On the other hand, kernel methods offer
higher flexibility and do not require the input structure to be
ordered. Horváth et al. (2004) have proposed a decomposition
kernel that compare the sets of tree and cyclic patterns present
in the input molecules. This kernel has been applied to the
HIV dataset and will be used as a comparison in this work
(Section 3.2). The kernel proposed by Kashima et al. (2003) is
based on counting the number of common label paths
originated by random walks on the graph representation of
the molecules. Mahe et al. (2005) improved this kernel
and obtained successful results on two mutagenesis datasets.
Swamidass et al. (2005) tested several new kernels based on 1D,
2D and 3D representations of the input molecules on the NCI
Cancer dataset (Section 3.1). Finally, the 2D kernel introduced
by Menchetti et al. (2005) (and also exploited in this article in
conjunction with a novel 3D kernel) is based in the intuition
that large subgraphs can be matched in a soft way by flattening
them into histograms (see Section 2.3 for more details).
Three-dimensional information can be obtained by running
rule-based prediction programs such as CORINA (Gasteiger
et al., 1990; Sadowski et al., 2003) on the chemical
compound. Atomic coordinates predicted in this way may be
affected by errors but nonetheless they may help to derive a
better measure of similarity between molecules. However, as
briefly summarized above, research about kernels on molecular
structures has mainly focused on 2D structures (i.e. sparse
graphs) and only few methods have considered the possibility
of incorporating features extracted from predicted atom coordinates into a SVM classifier. Deshpande et al. (2003) have
defined an extension of their frequent subgraph mining
algorithm that also checks the average distance between
all atom pairs in a (sub)molecule as a condition for detecting
subgraphs.Swamidass et al. (2005) have proposed a kernel that
takes 3D coordinates into account by computing the inner
product between histograms of pairwise interatomic distances.
Deshpande et al. (2003) report that 3D information in
conjunction with their method helps improving accuracy on
the NCI HIV dataset, while Swamidass et al. (2005) found that
using 3D information only worsen prediction accuracy on
the NCI cancer dataset.
By computing distance averages or histograms, previous
methods only take into account proximity relationships
between pairs of atoms but not the 3D similarity between
larger sets of atoms (as could be inferred by preserving the fully
connected distance graph). The novel 3D kernel presented in
this article is based on the comparison between entire 3D
shapes and can in principle preserve more information. We
show that a combination of WDK and the new 3D decomposition kernel (3DDK) allows us to reach state-of-the-art
performance.
2
2.1
METHODS
Background on kernel methods for structured data
Our approach to the classification of small molecules follows
the classical supervised learning framework of SVM (Schölkopf
and Smola, 2002). In this class of algorithms, a classification function
f(x) is obtained from data as
fðxÞ ¼
m
X
i yi Kðxi ; xÞ:
ð1Þ
i¼1
In the above formula, (xi,yi) is a generic training example, consisting
of an input object xi and its associated target yi 2 f1,1} representing a
binary class. The coefficients i are determined by the learning
2039
A.Ceroni et al.
procedure and i 6¼ 0 if xi is a support vector. Finally, K is a kernel
function that measures the similarity between two examples. Positive
semi-definite kernels are preferred since they yield a convex
optimization problem for finding the coefficients i.
Kernels on structured data (such as molecular graphs) are typically
special cases of decomposition or convolution kernels Haussler (1999),
a vast class of functions that relies on two main concepts: (1) decomposition in parts of the structured objects and (2) composition of
kernels between these parts. Let be a set of structured objects (such
as labeled graphs) and for x 2 , suppose (x1,. . .,xD) is a tuple of parts
of x, with xd 2 i for each part type d ¼ 1,. . .,D, which can be formally
expressed by a decomposition relation R(x1,. . .,xD,x). Assuming that
for each part type a Mercer kernel d : d d } R is available,
the decomposition kernel between two structured objects x and x0
is defined as
Kðx; x0 Þ ¼
X
D
Y
ðx1 ;...;xD Þ2R1 ðxÞ
ðx0 ;...;x0 Þ2R1 ðx0 Þ
D
1
d¼1
d ðxd ; x0d Þ
ð2Þ
where R-1(x) ¼ {(x1,. . .,xD) : R(x1,. . .,xD, x)} denote the set of
all possible decompositions of x.
In practice, popular decomposition kernels are very often ‘allsubstructures kernels’ that rely on a single part type (i.e. D ¼ 1)
and are based on the exact match kernel 1 ¼ where (x, x0 ) ¼ 1 if
x x0 and 0 otherwise. Common examples of ‘all-substructures kernels’
are the spectrum kernel for strings (Leslie et al., 2002), the kernel
defined by Collins and Duffy (2002) for trees and the cyclic pattern
kernel for molecules (Horváth et al., 2004). Although some of
the above kernels can be computed efficiently, matching arbitrary
substructures may be intractable because it involves subgraph
isomorphism (Gärtner, 2003).
All-substructure kernels may induce sparseness in the feature space
induced by the kernel if the parts of x that are informative
for prediction are only a small subset of R1(x). In this case,
the similarity between x and x0 can be typically much smaller than
the similarity between x and itself, leading to a problem of diagonal
dominance (Schölkopf et al., 2002) and a consequent form of rote
learning in which generalization to new cases is unsatisfactory.
In practice, in order to obtain good performance in a particular
prediction task, it is very important to carefully tune the decomposition
relation R and the kernel between parts. The 2D and 3D kernels
described in the following aim to reduce sparseness by limiting
the number of parts and by introducing a soft notion of similarity
for matching them.
2.2
Representation of small molecules
A molecule is represented as an attributed undirected graph x¼(V,E).
Each vertex v2 V represents an atom and is labeled by three kinds
of attributes: t[v] is the atom type (defined as in the Tripos Sybyl
MOL2 format1), c[v] is the partial charge, discretized into negative
if c[v]50.05, uncharged if 0.05 c[v]0.05 and positive if c[v]40.05
and ~½v 2 R3 are the 3D atom coordinates. Each edge (u,v)2E
represents a chemical bond and is characterized by a type b[u,v]
that can be one of single, double, triple, amide or aromatic.
2.3
Weighted decomposition kernels for 2D chemical
structures
Menchetti et al. (2005) extended the definition of all-substructures
kernels by weighting the match between identical parts (called here
selectors) according to contextual information. The resulting kernel
1
Available from www.tripos.com/data/support/mol2.pdf
2040
Fig. 1. Comparing substructures in a weighed decomposition kernel.
can be applied to general data types but in this article we are interested
in the case of labeled graphs. A weighted decomposition kernel (WDK)
is characterized by a decomposition R(s,z,x) where s is a subgraph of
x called the selector and z is a subgraph of x called the context
of occurrence of s in x (generally a subgraph containing s). This setting
results in the following general form of the kernel:
X
K2D ðx; x0 Þ ¼
ðs; s0 Þðz; z0 Þ
ð3Þ
ðs;zÞ2R1 ðxÞ
ðs0 ;z0 Þ2R1 ðx0 Þ
where, is the exact matching kernel applied to selectors and is
a kernel on contexts. The basic idea behind WDK is that each
substructure in which a graph is decomposed is enriched with
its graphical context. Figure 1 provides an example on molecule-like
graphs where selectors are single nodes and relative contexts are vertices
and edges that are adjacent to the selector.
In this article, selectors are always single atoms and the match
(s,s0 ) is defined by the coincidence between the type of s0 and s0 .
The context kernel is based on soft match between substructures, defined by the distributions of label contents after discarding
topology.
Specifically, let L denote the total number of attributes labeling
vertices and edges and for ‘¼1,. . .,L let p‘(j) be the observed frequency
of value j for the ‘-th attribute in a substructure z. Then we compare
substructures by means of a histogram intersection kernel (Odone et al.,
2005), defined as follows:
‘ ðz; z0 Þ ¼
m‘
X
minfp‘ ðjÞ; p0‘ ðjÞg
ð4Þ
‘ ðz; z0 Þ:
ð5Þ
j¼1
ðz; z0 Þ ¼
L
X
‘¼1
where m‘ is the number of possible values of attribute ‘.
In the following we shall use L ¼ 3: atom type, atom charge and
bond type, while atom coordinates are discarded for computing
the WDK.
Contexts are formed as follows. Given a vertex v 2 V and an integer r
0, let x(v, r) denote the substructure of x obtained by retaining
all the vertices that are reachable from v by a path of length at most r,
and all the edges that touch at least one of these vertices.
The decomposition relation Rr, dependent on the context radius r, is
then defined as Rr ¼ {(s, z, x) : x 2 , s¼{v}, z¼x(v, r), v 2 V}, where s is
the selector and z is the context for vertex v.
In Menchetti et al. (2005), it was proposed a further extension
to Equation (3) where each selector is associated not only
with its context but also with the complement c of the context taken
in respect to the whole graph, obtaining a decomposition relation Rr ¼
{(s, z, c, x) : x 2 , s ¼ {v}, z ¼ x(v, r), c¼x \ z, v 2 V}. The same kernel is employed on both the context and the complement parts, obtaining
Classification of small Molecules
P
Kðx; x0 Þ ¼ ðs; s0 Þððz; z0 Þ þ ðc; c0 ÞÞ, where the sum extends over
(s,z,c) 2 R1(x) and the corresponding parts for x0 . The inclusion of
the complement is a mean to add a similarity factor that takes
into consideration the whole graph/molecule, i.e. two selectors
are considered similar if they are embedded in molecules with
similar attribute distribution (in our case element types and partial
charge).
2.4
Three-dimensional decomposition kernels
A 3D molecular structure is interpreted here as a special kind of
relational data object where atoms are related by chemical bonds
but also by their spatial distances. According to this view, a novel
decomposition kernel for 3D structures is defined in this section.
The molecule is first decomposed into a set of overlapping 3D
substructures of varied geometry, called shapes. The kernel between
two molecules is then computed by summing kernels between all pairs
of shapes.
Given a molecule x¼(V,E), a shape of order n is a set of n distinct
vertices ¼{u1,u2,. . .,ui}, ui 2 V, for i¼1,. . .n. For example, shapes
of order two can be interpreted as segments and shapes of order
three as triangles.2 We first introduce a kernel between shapes
as follows. Given a shape of order n and two vertices u,v2,
let e¼(t[u],t[v],b[u,v]) denote a labeled edge of the shape, formed by
considering the two atom types t[u] and t[v] and the bond type b[u,v]
(conventionally, b[u,v]¼0 if there is no chemical bond between atoms
u and v). Since atom and bond types both take values on a finite
alphabet, labeled edges can be sorted lexicographically.3 Then, let
he1,. . .,en(n1)/2i denote the lexicographically ordered sequence of
all labeled edges in . For example, the shape (C1,C2,C3,O1) for
the molecule NSC_1027 (see Fig. 2) yields the sequence (C.2,C.3,1)
(C.2,C.3,1) (C.2,O.2,2) (C.3,C.3,0) (C.3,O.2,0) (C.3,O.2,0).
The kernel between two shapes and 0 of equal order n
has complexity O(n2) and is defined as:
kshapes ð; 0 Þ ¼
nðn1Þ=2
Y
0 2
ðei ; e0i Þeðdi di Þ
ð6Þ
i¼1
where is a kernel hyperparameter and di ¼ k~½ui ~½vi k is the length
of edge ei, i.e. the Euclidean distance between atoms ui and vi.
The kernel between two shapes of different order is null.
We can now move on to the definition of the 3D kernel between
two molecules. One simplistic approach consists of decomposing
a molecules into a bag of all possible shapes and then
comparing the two bags of shapes using the above defined kernel.
However, this
would quickly result in a combinatorial explosion as
there are mn shapes or order n in a molecule having m atoms. Even for
moderate values of n (e.g. n¼3 to consider 3D triangles), this
approach has two important drawbacks: the kernel would be
computationally costly and the feature space too large, leading to the
sparseness problems mentioned in Section 2.1. To overcome
these drawbacks, we can restrict the selection of vertices used to
make shapes using the notion of topological distance induced by the
chemical graph.
Let x¼(V,E). Given a vertex v 2 V and an integer r, a 2D-supported
shape anchored in v is a set of vertices ¼{v,w} [ adj[w] such that
w 2 x(v,r) and adj[w] is the adjacency list of w. Informally this
corresponds to select just the adjacent list of vertices that are within
2
Shapes could also be interpreted as hyperedges of a hypergraph having
vertex set V.
3
Lexicographical order between two n-tuples t and s can be defined as
follows: t5s if and only if there exists a position i such that the first i1
items of t and s are all equal but the i-th item of t is less than the i-th
item of s.
Fig. 2. Illustration of the definition of 2D-supported shapes. The three
2D-supported shapes of radius 1, anchored to atom C3 in the molecule
NSC_1027 are (Br3,C3), (Br2,C3) and (C1,C2,C3,O1). Atom identifiers
and types (in parentheses) are formatted according to the Tripos Sybyl
MOL2 conventions.
distance r from x. The size of the set of shapes anchored in v is
O(|x(v,r)|). Note that the order of each shape is variable and depends on
the topology of the graph (Fig. 2 shows an example). The shape set of
radius r of x, denoted S r ðxÞ, is defined as the set of all 2D-supported
shapes anchored in some vertex v of x. The 3DDK between two
molecules x and x0 is then defined as
X X
kshapes ð; 0 Þ
ð7Þ
K3D ðx; x0 Þ ¼
2S r ðxÞ 0 2S r ðx0 Þ
It is immediate to recognize that shapesets are a decomposition of
a molecule and, in turn, lexicographically ordered edges are a
decomposition of a shape. Therefore, the positive definiteness of K3D
follows as a corollary of results proved in Haussler (1999) for
the general class of decomposition kernels.
The 3DDK thus defined is able to compare molecule structures by
using both chemical (atoms and bonds types) and geometrical (atoms
distances) information. 3DDK performs a decomposition of a molecule
structure into shapes defined by variable sized groups of atoms with
their relative distances. The shapes are invariant of the coordinate
system (no alignment is needed) and represent n-wise relations between
groups of n ( 2) atoms. The conformations of two shapes with
identical composition (atoms and bonds types) are compared by
looking at the variations between atoms distances. The similarity
between two shapes is thus a smooth function of the similarity between
their dimensions.
3
3.1
EXPERIMENTS
NCI cancer dataset
The cancer dataset has been made publicly available by
the National Cancer Institute and provides screening results
for the ability of more than 70 000 compounds to suppress
or inhibit the growth of a panel of 60 human tumor cell
lines. The dataset corresponding to the parameter GI50,
the concentration that causes 50% growth inhibition, is used
here. A version of this dataset suitable for machine learning
purposes has been used in Swamidass et al. (2005), and it is
2041
A.Ceroni et al.
available for download from the ChemDB project (Chen et al.,
2005).4 For each cell line, approximately 3500 compounds are
provided together with information on their cancer-inhibiting
action. A two-classes classification problem is thus defined,
with roughly 50% of examples in each class.
For all the compounds, 3D coordinates were taken from
the ChemDB [generated by running CORINA (Gasteiger et al.,
1990; Sadowski et al., 2003) on the 2D structure].
The WDK and 3DDK used in this experiment had both the
radius r = 3 and no graph complement was used for the
WDK. The parameter in Equation. (6) was set to 2.5.
A combination of the two kernels was also tested
(WDKþ3DDK). Each kernel was then composed with a
polynomial kernel with degree 2:
Kðx; x0 Þ ¼ ð1 þ ðx; x0 ÞÞ2
ð8Þ
where is either K2D or K3D or (x,x0 ) ¼ K2D(x,x0 ) þ K3D(x,x0 ).
The choice of the polynomial degree and the radius for both
methods was optimized with a k-fold procedure on a smaller
subset and then kept these values fixed in the remaining
experiments.
We consider three performance measures: binary classification accuracy, area under the ROC (AUC) and area under
the precision recall curve. These measures were estimated by a
10-folds cross-validation. Model selection for choosing the C
value was performed by holding out a development set from
each training set of the 10-fold cross-validation.
The results of the experiments on the 60 cell lines are
reported in Figure 3. The performances of WDK and 3DDK
are compared to the results obtained by the top performing
kernel (MinMax) proposed by Swamidass et al.
(2005). The values of accuracy of the 3D histogram kernel
from the same work are also reported, in order to compare
3DDK with the only other method that make use of atom
coordinates on this problem.
The results of the experiments demonstrate the superiority
of the kernels proposed in this work on this dataset. Comparing
the accuracy values, the 3DDK performs better than MinMax
on 50 over 60 cell lines, the WDK on 59 over 60 cell lines
and the two combined kernels perform always better than
MinMax. Almost the same conclusions can be drawn by
comparing the AUC values: the 3DDK performs better than
MinMax on 47 over 60 cell lines the 3DDK on 53 over 60 cell
lines and their combination always perform better than
MinMax. Moreover, the 3DDK outperforms the 3D
Histogram kernel on all the cell lines, thus showing
the importance of using spatial similarity between sets of
several atoms when comparing 3D structures. The area under
the precision/recall curve clearly shows the advantage of
the combination of the two methods over each single one.
We extracted also ROC50 values (results not reported) which
confirm the aforementioned conclusions. Finally, we tested
whether the performance of any single classifier is equivalent
to those of the classifiers combination. By running
the Wilcoxon matched-pairs signed-ranks test on the accuracy,
4
ftp://cdb.ics.uci.edu/pub/LEARNING/data/nci/gi50/
2042
AUC and precision/recall results we can reject the null
hypothesis at 99% confidence level.
3.2
NCI
HIV dataset
The HIV dataset contains 42 687 compounds evaluated
for evidence of anti HIV activity from the DTP AIDS antiviral screen program of the National Cancer Institute.5 The
compounds are divided in three classes: 422 compounds are
confirmed active (CA), 1081 are moderately active (CM) and
41 184 are confirmed inactive (CI). A compound is considered
inactive if a test showed 5 50% protection of human CEM
cells. All other compounds were re-tested. Compounds showing
550% protection (in the second test) are also classified inactive.
The other compounds are classified active, if they provided 50%
protection in both tests, and moderately active, otherwise. For
all the compounds 3D coordinates have been generated by
running CORINA (Gasteiger et al., 1990; Sadowski et al., 2003)
on the 2D structure.
Following the customary experimental setup, three classification problems are formulated on this dataset: in the first
problem (CA verses CM), positive examples are confirmed
active compounds, while moderately active compounds forms
the negative class; for the second problem (CAþCM verses CI),
the positive class is formed by the combination of moderately
active and confirmed active compounds and in the last problem
(CA verses CI) the positive examples are confirmed active
compounds and the negative class is formed by confirmed
inactive compounds.
For the WDK, we used graph complement and context
radius r = 4, as in Menchetti et al. (2005). For the 3DDK, we
set the radius to r = 3. The parameter in Equation (6) was set
to 2.5 (parameters chosen as in Section 3.1). A combination of
the two kernels was also tested (WDKþ3DDK). Each of these
three kernels was then composed with a Gaussian kernel with
¼ 1,6 yielding
1
0
0
0
Kðx; x0 Þ ¼ e2ððx;xÞþðx ;x Þ2ðx;x Þ
0
ð9Þ
0
where is either K2D or K3D or (x,x ) ¼ K2D(x,x ) þ K3D(x,x0 ).
AUC performance was estimated by a 5-folds cross-validation
in which the original class distribution was preserved in each
fold (we do not report binary classification accuracy as the ratio
of positive to negative examples is very small). Model selection
for choosing the C value was performed by holding out a
development set from each training set of the 5-fold
cross-validation.
The results of the experiments are reported in Table 1.
The performances of WDK and 3DDK are compared to the
best results obtained with the frequent subgraphs method
proposed by Deshpande et al. (2003) and the cyclic pattern
kernel described in Horváth et al. (2004) on the same dataset.
The combination of 3DDK WDK obtained the highest
5
The dataset is available at http://dtp.nci.nih.gov/docs/aids/
aids_data.html
6
We observed in our previous experiments (Menchetti et al., 2005) that
the choice of optimizing C keeping ¼ 1 yields models with comparable
performance to those obtained allowing the joined optimization of both
C and . We cannot exclude that some improvements could be gained
by jointly fine-tuning these hyper-parameters.
77.0%
74.8%
72.6%
70.4%
MinMax
3DK
3DHist
3DDK
WDK
Classification of small Molecules
3DK+WDK
WDK+3DDK
OVCAR_5
SK_MEL_28
HOP_92
A498
IGROV1
SK_OV_3
UACC_257
SNB_75
EKVX
HCC_2998
NCI_H226
TK_10
UO_31
MALME_3M
DU_145
OVCAR_3
OVCAR_4
SN12C
SK_MEL_2
SF_539
UACC_62
NCI_ADR_RES
HCT_15
HOP_62
KM12
MDA_MB_435
SNB_19
NCI_H322M
HS_578T
MDA_MB_231_ATCC
SF_295
CAKI_1
M14
RXF_393
BT_549
RPMI_8226
CCRF_CEM
HL_60_TB
ACHN
T_47D
COLO_205
HT29
OVCAR_8
SW_620
A549_ATCC
NCI_H23
NCI_H460
PC_3
K_562
MCF7
786_0
MOLT_4
LOX_IMVI
U251
SF_268
SK_MEL_5
MDA_N
NCI_H522
SR
HCT_116
68.2%
66.0%
0.83
0.81
0.79
WDK
SNB_75
HL_60_TB
HCC_2998
HOP_92
SF_539
IGROV1
SK_OV_3
OVCAR_5
EKVX
UACC_257
SK_MEL_28
A498
UO_31
NCI_H226
HCT_15
HOP_62
MALME_3M
SK_MEL_2
CCRF_CEM
RXF_393
TK_10
OVCAR_4
RPMI_8226
MDA_MB_231_ATCC
KM12
SN12C
OVCAR_3
K_562
NCI_H322M
MOLT_4
SF_295
HS_578T
UACC_62
NCI_H460
NCI_ADR_RES
SNB_19
NCI_H23
DU_145
T_47D
BT_549
CAKI_1
OVCAR_8
SR
M14
MDA_MB_435
A549_ATCC
SK_MEL_5
SW_620
COLO_205
SF_268
NCI_H522
786_0
MCF7
HT29
U251
PC_3
MDA_N
ACHN
LOX_IMVI
HCT_116
0.77
0.75
0.57%
0.55%
0.53%
0.51%
0.49%
Fig. 3. Results on all the 60 cell lines of the NCI Cancer screening dataset. The 3DDK, the WDK and their sum are compared to the MinMax and
3D-Hist kernels. Top: prediction accuracy. Center: ROC AUC values. Bottom: area of the precision/recall curve values.
2043
SNB_75
HCC_2998
HL_60_TB
OVCAR_5
MDA_MB_231_ATCC
IGROV1
HOP_92
A498
HOP_62
SF_539
SK_OV_3
UO_31
UACC_257
NCI_H226
HCT_15
EKVX
TK_10
BT_549
KM12
T_47D
SK_MEL_28
MALME_3M
OVCAR_4
NCI_H322M
RXF_393
SK_MEL_2
OVCAR_3
K_562
PC_3
NCI_ADR_RES
NCI_H460
DU_145
HS_578T
MOLT_4
SNB_19
RPMI_8226
SF_295
CCRF_CEM
UACC_62
MDA_MB_435
SR
SN12C
MCF7
CAKI_1
NCI_H23
SK_MEL_5
A549_ATCC
OVCAR_8
M14
U251
MDA_N
COLO_205
NCI_H522
786_0
ACHN
SF_268
SW_620
HT29
LOX_IMVI
HCT_116
A.Ceroni et al.
Table 1. Results of the experiments on the NCI Anti-HIV screening dataset
Method
CA versus CM
CAþCM versus CI
CA versus CI
FSG
FSGþ3D
CPK
WDK
3DDK
(WDKþ3DDK)
0.786
0.811
0.840
0.854
0.853
0.861
0.786
0.819
0.837
0.841
0.844
0.848
0.914
0.940
0.947
0.945
0.951
0.951
0.010
0.019
0.040
0.028
0.012
0.006
0.007
0.009
0.008
0.009
0.006
0.007
The 3DDK and WDK are compared to the frequent subgraphs approach and to the cyclic pattern kernel. The table reports the value of AUC for the various methods.
performance measures although in this case we the improvement is not statistically significant.
Table 2. Results of the experiments on CA versus CM HIV NCI
dataset with the 3DDK using only the geometrical information
3.3
Shape order n
Experiments with geometrical structure only
We run a different set of experiments on CA versus CM HIV
NCI dataset to demonstrate that the chemical structure of a
compound is not necessary (albeit useful for efficiency issues)
for correct predictions. In these experiments, the decomposition
strategy collects all spatially related atoms: for each atom ai all
the atoms aij that are distant less than d Angstroms apart from
ai are collected; from these atoms all the combinations of m 1
atoms are generated; a shape is then formed using ai and these
m 1 atoms. In this case, the shape contains no information
about bond types. Note that in this case the shape order is fixed
a priori and that we do not use the 2D support graph to select a
subset of possible edges (i.e. we take into consideration a fully
connected version of the subgraph induced by spheres of radius
d centered in each atom).
Results, reported in Table 2, show how kernels based on
shapes of variable order formed by selecting atoms at distance 3
on the chemical graph or kernels based on shapes of order three
at distance of 4 angstroms achieves comparable results. Note,
however, that not using the chemical information implies a
much higher computational cost as the number of atoms
selected with the physical distance criterion increases much
more rapidly than when using the distance based on
the chemical topology.
4
CONCLUSIONS
We have introduced a novel decomposition kernel for 3D
objects. Combined usage of 2D and 3D kernels yields accuracy
improvements in most of the 60 cell lines of the NCI cancer
screening dataset, showing that predictive information
not available in the atom-bond representation is indeed
contained in the calculated atom coordinates. The combination
of WDK and 3DDK also achieved state-of-the-art results in
the HIV screening dataset.
Other researchers have found it difficult to improve
prediction accuracy by exploiting 3D information, conjecturing
that the noise level in the CORINA predicted coordinates may
play a crucial role (Swamidass et al., 2005). While this noise is
likely to have a significant effect, the definition of a kernel
function based on 3D coordinates can indeed capture some
additional useful information. We believe that two related
reasons can account for the improvement that we have
2044
2
3
4
Distance d
4
5
6
0.820
0.845
0.845
0.831
0.843
0.831
0.841
observed in the 3DDK. First, when using simple distance
histograms, information about the local arrangement of atoms
is buried in the global counts of distances for the whole
molecule. In this way, specific local 3D patterns (that may
be responsible for the molecular activity) are not identified. The
3DDK, on the other hand, introduces a more selective notion
of similarity, because several atoms must simultaneously be
arranged in a similar way in the two molecules being compared,
in order to give a contribution to their similarity. Second, in
the 3DDK we keep distance comparisons local within smaller
subgraphs. In this way, we obtain a similarity measure that
allows the rearrangement of entire portions of the graph
without incurring in too severe a penalty. To show this, in a
preliminary experiment, we have used a subgraph version of
the distance histogram (i.e. we enrich each atom description
with the pairwise distance histogram of the closest atoms up to
a specified small distance) obtaining better results compared to
the global histogram kernel (although worse than those
obtained with the 3DDK).
In future work, it will be of interest to characterize the class
of molecules on which the use of 3D features helps to achieve a
correct prediction. Also, it would be interesting to increase
the amount of information available globally for each molecule
(such as the melting point) or parts of molecule (such as
functional group specifications) and explore the possible ways
of information composition under the unifying framework of
kernel machines.
ACKNOWLEDGEMENTS
This research is supported by EU STREP APrIL II (contract
no. FP6-508861) and EU NoE BIOPATTERN (contract no.
FP6-508803).
Conflict of Interest: none declared.
Classification of small Molecules
REFERENCES
Berthold,M. and Borgelt,C. (2002) Mining molecular fragments: finding relevant
substructures of molecules In Proceedings of the Second IEEE International
Conference on Data Mining.
Burbidge,R. et al., (2001) Drug design by machine learning: support vector
machines for pharmaceutical data analysis. Comput. Chem., 26, 5–14.
Chen,J., Swamidass,S.J., Dou,Y., Bruand,J. and Baldi,P. (2005) Chemdb: a
public database of small molecules and related chemoinformatics resources.
Bioinformatics, 21, 4133–4139.
Collins,M. and Duffy,N. (2002) Convolution kernels for natural language. In
Dietterich,T. et al. (eds.) Advances in Neural Information Processing Systems
14. MIT Press, Cambridge, MA.
Deshpande,M. et al. (2003) Frequent sub structure based approaches for
classifying chemical compounds. In Proceedings of IEEE International
Conference on Data Mining, 35–42. A longer version is available as
Tecnical Report #03-016, University of Minnesota.
Devillers,J. (1996) Neural Networks in QSAR and Drug Design. Academic Press,
Burlington, MA.
Devillers,J. and Balaban,A.T. (eds.) (1999) Topological Indices and Related
Descriptors in QSAR and QSPR. Gordon & Breach, Amsterdam,
The Netherlands.
Gärtner,T. (2003) A survey of kernels for structured data. SIGKDD Exploration
Newsletter, 5, 49–58.
Gasteiger,J. et al. (1990) Automatic generation of 3d-atomic coordinates for
organic molecules. Tetrahedron Comput. Methods, 3, 537–547.
Giordana,A. and Saitta,L. (2000) Phase transitions in relational learning. Mach.
Learn., 41, 217.
Hansch,C. et al. (1995) Exploring QSAR : Hydrophobic, Electronic, and Steric
Constants. American Chemical Society, Wayne, PA.
Hansch,C. et al. (1962) Correlation of biological activity of phenoxyacetic acids
with hammett substituent constants and partition coefficients. Nature, 194,
178–180.
Haussler,D. (1999) Convolution kernels on discrete structures. In Technical
Report. UCS-CRL-99-10, UC Santa Cruz.
Hawkins,D. et al. (2001) Qsar with few compounds and many features. J. Chem.
Inf. Comput. Sci., 41, 663–670.
Helma,C. (ed.) (2005) Predictive Toxicology. CRC Press, Boca Raton.
Helma,C. and Kramer,S. (2003) A survey of the predictive toxicology challenge
2000-2001. Bioinformatics, 19, 1179–1182.
Horváth,T. et al. (2004) Cyclic pattern kernels for predictive graph mining. In
Proceedings of KDD 04. ACM Press, New York, NY, pp. 158–167.
Jalali-Heravi,M. and Parastar,F. (2000) Use of artificial neural networks in a
QSAR study of anti-HIV activity for a large group of HEPT derivatives.
J. Chem. Inf. Comput. Sci., 40, 147–154.
Karelson,M. et al. (1996) Quantum-chemical descriptors in qsar/qspr studies.
Chem. Rev., 96, 1027–1044.
Kashima,H. et al. (2003) Marginalized kernels between labeled graphs. In
Proceedings of ICML’03, 321–328.
Kier,L. B. and Hall,L. H. (1986) Molecular Connectivity in Structure-Activity
Analysis. John Wiley, Indianapolis, IN.
King,R., Muggleton,S., Srinivasan,A. and Sternberg,M. (1996) Structure-activity
relationships derived by machine learning: the use of atoms and their bond
connectivities to predict mutagene-city by inductive logic programming.
Proc. Nat. Acad. Sci., 93, 438–442.
Klebe,G. et al. (1994) Molecular similarity indices in a comparative analysis
(CoMSIA) of drug molecules to correlate and predict their biological activity.
J. Med. Chem., 37, 4130–4146.
Kramer,S. et al. (2001) Molecular feature mining in hiv data. In Provost,F. and
Srikant,R. (eds.) Proceedings of the 7th ACM International Conference on
Knowledge Discovery and Data Mining. ACM Press, pp. 136–143.
Kubinyi,H. et al. (eds.) (2002) 3D QSAR in Drug Design. Kluwer, Dordrecht, The
Netherlands.
Leslie,C. et al. (2002) The spectrum kernel: A string kernel for SVM protein
classification. Proc. Paci. Symp. Biocomput., 564–575.
L’Heureux,P.J. et al. (2004) Locally linear embedding for dimensionality
reduction in QSAR. J. Comput. Aided. Mol. Des., 18, 475–482.
Lipnick,R.L. (1991) Outliers: their origin and use in the classification of
molecular mechanisms of toxicity. Sci. Total Environ., 109–110, 131–153.
Mahe,P. et al. (2005) Graph kernels for molecular structure-activity relationship
analysis with support vector machines. J Chem. Inf. Model., 45, 939–951.
Menchetti,S. et al. (2005) Weighted decomposition kernels. In Proceedings of the
22nd International Conference on Machine Learning.
Micheli,A. et al. (2001) Analysis of the internal representations developed by
neural networks for structures applied to quantitative structure-activity
relationship studies of benzodiazepines. J. Chem. Inf. Comput. Sci., 41,
202–218.
Odone,F. et al. (2005) Building kernels from binary strings for image matching.
IEEE Trans. Image Process., 14, 169–180.
Pastor,M. et al. (2000) GRid-INdependent descriptors (GRIND): a novel class of
alignment-independent three-dimensional molecular descriptors. J. Med.
Chem., 43, 3233–3243.
Cramer,R.D,I., Patterson,D.E. and Bunce,J.D. (1988) Comparative molecular
field analysis (CoMFA). 1. Effect of shape on binding of steroids to carrier
proteins. J. Am. Chem. Soc, 110, 5959–5967.
Sadowski,J. et al. (2003) 3d structure generation and conformational searching.
In Computational Medicinal Chemistry and Drug Discovery. Dekker Inc,
New York, NY, 151–212.
Schölkopf,B. and Smola,A.J. (2002) Learning with Kernels. MIT Press,
Cambridge, MA.
Schölkopf,B. et al. (2002) A kernel approach for learning from almost orthogonal
patterns. Proceedings. of ECML’02, 511–528.
Swamidass,S.J. et al. (2005) Kernels for small molecules and the prediction of
mutagenicity, toxicity and anti-cancer activity. Bioinformatics, 21 (suppl1),
359–368.
van de Waterbeemd,H. and Gifford,E. (2003) ADMET in silico
modelling: towards prediction paradise? Nat. Rev. Drug Discov., 2, 192–204.
Weininger,D. (1988) SMILES, a chemical language and information
system: 1. introduction to methodology and encoding rules. J. Chem. Inf.
Compu. Sci., 28, 31–36.
Weininger,D. (1989) SMILES: 2. algorithm for generation of unique
SMILES notation. J. Chemi. Inf. Comput. Sci., 29, 97–101.
2045