BIOINFORMATICS ORIGINAL PAPER Vol. 23 no. 16 2007, pages 2038–2045 doi:10.1093/bioinformatics/btm298 Structural bioinformatics Classification of small molecules by two- and three-dimensional decomposition kernels Alessio Ceroni, Fabrizio Costa* and Paolo Frasconi Machine Learning and Neural Networks Group, Dipartimento di Sistemi e Informatica, Universitá degli Studi di Firenze, Italy Received on February 19, 2007; revised on May 14, 2007; accepted on May 28, 2007 Advance Access publication June 5, 2007 Associate Editor: Anna Tramontano ABSTRACT Motivation: Several kernel-based methods have been recently introduced for the classification of small molecules. Most available kernels on molecules are based on 2D representations obtained from chemical structures, but far less work has focused so far on the definition of effective kernels that can also exploit 3D information. Results: We introduce new ideas for building kernels on small molecules that can effectively use and combine 2D and 3D information. We tested these kernels in conjunction with support vector machines for binary classification on the 60 NCI cancer screening datasets as well as on the NCI HIV data set. Our results show that 3D information leveraged by these kernels can consistently improve prediction accuracy in all datasets. Availability: An implementation of the small molecule classifier is available from http://www.dsi.unifi.it/neural/src/3DDK Contact: [email protected] 1 INTRODUCTION Prediction of the biological activity of chemical compounds is one of the major goals of chemoinformatics, mainly prompted by the need to improve current drug development methods (van de Waterbeemd and Gifford, 2003) and to gain deeper insights in toxicology (Helma, 2005; Helma and kramer, 2003) Recent development in combinatorial technologies and ultrahigh-throughput screening have paved the way towards largescale exploration of potential drugs. In this context, prediction algorithms are needed for reducing attrition in drug development and to select subsets of interesting compounds from the large set of molecules that can be potentially synthesized (van de Waterbeemd and Gifford, 2003). Machine-learning algorithms for prediction are justified by the reasonable assumption that compounds having similar geometrical and physicochemical characteristics should also exhibit similar effects on the same biological system. A major challenge is to define an effective quantitative measure of similarity. In this article, we extend recent research in the area of kernel machines for defining similarity between small molecules (Horváth et al., 2004; Kashima et al., 2003; Mahe et al., 2005; Menchetti et al., *To whom correspondence should be addressed. 2038 2005; Swamidass et al., 2005) and propose new methods that combine 2D and 3D information. Starting from the seminal work of Hansch et al. (1962) on quantitative structure activity relationship (QSAR), several researchers have investigated supervised learning techniques for predicting the biological activity of small molecules. All these techniques must effectively solve two separate subproblems: one related to representation, and another focused on statistical learning issues. The former essentially amounts to devising a proper set of descriptors that effectively capture the representative characteristics of the molecules (ideally, molecules exhibiting similar behavior should be represented by similar descriptors). The latter amounts to finding a smooth and stable function that maps descriptors either into a real number expressing the activity of the molecule (regression) or into a boolean decision such as active/inactive (classification). In classical approaches, after descriptor analysis has been performed on the chemical structure, a molecule is always represented by a vector of real numbers and the similarity between two molecules is implicitly or explicitly computed from their descriptor vectors. The simplicity of vector-based representations (as opposite to relational representations) has made it possible to consider several alternative algorithms for solving the supervised learning problem including linear regression, ridge regression (Hawkins et al., 2001), artificial neural networks (Devillers, 1996; Jalali-Heravi and Parastar, 2000) and support vector machines (SVMs) (Burbidge et al., 2001). Although there are some crucial statistical aspects that may affect the quality of prediction [most importantly the problem of identifying outlier (Lipnick, 1991) and problems of feature selection and dimensionality reduction (L’Heureux et al., 2004)] most of the research has focused on descriptors. Descriptors have been traditionally obtained from both the chemical and the 3D structure of the compounds using steric and hydrophobic properties (Hansch et al., 1995), topological properties (Devillers and Balaban, 1999), connectivity indices (Kier and Hall, 1986) or quantum-chemical effects (Karelson et al., 1996). Alternatively, 3D-QSAR (Kubinyi et al., 2002) methods have been proposed to generate the descriptors exclusively from the 3D structure of the molecule. Among the most preeminent 3D-QSAR methods figure CoMFA (comparative molecular field analysis, Cramer et al., 1988) ß The Author 2007. Published by Oxford University Press. All rights reserved. For Permissions, please email: [email protected] Classification of small Molecules and CoMSIA (comparative molecular similarity indices analysis, Klebe et al., 1994). In both approaches, the molecular structures are first aligned and then placed into a 3D grid to calculate their interaction energy with an atomic probe. The molecular fields obtained in this way are then correlated with the biological activity by partial least squares analysis. However, these grid-based approaches are strongly influenced by the alignment of compounds structure, that is the critical step of these methods. Pastor et al. (2000) proposed an alternative approach based on the extraction of pairwise pharmacophore distances that does not require alignment. Other (and somewhat simpler) approaches for incorporating geometrical information have been proposed by Deshpande et al. (2003) and Swamidass et al. (2005). These are also limited to pairwise atom distances, that are less effective than interaction energies. Alternatives to descriptor-vector approaches aim to preserve as much information as possible about the molecular structure by recognizing the relational nature of the data. In particular, molecules can be represented as labeled graphs whose vertices and edges contain attributes describing physico-chemical properties of atoms and bonds, respectively. Relational learning algorithms have been mainly developed within the inductive logic programming (ILP) framework (King et al., 1996). ILP has also the advantage that domain expertize can be encoded in the form of a logic program and that the result of the learning algorithm is human readable in the form of firstorder predicates. The complexity of ILP algorithms, however, may not scale well with the training set size (Giordana and Saitta, 2000), making it difficult to apply them to large scale problems. For this reason, a number of researchers have explored the much simpler graph-based representation of the compound chemical structure and transformed the problem of finding chemical substructures to that of finding subgraphs in this graph-based representation (Berthold and Borgelt, 2002; Deshpande et al., 2003). An even simpler representation was proposed by Kramer et al. (2001): in this scheme each chemical compound is represented as a SMILES string (Weininger, 1988, 1989), a 1D representation of chemical structures encoding atomic content and connectivity. A single compound can be represented by different SMILES strings, unless a canonical ordering is defined. This representation simplifies the problem of discovering frequently occurring substructures at the cost of losing some information on the represented molecules. Different statistical learning algorithms for structured data have been successfully applied to QSAR problems, including recursive neural networks (Micheli et al., 2001) and kernel machines (Horváth et al., 2004; Kashima et al., 2003; Mahe et al., 2005; Menchetti et al., 2005; Swamidass et al., 2005). Micheli et al. (2001) first transformed chemical structures into ordered trees (this is possible for the particular class of compounds, benzodiazepines, investigated in that study) and then used a standard RNN for learning the structure/ activity relation. On the other hand, kernel methods offer higher flexibility and do not require the input structure to be ordered. Horváth et al. (2004) have proposed a decomposition kernel that compare the sets of tree and cyclic patterns present in the input molecules. This kernel has been applied to the HIV dataset and will be used as a comparison in this work (Section 3.2). The kernel proposed by Kashima et al. (2003) is based on counting the number of common label paths originated by random walks on the graph representation of the molecules. Mahe et al. (2005) improved this kernel and obtained successful results on two mutagenesis datasets. Swamidass et al. (2005) tested several new kernels based on 1D, 2D and 3D representations of the input molecules on the NCI Cancer dataset (Section 3.1). Finally, the 2D kernel introduced by Menchetti et al. (2005) (and also exploited in this article in conjunction with a novel 3D kernel) is based in the intuition that large subgraphs can be matched in a soft way by flattening them into histograms (see Section 2.3 for more details). Three-dimensional information can be obtained by running rule-based prediction programs such as CORINA (Gasteiger et al., 1990; Sadowski et al., 2003) on the chemical compound. Atomic coordinates predicted in this way may be affected by errors but nonetheless they may help to derive a better measure of similarity between molecules. However, as briefly summarized above, research about kernels on molecular structures has mainly focused on 2D structures (i.e. sparse graphs) and only few methods have considered the possibility of incorporating features extracted from predicted atom coordinates into a SVM classifier. Deshpande et al. (2003) have defined an extension of their frequent subgraph mining algorithm that also checks the average distance between all atom pairs in a (sub)molecule as a condition for detecting subgraphs.Swamidass et al. (2005) have proposed a kernel that takes 3D coordinates into account by computing the inner product between histograms of pairwise interatomic distances. Deshpande et al. (2003) report that 3D information in conjunction with their method helps improving accuracy on the NCI HIV dataset, while Swamidass et al. (2005) found that using 3D information only worsen prediction accuracy on the NCI cancer dataset. By computing distance averages or histograms, previous methods only take into account proximity relationships between pairs of atoms but not the 3D similarity between larger sets of atoms (as could be inferred by preserving the fully connected distance graph). The novel 3D kernel presented in this article is based on the comparison between entire 3D shapes and can in principle preserve more information. We show that a combination of WDK and the new 3D decomposition kernel (3DDK) allows us to reach state-of-the-art performance. 2 2.1 METHODS Background on kernel methods for structured data Our approach to the classification of small molecules follows the classical supervised learning framework of SVM (Schölkopf and Smola, 2002). In this class of algorithms, a classification function f(x) is obtained from data as fðxÞ ¼ m X i yi Kðxi ; xÞ: ð1Þ i¼1 In the above formula, (xi,yi) is a generic training example, consisting of an input object xi and its associated target yi 2 f1,1} representing a binary class. The coefficients i are determined by the learning 2039 A.Ceroni et al. procedure and i 6¼ 0 if xi is a support vector. Finally, K is a kernel function that measures the similarity between two examples. Positive semi-definite kernels are preferred since they yield a convex optimization problem for finding the coefficients i. Kernels on structured data (such as molecular graphs) are typically special cases of decomposition or convolution kernels Haussler (1999), a vast class of functions that relies on two main concepts: (1) decomposition in parts of the structured objects and (2) composition of kernels between these parts. Let be a set of structured objects (such as labeled graphs) and for x 2 , suppose (x1,. . .,xD) is a tuple of parts of x, with xd 2 i for each part type d ¼ 1,. . .,D, which can be formally expressed by a decomposition relation R(x1,. . .,xD,x). Assuming that for each part type a Mercer kernel d : d d } R is available, the decomposition kernel between two structured objects x and x0 is defined as Kðx; x0 Þ ¼ X D Y ðx1 ;...;xD Þ2R1 ðxÞ ðx0 ;...;x0 Þ2R1 ðx0 Þ D 1 d¼1 d ðxd ; x0d Þ ð2Þ where R-1(x) ¼ {(x1,. . .,xD) : R(x1,. . .,xD, x)} denote the set of all possible decompositions of x. In practice, popular decomposition kernels are very often ‘allsubstructures kernels’ that rely on a single part type (i.e. D ¼ 1) and are based on the exact match kernel 1 ¼ where (x, x0 ) ¼ 1 if x x0 and 0 otherwise. Common examples of ‘all-substructures kernels’ are the spectrum kernel for strings (Leslie et al., 2002), the kernel defined by Collins and Duffy (2002) for trees and the cyclic pattern kernel for molecules (Horváth et al., 2004). Although some of the above kernels can be computed efficiently, matching arbitrary substructures may be intractable because it involves subgraph isomorphism (Gärtner, 2003). All-substructure kernels may induce sparseness in the feature space induced by the kernel if the parts of x that are informative for prediction are only a small subset of R1(x). In this case, the similarity between x and x0 can be typically much smaller than the similarity between x and itself, leading to a problem of diagonal dominance (Schölkopf et al., 2002) and a consequent form of rote learning in which generalization to new cases is unsatisfactory. In practice, in order to obtain good performance in a particular prediction task, it is very important to carefully tune the decomposition relation R and the kernel between parts. The 2D and 3D kernels described in the following aim to reduce sparseness by limiting the number of parts and by introducing a soft notion of similarity for matching them. 2.2 Representation of small molecules A molecule is represented as an attributed undirected graph x¼(V,E). Each vertex v2 V represents an atom and is labeled by three kinds of attributes: t[v] is the atom type (defined as in the Tripos Sybyl MOL2 format1), c[v] is the partial charge, discretized into negative if c[v]50.05, uncharged if 0.05 c[v]0.05 and positive if c[v]40.05 and ~½v 2 R3 are the 3D atom coordinates. Each edge (u,v)2E represents a chemical bond and is characterized by a type b[u,v] that can be one of single, double, triple, amide or aromatic. 2.3 Weighted decomposition kernels for 2D chemical structures Menchetti et al. (2005) extended the definition of all-substructures kernels by weighting the match between identical parts (called here selectors) according to contextual information. The resulting kernel 1 Available from www.tripos.com/data/support/mol2.pdf 2040 Fig. 1. Comparing substructures in a weighed decomposition kernel. can be applied to general data types but in this article we are interested in the case of labeled graphs. A weighted decomposition kernel (WDK) is characterized by a decomposition R(s,z,x) where s is a subgraph of x called the selector and z is a subgraph of x called the context of occurrence of s in x (generally a subgraph containing s). This setting results in the following general form of the kernel: X K2D ðx; x0 Þ ¼ ðs; s0 Þðz; z0 Þ ð3Þ ðs;zÞ2R1 ðxÞ ðs0 ;z0 Þ2R1 ðx0 Þ where, is the exact matching kernel applied to selectors and is a kernel on contexts. The basic idea behind WDK is that each substructure in which a graph is decomposed is enriched with its graphical context. Figure 1 provides an example on molecule-like graphs where selectors are single nodes and relative contexts are vertices and edges that are adjacent to the selector. In this article, selectors are always single atoms and the match (s,s0 ) is defined by the coincidence between the type of s0 and s0 . The context kernel is based on soft match between substructures, defined by the distributions of label contents after discarding topology. Specifically, let L denote the total number of attributes labeling vertices and edges and for ‘¼1,. . .,L let p‘(j) be the observed frequency of value j for the ‘-th attribute in a substructure z. Then we compare substructures by means of a histogram intersection kernel (Odone et al., 2005), defined as follows: ‘ ðz; z0 Þ ¼ m‘ X minfp‘ ðjÞ; p0‘ ðjÞg ð4Þ ‘ ðz; z0 Þ: ð5Þ j¼1 ðz; z0 Þ ¼ L X ‘¼1 where m‘ is the number of possible values of attribute ‘. In the following we shall use L ¼ 3: atom type, atom charge and bond type, while atom coordinates are discarded for computing the WDK. Contexts are formed as follows. Given a vertex v 2 V and an integer r 0, let x(v, r) denote the substructure of x obtained by retaining all the vertices that are reachable from v by a path of length at most r, and all the edges that touch at least one of these vertices. The decomposition relation Rr, dependent on the context radius r, is then defined as Rr ¼ {(s, z, x) : x 2 , s¼{v}, z¼x(v, r), v 2 V}, where s is the selector and z is the context for vertex v. In Menchetti et al. (2005), it was proposed a further extension to Equation (3) where each selector is associated not only with its context but also with the complement c of the context taken in respect to the whole graph, obtaining a decomposition relation Rr ¼ {(s, z, c, x) : x 2 , s ¼ {v}, z ¼ x(v, r), c¼x \ z, v 2 V}. The same kernel is employed on both the context and the complement parts, obtaining Classification of small Molecules P Kðx; x0 Þ ¼ ðs; s0 Þððz; z0 Þ þ ðc; c0 ÞÞ, where the sum extends over (s,z,c) 2 R1(x) and the corresponding parts for x0 . The inclusion of the complement is a mean to add a similarity factor that takes into consideration the whole graph/molecule, i.e. two selectors are considered similar if they are embedded in molecules with similar attribute distribution (in our case element types and partial charge). 2.4 Three-dimensional decomposition kernels A 3D molecular structure is interpreted here as a special kind of relational data object where atoms are related by chemical bonds but also by their spatial distances. According to this view, a novel decomposition kernel for 3D structures is defined in this section. The molecule is first decomposed into a set of overlapping 3D substructures of varied geometry, called shapes. The kernel between two molecules is then computed by summing kernels between all pairs of shapes. Given a molecule x¼(V,E), a shape of order n is a set of n distinct vertices ¼{u1,u2,. . .,ui}, ui 2 V, for i¼1,. . .n. For example, shapes of order two can be interpreted as segments and shapes of order three as triangles.2 We first introduce a kernel between shapes as follows. Given a shape of order n and two vertices u,v2, let e¼(t[u],t[v],b[u,v]) denote a labeled edge of the shape, formed by considering the two atom types t[u] and t[v] and the bond type b[u,v] (conventionally, b[u,v]¼0 if there is no chemical bond between atoms u and v). Since atom and bond types both take values on a finite alphabet, labeled edges can be sorted lexicographically.3 Then, let he1,. . .,en(n1)/2i denote the lexicographically ordered sequence of all labeled edges in . For example, the shape (C1,C2,C3,O1) for the molecule NSC_1027 (see Fig. 2) yields the sequence (C.2,C.3,1) (C.2,C.3,1) (C.2,O.2,2) (C.3,C.3,0) (C.3,O.2,0) (C.3,O.2,0). The kernel between two shapes and 0 of equal order n has complexity O(n2) and is defined as: kshapes ð; 0 Þ ¼ nðn1Þ=2 Y 0 2 ðei ; e0i Þeðdi di Þ ð6Þ i¼1 where is a kernel hyperparameter and di ¼ k~½ui ~½vi k is the length of edge ei, i.e. the Euclidean distance between atoms ui and vi. The kernel between two shapes of different order is null. We can now move on to the definition of the 3D kernel between two molecules. One simplistic approach consists of decomposing a molecules into a bag of all possible shapes and then comparing the two bags of shapes using the above defined kernel. However, this would quickly result in a combinatorial explosion as there are mn shapes or order n in a molecule having m atoms. Even for moderate values of n (e.g. n¼3 to consider 3D triangles), this approach has two important drawbacks: the kernel would be computationally costly and the feature space too large, leading to the sparseness problems mentioned in Section 2.1. To overcome these drawbacks, we can restrict the selection of vertices used to make shapes using the notion of topological distance induced by the chemical graph. Let x¼(V,E). Given a vertex v 2 V and an integer r, a 2D-supported shape anchored in v is a set of vertices ¼{v,w} [ adj[w] such that w 2 x(v,r) and adj[w] is the adjacency list of w. Informally this corresponds to select just the adjacent list of vertices that are within 2 Shapes could also be interpreted as hyperedges of a hypergraph having vertex set V. 3 Lexicographical order between two n-tuples t and s can be defined as follows: t5s if and only if there exists a position i such that the first i1 items of t and s are all equal but the i-th item of t is less than the i-th item of s. Fig. 2. Illustration of the definition of 2D-supported shapes. The three 2D-supported shapes of radius 1, anchored to atom C3 in the molecule NSC_1027 are (Br3,C3), (Br2,C3) and (C1,C2,C3,O1). Atom identifiers and types (in parentheses) are formatted according to the Tripos Sybyl MOL2 conventions. distance r from x. The size of the set of shapes anchored in v is O(|x(v,r)|). Note that the order of each shape is variable and depends on the topology of the graph (Fig. 2 shows an example). The shape set of radius r of x, denoted S r ðxÞ, is defined as the set of all 2D-supported shapes anchored in some vertex v of x. The 3DDK between two molecules x and x0 is then defined as X X kshapes ð; 0 Þ ð7Þ K3D ðx; x0 Þ ¼ 2S r ðxÞ 0 2S r ðx0 Þ It is immediate to recognize that shapesets are a decomposition of a molecule and, in turn, lexicographically ordered edges are a decomposition of a shape. Therefore, the positive definiteness of K3D follows as a corollary of results proved in Haussler (1999) for the general class of decomposition kernels. The 3DDK thus defined is able to compare molecule structures by using both chemical (atoms and bonds types) and geometrical (atoms distances) information. 3DDK performs a decomposition of a molecule structure into shapes defined by variable sized groups of atoms with their relative distances. The shapes are invariant of the coordinate system (no alignment is needed) and represent n-wise relations between groups of n ( 2) atoms. The conformations of two shapes with identical composition (atoms and bonds types) are compared by looking at the variations between atoms distances. The similarity between two shapes is thus a smooth function of the similarity between their dimensions. 3 3.1 EXPERIMENTS NCI cancer dataset The cancer dataset has been made publicly available by the National Cancer Institute and provides screening results for the ability of more than 70 000 compounds to suppress or inhibit the growth of a panel of 60 human tumor cell lines. The dataset corresponding to the parameter GI50, the concentration that causes 50% growth inhibition, is used here. A version of this dataset suitable for machine learning purposes has been used in Swamidass et al. (2005), and it is 2041 A.Ceroni et al. available for download from the ChemDB project (Chen et al., 2005).4 For each cell line, approximately 3500 compounds are provided together with information on their cancer-inhibiting action. A two-classes classification problem is thus defined, with roughly 50% of examples in each class. For all the compounds, 3D coordinates were taken from the ChemDB [generated by running CORINA (Gasteiger et al., 1990; Sadowski et al., 2003) on the 2D structure]. The WDK and 3DDK used in this experiment had both the radius r = 3 and no graph complement was used for the WDK. The parameter in Equation. (6) was set to 2.5. A combination of the two kernels was also tested (WDKþ3DDK). Each kernel was then composed with a polynomial kernel with degree 2: Kðx; x0 Þ ¼ ð1 þ ðx; x0 ÞÞ2 ð8Þ where is either K2D or K3D or (x,x0 ) ¼ K2D(x,x0 ) þ K3D(x,x0 ). The choice of the polynomial degree and the radius for both methods was optimized with a k-fold procedure on a smaller subset and then kept these values fixed in the remaining experiments. We consider three performance measures: binary classification accuracy, area under the ROC (AUC) and area under the precision recall curve. These measures were estimated by a 10-folds cross-validation. Model selection for choosing the C value was performed by holding out a development set from each training set of the 10-fold cross-validation. The results of the experiments on the 60 cell lines are reported in Figure 3. The performances of WDK and 3DDK are compared to the results obtained by the top performing kernel (MinMax) proposed by Swamidass et al. (2005). The values of accuracy of the 3D histogram kernel from the same work are also reported, in order to compare 3DDK with the only other method that make use of atom coordinates on this problem. The results of the experiments demonstrate the superiority of the kernels proposed in this work on this dataset. Comparing the accuracy values, the 3DDK performs better than MinMax on 50 over 60 cell lines, the WDK on 59 over 60 cell lines and the two combined kernels perform always better than MinMax. Almost the same conclusions can be drawn by comparing the AUC values: the 3DDK performs better than MinMax on 47 over 60 cell lines the 3DDK on 53 over 60 cell lines and their combination always perform better than MinMax. Moreover, the 3DDK outperforms the 3D Histogram kernel on all the cell lines, thus showing the importance of using spatial similarity between sets of several atoms when comparing 3D structures. The area under the precision/recall curve clearly shows the advantage of the combination of the two methods over each single one. We extracted also ROC50 values (results not reported) which confirm the aforementioned conclusions. Finally, we tested whether the performance of any single classifier is equivalent to those of the classifiers combination. By running the Wilcoxon matched-pairs signed-ranks test on the accuracy, 4 ftp://cdb.ics.uci.edu/pub/LEARNING/data/nci/gi50/ 2042 AUC and precision/recall results we can reject the null hypothesis at 99% confidence level. 3.2 NCI HIV dataset The HIV dataset contains 42 687 compounds evaluated for evidence of anti HIV activity from the DTP AIDS antiviral screen program of the National Cancer Institute.5 The compounds are divided in three classes: 422 compounds are confirmed active (CA), 1081 are moderately active (CM) and 41 184 are confirmed inactive (CI). A compound is considered inactive if a test showed 5 50% protection of human CEM cells. All other compounds were re-tested. Compounds showing 550% protection (in the second test) are also classified inactive. The other compounds are classified active, if they provided 50% protection in both tests, and moderately active, otherwise. For all the compounds 3D coordinates have been generated by running CORINA (Gasteiger et al., 1990; Sadowski et al., 2003) on the 2D structure. Following the customary experimental setup, three classification problems are formulated on this dataset: in the first problem (CA verses CM), positive examples are confirmed active compounds, while moderately active compounds forms the negative class; for the second problem (CAþCM verses CI), the positive class is formed by the combination of moderately active and confirmed active compounds and in the last problem (CA verses CI) the positive examples are confirmed active compounds and the negative class is formed by confirmed inactive compounds. For the WDK, we used graph complement and context radius r = 4, as in Menchetti et al. (2005). For the 3DDK, we set the radius to r = 3. The parameter in Equation (6) was set to 2.5 (parameters chosen as in Section 3.1). A combination of the two kernels was also tested (WDKþ3DDK). Each of these three kernels was then composed with a Gaussian kernel with ¼ 1,6 yielding 1 0 0 0 Kðx; x0 Þ ¼ e2ððx;xÞþðx ;x Þ2ðx;x Þ 0 ð9Þ 0 where is either K2D or K3D or (x,x ) ¼ K2D(x,x ) þ K3D(x,x0 ). AUC performance was estimated by a 5-folds cross-validation in which the original class distribution was preserved in each fold (we do not report binary classification accuracy as the ratio of positive to negative examples is very small). Model selection for choosing the C value was performed by holding out a development set from each training set of the 5-fold cross-validation. The results of the experiments are reported in Table 1. The performances of WDK and 3DDK are compared to the best results obtained with the frequent subgraphs method proposed by Deshpande et al. (2003) and the cyclic pattern kernel described in Horváth et al. (2004) on the same dataset. The combination of 3DDK WDK obtained the highest 5 The dataset is available at http://dtp.nci.nih.gov/docs/aids/ aids_data.html 6 We observed in our previous experiments (Menchetti et al., 2005) that the choice of optimizing C keeping ¼ 1 yields models with comparable performance to those obtained allowing the joined optimization of both C and . We cannot exclude that some improvements could be gained by jointly fine-tuning these hyper-parameters. 77.0% 74.8% 72.6% 70.4% MinMax 3DK 3DHist 3DDK WDK Classification of small Molecules 3DK+WDK WDK+3DDK OVCAR_5 SK_MEL_28 HOP_92 A498 IGROV1 SK_OV_3 UACC_257 SNB_75 EKVX HCC_2998 NCI_H226 TK_10 UO_31 MALME_3M DU_145 OVCAR_3 OVCAR_4 SN12C SK_MEL_2 SF_539 UACC_62 NCI_ADR_RES HCT_15 HOP_62 KM12 MDA_MB_435 SNB_19 NCI_H322M HS_578T MDA_MB_231_ATCC SF_295 CAKI_1 M14 RXF_393 BT_549 RPMI_8226 CCRF_CEM HL_60_TB ACHN T_47D COLO_205 HT29 OVCAR_8 SW_620 A549_ATCC NCI_H23 NCI_H460 PC_3 K_562 MCF7 786_0 MOLT_4 LOX_IMVI U251 SF_268 SK_MEL_5 MDA_N NCI_H522 SR HCT_116 68.2% 66.0% 0.83 0.81 0.79 WDK SNB_75 HL_60_TB HCC_2998 HOP_92 SF_539 IGROV1 SK_OV_3 OVCAR_5 EKVX UACC_257 SK_MEL_28 A498 UO_31 NCI_H226 HCT_15 HOP_62 MALME_3M SK_MEL_2 CCRF_CEM RXF_393 TK_10 OVCAR_4 RPMI_8226 MDA_MB_231_ATCC KM12 SN12C OVCAR_3 K_562 NCI_H322M MOLT_4 SF_295 HS_578T UACC_62 NCI_H460 NCI_ADR_RES SNB_19 NCI_H23 DU_145 T_47D BT_549 CAKI_1 OVCAR_8 SR M14 MDA_MB_435 A549_ATCC SK_MEL_5 SW_620 COLO_205 SF_268 NCI_H522 786_0 MCF7 HT29 U251 PC_3 MDA_N ACHN LOX_IMVI HCT_116 0.77 0.75 0.57% 0.55% 0.53% 0.51% 0.49% Fig. 3. Results on all the 60 cell lines of the NCI Cancer screening dataset. The 3DDK, the WDK and their sum are compared to the MinMax and 3D-Hist kernels. Top: prediction accuracy. Center: ROC AUC values. Bottom: area of the precision/recall curve values. 2043 SNB_75 HCC_2998 HL_60_TB OVCAR_5 MDA_MB_231_ATCC IGROV1 HOP_92 A498 HOP_62 SF_539 SK_OV_3 UO_31 UACC_257 NCI_H226 HCT_15 EKVX TK_10 BT_549 KM12 T_47D SK_MEL_28 MALME_3M OVCAR_4 NCI_H322M RXF_393 SK_MEL_2 OVCAR_3 K_562 PC_3 NCI_ADR_RES NCI_H460 DU_145 HS_578T MOLT_4 SNB_19 RPMI_8226 SF_295 CCRF_CEM UACC_62 MDA_MB_435 SR SN12C MCF7 CAKI_1 NCI_H23 SK_MEL_5 A549_ATCC OVCAR_8 M14 U251 MDA_N COLO_205 NCI_H522 786_0 ACHN SF_268 SW_620 HT29 LOX_IMVI HCT_116 A.Ceroni et al. Table 1. Results of the experiments on the NCI Anti-HIV screening dataset Method CA versus CM CAþCM versus CI CA versus CI FSG FSGþ3D CPK WDK 3DDK (WDKþ3DDK) 0.786 0.811 0.840 0.854 0.853 0.861 0.786 0.819 0.837 0.841 0.844 0.848 0.914 0.940 0.947 0.945 0.951 0.951 0.010 0.019 0.040 0.028 0.012 0.006 0.007 0.009 0.008 0.009 0.006 0.007 The 3DDK and WDK are compared to the frequent subgraphs approach and to the cyclic pattern kernel. The table reports the value of AUC for the various methods. performance measures although in this case we the improvement is not statistically significant. Table 2. Results of the experiments on CA versus CM HIV NCI dataset with the 3DDK using only the geometrical information 3.3 Shape order n Experiments with geometrical structure only We run a different set of experiments on CA versus CM HIV NCI dataset to demonstrate that the chemical structure of a compound is not necessary (albeit useful for efficiency issues) for correct predictions. In these experiments, the decomposition strategy collects all spatially related atoms: for each atom ai all the atoms aij that are distant less than d Angstroms apart from ai are collected; from these atoms all the combinations of m 1 atoms are generated; a shape is then formed using ai and these m 1 atoms. In this case, the shape contains no information about bond types. Note that in this case the shape order is fixed a priori and that we do not use the 2D support graph to select a subset of possible edges (i.e. we take into consideration a fully connected version of the subgraph induced by spheres of radius d centered in each atom). Results, reported in Table 2, show how kernels based on shapes of variable order formed by selecting atoms at distance 3 on the chemical graph or kernels based on shapes of order three at distance of 4 angstroms achieves comparable results. Note, however, that not using the chemical information implies a much higher computational cost as the number of atoms selected with the physical distance criterion increases much more rapidly than when using the distance based on the chemical topology. 4 CONCLUSIONS We have introduced a novel decomposition kernel for 3D objects. Combined usage of 2D and 3D kernels yields accuracy improvements in most of the 60 cell lines of the NCI cancer screening dataset, showing that predictive information not available in the atom-bond representation is indeed contained in the calculated atom coordinates. The combination of WDK and 3DDK also achieved state-of-the-art results in the HIV screening dataset. Other researchers have found it difficult to improve prediction accuracy by exploiting 3D information, conjecturing that the noise level in the CORINA predicted coordinates may play a crucial role (Swamidass et al., 2005). While this noise is likely to have a significant effect, the definition of a kernel function based on 3D coordinates can indeed capture some additional useful information. We believe that two related reasons can account for the improvement that we have 2044 2 3 4 Distance d 4 5 6 0.820 0.845 0.845 0.831 0.843 0.831 0.841 observed in the 3DDK. First, when using simple distance histograms, information about the local arrangement of atoms is buried in the global counts of distances for the whole molecule. In this way, specific local 3D patterns (that may be responsible for the molecular activity) are not identified. The 3DDK, on the other hand, introduces a more selective notion of similarity, because several atoms must simultaneously be arranged in a similar way in the two molecules being compared, in order to give a contribution to their similarity. Second, in the 3DDK we keep distance comparisons local within smaller subgraphs. In this way, we obtain a similarity measure that allows the rearrangement of entire portions of the graph without incurring in too severe a penalty. To show this, in a preliminary experiment, we have used a subgraph version of the distance histogram (i.e. we enrich each atom description with the pairwise distance histogram of the closest atoms up to a specified small distance) obtaining better results compared to the global histogram kernel (although worse than those obtained with the 3DDK). In future work, it will be of interest to characterize the class of molecules on which the use of 3D features helps to achieve a correct prediction. Also, it would be interesting to increase the amount of information available globally for each molecule (such as the melting point) or parts of molecule (such as functional group specifications) and explore the possible ways of information composition under the unifying framework of kernel machines. ACKNOWLEDGEMENTS This research is supported by EU STREP APrIL II (contract no. FP6-508861) and EU NoE BIOPATTERN (contract no. FP6-508803). Conflict of Interest: none declared. Classification of small Molecules REFERENCES Berthold,M. and Borgelt,C. (2002) Mining molecular fragments: finding relevant substructures of molecules In Proceedings of the Second IEEE International Conference on Data Mining. Burbidge,R. et al., (2001) Drug design by machine learning: support vector machines for pharmaceutical data analysis. Comput. Chem., 26, 5–14. Chen,J., Swamidass,S.J., Dou,Y., Bruand,J. and Baldi,P. (2005) Chemdb: a public database of small molecules and related chemoinformatics resources. Bioinformatics, 21, 4133–4139. Collins,M. and Duffy,N. (2002) Convolution kernels for natural language. In Dietterich,T. et al. (eds.) Advances in Neural Information Processing Systems 14. MIT Press, Cambridge, MA. Deshpande,M. et al. (2003) Frequent sub structure based approaches for classifying chemical compounds. In Proceedings of IEEE International Conference on Data Mining, 35–42. A longer version is available as Tecnical Report #03-016, University of Minnesota. Devillers,J. (1996) Neural Networks in QSAR and Drug Design. Academic Press, Burlington, MA. Devillers,J. and Balaban,A.T. (eds.) (1999) Topological Indices and Related Descriptors in QSAR and QSPR. Gordon & Breach, Amsterdam, The Netherlands. Gärtner,T. (2003) A survey of kernels for structured data. SIGKDD Exploration Newsletter, 5, 49–58. Gasteiger,J. et al. (1990) Automatic generation of 3d-atomic coordinates for organic molecules. Tetrahedron Comput. Methods, 3, 537–547. Giordana,A. and Saitta,L. (2000) Phase transitions in relational learning. Mach. Learn., 41, 217. Hansch,C. et al. (1995) Exploring QSAR : Hydrophobic, Electronic, and Steric Constants. American Chemical Society, Wayne, PA. Hansch,C. et al. (1962) Correlation of biological activity of phenoxyacetic acids with hammett substituent constants and partition coefficients. Nature, 194, 178–180. Haussler,D. (1999) Convolution kernels on discrete structures. In Technical Report. UCS-CRL-99-10, UC Santa Cruz. Hawkins,D. et al. (2001) Qsar with few compounds and many features. J. Chem. Inf. Comput. Sci., 41, 663–670. Helma,C. (ed.) (2005) Predictive Toxicology. CRC Press, Boca Raton. Helma,C. and Kramer,S. (2003) A survey of the predictive toxicology challenge 2000-2001. Bioinformatics, 19, 1179–1182. Horváth,T. et al. (2004) Cyclic pattern kernels for predictive graph mining. In Proceedings of KDD 04. ACM Press, New York, NY, pp. 158–167. Jalali-Heravi,M. and Parastar,F. (2000) Use of artificial neural networks in a QSAR study of anti-HIV activity for a large group of HEPT derivatives. J. Chem. Inf. Comput. Sci., 40, 147–154. Karelson,M. et al. (1996) Quantum-chemical descriptors in qsar/qspr studies. Chem. Rev., 96, 1027–1044. Kashima,H. et al. (2003) Marginalized kernels between labeled graphs. In Proceedings of ICML’03, 321–328. Kier,L. B. and Hall,L. H. (1986) Molecular Connectivity in Structure-Activity Analysis. John Wiley, Indianapolis, IN. King,R., Muggleton,S., Srinivasan,A. and Sternberg,M. (1996) Structure-activity relationships derived by machine learning: the use of atoms and their bond connectivities to predict mutagene-city by inductive logic programming. Proc. Nat. Acad. Sci., 93, 438–442. Klebe,G. et al. (1994) Molecular similarity indices in a comparative analysis (CoMSIA) of drug molecules to correlate and predict their biological activity. J. Med. Chem., 37, 4130–4146. Kramer,S. et al. (2001) Molecular feature mining in hiv data. In Provost,F. and Srikant,R. (eds.) Proceedings of the 7th ACM International Conference on Knowledge Discovery and Data Mining. ACM Press, pp. 136–143. Kubinyi,H. et al. (eds.) (2002) 3D QSAR in Drug Design. Kluwer, Dordrecht, The Netherlands. Leslie,C. et al. (2002) The spectrum kernel: A string kernel for SVM protein classification. Proc. Paci. Symp. Biocomput., 564–575. L’Heureux,P.J. et al. (2004) Locally linear embedding for dimensionality reduction in QSAR. J. Comput. Aided. Mol. Des., 18, 475–482. Lipnick,R.L. (1991) Outliers: their origin and use in the classification of molecular mechanisms of toxicity. Sci. Total Environ., 109–110, 131–153. Mahe,P. et al. (2005) Graph kernels for molecular structure-activity relationship analysis with support vector machines. J Chem. Inf. Model., 45, 939–951. Menchetti,S. et al. (2005) Weighted decomposition kernels. In Proceedings of the 22nd International Conference on Machine Learning. Micheli,A. et al. (2001) Analysis of the internal representations developed by neural networks for structures applied to quantitative structure-activity relationship studies of benzodiazepines. J. Chem. Inf. Comput. Sci., 41, 202–218. Odone,F. et al. (2005) Building kernels from binary strings for image matching. IEEE Trans. Image Process., 14, 169–180. Pastor,M. et al. (2000) GRid-INdependent descriptors (GRIND): a novel class of alignment-independent three-dimensional molecular descriptors. J. Med. Chem., 43, 3233–3243. Cramer,R.D,I., Patterson,D.E. and Bunce,J.D. (1988) Comparative molecular field analysis (CoMFA). 1. Effect of shape on binding of steroids to carrier proteins. J. Am. Chem. Soc, 110, 5959–5967. Sadowski,J. et al. (2003) 3d structure generation and conformational searching. In Computational Medicinal Chemistry and Drug Discovery. Dekker Inc, New York, NY, 151–212. Schölkopf,B. and Smola,A.J. (2002) Learning with Kernels. MIT Press, Cambridge, MA. Schölkopf,B. et al. (2002) A kernel approach for learning from almost orthogonal patterns. Proceedings. of ECML’02, 511–528. Swamidass,S.J. et al. (2005) Kernels for small molecules and the prediction of mutagenicity, toxicity and anti-cancer activity. Bioinformatics, 21 (suppl1), 359–368. van de Waterbeemd,H. and Gifford,E. (2003) ADMET in silico modelling: towards prediction paradise? Nat. Rev. Drug Discov., 2, 192–204. Weininger,D. (1988) SMILES, a chemical language and information system: 1. introduction to methodology and encoding rules. J. Chem. Inf. Compu. Sci., 28, 31–36. Weininger,D. (1989) SMILES: 2. algorithm for generation of unique SMILES notation. J. Chemi. Inf. Comput. Sci., 29, 97–101. 2045
© Copyright 2025 Paperzz