BIOINFORMATICS ORIGINAL PAPER Vol. 22 no. 24 2006, pages 3082–3088 doi:10.1093/bioinformatics/btl535 Systems biology Prediction of oxidoreductase-catalyzed reactions based on atomic properties of metabolites Fangping Mu1, Pat J. Unkefer2, Clifford J. Unkefer2 and William S. Hlavacek1,3, 1 Theoretical Biology and Biophysics Group, Theoretical Division, 2National Stable Isotope Resource, Bioscience Division and 3Center for Nonlinear Studies, Los Alamos National Laboratory, Los Alamos, NM 87545, USA Received on July 13, 2006; revised on September 28, 2006; accepted on October 12, 2006 Advance Access publication October 23, 2006 Associate Editor: Golan Yona ABSTRACT Motivation: Our knowledge of metabolism is far from complete, and the gaps in our knowledge are being revealed by metabolomic detection of small-molecules not previously known to exist in cells. An important challenge is to determine the reactions in which these compounds participate, which can lead to the identification of gene products responsible for novel metabolic pathways. To address this challenge, we investigate how machine learning can be used to predict potential substrates and products of oxidoreductase-catalyzed reactions. Results: We examined 1956 oxidation/reduction reactions in the KEGG database. The vast majority of these reactions (1626) can be divided into 12 subclasses, each of which is marked by a particular type of functional group transformation. For a given transformation, the local structures of reaction centers in substrates and products can be characterized by patterns. These patterns are not unique to reactants but are widely distributed among KEGG metabolites. To distinguish reactants from non-reactants, we trained classifiers (linear-kernel Support Vector Machines) using negative and positive examples. The input to a classifier is a set of atomic features that can be determined from the 2D chemical structure of a compound. Depending on the subclass of reaction, the accuracy of prediction for positives (negatives) is 64 to 93% (44 to 92%) when asking if a compound is a substrate and 71 to 98% (50 to 92%) when asking if a compound is a product. Sensitivity analysis reveals that this performance is robust to variations of the training data. Our results suggest that metabolic connectivity can be predicted with reasonable accuracy from the presence or absence of local structural motifs in compounds and their readily calculated atomic features. Availability: Classifiers reported here can be used freely for noncommercial purposes via a Java program available upon request. Contact: [email protected] Supplementary information: Supplementary data are available at Bioinformatics online. 1 INTRODUCTION High-throughput analytic techniques of metabolomics now allow large-scale characterization of the metabolic state of a cell (Fiehn, 2002; Kell, 2004; Dunn et al., 2005). Metabolomic approaches are designed to detect the metabolites in a complex mixure as comprehensively as possible. The various techniques, To whom correspondence should be addressed. 3082 when used in combination, can be highly sensitive. A goal of metabolomics is to define and monitor the entire metabolite complement of a cell, and progress is being made in this direction (Fernie et al., 2004; Goodacre et al., 2004). Although our knowledge of metabolism is extensive, it is still incomplete. Many metabolites present in cells, and their reactions and associated enzymes, remain to be discovered. Systematic and rapid approaches for determining how a newly discovered metabolite is produced and consumed are lacking. Thus, a present challenge for bioinformatics in support of metabolomics is the development of computational methods that will help place novel metabolites in the context of biochemical pathways efficiently and accurately. The prediction of metabolic reactions has been a topic of interest for some time, in part because the metabolic breakdown products of a drug can sometimes be toxic (Anari and Baillie, 2005). Bioremediation is another area in which predictive understanding of metabolic breakdown products is important (Langowski and Long, 2002). Because enzyme-catalyzed reactions are generally too complex to be predicted from first principles, a variety of knowledge-based expert systems have been developed for predicting metabolic transformations (Greene et al., 1999; Langowski and Long, 2002; Payne, 2004), and various software tools are available, largely, for the prediction of biodegradation pathways, such as MetabolExpert (Darvas, 1998), METEOR (Greene et al., 1999), META (Klopman et al., 1994; Talafous et al., 1994), and others (Hou et al., 2003, Jaworska et al., 2002; McShan et al., 2004; Mayeno et al., 2005). These systems are based on expert-defined rules for biotransformations that generalize the transformations of known reactions. For example, a recently developed system, BNICE, is based on around 250 enzyme-reaction rules derived from the Enzyme Commission (EC) classification system (Hatzimanikatis et al., 2005). The rules are essentially generators of chemical reactions, and they are applied to enumerate the reactions in which a molecule of interest may potentially participate. The number of rules varies from system to system [cf., META, which is based on 1118 rules (Klopman et al., 1994; Talafous et al., 1994; Anari and Baillie, 2005)]. The number of rules reflects the level of abstraction judged to be appropriate, the intended scope of application, and the data considered when formulating rules. A common problem of systems for rule-based reaction prediction is the number of false positives. Given a molecule, rule application tends to generate a large number of possible reactions, and a combinatorial explosion occurs if rules are applied iteratively to the products of predicted reactions (Hatzimanikatis et al., 2005). Published by Oxford University Press 2006 Prediction of oxidoreductase-catalyzed reactions To mitigate this problem, researchers have attempted to use expert knowledge to limit the generality of rules. Some systems provide priority indices or other filters based on expert knowledge to aid a user in assessing the likelihood of a reaction (Payne, 2004). Here, focusing on oxidoreductase-catalyzed reactions, which account for more than one-third of all known enzymatic reactions documented in the KEGG database (Goto et al., 2002), we extend rule-based reaction prediction by incorporating the use of machine learning. Using information about reactions and metabolites available in KEGG, we train a Support Vector Machine (SVM) to classify a potential reactant in an oxidoreductase-catalyzed reaction as a true or false reactant. The classification is based on seven quantitative structure-property relationship (QSPR) descriptors. We consider 12 classes of reactions, which are marked by characteristic functional group transformations. Most reaction classes correspond to subclasses of the EC 1 class of enzymes. 2 2.1 METHODS Class of reaction Description Predominant EC 1 subclass Number of reactions 1 Dehydrogenation of alcohols or hemiacetals Oxidation of an aldehyde or ketone Dehydrogenation of a CH-CH group introducing a double bond Oxidation of an amine yielding an imine which hydrolyzes to an aldehyde or ketone Dehydrogenation of an amine introducing a C¼N double bond Oxidation of a C-H to generate a C-OH Oxidation of C¼C to generate C(OH)-C(OH) Breaking a C¼C bond Breaking a C-C bond Oxidation of a nitrogenous group Oxidation of thiols with formation of a disulfide bond (S-S) Oxidation introducing a C-halide group 1.1.-.- 540 1.2.-.- 182 1.3.-.- 217 1.4.-.- 123 1.5.-.- 39 1.14.-.and 1.17.-.1.14.-.- 265 65 1.13.-.1.14.-.1.7.-.1.8.-.- 77 57 27 30 1.97.-.- 17 2 3 4 5 6 Data On December 8, 2005, the KEGG LIGAND composite database (Goto et al., 2002) contained information about 6455 reactions, of which 4993 are associated with a full EC number and 537 are associated with a partial EC number. These 5530 reactions involve a total of 4139 distinct metabolites as substrates, products or cofactors. For each of these metabolites, we downloaded its associated MDL Mol file (if available) from the COMPOUND section of the database. This file encodes the 2D chemical structure except for hydrogen atoms, which are usually missing. To obtain complete structures, we added hydrogen atoms consistent with valency using CDK (Steinbeck et al., 2003) and JOELib (Wegner, 2004) (http://joelib. sourceforge.net/). Chirality and other 3D structural features of the metabolites were not considered; results are unaffected because classifications are made based on QSPR descriptors that depend only on 2D chemical structure (see below). Of the 5530 reactions with EC information, 1956 are classified as oxidoreductase-catalyzed reactions (i.e. assigned an EC 1.-.-.- number). We downloaded the information available about each of these reactions from the REACTION section of the database. In some cases, a metabolite in a reaction is indicated as having a generic R group. An example is the reaction with KEGG reference number R04429. In these cases, we replaced the R group with the simplest possible alkane group, which was usually a methyl group (-CH3). 2.2 Table 1. Classification of oxidoreductase-catalyzed reactions based on functional group chemistry Reaction classification based on functional group chemistry Oxidoreductases catalyze oxidation and reduction reactions, which involve electron transfer from one metabolite to another. In the EC classification scheme, oxidoreductase-catalyzed reactions are divided into 22 subclasses based on the donor group that undergoes oxidation. Sub-subclasses are defined based on the acceptor group. Here, we consider 12 novel classes of oxidoreductase-catalyzed reactions (Table 1), which we define based on functional group transformations of substrates, ignoring cofactors, such as NAD(P)+, FAD+ and O2. These classes largely correspond to EC 1 subclasses, though the correspondence is not 1 to 1. The 12 classes include 1626 of the 1956 oxidoreductasecatalyzed reactions in KEGG. The other 330 reactions, which we do not consider further, are exotic (e.g. they involve oxidation of a metal ion, sulfur group or cofactor) or, in 114 cases, involve metabolites for which structural information is unavailable from KEGG. Most of the 1626 reactions classified in Table 1 belong to one of the 10 most populated EC 1 subclasses, which are 1.1, 1.2, 1.3, 1.4, 1.5, 1.7, 1.8, 1.13, 1.14 and 1.17. Within each of these EC subclasses, reactions tend to involve a particular functional group transformation or set of transformations, although there are a few exceptions 7 8 9 10 11 12 (e.g. reaction R02260 in KEGG, which is assigned EC 1.2.1.49, is better placed in Class 1 of Table 1 rather than Class 2 like other EC 1.2.-.reactions). Reactions in EC 1.14, which involve paired donors and incorporation of molecular oxygen, were split into three categories (Classes 6, 7 and 9 in Table 1) based on the functional group to which oxygen attaches. In some cases, we were able to assign reactions in EC subclasses outside those listed above to classes of Table 1 (e.g. reaction R04059 in KEGG, which is assigned EC 1.10.1.1, is included in Class 1). Also, Class 12 of Table 1 includes only reactions from EC 1.97, a catchall category for EC 1 reactions. The reactions included in each class of Table 1 are listed in the Supplementary material. 2.3 Reaction center patterns A metabolite is a potential reactant in a reaction class only if it has structural features compatible with a class-specific reaction center, an example of which is shown in Figure 1. Because our 12 reaction classes are defined based on functional group transformations, each reaction in a class must involve a reaction center that corresponds to the functional group affected by the characteristic transformation of the class. For example, a substrate in a Class 1 reaction must have a functional group that qualifies it as a primary or secondary alcohol or a hemiacetal. As can be seen in Table 2, reaction centers are comprised of one to three atoms. The reaction centers highlighted in Table 2, like any molecular pattern in fact can be precisely encoded using SMARTS strings (James et al., 2005). For example, the SMARTS string [C;!h0][Oh] encodes the first reaction center indicated in Table 2 (upper left). This pattern matches a hydroxyl group bound to a carbon atom of an alkyl group; thus, it can be used to identify alcohols. We manually identified reaction centers for all 1626 oxidoreductasecatalyzed reactions considered in Table 1 and encoded these reaction centers as SMARTS strings. We then sought, using a combination of expert 3083 F.Mu et al. A substructure was considered to be a positive example if the metabolite containing it is documented in KEGG to participate in a reaction corresponding to the pattern the substructure matches. Other metabolite substructures were considered to be negative examples. 2.5 Fig. 1. A reaction center is highlighted. This reaction center is specific to Class 3 reactions. The reaction shown, R00408 in KEGG, is the reaction catalyzed by succinate dehydrogenase: succinate + FAD+ , FADH2 + fumarate. Indices are assigned to atoms, except for hydrogens. knowledge (Silverman, 2000) and the results of graph mining (described in the paragraph below), to find more stringent structural requirements on the reactants in each reaction class by inspecting the surrounding structures of their reaction centers, looking for both typical and atypical features. For each reaction center in a reaction, we initiated a breadth-first traversal of the graph corresponding to the chemical structure of each metabolite participating in the reaction—the substrate(s) and product(s) are considered in turn. The traversal begins with the atoms of the reaction center. At each depth of the traversal, we store the substructure (the reaction center plus possibly some local surrounding structure) corresponding to the subgraph visited. For example, for either the substrate or product of Figure 1, atoms 2 and 5 and the bond between them are stored at depth 0. This initial subgraph corresponds to the reaction center. Atoms 1 and 6 and the bonds that connect these atoms to the reaction center are added at depth 1, and all other atoms and bonds are added at depth 2. Traversal, even if extension is possible, is halted at depth 2. In this manner, we obtained 25 collections of reaction center containing substructures for the substrates and products of the 12 sets of reactions indicated in Table 1. A collection of substructures, one obtained for the substrates of the 540 reactions of Class 1, is illustrated in Figure 2. Note that, because Class 4 reactions have two products with distinct reaction centers, there are 25 rather than 24 collections. Each of the 25 collections was inspected and a set of patterns, encoded using SMARTS strings, was defined. These patterns, which we will refer to as reaction center patterns, were selected with consideration of chemistry and with the goal of finding one or two patterns for each reaction class that together match as many of the reactants in the corresponding class of reactions as possible. We used JOELib to perform SMARTS-based matching (Wegner, 2004). The patterns we defined through this process are illustrated in Table 2; the SMARTS encodings of these patterns are provided in the Supplementary material. There are one, one, and two reaction center containing substructures not matched by a pattern for reactions in Classes 1, 8 and 9, respectively. These rare substructures are found in only 10 metabolites, which are indicated in the Supplementary material. In some cases, a reaction center pattern encompasses only a reaction center. In other cases, the pattern extends beyond the reaction center, which suggests that the surrounding local structure is important for reactivity. Our assumption is that the reaction center patterns of Table 2, some of which include negative application conditions, represent structural motifs that are prerequisites for participation of a metabolite, as a substrate or product, in reactions of Classes 1–12. 2.4 Negative and positive examples for training Negative and positive examples are required to train a binary classifier through supervised learning. Using routines for SMARTS string matching in JOELib (Wegner, 2004), we identified the set of substructures among the 4139 metabolites in KEGG that contain each reaction center pattern of Table 2. In finding matches, we used routines in JOELib, based on Hückel’s rule, to identify aromatic atoms and bonds, and we used the UniversalIsomorphismTester routine in CDK (Steinbeck et al., 2003) to avoid multiple matches to a single substructure because of symmetry. We then divided each set of metabolite substructures into negative and positive examples. 3084 Atomic properties Classifiers map inputs to outputs. Here, the outputs of interest will indicate whether a metabolite is a substrate or product in a reaction of a given type. The inputs that we will consider are atomic properties, namely, a set of seven QSPR empirical or semi-empirical descriptors calculated for each of the atoms in a reaction center. These descriptors were determined from 2D chemical structures using CDK (Steinbeck et al., 2003) and JOELib (Wegner, 2004). Each descriptor is briefly described below. Graph potential (Walters and Yalkowsky, 1996). This topological descriptor provides a measure of the distance of atoms from the graphical center of a molecule. Atoms that are close to the center will have a higher graph potential than atoms that are far from the center. Electrogeometrical state index, electrotopological state index, and conjugated electrotopological state index (Wegner and Zell, 2003). These three indices characterize the electronic state of an atom in the context of other atoms in a molecule. Gasteiger–Marsili partial charges (Gasteiger and Marsili, 1980). Gasteiger–Marsili partial charges characterize the distribution of electrons in a molecule, which is one of the most important factors influencing its chemical properties. These partial charges are calculated through an iterative procedure that balances electronegativity of charge and charge transfer through electronegativity difference. The partial charge of an atom depends on both the type of atom and its molecular environment. Effective polarizability (Gasteiger and Hutchings, 1984). This property measures the effect of an electric field on the electron cloud of an atom. Sigma electronegativity (Hutchings and Gasteiger, 1983). This property reflects the electron-attracting ability of an atom in its molecular environment. We considered these descriptors for two reasons. First, they measure topological and electronic properties, which we expect to be important for formation and breakage of covalent bonds. Second, these descriptors are readily calculated. 2.6 Classifiers To predict the potential of a given metabolite to participate in an oxidoreductase-catalyzed reaction, we trained binary classifiers for each reaction class. The classification problem is to predict whether a metabolite matching a given reaction center pattern (i.e. containing a substructure with a potential reaction center plus possibly some additional surrounding local structure) is a substrate or product of a reaction of the type corresponding to the pattern. Classification is made based on the QSPR descriptors listed in Section 2.5, which are calculated for each atom in the putative reaction center. Thus, in classification, we consider from 7 to 21 features of a metabolite, depending on the number of atoms in the putative reaction center, which varies from 1 to 3 (Table 2). To the best of our knowledge, this classification problem is novel. Thus, to establish a proof of principle and baseline performance expectations for a classifier, we decided to use a predictor method that is both simple and computationally efficient: a linear-kernel SVM (Cortes and Vapnik, 1995; Vapnik, 1998). SVMs often perform well at correctly classifying data not in training data (i.e. at generalization). Various kernels are popular and nonlinear-kernels, such as radial basis function kernels, arguably offer the best SVM performance, although there is little theoretical guidance available about which kernel might be best for our problem. We used a linear-kernel, corresponding to the simplest SVM, because this type of kernel has no adjustable parameters, and with this kernel, the weight vector w of the SVM (see below) provides a measure of which features have the most influence on classification outcome. We considered 25 training sets, one (or two in one case) for each entry in Table 2, as discussed in Section 2.4. Each metabolite substructure i in Prediction of oxidoreductase-catalyzed reactions Table 2. Reaction centers (highlighted in red) and local structural motifs of substrates and products of the reactions considered in Table 1 Class of reaction Reaction center patterns for reactants Reaction center patterns for products Class of reaction Reaction center patterns for reactants Reaction center patterns for products 7 1 8b 2 9b 3 10 4a 11c 5 6 12 In patterns, aromatic atoms are indicated by lower case letters, and aromatic bonds are indicated by broken lines (--–—). The symbol ‘!’ denotes a negative application condition, i.e. a condition that prevents a match. For example, ‘!¼O’ indicates ‘not doubly bound to oxygen’. a Reactions in Class 4 have two products, which have distinct reaction centers. b Reactions in Classes 8 and 9 have two products, which have indistinguishable reaction center patterns. c Reactions in Class 11 have two substrates, which have indistinguishable reaction center patterns. a training set has an ouptut yi ¼ {1,1}, depending on whether the metabolite substructure is a negative (1) or positive (1) example. Each metabolite substructure is also associated with an input vector xi ¼ (xi1, . . . , xid), which is based on the atomic properties discussed in Section 2.5. Unadjusted values of inputs can be on different scales. We linearly scaled the inputs xij such that mini xij ¼ 1 and maxi xij ¼ 1 for any given j. The length of the input vector, d, is the product of the number of atomic properties considered (7) and the number of atoms (from 1 to 3) in the reaction center associated with the training set of interest. The elements of the input vector are ordered according to the corresponding reaction center pattern, in which atoms in the reaction center are specified to have a fixed, arbitrary ordering. For symmetric reaction centers, elements of the input vector are further ordered according to a metabolite-specific ordering of atoms in the reaction center, which is illustrated in Figure 3. Given inputs xi and output yi for all i in a training set, the idea of a linear SVM is to find a hyperplane in the space of inputs x, satisfying the equation wTx + b ¼ 0, that splits the negative and positive examples with maximum margin. The margin of a separating hyperplane is the width by which it can be increased before touching a data point. The parameters of the SVM include the vectors w and b. Additional parameters include slack variables ji and two penalty parameters C and C+. The slack variables, which introduce a soft margin, are included to account for negative and positive examples that are not linearly separable (Cortes and Vapnik, 1995), and the penalty parameters are included to effectively balance unequal numbers of negative and positive examples (Chang and Lin, 2001) (http://www.csie.ntu.edu.tw/~cjlin/libsvm). For simplicity, we use the following relations to specify the penalty parameters: C ¼ 1 and C+ ¼ Nn/Np, where Nn is the number of negative examples in a training set and 3085 F.Mu et al. while the other nine sets are used for training. In a sensitivity analysis, we varied the number of negative examples included in training sets, randomly excluding some fraction. Each of the above procedures was repeated 10 times; the mean values of Qp and Qn were determined in each case. 3 Fig. 2. These seven distinct substructures are found around the reaction centers (highlighted) of the 540 substrates of Class 1 reactions after a depth 1 breadth-first traversal starting from the reaction center. The number above each substructure indicates the number of substrates containing the substructure. The first six substructures, accounting for 538 out of 540 substrates, are matched by one of the two reaction center patterns defined in Table 2. The first, second, fourth, fifth and sixth substructures with correspond to primary and secondary alcohols, which are matched by the SMARTS string [C;!h0;!$(C(O)O);!$(C ¼ O)][Oh]. Fig. 3. Atoms in a symmetric reaction center (matched by the first substrate pattern of Class 3 reactions) are highlighted. Although the reaction center is symmetric, Atoms 4 and 8 are distinct because of their different chemical environments. Atoms in this reaction center, and all other symmetric reaction centers, are ordered based on their Gasteiger–Marsili partial charges, from most negative to most positive. Thus, Atom 8, which has partial charge 0.0123 eV, comes before Atom 4, which has partial charge, 0.0343 eV. Np is the number of positive examples. The other parameters are found by solving a minimization problem: X X 1 min wT w þ Cþ ji þ C ji w‚ b‚ j 2 y ¼1 y ¼1 i i Minimization is subject to the following constraints: yi ðwT xi þ bÞ 1 ji and ji 0‚8i SVMs found as indicated above are used as follows. Given a SVM with parameters w and b and a metabolite substructure with inputs xi, we calculate the quantity wTxi + b. If the sign of this quantity is positive, the metabolite is predicted to be reactive in the type of reaction corresponding to the SVM. Conversely, if the sign is negative, the metabolite is predicted not to be reactive. To measure the performance of a SVM, we count the numbers of true positives (TP), true negatives (TN), false positives (FP) and false negatives (FN) that it predicts for test data. We then calculate two ratios: sensitivity, which is the fraction of positive examples that are predicted to be positives, Qp ¼ TP/(TP + FN), and specificity, which is the fraction of negative examples that are predicted to be negatives: Qn ¼ TN/(TN + FP). These quantities are calculated as part of a 10-fold cross validation (CV) procedure. In the CV procedure, the relevant training dataset is divided randomly into 10 equal sets. Then, each of the 10 sets is used for testing 3086 RESULTS As described in the Methods section, binary classifiers were developed for predicting the reactivity of potential substrates and products in the 12 classes of oxidoreductase-catalyzed reactions defined in Table 1. A potential substrate or product is a metabolite that contains a substructure matching one of the reaction center patterns defined in Table 2 and is classified as reactive or not reactive in the corresponding type of reaction based on seven QSPR descriptors for each atom in the putative reaction center. These descriptors provide a characterization of the electronic and topological atomic properties of the potential reaction center. The accuracy of predictions was assessed by 10-fold CV. Results are shown in Table 3. We measure the quality of a classifier by its sensitivity, Qp, and specificity, Qn. The average sensitivity and specificity obtained for 10 sets of tests in a 10-fold CV procedure are given for each classifier in Table 3, which also indicates the numbers of positive and negative examples used in each of the training sets. Sensitivity ranges from 64 to 93% for substrates and from 71 to 98% for products. Specificity ranges from 44 to 92% for substrates and from 50 to 92% for products. The standard deviation (data not shown) is always small, being less than 0.03 for all cases. The substrate and product classifiers for a given type of reaction vary in performance because these classifiers were derived independently using distinct training sets. In training a classifier for a particular type of reaction, we used metabolites in the KEGG database not known to participate in that type of reaction as negative examples. However, our knowledge of metabolism is incomplete, and some of these putative negative examples might actually be positive examples. Moreover, negative examples include only metabolites that are substrates and/or products of known enzymatic reactions, which might weaken the ability of a classifier to generalize when applied to xenobiotic compounds or novel metabolites. To determine how classifier sensitivity and specificity depend on the negative examples included in a training set, we removed a fraction of the negative examples, which were randomly selected, and repeated the analysis of Table 3. Fractions of different sizes were removed. Results for classifiers of potential substrates and products in Class 1 reactions are shown in Figure 4. These results, which indicate that sensitivity and selectivity are robust to variations of the training data, are typical. The results for the other group transformations are available in the Supplementary material. To determine the properties of reaction centers that are most relevant for predicting reactivity, we can use the weight parameters w of a classifier to rank the relative importance of its inputs x. The inputs weighted most heavily are most informative. In Figure 5, we report the weights of the classifier for substrates of Class 3 reactions, in which a CH-CH group is dehydrogenated with the introduction of a double bond. As can be seen, the Gasteiger– Marsili partial charges of the two carbon atoms in the CH-CH reaction center are responsible for the largest positive contributions to the classifier output. This result suggests that reactivity requires Prediction of oxidoreductase-catalyzed reactions Table 3. Performance of classifiers for potential substrates and products in 12 types of oxidoreductase-catalyzed reactions Class of reaction Predictions based on reaction center patterns of substrates Training set CV performance Qn Positive Negative Qp Predictions based on reaction center patterns of products Training set CV performance Positive Negative Qp Qn 1 2 3 4 538 182 217 123 4140 209 17 143 2618 0.87 0.82 0.72 0.85 0.77 0.75 0.69 0.80 5 6 7 8 9 10 11 12 39 263 65 77 57 23 60 17 2178 21 766 10 504 10 479 11 070 2858 34 6865 0.93 0.78 0.85 0.86 0.87 0.64 0.78 0.66 0.78 0.44 0.75 0.85 0.80 0.92 0.53 0.84 539 182 217 123 50 39 265 65 149 111 27 30 17 Fig. 4. Classifier performance remains steady as negative examples are removed from training data. The horizontal axis indicates the fraction of negative examples removed (randomly). The classifiers considered here predict the reactivity of potential substrates (left panel) and products (right panel) in Class 1 reactions. Mean values and standard deviations are indicated for Qp (solid line) and Qn (broken line), which are calculated from 10 tests in a 10fold CV procedure. the deficiency of electrons in the reaction center, which is consistent with chemical intuition. However, in general, there is no easily discernible chemical interpretation of the weight distribution of a classifier. Weight distributions for all classifiers are provided in the Supplementary material. 4 DISCUSSION The genome sequence of an organism can be used to identify genes coding for enzymes and to infer the metabolic network of the organism (Francke et al., 2005), about which we may know little. However, only the reactions that have been characterized experimentally in some system can be inferred, and we have probably only scratched the surface. It is estimated that <1% of microorganisms can be cultured for laboratory study (Schloss and Handelsman, 2005). Technological advances, particularly in mass spectrometry, have enabled the detection of many signals for metabolites in cell extracts that have never been documented or 4498 2256 10 412 4203 2490 4328 13 821 6061 6119 16 248 73 10 176 0.89 0.80 0.76 0.90 0.90 0.95 0.84 0.93 0.91 0.85 0.72 0.98 0.71 0.83 0.74 0.66 0.89 0.77 0.83 0.75 0.92 0.86 0.85 0.84 0.50 0.80 Fig. 5. Weight parameter distribution for the linear-kernel SVM that classifies substrates of Class 3 reactions. The reaction center associated with this reaction class (CH-CH) is comprised of two carbon atoms. Thus, there are 14 inputs, x1, . . . , x14, the indices of which are indicated on the horizontal axis, and the same number of weights, the values of which are indicated on the vertical axis. Odd indices of inputs correspond to the first atom in the reaction center, whereas even indices correspond to the second atom. Recall that ordering of atoms in a reaction center is determined as described in Section 2.6. The first inputs considered (at left) are graph potentials. The other inputs are considered in the same order as they are listed in Section 2.5. Thus, the 9th and 10th inputs correspond to Gasteiger–Marsili partial charges of the first and second atoms, respectively. These inputs are associated with the two largest positive-valued weights (i.e. the tallest bars in the plot). As a result, a metabolite with larger, more positive partial charges will be more likely to be classified as a true substrate than one with smaller, more negative charges. characterized before (Wagner et al., 2003; Breitling et al., in press). We are now faced with the challenge of characterizing these metabolites and discovering the metabolic networks in which they are produced and consumed, as well as their regulatory roles, which may be significant (Wall et al., 2004). Here, we have developed binary classifiers, linear-kernel SVMs, that predict whether a potential substrate or product in one of 12 types of oxidoreductase-catalyzed reactions is a true reactant or not. The classifiers require structural information but no genomic sequence information. A 10-fold cross validation procedure was used to assess the performance of each classifier. Typically, but 3087 F.Mu et al. not uniformly, we find both high sensitivity and specificity (Table 3), which suggests that the classifier inputs reflect the ability of a metabolite to participate in an oxidoreductase-catalyzed reaction in many cases. These inputs are descriptors of the topological and electronic properties of atoms in the potential reaction center. The overall results establish the feasibility of our approach and baseline performance expectations, and they should motivate the development of more sophisticated classifiers. For example, one might consider SVMs based on nonlinear kernels. A nonlinear kernel, such as a radial basis function or polynomial kernel, should, with careful selection of kernel parameters, perform better. In cases where sensitivity or specificity is less than satisfactory, the particular descriptors we considered may simply be uninformative, in which case additional descriptors and/or more relevant or predictive descriptors might yield better results. Alternatively, it may be that reactivity is determined not by metabolite structure alone but rather by enzyme properties or the conformational changes induced by enzyme-substrate binding. The influence of enzyme-dependent properties on reactivity can be difficult to address (Colombo and Carrea, 2002; Henkelman et al., 2005), but more predictive descriptors of atomic and molecular properties can be obtained through straightforward application of quantum chemistry methods (Karelson et al., 1996), which require a 3D chemical structure. The drawback of these methods is that they are more computationally expensive and technically challenging than the ones used here. However, again, our overall results should provide motivation for further study, including the consideration of more expensive calculations of metabolite features. An advantage of the prediction system presented here over earlier rule-based approaches is the quantitative and automatic assessment of the reactivity of a given metabolite beyond a checking of the structural requirements of a transformation rule. Thus, experimental follow up can be focused faster on fewer metabolites, those with atomic properties characteristic of known reactants. The development of the methods presented here, which have been applied to a single EC class of enzyme-catalyzed reactions but can be extended systematically to the other five major EC reaction classes, represent a step toward the development of computational screening tools that can aid in elucidating the functional roles of newly discovered small-molecule metabolites. ACKNOWLEDGEMENTS The authors thank Ilya Nemenman and Robert F. Williams for helpful discussions. This work was supported by the Department of Energy, in part through the Genomics:GTL Program. Conflict of Interest: none declared. REFERENCES Anari,M.R. and Baillie,T.A. (2005) Bridging chemoinformatic metabolite prediction and tandem mass spectrometry. Drug Discov. Today, 10, 711–717. Breitling,R. et al. Precision mapping of the metabolome. Trends Biotechnol., doi:10.1016/j.tibtech.2006.10.006. Chang,C.-C. and Lin,C.-J. (2001) LIBSVM : a library for support vector machines. Colombo,G. and Carrea,G. (2002) Modeling enzyme reactivity in organic solvents and water through computer simulations. J. Biotechnol., 96, 23–33. Cortes,C. and Vapnik,V. (1995) Support-vector networks. Mach. Learn., 20, 273–297. Darvas,F. (1998) Predicting metabolic pathways by logic programming. J. Mol. Graphics, 6, 80–86. 3088 Dunn,W.B. et al. (2005) Measuring the metabolome: current analytical technologies. Analyst, 130, 606–625. Fernie,A.R. et al. (2004) Metabolite profiling: from diagnostics to systems biology. Nat. Rev. Mol. Cell. Biol., 5, 763–769. Fiehn,O. (2002) Metabolomics: the link between genotypes and phenotypes. Plant Mol. Biol., 48, 155–171. Francke,C. et al. (2005) Reconstructing the metabolic network of a bacterium from its genome. Trends Microbiol., 13, 550–558. Gasteiger,J. and Hutchings,M.G. (1984) Quantitative models of gas-phase protontransfer reactions involving alcohols, ethers, and their thio analoguess. Correlation analyses based on residual electronegativity and effective polarizability. J. Am. Chem. Soc., 106, 6489–6495. Gasteiger,J. and Marsili,M. (1980) Iterative partial equalization of orbital electronegativity—a rapid access to atomic charges. Tetrahedron, 36, 3219–3228. Goodacre,R. et al. (2004) Metabolomics by numbers: acquiring and understanding global metabolite data. Trends Biotechnol., 22, 245–252. Goto,S. et al. (2002) LIGAND: database of chemical compounds and reactions in biological pathways. Nucleic Acids Res., 30, 402–404. Greene,N. et al. (1999) Knowledge-based expert systems for toxicity and metabolism prediction: DEREK, StAR and METEOR. SAR QSAR Environ. Res., 10, 299–313. Hatzimanikatis,V. et al. (2005) Exploring the diversity of complex metabolic networks. Bioinformatics, 21, 1603–1609. Henkelman,G. et al. (2005) Conformational dependence of a protein kinase phosphate transfer reaction. Proc. Natl Acad. Sci. USA, 102, 15347–15351. Hou,B.K. et al. (2003) Microbial pathway prediction: a functional group approach. J. Chem. Inf. Comput. Sci., 43, 1051–1057. Hutchings,M.G. and Gasteiger,J. (1983) Residual electronegativity—an empirical quantification of polar influences and its application to the proton affinity of amines. Tetrahedron Lett., 24, 2541–2544. James,C.A. et al. (2005) Daylight Theory Manual, version 4.9, Ch. 4. Daylight Chemical Information Systems, Inc., Aliso Viejo, CA. Jaworska,J. et al. (2002) Probabilistic assessment of biodegradability based on meta-bolic pathways: CATBOL System. SAR QSAR Environ. Res., 13, 307–323. Karelson,M. et al. (1996) Quantum-chemical descriptors in QSAR/QSPR studies. Chem. Rev., 96, 1027–1043. Kell,D.B. (2004) Metabolomics and systems biology: making sense of the soup. Curr. Opin. Microbiol., 7, 296–307. Klopman,G. et al. (1994) META. 1. A program for the evaluation of metabolic transformation of chemicals. J. Chem. Inf. Comput. Sci., 34, 1320–1325. Langowski,J. and Long,A. (2002) Computer systems for the prediction of xenobiotic metabolism. Adv. Drug Deliv. Rev. 54, 407–415. Mayeno,A.N. et al. (2005) Biochemical reaction network modeling: predicting metabolism of organic chemical mixtures. Environ. Sci. Technol., 39, 5363–5371. McShan,D.C. et al. (2004) Symbolic inference of xenobiotic metabolism. Pac. Symp. Biocomput., 9, 545–546. Payne,M.P. (2004) Computer-based methods for the prediction of chemical metabolism and biotransformation within biological organisms. In Cronin,M.T.D. and Livingstone,D.J. (eds), Predicting Chemical Toxicity and Fate. CRC Press, Boca Raton, pp. 205–227. Schloss,P.D. and Handelsman,J. (2005) Metagenomics for studying unculturable microorganisms: cutting the Gordian knot. Genome Biol., 6, 229. Silverman,R.B. (2000) The Organic Chemistry of Enzyme-Catalyzed Reactions. Academic Press, San Diego, CA. Steinbeck,C. et al. (2003) The Chemistry Development Kit (CDK): an open-source Java library for chemo- and bioinformatics. J. Chem. Inf. Comput. Sci., 43, 493–500. Talafous,J. et al. (1994) META. 2. A dictionary model of mammalian xenobiotic metabolism. J. Chem. Inf. Comput. Sci., 34, 1326–1333. Vapnik,V.N. (1998) Statistical Learning Theory. Wiley, NY. Wagner,C. et al. (2003) Construction and application of a mass spectral and retention time index database generated from plant GC/EI-TOF-MS metabolite profiles. Phytochemistry, 62, 887–900. Wall, M.E. et al. (2004) Design of gene circuits: lessons from bacteria. Nat. Rev. Genet., 5, 34–42. Walters,W.P. and Yalkowsky,S.H. (1996) ESCHER—a computer program for the determination of external rotational symmetry numbers from molecular topology. J. Chem. Inf. Comput. Sci., 36, 1015–1017. Wegner,J.K. (2004) JOELib Tutorial: A Java based cheminformatics/computational chemistry package. Wegner,J.K. and Zell,A. (2003) Prediction of aqueous solubility and partition coefficient optimized by a genetic algorithm based descriptor selection method. J. Chem. Inf. Comput. Sci, 43, 1077–1084.
© Copyright 2026 Paperzz