Prediction of oxidoreductase-catalyzed reactions based on atomic

BIOINFORMATICS
ORIGINAL PAPER
Vol. 22 no. 24 2006, pages 3082–3088
doi:10.1093/bioinformatics/btl535
Systems biology
Prediction of oxidoreductase-catalyzed reactions based
on atomic properties of metabolites
Fangping Mu1, Pat J. Unkefer2, Clifford J. Unkefer2 and William S. Hlavacek1,3,
1
Theoretical Biology and Biophysics Group, Theoretical Division, 2National Stable Isotope Resource, Bioscience
Division and 3Center for Nonlinear Studies, Los Alamos National Laboratory, Los Alamos, NM 87545, USA
Received on July 13, 2006; revised on September 28, 2006; accepted on October 12, 2006
Advance Access publication October 23, 2006
Associate Editor: Golan Yona
ABSTRACT
Motivation: Our knowledge of metabolism is far from complete, and
the gaps in our knowledge are being revealed by metabolomic detection
of small-molecules not previously known to exist in cells. An important
challenge is to determine the reactions in which these compounds
participate, which can lead to the identification of gene products
responsible for novel metabolic pathways. To address this challenge,
we investigate how machine learning can be used to predict potential
substrates and products of oxidoreductase-catalyzed reactions.
Results: We examined 1956 oxidation/reduction reactions in the
KEGG database. The vast majority of these reactions (1626) can be
divided into 12 subclasses, each of which is marked by a particular
type of functional group transformation. For a given transformation,
the local structures of reaction centers in substrates and products
can be characterized by patterns. These patterns are not unique
to reactants but are widely distributed among KEGG metabolites. To
distinguish reactants from non-reactants, we trained classifiers
(linear-kernel Support Vector Machines) using negative and positive
examples. The input to a classifier is a set of atomic features that
can be determined from the 2D chemical structure of a compound.
Depending on the subclass of reaction, the accuracy of prediction for
positives (negatives) is 64 to 93% (44 to 92%) when asking if a
compound is a substrate and 71 to 98% (50 to 92%) when asking if
a compound is a product. Sensitivity analysis reveals that this performance is robust to variations of the training data. Our results suggest
that metabolic connectivity can be predicted with reasonable accuracy
from the presence or absence of local structural motifs in compounds
and their readily calculated atomic features.
Availability: Classifiers reported here can be used freely for noncommercial purposes via a Java program available upon request.
Contact: [email protected]
Supplementary information: Supplementary data are available at
Bioinformatics online.
1
INTRODUCTION
High-throughput analytic techniques of metabolomics now allow
large-scale characterization of the metabolic state of a cell
(Fiehn, 2002; Kell, 2004; Dunn et al., 2005). Metabolomic
approaches are designed to detect the metabolites in a complex
mixure as comprehensively as possible. The various techniques,
To whom correspondence should be addressed.
3082
when used in combination, can be highly sensitive. A goal of
metabolomics is to define and monitor the entire metabolite complement of a cell, and progress is being made in this direction
(Fernie et al., 2004; Goodacre et al., 2004).
Although our knowledge of metabolism is extensive, it is still
incomplete. Many metabolites present in cells, and their reactions
and associated enzymes, remain to be discovered. Systematic and
rapid approaches for determining how a newly discovered metabolite is produced and consumed are lacking. Thus, a present challenge
for bioinformatics in support of metabolomics is the development of
computational methods that will help place novel metabolites in the
context of biochemical pathways efficiently and accurately.
The prediction of metabolic reactions has been a topic of interest
for some time, in part because the metabolic breakdown products
of a drug can sometimes be toxic (Anari and Baillie, 2005). Bioremediation is another area in which predictive understanding of
metabolic breakdown products is important (Langowski and
Long, 2002). Because enzyme-catalyzed reactions are generally
too complex to be predicted from first principles, a variety of
knowledge-based expert systems have been developed for predicting metabolic transformations (Greene et al., 1999; Langowski and
Long, 2002; Payne, 2004), and various software tools are available,
largely, for the prediction of biodegradation pathways, such as
MetabolExpert (Darvas, 1998), METEOR (Greene et al., 1999),
META (Klopman et al., 1994; Talafous et al., 1994), and others
(Hou et al., 2003, Jaworska et al., 2002; McShan et al., 2004;
Mayeno et al., 2005). These systems are based on expert-defined
rules for biotransformations that generalize the transformations of
known reactions. For example, a recently developed system,
BNICE, is based on around 250 enzyme-reaction rules derived
from the Enzyme Commission (EC) classification system
(Hatzimanikatis et al., 2005). The rules are essentially generators
of chemical reactions, and they are applied to enumerate the reactions in which a molecule of interest may potentially participate.
The number of rules varies from system to system [cf., META,
which is based on 1118 rules (Klopman et al., 1994; Talafous
et al., 1994; Anari and Baillie, 2005)]. The number of rules reflects
the level of abstraction judged to be appropriate, the intended scope
of application, and the data considered when formulating rules.
A common problem of systems for rule-based reaction prediction is the number of false positives. Given a molecule, rule application tends to generate a large number of possible reactions, and
a combinatorial explosion occurs if rules are applied iteratively to
the products of predicted reactions (Hatzimanikatis et al., 2005).
Published by Oxford University Press 2006
Prediction of oxidoreductase-catalyzed reactions
To mitigate this problem, researchers have attempted to use expert
knowledge to limit the generality of rules. Some systems provide
priority indices or other filters based on expert knowledge to aid a
user in assessing the likelihood of a reaction (Payne, 2004).
Here, focusing on oxidoreductase-catalyzed reactions, which
account for more than one-third of all known enzymatic reactions
documented in the KEGG database (Goto et al., 2002), we extend
rule-based reaction prediction by incorporating the use of machine
learning. Using information about reactions and metabolites available in KEGG, we train a Support Vector Machine (SVM) to classify a potential reactant in an oxidoreductase-catalyzed reaction as a
true or false reactant. The classification is based on seven quantitative structure-property relationship (QSPR) descriptors. We consider 12 classes of reactions, which are marked by characteristic
functional group transformations. Most reaction classes correspond
to subclasses of the EC 1 class of enzymes.
2
2.1
METHODS
Class of
reaction
Description
Predominant
EC 1 subclass
Number of
reactions
1
Dehydrogenation of alcohols or
hemiacetals
Oxidation of an aldehyde or
ketone
Dehydrogenation of a CH-CH
group introducing a double
bond
Oxidation of an amine yielding
an imine which hydrolyzes
to an aldehyde or ketone
Dehydrogenation of an amine
introducing a C¼N double
bond
Oxidation of a C-H to generate a
C-OH
Oxidation of C¼C to
generate C(OH)-C(OH)
Breaking a C¼C bond
Breaking a C-C bond
Oxidation of a nitrogenous group
Oxidation of thiols with formation
of a disulfide bond (S-S)
Oxidation introducing a C-halide
group
1.1.-.-
540
1.2.-.-
182
1.3.-.-
217
1.4.-.-
123
1.5.-.-
39
1.14.-.and 1.17.-.1.14.-.-
265
65
1.13.-.1.14.-.1.7.-.1.8.-.-
77
57
27
30
1.97.-.-
17
2
3
4
5
6
Data
On December 8, 2005, the KEGG LIGAND composite database (Goto et al.,
2002) contained information about 6455 reactions, of which 4993 are
associated with a full EC number and 537 are associated with a partial
EC number. These 5530 reactions involve a total of 4139 distinct metabolites
as substrates, products or cofactors. For each of these metabolites, we downloaded its associated MDL Mol file (if available) from the COMPOUND
section of the database. This file encodes the 2D chemical structure except
for hydrogen atoms, which are usually missing. To obtain complete
structures, we added hydrogen atoms consistent with valency using CDK
(Steinbeck et al., 2003) and JOELib (Wegner, 2004) (http://joelib.
sourceforge.net/). Chirality and other 3D structural features of the metabolites were not considered; results are unaffected because classifications are
made based on QSPR descriptors that depend only on 2D chemical structure
(see below). Of the 5530 reactions with EC information, 1956 are classified
as oxidoreductase-catalyzed reactions (i.e. assigned an EC 1.-.-.- number).
We downloaded the information available about each of these reactions from
the REACTION section of the database. In some cases, a metabolite in a
reaction is indicated as having a generic R group. An example is the reaction
with KEGG reference number R04429. In these cases, we replaced the
R group with the simplest possible alkane group, which was usually a methyl
group (-CH3).
2.2
Table 1. Classification of oxidoreductase-catalyzed reactions based on
functional group chemistry
Reaction classification based on functional
group chemistry
Oxidoreductases catalyze oxidation and reduction reactions, which involve
electron transfer from one metabolite to another. In the EC classification
scheme, oxidoreductase-catalyzed reactions are divided into 22 subclasses
based on the donor group that undergoes oxidation. Sub-subclasses are
defined based on the acceptor group.
Here, we consider 12 novel classes of oxidoreductase-catalyzed reactions
(Table 1), which we define based on functional group transformations of
substrates, ignoring cofactors, such as NAD(P)+, FAD+ and O2. These
classes largely correspond to EC 1 subclasses, though the correspondence
is not 1 to 1. The 12 classes include 1626 of the 1956 oxidoreductasecatalyzed reactions in KEGG. The other 330 reactions, which we do not
consider further, are exotic (e.g. they involve oxidation of a metal ion, sulfur
group or cofactor) or, in 114 cases, involve metabolites for which structural
information is unavailable from KEGG. Most of the 1626 reactions classified
in Table 1 belong to one of the 10 most populated EC 1 subclasses, which are
1.1, 1.2, 1.3, 1.4, 1.5, 1.7, 1.8, 1.13, 1.14 and 1.17. Within each of these EC
subclasses, reactions tend to involve a particular functional group transformation or set of transformations, although there are a few exceptions
7
8
9
10
11
12
(e.g. reaction R02260 in KEGG, which is assigned EC 1.2.1.49, is better
placed in Class 1 of Table 1 rather than Class 2 like other EC 1.2.-.reactions). Reactions in EC 1.14, which involve paired donors and incorporation of molecular oxygen, were split into three categories (Classes 6,
7 and 9 in Table 1) based on the functional group to which oxygen attaches.
In some cases, we were able to assign reactions in EC subclasses outside
those listed above to classes of Table 1 (e.g. reaction R04059 in KEGG,
which is assigned EC 1.10.1.1, is included in Class 1). Also, Class 12 of
Table 1 includes only reactions from EC 1.97, a catchall category for EC
1 reactions. The reactions included in each class of Table 1 are listed in the
Supplementary material.
2.3
Reaction center patterns
A metabolite is a potential reactant in a reaction class only if it has structural
features compatible with a class-specific reaction center, an example of
which is shown in Figure 1. Because our 12 reaction classes are defined
based on functional group transformations, each reaction in a class
must involve a reaction center that corresponds to the functional group
affected by the characteristic transformation of the class. For example, a
substrate in a Class 1 reaction must have a functional group that qualifies it
as a primary or secondary alcohol or a hemiacetal. As can be seen in Table 2,
reaction centers are comprised of one to three atoms. The reaction centers
highlighted in Table 2, like any molecular pattern in fact can be precisely
encoded using SMARTS strings (James et al., 2005). For example, the
SMARTS string [C;!h0][Oh] encodes the first reaction center indicated in
Table 2 (upper left). This pattern matches a hydroxyl group bound to a
carbon atom of an alkyl group; thus, it can be used to identify alcohols.
We manually identified reaction centers for all 1626 oxidoreductasecatalyzed reactions considered in Table 1 and encoded these reaction
centers as SMARTS strings. We then sought, using a combination of expert
3083
F.Mu et al.
A substructure was considered to be a positive example if the metabolite
containing it is documented in KEGG to participate in a reaction corresponding to the pattern the substructure matches. Other metabolite substructures
were considered to be negative examples.
2.5
Fig. 1. A reaction center is highlighted. This reaction center is specific to
Class 3 reactions. The reaction shown, R00408 in KEGG, is the reaction
catalyzed by succinate dehydrogenase: succinate + FAD+ , FADH2 + fumarate. Indices are assigned to atoms, except for hydrogens.
knowledge (Silverman, 2000) and the results of graph mining (described in
the paragraph below), to find more stringent structural requirements on the
reactants in each reaction class by inspecting the surrounding structures of
their reaction centers, looking for both typical and atypical features.
For each reaction center in a reaction, we initiated a breadth-first
traversal of the graph corresponding to the chemical structure of each
metabolite participating in the reaction—the substrate(s) and product(s)
are considered in turn. The traversal begins with the atoms of the reaction
center. At each depth of the traversal, we store the substructure (the reaction
center plus possibly some local surrounding structure) corresponding to
the subgraph visited. For example, for either the substrate or product of
Figure 1, atoms 2 and 5 and the bond between them are stored at depth 0. This
initial subgraph corresponds to the reaction center. Atoms 1 and 6 and
the bonds that connect these atoms to the reaction center are added at
depth 1, and all other atoms and bonds are added at depth 2. Traversal,
even if extension is possible, is halted at depth 2. In this manner, we obtained
25 collections of reaction center containing substructures for the substrates
and products of the 12 sets of reactions indicated in Table 1. A collection
of substructures, one obtained for the substrates of the 540 reactions of
Class 1, is illustrated in Figure 2. Note that, because Class 4 reactions
have two products with distinct reaction centers, there are 25 rather than
24 collections. Each of the 25 collections was inspected and a set of patterns,
encoded using SMARTS strings, was defined. These patterns, which we will
refer to as reaction center patterns, were selected with consideration of
chemistry and with the goal of finding one or two patterns for each reaction
class that together match as many of the reactants in the corresponding class
of reactions as possible. We used JOELib to perform SMARTS-based
matching (Wegner, 2004). The patterns we defined through this process
are illustrated in Table 2; the SMARTS encodings of these patterns are
provided in the Supplementary material. There are one, one, and two reaction
center containing substructures not matched by a pattern for reactions in
Classes 1, 8 and 9, respectively. These rare substructures are found in only
10 metabolites, which are indicated in the Supplementary material. In some
cases, a reaction center pattern encompasses only a reaction center. In
other cases, the pattern extends beyond the reaction center, which suggests
that the surrounding local structure is important for reactivity. Our assumption is that the reaction center patterns of Table 2, some of which include
negative application conditions, represent structural motifs that are prerequisites for participation of a metabolite, as a substrate or product, in reactions
of Classes 1–12.
2.4
Negative and positive examples for training
Negative and positive examples are required to train a binary classifier
through supervised learning. Using routines for SMARTS string matching
in JOELib (Wegner, 2004), we identified the set of substructures among
the 4139 metabolites in KEGG that contain each reaction center pattern of
Table 2. In finding matches, we used routines in JOELib, based on Hückel’s
rule, to identify aromatic atoms and bonds, and we used the UniversalIsomorphismTester routine in CDK (Steinbeck et al., 2003) to avoid multiple
matches to a single substructure because of symmetry. We then divided
each set of metabolite substructures into negative and positive examples.
3084
Atomic properties
Classifiers map inputs to outputs. Here, the outputs of interest will indicate
whether a metabolite is a substrate or product in a reaction of a given type.
The inputs that we will consider are atomic properties, namely, a set of seven
QSPR empirical or semi-empirical descriptors calculated for each of the
atoms in a reaction center. These descriptors were determined from 2D
chemical structures using CDK (Steinbeck et al., 2003) and JOELib
(Wegner, 2004). Each descriptor is briefly described below.
Graph potential (Walters and Yalkowsky, 1996). This topological
descriptor provides a measure of the distance of atoms from the graphical
center of a molecule. Atoms that are close to the center will have a higher
graph potential than atoms that are far from the center.
Electrogeometrical state index, electrotopological state index, and
conjugated electrotopological state index (Wegner and Zell, 2003). These
three indices characterize the electronic state of an atom in the context of
other atoms in a molecule.
Gasteiger–Marsili partial charges (Gasteiger and Marsili, 1980).
Gasteiger–Marsili partial charges characterize the distribution of electrons
in a molecule, which is one of the most important factors influencing its
chemical properties. These partial charges are calculated through an iterative
procedure that balances electronegativity of charge and charge transfer
through electronegativity difference. The partial charge of an atom depends
on both the type of atom and its molecular environment.
Effective polarizability (Gasteiger and Hutchings, 1984). This property
measures the effect of an electric field on the electron cloud of an atom.
Sigma electronegativity (Hutchings and Gasteiger, 1983). This
property reflects the electron-attracting ability of an atom in its molecular
environment.
We considered these descriptors for two reasons. First, they measure
topological and electronic properties, which we expect to be important
for formation and breakage of covalent bonds. Second, these descriptors
are readily calculated.
2.6
Classifiers
To predict the potential of a given metabolite to participate in an
oxidoreductase-catalyzed reaction, we trained binary classifiers for each
reaction class. The classification problem is to predict whether a metabolite
matching a given reaction center pattern (i.e. containing a substructure with a
potential reaction center plus possibly some additional surrounding local
structure) is a substrate or product of a reaction of the type corresponding to
the pattern. Classification is made based on the QSPR descriptors listed in
Section 2.5, which are calculated for each atom in the putative reaction
center. Thus, in classification, we consider from 7 to 21 features of a metabolite, depending on the number of atoms in the putative reaction center, which
varies from 1 to 3 (Table 2). To the best of our knowledge, this classification
problem is novel. Thus, to establish a proof of principle and baseline performance expectations for a classifier, we decided to use a predictor method
that is both simple and computationally efficient: a linear-kernel SVM
(Cortes and Vapnik, 1995; Vapnik, 1998). SVMs often perform well at
correctly classifying data not in training data (i.e. at generalization). Various
kernels are popular and nonlinear-kernels, such as radial basis function
kernels, arguably offer the best SVM performance, although there is little
theoretical guidance available about which kernel might be best for our
problem. We used a linear-kernel, corresponding to the simplest SVM,
because this type of kernel has no adjustable parameters, and with this
kernel, the weight vector w of the SVM (see below) provides a measure
of which features have the most influence on classification outcome.
We considered 25 training sets, one (or two in one case) for each entry
in Table 2, as discussed in Section 2.4. Each metabolite substructure i in
Prediction of oxidoreductase-catalyzed reactions
Table 2. Reaction centers (highlighted in red) and local structural motifs of substrates and products of the reactions considered in Table 1
Class of
reaction
Reaction center
patterns for reactants
Reaction center
patterns for products
Class of
reaction
Reaction center
patterns for reactants
Reaction center
patterns for products
7
1
8b
2
9b
3
10
4a
11c
5
6
12
In patterns, aromatic atoms are indicated by lower case letters, and aromatic bonds are indicated by broken lines (--–—). The symbol ‘!’ denotes a negative application condition, i.e. a
condition that prevents a match. For example, ‘!¼O’ indicates ‘not doubly bound to oxygen’.
a
Reactions in Class 4 have two products, which have distinct reaction centers.
b
Reactions in Classes 8 and 9 have two products, which have indistinguishable reaction center patterns.
c
Reactions in Class 11 have two substrates, which have indistinguishable reaction center patterns.
a training set has an ouptut yi ¼ {1,1}, depending on whether the metabolite substructure is a negative (1) or positive (1) example. Each metabolite
substructure is also associated with an input vector xi ¼ (xi1, . . . , xid), which
is based on the atomic properties discussed in Section 2.5. Unadjusted values
of inputs can be on different scales. We linearly scaled the inputs xij such that
mini xij ¼ 1 and maxi xij ¼ 1 for any given j. The length of the input vector,
d, is the product of the number of atomic properties considered (7) and the
number of atoms (from 1 to 3) in the reaction center associated with the
training set of interest. The elements of the input vector are ordered according to the corresponding reaction center pattern, in which atoms in the
reaction center are specified to have a fixed, arbitrary ordering. For symmetric reaction centers, elements of the input vector are further ordered
according to a metabolite-specific ordering of atoms in the reaction center,
which is illustrated in Figure 3.
Given inputs xi and output yi for all i in a training set, the idea of a
linear SVM is to find a hyperplane in the space of inputs x, satisfying
the equation wTx + b ¼ 0, that splits the negative and positive examples
with maximum margin. The margin of a separating hyperplane is the width
by which it can be increased before touching a data point. The parameters of
the SVM include the vectors w and b. Additional parameters include slack
variables ji and two penalty parameters C and C+. The slack variables,
which introduce a soft margin, are included to account for negative and
positive examples that are not linearly separable (Cortes and Vapnik, 1995),
and the penalty parameters are included to effectively balance unequal
numbers of negative and positive examples (Chang and Lin, 2001)
(http://www.csie.ntu.edu.tw/~cjlin/libsvm). For simplicity, we use the
following relations to specify the penalty parameters: C ¼ 1 and C+ ¼
Nn/Np, where Nn is the number of negative examples in a training set and
3085
F.Mu et al.
while the other nine sets are used for training. In a sensitivity analysis, we
varied the number of negative examples included in training sets, randomly
excluding some fraction. Each of the above procedures was repeated
10 times; the mean values of Qp and Qn were determined in each case.
3
Fig. 2. These seven distinct substructures are found around the reaction
centers (highlighted) of the 540 substrates of Class 1 reactions after a depth
1 breadth-first traversal starting from the reaction center. The number above
each substructure indicates the number of substrates containing the substructure. The first six substructures, accounting for 538 out of 540 substrates, are
matched by one of the two reaction center patterns defined in Table 2. The
first, second, fourth, fifth and sixth substructures with correspond to primary
and secondary alcohols, which are matched by the SMARTS string
[C;!h0;!$(C(O)O);!$(C ¼ O)][Oh].
Fig. 3. Atoms in a symmetric reaction center (matched by the first substrate
pattern of Class 3 reactions) are highlighted. Although the reaction center is
symmetric, Atoms 4 and 8 are distinct because of their different chemical
environments. Atoms in this reaction center, and all other symmetric reaction
centers, are ordered based on their Gasteiger–Marsili partial charges, from
most negative to most positive. Thus, Atom 8, which has partial charge 0.0123
eV, comes before Atom 4, which has partial charge, 0.0343 eV.
Np is the number of positive examples. The other parameters are found
by solving a minimization problem:
X
X
1
min wT w þ Cþ
ji þ C
ji
w‚ b‚ j 2
y ¼1
y ¼1
i
i
Minimization is subject to the following constraints:
yi ðwT xi þ bÞ 1 ji
and ji 0‚8i
SVMs found as indicated above are used as follows. Given a SVM
with parameters w and b and a metabolite substructure with inputs xi, we
calculate the quantity wTxi + b. If the sign of this quantity is positive, the
metabolite is predicted to be reactive in the type of reaction corresponding to
the SVM. Conversely, if the sign is negative, the metabolite is predicted not
to be reactive.
To measure the performance of a SVM, we count the numbers of true
positives (TP), true negatives (TN), false positives (FP) and false negatives
(FN) that it predicts for test data. We then calculate two ratios: sensitivity,
which is the fraction of positive examples that are predicted to be positives,
Qp ¼ TP/(TP + FN), and specificity, which is the fraction of negative
examples that are predicted to be negatives: Qn ¼ TN/(TN + FP).
These quantities are calculated as part of a 10-fold cross validation (CV)
procedure. In the CV procedure, the relevant training dataset is divided
randomly into 10 equal sets. Then, each of the 10 sets is used for testing
3086
RESULTS
As described in the Methods section, binary classifiers were
developed for predicting the reactivity of potential substrates and
products in the 12 classes of oxidoreductase-catalyzed reactions
defined in Table 1. A potential substrate or product is a metabolite
that contains a substructure matching one of the reaction center
patterns defined in Table 2 and is classified as reactive or not
reactive in the corresponding type of reaction based on seven
QSPR descriptors for each atom in the putative reaction center.
These descriptors provide a characterization of the electronic and
topological atomic properties of the potential reaction center. The
accuracy of predictions was assessed by 10-fold CV. Results are
shown in Table 3.
We measure the quality of a classifier by its sensitivity, Qp,
and specificity, Qn. The average sensitivity and specificity obtained
for 10 sets of tests in a 10-fold CV procedure are given for each
classifier in Table 3, which also indicates the numbers of positive
and negative examples used in each of the training sets. Sensitivity
ranges from 64 to 93% for substrates and from 71 to 98%
for products. Specificity ranges from 44 to 92% for substrates
and from 50 to 92% for products. The standard deviation (data
not shown) is always small, being less than 0.03 for all cases.
The substrate and product classifiers for a given type of reaction
vary in performance because these classifiers were derived independently using distinct training sets.
In training a classifier for a particular type of reaction, we used
metabolites in the KEGG database not known to participate in that
type of reaction as negative examples. However, our knowledge of
metabolism is incomplete, and some of these putative negative
examples might actually be positive examples. Moreover, negative
examples include only metabolites that are substrates and/or products of known enzymatic reactions, which might weaken the ability
of a classifier to generalize when applied to xenobiotic compounds
or novel metabolites. To determine how classifier sensitivity and
specificity depend on the negative examples included in a training
set, we removed a fraction of the negative examples, which were
randomly selected, and repeated the analysis of Table 3. Fractions
of different sizes were removed. Results for classifiers of potential
substrates and products in Class 1 reactions are shown in Figure 4.
These results, which indicate that sensitivity and selectivity are
robust to variations of the training data, are typical. The results
for the other group transformations are available in the Supplementary material.
To determine the properties of reaction centers that are most
relevant for predicting reactivity, we can use the weight parameters
w of a classifier to rank the relative importance of its inputs x. The
inputs weighted most heavily are most informative. In Figure 5,
we report the weights of the classifier for substrates of Class 3
reactions, in which a CH-CH group is dehydrogenated with the
introduction of a double bond. As can be seen, the Gasteiger–
Marsili partial charges of the two carbon atoms in the CH-CH
reaction center are responsible for the largest positive contributions
to the classifier output. This result suggests that reactivity requires
Prediction of oxidoreductase-catalyzed reactions
Table 3. Performance of classifiers for potential substrates and products in 12 types of oxidoreductase-catalyzed reactions
Class of reaction
Predictions based on reaction center patterns of substrates
Training set
CV performance
Qn
Positive
Negative
Qp
Predictions based on reaction center patterns of products
Training set
CV performance
Positive
Negative
Qp
Qn
1
2
3
4
538
182
217
123
4140
209
17 143
2618
0.87
0.82
0.72
0.85
0.77
0.75
0.69
0.80
5
6
7
8
9
10
11
12
39
263
65
77
57
23
60
17
2178
21 766
10 504
10 479
11 070
2858
34
6865
0.93
0.78
0.85
0.86
0.87
0.64
0.78
0.66
0.78
0.44
0.75
0.85
0.80
0.92
0.53
0.84
539
182
217
123
50
39
265
65
149
111
27
30
17
Fig. 4. Classifier performance remains steady as negative examples are removed from training data. The horizontal axis indicates the fraction of negative examples removed (randomly). The classifiers considered here predict
the reactivity of potential substrates (left panel) and products (right panel) in
Class 1 reactions. Mean values and standard deviations are indicated for Qp
(solid line) and Qn (broken line), which are calculated from 10 tests in a 10fold CV procedure.
the deficiency of electrons in the reaction center, which is consistent
with chemical intuition. However, in general, there is no easily
discernible chemical interpretation of the weight distribution of a
classifier. Weight distributions for all classifiers are provided in the
Supplementary material.
4
DISCUSSION
The genome sequence of an organism can be used to identify genes
coding for enzymes and to infer the metabolic network of the
organism (Francke et al., 2005), about which we may know little.
However, only the reactions that have been characterized experimentally in some system can be inferred, and we have probably
only scratched the surface. It is estimated that <1% of microorganisms can be cultured for laboratory study (Schloss and
Handelsman, 2005). Technological advances, particularly in mass
spectrometry, have enabled the detection of many signals for
metabolites in cell extracts that have never been documented or
4498
2256
10 412
4203
2490
4328
13 821
6061
6119
16 248
73
10
176
0.89
0.80
0.76
0.90
0.90
0.95
0.84
0.93
0.91
0.85
0.72
0.98
0.71
0.83
0.74
0.66
0.89
0.77
0.83
0.75
0.92
0.86
0.85
0.84
0.50
0.80
Fig. 5. Weight parameter distribution for the linear-kernel SVM that classifies substrates of Class 3 reactions. The reaction center associated with this
reaction class (CH-CH) is comprised of two carbon atoms. Thus, there are 14
inputs, x1, . . . , x14, the indices of which are indicated on the horizontal axis,
and the same number of weights, the values of which are indicated on the
vertical axis. Odd indices of inputs correspond to the first atom in the reaction
center, whereas even indices correspond to the second atom. Recall that
ordering of atoms in a reaction center is determined as described in Section
2.6. The first inputs considered (at left) are graph potentials. The other inputs
are considered in the same order as they are listed in Section 2.5. Thus, the 9th
and 10th inputs correspond to Gasteiger–Marsili partial charges of the first
and second atoms, respectively. These inputs are associated with the two
largest positive-valued weights (i.e. the tallest bars in the plot). As a result, a
metabolite with larger, more positive partial charges will be more likely to be
classified as a true substrate than one with smaller, more negative charges.
characterized before (Wagner et al., 2003; Breitling et al., in press).
We are now faced with the challenge of characterizing these
metabolites and discovering the metabolic networks in which
they are produced and consumed, as well as their regulatory
roles, which may be significant (Wall et al., 2004).
Here, we have developed binary classifiers, linear-kernel SVMs,
that predict whether a potential substrate or product in one of
12 types of oxidoreductase-catalyzed reactions is a true reactant
or not. The classifiers require structural information but no genomic
sequence information. A 10-fold cross validation procedure was
used to assess the performance of each classifier. Typically, but
3087
F.Mu et al.
not uniformly, we find both high sensitivity and specificity
(Table 3), which suggests that the classifier inputs reflect the ability
of a metabolite to participate in an oxidoreductase-catalyzed reaction in many cases. These inputs are descriptors of the topological
and electronic properties of atoms in the potential reaction center.
The overall results establish the feasibility of our approach and
baseline performance expectations, and they should motivate the
development of more sophisticated classifiers. For example, one
might consider SVMs based on nonlinear kernels. A nonlinear
kernel, such as a radial basis function or polynomial kernel, should,
with careful selection of kernel parameters, perform better.
In cases where sensitivity or specificity is less than satisfactory,
the particular descriptors we considered may simply be uninformative, in which case additional descriptors and/or more relevant or
predictive descriptors might yield better results. Alternatively, it
may be that reactivity is determined not by metabolite structure
alone but rather by enzyme properties or the conformational
changes induced by enzyme-substrate binding. The influence of
enzyme-dependent properties on reactivity can be difficult to
address (Colombo and Carrea, 2002; Henkelman et al., 2005),
but more predictive descriptors of atomic and molecular properties
can be obtained through straightforward application of quantum
chemistry methods (Karelson et al., 1996), which require a 3D
chemical structure. The drawback of these methods is that they
are more computationally expensive and technically challenging
than the ones used here. However, again, our overall results should
provide motivation for further study, including the consideration of
more expensive calculations of metabolite features.
An advantage of the prediction system presented here over earlier
rule-based approaches is the quantitative and automatic assessment
of the reactivity of a given metabolite beyond a checking of the
structural requirements of a transformation rule. Thus, experimental
follow up can be focused faster on fewer metabolites, those with
atomic properties characteristic of known reactants. The development of the methods presented here, which have been applied to a
single EC class of enzyme-catalyzed reactions but can be extended
systematically to the other five major EC reaction classes, represent
a step toward the development of computational screening tools that
can aid in elucidating the functional roles of newly discovered
small-molecule metabolites.
ACKNOWLEDGEMENTS
The authors thank Ilya Nemenman and Robert F. Williams for helpful discussions. This work was supported by the Department of
Energy, in part through the Genomics:GTL Program.
Conflict of Interest: none declared.
REFERENCES
Anari,M.R. and Baillie,T.A. (2005) Bridging chemoinformatic metabolite prediction
and tandem mass spectrometry. Drug Discov. Today, 10, 711–717.
Breitling,R. et al. Precision mapping of the metabolome. Trends Biotechnol.,
doi:10.1016/j.tibtech.2006.10.006.
Chang,C.-C. and Lin,C.-J. (2001) LIBSVM : a library for support vector machines.
Colombo,G. and Carrea,G. (2002) Modeling enzyme reactivity in organic solvents and
water through computer simulations. J. Biotechnol., 96, 23–33.
Cortes,C. and Vapnik,V. (1995) Support-vector networks. Mach. Learn., 20, 273–297.
Darvas,F. (1998) Predicting metabolic pathways by logic programming. J. Mol.
Graphics, 6, 80–86.
3088
Dunn,W.B. et al. (2005) Measuring the metabolome: current analytical technologies.
Analyst, 130, 606–625.
Fernie,A.R. et al. (2004) Metabolite profiling: from diagnostics to systems biology.
Nat. Rev. Mol. Cell. Biol., 5, 763–769.
Fiehn,O. (2002) Metabolomics: the link between genotypes and phenotypes. Plant
Mol. Biol., 48, 155–171.
Francke,C. et al. (2005) Reconstructing the metabolic network of a bacterium from its
genome. Trends Microbiol., 13, 550–558.
Gasteiger,J. and Hutchings,M.G. (1984) Quantitative models of gas-phase protontransfer reactions involving alcohols, ethers, and their thio analoguess. Correlation
analyses based on residual electronegativity and effective polarizability.
J. Am. Chem. Soc., 106, 6489–6495.
Gasteiger,J. and Marsili,M. (1980) Iterative partial equalization of orbital
electronegativity—a rapid access to atomic charges. Tetrahedron, 36, 3219–3228.
Goodacre,R. et al. (2004) Metabolomics by numbers: acquiring and understanding
global metabolite data. Trends Biotechnol., 22, 245–252.
Goto,S. et al. (2002) LIGAND: database of chemical compounds and reactions in
biological pathways. Nucleic Acids Res., 30, 402–404.
Greene,N. et al. (1999) Knowledge-based expert systems for toxicity and metabolism
prediction: DEREK, StAR and METEOR. SAR QSAR Environ. Res., 10, 299–313.
Hatzimanikatis,V. et al. (2005) Exploring the diversity of complex metabolic networks.
Bioinformatics, 21, 1603–1609.
Henkelman,G. et al. (2005) Conformational dependence of a protein kinase phosphate
transfer reaction. Proc. Natl Acad. Sci. USA, 102, 15347–15351.
Hou,B.K. et al. (2003) Microbial pathway prediction: a functional group approach.
J. Chem. Inf. Comput. Sci., 43, 1051–1057.
Hutchings,M.G. and Gasteiger,J. (1983) Residual electronegativity—an empirical
quantification of polar influences and its application to the proton affinity of
amines. Tetrahedron Lett., 24, 2541–2544.
James,C.A. et al. (2005) Daylight Theory Manual, version 4.9, Ch. 4. Daylight
Chemical Information Systems, Inc., Aliso Viejo, CA.
Jaworska,J. et al. (2002) Probabilistic assessment of biodegradability based on
meta-bolic pathways: CATBOL System. SAR QSAR Environ. Res., 13, 307–323.
Karelson,M. et al. (1996) Quantum-chemical descriptors in QSAR/QSPR studies.
Chem. Rev., 96, 1027–1043.
Kell,D.B. (2004) Metabolomics and systems biology: making sense of the soup.
Curr. Opin. Microbiol., 7, 296–307.
Klopman,G. et al. (1994) META. 1. A program for the evaluation of metabolic transformation of chemicals. J. Chem. Inf. Comput. Sci., 34, 1320–1325.
Langowski,J. and Long,A. (2002) Computer systems for the prediction of xenobiotic
metabolism. Adv. Drug Deliv. Rev. 54, 407–415.
Mayeno,A.N. et al. (2005) Biochemical reaction network modeling: predicting metabolism of organic chemical mixtures. Environ. Sci. Technol., 39, 5363–5371.
McShan,D.C. et al. (2004) Symbolic inference of xenobiotic metabolism. Pac. Symp.
Biocomput., 9, 545–546.
Payne,M.P. (2004) Computer-based methods for the prediction of chemical metabolism
and biotransformation within biological organisms. In Cronin,M.T.D. and
Livingstone,D.J. (eds), Predicting Chemical Toxicity and Fate. CRC Press,
Boca Raton, pp. 205–227.
Schloss,P.D. and Handelsman,J. (2005) Metagenomics for studying unculturable
microorganisms: cutting the Gordian knot. Genome Biol., 6, 229.
Silverman,R.B. (2000) The Organic Chemistry of Enzyme-Catalyzed Reactions.
Academic Press, San Diego, CA.
Steinbeck,C. et al. (2003) The Chemistry Development Kit (CDK): an open-source Java
library for chemo- and bioinformatics. J. Chem. Inf. Comput. Sci., 43, 493–500.
Talafous,J. et al. (1994) META. 2. A dictionary model of mammalian xenobiotic
metabolism. J. Chem. Inf. Comput. Sci., 34, 1326–1333.
Vapnik,V.N. (1998) Statistical Learning Theory. Wiley, NY.
Wagner,C. et al. (2003) Construction and application of a mass spectral and retention
time index database generated from plant GC/EI-TOF-MS metabolite profiles.
Phytochemistry, 62, 887–900.
Wall, M.E. et al. (2004) Design of gene circuits: lessons from bacteria. Nat. Rev.
Genet., 5, 34–42.
Walters,W.P. and Yalkowsky,S.H. (1996) ESCHER—a computer program for the
determination of external rotational symmetry numbers from molecular topology.
J. Chem. Inf. Comput. Sci., 36, 1015–1017.
Wegner,J.K. (2004) JOELib Tutorial: A Java based cheminformatics/computational
chemistry package.
Wegner,J.K. and Zell,A. (2003) Prediction of aqueous solubility and partition
coefficient optimized by a genetic algorithm based descriptor selection method.
J. Chem. Inf. Comput. Sci, 43, 1077–1084.