Automatic assignment of reaction operators to

BIOINFORMATICS
ORIGINAL PAPER
Vol. 25 no. 23 2009, pages 3135–3142
doi:10.1093/bioinformatics/btp549
Systems biology
Automatic assignment of reaction operators to
enzymatic reactions
Markus Leber1 , Volker Egelhofer1,2 , Ida Schomburg1,2 and Dietmar Schomburg1,2,∗
1 Institute
for Biochemistry, University of Cologne, Zülpicher Straße 47, 50674 and 2 Department for Bioinformatics
and Biochemistry, Technische Universität Braunschweig, Langer Kamp 19B, 38106 Braunschweig, Germany
Received on December 19, 2009; revised on July 21, 2009; accepted on September 14, 2009
Advance Access publication September 25, 2009
Associate Editor: John Quackenbush
ABSTRACT
Background: Enzymes are classified in a numerical classification
scheme introduced by the Nomenclature Committee of the
IUBMB based on the overall reaction chemistry. Due to the
manifold of enzymatic reactions the system has become highly
complex. Assignment of enzymes to the enzyme classes requires
a detailed knowledge of the system and manual analysis. Frequently
rearrangements and deletions of enzymes and sub-subclasses are
necessary.
Results: We use the Dugundji–Ugi model for coding of biochemical
reactions which is based on electron shift patterns occurring during
reactions. Changes of the bonds or of non-bonded valence electrons
are expressed by reaction matrices. Our program calculates reaction
matrices automatically on the sole basis of substrate and product
chemical structures based on a new strategy for maximal common
substructure determination, which allows an accurate atom mapping
of the substrate and product atoms. The system has been tested for
a large set of enzymatic reactions including all sub-subclasses of
the EC classification system. Altogether 147 different representative
reaction operators were found in the classified enzymes, 121 of
which are unique with respect to an EC sub-subclass. The other
26 comprise groups of enzymes with very similar reactions, being
identical with respect to the bonds formed and broken.
Conclusion: The analysis and comparison of enzymatic reactions
according to their electron shift patterns is defining enzyme groups
characterised by unique reaction cores. Our results demonstrate
the applicability of the Dugundji–Ugi model as a reasonable preclassification system allowing an objective and rational view on
biochemical reactions.
Availability: The program to generate reaction matrix descriptors is
available upon request.
Contact: [email protected]
1
BACKGROUND
Classification of enzymes is a complex procedure performed by
the joined nomenclature committee of the International Union of
Pure and Applied Chemistry (IUPAC) and International Union of
Biochemistry and Molecular Biology (IUBMB) (Tipton and Boyce,
2000). In general nomenclature and classification of enzymes are
based on the overall reaction chemistry of biochemical processes.
∗ To
whom correspondence should be addressed.
In 1961 the first enzyme commission devised a system of code
numbers to describe the chemical features of enzymatic reactions.
These Enzyme Commission numbers (EC numbers) start with the
prefix EC followed by four numbers, separated by points. The first
number indicates which of the six main classes the enzyme belongs
to. The main classes are divided into subclasses, which are again
subdivided to sub-subclasses depending on the reaction type and
which bonds are formed or cleaved during the reaction. For instance,
the second number represents the subclass and the third number the
sub-subclass of the respective enzyme. The fourth and last number
indicates a serial number of the enzyme within the sub-subclass.
EC numbers therefore indicate which reaction is catalysed by
the enzymes. Different proteins can receive the same EC number if
they catalyse identical chemical reactions. Over the years it turned
out that the chemical properties of enzymatic reactions allow a
reasonable classification of enzymes. With the growing number
of known enzymes the correct assignment has become a difficult
task for researchers. In addition enzymes sometimes are grouped
together for historical reasons or based on physiological aspects
(Webb, 1990).
This was the motivation for us to test a system for coding of
enzymatic reactions that can be automated and can be used as
a pre-classification scheme. We found the Dugundji–Ugi model
(Dugundji and Ugi, 1973) to be an efficient method to characterise
biochemical reactions automatically. Similar to the classification
system introduced by the first Enzyme Commission, the Dugundji–
Ugi model considers the chemical properties of the overall reaction.
Reactions are described by mathematical operators, which are
named transition or reaction matrices and represent electron
shift patterns occurring during reactions. In this model substrate
molecules and product molecules are depicted by binding electron
matrices (BE matrices), in which bonds are marked by values.
Reactions can be described by this fundamental equation:
B+R = E,
whereas matrix B is the BE-matrix of the substrate molecules and
E is the BE-matrix of the product molecules. As can be seen by
the Equation (1) the substrate matrix (B) can be transformed to the
product matrix (E) by the reaction matrix (R). Values in the reaction
matrix correspond to changes of the bonds or of non-bonded valence
electrons. A positive value in the reaction matrix indicates bond
formation, whereas negative values indicate the cleavage of bonds.
© The Author 2009. Published by Oxford University Press. All rights reserved. For Permissions, please email: [email protected]
[20:14 3/11/2009 Bioinformatics-btp549.tex]
(1)
3135
Page: 3135
3135–3142
M.Leber et al.
Fig. 1. As demonstrated by the reaction of the cyanide hydratase, reaction
matrix R converts substrate matrix B to product matrix E. The matrices of
B and E are reduced to the atoms involved into the reaction.
Values on the main diagonal correspond to changes of the number
of free valence electrons of the respective atom. The matrices B, E
and R are symmetric and obey algebraic rules. As an advantage the
Dugundji–Ugi model allows a detection of the atoms involved into
the reaction by simple arithmetic operations. Thus the Dugundji–Ugi
model allows a very general and objective view on reactions.
In the present work we describe the calculation, comparison and
grouping of reaction matrices for biochemical reactions. An example
for reaction matrix calculation is given by Figure 1.
Although chemical reactions were classified by the Dugundji–
Ugi model in the past (Brandt et al., 1981), the model has not
been used excessively within the last years. The method requires
the assignment of the equivalent atoms between substrates and
products. For instance a reaction matrix can be easily calculated
by the following Equation (2):
R = E −B.
(2)
But within the matrices of B,E and R, each atom is represented by a
certain row and column number. Thus, if the correspondence of the
substrate and product atoms is unknown it is not possible to arrange
the atoms within the matrices. For this reason reaction matrices
have been generated manually in the past. In order to avoid this
time consuming procedure, we developed a method that allows the
automatic calculation of the reaction matrices based on the chemical
structure of the substrate and product molecules. The atom-mapping
problem is solved by an algorithm for maximal common subgraph
determination. Since the maximal common subgraph problem is
NP-complete it was necessary to develop an algorithm, which is
fast, flexible and allows to reproduce complicated changes of the
structure as well.
2
RELATED WORK
The Dugundji–Ugi model is applicable to any kind of reaction
and was subject of intensive studies in the past. Techniques of
reaction generation, reaction classification, reaction documentation
and elucidation of reaction mechanisms have been developed
(Ugi et al., 1994). An attempt to classify reactions on the basis of
electron shift patterns was proposed by Brandt et al. (Brandt and
von Scholley, 1983; Brandt et al., 1981). In these studies a set of
canonisation rules are defined to generate a unique reaction matrix.
In other studies the Dugundji–Ugi model was used as reaction
generator. The programs IGOR (Bauer et al., 1989) and RAIN
(Fontain and Reitsam, 1991) are two examples. Both programs
just work in combination with transition tables and sets of formal
constraints in order to avoid combinatorial explosions (Ugi et al.,
1994). In the most recent application the Dugundji–Ugi model
was used by Hatzimanikatis et al. (2005) for de novo synthesis
of metabolic pathways. Almost 250 reaction matrices are generated
manually and utilised by a pathway generation algorithm. This study
demonstrates the efficiency of the Dugundji–Ugi system to model
each type of biochemical reactions. In comparison to these previous
studies our program is the first approach, which is able to calculate
R matrices automatically.
Due to the complexity of the atom-mapping problem just
few approaches have been published to handle the assignment
problem. At first databases have been constructed manually, like
the BioPath database, which contains biochemical pathways and
reaction information (Reitz et al., 2004). The Kyoto Encyclopedia
of Genes and Genomes (KEGG) developed a tool for computational
assignment of EC numbers published by Kotera et al. (2004). In
this approach each reaction formula has to be decomposed by the
user into sets of corresponding substrate and product molecules,
which are called reactant pairs. In the second step every reactant
pair is analysed by the structure comparison method SIMCOMP
developed by Hattori et al. (2003). As reported by KEGG (Oh et al.,
2007) many alignments generated by SIMPCOMP were corrected
manually. Another approach proposed by Körner and Apostolakis
(2008) and Apostolakis et al. (2008) considers reaction energetics
to predict reaction sites. However, none of these methods rely on
full automatic reaction matrix calculation to characterise enzymatic
reactions by electron shift patterns. Based on the Dugundji–Ugi
model and reaction matrix canonisation our system leads to a
new grouping of enzymes, which are only represented by their
reaction core.
3
METHODS
The full algorithm requires the following steps:
(1) Calculation of maximal common substructure between substrates and
products;
(2) Atom–atom mapping;
(3) Calculation of BE-matrices of substrates and products;
(4) Calculation of reaction matrices by subtraction of BE-matrices.
The comparison of reaction matrices requires one additional step, the
conversion of the reaction matrices to a canonical form.
3.1
MCS algorithm
In order to generate an atom–atom mapping and to calculate reaction matrices
of biochemical reactions, we developed a new approach for Maximal
Common Subgraph (MCS) determination.
At first the molecules are compared pairwise by the MCS algorithm. As
input we used molfiles in the MDL format containing information of the
atoms and the bonds of a molecule.
MCS determination is the method of choice for an efficient similarity
analysis of graphs. This method is applicable to the comparison of molecules,
3136
[20:14 3/11/2009 Bioinformatics-btp549.tex]
Page: 3136
3135–3142
Automatic assignment of reaction operators to enzymatic reactions
Fig. 2. Example reaction, where the variant of the Bron-Kerbosch algorithm
fails due to absence of corresponding edges in the compatibility graph.
because molecules can be considered as graphs, whereas the atoms can
represent the nodes and the bonds the edges of a graph. The problem of
finding all maximal common substructures of two molecules is identical to
the search for all maximal common subgraphs in two graphs. This method
also can be used as an atom mapping approach for reactions. According to the
underlying principle an atom can be mapped, if it is in the same substructure
before and after the reaction.
Since the MCS-problem is NP-complete several algorithms have been
developed to solve the problem in acceptable time for a sufficient amount
of nodes (Hattori et al., 2003, Marialke et al., 2007, Raymond and Willett,
2002). Especially if the method is used to consider reactions there is a conflict
between improvement of the runtime and flexibility of the algorithm. On
the one hand a MCS algorithm has to be flexible, because it must be able
to reproduce complex changes of molecule structures. Therefore it is not
possible to integrate restrictions like double bonds, atomic environments or
three-dimensional information of the structure. These characteristics may
change during a reaction. On the other hand the MCS algorithm must be fast
enough to compare large metabolites as well. Several MCS algorithms with
various properties have been developed during time. It turned out that the best
solution for the present project is a combination of two MCS algorithms in
order to combine the favourable characteristics of these different approaches.
The first algorithm is a variant of the Bron–Kerbosch (Bron and Kerbosch,
1973) algorithm, which solves the MCS search by transforming the MCS
problem into a clique problem. In general this procedure is based on the
assumption that certain vertices and edges of a graph G1 are compatible
to certain vertices and edges of a second graph G2. All of these possible
compatibilities are stored in a new graph, called compatibility graph.
Corresponding features of G1 and G2 are given by cliques within the
compatibility graph. A clique is a complete subgraph. This means within
the clique each node is connected with all the other nodes by an edge. A
maximal clique is maximal if the clique cannot be extended by a further
graph node. As proved by Levi (1972) a maximal common induced subgraph
of the graphs of G1 and G2 corresponds to a maximal complete subgraph
in the compatibility graph. Accordingly the search for all maximal cliques
delivers all maximal common induced subgraphs of G1 and G2.
Cliques can be detected by backtracking methods as branch and bound
algorithms with exponential runtimes. The mentioned widely used Bron–
Kerbosch algorithm is an example. We use a modified version of the Bron–
Kerbosch algorithm proposed by Koch (2001), which considers connected
maximal common subgraphs and not maximal common induced subgraphs.
This variant solves a new problem by searching only for those cliques which
represent connected subgraphs. The result is a drastic reduction of the runtime. We modified this algorithm and use it for the comparison of molecules.
On the basis of a vertex compatibility graph our approach searches for all
connected maximal common substructures of two molecules. However we
noticed this algorithm has problems to reconstruct complicate conversions of
the structure. For instance Figure 2 shows the reaction of the Catechol:oxygen
1,2-oxidoreductase (decyclising), in which an aromatic ring is cleaved.
Fig. 3. Reaction of the pyruvate decarboxylase (A). Comparison of pyruvate
to acetaldehyde (B) is leading to 4 MCS. But only MCS number 3
corresponds to the correct position.
In this case catechol must be compared to muconate. But the mapping
leads to an incomplete mapping of six backbone atoms, whereas all eight
backbone atoms of catechol are transformed to muconate. The reason is the
absence of corresponding edges in the vertex compatibility graph.
Due to this drawback, we combined the Bron–Kerbosch variant with
a flexible approach proposed by McGregor et al. (1982). This algorithm
has been developed for analysing chemical reactions and enumerating
bond changes. The McGregor algorithm uses a backtracking search as its
basic component. Thus we integrated techniques to reduce the number of
possibilities to be considered.
In our approach, the McGregor algorithm is added to the Bron–Kerbosch
variant developed by Koch. First the molecules are compared by the fast
Koch algorithm. In case that unmapped atoms remain in both molecules,
the McGregor algorithm is used to extend the common substructures. To
assure that the maximal common substructures are still connected, the
core structures given by the Bron–Kerbosch variant are only extended by
neighbouring bonds. The extension is performed within a backtracking
procedure, which systematically tests all possibilities to extent the core
structures as long as it is not longer possible.
The combination of these two algorithms also allowed an improvement
of the run-time. In contrast to small molecules, where a higher percentage of
atoms are involved in the reaction, substantial parts of large molecules remain
unchanged. Therefore if large molecules are compared, atoms together with
their environment are considered to reduce the size of the vertex compatibility
graph.
Altogether the result is a fast and flexible MCS approach, which is able
to reproduce complex changes of the structure and allows the comparison of
large metabolites as well.
3.2 Atom–atom mapping
The MCS algorithm is used as a core component to generate an atom mapping
of all substrate atoms to all product atoms. For this purpose the algorithm
compares each substrate molecule to each product molecule.
An example is given by the reaction of the pyruvate decarboxylase
(Fig. 3), where pyruvate is compared to acetaldehyde and carbon dioxide.
The comparison of pyruvate to acetaldehyde is leading to four maximal
common substructures. But only one of them is a valid match in the context of
the reaction. The other three maximal common substructures contain atoms,
which belong to the cleaved carbon dioxide. Thus, to generate a correct atom
mapping it is necessary to select those combinations of maximal common
substructures, which do not contain overlaps and where all substrate atoms
can be mapped on all product atoms. For this purpose the maximal common
3137
[20:14 3/11/2009 Bioinformatics-btp549.tex]
Page: 3137
3135–3142
M.Leber et al.
Fig. 4. Molecule comparison and ranking system. At first each substrate molecule is compared to each product molecule. In the second step the MCS are
ordered by size. Third within an iterative procedure overlapping regions of smaller MCS are rejected.
substructures of the different sets are joined in all possible combinations
and afterwards those combinations are selected, which exhibit a complete
mapping of the atoms.
Figure 4 demonstrates a more complex example, where two substrate
molecules react to two product molecules. In this case the maximal common
substructures have overlaps within all combinations. This problem has been
solved by a ranking system, in which the maximal common substructures
of a combination are ordered according to their size. Large maximal
common substructures have a higher rank than smaller maximal common
substructures. Within an alternating procedure the respective largest maximal
common substructure is added to the solution set, whereas overlapping atoms
of smaller maximal common substructures are deleted. This means the size
of the remaining maximal common substructures may decrease along this
process. Therefore it is important to rearrange the order of the remaining
maximal common substructures after each step.
Our method leads to an assignment of the structures comparable to a
manual arrangement. In some cases it is possible to generate more than one
complete mapping for a reaction. The reason can be symmetries of partial
structures or whole molecules. In other cases substrate atoms of the same
type can match to several targets of the product side. At first our underlying
algorithm calculates all valid possibilities to map the atoms. Redundant atom
mappings can be easily removed during reaction matrix calculation, because
equivalent atom mappings lead to identical reaction matrices. However, in
some cases, the method to combine the maximal common substructures may
result in a large number of combination sets. Thus, in order to avoid run
time problems it was necessary to integrate heuristics to reduce the number
of combination sets to be considered.
Especially if cofactors are involved the number of combination sets
increases rapidly. On the other hand the chemical states of the cofactors
are known and it is not essential to compare their structures with the other
reactants. Therefore a mechanism is integrated, which maps the atoms of the
cofactors separately. Afterwards the mappings of the cofactors are added to
the mappings of the effective reactants.
Comparisons of very small molecules with large metabolites are
problematic, too. They result in a high number of maximal common
substructures and in consequence a rapid increase of combination sets.
Furthermore these comparisons deliver targets, which are not significant in
most cases. Due to this reason small molecules like water, ammonia, methane
and further molecules of this size are not considered in the first comparison
step. The small molecules are mapped on remaining unmapped structures
after the other molecules are compared and combined to preliminary solution
sets. The mapping of the small molecules is achieved by a repetitive variant
of the Bron–Kerbosch algorithm.
Although the ranking system in combination with the heuristics appears
complicated on the first view, the outcome is a fast method, which allows an
accurate assignment of the atoms.
3.3
Hydrogen atom mapping
Depending on element type, charge, number of valence electrons and radical
label hydrogen atoms are added to the backbone atoms. The number of
hydrogen atoms directly bonded to each backbone in the substrate atom
is compared to the product hydrogen atom numbers. This procedure reveals
whether a hydrogen atom is removed or added to the corresponding backbone
atom. Removed hydrogen atoms are mapped to added hydrogen atoms on
the product side. If the number of hydrogen atoms on the substrate side does
not match the number of added hydrogen atoms on the product side, the
supernumerous hydrogen atoms are considered as protons.
3.4
Stereoisomerism
The reactions of the isomerases are summarised in the fifth main class of the
EC classification system. A fraction of enzymes within this class catalyses
stereoisomeric changes of single atoms without a net cleavage or forming of
bonds. Hence the characterisation of these reactions required an extension
of the method.
In our system the absolute configuration of an atom is indicated by values
on the main diagonal. An entry of -10 represents an S-configuration and
an entry of 10 an R-configuration. Hence an entry of 20 in the reaction
matrix means that a stereoisomeric centre is changed from an S- to an
R-configuration.
3.5
Reaction matrix calculation and canonisation
The comparison or identification of identical reaction matrices requires in
worst-case n! row/column permutations, where n is the dimension of the
matrix. Therefore canonisation rules have been defined to find a unique
reaction matrix. These rules reflect the observation from chemical reactions
that electrons are shifted along a single path or a cycle. If the succession
of atoms within this path or cycle is transferred to a reaction matrix, a
unique canonical reaction matrix is obtained. The characteristic feature of
canonical matrices is a continuous succession of entries along the first side
diagonal, where the entries alternate in signs. The determination of canonical
3138
[20:14 3/11/2009 Bioinformatics-btp549.tex]
Page: 3138
3135–3142
Automatic assignment of reaction operators to enzymatic reactions
Fig. 5. Canonical matrix transformation can be achieved by a graph based procedure, which enumerates all continuous electron shift patterns of maximal length.
reaction matrices can be a time-consuming procedure if all n! permutations
are generated.
Therefore methods have been developed to calculate canonical reaction
matrices within acceptable computing time. But it turned out that these
methods rely on heuristic assumptions, which cannot be adapted to
biochemical reactions for several reasons. At first the valence electrons are
not always shifted along a single path or a cycle. For instance often protons
are generated in biochemical reactions. Protons can lead to interruptions
of an electron shift pattern, because the corresponding hydrogen atom is
involved in a cleavage, but not in a formation of a bond. Metal ions or atoms
with changes of the stereochemistry are other examples, where the electron
shift pattern is interrupted. In these cases a graph based representation of the
electron shift pattern leads to a fragmentation of the corresponding graph
(disconnected graph). Another problem is the occurrence of branches within
a graph, which results in several paths of different length.
Faced with these problems we implemented a graph-based method for
canonisation. This means the reaction matrix is converted to a graph, where
entries of the reaction matrix represent edges and the atoms form nodes.
A longest path of this graph represents a continuous electron shift pattern
of maximal length. In order to find these maximal electron shift patterns
we implemented a backtracking algorithm, which searches for the largest
sub-graphs that are Hamiltonian paths, e.g. each vertex is visited once.
Figure 5 shows one reaction matrix of the 2-hydroxyglutarate
dehydrogenase calculated by the program. Matrix conversion to a graph
reveals the cyclic electron shift pattern of the reaction. Every node in the
graph can serve as a starting point for a path and the pathways can be
extended in two directions. Hence 16 paths can be generated. To reduce
the number of solutions the longest paths are investigated with respect to
canonisation rules, in particular solutions which exhibit an alternation of the
sign and start with a negatively labelled edge are selected.
In practice this method still leads to a number of valid solutions. Therefore
we integrated the atomic weight as a further factor. The succession of atoms
of a longest path is examined for the atomic weight and solutions are selected
which exhibit a longest sequence of heaviest atoms in a front position. In
many cases this method leads to a drastic reduction of the solution set.
The whole procedure becomes much more complex if the underlying
graph is disconnected and consists of various fragments. In this case the
longest paths within the fragments are determined separately, before they are
concatenated to the whole electron shift pattern of the reaction. The maximal
electron shift patterns of the different fragments are concatenated according
to their size. Larger electron shift patterns have a higher rank than smaller
patterns. If electron shift patterns have the same size several permutations
are evaluated.
3.6
Comparison of canonisised reaction matrices:
The canonisation procedure allows the construction of unique reaction
matrices. The canonical reaction matrix is converted to a linear notation
of two R-strings. The first string is formed by reaction matrix entries,
which are concatenated according to a predefined system defined by Brandt
et al. (Brandt and von Scholley, 1983; Brandt et al., 1981). The second
string consists of element symbols of the atoms, which correspond to the
reaction matrix entry string. Both strings, called R-strings, allow an efficient
comparison and identification of identical reaction matrices.
4
RESULTS
The reaction matrix calculator was tested for a set of 3600 enzymatic
reactions with representatives of all sub-subclasses. Reactions which
are described as free text in the EC-classification scheme and
reactions with generic molecules are not included. The set is derived
from the BRENDA database (Schomburg et al., 2004). The reaction
matrix dimensions vary between 1 and 90, beginning with the
reaction matrices of racemases with a dimension of 1 to the reaction
of the nitrogenase (with 48 substrate molecules). The most frequent
3139
[20:14 3/11/2009 Bioinformatics-btp549.tex]
Page: 3139
3135–3142
M.Leber et al.
Fig. 6. Sub-subclass consistence.
matrix dimension is 4, which is typical for many reactions of
the transferases, hydrolases and lyases. Reaction matrices of the
oxidoreductases exhibit the highest variance in dimension among
the main classes. Reactions of this main class often include cofactors
in different numbers. Cofactors have a substantial influence on the
dimension of the reaction matrix. For example the conversion of
NAD+ to NADH increases the reaction matrix dimension by ∼six
rows and columns, whereas cytochrome leads to an increase of one
row and column of the reaction matrix.
A comparison of the reaction matrix grouping with the EC-class
system is instructive: The reaction matrices within most of the subsubclasses of the EC system are homogenous. Figure 6 shows the
consistence of reaction matrices within the sub-subclasses. This
result is not surprising, because sub-subclasses contain reactions
that are basically identical with the exception of different substrate
specificity. Nevertheless 18% of the sub-subclasses have a reaction
matrix identity of <50%.
Differences of the electron shift pattern within a sub-subclass
can have different reasons. For instance an enzymatic reaction can
be associated with parallel reactions caused by metabolites, which
merely serve as electron acceptors or donors for the real reaction.
Such reactions can be found in sub-subclasses of type –.–.99, that
form a high portion of the sub-subclasses with a reaction matrix
identity below 50%. This is not surprising, because in these subsubclasses heterogeneous reactions are collected, which do not
match with the characteristics of any other sub-subclass of the
respective subclass. An example is sub-subclass 1.10.99 which
contains reactions acting on diphenols or related substances as
donors. The respective acceptor molecules have different properties,
leading to diverse electron shift patterns and a low reaction matrix
identity of only 33.3 %.
Nevertheless altogether an average value of 79% reaction matrix
consistence within a subclass has been determined. Therefore we
considered the reaction matrix with the highest occurrence as
representative for the respective sub-subclass. Reaction matrices,
which differ from their respective sub-subclass, were investigated
for correctness. The representative reaction matrices of the
different sub-subclasses were compared and grouped as shown in
Table 1.
Two hundred and twenty-eight sub-subclasses were examined in
total. One hundred and twenty-one sub-subclasses have a unique
reaction matrix, which is not found in any other sub-subclass
whereas the remaining 107 sub-subclasses do not have unique
reaction matrices. Comparison and grouping of these reaction
matrices result in 26 clusters. The largest groups are characterised
by a basic reaction core, where two bonds are broken on the
substrate side and two bonds are formed on the product side. This
mechanism is widely-found in biochemical reactions and is leading
to heterogeneous groups, which are formed by sub-subclasses of
different main classes. For example group number 1 consists of
transferases and hydrolases. Group number 3 exhibits the highest
EC-class diversity. In all reactions of this group a P–O bond and
O–H bond is broken and a P–O bond and O–H bond is established.
This reaction is catalysed by phosphotransferases, hydrolases that
hydrolyse phosphoric-esters, and phosphorus–oxygen lyases. Of the
26 heterogeneous groups, 6 contain even enzymes from different
main classes.
Stereoisomeric information is just considered in reactions of the
fifth main class, because stereoisomeric information is not always
encoded correctly in the molfiles (MDL format) and can lead to false
assignments. Group number 6 (EC subclass 5.1) contains enzymes
that catalyse the change of a stereoisomeric centre from S- to an
R-configuration or vice versa.
When comparing the method described with published methods,
in particular the E-zyme method published by the KEGG team,
a number of advantages are obvious, the most prominent being
the fully automatic search for the atoms of the reaction centre,
in contrast to the required error-prone manual assignment of
corresponding substrate/product pairs. In addition the reaction
prediction method E-zyme does not use a ranking system. Hence
the prediction can fail. The reaction prediction method E-zyme
developed by KEGG does not use a ranking system. Hence the
prediction can fail. An example is the reaction of the glycine
amidinotransferase, which we used to describe the ranking system
of our approach (cf. Section 3.2). Computation of the reaction
pairs of the glycine amidinotransferase (R00565) by E-zyme is
leading to overlapping substructures. At first the comparison of the
main RPairs (C00062_C00077, C00037_C00581) produces correct
assignments, but leaving a number of atoms unassigned. When
adding the trans RPair (C00062_C00581) the result overlaps with
the previous substructures and delivers false operators. Therefore it
is not sufficient just to compare corresponding substrate and product
molecules.
A systematic analysis of correspondence between the EC-system
and the two different methods is not possible as the data from the
E-zyme system cannot only be obtained reaction by reaction. In
order to get an estimate of the correctness a number of results were
manually inspected that were identified to be especially complicated
for published methods and were also used for the evaluation of the
method published by Körner et al. (2008). From the 19 problematic
cases mentioned in this publication in Table 2 and in the following
paragraphs our method gave a correct or chemically reasonable
solution in 15 cases. The method failed in those cases where larger
rearrangements/cyclisations meant that not the largest possible
substructure was preserved in the reaction like in EC 5.4.99.7 or
4.2.1.75 or in those cases like in 5.4.2.1 where chemical reasoning
indicates a transfer of the phosphate group and automatic MCS atom
mapping would suggest a transfer of the (smaller) CO2 group. Most
of these cases could be identified because they gave a reaction matrix
atypical for the corresponding EC-sub-subclass and were manually
corrected.
3140
[20:14 3/11/2009 Bioinformatics-btp549.tex]
Page: 3140
3135–3142
Automatic assignment of reaction operators to enzymatic reactions
Table 1. Groups of sub-subclasses with equal reaction matrices
Group SubCleaved
number sub
bonds or
classes donated
electrons
1
2.4.1.− C–O, O–
2.4.99.− H
3.1.1.−
3.1.3.−
3.1.4.−
3.1.5.−
3.1.7.−
3.1.8.−
3.1.11.−
3.1.13.−
3.1.15.−
3.1.16.−
3.1.21.−
3.1.25.−
3.1.26.−
3.1.30.−
3.2.1.−
3.3.2.−
2
Formed
bonds or
accepted
electrons
C–O, O–
H
2.4.2.− C–N, O– C–O, N–
H
3.2.2.− H
3.4.11.−
3.4.13.−
3.4.14.−
3.4.15.−
3.4.16.−
3.4.17.−
3.4.18.−
3.4.19.−
3.4.21.−
3.4.22.−
3.4.23.−
3.4.24.−
3.4.25.−
3.5.1.−
3.5.2.−
Group Sub-sub Cleaved
number classes bond or
donated
electrons
P–O, O–H
3
2.7.1.−
2.7.2.−
2.7.4.−
2.7.6.−
2.7.8.−
2.7.10.−
2.7.11.−
2.7.12.−
2.7.99.−
3.6.1.−
3.6.3.−
3.6.4.−
3.6.5.−
4.6.1.−
4
5
6
7
8
9
10
3.1.14.−
3.1.22.−
3.1.27.−
3.1.31.−
3.7.1.−
4.1.1.−
4.1.2.−
4.1.3.−
Formed
bonds or
accepted
electrons
P–O, O–H
P–O, N–
C,
O–H
13
1.1.2.−
1.2.2.−
O–H, C–
H
C–O,
Fe , Fe
14
1.1.3.−
1.2.3.−
15
1.2.4.−
1.4.4.−
1.4.1.−
1.5.1.−
C–O, O–
H,
O–H
S–C, O–
C, S–H
C=O, C–
C,
N–H, C–
H / N:
C–O, R–H
24
4.2.1.−
4.2.99.−
C–O, S–H
25
5.2.1.−
5.3.3.−
C–C, C– C–C, C–
H
H
S–O, O–H
26
5.3.2.−
5.5.1.−
C–O, C– C–C, O–
H
H
C–O, C–H
16
17
1.4.99.−
1.5.99.−
18
1.5.3.−
1.17.3.−
19
1.16.1.−
1.18.1.−
1.17.4.−
1.17.5.−
2.1.4.−
2.3.2.−
2.7.3.−
2.7.13.−
3.5.3.−
3.5.4.−
5.1.1.
5.1.2.
5.1.3.
5.1.99.
SRconfigura- configuration
tion
20
1.1.1.−
1.2.1.−
1.17.1.−
C–N, C–C, C–O, C–
C–H, C–H C,
C–H, N:
22
1.1.99.− C–H, O–H
1.2.99.−
1.17.99.−
S–C, O–H
2.3.1.−
3.1.2.−
3.3.1.−
S–O, O–H
2.8.2.−
3.1.6.−
3.6.2.−
Formed
bonds or
accepted
electrons
C–C, N–
H
O–O, O–
H,
C–H
S–S, C–
C, O–H
C–C, C–
N,
C–N, O–
H,
O–H, C–
H
C–N, O–
H,
O–H, C–
H
C–N, C–
H,
O–O, O–
H,
O–H
N–C, C–
C, Fe ,
Fe
S–S, O–
H, C–H
C–N, N–
H
P–O, N–
H
N–C, N–
C,
O–H, O–
H
O–C, C–
H
P–O, C– P–O, C–
O,
O,
O–H, O–H O–H, O–H
C–C, O–H
Group SubCleaved
number sub
bond or
classes donated
electrons
11
4.3.1.− C–N, C–
4.3.2.− H
4.3.3.−
12
6.3.1.− P–O, C–
6.3.2.− O,
6.3.3.− N–H
21
23
C=O, N–
H,
R–H
C=O, N–
H,
O–H, O–
H
C–C, C–
H, N:
C–O, S–
H, S–H
C–N, N–
H
P–N, O–
H
C=O, N–
H,
N–H
C–C, O–
H
3141
[20:14 3/11/2009 Bioinformatics-btp549.tex]
Page: 3141
3135–3142
M.Leber et al.
5
SUMMARY
We present a novel approach for an automatic characterisation of
biochemical reactions. The approach is based on the Dugundji–Ugi
model, which describes electron shift patterns by reaction matrices.
Although this model allows a rational description of the reaction
core, the system requires an assignment of the substrate atoms to the
product atoms. Hence it was necessary to construct reaction matrices
manually in the past. Our program, that is able to calculate reaction
matrices automatically, represents a substantial improvement in
handling of the Dugundji–Ugi-model. Our reaction matrix calculator
calculates reaction matrices on the basis of the substrate and product
molecules only, using molfiles of MDL format as input.
The atom-mapping problem is treated by an algorithm for
maximal common subgraph determination. Due to the fact that
the MCS problem is NP-complete, we developed a new approach,
which combines the favourable characteristics of two different MCS
algorithms. The fast Koch algorithm, which is a variant of the
Bron–Kerbosch algorithm and is searching for connected MCS, is
combined with the McGregor algorithm developed for the analysis
of chemical reactions.
This MCS algorithm is used to compare the substrate molecules
to the product molecules. The MCS method allows a very accurate
assignment of the atoms in the most cases. However there remain
cases, where the maximal common substructure does not allow
the univocal identification of the part of the molecule, which is
transferred during the reaction. Thus it was necessary to integrate
heuristic information for the treatment of those reactions. In 112
reactions, the algorithms fail to calculate the reaction matrix
due to highly complex changes of the structure (e.g. pentalene
synthase). However, almost 3500 enzymatic reaction operators
have been calculated including 228 sub-subclasses of the EC
classification system. Comparison and grouping of reaction matrices
is performed by a novel graph-based canonisation method, which is
able to handle interruptions within the electron shift pattern. The
reaction matrix identities within the sub-subclasses are examined.
In most cases the reaction matrices within a sub-subclass are
homogeneous. Nevertheless the system also reveals sub-subclasses
with inconsistent electron shift patterns.
In another examination reaction matrices of different subsubclasses were compared and grouped. One hundred and twentyone sub-subclasses have a unique reaction matrix. This means
enzymes of these sub-subclasses can be recognised by the
corresponding reaction matrix. On the other hand 107 sub-subclasses
do not have unique reaction matrices. They form 26 groups,
whereas each group shares an identical reaction matrix. The reaction
matrices of the largest groups show basic electron shift patterns.
Although these sub-subclasses have identical reaction matrices,
they sometimes belong to different main classes within the EC
classification system. Obviously the electron shift patterns differ
in many aspects from the EC classification system.
The Dugundji–Ugi model is known as an objective and rational
system to characterise chemical reactions. We were able to show that
electron shift patterns can be used for a classification of biochemical
reactions and allow a verification of the classes of the EC system
according to chemical properties as well.
Funding: The German Minister of Research (Bundesministerium für
Bildung und Forschung, Bonn, grant no. 031U211A).
Conflict of Interest: none declared.
REFERENCES
Apostolakis,J. et al. (2008) Automatic determination of reaction mappings and reaction
center information. 2. Validation on a biochemical reaction database. J. Chem.
Inform. Model., 48, 1190–1198.
Bauer,J. (1989) IGOR2: a PC-program for generating new reactions and molecular
structures. Tetrahedron Comput. Methodol., 2, 269–280.
Brandt,J. and von Scholley,A. (1983) An efficient algorithm for the computation of the
canonical numbering of reaction matrices. Comput. Chem., 7, 51–59.
Brandt,J. et al. (1981) Classification of reactions by electron shift patterns. Chemica
Scripta., 18, 53–60.
Bron,C. and Kerbosch,J. (1973) Algorithm 457—finding all cliques of an undirected
graph. Commun. ACM, 16, 575–577.
Dugundji,J. and Ugi,I. (1973) An algebraic model of constitutional chemistry as a basis
for chemical computer programs. Topics Curr. Chem., 39, 19–64.
Fontain,E. and Reitsam,K. (1991) The generation of reaction networks with RAIN. 1.
The reaction generator. J. Chem. Inform. Comp. Sci., 31, 96–101.
Hattori,M. et al. (2003) Development of a chemical structure comparison method for
integrated analysis of chemical and genomic information in the metabolic pathways.
J. Amer. Chem. Soc., 125, 11853–11865.
Hatzimanikatis,V. et al. (2005) Exploring the diversity of complex metabolic networks.
Bioinformatics, 21, 1603–1609.
Koch,I. (2001) Enumerating all connected maximal common subgraphs in two graphs.
Theoret. Comp. Sci., 250, 1–30.
Körner,R. and Apostolakis,J. (2008) Automatic determination of reaction mappings
and reaction center information. 1. The imaginary transition state energy approach.
J. Chem. Inform. Model., 48, 1181–1189.
Kotera,M. et al. (2004) Computational assignment of the EC-numbers for genomic-scale
analysis of enzymatic reactions. J. Am. Chem. Soc., 126, 16487–16498.
Levi,G. (1972) A note on the derivation of maximal common subgraphs of two directed
or undirected graphs. Calcolo, 9, 341–352.
Marialke,J. et al. (2007) Graph-based molecular alignment (GMA). J. Chem. Inform.
Model., 47, 591–601.
McGregor,J.J., (1982) Backtrack search algorithms and the maximal common subgraph
problem. Software: Pract. Exp., 12, 23–34.
Oh,M. et al. (2007) Systematic analysis of enzyme-catalyzed reaction patterns and
prediction of microbial biodegradation pathways. J. Chem. Inf. Model., 47,
1702–1712.
Raymond,J.W. and Willett,P. (2002) Maximum common subgraph isomorphism
algorithms for the matching of chemical structures. J. Comp.-Aid. Mol. Des., 16,
521–533.
Reitz,M. et al. (2004) Enabling the exploration of biochemical pathways. Org. Biomol.
Chem., 2, 3226–3237.
Schomburg,I. et al. (2004) BRENDA, the enzyme database: updates and major new
developments. Nucleic Acids Res., 32, D431–D433.
Tipton,K.F. and Boyce,S. (2000) History of the enzyme nomenclature system.
Bioinformatics, 16, 34–40.
Ugi,I. et al. (1994) Models, concepts, theories, and formal languages in chemistry and
their use as a basis for computer assictance in chemistry. J. Chem. Inform. Comp.
Sci., 34, 3–16.
Webb,E.C. (1990) Enzyme Nomenclature—recommendations—1984—Supplement3—corrections and additions. Eur. J. Biochem., 187, 263–281.
3142
[20:14 3/11/2009 Bioinformatics-btp549.tex]
Page: 3142
3135–3142