BIOINFORMATICS ORIGINAL PAPER Vol. 25 no. 23 2009, pages 3135–3142 doi:10.1093/bioinformatics/btp549 Systems biology Automatic assignment of reaction operators to enzymatic reactions Markus Leber1 , Volker Egelhofer1,2 , Ida Schomburg1,2 and Dietmar Schomburg1,2,∗ 1 Institute for Biochemistry, University of Cologne, Zülpicher Straße 47, 50674 and 2 Department for Bioinformatics and Biochemistry, Technische Universität Braunschweig, Langer Kamp 19B, 38106 Braunschweig, Germany Received on December 19, 2009; revised on July 21, 2009; accepted on September 14, 2009 Advance Access publication September 25, 2009 Associate Editor: John Quackenbush ABSTRACT Background: Enzymes are classified in a numerical classification scheme introduced by the Nomenclature Committee of the IUBMB based on the overall reaction chemistry. Due to the manifold of enzymatic reactions the system has become highly complex. Assignment of enzymes to the enzyme classes requires a detailed knowledge of the system and manual analysis. Frequently rearrangements and deletions of enzymes and sub-subclasses are necessary. Results: We use the Dugundji–Ugi model for coding of biochemical reactions which is based on electron shift patterns occurring during reactions. Changes of the bonds or of non-bonded valence electrons are expressed by reaction matrices. Our program calculates reaction matrices automatically on the sole basis of substrate and product chemical structures based on a new strategy for maximal common substructure determination, which allows an accurate atom mapping of the substrate and product atoms. The system has been tested for a large set of enzymatic reactions including all sub-subclasses of the EC classification system. Altogether 147 different representative reaction operators were found in the classified enzymes, 121 of which are unique with respect to an EC sub-subclass. The other 26 comprise groups of enzymes with very similar reactions, being identical with respect to the bonds formed and broken. Conclusion: The analysis and comparison of enzymatic reactions according to their electron shift patterns is defining enzyme groups characterised by unique reaction cores. Our results demonstrate the applicability of the Dugundji–Ugi model as a reasonable preclassification system allowing an objective and rational view on biochemical reactions. Availability: The program to generate reaction matrix descriptors is available upon request. Contact: [email protected] 1 BACKGROUND Classification of enzymes is a complex procedure performed by the joined nomenclature committee of the International Union of Pure and Applied Chemistry (IUPAC) and International Union of Biochemistry and Molecular Biology (IUBMB) (Tipton and Boyce, 2000). In general nomenclature and classification of enzymes are based on the overall reaction chemistry of biochemical processes. ∗ To whom correspondence should be addressed. In 1961 the first enzyme commission devised a system of code numbers to describe the chemical features of enzymatic reactions. These Enzyme Commission numbers (EC numbers) start with the prefix EC followed by four numbers, separated by points. The first number indicates which of the six main classes the enzyme belongs to. The main classes are divided into subclasses, which are again subdivided to sub-subclasses depending on the reaction type and which bonds are formed or cleaved during the reaction. For instance, the second number represents the subclass and the third number the sub-subclass of the respective enzyme. The fourth and last number indicates a serial number of the enzyme within the sub-subclass. EC numbers therefore indicate which reaction is catalysed by the enzymes. Different proteins can receive the same EC number if they catalyse identical chemical reactions. Over the years it turned out that the chemical properties of enzymatic reactions allow a reasonable classification of enzymes. With the growing number of known enzymes the correct assignment has become a difficult task for researchers. In addition enzymes sometimes are grouped together for historical reasons or based on physiological aspects (Webb, 1990). This was the motivation for us to test a system for coding of enzymatic reactions that can be automated and can be used as a pre-classification scheme. We found the Dugundji–Ugi model (Dugundji and Ugi, 1973) to be an efficient method to characterise biochemical reactions automatically. Similar to the classification system introduced by the first Enzyme Commission, the Dugundji– Ugi model considers the chemical properties of the overall reaction. Reactions are described by mathematical operators, which are named transition or reaction matrices and represent electron shift patterns occurring during reactions. In this model substrate molecules and product molecules are depicted by binding electron matrices (BE matrices), in which bonds are marked by values. Reactions can be described by this fundamental equation: B+R = E, whereas matrix B is the BE-matrix of the substrate molecules and E is the BE-matrix of the product molecules. As can be seen by the Equation (1) the substrate matrix (B) can be transformed to the product matrix (E) by the reaction matrix (R). Values in the reaction matrix correspond to changes of the bonds or of non-bonded valence electrons. A positive value in the reaction matrix indicates bond formation, whereas negative values indicate the cleavage of bonds. © The Author 2009. Published by Oxford University Press. All rights reserved. For Permissions, please email: [email protected] [20:14 3/11/2009 Bioinformatics-btp549.tex] (1) 3135 Page: 3135 3135–3142 M.Leber et al. Fig. 1. As demonstrated by the reaction of the cyanide hydratase, reaction matrix R converts substrate matrix B to product matrix E. The matrices of B and E are reduced to the atoms involved into the reaction. Values on the main diagonal correspond to changes of the number of free valence electrons of the respective atom. The matrices B, E and R are symmetric and obey algebraic rules. As an advantage the Dugundji–Ugi model allows a detection of the atoms involved into the reaction by simple arithmetic operations. Thus the Dugundji–Ugi model allows a very general and objective view on reactions. In the present work we describe the calculation, comparison and grouping of reaction matrices for biochemical reactions. An example for reaction matrix calculation is given by Figure 1. Although chemical reactions were classified by the Dugundji– Ugi model in the past (Brandt et al., 1981), the model has not been used excessively within the last years. The method requires the assignment of the equivalent atoms between substrates and products. For instance a reaction matrix can be easily calculated by the following Equation (2): R = E −B. (2) But within the matrices of B,E and R, each atom is represented by a certain row and column number. Thus, if the correspondence of the substrate and product atoms is unknown it is not possible to arrange the atoms within the matrices. For this reason reaction matrices have been generated manually in the past. In order to avoid this time consuming procedure, we developed a method that allows the automatic calculation of the reaction matrices based on the chemical structure of the substrate and product molecules. The atom-mapping problem is solved by an algorithm for maximal common subgraph determination. Since the maximal common subgraph problem is NP-complete it was necessary to develop an algorithm, which is fast, flexible and allows to reproduce complicated changes of the structure as well. 2 RELATED WORK The Dugundji–Ugi model is applicable to any kind of reaction and was subject of intensive studies in the past. Techniques of reaction generation, reaction classification, reaction documentation and elucidation of reaction mechanisms have been developed (Ugi et al., 1994). An attempt to classify reactions on the basis of electron shift patterns was proposed by Brandt et al. (Brandt and von Scholley, 1983; Brandt et al., 1981). In these studies a set of canonisation rules are defined to generate a unique reaction matrix. In other studies the Dugundji–Ugi model was used as reaction generator. The programs IGOR (Bauer et al., 1989) and RAIN (Fontain and Reitsam, 1991) are two examples. Both programs just work in combination with transition tables and sets of formal constraints in order to avoid combinatorial explosions (Ugi et al., 1994). In the most recent application the Dugundji–Ugi model was used by Hatzimanikatis et al. (2005) for de novo synthesis of metabolic pathways. Almost 250 reaction matrices are generated manually and utilised by a pathway generation algorithm. This study demonstrates the efficiency of the Dugundji–Ugi system to model each type of biochemical reactions. In comparison to these previous studies our program is the first approach, which is able to calculate R matrices automatically. Due to the complexity of the atom-mapping problem just few approaches have been published to handle the assignment problem. At first databases have been constructed manually, like the BioPath database, which contains biochemical pathways and reaction information (Reitz et al., 2004). The Kyoto Encyclopedia of Genes and Genomes (KEGG) developed a tool for computational assignment of EC numbers published by Kotera et al. (2004). In this approach each reaction formula has to be decomposed by the user into sets of corresponding substrate and product molecules, which are called reactant pairs. In the second step every reactant pair is analysed by the structure comparison method SIMCOMP developed by Hattori et al. (2003). As reported by KEGG (Oh et al., 2007) many alignments generated by SIMPCOMP were corrected manually. Another approach proposed by Körner and Apostolakis (2008) and Apostolakis et al. (2008) considers reaction energetics to predict reaction sites. However, none of these methods rely on full automatic reaction matrix calculation to characterise enzymatic reactions by electron shift patterns. Based on the Dugundji–Ugi model and reaction matrix canonisation our system leads to a new grouping of enzymes, which are only represented by their reaction core. 3 METHODS The full algorithm requires the following steps: (1) Calculation of maximal common substructure between substrates and products; (2) Atom–atom mapping; (3) Calculation of BE-matrices of substrates and products; (4) Calculation of reaction matrices by subtraction of BE-matrices. The comparison of reaction matrices requires one additional step, the conversion of the reaction matrices to a canonical form. 3.1 MCS algorithm In order to generate an atom–atom mapping and to calculate reaction matrices of biochemical reactions, we developed a new approach for Maximal Common Subgraph (MCS) determination. At first the molecules are compared pairwise by the MCS algorithm. As input we used molfiles in the MDL format containing information of the atoms and the bonds of a molecule. MCS determination is the method of choice for an efficient similarity analysis of graphs. This method is applicable to the comparison of molecules, 3136 [20:14 3/11/2009 Bioinformatics-btp549.tex] Page: 3136 3135–3142 Automatic assignment of reaction operators to enzymatic reactions Fig. 2. Example reaction, where the variant of the Bron-Kerbosch algorithm fails due to absence of corresponding edges in the compatibility graph. because molecules can be considered as graphs, whereas the atoms can represent the nodes and the bonds the edges of a graph. The problem of finding all maximal common substructures of two molecules is identical to the search for all maximal common subgraphs in two graphs. This method also can be used as an atom mapping approach for reactions. According to the underlying principle an atom can be mapped, if it is in the same substructure before and after the reaction. Since the MCS-problem is NP-complete several algorithms have been developed to solve the problem in acceptable time for a sufficient amount of nodes (Hattori et al., 2003, Marialke et al., 2007, Raymond and Willett, 2002). Especially if the method is used to consider reactions there is a conflict between improvement of the runtime and flexibility of the algorithm. On the one hand a MCS algorithm has to be flexible, because it must be able to reproduce complex changes of molecule structures. Therefore it is not possible to integrate restrictions like double bonds, atomic environments or three-dimensional information of the structure. These characteristics may change during a reaction. On the other hand the MCS algorithm must be fast enough to compare large metabolites as well. Several MCS algorithms with various properties have been developed during time. It turned out that the best solution for the present project is a combination of two MCS algorithms in order to combine the favourable characteristics of these different approaches. The first algorithm is a variant of the Bron–Kerbosch (Bron and Kerbosch, 1973) algorithm, which solves the MCS search by transforming the MCS problem into a clique problem. In general this procedure is based on the assumption that certain vertices and edges of a graph G1 are compatible to certain vertices and edges of a second graph G2. All of these possible compatibilities are stored in a new graph, called compatibility graph. Corresponding features of G1 and G2 are given by cliques within the compatibility graph. A clique is a complete subgraph. This means within the clique each node is connected with all the other nodes by an edge. A maximal clique is maximal if the clique cannot be extended by a further graph node. As proved by Levi (1972) a maximal common induced subgraph of the graphs of G1 and G2 corresponds to a maximal complete subgraph in the compatibility graph. Accordingly the search for all maximal cliques delivers all maximal common induced subgraphs of G1 and G2. Cliques can be detected by backtracking methods as branch and bound algorithms with exponential runtimes. The mentioned widely used Bron– Kerbosch algorithm is an example. We use a modified version of the Bron– Kerbosch algorithm proposed by Koch (2001), which considers connected maximal common subgraphs and not maximal common induced subgraphs. This variant solves a new problem by searching only for those cliques which represent connected subgraphs. The result is a drastic reduction of the runtime. We modified this algorithm and use it for the comparison of molecules. On the basis of a vertex compatibility graph our approach searches for all connected maximal common substructures of two molecules. However we noticed this algorithm has problems to reconstruct complicate conversions of the structure. For instance Figure 2 shows the reaction of the Catechol:oxygen 1,2-oxidoreductase (decyclising), in which an aromatic ring is cleaved. Fig. 3. Reaction of the pyruvate decarboxylase (A). Comparison of pyruvate to acetaldehyde (B) is leading to 4 MCS. But only MCS number 3 corresponds to the correct position. In this case catechol must be compared to muconate. But the mapping leads to an incomplete mapping of six backbone atoms, whereas all eight backbone atoms of catechol are transformed to muconate. The reason is the absence of corresponding edges in the vertex compatibility graph. Due to this drawback, we combined the Bron–Kerbosch variant with a flexible approach proposed by McGregor et al. (1982). This algorithm has been developed for analysing chemical reactions and enumerating bond changes. The McGregor algorithm uses a backtracking search as its basic component. Thus we integrated techniques to reduce the number of possibilities to be considered. In our approach, the McGregor algorithm is added to the Bron–Kerbosch variant developed by Koch. First the molecules are compared by the fast Koch algorithm. In case that unmapped atoms remain in both molecules, the McGregor algorithm is used to extend the common substructures. To assure that the maximal common substructures are still connected, the core structures given by the Bron–Kerbosch variant are only extended by neighbouring bonds. The extension is performed within a backtracking procedure, which systematically tests all possibilities to extent the core structures as long as it is not longer possible. The combination of these two algorithms also allowed an improvement of the run-time. In contrast to small molecules, where a higher percentage of atoms are involved in the reaction, substantial parts of large molecules remain unchanged. Therefore if large molecules are compared, atoms together with their environment are considered to reduce the size of the vertex compatibility graph. Altogether the result is a fast and flexible MCS approach, which is able to reproduce complex changes of the structure and allows the comparison of large metabolites as well. 3.2 Atom–atom mapping The MCS algorithm is used as a core component to generate an atom mapping of all substrate atoms to all product atoms. For this purpose the algorithm compares each substrate molecule to each product molecule. An example is given by the reaction of the pyruvate decarboxylase (Fig. 3), where pyruvate is compared to acetaldehyde and carbon dioxide. The comparison of pyruvate to acetaldehyde is leading to four maximal common substructures. But only one of them is a valid match in the context of the reaction. The other three maximal common substructures contain atoms, which belong to the cleaved carbon dioxide. Thus, to generate a correct atom mapping it is necessary to select those combinations of maximal common substructures, which do not contain overlaps and where all substrate atoms can be mapped on all product atoms. For this purpose the maximal common 3137 [20:14 3/11/2009 Bioinformatics-btp549.tex] Page: 3137 3135–3142 M.Leber et al. Fig. 4. Molecule comparison and ranking system. At first each substrate molecule is compared to each product molecule. In the second step the MCS are ordered by size. Third within an iterative procedure overlapping regions of smaller MCS are rejected. substructures of the different sets are joined in all possible combinations and afterwards those combinations are selected, which exhibit a complete mapping of the atoms. Figure 4 demonstrates a more complex example, where two substrate molecules react to two product molecules. In this case the maximal common substructures have overlaps within all combinations. This problem has been solved by a ranking system, in which the maximal common substructures of a combination are ordered according to their size. Large maximal common substructures have a higher rank than smaller maximal common substructures. Within an alternating procedure the respective largest maximal common substructure is added to the solution set, whereas overlapping atoms of smaller maximal common substructures are deleted. This means the size of the remaining maximal common substructures may decrease along this process. Therefore it is important to rearrange the order of the remaining maximal common substructures after each step. Our method leads to an assignment of the structures comparable to a manual arrangement. In some cases it is possible to generate more than one complete mapping for a reaction. The reason can be symmetries of partial structures or whole molecules. In other cases substrate atoms of the same type can match to several targets of the product side. At first our underlying algorithm calculates all valid possibilities to map the atoms. Redundant atom mappings can be easily removed during reaction matrix calculation, because equivalent atom mappings lead to identical reaction matrices. However, in some cases, the method to combine the maximal common substructures may result in a large number of combination sets. Thus, in order to avoid run time problems it was necessary to integrate heuristics to reduce the number of combination sets to be considered. Especially if cofactors are involved the number of combination sets increases rapidly. On the other hand the chemical states of the cofactors are known and it is not essential to compare their structures with the other reactants. Therefore a mechanism is integrated, which maps the atoms of the cofactors separately. Afterwards the mappings of the cofactors are added to the mappings of the effective reactants. Comparisons of very small molecules with large metabolites are problematic, too. They result in a high number of maximal common substructures and in consequence a rapid increase of combination sets. Furthermore these comparisons deliver targets, which are not significant in most cases. Due to this reason small molecules like water, ammonia, methane and further molecules of this size are not considered in the first comparison step. The small molecules are mapped on remaining unmapped structures after the other molecules are compared and combined to preliminary solution sets. The mapping of the small molecules is achieved by a repetitive variant of the Bron–Kerbosch algorithm. Although the ranking system in combination with the heuristics appears complicated on the first view, the outcome is a fast method, which allows an accurate assignment of the atoms. 3.3 Hydrogen atom mapping Depending on element type, charge, number of valence electrons and radical label hydrogen atoms are added to the backbone atoms. The number of hydrogen atoms directly bonded to each backbone in the substrate atom is compared to the product hydrogen atom numbers. This procedure reveals whether a hydrogen atom is removed or added to the corresponding backbone atom. Removed hydrogen atoms are mapped to added hydrogen atoms on the product side. If the number of hydrogen atoms on the substrate side does not match the number of added hydrogen atoms on the product side, the supernumerous hydrogen atoms are considered as protons. 3.4 Stereoisomerism The reactions of the isomerases are summarised in the fifth main class of the EC classification system. A fraction of enzymes within this class catalyses stereoisomeric changes of single atoms without a net cleavage or forming of bonds. Hence the characterisation of these reactions required an extension of the method. In our system the absolute configuration of an atom is indicated by values on the main diagonal. An entry of -10 represents an S-configuration and an entry of 10 an R-configuration. Hence an entry of 20 in the reaction matrix means that a stereoisomeric centre is changed from an S- to an R-configuration. 3.5 Reaction matrix calculation and canonisation The comparison or identification of identical reaction matrices requires in worst-case n! row/column permutations, where n is the dimension of the matrix. Therefore canonisation rules have been defined to find a unique reaction matrix. These rules reflect the observation from chemical reactions that electrons are shifted along a single path or a cycle. If the succession of atoms within this path or cycle is transferred to a reaction matrix, a unique canonical reaction matrix is obtained. The characteristic feature of canonical matrices is a continuous succession of entries along the first side diagonal, where the entries alternate in signs. The determination of canonical 3138 [20:14 3/11/2009 Bioinformatics-btp549.tex] Page: 3138 3135–3142 Automatic assignment of reaction operators to enzymatic reactions Fig. 5. Canonical matrix transformation can be achieved by a graph based procedure, which enumerates all continuous electron shift patterns of maximal length. reaction matrices can be a time-consuming procedure if all n! permutations are generated. Therefore methods have been developed to calculate canonical reaction matrices within acceptable computing time. But it turned out that these methods rely on heuristic assumptions, which cannot be adapted to biochemical reactions for several reasons. At first the valence electrons are not always shifted along a single path or a cycle. For instance often protons are generated in biochemical reactions. Protons can lead to interruptions of an electron shift pattern, because the corresponding hydrogen atom is involved in a cleavage, but not in a formation of a bond. Metal ions or atoms with changes of the stereochemistry are other examples, where the electron shift pattern is interrupted. In these cases a graph based representation of the electron shift pattern leads to a fragmentation of the corresponding graph (disconnected graph). Another problem is the occurrence of branches within a graph, which results in several paths of different length. Faced with these problems we implemented a graph-based method for canonisation. This means the reaction matrix is converted to a graph, where entries of the reaction matrix represent edges and the atoms form nodes. A longest path of this graph represents a continuous electron shift pattern of maximal length. In order to find these maximal electron shift patterns we implemented a backtracking algorithm, which searches for the largest sub-graphs that are Hamiltonian paths, e.g. each vertex is visited once. Figure 5 shows one reaction matrix of the 2-hydroxyglutarate dehydrogenase calculated by the program. Matrix conversion to a graph reveals the cyclic electron shift pattern of the reaction. Every node in the graph can serve as a starting point for a path and the pathways can be extended in two directions. Hence 16 paths can be generated. To reduce the number of solutions the longest paths are investigated with respect to canonisation rules, in particular solutions which exhibit an alternation of the sign and start with a negatively labelled edge are selected. In practice this method still leads to a number of valid solutions. Therefore we integrated the atomic weight as a further factor. The succession of atoms of a longest path is examined for the atomic weight and solutions are selected which exhibit a longest sequence of heaviest atoms in a front position. In many cases this method leads to a drastic reduction of the solution set. The whole procedure becomes much more complex if the underlying graph is disconnected and consists of various fragments. In this case the longest paths within the fragments are determined separately, before they are concatenated to the whole electron shift pattern of the reaction. The maximal electron shift patterns of the different fragments are concatenated according to their size. Larger electron shift patterns have a higher rank than smaller patterns. If electron shift patterns have the same size several permutations are evaluated. 3.6 Comparison of canonisised reaction matrices: The canonisation procedure allows the construction of unique reaction matrices. The canonical reaction matrix is converted to a linear notation of two R-strings. The first string is formed by reaction matrix entries, which are concatenated according to a predefined system defined by Brandt et al. (Brandt and von Scholley, 1983; Brandt et al., 1981). The second string consists of element symbols of the atoms, which correspond to the reaction matrix entry string. Both strings, called R-strings, allow an efficient comparison and identification of identical reaction matrices. 4 RESULTS The reaction matrix calculator was tested for a set of 3600 enzymatic reactions with representatives of all sub-subclasses. Reactions which are described as free text in the EC-classification scheme and reactions with generic molecules are not included. The set is derived from the BRENDA database (Schomburg et al., 2004). The reaction matrix dimensions vary between 1 and 90, beginning with the reaction matrices of racemases with a dimension of 1 to the reaction of the nitrogenase (with 48 substrate molecules). The most frequent 3139 [20:14 3/11/2009 Bioinformatics-btp549.tex] Page: 3139 3135–3142 M.Leber et al. Fig. 6. Sub-subclass consistence. matrix dimension is 4, which is typical for many reactions of the transferases, hydrolases and lyases. Reaction matrices of the oxidoreductases exhibit the highest variance in dimension among the main classes. Reactions of this main class often include cofactors in different numbers. Cofactors have a substantial influence on the dimension of the reaction matrix. For example the conversion of NAD+ to NADH increases the reaction matrix dimension by ∼six rows and columns, whereas cytochrome leads to an increase of one row and column of the reaction matrix. A comparison of the reaction matrix grouping with the EC-class system is instructive: The reaction matrices within most of the subsubclasses of the EC system are homogenous. Figure 6 shows the consistence of reaction matrices within the sub-subclasses. This result is not surprising, because sub-subclasses contain reactions that are basically identical with the exception of different substrate specificity. Nevertheless 18% of the sub-subclasses have a reaction matrix identity of <50%. Differences of the electron shift pattern within a sub-subclass can have different reasons. For instance an enzymatic reaction can be associated with parallel reactions caused by metabolites, which merely serve as electron acceptors or donors for the real reaction. Such reactions can be found in sub-subclasses of type –.–.99, that form a high portion of the sub-subclasses with a reaction matrix identity below 50%. This is not surprising, because in these subsubclasses heterogeneous reactions are collected, which do not match with the characteristics of any other sub-subclass of the respective subclass. An example is sub-subclass 1.10.99 which contains reactions acting on diphenols or related substances as donors. The respective acceptor molecules have different properties, leading to diverse electron shift patterns and a low reaction matrix identity of only 33.3 %. Nevertheless altogether an average value of 79% reaction matrix consistence within a subclass has been determined. Therefore we considered the reaction matrix with the highest occurrence as representative for the respective sub-subclass. Reaction matrices, which differ from their respective sub-subclass, were investigated for correctness. The representative reaction matrices of the different sub-subclasses were compared and grouped as shown in Table 1. Two hundred and twenty-eight sub-subclasses were examined in total. One hundred and twenty-one sub-subclasses have a unique reaction matrix, which is not found in any other sub-subclass whereas the remaining 107 sub-subclasses do not have unique reaction matrices. Comparison and grouping of these reaction matrices result in 26 clusters. The largest groups are characterised by a basic reaction core, where two bonds are broken on the substrate side and two bonds are formed on the product side. This mechanism is widely-found in biochemical reactions and is leading to heterogeneous groups, which are formed by sub-subclasses of different main classes. For example group number 1 consists of transferases and hydrolases. Group number 3 exhibits the highest EC-class diversity. In all reactions of this group a P–O bond and O–H bond is broken and a P–O bond and O–H bond is established. This reaction is catalysed by phosphotransferases, hydrolases that hydrolyse phosphoric-esters, and phosphorus–oxygen lyases. Of the 26 heterogeneous groups, 6 contain even enzymes from different main classes. Stereoisomeric information is just considered in reactions of the fifth main class, because stereoisomeric information is not always encoded correctly in the molfiles (MDL format) and can lead to false assignments. Group number 6 (EC subclass 5.1) contains enzymes that catalyse the change of a stereoisomeric centre from S- to an R-configuration or vice versa. When comparing the method described with published methods, in particular the E-zyme method published by the KEGG team, a number of advantages are obvious, the most prominent being the fully automatic search for the atoms of the reaction centre, in contrast to the required error-prone manual assignment of corresponding substrate/product pairs. In addition the reaction prediction method E-zyme does not use a ranking system. Hence the prediction can fail. The reaction prediction method E-zyme developed by KEGG does not use a ranking system. Hence the prediction can fail. An example is the reaction of the glycine amidinotransferase, which we used to describe the ranking system of our approach (cf. Section 3.2). Computation of the reaction pairs of the glycine amidinotransferase (R00565) by E-zyme is leading to overlapping substructures. At first the comparison of the main RPairs (C00062_C00077, C00037_C00581) produces correct assignments, but leaving a number of atoms unassigned. When adding the trans RPair (C00062_C00581) the result overlaps with the previous substructures and delivers false operators. Therefore it is not sufficient just to compare corresponding substrate and product molecules. A systematic analysis of correspondence between the EC-system and the two different methods is not possible as the data from the E-zyme system cannot only be obtained reaction by reaction. In order to get an estimate of the correctness a number of results were manually inspected that were identified to be especially complicated for published methods and were also used for the evaluation of the method published by Körner et al. (2008). From the 19 problematic cases mentioned in this publication in Table 2 and in the following paragraphs our method gave a correct or chemically reasonable solution in 15 cases. The method failed in those cases where larger rearrangements/cyclisations meant that not the largest possible substructure was preserved in the reaction like in EC 5.4.99.7 or 4.2.1.75 or in those cases like in 5.4.2.1 where chemical reasoning indicates a transfer of the phosphate group and automatic MCS atom mapping would suggest a transfer of the (smaller) CO2 group. Most of these cases could be identified because they gave a reaction matrix atypical for the corresponding EC-sub-subclass and were manually corrected. 3140 [20:14 3/11/2009 Bioinformatics-btp549.tex] Page: 3140 3135–3142 Automatic assignment of reaction operators to enzymatic reactions Table 1. Groups of sub-subclasses with equal reaction matrices Group SubCleaved number sub bonds or classes donated electrons 1 2.4.1.− C–O, O– 2.4.99.− H 3.1.1.− 3.1.3.− 3.1.4.− 3.1.5.− 3.1.7.− 3.1.8.− 3.1.11.− 3.1.13.− 3.1.15.− 3.1.16.− 3.1.21.− 3.1.25.− 3.1.26.− 3.1.30.− 3.2.1.− 3.3.2.− 2 Formed bonds or accepted electrons C–O, O– H 2.4.2.− C–N, O– C–O, N– H 3.2.2.− H 3.4.11.− 3.4.13.− 3.4.14.− 3.4.15.− 3.4.16.− 3.4.17.− 3.4.18.− 3.4.19.− 3.4.21.− 3.4.22.− 3.4.23.− 3.4.24.− 3.4.25.− 3.5.1.− 3.5.2.− Group Sub-sub Cleaved number classes bond or donated electrons P–O, O–H 3 2.7.1.− 2.7.2.− 2.7.4.− 2.7.6.− 2.7.8.− 2.7.10.− 2.7.11.− 2.7.12.− 2.7.99.− 3.6.1.− 3.6.3.− 3.6.4.− 3.6.5.− 4.6.1.− 4 5 6 7 8 9 10 3.1.14.− 3.1.22.− 3.1.27.− 3.1.31.− 3.7.1.− 4.1.1.− 4.1.2.− 4.1.3.− Formed bonds or accepted electrons P–O, O–H P–O, N– C, O–H 13 1.1.2.− 1.2.2.− O–H, C– H C–O, Fe , Fe 14 1.1.3.− 1.2.3.− 15 1.2.4.− 1.4.4.− 1.4.1.− 1.5.1.− C–O, O– H, O–H S–C, O– C, S–H C=O, C– C, N–H, C– H / N: C–O, R–H 24 4.2.1.− 4.2.99.− C–O, S–H 25 5.2.1.− 5.3.3.− C–C, C– C–C, C– H H S–O, O–H 26 5.3.2.− 5.5.1.− C–O, C– C–C, O– H H C–O, C–H 16 17 1.4.99.− 1.5.99.− 18 1.5.3.− 1.17.3.− 19 1.16.1.− 1.18.1.− 1.17.4.− 1.17.5.− 2.1.4.− 2.3.2.− 2.7.3.− 2.7.13.− 3.5.3.− 3.5.4.− 5.1.1. 5.1.2. 5.1.3. 5.1.99. SRconfigura- configuration tion 20 1.1.1.− 1.2.1.− 1.17.1.− C–N, C–C, C–O, C– C–H, C–H C, C–H, N: 22 1.1.99.− C–H, O–H 1.2.99.− 1.17.99.− S–C, O–H 2.3.1.− 3.1.2.− 3.3.1.− S–O, O–H 2.8.2.− 3.1.6.− 3.6.2.− Formed bonds or accepted electrons C–C, N– H O–O, O– H, C–H S–S, C– C, O–H C–C, C– N, C–N, O– H, O–H, C– H C–N, O– H, O–H, C– H C–N, C– H, O–O, O– H, O–H N–C, C– C, Fe , Fe S–S, O– H, C–H C–N, N– H P–O, N– H N–C, N– C, O–H, O– H O–C, C– H P–O, C– P–O, C– O, O, O–H, O–H O–H, O–H C–C, O–H Group SubCleaved number sub bond or classes donated electrons 11 4.3.1.− C–N, C– 4.3.2.− H 4.3.3.− 12 6.3.1.− P–O, C– 6.3.2.− O, 6.3.3.− N–H 21 23 C=O, N– H, R–H C=O, N– H, O–H, O– H C–C, C– H, N: C–O, S– H, S–H C–N, N– H P–N, O– H C=O, N– H, N–H C–C, O– H 3141 [20:14 3/11/2009 Bioinformatics-btp549.tex] Page: 3141 3135–3142 M.Leber et al. 5 SUMMARY We present a novel approach for an automatic characterisation of biochemical reactions. The approach is based on the Dugundji–Ugi model, which describes electron shift patterns by reaction matrices. Although this model allows a rational description of the reaction core, the system requires an assignment of the substrate atoms to the product atoms. Hence it was necessary to construct reaction matrices manually in the past. Our program, that is able to calculate reaction matrices automatically, represents a substantial improvement in handling of the Dugundji–Ugi-model. Our reaction matrix calculator calculates reaction matrices on the basis of the substrate and product molecules only, using molfiles of MDL format as input. The atom-mapping problem is treated by an algorithm for maximal common subgraph determination. Due to the fact that the MCS problem is NP-complete, we developed a new approach, which combines the favourable characteristics of two different MCS algorithms. The fast Koch algorithm, which is a variant of the Bron–Kerbosch algorithm and is searching for connected MCS, is combined with the McGregor algorithm developed for the analysis of chemical reactions. This MCS algorithm is used to compare the substrate molecules to the product molecules. The MCS method allows a very accurate assignment of the atoms in the most cases. However there remain cases, where the maximal common substructure does not allow the univocal identification of the part of the molecule, which is transferred during the reaction. Thus it was necessary to integrate heuristic information for the treatment of those reactions. In 112 reactions, the algorithms fail to calculate the reaction matrix due to highly complex changes of the structure (e.g. pentalene synthase). However, almost 3500 enzymatic reaction operators have been calculated including 228 sub-subclasses of the EC classification system. Comparison and grouping of reaction matrices is performed by a novel graph-based canonisation method, which is able to handle interruptions within the electron shift pattern. The reaction matrix identities within the sub-subclasses are examined. In most cases the reaction matrices within a sub-subclass are homogeneous. Nevertheless the system also reveals sub-subclasses with inconsistent electron shift patterns. In another examination reaction matrices of different subsubclasses were compared and grouped. One hundred and twentyone sub-subclasses have a unique reaction matrix. This means enzymes of these sub-subclasses can be recognised by the corresponding reaction matrix. On the other hand 107 sub-subclasses do not have unique reaction matrices. They form 26 groups, whereas each group shares an identical reaction matrix. The reaction matrices of the largest groups show basic electron shift patterns. Although these sub-subclasses have identical reaction matrices, they sometimes belong to different main classes within the EC classification system. Obviously the electron shift patterns differ in many aspects from the EC classification system. The Dugundji–Ugi model is known as an objective and rational system to characterise chemical reactions. We were able to show that electron shift patterns can be used for a classification of biochemical reactions and allow a verification of the classes of the EC system according to chemical properties as well. Funding: The German Minister of Research (Bundesministerium für Bildung und Forschung, Bonn, grant no. 031U211A). Conflict of Interest: none declared. REFERENCES Apostolakis,J. et al. (2008) Automatic determination of reaction mappings and reaction center information. 2. Validation on a biochemical reaction database. J. Chem. Inform. Model., 48, 1190–1198. Bauer,J. (1989) IGOR2: a PC-program for generating new reactions and molecular structures. Tetrahedron Comput. Methodol., 2, 269–280. Brandt,J. and von Scholley,A. (1983) An efficient algorithm for the computation of the canonical numbering of reaction matrices. Comput. Chem., 7, 51–59. Brandt,J. et al. (1981) Classification of reactions by electron shift patterns. Chemica Scripta., 18, 53–60. Bron,C. and Kerbosch,J. (1973) Algorithm 457—finding all cliques of an undirected graph. Commun. ACM, 16, 575–577. Dugundji,J. and Ugi,I. (1973) An algebraic model of constitutional chemistry as a basis for chemical computer programs. Topics Curr. Chem., 39, 19–64. Fontain,E. and Reitsam,K. (1991) The generation of reaction networks with RAIN. 1. The reaction generator. J. Chem. Inform. Comp. Sci., 31, 96–101. Hattori,M. et al. (2003) Development of a chemical structure comparison method for integrated analysis of chemical and genomic information in the metabolic pathways. J. Amer. Chem. Soc., 125, 11853–11865. Hatzimanikatis,V. et al. (2005) Exploring the diversity of complex metabolic networks. Bioinformatics, 21, 1603–1609. Koch,I. (2001) Enumerating all connected maximal common subgraphs in two graphs. Theoret. Comp. Sci., 250, 1–30. Körner,R. and Apostolakis,J. (2008) Automatic determination of reaction mappings and reaction center information. 1. The imaginary transition state energy approach. J. Chem. Inform. Model., 48, 1181–1189. Kotera,M. et al. (2004) Computational assignment of the EC-numbers for genomic-scale analysis of enzymatic reactions. J. Am. Chem. Soc., 126, 16487–16498. Levi,G. (1972) A note on the derivation of maximal common subgraphs of two directed or undirected graphs. Calcolo, 9, 341–352. Marialke,J. et al. (2007) Graph-based molecular alignment (GMA). J. Chem. Inform. Model., 47, 591–601. McGregor,J.J., (1982) Backtrack search algorithms and the maximal common subgraph problem. Software: Pract. Exp., 12, 23–34. Oh,M. et al. (2007) Systematic analysis of enzyme-catalyzed reaction patterns and prediction of microbial biodegradation pathways. J. Chem. Inf. Model., 47, 1702–1712. Raymond,J.W. and Willett,P. (2002) Maximum common subgraph isomorphism algorithms for the matching of chemical structures. J. Comp.-Aid. Mol. Des., 16, 521–533. Reitz,M. et al. (2004) Enabling the exploration of biochemical pathways. Org. Biomol. Chem., 2, 3226–3237. Schomburg,I. et al. (2004) BRENDA, the enzyme database: updates and major new developments. Nucleic Acids Res., 32, D431–D433. Tipton,K.F. and Boyce,S. (2000) History of the enzyme nomenclature system. Bioinformatics, 16, 34–40. Ugi,I. et al. (1994) Models, concepts, theories, and formal languages in chemistry and their use as a basis for computer assictance in chemistry. J. Chem. Inform. Comp. Sci., 34, 3–16. Webb,E.C. (1990) Enzyme Nomenclature—recommendations—1984—Supplement3—corrections and additions. Eur. J. Biochem., 187, 263–281. 3142 [20:14 3/11/2009 Bioinformatics-btp549.tex] Page: 3142 3135–3142
© Copyright 2026 Paperzz