374 Pathway evolution, structurally speaking Stuart CG Rison* and Janet M Thornton*†‡ Small-molecule metabolism forms the core of the metabolic processes of all living organisms. As early as 1945, possible mechanisms for the evolution of such a complex metabolic system were considered. The problem is to explain the appearance and development of a highly regulated complex network of interacting proteins and substrates from a limited structural and functional repertoire. By permitting the co-analysis of phylogeny and metabolism, the combined exploitation of pathway and structural databases, as well as the use of multiple-sequence alignment search algorithms, sheds light on this problem. Much of the current research suggests a chemistry-driven ‘patchwork’ model of pathway evolution, but other mechanisms may play a role. In the future, as metabolic structure and sequence space are further explored, it should become easier to trace the finer details of pathway development and understand how complexity has evolved. Addresses *Department of Biochemistry and Molecular Biology, University College London, Darwin Building, Gower Street, London WC1E 6BT, UK † Department of Crystallography, Birkbeck College, Malet Street, London WC1E 7HX, UK ‡ Current address: European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK; e-mail: [email protected] Current Opinion in Structural Biology 2002, 12:374–382 0959-440X/02/$ — see front matter © 2002 Elsevier Science Ltd. All rights reserved. Abbreviations CATH Class Architecture Topology Homology EC Enzyme Commission HMM hidden Markov model KEGG Kyoto Encyclopaedia of Genes and Genomes NAD(P) nicotinamide adenine dinucleotide (phosphate) SCOP Structural Classification of Proteins SMM small-molecule metabolism TIM triose phosphate isomerase WIT What Is There Introduction: the evolution of metabolic pathways A number of theories have been advanced to explain the evolution of an enzyme-catalysed metabolic network from the constituents of the prebiotic soup [1•]. In the retrograde model, proposed by Horowitz in 1945 [2], pathways evolve ‘backwards’ from a key metabolite. The model presupposes the existence of a chemical environment in which both key metabolites and potential intermediates are available. An organism heterotrophic for molecule A will use up environmental reserves of the metabolite to the point at which falling availability limits growth; in such an environment, an organism capable of synthesising molecule A from environmental precursors B and C will have a distinct selective advantage. Any mutant evolving an enzyme that catalyses this synthesis will rapidly spread through the environment; in addition, in the continued absence of environmental A, any null mutation of the evolved enzyme will be lethal, thereby favouring its preservation. In turn, as the environmental concentration of B or C drops, the process will be repeated with the similar recruitment of further enzymes. The retrograde model of pathway evolution is illustrated in Figure 1. In addition, Horowitz suggested that the simultaneous unavailability of two intermediates (say B and C) would favour symbiotic association between two mutants, one capable of synthesising B and the other of synthesising C from other environmental precursors. However, the retrograde model of pathway evolution fails to account for the development of pathways that include labile metabolites, which could not accumulate in the environment long enough for retrograde recruitment to take place. Furthermore, the theory can only explain pathway evolution in an environment rich in metabolic intermediates; the ultimate destruction of the organic environment would prevent the evolution of pathways by retrograde evolution [1,2]. Considering the possible earlier states of biochemical systems, Ycas [3] proposed an alternative to the retrograde evolution theory in 1974. In 1976, Jensen [4] proposed his theory of pathway evolution. Jensen’s theory expands and refines that of Ycas, but, in essence, both propose that metabolic pathways evolved from a system of broadspecificity enzymes, a concept that has come to be known as the ‘patchwork evolution’ model [5]. In the patchwork model, enzymes exhibit broad substrate specificities and catalyse classes of reaction [3]. In addition to spontaneous nonenzymatic reactions, these broad specificities would mean that many metabolic chains, synthesising key metabolites, may have existed, albeit at a very low level. The duplication of genes in such pathways (advantageous because increased levels of the enzyme would generate more of the key metabolites), followed by their specialisation, would account for extant pathways (see Figure 2). Furthermore, the fortuitous evolution of a novel chemistry, together with the biological leakiness of such a system, could allow the production of a key metabolite from a novel intermediate, even if it is several enzymatic steps away from the original substrate required [4]. A number of other pathway evolution theories have been advanced (see, for example, the review by Lazcano and Miller [1•]), but the retrograde and patchwork models of pathway evolution are generally thought to be the main contenders. Herein, we briefly survey pathway evolution theories and some of the available pathway-related Pathway evolution Rison and Thornton resources. We then investigate recent structure-based research by considering four themes that bear upon the analysis of metabolic pathway evolution. Figure 1 Pathway resources The study of biochemical pathways is age-old and yet the advent of metabolic databases is relatively recent [6]. Such databases range from simple online reproductions of textbook pathways to complex interactive databases listing pathways, reactions, enzymes, reactants, cofactors and so on. Metabolic databases are the logical consequence of the accumulation of large amounts of biochemical and genomic data: from genomes we deduce putative enzymes and from biochemistry we derive the patterns of interaction between the enzymes and their substrates. Any large-scale investigation of metabolic pathways must exploit such repositories, so we briefly discuss some of these resources below. Certain databases, such as the EcoCyc database, are specific to one organism [7]. Others databases pertain to many organisms, for example, the PATHWAY database from KEGG (Kyoto Encyclopaedia of Genes and Genomes) [8] and the WIT (What Is There) database [9], or focus on a particular section of metabolism, such as biocatalysis and biodegradation, as illustrated by UM-BBD (the University of Minnesota Biocatalysis/Biodegradation Database) [10]. PATHWAY and WIT employ different strategies to contend with multiple organisms. In the former, pathways are consensus views not specific to a particular organism. For each consensus pathway view, enzymes thought to exist in a particular organism can be highlighted. In WIT, consensus views exist, but pathway collections are organised by species. The EcoCyc metabolic pathway set, however, has the advantage of being thought to be complete and experimentally verified [6]. Recently, the repertoire of species integrated within the EcoCyc architecture has been extended to include 11 further species, but pathways for these were computationally derived using the PathLogic program [7]. As for all biological databases, key research requirements remain: their public availability; accessibility to the data (e.g. the ability to download them for further analysis); a high level and quality of annotation; good coverage; and ease of integration with other databases. Analysing pathways The evolutionary analysis of metabolic pathways requires two key elements: pathway data (e.g. enzymes, compounds and their interactions) and (phylo)genetic data (i.e. knowledge of the genes encoding the small-molecule metabolism [SMM] enzymes and their evolutionary relationships). This information may be obtained from a variety of sources, but usually from the combined exploitation of metabolic and structural databases. Below we discuss recent relevant literature within the context of four main themes. Detecting evolutionary relationships in small-molecule metabolism pathways If we set aside for a moment the complexity of interactions within SMM networks, we can think of SMM as being 375 A [A] A [B] A [E] B Enz 1 C D Enz 2 B Enz 1 E C D Enz 2 F Enz 3 G B Enz 1 E A C Current Opinion in Structural Biology The retrograde (Horowitz) model of pathway evolution [2,30]. An organism heterotrophic for key metabolite A uses up all of the environmental supply of the metabolite. The fortuitous recruitment of an enzyme (Enz 1) capable of synthesising A from B and C confers a survival advantage to the organism. In turn, environmental concentrations of B and E drop, compensated by the recruitment of enzymes ‘Enz 2’ and ‘Enz 3’, respectively. performed by the concerted action of a number of proteins. In this ‘bag of proteins’, certain enzymes will be homologous (i.e. share a common evolutionary ancestor). Identifying such homologues is one of the requirements for analysing pathway evolution. Pairwise comparison of protein sequences is the simplest way of detecting homology; proteins with detectable similarity probably are homologous — proteins with a high percentage of sequence identity having diverged only recently from the common ancestor. Below a certain level of similarity (around 30%), homology between proteins and a distant common ancestor may not be detected. Two main strategies are used to detect such distantly related homologues: multiplesequence alignment algorithms and comparison of the three-dimensional structures of proteins, which are often conserved even in the absence of detectable sequence similarity. A further issue is the existence of multidomain proteins composed of two or more evolutionary units capable of independent duplication and recombination. The task is therefore to identify the domain make-up of SMM proteins and to define which of the units are evolutionarily related, grouping proteins with identical domains in the same superfamily. Domains identified in proteins of known atomic structure are classified in databases such as CATH (Class Architecture Topology Homology) [11] and SCOP (Structural Classification of 376 Sequences and topology Figure 2 (a) (b) (c) (d) The patchwork model of pathway evolution [3,4]. In the patchwork model, enzymes may have a favoured substrate and catalytic mechanism (a), but exhibit broad substrate specificities and are capable of catalysing other reactions (b). Therefore, many metabolic chains synthesising key metabolites (e.g. yellow square) may have existed, such as the one catalysed by the olive circle, the green cross and the pink doughnut. Duplication of any gene in such a pathway (c) would be advantageous, as more of the key metabolite would be synthesised. This duplication, followed by enzyme specialisation (d), would account for extant pathways. Current Opinion in Structural Biology Proteins) [12]. By considering sequence, structure and functional similarities, these databases distinguish between similar domains belonging to the same family (i.e. the product of divergent evolution) and domains belonging to different families. Quite often, structural databases are used in conjunction with multiple-sequence alignment methods, the latter used to identify structural domains in proteins. Tsoka and Ouzounis [13 ••] clustered the metabolic proteins of Escherichia coli into families on the basis of sequence similarity alone using the GeneRAGE package, which can automatically cluster a large protein data set [14]. GeneRAGE begins with a BLAST-based ‘all-versus-all’ comparison, and verifies BLAST assignments and putative multidomain protein divisions using a Smith–Waterman alignment algorithm. GeneRAGE clustered 548 metabolic enzymes into 405 protein families, of which 316 (57%) were single-member families. Sequence information was also combined with a comparison of the underlying metabolic networks in order to derive a ‘phylogeny of pathways’ [15]. Furthermore, Jardine et al. recently compared the ‘structural make-up’ of SMM enzymes in the prokaryote E. coli with that of SMM enzymes in the eukaryote Saccharomyces cerevisiae (see Update). Copley and Bork [16••] investigated homology among the triose phosphate isomerase (TIM) (βα)8-barrel superfamilies in SCOP and its implications for the evolution of metabolic pathways. They obtained the sequences of SCOP (βα)8-barrel proteins and detected homologies among them using PSI-BLAST. The ubiquity and diversity of (βα)8 barrels make their evolutionary relationships difficult to define, in particular with respect to distinguishing instances of convergent and divergent evolution. In the SCOP database, 23 superfamilies of (βα)8 barrels are defined; within these, members probably have a common evolutionary origin, but the SCOP curators consider that there is insufficient evidence to further merge any of these 23 superfamilies. Copley and Bork, however, using carefully validated PSI-BLAST searches, identified probable homology between six of these (βα)8-barrel superfamilies, all of which are phosphate binding. A further six SCOP superfamilies were linked to this canonical phosphatebinding extended superfamily on the basis of PSI-BLAST searches, structural alignments and careful analysis of key residues. As well as predicting homology between 12 of the 23 SCOP (βα)8-barrel superfamilies, Copley and Bork derived a phylogeny, based on sequence, structure and function, for the members of these 12 superfamilies that Pathway evolution Rison and Thornton 377 Figure 3 Class 1: mainly α Class 77: CATH hyperfamilies Class 2: mainly β Class 6: sequence families Class 88: sequence families Class 3: mixed α/β Class 4: few secondary structures EcoCyc pathway 1 31 61 1 31 61 91 121 151 181 211 241 271 301 331 Domain family Current Opinion in Structural Biology Domain families in EcoCyc pathways. The 82 EcoCyc pathways analysed by Rison et al. [21••] are ordered, from top to bottom, by the number of distinct domain families identified in their enzymes. A coloured square indicates that at least one member of the 337 domain families identified in the E. coli SMM enzymes has been detected in that pathway. These domains include ‘standard’ CATH domains (classes 1–4); CATH hyperfamilies [36] (class 77), which cluster distinct CATH superfamilies now thought to be distantly evolutionarily related (e.g. certain TIM barrels); and sequence families (classes 6 and 88). Similar diagrams for SCOP assignments to the KEGG and EcoCyc pathways can be found in [20••]. are involved in central metabolism (i.e. glycolysis, the TCA cycle, the pentose phosphate pathway, amino acid biosynthesis and nucleotide biosynthesis). They also analysed the distribution of these members in central metabolism (see below). Gene3D database comprises structural assignments for whole genes and genomes in the CATH domain database [22•]. Instead of HMMs for domains, Gene3D uses PSI-BLAST profiles for CATH domains. We assigned 382 (65.1%) proteins to at least one CATH superfamily. Again, structurally unassigned sequences were clustered using sequence comparison methods and an additional 98 enzymes were classified into a sequence family, bringing the total number of evolutionarily mapped proteins to 480 (82%). A graphical overview of these assignments (inspired by Saqi and Sternberg [20••]) is shown in Figure 3. A comprehensive investigation of E. coli SMM pathways was performed by Teichmann et al. [17••,18] in order to define their structural anatomy. The study investigated 581 genes involved in 106 EcoCyc SMM pathways. Structural assignments for the proteins encoded by these genes were obtained by scanning the proteins against a library of hidden Markov models (HMMs) for SCOP domains — an assignment strategy now encapsulated in the SUPERFAMILY database [19•]. When no structural assignment was available, proteins were, when possible, clustered into sequence families. This provided domain composition and evolutionary relationship information for 510 proteins (88% of the total number). SCOP was also used in a recent structural census of metabolic networks in E. coli [20••]: SCOP domain sequences were integrated into a nonredundant protein sequence database and E. coli SMM proteins ‘PSI-BLASTed’ against this database. 440 out of 660 proteins (71%) had at least one match to a SCOP domain. In a recent study, we used a conceptually similar database to SUPERFAMILY to identify the evolutionary relationships among 586 E. coli SMM enzymes [21••]. The In all of these studies, the percentage of enzymes assigned a putative structure is high. This is probably because enzymes are ‘over-represented’ in protein atomic structure databases and E. coli is a model organism. The E. coli SMM protein repertoire was also analysed in terms of its suitability for comparative modelling, a procedure known to perform poorly below 35% identity. The distribution of percentage identities for the alignment of E. coli genes with structural matches was bimodal, peaking at 10–20% and 90–100% [20••]. This means that many SMM enzymes, even in well-characterised organisms such as E. coli, will still prove challenging to model. Naturally, the most effective way of unequivocally detecting evolutionary relationships would be to solve the structures of all metabolic enzymes in all organisms or at least of representative examples of all 378 Sequences and topology SMM enzymes — an aim that may be made easier to reach using structural genomics initiatives [23,24]. For a defined set of proteins (i.e. metabolic proteins in a model organism), this should be achievable. The domain composition of small-molecule metabolism enzymes Domains containing both α helices and β strands (α/β domains) form by far the largest proportion of domains in SMM enzymes, a trend maintained at the level of each pathway [17••,20••,25,26]. This bias can be observed in Figure 3. The most common fold (i.e. topological arrangement of secondary structure) in SMM enzymes is the TIM (βα)8 barrel [20••]; the same census identified the three most commonly occurring superfamilies as the NAD(P)-binding Rossmann domain, the PLP-dependent transferase domain and the P-loop-containing nucleotide triphosphate hydrolase domain. Two of these are coenzyme binding and the P-loop hydrolase domain is involved in the supply of energy to a reaction [20••]. Such ‘battery domains’ are therefore critical in SMM networks. Teichmann et al. [17••] analysed 581 SMM proteins in E. coli; 772 domains, nearly all of which were homologous to proteins of known structure, formed all or part of 510 of these proteins. From these data, the authors derived a structural anatomy of the SMM pathways. Approximately half the SMM proteins were composed of a single domain and half were multidomain. In multidomain proteins, the repertoire of domain combinations was limited, that is, members of one domain family were often found to combine only with members of a restricted set of other domain families (usually only one or two). However, members of some versatile domain families (e.g. Rossmann NAD[P]-binders) combine with members of a large number of other domain families. For proteins with identical domain composition, the order of domains in the proteins was usually conserved. Interestingly, when using a purely sequence-based clustering method, only six two-domain proteins were identified by Tsoka and Ouzounis [13••] — this illustrates the power of sophisticated sequence and structure methods to identify relatives that are not found using simpler methods and to detect protein domains as evolutionary units. Versatility and diversity of small-molecule metabolism Knowledge of the evolutionary make-up of SMM pathways permits a number of analyses of the distribution of homologues within and between pathways, as well as an investigation of the properties of protein families. Copley and Bork [16••] found that TIM (αβ)8-barrel homologues were be widely distributed both within and between SMM pathways, and that multiple homologues occurring in the same pathways were not necessarily adjacent enzymes (although adjacent TIM barrels were observed in tryptophan and histidine biosynthesis, and in glycolysis). Similarly, considering other homologous families, it was observed that domains within the same family were widely distributed across pathways, although the presence of homologues within pathways was observed [17••]. Homologues usually have conserved catalytic mechanisms and/or cofactor binding, whereas conservation of substrate binding with modification of chemistry was rarely observed [17••,21••]. Using their sequence families, Tsoka and Ouzounis [13••] investigated two mirror aspects of SMM: functional versatility (i.e. the association of families with distinct reactions and pathways) and molecular diversity (i.e. the distribution of reactions and pathways across families). The authors found that 91% of the enzyme families spanned only one or two distinct Enzyme Commission (EC) numbers, with this trend even more pronounced when only the higher levels of the EC hierarchy were considered. A different picture of the functional versatility of SMM enzymes was observed when they considered participation in an SMM pathway as a description of function: the distribution ‘widened’ towards multifunctional families (i.e. families with members participating in more than one pathway). These correlations were ‘inverted’ to investigate molecular diversity: 86% of reaction types were catalysed by a single enzyme family; however, only 12% of pathways spanned a single enzyme family. To Tsoka and Ouzounis, these data suggested that functional versatility (as described by EC number) tended to be well conserved within families — a picture admittedly affected by the large number of single-member families in their data set. The reverse relationship, the number of enzyme families spanned by a pathway, suggested that biochemical pathways only require a small number of different enzyme types to be effective, again with one enzyme type multiply recruited. Saqi and Sternberg [20••] also found that the majority of families had only one or two members in the SMM repertoire, and occurred in only one or two networks, indicating specialisation for a specific biological context. Context-based analysis of small-molecule metabolism pathways In many analyses of SMM networks, each individual pathway is considered a separate entity and distinctions such as domain recruitment between and within pathways are made. Nevertheless, SMM is a complex and complete network, and, ignoring irreversible reactions, any metabolite in one part of the network is theoretically ‘synthesisable’ from another. The division of the SMM network into distinct pathways is therefore arbitrary [27]. A possible way to deal with this is to ignore these divisions and consider instead SMM as a whole. In such an analysis, the concept of recruitment ‘within and between’ pathways becomes meaningless. Instead, a measure of distance between enzymes can be used, a metric that has been called pathway distance [21••] and metabolic distance [28•]. Pathway distance is a measure of the number of metabolic steps separating two enzymes. By metabolic step, we mean the enzyme-catalysed modification of one or more substrates into chemically distinct compounds [21••]. Such a metric requires a transition from the traditional metabolite-centric representation of pathways to a protein-centric one [27]. Pathway evolution Rison and Thornton 379 Figure 4 6.00 Percentage homologous pairs Homology and pathway distance. At each pathway distance, the percentage of enzyme pairs at that distance sharing homology in at least one domain is plotted (see Rison et al. [21••]). Observed percentages found by simulation to be statistically significant are in bold type. The dashed line indicates the average percentage of homologous pairs expected if SMM enzymes were randomly distributed (~1.7%). 5.00 5.00 4.00 3.87 3.00 2.59 2.45 2.33 2.00 1.55 1.06 1.00 1.05 1.14 1.02 0.79 0.00 1 2 3 4 5 6 7 8 9 10 11 Pathway distance Current Opinion in Structural Biology Pathway distance can be correlated with a number of metrics; recently, we investigated the relationship between pathway distance, protein homology and the chromosomal localisation of SMM protein encoding genes [21••]. The study revealed that metabolically close enzymes are more likely to be homologous than distant ones (see Figure 4). This dependency was only statistically significant at short distances (1–3 steps). Beyond that distance, the number of homologous pairs observed is not significantly different from that which might be expected by chance. Overall, homologous enzymes within a metabolic neighbourhood (1–11 steps) are rare, accounting for, at most, 5% of the enzyme pairs encountered. For the homologous proteins, the most common explanation for domain duplication was conservation of chemistry, with conservation of cofactor binding a close second. The relationship between pathway distance and gene interval (i.e. the number of genes separating two SMM enzyme encoding genes on the E. coli chromosome) was also investigated (see Figure 5). There was a clear correlation between pathway distance and gene interval, with enzymes encoded by nearby genes in the E. coli genome more likely than those encoded by distant ones to be close in a pathway. This observation was neither unexpected nor novel (see, for example, Overbeek et al. [29]), but the work demonstrated this correlation to hold true for all of E. coli’s SMM. The correlation shown in Figure 5 was shown to be nearly entirely due to the clustering of metabolic genes into operons; we were observing not only an operon effect, but also a short-range effect, essentially only clustering genes that encode proteins found separated by at most four metabolic steps [21••]. Furthermore, serial recruitments (recruitment, in the same order, of two enzymes in one pathway to another), although identified, were rare, suggesting that novel pathways are not, in general, derived from block duplication of existing ones [18]. A number of other correlations were also investigated (e.g. the relationship between homology and gene interval), as well as related aspects of domain recruitment, such as the use of isozymes and the reuse of an enzyme several times within a pathway [21••]. The overall picture was complex, suggesting that a number of evolutionary mechanisms might occur in concert, involving not only catalytic constraints (i.e. the necessity to evolve chemically efficient networks for the production of small molecules) but also regulatory constraints (e.g. the clustering of metabolic genes into operon structures). The work also demonstrated the validity of ‘mining’ the interaction between the metabolic context, the genome context and the evolutionary context. This is well illustrated by the SNAP algorithm for finding functionally related genes [28•]. Conclusions Horowitz’s theory of retrograde evolution is generally supposed to lead to three observable consequences: clustering of evolutionarily related proteins in metabolic pathways; within such clusters, identification of the enzyme catalysing the last step in a metabolic chain as the deepest branch (i.e. ancestral) when a phylogeny is constructed [16••]; and a tendency for substrate-driven recruitment. In 1965, Horowitz [30] restated his theory to take into account the discovery of operons. At the time, the clustering of genes involved in known pathways into operons (e.g. leucine and tryptophan), along with a consideration of the probable origin of operons, led him to suggest that operons would cluster genes with overlapping specificities favouring structural homology and common ancestry. This clustering is not, however, thought to be essential [30]. Indeed, in its purest form, ”the stepwise backwards route does not demand that the enzymes are evolutionarily related” [3]. If evolutionary relatedness of 380 Sequences and topology Figure 5 Pathway distance and gene intervals. The gene interval measures the number of genes separating two SMM enzyme encoding genes on the E. coli chromosome [21••]. At each pathway distance (x-axis), the percentage of enzyme pairs with a gene interval of 0–5 genes (blue diamonds), 6–500 genes (pink squares) and 501+ genes (red triangles) is plotted. The larger bins show no distinguishable trend, but the 0–5 bin shows that metabolically close enzymes tend to be encoded by chromosomally close genes. 95 Percentage pairs within gene interval bin 85 75 65 30 20 10 0 1 2 3 4 5 6 7 8 9 10 11 Pathway distance 0–5 genes 6–500 genes 501+ genes Current Opinion in Structural Biology recruited enzymes is not a sine qua non condition for retrograde evolution, then the theory should perhaps be thought of as an extension of the patchwork model — both based on the ad hoc recruitment of enzymes, but driven by different selective advantages. There is the difference that the retrograde model is thought to be substrate-driven (i.e. enzymes recruited because nearby metabolic enzymes are likely to act on similar chemical moieties) and the patchwork model is thought to be chemistry-driven (i.e. enzymes recruited for their catalytic potential). Again, however, substrate-driven recruitment is not a necessary condition for the retrograde model to be valid, just an interpreted speculation. It may therefore be unwise to see these two theories as different and competing. Some observations are nevertheless clear. Homologues are widely distributed within and between pathways. Nearly half of the sequence families identified in E. coli SMM had members spanning more than one pathway [13••]. Related TIM barrels had a diffuse distribution in SMM pathways [16••]. It was also shown that homologues were more commonly to be found distributed across than within pathways [17••]. However, Saqi and Sternberg [20••] did find that the majority of superfamilies had only one or two members in the SMM repertoire and occurred only in one or two networks — suggesting that some families do specialise for a particular biological context. There was little order in the process of recruitment [17••] and, when derived for TIM barrel homologues [16••], phylogenies did not support the notion that the last enzyme in a metabolic chain was necessarily the most ancestral. In the majority of cases, recruitment of domains conserved either chemistry or minor substrate/cofactor binding, and conservation of substrate binding with modification of catalytic activity was rarely observed [17••, 21••]. This general recruitment of enzymes with similar functions (either a particular catalytic activity or specifically binding particular groups) is commented upon by Copley and Bork [16••], but interestingly these authors suggest that, over long timescales, the catalytic mechanism of enzymes does not appear to be conserved. These observations are consistent with patchwork recruitment of enzymes, a pattern of recruitment also observed in a recently evolved pathway [31]. Consideration of pathway distance [21••] supported the notion of patchwork evolution insofar as enzymes were not commonly found to be recruited from the metabolic neighbourhood: only 2.6% of enzymes within 11 metabolic steps of one another were found to be homologous. Within these 2.6%, homology was found to be more likely at short pathway distances, suggesting that some pathway distance dependent evolutionary mechanism may have been involved. However, this observation may have been due to cases of repeated reaction types carried out by homologous enzymes (e.g. repeated phosphorylation reactions carried out by homologous phosphorylases in glycogen catabolism [17••]) Pathway evolution Rison and Thornton and these homologues need not necessarily have been recruited from one another (i.e. they might individually have been recruited from proteins more than 11 steps away). Taken together, these observations do support patchwork evolution of SMM, with enzymes recruited with no particular order or bias. This recruitment does favour conservation of chemistry, perhaps because modifying substrate specificity is less evolutionarily costly than modifying catalysis [32–34]. In addition to the need to develop catalytically viable metabolic networks is the requirement for them to be efficiently regulated [35]. This undoubtedly generates further constraints on SMM, leading to strategies such as the use of operons, isozymes and the reuse of enzymes [21••]. Between SMM proteins, homology can be difficult to detect. Knowledge of protein structure is the most suitable means for revealing distant evolutionary relationships; it also helps shed light on the actual mechanisms of catalysis [23]. As such, the computational assignment of structures to metabolic proteins is commendable [19•,20••,22•]; actually solving the structures is even more useful [24]. We have much to gain in doing so, including perhaps finding a definitive answer to a quandary already articulated by Horowitz in 1945 [2]: how to account for the macroevolution of pathways in terms of microevolutionary steps. Update Jardine et al. [37] compared the enzymes in E. coli to those in the unicellular eukaryote S. cerevisiae (yeast). At most, one half to two thirds of the gene products involved in SMM are common to E. coli and yeast. The 271 enzymes that are common have been largely conserved since the separation of prokaryotes and eukaryotes: 70% of the common enzymes consist entirely of homologous domains in E. coli and yeast, and a further 20% have homologous domains linked to other domains that are unique to E. coli, yeast or both. Acknowledgements SCGR was funded by GlaxoSmithKline. We thank Gail Bartlett and Sarah Teichmann for useful comments on our manuscript, and Ian Sillitoe for the use of his print-matrix program used to generate Figure 3. References and recommended reading Papers of particular interest, published within the annual period of review, have been highlighted as: • of special interest •• of outstanding interest 1. Lazcano A, Miller SL: On the origin of metabolic pathways. J Mol • Evol 1999, 49:424-431. Lazcano and Miller survey the principal theories of pathway evolution and propose their own theory similar to one of these. Horowitz [2] proposes the retrograde theory of pathway evolution (see also [30]), in which pathways are evolved in a direction opposite to the metabolic flow to a key metabolite. Ycas [3] and Jensen [4] propose a patchwork model of pathway evolution, with ancestral enzymes, with broad specificities and catalysing classes of reaction, forming a large network of possible pathways, and duplication and specialisation of these enzymes accounting for extant pathways. Lazcano and Miller propose the semi-enzymatic theory of the origin of metabolism; this theory is similar to the Horowitz hypothesis, but includes the use of compounds leaking from pre-existing pathways, as well as prebiotic compounds from the environment. 381 2. Horowitz NH: On the evolution of biochemical synthesis. Proc Natl Acad Sci USA 1945, 31:153-157. 3. Ycas M: On earlier states of the biochemical system. J Theor Biol 1974, 44:145-160. 4. Jensen RA: Enzyme recruitment in evolution of new function. Annu Rev Microbiol 1976, 30:409-425. 5. Lazcano A, Miller SL: The origin and early evolution of life: prebiotic chemistry, the pre-RNA world, and time. Cell 1996, 85:793-798. 6. Karp PD: Metabolic databases. Trends Biochem Sci 1998, 23:114-116. 7. Karp PD, Riley M, Saier M, Paulsen IT, Collado-Vides J, Paley SM, Pellegrini-Toole A, Bonavides C, Gama-Castro S: The EcoCyc database. Nucleic Acids Res 2002, 30:56-58. 8. Kanehisa M, Goto S, Kawashima S, Nakaya A: The KEGG databases at GenomeNet. Nucleic Acids Res 2002, 30:42-46. 9. Overbeek R, Larsen N, Pusch GD, D’Souza M, Selkov E Jr, Kyrpides N, Fonstein M, Maltsev N, Selkov E: WIT: integrated system for high-throughput genome sequence analysis and metabolic reconstruction. Nucleic Acids Res 2000, 28:123-125. 10. Ellis LB, Hershberger CD, Bryan EM, Wackett LP: The University of Minnesota Biocatalysis/Biodegradation Database: emphasizing enzymes. Nucleic Acids Res 2001, 29:340-343. 11. Pearl FM, Martin N, Bray JE, Buchan DW, Harrison AP, Lee D, Reeves GA, Shepherd AJ, Sillitoe I, Todd AE: A rapid classification protocol for the CATH Domain Database to support structural genomics. Nucleic Acids Res 2001, 29:223-227. 12. Lo Conte L, Brenner SE, Hubbard TJ, Chothia C, Murzin AG: SCOP database in 2002: refinements accommodate structural genomics. Nucleic Acids Res 2002, 30:264-267. 13. Tsoka S, Ouzounis CA: Functional versatility and molecular •• diversity of the metabolic map of Escherichia coli. Genome Res 2001, 11:1503-1510. This paper presents an analysis of E. coli metabolic networks. Metabolic enzymes were clustered into sequence families and the distribution of these families across metabolic pathways was investigated. The distribution of reaction types and pathways across sequence families was also investigated. 14. Enright AJ, Ouzounis CA: GeneRAGE: a robust algorithm for sequence clustering and domain detection. Bioinformatics 2000, 16:451-457. 15. Forst CV, Schulten K: Phylogenetic analysis of metabolic pathways. J Mol Evol 2001, 52:471-489. βα)8 barrels: implications 16. Copley RR, Bork P: Homology among (β •• for the evolution of metabolic pathways. J Mol Biol 2000, 303:627-641. The authors present an in-depth analysis of (βα)8 barrels. SCOP identifies 23 distinct superfamilies of (βα)8 barrels; however, Copley and Bork, using carefully validated PSI-BLAST searches, detected homology between 12 of these. The distribution of members of these 12 superfamilies involved in central metabolism was investigated. Members are found to be widely distributed within and between pathways, in a pattern suggesting patchwork evolution. A manually derived phylogeny of central metabolism (βα)8 barrels further confirms this pattern. 17. •• Teichmann SA, Rison SCG, Thornton JM, Riley M, Gough J, Chothia C: The evolution and structural anatomy of the small molecule metabolic pathways in Escherichia coli. J Mol Biol 2001, 311:693-708. This paper describes a large-scale analysis of E. coli SMM (see also [18]). 510 of the 581 different proteins that form part of the SMM pathways of E. coli were computationally assigned at least one sequence or structure domain. These assignments were used to define evolutionary relationships between these proteins. Combinations of domains within proteins were investigated and the authors showed that members of most families only combine in multidomain proteins with one or two other domains. A few more versatile families combine with many other domains. The distribution of domains within and across pathways was investigated, with domains more commonly distributed across pathways; however, block recruitment of enzymes is rarely observed. These observations suggest a ‘mosaic’ model for the formation of pathways. 18. Teichmann SA, Rison SCG, Thornton JM, Riley M, Gough J, Chothia C: Small-molecule metabolism: an enzyme mosaic. Trends Biotechnol 2001, 19:482-486. 382 Sequences and topology 19. Gough J, Chothia C: SUPERFAMILY: HMMs representing all • proteins of known structure. SCOP sequence searches, alignments and genome assignments. Nucleic Acids Res 2002, 30:268-272. See annotation to [22•]. 20. Saqi MAS, Sternberg MJE: A structural census of metabolic •• networks for E. coli. J Mol Biol 2001, 313:1195-1206. In this paper, the authors perform a structural survey of E. coli metabolic networks. The paper principally exploits the KEGG database (see [8]). 21 pathways are found to have a structural coverage of 50% or more. Levels of sequence identity suggest that many of the proteins computationally assigned a SCOP domain will nevertheless prove challenging to model. A few of the superfamilies are found in many pathways, but the authors suggest that a particular superfamily has specificity for a particular pathway. 21. Rison SCG, Teichmann SA, Thornton JM: Homology, pathway distance •• and chromosomal localisation of the small molecule metabolism enzymes in Escherichia coli. J Mol Biol 2002, 318:911-932. This paper expands on work presented in [17••]. It makes use of a pathway distance metric: a measure of the number of metabolic steps separating two enzymes. This metric is correlated to homology and gene interval (i.e. the number of genes separating two enzyme-encoding genes on the E. coli chromosome). The analyses suggest chemistry-driven patchwork evolution of pathways, but indicate that other mechanisms may also be involved, some probably related to the need to evolve tight control over metabolism. Additionally, the clustering of enzyme-encoding genes is discussed and the rationales behind the use of isozymes and the reuse of enzymes investigated. 26. Hegyi H, Gerstein M: The relationship between protein structure and function: a comprehensive survey with application to the yeast genome. J Mol Biol 1999, 288:147-164. 27. Gerrard JA, Sparrow AD, Wells JA: Metabolic databases – what next? Trends Biochem Sci 2001, 26:137-140. 28. Kolesov G, Mewes HW, Frishman D: SNAPping up functionally • related genes based on context information: a colinearity-free approach. J Mol Biol 2001, 311:639-656. In this elegant paper, the authors present a computational approach to finding genes that are functionally related, but do not possess any noticeable sequence similarity. Orthologous genes in different genomes are connected by S-edges and adjacent genes in the same genome are connected by N-edges. Closed graphs alternating S-edges and N-edges, known as SN-cycles, are found to be very likely to connect functionally related genes. 29. Overbeek R, Fonstein M, D’Souza M, Pusch GD, Maltsev N: The use of gene clusters to infer functional coupling. Proc Natl Acad Sci USA 1999, 96:2896-2901. 30. Horowitz NH: The evolution of biochemical synthesis – retrospect and prospect. In Evolving Genes and Proteins. Edited by Bryson V, Vogel H. New York: Academic Press Inc; 1965:15-23. 31. Copley SD: Evolution of a metabolic pathway for degradation of a toxic xenobiotic: the patchwork approach. Trends Biochem Sci 2000, 25:261-265. 22. Buchan DWA, Shepherd AJ, Lee D, Pearl F, Rison SCG, Thornton JM, • Orengo CA: Gene 3D: structural assignment for whole genes and genomes in the CATH domain structure database. Genome Res 2002, 12:503-514. Gene3D and SUPERFAMILY [19•] are both resources derived from structural databases, respectively, CATH and SCOP. A library of models for each superfamily in the database was generated. The Gene3D library is composed of PSI-BLAST profiles, whereas SUPERFAMILY uses HMMs. Sequences may be scanned against these libraries to obtain structural assignments. In both Gene3D and SUPERFAMILY, structural assignments for complete genomes are available. 32. Petsko GA, Kenyon GL, Gerlt JA, Ringe D, Kozarich JW: On the origin of enzymatic species. Trends Biochem Sci 1993, 18:372-376. 23. Erlandsen H, Abola EE, Stevens RC: Combining structural genomics and enzymology: completing the picture in metabolic pathways and enzyme active sites. Curr Opin Struct Biol 2000, 10:719-730. 35. van der Meer JR: Evolution of novel metabolic pathways for the degradation of chloroaromatic compounds. Antonie Van Leeuwenhoek 1997, 71:159-178. 24. Bonanno JB, Edo C, Eswar N, Pieper U, Romanowski MJ, Ilyin V, Gerchman SE, Kycia H, Studier FW, Sali A, Burley SK: Structural genomics of enzymes involved in sterol/isoprenoid biosynthesis. Proc Natl Acad Sci USA 2001, 98:12896-12901. 36. Pearl FM, Lee D, Bray JE, Buchan DW, Shepherd AJ, Orengo CA: The CATH extended protein-family database: providing structural annotations for genome sequences. Protein Sci 2002, 11:233-244. 25. Martin AC, Orengo CA, Hutchinson EG, Jones S, Karmirantzou M, Laskowski RA, Mitchell JB, Taroni C, Thornton JM: Protein folds and functions. Structure 1998, 6:875-884. 37. 33. Gerlt JA, Babbitt PC: Divergent evolution of enzymatic function: mechanistically diverse superfamilies and functionally distinct suprafamilies. Annu Rev Biochem 2001, 70:209-246. 34. Todd AE, Orengo CA, Thornton JM: Evolution of function in protein superfamilies, from a structural perspective. J Mol Biol 2001, 307:1113-1143. Jardine O, Gough J, Chothia C, Teichmann SA: Comparison of the small molecule metabolic enzymes of Escherichia coli and Saccharomyces cerevisiae. Genome Res 2002, in press.
© Copyright 2026 Paperzz