Pathway evolution, structurally speaking, Current Opinion in

374
Pathway evolution, structurally speaking
Stuart CG Rison* and Janet M Thornton*†‡
Small-molecule metabolism forms the core of the metabolic
processes of all living organisms. As early as 1945, possible
mechanisms for the evolution of such a complex metabolic
system were considered. The problem is to explain the
appearance and development of a highly regulated complex
network of interacting proteins and substrates from a limited
structural and functional repertoire. By permitting the
co-analysis of phylogeny and metabolism, the combined
exploitation of pathway and structural databases, as well as the
use of multiple-sequence alignment search algorithms, sheds
light on this problem. Much of the current research suggests a
chemistry-driven ‘patchwork’ model of pathway evolution, but
other mechanisms may play a role. In the future, as metabolic
structure and sequence space are further explored, it should
become easier to trace the finer details of pathway
development and understand how complexity has evolved.
Addresses
*Department of Biochemistry and Molecular Biology, University College
London, Darwin Building, Gower Street, London WC1E 6BT, UK
† Department of Crystallography, Birkbeck College, Malet Street,
London WC1E 7HX, UK
‡ Current address: European Bioinformatics Institute, Wellcome Trust
Genome Campus, Hinxton, Cambridge CB10 1SD, UK;
e-mail: [email protected]
Current Opinion in Structural Biology 2002, 12:374–382
0959-440X/02/$ — see front matter
© 2002 Elsevier Science Ltd. All rights reserved.
Abbreviations
CATH
Class Architecture Topology Homology
EC
Enzyme Commission
HMM
hidden Markov model
KEGG
Kyoto Encyclopaedia of Genes and Genomes
NAD(P)
nicotinamide adenine dinucleotide (phosphate)
SCOP
Structural Classification of Proteins
SMM
small-molecule metabolism
TIM
triose phosphate isomerase
WIT
What Is There
Introduction: the evolution of metabolic
pathways
A number of theories have been advanced to explain the
evolution of an enzyme-catalysed metabolic network from
the constituents of the prebiotic soup [1•].
In the retrograde model, proposed by Horowitz in 1945 [2],
pathways evolve ‘backwards’ from a key metabolite.
The model presupposes the existence of a chemical environment in which both key metabolites and potential
intermediates are available. An organism heterotrophic for
molecule A will use up environmental reserves of the
metabolite to the point at which falling availability limits
growth; in such an environment, an organism capable of
synthesising molecule A from environmental precursors B
and C will have a distinct selective advantage. Any mutant
evolving an enzyme that catalyses this synthesis will rapidly
spread through the environment; in addition, in the
continued absence of environmental A, any null mutation
of the evolved enzyme will be lethal, thereby favouring its
preservation. In turn, as the environmental concentration
of B or C drops, the process will be repeated with the
similar recruitment of further enzymes. The retrograde
model of pathway evolution is illustrated in Figure 1. In
addition, Horowitz suggested that the simultaneous
unavailability of two intermediates (say B and C) would
favour symbiotic association between two mutants, one
capable of synthesising B and the other of synthesising C
from other environmental precursors.
However, the retrograde model of pathway evolution fails
to account for the development of pathways that include
labile metabolites, which could not accumulate in the
environment long enough for retrograde recruitment to
take place. Furthermore, the theory can only explain
pathway evolution in an environment rich in metabolic
intermediates; the ultimate destruction of the organic
environment would prevent the evolution of pathways by
retrograde evolution [1,2].
Considering the possible earlier states of biochemical
systems, Ycas [3] proposed an alternative to the retrograde
evolution theory in 1974. In 1976, Jensen [4] proposed his
theory of pathway evolution. Jensen’s theory expands and
refines that of Ycas, but, in essence, both propose that
metabolic pathways evolved from a system of broadspecificity enzymes, a concept that has come to be known
as the ‘patchwork evolution’ model [5].
In the patchwork model, enzymes exhibit broad substrate
specificities and catalyse classes of reaction [3]. In addition
to spontaneous nonenzymatic reactions, these broad
specificities would mean that many metabolic chains,
synthesising key metabolites, may have existed, albeit at a
very low level. The duplication of genes in such pathways
(advantageous because increased levels of the enzyme
would generate more of the key metabolites), followed by
their specialisation, would account for extant pathways
(see Figure 2). Furthermore, the fortuitous evolution of a
novel chemistry, together with the biological leakiness of
such a system, could allow the production of a key metabolite
from a novel intermediate, even if it is several enzymatic
steps away from the original substrate required [4].
A number of other pathway evolution theories have been
advanced (see, for example, the review by Lazcano and
Miller [1•]), but the retrograde and patchwork models of
pathway evolution are generally thought to be the main
contenders. Herein, we briefly survey pathway evolution
theories and some of the available pathway-related
Pathway evolution Rison and Thornton
resources. We then investigate recent structure-based
research by considering four themes that bear upon the
analysis of metabolic pathway evolution.
Figure 1
Pathway resources
The study of biochemical pathways is age-old and yet the
advent of metabolic databases is relatively recent [6].
Such databases range from simple online reproductions of
textbook pathways to complex interactive databases listing
pathways, reactions, enzymes, reactants, cofactors and so
on. Metabolic databases are the logical consequence of the
accumulation of large amounts of biochemical and genomic
data: from genomes we deduce putative enzymes and
from biochemistry we derive the patterns of interaction
between the enzymes and their substrates. Any large-scale
investigation of metabolic pathways must exploit such
repositories, so we briefly discuss some of these resources
below. Certain databases, such as the EcoCyc database, are
specific to one organism [7]. Others databases pertain to
many organisms, for example, the PATHWAY database
from KEGG (Kyoto Encyclopaedia of Genes and
Genomes) [8] and the WIT (What Is There) database [9],
or focus on a particular section of metabolism, such as
biocatalysis and biodegradation, as illustrated by UM-BBD
(the University of Minnesota Biocatalysis/Biodegradation
Database) [10]. PATHWAY and WIT employ different
strategies to contend with multiple organisms. In the
former, pathways are consensus views not specific to a
particular organism. For each consensus pathway view,
enzymes thought to exist in a particular organism can be
highlighted. In WIT, consensus views exist, but pathway
collections are organised by species. The EcoCyc metabolic
pathway set, however, has the advantage of being thought
to be complete and experimentally verified [6]. Recently,
the repertoire of species integrated within the EcoCyc
architecture has been extended to include 11 further species,
but pathways for these were computationally derived using
the PathLogic program [7]. As for all biological databases,
key research requirements remain: their public availability;
accessibility to the data (e.g. the ability to download them
for further analysis); a high level and quality of annotation;
good coverage; and ease of integration with other databases.
Analysing pathways
The evolutionary analysis of metabolic pathways requires
two key elements: pathway data (e.g. enzymes, compounds and their interactions) and (phylo)genetic data
(i.e. knowledge of the genes encoding the small-molecule
metabolism [SMM] enzymes and their evolutionary relationships). This information may be obtained from a variety
of sources, but usually from the combined exploitation of
metabolic and structural databases. Below we discuss recent
relevant literature within the context of four main themes.
Detecting evolutionary relationships in small-molecule
metabolism pathways
If we set aside for a moment the complexity of interactions
within SMM networks, we can think of SMM as being
375
A
[A]
A
[B]
A
[E]
B
Enz 1
C
D
Enz 2
B
Enz 1
E
C
D
Enz 2
F
Enz 3
G
B
Enz 1
E
A
C
Current Opinion in Structural Biology
The retrograde (Horowitz) model of pathway evolution [2,30].
An organism heterotrophic for key metabolite A uses up all of the
environmental supply of the metabolite. The fortuitous recruitment
of an enzyme (Enz 1) capable of synthesising A from B and C confers
a survival advantage to the organism. In turn, environmental
concentrations of B and E drop, compensated by the recruitment of
enzymes ‘Enz 2’ and ‘Enz 3’, respectively.
performed by the concerted action of a number of proteins.
In this ‘bag of proteins’, certain enzymes will be homologous
(i.e. share a common evolutionary ancestor). Identifying
such homologues is one of the requirements for analysing
pathway evolution. Pairwise comparison of protein
sequences is the simplest way of detecting homology;
proteins with detectable similarity probably are homologous — proteins with a high percentage of sequence
identity having diverged only recently from the common
ancestor. Below a certain level of similarity (around 30%),
homology between proteins and a distant common ancestor
may not be detected. Two main strategies are used to
detect such distantly related homologues: multiplesequence alignment algorithms and comparison of the
three-dimensional structures of proteins, which are often
conserved even in the absence of detectable sequence
similarity. A further issue is the existence of multidomain
proteins composed of two or more evolutionary units
capable of independent duplication and recombination.
The task is therefore to identify the domain make-up of
SMM proteins and to define which of the units are
evolutionarily related, grouping proteins with identical
domains in the same superfamily. Domains identified in
proteins of known atomic structure are classified in
databases such as CATH (Class Architecture Topology
Homology) [11] and SCOP (Structural Classification of
376
Sequences and topology
Figure 2
(a)
(b)
(c)
(d)
The patchwork model of pathway evolution
[3,4]. In the patchwork model, enzymes may
have a favoured substrate and catalytic
mechanism (a), but exhibit broad substrate
specificities and are capable of catalysing
other reactions (b). Therefore, many metabolic
chains synthesising key metabolites
(e.g. yellow square) may have existed, such as
the one catalysed by the olive circle, the
green cross and the pink doughnut.
Duplication of any gene in such a pathway
(c) would be advantageous, as more of the
key metabolite would be synthesised. This
duplication, followed by enzyme specialisation
(d), would account for extant pathways.
Current Opinion in Structural Biology
Proteins) [12]. By considering sequence, structure and
functional similarities, these databases distinguish between
similar domains belonging to the same family (i.e. the
product of divergent evolution) and domains belonging to
different families. Quite often, structural databases are used
in conjunction with multiple-sequence alignment methods,
the latter used to identify structural domains in proteins.
Tsoka and Ouzounis [13 ••] clustered the metabolic
proteins of Escherichia coli into families on the basis of
sequence similarity alone using the GeneRAGE package,
which can automatically cluster a large protein data set
[14]. GeneRAGE begins with a BLAST-based ‘all-versus-all’
comparison, and verifies BLAST assignments and putative
multidomain protein divisions using a Smith–Waterman
alignment algorithm. GeneRAGE clustered 548 metabolic
enzymes into 405 protein families, of which 316 (57%)
were single-member families. Sequence information was
also combined with a comparison of the underlying
metabolic networks in order to derive a ‘phylogeny of
pathways’ [15]. Furthermore, Jardine et al. recently
compared the ‘structural make-up’ of SMM enzymes in
the prokaryote E. coli with that of SMM enzymes in the
eukaryote Saccharomyces cerevisiae (see Update).
Copley and Bork [16••] investigated homology among the
triose phosphate isomerase (TIM) (βα)8-barrel superfamilies
in SCOP and its implications for the evolution of metabolic
pathways. They obtained the sequences of SCOP
(βα)8-barrel proteins and detected homologies among
them using PSI-BLAST. The ubiquity and diversity of
(βα)8 barrels make their evolutionary relationships difficult
to define, in particular with respect to distinguishing
instances of convergent and divergent evolution. In the
SCOP database, 23 superfamilies of (βα)8 barrels are defined;
within these, members probably have a common evolutionary
origin, but the SCOP curators consider that there is
insufficient evidence to further merge any of these
23 superfamilies. Copley and Bork, however, using carefully
validated PSI-BLAST searches, identified probable
homology between six of these (βα)8-barrel superfamilies,
all of which are phosphate binding. A further six SCOP
superfamilies were linked to this canonical phosphatebinding extended superfamily on the basis of PSI-BLAST
searches, structural alignments and careful analysis of key
residues. As well as predicting homology between 12 of
the 23 SCOP (βα)8-barrel superfamilies, Copley and Bork
derived a phylogeny, based on sequence, structure and
function, for the members of these 12 superfamilies that
Pathway evolution Rison and Thornton
377
Figure 3
Class 1: mainly α
Class 77: CATH hyperfamilies
Class 2: mainly β
Class 6: sequence families
Class 88: sequence families
Class 3: mixed α/β
Class 4: few secondary structures
EcoCyc pathway
1
31
61
1
31
61
91
121
151
181
211
241
271
301
331
Domain family
Current Opinion in Structural Biology
Domain families in EcoCyc pathways. The 82 EcoCyc pathways
analysed by Rison et al. [21••] are ordered, from top to bottom, by
the number of distinct domain families identified in their enzymes.
A coloured square indicates that at least one member of the
337 domain families identified in the E. coli SMM enzymes has
been detected in that pathway. These domains include ‘standard’
CATH domains (classes 1–4); CATH hyperfamilies [36] (class 77),
which cluster distinct CATH superfamilies now thought to be
distantly evolutionarily related (e.g. certain TIM barrels); and
sequence families (classes 6 and 88). Similar diagrams for SCOP
assignments to the KEGG and EcoCyc pathways can be
found in [20••].
are involved in central metabolism (i.e. glycolysis, the
TCA cycle, the pentose phosphate pathway, amino acid
biosynthesis and nucleotide biosynthesis). They also
analysed the distribution of these members in central
metabolism (see below).
Gene3D database comprises structural assignments for
whole genes and genomes in the CATH domain database
[22•]. Instead of HMMs for domains, Gene3D uses
PSI-BLAST profiles for CATH domains. We assigned
382 (65.1%) proteins to at least one CATH superfamily.
Again, structurally unassigned sequences were clustered
using sequence comparison methods and an additional
98 enzymes were classified into a sequence family, bringing
the total number of evolutionarily mapped proteins to 480
(82%). A graphical overview of these assignments (inspired
by Saqi and Sternberg [20••]) is shown in Figure 3.
A comprehensive investigation of E. coli SMM pathways
was performed by Teichmann et al. [17••,18] in order to
define their structural anatomy. The study investigated
581 genes involved in 106 EcoCyc SMM pathways.
Structural assignments for the proteins encoded by these
genes were obtained by scanning the proteins against a
library of hidden Markov models (HMMs) for SCOP
domains — an assignment strategy now encapsulated in
the SUPERFAMILY database [19•]. When no structural
assignment was available, proteins were, when possible,
clustered into sequence families. This provided domain
composition and evolutionary relationship information for
510 proteins (88% of the total number). SCOP was also
used in a recent structural census of metabolic networks in
E. coli [20••]: SCOP domain sequences were integrated
into a nonredundant protein sequence database and E. coli
SMM proteins ‘PSI-BLASTed’ against this database. 440 out
of 660 proteins (71%) had at least one match to a SCOP
domain. In a recent study, we used a conceptually similar
database to SUPERFAMILY to identify the evolutionary
relationships among 586 E. coli SMM enzymes [21••]. The
In all of these studies, the percentage of enzymes assigned
a putative structure is high. This is probably because
enzymes are ‘over-represented’ in protein atomic structure
databases and E. coli is a model organism. The E. coli SMM
protein repertoire was also analysed in terms of its suitability
for comparative modelling, a procedure known to perform
poorly below 35% identity. The distribution of percentage
identities for the alignment of E. coli genes with structural
matches was bimodal, peaking at 10–20% and 90–100%
[20••]. This means that many SMM enzymes, even in
well-characterised organisms such as E. coli, will still prove
challenging to model. Naturally, the most effective way of
unequivocally detecting evolutionary relationships would
be to solve the structures of all metabolic enzymes in
all organisms or at least of representative examples of all
378
Sequences and topology
SMM enzymes — an aim that may be made easier to reach
using structural genomics initiatives [23,24]. For a defined
set of proteins (i.e. metabolic proteins in a model organism),
this should be achievable.
The domain composition of small-molecule
metabolism enzymes
Domains containing both α helices and β strands (α/β
domains) form by far the largest proportion of domains in
SMM enzymes, a trend maintained at the level of each
pathway [17••,20••,25,26]. This bias can be observed in
Figure 3. The most common fold (i.e. topological
arrangement of secondary structure) in SMM enzymes is
the TIM (βα)8 barrel [20••]; the same census identified
the three most commonly occurring superfamilies as the
NAD(P)-binding Rossmann domain, the PLP-dependent
transferase domain and the P-loop-containing nucleotide
triphosphate hydrolase domain. Two of these are coenzyme
binding and the P-loop hydrolase domain is involved
in the supply of energy to a reaction [20••]. Such ‘battery
domains’ are therefore critical in SMM networks.
Teichmann et al. [17••] analysed 581 SMM proteins in
E. coli; 772 domains, nearly all of which were homologous
to proteins of known structure, formed all or part of 510 of
these proteins. From these data, the authors derived a
structural anatomy of the SMM pathways. Approximately
half the SMM proteins were composed of a single domain
and half were multidomain. In multidomain proteins, the
repertoire of domain combinations was limited, that is,
members of one domain family were often found to
combine only with members of a restricted set of other
domain families (usually only one or two). However,
members of some versatile domain families (e.g. Rossmann
NAD[P]-binders) combine with members of a large
number of other domain families. For proteins with
identical domain composition, the order of domains in the
proteins was usually conserved. Interestingly, when using
a purely sequence-based clustering method, only six
two-domain proteins were identified by Tsoka and
Ouzounis [13••] — this illustrates the power of sophisticated
sequence and structure methods to identify relatives that
are not found using simpler methods and to detect protein
domains as evolutionary units.
Versatility and diversity of small-molecule metabolism
Knowledge of the evolutionary make-up of SMM pathways
permits a number of analyses of the distribution of
homologues within and between pathways, as well as an
investigation of the properties of protein families.
Copley and Bork [16••] found that TIM (αβ)8-barrel homologues were be widely distributed both within and between
SMM pathways, and that multiple homologues occurring in
the same pathways were not necessarily adjacent enzymes
(although adjacent TIM barrels were observed in
tryptophan and histidine biosynthesis, and in glycolysis).
Similarly, considering other homologous families, it was
observed that domains within the same family were widely
distributed across pathways, although the presence of
homologues within pathways was observed [17••].
Homologues usually have conserved catalytic mechanisms
and/or cofactor binding, whereas conservation of substrate
binding with modification of chemistry was rarely observed
[17••,21••]. Using their sequence families, Tsoka and
Ouzounis [13••] investigated two mirror aspects of SMM:
functional versatility (i.e. the association of families with
distinct reactions and pathways) and molecular diversity
(i.e. the distribution of reactions and pathways across families).
The authors found that 91% of the enzyme families
spanned only one or two distinct Enzyme Commission
(EC) numbers, with this trend even more pronounced
when only the higher levels of the EC hierarchy were
considered. A different picture of the functional versatility
of SMM enzymes was observed when they considered
participation in an SMM pathway as a description of function:
the distribution ‘widened’ towards multifunctional families
(i.e. families with members participating in more than one
pathway). These correlations were ‘inverted’ to investigate
molecular diversity: 86% of reaction types were catalysed
by a single enzyme family; however, only 12% of pathways
spanned a single enzyme family. To Tsoka and Ouzounis,
these data suggested that functional versatility (as
described by EC number) tended to be well conserved
within families — a picture admittedly affected by the large
number of single-member families in their data set. The
reverse relationship, the number of enzyme families
spanned by a pathway, suggested that biochemical pathways
only require a small number of different enzyme types to
be effective, again with one enzyme type multiply recruited.
Saqi and Sternberg [20••] also found that the majority of
families had only one or two members in the SMM repertoire, and occurred in only one or two networks, indicating
specialisation for a specific biological context.
Context-based analysis of small-molecule metabolism
pathways
In many analyses of SMM networks, each individual
pathway is considered a separate entity and distinctions
such as domain recruitment between and within pathways
are made. Nevertheless, SMM is a complex and complete
network, and, ignoring irreversible reactions, any metabolite
in one part of the network is theoretically ‘synthesisable’
from another. The division of the SMM network into
distinct pathways is therefore arbitrary [27]. A possible way
to deal with this is to ignore these divisions and consider
instead SMM as a whole. In such an analysis, the concept
of recruitment ‘within and between’ pathways becomes
meaningless. Instead, a measure of distance between
enzymes can be used, a metric that has been called pathway
distance [21••] and metabolic distance [28•]. Pathway
distance is a measure of the number of metabolic steps
separating two enzymes. By metabolic step, we mean the
enzyme-catalysed modification of one or more substrates
into chemically distinct compounds [21••]. Such a metric
requires a transition from the traditional metabolite-centric
representation of pathways to a protein-centric one [27].
Pathway evolution Rison and Thornton
379
Figure 4
6.00
Percentage homologous pairs
Homology and pathway distance. At each
pathway distance, the percentage of enzyme
pairs at that distance sharing homology in at
least one domain is plotted (see Rison et al.
[21••]). Observed percentages found by
simulation to be statistically significant are in
bold type. The dashed line indicates the
average percentage of homologous pairs
expected if SMM enzymes were randomly
distributed (~1.7%).
5.00
5.00
4.00
3.87
3.00
2.59
2.45
2.33
2.00
1.55
1.06
1.00
1.05
1.14
1.02
0.79
0.00
1
2
3
4
5
6
7
8
9
10
11
Pathway distance
Current Opinion in Structural Biology
Pathway distance can be correlated with a number of
metrics; recently, we investigated the relationship between
pathway distance, protein homology and the chromosomal
localisation of SMM protein encoding genes [21••]. The
study revealed that metabolically close enzymes are more
likely to be homologous than distant ones (see Figure 4).
This dependency was only statistically significant at short
distances (1–3 steps). Beyond that distance, the number of
homologous pairs observed is not significantly different
from that which might be expected by chance. Overall,
homologous enzymes within a metabolic neighbourhood
(1–11 steps) are rare, accounting for, at most, 5% of the
enzyme pairs encountered. For the homologous proteins,
the most common explanation for domain duplication was
conservation of chemistry, with conservation of cofactor
binding a close second.
The relationship between pathway distance and gene
interval (i.e. the number of genes separating two SMM
enzyme encoding genes on the E. coli chromosome) was
also investigated (see Figure 5). There was a clear correlation between pathway distance and gene interval, with
enzymes encoded by nearby genes in the E. coli genome
more likely than those encoded by distant ones to be close
in a pathway. This observation was neither unexpected nor
novel (see, for example, Overbeek et al. [29]), but the work
demonstrated this correlation to hold true for all of E. coli’s
SMM. The correlation shown in Figure 5 was shown to be
nearly entirely due to the clustering of metabolic genes
into operons; we were observing not only an operon effect,
but also a short-range effect, essentially only clustering
genes that encode proteins found separated by at most four
metabolic steps [21••]. Furthermore, serial recruitments
(recruitment, in the same order, of two enzymes in one
pathway to another), although identified, were rare, suggesting
that novel pathways are not, in general, derived from block
duplication of existing ones [18].
A number of other correlations were also investigated
(e.g. the relationship between homology and gene interval),
as well as related aspects of domain recruitment, such as
the use of isozymes and the reuse of an enzyme several
times within a pathway [21••]. The overall picture was
complex, suggesting that a number of evolutionary
mechanisms might occur in concert, involving not only
catalytic constraints (i.e. the necessity to evolve chemically
efficient networks for the production of small molecules)
but also regulatory constraints (e.g. the clustering of
metabolic genes into operon structures).
The work also demonstrated the validity of ‘mining’ the
interaction between the metabolic context, the genome
context and the evolutionary context. This is well illustrated by the SNAP algorithm for finding functionally
related genes [28•].
Conclusions
Horowitz’s theory of retrograde evolution is generally
supposed to lead to three observable consequences:
clustering of evolutionarily related proteins in metabolic
pathways; within such clusters, identification of the
enzyme catalysing the last step in a metabolic chain as the
deepest branch (i.e. ancestral) when a phylogeny is
constructed [16••]; and a tendency for substrate-driven
recruitment. In 1965, Horowitz [30] restated his theory to
take into account the discovery of operons. At the time, the
clustering of genes involved in known pathways into
operons (e.g. leucine and tryptophan), along with a consideration of the probable origin of operons, led him to
suggest that operons would cluster genes with overlapping
specificities favouring structural homology and common
ancestry. This clustering is not, however, thought to be
essential [30]. Indeed, in its purest form, ”the stepwise
backwards route does not demand that the enzymes are
evolutionarily related” [3]. If evolutionary relatedness of
380
Sequences and topology
Figure 5
Pathway distance and gene intervals. The
gene interval measures the number of genes
separating two SMM enzyme encoding genes
on the E. coli chromosome [21••]. At each
pathway distance (x-axis), the percentage of
enzyme pairs with a gene interval of
0–5 genes (blue diamonds), 6–500 genes
(pink squares) and 501+ genes (red triangles)
is plotted. The larger bins show no
distinguishable trend, but the 0–5 bin shows
that metabolically close enzymes tend to be
encoded by chromosomally close genes.
95
Percentage pairs within gene interval bin
85
75
65
30
20
10
0
1
2
3
4
5
6
7
8
9
10
11
Pathway distance
0–5 genes
6–500 genes
501+ genes
Current Opinion in Structural Biology
recruited enzymes is not a sine qua non condition for
retrograde evolution, then the theory should perhaps be
thought of as an extension of the patchwork model — both
based on the ad hoc recruitment of enzymes, but driven by
different selective advantages. There is the difference that
the retrograde model is thought to be substrate-driven
(i.e. enzymes recruited because nearby metabolic enzymes
are likely to act on similar chemical moieties) and the
patchwork model is thought to be chemistry-driven
(i.e. enzymes recruited for their catalytic potential). Again,
however, substrate-driven recruitment is not a necessary
condition for the retrograde model to be valid, just an
interpreted speculation. It may therefore be unwise to see
these two theories as different and competing.
Some observations are nevertheless clear. Homologues are
widely distributed within and between pathways. Nearly
half of the sequence families identified in E. coli SMM had
members spanning more than one pathway [13••]. Related
TIM barrels had a diffuse distribution in SMM pathways
[16••]. It was also shown that homologues were more
commonly to be found distributed across than within
pathways [17••]. However, Saqi and Sternberg [20••] did
find that the majority of superfamilies had only one or two
members in the SMM repertoire and occurred only in
one or two networks — suggesting that some families do
specialise for a particular biological context. There was
little order in the process of recruitment [17••] and, when
derived for TIM barrel homologues [16••], phylogenies did
not support the notion that the last enzyme in a metabolic
chain was necessarily the most ancestral. In the majority of
cases, recruitment of domains conserved either chemistry
or minor substrate/cofactor binding, and conservation of
substrate binding with modification of catalytic activity
was rarely observed [17••, 21••]. This general recruitment
of enzymes with similar functions (either a particular
catalytic activity or specifically binding particular groups)
is commented upon by Copley and Bork [16••], but interestingly these authors suggest that, over long timescales,
the catalytic mechanism of enzymes does not appear to be
conserved. These observations are consistent with patchwork recruitment of enzymes, a pattern of recruitment also
observed in a recently evolved pathway [31].
Consideration of pathway distance [21••] supported the
notion of patchwork evolution insofar as enzymes were not
commonly found to be recruited from the metabolic
neighbourhood: only 2.6% of enzymes within 11 metabolic
steps of one another were found to be homologous. Within
these 2.6%, homology was found to be more likely at short
pathway distances, suggesting that some pathway distance
dependent evolutionary mechanism may have been involved.
However, this observation may have been due to cases of
repeated reaction types carried out by homologous enzymes
(e.g. repeated phosphorylation reactions carried out by
homologous phosphorylases in glycogen catabolism [17••])
Pathway evolution Rison and Thornton
and these homologues need not necessarily have been
recruited from one another (i.e. they might individually have
been recruited from proteins more than 11 steps away).
Taken together, these observations do support patchwork
evolution of SMM, with enzymes recruited with no
particular order or bias. This recruitment does favour
conservation of chemistry, perhaps because modifying
substrate specificity is less evolutionarily costly than
modifying catalysis [32–34]. In addition to the need to
develop catalytically viable metabolic networks is the
requirement for them to be efficiently regulated [35]. This
undoubtedly generates further constraints on SMM,
leading to strategies such as the use of operons, isozymes
and the reuse of enzymes [21••].
Between SMM proteins, homology can be difficult to
detect. Knowledge of protein structure is the most suitable
means for revealing distant evolutionary relationships; it
also helps shed light on the actual mechanisms of catalysis
[23]. As such, the computational assignment of structures
to metabolic proteins is commendable [19•,20••,22•];
actually solving the structures is even more useful [24]. We
have much to gain in doing so, including perhaps finding a
definitive answer to a quandary already articulated by
Horowitz in 1945 [2]: how to account for the macroevolution
of pathways in terms of microevolutionary steps.
Update
Jardine et al. [37] compared the enzymes in E. coli to those
in the unicellular eukaryote S. cerevisiae (yeast). At most,
one half to two thirds of the gene products involved in
SMM are common to E. coli and yeast. The 271 enzymes
that are common have been largely conserved since the
separation of prokaryotes and eukaryotes: 70% of the
common enzymes consist entirely of homologous domains
in E. coli and yeast, and a further 20% have homologous
domains linked to other domains that are unique to E. coli,
yeast or both.
Acknowledgements
SCGR was funded by GlaxoSmithKline. We thank Gail Bartlett and
Sarah Teichmann for useful comments on our manuscript, and Ian Sillitoe
for the use of his print-matrix program used to generate Figure 3.
References and recommended reading
Papers of particular interest, published within the annual period of review,
have been highlighted as:
• of special interest
•• of outstanding interest
1. Lazcano A, Miller SL: On the origin of metabolic pathways. J Mol
•
Evol 1999, 49:424-431.
Lazcano and Miller survey the principal theories of pathway evolution and
propose their own theory similar to one of these. Horowitz [2] proposes the
retrograde theory of pathway evolution (see also [30]), in which pathways are
evolved in a direction opposite to the metabolic flow to a key metabolite. Ycas
[3] and Jensen [4] propose a patchwork model of pathway evolution, with
ancestral enzymes, with broad specificities and catalysing classes of reaction,
forming a large network of possible pathways, and duplication and specialisation of these enzymes accounting for extant pathways. Lazcano and Miller propose the semi-enzymatic theory of the origin of metabolism; this theory is similar
to the Horowitz hypothesis, but includes the use of compounds leaking from
pre-existing pathways, as well as prebiotic compounds from the environment.
381
2.
Horowitz NH: On the evolution of biochemical synthesis. Proc Natl
Acad Sci USA 1945, 31:153-157.
3.
Ycas M: On earlier states of the biochemical system. J Theor Biol
1974, 44:145-160.
4.
Jensen RA: Enzyme recruitment in evolution of new function. Annu
Rev Microbiol 1976, 30:409-425.
5.
Lazcano A, Miller SL: The origin and early evolution of life:
prebiotic chemistry, the pre-RNA world, and time. Cell 1996,
85:793-798.
6.
Karp PD: Metabolic databases. Trends Biochem Sci 1998,
23:114-116.
7.
Karp PD, Riley M, Saier M, Paulsen IT, Collado-Vides J, Paley SM,
Pellegrini-Toole A, Bonavides C, Gama-Castro S: The EcoCyc
database. Nucleic Acids Res 2002, 30:56-58.
8.
Kanehisa M, Goto S, Kawashima S, Nakaya A: The KEGG databases
at GenomeNet. Nucleic Acids Res 2002, 30:42-46.
9.
Overbeek R, Larsen N, Pusch GD, D’Souza M, Selkov E Jr,
Kyrpides N, Fonstein M, Maltsev N, Selkov E: WIT: integrated
system for high-throughput genome sequence analysis
and metabolic reconstruction. Nucleic Acids Res 2000,
28:123-125.
10. Ellis LB, Hershberger CD, Bryan EM, Wackett LP: The University of
Minnesota Biocatalysis/Biodegradation Database: emphasizing
enzymes. Nucleic Acids Res 2001, 29:340-343.
11. Pearl FM, Martin N, Bray JE, Buchan DW, Harrison AP, Lee D,
Reeves GA, Shepherd AJ, Sillitoe I, Todd AE: A rapid classification
protocol for the CATH Domain Database to support structural
genomics. Nucleic Acids Res 2001, 29:223-227.
12. Lo Conte L, Brenner SE, Hubbard TJ, Chothia C, Murzin AG:
SCOP database in 2002: refinements accommodate structural
genomics. Nucleic Acids Res 2002, 30:264-267.
13. Tsoka S, Ouzounis CA: Functional versatility and molecular
•• diversity of the metabolic map of Escherichia coli. Genome Res
2001, 11:1503-1510.
This paper presents an analysis of E. coli metabolic networks. Metabolic
enzymes were clustered into sequence families and the distribution of these
families across metabolic pathways was investigated. The distribution of
reaction types and pathways across sequence families was also investigated.
14. Enright AJ, Ouzounis CA: GeneRAGE: a robust algorithm for
sequence clustering and domain detection. Bioinformatics 2000,
16:451-457.
15. Forst CV, Schulten K: Phylogenetic analysis of metabolic
pathways. J Mol Evol 2001, 52:471-489.
βα)8 barrels: implications
16. Copley RR, Bork P: Homology among (β
•• for the evolution of metabolic pathways. J Mol Biol 2000,
303:627-641.
The authors present an in-depth analysis of (βα)8 barrels. SCOP identifies
23 distinct superfamilies of (βα)8 barrels; however, Copley and Bork, using
carefully validated PSI-BLAST searches, detected homology between 12 of
these. The distribution of members of these 12 superfamilies involved in central
metabolism was investigated. Members are found to be widely distributed
within and between pathways, in a pattern suggesting patchwork evolution.
A manually derived phylogeny of central metabolism (βα)8 barrels further
confirms this pattern.
17.
••
Teichmann SA, Rison SCG, Thornton JM, Riley M, Gough J,
Chothia C: The evolution and structural anatomy of the small
molecule metabolic pathways in Escherichia coli. J Mol Biol 2001,
311:693-708.
This paper describes a large-scale analysis of E. coli SMM (see also [18]).
510 of the 581 different proteins that form part of the SMM pathways of
E. coli were computationally assigned at least one sequence or structure
domain. These assignments were used to define evolutionary relationships
between these proteins. Combinations of domains within proteins were
investigated and the authors showed that members of most families only
combine in multidomain proteins with one or two other domains. A few more
versatile families combine with many other domains. The distribution of
domains within and across pathways was investigated, with domains more
commonly distributed across pathways; however, block recruitment of
enzymes is rarely observed. These observations suggest a ‘mosaic’ model
for the formation of pathways.
18. Teichmann SA, Rison SCG, Thornton JM, Riley M, Gough J, Chothia C:
Small-molecule metabolism: an enzyme mosaic. Trends
Biotechnol 2001, 19:482-486.
382
Sequences and topology
19. Gough J, Chothia C: SUPERFAMILY: HMMs representing all
•
proteins of known structure. SCOP sequence searches, alignments
and genome assignments. Nucleic Acids Res 2002, 30:268-272.
See annotation to [22•].
20. Saqi MAS, Sternberg MJE: A structural census of metabolic
•• networks for E. coli. J Mol Biol 2001, 313:1195-1206.
In this paper, the authors perform a structural survey of E. coli metabolic
networks. The paper principally exploits the KEGG database (see [8]). 21
pathways are found to have a structural coverage of 50% or more. Levels of
sequence identity suggest that many of the proteins computationally
assigned a SCOP domain will nevertheless prove challenging to model. A
few of the superfamilies are found in many pathways, but the authors
suggest that a particular superfamily has specificity for a particular pathway.
21. Rison SCG, Teichmann SA, Thornton JM: Homology, pathway distance
•• and chromosomal localisation of the small molecule metabolism
enzymes in Escherichia coli. J Mol Biol 2002, 318:911-932.
This paper expands on work presented in [17••]. It makes use of a pathway
distance metric: a measure of the number of metabolic steps separating two
enzymes. This metric is correlated to homology and gene interval (i.e. the
number of genes separating two enzyme-encoding genes on the E. coli
chromosome). The analyses suggest chemistry-driven patchwork evolution
of pathways, but indicate that other mechanisms may also be involved, some
probably related to the need to evolve tight control over metabolism.
Additionally, the clustering of enzyme-encoding genes is discussed and the
rationales behind the use of isozymes and the reuse of enzymes investigated.
26. Hegyi H, Gerstein M: The relationship between protein structure
and function: a comprehensive survey with application to the
yeast genome. J Mol Biol 1999, 288:147-164.
27.
Gerrard JA, Sparrow AD, Wells JA: Metabolic databases – what
next? Trends Biochem Sci 2001, 26:137-140.
28. Kolesov G, Mewes HW, Frishman D: SNAPping up functionally
•
related genes based on context information: a colinearity-free
approach. J Mol Biol 2001, 311:639-656.
In this elegant paper, the authors present a computational approach to finding
genes that are functionally related, but do not possess any noticeable
sequence similarity. Orthologous genes in different genomes are connected
by S-edges and adjacent genes in the same genome are connected by
N-edges. Closed graphs alternating S-edges and N-edges, known as
SN-cycles, are found to be very likely to connect functionally related genes.
29. Overbeek R, Fonstein M, D’Souza M, Pusch GD, Maltsev N: The use
of gene clusters to infer functional coupling. Proc Natl Acad Sci
USA 1999, 96:2896-2901.
30. Horowitz NH: The evolution of biochemical synthesis – retrospect
and prospect. In Evolving Genes and Proteins. Edited by Bryson V,
Vogel H. New York: Academic Press Inc; 1965:15-23.
31. Copley SD: Evolution of a metabolic pathway for degradation of a
toxic xenobiotic: the patchwork approach. Trends Biochem Sci
2000, 25:261-265.
22. Buchan DWA, Shepherd AJ, Lee D, Pearl F, Rison SCG, Thornton JM,
•
Orengo CA: Gene 3D: structural assignment for whole genes and
genomes in the CATH domain structure database. Genome Res
2002, 12:503-514.
Gene3D and SUPERFAMILY [19•] are both resources derived from structural
databases, respectively, CATH and SCOP. A library of models for each
superfamily in the database was generated. The Gene3D library is composed
of PSI-BLAST profiles, whereas SUPERFAMILY uses HMMs. Sequences
may be scanned against these libraries to obtain structural assignments. In
both Gene3D and SUPERFAMILY, structural assignments for complete
genomes are available.
32. Petsko GA, Kenyon GL, Gerlt JA, Ringe D, Kozarich JW: On the
origin of enzymatic species. Trends Biochem Sci 1993,
18:372-376.
23. Erlandsen H, Abola EE, Stevens RC: Combining structural genomics
and enzymology: completing the picture in metabolic pathways
and enzyme active sites. Curr Opin Struct Biol 2000, 10:719-730.
35. van der Meer JR: Evolution of novel metabolic pathways for the
degradation of chloroaromatic compounds. Antonie Van
Leeuwenhoek 1997, 71:159-178.
24. Bonanno JB, Edo C, Eswar N, Pieper U, Romanowski MJ, Ilyin V,
Gerchman SE, Kycia H, Studier FW, Sali A, Burley SK: Structural
genomics of enzymes involved in sterol/isoprenoid biosynthesis.
Proc Natl Acad Sci USA 2001, 98:12896-12901.
36. Pearl FM, Lee D, Bray JE, Buchan DW, Shepherd AJ, Orengo CA:
The CATH extended protein-family database: providing
structural annotations for genome sequences. Protein Sci 2002,
11:233-244.
25. Martin AC, Orengo CA, Hutchinson EG, Jones S, Karmirantzou M,
Laskowski RA, Mitchell JB, Taroni C, Thornton JM: Protein folds and
functions. Structure 1998, 6:875-884.
37.
33. Gerlt JA, Babbitt PC: Divergent evolution of enzymatic function:
mechanistically diverse superfamilies and functionally distinct
suprafamilies. Annu Rev Biochem 2001, 70:209-246.
34. Todd AE, Orengo CA, Thornton JM: Evolution of function in protein
superfamilies, from a structural perspective. J Mol Biol 2001,
307:1113-1143.
Jardine O, Gough J, Chothia C, Teichmann SA: Comparison of the
small molecule metabolic enzymes of Escherichia coli and
Saccharomyces cerevisiae. Genome Res 2002, in press.