The Relationship Between Domain Duplication and

ARTICLE IN PRESS
doi:10.1016/j.jmb.2004.11.050
J. Mol. Biol. (2005) xx, 1–11
The Relationship Between Domain Duplication and
Recombination
Christine Vogel*, Sarah A. Teichmann and Jose Pereira-Leal
MRC Laboratory of Molecular
Biology, Hills Rd, Cambridge
CB2 2QH, UK
Protein domains represent the basic evolutionary units that form proteins.
Domain duplication and shuffling by recombination are probably the most
important forces driving protein evolution and hence the complexity of the
proteome. While the duplication of whole genes as well as domainencoding exons increases the abundance of domains in the proteome,
domain shuffling increases versatility, i.e. the number of distinct contexts in
which a domain can occur. Here, we describe a comprehensive, genomewide analysis of the relationship between these two processes.
We observe a strong and robust correlation between domain versatility
and abundance: domains that occur more often also have many different
combination partners. This supports the view that domain recombination
occurs in a random way. However, we do not observe all the different
combinations that are expected from a simple random recombination
scenario, and this is due to frequent duplication of specific domain
combinations. When we simulate the evolution of the protein repertoire
considering stochastic recombination of domains followed by extensive
duplication of the combinations, we approximate the observed data well.
Our analyses are consistent with a stochastic process that governs
domain recombination and thus protein divergence with respect to
domains within a polypeptide chain. At the same time, they support a
scenario in which domain combinations are formed only once during the
evolution of the protein repertoire, and are then duplicated to various
extents. The extent of duplication of different combinations varies widely
and, in nature, will depend on selection for the domain combination based
on its function. Some of the pair-wise domain combinations that are highly
duplicated also recur frequently with other partner domains, and thus
represent evolutionary units larger than single protein domains, which we
term “supra-domains”.
q 2004 Elsevier Ltd. All rights reserved.
*Corresponding author
Keywords: domain recombination; neutral evolution; domain versatility;
domain shuffling
Introduction
Protein domains represent the basic evolutionary
units that form proteins.1,2 A domain is defined here
as a structural unit observed in nature either in
isolation or in more than one context in multidomain proteins.2 Examination of their properties
provides a key to understanding the evolution of
proteomes and of an organism’s complexity. The
protein domains examined here are classified into
Abbreviation used: SCOP, structural classification of
proteins.
E-mail address of the corresponding author:
[email protected]
superfamilies2,3 of domains with a common ancestor. Thus, two proteins with domains of a common
superfamily are likely to be evolutionarily related,
originating by processes such as duplication and/or
shuffling of whole genes or domain-encoding
exons.4
Figure 1(a) illustrates how an organism’s protein
repertoire can be represented as a collection of
proteins with different domain architectures. The
combinations of domains in an organism can also be
represented as a network (Figure 1(b)) in which the
nodes denote different domain superfamilies and
the edges denote that the two domains are adjacent
to each other in a protein with arrows pointing
from N to C terminus. These domain combination
0022-2836/$ - see front matter q 2004 Elsevier Ltd. All rights reserved.
ARTICLE IN PRESS
2
Domain Duplication and Recombination
Figure 1. The duplication and recombination of domains form the protein repertoire. The Figure illustrates, from a
domain perspective, the main processes that formed the protein repertoire and which are discussed here: the duplication
and recombination of domains. (a) Proteome of an organism, consisting of eight different proteins (black lines) with
domains of different superfamilies represented as boxes of different colours. (b) An alternative representation of the
proteome: the colour of a circle represents a particular superfamily, the size of the circle corresponds to the abundance of
the domains of that superfamily in the genome. The connections between the circles denote domain combinations.
(c) Abundance is the result of a variety of duplication processes. The abundance of each domain superfamily is the
number of domains in the proteins of the proteome, disregarding tandem repeats. (d) Versatility is the number of
different N and C-terminal partner domains of a particular superfamily that are observed. The example shows the
combinations of the blue superfamily: it occurs with one N-terminal and two different C-terminal partner superfamilies.
networks are known to be scale-free, which means
that a few superfamilies are connected to many
different superfamilies, while most superfamilies
are adjacent to only one or two types of neighbours.5,6
Duplication, transposition and horizontal gene
transfer result in the expansion of domain superfamilies in terms of their abundance; that is, the
number of occurrences of a domain in the proteome
of one organism (Figure 1(c)). Protein and domain
duplication have been studied extensively and have
been connected with an increase in an organism’s
complexity in evolution, for example expansions of
particular families in higher eukaryotes.7–15
Domains can recombine to form multi-domain
proteins, and proteins with two or more domains
constitute the majority of proteins in all organisms
studied.16,17 Thus, the recombination of existing
domains may be a major mechanism that modifies
protein function18 and increases proteome
complexity.16,19 The combination or shuffling of
domains increases what we term the versatility of a
domain superfamily; that is, the number of different
partner domains that domains of a particular
superfamily are adjacent to (Figure 1(d)).
The exact interplay of the mechanisms that
underlie the evolution of the protein repertoire is
still not completely resolved. In particular, it is
unclear to what extent the recombination of protein
domains, or domain versatility, is under selection,
or if it is the result of a purely random shuffling
process.20 Next, we discuss the arguments for both
mechanisms.
The frequency distribution of domain family
sizes, of protein length measured as number of
domains per protein and the distribution of domain
versatility are all best approximated by a power-law
function.5,6,21 These mathematical relationships are
often taken as support of a neutral or random mode
of evolution.5,20–23 However, it is still unclear to
what extent a random model holds true for detailed
examples.
The observation that the number of domain
combinations in nature is only a small fraction of
the possible number of combinations suggests that
domain recombination is under strong selection.5,24
In fact, several lines of evidence support this view.
First, eukaryotic proteins have more different
domains per protein than prokaryotic ones.11,16,25–27
They also have more different domain combinations
than simpler organisms.5,11,16,27,28 This means that
individual domain superfamilies are observed with
a greater variety of combination partners in higher
organisms. However, eukaryotes also have larger
genomes and higher rates of retention of duplicated
genes, which questions the extent to which these
observations are evidence for selection. Furthermore, a detailed analysis of the number of
co-occurring domains in human and other
eukaryotes showed no change in versatility in
terms of co-occurring domains, but just a change
in the repertoire.15
ARTICLE IN PRESS
Domain Duplication and Recombination
Next, domains with “generic” or especially
“useful” functions clearly are the most versatile
superfamilies in many organisms.5,6 These are, for
example, cofactor or co-substrate-binding domains,
like the P-loop nucleotide triphosphate hydrolase
or the NADP(P)-binding Rossmann domains.
Protein–protein interaction domains are also
found in multi-domain proteins adjacent to a
variety of other domains, and these different
combinations are used to regulate distinct aspects
of cellular organization.29 These domain superfamilies are, however, also very abundant in
genomes,30 thus it is unclear if their great versatility
is suggestive of selection or just a consequence of
their high abundance.
Furthermore, domains that co-occur in proteins
are more likely to display similar function19 or
localisation31 than domains in separate proteins,
and may support selection acting on domain
combinations. However, this could be due to a
bias in the function classification schemes towards
annotation of whole proteins rather than domains.32
Finally, since some highly abundant folds tend to
have particular structures,33 it is possible that the
three-dimensional structure of a domain may
impose constraints on its ability to combine with
other domains and hence reduce its versatility.
However, to the best of our knowledge no such
constraints have been observed.
In summary, although there is support for
selective forces acting on the recombination of
protein domains, there are also observations that
suggest a random process. Here, we present a
comprehensive, genome-wide study of domain
recombination, testing the hypothesis that it is a
random process. We use a data-driven approach
combined with simulations to show that the
domain combinations observed are consistent
with stochastic recombination together with
differential duplication of domain combinations.
Duplication of domain combinations is much more
common than invention of new combinations.
Selection at this level has resulted in a few highly
duplicated, very abundant domain combinations in
genomes, while the bulk of domain combinations
are rare, occurring in only a few proteins.
Results and Discussion
The abundance of domain superfamilies
correlates with their number of combination
partners
In a random scenario, more trials result in higher
success rates. For domains, the domain abundance
represents the number of trials for recombination.
This means that the more abundant domains would
be expected to be “luckier” and have more
combination partners than the less abundant
domains. This is illustrated in Figure 1(b) where
the number of edges pointing to and from a node
3
Figure 2. Domain versatility and abundance. The
abundance and versatility of all superfamilies is plotted
on a log–log-scale. Each point represents the abundance
and versatility of a particular superfamily. The plot shows
the data for human, and the distribution is very similar
for all other eukaryotes. In bacteria and archaea, the
abundance and versatility is much smaller, thus a
relationship between the two measures is much less
obvious. The plotted data fit a power-law function with
VwAq and qobsZ0.42G0.04 (r2Z0.75) best, compared to
an exponential, linear or logarithmic function.
(connectivity) would be expected to correlate with
the size of the node.
A random scenario would also imply that the
abundance of a domain alone should determine
how often it is found with a different combination
partner. Thus, we compared abundance and versatility of each superfamily in various genomes, and
the data for the human proteome is plotted in
Figure 2. We observe that the absolute versatility
(V; see Methods and Data Sets for definitions)
correlates positively with its absolute abundance
(A). This relationship is best approximated by a
power-law, where VwAq, with qZ0.42 (r2Z0.75).
The finding of a power-law relationship between
abundance and versatility is consistent with the null
hypothesis of a random process of domain combination. However, as different domains may differ in
their propensity for retention after duplication and
recombination, this result does not exclude selection at the level of domain recombination.
In order to further investigate the hypothesis that
domain recombination is a random process we
study the robustness of the observation that
versatility correlates with abundance. If domain
recombination is indeed a random process, then we
expect the relationship between versatility and
abundance of any group of protein domains to be
constant. In order to investigate the robustness of
this relationship, we define the exponent q as the
relative versatility of a protein domain, describing
the relationship between the absolute versatility of a
domain and its absolute abundance. Note that q is
identical with the slope of the fitted line in Figure 2.
We observe that the relative versatility is roughly
the same for each of the bacterial, archaeal or
ARTICLE IN PRESS
4
Domain Duplication and Recombination
Figure 3. The relationship between versatility and abundance across different domain properties. Versatility is plotted
as a function of abundance similar to Figure 2. In each panel, the superfamily data on 42 different genomes is grouped
according to four different criteria: structural class and phylogeny (affiliation to kingdom) domain molecular function
and biological process. See Methods and Data Sets for a detailed description of the groupings. Left: each panel shows the
lines fitted through the superfamily groupings. Right: a graph displays the slopes q of the fitted lines on the left together
with their standard deviations.
ARTICLE IN PRESS
Domain Duplication and Recombination
eukaryote genomes in our data set; similar to the
human data shown in Figure 2.
The versatility relative to abundance is constant
Next, we consider the factors that can affect
domain versatility, as discussed in the Introduction.
We ask whether distinct groups of domains,
defined according to their function, structure and
evolutionary history, display different relative
versatilities. Such differences would reveal selection, as some combinations would be more duplicated, thus having higher abundances without the
corresponding increase in versatility expected from
a random scenario.
In order to address this hypothesis, we test
whether the relationship VwAq holds true in
particular subsets of domains and, whether the
exponent q is constant. If the relative rates of
domain duplication and combination are dictated
by characteristics of their component domains, one
would expect that: (i) the power-law relationship
does not hold for one or more of the domain groups;
or that (ii) the slope qx for domains of group x is
different from the slope qy for domains of group y.
This will tell us whether the relative rate is different
for distinct groups of domains.
We divide protein domains according to structural classes and folds, phylogenetic affiliation,
molecular function and biological process as
described in Methods and Data Sets. We examine
the relationship between versatility and
abundance in each subset (Figure 3). For each
grouping, we observe a remarkable consistency of
(i) the power-law function being the best fit to the
data and (ii) the slope q, the relative versatility
across all subgroups (Figure 3). Thus, the relationship between versatility and abundance is very
similar in all subsets of domains.
This confirms that the versatility of a domain in a
particular genome can be predicted relatively
accurately purely from its abundance in that
genome, independently of other properties of the
domain. The combination of domains into novel
contexts is constant relative to its duplication. With
respect to the illustration in Figure 1(b), this means
that the connectivity of the node is correlated with
its size.
Given the vast variety of domain functions,
structures and other characteristics considered
here, this is a surprising result. It is consistent
with the hypothesis that the versatility of protein
domains is the result of a stochastic process in
which more abundant domains naturally have a
higher chance of combining with different partner
domains than less abundant ones. In other words,
our findings can be explained by a random model of
domain shuffling; and the null hypothesis of
random recombination cannot be rejected.
S0: a random model of domain combination
The dependence of domain versatility on
5
abundance suggests that new combinations of
domains occur in a random or neutral fashion,
independently of any other particular circumstances.
If this is the case, random shuffling of domains given
their distribution of abundances should result in a
similar absolute versatility for individual superfamilies and thus a similar relationship between
abundance and versatility as that observed. We term
this model S0.
We performed 10,000 random shuffling experiments where the abundance of each domain, the
number and their size in terms of domains are
maintained for each genome considered. The
expected maximal versatility we obtained in this
random shuffling is also related to abundance by a
power-law function (Figure 4, blue), which supports a
random model. Highly abundant domains are more
versatile, and conversely, less abundant domains are
less versatile.
However, the expected maximal relative versatility of each domain according to this random
scenario (qrandZ0.72G0.01) is considerably higher
than what we observe in the genomes (qobsZ0.42G
0.04). This means that overall the observed relative
versatility of protein domains is lower than that
expected by simple random shuffling of domains.
This represents a marked discrepancy between the
model S0 and the observed data (Figure 4).
Inspection of this discrepancy reveals that a small
proportion of the domain combinations are more
frequent than expected by random shuffling of
domains, whereas a larger proportion of combinations are under-represented; the random
shuffling of domains results in a more even
distribution of domain combinations and abundances. This means that in the genomes, many
domain superfamilies have fewer combination
partners than expected by chance, which is consistent with previously reported results.20
For example, in the human genome, these overrepresented combinations correspond to 7%
(61/856) of all observed pair-wise combinations
and under-represented combinations correspond to
41% (352/856). One example is the SH3 domain,
which is a highly abundant domain (226 copies in
the human genome). This domain combines with
the SH2 and protein kinase-like domains much
more frequently than would be expected given the
abundance of the two domain superfamilies: 31
times with an SH2 domain and 27 times with a
protein kinase domain given a single expected
occurrence of each type. This suggests that the two
domain combinations have been duplicated
repeatedly, and as a consequence other domain
combinations involving the SH3 domain are underrepresented.
Thus, when the combination is duplicated, the
abundance of each component domain is increased
but the versatility is kept constant. This explains
why the versatility of protein domains is not at the
maximal level possible according to their
abundance in a random shuffling model. The
discrepancy between the predictions of the S0
ARTICLE IN PRESS
6
Figure 4. Two different random scenarios on how
domain combinations were formed. The versatility of
domain superfamilies is plotted against their abundance.
The simulation procedure is described in the text. The
observed data are plotted in black diamonds, the
simulated data in blue (S0) or red (S1) crosses. S0, simple
random model. S1, random model with duplication of
combinations.
model and the observed data is because the model
does not take into account the duplication of
domain combinations.
Domain Duplication and Recombination
compared to model S0: for superfamilies that occur
in ten or more proteins, model S1 predicts 54%
correctly, and for superfamilies that occur in 30 or
more proteins, model S1 even predicts two-thirds
correctly (66%). Model S0 correctly predicts only 8%
and 3% of the abundant superfamilies, respectively
(Figure 4). In contrast to model S0, model S1 is best
approximated by a linear relationship between
domain versatility and abundance (r2Z0.991)
but is also well approximated by a power-law
(r2Z0.948). Since we lack data especially for
domains of high abundance and versatility, it
remains inconclusive when the approximations
from model S1 would break down.
Given that the model S1, emphasizing duplication of combinations, is much more accurate than
the first random model, we conclude that duplication and retention of domain combinations due to
selection plays an important role in the evolution of
the protein repertoire. We show that recombination
and duplication of domain combinations can be
approximated by our stochastic simulation, and this
suggests that selection favours duplicates of an
existing combination much more often than novel
combinations. This means that it is easier or more
likely that a functional protein retained by selection
evolves by sequence divergence and mutation of an
existing combination, rather than combining existing domains in a new way.
S1: a random model of domain combination with
duplication of combinations
Prevalent domain combinations
Given the observation that domain combinations
in genomes are often more duplicated than simulated combinations in the random model S0, we
implemented a different model, which we termed
S1. In this model we explicitly consider the
duplication of domain combinations. As in model
S0, the domain abundance, number and size of
proteins are maintained. Thus, we examine recombination given selection processes that may have
produced the observed number of domains per
superfamily. All domains are randomly shuffled. As
soon as one combination is formed, we regard this
combination as fixed and it is duplicated until all
the instances of at least one of the component
domains are used up. This reduces the abundance
of the component domains that are available for
further re-shuffling. The recombination and subsequent duplication is repeated until all domains
are placed into combinations. This simulation was
repeated 10,000 times. The relationship between the
average versatility and abundance predicted by this
model is shown in Figure 4. It is a substantially
better approximation of the real data than model S0.
Model S1 reproduces the observed data four
times better than model S0: model S1 correctly
predicts the number of different combination
partners for 39% of all superfamilies in human,
whereas model S0 predicts only 10% (within 20%
error). A correct prediction of domain versatility is
most meaningful for large domain superfamilies,
and this is where model S1 improves enormously
Model S1 is consistent with the observed data,
predicting versatilities that are lower than S0, and
approximating better the overall relationship
between versatility and abundance. It also predicts
that for a given domain, a few combinations will be
highly duplicated whereas most combinations will
be infrequent. To what extent do we observe this
prediction in the protein repertoire? In order to
answer this question, we analysed the abundance
distribution of combinations of particular superfamilies: we wanted to know whether for each
particular superfamily, there are one or a few
prevailing combinations.
The abundance distribution for the pair-wise
combinations of large superfamilies in the human
genome is shown in Figure 5. For example, the SH3
domain occurs 31 times in combination with the
SH2 domain, as mentioned above. The next most
abundant combination of the SH3 domain is with
the PDZ peptide-binding domain and the P-loop
nucleotide triphosphate hydrolase domains: each
combination occurs 12 times in the human genome.
The other 24 combinations of the SH3 domain occur
much less frequently. The distribution in Figure 5 is
in accordance with our prediction that large superfamilies have only one or few combinations with
many duplicates.
These results and the striking fact that model S1
accounts for about two-thirds of the large superfamilies support the possibility that the mechanisms simulated in this model are an approximation
ARTICLE IN PRESS
Domain Duplication and Recombination
7
Figure 5. Duplication of domain
combinations. The Figure shows
for eight large domain superfamilies in the human genome the
abundance of their combinations.
The different colours correspond to
different superfamilies, and each
entry on the x-axis corresponds to a
combination of that superfamily
with another domain. The y-axis
denotes how often the particular
combination is observed in the
human genome. We omit tandem
combinations where the two component domains are the same and
combinations with unknown
domains. The plot is restricted to
the 13 most abundant combinations
of each domain superfamily.
of the evolution of protein repertoires. We do not
claim that the model resolves the chronological or
evolutionary sequence of events, but simply that
the concerted action of random re-shuffling and
selection and duplication of specific combinations
can have formed the protein repertoire. Furthermore, while model S1 clearly does not reproduce
the exact versatility of each individual superfamily,
we demonstrate how the observed relationship can
result from an underlying stochastic process of
evolution, involving random domain shuffling. The
model also includes selection at the level of duplication of domain combinations: a few combinations
are highly duplicated, while most have been duplicated and retained to a much lesser extent. As is
obvious from Figure 4, the observed domain combinations have a broad distribution of abundance and
versatility, suggesting that nature employs some
combination of model S0, S1 and other processes.
Conclusions
The interplay of neutral processes and selection
has been debated for a long time.34,35 Upon gene
duplication, the redundant copy is under
selection, and the “usefulness” and ability to subfunctionalise, i.e. modify function, is thought to
increase a duplicate’s chance of retention.36,37 In
order to modify function, the duplicate diverges, for
example, with respect to the sequence or the spatial
and temporal expression.38,39 Another mode of
divergence that can change protein function18 has
been discussed here: the recombination with other
protein domains.
We described an analysis of the general processes
that may have governed the recombination of
protein domains. We compared the known domain
superfamilies and their arrangements in proteins in
a variety of genomes to stochastic models of domain
recombination and duplication. We defined domain
versatility as the number of domain superfamilies
that a particular domain can combine with, and we
observed a robust correlation between domain
abundance and versatility by a power-law. This
correlation is independent of domain function,
structure or evolutionary history. It suggests that,
similar to transcriptome evolution,40,41 domain
reshuffling represents a form of random protein
divergence.
Despite this evidence for random evolution of
domain combinations, the observed domain versatility in genomes is very different to that resulting
from a simple random model (S0). This is due to
highly duplicated domain combinations, and many
of these qualify as supra-domains, which are
evolutionary units larger than a single domain
and which recur in many different proteins.24,28 One
such supra-domain is the combination of the P-loop
nucleotide triphosphate hydrolase domain with the
translation proteins domain.28 Proteins with this
domain combination hydrolyse GTP and interact
with the ribosome, and these are functions that are
needed by all translation factors. Thus, the combination recurs in 23 proteins in human. When
incorporating the duplication of particular domain
combinations into a more elaborate model, we were
able to approximate the observed versatility fairly
accurately. Thus, selection appears to occur at the
level of retaining particular domains and domain
combinations after duplication, while the footprint
of selection is less clear at the level of domain
recombination.
The emphasis on duplication of domain combinations supports the view that most combinations
formed once during evolution, and all examples of a
combination are related duplicates rather than
emerging from independent recombination events.
This is consistent with: (i) the prevalence of most
combinations in only one N to C-terminal order,5
ARTICLE IN PRESS
8
Domain Duplication and Recombination
“easier” for nature to retain duplicates of this
combination and re-use them in different proteins.
despite no obvious structural or functional constraints on the sequential order; (ii) structural
evidence of an evolutionary relationship between
all instances of the same domain combination;42 and
(iii) the lack of functional conservation if the domain
order is modified.42,43
Such a view implies that duplication of combinations is more parsimonious than several independent recombination events, and predicts the
prevalence of one or two combination partners for a
particular domain superfamily in a genome. We
observe such a distribution of combination partners
in Figure 5. Usually, modification of an existing
domain occurs by incremental mutation of the coding
sequence. The recombination with a novel partner
domain, however, is a different and more drastic type
of modification as it can require the evolution of a new
domain interface, for example. Once the domain
combination is established as a functional unit, it is
Methods and Data Sets
Genomic and domain assignment data
The 14 eukaryote, 14 bacterial and 14 archaeal genomes
used in our analysis are listed in Table 1. The gene
predictions and domain assignments to the gene predictions were taken from the SUPERFAMILY database
version 1.63.30 The domain superfamilies are defined
and classified in the SCOP database.3 All domains within
a SCOP superfamily are related and can be regarded as
descendants from one common ancestral domain. All
eukaryotic genomes, except for the two yeasts and the
two plants, were made non-redundant with respect to
predicted splice variants: for each gene only the longest
transcript was included.
Table 1. The genomes used in this analysis
Name
Anopheles gambiae release 19.2a
Arabidopsis thaliana
Aspergillus nidulans release 1 r3.1
Caenorhabditis briggsae release Aug03
Caenorhabditis elegans release 19.102
Danio rerio release 19.2
Drosophila melanogaster release 3.1
Fugu rubripes release19.2
Homo sapiens release19.34a
Mus musculus release19.30
Neurospora crassa release3
Oryza sativa ssp. japonica Nipponbare
Saccharomyces cerevisiae
Schizosaccharomyces pombe
Bacillus subtilis ssp. subtilis 168
Bifidobacterium longum NCC2705
Buchnera aphidicola Bp
Chlamydophila pneumoniae J138
Escherichia coli K12
Helicobacter pylori 26695
Mycoplasma genitalium
Prochlorococcus marinus ssp. marinus CCMP1375
Salmonella typhimurium LT2
Streptomyces coelicolor A3(2)
Thermotoga maritima
Tropheryma whipplei Twist
Vibrio cholerae
Wigglesworthia glossinidia
Aeropyrum pernix
Archaeoglobus fulgidus DSM 4304
Halobacterium sp. NRC-1
Methanobacterium thermoautotrophicum
Methanopyrus kandleri AV19
Methanosarcina acetivorans C2A
Methanothermobacter thermautotrophicus Delta H
Nanoarchaeum equitans Kin4-M
Pyrococcus abyssi
Pyrococcus furiosus DSM 3638
Sulfolobus solfataricus
Sulfolobus tokodaii
Thermoplasma acidophilum
Thermoplasma volcanium
Kingdoma
E
E
E
E
E
E
E
E
E
E
E
E
E
E
B
B
B
B
B
B
B
B
B
B
B
B
B
B
A
A
A
A
A
A
A
A
A
A
A
A
A
A
Total number of predicted proteinsb
16,148
28,677
9541
19,507
21,631
26,587
18,484
37,245
29,802
32,911
10,082
56,056
6703
4993
4112
1727
504
1069
4311
1576
484
1882
4451
7769
1858
808
3835
611
1841
2420
2075
1869
1687
4540
1873
563
1896
2125
2977
2826
1482
1499
All assignment data and genome information were taken from SUPERFAMILY version 1.63.30 On average, about 45% of the sequences
had at least one domain predicted.
a
A, archaea; B, bacteria; E, eukaryotes.
b
Eukaryote genomes were made non-redundant in terms of splice variants.
ARTICLE IN PRESS
9
Domain Duplication and Recombination
Domain properties
Annotation of domain function and process
Manual procedure
Abundance and versatility
The duplication or abundance of a domain superfamily
in each genome was measured as the number of domains
belonging to the particular superfamily (Figure 1(c)). The
recombination or absolute versatility of a particular
superfamily was measured as the sum of different N
and C-terminal domains that are immediately adjacent to
the domain (Figure 1(d)). Tandem duplications of the
same domain within one protein are ignored. The powerlaw function was fitted using regression. The minimal
versatility of a domain and thus the intercept of the graph
is known (no combination partners), so it seemed meaningful to restrict the analysis to the slope or exponent q.
Alternative definitions of abundance and versatility
We tested our results on alternative definitions of
abundance and versatility, and found the same general
relationships. When we considered the number of
proteins instead of the number of domains as a measure
of abundance, or when we considered the number of cooccurring domains instead of immediately adjacent
domains as a measure of versatility, the exponent q
increased slightly. In both cases, the versatility relative to
the abundance is slightly higher, but their power-law
relationship is the same. The same holds true if we do
account for tandem duplications of domains within one
protein.
Structural classification
Our analysis is based on the four main structural
classes in the SCOP database: alpha, beta, alpha/beta and
alphaCbeta.2 Each class contains several folds, and each
fold contains several superfamilies. For the hierarchical
classification of protein domains, please refer to the SCOP
database.2,3
All domain superfamilies were manually annotated
with respect to their dominant molecular function, i.e. the
function within the protein, and their dominant biological
process, i.e. their role in the cell. Using a manual scheme,
we ensured a high-quality annotation and a domaincentred annotation instead of the widely used whole-gene
or whole-protein centred annotation. The different
schemes are described below.
Scheme for biological processes of domains. For an
analysis of the biological process in which domain
superfamilies act, we annotated all superfamilies with
respect to their affiliation to one of 43 different process
categories (Table 2). All 43 functional categories were then
mapped onto one of eight process classes which are
defined as follows.
(i) Information: storage, maintenance of the genetic
code.
(ii) Regulation: regulation of gene expression, protein
activity; information processing in response to
environmental input.
(iii) Metabolism: all anabolic and catabolic processes;
cell maintenance/homeostasis.
(iv) Energy: energy production/storage.
(v) Processes_IC: intra-cellular processes; cell
motility/division; transport.
(vi) Processes_EC: inter or extra-cellular processes;
organismal processes (such as blood clotting).
(vii) General: general enzymatic reactions and multiple
functions.
(viii) Unknown/Other: unknown or unclassifiable
superfamilies, viral functions.
The functional distribution of all SCOP superfamilies
and those in the respective genomes is available upon
request.
Scheme for molecular functions of domains. All domain
superfamilies were assigned to one of 27 functional
Table 2. Classification scheme for biological processes of domains
Biological process class
Biological process category
Information
Chromatin structure and dynamics; translation, ribosomes, ribosome biogenesis; tRNA
metabolism; transcription, DNA replication, recombination, repair; RNA processing,
dynamics; nuclear structure
RNA processing and modification; DNA-binding (transcription factors); kinases and
phosphatases and inhibitors; signal transduction
Electron transfer/transport; amino acid transport and metabolism; nitrogen metabolism; nucleotide transport and metabolism; carbohydrate transport and metabolism;
polysaccharide metabolism; lipid/polysaccharide storage; coenzyme metabolism;
lipid transport and metabolism; cell envelope biogenesis, outer membrane; secondary
metabolites biosynthesis, transport and catabolism; oxidation/reduction; transferases; other enzymes
Energy production and conversion; photosynthesis
Cell division and chromsome partitioning, cell cycle; phospholipid metabolism; cell
motility, cytoskeleton; intracellular trafficking and secretion; post-translational
modification, protein turnover, chaperones; proteases, peptidases and their inhibitors;
inorganic ion transport and metabolism; transport
Cell adhesion; immune response; blood clotting; toxins and defense enzymes
Small molecule binding; general or several functions; protein–protein interaction
Function unknown; viral proteins
Regulation
Metabolism
Energy
Intra-cellular processes, Processes_IC
Inter-cellular processes, Processes_EC
General
Unknown/Other
ARTICLE IN PRESS
10
Domain Duplication and Recombination
Table 3. Classification scheme for the molecular functions of domains
Functional class
Functional category
Enzyme/related function
Cofactor binding
Nucleic acid binding
Protein binding
Catalytic; substrate-binding; regulatory only; other enzyme function
Cofactor-binding; ion-binding; ion-cluster-binding; other molecule-binding function
DNA-binding; RNA-binding; other nucleic acid binding function
Permanent protein–protein interaction; transient protein–protein interaction; part of complex;
other protein binding function
Lipid binding; other lipid binding function
Carbohydrate binding; other carbohydrate binding function
Channel protein; electron transfer; other transport function
Structural role; general/multiple functions; viral proteins; toxin; other/ambiguous/unknown
function
Lipid binding
Carbohydrate binding
Transport
Other function
categories. Each category maps to one of the eight rough
functional classes (Table 3). The annotation was based on
information in SCOP3 and the literature.
Controls for our functional classification
Automated domain annotation
As a control, we used the automated annotation of GO
process, function and location44 to Pfam45 domains in
InterPro.46 Pfam domains were mapped onto SCOP
domain superfamilies based on sequence similarity. This
provided annotation for 647, 667 and 343 domain superfamilies, respectively. The manual domain annotation
was largely consistent with the Gene-Ontology annotation for Pfam domains and their mappings to the domains
described in SUPERFAMILY.30
Random annotation of function
As a further control, all domain superfamilies were
randomly assigned to 15 categories and the relationship
between versatility and abundance tested as described
above (data not shown). The relationship between
abundance and versatility and the slope q were the
same for all subsets.
Random models
For a simulation of the versatility expected from
random shuffling, we considered each genome separately.
The abundance of a particular domain superfamily and
the number of proteins and their lengths in terms of
domains were kept constant. The resulting versatility was
recorded for each superfamily; and the average was taken
after 10,000 repeats of the simulation.
S0. All domains were considered at their observed
abundance. The domains were then picked and assigned
to random positions in the proteins until each position
was filled.
S1. All domains were considered at their observed
abundance. Two different domain superfamilies were
chosen at random to combine with each other. The
combination was then duplicated until one of the
component domains was “used up”. The process was
repeated taking the new abundances for both superfamilies into account. Eventually, all domains were
assigned to combinations.
Acknowledgements
We are grateful to Cyrus Chothia, M. Madan
Babu, Shinhan Shiu and Christine Orengo for
helpful discussion. C.V. acknowledges financial
support by the Boehringer Ingelheim Fonds.
References
1. Rossmann, M. G., Moras, D. & Olsen, K. W. (1974).
Chemical and biological evolution of nucleotidebinding protein. Nature, 250, 194–199.
2. Murzin, A. G., Brenner, S. E., Hubbard, T. & Chothia,
C. (1995). SCOP: a structural classification of proteins
database for the investigation of sequences and
structures. J. Mol. Biol. 247, 536–540.
3. Andreeva, A., Howorth, D., Brenner, S. E., Hubbard,
T. J., Chothia, C., Murzin, A. G. et al. (2004). SCOP
database in 2004: refinements integrate structure and
sequence family data. Nucl. Acids. Res. 32, D226–D229.
4. Doolittle, R. F. (1995). The multiplicity of domains in
proteins. Annu. Rev. Biochem. 64, 287–314.
5. Apic, G., Gough, J. & Teichmann, S. A. (2001). Domain
combinations in archaeal, eubacterial and eukaryotic
proteomes. J. Mol. Biol. 310, 311–325.
6. Wuchty, S. (2001). Scale-free behavior in protein
domain networks. Mol. Biol. Evol. 18, 1694–1702.
7. Patthy, L. (2003). Modular assembly of genes and the
evolution of new functions. Genetica, 118, 217–231.
8. Sinha, S., van Nimwegen, E. & Siggia, E. D. (2003).
A probabilistic method to detect regulatory modules.
Bioinformatics, 19, i292–i301.
9. Chervitz, S. A., Aravind, L., Sherlock, G., Ball, C. A.,
Koonin, E. V., Dwight, S. S. et al. (1998). Comparison of
the complete protein sets of worm and yeast:
orthology and divergence. Science, 282, 2022–2028.
10. Rubin, G. M., Yandell, M. D., Wortman, J. R., Gabor
Miklos, G. L., Nelson, C. R., Hariharan, I. K. et al.
(2000). Comparative genomics of the eukaryotes.
Science, 287, 2204–2215.
11. Aravind, L. & Subramanian, G. (1999). Origin of
multicellular eukaryotes—insights from proteome
comparisons. Curr. Opin. Genet. Dev. 9, 688–694.
12. Lespinet, O., Wolf, Y. I., Koonin, E. V. & Aravind, L.
(2002). The role of lineage-specific gene family
expansion in the evolution of eukaryotes. Genome
Res. 12, 1048–1059.
13. Ranea, J. A., Buchan, D. W., Thornton, J. M. & Orengo,
C. A. (2004). Evolution of protein superfamilies and
bacterial genome size. J. Mol. Biol. 336, 871–887.
14. Vogel, C., Teichmann, S. A. & Chothia, C. (2003). The
ARTICLE IN PRESS
11
Domain Duplication and Recombination
15.
16.
17.
18.
19.
20.
21.
22.
23.
24.
25.
26.
27.
28.
29.
30.
immunoglobulin
superfamily
in
Drosophila
melanogaster and Caenorhabditis elegans and the
evolution of complexity. Development, 130, 6317–6328.
Muller, A., MacCallum, R. M. & Sternberg, M. J.
(2002). Structural characterization of the human
proteome. Genome Res. 12, 1625–1641.
Lander, E. S., Linton, L. M., Birren, B., Nusbaum, C.,
Zody, M. C., Baldwin, J. et al. (2001). Initial sequencing
and analysis of the human genome. Nature, 409,
860–921.
Teichmann, S. A., Park, J. & Chothia, C. (1998).
Structural assignments to the Mycoplasma genitalium
proteins show extensive gene duplications and
domain rearrangements. Proc. Natl Acad. Sci. USA,
95, 14658–14663.
Todd, A. E., Orengo, C. A. & Thornton, J. M. (2001).
Evolution of function in protein superfamilies, from a
structural perspective. J. Mol. Biol. 307, 1113–1143.
Ye, Y. & Godzik, A. (2004). Comparative analysis of
protein domain organization. Genome Res. 14, 343–353.
Apic, G., Huber, W. & Teichmann, S. A. (2003). Multidomain protein families and domain pairs:
comparison with known structures and a random
model of domain recombination. J. Struct. Funct.
Genome, 4, 67–78.
Koonin, E. V., Wolf, Y. I. & Karev, G. P. (2002). The
structure of the protein universe and genome
evolution. Nature, 420, 218–223.
Barabasi, A. L. & Albert, R. (1999). Emergence of
scaling in random networks. Science, 286, 509–512.
Qian, J., Luscombe, N. M. & Gerstein, M. (2001).
Protein family and fold occurrence in genomes:
power-law behaviour and evolutionary model.
J. Mol. Biol. 313, 673–681.
Chothia, C., Gough, J., Vogel, C. & Teichmann, S. A.
(2003). Evolution of the protein repertoire. Science,
300, 1701–1703.
Koonin, E. V., Aravind, L. & Kondrashov, A. S. (2000).
The impact of comparative genomics on our understanding of evolution. Cell, 101, 573–576.
Gerstein, M. (1998). How representative are the
known structures of the proteins in a complete
genome? A comprehensive structural census. Fold.
Des. 3, 497–512.
Waterston, R. H., Lindblad-Toh, K., Birney, E., Rogers,
J., Abril, J. F., Agarwal, P. et al. (2002). Initial
sequencing and comparative analysis of the mouse
genome. Nature, 420, 520–562.
Vogel, C., Berzuini, C., Bashton, M., Gough, J. &
Teichmann, S. A. (2004). Supra-domains—evolutionary units larger than single protein domains. J. Mol.
Biol. 336, 809–823.
Pawson, T. & Nash, P. (2003). Assembly of cell
regulatory systems through protein interaction
domains. Science, 300, 445–452.
Madera, M., Vogel, C., Kummerfeld, S. K., Chothia, C.
31.
32.
33.
34.
35.
36.
37.
38.
39.
40.
41.
42.
43.
44.
45.
46.
& Gough, J. (2004). The SUPERFAMILY database in
2004: additions and improvements. Nucl. Acids Res.
32, D235–D239.
Mott, R., Schultz, J., Bork, P. & Ponting, C. P. (2002).
Predicting protein cellular localization using a
domain projection method. Genome Res. 12, 1168–1174.
Ouzounis, C. A., Coulson, R. M., Enright, A. J., Kunin,
V. & Pereira-Leal, J. B. (2003). Classification schemes
for protein structure and function. Nature Rev. Genet.
4, 508–519.
Li, H., Helling, R., Tang, C. & Wingreen, N. (1996).
Emergence of preferred structures in a simple model
of protein folding. Science, 273, 666–669.
Kimura, M. (1983). The Neutral Theory of Molecular
Evolution, Cambridge University Press, Cambridge,
UK.
Darwin, C. (1859). The Origin of Species by Means of
Natural Selection, or the Preservation of Favoured Races in
the Struggle for Life, John Murray, London.
Lynch, M. & Conery, J. S. (2000). The evolutionary fate
and consequences of duplicate genes. Science, 290,
1151–1155.
Wolfe, K. H. (2001). Yesterday’s polyploids and the
mystery of diploidization. Nature Rev. Genet. 2, 333–
341.
Gu, Z., Rifkin, S. A., White, K. P. & Li, W. H. (2004).
Duplicate genes increase gene expression diversity
within and between species. Nature Genet. 36, 577–579.
Wagner, A. (2000). Decoupled evolution of coding
region and mRNA expression patterns after gene
duplication: implications for the neutralist–selectionist debate. Proc. Natl Acad. Sci. USA, 97, 6579–6584.
Khaitovich, P., Weiss, G., Lachmann, M., Hellmann, I.,
Enard, W., Muetzel, B. et al. (2004). A neutral model of
transcriptome evolution. PLoS Biol. 2, E132.
van Noort, V., Snel, B. & Huynen, M. A. (2004). The
yeast coexpression network has a small-world, scalefree architecture and can be explained by a simple
model. EMBO Rep. 5, 280–284.
Bashton, M. & Chothia, C. (2002). The geometry of
domain combination in proteins. J. Mol. Biol. 315,
927–939.
Hegyi, H. & Gerstein, M. (2001). Annotation transfer
for genomics: measuring functional divergence in
multi-domain proteins. Genome Res. 11, 1632–1640.
Harris, M. A., Clark, J., Ireland, A., Lomax, J.,
Ashburner, M., Foulger, R. et al. (2004). The gene
ontology (GO) database and informatics resource.
Nucl. Acids Res. 32, D258–D261.
Bateman, A., Coin, L., Durbin, R., Finn, R. D., Hollich,
V., Griffiths-Jones, S. et al. (2004). The Pfam protein
families database. Nucl. Acids Res. 32, D138–D141.
Mulder, N. J., Apweiler, R., Attwood, T. K., Bairoch,
A., Barrell, D., Bateman, A. et al. (2003). The InterPro
Database, 2003 brings increased coverage and new
features. Nucl. Acids Res. 31, 315–318.
Edited by G. von Heijne
(Received 23 September 2004; received in revised form 18 November 2004; accepted 19 November 2004)