Proteins with Highly Evolvable Domain

Proteins with Highly Evolvable Domain Architectures Are
Nonessential but Highly Retained
Chia-Hsin Hsu,†,1,2,3 Austin W. T. Chiang,†,1,2,3 Ming-Jing Hwang,*,1,2,3 and Ben-Yang Liao*,4
1
Bioinformatics Program, Taiwan International Graduate Program, Academia Sinica, Taipei, Taiwan, ROC
Institute of Biomedical Informatics, National Yang-Ming University, Taipei, Taiwan, ROC
3
Institute of Biomedical Sciences, Academia Sinica, Taipei, Taiwan, ROC
4
Division of Biostatistics and Bioinformatics, Institute of Population Health Sciences, National Health Research Institutes, Miaoli
County, Taiwan, ROC
†
These authors contributed equally to this work.
*Corresponding author: E-mail: [email protected]; [email protected].
Associate editor: Naoko Takezaki
2
Abstract
The functions of proteins are usually determined by domains, and the sequential order in which domains are connected
to make up a protein chain is known as the domain architecture. Here, we constructed evolutionary networks of protein
domain architectures in species from three major life lineages (bacteria, fungi, and metazoans) by connecting any two
architectures between which an evolutionary event could be inferred by a model that assumes maximum parsimony. We
found that proteins with domain architectures with a higher level of evolvability, indicated by a greater number of
connections in the evolutionary network, are present in a wider range of species. However, these proteins tend to be less
essential to the organism, are duplicated more often during evolution, have more isoforms, and, intriguingly, tend to be
associated with functional categories important for organismal adaptation. These results reveal the presence, in many
genomes, of genes coding for a core set of nonessential proteins that have a highly evolvable domain architecture and
thus a repertoire of genetic materials accessible for organismal adaptation.
Key words: protein domain architecture, protein evolution, evolvability, essentiality, evolutionarily inferred network.
Introduction
genetic variations important for evolution must emerge by
mutations of functional elements of the genome, and protein-coding genes are among the most well-characterized. We
hypothesized that protein-coding genes contribute significantly to an organism’s ability to evolve and that this contribution is related to the ability of highly evolvable architectures
to evolve new architectures for adaptation. Here, we tested
this hypothesis by examining the domain architecture of proteins in species from three major lineages of life—bacteria
(prokaryotes), fungi (eukaryotes), and metazoans (eukaryotes)—and asking 1) whether there is a core set of genes/
proteins that are utilized by a wide range of species and can be
more readily modified to produce new forms (i.e., new protein domain architectures) by natural selection and 2) if so,
what the molecular, functional, and evolutionary properties
of this core set of genes/proteins are.
Results
The Network and Evolvability of Protein Domain
Architectures
Although there are various definitions (Gregory 2002; Pigliucci
2008; Brookfield 2009; Payne and Wagner 2014), “evolvability”
can be defined as the propensity to evolve novel structures
that are fixed in the population or have not yet been eradicated by natural selection (Kirschner and Gerhart 1998;
ß The Author 2016. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution. All rights reserved. For permissions, please
e-mail: [email protected]
Mol. Biol. Evol. 33(5):1219–1230 doi:10.1093/molbev/msw006 Advance Access publication January 14, 2016
1219
Article
The majority of proteins encoded in prokaryotic and eukaryotic genomes are composed of more than one structurally
separable domains, which may or may not be independent of
each other functionally or evolutionarily (Liu and Rost 2004).
Domain architecture refers to the sequential order of protein
domains and has been an important subject in studies of
protein evolution (Ekman et al. 2007; Fong et al. 2007;
Forslund and Sonnhammer 2012; Zhang et al. 2012). The
domains determine a protein’s structure and physical interactions with other macromolecules, and modifications of
domain architectures, including domain duplication, insertion, deletion, fusion, and fission, therefore play a central
role in creating evolutionary novelties for adaptation. For instance, during evolution, Dicer proteins from animals and
plants have undergone changes in domain architecture that
allowed the development of specialized gene regulatory and
antiviral functions (Mukherjee et al. 2013).
To survive, organisms must maintain the ability to adapt to
a changing environment (Pigliucci 2008; Brookfield 2009).
Organismal adaptation is a complex process and can start
either by the acquisition of advantageous genetic variations
by positive selection or, alternatively, by the fixation of neutral
or slightly deleterious variations in the population by genetic
drift as the sources for preadaptation (Rajon and Masel 2011;
Payne and Wagner 2014). Regardless of the mechanism,
MBE
Hsu et al. . doi:10.1093/molbev/msw006
Brookfield 2001). Accordingly, we constructed evolutionary
networks of existing protein domain architectures in species
with a sequenced genome, covering three lineages of living
organisms—bacteria (1,557 genomes), fungi (138 genomes),
and metazoans (82 genomes) (see Materials and Methods).
We constructed a separate network for each of the three life
lineages, and, in each network, connected all pairs of domain
architectures (nodes) that can be explained by an evolutionary event (the edge) inferred by a maximum parsimony approach in which a new domain architecture arises by fission/
fusion or insertion/deletion of domain(s) from a precursor
architecture present in an ancestral species (Fong et al. 2007)
(fig. 1A; also see supplementary fig. S1, Supplementary
Material online, for an example of a subnetwork highlighting
the architectures containing the DEAD domain). Besides this
evolutionarily inferred network, we generated another type of
network in which architectures were connected if they differed by only one domain (fig. 1B), as such events are most
frequent in the generation of new domain architectures
(Bj€orklund et al. 2005). Note that this type of network, hereafter referred to as “simplified” network as opposed to the
“inferred” network mentioned above, is essentially an assembly of all the domain-centered architecture networks we have
studied previously (Hsu et al. 2013), the difference being that
the architectures of a network of this type in the present work
were not restricted to those containing a specific single
domain of interest. Below, we present results of the inferred
networks, while providing those of the simplified networks in
supplementary figures and tables, as both types of networks
produced similar results.
The evolvability of a domain architecture (node) was defined as the number of edges (i.e., connectivity) of the examined node that go out to (the inferred network) or simply
connect with (the simplified network) other nodes, which are
created from the examined node and are not eradicated evolutionarily. When the probability of measuring a particular
value of some quantity varies inversely with a power of that
value, the quantity is said to follow a power law (Newman
2006). Analysis of the networks constructed for the genomes
of bacteria, fungi, and metazoans showed that the evolvability
of protein domain architectures followed a power law distribution; that is, for each network (lineage), most of the evolvability was contributed by a small number of highly evolvable
architectures, although many architectures of a low evolvability were also present. Significantly, the same power law
distribution (i.e., a negative linear relationship between the
logarithm of the probability of observing domain architectures with a given evolvability and the logarithm of that evolvability) was observed for all networks, whether in inferred
(fig. 2) or simplified (supplementary fig. S2, Supplementary
Material online), regardless of the life lineage (fig. 2 and supplementary fig. S2, Supplementary Material online).
Protein Abundance and Genome Prevalence
We found that domain architectures with a higher evolvability tended to be present in a larger number of bacterial,
fungal, or metazoan proteins than those with a lower evolvability (fig. 3 and supplementary fig. S3, Supplementary
Material online). Note that although each protein has one
corresponding domain architecture annotated in Pfam-A
(Finn et al. 2013), the database used to construct the architecture networks (see Materials and Methods), many proteins
are annotated with the same architecture. In other words,
although the probability of finding a highly evolvable architecture in all unique architectures is low (fig. 2 and supplementary fig. S2, Supplementary Material online), many
proteins were found to be composed of highly evolvable architectures. For architectures with higher evolvability, this
phenomenon might result from a higher retainability (i.e.,
retention rate, defined as the percentage of genomes in the
lineage containing at least one gene encoding a protein with a
given architecture) and/or a greater protein abundance
(number of proteins with this architecture within a genome).
To determine the cause, we performed rank correlation
analysis. For each species (only those species with a fully sequenced genome were considered, see Materials and
FIG. 1. A schematic example of the construction of an evolutionarily inferred network (A) or a simplified network (B) of protein domain architectures.
Domain architectures, annotated and numbered according to Pfam-A (“No.” above or below each box), are presented as a set of sequentially linked
individual domains (boxed), each of which is represented by a unique symbol, which is listed in the box on the right, together with the domain name
and Pfam accession number. Note that, in (A) (the inferred network), the evolutionary events connecting all pairs of parent (arrow tail) to child (arrow
head) architectures were determined by the evolution model of Fong et al. (2007), while in (B) (the simplified network”), any two architectures differing
only by one domain would be connected, provided that the difference could be explained as the result of a single evolutionary event (Hsu et al. 2013). A
zoom-in view of a segment of the inferred network for the lineage of metazoans is shown in supplementary figure S1, Supplementary Material online.
1220
Evolvability of Protein Domain Architecture . doi:10.1093/molbev/msw006
Methods), we tested whether the evolvability of its protein
domain architectures correlated with architecture retainability or with protein abundance. We plotted domain architecture evolvability against either architecture retainability or
protein abundance for each species examined and calculated
the Spearman’s correlation coefficient (r), and found that
evolvability was positively correlated with retainability and
also with protein abundance for almost all species examined
(fig. 4 and supplementary fig. S4, Supplementary Material
online). This indicates that protein domain architectures
with a higher evolvability are present not only in more proteins encoded in a single genome (greater protein abundance), but also in more genomes (higher retainability).
Essentiality and Duplicability
Besides protein domain architecture, the function of a gene
and, ultimately, the fate of its evolution are affected by other
properties. In theory, homologous genes should encode proteins of similar domain architectures due to a common origin.
Homologous genes, genes with shared ancestry either by speciation event (orthologs) or by duplication events (paralogs),
are likely to encode proteins with an identical domain architecture (Lin et al. 2006), which means that genes encoding
proteins of the same architecture in the same genome tend to
be paralogs that have arisen from gene duplication events. On
MBE
the other hand, it has also been shown that essential genes are
more likely to be retained in distantly related genomes
(Gustafson et al. 2006; Waterhouse et al. 2011) and are less
likely to undergo duplication during evolution (He and Zhang
2006; Liang and Li 2007). The results of these previous studies,
in conjunction with our observation in figure 4, imply a complex relationship among duplicability and essentiality of genes
and the retainability and evolvability of the domain architecture of the encoded protein. Accordingly, we computed these
four properties (see Materials and Methods) for three model
organisms, Escherichia coli, Saccharomyces cerevisiae (budding
yeast), and Mus musculus (house mouse), representing, respectively, the three lineages of bacteria, fungi, and metazoans, and examined their correlations.
We defined gene essentiality using data from targeted gene
deletion experiments and gene duplicability by the number of
paralogs in the genome (see Materials and Methods).
Architecture evolvability and retainability were computed
from data for protein domain architectures; in order to correlate them with duplicability and essentiality, which are
properties of genes, we assigned evolvability and retainability
of a given protein domain architecture to the gene that encodes the protein from which the domain architecture is
derived. In some cases, especially in the case of the mouse,
one gene can have more than one protein product (isoforms)
FIG. 2. Power law distribution of domain architecture evolvability for the inferred networks of the bacterial (left panel), fungal (center panel), and
metazoan (right panel) lineages (those for the corresponding simplified networks are shown in supplementary fig. S2, Supplementary Material online).
The linear regression data and the resulting Pearson’s correlation coefficient (r) are presented in each panel. K, domain architecture evolvability
measured by network connectivity (see Materials and Methods); P(K), probability of observing K in the data analyzed. Goodness-of-fit statistical tests
were used to determine that these distributions followed a power law equation, as indicated by a very small P value.
FIG. 3. Domain architectures with a higher evolvability (K) tended to be present in a larger number of proteins. The dot plots were generated based on
the entire set of bacterial, fungal, or metazoan proteins (see Materials and Methods). The Spearman’s correlation coefficient (r) and the P value under
the null hypothesis of no correlation are shown. Evolvability values were computed using the inferred networks; those for the simplified networks are
shown in supplementary figure S3, Supplementary Material online.
1221
Hsu et al. . doi:10.1093/molbev/msw006
FIG. 4. Boxplots of Spearman’s coefficients for the correlation between
architecture evolvability and retainability (or protein abundance) for
every bacteria, fungi, or metazoan species (genome) investigated in
this study. The architecture evolvabilities (K) were computed from
the inferred networks (see supplementary fig. S4, Supplementary
Material online, for results of the simplified networks). The total numbers of species used to construct the boxplots are indicated in parentheses under lineage on the x-axis. The percentage of species showing a
positive rank correlation with statistical significance (P < 0.05) between
K and genome retention rate or between K and protein abundance in
the genome is given above each box. The values for the upper quartile,
median, and lower quartile are indicated for each box, with outliers
indicated by crosses.
and some may have different domain architectures. In such
cases, the evolvabilities and retainabilities of all the different
isoforms of the same gene were averaged. Rank correlation
analysis using all pairs of the four properties (evolvability,
retainability, duplicability, and essentiality) was then performed. Because these properties are interrelated, we also
performed partial correlation analyses to estimate a direct
association by controlling for potential confounding factors.
The Spearman’s rank correlation coefficients between
examined factors for the inferred networks without using
partial correlation are shown in supplementary table S1,
Supplementary Material online, and those using partial correlation in table 1. Although we focused here on finding factors that can be directly influenced by domain architecture
evolvability, we note that the observed positive correlation
between gene essentiality and retainability is consistent with
gene-based findings reported by others (Gustafson et al. 2006;
1222
MBE
Waterhouse et al. 2011). The observed negative correlations
between gene essentiality and gene duplicability (table 1 and
supplementary table S1, Supplementary Material online) are
also in accordance with the observation that essential genes
are less likely to undergo duplication during evolution (He
and Zhang 2006; Liang and Li 2007). When the simplified
networks were used, or when a single isoform was randomly
selected to represent genes with multiple protein products,
similar correlations were observed (supplementary tables S2
and S3, Supplementary Material online).
Table 1 shows that the patterns of correlation found for
domain architecture evolvability across the lineages of bacteria,
fungi, and metazoans were consistent. For example, in addition
to exhibiting higher retainability, genes encoding proteins with
domain architectures of higher evolvability tended to be less
essential to the organism. Similar results were obtained when
only single-copy genes were considered (supplementary table
S4, Supplementary Material online), or when duplicability was
alternatively defined by the number of proteins with identical
architectures in the genome (supplementary table S5,
Supplementary Material online). A positive correlation between evolvability and duplicability was also consistently observed for the model organisms in the three life lineages (table
1 and supplementary tables S1–S3 and S5, Supplementary
Material online), although in one case of the yeast the statistics
for the correlation was not significant (P = 0.11, table 1).
Assignment of protein domains from genome sequences
can be erroneous due to poor genome assembly (Nagy and
Patthy 2013). To address this issue, we recalculated the correlations of table 1 using only high-quality genomes (those
annotated as “manually curated” in KEGG; Kanehisa et al.
2014; see Materials and Methods). The results, presented in
supplementary table S6, Supplementary Material online, reinforced the observations made from table 1 where a much
larger (about three times) set of genomes was used. How
protein domains and domain architectures were assigned
could also affect these correlations; accordingly, we recalculated these correlations using different Pfam domain-defining
thresholds and methods, and InterPro (Hunter et al. 2012), an
integrated database of protein domains collected from multiple sources including Pfam-A (see Materials and Methods).
As can be seen from the results shown in supplementary
tables S7–S10, Supplementary Material online, the aforementioned positive and negative correlations between the four
properties studied here largely held, although some correlations, especially those of the mouse involving essentiality,
disappeared when putative (supplementary table S8,
Supplementary Material online) or orphan domains (ODs;
supplementary table S9, Supplementary Material online)
were included. The influence of these two treatments was
minimal on E. coli, possibly because putative and ODs tend to
be disordered linkers (protein regions without a well-defined
conformation in their native state that connect wellcharacterized protein domains) (Ekman et al. 2005) and disordered linkers are evolutionarily reduced in prokaryotes
(Wang et al. 2011). The correlation between evolvability
and duplicability for E. coli lost statistical significance or
became negative when a higher hierarchy of protein
MBE
Evolvability of Protein Domain Architecture . doi:10.1093/molbev/msw006
Table 1. Partial Rank Correlations (rp) between Domain Architecture Evolvability and the Other Three Evolutionary Properties.a
Property
Retainability
Duplicability
Essentiality
Species
Escherichia coli
Yeast
Mouse
E. coli
Yeast
Mouse
E. coli
Yeast
Mouse
Architecture Evolvability
rp
0.59
0.65
0.54
0.13
0.02
0.24
0.12
0.15
0.05
Retainability
P valueb
<10 325
<10 325
<10 325
<10 14
0.11
<10 64
<10 13
<10 22
<10 3
Duplicability
rp
P valueb
rp
P valueb
0.09
0.03
0.01
0.23
0.08
0.06
<10 7
0.08
0.49
<10 47
<10 6
<10 4
0.13
0.14
0.08
<10
<10
<10
14
19
7
a
Data computed from the inferred networks constructed using Pfam-A for 2,753 E. coli, 4,039 yeast, and 5,109 mouse genes.
P value is the probability of obtaining a result equal to or more extreme than what was actually observed under the null hypothesis of no correlation.
b
domain classification, “clan” instead of “family”, or InterPro
domains was used in defining architectures (supplementary
tables S7 or S10, Supplementary Material online). This suggests that the use of clan, in which domains of an ancient
origin (i.e., those from different subfamilies) are grouped as
homologs, to define and connect architectures that are supposed to be closely related (i.e., explainable by one speciation
event) for the computation of evolvability could dilute some
of the evolvability correlations observed. The cause underlying
the change in the direction of correlation between evolvability and duplicability of E. coli protein architectures defined by
InterPro domains remains unclear. Nevertheless, the effects of
all these additional considerations were small; therefore, taken
as a whole, these results suggest that our approach is fairly
robust and that different definitions of protein domains will
not significantly alter the main observation that evolvability is
positively correlated with retainability and negatively correlated with essentiality.
Evolvability and Alternative Splicing
The vast majority of mammalian genes undergo alternative
splicing (Kim et al. 2007; Merkin et al. 2012), whereas this
process is not seen in bacteria (Sorek and Cossart 2010)
and is rare in yeasts (Kim et al. 2008). To determine whether
alternative splicing plays a role in the creation of new protein
domain architectures, we included an additional factor, the
number of isoforms per gene (see Materials and Methods), in
the analysis of mouse gene products, and found that genes
producing more isoforms tended to encode proteins with a
highly evolvable domain architecture, evidenced by a weak
but significant positive correlation between architecture evolvability and the number of isoforms (r = 0.09, P < 10 9;
supplementary table S11, Supplementary Material online).
Not all annotated mRNA isoforms of a gene are translated
into proteins in reality. According to UniProtKB (UniProt
Knowledgebase) (Magrane and Consortium U 2011), the existence of an annotated protein can be supported at five
levels, from strongest to weakest: 1) “Experimental evidence
at protein level,” 2) “Experimental evidence at transcript
level,” 3) “Protein inferred from homology,” 4) “Protein predicted,” or 5) “Protein uncertain”. When exclusively focusing
our analysis on isoforms translated into proteins with the
strongest evidence code of “Experimental evidence at protein
level,” we obtained consistent results (architecture evolvability vs number of isoforms: r = 0.05, P < 10 3; supplementary
table S12, Supplementary Material online). These results
imply that new domain architectures can arise through the
emergence of alternatively spliced forms. As a result, the original domain architecture and function of the gene product
can be maintained without invoking duplication, which could
potentially have a deleterious effect, like dosage imbalance
(Papp et al. 2003; Qian et al. 2010; Chang and Liao 2012), to
the organism.
Functional Characterization of Highly Evolvable
Proteins
To understand the surprising finding that architecture evolvability, while positively correlated with retainability, was
negatively correlated with gene essentiality, we investigated
which functional groups of nonessential genes (see Materials
and Methods) tended to encode proteins with high architecture evolvability. We performed a clustering analysis on
clusters of orthologous gene (COG) functional categories
(Tatusov et al. 2000), in which we divided all nonessential
genes equally into five evolvability levels (Level 1 denoting
the lowest and Level 5 the highest evolvability) according to
the evolvability of the architecture of the encoded protein as
determined from the analysis of the inferred networks described above. For each COG functional category, the percentage of nonessential genes grouped into each of the five
evolvability level bins was calculated, resulting in an occupancy profile of evolvability (fig. 5A; see supplementary fig.
S5A, Supplementary Material online, for the simplified networks). Three clusters denoted, respectively, as “high,” “medium,” and “low” evolvability of COG functional categories
emerged from a hierarchical clustering (see Materials and
Methods) for each of the three model organisms (E. coli,
yeast, and mouse) examined (fig. 5A). All but two of the
evolvability cluster assignments of these COG categories
were shown to be statistically significant (P < 0.05) by a
permutation test (supplementary fig. S6, Supplementary
Material online). The evolvability patterns of COG categories for the three model organisms were quite similar between the inferred networks and the simplified networks,
1223
Hsu et al. . doi:10.1093/molbev/msw006
MBE
FIG. 5. Clusters of COG functional categories based on evolvability occupancy profiles of nonessential genes. (A) In each model organism (Escherichia
coli, yeast, or mouse), three clusters were seen: The high evolvability cluster (red), in which the percentage of nonessential genes increased as evolvability
increased; the “medium evolvability” cluster (green), in which the percentage increased, then decreased, as evolvability increased; and the “low
evolvability” cluster (blue), in which the percentage decreased as evolvability increased. The occupancy of genes at each evolvability level (percentage
of nonessential genes in a given COG category having a given evolvability level) was normalized and color coded as indicated by the spectrum shown at
the top of the panel. COG categories are represented by one-letter codes, which are color coded to indicate the cluster to which they belong. The
occupancy profile for each functional category is shown as a line in the same color as the cluster and the averaged profile for a cluster is shown as a black
line. (B) Classification of COG functional categories for Escherichia coli (E), yeast (Y), or mouse (M). The color of the box indicates the evolvability cluster
to which the COG functional category was found to belong, with red indicating the high evolvability cluster, green the medium evolvability cluster, and
blue the low evolvability cluster; COG categories with fewer than ten proteins are indicated by white boxes. “*” and “#” denote those letter codes
(categories) in the high evolvability cluster for all, or just one or two, of the three model organisms, respectively.
1224
Evolvability of Protein Domain Architecture . doi:10.1093/molbev/msw006
especially for metabolism categories (cf. fig. 5 and supplementary fig. S5, Supplementary Material online).
Two functional categories (I: Lipid transport and metabolism, Q: Secondary metabolite biosynthesis, transport, and
catabolism; marked with * in fig. 5B) were found to belong
to the “high evolvability” cluster in all three model organisms
(fig. 5B). The genes in these categories probably play an important role in the adaptation of the three lineages examined,
although the underlying mechanisms may not necessarily be
the same from one lineage to another.
Some categories (K: Transcription; V: Defense mechanisms;
T: Signal transduction mechanisms; M: Cell wall/membrane/
envelope biogenesis; Z: Cytoskeleton; C: Energy production
and conversion; G: Carbohydrate transport and metabolism;
E: Amino acid transport and metabolism; P: Inorganic ion
transport and metabolism; H: Coenzyme transport and metabolism; marked with # in fig. 5B) were only of “high
evolvability” in one or two of the three model organisms.
Two “high evolvability” COG functional categories (G and
C) shared by E. coli and yeast are linked to their commonality
in unicellular lifestyle. In contrast to multicellular organisms,
which require a complex cell–cell communication system to
complete their developmental process and sustain life, the
reproduction (division) of unicellular organisms are coupled
with the growth of cell volume largely determined by biomass
production (Jorgensen and Tyers 2004), which might explain
why, in unicellular species, many categories associated with
metabolism were found in the high evolvability cluster
(categories G and C in E. coli and yeast, and E, P, and H in
E. coli; fig. 5B). Cytoskeletal proteins can direct cellular development, and their evolution is responsible for the emergence
of complex tissue and organ structures in plants and animals
(Meagher et al. 2008, 2009). Consistent with this hypothesis,
proteins in the functional category “Cytoskeleton (Z)” were
found to be of high evolvability only in the mouse, the multicellular eukaryote. Interestingly, genes associated with defense
mechanisms (the category of V; e.g., CRISPR-associated proteins) are expected to be fast evolving because of involvement
in host pathogen arm race, but previous studies failed to
identify overrepresentative positive selections on nucleotide
substitutions of these genes (Chen et al. 2006; Takeuchi et al.
2012). Our result thus provides an alternative explanation
that the adaptive changes of these genes in bacteria may
occur at the domain architecture level in general.
For comparison, we also carried out an analysis of COG
functional categories for essential genes (supplementary fig.
S7, Supplementary Material online). Although the evolvability
occupancy profile for many COG categories could not be
determined because of a small sample size, the patterns of
COG classification for essential genes were generally different
from those shown in fig. 5 (or supplementary fig. S5,
Supplementary Material online, from the simplified networks) for nonessential genes. For example, many high evolvability categories (V, M, Q, G, C, E, P, H) of nonessential genes
did not retain the high evolvability status for essential genes in
any of the three model organisms, whereas three new high
evolvability categories (L: Replication, recombination and
repair; D: Cell cycle control, cell division, chromosome
MBE
partitioning; and F: Nucleotide transport and metabolism)
emerged. The functional categories enriched by high evolvability essential genes were also often lineage specific (e.g.,
yeast: K; E. coli: D and F; and mouse: Z), and no category of
high evolvability was shared by all the three model organisms
(supplementary fig. S7, Supplementary Material online).
These results reflect the known functional differences between essential genes and nonessential genes and strengthen
the notion that the evolutionarily retained “core” set of highly
evolvable proteins shared among species/lineages is more
likely to be encoded by nonessential genes.
Taken together, these results suggest that the ability of an
organism to adapt requires the presence in the genome of a
group of genes responsible for cellular functions involved in
adaptation, and that many of those genes, while being nonessential, encode proteins with a highly evolvable domain
architecture.
Discussion
Much of our understanding of how biological systems operate
has been gained from studies of biological networks, such as
protein–protein interaction (PPI) networks (Schwikowski
et al. 2000), genetic interaction networks (Costanzo et al.
2010), and gene coexpression networks (Stuart et al. 2003).
The edges of these networks usually imply functional relatedness between the genes or proteins (nodes), and the connectivity of a node is an index of pleiotropy (number of
functions performed by a gene or a protein) (Promislow
2004; Tyler et al. 2009; Nguyen et al. 2011). It has also been
shown that the connectivity of these networks follows a
power law distribution characteristic of scale-free networks
(Yook et al. 2004; Xulvi-Brunet and Li 2010; Hu et al. 2011; Xu
et al. 2011). Although having a similar frequency distribution
of degrees (the probability of observing domain architectures
with a given evolvability varies inversely with a power of that
evolvability; fig. 2), our networks are inherently different from
previously reported gene/protein networks in one important
aspect: The edges in our networks of domain architectures do
not indicate functional relatedness, but hypothetical evolutionary paths. In this regard, our networks are more similar to
the neutral network of RNA secondary structures (van
Nimwegen et al. 1999; J€org et al. 2008), in which fixedlength RNA sequences (nodes) are connected in sequence
space through a series of nucleotide changes (edges), although, in contrast to our networks, the edge distribution
of the RNA network does not follow a power law distribution
(Aguirre et al. 2011) and the amino acid sequences of protein
domain architectures are not of fixed length.
One likely scenario underlying the power law distribution
of protein architecture evolvabilities is that some domains are
more susceptible (i.e., evolvable) than others to undergo duplication or combination with other domains, and that evolvability can, to some extent, carry over from one architecture
level to the next (i.e., from single domain to di-domain, to tridomain, and so on), allowing a mechanism such as preferential attachment (Barabasi and Albert 1999) to operate during
evolution. To test this hypothesis, we performed a permutation experiment to randomly redistribute the domains
1225
Hsu et al. . doi:10.1093/molbev/msw006
contained in the domain architectures present in the inferred
network, then used the same procedures described above to
regenerate the network. For each domain, we examined if its
residing architectures had a statistically lower evolvability
after the permutation (by Mann–Whitney U test), that is, a
higher than random (permutation) evolvability, and identified 22, 43, and 14 of such domains (dubbed “driver domains”) for protein architectures of bacteria, fungi, and
metazoans, respectively. The results showed that driver domains were observed in increasing frequency in protein architectures of a higher evolvability (supplementary fig. S8A,
Supplementary Material online) and also among the pool of
domains used by protein architectures of a higher evolvability
(supplementary fig. S8B, Supplementary Material online), regardless of the lineage examined. These results can be interpreted as an evolutionary consequence resulting from the
propagation of driver domains in highly evolvable domain
architectures, likely due to the engagement of driver domains
in promoting intraarchitectural duplication or interarchitectural recombination, supporting our propagation hypothesis.
Scale-free networks are robust against the random removal
of nodes and are able to propagate perturbation through the
network within a few steps (Newman 2003). Thus, it is possible that the power law distribution of protein evolvability is
an evolutionary consequence of increasing the robustness
and efficiency of adaption for the entire system (proteome).
Whether the distribution observed is shaped by natural selection or is merely a byproduct of processes unrelated to
adaption (Lynch 2007) requires further investigation.
Previous studies on PPI networks have established a centrality–lethality rule that deletion of proteins with a greater
number of interacting partners, that is, those connecting to
more nodes in the network and hence being of a higher
centrality, is more lethal to the organism in both yeast and
humans (Jeong et al. 2001; Liang and Li 2007). A similar tendency has been observed in networks constructed using other
types of data, such as coexpression of protein-coding genes
(Bhardwaj and Lu 2005). As noted above, the network of
protein domain architectures did not follow the centrality–
lethality (i.e., essentiality) rule. We found that the evolvability
of a protein’s architecture in our network and its connectivity
in the PPI network (see Materials and Methods) were uncorrelated in E. coli and mouse and negatively correlated in yeast
(E. coli: r = 0.01, P = 0.60; yeast: r = 0.04, P = 0.02; mouse:
r = 0.0, P = 0.95), in spite of the positive correlation between
essentiality and PPI connectivity (E. coli: r = 0.08, P < 10 6;
yeast: r = 0.30, P < 10 85; mouse: r = 0.02, P = 0.02). These
results corroborate the notion that the nature of the edges in
our architecture network differs fundamentally from that in
conventional PPI networks.
Every edge (evolutionary event) of the inferred network
can be further distinguished by three types of evolutionary
events: 1) Whether the event occurs at a terminal or internal
position of the architecture, 2) whether the change is an addition or deletion, and 3) whether or not the added domain is
new (novel). Consistent with findings reported in the literature (Bj€orklund et al. 2005; Pasek et al. 2006; Weiner et al.
2006; Ekman et al. 2007; Buljan and Bateman 2009),
1226
MBE
supplementary figure S9A, Supplementary Material online,
shows that terminal, addition, and novel domain links dominated the types of evolutionary events, although the dominance was less pronounced in metazoans than in bacteria or
fungi. As a result, there was no significant difference in the
correlations between evolvability versus retainabilty (supplementary fig. S9B) or between evolvability versus essentiality
(supplementary fig. S9C, Supplementary Material online)
when only the major types of evolutionary events were considered. The dominance of terminal events also suggests that
a lengthy architecture may not necessarily be of high evolvability, despite having more domains and more locations for
recombination, insertion, and deletion. Indeed, we found that
evolvability was negatively correlated with architecture length
(bacteria: r = 0.23, P < 10 325; fungi: r = 0.22, P < 10 325;
metazoans: r = 0.11, P < 10 325; supplementary fig. S10,
Supplementary Material online). Many architectures with extremely high evolvability, such as F-box-like (PF12937), zfC2H2 (PF00096), Pkinase (PF00069,) and immunoglobulin
domain (PF13895), were single-domain architectures that
are known to be promiscuous and can interact or recombine
with many different partner domains (Basu et al. 2008; Hsu
et al. 2013). To examine the effect of a promiscuous single
domain on the evolvability of its residing multidomain architectures, we plotted the largest evolvability of component
single-domain architecture versus the evolvability of its residing multidomain architecture, and found that a weak, but
significant, positive correlation was evident only in mouse
(supplementary fig. S11, Supplementary Material online).
This indicates that promiscuity of single domains does not
sufficiently account for architecture evolvability of multidomain proteins, although how the evolvability propagates and
reduces as domains accrued in architecture awaits further
investigations. In addition, excluding proteins containing promiscuous domains (see Materials and Methods) from the
analysis or focusing analysis on architectures composed of a
similar range of domain numbers did not alter the correlation
trends observed in figure 3 (supplementary figs. S12 and S13,
Supplementary Material online) and table 1 (supplementary
tables S13 and S14, Supplementary Material online), indicating that neither could be a determining factor of the correlations observed.
The positive correlation between evolvability and retainability (table 1 and supplementary table S1, Supplementary
Material online) suggests that proteins with a highly evolvable
domain architecture might be preferentially preserved and/or
repeatedly generated during evolution. In order to test these
two possibilities, we calculated the longest preserved duration
(Tmax) and the number of repeatedly generated events (Ngen)
during evolution for each of the domain architectures found
in proteins encoded in the entire set of bacterial, fungal, or
metazoan genomes (see Materials and Methods; supplementary fig. S14, Supplementary Material online). We found that
Tmax and Ngen were positively correlated (bacteria: r = 0.73,
P < 10 325; fungi: r = 0.48, P < 10 325; metazoans: r = 0.40,
P < 10 325). Therefore, an analysis of partial rank correlation
controlling for the intercorrelation between Tmax and Ngen
was carried out. The results showed that architecture
Evolvability of Protein Domain Architecture . doi:10.1093/molbev/msw006
evolvability was directly associated with both Tmax and Ngen in
all three lineages, and the correlation was much stronger with
the former than with the latter (supplementary table S15,
Supplementary Material online). These results suggest that
the retention of highly evolvable architectures was mostly
attributable to longer preservation duration (larger Tmax) of
the architecture during evolution.
The finding that proteins with a domain architecture of
high evolvability tended to be nonessential for the organism,
as suggested by the observed negative correlation between
evolvability and essentiality (table 1 and supplementary table
S1, Supplementary Material online), seems to run counter to
the notion that genes with a high retention rate in genomes
of organisms across the tree of life generally have essential
cellular functions, which was concluded from a comparison of
the essentiality data of lineage-specific genes with those of
non–lineage-specific genes (Castillo-Davis and Hartl 2003) as
well as a correlation between gene essentiality and the propensity of gene loss in eukaryotes (Krylov et al. 2003). Our
finding therefore suggests that this widely held notion needs
to be modified, as retainability does not necessarily equate
with essentiality. Note that the evolvability of protein domain
architectures has never been defined and estimated, and the
complicated associations among gene essentiality, retainability, and protein “evolvability” were unknown previously.
Analysis of functional categories of nonessential genes
(fig. 5) suggested that these genes are probably retained in
genomes to provide organisms the capability to adapt to
environmental changes during evolution. Nonessential
genes that encode highly evolvable proteins, particularly
those in the high evolvability cluster (fig. 5B), can be considered as a core set of nonessential genes/proteins commonly
employed for adaptation during the evolution of bacterial,
fungal, or metazoan species, and species without such genes
can survive and reproduce in the present ecosystem, but can
easily become extinct because of decreased ability to adapt to
a changing environment.
Concluding Remarks
We have found that proteins with a highly evolvable domain
architecture are present in a wide range of species.
Surprisingly, the genes encoding these highly retained proteins are less essential to the organism. Our analysis also indicates that these genes tend to duplicate more often and
also generate more isoforms in species in which alternative
splicing occurs. The tendencies described above suggest an
evolutionary route by which genes/proteins can explore new
domain architectures, and consequently create new functions, providing the ability to adapt and evolve. Although
we did not examine plants and archaea because of the lack
of gene essentiality data for their model organisms, the consistent patterns of these tendencies seen in three kingdoms of
life (bacteria, fungi, and metazoans) suggest a common need
for nonessential genes in the genome that is probably associated with the ability to adapt to a changing environment
and is related to the evolvability of protein domain
architecture.
MBE
Materials and Methods
Data on protein domains and architectures were downloaded
from Pfam (version 27.0) (Finn et al. 2013), which contains
annotations for 9,845,277 proteins from the completely sequenced genomes of 3,051 species. To avoid a bias toward
more intensively studied protein families, we only considered
domain architectures derived from species with a fully sequenced genome. In the analyses performed on species
from a single lineage, we focused on the subset of genes
from bacteria (1,557 species), fungi (138 species), and metazoans (82 species), coding, respectively, for 5,087,744,
1,209,787, and 1,747,372 proteins. In all analyses, only the
domain architectures of full-length proteins were considered.
In this work, results using four different Pfam domain definitions, Pfam-A, Pfam-clan, Pfam-A+B (putative domains)
and Pfam-A+B+OD, were reported (in main text for PfamA and supplement for the rest). Domain architectures for
Pfam-A were directly downloaded from the Pfam site
(http://pfam.xfam.org/, last accessed July 29, 2014), while for
those derived using Pfam-A+B and Pfam-A+B+OD, we followed the procedures described by Ekman et al. (2005). Pfamclan architectures were deduced using a mapping of Pfam-A
to Pfam-clan available from the Pfam site. Pfam-A provides
three levels of cut offs—“trusted,” “gathering,” and “noise”—
to assign domains. Domains derived using the gathering cut
off, which is Pfam-A’s default, were used in this work. To
investigate the effects of using a different threshold for
domain classification on our results, we also analyzed domains assigned by the trusted and the noise threshold.
However, because only a very small percentage (1% or less)
of proteins changed their architecture using either of the two
different thresholds, no significant changes in the correlation
values to those using the gathering threshold (table 1) were
observed (supplementary tables S16 and S17, Supplementary
Material online). Similar results were also obtained when a
different set of protein domains as assigned by InterPro (version 39) (Hunter et al. 2012) (supplementary table S10,
Supplementary Material online) or when internal duplications of a repeat unit treated as one domain (supplementary
table S18, Supplementary Material online) were used. The 25
promiscuous domains used in some of the analysis were those
reported by Basu et al. (2008).
To place a species in one of the three life lineages (bacteria,
fungi, and metazoans) studied, the taxonomy information for
the source genomes retrieved from NCBI’s Taxonomy
Database (Federhen 2012) was used. The retainability, or retention rate, of a domain architecture was calculated as the
ratio of the number of species in which this architecture was
found to the total number of species (only those with a fully
sequenced genome were considered) in the lineage examined.
The evolutionary relationships between architectures for the
inferred networks were deduced using the model of evolution
developed by Fong et al. (2007) while considering only fission/
fusion and insertion/deletion events and ignoring events of
the more complicated “other” rearrangement class (Fong
et al. 2007). Including the more complicated events under a
maximal cost of 3 (Fong et al. 2007) resulted in up to twice as
1227
MBE
Hsu et al. . doi:10.1093/molbev/msw006
many network connectivities (events), but the correlations
between the examined properties remained similar (supplementary table S19, Supplementary Material online). The simplified networks were constructed following the procedure
described in our previous report (Hsu et al. 2013).
The outward connectivity (out-degree) of each node in the
inferred networks was used to define the evolvability of the
domain architecture (node) examined. The connectivity of
the simplified networks does not indicate a parent-to-child
relationship because no phylogenetic information was used in
constructing this type of networks (Hsu et al. 2013); for simplicity, degree was used to define the evolvability of a given
domain architecture.
The genomic data for E. coli, S. cerevisiae, and M. musculus
were retrieved from Pfam to compute and assign four evolutionary properties to the genes that encode the protein
(hence domain architecture) under consideration; these
were evolvability, retainability, essentiality, and duplicability.
A gene’s evolvability and retainability were set to the same
values calculated for its protein product’s domain architecture from the lineage network (of bacteria, fungi, or metazoans) to which the gene belongs. In the case of one gene
producing multiple protein products (isoforms), the evolvabilities and retainabilities of all isoforms generated from this
gene were averaged. However, the results of randomly keeping one isoform for each gene were also produced (supplementary table S3, Supplementary Material online).
In analyses involving essentiality (table 1 and supplementary tables S1–S14 and S16–S19, Supplementary Material
online), only proteins produced from genes with experimentally measured essentiality data were used. Gene essentiality
data for E. coli and budding yeast were calculated as 1 minus
the relative growth rate of the knockout mutant strain obtained from, respectively, GenoBase (Baba et al. 2006; Baba
and Mori 2008) or the Stanford yeast deletion project
(Steinmetz et al. 2002). Genes with the phenotype of zero
growth rate in these experiments were regarded as essential
and the others as nonessential. Mouse essential genes (essentiality = 1) and nonessential genes (essentiality = 0) were
defined according to previous studies (Liao and Zhang
2008; Liao et al. 2010). To define gene duplicability, paralogs
of S. cerevisiae genes and mouse genes identified by the
method of Vilella et al. (2009) were retrieved using the
BioMart interface (Haider et al. 2009). As in a previous
study (He and Zhang 2006), paralogs of E. coli were identified
by a BLASTP sequence search (Altschul et al. 1997), using the
threshold of 60% amino acid sequence identity and an e-value
of 10 10.
A total of 3,753 E. coli genes, 4,039 yeast genes, and 5,109
mouse genes with domain architecture information (evolvability and retainability) and gene information (essentiality and
duplicability) were used in the correlation analyses. These
numbers of genes were subject to change due to domain
architecture assignments used in different considerations
(see each supplementary table for the number of genes
used in a given calculation). Note that the properties of essentiality and duplicability were computed for each gene
within the genome of E. coli, yeast, or mouse, whereas the
1228
properties of evolvability and retainability were computed
using the aggregate network of the lineage to which E. coli,
yeast, or mouse belongs, because networks in a single species
were too fragmented to be used. Supplementary table S20,
Supplementary Material online, listing the four properties of
every architecture in the genome of each of the three model
organisms was provided in the supplement.
The COG functional categories of genes were downloaded
from eggNOG (version 4.0) (Powell et al. 2014). We used an
Euclidean distance-based hierarchical clustering method
(Ward 1963) for the classification of COG functional categories. The correlation analysis and clustering of COG categories
were performed using MATLAB’s Bioinformatics Toolbox (release 2012b). Data for PPIs for S. cerevisiae and M. musculus
were retrieved from the BioGRID database (Chatr-Aryamontri
et al. 2013), while those for E. coli were taken from Rajagopala
et al. (2014).
Based on the presence of a given domain architecture in
both the sequenced genomes and the hypothetical ancestral
genomes inferred from the maximum parsimony method of
Fong et al. (2007), Tmax and Ngen were determined for every
domain architecture: Tmax was the longest continuous path in
which a given domain architecture was preserved from root
to termini of the taxonomy tree, and Ngen was the number of
events in which the examined architecture was newly generated across the entire taxonomy tree (supplementary fig. S14,
Supplementary Material online).
Supplementary Material
Supplementary tables S1–S20 and figures S1–S14 are available
at Molecular Biology and Evolution online (http://www.mbe.
oxfordjournals.org/).
Acknowledgments
This study was supported by intramural funding from the
Academia Sinica (to M.J.H) and the National Health
Research Institutes (to B.Y.L.) and research grants from the
Ministry of Science and Technology (MOST 101-2311-B-400001-MY3 and 104-2311-B-400 -002 -MY3 to B.Y.L.). We thank
Dr Barkas for English editing.
References
Aguirre J, Buldu JM, Stich M, Manrubia SC. 2011. Topological structure
of the space of phenotypes: the case of RNA neutral networks. PLoS
One 6:e26324.
Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W,
Lipman DJ. 1997. Gapped BLAST and PSI-BLAST: a new generation
of protein database search programs. Nucleic Acids Res.
25:3389–3402.
Baba T, Ara T, Hasegawa M, Takai Y, Okumura Y, Baba M, Datsenko KA,
Tomita M, Wanner BL, Mori H. 2006. Construction of Escherichia
coli K-12 in-frame, single-gene knockout mutants: the Keio collection. Mol Syst Biol. 2:2006.0008.
Baba T, Mori H. 2008. The construction of systematic in-frame, singlegene knockout mutant collection in Escherichia coli K-12. Methods
Mol Biol. 416:171–181.
Barabasi AL, Albert R. 1999. Emergence of scaling in random networks.
Science 286:509–512.
Basu MK, Carmel L, Rogozin IB, Koonin EV. 2008. Evolution of protein
domain promiscuity in eukaryotes. Genome Res 18:449–461.
Evolvability of Protein Domain Architecture . doi:10.1093/molbev/msw006
Bhardwaj N, Lu H. 2005. Correlation between gene expression profiles
and protein-protein interactions within and across genomes.
Bioinformatics 21:2730–2738.
Bj€orklund ÅK, Ekman D, Light S, Frey-Sk€ott J, Elofsson A. 2005. Domain
rearrangements in protein evolution. J Mol Biol. 353:911–923.
Brookfield JF. 2001. Evolution: the evolvability enigma. Curr Biol.
11:R106–R108.
Brookfield JFY. 2009. Evolution and evolvability: celebrating Darwin 200.
Biol Lett. 5:44–46.
Buljan M, Bateman A. 2009. The evolution of protein domain families.
Biochem Soc Trans. 37:751–755.
Castillo-Davis CI, Hartl DL. 2003. Conservation, relocation and duplication in genome evolution. Trends Genet. 19:593–597.
Chang AY, Liao BY. 2012. DNA methylation rebalances gene dosage
after mammalian gene duplications. Mol Biol Evol. 29:133–144.
Chatr-Aryamontri A, Breitkreutz BJ, Heinicke S, Boucher L, Winter A,
Stark C, Nixon J, Ramage L, Kolas N, O’Donnell L, et al. 2013. The
BioGRID interaction database: 2013 update. Nucleic Acids Res.
41:D816–D823.
Chen SL, Hung CS, Xu J, Reigstad CS, Magrini V, Sabo A, Blasiar D, Bieri T,
Meyer RR, Ozersky P, et al. 2006. Identification of genes subject to
positive selection in uropathogenic strains of Escherichia coli: a
comparative genomics approach. Proc Natl Acad Sci U S A.
103:5977–5982.
Costanzo M, Baryshnikova A, Bellay J, Kim Y, Spear ED, Sevier CS, Ding H,
Koh JL, Toufighi K, Mostafavi S, et al. 2010. The genetic landscape of
a cell. Science 327:425–431.
Ekman D, Bjorklund AK, Elofsson A. 2007. Quantification of the elevated
rate of domain rearrangements in metazoa. J Mol Biol. 372:1337–
1348.
Ekman D, Bjorklund AK, Frey-Skott J, Elofsson A. 2005. Multi-domain
proteins in the three kingdoms of life: orphan domains and other
unassigned regions. J Mol Biol. 348:231–243.
Federhen S. 2012. The NCBI Taxonomy database. Nucleic Acids Res.
40:D136–D143.
Finn RD, Bateman A, Clements J, Coggill P, Eberhardt RY, Eddy SR, Heger
A, Hetherington K, Holm L, Mistry J, et al. 2013. Pfam: the protein
families database. Nucleic Acids Res. 42:D222–D230.
Fong JH, Geer LY, Panchenko AR, Bryant SH. 2007. Modeling the evolution of protein domain architectures using maximum parsimony.
J Mol Biol. 366:307–315.
Forslund K, Sonnhammer EL. 2012. Evolution of protein domain architectures. Methods Mol Biol. 856:187–216.
Gregory TR. 2002. A bird’s-eye view of the C-value enigma: genome
size, cell size, and metabolic rate in the class Aves. Evolution
56:121–130.
Gustafson AM, Snitkin ES, Parker SCJ, DeLisi C, Kasif S. 2006. Towards the
identification of essential genes using targeted genome sequencing
and comparative analysis. BMC Genomics 7:265–280.
Haider S, Ballester B, Smedley D, Zhang J, Rice P, Kasprzyk A. 2009.
BioMart Central Portal—unified access to biological data. Nucleic
Acids Res. 37:W23–W27.
He X, Zhang J. 2006. Higher duplicability of less important genes in yeast
genomes. Mol Biol Evol. 23:144–151.
Hsu CH, Chen CK, Hwang MJ. 2013. The architectural design of networks of protein domain architectures. Biol Lett. 9:20130268
Hu T, Sinnott-Armstrong NA, Kiralis JW, Andrew AS, Karagas MR,
Moore JH. 2011. Characterizing genetic interactions in human disease association studies using statistical epistasis networks. BMC
Bioinformatics 12:364.
Hunter S, Jones P, Mitchell A, Apweiler R, Attwood TK, Bateman A,
Bernard T, Binns D, Bork P, Burge S, et al. 2012. InterPro in 2011: new
developments in the family and domain prediction database.
Nucleic Acids Res. 40:D306–D312.
Jeong H, Mason SP, Barabasi AL, Oltvai ZN. 2001. Lethality and centrality
in protein networks. Nature 411:41–42.
J€org T, Martin OC, Wagner A. 2008. Neutral network sizes of biological
RNA molecules can be computed and are not atypically small. BMC
Bioinformatics 9:464.
MBE
Jorgensen P, Tyers M. 2004. How cells coordinate growth and division.
Curr Biol. 14:R1014–R1027.
Kanehisa M, Goto S, Kawashima S, Okuno Y, Hattori M. 2004. The KEGG
resource for deciphering the genome. Nucleic Acids Res 42:D277–D280.
Kim E, Goren A, Ast G. 2008. Alternative splicing: current perspectives.
Bioessays 30:38–47.
Kim E, Magen A, Ast G. 2007. Different levels of alternative splicing
among eukaryotes. Nucleic Acids Res. 35:125–131.
Kirschner M, Gerhart J. 1998. Evolvability. Proc Natl Acad Sci U S A.
95:8420–8427.
Krylov DM, Wolf YI, Rogozin IB, Koonin EV. 2003. Gene loss, protein
sequence divergence, gene dispensability, expression level, and interactivity are correlated in eukaryotic evolution. Genome Res.
13:2229–2235.
Liang H, Li WH. 2007. Gene essentiality, gene duplicability and protein
connectivity in human and mouse. Trends Genet. 23:375–378.
Liao BY, Weng MP, Zhang J. 2010. Impact of extracellularity on the
evolutionary rate of mammalian proteins. Genome Biol Evol.
2010:39–43.
Liao BY, Zhang J. 2008. Null mutations in human and mouse orthologs
frequently result in different phenotypes. Proc Natl Acad Sci U S A.
105:6987–6992.
Lin K, Zhu L, Zhang DY. 2006. An initial strategy for comparing proteins
at the domain architecture level. Bioinformatics 22:2081–2086.
Liu J, Rost B. 2004. CHOP proteins into structural domain-like fragments.
Proteins 55:678–688.
Lynch M. 2007. The evolution of genetic networks by non-adaptive
processes. Nat Rev Genet. 8:803–813.
Magrane M, Consortium U. 2011. UniProt Knowledgebase: a hub of
integrated protein data. Database (Oxford) 2011:bar009.
Meagher RB, Kandasamy MK, McKinney EC. 2008. Multicellular
development and protein-protein interactions. Plant Signal Behav
3:333–336.
Meagher RB, Kandasamy MK, McKinney EC, Roy E. 2009. Chapter 5.
Nuclear actin-related proteins in epigenetic control. Int Rev Cell Mol
Biol. 277:157–215.
Merkin J, Russell C, Chen P, Burge CB. 2012. Evolutionary dynamics of
gene and isoform regulation in mammalian tissues. Science
338:1593–1599.
Mukherjee K, Campos H, Kolaczkowski B. 2013. Evolution of animal and
plant dicers: early parallel duplications and recurrent adaptation of
antiviral RNA binding in plants. Mol Biol Evol. 30:627–641.
Nagy A, Patthy L. 2013. MisPred: a resource for identification of erroneous
protein sequences in public databases. Database (Oxford) 2013:bat053.
Newman MEJ. 2003. The structure and function of complex networks.
SIAM Rev. 45:167–256.
Newman MEJ. 2006. Power laws, Pareto distributions and Zipf’s law.
Contemp Phys. 46:323–351.
Nguyen TP, Liu WC, Jordan F. 2011. Inferring pleiotropy by network analysis: linked diseases in the human PPI network. BMC Syst Biol. 5:179
Papp B, Pal C, Hurst LD. 2003. Dosage sensitivity and the evolution of
gene families in yeast. Nature 424:194–197.
Pasek S, Risler JL, Brezellec P. 2006. Gene fusion/fission is a major contributor to evolution of multi-domain bacterial proteins.
Bioinformatics 22:1418–1423.
Payne JL, Wagner A. 2014. The robustness and evolvability of transcription factor binding sites. Science 343:875–877.
Pigliucci M. 2008. Is evolvability evolvable? Nat Rev Genet 9:75–82.
Powell S, Forslund K, Szklarczyk D, Trachana K, Roth A, Huerta-Cepas J,
Gabaldon T, Rattei T, Creevey C, Kuhn M, et al. 2014. eggNOG v4.0:
nested orthology inference across 3686 organisms. Nucleic Acids Res.
42:D231–D239.
Promislow DE. 2004. Protein networks, pleiotropy and the evolution of
senescence. Proc Biol Sci 271:1225–1234.
Qian W, Liao BY, Chang AY, Zhang J. 2010. Maintenance of duplicate
genes and their functional redundancy by reduced expression.
Trends Genet. 26:425–430.
Rajagopala SV, Sikorski P, Kumar A, Mosca R, Vlasblom J, Arnold R,
Franca-Koh J, Pakala SB, Phanse S, Ceol A, et al. 2014. The binary
1229
Hsu et al. . doi:10.1093/molbev/msw006
protein-protein interaction landscape of Escherichia coli. Nat
Biotechnol. 32:285–290.
Rajon E, Masel J. 2011. Evolution of molecular error rates and the consequences for evolvability. Proc Natl Acad Sci U S A. 108:1082–1087.
Schwikowski B, Uetz P, Fields S. 2000. A network of protein-protein
interactions in yeast. Nat Biotechnol. 18:1257–1261.
Sorek R, Cossart P. 2010. Prokaryotic transcriptomics: a new view on
regulation, physiology and pathogenicity. Nat Rev Genet. 11:9–16.
Steinmetz LM, Scharfe C, Deutschbauer AM, Mokranjac D, Herman ZS,
Jones T, Chu AM, Giaever G, Prokisch H, Oefner PJ, et al. 2002.
Systematic screen for human disease genes in yeast. Nat Genet.
31:400–404.
Stuart JM, Segal E, Koller D, Kim SK. 2003. A gene-coexpression network for global discovery of conserved genetic modules. Science
302:249–255.
Takeuchi N, Wolf YI, Makarova KS, Koonin EV. 2012. Nature and intensity of selection pressure on CRISPR-associated genes. J Bacteriol.
194:1216–1225.
Tatusov RL, Galperin MY, Natale DA, Koonin EV. 2000. The COG database: a tool for genome-scale analysis of protein functions and evolution. Nucleic Acids Res. 28:33–36.
Tyler AL, Asselbergs FW, Williams SM, Moore JH. 2009. Shadows of
complexity: what biological networks reveal about epistasis and
pleiotropy. Bioessays 31:220–227.
van Nimwegen E, Crutchfield JP, Huynen M. 1999. Neutral evolution of
mutational robustness. Proc Natl Acad Sci U S A. 96:9716–9720.
1230
MBE
Vilella AJ, Severin J, Ureta-Vidal A, Heng L, Durbin R, Birney E. 2009.
EnsemblCompara GeneTrees: complete, duplication-aware phylogenetic trees in vertebrates. Genome Res. 19:327–335.
Wang M, Kurland CG, Caetano-Anolles G. 2011. Reductive evolution of
proteomes and protein structures. Proc Natl Acad Sci U S A.
108:11954–11958.
Ward JH Jr. 1963. Hierarchical grouping to optimize an objective function. J Am Stat Assoc. 58:236–244.
Waterhouse RM, Zdobnov EM, Kriventseva EV. 2011. Correlating
traits of gene retention, sequence divergence, duplicability and essentiality in vertebrates, arthropods, and fungi. Genome Biol Evol.
3:75–86.
Weiner J 3rd, Beaussart F, Bornberg-Bauer E. 2006. Domain deletions
and substitutions in the modular protein evolution. FEBS J.
273:2037–2047.
Xu L, Jiang H, Chen H, Gu Z. 2011. Genetic architecture of growth traits
revealed by global epistatic interactions. Genome Biol Evol. 3:909–
914.
Xulvi-Brunet R, Li H. 2010. Co-expression networks: graph properties
and topological comparisons. Bioinformatics 26:205–214.
Yook SH, Oltvai ZN, Barabasi AL. 2004. Functional and topological
characterization of protein interaction networks. Proteomics
4:928–942.
Zhang XC, Wang Z, Zhang X, Le MH, Sun J, Xu D, Cheng J, Stacey G.
2012. Evolutionary dynamics of protein domain architecture in
plants. BMC Evol Biol. 12:6.