Proteins with Highly Evolvable Domain Architectures Are Nonessential but Highly Retained Chia-Hsin Hsu,†,1,2,3 Austin W. T. Chiang,†,1,2,3 Ming-Jing Hwang,*,1,2,3 and Ben-Yang Liao*,4 1 Bioinformatics Program, Taiwan International Graduate Program, Academia Sinica, Taipei, Taiwan, ROC Institute of Biomedical Informatics, National Yang-Ming University, Taipei, Taiwan, ROC 3 Institute of Biomedical Sciences, Academia Sinica, Taipei, Taiwan, ROC 4 Division of Biostatistics and Bioinformatics, Institute of Population Health Sciences, National Health Research Institutes, Miaoli County, Taiwan, ROC † These authors contributed equally to this work. *Corresponding author: E-mail: [email protected]; [email protected]. Associate editor: Naoko Takezaki 2 Abstract The functions of proteins are usually determined by domains, and the sequential order in which domains are connected to make up a protein chain is known as the domain architecture. Here, we constructed evolutionary networks of protein domain architectures in species from three major life lineages (bacteria, fungi, and metazoans) by connecting any two architectures between which an evolutionary event could be inferred by a model that assumes maximum parsimony. We found that proteins with domain architectures with a higher level of evolvability, indicated by a greater number of connections in the evolutionary network, are present in a wider range of species. However, these proteins tend to be less essential to the organism, are duplicated more often during evolution, have more isoforms, and, intriguingly, tend to be associated with functional categories important for organismal adaptation. These results reveal the presence, in many genomes, of genes coding for a core set of nonessential proteins that have a highly evolvable domain architecture and thus a repertoire of genetic materials accessible for organismal adaptation. Key words: protein domain architecture, protein evolution, evolvability, essentiality, evolutionarily inferred network. Introduction genetic variations important for evolution must emerge by mutations of functional elements of the genome, and protein-coding genes are among the most well-characterized. We hypothesized that protein-coding genes contribute significantly to an organism’s ability to evolve and that this contribution is related to the ability of highly evolvable architectures to evolve new architectures for adaptation. Here, we tested this hypothesis by examining the domain architecture of proteins in species from three major lineages of life—bacteria (prokaryotes), fungi (eukaryotes), and metazoans (eukaryotes)—and asking 1) whether there is a core set of genes/ proteins that are utilized by a wide range of species and can be more readily modified to produce new forms (i.e., new protein domain architectures) by natural selection and 2) if so, what the molecular, functional, and evolutionary properties of this core set of genes/proteins are. Results The Network and Evolvability of Protein Domain Architectures Although there are various definitions (Gregory 2002; Pigliucci 2008; Brookfield 2009; Payne and Wagner 2014), “evolvability” can be defined as the propensity to evolve novel structures that are fixed in the population or have not yet been eradicated by natural selection (Kirschner and Gerhart 1998; ß The Author 2016. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution. All rights reserved. For permissions, please e-mail: [email protected] Mol. Biol. Evol. 33(5):1219–1230 doi:10.1093/molbev/msw006 Advance Access publication January 14, 2016 1219 Article The majority of proteins encoded in prokaryotic and eukaryotic genomes are composed of more than one structurally separable domains, which may or may not be independent of each other functionally or evolutionarily (Liu and Rost 2004). Domain architecture refers to the sequential order of protein domains and has been an important subject in studies of protein evolution (Ekman et al. 2007; Fong et al. 2007; Forslund and Sonnhammer 2012; Zhang et al. 2012). The domains determine a protein’s structure and physical interactions with other macromolecules, and modifications of domain architectures, including domain duplication, insertion, deletion, fusion, and fission, therefore play a central role in creating evolutionary novelties for adaptation. For instance, during evolution, Dicer proteins from animals and plants have undergone changes in domain architecture that allowed the development of specialized gene regulatory and antiviral functions (Mukherjee et al. 2013). To survive, organisms must maintain the ability to adapt to a changing environment (Pigliucci 2008; Brookfield 2009). Organismal adaptation is a complex process and can start either by the acquisition of advantageous genetic variations by positive selection or, alternatively, by the fixation of neutral or slightly deleterious variations in the population by genetic drift as the sources for preadaptation (Rajon and Masel 2011; Payne and Wagner 2014). Regardless of the mechanism, MBE Hsu et al. . doi:10.1093/molbev/msw006 Brookfield 2001). Accordingly, we constructed evolutionary networks of existing protein domain architectures in species with a sequenced genome, covering three lineages of living organisms—bacteria (1,557 genomes), fungi (138 genomes), and metazoans (82 genomes) (see Materials and Methods). We constructed a separate network for each of the three life lineages, and, in each network, connected all pairs of domain architectures (nodes) that can be explained by an evolutionary event (the edge) inferred by a maximum parsimony approach in which a new domain architecture arises by fission/ fusion or insertion/deletion of domain(s) from a precursor architecture present in an ancestral species (Fong et al. 2007) (fig. 1A; also see supplementary fig. S1, Supplementary Material online, for an example of a subnetwork highlighting the architectures containing the DEAD domain). Besides this evolutionarily inferred network, we generated another type of network in which architectures were connected if they differed by only one domain (fig. 1B), as such events are most frequent in the generation of new domain architectures (Bj€orklund et al. 2005). Note that this type of network, hereafter referred to as “simplified” network as opposed to the “inferred” network mentioned above, is essentially an assembly of all the domain-centered architecture networks we have studied previously (Hsu et al. 2013), the difference being that the architectures of a network of this type in the present work were not restricted to those containing a specific single domain of interest. Below, we present results of the inferred networks, while providing those of the simplified networks in supplementary figures and tables, as both types of networks produced similar results. The evolvability of a domain architecture (node) was defined as the number of edges (i.e., connectivity) of the examined node that go out to (the inferred network) or simply connect with (the simplified network) other nodes, which are created from the examined node and are not eradicated evolutionarily. When the probability of measuring a particular value of some quantity varies inversely with a power of that value, the quantity is said to follow a power law (Newman 2006). Analysis of the networks constructed for the genomes of bacteria, fungi, and metazoans showed that the evolvability of protein domain architectures followed a power law distribution; that is, for each network (lineage), most of the evolvability was contributed by a small number of highly evolvable architectures, although many architectures of a low evolvability were also present. Significantly, the same power law distribution (i.e., a negative linear relationship between the logarithm of the probability of observing domain architectures with a given evolvability and the logarithm of that evolvability) was observed for all networks, whether in inferred (fig. 2) or simplified (supplementary fig. S2, Supplementary Material online), regardless of the life lineage (fig. 2 and supplementary fig. S2, Supplementary Material online). Protein Abundance and Genome Prevalence We found that domain architectures with a higher evolvability tended to be present in a larger number of bacterial, fungal, or metazoan proteins than those with a lower evolvability (fig. 3 and supplementary fig. S3, Supplementary Material online). Note that although each protein has one corresponding domain architecture annotated in Pfam-A (Finn et al. 2013), the database used to construct the architecture networks (see Materials and Methods), many proteins are annotated with the same architecture. In other words, although the probability of finding a highly evolvable architecture in all unique architectures is low (fig. 2 and supplementary fig. S2, Supplementary Material online), many proteins were found to be composed of highly evolvable architectures. For architectures with higher evolvability, this phenomenon might result from a higher retainability (i.e., retention rate, defined as the percentage of genomes in the lineage containing at least one gene encoding a protein with a given architecture) and/or a greater protein abundance (number of proteins with this architecture within a genome). To determine the cause, we performed rank correlation analysis. For each species (only those species with a fully sequenced genome were considered, see Materials and FIG. 1. A schematic example of the construction of an evolutionarily inferred network (A) or a simplified network (B) of protein domain architectures. Domain architectures, annotated and numbered according to Pfam-A (“No.” above or below each box), are presented as a set of sequentially linked individual domains (boxed), each of which is represented by a unique symbol, which is listed in the box on the right, together with the domain name and Pfam accession number. Note that, in (A) (the inferred network), the evolutionary events connecting all pairs of parent (arrow tail) to child (arrow head) architectures were determined by the evolution model of Fong et al. (2007), while in (B) (the simplified network”), any two architectures differing only by one domain would be connected, provided that the difference could be explained as the result of a single evolutionary event (Hsu et al. 2013). A zoom-in view of a segment of the inferred network for the lineage of metazoans is shown in supplementary figure S1, Supplementary Material online. 1220 Evolvability of Protein Domain Architecture . doi:10.1093/molbev/msw006 Methods), we tested whether the evolvability of its protein domain architectures correlated with architecture retainability or with protein abundance. We plotted domain architecture evolvability against either architecture retainability or protein abundance for each species examined and calculated the Spearman’s correlation coefficient (r), and found that evolvability was positively correlated with retainability and also with protein abundance for almost all species examined (fig. 4 and supplementary fig. S4, Supplementary Material online). This indicates that protein domain architectures with a higher evolvability are present not only in more proteins encoded in a single genome (greater protein abundance), but also in more genomes (higher retainability). Essentiality and Duplicability Besides protein domain architecture, the function of a gene and, ultimately, the fate of its evolution are affected by other properties. In theory, homologous genes should encode proteins of similar domain architectures due to a common origin. Homologous genes, genes with shared ancestry either by speciation event (orthologs) or by duplication events (paralogs), are likely to encode proteins with an identical domain architecture (Lin et al. 2006), which means that genes encoding proteins of the same architecture in the same genome tend to be paralogs that have arisen from gene duplication events. On MBE the other hand, it has also been shown that essential genes are more likely to be retained in distantly related genomes (Gustafson et al. 2006; Waterhouse et al. 2011) and are less likely to undergo duplication during evolution (He and Zhang 2006; Liang and Li 2007). The results of these previous studies, in conjunction with our observation in figure 4, imply a complex relationship among duplicability and essentiality of genes and the retainability and evolvability of the domain architecture of the encoded protein. Accordingly, we computed these four properties (see Materials and Methods) for three model organisms, Escherichia coli, Saccharomyces cerevisiae (budding yeast), and Mus musculus (house mouse), representing, respectively, the three lineages of bacteria, fungi, and metazoans, and examined their correlations. We defined gene essentiality using data from targeted gene deletion experiments and gene duplicability by the number of paralogs in the genome (see Materials and Methods). Architecture evolvability and retainability were computed from data for protein domain architectures; in order to correlate them with duplicability and essentiality, which are properties of genes, we assigned evolvability and retainability of a given protein domain architecture to the gene that encodes the protein from which the domain architecture is derived. In some cases, especially in the case of the mouse, one gene can have more than one protein product (isoforms) FIG. 2. Power law distribution of domain architecture evolvability for the inferred networks of the bacterial (left panel), fungal (center panel), and metazoan (right panel) lineages (those for the corresponding simplified networks are shown in supplementary fig. S2, Supplementary Material online). The linear regression data and the resulting Pearson’s correlation coefficient (r) are presented in each panel. K, domain architecture evolvability measured by network connectivity (see Materials and Methods); P(K), probability of observing K in the data analyzed. Goodness-of-fit statistical tests were used to determine that these distributions followed a power law equation, as indicated by a very small P value. FIG. 3. Domain architectures with a higher evolvability (K) tended to be present in a larger number of proteins. The dot plots were generated based on the entire set of bacterial, fungal, or metazoan proteins (see Materials and Methods). The Spearman’s correlation coefficient (r) and the P value under the null hypothesis of no correlation are shown. Evolvability values were computed using the inferred networks; those for the simplified networks are shown in supplementary figure S3, Supplementary Material online. 1221 Hsu et al. . doi:10.1093/molbev/msw006 FIG. 4. Boxplots of Spearman’s coefficients for the correlation between architecture evolvability and retainability (or protein abundance) for every bacteria, fungi, or metazoan species (genome) investigated in this study. The architecture evolvabilities (K) were computed from the inferred networks (see supplementary fig. S4, Supplementary Material online, for results of the simplified networks). The total numbers of species used to construct the boxplots are indicated in parentheses under lineage on the x-axis. The percentage of species showing a positive rank correlation with statistical significance (P < 0.05) between K and genome retention rate or between K and protein abundance in the genome is given above each box. The values for the upper quartile, median, and lower quartile are indicated for each box, with outliers indicated by crosses. and some may have different domain architectures. In such cases, the evolvabilities and retainabilities of all the different isoforms of the same gene were averaged. Rank correlation analysis using all pairs of the four properties (evolvability, retainability, duplicability, and essentiality) was then performed. Because these properties are interrelated, we also performed partial correlation analyses to estimate a direct association by controlling for potential confounding factors. The Spearman’s rank correlation coefficients between examined factors for the inferred networks without using partial correlation are shown in supplementary table S1, Supplementary Material online, and those using partial correlation in table 1. Although we focused here on finding factors that can be directly influenced by domain architecture evolvability, we note that the observed positive correlation between gene essentiality and retainability is consistent with gene-based findings reported by others (Gustafson et al. 2006; 1222 MBE Waterhouse et al. 2011). The observed negative correlations between gene essentiality and gene duplicability (table 1 and supplementary table S1, Supplementary Material online) are also in accordance with the observation that essential genes are less likely to undergo duplication during evolution (He and Zhang 2006; Liang and Li 2007). When the simplified networks were used, or when a single isoform was randomly selected to represent genes with multiple protein products, similar correlations were observed (supplementary tables S2 and S3, Supplementary Material online). Table 1 shows that the patterns of correlation found for domain architecture evolvability across the lineages of bacteria, fungi, and metazoans were consistent. For example, in addition to exhibiting higher retainability, genes encoding proteins with domain architectures of higher evolvability tended to be less essential to the organism. Similar results were obtained when only single-copy genes were considered (supplementary table S4, Supplementary Material online), or when duplicability was alternatively defined by the number of proteins with identical architectures in the genome (supplementary table S5, Supplementary Material online). A positive correlation between evolvability and duplicability was also consistently observed for the model organisms in the three life lineages (table 1 and supplementary tables S1–S3 and S5, Supplementary Material online), although in one case of the yeast the statistics for the correlation was not significant (P = 0.11, table 1). Assignment of protein domains from genome sequences can be erroneous due to poor genome assembly (Nagy and Patthy 2013). To address this issue, we recalculated the correlations of table 1 using only high-quality genomes (those annotated as “manually curated” in KEGG; Kanehisa et al. 2014; see Materials and Methods). The results, presented in supplementary table S6, Supplementary Material online, reinforced the observations made from table 1 where a much larger (about three times) set of genomes was used. How protein domains and domain architectures were assigned could also affect these correlations; accordingly, we recalculated these correlations using different Pfam domain-defining thresholds and methods, and InterPro (Hunter et al. 2012), an integrated database of protein domains collected from multiple sources including Pfam-A (see Materials and Methods). As can be seen from the results shown in supplementary tables S7–S10, Supplementary Material online, the aforementioned positive and negative correlations between the four properties studied here largely held, although some correlations, especially those of the mouse involving essentiality, disappeared when putative (supplementary table S8, Supplementary Material online) or orphan domains (ODs; supplementary table S9, Supplementary Material online) were included. The influence of these two treatments was minimal on E. coli, possibly because putative and ODs tend to be disordered linkers (protein regions without a well-defined conformation in their native state that connect wellcharacterized protein domains) (Ekman et al. 2005) and disordered linkers are evolutionarily reduced in prokaryotes (Wang et al. 2011). The correlation between evolvability and duplicability for E. coli lost statistical significance or became negative when a higher hierarchy of protein MBE Evolvability of Protein Domain Architecture . doi:10.1093/molbev/msw006 Table 1. Partial Rank Correlations (rp) between Domain Architecture Evolvability and the Other Three Evolutionary Properties.a Property Retainability Duplicability Essentiality Species Escherichia coli Yeast Mouse E. coli Yeast Mouse E. coli Yeast Mouse Architecture Evolvability rp 0.59 0.65 0.54 0.13 0.02 0.24 0.12 0.15 0.05 Retainability P valueb <10 325 <10 325 <10 325 <10 14 0.11 <10 64 <10 13 <10 22 <10 3 Duplicability rp P valueb rp P valueb 0.09 0.03 0.01 0.23 0.08 0.06 <10 7 0.08 0.49 <10 47 <10 6 <10 4 0.13 0.14 0.08 <10 <10 <10 14 19 7 a Data computed from the inferred networks constructed using Pfam-A for 2,753 E. coli, 4,039 yeast, and 5,109 mouse genes. P value is the probability of obtaining a result equal to or more extreme than what was actually observed under the null hypothesis of no correlation. b domain classification, “clan” instead of “family”, or InterPro domains was used in defining architectures (supplementary tables S7 or S10, Supplementary Material online). This suggests that the use of clan, in which domains of an ancient origin (i.e., those from different subfamilies) are grouped as homologs, to define and connect architectures that are supposed to be closely related (i.e., explainable by one speciation event) for the computation of evolvability could dilute some of the evolvability correlations observed. The cause underlying the change in the direction of correlation between evolvability and duplicability of E. coli protein architectures defined by InterPro domains remains unclear. Nevertheless, the effects of all these additional considerations were small; therefore, taken as a whole, these results suggest that our approach is fairly robust and that different definitions of protein domains will not significantly alter the main observation that evolvability is positively correlated with retainability and negatively correlated with essentiality. Evolvability and Alternative Splicing The vast majority of mammalian genes undergo alternative splicing (Kim et al. 2007; Merkin et al. 2012), whereas this process is not seen in bacteria (Sorek and Cossart 2010) and is rare in yeasts (Kim et al. 2008). To determine whether alternative splicing plays a role in the creation of new protein domain architectures, we included an additional factor, the number of isoforms per gene (see Materials and Methods), in the analysis of mouse gene products, and found that genes producing more isoforms tended to encode proteins with a highly evolvable domain architecture, evidenced by a weak but significant positive correlation between architecture evolvability and the number of isoforms (r = 0.09, P < 10 9; supplementary table S11, Supplementary Material online). Not all annotated mRNA isoforms of a gene are translated into proteins in reality. According to UniProtKB (UniProt Knowledgebase) (Magrane and Consortium U 2011), the existence of an annotated protein can be supported at five levels, from strongest to weakest: 1) “Experimental evidence at protein level,” 2) “Experimental evidence at transcript level,” 3) “Protein inferred from homology,” 4) “Protein predicted,” or 5) “Protein uncertain”. When exclusively focusing our analysis on isoforms translated into proteins with the strongest evidence code of “Experimental evidence at protein level,” we obtained consistent results (architecture evolvability vs number of isoforms: r = 0.05, P < 10 3; supplementary table S12, Supplementary Material online). These results imply that new domain architectures can arise through the emergence of alternatively spliced forms. As a result, the original domain architecture and function of the gene product can be maintained without invoking duplication, which could potentially have a deleterious effect, like dosage imbalance (Papp et al. 2003; Qian et al. 2010; Chang and Liao 2012), to the organism. Functional Characterization of Highly Evolvable Proteins To understand the surprising finding that architecture evolvability, while positively correlated with retainability, was negatively correlated with gene essentiality, we investigated which functional groups of nonessential genes (see Materials and Methods) tended to encode proteins with high architecture evolvability. We performed a clustering analysis on clusters of orthologous gene (COG) functional categories (Tatusov et al. 2000), in which we divided all nonessential genes equally into five evolvability levels (Level 1 denoting the lowest and Level 5 the highest evolvability) according to the evolvability of the architecture of the encoded protein as determined from the analysis of the inferred networks described above. For each COG functional category, the percentage of nonessential genes grouped into each of the five evolvability level bins was calculated, resulting in an occupancy profile of evolvability (fig. 5A; see supplementary fig. S5A, Supplementary Material online, for the simplified networks). Three clusters denoted, respectively, as “high,” “medium,” and “low” evolvability of COG functional categories emerged from a hierarchical clustering (see Materials and Methods) for each of the three model organisms (E. coli, yeast, and mouse) examined (fig. 5A). All but two of the evolvability cluster assignments of these COG categories were shown to be statistically significant (P < 0.05) by a permutation test (supplementary fig. S6, Supplementary Material online). The evolvability patterns of COG categories for the three model organisms were quite similar between the inferred networks and the simplified networks, 1223 Hsu et al. . doi:10.1093/molbev/msw006 MBE FIG. 5. Clusters of COG functional categories based on evolvability occupancy profiles of nonessential genes. (A) In each model organism (Escherichia coli, yeast, or mouse), three clusters were seen: The high evolvability cluster (red), in which the percentage of nonessential genes increased as evolvability increased; the “medium evolvability” cluster (green), in which the percentage increased, then decreased, as evolvability increased; and the “low evolvability” cluster (blue), in which the percentage decreased as evolvability increased. The occupancy of genes at each evolvability level (percentage of nonessential genes in a given COG category having a given evolvability level) was normalized and color coded as indicated by the spectrum shown at the top of the panel. COG categories are represented by one-letter codes, which are color coded to indicate the cluster to which they belong. The occupancy profile for each functional category is shown as a line in the same color as the cluster and the averaged profile for a cluster is shown as a black line. (B) Classification of COG functional categories for Escherichia coli (E), yeast (Y), or mouse (M). The color of the box indicates the evolvability cluster to which the COG functional category was found to belong, with red indicating the high evolvability cluster, green the medium evolvability cluster, and blue the low evolvability cluster; COG categories with fewer than ten proteins are indicated by white boxes. “*” and “#” denote those letter codes (categories) in the high evolvability cluster for all, or just one or two, of the three model organisms, respectively. 1224 Evolvability of Protein Domain Architecture . doi:10.1093/molbev/msw006 especially for metabolism categories (cf. fig. 5 and supplementary fig. S5, Supplementary Material online). Two functional categories (I: Lipid transport and metabolism, Q: Secondary metabolite biosynthesis, transport, and catabolism; marked with * in fig. 5B) were found to belong to the “high evolvability” cluster in all three model organisms (fig. 5B). The genes in these categories probably play an important role in the adaptation of the three lineages examined, although the underlying mechanisms may not necessarily be the same from one lineage to another. Some categories (K: Transcription; V: Defense mechanisms; T: Signal transduction mechanisms; M: Cell wall/membrane/ envelope biogenesis; Z: Cytoskeleton; C: Energy production and conversion; G: Carbohydrate transport and metabolism; E: Amino acid transport and metabolism; P: Inorganic ion transport and metabolism; H: Coenzyme transport and metabolism; marked with # in fig. 5B) were only of “high evolvability” in one or two of the three model organisms. Two “high evolvability” COG functional categories (G and C) shared by E. coli and yeast are linked to their commonality in unicellular lifestyle. In contrast to multicellular organisms, which require a complex cell–cell communication system to complete their developmental process and sustain life, the reproduction (division) of unicellular organisms are coupled with the growth of cell volume largely determined by biomass production (Jorgensen and Tyers 2004), which might explain why, in unicellular species, many categories associated with metabolism were found in the high evolvability cluster (categories G and C in E. coli and yeast, and E, P, and H in E. coli; fig. 5B). Cytoskeletal proteins can direct cellular development, and their evolution is responsible for the emergence of complex tissue and organ structures in plants and animals (Meagher et al. 2008, 2009). Consistent with this hypothesis, proteins in the functional category “Cytoskeleton (Z)” were found to be of high evolvability only in the mouse, the multicellular eukaryote. Interestingly, genes associated with defense mechanisms (the category of V; e.g., CRISPR-associated proteins) are expected to be fast evolving because of involvement in host pathogen arm race, but previous studies failed to identify overrepresentative positive selections on nucleotide substitutions of these genes (Chen et al. 2006; Takeuchi et al. 2012). Our result thus provides an alternative explanation that the adaptive changes of these genes in bacteria may occur at the domain architecture level in general. For comparison, we also carried out an analysis of COG functional categories for essential genes (supplementary fig. S7, Supplementary Material online). Although the evolvability occupancy profile for many COG categories could not be determined because of a small sample size, the patterns of COG classification for essential genes were generally different from those shown in fig. 5 (or supplementary fig. S5, Supplementary Material online, from the simplified networks) for nonessential genes. For example, many high evolvability categories (V, M, Q, G, C, E, P, H) of nonessential genes did not retain the high evolvability status for essential genes in any of the three model organisms, whereas three new high evolvability categories (L: Replication, recombination and repair; D: Cell cycle control, cell division, chromosome MBE partitioning; and F: Nucleotide transport and metabolism) emerged. The functional categories enriched by high evolvability essential genes were also often lineage specific (e.g., yeast: K; E. coli: D and F; and mouse: Z), and no category of high evolvability was shared by all the three model organisms (supplementary fig. S7, Supplementary Material online). These results reflect the known functional differences between essential genes and nonessential genes and strengthen the notion that the evolutionarily retained “core” set of highly evolvable proteins shared among species/lineages is more likely to be encoded by nonessential genes. Taken together, these results suggest that the ability of an organism to adapt requires the presence in the genome of a group of genes responsible for cellular functions involved in adaptation, and that many of those genes, while being nonessential, encode proteins with a highly evolvable domain architecture. Discussion Much of our understanding of how biological systems operate has been gained from studies of biological networks, such as protein–protein interaction (PPI) networks (Schwikowski et al. 2000), genetic interaction networks (Costanzo et al. 2010), and gene coexpression networks (Stuart et al. 2003). The edges of these networks usually imply functional relatedness between the genes or proteins (nodes), and the connectivity of a node is an index of pleiotropy (number of functions performed by a gene or a protein) (Promislow 2004; Tyler et al. 2009; Nguyen et al. 2011). It has also been shown that the connectivity of these networks follows a power law distribution characteristic of scale-free networks (Yook et al. 2004; Xulvi-Brunet and Li 2010; Hu et al. 2011; Xu et al. 2011). Although having a similar frequency distribution of degrees (the probability of observing domain architectures with a given evolvability varies inversely with a power of that evolvability; fig. 2), our networks are inherently different from previously reported gene/protein networks in one important aspect: The edges in our networks of domain architectures do not indicate functional relatedness, but hypothetical evolutionary paths. In this regard, our networks are more similar to the neutral network of RNA secondary structures (van Nimwegen et al. 1999; J€org et al. 2008), in which fixedlength RNA sequences (nodes) are connected in sequence space through a series of nucleotide changes (edges), although, in contrast to our networks, the edge distribution of the RNA network does not follow a power law distribution (Aguirre et al. 2011) and the amino acid sequences of protein domain architectures are not of fixed length. One likely scenario underlying the power law distribution of protein architecture evolvabilities is that some domains are more susceptible (i.e., evolvable) than others to undergo duplication or combination with other domains, and that evolvability can, to some extent, carry over from one architecture level to the next (i.e., from single domain to di-domain, to tridomain, and so on), allowing a mechanism such as preferential attachment (Barabasi and Albert 1999) to operate during evolution. To test this hypothesis, we performed a permutation experiment to randomly redistribute the domains 1225 Hsu et al. . doi:10.1093/molbev/msw006 contained in the domain architectures present in the inferred network, then used the same procedures described above to regenerate the network. For each domain, we examined if its residing architectures had a statistically lower evolvability after the permutation (by Mann–Whitney U test), that is, a higher than random (permutation) evolvability, and identified 22, 43, and 14 of such domains (dubbed “driver domains”) for protein architectures of bacteria, fungi, and metazoans, respectively. The results showed that driver domains were observed in increasing frequency in protein architectures of a higher evolvability (supplementary fig. S8A, Supplementary Material online) and also among the pool of domains used by protein architectures of a higher evolvability (supplementary fig. S8B, Supplementary Material online), regardless of the lineage examined. These results can be interpreted as an evolutionary consequence resulting from the propagation of driver domains in highly evolvable domain architectures, likely due to the engagement of driver domains in promoting intraarchitectural duplication or interarchitectural recombination, supporting our propagation hypothesis. Scale-free networks are robust against the random removal of nodes and are able to propagate perturbation through the network within a few steps (Newman 2003). Thus, it is possible that the power law distribution of protein evolvability is an evolutionary consequence of increasing the robustness and efficiency of adaption for the entire system (proteome). Whether the distribution observed is shaped by natural selection or is merely a byproduct of processes unrelated to adaption (Lynch 2007) requires further investigation. Previous studies on PPI networks have established a centrality–lethality rule that deletion of proteins with a greater number of interacting partners, that is, those connecting to more nodes in the network and hence being of a higher centrality, is more lethal to the organism in both yeast and humans (Jeong et al. 2001; Liang and Li 2007). A similar tendency has been observed in networks constructed using other types of data, such as coexpression of protein-coding genes (Bhardwaj and Lu 2005). As noted above, the network of protein domain architectures did not follow the centrality– lethality (i.e., essentiality) rule. We found that the evolvability of a protein’s architecture in our network and its connectivity in the PPI network (see Materials and Methods) were uncorrelated in E. coli and mouse and negatively correlated in yeast (E. coli: r = 0.01, P = 0.60; yeast: r = 0.04, P = 0.02; mouse: r = 0.0, P = 0.95), in spite of the positive correlation between essentiality and PPI connectivity (E. coli: r = 0.08, P < 10 6; yeast: r = 0.30, P < 10 85; mouse: r = 0.02, P = 0.02). These results corroborate the notion that the nature of the edges in our architecture network differs fundamentally from that in conventional PPI networks. Every edge (evolutionary event) of the inferred network can be further distinguished by three types of evolutionary events: 1) Whether the event occurs at a terminal or internal position of the architecture, 2) whether the change is an addition or deletion, and 3) whether or not the added domain is new (novel). Consistent with findings reported in the literature (Bj€orklund et al. 2005; Pasek et al. 2006; Weiner et al. 2006; Ekman et al. 2007; Buljan and Bateman 2009), 1226 MBE supplementary figure S9A, Supplementary Material online, shows that terminal, addition, and novel domain links dominated the types of evolutionary events, although the dominance was less pronounced in metazoans than in bacteria or fungi. As a result, there was no significant difference in the correlations between evolvability versus retainabilty (supplementary fig. S9B) or between evolvability versus essentiality (supplementary fig. S9C, Supplementary Material online) when only the major types of evolutionary events were considered. The dominance of terminal events also suggests that a lengthy architecture may not necessarily be of high evolvability, despite having more domains and more locations for recombination, insertion, and deletion. Indeed, we found that evolvability was negatively correlated with architecture length (bacteria: r = 0.23, P < 10 325; fungi: r = 0.22, P < 10 325; metazoans: r = 0.11, P < 10 325; supplementary fig. S10, Supplementary Material online). Many architectures with extremely high evolvability, such as F-box-like (PF12937), zfC2H2 (PF00096), Pkinase (PF00069,) and immunoglobulin domain (PF13895), were single-domain architectures that are known to be promiscuous and can interact or recombine with many different partner domains (Basu et al. 2008; Hsu et al. 2013). To examine the effect of a promiscuous single domain on the evolvability of its residing multidomain architectures, we plotted the largest evolvability of component single-domain architecture versus the evolvability of its residing multidomain architecture, and found that a weak, but significant, positive correlation was evident only in mouse (supplementary fig. S11, Supplementary Material online). This indicates that promiscuity of single domains does not sufficiently account for architecture evolvability of multidomain proteins, although how the evolvability propagates and reduces as domains accrued in architecture awaits further investigations. In addition, excluding proteins containing promiscuous domains (see Materials and Methods) from the analysis or focusing analysis on architectures composed of a similar range of domain numbers did not alter the correlation trends observed in figure 3 (supplementary figs. S12 and S13, Supplementary Material online) and table 1 (supplementary tables S13 and S14, Supplementary Material online), indicating that neither could be a determining factor of the correlations observed. The positive correlation between evolvability and retainability (table 1 and supplementary table S1, Supplementary Material online) suggests that proteins with a highly evolvable domain architecture might be preferentially preserved and/or repeatedly generated during evolution. In order to test these two possibilities, we calculated the longest preserved duration (Tmax) and the number of repeatedly generated events (Ngen) during evolution for each of the domain architectures found in proteins encoded in the entire set of bacterial, fungal, or metazoan genomes (see Materials and Methods; supplementary fig. S14, Supplementary Material online). We found that Tmax and Ngen were positively correlated (bacteria: r = 0.73, P < 10 325; fungi: r = 0.48, P < 10 325; metazoans: r = 0.40, P < 10 325). Therefore, an analysis of partial rank correlation controlling for the intercorrelation between Tmax and Ngen was carried out. The results showed that architecture Evolvability of Protein Domain Architecture . doi:10.1093/molbev/msw006 evolvability was directly associated with both Tmax and Ngen in all three lineages, and the correlation was much stronger with the former than with the latter (supplementary table S15, Supplementary Material online). These results suggest that the retention of highly evolvable architectures was mostly attributable to longer preservation duration (larger Tmax) of the architecture during evolution. The finding that proteins with a domain architecture of high evolvability tended to be nonessential for the organism, as suggested by the observed negative correlation between evolvability and essentiality (table 1 and supplementary table S1, Supplementary Material online), seems to run counter to the notion that genes with a high retention rate in genomes of organisms across the tree of life generally have essential cellular functions, which was concluded from a comparison of the essentiality data of lineage-specific genes with those of non–lineage-specific genes (Castillo-Davis and Hartl 2003) as well as a correlation between gene essentiality and the propensity of gene loss in eukaryotes (Krylov et al. 2003). Our finding therefore suggests that this widely held notion needs to be modified, as retainability does not necessarily equate with essentiality. Note that the evolvability of protein domain architectures has never been defined and estimated, and the complicated associations among gene essentiality, retainability, and protein “evolvability” were unknown previously. Analysis of functional categories of nonessential genes (fig. 5) suggested that these genes are probably retained in genomes to provide organisms the capability to adapt to environmental changes during evolution. Nonessential genes that encode highly evolvable proteins, particularly those in the high evolvability cluster (fig. 5B), can be considered as a core set of nonessential genes/proteins commonly employed for adaptation during the evolution of bacterial, fungal, or metazoan species, and species without such genes can survive and reproduce in the present ecosystem, but can easily become extinct because of decreased ability to adapt to a changing environment. Concluding Remarks We have found that proteins with a highly evolvable domain architecture are present in a wide range of species. Surprisingly, the genes encoding these highly retained proteins are less essential to the organism. Our analysis also indicates that these genes tend to duplicate more often and also generate more isoforms in species in which alternative splicing occurs. The tendencies described above suggest an evolutionary route by which genes/proteins can explore new domain architectures, and consequently create new functions, providing the ability to adapt and evolve. Although we did not examine plants and archaea because of the lack of gene essentiality data for their model organisms, the consistent patterns of these tendencies seen in three kingdoms of life (bacteria, fungi, and metazoans) suggest a common need for nonessential genes in the genome that is probably associated with the ability to adapt to a changing environment and is related to the evolvability of protein domain architecture. MBE Materials and Methods Data on protein domains and architectures were downloaded from Pfam (version 27.0) (Finn et al. 2013), which contains annotations for 9,845,277 proteins from the completely sequenced genomes of 3,051 species. To avoid a bias toward more intensively studied protein families, we only considered domain architectures derived from species with a fully sequenced genome. In the analyses performed on species from a single lineage, we focused on the subset of genes from bacteria (1,557 species), fungi (138 species), and metazoans (82 species), coding, respectively, for 5,087,744, 1,209,787, and 1,747,372 proteins. In all analyses, only the domain architectures of full-length proteins were considered. In this work, results using four different Pfam domain definitions, Pfam-A, Pfam-clan, Pfam-A+B (putative domains) and Pfam-A+B+OD, were reported (in main text for PfamA and supplement for the rest). Domain architectures for Pfam-A were directly downloaded from the Pfam site (http://pfam.xfam.org/, last accessed July 29, 2014), while for those derived using Pfam-A+B and Pfam-A+B+OD, we followed the procedures described by Ekman et al. (2005). Pfamclan architectures were deduced using a mapping of Pfam-A to Pfam-clan available from the Pfam site. Pfam-A provides three levels of cut offs—“trusted,” “gathering,” and “noise”— to assign domains. Domains derived using the gathering cut off, which is Pfam-A’s default, were used in this work. To investigate the effects of using a different threshold for domain classification on our results, we also analyzed domains assigned by the trusted and the noise threshold. However, because only a very small percentage (1% or less) of proteins changed their architecture using either of the two different thresholds, no significant changes in the correlation values to those using the gathering threshold (table 1) were observed (supplementary tables S16 and S17, Supplementary Material online). Similar results were also obtained when a different set of protein domains as assigned by InterPro (version 39) (Hunter et al. 2012) (supplementary table S10, Supplementary Material online) or when internal duplications of a repeat unit treated as one domain (supplementary table S18, Supplementary Material online) were used. The 25 promiscuous domains used in some of the analysis were those reported by Basu et al. (2008). To place a species in one of the three life lineages (bacteria, fungi, and metazoans) studied, the taxonomy information for the source genomes retrieved from NCBI’s Taxonomy Database (Federhen 2012) was used. The retainability, or retention rate, of a domain architecture was calculated as the ratio of the number of species in which this architecture was found to the total number of species (only those with a fully sequenced genome were considered) in the lineage examined. The evolutionary relationships between architectures for the inferred networks were deduced using the model of evolution developed by Fong et al. (2007) while considering only fission/ fusion and insertion/deletion events and ignoring events of the more complicated “other” rearrangement class (Fong et al. 2007). Including the more complicated events under a maximal cost of 3 (Fong et al. 2007) resulted in up to twice as 1227 MBE Hsu et al. . doi:10.1093/molbev/msw006 many network connectivities (events), but the correlations between the examined properties remained similar (supplementary table S19, Supplementary Material online). The simplified networks were constructed following the procedure described in our previous report (Hsu et al. 2013). The outward connectivity (out-degree) of each node in the inferred networks was used to define the evolvability of the domain architecture (node) examined. The connectivity of the simplified networks does not indicate a parent-to-child relationship because no phylogenetic information was used in constructing this type of networks (Hsu et al. 2013); for simplicity, degree was used to define the evolvability of a given domain architecture. The genomic data for E. coli, S. cerevisiae, and M. musculus were retrieved from Pfam to compute and assign four evolutionary properties to the genes that encode the protein (hence domain architecture) under consideration; these were evolvability, retainability, essentiality, and duplicability. A gene’s evolvability and retainability were set to the same values calculated for its protein product’s domain architecture from the lineage network (of bacteria, fungi, or metazoans) to which the gene belongs. In the case of one gene producing multiple protein products (isoforms), the evolvabilities and retainabilities of all isoforms generated from this gene were averaged. However, the results of randomly keeping one isoform for each gene were also produced (supplementary table S3, Supplementary Material online). In analyses involving essentiality (table 1 and supplementary tables S1–S14 and S16–S19, Supplementary Material online), only proteins produced from genes with experimentally measured essentiality data were used. Gene essentiality data for E. coli and budding yeast were calculated as 1 minus the relative growth rate of the knockout mutant strain obtained from, respectively, GenoBase (Baba et al. 2006; Baba and Mori 2008) or the Stanford yeast deletion project (Steinmetz et al. 2002). Genes with the phenotype of zero growth rate in these experiments were regarded as essential and the others as nonessential. Mouse essential genes (essentiality = 1) and nonessential genes (essentiality = 0) were defined according to previous studies (Liao and Zhang 2008; Liao et al. 2010). To define gene duplicability, paralogs of S. cerevisiae genes and mouse genes identified by the method of Vilella et al. (2009) were retrieved using the BioMart interface (Haider et al. 2009). As in a previous study (He and Zhang 2006), paralogs of E. coli were identified by a BLASTP sequence search (Altschul et al. 1997), using the threshold of 60% amino acid sequence identity and an e-value of 10 10. A total of 3,753 E. coli genes, 4,039 yeast genes, and 5,109 mouse genes with domain architecture information (evolvability and retainability) and gene information (essentiality and duplicability) were used in the correlation analyses. These numbers of genes were subject to change due to domain architecture assignments used in different considerations (see each supplementary table for the number of genes used in a given calculation). Note that the properties of essentiality and duplicability were computed for each gene within the genome of E. coli, yeast, or mouse, whereas the 1228 properties of evolvability and retainability were computed using the aggregate network of the lineage to which E. coli, yeast, or mouse belongs, because networks in a single species were too fragmented to be used. Supplementary table S20, Supplementary Material online, listing the four properties of every architecture in the genome of each of the three model organisms was provided in the supplement. The COG functional categories of genes were downloaded from eggNOG (version 4.0) (Powell et al. 2014). We used an Euclidean distance-based hierarchical clustering method (Ward 1963) for the classification of COG functional categories. The correlation analysis and clustering of COG categories were performed using MATLAB’s Bioinformatics Toolbox (release 2012b). Data for PPIs for S. cerevisiae and M. musculus were retrieved from the BioGRID database (Chatr-Aryamontri et al. 2013), while those for E. coli were taken from Rajagopala et al. (2014). Based on the presence of a given domain architecture in both the sequenced genomes and the hypothetical ancestral genomes inferred from the maximum parsimony method of Fong et al. (2007), Tmax and Ngen were determined for every domain architecture: Tmax was the longest continuous path in which a given domain architecture was preserved from root to termini of the taxonomy tree, and Ngen was the number of events in which the examined architecture was newly generated across the entire taxonomy tree (supplementary fig. S14, Supplementary Material online). Supplementary Material Supplementary tables S1–S20 and figures S1–S14 are available at Molecular Biology and Evolution online (http://www.mbe. oxfordjournals.org/). Acknowledgments This study was supported by intramural funding from the Academia Sinica (to M.J.H) and the National Health Research Institutes (to B.Y.L.) and research grants from the Ministry of Science and Technology (MOST 101-2311-B-400001-MY3 and 104-2311-B-400 -002 -MY3 to B.Y.L.). We thank Dr Barkas for English editing. References Aguirre J, Buldu JM, Stich M, Manrubia SC. 2011. Topological structure of the space of phenotypes: the case of RNA neutral networks. PLoS One 6:e26324. Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ. 1997. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25:3389–3402. Baba T, Ara T, Hasegawa M, Takai Y, Okumura Y, Baba M, Datsenko KA, Tomita M, Wanner BL, Mori H. 2006. Construction of Escherichia coli K-12 in-frame, single-gene knockout mutants: the Keio collection. Mol Syst Biol. 2:2006.0008. Baba T, Mori H. 2008. The construction of systematic in-frame, singlegene knockout mutant collection in Escherichia coli K-12. Methods Mol Biol. 416:171–181. Barabasi AL, Albert R. 1999. Emergence of scaling in random networks. Science 286:509–512. Basu MK, Carmel L, Rogozin IB, Koonin EV. 2008. Evolution of protein domain promiscuity in eukaryotes. Genome Res 18:449–461. Evolvability of Protein Domain Architecture . doi:10.1093/molbev/msw006 Bhardwaj N, Lu H. 2005. Correlation between gene expression profiles and protein-protein interactions within and across genomes. Bioinformatics 21:2730–2738. Bj€orklund ÅK, Ekman D, Light S, Frey-Sk€ott J, Elofsson A. 2005. Domain rearrangements in protein evolution. J Mol Biol. 353:911–923. Brookfield JF. 2001. Evolution: the evolvability enigma. Curr Biol. 11:R106–R108. Brookfield JFY. 2009. Evolution and evolvability: celebrating Darwin 200. Biol Lett. 5:44–46. Buljan M, Bateman A. 2009. The evolution of protein domain families. Biochem Soc Trans. 37:751–755. Castillo-Davis CI, Hartl DL. 2003. Conservation, relocation and duplication in genome evolution. Trends Genet. 19:593–597. Chang AY, Liao BY. 2012. DNA methylation rebalances gene dosage after mammalian gene duplications. Mol Biol Evol. 29:133–144. Chatr-Aryamontri A, Breitkreutz BJ, Heinicke S, Boucher L, Winter A, Stark C, Nixon J, Ramage L, Kolas N, O’Donnell L, et al. 2013. The BioGRID interaction database: 2013 update. Nucleic Acids Res. 41:D816–D823. Chen SL, Hung CS, Xu J, Reigstad CS, Magrini V, Sabo A, Blasiar D, Bieri T, Meyer RR, Ozersky P, et al. 2006. Identification of genes subject to positive selection in uropathogenic strains of Escherichia coli: a comparative genomics approach. Proc Natl Acad Sci U S A. 103:5977–5982. Costanzo M, Baryshnikova A, Bellay J, Kim Y, Spear ED, Sevier CS, Ding H, Koh JL, Toufighi K, Mostafavi S, et al. 2010. The genetic landscape of a cell. Science 327:425–431. Ekman D, Bjorklund AK, Elofsson A. 2007. Quantification of the elevated rate of domain rearrangements in metazoa. J Mol Biol. 372:1337– 1348. Ekman D, Bjorklund AK, Frey-Skott J, Elofsson A. 2005. Multi-domain proteins in the three kingdoms of life: orphan domains and other unassigned regions. J Mol Biol. 348:231–243. Federhen S. 2012. The NCBI Taxonomy database. Nucleic Acids Res. 40:D136–D143. Finn RD, Bateman A, Clements J, Coggill P, Eberhardt RY, Eddy SR, Heger A, Hetherington K, Holm L, Mistry J, et al. 2013. Pfam: the protein families database. Nucleic Acids Res. 42:D222–D230. Fong JH, Geer LY, Panchenko AR, Bryant SH. 2007. Modeling the evolution of protein domain architectures using maximum parsimony. J Mol Biol. 366:307–315. Forslund K, Sonnhammer EL. 2012. Evolution of protein domain architectures. Methods Mol Biol. 856:187–216. Gregory TR. 2002. A bird’s-eye view of the C-value enigma: genome size, cell size, and metabolic rate in the class Aves. Evolution 56:121–130. Gustafson AM, Snitkin ES, Parker SCJ, DeLisi C, Kasif S. 2006. Towards the identification of essential genes using targeted genome sequencing and comparative analysis. BMC Genomics 7:265–280. Haider S, Ballester B, Smedley D, Zhang J, Rice P, Kasprzyk A. 2009. BioMart Central Portal—unified access to biological data. Nucleic Acids Res. 37:W23–W27. He X, Zhang J. 2006. Higher duplicability of less important genes in yeast genomes. Mol Biol Evol. 23:144–151. Hsu CH, Chen CK, Hwang MJ. 2013. The architectural design of networks of protein domain architectures. Biol Lett. 9:20130268 Hu T, Sinnott-Armstrong NA, Kiralis JW, Andrew AS, Karagas MR, Moore JH. 2011. Characterizing genetic interactions in human disease association studies using statistical epistasis networks. BMC Bioinformatics 12:364. Hunter S, Jones P, Mitchell A, Apweiler R, Attwood TK, Bateman A, Bernard T, Binns D, Bork P, Burge S, et al. 2012. InterPro in 2011: new developments in the family and domain prediction database. Nucleic Acids Res. 40:D306–D312. Jeong H, Mason SP, Barabasi AL, Oltvai ZN. 2001. Lethality and centrality in protein networks. Nature 411:41–42. J€org T, Martin OC, Wagner A. 2008. Neutral network sizes of biological RNA molecules can be computed and are not atypically small. BMC Bioinformatics 9:464. MBE Jorgensen P, Tyers M. 2004. How cells coordinate growth and division. Curr Biol. 14:R1014–R1027. Kanehisa M, Goto S, Kawashima S, Okuno Y, Hattori M. 2004. The KEGG resource for deciphering the genome. Nucleic Acids Res 42:D277–D280. Kim E, Goren A, Ast G. 2008. Alternative splicing: current perspectives. Bioessays 30:38–47. Kim E, Magen A, Ast G. 2007. Different levels of alternative splicing among eukaryotes. Nucleic Acids Res. 35:125–131. Kirschner M, Gerhart J. 1998. Evolvability. Proc Natl Acad Sci U S A. 95:8420–8427. Krylov DM, Wolf YI, Rogozin IB, Koonin EV. 2003. Gene loss, protein sequence divergence, gene dispensability, expression level, and interactivity are correlated in eukaryotic evolution. Genome Res. 13:2229–2235. Liang H, Li WH. 2007. Gene essentiality, gene duplicability and protein connectivity in human and mouse. Trends Genet. 23:375–378. Liao BY, Weng MP, Zhang J. 2010. Impact of extracellularity on the evolutionary rate of mammalian proteins. Genome Biol Evol. 2010:39–43. Liao BY, Zhang J. 2008. Null mutations in human and mouse orthologs frequently result in different phenotypes. Proc Natl Acad Sci U S A. 105:6987–6992. Lin K, Zhu L, Zhang DY. 2006. An initial strategy for comparing proteins at the domain architecture level. Bioinformatics 22:2081–2086. Liu J, Rost B. 2004. CHOP proteins into structural domain-like fragments. Proteins 55:678–688. Lynch M. 2007. The evolution of genetic networks by non-adaptive processes. Nat Rev Genet. 8:803–813. Magrane M, Consortium U. 2011. UniProt Knowledgebase: a hub of integrated protein data. Database (Oxford) 2011:bar009. Meagher RB, Kandasamy MK, McKinney EC. 2008. Multicellular development and protein-protein interactions. Plant Signal Behav 3:333–336. Meagher RB, Kandasamy MK, McKinney EC, Roy E. 2009. Chapter 5. Nuclear actin-related proteins in epigenetic control. Int Rev Cell Mol Biol. 277:157–215. Merkin J, Russell C, Chen P, Burge CB. 2012. Evolutionary dynamics of gene and isoform regulation in mammalian tissues. Science 338:1593–1599. Mukherjee K, Campos H, Kolaczkowski B. 2013. Evolution of animal and plant dicers: early parallel duplications and recurrent adaptation of antiviral RNA binding in plants. Mol Biol Evol. 30:627–641. Nagy A, Patthy L. 2013. MisPred: a resource for identification of erroneous protein sequences in public databases. Database (Oxford) 2013:bat053. Newman MEJ. 2003. The structure and function of complex networks. SIAM Rev. 45:167–256. Newman MEJ. 2006. Power laws, Pareto distributions and Zipf’s law. Contemp Phys. 46:323–351. Nguyen TP, Liu WC, Jordan F. 2011. Inferring pleiotropy by network analysis: linked diseases in the human PPI network. BMC Syst Biol. 5:179 Papp B, Pal C, Hurst LD. 2003. Dosage sensitivity and the evolution of gene families in yeast. Nature 424:194–197. Pasek S, Risler JL, Brezellec P. 2006. Gene fusion/fission is a major contributor to evolution of multi-domain bacterial proteins. Bioinformatics 22:1418–1423. Payne JL, Wagner A. 2014. The robustness and evolvability of transcription factor binding sites. Science 343:875–877. Pigliucci M. 2008. Is evolvability evolvable? Nat Rev Genet 9:75–82. Powell S, Forslund K, Szklarczyk D, Trachana K, Roth A, Huerta-Cepas J, Gabaldon T, Rattei T, Creevey C, Kuhn M, et al. 2014. eggNOG v4.0: nested orthology inference across 3686 organisms. Nucleic Acids Res. 42:D231–D239. Promislow DE. 2004. Protein networks, pleiotropy and the evolution of senescence. Proc Biol Sci 271:1225–1234. Qian W, Liao BY, Chang AY, Zhang J. 2010. Maintenance of duplicate genes and their functional redundancy by reduced expression. Trends Genet. 26:425–430. Rajagopala SV, Sikorski P, Kumar A, Mosca R, Vlasblom J, Arnold R, Franca-Koh J, Pakala SB, Phanse S, Ceol A, et al. 2014. The binary 1229 Hsu et al. . doi:10.1093/molbev/msw006 protein-protein interaction landscape of Escherichia coli. Nat Biotechnol. 32:285–290. Rajon E, Masel J. 2011. Evolution of molecular error rates and the consequences for evolvability. Proc Natl Acad Sci U S A. 108:1082–1087. Schwikowski B, Uetz P, Fields S. 2000. A network of protein-protein interactions in yeast. Nat Biotechnol. 18:1257–1261. Sorek R, Cossart P. 2010. Prokaryotic transcriptomics: a new view on regulation, physiology and pathogenicity. Nat Rev Genet. 11:9–16. Steinmetz LM, Scharfe C, Deutschbauer AM, Mokranjac D, Herman ZS, Jones T, Chu AM, Giaever G, Prokisch H, Oefner PJ, et al. 2002. Systematic screen for human disease genes in yeast. Nat Genet. 31:400–404. Stuart JM, Segal E, Koller D, Kim SK. 2003. A gene-coexpression network for global discovery of conserved genetic modules. Science 302:249–255. Takeuchi N, Wolf YI, Makarova KS, Koonin EV. 2012. Nature and intensity of selection pressure on CRISPR-associated genes. J Bacteriol. 194:1216–1225. Tatusov RL, Galperin MY, Natale DA, Koonin EV. 2000. The COG database: a tool for genome-scale analysis of protein functions and evolution. Nucleic Acids Res. 28:33–36. Tyler AL, Asselbergs FW, Williams SM, Moore JH. 2009. Shadows of complexity: what biological networks reveal about epistasis and pleiotropy. Bioessays 31:220–227. van Nimwegen E, Crutchfield JP, Huynen M. 1999. Neutral evolution of mutational robustness. Proc Natl Acad Sci U S A. 96:9716–9720. 1230 MBE Vilella AJ, Severin J, Ureta-Vidal A, Heng L, Durbin R, Birney E. 2009. EnsemblCompara GeneTrees: complete, duplication-aware phylogenetic trees in vertebrates. Genome Res. 19:327–335. Wang M, Kurland CG, Caetano-Anolles G. 2011. Reductive evolution of proteomes and protein structures. Proc Natl Acad Sci U S A. 108:11954–11958. Ward JH Jr. 1963. Hierarchical grouping to optimize an objective function. J Am Stat Assoc. 58:236–244. Waterhouse RM, Zdobnov EM, Kriventseva EV. 2011. Correlating traits of gene retention, sequence divergence, duplicability and essentiality in vertebrates, arthropods, and fungi. Genome Biol Evol. 3:75–86. Weiner J 3rd, Beaussart F, Bornberg-Bauer E. 2006. Domain deletions and substitutions in the modular protein evolution. FEBS J. 273:2037–2047. Xu L, Jiang H, Chen H, Gu Z. 2011. Genetic architecture of growth traits revealed by global epistatic interactions. Genome Biol Evol. 3:909– 914. Xulvi-Brunet R, Li H. 2010. Co-expression networks: graph properties and topological comparisons. Bioinformatics 26:205–214. Yook SH, Oltvai ZN, Barabasi AL. 2004. Functional and topological characterization of protein interaction networks. Proteomics 4:928–942. Zhang XC, Wang Z, Zhang X, Le MH, Sun J, Xu D, Cheng J, Stacey G. 2012. Evolutionary dynamics of protein domain architecture in plants. BMC Evol Biol. 12:6.
© Copyright 2026 Paperzz