General Trends in the Utilization of Structural Factors Contributing to Biological Complexity Dong Yang,1,2 Fan Zhong,3 Dong Li,1,2 Zhongyang Liu,1,2 Handong Wei,1,2 Ying Jiang,*,1,2 and Fuchu He*,1,2,3 1 State Key Laboratory of Proteomics, Beijing Proteome Research Center, Beijing Institute of Radiation Medicine, Beijing, P. R. China National Engineering Research Center for Protein Drugs, Beijing, P. R. China 3 Institutes of Biomedical Sciences and Department of Chemistry, Fudan University, Shanghai, P. R. China *Corresponding author: E-mail: [email protected]; [email protected]. Associate editor: Todd Oakley 2 Abstract Key words: biological complexity, protein domain, proteome, genome, purifying selection. Introduction One of the fundamental questions in biology is how species evolve increased biological complexity. In general, biological complexity can be assessed at the level of organism, organ, tissue, cell, or protein interaction networks (Oakley and Rivera 2008). At the organism level, it is clear that biological complexity correlates well with genomic complexity, measured not only by the number of protein-coding genes (PCGs) but also by the complexity of gene functions (Koonin et al. 2002; Vogel and Chothia 2006), the connectivity of gene regulation networks (Szathmary et al. 2001), and the proportion of spliceosomal introns leading to alterative splicing (Nilsen and Graveley 2010). In this view, the genome could be regarded as a ‘‘library’’ of potential biological information, among which are the factors contributing to biological complexity. Previous studies paid much more attention to the increase of these factors in GEP during evolution (Adami et al. 2000; Szathmary et al. 2001; Koonin et al. 2002; Koonin 2009; Lynch and Conery 2003; Doebeli and Ispolatov 2010; Nilsen and Graveley 2010; Prochnik et al. 2010; Rivera et al. 2010), but it is not known how these factors are preferentially utilized to realize biological complexity under different conditions. It is known that different CSPs utilize different proteins from GEP, but this process is not random and there may be some general trends. Among all the factors contributing to biological complexity, protein domain characters, that is, the potential structural factors, are of significant importance because of their direct relationship with the functional complexity of proteins and network (Koonin et al. 2002). To extract potential rules that may govern the utilization of proteins with different domain characters from GEP to CSP, we investigated the correlation of overrepresentation strength of proteins in CSPs as compared with GEP as a function of protein domain property—the number and age of the protein domain in proteins. In addition, we explored the mechanisms underlying the general trends and their evolutionary effects. Materials and Methods Data Sets (1) GEP (the full genome-encoded proteome). The PCGs in Entrez Gene database (Maglott et al. 2007) for six species (Homo sapiens, Mus musculus, Drosophila melanogaster, Caenorhabditis elegans, Saccharomyces cerevisiae, and Escherichia coli) at the NCBI ftp site (ftp://ftp.ncbi.nlm.nih.gov/ gene/DATA/GENE_INFO/) were used as genome background. These six species represent higher multicellular eukaryota, lower multicellular eukaryota, unicellular eukaryota, and prokaryote, respectively. The corresponding © The Author 2012. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution. All rights reserved. For permissions, please e-mail: [email protected] Mol. Biol. Evol. 29(8):1957–1968. 2012 doi:10.1093/molbev/mss064 Advance Access publication February 10, 2012 1957 Research article During evolution, proteins containing newly emerged domains and the increasing proportion of multidomain proteins in the full genome-encoded proteome (GEP) have substantially contributed to increasing biological complexity. However, it is not known how these two potential structural factors are preferentially utilized at given physiological states. Here, we classified proteins according to domain number and domain age and explored the general trends across species for the utilization of proteins from GEP to various certain-state proteomes (CSPs, i.e., all the proteins expressed at certain physiological states). We found that multidomain proteins or only older domain–containing proteins are significantly overrepresented in CSPs compared with GEP, which is a trend that is stronger in multicellular organisms than in unicellular organisms. Interestingly, the strengths of overrepresentation decreased during evolution of multicellular eukaryotes. When comparing across CSPs, we found that multidomain proteins are more overrepresented in complex tissues than in simpler ones, whereas no difference among proteins with domains of different ages is evident between complex and simple tissues. Thus, biological complexity under certain conditions is more significantly realized by diverse domain organization than by the emergence of new types of domain. In addition, we found that multidomain or only older domain–containing proteins tend to evolve slowly and generally are under stronger purifying selection, which may partly result from their general overrepresentation trends in CSPs. MBE Yang et al. · doi:10.1093/molbev/mss064 Table 1. Overview of the Expression Data Sets Used in the Analyses. Species Homo sapiens Technology Microarray MPSS Data Sets (GEO data set ID) GSE1133 GSE1747 Reference Su et al. (2004) Jongeneel et al. (2005) Description 73 normal OTCs 32 normal OTCs Mus musculus Microarray MPSS GSE1133 GSE1581 Su et al. (2004) — 61 normal OTCs 87 normal OTCs Drosophila melanogaster Microarray GSE7763, GSE5430 Chintapalli et al. (2007) Qin et al. (2007) Embryonic, larval or adult fly, whole body, or different OTCs Caenorhabditis elegans Microarray GSE11055, GSE2180, GSE8231, GSE8462, GSE8004, GSE5793 Baugh et al. (2009); Baugh et al. (2005); Fox et al. (2007); Fox et al. (2007); Von Stetina et al. (2007); Troemel et al. (2006) Embryonic, larval or adult worm, whole body or muscle, neural cells with various treatments Saccharomyces cerevisiae Microarray GSE14227, GSE5185, GSE15936 Ge et al. (2010); Alper et al. (2006); Mitchell et al. (2009) Samples with various treatments Escherichia coli Microarray GSE1121, GSE6425, GSE10345, GSE13982 Covert et al. (2004) Reigstad et al. (2007) Cardinale et al. (2008) Nobre et al. (2009) E. coli (K12, MG1655) samples (containing aerobic or anaerobic samples and samples treated with different drugs) protein sequences for the PCGs were obtained, and the redundant predicted splice variants of one PCG were removed: for each PCG, only the longest transcript remained (Vogel et al. 2005). (2) Certain-state proteome (CSP). CSPs were constructed based on published gene expression data sets of various conditions downloaded from the GEO database (Barrett et al. 2009) (for detailed information and corresponding references, see table 1). To prove that the core conclusion of this study is valid in any conditions, the gene expression data sets selected were as complete as possible, representing the states of all the stages and spaces. For microarray data, the presence or absence of one gene in a certain OTC (organ, tissue, or cell type) or condition is calculated using presence/absence calls by MAS 5.0 algorithm (MAS5) (Hubbell et al. 2002). For massively parallel signature sequencing (MPSS) data, if the transcripts per million (TPM) value of one gene is larger than 3 (Peters et al. 2007), the gene’s transcript is considered to be present in the OTC or condition. Identification of Protein Domains and Protein– Protein Interaction Domains The identification of protein domains was performed as described previously (Basu et al. 2008). Briefly, to identify the known domains presented in all the sequences of each GEP, we used RPS-BLAST to search against the Pfam (Finn et al. 2008) and SMART (Letunic et al. 2006) databases. When running the program, we select the option of ‘‘masking low-complexity regions.’’ Among the search results, only hits with the E value smaller than 0.001 were kept and only one of the overlapping hits with the smallest E value was retained. When the hits from Pfam and SMART databases were overlapped, the Pfam hits were retained, and the SMART hits were discarded regardless of the E value. 1958 We got the list of known protein–protein interaction (PPI) domains from the interdom database (Ng et al. 2003) and calculate/tgcqd the PPI domain numbers in proteins of each GEP. The interdom database was downloaded from the Web site: http://interdom.i2r.a-star.edu.sg/interdom. The potential false-positive interaction pairs were removed according to the annotation in the database. The number of interactions for each Pfam domain was counted. The CDD database (Marchler-Bauer et al. 2009) domain neighbor list was used to extract synonymous Pfam domain IDs of SMART domains in our data set. Domain Number in Proteins The number of domains in one protein was calculated with two methods, one including and the other one excluding the domain repeats in one protein. Based on these two methods, we can obtain two kinds of domain number. One is called DNIR (Domain Number Including Repeats in one protein) and the other one is called DNER (Domain Number Excluding Repeats in one protein) in this paper. According to the domain identification result, the number of Pfam or SMART annotated domains in one protein was calculated automatically. The two methods mentioned above were used to calculate protein domain numbers. For the domain number grading, DNIRs or DNERs of proteins were divided into three grades (1, 2, .2), corresponding to single-domain, two-domain, and multidomain proteins, respectively. Domain Age Domain age is defined by the oldest type of species in which the domain first appeared. To define the species in which one domain first appeared, a procedure previously used for gene age class definition (Wolf et al. 2009) was Trends in the Biological Complexity Realization · doi:10.1093/molbev/mss064 MBE FIG. 1. Over- or underrepresentation strengths of proteins with different domain numbers at various conditions of six representative species. Domains are calculated including the repeats, marked as ‘‘DNIR.’’ According to DNIR, proteins are divided into three classes: 1, 2, and .2. The over- or underrepresentation strengths are represented by log (P) or log (P), respectively (for detail information, see Materials and Methods). Red/blue dashed line represents the /þlog (P) value corresponding significant over- or underrepresentation. Gene expression data sets of six representative species are used. (A) Seventy-three normal OTCs (organ, tissue, or cells) of Homo sapiens, including fetal organs and various organs, tissues, or cell types of adult human body. Tissues composed primarily of neurons are indicated with a purple bar. Leukomonocytes in peripheral blood or bone marrow are indicated with a green bar. (B) Sixty-one normal OTCs of Mus musculus, including embryos at different developmental stages and various organs, tissues, or cell types of adult mouse. Purple bar and green bar indicate OTCs the same to A. (C) Embryonic, larval, or adult fly (Drosophila melanogaster). Embryos were collected in 2-h intervals and aged to generate animals 0–2, 4–6, and 8–10 h old. Larval fly samples include one whole body sample and 10 OTCs. Adult fly samples include one whole body sample and 17 OTCs. Purple bar indicates OTCs the same to A. (D) Embryonic, larval, or adult worm (Caenorhabditis elegans). The sources of embryonic worm samples vary (indicated by a blue bar), which contain ten samples at the time points after the 4-cell embryonic stage at 22 °C, one embryonic sample at 2 h before hatching, four embryonic muscle samples of different time points of heat shock HLH-1, and four embryos or embryonic muscle treated with chitinase were included. Larval worm samples (indicated with a green bar) contain 17 samples of larvae hatching in the presence and absence of food and 5 samples of larval neural or whole body (reference). Adult worm samples treated with some drugs are indicated with a red bar. (E) Samples of Saccharomyces cerevisiae contain 10 time course samples (every 12 h from 12 to 120 h), 2 samples treated by EtOH, and 22 samples exposed to different series of mild stresses, including heat shock (HS), oxidative, and osmotic stresses. (F) Samples of Escherichia coli are composed of 13 samples of various time points during aerobic or anaerobic conditions and 8 samples treated with different drugs, including bcm (bicyclomycin), CORM (carbon monoxide releasing molecule) at anaerobic (ANA) or aerobic (AE) conditions. used. Briefly, the protein sequences of all species in RefSeq database were employed to predict the known protein domains using the method mentioned above. If the sequence of one domain failed to show significant similarity to protein sequences outside one given taxon, the domain was then considered to belong to this class (taxon). Based on this rule, the protein domains from the six representative species were each partitioned into five or fewer age classes. Domains of H. sapiens, M. musculus, D. melanogaster, and C. elegans are divided into five age grades: cellular organisms (I), Eukaryota (II), Fungi/Metazoa group (III), Metazoa (IV), and Chordata (V, for H. sapiens and M. musculus), Insecta (V, for D. melanogaster), or Nematoda (V, for C. elegans). Domains of S. cerevisiae are divided into four age grades: 1959 MBE Yang et al. · doi:10.1093/molbev/mss064 FIG. 1 (Continued) cellular organisms (I), Eukaryota (II), Fungi/Metazoa group (III), and Fungi (IV). Domains of E. coli are divided into three age grades: cellular organisms (I), Bacteria (II), and Enterobacteriaceae (III). For H. sapiens, M. musculus, D. melanogaster, and C. elegans, Grades I and II are older and Grades IV and V are younger. For S. cerevisiae, Grades I and II are older grades and Grade IV are younger grades. For E. coli, Grade I are older grades and Grade III are younger grades. We characterize domain age grade of one protein by the youngest domain in it. Then, each protein in GEP is assigned with a domain age grade. General Approach for Over- or Underrepresentation Analysis To explore the distribution characteristics of proteins with different domain characters in CSP comparing with GEP, the hypergeometric distribution model, which has been used in many similar analyses (Jaffe et al. 2004; Li et al. 2005; de Godoy et al. 2008), was employed in this strategy (for details, see supplementary method, Supplementary Material online). The over- or underrepresentation strengths of each group of proteins with certain domain characters in CSP upon the background of GEP were 1960 estimated, and the P values were corrected using the Benjamini–Hochberg method (Benjamini and Hochberg 1995). When the value of upper or lower tail probabilities for a given class of proteins was no more than 0.05, this class of proteins was regarded as significantly over- or underrepresented in CSP against GEP, respectively. The negative or positive logarithm of P (lower or upper tail probabilities, logP) is used to represent the strength of over- or underrepresentation of proteins containing certain domain characters in CSPs. We calculated the average logP among the groups of proteins with given domain characters (we define logP as positive value if the trend is consistent with the general trend, otherwise, we define it as negative value) and named the result value as T_grade to represent the strength for its consistence with the general over- or underrepresentation trends of most OTCs/conditions. Retrieving Orthology Relationships and Evolutionary Rate Measurements The evolutionary rates of the proteins in the six GEPs were calculated based on these ortholog pairs: H. sapiens versus Pan troglodytes, M. musculus versus Rattus norvegicus, D. melanogaster versus D. erecta, C. elegans versus C. brenneri, Trends in the Biological Complexity Realization · doi:10.1093/molbev/mss064 MBE FIG. 1 (Continued) S. cerevisiae versus S. paradoxus, and E. coli str. K-12 substr. MG1655 versus E. coli O157:H7 EC4115. These orthology relationships based on Ensembl release 63 (http://www.ensembl.org/) were retrieved using BioMart (http://www.biomart.org/). The number of synonymous substitutions per synonymous site (dS) and the number of nonsynonymous substitutions per nonsynonymous site (dN) between the above ortholog pairs were retrieved using BioMart, except that the values of dS and dN between S. cerevisiae versus S. paradoxus were calculated in a previous study (Liu et al. 2011). The values of dN and dS from BioMart or the previous study were all estimated by the maximum likelihood method using the PAML package (Yang 1997). The ratio of dN and dS (dN/dS) was used to represent the evolutionary rate of one protein, and the values of dN/dS are classified into five grades (0–0.05, 0.05–0.1, 0.1–0.2, 0.2–0.35, and .0.35) for the proteins of H. sapiens, M. musculus, and D. melanogaster, and they are classified into four grades (0– 0.05, 0.05–0.1, 0.1–0.2, and .0.2) for the proteins of C. elegans, S. cerevisiae, and E. coli. The orthology relationships among all the six representative species were obtained from the OrthoMCL database (v5.0) (Chen et al. 2006), and these orthologous pairs were used in the analysis of domain composition evolution across the six species. Measuring the Organismal Complexity The biological complexity at organism level is measured by the cell type number or of the representative species. It is reported in cell type number, H. sapiens . M. musculus . D. melanogaster . C. elegans . S. cerevisiae (Vogel and Chothia 2006). Saccharomyces cerevisiae is more complex than E. coli because of its more parts of subcellular components than E. coli, although they are both unicellular organisms. Results and Discussion Multidomain Proteins Are Significantly Overrepresented in CSPs Protein domains are the basic structural and functional units/parts in a protein (Koonin et al. 2002; Basu et al. 2008), and the complexity of a system is represented by 1961 Yang et al. · doi:10.1093/molbev/mss064 MBE FIG. 2. Over- or underrepresentation strengths of proteins with different domain age grades at various conditions of six representative species. The age grade of the youngest domain in one protein is used to describe the domain age character of the protein. Protein domains of Homo sapiens, Mus musculus, Drosophila melanogaster, and Caenorhabditis elegans are divided into five age grades: cellular organisms (I), Eukaryota (II), Fungi/Metazoa group (III), Metazoa (IV) and Chordata (V, for H. sapiens and M. musculus), Insecta (V, for D. melanogaster), or Nematoda (V, for C. elegans). Domains of Saccharomyces cerevisiae proteins are divided into four age grades: cellular organisms (I), Eukaryota (II), Fungi/ Metazoa group (III), and Fungi (IV). Domains of Escherichia coli proteins are divided into three age grades: cellular organisms (I), Bacteria (II), and Enterobacteriaceae (III). The significance of red/blue dashed line and data used are the same as that of figure 1. its number of constituting ‘‘parts’’ (McShea 2000; Oakley and Rivera 2008). McShea has found that the number of structural parts generally is a good indicator of functional complexity (McShea 2000). Thus, domain number (DN) could be used to represent the complexity of protein structure and function, that is, multidomain proteins have more complex functions than the single-domain proteins. This inference is confirmed by the genome-wide analysis of the molecular functional complexity of the proteins in each DN class (supplementary fig. S1, Supplementary Material online). For example, multidomain proteins tend to be the link in multiprotein complexes or in signal pathways and have critical roles in all living cells (Koonin et al. 2002; Prochnik et al. 2010). Both our analysis and previous studies confirmed that multidomain proteins tend to carry out more complex functions than single-domain proteins. 1962 During evolution, the proportion of multidomain proteins in GEP increased gradually (supplementary fig. S2A, Supplementary Material online and see the last section of Materials and Methods for the measurement of organismal complexity). It is obvious that the proportion of younger domain–containing proteins is much higher in multidomain proteins than in single-domain proteins especially in multicellular species (supplementary fig. S2C–G, Supplementary Material online), which indicates that the emergence of younger domain is one of the causes of the formation of multidomain proteins. What’s more, older domain duplication or combination (supplementary fig. S3, Supplementary Material online) during the evolution of orthologous proteins may also lead to the formation of multidomain proteins (Chothia et al. 2003). The increasing proportion of multidomain proteins has substantially contributed to increasing biological complexity. In this study, we focus on Trends in the Biological Complexity Realization · doi:10.1093/molbev/mss064 MBE FIG. 2 (Continued) the utilization of proteins in each DN class under certain conditions instead of only at the genome level. To calculate DN, two methods including or excluding repeats of the same domain in a single polypeptide were used. Similar results were obtained: In most CSPs, especially in multicellular organisms, compared with their corresponding GEPs, single-domain proteins are underrepresented, whereas the multidomain proteins are overrepresented markedly (fig. 1 and supplementary fig. S4, Supplementary Material online).These results indicate that genes encoding multidomain proteins are expressed in significantly higher proportions than single-domain proteins as compared with all genes encoded in the genome. These results indicate that proteins with complex structures tend to be utilized preferentially in CSPs. Multidomain proteins tend to form more complex PPI network than single-domain proteins. PPIs are mainly mediated by the protein–protein interacting domains (PPI domains) and also contribute to biological complexity (Wuchty 2001). As expected, we found that proteins containing multiple PPI domains are also significantly overrepresented in CSPs (supplementary fig. S5, Supplementary Material online), supporting the conclusion that proteins with complex structures tend to be utilized preferentially. When comparing across CSPs, it is interesting that in the data of H. sapiens, M. musculus, and D. melanogaster, the general trend is more noticeable in more complex organs or tissues, measured by the number of cell types, such as those of the nervous system (Panman and Perlmann 2011), whereas it is less noticeable or not significant in simple samples, such as leukomonocytes in peripheral blood or bone marrow (fig. 1, supplementary figs. S4 and S5, Supplementary Material online). When comparing across species, the general trend is much stronger in multicellular organisms than unicellular ones, although the strength of the trend decreased during evolution of multicellular eukaryotes (fig. 1, supplementary figs. S4–S6, Supplementary Material online). These findings further support the conclusion that complex proteins tend to be utilized preferentially to realize biological complexity. Proteins Containing Younger Domains Are Significantly Underrepresented in CSPs The origin time of a protein domain, that is, its evolutionary age, is one of its important inherent properties. During evolution, new domains appeared continuously, meeting with new requirements of biological functions, which directly contributed to the increase in biological complexity 1963 MBE Yang et al. · doi:10.1093/molbev/mss064 FIG. 2 (Continued) (Chothia et al. 2003; Vogel and Chothia 2006). In general, older domains provide common and essential functions, whereas younger domains provide newly evolved biological functions, which in turn specify the functional characteristics of the protein (Vinogradov 2004; Cohen-Gihon et al. 2005). Based on this consideration, we used the age of the youngest domain in one protein to define the domain age character of the protein. Consistent with the functional characteristics at the domain level, in multicellular organisms, proteins only containing older domains generally take part in critical and primitive cellular processes, whereas proteins containing young-age domains mainly take part in multicellular organism– specific biological processes (supplementary fig. S7, Supplementary Material online). As shown in figure 2, we found that in most CSPs, proteins only containing older domains are significantly overrepresented, whereas proteins containing younger domains are underrepresented. It is notable that this trend is also stronger in species of multicellular organisms than in the species of unicellular organisms. And it is interesting that, in H. sapiens data, the general trend is stronger in leukomonocytes in peripheral blood or bone marrow 1964 (fig. 2A). We also found that in liver and kidney, the proteins only containing the oldest domain (Grade I) are more enriched than that of Grade II, which are not consistent with most tissues (fig. 2A and B). This inconsistency may relate to the primitive physiological function of liver and kidney. When comparing across species, it is also obvious that the strength of overrepresentation is much higher in multicellular species than unicellular ones, although the strengths decreased during evolution of multicellular species (fig. 2, supplementary fig. S6, Supplementary Material online). Our findings indicate proteins involved in older biological processes tend to be used more than proteins involved in newer biological processes. Mechanisms Underlying the General Trends and Their Evolutionary Effects The next fundamental question is what mechanisms underlie the general trends. Biological systems tend to meet with the basic functional requirement by expending energy cost as little as possible. So, there may be two constraints acting as the causes partly contributing to the formation of the general trends. Trends in the Biological Complexity Realization · doi:10.1093/molbev/mss064 The first one is functional constraint. The trend may partly result from the functional traits of multidomain and only older domain–containing proteins. Previous studies have proved multidomain proteins have critical roles and tend to be the linker protein in the network (Koonin et al. 2002; Prochnik et al. 2010), and our analyses revealed only older domain–containing proteins generally take part in critical and primitive cellular processes (supplementary fig. S7, Supplementary Material online). So, the biological systems tend to utilize multidomain or only older domain–containing proteins preferentially to meet with the fundamental functional requirement. The second one is cost constraint. Transcription and translation are two slow and expensive processes (Izban and Luse 1992; Castillo-Davis et al. 2002; Wagner 2005). Thus, the biological systems tend to reduce the amount of genes to express and tend to express such genes with lower cost consuming so as to save energy. Multidomain proteins generally are functionally pleiotropic (supplementary fig. S1, Supplementary Material online) and tend to be the linker protein in the network (Koonin et al. 2002; Prochnik et al. 2010). So, preferentially utilizing multidomain proteins can form the more compact protein network, which needs less genes to be expressed. On the other hand, we found only older domain–containing proteins tend to be short proteins (supplementary fig. S8, Supplementary Material online) except for C. elegans (similar to the previous investigation of the unexpected correlation between protein abundance and length; Duret and Mouchiroud 1999) and E. coli. The synthesis of short proteins needs lower energy cost (Eisenberg and Levanon 2003; Urrutia and Hurst 2003). So, to reduce the cost of gene expression, the biological systems tend to utilize multidomain or only older domain–containing proteins preferentially. Besides the over/underrepresentation trends, another familiar aspect of gene expression regulation is its tissue specificity. In fact, the universality of the trend of over/underrepresentation across CSPs is related to the proportion of housekeeping/tissue-specific proteins in one category, that is, those protein categories containing more housekeeping proteins tend to be widely overrepresented across CSPs, whereas those categories of proteins containing more tissue-specific proteins tend to be widely underrepresented across CSPs. Our analyses found that multidomain or only older domain–containing proteins have a higher proportion of housekeeping proteins (supplementary figs. S9 and S10, Supplementary Material online), which are consistent with the results of over- or underrepresentation analyses. Next, we will discuss the evolutionary effects of the general trends. Since the sequence of domain is relatively conserved across species, we presume that protein domain characters are closely related to the evolutionary rates of proteins measured by the ratio of dN and dS (see the ‘‘Retrieving Orthology Relationships and Evolutionary Rate Measurements’’ section in Materials and Methods). We found that proteins without any known conserved domain sequences evolve at much higher rates than proteins MBE containing known domains (fig. 3A–L). Among the proteins containing known domains, multidomain proteins tend to evolve more slowly than single-domain proteins (fig. 3A–F), and similarly, proteins only containing older domains tend to evolve more slowly than those containing younger domains (fig. 3G–L). We have found multidomain proteins or only older domain–containing proteins are significantly overrepresented in CSPs, then we presume that proteins with lower evolutionary rates may be significantly overrepresented and vice versa. This reasoning was confirmed by further analyses (supplementary fig. S11, Supplementary Material online). In general, dN/dS could represent the strength of purifying selection, that is, the lower value of dN/dS, the greater the strength of purifying selection for the protein (Hurst 2002; Liao and Zhang 2006; Basu et al. 2008; Liao et al. 2010; Oldmeadow et al. 2010). Thus, we can conclude that multidomain or only older domain– containing proteins are subjected to stronger purifying selection. We have proved that multidomain proteins generally are more pleiotropic (supplementary fig. S1, Supplementary Material online), so this conclusion is consistent with the view of previous studies (Wang et al. 2010) that more pleiotropic genes are have larger effects on trait values and are subjected to stronger selection. Based on our findings and the view of purifying selection, the stronger selections on multidomain and only older domain–containing proteins may partly result from their general overrepresented trends in CSPs. More generally, we can presume that proteins significantly overrepresented in CSPs tend to be subjected to stronger purifying selection. By this way, biological systems can avoid the harms resulting from mistranslation (Drummond and Wilke 2008) or other errors of proteins. As the result of regulatory and evolutionary processes, obvious differences of the strength of the general trends across species and CSPs could be found (figs. 1 and 2 and supplementary figs. S4–S6, Supplementary Material online). The general trend is stronger in multicellular organisms than unicellular organisms, and surprisingly, the strengths of the general trend decreased during the evolution of multicellular eukaryotes. This interesting phenomenon may have resulted from the increase of differentiation, which may have decreased the number of part types in specialized systems. This is similar to a previous test conducted at cell level, in which study the authors found there is a complexity drain on cells in the evolution of multicellularity (McShea 2002). In addition, we estimated the effect on the strength of general trends by operons of E. coli (Gama-Castro et al. 2011). We found that among the 737 operons containing more than one gene, almost half of them consisted of genes with different domain number and domain age characters (for detailed data, see supplementary table S1, Supplementary Material online). This indicates that many genes with different domain characters are turned on/off simultaneously because of their presence in operons. And this may partly contribute to the reduced intensity in E. coli of the trends found in this study. 1965 Yang et al. · doi:10.1093/molbev/mss064 MBE FIG. 3. The correlation between protein evolutionary rates and its domain characters. (A–F) for domain number character and (G–L) for domain age character. The ratio of nonsynonymous and synonymous distance (dN/dS) was used to represent the evolutionary rate of each protein (for detail information, See Materials and Methods). The values of upper and lower quartile are indicated as upper and lower edges of the box, and the values of median are indicated as a red bar in the box. The maximum whisker length is set as 1.5, which means points are drawn as outliers (dotted individually outside the bars) if they are larger than q3 þ 1.5 (q3 q1) (shown as the upper bar) or smaller than q1 1.5 (q3 q1) (shown as the lower bar), where q1 and q3 are the 25th and 75th percentiles, respectively. The P values shown in the figure are from a Mann– Whitney U test. Abbreviations in this figure: No, proteins without known conserved domains; Single, single-domain proteins; Multiple, multipledomain proteins; Old, proteins only containing older domains (see the definition in Materials and Methods); and Young, proteins containing younger domains. 1966 Trends in the Biological Complexity Realization · doi:10.1093/molbev/mss064 Deep Understanding of the Causes of Biological Complexity The origin and evolution of biological complexity are old and fundamental questions in biology (Oakley and Rivera 2008). Our study focused on a basic question: How are structural factors preferentially utilized at certain conditions to contribute to biological complexity? We grouped proteins into categories and analyzed the group as a whole instead of as an individual gene/protein as in previous studies (Lehner and Fraser 2004; Vinogradov 2004). Our major finding is that the proportion of each group of proteins in CSPs is different from GEP, all the proteins encoded by the genome. In particular, multidomain proteins or proteins only containing older domains are overrepresented significantly in various CSPs across species. Notably, this conclusion is not obtained from the bias introduced by focusing on only one particular experiment because all the conclusions remain valid when analyzed based on the data sets acquired with other techniques (supplementary fig. S12, Supplementary Material online). Our findings provide a deeper understanding of the causes of biological complexity under certain conditions. During evolution, gene duplication, divergence, and recombination all contribute to the diverse domain characters of proteins (Chothia et al. 2003), which is driven by extremely general mechanisms based on the preferential attachment principle (Koonin et al. 2002). Both new domains and the increasing proportion of multiple-domain proteins in GEP directly increase the biological complexity (Chothia et al. 2003; Vogel and Chothia 2006). However, they only provide more potential new and complex function executors that may contribute to the biological complexity in higher species, but it is not known how these two potential causes work in the formation of biological complexity under certain conditions. Our findings reveal that the biological complexity under certain conditions is more significantly realized by diverse domain organization than by the emergence of new types of domain. We anticipate our analyses to be a starting point for the systematic studies of the rules for the utilization of the potential information stored in the genome. Supplementary Material Supplementary methods, figures S1–S12, and tables S1 are available at Molecular Biology and Evolution online (http:// www.mbe.oxfordjournals.org/). Acknowledgments We acknowledge the Associate Editor (Prof. Todd Oakley from the University of California) and the anonymous reviewers for their constructive comments and valuable suggestions. Sincere thanks are also due to Prof. Jun Qin (Beijing Proteome Research Center, Beijing, China) for the valuable advises. This work was partially supported by Chinese State Key Projects for Basic Research (973 Program) (Nos. 2011CB910601, 2011CB910700, 2010CB912700, and 2011CB505304), National Natural MBE Science Foundation of China (81000192, 81170378, 30972909, 81001470, 81170399, and 81010064), and International Scientific Collaboration Program (2009DFB33070, 2010DFA31260, and 2011DFB30370). References Adami C, Ofria C, Collier TC. 2000. Evolution of biological complexity. Proc Natl Acad Sci U S A. 97:4463–4468. Alper H, Moxley J, Nevoigt E, Fink GR, Stephanopoulos G. 2006. Engineering yeast transcription machinery for improved ethanol tolerance and production. Science 314:1565–1568. Barrett T, Troup DB, Wilhite SE, et al. (14 co-authors). 2009. NCBI GEO: archive for high-throughput functional genomic data. Nucleic Acids Res. 37:D885–D890. Basu MK, Carmel L, Rogozin IB, Koonin EV. 2008. Evolution of protein domain promiscuity in eukaryotes. Genome Res. 18:449–461. Baugh LR, Demodena J, Sternberg PW. 2009. RNA Pol II accumulates at promoters of growth genes during developmental arrest. Science 324:92–94. Baugh LR, Hill AA, Claggett JM, Hill-Harfe K, Wen JC, Slonim DK, Brown EL, Hunter CP. 2005. The homeodomain protein PAL-1 specifies a lineage-specific regulatory network in the C. elegans embryo. Development 132:1843–1854. Benjamini Y, Hochberg Y. 1995. Controlling the false discovery rate—a practical and powerful approach to multiple testing. J R Stat Soc B. 57:289–300. Cardinale CJ, Washburn RS, Tadigotla VR, Brown LM, Gottesman ME, Nudler E. 2008. Termination factor Rho and its cofactors NusA and NusG silence foreign DNA in E. coli. Science 320:935–938. Castillo-Davis CI, Mekhedov SL, Hartl DL, Koonin EV, Kondrashov FA. 2002. Selection for short introns in highly expressed genes. Nat Genet. 31:415–418. Chen F, Mackey AJ, Stoeckert CJ Jr, Roos DS. 2006. OrthoMCL-DB: querying a comprehensive multi-species collection of ortholog groups. Nucleic Acids Res. 34:D363–D368. Chintapalli VR, Wang J, Dow JA. 2007. Using FlyAtlas to identify better Drosophila melanogaster models of human disease. Nat Genet. 39:715–720. Chothia C, Gough J, Vogel C, Teichmann SA. 2003. Evolution of the protein repertoire. Science 300:1701–1703. Cohen-Gihon I, Lancet D, Yanai I. 2005. Modular genes with metazoan-specific domains have increased tissue specificity. Trends Genet. 21:210–213. Covert MW, Knight EM, Reed JL, Herrgard MJ, Palsson BO. 2004. Integrating high-throughput and computational data elucidates bacterial networks. Nature 429:92–96. de Godoy LM, Olsen JV, Cox J, Nielsen ML, Hubner NC, Frohlich F, Walther TC, Mann M. 2008. Comprehensive mass-spectrometrybased proteome quantification of haploid versus diploid yeast. Nature 455:1251–1254. Doebeli M, Ispolatov I. 2010. Complexity and diversity. Science 328:494–497. Drummond DA, Wilke CO. 2008. Mistranslation-induced protein misfolding as a dominant constraint on coding-sequence evolution. Cell 134:341–352. Duret L, Mouchiroud D. 1999. Expression pattern and, surprisingly, gene length shape codon usage in Caenorhabditis, Drosophila, and Arabidopsis. Proc Natl Acad Sci U S A. 96:4482–4487. Eisenberg E, Levanon EY. 2003. Human housekeeping genes are compact. Trends Genet. 19:362–365. Finn RD, Tate J, Mistry J, et al. (11 co-authors). 2008. The Pfam protein families database. Nucleic Acids Res. 36:D281–D288. Fox RM, Watson JD, Von Stetina SE, McDermott J, Brodigan TM, Fukushige T, Krause M, Miller DM 3rd. 2007. The embryonic 1967 Yang et al. · doi:10.1093/molbev/mss064 muscle transcriptome of Caenorhabditis elegans. Genome Biol. 8:R188. Gama-Castro S, Salgado H, Peralta-Gil M, et al. (28 co-authors). 2011. RegulonDB version 7.0: transcriptional regulation of Escherichia coli K-12 integrated within genetic sensory response units. Gensor Units. Nucleic Acids Res. 39:D98–D105. Ge H, Wei M, Fabrizio P, Hu J, Cheng C, Longo VD, Li LM. 2010. Comparative analyses of time-course gene expression profiles of the long-lived sch9Delta mutant. Nucleic Acids Res. 38:143–158. Hubbell E, Liu WM, Mei R. 2002. Robust estimators for expression analysis. Bioinformatics 18:1585–1592. Hurst LD. 2002. The Ka/Ks ratio: diagnosing the form of sequence evolution. Trends Genet. 18(9):486. Izban MG, Luse DS. 1992. Factor-stimulated RNA polymerase II transcribes at physiological elongation rates on naked DNA but very poorly on chromatin templates. J Biol Chem. 267:13647–13655. Jaffe JD, Stange-Thomann N, Smith C, et al. (19 co-authors). 2004. The complete genome and proteome of Mycoplasma mobile. Genome Res. 14:1447–1461. Jongeneel CV, Delorenzi M, Iseli C, et al. (11 co-authors). 2005. An atlas of human gene expression from massively parallel signature sequencing (MPSS). Genome Res. 15:1007–1014. Koonin EV. 2009. Darwinian evolution in the light of genomics. Nucleic Acids Res. 37:1011–1034. Koonin EV, Wolf YI, Karev GP. 2002. The structure of the protein universe and genome evolution. Nature 420:218–223. Lehner B, Fraser AG. 2004. Protein domains enriched in mammalian tissue-specific or widely expressed genes. Trends Genet. 20:468–472. Letunic I, Copley RR, Pils B, Pinkert S, Schultz J, Bork P. 2006. SMART 5: domains in the context of genomes and networks. Nucleic Acids Res. 34:D257–D260. Li D, Li JQ, Ouyang SG, Wang J, Xu XJ, Zhu YP, He FC. 2005. An integrated strategy for functional analysis in large-scale proteomic research by gene ontology. Prog Biochem Biophys. 32:1026–1029. Liao BY, Weng MP, Zhang J. 2010. Contrasting genetic paths to morphological and physiological evolution. Proc Natl Acad Sci U S A. 107:7353–7358. Liao BY, Zhang J. 2006. Low rates of expression profile divergence in highly expressed genes and tissue-specific genes during mammalian evolution. Mol Biol Evol. 23:1119–1128. Liu ZY, Liu QJ, Sun HC, Hou L, Guo H, Zhu YP, Li D, He FC. 2011. Evidence for the additions of clustered interacting nodes during the evolution of protein interaction networks from network motifs. BMC Evol Biol. 11:133. Lynch M, Conery JS. 2003. The origins of genome complexity. Science 302:1401–1404. Maglott D, Ostell J, Pruitt KD, Tatusova T. 2007. Entrez Gene: genecentered information at NCBI. Nucleic Acids Res. 35:D26–D31. Marchler-Bauer A, Anderson JB, Chitsaz F, et al. (28 co-authors). 2009. CDD: specific functional annotation with the Conserved Domain Database. Nucleic Acids Res. 37:D205–D210. McShea DW. 2000. Functional complexity in organisms: parts as proxies. Biol Philos. 15:28. McShea DW. 2002. A complexity drain on cells in the evolution of multicellularity. Evolution 56:441–452. Mitchell A, Romano GH, Groisman B, Yona A, Dekel E, Kupiec M, Dahan O, Pilpel Y. 2009. Adaptive prediction of environmental changes by microorganisms. Nature 460:220–224. Ng SK, Zhang Z, Tan SH, Lin K. 2003. InterDom: a database of putative interacting protein domains for validating predicted protein interactions and complexes. Nucleic Acids Res. 31:251–254. 1968 MBE Nilsen TW, Graveley BR. 2010. Expansion of the eukaryotic proteome by alternative splicing. Nature 463:457–463. Nobre LS, Al-Shahrour F, Dopazo J, Saraiva LM. 2009. Exploring the antimicrobial action of a carbon monoxide-releasing compound through whole-genome transcription profiling of Escherichia coli. Microbiology 155:813–824. Oakley TH, Rivera AS. 2008. Genomics and the evolutionary origins of nervous system complexity. Curr Opin Genet Dev. 18:479–492. Oldmeadow C, Mengersen K, Mattick JS, Keith JM. 2010. Multiple evolutionary rate classes in animal genome evolution. Mol Biol Evol. 27(4):942–953. Panman L, Perlmann T. 2011. Tracing lineages to uncover neuronal identity. BMC Biol. 9:51. Peters LM, Belyantseva IA, Lagziel A, Battey JF, Friedman TB, Morell RJ. 2007. Signatures from tissue-specific MPSS libraries identify transcripts preferentially expressed in the mouse inner ear. Genomics 89:197–206. Prochnik SE, Umen J, Nedelcu AM, et al. (28 co-authors). 2010. Genomic analysis of organismal complexity in the multicellular green alga Volvox carteri. Science 329:223–226. Qin X, Ahn S, Speed TP, Rubin GM. 2007. Global analyses of mRNA translational control during early Drosophila embryogenesis. Genome Biol. 8:R63. Reigstad CS, Hultgren SJ, Gordon JI. 2007. Functional genomic studies of uropathogenic Escherichia coli and host urothelial cells when intracellular bacterial communities are assembled. J Biol Chem. 282:21259–21267. Rivera AS, Pankey MS, Plachetzki DC, Villacorta C, Syme AE, Serb JM, Omilian AR, Oakley TH. 2010. Gene duplication and the origins of morphological complexity in pancrustacean eyes, a genomic approach. BMC Evol Biol. 10:123. Su AI, Wiltshire T, Batalov S, et al. (13 co-authors). 2004. A gene atlas of the mouse and human protein-encoding transcriptomes. Proc Natl Acad Sci U S A. 101:6062–6067. Szathmary E, Jordan F, Pal C. 2001. Molecular biology and evolution. Can genes explain biological complexity? Science 292:1315–1316. Troemel ER, Chu SW, Reinke V, Lee SS, Ausubel FM, Kim DH. 2006. p38 MAPK regulates expression of immune response genes and contributes to longevity in C. elegans. PLoS Genet. 2:e183. Urrutia AO, Hurst LD. 2003. The signature of selection mediated by expression on human genes. Genome Res. 13:2260–2264. Vinogradov AE. 2004. Compactness of human housekeeping genes: selection for economy or genomic design? Trends Genet. 20:248–253. Vogel C, Chothia C. 2006. Protein family expansions and biological complexity. PLoS Comput Biol. 2:e48. Vogel C, Teichmann SA, Pereira-Leal J. 2005. The relationship between domain duplication and recombination. J Mol Biol. 346:355–365. Von Stetina SE, Watson JD, Fox RM, Olszewski KL, Spencer WC, Roy PJ, Miller DM 3rd. 2007. Cell-specific microarray profiling experiments reveal a comprehensive picture of gene expression in the C. elegans nervous system. Genome Biol. 8:R135. Wagner A. 2005. Energy constraints on the evolution of gene expression. Mol Biol Evol. 22:1365–1374. Wang Z, Liao BY, Zhang J. 2010. Genomic patterns of pleiotropy and the evolution of complexity. Proc Natl Acad Sci U S A. 107:18034–18039. Wolf YI, Novichkov PS, Karev GP, Koonin EV, Lipman DJ. 2009. The universal distribution of evolutionary rates of genes and distinct characteristics of eukaryotic genes of different apparent ages. Proc Natl Acad Sci U S A. 106:7273–7280. Wuchty S. 2001. Scale-free behavior in protein domain networks. Mol Biol Evol. 18(9):1694–1702. Yang Z. 1997. PAML: a program package for phylogenetic analysis by maximum likelihood. Comput Appl Biosci. 13:555–556.
© Copyright 2026 Paperzz