General Trends in the Utilization of Structural

General Trends in the Utilization of Structural Factors
Contributing to Biological Complexity
Dong Yang,1,2 Fan Zhong,3 Dong Li,1,2 Zhongyang Liu,1,2 Handong Wei,1,2 Ying Jiang,*,1,2 and
Fuchu He*,1,2,3
1
State Key Laboratory of Proteomics, Beijing Proteome Research Center, Beijing Institute of Radiation Medicine, Beijing, P. R. China
National Engineering Research Center for Protein Drugs, Beijing, P. R. China
3
Institutes of Biomedical Sciences and Department of Chemistry, Fudan University, Shanghai, P. R. China
*Corresponding author: E-mail: [email protected]; [email protected].
Associate editor: Todd Oakley
2
Abstract
Key words: biological complexity, protein domain, proteome, genome, purifying selection.
Introduction
One of the fundamental questions in biology is how species
evolve increased biological complexity. In general, biological
complexity can be assessed at the level of organism, organ,
tissue, cell, or protein interaction networks (Oakley and
Rivera 2008). At the organism level, it is clear that biological
complexity correlates well with genomic complexity, measured not only by the number of protein-coding genes
(PCGs) but also by the complexity of gene functions (Koonin
et al. 2002; Vogel and Chothia 2006), the connectivity of gene
regulation networks (Szathmary et al. 2001), and the proportion of spliceosomal introns leading to alterative splicing
(Nilsen and Graveley 2010). In this view, the genome could
be regarded as a ‘‘library’’ of potential biological information,
among which are the factors contributing to biological complexity. Previous studies paid much more attention to the
increase of these factors in GEP during evolution (Adami
et al. 2000; Szathmary et al. 2001; Koonin et al. 2002; Koonin
2009; Lynch and Conery 2003; Doebeli and Ispolatov 2010;
Nilsen and Graveley 2010; Prochnik et al. 2010; Rivera et al.
2010), but it is not known how these factors are preferentially utilized to realize biological complexity under different
conditions. It is known that different CSPs utilize different
proteins from GEP, but this process is not random and there
may be some general trends.
Among all the factors contributing to biological complexity, protein domain characters, that is, the potential
structural factors, are of significant importance because
of their direct relationship with the functional complexity
of proteins and network (Koonin et al. 2002). To extract
potential rules that may govern the utilization of proteins
with different domain characters from GEP to CSP, we investigated the correlation of overrepresentation strength of
proteins in CSPs as compared with GEP as a function of
protein domain property—the number and age of the
protein domain in proteins. In addition, we explored the
mechanisms underlying the general trends and their evolutionary effects.
Materials and Methods
Data Sets
(1) GEP (the full genome-encoded proteome). The PCGs in
Entrez Gene database (Maglott et al. 2007) for six species
(Homo sapiens, Mus musculus, Drosophila melanogaster,
Caenorhabditis elegans, Saccharomyces cerevisiae, and Escherichia coli) at the NCBI ftp site (ftp://ftp.ncbi.nlm.nih.gov/
gene/DATA/GENE_INFO/) were used as genome background. These six species represent higher multicellular
eukaryota, lower multicellular eukaryota, unicellular eukaryota, and prokaryote, respectively. The corresponding
© The Author 2012. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution. All rights reserved. For permissions, please
e-mail: [email protected]
Mol. Biol. Evol. 29(8):1957–1968. 2012 doi:10.1093/molbev/mss064 Advance Access publication February 10, 2012
1957
Research article
During evolution, proteins containing newly emerged domains and the increasing proportion of multidomain proteins in
the full genome-encoded proteome (GEP) have substantially contributed to increasing biological complexity. However, it is
not known how these two potential structural factors are preferentially utilized at given physiological states. Here, we
classified proteins according to domain number and domain age and explored the general trends across species for the
utilization of proteins from GEP to various certain-state proteomes (CSPs, i.e., all the proteins expressed at certain
physiological states). We found that multidomain proteins or only older domain–containing proteins are significantly
overrepresented in CSPs compared with GEP, which is a trend that is stronger in multicellular organisms than in unicellular
organisms. Interestingly, the strengths of overrepresentation decreased during evolution of multicellular eukaryotes. When
comparing across CSPs, we found that multidomain proteins are more overrepresented in complex tissues than in simpler
ones, whereas no difference among proteins with domains of different ages is evident between complex and simple tissues.
Thus, biological complexity under certain conditions is more significantly realized by diverse domain organization than by
the emergence of new types of domain. In addition, we found that multidomain or only older domain–containing proteins
tend to evolve slowly and generally are under stronger purifying selection, which may partly result from their general
overrepresentation trends in CSPs.
MBE
Yang et al. · doi:10.1093/molbev/mss064
Table 1. Overview of the Expression Data Sets Used in the Analyses.
Species
Homo sapiens
Technology
Microarray
MPSS
Data Sets
(GEO data set ID)
GSE1133
GSE1747
Reference
Su et al. (2004)
Jongeneel et al. (2005)
Description
73 normal OTCs
32 normal OTCs
Mus musculus
Microarray
MPSS
GSE1133
GSE1581
Su et al. (2004)
—
61 normal OTCs
87 normal OTCs
Drosophila melanogaster
Microarray
GSE7763,
GSE5430
Chintapalli et al. (2007)
Qin et al. (2007)
Embryonic, larval or adult fly,
whole body, or different OTCs
Caenorhabditis elegans
Microarray
GSE11055,
GSE2180,
GSE8231,
GSE8462,
GSE8004,
GSE5793
Baugh et al. (2009);
Baugh et al. (2005);
Fox et al. (2007);
Fox et al. (2007);
Von Stetina et al. (2007);
Troemel et al. (2006)
Embryonic, larval or adult worm,
whole body or muscle,
neural cells with various treatments
Saccharomyces cerevisiae
Microarray
GSE14227,
GSE5185,
GSE15936
Ge et al. (2010);
Alper et al. (2006);
Mitchell et al. (2009)
Samples with various treatments
Escherichia coli
Microarray
GSE1121,
GSE6425,
GSE10345,
GSE13982
Covert et al. (2004)
Reigstad et al. (2007)
Cardinale et al. (2008)
Nobre et al. (2009)
E. coli (K12, MG1655) samples
(containing aerobic or anaerobic
samples and samples
treated with different drugs)
protein sequences for the PCGs were obtained, and the
redundant predicted splice variants of one PCG were
removed: for each PCG, only the longest transcript remained
(Vogel et al. 2005).
(2) Certain-state proteome (CSP). CSPs were constructed
based on published gene expression data sets of various
conditions downloaded from the GEO database (Barrett
et al. 2009) (for detailed information and corresponding
references, see table 1). To prove that the core conclusion
of this study is valid in any conditions, the gene expression
data sets selected were as complete as possible,
representing the states of all the stages and spaces. For
microarray data, the presence or absence of one gene in
a certain OTC (organ, tissue, or cell type) or condition is
calculated using presence/absence calls by MAS 5.0
algorithm (MAS5) (Hubbell et al. 2002). For massively
parallel signature sequencing (MPSS) data, if the transcripts per million (TPM) value of one gene is larger than 3
(Peters et al. 2007), the gene’s transcript is considered to
be present in the OTC or condition.
Identification of Protein Domains and Protein–
Protein Interaction Domains
The identification of protein domains was performed as
described previously (Basu et al. 2008). Briefly, to identify
the known domains presented in all the sequences of each
GEP, we used RPS-BLAST to search against the Pfam (Finn
et al. 2008) and SMART (Letunic et al. 2006) databases.
When running the program, we select the option of
‘‘masking low-complexity regions.’’ Among the search
results, only hits with the E value smaller than 0.001 were
kept and only one of the overlapping hits with the
smallest E value was retained. When the hits from Pfam
and SMART databases were overlapped, the Pfam hits were
retained, and the SMART hits were discarded regardless of
the E value.
1958
We got the list of known protein–protein interaction
(PPI) domains from the interdom database (Ng et al.
2003) and calculate/tgcqd the PPI domain numbers in proteins of each GEP. The interdom database was downloaded
from the Web site: http://interdom.i2r.a-star.edu.sg/interdom. The potential false-positive interaction pairs were removed according to the annotation in the database. The
number of interactions for each Pfam domain was counted.
The CDD database (Marchler-Bauer et al. 2009) domain
neighbor list was used to extract synonymous Pfam domain
IDs of SMART domains in our data set.
Domain Number in Proteins
The number of domains in one protein was calculated with
two methods, one including and the other one excluding
the domain repeats in one protein. Based on these two
methods, we can obtain two kinds of domain number.
One is called DNIR (Domain Number Including Repeats
in one protein) and the other one is called DNER (Domain
Number Excluding Repeats in one protein) in this paper.
According to the domain identification result, the number of Pfam or SMART annotated domains in one protein
was calculated automatically. The two methods mentioned
above were used to calculate protein domain numbers. For
the domain number grading, DNIRs or DNERs of proteins
were divided into three grades (1, 2, .2), corresponding
to single-domain, two-domain, and multidomain proteins,
respectively.
Domain Age
Domain age is defined by the oldest type of species in
which the domain first appeared. To define the species
in which one domain first appeared, a procedure previously
used for gene age class definition (Wolf et al. 2009) was
Trends in the Biological Complexity Realization · doi:10.1093/molbev/mss064
MBE
FIG. 1. Over- or underrepresentation strengths of proteins with different domain numbers at various conditions of six representative species.
Domains are calculated including the repeats, marked as ‘‘DNIR.’’ According to DNIR, proteins are divided into three classes: 1, 2, and .2. The
over- or underrepresentation strengths are represented by log (P) or log (P), respectively (for detail information, see Materials and Methods).
Red/blue dashed line represents the /þlog (P) value corresponding significant over- or underrepresentation. Gene expression data sets of six
representative species are used. (A) Seventy-three normal OTCs (organ, tissue, or cells) of Homo sapiens, including fetal organs and various
organs, tissues, or cell types of adult human body. Tissues composed primarily of neurons are indicated with a purple bar. Leukomonocytes in
peripheral blood or bone marrow are indicated with a green bar. (B) Sixty-one normal OTCs of Mus musculus, including embryos at different
developmental stages and various organs, tissues, or cell types of adult mouse. Purple bar and green bar indicate OTCs the same to
A. (C) Embryonic, larval, or adult fly (Drosophila melanogaster). Embryos were collected in 2-h intervals and aged to generate animals 0–2, 4–6,
and 8–10 h old. Larval fly samples include one whole body sample and 10 OTCs. Adult fly samples include one whole body sample and 17
OTCs. Purple bar indicates OTCs the same to A. (D) Embryonic, larval, or adult worm (Caenorhabditis elegans). The sources of embryonic worm
samples vary (indicated by a blue bar), which contain ten samples at the time points after the 4-cell embryonic stage at 22 °C, one embryonic
sample at 2 h before hatching, four embryonic muscle samples of different time points of heat shock HLH-1, and four embryos or embryonic
muscle treated with chitinase were included. Larval worm samples (indicated with a green bar) contain 17 samples of larvae hatching in the
presence and absence of food and 5 samples of larval neural or whole body (reference). Adult worm samples treated with some drugs are
indicated with a red bar. (E) Samples of Saccharomyces cerevisiae contain 10 time course samples (every 12 h from 12 to 120 h), 2 samples treated
by EtOH, and 22 samples exposed to different series of mild stresses, including heat shock (HS), oxidative, and osmotic stresses. (F) Samples of
Escherichia coli are composed of 13 samples of various time points during aerobic or anaerobic conditions and 8 samples treated with different
drugs, including bcm (bicyclomycin), CORM (carbon monoxide releasing molecule) at anaerobic (ANA) or aerobic (AE) conditions.
used. Briefly, the protein sequences of all species in RefSeq
database were employed to predict the known protein domains using the method mentioned above. If the sequence
of one domain failed to show significant similarity to protein sequences outside one given taxon, the domain was
then considered to belong to this class (taxon). Based on
this rule, the protein domains from the six representative
species were each partitioned into five or fewer age classes.
Domains of H. sapiens, M. musculus, D. melanogaster, and
C. elegans are divided into five age grades: cellular organisms
(I), Eukaryota (II), Fungi/Metazoa group (III), Metazoa (IV),
and Chordata (V, for H. sapiens and M. musculus), Insecta
(V, for D. melanogaster), or Nematoda (V, for C. elegans).
Domains of S. cerevisiae are divided into four age grades:
1959
MBE
Yang et al. · doi:10.1093/molbev/mss064
FIG. 1 (Continued)
cellular organisms (I), Eukaryota (II), Fungi/Metazoa group
(III), and Fungi (IV). Domains of E. coli are divided into three
age grades: cellular organisms (I), Bacteria (II), and Enterobacteriaceae (III). For H. sapiens, M. musculus, D. melanogaster, and C. elegans, Grades I and II are older and Grades IV and
V are younger. For S. cerevisiae, Grades I and II are older
grades and Grade IV are younger grades. For E. coli, Grade
I are older grades and Grade III are younger grades. We characterize domain age grade of one protein by the youngest
domain in it. Then, each protein in GEP is assigned with a domain age grade.
General Approach for Over- or
Underrepresentation Analysis
To explore the distribution characteristics of proteins with
different domain characters in CSP comparing with GEP,
the hypergeometric distribution model, which has been
used in many similar analyses (Jaffe et al. 2004; Li et al.
2005; de Godoy et al. 2008), was employed in this strategy
(for details, see supplementary method, Supplementary
Material online). The over- or underrepresentation
strengths of each group of proteins with certain domain
characters in CSP upon the background of GEP were
1960
estimated, and the P values were corrected using the
Benjamini–Hochberg method (Benjamini and Hochberg
1995). When the value of upper or lower tail probabilities
for a given class of proteins was no more than 0.05, this class
of proteins was regarded as significantly over- or underrepresented in CSP against GEP, respectively. The negative or
positive logarithm of P (lower or upper tail probabilities,
logP) is used to represent the strength of over- or underrepresentation of proteins containing certain domain characters
in CSPs. We calculated the average logP among the groups of
proteins with given domain characters (we define logP as
positive value if the trend is consistent with the general
trend, otherwise, we define it as negative value) and named
the result value as T_grade to represent the strength for its
consistence with the general over- or underrepresentation
trends of most OTCs/conditions.
Retrieving Orthology Relationships and
Evolutionary Rate Measurements
The evolutionary rates of the proteins in the six GEPs were
calculated based on these ortholog pairs: H. sapiens versus
Pan troglodytes, M. musculus versus Rattus norvegicus, D.
melanogaster versus D. erecta, C. elegans versus C. brenneri,
Trends in the Biological Complexity Realization · doi:10.1093/molbev/mss064
MBE
FIG. 1 (Continued)
S. cerevisiae versus S. paradoxus, and E. coli str. K-12 substr.
MG1655 versus E. coli O157:H7 EC4115. These orthology
relationships based on Ensembl release 63 (http://www.ensembl.org/) were retrieved using BioMart (http://www.biomart.org/). The number of synonymous substitutions per
synonymous site (dS) and the number of nonsynonymous
substitutions per nonsynonymous site (dN) between the
above ortholog pairs were retrieved using BioMart, except
that the values of dS and dN between S. cerevisiae versus
S. paradoxus were calculated in a previous study (Liu et al.
2011). The values of dN and dS from BioMart or the previous
study were all estimated by the maximum likelihood
method using the PAML package (Yang 1997). The ratio
of dN and dS (dN/dS) was used to represent the evolutionary
rate of one protein, and the values of dN/dS are classified
into five grades (0–0.05, 0.05–0.1, 0.1–0.2, 0.2–0.35, and
.0.35) for the proteins of H. sapiens, M. musculus, and D.
melanogaster, and they are classified into four grades (0–
0.05, 0.05–0.1, 0.1–0.2, and .0.2) for the proteins of C. elegans, S. cerevisiae, and E. coli.
The orthology relationships among all the six representative species were obtained from the OrthoMCL database
(v5.0) (Chen et al. 2006), and these orthologous pairs were
used in the analysis of domain composition evolution
across the six species.
Measuring the Organismal Complexity
The biological complexity at organism level is measured by
the cell type number or of the representative species. It is
reported in cell type number, H. sapiens . M. musculus .
D. melanogaster . C. elegans . S. cerevisiae (Vogel and
Chothia 2006). Saccharomyces cerevisiae is more complex
than E. coli because of its more parts of subcellular components than E. coli, although they are both unicellular
organisms.
Results and Discussion
Multidomain Proteins Are Significantly
Overrepresented in CSPs
Protein domains are the basic structural and functional
units/parts in a protein (Koonin et al. 2002; Basu et al.
2008), and the complexity of a system is represented by
1961
Yang et al. · doi:10.1093/molbev/mss064
MBE
FIG. 2. Over- or underrepresentation strengths of proteins with different domain age grades at various conditions of six representative species.
The age grade of the youngest domain in one protein is used to describe the domain age character of the protein. Protein domains of Homo
sapiens, Mus musculus, Drosophila melanogaster, and Caenorhabditis elegans are divided into five age grades: cellular organisms (I), Eukaryota
(II), Fungi/Metazoa group (III), Metazoa (IV) and Chordata (V, for H. sapiens and M. musculus), Insecta (V, for D. melanogaster), or Nematoda
(V, for C. elegans). Domains of Saccharomyces cerevisiae proteins are divided into four age grades: cellular organisms (I), Eukaryota (II), Fungi/
Metazoa group (III), and Fungi (IV). Domains of Escherichia coli proteins are divided into three age grades: cellular organisms (I), Bacteria (II),
and Enterobacteriaceae (III). The significance of red/blue dashed line and data used are the same as that of figure 1.
its number of constituting ‘‘parts’’ (McShea 2000; Oakley
and Rivera 2008). McShea has found that the number of
structural parts generally is a good indicator of functional
complexity (McShea 2000). Thus, domain number (DN)
could be used to represent the complexity of protein structure and function, that is, multidomain proteins have more
complex functions than the single-domain proteins. This
inference is confirmed by the genome-wide analysis of
the molecular functional complexity of the proteins in
each DN class (supplementary fig. S1, Supplementary
Material online). For example, multidomain proteins tend
to be the link in multiprotein complexes or in signal pathways and have critical roles in all living cells (Koonin et al.
2002; Prochnik et al. 2010). Both our analysis and previous
studies confirmed that multidomain proteins tend to
carry out more complex functions than single-domain
proteins.
1962
During evolution, the proportion of multidomain proteins in GEP increased gradually (supplementary fig. S2A,
Supplementary Material online and see the last section of
Materials and Methods for the measurement of organismal
complexity). It is obvious that the proportion of younger
domain–containing proteins is much higher in multidomain
proteins than in single-domain proteins especially in multicellular species (supplementary fig. S2C–G, Supplementary
Material online), which indicates that the emergence of
younger domain is one of the causes of the formation of
multidomain proteins. What’s more, older domain duplication or combination (supplementary fig. S3, Supplementary
Material online) during the evolution of orthologous proteins may also lead to the formation of multidomain proteins (Chothia et al. 2003). The increasing proportion of
multidomain proteins has substantially contributed to increasing biological complexity. In this study, we focus on
Trends in the Biological Complexity Realization · doi:10.1093/molbev/mss064
MBE
FIG. 2 (Continued)
the utilization of proteins in each DN class under certain
conditions instead of only at the genome level.
To calculate DN, two methods including or excluding repeats of the same domain in a single polypeptide were used.
Similar results were obtained: In most CSPs, especially in
multicellular organisms, compared with their corresponding
GEPs, single-domain proteins are underrepresented, whereas
the multidomain proteins are overrepresented markedly
(fig. 1 and supplementary fig. S4, Supplementary Material
online).These results indicate that genes encoding multidomain proteins are expressed in significantly higher proportions than single-domain proteins as compared with all
genes encoded in the genome. These results indicate
that proteins with complex structures tend to be utilized
preferentially in CSPs.
Multidomain proteins tend to form more complex PPI
network than single-domain proteins. PPIs are mainly mediated by the protein–protein interacting domains (PPI
domains) and also contribute to biological complexity
(Wuchty 2001). As expected, we found that proteins
containing multiple PPI domains are also significantly overrepresented in CSPs (supplementary fig. S5, Supplementary
Material online), supporting the conclusion that proteins
with complex structures tend to be utilized preferentially.
When comparing across CSPs, it is interesting that in the
data of H. sapiens, M. musculus, and D. melanogaster, the
general trend is more noticeable in more complex organs
or tissues, measured by the number of cell types, such as
those of the nervous system (Panman and Perlmann 2011),
whereas it is less noticeable or not significant in simple
samples, such as leukomonocytes in peripheral blood or
bone marrow (fig. 1, supplementary figs. S4 and S5, Supplementary Material online). When comparing across species,
the general trend is much stronger in multicellular organisms than unicellular ones, although the strength of the
trend decreased during evolution of multicellular eukaryotes (fig. 1, supplementary figs. S4–S6, Supplementary Material online). These findings further support the
conclusion that complex proteins tend to be utilized preferentially to realize biological complexity.
Proteins Containing Younger Domains Are
Significantly Underrepresented in CSPs
The origin time of a protein domain, that is, its evolutionary age, is one of its important inherent properties. During
evolution, new domains appeared continuously, meeting
with new requirements of biological functions, which directly contributed to the increase in biological complexity
1963
MBE
Yang et al. · doi:10.1093/molbev/mss064
FIG. 2 (Continued)
(Chothia et al. 2003; Vogel and Chothia 2006). In general,
older domains provide common and essential functions,
whereas younger domains provide newly evolved biological functions, which in turn specify the functional characteristics of the protein (Vinogradov 2004; Cohen-Gihon
et al. 2005). Based on this consideration, we used the
age of the youngest domain in one protein to define
the domain age character of the protein. Consistent
with the functional characteristics at the domain level,
in multicellular organisms, proteins only containing older
domains generally take part in critical and primitive cellular processes, whereas proteins containing young-age
domains mainly take part in multicellular organism–
specific biological processes (supplementary fig. S7,
Supplementary Material online).
As shown in figure 2, we found that in most CSPs, proteins only containing older domains are significantly overrepresented, whereas proteins containing younger domains
are underrepresented. It is notable that this trend is also
stronger in species of multicellular organisms than in
the species of unicellular organisms. And it is interesting
that, in H. sapiens data, the general trend is stronger
in leukomonocytes in peripheral blood or bone marrow
1964
(fig. 2A). We also found that in liver and kidney, the proteins only containing the oldest domain (Grade I) are more
enriched than that of Grade II, which are not consistent
with most tissues (fig. 2A and B). This inconsistency
may relate to the primitive physiological function of liver
and kidney. When comparing across species, it is also obvious that the strength of overrepresentation is much higher in multicellular species than unicellular ones, although
the strengths decreased during evolution of multicellular
species (fig. 2, supplementary fig. S6, Supplementary Material online). Our findings indicate proteins involved in older
biological processes tend to be used more than proteins
involved in newer biological processes.
Mechanisms Underlying the General Trends and
Their Evolutionary Effects
The next fundamental question is what mechanisms underlie the general trends. Biological systems tend to meet
with the basic functional requirement by expending energy
cost as little as possible. So, there may be two constraints
acting as the causes partly contributing to the formation of
the general trends.
Trends in the Biological Complexity Realization · doi:10.1093/molbev/mss064
The first one is functional constraint. The trend may
partly result from the functional traits of multidomain
and only older domain–containing proteins. Previous studies have proved multidomain proteins have critical roles
and tend to be the linker protein in the network (Koonin
et al. 2002; Prochnik et al. 2010), and our analyses revealed
only older domain–containing proteins generally take part
in critical and primitive cellular processes (supplementary
fig. S7, Supplementary Material online). So, the biological
systems tend to utilize multidomain or only older
domain–containing proteins preferentially to meet with
the fundamental functional requirement.
The second one is cost constraint. Transcription and
translation are two slow and expensive processes (Izban
and Luse 1992; Castillo-Davis et al. 2002; Wagner 2005).
Thus, the biological systems tend to reduce the amount
of genes to express and tend to express such genes with
lower cost consuming so as to save energy. Multidomain
proteins generally are functionally pleiotropic (supplementary fig. S1, Supplementary Material online) and tend to be
the linker protein in the network (Koonin et al. 2002;
Prochnik et al. 2010). So, preferentially utilizing multidomain proteins can form the more compact protein network, which needs less genes to be expressed. On the
other hand, we found only older domain–containing proteins tend to be short proteins (supplementary fig. S8, Supplementary Material online) except for C. elegans (similar to
the previous investigation of the unexpected correlation
between protein abundance and length; Duret and
Mouchiroud 1999) and E. coli. The synthesis of short proteins needs lower energy cost (Eisenberg and Levanon 2003;
Urrutia and Hurst 2003). So, to reduce the cost of gene expression, the biological systems tend to utilize multidomain
or only older domain–containing proteins preferentially.
Besides the over/underrepresentation trends, another
familiar aspect of gene expression regulation is its tissue
specificity. In fact, the universality of the trend of over/underrepresentation across CSPs is related to the proportion
of housekeeping/tissue-specific proteins in one category,
that is, those protein categories containing more housekeeping proteins tend to be widely overrepresented across
CSPs, whereas those categories of proteins containing more
tissue-specific proteins tend to be widely underrepresented
across CSPs. Our analyses found that multidomain or only
older domain–containing proteins have a higher proportion of housekeeping proteins (supplementary figs. S9
and S10, Supplementary Material online), which are consistent with the results of over- or underrepresentation
analyses.
Next, we will discuss the evolutionary effects of the general trends. Since the sequence of domain is relatively conserved across species, we presume that protein domain
characters are closely related to the evolutionary rates of
proteins measured by the ratio of dN and dS (see the
‘‘Retrieving Orthology Relationships and Evolutionary Rate
Measurements’’ section in Materials and Methods). We
found that proteins without any known conserved domain
sequences evolve at much higher rates than proteins
MBE
containing known domains (fig. 3A–L). Among the proteins containing known domains, multidomain proteins
tend to evolve more slowly than single-domain proteins
(fig. 3A–F), and similarly, proteins only containing older domains tend to evolve more slowly than those containing
younger domains (fig. 3G–L). We have found multidomain
proteins or only older domain–containing proteins are significantly overrepresented in CSPs, then we presume that
proteins with lower evolutionary rates may be significantly
overrepresented and vice versa. This reasoning was confirmed by further analyses (supplementary fig. S11, Supplementary Material online). In general, dN/dS could represent
the strength of purifying selection, that is, the lower value
of dN/dS, the greater the strength of purifying selection
for the protein (Hurst 2002; Liao and Zhang 2006; Basu
et al. 2008; Liao et al. 2010; Oldmeadow et al. 2010). Thus,
we can conclude that multidomain or only older domain–
containing proteins are subjected to stronger purifying
selection. We have proved that multidomain proteins generally are more pleiotropic (supplementary fig. S1, Supplementary Material online), so this conclusion is consistent
with the view of previous studies (Wang et al. 2010) that
more pleiotropic genes are have larger effects on trait values and are subjected to stronger selection.
Based on our findings and the view of purifying selection,
the stronger selections on multidomain and only older
domain–containing proteins may partly result from their
general overrepresented trends in CSPs. More generally,
we can presume that proteins significantly overrepresented
in CSPs tend to be subjected to stronger purifying selection.
By this way, biological systems can avoid the harms resulting from mistranslation (Drummond and Wilke 2008) or
other errors of proteins.
As the result of regulatory and evolutionary processes,
obvious differences of the strength of the general trends
across species and CSPs could be found (figs. 1 and 2
and supplementary figs. S4–S6, Supplementary Material
online). The general trend is stronger in multicellular
organisms than unicellular organisms, and surprisingly,
the strengths of the general trend decreased during the
evolution of multicellular eukaryotes. This interesting phenomenon may have resulted from the increase of differentiation, which may have decreased the number of part
types in specialized systems. This is similar to a previous
test conducted at cell level, in which study the authors
found there is a complexity drain on cells in the evolution
of multicellularity (McShea 2002).
In addition, we estimated the effect on the strength of
general trends by operons of E. coli (Gama-Castro et al.
2011). We found that among the 737 operons containing
more than one gene, almost half of them consisted of genes
with different domain number and domain age characters
(for detailed data, see supplementary table S1, Supplementary Material online). This indicates that many genes with
different domain characters are turned on/off simultaneously because of their presence in operons. And this
may partly contribute to the reduced intensity in E. coli
of the trends found in this study.
1965
Yang et al. · doi:10.1093/molbev/mss064
MBE
FIG. 3. The correlation between protein evolutionary rates and its domain characters. (A–F) for domain number character and (G–L) for domain
age character. The ratio of nonsynonymous and synonymous distance (dN/dS) was used to represent the evolutionary rate of each protein (for
detail information, See Materials and Methods). The values of upper and lower quartile are indicated as upper and lower edges of the box, and the
values of median are indicated as a red bar in the box. The maximum whisker length is set as 1.5, which means points are drawn as outliers
(dotted individually outside the bars) if they are larger than q3 þ 1.5 (q3 q1) (shown as the upper bar) or smaller than q1 1.5 (q3 q1)
(shown as the lower bar), where q1 and q3 are the 25th and 75th percentiles, respectively. The P values shown in the figure are from a Mann–
Whitney U test. Abbreviations in this figure: No, proteins without known conserved domains; Single, single-domain proteins; Multiple, multipledomain proteins; Old, proteins only containing older domains (see the definition in Materials and Methods); and Young, proteins containing
younger domains.
1966
Trends in the Biological Complexity Realization · doi:10.1093/molbev/mss064
Deep Understanding of the Causes of Biological
Complexity
The origin and evolution of biological complexity are old
and fundamental questions in biology (Oakley and Rivera
2008). Our study focused on a basic question: How are
structural factors preferentially utilized at certain conditions to contribute to biological complexity? We grouped
proteins into categories and analyzed the group as a whole
instead of as an individual gene/protein as in previous studies (Lehner and Fraser 2004; Vinogradov 2004). Our major
finding is that the proportion of each group of proteins in
CSPs is different from GEP, all the proteins encoded by the
genome. In particular, multidomain proteins or proteins
only containing older domains are overrepresented significantly in various CSPs across species. Notably, this conclusion is not obtained from the bias introduced by focusing
on only one particular experiment because all the conclusions remain valid when analyzed based on the data sets
acquired with other techniques (supplementary fig. S12,
Supplementary Material online).
Our findings provide a deeper understanding of the
causes of biological complexity under certain conditions.
During evolution, gene duplication, divergence, and recombination all contribute to the diverse domain characters of
proteins (Chothia et al. 2003), which is driven by extremely
general mechanisms based on the preferential attachment
principle (Koonin et al. 2002). Both new domains and the
increasing proportion of multiple-domain proteins in GEP
directly increase the biological complexity (Chothia et al.
2003; Vogel and Chothia 2006). However, they only provide
more potential new and complex function executors
that may contribute to the biological complexity in higher
species, but it is not known how these two potential causes
work in the formation of biological complexity under
certain conditions. Our findings reveal that the biological
complexity under certain conditions is more significantly
realized by diverse domain organization than by the emergence of new types of domain. We anticipate our analyses
to be a starting point for the systematic studies of the rules
for the utilization of the potential information stored in the
genome.
Supplementary Material
Supplementary methods, figures S1–S12, and tables S1 are
available at Molecular Biology and Evolution online (http://
www.mbe.oxfordjournals.org/).
Acknowledgments
We acknowledge the Associate Editor (Prof. Todd Oakley
from the University of California) and the anonymous reviewers for their constructive comments and valuable suggestions. Sincere thanks are also due to Prof. Jun Qin
(Beijing Proteome Research Center, Beijing, China) for
the valuable advises. This work was partially supported
by Chinese State Key Projects for Basic Research (973
Program)
(Nos.
2011CB910601,
2011CB910700,
2010CB912700, and 2011CB505304), National Natural
MBE
Science Foundation of China (81000192, 81170378,
30972909, 81001470, 81170399, and 81010064), and International Scientific Collaboration Program (2009DFB33070,
2010DFA31260, and 2011DFB30370).
References
Adami C, Ofria C, Collier TC. 2000. Evolution of biological
complexity. Proc Natl Acad Sci U S A. 97:4463–4468.
Alper H, Moxley J, Nevoigt E, Fink GR, Stephanopoulos G. 2006.
Engineering yeast transcription machinery for improved ethanol
tolerance and production. Science 314:1565–1568.
Barrett T, Troup DB, Wilhite SE, et al. (14 co-authors). 2009. NCBI
GEO: archive for high-throughput functional genomic data.
Nucleic Acids Res. 37:D885–D890.
Basu MK, Carmel L, Rogozin IB, Koonin EV. 2008. Evolution of protein
domain promiscuity in eukaryotes. Genome Res. 18:449–461.
Baugh LR, Demodena J, Sternberg PW. 2009. RNA Pol II accumulates
at promoters of growth genes during developmental arrest.
Science 324:92–94.
Baugh LR, Hill AA, Claggett JM, Hill-Harfe K, Wen JC, Slonim DK,
Brown EL, Hunter CP. 2005. The homeodomain protein PAL-1
specifies a lineage-specific regulatory network in the C. elegans
embryo. Development 132:1843–1854.
Benjamini Y, Hochberg Y. 1995. Controlling the false discovery
rate—a practical and powerful approach to multiple testing. J R
Stat Soc B. 57:289–300.
Cardinale CJ, Washburn RS, Tadigotla VR, Brown LM,
Gottesman ME, Nudler E. 2008. Termination factor Rho and
its cofactors NusA and NusG silence foreign DNA in E. coli.
Science 320:935–938.
Castillo-Davis CI, Mekhedov SL, Hartl DL, Koonin EV,
Kondrashov FA. 2002. Selection for short introns in highly
expressed genes. Nat Genet. 31:415–418.
Chen F, Mackey AJ, Stoeckert CJ Jr, Roos DS. 2006. OrthoMCL-DB:
querying a comprehensive multi-species collection of ortholog
groups. Nucleic Acids Res. 34:D363–D368.
Chintapalli VR, Wang J, Dow JA. 2007. Using FlyAtlas to identify
better Drosophila melanogaster models of human disease. Nat
Genet. 39:715–720.
Chothia C, Gough J, Vogel C, Teichmann SA. 2003. Evolution of the
protein repertoire. Science 300:1701–1703.
Cohen-Gihon I, Lancet D, Yanai I. 2005. Modular genes with
metazoan-specific domains have increased tissue specificity.
Trends Genet. 21:210–213.
Covert MW, Knight EM, Reed JL, Herrgard MJ, Palsson BO. 2004.
Integrating high-throughput and computational data elucidates
bacterial networks. Nature 429:92–96.
de Godoy LM, Olsen JV, Cox J, Nielsen ML, Hubner NC, Frohlich F,
Walther TC, Mann M. 2008. Comprehensive mass-spectrometrybased proteome quantification of haploid versus diploid yeast.
Nature 455:1251–1254.
Doebeli M, Ispolatov I. 2010. Complexity and diversity. Science
328:494–497.
Drummond DA, Wilke CO. 2008. Mistranslation-induced protein
misfolding as a dominant constraint on coding-sequence
evolution. Cell 134:341–352.
Duret L, Mouchiroud D. 1999. Expression pattern and, surprisingly,
gene length shape codon usage in Caenorhabditis, Drosophila,
and Arabidopsis. Proc Natl Acad Sci U S A. 96:4482–4487.
Eisenberg E, Levanon EY. 2003. Human housekeeping genes are
compact. Trends Genet. 19:362–365.
Finn RD, Tate J, Mistry J, et al. (11 co-authors). 2008. The Pfam
protein families database. Nucleic Acids Res. 36:D281–D288.
Fox RM, Watson JD, Von Stetina SE, McDermott J, Brodigan TM,
Fukushige T, Krause M, Miller DM 3rd. 2007. The embryonic
1967
Yang et al. · doi:10.1093/molbev/mss064
muscle transcriptome of Caenorhabditis elegans. Genome Biol.
8:R188.
Gama-Castro S, Salgado H, Peralta-Gil M, et al. (28 co-authors).
2011. RegulonDB version 7.0: transcriptional regulation of
Escherichia coli K-12 integrated within genetic sensory response
units. Gensor Units. Nucleic Acids Res. 39:D98–D105.
Ge H, Wei M, Fabrizio P, Hu J, Cheng C, Longo VD, Li LM. 2010.
Comparative analyses of time-course gene expression profiles
of the long-lived sch9Delta mutant. Nucleic Acids Res.
38:143–158.
Hubbell E, Liu WM, Mei R. 2002. Robust estimators for expression
analysis. Bioinformatics 18:1585–1592.
Hurst LD. 2002. The Ka/Ks ratio: diagnosing the form of sequence
evolution. Trends Genet. 18(9):486.
Izban MG, Luse DS. 1992. Factor-stimulated RNA polymerase II
transcribes at physiological elongation rates on naked DNA but
very poorly on chromatin templates. J Biol Chem.
267:13647–13655.
Jaffe JD, Stange-Thomann N, Smith C, et al. (19 co-authors). 2004.
The complete genome and proteome of Mycoplasma mobile.
Genome Res. 14:1447–1461.
Jongeneel CV, Delorenzi M, Iseli C, et al. (11 co-authors). 2005. An
atlas of human gene expression from massively parallel signature
sequencing (MPSS). Genome Res. 15:1007–1014.
Koonin EV. 2009. Darwinian evolution in the light of genomics.
Nucleic Acids Res. 37:1011–1034.
Koonin EV, Wolf YI, Karev GP. 2002. The structure of the protein
universe and genome evolution. Nature 420:218–223.
Lehner B, Fraser AG. 2004. Protein domains enriched in mammalian
tissue-specific or widely expressed genes. Trends Genet.
20:468–472.
Letunic I, Copley RR, Pils B, Pinkert S, Schultz J, Bork P. 2006. SMART
5: domains in the context of genomes and networks. Nucleic
Acids Res. 34:D257–D260.
Li D, Li JQ, Ouyang SG, Wang J, Xu XJ, Zhu YP, He FC. 2005. An
integrated strategy for functional analysis in large-scale
proteomic research by gene ontology. Prog Biochem Biophys.
32:1026–1029.
Liao BY, Weng MP, Zhang J. 2010. Contrasting genetic paths to
morphological and physiological evolution. Proc Natl Acad Sci
U S A. 107:7353–7358.
Liao BY, Zhang J. 2006. Low rates of expression profile divergence in
highly expressed genes and tissue-specific genes during
mammalian evolution. Mol Biol Evol. 23:1119–1128.
Liu ZY, Liu QJ, Sun HC, Hou L, Guo H, Zhu YP, Li D, He FC. 2011.
Evidence for the additions of clustered interacting nodes during
the evolution of protein interaction networks from network
motifs. BMC Evol Biol. 11:133.
Lynch M, Conery JS. 2003. The origins of genome complexity. Science
302:1401–1404.
Maglott D, Ostell J, Pruitt KD, Tatusova T. 2007. Entrez Gene: genecentered information at NCBI. Nucleic Acids Res. 35:D26–D31.
Marchler-Bauer A, Anderson JB, Chitsaz F, et al. (28 co-authors).
2009. CDD: specific functional annotation with the Conserved
Domain Database. Nucleic Acids Res. 37:D205–D210.
McShea DW. 2000. Functional complexity in organisms: parts as
proxies. Biol Philos. 15:28.
McShea DW. 2002. A complexity drain on cells in the evolution of
multicellularity. Evolution 56:441–452.
Mitchell A, Romano GH, Groisman B, Yona A, Dekel E, Kupiec M,
Dahan O, Pilpel Y. 2009. Adaptive prediction of environmental
changes by microorganisms. Nature 460:220–224.
Ng SK, Zhang Z, Tan SH, Lin K. 2003. InterDom: a database of
putative interacting protein domains for validating predicted
protein interactions and complexes. Nucleic Acids Res.
31:251–254.
1968
MBE
Nilsen TW, Graveley BR. 2010. Expansion of the eukaryotic
proteome by alternative splicing. Nature 463:457–463.
Nobre LS, Al-Shahrour F, Dopazo J, Saraiva LM. 2009. Exploring the
antimicrobial action of a carbon monoxide-releasing compound
through whole-genome transcription profiling of Escherichia
coli. Microbiology 155:813–824.
Oakley TH, Rivera AS. 2008. Genomics and the evolutionary origins
of nervous system complexity. Curr Opin Genet Dev. 18:479–492.
Oldmeadow C, Mengersen K, Mattick JS, Keith JM. 2010. Multiple
evolutionary rate classes in animal genome evolution. Mol Biol
Evol. 27(4):942–953.
Panman L, Perlmann T. 2011. Tracing lineages to uncover neuronal
identity. BMC Biol. 9:51.
Peters LM, Belyantseva IA, Lagziel A, Battey JF, Friedman TB,
Morell RJ. 2007. Signatures from tissue-specific MPSS libraries
identify transcripts preferentially expressed in the mouse inner
ear. Genomics 89:197–206.
Prochnik SE, Umen J, Nedelcu AM, et al. (28 co-authors). 2010.
Genomic analysis of organismal complexity in the multicellular
green alga Volvox carteri. Science 329:223–226.
Qin X, Ahn S, Speed TP, Rubin GM. 2007. Global analyses of mRNA
translational control during early Drosophila embryogenesis.
Genome Biol. 8:R63.
Reigstad CS, Hultgren SJ, Gordon JI. 2007. Functional genomic
studies of uropathogenic Escherichia coli and host urothelial
cells when intracellular bacterial communities are assembled.
J Biol Chem. 282:21259–21267.
Rivera AS, Pankey MS, Plachetzki DC, Villacorta C, Syme AE, Serb JM,
Omilian AR, Oakley TH. 2010. Gene duplication and the origins
of morphological complexity in pancrustacean eyes, a genomic
approach. BMC Evol Biol. 10:123.
Su AI, Wiltshire T, Batalov S, et al. (13 co-authors). 2004. A gene
atlas of the mouse and human protein-encoding transcriptomes. Proc Natl Acad Sci U S A. 101:6062–6067.
Szathmary E, Jordan F, Pal C. 2001. Molecular biology and evolution.
Can genes explain biological complexity? Science 292:1315–1316.
Troemel ER, Chu SW, Reinke V, Lee SS, Ausubel FM, Kim DH. 2006.
p38 MAPK regulates expression of immune response genes and
contributes to longevity in C. elegans. PLoS Genet. 2:e183.
Urrutia AO, Hurst LD. 2003. The signature of selection mediated by
expression on human genes. Genome Res. 13:2260–2264.
Vinogradov AE. 2004. Compactness of human housekeeping genes:
selection for economy or genomic design? Trends Genet.
20:248–253.
Vogel C, Chothia C. 2006. Protein family expansions and biological
complexity. PLoS Comput Biol. 2:e48.
Vogel C, Teichmann SA, Pereira-Leal J. 2005. The relationship
between domain duplication and recombination. J Mol Biol.
346:355–365.
Von Stetina SE, Watson JD, Fox RM, Olszewski KL, Spencer WC,
Roy PJ, Miller DM 3rd. 2007. Cell-specific microarray profiling
experiments reveal a comprehensive picture of gene expression
in the C. elegans nervous system. Genome Biol. 8:R135.
Wagner A. 2005. Energy constraints on the evolution of gene
expression. Mol Biol Evol. 22:1365–1374.
Wang Z, Liao BY, Zhang J. 2010. Genomic patterns of pleiotropy and
the evolution of complexity. Proc Natl Acad Sci U S A.
107:18034–18039.
Wolf YI, Novichkov PS, Karev GP, Koonin EV, Lipman DJ. 2009. The
universal distribution of evolutionary rates of genes and distinct
characteristics of eukaryotic genes of different apparent ages.
Proc Natl Acad Sci U S A. 106:7273–7280.
Wuchty S. 2001. Scale-free behavior in protein domain networks.
Mol Biol Evol. 18(9):1694–1702.
Yang Z. 1997. PAML: a program package for phylogenetic analysis by
maximum likelihood. Comput Appl Biosci. 13:555–556.