Protein Function, Connectivity, and Duplicability in Yeast

Protein Function, Connectivity, and Duplicability in Yeast
Anuphap Prachumwat* and Wen-Hsiung Li *Committee on Genetics, University of Chicago; and Department of Ecology and Evolution, University of Chicago
Protein-protein interaction networks have evolved mainly through connectivity rewiring and gene duplication. However,
how protein function influences these processes and how a network grows in time have not been well studied. Using
protein-protein interaction data and genomic data from the budding yeast, we first examined whether there is a correlation
between the age and connectivity of yeast proteins. A steady increase in connectivity with protein age is observed for yeast
proteins except for those that can be traced back to Eubacteria. Second, we investigated whether protein connectivity and
duplicability vary with gene function. We found a higher average duplicability for proteins interacting with external
environments than for proteins localized within intracellular compartments. For example, proteins that function in the
cell periphery (mainly transporters) show a high duplicability but are lowly connected. Conversely, proteins that function within the nucleus (e.g., transcription, RNA and DNA metabolisms, and ribosome biogenesis and assembly) are
highly connected but have a low duplicability. Finally, we found a negative correlation between protein connectivity
and duplicability.
Introduction
Biological processes, which contribute to the phenotypes of living cells, are wired by interaction networks of
various cellular components such as proteins, DNA, RNA,
and metabolites. Such network data, especially proteinprotein interactions in the budding yeast (Saccharomyces
cerevisiae), can now be generated in a high-throughput
manner, allowing large-scale analyses. We are interested
in the yeast protein interaction network that is organized,
similar to nonbiological networks, into a small world and a
scale-free topology (Barabasi and Oltvai 2004). A small
world has a high probability that any two neighbors of
a node are connected with each other, while a scale-free
topology shows a power-law distribution of node connectivities (for a review, see Barabasi and Oltvai 2004) and
contributes to a high tolerance to disturbance (Albert and
Barabasi 2000).
Barabasi and Albert (1999) proposed that growth of
a network with a preferential attachment behavior is sufficient to explain the emergence of a scale-free network topology. This model requires that a new node preferentially
connects to a well-connected node, predicting that old
nodes should tend to have a higher connectivity than young
ones. This prediction, however, was not supported by a
recent analysis of the yeast protein network by Kunin,
Pereira-Leal, and Ouzounis (2004), who therefore suggested that to understand the scale-free topology of the
protein network, protein function should also be taken
into account.
In this study, we use a larger set of data or a set of
better quality data than that of Kunin, Pereira-Leal, and
Ouzounis (2004) to re-examine the prediction of the preferential attachment model by checking whether a correlation
exists between the age and connectivity of yeast proteins.
We also investigate whether protein connectivity and gene
duplicability vary with gene function. Because yeast, which
is a single-cell organism, inhabits in a wide range of environmental niches, genetic diversity for proteins that are exKey words: protein interaction network, protein connectivity, gene
duplicability, network evolution, protein localization.
E-mail: [email protected].
Mol. Biol. Evol. 23(1):30–39. 2006
doi:10.1093/molbev/msi249
Advance Access publication August 24, 2005
Ó The Author 2005. Published by Oxford University Press on behalf of
the Society for Molecular Biology and Evolution. All rights reserved.
For permissions, please e-mail: [email protected]
posed to or interact with extracellular environments may
confer benefits to the organism. As duplication may increase such diversity (or produce a new adaptive function,
e.g., Francino 2005), we hypothesize a higher duplicability
for proteins exposed to extracellular environments than for
those localized to intracellular compartments. Moreover,
because gene duplication plays a major role in network
growth (e.g., Barabasi and Albert 1999; Pastor-Satorras,
Smith, and Sole 2003) and conversely, connectivity may
affect gene duplicability, we investigate whether a relationship exists between protein connectivity and duplicability.
Materials and Methods
Protein-Protein Interaction Data
Protein-protein interaction pairs are collected from
various high-throughput experiments (Fromont-Racine
et al. 2000; Newman, Wolf, and Kim 2000; Uetz et al.
2000; Dress et al. 2001; Ito et al. 2001; Gavin et al.
2002; Ho et al. 2002; Tong et al. 2002) and databases
(Munich Information Center for Protein Sequences, Database of Interacting Proteins, Biomolecular Interaction
Network Database, and Yeast Protein Database). This collection (denoted by ALL_K) includes 5,015 proteins and
16,747 interactions. Because high-throughput interaction
data come with high false-positive rates, we also use
a set of highly confident data (denoted by BaderSTD) from
Bader et al. (2004) that is comprised of 2,759 proteins and
5,785 interactions. Further, ‘‘true interactions’’ inferred
from many small-scale experiments are also considered
(denoted by SSE). Given that SSE is a small data set, we
combine it with BaderSTD to obtain a larger high-confident
data set (denoted by SSBader). Descriptive statistics of
these data are shown in table 1. The connectivity (denoted
by k) of a protein in a network of interest is defined by the
number of interactions of the protein with other proteins in
that network. In addition to using the mean and median of
k as measures of connectivity for the proteins in a category
of interest, we also use the proportion of hubs in the category. We define a hub as a protein with k a, where a is
5 or 7 (the two cutoff points give similar results). We
show the results of analyses on SSBader and ALL_K
but not the results on other data sets because they are essentially the same.
Protein Function, Connectivity, and Duplicability 31
Table 1
Descriptive Statistics of the Protein-Protein Interaction
Data Sets Used in This Study
Data Seta
ALL_K
SSE
BaderSTD
SSBader
Interactions
Proteins
Median k
Mean k
Degree exponentb
16,747
5,015
3
6.61
1.76
3,789
1,796
3
4.11
2.11
5,785
2,759
3
4.19
2.02
8,666
3,218
3
5.32
2.05
a
ALL_K denotes all data collected; BaderSTD denotes the high-confident
interaction data from Bader et al. (2004); and SSBader denotes a combined set of
the SSE and BaderSTD data sets. Connectivity is denoted by k.
b
The degree exponent (c) is calculated using the power-law model, P(k) a kÿc,
with a standard regression method in R.
Classification of Proteins into Age Groups
For each yeast protein, we identified homologous proteins from other genomes that have been sequenced. These
homologous groups of yeast proteins were obtained from
KOG and COG (Tatusov et al. 2003), Inparanoid (O’Brien,
Remm, and Sonnhammer 2005), Génolevures (Dujon et al.
2004), Kellis et al. (2003), Cliften et al. (2003), and Kunin,
Pereira-Leal, and Ouzounis (2004). Although yeast proteins
can be assigned into 10 age categories (groups) by their
shared ancestral origins (10 lineages) from these orthologous groups (fig. 1), this categorization gives a small number of proteins for some categories. For statistical purposes,
we classify yeast proteins into five age categories (denoted
by I–V; fig. 1 and table 2); we exclude the 380 spurious
open reading frames (ORFs) defined by both Kellis et al.
(2003) and Ghaemmaghami et al. (2003).
Identification of Duplicate and Singleton Genes
The whole set of S. cerevisiae protein sequences were
downloaded from SGD (http://www.yeastgenome.org/).
Duplicate genes were identified as described in Gu et al.
(2003) (E , 10ÿ10). A singleton was defined as a gene with
only one copy in the genome.
Protein Subcellular Localization and Biological Process
The protein localization profile for S. cerevisiae grown
in synthetic medium (downloaded from http://yeastgfp.
ucsf.edu; Huh et al. 2003) is combined with subcellular localization defined by the gene ontogeny (GO) classification
(downloaded from SGD on April 5, 2005). Mislocalization
of some proteins from Huh et al. (2003) is corrected according to the authors’ supplementary data. The GO subcellular
localization categories are translated to the subcellular
localization categories of Huh et al. (2003) because GO
subcellular localizations are at a deeper level than those
from Huh et al. (2003) (e.g., GO distinguishes between
membrane and lumen of mitochondrion, while Huh et al.
[2003] does not). The GO’s extracellular category composed of a small number of proteins is combined into
the cell periphery. A protein is associated with more than
one localization category if it is found in multiple localizations (e.g., shuttle and transport proteins). Biological processes of each ORF are assigned according to the GO Slim
FIG. 1.—The evolutionary path leading to the yeast (Saccharomyces
cerevisiae) is shown in thick branches on this species tree. Yeast protein
age is inferred by the presence of an ortholog in other species. The oldest
age group includes yeast proteins that can be traced back to eubacterial
genomes, while the youngest one includes proteins with orthologs only
within the Saccharomyces sensu-stricto species or without any ortholog
in the other genomes. Horizontal dashed lines represent the age groups
and are numbered by I, II, III, IV, and V. The tree is not drawn according
to scale.
that classifies proteins to gain a high-level view of the
functions (downloaded from SGD on April 5, 2005).
Measures of Gene Duplicability
Similar to Marland et al. (2004), for each category
(i.e., a subcellular localization category or a biological process) under study, the number of unique types of genes
is defined as the number of singletons plus the number
of duplicated gene types in that category. The number of
duplications per gene (n) is the total number of genes divided by the total number of unique types of genes. The
proportion of unduplicated genes (P) is the proportion of
singletons in the total number of unique types of genes.
While n roughly indicates the average number of paralogs
per gene in the category, 1 ÿ P denotes the proportion of
gene types that have been duplicated. Both n and 1 ÿ P can
be used as measures of gene duplicability (Yang, Lusk, and
Li 2003). In addition, we also consider the proportion of
duplicate genes in each category (Q). Q and n are less desirable than P because they can be strongly affected by the
presence of large gene families.
Our statistical analyses are conducted in R (version
2.0.1, http://www.r-project.org/). The statistical tests used
are Fisher’s exact test and the Mann-Whitney test (also
called the Wilcoxon rank sum two-sample test), which,
in contrast to the parametric two-sample t test, is a nonparametric method replacing the protein connectivity data by
ranks, which reduces the influence of outliers. The test is
more appropriate than the t test because protein connectivities are not normally distributed.
Results
Origins of Proteins and Their Connectivity
To determine whether the connectivity (k) correlates
with the age of a protein, the mean and median k values
for each age group are obtained. It appears that young
proteins (e.g., those found in yeasts only) have a lower
32 Prachumwat and Li
Table 2
Descriptive Statistics of Each Age Group for the Number of Proteins (t) and Median and Mean k Values
All Proteinsa
Age Groups
V
a
b
SSBaderb
Description
t
t
Median k
Mean k
t
Median k
Mean k
Eubacteria
Archaea
Plasmodium-Plants-Animals
Microspora–S. pombe–Saccharomyces
complex
Saccharomyces sensu-stricto complex
2,275
403
1648
1,150
1,801
370
1314
917
3
5
4
3
6.72
10.64
7.89
5.38
1,243
293
979
574
3
5
4
3
5.07
7.12
6.14
4.21
270
193
2
3.40
74
1
2.42
Total set
5,746
4,595
3
6.96
3,163
3
5.33
Group
I
II
III
IV
ALL_Kb
‘‘All proteins’’ are the protein data set collected for the classification of proteins into age categories, which is described in text and figure 1.
The data sets are described in table 1.
mean k than that in the older age groups (e.g., Archaea and
Plasmodium-Plants-Animals) for both the all data
set (ALL_K) and the highly confident (SSBader) data set
(table 2 and fig. 2A and B). However, those proteins traceable to Eubacteria show a lower mean k and a slightly lower
median k than those in the Archaea group (table 2 and
fig. 2A and B). Further, the younger age groups have a
lower proportion of hubs than the older age groups, except
the Eubacteria, which shows a lower proportion of hubs
than the Archaea and the Plasmodium-Plants-Animals
(fig. 2C).
Performing the Mann-Whitney test on these data, we
first ask whether two adjacent age groups have different
connectivities. The test shows that the Eubacteria age group
has a significantly lower k than Archaea in both data sets
(P , 5 3 10ÿ8; fig. 2A and B). The Archaea age group
has a significantly higher k than the Plasmodium-PlantsAnimals group in ALL_K (P 5 2 3 10ÿ4), though the significant level is lower in SSBader (P 5 0.068). Second, we
pick an age group as a pivot group and perform two tests:
(1) between this pivot group and the older proteins and (2)
between the pivot group and the younger proteins. The tests
reveal that the Eubacteria group ‘‘does not’’ show a different
k from the rest of the proteins in the network. The other
groups show a significantly different k from their older
and/or younger counterparts (P 0.006; data not shown).
Clearly, the oldest proteins (the Eubacteria group) do not
have the highest k in the protein network, and for this reason
there is no positive correlation between connectivity and
age. However, a significant correlation is seen when the
Eubacteria group is excluded.
Protein Function and Connectivity
In the following analysis, we consider protein localization and perform the Mann-Whitney test on both data
sets; although we show only the results for SSBader, a similar pattern is observed for ALL_K. Note that the mean k
values for the proteins localized to nucleus and nucleolus
are 6.85 and 8.81, respectively, which are significantly
higher than the mean k (5.33) for the whole network
(P , 5 3 10ÿ6, table 3). Some other localization categories
such as cytoplasm, mitochondrion, cell periphery, and
endoplasmic reticulum show a significantly lower k than
the other proteins (P , 0.003, table 3).
Similarly, when biological processes are considered,
proteins involved in protein biosynthesis and catabolism,
ribosome biogenesis and assembly, DNA and RNA metabolisms, and transcription show a significantly higher k than
the proteins involved in other biological processes (mean
and median k are greater than 5.33 and 3, respectively;
P , 5 3 10ÿ6, table 3). Although proteins involved in lipid,
carbohydrate, and amino acid metabolisms and cellular respiration show a significantly lower k than the average in
SSBader (table 3), only lipid metabolism proteins show
a significantly lower k in ALL_K; nonetheless, the proteins
in the other three categories still have the low k (data not
shown).
Protein Function Versus Connectivity Within
the Same Age Group
It is interesting to ask whether within the same age
group the function of a protein affects its connectivity.
To answer this question, we categorize proteins by their localization or biological processes for each protein age group
and perform the Mann-Whitney test between a functional
group of interest and the rest within the same age group
(only mean k values for functional categories are shown
in Supplementary Fig. 1S, Supplementary Material online).
Proteins localized to nucleus and nucleolus show a significantly higher k in the Eubacteria and Archaea age groups;
proteins localized to nucleus also show a significant higher
k in the Plasmodium-Plants-Animals and Microspora–
Schizosaccharomyces pombe–Saccharomyces complex
groups (P , 7 3 10ÿ4). For biological processes, proteins
involved in ribosome biogenesis and assembly, RNA metabolism, and protein catabolism are significantly more highly
connected than other functions for the Eubacteria, Archaea,
and Plasmodium-Plants-Animals groups. Although many
younger age groups (IV and V) do not show a significant
difference in connectivity among biological process categories (probably because of small sample sizes), proteins involved in transcription show a significantly higher k than
those in the other biological processes in the Microspora–Schizosaccharomyces pombe–Saccharomyces complex age group. Proteins involved in carbohydrate and
amino acid and derivative metabolisms show a significantly
lower k than other proteins in the Eubacteria group, while
proteins involved in cell wall and membrane organization
Protein Function, Connectivity, and Duplicability 33
Table 3
Descriptive Statistics for the Number of Proteins (t) and Mean
and Median k in the SSBader Data Set When Categorized
by Subcellular Localization and Biological Process
FIG. 2.—The patterns of connectivity (k) for each age group in the
ALL_K (A) and the SSBader (B) are represented by mean (black bars)
and median (white bars) k. The P values from the Mann-Whitney test performed between the adjacent age groups are indicated under the graphs.
The bar marked by * indicates a P value of 0.068. (C) The proportion
of hubs (proteins with k 5) among proteins in the same age group also
indicates a level of connectivity for each age group. Similar patterns are
observed for both the ALL_K and the SSBader, but only the ALL_K is
shown. The P values from Fisher’s exact test performed between the
adjacent age groups are indicated under the graph.
and biogenesis are lowly connected in the Microspora–
S. pombe–Saccharomyces complex group.
Functional Category
Mean
k
Median
k
t
P valuea
Subcellular localization
Nucleolus
Actin
Golgi
Nucleus
Bud
Spindle pole
Cytoplasmic vesicle
Mitochondrion
Cytoplasm
Peroxisome
Lipid particle
Cell periphery
Vacuole
Microtubule
Endoplasmic recticulum
Endosome
8.81
8.18
7.31
6.85
5.78
5.60
5.39
5.08
4.83
4.59
4.17
4.12
4.01
3.77
3.70
3.63
7
5
6.5
4
4
5
4
3
3
3
4
3
3
3.5
2
3
160
44
36
1,251
107
55
157
314
1,273
34
6
139
81
26
123
35
****
*
*
****
9.48
7
135
****
8.83
8.11
8.10
7.68
7.50
7
5
6
6
5
266
71
251
31
272
****
**
****
*
***
6.54
6
41
*
6.31
6.09
5.76
5.64
5.63
5.44
5.31
5.13
4
4
3
3
4
3
4
4
35
119
63
203
236
113
363
88
5.12
4.71
3
3
86
86
4.71
3
103
4.00
3.73
4
2
13
56
*
3.50
3.17
3.07
2.95
2.83
2.64
2
2
2.5
2
2
2
26
58
28
22
66
69
*
**
*
*
***
****
2.54
2
35
**
Biological process
Ribosome biogenesis and
assembly
RNA metabolism
Cell budding and cytokinesis
Transcription
Pseudohyphal growth
Protein biosynthesis and
catabolism
Nuclear organization and
biogenesis
Conjugation
Cell cycle
Signal transduction
Protein modification
DNA metabolism
Response to stress
Transport
Cytoskeleton organization
and biogenesis
Meiosis
Cell wall and membrane
organization and
biogenesis
Organelle organization and
biogenesis
Morphogenesis
Generation of precursor
metabolites and energy
Cell homeostasis
Lipid metabolism
Sporulation
Vitamin metabolism
Carbohydrate metabolism
Amino acid and derivative
metabolism
Cellular respiration
***
****
**
***
**
a
Protein Function and Duplicability
The P values less than 0.05 from the Mann-Whitney test performed between
a category of interest and other categories combined are indicated: *0.005 , P ,
0.05, **5 3 10ÿ4 , P 0.005, ***5 3 10ÿ6 , P 5 3 10ÿ4, and ****P 5 3
10ÿ6. The italicized category names indicate a significant difference between that
category and the average of all categories after Bonferronni correction.
We investigate the proportion of unduplicated genes
(P) for each localization category. A low P value indicates
a high duplicability. The P values are significantly lower in
cell periphery, bud, and vacuole categories but significantly
higher in nucleus and nucleolus (P , 0.003, table 4); all
tests for this section are Fisher’s exact test. The categories
with a significantly lower P value have a higher proportion
of duplicate genes (Q) than that of the whole genome and
vice versa (P , 0.003, table 4). A significantly different
34 Prachumwat and Li
Table 4
Duplication Patterns of Proteins Localized to 16 Subcellular Compartment Categories as
Measured by the Proportion of Duplicates (Q) and the Proportion of Unduplicated Genes (P)
Subcellular
Compartmentsa
Cell periphery
Bud
Vacuole
Peroxisome
Lipid particle
Cytoplasmic vesicle
Golgi
Cytoplasm
Actin
Mitochondrion
Endoplasmic reticulum
Endosome
Microtubule
Nucleus
Nucleolus
Spindle pole
The whole data set
Singletons
Duplicates
Total Genes
Q
Unique Types of
Genesb
Pc
141
81
134
32
16
160
47
1,320
33
515
259
42
25
1,339
166
59
3,531
188
56
94
15
7
63
18
641
11
170
82
10
7
368
39
8
1,429
329
137
228
47
23
223
65
1,961
44
685
341
52
32
1,707
205
67
4,960
57.14d
40.88d
41.23d
31.91
30.43
28.25
27.69
32.69d
25.00
24.82e
24.05e
19.23
21.88
21.56f
19.02f
11.94f
28.81
229
117
192
45
22
205
59
1,657
41
623
312
49
29
1,531
185
65
4,098
61.57d
69.23d
69.79d
71.11
72.73
78.05
79.66
79.66
80.49
82.66
83.01
85.71
86.21
87.46f
89.73f
90.77
86.16
a
The subcellular compartments are ordered by P.
Unique types of genes 5 singletons 1 duplicated gene types, where duplicated gene types (duplication groups) are defined
as the number of gene type families of proteins that are localized to such a subcellular compartment.
c
Proportion of unduplicated genes (P) 5 singletons/unique types of genes.
d
Significantly above average; P , 0.003, Fisher’s exact test.
e
0.003 P , 0.05, Fisher’s exact test.
f
Significantly below average; P , 0.003, Fisher’s exact test.
b
duplicability in cytoplasm (higher) and spindle pole (lower)
from average is indicated by Q. The significant high duplicability in cell periphery is also revealed by the number of
duplications per gene (n 5 1.44; n 5 1.21 for the wholegenome average). Similarly, the n values are relatively low
(between 1.03 and 1.11) for mitochondrion, nucleus, nucleolus, and spindle pole.
When biological processes are considered, we find that
;1/4 of yeast proteins are uncharacterized. Among the
remaining proteins, duplicates in carbohydrate metabolism,
generation of precursor metabolites and energy, protein biosynthesis and catabolism, transport, and response to stress
are significantly overrepresented, whereas in DNA metabolism, RNA metabolism, transcription, and ribosome
biogenesis and assembly, duplicates are significantly underrepresented (P , 0.002, table 5). Among all proteins annotated with their biological processes, those involved in
the transport, protein biosynthesis and catabolism, RNA
metabolism, transcription, protein modification, and
DNA metabolism are among the highest represented (between 7%–17%). Relative to the whole-proteome average,
these categories show either high or low number of duplicates (table 5). Generally speaking, low P values are supported by high Q values. Duplicates in the unknown
biological process category, however, are significantly underrepresented (P , 0.002).
Protein Connectivity and Duplicability
Figure 3A shows that P is positively correlated with
both mean and median k for biological processes (R2 5
0.35 and 0.45 for mean and median k, respectively, P ,
0.002). A similar pattern is also observed when we consider
only significant categories from table 3 (R2 5 0.66
and 0.79) or table 5 (R2 5 0.74 and 0.83 for mean and
median k, respectively, all P , 0.008). Moreover, this pattern is also found when the proportion of hubs is used as
a measure of connectivity (R2 5 0.43, P 5 0.0001; fig.
3B). In addition, we observe essentially the same results
when using protein localization categories and/or the Q values (data not shown). Furthermore, there are, on average,
;8% higher duplicabilities in the nonhub proteins than
the hub proteins (P 5 79% and 88% and Q 5 30% and
22% for the nonhubs and hubs, respectively, P , 1 3
10ÿ6). This pattern suggests that proteins with a lower connectivity have, on average, a high gene duplicability.
A summary of protein connectivity and gene duplicability of nuclear, cytoplasmic, and external and cell peripheral proteins are shown in table 6. In general, nuclear proteins
are highly connected but show a low duplicability, while
those external and cell peripheral ones show a high duplicability but are lowly connected. The connectivity and gene
duplicability of cytoplasmic proteins are between those of
the nuclear and the external and cell peripheral proteins.
Discussion
Our finding that proteins in the oldest group (the
Eubacteria group) do not exhibit higher connectivities (k)
than proteins in the Archaea and Plasmodium-PlantsAnimals groups is similar to that of Kunin, Pereira-Leal,
and Ouzounis (2004). However, the connectivities of the
pre-Eukaryotes group (the union of the Eubacteria and
Archaea) are, on average, only slightly lower than those
of the Plasmodium-Plants-Animals group (i.e., the CrownEukaryotes in the study of Kunin, Pereira-Leal, and Ouzounis
[2004]). Moreover, proteins in the Archaea age group show
a significantly higher k than those in the PlasmodiumPlants-Animals age group (fig. 2). Thus, only the Eubacteria group contradicts the prediction of the preferential
Protein Function, Connectivity, and Duplicability 35
Table 5
Distribution of Duplicates for Each Biological Process That Is Defined According to
GO Slim Classification
Biological Processesa
Generation of precursor metabolites
and energy
Carbohydrate metabolism
Cell homeostasis
Response to stress
Lipid metabolism
Pseudohyphal growth
Amino acid and derivative metabolism
Vitamin metabolism
Signal transduction
Sporulation
Protein biosynthesis and catabolism
Cell wall and membrane organization
and biogenesis
Cell budding and cytokinesis
Cellular respiration
Conjugation
Cell cycle
Transport
Cytoskeleton organization and
biogenesis
Morphogenesis
Meiosis
Organelle organization and biogenesis
Protein modification
Nuclear organization and biogenesis
DNA metabolism
Transcription
RNA metabolism
Ribosome biogenesis and assembly
Singletons Duplicates Total Genes
Q
Unique Types of
Genesb
Pc
41
46
87
52.87d
66
62.12d
48
28
99
76
30
76
28
50
47
268
98
55
22
81
55
18
50
21
34
19
186
61
103
50
180
131
48
126
49
84
66
454
159
53.40d
44.00e
45.00d
41.98e
37.50
39.68e
42.86
40.48
28.79
40.97d
38.36
77
42
148
109
43
106
39
69
64
358
130
62.34d
66.67e
66.89d
69.72e
69.77
71.70e
71.79
72.46
73.44
74.86d
75.38
58
50
42
102
413
77
32
20
15
37
242
29
90
70
57
139
655
106
35.56
28.57
26.32
26.62
36.95d
27.36
76
64
52
125
506
94
76.32
78.13
80.77
81.60
81.62
81.91
14
89
121
227
38
246
253
279
134
4
26
31
77
9
61
39
50
23
18
115
152
304
47
307
292
329
157
22.22
22.61
20.39e
25.33e
19.15
19.87f
13.36f
15.20f
14.65f
17
106
142
261
43
276
281
298
142
82.35
83.96
85.21
86.97e
88.37
89.13f
90.04f
93.62f
94.37f
a
The average proportion of duplicates (Q) and of unduplicated genes (P) are 28.71% and 87.02%, respectively. The biological processes are ordered by P.
b
Unique types of genes 5 singletons 1 duplicated gene types, where duplicated gene types (duplication groups) are defined
as the number of gene type families in such a biological process category.
c
Proportion of unduplicated genes (P) 5 singletons/unique types of gene.
d
Significantly above average; P , 0.002, Fisher’s exact test.
e
0.002 P , 0.05, Fisher’s exact test.
f
Significantly below average; P , 0.002, Fisher’s exact test.
attachment model, and actually a positive correlation
between age and k is seen when the Eubacteria group
is excluded (table 2 and fig. 2).
The higher protein connectivity for the Archaea and
Plasmodium-Plants-Animals age groups than the Eubacteria group could be due to connection gains through new
gene creation (e.g., gene duplication or gene fusion). Possibly, during the early evolution of eukaryotic cells whose
nucleus evolved from Archaea, proteins for eukaryotic cell
formation might have arisen in number, and some became
hubs for such functional modules (e.g., fig. 2C). Moreover,
domain shuffling and length extension (increase protein
complexity) of proteins in the Archaea and PlasmodiumPlants-Animals groups could have increased new connections for these proteins.
A constraint by gene function may influence protein
network evolution (Kunin, Pereira-Leal, and Ouzounis
2004). To investigate this, we defined protein function
by both localization and biological processes according
to the GO annotation. Because localization partly determines the function of a protein, a combination of localization and biological process increases confidence in our
function classification. Proteins involved in transcription,
RNA metabolism, protein biosynthesis and catabolism,
and ribosome biogenesis and assembly tend to be highly connected. Although the majority of our results are consistent
with those reported by Kunin, Pereira-Leal, and Ouzounis
(2004), translational proteins (e.g., protein biosynthesis
and catabolism) are highly connected, contrary to their finding. In support of our observation, the majority of these
proteins localized to nucleus and nucleolus are highly
connected. On the other hand, proteins localized to cell
periphery and vacuole are lowly connected (tables 3 and 6).
It appears that protein function affects connectivity
across protein age groups (see ‘‘Protein Function Versus
Connectivity Within the Same Age Group’’). This pattern,
however, may have resulted from the emergence time of
these highly connected protein functions because proteins
emerged at the same evolutionary period tend to interact
with one another (Qin et al. 2003), and proteins with similar
functions are likely clustered (von Mering et al. 2002). We
find that the emergence time of protein contributes partly to
the high k for ‘‘only’’ some gene functions. For example,
transport and RNA metabolism categories have comparable
numbers of proteins (and prevalently emerged) in the Eubacteria and Plasmodium-Plants-Animals age groups, but
36 Prachumwat and Li
FIG. 3.—A positive correlation between connectivity (k) and proportion of unduplicated proteins (P) for the biological process classification.
Because a similar trend is observed for median k, only (A) mean k and (B)
the proportion of hubs (k 5) are shown. The trend lines are provided for
only visualization.
transport proteins are not highly connected (Supplementary
Table 1S and Fig. 1SB, Supplementary Material online).
Biological processes with proteins that largely emerged
in the Eubacteria group (e.g., carbohydrate, amino acid
and derivative metabolisms, and generation of precursor
metabolites and energy) are also relatively lowly connected
(Supplementary Table 1S and Fig. 1SB, Supplementary Material online). Likewise, proteins localized in cell
periphery, cytoplasm, endoplasmic reticulum, nucleus,
and nucleolus largely emerged in the Eubacteria and
Plasmodium-Plants-Animals age groups, but only those
localized in nucleus and nucleolus are coincidentally highly
connected (Supplementary Table 1S and Fig. 1SA, Supplementary Material online). This finding supports the view of
Kunin, Pereira-Leal, and Ouzounis (2004) that age alone is
not sufficient to explain the observed connectivities of proteins and that protein function also needs to be considered.
Importantly, evidence that for almost all of the function
categories proteins in the Eubacteria group show a lower k
than those in the Archaea and Plasmodium-Plants-Animals
groups (Supplementary Fig. 1S, Supplementary Material
online) confirms our previous finding.
The observed patterns of gene duplication suggest that
duplicate genes in the yeast are unequally represented in
both subcellular localization and biological process categorizations (tables 4–6). A higher duplicability is observed
for proteins localized to cell periphery, bud, vacuole, and
cytoplasm and for proteins involved in transport, carbohydrate metabolisms, protein biosynthesis and catabolism,
response to stress, and generation of precursor metabolites
and energy, but not for proteins in other subcellular compartments or biological processes. Some functions such as
transcription, DNA and RNA metabolisms, and ribosome
biogenesis and assembly have a low duplicability. From
these observations, we suggest that gene function is a major
determinant of gene duplicability in S. cerevisiae.
Duplicate genes of some functions may not have
a good chance to confer selective advantages, leading to
a low gene duplicability. Proteins involved in transcription,
DNA and RNA metabolisms, and ribosome biogenesis and
assembly may face with such a constraint. For example,
duplication of a global transcription regulator likely affects
many downstream genes, presumably being deleterious in
the majority of cases and leading to a slim chance of duplicate survival. These functions (e.g., ribosome biogenesis
and assembly) may also be constrained by the dosage balance of protein complex (Papp, Pal, and Hurst 2003; Yang,
Lusk, and Li 2003). However, other factors may affect gene
duplicability because of a higher proportion of transcription
proteins in multicellular organisms than in yeast (Babu et al.
2004). Moreover, the pattern that yeast’s duplicate genes,
especially those retained from the whole-genome duplication, tend to have a higher gene complexity (measured by
protein length, number of domains or of cis-regulatory elements) than other genes leads to the conclusion that gene
complexity may contribute to the duplicate retention (He
and Zhang 2005). However, analyzing protein length in
our data set, we find that in approximately half of the functional categories duplicates are longer than singletons, and
in a few of these cases the difference is statistically significant (data not shown).
Our results (table 6) support the hypothesis that
a higher duplicability for proteins interacting with fluctuating external environments may confer benefits to the organism. For example, in yeast nutrient capture through cell
periphery is the first stage of cell growth, and so the chance
that duplication of a gene in this process is beneficial is
high. A high duplicability for proteins localized to cell periphery is also seen in fruit fly, nematode, mouse, and
humans (unpublished data), along with an increase in the
total numbers of these proteins from yeast to nematode
and fruit fly (Hazkani-Covo et al. 2004). Moreover, the majority of highly duplicated genes in bacterial or multicellular eukaryotic genomes encode various types of membrane
or secreted proteins such as membrane transporters, receptors, and secreted signaling molecules (Kondrashov et al.
2002). Together, these results support a higher duplicability
for proteins that interact with external environments.
Living in an often scarce nutrient habitat, yeasts inevitably compete among themselves or with other species
for limited nutrients. Therefore, duplication of a transport
Protein Function, Connectivity, and Duplicability 37
Table 6
A Summary of Protein Connectivity (k) and Gene Duplicability (1ÿP) for Nuclear, Cytoplasmic, and External and
Cell Peripheral Proteins Categorized by Functions
Nuclear Proteins
Functions
External and Cell Peripheral
Proteins
Cytoplasmic Proteins
ka
1 ÿ Pb
Functions
k
1ÿP
Functions
k
1ÿP
6.9
8.8
12.5
10.3
Cytoplasm
Cytoplasmic vesicle
4.8
5.3
20.3
21.9
Cell periphery
4.1
38.4
3.2
7.6
39.4
33.7
2.8
5.1
32.7
33.3
7.6
10.8
5.9
7.2
5.9
8.2
8.2
9.4
9.7
25.0
23.9
13.2
10.7
10.5
10.0
9.6
5.8
5.6
6.6
5.9
5.8
6.8
4.9
4.5
6.9
8.4
9.5
20.0
40.8
20.2
20.0
12.1
23.2
16.8
13.5
9.8
c
Localizations
Nucleus
Nucleolus
Biological processesd
Carbohydrate metabolism
Cell wall and membrane organization
and biogenesis
Signal transduction
Protein biosynthesis and catabolism
Transport
Nuclear organization and biogenesis
DNA metabolism
Protein modification
Transcription
RNA metabolism
Ribosome biogenesis and assembly
3.8
4.4
3.1
4.0
39.1
66.7
53.8
60.0
a
Protein connectivity is represented by mean k.
Gene duplicability is represented by 1ÿP. An empty cell in both k and 1ÿP columns indicates that data are not available or that the total number of proteins is too small.
Localization categories differ among columns as indicated.
d
Biological processes are the same for all columns in the same row and is only indicated on the rows in the first column. These proteins are localized to the localization
categories indicated above.
b
c
protein may be advantageous because it increases the efficiency of nutrient uptake. Similarly, the substrate transport
between subcellular compartments or even in or out of the
cell is a basic requirement of eukaryotic cells. In addition to
nutrient uptake, yeast transporters play diverse roles such as
drug resistance, salt tolerance, control of cell volume, efflux
of undesirable metabolites, and sensing of extracellular
nutrients (Van Belle and Andre 2001). A high duplicability
of transport proteins is also observed in bacterial genomes
(Gevers et al. 2004). Therefore, duplication of such a protein may increase the chance of functional specialization or
diversification.
Using transporter subfamilies characterized phylogenetically (De Hertogh et al. 2002), we find a unique set
of transporters in mitochondrion but a shared set between
cell periphery and vacuole. In cell periphery and vacuole,
three subfamilies are present at a high number: the yeast
amino acid transporters (YATs), the drug H1 antiporters
(DHAs), and the sugar porters (SPs). In particular, the
DHAs directly interact with and protect cell from a number
of extracellular compounds that are growth inhibitory or unusual to natural environments (Sá-Correia and Tenreiro
2002). Most DHAs are typically characterized as nonessential due to their functional redundancy and specificity overlap (Rogers et al. 2001; Giaever et al. 2002). Furthermore,
these genes are only activated by environmental stress factors. In general, DHAs and a large number of YATs and SPs
are undetected under a normal growth condition. The SPs
are usually involved in the first step in carbohydrate metabolism after di- and trisaccharides are hydrolyzed outside the
cell. Therefore, the variability and efficiencies of transporters directly affect the metabolic and growth rate of yeast.
Furthermore, a high duplicability in yeast metabolism,
especially in the central metabolism and upstream of the
central metabolism pathways, has been observed (Marland
et al. 2004).
Although recent evidence of prevalence in partial
duplications of yeast’s protein complexes (i.e., a large fraction of protein complexes with a strong homology to others)
lends support for functional specialization (Pereira-Leal
and Teichmann 2005), how protein connectivity plays a role
in gene duplicability is unclear. The preferential attachment
model also does not suggest any bias in duplicability of a
node type (hub vs. nonhub). Our results suggest that highly
connected proteins (i.e., hubs) have a low duplicability
(fig. 3 and table 6). Despite its high tolerance against random perturbation, the protein network integrity relies mainly on its hubs and is sensitive to a targeted hub removal
(Albert, Jeong, and Barabasi 2000). Indeed, lethality increases threefolds if a hub is deleted (Jeong et al. 2001;
Han et al. 2004). Along with these observations, a slow
evolutionary rate (Fraser 2005) and highly conserved ortholog (Wuchty 2004; Fraser 2005) for hubs suggest a strong
selection pressure on them. Likely, duplication of a hub is
deleterious because it affects a large number of proteins
(i.e., a high pleiotropy), especially those with partners
participting in different functions (an intermodule hub).
However, the pleiotropy is likely reduced if such a hub is
situated within a functional module (an intramodule hub).
Recently, however, a greater constraint on intramodule than
intermodule hubs was found (Fraser 2005). Below, we discuss this issue further.
A hub protein may be part of a large (stable) protein
complex; in this case, a dosage increase by a single-gene duplication would likely affect the balance of complex formation (Veitia 2002). A larger proportion of the intramodule
hubs (81%) are in a complex than that of the intermodule
hubs (18%). Conversely, the majority of the intermodule hubs
38 Prachumwat and Li
are mediators, regulators, or adapters (Han et al. 2004).
These intermodule hubs globally integrate signals between
functional modules and are likely to localize to various subcellular compartments. Duplication of an intermodule hub
can destroy the network integrity and disrupt the informational flow because of a subsequent interaction change or
misexpression of a duplicate. Using a small data set characterized by Han et al. (2004), we find that the intermodule
hubs show a slightly lower duplicability (12.6%) than the
intramodule hubs (16.3%). This is contrary to Fraser’s
(2005) observation. Further research is needed to find out
whether duplicability of a hub is more constrained within
or between functional modules. It is, however, clear that
the survivability of duplication of an intramodule or an
intermodule hub is usually lower than the average gene
duplicability in the genome.
Supplementary Material
Supplementary Table 1S and Figure 1S are available at
Molecular Biology and Evolution online (http://www.
mbe.oxfordjournals.org/).
Acknowledgments
We thank V. Kunin for sending us data and R. Lusk
and M. Chou for their help in the protein interaction data
collection, Y.-W. Chang for her help in the gene function
classification, and G. Morris, J. Yang, and Z. Gu for helpful
discussions. We are grateful to two anonymous reviewers
for their valuable comments. This study was supported by
the International Balzan Foundation.
Literature Cited
Albert, R., and A. L. Barabasi. 2000. Topology of evolving
networks: local events and universality. Phys. Rev. Lett.
85:5234–5237.
Albert, R., H. Jeong, and A. L. Barabasi. 2000. Error and attack
tolerance of complex networks. Nature 406:378–382.
Babu, M. M., N. M. Luscombe, L. Aravind, M. Gerstein, and
S. A. Teichmann. 2004. Structure and evolution of transcriptional regulatory networks. Curr. Opin. Struct. Biol. 14:
283–291.
Bader, J. S., A. Chaudhuri, J. M. Rothberg, and J. Chant. 2004.
Gaining confidence in high-throughput protein interaction
networks. Nat. Biotechnol. 22:78–85.
Barabasi, A. L., and R. Albert. 1999. Emergence of scaling in
random networks. Science 286:509–512.
Barabasi, A. L., and Z. N. Oltvai. 2004. Network biology: understanding the cell’s functional organization. Nat. Rev. Genet.
5:101–113.
Cliften, P., P. Sudarsanam, A. Desikan, L. Fulton, B. Fulton,
J. Majors, R. Waterston, B. A. Cohen, and M. Johnston. 2003.
Finding functional features in Saccharomyces genomes by
phylogenetic footprinting. Science 301:71–76.
De Hertogh, B., E. Carvajal, E. Talla, B. Dujon, P. Baret, and
A. Goffeau. 2002. Phylogenetic classification of transporters
and other membrane proteins from Saccharomyces cerevisiae.
Funct. Integr. Genomics 2:154–170.
Drees, B. L., B. Sundin, E. Brazeau et al. (22 co-authors). 2001. A
protein interaction map for cell polarity development. J. Cell
Biol. 154:549–571.
Dujon, B., D. Sherman, G. Fischer et al. (19 co-authors). 2004.
Genome evolution in yeasts. Nature 430:35–44.
Francino, M. P. 2005. An adaptive radiation model for the origin
of new gene functions. Nat. Genet. 37:573–577.
Fraser, H. B. 2005. Modularity and evolutionary constraint on
proteins. Nat. Genet. 37:351–352.
Fromont-Racine, M., A. E. Mayes, A. Brunet-Simon et al. (11 coauthors). 2000. Genome-wide protein interaction screens reveal functional networks involving Sm-like proteins. Yeast
17:95–110.
Gavin, A. C., M. Bosche, R. Krause et al. (38 co-authors). 2002.
Functional organization of the yeast proteome by systematic
analysis of protein complexes. Nature 415:141–147.
Gevers, D., K. Vandepoele, C. Simillon, and Y. Van de Peer.
2004. Gene duplication and biased functional retention of
paralogs in bacterial genomes. Trends Microbiol. 12:148–154.
Ghaemmaghami, S., W. K. Huh, K. Bower, R. W. Howson,
A. Belle, N. Dephoure, E. K. O’Shea, and J. S. Weissman.
2003. Global analysis of protein expression in yeast. Nature
425:737–741.
Giaever, G., A. M. Chu, L. Ni et al. (74 co-authors). 2002. Functional profiling of the Saccharomyces cerevisiae genome.
Nature 418:387–391.
Gu, Z., L. M. Steinmetz, X. Gu, C. Scharfe, R. W. Davis, and
W. H. Li. 2003. Role of duplicate genes in genetic robustness
against null mutations. Nature 421:63–66.
Han, J. D., N. Bertin, T. Hao et al. (11 co-authors). 2004. Evidence
for dynamically organized modularity in the yeast proteinprotein interaction network. Nature 430:88–93.
Hazkani-Covo, E., E. Y. Levanon, G. Rotman, D. Graur, and
A. Novik. 2004. Evolution of multicellularity in Metazoa:
comparative analysis of the subcellular localization of proteins
in Saccharomyces, Drosophila and Caenorhabditis. Cell Biol.
Int. 28:171–178.
He, X., and J. Zhang. 2005. Gene complexity and gene duplicability. Curr. Biol. 15:1016–1021.
Ho, Y., A. Gruhler, A. Heilbut et al. (20 co-authors). 2002. Systematic identification of protein complexes in Saccharomyces
cerevisiae by mass spectrometry. Nature 415:180–183.
Huh, W. K., J. V. Falvo, L. C. Gerke, A. S. Carroll, R. W. Howson,
J. S. Weissman, and E. K. O’Shea. 2003. Global analysis of
protein localization in budding yeast. Nature 425:686–691.
Ito, T., T. Chiba, R. Ozawa, M. Yoshida, M. Hattori, and
Y. Sakaki. 2001. A comprehensive two-hybrid analysis to
explore the yeast protein interactome. Proc. Natl. Acad. Sci.
USA 98:4569–4574.
Jeong, H., S. P. Mason, A. L. Barabasi, and Z. N. Oltvai. 2001.
Lethality and centrality in protein networks. Nature 411:
41–42.
Kellis, M., N. Patterson, M. Endrizzi, B. Birren, and E. S. Lander.
2003. Sequencing and comparison of yeast species to identify
genes and regulatory elements. Nature 423:241–254.
Kondrashov, F. A., I. B. Rogozin, Y. I. Wolf, and E. V. Koonin.
2002. Selection in the evolution of gene duplications. Genome
Biol. 3:research0008.1–0008.9.
Kunin, V., J. B. Pereira-Leal, and C. A. Ouzounis. 2004. Functional evolution of the yeast protein interaction network.
Mol. Biol. Evol. 21:1171–1176.
Marland, E., A. Prachumwat, N. Maltsev, Z. Gu, and W. H. Li.
2004. Higher gene duplicabilities for metabolic proteins than
for nonmetabolic proteins in yeast and E. coli. J. Mol. Evol.
59:806–814.
Newman, J. R., E. Wolf, and P. S. Kim. 2000. A computationally
directed screen identifying interacting coiled coils from
Saccharomyces cerevisiae. Proc. Natl. Acad. Sci. USA
97:13203–13208.
O’Brien, K. P., M. Remm, and E. L. Sonnhammer. 2005. Inparanoid: a comprehensive database of eukaryotic orthologs.
Nucleic Acids Res. 33(Database Issue):D476–D480.
Protein Function, Connectivity, and Duplicability 39
Papp, B., C. Pal, and L. D. Hurst. 2003. Dosage sensitivity and the
evolution of gene families in yeast. Nature 424:194–197.
Pastor-Satorras, R., E. Smith, and R. V. Sole. 2003. Evolving protein interaction networks through gene duplication. J. Theor.
Biol. 222:199–210.
Pereira-Leal, J. B., and S. A. Teichmann. 2005. Novel specificities
emerge by stepwise duplication of functional modules.
Genome Res. 15:552–559.
Qin, H., H. H. Lu, W. B. Wu, and W. H. Li. 2003. Evolution of the
yeast protein interaction network. Proc. Natl. Acad. Sci. USA
100:12820–12824.
Rogers, B., A. Decottignies, M. Kolaczkowski, E. Carvajal, E.
Balzi, and A. Goffeau. 2001. The pleitropic drug ABC transporters from Saccharomyces cerevisiae. J. Mol. Microbiol.
Biotechnol. 3:207–214.
Sá-Correia, I., and S. Tenreiro. 2002. The multidrug resistance
transporters of the major facilitator superfamily, 6 years after
disclosure of Saccharomyces cerevisiae genome sequence.
J. Biotechnol. 98:215–226.
Tatusov, R. L., N. D. Fedorova, J. D. Jackson et al. (17 coauthors). 2003. The COG database: an updated version includes eukaryotes. BMC Bioinformatics 4:41.
Tong, A. H., B. Drees, G. Nardelli et al. (16 co-authors). 2002. A
combined experimental and computational strategy to define
protein interaction networks for peptide recognition modules.
Science 295:321–324.
Uetz, P., L. Giot, G. Cagney et al. (20 co-authors). 2000. A comprehensive analysis of protein-protein interactions in Saccharomyces cerevisiae. Nature 403:623–627.
Van Belle, D., and B. Andre. 2001. A genomic view of yeast membrane transporters. Curr. Opin. Cell Biol. 13:389–398.
Veitia, R. A. 2002. Exploring the etiology of haploinsufficiency.
Bioessays 24:175–184.
von Mering, C., R. Krause, B. Snel, M. Cornell, S. G. Oliver,
S. Fields, and P. Bork. 2002. Comparative assessment of largescale data sets of protein-protein interactions. Nature 417:
399–403.
Wuchty, S. 2004. Evolution and topology in the yeast protein
interaction network. Genome Res. 14:1310–1314.
Yang, J., R. Lusk, and W. H. Li. 2003. Organismal complexity,
protein complexity, and gene duplicability. Proc. Natl. Acad.
Sci. USA 100:15661–15665.
Takashi Gojobori, Associate Editor
Accepted August 18, 2005