Protein Function, Connectivity, and Duplicability in Yeast Anuphap Prachumwat* and Wen-Hsiung Li *Committee on Genetics, University of Chicago; and Department of Ecology and Evolution, University of Chicago Protein-protein interaction networks have evolved mainly through connectivity rewiring and gene duplication. However, how protein function influences these processes and how a network grows in time have not been well studied. Using protein-protein interaction data and genomic data from the budding yeast, we first examined whether there is a correlation between the age and connectivity of yeast proteins. A steady increase in connectivity with protein age is observed for yeast proteins except for those that can be traced back to Eubacteria. Second, we investigated whether protein connectivity and duplicability vary with gene function. We found a higher average duplicability for proteins interacting with external environments than for proteins localized within intracellular compartments. For example, proteins that function in the cell periphery (mainly transporters) show a high duplicability but are lowly connected. Conversely, proteins that function within the nucleus (e.g., transcription, RNA and DNA metabolisms, and ribosome biogenesis and assembly) are highly connected but have a low duplicability. Finally, we found a negative correlation between protein connectivity and duplicability. Introduction Biological processes, which contribute to the phenotypes of living cells, are wired by interaction networks of various cellular components such as proteins, DNA, RNA, and metabolites. Such network data, especially proteinprotein interactions in the budding yeast (Saccharomyces cerevisiae), can now be generated in a high-throughput manner, allowing large-scale analyses. We are interested in the yeast protein interaction network that is organized, similar to nonbiological networks, into a small world and a scale-free topology (Barabasi and Oltvai 2004). A small world has a high probability that any two neighbors of a node are connected with each other, while a scale-free topology shows a power-law distribution of node connectivities (for a review, see Barabasi and Oltvai 2004) and contributes to a high tolerance to disturbance (Albert and Barabasi 2000). Barabasi and Albert (1999) proposed that growth of a network with a preferential attachment behavior is sufficient to explain the emergence of a scale-free network topology. This model requires that a new node preferentially connects to a well-connected node, predicting that old nodes should tend to have a higher connectivity than young ones. This prediction, however, was not supported by a recent analysis of the yeast protein network by Kunin, Pereira-Leal, and Ouzounis (2004), who therefore suggested that to understand the scale-free topology of the protein network, protein function should also be taken into account. In this study, we use a larger set of data or a set of better quality data than that of Kunin, Pereira-Leal, and Ouzounis (2004) to re-examine the prediction of the preferential attachment model by checking whether a correlation exists between the age and connectivity of yeast proteins. We also investigate whether protein connectivity and gene duplicability vary with gene function. Because yeast, which is a single-cell organism, inhabits in a wide range of environmental niches, genetic diversity for proteins that are exKey words: protein interaction network, protein connectivity, gene duplicability, network evolution, protein localization. E-mail: [email protected]. Mol. Biol. Evol. 23(1):30–39. 2006 doi:10.1093/molbev/msi249 Advance Access publication August 24, 2005 Ó The Author 2005. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution. All rights reserved. For permissions, please e-mail: [email protected] posed to or interact with extracellular environments may confer benefits to the organism. As duplication may increase such diversity (or produce a new adaptive function, e.g., Francino 2005), we hypothesize a higher duplicability for proteins exposed to extracellular environments than for those localized to intracellular compartments. Moreover, because gene duplication plays a major role in network growth (e.g., Barabasi and Albert 1999; Pastor-Satorras, Smith, and Sole 2003) and conversely, connectivity may affect gene duplicability, we investigate whether a relationship exists between protein connectivity and duplicability. Materials and Methods Protein-Protein Interaction Data Protein-protein interaction pairs are collected from various high-throughput experiments (Fromont-Racine et al. 2000; Newman, Wolf, and Kim 2000; Uetz et al. 2000; Dress et al. 2001; Ito et al. 2001; Gavin et al. 2002; Ho et al. 2002; Tong et al. 2002) and databases (Munich Information Center for Protein Sequences, Database of Interacting Proteins, Biomolecular Interaction Network Database, and Yeast Protein Database). This collection (denoted by ALL_K) includes 5,015 proteins and 16,747 interactions. Because high-throughput interaction data come with high false-positive rates, we also use a set of highly confident data (denoted by BaderSTD) from Bader et al. (2004) that is comprised of 2,759 proteins and 5,785 interactions. Further, ‘‘true interactions’’ inferred from many small-scale experiments are also considered (denoted by SSE). Given that SSE is a small data set, we combine it with BaderSTD to obtain a larger high-confident data set (denoted by SSBader). Descriptive statistics of these data are shown in table 1. The connectivity (denoted by k) of a protein in a network of interest is defined by the number of interactions of the protein with other proteins in that network. In addition to using the mean and median of k as measures of connectivity for the proteins in a category of interest, we also use the proportion of hubs in the category. We define a hub as a protein with k a, where a is 5 or 7 (the two cutoff points give similar results). We show the results of analyses on SSBader and ALL_K but not the results on other data sets because they are essentially the same. Protein Function, Connectivity, and Duplicability 31 Table 1 Descriptive Statistics of the Protein-Protein Interaction Data Sets Used in This Study Data Seta ALL_K SSE BaderSTD SSBader Interactions Proteins Median k Mean k Degree exponentb 16,747 5,015 3 6.61 1.76 3,789 1,796 3 4.11 2.11 5,785 2,759 3 4.19 2.02 8,666 3,218 3 5.32 2.05 a ALL_K denotes all data collected; BaderSTD denotes the high-confident interaction data from Bader et al. (2004); and SSBader denotes a combined set of the SSE and BaderSTD data sets. Connectivity is denoted by k. b The degree exponent (c) is calculated using the power-law model, P(k) a kÿc, with a standard regression method in R. Classification of Proteins into Age Groups For each yeast protein, we identified homologous proteins from other genomes that have been sequenced. These homologous groups of yeast proteins were obtained from KOG and COG (Tatusov et al. 2003), Inparanoid (O’Brien, Remm, and Sonnhammer 2005), Génolevures (Dujon et al. 2004), Kellis et al. (2003), Cliften et al. (2003), and Kunin, Pereira-Leal, and Ouzounis (2004). Although yeast proteins can be assigned into 10 age categories (groups) by their shared ancestral origins (10 lineages) from these orthologous groups (fig. 1), this categorization gives a small number of proteins for some categories. For statistical purposes, we classify yeast proteins into five age categories (denoted by I–V; fig. 1 and table 2); we exclude the 380 spurious open reading frames (ORFs) defined by both Kellis et al. (2003) and Ghaemmaghami et al. (2003). Identification of Duplicate and Singleton Genes The whole set of S. cerevisiae protein sequences were downloaded from SGD (http://www.yeastgenome.org/). Duplicate genes were identified as described in Gu et al. (2003) (E , 10ÿ10). A singleton was defined as a gene with only one copy in the genome. Protein Subcellular Localization and Biological Process The protein localization profile for S. cerevisiae grown in synthetic medium (downloaded from http://yeastgfp. ucsf.edu; Huh et al. 2003) is combined with subcellular localization defined by the gene ontogeny (GO) classification (downloaded from SGD on April 5, 2005). Mislocalization of some proteins from Huh et al. (2003) is corrected according to the authors’ supplementary data. The GO subcellular localization categories are translated to the subcellular localization categories of Huh et al. (2003) because GO subcellular localizations are at a deeper level than those from Huh et al. (2003) (e.g., GO distinguishes between membrane and lumen of mitochondrion, while Huh et al. [2003] does not). The GO’s extracellular category composed of a small number of proteins is combined into the cell periphery. A protein is associated with more than one localization category if it is found in multiple localizations (e.g., shuttle and transport proteins). Biological processes of each ORF are assigned according to the GO Slim FIG. 1.—The evolutionary path leading to the yeast (Saccharomyces cerevisiae) is shown in thick branches on this species tree. Yeast protein age is inferred by the presence of an ortholog in other species. The oldest age group includes yeast proteins that can be traced back to eubacterial genomes, while the youngest one includes proteins with orthologs only within the Saccharomyces sensu-stricto species or without any ortholog in the other genomes. Horizontal dashed lines represent the age groups and are numbered by I, II, III, IV, and V. The tree is not drawn according to scale. that classifies proteins to gain a high-level view of the functions (downloaded from SGD on April 5, 2005). Measures of Gene Duplicability Similar to Marland et al. (2004), for each category (i.e., a subcellular localization category or a biological process) under study, the number of unique types of genes is defined as the number of singletons plus the number of duplicated gene types in that category. The number of duplications per gene (n) is the total number of genes divided by the total number of unique types of genes. The proportion of unduplicated genes (P) is the proportion of singletons in the total number of unique types of genes. While n roughly indicates the average number of paralogs per gene in the category, 1 ÿ P denotes the proportion of gene types that have been duplicated. Both n and 1 ÿ P can be used as measures of gene duplicability (Yang, Lusk, and Li 2003). In addition, we also consider the proportion of duplicate genes in each category (Q). Q and n are less desirable than P because they can be strongly affected by the presence of large gene families. Our statistical analyses are conducted in R (version 2.0.1, http://www.r-project.org/). The statistical tests used are Fisher’s exact test and the Mann-Whitney test (also called the Wilcoxon rank sum two-sample test), which, in contrast to the parametric two-sample t test, is a nonparametric method replacing the protein connectivity data by ranks, which reduces the influence of outliers. The test is more appropriate than the t test because protein connectivities are not normally distributed. Results Origins of Proteins and Their Connectivity To determine whether the connectivity (k) correlates with the age of a protein, the mean and median k values for each age group are obtained. It appears that young proteins (e.g., those found in yeasts only) have a lower 32 Prachumwat and Li Table 2 Descriptive Statistics of Each Age Group for the Number of Proteins (t) and Median and Mean k Values All Proteinsa Age Groups V a b SSBaderb Description t t Median k Mean k t Median k Mean k Eubacteria Archaea Plasmodium-Plants-Animals Microspora–S. pombe–Saccharomyces complex Saccharomyces sensu-stricto complex 2,275 403 1648 1,150 1,801 370 1314 917 3 5 4 3 6.72 10.64 7.89 5.38 1,243 293 979 574 3 5 4 3 5.07 7.12 6.14 4.21 270 193 2 3.40 74 1 2.42 Total set 5,746 4,595 3 6.96 3,163 3 5.33 Group I II III IV ALL_Kb ‘‘All proteins’’ are the protein data set collected for the classification of proteins into age categories, which is described in text and figure 1. The data sets are described in table 1. mean k than that in the older age groups (e.g., Archaea and Plasmodium-Plants-Animals) for both the all data set (ALL_K) and the highly confident (SSBader) data set (table 2 and fig. 2A and B). However, those proteins traceable to Eubacteria show a lower mean k and a slightly lower median k than those in the Archaea group (table 2 and fig. 2A and B). Further, the younger age groups have a lower proportion of hubs than the older age groups, except the Eubacteria, which shows a lower proportion of hubs than the Archaea and the Plasmodium-Plants-Animals (fig. 2C). Performing the Mann-Whitney test on these data, we first ask whether two adjacent age groups have different connectivities. The test shows that the Eubacteria age group has a significantly lower k than Archaea in both data sets (P , 5 3 10ÿ8; fig. 2A and B). The Archaea age group has a significantly higher k than the Plasmodium-PlantsAnimals group in ALL_K (P 5 2 3 10ÿ4), though the significant level is lower in SSBader (P 5 0.068). Second, we pick an age group as a pivot group and perform two tests: (1) between this pivot group and the older proteins and (2) between the pivot group and the younger proteins. The tests reveal that the Eubacteria group ‘‘does not’’ show a different k from the rest of the proteins in the network. The other groups show a significantly different k from their older and/or younger counterparts (P 0.006; data not shown). Clearly, the oldest proteins (the Eubacteria group) do not have the highest k in the protein network, and for this reason there is no positive correlation between connectivity and age. However, a significant correlation is seen when the Eubacteria group is excluded. Protein Function and Connectivity In the following analysis, we consider protein localization and perform the Mann-Whitney test on both data sets; although we show only the results for SSBader, a similar pattern is observed for ALL_K. Note that the mean k values for the proteins localized to nucleus and nucleolus are 6.85 and 8.81, respectively, which are significantly higher than the mean k (5.33) for the whole network (P , 5 3 10ÿ6, table 3). Some other localization categories such as cytoplasm, mitochondrion, cell periphery, and endoplasmic reticulum show a significantly lower k than the other proteins (P , 0.003, table 3). Similarly, when biological processes are considered, proteins involved in protein biosynthesis and catabolism, ribosome biogenesis and assembly, DNA and RNA metabolisms, and transcription show a significantly higher k than the proteins involved in other biological processes (mean and median k are greater than 5.33 and 3, respectively; P , 5 3 10ÿ6, table 3). Although proteins involved in lipid, carbohydrate, and amino acid metabolisms and cellular respiration show a significantly lower k than the average in SSBader (table 3), only lipid metabolism proteins show a significantly lower k in ALL_K; nonetheless, the proteins in the other three categories still have the low k (data not shown). Protein Function Versus Connectivity Within the Same Age Group It is interesting to ask whether within the same age group the function of a protein affects its connectivity. To answer this question, we categorize proteins by their localization or biological processes for each protein age group and perform the Mann-Whitney test between a functional group of interest and the rest within the same age group (only mean k values for functional categories are shown in Supplementary Fig. 1S, Supplementary Material online). Proteins localized to nucleus and nucleolus show a significantly higher k in the Eubacteria and Archaea age groups; proteins localized to nucleus also show a significant higher k in the Plasmodium-Plants-Animals and Microspora– Schizosaccharomyces pombe–Saccharomyces complex groups (P , 7 3 10ÿ4). For biological processes, proteins involved in ribosome biogenesis and assembly, RNA metabolism, and protein catabolism are significantly more highly connected than other functions for the Eubacteria, Archaea, and Plasmodium-Plants-Animals groups. Although many younger age groups (IV and V) do not show a significant difference in connectivity among biological process categories (probably because of small sample sizes), proteins involved in transcription show a significantly higher k than those in the other biological processes in the Microspora–Schizosaccharomyces pombe–Saccharomyces complex age group. Proteins involved in carbohydrate and amino acid and derivative metabolisms show a significantly lower k than other proteins in the Eubacteria group, while proteins involved in cell wall and membrane organization Protein Function, Connectivity, and Duplicability 33 Table 3 Descriptive Statistics for the Number of Proteins (t) and Mean and Median k in the SSBader Data Set When Categorized by Subcellular Localization and Biological Process FIG. 2.—The patterns of connectivity (k) for each age group in the ALL_K (A) and the SSBader (B) are represented by mean (black bars) and median (white bars) k. The P values from the Mann-Whitney test performed between the adjacent age groups are indicated under the graphs. The bar marked by * indicates a P value of 0.068. (C) The proportion of hubs (proteins with k 5) among proteins in the same age group also indicates a level of connectivity for each age group. Similar patterns are observed for both the ALL_K and the SSBader, but only the ALL_K is shown. The P values from Fisher’s exact test performed between the adjacent age groups are indicated under the graph. and biogenesis are lowly connected in the Microspora– S. pombe–Saccharomyces complex group. Functional Category Mean k Median k t P valuea Subcellular localization Nucleolus Actin Golgi Nucleus Bud Spindle pole Cytoplasmic vesicle Mitochondrion Cytoplasm Peroxisome Lipid particle Cell periphery Vacuole Microtubule Endoplasmic recticulum Endosome 8.81 8.18 7.31 6.85 5.78 5.60 5.39 5.08 4.83 4.59 4.17 4.12 4.01 3.77 3.70 3.63 7 5 6.5 4 4 5 4 3 3 3 4 3 3 3.5 2 3 160 44 36 1,251 107 55 157 314 1,273 34 6 139 81 26 123 35 **** * * **** 9.48 7 135 **** 8.83 8.11 8.10 7.68 7.50 7 5 6 6 5 266 71 251 31 272 **** ** **** * *** 6.54 6 41 * 6.31 6.09 5.76 5.64 5.63 5.44 5.31 5.13 4 4 3 3 4 3 4 4 35 119 63 203 236 113 363 88 5.12 4.71 3 3 86 86 4.71 3 103 4.00 3.73 4 2 13 56 * 3.50 3.17 3.07 2.95 2.83 2.64 2 2 2.5 2 2 2 26 58 28 22 66 69 * ** * * *** **** 2.54 2 35 ** Biological process Ribosome biogenesis and assembly RNA metabolism Cell budding and cytokinesis Transcription Pseudohyphal growth Protein biosynthesis and catabolism Nuclear organization and biogenesis Conjugation Cell cycle Signal transduction Protein modification DNA metabolism Response to stress Transport Cytoskeleton organization and biogenesis Meiosis Cell wall and membrane organization and biogenesis Organelle organization and biogenesis Morphogenesis Generation of precursor metabolites and energy Cell homeostasis Lipid metabolism Sporulation Vitamin metabolism Carbohydrate metabolism Amino acid and derivative metabolism Cellular respiration *** **** ** *** ** a Protein Function and Duplicability The P values less than 0.05 from the Mann-Whitney test performed between a category of interest and other categories combined are indicated: *0.005 , P , 0.05, **5 3 10ÿ4 , P 0.005, ***5 3 10ÿ6 , P 5 3 10ÿ4, and ****P 5 3 10ÿ6. The italicized category names indicate a significant difference between that category and the average of all categories after Bonferronni correction. We investigate the proportion of unduplicated genes (P) for each localization category. A low P value indicates a high duplicability. The P values are significantly lower in cell periphery, bud, and vacuole categories but significantly higher in nucleus and nucleolus (P , 0.003, table 4); all tests for this section are Fisher’s exact test. The categories with a significantly lower P value have a higher proportion of duplicate genes (Q) than that of the whole genome and vice versa (P , 0.003, table 4). A significantly different 34 Prachumwat and Li Table 4 Duplication Patterns of Proteins Localized to 16 Subcellular Compartment Categories as Measured by the Proportion of Duplicates (Q) and the Proportion of Unduplicated Genes (P) Subcellular Compartmentsa Cell periphery Bud Vacuole Peroxisome Lipid particle Cytoplasmic vesicle Golgi Cytoplasm Actin Mitochondrion Endoplasmic reticulum Endosome Microtubule Nucleus Nucleolus Spindle pole The whole data set Singletons Duplicates Total Genes Q Unique Types of Genesb Pc 141 81 134 32 16 160 47 1,320 33 515 259 42 25 1,339 166 59 3,531 188 56 94 15 7 63 18 641 11 170 82 10 7 368 39 8 1,429 329 137 228 47 23 223 65 1,961 44 685 341 52 32 1,707 205 67 4,960 57.14d 40.88d 41.23d 31.91 30.43 28.25 27.69 32.69d 25.00 24.82e 24.05e 19.23 21.88 21.56f 19.02f 11.94f 28.81 229 117 192 45 22 205 59 1,657 41 623 312 49 29 1,531 185 65 4,098 61.57d 69.23d 69.79d 71.11 72.73 78.05 79.66 79.66 80.49 82.66 83.01 85.71 86.21 87.46f 89.73f 90.77 86.16 a The subcellular compartments are ordered by P. Unique types of genes 5 singletons 1 duplicated gene types, where duplicated gene types (duplication groups) are defined as the number of gene type families of proteins that are localized to such a subcellular compartment. c Proportion of unduplicated genes (P) 5 singletons/unique types of genes. d Significantly above average; P , 0.003, Fisher’s exact test. e 0.003 P , 0.05, Fisher’s exact test. f Significantly below average; P , 0.003, Fisher’s exact test. b duplicability in cytoplasm (higher) and spindle pole (lower) from average is indicated by Q. The significant high duplicability in cell periphery is also revealed by the number of duplications per gene (n 5 1.44; n 5 1.21 for the wholegenome average). Similarly, the n values are relatively low (between 1.03 and 1.11) for mitochondrion, nucleus, nucleolus, and spindle pole. When biological processes are considered, we find that ;1/4 of yeast proteins are uncharacterized. Among the remaining proteins, duplicates in carbohydrate metabolism, generation of precursor metabolites and energy, protein biosynthesis and catabolism, transport, and response to stress are significantly overrepresented, whereas in DNA metabolism, RNA metabolism, transcription, and ribosome biogenesis and assembly, duplicates are significantly underrepresented (P , 0.002, table 5). Among all proteins annotated with their biological processes, those involved in the transport, protein biosynthesis and catabolism, RNA metabolism, transcription, protein modification, and DNA metabolism are among the highest represented (between 7%–17%). Relative to the whole-proteome average, these categories show either high or low number of duplicates (table 5). Generally speaking, low P values are supported by high Q values. Duplicates in the unknown biological process category, however, are significantly underrepresented (P , 0.002). Protein Connectivity and Duplicability Figure 3A shows that P is positively correlated with both mean and median k for biological processes (R2 5 0.35 and 0.45 for mean and median k, respectively, P , 0.002). A similar pattern is also observed when we consider only significant categories from table 3 (R2 5 0.66 and 0.79) or table 5 (R2 5 0.74 and 0.83 for mean and median k, respectively, all P , 0.008). Moreover, this pattern is also found when the proportion of hubs is used as a measure of connectivity (R2 5 0.43, P 5 0.0001; fig. 3B). In addition, we observe essentially the same results when using protein localization categories and/or the Q values (data not shown). Furthermore, there are, on average, ;8% higher duplicabilities in the nonhub proteins than the hub proteins (P 5 79% and 88% and Q 5 30% and 22% for the nonhubs and hubs, respectively, P , 1 3 10ÿ6). This pattern suggests that proteins with a lower connectivity have, on average, a high gene duplicability. A summary of protein connectivity and gene duplicability of nuclear, cytoplasmic, and external and cell peripheral proteins are shown in table 6. In general, nuclear proteins are highly connected but show a low duplicability, while those external and cell peripheral ones show a high duplicability but are lowly connected. The connectivity and gene duplicability of cytoplasmic proteins are between those of the nuclear and the external and cell peripheral proteins. Discussion Our finding that proteins in the oldest group (the Eubacteria group) do not exhibit higher connectivities (k) than proteins in the Archaea and Plasmodium-PlantsAnimals groups is similar to that of Kunin, Pereira-Leal, and Ouzounis (2004). However, the connectivities of the pre-Eukaryotes group (the union of the Eubacteria and Archaea) are, on average, only slightly lower than those of the Plasmodium-Plants-Animals group (i.e., the CrownEukaryotes in the study of Kunin, Pereira-Leal, and Ouzounis [2004]). Moreover, proteins in the Archaea age group show a significantly higher k than those in the PlasmodiumPlants-Animals age group (fig. 2). Thus, only the Eubacteria group contradicts the prediction of the preferential Protein Function, Connectivity, and Duplicability 35 Table 5 Distribution of Duplicates for Each Biological Process That Is Defined According to GO Slim Classification Biological Processesa Generation of precursor metabolites and energy Carbohydrate metabolism Cell homeostasis Response to stress Lipid metabolism Pseudohyphal growth Amino acid and derivative metabolism Vitamin metabolism Signal transduction Sporulation Protein biosynthesis and catabolism Cell wall and membrane organization and biogenesis Cell budding and cytokinesis Cellular respiration Conjugation Cell cycle Transport Cytoskeleton organization and biogenesis Morphogenesis Meiosis Organelle organization and biogenesis Protein modification Nuclear organization and biogenesis DNA metabolism Transcription RNA metabolism Ribosome biogenesis and assembly Singletons Duplicates Total Genes Q Unique Types of Genesb Pc 41 46 87 52.87d 66 62.12d 48 28 99 76 30 76 28 50 47 268 98 55 22 81 55 18 50 21 34 19 186 61 103 50 180 131 48 126 49 84 66 454 159 53.40d 44.00e 45.00d 41.98e 37.50 39.68e 42.86 40.48 28.79 40.97d 38.36 77 42 148 109 43 106 39 69 64 358 130 62.34d 66.67e 66.89d 69.72e 69.77 71.70e 71.79 72.46 73.44 74.86d 75.38 58 50 42 102 413 77 32 20 15 37 242 29 90 70 57 139 655 106 35.56 28.57 26.32 26.62 36.95d 27.36 76 64 52 125 506 94 76.32 78.13 80.77 81.60 81.62 81.91 14 89 121 227 38 246 253 279 134 4 26 31 77 9 61 39 50 23 18 115 152 304 47 307 292 329 157 22.22 22.61 20.39e 25.33e 19.15 19.87f 13.36f 15.20f 14.65f 17 106 142 261 43 276 281 298 142 82.35 83.96 85.21 86.97e 88.37 89.13f 90.04f 93.62f 94.37f a The average proportion of duplicates (Q) and of unduplicated genes (P) are 28.71% and 87.02%, respectively. The biological processes are ordered by P. b Unique types of genes 5 singletons 1 duplicated gene types, where duplicated gene types (duplication groups) are defined as the number of gene type families in such a biological process category. c Proportion of unduplicated genes (P) 5 singletons/unique types of gene. d Significantly above average; P , 0.002, Fisher’s exact test. e 0.002 P , 0.05, Fisher’s exact test. f Significantly below average; P , 0.002, Fisher’s exact test. attachment model, and actually a positive correlation between age and k is seen when the Eubacteria group is excluded (table 2 and fig. 2). The higher protein connectivity for the Archaea and Plasmodium-Plants-Animals age groups than the Eubacteria group could be due to connection gains through new gene creation (e.g., gene duplication or gene fusion). Possibly, during the early evolution of eukaryotic cells whose nucleus evolved from Archaea, proteins for eukaryotic cell formation might have arisen in number, and some became hubs for such functional modules (e.g., fig. 2C). Moreover, domain shuffling and length extension (increase protein complexity) of proteins in the Archaea and PlasmodiumPlants-Animals groups could have increased new connections for these proteins. A constraint by gene function may influence protein network evolution (Kunin, Pereira-Leal, and Ouzounis 2004). To investigate this, we defined protein function by both localization and biological processes according to the GO annotation. Because localization partly determines the function of a protein, a combination of localization and biological process increases confidence in our function classification. Proteins involved in transcription, RNA metabolism, protein biosynthesis and catabolism, and ribosome biogenesis and assembly tend to be highly connected. Although the majority of our results are consistent with those reported by Kunin, Pereira-Leal, and Ouzounis (2004), translational proteins (e.g., protein biosynthesis and catabolism) are highly connected, contrary to their finding. In support of our observation, the majority of these proteins localized to nucleus and nucleolus are highly connected. On the other hand, proteins localized to cell periphery and vacuole are lowly connected (tables 3 and 6). It appears that protein function affects connectivity across protein age groups (see ‘‘Protein Function Versus Connectivity Within the Same Age Group’’). This pattern, however, may have resulted from the emergence time of these highly connected protein functions because proteins emerged at the same evolutionary period tend to interact with one another (Qin et al. 2003), and proteins with similar functions are likely clustered (von Mering et al. 2002). We find that the emergence time of protein contributes partly to the high k for ‘‘only’’ some gene functions. For example, transport and RNA metabolism categories have comparable numbers of proteins (and prevalently emerged) in the Eubacteria and Plasmodium-Plants-Animals age groups, but 36 Prachumwat and Li FIG. 3.—A positive correlation between connectivity (k) and proportion of unduplicated proteins (P) for the biological process classification. Because a similar trend is observed for median k, only (A) mean k and (B) the proportion of hubs (k 5) are shown. The trend lines are provided for only visualization. transport proteins are not highly connected (Supplementary Table 1S and Fig. 1SB, Supplementary Material online). Biological processes with proteins that largely emerged in the Eubacteria group (e.g., carbohydrate, amino acid and derivative metabolisms, and generation of precursor metabolites and energy) are also relatively lowly connected (Supplementary Table 1S and Fig. 1SB, Supplementary Material online). Likewise, proteins localized in cell periphery, cytoplasm, endoplasmic reticulum, nucleus, and nucleolus largely emerged in the Eubacteria and Plasmodium-Plants-Animals age groups, but only those localized in nucleus and nucleolus are coincidentally highly connected (Supplementary Table 1S and Fig. 1SA, Supplementary Material online). This finding supports the view of Kunin, Pereira-Leal, and Ouzounis (2004) that age alone is not sufficient to explain the observed connectivities of proteins and that protein function also needs to be considered. Importantly, evidence that for almost all of the function categories proteins in the Eubacteria group show a lower k than those in the Archaea and Plasmodium-Plants-Animals groups (Supplementary Fig. 1S, Supplementary Material online) confirms our previous finding. The observed patterns of gene duplication suggest that duplicate genes in the yeast are unequally represented in both subcellular localization and biological process categorizations (tables 4–6). A higher duplicability is observed for proteins localized to cell periphery, bud, vacuole, and cytoplasm and for proteins involved in transport, carbohydrate metabolisms, protein biosynthesis and catabolism, response to stress, and generation of precursor metabolites and energy, but not for proteins in other subcellular compartments or biological processes. Some functions such as transcription, DNA and RNA metabolisms, and ribosome biogenesis and assembly have a low duplicability. From these observations, we suggest that gene function is a major determinant of gene duplicability in S. cerevisiae. Duplicate genes of some functions may not have a good chance to confer selective advantages, leading to a low gene duplicability. Proteins involved in transcription, DNA and RNA metabolisms, and ribosome biogenesis and assembly may face with such a constraint. For example, duplication of a global transcription regulator likely affects many downstream genes, presumably being deleterious in the majority of cases and leading to a slim chance of duplicate survival. These functions (e.g., ribosome biogenesis and assembly) may also be constrained by the dosage balance of protein complex (Papp, Pal, and Hurst 2003; Yang, Lusk, and Li 2003). However, other factors may affect gene duplicability because of a higher proportion of transcription proteins in multicellular organisms than in yeast (Babu et al. 2004). Moreover, the pattern that yeast’s duplicate genes, especially those retained from the whole-genome duplication, tend to have a higher gene complexity (measured by protein length, number of domains or of cis-regulatory elements) than other genes leads to the conclusion that gene complexity may contribute to the duplicate retention (He and Zhang 2005). However, analyzing protein length in our data set, we find that in approximately half of the functional categories duplicates are longer than singletons, and in a few of these cases the difference is statistically significant (data not shown). Our results (table 6) support the hypothesis that a higher duplicability for proteins interacting with fluctuating external environments may confer benefits to the organism. For example, in yeast nutrient capture through cell periphery is the first stage of cell growth, and so the chance that duplication of a gene in this process is beneficial is high. A high duplicability for proteins localized to cell periphery is also seen in fruit fly, nematode, mouse, and humans (unpublished data), along with an increase in the total numbers of these proteins from yeast to nematode and fruit fly (Hazkani-Covo et al. 2004). Moreover, the majority of highly duplicated genes in bacterial or multicellular eukaryotic genomes encode various types of membrane or secreted proteins such as membrane transporters, receptors, and secreted signaling molecules (Kondrashov et al. 2002). Together, these results support a higher duplicability for proteins that interact with external environments. Living in an often scarce nutrient habitat, yeasts inevitably compete among themselves or with other species for limited nutrients. Therefore, duplication of a transport Protein Function, Connectivity, and Duplicability 37 Table 6 A Summary of Protein Connectivity (k) and Gene Duplicability (1ÿP) for Nuclear, Cytoplasmic, and External and Cell Peripheral Proteins Categorized by Functions Nuclear Proteins Functions External and Cell Peripheral Proteins Cytoplasmic Proteins ka 1 ÿ Pb Functions k 1ÿP Functions k 1ÿP 6.9 8.8 12.5 10.3 Cytoplasm Cytoplasmic vesicle 4.8 5.3 20.3 21.9 Cell periphery 4.1 38.4 3.2 7.6 39.4 33.7 2.8 5.1 32.7 33.3 7.6 10.8 5.9 7.2 5.9 8.2 8.2 9.4 9.7 25.0 23.9 13.2 10.7 10.5 10.0 9.6 5.8 5.6 6.6 5.9 5.8 6.8 4.9 4.5 6.9 8.4 9.5 20.0 40.8 20.2 20.0 12.1 23.2 16.8 13.5 9.8 c Localizations Nucleus Nucleolus Biological processesd Carbohydrate metabolism Cell wall and membrane organization and biogenesis Signal transduction Protein biosynthesis and catabolism Transport Nuclear organization and biogenesis DNA metabolism Protein modification Transcription RNA metabolism Ribosome biogenesis and assembly 3.8 4.4 3.1 4.0 39.1 66.7 53.8 60.0 a Protein connectivity is represented by mean k. Gene duplicability is represented by 1ÿP. An empty cell in both k and 1ÿP columns indicates that data are not available or that the total number of proteins is too small. Localization categories differ among columns as indicated. d Biological processes are the same for all columns in the same row and is only indicated on the rows in the first column. These proteins are localized to the localization categories indicated above. b c protein may be advantageous because it increases the efficiency of nutrient uptake. Similarly, the substrate transport between subcellular compartments or even in or out of the cell is a basic requirement of eukaryotic cells. In addition to nutrient uptake, yeast transporters play diverse roles such as drug resistance, salt tolerance, control of cell volume, efflux of undesirable metabolites, and sensing of extracellular nutrients (Van Belle and Andre 2001). A high duplicability of transport proteins is also observed in bacterial genomes (Gevers et al. 2004). Therefore, duplication of such a protein may increase the chance of functional specialization or diversification. Using transporter subfamilies characterized phylogenetically (De Hertogh et al. 2002), we find a unique set of transporters in mitochondrion but a shared set between cell periphery and vacuole. In cell periphery and vacuole, three subfamilies are present at a high number: the yeast amino acid transporters (YATs), the drug H1 antiporters (DHAs), and the sugar porters (SPs). In particular, the DHAs directly interact with and protect cell from a number of extracellular compounds that are growth inhibitory or unusual to natural environments (Sá-Correia and Tenreiro 2002). Most DHAs are typically characterized as nonessential due to their functional redundancy and specificity overlap (Rogers et al. 2001; Giaever et al. 2002). Furthermore, these genes are only activated by environmental stress factors. In general, DHAs and a large number of YATs and SPs are undetected under a normal growth condition. The SPs are usually involved in the first step in carbohydrate metabolism after di- and trisaccharides are hydrolyzed outside the cell. Therefore, the variability and efficiencies of transporters directly affect the metabolic and growth rate of yeast. Furthermore, a high duplicability in yeast metabolism, especially in the central metabolism and upstream of the central metabolism pathways, has been observed (Marland et al. 2004). Although recent evidence of prevalence in partial duplications of yeast’s protein complexes (i.e., a large fraction of protein complexes with a strong homology to others) lends support for functional specialization (Pereira-Leal and Teichmann 2005), how protein connectivity plays a role in gene duplicability is unclear. The preferential attachment model also does not suggest any bias in duplicability of a node type (hub vs. nonhub). Our results suggest that highly connected proteins (i.e., hubs) have a low duplicability (fig. 3 and table 6). Despite its high tolerance against random perturbation, the protein network integrity relies mainly on its hubs and is sensitive to a targeted hub removal (Albert, Jeong, and Barabasi 2000). Indeed, lethality increases threefolds if a hub is deleted (Jeong et al. 2001; Han et al. 2004). Along with these observations, a slow evolutionary rate (Fraser 2005) and highly conserved ortholog (Wuchty 2004; Fraser 2005) for hubs suggest a strong selection pressure on them. Likely, duplication of a hub is deleterious because it affects a large number of proteins (i.e., a high pleiotropy), especially those with partners participting in different functions (an intermodule hub). However, the pleiotropy is likely reduced if such a hub is situated within a functional module (an intramodule hub). Recently, however, a greater constraint on intramodule than intermodule hubs was found (Fraser 2005). Below, we discuss this issue further. A hub protein may be part of a large (stable) protein complex; in this case, a dosage increase by a single-gene duplication would likely affect the balance of complex formation (Veitia 2002). A larger proportion of the intramodule hubs (81%) are in a complex than that of the intermodule hubs (18%). Conversely, the majority of the intermodule hubs 38 Prachumwat and Li are mediators, regulators, or adapters (Han et al. 2004). These intermodule hubs globally integrate signals between functional modules and are likely to localize to various subcellular compartments. Duplication of an intermodule hub can destroy the network integrity and disrupt the informational flow because of a subsequent interaction change or misexpression of a duplicate. Using a small data set characterized by Han et al. (2004), we find that the intermodule hubs show a slightly lower duplicability (12.6%) than the intramodule hubs (16.3%). This is contrary to Fraser’s (2005) observation. Further research is needed to find out whether duplicability of a hub is more constrained within or between functional modules. It is, however, clear that the survivability of duplication of an intramodule or an intermodule hub is usually lower than the average gene duplicability in the genome. Supplementary Material Supplementary Table 1S and Figure 1S are available at Molecular Biology and Evolution online (http://www. mbe.oxfordjournals.org/). Acknowledgments We thank V. Kunin for sending us data and R. Lusk and M. Chou for their help in the protein interaction data collection, Y.-W. Chang for her help in the gene function classification, and G. Morris, J. Yang, and Z. Gu for helpful discussions. We are grateful to two anonymous reviewers for their valuable comments. This study was supported by the International Balzan Foundation. Literature Cited Albert, R., and A. L. Barabasi. 2000. Topology of evolving networks: local events and universality. Phys. Rev. Lett. 85:5234–5237. Albert, R., H. Jeong, and A. L. Barabasi. 2000. Error and attack tolerance of complex networks. Nature 406:378–382. Babu, M. M., N. M. Luscombe, L. Aravind, M. Gerstein, and S. A. Teichmann. 2004. Structure and evolution of transcriptional regulatory networks. Curr. Opin. Struct. Biol. 14: 283–291. Bader, J. S., A. Chaudhuri, J. M. Rothberg, and J. Chant. 2004. Gaining confidence in high-throughput protein interaction networks. Nat. Biotechnol. 22:78–85. Barabasi, A. L., and R. Albert. 1999. Emergence of scaling in random networks. Science 286:509–512. Barabasi, A. L., and Z. N. Oltvai. 2004. Network biology: understanding the cell’s functional organization. Nat. Rev. Genet. 5:101–113. Cliften, P., P. Sudarsanam, A. Desikan, L. Fulton, B. Fulton, J. Majors, R. Waterston, B. A. Cohen, and M. Johnston. 2003. Finding functional features in Saccharomyces genomes by phylogenetic footprinting. Science 301:71–76. De Hertogh, B., E. Carvajal, E. Talla, B. Dujon, P. Baret, and A. Goffeau. 2002. Phylogenetic classification of transporters and other membrane proteins from Saccharomyces cerevisiae. Funct. Integr. Genomics 2:154–170. Drees, B. L., B. Sundin, E. Brazeau et al. (22 co-authors). 2001. A protein interaction map for cell polarity development. J. Cell Biol. 154:549–571. Dujon, B., D. Sherman, G. Fischer et al. (19 co-authors). 2004. Genome evolution in yeasts. Nature 430:35–44. Francino, M. P. 2005. An adaptive radiation model for the origin of new gene functions. Nat. Genet. 37:573–577. Fraser, H. B. 2005. Modularity and evolutionary constraint on proteins. Nat. Genet. 37:351–352. Fromont-Racine, M., A. E. Mayes, A. Brunet-Simon et al. (11 coauthors). 2000. Genome-wide protein interaction screens reveal functional networks involving Sm-like proteins. Yeast 17:95–110. Gavin, A. C., M. Bosche, R. Krause et al. (38 co-authors). 2002. Functional organization of the yeast proteome by systematic analysis of protein complexes. Nature 415:141–147. Gevers, D., K. Vandepoele, C. Simillon, and Y. Van de Peer. 2004. Gene duplication and biased functional retention of paralogs in bacterial genomes. Trends Microbiol. 12:148–154. Ghaemmaghami, S., W. K. Huh, K. Bower, R. W. Howson, A. Belle, N. Dephoure, E. K. O’Shea, and J. S. Weissman. 2003. Global analysis of protein expression in yeast. Nature 425:737–741. Giaever, G., A. M. Chu, L. Ni et al. (74 co-authors). 2002. Functional profiling of the Saccharomyces cerevisiae genome. Nature 418:387–391. Gu, Z., L. M. Steinmetz, X. Gu, C. Scharfe, R. W. Davis, and W. H. Li. 2003. Role of duplicate genes in genetic robustness against null mutations. Nature 421:63–66. Han, J. D., N. Bertin, T. Hao et al. (11 co-authors). 2004. Evidence for dynamically organized modularity in the yeast proteinprotein interaction network. Nature 430:88–93. Hazkani-Covo, E., E. Y. Levanon, G. Rotman, D. Graur, and A. Novik. 2004. Evolution of multicellularity in Metazoa: comparative analysis of the subcellular localization of proteins in Saccharomyces, Drosophila and Caenorhabditis. Cell Biol. Int. 28:171–178. He, X., and J. Zhang. 2005. Gene complexity and gene duplicability. Curr. Biol. 15:1016–1021. Ho, Y., A. Gruhler, A. Heilbut et al. (20 co-authors). 2002. Systematic identification of protein complexes in Saccharomyces cerevisiae by mass spectrometry. Nature 415:180–183. Huh, W. K., J. V. Falvo, L. C. Gerke, A. S. Carroll, R. W. Howson, J. S. Weissman, and E. K. O’Shea. 2003. Global analysis of protein localization in budding yeast. Nature 425:686–691. Ito, T., T. Chiba, R. Ozawa, M. Yoshida, M. Hattori, and Y. Sakaki. 2001. A comprehensive two-hybrid analysis to explore the yeast protein interactome. Proc. Natl. Acad. Sci. USA 98:4569–4574. Jeong, H., S. P. Mason, A. L. Barabasi, and Z. N. Oltvai. 2001. Lethality and centrality in protein networks. Nature 411: 41–42. Kellis, M., N. Patterson, M. Endrizzi, B. Birren, and E. S. Lander. 2003. Sequencing and comparison of yeast species to identify genes and regulatory elements. Nature 423:241–254. Kondrashov, F. A., I. B. Rogozin, Y. I. Wolf, and E. V. Koonin. 2002. Selection in the evolution of gene duplications. Genome Biol. 3:research0008.1–0008.9. Kunin, V., J. B. Pereira-Leal, and C. A. Ouzounis. 2004. Functional evolution of the yeast protein interaction network. Mol. Biol. Evol. 21:1171–1176. Marland, E., A. Prachumwat, N. Maltsev, Z. Gu, and W. H. Li. 2004. Higher gene duplicabilities for metabolic proteins than for nonmetabolic proteins in yeast and E. coli. J. Mol. Evol. 59:806–814. Newman, J. R., E. Wolf, and P. S. Kim. 2000. A computationally directed screen identifying interacting coiled coils from Saccharomyces cerevisiae. Proc. Natl. Acad. Sci. USA 97:13203–13208. O’Brien, K. P., M. Remm, and E. L. Sonnhammer. 2005. Inparanoid: a comprehensive database of eukaryotic orthologs. Nucleic Acids Res. 33(Database Issue):D476–D480. Protein Function, Connectivity, and Duplicability 39 Papp, B., C. Pal, and L. D. Hurst. 2003. Dosage sensitivity and the evolution of gene families in yeast. Nature 424:194–197. Pastor-Satorras, R., E. Smith, and R. V. Sole. 2003. Evolving protein interaction networks through gene duplication. J. Theor. Biol. 222:199–210. Pereira-Leal, J. B., and S. A. Teichmann. 2005. Novel specificities emerge by stepwise duplication of functional modules. Genome Res. 15:552–559. Qin, H., H. H. Lu, W. B. Wu, and W. H. Li. 2003. Evolution of the yeast protein interaction network. Proc. Natl. Acad. Sci. USA 100:12820–12824. Rogers, B., A. Decottignies, M. Kolaczkowski, E. Carvajal, E. Balzi, and A. Goffeau. 2001. The pleitropic drug ABC transporters from Saccharomyces cerevisiae. J. Mol. Microbiol. Biotechnol. 3:207–214. Sá-Correia, I., and S. Tenreiro. 2002. The multidrug resistance transporters of the major facilitator superfamily, 6 years after disclosure of Saccharomyces cerevisiae genome sequence. J. Biotechnol. 98:215–226. Tatusov, R. L., N. D. Fedorova, J. D. Jackson et al. (17 coauthors). 2003. The COG database: an updated version includes eukaryotes. BMC Bioinformatics 4:41. Tong, A. H., B. Drees, G. Nardelli et al. (16 co-authors). 2002. A combined experimental and computational strategy to define protein interaction networks for peptide recognition modules. Science 295:321–324. Uetz, P., L. Giot, G. Cagney et al. (20 co-authors). 2000. A comprehensive analysis of protein-protein interactions in Saccharomyces cerevisiae. Nature 403:623–627. Van Belle, D., and B. Andre. 2001. A genomic view of yeast membrane transporters. Curr. Opin. Cell Biol. 13:389–398. Veitia, R. A. 2002. Exploring the etiology of haploinsufficiency. Bioessays 24:175–184. von Mering, C., R. Krause, B. Snel, M. Cornell, S. G. Oliver, S. Fields, and P. Bork. 2002. Comparative assessment of largescale data sets of protein-protein interactions. Nature 417: 399–403. Wuchty, S. 2004. Evolution and topology in the yeast protein interaction network. Genome Res. 14:1310–1314. Yang, J., R. Lusk, and W. H. Li. 2003. Organismal complexity, protein complexity, and gene duplicability. Proc. Natl. Acad. Sci. USA 100:15661–15665. Takashi Gojobori, Associate Editor Accepted August 18, 2005
© Copyright 2026 Paperzz