Cellular crowding imposes global constraints on the chemistry and evolution of proteomes Emmanuel D. Levya,b,c,1, Subhajyoti Dea,d,e, and Sarah A. Teichmanna,1 a Medical Research Council Laboratory of Molecular Biology, Cambridge CB2 0QH, United Kingdom; bDépartement de Biochimie, Université de Montréal, Montréal, QC, Canada H3T 1J4; cDepartment of Structural Biology, Weizmann Institute of Science, Rehovot 76100, Israel; dDepartment of Medicine, University of Colorado School of Medicine, Aurora, CO 80045; and eMolecular Oncology Program, University of Colorado Cancer Center, Aurora, CO 80045 In living cells, functional protein–protein interactions compete with a much larger number of nonfunctional, or promiscuous, interactions. Several cellular properties contribute to avoiding unwanted protein interactions, including regulation of gene expression, cellular compartmentalization, and high specificity and affinity of functional interactions. Here we investigate whether other mechanisms exist that shape the sequence and structure of proteins to favor their correct assembly into functional protein complexes. To examine this question, we project evolutionary and cellular abundance information onto 397, 196, and 631 proteins of known 3D structure from Escherichia coli, Saccharomyces cerevisiae, and Homo sapiens, respectively. On the basis of amino acid frequencies in interface patches versus the solvent-accessible protein surface, we define a propensity or “stickiness” scale for each of the 20 amino acids. We find that the propensity to interact in a nonspecific manner is inversely correlated with abundance. In other words, high abundance proteins have less sticky surfaces. We also find that stickiness constrains protein evolution, whereby residues in sticky surface patches are more conserved than those found in nonsticky patches. Finally, we find that the constraint imposed by stickiness on protein divergence is proportional to protein abundance, which provides mechanistic insights into the correlation between protein conservation and protein abundance. Overall, the avoidance of nonfunctional interactions significantly influences the physico-chemical and evolutionary properties of proteins. Remarkably, the effects observed are consistently larger in E. coli and S. cerevisiae than in H. sapiens, suggesting that promiscuous protein–protein interactions may be freer to accumulate in the human lineage. promiscuity | protein structure | interaction potential T he interior of cells is a highly crowded environment where proteins continuously encounter each other (1). Thus, for cells to function properly, it is important that casual encounters do not outweigh functional ones. Statistically, the competition from nonfunctional interactions should be severe (2–4), given that the huge number of possible interactions far outweighs the comparatively small number of functional interactions: the Escherichia coli proteome contains about 4,200 proteins, yielding over 8,000,000 potential distinct pairwise interactions. Eukaryotic proteomes are even larger and require additional mechanisms to minimize the impact of nonfunctional interactions (3, 5, 6). For example, Zhang et al. showed that, in Saccharomyces cerevisiae, the average concentration of coexpressed and colocalized proteins is close to the upper tolerable limit (3), implying that compartmentalization of proteins in time and space was crucial to allow the expansion of eukaryotic protein repertoires. In addition to cellular mechanisms such as compartmentalization and regulation of protein abundance, shown to be important for intrinsically unstructured proteins, for example (7), specific physicochemical properties contribute to minimizing nonfunctional protein-protein interactions (PPIs). This has been observed within the protein core (8) and within interface patches (9), which, due to their hydrophobic character, have a potential to mediate nonfunctional interactions. Pechmann et al. showed that interface regions are www.pnas.org/cgi/doi/10.1073/pnas.1209312109 often aggregation-prone but protected by strategically placed disulfide bonds and salt bridges (9). Such aggregation-prone regions have also been shown to be less frequent among highly expressed proteins, which, according to the law of mass action, are potentially more deleterious to the cell than lowly expressed proteins (10). Importantly, in these studies aggregation is measured along the protein sequence and therefore reflects the potential for aggregation of the unfolded state. Most previous studies have highlighted “negative-design” principles at known binding regions (9) or examined nonfunctional interactions through aggregation (10–13). In contrast, here we concentrate on the surface regions of proteins in their folded state. Specifically, we ask if the folded state of proteins is evolutionarily constrained by nonfunctional interactions. This means, in particular, that we consider surface residues but not amino acids buried in the protein core, as these cannot be involved in protein–protein interactions. In a molecular evolution-oriented study, Yang et al. recently observed that such surface-specific evolutionary constraints exist in yeast (14). Here we present a complementary analysis that places the emphasis on the physico-chemical properties of proteins associated with constraints from nonfunctional interactions and describe these properties in two additional species to better cover the tree of life. We thus assembled three datasets of proteins of known structure in their biological state (“biological unit”), resulting in 397, 196, and 631 proteins for E. coli, S. cerevisiae, and Homo sapiens, respectively. Results Defining an Interaction Propensity Scale. To investigate the impact of promiscuous interactions, we first define an interaction propensity scale to use as a proxy for an amino acid “stickiness” scale. We derive this scale purely from structural data by taking the log ratio of amino acid frequencies observed at the protein surface versus in protein–protein interfaces, as previously defined (15–17) and as illustrated in Fig. 1B. As we consider protein structures in terms of biological units, surface amino acids as defined here are not involved in interfacial protein–protein contacts in the crystal structure. This scale thus reflects a tradeoff between the probability of finding a given amino acid in a solvated environment versus the residue being involved in an interaction with another protein. For example, lysine is frequent at the surface (∼15% of amino acids) but rare in interface core regions (<5% of amino acids), which makes it an Author contributions: E.D.L., S.D., and S.A.T. designed research; E.D.L. and S.D. performed research; E.D.L., S.D., and S.A.T. analyzed data; and E.D.L. and S.A.T. wrote the paper. The authors declare no conflict of interest. *This Direct Submission article had a prearranged editor. Freely available online through the PNAS open access option. Data deposition: The data processed in this paper are available at: www.tinyurl.com/ structuralregions. 1 To whom correspondence may be addressed. E-mail: [email protected] or sat@ mrc-lmb.cam.ac.uk. This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10. 1073/pnas.1209312109/-/DCSupplemental. PNAS Early Edition | 1 of 6 BIOPHYSICS AND COMPUTATIONAL BIOLOGY Edited* by Ken A. Dill, Stony Brook University, Stony Brook, NY, and approved October 30, 2012 (received for review June 1, 2012) Projection of protein abundance and evolutionary information onto structures Interior Surface Interface + amino acid interface propensity log( + freqAA interface freqAA surface 0.0 1.0 Protein H. sapiens (631 structures) ) S. cerevisiae KEDQNPGRA T SHV LWYM I CF R= −0.36 p = 7.2e−07 1 R= −0.48 p = 9.4e−10 0 0 R= −0.26 p = 0.00018 1 R= −0.08 p = 0.4 −0.5 −0.4 −0.3 −0.2 −0.1 Surface Stickiness 0.2 0.3 0.4 0.5 0.6 Interior Stickiness 0.7 H. sapiens R= −0.25 p = 2.0e−05 −0.5 −0.4 −0.3 −0.2 −0.1 0 Surface Stickiness R= −0.07 p = 0.33 0.2 1 5 20 100 −0.5 −0.4 −0.3 −0.2 −0.1 Surface Stickiness Protein Abundance (A.U.) 1 5 20 100 500 2000 D E. coli 5 20 100 500 10000 Protein Abundance (A.U.) 1 5 20 100 500 2000 C 5 20 100 500 10000 proxy for amino acid ”stickiness” scale −1.0 B S. cerevisiae (196 structures) 0.2 1 5 20 100 2000 E. coli (397 structures) 2000 A 0.2 0.3 0.4 0.5 0.6 Interior Stickiness 0.7 0.2 0.3 0.4 0.5 0.6 Interior Stickiness 0.7 Fig. 1. The solvent-accessible surfaces of high-abundance proteins are enriched in nonsticky amino acids compared with low-abundance proteins. (A) Illustration of the approach taken in this study. (B) We first define a stickiness scale for each amino acid using its interface propensity. The propensity is defined by the log ratio of amino acid frequencies at interfaces versus surfaces. The definition of the structural regions used is explained in more detail in Fig. S1. (C and D) We calculate a stickiness score by averaging interface propensity scores of residues in the region considered (surface or interior). We then plot this score against the abundance of the protein and indicate the Spearman rank correlation coefficients of the relationships, as well as the P value associated with the linear association obtained by analysis of variance. The contour lines mark the 2/6, 3/6, 4/6, and 5/6 percentile of the density function range. interaction-resistant or “nonsticky” amino acid (17). We used only E. coli proteins to derive this scale, but our conclusions are not dependent on the organism used because the scales based on S. cerevisiae and H. sapiens proteins are almost identical to that of E. coli (Rcoli-yeast = 0.94, Rcoli-human = 0.97; Fig. S2). Chemical Constraints on Surfaces of Highly Abundant Proteins. Nonfunctional interactions are, on average, detrimental to fitness because they sequester interaction partners (18). According to the law of mass action, the number of nonfunctional interactions that a protein participates in should be proportional to its abundance (19). Therefore, an abundant protein with a sticky surface is expected to be more deleterious than a low-abundance protein with the same surface stickiness. If cellular crowding and its associated promiscuous interactions were a constraint in 2 of 6 | www.pnas.org/cgi/doi/10.1073/pnas.1209312109 cellular systems, we would expect an anticorrelation between protein surface stickiness and protein abundance. We quantified the stickiness of a protein surface as the average of interfacepropensity scores, thus reflecting the tendency of its solvent-accessible residues to interact with other protein surfaces (Fig. 1). For all three organisms, we used all of the available experimental data on protein abundance provided by the PaxDb database (http://pax-db.org) (20). These values are linearly proportional to protein copy numbers in cells. Plotting surface stickiness against protein abundance reveals a significant anticorrelation in all three organisms studied (pcoli = 9.10−10, pyeast = 7.10−7, phuman = 2.10−5; Fig. 1C; these and subsequent P values associated with correlations were calculated using the F-statistic obtained by analysis of variance of the linear association between abundance and stickiness). However, the Levy et al. Levy et al. o 2 400A central residue in non-sticky context in sticky context E. coli 35% difference S. cerevisiae 65% difference H. sapiens 12% difference p = 9e−120 p = 4e−101 p = 6e−52 3 4 The composition of central residues is independent of their context B Average stickiness of the surrounding surface patch Fig. 3. The relative evolutionary rate of an amino acid is influenced by the stickiness of its environment. (A) Illustration of the procedure used to calculate the stickiness score of a residue’s environment. We use this score as a proxy for the probability of the central residue to trigger a promiscuous interaction upon mutation. Note that, although the central residue is classified according to its context, its chemical composition remains independent of the context and follows an average surface composition, even for the most sticky category of patches (Fig. S7). (B) An evolutionary conservation ratio is calculated for each surface amino acid. The ratio is equal to the median evolutionary rate of the entire protein divided by the evolutionary rate of the residue. We bin all residues into five classes of equal size and increasing stickiness and show the boxplot distribution of evolutionary rates for each class. In all three organisms, the stickier the environment of a residue, the more the residue is conserved relative to the rest of the protein. Note that in this analysis we consider the conservation of the central residue and not that of the patch surrounding it. P values are calculated using the Wilcoxon test. PNAS Early Edition | 3 of 6 BIOPHYSICS AND COMPUTATIONAL BIOLOGY A 2 magnitude of the anticorrelation, as measured by the Spearman rank correlation coefficient, varies greatly. The strongest anticorrelation is found in E. coli (R = −0.48), followed by yeast (R = −0.36), followed by human (R = −0.25). This result shows that the surface of highly abundant proteins has adapted to become less sticky and more soluble than for lowly abundant proteins, especially in E. coli and, to a smaller extent, in yeast and humans. This weaker signal might reflect the fact that eukaryotic cells are more compartmentalized than bacterial cells, which may introduce a bias in the measure of protein concentration approximated here with abundance. An analysis of protein stickiness as a function of localization indeed reveals significant differences across different cellular compartments. Interestingly, nuclear proteins are more sticky than the rest of the proteome taken as an average (pcerevisiae = 0.023; psapiens = 0.016) whereas mitochondrial proteins are less sticky (pcerevisiae = 0.0021, psapiens = 0.0045). Remarkably, in H. sapiens the gene ontology (GO) term most enriched in nonsticky proteins is “soluble fraction” (psapiens = 3.6*10−5; Fig. S3). The amino acid potential provided in this analysis yields results that are significantly different from those obtained on the basis of the commonly used hydrophobicity scale of Kyte and Doolittle (21). When considering this hydrophobicity scale, the association described in Fig. 1 disappears in S. cerevisiae and H. sapiens and greatly weakens in E. coli (Fig. 2 and Fig. S4). We further tested 71 additional scales associated with “hydrophobicity” from the AAindex database (22) (Table S1). Interestingly, the scale of Wimley and White (23) yields the best correlation (R = −0.44) in E. coli, and is based on the transfer of amino acids from a hydrophobic environment (lipid bilayer interface) to water. This is different from the Kyte and Doolittle scale, which is based on measures of transfers of amino acids between two polar environments (e.g., ethanol and water). The similarity between the stickiness scale and the Wimley and White scale may reflect the fact that an interaction resembles more a transfer from water to a hydrophobic environment than a transfer between two relatively polar environments. Fig. S5 provides a comparison of these three scales, and Table S2 presents the values for our stickiness scale. Current views of protein evolution emphasize stability, which must be maintained to avoid misfolding and thereby prevent loss of function or aggregation (24, 25). To assess the extent to which the anticorrelation observed here is linked to the unfolded state of the protein, we reproduce the same plots but now consider amino acids at the protein interior instead of the surface (Fig. 1D). For rate(surface residue) Fig. 2. Protein hydrophobicity is less strongly tuned as a function of abundance than stickiness. We calculate a “hydrophobicity score” for the surface and interior regions of a protein by averaging Kyte and Doolittle hydrophobicity scores of residues in the region (21). We then plot this score against the abundance of the protein and indicate the Spearman rank correlation coefficients of the relationships, as well as the P value associated with the linear association obtained by analysis of variance. The hydrophobicity analysis for all species and surface as well as interior regions is shown in Fig. S4. 1 −2.4 −2 −1.6 −1.2 −0.8 −2.6 −2.2 −1.8 −1.4 −1.0 Surface Hydrophobicity Surface Hydrophobicity two organisms, the correlation disappears almost entirely when we consider amino acids at the interior. The surface–interior difference is most marked in E. coli, where the correlation vanishes almost completely (R = −0.08) and becomes insignificant (P = 0.4). In humans, the weaker anticorrelation observed in Fig. 1B is also lost with interior amino acids (R = −0.07, P = 0.33), whereas in yeast a weak correlation persists (R = −0.26, P = 4.10−2). Considering protein length provides a further piece of evidence showing that misassembly rather than misfolding is responsible for the anticorrelation between surface stickiness and abundance. It is known that the small hydrophobic core of short proteins (26) requires compensating mechanisms (27) that increase their stability. In line with this, we find an increase in interior stickiness among small proteins relative to larger proteins for all three species (Fig. S6; pcoli = 2.10−5; pyeast = 0.015; phuman = 2.10−8). The increased stickiness associated with the core of small proteins suggests that a strong amino acid interaction potential can lead to an increase in stability. Comparatively, however, the lack of association between surface stickiness and protein length (Fig. S6; pcoli = 0.05; pyeast = 0.89; phuman = 0.15) implies that stability is unlikely to drive the evolution of protein surfaces toward nonsticky amino acids. Taken together, these results suggest that, in addition to selection against misfolding and aggregation of polypeptide chains, rate(protein) S. cerevisiae R= −0.02 p= 0.51 0 5 20100 500 10000 R= −0.22 p= 0.0056 1 Protein Abundance (A.U.) 1 5 20 100 2000 E. coli avoidance of nonfunctional interactions by folded proteins is an important constraint that is proportional to abundance. Moreover, adaptation to this constraint is achieved through a bias in surface amino acid composition toward nonsticky amino acids. residue. Therefore, the larger the ratio, the more conserved the residue relative to the protein. This shows the clear effect of a residue’s environment stickiness on its degree of conservation relative to the protein: residues in nonsticky environments (leftmost bin) are 35%, 65%, and 12% freer to evolve than residues in stickier environments (right-most bin) for E. coli, S. cerevisiae, and H. sapiens, respectively. Because these values are obtained after a normalization per protein, they reflect the impact of stickiness on conservation relative to the conservation of the protein. This normalization is necessary to single-out the effect of stickiness because lowly expressed proteins are poorly conserved (28) but also carry most of the sticky patches, as shown in Fig. 1C. Interestingly, the weaker adaptation of human proteins against nonfunctional interactions observed in Fig. 1C is reproduced here, as differences in evolutionary conservation across the five probability classes are weakest in the human data set. It can be argued that the conservation of residues found in sticky surface patches is due to those patches being unknown biological interfaces. However, several pieces of evidence suggest otherwise. First, if this were the case, we would not expect to see such a difference in signal between species (i.e., decreasing signal strength from E. coli to H. sapiens) because functional interfaces should, on average, be conserved in all species. Second, we would expect the central residue within sticky patches to resemble interface amino acids. To assess this, we compared the frequency distribution of amino acids in sticky patches (Fsticky) with that of amino acids at the interface (Finterface) and surface (Fsurface). Because amino acids A substitution frequency between D and E 1.4 substitution frequency between K and R E. coli = ∑ (K-R) + ∑ (R-K) ∑ (K-K) + ∑ (R-R) * ∑ (D-D) + ∑ (E-E) ∑ (D-E) + ∑ (E-D) S. cerevisiae H. sapiens B 0.9 0.6 Protein abundance class (%) Low abundance protein 0-20 20-40 40-60 60-80 80-100 Top 5 0-20 20-40 40-60 60-80 80-100 Top 5 Protein abundance class (%) Protein abundance class (%) High abundance protein Mut1 Mut3 Mut2 Sticky surface 1.2 Ratio r 1 0.8 3 2 0-20 20-40 40-60 60-80 80-100 Top 5 1.1 1 Ratio r 5 4 Ratio r 6 1.2 1.3 7 8 Ratio r = 1.4 Surface Stickiness Is an Evolutionary Constraint. To assess whether nonfunctional interactions place a constraint on protein evolution, we study conservation at the amino acid level. We ask whether, within a protein, amino acids surrounded by a sticky environment are more conserved than amino acids surrounded by a nonsticky environment. We computed rates of evolution for each amino acid for all three species and projected these data onto protein structures of each organism (Materials and Methods). In parallel, we calculated a surrounding stickiness score for every surface amino acid of each protein (Fig. 3A). This score is calculated from the amino acid composition of the 400-Å2 surface patch surrounding the residue of interest by averaging its amino acids stickiness values (note that the stickiness of the central residue is independent from that of the patch). Residues are then binned into five “surrounding stickiness” classes of equal size for each organism, and evolutionary conservation is compared across the five classes (Fig. 3B). We reason that residues in more sticky environments are expected to have a higher probability of triggering nonfunctional interactions upon mutation and on average should be more constrained than those in less sticky environments. Importantly in Fig. 3, the evolutionary rate of each residue is normalized as we divide the rate of the protein by that of each Mut4 Interior Non sticky surface Constrained by Promiscuous Misfolding PPIs toxicity Mut1 Mut2 Mut3 Mut4 Low Low Low High Low Low High Low Fig. 4. The strength of selection against changes in protein stickiness is proportional to protein abundance. (A) Ratio of frequencies of two substitution types: one between charged residues of equal stickiness (D and E) and one between charged residues with a change in stickiness (K and R). The ratio is plotted for five bins of increasing protein abundance, each containing the same number of these charged residues. The sixth bin contains the top 5% abundant proteins. The ratio, r, defined in the figure, increases by 160%, 78%, and 13% in E. coli, S. cerevisiae, and H. sapiens, respectively, for the most abundant proteins relative to the least abundant ones. Thus, substitutions between K and R become less frequent than substitutions between D and E among highly abundant proteins. The red intervals show the SD of the ratios r obtained from 1,000 datasets where abundance data are randomized. (B) Scheme illustrating the constraints from misfolding and promiscuous interactions. Selection against misfolding provides an explanation for the relationship between protein abundance and evolutionary conservation for residues buried in the interior because the deleterious effects of misfolded aggregates increase with abundance. Avoidance of promiscuous interactions provides a further mechanism that explains negative selection proportional to abundance for residues on the solvent-accessible surface of proteins. 4 of 6 | www.pnas.org/cgi/doi/10.1073/pnas.1209312109 Levy et al. Nonfunctional Interactions Might Contribute to the Differential Conservation Between Highly and Lowly Expressed Proteins. In the first part of this study, we observed an anticorrelation between protein abundance and protein surface stickiness. Subsequently we saw that stickiness is correlated with conservation within a protein. This prompts us to ask whether protein stickiness might be involved in the well-established correlation between protein abundance and evolutionary conservation. Thus, we would expect low-copy proteins to be more tolerant than abundant proteins to amino acid substitutions that significantly change their surface stickiness. To test this hypothesis, we took advantage of the properties of two pairs of charged amino acids: aspartic (D) and glutamic (E) acids have similar stickiness scores, whereas arginine (R) and lysine (K) do not (Fig. 1B) (15, 17, 31). Arginine is more frequently found at protein–protein interfaces than lysine, making it a stickier amino acid according to our definition. This characteristic enables us to make the following prediction: among high-copy proteins, where significant changes in stickiness have a greater impact, substitutions between K and R should be less frequent than substitutions between D and E. Also, because K, R, E, and D are mostly present at protein surfaces (15, 17), we do not need to restrict ourselves to proteins of known structure and can measure substitution rates from whole proteomes. We thus measured the substitutions frequencies between K and R (fK<->R) as well as between D and E (fD<->E) among orthologs of three species pairs: E. coli–Salmonella typhimurium, S. cerevisiae–Saccharomyces paradoxus, and H. sapiens–Mus musculus, as detailed in Table S3. Fig. 4 shows ratios of these frequencies (fD↔E/fK↔R) as a function of protein abundance. Substitutions between K and R are rare among abundant proteins relative to substitutions between D and E. In contrast, among low-copy proteins, both substitution types occur at more comparable frequencies. Interestingly, the magnitude of the effect observed, again, decreases in strength from E. coli (160% change between lowest and highest abundance classes) to yeast (78% change) and to humans (13% change). Taken together, these observations provide mechanistic insights into the well-established correlation between protein abundance and evolutionary conservation. Although this correlation has been known for over a decade (28), the biological mechanisms Levy et al. associated with it are still not entirely clear. Selection against misfolding can explain part of the correlation (24, 25), where the assumption is that toxicity of misfolded proteins is proportional to their abundance. Our results support the notion that avoidance of promiscuous interactions, or negative pleiotropy (32), represents an additional mechanistic explanation (Fig. 4B). Discussion It has been shown previously that mutations tend to arise faster at the protein surface than in the interior (33, 34). In fact, TothPetroczy and Tawfik recently showed that mutations at the interior accumulate more rapidly once the surface has drifted sufficiently (35). Therefore, by lowering the tolerance for mutations at the surface, the divergence of the entire protein becomes constrained (35). Promiscuous interactions, which constrain mutations at the surface, could thereby limit the evolutionary rate of the entire protein. This is consistent with the results of a recent study by Yang et al. showing, in a theoretical molecular evolutionary model using S. cerevisiae, that protein misinteraction represents an evolutionary constraint (14). Considering two additional species and taking a complementary approach placing more emphasis on the physico-chemical properties of proteins, we also find that protein misinteractions represent an evolutionary constraint. We provide a physicochemical rationalization of nonfunctional interactions through the stickiness scale. This scale is significantly different from the Kyte and Doolittle hydrophobic scale, which is commonly used as in, e.g., Yang et al. (14). Our stickiness scale is more similar to the Wimley and White scale, although differences, e.g., between lysine and arginine, suggest that it is important to consider the “interaction” potential of amino acids in interpreting nonfunctional interactions. Interestingly, lysine underrepresentation at nonbiological crystal contacts also supports the notion that lysine and arginine have different potentials to be involved in nonfunctional interactions (17, 36). We thus hope that the stickiness scale proposed here will help to refine models that couple protein chemistry to cellular crowding (5). Furthermore, taken together, the work by Yang et al. and our work suggest that proteins are constrained to avoid nonfunctional interactions, adding to the commonly accepted stability and solubility constraints on the amino acid composition of proteins. Finally, the impact of promiscuous interactions appears most prominent among the unicellular organisms E. coli and S. cerevisiae. It is thus tempting to speculate that nonfunctional interactions may have accumulated in the human lineage (37) in a similar fashion to the accumulation of noncoding DNA (38). In a further analogy to noncoding DNA, nonfunctional interactions may represent the raw material for exploring and ultimately selecting functional interactions (39, 40) through mechanisms such as colocalization (41). These speculations should nevertheless be considered with care, as the weaker signal observed for H. sapiens may also result from the ill-defined nature of protein abundance in multicellular organisms. Future studies will thus be needed to explore these ideas further and better understand the properties of proteomes across the tree of life. Methods Sequence Data. Sequences of proteins and their respective orthologs were aligned with MUSCLE (42). Orthology information was taken from ref. 43 for E. coli and from ENSEMBL v.48 (44) for H. sapiens. Multiple sequence alignments of S. cerevisiae proteins with their orthologs were taken from Wapinsky et al. (45). The details of the species used are in Table S4. Protein multiple alignments were concatenated to obtain three proteome wide multiple alignments (one for each species). These were used to calculate amino acid evolutionary rates using Rate4Site (46). Structural Data. Species-specific structures were retrieved by sequence homology. We searched for structures where the sequence from the SEQRES field was similar to proteins from E. coli, S. cerevisiae, or H. sapiens PNAS Early Edition | 5 of 6 BIOPHYSICS AND COMPUTATIONAL BIOLOGY such as cysteine are rare in all regions, we normalized these distributions by the average frequencies observed in all regions (Ftotal). As expected, the linear regression between (Fsticky/Ftotal) and (Finterface/Ftotal) was not significant (pcoli = 0.27, pcerevisiae = 0.66, psapiens = 0.48). Residues found in sticky patches are in fact nearly identical in their composition to surface residues, as reflected by the highly significant linear regression between (Fsticky/Ftotal) and (Fsurface/Ftotal): pcoli = 3.1e-14, pcerevisiae = 4.1e-14, psapiens = 9.2e-12, as obtained by analysis of variance. These results are detailed in Fig. S7 and show that for biological units in the Protein Data Bank (PDB) the surfaces are largely solvent-exposed as opposed to being involved in cryptic stable interfaces. Considering isolated subunits, however, we observe the opposite because the sticky patches include genuine interfaces. For this data set, the distribution of residues at the center of sticky patches is closer to interface amino acids (Fig. S7, pcoli = 0.019, pcerevisiae = 0.16, psapiens = 0.028) than to surface ones (for the latter, the regression slopes are actually negative (slopecoli = −0.85, slopecerevisiae = −1.02, slopesapiens = −0.19). Finally, the increasing conservation of residues in increasingly sticky environments holds true even at known interfaces, both at the rim and at the core (Fig. S8), showing that, even within protein–protein contact regions, stickiness is controlled. The latter observation supports the notion of negative design (29) in sensitive interface regions (8, 9, 30). Although unknown biological interfaces must exist, these observations make us confident that they are unlikely to contribute significantly to the signal observed. proteomes. We imposed a minimal sequence identity of 90% and a minimum overlap of 70%. We used protein structures from the PDB (47), and the dataset includes all structures present in the second release of 3DComplex (48). All structures for which the biological state was manually annotated in the PiQSi database (49) as “error,” “probable error,” or “undefined” were discarded, as well as all DNA-binding and membrane proteins. Finally, we kept only structures with a resolution below 3 Å. A summary of the number of structures per organism and complex type is given in Table S5. Structural regions were defined as in Levy (15). The environment stickiness for a given residue was calculated based on its surrounding residues, i.e., residues with the Cα within a 400-Å2 patch centered on the Cα of the residue of interest. proteins, we discarded all proteins with an abundance unit below 1. Statistical analyses and plots were done with R. Data used in this study are available at www.tinyurl.com/structuralregions. Abundance Data. Protein abundance data were taken from PaxDb (20) (http:// pax-db.org). Because of the uncertainty associated with very low abundance ACKNOWLEDGMENTS. We thank Dan Tawfik, Joël Janin, Eugene Shakhnovich, Sergei Maslov, David Liberles, Joseph Marsh, Eviatar Natan, Gideon Schreiber and Peter Tompa for their comments on the manuscript. We also thank the two anonymous referees for their constructive comments that significantly helped improve the paper. E.D.L. acknowledges the Human Frontier Science Project for financial support through a long-term fellowship; Stephen Michnick and Université de Montréal for hosting part of this research; and the Weizmann Institute of Science for hosting part of this research. S.D. acknowledges support from the University of Colorado School of Medicine and the National Cancer Institute Physical Sciences Oncology Center initiative (U54-CA143798). E.D.L. and S.A.T. were supported by the Medical Research Council (file Reference U105161047). 1. McGuffee SR, Elcock AH (2010) Diffusion, crowding & protein stability in a dynamic molecular model of the bacterial cytoplasm. PLOS Comput Biol 6(3):e1000694. 2. Janin J (1996) Quantifying biological specificity: The statistical mechanics of molecular recognition. Proteins 25(4):438–445. 3. Zhang J, Maslov S, Shakhnovich EI (2008) Constraints imposed by non-functional proteinprotein interactions on gene expression and proteome size. Mol Syst Biol 4:210. 4. Tompa P, Rose GD (2011) The Levinthal paradox of the interactome. Protein Sci 20(12):2074–2079. 5. Heo M, Maslov S, Shakhnovich E (2011) Topology of protein interaction network shapes protein abundances and strengths of their functional and nonspecific interactions. Proc Natl Acad Sci USA 108(10):4258–4263. 6. Johnson ME, Hummer G (2011) Nonspecific binding limits the number of proteins in a cell and shapes their interaction networks. Proc Natl Acad Sci USA 108(2):603–608. 7. Gsponer J, Futschik ME, Teichmann SA, Babu MM (2008) Tight regulation of unstructured proteins: From transcript synthesis to protein degradation. Science 322 (5906):1365–1368. 8. Fleishman SJ, Baker D (2012) Role of the biomolecular energy gap in protein design, structure, and evolution. Cell 149(2):262–273. 9. Pechmann S, Levy ED, Tartaglia GG, Vendruscolo M (2009) Physicochemical principles that regulate the competition between functional and dysfunctional association of proteins. Proc Natl Acad Sci USA 106(25):10159–10164. 10. Tartaglia GG, Pechmann S, Dobson CM, Vendruscolo M (2009) A relationship between mRNA expression levels and protein solubility in E. coli. J Mol Biol 388(2):381–389. 11. Hamada D, et al. (2009) Competition between folding, native-state dimerisation and amyloid aggregation in beta-lactoglobulin. J Mol Biol 386(3):878–890. 12. Tartaglia GG, Pechmann S, Dobson CM, Vendruscolo M (2007) Life on the edge: A link between gene expression levels and aggregation rates of human proteins. Trends Biochem Sci 32(5):204–206. 13. Münch C, Bertolotti A (2010) Exposure of hydrophobic surfaces initiates aggregation of diverse ALS-causing superoxide dismutase-1 mutants. J Mol Biol 399(3):512–525. 14. Yang JR, Liao BY, Zhuang SM, Zhang J (2012) Protein misinteraction avoidance causes highly expressed proteins to evolve slowly. Proc Natl Acad Sci USA 109(14):E831–E840. 15. Levy ED (2010) A simple definition of structural regions in proteins and its use in analyzing interface evolution. J Mol Biol 403(4):660–670. 16. Lo Conte L, Chothia C, Janin J (1999) The atomic structure of protein-protein recognition sites. J Mol Biol 285(5):2177–2198. 17. Janin J, Bahadur RP, Chakrabarti P (2008) Protein-protein interaction and quaternary structure. Q Rev Biophys 41(2):133–180. 18. Vavouri T, Semple JI, Garcia-Verdugo R, Lehner B (2009) Intrinsic protein disorder and interaction promiscuity are widely associated with dosage sensitivity. Cell 138(1): 198–208. 19. Levy ED, Michnick SW, Landry CR (2012) Protein abundance is key to distinguish promiscuous from functional phosphorylation based on evolutionary information. Philos Trans R Soc Lond B Biol Sci 367(1602):2594–2606. 20. Wang M, et al. (2012) PaxDb, a database of protein abundance averages across all three domains of life. Mol Cell Proteomics 11(8):492–500. 21. Kyte J, Doolittle RF (1982) A simple method for displaying the hydropathic character of a protein. J Mol Biol 157(1):105–132. 22. Kawashima S, et al. (2008) AAindex: Amino acid index database, progress report 2008. Nucleic Acids Res 36(Database issue):D202–D205. 23. Wimley WC, White SH (1996) Experimentally determined hydrophobicity scale for proteins at membrane interfaces. Nat Struct Biol 3(10):842–848. 24. Yang JR, Zhuang SM, Zhang J (2010) Impact of translational error-induced and errorfree misfolding on the rate of protein evolution. Mol Syst Biol 6:421. 25. Drummond DA, Wilke CO (2008) Mistranslation-induced protein misfolding as a dominant constraint on coding-sequence evolution. Cell 134(2):341–352. 26. Chothia C (1975) Structural invariants in protein folding. Nature 254(5498):304–308. 27. Pereira de Araújo AF, Gomes AL, Bursztyn AA, Shakhnovich EI (2008) Native atomic burials, supplemented by physically motivated hydrogen bond constraints, contain sufficient information to determine the tertiary structure of small globular proteins. Proteins 70(3):971–983. 28. Pál C, Papp B, Hurst LD (2001) Highly expressed genes in yeast evolve slowly. Genetics 158(2):927–931. 29. Doye JP, Louis AA, Vendruscolo M (2004) Inhibition of protein crystallization by evolutionary negative design. Phys Biol 1(1–2):9–13. 30. Levin KB, et al. (2009) Following evolutionary paths to protein-protein interactions with high affinity and selectivity. Nat Struct Mol Biol 16(10):1049–1055. 31. MacCallum JL, Tieleman DP (2011) Hydrophobicity scales: A thermodynamic looking glass into lipid-protein interactions. Trends Biochem Sci 36(12):653–662. 32. Liberles DA, Tisdell MD, Grahnen JA (2011) Binding constraints on the evolution of enzymes and signalling proteins: The important role of negative pleiotropy. Proc Biol Sci 278(1714):1930–1935. 33. Sasidharan R, Chothia C (2007) The selection of acceptable protein mutations. Proc Natl Acad Sci USA 104(24):10080–10085. 34. Franzosa EA, Xia Y (2009) Structural determinants of protein evolution are contextsensitive at the residue level. Mol Biol Evol 26(10):2387–2395. 35. Tóth-Petróczy A, Tawfik DS (2011) Slow protein evolutionary rates are dictated by surface-core association. Proc Natl Acad Sci USA 108(27):11151–11156. 36. Cieslik M, Derewenda ZS (2009) The role of entropy and polarity in intermolecular contacts in protein crystals. Acta Crystallogr D Biol Crystallogr 65(Pt 5):500–509. 37. Fernández A, Lynch M (2011) Non-adaptive origins of interactome complexity. Nature 474(7352):502–505. 38. Lynch M (2007) The Origins of Genome Architecture (Sinauer Associates, Inc., Sunderland, MA), pp 494. 39. Tawfik DS (2010) Messy biology and the origins of evolutionary innovations. Nat Chem Biol 6(10):692–696. 40. Nobeli I, Favia AD, Thornton JM (2009) Protein promiscuity and its implications for biotechnology. Nat Biotechnol 27(2):157–167. 41. Kuriyan J, Eisenberg D (2007) The origin of protein interactions and allostery in colocalization. Nature 450(7172):983–990. 42. Edgar RC (2004) MUSCLE: Multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res 32(5):1792–1797. 43. Moreno-Hagelsieb G, Janga SC (2008) Operons and the effect of genome redundancy in deciphering functional relationships using phylogenetic profiles. Proteins 70(2): 344–352. 44. Flicek P, et al. (2008) Ensembl 2008. Nucleic Acids Res 36(Database issue):D707–D714. 45. Wapinski I, Pfeffer A, Friedman N, Regev A (2007) Natural history and evolutionary principles of gene duplication in fungi. Nature 449(7158):54–61. 46. Pupko T, Bell RE, Mayrose I, Glaser F, Ben-Tal N (2002) Rate4Site: An algorithmic tool for the identification of functional regions in proteins by surface mapping of evolutionary determinants within their homologues. Bioinformatics 18(Suppl 1):S71–S77. 47. Berman HM, et al. (2002) The Protein Data Bank. Acta Crystallogr D Biol Crystallogr 58(Pt 6 No 1):899–907. 48. Levy ED, Pereira-Leal JB, Chothia C, Teichmann SA (2006) 3D complex: A structural classification of protein complexes. PLOS Comput Biol 2(11):e155. 49. Levy ED (2007) PiQSi: Protein quaternary structure investigation. Structure 15(11): 1364–1367. 6 of 6 | www.pnas.org/cgi/doi/10.1073/pnas.1209312109 Levy et al. Supporting Information Levy et al. 10.1073/pnas.1209312109 Schematic view SURFACE cross-section INTERIOR SUPPORT RIM CORE RIM Interacting partner INTERIOR rASAc < 25% & ΔrASA = 0 SURFACE rASAc > 25% & ΔrASA = 0 SUPPORT ΔrASA > 0 & rASAm < 25% RIM ΔrASA > 0 & rASAc > 25% CORE ΔrASA > 0 & rASAm > 25% & rASAc < 25% ΔrASA = rASAm-rASAc rASAm = relative ASA in monomer rASAc = relative ASA in complex Fig. S1. The regions of protein structure used in this study. We use the definitions of interface, surface, interface support, rim, and core as defined in Levy et al. (1). Amino acid interface propensities are computed as the log ratios of their frequencies at the interface core (orange) and surface (blue). Interface core propensity relative to surface −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 S. cerevisiae C YM V L I RA Q G P N H TS F −1.0 V A TH GR S Q NP D E K −1.5 −1.5 K −1.5 W F YM I L C −1.0 −1.0 ED K E D Q N P G R A T S H V L WY M I C F H. sapiens propensities −0.5 0.0 0.5 1.0 S. cerevisiae propensities −0.5 0.0 0.5 1.0 W H. sapiens 1.5 K E D Q N P G R A T S H V L WY M I C F 1.5 Interface core propensity relative to surface −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 1. Levy ED (2010) A simple definition of structural regions in proteins and its use in analyzing interface evolution. J Mol Biol 403(4):660–670. −0.5 0.0 0.5 E. coli propensities 1.0 1.5 −1.5 −1.0 −0.5 0.0 0.5 E. coli propensities 1.0 1.5 Fig. S2. Amino acid interface propensities are similar across distant species. (Upper panels) Residue interface-to-surface propensities for Saccharomyces cerevisiae and Homo sapiens. (Lower panels) S. cerevisiae and H. sapiens residue interface propensities are very similar to those in Escherichia coli. Levy et al. www.pnas.org/cgi/content/short/1209312109 1 of 9 Top 10% S. cerevisiae organelle lumen intracellular organelle lumen cytosol mitochondrial part endomembrane system membrane part organelle membrane endoplasmic reticulum envelope membrane 1 Z-score 1 organelle envelope organelle part nucleus intracellular organelle part protein complex macromolecular complex nuclear part mitochondrion Top 10% Lowest 10% p. value H. sapiens pvalue Z-score endosome 1 1 1 1 1 1 neuron projection cytoplasmic vesicle nucleolus Golgi apparatus part cell projection Golgi apparatus subsynaptic reticulum membrane fraction mitochondrial envelope insoluble fraction envelope organelle envelope mitochondrial membrane endoplasmic reticulum extracellular region part extracellular space endoplasmic reticulum part mitochondrial inner membrane organelle inner membrane endoplasmic reticulum membrane microtubule cytoskeleton plasma membrane part organelle membrane cytoskeletal part cytoskeleton vacuole endomembrane system soluble fraction cell fraction nucleoplasm nuclear part extracellular region nuclear lumen intrinsic to membrane mitochondrial lumen mitochondrial matrix mitochondrial part integral to membrane vesicle Lowest 10% Stickiness score of protein surfaces Stickiness score of protein surfaces standard deviation obtained on simulated random data Fig. S3. Changes in protein surface stickiness as a function of subcellular localization. Stickiness scores of surface residues are binned according to the gene ontology (GO) annotation of the protein to which they correspond. Note that, for proteins with multiple GO annotations, residues are counted several times. For each bin or GO category, the median stickiness scores are shown. The red lines are the SDs of the medians when GO annotations are shuffled (100,000 iterations). The first two bars are the scores of the top and lowest 10 quantiles. This illustrates the similarity in stickiness across different cellular compartments. There are, however, significant differences which, remarkably, are conserved across Saccharomyces cerevisiae and Homo sapiens; e.g., proteins in mitochondria tend to be less sticky than average (pcerevisiae = 0.023; psapiens = 0.016), whereas nuclear proteins tend to be more sticky (pcerevisiae = 0.0021, psapiens = 0.0045). The least sticky GO cellular component is the “soluble fraction” (psapiens = 3.6*10−5). Levy et al. www.pnas.org/cgi/content/short/1209312109 2 of 9 Fig. S4. Protein hydrophobicity is less strongly tuned as a function of abundance than stickiness. We calculate a “hydrophobicity score” for the surface and interior regions of a protein by averaging hydrophobicity scores (1) of residues in the region. We then plot this score against the abundance of the protein and indicate the Spearman rank correlation coefficients of the relationships, as well as the P value associated with the linear association obtained by analysis of variance. 1. Kyte J, Doolittle RF (1982) A simple method for displaying the hydropathic character of a protein. J Mol Biol 157(1):105–132. F M Y C T A R N Q P V G D E K Kite & Doolittle hydrophobicity Stickiness scale Stickiness scale S L V W H F C I M L Y I Q E W S H T R A P G N D K Wimley & White hydrophobicity Fig. S5. Comparison of the stickiness scale with the Kyte and Doolittle scale and the Wimley and White scales. The stickiness scale is distinct from the other two scales, particularly with respect to K, R, and W. Levy et al. www.pnas.org/cgi/content/short/1209312109 3 of 9 200 400 600 800 1000 1200 R= −0.27 p= 0.015 200 R= −0.12 p= 0.05 200 400 600 800 1000 1200 Protein length 400 600 800 200 Protein length −0.5−0.4−0.3−0.2−0.10.0 −0.4 −0.3 −0.2 −0.1 0.0 Surface Stickiness Protein length R= −0.41 p= 2.e-8 R= −0.08 p= 0.89 200 400 600 800 Protein length 400 600 800 1000 Protein length R= −0.1 p= 0.15 −0.6 −0.4 −0.2 0.0 0.2 R= −0.36 p= 2.e-5 H. sapiens 0.2 0.3 0.4 0.5 0.6 0.7 0.8 S. cerevisiae 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Interior Stickiness 0.2 0.3 0.4 0.5 0.6 0.7 0.8 E. coli 200 400 600 800 1000 Protein length Fig. S6. The stickiness scale captures information on protein stability and suggests that surface amino acids contribute little to stability compared with interior amino acids. Small proteins have a smaller volume-to-surface ratio (1) and thus a smaller hydrophobic core than large proteins. To meet the stability requirement, small proteins are therefore expected to exhibit particular properties (2). We observed in Fig. 1 that abundant proteins have a decreased surface stickiness. To test whether this observation may be linked to stability, we compare the stickiness of surface (Lower panels) and interior (Upper panels) amino acids as a function of protein length, which should be linked to stability. Stickiness of interior amino acids appears as such a requirement because we observe that small proteins have a stickier interior. However, surface amino acids have comparatively little to no influence on stability, as the correlation between stickiness and protein length tends to disappear for surface amino acids. 10 8 6 a.a. interface frequency 4 Central a.a. frequency in the 10% most sticky patches at known biological interfaces Central a.a frequency in the 10% most sticky patches at the surface of single chains 0 Central a.a. frequency in the 10% most sticky patches at the surface of Biological Units A C D E F G H I K L M N P Q R S T V WY 12 a.a. surface frequency of Biological Units 2 E. coli S. cerevisiae H. sapiens 8 6 4 0 2 Frequency 10 Frequency 12 1. Chothia C (1975) Structural invariants in protein folding. Nature 254(5498):304–308. 2. Pereira de Araújo AF, Gomes AL, Bursztyn AA, Shakhnovich EI (2008) Native atomic burials, supplemented by physically motivated hydrogen bond constraints, contain sufficient information to determine the tertiary structure of small globular proteins. Proteins 70(3):971–983. A C D E F G H I K L M N P Q R S T V WY A C D E F G H I K L M N P Q R S T V WY Fig. S7. Residue frequencies are independent of the stickiness of the 400-Å2 surrounding surface patch. This implies that sticky patches on surfaces are unlikely to be cryptic interfaces. The 10% most “interface-like” patches found at the protein surface were identified, and their central residue was recorded. The frequencies of these central residues (dark blue) closely follow those of surface residues (light blue) rather than frequencies of interface residues (yellow). Therefore, highly sticky patches are unlikely to represent unknown biological protein interfaces. To further control that this strategy is likely to detect interfaces, we carried out the same analysis but considered the entire protein surface (including known interfaces). When these real interfaces are included, the stickiness of central residues radically shifts and becomes close to that of known interfaces (gray). Levy et al. www.pnas.org/cgi/content/short/1209312109 4 of 9 p < 0.013 p < 0.76 p < 0.84 p < 0.61 9% difference p = 0.0021 p < 0.38 65% difference p p < 0.013 H. sapiens p < 0.089 p < 0.39 p < 0.029 p < 0.0083 p < 0.05 p < 0.0099 59% difference p S. cerevisiae 1 2 3 4 E. coli 0 rate(protein) rate(interface rim residue) A Propensity of a residue to trigger a promiscuous interaction upon mutation p < 0.0021 p < 0.2 p < 0.058 p < 0.087 p < 0.36 p < 0.0014 p < 0.0057 p < 0.19 p < 0.082 p < 0.067 p < 0.00077 p =1.2e−15 1 2 3 p =8.9e−10 0 rate(protein) rate(interface core residue) 4 p =9.4e−16 p < 0.63 B Propensity of a residue to trigger a promiscuous interaction upon mutation Fig. S8. The divergence rate of residues at interface cores depends on the stickiness of their environment. (A and B) Interface core residues are grouped into five classes on the basis of their surrounding 400-Å2 environment, i.e., from residues in nonsticky environments (gray) to those in sticky environments (orange). The boxplot distribution of evolutionary rates of residues is shown for each group. Importantly, the rate of each residue is normalized by the median rate of the entire protein. All classes have a median (black thick line) greater than 1 because interface cores are in general more conserved than the rest of the protein. The significant difference in rates between the first and last bins indicates that the environment of a residue correlates with the divergence rate. A possible explanation for this observation is that residues in highly sticky environments are more likely to trigger nonfunctional interactions if they mutate and are therefore subject to more selective pressure. Consistent with the trend observed throughout this work, the difference is less marked among Homo sapiens proteins, suggesting that there is more tolerance for promiscuous interactions in human versus yeast and Escherichia coli. P values were computed using the Wilcoxon test. Levy et al. www.pnas.org/cgi/content/short/1209312109 5 of 9 Table S1. Correlation between protein abundance and protein surface scores Scale ID WIMW960101 PONP800103 PONP800108 ROSG850102 PONP800102 GOLD730101 KANM800102 PONP800101 BLAS910101 ROSG850101 VENT840101 LAWE840101 KANM800104 PONP800106 MANP780101 CHAM820101 WILM950101 NAKH900105 NAKH900112 ARGP820101 JOND750101 ZIMJ680101 PONP800107 GOLD730102 BIGC670101 COWR900101 NAKH900103 NAKH900111 CHAM820102 WILM950102 WILM950103 WOLR790101 YUTK870102 FAUJ830101 PONP800105 PONP930101 NAKH900107 CIDH920101 NAKH900109 NAKH900108 WILM950104 JURD980101 KIDA850101 EISD860103 EISD860102 YUTK870101 FASG890101 EISD840101 CIDH920103 CIDH920105 CIDH920104 PONP800104 KANM800103 CIDH920102 SWER830101 NAKH900104 NAKH900106 YUTK870104 NAKH900101 YUTK870103 Description E. coli S. cerevisiae H. sapiens Free energies of transfer of acwl-x-ll peptides from bilayer interface to water (1) Average gain ratio in surrounding hydrophobicity (2) Average number of surrounding residues (2) Mean fractional area loss (3) Average gain in surrounding hydrophobicity (2) Hydrophobicity factor (4) Average relative probability of β-sheet (5) Surrounding hydrophobicity in folded form (2) Scaled side-chain hydrophobicity values (6) Mean area buried on transfer (3) Bitterness (7) Transfer free energy between N-cyclohexyl-2-pyrrolidone and water (8) Average relative probability of inner β-sheet (5) Surrounding hydrophobicity in turn (2) Average surrounding hydrophobicity (9) Polarizability parameter (10) Hydrophobicity coefficient in reverse phase high-performance liquid chromatography (rp-hplc) (11) Amino acid composition of mitochondrial proteins from animal (12) Transmembrane regions of mitochondrial proteins (12) Hydrophobicity index (13) Hydrophobicity (14) Hydrophobicity (15) Accessibility reduction ratio (2) Residue volume (4) Residue volume (16) Hydrophobicity index, 3.0 ph (17) Amino acid composition of mt proteins (12) Transmembrane regions of non-mt proteins (12) Free energy of solution in water (kcal/mol) (10) Hydrophobicity coefficient in rp-hplc (11) Hydrophobicity coefficient in rp-hplc (11) Hydrophobicity index (18) unfolding gibbs energy in water, ph 9.0 (19) Hydrophobic parameter π (20) Surrounding hydrophobicity in β-sheet (2) Hydrophobicity scales (21) Amino acid composition of mt proteins from fungi and plant (12) Normalized hydrophobicity scales for α-proteins (22) Amino acid composition of membrane proteins (12) Normalized composition from fungi and plant (12) Hydrophobicity coefficient in rp-hplc (11) Modified Kyte–Doolittle hydrophobicity scale (23) Hydrophobicity-related index (24) Direction of hydrophobic moment (25) Atom-based hydrophobic moment (25) Unfolding Gibbs energy in water, ph 7.0 (19) Hydrophobicity index (26) Consensus normalized hydrophobicity scale (27) Normalized hydrophobicity scales for α+β-proteins (22) Normalized average hydrophobicity scales (22) Normalized hydrophobicity scales for α/β-proteins (22) Surrounding hydrophobicity in α-helix (2) Average relative probability of inner helix (5) Normalized hydrophobicity scales for β-proteins (22) Optimal matching hydrophobicity (28) Normalized composition of mt proteins (12) Normalized composition from animal (12) Activation Gibbs energy of unfolding, ph 9.0 (19) Amino acid composition of total proteins (12) Activation Gibbs energy of unfolding, ph 7.0 (19) −0.44 −0.28 −0.17 −0.44 −0.42 −0.41 −0.39 −0.35 −0.35 −0.32 −0.32 −0.27 −0.26 −0.26 −0.27 −0.24 −0.26 −0.28 −0.27 −0.29 −0.27 −0.11 −0.47 −0.39 −0.19 −0.22 −0.21 −0.21 −0.21 −0.13 −0.16 −0.19 −0.10 −0.28 −0.15 −0.11 −0.25 −0.25 −0.23 −0.21 −0.20 −0.22 −0.33 −0.20 −0.43 −0.39 −0.16 −0.28 −0.15 −0.23 −0.19 −0.19 −0.18 −0.18 −0.18 −0.17 −0.17 −0.17 −0.16 −0.16 −0.15 −0.15 −0.14 −0.13 −0.13 −0.10 −0.09 −0.09 −0.08 −0.07 −0.05 −0.02 0.00 0.01 0.02 0.03 0.03 0.05 0.05 0.05 0.05 0.06 0.08 0.08 0.08 0.12 0.12 0.13 0.13 0.17 0.18 0.18 0.19 0.20 −0.16 −0.15 −0.19 −0.19 −0.17 −0.21 −0.39 −0.41 −0.31 −0.18 0.06 −0.17 −0.23 −0.08 0.01 −0.09 −0.39 0.01 −0.22 −0.20 −0.19 0.24 −0.31 −0.27 −0.24 −0.17 −0.39 −0.10 −0.11 −0.28 −0.13 −0.31 −0.37 −0.34 0.17 0.05 −0.31 −0.32 −0.23 −0.14 0.23 0.32 0.24 −0.06 −0.07 −0.09 −0.09 −0.09 −0.14 −0.19 −0.19 −0.18 −0.04 −0.05 −0.12 −0.05 −0.16 −0.15 −0.01 −0.08 −0.05 0.00 0.02 −0.05 0.06 −0.06 −0.06 −0.10 −0.06 −0.13 −0.02 0.05 −0.07 −0.04 0.04 −0.05 −0.04 0.07 −0.02 −0.04 −0.06 −0.01 0.00 0.17 0.20 0.17 Levy et al. www.pnas.org/cgi/content/short/1209312109 6 of 9 Table S1. Cont. Scale ID NAKH900102 RACS770101 NAKH900113 KANM800101 LEVM760101 CASG920101 NAKH900110 PRAM900101 RACS770103 ENGD860101 RACS770102 Description SD of amino acid composition of total proteins (12) Average reduced distance for c-α (29) Ratio of average and computed composition (12) Average relative probability of helix (5) Hydrophobic parameter (30) Hydrophobicity scale from native protein structures (31) Normalized composition of membrane proteins (12) Hydrophobicity (32) side chain orientational preference (29) Hydrophobicity index (33) Average reduced distance for side chain (29) E. coli 0.20 0.21 0.22 0.24 0.26 0.29 0.30 0.31 0.31 0.31 0.34 S. cerevisiae 0.42 0.34 0.34 0.13 −0.02 −0.02 0.08 0.00 0.23 0.00 0.28 H. sapiens 0.17 0.17 0.15 0.04 0.04 0.11 0.06 0.06 0.16 0.06 0.18 Scores were computed using scales from the AAindex database instead of the “stickiness” scale. 1. 2. 3. 4. Wimley WC, White SH (1996) Experimentally determined hydrophobicity scale for proteins at membrane interfaces. Nature Structural Biology 3(10):842–848. Ponnuswamy PK, Prabhakaran M, Manavalan P (1980) Hydrophobic packing and spatial arrangement of amino acid residues in globular proteins. Biochim Biophys Acta 623(2):301–316. Rose GD, Geselowitz AR, Lesser GJ, Lee RH, Zehfus MH (1985) Hydrophobicity of amino acid residues in globular proteins. Science 229(4716):834–838. Goldsack DE, Chalifoux RC (1973) Contribution of the free energy of mixing of hydrophobic side chains to the stability of the tertiary structure of proteins. Journal of Theoretical Biology 39(3):645–651. 5. Kanehisa MI, Tsong TY (1980) Local hydrophobicity stabilizes secondary structures in proteins. Biopolymers 19(9):1617–1628. 6. Black SD, Mould DR (1991) Development of hydrophobicity parameters to analyze proteins which bear post- or cotranslational modifications. Anal Biochem 193(1):72–82. 7. Venanzi TJ (1984) Hydrophobicity parameters and the bitter taste of L-amino acids. Journal of Theoretical Biology 111(3):447–450. 8. Lawson EQ, et al. (1984) A simple experimental model for hydrophobic interactions in proteins. J Biol Chem 259(5):2910–2912. 9. Manavalan P, Ponnuswamy PK (1978) Hydrophobic character of amino acid residues in globular proteins. Nature 275(5681):673–674. 10. Charton M, Charton BI (1982) The structural dependence of amino acid hydrophobicity parameters. Journal of Theoretical Biology 99(4):629–644. 11. Wilce MCJ, Aguilar M-I, Hearn MTW (1995) Physicochemical basis of amino acid hydrophobicity scales: Evaluation of four new scales of amino acid hydrophobicity coefficients derived from RP-HPLC of peptides. Analytical Chemistry 67(7):1210–1219. 12. Nakashima H, Nishikawa K, Ooi T (1990) Distinct character in hydrophobicity of amino acid compositions of mitochondrial proteins. Proteins: Structure, Function, and Bioinformatics 8(2):173–178. 13. Argos P, Rao JK, Hargrave PA (1982) Structural prediction of membrane-bound proteins. Eur J Biochem 128(2-3):565–575. 14. Jones DD (1975) Amino acid properties and side-chain orientation in proteins: a cross correlation appraoch. Journal of Theoretical Biology 50(1):167–183. 15. Zimmerman JM, Eliezer N, Simha R (1968) The characterization of amino acid sequences in proteins by statistical methods. Journal of Theoretical Biology 21(2):170–201. 16. Bigelow CC (1967) On the average hydrophobicity of proteins and the relation between it and protein structure. Journal of Theoretical Biology 16(2):187–211. 17. Cowan R, Whittaker RG (1990) Hydrophobicity indices for amino acid residues as determined by high-performance liquid chromatography. Peptide Research 3(2):75–80. 18. Wolfenden RV, Cullis PM, Southgate CC (1979) Water, protein folding, and the genetic code. Science 206(4418):575–577. 19. Yutani K, Ogasahara K, Tsujita T, Sugino Y (1987) Dependence of conformational stability on hydrophobicity of the amino acid residue in a series of variant proteins substituted at a unique position of tryptophan synthase alpha subunit. Proc Natl Acad Sci USA 84(13):4441–4444. 20. Fauchere JL, Pliska V (1983) Hydrophobic parameters pi of amino acid side chains from the partitioning of N-acetyl-amino acid amides. Eur J Med Chem 18:369–375. 21. Ponnuswamy PK (1993) Hydrophobic characteristics of folded proteins. Prog Biophys Mol Biol 59(1):57–103. 22. Cid H, Bunster M, Canales M, Gazitua F (1992) Hydrophobicity and structural classes in proteins. Protein Eng 5(5):373–375. 23. Juretic D, Lucic B, Zucic D, Trinajstic N (1998) Protein transmembrane structure: recognition and prediction by using hydrophobicity scales through preference functions. Theoretical and Computational Chemistry, ed Cyril P (Elsevier), Vol 5, pp 405–445. 24. Kidera A, Konishi Y, Oka M, Ooi T, Scheraga H (1985) Statistical analysis of the physical properties of the 20 naturally occurring amino acids. J Protein Chem 4(1):23–55. 25. Eisenberg D, McLachlan AD (1986) Solvation energy in protein folding and binding. Nature 319(6050):199–203. 26. Fasman GD (1989) Prediction of protein structure and the principles of protein conformation (Plenum, New York). 27. Eisenberg D (1984) Three-dimensional structure of membrane and surface proteins. Annu Rev Biochem 53:595–623. 28. Sweet RM, Eisenberg D (1983) Correlation of sequence hydrophobicities measures similarity in three-dimensional protein structure. J Mol Biol 171(4):479–488. 29. Rackovsky S, Scheraga HA (1977) Hydrophobicity, hydrophilicity, and the radial and orientational distributions of residues in native proteins. Proc Natl Acad Sci USA 74(12):5248–5251. 30. Levitt M (1976) A simplified representation of protein conformations for rapid simulation of protein folding. J Mol Biol 104(1):59–107. 31. Casari G, Sippl MJ (1992) Structure-derived hydrophobic potential. Hydrophobic potential derived from X-ray structures of globular proteins is able to identify native folds. J Mol Biol 224(3):725–732. 32. Prabhakaran M (1990) The distribution of physical, chemical and conformational properties in signal and nascent peptides. Biochem J 269(3):691–696. 33. Engelman DM, Steitz TA, Goldman A (1986) Identifying nonpolar transbilayer helices in amino acid sequences of membrane proteins. Annual Review of Biophysics and Biophysical Chemistry 15:321–353. 34. Kawashima S, et al. (2008) AAindex: amino acid index database, progress report 2008. Nucleic Acids Res 36(Database issue):D202–D205. Levy et al. www.pnas.org/cgi/content/short/1209312109 7 of 9 Table S2. Interface propensities used as the stickiness score Amino acid Interface propensity A C D E F G H I K L M N P Q R S T V W Y 0.0062 1.0372 −0.7485 −0.7893 1.2727 −0.1771 0.1204 1.1109 −1.1806 0.9138 1.0124 −0.2693 −0.1799 −0.4114 −0.0876 0.1376 0.1031 0.7599 0.7925 0.8806 Table S3. Information on species pairs used for the arginine–lysine and aspartic acid-–glutamic acid substitution frequency analysis Species 1 Escherichia coli Saccharomyces cerevisiae Homo sapiens Species 2 No. orthologous pairs Average % sequence divergence No. conserved D, E, R, or K No. of substitutions between D-E or R-K Salmonella typhimurium Saccharomyces paradoxus Mus musculus 3,102 (868) 3,798 (3,529) 7,314 (4,435) 13.8 (10.4) 8.0 (7.4) 11.3 (10.8) 180,634 (65,231) 463,813 (440,319) 810,931 (525,740) 8,818 (2,743) 14,822 (13,603) 34,766 (21,876) The numbers in parentheses refer to the subset of proteins with known abundance data. Table S4. Species used in alignments of orthologous proteins Escherichia coli set (no. orthogroups with at least 10/13 species and a known structure = 397) Saccharomyces cerevisiae set (no. orthogroups with at least 13/15 species and a known structure = 196) Homo sapiens set (no. orthogroups with at least 8/9 species and a known structure = 701) E. coli (K12 MG1655) Vibrio parahaemolyticus Burkholderia cenocepacia (J2315) Proteus mirabilis Pseudomonas fluorescens (SBW25) Shewanella baltica (OS223) Serratia proteamaculans (568) Aeromonas salmonicida (A449) Salmonella enterica Enteritidis (P125109) Pseudomonas aeruginosa (LESB58) Salmonella typhimurium (LT2) Yersinia enterocolitica (8081) Aeromonas hydrophila (ATCC7966) S. cerevisiae Saccharomyces paradoxus Saccharomyces mikatae Saccharomyces bayanus Candida glabrata Saccharomyces castellii Kluyveromyces lactis Ashbya gossypii Kluyveromyces waltii Debaryomyces hansenii Candida albicans Yarrowia lipolitica Candida tropicalis Candida guilliermondii Candida lusitaniae H. sapiens Mus musculus Gallus gallus Pan troglodytes Rattus norvegicus Danio rerio Xenopus tropicalis Bos taurus Malus domestica Proteins within an orthologous group were aligned with MUSCLE (1), and evolutionary rates of each amino acid were calculated using Rate4Site (2). 1. Edgar RC (2004) MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res 32(5):1792–1797. 2. Pupko T, Bell RE, Mayrose I, Glaser F, Ben-Tal N (2002) Rate4Site: an algorithmic tool for the identification of functional regions in proteins by surface mapping of evolutionary determinants within their homologues. Bioinformatics (Oxford, England) 18(Suppl 1):S71–S77. Levy et al. www.pnas.org/cgi/content/short/1209312109 8 of 9 Table S5. General statistics on the structural datasets used Species name Escherichia coli Saccharomyces cerevisiae Homo sapiens No. structural chains 397 (172) 196 (193) 631 (495) Numbers in parentheses correspond to the number of structures with corresponding abundance data available. Levy et al. www.pnas.org/cgi/content/short/1209312109 9 of 9
© Copyright 2026 Paperzz