Cellular crowding imposes global constraints on the chemistry and

Cellular crowding imposes global constraints on
the chemistry and evolution of proteomes
Emmanuel D. Levya,b,c,1, Subhajyoti Dea,d,e, and Sarah A. Teichmanna,1
a
Medical Research Council Laboratory of Molecular Biology, Cambridge CB2 0QH, United Kingdom; bDépartement de Biochimie, Université de Montréal,
Montréal, QC, Canada H3T 1J4; cDepartment of Structural Biology, Weizmann Institute of Science, Rehovot 76100, Israel; dDepartment of Medicine, University
of Colorado School of Medicine, Aurora, CO 80045; and eMolecular Oncology Program, University of Colorado Cancer Center, Aurora, CO 80045
In living cells, functional protein–protein interactions compete with
a much larger number of nonfunctional, or promiscuous, interactions. Several cellular properties contribute to avoiding unwanted
protein interactions, including regulation of gene expression, cellular compartmentalization, and high specificity and affinity of functional interactions. Here we investigate whether other mechanisms
exist that shape the sequence and structure of proteins to favor
their correct assembly into functional protein complexes. To examine this question, we project evolutionary and cellular abundance
information onto 397, 196, and 631 proteins of known 3D structure
from Escherichia coli, Saccharomyces cerevisiae, and Homo sapiens,
respectively. On the basis of amino acid frequencies in interface
patches versus the solvent-accessible protein surface, we define
a propensity or “stickiness” scale for each of the 20 amino acids.
We find that the propensity to interact in a nonspecific manner is
inversely correlated with abundance. In other words, high abundance proteins have less sticky surfaces. We also find that stickiness
constrains protein evolution, whereby residues in sticky surface
patches are more conserved than those found in nonsticky patches.
Finally, we find that the constraint imposed by stickiness on protein
divergence is proportional to protein abundance, which provides
mechanistic insights into the correlation between protein conservation and protein abundance. Overall, the avoidance of nonfunctional
interactions significantly influences the physico-chemical and evolutionary properties of proteins. Remarkably, the effects observed are
consistently larger in E. coli and S. cerevisiae than in H. sapiens,
suggesting that promiscuous protein–protein interactions may be
freer to accumulate in the human lineage.
promiscuity
| protein structure | interaction potential
T
he interior of cells is a highly crowded environment where
proteins continuously encounter each other (1). Thus, for
cells to function properly, it is important that casual encounters
do not outweigh functional ones. Statistically, the competition
from nonfunctional interactions should be severe (2–4), given
that the huge number of possible interactions far outweighs the
comparatively small number of functional interactions: the
Escherichia coli proteome contains about 4,200 proteins, yielding
over 8,000,000 potential distinct pairwise interactions. Eukaryotic proteomes are even larger and require additional mechanisms to minimize the impact of nonfunctional interactions (3, 5,
6). For example, Zhang et al. showed that, in Saccharomyces
cerevisiae, the average concentration of coexpressed and colocalized proteins is close to the upper tolerable limit (3), implying
that compartmentalization of proteins in time and space was
crucial to allow the expansion of eukaryotic protein repertoires.
In addition to cellular mechanisms such as compartmentalization
and regulation of protein abundance, shown to be important for
intrinsically unstructured proteins, for example (7), specific physicochemical properties contribute to minimizing nonfunctional protein-protein interactions (PPIs). This has been observed within the
protein core (8) and within interface patches (9), which, due to their
hydrophobic character, have a potential to mediate nonfunctional
interactions. Pechmann et al. showed that interface regions are
www.pnas.org/cgi/doi/10.1073/pnas.1209312109
often aggregation-prone but protected by strategically placed
disulfide bonds and salt bridges (9). Such aggregation-prone
regions have also been shown to be less frequent among highly
expressed proteins, which, according to the law of mass action, are
potentially more deleterious to the cell than lowly expressed
proteins (10). Importantly, in these studies aggregation is measured along the protein sequence and therefore reflects the potential for aggregation of the unfolded state.
Most previous studies have highlighted “negative-design” principles at known binding regions (9) or examined nonfunctional
interactions through aggregation (10–13). In contrast, here we
concentrate on the surface regions of proteins in their folded state.
Specifically, we ask if the folded state of proteins is evolutionarily
constrained by nonfunctional interactions. This means, in particular, that we consider surface residues but not amino acids buried
in the protein core, as these cannot be involved in protein–protein
interactions. In a molecular evolution-oriented study, Yang et al.
recently observed that such surface-specific evolutionary constraints exist in yeast (14). Here we present a complementary
analysis that places the emphasis on the physico-chemical properties of proteins associated with constraints from nonfunctional
interactions and describe these properties in two additional species to better cover the tree of life. We thus assembled three
datasets of proteins of known structure in their biological state
(“biological unit”), resulting in 397, 196, and 631 proteins for
E. coli, S. cerevisiae, and Homo sapiens, respectively.
Results
Defining an Interaction Propensity Scale. To investigate the impact
of promiscuous interactions, we first define an interaction propensity scale to use as a proxy for an amino acid “stickiness”
scale. We derive this scale purely from structural data by taking
the log ratio of amino acid frequencies observed at the protein
surface versus in protein–protein interfaces, as previously defined (15–17) and as illustrated in Fig. 1B. As we consider protein structures in terms of biological units, surface amino acids as
defined here are not involved in interfacial protein–protein
contacts in the crystal structure. This scale thus reflects a tradeoff between the probability of finding a given amino acid in a
solvated environment versus the residue being involved in an
interaction with another protein. For example, lysine is frequent at the surface (∼15% of amino acids) but rare in interface core regions (<5% of amino acids), which makes it an
Author contributions: E.D.L., S.D., and S.A.T. designed research; E.D.L. and S.D. performed
research; E.D.L., S.D., and S.A.T. analyzed data; and E.D.L. and S.A.T. wrote the paper.
The authors declare no conflict of interest.
*This Direct Submission article had a prearranged editor.
Freely available online through the PNAS open access option.
Data deposition: The data processed in this paper are available at: www.tinyurl.com/
structuralregions.
1
To whom correspondence may be addressed. E-mail: [email protected] or sat@
mrc-lmb.cam.ac.uk.
This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10.
1073/pnas.1209312109/-/DCSupplemental.
PNAS Early Edition | 1 of 6
BIOPHYSICS AND
COMPUTATIONAL BIOLOGY
Edited* by Ken A. Dill, Stony Brook University, Stony Brook, NY, and approved October 30, 2012 (received for review June 1, 2012)
Projection of protein abundance and evolutionary information onto structures
Interior
Surface Interface
+
amino acid
interface propensity
log(
+
freqAA interface
freqAA surface
0.0 1.0
Protein
H. sapiens (631 structures)
)
S. cerevisiae
KEDQNPGRA T SHV LWYM I CF
R= −0.36
p = 7.2e−07
1
R= −0.48
p = 9.4e−10
0
0
R= −0.26
p = 0.00018
1
R= −0.08
p = 0.4
−0.5 −0.4 −0.3 −0.2 −0.1
Surface Stickiness
0.2
0.3 0.4 0.5 0.6
Interior Stickiness
0.7
H. sapiens
R= −0.25
p = 2.0e−05
−0.5 −0.4 −0.3 −0.2 −0.1 0
Surface Stickiness
R= −0.07
p = 0.33
0.2 1 5 20 100
−0.5 −0.4 −0.3 −0.2 −0.1
Surface Stickiness
Protein Abundance (A.U.)
1 5 20 100 500 2000
D
E. coli
5 20 100 500 10000
Protein Abundance (A.U.)
1 5 20 100 500 2000
C
5 20 100 500 10000
proxy for amino acid
”stickiness” scale
−1.0
B
S. cerevisiae (196 structures)
0.2 1 5 20 100 2000
E. coli (397 structures)
2000
A
0.2
0.3 0.4 0.5 0.6
Interior Stickiness
0.7
0.2
0.3 0.4 0.5 0.6
Interior Stickiness
0.7
Fig. 1. The solvent-accessible surfaces of high-abundance proteins are enriched in nonsticky amino acids compared with low-abundance proteins. (A)
Illustration of the approach taken in this study. (B) We first define a stickiness scale for each amino acid using its interface propensity. The propensity is
defined by the log ratio of amino acid frequencies at interfaces versus surfaces. The definition of the structural regions used is explained in more detail in
Fig. S1. (C and D) We calculate a stickiness score by averaging interface propensity scores of residues in the region considered (surface or interior). We then
plot this score against the abundance of the protein and indicate the Spearman rank correlation coefficients of the relationships, as well as the P value associated with the linear association obtained by analysis of variance. The contour lines mark the 2/6, 3/6, 4/6, and 5/6 percentile of the density function range.
interaction-resistant or “nonsticky” amino acid (17). We used
only E. coli proteins to derive this scale, but our conclusions are
not dependent on the organism used because the scales based on
S. cerevisiae and H. sapiens proteins are almost identical to that
of E. coli (Rcoli-yeast = 0.94, Rcoli-human = 0.97; Fig. S2).
Chemical Constraints on Surfaces of Highly Abundant Proteins.
Nonfunctional interactions are, on average, detrimental to fitness because they sequester interaction partners (18). According
to the law of mass action, the number of nonfunctional interactions that a protein participates in should be proportional to its
abundance (19). Therefore, an abundant protein with a sticky
surface is expected to be more deleterious than a low-abundance
protein with the same surface stickiness. If cellular crowding
and its associated promiscuous interactions were a constraint in
2 of 6 | www.pnas.org/cgi/doi/10.1073/pnas.1209312109
cellular systems, we would expect an anticorrelation between
protein surface stickiness and protein abundance. We quantified
the stickiness of a protein surface as the average of interfacepropensity scores, thus reflecting the tendency of its solvent-accessible residues to interact with other protein surfaces (Fig. 1).
For all three organisms, we used all of the available experimental
data on protein abundance provided by the PaxDb database
(http://pax-db.org) (20). These values are linearly proportional to
protein copy numbers in cells.
Plotting surface stickiness against protein abundance reveals
a significant anticorrelation in all three organisms studied (pcoli =
9.10−10, pyeast = 7.10−7, phuman = 2.10−5; Fig. 1C; these and
subsequent P values associated with correlations were calculated
using the F-statistic obtained by analysis of variance of the linear
association between abundance and stickiness). However, the
Levy et al.
Levy et al.
o
2
400A
central residue
in non-sticky
context
in sticky
context
E. coli
35% difference
S. cerevisiae
65% difference
H. sapiens
12% difference
p = 9e−120
p = 4e−101
p = 6e−52
3
4
The composition of central residues is independent of their context
B
Average stickiness of the surrounding surface patch
Fig. 3. The relative evolutionary rate of an amino acid is influenced by the
stickiness of its environment. (A) Illustration of the procedure used to calculate
the stickiness score of a residue’s environment. We use this score as a proxy for
the probability of the central residue to trigger a promiscuous interaction
upon mutation. Note that, although the central residue is classified according
to its context, its chemical composition remains independent of the context
and follows an average surface composition, even for the most sticky category
of patches (Fig. S7). (B) An evolutionary conservation ratio is calculated for
each surface amino acid. The ratio is equal to the median evolutionary rate of
the entire protein divided by the evolutionary rate of the residue. We bin all
residues into five classes of equal size and increasing stickiness and show the
boxplot distribution of evolutionary rates for each class. In all three organisms,
the stickier the environment of a residue, the more the residue is conserved
relative to the rest of the protein. Note that in this analysis we consider the
conservation of the central residue and not that of the patch surrounding it.
P values are calculated using the Wilcoxon test.
PNAS Early Edition | 3 of 6
BIOPHYSICS AND
COMPUTATIONAL BIOLOGY
A
2
magnitude of the anticorrelation, as measured by the Spearman
rank correlation coefficient, varies greatly. The strongest anticorrelation is found in E. coli (R = −0.48), followed by yeast (R =
−0.36), followed by human (R = −0.25).
This result shows that the surface of highly abundant proteins
has adapted to become less sticky and more soluble than for
lowly abundant proteins, especially in E. coli and, to a smaller
extent, in yeast and humans. This weaker signal might reflect the
fact that eukaryotic cells are more compartmentalized than
bacterial cells, which may introduce a bias in the measure of
protein concentration approximated here with abundance. An
analysis of protein stickiness as a function of localization indeed
reveals significant differences across different cellular compartments. Interestingly, nuclear proteins are more sticky than the
rest of the proteome taken as an average (pcerevisiae = 0.023;
psapiens = 0.016) whereas mitochondrial proteins are less sticky
(pcerevisiae = 0.0021, psapiens = 0.0045). Remarkably, in H. sapiens
the gene ontology (GO) term most enriched in nonsticky proteins is “soluble fraction” (psapiens = 3.6*10−5; Fig. S3).
The amino acid potential provided in this analysis yields
results that are significantly different from those obtained on the
basis of the commonly used hydrophobicity scale of Kyte and
Doolittle (21). When considering this hydrophobicity scale, the
association described in Fig. 1 disappears in S. cerevisiae and
H. sapiens and greatly weakens in E. coli (Fig. 2 and Fig. S4).
We further tested 71 additional scales associated with “hydrophobicity” from the AAindex database (22) (Table S1). Interestingly, the scale of Wimley and White (23) yields the best
correlation (R = −0.44) in E. coli, and is based on the transfer of
amino acids from a hydrophobic environment (lipid bilayer
interface) to water. This is different from the Kyte and Doolittle
scale, which is based on measures of transfers of amino acids
between two polar environments (e.g., ethanol and water). The
similarity between the stickiness scale and the Wimley and White
scale may reflect the fact that an interaction resembles more a
transfer from water to a hydrophobic environment than a transfer between two relatively polar environments. Fig. S5 provides
a comparison of these three scales, and Table S2 presents the
values for our stickiness scale.
Current views of protein evolution emphasize stability, which
must be maintained to avoid misfolding and thereby prevent loss
of function or aggregation (24, 25). To assess the extent to which
the anticorrelation observed here is linked to the unfolded state of
the protein, we reproduce the same plots but now consider amino
acids at the protein interior instead of the surface (Fig. 1D). For
rate(surface residue)
Fig. 2. Protein hydrophobicity is less strongly tuned as a function of abundance than stickiness. We calculate a “hydrophobicity score” for the surface
and interior regions of a protein by averaging Kyte and Doolittle hydrophobicity scores of residues in the region (21). We then plot this score against the
abundance of the protein and indicate the Spearman rank correlation coefficients of the relationships, as well as the P value associated with the linear
association obtained by analysis of variance. The hydrophobicity analysis for
all species and surface as well as interior regions is shown in Fig. S4.
1
−2.4 −2 −1.6 −1.2 −0.8 −2.6 −2.2 −1.8 −1.4 −1.0
Surface Hydrophobicity Surface Hydrophobicity
two organisms, the correlation disappears almost entirely when we
consider amino acids at the interior. The surface–interior difference is most marked in E. coli, where the correlation vanishes
almost completely (R = −0.08) and becomes insignificant (P =
0.4). In humans, the weaker anticorrelation observed in Fig. 1B is
also lost with interior amino acids (R = −0.07, P = 0.33), whereas
in yeast a weak correlation persists (R = −0.26, P = 4.10−2).
Considering protein length provides a further piece of evidence showing that misassembly rather than misfolding is responsible for the anticorrelation between surface stickiness and
abundance. It is known that the small hydrophobic core of short
proteins (26) requires compensating mechanisms (27) that increase their stability. In line with this, we find an increase in
interior stickiness among small proteins relative to larger proteins for all three species (Fig. S6; pcoli = 2.10−5; pyeast = 0.015;
phuman = 2.10−8). The increased stickiness associated with the
core of small proteins suggests that a strong amino acid interaction potential can lead to an increase in stability. Comparatively, however, the lack of association between surface
stickiness and protein length (Fig. S6; pcoli = 0.05; pyeast = 0.89;
phuman = 0.15) implies that stability is unlikely to drive the evolution of protein surfaces toward nonsticky amino acids.
Taken together, these results suggest that, in addition to selection against misfolding and aggregation of polypeptide chains,
rate(protein)
S. cerevisiae R= −0.02
p= 0.51
0
5 20100 500 10000
R= −0.22
p= 0.0056
1
Protein Abundance (A.U.)
1 5 20 100 2000
E. coli
avoidance of nonfunctional interactions by folded proteins is an
important constraint that is proportional to abundance. Moreover, adaptation to this constraint is achieved through a bias in
surface amino acid composition toward nonsticky amino acids.
residue. Therefore, the larger the ratio, the more conserved the
residue relative to the protein. This shows the clear effect of
a residue’s environment stickiness on its degree of conservation
relative to the protein: residues in nonsticky environments (leftmost bin) are 35%, 65%, and 12% freer to evolve than residues
in stickier environments (right-most bin) for E. coli, S. cerevisiae,
and H. sapiens, respectively. Because these values are obtained
after a normalization per protein, they reflect the impact of
stickiness on conservation relative to the conservation of the
protein. This normalization is necessary to single-out the effect
of stickiness because lowly expressed proteins are poorly conserved (28) but also carry most of the sticky patches, as shown in
Fig. 1C. Interestingly, the weaker adaptation of human proteins
against nonfunctional interactions observed in Fig. 1C is reproduced here, as differences in evolutionary conservation across
the five probability classes are weakest in the human data set.
It can be argued that the conservation of residues found in sticky
surface patches is due to those patches being unknown biological
interfaces. However, several pieces of evidence suggest otherwise.
First, if this were the case, we would not expect to see such a difference in signal between species (i.e., decreasing signal strength
from E. coli to H. sapiens) because functional interfaces should, on
average, be conserved in all species. Second, we would expect the
central residue within sticky patches to resemble interface amino
acids. To assess this, we compared the frequency distribution of
amino acids in sticky patches (Fsticky) with that of amino acids at
the interface (Finterface) and surface (Fsurface). Because amino acids
A
substitution frequency between D and E
1.4
substitution frequency between K and R
E. coli
=
∑ (K-R) + ∑ (R-K)
∑ (K-K) + ∑ (R-R)
*
∑ (D-D) + ∑ (E-E)
∑ (D-E) + ∑ (E-D)
S. cerevisiae
H. sapiens
B
0.9
0.6
Protein abundance class (%)
Low abundance protein
0-20 20-40 40-60 60-80 80-100 Top 5
0-20 20-40 40-60 60-80 80-100 Top 5
Protein abundance class (%)
Protein abundance class (%)
High abundance protein
Mut1
Mut3
Mut2
Sticky surface
1.2
Ratio r
1
0.8
3
2
0-20 20-40 40-60 60-80 80-100 Top 5
1.1
1
Ratio r
5
4
Ratio r
6
1.2
1.3
7
8
Ratio r =
1.4
Surface Stickiness Is an Evolutionary Constraint. To assess whether
nonfunctional interactions place a constraint on protein evolution,
we study conservation at the amino acid level. We ask whether,
within a protein, amino acids surrounded by a sticky environment
are more conserved than amino acids surrounded by a nonsticky
environment. We computed rates of evolution for each amino acid
for all three species and projected these data onto protein structures of each organism (Materials and Methods). In parallel, we
calculated a surrounding stickiness score for every surface amino
acid of each protein (Fig. 3A). This score is calculated from the
amino acid composition of the 400-Å2 surface patch surrounding
the residue of interest by averaging its amino acids stickiness
values (note that the stickiness of the central residue is independent from that of the patch). Residues are then binned into
five “surrounding stickiness” classes of equal size for each organism, and evolutionary conservation is compared across the five
classes (Fig. 3B). We reason that residues in more sticky environments are expected to have a higher probability of triggering
nonfunctional interactions upon mutation and on average should
be more constrained than those in less sticky environments.
Importantly in Fig. 3, the evolutionary rate of each residue is
normalized as we divide the rate of the protein by that of each
Mut4
Interior
Non sticky surface
Constrained by
Promiscuous Misfolding
PPIs
toxicity
Mut1
Mut2
Mut3
Mut4
Low
Low
Low
High
Low
Low
High
Low
Fig. 4. The strength of selection against changes in protein stickiness is proportional to protein abundance. (A) Ratio of frequencies of two substitution
types: one between charged residues of equal stickiness (D and E) and one between charged residues with a change in stickiness (K and R). The ratio is plotted
for five bins of increasing protein abundance, each containing the same number of these charged residues. The sixth bin contains the top 5% abundant
proteins. The ratio, r, defined in the figure, increases by 160%, 78%, and 13% in E. coli, S. cerevisiae, and H. sapiens, respectively, for the most abundant
proteins relative to the least abundant ones. Thus, substitutions between K and R become less frequent than substitutions between D and E among highly
abundant proteins. The red intervals show the SD of the ratios r obtained from 1,000 datasets where abundance data are randomized. (B) Scheme illustrating
the constraints from misfolding and promiscuous interactions. Selection against misfolding provides an explanation for the relationship between protein
abundance and evolutionary conservation for residues buried in the interior because the deleterious effects of misfolded aggregates increase with abundance. Avoidance of promiscuous interactions provides a further mechanism that explains negative selection proportional to abundance for residues on the
solvent-accessible surface of proteins.
4 of 6 | www.pnas.org/cgi/doi/10.1073/pnas.1209312109
Levy et al.
Nonfunctional Interactions Might Contribute to the Differential
Conservation Between Highly and Lowly Expressed Proteins. In the
first part of this study, we observed an anticorrelation between
protein abundance and protein surface stickiness. Subsequently
we saw that stickiness is correlated with conservation within
a protein. This prompts us to ask whether protein stickiness
might be involved in the well-established correlation between
protein abundance and evolutionary conservation. Thus, we
would expect low-copy proteins to be more tolerant than
abundant proteins to amino acid substitutions that significantly
change their surface stickiness.
To test this hypothesis, we took advantage of the properties of
two pairs of charged amino acids: aspartic (D) and glutamic (E)
acids have similar stickiness scores, whereas arginine (R) and lysine (K) do not (Fig. 1B) (15, 17, 31). Arginine is more frequently
found at protein–protein interfaces than lysine, making it a stickier
amino acid according to our definition. This characteristic enables
us to make the following prediction: among high-copy proteins,
where significant changes in stickiness have a greater impact,
substitutions between K and R should be less frequent than substitutions between D and E. Also, because K, R, E, and D are
mostly present at protein surfaces (15, 17), we do not need to
restrict ourselves to proteins of known structure and can measure
substitution rates from whole proteomes.
We thus measured the substitutions frequencies between K
and R (fK<->R) as well as between D and E (fD<->E) among
orthologs of three species pairs: E. coli–Salmonella typhimurium,
S. cerevisiae–Saccharomyces paradoxus, and H. sapiens–Mus
musculus, as detailed in Table S3. Fig. 4 shows ratios of these
frequencies (fD↔E/fK↔R) as a function of protein abundance.
Substitutions between K and R are rare among abundant proteins relative to substitutions between D and E. In contrast,
among low-copy proteins, both substitution types occur at more
comparable frequencies. Interestingly, the magnitude of the effect observed, again, decreases in strength from E. coli (160%
change between lowest and highest abundance classes) to yeast
(78% change) and to humans (13% change).
Taken together, these observations provide mechanistic insights
into the well-established correlation between protein abundance
and evolutionary conservation. Although this correlation has
been known for over a decade (28), the biological mechanisms
Levy et al.
associated with it are still not entirely clear. Selection against
misfolding can explain part of the correlation (24, 25), where the
assumption is that toxicity of misfolded proteins is proportional to
their abundance. Our results support the notion that avoidance of
promiscuous interactions, or negative pleiotropy (32), represents
an additional mechanistic explanation (Fig. 4B).
Discussion
It has been shown previously that mutations tend to arise faster at
the protein surface than in the interior (33, 34). In fact, TothPetroczy and Tawfik recently showed that mutations at the interior
accumulate more rapidly once the surface has drifted sufficiently
(35). Therefore, by lowering the tolerance for mutations at the
surface, the divergence of the entire protein becomes constrained
(35). Promiscuous interactions, which constrain mutations at the
surface, could thereby limit the evolutionary rate of the entire
protein. This is consistent with the results of a recent study by
Yang et al. showing, in a theoretical molecular evolutionary model
using S. cerevisiae, that protein misinteraction represents an evolutionary constraint (14).
Considering two additional species and taking a complementary approach placing more emphasis on the physico-chemical
properties of proteins, we also find that protein misinteractions
represent an evolutionary constraint. We provide a physicochemical rationalization of nonfunctional interactions through
the stickiness scale. This scale is significantly different from the
Kyte and Doolittle hydrophobic scale, which is commonly used
as in, e.g., Yang et al. (14). Our stickiness scale is more similar to
the Wimley and White scale, although differences, e.g., between
lysine and arginine, suggest that it is important to consider the
“interaction” potential of amino acids in interpreting nonfunctional interactions. Interestingly, lysine underrepresentation
at nonbiological crystal contacts also supports the notion that
lysine and arginine have different potentials to be involved in
nonfunctional interactions (17, 36). We thus hope that the
stickiness scale proposed here will help to refine models that
couple protein chemistry to cellular crowding (5). Furthermore,
taken together, the work by Yang et al. and our work suggest that
proteins are constrained to avoid nonfunctional interactions,
adding to the commonly accepted stability and solubility constraints on the amino acid composition of proteins.
Finally, the impact of promiscuous interactions appears most
prominent among the unicellular organisms E. coli and S. cerevisiae. It is thus tempting to speculate that nonfunctional
interactions may have accumulated in the human lineage (37) in
a similar fashion to the accumulation of noncoding DNA (38). In
a further analogy to noncoding DNA, nonfunctional interactions
may represent the raw material for exploring and ultimately
selecting functional interactions (39, 40) through mechanisms
such as colocalization (41). These speculations should nevertheless be considered with care, as the weaker signal observed for
H. sapiens may also result from the ill-defined nature of protein
abundance in multicellular organisms. Future studies will thus be
needed to explore these ideas further and better understand the
properties of proteomes across the tree of life.
Methods
Sequence Data. Sequences of proteins and their respective orthologs were
aligned with MUSCLE (42). Orthology information was taken from ref. 43 for
E. coli and from ENSEMBL v.48 (44) for H. sapiens. Multiple sequence
alignments of S. cerevisiae proteins with their orthologs were taken from
Wapinsky et al. (45). The details of the species used are in Table S4. Protein
multiple alignments were concatenated to obtain three proteome wide
multiple alignments (one for each species). These were used to calculate
amino acid evolutionary rates using Rate4Site (46).
Structural Data. Species-specific structures were retrieved by sequence homology. We searched for structures where the sequence from the SEQRES
field was similar to proteins from E. coli, S. cerevisiae, or H. sapiens
PNAS Early Edition | 5 of 6
BIOPHYSICS AND
COMPUTATIONAL BIOLOGY
such as cysteine are rare in all regions, we normalized these distributions by the average frequencies observed in all regions
(Ftotal). As expected, the linear regression between (Fsticky/Ftotal)
and (Finterface/Ftotal) was not significant (pcoli = 0.27, pcerevisiae = 0.66,
psapiens = 0.48). Residues found in sticky patches are in fact nearly
identical in their composition to surface residues, as reflected by
the highly significant linear regression between (Fsticky/Ftotal) and
(Fsurface/Ftotal): pcoli = 3.1e-14, pcerevisiae = 4.1e-14, psapiens = 9.2e-12,
as obtained by analysis of variance. These results are detailed in
Fig. S7 and show that for biological units in the Protein Data Bank
(PDB) the surfaces are largely solvent-exposed as opposed to
being involved in cryptic stable interfaces. Considering isolated
subunits, however, we observe the opposite because the sticky
patches include genuine interfaces. For this data set, the distribution of residues at the center of sticky patches is closer to interface amino acids (Fig. S7, pcoli = 0.019, pcerevisiae = 0.16, psapiens =
0.028) than to surface ones (for the latter, the regression slopes are
actually negative (slopecoli = −0.85, slopecerevisiae = −1.02, slopesapiens =
−0.19). Finally, the increasing conservation of residues in increasingly sticky environments holds true even at known interfaces,
both at the rim and at the core (Fig. S8), showing that, even within
protein–protein contact regions, stickiness is controlled. The latter
observation supports the notion of negative design (29) in sensitive
interface regions (8, 9, 30). Although unknown biological interfaces must exist, these observations make us confident that they
are unlikely to contribute significantly to the signal observed.
proteomes. We imposed a minimal sequence identity of 90% and a minimum overlap of 70%. We used protein structures from the PDB (47), and
the dataset includes all structures present in the second release of
3DComplex (48). All structures for which the biological state was manually
annotated in the PiQSi database (49) as “error,” “probable error,” or
“undefined” were discarded, as well as all DNA-binding and membrane
proteins. Finally, we kept only structures with a resolution below 3 Å. A
summary of the number of structures per organism and complex type is
given in Table S5. Structural regions were defined as in Levy (15). The environment stickiness for a given residue was calculated based on its surrounding residues, i.e., residues with the Cα within a 400-Å2 patch centered
on the Cα of the residue of interest.
proteins, we discarded all proteins with an abundance unit below 1. Statistical analyses and plots were done with R. Data used in this study are
available at www.tinyurl.com/structuralregions.
Abundance Data. Protein abundance data were taken from PaxDb (20) (http://
pax-db.org). Because of the uncertainty associated with very low abundance
ACKNOWLEDGMENTS. We thank Dan Tawfik, Joël Janin, Eugene Shakhnovich, Sergei Maslov, David Liberles, Joseph Marsh, Eviatar Natan, Gideon
Schreiber and Peter Tompa for their comments on the manuscript. We also
thank the two anonymous referees for their constructive comments that
significantly helped improve the paper. E.D.L. acknowledges the Human
Frontier Science Project for financial support through a long-term fellowship; Stephen Michnick and Université de Montréal for hosting part of this
research; and the Weizmann Institute of Science for hosting part of this
research. S.D. acknowledges support from the University of Colorado School
of Medicine and the National Cancer Institute Physical Sciences Oncology
Center initiative (U54-CA143798). E.D.L. and S.A.T. were supported by the
Medical Research Council (file Reference U105161047).
1. McGuffee SR, Elcock AH (2010) Diffusion, crowding & protein stability in a dynamic
molecular model of the bacterial cytoplasm. PLOS Comput Biol 6(3):e1000694.
2. Janin J (1996) Quantifying biological specificity: The statistical mechanics of molecular
recognition. Proteins 25(4):438–445.
3. Zhang J, Maslov S, Shakhnovich EI (2008) Constraints imposed by non-functional proteinprotein interactions on gene expression and proteome size. Mol Syst Biol 4:210.
4. Tompa P, Rose GD (2011) The Levinthal paradox of the interactome. Protein Sci
20(12):2074–2079.
5. Heo M, Maslov S, Shakhnovich E (2011) Topology of protein interaction network
shapes protein abundances and strengths of their functional and nonspecific interactions. Proc Natl Acad Sci USA 108(10):4258–4263.
6. Johnson ME, Hummer G (2011) Nonspecific binding limits the number of proteins in
a cell and shapes their interaction networks. Proc Natl Acad Sci USA 108(2):603–608.
7. Gsponer J, Futschik ME, Teichmann SA, Babu MM (2008) Tight regulation of unstructured proteins: From transcript synthesis to protein degradation. Science 322
(5906):1365–1368.
8. Fleishman SJ, Baker D (2012) Role of the biomolecular energy gap in protein design,
structure, and evolution. Cell 149(2):262–273.
9. Pechmann S, Levy ED, Tartaglia GG, Vendruscolo M (2009) Physicochemical principles
that regulate the competition between functional and dysfunctional association of
proteins. Proc Natl Acad Sci USA 106(25):10159–10164.
10. Tartaglia GG, Pechmann S, Dobson CM, Vendruscolo M (2009) A relationship between
mRNA expression levels and protein solubility in E. coli. J Mol Biol 388(2):381–389.
11. Hamada D, et al. (2009) Competition between folding, native-state dimerisation and
amyloid aggregation in beta-lactoglobulin. J Mol Biol 386(3):878–890.
12. Tartaglia GG, Pechmann S, Dobson CM, Vendruscolo M (2007) Life on the edge: A link
between gene expression levels and aggregation rates of human proteins. Trends
Biochem Sci 32(5):204–206.
13. Münch C, Bertolotti A (2010) Exposure of hydrophobic surfaces initiates aggregation
of diverse ALS-causing superoxide dismutase-1 mutants. J Mol Biol 399(3):512–525.
14. Yang JR, Liao BY, Zhuang SM, Zhang J (2012) Protein misinteraction avoidance causes
highly expressed proteins to evolve slowly. Proc Natl Acad Sci USA 109(14):E831–E840.
15. Levy ED (2010) A simple definition of structural regions in proteins and its use in
analyzing interface evolution. J Mol Biol 403(4):660–670.
16. Lo Conte L, Chothia C, Janin J (1999) The atomic structure of protein-protein recognition sites. J Mol Biol 285(5):2177–2198.
17. Janin J, Bahadur RP, Chakrabarti P (2008) Protein-protein interaction and quaternary
structure. Q Rev Biophys 41(2):133–180.
18. Vavouri T, Semple JI, Garcia-Verdugo R, Lehner B (2009) Intrinsic protein disorder and
interaction promiscuity are widely associated with dosage sensitivity. Cell 138(1):
198–208.
19. Levy ED, Michnick SW, Landry CR (2012) Protein abundance is key to distinguish
promiscuous from functional phosphorylation based on evolutionary information.
Philos Trans R Soc Lond B Biol Sci 367(1602):2594–2606.
20. Wang M, et al. (2012) PaxDb, a database of protein abundance averages across all
three domains of life. Mol Cell Proteomics 11(8):492–500.
21. Kyte J, Doolittle RF (1982) A simple method for displaying the hydropathic character
of a protein. J Mol Biol 157(1):105–132.
22. Kawashima S, et al. (2008) AAindex: Amino acid index database, progress report 2008.
Nucleic Acids Res 36(Database issue):D202–D205.
23. Wimley WC, White SH (1996) Experimentally determined hydrophobicity scale for
proteins at membrane interfaces. Nat Struct Biol 3(10):842–848.
24. Yang JR, Zhuang SM, Zhang J (2010) Impact of translational error-induced and errorfree misfolding on the rate of protein evolution. Mol Syst Biol 6:421.
25. Drummond DA, Wilke CO (2008) Mistranslation-induced protein misfolding as
a dominant constraint on coding-sequence evolution. Cell 134(2):341–352.
26. Chothia C (1975) Structural invariants in protein folding. Nature 254(5498):304–308.
27. Pereira de Araújo AF, Gomes AL, Bursztyn AA, Shakhnovich EI (2008) Native atomic
burials, supplemented by physically motivated hydrogen bond constraints, contain
sufficient information to determine the tertiary structure of small globular proteins.
Proteins 70(3):971–983.
28. Pál C, Papp B, Hurst LD (2001) Highly expressed genes in yeast evolve slowly. Genetics
158(2):927–931.
29. Doye JP, Louis AA, Vendruscolo M (2004) Inhibition of protein crystallization by
evolutionary negative design. Phys Biol 1(1–2):9–13.
30. Levin KB, et al. (2009) Following evolutionary paths to protein-protein interactions
with high affinity and selectivity. Nat Struct Mol Biol 16(10):1049–1055.
31. MacCallum JL, Tieleman DP (2011) Hydrophobicity scales: A thermodynamic looking
glass into lipid-protein interactions. Trends Biochem Sci 36(12):653–662.
32. Liberles DA, Tisdell MD, Grahnen JA (2011) Binding constraints on the evolution of
enzymes and signalling proteins: The important role of negative pleiotropy. Proc Biol
Sci 278(1714):1930–1935.
33. Sasidharan R, Chothia C (2007) The selection of acceptable protein mutations. Proc
Natl Acad Sci USA 104(24):10080–10085.
34. Franzosa EA, Xia Y (2009) Structural determinants of protein evolution are contextsensitive at the residue level. Mol Biol Evol 26(10):2387–2395.
35. Tóth-Petróczy A, Tawfik DS (2011) Slow protein evolutionary rates are dictated by
surface-core association. Proc Natl Acad Sci USA 108(27):11151–11156.
36. Cieslik M, Derewenda ZS (2009) The role of entropy and polarity in intermolecular
contacts in protein crystals. Acta Crystallogr D Biol Crystallogr 65(Pt 5):500–509.
37. Fernández A, Lynch M (2011) Non-adaptive origins of interactome complexity. Nature
474(7352):502–505.
38. Lynch M (2007) The Origins of Genome Architecture (Sinauer Associates, Inc., Sunderland, MA), pp 494.
39. Tawfik DS (2010) Messy biology and the origins of evolutionary innovations. Nat
Chem Biol 6(10):692–696.
40. Nobeli I, Favia AD, Thornton JM (2009) Protein promiscuity and its implications for
biotechnology. Nat Biotechnol 27(2):157–167.
41. Kuriyan J, Eisenberg D (2007) The origin of protein interactions and allostery in colocalization. Nature 450(7172):983–990.
42. Edgar RC (2004) MUSCLE: Multiple sequence alignment with high accuracy and high
throughput. Nucleic Acids Res 32(5):1792–1797.
43. Moreno-Hagelsieb G, Janga SC (2008) Operons and the effect of genome redundancy
in deciphering functional relationships using phylogenetic profiles. Proteins 70(2):
344–352.
44. Flicek P, et al. (2008) Ensembl 2008. Nucleic Acids Res 36(Database issue):D707–D714.
45. Wapinski I, Pfeffer A, Friedman N, Regev A (2007) Natural history and evolutionary
principles of gene duplication in fungi. Nature 449(7158):54–61.
46. Pupko T, Bell RE, Mayrose I, Glaser F, Ben-Tal N (2002) Rate4Site: An algorithmic tool
for the identification of functional regions in proteins by surface mapping of evolutionary determinants within their homologues. Bioinformatics 18(Suppl 1):S71–S77.
47. Berman HM, et al. (2002) The Protein Data Bank. Acta Crystallogr D Biol Crystallogr
58(Pt 6 No 1):899–907.
48. Levy ED, Pereira-Leal JB, Chothia C, Teichmann SA (2006) 3D complex: A structural
classification of protein complexes. PLOS Comput Biol 2(11):e155.
49. Levy ED (2007) PiQSi: Protein quaternary structure investigation. Structure 15(11):
1364–1367.
6 of 6 | www.pnas.org/cgi/doi/10.1073/pnas.1209312109
Levy et al.
Supporting Information
Levy et al. 10.1073/pnas.1209312109
Schematic view
SURFACE
cross-section
INTERIOR
SUPPORT
RIM CORE RIM
Interacting
partner
INTERIOR rASAc < 25% & ΔrASA = 0
SURFACE rASAc > 25% & ΔrASA = 0
SUPPORT ΔrASA > 0 & rASAm < 25%
RIM
ΔrASA > 0 & rASAc > 25%
CORE
ΔrASA > 0 & rASAm > 25% & rASAc < 25%
ΔrASA = rASAm-rASAc
rASAm = relative ASA in monomer
rASAc = relative ASA in complex
Fig. S1. The regions of protein structure used in this study. We use the definitions of interface, surface, interface support, rim, and core as defined in Levy
et al. (1). Amino acid interface propensities are computed as the log ratios of their frequencies at the interface core (orange) and surface (blue).
Interface core propensity relative to surface
−1.5 −1.0 −0.5 0.0
0.5
1.0
1.5
S. cerevisiae
C
YM
V L I
RA
Q
G
P
N
H
TS
F
−1.0
V
A
TH
GR S
Q NP
D
E
K
−1.5
−1.5
K
−1.5
W
F
YM
I
L C
−1.0
−1.0
ED
K E D Q N P G R A T S H V L WY M I C F
H. sapiens propensities
−0.5 0.0
0.5
1.0
S. cerevisiae propensities
−0.5 0.0
0.5
1.0
W
H. sapiens
1.5
K E D Q N P G R A T S H V L WY M I C F
1.5
Interface core propensity relative to surface
−1.5 −1.0 −0.5 0.0
0.5
1.0
1.5
1. Levy ED (2010) A simple definition of structural regions in proteins and its use in analyzing interface evolution. J Mol Biol 403(4):660–670.
−0.5 0.0
0.5
E. coli propensities
1.0
1.5
−1.5
−1.0
−0.5 0.0
0.5
E. coli propensities
1.0
1.5
Fig. S2. Amino acid interface propensities are similar across distant species. (Upper panels) Residue interface-to-surface propensities for Saccharomyces
cerevisiae and Homo sapiens. (Lower panels) S. cerevisiae and H. sapiens residue interface propensities are very similar to those in Escherichia coli.
Levy et al. www.pnas.org/cgi/content/short/1209312109
1 of 9
Top 10%
S. cerevisiae
organelle lumen
intracellular organelle lumen
cytosol
mitochondrial part
endomembrane system
membrane part
organelle membrane
endoplasmic reticulum
envelope
membrane
1
Z-score
1
organelle envelope
organelle part
nucleus
intracellular organelle part
protein complex
macromolecular complex
nuclear part
mitochondrion
Top 10%
Lowest 10%
p. value
H. sapiens
pvalue
Z-score
endosome
1
1
1
1
1
1
neuron projection
cytoplasmic vesicle
nucleolus
Golgi apparatus part
cell projection
Golgi apparatus
subsynaptic reticulum
membrane fraction
mitochondrial envelope
insoluble fraction
envelope
organelle envelope
mitochondrial membrane
endoplasmic reticulum
extracellular region part
extracellular space
endoplasmic reticulum part
mitochondrial inner membrane
organelle inner membrane
endoplasmic reticulum membrane
microtubule cytoskeleton
plasma membrane part
organelle membrane
cytoskeletal part
cytoskeleton
vacuole
endomembrane system
soluble fraction
cell fraction
nucleoplasm
nuclear part
extracellular region
nuclear lumen
intrinsic to membrane
mitochondrial lumen
mitochondrial matrix
mitochondrial part
integral to membrane
vesicle
Lowest 10%
Stickiness score of protein surfaces
Stickiness score of protein surfaces
standard deviation obtained on simulated random data
Fig. S3. Changes in protein surface stickiness as a function of subcellular localization. Stickiness scores of surface residues are binned according to the gene
ontology (GO) annotation of the protein to which they correspond. Note that, for proteins with multiple GO annotations, residues are counted several times.
For each bin or GO category, the median stickiness scores are shown. The red lines are the SDs of the medians when GO annotations are shuffled (100,000
iterations). The first two bars are the scores of the top and lowest 10 quantiles. This illustrates the similarity in stickiness across different cellular compartments.
There are, however, significant differences which, remarkably, are conserved across Saccharomyces cerevisiae and Homo sapiens; e.g., proteins in mitochondria
tend to be less sticky than average (pcerevisiae = 0.023; psapiens = 0.016), whereas nuclear proteins tend to be more sticky (pcerevisiae = 0.0021, psapiens = 0.0045). The
least sticky GO cellular component is the “soluble fraction” (psapiens = 3.6*10−5).
Levy et al. www.pnas.org/cgi/content/short/1209312109
2 of 9
Fig. S4. Protein hydrophobicity is less strongly tuned as a function of abundance than stickiness. We calculate a “hydrophobicity score” for the surface and interior
regions of a protein by averaging hydrophobicity scores (1) of residues in the region. We then plot this score against the abundance of the protein and indicate the
Spearman rank correlation coefficients of the relationships, as well as the P value associated with the linear association obtained by analysis of variance.
1. Kyte J, Doolittle RF (1982) A simple method for displaying the hydropathic character of a protein. J Mol Biol 157(1):105–132.
F
M
Y
C
T
A
R
N
Q
P
V
G
D
E
K
Kite & Doolittle hydrophobicity
Stickiness scale
Stickiness scale
S
L
V
W
H
F
C I
M L Y
I
Q
E
W
S
H
T
R A
P
G
N
D
K
Wimley & White hydrophobicity
Fig. S5. Comparison of the stickiness scale with the Kyte and Doolittle scale and the Wimley and White scales. The stickiness scale is distinct from the other
two scales, particularly with respect to K, R, and W.
Levy et al. www.pnas.org/cgi/content/short/1209312109
3 of 9
200 400 600 800 1000 1200
R= −0.27
p= 0.015
200
R= −0.12
p= 0.05
200 400 600 800 1000 1200
Protein length
400
600
800
200
Protein length
−0.5−0.4−0.3−0.2−0.10.0
−0.4 −0.3 −0.2 −0.1 0.0
Surface Stickiness
Protein length
R= −0.41
p= 2.e-8
R= −0.08
p= 0.89
200
400
600
800
Protein length
400
600
800 1000
Protein length
R= −0.1
p= 0.15
−0.6 −0.4 −0.2 0.0 0.2
R= −0.36
p= 2.e-5
H. sapiens
0.2 0.3 0.4 0.5 0.6 0.7 0.8
S. cerevisiae
0.2 0.3 0.4 0.5 0.6 0.7 0.8
Interior Stickiness
0.2 0.3 0.4 0.5 0.6 0.7 0.8
E. coli
200
400
600
800 1000
Protein length
Fig. S6. The stickiness scale captures information on protein stability and suggests that surface amino acids contribute little to stability compared with interior
amino acids. Small proteins have a smaller volume-to-surface ratio (1) and thus a smaller hydrophobic core than large proteins. To meet the stability requirement, small proteins are therefore expected to exhibit particular properties (2). We observed in Fig. 1 that abundant proteins have a decreased surface
stickiness. To test whether this observation may be linked to stability, we compare the stickiness of surface (Lower panels) and interior (Upper panels) amino
acids as a function of protein length, which should be linked to stability. Stickiness of interior amino acids appears as such a requirement because we observe
that small proteins have a stickier interior. However, surface amino acids have comparatively little to no influence on stability, as the correlation between
stickiness and protein length tends to disappear for surface amino acids.
10
8
6
a.a. interface frequency
4
Central a.a. frequency in the 10% most sticky
patches at known biological interfaces
Central a.a frequency in the 10% most sticky
patches at the surface of single chains
0
Central a.a. frequency in the 10% most sticky
patches at the surface of Biological Units
A C D E F G H I K L M N P Q R S T V WY
12
a.a. surface frequency of Biological Units
2
E. coli
S. cerevisiae
H. sapiens
8
6
4
0
2
Frequency
10
Frequency
12
1. Chothia C (1975) Structural invariants in protein folding. Nature 254(5498):304–308.
2. Pereira de Araújo AF, Gomes AL, Bursztyn AA, Shakhnovich EI (2008) Native atomic burials, supplemented by physically motivated hydrogen bond constraints, contain sufficient
information to determine the tertiary structure of small globular proteins. Proteins 70(3):971–983.
A C D E F G H I K L M N P Q R S T V WY A C D E F G H I K L M N P Q R S T V WY
Fig. S7. Residue frequencies are independent of the stickiness of the 400-Å2 surrounding surface patch. This implies that sticky patches on surfaces are unlikely
to be cryptic interfaces. The 10% most “interface-like” patches found at the protein surface were identified, and their central residue was recorded. The
frequencies of these central residues (dark blue) closely follow those of surface residues (light blue) rather than frequencies of interface residues (yellow).
Therefore, highly sticky patches are unlikely to represent unknown biological protein interfaces. To further control that this strategy is likely to detect interfaces, we carried out the same analysis but considered the entire protein surface (including known interfaces). When these real interfaces are included, the
stickiness of central residues radically shifts and becomes close to that of known interfaces (gray).
Levy et al. www.pnas.org/cgi/content/short/1209312109
4 of 9
p < 0.013
p < 0.76
p < 0.84
p < 0.61
9% difference
p = 0.0021
p < 0.38
65% difference
p
p < 0.013
H. sapiens
p < 0.089
p < 0.39
p < 0.029
p < 0.0083
p < 0.05
p < 0.0099
59% difference
p
S. cerevisiae
1
2
3
4
E. coli
0
rate(protein)
rate(interface rim residue)
A
Propensity of a residue to trigger a promiscuous interaction upon mutation
p < 0.0021
p < 0.2
p < 0.058
p < 0.087
p < 0.36
p < 0.0014
p < 0.0057
p < 0.19
p < 0.082
p < 0.067
p < 0.00077
p =1.2e−15
1
2
3
p =8.9e−10
0
rate(protein)
rate(interface core residue)
4
p =9.4e−16
p < 0.63
B
Propensity of a residue to trigger a promiscuous interaction upon mutation
Fig. S8. The divergence rate of residues at interface cores depends on the stickiness of their environment. (A and B) Interface core residues are grouped into
five classes on the basis of their surrounding 400-Å2 environment, i.e., from residues in nonsticky environments (gray) to those in sticky environments (orange).
The boxplot distribution of evolutionary rates of residues is shown for each group. Importantly, the rate of each residue is normalized by the median rate
of the entire protein. All classes have a median (black thick line) greater than 1 because interface cores are in general more conserved than the rest of the
protein. The significant difference in rates between the first and last bins indicates that the environment of a residue correlates with the divergence rate. A
possible explanation for this observation is that residues in highly sticky environments are more likely to trigger nonfunctional interactions if they mutate and
are therefore subject to more selective pressure. Consistent with the trend observed throughout this work, the difference is less marked among Homo sapiens
proteins, suggesting that there is more tolerance for promiscuous interactions in human versus yeast and Escherichia coli. P values were computed using the
Wilcoxon test.
Levy et al. www.pnas.org/cgi/content/short/1209312109
5 of 9
Table S1. Correlation between protein abundance and protein surface scores
Scale ID
WIMW960101
PONP800103
PONP800108
ROSG850102
PONP800102
GOLD730101
KANM800102
PONP800101
BLAS910101
ROSG850101
VENT840101
LAWE840101
KANM800104
PONP800106
MANP780101
CHAM820101
WILM950101
NAKH900105
NAKH900112
ARGP820101
JOND750101
ZIMJ680101
PONP800107
GOLD730102
BIGC670101
COWR900101
NAKH900103
NAKH900111
CHAM820102
WILM950102
WILM950103
WOLR790101
YUTK870102
FAUJ830101
PONP800105
PONP930101
NAKH900107
CIDH920101
NAKH900109
NAKH900108
WILM950104
JURD980101
KIDA850101
EISD860103
EISD860102
YUTK870101
FASG890101
EISD840101
CIDH920103
CIDH920105
CIDH920104
PONP800104
KANM800103
CIDH920102
SWER830101
NAKH900104
NAKH900106
YUTK870104
NAKH900101
YUTK870103
Description
E. coli
S. cerevisiae
H. sapiens
Free energies of transfer of acwl-x-ll peptides from bilayer
interface to water (1)
Average gain ratio in surrounding hydrophobicity (2)
Average number of surrounding residues (2)
Mean fractional area loss (3)
Average gain in surrounding hydrophobicity (2)
Hydrophobicity factor (4)
Average relative probability of β-sheet (5)
Surrounding hydrophobicity in folded form (2)
Scaled side-chain hydrophobicity values (6)
Mean area buried on transfer (3)
Bitterness (7)
Transfer free energy between N-cyclohexyl-2-pyrrolidone
and water (8)
Average relative probability of inner β-sheet (5)
Surrounding hydrophobicity in turn (2)
Average surrounding hydrophobicity (9)
Polarizability parameter (10)
Hydrophobicity coefficient in reverse phase high-performance
liquid chromatography (rp-hplc) (11)
Amino acid composition of mitochondrial proteins from animal (12)
Transmembrane regions of mitochondrial proteins (12)
Hydrophobicity index (13)
Hydrophobicity (14)
Hydrophobicity (15)
Accessibility reduction ratio (2)
Residue volume (4)
Residue volume (16)
Hydrophobicity index, 3.0 ph (17)
Amino acid composition of mt proteins (12)
Transmembrane regions of non-mt proteins (12)
Free energy of solution in water (kcal/mol) (10)
Hydrophobicity coefficient in rp-hplc (11)
Hydrophobicity coefficient in rp-hplc (11)
Hydrophobicity index (18)
unfolding gibbs energy in water, ph 9.0 (19)
Hydrophobic parameter π (20)
Surrounding hydrophobicity in β-sheet (2)
Hydrophobicity scales (21)
Amino acid composition of mt proteins from fungi and plant (12)
Normalized hydrophobicity scales for α-proteins (22)
Amino acid composition of membrane proteins (12)
Normalized composition from fungi and plant (12)
Hydrophobicity coefficient in rp-hplc (11)
Modified Kyte–Doolittle hydrophobicity scale (23)
Hydrophobicity-related index (24)
Direction of hydrophobic moment (25)
Atom-based hydrophobic moment (25)
Unfolding Gibbs energy in water, ph 7.0 (19)
Hydrophobicity index (26)
Consensus normalized hydrophobicity scale (27)
Normalized hydrophobicity scales for α+β-proteins (22)
Normalized average hydrophobicity scales (22)
Normalized hydrophobicity scales for α/β-proteins (22)
Surrounding hydrophobicity in α-helix (2)
Average relative probability of inner helix (5)
Normalized hydrophobicity scales for β-proteins (22)
Optimal matching hydrophobicity (28)
Normalized composition of mt proteins (12)
Normalized composition from animal (12)
Activation Gibbs energy of unfolding, ph 9.0 (19)
Amino acid composition of total proteins (12)
Activation Gibbs energy of unfolding, ph 7.0 (19)
−0.44
−0.28
−0.17
−0.44
−0.42
−0.41
−0.39
−0.35
−0.35
−0.32
−0.32
−0.27
−0.26
−0.26
−0.27
−0.24
−0.26
−0.28
−0.27
−0.29
−0.27
−0.11
−0.47
−0.39
−0.19
−0.22
−0.21
−0.21
−0.21
−0.13
−0.16
−0.19
−0.10
−0.28
−0.15
−0.11
−0.25
−0.25
−0.23
−0.21
−0.20
−0.22
−0.33
−0.20
−0.43
−0.39
−0.16
−0.28
−0.15
−0.23
−0.19
−0.19
−0.18
−0.18
−0.18
−0.17
−0.17
−0.17
−0.16
−0.16
−0.15
−0.15
−0.14
−0.13
−0.13
−0.10
−0.09
−0.09
−0.08
−0.07
−0.05
−0.02
0.00
0.01
0.02
0.03
0.03
0.05
0.05
0.05
0.05
0.06
0.08
0.08
0.08
0.12
0.12
0.13
0.13
0.17
0.18
0.18
0.19
0.20
−0.16
−0.15
−0.19
−0.19
−0.17
−0.21
−0.39
−0.41
−0.31
−0.18
0.06
−0.17
−0.23
−0.08
0.01
−0.09
−0.39
0.01
−0.22
−0.20
−0.19
0.24
−0.31
−0.27
−0.24
−0.17
−0.39
−0.10
−0.11
−0.28
−0.13
−0.31
−0.37
−0.34
0.17
0.05
−0.31
−0.32
−0.23
−0.14
0.23
0.32
0.24
−0.06
−0.07
−0.09
−0.09
−0.09
−0.14
−0.19
−0.19
−0.18
−0.04
−0.05
−0.12
−0.05
−0.16
−0.15
−0.01
−0.08
−0.05
0.00
0.02
−0.05
0.06
−0.06
−0.06
−0.10
−0.06
−0.13
−0.02
0.05
−0.07
−0.04
0.04
−0.05
−0.04
0.07
−0.02
−0.04
−0.06
−0.01
0.00
0.17
0.20
0.17
Levy et al. www.pnas.org/cgi/content/short/1209312109
6 of 9
Table S1. Cont.
Scale ID
NAKH900102
RACS770101
NAKH900113
KANM800101
LEVM760101
CASG920101
NAKH900110
PRAM900101
RACS770103
ENGD860101
RACS770102
Description
SD of amino acid composition of total proteins (12)
Average reduced distance for c-α (29)
Ratio of average and computed composition (12)
Average relative probability of helix (5)
Hydrophobic parameter (30)
Hydrophobicity scale from native protein structures (31)
Normalized composition of membrane proteins (12)
Hydrophobicity (32)
side chain orientational preference (29)
Hydrophobicity index (33)
Average reduced distance for side chain (29)
E. coli
0.20
0.21
0.22
0.24
0.26
0.29
0.30
0.31
0.31
0.31
0.34
S. cerevisiae
0.42
0.34
0.34
0.13
−0.02
−0.02
0.08
0.00
0.23
0.00
0.28
H. sapiens
0.17
0.17
0.15
0.04
0.04
0.11
0.06
0.06
0.16
0.06
0.18
Scores were computed using scales from the AAindex database instead of the “stickiness” scale.
1.
2.
3.
4.
Wimley WC, White SH (1996) Experimentally determined hydrophobicity scale for proteins at membrane interfaces. Nature Structural Biology 3(10):842–848.
Ponnuswamy PK, Prabhakaran M, Manavalan P (1980) Hydrophobic packing and spatial arrangement of amino acid residues in globular proteins. Biochim Biophys Acta 623(2):301–316.
Rose GD, Geselowitz AR, Lesser GJ, Lee RH, Zehfus MH (1985) Hydrophobicity of amino acid residues in globular proteins. Science 229(4716):834–838.
Goldsack DE, Chalifoux RC (1973) Contribution of the free energy of mixing of hydrophobic side chains to the stability of the tertiary structure of proteins. Journal of Theoretical
Biology 39(3):645–651.
5. Kanehisa MI, Tsong TY (1980) Local hydrophobicity stabilizes secondary structures in proteins. Biopolymers 19(9):1617–1628.
6. Black SD, Mould DR (1991) Development of hydrophobicity parameters to analyze proteins which bear post- or cotranslational modifications. Anal Biochem 193(1):72–82.
7. Venanzi TJ (1984) Hydrophobicity parameters and the bitter taste of L-amino acids. Journal of Theoretical Biology 111(3):447–450.
8. Lawson EQ, et al. (1984) A simple experimental model for hydrophobic interactions in proteins. J Biol Chem 259(5):2910–2912.
9. Manavalan P, Ponnuswamy PK (1978) Hydrophobic character of amino acid residues in globular proteins. Nature 275(5681):673–674.
10. Charton M, Charton BI (1982) The structural dependence of amino acid hydrophobicity parameters. Journal of Theoretical Biology 99(4):629–644.
11. Wilce MCJ, Aguilar M-I, Hearn MTW (1995) Physicochemical basis of amino acid hydrophobicity scales: Evaluation of four new scales of amino acid hydrophobicity coefficients derived
from RP-HPLC of peptides. Analytical Chemistry 67(7):1210–1219.
12. Nakashima H, Nishikawa K, Ooi T (1990) Distinct character in hydrophobicity of amino acid compositions of mitochondrial proteins. Proteins: Structure, Function, and Bioinformatics
8(2):173–178.
13. Argos P, Rao JK, Hargrave PA (1982) Structural prediction of membrane-bound proteins. Eur J Biochem 128(2-3):565–575.
14. Jones DD (1975) Amino acid properties and side-chain orientation in proteins: a cross correlation appraoch. Journal of Theoretical Biology 50(1):167–183.
15. Zimmerman JM, Eliezer N, Simha R (1968) The characterization of amino acid sequences in proteins by statistical methods. Journal of Theoretical Biology 21(2):170–201.
16. Bigelow CC (1967) On the average hydrophobicity of proteins and the relation between it and protein structure. Journal of Theoretical Biology 16(2):187–211.
17. Cowan R, Whittaker RG (1990) Hydrophobicity indices for amino acid residues as determined by high-performance liquid chromatography. Peptide Research 3(2):75–80.
18. Wolfenden RV, Cullis PM, Southgate CC (1979) Water, protein folding, and the genetic code. Science 206(4418):575–577.
19. Yutani K, Ogasahara K, Tsujita T, Sugino Y (1987) Dependence of conformational stability on hydrophobicity of the amino acid residue in a series of variant proteins substituted at
a unique position of tryptophan synthase alpha subunit. Proc Natl Acad Sci USA 84(13):4441–4444.
20. Fauchere JL, Pliska V (1983) Hydrophobic parameters pi of amino acid side chains from the partitioning of N-acetyl-amino acid amides. Eur J Med Chem 18:369–375.
21. Ponnuswamy PK (1993) Hydrophobic characteristics of folded proteins. Prog Biophys Mol Biol 59(1):57–103.
22. Cid H, Bunster M, Canales M, Gazitua F (1992) Hydrophobicity and structural classes in proteins. Protein Eng 5(5):373–375.
23. Juretic D, Lucic B, Zucic D, Trinajstic N (1998) Protein transmembrane structure: recognition and prediction by using hydrophobicity scales through preference functions. Theoretical
and Computational Chemistry, ed Cyril P (Elsevier), Vol 5, pp 405–445.
24. Kidera A, Konishi Y, Oka M, Ooi T, Scheraga H (1985) Statistical analysis of the physical properties of the 20 naturally occurring amino acids. J Protein Chem 4(1):23–55.
25. Eisenberg D, McLachlan AD (1986) Solvation energy in protein folding and binding. Nature 319(6050):199–203.
26. Fasman GD (1989) Prediction of protein structure and the principles of protein conformation (Plenum, New York).
27. Eisenberg D (1984) Three-dimensional structure of membrane and surface proteins. Annu Rev Biochem 53:595–623.
28. Sweet RM, Eisenberg D (1983) Correlation of sequence hydrophobicities measures similarity in three-dimensional protein structure. J Mol Biol 171(4):479–488.
29. Rackovsky S, Scheraga HA (1977) Hydrophobicity, hydrophilicity, and the radial and orientational distributions of residues in native proteins. Proc Natl Acad Sci USA 74(12):5248–5251.
30. Levitt M (1976) A simplified representation of protein conformations for rapid simulation of protein folding. J Mol Biol 104(1):59–107.
31. Casari G, Sippl MJ (1992) Structure-derived hydrophobic potential. Hydrophobic potential derived from X-ray structures of globular proteins is able to identify native folds. J Mol Biol
224(3):725–732.
32. Prabhakaran M (1990) The distribution of physical, chemical and conformational properties in signal and nascent peptides. Biochem J 269(3):691–696.
33. Engelman DM, Steitz TA, Goldman A (1986) Identifying nonpolar transbilayer helices in amino acid sequences of membrane proteins. Annual Review of Biophysics and Biophysical
Chemistry 15:321–353.
34. Kawashima S, et al. (2008) AAindex: amino acid index database, progress report 2008. Nucleic Acids Res 36(Database issue):D202–D205.
Levy et al. www.pnas.org/cgi/content/short/1209312109
7 of 9
Table S2.
Interface propensities used as the stickiness score
Amino acid
Interface propensity
A
C
D
E
F
G
H
I
K
L
M
N
P
Q
R
S
T
V
W
Y
0.0062
1.0372
−0.7485
−0.7893
1.2727
−0.1771
0.1204
1.1109
−1.1806
0.9138
1.0124
−0.2693
−0.1799
−0.4114
−0.0876
0.1376
0.1031
0.7599
0.7925
0.8806
Table S3. Information on species pairs used for the arginine–lysine and aspartic acid-–glutamic acid substitution frequency analysis
Species 1
Escherichia coli
Saccharomyces cerevisiae
Homo sapiens
Species 2
No. orthologous
pairs
Average % sequence
divergence
No. conserved
D, E, R, or K
No. of substitutions
between D-E or R-K
Salmonella typhimurium
Saccharomyces paradoxus
Mus musculus
3,102 (868)
3,798 (3,529)
7,314 (4,435)
13.8 (10.4)
8.0 (7.4)
11.3 (10.8)
180,634 (65,231)
463,813 (440,319)
810,931 (525,740)
8,818 (2,743)
14,822 (13,603)
34,766 (21,876)
The numbers in parentheses refer to the subset of proteins with known abundance data.
Table S4. Species used in alignments of orthologous proteins
Escherichia coli set (no. orthogroups
with at least 10/13 species and a known
structure = 397)
Saccharomyces cerevisiae set
(no. orthogroups with at least 13/15 species
and a known structure = 196)
Homo sapiens set
(no. orthogroups with at least 8/9 species
and a known structure = 701)
E. coli (K12 MG1655)
Vibrio parahaemolyticus
Burkholderia cenocepacia (J2315)
Proteus mirabilis
Pseudomonas fluorescens (SBW25)
Shewanella baltica (OS223)
Serratia proteamaculans (568)
Aeromonas salmonicida (A449)
Salmonella enterica Enteritidis (P125109)
Pseudomonas aeruginosa (LESB58)
Salmonella typhimurium (LT2)
Yersinia enterocolitica (8081)
Aeromonas hydrophila (ATCC7966)
S. cerevisiae
Saccharomyces paradoxus
Saccharomyces mikatae
Saccharomyces bayanus
Candida glabrata
Saccharomyces castellii
Kluyveromyces lactis
Ashbya gossypii
Kluyveromyces waltii
Debaryomyces hansenii
Candida albicans
Yarrowia lipolitica
Candida tropicalis
Candida guilliermondii
Candida lusitaniae
H. sapiens
Mus musculus
Gallus gallus
Pan troglodytes
Rattus norvegicus
Danio rerio
Xenopus tropicalis
Bos taurus
Malus domestica
Proteins within an orthologous group were aligned with MUSCLE (1), and evolutionary rates of each amino acid were calculated using Rate4Site (2).
1. Edgar RC (2004) MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res 32(5):1792–1797.
2. Pupko T, Bell RE, Mayrose I, Glaser F, Ben-Tal N (2002) Rate4Site: an algorithmic tool for the identification of functional regions in proteins by surface mapping of evolutionary
determinants within their homologues. Bioinformatics (Oxford, England) 18(Suppl 1):S71–S77.
Levy et al. www.pnas.org/cgi/content/short/1209312109
8 of 9
Table S5.
General statistics on the structural datasets used
Species name
Escherichia coli
Saccharomyces cerevisiae
Homo sapiens
No. structural chains
397 (172)
196 (193)
631 (495)
Numbers in parentheses correspond to the number of structures with
corresponding abundance data available.
Levy et al. www.pnas.org/cgi/content/short/1209312109
9 of 9