Predicting protein interaction sites: binding hot

BIOINFORMATICS
ORIGINAL PAPER
Vol. 22 no. 11 2006, pages 1335–1342
doi:10.1093/bioinformatics/btl079
Structural bioinformatics
Predicting protein interaction sites: binding hot-spots in
protein–protein and protein–ligand interfaces
Nicholas J. Burgoyne and Richard M. Jackson
Institute of Molecular and Cellular Biology, Faculty of Biological Sciences, University of Leeds, Leeds LS2 9JT, UK
Received on January 20, 2006; revised on February 14, 2006; accepted on February 28, 2006
Advance Access publication March 7, 2006
Associate Editor: Dmitrij Frishman
ABSTRACT
Motivation: Protein assemblies are currently poorly represented in
structural databases and their structural elucidation is a key goal in
biology. Here we analyse clefts in protein surfaces, likely to correspond
to binding ‘hot-spots’, and rank them according to sequence conservation and simple measures of physical properties including hydrophobicity, desolvation, electrostatic and van der Waals potentials, to predict
which are involved in binding in the native complex.
Results: The resulting differences between predicting binding-sites at
protein–protein and protein–ligand interfaces are striking. There is a
high level of prediction accuracy (93%) for protein–ligand interactions,
based on the following attributes: van der Waals potential, electrostatic
potential, desolvation and surface conservation. Generally, the prediction accuracy for protein–protein interactions is lower, with the exception of enzymes. Our results show that the ease of cleft desolvation is
strongly predictive of interfaces and strongly maintained across all
classes of protein-binding interface.
Contact: [email protected]
Supplementary information: Supplementary data are available at
Bioinformatics online.
INTRODUCTION
Protein interactions are critical for all aspects of cellular function,
and determining how they interact has become a major goal in the
post-genomic era (Sali et al., 2003; Rhodes et al., 2005). An emerging new approach is to take advantage of structural information
to predict physical binding. In particular, prediction of protein
binding-sites can guide the structural elucidation of protein complexes, allowing function prediction for numerous unannotated
structural genomics targets and the design of molecules that
can modulate biological function at a systems-level (Russell
et al., 2004).
Protein–protein interfaces are generally considered to be circular
and relatively flat (Jones and Thornton, 1997a), studies of their
properties have also shown that hydrophobic area prevails while
electrostatic residues and hydrogen-bonding groups are evenly
spread across the surface (Xu et al., 1997; Lo Conte et al.,
1999), suggesting a uniform distribution of properties across the
binding interfaces. However, alanine-scanning mutagenesis has
shown that the stability of a complex is determined by only a
fraction of the interface residues. This was first shown in the
To whom correspondence should be addressed.
complex between human-growth hormone and the human-growth
hormone-binding protein (Clackson and Wells, 1995). They showed
that a far greater loss in affinity was seen when two tryptophan
residues were mutated to alanine when compared with other mutations made in the interface. Similar results have been collated for a
number of other protein complexes (Bogan and Thorn, 1998; Thorn
and Bogan, 2001). Analysis of this data shows that these ‘hot-spot’
residues are usually found in the centre of the interface, and are
surrounded by residues that have a lesser effect on stability (Bogan
and Thorn, 1998). Structural analysis has shown that the clusters of
conserved, hot-spot, residues form clefts in the surface of the protein
(Li et al., 2004). Enrichment of hot-spot areas with tryptophan,
tyrosine and arginine residues have been shown based on protein
family conservation (Hu et al., 2000; Ma et al., 2003; Keskin et al.,
2005). However, other residues, including polar residues, may also
be highly conserved (Hu et al., 2000). Solvent exclusion around
polar or charged interactions lowers the effective dielectric constant
thus strengthening the interaction. It has been suggested that the
hydrophobic effect of hot-spot residues is almost double that of
other residues in an antibody/antigen interface (Li et al., 2005).
Multiple hot-spot regions can be found in a single interface (Ma
et al., 2003), these tend to cluster at the centre of the interface, and
interact with equivalent hot-spot residues in the binding partner
protein (Halperin et al., 2004; Keskin et al., 2005). The clustering
of conserved residues has been used to predict possible interacting
partners based on alignment to a known pair of interacting
proteins (Aytuna et al., 2005; Espadaler et al., 2005). Clefts in
the protein surface are also important for the prediction of
protein–ligand interactions. Indeed, ligand-binding sites can often
be predicted by identifying the largest exposed cleft (Laskowski
et al., 1996). Ligand-binding sites may also be detected as cleft
regions with high predicted binding affinity, based on energetic
contouring with interacting molecular probes (Laurie and
Jackson, 2005).
Here we analyse the ability of different key physicochemical
attributes and other binding surface properties, such as surface
conservation, to predict the binding interface in protein–protein
and protein–ligand complexes. Our approach attempts to define
binding hot-spots on the protein surface. These are defined by clefts
on the protein surface that are further described by their key attributes. We predict clefts on the protein surface using Q-SiteFinder
(Laurie and Jackson, 2005) an energy-based method for the prediction of protein binding clefts. The properties of these predicted
clefts are then investigated and compared. Given that the key attributes we use to describe clefts are the same for both protein–protein
The Author 2006. Published by Oxford University Press. All rights reserved. For Permissions, please email: [email protected]
1335
N.J.Burgoyne and R.M.Jackson
and protein–ligand interactions, the resulting differences we find
between their binding-sites are striking. This has important implications for understanding the nature of protein interactions, for identifying the suitability of sites as drug targets and for identifying
critical regions for docking and structure-based drug design.
consisted of the larger proteins of two protein–protein interactions 1HE8 and
1DE4, and six protein–ligand complexes (1CGY, 1DR1, 1HDY, 1LDM,
1PBD and 2PDM).
Pocket ranking
Hydrophobicity
METHODS
Interaction datasets
Enzyme-inhibitor and antibody–antigen complexes of the protein–protein
docking benchmark 1.0 (Chen et al., 2003) were supplemented with a
selection of other available protein complexes to create a protein–protein
interaction dataset of 97 pairwise non-obligate hetero-complexes. These
additional complexes satisfied the following criteria: (1) No two complexes
had a sequence-identity of >60% in an alignment over >80% of the sequence
of both proteins involved in the interaction. (2) In keeping with the other
complexes of benchmark 1.0 no complex had a resolution of >3.25 s.
(3) No complex had large disordered regions in the interface. Oligomeric
interfaces for all obligate protein–protein interactions of the dataset were
obscured using the appropriate multimeric complex as the interacting
monomer. This was achieved by analysis of the protein quaternary structure
database (Henrick and Thornton, 1998). As such, monomers mentioned in
the text may refer to protein structures of more than one peptide chain.
Occasionally this left several independent interfaces between interacting
proteins. In the cases where more than one independent interface to identical
proteins exist, all interfaces of the protein are considered. Where proteins
have only one interface only that interface was assessed. Cognate ligands
required for the function of the protein were retained, while those not indicated as such in the literature were discarded. The final dataset consisted of
22 enzyme–inhibitor complexes, 19 antibody-antigen complexes and 56
protein complexes that could not be included in either classification
(other complexes). Details of the proteins used can be found in the Supplementary Material.
A total of 134 ligand–protein complexes from the Gold dataset were used
(Nissink et al., 2002) and were prepared as described by Laurie and Jackson
(2005). These constitute a subset of the full Gold dataset where proteins of
high structural similarity are removed. The dataset was further classified into
protein–ligand interactions in enzymes (95 complexes) and non-enzymes
(39 complexes). The binding-site of a single ligand is assessed for each
protein. Additional ligands covalently bound to the protein are treated
as protein while other solvent molecules are removed. As with the protein–
protein interaction dataset the interacting proteins are assessed in the absence
of the binding molecule.
Pocket detection and occupancy using Q-SiteFinder
Clefts were identified in the protein-surface using Q-SiteFinder (Laurie and
Jackson, 2005). The method is briefly described. A grid is built around each
protein, such that the spacing between grid points is 0.9 s in all orthogonaldirections and the entire protein is covered. The non-bonded interaction
energy is then calculated using the GRID forcefield (Goodford, 1985) parameters for every grid-point that is sufficiently far from the protein to allow a
methyl (–CH3) group to be positioned without steric overlap with any protein
atom (Jackson, 2002). Probes with calculated interaction energies of less
than a 1.3 kcal/mol are retained and clustered according to spatial proximity whereby no probe in any cluster has a centre that is >1.0 s from the
centre of another probe in the same cluster. The total non-bonded interaction
energy is then calculated across all probes in the cluster and serves as the
means by which the clusters are ranked. The highest ranking site is the
cluster with the highest cumulative interaction energy. Occupancy of a
cleft is defined as the percentage fill of the cleft by atoms of the binding
protein or ligand; a threshold of 25% is applied to define a successful cleft.
Proteins with no occupied clefts were excluded from further analysis. These
1336
The atomistic Solvent Accessible Surface Area (SASA)
covered by each cleft was calculated by NACCESS (Hubbard and Thornton,
1993). The solvent accessibilities of all protein atoms were calculated in both
presence and absence of the probes of each cluster. The difference between
these values represents the covered area of the cleft. Hydrophobic area was
defined as the area exposed by atoms that are either carbon or sulphur (as in
NACCESS). The cleft with the highest proportion of hydrophobic accessible
atom area was ranked first.
Desolvation
The entropy associated with the removal of water from
the surfaces covered by the probe clusters was estimated using coefficients
for the transfer of N-acetyl derivative amino acids between water and
octanol. The total transfer energies have previously been converted to
atomistic-based values by linear fitting (Fauchere and Pliska, 1983) and
optimized for protein–protein docking (Fernandez-Recio et al., 2004).
The coverage of each atom was calculated as described above, allowing
the desolvation energy of the cleft to be calculated by multiplying the area
by appropriate solvation parameter taken from Fernandez-Recio et al.
(2004). Clefts were ranked such that the cleft most easily desolvated was
ranked first.
Electrostatics
The peak, average and total electrostatic potential for each
cleft was calculated using the DelPhi v.4 package (Rocchia et al., 2001,
2002). Protons were added to the protein as described by Q-SiteFinder
(Laurie and Jackson, 2005). A grid of 101 points in all dimensions was
built around each protein such that the molecule occupied 50% of the cubic
grid’s volume. Amber charges and radii were used to describe the protein and
the associated cognate ligands (protein–protein interface analysis only)
(Giammona, 1984; Weiner et al., 1984; Schneider and Suhnel, 1999;
Antony et al., 2000; Meagher et al., 2003). A dielectric boundary condition
was used, with the dielectric constants of 4 within and 80 outside the
molecular surfaces defined by a water probe radii of 1.4 s. Salt concentration
in the solute was set at 0.15 M and a 2 s ion exclusion radius was applied
around the protein. The finite difference Possion–Boltzmann calculations
were performed until the potentials at the grid points converged to 0.0001 kT
e1, after which the potentials were extrapolated to the centre of the individual probes. The total cleft potential was defined as the sum of the modulus
of individual probe potentials that form the cluster, while the average potential was this total value divided by the number of probes in the cluster. The
peak values for each cleft were defined as the highest modulus individual
probe potentials of each cluster. The same calculations were repeated with all
proteins with a unit charge applied to all atoms. The same Amber atom radii
were used as described, as were all other variables bar the convergence
criteria. Unit charge calculations were run until they converged to within
0.01 kT e1.
Conservation The surface conservation for each cleft was calculated by
extracting the protein sequences of each chain in the protein from the associated ATOM records of their Protein Data Bank coordinates. For each chain
of the protein, close-homologues were found by performing a PSI-BLAST
search over the non-redundant Swiss-Prot database release 47.6 (Altschul
et al., 1997; Bairoch et al., 2005). Three iterations were performed and the
search refined using sequences with similarity to the query sequence defined
at an E-value <0.001. Redundant sequences were removed and the remaining
full-length sequences were then aligned by Muscle (Edgar, 2004). The
conservation of each position in the alignment relative to the initial protein
sequence was calculated by Scorecons (Valdar, 2002). Conservation scores
are based on the sum of the weighted pairwise exchange of residues, for each
pair of sequences, at each position in the alignment. Values lie between 0
Binding hot-spots in protein interfaces
(no conservation) and 1 (completely conserved). SASA was calculated for
the protein structure of the same chain by MSMS with vertices spread evenly
across the protein surface at a density of 1 vertex per Å2 (Sanner et al., 1996).
The conservation scores for each residue were then mapped to the appropriate vertices. Cleft conservation was defined as the average conservation
score of vertices within 2.0 s of the centre of a probe belonging to the
cluster.
Calculation of true and false positive rates
The true positive rate (TPR), or sensitivity, was calculated as the number
of genuine interface clefts that are ranked in the top k clefts (true positives,
TP) divided by the total number of interface clefts identified for this monomer (TP plus false negatives, FN), where k increases by one each time. The
false positive rate (FPR), one minus specificity, was the number of clefts
ranked in the top k clefts that are not in the interface (FP) divided by the total
number of non-interface clefts (FP plus true negatives, TN). Equal TPR and
FPR values at all values of k indicate no discrimination by the ranking
method.
The sensitivity [TPR ¼ TP/(TP + FN)] and false positive rates [FPR ¼
FP/(FP + TN)] were calculated for the clefts ranked by the above attributes.
An attribute that is a successful predictor of interface clefts will show high
sensitivity and a low error rate. Plotting TPR against FPR gives a receiver
operating characteristic (ROC) plot.
In order to give a single measure of prediction accuracy, which is
independent of any decision threshold, we have calculated the ROC integral
or area under the curve (AUC) (Hanley and Mcneil, 1982). A value of
0.5 indicates no correlation, a value <0.5 indicates negative correlation
and a value approaching 1.0 indicates the theoretical maximum or perfect
prediction.
RESULTS
Analysis of protein clefts
Q-SiteFinder was used to define the clefts found on the protein
surface (see Methods). Q-SiteFinder was run on both bound monomers of each hetero-protein complex in the non-obligate protein–
protein interaction datasets and on all proteins in the protein–ligand
dataset. Following evaluation at different probe interaction thresholds a value of 1.3 kcal/mol was chosen, based on the success of
the method in protein–ligand complexes around this value (Laurie
and Jackson, 2005) and also based on the visualization of the clefts
involved in protein–protein interactions, since it also best describes
the clefts of the protein that were filled by the sidechains of the
opposing binding protein. We found that altering the threshold
did not greatly affect the predictive power for protein–protein
interactions.
Q-SiteFinder was used to calculate 99 clefts on the protein
surface. Occupied or true cleft predictions for each protein are
defined as those occupied to >25% of their volume by atoms of
the interacting molecule. Figure 1 shows that for all 194 monomers
of the protein–protein dataset there is a slightly skewed distribution
of the number of true interface clefts per protein monomer with a
peak around six clefts. The separate distributions of smaller and
larger monomers show differences. The peak number of successful
interface clefts for the smaller (ligand) monomers, at six, is higher
than the peak, at 2–4 clefts for larger (receptor) monomers, however, the latter has a broad peak spanning 2–9 clefts. Figure 2 shows
there are approximately equal percentages of interface areas
covered in both sets of monomers. Protein–ligand receptors show
far fewer occupied clefts with a clear peak at one. Also true interface
clefts have a greater coverage than do the non-interface regions in
contrast to protein–protein interfaces. However, given the large
standard deviations it is not possible to conclude this observation
is significant. Below we attempt to separate those clefts that are
occupied from those that are not, based on properties that have
been suggested to be important for the prediction and stability of
intermolecular interactions.
Analysis of cleft properties
Intermolecular interactions are stabilized by a number of factors,
these include the burial of areas of hydrophobicity, the formation of
hydrogen-bonds and electrostatic complementarity (Chothia and
Janin, 1975). We have found clefts in the protein surfaces using
Q-SiteFinder and re-ranked them according to simple measures of
the cleft properties (see Methods). All the different properties were
used to rank the clefts to assess their ability to predict which ones
are involved in binding in the native complex. ROC plots are
used for this purpose (see Methods). Whilst, most of the properties
used (van der Waals, electrostatic, hydrophobicity, desolvation, and
conservation) are readily interpretable, the unit electrostatic properties involve the calculation of electrostatic potential from a
uniform charge density on all protein atoms, as opposed to the
standard atom-specific charges. The method typically generates
large potentials in larger or enclosed clefts (Bate and Warwicker,
2004), and does not reflect atom-specific electrostatic properties of
the site.
Enzyme–inhibitor complexes
Results of re-ranking the clefts for the enzyme–inhibitor monomers
are shown in Figure 3a and b. The results for the enzymes (receptors) show that surface residue conservation is the best predictor of
true protein interface regions as well as for their protein inhibitors
(ligands). However, it is a much better predictor for enzymes, in
agreement with the study of Bradford and Westhead (2003). The
unit electrostatic properties (unit peak and average) are also good
predictors of interface regions in the enzymes but show no predictive power (or even slight anti-correlation) in the inhibitors. These
results can be rationalized in terms of known protein function, in
that the substrate/inhibitor binding cleft of the enzymes is conserved
in sequence owing to the conserved nature of the catalytic mechanism, and also active sites are often enclosed, pre-organized clefts,
a property that may be important for enzyme catalysis. For example,
the serine-protease family, are well-represented in the enzyme class.
The sequence conserved catalytic oxy-anion hole in serine proteases
is a deep cleft which accommodates the conserved binding conformation of the protein inhibitor ‘canonical’ loop (Jackson and
Russell, 2000). Therefore, it is unsurprising that unit electrostatic
properties capture this characteristic for the enzymes only. Both
of these cleft characteristics may capture the characteristics of the
binding clefts of enzyme protein families very well but are not
expected to generalize well to other protein–protein interactions.
The only other strong predictor of interface regions is ranking clefts
by desolvation energy. This is discussed further below.
Antibody-antigen complexes
The ROC plots for the predictions of the interface clefts of the 38
bound monomers of the 19 antibody–antigen complexes can be seen
in Figure 3c and d. Although the results for the antigens (ligands)
do not show any strong trends, those for the prediction of interface
clefts in the antibodies (receptors) do. The highly significant
1337
N.J.Burgoyne and R.M.Jackson
All 194 Protein-Protein Monomers
Percentage of Monomers
20
18
18
16
16
14
14
12
12
10
10
8
8
6
6
4
4
2
2
0
97 Larger Protein-Protein Monomers
20
0
0
1
2
20
3
4
5
6
7
8
9
10 11 12 13 14 15 >15
97 Smaller Protein-Protein Monomers
18
0
1
2
3
70
4
5
6
7
8
9
10 11 12 13 14 15 >15
134 Protein-Ligand Monomers
60
16
50
14
12
40
10
30
8
6
20
4
10
2
0
0
1
2
3
4
5
6
7
8
9
10 11 12 13 14 15 >15
0
0
1
2
3
4
5
6
7
8
9
10 11 12 13 14 15 >15
Number of Clefts with Occupancies Greater than 25%
Fig. 1. Distribution of the number of occupied interface clefts across the 97 protein–protein complexes and 134 protein–ligand complexes, as generated by
Q-SiteFinder.
anti-correlation of residue conservation in the antibodies is striking,
yet entirely logical given the biological role of antibodies. This
reflects antibody structure in which high sequence diverse loops
form the complementarity determining regions (CDRs) and antigen
binding interface. The CDRs determine specificity on what is
otherwise a highly conserved protein framework. Both Q-SiteFinder
and total unit electrostatics scores show some weak predictive
capacity that may reflect their ability to define the size and
position of clefts between the loops that form the CDRs of the
antibodies. However, the results for the antibodies show that
cleft desolvation energy is by far the best predictor of protein
interface regions. It is interesting to see that it is also the most
prominent of all attributes in the antigen proteins as well, albeit
to a much lesser degree.
Other protein–protein complexes
The ROC curves for the 112 bound monomers of the 56 ‘other’
complexes can be seen in Figure 3e. Unlike antibody and enzyme
results there are few strong correlations to be seen in either the
ligand or receptor sets of proteins (see Supplementary Data). This is
probably because this is a highly non-homogenous set of protein–
protein complexes. This diversity almost certainly neutralizes
protein sub-class specific characteristics such as those seen in the
enzyme and antibody sets. The strongest correlation between interface cleft and high rank in both the receptor and ligand proteins is
the desolvation energy. Conservation is only weakly predictive, and
fails to re-rank the interface clefts highly for a more diverse set
of protein complexes, consistent with the results of Caffrey et al.,
(2004). By changing the area over which conservation is assessed
[from a large patch in Caffrey et al. (2004) investigation, to a small
cleft in ours] it was hoped to improve these results in-line with the
1338
findings of Li et al. (2004). Li et al. (2004) showed that the
conserved residues of interface regions cluster around clefts in
the protein surface. By re-ranking the clefts according to conservation score, we anticipated an improved relationship for the prediction of interfaces. However this was not the case. Consistent with
the results of all the protein–protein interfaces analyzed in this piece
of work, desolvation energy is the most effective common factor in
the identification of interface clefts.
Protein–ligand interactions
The ROC curves for the 134 protein–ligand complexes can be found
in Figure 3f. The difference between these results and those of the
protein–protein interface clefts is striking. Unlike the protein–
protein interfaces, several properties are excellent predictors of
protein interface regions. The ranking of clefts based on
Q-SiteFinder support both the relative merits of using this method
for ligand-binding site prediction and the previous observation that
this method has a 90% success rate in the top three predicted clusters
when tested on the same dataset (Laurie and Jackson, 2005). Not so
surprisingly the electrostatic total (based on amber charges) and
unit electrostatic total show a very similar pattern, with the latter
being slightly more successful than Q-SiteFinder. These two measures may be most dependent on the number of probe centres that
make up the cleft, as defined by Q-SiteFinder, rather than the nature
of the properties that define the cleft. Of all the attributes the unit
electrostatics (average and peak) stand out and rank slightly above
Q-SiteFinder overall, however, their initial prioritization of clefts
is in fact weaker. The finding that using unit charges rather than the
more physicochemically realistic amber charges improves results
(Bate and Warwicker, 2004) is also corroborated by our results.
Other, strong predictors are cleft desolvation energy and surface
Binding hot-spots in protein interfaces
Interface
Non-Interface
Total Surface
100
Relavent Surface Area Covered (%)
90
80
70
60
50
40
30
20
10
0
All Protein-Protein
Smaller ProteinProtein
Larger ProteinProtein
Protein-Ligand
Fig. 2. Average coverage of the protein surface area; by the predicted clefts generated by Q-SiteFinder in 97 protein–protein complexes and 134 protein–ligand
complexes. Error bars represent the standard deviations of each distribution.
residue conservation. The latter is not unexpected since active/ligand binding sites, have been shown to be sequence conserved across
many different protein families (Bartlett et al., 2002) and consequently this property is useful in the identification of enzyme
active sites (Greaves and Warwicker, 2005). This is further corroborated here by the difference in predictive power of conservation for
enzymes versus non-enzyme protein–ligand complexes (see Supplementary Data).
With the exception of the enzymes, protein–protein interaction
clefts have a poor correlation with each of the three measures for
the unit electrostatics calculations, implying that clefts in protein–
protein interfaces (defined by Q-SiteFinder) are fundamentally
different to those in protein–ligand complexes. The observation
that high electrostatics potential is indicative of a ligand-binding
site seems at odds with the finding that the ease of desolvation of
the cleft also correlates well. The ease of desolvation used in this
analysis would be expected to favour non-polar surface with a high
aliphatic or aromatic content rather than a charged surface. However, it is primarily the unit electrostatic properties that are highly
predictive, and these may be most dependent on the shape and depth
of the cavity rather than true atom-based electrostatic properties.
Furthermore, the poor predictive power of hydrophobicity in
protein–ligand and most protein–protein interactions indicate fundamental differences between clefts defined by hydrophobicity
and desolvation. This challenges the classical view that hydrophobicity, as defined by non-polar surface area, is a useful predictor of
interaction interfaces (Chothia and Janin, 1975).
DISCUSSION
This study has assessed the applicability of using protein surface
clefts for the prediction of protein–protein and protein–ligand interfaces. There are a number of multi-attribute methods for the prediction of protein–protein interfaces that use averages of properties
across a patch of protein surface in order to determine whether that
patch lies in an interface (Jones and Thornton, 1997b; Zhou and
Shan, 2001; Fariselli et al., 2002; Neuvirth et al., 2004; Bordner and
Abagyan, 2005; Bradford and Westhead, 2005). Combinations of
properties allow the prediction of protein–protein interactions with
successes of up to 70%, despite treating the protein interface as an
average of its properties. There is some evidence to suggest that the
stability of a protein complex is not spread across the interface but
instead localized around clefts in the protein surface. By identifying
clefts in the protein surface we investigated whether any single
property could discriminate those in the protein interface from
those elsewhere on the protein surface.
An overall ROC integral or AUC gives a single measure of
prediction accuracy, (Hanley and Mcneil, 1982) used in many scientific studies. A table of results for the AUC of all the studies
presented, are given in the Supplementary Data. For all protein–
protein interactions, conservation scores appear to be less effective
than anticipated (AUC: 58%), confirming the results of Caffrey
et al. (2004) who concluded that protein–protein interfaces are
not significantly more conserved than the rest of the protein surface.
Only in the enzyme (AUC: 79%) and enzyme protein–ligand (AUC:
78%) interfaces is conservation of significant predictive value.
Electrostatics also failed to show any general trends in the prediction of protein–protein interfaces on monomers (AUC: 45–54% for
all interactions) other than enzymes (AUC: 65%, electrostatic total)
consistent with the theory that long-range electrostatics do not play
an important role in the funnel concept of protein–protein interactions (Schlosshauer and Baker, 2004), but are rather used for orientation (Schreiber and Fersht, 1996). Of the properties studied in
this investigation only desolvation energy of the clefts had any
general predictive power across all protein–protein interaction
types (AUC: 68%). Fernández-Recio et al. (2005) showed that
desolvation scores over larger regions of protein surface area, as
defined by circular patches, correlated with the interface in 58% of
proteins. It appears that the cleft-based approach employed here is
more general. Charged interfaces, such as those between enzymes
and inhibitors, and those of the antigens form the majority of interfaces that Fernández-Recio et al. (2005) failed to predict. Desolvation scores, as implemented by our cleft-based analysis, correlate
nearly as well with charged and antigen interfaces as they do with
other types of interface, with the enzyme (AUC: 74%), inhibitor
(AUC: 61%) and antigen (AUC: 62%) interfaces being predicted
1339
N.J.Burgoyne and R.M.Jackson
(a)
(b)
(c)
(d)
(e)
(f)
Fig. 3. ROC plots for the prediction of interface clefts in (a–e) protein–protein and (f) protein–ligand complexes. See text for description.
1340
Binding hot-spots in protein interfaces
well. A similar high level of success is also seen in protein–ligand
interfaces (AUC: 72%), which also have charged interfaces with a
high electrostatic potential. The level of success may be attributed to
the fact that in defining the protein surface in terms of clefts we
are only sampling a portion of the protein interface, as opposed to
the whole surface of a circular patch. It appears that sites of high
electrostatic potential and favourable desolvation coexist independently in the same interface. This is supported by the fact that the
correlation between sites ranked by these two attributes for both
protein–protein and protein–ligand complexes is insignificant (R2 ¼
0.020.1) and slightly negatively correlated (data not shown).
Protein–protein and protein–ligand interface clefts show several
differences as well as some similarities. The most obvious differences are the failure of electrostatic descriptors overall to identify
protein–protein interface clefts. However, identification of enzyme
active sites is possible by electrostatic methods, and the results were
improved when using a unit charge applied to each atom in the
protein (Bate and Warwicker, 2004). Our results confirm these
observations. The power of electrostatic descriptors in the protein–
ligand interactions, particularly those of the enzyme subset, where
the ligands bind at active sites, is only maintained in the enzyme
class of protein–protein interactions. In fact, the protein–protein
enzyme class shows a high degree of correlation with protein–ligand
interactions for other attributes, and both also show significant
predictive power of Q-SiteFinder and residue conservation, in contrast to all other protein–protein interfaces. Therefore, it would
appear that the binding-sites of enzymes are much more predictable
than those of other protein–protein complexes. The very high predictive power of Q-SiteFinder (AUC: 88%) and all the unit electrostatic properties (AUC: 93%) for protein–ligand interactions is
very encouraging and will allow the targeting of functionally
important ligand-binding sites in functional genomics.
Overall, protein–protein and protein–ligand interface clefts do
have similarities. Our results show that the ease of cleft desolvation
or ‘de-wetting’ of all classes of interface are strikingly similar
and more indicative of interface regions than electrostatics,
Q-SiteFinder, or residue conservation, which although highly predictive in some cases do not generalize well to all classes of protein
interaction. Recent studies of the role of water in protein–protein
association of the melittin tetramer (Barratt et al., 2005) and in
protein–ligand binding (Liu et al., 2005) in mouse major urinary
protein (MUP) by Molecular Dynamics simulation illustrate the
importance of the phenomenon of de-wetting in molecular interactions. In the former, the strongly hydrophobic surfaces induce the
evaporation (drying) of water in the interface as the melittin protein
subunits approach one-another. The authors conclude that sufficiently hydrophobic protein surfaces can induce a liquid–vapour
transition providing the driving force towards protein association. In
the later study, the MUP ligand-binding-site cleft is pre-organized,
hydrophobic and poorly solvated in the unbound form, which may
explain the largely enthalpy (as opposed to entropy) driven thermodynamics of ligand binding. We have found that hydrophobicity
of the surface does not correlate strongly with interface clefts in
either protein–protein or protein–ligand interfaces. Hydrophobic,
non-polar surfaces will also correlate closely with areas of low
desolvation energy; however, in proteins that form non-obligate
protein/ligand complexes (and hence must be independently stable
in water) extended hydrophobic solvent exposed surfaces are
unlikely to exist. Therefore, a more subtle phenomenon exists,
a relationship between ease of desolvation (de-wetting) and interface propensity which may be an intrinsic feature of all protein
binding surfaces.
ACKNOWLEDGEMENTS
The authors would like to thank Alasdair Laurie for help with
Q-SiteFinder and the protein-small molecule dataset. NJB is funded
by the Medical Research Council.
Conflict of Interest: none declared.
REFERENCES
Altschul,S.F. et al. (1997) Gapped BLAST and PSI-BLAST: a new generation of
protein database search programs. Nucleic Acids Res., 25, 3389–3402.
Antony,J. et al. (2000) Theoretical study of electron transfer between the photolyase
catalytic cofactor FADH() and DNA thymine dimer. J. Am. Chem. Soc., 122,
1057–1065.
Aytuna,A.S. et al. (2005) Prediction of protein–protein interactions by combining
structure and sequence conservation in protein interfaces. Bioinformatics, 21,
2850–2855.
Bairoch,A. et al. (2005) The Universal Protein Resource (UniProt). Nucleic Acids Res.,
33, D154–D159.
Barratt,E. et al. (2005) Van der Waals interactions dominate ligand–protein association
in a protein binding site occluded from solvent water. J. Am. Chem. Soc., 127,
11827–11834.
Bartlett,G.J. et al. (2002) Analysis of catalytic residues in enzyme active sites.
J. Mol. Biol., 324, 105–121.
Bate,P. and Warwicker,J. (2004) Enzyme/non-enzyme discrimination and prediction of
enzyme active site location using charge-based methods. J. Mol. Biol., 340,
263–276.
Bogan,A.A. and Thorn,K.S. (1998) Anatomy of hot spots in protein interfaces.
J. Mol. Biol., 280, 1–9.
Bordner,A.J. and Abagyan,R. (2005) Statistical analysis and prediction of protein–
protein interfaces. Proteins, 60, 353–366.
Bradford,J.R. and Westhead,D.R. (2003) Asymmetric mutation rates at enzymeinhibitor interfaces: implications for the protein–protein docking problem.
Protein Sci., 12, 2099–2103.
Bradford,J.R. and Westhead,D.R. (2005) Improved prediction of protein–protein binding sites using a support vector machines approach. Bioinformatics, 21, 1487–1494.
Caffrey,D.R. et al. (2004) Are protein–protein interfaces more conserved in sequence
than the rest of the protein surface? Protein Sci., 13, 190–202.
Chen,R. et al. (2003) A protein–protein docking benchmark. Proteins, 52, 88–91.
Chothia,C. and Janin,J. (1975) Principles of protein–protein recognition. Nature, 256,
705–708.
Clackson,T. and Wells,J.A. (1995) A hot spot of binding energy in a hormone-receptor
interface. Science, 267, 383–386.
Edgar,R.C. (2004) MUSCLE: multiple sequence alignment with high accuracy and
high throughput. Nucleic Acids Res., 32, 1792–1797.
Espadaler,J. et al. (2005) Prediction of protein–protein interactions using distant conservation of sequence patterns and structure relationships. Bioinformatics, 21,
3360–3368.
Fariselli,P. et al. (2002) Prediction of protein–protein interaction sites in heterocomplexes with neural networks. Eur. J. Biochem., 269, 1356–1361.
Fauchere,J.L. and Pliska,V. (1983) Hydrophobic paramaters P of amino acid side
chains from the partitioning of N-acetyl-amino-acid amides. Eur. J. Med.
Chem., 18, 369–375.
Fernandez-Recio,J. et al. (2004) Identification of protein–protein interaction sites from
docking energy landscapes. J. Mol. Biol., 335, 843–865.
Fernandez-Recio,J. et al. (2005) Optimal docking area: a new method for predicting
protein–protein interaction sites. Proteins, 58, 134–143.
Giammona,D.A. (1984) Ph.D. Thesis, University of California, Davis, CA.
Goodford,P.J. (1985) A computational procedure for determining energetically favorable binding sites on biologically important macromolecules. J. Med. Chem., 28,
849–857.
Greaves,R. and Warwicker,J. (2005) Active site identification through geometry-based
and sequence profile-based calculations: burial of catalytic clefts. J. Mol. Biol.,
349, 547–557.
1341
N.J.Burgoyne and R.M.Jackson
Halperin,I. et al. (2004) Protein–protein interactions; coupling of structurally conserved residues and of hot spots across interfaces. Implications for docking.
Structure, 12, 1027–1038.
Hanley,J.A. and McNeil,B.J. (1982) The meaning and use of the area under a receiver
operating characteristic (ROC) curve. Radiology, 143, 29–36.
Henrick,K. and Thornton,J.M. (1998) PQS: a protein quaternary structure file server.
Trends Biochem. Sci., 23, 358–361.
Hu,Z. et al. (2000) Conservation of polar residues as hot spots at protein interfaces.
Proteins, 39, 331–342.
Hubbard,S.J. and Thornton,J.M. (1993) NACCESS. Manchester University,
Manchester, UK.
Jackson,R.M. (2002) Q-fit: a probabilistic method for docking molecular fragments by
sampling low energy conformational space. J. Comput. Aided Mol. Des., 16, 43–57.
Jackson,R.M. and Russell,R.B. (2000) The serine protease inhibitor canonical loop
conformation: examples found in extracellular hydrolases, toxins, cytokines and
viral proteins. J. Mol. Biol., 296, 325–334.
Jones,S. and Thornton,J.M. (1997a) Analysis of protein–protein interaction sites using
surface patches. J. Mol. Biol., 272, 121–132.
Jones,S. and Thornton,J.M. (1997b) Prediction of protein–protein interaction sites
using patch analysis. J. Mol. Biol., 272, 133–143.
Keskin,O. et al. (2005) Hot regions in protein–protein interactions: the organization
and contribution of structurally conserved hot spot residues. J. Mol. Biol., 345,
1281–1294.
Laskowski,R.A. et al. (1996) Protein clefts in molecular recognition and function.
Protein Sci., 5, 2438–2452.
Laurie,A.T. and Jackson,R.M. (2005) Q-SiteFinder: an energy-based method for the
prediction of protein–ligand binding sites. Bioinformatics, 21, 1908–1916.
Li,X. et al. (2004) Protein–protein interactions: hot spots and structurally conserved
residues often locate in complemented pockets that pre-organized in the unbound
states: implications for docking. J. Mol. Biol., 344, 781–795.
Li,Y. et al. (2005) Magnitude of the hydrophobic effect at central versus peripheral sites
in protein–protein interfaces. Structure, 13, 297–307.
Liu,P. et al. (2005) Observation of a dewetting transition in the collapse of the melittin
tetramer. Nature, 437, 159–162.
Lo Conte,L. et al. (1999) The atomic structure of protein–protein recognition sites.
J. Mol. Biol., 285, 2177–2198.
Ma,B. et al. (2003) Protein–protein interactions: structurally conserved residues distinguish between binding sites and exposed protein surfaces. Proc. Natl Acad Sci.
USA, 100, 5772–5777.
1342
Meagher,K.L. et al. (2003) Development of polyphosphate parameters for use with the
AMBER force field. J. Comput. Chem., 24, 1016–1025.
Neuvirth,H. et al. (2004) ProMate: a structure based prediction program to
identify the location of protein–protein binding sites. J. Mol. Biol., 338,
181–199.
Nissink,J.W. et al. (2002) A new test set for validating predictions of protein–ligand
interaction. Proteins, 49, 457–471.
Rhodes,D.R. et al. (2005) Probabilistic model of the human protein–protein interaction
network. Nat. Biotechnol., 23, 951–959.
Rocchia,W. et al. (2001) Extending the applicability of the nonlinear Poisson–
Boltzmann equation: multiple dielectric constants and multivalent ions. J. Phys.
Chem. B, 105, 6507–6514.
Rocchia,W. et al. (2002) Rapid grid-based construction of the molecular surface
and the use of induced surface charge to calculate reaction field energies: applications to the molecular systems and geometric objects. J. Comput. Chem., 23,
128–137.
Russell,R.B. et al. (2004) A structural perspective on protein–protein interactions.
Curr. Opin. Struct. Biol., 14, 313–324.
Sali,A. et al. (2003) From words to literature in structural proteomics. Nature, 422,
216–225.
Sanner,M.F. et al. (1996) Reduced surface: an efficient way to compute molecular
surfaces. Biopolymers, 38, 305–320.
Schlosshauer,M. and Baker,D. (2004) Realistic protein–protein association rates from a
simple diffusional model neglecting long-range interactions, free energy barriers,
and landscape ruggedness. Protein Sci., 13, 1660–1669.
Schneider,C. and Suhnel,J. (1999) A molecular dynamics simulation of the flavin
mononucleotide-RNA aptamer complex. Biopolymers, 50, 287–302.
Schreiber,G. and Fersht,A.R. (1996) Rapid, electrostatically assisted association of
proteins. Nat. Struct. Biol., 3, 427–431.
Thorn,K.S. and Bogan,A.A. (2001) ASEdb: a database of alanine mutations and their
effects on the free energy of binding in protein interactions. Bioinformatics, 17,
284–285.
Valdar,W.S. (2002) Scoring residue conservation. Proteins, 48, 227–241.
Weiner,S.J. et al. (1984) A new force-field for mlecular mechanical simulation of
nucleic-acids and proteins. J. Am. Chem. Soc., 106, 765–784.
Xu,D. et al. (1997) Hydrogen bonds and salt bridges across protein–protein interfaces.
Protein Eng., 10, 999–1012.
Zhou,H.X. and Shan,Y. (2001) Prediction of protein interaction sites from sequence
profile and residue neighbor list. Proteins, 44, 336–343.