How many water molecules are detected in X-ray

research papers
How many water molecules are detected in X-ray
protein crystal structures?
ISSN 1600-5767
Marco Gnesia and Oliviero Carugob,c*
Received 14 July 2016
Accepted 23 November 2016
Edited by D. I. Svergun, European Molecular
Biology Laboratory, Hamburg, Germany
Keywords: multiple regression; Poisson
regression; regression model; protein hydration;
structure refinement; water molecules; structure
validation.
Supporting information: this article has
supporting information at journals.iucr.org/j
a
Section of Biostatistics and Clinical Epidemiology, Department of Public Health, Experimental and Forensic Medicine,
University of Pavia, via Forlanini 2, Pavia 27100, Italy, bDepartment of Chemistry, University of Pavia, viale Taramelli 12,
Pavia 27100, Italy, and cDepartment of Structural and Computational Biology, Max F. Perutz Laboratories, Vienna
University, Campus Vienna Biocenter 5, Vienna A-1030, Austria. *Correspondence e-mail: [email protected]
The positions of several water molecules can be determined in protein
crystallography, either buried in internal cavities or at the protein surface. It is
important to be able to estimate the expected number of these water molecules
to facilitate crystal structure determination. Here, a multiple Poisson regression
model implemented on nearly 10 000 protein crystal structures shows that the
number of detectable water molecules depends on eight variables: crystallographic resolution, R factor, percentage of solvent in the crystal, average B
factor of the protein atoms, percentage of amino acid residues in loops, average
solvent-accessible surface area of the amino acid residues, grand average of
hydropathy of the protein(s) in the asymmetric unit and normalized number of
heteroatoms that are not water molecules. Furthermore, a secondary analysis
tested the effect of different software packages. Given the values of these eight
variables, it is possible to compute the expected number of water molecules
detectable in electron-density maps with reasonable accuracy (as suggested by
an external validation of the model).
1. Introduction
# 2017 International Union of Crystallography
96
The positions of the oxygen atoms of several water molecules
can be observed in protein crystal structures. A few of these
water molecules are buried in the protein core and may
stabilize the molecular structure by forming hydrogen bonds
with polar atoms that are confined in the protein core because
of protein folding (Carugo, 2016a, 2015). However, most of the
water molecules cover the protein surface, by forming
hydrogen bonds with hydrogen-donor and hydrogenacceptor protein atoms and with other water molecules (and
occasionally with other types of molecules) (Matsuoka &
Nakasako, 2009). This agrees with the fact that the physicochemical properties of globular protein surfaces induce the
formation of one or two layers of water molecules over the
protein surface (Fogarty & Laage, 2014) and a minimum
monolayer hydration shell seems to be necessary for protein
function (Nakasako, 2004; Rupley & Careri, 1991).
Nearly two decades ago, we published a paper dealing with
the number of water molecules that are detectable with
protein crystallography (Carugo & Bordo, 1999). By means of
multiple regression analysis of 873 protein crystal structures
determined at room temperature and of only 33 structures
determined at low temperature, it emerged that the number of
water molecules included in crystallographic models was
related primarily to the resolution at which the structure was
solved and refined. Other variables, like the number of residues in the asymmetric unit, the fraction of apolar protein
https://doi.org/10.1107/S1600576716018719
J. Appl. Cryst. (2017). 50, 96–101
research papers
surface, the secondary structure and the temperature, seemed
to have only a marginal influence.
Nowadays, most protein crystal structures are determined at
low temperature to reduce radiation damage induced by
extremely brilliant synchrotron beamlines (Garman &
Schneider, 1997; Carugo & Carugo, 2005). On the basis of
datasets much larger than those available in 1999, it has been
possible to recognize that low-temperature protein crystal
structures have 1.5–3.0 times more water molecules than
structures determined at room temperature (Matsuoka &
Nakasako, 2009; Nakasako, 2004). This might be due to a
series of non-mutually exclusive causes: structural rearrangements caused by freezing, reduced thermal motion with
concomitant sharpening of the electron density or higher
resolution of cryogenic data (Dunlop et al., 2005).
Here, we analyse a non-redundant dataset of 9894 protein
structures with a robust and rigorous statistical treatment. The
aim is to develop a model that could highlight the contributions of several factors in determining the number of water
molecules, and to verify its performance on a different dataset
of 1066 proteins.
2. Methods
2.1. Data selection
Protein crystal structures were downloaded from the
Protein Data Bank (PDB; Bernstein et al., 1977; Berman,
2000) and selected according to the following criteria: (i) only
X-ray crystal structures with experimental data deposited in
the database were considered; (ii) the crystallographic resolution had to be at least 4.5 Å; (iii) the working R factor had to
be at least 0.35; (iv) the free R factor had to be at least 0.40; (v)
the fraction of nucleic acid atoms or of heteroatoms different
from water molecules had to be lower than 5% of the atoms
located in the asymmetric unit; (vi) at least 30 amino acids had
to be found in the asymmetric unit; (vii) the average B factor
of the protein atoms had to be smaller than 100 Å2; (viii) the
maximal pairwise sequence identity had to be lower than 30%.
Data from the PDB were inspected to ensure their consistency. The structures were divided into two datasets, a
‘training dataset’, containing structures recorded in the PDB
before 2016, and a ‘test dataset’, including structures recorded
during 2016.
The values of 22 variables (listed in Table S1 in the
supporting information) were extracted or computed for each
crystal structure. In the present work, the dependent variable
to be predicted is the load of water within a crystal, obtained
by normalizing the number of molecules of water on the
number of amino acid residues in the protein’s primary
structure (H2O/aa ratio). The dependent variable is derived
from count data and, therefore, it follows a Poisson distribution rather than a normal distribution.
After a careful evaluation of several potentially relevant
predictors, considering both experimental and statistical
requirements (like non-collinearity), for our modelling
analyses we considered the following as potential covariates:
J. Appl. Cryst. (2017). 50, 96–101
crystallographic resolution (Res), working R factor (R),
percentage of solvent in the crystal (Solv%), average B factor
of the protein and nucleic acid atoms (Bave), percentage of
amino acid residues in loops (aaLoops%), average solventaccessible surface areas of the amino acid residues (Sasaaa),
grand average of hydropathy of the proteins in the asymmetric
unit (GRAVY), total electric charge of the proteins in the
asymmetric unit (Tectp), and number of heteroatoms that are
not water molecules, normalized on the number of amino acid
residues in the protein’s primary structure (Het/aa ratio). The
type of software package used to refine the structure (Software) was added to the above-mentioned variables in the
context of the secondary analysis.
2.2. Statistical analyses
All the variables collected from the PDB were quantitative
and were expressed as mean values, standard deviations (SD)
and medians. The only exception is the categorial variable
Software, which was described in terms of absolute and relative frequencies.
In the main analysis, the linear or nonlinear nature of the
relationship between the dependent variable and each of the
continuous independent variables was assessed graphically on
the training dataset. When a linear relationship was assumed,
its strength was initially evaluated through bivariate analyses,
by means of the Spearman non-parametric correlation coefficient (given that the dependent variable is not normally
distributed). When relationships were assumed to be other
than linear, suitable mathematical transformations were
applied to the independent variable.
A multiple Poisson regression model, appropriate for our
count-derived dependent variable, was fitted on the training
dataset to assess the adjusted effect of every candidate
covariate on the dependent variable (H2O/aa ratio). Only
those covariates showing a relevant effect were kept (statistical significance on Wald’s test, thus meaning the estimated
regression coefficient is different from a null effect). Robust
standard errors and confidence intervals (95% CI) for
regression coefficients were estimated using the Huber–White
formula. With regard to Poisson regression, given the values of
the independent variables, the predicted values of the
dependent variable should be computed as follows:
y^ ¼ expð0 Þ ½expð1 Þx1 ½expði Þxi
I
Q
¼ intercept ½expði Þxi :
ð1Þ
i¼1
Consequently, if expði Þ > 1, the predicted value increases; on
the other hand, if expði Þ < 1, the predicted value decreases.
The predictive performance of the thus-developed regression model was then tested by analysing the regression’s
residuals, both within the sample (i.e. apparent validation on
the same training dataset as was used to estimate the regression coefficients) and outside the sample (i.e. external validation on the test dataset). The Spearman correlation
coefficient between the predicted and observed values was
Gnesi and Carugo
Water in protein crystal structures
97
research papers
Table 1
Table 2
General descriptions of the dependent and independent variables, each
expressed in terms of average value (Mean), standard error (SE) and
median (Median).
Non-parametric correlation between the number of water molecules per
amino acid residue and the covariates.
Variable
Mean
SE
Median
Covariate
Type of
transformation
Spearman correlation
coefficient
p value
H2O/aa
Res (Å)
R
Solv% (%)
Bave (Å2)
aaLoops% (%)
Sasaaa (Å2)
GRAVY
Tectp
Het/aa
0.6617
1.9442
0.1898
53.1841
31.6188
36.2741
50.3714
0.0577
1.7980
0.1298
0.4525
0.4421
0.0301
9.3651
16.7318
9.2523
9.3405
0.0798
17.2095
0.1604
0.6096
1.9000
0.1900
52.0000
28.0000
37.0000
49.0000
0.0530
2.0000
0.0874
Res
R
Solv%
Bave
aaLoops%
Sasaaa
GRAVY
Tectp
Het/aa
None
None
None
None
None
None
Second-grade
Opposite of modulus
None
0.806
0.637
0.367
0.735
0.111
0.056
Not applicable
0.139
0.161
<0.0001
<0.0001
<0.0001
<0.0001
<0.0001
<0.0001
Not applicable
<0.0001
<0.0001
Table 3
Results of the Poisson multiple regression model – main analysis.
also computed, even though it is a naı̈ve indicator and should
be interpreted cautiously.
A secondary analysis, performed again on the training dataset, outlined the eventual differences in H2O/aa ratio among
different groups identified by a categorial variable (Software)
through the Kruskal–Wallis method (a non-parametric
method, analogous to the one-way ANOVA, suitable for a
response variable that does not fit a normal distribution). The
categories of this variable were then added to the Poisson
multiple regression model fitted in the main analyses. The
model thus obtained underwent validation both within and
outside the sample.
In inferential testing, the level of statistical significance was
set at 0.05. In the case of Wald’s tests on regression coefficients, the significance level was adjusted according to the
Bonferroni method by dividing by the terms estimated
within the model (predictors and intercept). Statistical
analyses were performed using Stata13 (StataCorp, 2013).
3. Results
The final training dataset consisted of experimental data from
9894 experiments on protein crystals. In more than 90% of
these cases, data collection had taken place at a temperature
of 100 K (or, in any case, within the range 63–298 K; 100.29 K
on average). A general description of the data collected from
the PDB is reported in Table 1.
A graphical analysis suggested the H2O/aa ratio to have an
apparently linear relationship with some of the covariates
(Res, R, Solv%, Sasaaa, Bave and Het/aa), while Tectp was
transformed to the opposite of its absolute value. Other
covariates (aaLoops%, Sasaaa and GRAVY) were suspected
to have a second-grade (i.e. parabolic) association with the
log-transformed dependent variable, and the regression
analysis led us to discard these second-grade terms, with the
single exception of GRAVY. Scatter plots of each variable
versus the H2O/aa ratio are shown in Fig. S1 in the supporting
information.
Spearman correlation coefficients for the crude (bivariate)
linear relationships between the covariates (or its transformed
value for Tectp) and the number of water molecules per amino
acid residue are reported in Table 2. The coefficients indicate a
98
Gnesi and Carugo
Water in protein crystal structures
Covariate
exp()
Standard error
Wald’s test†
95% CI
Res
R
Solv%
Bave
aaLoops%
Sasaaa
GRAVY
GRAVY2
Het/aa
Intercept
0.4290
0.0516
1.0094
0.9819
1.0016
1.0030
0.4151
0.1376
1.1448
4.1506
0.0080
0.0099
0.0006
0.0005
0.0005
0.0005
0.0347
0.0570
0.0280
0.1960
p < 0.001
p < 0.001
p < 0.001
p < 0.001
p = 0.002
p < 0.001
p < 0.001
p < 0.001
p < 0.001
p < 0.001
0.4136–0.4450
0.0354–0.0751
1.0083–1.0105
0.9809–0.9829
1.0006–1.0026
1.0020–1.0041
0.3523–0.4890
0.0612–0.3097
1.0911–1.2010
3.7837–4.5530
† Bonferroni-adjusted significance level 0.005.
weak relationship for the percentage of amino acid residues in
loops (aaLoops%), average solvent-accessible areas of the
amino acid residues (Sasaaa), transformed total electric
charge of the asymmetric unit (Tectp) and normalized number
of heteroatoms other than water molecules (Het/aa). The
coefficient was not computed for the grand average of
hydropathy of the proteins in the asymmetric unit (GRAVY),
as its quadratic relation with the H2O/aa ratio is not monotonic.
3.1. Modelling through Poisson multiple regression
The covariates shown in Table 2 were included in a Poisson
multiple regression model with the ratio H2O/aa as a dependent variable. Non-relevant covariates were discarded (see
Methods). The results of the final regression model are
represented in Table 3; our model is appropriate for describing
in-sample data, as confirmed by the Pearson goodness-of-fit
test (2 = 1048.38, p ’ 1.000). The Spearman correlation
coefficients between all pairs of independent variables, as
reported in Table S2 in the supporting information, are weak
enough to exclude multicollinearity issues according to standard statistical practice.
No substantive change was observed when the analysis was
applied to a subsample of the most recent structures
(deposition year 2005–2014).
For the experiments included in our training dataset, the
values of the H2O/aa ratio predicted by the regression model
lie between 0.0237 and 3.1118. For interested readers, the
tolerance intervals for predicted values of the H2O/aa ratio are
J. Appl. Cryst. (2017). 50, 96–101
research papers
Table 4
Table 5
Distribution of software packages (Software variable).
Results of the Poisson multiple regression model – secondary analysis.
Program title
Program reference
No. (%)
Covariate
exp() Standard error Wald’s test† 95% CI
REFMAC
PHENIX
CNS
BUSTER
X-PLOR
SHELXL
Total
Murshudov et al. (2011)
Adams et al. (2010)
Brünger et al. (1998)
Bricogne et al. (2016)
Brünger (1992)
Sheldrick (2008, 2015)
–
6476 (65.7%)
1016 (10.3%)
1908 (19.3%)
349 (3.5%)
59 (0.6%)
56 (0.6%)
9864 (100.0%)
Res
R
Solv%
Bave
aaLoops%
Sasaaa
GRAVY
GRAVY2
Het/aa
0.4286
0.0139
1.0098
0.9825
1.0013
1.0037
0.4217
0.1476
1.1343
0.0079
0.0029
0.0006
0.0005
0.0005
0.0005
0.0357
0.0609
0.0269
p < 0.001
p < 0.001
p < 0.001
p < 0.001
p = 0.008
p < 0.001
p < 0.001
p < 0.001
p < 0.001
0.4134–0.4444
0.0092–0.0209
1.0087–1.0109
0.9814–0.9835
1.0004–1.0023
1.0027–1.0048
0.3572–0.4979
0.0658–0.3312
1.0827–1.1884
Software‡ REFMAC
PHENIX
CNS
BUSTER
X-PLOR
SHELXL
1
1.0277
1.2172
1.1486
1.1710
0.7926
(Reference category)
0.0149
p = 0.059
0.0145
p < 0.001
0.0269
p < 0.001
0.0672
p = 0.006
0.0383
p < 0.001
0.9989–1.0573
1.1891–1.2460
1.0971–1.2025
1.0464–1.3105
0.7211–0.8713
Intercept
4.7658 0.2304
reported and explained in the supporting information
(Table S3).
3.2. Validation of the regression model
According to the apparent validation criteria (see
Methods), the regression model appears able to predict the
water load properly, as confirmed by an analysis of the
regression residuals. The mean of the residuals’ distribution is
4.38 109 with a standard deviation (SD) of 0.2694; the
median absolute value is 0.1467. The predicted H2O/aa ratio
values are highly correlated with the observed ones
(Spearman correlation coefficient = 0.849, p < 0.0001).
Moreover, a fully external validation was performed on the
test dataset (see Methods). The predicted values of the H2O/aa
ratio, in the range 0.0313–2.3372, adequately fit the observed
values, as shown in Fig. S4 in the supporting information, and
their correlation is as high as in the apparent validation
(Spearman correlation coefficient = 0.893, p < 0.0001). The
residuals have an average of 0.0265 with an SD of 0.2747.
Looking at absolute values, the median residual is 0.1302. The
most striking outlier is 5gnn, where the observed ratio
between the water molecules and the residues (4.14) is much
higher than the predicted value (0.61). However, we observe
that this PDB entry is characterized by extremely bad scores in
the PDB validation report: it is affected by numerous interatomic clashes, it has a very bad Ramachandran plot and many
side chains are severely distorted; moreover, 519 of its water
molecules are more than 5 Å away from the nearest polymer
chain. Very recently, the quality of this PDB entry has been
critically debated on a public forum (information can be
retrieved at https://www.jiscmail.ac.uk/cgi-bin/webadmin?A1=
ind1609&L=ccp4bb#6). We might argue that this is one
example of a problematic crystal structure which would
deserve further attention, although its real-space validation
scores are quite good.
However, despite a more-than-acceptable performance of
the model, a modest underestimation of the water load could
be hypothesized, thus accounting for the slight negative
asymmetry observed in the distribution of residuals both
within and outside the sample.
3.3. Investigating the effect of the different software
packages used in crystallographic refinements
The software used to resolve the crystal structure could
potentially influence the number of water molecules identified
J. Appl. Cryst. (2017). 50, 96–101
p < 0.001
4.3349–5.2394
† Bonferroni-adjusted significance level 0.0033. ‡ When computing the predicted
H2O/aa ratio, xi = 1 for the software package used to resolve the structure and xi = 0
otherwise.
in a structure. However, if the type of software package
(Software variable) was included in the main regression
model, it would have implied restrictions. For instance, any
chance to use the model would be undermined by the lack of
specific attributes for rare or, in the future, new software
packages, which were not included in our model.
To overcome the aforementioned limitations, the effect of
the Software variable on the water load of crystals was
investigated in a secondary analysis on the training dataset.
Data from 9864 structures were considered, meaning that 30
structures were excluded for having insufficient information
about the used software in their PDB files. These data are
presented in Table 4.
A bivariate analysis disclosed a significant overall difference in the observed H2O/aa ratio among Software types
(Kruskal–Wallis non-parametric test, 2 = 181.384, p = 0.0001).
Post hoc testing was not performed, as we chose to estimate
the individual effect of each software package directly through
the subsequent multivariate analysis carried out by means of
Poisson multiple regression. The results of this model, shown
in Table 5, go together substantially with the main analysis
(Table 3); in addition, the role of the Software variable is
confirmed. Most of the software packages, measured against
REFMAC (which was taken as the reference type), have a
statistically relevant impact on the H2O/aa ratio, with the
exception of PHENIX and X-PLOR. For the latter two, it
must be considered that Wald’s test on the regressor is
borderline significant in front of a strongly conservative
criterion of statistical significance. Moreover, only 59 cases
using PHENIX and X-PLOR are reported in the training
dataset, and this could make Wald’s test even more conservative, because such a small number of cases could increase
the uncertainty in the estimation of the regression coefficient
(i.e. the standard error and confidence interval).
Gnesi and Carugo
Water in protein crystal structures
99
research papers
The performance of this model, in terms of apparent
validity, seems reasonable and comparable to the main model.
The predicted values of the H2O/aa ratio, ranging from 0.0210
to 3.2827, are highly correlated with the observed values
(Spearman correlation coefficient = 0.855, p < 0.0001). On
average, the residuals are very close to zero (4.43 109) with
an SD of 0.2682, again with a slight negative asymmetry of
distribution, and the median absolute value is 0.1453. External
validation substantially confirms the apparent validation: the
predicted values of the H2O/aa ratio, within the range 0.0256–
2.3559, are strongly correlated with the observed ones
(Spearman correlation coefficient = 0.894, p < 0.0001); the
residuals have an average of 0.0566 with an SD of 0.0555,
and the median absolute residual is 0.1253. We observe that in
the test dataset only three software packages are present
(REFMAC, PHENIX and CNS).
4. Discussion
It is necessary to note that some of the water molecules
deposited in the PDB files may result from misinterpretation
of the electron-density maps, e.g. if weak density peaks were
interpreted as water molecules with high B factors or low
occupancies. However, we did not exclude these potentially
erroneous water molecules from our analysis, as no widely
accepted criteria are available to filter off these water molecules automatically and since it is obviously impossible to
examine nearly 10 000 electron-density maps visually. The
results presented here are therefore not an attempt to predict
how many water molecules hydrate proteins in the crystals; we
simply examine the number of water molecules that crystallographers include, on average, in the structures published and
deposited in the PDB.
In the main analysis presented in this paper, the multivariable model has highlighted eight variables with a significant impact on the normalized number of water molecules
observed: crystallographic resolution (Res), R factor (R),
percentage of solvent in the crystal (Solv%), average B factor
of the protein atoms (Bave), percentage of amino acid residues
in loops (aaLoops%), average solvent-accessible surface area
of the amino acid residues (Sasaaa), grand average of
hydropathy of the protein(s) in the asymmetric unit (Gravy,
Gravy2) and normalized number of heteroatoms that are not
water molecules (Het/aa). The distributions of these variables
are shown in Fig. S3 in the supporting information. Here, we
briefly examine the adjusted relation of each covariate to the
dependent variable. In each case, the effect on the dependent
variable was computed for a variation of approximately one
standard deviation of the covariate.
Fewer water molecules per residue are observed if the
crystallographic resolution (Res) worsens. If the resolution
increases by 0.44 Å, the dependent variable falls by approximately 31%. This is expected, since the electron-density maps
are obviously sharper when more diffraction intensities can be
measured to higher resolution. In an independent analysis, it
has recently been shown that it is necessary to reach a reso-
100
Gnesi and Carugo
Water in protein crystal structures
lution of about 1.5–1.6 Å to observe a continuous hydration
layer at the protein surface (Carugo, 2016b).
If the average B factor of the protein atoms (Bave) increases,
fewer water molecules per residue are detected. For example,
if the B factor increases by 15 Å2, the dependent variable is
reduced by roughly 24%. To understand this, it is necessary to
recall that the B factor monitors not only the amplitude of the
oscillations of the atoms around their average positions but
also the conformational disorder of the atoms (Carugo &
Argos, 1997; Ringe & Petsko, 1986). Consequently, it is
reasonable to expect that fewer water molecules can be
observed if the protein has larger B factors and is more flexible. In particular, one must remember that most of the water
molecules observed in crystal structures are found in the first,
second and perhaps third hydration layers, in contact with the
atoms of the residues that are at the protein surface, which are
more flexible than the protein core.
Analogously, more water molecules per residue are
observed if the R factor (R) improves. For instance, when the
R factor is increased by 0.05, the dependent variable decreases
by 14%. Like for the resolution, this is largely expected, since
electron-density maps become sharper when differences
between computed and observed diffraction intensities
diminish.
The dependent variable is also incremented by the
percentage of solvent in the crystal (Solv%), even though the
effect is far from being pivotal (an increment of 10% in the
percentage of solvent in the crystal, which is approximately a
standard deviation, increases the H2O/aa ratio by 10%). This
is surprising, at least in part. In fact, crystals containing more
liquid solvent are expected to diffract poorly, with the
consequence that fewer protein atoms and fewer water
molecules are observable in the electron-density maps.
The relationship between the H2O/aa ratio and GRAVY
includes two terms, one linear and the other quadratic, and as
a consequence it is impossible to give an intuitive and
straightforward explanation. Fig. S2 in the supporting information shows a scatter plot of the predicted values of the H2O/
aa ratio versus observed values of GRAVY. From the plot, we
can infer that the predicted H2O/aa ratio reaches a maximum
when GRAVY is around 0.10; the greater the distance from
this point, the lower the predicted ratio.
Three other independent variables have an extremely low
impact on the number of water molecules per amino acid
residue, which increases marginally if the average solventaccessible surface area of the residues (Sasaaa), the number of
heteroatoms that are not water molecules (Het/aa) or the
percentage of residues in loops (aaLoops%) increases.
According to our analysis, an increment of 10 Å2 of Sasaaa
implies the dependent variable increases by only 3%. An
increment of 0.15 in the Het/aa ratio, which is approximately
one standard deviation, corresponds to a 2% increase in the
H2O/aa ratio. If residues in loops are incremented by 10%
there is only a 1.6% increment in the dependent variable. To
some extent, these trends are surprising. One must observe
that, although these effects are statistically significant, they
have a very modest impact on the dependent variable.
J. Appl. Cryst. (2017). 50, 96–101
research papers
In addition, a secondary analysis pinpointed the effect of
the type of software package used to resolve the structures
(Software variable) on the H2O/aa ratio. The effect exerted by
Software seems to be independent of those of other predictors,
whose regression coefficients remained stable after the inclusion of this independent variable. The impact of crystallographic software is interesting. Fewer water molecules are
localized when using SHELXL and more with CNS. It is
reasonable to suppose that this reflects a systematic difference
amongst different packages. However, a detailed comparison
between different software packages is beyond the purpose of
the present study and further studies are necessary to investigate this issue.
This analysis of each variable clearly shows that nearly
every variable has a predictable impact on the number of
water molecules per residue. However, the correlation of
several individual variables with the number of water molecules per residue is relatively modest. Only when they are
taken together do these variables correlate well with the
number of water molecules per residue. The performance of
this multivariate model is remarkable, as indicated by the outof-sample predictive validation.
5. Summary
We have analysed the relationships between the number of
water molecules that are detected in a protein crystal structure
and a series of variables that might have an impact on it. By
means of Poisson multiple regression analyses, we have
identified eight variables that significantly affect the number
of water molecules per residue: the crystallographic resolution, the R factor, the average B factor of the protein atoms,
the percentage of solvent in the crystal, the average solventaccessible surface area of the amino acid residues, the grand
average of hydropathy of the proteins in the asymmetric unit,
the percentage of amino acids in loops and the number of
heteroatoms that are not water molecules (normalized on the
number of amino acids). A secondary analysis also disclosed
the fact that the software packages which are routinely used to
refine crystal structures can affect the number of water
molecules located in the electron-density maps.
This analysis extends previous studies on room-temperature
protein crystal structures, published nearly 20 years ago
(Carugo & Bordo, 1999), which can be considered outdated,
since diffraction data are nowadays commonly collected at low
temperature (usually about 100 K) and since the present
quantity of data is significantly larger than 20 years ago.
However, a direct comparison between the model presented
here and the multiple linear regression published 20 years ago
is impossible, since the two models are based on different
covariates and have been built with different regression
techniques.
These results should prove useful in identifying anomalous
structures containing too few or too many water molecules per
J. Appl. Cryst. (2017). 50, 96–101
residue than expected. It is, however, necessary to remember
that some of the water molecules deposited in the PDB files
may be erroneous interpretations of electron-density maps. As
a consequence, the multivariate model described in the
present manuscript cannot be utilized directly to validate the
degree of hydration of proteins in crystals.
6. Related literature
For additional literature relating to the supporting information, see Carugo (2003), Eisenhaber & Argos (1993), Frishman
& Argos (1995), Murphy (1948) and Winn et al. (2011).
References
Adams, P. D. et al. (2010). Acta Cryst. D66, 213–221.
Berman, H. M., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T. N.,
Weissig, H., Shindyalow, I. N. & Bourne, P. E. (2000). Nucleic Acids
Res. 28, 235–242.
Bernstein, F. C., Koetzle, T. F., Williams, G. J., Meyer, E. F. Jr, Brice,
M. D., Rodgers, J. R., Kennard, O., Shimanouchi, T. & Tasumi, M.
(1977). J. Mol. Biol. 112, 535–542.
Brünger, A. T. (1992). X-PLOR, Version 3.1. A System for X-ray
Crystallography and NMR. New Haven: Yale University Press.
Brünger, A. T., Adams, P. D., Clore, G. M., DeLano, W. L., Gros, P.,
Grosse-Kunstleve, R. W., Jiang, J.-S., Kuszewski, J., Nilges, M.,
Pannu, N. S., Read, R. J., Rice, L. M., Simonson, T. & Warren, G. L.
(1998). Acta Cryst. D54, 905–921.
Bricogne, G., Blanc, E., Brandl, M., Flensburg, C., Keller, P., Paciorek,
W., Roversi, P, Sharff, A., Smart, O. S., Vonrhein, C. & Womack, T.
O. (2016). BUSTER. Global Phasing Ltd, Cambridge, UK. http://
www.globalphasing.com/.
Carugo, O. (2003). In Silico Biol. 3, 417–428.
Carugo, O. (2015). Curr. Protein Pept. Sci. 16, 259–265.
Carugo, O. (2016a). Amino Acids, 48, 193–202.
Carugo, O. (2016b). Int. J. Biol. Macromol. 89, 137–143.
Carugo, O. & Argos, P. (1997). Protein Eng. 10, 777–787.
Carugo, O. & Bordo, D. (1999). Acta Cryst. D55, 479–483.
Carugo, O. & Djinovic-Carugo, K. (2005). Trends Biochem. Sci. 30,
213–219.
Dunlop, K. V., Irvin, R. T. & Hazes, B. (2005). Acta Cryst. D61, 80–87.
Eisenhaber, F. & Argos, P. (1993). J. Comput. Chem. 14, 1272–1280.
Fogarty, A. C. & Laage, D. (2014). J. Phys. Chem. B, 118, 7715–7729.
Frishman, D. & Argos, P. (1995). Proteins, 23, 566–579.
Garman, E. F. & Schneider, T. R. (1997). J. Appl. Cryst. 30, 211–237.
Matsuoka, D. & Nakasako, M. (2009). J. Phys. Chem. B, 113, 11274–
11292.
Murshudov, G. N., Skubák, P., Lebedev, A. A., Pannu, N. S., Steiner,
R. A., Nicholls, R. A., Winn, M. D., Long, F. & Vagin, A. A. (2011).
Acta Cryst. D67, 355–367.
Murphy, R. B. (1948). Ann. Math. Stat. 19, 581–589.
Nakasako, M. (2004). Philos. Trans. R. Soc. London B, 359, 1191–
1206.
Ringe, D. & Petsko, G. A. (1986). Methods Enzymol. 131, 389–433.
Rupley, J. A. & Careri, G. (1991). Adv. Protein Chem. 41, 37–172.
StataCorp (2013). Stata13. Statistical Software, Release 13. StataCorp
Lp, College Station, Texas, USA.
Sheldrick, G. M. (2008). Acta Cryst. A64, 112–122.
Sheldrick, G. M. (2015). Acta Cryst. C71, 3–8.
Winn, M. D. et al. (2011). Acta Cryst. D67, 235–242.
Gnesi and Carugo
Water in protein crystal structures
101