research papers How many water molecules are detected in X-ray protein crystal structures? ISSN 1600-5767 Marco Gnesia and Oliviero Carugob,c* Received 14 July 2016 Accepted 23 November 2016 Edited by D. I. Svergun, European Molecular Biology Laboratory, Hamburg, Germany Keywords: multiple regression; Poisson regression; regression model; protein hydration; structure refinement; water molecules; structure validation. Supporting information: this article has supporting information at journals.iucr.org/j a Section of Biostatistics and Clinical Epidemiology, Department of Public Health, Experimental and Forensic Medicine, University of Pavia, via Forlanini 2, Pavia 27100, Italy, bDepartment of Chemistry, University of Pavia, viale Taramelli 12, Pavia 27100, Italy, and cDepartment of Structural and Computational Biology, Max F. Perutz Laboratories, Vienna University, Campus Vienna Biocenter 5, Vienna A-1030, Austria. *Correspondence e-mail: [email protected] The positions of several water molecules can be determined in protein crystallography, either buried in internal cavities or at the protein surface. It is important to be able to estimate the expected number of these water molecules to facilitate crystal structure determination. Here, a multiple Poisson regression model implemented on nearly 10 000 protein crystal structures shows that the number of detectable water molecules depends on eight variables: crystallographic resolution, R factor, percentage of solvent in the crystal, average B factor of the protein atoms, percentage of amino acid residues in loops, average solvent-accessible surface area of the amino acid residues, grand average of hydropathy of the protein(s) in the asymmetric unit and normalized number of heteroatoms that are not water molecules. Furthermore, a secondary analysis tested the effect of different software packages. Given the values of these eight variables, it is possible to compute the expected number of water molecules detectable in electron-density maps with reasonable accuracy (as suggested by an external validation of the model). 1. Introduction # 2017 International Union of Crystallography 96 The positions of the oxygen atoms of several water molecules can be observed in protein crystal structures. A few of these water molecules are buried in the protein core and may stabilize the molecular structure by forming hydrogen bonds with polar atoms that are confined in the protein core because of protein folding (Carugo, 2016a, 2015). However, most of the water molecules cover the protein surface, by forming hydrogen bonds with hydrogen-donor and hydrogenacceptor protein atoms and with other water molecules (and occasionally with other types of molecules) (Matsuoka & Nakasako, 2009). This agrees with the fact that the physicochemical properties of globular protein surfaces induce the formation of one or two layers of water molecules over the protein surface (Fogarty & Laage, 2014) and a minimum monolayer hydration shell seems to be necessary for protein function (Nakasako, 2004; Rupley & Careri, 1991). Nearly two decades ago, we published a paper dealing with the number of water molecules that are detectable with protein crystallography (Carugo & Bordo, 1999). By means of multiple regression analysis of 873 protein crystal structures determined at room temperature and of only 33 structures determined at low temperature, it emerged that the number of water molecules included in crystallographic models was related primarily to the resolution at which the structure was solved and refined. Other variables, like the number of residues in the asymmetric unit, the fraction of apolar protein https://doi.org/10.1107/S1600576716018719 J. Appl. Cryst. (2017). 50, 96–101 research papers surface, the secondary structure and the temperature, seemed to have only a marginal influence. Nowadays, most protein crystal structures are determined at low temperature to reduce radiation damage induced by extremely brilliant synchrotron beamlines (Garman & Schneider, 1997; Carugo & Carugo, 2005). On the basis of datasets much larger than those available in 1999, it has been possible to recognize that low-temperature protein crystal structures have 1.5–3.0 times more water molecules than structures determined at room temperature (Matsuoka & Nakasako, 2009; Nakasako, 2004). This might be due to a series of non-mutually exclusive causes: structural rearrangements caused by freezing, reduced thermal motion with concomitant sharpening of the electron density or higher resolution of cryogenic data (Dunlop et al., 2005). Here, we analyse a non-redundant dataset of 9894 protein structures with a robust and rigorous statistical treatment. The aim is to develop a model that could highlight the contributions of several factors in determining the number of water molecules, and to verify its performance on a different dataset of 1066 proteins. 2. Methods 2.1. Data selection Protein crystal structures were downloaded from the Protein Data Bank (PDB; Bernstein et al., 1977; Berman, 2000) and selected according to the following criteria: (i) only X-ray crystal structures with experimental data deposited in the database were considered; (ii) the crystallographic resolution had to be at least 4.5 Å; (iii) the working R factor had to be at least 0.35; (iv) the free R factor had to be at least 0.40; (v) the fraction of nucleic acid atoms or of heteroatoms different from water molecules had to be lower than 5% of the atoms located in the asymmetric unit; (vi) at least 30 amino acids had to be found in the asymmetric unit; (vii) the average B factor of the protein atoms had to be smaller than 100 Å2; (viii) the maximal pairwise sequence identity had to be lower than 30%. Data from the PDB were inspected to ensure their consistency. The structures were divided into two datasets, a ‘training dataset’, containing structures recorded in the PDB before 2016, and a ‘test dataset’, including structures recorded during 2016. The values of 22 variables (listed in Table S1 in the supporting information) were extracted or computed for each crystal structure. In the present work, the dependent variable to be predicted is the load of water within a crystal, obtained by normalizing the number of molecules of water on the number of amino acid residues in the protein’s primary structure (H2O/aa ratio). The dependent variable is derived from count data and, therefore, it follows a Poisson distribution rather than a normal distribution. After a careful evaluation of several potentially relevant predictors, considering both experimental and statistical requirements (like non-collinearity), for our modelling analyses we considered the following as potential covariates: J. Appl. Cryst. (2017). 50, 96–101 crystallographic resolution (Res), working R factor (R), percentage of solvent in the crystal (Solv%), average B factor of the protein and nucleic acid atoms (Bave), percentage of amino acid residues in loops (aaLoops%), average solventaccessible surface areas of the amino acid residues (Sasaaa), grand average of hydropathy of the proteins in the asymmetric unit (GRAVY), total electric charge of the proteins in the asymmetric unit (Tectp), and number of heteroatoms that are not water molecules, normalized on the number of amino acid residues in the protein’s primary structure (Het/aa ratio). The type of software package used to refine the structure (Software) was added to the above-mentioned variables in the context of the secondary analysis. 2.2. Statistical analyses All the variables collected from the PDB were quantitative and were expressed as mean values, standard deviations (SD) and medians. The only exception is the categorial variable Software, which was described in terms of absolute and relative frequencies. In the main analysis, the linear or nonlinear nature of the relationship between the dependent variable and each of the continuous independent variables was assessed graphically on the training dataset. When a linear relationship was assumed, its strength was initially evaluated through bivariate analyses, by means of the Spearman non-parametric correlation coefficient (given that the dependent variable is not normally distributed). When relationships were assumed to be other than linear, suitable mathematical transformations were applied to the independent variable. A multiple Poisson regression model, appropriate for our count-derived dependent variable, was fitted on the training dataset to assess the adjusted effect of every candidate covariate on the dependent variable (H2O/aa ratio). Only those covariates showing a relevant effect were kept (statistical significance on Wald’s test, thus meaning the estimated regression coefficient is different from a null effect). Robust standard errors and confidence intervals (95% CI) for regression coefficients were estimated using the Huber–White formula. With regard to Poisson regression, given the values of the independent variables, the predicted values of the dependent variable should be computed as follows: y^ ¼ expð0 Þ ½expð1 Þx1 ½expði Þxi I Q ¼ intercept ½expði Þxi : ð1Þ i¼1 Consequently, if expði Þ > 1, the predicted value increases; on the other hand, if expði Þ < 1, the predicted value decreases. The predictive performance of the thus-developed regression model was then tested by analysing the regression’s residuals, both within the sample (i.e. apparent validation on the same training dataset as was used to estimate the regression coefficients) and outside the sample (i.e. external validation on the test dataset). The Spearman correlation coefficient between the predicted and observed values was Gnesi and Carugo Water in protein crystal structures 97 research papers Table 1 Table 2 General descriptions of the dependent and independent variables, each expressed in terms of average value (Mean), standard error (SE) and median (Median). Non-parametric correlation between the number of water molecules per amino acid residue and the covariates. Variable Mean SE Median Covariate Type of transformation Spearman correlation coefficient p value H2O/aa Res (Å) R Solv% (%) Bave (Å2) aaLoops% (%) Sasaaa (Å2) GRAVY Tectp Het/aa 0.6617 1.9442 0.1898 53.1841 31.6188 36.2741 50.3714 0.0577 1.7980 0.1298 0.4525 0.4421 0.0301 9.3651 16.7318 9.2523 9.3405 0.0798 17.2095 0.1604 0.6096 1.9000 0.1900 52.0000 28.0000 37.0000 49.0000 0.0530 2.0000 0.0874 Res R Solv% Bave aaLoops% Sasaaa GRAVY Tectp Het/aa None None None None None None Second-grade Opposite of modulus None 0.806 0.637 0.367 0.735 0.111 0.056 Not applicable 0.139 0.161 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 Not applicable <0.0001 <0.0001 Table 3 Results of the Poisson multiple regression model – main analysis. also computed, even though it is a naı̈ve indicator and should be interpreted cautiously. A secondary analysis, performed again on the training dataset, outlined the eventual differences in H2O/aa ratio among different groups identified by a categorial variable (Software) through the Kruskal–Wallis method (a non-parametric method, analogous to the one-way ANOVA, suitable for a response variable that does not fit a normal distribution). The categories of this variable were then added to the Poisson multiple regression model fitted in the main analyses. The model thus obtained underwent validation both within and outside the sample. In inferential testing, the level of statistical significance was set at 0.05. In the case of Wald’s tests on regression coefficients, the significance level was adjusted according to the Bonferroni method by dividing by the terms estimated within the model (predictors and intercept). Statistical analyses were performed using Stata13 (StataCorp, 2013). 3. Results The final training dataset consisted of experimental data from 9894 experiments on protein crystals. In more than 90% of these cases, data collection had taken place at a temperature of 100 K (or, in any case, within the range 63–298 K; 100.29 K on average). A general description of the data collected from the PDB is reported in Table 1. A graphical analysis suggested the H2O/aa ratio to have an apparently linear relationship with some of the covariates (Res, R, Solv%, Sasaaa, Bave and Het/aa), while Tectp was transformed to the opposite of its absolute value. Other covariates (aaLoops%, Sasaaa and GRAVY) were suspected to have a second-grade (i.e. parabolic) association with the log-transformed dependent variable, and the regression analysis led us to discard these second-grade terms, with the single exception of GRAVY. Scatter plots of each variable versus the H2O/aa ratio are shown in Fig. S1 in the supporting information. Spearman correlation coefficients for the crude (bivariate) linear relationships between the covariates (or its transformed value for Tectp) and the number of water molecules per amino acid residue are reported in Table 2. The coefficients indicate a 98 Gnesi and Carugo Water in protein crystal structures Covariate exp() Standard error Wald’s test† 95% CI Res R Solv% Bave aaLoops% Sasaaa GRAVY GRAVY2 Het/aa Intercept 0.4290 0.0516 1.0094 0.9819 1.0016 1.0030 0.4151 0.1376 1.1448 4.1506 0.0080 0.0099 0.0006 0.0005 0.0005 0.0005 0.0347 0.0570 0.0280 0.1960 p < 0.001 p < 0.001 p < 0.001 p < 0.001 p = 0.002 p < 0.001 p < 0.001 p < 0.001 p < 0.001 p < 0.001 0.4136–0.4450 0.0354–0.0751 1.0083–1.0105 0.9809–0.9829 1.0006–1.0026 1.0020–1.0041 0.3523–0.4890 0.0612–0.3097 1.0911–1.2010 3.7837–4.5530 † Bonferroni-adjusted significance level 0.005. weak relationship for the percentage of amino acid residues in loops (aaLoops%), average solvent-accessible areas of the amino acid residues (Sasaaa), transformed total electric charge of the asymmetric unit (Tectp) and normalized number of heteroatoms other than water molecules (Het/aa). The coefficient was not computed for the grand average of hydropathy of the proteins in the asymmetric unit (GRAVY), as its quadratic relation with the H2O/aa ratio is not monotonic. 3.1. Modelling through Poisson multiple regression The covariates shown in Table 2 were included in a Poisson multiple regression model with the ratio H2O/aa as a dependent variable. Non-relevant covariates were discarded (see Methods). The results of the final regression model are represented in Table 3; our model is appropriate for describing in-sample data, as confirmed by the Pearson goodness-of-fit test (2 = 1048.38, p ’ 1.000). The Spearman correlation coefficients between all pairs of independent variables, as reported in Table S2 in the supporting information, are weak enough to exclude multicollinearity issues according to standard statistical practice. No substantive change was observed when the analysis was applied to a subsample of the most recent structures (deposition year 2005–2014). For the experiments included in our training dataset, the values of the H2O/aa ratio predicted by the regression model lie between 0.0237 and 3.1118. For interested readers, the tolerance intervals for predicted values of the H2O/aa ratio are J. Appl. Cryst. (2017). 50, 96–101 research papers Table 4 Table 5 Distribution of software packages (Software variable). Results of the Poisson multiple regression model – secondary analysis. Program title Program reference No. (%) Covariate exp() Standard error Wald’s test† 95% CI REFMAC PHENIX CNS BUSTER X-PLOR SHELXL Total Murshudov et al. (2011) Adams et al. (2010) Brünger et al. (1998) Bricogne et al. (2016) Brünger (1992) Sheldrick (2008, 2015) – 6476 (65.7%) 1016 (10.3%) 1908 (19.3%) 349 (3.5%) 59 (0.6%) 56 (0.6%) 9864 (100.0%) Res R Solv% Bave aaLoops% Sasaaa GRAVY GRAVY2 Het/aa 0.4286 0.0139 1.0098 0.9825 1.0013 1.0037 0.4217 0.1476 1.1343 0.0079 0.0029 0.0006 0.0005 0.0005 0.0005 0.0357 0.0609 0.0269 p < 0.001 p < 0.001 p < 0.001 p < 0.001 p = 0.008 p < 0.001 p < 0.001 p < 0.001 p < 0.001 0.4134–0.4444 0.0092–0.0209 1.0087–1.0109 0.9814–0.9835 1.0004–1.0023 1.0027–1.0048 0.3572–0.4979 0.0658–0.3312 1.0827–1.1884 Software‡ REFMAC PHENIX CNS BUSTER X-PLOR SHELXL 1 1.0277 1.2172 1.1486 1.1710 0.7926 (Reference category) 0.0149 p = 0.059 0.0145 p < 0.001 0.0269 p < 0.001 0.0672 p = 0.006 0.0383 p < 0.001 0.9989–1.0573 1.1891–1.2460 1.0971–1.2025 1.0464–1.3105 0.7211–0.8713 Intercept 4.7658 0.2304 reported and explained in the supporting information (Table S3). 3.2. Validation of the regression model According to the apparent validation criteria (see Methods), the regression model appears able to predict the water load properly, as confirmed by an analysis of the regression residuals. The mean of the residuals’ distribution is 4.38 109 with a standard deviation (SD) of 0.2694; the median absolute value is 0.1467. The predicted H2O/aa ratio values are highly correlated with the observed ones (Spearman correlation coefficient = 0.849, p < 0.0001). Moreover, a fully external validation was performed on the test dataset (see Methods). The predicted values of the H2O/aa ratio, in the range 0.0313–2.3372, adequately fit the observed values, as shown in Fig. S4 in the supporting information, and their correlation is as high as in the apparent validation (Spearman correlation coefficient = 0.893, p < 0.0001). The residuals have an average of 0.0265 with an SD of 0.2747. Looking at absolute values, the median residual is 0.1302. The most striking outlier is 5gnn, where the observed ratio between the water molecules and the residues (4.14) is much higher than the predicted value (0.61). However, we observe that this PDB entry is characterized by extremely bad scores in the PDB validation report: it is affected by numerous interatomic clashes, it has a very bad Ramachandran plot and many side chains are severely distorted; moreover, 519 of its water molecules are more than 5 Å away from the nearest polymer chain. Very recently, the quality of this PDB entry has been critically debated on a public forum (information can be retrieved at https://www.jiscmail.ac.uk/cgi-bin/webadmin?A1= ind1609&L=ccp4bb#6). We might argue that this is one example of a problematic crystal structure which would deserve further attention, although its real-space validation scores are quite good. However, despite a more-than-acceptable performance of the model, a modest underestimation of the water load could be hypothesized, thus accounting for the slight negative asymmetry observed in the distribution of residuals both within and outside the sample. 3.3. Investigating the effect of the different software packages used in crystallographic refinements The software used to resolve the crystal structure could potentially influence the number of water molecules identified J. Appl. Cryst. (2017). 50, 96–101 p < 0.001 4.3349–5.2394 † Bonferroni-adjusted significance level 0.0033. ‡ When computing the predicted H2O/aa ratio, xi = 1 for the software package used to resolve the structure and xi = 0 otherwise. in a structure. However, if the type of software package (Software variable) was included in the main regression model, it would have implied restrictions. For instance, any chance to use the model would be undermined by the lack of specific attributes for rare or, in the future, new software packages, which were not included in our model. To overcome the aforementioned limitations, the effect of the Software variable on the water load of crystals was investigated in a secondary analysis on the training dataset. Data from 9864 structures were considered, meaning that 30 structures were excluded for having insufficient information about the used software in their PDB files. These data are presented in Table 4. A bivariate analysis disclosed a significant overall difference in the observed H2O/aa ratio among Software types (Kruskal–Wallis non-parametric test, 2 = 181.384, p = 0.0001). Post hoc testing was not performed, as we chose to estimate the individual effect of each software package directly through the subsequent multivariate analysis carried out by means of Poisson multiple regression. The results of this model, shown in Table 5, go together substantially with the main analysis (Table 3); in addition, the role of the Software variable is confirmed. Most of the software packages, measured against REFMAC (which was taken as the reference type), have a statistically relevant impact on the H2O/aa ratio, with the exception of PHENIX and X-PLOR. For the latter two, it must be considered that Wald’s test on the regressor is borderline significant in front of a strongly conservative criterion of statistical significance. Moreover, only 59 cases using PHENIX and X-PLOR are reported in the training dataset, and this could make Wald’s test even more conservative, because such a small number of cases could increase the uncertainty in the estimation of the regression coefficient (i.e. the standard error and confidence interval). Gnesi and Carugo Water in protein crystal structures 99 research papers The performance of this model, in terms of apparent validity, seems reasonable and comparable to the main model. The predicted values of the H2O/aa ratio, ranging from 0.0210 to 3.2827, are highly correlated with the observed values (Spearman correlation coefficient = 0.855, p < 0.0001). On average, the residuals are very close to zero (4.43 109) with an SD of 0.2682, again with a slight negative asymmetry of distribution, and the median absolute value is 0.1453. External validation substantially confirms the apparent validation: the predicted values of the H2O/aa ratio, within the range 0.0256– 2.3559, are strongly correlated with the observed ones (Spearman correlation coefficient = 0.894, p < 0.0001); the residuals have an average of 0.0566 with an SD of 0.0555, and the median absolute residual is 0.1253. We observe that in the test dataset only three software packages are present (REFMAC, PHENIX and CNS). 4. Discussion It is necessary to note that some of the water molecules deposited in the PDB files may result from misinterpretation of the electron-density maps, e.g. if weak density peaks were interpreted as water molecules with high B factors or low occupancies. However, we did not exclude these potentially erroneous water molecules from our analysis, as no widely accepted criteria are available to filter off these water molecules automatically and since it is obviously impossible to examine nearly 10 000 electron-density maps visually. The results presented here are therefore not an attempt to predict how many water molecules hydrate proteins in the crystals; we simply examine the number of water molecules that crystallographers include, on average, in the structures published and deposited in the PDB. In the main analysis presented in this paper, the multivariable model has highlighted eight variables with a significant impact on the normalized number of water molecules observed: crystallographic resolution (Res), R factor (R), percentage of solvent in the crystal (Solv%), average B factor of the protein atoms (Bave), percentage of amino acid residues in loops (aaLoops%), average solvent-accessible surface area of the amino acid residues (Sasaaa), grand average of hydropathy of the protein(s) in the asymmetric unit (Gravy, Gravy2) and normalized number of heteroatoms that are not water molecules (Het/aa). The distributions of these variables are shown in Fig. S3 in the supporting information. Here, we briefly examine the adjusted relation of each covariate to the dependent variable. In each case, the effect on the dependent variable was computed for a variation of approximately one standard deviation of the covariate. Fewer water molecules per residue are observed if the crystallographic resolution (Res) worsens. If the resolution increases by 0.44 Å, the dependent variable falls by approximately 31%. This is expected, since the electron-density maps are obviously sharper when more diffraction intensities can be measured to higher resolution. In an independent analysis, it has recently been shown that it is necessary to reach a reso- 100 Gnesi and Carugo Water in protein crystal structures lution of about 1.5–1.6 Å to observe a continuous hydration layer at the protein surface (Carugo, 2016b). If the average B factor of the protein atoms (Bave) increases, fewer water molecules per residue are detected. For example, if the B factor increases by 15 Å2, the dependent variable is reduced by roughly 24%. To understand this, it is necessary to recall that the B factor monitors not only the amplitude of the oscillations of the atoms around their average positions but also the conformational disorder of the atoms (Carugo & Argos, 1997; Ringe & Petsko, 1986). Consequently, it is reasonable to expect that fewer water molecules can be observed if the protein has larger B factors and is more flexible. In particular, one must remember that most of the water molecules observed in crystal structures are found in the first, second and perhaps third hydration layers, in contact with the atoms of the residues that are at the protein surface, which are more flexible than the protein core. Analogously, more water molecules per residue are observed if the R factor (R) improves. For instance, when the R factor is increased by 0.05, the dependent variable decreases by 14%. Like for the resolution, this is largely expected, since electron-density maps become sharper when differences between computed and observed diffraction intensities diminish. The dependent variable is also incremented by the percentage of solvent in the crystal (Solv%), even though the effect is far from being pivotal (an increment of 10% in the percentage of solvent in the crystal, which is approximately a standard deviation, increases the H2O/aa ratio by 10%). This is surprising, at least in part. In fact, crystals containing more liquid solvent are expected to diffract poorly, with the consequence that fewer protein atoms and fewer water molecules are observable in the electron-density maps. The relationship between the H2O/aa ratio and GRAVY includes two terms, one linear and the other quadratic, and as a consequence it is impossible to give an intuitive and straightforward explanation. Fig. S2 in the supporting information shows a scatter plot of the predicted values of the H2O/ aa ratio versus observed values of GRAVY. From the plot, we can infer that the predicted H2O/aa ratio reaches a maximum when GRAVY is around 0.10; the greater the distance from this point, the lower the predicted ratio. Three other independent variables have an extremely low impact on the number of water molecules per amino acid residue, which increases marginally if the average solventaccessible surface area of the residues (Sasaaa), the number of heteroatoms that are not water molecules (Het/aa) or the percentage of residues in loops (aaLoops%) increases. According to our analysis, an increment of 10 Å2 of Sasaaa implies the dependent variable increases by only 3%. An increment of 0.15 in the Het/aa ratio, which is approximately one standard deviation, corresponds to a 2% increase in the H2O/aa ratio. If residues in loops are incremented by 10% there is only a 1.6% increment in the dependent variable. To some extent, these trends are surprising. One must observe that, although these effects are statistically significant, they have a very modest impact on the dependent variable. J. Appl. Cryst. (2017). 50, 96–101 research papers In addition, a secondary analysis pinpointed the effect of the type of software package used to resolve the structures (Software variable) on the H2O/aa ratio. The effect exerted by Software seems to be independent of those of other predictors, whose regression coefficients remained stable after the inclusion of this independent variable. The impact of crystallographic software is interesting. Fewer water molecules are localized when using SHELXL and more with CNS. It is reasonable to suppose that this reflects a systematic difference amongst different packages. However, a detailed comparison between different software packages is beyond the purpose of the present study and further studies are necessary to investigate this issue. This analysis of each variable clearly shows that nearly every variable has a predictable impact on the number of water molecules per residue. However, the correlation of several individual variables with the number of water molecules per residue is relatively modest. Only when they are taken together do these variables correlate well with the number of water molecules per residue. The performance of this multivariate model is remarkable, as indicated by the outof-sample predictive validation. 5. Summary We have analysed the relationships between the number of water molecules that are detected in a protein crystal structure and a series of variables that might have an impact on it. By means of Poisson multiple regression analyses, we have identified eight variables that significantly affect the number of water molecules per residue: the crystallographic resolution, the R factor, the average B factor of the protein atoms, the percentage of solvent in the crystal, the average solventaccessible surface area of the amino acid residues, the grand average of hydropathy of the proteins in the asymmetric unit, the percentage of amino acids in loops and the number of heteroatoms that are not water molecules (normalized on the number of amino acids). A secondary analysis also disclosed the fact that the software packages which are routinely used to refine crystal structures can affect the number of water molecules located in the electron-density maps. This analysis extends previous studies on room-temperature protein crystal structures, published nearly 20 years ago (Carugo & Bordo, 1999), which can be considered outdated, since diffraction data are nowadays commonly collected at low temperature (usually about 100 K) and since the present quantity of data is significantly larger than 20 years ago. However, a direct comparison between the model presented here and the multiple linear regression published 20 years ago is impossible, since the two models are based on different covariates and have been built with different regression techniques. These results should prove useful in identifying anomalous structures containing too few or too many water molecules per J. Appl. Cryst. (2017). 50, 96–101 residue than expected. It is, however, necessary to remember that some of the water molecules deposited in the PDB files may be erroneous interpretations of electron-density maps. As a consequence, the multivariate model described in the present manuscript cannot be utilized directly to validate the degree of hydration of proteins in crystals. 6. Related literature For additional literature relating to the supporting information, see Carugo (2003), Eisenhaber & Argos (1993), Frishman & Argos (1995), Murphy (1948) and Winn et al. (2011). References Adams, P. D. et al. (2010). Acta Cryst. D66, 213–221. Berman, H. M., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T. N., Weissig, H., Shindyalow, I. N. & Bourne, P. E. (2000). Nucleic Acids Res. 28, 235–242. Bernstein, F. C., Koetzle, T. F., Williams, G. J., Meyer, E. F. Jr, Brice, M. D., Rodgers, J. R., Kennard, O., Shimanouchi, T. & Tasumi, M. (1977). J. Mol. Biol. 112, 535–542. Brünger, A. T. (1992). X-PLOR, Version 3.1. A System for X-ray Crystallography and NMR. New Haven: Yale University Press. Brünger, A. T., Adams, P. D., Clore, G. M., DeLano, W. L., Gros, P., Grosse-Kunstleve, R. W., Jiang, J.-S., Kuszewski, J., Nilges, M., Pannu, N. S., Read, R. J., Rice, L. M., Simonson, T. & Warren, G. L. (1998). Acta Cryst. D54, 905–921. Bricogne, G., Blanc, E., Brandl, M., Flensburg, C., Keller, P., Paciorek, W., Roversi, P, Sharff, A., Smart, O. S., Vonrhein, C. & Womack, T. O. (2016). BUSTER. Global Phasing Ltd, Cambridge, UK. http:// www.globalphasing.com/. Carugo, O. (2003). In Silico Biol. 3, 417–428. Carugo, O. (2015). Curr. Protein Pept. Sci. 16, 259–265. Carugo, O. (2016a). Amino Acids, 48, 193–202. Carugo, O. (2016b). Int. J. Biol. Macromol. 89, 137–143. Carugo, O. & Argos, P. (1997). Protein Eng. 10, 777–787. Carugo, O. & Bordo, D. (1999). Acta Cryst. D55, 479–483. Carugo, O. & Djinovic-Carugo, K. (2005). Trends Biochem. Sci. 30, 213–219. Dunlop, K. V., Irvin, R. T. & Hazes, B. (2005). Acta Cryst. D61, 80–87. Eisenhaber, F. & Argos, P. (1993). J. Comput. Chem. 14, 1272–1280. Fogarty, A. C. & Laage, D. (2014). J. Phys. Chem. B, 118, 7715–7729. Frishman, D. & Argos, P. (1995). Proteins, 23, 566–579. Garman, E. F. & Schneider, T. R. (1997). J. Appl. Cryst. 30, 211–237. Matsuoka, D. & Nakasako, M. (2009). J. Phys. Chem. B, 113, 11274– 11292. Murshudov, G. N., Skubák, P., Lebedev, A. A., Pannu, N. S., Steiner, R. A., Nicholls, R. A., Winn, M. D., Long, F. & Vagin, A. A. (2011). Acta Cryst. D67, 355–367. Murphy, R. B. (1948). Ann. Math. Stat. 19, 581–589. Nakasako, M. (2004). Philos. Trans. R. Soc. London B, 359, 1191– 1206. Ringe, D. & Petsko, G. A. (1986). Methods Enzymol. 131, 389–433. Rupley, J. A. & Careri, G. (1991). Adv. Protein Chem. 41, 37–172. StataCorp (2013). Stata13. Statistical Software, Release 13. StataCorp Lp, College Station, Texas, USA. Sheldrick, G. M. (2008). Acta Cryst. A64, 112–122. Sheldrick, G. M. (2015). Acta Cryst. C71, 3–8. Winn, M. D. et al. (2011). Acta Cryst. D67, 235–242. Gnesi and Carugo Water in protein crystal structures 101
© Copyright 2026 Paperzz