Estimation of the quality of refined protein crystal structures Jimin Wang* Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, Connecticut 06520 Received 20 October 2014; Accepted 5 January 2015 DOI: 10.1002/pro.2639 Published online 7 January 2015 proteinscience.org Abstract: Crystallographic Rwork and Rfree values, which are measures of the ability of the models of macromolecular structures to explain the crystallographic data on which they are based, are often used to assess structure quality. It is widely known, and confirmed here that both are sensitive to the methods used to compute them, and can be manipulated to improve the apparent quality of the model. As an alternative it is proposed here that the quality of crystallographic models should be assessed using a global goodness-of-fit metric RO2A/Rwork where RO2A is the number of reflections used for refinement divided by the number of nonhydrogen atoms in the structure, and Rwork is the working R-factor of the refined structure. Also, analysis of structures in the Protein Data Bank suggests that many data sets have been truncated at high resolution, thereby improving the R-factor statistics. To discourage this practice, it is proposed that the resolution of a dataset be defined as the resolution of the shell of data where <I/rI> falls to 1. The proposed goodness-offit metric encourages investigators to use all the data available rather than a truncated subset. Keywords: resolution; protein quality; statistical gaming; statistical manipulation; free R-factors; working R-factors; data truncation Introduction A large number of statistics are used to assess the geometric and crystallographic quality of macromolecular crystal structures.1 Unlike the statistics used to gauge geometric quality, which are largely independent of the methods used to solve structures, those used to characterize crystallographic quality such as the agreement R-factors for the measured data (i.e., Rmerge, Rmeas, RPIM, CC1/2, etc.) and to judge the correspondence between the model being refined and the measured data (i.e., Rwork and Rfree) are sensitive to the decisions made consciously by crystallographers as they solve structures, or in Additional Supporting Information may be found in the online version of this article. Grant sponsor: National Institutes of Health Grants; Grant number: P01 GM022778; Grant sponsor: Steitz Center for Structural Biology, Gwangju Institute of Science and Technology, Republic of Korea. *Correspondence to: Jimin Wang, Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, Connecticut 06520. E-mail: [email protected] C 2015 The Protein Society Published by Wiley-Blackwell. V some cases without their knowledge, by the programs they use. These statistics can make structures appear better than they really are. Crystallographic quality statistics can be “improved” by systematically excluding weakintensity, high-resolution (WIHR) reflections from data sets. The net effect is to trade an apparent improvement in structural quality for a reduction in resolution.2,3 The resolution of the crystal structure of a macromolecule is the single most important determinant of its quality. However, while everyone is sure they know what the word “resolution” means; there is no generally accepted method for estimating it. Worse, investigators may omit available highresolution data during structure refinement because by doing so they can improve apparent working and free R-factors of the resulting structures. One reason investigators may do so is that the free R-factor, rather than resolution, is commonly used today to measure of structure quality.1,4 For example, the structure with PDB code 4HYO, which was solved at a resolution of 1.65 Å with a Rfree value of 18.1%, is likely to be regarded more favorably than the PROTEIN SCIENCE 2015 VOL 24:661—669 661 Figure 1. Distribution of working (filled) and free R-factors (open) for GroEL 1DER (dashed lines) and 1KP8 (solid lines) as a function of reciprocal resolution (Å21). The reciprocal resolution of 2.4 Å is marked with a green vertical line. structure of the same molecule with PDB code 3LDC, which was solved at a resolution of 1.45 Å with Rfree of 20.2%.5,6 The tendency of editors and authors to have this preference is perverse.7 There are many examples in the literature of the benefits to be gained by using all of the WIHR available when refining structures. For example, over a decade ago, the structure of GroEL reported in 1DER was rerefined (PDB accession: 1KP8) using all of the WIHR data between a resolution of 2.0 Å and 2.4 Å, the latter being the resolution of the 1DER structure.8,9 At 2.4 Å, <I/rI> is 1.0 in the highest resolution shell, but at 2.0 Å, <I/rI> is only 0.5. The inclusion of these WIHR data made it possible to correct several significant errors in the 1DER structure; and upon further refinement, a structure was obtained (1KP8) that has both working and free R-factor more than 10% lower than those reported for 1DER out to a resolution of 2.4 Å (Fig. 1). It was argued then that the omission of the WIHR data from the data set used to refine the 1DER structure had trapped it in a local minimum from which it could not escape until the WIHR data were taken into account. Here, evidence will be presented that most of the structures in the Protein Data Bank (PDB) would probably be improved if they were refined using all the WIHR data available. In order to encourage crystallographers to use all the data available, a new metric needs to be developed for judging structural quality that takes account not only of the correspondence between the data and the model, which Rfree certainly does, but also of the explanatory power of the data, which Rfree does not. The explanatory power of the data is the ratio of the number of independent reflections in the data set used for refinement to the number of 662 PROTEINSCIENCE.ORG adjustable parameters in the structural model obtained using those data. Since the number of nonhydrogen atoms in a structure should be proportional to the number of parameters a molecular model must specify, the observation to atoms (O2A) ratio (RO2A), which is easily estimated, ought to be a useful measure of the explanatory power of the data sets. A structure that has a comparatively high Rfree value but also a high O2A ratio can be superior in quality to one with a low Rfree value but also a low O2A ratio. A number of global goodness-of-fit (GGOF) metrics based on this principle are proposed here to encourage further discussion. The two GroEL structures mentioned above provide a case in point. The R-factors for the GroEL structure originally published (1DER; R 5 24.7%, Rfree 5 29.8%, Resolution 2.4 Å) are not much different from those that characterize the model obtained for GroEL using all the WIHR data available (1KP8; R 5 24.3%, Rfree 5 25.8%, Resolution at 2.0 Å).8,9 However, since the 1KP8 GroEL model explains nearly twice the number of experimental observations as the 1DER model, it is clearly superior. Results Rfree values can be manipulated For at least the last decade, the Rfree values reported for the refined crystal structures of macromolecules have been used as a quality metric, in ways that were never intended.4 Investigators may be concerned that they cannot publish structures if their Rfree values are too high, or are significantly poorer than the average structure in the PDB of similar resolution. Yet, referees seldom object to structures on the grounds that their Rfree values are too low, even when, as sometimes happens, they are smaller than the corresponding working R factors, which is essentially impossible. As the following example illustrates, it is easy to reduce the Rfree value of a structure without doing anything to improve its quality. A 2.5-Å resolution crystal structure was reported recently for E. coli YfbU (4LR3) that has a working R-factor of 21.0% and free R-factor of 24.7%.10 The free and working R-factors of the YfbU structure vary as a function of resolution the same way they do for most macromolecular crystal structures: they are large at both high and low resolution, and small in the middle (Fig. 2). Thus one can improve both the overall free and working R-factors of this structure simply by discarding the part of the data that is poorly explained by the model (Table I). As the resolution of the data used to compute Rfree is reduced from 2.50, to 2.64, 2.79, 3.10, and finally to 3.18 Å, Rfree falls from 24.70%, to 23.46%, then to 22.33% and 20.57%, and then finally to 20.15%, while the <I/ rI> values in the highest resolution shell increase The Quality of Protein Structures Figure 2. Crystallographic R-factor statistics as a function of reciprocal resolution (Å21). (a) Working (black filled spheres) and free (red open circles) R-factors of the YfbU structure. (b) For other rerefined structures, RtcB (black lines), catalase in C2 (red), catalase in P21 (blue), and PSII (green) structures. from 0.46, to 1.00, then to 2.00, and 3.00, and finally to 4.00. These statistics show that it is unrealistic to compare the working and free R-factors of the 2.5-Å resolution version of the YfbU structure with those of other structures in the PDB having similar nominal resolutions but much higher values of <I/rI> in the highest resolution shells. The <I/rI> value in the highest resolution shell for many structures in the PDB is high In connection with another study,7 structure factors (or intensities) were retrieved for all of the P212121 entries in the PDB in April, 2014, and are also used here. Of all the space groups in which macromolecules crystallize, the most common is P212121, which includes 23.0% of the entries in the PDB, followed by P21 (15.4%), and C2 (9.6%) (Supporting Information Table S12S3). Thus, the statistical properties of these P212121 data sets should be representative of the entire PDB.11 Using the program SCALEPACK,12 the structure factors (or intensities) of 11,265 P212121 entries in the PDB were binned into 20 resolution shells each containing approximately equal numbers of reflections, and <I/rI> values were computed for each shell. The value reported for <I/rI> in the highest resolution shell was greater than 30 for 37 of these sets. Because these values were so high, these data sets were excluded from the analysis (Fig. 3, Table II, Supporting Information Table S4, S5). Even so, the mean value for <I/rI> in the highest resolution shell for the remaining data sets is 3.88 6 2.75 (Table III). Surprisingly, 3.8% (432) of the data sets have been truncated at <I/rI> 5 10.0 (Table II). Another 20.4% (2,298) of the P212121 entries have excluded all the data with <I/rI> less than 5.0. For many of these entries, an analysis shows that Rwork and Rfree in the highest resolution shells are often smaller (or not much higher) than their overall values,7 which suggests that the WIHR data were truncated during structure refinement, as in the YfbU example discussed above. Given that the crystals of most macromolecules diffract weakly, if one were to use <I/ rI> 5 10.0 as the resolution cut-off criterion, the amount of data discarded from most data sets would often be far greater than the amount of data used for structure determination. Discussions with the authors of a few of the structures that have very high values reported for <I/rI> in the highest resolution shell indicated that from their point of view, resolution was not an issue. It did not matter whether the resolution of a structure was 1.5 Å or 2.5 Å, as long as the structure Table I. Statistical Gaming Rfree Values by Data Truncation for YfbU Structure Resolution range (Å) 56-2.50 56-2.64 56-2.79 56-3.10 56-3.18 Number of reflections <I/rI>a Rwork (%) Rfree (%) Reduction in RO2Ab Reduction in Rwork Reduction in Rfree 131,307 112,400 95,391 69,662 64,565 0.46 1.00 2.00 3.00 4.00 20.79 18.49 17.25 15.54 15.14 24.70 23.46 22.33 20.57 20.15 1.00 0.86 0.73 0.53 0.49 1.00 0.89 0.83 0.75 0.73 1.00 0.95 0.90 0.83 0.82 a The mean <I/rI> value in the corresponding highest resolution shell. Total number of atom is 23,395 and the total number of parameters is 93,580, which results in RO2A of 5.61 at 2.50-Å resolution, which is a reference resolution for reductions in R-factors. b Wang PROTEIN SCIENCE VOL 24:661—669 663 Figure 3. The distribution of <I/rI> values for the highest resolution shells as a function of reciprocal resolution (Å21) (a) and a histogram as a function the I/rI ratio (b) for all P212121 entries. Shell-averaged <I/rI> for all P212121 entries are shown in red line. enabled well-founded experiments. follow-up biochemical The Rfree-Rwork differences of the structures in the PDB are useful measures of refinement quality Both Rwork and Rfree values can be adjusted to some degree by manipulating the resolution range of the data for structure refinement. Rfree values are likely to be more sensitive to the details of the way data are treated than Rwork values because the reflections for calculation of Rfree values are usually based on only 5% of the data. For the following reasons, it is harder to manipulate the difference between Rfree and Rwork. First, this difference should always be positive because if the data have been processed properly, Rfree must always be greater than Rwork. In addition, Rfree should gradually approach Rwork during refinement. Also, once the refinement has converged, the difference between them should vary in Table II. Intensity Distribution of <I/rI> in the Highest Resolution Shells for all P212121 Entries in the PDBa <I/rI> condition Number (percentage) <2.0 2.0< <I/rI> < 5.0 >5.0 >10.0 >20.0 >30.0 >40.0 >50.0 >100.0 1550 (13.8%) 7417 (65.8%) 2298 (20.4%) 432 (3.83%) 99 (0.88%) 47 (0.42%) 35 (0.31%) 32 (0.22%) 14 (0.12%) a An initial analysis included all the 14,376 P212121 entries, and the final analysis included only 11,265 entries after some entries containing questionable Friedel pair columns with the pdbx prefix were excluded following discussions with Dr. S. Burley and colleagues at the PDB. 664 PROTEINSCIENCE.ORG a predictable way as a function of both resolution and RO2A. As Figure 4 shows, the high-resolution limit of the value of that difference appears to be about 0.020 for all the P212121 entries in the PDB. Even though it is essentially impossible for Rfree to be less than Rwork, 24 of the P212121 structures in the PDB have differences that are zero or negative (Supporting Information Table S6). Another 57 entries are characterized by differences less than 0.005, and the total number of entries having differences less than asymptotic limit is 1010 (6.4%) (Supporting Information Table S6). Finally, the average R-factor difference for all structures at resolutions lower than 2.85 Å is much smaller than the value predicted by the trends using all the P212121 structures in the PDB (Fig. 4). Explanations for most these anomalies remain unknown. Nevertheless, the fact that so many structures can be refined in ways that result in very small differences between the two R-factors does call into question the wisdom of using Rfree to judge structure quality. The importance of the observation-to-atom ratios for structures in the PDB The most obvious shortcoming of R-factors as measures of structure quality is that they do not reflect the capacity of the X-ray data to determine the parameters of the structure. A simple metric that Table III. Distribution of the Mean <I/rI> Values in the Highest Resolution Shell for all P212121 Entries in the PDB <I/rI> condition <5 <10 <20 <30 <40 Mean values 2.96 6 0.99 3.59 6 1.72 3.88 6 2.42 3.98 6 2.78 4.01 6 2.94 The Quality of Protein Structures Figure 4. The distribution of differences between Rfree and Rwork as a function of reciprocal resolution (Å21) (a) and of RO2A (b) for all P212121 entries. Shell-averaged DR-factors are shown in red liens, zero lines in blue, and high-resolution (high RO2A) asymptotic line in green. Data outside the boxes that were included in this analysis are not shown. could be used to provide this information is the number of (independent) reflections used to refine a structure, divided by the number of non-hydrogen atoms in the asymmetric unit. This ratio is designated here as RO2A. It is an imperfect measure of the quality of the structural model because the number of adjustable parameters in the model depends in part on the method used for its refinement, and on the way Bfactors are treated. It is also the case that the number of parameters per nonhydrogen atom needed to model solvent molecules is not the same as the number per atom for the macromolecule itself. Nevertheless, RO2A will increase as the cube of the reciprocal of the resolution (Fig. 5), as it should, and it automatically takes into account solvent-content variations in crystals. In an analysis done in April 2014, the average solvent content for the all entries in the PDB was 50.2 6 8.2%, and 48.2 6 8.5% for the P212121 entries (Supporting Information Table S12S3). However, there is a considerable variation from one crystal to the next. For example, 95% of the volume of the unit cell in the crystals used to solve the 2YQ3 structure is occupied by solvent.13 Thus, this structure would have an RO2A 10 fold higher than that of a structure solved at the same resolution using crystals that have a solvent content of 50%. When RO2A is plotted against reciprocal resolution cubed for all the P212121 entries in the PDB, the resulting distribution can be fitted to a line with an overall correlation coefficient of 0.898 (Fig. 5). However, its intercept does not pass through the origin of the plot, as one would anticipate. This failure may be caused by a tendency of increases in solvent content to correlate with decreases in the resolution of macromolecular crystal structures. The average solvent content is about 50% for all the structures in the PDB, but it is 70% for all structures with resolutions lower than 5.0 Å (Supporting Information Table S1). Under-representation of low-resolution crystal structures in the set of structures considered may also contribute, as may systematic differences in the way WIHR data are treated, with more WIHR data being used for low-resolution structure determinations. Figure 5. The distribution of RO2A values for all P212121 entries. (a) As a function of reciprocal resolution cubed (Å23). (b) As a function of reciprocal resolution (Å21) for fitted model from (a). Intercepts at RO2A of 4 (green) and 9 (blue) are also shown. Wang PROTEIN SCIENCE VOL 24:661—669 665 The regression line in Figure 5 indicates that on average, RO2A reaches 4 at 3.11 Å and 9 at 1.94 Å, which implies that unconstrained refinement of atomic positions and isotropic B-factors is likely to fail for macromolecular structures solved at resolutions of 3.11 Å or lower, and that unconstrained refinement of atomic positions and individual anisotropic B-factor parameters cannot be expected to work well unless the resolution of the data exceeds 1.94 Å. These estimates agree well with conventional wisdom in the macromolecular crystallographic community (e.f., Ref. [14]). A proposal for some new measures of structure quality The goal of all structure refinements is to arrive at the physically plausible model for the molecule of concern that best explains all the observations available. For this reason alone, all the WIHR data available ought to be included in the structure refinement. It is also proposed that structure quality be assessed using a global goodness of fit (GGOF) statistic, the simplest of which, GGOF1, is defined as follows. GGOF1 5½Nobs =Natom =½Rwork jFobs 2Fcalc j=Rwork Fobs 5RO2A =Rwork Experience shows that structures having a GGOF1 > 100 should be considered “high quality” (Table IV), which implies that a structure with an RO2A> 15 would have to have a Rwork <15% to be considered high quality. If the solvent content of a structure determined at resolution of about 5 Å were 70%, which is unusually high (Supporting Information Table S1), the RO2A value would be about 3.5. In order for such a structure to be considered high quality, its Rwork would have to be less than 3.5%, which has never been achieved for a 5.0-Å resolution structure. The results obtained with the YfbU structure by trimming the WIHR data used to refine it suggest that Rfree values fall much slower than Rwork values, and that the decrease of Rwork appears to be proportional to RO2A, while that of Rfree appears to pffiffiffiffiffiffiffiffiffiffiffi be proportional to RO2A (Supporting Information Table S1), assuming that the structure refinement has fully converged. These observations suggest that a second GGOF metric might be considered, GGOF2 : pffiffiffiffiffiffiffiffiffiffiffi GGOF2 5 RO2A =Rfree When dealing with the WIHR data, it is important to include the weighted R-factors by the measurement errors, which many refinement programs often report but are not quoted in many publications. Table IV. Application of the GGOF structural quality metricsa PDB 4F1U 4F1V 4F1V/Trim 4F18 4F19 2OL9 4AYO 4AYP 4AYQ 4AYR 4GHO 4LTG 4MJ9 4LR3 4LR3/Trim 4LR3/Trim 4LR3/Trim 4LR3/Trim 1DER 1KP8 RtcB/Mnb C2Catalseb 3P9Q 3P9Q/Newb 3ARC 3ARC/Newb Reso (Å) RO2A Rwork Rfree GGOF1 GGOF2 Rsigma 0.98 0.88 0.98 0.96 0.95 0.85 0.85 0.85 1.10 1.10 1.10 1.18 0.97 2.50 2.64 2.79 3.10 3.18 2.40 2.00 1.48 1.53 1.48 1.48 1.90 1.90 44.7 61.3 38.2 47.1 52.0 47.9 67.3 72.0 33.2 30.6 44.3 28.6 53.0 5.6 4.8 4.1 3.0 2.8 5.8 9.5 19.1 9.9 16.0 13.7 10.8 10.3 0.088 0.125 0.107 0.095 0.096 0.073 0.095 0.097 0.088 0.084 0.097 0.089 0.086 0.208 0.185 0.172 0.155 0.151 0.247 0.243 0.098 0.075 0.143 0.096 0.177 0.114 0.096 0.140 0.123 0.110 0.111 0.078 0.105 0.106 0.105 0.102 0.117 0.113 0.096 0.247 0.235 0.223 0.206 0.202 0.283 0.258 0.144 0.134 0.177 0.143 0.204 0.162 508.0 490.4 357.0 495.8 541.4 655.6 708.4 742.3 377.3 364.3 605.2 321.3 616.2 27.5 26.0 23.7 19.2 18.3 23.4 39.0 194.9 131.7 111.9 142.7 61.4 90.1 69.6 55.9 50.3 62.4 64.9 69.2 78.1 80.0 54.9 54.2 56.9 47.3 75.8 9.8 9.3 9.1 8.4 8.2 8.5 11.9 30.3 23.5 22.6 25.9 16.2 19.8 0.041 0.059 0.045 0.053 0.064 0.042 0.056 0.059 0.053 0.031 0.092 0.042 0.041 0.138 0.122 0.107 0.084 0.080 0.121 0.164 0.102 0.084 0.123 0.123 0.047 0.047 a The first group of PDB entries is with reported Rwork of less than 10% with an exception of 4F1V, which is closely related to 4F1U. The second group of PDB entries is of structures discussed in this manuscript, including some unpublished rerefined structures. The trimming of the WIHR data is also included in this table. See text for definitions of parameters used in this table. Rsigma is 1/<I/rI> for all the data. b These new structures included rerefinement of the structures published from author’s laboratory as well as from some other laboratories. 666 PROTEINSCIENCE.ORG The Quality of Protein Structures These weighted R-factors can be used for calculations of GGOF metrics as well. Applications of GGOF metrics Table IV provides the GGOF metrics for a handful of structures in the PDB. It should be noted that GGOF1 for 1KP8, the higher resolution structure of the two GroEL structures mentioned earlier, is much better than that of 1DER, its lower resolution mate, as it should be, and that 1KP8 is also superior to 1DER based on their GGOF2 statistics. GGOF metrics can also be used to determine whether the version of 1EGW15 that has been rerefined with two identical copies of a DNA duplex bound in two different orientations (i.e., two alternative conformations for the entire DNA duplex) bound to the two monomers of that homo-dimeric protein is better than the one rerefined with one asymmetric DNA duplex bound to both monomers (plus three nucleotides per strand or six nucleotides in total in two alternative conformations) that differ in orientation between two monomers.7 The first model has a working R-factor of 14.2%, a free R-factor of 18.2%, and the number of atoms is 3,165 so that RO2A is 9.24. The second model has a working R-factor of 15.3%, free R-factor of 19.2%, and the number of atoms is 2591 (a smaller number than the first structure) so that RO2A is 11.28. Thus, the first model has GGOF1 of 65.1 and GGOF2 of 16.7, whereas the second model has GGOF of 73.7, and GGOF2 of 17.5. Both criteria suggest that the second model with fewer atoms is better than the first. The 4HYO structure,6 which was originally determined in P1, provides another instructive example. When it is refined in the space group appropriate for the crystals it forms, P4212, its GGOF metric is 30% better than it is when it is rerefined in P1 under identical conditions.6,7 To give the reader some idea of the kinds of GGOF values that should be aspired to, Table IV includes data on some structures of exceptional quality, which were chosen from among the 67 single-crystal structures in the PDB that have working R-factors less than 10% at resolutions in the atomic to sub-atomic range.16–21 These values range from 320 for 1LTG, which was solved at a resolution of 1.1 Å, to 740 for 4AYP, which is a 0.85-Å resolution structure (Table IV). The structure described by 4F1V, which was determined at 0.88-Å resolution with working Rfactor of 12.5%, is included in this high-resolution group because it is instructive to compare its statistics with those of a closely related entry, 4F1U, which was solved at 0.98-Å resolution with working R-factor of 9.6%.16 For these structures, the number of measured observations at a resolution of 0.98 Å is 176,838, but at a resolution of 0.88 Å the number is 248,683, and the corresponding RO2A values are 44.7 Wang and 61.3. Omission of the WIHR data from the data used for refining the 4F1V structure consistently resulted in poorer GGOF metrics, even though it led to improved apparent working and free R-factor values. Thus, if the GGOF metrics proposed here were used to assess quality, it would be apparent that there is no justification for omission of any of the WIHR data available for the 4F1V structure, no matter what the effect on R-factors (Table IV). Discussion Consequences of excluding poorly measured observations in data processing It is surprising that one in every five structures in the PDB has an <I/rI> value of 5.0, or higher, in the highest resolution shell, and that this value is 10.0 or higher for one in every thirty structures in the PDB. There are at least two ways data can be processed so that such high <I/rI> values are obtained in the highest resolution shells: (i) exclusion of all the poorly measured weak-intensity reflections whatever the resolution, and/or (iii) exclusion of all the WIHR data. Both will improve the statistics of the processed data, but neither is good practice. As the HKL Users Manual explains, individual observations of specific reflections from the data sets should not be excluded on the grounds that they have been poorly measured.21 Some reflections in any data set obtained from a macromolecular crystal are bound to be weak, and when a weak reflection crosses the Ewald sphere, the value recorded for its intensity may be negative as a consequence of counting statistics. Those apparently nonphysical values for intensities must be averaged with all the other observations made of the intensities of the corresponding reflections. If not, the intensity estimates that emerge for those reflections when observations are averaged will be larger than they should be. This practice could give the impression that intensities of reflections that are zero because they are systematically absent due to crystal symmetry are non-zero, and thus lead them to incorrect conclusions about crystal symmetry.7 Questions related to negative intensity measurements are best addressed using techniques that rely on Bayesian statistics.23 These measurements should not be omitted. During structure refinement, the knowledge that a particular reflection is weak is as important as the knowledge that some other reflection is strong. What counts are the differences between predicted and measured amplitudes, not the absolute values of measured amplitudes as such. Thus, if weak data are omitted during structure refinement, and the model that emerges fails to predict that these data should be weak, the model cannot be accurate, nor can the phases calculated using that PROTEIN SCIENCE VOL 24:661—669 667 model be relied upon. Thus, an important distinction exists between setting the intensities of weak reflections to zero and omitting them altogether during refinement. If the intensity of a weak reflection is not taken into account during refinement, its value will not constrain the model being refined. By contrast, if it is taken into account but its amplitude is set to zero, models will be favored that predict that its amplitude should be weak, which is as it should be. A proposed metric to measure crystallographic quality The major problem with the metrics commonly used for assessing crystallographic quality today is that they give too little emphasis on resolution and too much emphasis on the correspondence between the model and the data. The primary virtue of the GGOF parameters proposed here as metrics of crystallographic quality is that they will “reward” use of all the data available. If GGOF parameters of the sort being advocated here become widely adopted, it will become even more important than it is today for the community to arrive at an agreement as to how “resolution” is defined. In earlier times, when data collection was much more difficult than it is today, the resolution of a crystal structure was taken to be the Bragg spacing at which <I/ rI> falls to 2.0. Today, as pointed out above, “resolution” turns out to be the Bragg spacing of the highest resolution shell of data used to refine the structure, no matter what the cutoff value of <I/rI> may be. Given the quality of the data sets being collected today, it would be reasonable to define the resolution of a data set, and hence that of the structure obtained from it, as the resolution at which <I/rI> falls to 1.0. By itself, this change in the operational definition of resolution might encourage the crystallographic community to be more aggressive in its use of highresolution data than it has been in the recent past. The service the PDB provides as a repository of both structures and data cannot be overstated. As most crystallographers know, it can be remarkably hard to locate and retrieve the data sets that are more than a few years old from one’s own computer system. Obsolescence of the media on which the data are stored can become a problem for data sets that are more than a few years old, and but for the PDB, closure of the laboratory that collected them would be the equivalent of a death sentence. Thus, it is important that the authors responsible for the 20% of the structures in the PDB for which there are no data deposited beyond <I/rI>55 reprocess the data they have to the highest resolution possible before they are lost forever. It is less urgent, but also important that for the authors of the 86% of all the structures in the PDB that lack data beyond <I/rI>52.0 do what they can do to extend the resolution of their data sets (Table II). The availability 668 PROTEINSCIENCE.ORG of the WIHR data for these structures will make it possible for anyone who becomes interested in the future to re-refine them using the programs are available then, which are all but certain to be even better than those available today. Materials and Methods All structural factors were retrieved in mid-April 2014 from the PDB and analyzed as described elsewhere.7 The program SCALEPACK was used to analyze the <I/rI> distribution in the highest resolution shell of the 20 shells of approximately equal number of expected observations.12 Acknowledgments The author acknowledges Professors Peter Moore and Brian Matthews for extensively editing this manuscript. The 2YQ3 entry cited in this paper had the highest reported solvent content among all the PDB entries retrieved and analyzed on April 2014, which was no longer so after November 26, 2014 when the solvent content for the 2YQ3 entry was revised. References 1. Brown EN, Ramaswamy S (2007) Quality of protein crystal structures. Acta Cryst D63:941–950. 2. Diederichs K, Karplus PA (2013) Better models by discarding data? Acta Cryst D69:1215–1222. 3. Karplus PA, Diederichs K (2012) Linking crystallographic model and data quality. Science 336:1030–1033. 4. Brunger AT (1992) Free R value: a novel statistical quantity for assessing the accuracy of crystal structures. Nature 355:472–475. 5. Ye S, Li Y, Jiang Y (2010) Novel insights into K1 selectivity from high-resolution structures of an open K1 channel pore. Nat Struct Mol Biol 17:1019–1023. 6. Posson DJ, McCoy JG, Nimigean CM (2013) The voltagedependent gate in MthK potassium channels is located at the selectivity filter. Nat Struct Mol Biol 20:159–166. 7. Wang J (2015) On the validation of crystallographic symmetry and the quality of structures. Protein Sci 24: 621–632. 8. Boisvert DC, Wang J, Otwinowski Z, Horwich AL, Sigler PB (1996) The 2.4 A crystal structure of the bacterial chaperonin GroEL complexed with ATP gamma S. Nat Struct Biol 3:170–177. 9. Wang J, Boisvert DC (2003) Structural basis for GroEL-assisted protein folding from the crystal structure of (GroEL-KMgATP)14 at 2.0A resolution. J Mol Biol 327:843–855. 10. Wang J, Wing, R. (2014) Diamonds in the rough: a strong case for the inclusion of weak-intensity X-ray diffraction data. Acta Cryst D70:1491–1497. 11. Berman HM, Bhat TN, Bourne PE, Feng Z, Gilliland G, Weissig H, Westbrook J (2000) The Protein Data Bank and the challenge of structural genomics. Nat Struct Biol 7 Suppl:957–959. 12. Otwinowski Z, Minor W (1997) Processing of X-ray diffraction data collected in oscillation mode. Macromol Cryst A 276:307–326. 13. El Omari K, Iourin O, Harlos K, Grimes JM, Stuart DI (2013) Structure of a pestivirus envelope glycoprotein E2 clarifies its role in cell entry. Cell Rep 3:30–35. The Quality of Protein Structures 14. Moore PB. 2012. Visualizing the invisible. Imaging techniques for the structural biologist. New York: Oxford University Press. 15. Santelli E, Richmond TJ (2000) Crystal structure of MEF2A core bound to DNA at 1.5 A resolution. J Mol Biol 297:437–449. 16. Elias M, Wellner A, Goldin-Azulay K, Chabriere E, Vorholt JA, Erb TJ, Tawfik DS (2012) The molecular basis of phosphate discrimination in arsenate-rich environments. Nature 491:134–137. 17. Sawaya MR, Sambashivan S, Nelson R, Ivanova MI, Sievers SA, Apostol MI, Thompson MJ, Balbirnie M, Wiltzius JJ, McFarlane HT, Madsen AO, Riekel C, Eisenberg D (2007) Atomic structures of amyloid crossbeta spines reveal varied steric zippers. Nature 447:453– 457. 18. Thompson AJ, Dabin J, Iglesias-Fernandez J, Ardevol A, Dinev Z, Williams SJ, Bande O, Siriwardena A, Moreland C, Hu TC, Smith DK, Gilbert HJ, Rovira C, Davies GJ (2012) The reaction coordinate of a bacterial Wang 19. 20. 21. 22. 23. GH47 alpha-mannosidase: a combined quantum mechanical and structural approach. Angew Chem Int Ed Engl 51:10997–11001. Pace CN, Fu H, Lee Fryar K, Landua J, Trevino SR, Schell D, Thurlkill RL, Imura S, Scholtz JM, Gajiwala K, Sevcik J, Urbanikova L, Myers JK, Takano K, Hebert EJ, Shirley BA, Grimsley GR (2014) Contribution of hydrogen bonds to protein stability. Protein Sci 23:652–661. Hall JP, Sanches-Weatherby J, O’Sullivan K, Kelly JM, Cardin CJ. (To be published) Dehydration/rehydration of a nucleic acid system containing a polypyridyl ruthenuum complex at 74% relative humidity. Hall JP, Beer H, Buchner K, Cardin DJ, Cardin CJ. (To be published) Lamda-[Ru(TAP)2(dppz-10-Me)21 bound to a synthetic DNA oligomer. Gerwirth D (1995) The fourth edition of the HKL manual. New Haven, CT: Yale University Press. French S, Wilson K (1978) Treatment of negative intensity observations. Acta Cryst A34:517–525. PROTEIN SCIENCE VOL 24:661—669 669
© Copyright 2026 Paperzz