shelxc/d/e

The SHELX approach to the
experimental phasing of macromolecules
IUCr 2011 Madrid
George M. Sheldrick, Göttingen University
http://shelx.uni-ac.gwdg.de/SHELX/
Experimental phasing of macromolecules
Except in relatively rare cases where atomic resolution data permit the
phase problem to be solved by ab initio direct methods, experimental
phasing usually implies the presence of’ ‘heavy’ atoms. In order to
locate the heavy atoms, we need their structure factors FA.
The phases φA calculated for the heavy atom substructure enable us to
estimate starting phases φT for the full macromolecular structure by:
φT = φA + α
After refining these phases we can use them and the native structure
factors FT to calculate an electron density map of the macromolecule.
As we will see, α, FA and FT can all be estimated from the experimental
data.
Data files used for shelxc/d/e
shelxc reads the experimental data and writes three files in addition to
some useful statistics that are output to the console (and can be
displayed graphically by Thomas Schneider’s hkl2map). These files are:
name.hkl – reflection indices,
merged native intensity for density
modification with shelxe and
possibly later refinement with
shelxl.
name_fa.ins – instruction file for
running shelxd.
name_fa.hkl – reflection indices, FA
and α.
The analysis of MAD data
Karle (1980) and Hendrickson, Smith & Sheriff (1985) showed by
algebra that the measured intensities in a MAD experiment, assuming
only one type of heavy atom, should be given by:
|F±|2 = |FT|2 + a|FA|2 + b|FT||FA|cosα
α ± c|FT||FA|sinα
α
where a = (f” 2+f’ 2)/f02, b = 2f’/f0, c = 2f”/f0 and α = φT – φA. We need to
know f ’ and f ” for each wavelength.
In a 2-wavelength MAD experiment, we have 4 equations for the 3
unknowns FA, FT and α, so with error-free data we should get a
perfect map!
For SIRAS, we have 2 equations for the derivative plus a native
dataset. Given perfect isomorphism, the phase problem is solved.
Unfortunately for SAD we only have 2 equations for the 3 unknowns.
SAD phasing
For a single wavelength SAD experiment, we only have equations for
|F+|2 and |F–|2 . Subtracting the second from the first, we obtain:
|F+|2 – |F–|2 = 2c |FA| |FT| sinα
α
For a small anomalous difference, we can assume:
|FT| ~ ½ ( |F+| + |F–| )
So using
we obtain:
|F+|2 – |F–|2 = (|F+| + |F−|) (|F+| − |F−|)
|F+| – |F–| = c |FA| sinα
α
where c is 2f ”/ f0. c could be calculated but we would still require |FA|,
not |FA|sinα
α, to find the heavy atoms. Lacking a better estimate, we
have to input ||F+| – |F–|| (generated by SHELXC) into SHELXD.
For MAD and SIRAS we can use |FA|, so there is no problem.
Starting atoms
consistent with
Patterson
Dual space recycling for
SHELXD substructure solution
Based on the method introduced in the SnB
program by Weeks, Hauptman et al. (1993)
SF calculation
reciprocal space:
refine phases
FFT
real space:
select atoms
Many cycles E > Emin
First the FA values are normalized to E-values.
The correlation coefficient CC between Eobs
and Ecalc is used to select the best solution.
CC(weak) is also calculated for the reflections
with E < Emin; these were not used for the
substructure solution
refine occupancies,
save best solution so
far
SAD substructure determination
To find the heavy atoms using SHELXD, originally written for ab initio
phasing using atomic resolution data, we would like to input |FA|, but
α|. Why does this work?
we actually have to input ||F+| – |F–|| = |c|FA|sinα
1. SHELXD first normalizes ||F+| – |F–|| to get E-values. This gets rid of
c and its resolution dependence!
2. Although all the reflections are used at the end for the occupancy
refinement and the calculation of CC values, the direct methods
location of the heavy atoms only uses the strongest reflections.
These are likely to be the ones with sinα
α close to ±1. So to an
adequate approximation, ||F+| – |F–|| is equal to |FA|!
SHELXD histograms and occupancies for Elastase
1 solution in
100 trials
S-SAD
Diagrams from
the hkl2map GUI
I-SIRAS
Critical parameters for SHELXD
The Patterson-seeded dual-space recycling in SHELXD is effective at
finding the heavy atoms, however attention needs to be paid to:
1. The resolution at which the data are truncated, e.g. where the
internal CC between the signed anomalous differences of two
randomly chosen reflection subsets falls below 30%.
2. The number of sites requested should be within about 20% of the
true value so that the occupancy refinement works well (and
reveals the true number).
3. In the case of a soak, the rejection of sites on special positions
should be switched off.
4. For S-SAD, DSUL (search for disulfides in addition to atoms) can
be very useful.
5. In difficult cases it may be necessary to run more trials (say
50000). The multiple-CPU version of SHELXD is recommended!
CC
hits per
10000 tries
CCweak
Tendamistat
CC, CCweak
and hits per
10000 tries
Occupancy (from
SHELXD) of supersulfur peak 8
Occupancy of
peak 9
SHELXD_MP – multi-CPU heavy atom location
The heavy atom search may require a large number of trials for weak
SAD signals or when there are a large number of heavy atoms. It is
however a good candidate for multi-tasking because only very limited
communication is needed between different trials.
shelxd has been parallelized using openmp. The scaling up is quite
respectable (about 29 times for a computer with 32 real CPUs).
In addition to reorganizing the output, an improved criterion is used for
selecting the ‘best’ solution:
CFOM = CC + CC(weak)
A beta-test is available on email request.
The heavy atom enantiomorph problem
The location of the heavy atoms from the |FA|-values does not define
the enantiomorph of the heavy-atom substructure; there is exactly a
50% chance of getting the enantiomorph right. When the protein
phases are calculated from the heavy atom reference phases, only one
of the two possible maps should look like a protein, as this enables the
correct heavy atom enantiomorph to be chosen.
If the space group is one of an enantiomorphic pair (e.g. P41212 and
P43212) the space group must be inverted as well as the atom
coordinates.
For three of the 65 Sohnke space groups possible for chiral molecules,
the coordinates have to be inverted in a point other than the origin!
These space groups and inversion operations are:
I41 (1–x,½–y,1–z); I4 122 (1–x,½–y,¼–z); F4 132 (¼–x,¼–y,¼–z).
Simulations with one heavy atom in P1
MAD or
SIRAS
SAD
SIR
A perfect MAD or SIRAS experiment
should give perfect phases! A
centrosymmetric array of heavy
atoms is fatal for SIR, but one heavy
atom is enough for SAD even in
space group P1, because it is easy
to remove the negative image by
setting negative density to zero!
Density modification
The heavy atoms can be used to calculate reference phases; initial
estimates of the protein phases can then be obtained by adding the
phase shifts α to the heavy atom phases.
These phases are then improved by density modification. Clearly, if we
simply do an inverse Fourier transform of the unmodified density we
get back the phases we put in. So we try to make a chemically sensible
modification to the density before doing the inverse FFT in the hope
that this will lead to improved estimates for the phases.
Many such density modifications have been tried, some of them very
sophisticated. Major contributions have been made by Peter Main,
Kevin Cowtan and Tom Terwilliger. One of the simplest ideas,
truncating negative density to zero, is actually not too bad (it is the
basic idea behind the program ACORN).
The sphere of influence algorithm
The variance V of the density on a spherical surface of radius 2.42 Å is
calculated for each pixel in the map. The use of a spherical surface
rather than a spherical volume was intended to save time and add a
little chemical information (2.42 Å is a typical 1,3-distance in proteins
and DNA). V gives an indication of the probability that a pixel
corresponds to a true atomic position.
Pixels with low V are flipped (ρ
ρS’ = –γρ
γρ where γ is usually set to 1.0).
For pixels with high V, ρ is replaced by [ρ
ρ4/(ν2σ2(ρ
ρ)+ρ
ρ2)]½ (with ν usually
0.5) if positive and by zero if negative. This has a similar effect to the
procedure used in the CCP4 program ACORN.
For intermediate values of V, a suitably weighted mixture of the two
treatments is used.
An empirical weighting scheme for phase recombination is used to
combat model bias.
Density
histograms
Bernhard Rupp &
Peter Zwart
Although SHELXE makes no use of histogram matching, the sphere of
influence algorithm is able to bring the histogram much closer to the
one for the correct structure!
The free lunch algorithm
The free lunch algorithm (FLA) is an attempt to extend the resolution
of the data by including, in the density modification, reflections at
higher resolution than have been measured. Although discovered
independently and first published by the Bari group (Acta Cryst. D61
(2005) 556-565 and 1080-1087), the first successful implementation of
the FLA was probably in 2001 in the program ACORN, but was not
published until 2005 (Acta Cryst. D61 (2005) 1465-1475).
The unexpected conclusion was that if these phases are now used to
recalculate the density, using very rough guesses for the
(unmeasured) amplitudes, the density actually improves! The FLA is
incorporated in SHELXE and tests confirm that the phases of the
observed reflections improve, at least when the native data have been
measured to a resolution of 2 Å or better.
Maps before and after a free lunch
Best experimental phases after
density modification (MapCC 0.57)
After expansion to 1.0 Å with
virtual data (MapCC 0.94)
Isabel Usón et al, Acta Cryst. D63 (2007) 1069-1074.
Why do we get a free lunch?
It is not immediately obvious why inventing extra data improves
the maps. Possible explanations are:
1. The algorithm corrects Fourier truncation errors that may have
had a more serious effect on the maps than we had realised.
2. Phases are more important than amplitudes (see Kevin Cowtan’s
ducks and cats!), so as long as the extrapolated phases are OK
any amplitudes will do.
3. Zero is a very poor estimate of the amplitude of a reflection that
we did not measure.
The SHELXE autotracing algorithm
A fast but very crude autotracing algorithm has been incorporated
into the density modification in SHELXE. It is primarily designed for
iterative phase improvement starting from very poor phases. The
tracing proceeds as follows:
1. Find potential α-helices in the density and try to extend them at
both ends. Then find other potential tripeptides and try to extend
them at both ends in the same way.
2. Tidy up and splice the traces as required, applying any necessary
symmetry operations.
3. Use the traced residues to estimate phases and combine these with
the initial phase information using sigma-A weights, then restart
density modification. The refinement of one B-value per residue
provides a further opportunity to suppress wrongly traced
residues.
Extending chains at both ends
The chain extension algorithm looks two residues ahead of the
residue currently being added, and employs a simplex algorithm to
find a best fit to the density at the atom centers as well at ‘holes’ in
the chain. The quality of each completed trace is then assessed
independently before accepting it.
Important features of the algorithm are the generation of a no-go map
that defines regions that should not be traced into, e.g. because of
symmetry elements or existing atoms, and the efficient use of
crystallographic symmetry. The trace is not restricted to a predefined
volume, and the splicing algorithm takes symmetry equivalents into
account.
Criteria for accepting chains
The following criteria are combined into a single figure of merit for
accepting traced chains:
1. The overall fit to the density should be good.
2. The chains must be long enough (in general at least 7 aminoacids); longer chains are given a higher weight.
3. There should not be too many Ramachandran outliers.
4. There should be a well defined secondary structure (φ /ϕ pairs
should tend to be similar for consecutive residues).
C
5. On average, there should be significant positive
density 2.9 Å from N in the N→H direction (to a
N―H - - - - O
hydrogen bond acceptor):
Cα
Is the structure solved?
If the CC for the structure factors calculated for the trace against the
native data is better than 25%, it is extremely likely that the structure is
solved. Another good indication is whether one can see side-chains.
Fibronectin autotracing
This structure illustrates the ability of the autotracing to start from a
noisy S-SAD map. Recycling the partial (but rather accurate) traces
leads to better phases and an almost complete structure.
N-terminus
C-terminus
Cycle 1:
Incorrectly
traced Cα
Cycle 2:
Cycle 3:
Cα deviation:
< 0.3Å <
< 0.6Å <
< 1.0Å <
< 2.0Å <
In the first cycle, 41% was traced with Cα within 1.0Å, 33% within 0.5Å
and 4% false. After 3 cycles the figures were 94%, 87% and 0%.
Fibronectin map quality
MPE[°°]
mapCC
Standard S-SAD [−h −s0.35]:
53.4
0.63
S-SAD with FLA [−m200 −h −s0.5 −e1]:
42.9
0.70
S-SAD with autotracing [−h −s0.35 −a3]:
32.3
0.84
S-SAD, autotracing and FLA [−h −s0.35 −a3 −e1]:
31.6
0.86
However, combining the FLA (free lunch algorithm) with autotracing
did not produce much further improvement. Although the FLA had
proved very useful in solving several borderline cases, with the
phase improvement that arises from autotracing the FLA has almost
been relegated to the role of cosmetic map improvements!
ACA2011
SHELXE Poly-Ala
trace for 1y13-test
In this case,
assuming
threefold NCS
(−n3) found 50
residues more
than without
NCS. However
the treatment of
NCS in shelxe is
‘quick and dirty’
and needs
rewriting.
Extension of small MR fragments using SHELXE
It often happens that a search model for structure solution by
molecular replacement (MR) corresponds to only a small fraction of
the total scattering power, and in such case expansion to the full
structure can be tedious. For input to the beta-test shelxe, the PDB file
of the MR solution is renamed to name.pda where the merged intensity
data (e.g. from shelxc) are in the file name.hkl. Then e.g.
shelxe name.pda -a30 -s0.5 -y2.0 -q -e1
can be used to run shelxe. A large number of tracing cycles may be
needed (-a30)! A critical parameter is -y, the resolution at which to
truncate the phases calculated from the MR solution. Experience
suggests that -y2.0 is often best, with the implication that this
approach works best with native data to a resolution of 2.0 Å or better.
Progress of fragment extension
For this test, the variation of the CC value with the iteration number was
unexpected. Instead of gradually improving, it meanders randomly at a
about 10% and then suddenly, in the course of four or five iterations,
jumps to a value well above 25%, indicating a solved structure. This
strongly resembles the behavior of small-molecule direct methods, with
the important differences that they involve data to about 1 Å or better,
and start from random phases.
ARCIMBOLDO – ab initio protein structure solution?
Isabel Usón’s arcimboldo uses a supercomputer to produce a large
number of phaser MR solutions using very small but precise search
fragments such as a 14-residue α-helix. Many potential solutions with
(say) 1, 2 or 3 placed fragments are fed into shelxe, using the phaser
TFZ as a broad-pass filter. If one or more shelxe attempts exceed the
magic CC value of 25%, the structure has been solved!
The only requirements are native data to about 2.1 Å or better, massive
computing power and the presence of at least one α-helix (the more the
better) in the structure. In principle any common fragment would do,
but α-helices tend to have the most tightly conserved geometry.
Since the median resolution of protein structures in the PDB is about
2.1 Å, the method should be able to solve at least 25% of the structures
in the current PDB without using any experimental phase or other
information! Isabel <[email protected]> is currently setting up an
arcimboldo supercomputer server.
ANODE
The new program ANODE reads a PDB format file to calculate φT and a
file from shelxc containing FA and α. The heavy atom substructure
phases are then calculated using φA = φT − α. A Fourier map calculated
with phases φA and amplitudes FA then reveals the heavy atom
substructure from a SAD, MAD or SIRAS etc. experiment.
‘anode lyso’ would read the PDB format file lyso.ent and the
lyso_fa.hkl file from shelxc. Lysozyme SAD-phased by four I3C
'sticky triangles' gave the following averaged anomalous densities:
Averaged anomalous densities (sigma)
23.207 I2_I3C
23.103 I1_I3C
21.607 I3_I3C
2.705 SD_MET
2.302 S_EPE (EPE is HEPES buffer)
2.300 SG_CYS
0.539 C9_EPE
0.289 C4_I3C
ANODE (continued)
This table is followed by a list of the highest unique anomalous peaks
and the nearest atoms to each. In addition, a .pha file is written for
displaying the anomalous density in coot. This approach always
produces density with the same unit-cell origin as the original PDB file.
Where alternative reflection indexing is possible it may be necessary
to take it into account (with the switch -i).
This figure clearly
shows disulfides
and other sulfurs,
even though the
anomalous data
were too weak to
find them using
shelxd.
Acknowledgements
I am particularly grateful to Isabel Usón, Thomas Schneider, Tim
Gruene, Tobias Beck, Christian Grosse, Andrea Thorn and Navdeep
Sidhu for discussions and testing various half-baked ideas. Andrea
also prepared many of the pictures.
Posters (Saturday 27th / Sunday 28th)
MS58.P01/C591 Isabel Usón, Arcimboldo
MS58.P05/C592 Christian Hübschle, shelXle
MS58.P11/C595 Andrea Thorn, MR + shelxe
MS58.P14/C596 Dayte Rodriguez Arcimboldo.
References:
SHELXC/D/E: Sheldrick (2010), Acta Cryst. D66, 479-485.
ARCIMBOLDO: Rodríguez, Grosse, Himmel, González, de Ilarduya,
Becker, Sheldrick & Usón (2009), Nature Methods 6, 651-653.
The beta-tests of anode, shelxc, shelxd_mp and the autotracing shelxe
are available on email request to [email protected]