the frames of the production MD runs are subjected to cluster

Comprehensive peptide ion structure studies using ion mobility techniques: Part
1. An advanced protocol for molecular dynamics simulations and collision cross
section calculation
Samaneh Ghassabi Kondalaji1*, Mahdiar Khakinejad1*, Amirmahdi Tafreshian2, Stephen J. Valentine1
1. C. Eugene Bennett Department of Chemistry, West Virginia University, Morgantown WV 26506
2. Department of Statistics P.O. Box 6330 West Virginia University Morgantown, WV 26506
* Samaneh Ghassabi Kondalaji and Mahdiar Khakinejad contributed equally to this manuscript.
Supplementary Information
Cis-trans conversion restraints. To analyze the effect of the applied restraints,
1,000 cycles of 40-ps SA runs have been performed on the peptide ion with the charge
arrangement of K(6)-K(11)-K(21) without any chirality and cis-omega dihedral restraints.
The resulting structures have been subjected to helicity and pairwise RMSD
calculations. The comparison between these data and the results obtained from
restrained SA runs with the same cooling period (40-ps), suggests the same range
(within ~3%) of conformational space coverage for both simulations. Additionally, the
helicity comparison shows a high degree of similarity between the NP values for both
SA methods. Because the presence of omega dihedral angle values deviating from 180̊
(ideal trans peptide bond) can dramatically affect the structure assignments in nonrestrained conformers, the values generated from the STRIDE algorithm can only be
considered reliable in providing an overall picture of structural diversity in non-restrained
simulations and cannot be used in determining the secondary structure for the peptide
ions with cis-trans transitions along the backbone.
The overall number of peptide bonds with a cis configuration is determined using
the Cispeptide plugin[1] implemented in the VMD software package [2] with an omega
dihedral angle threshold of 85̊. The data demonstrate that the during the first multiple
cycles of simulations, all the sampled structures at 0 K have the omega dihedral value
corresponding to a trans peptide bond; however as the cycles proceed, the number of
peptide bonds in a cis configuration increases as the majority of annealed structures
have all their omega dihedral values transition to the cis configuration. The presence of
a cis peptide bond in the annealed structures decreases the candidate sample pool
volume; only structures with trans peptide bonds are acceptable and therefore the
number of acceptable annealed structures decreases. For the purposes of this study,
the cis-trans restraints have been applied over all simulations to create a large pool of
structures. It is worth mentioning that for this particular system, no chirality inversions
have been observed in non-restrained simulations, however, due to such occurrences in
previous SA runs on different protein systems, chirality restraints with a force constant
value of 10 kcal.mol-1.rad-2 have also been applied to all SA analyses.
Summary of Molecular Dynamics (MD) simulations. In Supplementary Figure 1,
the pink boxes represent the main workflow from initial structure generation to selection
of candidate structures through a series of extensive cluster analysis and collision cross
section (CCS) calculations. The grey rectangles show the process applied in calculating
the force field parameters for the c-terminal, positively-charged lysine residues (LYSCOOH) through a series of geometry optimizations and charge determination steps.
The blue rectangles show the CCS calculations and linear selection of 10 reference
structures from the entire CCS range of annealed structures, followed by CCS
determinations of their exhibited trajectories (5,000 frames sampled per trajectory) and
comparison of these accurate CCS values (Ω*) with CCS measurements obtained
through benchmarking procedures (pink- rectangles). The analysis of simulated
annealing algorithms and the related optimizations are shown by the rectangles
2
highlighted in yellow. The filled arrows represent the direction for performing processes.
The green rectangles demonstrate the different selections for the original step (pinkrectangle).
Cluster analysis and data mining. k-means clustering is one of the simplest (and
therefore most applied) methods in data mining to divide n vectors of p dimensions into
k distinct clusters. In this method, k centroids are selected for the vectors to similar
groups and the distances from their corresponding centroids are minimized. That is, the
following function is minimized:
∑𝑘𝑖=1 ∑𝑗∈𝐶(𝑖)(𝑥𝑖 − 𝜇𝑖 )2
(1)
In Equation 1, xj is a p×1 vector, C(i) is the ith cluster and μi is the corresponding
centroid vector, and the term in the parenthesis is denoted as RMSDij in the manuscript.
Moreover, C and μ represent the set of clusters and their centroids (i= 1,2,…,k),
respectively. Given a large number of clusters, this is a NP-hard problem which means
that it is computationally difficult to find an optimal global solution. However, Hartigan
and Wong [3] introduced an algorithm that provides optimal local solutions which
guarantee that there is no single movement from one cluster to the other that would
reduce the objective function above. This algorithm can be implemented in the R
software environment using the kmeans function in the stat package.
It is worth noting that if the members of each cluster are similar enough to their
centroids, these k centroids can be assumed to be the representative of their associated
clusters and used in a subsequent analysis [4].
Note that for the k-means algorithm introduced above, the number of clusters, k,
is assumed to be set which is the main shortcoming of the k-means algorithm. However,
3
there is a variety of methods to tackle this issue including the Elbow-point method which
can be utilized to specify an appropriate number of clusters [5]. In this method, the k
value is iterated from 1 to a reasonably large number and, for each iteration, the kmeans algorithm would be applied and the total within-cluster sum of squares can be
computed. A descending trend is expected when plotting the total within-cluster sum of
squares against k, because as k increases the members of each cluster more closely
approximate the centroid and consequently the within-cluster sum of square value
reduces. However, from a reference frame defined as the Elbow-point, the objective
function does not change dramatically and because minimizing the number of clusters
would decrease the computation time for the purposes, the Elbow point is selected as
the suitable number of clusters. This algorithm can be implemented in R software using
the function fviz_nbclust() in the factoextra package. Supplementary Figure 2 provides a
histogram of the elbow-point occurrence for all production MD simulation data.
Arrival time distribution of [M+3H]3+ and [M+4H]4+ ions. Upon electrospraying the
model peptide Acetyl-PAAAAKAAAAKAAAAKAAAAK, quadruply-, triply- and doublyprotonated ions of the peptide are produced. Supplementary Figure 3 shows the arrival
time distribution of these ions generated by integrating values over a narrow m/z range
across all drift time (tD) values. The tD distribution for triply-protonated ions is depicted
in panel a of Supplementary Figure 3. Two well-resolved peaks at ~7.9 and ~8.3 ms are
indicative of a compact (Ω=417 Å2) and a more diffuse (Ω=438 Å2) structure type.
[M+4H]4+ ions, exhibit two unresolved conformers at ~7 ms (Ω=492 Å2) and ~7.2 ms
(Ω=506 Å2) as the most prevalent structural types, while an unresolved shoulder at 7.6
ms (Ω=534 Å2) is an indicator of a more diffuse structure.
4
Effect of clustering after SA runs. Structure clustering and using the closest
structure to the cluster centroid can be used as a data reduction method, which can
significantly decrease computation time. To examine the effects of such an approach on
the sampled structures, after the simulated annealing step all of the structures have
been subjected to cluster analysis (for more information see the manuscript).
Supplementary Figure 4 shows the effect of the procedure on the secondary structure of
the sampled conformers. The blue line is representative of data without cluster analysis
while the red line shows the trace for the output for the data reduction procedure.
Because such an approach can lead to unwanted elimination of some structure types,
no data reduction has been carried out.
Centroid as the cluster representative. As mentioned above the k-means method
with the centroid as the representative of each cluster can be used as a data reduction
method. This approach is used to calculate a CCS value for a structural type; all the
frames of the production MD runs are subjected to cluster analysis and centroid CCS
values can be used to approximate CCS values exhibited by the original annealed
structure (Equation 4 in the manuscript). Supplementary Figure 5 summarizes the
results for such a procedure. A narrow CCS energy distribution and small CCS values,
along with a large error associated with this method can be indicative of its
ineffectiveness (for more discussion see the manuscript).
Effect of no clustering on the CCS calculation. As discussed in the manuscript,
the closest structure to the centroid can be used as the representative structure to
calculate CCS values. The accuracy of this approach can be drastically affected by the
number of clusters. Supplementary Figure 6 compares accurate CCS values for 10
5
selected reference structures (see manuscript) versus their calculated value with no
clustering. This approach produced an average error of 5.9 % with a maximum error of
15 %.
Preliminary comparison of solution and gas-phase structures. Supplementary
Figure 7 shows the similarity between solution- and gas-phase structures generated by
MD simulations. To generate the solution structures using the Amber ff12SB force field,
an initial, fully-helical structure of the model peptide with amino acid residues exhibiting
the appropriate charge for water solvent at pH=7 was generated. The structure was
neutralized with Cl- as the counter ion and solvated using truncated octahedral box of
TIP3P water [6] and relaxed. During a 4-ns timescale and under the canonical
ensemble in the NVT condition, the system was gradually heated to 300 K using
Langevin dynamics with collision frequency of 1.0 ps-1 [7]. This was followed by a 40-ns
equilibration at isothermal–isobaric ensemble (NPT) with an average target pressure of
1 atm. The obtained structure was subjected to 0.5 μs production MD in the NPT
ensemble with periodic boundary condition. The non-bonded cutoff for calculating the
long-range electrostatic and van der Waals interactions was set to 12 Å. All the
simulations were performed using the AMBER software package [8].
After removal of water molecules, the generated trajectory was subjected to
dihedral cluster analysis with a mask on ψ and φ dihedral angles passing through 10
middle amino acid residues. This lead to the formation of 4 clusters. The most similar
conformer to the centroid of each cluster was selected as a solution-phase
representative structure shown in Supplementary Figure 7.
6
Using backbone-only RMSD calculations and the cpptraj module [9] implemented
in the AMBER software package [8], the closest structure to the centroids obtained from
production MD simulations in the gas-phase and cluster analysis with k=50 were
compared with the newly-generated solution-phase representative structures. The most
similar conformations (lowest RMSD values) were extracted as the solution-like, gasphase structures. Notably, this analysis does not make any claim about the degree of
preservation of solution structure into the gas phase. Rather, it demonstrates that the
wide range of structures accessible by the advanced MD simulations could contain
conformer subsets resembling solution structures.
References
1.
2.
3.
4.
5.
6.
7.
8.
9.
Schreiner, E., et al., Stereochemical errors and their implications for molecular dynamics
simulations. BMC Bioinformatics, 2011. 12: p. 190.
Humphrey, W., A. Dalke, and K. Schulten, VMD: Visual molecular dynamics. Journal of Molecular
Graphics & Modelling, 1996. 14(1): p. 33-38.
Hartigan, J.A. and M.A. Wong, Algorithm AS 136: A K-Means Clustering Algorithm. Journal of the
Royal Statistical Society. Series C (Applied Statistics), 1979. 28(1): p. 100-108.
Faber, V., Clustering and the continuous k-means algorithm. Los Alamos Science, 1994.
22(138144.21).
Kodinariya, T.M. and P.R. Makwana, Review on determining number of Cluster in K-Means
Clustering. International Journal, 2013. 1(6): p. 90-95.
Jorgensen, W.L., et al., Comparison of simple potential functions for simulating liquid water. The
Journal of Chemical Physics, 1983. 79(2): p. 926-935.
Section 5: Running Minimization and MD (in explicit solvent). [cited 2015; Available from:
http://ambermd.org/tutorials/basic/tutorial1/section5.htm.
Case, D.A., Darden T.A.; Cheatham, T E. III,; Simmerling, C.L.;, et al. Amber12, University of
California, San Francisco. 2012.
AmberTools12 Reference Manual. Available from:
http://ambermd.org/doc12/AmberTools12.pdf.
7
MEP Calculation
QM Optimization
Energy Minimization
Initial Structure Generation
Force Field Parameterization
1200-ps SA
SA Algorithm Optimization
40-ps SA
Heating/Equilibration
Production MD simulation at 300 K
RMSD Method Determination
k-means Trajectory Clustering
Elbow-point Determination
Cluster Analysis
CCS Calculation
K(6), K(11), K(21)
K(6), K(16), K(21)
Closest-to-the-centroid
Centroid
Backbone-only
All-atom
Structure Analysis
Cooling-time Variation
Cluster-representative Structure Determination
50-k
10-k
Charge Fitting
CCS Calculations
Reference trajectory Selection
Ω* Determination
Candidate Structure Determination
Supplementary Figure 1. Schematic summarizing the overall process for MD
simulations. For more information see the description above as well as in the
manuscript.
8
100
Frequency of Elbow-Points
90
80
70
60
50
40
30
20
10
0
27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Number of Clusters
Supplementary Figure 2. Histogram of elbow-point occurrences for all of the
production simulations at 300 K.
9
b
Intensity (Arbitrary Units)
a
7
9
11
6
7
8
9
Drift Time (ms)
Supplementary Figure 3. Drift time distribution of [M+3H]3+ (panel a) and [M+4H]4+
(panel b) ions of the model peptide Acetyl-PAAAAKAAAAKAAAAKAAAAK. The
distribution is obtained by integrating intensities for all m/z values for these ions across
the drift time range.
10
Normalized Population
0.4
0.3
0.2
0.1
0
0
5
10
15
20
Number of Residues
Supplementary Figure 4. Normalized populations of structure types with a specific
number of residues in helical structure (R) are depicted. Blue traces shows the value
exhibited by all structures while red lines illustrates the values after data reduction with
cluster analysis.
11
Potential Energy
(b)
Potential Energy
(a)
300
350
400
450
500
Collision Cross Section
550
600
300
350
(Å2)
400
450
500
Collision Cross Section
(c)
550
600
(Å2)
Potential Energy
Potential Energy
(d)
300
350
400
450
500
Collision Cross Section (Å2)
550
600
300
350
400
450
500
550
600
Collision Cross Section (Å2)
Supplementary Figure 5. CCS values versus potential energy for the [M+3H]3+ ions.
The centroid geometry has been used to calculate the CCS value. For the top panels,
prior to cluster analysis structures are backbone-only RMSD oriented while for the
bottom panel orientation is performed according to all-atom criteria. Left and right
panels show the results for 10-k and 50-k cluster analyses, respectively.
12
Collision Cross Section (Å2)
500
450
400
350
1
2
3
4
5
6
7
8
9
10
Supplementary Figure 6. Bar graphs showing the calculated CCS values (Ωtotal) for 10
selected reference structures versus their accurate values (Ω*). Red bars show the
CCS value calculated for these 10 structure using the closest structure to the centroid
as the representative structure and no cluster analysis. Blue bars are representative of
the accurate CCS values (Ω*) for these reference conformers.
13
Supplementary Figure 7. Comparison between solution and gas-phase structures for
4 different conformer types. The solution structures were obtained after clustering the
trajectory obtained from 0.5 μs production MD in explicit solvent and in the NPT
ensemble of 300 K. The gas-phase structures with the lowest RMSD values relative to
these 4 conformer types are illustrated.
14