Comprehensive peptide ion structure studies using ion mobility techniques: Part 1. An advanced protocol for molecular dynamics simulations and collision cross section calculation Samaneh Ghassabi Kondalaji1*, Mahdiar Khakinejad1*, Amirmahdi Tafreshian2, Stephen J. Valentine1 1. C. Eugene Bennett Department of Chemistry, West Virginia University, Morgantown WV 26506 2. Department of Statistics P.O. Box 6330 West Virginia University Morgantown, WV 26506 * Samaneh Ghassabi Kondalaji and Mahdiar Khakinejad contributed equally to this manuscript. Supplementary Information Cis-trans conversion restraints. To analyze the effect of the applied restraints, 1,000 cycles of 40-ps SA runs have been performed on the peptide ion with the charge arrangement of K(6)-K(11)-K(21) without any chirality and cis-omega dihedral restraints. The resulting structures have been subjected to helicity and pairwise RMSD calculations. The comparison between these data and the results obtained from restrained SA runs with the same cooling period (40-ps), suggests the same range (within ~3%) of conformational space coverage for both simulations. Additionally, the helicity comparison shows a high degree of similarity between the NP values for both SA methods. Because the presence of omega dihedral angle values deviating from 180̊ (ideal trans peptide bond) can dramatically affect the structure assignments in nonrestrained conformers, the values generated from the STRIDE algorithm can only be considered reliable in providing an overall picture of structural diversity in non-restrained simulations and cannot be used in determining the secondary structure for the peptide ions with cis-trans transitions along the backbone. The overall number of peptide bonds with a cis configuration is determined using the Cispeptide plugin[1] implemented in the VMD software package [2] with an omega dihedral angle threshold of 85̊. The data demonstrate that the during the first multiple cycles of simulations, all the sampled structures at 0 K have the omega dihedral value corresponding to a trans peptide bond; however as the cycles proceed, the number of peptide bonds in a cis configuration increases as the majority of annealed structures have all their omega dihedral values transition to the cis configuration. The presence of a cis peptide bond in the annealed structures decreases the candidate sample pool volume; only structures with trans peptide bonds are acceptable and therefore the number of acceptable annealed structures decreases. For the purposes of this study, the cis-trans restraints have been applied over all simulations to create a large pool of structures. It is worth mentioning that for this particular system, no chirality inversions have been observed in non-restrained simulations, however, due to such occurrences in previous SA runs on different protein systems, chirality restraints with a force constant value of 10 kcal.mol-1.rad-2 have also been applied to all SA analyses. Summary of Molecular Dynamics (MD) simulations. In Supplementary Figure 1, the pink boxes represent the main workflow from initial structure generation to selection of candidate structures through a series of extensive cluster analysis and collision cross section (CCS) calculations. The grey rectangles show the process applied in calculating the force field parameters for the c-terminal, positively-charged lysine residues (LYSCOOH) through a series of geometry optimizations and charge determination steps. The blue rectangles show the CCS calculations and linear selection of 10 reference structures from the entire CCS range of annealed structures, followed by CCS determinations of their exhibited trajectories (5,000 frames sampled per trajectory) and comparison of these accurate CCS values (Ω*) with CCS measurements obtained through benchmarking procedures (pink- rectangles). The analysis of simulated annealing algorithms and the related optimizations are shown by the rectangles 2 highlighted in yellow. The filled arrows represent the direction for performing processes. The green rectangles demonstrate the different selections for the original step (pinkrectangle). Cluster analysis and data mining. k-means clustering is one of the simplest (and therefore most applied) methods in data mining to divide n vectors of p dimensions into k distinct clusters. In this method, k centroids are selected for the vectors to similar groups and the distances from their corresponding centroids are minimized. That is, the following function is minimized: ∑𝑘𝑖=1 ∑𝑗∈𝐶(𝑖)(𝑥𝑖 − 𝜇𝑖 )2 (1) In Equation 1, xj is a p×1 vector, C(i) is the ith cluster and μi is the corresponding centroid vector, and the term in the parenthesis is denoted as RMSDij in the manuscript. Moreover, C and μ represent the set of clusters and their centroids (i= 1,2,…,k), respectively. Given a large number of clusters, this is a NP-hard problem which means that it is computationally difficult to find an optimal global solution. However, Hartigan and Wong [3] introduced an algorithm that provides optimal local solutions which guarantee that there is no single movement from one cluster to the other that would reduce the objective function above. This algorithm can be implemented in the R software environment using the kmeans function in the stat package. It is worth noting that if the members of each cluster are similar enough to their centroids, these k centroids can be assumed to be the representative of their associated clusters and used in a subsequent analysis [4]. Note that for the k-means algorithm introduced above, the number of clusters, k, is assumed to be set which is the main shortcoming of the k-means algorithm. However, 3 there is a variety of methods to tackle this issue including the Elbow-point method which can be utilized to specify an appropriate number of clusters [5]. In this method, the k value is iterated from 1 to a reasonably large number and, for each iteration, the kmeans algorithm would be applied and the total within-cluster sum of squares can be computed. A descending trend is expected when plotting the total within-cluster sum of squares against k, because as k increases the members of each cluster more closely approximate the centroid and consequently the within-cluster sum of square value reduces. However, from a reference frame defined as the Elbow-point, the objective function does not change dramatically and because minimizing the number of clusters would decrease the computation time for the purposes, the Elbow point is selected as the suitable number of clusters. This algorithm can be implemented in R software using the function fviz_nbclust() in the factoextra package. Supplementary Figure 2 provides a histogram of the elbow-point occurrence for all production MD simulation data. Arrival time distribution of [M+3H]3+ and [M+4H]4+ ions. Upon electrospraying the model peptide Acetyl-PAAAAKAAAAKAAAAKAAAAK, quadruply-, triply- and doublyprotonated ions of the peptide are produced. Supplementary Figure 3 shows the arrival time distribution of these ions generated by integrating values over a narrow m/z range across all drift time (tD) values. The tD distribution for triply-protonated ions is depicted in panel a of Supplementary Figure 3. Two well-resolved peaks at ~7.9 and ~8.3 ms are indicative of a compact (Ω=417 Å2) and a more diffuse (Ω=438 Å2) structure type. [M+4H]4+ ions, exhibit two unresolved conformers at ~7 ms (Ω=492 Å2) and ~7.2 ms (Ω=506 Å2) as the most prevalent structural types, while an unresolved shoulder at 7.6 ms (Ω=534 Å2) is an indicator of a more diffuse structure. 4 Effect of clustering after SA runs. Structure clustering and using the closest structure to the cluster centroid can be used as a data reduction method, which can significantly decrease computation time. To examine the effects of such an approach on the sampled structures, after the simulated annealing step all of the structures have been subjected to cluster analysis (for more information see the manuscript). Supplementary Figure 4 shows the effect of the procedure on the secondary structure of the sampled conformers. The blue line is representative of data without cluster analysis while the red line shows the trace for the output for the data reduction procedure. Because such an approach can lead to unwanted elimination of some structure types, no data reduction has been carried out. Centroid as the cluster representative. As mentioned above the k-means method with the centroid as the representative of each cluster can be used as a data reduction method. This approach is used to calculate a CCS value for a structural type; all the frames of the production MD runs are subjected to cluster analysis and centroid CCS values can be used to approximate CCS values exhibited by the original annealed structure (Equation 4 in the manuscript). Supplementary Figure 5 summarizes the results for such a procedure. A narrow CCS energy distribution and small CCS values, along with a large error associated with this method can be indicative of its ineffectiveness (for more discussion see the manuscript). Effect of no clustering on the CCS calculation. As discussed in the manuscript, the closest structure to the centroid can be used as the representative structure to calculate CCS values. The accuracy of this approach can be drastically affected by the number of clusters. Supplementary Figure 6 compares accurate CCS values for 10 5 selected reference structures (see manuscript) versus their calculated value with no clustering. This approach produced an average error of 5.9 % with a maximum error of 15 %. Preliminary comparison of solution and gas-phase structures. Supplementary Figure 7 shows the similarity between solution- and gas-phase structures generated by MD simulations. To generate the solution structures using the Amber ff12SB force field, an initial, fully-helical structure of the model peptide with amino acid residues exhibiting the appropriate charge for water solvent at pH=7 was generated. The structure was neutralized with Cl- as the counter ion and solvated using truncated octahedral box of TIP3P water [6] and relaxed. During a 4-ns timescale and under the canonical ensemble in the NVT condition, the system was gradually heated to 300 K using Langevin dynamics with collision frequency of 1.0 ps-1 [7]. This was followed by a 40-ns equilibration at isothermal–isobaric ensemble (NPT) with an average target pressure of 1 atm. The obtained structure was subjected to 0.5 μs production MD in the NPT ensemble with periodic boundary condition. The non-bonded cutoff for calculating the long-range electrostatic and van der Waals interactions was set to 12 Å. All the simulations were performed using the AMBER software package [8]. After removal of water molecules, the generated trajectory was subjected to dihedral cluster analysis with a mask on ψ and φ dihedral angles passing through 10 middle amino acid residues. This lead to the formation of 4 clusters. The most similar conformer to the centroid of each cluster was selected as a solution-phase representative structure shown in Supplementary Figure 7. 6 Using backbone-only RMSD calculations and the cpptraj module [9] implemented in the AMBER software package [8], the closest structure to the centroids obtained from production MD simulations in the gas-phase and cluster analysis with k=50 were compared with the newly-generated solution-phase representative structures. The most similar conformations (lowest RMSD values) were extracted as the solution-like, gasphase structures. Notably, this analysis does not make any claim about the degree of preservation of solution structure into the gas phase. Rather, it demonstrates that the wide range of structures accessible by the advanced MD simulations could contain conformer subsets resembling solution structures. References 1. 2. 3. 4. 5. 6. 7. 8. 9. Schreiner, E., et al., Stereochemical errors and their implications for molecular dynamics simulations. BMC Bioinformatics, 2011. 12: p. 190. Humphrey, W., A. Dalke, and K. Schulten, VMD: Visual molecular dynamics. Journal of Molecular Graphics & Modelling, 1996. 14(1): p. 33-38. Hartigan, J.A. and M.A. Wong, Algorithm AS 136: A K-Means Clustering Algorithm. Journal of the Royal Statistical Society. Series C (Applied Statistics), 1979. 28(1): p. 100-108. Faber, V., Clustering and the continuous k-means algorithm. Los Alamos Science, 1994. 22(138144.21). Kodinariya, T.M. and P.R. Makwana, Review on determining number of Cluster in K-Means Clustering. International Journal, 2013. 1(6): p. 90-95. Jorgensen, W.L., et al., Comparison of simple potential functions for simulating liquid water. The Journal of Chemical Physics, 1983. 79(2): p. 926-935. Section 5: Running Minimization and MD (in explicit solvent). [cited 2015; Available from: http://ambermd.org/tutorials/basic/tutorial1/section5.htm. Case, D.A., Darden T.A.; Cheatham, T E. III,; Simmerling, C.L.;, et al. Amber12, University of California, San Francisco. 2012. AmberTools12 Reference Manual. Available from: http://ambermd.org/doc12/AmberTools12.pdf. 7 MEP Calculation QM Optimization Energy Minimization Initial Structure Generation Force Field Parameterization 1200-ps SA SA Algorithm Optimization 40-ps SA Heating/Equilibration Production MD simulation at 300 K RMSD Method Determination k-means Trajectory Clustering Elbow-point Determination Cluster Analysis CCS Calculation K(6), K(11), K(21) K(6), K(16), K(21) Closest-to-the-centroid Centroid Backbone-only All-atom Structure Analysis Cooling-time Variation Cluster-representative Structure Determination 50-k 10-k Charge Fitting CCS Calculations Reference trajectory Selection Ω* Determination Candidate Structure Determination Supplementary Figure 1. Schematic summarizing the overall process for MD simulations. For more information see the description above as well as in the manuscript. 8 100 Frequency of Elbow-Points 90 80 70 60 50 40 30 20 10 0 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 Number of Clusters Supplementary Figure 2. Histogram of elbow-point occurrences for all of the production simulations at 300 K. 9 b Intensity (Arbitrary Units) a 7 9 11 6 7 8 9 Drift Time (ms) Supplementary Figure 3. Drift time distribution of [M+3H]3+ (panel a) and [M+4H]4+ (panel b) ions of the model peptide Acetyl-PAAAAKAAAAKAAAAKAAAAK. The distribution is obtained by integrating intensities for all m/z values for these ions across the drift time range. 10 Normalized Population 0.4 0.3 0.2 0.1 0 0 5 10 15 20 Number of Residues Supplementary Figure 4. Normalized populations of structure types with a specific number of residues in helical structure (R) are depicted. Blue traces shows the value exhibited by all structures while red lines illustrates the values after data reduction with cluster analysis. 11 Potential Energy (b) Potential Energy (a) 300 350 400 450 500 Collision Cross Section 550 600 300 350 (Å2) 400 450 500 Collision Cross Section (c) 550 600 (Å2) Potential Energy Potential Energy (d) 300 350 400 450 500 Collision Cross Section (Å2) 550 600 300 350 400 450 500 550 600 Collision Cross Section (Å2) Supplementary Figure 5. CCS values versus potential energy for the [M+3H]3+ ions. The centroid geometry has been used to calculate the CCS value. For the top panels, prior to cluster analysis structures are backbone-only RMSD oriented while for the bottom panel orientation is performed according to all-atom criteria. Left and right panels show the results for 10-k and 50-k cluster analyses, respectively. 12 Collision Cross Section (Å2) 500 450 400 350 1 2 3 4 5 6 7 8 9 10 Supplementary Figure 6. Bar graphs showing the calculated CCS values (Ωtotal) for 10 selected reference structures versus their accurate values (Ω*). Red bars show the CCS value calculated for these 10 structure using the closest structure to the centroid as the representative structure and no cluster analysis. Blue bars are representative of the accurate CCS values (Ω*) for these reference conformers. 13 Supplementary Figure 7. Comparison between solution and gas-phase structures for 4 different conformer types. The solution structures were obtained after clustering the trajectory obtained from 0.5 μs production MD in explicit solvent and in the NPT ensemble of 300 K. The gas-phase structures with the lowest RMSD values relative to these 4 conformer types are illustrated. 14
© Copyright 2026 Paperzz