BIOINFORMATICS ORIGINAL PAPER Vol. 25 no. 20 2009, pages 2663–2669 doi:10.1093/bioinformatics/btp506 Structural bioinformatics MISTRAL: a tool for energy-based multiple structural alignment of proteins Cristian Micheletti1,∗ and Henri Orland2 1 SISSA, 2 Institut CNR-INFM Democritos and Italian Institute of Technology, Via Beirut 2-4, 34014 Trieste, Italy and de Physique Théorique, CEA, IPhT, F- 91191 Gif-sur-Yvette, France Received on May 11, 2009; revised on July 20, 2009; accepted on August 13, 2009 Advance Access publication August 19, 2009 Associate Editor: Thomas Lengauer ABSTRACT Motivation: The steady growth of the number of available protein structures has constantly motivated the development of new algorithms for detecting structural correspondences in proteins. Detecting structural equivalences in two or more proteins is computationally demanding as it typically entails the exploration of the combinatorial space of all possible amino acid pairings in the parent proteins. The search is often aided by the introduction of various constraints such as considering protein fragments, rather than single amino acids, and/or seeking only sequential correspondences in the given proteins. An additional challenge is represented by the difficulty of associating to a given alignment, a reliable a priori measure of its statistical significance. Results: Here, we present and discuss MISTRAL (Multiple STRuctural ALignment), a novel strategy for multiple protein alignment based on the minimization of an energy function over the low-dimensional space of the relative rotations and translations of the molecules. The energy minimization avoids combinatorial searches and returns pairwise alignment scores for which a reliable a priori statistical significance can be given. Availability: MISTRAL is freely available for academic users as a standalone program and as a web service at http://ipht.cea.fr/protein .php. Contact: [email protected] Supplementary information: Supplementary data are available at Bioinformatics online. 1 INTRODUCTION One of the ubiquitous bioinformatics tasks is the alignment of protein sequences and structures. The development of these computational tools has a long tradition and was especially motivated by the interest in tracing the evolutionary relationships of various protein families (Needleman and Wunsch, 1970; Phillips, 1970). Sequence alignment methods are particularly useful for this purpose when the sequence identity of the involved proteins is larger than ∼30%. Below this threshold there is no stringent relationship between sequence and structural similarity (Chothia, 1984; Chothia and Lesk, 1986; Gan et al., 2002; Lesk, 2004; Wood and Pearson, 1999) and the detection of evolutionary and/or functional relatedness is appropriately complemented by structural alignment techniques ∗ To whom correspondence should be addressed. (Koehl, 2001; Lichtarge and Sowa, 2002). Structural alignment, which is the focus of the present study, is applied in several other contexts including the grouping and classification of known protein folds, the identification of conserved structural fragments for molecular replacement in crystallography and homology modelling (Holm et al., 1992; Orengo et al., 1997; Schwarzenbacher et al., 2008). The result of a pairwise alignment procedure consists of a oneto-one correspondence list between amino acids of the two proteins. In rigid structural alignments, the two sets of corresponding Cα (or mainchain) atoms can be put in good spatial proximity after an optimal structural superposition which minimizes the root mean square distance (RMSD). Indeed, within these approaches the best alignment is obtained by minimizing a suitable measure of geometric compatibility, such as the similarity of distance matrices in DALI, over the possible amino acid pairings (Holm and Sander, 1993) or weighted sums of distances of equivalent Cα’s in CE (Shindyalov and Bourne, 1998) or in MAMMOTH (Ortiz et al., 2002). Amino acid correspondences can also be established without resorting to criteria of rigid spatial superposability, as in ‘flexible’ alignment methods such as RAPIDO, FlexProt, FATCAT and POSA (Mosca and Schneider, 2008; Shatsky et al., 2004b; Ye and Godzik, 2003, 2005). As the combinatorial space of possible amino acid correspondences is enormous, the search space for the optimal alignment is often limited by the introduction of various constraints. For instance, pairings can be sought not between single amino acids but among fragments or secondary structure elements of suitable minimal length (Camproux et al., 2004; Kolodny et al., 2002; Micheletti et al., 2000) such as in DALI, CE, MAMMOTH, GANGSTA (Kolbeck et al., 2006) or CLEMAPS (Liu et al., 2008). A second commonly employed constraint is the requirement of sequentiality, that is subsequent amino acids in one protein must correspond to subsequent amino acids in the partner protein. This restriction, while computationally convenient, can impair the capability to detect evolutionary relationships (Xie and Bourne, 2008) as new protein structures can arise from the combination and permutation of subdomains (Bashton and Chothia, 2007; Fong et al., 2007). The number of structural alignment methods that are nonsequential is still limited and include CA, MULTIPROT, MASS, TOPOFIT, SCALI and GANGSTA+ (Bachar et al., 1993; Dror et al., 2003; Guerler and Knapp, 2008; Ilyin et al., 2004; Shatsky et al., 2004a; Yuan and Bystroff, 2005). © The Author 2009. Published by Oxford University Press. All rights reserved. For Permissions, please email: [email protected] [14:47 17/9/2009 Bioinformatics-btp506.tex] 2663 Page: 2663 2663–2669 C.Micheletti and H.Orland Besides the list of equivalent amino acids in two or more proteins, a highly desirable output of an alignment procedure is the indication of the statistical significance of the returned structural correspondence. In a comprehensive study, Sierk and Pearson (2004) have applied several tests, based on the capability to correctly identify known homologous pairs in a reference set of protein domains, to validate the statistical significance returned by several structural alignment methods. Their analysis pointed out that the provided nominal statistical significance of alignment scores largely exceeded the true one, and hence posed the necessity to develop a better control of the scoring statistics. This task appears particularly challenging for alignment methods based on the combination of several criteria rather than on the optimization of a single scoring function. Here, we discuss a novel scheme named MISTRAL (after MultIple STRuctural ALignment), which establishes structural correpondences between two or more proteins without exploring the combinatorial space of amino acid pairings and allows for a good control of the statistical significance of the returned alignments. The optimal superposition of a given set of proteins is obtained by minimizing a suitable protein–protein interaction energy. The minimization of the energy function is carried out in the lowdimensional space corresponding to the orientations (translations and rotations) of the molecules in a given Cartesian reference frame. Once the molecules are optimally oriented, the residue–residue correspondences are established by using simple criteria based on spatial proximity. Such correspondences are therefore not sought by assuming any a priori sequence directionality among or within blocks of aligned residues in the two proteins which, in fact, can even be multimeric. The method is here applied and tested in several cases that aptly illustrate its non-sequential character and applicability in multiple alignment contexts. Finally, a detailed discussion of the returned nominal statistical significance, based on the database of Sierk and Pearson (2004) is presented. MISTRAL is freely available for academic users as a web server at http://ipht.cea.fr/protein.php where multiple alignments of up to 20 proteins (of chain length up to 500 amino acids) can be submitted. To align more or longer proteins users can obtain the MISTRAL program from the authors. 2 METHODS The optimal structural superposition of two proteins A and B, is identified by minimizing the following interaction energy term over their relative translations and orientations: fi,j fi+1, j+1 fi+2,j+2 (1) HAB = i∈A,j∈B where the indexes i and j run over all amino acids in proteins A and B, respectively, and f is a weight function rewarding short spatial separations of amino acid pairs. Specifically, we considered the following piecewise-linear sigmoidal weight, ⎧ ⎪ if dij ≤ /2 ⎪ ⎨−1 dij if /2 < dij < (2) fi,j = −2 1− ⎪ ⎪ ⎩0 otherwise where dij is the separation of the Cα atoms of amino acids i and j, and is the interaction range which controls the spatial tolerance of the resulting superposition. By default, is set to 8 Å. The product of the f ’s in Equation (1) embodies a multi-body interaction between pairs of amino acids that are nearby in sequence and consequently rewards the superposition of small protein fragments. The parameters over which the minimization is done are the six degrees of freedom necessary to specify the displacement and orientation of protein B relative to A, which is held fixed in space. The extensive search over the relative orientations was employed before in alignment contexts; in STRUCTAL (Gerstein and Levitt, 1998) the spatial positioning of a protein pair is iteratively guided by finding equivalent pairs of amino acids. Here, the exploration of the relative orientations is, instead, guided by the minimization of the energy function in Equation (1). The optimization is carried out within a simulated annealing scheme starting from a tentative orientation of the proteins obtained by optimally superimposing fragments of 10–20 amino acids (using longer fragments typically adds additional computational overhead, without altering the result). Notice that, unlike the case of effective pairwise amino acid interactions used, for example, in models for protein docking or protein folding, the potential energy f , does not introduce any excluded-volume penalty at small separations. This allows protein chains to ‘pass-through’ each other as the minimum of Equation (2) is sought. We point out that the method is seamlessly applicable if one or both proteins to be aligned are multimeric (though it should be noted that ‘flexible’ alignments scheme are better suited to deal with arbitrary displacements of separate domains). In this case, those amino acids i and j for which a chain interruption occurs in the intervals [i,i+2] or [j,j +2] are excluded from the sum in Equation (1). The same summation restriction is applied when dealing with proteins with missing structural informations about Cα atoms. The energy-based approach is readily generalizable to perform multiple alignments where all proteins are treated in a symmetric, equal way. For example, in the case of alignment of three proteins A, B and C, the interaction energy to minimize is the sum of all pairwise interaction energies (possibly length-normalized): H = HAB +HBC +HAC . (3) The dimensionality of the parameter space entailed by the symmetric multiple alignment of n proteins is 6(n−1) as one protein, the largest for computational convenience, is held fixed in space while all possible rotations and translations are considered for the others. However, for the simplicity of the formulation, we shall not resort to the minimization of Equation (3), but will consider separately the possible n(n−1)/2 alignments and combine them a posteriori to identify the alignment core, as explained in a following section. Finally, we observe that unless small values of are used, the ‘cooperative’ terms in Equation (1), can assign a favorable energy not only to superposable segments having the same sequence directionality in the two proteins but also (albeit to a lesser extent) to segments with opposite directionality. This property, which is illustrated in Section 3, further highlights that the energy-based alignment can be employed without any a priori assumptions on the sequentiality and sequence directionality of the blocks to be aligned. The energy function in Equation (1) is, however, generalizable to strongly promote the matches of segments with opposite directionality by substituting (or complementing) the term in Equation (1) with fi,j+2 fi+1,j+1 fi,j−1 (4) i∈A,j∈B 2.1 Statistical significance of pairwise alignments The optimized alignment energies of 104 pairs of protein chains were analyzed to characterize their statistical distribution. These reference pairwise alignments were produced by randomly pairing the members of a comprehensive set of about 2000 protein chains comprising between 19 and 500 amino acids (the average length being 230) with limited mutual sequence identity. The average pairwise sequence identity (the number of identically aligned amino acids divided by the length of the shortest protein (May, 2004)) returned by Clustalw (Chenna et al., 2003) over the 104 pairs was 25%. No more than 80 pairs out of 104 overcome the 40% threshold. In view of the limited sequence relatedness of the set, the 104 pairings are expected to 2664 [14:47 17/9/2009 Bioinformatics-btp506.tex] Page: 2664 2663–2669 MISTRAL: energy-based multiple structural alignment (a) for the considered intervals of lmax . The quality of the Gumbel fits is visible in Figure 1, where it can be compared against the Gaussian distribution which yields χ2 values larger than a factor of about 4. The Gumbel distribution: prob(E/lmax ) = (b) 1 (E/lmax +a1)/a2 e exp[−exp[(E/lmax +a1 )/a2 ]], a2 (5) where E is the energy of the alignment and a1 and a2 are the so-called location and scale parameters, was therefore taken as a viable statistical model for modeling the statistics of rescaled alignment energies. The analysis of the Gumbel fits for various length ranges (see e.g. Fig. 1a and b) indicated that the parametric dependence of the a1 and a2 parameters on lmax can be approximated as a1 = 0.867−9.110−4 ·lmax , a2 = 0.1934−3.810−5 ·lmax (6) Based on the above considerations, the statistical significance of the alignment of two proteins is equivalently conveyed by the Gumbel z-score: √ (7) z(E) = (−E/lmax −a1 −γa2 )/(πa2 / 6); where γ ≈ 0.5772, or the associated P-value: √ p(E) = 1.0−exp[−exp[−π/ 6 z(E)−γ]] (c) Fig. 1. Distribution of length-rescaled alignment energies (notice the minus sign) for random protein pairs where the length of the longest protein, lmax is in the range (a) [25:75] and (b) [275–325]. The superposed curves provide the Gaussian and Gumbel fits to the histograms obtained by minimizing the χ2 (assuming a Poisson uncertainty on the histogram heights) computed only over the bins with energy more favorable than average, so as to reproduce accurately the extremal tail of the distributions. (c) Trend of the theoretical and observed P-values as a function of the z-score for the set of ≈ 104 random alignments of pairs with sequence identity <40%. involve mostly structurally unrelated proteins chains (Chothia, 1984; Chothia and Lesk, 1986; Gan et al., 2002; Levitt and Gerstein, 1998; Wood and Pearson, 1999) and are thus taken as a random background reference to calibrate the statistical significance of structural alignments. As a first step, the 104 minimized pairwise energies were divided by the length of the longest of the two proteins, lmax . This energy rescaling favors the detection of matches that involve a substantial fraction of the longest protein. This is desirable as short proteins can be fitted against much longer ones which may have a different overall topological organization. The rescaled energies were found to have a residual dependence on lmax . In particular, the energies were shifted toward smaller absolute values for increasing lmax , as visible in the histograms of Figure 1a and b. This is consistent with the intuitive expectation that it is easier to find random alignments involving, e.g. 30% of a 100-residue protein, than 30% of a 300-residue one. The length dependence of the rescaled alignment energies was characterized by fitting separately the energy distributions involving various intervals of lmax . Among the commonly employed statistical distributions it was found that the Gumbel extreme-value statistics, which often applies to alignment scores (Karlin and Altschul, 1990), provided a χ2 of order unity (8) where a1 and a2 have the aforementioned dependence on the length of the longest of the two proteins. The z-score measures the separation (in terms of SDs) of the observed alignment energy from the random average. The P-value, instead, provides the expected probability to have a random alignment with interaction energy equal or better (i.e. more negative) than the observed one. The viability and consistency of the above analysis to model the statistics of random alignment energies is verified a posteriori over the reference dataset by comparing the expected and observed P-values, as a function of the z-score threshold. The observed P-value is straightforwardly calculated as the fraction of alignments whose energy has a z-score larger than the assigned threshold; the 80 pairs (out of 104 ) having sequence identity >40% were excluded from the analysis. The result is shown in Figure 1c and indicates that the expected and observed P-values are in very close agreement up to z-scores of about 3. Beyond this range, the differences are nevertheless limited, as the observed and expected number of alignments are still of the same order of magnitude even for z-score of 4.5 (the relative difference being a factor of 3). The fact that, at these values of z-score, the small number of observed alignments overcomes the expected one is also consistent with the fact that, despite the limited mutual sequence identity of members of the set, a limited number of genuine pairwise structural correspondences is still expected to occur in the dataset. The above considerations indicate that the parameterized Gumbel statistics provides a consistent and viable model for estimating the significance of the structural alignments. An independent assessment of the viability of the statistical significance is presented in Section 3. As the characterization of the statistics is particularly time consuming (both in terms of computation and data analysis), it was carried out only for pairwise alignments obtained with the default interaction range = 8 Å. Consequently, while further extensions of the statistical analysis may be undertaken in the future, presently the z-scores and P-values are provided by MISTRAL only for pairwise alignments carried out with the default interaction range. 2.2 One-to-one amino acid correspondences The direct output of the minimization of Equation (1) or Equation (3) is the best spatial positioning of two or more proteins. The output is further processed to return one-to-one correspondences among amino acids in different proteins. The post-processing consists of an iterative scheme which attempts to match segments of length l in the two proteins. For two such segments, (i,i ) and (j,j ), with |i−i | = |j −j | = l, amino acids are tentatively matched in the scheme i : j; i+1 : j +1; ... or i : j ; i+1 : j −1; .... The latter type of matching can be disallowed in MISTRAL if reversal of sequence 2665 [14:47 17/9/2009 Bioinformatics-btp506.tex] Page: 2665 2663–2669 C.Micheletti and H.Orland directionality in matching segments is not considered appropriate. The segments represent a putative match if they contain at least two corresponding amino acids at distance below the ‘seed’ threshold of 3.5 Å, and no pair at distance beyond the ‘alignment tolerance’ threshold of 4.0 Å. Among all viable segment pairs of length l, (i,i+l) (j,j ±l), the one with the lowest RMSD is picked and assigned. Once this assignment is made, another search is performed for other viable segments of the same length. If no other viable segment is found, l is decreased by one unit and the search is repeated. The search is run starting from the maximum possible value of l = min(lA ,lB ) and stopping at l = 4 which is the minimum considered length for aligned segments. The relative position and orientation of the two proteins obtained by the minimization of the interaction energy is finally adjusted by optimally superimposing the corresponding amino acids. In the present online and distributed version of MISTRAL, multiple alignments are constructed by computing separately all one-to-one correspondences in the possible pairwise alignments of the set. Each protein in the set is next considered in turn as a tentative pivot for the multiple alignment and the size of the alignment core is computed. The latter is given by the set of amino acids that have a matching amino acid in all pairwise alignments with the other proteins in the set. The protein with the largest core is picked as pivot for the multiple alignment and relative position of the proteins is finally adjusted by optimally superimposing the core-corresponding region of each non-pivotal protein over the core of the pivot one. Besides the size of the core, the quality of the resulting alignment is conveyed by calculating the RMSD of all pairs of corresponding amino acids; e.g. if n proteins share a core of N residues, the number of spatial distances considered for the average is N n(n−1)/2. 3 RESULTS The MISTRAL alignment strategy complements traditional approaches for structural alignments in two main respects. On the one hand structural alignment between two or more proteins is identified by minimizing an energy function in a very low-dimensional parameter space (e.g. six parameters per pairwise alignment) rather than exploring the highly dimensional combinatorial space of possible amino acid correspondences. This aspect entails not only a simpler formulation and implementation of the alignment algorithm, but also makes the algorithm free of constraints on the sequentiality of the matched blocks, and also their sequence directionality, that are employed in several alignment methods. A further point is the fact that the distribution of MISTRAL alignment scores (energies) of random pairs of proteins chain is well modeled by the extreme value (Gumbel) statistics, implying a good control of the statistical significance of the found alignments. This property represents an important aspect to the method as appreciable discrepancies in the expected and observed significance of various alignment schemes were reported. In the following, we shall illustrate and discuss the generality and performance of the MISTRAL algorithm by considering various cases for pairwise and multiple structural alignments. 3.1 Validation against curated pairwise alignments We first discuss the performance of MISTRAL in a context of curated pairwise alignments taken from the SISYPHUS database (Andreeva et al., 2007). Specifically, the reference set consists of 69 protein pairs recently used by Mayr et al. (2007) in a comparative evaluation of six alignment methods: CE, DALI, FATCAT, CA, MATRAS (Kawabata, 2003) and SHEBA (Jung and Lee, 2000). The accuracy of each alignment is measured by computing the fraction of reference amino acid pairings that are correctly aligned by MISTRAL. The overall performance of the method is conveyed by the average and median accuracy calculated over the 69 queries. The average accuracy was found to be 66% and the median one was 82%. These values were obtained for the default parameters (seed tolerance and alignment tolerance equal to 3.5 Å and 4.0 Å, respectively). A detailed profiling of the performance in a wide range of parameters are provided in the Supplementary Material. The accuracy of MISTRAL compares well against the performance of the six alignments methods considered by (Mayr et al., 2007) for which the typical average accuracy is in the range of 51–76% and the median one is in the range of ∼ 72–91%. In particular, for both the average and median accuracy, MISTRAL is behind only DALI (average: 76%, median: 91%) and MATRAS (average: 67%, median: 88%). An important additional element to take into account in the above evaluation is the average length of the alignments, as the latter impacts on the method sensitivity. In this respect, it is noted that the average length of MISTRAL alignment is 100, which is appreciably closer to the average reference length (77) than, for example, DALI which has an average alignment length larger than 115. Finally, we applied the method of (Kim and Lee, 2007) and computed how the fraction of correctly aligned residues, Fcar changes if a maximum shift error (mismatch along the sequence), , is allowed. The average Fcar was computed for the range 0 ≤ ≤ 10, see Supplementary Material. The performance improvement of MISTRAL for increasing is in line with that of most alignment methods considered by (Kim and Lee, 2007). In particular, increasing from 0 (no mismatch allowed) to 4 takes the average Fcar from 66% to 75%, while the median Fcar increases from 82% to 90%. 3.2 Non-sequentiality of MISTRAL alignments We now discuss the capability of MISTRAL to deal with non-sequential alignments. The method can, in fact, detect correspondences between segments with arbitrary order in the matching proteins and is tolerant to their sequence directionality. In addition, it is seamlessly applicable to multimeric proteins. These features are illustrated and discussed in the context of pairwise alignments between two representatives of the OB-fold family, which gathers proteins that bind nucleic acids. Canonical OB-fold members can exhibit considerable variability for both structure and nucleic acids binding modes (Theobald et al., 2003), yet they share a common spatial organization of several β-strands, which is highlighted in Figure 2a for protein RPA70. By means of NMR techniques, it has recently been possible to resolve the structure of nucleic acids binding proteins that adopt a non-canonical OB-fold (de Chiara et al., 2005). An example is provided by ATX:HBP1 (PDB: 1v06) which is represented in Figure 2b. From Figure 2a, it can be perceived that the sequence order of the β-strands and also their directionality is different with respect to the canonical OB-fold. The visual inspection of Figure 2a and b, however, suggest that a general alignment algorithm, insensitive to the succession and directionality of the matching regions ought to detect the superposability of the two structures (Zen et al., 2009). This expectation is indeed confirmed by their MISTRAL alignment which yielded a set of 59 corresponding amino acids (with 2666 [14:47 17/9/2009 Bioinformatics-btp506.tex] Page: 2666 2663–2669 MISTRAL: energy-based multiple structural alignment (a) (b) Table 1. Set of proteins used in MISTRAL multiple alignments Set PDB IDs Avg. length Core size RMSD 1 1hhoA, 1hhoB, 1mbd 2dhbA, 2dhbB 1dlw, 1dly, 1eco 1hhoA, 1hhoB, 1idrA 1mbd, 2dhbA, 2dhbB 2lh7, 4vhbA 1a75A, 1omd, 1pal, 1pvaA 1rtp1, 5cpv, 5pal 1bmv1, 1cwpA, 1smvA 2bukA, 2tbvA, 4sbvA 145 136 1.4 Å 138 72 2.1 Å 108 100 0.7 Å 200 54 2 (c) (d) 3 4 2.0 The length and RMSD of aligned core are provided. (e) (f) (g) (h) Fig. 2. (a–d) Alignment of canonical and non-canonical OB-folds. Cartoon representation of (a) a canonical OB-fold from protein RPA70 (PDB: 1jmc residues 183–298) and (b) a non-canonical OB-fold from protein ATX:HBP1 (PDB: 1v06). The color coding reflects the indexing of the amino acids in the proteins. (c) MISTRAL alignment of the proteins in (a) and (b) which are shown, respectively, in blue and yellow color. Matching regions, comprising 62 amino acids, were highlighted with a thick backbone. (d) 2D matrix of corresponding amino acids in the aligned proteins. Blocks where amino acids have the same (opposite) sequence directionality appear as bands parallel (orthogonal) to the diagonal of the matrix. The multiple structural alignments returned by MISTRAL for the Sets 1–4 in Table 1 are shown in (e–h), respectively. A different color is used to represent the backbone (thin lines) of each protein in the set. The thick red regions correspond to the aligned core. Protein structures have been rendered with the VMD package (Humphrey et al., 1996). an RMSD of 2.0 Å) covering intervals with different sequence order and directionality in the two proteins (Fig. 2c and d). 3.3 Multiple alignments As anticipated in Section 2, the algorithm can be used for multiple alignments either by combining information from all pairwise alignments among the provided proteins or through the minimization of the energy function in Equation 3 that treats simultaneously and on equal footing, all possible protein pairings. In both the cases, it is necessary to evaluate n(n−1)/2 energy terms, n being the number of proteins to align. The difference of the two approaches is that, in the first case, the minimization is carried over a 6D space for each of the n(n−1)/2 alignments, while in the second case, the minimization is over a 6(n−1)-dimensional parameter space. By comparison against available multiple structural alignment algorithms, we found that the first strategy, based on the prior computation of pairwise alignments, was adequate to detect known common alignment cores, and therefore we adopted it for its simplicity. We considered four groups of proteins previously used as test cases for multiple alignments. In particular, we considered two sets of globins, which had been originally the object of manually curated structural superpositions (Lesk and Chothia, 1980), and two sets of the Homstrad database (Mizuguchi et al., 1998). The list of proteins defining the four groups are provided in Table 1, along with quantitative indicators of their multiple alignment. The resulting multiple structures superpositions are shown Figure 2e–h. We first discuss the two sets of globins, previously considered by Konagurthu et al. (2006) when evaluating the performance of the Mustang method. The group of proteins in Set 1 are highly homologous and hence admit an almost complete structural superposition. Their alignment core (Fig. 2e), consists of 136 amino acids, with an average RMSD of 1.4 Å. The second set of globins, which comprises those in Set 1, is more heterogeneous, a fact reflected in a smaller core. The latter in fact, consists of 72 amino acids with an average RMSD of 2.1 Å (Fig. 2f). The running time over the two sets on a present-day personal computer was 16 s for set 1 and 80 s for set 2. Over the same sets Mustang, POSA and CE-MC yielded a core of 138–139 amino acids at an average RMSD of 1.4–1.5 Å for Set1 and a core of 90–92 amino acids at an average RMSD of 2.2–2.5 Å for Set 2. As already remarked in the context of the MISTRAL selectivity about SISYPHUS reference alignments, the method returns somewhat smaller cores than other alignment methods, but the associated RMSD is also smaller. Indeed, over the sets in Table 1, the values of RMSD100 [which provides a length-rescaled RMSD (Carugo and Pongor, 2001)] of MISTRAL are smaller than those of the mentioned alignment methods. The results refer to the default parameters for interaction range, = 8 Å, seed tolerance of 3.5 Å and alignment tolerance of 4.0 Å, which can be adjusted for specific needs (see Supplementary Material for an illustration of the dependence of alignment length on the parameters). For example, setting the tolerance to 5 Å returns a core of 138 residues with 1.7 Å RMSD for Set 1 and a core of 85 residues with 2.5 Å RMSD for Set 2. 2667 [14:47 17/9/2009 Bioinformatics-btp506.tex] Page: 2667 2663–2669 C.Micheletti and H.Orland (a) Finally, we consider the multiple alignment of two groups of proteins from the Homstrad database. The first set (Set 3 in Table 1) comprises seven proteins, while the second, Set 4, gathers six proteins. Their multiple alignment returns a core of 100 residues with an average RMSD of 0.7 Å for the first set (Fig. 2g), and 54 residues with average RMSD of 2.0 Å for the second one (Fig. 2h). In both cases, the results compare well with the results of MULTIPROT which yield cores of 101–107 residues for Set 3 and of about 33–76 residues for Set 4 upon varying the algorithm alignment tolerance threshold from 3 Å to 4 Å. 3.4 Selectivity and sensitivity of the method As discussed in Section 2, the distribution of alignment energy of random pairs of protein chains is well described by the extreme value (Gumbel) statistics. This fact indicates that the alignment energy of MISTRAL can adequately capture the statistical significance of structural alignments. The need to develop reliable statistical scores for structural alignments emerged some years ago from the work of Sierk and Pearson (2004) who showed that structural alignment methods were largely overconfident in the expected statistical significance of the alignments (which was shown to exceed even by orders of magnitude the observed one). As a further test of the viability of the statistical significance provided by MISTRAL for pairwise alignments we have carried out statistical tests on the same dataset considered by Sierk and Pearson (2004). The use of the same dataset, allows a straightforward comparison of the performance of MISTRAL relative to other methods and of its statistical significance. The dataset of Sierk and Pearson (2004), consists of 86 protein domains used as queries and of a database of 2771 domains (including the queries themselves). Each query was run against all members of the database, thus returning 86×2771 pairwise alignments which include 1111 pairings of domains sharing the same CATH code [following Sierk and Pearson (2004) these pairings are referred to as homologs]. The CATH subdivision in groups of homologs is taken as the ‘gold standard’ for defining the ‘true’ structural correspondences present in the set. While this is a natural criterion, it should be borne in mind that it is not free of limitations (Kolodny et al., 2005; Sippl and Wiederstein, 2008). We first used the dataset to analyze the overall sensitivity and selectivity of MISTRAL. For a given level of statistical significance, a more sensitive method will identify a larger number of correct correspondences (homologs) among a given set of pairwise alignments. A more selective method will instead yield a smaller number of false positives (non-homologs). As customary, both features are assessed by plotting the errors per query (EPQ) versus coverage (Brenner et al., 1998). Specifically, the set of alignments was ordered by decreasing z-score and the number of correct homolog pairs nc among the top n alignments was computed. For a given level of significance (z-score), the EPQ are given by (n−nc)/86. The coverage, which is the detected fraction of correct alignments, is instead given by nc/1111. The plot is shown in Figure 3. The resulting performance is consistent with the average one exhibited by the seven structural alignment methods considered by Sierk and Pearson (which included methods that were sequential and not applicable to multimeric proteins). In particular, the coverage for the threshold values of 0.1, 1 and 10 EPQ correspond to a coverage of 21%, 32% and 51%, (b) Fig. 3. (a) EPQ versus coverage curve computed from MISTRAL alignments of the dataset used in ref. (Sierk and Pearson, 2004). (b) Comparison of the expected and observed number of false positives in the same dataset. respectively. The average coverage ranges found among the seven methods analyzed in Sierk and Pearson (2004) for the same EPQ thresholds are approximately: 15%, 30% and 50%, respectively. As for several of the methods considered in the mentioned study, most of the top false positives of MISTRAL correspond to topologs, that is pairs of domains that share the same CATH topology. In fact, among the top 100 non-homolog pairs, there are 71 topologs. As a complement of the sensitivity/selectivity analysis, we examined the reliability of the statistical significance of the alignments. This represents an important complementary aspect of the above analysis as it aims at clarifying if the incidence of false positives can be kept under a priori control. In this respect, all structural alignment methods considered in Sierk and Pearson (2004) proved to be ‘overconfident’, in that the number of encountered top false positives was found to exceed by orders of magnitude the one expected on the basis of the score statistics. We have undertaken this analysis by following the method of Brenner et al. (1998) which, at variance with the approach of Sierk and Pearson (2004), is not limited to the first false positive of each query. The method we used consists of comparing the number of observed and expected EPQ (nonhomologs) as a function of the z-score threshold. The number of expected EPQ is obtained by multiplying the P-value by the size of the database. The resulting curve is shown in Figure 3. It is seen that the number of expected false positives (or equivalently of EPQ) follows the observed one within about a factor of 3. In agreement with what we exposed in Section 2, the results of Figure 3 provide a clear indication that also on this comprehensive set of 86×2771 alignments (which is unrelated to the one used for modeling and calibrating the alignment score), the returned z-score and P-values provide an adequate estimate of the statistical significance of the alignments. 4 CONCLUSIONS We presented and discussed MISTRAL, a novel multiple structural alignment method based on the minimization of a scoring function representing a suitable pairwise interaction energy between the given proteins. The parameters over which the energy function is minimized are the relative rotations and translations of the molecules. The dimensionality of the phase space to be explored during the minimization is consequently drastically limited (e.g. six parameters per pairwise alignment) compared with the highly dimensional combinatorial space of possible amino acid correspondences. 2668 [14:47 17/9/2009 Bioinformatics-btp506.tex] Page: 2668 2663–2669 MISTRAL: energy-based multiple structural alignment After the suitable normalization depending on the proteins’ length, the optimized energies returned by MISTRAL for alignment of unrelated proteins are found to follow closely the Gumbel extremal statistics distribution. This property reflects in a good a priori control of the statistical significance of the method, as confirmed by running MISTRAL over a standard reference database of protein domains (Sierk and Pearson, 2004). ACKNOWLEDGEMENTS We thank A. Capdepon for setting up the MISTRAL web server and R. Potestio and A. Zen for useful discussions. Funding: The Italian Ministry of Education, grant PRIN. Conflict of Interest: none declared. REFERENCES Andreeva,A. et al. (2007) SISYPHUS–structural alignments for proteins with nontrivial relationships. Nucleic Acids Res., 35, 253–259. Bachar,O. et al. (1993) A computer vision based technique for 3D sequence-independent structural comparison of proteins. Protein Eng., 6, 279–288. Bashton,M. and Chothia,C. (2007) The generation of new protein functions by the combination of domains. Structure, 15, 85–99. Brenner,S.E. et al. (1998) Assessing sequence comparison methods with reliable structurally identified distant evolutionary relationships. Proc. Natl Acad. Sci. USA, 95, 6073–6078. Camproux,A.C. et al. (2004) A hidden Markov model derived structural alphabet for proteins. J. Mol. Biol., 339, 591–605. Carugo,O. and Pongor,S. (2001) A normalized root-mean-square distance for comparing protein three-dimensional structures. Protein Sci., 10, 1470–1473. Chenna,R. et al. (2003) Multiple sequence alignment with the Clustal series of programs. Nucleic Acids Res., 31, 3497–3500. Chothia,C. (1984) Principles that determine the structure of proteins. Annu. Rev. Biochem., 53, 537–572. Chothia,C. and Lesk,A.M. (1986) The relation between the divergence of sequence and structure in proteins. EMBO J., 5, 823–826. de Chiara,C. et al. (2005) The AXH domain adopts alternative folds the solution structure of HBP1 AXH. Structure, 13, 743–753. Dror,O. et al. (2003) MASS: multiple structural alignment by secondary structures. Bioinformatics, 19 (Suppl. 1), 95–104. Fong,J.H. et al. (2007) Modeling the evolution of protein domain architectures using maximum parsimony. J. Mol. Biol., 366, 307–315. Gan,H.H. et al. (2002) Analysis of protein sequence/structure similarity relationships. Biophys J., 83, 2781–2791. Gerstein,M. and Levitt,M. (1998) Comprehensive assessment of automatic structural alignment against a manual standard. Prot. Sci., 7, 445-456. Guerler,A. and Knapp,E.W. (2008) Novel protein folds and their nonsequential structural analogs. Protein Sci., 17, 1374–1382. Holm,L. and Sander,C. (1993) Protein structure comparison by alignment of distance matrices. J. Mol. Biol., 233, 123–138. Holm,L. et al. (1992) A database of protein structure families with common folding motifs. Protein Sci., 1, 1691–1698. Humphrey,W. et al. (1996) VMD: visual molecular dynamics. J. Mol. Graph, 14, 33–38. Ilyin,V.A. et al. (2004) Structural alignment of proteins by a novel TOPOFIT method Protein Sci., 13, 1865–1874. Jung,J. and Lee,B. (2000) Protein structure alignment using environmental profiles. Protein Eng., 13, 535–543. Karlin,S. and Altschul,S.F. (1990) Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. Proc. Natl Acad. Sci. USA, 87, 2264–2268. Kim,C. and Lee,B. (2007) Accuracy of structure-based sequence alignment of automatic methods. BMC Bioinformatics, 8, 355–355. Kawabata,T. (2003) MATRAS: a program for protein 3D structure comparison. Nucleic Acids Res., 31, 3367–3369. Koehl,P. (2001) Protein structure similarities. Curr. Opin. Struct. Biol., 11, 348–353. Kolbeck,B. et al. (2006) Connectivity independent protein-structure alignment. BMC Bioinformatics, 7, 510–510. Kolodny,R. et al. (2002) Small libraries of protein fragments model native protein structures accurately. J. Mol. Biol., 323, 297–307. Kolodny,R. et al. (2005) Comprehensive evaluation of protein structure alignment methods: scoring by geometric measures. J. Mol. Biol., 346, 1173–1188. Konagurthu,A.S. et al. (2006) MUSTANG: a multiple structural alignment algorithm. J. Mol. Biol., 64, 559–574. Lesk,A.M. (2004) Introduction to Protein Science: Architecture, Function and Genomics. Oxford University Press, Oxford, UK. Lesk,A.M. and Chothia,C. (1980) How different amino acid sequences determine similar protein structures. J. Mol. Biol., 136, 225–270. Levitt,M. and Gerstein,M. (1998) A unified statistical framework for sequence comparison and structure comparison. Proc. Natl Acad. Sci. USA, 95, 5913–5920. Lichtarge,O. and Sowa,M.E. (2002) Evolutionary predictions of binding surfaces and interactions. Curr. Opin. Struct. Biol., 12, 21–27. Liu,X. et al. (2008) CLEMAPS: multiple alignment of protein structures based on conformational letters. Proteins, 71, 728–736. May,A. (2004) Percent sequence identity, the need to be explicit. Structure, 12, 737–738. Mayr,G. et al. (2007) Comparative analysis of protein structure alignments. BMC Struct. Biol., 7, 50–50. Micheletti,C. et al. (2000) Recurrent oligomers in proteins: an optimal scheme reconciling accurate and concise backbone representations in automated folding and design studies. Proteins, 40, 662–674. Mizuguchi,K. et al. (1998) HOMSTRAD: a database of protein structure alignments for homologous families. Protein Sci., 7, 2469–2471. Mosca,R. and Schneider,T.R. (2008) RAPIDO: a web server for the alignment of protein structures in the presence of conformational changes. Nucleic Acids Res., 36 (Web Server issue), 42–46. Needleman,S.B. and Wunsch,C.D. (1970) A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol., 48, 443–453. Orengo,C.A. et al. (1997) CATH–a hierarchic classification of protein domain structures. Structure, 5, 1093–1108. Ortiz,A.R. et al. (2002) MAMMOTH: an automated method for model comparison. Protein Sci., 11, 2606–2621. Phillips,D.C. (1970) The development of crystallographic enzymology. Biochem. Soc. Symp., 30, 11–28. Schwarzenbacher,R. et al. (2008) The JCSG MR pipeline Acta Crystallogr. D Biol. Crystallogr., 64 (Pt 1), 133–140. Shatsky,M. et al. (2004a) A method for simultaneous alignment of multiple protein structures. Proteins, 56, 143–156. Shatsky,M. et al. (2004b) FlexProt: alignment of flexible protein structures without a predefinition of hinge regions. J. Comput. Biol., 11, 83–106. Shindyalov,I.N. and Bourne,P.E. (1998) Protein structure alignment by incremental combinatorial extension (CE) of the optimal path. Protein Eng., 11, 739–747. Sierk,M.L. and Pearson,W.R. (2004) Sensitivity and selectivity in protein structure comparison. Protein Sci., 13, 773–785. Sippl,M.J. and Wiederstein,M. (2008) A note on difficult structure alignment problems. Bioinformatics, 24, 426–427. Theobald,D.L. et al. (2003) Nucleic acid recognition by OB-fold proteins. Annu. Rev. Biophys. Biomol. Struct., 32, 115–133. Wood,T.C. and Pearson,W.R. (1999) Evolution of protein sequences and structures. J. Mol. Biol., 291, 977–995. Xie,L. and Bourne,P.E. (2008) Detecting evolutionary relationships across existing fold space Proc. Natl Acad. Sci. USA, 105, 5441–5446. Ye,Y. and Godzik,A. (2003) Flexible structure alignment by chaining aligned fragment pairs allowing twists. Bioinformatics, 19, 246–255. Ye,Y. and Godzik,A. (2005) Multiple flexible structure alignment using partial order graphs. Bioinformatics, 21, 2362–2369. Yuan,X. and Bystroff,C. (2005) Non-sequential structure-based alignments reveal topology-independent core packing arrangements in proteins. Bioinformatics, 21, 1010–1019. Zen,A. et al. (2009) Using dynamics-based comparisons to predict nucleic acid binding sites in proteins: an application to OB-fold domains. Bioinformatics, 25, 1010–1019. 2669 [14:47 17/9/2009 Bioinformatics-btp506.tex] Page: 2669 2663–2669
© Copyright 2026 Paperzz