MISTRAL: a tool for energy-based multiple

BIOINFORMATICS
ORIGINAL PAPER
Vol. 25 no. 20 2009, pages 2663–2669
doi:10.1093/bioinformatics/btp506
Structural bioinformatics
MISTRAL: a tool for energy-based multiple structural
alignment of proteins
Cristian Micheletti1,∗ and Henri Orland2
1 SISSA,
2 Institut
CNR-INFM Democritos and Italian Institute of Technology, Via Beirut 2-4, 34014 Trieste, Italy and
de Physique Théorique, CEA, IPhT, F- 91191 Gif-sur-Yvette, France
Received on May 11, 2009; revised on July 20, 2009; accepted on August 13, 2009
Advance Access publication August 19, 2009
Associate Editor: Thomas Lengauer
ABSTRACT
Motivation:
The steady growth of the number of available
protein structures has constantly motivated the development of
new algorithms for detecting structural correspondences in proteins.
Detecting structural equivalences in two or more proteins is
computationally demanding as it typically entails the exploration
of the combinatorial space of all possible amino acid pairings in
the parent proteins. The search is often aided by the introduction
of various constraints such as considering protein fragments,
rather than single amino acids, and/or seeking only sequential
correspondences in the given proteins. An additional challenge is
represented by the difficulty of associating to a given alignment, a
reliable a priori measure of its statistical significance.
Results: Here, we present and discuss MISTRAL (Multiple
STRuctural ALignment), a novel strategy for multiple protein
alignment based on the minimization of an energy function over the
low-dimensional space of the relative rotations and translations of the
molecules. The energy minimization avoids combinatorial searches
and returns pairwise alignment scores for which a reliable a priori
statistical significance can be given.
Availability: MISTRAL is freely available for academic users as a
standalone program and as a web service at http://ipht.cea.fr/protein
.php.
Contact: [email protected]
Supplementary information: Supplementary data are available at
Bioinformatics online.
1
INTRODUCTION
One of the ubiquitous bioinformatics tasks is the alignment of protein
sequences and structures. The development of these computational
tools has a long tradition and was especially motivated by the
interest in tracing the evolutionary relationships of various protein
families (Needleman and Wunsch, 1970; Phillips, 1970). Sequence
alignment methods are particularly useful for this purpose when
the sequence identity of the involved proteins is larger than ∼30%.
Below this threshold there is no stringent relationship between
sequence and structural similarity (Chothia, 1984; Chothia and
Lesk, 1986; Gan et al., 2002; Lesk, 2004; Wood and Pearson,
1999) and the detection of evolutionary and/or functional relatedness
is appropriately complemented by structural alignment techniques
∗ To
whom correspondence should be addressed.
(Koehl, 2001; Lichtarge and Sowa, 2002). Structural alignment,
which is the focus of the present study, is applied in several
other contexts including the grouping and classification of known
protein folds, the identification of conserved structural fragments for
molecular replacement in crystallography and homology modelling
(Holm et al., 1992; Orengo et al., 1997; Schwarzenbacher et al.,
2008).
The result of a pairwise alignment procedure consists of a oneto-one correspondence list between amino acids of the two proteins.
In rigid structural alignments, the two sets of corresponding Cα
(or mainchain) atoms can be put in good spatial proximity after
an optimal structural superposition which minimizes the root mean
square distance (RMSD). Indeed, within these approaches the
best alignment is obtained by minimizing a suitable measure of
geometric compatibility, such as the similarity of distance matrices
in DALI, over the possible amino acid pairings (Holm and Sander,
1993) or weighted sums of distances of equivalent Cα’s in CE
(Shindyalov and Bourne, 1998) or in MAMMOTH (Ortiz et al.,
2002). Amino acid correspondences can also be established without
resorting to criteria of rigid spatial superposability, as in ‘flexible’
alignment methods such as RAPIDO, FlexProt, FATCAT and POSA
(Mosca and Schneider, 2008; Shatsky et al., 2004b; Ye and Godzik,
2003, 2005).
As the combinatorial space of possible amino acid
correspondences is enormous, the search space for the optimal
alignment is often limited by the introduction of various constraints.
For instance, pairings can be sought not between single amino
acids but among fragments or secondary structure elements of
suitable minimal length (Camproux et al., 2004; Kolodny et al.,
2002; Micheletti et al., 2000) such as in DALI, CE, MAMMOTH,
GANGSTA (Kolbeck et al., 2006) or CLEMAPS (Liu et al., 2008).
A second commonly employed constraint is the requirement
of sequentiality, that is subsequent amino acids in one protein
must correspond to subsequent amino acids in the partner protein.
This restriction, while computationally convenient, can impair the
capability to detect evolutionary relationships (Xie and Bourne,
2008) as new protein structures can arise from the combination and
permutation of subdomains (Bashton and Chothia, 2007; Fong et al.,
2007). The number of structural alignment methods that are nonsequential is still limited and include CA, MULTIPROT, MASS,
TOPOFIT, SCALI and GANGSTA+ (Bachar et al., 1993; Dror et al.,
2003; Guerler and Knapp, 2008; Ilyin et al., 2004; Shatsky et al.,
2004a; Yuan and Bystroff, 2005).
© The Author 2009. Published by Oxford University Press. All rights reserved. For Permissions, please email: [email protected]
[14:47 17/9/2009 Bioinformatics-btp506.tex]
2663
Page: 2663
2663–2669
C.Micheletti and H.Orland
Besides the list of equivalent amino acids in two or more
proteins, a highly desirable output of an alignment procedure is the
indication of the statistical significance of the returned structural
correspondence.
In a comprehensive study, Sierk and Pearson (2004) have applied
several tests, based on the capability to correctly identify known
homologous pairs in a reference set of protein domains, to validate
the statistical significance returned by several structural alignment
methods. Their analysis pointed out that the provided nominal
statistical significance of alignment scores largely exceeded the true
one, and hence posed the necessity to develop a better control of
the scoring statistics. This task appears particularly challenging for
alignment methods based on the combination of several criteria
rather than on the optimization of a single scoring function.
Here, we discuss a novel scheme named MISTRAL (after
MultIple STRuctural ALignment), which establishes structural
correpondences between two or more proteins without exploring the
combinatorial space of amino acid pairings and allows for a good
control of the statistical significance of the returned alignments.
The optimal superposition of a given set of proteins is obtained
by minimizing a suitable protein–protein interaction energy. The
minimization of the energy function is carried out in the lowdimensional space corresponding to the orientations (translations
and rotations) of the molecules in a given Cartesian reference frame.
Once the molecules are optimally oriented, the residue–residue
correspondences are established by using simple criteria based on
spatial proximity. Such correspondences are therefore not sought
by assuming any a priori sequence directionality among or within
blocks of aligned residues in the two proteins which, in fact, can even
be multimeric. The method is here applied and tested in several cases
that aptly illustrate its non-sequential character and applicability in
multiple alignment contexts. Finally, a detailed discussion of the
returned nominal statistical significance, based on the database of
Sierk and Pearson (2004) is presented.
MISTRAL is freely available for academic users as a web server at
http://ipht.cea.fr/protein.php where multiple alignments of up to 20
proteins (of chain length up to 500 amino acids) can be submitted.
To align more or longer proteins users can obtain the MISTRAL
program from the authors.
2
METHODS
The optimal structural superposition of two proteins A and B, is identified
by minimizing the following interaction energy term over their relative
translations and orientations:
fi,j fi+1, j+1 fi+2,j+2
(1)
HAB =
i∈A,j∈B
where the indexes i and j run over all amino acids in proteins A and B,
respectively, and f is a weight function rewarding short spatial separations of
amino acid pairs. Specifically, we considered the following piecewise-linear
sigmoidal weight,
⎧
⎪
if dij ≤ /2
⎪
⎨−1 dij
if /2 < dij < (2)
fi,j = −2 1− ⎪
⎪
⎩0
otherwise
where dij is the separation of the Cα atoms of amino acids i and j, and is
the interaction range which controls the spatial tolerance of the resulting
superposition. By default, is set to 8 Å. The product of the f ’s in
Equation (1) embodies a multi-body interaction between pairs of amino acids
that are nearby in sequence and consequently rewards the superposition of
small protein fragments.
The parameters over which the minimization is done are the six degrees
of freedom necessary to specify the displacement and orientation of protein
B relative to A, which is held fixed in space. The extensive search over
the relative orientations was employed before in alignment contexts; in
STRUCTAL (Gerstein and Levitt, 1998) the spatial positioning of a protein
pair is iteratively guided by finding equivalent pairs of amino acids. Here, the
exploration of the relative orientations is, instead, guided by the minimization
of the energy function in Equation (1). The optimization is carried out
within a simulated annealing scheme starting from a tentative orientation
of the proteins obtained by optimally superimposing fragments of 10–20
amino acids (using longer fragments typically adds additional computational
overhead, without altering the result). Notice that, unlike the case of effective
pairwise amino acid interactions used, for example, in models for protein
docking or protein folding, the potential energy f , does not introduce any
excluded-volume penalty at small separations. This allows protein chains to
‘pass-through’ each other as the minimum of Equation (2) is sought.
We point out that the method is seamlessly applicable if one or both
proteins to be aligned are multimeric (though it should be noted that ‘flexible’
alignments scheme are better suited to deal with arbitrary displacements of
separate domains). In this case, those amino acids i and j for which a chain
interruption occurs in the intervals [i,i+2] or [j,j +2] are excluded from
the sum in Equation (1). The same summation restriction is applied when
dealing with proteins with missing structural informations about Cα atoms.
The energy-based approach is readily generalizable to perform multiple
alignments where all proteins are treated in a symmetric, equal way. For
example, in the case of alignment of three proteins A, B and C, the interaction
energy to minimize is the sum of all pairwise interaction energies (possibly
length-normalized):
H = HAB +HBC +HAC .
(3)
The dimensionality of the parameter space entailed by the symmetric
multiple alignment of n proteins is 6(n−1) as one protein, the largest for
computational convenience, is held fixed in space while all possible rotations
and translations are considered for the others.
However, for the simplicity of the formulation, we shall not resort
to the minimization of Equation (3), but will consider separately the
possible n(n−1)/2 alignments and combine them a posteriori to identify
the alignment core, as explained in a following section.
Finally, we observe that unless small values of are used, the
‘cooperative’ terms in Equation (1), can assign a favorable energy not
only to superposable segments having the same sequence directionality
in the two proteins but also (albeit to a lesser extent) to segments with
opposite directionality. This property, which is illustrated in Section 3, further
highlights that the energy-based alignment can be employed without any
a priori assumptions on the sequentiality and sequence directionality of
the blocks to be aligned. The energy function in Equation (1) is, however,
generalizable to strongly promote the matches of segments with opposite
directionality by substituting (or complementing) the term in Equation (1)
with
fi,j+2 fi+1,j+1 fi,j−1
(4)
i∈A,j∈B
2.1
Statistical significance of pairwise alignments
The optimized alignment energies of 104 pairs of protein chains were
analyzed to characterize their statistical distribution. These reference
pairwise alignments were produced by randomly pairing the members of a
comprehensive set of about 2000 protein chains comprising between 19 and
500 amino acids (the average length being 230) with limited mutual sequence
identity. The average pairwise sequence identity (the number of identically
aligned amino acids divided by the length of the shortest protein (May,
2004)) returned by Clustalw (Chenna et al., 2003) over the 104 pairs was
25%. No more than 80 pairs out of 104 overcome the 40% threshold. In view
of the limited sequence relatedness of the set, the 104 pairings are expected to
2664
[14:47 17/9/2009 Bioinformatics-btp506.tex]
Page: 2664
2663–2669
MISTRAL: energy-based multiple structural alignment
(a)
for the considered intervals of lmax . The quality of the Gumbel fits is visible in
Figure 1, where it can be compared against the Gaussian distribution which
yields χ2 values larger than a factor of about 4. The Gumbel distribution:
prob(E/lmax ) =
(b)
1 (E/lmax +a1)/a2
e
exp[−exp[(E/lmax +a1 )/a2 ]],
a2
(5)
where E is the energy of the alignment and a1 and a2 are the so-called location
and scale parameters, was therefore taken as a viable statistical model for
modeling the statistics of rescaled alignment energies. The analysis of the
Gumbel fits for various length ranges (see e.g. Fig. 1a and b) indicated
that the parametric dependence of the a1 and a2 parameters on lmax can be
approximated as
a1 = 0.867−9.110−4 ·lmax ,
a2 = 0.1934−3.810−5 ·lmax
(6)
Based on the above considerations, the statistical significance of the
alignment of two proteins is equivalently conveyed by the Gumbel z-score:
√
(7)
z(E) = (−E/lmax −a1 −γa2 )/(πa2 / 6);
where γ ≈ 0.5772, or the associated P-value:
√
p(E) = 1.0−exp[−exp[−π/ 6 z(E)−γ]]
(c)
Fig. 1. Distribution of length-rescaled alignment energies (notice the minus
sign) for random protein pairs where the length of the longest protein, lmax is
in the range (a) [25:75] and (b) [275–325]. The superposed curves provide
the Gaussian and Gumbel fits to the histograms obtained by minimizing the
χ2 (assuming a Poisson uncertainty on the histogram heights) computed only
over the bins with energy more favorable than average, so as to reproduce
accurately the extremal tail of the distributions. (c) Trend of the theoretical
and observed P-values as a function of the z-score for the set of ≈ 104 random
alignments of pairs with sequence identity <40%.
involve mostly structurally unrelated proteins chains (Chothia, 1984; Chothia
and Lesk, 1986; Gan et al., 2002; Levitt and Gerstein, 1998; Wood and
Pearson, 1999) and are thus taken as a random background reference to
calibrate the statistical significance of structural alignments.
As a first step, the 104 minimized pairwise energies were divided by the
length of the longest of the two proteins, lmax . This energy rescaling favors the
detection of matches that involve a substantial fraction of the longest protein.
This is desirable as short proteins can be fitted against much longer ones
which may have a different overall topological organization. The rescaled
energies were found to have a residual dependence on lmax . In particular,
the energies were shifted toward smaller absolute values for increasing lmax ,
as visible in the histograms of Figure 1a and b. This is consistent with the
intuitive expectation that it is easier to find random alignments involving,
e.g. 30% of a 100-residue protein, than 30% of a 300-residue one.
The length dependence of the rescaled alignment energies was
characterized by fitting separately the energy distributions involving various
intervals of lmax . Among the commonly employed statistical distributions it
was found that the Gumbel extreme-value statistics, which often applies to
alignment scores (Karlin and Altschul, 1990), provided a χ2 of order unity
(8)
where a1 and a2 have the aforementioned dependence on the length of
the longest of the two proteins. The z-score measures the separation (in
terms of SDs) of the observed alignment energy from the random average.
The P-value, instead, provides the expected probability to have a random
alignment with interaction energy equal or better (i.e. more negative) than
the observed one.
The viability and consistency of the above analysis to model the statistics
of random alignment energies is verified a posteriori over the reference
dataset by comparing the expected and observed P-values, as a function of
the z-score threshold. The observed P-value is straightforwardly calculated as
the fraction of alignments whose energy has a z-score larger than the assigned
threshold; the 80 pairs (out of 104 ) having sequence identity >40% were
excluded from the analysis. The result is shown in Figure 1c and indicates
that the expected and observed P-values are in very close agreement up
to z-scores of about 3. Beyond this range, the differences are nevertheless
limited, as the observed and expected number of alignments are still of the
same order of magnitude even for z-score of 4.5 (the relative difference being
a factor of 3). The fact that, at these values of z-score, the small number of
observed alignments overcomes the expected one is also consistent with the
fact that, despite the limited mutual sequence identity of members of the
set, a limited number of genuine pairwise structural correspondences is still
expected to occur in the dataset.
The above considerations indicate that the parameterized Gumbel statistics
provides a consistent and viable model for estimating the significance of
the structural alignments. An independent assessment of the viability of the
statistical significance is presented in Section 3.
As the characterization of the statistics is particularly time consuming
(both in terms of computation and data analysis), it was carried out only
for pairwise alignments obtained with the default interaction range = 8 Å.
Consequently, while further extensions of the statistical analysis may be
undertaken in the future, presently the z-scores and P-values are provided
by MISTRAL only for pairwise alignments carried out with the default
interaction range.
2.2
One-to-one amino acid correspondences
The direct output of the minimization of Equation (1) or Equation (3) is
the best spatial positioning of two or more proteins. The output is further
processed to return one-to-one correspondences among amino acids in
different proteins. The post-processing consists of an iterative scheme which
attempts to match segments of length l in the two proteins. For two such
segments, (i,i ) and (j,j ), with |i−i | = |j −j | = l, amino acids are tentatively
matched in the scheme i : j; i+1 : j +1; ... or i : j ; i+1 : j −1; .... The latter
type of matching can be disallowed in MISTRAL if reversal of sequence
2665
[14:47 17/9/2009 Bioinformatics-btp506.tex]
Page: 2665
2663–2669
C.Micheletti and H.Orland
directionality in matching segments is not considered appropriate. The
segments represent a putative match if they contain at least two corresponding
amino acids at distance below the ‘seed’ threshold of 3.5 Å, and no pair at
distance beyond the ‘alignment tolerance’ threshold of 4.0 Å. Among all
viable segment pairs of length l, (i,i+l) (j,j ±l), the one with the lowest
RMSD is picked and assigned. Once this assignment is made, another search
is performed for other viable segments of the same length. If no other viable
segment is found, l is decreased by one unit and the search is repeated. The
search is run starting from the maximum possible value of l = min(lA ,lB )
and stopping at l = 4 which is the minimum considered length for aligned
segments. The relative position and orientation of the two proteins obtained
by the minimization of the interaction energy is finally adjusted by optimally
superimposing the corresponding amino acids.
In the present online and distributed version of MISTRAL, multiple
alignments are constructed by computing separately all one-to-one
correspondences in the possible pairwise alignments of the set. Each protein
in the set is next considered in turn as a tentative pivot for the multiple
alignment and the size of the alignment core is computed. The latter is
given by the set of amino acids that have a matching amino acid in all
pairwise alignments with the other proteins in the set. The protein with
the largest core is picked as pivot for the multiple alignment and relative
position of the proteins is finally adjusted by optimally superimposing the
core-corresponding region of each non-pivotal protein over the core of the
pivot one. Besides the size of the core, the quality of the resulting alignment
is conveyed by calculating the RMSD of all pairs of corresponding amino
acids; e.g. if n proteins share a core of N residues, the number of spatial
distances considered for the average is N n(n−1)/2.
3
RESULTS
The MISTRAL alignment strategy complements traditional
approaches for structural alignments in two main respects.
On the one hand structural alignment between two or more
proteins is identified by minimizing an energy function in a
very low-dimensional parameter space (e.g. six parameters per
pairwise alignment) rather than exploring the highly dimensional
combinatorial space of possible amino acid correspondences. This
aspect entails not only a simpler formulation and implementation
of the alignment algorithm, but also makes the algorithm free of
constraints on the sequentiality of the matched blocks, and also
their sequence directionality, that are employed in several alignment
methods.
A further point is the fact that the distribution of MISTRAL
alignment scores (energies) of random pairs of proteins chain is well
modeled by the extreme value (Gumbel) statistics, implying a good
control of the statistical significance of the found alignments. This
property represents an important aspect to the method as appreciable
discrepancies in the expected and observed significance of various
alignment schemes were reported.
In the following, we shall illustrate and discuss the generality and
performance of the MISTRAL algorithm by considering various
cases for pairwise and multiple structural alignments.
3.1 Validation against curated pairwise alignments
We first discuss the performance of MISTRAL in a context of curated
pairwise alignments taken from the SISYPHUS database (Andreeva
et al., 2007).
Specifically, the reference set consists of 69 protein pairs
recently used by Mayr et al. (2007) in a comparative evaluation
of six alignment methods: CE, DALI, FATCAT, CA, MATRAS
(Kawabata, 2003) and SHEBA (Jung and Lee, 2000). The accuracy
of each alignment is measured by computing the fraction of reference
amino acid pairings that are correctly aligned by MISTRAL. The
overall performance of the method is conveyed by the average
and median accuracy calculated over the 69 queries. The average
accuracy was found to be 66% and the median one was 82%. These
values were obtained for the default parameters (seed tolerance
and alignment tolerance equal to 3.5 Å and 4.0 Å, respectively).
A detailed profiling of the performance in a wide range of parameters
are provided in the Supplementary Material. The accuracy of
MISTRAL compares well against the performance of the six
alignments methods considered by (Mayr et al., 2007) for which
the typical average accuracy is in the range of 51–76% and
the median one is in the range of ∼ 72–91%. In particular, for
both the average and median accuracy, MISTRAL is behind only
DALI (average: 76%, median: 91%) and MATRAS (average: 67%,
median: 88%).
An important additional element to take into account in the above
evaluation is the average length of the alignments, as the latter
impacts on the method sensitivity. In this respect, it is noted that the
average length of MISTRAL alignment is 100, which is appreciably
closer to the average reference length (77) than, for example, DALI
which has an average alignment length larger than 115.
Finally, we applied the method of (Kim and Lee, 2007)
and computed how the fraction of correctly aligned residues,
Fcar changes if a maximum shift error (mismatch along the
sequence), , is allowed. The average Fcar was computed for the
range 0 ≤ ≤ 10, see Supplementary Material. The performance
improvement of MISTRAL for increasing is in line with that of
most alignment methods considered by (Kim and Lee, 2007). In
particular, increasing from 0 (no mismatch allowed) to 4 takes
the average Fcar from 66% to 75%, while the median Fcar increases
from 82% to 90%.
3.2
Non-sequentiality of MISTRAL alignments
We now discuss the capability of MISTRAL to deal with
non-sequential alignments. The method can, in fact, detect
correspondences between segments with arbitrary order in the
matching proteins and is tolerant to their sequence directionality.
In addition, it is seamlessly applicable to multimeric proteins.
These features are illustrated and discussed in the context of
pairwise alignments between two representatives of the OB-fold
family, which gathers proteins that bind nucleic acids. Canonical
OB-fold members can exhibit considerable variability for both
structure and nucleic acids binding modes (Theobald et al., 2003),
yet they share a common spatial organization of several β-strands,
which is highlighted in Figure 2a for protein RPA70. By means
of NMR techniques, it has recently been possible to resolve the
structure of nucleic acids binding proteins that adopt a non-canonical
OB-fold (de Chiara et al., 2005). An example is provided by
ATX:HBP1 (PDB: 1v06) which is represented in Figure 2b. From
Figure 2a, it can be perceived that the sequence order of the
β-strands and also their directionality is different with respect to
the canonical OB-fold. The visual inspection of Figure 2a and b,
however, suggest that a general alignment algorithm, insensitive to
the succession and directionality of the matching regions ought to
detect the superposability of the two structures (Zen et al., 2009).
This expectation is indeed confirmed by their MISTRAL
alignment which yielded a set of 59 corresponding amino acids (with
2666
[14:47 17/9/2009 Bioinformatics-btp506.tex]
Page: 2666
2663–2669
MISTRAL: energy-based multiple structural alignment
(a)
(b)
Table 1. Set of proteins used in MISTRAL multiple alignments
Set
PDB IDs
Avg. length
Core size
RMSD
1
1hhoA, 1hhoB, 1mbd
2dhbA, 2dhbB
1dlw, 1dly, 1eco
1hhoA, 1hhoB, 1idrA
1mbd, 2dhbA, 2dhbB
2lh7, 4vhbA
1a75A, 1omd, 1pal, 1pvaA
1rtp1, 5cpv, 5pal
1bmv1, 1cwpA, 1smvA
2bukA, 2tbvA, 4sbvA
145
136
1.4 Å
138
72
2.1 Å
108
100
0.7 Å
200
54
2
(c)
(d)
3
4
2.0
The length and RMSD of aligned core are provided.
(e)
(f)
(g)
(h)
Fig. 2. (a–d) Alignment of canonical and non-canonical OB-folds. Cartoon
representation of (a) a canonical OB-fold from protein RPA70 (PDB: 1jmc
residues 183–298) and (b) a non-canonical OB-fold from protein ATX:HBP1
(PDB: 1v06). The color coding reflects the indexing of the amino acids in
the proteins. (c) MISTRAL alignment of the proteins in (a) and (b) which are
shown, respectively, in blue and yellow color. Matching regions, comprising
62 amino acids, were highlighted with a thick backbone. (d) 2D matrix of
corresponding amino acids in the aligned proteins. Blocks where amino acids
have the same (opposite) sequence directionality appear as bands parallel
(orthogonal) to the diagonal of the matrix. The multiple structural alignments
returned by MISTRAL for the Sets 1–4 in Table 1 are shown in (e–h),
respectively. A different color is used to represent the backbone (thin lines) of
each protein in the set. The thick red regions correspond to the aligned core.
Protein structures have been rendered with the VMD package (Humphrey
et al., 1996).
an RMSD of 2.0 Å) covering intervals with different sequence order
and directionality in the two proteins (Fig. 2c and d).
3.3
Multiple alignments
As anticipated in Section 2, the algorithm can be used for multiple
alignments either by combining information from all pairwise
alignments among the provided proteins or through the minimization
of the energy function in Equation 3 that treats simultaneously and
on equal footing, all possible protein pairings. In both the cases, it is
necessary to evaluate n(n−1)/2 energy terms, n being the number of
proteins to align. The difference of the two approaches is that, in the
first case, the minimization is carried over a 6D space for each of the
n(n−1)/2 alignments, while in the second case, the minimization
is over a 6(n−1)-dimensional parameter space. By comparison
against available multiple structural alignment algorithms, we found
that the first strategy, based on the prior computation of pairwise
alignments, was adequate to detect known common alignment cores,
and therefore we adopted it for its simplicity.
We considered four groups of proteins previously used as test
cases for multiple alignments. In particular, we considered two sets
of globins, which had been originally the object of manually curated
structural superpositions (Lesk and Chothia, 1980), and two sets of
the Homstrad database (Mizuguchi et al., 1998).
The list of proteins defining the four groups are provided
in Table 1, along with quantitative indicators of their multiple
alignment. The resulting multiple structures superpositions are
shown Figure 2e–h.
We first discuss the two sets of globins, previously considered
by Konagurthu et al. (2006) when evaluating the performance
of the Mustang method. The group of proteins in Set 1 are
highly homologous and hence admit an almost complete structural
superposition. Their alignment core (Fig. 2e), consists of 136 amino
acids, with an average RMSD of 1.4 Å. The second set of globins,
which comprises those in Set 1, is more heterogeneous, a fact
reflected in a smaller core. The latter in fact, consists of 72 amino
acids with an average RMSD of 2.1 Å (Fig. 2f). The running time
over the two sets on a present-day personal computer was 16 s
for set 1 and 80 s for set 2. Over the same sets Mustang, POSA
and CE-MC yielded a core of 138–139 amino acids at an average
RMSD of 1.4–1.5 Å for Set1 and a core of 90–92 amino acids at
an average RMSD of 2.2–2.5 Å for Set 2. As already remarked in
the context of the MISTRAL selectivity about SISYPHUS reference
alignments, the method returns somewhat smaller cores than other
alignment methods, but the associated RMSD is also smaller. Indeed,
over the sets in Table 1, the values of RMSD100 [which provides a
length-rescaled RMSD (Carugo and Pongor, 2001)] of MISTRAL
are smaller than those of the mentioned alignment methods.
The results refer to the default parameters for interaction range,
= 8 Å, seed tolerance of 3.5 Å and alignment tolerance of 4.0 Å,
which can be adjusted for specific needs (see Supplementary
Material for an illustration of the dependence of alignment length
on the parameters). For example, setting the tolerance to 5 Å returns
a core of 138 residues with 1.7 Å RMSD for Set 1 and a core of 85
residues with 2.5 Å RMSD for Set 2.
2667
[14:47 17/9/2009 Bioinformatics-btp506.tex]
Page: 2667
2663–2669
C.Micheletti and H.Orland
(a)
Finally, we consider the multiple alignment of two groups of
proteins from the Homstrad database. The first set (Set 3 in Table 1)
comprises seven proteins, while the second, Set 4, gathers six
proteins. Their multiple alignment returns a core of 100 residues
with an average RMSD of 0.7 Å for the first set (Fig. 2g), and 54
residues with average RMSD of 2.0 Å for the second one (Fig. 2h). In
both cases, the results compare well with the results of MULTIPROT
which yield cores of 101–107 residues for Set 3 and of about 33–76
residues for Set 4 upon varying the algorithm alignment tolerance
threshold from 3 Å to 4 Å.
3.4
Selectivity and sensitivity of the method
As discussed in Section 2, the distribution of alignment energy of
random pairs of protein chains is well described by the extreme value
(Gumbel) statistics. This fact indicates that the alignment energy
of MISTRAL can adequately capture the statistical significance
of structural alignments. The need to develop reliable statistical
scores for structural alignments emerged some years ago from
the work of Sierk and Pearson (2004) who showed that structural
alignment methods were largely overconfident in the expected
statistical significance of the alignments (which was shown to exceed
even by orders of magnitude the observed one). As a further test of
the viability of the statistical significance provided by MISTRAL
for pairwise alignments we have carried out statistical tests on the
same dataset considered by Sierk and Pearson (2004). The use
of the same dataset, allows a straightforward comparison of the
performance of MISTRAL relative to other methods and of its
statistical significance.
The dataset of Sierk and Pearson (2004), consists of 86 protein
domains used as queries and of a database of 2771 domains
(including the queries themselves). Each query was run against
all members of the database, thus returning 86×2771 pairwise
alignments which include 1111 pairings of domains sharing the same
CATH code [following Sierk and Pearson (2004) these pairings
are referred to as homologs]. The CATH subdivision in groups
of homologs is taken as the ‘gold standard’ for defining the ‘true’
structural correspondences present in the set. While this is a natural
criterion, it should be borne in mind that it is not free of limitations
(Kolodny et al., 2005; Sippl and Wiederstein, 2008).
We first used the dataset to analyze the overall sensitivity
and selectivity of MISTRAL. For a given level of statistical
significance, a more sensitive method will identify a larger number of
correct correspondences (homologs) among a given set of pairwise
alignments. A more selective method will instead yield a smaller
number of false positives (non-homologs). As customary, both
features are assessed by plotting the errors per query (EPQ) versus
coverage (Brenner et al., 1998). Specifically, the set of alignments
was ordered by decreasing z-score and the number of correct
homolog pairs nc among the top n alignments was computed.
For a given level of significance (z-score), the EPQ are given by
(n−nc)/86. The coverage, which is the detected fraction of correct
alignments, is instead given by nc/1111.
The plot is shown in Figure 3. The resulting performance is
consistent with the average one exhibited by the seven structural
alignment methods considered by Sierk and Pearson (which included
methods that were sequential and not applicable to multimeric
proteins). In particular, the coverage for the threshold values of 0.1,
1 and 10 EPQ correspond to a coverage of 21%, 32% and 51%,
(b)
Fig. 3. (a) EPQ versus coverage curve computed from MISTRAL
alignments of the dataset used in ref. (Sierk and Pearson, 2004). (b)
Comparison of the expected and observed number of false positives in the
same dataset.
respectively. The average coverage ranges found among the seven
methods analyzed in Sierk and Pearson (2004) for the same EPQ
thresholds are approximately: 15%, 30% and 50%, respectively.
As for several of the methods considered in the mentioned study,
most of the top false positives of MISTRAL correspond to topologs,
that is pairs of domains that share the same CATH topology. In fact,
among the top 100 non-homolog pairs, there are 71 topologs. As
a complement of the sensitivity/selectivity analysis, we examined
the reliability of the statistical significance of the alignments. This
represents an important complementary aspect of the above analysis
as it aims at clarifying if the incidence of false positives can be
kept under a priori control. In this respect, all structural alignment
methods considered in Sierk and Pearson (2004) proved to be
‘overconfident’, in that the number of encountered top false positives
was found to exceed by orders of magnitude the one expected on
the basis of the score statistics. We have undertaken this analysis by
following the method of Brenner et al. (1998) which, at variance
with the approach of Sierk and Pearson (2004), is not limited to
the first false positive of each query. The method we used consists
of comparing the number of observed and expected EPQ (nonhomologs) as a function of the z-score threshold. The number of
expected EPQ is obtained by multiplying the P-value by the size
of the database. The resulting curve is shown in Figure 3. It is seen
that the number of expected false positives (or equivalently of EPQ)
follows the observed one within about a factor of 3. In agreement
with what we exposed in Section 2, the results of Figure 3 provide
a clear indication that also on this comprehensive set of 86×2771
alignments (which is unrelated to the one used for modeling and
calibrating the alignment score), the returned z-score and P-values
provide an adequate estimate of the statistical significance of the
alignments.
4
CONCLUSIONS
We presented and discussed MISTRAL, a novel multiple structural
alignment method based on the minimization of a scoring function
representing a suitable pairwise interaction energy between the given
proteins.
The parameters over which the energy function is minimized
are the relative rotations and translations of the molecules. The
dimensionality of the phase space to be explored during the
minimization is consequently drastically limited (e.g. six parameters
per pairwise alignment) compared with the highly dimensional
combinatorial space of possible amino acid correspondences.
2668
[14:47 17/9/2009 Bioinformatics-btp506.tex]
Page: 2668
2663–2669
MISTRAL: energy-based multiple structural alignment
After the suitable normalization depending on the proteins’ length,
the optimized energies returned by MISTRAL for alignment of
unrelated proteins are found to follow closely the Gumbel extremal
statistics distribution. This property reflects in a good a priori control
of the statistical significance of the method, as confirmed by running
MISTRAL over a standard reference database of protein domains
(Sierk and Pearson, 2004).
ACKNOWLEDGEMENTS
We thank A. Capdepon for setting up the MISTRAL web server and
R. Potestio and A. Zen for useful discussions.
Funding: The Italian Ministry of Education, grant PRIN.
Conflict of Interest: none declared.
REFERENCES
Andreeva,A. et al. (2007) SISYPHUS–structural alignments for proteins with nontrivial relationships. Nucleic Acids Res., 35, 253–259.
Bachar,O. et al. (1993) A computer vision based technique for 3D sequence-independent
structural comparison of proteins. Protein Eng., 6, 279–288.
Bashton,M. and Chothia,C. (2007) The generation of new protein functions by the
combination of domains. Structure, 15, 85–99.
Brenner,S.E. et al. (1998) Assessing sequence comparison methods with reliable
structurally identified distant evolutionary relationships. Proc. Natl Acad. Sci. USA,
95, 6073–6078.
Camproux,A.C. et al. (2004) A hidden Markov model derived structural alphabet for
proteins. J. Mol. Biol., 339, 591–605.
Carugo,O. and Pongor,S. (2001) A normalized root-mean-square distance for comparing
protein three-dimensional structures. Protein Sci., 10, 1470–1473.
Chenna,R. et al. (2003) Multiple sequence alignment with the Clustal series of programs.
Nucleic Acids Res., 31, 3497–3500.
Chothia,C. (1984) Principles that determine the structure of proteins. Annu. Rev.
Biochem., 53, 537–572.
Chothia,C. and Lesk,A.M. (1986) The relation between the divergence of sequence and
structure in proteins. EMBO J., 5, 823–826.
de Chiara,C. et al. (2005) The AXH domain adopts alternative folds the solution
structure of HBP1 AXH. Structure, 13, 743–753.
Dror,O. et al. (2003) MASS: multiple structural alignment by secondary structures.
Bioinformatics, 19 (Suppl. 1), 95–104.
Fong,J.H. et al. (2007) Modeling the evolution of protein domain architectures using
maximum parsimony. J. Mol. Biol., 366, 307–315.
Gan,H.H. et al. (2002) Analysis of protein sequence/structure similarity relationships.
Biophys J., 83, 2781–2791.
Gerstein,M. and Levitt,M. (1998) Comprehensive assessment of automatic structural
alignment against a manual standard. Prot. Sci., 7, 445-456.
Guerler,A. and Knapp,E.W. (2008) Novel protein folds and their nonsequential
structural analogs. Protein Sci., 17, 1374–1382.
Holm,L. and Sander,C. (1993) Protein structure comparison by alignment of distance
matrices. J. Mol. Biol., 233, 123–138.
Holm,L. et al. (1992) A database of protein structure families with common folding
motifs. Protein Sci., 1, 1691–1698.
Humphrey,W. et al. (1996) VMD: visual molecular dynamics. J. Mol. Graph, 14, 33–38.
Ilyin,V.A. et al. (2004) Structural alignment of proteins by a novel TOPOFIT method
Protein Sci., 13, 1865–1874.
Jung,J. and Lee,B. (2000) Protein structure alignment using environmental profiles.
Protein Eng., 13, 535–543.
Karlin,S. and Altschul,S.F. (1990) Methods for assessing the statistical significance of
molecular sequence features by using general scoring schemes. Proc. Natl Acad.
Sci. USA, 87, 2264–2268.
Kim,C. and Lee,B. (2007) Accuracy of structure-based sequence alignment of automatic
methods. BMC Bioinformatics, 8, 355–355.
Kawabata,T. (2003) MATRAS: a program for protein 3D structure comparison. Nucleic
Acids Res., 31, 3367–3369.
Koehl,P. (2001) Protein structure similarities. Curr. Opin. Struct. Biol., 11, 348–353.
Kolbeck,B. et al. (2006) Connectivity independent protein-structure alignment. BMC
Bioinformatics, 7, 510–510.
Kolodny,R. et al. (2002) Small libraries of protein fragments model native protein
structures accurately. J. Mol. Biol., 323, 297–307.
Kolodny,R. et al. (2005) Comprehensive evaluation of protein structure alignment
methods: scoring by geometric measures. J. Mol. Biol., 346, 1173–1188.
Konagurthu,A.S. et al. (2006) MUSTANG: a multiple structural alignment algorithm.
J. Mol. Biol., 64, 559–574.
Lesk,A.M. (2004) Introduction to Protein Science: Architecture, Function and
Genomics. Oxford University Press, Oxford, UK.
Lesk,A.M. and Chothia,C. (1980) How different amino acid sequences determine
similar protein structures. J. Mol. Biol., 136, 225–270.
Levitt,M. and Gerstein,M. (1998) A unified statistical framework for sequence
comparison and structure comparison. Proc. Natl Acad. Sci. USA, 95, 5913–5920.
Lichtarge,O. and Sowa,M.E. (2002) Evolutionary predictions of binding surfaces and
interactions. Curr. Opin. Struct. Biol., 12, 21–27.
Liu,X. et al. (2008) CLEMAPS: multiple alignment of protein structures based on
conformational letters. Proteins, 71, 728–736.
May,A. (2004) Percent sequence identity, the need to be explicit. Structure, 12, 737–738.
Mayr,G. et al. (2007) Comparative analysis of protein structure alignments. BMC Struct.
Biol., 7, 50–50.
Micheletti,C. et al. (2000) Recurrent oligomers in proteins: an optimal scheme
reconciling accurate and concise backbone representations in automated folding
and design studies. Proteins, 40, 662–674.
Mizuguchi,K. et al. (1998) HOMSTRAD: a database of protein structure alignments
for homologous families. Protein Sci., 7, 2469–2471.
Mosca,R. and Schneider,T.R. (2008) RAPIDO: a web server for the alignment of protein
structures in the presence of conformational changes. Nucleic Acids Res., 36 (Web
Server issue), 42–46.
Needleman,S.B. and Wunsch,C.D. (1970) A general method applicable to the search for
similarities in the amino acid sequence of two proteins. J. Mol. Biol., 48, 443–453.
Orengo,C.A. et al. (1997) CATH–a hierarchic classification of protein domain
structures. Structure, 5, 1093–1108.
Ortiz,A.R. et al. (2002) MAMMOTH: an automated method for model comparison.
Protein Sci., 11, 2606–2621.
Phillips,D.C. (1970) The development of crystallographic enzymology. Biochem. Soc.
Symp., 30, 11–28.
Schwarzenbacher,R. et al. (2008) The JCSG MR pipeline Acta Crystallogr. D Biol.
Crystallogr., 64 (Pt 1), 133–140.
Shatsky,M. et al. (2004a) A method for simultaneous alignment of multiple protein
structures. Proteins, 56, 143–156.
Shatsky,M. et al. (2004b) FlexProt: alignment of flexible protein structures without a
predefinition of hinge regions. J. Comput. Biol., 11, 83–106.
Shindyalov,I.N. and Bourne,P.E. (1998) Protein structure alignment by incremental
combinatorial extension (CE) of the optimal path. Protein Eng., 11, 739–747.
Sierk,M.L. and Pearson,W.R. (2004) Sensitivity and selectivity in protein structure
comparison. Protein Sci., 13, 773–785.
Sippl,M.J. and Wiederstein,M. (2008) A note on difficult structure alignment problems.
Bioinformatics, 24, 426–427.
Theobald,D.L. et al. (2003) Nucleic acid recognition by OB-fold proteins. Annu. Rev.
Biophys. Biomol. Struct., 32, 115–133.
Wood,T.C. and Pearson,W.R. (1999) Evolution of protein sequences and structures.
J. Mol. Biol., 291, 977–995.
Xie,L. and Bourne,P.E. (2008) Detecting evolutionary relationships across existing fold
space Proc. Natl Acad. Sci. USA, 105, 5441–5446.
Ye,Y. and Godzik,A. (2003) Flexible structure alignment by chaining aligned fragment
pairs allowing twists. Bioinformatics, 19, 246–255.
Ye,Y. and Godzik,A. (2005) Multiple flexible structure alignment using partial order
graphs. Bioinformatics, 21, 2362–2369.
Yuan,X. and Bystroff,C. (2005) Non-sequential structure-based alignments reveal
topology-independent core packing arrangements in proteins. Bioinformatics, 21,
1010–1019.
Zen,A. et al. (2009) Using dynamics-based comparisons to predict nucleic acid binding
sites in proteins: an application to OB-fold domains. Bioinformatics, 25, 1010–1019.
2669
[14:47 17/9/2009 Bioinformatics-btp506.tex]
Page: 2669
2663–2669