Comparative protein structure modeling by combining multiple

BIOINFORMATICS
ORIGINAL PAPER
Vol. 23 no. 19 2007, pages 2558–2565
doi:10.1093/bioinformatics/btm377
Structural bioinformatics
Comparative protein structure modeling by combining multiple
templates and optimizing sequence-to-structure alignments
Narcis Fernandez-Fuentes, Brajesh K. Rai†, Carlos J. Madrid-Aliste, J. Eduardo Fajardo
and András Fiser*
Department of Biochemistry and Seaver Center for Bioinformatics, Albert Einstein College of Medicine,
1300 Morris Park Avenue, Bronx, NY 10461, USA, Institute of Enzymology and Alfred Renyi Institute of Mathematics,
Hungarian Academy of Sciences, H-1113 Budapest, Karolina ut 29, Hungary
Received on March 8, 2007; revised on June 20, 2007; accepted on July 14, 2007
Advance Access publication September 6, 2007
Associate Editor: Burkhard Rost
ABSTRACT
Motivation: Two major bottlenecks in advancing comparative
protein structure modeling are the efficient combination of multiple
template structures and the generation of a correct input targettemplate alignment.
Results: A novel method, Multiple Mapping Method with Multiple
Templates (M4T) is introduced that implements an algorithm to
automatically select and combine Multiple Template structures (MT)
and an alignment optimization protocol (Multiple Mapping Method,
MMM). The MT module of M4T selects and combines multiple
template structures through an iterative clustering approach that
takes into account the ‘unique’ contribution of each template, their
sequence similarity among themselves and to the target sequence,
and their experimental resolution. MMM is a sequence-to-structure
alignment method that optimally combines alternatively aligned
regions according to their fit in the structural environment of the
template structure. The resulting M4T alignment is used as input to a
comparative modeling module. The performance of M4T has been
benchmarked on CASP6 comparative modeling target sequences
and on a larger independent test set, and showed favorable
performance to current state of the art methods.
Availability: A web server was established for the method at http://
www.fiserlab.org/servers/M4T
Contact: [email protected] or [email protected]
1
INTRODUCTION
Comparative protein structure modeling relies on detectable
similarity spanning most of the modeled sequence and at least
one known structure (Marti-Renom et al., 2000). When the
structure of one protein in a family has been determined by
experiment, the other members of the family can be modeled
based on their alignment to the known structure. Comparative
modeling approaches usually consist of four major steps:
(1) identifying one or more templates (2) calculating an
accurate alignment between the target sequence and template
*To whom correspondence should be addressed.
y
Present address: Wyeth Research, CN8000, Princeton, New Jersey,
08543-8000, USA.
2558
structure(s) (3) modeling the target and (4) evaluating the target
model (Fiser and Sali, 2003). Each step determines the success
of all subsequent ones. For instance, an incorrect template
selection cannot be corrected at the alignment step or an
alignment error cannot be corrected at the model building step.
Accordingly, the first two steps are the most critical ones in
comparative modeling.
The first step in homology modeling (i.e. template selection
step) is aided by several available methods developed for foldrecognition (Domingues et al., 1999; McGuffin et al., 2000; Shi
et al., 2001) and profile-alignment (Altschul et al., 1997;
Li et al., 2000) that allow efficient recognition of remotely
related sequences. Using these methods, it is most often
possible to identify more than one template structure.
Obviously, this trend is strengthening due to the rapid
expansion of Protein Data Bank (PDB) (Berman et al., 2000)
and in particular to worldwide structural genomics efforts
(Chance et al., 2004). However, due to the complexity of the
problem to optimally select and combine multiple templates,
currently available modeling programs, and especially the
automated servers, typically consider only one template for
building a model for a target sequence. Meanwhile results at
CASP experiments, as early as at CASP2 in 1996, indicated that
multiple templates help to improve the quality of comparative
models (Sanchez and Sali 1997; Venclovas and Margelevicius,
2005). Multiple template structures can be useful in two ways:
first, multiple template structures may be aligned with different
parts/domains of the target, with little overlap between them, in
which case, the modeling procedure can construct a homologybased model of the whole target sequence (improving model
coverage). Therefore, it is frequently beneficial to include in the
modeling process all the templates that have a unique
contribution to the target sequence (Fiser, 2004). Second, the
template structures may be aligned with the same part of the
target and build the model on the locally best template
(improving model quality).
Although the idea of combining multiple templates sounds
straightforward, its implementation is fairly complex. The real
challenge is not the identification of a list of suitable template
candidates, but an optimal combination of these. This is
ß The Author 2007. Published by Oxford University Press. All rights reserved. For Permissions, please email: [email protected]
Comparative protein structure modeling
because template search methods ‘outperform’ the needs of
comparative modeling in the sense that they are able to locate
so remotely related sequences for which no reliable comparative
model can be built. The reason for this is that sequence
relationships are often established on short conserved segments,
while a successful comparative modeling exercise requires an
overall correct alignment for the entire modeled part of the
protein. The MT module of the M4T algorithm addresses this
very important issue.
The second step in comparative modeling (i.e. the calculation
of an accurate alignment of a target sequence to a template
structure) remains to be a bottleneck in producing good quality
homology models. A number of alignment methods have been
developed and are publicly available [MUSCLE (Edgar, 2004),
CLUSTALW (Thompson et al., 1994), Align2d (Madhusudhan
et al., 2006), T-coffee (Notredame et al., 2000), FFAS
(Jaroszewski et al., 2000) and SATCHMO (Edgar and
Sjolander, 2003)]. However, none of these alignment methods
consistently produces better solution for all cases (Prasad et al.,
2003; Rai and Fiser, 2006). Furthermore, alignments
produced by two different methods are often better in some
regions and worse in others when compared to each other.
One possible solution to this problem is to consider
several alignment methods and combine better-aligned parts
into a unique solution (Kosinski et al., 2005; Rai and Fiser,
2006).
M4T has been developed to produce accurate alignments and
models by minimizing the errors associated with the first two
steps in comparative modeling (recognizing and combining
templates and generating an optimal input alignment). In the
first step, the MT module uses an iterative clustering approach
to select and combine multiple protein structures to serve as
templates. Next, to reduce errors associated with alignments, an
iterative implementation of the earlier published Multiple
Mapping Method (MMM) (Rai and Fiser, 2006) is used that
considers solutions from several alignment methods and
combines better-aligned parts into a unique solution. The
performance of M4T has been rigorously tested using
various benchmarks. We demonstrate that M4T produces
better models when multiple templates are used as opposed to
the cases using only the single best available template; M4T
superior performance stands out in the low-sequence identity
region, which present major challenge to homology modeling.
Furthermore, M4T also compares favorably with other
competitive approaches and with the performance of expert
users at CASP.
2
2.1
METHODS
Template selection method: MT module
The target sequence is used as a query to search for homologous protein
structure(s) that could serve as template(s) by running three iterations
of PSI-BLAST (Altschul et al., 1997) against PDB (Berman et al.,
2000), with an E-value cutoff of 0.0001. Only those hits are selected
where the sequence overlap with the target sequence is 460% of the
actual SCOP domain length or more than 75% of the PDB chain length
in case of a missing SCOP classification. Next, the hits are clustered
using an iterative clustering procedure that identifies the most suitable
PDB structures to combine as templates. The goal of the clustering step
is to identify the least number of targets that can contribute the most to
the model. Templates are selected or discarded according to the
following procedure [Fig. 1, also Fig. 2 in Fernandez et al. (2007)]:
(1) Cluster initiation. The hit with the smallest E-value is selected
and is used to seed a cluster. All hits that align in the same region
(within 10 flanking residues of the first selected hit) are added to
this cluster.
(2) Sequence identity hits to query. The sequence identity is
calculated between query and all hits in the cluster according to
the PSI-BLAST alignment. If the sequence identity of the best
available hit is larger than 50%, only those additional hits are
kept in the cluster whose identity is within 20% of the best hit.
(3) Characterize hits as unique and non-unique. A hit is unique if it
contains at least one stretch of 8 or more residues aligned to a
region of the target sequence that is not covered by any other hit.
The current limit of 8 residues approximately corresponds to an
upper limit, until which a reliable loop conformation can be built
using available approaches and therefore it is subject to change as
loop modeling techniques are improving in time (FernandezFuentes, 2006; Fiser et al., 2000). Unique and non-unique
attributes are assigned to all hits that form a cluster and then all
hits are ranked within a cluster according to their crystal
resolution. Thus, a hit with the best crystal resolution is always
unique and the remaining hits can be unique only if they
contribute to a unique region (e.g. to an insertion that is solved in
that one structure only and not in any other).
(4) Consolidating the clusters. Once the hits that form the cluster are
classified into ‘unique’ and ‘non-unique’ a purging process is
started. It has three consecutive qualifying steps and applies to
non-unique hits only:
(a) The first step is a sequence identity comparison using a
greedy algorithm, where only those non-unique hits that have
a sequence identity between 30 and 90% to any unique hit are
kept; the rest are discarded. Note that once a non-unique hit
is selected the remaining non-unique hits will be compared
against the unique plus the selected non-unique hits. Again,
the order of comparisons is set by crystal resolution. The
sequence identity is calculated using the alignments between
hits and target sequence given by PSI-BLAST. In general,
this step ensures that structurally neither too similar nor too
dissimilar templates will be selected.
(b) Next, a filtering step takes place that consolidates templates
with varying crystal resolution. Non-unique hits are discarded if the difference in crystal resolution to the experimentally best-solved unique template is larger than 1.5 Å.
This step guarantees that significantly poorer resolution
templates are not used. NMR structures are assigned a virtual
4.5 Å resolution, which means that NMR solution is used
only if it is the only template or if a similar X-ray structure
has a worse resolution than 3 Å.
(c) The last filter determines if a hit is contributing to an
‘underrepresented’ part of the target, i.e. a non-unique hit is
kept only if it is aligned to a region of 8 or more residues that
is covered by two or less hits.
(5) Return to point (1) if there are hits that are not assigned to any
cluster and iterate again, if necessary by initiating and establishing new clusters.
The result of this iterative clustering process is one or more clusters of
templates containing one or more template structures. Next, within
each cluster, all templates are aligned to the corresponding target
2559
N.Fernandez-Fuentes et al.
sequence using the iterative-MMM approach (see Subsequently).
In a last consolidation step, sequence-to-structure alignments of clusters
that overlap are combined. The overlapping parts of the templates are
superposed and an LGA_S score (Zemla, 2003) is calculated on that
superposition. If this score is larger than 70%, then the overlapping
clusters are combined using their alignment to the (same) target
sequence as reference. If clusters of templates are not overlapping or the
overlap between them cannot be structurally accurately superposed,
then individual models are built for each ‘modelable’ part of the target
sequence for each cluster of templates.
2.2
Target to template(s) alignment: MMM module
The target-to-template(s) alignments are calculated using an iterative
implementation of the Multiple Mapping Method (Rai and Fiser,
2006). To construct profiles, the sequences of the target and template(s)
are independently searched against the non-redundant database
[NR (Boeckmann et al., 2003)] of NCBI using five iterations of
PSI-BLAST and with E-value cutoff of 0.0001. Next, BlastProfiler
(Rai et al., 2007) is run to build sequence profiles for both the target and
template sequences. The program parses all iterations of PSIBLAST
outputs, locates and stores those pairwise alignments between the query
and database sequences that meet the filtering criteria. The values
specified for filtering are: (i) Lower and upper cutoffs for percent
sequence identities between the hit and the query, as reported in the
pairwise Blast alignment; default: 30 and 90%, respectively. (ii) Lower
bound for alignment length; default: 30 residues. (iii) Maximal E-value
for each hit; default: 0.0001. (iv) Minimal required coverage of the
query in the alignment, in percentage; default: 30%. Typically, the PSIBLAST output contains more than one alignment for the same hit
sequence, especially when multiple iterations are performed. Such
alternative alignments may include either the same or different regions
of the hit sequence. Alignments to different regions of the target are
kept as separate entries. Two alignments that involve the same hit
sequence are considered redundant if the overlap is 450%. Because
alignments produced in later iterations contain more specific information about the sequence profile, these alignments are preferred over
earlier ones in case of overlaps. The second major step in the selection
of a set of representative hit sequences is to remove sequence
redundancy using CD-HIT clustering program (Li et al., 2002) at
40% identity level.
Starting from the collected sequences, three separate profiles
are calculated for each template(s) and target sequence, namely
clustalw_d_profile, clustalw_m_profile and muscle_profile. The
clustalw_d_profile and clustalw_m_profile are obtained by aligning
the sequences using CLUSTALW (Thompson et al., 1994) with default
gap penalty function (clustal_d_profile) and with modified gap penalty
function (clustalw_m_profile), and muscle_profile is obtained using
MUSCLE (Edgar, 2004). At the end of this step, three alternative
profile-to-profile-based sequence alignments are available, which are
used as input to MMM (Rai and Fiser, 2006). These three alternative
profile-to-profile based sequence alignments are combined in the
following manner: clustalw_d_profile is combined with muscle_d_
profile, generating an MMM alignment, mmm_alignment_1; clustal_m_profile is combined with muscle_d_profile generating mmm_
alignment_2. Finally, mmm_alignment_1 and mmm_alignment_2 are
used as inputs to MMM for the final MMM alignment (Fig. 1).
2.3
Model building
Models are built with the MODELLER program (Fiser and Sali, 2003;
Sali and Blundell, 1993) using the default values for __model.top
routine. Selected template(s) and optimized alignment(s) are provided
as inputs.
2560
Fig. 1. Flowchart for model building. General overview of the
algorithm: starting from a query sequence a search is performed using
PSI-BLAST, and template(s) are selected in MT-module; subsequently,
the MMM-module performs sequence alignment(s), and finally
MODELLER builds the protein(s) model(s). see Methods section for
further explanations.
2.4
Benchmark sets
Two different test sets were used to benchmark our method. The first
benchmark set was composed of sequences used in the CASP6
experiment for comparative modeling assessments. The target
sequences were downloaded from http://predictioncenter.gc.ucdavis.
edu/casp6/ and only those target sequences that produced a hit against
a tailored PDB (Berman et al., 2000) dataset (see Subsequently) with
PSI-BLAST (Altschul et al., 1997) were kept. In total 24 targets from 17
target protein sequences were considered (CASP target identifications:
T0204, T0229, T0231, T0233, T0240, T0246, T0247, T0264, T0266,
T0268, T0269, T0271, T0274, T0275, T0276, T0277 and T0282). The
second benchmark set was composed of 765 selected protein sequences
with known structures, taken out of 1160 from a previous work (Rai
and Fiser, 2006), for each of these selected sequences the MT module
returned more than one hit or template. Each query sequence of both
benchmark sets was modeled using a tailored PDB (MT module) and a
tailored NR database (MMM module). The tailored databases did not
contain any structure or sequence that was deposited after the
expiration date set by the CASP organizers.
2.5
Measure of model quality
Three measures were used to assess the quality of the models, i.e. the
similarity between the generated comparative models and the
Comparative protein structure modeling
Table 1. List of CASP6 targets and the accuracy of the comparative models built using a template with the best PSI-BLAST E-value
Target
Template
Nt
Nm
RMSDseq (Å)
RMSDstr (Å)
Nr
GDT_TS
T0204
T0229_1
T0229_2
T0231
T0233_1
T0233_2
T0240
T0246
T0247_1
T0247_2
T0247_3
T0264_1
T0264_2
T0266
T0268_1
T0268_2
T0269_1
T0269_1
T0271
T0274
T0275
T0276
T0277
T0282
1HXP_A
1ML8_A
1ML8_A
1F7S_A
1KHD_D
1KHD_D
1QXX_A
1A05_A
1PJ6_A
1PJ6_A
1PJ6_A
1VHV_A
1VHV_A
1DBU_A
1N2X_A
1N2X_A
1QMV_A
1QQ2_A
1RLH_A
1I0R__A
1MJH_A
1SOU_A
1JOG_A
1PQ3_A
297
23
102
137
66
270
90
354
139
134
76
116
173
150
172
109
158
158
161
156
135
168
117
323
270
23
98
126
65
262
76
354
98
134
76
99
131
121
168
109
158
157
132
150
135
166
91
287
3.67
0.77
2.42
3.31
1.49
2.13
21.75
2.29
2.58
2.48
3.67
1.94
4.24
1.67
1.00
1.25
2.35
2.54
2.22
2.98
6.95
3.44
1.62
4.35
1.84
0.77
1.99
1.93
1.47
1.72
2.87
2.09
1.73
1.66
2.27
1.81
2.17
1.67
1.01
1.25
1.56
1.71
1.67
1.52
1.9
2.41
1.63
2.00
248
23
96
120
65
257
45
347
88
114
70
98
115
121
75
109
150
146
127
142
108
157
91
251
73.05
95.83
78.28
79.16
89.61
80.15
40.78
75.77
82.06
86.11
69.40
86.61
66.98
82.02
93.66
90.82
84.65
80.41
82.00
81.83
55.37
62.95
86.53
70.47
Nt: number of residues in target structure; Mm: number of residues in model; RMSDseq: root mean square deviation of C atoms based on a sequence-dependent
superposition; RMSDstr: root mean square deviation of C atoms based on a structure-dependent superposition; Nr: number of residues considered for RMSD
calculation and GDT_TS: global distance test total score (see Methods section for more information).
corresponding experimental structure: RMSDseq, RMSDstr and
GDT_TS score. RMSDseq is the root mean square deviation that is
calculated on Calpha atoms after a sequence-dependent superposition
of Calpha positions using a 5.0 Å distance cutoff. RMSDstr is the same
as RMSDseq but on a sequence-independent superposition (i.e. using
the best structural superimposition). Finally, GDT_TS score or global
distance test total score was calculated. GDT_TS score is a main metric
to evaluate CASP experiments and it accounts for the structural
similarity between the model and experimental solution structure by
measuring the fraction of superposable residues at distance cutoffs of
1.0, 2.0, 4.0 and 8.0 Å. All these measures were calculated using the
LGA program (Zemla, 2003).
3
3.1
RESULTS
Performance of M4T
The performance of the method has been benchmarked in two
different scenarios. M4T performance was tested on CASP6
comparative modeling targets and compared to models that
were based on the single best template and then on the single
best model produced by any group at CASP6. Finally, on a
larger independent set the overall performance of M4T was
tested by building models on single and multiple templates for
765 cases.
3.2
Single versus multiple templates at CASP
All comparative model targets were tested by building models
with M4T using the single best identified template and then by
using multiple templates. In this setup, we used the MMM
alignment module of M4T to generate input alignments for
both cases. For 11 out of 24 CASP comparative modeling
targets, it was possible to combine multiple templates. For all
cases but one (T0269) the use of multiple templates provides a
superior model in terms of RMSDseq, RMSDstr and GDT_TS
scores than the one based on a single best template (Tables 1
and 2). The most impressive improvement takes place in case of
target sequence T0275 where the GDT_TS score increases from
55.37 to 72.41 when multiple templates are combined. These
observations confirm the anecdotal reports of CASP participants that suggested that use of multiple templates is
advantageous (Sanchez and Sali, 1997; Venclovas and
Margelevicius, 2005).
3.3
Comparison with current methods and expert
knowledge
M4T also compared well with state-of-the art methods and
human experts in protein modeling. Table 3 shows the
performance of M4T as compared with the single best models
submitted to CASP6 by any group. These results often differ
from the ones reported in the previous section because
alignments may be different due to different methods used,
different profiles employed or manual editing. Certain users
may have used information on multiple structures. In addition,
expert users may have attempted side chain and loop modeling
in certain parts of the models. An ultimate goal of automated
2561
N.Fernandez-Fuentes et al.
Table 2. List of CASP6 target sequences and the accuracy of its prediction using multiple templates
Target
Template
Nt
Nm
RMSDseq (Å)
RMSDstr (Å)
Nr
GDT_TS
T0204
1HXP_A
1GUP_A
1F7S_A
1M4J_A
1AHQ_1AK6_1KHD_{A,C,D}
1BRW_A
1V8G_A
1KHD_{A,C,D}
1BRW_A
1V8G_A
1A05_A
1HQS_A
1CNZ_A
1CM7_A
1N2X_A
1M6Y_B
1N2X_A
1M6Y_B
1QMV_A
1N8J_A
1QQ2_A
1ST9_A
1MJH_A
1JMV_A
1TQ8_A
1PQ3_A
1CEV_A
297
260
3.61
1.77
245
73.46
137
134
2.11
1.57
130
81.91
66
62
1.13
1.14
62
91.13
270
263
2.01
1.45
259
85.51
354
352
2.02
1.95
347
78.69
172
169
0.99
1.00
76
93.75
109
109
1.20
1.21
109
91.28
158
152
3.46
1.68
143
81.42
158
157
2.58
1.89
149
79.33
135
135
6.89
1.73
109
72.41
323
278
3.24
1.92
251
73.47
T0231
T0233_1
T0233_2
T0246
T0268_1
T0268_2
T0269_1
T0269_1
T0275
T0282
See Table 1 for explanation of headers.
structure prediction is to deliver models with a competitive
accuracy to the ones created to ‘expert users’, and to do it in a
fully automated way and in a short time. In 9 out of 24 cases,
M4T outperformed the single best model submitted to CASP
(Table 3). As another qualitative comparison, in 9 cases the
differences between the best CASP model and M4T were small,
and in 5 other cases M4T was significantly better, while in
9 cases CASP models turned out to be more accurate (for one
case M4T did not return a model). Out of the 24 best CASP
targets the largest population of targets that belonged to the
same research group was 9, the second largest was 2. In this
simplified comparison, M4T would fare as the second best
individual performer with five most superior models to any
other submission. While it is true that from a small number of
test cases, such as at CASP, it is hard to conclude statistical
significance (Marti-Renom et al., 2002) we perceive this
performance as encouraging and a sign that automated
methods becoming competitive with the best expert users.
3.4
Benchmarking on an independent test set
The benefit of using multiple templates was also confirmed on
an independent benchmarking set consisting of 765 proteins
taken from an earlier study (Rai and Fiser, 2006). Two sets of
models were built: (a) using multiple templates, and (b) using
2562
the single best template. On Figure 2, RMSDseq is shown
versus sequence identity (comparing the quality of models to
the sequence identity between the target and the best template).
Below 50% sequence identity, models built using multiple
templates are more accurate than those built using a single
template only and this trend is accentuated as one moves into
more remote target-template pair cases. Meanwhile, the
advantage of using multiple templates gradually disappears
above 50% target-template sequence identity cases. This result
is also consistent with the performance on the CASP6 set where
hits usually have a low sequence identity with their corresponding query.
Besides improving the model quality, the use of multiple
templates also increases model coverage, i.e. the resulting
models cover a larger fraction of the target sequence, sometimes
as much as 50 residues longer (Fig. 3).
3.5
Two examples of models predicted using single and
multiple templates
Figure 4 shows the structure prediction of PDB: 1ekx, chain A.
After searching in a tailored PDB database, the hit with highest
E-value was 9atc (E-value 1E 176). MT module returned a
cluster of three templates: 1acm, 1a1s and 1oth. Both models
are very accurate for the core of the protein, however, the
Comparative protein structure modeling
Table 3. Comparison of prediction accuracy between the best possible model using our method and the best model submitted to CASP6
Target
T0204
T0229_1
T0229_2
T0231
T0233_1
T0233_2
T0240
T0246
T0247_1
T0247_2
T0247_3
T0264_1
T0264_2
T0266
T0268_1
T0268_2
T0269_1
T0269_2
T0271
T0274
T0275
T0276
T0277
T0282
BEST M4T
BEST CASP 6
GDT_TS
RMSDseq (Å)
GROUP
GDT_TS
RMSDseq (Å)
73.46
95.83
78.28
81.91
91.13
85.51
40.79
78.69
82.06
86.11
69.41
86.62
66.98
82.03
93.75
91.28
84.65
N/A
82.01
81.83
72.41
62.95
86.54
73.47
3.61
0.77
2.42
2.11
1.13
2.01
21.75
2.02
2.58
2.48
3.67
1.94
4.24
1.67
0.99
1.21
2.35
N/A
2.22
2.98
6.89
3.44
1.62
3.24
GINALSKY
CBRC-3D
CHIMERA
NANOMODEL
ROHL
GINALSKY
GINALSKY
HONIGLAB
ALSO-RAN_U
GINALSKY
TOME_U
JONES-UCL
GINALSKY
SKOLNICK-ZHANG
CASPITA
CBSU
GINALSKY
GINALSKY
GENESILICO-GROUP
GINALSKY
VENCLOVAS
TOME_U
KOLINSKY&BUJNICKI
GINALSKY
74.49
97.92
80.15
92.34
94.69
90.66
64.44
89.05
78.83
85.56
81.91
85.78
64.31
82.83
90.41
90.59
88.61
64.34
76.86
80.93
81.11
76.49
88.03
70.66
3.96
0.76
2.11
1.66
0.94
1.67
9.34
1.29
3.21
1.87
2.03
2.11
2.45
1.60
1.32
1.3
1.77
5.66
2.65
3.59
2.88
2.44
1.61
6.02
and 1yna as templates. For comparison, the length of the model
using the single best E-value hit, 1xyn, is 167 residues only. The
longer model includes an additional supersecondary element,
a beta-turn-beta-turn element, which is not present in the model
built with single best template.
4
Fig. 2. RMSD(seq) versus sequence identity. Using a dataset of 765
proteins with known structure, two sets of models were built: (1) using
one template (best E-value hit only; light bars), (2) using multiple
templates (gray bars). The percentage of sequence identity is calculated
between the hit with the highest E-value and the query sequence. The
error of the mean is shown.
model built using multiple templates (red) is more accurate in
two regions, marked A and B, than the model built using a
single template.
An additional advantage of using multiple templates is that
the resulting model is more complete. Figure 5 shows the model
for PDB 1hix, chain B. The length of the model built with
multiple templates is 187 and was built using 2bvv, 1enx, 1f5f
DISCUSSION AND CONCLUSIONS
We described a new algorithm, M4T, for fully automated
comparative modeling that makes it possible to: (1) efficiently
selects and combines multiple template structures; and
(2) generates an accurated target-to-template alignment.
For template selection step, we introduced an iterative
clustering approach of potential templates that is driven by a
set of filtering and ranking criteria and is based on sequence
signal, crystal resolution and on the ‘relative sequence novelty’
contribution to the target. For aligning the selected templates
with the target sequence, we used a new version of the MMM
method. The novelty comes from employing a sequence profile
building module so that profile-to-profile alignments are used
as inputs to MMM instead of pairwise alignments. The other
difference to the earlier implementation of MMM is that
the input alignments are combined in an automated iterative
way, unlike before when the actual combination required
supervision (Rai and Fiser, 2006). The original version of
MMM showed a statistically significant improvement over
existing methods by reducing alignment errors in the range of
3–17% over the inputs. MMM also compared favorably
over two alignment meta-servers tested (Lambert et al., 2002;
2563
N.Fernandez-Fuentes et al.
Fig. 3. Histogram of the increase of model coverage. Each query
sequence is modeled using single and multiple template(s). The
histogram shows the frequency of difference between the length of
model built using multiple templates (Lm), and length of the model
built using a single template (Ls) sequence identity.
Fig. 5. Model for pdb 1hix chain B using single and multiple templates.
The X-ray structure, the model with multiple templates, and the model
built with a single template are shown in gray, red and blue,
respectively. The combination of multiple templates resulted in a
more complete model that includes an extra beta-turn-beta-turn region
(20 amino acids), depicted in ribbon in the figure.
structure modeling in the hands of expert users. M4T also
performs better at low sequence identity signal, both in terms of
model quality and model coverage.
4.1
Fig. 4. Model for pdb 1ekx chain A using single and multiple templates.
The X-ray structure, model with multiple templates, and model with
single templates are shown in gray, red and blue, respectively. Although
both models agree very well with the core of the X-ray protein, the
model constructed using multiple templates agrees much better in
two exposed regions, A and B, than the model built using single
template. Figures 4 and 5 were generated using PyMOL (http://
pymol.sourceforge.net/).
Web-server
M4T is accessible as a web-server at http://ww.fiserlab.org/
servers/M4T/ (Fernandez-Fuentes et al., 2007). The web-server
has a straightforward interface. The user only needs to provide
a target sequence, which can be entered in a text box, or can be
uploaded as a text file, provide a short description for the
sequence and a valid e-mail address. The target sequence must
be in pure text containing one-letter amino acid codes (without
any header). The server will returns a full atom model(s) in
PDB format as output, plus the alignment(s) used for modeling.
All the jobs are submitted to a queuing system thus the delay in
execution depends on the number of active queries. Once the
prediction is completed results are sent by e-mail in the form of
a link pointing to a temporary web page that stores results for
1 month.
ACKNOWLEDGEMENT
This work was supported by NIH GM62519-04.
Conflict of Interest: none declared.
Prasad et al., 2003). Meanwhile, the iterative version of MMM
has been illustrated here to outperform its own earlier
implementation (Rai et al., 2007).
We have shown that the fully automated M4T performs
equally well or better as the most advanced methods in protein
2564
REFERENCES
Altschul,S.F. et al. (1997) Gapped BLAST and PSI-BLAST: a new generation of
protein database search programs. Nucleic Acids Res., 25, 3389.
Berman,H.M. et al. (2000) The Protein Data Bank. Nucleic Acids Res., 28, 235.
Comparative protein structure modeling
Boeckmann,B. et al. (2003) The SWISS-PROT protein knowledgebase and its
supplement TrEMBL in 2003. Nucleic Acids Res., 31, 365.
Chance,M.R. et al. (2004) High-throughput computational and experimental
techniques in structural genomics. Genome Res., 14, 2145.
Domingues,F.S. et al. (1999) Sustained performance of knowledge-based
potentials in fold recognition. Proteins, 37, 112.
Edgar,R.C. (2004) MUSCLE: a multiple sequence alignment method with
reduced time and space complexity. BMC Bioinformatics, 5, 113.
Edgar,R.C. and Sjolander,K. (2003) SATCHMO: sequence alignment and tree
construction using hidden Markov models. Bioinformatics, 19, 1404.
Fernandez-Fuentes,N. et al. (2006) A supersecondary structure library and search
algorithm for modeling loop in protein structures. Nucleic Acids Res., 14,
2085.
Fernandez-Fuentes,N. et al. (2007) M4T: a comparative protein structure
modeling server. Nucleic Acids Res.
Fiser,A. (2004) Protein structure modeling in the proteomics era. Expert Rev
Proteomics, 1, 9–11.
Fiser,A. and Sali,A. (2003) Modeller: generation and refinement of homologybased protein structure models. Methods Enzymol., 374, 461.
Fiser,A et al. (2000) Modeling of loops in protein structures. Proein Sci., 9,
1753.
Jaroszewski,L. et al. (2000) Improving the quality of twilight-zone alignments.
Protein Sci., 9, 1487.
Kosinski,J. et al. (2005) FRankenstein becomes a cyborg: the automatic
recombination and realignment of fold recognition models in CASP6.
Proteins, 61, 106–113.
Lambert,C. et al. (2002) ESyPred3D: prediction of proteins 3D structures.
Bioinformatics, 18, 1250–1256.
Li,W. et al. (2000) Saturated BLAST: an automated multiple intermediate
sequence search used to detect distant homology. Bioinformatics, 16, 1105.
Li,W. et al. (2002) Tolerating some redundancy significantly speeds up clustering
of large protein databases. Bioinformatics, 18, 77–82.
Madhusudhan,M.S. et al. (2006) Variable gap penalty for protein sequencestructure alignment. Protein Eng. Des. Sel., 19, 129–33.
Marti-Renom,M.A. et al. (2000) Comparative protein structure modeling of
genes and genomes. Annu. Rev. Biophys. Biomol. Struct., 29, 291.
Marti-Renom,M.A. et al. (2002) Reliability of assessment of protein structure
prediction methods. Structure (Camb.) 10, 435.
McGuffin,L.J. et al. (2000) The PSIPRED protein structure prediction server.
Bioinformatics, 16, 404.
Notredame,C. et al. (2000) T-Coffee: a novel method for fast and accurate
multiple sequence alignment. J. Mol. Biol., 302, 205.
Prasad,J.C. et al. (2003) Consensus alignment for reliable framework prediction
in homology modeling. Bioinformatics, 19, 1682.
Rai,B.K. and Fiser,A. (2006) Multiple mapping method: a novel approach to the
sequence-to-structure alignment problem in comparative protein structure
modeling. Proteins, 63, 644–661.
Rai,B.K. et al. (2007) MMM: a sequence-to-structure alignment protocol.
Bioinformatics, 22, 2691–2692.
Sali,A. and Blundell,T.L. (1993) Comparative protein modelling by satisfaction
of spatial restraints. J. Mol. Biol., 234, 779.
Sanchez,R. and Sali,A. (1997) Evaluation of comparative protein structure
modeling by MODELLER-3. Proteins, (Suppl. 1), 50.
Shi,J. et al. (2001) FUGUE: sequence-structure homology recognition using
environment-specific substitution tables and structure-dependent gap
penalties. J. Mol. Biol., 310, 243.
Thompson,J.D. et al. (1994) CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, positionspecific gap penalties and weight matrix choice. Nucleic Acids Res., 22, 4673.
Venclovas,C. and Margelevicius,M. (2005) Comparative modeling in CASP6
using consensus approach to template selection, sequence-structure alignment, and structure assessment. Proteins, 61, 99–105.
Zemla,A. (2003) LGA: a method for finding 3D similarities in protein structures.
Nucleic Acids Res., 31, 3370–3374.
2565