BIOINFORMATICS ORIGINAL PAPER Vol. 23 no. 19 2007, pages 2558–2565 doi:10.1093/bioinformatics/btm377 Structural bioinformatics Comparative protein structure modeling by combining multiple templates and optimizing sequence-to-structure alignments Narcis Fernandez-Fuentes, Brajesh K. Rai†, Carlos J. Madrid-Aliste, J. Eduardo Fajardo and András Fiser* Department of Biochemistry and Seaver Center for Bioinformatics, Albert Einstein College of Medicine, 1300 Morris Park Avenue, Bronx, NY 10461, USA, Institute of Enzymology and Alfred Renyi Institute of Mathematics, Hungarian Academy of Sciences, H-1113 Budapest, Karolina ut 29, Hungary Received on March 8, 2007; revised on June 20, 2007; accepted on July 14, 2007 Advance Access publication September 6, 2007 Associate Editor: Burkhard Rost ABSTRACT Motivation: Two major bottlenecks in advancing comparative protein structure modeling are the efficient combination of multiple template structures and the generation of a correct input targettemplate alignment. Results: A novel method, Multiple Mapping Method with Multiple Templates (M4T) is introduced that implements an algorithm to automatically select and combine Multiple Template structures (MT) and an alignment optimization protocol (Multiple Mapping Method, MMM). The MT module of M4T selects and combines multiple template structures through an iterative clustering approach that takes into account the ‘unique’ contribution of each template, their sequence similarity among themselves and to the target sequence, and their experimental resolution. MMM is a sequence-to-structure alignment method that optimally combines alternatively aligned regions according to their fit in the structural environment of the template structure. The resulting M4T alignment is used as input to a comparative modeling module. The performance of M4T has been benchmarked on CASP6 comparative modeling target sequences and on a larger independent test set, and showed favorable performance to current state of the art methods. Availability: A web server was established for the method at http:// www.fiserlab.org/servers/M4T Contact: [email protected] or [email protected] 1 INTRODUCTION Comparative protein structure modeling relies on detectable similarity spanning most of the modeled sequence and at least one known structure (Marti-Renom et al., 2000). When the structure of one protein in a family has been determined by experiment, the other members of the family can be modeled based on their alignment to the known structure. Comparative modeling approaches usually consist of four major steps: (1) identifying one or more templates (2) calculating an accurate alignment between the target sequence and template *To whom correspondence should be addressed. y Present address: Wyeth Research, CN8000, Princeton, New Jersey, 08543-8000, USA. 2558 structure(s) (3) modeling the target and (4) evaluating the target model (Fiser and Sali, 2003). Each step determines the success of all subsequent ones. For instance, an incorrect template selection cannot be corrected at the alignment step or an alignment error cannot be corrected at the model building step. Accordingly, the first two steps are the most critical ones in comparative modeling. The first step in homology modeling (i.e. template selection step) is aided by several available methods developed for foldrecognition (Domingues et al., 1999; McGuffin et al., 2000; Shi et al., 2001) and profile-alignment (Altschul et al., 1997; Li et al., 2000) that allow efficient recognition of remotely related sequences. Using these methods, it is most often possible to identify more than one template structure. Obviously, this trend is strengthening due to the rapid expansion of Protein Data Bank (PDB) (Berman et al., 2000) and in particular to worldwide structural genomics efforts (Chance et al., 2004). However, due to the complexity of the problem to optimally select and combine multiple templates, currently available modeling programs, and especially the automated servers, typically consider only one template for building a model for a target sequence. Meanwhile results at CASP experiments, as early as at CASP2 in 1996, indicated that multiple templates help to improve the quality of comparative models (Sanchez and Sali 1997; Venclovas and Margelevicius, 2005). Multiple template structures can be useful in two ways: first, multiple template structures may be aligned with different parts/domains of the target, with little overlap between them, in which case, the modeling procedure can construct a homologybased model of the whole target sequence (improving model coverage). Therefore, it is frequently beneficial to include in the modeling process all the templates that have a unique contribution to the target sequence (Fiser, 2004). Second, the template structures may be aligned with the same part of the target and build the model on the locally best template (improving model quality). Although the idea of combining multiple templates sounds straightforward, its implementation is fairly complex. The real challenge is not the identification of a list of suitable template candidates, but an optimal combination of these. This is ß The Author 2007. Published by Oxford University Press. All rights reserved. For Permissions, please email: [email protected] Comparative protein structure modeling because template search methods ‘outperform’ the needs of comparative modeling in the sense that they are able to locate so remotely related sequences for which no reliable comparative model can be built. The reason for this is that sequence relationships are often established on short conserved segments, while a successful comparative modeling exercise requires an overall correct alignment for the entire modeled part of the protein. The MT module of the M4T algorithm addresses this very important issue. The second step in comparative modeling (i.e. the calculation of an accurate alignment of a target sequence to a template structure) remains to be a bottleneck in producing good quality homology models. A number of alignment methods have been developed and are publicly available [MUSCLE (Edgar, 2004), CLUSTALW (Thompson et al., 1994), Align2d (Madhusudhan et al., 2006), T-coffee (Notredame et al., 2000), FFAS (Jaroszewski et al., 2000) and SATCHMO (Edgar and Sjolander, 2003)]. However, none of these alignment methods consistently produces better solution for all cases (Prasad et al., 2003; Rai and Fiser, 2006). Furthermore, alignments produced by two different methods are often better in some regions and worse in others when compared to each other. One possible solution to this problem is to consider several alignment methods and combine better-aligned parts into a unique solution (Kosinski et al., 2005; Rai and Fiser, 2006). M4T has been developed to produce accurate alignments and models by minimizing the errors associated with the first two steps in comparative modeling (recognizing and combining templates and generating an optimal input alignment). In the first step, the MT module uses an iterative clustering approach to select and combine multiple protein structures to serve as templates. Next, to reduce errors associated with alignments, an iterative implementation of the earlier published Multiple Mapping Method (MMM) (Rai and Fiser, 2006) is used that considers solutions from several alignment methods and combines better-aligned parts into a unique solution. The performance of M4T has been rigorously tested using various benchmarks. We demonstrate that M4T produces better models when multiple templates are used as opposed to the cases using only the single best available template; M4T superior performance stands out in the low-sequence identity region, which present major challenge to homology modeling. Furthermore, M4T also compares favorably with other competitive approaches and with the performance of expert users at CASP. 2 2.1 METHODS Template selection method: MT module The target sequence is used as a query to search for homologous protein structure(s) that could serve as template(s) by running three iterations of PSI-BLAST (Altschul et al., 1997) against PDB (Berman et al., 2000), with an E-value cutoff of 0.0001. Only those hits are selected where the sequence overlap with the target sequence is 460% of the actual SCOP domain length or more than 75% of the PDB chain length in case of a missing SCOP classification. Next, the hits are clustered using an iterative clustering procedure that identifies the most suitable PDB structures to combine as templates. The goal of the clustering step is to identify the least number of targets that can contribute the most to the model. Templates are selected or discarded according to the following procedure [Fig. 1, also Fig. 2 in Fernandez et al. (2007)]: (1) Cluster initiation. The hit with the smallest E-value is selected and is used to seed a cluster. All hits that align in the same region (within 10 flanking residues of the first selected hit) are added to this cluster. (2) Sequence identity hits to query. The sequence identity is calculated between query and all hits in the cluster according to the PSI-BLAST alignment. If the sequence identity of the best available hit is larger than 50%, only those additional hits are kept in the cluster whose identity is within 20% of the best hit. (3) Characterize hits as unique and non-unique. A hit is unique if it contains at least one stretch of 8 or more residues aligned to a region of the target sequence that is not covered by any other hit. The current limit of 8 residues approximately corresponds to an upper limit, until which a reliable loop conformation can be built using available approaches and therefore it is subject to change as loop modeling techniques are improving in time (FernandezFuentes, 2006; Fiser et al., 2000). Unique and non-unique attributes are assigned to all hits that form a cluster and then all hits are ranked within a cluster according to their crystal resolution. Thus, a hit with the best crystal resolution is always unique and the remaining hits can be unique only if they contribute to a unique region (e.g. to an insertion that is solved in that one structure only and not in any other). (4) Consolidating the clusters. Once the hits that form the cluster are classified into ‘unique’ and ‘non-unique’ a purging process is started. It has three consecutive qualifying steps and applies to non-unique hits only: (a) The first step is a sequence identity comparison using a greedy algorithm, where only those non-unique hits that have a sequence identity between 30 and 90% to any unique hit are kept; the rest are discarded. Note that once a non-unique hit is selected the remaining non-unique hits will be compared against the unique plus the selected non-unique hits. Again, the order of comparisons is set by crystal resolution. The sequence identity is calculated using the alignments between hits and target sequence given by PSI-BLAST. In general, this step ensures that structurally neither too similar nor too dissimilar templates will be selected. (b) Next, a filtering step takes place that consolidates templates with varying crystal resolution. Non-unique hits are discarded if the difference in crystal resolution to the experimentally best-solved unique template is larger than 1.5 Å. This step guarantees that significantly poorer resolution templates are not used. NMR structures are assigned a virtual 4.5 Å resolution, which means that NMR solution is used only if it is the only template or if a similar X-ray structure has a worse resolution than 3 Å. (c) The last filter determines if a hit is contributing to an ‘underrepresented’ part of the target, i.e. a non-unique hit is kept only if it is aligned to a region of 8 or more residues that is covered by two or less hits. (5) Return to point (1) if there are hits that are not assigned to any cluster and iterate again, if necessary by initiating and establishing new clusters. The result of this iterative clustering process is one or more clusters of templates containing one or more template structures. Next, within each cluster, all templates are aligned to the corresponding target 2559 N.Fernandez-Fuentes et al. sequence using the iterative-MMM approach (see Subsequently). In a last consolidation step, sequence-to-structure alignments of clusters that overlap are combined. The overlapping parts of the templates are superposed and an LGA_S score (Zemla, 2003) is calculated on that superposition. If this score is larger than 70%, then the overlapping clusters are combined using their alignment to the (same) target sequence as reference. If clusters of templates are not overlapping or the overlap between them cannot be structurally accurately superposed, then individual models are built for each ‘modelable’ part of the target sequence for each cluster of templates. 2.2 Target to template(s) alignment: MMM module The target-to-template(s) alignments are calculated using an iterative implementation of the Multiple Mapping Method (Rai and Fiser, 2006). To construct profiles, the sequences of the target and template(s) are independently searched against the non-redundant database [NR (Boeckmann et al., 2003)] of NCBI using five iterations of PSI-BLAST and with E-value cutoff of 0.0001. Next, BlastProfiler (Rai et al., 2007) is run to build sequence profiles for both the target and template sequences. The program parses all iterations of PSIBLAST outputs, locates and stores those pairwise alignments between the query and database sequences that meet the filtering criteria. The values specified for filtering are: (i) Lower and upper cutoffs for percent sequence identities between the hit and the query, as reported in the pairwise Blast alignment; default: 30 and 90%, respectively. (ii) Lower bound for alignment length; default: 30 residues. (iii) Maximal E-value for each hit; default: 0.0001. (iv) Minimal required coverage of the query in the alignment, in percentage; default: 30%. Typically, the PSIBLAST output contains more than one alignment for the same hit sequence, especially when multiple iterations are performed. Such alternative alignments may include either the same or different regions of the hit sequence. Alignments to different regions of the target are kept as separate entries. Two alignments that involve the same hit sequence are considered redundant if the overlap is 450%. Because alignments produced in later iterations contain more specific information about the sequence profile, these alignments are preferred over earlier ones in case of overlaps. The second major step in the selection of a set of representative hit sequences is to remove sequence redundancy using CD-HIT clustering program (Li et al., 2002) at 40% identity level. Starting from the collected sequences, three separate profiles are calculated for each template(s) and target sequence, namely clustalw_d_profile, clustalw_m_profile and muscle_profile. The clustalw_d_profile and clustalw_m_profile are obtained by aligning the sequences using CLUSTALW (Thompson et al., 1994) with default gap penalty function (clustal_d_profile) and with modified gap penalty function (clustalw_m_profile), and muscle_profile is obtained using MUSCLE (Edgar, 2004). At the end of this step, three alternative profile-to-profile-based sequence alignments are available, which are used as input to MMM (Rai and Fiser, 2006). These three alternative profile-to-profile based sequence alignments are combined in the following manner: clustalw_d_profile is combined with muscle_d_ profile, generating an MMM alignment, mmm_alignment_1; clustal_m_profile is combined with muscle_d_profile generating mmm_ alignment_2. Finally, mmm_alignment_1 and mmm_alignment_2 are used as inputs to MMM for the final MMM alignment (Fig. 1). 2.3 Model building Models are built with the MODELLER program (Fiser and Sali, 2003; Sali and Blundell, 1993) using the default values for __model.top routine. Selected template(s) and optimized alignment(s) are provided as inputs. 2560 Fig. 1. Flowchart for model building. General overview of the algorithm: starting from a query sequence a search is performed using PSI-BLAST, and template(s) are selected in MT-module; subsequently, the MMM-module performs sequence alignment(s), and finally MODELLER builds the protein(s) model(s). see Methods section for further explanations. 2.4 Benchmark sets Two different test sets were used to benchmark our method. The first benchmark set was composed of sequences used in the CASP6 experiment for comparative modeling assessments. The target sequences were downloaded from http://predictioncenter.gc.ucdavis. edu/casp6/ and only those target sequences that produced a hit against a tailored PDB (Berman et al., 2000) dataset (see Subsequently) with PSI-BLAST (Altschul et al., 1997) were kept. In total 24 targets from 17 target protein sequences were considered (CASP target identifications: T0204, T0229, T0231, T0233, T0240, T0246, T0247, T0264, T0266, T0268, T0269, T0271, T0274, T0275, T0276, T0277 and T0282). The second benchmark set was composed of 765 selected protein sequences with known structures, taken out of 1160 from a previous work (Rai and Fiser, 2006), for each of these selected sequences the MT module returned more than one hit or template. Each query sequence of both benchmark sets was modeled using a tailored PDB (MT module) and a tailored NR database (MMM module). The tailored databases did not contain any structure or sequence that was deposited after the expiration date set by the CASP organizers. 2.5 Measure of model quality Three measures were used to assess the quality of the models, i.e. the similarity between the generated comparative models and the Comparative protein structure modeling Table 1. List of CASP6 targets and the accuracy of the comparative models built using a template with the best PSI-BLAST E-value Target Template Nt Nm RMSDseq (Å) RMSDstr (Å) Nr GDT_TS T0204 T0229_1 T0229_2 T0231 T0233_1 T0233_2 T0240 T0246 T0247_1 T0247_2 T0247_3 T0264_1 T0264_2 T0266 T0268_1 T0268_2 T0269_1 T0269_1 T0271 T0274 T0275 T0276 T0277 T0282 1HXP_A 1ML8_A 1ML8_A 1F7S_A 1KHD_D 1KHD_D 1QXX_A 1A05_A 1PJ6_A 1PJ6_A 1PJ6_A 1VHV_A 1VHV_A 1DBU_A 1N2X_A 1N2X_A 1QMV_A 1QQ2_A 1RLH_A 1I0R__A 1MJH_A 1SOU_A 1JOG_A 1PQ3_A 297 23 102 137 66 270 90 354 139 134 76 116 173 150 172 109 158 158 161 156 135 168 117 323 270 23 98 126 65 262 76 354 98 134 76 99 131 121 168 109 158 157 132 150 135 166 91 287 3.67 0.77 2.42 3.31 1.49 2.13 21.75 2.29 2.58 2.48 3.67 1.94 4.24 1.67 1.00 1.25 2.35 2.54 2.22 2.98 6.95 3.44 1.62 4.35 1.84 0.77 1.99 1.93 1.47 1.72 2.87 2.09 1.73 1.66 2.27 1.81 2.17 1.67 1.01 1.25 1.56 1.71 1.67 1.52 1.9 2.41 1.63 2.00 248 23 96 120 65 257 45 347 88 114 70 98 115 121 75 109 150 146 127 142 108 157 91 251 73.05 95.83 78.28 79.16 89.61 80.15 40.78 75.77 82.06 86.11 69.40 86.61 66.98 82.02 93.66 90.82 84.65 80.41 82.00 81.83 55.37 62.95 86.53 70.47 Nt: number of residues in target structure; Mm: number of residues in model; RMSDseq: root mean square deviation of C atoms based on a sequence-dependent superposition; RMSDstr: root mean square deviation of C atoms based on a structure-dependent superposition; Nr: number of residues considered for RMSD calculation and GDT_TS: global distance test total score (see Methods section for more information). corresponding experimental structure: RMSDseq, RMSDstr and GDT_TS score. RMSDseq is the root mean square deviation that is calculated on Calpha atoms after a sequence-dependent superposition of Calpha positions using a 5.0 Å distance cutoff. RMSDstr is the same as RMSDseq but on a sequence-independent superposition (i.e. using the best structural superimposition). Finally, GDT_TS score or global distance test total score was calculated. GDT_TS score is a main metric to evaluate CASP experiments and it accounts for the structural similarity between the model and experimental solution structure by measuring the fraction of superposable residues at distance cutoffs of 1.0, 2.0, 4.0 and 8.0 Å. All these measures were calculated using the LGA program (Zemla, 2003). 3 3.1 RESULTS Performance of M4T The performance of the method has been benchmarked in two different scenarios. M4T performance was tested on CASP6 comparative modeling targets and compared to models that were based on the single best template and then on the single best model produced by any group at CASP6. Finally, on a larger independent set the overall performance of M4T was tested by building models on single and multiple templates for 765 cases. 3.2 Single versus multiple templates at CASP All comparative model targets were tested by building models with M4T using the single best identified template and then by using multiple templates. In this setup, we used the MMM alignment module of M4T to generate input alignments for both cases. For 11 out of 24 CASP comparative modeling targets, it was possible to combine multiple templates. For all cases but one (T0269) the use of multiple templates provides a superior model in terms of RMSDseq, RMSDstr and GDT_TS scores than the one based on a single best template (Tables 1 and 2). The most impressive improvement takes place in case of target sequence T0275 where the GDT_TS score increases from 55.37 to 72.41 when multiple templates are combined. These observations confirm the anecdotal reports of CASP participants that suggested that use of multiple templates is advantageous (Sanchez and Sali, 1997; Venclovas and Margelevicius, 2005). 3.3 Comparison with current methods and expert knowledge M4T also compared well with state-of-the art methods and human experts in protein modeling. Table 3 shows the performance of M4T as compared with the single best models submitted to CASP6 by any group. These results often differ from the ones reported in the previous section because alignments may be different due to different methods used, different profiles employed or manual editing. Certain users may have used information on multiple structures. In addition, expert users may have attempted side chain and loop modeling in certain parts of the models. An ultimate goal of automated 2561 N.Fernandez-Fuentes et al. Table 2. List of CASP6 target sequences and the accuracy of its prediction using multiple templates Target Template Nt Nm RMSDseq (Å) RMSDstr (Å) Nr GDT_TS T0204 1HXP_A 1GUP_A 1F7S_A 1M4J_A 1AHQ_1AK6_1KHD_{A,C,D} 1BRW_A 1V8G_A 1KHD_{A,C,D} 1BRW_A 1V8G_A 1A05_A 1HQS_A 1CNZ_A 1CM7_A 1N2X_A 1M6Y_B 1N2X_A 1M6Y_B 1QMV_A 1N8J_A 1QQ2_A 1ST9_A 1MJH_A 1JMV_A 1TQ8_A 1PQ3_A 1CEV_A 297 260 3.61 1.77 245 73.46 137 134 2.11 1.57 130 81.91 66 62 1.13 1.14 62 91.13 270 263 2.01 1.45 259 85.51 354 352 2.02 1.95 347 78.69 172 169 0.99 1.00 76 93.75 109 109 1.20 1.21 109 91.28 158 152 3.46 1.68 143 81.42 158 157 2.58 1.89 149 79.33 135 135 6.89 1.73 109 72.41 323 278 3.24 1.92 251 73.47 T0231 T0233_1 T0233_2 T0246 T0268_1 T0268_2 T0269_1 T0269_1 T0275 T0282 See Table 1 for explanation of headers. structure prediction is to deliver models with a competitive accuracy to the ones created to ‘expert users’, and to do it in a fully automated way and in a short time. In 9 out of 24 cases, M4T outperformed the single best model submitted to CASP (Table 3). As another qualitative comparison, in 9 cases the differences between the best CASP model and M4T were small, and in 5 other cases M4T was significantly better, while in 9 cases CASP models turned out to be more accurate (for one case M4T did not return a model). Out of the 24 best CASP targets the largest population of targets that belonged to the same research group was 9, the second largest was 2. In this simplified comparison, M4T would fare as the second best individual performer with five most superior models to any other submission. While it is true that from a small number of test cases, such as at CASP, it is hard to conclude statistical significance (Marti-Renom et al., 2002) we perceive this performance as encouraging and a sign that automated methods becoming competitive with the best expert users. 3.4 Benchmarking on an independent test set The benefit of using multiple templates was also confirmed on an independent benchmarking set consisting of 765 proteins taken from an earlier study (Rai and Fiser, 2006). Two sets of models were built: (a) using multiple templates, and (b) using 2562 the single best template. On Figure 2, RMSDseq is shown versus sequence identity (comparing the quality of models to the sequence identity between the target and the best template). Below 50% sequence identity, models built using multiple templates are more accurate than those built using a single template only and this trend is accentuated as one moves into more remote target-template pair cases. Meanwhile, the advantage of using multiple templates gradually disappears above 50% target-template sequence identity cases. This result is also consistent with the performance on the CASP6 set where hits usually have a low sequence identity with their corresponding query. Besides improving the model quality, the use of multiple templates also increases model coverage, i.e. the resulting models cover a larger fraction of the target sequence, sometimes as much as 50 residues longer (Fig. 3). 3.5 Two examples of models predicted using single and multiple templates Figure 4 shows the structure prediction of PDB: 1ekx, chain A. After searching in a tailored PDB database, the hit with highest E-value was 9atc (E-value 1E 176). MT module returned a cluster of three templates: 1acm, 1a1s and 1oth. Both models are very accurate for the core of the protein, however, the Comparative protein structure modeling Table 3. Comparison of prediction accuracy between the best possible model using our method and the best model submitted to CASP6 Target T0204 T0229_1 T0229_2 T0231 T0233_1 T0233_2 T0240 T0246 T0247_1 T0247_2 T0247_3 T0264_1 T0264_2 T0266 T0268_1 T0268_2 T0269_1 T0269_2 T0271 T0274 T0275 T0276 T0277 T0282 BEST M4T BEST CASP 6 GDT_TS RMSDseq (Å) GROUP GDT_TS RMSDseq (Å) 73.46 95.83 78.28 81.91 91.13 85.51 40.79 78.69 82.06 86.11 69.41 86.62 66.98 82.03 93.75 91.28 84.65 N/A 82.01 81.83 72.41 62.95 86.54 73.47 3.61 0.77 2.42 2.11 1.13 2.01 21.75 2.02 2.58 2.48 3.67 1.94 4.24 1.67 0.99 1.21 2.35 N/A 2.22 2.98 6.89 3.44 1.62 3.24 GINALSKY CBRC-3D CHIMERA NANOMODEL ROHL GINALSKY GINALSKY HONIGLAB ALSO-RAN_U GINALSKY TOME_U JONES-UCL GINALSKY SKOLNICK-ZHANG CASPITA CBSU GINALSKY GINALSKY GENESILICO-GROUP GINALSKY VENCLOVAS TOME_U KOLINSKY&BUJNICKI GINALSKY 74.49 97.92 80.15 92.34 94.69 90.66 64.44 89.05 78.83 85.56 81.91 85.78 64.31 82.83 90.41 90.59 88.61 64.34 76.86 80.93 81.11 76.49 88.03 70.66 3.96 0.76 2.11 1.66 0.94 1.67 9.34 1.29 3.21 1.87 2.03 2.11 2.45 1.60 1.32 1.3 1.77 5.66 2.65 3.59 2.88 2.44 1.61 6.02 and 1yna as templates. For comparison, the length of the model using the single best E-value hit, 1xyn, is 167 residues only. The longer model includes an additional supersecondary element, a beta-turn-beta-turn element, which is not present in the model built with single best template. 4 Fig. 2. RMSD(seq) versus sequence identity. Using a dataset of 765 proteins with known structure, two sets of models were built: (1) using one template (best E-value hit only; light bars), (2) using multiple templates (gray bars). The percentage of sequence identity is calculated between the hit with the highest E-value and the query sequence. The error of the mean is shown. model built using multiple templates (red) is more accurate in two regions, marked A and B, than the model built using a single template. An additional advantage of using multiple templates is that the resulting model is more complete. Figure 5 shows the model for PDB 1hix, chain B. The length of the model built with multiple templates is 187 and was built using 2bvv, 1enx, 1f5f DISCUSSION AND CONCLUSIONS We described a new algorithm, M4T, for fully automated comparative modeling that makes it possible to: (1) efficiently selects and combines multiple template structures; and (2) generates an accurated target-to-template alignment. For template selection step, we introduced an iterative clustering approach of potential templates that is driven by a set of filtering and ranking criteria and is based on sequence signal, crystal resolution and on the ‘relative sequence novelty’ contribution to the target. For aligning the selected templates with the target sequence, we used a new version of the MMM method. The novelty comes from employing a sequence profile building module so that profile-to-profile alignments are used as inputs to MMM instead of pairwise alignments. The other difference to the earlier implementation of MMM is that the input alignments are combined in an automated iterative way, unlike before when the actual combination required supervision (Rai and Fiser, 2006). The original version of MMM showed a statistically significant improvement over existing methods by reducing alignment errors in the range of 3–17% over the inputs. MMM also compared favorably over two alignment meta-servers tested (Lambert et al., 2002; 2563 N.Fernandez-Fuentes et al. Fig. 3. Histogram of the increase of model coverage. Each query sequence is modeled using single and multiple template(s). The histogram shows the frequency of difference between the length of model built using multiple templates (Lm), and length of the model built using a single template (Ls) sequence identity. Fig. 5. Model for pdb 1hix chain B using single and multiple templates. The X-ray structure, the model with multiple templates, and the model built with a single template are shown in gray, red and blue, respectively. The combination of multiple templates resulted in a more complete model that includes an extra beta-turn-beta-turn region (20 amino acids), depicted in ribbon in the figure. structure modeling in the hands of expert users. M4T also performs better at low sequence identity signal, both in terms of model quality and model coverage. 4.1 Fig. 4. Model for pdb 1ekx chain A using single and multiple templates. The X-ray structure, model with multiple templates, and model with single templates are shown in gray, red and blue, respectively. Although both models agree very well with the core of the X-ray protein, the model constructed using multiple templates agrees much better in two exposed regions, A and B, than the model built using single template. Figures 4 and 5 were generated using PyMOL (http:// pymol.sourceforge.net/). Web-server M4T is accessible as a web-server at http://ww.fiserlab.org/ servers/M4T/ (Fernandez-Fuentes et al., 2007). The web-server has a straightforward interface. The user only needs to provide a target sequence, which can be entered in a text box, or can be uploaded as a text file, provide a short description for the sequence and a valid e-mail address. The target sequence must be in pure text containing one-letter amino acid codes (without any header). The server will returns a full atom model(s) in PDB format as output, plus the alignment(s) used for modeling. All the jobs are submitted to a queuing system thus the delay in execution depends on the number of active queries. Once the prediction is completed results are sent by e-mail in the form of a link pointing to a temporary web page that stores results for 1 month. ACKNOWLEDGEMENT This work was supported by NIH GM62519-04. Conflict of Interest: none declared. Prasad et al., 2003). Meanwhile, the iterative version of MMM has been illustrated here to outperform its own earlier implementation (Rai et al., 2007). We have shown that the fully automated M4T performs equally well or better as the most advanced methods in protein 2564 REFERENCES Altschul,S.F. et al. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res., 25, 3389. Berman,H.M. et al. (2000) The Protein Data Bank. Nucleic Acids Res., 28, 235. Comparative protein structure modeling Boeckmann,B. et al. (2003) The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003. Nucleic Acids Res., 31, 365. Chance,M.R. et al. (2004) High-throughput computational and experimental techniques in structural genomics. Genome Res., 14, 2145. Domingues,F.S. et al. (1999) Sustained performance of knowledge-based potentials in fold recognition. Proteins, 37, 112. Edgar,R.C. (2004) MUSCLE: a multiple sequence alignment method with reduced time and space complexity. BMC Bioinformatics, 5, 113. Edgar,R.C. and Sjolander,K. (2003) SATCHMO: sequence alignment and tree construction using hidden Markov models. Bioinformatics, 19, 1404. Fernandez-Fuentes,N. et al. (2006) A supersecondary structure library and search algorithm for modeling loop in protein structures. Nucleic Acids Res., 14, 2085. Fernandez-Fuentes,N. et al. (2007) M4T: a comparative protein structure modeling server. Nucleic Acids Res. Fiser,A. (2004) Protein structure modeling in the proteomics era. Expert Rev Proteomics, 1, 9–11. Fiser,A. and Sali,A. (2003) Modeller: generation and refinement of homologybased protein structure models. Methods Enzymol., 374, 461. Fiser,A et al. (2000) Modeling of loops in protein structures. Proein Sci., 9, 1753. Jaroszewski,L. et al. (2000) Improving the quality of twilight-zone alignments. Protein Sci., 9, 1487. Kosinski,J. et al. (2005) FRankenstein becomes a cyborg: the automatic recombination and realignment of fold recognition models in CASP6. Proteins, 61, 106–113. Lambert,C. et al. (2002) ESyPred3D: prediction of proteins 3D structures. Bioinformatics, 18, 1250–1256. Li,W. et al. (2000) Saturated BLAST: an automated multiple intermediate sequence search used to detect distant homology. Bioinformatics, 16, 1105. Li,W. et al. (2002) Tolerating some redundancy significantly speeds up clustering of large protein databases. Bioinformatics, 18, 77–82. Madhusudhan,M.S. et al. (2006) Variable gap penalty for protein sequencestructure alignment. Protein Eng. Des. Sel., 19, 129–33. Marti-Renom,M.A. et al. (2000) Comparative protein structure modeling of genes and genomes. Annu. Rev. Biophys. Biomol. Struct., 29, 291. Marti-Renom,M.A. et al. (2002) Reliability of assessment of protein structure prediction methods. Structure (Camb.) 10, 435. McGuffin,L.J. et al. (2000) The PSIPRED protein structure prediction server. Bioinformatics, 16, 404. Notredame,C. et al. (2000) T-Coffee: a novel method for fast and accurate multiple sequence alignment. J. Mol. Biol., 302, 205. Prasad,J.C. et al. (2003) Consensus alignment for reliable framework prediction in homology modeling. Bioinformatics, 19, 1682. Rai,B.K. and Fiser,A. (2006) Multiple mapping method: a novel approach to the sequence-to-structure alignment problem in comparative protein structure modeling. Proteins, 63, 644–661. Rai,B.K. et al. (2007) MMM: a sequence-to-structure alignment protocol. Bioinformatics, 22, 2691–2692. Sali,A. and Blundell,T.L. (1993) Comparative protein modelling by satisfaction of spatial restraints. J. Mol. Biol., 234, 779. Sanchez,R. and Sali,A. (1997) Evaluation of comparative protein structure modeling by MODELLER-3. Proteins, (Suppl. 1), 50. Shi,J. et al. (2001) FUGUE: sequence-structure homology recognition using environment-specific substitution tables and structure-dependent gap penalties. J. Mol. Biol., 310, 243. Thompson,J.D. et al. (1994) CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, positionspecific gap penalties and weight matrix choice. Nucleic Acids Res., 22, 4673. Venclovas,C. and Margelevicius,M. (2005) Comparative modeling in CASP6 using consensus approach to template selection, sequence-structure alignment, and structure assessment. Proteins, 61, 99–105. Zemla,A. (2003) LGA: a method for finding 3D similarities in protein structures. Nucleic Acids Res., 31, 3370–3374. 2565
© Copyright 2026 Paperzz