Ab-initio prediction and reliability of protein structural genomics by PROPAINOR algorithm Rajani R. Joshi a,b,*, S. Jyothi b,1 a BJM School of Bioscience and Bioengineering, Indian Institute of Technology, Powai, Mumbai 400076, India b Department of Mathematics, Indian Institute of Technology, Powai, Mumbai 400076, India Abstract We have formulated the ab-initio prediction of the 3D-structure of proteins as a probabilistic programming problem where the inter-residue 3D-distances are treated as random variables. Lower and upper bounds for these random variables and the corresponding probabilities are estimated by nonparametric statistical methods and knowledge-based heuristics. In this paper we focus on the probabilistic computation of the 3D-structure using these distance interval estimates. Validation of the predicted structures shows our method to be more accurate than other computational methods reported so far. Our method is also found to be computationally more efficient than other existing ab-initio structure prediction methods. Moreover, we provide a reliability index for the predicted structures too. Because of its computational simplicity and its applicability to any random sequence, our algorithm called PROPAINOR (PROtein structure Prediction by AI an Nonparametric Regression) has significant scope in computational protein structural genomics. Keywords: Ab-initio protein structure prediction; Distance geometry; Nonparametric discriminant analysis; Nonparametric regression; Probabilistic programming; Reliability index 1. Introduction Knowledge of the three-dimensional structure of a protein often greatly enhances our understanding of its function and its interaction with other proteins. With the development of sequencing techniques, protein sequencing has now become a routine task. But the determination of the 3D-structures still remains difficult*/experimental methods of structure determination like NMR spectroscopy and X-ray crystallography continue to be expensive, time consuming techniques constrained by the availability of pure extracts and crystals. Computational methods of structure prediction have become extremely important in the post-genomic period today as they could be applied on the primary sequence (or synthetic peptides) directly. Whence their importance in protein structural genomics. Modelling by homology is at present the most reliable computational technique for 3D-structure prediction. But these alignment methods prove to be of limited use for many protein sequences due to lack of a suitable structure template. Ab-initio methods of structure prediction involving potential energy minimization too are computationally tractable only for very small proteins. The importance of thorough statistical analysis addressing the challenges of implicit heterogeneity and stochastic nature of the protein data has therefore become very significant. Though few in number, recent approaches in this direction for 3D-structure prediction have shown good promise (Xia et al., 2000; Jyothi and Joshi, 2001; Simons et al., 2001). We had developed a new algorithm called PROPAINOR2 (Jyothi and Joshi, 2001) for the ab-initio predic- 2 Protein structure prediction by AI and nonparametric regression. 242 tion of protein structure. Here, high probability Ca distance intervals were estimated for short and mediumrange inter-residue separation by nonparametric regression on statistically significant features of the primary chain. Heuristics based on the globular geometry of protein folds were used in-lieu of long-range constraints to impose compactness to the predicted structures. These constraints were used in the estimation of the 3D-structures by the distance geometry approach. Performance of PROPAINOR was found (Jyothi and Joshi, 2001) to be better than the extensively used programs like X-PLOR (Nilges et al., 1988), DRAGON (Aszodi et al., 1995) and the threading methods (Sippl, 1990). The sensitivity analysis also showed the possibility of a drastic improvement in the prediction accuracy with the incorporation of a few long-range restraints. The results of this analysis hence motivated us to look into the possibility of enhancing the accuracy of PROPAINOR by improving upon the distance estimation and structure calculation procedures. We have modified the distance estimators using the multivariate technique of nonparametric discriminant analysis (NDA) for the short and medium range distance intervals (Jyothi and Joshi, 2002) and NDA together with some knowledge-based heuristics for the long range distance intervals and contacts (outlined in Appendix A). These refinements are highlighted in Section 2. Under the new approach, we also obtain the probability of correctness of the distance intervals. We also assign weights (worth-coefficients) to each type of distance interval according to its relative accuracy in the training sample. These weights are used along with the probability assigned to the distance interval, in a two-stage probabilistic (distance geometry) programming problem for the calculation of 3D-structures. The new 3D-structure computation method is discussed in Section 3 followed by validation results on a large set of real data obtained from the PDB. We discuss the problem of assessment of the predicted protein structures in Section 4. Here, we present a novel technique for the estimation of a reliability index as a measure of the expected accuracy score of the predicted structures. We also present here a novel technique for the estimation of the expected RMSD, the most likely RMSD and the expected RMSD ranges for the predicted structure of a protein, based on its average posterior probability and a score function. Discussion of the results and scope of our method is presented in Section 5. 2. Estimation of distance intervals for short, medium and long inter-residue ranges We consider the following probabilistic minimization problem for the determination of the 3D-structure min f (x . . . x ) ¯1 ¯N x ...x ¯1 ¯N where XX f (½½x x ½½2 D2ij )2 ¯i ¯j i ji (1) s.t. distance contraints on Dij :/ Here, xi/ /(/xi1 ; xi2 ; xi3 ) is the vector representing the coordinates of the ith residue centroid or Ca atom in R3 and Dij/s are the random variables representing the 3Ddistance between the centroids or Ca atoms of residue i and residue j; i , j/1, 2,. . ., N . We use multivariate NDA for the estimation of the probabilities for inter-Ca 3D-distance intervals and contacts for short, medium and long range separation of residues on the primary chain. We estimate the range (lij , uij ) for Dij separately for each of the above categories and also compute the posterior probability pij of the constraint Dij (lij ; uij ):/ Under the new probabilistic approach, weights (worth-coefficients) are assigned to each type of distance interval according to its relative accuracy in the training sample. These weights are used along with the probability pij in a two-stage probabilistic (distance geometry) programming problem for the calculation of 3Dstructures. 2.1. Estimation of distances for short and medium-range residue separation The Dij values for 15 ½j i½ 52 are termed shortrange distances and those for 35 ½j i½ 54 are termed medium-range. Accurate estimation of the short and medium range distances (for ½j i½ ]2) is very important here as it corresponds to determining the secondary fold. We need them in refinement of the estimators of certain parameters that we found significant in the prediction of long-range distances. In PROPAINOR (Jyothi and Joshi, 2001) we had used a sliding window model of the protein chain (Fig. 1) to estimate the expected values of the short and medium range 3D-distances in proteins as functions of their primary distances, the correlation (rm ) between the primary and 3D-distances in the m th window (Wm ), and a few other important features of the primary sequence. These primary sequence features were found to be significant with more than 99% confidence. Here, the distance correlation, rm was found to be a dominant parameter in the regression model. We have refined the method for the estimation of rm using a 2stage NDA and heuristics based on the distinct primary and 3D-distance correlation patterns in different secondary structure states as observed in the training 243 Fig. 1. Sliding-window model. sample3 */TRAIN-93. This resulted in increasing the average accuracy of predictions of the short and medium-range distance estimates from 93 to 98% (Jyothi and Joshi, 2002). 2.2. Estimation of distance intervals and contacts for long-range residue separation Long-range distance estimates play a crucial role in the process of protein structure determination by distance geometry methods. We have extended our sliding window model (Fig. 2) for the estimation of selected long-range distance intervals also, using NDA. Here, we consider only residue pairs (i, j) that are separated by atleast 19 residues in the sequence, i.e. {/ Dij ½½j i½]20/}. Distance interval estimation is then carried out by separating the window pairs Wi2 and Wj2 into different categories based on the residue separation and sequence environments of residue i and residue j. Accordingly, we define window-pair categories for estimation of intervals for Dij as specified below: hmm category: Wi2 and Wj2 contain all hydrophilic residues; hpp category: Both Wi2 and Wj2 contain more than three hydrophobic residues; hpm category: Wi2 contains all hydrophilic residues and Wj2 contains more than three hydrophobic residues, or vice-versa; L-20 category: {jj/ij/20; i/4, 6,. . .}; L-30 category: {jj/i j/30; i/4, 6,. . .} and nc category: j/ N/5 and i /5. 3 To get this training sample, around 400 proteins of size between 70 and 150 residues were taken randomly from the PDB. These sequences were randomly numbered 1, 2, 3,. . . The sequence numbered 1 was selected first; all the sequences having more than 40% homology with it were removed. The next higher labelled sequence was then selected and the process was repeated successively until no removal was accounted for. The resulting sample contained 93 proteins. Interestingly, in this set, B/1% of the protein pairs had mutual sequence homology around 40%. All others had much less or no homology. We label this sample TRAIN-93. Prediction of likely classes (based on some estimated thresholds) for the distance Dij/s are separately carried out for each window-pair category using NDA on the feature vector P of the protein sequence, (cf. section A.1 ¯ This feature vector takes into account of Appendix A). the sequence environments in the neighbourhood of both residue i and residue j as well the properties of the segment inbetween residue i and residue j . The corresponding probabilistic constraints are given in section A.1 of Appendix A. The prediction accuracy for the long-range distance intervals estimated here varies between 70 and 90% and the Mathew’s correlation coefficient varies between 0.38 and 0.71 for the validation sample,4 VALID-40, which is much higher when compared to the recent method of Lund et al. (1997) who report a prediction accuracy of 57 /63% and a maximum Mathew’s correlation of 0.42 for the estimation of similar distance intervals. We have refined the globular heuristics giving rise to hydrophobic core constraints, introduced in (Jyothi and Joshi, 2001) taking into account turn predictions using b-turn propensity (Chou and Fasman, 1974) and maximum hydrophilicity in each window (estimated by moving average technique). These turn predictions are also used along with the estimates of rm for the formulation of b-strand packing heuristics (sections A.2 and A.3 of Appendix A). The average accuracy of the globular constraints was around 96%. Stability of the predicted structures by distance geometry methods is sensitive to the constraints on residues that are likely to be in close vicinity in 3D. But, prediction of long-range 3D-contacts remains a challenging problem in protein structure research. We have also formulated here a consensus approach based on NDA 4 The validation was also selected by a procedure similar to that applied to TRAIN-93 to ensure that no two proteins in this set would have more than 40% homology. The random sets thus found were further screened so that no protein in this set had more than 40% sequence homology with the training sample, TRAIN-93. This way, 40 proteins were obtained for the validation sample. Only three of the proteins in this set had around 40% homology with three proteins in the training sample. We call this set VALID-40. 244 Fig. 2. Sliding-window for long range distance estimation. and heuristics for the prediction of 3D-contacts (section A.4 of Appendix A). Validation on the set of proteins in VALID-40 showed that the precision for contact predictions by our method is 55%. Here, we note that the computational methods based on multiple sequence alignments and correlated mutation analysis for medium and longrange contact predictions (Fariselli and Casadio, 1999; Olmea and Valencia, 1997) report a precision of only 20 /25%. 3. Three-dimensional structure computations In the earlier formulation of the 3D-structure prediction problem by PROPAINOR we had constraints of the type dij (lij ; uij ) where the lower and upper bounds for the short and medium-range distances were estimated by nonparametric regression. These were supplemented by knowledge-based long-range constraints derived using heuristics on the globularity of the 3Dfold. These distance bounds are now refined and we now have distance intervals (lij , uij ) from different types, i.e. short and medium-range, long-range, globular, turn, contacts and b-strand packing heuristics. Attached to each distance interval, we now also have a specified probability pij where, pij P[lij 5 Dij 5uij ] These probabilities are computed using the posterior probabilities obtained from NDA. The formulae used for different categories of residue pairs are given in Appendix A. We apply a two-stage probabilistic programming technique to compute the 3D-coordinates by a distance geometry approach under these probabilistic constraints and the bump distance constraints. gether with (i) the hmm, hpp, hpm and nc distance constraints obtained from NDA and (ii) the hydrophobic core and b-turn constraints, are first used in the distance geometry program, dgsol (More and Wu, 1999) for the fast generation of a set of initial structures. No probabilities are attached to the distance constraints at this stage. 3.1.2. Stage 2 The set of distinct initial solutions obtained in stage 1 are then refined through probabilistic minimization of the following modified distance geometry objective function f given by: f X t X wt ½½xi xj ½½2 lij2 lij2 (i;j) Gt pij + max2 pij + min 2 N X ji4 min2 ½½xi xj ½½2 u2ij u2ij ;0 N4 X ; 0 wbump + 1 + i1 2 ½½xi xj ½½ 2 lbump 2 lbump ;0 (2) Here t indexes over the type of constraint i.e. short and medium-range, long-range (i.e. hmm, hpp, hpm, L-20, L-30 and nc), globular, turn, contact and b-strand packing. Gt denotes the set of constraints of type t . The probabilities pij s are multiplied by weights wt derived from heuristics on the relative importance and accuracy of each class of constraint. The assignment of weights to the various constraints are discussed below. Conjugate gradient algorithm (Gilbert and Nocedal, 1992) is used for minimization of the objective function given by Eq. (2). 3.1. Probabilistic distance geometry optimization 3.1.1. Stage 1 The short and medium-range distance interval constraints obtained from nonparametric regression to- 3.1.3. Assignment of weights Weights (wt ) for the constraints are assigned based on the relative accuracy of the distance intervals as obtained in the training sample. Accordingly, we chose, 245 wt /10000 for bump distance constraints. wt /1000 for constraints involving short-range distances dij with j/i/1 or j /i/2. wt /100 for constraints involving short-range distances dij with j/i/3 or j /i/4. wt /10 for globular, b, hmm, hpp, hpm, nc and longrange contact constraints. wt /1 for L-20 and L-30 distance constraints. 3.2. Optimal solution As in Jyothi and Joshi (2001), the selection of the best solutions is done based on the minimum objective function value and the theoretical radius of gyration and the best satisfaction of the hydrophobic residue distribution. For this purpose, the solution having the minimum objective function value is selected. All the solutions (models) having objective function value within 0.25 difference from the minimum are taken into consideration. From these selected solutions, only those, which give the predicted structures with radius of gyration within 9/0.5 Å of the theoretical value (Andrzej and Skolnick 1997) are retained. From these solutions, all those which do not satisfy the hydrophobic core residue distribution are also eliminated. Using this procedure, we got around 5/15 solutions each for the proteins in the validation sample. These solutions are used as the set of predicted structures. 3.2.1. Validation Using the refined PROPAINOR, we have predicted the 3D-structures for all the 52 proteins in VALID-52.5 The average RMSD, Ravg of the set of predicted structures for each protein this validation sample is shown in Table 1. For globular proteins it was found to vary between 4.40 and 7.60 Å with the modal value around 6 Å. 4. Reliability assessment Suitable estimates of the reliability of predictions is highly desired in any protein structure prediction algorithm. However, no ab-initio prediction method has yet reported on such a measure. We have developed a method based only on the primary sequence of the protein, to assess the reliability of the predicted 3Dstructure. Apart from a reliability index, we also compute estimates for its expected and most likely RMSD. These computations are based on an analogy 5 This set includes all the proteins in VALID-40 plus 12 new proteins (selected by similar random procedure) on which no NDA validation was run earlier. with the calculation of the most likely job completion times in PERT (Wiest and Levy, 1977). 4.1. Overall reliability index Here, we describe the development of a regression model for the computation of an overall reliability index for the predicted structure. We compute this index based on the precision score (/Ŝc ) and the average probability (/ p̄) of the predicted structure(s). Ŝc and p̄ are calculated from the weights wt ’s and the probabilities pij attached to the distance intervals as shown below: X X wt pij (3) Ŝc t and p̄ X t (i;j) Gt X wt pij (i;j) Gt X wt (4) (i;j) Gt Here the summation runs over the set of all the imposed probabilistic distance constraints. To arrive at the regression equation, we first chose a random subset of 15 proteins from VALID-52. We refer this sample as VALID-52.15. Since we wanted the reliability index for a protein (of N residues) to reflect the deviation (say Rdiff) of its RMSD from the RMSD of a random structure of N residues, Rran N (Cohen, 1980), we computed avg R̂diff Rran N R for these 15 proteins, where Ravg is the average RMSD for the protein based on the set of best solutions In order to estimate E [Rdiff] for a new protein, we regress Rdiff on Ŝc and p̄: Computational experiments showed that the model regressing Rdiff on hŜc i and hp̄i as the best, where, 0 if Ŝc 51000 hŜc i (5) 1 if Ŝc 1000 0 if p̄5 0:8 hp̄i (6) 1 if p̄ 0:8 and hRdiff i 0 1 if Rdiff 51 if Rdiff 1 (7) The equation of the best fit was then obtained (using SAS software, SAS, 1995) as: E[hRdiff ijhŜc i; hp̄i]0:190:23hŜc i0:72hp̄i (8) This model was found significant with a p -value of 0.0002. The discretized score hŜc i was significant with a p -value of 0.0845 and hp̄i was significant with a p-value of 0.0001. The adjusted R2 value for the model was 0.7184. 246 Table 1 Validation results: accuracy and reliability indices for the predicted protein structures PDB code No. of residues lalw lalx lag4 laj4 laok laqe lazf lbhu 1b14 lboO 1c15 lccd ldun 1e68 leoO lfOm lg7d lgdc lghj lgsp lhf,c lhnil limy ljug lmbl lmho lnkl lplm lrhl ltrf lupj lutg lwho lxer lytj lyvs lzib 2af8 2fhi 2psr 2pyp 2sxl 2tgi 2tmy 2tn4 2trx 3ait 3crd 3icb 3mba 3vub 5hpg 83 106 103 90 130 110 129 102 107 76 97 75 110 70 77 71 105 72 79 104 123 123 85 125 98 88 78 130 104 76 99 70 94 103 99 108 124 86 105 96 125 88 112 118 85 108 74 100 75 146 87 83 / Ravg (Å) Ŝc p̄ 636.31 991.86 935.28 6017.19 2131.40 234.78 2745.11 539.89 1419.74 625.79 1197.39 386.88 2428.69 778.00 470.61 291.69 866.04 317.41 587.34 1231.28 1375.54 1549.06 1099.04 1697.24 969.21 1821.48 553.20 1952.87 1350.87 1222.94 831.49 484.02 1378.07 3128.86 904.91 1059.81 1879.35 851.02 791.18 1586.88 606.38 908.84 1542.15 2366.38 3829.47 818.87 714.86 656.50 375.70 1856.17 931.66 1076.78 0.8406 5.49 0.7810 7.37 0.8358 5.83 0.8889 4.90 0.8362 6.32 0.7158 7.62 0.7932 6.46 0.6281 7.92 0.8421 5.56 0.7972 7.73 0.8474 6.48 0.8197 5.66 0.8995 6.82 0.8801 5.30 0.8065 4.61 0.8217 5.48 0.8047 6.61 0.6077 7.08 0.7937 7.22 0.7763 5.73 0.8689 7.02 0.9006 6.36 0.8208 5.32 0.7736 6.42 0.8465 6.77 0.9355 5.33 0.8041 5.79 0.8278 6.38 0.7872 6.04 0.8704 5.07 0.8808 6.21 0.8148 5.32 0.8362 6.39 0.8186 6.83 0.7959 7.61 0.7166 18.60 0.8454 6.84 0.8485 5.86 0.8940 6.26 0.6987 6.13 0.8194 6.56 0.8136 6.11 0.9229 10.79 0.8694 6.76 0.9388 5.66 0.8953 6.51 0.8014 5.60 0.8537 6.43 0.8903 4.44 0.8503 7.50 0.7245 7.01 0.7820 6.79 Rran N (Å) Actual RMSD range (Å) 7.81 8.43 8.35 8.00 9.08 8.54 9.06 8.32 8.46 7.62 8.19 7.59 8.54 7.46 7.65 7.48 8.41 7.51 7.70 8.38 8.89 8.89 7.86 8.95 8.22 7.94 7.67 9.08 8.38 7.62 8.24 7.46 8.11 8.35 8.24 8.49 8.92 7.89 8.41 8.16 8.95 7.94 8.60 8.76 7.86 8.49 7.57 8.27 7.59 9.52 7.92 7.81 (5.17, (7.24, (5.63, (4.84, (6.18, (7.32, (6.37, (7.66, (5.52, (7.58, (6.20, (5.59, (6.62, (5.17, (4.13, (5.41, (6.55, (6.93, (6.97, (5.60, (7.02, (5.82, (5.18, (6.11, (6.67, (5.19, (5.66, (6.19, (5.92, (5.03, (5.79, (5.22, (5.88, (6.50, (7.22, (18.2, (6.84, (5.56, (5.82, (5.78, (6.23, (5.51, (10.7, (6.43, (5.40, (6.26, (5.36, (6.26, (4.25, (7.41, (6.80, (6.69, 5.75) 7.51) 6.23) 5.44) 6.48) 7.92) 6.61) 8.61) 5.62) 7.90) 6.90) 5.73) 7.06) 5.42) 5.17) 5.60) 6.69) 7.16) 7.99) 5.97) 7.02) 6.62) 5.58) 6.80) 6.98) 5.58) 5.86) 6.69) 6.22) 5.12) 6.96) 5.46) 6.59) 7.22) 7.89) 18.7) 6.84) 6.02) 6.55) 6.34) 6.87) 6.35) 10.9) 6.89) 5.84) 6.78) 5.88) 6.55) 4.64) 7.58) 7.51) 6.89) / R̂f 0.92 0.19 0.92 1.14 1.14 0.19 0.42 0.19 1.14 0.19 1.14 0.92 1.14 0.92 0.92 0.92 0.92 0.19 0.19 0.42 1.14 1.14 1.14 0.42 0.92 1.14 0.92 1.14 0.42 1.14 0.92 0.92 1.14 1.14 0.92 0.42 1.14 0.92 0.92 0.42 0.92 0.92 1.14 1.14 1.14 0.92 0.92 0.92 0.92 1.14 0.19 0.42 / Ê (Å) Pred RMSD range (Å) 5.41 7.72 6.31 5.77 6.58 7.77 6.63 7.66 6.34 7.12 6.21 5.15 6.29 4.82 5.27 4.95 6.37 7.01 7.20 6.40 6.46 6.42 5.65 6.62 6.30 5.64 5.32 6.59 6.39 5.13 6.32 4.91 6.08 6.33 7.62 6.53 6.51 5.58 6.25 6.39 6.55 5.75 6.28 6.41 5.40 6.28 4.93 6.35 4.86 6.72 7.42 5.50 (4.71, (7.43, (5.61, (5.07, (5.88, (7.54, (5.93, (7.32, (5.64, (6.60, (5.51, (4.45, (5.59, (4.13, (4.57, (4.25, (5.67, (6.51, (6.70, (5.70, (5.76, (5.72, (4.95, (5.92, (5.60, (4.94, (4.62, (5.89, (5.69, (4.43, (5.62, (4.21, (5.38, (5.63, (7.24, (5.80, (5.81, (4.88, (5.55, (5.69, (5.85, (5.05, (5.50, (5.71, (4.70, (5.58, (4.23, (5.65, (4.16, (6.02, (6.92, (4.80, 6.11) 8.00) 7.01) 6.47) 7.28) 8.00) 7.33) 8.00) 7.04) 7.62) 6.91) 5.85) 6.99) 5.52) 5.97) 5.65) 7.07) 7.51) 7.70) 7.10) 7.16) 7.12) 6.35) 7.32) 7.00) 6.34) 6.02) 7.29) 7.09) 5.83) 7.02) 5.61) 6.78) 7.03) 8.00) 7.23) 7.21) 6.28) 6.95) 7.09) 7.25) 6.45) 6.98) 7.11) 6.10) 6.98) 5.63) 7.05) 5.56) 7.42) 7.92) 6.20) M̂ (Å) / 5.17 7.72 6.31 5.57 6.60 7.77 6.72 7.66 6.36 7.12 6.05 4.92 6.27 4.55 5.05 4.72 6.40 7.01 7.20 6.51 6.46 6.39 5.32 6.72 6.06 5.41 5.11 6.62 6.48 4.95 6.05 4.61 5.92 6.37 7.62 6.67 6.52 5.33 6.21 6.06 6.56 5.53 6.24 6.41 5.17 6.23 4.71 6.10 4.59 6.73 7.42 5.11 Here, Ravg represents the average RMSD of the predicted models of the protein with respect to its native structure. Rran N is the RMSD for a random structure calculated as in Cohen (1980). Actual RMSD gives the range of RMSD observed in the distinct solutions, R̂f is the reliability factor for the protein, Ê is the calculated expected RMSD, Predicted RMSD Range gives the upper and lower bounds within which the observed RMSD is expected to lie. M̂ is the most likely RMSD. 247 For any protein, the conditional expectation of Rdiff computed from the model (8) serves as a measure of the overall reliability of the predicted structure. We denote this measure by R̂f : The reliability indices for all the proteins in VALID-52 were computed using Eq. (8). These are listed in Table 1 along with the corresponding score Ŝc ; and probability p̄:/ 4.2. Estimation of the expected and the most likely RMSD Here we estimate the expected RMSD (/Ê); the most likely RMSD (/M̂) and the interval (R̂min ; R̂max ) in which the RMSD of a predicted structure would lie. The sample VALID-52.15 is used here for building the model. It was observed in this sample that the RMSD for proteins with larger p̄ was lesser than the sample average as compared to proteins with lesser p̄: We model this scaling with respect to p̄ in the computation of mnew below. This model for the computation of mnew also takes into account the observation that the RMSD for proteins with larger N (N /100) was higher than the sample average and that the RMSD for proteins with smaller N (N B/100) was lesser than the sample average. However, since this trend was increasing rapidly for proteins up to N /100 and was almost stationary with respect to larger N , we also incorporate a correction factor c as shown below. From model (8), we note that R̂f 0:20 iff hŜc i 1 or hp̄i1: In other words, R̂f 0:20 indicates a good prediction. Hence, Ê; M̂; R̂min and R̂max are estimated separately for proteins with R̂f 0:20 and for proteins with lower R̂f values. 4.2.1. Estimation for proteins with R̂f 0:20/ Let mR and sR denote respectively the average and the standard deviation of the RMSD in the model sample (For VALID-52.15 we obtained mR/ /6.05 Å and sR/ / 0.70 Å). The following model was found to be the best for the computation of mnew and c to incorporate the aforementioned trends. mnew is estimated as: if N 5100 m p̄ + 2sR mnew R (9) mR (1 p̄) + 2sR if N 100 The correction factor, c, for the size of the protein is given by, 0:06 + (N Nmin ) if N 5 100 c (10) 0:01 + (N 100) if N 100 Here Nmin is the size of the smallest sequence in the model sample (for VALID-52.15 Nmin was 74). The Expected value of the RMSD, Ê; is calculated as Ê mnew c (11) From Ê; the lower and upper bounds (/R̂min and R̂max ) for the RMSD are calculated as: R̂min Ê sR R̂max Ê sR (12) (13) The most likely RMSD, M̂ is calculated in analogy with the calculation of the most likely estimates of job completion times in PERT. For this, we need an estimate of the most optimistic estimate (moe) of the RMSD, the most pessimistic estimate (mpe) of the RMSD. These are computed as follows: m 1:75 + R̂f if N 5100 moe R (14) m 0:5 + R̂f if N 100 R if N 5100 m 0:25 + R̂f mpe R (15) minfmR R̂f ; 7:00g if N 100 Then 1 M̂ (6 + mnew moempe)c 4 (16) where mnew and c are computed as in Eqs. (9) and (10), respectively. 4.2.2. Estimation for proteins with R̂f B 0:20/ The proteins in this category had RMSD /7 Å. So instead of using mR and sR ; the following simple model was found to be sufficient here. R̂min Rran N 1 R̂max Rran N 1 (17) (18) and Ê M̂ R̂min R̂max 2 (19) The computed values of Ê; M̂; R̂min and R̂max for all proteins in VALID-52 are given in Table 1. We note that there is a close correspondence between Ravg, Ê/and M̂ and that the computed RMSD ranges includes Ravg in almost all the cases in Table 1. We obtain a correlation coefficient of 0.42 between Ravg and M̂: But when the two non-globular proteins 1yvs and 2tgi are removed from the validation sample, this correlation increases to 0.84; with confidence level above 95%. 5. Results and discussion We had developed a new algorithm, PROPAINOR, for ab-initio protein structure prediction from a set of distance restraints that are estimated by nonparametric regression and knowledge based heuristics (Jyothi and Joshi, 2001). In this paper, we have reported the refinements made to PROPAINOR in the structure computation methodology. In contrast to our earlier 248 deterministic distance geometry optimization, we now adopt a two-stage probabilistic programming approach to compute the 3D-structure of proteins from a set probabilistic distance constraints. Validation of our new method on a sample of 52 proteins (VALID-52) show that the RMSD of the globular protein structures predicted by the new approach as compared to the native structure varies between 4.40 and 7.60 Å. Moreover, for globular proteins of sizes between 70 and 130 residues the average value of the RMSD in the validation sample, VALID-52 is found to be 6.05 Å (Table 1). This RMSD level is the lowest possible for any ab-initio method reported so far. It is also clear from Table 1 that the RMSD for any protein, Ravg, is much less than the RMSD for a random structure with the same number of residues, Rran N . Further, continuous fragments of sizes varying between 50 and 75 residues are now predicted with an RMSD B/ 4 Å (Fig. 3(d /f) and Table 2). A comparison of our method with some of the recent well-received computational methods of protein structure prediction (for example, Simons et al., 1998; Ortiz et al., 1998; Huang et al., 1999; Xia et al., 2000; Osguthorpe, 1999; Simons et al., 2001) shows it to be better or comparable (Table 2). Most of these methods use scoring or fitness functions to select the optimal solutions from a large number of conformations (decoys6). In contrast, in our method, we do not require such a set of decoys as we estimate the constraints exhaustively based on nonparametric statistical estimates (with high confidence) that take care of the random variations. We then choose the optimal solution based on the heuristics as described in Section 3.2. Further, it is important to note that most of the other computational methods use multiple sequence alignments during some stage of structure prediction and hence are not efficient for proteins that do not have many homologous sequences in the sequence databanks. We also note that there is a good overall topological fit between the predicted and the native Ca traces. This is evident from the plots depicting the superposition of the predicted Ca trace onto the native Ca trace (Fig. 3). Because of the small standard deviation of the RMSD, all the optimal structures predicted by our method have qualitatively similar topologies too. Therefore a pre6 E.g. URL: dd.compbio.washington.edu Fig. 3. (a /c) Superpositions of the Ca traces of the actual and predicted structures. (a) a-Helical protein 3icb, (b) b-strand protein 1bl4 and (c) a/b protein 2trx. These (a /c) are also, respectively, the illustrations of the best, average and the worst results of topological matching in the validation sample. The bad results for 2trx are mainly due to the chirality problem. (d /f) Superpositions of the Ca traces of the actual and predicted structures for maximum length fragments having RMSD just below 4 Å. (d) 3icb (residues 1 /73: 73 residues). (e) 1bl4 (residues 16 /76: 61 residues). (f) 2trx (residues 32 /56: 55 residues). In all these figures, the native structure is depicted in grey and the predicted structure in black. The residue numbers are labelled only at some positions to make the comparison clear. 249 Table 2 A comparison of ab-initio structure prediction methods Group/algorithm Computing time RMSD for test sets of 15 /20 proteinsa (entire protein) Other comments Simons et al. (1999) (ROSETTA) Ortiz et al. (1998) (MONSTERR) Huang et al. (1999) 2 days on 5 a-533 MHz processors for proteins around 120 residues 4 days on a SGI MIPS 180 MHz processor Not mentioned 8.32 /11.87 Å / / 3.5 /6.7 Å for 70 /100 residue fragments 5.37 /9.99 Å Xia et al. (2000) 3CPU days on a 533 MHz a /processor for 40 /80 residue proteins Not mentioned 100 h of wall-clock time using 64 processors of IBM2 supercomputer Around 8 /10 h on a SUN ULTRA sparc 128 MB RAM 167 MHz CPU 4.95 /13.90 Å Applicable only to helical proteins up to 90 residues 6 /7 Å for fragments up to 80 residues 9.95 /19.06 Å Best: 4.3 Å for 70 residue 1e68. Worst: not mentioned 4.40 /7.60 Å / 6.0 Å for 52 /68 residuefragments of a-helical proteins. Worse for b and a/b proteins B/4 Å for 50 /75 residue fragments Osguthorpe (1999) Pillardy et al. (2001) Joshi and Jyothi (2002) PROPAINOR It may be noted here that the methods of Pillardy et al. (2001) and Osguthorpe (1999) use potentials which mimic the physical forces instead of knowledge-based potentials. Also the computing times mentioned in column 2 for all the methods represent the time taken to get an ensemble of structures. The best structures are selected from this ensemble through various heuristics. a Our algorithm, PROPAINOR has been validated on a test set of 52 proteins. No NMR or other experimental data were used in our method. dicted structure, is selected at random from the set of optimal solutions for the purpose of superposition. Fig. 3 also shows the topological fit for 3 proteins for maximum length fragments with RMSD just below 4 Å. It is seen that continuous fragments containing about 50 /99% of the residues in the protein can be superimposed with a very good fit. Our method also gives a confidence interval, score and a most likely estimate of the RMSD for the predicted structures. This is always desired, and to the best of our knowledge, no other ab-initio method provides this. It may be noted that whenever a structure is predicted by computational/statistical method, an obvious and natural doubt of the user is how reliable would be the prediction. The aforesaid reliability parameters would satisfy his query; these are as important for a predicted structure of a given sequence as the p value or confidence intervals in any statistical data driven decision making. Ab-initio structures play a key role in exploring the structure/function of a new protein as they at least provide a preliminary solution to optimally proceed with more sophisticated and expensive methods of refinement. Our estimates like the most likely RMSD would give a quantitative measure of the resolution of the preliminary structure. PROPAINOR is found to be highly computationally efficient too. The entire two-stage process for getting an ensemble of distinct Ca structures takes less than 8 h on a Sun ULTRA Sparc station (128 MB RAM, 167 MHz CPU). In contrast, other recent methods take more than 3 days on multiple workstations of very high speed for similar sized proteins. A comparison of the accuracy and computation times for a few other ab-initio structure prediction methods is given in Table 2. Most importantly, since our method does not rely on homology, it is well-suited for the structure prediction of proteins like human seminal plasma prostatic inhibin (HSPI), which lack sequence homology with known proteins and which is difficult to be studied by NMR and crystallography. We had predicted the structure for this protein and had validated it with respect to the available experimental observations with good results. We had also undertaken certain docking and other computational studies on the predicted structure. These were found to give significant insight into its important biological functions (Joshi and Jyothi, 2002). Further, we have also predicted the structure of a calcium binding protein, EhCaBP. The predicted structure of EhCaBP (with and without NMR restraints) was also found to agree well with the recently obtained experimental structure (Atreya et al., 2000). Docking studies give further directions for its possible immunosuppression. In view of the above applications and because of its computational efficiency, high reliability and applicability to any random sequence, our method offers promising applications in structural and functional genomics and proteomics research. At present it is successfully implemented on proteins of sizes 70 /150 residues. We are currently working on its extensions to shorter and longer sequences. 250 Acknowledgements Part of this research has been funded by the Dept. of Biotechnology (DBT), Govt. of India through a project undertaken by Prof. Rajani R. Joshi. The author thanks the DBT for the same. Jyothi S. thanks the Council of Scientific and Industrial Research, Govt. of India for the financial support in undertaking this research work. Appendix A A.1. Estimation of long-range distance intervals Different subsets of the following sequence feature vector (P) were found to be significant (confidence level /99%) ¯for the window-pair categories when NDA was applied for the classification of the window-pairs in either of the two distance-classes (say, long/0 and long/1) that were formed using the mean 3D-distance for the tth category, mt , as threshold. P[lbump 5Dij Bmt ] p0ij and P[mt 5Dij 5utrad ]p1ij Here lbump is the bump distance threshold set to 4.5 Å and using globular heuristics (Jyothi and Joshi, 2001), 8 if thpp 2Rg > > > > > <10Rg if thmm or nc 3 utrad > > > 8R > g > if thpm; L20 or L30 : 3 Now, if p0ij /p1ij , we set, lij /lbump, uij /mt and pij / pij 0. Else, if p1ij /p0ij , we assign, lij /mt , uij /utrad and pij /p1ij . A.2. Refinement of globular heuristics P (H; C1 ; C2 ; C3 ; na ; nb ; L; H ij ; C1ij ; C2ij ; m m m C3ij ; nija ; nijb ; Lij ; hm ; cm 1 ; c2 ; c3 ; f ) A.1.1. Probability assignment for long-range distance intervals NDA gives us the posterior probabilities, p0ij and p1ij , of membership of window-pair (Wi2, Wj2) in distance-class long/0 and class long/1, respectively. In view of the geometrical constraints on globular folding (Jyothi and Joshi, 2001), we set (A1) Here H is the percentage concentration of hydrophobic residues in the primary chain; Ck is the percentage concentration of residues in the chain belonging to the k th amino-acid cluster, k /1, 2, 3; na is the number of windows in the chain with average propensity greater than 0.06; nb is the number of windows in the chain with average propensity less than /0.06; L is the length of the chain; Hij is the percentage concentration of hydrophobic residues in the portion from residue i/5 to j/5 of the primary chain; Cijk is the percentage concentration of residues in the chain belonging to the k th amino acid cluster and lying in the portion i/5 to j/5 of the primary chain, k /1, 2, 3; nija is the number of windows in the portion from residue i/5 to j/5 of the primary chain with average propensity greater than 0.06; nijb is the number of windows in the portion from residue i/5 to j/5 of the primary chain with average propensity less than /0.06; Lij is the length of the portion of the chain from residue i/5 to j/5; hm is the percentage concentration of hydrophobic residues in window Wm , m /{i/4, i/2, i, j/4, j/2, j}; cm k is the percentage concentration of residues belonging to the k th amino acid cluster in window Wm , m /{i/4, i/ 2, i, j/4, j/2, j } and k /{1 ,2, 3}; fm is the average propensity of residues in Wm multiplied by 100 (to make this scale on par with the other features), m /{i/4, i/ 2, i, j/4, j/2, j }. Here the average propensity fm am4 im (ai bi )=10; where ai and bi are, respectively, the propensity of residue i for a-helix and b-sheet (Chou and Fasman, 1974). We have modified of our globular constraints (derived previously in Jyothi and Joshi, 2001) by taking into account the positions of the possible turns in the protein. A.2.1. Turn heuristics Residues ti , that have a tendency to lie on the surface of the protein are determined by a consensus of Chou’s b-turn propensity (Chou and Fasman, 1974) and maximum hydrophilicity calculated for each window using moving average technique. Let ft1 ; t2 ; . . . ; tn g be the residue numbers of the turns thus predicted. We have formulated the new turn heuristics based on the consideration that adjacent turn residues, ti and ti1 would be separated by a regular secondary structure element and hence would lie diametrically opposite to each other in 3D. Also neither ti nor ti1 would come inside the protein core. Therefore, using the protein data in TRAIN-93, we estimated that, P[Dti ;ti1 ]minf2Rg d; 20 Åg]0:75 (A2) Here Rg (in Å) is the radius of gyration for the protein (Andrzej and Skolnick, 1997) and d is determined using heuristics on known protein structures. A.2.2. Modified hydrophobic core constraints These turn predictions are then used to modify the hydrophobic core constraints derived previously in Jyothi and Joshi (2001). Since predicted turns are constrained to lie on the surface of the protein, any hydrophobic residue adjacent to a predicted turn would also essentially lie on the surface. Hence, in our reformulated core heuristics, 251 residue i is constrained to be in the core of the protein only if it is hydrophobic and, (A3) ½tj i½ ]3 Here tj represents the primary position number of the jth (predicted) turn residue. Using data from TRAIN93, it was further found that P[Dij B d]]0:96; where, 8 3Rg > > if ½j i½B 20 > > > < 2 (A4) d B 5Rg > if 205 ½j i½ B25 > > > 3 > : 2Rg otherwise Here i and j are hydrophobic residues satisfying the condition (A3) for the new core definition. A.2.3. Probability assignment for turn and hydrophobic core constraints The lower bounds lti ,ti1 for the distances between adjacent predicted turns are set to minf2Rg d; 20 Åg: The upper bounds, uti ,ti1 are set to 8Rg/3, the estimated diameter of the molecule. Hence, from Eq. (A2), we get, pti ;ti1 0:75 where ti is the primary position number of the i th predicted turn residue. For the hydrophobic core constraints, the lower bounds, lij were set to the bump distance threshold of 4.5 Å and the upper bounds, uij are set as ‘d’ in Eq. (A4) depending on the residue separation between residues i and j . Also, from Eq. (A4), pij /0.96 for i and j being core residues (satisfying Eq. (A3)). A.3. b-Strand packing heuristics From the training sample, the upper bounds for the distances between the Ca residues in two adjacent bstrands separated by a b-turn was estimated to be 12 Å. Similarly, the upper bounds for the inter-Ca distances between residues in b-strands separated by two b-turns was estimated at 16 Å. Hence we have the following heuristic. Fig. 4. Packing of b-strands into b-sheets. lij 4:5 Å; uij 12 Å (A5) if residue i and residue j are corresponding residues in adjacent b-strands separated by a b-turn (Fig. 4(a)). Also, lij 4:5 Å; uij 16 Å (A6) if residue i and residue j are corresponding residues in adjacent b-strands separated by two b-turns (Fig. 4(b)). In the training sample, TRAIN-93, it was found that 73% of the residues in b-strands separated by b-turn satisfied Eqs. (A5) and (A6). We therefore set, pij /0.73 for residues i and j predicted to be in the b-strand class (by NDA) and separated by a (predicted) b-turn. A.4. Estimation of medium and long-range contacts We define two residues to be in the contact class if their Ca positions are within 10 Å of each other and in the non-contact class otherwise. To overcome the vast difference in the contact and non-contact class sizes (Fariselli and Casadio, 1999), we have formulated a balanced consensus approach for the prediction of residue contacts in proteins using different training data selected at random from TRAIN-93. In doing so, we consider only residue pairs (i, j) such that f½j i½ ]10 and ½j i½530; i 5; 8; . . . ; j 5; 8; . . .g because it is seen that contacts cluster predominantly in the short and medium-range residue separation (Fariselli and Casadio, 1999). Contacts for the validation data are predicted using a sixfold cross-validation approach using six different sets of training data. Each training data set contains all the contact data from TRAIN-93 and different set of noncontact data selected at random from the set of noncontact residue pairs in TRAIN-93. The ratio of the number of contacts to the number of non-contacts is kept approximately the same for all the six training datasets. The classification is done using all the sequence parameters in P and the Normal kernel function. ¯ For each protein, the initial set of predicted contacts is further filtered of false-positives to get a new set S2 of predicted contacts using majority rule with respect to the ordered average posterior probabilities of a contact. The following set of rules are then applied to further prune S2 of false-positives. A.4.1. Prune heuristics No contacts are allowed between residues which are 9/3 i) residues of adjacent predicted turn residues. For example, if residue 10 and residue 25 are predicted as turns, and no other residue in-between residues 10 and 25 is predicted to be a turn residue (Fig. 5(a)), then any 252 Fig. 5. Illustration of the three cases of Prune heuristics. The residues marked by ‘*’ are the predicted turn residues. predicted contact between residues (9, 24), (9, 27), (10, 25) and so on are removed from S2. No contacts are allowed between residues if they are 9/1 ii) residue of any predicted turn residue. For example, if residue 10 and residue 42 are predicted as turns, then any predicted contact between (9, 43), (10, 41) and so on are removed from S2 (Fig. 5(b)). iii) No contacts are allowed between residues if they are not separated by at least one turn. For example, if residue 11 and residue 21 are predicted to be in contact and there is no residue is predicted as a turn in between residues 11 and 21, then the predicted contact between residues 11 and 21 is deleted (Fig. 5(c)). Only the subset S3 of contacts obtained from S2 after applying the above rules are considered in 3D-structure computations. A.4.2. Probability assignment for long-range contacts For (i , j) /S3 we set, lij /4.5 Å, the bump distance threshold. Though the upper bounds in the contact definition was set to 10 Å, to allow for the larger randomness in contact predictions, we set uij /12 Å. ) (k ) The probability pij is calculated as a6k1 p(k ij /6. Here pij represents the posterier probability of contact prediction obtained using NDA on the k th training sample, k / 1, 2,. . ., 6 under the best experimental design. References Andrzej, K., Skolnick, J., 1997. Secondary structure of polypeptide chains */interplay between short range and burial interactions. J. Chem. Phys. 107, 953 /964. Aszodi, A., Gradwell, M.J., Taylor, W.R., 1995. Global fold determination from a small number of distance restraints. J. Mol. Biol. 251, 308 /326. Atreya, S., Sahu, C., Chary, K.V.R., Govil, G., 2000. A tracked approach for automated NMR assignments in proteins (TATAPRO). J. Biomol. NMR 17, 125 /136. Chou, P.Y., Fasman, G.D., 1974. Prediction of protein conformation. Biochemistry 13, 222 /245. Cohen, F.E., Sternberg, M.J.E., 1980. On the Predication of Protein Structure: The Significance of the Root Mean Square Devviation. J. Mol. Biol. 138, 321 /333. Fariselli, P., Casadio, R., 1999. A neural network based predictor of residue contacts in proteins. Protein Eng. 12, 15 /21. Gilbert, J.C., Nocedal, J., 1992. Global convergence properties of conjugate gradient methods. SIAM J. Optim. 2, 21 /42. Huang, E.S., Samudrala, R., Ponder, J.W., 1999. Ab-initio fold prediction of small helical proteins using distance geometry and knowledge-based scoring functions. J. Mol. Biol. 290, 267 /281. Joshi, R.R., Jyothi, S., 2002. Ab-initio structure of human seminal plasma prostatic inhibin gives significant insight into its biological functions. J. Mol. Model. 8, 50 /57. Jyothi, S., Joshi, R.R., 2001. Protein structure determination by nonparametric regression and knowledge-based constraints. Comput. Chem. 25, 283 /299. Jyothi, S., Joshi, R.R., 2002. Ab-initio computation of the 3Dstructures of proteins by nonparametric statistical methods */ medium and short range distance estimates (communicated). Lund,O.,Frimand,K.,Gorodkin,J.,Bohr,H.,Bohr,J.,Hansen,J.,Brunak, S., 1997. Protein distanceconstraints predictedbyneural networks and probability density functions. Protein Eng. 10, 1241 /1248. More, J.J., Zhijun, Wu., 1999. Distance Geometry Optimization for Protein Structures. J. Global Optim., 15, 219 /234. Nilges, M., Clore, G.M., Gronenborn, A.M., 1988. X-PLOR */a hybrid distance geometry dynamical simulated annealing calculation strategy. FEBS Lett. 289, 317 /324. Olmea, O., Valencia, A., 1997. Improving contact predictions by the combination of correlated mutations and other sources of sequence information. Fold. Des. 2, 25 /32. Ortiz, A.R., Kolinski, A., Skolnick, J., 1998. Fold assembly of small proteins using monte-carlo simulations driven by restraints derived from multiple sequence alignments. J. Mol. Biol. 277, 419 /448. Osguthorpe, D.J., 1999. Improved ab-initio predictions with a simplified flexible geometry model. Proteins Struct. Funct. Genet. S3, 186 /193. Pillardy, J., Czaplewski, C., Limo, A. et al., 2001. Recent improvements in the prediction of protein structure by global optimization of a potential energy function, Proc. Natl. Acad. Sci. USA., 98, 2329-2333. SAS/STAT, 1995. Software usage and reference, version 6, vols. 1 and 2, 1st ed., SAS Institute Inc, NC. Simons, K.T., Ruczinski, I., Kooperberg, C., Fox, B.A., Bystroff, C., Baker, D., 1998. Improved Recognition of Native-Like Protein Structures Using a Combination of Sequence-Dependent and Sequence-Independent Features of Proteins. PROTEINS: Struc. Func. Genitics, 34, 82 /95. Simons, K.T., Strauss, C., Baker, D., 2001. Prospects for ab-initio protein structural genomics. J. Mol. Biol. 306, 1191 /1199. Sippl, M., 1990. Calculation of conformational ensembles from potentials of mean force. J. Mol. Biol. 213, 859 /883. Wiest, J.D., Levy, F.K., 1977. A Management Guide to PERT/CPM, 2nd ed.. Prentice Hall, Englewood Cliffs, NJ. Xia, Y., Huang, E.S., Levitt, M., Samudrala, R., 2000. Ab-initio construction of protein tertiary structures using a hierarchical approach. J. Mol. Biol. 300, 171 /185.
© Copyright 2025 Paperzz