11514.pdf

Ab-initio prediction and reliability of protein structural genomics by
PROPAINOR algorithm
Rajani R. Joshi a,b,*, S. Jyothi b,1
a
BJM School of Bioscience and Bioengineering, Indian Institute of Technology, Powai, Mumbai 400076, India
b
Department of Mathematics, Indian Institute of Technology, Powai, Mumbai 400076, India
Abstract
We have formulated the ab-initio prediction of the 3D-structure of proteins as a probabilistic programming problem where the
inter-residue 3D-distances are treated as random variables. Lower and upper bounds for these random variables and the
corresponding probabilities are estimated by nonparametric statistical methods and knowledge-based heuristics. In this paper we
focus on the probabilistic computation of the 3D-structure using these distance interval estimates. Validation of the predicted
structures shows our method to be more accurate than other computational methods reported so far. Our method is also found to be
computationally more efficient than other existing ab-initio structure prediction methods. Moreover, we provide a reliability index
for the predicted structures too. Because of its computational simplicity and its applicability to any random sequence, our algorithm
called PROPAINOR (PROtein structure Prediction by AI an Nonparametric Regression) has significant scope in computational
protein structural genomics.
Keywords: Ab-initio protein structure prediction; Distance geometry; Nonparametric discriminant analysis; Nonparametric regression; Probabilistic
programming; Reliability index
1. Introduction
Knowledge of the three-dimensional structure of a
protein often greatly enhances our understanding of its
function and its interaction with other proteins. With
the development of sequencing techniques, protein
sequencing has now become a routine task. But the
determination of the 3D-structures still remains
difficult*/experimental methods of structure determination like NMR spectroscopy and X-ray crystallography
continue to be expensive, time consuming techniques
constrained by the availability of pure extracts and
crystals. Computational methods of structure prediction
have become extremely important in the post-genomic
period today as they could be applied on the primary
sequence (or synthetic peptides) directly. Whence their
importance in protein structural genomics.
Modelling by homology is at present the most reliable
computational technique for 3D-structure prediction.
But these alignment methods prove to be of limited use
for many protein sequences due to lack of a suitable
structure template. Ab-initio methods of structure prediction involving potential energy minimization too are
computationally tractable only for very small proteins.
The importance of thorough statistical analysis addressing the challenges of implicit heterogeneity and stochastic nature of the protein data has therefore become
very significant. Though few in number, recent approaches in this direction for 3D-structure prediction
have shown good promise (Xia et al., 2000; Jyothi and
Joshi, 2001; Simons et al., 2001).
We had developed a new algorithm called PROPAINOR2 (Jyothi and Joshi, 2001) for the ab-initio predic-
2
Protein structure prediction by AI and nonparametric regression.
242
tion of protein structure. Here, high probability Ca
distance intervals were estimated for short and mediumrange inter-residue separation by nonparametric regression on statistically significant features of the primary
chain. Heuristics based on the globular geometry of
protein folds were used in-lieu of long-range constraints
to impose compactness to the predicted structures.
These constraints were used in the estimation of the
3D-structures by the distance geometry approach.
Performance of PROPAINOR was found (Jyothi and
Joshi, 2001) to be better than the extensively used
programs like X-PLOR (Nilges et al., 1988), DRAGON
(Aszodi et al., 1995) and the threading methods (Sippl,
1990). The sensitivity analysis also showed the possibility of a drastic improvement in the prediction accuracy
with the incorporation of a few long-range restraints.
The results of this analysis hence motivated us to look
into the possibility of enhancing the accuracy of
PROPAINOR by improving upon the distance estimation and structure calculation procedures. We have
modified the distance estimators using the multivariate
technique of nonparametric discriminant analysis
(NDA) for the short and medium range distance
intervals (Jyothi and Joshi, 2002) and NDA together
with some knowledge-based heuristics for the long range
distance intervals and contacts (outlined in Appendix
A). These refinements are highlighted in Section 2.
Under the new approach, we also obtain the probability of correctness of the distance intervals. We also
assign weights (worth-coefficients) to each type of
distance interval according to its relative accuracy in
the training sample. These weights are used along with
the probability assigned to the distance interval, in a
two-stage probabilistic (distance geometry) programming problem for the calculation of 3D-structures. The
new 3D-structure computation method is discussed in
Section 3 followed by validation results on a large set of
real data obtained from the PDB.
We discuss the problem of assessment of the predicted
protein structures in Section 4. Here, we present a novel
technique for the estimation of a reliability index as a
measure of the expected accuracy score of the predicted
structures. We also present here a novel technique for
the estimation of the expected RMSD, the most likely
RMSD and the expected RMSD ranges for the predicted structure of a protein, based on its average
posterior probability and a score function. Discussion
of the results and scope of our method is presented in
Section 5.
2. Estimation of distance intervals for short, medium and
long inter-residue ranges
We consider the following probabilistic minimization
problem for the determination of the 3D-structure
min f (x . . . x )
¯1
¯N
x ...x
¯1 ¯N
where
XX
f
(½½x x ½½2 D2ij )2
¯i ¯j
i
ji
(1)
s.t. distance contraints on Dij :/
Here, xi/ /(/xi1 ; xi2 ; xi3 ) is the vector representing the
coordinates of the ith residue centroid or Ca atom in R3
and Dij/s are the random variables representing the 3Ddistance between the centroids or Ca atoms of residue i
and residue j; i , j/1, 2,. . ., N .
We use multivariate NDA for the estimation of the
probabilities for inter-Ca 3D-distance intervals and
contacts for short, medium and long range separation
of residues on the primary chain. We estimate the range
(lij , uij ) for Dij separately for each of the above
categories and also compute the posterior probability
pij of the constraint Dij (lij ; uij ):/
Under the new probabilistic approach, weights
(worth-coefficients) are assigned to each type of distance
interval according to its relative accuracy in the training
sample. These weights are used along with the probability pij in a two-stage probabilistic (distance geometry) programming problem for the calculation of 3Dstructures.
2.1. Estimation of distances for short and medium-range
residue separation
The Dij values for 15 ½j i½ 52 are termed shortrange distances and those for 35 ½j i½ 54 are termed
medium-range. Accurate estimation of the short and
medium range distances (for ½j i½ ]2) is very important
here as it corresponds to determining the secondary fold.
We need them in refinement of the estimators of certain
parameters that we found significant in the prediction of
long-range distances.
In PROPAINOR (Jyothi and Joshi, 2001) we had
used a sliding window model of the protein chain (Fig.
1) to estimate the expected values of the short and
medium range 3D-distances in proteins as functions of
their primary distances, the correlation (rm ) between the
primary and 3D-distances in the m th window (Wm ), and
a few other important features of the primary sequence.
These primary sequence features were found to be
significant with more than 99% confidence.
Here, the distance correlation, rm was found to be a
dominant parameter in the regression model. We have
refined the method for the estimation of rm using a 2stage NDA and heuristics based on the distinct primary
and 3D-distance correlation patterns in different secondary structure states as observed in the training
243
Fig. 1. Sliding-window model.
sample3 */TRAIN-93. This resulted in increasing the
average accuracy of predictions of the short and
medium-range distance estimates from 93 to 98% (Jyothi
and Joshi, 2002).
2.2. Estimation of distance intervals and contacts for
long-range residue separation
Long-range distance estimates play a crucial role in
the process of protein structure determination by
distance geometry methods. We have extended our
sliding window model (Fig. 2) for the estimation of
selected long-range distance intervals also, using NDA.
Here, we consider only residue pairs (i, j) that are
separated by atleast 19 residues in the sequence, i.e. {/
Dij ½½j i½]20/}.
Distance interval estimation is then carried out by
separating the window pairs Wi2 and Wj2 into
different categories based on the residue separation
and sequence environments of residue i and residue j.
Accordingly, we define window-pair categories for
estimation of intervals for Dij as specified below:
hmm category: Wi2 and Wj2 contain all hydrophilic residues; hpp category: Both Wi2 and Wj2
contain more than three hydrophobic residues; hpm
category: Wi2 contains all hydrophilic residues and
Wj2 contains more than three hydrophobic residues, or
vice-versa; L-20 category: {jj/ij/20; i/4, 6,. . .}; L-30
category: {jj/i j/30; i/4, 6,. . .} and nc category: j/
N/5 and i /5.
3
To get this training sample, around 400 proteins of size between
70 and 150 residues were taken randomly from the PDB. These
sequences were randomly numbered 1, 2, 3,. . . The sequence numbered
1 was selected first; all the sequences having more than 40% homology
with it were removed. The next higher labelled sequence was then
selected and the process was repeated successively until no removal
was accounted for. The resulting sample contained 93 proteins.
Interestingly, in this set, B/1% of the protein pairs had mutual
sequence homology around 40%. All others had much less or no
homology. We label this sample TRAIN-93.
Prediction of likely classes (based on some estimated
thresholds) for the distance Dij/s are separately carried
out for each window-pair category using NDA on the
feature vector P of the protein sequence, (cf. section A.1
¯ This feature vector takes into account
of Appendix A).
the sequence environments in the neighbourhood of
both residue i and residue j as well the properties of the
segment inbetween residue i and residue j . The corresponding probabilistic constraints are given in section
A.1 of Appendix A.
The prediction accuracy for the long-range distance
intervals estimated here varies between 70 and 90% and
the Mathew’s correlation coefficient varies between 0.38
and 0.71 for the validation sample,4 VALID-40, which is
much higher when compared to the recent method of
Lund et al. (1997) who report a prediction accuracy of
57 /63% and a maximum Mathew’s correlation of 0.42
for the estimation of similar distance intervals.
We have refined the globular heuristics giving rise to
hydrophobic core constraints, introduced in (Jyothi and
Joshi, 2001) taking into account turn predictions using
b-turn propensity (Chou and Fasman, 1974) and maximum hydrophilicity in each window (estimated by
moving average technique). These turn predictions are
also used along with the estimates of rm for the
formulation of b-strand packing heuristics (sections
A.2 and A.3 of Appendix A). The average accuracy of
the globular constraints was around 96%.
Stability of the predicted structures by distance
geometry methods is sensitive to the constraints on
residues that are likely to be in close vicinity in 3D. But,
prediction of long-range 3D-contacts remains a challenging problem in protein structure research. We have also
formulated here a consensus approach based on NDA
4
The validation was also selected by a procedure similar to that
applied to TRAIN-93 to ensure that no two proteins in this set would
have more than 40% homology. The random sets thus found were
further screened so that no protein in this set had more than 40%
sequence homology with the training sample, TRAIN-93. This way, 40
proteins were obtained for the validation sample. Only three of the
proteins in this set had around 40% homology with three proteins in
the training sample. We call this set VALID-40.
244
Fig. 2. Sliding-window for long range distance estimation.
and heuristics for the prediction of 3D-contacts (section
A.4 of Appendix A).
Validation on the set of proteins in VALID-40
showed that the precision for contact predictions by
our method is 55%. Here, we note that the computational methods based on multiple sequence alignments
and correlated mutation analysis for medium and longrange contact predictions (Fariselli and Casadio, 1999;
Olmea and Valencia, 1997) report a precision of only
20 /25%.
3. Three-dimensional structure computations
In the earlier formulation of the 3D-structure prediction problem by PROPAINOR we had constraints of
the type dij (lij ; uij ) where the lower and upper bounds
for the short and medium-range distances were estimated by nonparametric regression. These were supplemented by knowledge-based long-range constraints
derived using heuristics on the globularity of the 3Dfold. These distance bounds are now refined and we now
have distance intervals (lij , uij ) from different types, i.e.
short and medium-range, long-range, globular, turn,
contacts and b-strand packing heuristics. Attached to
each distance interval, we now also have a specified
probability pij where,
pij P[lij 5 Dij 5uij ]
These probabilities are computed using the posterior
probabilities obtained from NDA. The formulae used
for different categories of residue pairs are given in
Appendix A.
We apply a two-stage probabilistic programming
technique to compute the 3D-coordinates by a distance
geometry approach under these probabilistic constraints
and the bump distance constraints.
gether with (i) the hmm, hpp, hpm and nc distance
constraints obtained from NDA and (ii) the hydrophobic core and b-turn constraints, are first used in the
distance geometry program, dgsol (More and Wu, 1999)
for the fast generation of a set of initial structures. No
probabilities are attached to the distance constraints at
this stage.
3.1.2. Stage 2
The set of distinct initial solutions obtained in stage 1
are then refined through probabilistic minimization of
the following modified distance geometry objective
function f given by:
f
X
t
X
wt
½½xi xj ½½2 lij2
lij2
(i;j) Gt
pij + max2
pij + min
2
N
X
ji4
min2
½½xi xj ½½2 u2ij
u2ij
;0
N4
X
; 0 wbump + 1 +
i1
2
½½xi xj ½½ 2
lbump
2
lbump
;0
(2)
Here t indexes over the type of constraint i.e. short and
medium-range, long-range (i.e. hmm, hpp, hpm, L-20,
L-30 and nc), globular, turn, contact and b-strand
packing. Gt denotes the set of constraints of type t .
The probabilities pij s are multiplied by weights wt
derived from heuristics on the relative importance and
accuracy of each class of constraint. The assignment of
weights to the various constraints are discussed below.
Conjugate gradient algorithm (Gilbert and Nocedal,
1992) is used for minimization of the objective function
given by Eq. (2).
3.1. Probabilistic distance geometry optimization
3.1.1. Stage 1
The short and medium-range distance interval constraints obtained from nonparametric regression to-
3.1.3. Assignment of weights
Weights (wt ) for the constraints are assigned based on
the relative accuracy of the distance intervals as
obtained in the training sample. Accordingly, we chose,
245
wt /10000 for bump distance constraints.
wt /1000 for constraints involving short-range distances dij with j/i/1 or j /i/2.
wt /100 for constraints involving short-range distances dij with j/i/3 or j /i/4.
wt /10 for globular, b, hmm, hpp, hpm, nc and longrange contact constraints.
wt /1 for L-20 and L-30 distance constraints.
3.2. Optimal solution
As in Jyothi and Joshi (2001), the selection of the best
solutions is done based on the minimum objective
function value and the theoretical radius of gyration
and the best satisfaction of the hydrophobic residue
distribution. For this purpose, the solution having the
minimum objective function value is selected. All the
solutions (models) having objective function value
within 0.25 difference from the minimum are taken
into consideration. From these selected solutions, only
those, which give the predicted structures with radius of
gyration within 9/0.5 Å of the theoretical value (Andrzej
and Skolnick 1997) are retained. From these solutions,
all those which do not satisfy the hydrophobic core
residue distribution are also eliminated. Using this
procedure, we got around 5/15 solutions each for the
proteins in the validation sample. These solutions are
used as the set of predicted structures.
3.2.1. Validation
Using the refined PROPAINOR, we have predicted
the 3D-structures for all the 52 proteins in VALID-52.5
The average RMSD, Ravg of the set of predicted
structures for each protein this validation sample is
shown in Table 1. For globular proteins it was found to
vary between 4.40 and 7.60 Å with the modal value
around 6 Å.
4. Reliability assessment
Suitable estimates of the reliability of predictions is
highly desired in any protein structure prediction algorithm. However, no ab-initio prediction method has yet
reported on such a measure. We have developed a
method based only on the primary sequence of the
protein, to assess the reliability of the predicted 3Dstructure. Apart from a reliability index, we also
compute estimates for its expected and most likely
RMSD. These computations are based on an analogy
5
This set includes all the proteins in VALID-40 plus 12 new
proteins (selected by similar random procedure) on which no NDA
validation was run earlier.
with the calculation of the most likely job completion
times in PERT (Wiest and Levy, 1977).
4.1. Overall reliability index
Here, we describe the development of a regression
model for the computation of an overall reliability index
for the predicted structure. We compute this index based
on the precision score (/Ŝc ) and the average probability (/
p̄) of the predicted structure(s). Ŝc and p̄ are calculated
from the weights wt ’s and the probabilities pij attached
to the distance intervals as shown below:
X
X
wt
pij
(3)
Ŝc t
and
p̄ X
t
(i;j) Gt
X
wt
pij
(i;j) Gt
X
wt
(4)
(i;j) Gt
Here the summation runs over the set of all the imposed
probabilistic distance constraints.
To arrive at the regression equation, we first chose a
random subset of 15 proteins from VALID-52. We refer
this sample as VALID-52.15. Since we wanted the
reliability index for a protein (of N residues) to reflect
the deviation (say Rdiff) of its RMSD from the RMSD
of a random structure of N residues, Rran
N (Cohen, 1980),
we computed
avg
R̂diff Rran
N R
for these 15 proteins, where Ravg is the average RMSD
for the protein based on the set of best solutions
In order to estimate E [Rdiff] for a new protein, we
regress Rdiff on Ŝc and p̄: Computational experiments
showed that the model regressing Rdiff on hŜc i and
hp̄i as the best, where,
0 if Ŝc 51000
hŜc i
(5)
1 if Ŝc 1000
0 if p̄5 0:8
hp̄i
(6)
1 if p̄ 0:8
and
hRdiff i
0
1
if Rdiff 51
if Rdiff 1
(7)
The equation of the best fit was then obtained (using
SAS software, SAS, 1995) as:
E[hRdiff ijhŜc i; hp̄i]0:190:23hŜc i0:72hp̄i
(8)
This model was found significant with a p -value of
0.0002. The discretized score hŜc i was significant with a
p -value of 0.0845 and hp̄i was significant with a p-value
of 0.0001. The adjusted R2 value for the model was
0.7184.
246
Table 1
Validation results: accuracy and reliability indices for the predicted protein structures
PDB code
No. of residues
lalw
lalx
lag4
laj4
laok
laqe
lazf
lbhu
1b14
lboO
1c15
lccd
ldun
1e68
leoO
lfOm
lg7d
lgdc
lghj
lgsp
lhf,c
lhnil
limy
ljug
lmbl
lmho
lnkl
lplm
lrhl
ltrf
lupj
lutg
lwho
lxer
lytj
lyvs
lzib
2af8
2fhi
2psr
2pyp
2sxl
2tgi
2tmy
2tn4
2trx
3ait
3crd
3icb
3mba
3vub
5hpg
83
106
103
90
130
110
129
102
107
76
97
75
110
70
77
71
105
72
79
104
123
123
85
125
98
88
78
130
104
76
99
70
94
103
99
108
124
86
105
96
125
88
112
118
85
108
74
100
75
146
87
83
/
Ravg (Å)
Ŝc
p̄
636.31
991.86
935.28
6017.19
2131.40
234.78
2745.11
539.89
1419.74
625.79
1197.39
386.88
2428.69
778.00
470.61
291.69
866.04
317.41
587.34
1231.28
1375.54
1549.06
1099.04
1697.24
969.21
1821.48
553.20
1952.87
1350.87
1222.94
831.49
484.02
1378.07
3128.86
904.91
1059.81
1879.35
851.02
791.18
1586.88
606.38
908.84
1542.15
2366.38
3829.47
818.87
714.86
656.50
375.70
1856.17
931.66
1076.78
0.8406 5.49
0.7810 7.37
0.8358 5.83
0.8889 4.90
0.8362 6.32
0.7158 7.62
0.7932 6.46
0.6281 7.92
0.8421 5.56
0.7972 7.73
0.8474 6.48
0.8197 5.66
0.8995 6.82
0.8801 5.30
0.8065 4.61
0.8217 5.48
0.8047 6.61
0.6077 7.08
0.7937 7.22
0.7763 5.73
0.8689 7.02
0.9006 6.36
0.8208 5.32
0.7736 6.42
0.8465 6.77
0.9355 5.33
0.8041 5.79
0.8278 6.38
0.7872 6.04
0.8704 5.07
0.8808 6.21
0.8148 5.32
0.8362 6.39
0.8186 6.83
0.7959 7.61
0.7166 18.60
0.8454 6.84
0.8485 5.86
0.8940 6.26
0.6987 6.13
0.8194 6.56
0.8136 6.11
0.9229 10.79
0.8694 6.76
0.9388 5.66
0.8953 6.51
0.8014 5.60
0.8537 6.43
0.8903 4.44
0.8503 7.50
0.7245 7.01
0.7820 6.79
Rran
N (Å) Actual RMSD range (Å)
7.81
8.43
8.35
8.00
9.08
8.54
9.06
8.32
8.46
7.62
8.19
7.59
8.54
7.46
7.65
7.48
8.41
7.51
7.70
8.38
8.89
8.89
7.86
8.95
8.22
7.94
7.67
9.08
8.38
7.62
8.24
7.46
8.11
8.35
8.24
8.49
8.92
7.89
8.41
8.16
8.95
7.94
8.60
8.76
7.86
8.49
7.57
8.27
7.59
9.52
7.92
7.81
(5.17,
(7.24,
(5.63,
(4.84,
(6.18,
(7.32,
(6.37,
(7.66,
(5.52,
(7.58,
(6.20,
(5.59,
(6.62,
(5.17,
(4.13,
(5.41,
(6.55,
(6.93,
(6.97,
(5.60,
(7.02,
(5.82,
(5.18,
(6.11,
(6.67,
(5.19,
(5.66,
(6.19,
(5.92,
(5.03,
(5.79,
(5.22,
(5.88,
(6.50,
(7.22,
(18.2,
(6.84,
(5.56,
(5.82,
(5.78,
(6.23,
(5.51,
(10.7,
(6.43,
(5.40,
(6.26,
(5.36,
(6.26,
(4.25,
(7.41,
(6.80,
(6.69,
5.75)
7.51)
6.23)
5.44)
6.48)
7.92)
6.61)
8.61)
5.62)
7.90)
6.90)
5.73)
7.06)
5.42)
5.17)
5.60)
6.69)
7.16)
7.99)
5.97)
7.02)
6.62)
5.58)
6.80)
6.98)
5.58)
5.86)
6.69)
6.22)
5.12)
6.96)
5.46)
6.59)
7.22)
7.89)
18.7)
6.84)
6.02)
6.55)
6.34)
6.87)
6.35)
10.9)
6.89)
5.84)
6.78)
5.88)
6.55)
4.64)
7.58)
7.51)
6.89)
/
R̂f
0.92
0.19
0.92
1.14
1.14
0.19
0.42
0.19
1.14
0.19
1.14
0.92
1.14
0.92
0.92
0.92
0.92
0.19
0.19
0.42
1.14
1.14
1.14
0.42
0.92
1.14
0.92
1.14
0.42
1.14
0.92
0.92
1.14
1.14
0.92
0.42
1.14
0.92
0.92
0.42
0.92
0.92
1.14
1.14
1.14
0.92
0.92
0.92
0.92
1.14
0.19
0.42
/
Ê (Å) Pred RMSD range (Å)
5.41
7.72
6.31
5.77
6.58
7.77
6.63
7.66
6.34
7.12
6.21
5.15
6.29
4.82
5.27
4.95
6.37
7.01
7.20
6.40
6.46
6.42
5.65
6.62
6.30
5.64
5.32
6.59
6.39
5.13
6.32
4.91
6.08
6.33
7.62
6.53
6.51
5.58
6.25
6.39
6.55
5.75
6.28
6.41
5.40
6.28
4.93
6.35
4.86
6.72
7.42
5.50
(4.71,
(7.43,
(5.61,
(5.07,
(5.88,
(7.54,
(5.93,
(7.32,
(5.64,
(6.60,
(5.51,
(4.45,
(5.59,
(4.13,
(4.57,
(4.25,
(5.67,
(6.51,
(6.70,
(5.70,
(5.76,
(5.72,
(4.95,
(5.92,
(5.60,
(4.94,
(4.62,
(5.89,
(5.69,
(4.43,
(5.62,
(4.21,
(5.38,
(5.63,
(7.24,
(5.80,
(5.81,
(4.88,
(5.55,
(5.69,
(5.85,
(5.05,
(5.50,
(5.71,
(4.70,
(5.58,
(4.23,
(5.65,
(4.16,
(6.02,
(6.92,
(4.80,
6.11)
8.00)
7.01)
6.47)
7.28)
8.00)
7.33)
8.00)
7.04)
7.62)
6.91)
5.85)
6.99)
5.52)
5.97)
5.65)
7.07)
7.51)
7.70)
7.10)
7.16)
7.12)
6.35)
7.32)
7.00)
6.34)
6.02)
7.29)
7.09)
5.83)
7.02)
5.61)
6.78)
7.03)
8.00)
7.23)
7.21)
6.28)
6.95)
7.09)
7.25)
6.45)
6.98)
7.11)
6.10)
6.98)
5.63)
7.05)
5.56)
7.42)
7.92)
6.20)
M̂ (Å)
/
5.17
7.72
6.31
5.57
6.60
7.77
6.72
7.66
6.36
7.12
6.05
4.92
6.27
4.55
5.05
4.72
6.40
7.01
7.20
6.51
6.46
6.39
5.32
6.72
6.06
5.41
5.11
6.62
6.48
4.95
6.05
4.61
5.92
6.37
7.62
6.67
6.52
5.33
6.21
6.06
6.56
5.53
6.24
6.41
5.17
6.23
4.71
6.10
4.59
6.73
7.42
5.11
Here, Ravg represents the average RMSD of the predicted models of the protein with respect to its native structure. Rran
N is the RMSD for a random
structure calculated as in Cohen (1980). Actual RMSD gives the range of RMSD observed in the distinct solutions, R̂f is the reliability factor for the
protein, Ê is the calculated expected RMSD, Predicted RMSD Range gives the upper and lower bounds within which the observed RMSD is
expected to lie. M̂ is the most likely RMSD.
247
For any protein, the conditional expectation of
Rdiff computed from the model (8) serves as a measure
of the overall reliability of the predicted structure. We
denote this measure by R̂f : The reliability indices for all
the proteins in VALID-52 were computed using Eq. (8).
These are listed in Table 1 along with the corresponding
score Ŝc ; and probability p̄:/
4.2. Estimation of the expected and the most likely
RMSD
Here we estimate the expected RMSD (/Ê); the most
likely RMSD (/M̂) and the interval (R̂min ; R̂max ) in which
the RMSD of a predicted structure would lie.
The sample VALID-52.15 is used here for building
the model. It was observed in this sample that the
RMSD for proteins with larger p̄ was lesser than the
sample average as compared to proteins with lesser p̄:
We model this scaling with respect to p̄ in the computation of mnew below.
This model for the computation of mnew also takes into
account the observation that the RMSD for proteins
with larger N (N /100) was higher than the sample
average and that the RMSD for proteins with smaller N
(N B/100) was lesser than the sample average. However,
since this trend was increasing rapidly for proteins up to
N /100 and was almost stationary with respect to larger
N , we also incorporate a correction factor c as shown
below.
From model (8), we note that R̂f 0:20 iff hŜc i 1
or hp̄i1: In other words, R̂f 0:20 indicates a good
prediction. Hence, Ê; M̂; R̂min and R̂max are estimated
separately for proteins with R̂f 0:20 and for proteins
with lower R̂f values.
4.2.1. Estimation for proteins with R̂f 0:20/
Let mR and sR denote respectively the average and the
standard deviation of the RMSD in the model sample
(For VALID-52.15 we obtained mR/ /6.05 Å and sR/ /
0.70 Å).
The following model was found to be the best for the
computation of mnew and c to incorporate the aforementioned trends. mnew is estimated as:
if N 5100
m p̄ + 2sR
mnew R
(9)
mR (1 p̄) + 2sR if N 100
The correction factor, c, for the size of the protein is
given by,
0:06 + (N Nmin ) if N 5 100
c
(10)
0:01 + (N 100) if N 100
Here Nmin is the size of the smallest sequence in the
model sample (for VALID-52.15 Nmin was 74).
The Expected value of the RMSD, Ê; is calculated as
Ê mnew c
(11)
From Ê; the lower and upper bounds (/R̂min and R̂max )
for the RMSD are calculated as:
R̂min Ê sR
R̂max Ê sR
(12)
(13)
The most likely RMSD, M̂ is calculated in analogy
with the calculation of the most likely estimates of job
completion times in PERT. For this, we need an
estimate of the most optimistic estimate (moe) of the
RMSD, the most pessimistic estimate (mpe) of the
RMSD. These are computed as follows:
m 1:75 + R̂f if N 5100
moe R
(14)
m 0:5 + R̂f
if N 100
R
if N 5100
m 0:25 + R̂f
mpe R
(15)
minfmR R̂f ; 7:00g if N 100
Then
1
M̂ (6 + mnew moempe)c
4
(16)
where mnew and c are computed as in Eqs. (9) and (10),
respectively.
4.2.2. Estimation for proteins with R̂f B 0:20/
The proteins in this category had RMSD /7 Å. So
instead of using mR and sR ; the following simple model
was found to be sufficient here.
R̂min Rran
N 1
R̂max Rran
N 1
(17)
(18)
and
Ê M̂ R̂min R̂max
2
(19)
The computed values of Ê; M̂; R̂min and R̂max for all
proteins in VALID-52 are given in Table 1. We note that
there is a close correspondence between Ravg, Ê/and M̂
and that the computed RMSD ranges includes Ravg in
almost all the cases in Table 1. We obtain a correlation
coefficient of 0.42 between Ravg and M̂: But when the
two non-globular proteins 1yvs and 2tgi are removed
from the validation sample, this correlation increases to
0.84; with confidence level above 95%.
5. Results and discussion
We had developed a new algorithm, PROPAINOR,
for ab-initio protein structure prediction from a set of
distance restraints that are estimated by nonparametric
regression and knowledge based heuristics (Jyothi and
Joshi, 2001). In this paper, we have reported the
refinements made to PROPAINOR in the structure
computation methodology. In contrast to our earlier
248
deterministic distance geometry optimization, we now
adopt a two-stage probabilistic programming approach
to compute the 3D-structure of proteins from a set
probabilistic distance constraints.
Validation of our new method on a sample of 52
proteins (VALID-52) show that the RMSD of the
globular protein structures predicted by the new approach as compared to the native structure varies
between 4.40 and 7.60 Å. Moreover, for globular
proteins of sizes between 70 and 130 residues the average
value of the RMSD in the validation sample, VALID-52
is found to be 6.05 Å (Table 1). This RMSD level is the
lowest possible for any ab-initio method reported so far.
It is also clear from Table 1 that the RMSD for any
protein, Ravg, is much less than the RMSD for a random
structure with the same number of residues, Rran
N .
Further, continuous fragments of sizes varying between
50 and 75 residues are now predicted with an RMSD B/
4 Å (Fig. 3(d /f) and Table 2).
A comparison of our method with some of the recent
well-received computational methods of protein structure prediction (for example, Simons et al., 1998; Ortiz
et al., 1998; Huang et al., 1999; Xia et al., 2000;
Osguthorpe, 1999; Simons et al., 2001) shows it to be
better or comparable (Table 2). Most of these methods
use scoring or fitness functions to select the optimal
solutions from a large number of conformations (decoys6). In contrast, in our method, we do not require
such a set of decoys as we estimate the constraints
exhaustively based on nonparametric statistical estimates (with high confidence) that take care of the
random variations. We then choose the optimal solution
based on the heuristics as described in Section 3.2.
Further, it is important to note that most of the other
computational methods use multiple sequence alignments during some stage of structure prediction and
hence are not efficient for proteins that do not have
many homologous sequences in the sequence databanks.
We also note that there is a good overall topological
fit between the predicted and the native Ca traces. This is
evident from the plots depicting the superposition of the
predicted Ca trace onto the native Ca trace (Fig. 3).
Because of the small standard deviation of the RMSD,
all the optimal structures predicted by our method have
qualitatively similar topologies too. Therefore a pre6
E.g. URL: dd.compbio.washington.edu
Fig. 3. (a /c) Superpositions of the Ca traces of the actual and predicted structures. (a) a-Helical protein 3icb, (b) b-strand protein 1bl4 and (c) a/b
protein 2trx. These (a /c) are also, respectively, the illustrations of the best, average and the worst results of topological matching in the validation
sample. The bad results for 2trx are mainly due to the chirality problem. (d /f) Superpositions of the Ca traces of the actual and predicted structures
for maximum length fragments having RMSD just below 4 Å. (d) 3icb (residues 1 /73: 73 residues). (e) 1bl4 (residues 16 /76: 61 residues). (f) 2trx
(residues 32 /56: 55 residues). In all these figures, the native structure is depicted in grey and the predicted structure in black. The residue numbers are
labelled only at some positions to make the comparison clear.
249
Table 2
A comparison of ab-initio structure prediction methods
Group/algorithm
Computing time
RMSD for test sets
of 15 /20 proteinsa
(entire protein)
Other comments
Simons et al. (1999)
(ROSETTA)
Ortiz et al. (1998)
(MONSTERR)
Huang et al. (1999)
2 days on 5 a-533 MHz processors
for proteins around 120 residues
4 days on a SGI MIPS 180 MHz
processor
Not mentioned
8.32 /11.87 Å
/
/
3.5 /6.7 Å for 70 /100 residue fragments
5.37 /9.99 Å
Xia et al. (2000)
3CPU days on a 533 MHz
a /processor for 40 /80 residue
proteins
Not mentioned
100 h of wall-clock time using 64
processors of IBM2 supercomputer
Around 8 /10 h on a SUN ULTRA
sparc 128 MB RAM 167 MHz CPU
4.95 /13.90 Å
Applicable only to helical proteins
up to 90 residues
6 /7 Å for fragments up to 80 residues
9.95 /19.06 Å
Best: 4.3 Å for 70 residue
1e68. Worst: not mentioned
4.40 /7.60 Å
/
6.0 Å for 52 /68 residuefragments of a-helical
proteins. Worse for b and a/b proteins
B/4 Å for 50 /75 residue fragments
Osguthorpe (1999)
Pillardy et al. (2001)
Joshi and Jyothi (2002)
PROPAINOR
It may be noted here that the methods of Pillardy et al. (2001) and Osguthorpe (1999) use potentials which mimic the physical forces instead of
knowledge-based potentials. Also the computing times mentioned in column 2 for all the methods represent the time taken to get an ensemble of
structures. The best structures are selected from this ensemble through various heuristics.
a
Our algorithm, PROPAINOR has been validated on a test set of 52 proteins. No NMR or other experimental data were used in our method.
dicted structure, is selected at random from the set of
optimal solutions for the purpose of superposition.
Fig. 3 also shows the topological fit for 3 proteins for
maximum length fragments with RMSD just below 4 Å.
It is seen that continuous fragments containing about
50 /99% of the residues in the protein can be superimposed with a very good fit.
Our method also gives a confidence interval, score
and a most likely estimate of the RMSD for the
predicted structures. This is always desired, and to the
best of our knowledge, no other ab-initio method
provides this. It may be noted that whenever a structure
is predicted by computational/statistical method, an
obvious and natural doubt of the user is how reliable
would be the prediction. The aforesaid reliability parameters would satisfy his query; these are as important
for a predicted structure of a given sequence as the p value or confidence intervals in any statistical data
driven decision making.
Ab-initio structures play a key role in exploring the
structure/function of a new protein as they at least
provide a preliminary solution to optimally proceed
with more sophisticated and expensive methods of
refinement. Our estimates like the most likely RMSD
would give a quantitative measure of the resolution of
the preliminary structure.
PROPAINOR is found to be highly computationally
efficient too. The entire two-stage process for getting an
ensemble of distinct Ca structures takes less than 8 h on
a Sun ULTRA Sparc station (128 MB RAM, 167 MHz
CPU). In contrast, other recent methods take more than
3 days on multiple workstations of very high speed for
similar sized proteins. A comparison of the accuracy and
computation times for a few other ab-initio structure
prediction methods is given in Table 2.
Most importantly, since our method does not rely on
homology, it is well-suited for the structure prediction of
proteins like human seminal plasma prostatic inhibin
(HSPI), which lack sequence homology with known
proteins and which is difficult to be studied by NMR
and crystallography. We had predicted the structure for
this protein and had validated it with respect to the
available experimental observations with good results.
We had also undertaken certain docking and other
computational studies on the predicted structure. These
were found to give significant insight into its important
biological functions (Joshi and Jyothi, 2002). Further,
we have also predicted the structure of a calcium
binding protein, EhCaBP. The predicted structure of
EhCaBP (with and without NMR restraints) was also
found to agree well with the recently obtained experimental structure (Atreya et al., 2000). Docking studies
give further directions for its possible immunosuppression.
In view of the above applications and because of its
computational efficiency, high reliability and applicability to any random sequence, our method offers promising applications in structural and functional genomics
and proteomics research. At present it is successfully
implemented on proteins of sizes 70 /150 residues. We
are currently working on its extensions to shorter and
longer sequences.
250
Acknowledgements
Part of this research has been funded by the Dept. of
Biotechnology (DBT), Govt. of India through a project
undertaken by Prof. Rajani R. Joshi. The author thanks
the DBT for the same. Jyothi S. thanks the Council of
Scientific and Industrial Research, Govt. of India for the
financial support in undertaking this research work.
Appendix A
A.1. Estimation of long-range distance intervals
Different subsets of the following sequence feature
vector (P) were found to be significant (confidence level
/99%) ¯for the window-pair categories when NDA was
applied for the classification of the window-pairs in
either of the two distance-classes (say, long/0 and
long/1) that were formed using the mean 3D-distance
for the tth category, mt , as threshold.
P[lbump 5Dij Bmt ] p0ij
and
P[mt 5Dij 5utrad ]p1ij
Here lbump is the bump distance threshold set to 4.5 Å
and using globular heuristics (Jyothi and Joshi, 2001),
8
if thpp
2Rg
>
>
>
>
>
<10Rg if thmm or nc
3
utrad >
>
>
8R
>
g
>
if thpm; L20 or L30
:
3
Now, if p0ij /p1ij , we set, lij /lbump, uij /mt and pij /
pij 0. Else, if p1ij /p0ij , we assign, lij /mt , uij /utrad and
pij /p1ij .
A.2. Refinement of globular heuristics
P (H; C1 ; C2 ; C3 ; na ; nb ; L; H ij ; C1ij ; C2ij ;
m
m m
C3ij ; nija ; nijb ; Lij ; hm ; cm
1 ; c2 ; c3 ; f )
A.1.1. Probability assignment for long-range distance
intervals
NDA gives us the posterior probabilities, p0ij and p1ij ,
of membership of window-pair (Wi2, Wj2) in distance-class long/0 and class long/1, respectively. In
view of the geometrical constraints on globular folding
(Jyothi and Joshi, 2001), we set
(A1)
Here H is the percentage concentration of hydrophobic
residues in the primary chain; Ck is the percentage
concentration of residues in the chain belonging to the
k th amino-acid cluster, k /1, 2, 3; na is the number of
windows in the chain with average propensity greater
than 0.06; nb is the number of windows in the chain with
average propensity less than /0.06; L is the length of
the chain; Hij is the percentage concentration of
hydrophobic residues in the portion from residue i/5
to j/5 of the primary chain; Cijk is the percentage
concentration of residues in the chain belonging to the
k th amino acid cluster and lying in the portion i/5 to
j/5 of the primary chain, k /1, 2, 3; nija is the number of
windows in the portion from residue i/5 to j/5 of the
primary chain with average propensity greater than
0.06; nijb is the number of windows in the portion from
residue i/5 to j/5 of the primary chain with average
propensity less than /0.06; Lij is the length of the
portion of the chain from residue i/5 to j/5; hm is the
percentage concentration of hydrophobic residues in
window Wm , m /{i/4, i/2, i, j/4, j/2, j}; cm
k is the
percentage concentration of residues belonging to the
k th amino acid cluster in window Wm , m /{i/4, i/
2, i, j/4, j/2, j } and k /{1 ,2, 3}; fm is the average
propensity of residues in Wm multiplied by 100 (to make
this scale on par with the other features), m /{i/4, i/
2, i, j/4, j/2, j }.
Here the average propensity fm am4
im (ai bi )=10;
where ai and bi are, respectively, the propensity of
residue i for a-helix and b-sheet (Chou and Fasman,
1974).
We have modified of our globular constraints (derived
previously in Jyothi and Joshi, 2001) by taking into
account the positions of the possible turns in the protein.
A.2.1. Turn heuristics
Residues ti , that have a tendency to lie on the surface
of the protein are determined by a consensus of Chou’s
b-turn propensity (Chou and Fasman, 1974) and maximum hydrophilicity calculated for each window using
moving average technique. Let ft1 ; t2 ; . . . ; tn g be the
residue numbers of the turns thus predicted. We have
formulated the new turn heuristics based on the
consideration that adjacent turn residues, ti and ti1
would be separated by a regular secondary structure
element and hence would lie diametrically opposite to
each other in 3D. Also neither ti nor ti1 would come
inside the protein core. Therefore, using the protein data
in TRAIN-93, we estimated that,
P[Dti ;ti1 ]minf2Rg d; 20 Åg]0:75
(A2)
Here Rg (in Å) is the radius of gyration for the protein
(Andrzej and Skolnick, 1997) and d is determined using
heuristics on known protein structures.
A.2.2. Modified hydrophobic core constraints
These turn predictions are then used to modify the
hydrophobic core constraints derived previously in
Jyothi and Joshi (2001).
Since predicted turns are constrained to lie on the
surface of the protein, any hydrophobic residue adjacent
to a predicted turn would also essentially lie on the
surface. Hence, in our reformulated core heuristics,
251
residue i is constrained to be in the core of the protein
only if it is hydrophobic and,
(A3)
½tj i½ ]3
Here tj represents the primary position number of the
jth (predicted) turn residue. Using data from TRAIN93, it was further found that P[Dij B d]]0:96; where,
8
3Rg
>
>
if ½j i½B 20
>
>
>
< 2
(A4)
d B 5Rg
>
if 205 ½j i½ B25
>
>
> 3
>
:
2Rg otherwise
Here i and j are hydrophobic residues satisfying the
condition (A3) for the new core definition.
A.2.3. Probability assignment for turn and hydrophobic
core constraints
The lower bounds lti ,ti1 for the distances between
adjacent predicted turns are set to minf2Rg d; 20 Åg:
The upper bounds, uti ,ti1 are set to 8Rg/3, the estimated
diameter of the molecule. Hence, from Eq. (A2), we get,
pti ;ti1 0:75
where ti is the primary position number of the i th
predicted turn residue.
For the hydrophobic core constraints, the lower
bounds, lij were set to the bump distance threshold of
4.5 Å and the upper bounds, uij are set as ‘d’ in Eq. (A4)
depending on the residue separation between residues i
and j . Also, from Eq. (A4), pij /0.96 for i and j being
core residues (satisfying Eq. (A3)).
A.3. b-Strand packing heuristics
From the training sample, the upper bounds for the
distances between the Ca residues in two adjacent bstrands separated by a b-turn was estimated to be 12 Å.
Similarly, the upper bounds for the inter-Ca distances
between residues in b-strands separated by two b-turns
was estimated at 16 Å. Hence we have the following
heuristic.
Fig. 4. Packing of b-strands into b-sheets.
lij 4:5 Å;
uij 12 Å
(A5)
if residue i and residue j are corresponding residues in
adjacent b-strands separated by a b-turn (Fig. 4(a)).
Also,
lij 4:5 Å;
uij 16 Å
(A6)
if residue i and residue j are corresponding residues in
adjacent b-strands separated by two b-turns (Fig. 4(b)).
In the training sample, TRAIN-93, it was found that
73% of the residues in b-strands separated by b-turn
satisfied Eqs. (A5) and (A6). We therefore set, pij /0.73
for residues i and j predicted to be in the b-strand class
(by NDA) and separated by a (predicted) b-turn.
A.4. Estimation of medium and long-range contacts
We define two residues to be in the contact class if
their Ca positions are within 10 Å of each other and in
the non-contact class otherwise. To overcome the vast
difference in the contact and non-contact class sizes
(Fariselli and Casadio, 1999), we have formulated a
balanced consensus approach for the prediction of
residue contacts in proteins using different training
data selected at random from TRAIN-93. In doing so,
we consider only residue pairs (i, j) such that
f½j i½ ]10
and
½j i½530;
i 5; 8; . . . ;
j 5; 8; . . .g
because it is seen that contacts cluster predominantly in
the short and medium-range residue separation (Fariselli and Casadio, 1999).
Contacts for the validation data are predicted using a
sixfold cross-validation approach using six different sets
of training data. Each training data set contains all the
contact data from TRAIN-93 and different set of noncontact data selected at random from the set of noncontact residue pairs in TRAIN-93. The ratio of the
number of contacts to the number of non-contacts is
kept approximately the same for all the six training
datasets. The classification is done using all the sequence
parameters in P and the Normal kernel function.
¯
For each protein,
the initial set of predicted contacts
is further filtered of false-positives to get a new set S2 of
predicted contacts using majority rule with respect to the
ordered average posterior probabilities of a contact. The
following set of rules are then applied to further prune
S2 of false-positives.
A.4.1. Prune heuristics
No contacts are allowed between residues which are 9/3
i)
residues of adjacent predicted turn residues. For example, if residue 10 and residue 25 are predicted as turns,
and no other residue in-between residues 10 and 25 is
predicted to be a turn residue (Fig. 5(a)), then any
252
Fig. 5. Illustration of the three cases of Prune heuristics. The residues marked by ‘*’ are the predicted turn residues.
predicted contact between residues (9, 24), (9, 27),
(10, 25) and so on are removed from S2.
No contacts are allowed between residues if they are 9/1
ii)
residue of any predicted turn residue. For example, if
residue 10 and residue 42 are predicted as turns, then
any predicted contact between (9, 43), (10, 41) and so on
are removed from S2 (Fig. 5(b)).
iii)
No contacts are allowed between residues if they are not
separated by at least one turn. For example, if residue 11
and residue 21 are predicted to be in contact and there is
no residue is predicted as a turn in between residues 11
and 21, then the predicted contact between residues 11
and 21 is deleted (Fig. 5(c)).
Only the subset S3 of contacts obtained from S2 after
applying the above rules are considered in 3D-structure
computations.
A.4.2. Probability assignment for long-range contacts
For (i , j) /S3 we set, lij /4.5 Å, the bump distance
threshold. Though the upper bounds in the contact
definition was set to 10 Å, to allow for the larger
randomness in contact predictions, we set uij /12 Å.
)
(k )
The probability pij is calculated as a6k1 p(k
ij /6. Here pij
represents the posterier probability of contact prediction
obtained using NDA on the k th training sample, k /
1, 2,. . ., 6 under the best experimental design.
References
Andrzej, K., Skolnick, J., 1997. Secondary structure of polypeptide
chains */interplay between short range and burial interactions. J.
Chem. Phys. 107, 953 /964.
Aszodi, A., Gradwell, M.J., Taylor, W.R., 1995. Global fold determination from a small number of distance restraints. J. Mol. Biol.
251, 308 /326.
Atreya, S., Sahu, C., Chary, K.V.R., Govil, G., 2000. A tracked
approach for automated NMR assignments in proteins (TATAPRO). J. Biomol. NMR 17, 125 /136.
Chou, P.Y., Fasman, G.D., 1974. Prediction of protein conformation.
Biochemistry 13, 222 /245.
Cohen, F.E., Sternberg, M.J.E., 1980. On the Predication of Protein
Structure: The Significance of the Root Mean Square Devviation.
J. Mol. Biol. 138, 321 /333.
Fariselli, P., Casadio, R., 1999. A neural network based predictor of
residue contacts in proteins. Protein Eng. 12, 15 /21.
Gilbert, J.C., Nocedal, J., 1992. Global convergence properties of
conjugate gradient methods. SIAM J. Optim. 2, 21 /42.
Huang, E.S., Samudrala, R., Ponder, J.W., 1999. Ab-initio fold
prediction of small helical proteins using distance geometry and
knowledge-based scoring functions. J. Mol. Biol. 290, 267 /281.
Joshi, R.R., Jyothi, S., 2002. Ab-initio structure of human seminal
plasma prostatic inhibin gives significant insight into its biological
functions. J. Mol. Model. 8, 50 /57.
Jyothi, S., Joshi, R.R., 2001. Protein structure determination by
nonparametric regression and knowledge-based constraints. Comput. Chem. 25, 283 /299.
Jyothi, S., Joshi, R.R., 2002. Ab-initio computation of the 3Dstructures of proteins by nonparametric statistical methods */
medium and short range distance estimates (communicated).
Lund,O.,Frimand,K.,Gorodkin,J.,Bohr,H.,Bohr,J.,Hansen,J.,Brunak, S., 1997. Protein distanceconstraints predictedbyneural networks
and probability density functions. Protein Eng. 10, 1241 /1248.
More, J.J., Zhijun, Wu., 1999. Distance Geometry Optimization for
Protein Structures. J. Global Optim., 15, 219 /234.
Nilges, M., Clore, G.M., Gronenborn, A.M., 1988. X-PLOR */a hybrid
distance geometry dynamical simulated annealing calculation
strategy. FEBS Lett. 289, 317 /324.
Olmea, O., Valencia, A., 1997. Improving contact predictions by the
combination of correlated mutations and other sources of sequence
information. Fold. Des. 2, 25 /32.
Ortiz, A.R., Kolinski, A., Skolnick, J., 1998. Fold assembly of small
proteins using monte-carlo simulations driven by restraints derived
from multiple sequence alignments. J. Mol. Biol. 277, 419 /448.
Osguthorpe, D.J., 1999. Improved ab-initio predictions with a simplified
flexible geometry model. Proteins Struct. Funct. Genet. S3, 186 /193.
Pillardy, J., Czaplewski, C., Limo, A. et al., 2001. Recent improvements in
the prediction of protein structure by global optimization of a
potential energy function, Proc. Natl. Acad. Sci. USA., 98, 2329-2333.
SAS/STAT, 1995. Software usage and reference, version 6, vols. 1 and
2, 1st ed., SAS Institute Inc, NC.
Simons, K.T., Ruczinski, I., Kooperberg, C., Fox, B.A., Bystroff, C.,
Baker, D., 1998. Improved Recognition of Native-Like Protein
Structures Using a Combination of Sequence-Dependent and
Sequence-Independent Features of Proteins. PROTEINS: Struc.
Func. Genitics, 34, 82 /95.
Simons, K.T., Strauss, C., Baker, D., 2001. Prospects for ab-initio
protein structural genomics. J. Mol. Biol. 306, 1191 /1199.
Sippl, M., 1990. Calculation of conformational ensembles from
potentials of mean force. J. Mol. Biol. 213, 859 /883.
Wiest, J.D., Levy, F.K., 1977. A Management Guide to PERT/CPM,
2nd ed.. Prentice Hall, Englewood Cliffs, NJ.
Xia, Y., Huang, E.S., Levitt, M., Samudrala, R., 2000. Ab-initio
construction of protein tertiary structures using a hierarchical
approach. J. Mol. Biol. 300, 171 /185.