RCNPRED: prediction of the residue co

BIOINFORMATICS APPLICATIONS NOTE
Vol. 17 no. 2 2001
Pages 202–203
RCNPRED: prediction of the residue
co-ordination numbers in proteins
Piero Fariselli and Rita Casadio ∗
CIRB Biocomputing Unit, Department of Biology, University of Bologna,
via Irnerio 42, 40126 Bologna, Italy
Received on July 20, 2000; revised on September 19, 2000; accepted on October 13, 2000
ABSTRACT
Summary: The RCNPRED server implements a neural
network-based method to predict the co-ordination numbers of residues starting from the protein sequence. Using
evolutionary information as input, RCNPRED predicts the
residue states of the proteins in the database with 69%
accuracy and scores 12 percentage points higher than a
simple statistical method. Moreover the server implements
a neural network to predict the relative solvent accessibility
of each residue. A protein sequence can be directly submitted to RCNPRED: residue co-ordination numbers and
solvent accessibility for each chain are returned via e-mail.
Availability: Freely available to non-commercial users at
http://prion.biocomp.unibo.it/rcnpred.html
Contact: [email protected];
[email protected]
In the post-genomic era, efficient automatic methods for
prediction of protein features are becoming increasingly
important to cope with the amount of data arising from
sequencing projects. We tackled the problem of predicting
residue co-ordination numbers by developing a neural
network-based method. Correct predictions of residue coordination numbers in a sequence are particularly relevant
in helping to find the correct protein folding. Methods
that predict contacts between residue pairs can benefit
from imposing constraints to the maximum number of
contacts that each residue can make. Tools that address
the problem of predicting contacts among protein residues
have been developed with some extent of success (Thomas
et al., 1996; Olmea and Valencia, 1997; Fariselli and
Casadio, 1999). The correct assignment of the positions of
residue contacts in proteins has proven extremely effective
to determine the three-dimensional structure of a given
protein, as it was recently demonstrated in the CASP3
competition (Ortiz et al., 1999). The present method was
trained to discriminate between two different states of
residue contact numbers (or co-ordination numbers). For
each residue type, the contact number in a given position
∗ To whom the correspondence should be addressed.
202
of the protein sequence can be greater or lower than the
average value of the contact distribution of the residue
type in the database. The residue co-ordination number
is computed inside a spherical cut-off centred into each
residue and by counting the number of residues falling
inside a defined volume (Flöckner et al., 1995).
The contact distributions computed for each residue
type (20) using the database are endowed with different
average values. The threshold for discriminating whether
a contact number is greater or lower than the average value
is different and depends on the residue. This procedure
ensures a direct comparison between contact numbers of
different residue types, irrespective of their size and steric
hindrance.
We have previously shown that the best performing
method to predict residue co-ordination numbers is a
neural network trained with evolutionary information in
the form of sequence profile (Fariselli and Casadio, 2000).
The neural network implemented in RCNPRED operates
with an input window comprising 15 residues. This choice
was made since the prediction accuracy is unaffected by
changing the window dimension from 7 to 17 residues.
The number of hidden neurons is set to 8 for similar
reasons (the explored range was from to 2 to 32 nodes).
A baseline predictor (Richardson and Barlow, 1999) was
used as the simplest possible predictor to score the neural
network accuracy. This comparison showed that the neural
network trained with evolutionary information as input
performs 12 percentage points better than the baseline one
(Fariselli and Casadio, 2000).
Although a strict connection between accessibility and
contact numbers is commonly accepted, for each residue
the surface accessibility is differently distributed than the
number of residue contacts in the database (Fariselli and
Casadio, 2000). Therefore RCNPRED also implements
a neural network predicting whether a given residue is
exposed or not to the solvent. Evolutionary information is
used as input, similarly to previously described methods
(Rost and Sander, 1994; Cuff and Barton, 2000). The
two-state prediction of the solvent accessibility reaches an
accuracy of 76% on a cross-validated set comprising 651
c Oxford University Press 2001
RCNPRED
S
CN
Prob
RACC
=
=
=
=
Query sequence (residue 1-letter code)
Range of predicted residue co-ordination number
Probability of the residue contact assignment
Relative accessibility ( Exposed (E) >=16%
Buried (B) < 16%)
P(E)/P(B) = Network outputs relative to the exposed (E)
or buried (B) classes, respectively.
____________________________________________________________
S
CN
Prob
RACC
P(E) P(B)
____________________________________________________________
G 6<= CN <= 12 with probability of
K 5<= CN <= 10 with probability of
K 5<= CN <= 10 with probability of
K 5<= CN <= 10 with probability of
D 5<= CN <= 11 with probability of
R 5<= CN <= 11 with probability of
K 5<= CN <= 10 with probability of
G 0<= CN <= 6 with probability of
E 0<= CN <= 5 with probability of
D 0<= CN <= 5 with probability of
A 0<= CN <= 6 with probability of
R 5<= CN <= 11 with probability of
Y 6<= CN <= 11 with probability of
.................................
0.98
0.95
0.96
0.93
0.90
0.98
0.91
0.94
0.84
0.79
0.51
0.75
0.73
E
E
E
E
E
E
E
E
E
E
B
E
E
0.995
0.999
0.999
0.999
0.994
0.987
0.950
0.925
0.847
0.950
0.252
0.825
0.749
0.005
0.001
0.001
0.001
0.006
0.013
0.050
0.075
0.153
0.050
0.748
0.175
0.251
Fig. 1. The output of RCNPRED. S is the query sequence given
by the user and pasted on the web interface. CN (the co-ordination
number) gives the range of the predicted number of contacts for each
residue. Prob is the reliability of the prediction (its range is [0, 1]).
RACC is the predicted relative residue accessibility (Buried (B) or
Exposed (E)). The classification depends on a relative accessibility
value lower or higher than 16%. Associated probability values (P(B)
and P(E)) are also computed.
proteins endowed with a low sequence identity (<25%)
(Fariselli and Casadio, 2000).
The architecture of the RCNPRED server is extremely
simple. It takes a single sequence from the web page
and it uses the PSI-BLAST program to search against
SWISSPROT for similarity sequences. Subsequently, a
script directly uses the PSI-BLAST output for building
a sequence profile suited for the net input. After this,
two types of predictions (residue co-ordination number
and relative solvent accessibility) are computed and both
results are mailed back to the user in ASCII format.
A typical server output is shown in Figure 1, where for
each residue of the query sequence, different predicted
features are reported in a column format. The first column
lists the predicted range of the residue co-ordination
numbers (CN), giving the minimum and the maximum
predicted values. The second column represents the level
of confidence of the prediction (Prob), evaluated as the
absolute value of the difference between the two output
values of the network. This is a real number ranging from
0 (the lowest reliability) to 1 (the higher reliability). The
last column is the predicted relative accessibility of each
residue with our system. Two labels highlight buried (B)
or exposed (E) residues and their probability values (P(B)
and P(E)). The decision threshold for this prediction is set
equal to 16% of the relative solvent accessibility (Rost and
Sander, 1994).
In summary, RCNPRED is a predictor for discriminating if a given residue, depending on its sequence context,
has a number of contacts greater or lower than its average
value in the database. This type of classification is complementary to predicting residue solvent accessibility and
can be used to improve protein structure prediction.
ACKNOWLEDGEMENTS
Contract grant sponsors were Ministero della Università e
della Ricerca Scientifica e Tecnologica (MURST) and the
Italian Centro Nazionale delle Ricerche (CNR).
REFERENCES
Cuff,J.A. and Barton,G.J. (2000) Application of multiple sequence
alignment profiles to improve protein secondary structure prediction. Proteins, 40, 502–511.
Fariselli,P. and Casadio,R. (1999) Neural network based predictor
of residue contacts in proteins. Protein Eng., 12, 15–21.
Fariselli,P. and Casadio,R. (2000) Prediction of the number of
residue contacts in Proteins. In Proceedings of the Eighth
International Conference on Intelligence Systems for Molecular
Biology. AAAI Press, Menlo Park, CA, pp. 146–151.
Flöckner,H., Braxenthaler,M., Lackner,P., Jaitz,M., Ortner,M. and
Sippl,M.J. (1995) Progress in fold recognition. Proteins, 3, 376–
386.
Olmea,O. and Valencia,A. (1997) Improving contact predictions by
the combination of correlated mutations and other sources of
sequence information. Fold Des., 2, S25–32.
Ortiz,A.R., Kolinski,A., Rotkiewicz,P., Ilkowski,B. and Skolnick,J.
(1999) Ab initio folding of proteins using restraints derived from
evolutionary information. Proteins, 3 (Suppl.), 177–185.
Rost,B. and Sander,C. (1994) Conservation and prediction of
solvent accessibility in protein families. Proteins, 20, 216–226.
Richardson,C.J. and Barlow,D.J. (1999) The bottom line for prediction of residue solvent accessibility. Protein Eng., 12, 1051–
1054.
Thomas,D.J., Casari,G. and Sander,C. (1996) The prediction of
protein contacts from multiple sequence alignments. Protein
Eng., 9, 941–948.
203