Protein structure prediction in genomics

David T. Jones
is currently Professor of
Bioinformatics at the Institute
for Cancer Genetics and
Pharmacogenomics at Brunel
University. As a Wellcome
Trust Research Fellow at
University College London and
later as a Royal Society
University Research Fellow at
the University of Warwick,
Prof. Jones has published in
many areas of bioinformatics,
but mainly in the area of
protein structure prediction
and analysis. He is also one of
the Founders of Inpharmatica, a
bioinformatics driven drug
discovery company located in
Central London.
Keywords: protein structure
prediction, protein folding,
functional genomics
David T. Jones,
Department of Biological Sciences,
Brunel University,
Uxbridge
UB8 3PH, UK
E-mail: [email protected]
Protein structure prediction
in genomics
David T. Jones
Date received (in revised form): 16th March 2001
Abstract
As the number of completely sequenced genomes rapidly increases, including now the
complete Human Genome sequence, the post-genomic problems of genome-scale protein
structure determination and the issue of gene function identi®cation become ever more
pressing. In fact, these problems can be seen as interrelated in that experimentally determining
or predicting or the structure of proteins encoded by genes of interest is one possible means
to glean subtle hints as to the functions of these genes. The applicability of this approach to
gene characterisation is reviewed, along with a brief survey of the reliability of large-scale
protein structure prediction methods and the prospects for the development of new
prediction methods.
INTRODUCTION
The release of the complete human
genome sequence in early 2001 was a
milestone event that marked the transition
of modern biology into a new `postgenome' era. In addition to the human
genome, sequencing efforts for simpler
organisms are also continuing to generate
increasing volumes of valuable data, and
at the time of writing, some 40 or so
complete microbial genome sequences are
now available, along with the genomes
for the nematode worm, the fruit¯y and
thale cress.
As we move into the post-sequencing
phase of many genome projects, attention
is becoming increasingly focused on the
correct identi®cation of gene products.
Assigning a possible function to a gene is
an important ®rst step to characterising its
role in the various cellular processes, and
without this information, it is impossible
to realise the true value of genome
sequencing. Of course, straightforward
sequence comparison algorithms are by
far the most widely used techniques for
making an initial identi®cation of a
particular gene product. By identifying
homology between a new gene product
and a gene of known function some
inferences can be made as to the function
of the new gene. How reliably the
function can be extrapolated to the new
gene depends on a number of factors, but
the principal factor is of course the degree
of sequence similarity observed.
In recent years, sequence comparison
algorithms such as PSI-BLAST1 or
techniques based on hidden Markov
models2 have `pushed the envelope' as far
as detecting homologous relationships
goes. Of course, as more and more
remote relationships are being considered,
it becomes less clear as to how reliably
one can map the function of one gene to
another.3,4 Nevertheless, sensitive
sequence comparison algorithms remain
the most vital technology that we have for
rapidly characterising new gene products.
Despite the power of current-day
sequence comparison algorithms, there
are still open reading frames (ORFs) that
either match no existing entries in the
sequence data banks, or that match
proteins that are also uncharacterised
`unknowns'. These sequence orphans or
`ORFans'5 are somewhat of a puzzle.
Clearly, at present, this class of ORF
represents an uncertain, but signi®cant,
fraction of the larger completely
sequenced genomes. No matter what the
true number of sequence orphans happens
to be, however, the fact remains that
there remains a `hard core' of small
& HENRY STEWART PUBLICATIONS 1467-5463. B R I E F I N G S I N B I O I N F O R M A T I C S . VOL 2. NO 2. 111±125. MAY 2001
111
David T. Jones
new algorithms for
predicting gene function
theoretical methods will
help `®ll the gaps' in
`fold space'
sequence families across a wide variety of
genomes for which no functional
information can be derived by homologybased methods.
Direct experimental function
determination is perhaps the ideal
approach to characterising these orphan
proteins. Gene knockout experiments and
expression array techniques are just two of
many experimental techniques now being
widely applied to function determination.
New algorithms for predicting gene
function have also been described.6,7 Two
basic ideas are represented in these
methods. Firstly, proteins of similar
function may `co-evolve'.6 In other
words, groups of proteins that are found
in some organisms but not others may
share some common function. Proteins
that might be found, for example, in
aerobic organisms but never in anaerobic
organisms may well have a role in the
utilisation of oxygen. Of course, owing to
the broad scope of this level of functional
classi®cation, the value of this kind of
functional classi®cation is rather
uncertain. Nonetheless, this kind of
information might well produce some
unique hints for at least a few sequence
orphans. The second idea which has been
proposed independently by two groups6,7
has been called the `Rosetta sequence'
algorithm by Eisenberg and colleagues.6
In this case, possible protein±protein
interactions are predicted by identifying
separate protein domains which are
sometimes observed to be fused together
in some species of organism. Of course,
there are a number of well-known
exceptions to this rule (many protein
modules for example), but despite this,
Rosetta sequences may well offer some
tantalising hints as to the network of
interactions that are present in living cells.
WHAT CAN STRUCTURE
TELL US ABOUT PROTEIN
FUNCTION?
It is now common knowledge that the
tertiary structure of a protein family is
much more highly conserved than the
sequences of the proteins within the
112
family. It is also apparent that it is the
tertiary structure of a protein that creates
the chemical microenvironment, which,
in turn, produces its biochemical activity.
Given these two observations, it is not
surprising, therefore, that the 3D structure
of a protein can provide valuable
information as to its function and
mechanism. One result of this belief is the
strong impetus to solve, experimentally,
the structures of every protein encoded by
a bacterial genome. Some such structural
genomics initiatives are already
underway8,9 but as yet none of these
projects has generated large numbers of
new structures as they remain in the pilot
stage of development.
Despite great improvements to the
basic methods of X-ray crystallography,10
particularly the use of synchrotron
radiation sources,11 the rate-limiting step
in structure determination still remains
the expression, puri®cation and
crystallisation of the target proteins.
Nuclear magnetic resonance (NMR)
techniques offer some scope for avoiding
some of these dif®culties, but are still
limited with respect to the size of protein
that can be tackled on a routine basis.
Despite these technical improvements
in experimental structure determination,
none of the ongoing structural genomics
projects is based on the idea of solving the
structure for every single gene product.
Instead, it is expected that theoretical
methods will help `®ll the gaps' in `fold
space'. Given that it has been estimated
that the probability of a novel gene
product having an purely new fold is less
than 30 per cent,12,13 algorithms for
recognising known folds are of course
expected to be a powerful means for
obtaining structural information about a
new gene. Beyond fold recognition there
also lies the hope that algorithms will
become available that might calculate an
approximate fold for a given protein
sequence without reference to a template
structure.
The question of prediction will be
covered later, but for the time being let us
ignore this problem. Suppose a algorithm
& HENRY STEWART PUBLICATIONS 1467-5463. B R I E F I N G S I N B I O I N F O R M A T I C S . VOL 2. NO 2. 111±125. MAY 2001
Protein structure prediction in genomics
apparent trends
between the broad
functional class of a
protein and its
structural classi®cation
assigning a correct fold
to a gene product can
provide signi®cant hints
as to its function
did exist for accurately predicting protein
tertiary structure, or that we had solved
the structures for all of the proteins in a
genome of interest. The question still
remains as to how useful protein structure
is for elucidating the function of a
protein.
At the most basic level there are
apparent trends between the broad
functional class of a protein and its
structural classi®cation. In a recent survey
of the structures in the Protein Data Bank
(PDB),14 for example, Thornton and
colleagues show that the majority of
enzyme structures are found to be in the
áâ fold class,15 as are those of nucleotidebinding domains. Unfortunately these
observations, while suggesting a
relationship between fold and function,
have little or no obvious predictive value.
Clearly, it would be foolish to say that just
because a protein has an áâ fold, it is
likely to have enzymatic activity, even
though one would be more frequently
right than wrong.
Hegyi and Gerstein16 have looked
more closely at the relationship between
fold classi®cation and enzyme
classi®cation (EC number), where they
used BLAST1 to cross-reference between
the SCOP database and SWISS-PROT.
In terms of fold class biases, their data are
in broad agreement with the observations
made by Thornton et al. and earlier work
by Martin et al.17 However, by extending
their data set by counting not just entries
in PDB, but also the homologues of these
structures in SWISS-PROT, Hegyi and
Gerstein were also able to make some
statements about the statistical relationship
between functional class and protein
topology. They found that the average
number of functions found to be
associated with a particular fold is 1.2 for
both enzymes and non-enzymes, and 1.8
for enzyme-related folds alone.
Furthermore, they found the average
number of folds for a given function to be
3.6 (2.5 for enzymes alone). One
interpretation of this is that, on average at
least, the correct prediction of a protein's
fold might be a very good indicator as to
its function. Unfortunately, this evident
good news is somewhat marred by the
observed biases in fold distributions. The
superfolds (18; Figure 1) such as the (áâ)8
(TIM; triose phosphate isomerase) barrel
have been long known to be associated
with a very large number of functions.
Hegyi and Gerstein similarly found the
top ®ve `multifunctional folds' to be the
TIM barrel, the áâ hydrolase fold, the
Rossmann fold, the P-loop containing
NTP hydrolase fold and the ferredoxinlike fold (Table 1).
From this we can see that assigning the
TIM barrel fold to a particular gene
product will give very little information as
to its function. In all probability, in the
case of the TIM barrel fold, the gene
would encode an enzyme (as almost all
proteins with the TIM barrel fold are
enzymes) but beyond this, very little
functional insight would be gained. To
add to the problem, the superfolds also
account for the bulk of observed
structural similarities. Orengo et al.18
estimated that approximately half of the
observed structural similarities were found
to be between the 10 superfolds.
One positive point to make about these
structural similarities is that whenever a
non-superfold structure can be assigned to
a new gene, based on current observations
it would appear that the functions of the
template protein and the target protein
would be expected to be broadly similar.
As we have seen, under the right
circumstances, assigning a correct fold to a
gene product can provide signi®cant hints
as to its function. This assumes, of course,
that the fold has already been associated
with a known function. Fortunately, the
vast bulk of proteins of known 3D
structure belong to well-characterised
families for which a lot of biochemical
knowledge has been collected. The
various structural genomics initiatives
may, however, start to change this
picture. Perhaps the ®rst clue of things to
come was seen in the recently determined
structure for the Escherichia coli protein,
HDEA, by Yang et al.19 The structure of
this protein was actually solved by
& HENRY STEWART PUBLICATIONS 1467-5463. B R I E F I N G S I N B I O I N F O R M A T I C S . VOL 2. NO 2. 111±125. MAY 2001
113
David T. Jones
Figure 1: The 10 socalled `superfolds'
currently found in the
CATH protein structure
classi®cation scheme.
These folds account for
more than 50 per cent of
the observed structural
similarities between
protein domains.
Despite the recurrence
of these folds, there
appears to be no
indication of common
ancestry between many
of the proteins which
exhibit these folds
Table 1: The ®ve most functionally diverse
protein folds according to Hegyi and
16
Gerstein
HDEA structure
114
Fold
No. of
functions
TIM barrel
áâ hydrolase fold
Rossmann fold
P-loop containing NTP hydrolase fold
Ferredoxin fold
16
9
6
6
6
accident as it turned out to have an almost
identical molecular weight to the protein
the crystallographers were trying to
investigate. Despite this, the HDEA
structure (Figure 1) offers a stark lesson to
both experimentalists and theoreticians
alike, as it is a protein of purely unknown
function. Worse still, HDEA is currently a
sequence orphan, and so algorithms such
as evolutionary pro®ling6 could not be
applied.
Other attempts at genomic functional
assignment by means of structure
determination have been
documented.20,21 In the ®rst of these
cases,20 despite a structural resemblance to
chorismate mutase, no similarity was
observed between the active sites and the
crystal structure of the yjgF gene product
from E. coli revealed rather few hints as to
the protein's function. In the second
case,21 the crystal structure of
Methanococcus jannaschii ORF MG0577
not only clearly indicated the presence of
a bound ATP (suggesting a probable
ATPase or an ATP-mediated molecular
switch) but also incorporated several
& HENRY STEWART PUBLICATIONS 1467-5463. B R I E F I N G S I N B I O I N F O R M A T I C S . VOL 2. NO 2. 111±125. MAY 2001
Protein structure prediction in genomics
trying to identify
particular
arrangements of atom
groups which might
form an active site in a
given structure
CASP4 experiment
there is no evidence so
far that functions can be
assigned to completely
novel structures
structural motifs known to be frequently
associated with ATP binding. Despite the
fact that this is a positive example where
structural studies have revealed functional
information, the fact that part of the
functional characterisation was based on
the presence of a co-crystallised ATP
means that this result is less applicable to
the case of structural prediction and fold
assignment, where information regarding
ligand binding would not be produced.
Of course, in the absence of similarity
at the level of sequence or even structure
to proteins of known function, the
possibility remains that the function of a
protein might be inferred ab initio from an
analysis of the 3D structure alone. Several
ideas have been put forward for trying to
identify particular arrangements of atom
groups which might form an active site in
a given structure.22±25
Both Russell22 and Wallace et al.23 have
proposed methods to detect particular side
chain conformational patterns which
relate to the active site geometry of
enzymes with a similar function, even
with entirely different folds. Both groups
propose that this should permit the
creation of active site templates which
might allow the recognition of the active
site in a protein structure of unknown
function. Wallace et al. have now created
a library of such templates, called
PROCAT.24
Fetrow et al.25,26 have also suggested an
algorithm for identifying the function of a
given protein structure based on side
chain conformational patterns. However,
in their case they explicitly apply the
technique to predicted protein structures.
As a proof of concept, Fetrow et al.
generated a `fuzzy template' for the thioldisul®de oxidoreductase activity of the
glutaredoxin/thioredoxin protein family,
and use this template to assess models
generated by a fold recognition algorithm
applied to ORFs found in E. coli.
Although the potential of these
template approaches to recognising active
sites and other functional regions is very
clear, there is no abundant evidence so far
that functions can be assigned to
completely novel structures. We will have
to wait for these methods to be tested on
a large number of structures with both
novel folds and unknown functions
before we can properly evaluate their
merits.
EVALUATING METHODS
FOR STRUCTURE
PREDICTION
Given the evident importance of 3D
structure in providing insights into the
function and mechanism of proteins, the
next question relates to the applicability
and reliability of available structure
prediction techniques. Is there a role for
protein structure prediction in structural
genomics? Clearly, a theoretical approach
to accurately modelling the structure of
many proteins would have a great impact
on genomics as a whole. However, if the
use of prediction algorithms is going to be
generally accepted by the biology
community at large, then it is essential
that the reliability of these methods be
assessed in such a way as to convince this
rather sceptical audience. Although
individual authors of automatic prediction
methods do attempt to benchmark their
methods properly and attempt to provide
useful measures of con®dence alongside
their predictions, there still remains the
possibility that the published results are
somewhat better than might be expected
in cases where the true structure is not
known. The Fourth Critical Assessment
in Structure Prediction (CASP4)
Experiment was carried out in 2000,
along similar lines to the previous three
similar experiments, and this continues to
allow some indication to be gained as to
the reliability of truly blind predictions
using different approaches. Detailed
results from the experiment will be
published in a special issue of the journal
Proteins, along the same lines as for
CASP3.27 The raw data from the CASP4
evaluation are also available across the
Internet.28
& HENRY STEWART PUBLICATIONS 1467-5463. B R I E F I N G S I N B I O I N F O R M A T I C S . VOL 2. NO 2. 111±125. MAY 2001
115
David T. Jones
COMPARATIVE
MODELLING
modelling of unknown
protein structures by
homology represents
the most reliable and
most widely applied
method for protein
structure prediction
recognition of a
superfamily
membership is a very
different problem from
the recognition of
actual folds
116
At present, the modelling of unknown
protein structures by homology represents
the most reliable and most widely applied
method for protein structure prediction.
The reliability and simplicity of the
method stems from the fact that it is
limited to predicting the structure of
proteins that are closely related to a
template protein of known structure. The
comparative modelling process can be
divided into ®ve basic steps: alignment of
the target sequence with the sequence of a
protein of known 3D structure; building
of a framework structure based on the
alignment; loop building; addition and
optimisation of side chains; and ®nally
model re®nement.
In recent years there has been a de®nite
advance in the accuracy of sequence
alignments for target±template pairs
which are only distantly related. Indeed,
some of these pairs would previously have
been considered to be so distantly related
as to be only suitable for fold recognition.
This has come from the common usage of
sensitive sequence pro®le alignment
methods such as PSI-BLAST1 or one of
the several methods based on hidden
Markov models.
For comparative modelling to be used
routinely for genome annotation, it
should be possible to build good quality
3D models without requiring human
intervention. Given the fact that
progress does seem to have been made
in terms of full automation of
comparative modelling, and producing
accurate sequence-structure alignments,
it is not surprising that comparative
modelling techniques form a central part
of structural genomics initiatives.
Sanchez et al.29 have already
demonstrated that a large fraction of the
yeast genome can be automatically
modelled by homology to known 3D
structures using their program
MODELLER, but so far progress has
been limited to ORFs with relatively
high sequence similarity to the template
protein structures.
FOLD RECOGNITION
In the absence of suitable homologous
template structures with which to build a
model for a given sequence, and the lack
of success that is evident in the ab initio
approaches, fold recognition algorithms
provide another option for constructing
useful tertiary structural models. It was
clear at CASP4 that these algorithms are
now beginning to converge, with many
different groups all heavily relying on
sensitive sequence comparison in addition
to more traditional fold recognition
methods.
A simple, but interesting, view of the
CASP4 fold recognition results can be
obtained by dividing the prediction
targets into different categories.
Considering just the 11 target domains
which can readily be assigned to known
superfamilies, all but one of the folds
were correctly assigned by the best
performing group in this category.
Beyond the targets which were obvious
members of existing superfamilies, there
was very little success. Of the 11 or so
targets that had known folds, but that
were probably only structural analogues,
at best only three or four folds were
recognised even by the better performing
groups. This is perhaps the clearest
example of evidence in CASP4 that the
recognition of superfamily membership is
a very different problem from the
recognition of actual folds. Of course, to
be able to recognise a new sequence as
being in a particular superfamily is of
great biological value, particularly with
respect to function identi®cation. The
overall progress since CASP1 that is
evident in recognising distant homology
is even more impressive when one
considers that PSI-BLAST1 alone was
unable to assign any of the 11
homologous domains to their correct
superfamily. The fact that a combination
of sequence analysis, 3D structural
analysis (and in most cases some human
insight) can identify 10 out of 11 dif®cult
superfamily level matches correctly bodes
very well for the continued success of
fold recognition and distant homology
& HENRY STEWART PUBLICATIONS 1467-5463. B R I E F I N G S I N B I O I N F O R M A T I C S . VOL 2. NO 2. 111±125. MAY 2001
Protein structure prediction in genomics
to produce reasonably
accurate 3D models
there must be some
detectable sequence
similarity between the
target protein and at
least one template
structure of known 3D
structure
detection methods in structurally
characterising genomic sequences.
Nonetheless, the conclusion we have
to take from this very crude analysis is that
in order to produce reasonably accurate
3D models there must be some detectable
sequence similarity between the target
protein and at least one template structure
of known 3D structure. Despite the fact
that the sample sizes here are low, it
would appear that where there is at least
some detectable sequence similarity, fold
recognition methods based on sequencepro®les are presently suf®cient to build
useful models. Beyond these cases,
however, fold recognition methods not
reliant on sequence alignment (ie true
threading methods that ignore the
sequence of the template proteins) are
much more limited in their ability to
recognise folds, and to the accuracy of the
models they can produce. Nevertheless,
even these relatively poor models may be
enough to gain some insight into the
function of a new gene sequence. As
discussed earlier, even fold recognition
algorithms which are able to correctly
recognise folds but are entirely incapable
of producing sensible alignments may
offer some advantage in the narrowingdown of potential gene functions.
FOLD RECOGNITION
METHODS FOR GENOME
ANALYSIS
fold recognition
techniques can be
divided into three
classes
Given the potential bene®ts of assigning a
correct structure to a newly discovered
gene product, it is unsurprising than
several groups have applied existing fold
recognition algorithms to genome
analysis. These techniques can be
classi®ed into roughly three classes:
sequence pro®le methods (eg refs 1, 2,
30), structural (3D±1D) pro®le methods
(eg refs 31, 32) and threading algorithms
(eg refs 33±35).
The ®rst attempt at assigning folds to
genome sequences made use of a
structural pro®le method. Fischer and
Eisenberg36 used a development of the
original 3D±1D pro®le method31 to
assign folds to the ORFs found in
Mycoplasma genitalium, the smallest known
bacterial genome. They found that
approximately 16 per cent of the ORFs
could be assigned to a known fold by
means of straightforward sequence
comparison, and that an additional 6%
could be assigned to a known fold at high
con®dence using their fold recognition
method. Of course, as the structure
databases are now much larger, it is very
likely that these fractions would now be
somewhat higher.
Although many different threading
(purely pair potential-based fold
recognition) methods have been
developed, only a single attempt at
applying these methods to genome
analysis has been described.37 Grandori
applied the ProFit method35 to analyse the
ORFs in M. pneumoniae, a slightly larger
genome than M. genitalium. In this work,
to save time, proteins which could be
matched to known structures by
straightforward sequence comparison
were excluded from the analysis along
with proteins longer than 200 residues
(which were assumed to be multidomain
proteins). Of the 124 ORFs remaining,
Grandori was able to recognise folds for
12, giving a recognition rate of 10 per
cent. Interestingly, a number of
disagreements were reported when the
results were compared with the results
from Fischer and Eisenberg's results (by
identifying M. pneumoniae homologues in
M. genitalium). This is not surprising given
the relatively low overall reliability of
pure fold recognition algorithms, but
more surprising because in some cases
both predictions were apparently very
signi®cant.
Despite the fact that both Fischer and
Eisenberg and Grandori performed basic
sequence comparisons to detect clear
homologues to known structures, it is
clear that the possibility existed that better
sequence comparison algorithms could
have been applied, and these techniques
could have assigned a greater number of
folds to ORFs. In answer to this, a
number of groups38±41 have used PSIBLAST,1 which is an iterative sequence
& HENRY STEWART PUBLICATIONS 1467-5463. B R I E F I N G S I N B I O I N F O R M A T I C S . VOL 2. NO 2. 111±125. MAY 2001
117
David T. Jones
a number of Web
resources are available
where precompiled fold
assignments can be
accessed for different
subsets of genomes
GenTHREADER
118
pro®le method based on the standard
gapped-BLAST method.1 PSI-BLAST
has proven to be not only a very sensitive
sequence comparison method, but also
very reliable. To get the best results from
PSI-BLAST, however, it should be used
in both `directions'.38,41 Normally, each
ORF is scanned against a set of PSIBLAST pro®les, each of which
corresponds to a single protein structure
or structural domain. Despite the fact that
these pro®les are slow to calculate, this
process has to be done only once for each
sequence of known structure. Assigning
folds to ORFs using this procedure is thus
fairly ef®cient. To achieve the second
search direction, a PSI-BLAST pro®le
must be calculated for each ORF, and this
pro®le can be scanned against a library of
sequences relating to known structures.
Given that the calculation of a single PSIBLAST pro®le takes 10 minutes on
average using a modern workstation, for
large genomes, this second approach is
very computationally expensive, and
relatively impractical. Despite this
disadvantage, extra matches can be found
when both searches are carried out.
Although it has been claimed that
intermediate sequence searching methods
(ISL) using PSI-BLAST such as those
described by Salamov et al.41 or
Teichmann et al.42 can produce results
equivalent to a two-direction PSI-BLAST
search, in practice it is still quite clear that
these approaches can still miss matches
which can be found by more rigorous use
of PSI-BLAST itself.
Rychlewski et al.43,44 attempted to
exploit this asymmetry in PSI-BLAST
pro®le comparisons by means of a
comparison algorithm based on the
alignment of one pro®le with another.
Their technique, BASIC, requires pro®les
to be computed for each sequence in the
3D structure library and also for each
ORF. These two sets of pro®les are then
compared by means of a local dynamic
programming method.
Jones has developed a hybrid method
for assigning folds to genome sequences,
called GenTHREADER.45
GenTHREADER uses a traditional
sequence-pro®le alignment method to
produce alignments which are evaluated
by a method derived from threading
methods. As a last step, the alignment
scores and threading energy sums for each
threaded model is evaluated by a neural
network in order to produce a single
measure of con®dence in the proposed
prediction. The method was applied to
the genome of M. genitalium, where
analysis of the results showed that as many
as 46 per cent of the proteins derived
from the predicted protein coding regions
had a signi®cant relationship to a protein
of known structure.
In a recent review, Teichmann et al.46
compared the results from several
attempts to assign folds to the M.
genitalium genome. Being the smallest
bacterial genome, M. genitalium provides
a useful benchmark for different
approaches to fold assignment as most
groups have made predictions for this
genome. Despite the fact that it was
found that a high degree of agreement
was apparent between the different
algorithms, some results were not found
by all techniques. This suggests that to
maximise success in assigning folds to
genomes, some kind of consensus of
algorithms might be useful. At present,
this is dif®cult as there are no agreed
standards for how structural annotations
should be represented.
A number of Web resources are
available where precompiled fold
assignments can be accessed for different
subsets of genomes. These resources are
predominantly based on PSI-BLAST
comparisons. For example, the GTOP
database at the National Institute of
Genetics in Japan contains fold
assignments for 26 completed genomes
based on PSI-BLAST similarity searches,
and can be accessed from its web site.47
As far as fold recognition methods are
concerned, most of the results currently
available on the Web relate to small
subsets of genomes. For example,
comprehensive fold assignments are
available for M. genitalium, E. coli and
& HENRY STEWART PUBLICATIONS 1467-5463. B R I E F I N G S I N B I O I N F O R M A T I C S . VOL 2. NO 2. 111±125. MAY 2001
Protein structure prediction in genomics
Helicobacter pylori at the Burnham
Institute.48
To date, the only available set of
comprehensive fold assignments using
fold recognition techniques are those
derived by GenTHREADER, which
have been compiled for 25 complete
genomes (plus the currently con®rmed
gene products from the draft human
genome) and have been stored in a Webaccessible database.49
A preliminary analysis of these data
shows that while certain folds are
prevalent in all genomes, certain folds are
more common in some genomes than
others. The Rossmann-like fold is the
most commonly occurring fold in almost
every organism studied, mainly by virtue
of the high recurrence of the P-loop
hydrolase's superfamily. One interesting
exception to this is the human genome, in
which the immunoglobulin fold currently
appears to be the most common fold.
CASP process
CAFASP
One feature of the CASP process that
continues to be a concern is the dif®culty
in separating out wholly automatic
predictions from those that have been
made using various degrees of human
intervention. It is not at all clear how
much of the success shown in CASP
comes from the algorithms being used and
how much comes from the expert
biological knowledge of the people using
the algorithms. It is very clear that people
do add some value, and all of the most
successful groups at CASP4 made use of
the scienti®c literature to identify
functionally related proteins from a
shortlist of possibilities. It therefore seems
fair to say that human intervention is
required to make the best predictions, but
what can a non-expert hope to achieve
using just automated methods alone?
Also, is it possible to achieve the same
high levels of success shown in CASP4 on
many targets? If, instead of 30 or so
targets, which could easily be analysed
individually by human predictors, there
had been, say, 1,000 targets, how good
would the results have been? Clearly, if
fold recognition, or protein structure
prediction more generally, is to play an
important part in structural genomics then
it is essential that we characterise the
success of fully automated methods for
structure prediction.
Fischer et al.50 have attempted to
address this issue by creating a subsection
of the CASP process, called CAFASP
(Critical Assessment of Fully Automated
Structure Prediction). The basic idea of
CAFASP is to evaluate Web-based
prediction tools in a fully automated
fashion, thus eliminating the possibility of
human assistance in the prediction
process. CAFASP1 was carried out shortly
after the CASP3 meeting, and must
therefore be considered a pilot study as
the predictions were not blind.
Nonetheless, the results were interesting
and the process allowed some technical
issues to be resolved in good time for
CAFASP2, which was an of®cial part of
CASP4.
Although the detailed results for
CAFASP2 are available for viewing from
the associated web site,51 it is possible to
broadly conclude that although skilled
human intervention is clearly bene®cial,
entirely automated methods still
performed fairly well. The bad news is
that, as with the human predictors in
CASP4, the success of the automatic
methods mainly came from relatively easy
targets in the superfamily category.
Targets in the analogous category were
predicted very poorly by the automatic
servers.
AB INITIO METHODS
In cases where it is not possible to build a
useful model by comparative modelling, it
might be hoped that methods might be
available to calculate a 3D structure
directly from the amino acid sequence by
means of pure physics or a knowledge of
the rules of protein folding. This has
proven to be, and remains, a very dif®cult
challenge. It is not dif®cult to understand
the practical applications of an entirely
general method for predicting the tertiary
structure of novel gene products, but is
& HENRY STEWART PUBLICATIONS 1467-5463. B R I E F I N G S I N B I O I N F O R M A T I C S . VOL 2. NO 2. 111±125. MAY 2001
119
David T. Jones
Figure 2: The three
main classes of protein
structure prediction
methods in order of
`complexity'.
Comparative modelling
approaches are fairly
easy methods to apply to
genome sequences, and
produce accurate 3D
models for the proteins
concerned.
Unfortunately, only a
small percentage (20±30
per cent) of the proteins
encoded in most
genomes are closely
enough related to a
protein of known 3D
structure to allow
comparative modelling
techniques to be applied.
At the other extreme,
ab initio methods can in
theory be applied to any
sequence, but these
methods are not
currently able to
produce useful 3D
models. Fold recognition
methods occupy the
mid-ground between
these two classical
approaches
progress in methods
that attempt to predict
protein secondary
structure
120
there any hope that such approaches will
provide useful levels of success in the
short to medium term?
Of course, not all ab initio methods are
aimed at the prediction of tertiary
structure. A great deal of progress has
been made in recent years in methods that
attempt to predict protein secondary
structure. The reason there remains great
interest in secondary structure prediction
is because it is often used as a component
of a wide range of 3D prediction
methods. Indeed, although it is rarely
used in isolation, accurate secondary
structure prediction is exploited by the
vast majority of prediction groups taking
part in CASP. It has also been suggested
that careful analysis of accurate secondary
structure predictions can also provide
functional information on a new gene
sequence (eg King et al.52 ).
Up until around two years ago, the
best and by far the most widely used
method for predicting secondary
structure was the PHD method
developed by Burkhard Rost.53 At
CASP3, however, the PSIPRED
method54 showed a marked improvement
in prediction accuracy over previous
methods. Although PSIPRED was very
similar to PHD in concept (using two
levels of neural networks to analyse
sequence pro®les) it used PSI-BLAST to
provide more sensitive and more accurate
& HENRY STEWART PUBLICATIONS 1467-5463. B R I E F I N G S I N B I O I N F O R M A T I C S . VOL 2. NO 2. 111±125. MAY 2001
Protein structure prediction in genomics
traditionally, very little
success has been
demonstrated in the ab
initio prediction of
protein tertiary
structure in the various
CASP experiments
sequence pro®les. Added to this was a
highly redundant training set including
nearly 2,000 separate pro®les. At CASP4,
PSIPRED was still ranked at the top of
20 or so methods evaluated, achieving an
overall 3-state prediction accuracy (Q3
score) of 80.6 per cent for all 40 target
domains with no obvious sequence
similarity to existing structures.
PREDICTING NEW FOLDS
at CASP4 the Baker
group showed higher
degree of success across
a wider range of target
folds than previously
Somewhat more interesting than
secondary structure prediction is of course
the ab initio prediction of protein tertiary
structure. The concept behind most ab
initio approaches to protein structure is
quite simple. Firstly, a large number of
different chain conformations are
generated for the target protein. At the
very simplest these conformations can be
enumerated exhaustively ± ie virtually
every distinct 3D structure is generated
for the protein. Clearly, this is only
practical for very small proteins since for
larger proteins the number of possible
conformations grows exponentially with
the number of amino acid residues. One
way of slightly reducing the number of
possible structures is to build the structure
on a ®xed lattice, which restricts the
positions of the atoms in the structure to a
®xed number of coordinates. For larger
proteins, more intelligent search strategies
are needed, which include molecular
dynamics, simulated annealing, genetic
algorithms and a number of other ef®cient
search strategies.
Traditionally, very little success has
been demonstrated in the ab initio
prediction of protein tertiary structure in
the various CASP experiments. However,
in the second CASP experiment, the best
ab initio prediction55 was close enough (ácarbon root mean square deviation of
Ê for it to be con®dently claimed that
6:2 A)
at least the fold was correctly reproduced
in the model. This prediction was
generated by a Monte Carlo approach
where fragments of protein structures are
spliced together, and the resulting chain
conformations evaluated using a simple
energy function.
At CASP3 the group of David Baker
took these ideas further with some
success.56 As an aside, this kind of
approach to folding proteins has become
nicknamed `mini-threading' by some
predictors. This terminology is perhaps
useful to distinguish such knowledgebased prediction methods from methods
that attempt to simulate protein folding
using physical principles, but is otherwise
quite misleading.
At CASP4, however, the Baker group
showed an even higher degree of success
across a much wider range of target folds
than previously seen at earlier CASP
experiments, where earlier successes had
been limited to mainly alpha-helical
folds. Figure 3 shows an example of a
prediction from the Baker group which
accurately models the unique fold of
CASP4 target T0091, which is a
hypothetical protein from Haemophilus
in¯uenzae (HI0442). It of course remains
to be seen how much functional insight
can be gleaned from a knowledge of this
rather unusual fold.
Figure 3: Example of a correctly predicted
novel fold for a hypothetical gene product
(ORF HI0442 from H. in¯uenzae) using the
ROSETTA method of Simons and
56,57
The ®gure was generated
coworkers.
58
using Molscript
& HENRY STEWART PUBLICATIONS 1467-5463. B R I E F I N G S I N B I O I N F O R M A T I C S . VOL 2. NO 2. 111±125. MAY 2001
121
David T. Jones
CONCLUSIONS
fold recognition
algorithms have now
reached a point where
fair models for target
proteins can be found
on a routine basis
122
This review has demonstrated that protein
structure prediction does have a role to
play in the annotation of genome
sequences. In many cases, the correct
assignment of a fold to a novel gene can
provide useful clues as to its likely
function.
It seems that ab initio methods for
protein structure prediction, while
sometimes achieving interesting results on
fragments of proteins, are unlikely to be
used for genome analysis in the near
future. The success of ab initio algorithms
has never been tested rigorously on a large
number of test cases, and so the chance of
®nding a reasonable model for a target
protein is unknown. However, results in
the four CASP experiments suggest that
the chances of any single algorithm
producing a reasonable model for a given
sequence is very low.
Fold recognition algorithms, on the
other hand, have now reached a point
where fair models for target proteins can
be found on a routine basis, especially
where a homologous template structure
can be found. Not surprisingly, therefore,
different fold recognition methods have
already been applied to the problem of
assigning folds to genome sequences. The
simplest and most reliable predictions are
based purely on sequence similarity, and
in particular PSI-BLAST1 is proving to be
a valuable tool for detecting remote
homologous relationships between
protein sequences. At the other extreme,
fold recognition methods which typically
ignore sequence similarity and make use
of structural information have also been
applied, but with somewhat less success.
Hybrid methods, which combine
sequence comparison and fold
recognition methods, are expected to be
an important development in this area.
Such methods can detect homologous
relationships just beyond the detection
threshold of methods such as PSI-BLAST,
as evidenced by the results for the ®rst
such algorithm.39
So how will the relationship between
computational approaches to protein
folding and the ongoing structural
genomics projects probably develop? It
is clear that protein structure prediction
is never likely to challenge experimental
methods in the determination of
accurate structural models for proteins.
The role of protein structure prediction
and modelling lies in the rapid analysis
and annotation of proteins, the analysis
of proteins for which experimental
structure determination has proven to be
dif®cult, and the extrapolation from
existing experimental structures to other
members of the protein's family and
superfamily.
In the fullness of time, of course, the
protein structure data banks will be so
complete as to render protein structure
prediction a more or less academic
problem. Eventually there will almost
always be available a closely related
protein of known structure with which
to build an accurate model for any
given target protein. When might this
occur? Based on current rates of
structure determination and the
observed growth of the structure
databases, this scenario might arise in
around 15 years or longer.
On the other hand, perhaps the
structural genomics initiatives will
stimulate the development of novel
methods for rapid structure
determination and reduce this time to
around 10 years. In either case, will
theoreticians suddenly ®nd their skills of
little value? No. Even with a highly
optimistic view of success, present
approaches to structural genomics will
only scratch the surface of
understanding the working of cells at a
molecular level. There will still remain
a rich selection of unsolved problems to
keep theoreticians in structural biology
busy for as long as they wish: protein
misfolding, protein±protein interactions
and biomolecular recognition, drug
design, membrane proteins and even de
novo protein design are just a few of the
many challenging future possibilities for
continuing theoretical studies of protein
structure.
& HENRY STEWART PUBLICATIONS 1467-5463. B R I E F I N G S I N B I O I N F O R M A T I C S . VOL 2. NO 2. 111±125. MAY 2001
Protein structure prediction in genomics
References
1.
Altschul, S. F., Madden, T. L., Schaffer, A. A.
et al. (1997), `Gapped BLAST and PSIBLAST: A new generation of protein database
search programs', Nucleic Acids Res., Vol. 25,
pp. 3389±3402.
2.
Eddy, S. R. (1996), `Hidden Markov models',
Curr. Opin. Struct. Biol., Vol. 6, pp. 361±365.
3.
Bork, P. and Koonin, E. V. (1998), `Predicting
functions from protein sequences ± where are
the bottlenecks?', Nature Genetics, Vol. 18, pp.
313±318.
4.
Wilson, C. A., Kreychman, J. and
Gerstein, M. (2000), `Assessing annotation
transfer for genomics: Quantifying the
relations between protein sequence,
structure and function through traditional
and probabilistic scores', J. Mol. Biol., Vol.
297, pp. 233±249.
5.
Fischer, D. and Eisenberg, D. (1991), `Finding
families for genomic ORFans', Bioinformatics,
Vol. 15, pp. 759±762.
6.
Marcotte, E. M., Pellegrini, M., Thompson,
M. J. et al. (1999), `A combined algorithm for
genome-wide prediction of protein function',
Nature, Vol. 402, pp. 83±86.
7.
Enright, A. J., Iliopoulos, I., Kyrpides, N. C.
and Ouzounis, C. A. (1999), `Protein
interaction maps for complete genomes based
on gene fusion events', Nature, Vol. 402, pp.
86±90.
8.
Anonymous (1999), `Editorial: Money for
structural genomics', Nature Struct. Biol., Vol.
6, pp. 707±708.
9.
Shapiro, L. and Lima, C. D. (1998), `The
Argonne Structural Genomics Workshop:
Lamaze class for the birth of a new science',
Structure, Vol. 6, pp. 265±267.
10. Lamzin, V. S. and Perrakis, A. (2000),
`Current state of automated crystallographic
data analysis', Nat. Struct. Biol., Vol. 7, Nov.,
Suppl., pp. 978±981.
11. Hendrickson, W. A.(2000), `Synchrotron
crystallography', Trends Biochem. Sci., Vol. 25,
pp. 637±643.
12. Orengo, C. A., Michie, A. D., Jones, S. et al.
(1997), `CATH ± a hierarchic classi®cation of
protein domain structures', Structure, Vol. 5,
pp. 1093±1108.
13. Brenner, S. E. and Levitt, M. (2000),
`Expectations from structural genomics',
Protein Sci., Vol. 9, pp. 197±200.
14. Sussman, J. L., Lin, D. W., Jiang, J. S. et al.
(1998), `Protein Data Bank (PDB): Database of
three-dimensional structural information of
biological macromolecules', Acta Crystallogr.
D, Vol. 54, pp. 1078±1084.
15. Thornton, J. M., Orengo, C. A. , Todd, A. E.
and Pearl, F. M. G. (1999), `Protein folds,
functions and evolution', J. Mol. Biol., Vol.
293, pp. 333±342.
16. Hegyi, H. and Gerstein, M. (1999), `The
relationship between protein structure and
function: A comprehensive survey with
application to the yeast genome', J. Mol. Biol.,
Vol. 288, pp. 147±164.
17. Martin, A. C., Orengo, C. A., Hutchinson,
E. G. et al. (1998), `Protein folds and
functions', Structure, Vol. 6, pp. 875±884.
18. Orengo, C. A., Jones, D. T. and Thornton,
J. M. `Protein superfamilies and domain
superfolds', Nature, Vol. 372, pp. 631±634.
19. Yang, F. Gustafson, K. R., Boyd, M. R. and
Wlodawer, A. (1998), `Crystal structure of
Escherichia coli HdeA', Nature Struct. Biol., Vol.
5, pp. 763±764.
20. Volz, K. (1999), `A test case for structure-based
functional assignment: The 1.2 angstrom
crystal structure of the yjgF gene product from
Escherichia coli', Protein Sci., Vol. 8, pp. 2428±
2437.
21. Zarembinski, T. I., Hung, L. W.,
MuellerDieckmann, H. J. et al. (1998),
`Structure-based assignment of the biochemical
function of a hypothetical protein: A test case
of structural genomics', Proc. Natl Acad. Sci.,
USA, Vol. 95, pp. 15189±15193.
22. Russell, R. B. (1998) `Detection of protein
three-dimensional side-chain patterns: New
examples of convergent evolution', J. Mol.
Biol., Vol. 279, pp. 1211±1227.
23. Wallace, A. C., Borkakoti, N. and Thornton,
J. M. (1997), `TESS: A geometric hashing
algorithm for deriving 3D coordinate templates
for searching structural databases, Application
to enzyme active sites', Protein Sci., Vol. 6, pp.
2308±2323.
24. URL: http//www.biochem.ucl.ac.uk/bsm/
PROCAT
25. Fetrow, J. S., Godzik, A. and Skolnick, J.
(1998), `Functional analysis of the Escherichia
coli genome using the sequence-to-structureto-function paradigm: Identi®cation of
proteins exhibiting the glutaredoxin/
thioredoxin disul®de oxidoreductase activity',
J. Mol. Biol., Vol. 282, pp. 703±711.
26. Fetrow, J. S. and Skolnick, J. (1998), `Method
for prediction of protein function from
sequence using the sequence-to-structure-tofunction paradigm with application to
glutaredoxins/thioredoxins and T-1
ribonucleases', J. Mol. Biol., Vol. 281, pp.
949±968.
27. Moult, J., Hubbard, T., Fidelis, K. and
Pedersen, J. T. (1999), `Critical assessment of
methods of protein structure prediction
(CASP): Round III', Proteins, Vol. S3,
pp. 2±6.
28. URL: http//predictioncenter.llnl.gov
& HENRY STEWART PUBLICATIONS 1467-5463. B R I E F I N G S I N B I O I N F O R M A T I C S . VOL 2. NO 2. 111±125. MAY 2001
123
David T. Jones
29. Sanchez, R., Pieper, U., Melo, F. et al (2000),
`Protein structure modeling for structural
genomics', Nat. Struct. Biol., Vol. 7, Nov.,
Suppl., pp. 986±990
30. Overington, J., Donnelly, D., Johnson, M. S.
et al. (1992), `Environment-speci®c aminoacid substitution tables ± tertiary templates and
prediction of protein folds', Protein Sci., Vol. 1,
pp. 216±226.
31. Bowie, J. U., LuÈthy, R. and Eisenberg, D.
(1991), `A method to identify protein
sequences that fold into a known threedimensional structure', Science, Vol. 253,
pp.164±170.
32. Ouzounis, C., Sander, C., Scharf, M. and
Schneider, R. (1993), `Prediction of protein
structure by evaluation of sequence-structure
®tness. Aligning sequences to contact pro®les
derived from three-dimensional structures',
J. Mol. Biol., Vol. 232, pp. 805±825.
33. Jones, D. T., Taylor, W. R. and Thornton,
J. M. (1992), `A new approach to protein fold
recognition', Nature, Vol. 358 , pp. 86±89.
34. Bryant, S. H. and Lawrence, C. E. (1993), `An
empirical energy function for threading
protein-sequence through the folding motif ',
Proteins: Struct. Function Genet., Vol. 16, pp.
92±112.
35. FloÈckner, H., Braxenthaler, M., Lackner, P.
et al. (1995), `Progress in fold recognition',
Proteins, Vol. 23, pp. 376±386.
36. Fischer, D. and Eisenberg, D. (1997),
`Assigning folds to the proteins encoded by the
genome of Mycoplasma genitalium', Proc. Natl
Acad. Sci. USA, Vol. 94, pp. 11929±11934.
37. Grandori, R. (1998), `Systematic fold
recognition analysis of the sequences encoded
by the genome of Mycoplasma pneumoniae.,
Prot. Engng, Vol. 11, pp. 1129±1135.
38. Teichmann, S. A., Park, J. and Chothia, C.
(1998), `Structural assignments to the
Mycoplasma genitalium proteins show extensive
gene duplications and domain rearrangements',
Proc. Natl Acad. Sci. USA, Vol. 95, pp. 14658±
14663.
39. Huynen, M., Doerks, T., Eisenhaber, F. et al.
(1998), `Homology-based fold predictions for
Mycoplasma genitalium proteins', J. Mol. Biol.,
Vol. 280, pp. 323±326.
40. Wolf, Y. I., Brenner, S. E., Bash, P. A. and
Koonin, E. V. (1999), `Distribution of protein
folds in the three superkingdoms of life',
Genome Res., Vol. 9, pp. 17±26.
41. Salamov, A. A., Suwa, M., Orengo, C. A. and
Swindells, M. B. (1999), `Genome analysis:
Assigning protein coding regions to threedimensional structures', Prot. Sci., Vol. 8, pp.
771±777.
42. Teichmann, S. A., Chothia, C., Church,
G. M. and Park, J. (2000), `Fast assignment of
124
protein structures to sequences using the
intermediate sequence library PDB-ISL',
Bioinformatics, Vol. 16, pp. 117±124.
43. Rychlewski, L., Zhang, B. H. and Godzik, A.
(1998), `Fold and function predictions for
Mycoplasma genitalium proteins', Folding Design,
Vol. 3, pp. 229±238.
44. Rychlewski, L., Zhang, B. H. and Godzik, A.
(1999), `Functional insights from structural
predictions: Analysis of the Escherichia coli
genome', Prot. Sci., Vol. 8, pp. 614±624.
45. Jones, D. T. (1999), `GenTHREADER: An
ef®cient and reliable protein fold recognition
method for genomic sequences', J. Mol. Biol.,
Vol. 287, pp. 797±815.
46. Teichmann, S. A., Chothia, C. and Gerstein,
M. (1999), Advances in structural genomics',
Curr. Opin. Struct. Biol., Vol. 9, pp. 390±399.
47. URL: http://spock.genes.nig.ac.jp/~genome/
summary.html
48. URL: http://bioinformatics.burnhaminst.org/pages/
49. URL: http://insulin.brunel.ac.uk/genomes/
50. Fischer, D., Barret, C., Bryson, K. et al.
(1999), `CAFASP-1: Critical assessment of
fully automated structure prediction methods',
Proteins, Vol. S3, pp. 209±217.
51. URL: http://www.cs.bgu.ac.il/~d®scher/
CAFASP2/
52. King, R. D., Karwath, A., Clare, A. and
Dehaspe, L. (2000), `Accurate prediction of
protein functional class from sequence in the
Mycobacterium tuberculosis and Escherichia coli
genomes using data mining', Yeast, Vol. 17,
pp. 283±293.
53. Rost, B. (1996), `PHD: Predicting onedimensional protein structure by pro®le-based
neural networks', Methods Enzymol., Vol. 266,
pp. 525±539.
54. Jones, D. T. (1999), `Protein secondary
structure prediction based on position-speci®c
scoring matrices', J. Mol. Biol., Vol. 292, pp.
195±202.
55. Jones, D. T. (1997), `Successful ab initio
prediction of the tertiary structure of NK-lysin
using multiple sequences and recognized
supersecondary structural motifs', Proteins, Vol.
S1, pp. 185±191.
56. Simons, K. T., Bonneau, R., Ruczinski, I. and
Baker, D. (1999), `Ab initio protein structure
prediction of CASP III targets using
ROSETTA', Proteins, Vol. S3, pp. 171±176.
57. Baker, D. (2000), `A surprising simplicity to
protein folding', Nature, Vol. 405, pp. 39±42.
58. Kraulis, P. J. (1991), `MOLSCRIPT: A
program to produce both detailed and
schematic plots of protein structures', J. Appl.
Crystallogr., Vol. 24, pp. 946±950.
& HENRY STEWART PUBLICATIONS 1467-5463. B R I E F I N G S I N B I O I N F O R M A T I C S . VOL 2. NO 2. 111±125. MAY 2001
Protein structure prediction in genomics
APPENDIX: SOME LINKS TO PROTEIN STRUCTURE PREDICTION
RESOURCES ON THE WEB
Server
Organisation
Comparative modelling servers
3D-Jigsaw
Imperial Cancer Research Fund, UK
CPHmodels
Center for Biological Sequence Analysis, Denmark
SWISS-MODEL
Swiss Institute of Bioinformatics
Sequence-based homology recognition servers
FFAS
Burnham Institute, USA
Fugue
Cambridge University, UK
PDB-ISL
Laboratory of Molecular Biology, Cambridge, UK
PSI-BLAST and CDD
National Center for Biotechnology Information, USA
SAM
University of California, Santa Cruz, USA
Fold recognition servers
123D
3D-PSSM
BioInBgu
GenTHREADER
National Cancer Institute, USA
Imperial Cancer Research Fund, UK
Ben Gurion University, Israel
Brunel University, UK
General protein structure prediction servers
PredictProtein
Columbia University, USA
PSIPRED
Brunel University, UK
Web site
http://www.bmm.icnet.uk/servers/3djigsaw/
http://www.cbs.dtu.dk/services/CPHmodels/
http://www.expasy.ch/swissmod/SWISS-MODEL.html
http://bioinformatics.ljcrf.edu/FFAS/
http://www-cryst.bioc.cam.ac.uk/~fugue/prfsearch.html
http://stash.mrc-lmb.cam.ac.uk
http://www.ncbi.nlm.nih.gov
http://www.cse.ucsc.edu/research/compbio/HMM-apps/
HMM-applications.html
http://www-lmmb.ncifcrf.gov/~nicka/123D‡.html
http://www.bmm.icnet.uk/~3dpssm/html/ffrecog.html
http://www.cs.bgu.ac.il/~bioinbgu/form.html
http://www.psipred.net
http://cubic.bioc.columbia.edu/predictprotein/
http://www.psipred.net
Meta-servers (portals to multiple prediction methods)
META
Columbia University, USA
Meta-Server
http://cubic.bioc.columbia.edu/predictprotein/doc/
meta_intro.html
International Institute of Molecular and Cell Biology, Poland http://BioInfo.PL/meta/
Collections of genome fold assignments
GTD
Brunel University, UK
GTOP
National Institute of Genetics, Japan
MODBASE
Rockefeller University, USA
PEDANT
Max-Planck Institute, Germany
PRESAGE
Berkeley University, USA
http://insulin.brunel.ac.uk/genomes
http://spock.genes.nig.ac.jp/~genome/summary.html
http://pipe.rockefeller.edu/modbase/
http://pedant.mips.biochem.mpg.de/
http://presage.berkeley.edu/
Benchmarking resources for protein structure prediction
CAFASP
Ben Gurion University, Israel
CASP
Lawrence Livermore National Laboratory, USA
EVA
Columbia University, USA
LiveBench
International Institute of Molecular and Cell Biology, Poland
http://www.cs.bgu.ac.il/~d®scher/CAFASP2/
http://predictioncenter.llnl.gov/
http://cubic.bioc.columbia.edu/eva
http://BioInfo.PL/LiveBench/
& HENRY STEWART PUBLICATIONS 1467-5463. B R I E F I N G S I N B I O I N F O R M A T I C S . VOL 2. NO 2. 111±125. MAY 2001
125