Protein structure prediction in the postgenomic era David T Jones

371
Protein structure prediction in the postgenomic era
David T Jones
As the number of completely sequenced genomes rapidly
increases, the postgenomic problem of gene function
identification becomes ever more pressing. Predicting the
structures of proteins encoded by genes of interest is one
possible means to glean subtle clues as to the functions of
these proteins. There are limitations to this approach to gene
identification and a survey of the expected reliability of different
protein structure prediction techniques has been undertaken.
Addresses
Department of Biological Sciences, Brunel University, Uxbridge,
Middlesex UB8 3PH, UK; e-mail: [email protected]
Current Opinion in Structural Biology 2000, 10:371–379
0959-440X/00/$ — see front matter
© 2000 Elsevier Science Ltd. All rights reserved.
Abbreviations
CAFASP Critical Assessment of Fully Automated Structure Prediction
CASP
Critical Assessment in Structure Prediction
HMM
hidden Markov model
ORF
open reading frame
PDB
Protein Data Bank
RMSD
root mean square deviation
Introduction
It is expected that a first draft of the complete human
genome sequence will be available sometime in the year
2000. Although the completion of the sequence will probably take several more years, this milestone alone represents
a major breakthrough in molecular biology. Sequencing
efforts for simpler organisms are also continuing to produce
increasing volumes of valuable data and, at the time of writing, some 30 or so complete bacterial genome sequences
are available in the sequence databanks.
As we are now clearly moving into the postsequencing
phase of many genome projects, attention is becoming
more and more focused on the correct identification of
gene function. Assigning a function to a gene is an
important first step in characterising its role in the various cellular processes and, without this information, the
value of genome sequencing is greatly reduced. Of
course, simple sequence comparison techniques are by
far the most widely used method for making an initial
identification of a particular gene product. By identifying homology between a new gene and a gene of known
function, some inferences can be made as to the function
of the new gene. How reliably the function can be
extrapolated to the new gene depends on a number of
factors, but the principle factor is, of course, the degree
of sequence similarity observed.
In recent years, sequence comparison methods, such as
PSI-BLAST [1], or methods based on hidden Markov
models (HMMs) [2] have ‘pushed the envelope’ as far as
detecting homologous relationships goes. Of course, as
more and more remote relationships are being considered,
it becomes less clear as to how reliably one can map the
function of one gene to another [3]. Nonetheless, sensitive sequence comparison techniques are still the most
important technology that we have for rapidly characterising new gene products.
Despite the power of modern sequence comparison techniques, there still remain open reading frames (ORFs) that
either match no other entry in existing sequence databanks or match proteins that are also of unknown function.
These sequence orphans or ‘ORFans’ [4•] are a source of
great debate. Certainly, at present, this class of gene represents a large fraction of larger completely sequenced
genomes. However, estimates of the number of orphans
vary greatly [4•,5] and, consequently, different opinions
exist as to how much of a problem these genes represent.
The recent paper by Fischer and Eisenberg [4•] suggests
that the fraction of genomes falling into the category of
sequence orphans is unlikely to change rapidly.
These conclusions are based on the observation that the
number of uncharacterisable ORFs in the yeast genome,
for example, has not been reduced substantially in the two
years or so since it was first completely sequenced. This is
despite the fact that the sizes of the sequence databanks
have more than doubled over this period. Underlying this
kind of estimate is, of course, the fact that, even for yeast,
it is still not possible to be certain about the true number
of expressed genes. The fraction of the approximately
6000 ORFs found in the yeast genome that really relate to
expressed proteins is still unknown. Indeed, at the
extreme end, one recent calculation [5] on the yeast
genome suggests that the true fraction of yeast ORFs that
are sequence orphans may be as little as 5%.
No matter what is the true number of sequence orphans,
the fact is that there remains a ‘hard core’ of small
sequence families across a wide variety of genomes for
which no functional information apparently exists.
Direct experimental function determination is perhaps the
ideal approach to characterising these orphan proteins.
Gene knockout experiments and expression array techniques are just two of many experimental techniques that
are now being widely applied to function determination.
New theoretical methods for predicting gene function
have also been proposed [6••,7••]. Two basic ideas are represented by these methods. Firstly, proteins of similar
function may ‘co-evolve’ [6••]. In other words, groups of
proteins that are found in some organisms, but not others,
may share some common function. Proteins that might be
found, for example, in aerobic organisms, but never in
372
Sequences and topology
anaerobic organisms, may well have a role in the utilisation
of molecular oxygen. Of course, because of the broad scope
of this level of functional classification, the value of this
kind of functional classification is rather uncertain.
Nevertheless, this kind of information might well produce
some unique clues for at least a few sequence orphans.
The second idea, which has been proposed independently
by two groups [6••,7••], has been called the ‘Rosetta
sequence’ method by Eisenberg and colleagues [6••]. In
this case, possible protein–protein interactions are predicted by identifying separate protein domains that are
sometimes observed to be fused together in some organisms. Of course, there are a number of well-known
exceptions to this rule (many protein modules for example) but, despite this, Rosetta sequences may well offer
some tantalising clues as to the network of interactions that
is present in living cells.
Structure/function paradigm
It is now well known that protein structure is much more
highly conserved than protein sequence. It is also evident
that it is the tertiary structure of a protein that creates the
means by which it functions. Given these two observations, it is not surprising that many people believe that
determining the 3D structure of a protein can provide
valuable information as to its function and mechanism.
One aspect of this belief is the impetus to solve experimentally the structure of every protein encoded by a
bacterial genome. Some such structural genomics initiatives are already underway [8,9] but, as yet, none of these
projects has produced large numbers of new structures, as
they remain in the pilot stage of development. Despite
great improvements to the basic techniques of X-ray crystallography, particularly the use of synchrotron radiation
sources, the rate-limiting step in structure determination
remains the expression, purification and crystallisation of
the target proteins. NMR techniques offer some scope for
avoiding some of these difficulties, but they are still limited with respect to the size of protein that can be
routinely tackled.
In the absence of any revolution in the experimental structure determination fields, it is hoped that theoretical
methods may help ‘fill the gaps’ in ‘fold space’. Given that
it has been estimated that the probability of a novel gene
product having an entirely new fold is less than 30% [10],
methods for recognising known folds are, of course,
expected to be a powerful means for obtaining structural
information about a new gene. Beyond fold recognition,
there also lies the hope that methods will become available
that might calculate an approximate fold for a given protein sequence without reference to a template structure.
The question of prediction will be covered later but, for
the time being, let us ignore this problem. Suppose a
method did exist for accurately predicting protein tertiary
structure or that we had solved the structures of all the
proteins in a genome of interest. The question remains as
to how useful is protein structure for elucidating the
function of a protein.
At the most basic level, there are apparent correlations
between the broad functional class of a protein and its
function. In a recent survey of the structures in the
PDB [11], for example, Thornton and colleagues [12•]
show that the majority of enzyme structures are found to
be in the αβ-fold class, as are nucleotide-binding
domains. Unfortunately, these observations, although
suggesting a relationship between fold and function, have
little or no predictive value. Clearly, it would be foolish to
say that just because a protein is in the αβ-fold class it is
likely to be an enzyme, even though one would be more
often right than wrong.
Hegyi and Gerstein [13•] have looked more closely at the
relationship between fold classification and enzyme classification (EC number), using BLAST [1] to cross-reference
the SCOP database and SWISS-PROT. In terms of fold
class biases, their data are in broad agreement with the
observations made by Thornton et al. [12•] and earlier work
by Martin et al. [14]. By extending their dataset by counting not just entries in the PDB, but also homologues in
SWISS-PROT, Hegyi and Gerstein [13•] were also able to
make some statements about the statistical relationship
between functional class and protein topology. They estimate that the average number of functions found to be
associated with a particular fold is 1.2 for both enzymes and
nonenzymes, and 1.8 for enzyme-related folds alone.
Furthermore, they found the average number of folds for a
given function to be 3.6 (2.5 for enzymes alone). This suggests that, on average at least, the correct prediction of a
protein fold is a very good pointer to its function.
Unfortunately, this apparent good news is somewhat
marred by the observed biases in fold distributions. The
superfolds [15], such as the (αβ)8 (TIM) barrel, have been
long known to be associated with a very large number of
functions. Hegyi and Gerstein found the top five ‘multifunctional folds’ to be the TIM barrel, the αβ hydrolase
fold, the Rossmann fold, the P-loop-containing NTP
hydrolase fold and the ferredoxin fold (Table 1).
From this, we can see that assigning the TIM barrel fold
to a particular gene product will give little information as
to its function. In all probability, the gene would encode
an enzyme (as almost all proteins with the TIM barrel fold
Table 1
The five most ‘multifunctional’ folds [13•].
Fold
TIM barrel
αβ Hydrolase fold
Rossmann fold
P-loop-containing NTP hydrolase fold
Ferredoxin fold
Number of functions
16
9
6
6
6
Protein structure prediction Jones
373
Figure 1
(a)
(b)
(c)
Current Opinion in Structural Biology
(a) The X-ray structure of HDEA (PDB code 1BG8) is shown using the
program MOLSCRIPT [55]. This structure is interesting because, despite
the fact that a highly resolved structure was obtained experimentally
[16] and a reasonable 3D model for the protein was obtained by ab
initio prediction [50•], the function of the protein remains unknown. Not
only is HDEA a sequence orphan but also, at the time of writing, no
significant similarity can be found between the fold of HDEA and that of
any other protein. (b) The structure of the protein encoded by
M. jannaschii ORF MJ0577 (PDB code 1MJH) is shown [18]. Several
of the incorporated structural motifs are known to be associated with
ATP binding. ATP was found in the X-ray structure and the binding of
ATP was confirmed experimentally. (c) The X-ray structure of the yjgF
gene product from E. coli (PDB code 1QU9) revealed that the fold of
this protein is similar to that of chorismate mutase from Bacillus subtilis.
However, the similarity did not, in this case, help to reveal anything
about the function of the gene, as the active sites were clearly different.
are known to be enzymes) but, beyond this, very little
functional insight would be gained. Sadly, the superfolds
also account for the majority of observed structural similarities. Orengo et al. [15] estimated that roughly half of
the observed structural similarities were found to be
among the 10 superfolds. Whenever a nonsuperfold structure is assigned to a new gene, however, the chances are
that the functions of the template protein and the gene
are indeed similar.
Other attempts at genomic functional assignment by
means of structure determination have been documented
[17,18]. In the first of these cases [17], despite a structural resemblance to chorismate mutase, no similarity was
observed between their active sites and the crystal structure of the yjgF gene product from E. coli revealed very
few clues to the protein’s function. In the second case
[18], the crystal structure of Methanococcus jannaschii ORF
MJ0577 not only revealed the presence of bound ATP
(suggesting a probable ATPase or an ATP-mediated molecular switch), but also incorporated several structural
motifs known to be associated with ATP binding.
Although this is a positive example whereby structural
studies have revealed functional information, the fact
that part of the functional characterisation was based on
the presence of a co-crystallised ATP means that this
result is less applicable to the case of structural prediction
and fold assignment.
As we have seen, under the right circumstances, assigning
a correct fold to a gene product can provide significant
clues as to its function. This assumes, of course, that the
fold has already been associated with a known function.
Fortunately, the vast majority of proteins with known 3D
structures belong to well-characterised families for which a
lot of biochemical knowledge has been collected. The various structural genomics initiatives may, however, start to
change this picture. Perhaps the first hint of things to come
was the recently determined structure of the Escherichia
coli protein HDEA by Yang et al. [16] The structure of this
protein was actually solved by accident, as it turned out to
have an almost identical molecular weight to the protein
the crystallographers were trying to investigate. Despite
this, the HDEA structure (Figure 1) offers a stark lesson to
both experimentalists and theoreticians alike, as it is a protein of entirely unknown function. Worse still, HDEA is
currently a sequence orphan and so techniques such as
evolutionary profiling [6••,7••] are not applicable.
Of course, in the absence of similarity at the level of
sequence or structure to proteins of known function, the
possibility remains that the function of a protein might be
deduced from an analysis of its 3D structure alone. Several
ideas have been put forward for trying to identify particular arrangements of atom groups that might form an active
site in a given structure [19–22].
Russell et al. [20] have looked at the spatial arrangement of
active site residues in different fold families and have
374
Sequences and topology
found that the distributions are nonrandom. Despite the
detailed differences in the active sites of the different fold
families, the active sites are often located in a similar
region of the fold. The authors call these regions ‘supersites’ and suggest that certain geometric features of a
particular fold make certain regions of the fold more likely
to form a binding site than other regions. At a more
detailed level, Russell et al. [20] also identify similarities in
the active site geometry of enzymes with similar functions,
but different folds. They suggest that this should permit
the creation of active site templates that might allow the
recognition of the active site in a protein structure of
unknown function. Wallace et al. [19] have also suggested
a very similar idea.
Fetrow et al. [21,22] have also proposed a method for identifying the function of a given protein structure. In their
case, however, they explicitly apply the technique to predicted protein structures. As a proof of concept, Fetrow
et al. generated a ‘fuzzy template’ for the thiol/disulfide
oxidoreductase activity of the glutaredoxin/thioredoxin
protein family and used this template to assess models
produced by a fold recognition method applied to ORFs
found in E. coli.
Fold recognition methods for genome analysis
Given the potential benefits of assigning a correct structure to a newly discovered gene product, it is not surprising
that several groups have applied existing fold recognition
methods to genome analysis. These methods can be classified into roughly three classes: sequence profile methods
(e.g. [1,2,23]), structural (3-D-1-D) profile methods
(e.g. [24,25]) and threading methods (e.g. [26–28]).
The first attempt at assigning folds to genome sequences
made use of a structural profile method. Fischer and
Eisenberg [29] used a development of the original
3-D-1-D profile method [24] to assign folds to the ORFs
found in Mycoplasma genitalium, the smallest known bacterial genome. They found that 16% of the ORFs could
be assigned to a known fold by means of simple sequence
comparison and that an additional 6% could be assigned
to a known fold at high confidence using their fold recognition method.
Despite the fact that many different threading (pair potential based fold recognition) methods have been developed,
only a single attempt at applying these methods to genome
analysis has been described [30]. Grandori applied the
ProFit method [28] to analyse the ORFs in
Mycoplasma pneumoniae, a slightly larger genome than
M. genitalium. In this instance, to save time, proteins that
could be matched to known structures by simple sequence
comparison were excluded from the analysis, along with
proteins longer than 200 residues. This latter restriction
was to remove multidomain proteins from consideration.
Of the 124 ORFs remaining, Grandori was able to recognise folds for 12, giving a recognition rate of 10%.
Interestingly, a number of disagreements were reported
when these results were compared with the results from
Fischer and Eisenberg [29] (by identifying M. pneumoniae
homologues in M. genitalium). This is not surprising given
the relatively low overall reliability of pure fold recognition
methods, but is surprising because, in some cases, both
predictions were apparently very significant.
Although Fischer and Eisenberg [29] and Grandori [30]
performed basic sequence comparisons to detect obvious
homologues to known structures, it is clear that the possibility existed that better sequence comparison techniques
could have been applied and that these techniques could
have assigned a greater number of folds to ORFs. In
answer to this, a number of groups [31–34] have used
PSI-BLAST [1], which is an iterative sequence profile
method based on the standard gapped BLAST method [1].
PSI-BLAST has proven to be not only a very sensitive
sequence comparison method, but also very reliable. To
get the best results from PSI-BLAST, it should be used in
both ‘directions’, as shown by Teichmann et al. [31].
Normally, each ORF is scanned against a set of precalculated PSI-BLAST profiles. Although these profiles are
slow to calculate, this process only has to be done once for
each sequence of known structure. Assigning folds to
ORFs using this procedure is thus fairly efficient. To
achieve the second search direction, a PSI-BLAST profile
must be calculated for each ORF and this profile can be
scanned against a library of sequences relating to known
structures. Given that the calculation of a single
PSI-BLAST profile takes 10 minutes on average using a
modern work station, for large genomes, this second
approach is very computationally expensive and relatively
impractical. Despite this disadvantage, extra matches can
be found when both searches are carried out.
Rychlewski et al. [35,36] attempted to exploit this asymmetry in PSI-BLAST profile comparisons using a
comparison method based on the alignment of one profile with another. Their method, BASIC, requires that
profiles be computed for each sequence in the 3D structure library and also for each ORF. These two sets of
profiles are then compared by means of a local dynamic
programming algorithm.
Jones [37••] has developed a hybrid method for assigning
folds to genome sequences, called GenTHREADER.
GenTHREADER uses a traditional sequence profile
alignment algorithm to generate alignments that are evaluated by a method derived from threading techniques. As a
final step, the alignment scores and threading energy sums
for each threaded model are evaluated by a neural network
in order to produce a single measure of confidence in the
proposed prediction. The method was applied to the
genome of M. genitalium; analysis of the results showed that
as many as 46% of the proteins derived from the predicted
protein coding regions had a significant relationship to a
protein of known structure.
Protein structure prediction Jones
375
Figure 2
80
70
Percentage alignment accuracy
This graph shows the percentage alignment
accuracy of the best fold recognition
predictions submitted to CASP3 (white bars)
compared with the best percentage alignment
accuracy obtained using a sequence
comparison method (black bars). Alignment
accuracy is defined as the percentage of
alignment positions in exact register with the
alignment obtained from structural
superposition of the target and template
protein structures. An accuracy of 100%
would indicate that the fold recognition
method produced an alignment in exact
agreement with the structural alignment. Note
that some methods could not be accurately
classified from the deposited information.
60
50
40
30
20
T0081
T0068
T0074
T0054
T0083
T0046
T0079
T0044
T0071
T0067
T0059
T0053
T0085
T0077
T0063
0
T0043
10
Target
Current Opinion in Structural Biology
More recently, GenTHREADER predictions have been
compiled for 25 complete genomes and have been
stored in a Web-accessible database at URL: http://globin.bio.warwick.ac.uk/genomes.
Third Critical Assessment in Structure Prediction
If prediction methods are going to be accepted by the biology community at large, then it is essential that the
reliability of these methods be assessed in such a way as to
convince a sceptical audience. Despite the fact that individual authors of automatic prediction methods do attempt
both to properly benchmark their methods and to provide
useful measures of confidence alongside their predictions,
there still remains the possibility that the published results
are somewhat better than might be expected in cases in
which the true structure is unknown. The recent third
Critical Assessment in Structure Prediction (CASP3)
experiment was carried out in 1999 and allows some indication to be gained concerning the reliability of truly blind
predictions using different methods.
Detailed results from the experiment have been published in a special issue of Proteins [38••], along the same
lines as previous special issues. The raw data from the
CASP3 evaluation are also available on the Internet
(URL: http:/predictioncenter.llnl.gov).
Comparative modelling
Again, there seem to have been very few significant developments in the actual process of comparative modelling in
recent years; this was apparent from the remarkable homogeneity of the methods used at CASP3. One issue that
perhaps did emerge from the CASP3 meeting was that
some progress seems to have been made in terms of the
full automation of comparative modelling. Ideally, for
genome annotation, it should be possible to make
sequence alignments and build good quality 3D models
from these alignments without requiring human intervention. Sanchez and Sali [39•] have already shown that a large
fraction of the yeast genome can be automatically modelled by homology to known 3D structures using their
program MODELLER but, so far, progress has been limited to ORFs with relatively high sequence similarity to
the template protein structures. Although at CASP3 the
better comparative modelling results were obtained from
groups that employed a number of nonautomatic refinement techniques [40,41], several groups did manage to
produce remarkably good models for quite distantly related target–template pairs using sequence profile methods
[42] or fold recognition methods [43].
Fold recognition
In the absence of suitable homologous template structures
with which to build a model for a given sequence, fold
recognition methods provide another option for constructing useful tertiary structural models. It was clear at CASP3
that these methods are now beginning to mature, but it
was also clear that such methods seem to have reached a
plateau in terms of the accuracy of the models that are produced. Although there are clearly many ways of
representing the results from CASP3 — as will be clear
from the many different suggested schemes documented
in the special issue of Proteins — perhaps the main ‘takehome message’ from the fold recognition results can be
seen in Figure 2. In this figure, the highest alignment
accuracy from any of the assessed fold recognition methods
is plotted alongside the alignment accuracy from the best
sequence-based method (e.g. HMMs or PSI-BLAST). As
can be seen, in only three cases were alignments produced
with more than 50% of the alignment positions in exact
376
Sequences and topology
agreement with an alignment based on a structural superposition of the two structures and, in all three cases, the
alignments produced by the sequence-based method were
almost as accurate as the best threading method. In five
cases (T0063, T0077, T0053, T0079 and T0046), there
was clearly some benefit to be gained from using a threading method, as at least partially correct models were
obtained in situations in which sequence profile methods
were of little or no use. Nevertheless, the conclusion we
have to take from this very crude analysis is that, in order
to produce reasonably accurate alignments (and, consequently, accurate 3D models), there must be some
detectable sequence similarity between the target protein
and at least one template structure of known 3D structure.
Although the sample sizes in the analysis are low, it would
appear that, in cases in which there is at least some
detectable sequence similarity, fold recognition methods
based on PSI-BLAST or on other kinds of sequence profiles are currently sufficient to build reliable models.
Beyond these cases, fold recognition methods not reliant
on sequence alignment (i.e. true threading methods) are
much more limited with regards to the accuracy of the
alignments they can produce, but even these relatively
poor models may be enough to gain some insight into the
function of a new gene sequence. As discussed above,
even fold recognition methods that are able to correctly
recognise folds, but are entirely incapable of producing
sensible alignments may offer some advantage in the narrowing down of potential gene functions.
Critical Assessment of Fully Automated
Structure Prediction
One aspect of the CASP process that became more apparent at CASP3 was the difficulty in separating entirely
automatic predictions from those that have been made
using various degrees of human intervention. The predictions made by my own group certainly cover the spectrum
between those generated entirely automatically and those
for which the final choice of model from a shortlist of alternatives was made using our own ‘biological neural
networks’. On a case by case basis, of course, it is always
better to use prediction tools as guides, rather than simple
black boxes. Biology can frequently provide important
clues to the best model to select, as was demonstrated very
ably by the predictions made by Murzin and Bateman at
CASP2 [44]. In the context of genome annotation, however, such manual approaches to protein structure prediction
are clearly limited. For prediction methods to be useful on
a large scale, they clearly must be automatic.
Fischer et al. [45••] have attempted to address this issue
by creating a subsection of the CASP process, called
CAFASP (Critical Assessment of Fully Automated
Structure Prediction). The basic idea of CAFASP is to
evaluate Web-based prediction tools in a fully automated
fashion, thus eliminating the possibility of human assistance in the prediction process. The first CAFASP was
carried out shortly after the CASP3 meeting and must
therefore be considered a pilot study, as the predictions
were not blind. Nevertheless, the results were interesting
and the process allowed some technical issues to be
resolved in good time for CAFASP2, which will be an
official part of CASP4.
Ab initio methods
The ultimate goal of protein structure prediction is to
predict the structure of a protein without any reference to
known structures. If fold recognition methods are not
able to find a suitable template structure with which to
build a model for a gene product, in an ideal world it
would be possible to apply an ab initio prediction method
to build an approximate model. For this type of prediction, CASP3 provides a good snapshot of the current
state-of-the-art.
At the lowest level, the ab initio prediction category
encompasses simple secondary structure prediction.
Although no new methods have been developed in this
area, by making use of sensitive sequence alignment
methods (PSI-BLAST and HMMs again) definite
progress was evident in the level of accuracy one can
expect for secondary structure prediction. The most accurate method evaluated at CASP3, PSIPRED [46••], was
developed in my own laboratory and comprises a pair of
feed-forward neural networks designed to analyse the profiles generated by PSI-BLAST. Using this straightforward
approach, an average Q3 (three-state prediction accuracy)
score of 77% was achieved over all the submitted predictions. This compares well with the results obtained from
benchmarking [46••], where PSIPRED was found to have
a per residue Q3 score of 76.5% (per chain 76.0 ± 7.8%).
Much more interesting than secondary structure prediction is, of course, the ab initio prediction of protein tertiary
structure. At CASP1 and CASP2, very little success was
demonstrated in the ab initio prediction of protein tertiary
structure. At CASP2, the best of these ab initio predictions
came from my own laboratory [47] and were close enough
(Cα RMSD of 6.2 Å) for us to confidently claim that the
fold was correctly reproduced in the model. This prediction was generated by a Monte Carlo approach, whereby
fragments of highly resolved protein structures are joined
together and the resulting chain conformations evaluated
using a potential function. Interestingly, these ideas had
been developed further by CASP3 (e.g. [48]) and this
kind of approach to folding proteins was nicknamed
‘mini-threading’ by some predictors. This was to distinguish such knowledge-based prediction methods from
methods that attempt to simulate protein folding using
physical principles.
So was clear evidence of progress in the ab initio prediction field demonstrated at CASP3? Certainly, it can be
said that some interesting predictions were submitted
and, in one or two cases, some useful features of the
overall chain folds were correctly predicted. When the
Protein structure prediction Jones
Blue gene
Figure 3
As an interesting footnote to the above discussion of
developments in ab initio prediction methods, the development of a ‘petaflop’ computer, called ‘Blue Gene’,
aimed at tackling the protein folding problem was recently announced by IBM [52].
20
16
RMSD (Å)
377
12
Conclusions
8
4
0
30
60
Percentage of structure
90
This review has hopefully shown that protein structure
prediction does have a role to play in the annotation of
genome sequences. In many cases, the correct assignment
of a fold to a novel gene can provide useful clues as to its
likely function.
Current Opinion in Structural Biology
Plot of RMSD against fragment size (given as a percentage of the total
number of residues in the polypeptide chain) for the best ab initio
prediction model submitted to CASP3 for protein HDEA (target
T0061; Figure 1a) [50•].
actual evaluation statistics are taken into account, however, it is difficult to identify any of the submitted
models as having enough accuracy to be useful for routine genome analysis. The most commonly used statistic
for evaluating the quality of a 3D model is the RMSD
between the model and the correct experimentally
determined structure. Considering the whole chain Cα
RMSD values, none of the ab initio models submitted to
CASP3 approached the ‘correct fold’ threshold of
6.0 Å — in fact, on this basis, none of the models submitted achieved an RMSD lower than 8.0 Å. Better news
was obtained when substructures were considered. At
CASP3, a new way of visualising structural similarity
over different window lengths was used [49]. As an
example, the results for the best prediction of target
T0061 (HDEA) [50•] are shown in Figure 3. Notice that
the RMSD for this prediction is less than 4.5 Å for
around 80% of the model. Clearly, in this case, despite
the poor RMSD for the whole model, the best subset of
the model is much more impressive and it would be reasonable to class this prediction as an overall success.
In a previous review [51], published three years ago, I
made the following statement: “…it is very clear that there
is still a long way to go for these (ab initio) approaches to
structure prediction, and it is unlikely that a practical tool
based on this kind of method will be available within less
than five years”. Five years is, of course, a long time in a
rapidly developing field, but on the basis of the CASP3
results as a whole, there seems little reason to amend the
statement. Certainly, no single method exists that can be
relied upon to successfully fold proteins on a routine basis.
It is certainly impressive to see even a few results of the
quality shown in Figure 3, but this result must be considered the exception, rather than the rule. On this basis, it is
very hard to envisage ab initio folding methods finding significant applications in genome analysis — certainly in the
short term at least.
It seems that ab initio methods of protein structure prediction, although sometimes achieving interesting results
for fragments of proteins, are unlikely to be used for
genome analysis. The success of ab initio methods has
never been tested rigorously on a large number of test
cases and so the chance of finding a reasonable model for
a target protein is unknown. However, results from the
three CASP experiments suggest that the chance of any
single method producing a reasonable model for a given
sequence is very low.
Fold recognition methods, on the other hand, have now
reached a point at which reasonable models for target proteins can be found on a routine basis, particularly if a
homologous template structure can be found. Not surprisingly, therefore, different fold recognition methods have
already been applied to the problem of assigning folds to
genome sequences. The simplest and most reliable predictions are based purely on sequence similarity and, in
particular, PSI-BLAST [1] is proving to be a valuable tool
for detecting distant homologous relationships among protein sequences. At the other extreme, fold recognition
methods that tend to ignore sequence similarity and make
use of structural information have also been applied, but
with somewhat less success. Hybrid methods, which combine sequence comparison and fold recognition
techniques, are likely to be an important development in
this area. Such methods can detect homologous relationships just beyond the detection threshold of methods such
as PSI-BLAST, as evident from the results for the first
such method to be developed [37••].
In a recent review, Teichmann et al. [53•] compared the
results from several attempts to assign folds to the M. genitalium genome. Being the smallest bacterial genome,
M. genitalium provides a useful benchmark for different
approaches to fold assignment, as most groups have made
predictions for this genome. Although it was found that a
high degree of agreement was evident for the different
methods, some results were not found by all methods.
This suggests that to maximise success in assigning folds
to genomes, some kind of consensus of methods might
be useful. At present, this is difficult as there are no
agreed standards for how structural annotations should
378
Sequences and topology
be represented. The PRESAGE database created by
Brenner et al. [54] is a useful first attempt at providing an
archive for different genome-related fold assignments,
but such an archive is not a substitute for agreed standards among different groups working in this field. It is
vital that some kind of agreement is reached on how best
to exchange data among different fold assignment projects, before the problem becomes unmanageable.
References and recommended reading
Papers of particular interest, published within the annual period of review,
have been highlighted as:
• of special interest
•• of outstanding interest
1.
Altschul SF, Madden TL, Schaffer AA, Zhang JH, Zhang Z, Miller W,
Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation of
protein database search programs. Nucleic Acids Res 1997,
25:3389-3402.
2.
Eddy SR: Hidden Markov models. Curr Opin Struct Biol 1996,
6:361-365.
3.
Bork P, Koonin EV: Predicting functions from protein sequences —
where are the bottlenecks? Nat Genet 1998, 18:313-318.
4. Fischer D, Eisenberg D: Finding families for genomic ORFans.
•
Bioinformatics 1999, 15:759-762.
Sequence orphans (unique functionally uncharacterised sequences) represent the toughest challenges for theoretical approaches to genome analysis.
It is not clear how quickly the number of sequence orphans in different
genomes will decrease. The results in this paper suggest that we will be facing the problem of sequence orphans for a long time.
5.
Kowalczuk M, Mackiewicz P, Gierlik A, Dudek MR, Cebrat S: Total
number of coding open reading frames in the yeast genome.
Yeast 1999, 15:1031-1034.
6.
••
Marcotte EM, Pellegrini M, Thompson MJ, Yeates TO, Eisenberg D:
A combined algorithm for genome-wide prediction of protein
function. Nature 1999, 402:83-86.
This paper describes several methods for defining the function of a gene
product by theoretical means and also for identifying possible protein–protein interactions in which the encoded proteins take part. Importantly, these
methods do not rely upon knowing the function of a related gene.
7.
••
Enright AJ, Iliopoulos I, Kyrpides NC, Ouzounis CA: Protein
interaction maps for complete genomes based on gene fusion
events. Nature 1999, 402:86-90.
This paper describes how identifying a sequence in which the homologues
of two domains are found fused together can indicate that the two domains
are involved in protein–protein interactions. This is similar to the theory
behind one of the methods described in the previous paper [6••], whereby
the fused sequence is called a ‘Rosetta sequence’.
14. Martin AC, Orengo CA, Hutchinson EG, Jones S, Karmirantzou M,
Laskowski RA, Mitchell JB, Taroni C, Thornton JM: Protein folds and
functions. Structure 1998, 6:875-884.
15. Orengo CA, Jones DT, Thornton JM: Protein superfamilies and
domain superfolds. Nature 1994, 372:631-634.
16. Yang F, Gustafson KR, Boyd MR, Wlodawer A: Crystal structure of
Escherichia coli HdeA. Nat Struct Biol 1998, 5:763-764.
17.
Volz K: A test case for structure-based functional assignment: the
1.2 Å crystal structure of the yjgF gene product from Escherichia
coli. Protein Sci 1999, 8:2428-2437.
18. Zarembinski TI, Hung LW, Mueller-Dieckmann HJ, Kim KK, Yokota H,
Kim R, Kim SH: Structure-based assignment of the biochemical
function of a hypothetical protein: a test case of structural
genomics. Proc Natl Acad Sci USA 1998, 95:15189-15193.
19. Wallace AC, Borkakoti N, Thornton JM: TESS: a geometric hashing
algorithm for deriving 3D coordinate templates for searching
structural databases. Application to enzyme active sites. Protein
Sci 1997, 6:2308-2323.
20. Russell RB, Sasieni PD, Sternberg MJE: Supersites within
superfolds. Binding site similarity in the absence of homology.
J Mol Biol 1998, 282:903-918.
21. Fetrow JS, Godzik A, Skolnick J: Functional analysis of the
Escherichia coli genome using the sequence-to-structure-tofunction paradigm: identification of proteins exhibiting the
glutaredoxin/thioredoxin disulfide oxidoreductase activity. J Mol
Biol 1998, 282:703-711.
22. Fetrow JS, Skolnick J: Method for prediction of protein function
from sequence using the sequence-to-structure-to-function
paradigm with application to glutaredoxins/thioredoxins and T-1
ribonucleases. J Mol Biol 1998, 281:949-968.
23. Overington J, Donnelly D, Johnson MS, Sali A, Blundell TL: Environmentspecific amino-acid substitution tables — tertiary templates and
prediction of protein folds. Protein Sci 1992, 1:216-226.
24. Bowie JU, Lüthy R, Eisenberg D: A method to identify protein
sequences that fold into a known three-dimensional structure.
Science 1991, 253:164-170.
25. Ouzounis C, Sander C, Scharf M, Schneider R: Prediction of protein
structure by evaluation of sequence-structure fitness. Aligning
sequences to contact profiles derived from three-dimensional
structures. J Mol Biol 1993, 232:805-825.
26. Jones DT, Taylor WR, Thornton JM: A new approach to protein fold
recognition. Nature 1992, 358:86-89.
27.
Bryant SH, Lawrence CE: An empirical energy function for
threading protein-sequence through the folding motif. Proteins
1993, 16:92-112.
28. Flöckner H, Braxenthaler M, Lackner P, Jaritz M, Ortner M, Sippl MJ:
Progress in fold recognition. Proteins 1995, 23:376-386.
8.
Editorial: Money for structural genomics. Nat Struct Biol 1999,
6:707-708.
29. Fischer D, Eisenberg D: Assigning folds to the proteins encoded by
the genome of Mycoplasma genitalium. Proc Natl Acad Sci USA
1997, 94:11929-11934.
9.
Shapiro L, Lima CD: The Argonne Structural Genomics Workshop:
Lamaze class for the birth of a new science. Structure 1998,
6:265-267.
30. Grandori R: Systematic fold recognition analysis of the sequences
encoded by the genome of Mycoplasma pneumoniae. Protein Eng
1998, 11:1129-1135.
10. Orengo CA, Michie AD, Jones S, Jones DT, Swindells MB,
Thornton JM: CATH — a hierarchic classification of protein domain
structures. Structure 1997, 5:1093-1108.
31. Teichmann SA, Park J, Chothia C: Structural assignments to the
Mycoplasma genitalium proteins show extensive gene
duplications and domain rearrangements. Proc Natl Acad Sci USA
1998, 95:14658-14663.
11. Sussman JL, Lin DW, Jiang JS, Manning NO, Prilusky J, Ritter O,
Abola EE: Protein Data Bank (PDB): database of three-dimensional structural information of biological macromolecules. Acta
Crystallogr D 1998, 54:1078-1084.
32. Huynen M, Doerks T, Eisenhaber F, Orengo C, Sunyaev S, Yuan YP,
Bork P: Homology-based fold predictions for Mycoplasma
genitalium proteins. J Mol Biol 1998, 280:323-326.
12. Thornton JM, Orengo CA, Todd AE, Pearl FMG: Proteins folds,
•
functions and evolution. J Mol Biol 1999, 293:333-342.
This paper documents the distribution of enzyme functions (as described by
enzyme classification [EC] numbers) in the CATH structure classification database.
13. Hegyi H, Gerstein M: The relationship between protein structure
•
and function: a comprehensive survey with application to the
yeast genome. J Mol Biol 1999, 288:147-164.
A slightly different approach to the evaluation of the relationship between
fold and function is described. In this case, the SWISS-PROT databank is
used to extend the size of the dataset and the SCOP structure classification
scheme is primarily used.
33. Wolf YI, Brenner SE, Bash PA, Koonin EV: Distribution of protein folds
in the three superkingdoms of life. Genome Res 1999, 9:17-26.
34. Salamov AA, Suwa M, Orengo CA, Swindells MB: Genome
analysis:assigning protein coding regions to three-dimensional
structures. Protein Sci 1999, 8:771-777.
35. Rychlewski L, Zhang BH, Godzik A: Fold and function predictions
for Mycoplasma genitalium proteins. Fold Des 1998, 3:229-238.
36. Rychlewski L, Zhang BH, Godzik A: Functional insights from
structural predictions: analysis of the Escherichia coli genome.
Protein Sci 1999, 8:614-624.
Protein structure prediction Jones
37
••
Jones DT: GenTHREADER: an efficient and reliable protein fold
recognition method for genomic sequences. J Mol Biol 1999,
287:797-815.
GenTHREADER represents a fold recognition method that combines features of sequence profile comparison and threading. In testing, it was able to
assign folds to a greater fraction of the M. genitalium open reading frames at
high confidence (1% false positive rate) than other fold assignment methods.
38. Moult J, Hubbard T, Fidelis K, Pedersen JT: Critical assessment of
•• methods of protein structure prediction (CASP): round III. Proteins
1999, S3:2-6.
This special issue of Proteins relating to the third CASP experiment is a
‘must read’ for anyone wishing to know what are the current abilities and limitations of modern protein structure prediction methods.
39. Sanchez R, Sali A: Comparative protein structure modeling in
•
genomics. J Comp Phys 1999, 151:388-401.
This paper describes how comparative modelling techniques can be applied
to large-scale genome analysis. In particular, the authors’ method allows the
quality of a model to be predicted.
40. Jones TA, Kleywegt GJ: CASP3 comparative modeling evaluation.
Proteins 1999, S3:30-46.
41. Bates PA, Sternberg MJE: Model building by comparison at
CASP3: using expert knowledge and computer automation.
Proteins 1999, S3:47-54.
42. Dunbrack RL: Comparative modeling of CASP3 targets using
PSI-BLAST and SCWRL. Proteins 1999, S3:81-87.
379
46. Jones DT: Protein secondary structure prediction based on
•• position-specific scoring matrices. J Mol Biol 1999, 292:195-202.
PSIPRED (URL: http://globin.bio.warwick.ac.uk/psipred) is probably the
most accurate secondary structure prediction method currently available. At
CASP3, it was found to be the most accurate method.
47.
Jones DT: Successful ab initio prediction of the tertiary
structure of NK-lysin using multiple sequences and
recognized supersecondary structural motifs. Proteins 1997,
S1:185-191.
48. Simons KT, Bonneau R, Ruczinski I, Baker D: Ab initio protein
structure prediction of CASP III targets using ROSETTA. Proteins
1999, S3:171-176.
49. Hubbard TJP: RMS/coverage graphs: a qualitative method for
comparing three-dimensional protein structure predictions.
Proteins 1999, S3:15-21.
50. Lee J, Liwo A, Ripoll DR, Pillardy J, Scheraga HA: Calculation of
•
protein conformation by global optimization of a potential energy
function. Proteins 1999, S3:204-208.
This prediction of the HDEA structure (see text) was amongst the best of the
CASP3 predictions. It was particularly notable because the method used
was based on physical principles and was not a ‘knowledge-based’
approach. However, the result required hours of computation on a large
supercomputer and other results obtained using this approach were not of
comparable quality to this one.
51. Jones DT: Progress in protein structure prediction. Curr Opin Struct
Biol 1997, 7:377-387.
43. Fischer D: Modeling three-dimensional protein structures for
amino acid sequences of the CASP3 experiment using sequencederived predictions. Proteins 1999, S3:61-65.
52. Butler D: IBM promises scientists 500-fold leap in
supercomputing power. Nature 1999, 402:705-706.
44. Murzin AG, Bateman A: Distant homology recognition using
structural classification of proteins. Proteins 1997, S1:105-112.
53. Teichmann SA, Chothia C, Gerstein M: Advances in structural
•
genomics. Curr Opin Struct Biol 1999, 9:390-399.
This review includes a very informative comparison of genomic fold recognition studies.
45. Fischer D, Barret C, Bryson K, Elofsson A, Godzik A, Jones D,
•• Karplus KJ, Kelley LA, MacCallum RM, Pawowski K et al.: CAFASP-1:
critical assessment of fully automated structure prediction
methods. Proteins 1999, S3:209-217.
Fully automatic prediction methods are essential if they are to be applied to
genome analysis. This paper describes a ‘pilot study’ that explores how various Web-based prediction servers can be evaluated in a systematic way.
54. Brenner SE, Barken D, Levitt M: The PRESAGE database for
structural genomics. Nucleic Acids Res 1999, 27:251-253.
55. Kraulis PJ: MOLSCRIPT: a program to produce both detailed and
schematic plots of protein structures. J App Crystallogr 1991,
24:946-950.