371 Protein structure prediction in the postgenomic era David T Jones As the number of completely sequenced genomes rapidly increases, the postgenomic problem of gene function identification becomes ever more pressing. Predicting the structures of proteins encoded by genes of interest is one possible means to glean subtle clues as to the functions of these proteins. There are limitations to this approach to gene identification and a survey of the expected reliability of different protein structure prediction techniques has been undertaken. Addresses Department of Biological Sciences, Brunel University, Uxbridge, Middlesex UB8 3PH, UK; e-mail: [email protected] Current Opinion in Structural Biology 2000, 10:371–379 0959-440X/00/$ — see front matter © 2000 Elsevier Science Ltd. All rights reserved. Abbreviations CAFASP Critical Assessment of Fully Automated Structure Prediction CASP Critical Assessment in Structure Prediction HMM hidden Markov model ORF open reading frame PDB Protein Data Bank RMSD root mean square deviation Introduction It is expected that a first draft of the complete human genome sequence will be available sometime in the year 2000. Although the completion of the sequence will probably take several more years, this milestone alone represents a major breakthrough in molecular biology. Sequencing efforts for simpler organisms are also continuing to produce increasing volumes of valuable data and, at the time of writing, some 30 or so complete bacterial genome sequences are available in the sequence databanks. As we are now clearly moving into the postsequencing phase of many genome projects, attention is becoming more and more focused on the correct identification of gene function. Assigning a function to a gene is an important first step in characterising its role in the various cellular processes and, without this information, the value of genome sequencing is greatly reduced. Of course, simple sequence comparison techniques are by far the most widely used method for making an initial identification of a particular gene product. By identifying homology between a new gene and a gene of known function, some inferences can be made as to the function of the new gene. How reliably the function can be extrapolated to the new gene depends on a number of factors, but the principle factor is, of course, the degree of sequence similarity observed. In recent years, sequence comparison methods, such as PSI-BLAST [1], or methods based on hidden Markov models (HMMs) [2] have ‘pushed the envelope’ as far as detecting homologous relationships goes. Of course, as more and more remote relationships are being considered, it becomes less clear as to how reliably one can map the function of one gene to another [3]. Nonetheless, sensitive sequence comparison techniques are still the most important technology that we have for rapidly characterising new gene products. Despite the power of modern sequence comparison techniques, there still remain open reading frames (ORFs) that either match no other entry in existing sequence databanks or match proteins that are also of unknown function. These sequence orphans or ‘ORFans’ [4•] are a source of great debate. Certainly, at present, this class of gene represents a large fraction of larger completely sequenced genomes. However, estimates of the number of orphans vary greatly [4•,5] and, consequently, different opinions exist as to how much of a problem these genes represent. The recent paper by Fischer and Eisenberg [4•] suggests that the fraction of genomes falling into the category of sequence orphans is unlikely to change rapidly. These conclusions are based on the observation that the number of uncharacterisable ORFs in the yeast genome, for example, has not been reduced substantially in the two years or so since it was first completely sequenced. This is despite the fact that the sizes of the sequence databanks have more than doubled over this period. Underlying this kind of estimate is, of course, the fact that, even for yeast, it is still not possible to be certain about the true number of expressed genes. The fraction of the approximately 6000 ORFs found in the yeast genome that really relate to expressed proteins is still unknown. Indeed, at the extreme end, one recent calculation [5] on the yeast genome suggests that the true fraction of yeast ORFs that are sequence orphans may be as little as 5%. No matter what is the true number of sequence orphans, the fact is that there remains a ‘hard core’ of small sequence families across a wide variety of genomes for which no functional information apparently exists. Direct experimental function determination is perhaps the ideal approach to characterising these orphan proteins. Gene knockout experiments and expression array techniques are just two of many experimental techniques that are now being widely applied to function determination. New theoretical methods for predicting gene function have also been proposed [6••,7••]. Two basic ideas are represented by these methods. Firstly, proteins of similar function may ‘co-evolve’ [6••]. In other words, groups of proteins that are found in some organisms, but not others, may share some common function. Proteins that might be found, for example, in aerobic organisms, but never in 372 Sequences and topology anaerobic organisms, may well have a role in the utilisation of molecular oxygen. Of course, because of the broad scope of this level of functional classification, the value of this kind of functional classification is rather uncertain. Nevertheless, this kind of information might well produce some unique clues for at least a few sequence orphans. The second idea, which has been proposed independently by two groups [6••,7••], has been called the ‘Rosetta sequence’ method by Eisenberg and colleagues [6••]. In this case, possible protein–protein interactions are predicted by identifying separate protein domains that are sometimes observed to be fused together in some organisms. Of course, there are a number of well-known exceptions to this rule (many protein modules for example) but, despite this, Rosetta sequences may well offer some tantalising clues as to the network of interactions that is present in living cells. Structure/function paradigm It is now well known that protein structure is much more highly conserved than protein sequence. It is also evident that it is the tertiary structure of a protein that creates the means by which it functions. Given these two observations, it is not surprising that many people believe that determining the 3D structure of a protein can provide valuable information as to its function and mechanism. One aspect of this belief is the impetus to solve experimentally the structure of every protein encoded by a bacterial genome. Some such structural genomics initiatives are already underway [8,9] but, as yet, none of these projects has produced large numbers of new structures, as they remain in the pilot stage of development. Despite great improvements to the basic techniques of X-ray crystallography, particularly the use of synchrotron radiation sources, the rate-limiting step in structure determination remains the expression, purification and crystallisation of the target proteins. NMR techniques offer some scope for avoiding some of these difficulties, but they are still limited with respect to the size of protein that can be routinely tackled. In the absence of any revolution in the experimental structure determination fields, it is hoped that theoretical methods may help ‘fill the gaps’ in ‘fold space’. Given that it has been estimated that the probability of a novel gene product having an entirely new fold is less than 30% [10], methods for recognising known folds are, of course, expected to be a powerful means for obtaining structural information about a new gene. Beyond fold recognition, there also lies the hope that methods will become available that might calculate an approximate fold for a given protein sequence without reference to a template structure. The question of prediction will be covered later but, for the time being, let us ignore this problem. Suppose a method did exist for accurately predicting protein tertiary structure or that we had solved the structures of all the proteins in a genome of interest. The question remains as to how useful is protein structure for elucidating the function of a protein. At the most basic level, there are apparent correlations between the broad functional class of a protein and its function. In a recent survey of the structures in the PDB [11], for example, Thornton and colleagues [12•] show that the majority of enzyme structures are found to be in the αβ-fold class, as are nucleotide-binding domains. Unfortunately, these observations, although suggesting a relationship between fold and function, have little or no predictive value. Clearly, it would be foolish to say that just because a protein is in the αβ-fold class it is likely to be an enzyme, even though one would be more often right than wrong. Hegyi and Gerstein [13•] have looked more closely at the relationship between fold classification and enzyme classification (EC number), using BLAST [1] to cross-reference the SCOP database and SWISS-PROT. In terms of fold class biases, their data are in broad agreement with the observations made by Thornton et al. [12•] and earlier work by Martin et al. [14]. By extending their dataset by counting not just entries in the PDB, but also homologues in SWISS-PROT, Hegyi and Gerstein [13•] were also able to make some statements about the statistical relationship between functional class and protein topology. They estimate that the average number of functions found to be associated with a particular fold is 1.2 for both enzymes and nonenzymes, and 1.8 for enzyme-related folds alone. Furthermore, they found the average number of folds for a given function to be 3.6 (2.5 for enzymes alone). This suggests that, on average at least, the correct prediction of a protein fold is a very good pointer to its function. Unfortunately, this apparent good news is somewhat marred by the observed biases in fold distributions. The superfolds [15], such as the (αβ)8 (TIM) barrel, have been long known to be associated with a very large number of functions. Hegyi and Gerstein found the top five ‘multifunctional folds’ to be the TIM barrel, the αβ hydrolase fold, the Rossmann fold, the P-loop-containing NTP hydrolase fold and the ferredoxin fold (Table 1). From this, we can see that assigning the TIM barrel fold to a particular gene product will give little information as to its function. In all probability, the gene would encode an enzyme (as almost all proteins with the TIM barrel fold Table 1 The five most ‘multifunctional’ folds [13•]. Fold TIM barrel αβ Hydrolase fold Rossmann fold P-loop-containing NTP hydrolase fold Ferredoxin fold Number of functions 16 9 6 6 6 Protein structure prediction Jones 373 Figure 1 (a) (b) (c) Current Opinion in Structural Biology (a) The X-ray structure of HDEA (PDB code 1BG8) is shown using the program MOLSCRIPT [55]. This structure is interesting because, despite the fact that a highly resolved structure was obtained experimentally [16] and a reasonable 3D model for the protein was obtained by ab initio prediction [50•], the function of the protein remains unknown. Not only is HDEA a sequence orphan but also, at the time of writing, no significant similarity can be found between the fold of HDEA and that of any other protein. (b) The structure of the protein encoded by M. jannaschii ORF MJ0577 (PDB code 1MJH) is shown [18]. Several of the incorporated structural motifs are known to be associated with ATP binding. ATP was found in the X-ray structure and the binding of ATP was confirmed experimentally. (c) The X-ray structure of the yjgF gene product from E. coli (PDB code 1QU9) revealed that the fold of this protein is similar to that of chorismate mutase from Bacillus subtilis. However, the similarity did not, in this case, help to reveal anything about the function of the gene, as the active sites were clearly different. are known to be enzymes) but, beyond this, very little functional insight would be gained. Sadly, the superfolds also account for the majority of observed structural similarities. Orengo et al. [15] estimated that roughly half of the observed structural similarities were found to be among the 10 superfolds. Whenever a nonsuperfold structure is assigned to a new gene, however, the chances are that the functions of the template protein and the gene are indeed similar. Other attempts at genomic functional assignment by means of structure determination have been documented [17,18]. In the first of these cases [17], despite a structural resemblance to chorismate mutase, no similarity was observed between their active sites and the crystal structure of the yjgF gene product from E. coli revealed very few clues to the protein’s function. In the second case [18], the crystal structure of Methanococcus jannaschii ORF MJ0577 not only revealed the presence of bound ATP (suggesting a probable ATPase or an ATP-mediated molecular switch), but also incorporated several structural motifs known to be associated with ATP binding. Although this is a positive example whereby structural studies have revealed functional information, the fact that part of the functional characterisation was based on the presence of a co-crystallised ATP means that this result is less applicable to the case of structural prediction and fold assignment. As we have seen, under the right circumstances, assigning a correct fold to a gene product can provide significant clues as to its function. This assumes, of course, that the fold has already been associated with a known function. Fortunately, the vast majority of proteins with known 3D structures belong to well-characterised families for which a lot of biochemical knowledge has been collected. The various structural genomics initiatives may, however, start to change this picture. Perhaps the first hint of things to come was the recently determined structure of the Escherichia coli protein HDEA by Yang et al. [16] The structure of this protein was actually solved by accident, as it turned out to have an almost identical molecular weight to the protein the crystallographers were trying to investigate. Despite this, the HDEA structure (Figure 1) offers a stark lesson to both experimentalists and theoreticians alike, as it is a protein of entirely unknown function. Worse still, HDEA is currently a sequence orphan and so techniques such as evolutionary profiling [6••,7••] are not applicable. Of course, in the absence of similarity at the level of sequence or structure to proteins of known function, the possibility remains that the function of a protein might be deduced from an analysis of its 3D structure alone. Several ideas have been put forward for trying to identify particular arrangements of atom groups that might form an active site in a given structure [19–22]. Russell et al. [20] have looked at the spatial arrangement of active site residues in different fold families and have 374 Sequences and topology found that the distributions are nonrandom. Despite the detailed differences in the active sites of the different fold families, the active sites are often located in a similar region of the fold. The authors call these regions ‘supersites’ and suggest that certain geometric features of a particular fold make certain regions of the fold more likely to form a binding site than other regions. At a more detailed level, Russell et al. [20] also identify similarities in the active site geometry of enzymes with similar functions, but different folds. They suggest that this should permit the creation of active site templates that might allow the recognition of the active site in a protein structure of unknown function. Wallace et al. [19] have also suggested a very similar idea. Fetrow et al. [21,22] have also proposed a method for identifying the function of a given protein structure. In their case, however, they explicitly apply the technique to predicted protein structures. As a proof of concept, Fetrow et al. generated a ‘fuzzy template’ for the thiol/disulfide oxidoreductase activity of the glutaredoxin/thioredoxin protein family and used this template to assess models produced by a fold recognition method applied to ORFs found in E. coli. Fold recognition methods for genome analysis Given the potential benefits of assigning a correct structure to a newly discovered gene product, it is not surprising that several groups have applied existing fold recognition methods to genome analysis. These methods can be classified into roughly three classes: sequence profile methods (e.g. [1,2,23]), structural (3-D-1-D) profile methods (e.g. [24,25]) and threading methods (e.g. [26–28]). The first attempt at assigning folds to genome sequences made use of a structural profile method. Fischer and Eisenberg [29] used a development of the original 3-D-1-D profile method [24] to assign folds to the ORFs found in Mycoplasma genitalium, the smallest known bacterial genome. They found that 16% of the ORFs could be assigned to a known fold by means of simple sequence comparison and that an additional 6% could be assigned to a known fold at high confidence using their fold recognition method. Despite the fact that many different threading (pair potential based fold recognition) methods have been developed, only a single attempt at applying these methods to genome analysis has been described [30]. Grandori applied the ProFit method [28] to analyse the ORFs in Mycoplasma pneumoniae, a slightly larger genome than M. genitalium. In this instance, to save time, proteins that could be matched to known structures by simple sequence comparison were excluded from the analysis, along with proteins longer than 200 residues. This latter restriction was to remove multidomain proteins from consideration. Of the 124 ORFs remaining, Grandori was able to recognise folds for 12, giving a recognition rate of 10%. Interestingly, a number of disagreements were reported when these results were compared with the results from Fischer and Eisenberg [29] (by identifying M. pneumoniae homologues in M. genitalium). This is not surprising given the relatively low overall reliability of pure fold recognition methods, but is surprising because, in some cases, both predictions were apparently very significant. Although Fischer and Eisenberg [29] and Grandori [30] performed basic sequence comparisons to detect obvious homologues to known structures, it is clear that the possibility existed that better sequence comparison techniques could have been applied and that these techniques could have assigned a greater number of folds to ORFs. In answer to this, a number of groups [31–34] have used PSI-BLAST [1], which is an iterative sequence profile method based on the standard gapped BLAST method [1]. PSI-BLAST has proven to be not only a very sensitive sequence comparison method, but also very reliable. To get the best results from PSI-BLAST, it should be used in both ‘directions’, as shown by Teichmann et al. [31]. Normally, each ORF is scanned against a set of precalculated PSI-BLAST profiles. Although these profiles are slow to calculate, this process only has to be done once for each sequence of known structure. Assigning folds to ORFs using this procedure is thus fairly efficient. To achieve the second search direction, a PSI-BLAST profile must be calculated for each ORF and this profile can be scanned against a library of sequences relating to known structures. Given that the calculation of a single PSI-BLAST profile takes 10 minutes on average using a modern work station, for large genomes, this second approach is very computationally expensive and relatively impractical. Despite this disadvantage, extra matches can be found when both searches are carried out. Rychlewski et al. [35,36] attempted to exploit this asymmetry in PSI-BLAST profile comparisons using a comparison method based on the alignment of one profile with another. Their method, BASIC, requires that profiles be computed for each sequence in the 3D structure library and also for each ORF. These two sets of profiles are then compared by means of a local dynamic programming algorithm. Jones [37••] has developed a hybrid method for assigning folds to genome sequences, called GenTHREADER. GenTHREADER uses a traditional sequence profile alignment algorithm to generate alignments that are evaluated by a method derived from threading techniques. As a final step, the alignment scores and threading energy sums for each threaded model are evaluated by a neural network in order to produce a single measure of confidence in the proposed prediction. The method was applied to the genome of M. genitalium; analysis of the results showed that as many as 46% of the proteins derived from the predicted protein coding regions had a significant relationship to a protein of known structure. Protein structure prediction Jones 375 Figure 2 80 70 Percentage alignment accuracy This graph shows the percentage alignment accuracy of the best fold recognition predictions submitted to CASP3 (white bars) compared with the best percentage alignment accuracy obtained using a sequence comparison method (black bars). Alignment accuracy is defined as the percentage of alignment positions in exact register with the alignment obtained from structural superposition of the target and template protein structures. An accuracy of 100% would indicate that the fold recognition method produced an alignment in exact agreement with the structural alignment. Note that some methods could not be accurately classified from the deposited information. 60 50 40 30 20 T0081 T0068 T0074 T0054 T0083 T0046 T0079 T0044 T0071 T0067 T0059 T0053 T0085 T0077 T0063 0 T0043 10 Target Current Opinion in Structural Biology More recently, GenTHREADER predictions have been compiled for 25 complete genomes and have been stored in a Web-accessible database at URL: http://globin.bio.warwick.ac.uk/genomes. Third Critical Assessment in Structure Prediction If prediction methods are going to be accepted by the biology community at large, then it is essential that the reliability of these methods be assessed in such a way as to convince a sceptical audience. Despite the fact that individual authors of automatic prediction methods do attempt both to properly benchmark their methods and to provide useful measures of confidence alongside their predictions, there still remains the possibility that the published results are somewhat better than might be expected in cases in which the true structure is unknown. The recent third Critical Assessment in Structure Prediction (CASP3) experiment was carried out in 1999 and allows some indication to be gained concerning the reliability of truly blind predictions using different methods. Detailed results from the experiment have been published in a special issue of Proteins [38••], along the same lines as previous special issues. The raw data from the CASP3 evaluation are also available on the Internet (URL: http:/predictioncenter.llnl.gov). Comparative modelling Again, there seem to have been very few significant developments in the actual process of comparative modelling in recent years; this was apparent from the remarkable homogeneity of the methods used at CASP3. One issue that perhaps did emerge from the CASP3 meeting was that some progress seems to have been made in terms of the full automation of comparative modelling. Ideally, for genome annotation, it should be possible to make sequence alignments and build good quality 3D models from these alignments without requiring human intervention. Sanchez and Sali [39•] have already shown that a large fraction of the yeast genome can be automatically modelled by homology to known 3D structures using their program MODELLER but, so far, progress has been limited to ORFs with relatively high sequence similarity to the template protein structures. Although at CASP3 the better comparative modelling results were obtained from groups that employed a number of nonautomatic refinement techniques [40,41], several groups did manage to produce remarkably good models for quite distantly related target–template pairs using sequence profile methods [42] or fold recognition methods [43]. Fold recognition In the absence of suitable homologous template structures with which to build a model for a given sequence, fold recognition methods provide another option for constructing useful tertiary structural models. It was clear at CASP3 that these methods are now beginning to mature, but it was also clear that such methods seem to have reached a plateau in terms of the accuracy of the models that are produced. Although there are clearly many ways of representing the results from CASP3 — as will be clear from the many different suggested schemes documented in the special issue of Proteins — perhaps the main ‘takehome message’ from the fold recognition results can be seen in Figure 2. In this figure, the highest alignment accuracy from any of the assessed fold recognition methods is plotted alongside the alignment accuracy from the best sequence-based method (e.g. HMMs or PSI-BLAST). As can be seen, in only three cases were alignments produced with more than 50% of the alignment positions in exact 376 Sequences and topology agreement with an alignment based on a structural superposition of the two structures and, in all three cases, the alignments produced by the sequence-based method were almost as accurate as the best threading method. In five cases (T0063, T0077, T0053, T0079 and T0046), there was clearly some benefit to be gained from using a threading method, as at least partially correct models were obtained in situations in which sequence profile methods were of little or no use. Nevertheless, the conclusion we have to take from this very crude analysis is that, in order to produce reasonably accurate alignments (and, consequently, accurate 3D models), there must be some detectable sequence similarity between the target protein and at least one template structure of known 3D structure. Although the sample sizes in the analysis are low, it would appear that, in cases in which there is at least some detectable sequence similarity, fold recognition methods based on PSI-BLAST or on other kinds of sequence profiles are currently sufficient to build reliable models. Beyond these cases, fold recognition methods not reliant on sequence alignment (i.e. true threading methods) are much more limited with regards to the accuracy of the alignments they can produce, but even these relatively poor models may be enough to gain some insight into the function of a new gene sequence. As discussed above, even fold recognition methods that are able to correctly recognise folds, but are entirely incapable of producing sensible alignments may offer some advantage in the narrowing down of potential gene functions. Critical Assessment of Fully Automated Structure Prediction One aspect of the CASP process that became more apparent at CASP3 was the difficulty in separating entirely automatic predictions from those that have been made using various degrees of human intervention. The predictions made by my own group certainly cover the spectrum between those generated entirely automatically and those for which the final choice of model from a shortlist of alternatives was made using our own ‘biological neural networks’. On a case by case basis, of course, it is always better to use prediction tools as guides, rather than simple black boxes. Biology can frequently provide important clues to the best model to select, as was demonstrated very ably by the predictions made by Murzin and Bateman at CASP2 [44]. In the context of genome annotation, however, such manual approaches to protein structure prediction are clearly limited. For prediction methods to be useful on a large scale, they clearly must be automatic. Fischer et al. [45••] have attempted to address this issue by creating a subsection of the CASP process, called CAFASP (Critical Assessment of Fully Automated Structure Prediction). The basic idea of CAFASP is to evaluate Web-based prediction tools in a fully automated fashion, thus eliminating the possibility of human assistance in the prediction process. The first CAFASP was carried out shortly after the CASP3 meeting and must therefore be considered a pilot study, as the predictions were not blind. Nevertheless, the results were interesting and the process allowed some technical issues to be resolved in good time for CAFASP2, which will be an official part of CASP4. Ab initio methods The ultimate goal of protein structure prediction is to predict the structure of a protein without any reference to known structures. If fold recognition methods are not able to find a suitable template structure with which to build a model for a gene product, in an ideal world it would be possible to apply an ab initio prediction method to build an approximate model. For this type of prediction, CASP3 provides a good snapshot of the current state-of-the-art. At the lowest level, the ab initio prediction category encompasses simple secondary structure prediction. Although no new methods have been developed in this area, by making use of sensitive sequence alignment methods (PSI-BLAST and HMMs again) definite progress was evident in the level of accuracy one can expect for secondary structure prediction. The most accurate method evaluated at CASP3, PSIPRED [46••], was developed in my own laboratory and comprises a pair of feed-forward neural networks designed to analyse the profiles generated by PSI-BLAST. Using this straightforward approach, an average Q3 (three-state prediction accuracy) score of 77% was achieved over all the submitted predictions. This compares well with the results obtained from benchmarking [46••], where PSIPRED was found to have a per residue Q3 score of 76.5% (per chain 76.0 ± 7.8%). Much more interesting than secondary structure prediction is, of course, the ab initio prediction of protein tertiary structure. At CASP1 and CASP2, very little success was demonstrated in the ab initio prediction of protein tertiary structure. At CASP2, the best of these ab initio predictions came from my own laboratory [47] and were close enough (Cα RMSD of 6.2 Å) for us to confidently claim that the fold was correctly reproduced in the model. This prediction was generated by a Monte Carlo approach, whereby fragments of highly resolved protein structures are joined together and the resulting chain conformations evaluated using a potential function. Interestingly, these ideas had been developed further by CASP3 (e.g. [48]) and this kind of approach to folding proteins was nicknamed ‘mini-threading’ by some predictors. This was to distinguish such knowledge-based prediction methods from methods that attempt to simulate protein folding using physical principles. So was clear evidence of progress in the ab initio prediction field demonstrated at CASP3? Certainly, it can be said that some interesting predictions were submitted and, in one or two cases, some useful features of the overall chain folds were correctly predicted. When the Protein structure prediction Jones Blue gene Figure 3 As an interesting footnote to the above discussion of developments in ab initio prediction methods, the development of a ‘petaflop’ computer, called ‘Blue Gene’, aimed at tackling the protein folding problem was recently announced by IBM [52]. 20 16 RMSD (Å) 377 12 Conclusions 8 4 0 30 60 Percentage of structure 90 This review has hopefully shown that protein structure prediction does have a role to play in the annotation of genome sequences. In many cases, the correct assignment of a fold to a novel gene can provide useful clues as to its likely function. Current Opinion in Structural Biology Plot of RMSD against fragment size (given as a percentage of the total number of residues in the polypeptide chain) for the best ab initio prediction model submitted to CASP3 for protein HDEA (target T0061; Figure 1a) [50•]. actual evaluation statistics are taken into account, however, it is difficult to identify any of the submitted models as having enough accuracy to be useful for routine genome analysis. The most commonly used statistic for evaluating the quality of a 3D model is the RMSD between the model and the correct experimentally determined structure. Considering the whole chain Cα RMSD values, none of the ab initio models submitted to CASP3 approached the ‘correct fold’ threshold of 6.0 Å — in fact, on this basis, none of the models submitted achieved an RMSD lower than 8.0 Å. Better news was obtained when substructures were considered. At CASP3, a new way of visualising structural similarity over different window lengths was used [49]. As an example, the results for the best prediction of target T0061 (HDEA) [50•] are shown in Figure 3. Notice that the RMSD for this prediction is less than 4.5 Å for around 80% of the model. Clearly, in this case, despite the poor RMSD for the whole model, the best subset of the model is much more impressive and it would be reasonable to class this prediction as an overall success. In a previous review [51], published three years ago, I made the following statement: “…it is very clear that there is still a long way to go for these (ab initio) approaches to structure prediction, and it is unlikely that a practical tool based on this kind of method will be available within less than five years”. Five years is, of course, a long time in a rapidly developing field, but on the basis of the CASP3 results as a whole, there seems little reason to amend the statement. Certainly, no single method exists that can be relied upon to successfully fold proteins on a routine basis. It is certainly impressive to see even a few results of the quality shown in Figure 3, but this result must be considered the exception, rather than the rule. On this basis, it is very hard to envisage ab initio folding methods finding significant applications in genome analysis — certainly in the short term at least. It seems that ab initio methods of protein structure prediction, although sometimes achieving interesting results for fragments of proteins, are unlikely to be used for genome analysis. The success of ab initio methods has never been tested rigorously on a large number of test cases and so the chance of finding a reasonable model for a target protein is unknown. However, results from the three CASP experiments suggest that the chance of any single method producing a reasonable model for a given sequence is very low. Fold recognition methods, on the other hand, have now reached a point at which reasonable models for target proteins can be found on a routine basis, particularly if a homologous template structure can be found. Not surprisingly, therefore, different fold recognition methods have already been applied to the problem of assigning folds to genome sequences. The simplest and most reliable predictions are based purely on sequence similarity and, in particular, PSI-BLAST [1] is proving to be a valuable tool for detecting distant homologous relationships among protein sequences. At the other extreme, fold recognition methods that tend to ignore sequence similarity and make use of structural information have also been applied, but with somewhat less success. Hybrid methods, which combine sequence comparison and fold recognition techniques, are likely to be an important development in this area. Such methods can detect homologous relationships just beyond the detection threshold of methods such as PSI-BLAST, as evident from the results for the first such method to be developed [37••]. In a recent review, Teichmann et al. [53•] compared the results from several attempts to assign folds to the M. genitalium genome. Being the smallest bacterial genome, M. genitalium provides a useful benchmark for different approaches to fold assignment, as most groups have made predictions for this genome. Although it was found that a high degree of agreement was evident for the different methods, some results were not found by all methods. This suggests that to maximise success in assigning folds to genomes, some kind of consensus of methods might be useful. At present, this is difficult as there are no agreed standards for how structural annotations should 378 Sequences and topology be represented. The PRESAGE database created by Brenner et al. [54] is a useful first attempt at providing an archive for different genome-related fold assignments, but such an archive is not a substitute for agreed standards among different groups working in this field. It is vital that some kind of agreement is reached on how best to exchange data among different fold assignment projects, before the problem becomes unmanageable. References and recommended reading Papers of particular interest, published within the annual period of review, have been highlighted as: • of special interest •• of outstanding interest 1. Altschul SF, Madden TL, Schaffer AA, Zhang JH, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 1997, 25:3389-3402. 2. Eddy SR: Hidden Markov models. Curr Opin Struct Biol 1996, 6:361-365. 3. Bork P, Koonin EV: Predicting functions from protein sequences — where are the bottlenecks? Nat Genet 1998, 18:313-318. 4. Fischer D, Eisenberg D: Finding families for genomic ORFans. • Bioinformatics 1999, 15:759-762. Sequence orphans (unique functionally uncharacterised sequences) represent the toughest challenges for theoretical approaches to genome analysis. It is not clear how quickly the number of sequence orphans in different genomes will decrease. The results in this paper suggest that we will be facing the problem of sequence orphans for a long time. 5. Kowalczuk M, Mackiewicz P, Gierlik A, Dudek MR, Cebrat S: Total number of coding open reading frames in the yeast genome. Yeast 1999, 15:1031-1034. 6. •• Marcotte EM, Pellegrini M, Thompson MJ, Yeates TO, Eisenberg D: A combined algorithm for genome-wide prediction of protein function. Nature 1999, 402:83-86. This paper describes several methods for defining the function of a gene product by theoretical means and also for identifying possible protein–protein interactions in which the encoded proteins take part. Importantly, these methods do not rely upon knowing the function of a related gene. 7. •• Enright AJ, Iliopoulos I, Kyrpides NC, Ouzounis CA: Protein interaction maps for complete genomes based on gene fusion events. Nature 1999, 402:86-90. This paper describes how identifying a sequence in which the homologues of two domains are found fused together can indicate that the two domains are involved in protein–protein interactions. This is similar to the theory behind one of the methods described in the previous paper [6••], whereby the fused sequence is called a ‘Rosetta sequence’. 14. Martin AC, Orengo CA, Hutchinson EG, Jones S, Karmirantzou M, Laskowski RA, Mitchell JB, Taroni C, Thornton JM: Protein folds and functions. Structure 1998, 6:875-884. 15. Orengo CA, Jones DT, Thornton JM: Protein superfamilies and domain superfolds. Nature 1994, 372:631-634. 16. Yang F, Gustafson KR, Boyd MR, Wlodawer A: Crystal structure of Escherichia coli HdeA. Nat Struct Biol 1998, 5:763-764. 17. Volz K: A test case for structure-based functional assignment: the 1.2 Å crystal structure of the yjgF gene product from Escherichia coli. Protein Sci 1999, 8:2428-2437. 18. Zarembinski TI, Hung LW, Mueller-Dieckmann HJ, Kim KK, Yokota H, Kim R, Kim SH: Structure-based assignment of the biochemical function of a hypothetical protein: a test case of structural genomics. Proc Natl Acad Sci USA 1998, 95:15189-15193. 19. Wallace AC, Borkakoti N, Thornton JM: TESS: a geometric hashing algorithm for deriving 3D coordinate templates for searching structural databases. Application to enzyme active sites. Protein Sci 1997, 6:2308-2323. 20. Russell RB, Sasieni PD, Sternberg MJE: Supersites within superfolds. Binding site similarity in the absence of homology. J Mol Biol 1998, 282:903-918. 21. Fetrow JS, Godzik A, Skolnick J: Functional analysis of the Escherichia coli genome using the sequence-to-structure-tofunction paradigm: identification of proteins exhibiting the glutaredoxin/thioredoxin disulfide oxidoreductase activity. J Mol Biol 1998, 282:703-711. 22. Fetrow JS, Skolnick J: Method for prediction of protein function from sequence using the sequence-to-structure-to-function paradigm with application to glutaredoxins/thioredoxins and T-1 ribonucleases. J Mol Biol 1998, 281:949-968. 23. Overington J, Donnelly D, Johnson MS, Sali A, Blundell TL: Environmentspecific amino-acid substitution tables — tertiary templates and prediction of protein folds. Protein Sci 1992, 1:216-226. 24. Bowie JU, Lüthy R, Eisenberg D: A method to identify protein sequences that fold into a known three-dimensional structure. Science 1991, 253:164-170. 25. Ouzounis C, Sander C, Scharf M, Schneider R: Prediction of protein structure by evaluation of sequence-structure fitness. Aligning sequences to contact profiles derived from three-dimensional structures. J Mol Biol 1993, 232:805-825. 26. Jones DT, Taylor WR, Thornton JM: A new approach to protein fold recognition. Nature 1992, 358:86-89. 27. Bryant SH, Lawrence CE: An empirical energy function for threading protein-sequence through the folding motif. Proteins 1993, 16:92-112. 28. Flöckner H, Braxenthaler M, Lackner P, Jaritz M, Ortner M, Sippl MJ: Progress in fold recognition. Proteins 1995, 23:376-386. 8. Editorial: Money for structural genomics. Nat Struct Biol 1999, 6:707-708. 29. Fischer D, Eisenberg D: Assigning folds to the proteins encoded by the genome of Mycoplasma genitalium. Proc Natl Acad Sci USA 1997, 94:11929-11934. 9. Shapiro L, Lima CD: The Argonne Structural Genomics Workshop: Lamaze class for the birth of a new science. Structure 1998, 6:265-267. 30. Grandori R: Systematic fold recognition analysis of the sequences encoded by the genome of Mycoplasma pneumoniae. Protein Eng 1998, 11:1129-1135. 10. Orengo CA, Michie AD, Jones S, Jones DT, Swindells MB, Thornton JM: CATH — a hierarchic classification of protein domain structures. Structure 1997, 5:1093-1108. 31. Teichmann SA, Park J, Chothia C: Structural assignments to the Mycoplasma genitalium proteins show extensive gene duplications and domain rearrangements. Proc Natl Acad Sci USA 1998, 95:14658-14663. 11. Sussman JL, Lin DW, Jiang JS, Manning NO, Prilusky J, Ritter O, Abola EE: Protein Data Bank (PDB): database of three-dimensional structural information of biological macromolecules. Acta Crystallogr D 1998, 54:1078-1084. 32. Huynen M, Doerks T, Eisenhaber F, Orengo C, Sunyaev S, Yuan YP, Bork P: Homology-based fold predictions for Mycoplasma genitalium proteins. J Mol Biol 1998, 280:323-326. 12. Thornton JM, Orengo CA, Todd AE, Pearl FMG: Proteins folds, • functions and evolution. J Mol Biol 1999, 293:333-342. This paper documents the distribution of enzyme functions (as described by enzyme classification [EC] numbers) in the CATH structure classification database. 13. Hegyi H, Gerstein M: The relationship between protein structure • and function: a comprehensive survey with application to the yeast genome. J Mol Biol 1999, 288:147-164. A slightly different approach to the evaluation of the relationship between fold and function is described. In this case, the SWISS-PROT databank is used to extend the size of the dataset and the SCOP structure classification scheme is primarily used. 33. Wolf YI, Brenner SE, Bash PA, Koonin EV: Distribution of protein folds in the three superkingdoms of life. Genome Res 1999, 9:17-26. 34. Salamov AA, Suwa M, Orengo CA, Swindells MB: Genome analysis:assigning protein coding regions to three-dimensional structures. Protein Sci 1999, 8:771-777. 35. Rychlewski L, Zhang BH, Godzik A: Fold and function predictions for Mycoplasma genitalium proteins. Fold Des 1998, 3:229-238. 36. Rychlewski L, Zhang BH, Godzik A: Functional insights from structural predictions: analysis of the Escherichia coli genome. Protein Sci 1999, 8:614-624. Protein structure prediction Jones 37 •• Jones DT: GenTHREADER: an efficient and reliable protein fold recognition method for genomic sequences. J Mol Biol 1999, 287:797-815. GenTHREADER represents a fold recognition method that combines features of sequence profile comparison and threading. In testing, it was able to assign folds to a greater fraction of the M. genitalium open reading frames at high confidence (1% false positive rate) than other fold assignment methods. 38. Moult J, Hubbard T, Fidelis K, Pedersen JT: Critical assessment of •• methods of protein structure prediction (CASP): round III. Proteins 1999, S3:2-6. This special issue of Proteins relating to the third CASP experiment is a ‘must read’ for anyone wishing to know what are the current abilities and limitations of modern protein structure prediction methods. 39. Sanchez R, Sali A: Comparative protein structure modeling in • genomics. J Comp Phys 1999, 151:388-401. This paper describes how comparative modelling techniques can be applied to large-scale genome analysis. In particular, the authors’ method allows the quality of a model to be predicted. 40. Jones TA, Kleywegt GJ: CASP3 comparative modeling evaluation. Proteins 1999, S3:30-46. 41. Bates PA, Sternberg MJE: Model building by comparison at CASP3: using expert knowledge and computer automation. Proteins 1999, S3:47-54. 42. Dunbrack RL: Comparative modeling of CASP3 targets using PSI-BLAST and SCWRL. Proteins 1999, S3:81-87. 379 46. Jones DT: Protein secondary structure prediction based on •• position-specific scoring matrices. J Mol Biol 1999, 292:195-202. PSIPRED (URL: http://globin.bio.warwick.ac.uk/psipred) is probably the most accurate secondary structure prediction method currently available. At CASP3, it was found to be the most accurate method. 47. Jones DT: Successful ab initio prediction of the tertiary structure of NK-lysin using multiple sequences and recognized supersecondary structural motifs. Proteins 1997, S1:185-191. 48. Simons KT, Bonneau R, Ruczinski I, Baker D: Ab initio protein structure prediction of CASP III targets using ROSETTA. Proteins 1999, S3:171-176. 49. Hubbard TJP: RMS/coverage graphs: a qualitative method for comparing three-dimensional protein structure predictions. Proteins 1999, S3:15-21. 50. Lee J, Liwo A, Ripoll DR, Pillardy J, Scheraga HA: Calculation of • protein conformation by global optimization of a potential energy function. Proteins 1999, S3:204-208. This prediction of the HDEA structure (see text) was amongst the best of the CASP3 predictions. It was particularly notable because the method used was based on physical principles and was not a ‘knowledge-based’ approach. However, the result required hours of computation on a large supercomputer and other results obtained using this approach were not of comparable quality to this one. 51. Jones DT: Progress in protein structure prediction. Curr Opin Struct Biol 1997, 7:377-387. 43. Fischer D: Modeling three-dimensional protein structures for amino acid sequences of the CASP3 experiment using sequencederived predictions. Proteins 1999, S3:61-65. 52. Butler D: IBM promises scientists 500-fold leap in supercomputing power. Nature 1999, 402:705-706. 44. Murzin AG, Bateman A: Distant homology recognition using structural classification of proteins. Proteins 1997, S1:105-112. 53. Teichmann SA, Chothia C, Gerstein M: Advances in structural • genomics. Curr Opin Struct Biol 1999, 9:390-399. This review includes a very informative comparison of genomic fold recognition studies. 45. Fischer D, Barret C, Bryson K, Elofsson A, Godzik A, Jones D, •• Karplus KJ, Kelley LA, MacCallum RM, Pawowski K et al.: CAFASP-1: critical assessment of fully automated structure prediction methods. Proteins 1999, S3:209-217. Fully automatic prediction methods are essential if they are to be applied to genome analysis. This paper describes a ‘pilot study’ that explores how various Web-based prediction servers can be evaluated in a systematic way. 54. Brenner SE, Barken D, Levitt M: The PRESAGE database for structural genomics. Nucleic Acids Res 1999, 27:251-253. 55. Kraulis PJ: MOLSCRIPT: a program to produce both detailed and schematic plots of protein structures. J App Crystallogr 1991, 24:946-950.
© Copyright 2025 Paperzz