David T. Jones is currently Professor of Bioinformatics at the Institute for Cancer Genetics and Pharmacogenomics at Brunel University. As a Wellcome Trust Research Fellow at University College London and later as a Royal Society University Research Fellow at the University of Warwick, Prof. Jones has published in many areas of bioinformatics, but mainly in the area of protein structure prediction and analysis. He is also one of the Founders of Inpharmatica, a bioinformatics driven drug discovery company located in Central London. Keywords: protein structure prediction, protein folding, functional genomics David T. Jones, Department of Biological Sciences, Brunel University, Uxbridge UB8 3PH, UK E-mail: [email protected] Protein structure prediction in genomics David T. Jones Date received (in revised form): 16th March 2001 Abstract As the number of completely sequenced genomes rapidly increases, including now the complete Human Genome sequence, the post-genomic problems of genome-scale protein structure determination and the issue of gene function identi®cation become ever more pressing. In fact, these problems can be seen as interrelated in that experimentally determining or predicting or the structure of proteins encoded by genes of interest is one possible means to glean subtle hints as to the functions of these genes. The applicability of this approach to gene characterisation is reviewed, along with a brief survey of the reliability of large-scale protein structure prediction methods and the prospects for the development of new prediction methods. INTRODUCTION The release of the complete human genome sequence in early 2001 was a milestone event that marked the transition of modern biology into a new `postgenome' era. In addition to the human genome, sequencing efforts for simpler organisms are also continuing to generate increasing volumes of valuable data, and at the time of writing, some 40 or so complete microbial genome sequences are now available, along with the genomes for the nematode worm, the fruit¯y and thale cress. As we move into the post-sequencing phase of many genome projects, attention is becoming increasingly focused on the correct identi®cation of gene products. Assigning a possible function to a gene is an important ®rst step to characterising its role in the various cellular processes, and without this information, it is impossible to realise the true value of genome sequencing. Of course, straightforward sequence comparison algorithms are by far the most widely used techniques for making an initial identi®cation of a particular gene product. By identifying homology between a new gene product and a gene of known function some inferences can be made as to the function of the new gene. How reliably the function can be extrapolated to the new gene depends on a number of factors, but the principal factor is of course the degree of sequence similarity observed. In recent years, sequence comparison algorithms such as PSI-BLAST1 or techniques based on hidden Markov models2 have `pushed the envelope' as far as detecting homologous relationships goes. Of course, as more and more remote relationships are being considered, it becomes less clear as to how reliably one can map the function of one gene to another.3,4 Nevertheless, sensitive sequence comparison algorithms remain the most vital technology that we have for rapidly characterising new gene products. Despite the power of current-day sequence comparison algorithms, there are still open reading frames (ORFs) that either match no existing entries in the sequence data banks, or that match proteins that are also uncharacterised `unknowns'. These sequence orphans or `ORFans'5 are somewhat of a puzzle. Clearly, at present, this class of ORF represents an uncertain, but signi®cant, fraction of the larger completely sequenced genomes. No matter what the true number of sequence orphans happens to be, however, the fact remains that there remains a `hard core' of small & HENRY STEWART PUBLICATIONS 1467-5463. B R I E F I N G S I N B I O I N F O R M A T I C S . VOL 2. NO 2. 111±125. MAY 2001 111 David T. Jones new algorithms for predicting gene function theoretical methods will help `®ll the gaps' in `fold space' sequence families across a wide variety of genomes for which no functional information can be derived by homologybased methods. Direct experimental function determination is perhaps the ideal approach to characterising these orphan proteins. Gene knockout experiments and expression array techniques are just two of many experimental techniques now being widely applied to function determination. New algorithms for predicting gene function have also been described.6,7 Two basic ideas are represented in these methods. Firstly, proteins of similar function may `co-evolve'.6 In other words, groups of proteins that are found in some organisms but not others may share some common function. Proteins that might be found, for example, in aerobic organisms but never in anaerobic organisms may well have a role in the utilisation of oxygen. Of course, owing to the broad scope of this level of functional classi®cation, the value of this kind of functional classi®cation is rather uncertain. Nonetheless, this kind of information might well produce some unique hints for at least a few sequence orphans. The second idea which has been proposed independently by two groups6,7 has been called the `Rosetta sequence' algorithm by Eisenberg and colleagues.6 In this case, possible protein±protein interactions are predicted by identifying separate protein domains which are sometimes observed to be fused together in some species of organism. Of course, there are a number of well-known exceptions to this rule (many protein modules for example), but despite this, Rosetta sequences may well offer some tantalising hints as to the network of interactions that are present in living cells. WHAT CAN STRUCTURE TELL US ABOUT PROTEIN FUNCTION? It is now common knowledge that the tertiary structure of a protein family is much more highly conserved than the sequences of the proteins within the 112 family. It is also apparent that it is the tertiary structure of a protein that creates the chemical microenvironment, which, in turn, produces its biochemical activity. Given these two observations, it is not surprising, therefore, that the 3D structure of a protein can provide valuable information as to its function and mechanism. One result of this belief is the strong impetus to solve, experimentally, the structures of every protein encoded by a bacterial genome. Some such structural genomics initiatives are already underway8,9 but as yet none of these projects has generated large numbers of new structures as they remain in the pilot stage of development. Despite great improvements to the basic methods of X-ray crystallography,10 particularly the use of synchrotron radiation sources,11 the rate-limiting step in structure determination still remains the expression, puri®cation and crystallisation of the target proteins. Nuclear magnetic resonance (NMR) techniques offer some scope for avoiding some of these dif®culties, but are still limited with respect to the size of protein that can be tackled on a routine basis. Despite these technical improvements in experimental structure determination, none of the ongoing structural genomics projects is based on the idea of solving the structure for every single gene product. Instead, it is expected that theoretical methods will help `®ll the gaps' in `fold space'. Given that it has been estimated that the probability of a novel gene product having an purely new fold is less than 30 per cent,12,13 algorithms for recognising known folds are of course expected to be a powerful means for obtaining structural information about a new gene. Beyond fold recognition there also lies the hope that algorithms will become available that might calculate an approximate fold for a given protein sequence without reference to a template structure. The question of prediction will be covered later, but for the time being let us ignore this problem. Suppose a algorithm & HENRY STEWART PUBLICATIONS 1467-5463. B R I E F I N G S I N B I O I N F O R M A T I C S . VOL 2. NO 2. 111±125. MAY 2001 Protein structure prediction in genomics apparent trends between the broad functional class of a protein and its structural classi®cation assigning a correct fold to a gene product can provide signi®cant hints as to its function did exist for accurately predicting protein tertiary structure, or that we had solved the structures for all of the proteins in a genome of interest. The question still remains as to how useful protein structure is for elucidating the function of a protein. At the most basic level there are apparent trends between the broad functional class of a protein and its structural classi®cation. In a recent survey of the structures in the Protein Data Bank (PDB),14 for example, Thornton and colleagues show that the majority of enzyme structures are found to be in the áâ fold class,15 as are those of nucleotidebinding domains. Unfortunately these observations, while suggesting a relationship between fold and function, have little or no obvious predictive value. Clearly, it would be foolish to say that just because a protein has an áâ fold, it is likely to have enzymatic activity, even though one would be more frequently right than wrong. Hegyi and Gerstein16 have looked more closely at the relationship between fold classi®cation and enzyme classi®cation (EC number), where they used BLAST1 to cross-reference between the SCOP database and SWISS-PROT. In terms of fold class biases, their data are in broad agreement with the observations made by Thornton et al. and earlier work by Martin et al.17 However, by extending their data set by counting not just entries in PDB, but also the homologues of these structures in SWISS-PROT, Hegyi and Gerstein were also able to make some statements about the statistical relationship between functional class and protein topology. They found that the average number of functions found to be associated with a particular fold is 1.2 for both enzymes and non-enzymes, and 1.8 for enzyme-related folds alone. Furthermore, they found the average number of folds for a given function to be 3.6 (2.5 for enzymes alone). One interpretation of this is that, on average at least, the correct prediction of a protein's fold might be a very good indicator as to its function. Unfortunately, this evident good news is somewhat marred by the observed biases in fold distributions. The superfolds (18; Figure 1) such as the (áâ)8 (TIM; triose phosphate isomerase) barrel have been long known to be associated with a very large number of functions. Hegyi and Gerstein similarly found the top ®ve `multifunctional folds' to be the TIM barrel, the áâ hydrolase fold, the Rossmann fold, the P-loop containing NTP hydrolase fold and the ferredoxinlike fold (Table 1). From this we can see that assigning the TIM barrel fold to a particular gene product will give very little information as to its function. In all probability, in the case of the TIM barrel fold, the gene would encode an enzyme (as almost all proteins with the TIM barrel fold are enzymes) but beyond this, very little functional insight would be gained. To add to the problem, the superfolds also account for the bulk of observed structural similarities. Orengo et al.18 estimated that approximately half of the observed structural similarities were found to be between the 10 superfolds. One positive point to make about these structural similarities is that whenever a non-superfold structure can be assigned to a new gene, based on current observations it would appear that the functions of the template protein and the target protein would be expected to be broadly similar. As we have seen, under the right circumstances, assigning a correct fold to a gene product can provide signi®cant hints as to its function. This assumes, of course, that the fold has already been associated with a known function. Fortunately, the vast bulk of proteins of known 3D structure belong to well-characterised families for which a lot of biochemical knowledge has been collected. The various structural genomics initiatives may, however, start to change this picture. Perhaps the ®rst clue of things to come was seen in the recently determined structure for the Escherichia coli protein, HDEA, by Yang et al.19 The structure of this protein was actually solved by & HENRY STEWART PUBLICATIONS 1467-5463. B R I E F I N G S I N B I O I N F O R M A T I C S . VOL 2. NO 2. 111±125. MAY 2001 113 David T. Jones Figure 1: The 10 socalled `superfolds' currently found in the CATH protein structure classi®cation scheme. These folds account for more than 50 per cent of the observed structural similarities between protein domains. Despite the recurrence of these folds, there appears to be no indication of common ancestry between many of the proteins which exhibit these folds Table 1: The ®ve most functionally diverse protein folds according to Hegyi and 16 Gerstein HDEA structure 114 Fold No. of functions TIM barrel áâ hydrolase fold Rossmann fold P-loop containing NTP hydrolase fold Ferredoxin fold 16 9 6 6 6 accident as it turned out to have an almost identical molecular weight to the protein the crystallographers were trying to investigate. Despite this, the HDEA structure (Figure 1) offers a stark lesson to both experimentalists and theoreticians alike, as it is a protein of purely unknown function. Worse still, HDEA is currently a sequence orphan, and so algorithms such as evolutionary pro®ling6 could not be applied. Other attempts at genomic functional assignment by means of structure determination have been documented.20,21 In the ®rst of these cases,20 despite a structural resemblance to chorismate mutase, no similarity was observed between the active sites and the crystal structure of the yjgF gene product from E. coli revealed rather few hints as to the protein's function. In the second case,21 the crystal structure of Methanococcus jannaschii ORF MG0577 not only clearly indicated the presence of a bound ATP (suggesting a probable ATPase or an ATP-mediated molecular switch) but also incorporated several & HENRY STEWART PUBLICATIONS 1467-5463. B R I E F I N G S I N B I O I N F O R M A T I C S . VOL 2. NO 2. 111±125. MAY 2001 Protein structure prediction in genomics trying to identify particular arrangements of atom groups which might form an active site in a given structure CASP4 experiment there is no evidence so far that functions can be assigned to completely novel structures structural motifs known to be frequently associated with ATP binding. Despite the fact that this is a positive example where structural studies have revealed functional information, the fact that part of the functional characterisation was based on the presence of a co-crystallised ATP means that this result is less applicable to the case of structural prediction and fold assignment, where information regarding ligand binding would not be produced. Of course, in the absence of similarity at the level of sequence or even structure to proteins of known function, the possibility remains that the function of a protein might be inferred ab initio from an analysis of the 3D structure alone. Several ideas have been put forward for trying to identify particular arrangements of atom groups which might form an active site in a given structure.22±25 Both Russell22 and Wallace et al.23 have proposed methods to detect particular side chain conformational patterns which relate to the active site geometry of enzymes with a similar function, even with entirely different folds. Both groups propose that this should permit the creation of active site templates which might allow the recognition of the active site in a protein structure of unknown function. Wallace et al. have now created a library of such templates, called PROCAT.24 Fetrow et al.25,26 have also suggested an algorithm for identifying the function of a given protein structure based on side chain conformational patterns. However, in their case they explicitly apply the technique to predicted protein structures. As a proof of concept, Fetrow et al. generated a `fuzzy template' for the thioldisul®de oxidoreductase activity of the glutaredoxin/thioredoxin protein family, and use this template to assess models generated by a fold recognition algorithm applied to ORFs found in E. coli. Although the potential of these template approaches to recognising active sites and other functional regions is very clear, there is no abundant evidence so far that functions can be assigned to completely novel structures. We will have to wait for these methods to be tested on a large number of structures with both novel folds and unknown functions before we can properly evaluate their merits. EVALUATING METHODS FOR STRUCTURE PREDICTION Given the evident importance of 3D structure in providing insights into the function and mechanism of proteins, the next question relates to the applicability and reliability of available structure prediction techniques. Is there a role for protein structure prediction in structural genomics? Clearly, a theoretical approach to accurately modelling the structure of many proteins would have a great impact on genomics as a whole. However, if the use of prediction algorithms is going to be generally accepted by the biology community at large, then it is essential that the reliability of these methods be assessed in such a way as to convince this rather sceptical audience. Although individual authors of automatic prediction methods do attempt to benchmark their methods properly and attempt to provide useful measures of con®dence alongside their predictions, there still remains the possibility that the published results are somewhat better than might be expected in cases where the true structure is not known. The Fourth Critical Assessment in Structure Prediction (CASP4) Experiment was carried out in 2000, along similar lines to the previous three similar experiments, and this continues to allow some indication to be gained as to the reliability of truly blind predictions using different approaches. Detailed results from the experiment will be published in a special issue of the journal Proteins, along the same lines as for CASP3.27 The raw data from the CASP4 evaluation are also available across the Internet.28 & HENRY STEWART PUBLICATIONS 1467-5463. B R I E F I N G S I N B I O I N F O R M A T I C S . VOL 2. NO 2. 111±125. MAY 2001 115 David T. Jones COMPARATIVE MODELLING modelling of unknown protein structures by homology represents the most reliable and most widely applied method for protein structure prediction recognition of a superfamily membership is a very different problem from the recognition of actual folds 116 At present, the modelling of unknown protein structures by homology represents the most reliable and most widely applied method for protein structure prediction. The reliability and simplicity of the method stems from the fact that it is limited to predicting the structure of proteins that are closely related to a template protein of known structure. The comparative modelling process can be divided into ®ve basic steps: alignment of the target sequence with the sequence of a protein of known 3D structure; building of a framework structure based on the alignment; loop building; addition and optimisation of side chains; and ®nally model re®nement. In recent years there has been a de®nite advance in the accuracy of sequence alignments for target±template pairs which are only distantly related. Indeed, some of these pairs would previously have been considered to be so distantly related as to be only suitable for fold recognition. This has come from the common usage of sensitive sequence pro®le alignment methods such as PSI-BLAST1 or one of the several methods based on hidden Markov models. For comparative modelling to be used routinely for genome annotation, it should be possible to build good quality 3D models without requiring human intervention. Given the fact that progress does seem to have been made in terms of full automation of comparative modelling, and producing accurate sequence-structure alignments, it is not surprising that comparative modelling techniques form a central part of structural genomics initiatives. Sanchez et al.29 have already demonstrated that a large fraction of the yeast genome can be automatically modelled by homology to known 3D structures using their program MODELLER, but so far progress has been limited to ORFs with relatively high sequence similarity to the template protein structures. FOLD RECOGNITION In the absence of suitable homologous template structures with which to build a model for a given sequence, and the lack of success that is evident in the ab initio approaches, fold recognition algorithms provide another option for constructing useful tertiary structural models. It was clear at CASP4 that these algorithms are now beginning to converge, with many different groups all heavily relying on sensitive sequence comparison in addition to more traditional fold recognition methods. A simple, but interesting, view of the CASP4 fold recognition results can be obtained by dividing the prediction targets into different categories. Considering just the 11 target domains which can readily be assigned to known superfamilies, all but one of the folds were correctly assigned by the best performing group in this category. Beyond the targets which were obvious members of existing superfamilies, there was very little success. Of the 11 or so targets that had known folds, but that were probably only structural analogues, at best only three or four folds were recognised even by the better performing groups. This is perhaps the clearest example of evidence in CASP4 that the recognition of superfamily membership is a very different problem from the recognition of actual folds. Of course, to be able to recognise a new sequence as being in a particular superfamily is of great biological value, particularly with respect to function identi®cation. The overall progress since CASP1 that is evident in recognising distant homology is even more impressive when one considers that PSI-BLAST1 alone was unable to assign any of the 11 homologous domains to their correct superfamily. The fact that a combination of sequence analysis, 3D structural analysis (and in most cases some human insight) can identify 10 out of 11 dif®cult superfamily level matches correctly bodes very well for the continued success of fold recognition and distant homology & HENRY STEWART PUBLICATIONS 1467-5463. B R I E F I N G S I N B I O I N F O R M A T I C S . VOL 2. NO 2. 111±125. MAY 2001 Protein structure prediction in genomics to produce reasonably accurate 3D models there must be some detectable sequence similarity between the target protein and at least one template structure of known 3D structure detection methods in structurally characterising genomic sequences. Nonetheless, the conclusion we have to take from this very crude analysis is that in order to produce reasonably accurate 3D models there must be some detectable sequence similarity between the target protein and at least one template structure of known 3D structure. Despite the fact that the sample sizes here are low, it would appear that where there is at least some detectable sequence similarity, fold recognition methods based on sequencepro®les are presently suf®cient to build useful models. Beyond these cases, however, fold recognition methods not reliant on sequence alignment (ie true threading methods that ignore the sequence of the template proteins) are much more limited in their ability to recognise folds, and to the accuracy of the models they can produce. Nevertheless, even these relatively poor models may be enough to gain some insight into the function of a new gene sequence. As discussed earlier, even fold recognition algorithms which are able to correctly recognise folds but are entirely incapable of producing sensible alignments may offer some advantage in the narrowingdown of potential gene functions. FOLD RECOGNITION METHODS FOR GENOME ANALYSIS fold recognition techniques can be divided into three classes Given the potential bene®ts of assigning a correct structure to a newly discovered gene product, it is unsurprising than several groups have applied existing fold recognition algorithms to genome analysis. These techniques can be classi®ed into roughly three classes: sequence pro®le methods (eg refs 1, 2, 30), structural (3D±1D) pro®le methods (eg refs 31, 32) and threading algorithms (eg refs 33±35). The ®rst attempt at assigning folds to genome sequences made use of a structural pro®le method. Fischer and Eisenberg36 used a development of the original 3D±1D pro®le method31 to assign folds to the ORFs found in Mycoplasma genitalium, the smallest known bacterial genome. They found that approximately 16 per cent of the ORFs could be assigned to a known fold by means of straightforward sequence comparison, and that an additional 6% could be assigned to a known fold at high con®dence using their fold recognition method. Of course, as the structure databases are now much larger, it is very likely that these fractions would now be somewhat higher. Although many different threading (purely pair potential-based fold recognition) methods have been developed, only a single attempt at applying these methods to genome analysis has been described.37 Grandori applied the ProFit method35 to analyse the ORFs in M. pneumoniae, a slightly larger genome than M. genitalium. In this work, to save time, proteins which could be matched to known structures by straightforward sequence comparison were excluded from the analysis along with proteins longer than 200 residues (which were assumed to be multidomain proteins). Of the 124 ORFs remaining, Grandori was able to recognise folds for 12, giving a recognition rate of 10 per cent. Interestingly, a number of disagreements were reported when the results were compared with the results from Fischer and Eisenberg's results (by identifying M. pneumoniae homologues in M. genitalium). This is not surprising given the relatively low overall reliability of pure fold recognition algorithms, but more surprising because in some cases both predictions were apparently very signi®cant. Despite the fact that both Fischer and Eisenberg and Grandori performed basic sequence comparisons to detect clear homologues to known structures, it is clear that the possibility existed that better sequence comparison algorithms could have been applied, and these techniques could have assigned a greater number of folds to ORFs. In answer to this, a number of groups38±41 have used PSIBLAST,1 which is an iterative sequence & HENRY STEWART PUBLICATIONS 1467-5463. B R I E F I N G S I N B I O I N F O R M A T I C S . VOL 2. NO 2. 111±125. MAY 2001 117 David T. Jones a number of Web resources are available where precompiled fold assignments can be accessed for different subsets of genomes GenTHREADER 118 pro®le method based on the standard gapped-BLAST method.1 PSI-BLAST has proven to be not only a very sensitive sequence comparison method, but also very reliable. To get the best results from PSI-BLAST, however, it should be used in both `directions'.38,41 Normally, each ORF is scanned against a set of PSIBLAST pro®les, each of which corresponds to a single protein structure or structural domain. Despite the fact that these pro®les are slow to calculate, this process has to be done only once for each sequence of known structure. Assigning folds to ORFs using this procedure is thus fairly ef®cient. To achieve the second search direction, a PSI-BLAST pro®le must be calculated for each ORF, and this pro®le can be scanned against a library of sequences relating to known structures. Given that the calculation of a single PSIBLAST pro®le takes 10 minutes on average using a modern workstation, for large genomes, this second approach is very computationally expensive, and relatively impractical. Despite this disadvantage, extra matches can be found when both searches are carried out. Although it has been claimed that intermediate sequence searching methods (ISL) using PSI-BLAST such as those described by Salamov et al.41 or Teichmann et al.42 can produce results equivalent to a two-direction PSI-BLAST search, in practice it is still quite clear that these approaches can still miss matches which can be found by more rigorous use of PSI-BLAST itself. Rychlewski et al.43,44 attempted to exploit this asymmetry in PSI-BLAST pro®le comparisons by means of a comparison algorithm based on the alignment of one pro®le with another. Their technique, BASIC, requires pro®les to be computed for each sequence in the 3D structure library and also for each ORF. These two sets of pro®les are then compared by means of a local dynamic programming method. Jones has developed a hybrid method for assigning folds to genome sequences, called GenTHREADER.45 GenTHREADER uses a traditional sequence-pro®le alignment method to produce alignments which are evaluated by a method derived from threading methods. As a last step, the alignment scores and threading energy sums for each threaded model is evaluated by a neural network in order to produce a single measure of con®dence in the proposed prediction. The method was applied to the genome of M. genitalium, where analysis of the results showed that as many as 46 per cent of the proteins derived from the predicted protein coding regions had a signi®cant relationship to a protein of known structure. In a recent review, Teichmann et al.46 compared the results from several attempts to assign folds to the M. genitalium genome. Being the smallest bacterial genome, M. genitalium provides a useful benchmark for different approaches to fold assignment as most groups have made predictions for this genome. Despite the fact that it was found that a high degree of agreement was apparent between the different algorithms, some results were not found by all techniques. This suggests that to maximise success in assigning folds to genomes, some kind of consensus of algorithms might be useful. At present, this is dif®cult as there are no agreed standards for how structural annotations should be represented. A number of Web resources are available where precompiled fold assignments can be accessed for different subsets of genomes. These resources are predominantly based on PSI-BLAST comparisons. For example, the GTOP database at the National Institute of Genetics in Japan contains fold assignments for 26 completed genomes based on PSI-BLAST similarity searches, and can be accessed from its web site.47 As far as fold recognition methods are concerned, most of the results currently available on the Web relate to small subsets of genomes. For example, comprehensive fold assignments are available for M. genitalium, E. coli and & HENRY STEWART PUBLICATIONS 1467-5463. B R I E F I N G S I N B I O I N F O R M A T I C S . VOL 2. NO 2. 111±125. MAY 2001 Protein structure prediction in genomics Helicobacter pylori at the Burnham Institute.48 To date, the only available set of comprehensive fold assignments using fold recognition techniques are those derived by GenTHREADER, which have been compiled for 25 complete genomes (plus the currently con®rmed gene products from the draft human genome) and have been stored in a Webaccessible database.49 A preliminary analysis of these data shows that while certain folds are prevalent in all genomes, certain folds are more common in some genomes than others. The Rossmann-like fold is the most commonly occurring fold in almost every organism studied, mainly by virtue of the high recurrence of the P-loop hydrolase's superfamily. One interesting exception to this is the human genome, in which the immunoglobulin fold currently appears to be the most common fold. CASP process CAFASP One feature of the CASP process that continues to be a concern is the dif®culty in separating out wholly automatic predictions from those that have been made using various degrees of human intervention. It is not at all clear how much of the success shown in CASP comes from the algorithms being used and how much comes from the expert biological knowledge of the people using the algorithms. It is very clear that people do add some value, and all of the most successful groups at CASP4 made use of the scienti®c literature to identify functionally related proteins from a shortlist of possibilities. It therefore seems fair to say that human intervention is required to make the best predictions, but what can a non-expert hope to achieve using just automated methods alone? Also, is it possible to achieve the same high levels of success shown in CASP4 on many targets? If, instead of 30 or so targets, which could easily be analysed individually by human predictors, there had been, say, 1,000 targets, how good would the results have been? Clearly, if fold recognition, or protein structure prediction more generally, is to play an important part in structural genomics then it is essential that we characterise the success of fully automated methods for structure prediction. Fischer et al.50 have attempted to address this issue by creating a subsection of the CASP process, called CAFASP (Critical Assessment of Fully Automated Structure Prediction). The basic idea of CAFASP is to evaluate Web-based prediction tools in a fully automated fashion, thus eliminating the possibility of human assistance in the prediction process. CAFASP1 was carried out shortly after the CASP3 meeting, and must therefore be considered a pilot study as the predictions were not blind. Nonetheless, the results were interesting and the process allowed some technical issues to be resolved in good time for CAFASP2, which was an of®cial part of CASP4. Although the detailed results for CAFASP2 are available for viewing from the associated web site,51 it is possible to broadly conclude that although skilled human intervention is clearly bene®cial, entirely automated methods still performed fairly well. The bad news is that, as with the human predictors in CASP4, the success of the automatic methods mainly came from relatively easy targets in the superfamily category. Targets in the analogous category were predicted very poorly by the automatic servers. AB INITIO METHODS In cases where it is not possible to build a useful model by comparative modelling, it might be hoped that methods might be available to calculate a 3D structure directly from the amino acid sequence by means of pure physics or a knowledge of the rules of protein folding. This has proven to be, and remains, a very dif®cult challenge. It is not dif®cult to understand the practical applications of an entirely general method for predicting the tertiary structure of novel gene products, but is & HENRY STEWART PUBLICATIONS 1467-5463. B R I E F I N G S I N B I O I N F O R M A T I C S . VOL 2. NO 2. 111±125. MAY 2001 119 David T. Jones Figure 2: The three main classes of protein structure prediction methods in order of `complexity'. Comparative modelling approaches are fairly easy methods to apply to genome sequences, and produce accurate 3D models for the proteins concerned. Unfortunately, only a small percentage (20±30 per cent) of the proteins encoded in most genomes are closely enough related to a protein of known 3D structure to allow comparative modelling techniques to be applied. At the other extreme, ab initio methods can in theory be applied to any sequence, but these methods are not currently able to produce useful 3D models. Fold recognition methods occupy the mid-ground between these two classical approaches progress in methods that attempt to predict protein secondary structure 120 there any hope that such approaches will provide useful levels of success in the short to medium term? Of course, not all ab initio methods are aimed at the prediction of tertiary structure. A great deal of progress has been made in recent years in methods that attempt to predict protein secondary structure. The reason there remains great interest in secondary structure prediction is because it is often used as a component of a wide range of 3D prediction methods. Indeed, although it is rarely used in isolation, accurate secondary structure prediction is exploited by the vast majority of prediction groups taking part in CASP. It has also been suggested that careful analysis of accurate secondary structure predictions can also provide functional information on a new gene sequence (eg King et al.52 ). Up until around two years ago, the best and by far the most widely used method for predicting secondary structure was the PHD method developed by Burkhard Rost.53 At CASP3, however, the PSIPRED method54 showed a marked improvement in prediction accuracy over previous methods. Although PSIPRED was very similar to PHD in concept (using two levels of neural networks to analyse sequence pro®les) it used PSI-BLAST to provide more sensitive and more accurate & HENRY STEWART PUBLICATIONS 1467-5463. B R I E F I N G S I N B I O I N F O R M A T I C S . VOL 2. NO 2. 111±125. MAY 2001 Protein structure prediction in genomics traditionally, very little success has been demonstrated in the ab initio prediction of protein tertiary structure in the various CASP experiments sequence pro®les. Added to this was a highly redundant training set including nearly 2,000 separate pro®les. At CASP4, PSIPRED was still ranked at the top of 20 or so methods evaluated, achieving an overall 3-state prediction accuracy (Q3 score) of 80.6 per cent for all 40 target domains with no obvious sequence similarity to existing structures. PREDICTING NEW FOLDS at CASP4 the Baker group showed higher degree of success across a wider range of target folds than previously Somewhat more interesting than secondary structure prediction is of course the ab initio prediction of protein tertiary structure. The concept behind most ab initio approaches to protein structure is quite simple. Firstly, a large number of different chain conformations are generated for the target protein. At the very simplest these conformations can be enumerated exhaustively ± ie virtually every distinct 3D structure is generated for the protein. Clearly, this is only practical for very small proteins since for larger proteins the number of possible conformations grows exponentially with the number of amino acid residues. One way of slightly reducing the number of possible structures is to build the structure on a ®xed lattice, which restricts the positions of the atoms in the structure to a ®xed number of coordinates. For larger proteins, more intelligent search strategies are needed, which include molecular dynamics, simulated annealing, genetic algorithms and a number of other ef®cient search strategies. Traditionally, very little success has been demonstrated in the ab initio prediction of protein tertiary structure in the various CASP experiments. However, in the second CASP experiment, the best ab initio prediction55 was close enough (ácarbon root mean square deviation of Ê for it to be con®dently claimed that 6:2 A) at least the fold was correctly reproduced in the model. This prediction was generated by a Monte Carlo approach where fragments of protein structures are spliced together, and the resulting chain conformations evaluated using a simple energy function. At CASP3 the group of David Baker took these ideas further with some success.56 As an aside, this kind of approach to folding proteins has become nicknamed `mini-threading' by some predictors. This terminology is perhaps useful to distinguish such knowledgebased prediction methods from methods that attempt to simulate protein folding using physical principles, but is otherwise quite misleading. At CASP4, however, the Baker group showed an even higher degree of success across a much wider range of target folds than previously seen at earlier CASP experiments, where earlier successes had been limited to mainly alpha-helical folds. Figure 3 shows an example of a prediction from the Baker group which accurately models the unique fold of CASP4 target T0091, which is a hypothetical protein from Haemophilus in¯uenzae (HI0442). It of course remains to be seen how much functional insight can be gleaned from a knowledge of this rather unusual fold. Figure 3: Example of a correctly predicted novel fold for a hypothetical gene product (ORF HI0442 from H. in¯uenzae) using the ROSETTA method of Simons and 56,57 The ®gure was generated coworkers. 58 using Molscript & HENRY STEWART PUBLICATIONS 1467-5463. B R I E F I N G S I N B I O I N F O R M A T I C S . VOL 2. NO 2. 111±125. MAY 2001 121 David T. Jones CONCLUSIONS fold recognition algorithms have now reached a point where fair models for target proteins can be found on a routine basis 122 This review has demonstrated that protein structure prediction does have a role to play in the annotation of genome sequences. In many cases, the correct assignment of a fold to a novel gene can provide useful clues as to its likely function. It seems that ab initio methods for protein structure prediction, while sometimes achieving interesting results on fragments of proteins, are unlikely to be used for genome analysis in the near future. The success of ab initio algorithms has never been tested rigorously on a large number of test cases, and so the chance of ®nding a reasonable model for a target protein is unknown. However, results in the four CASP experiments suggest that the chances of any single algorithm producing a reasonable model for a given sequence is very low. Fold recognition algorithms, on the other hand, have now reached a point where fair models for target proteins can be found on a routine basis, especially where a homologous template structure can be found. Not surprisingly, therefore, different fold recognition methods have already been applied to the problem of assigning folds to genome sequences. The simplest and most reliable predictions are based purely on sequence similarity, and in particular PSI-BLAST1 is proving to be a valuable tool for detecting remote homologous relationships between protein sequences. At the other extreme, fold recognition methods which typically ignore sequence similarity and make use of structural information have also been applied, but with somewhat less success. Hybrid methods, which combine sequence comparison and fold recognition methods, are expected to be an important development in this area. Such methods can detect homologous relationships just beyond the detection threshold of methods such as PSI-BLAST, as evidenced by the results for the ®rst such algorithm.39 So how will the relationship between computational approaches to protein folding and the ongoing structural genomics projects probably develop? It is clear that protein structure prediction is never likely to challenge experimental methods in the determination of accurate structural models for proteins. The role of protein structure prediction and modelling lies in the rapid analysis and annotation of proteins, the analysis of proteins for which experimental structure determination has proven to be dif®cult, and the extrapolation from existing experimental structures to other members of the protein's family and superfamily. In the fullness of time, of course, the protein structure data banks will be so complete as to render protein structure prediction a more or less academic problem. Eventually there will almost always be available a closely related protein of known structure with which to build an accurate model for any given target protein. When might this occur? Based on current rates of structure determination and the observed growth of the structure databases, this scenario might arise in around 15 years or longer. On the other hand, perhaps the structural genomics initiatives will stimulate the development of novel methods for rapid structure determination and reduce this time to around 10 years. In either case, will theoreticians suddenly ®nd their skills of little value? No. Even with a highly optimistic view of success, present approaches to structural genomics will only scratch the surface of understanding the working of cells at a molecular level. There will still remain a rich selection of unsolved problems to keep theoreticians in structural biology busy for as long as they wish: protein misfolding, protein±protein interactions and biomolecular recognition, drug design, membrane proteins and even de novo protein design are just a few of the many challenging future possibilities for continuing theoretical studies of protein structure. & HENRY STEWART PUBLICATIONS 1467-5463. B R I E F I N G S I N B I O I N F O R M A T I C S . VOL 2. NO 2. 111±125. MAY 2001 Protein structure prediction in genomics References 1. Altschul, S. F., Madden, T. L., Schaffer, A. A. et al. (1997), `Gapped BLAST and PSIBLAST: A new generation of protein database search programs', Nucleic Acids Res., Vol. 25, pp. 3389±3402. 2. Eddy, S. R. (1996), `Hidden Markov models', Curr. Opin. Struct. Biol., Vol. 6, pp. 361±365. 3. Bork, P. and Koonin, E. V. (1998), `Predicting functions from protein sequences ± where are the bottlenecks?', Nature Genetics, Vol. 18, pp. 313±318. 4. Wilson, C. A., Kreychman, J. and Gerstein, M. (2000), `Assessing annotation transfer for genomics: Quantifying the relations between protein sequence, structure and function through traditional and probabilistic scores', J. Mol. Biol., Vol. 297, pp. 233±249. 5. Fischer, D. and Eisenberg, D. (1991), `Finding families for genomic ORFans', Bioinformatics, Vol. 15, pp. 759±762. 6. Marcotte, E. M., Pellegrini, M., Thompson, M. J. et al. (1999), `A combined algorithm for genome-wide prediction of protein function', Nature, Vol. 402, pp. 83±86. 7. Enright, A. J., Iliopoulos, I., Kyrpides, N. C. and Ouzounis, C. A. (1999), `Protein interaction maps for complete genomes based on gene fusion events', Nature, Vol. 402, pp. 86±90. 8. Anonymous (1999), `Editorial: Money for structural genomics', Nature Struct. Biol., Vol. 6, pp. 707±708. 9. Shapiro, L. and Lima, C. D. (1998), `The Argonne Structural Genomics Workshop: Lamaze class for the birth of a new science', Structure, Vol. 6, pp. 265±267. 10. Lamzin, V. S. and Perrakis, A. (2000), `Current state of automated crystallographic data analysis', Nat. Struct. Biol., Vol. 7, Nov., Suppl., pp. 978±981. 11. Hendrickson, W. A.(2000), `Synchrotron crystallography', Trends Biochem. Sci., Vol. 25, pp. 637±643. 12. Orengo, C. A., Michie, A. D., Jones, S. et al. (1997), `CATH ± a hierarchic classi®cation of protein domain structures', Structure, Vol. 5, pp. 1093±1108. 13. Brenner, S. E. and Levitt, M. (2000), `Expectations from structural genomics', Protein Sci., Vol. 9, pp. 197±200. 14. Sussman, J. L., Lin, D. W., Jiang, J. S. et al. (1998), `Protein Data Bank (PDB): Database of three-dimensional structural information of biological macromolecules', Acta Crystallogr. D, Vol. 54, pp. 1078±1084. 15. Thornton, J. M., Orengo, C. A. , Todd, A. E. and Pearl, F. M. G. (1999), `Protein folds, functions and evolution', J. Mol. Biol., Vol. 293, pp. 333±342. 16. Hegyi, H. and Gerstein, M. (1999), `The relationship between protein structure and function: A comprehensive survey with application to the yeast genome', J. Mol. Biol., Vol. 288, pp. 147±164. 17. Martin, A. C., Orengo, C. A., Hutchinson, E. G. et al. (1998), `Protein folds and functions', Structure, Vol. 6, pp. 875±884. 18. Orengo, C. A., Jones, D. T. and Thornton, J. M. `Protein superfamilies and domain superfolds', Nature, Vol. 372, pp. 631±634. 19. Yang, F. Gustafson, K. R., Boyd, M. R. and Wlodawer, A. (1998), `Crystal structure of Escherichia coli HdeA', Nature Struct. Biol., Vol. 5, pp. 763±764. 20. Volz, K. (1999), `A test case for structure-based functional assignment: The 1.2 angstrom crystal structure of the yjgF gene product from Escherichia coli', Protein Sci., Vol. 8, pp. 2428± 2437. 21. Zarembinski, T. I., Hung, L. W., MuellerDieckmann, H. J. et al. (1998), `Structure-based assignment of the biochemical function of a hypothetical protein: A test case of structural genomics', Proc. Natl Acad. Sci., USA, Vol. 95, pp. 15189±15193. 22. Russell, R. B. (1998) `Detection of protein three-dimensional side-chain patterns: New examples of convergent evolution', J. Mol. Biol., Vol. 279, pp. 1211±1227. 23. Wallace, A. C., Borkakoti, N. and Thornton, J. M. (1997), `TESS: A geometric hashing algorithm for deriving 3D coordinate templates for searching structural databases, Application to enzyme active sites', Protein Sci., Vol. 6, pp. 2308±2323. 24. URL: http//www.biochem.ucl.ac.uk/bsm/ PROCAT 25. Fetrow, J. S., Godzik, A. and Skolnick, J. (1998), `Functional analysis of the Escherichia coli genome using the sequence-to-structureto-function paradigm: Identi®cation of proteins exhibiting the glutaredoxin/ thioredoxin disul®de oxidoreductase activity', J. Mol. Biol., Vol. 282, pp. 703±711. 26. Fetrow, J. S. and Skolnick, J. (1998), `Method for prediction of protein function from sequence using the sequence-to-structure-tofunction paradigm with application to glutaredoxins/thioredoxins and T-1 ribonucleases', J. Mol. Biol., Vol. 281, pp. 949±968. 27. Moult, J., Hubbard, T., Fidelis, K. and Pedersen, J. T. (1999), `Critical assessment of methods of protein structure prediction (CASP): Round III', Proteins, Vol. S3, pp. 2±6. 28. URL: http//predictioncenter.llnl.gov & HENRY STEWART PUBLICATIONS 1467-5463. B R I E F I N G S I N B I O I N F O R M A T I C S . VOL 2. NO 2. 111±125. MAY 2001 123 David T. Jones 29. Sanchez, R., Pieper, U., Melo, F. et al (2000), `Protein structure modeling for structural genomics', Nat. Struct. Biol., Vol. 7, Nov., Suppl., pp. 986±990 30. Overington, J., Donnelly, D., Johnson, M. S. et al. (1992), `Environment-speci®c aminoacid substitution tables ± tertiary templates and prediction of protein folds', Protein Sci., Vol. 1, pp. 216±226. 31. Bowie, J. U., LuÈthy, R. and Eisenberg, D. (1991), `A method to identify protein sequences that fold into a known threedimensional structure', Science, Vol. 253, pp.164±170. 32. Ouzounis, C., Sander, C., Scharf, M. and Schneider, R. (1993), `Prediction of protein structure by evaluation of sequence-structure ®tness. Aligning sequences to contact pro®les derived from three-dimensional structures', J. Mol. Biol., Vol. 232, pp. 805±825. 33. Jones, D. T., Taylor, W. R. and Thornton, J. M. (1992), `A new approach to protein fold recognition', Nature, Vol. 358 , pp. 86±89. 34. Bryant, S. H. and Lawrence, C. E. (1993), `An empirical energy function for threading protein-sequence through the folding motif ', Proteins: Struct. Function Genet., Vol. 16, pp. 92±112. 35. FloÈckner, H., Braxenthaler, M., Lackner, P. et al. (1995), `Progress in fold recognition', Proteins, Vol. 23, pp. 376±386. 36. Fischer, D. and Eisenberg, D. (1997), `Assigning folds to the proteins encoded by the genome of Mycoplasma genitalium', Proc. Natl Acad. Sci. USA, Vol. 94, pp. 11929±11934. 37. Grandori, R. (1998), `Systematic fold recognition analysis of the sequences encoded by the genome of Mycoplasma pneumoniae., Prot. Engng, Vol. 11, pp. 1129±1135. 38. Teichmann, S. A., Park, J. and Chothia, C. (1998), `Structural assignments to the Mycoplasma genitalium proteins show extensive gene duplications and domain rearrangements', Proc. Natl Acad. Sci. USA, Vol. 95, pp. 14658± 14663. 39. Huynen, M., Doerks, T., Eisenhaber, F. et al. (1998), `Homology-based fold predictions for Mycoplasma genitalium proteins', J. Mol. Biol., Vol. 280, pp. 323±326. 40. Wolf, Y. I., Brenner, S. E., Bash, P. A. and Koonin, E. V. (1999), `Distribution of protein folds in the three superkingdoms of life', Genome Res., Vol. 9, pp. 17±26. 41. Salamov, A. A., Suwa, M., Orengo, C. A. and Swindells, M. B. (1999), `Genome analysis: Assigning protein coding regions to threedimensional structures', Prot. Sci., Vol. 8, pp. 771±777. 42. Teichmann, S. A., Chothia, C., Church, G. M. and Park, J. (2000), `Fast assignment of 124 protein structures to sequences using the intermediate sequence library PDB-ISL', Bioinformatics, Vol. 16, pp. 117±124. 43. Rychlewski, L., Zhang, B. H. and Godzik, A. (1998), `Fold and function predictions for Mycoplasma genitalium proteins', Folding Design, Vol. 3, pp. 229±238. 44. Rychlewski, L., Zhang, B. H. and Godzik, A. (1999), `Functional insights from structural predictions: Analysis of the Escherichia coli genome', Prot. Sci., Vol. 8, pp. 614±624. 45. Jones, D. T. (1999), `GenTHREADER: An ef®cient and reliable protein fold recognition method for genomic sequences', J. Mol. Biol., Vol. 287, pp. 797±815. 46. Teichmann, S. A., Chothia, C. and Gerstein, M. (1999), Advances in structural genomics', Curr. Opin. Struct. Biol., Vol. 9, pp. 390±399. 47. URL: http://spock.genes.nig.ac.jp/~genome/ summary.html 48. URL: http://bioinformatics.burnhaminst.org/pages/ 49. URL: http://insulin.brunel.ac.uk/genomes/ 50. Fischer, D., Barret, C., Bryson, K. et al. (1999), `CAFASP-1: Critical assessment of fully automated structure prediction methods', Proteins, Vol. S3, pp. 209±217. 51. URL: http://www.cs.bgu.ac.il/~d®scher/ CAFASP2/ 52. King, R. D., Karwath, A., Clare, A. and Dehaspe, L. (2000), `Accurate prediction of protein functional class from sequence in the Mycobacterium tuberculosis and Escherichia coli genomes using data mining', Yeast, Vol. 17, pp. 283±293. 53. Rost, B. (1996), `PHD: Predicting onedimensional protein structure by pro®le-based neural networks', Methods Enzymol., Vol. 266, pp. 525±539. 54. Jones, D. T. (1999), `Protein secondary structure prediction based on position-speci®c scoring matrices', J. Mol. Biol., Vol. 292, pp. 195±202. 55. Jones, D. T. (1997), `Successful ab initio prediction of the tertiary structure of NK-lysin using multiple sequences and recognized supersecondary structural motifs', Proteins, Vol. S1, pp. 185±191. 56. Simons, K. T., Bonneau, R., Ruczinski, I. and Baker, D. (1999), `Ab initio protein structure prediction of CASP III targets using ROSETTA', Proteins, Vol. S3, pp. 171±176. 57. Baker, D. (2000), `A surprising simplicity to protein folding', Nature, Vol. 405, pp. 39±42. 58. Kraulis, P. J. (1991), `MOLSCRIPT: A program to produce both detailed and schematic plots of protein structures', J. Appl. Crystallogr., Vol. 24, pp. 946±950. & HENRY STEWART PUBLICATIONS 1467-5463. B R I E F I N G S I N B I O I N F O R M A T I C S . VOL 2. NO 2. 111±125. MAY 2001 Protein structure prediction in genomics APPENDIX: SOME LINKS TO PROTEIN STRUCTURE PREDICTION RESOURCES ON THE WEB Server Organisation Comparative modelling servers 3D-Jigsaw Imperial Cancer Research Fund, UK CPHmodels Center for Biological Sequence Analysis, Denmark SWISS-MODEL Swiss Institute of Bioinformatics Sequence-based homology recognition servers FFAS Burnham Institute, USA Fugue Cambridge University, UK PDB-ISL Laboratory of Molecular Biology, Cambridge, UK PSI-BLAST and CDD National Center for Biotechnology Information, USA SAM University of California, Santa Cruz, USA Fold recognition servers 123D 3D-PSSM BioInBgu GenTHREADER National Cancer Institute, USA Imperial Cancer Research Fund, UK Ben Gurion University, Israel Brunel University, UK General protein structure prediction servers PredictProtein Columbia University, USA PSIPRED Brunel University, UK Web site http://www.bmm.icnet.uk/servers/3djigsaw/ http://www.cbs.dtu.dk/services/CPHmodels/ http://www.expasy.ch/swissmod/SWISS-MODEL.html http://bioinformatics.ljcrf.edu/FFAS/ http://www-cryst.bioc.cam.ac.uk/~fugue/prfsearch.html http://stash.mrc-lmb.cam.ac.uk http://www.ncbi.nlm.nih.gov http://www.cse.ucsc.edu/research/compbio/HMM-apps/ HMM-applications.html http://www-lmmb.ncifcrf.gov/~nicka/123D‡.html http://www.bmm.icnet.uk/~3dpssm/html/ffrecog.html http://www.cs.bgu.ac.il/~bioinbgu/form.html http://www.psipred.net http://cubic.bioc.columbia.edu/predictprotein/ http://www.psipred.net Meta-servers (portals to multiple prediction methods) META Columbia University, USA Meta-Server http://cubic.bioc.columbia.edu/predictprotein/doc/ meta_intro.html International Institute of Molecular and Cell Biology, Poland http://BioInfo.PL/meta/ Collections of genome fold assignments GTD Brunel University, UK GTOP National Institute of Genetics, Japan MODBASE Rockefeller University, USA PEDANT Max-Planck Institute, Germany PRESAGE Berkeley University, USA http://insulin.brunel.ac.uk/genomes http://spock.genes.nig.ac.jp/~genome/summary.html http://pipe.rockefeller.edu/modbase/ http://pedant.mips.biochem.mpg.de/ http://presage.berkeley.edu/ Benchmarking resources for protein structure prediction CAFASP Ben Gurion University, Israel CASP Lawrence Livermore National Laboratory, USA EVA Columbia University, USA LiveBench International Institute of Molecular and Cell Biology, Poland http://www.cs.bgu.ac.il/~d®scher/CAFASP2/ http://predictioncenter.llnl.gov/ http://cubic.bioc.columbia.edu/eva http://BioInfo.PL/LiveBench/ & HENRY STEWART PUBLICATIONS 1467-5463. B R I E F I N G S I N B I O I N F O R M A T I C S . VOL 2. NO 2. 111±125. MAY 2001 125
© Copyright 2026 Paperzz