Protein Engineering vol.11 no.12 pp.1129–1135, 1998 Systematic fold recognition analysis of the sequences encoded by the genome of Mycoplasma pneumoniae Rita Grandori1 Center for Applied Molecular Engineering, Institute for Chemistry and Biochemistry, University of Salzburg, A-5020 Salzburg, Austria 1Present address: Institute of Chemistry, Johannes Kepler University, Altenberger Strasse 69, A-4040 Linz, Austria. E-mail: [email protected] A robust tool for fold recognition was applied to the systematic analysis of the sequences below 200 residues encoded by the genome of Mycoplasma pneumoniae. The goal was to determine the additional information gain achievable in genome analysis by fold recognition, beyond the intrinsic limits of homology studies. A list of 124 sequences encoding for soluble proteins or domains not homologous to each other, or to proteins with known threedimensional structure, was analyzed, resulting in significant Z scores for the energy of the structural models in 12 of these cases. This result indicates that systematic application of fold recognition techniques to the analysis of structurally unassigned soluble proteins can lead to high-confidence structural predictions with an efficiency of about 10%, a relevant contribution besides the complementary approach of homology analysis. Four of the predictions presented include mapping of the putative active site of the target sequence and lead to the detection of probable catalytic and binding residues. The data are discussed with reference to the functional implications of the structural models and to the results reported for the homologous genome of Mycoplasma genitalium. Keywords: active site analysis/genome analysis/knowledgebased potentials/protein function prediction/protein structure prediction Introduction The development of highly efficient sequencing techniques has made possible the recent accomplishment of exhaustive sequencing projects, including already the entire genome of several organisms (Fleischmann et al., 1995; Fraser et al., 1995, 1997, 1998; Bult et al., 1996; Goffeau et al., 1996; Himmelreich et al., 1996; Kaneko et al., 1996; Blattner et al., 1997; Klenk et al., 1997; Kunst et al., 1997; Smith et al., 1997; Tomb et al., 1997; Cole et al., 1998; Deckert et al., 1998). Genome projects provide raw data with a tremendous information potential that could bring important contributions to our understanding of the metabolism, regulation and evolution of biological systems. The interpretation of DNA sequencing results is limited, at present, by our tools for the prediction of structural and functional features of the encoded protein products. This task has become relatively easy in the presence of homology to already characterized proteins, but it represents one of the central challenges of theoretical biology whenever the prediction cannot rely on sequence similarity. At the present level of expansion of databases, homology © Oxford University Press analysis leads, on average, to a functional assignment or at least coarse-grained classification, for ~60% of the open reading frames (ORFs) of newly sequenced genomes (see references above), and also to a structural assignment for 10–15% of the cases (Rost, 1998). Complementary approaches to protein structure prediction, like fold recognition methods, are expected to push forward these limits to the information we can extract from the results of genome projects. Indeed, fold recognition methods try to detect structural similarities that could not be detected on the sequence level, searching for known protein structures that might represent a good template for a given sequence of unknown native conformation. A knowledge-based fold recognition method (Sippl, 1993a; Flöckner et al., 1995; Sippl and Flöckner, 1996) was applied in this work to the systematic analysis of the ORFs of the genome of Mycoplasma pneumoniae (Himmelreich et al., 1996). The method makes use of mean force potential functions for residue-pair interactions and solvent accessibility to evaluate the fitness of a given sequence for alternative conformations. The input sequence is threaded on to a database of different known protein structures. The output is a list of sequence– structure alignments ranked on the basis of their relative energies, where low-energy conformations represent candidate native-like models for that sequence. This method has been demonstrated to be a robust tool for fold recognition and to perform with high alignment accuracy (Flöckner et al., 1995, 1997; Levitt, 1997). The choice of M.pneumoniae genome was motivated by its small size and by the availability of an exhaustive and accurate sequence analysis (Himmelreich et al., 1996). Hence this genome represents a good system to test how much new structural and functional information we can gain by fold recognition methods, beyond the limits of the most sophisticated analysis of sequence homology. The M.pneumonaie genome has 677 predicted ORFs. Of these, 376 have an assigned putative function on the basis of sequence homology, 198 show sequence homology only to proteins with unknown function and 103 show no significant sequence homology to any sequence in the database (Himmelreich et al., 1996). This genome contains 155 sequences that show significant sequence similarity (.25% identity for more than 80 residues aligned; Sander and Schneider, 1991) to proteins with known three-dimensional (3D) structure (F.Domingues, personal communication). In this work an attempt was made to identify new high-confidence templates for all the structurally unassigned sequences below 200 amino acids and to interpret the structural models with reference to the putative function of the protein. During this project, Eisenberg and co-workers performed a similar analysis on the homologous genome of M.genitalium (Fischer and Eisenberg, 1997).The present results are discussed in comparison with the reported data for M.genitalium. Materials and methods The amino acid sequences encoded by the genome of M.pneumoniae were downloaded from the internet site 1129 R.Grandori http://www.zmbh.uni-heidelberg.de/M pneumoniae and sequences from M.genitalium were downloaded from the internet site http://www.tigr.org/tdb/mdb/mgdb/mgdb.html. Protein sequences are named in this work with the original identifier of the corresponding ORFs. The PDB identifiers are used for structures, followed when necessary by the letter denoting the subunit chain and the number denoting one of multiple structural models. Only sequences of M.pneumoniae below 200 amino acids were considered, in order to limit the analysis to the range of highest reliability of fold recognition methods, owing to the technical difficulties of detecting multi-domain structures (Levitt, 1997). Only one sequence longer than 200 amino acids was examined (A05_orf290). This sequence is 290 residues long but, since it belongs to a relatively large homology group in the category of functionally unassigned sequences that would have not been represented among the target sequences of the present work and since three putative domains shorter than 200 residues could be recognized in it, the sequences corresponding to such domains were included in the list of the target sequences in the attempt to identify new functions. Sequences with predicted trans-membrane regions according to the authors of the sequence (Himmelreich et al., 1996) and to the program ProSEQ (M.Jaritz, unpublished work) were considered only when at least one putative soluble domain could be identified (a stretch longer than 50 residues without trans-membrane regions). In those cases, the fragment corresponding to the putative soluble domain was used independently for fold recognition. Every sequence was searched for the presence of putative structural domains also by homology search with FastA (Pearson and Lipman, 1988) against the Swissprot database. Whenever a homology domain could be detected within a sequence, the corresponding fragment was analyzed independently, as well as the complementary one. Sequence homology within the M.pneumoniae genome was also taken into account; a group of homologous sequences is considered as one fold recognition target. Sequences homologous (.25% identity, .80 residues aligned) to proteins with known 3D structure were also filtered out from the list of target proteins. The program ProFIT (Sippl, 1993a; Flöckner et al., 1995, 1997; Sippl and Flöckner, 1996) was used for fold recognition. Each target sequence was threaded on to a database containing 1700 structures. This database is a representative subset of the Brookhaven database, still containing redundancy for each structural class. ProFIT scores (Sippl, 1993b; Sippl and Jaritz, 1994) were used for the ranking of the sequence–structure alignments and were the main criterion for the evaluation of the outputs. However, ProFIT outputs were also inspected by the application of several other criteria of judgement: the quality of the alignments (number and position of gaps), the structural similarity among the highest scoring templates, the agreement between the models and the secondary structure prediction (when multiple homologous sequences were available), the energy profiles of the models as calculated by the program ProSA (Sippl, 1993b) and the consistency of ProFIT results for homologous sequences, when available. Only those models that generated consistent evidence by these different kinds of test were considered for further analysis. The program ProSUP (Feng and Sippl, 1996) was used for structure comparison. The goodness of the selected models was finally evaluated by the calculation of the Z scores for their energy values by 1130 the program ProSA (Sippl, 1993b). The raw models generated by ProFIT were refined by the program Modeler (S̆ali and Blundell, 1993) to add side-chains, close gaps and model loops. The energy of the models was then calculated by means of the combined residue-pair and surface potential functions (Sippl, 1993a) and compared with the distribution of the energy values obtained by threading the target sequence on to a polyprotein of known structures. The Z scores were calculated according to the equation Z 5 E – Ē/σ, where E is the energy of the model and Ē and σ are the mean and the standard deviation of the random distribution, respectively. A Z score value of –5 was considered as the threshold for high-confidence models. Results The list of ORFs from M.pneumoniae was filtered, as described in Materials and methods, by size, sequence hydrophobicity and homology relations, resulting in a total of 124 fold recognition targets analyzed in this work. Of these 124 sequences, 57 have a putative function by homology, 40 have uncharacterized homologous sequences and 27 are so far unique sequences. High confidence fold recognition results were obtained for 12 sequences of M.pneumoniae, as summarized in Table I. Six of these proteins belong to the functionally assigned category, five to the functionally unassigned category and one to the group of unique sequences. The sequence identity between each target and its predicted template is below 20%. The Z scores for the energies of the models, calculated with reference to the energy distribution of alternative folds for each target sequence, range between –5 and –8.6, indicating significant low energies of the proposed models (Sippl, 1993b). The alignments obtained by fold recognition were inspected for matching of functional residues of the template protein and for conservation of sequence signatures that could reflect similarities in the function of the two proteins. In four cases of sequences belonging to the functionally assigned category, these results bring some new insights about the putative active site of the molecule, indicating probable catalytic and binding residues. Despite the extensive sequence similarity between the genomes of M.pneumoniae and M.genitalium, there is a very small overlap between the predictions presented here and those reported by Eisenberg and co-workers (Fischer and Eisenberg, 1997), which can be partially explained by the size limit applied to the ORFs examined in this work. Indeed, almost all the predictions reported by Eisenberg and co-workers refer to proteins longer that 200 residues. Only two sequences are given a structural assignment in both studies: B01_orf178 of M.pneumoniae (MG030 of M.genitalium) and P01_orf193 of M.pneumoniae (MG335 of M.genitalium). In the first case the prediction is the same, whereas in the second case the predicted folds belong to the same SCOP superfamily (Murzin et al., 1995) but present some relevant differences. The prediction results for those cases with functional implications, which include B01_orf178 and P01_orf193, are discussed below in greater detail. All the alignments and models resulting from this work are available from the author. B01_orf178 (uracil phosphoribosyltransferase) A putative transmembrane region could be detected in this protein by the authors of the sequence (Himmelreich et al., 1996) but it was not confirmed by the program ProSEQ for Fold recognition for M. pneumoniae sequences Table I. Summary of the structural predictions ORF Domain (function) A19_orf200 B01_orf178 D02_orf152 F10_orf100B F11_orf133 P01_orf193 (no function) A05_orf290 B01_orf108 C09_orf159 F10_orf153 P01_orf197 (unique) GT9_orf113 10–76 Homologa Model Classb % IDEc Z scores Fctd MG264 MG030 MG396 MG232 MG276 MG335 1ukd.-.1hgx.A.1bmt.A.1htt.A.1oro.A.1gua.A.- α/β α/β α/β α/β α/β α/β 17 14 14 15 20 18 –7.1 –6.4 –6.7 –6.9 –6.6 –6.7 1 1 MG125 MG029 MG207 MG230 MG333 1ctf.-.1put.-.1 4fxn.-.4fxn.-.1qrd.A.- α1β α1β α/β α/β α/β 17 13 11 17 18 –6.3 –6.4 –5.0 –6.9 –8.6 1arn.-.- β 14 –5.1 1 1 aClosest homologous sequence. bStructural class of the template cPercentage dFunctional according to the SCOP database (Murzin et al., 1995). of identical residues between the target sequence and the template, calculated on the alignments obtained by ProFIT (Sippl, 1993a). implications discussed in this work. sequence analysis (M.Jaritz, unpublished work) using a multiple sequence alignment. Therefore, the whole sequence was used as input for fold recognition. The result is 1hgx.A.- as a template for this protein, in agreement with Eisenberg’s result; 1hgx.A.- corresponds to one subunit of the homodimeric hypoxanthine–guanine–xanthine phosphoribosyltransferase (HGXPRTase) from Tritrichomonas foetus (Somoza et al., 1996). Phosphoribosyltransferases (PRTases) catalyze the reversible transfer of a 5-phosphoribosyl group between pyrophosphate and a nitrogenous base. B01_orf178 and HGXPRTase have similar activity, although different specificity belonging to the pyrimidine and purine salvage pathway, respectively. Recently reported structures for enzymes of nucleotide synthesis show that several PRTases with very little sequence homology and different specificity share a common fold (Smith, 1995). The common core of the PRTase fold (residues 10–131 of HGXPRTase) consists of a five-stranded parallel β-sheet surrounded by three or four α-helices. Additional structural elements at the N- and C-termini of the molecule build up the so-called ‘hood’ subdomain, that carries the determinants for the base specificity of these enzymes. PRTases show only one region of sequence conservation, which follows the consensus [LIVMFYWCTA]–[LIVM]–[LIVMA]–[LIVMFC]–[DE]– D–[LIVMS]–[LIVM]–[STAVD]–[STAR]–[GAC]–X–[STAR] (Bairoch, 1992) and which encompasses the 59-phosphatebinding loop of the active site. As shown in Figure 1A, sequence B01_orf178 is aligned to both the core and the ‘hood’ of the template structure. A ribbon drawing of the resulting structural model for B01_orf178 is shown in Figure 2A. The target sequence presents only one mismatch to the consensus sequence of PRTases (Figure 1A), suggesting that this region of B01_orf178 (residues 96–108) is also involved in 59-phosphate binding. In particular, on the basis of the protein-GMP contacts in 1hgx (Somoza et al., 1996), it can be predicted that the side chains of Thr 105 and Thr 108 in B01_orf178 make hydrogen bonds with the 59-phosphate group of the substrate, like the aligned threonine residues of HGXPRTase and that Asp 100 is involved in the binding of the ribosyl moiety, by analogy to Glu 87 of HGXPRTase. The second acidic residue (Asp 88 in HGXPRTase), usually conserved in PRTases, is not present in this loop of B01_orf178. Furthermore, the residue corresponding to Thr 41 of HGXPRTase is always a lysine or an arginine in the other PRTases and it is probably involved in pyrophosphate binding (Somoza et al., 1996). The lack of this basic residue in HGXPRTase is thought to account for the high Km for pyrophosphate of this enzyme (Somoza et al., 1996). Similarly to all the other known PRTases, B01_orf178 has an arginine residue at this position. Other active site contacts suggested by the alignment on the basis of the GMP complex of HGXPRTase are hydrophobic interactions of Met 102 and Tyr 162 with the base. These results suggest that uracil-PRTases share the general fold and configuration of the active site observed so far in other PRTases. By analogy to other known PRTases (Eads et al., 1994; Scapin et al., 1995; Somoza et al., 1996), it is expected that B01_orf178 will also form dimers in solution. Nevertheless, the variability of the position and nature of the residues making up the dimer interface in PRTases (Somoza et al., 1996) does not allow one to derive further support for this hypothesis from sequence inspection. The same limitation applies to the interactions between the ‘hood’ and the base, which involve main-chain atoms. F11_orf133 (adenine phosphoribosyltransferase) F11_orf133 and B01_orf178 have similar putative functions, despite their low sequence similarity (16% identity). The prediction result for F11_orf133 (Figure 2D) is the orotate phosphoribosyltransferase (OPRTase) from Escherichia coli (1oro; Henriksen et al., 1996), another protein of the PRTases family, structurally related to HGXPRTase. In the results reported in Figure 1B, the sequence of F11_orf133 is aligned to the core region of OPRTase, excluding the structural elements of the ‘hood’ subdomain (residues 1–41 and 184– 213 of OPRTase). F11_orf133 perfectly matches the PTRases consensus signature (residues 76–88 of F11_orf133, Figure 1B). On the basis of the active-site interactions reported for OPRTases (Scapin et al., 1995; Henriksen et al., 1996), including those with complexed 5-phosphoribosyl-1-pyrophosphate (Scapin et al., 1995), the alignment reported in Figure 1B supports prediction of the 59-phosphate binding site 1131 R.Grandori Fig. 1. Sequence–structure alignments obtained by the program ProFIT. The upper sequence corresponds to the target protein, the lower sequence corresponds to the template. The secondary structure of the template, as calculated by the program ProFIT, is reported below its sequence (e, β-strand; h, α-helix). The sequence signatures referred to in the text are highlighted in gray. Vertical bars indicate residue identities, pairs of dots indicate residue similarities (scores higher than 0 according to the Blosum62 substitution matrix; Henikoff and Henikoff, 1992). of F11_orf133 (residues 80–88), and also detection of the putative pyrophosphate-binding region (residues 17–20 and 39–43). In particular, all the basic residues of OPRTases known to be required for catalysis (Grubmeyer et al., 1993) and/or to make contacts with the pyrophosphate (Scapin et al., 1995) are conserved in F11_orf133 (Arg19, Arg39, Lys40 and Lys43). The two aspartates involved in ribose binding are also conserved (Asp80 and Asp81 of F11_orf133), whereas the two threonines probably making hydrogen bonds to the 59-phosphate (Thr85 and Thr88 of F11_orf133) are shifted by one residue from the active Thr128 and Thr131 of OPRTase. On the basis of these findings, it seems likely that F11_orf133 shares the fold and the active residues typical of the core of PRTases. It would be of interest to investigate further the base specificity of F11_orf133 and its structural determinants with the lack of the ‘hood’ subdomain. P01_orf193 (hypothetical GTP-binding protein) In agreement with the hypothetical GTP-binding activity of this protein (Himmelreich et al., 1996), the results presented here indicate that this is a ras-related protein, with 1gua.A.(rap1A; Nassar et al., 1995) being the highest scoring template (Figure 2E). The prediction reported by Eisenberg and coworkers is, instead, 1gky (Stehle and Schulz, 1992), the yeast 1132 guanylate kinase (GK). Rap1A and GK are both α/β proteins but present different architectures. Whereas rap1A consists of a single domain folded around a central β-sheet, GK folds into two domains each centered on a distinct β-sheet. The minor domain of GK, containing the GMP-binding site, has no counterpart in the so-called G fold of the ras-like proteins (Stehle and Schulz, 1992). The two proteins also catalyze different reactions, where GTP and ATP are, respectively, the hydrolyzable substrates. The major domain of GK shows some structural similarities to the G fold, particularly in the N-terminal βα motif that includes the P loop typical of both GTP- and ATP-binding proteins (Saraste et al., 1990). The Nterminal region of P01_orf193 (residues 27–34) contains the generalized consensus sequence for the P loop G–X(4)–G–K– [TS] (Saraste et al., 1990), which is aligned to the corresponding region of rap1A (Figure 1C). Nonetheless, there is also good sequence conservation in other regions involved in GTP binding, that are typical of ras-related proteins (Valencia et al., 1991) and not of guanylate kinases. These are the NKXD and the SAK/L boxes (residues 136–139 and 168–170 of P01_orf193, respectively), which are involved in the guanine base binding and a DXXG box (residues 70–73 of P01_orf193) shifted by three residues from the corresponding box in rap1A. Fold recognition for M. pneumoniae sequences Fig. 2. Ribbon representation of the structural models presented in this work. The raw models generated by ProFIT were refined by the program Modeler (S̆ali and Blundell, 1993). The images were produced using the program RasMol (Sayle and Milner-White, 1995) and rendered using the program MolScript (Kraulis, 1991). Terminal residues of the input sequence not aligned to the template were not included in the models. A, B01_orf178/1hgx.A.-; B, D02_orf152/1bmt.A.-; C, F10_orf100B/1htt.A.-; D, F11_orf133/1oro.A.-; E, P01_orf193(res. 18–188)/1gua.A.-; F, A05_orf290(res. 10–76)/1ctf.-.-; G, A19_orf200(res. 1–187)/1ukd.-.-; H, B01_orf108/1put.-.1; I, C09_orf159(res. 1–136)/4fxn.-.-; J, F10_orf153(res. 20–153)/4fxn.-.-; K, P01_orf197/1qrd.A.-; L, GT9_orf113/1arn.-.-. 1133 R.Grandori The residue following the DXXG box (Gln61 in ras p21 and T61 in rap1A) has been implicated in the catalytic mechanism as the activator (via hydrogen bonding) of the water molecule that performs the nucleophilic attack on the γ-phosphate of GTP (Pai et al., 1990; Nassar et al., 1995). The presence of a threonine residue at this position in rap1A accounts for its reduced intrinsic GTPase activity relative to ras p21 (Frech et al., 1990). It is interesting that the corresponding residue in P01_orf193 would be a tyrosine (Tyr 74). In conclusion, inspection of the sequence alignment strengthens the prediction of 1gua as a good template for this sequence and allows at the same time assignment of the corresponding functional residues to P01_orf193. Therefore, P01_orf193 is with high probability a regulatory GTP-binding protein. Nevertheless, its sequence shows poor conservation of the effector region of rap1A and ras p21 (residues 32–40 of rap1A; Nassar et al., 1995), and also no match to the GTP-binding elongation factors signature (Bairoch, 1992), suggesting that it plays a distinct physiological role. A19_orf200 (nucleotide-binding protein) The predicted template for this sequence (Figure 2G) is 1ukd.-.-, the UMP/CMP kinase (UK) from slime mold (Scheffzek et al., 1996). A glycine-rich motif typical of the P loop of nucleotide-binding proteins could be already detected in its N-terminal region (see Swissprot:y264_mycge). In the alignment reported in Figure 1D the glycine-rich sequence of A19_orf200 (residues 7–16) matches the corresponding region of UK at the end of the first β strand. The high sequence conservation observable in this region, including the lysine and threonine side-chains (residues 13 and 15 of A19_orf200, respectively) that make hydrogen bonds with the phosphoryl groups of the competitive inhibitor in 1ukd, suggests that A19_orf200 binds ATP in a similar fashion to UK. The lysine of the P loop corresponding to Lys13 of A19_orf200 has been shown to be essential for the catalytic activity of adenylate kinases (Reinstein et al., 1990; Tian et al., 1990; Byeon et al., 1995), a family of enzymes highly related to UK (Scheffzek et al., 1996). Most of the hydrophobic residues of the UMP-binding site of UK (residues 31, 35, 62, 63, 68, 89 of A19_orf200) appear to be conserved. Nevertheless, A19_orf200 shows poor conservation of the sequence signature of adenylate and UMP/CMP kinases (residues 85–96 of UK; Bairoch, 1992), and also low sequence similarity to the region corresponding to the ‘lid’ of UK (residues 130–140 of UK), which carries several catalytically important residues. Therefore, A19_orf200 is with high probability a kinase, but its functional equivalence to UK remains ambiguous. Discussion This work represents one of the first attempts at the systematic application of fold recognition techniques to genome databases, although the analysis is restricted by the size limit of 200 amino acids. The implementation of automated procedures for fragmentation of the input sequence will allow one to extend systematic fold recognition analysis to large protein sequences. The genome of M.pneumoniae contains 144 non-homologous sequences below 200 amino acids encoding for soluble proteins. Of these, only 20 (14%) show significant sequence similarity to proteins with known 3D structure and can, therefore, be assigned a structure by normal sequence analysis. Highconfidence fold recognition results were obtained for another 12 sequences, representing an additional 8% (10% of the target 1134 sequences) to the intrinsic limits of structural prediction by sequence homology. Although the efficiency of structural prediction might decrease for large protein sequences, these results show that fold recognition techniques can make a significant contribution to the developing new discipline of functional genomics (Hieter and Boguski, 1997). The results obtained in this work also show that fold recognition can provide, in several cases, useful information about the function and the active site of the target protein. Both processes of evolutionary divergence and convergence can generate structurally related proteins containing similar active residues in the lack of overall sequence similarity. Detection of such cases is of great predictive importance but cannot rely on sequence–sequence, rather on sequence– structure, alignment methods. Inspection of residue conservation on the alignments generated by fold recognition led, in this study, to the identification of putative active residues of four proteins, providing heuristic models for future functional studies. The results presented here were compared with those reported for the homologous genome of M.genitalium (Fischer and Eisenberg, 1997). With the exception of one, so far unique, sequence, the proteins predicted in this work have a close homolog in M.genitalium, which generates ProFIT results consistent with those obtained with the corresponding sequence from M.pneumoniae. Nevertheless, only two of these sequences appear among the predictions reported for M.genitalium. The data obtained in this work are in agreement with one of these predictions (B01_orf178) but strongly suggest an alternative model for the second one (F11_orf133). The comparison of the two sets of results also indicates that two sequences shorter than 200 residues (MG287/F11_orf84 and MG353/ G12_orf109) and reported among the predictions for M.genitalium are not included in the results presented here. In the case of MG287/F11_orf84, the reason is that these sequences show significant sequence similarity (~25% identity) to the acyl carrier protein (1acp) and therefore were not targets of the present work. The second sequence, instead, did not generate high-confidence fold recognition results with the method employed in this work. The reported prediction for MG353 is 1huea, the histone-like protein HU. It is interesting that HU is not a compact globular protein; rather, it contains two extended, partially disordered β ribbons (Tanaka et al., 1984). It is therefore expected that the program ProFIT, based on distance analysis of globular proteins, will fail to detect cases of this kind. A critical problem in fold recognition is the method for the evaluation of the output. Extensive human inspection of the results, as performed in this work, is an approach that allows for high sensitivity, although it is time consuming. On the other hand, fully automated fold-assignment procedures allow for high efficiency but require the application of stringent score thresholds, resulting in lower sensitivity. Highly automated procedures will be required for efficient application of fold recognition techniques to large databases, such as genome sequences. On the other hand, this work emphasizes the importance of complex and articulated evaluation systems. Therefore, the optimization of the scoring systems to maximize both accuracy and sensitivity represents an urgent demand in this field in order to develop higher automated procedures for functional genomics. Fold recognition for M. pneumoniae sequences Note added in proof After submission of this paper, a related paper, ‘Homologybased fold predictions for Mycoplasma genitalium proteins’ by M.Huynen, T.Doerks, F.Eisenhaber, C.Orengo, S.Sunyaev, Y.Yuan and P.Bork, was published (J. Mol. Biol., 1998, 280, 323–326). By use of the program PSI-BLAST, the authors obtained identical or very similar structural predictions to those presented here for the sequences MG030, MG276, MG335, MG264 and MG333. The authors also report a higher efficiency (23% of the target residues) of homology-based structural prediction for new protein sequences than previously achieved by other methods. Acknowledgments Francisco Domingues (CAME, Salzburg) is thanked for useful discussions and technical assistence, Manfred Sippl (CAME, Salzburg) for providing the opportunity to carry out this study and Jannette Carey (Princeton University) for critical reading of the manuscript. This work was supported by FWF grant P11601GEN. Scapin,G., Ozturk,D.H., Grubmeyer,C. and Sacchettini,J.C. (1995) Biochemistry, 34, 10744–10754. Scheffzek,K., Kliche, W., Wiesmüller,L. and Reinstein,J. (1996) Biochemistry, 35, 9716–9727. Sippl,M.J. (1993a) J. Comput.-Aided Mol. Des., 7, 473–501. Sippl,M.J. (1993b) Proteins, 17, 335–362. Sippl,M.J. and Flöckner,H. (1996) Structure, 4, 15–19. Sippl,M.J. and Jaritz,M. (1994) In Bohr,H. and Brunak,S. (eds), Protein Structure by Distance Analysis. IOS Press, Amsterdam, pp. 113–134. Smith,D.R. et al. (1997) J. Bacteriol., 179, 7135–7155. Smith,J.L. (1995) Curr. Opin. Struct. Biol., 5, 752–757. Somoza,J.R., Chin,M.S., Focia,P.J., Wang,C.C. and Fletterick,R.J. (1996) Biochemistry, 35, 7032–7040. Stehle,T. and Schulz,G.E. (1992) J. Mol. Biol., 224, 1127–1141. Tanaka,I., Appelt,K., Dijk,J., White,S.W. and Wilson,K.S. (1984) Nature ,310, 376–381. Tian,G., Yan,H., Jiang,R.-T., Kishi,F., Nakazawa,A. and Tsai,M.-D. (1990) Biochemistry, 29, 4296–4304. Tomb,J.-F. et al. (1997) Nature, 388, 539–547. Valencia,A., Kjeldgaard,M., Pai,E.F. and Sander,C. (1991) Proc. Natl Acad. Sci. USA, 88, 5443–5447. Received March 11, 1998; revised August 12, 1998; accepted August 17, 1998 References Bairoch,A. (1992) Nucleic Acids Res., 20, 2013–2018. Blattner,F.R. et al. (1997) Science, 277, 1453–1474. Bult,C.J. et al. (1996) Science, 273, 1058–1073. Byeon,I.J.L., Shi,Z.T. and Tsai,M.-D. (1995) Biochemistry, 34, 3172–3182. Cole,S.T. et al. (1998) Nature, 393, 537–544. Deckert,G. et al. (1998) Nature, 392, 353–358. Eads,J.C., Scapin,G., Xu,Y., Grubmeyer,C. and Sacchettini,J.C. (1994) Cell, 78, 325–334. Feng,Z.-K. and Sippl,M.J. (1996) Folding Des., 1, 123–132. Fischer,D. and Eisenberg,D. (1997) Proc. Natl Acad. Sci. USA, 94, 11929– 11934. Fleischmann,R.D. et al. (1995) Science, 269, 496–512. Flöckner,H., Braxenthaler,M., Lackner,P., Jaritz,M., Ortner,M. and Sippl,M.J. (1995) Proteins, 23, 376–386. Flöckner,H., Domingues,F. and Sippl,M.J. (1997) Proteins, Suppl. 1, 129–133. Fraser,C.M. et al. (1995) Science, 270, 397–403. Fraser,C.M. et al. (1997) Nature, 390, 580–586. Fraser,C.M. et al. (1998) Science, 281, 375–388. Frech,M., John,J., Pizon,V., Chardin,P., Tavitian,A., Clark,R., McCormick,F. and Wittinghofer,A. (1990) Science, 249, 169–171. Goffeau,A. et al. (1996) Science, 274, 546–567. Grubmeyer,C., Segura,E. and Dorfman,R. (1993) J. Biol. Chem., 268, 20299–20304. Henikoff,S. and Henikoff,J.G. (1992) Proc. Natl Acad. Sci. USA, 89, 10915–10919. Henriksen,A., Aghajari,N., Jensen,K.F. and Gajhede,M. (1996) Biochemistry, 35, 3803–3809. Hieter,P. and Boguski,M. (1997) Science, 278, 601–602. Himmelreich,R., Hilbert,H., Plagens,H., Pirkl,E., Li,B.-C. and Herrmann,R. (1996) Nucleic Acids Res., 24, 4420–4449. Kaneko,T. et al. (1996) DNA Res., 3, 109–136. Klenk,H.-P. et al. (1997) Nature, 390, 364–370. Kraulis,P.J. (1991) J. Appl. Crystallogr., 24, 946–950. Kunst,F. et al. (1997) Nature, 390, 249–256. Levitt,M. (1997) Proteins, Suppl. 1, 92–104. Murzin,A.G., Brenner,S.E., Hubbard,T. and Chothia,C. (1995) J. Mol. Biol., 247, 536–540. Nassar,N., Horn,G., Herrmann,C., Scherer,A., McCormick,F. and Wittinghofer,A. (1995) Nature, 375, 554–560. Pai,E.F., Krengel,U., Petsko,G.A., Goody,R.S., Kabsch,W. and Wittinghofer,A. (1990) EMBO J., 9, 2351–2359. Pearson,W.R. and Lipman,D.J. (1988) Proc. Natl Acad. Sci. USA, 85, 2444– 2448. Reinstein,J., Schlichting,I. and Wittinghofer,A. (1990) Biochemistry, 29, 7451–7459. Rost,B. (1998) Structure, 6, 259–263. S̆ali,A. and Blundell,T.L. (1993) J. Mol. Biol., 234, 779–815. Sander,C. and Schneider,R. (1991) Proteins, 9, 56–68. Saraste,M., Sibbald,P.R. and Wittinghofer,A. (1990) Trends Biochem. Sci., 15, 430–434. Sayle,R.A. and Milner-White,E.J. (1995) Trends Biochem. Sci., 20, 374. 1135
© Copyright 2026 Paperzz