BIOINFORMATICS Vol. 17 no. 1 2001 Pages 13–15 Pro-Frame: similarity-based gene recognition in eukaryotic DNA sequences with errors Andrey A. Mironov, Pavel S. Novichkov and Mikhail S. Gelfand State Scientific Center for Biotechnology NIIGenetika, Moscow, 113545, Russia Received on September 22, 1999; revised on June 5, 2000; accepted on July 27, 2000 ABSTRACT Summary: Performance of existing algorithms for similarity-based gene recognition in eukaryotes drops when the genomic DNA has been sequenced with errors. A modification of the spliced alignment algorithm allows for gene recognition in sequences with errors, in particular frameshifts. It tolerates up to 5% of sequencing errors without considerable drop of prediction reliability when a sufficiently close homologous protein is available (normalized evolutionary distance similarity score 50% or higher). Availability: The program is free for academic users and available upon request at http://www.anchorgen.com Contact: [email protected] Analysis of sequence similarity is a powerful tool for gene recognition. It is employed in a number of database search programs, most notably BLASTX (Gish and States, 1993), and programs for exact prediction of exon–intron structure, in particular, Procrustes (Gelfand et al., 1996; Mironov et al., 1998), INFO (Hultner et al., 1994; Laub and Smith, 1998), GeneWise (Birney and Durbin, 1997). The common idea behind these algorithms is that among numerous possible exon chains, an algorithm chooses the chain having the highest similarity to a related protein (target). This is done by modified dynamic programming treating introns as a special case of gaps (GeneWise) or by spliced alignment (Procrustes). Testing of the similarity-based gene recognition programs demonstrated that given sufficiently close relatives, they produce highly reliable predictions. In particular, the correlation between predicted and real human genes is 96–99% when homologous vertebrate genes are available (Mironov et al., 1998; Laub and Smith, 1998). However, the quality of gene predictions when the genomic DNA contains sequencing errors is much lower (Burset and Guigo, 1996). One possibility to avoid this problem is to use the DNA spliced alignment instead of aligning translated candidate exons with proteins (Sze and Pevzner, 1997). However, it is well known that protein alignments are much more sensitive to distant similarities than nucleotide alignments. Thus it is indicative that there exist c Oxford University Press 2001 numerous protein–DNA alignment algorithms accounting for frameshifts (Posfai and Roberts, 1992; Birney et al., 1996; Guan and Uberbacher, 1996; Zhang et al., 1997; Pearson et al., 1997). However, none of them handles introns. We have implemented a modified version of the spliced alignment algorithm performing gene recognition in the presence of frameshift errors. The algorithm treats introns as non-penalized gaps that may start only at dinucleotide GT and end at dinucleotide AG. Frameshifts and in-frame stop codons in the genomic sequence are allowed, but heavily penalized. There is an option for acceleration of the dynamic programming, using the k-tuple alignment technique due to M.Roytberg (Nazipova et al., 1995). Since sequencing errors can destroy the invariant dinucleotides at splicing sites, the program has a postprocessing step. At this step the program identifies runs of deletions at exon termini, and moves the exon–intron boundary even if there are no suitable dinucleotides. More exactly, observing more than 50% deleted positions in the region (−30, +30) around the exon junction, the program searches for the optimal position of the donor and acceptor splicing sites allowing for a single deviation from the invariant dinucleotide at each site. The program outputs the exon positions before and after the correction and the alignment of the predicted exons and the target protein. Results of testing the algorithm on a sample of human genes and related proteins from (Mironov et al., 1998) are given in Figure 1. This sample consists of 256 genes. The average length of genomic sequences is approximately 8100 nucleotides, with the longest sequence exceeding 180 000 nucleotides. The number of exons ranges from 1 through 54, the average number of exons per gene is 5.5. The average length of exons in multi-exon genes is 140 nucleotides. The total number of protein targets is 731, their average length is 575 amino acids. Five independent rounds of mutations were performed for each sequence. The 3655 predictions (five times 731 comparisons) were done in about 5 h on a PC with Pentium II 400 MHz processor under Windows NT. The performance at different error levels is estimated using the standard correlation coefficient measure (Burset 13 A.A.Mironov et al. 100 100 90 90 80 Pro 0% 70 1% 2% 60 0% 1% 70 2% 3% 60 3% 4% 50 4% 50 5% 5% 6% 40 6% 40 8% 10% 30 8% 30 10% 20 12% 12% 15% 20 15% 10 10 0 0 0 10 20 30 40 50 60 70 80 90 100 Fig. 1. Testing of Pro-Frame on a sample of human genes. Horizontal axis: similarity between actual genes and related proteins (in %, see the text for definition). Vertical axis: correlation coefficient (in %). Each curve corresponds to a specific level of sequencing errors (the percent of erroneous positions is given in the legend on the right). ‘Pro’ corresponds to the original Procrustes algorithm. and Guigo, 1996; Mironov et al., 1998). For comparison we also present the correlation coefficient demonstrated by the original Procrustes algorithm and results of gapped BLASTX (Altschul et al., 1997) using the target protein as the query sequence (Figure 2). Since the performance depends on the similarity between the gene and a target, the figures feature plots of the correlation coefficient at different similarity levels. The similarity measure is the score of the alignment of the actual and target proteins divided by the half-sum of the scores of (trivial) alignments of the actual protein and the target protein with themselves. Such normalization accounts for varying protein length and amino acid composition. Sequencing errors were modeled as random nucleotide substitutions (80%), insertions (10%) and deletions (10%); the latter two types of errors had length one through three with equal probabilities. For instance, at the error rate 15% this means that nucleotides at 12% of all positions have been changed, 3% of nucleotides were deleted (0.5% of single nucleotide deletions, 0.5% of dinucleotide deletions, and 0.5% of trinucleotide deletions), and 1.5% positions contained insertions (with 0.5% being single nucleotides, 0.5%, dinucleotides, and 0.5%, trinucleotides). Errors at GT-AG invariant intron termini were not treated as a special case, however, the fraction of mutated termini can be easily estimated given the overall error rate. At all similarity and error levels Pro-Frame provides better recognition than straightforward BLASTX (cf. Figures 1 and 2). It is noteworthy that in the absence of sequencing errors Pro-Frame performs almost as well as Procrustes when the target protein is close to the analyzed gene, but the performance drops for distant relatives. This agrees with our observations about importance of the 14 80 0 10 20 30 40 50 60 70 80 90 100 Fig. 2. Prediction of coding regions in the sample of Figure 1 using BLASTX. Axes and notation as in Figure 1. statistical filtering procedure implemented in Procrustes (Mironov et al., 1998). On the other hand, up to 3% rate of sequencing errors does not considerably influence the reliability of predictions, and further, up to 6% of errors are easily tolerated if the target protein is sufficiently close to the analyzed gene. The above results demonstrate that Pro-Frame may be a useful tool for analysis of preliminary sequencing data, e.g. phase I or II output of major sequencing projects, draft human genome sequences, etc. The initial identification of target proteins should be done by BLASTX, and then ProFrame can be used to exactly map the exon boundaries and to find relatively short exons that could be missed by the straightforward similarity analyses. Since Pro-Frame does not rely on the statistical properties of the analyzed genome (the only requirement is the GT–AG rule for the intron termini), the program can be used for gene recognition in invertebrate, plant, fungal, and even protist sequences. ACKNOWLEDGEMENTS We are grateful to Drs V.Bafna, J.Fickett, P.Pevzner and M.Roytberg for useful discussions. This work was partially supported by grants from the Russian State Scientific Program ‘Human Genome’, the Russian Fund for Basic Research (99-04-48347 and 00-15-99362), the Merck Genome Research Institute (244), and INTAS (991476). REFERENCES Altschul,S.F., Madden,T.L., Schaffer,A.A., Zhang,J., Zhang,Z., Miller,W. and Lipman,D.J. (1997) Gapped BLAST and PSIBLAST: a new generation of protein database search programs. Nucleic Acids Res., 25, 3389–3402. Birney,E., Thompson,J.D. and Gibson,T.J. (1996) PairWise and SearchWise: finding the optimal alignment in a simultaneous comparison of a protein profile against all DNA translation frames. Nucleic Acids Res., 24, 2730–2739. Birney,E. and Durbin,R. (1997) Dynamite: a flexible code generating language for dynamic programming methods used in sequence comparison. Proceedings of the 5th International Conference on Intelligent Systems for Molecular Biology AAAI Press, Menlo Park, CA, pp. 56–64. Burset,M. and Guigo,R. (1996) Evaluation of gene structure prediction programs. 34, 353–367. Gelfand,M.S., Mironov,A.A. and Pevzner,P.A. (1996) Gene recognition via spliced sequence alignment. Proc. Natl. Acad. Sci. USA, 93, 9061–9066. Gish,W. and States,D.J. (1993) Identification of protein-coding regions by database similarity search. Nature Genet., 3, 266–272. Guan,X. and Uberbacher,E.C. (1996) Alignments of DNA and protein sequences containing frameshift errors. Comput. Applic. Biosci., 12, 31–40. Hultner,M., Smith,D.W. and Wills,C. (1994) Similarity lanscapes: a way to detect many structural and sequence motifs in both introns and exons. J. Mol. Evol., 38, 188–203. Laub,M.T. and Smith,D.W. (1998) Finding intron/exon splice junctions using INFO, INterruption Finder and Organizer. J. Comput. Biol., 5, 307–321. Mironov,A.A., Roytberg,M.A., Pevzner,P.A. and Gelfand,M.S. (1998) Performance-guarantee gene predictions via spliced alignment. Genomics, 51, 332–339. Nazipova,N.N., Shabalina,S.A., Ogurtsov,A.Yu., Kondrashov,A.S., Roytberg,M.A., Buryakov,G.V. and Vernoslov,S.E. (1995) SAMSON: a software package for the biopolymer primary structure analysis. Comput. Applic. Biosci., 11, 423–426. Pearson,W.R., Wood,T., Zhang,Z. and Miller,W. (1997) Comparison of DNA sequences with protein sequences. Genomics, 46, 24–36. Posfai,J. and Roberts,R.J. (1992) Finding errors in DNA sequences. Proc. Natl. Acad. Sci. USA, 89, 4698–4702. Sze,S.-H. and Pevzner,P.A. (1997) Las Vegas algorithms for gene recognition: suboptimal and error-tolerant spliced alignment. J. Comput. Biol., 4, 297–310. Zhang,Z., Pearson,W.Ri. and Miller,W. (1997) Aligning a DNA sequence with a protein sequence. J. Comput. Biol., 4, 339–349. 15
© Copyright 2026 Paperzz