A Case-Based Approach to Gene Finding Edwin Costello and David C. Wilson Computer Science Department, University College Dublin, Dublin 4, Ireland edwin [email protected],[email protected] Abstract. Advances in molecular biology and the tools and techniques available for analysing molecular structure are providing a rapidly increasing deluge of information that maps fundamental genetic structures in humans and other organisms. Intelligent support is essential for managing and interpreting this overwhelming amount of data, and one of the most important tasks faced currently is the analysis of sequences of nucleotides in order to locate the areas of DNA that actually encode functional biological information. Such analysis has a substantial impact on the health sciences as the foundation of identifying genetic structures that are related to disease and that can serve as the basis for pharmacological or gene-therapeutic response. This paper describes our initial research in developing a CBR approach to the problem of finding regions in mammalian DNA that code proteins essential for life. 1 Introduction Bioinformatics incorporates the fields of biology, computer science and information technology with the goal of discovering new biological insights and the enhancement of diagnostic and pharmaceutical medicine. Sequence analysis, which involves the study of an organism’s DNA in an effort to understand its molecular structure and underlying functionality, is of major importance to the area of gene therapy, which has led to the discovery of mutations in DNA and chromosomal abnormalities indicative of diseases such as cancer. Thus the analysis of nucleotide sequences, in particular the identification of DNA segments that encode functional biological information, can provide the medical profession with invaluable insight into the pathology of disease state and treatment. Sequence analysis first involves determining the basic molecular structure of an organism’s DNA, which is simply a molecule made up of two strands, with each strand comprised of nucleotides from a finite set. There are four different nucleotides—adenine, guanine, thiamine and cytosine, and the first letter of each provides an alphabet {A, G, T, C} for representing DNA. A nucleotideto-nucleotide bond holds the two strands together with each nucleotide being bonded to its complementary match, A only bonding to T and C only bonding to G. Therefore, given one strand (i.e., half a DNA molecule), the complementary strand can be reconstructed relatively easily. Determining the basic molecular DNA structure sequence is a well understood task that is providing a deluge of information for interpretation. The task of “gene finding,” identification of coding regions in an organism’s DNA, is the next essential step in analysing an organism’s genome. These coding regions are called exons and when these are put together they form an entire gene. It is the genes that tell the body how to create proteins, and it is these that give rise to biological function. Exons are continuous sequences in a strand that the body uses to replicate proteins; the parts in between these exons do not contribute to protein replication and are called introns, or noncoding regions. A human gene, for example, can consist of up to 2000 base-pairs (bp’s), with the gene being split up into, on average, 10 exons. The aspect that makes the finding of these exons most difficult is that the exons can be spread at random in a string of up to one million base pairs. In fact the human DNA is made up of almost 97% of so-called “junk” DNA that does not code any proteins [Pevzner 2000]. It must be noted however, that this intron-exon structure only applies to mammalian type DNA, which we focus on in this research. We are interested in applying CBR to the gene-finding problem by employing a case library of nucleotide segments that have previously been categorised as coding (exon) or non-coding (intron), in order to locate the coding regions of a new DNA strand. The primary research issues addressed in this work are establishing a similarity metric for nucleotide segments and combining the results of multiple cases to categorise entire new DNA strands. This paper describes our initial research in developing a CBR approach to the problem of finding regions in mammalian DNA that code proteins essential for life. Section 2 gives an overview of related work, section 3 describes our similarity metric, section 4 describes an evaluation of the initial approach, and we conclude with a discussion of future work. 2 Gene Finding Methods The structure of genes is well understood and their characteristics are used in many of the techniques that are used for exon prediction and gene classification. While we employ only direct nucleotide comparison, it is important to note that other information may be available, such as Promoter and Terminator signals that occur before the first and after the last exon respectively, and Donor and Acceptor sites that can help to indicate intron-exon boundaries [Brunak, Engelbrecht, & Knudsen 1997]. A number of approaches have been applied to signal detection and genefinding, including neural networks (e.g., [Towell, Shavlik, & Noordewier 1990; Farber, Lapedes, & Sirotkin 1992]), Bayesian approaches (e.g., [Staden & McLachlan 1982]), and Hidden Markov Models (e.g., [Krogh 1998; Kulp et al. 1996; Burge & Karlin 1997]). Overton and Haas [1998] describe a case-based system that used grammars, describing features of genes such as promoter regions and other signals. The cases that these grammars described were mapped to the unknown sequence in an attempt to generate predictions for exons, introns and certain regulatory signals. The cases used were generally full genes. For the system proposed here, the way an identified gene was broken up in its original DNA sequence is used, i.e. the individual component exons. These exons will be used as the cases in the case-base. Shavlik [1991] discusses a system called FIND-IT, which translates a query sequence of DNA into its possible protein translations based on the different reading frames. It then performs a search for matching cases from a large database of protein sequences. Our approach matches in nucleotide space, rather than protein space; while matching in protein space may have some advantages, it can also suffer from frame-shift errors in translation. 3 Sequence Similarity For a CBR system to work effectively, it is necessary to be able to compare the case (i.e., a known exon), with the query strand of DNA in which we want to identify exons. We have investigated a number of possible approaches to similarity, including Longest Common Subsequence [Cormen, Leiserson, & Rivest 1992] and sequence alignment methods [Jiang b], but we have chosen an edit distance method for our initial work. The edit distance between two sequences is defined as the minimum cost of transforming one sequence to the other with a sequence of the following operations: deleting a character, inserting a character or substituting one character for another, with no character taking part in more than one operation [Hoang 1993]. Each of these operations can be given a weight in order to penalise certain operations. Most implementations [Jiang a] seem to apply a weight of 1 to an insertion, 1 to a substitution and 0 for a nucleotide match. Figure 1 shows an example edit distance computation between a case exon and a target DNA sequence segment, with which the case is aligned. The minimal transformation cost between the exon case and the target segment requires a deletion (d) from the target, two substitutions (s), and an insertion (i) in the case, giving an edit distance of 4. This type of similarity is also conceptually appealing, as it computes similarity using adaptability (e.g., [Smyth & Keane 1995]). ... G T A G C C G A A T C G ... ACGAAGATC ... G T A - G ACG d 1 C A s 1 C G A A T C G ... AG - ATC s i 1 1 =4 Target Sequence Case Exon Edit Distance Fig. 1. Example edit distance computation. Computing the edit distance can be expensive, and we use a dynamic programming approach. Memoization is employed by creating a table t of size (m + 1)(n + 1) to store values for subproblems of the original. The edit distance algorithm matches each character in a sequence with all the characters in the comparison sequence. At each comparison it assigns a score based on the previous scores kept in the table. When generating a score for a character comparison at index (i, j), if the characters match it is given the score from index (i − 1, j − 1), otherwise it is given the minimum score of (i − 1, j) + 1 or (i, j − 1) + 1. This way, the smallest score is propagated through the table. 3.1 Sequence Scoring Given a measure of sequence similarity, we need to employ the case library segments in a way that will enable one to isolate regions of a sequence of DNA and point to them as potential protein coding regions. Since library exons are likely to be much shorter than a new strand, we adopt an approach that combines many retrieved cases (e.g., [Ram & Francis 1996]) in order to arrive at the new solution. We assign an activation to all the nucleotides involved in an optimal edit distance alignment. When all the case exons have been compared to the query strand, each nucleotide of the strand will have an activation that can be used to decide whether it is involved in a coding region or not. The scoring of a nucleotide is straightforward, the edit distance calculated for the best scoring subsequence of the query strand is applied to each of the bases in that subsequence, as shown in Figure 2. Fig. 2. Example sequence scoring. As more exons line up with a strand region, the activation increases. However keeping track of the cumulative edit distance would not be enough, as a large distance score would skew the results. To compensate for this, we also store a value that indicates how many times an individual nucleotide has been involved in an optimal alignment. In the example shown in Figure 2, the three nucleotides with a score of 7 would have a participation score of 2, with a score of 1 for the others. Given scores for activation and participation, we employ three measures for classifying an individual nucleotide: – Measure 1 nucleotide activation score, normalised by the maximum nucleotide activation score – Measure 2 number of optimal matches the nucleotide participates in, normalised by the maximum nucleotide participation score – Measure 3 the first two measures combined as a product A parameterized threshold is applied to the metric value, in order to determine coding status. When the analysis of a given test DNA sequence is finished, the result can be visualized, as shown in Figure 3. The base (wrapped) line represents the sequence being analysed, the thickness represents the degree of nucleotide activation, and the segments above the baseline show the position of the actual exons in the test sequence. Fig. 3. Visualization of nucleotide activation. 4 Evaluation We were interested in comparing our approach with other types of analysis, and we chose to evaluate the system with a test dataset developed by Burset and Guigó [1996] to evaluate gene-finding programs. It consists of 570 sequences obtained from the vertebrate divisions of GenBank release 85.0 [Burset & Guigó 1996] and contains a total of 2649 coding exons. In choosing a set of sequences to act as the case-base for our system, we wanted to select a dataset such that none of the sequences in the test set were the same as any of those in the case-base. The dataset that was chosen was one that was constructed specifically for the evaluation of seven recently developed programs for gene finding in mammalian sequences [Rogic, Mackworth, & Ouellette 2001]. The name of the dataset is HMR195 and consists altogether of 195 strands of human, rat and mouse DNA. In order to evaluate the system, the entire set of test sequences was analysed. Before the results are reviewed however, it is necessary to present the accuracy measures used. Burset and Guigo, [1996] outlined the measures used in their evaluation of gene finding programs, which will be utilised for the purposes of this project. These measures are: – Sensitivity (Sn) proportion of coding nucleotides that are correctly classified as coding – Specificity (Sp) proportion of noncoding nucleotides that are correctly classified as noncoding – Correlation Coefficient (CC) combine the values for sensitivity and specificity – Approximate Correlation (AC) approximates CC, but is defined for all values In this work, we use the AC as our primary measure of accuracy, since it integrates Sn and Sp and is defined for all values. 4.1 Results The system was allowed to run on the entire test set of sequences. This involved comparing the whole case-base of exons to each of the test set sequences. Cumulative Results In order to determine the average behaviour of the system across the entire set and to see how each new case affects the overall result, we kept track of cumulative results. This involves calculating the average, across the entire test set, of each of the accuracy measures after each successive library case has been applied. The measure provides not only the average accuracy trend across all the test sequences, but also, for the final case added it gives the average of each accuracy statistic for all the tested strands. Figure 4 shows the AC values for Measure 1 (middle curve), Measure 2 (top curve), and Measure 3 (bottom curve). We can see a number of things in these results. First, Measure 2 provides the most accurate method of scoring nucleotides. Second, there is an indication that there may be critical points in terms of numbers of cases necessary for analysis, as positive correlations begin only around the 50th case and take a significant turn around the 850th case. The correlation itself is perhaps most informative. A correlation value of zero would indicate random behaviour by the approach. Our method, while relatively weak when compared to some others, does provide a significant measure of predictive power. Given that other methods can employ thousands of training examples, we can achieve a reasonable accuracy with 948 exons from our 195 base strands with a clear upward trend continuing. Fig. 4. Approximate correlation cumulative accuracy results for Measure 1 (middle curve), Measure 2 (top curve), and Measure 3 (bottom curve). Strand Results Another way of viewing the results is to examine how well the approach performs on individual strands over the set. For example, of the sequences analysed, 23% have a CC value greater than 0.4 and over 43% have a value greater than the average of 0.28. Figure 5 compares the results of each of the strands tested against the results for the same strand as returned by the a representative program from the [Burset & Guigó 1996] analysis, GeneId. The results are ordered by increasing GeneId accuracy, represented by the smooth curve, along with the results from our approach that appear more dispersed given the ordering. This arrangement provides a visualization of the proportion of the test population for which our approach outperforms GeneId, given by the points that lie above the GeneId curve. There is a significant segment of the population (13%) for which the CBR approach performs better than GeneId. While this comparison argues that GeneId would be the better single choice, there is clearly a segment of the population that would benefit from a complementary CBR approach. 4.2 Metric Comparison Table 1 shows a comparison of our average Metric 2 accuracy results with the tested programs from [Burset & Guigó 1996]. The CBR approach outlined here Fig. 5. Strand results in increasing order of GeneId (smooth curve) accuracy. demonstrates that although it may not be as accurate as the other programs, it does suggest that its value should not be overlooked. In summary, having tested the initial system and evaluated the results it is evident that a simple case-based reasoning approach to the recognition of coding regions is certainly possible. The results show that there are sequences in the test set that can be classified more accurately by this approach. The statistical measure used (AC) rules out the affect of random influence on the results and that the accuracy levels are indeed indicative of the emergence of the coding region features through comparison to example exons. Program FgeneH GeneId GeneParser2 GenLang GRAILII SORFIND Xpound CBR Measure 2 Table 1. Average Accuracy Sn Sp AC CC 0.77 0.88 0.78 0.80 0.63 0.81 0.67 0.65 0.66 0.79 0.67 0.65 0.72 0.79 0.67 0.65 0.72 0.87 0.75 0.76 0.71 0.85 0.73 0.72 0.61 0.87 0.68 0.69 0.77 0.30 0.49 0.35 Measures for Nucleotide Level. 5 Conclusion and Future Work We have presented our initial work in applying a case-based approach to the problem of gene-finding in mammalian DNA. The results obtained from the approach taken here indicate that it is certainly feasible to do DNA-to-DNA comparisons in order to isolate relevant coding regions. Using DNA sequences, avoids the need for the translation of the sequence to the different protein sequence possible. This in turn, it is believed, will reduce the effect of any frame-shift errors on the final results. Our first iteration has employed a very straightforward editdistance method for similarity comparison in nucleotide space, ignoring other context such as promoter/terminator signals and donor/acceptor sites. Using a relatively small library in comparison to the training sets from other approaches, we have (1) achieved significant levels of accuracy, albeit low in comparison to other approaches, and (2) achieved higher accuracy in a significant segment of the test population than other approaches. This argues that our CBR approach can be useful, certainly as a complement to other methods. We intend to test updated versions of the system that take into account additional contextual information, such as signals and protein encodings, as well as larger case repositories. Given the simple approach taken here, we expect that results can be improved dramatically and that a CBR approach to gene finding will prove a viable complement, or even alternative, to other methods. References Brunak, S.; Engelbrecht, J.; and Knudsen, S. 1997. Prediction of human mRNA donor and acceptor sites from the DNA sequence. Journal of Molecular Biology 220:49–65. Burge, C., and Karlin, S. 1997. Prediction of complete gene structures in human genomic DNA. Journal of Molecular Biology 268:78–94. Burset, M., and Guigó, R. 1996. Evaluation of gene structure prediction programs. Genomics 34:353–367. Cormen, T. H.; Leiserson, C. E.; and Rivest, R. L. 1992. Introduction to Algorithms. MIT Press. Farber, R.; Lapedes, A.; and Sirotkin, K. 1992. Determination of eukaryotic protein coding regions using neural networks and information theory. Journal of Molecular Biology 226:471–479. Hoang, D. 1993. Searching genetic databases in splash 2. IEEE Workshop on FPGAs for Custom Computing Machines 185–191. Jiang, T. Approximation algorithms for multiple sequence alignment. URL: http://www.iis.sinica.edu.tw/ hil/summer/jiang2.ppt. University of California Lecture Notes. Jiang, T. Fundamental algorithmic problems and techniques in sequence alignment. URL: http://www.iis.sinica.edu.tw/∼hil/summer/jiang1.ppt. University of California Lecture Notes. Krogh, A. 1998. An introduction into hidden markov models for biological sequences. In Salzberg, S.; Searls, D.; and Kasif, S., eds., Computational Methods in Molecular Biology. Elsevier Science. chapter 4. Kulp, D.; Haussler, D.; Reese, M.; and Eeckman, F. 1996. A generalized hidden markov model for the recognition of human genes in DNA. In Proceedings of ISMB-96, 134–142. Overton, C. G., and Haas, J. 1998. Case-based reasoning gene annotation. In Salzberg, S.; Searls, D.; and Kasif, S., eds., Computational Methods in Molecular Biology. Elsevier Science. chapter 5. Pevzner, P. A. 2000. Computational Molecular Biology: An Algorithmic Approach. The MIT Press. chapter 4,6. Ram, A., and Francis, A. 1996. Multi-plan retrieval and adaptation in an experience-based agent. In Leake, D., ed., Case-Based Reasoning: Experiences, Lessons, and Future Directions. Menlo Park, CA: AAAI Press. Rogic, S.; Mackworth, A.; and Ouellette, B. 2001. Evaluation of gene-finding programs on mammalian sequences. Genome Research 11(5):817–832. Shavlik, J. 1991. Finding genes by case-based reasoning in the presence of noisy case boundaries. In Proceedings of the 1991 DARPA Workshop on Case-Based Reasoning, volume 14, 861–866. Smyth, B., and Keane, M. 1995. Experiments on adaptation-guided retrieval in case-based design. In Proceedings of First International Conference on CaseBased Reasoning. Staden, R., and McLachlan, A. 1982. Codon preferences and its use in identifying protein coding regions in long DNA sequences. Nucleic Acids Research 10(1):141–156. Towell, G. G.; Shavlik, J. W.; and Noordewier, M. O. 1990. Refinement of approximate domain theories by knowledge-based neural networks. In Proceedings of the Eighth National Conference on Artificial Intelligence, 861–866.
© Copyright 2026 Paperzz