A Case-Based Approach to Gene Finding

A Case-Based Approach to Gene Finding
Edwin Costello and David C. Wilson
Computer Science Department, University College Dublin, Dublin 4, Ireland
edwin [email protected],[email protected]
Abstract. Advances in molecular biology and the tools and techniques
available for analysing molecular structure are providing a rapidly increasing deluge of information that maps fundamental genetic structures
in humans and other organisms. Intelligent support is essential for managing and interpreting this overwhelming amount of data, and one of
the most important tasks faced currently is the analysis of sequences of
nucleotides in order to locate the areas of DNA that actually encode
functional biological information. Such analysis has a substantial impact
on the health sciences as the foundation of identifying genetic structures
that are related to disease and that can serve as the basis for pharmacological or gene-therapeutic response. This paper describes our initial
research in developing a CBR approach to the problem of finding regions
in mammalian DNA that code proteins essential for life.
1
Introduction
Bioinformatics incorporates the fields of biology, computer science and information technology with the goal of discovering new biological insights and the enhancement of diagnostic and pharmaceutical medicine. Sequence analysis, which
involves the study of an organism’s DNA in an effort to understand its molecular structure and underlying functionality, is of major importance to the area of
gene therapy, which has led to the discovery of mutations in DNA and chromosomal abnormalities indicative of diseases such as cancer. Thus the analysis of
nucleotide sequences, in particular the identification of DNA segments that encode functional biological information, can provide the medical profession with
invaluable insight into the pathology of disease state and treatment.
Sequence analysis first involves determining the basic molecular structure of
an organism’s DNA, which is simply a molecule made up of two strands, with
each strand comprised of nucleotides from a finite set. There are four different
nucleotides—adenine, guanine, thiamine and cytosine, and the first letter of
each provides an alphabet {A, G, T, C} for representing DNA. A nucleotideto-nucleotide bond holds the two strands together with each nucleotide being
bonded to its complementary match, A only bonding to T and C only bonding
to G. Therefore, given one strand (i.e., half a DNA molecule), the complementary
strand can be reconstructed relatively easily.
Determining the basic molecular DNA structure sequence is a well understood task that is providing a deluge of information for interpretation. The task
of “gene finding,” identification of coding regions in an organism’s DNA, is the
next essential step in analysing an organism’s genome. These coding regions are
called exons and when these are put together they form an entire gene. It is the
genes that tell the body how to create proteins, and it is these that give rise to
biological function. Exons are continuous sequences in a strand that the body
uses to replicate proteins; the parts in between these exons do not contribute to
protein replication and are called introns, or noncoding regions. A human gene,
for example, can consist of up to 2000 base-pairs (bp’s), with the gene being
split up into, on average, 10 exons. The aspect that makes the finding of these
exons most difficult is that the exons can be spread at random in a string of up
to one million base pairs. In fact the human DNA is made up of almost 97% of
so-called “junk” DNA that does not code any proteins [Pevzner 2000]. It must
be noted however, that this intron-exon structure only applies to mammalian
type DNA, which we focus on in this research.
We are interested in applying CBR to the gene-finding problem by employing
a case library of nucleotide segments that have previously been categorised as
coding (exon) or non-coding (intron), in order to locate the coding regions of
a new DNA strand. The primary research issues addressed in this work are
establishing a similarity metric for nucleotide segments and combining the results
of multiple cases to categorise entire new DNA strands. This paper describes our
initial research in developing a CBR approach to the problem of finding regions in
mammalian DNA that code proteins essential for life. Section 2 gives an overview
of related work, section 3 describes our similarity metric, section 4 describes an
evaluation of the initial approach, and we conclude with a discussion of future
work.
2
Gene Finding Methods
The structure of genes is well understood and their characteristics are used in
many of the techniques that are used for exon prediction and gene classification. While we employ only direct nucleotide comparison, it is important to note
that other information may be available, such as Promoter and Terminator signals that occur before the first and after the last exon respectively, and Donor
and Acceptor sites that can help to indicate intron-exon boundaries [Brunak,
Engelbrecht, & Knudsen 1997].
A number of approaches have been applied to signal detection and genefinding, including neural networks (e.g., [Towell, Shavlik, & Noordewier 1990;
Farber, Lapedes, & Sirotkin 1992]), Bayesian approaches (e.g., [Staden & McLachlan 1982]), and Hidden Markov Models (e.g., [Krogh 1998; Kulp et al. 1996;
Burge & Karlin 1997]).
Overton and Haas [1998] describe a case-based system that used grammars,
describing features of genes such as promoter regions and other signals. The
cases that these grammars described were mapped to the unknown sequence
in an attempt to generate predictions for exons, introns and certain regulatory
signals. The cases used were generally full genes. For the system proposed here,
the way an identified gene was broken up in its original DNA sequence is used,
i.e. the individual component exons. These exons will be used as the cases in the
case-base.
Shavlik [1991] discusses a system called FIND-IT, which translates a query
sequence of DNA into its possible protein translations based on the different reading frames. It then performs a search for matching cases from a large database
of protein sequences. Our approach matches in nucleotide space, rather than
protein space; while matching in protein space may have some advantages, it
can also suffer from frame-shift errors in translation.
3
Sequence Similarity
For a CBR system to work effectively, it is necessary to be able to compare the
case (i.e., a known exon), with the query strand of DNA in which we want to
identify exons. We have investigated a number of possible approaches to similarity, including Longest Common Subsequence [Cormen, Leiserson, & Rivest 1992]
and sequence alignment methods [Jiang b], but we have chosen an edit distance
method for our initial work.
The edit distance between two sequences is defined as the minimum cost of
transforming one sequence to the other with a sequence of the following operations: deleting a character, inserting a character or substituting one character
for another, with no character taking part in more than one operation [Hoang
1993]. Each of these operations can be given a weight in order to penalise certain operations. Most implementations [Jiang a] seem to apply a weight of 1 to
an insertion, 1 to a substitution and 0 for a nucleotide match. Figure 1 shows
an example edit distance computation between a case exon and a target DNA
sequence segment, with which the case is aligned. The minimal transformation
cost between the exon case and the target segment requires a deletion (d) from
the target, two substitutions (s), and an insertion (i) in the case, giving an
edit distance of 4. This type of similarity is also conceptually appealing, as it
computes similarity using adaptability (e.g., [Smyth & Keane 1995]).
... G T A G C C G A A T C G ...
ACGAAGATC
... G T A - G
ACG
d
1
C
A
s
1
C G A A T C G ...
AG - ATC
s
i
1
1
=4
Target Sequence
Case Exon
Edit Distance
Fig. 1. Example edit distance computation.
Computing the edit distance can be expensive, and we use a dynamic programming approach. Memoization is employed by creating a table t of size
(m + 1)(n + 1) to store values for subproblems of the original. The edit distance algorithm matches each character in a sequence with all the characters
in the comparison sequence. At each comparison it assigns a score based on
the previous scores kept in the table. When generating a score for a character
comparison at index (i, j), if the characters match it is given the score from
index (i − 1, j − 1), otherwise it is given the minimum score of (i − 1, j) + 1 or
(i, j − 1) + 1. This way, the smallest score is propagated through the table.
3.1
Sequence Scoring
Given a measure of sequence similarity, we need to employ the case library
segments in a way that will enable one to isolate regions of a sequence of DNA
and point to them as potential protein coding regions. Since library exons are
likely to be much shorter than a new strand, we adopt an approach that combines
many retrieved cases (e.g., [Ram & Francis 1996]) in order to arrive at the new
solution.
We assign an activation to all the nucleotides involved in an optimal edit
distance alignment. When all the case exons have been compared to the query
strand, each nucleotide of the strand will have an activation that can be used to
decide whether it is involved in a coding region or not. The scoring of a nucleotide
is straightforward, the edit distance calculated for the best scoring subsequence
of the query strand is applied to each of the bases in that subsequence, as shown
in Figure 2.
Fig. 2. Example sequence scoring.
As more exons line up with a strand region, the activation increases. However
keeping track of the cumulative edit distance would not be enough, as a large
distance score would skew the results. To compensate for this, we also store a
value that indicates how many times an individual nucleotide has been involved
in an optimal alignment. In the example shown in Figure 2, the three nucleotides
with a score of 7 would have a participation score of 2, with a score of 1 for the
others.
Given scores for activation and participation, we employ three measures for
classifying an individual nucleotide:
– Measure 1 nucleotide activation score, normalised by the maximum nucleotide activation score
– Measure 2 number of optimal matches the nucleotide participates in, normalised by the maximum nucleotide participation score
– Measure 3 the first two measures combined as a product A parameterized
threshold is applied to the metric value, in order to determine coding status.
When the analysis of a given test DNA sequence is finished, the result can be
visualized, as shown in Figure 3. The base (wrapped) line represents the sequence
being analysed, the thickness represents the degree of nucleotide activation, and
the segments above the baseline show the position of the actual exons in the test
sequence.
Fig. 3. Visualization of nucleotide activation.
4
Evaluation
We were interested in comparing our approach with other types of analysis,
and we chose to evaluate the system with a test dataset developed by Burset
and Guigó [1996] to evaluate gene-finding programs. It consists of 570 sequences
obtained from the vertebrate divisions of GenBank release 85.0 [Burset & Guigó
1996] and contains a total of 2649 coding exons.
In choosing a set of sequences to act as the case-base for our system, we
wanted to select a dataset such that none of the sequences in the test set were
the same as any of those in the case-base. The dataset that was chosen was one
that was constructed specifically for the evaluation of seven recently developed
programs for gene finding in mammalian sequences [Rogic, Mackworth, & Ouellette 2001]. The name of the dataset is HMR195 and consists altogether of 195
strands of human, rat and mouse DNA.
In order to evaluate the system, the entire set of test sequences was analysed.
Before the results are reviewed however, it is necessary to present the accuracy
measures used. Burset and Guigo, [1996] outlined the measures used in their
evaluation of gene finding programs, which will be utilised for the purposes of
this project. These measures are:
– Sensitivity (Sn) proportion of coding nucleotides that are correctly classified
as coding
– Specificity (Sp) proportion of noncoding nucleotides that are correctly classified as noncoding
– Correlation Coefficient (CC) combine the values for sensitivity and specificity
– Approximate Correlation (AC) approximates CC, but is defined for all values
In this work, we use the AC as our primary measure of accuracy, since it integrates Sn and Sp and is defined for all values.
4.1
Results
The system was allowed to run on the entire test set of sequences. This involved
comparing the whole case-base of exons to each of the test set sequences.
Cumulative Results In order to determine the average behaviour of the system
across the entire set and to see how each new case affects the overall result, we
kept track of cumulative results. This involves calculating the average, across
the entire test set, of each of the accuracy measures after each successive library
case has been applied. The measure provides not only the average accuracy trend
across all the test sequences, but also, for the final case added it gives the average
of each accuracy statistic for all the tested strands. Figure 4 shows the AC values
for Measure 1 (middle curve), Measure 2 (top curve), and Measure 3 (bottom
curve). We can see a number of things in these results. First, Measure 2 provides
the most accurate method of scoring nucleotides. Second, there is an indication
that there may be critical points in terms of numbers of cases necessary for
analysis, as positive correlations begin only around the 50th case and take a
significant turn around the 850th case.
The correlation itself is perhaps most informative. A correlation value of
zero would indicate random behaviour by the approach. Our method, while relatively weak when compared to some others, does provide a significant measure
of predictive power. Given that other methods can employ thousands of training
examples, we can achieve a reasonable accuracy with 948 exons from our 195
base strands with a clear upward trend continuing.
Fig. 4. Approximate correlation cumulative accuracy results for Measure 1 (middle
curve), Measure 2 (top curve), and Measure 3 (bottom curve).
Strand Results Another way of viewing the results is to examine how well
the approach performs on individual strands over the set. For example, of the
sequences analysed, 23% have a CC value greater than 0.4 and over 43% have a
value greater than the average of 0.28. Figure 5 compares the results of each of
the strands tested against the results for the same strand as returned by the a
representative program from the [Burset & Guigó 1996] analysis, GeneId. The
results are ordered by increasing GeneId accuracy, represented by the smooth
curve, along with the results from our approach that appear more dispersed given
the ordering. This arrangement provides a visualization of the proportion of the
test population for which our approach outperforms GeneId, given by the points
that lie above the GeneId curve. There is a significant segment of the population
(13%) for which the CBR approach performs better than GeneId. While this
comparison argues that GeneId would be the better single choice, there is clearly
a segment of the population that would benefit from a complementary CBR
approach.
4.2
Metric Comparison
Table 1 shows a comparison of our average Metric 2 accuracy results with the
tested programs from [Burset & Guigó 1996]. The CBR approach outlined here
Fig. 5. Strand results in increasing order of GeneId (smooth curve) accuracy.
demonstrates that although it may not be as accurate as the other programs, it
does suggest that its value should not be overlooked.
In summary, having tested the initial system and evaluated the results it is
evident that a simple case-based reasoning approach to the recognition of coding
regions is certainly possible. The results show that there are sequences in the
test set that can be classified more accurately by this approach. The statistical
measure used (AC) rules out the affect of random influence on the results and
that the accuracy levels are indeed indicative of the emergence of the coding
region features through comparison to example exons.
Program
FgeneH
GeneId
GeneParser2
GenLang
GRAILII
SORFIND
Xpound
CBR Measure 2
Table 1. Average Accuracy
Sn Sp AC CC
0.77 0.88 0.78 0.80
0.63 0.81 0.67 0.65
0.66 0.79 0.67 0.65
0.72 0.79 0.67 0.65
0.72 0.87 0.75 0.76
0.71 0.85 0.73 0.72
0.61 0.87 0.68 0.69
0.77 0.30 0.49 0.35
Measures for Nucleotide Level.
5
Conclusion and Future Work
We have presented our initial work in applying a case-based approach to the
problem of gene-finding in mammalian DNA. The results obtained from the approach taken here indicate that it is certainly feasible to do DNA-to-DNA comparisons in order to isolate relevant coding regions. Using DNA sequences, avoids
the need for the translation of the sequence to the different protein sequence possible. This in turn, it is believed, will reduce the effect of any frame-shift errors
on the final results. Our first iteration has employed a very straightforward editdistance method for similarity comparison in nucleotide space, ignoring other
context such as promoter/terminator signals and donor/acceptor sites. Using a
relatively small library in comparison to the training sets from other approaches,
we have (1) achieved significant levels of accuracy, albeit low in comparison to
other approaches, and (2) achieved higher accuracy in a significant segment of
the test population than other approaches. This argues that our CBR approach
can be useful, certainly as a complement to other methods. We intend to test
updated versions of the system that take into account additional contextual
information, such as signals and protein encodings, as well as larger case repositories. Given the simple approach taken here, we expect that results can be
improved dramatically and that a CBR approach to gene finding will prove a
viable complement, or even alternative, to other methods.
References
Brunak, S.; Engelbrecht, J.; and Knudsen, S. 1997. Prediction of human mRNA
donor and acceptor sites from the DNA sequence. Journal of Molecular Biology
220:49–65.
Burge, C., and Karlin, S. 1997. Prediction of complete gene structures in
human genomic DNA. Journal of Molecular Biology 268:78–94.
Burset, M., and Guigó, R. 1996. Evaluation of gene structure prediction programs. Genomics 34:353–367.
Cormen, T. H.; Leiserson, C. E.; and Rivest, R. L. 1992. Introduction to
Algorithms. MIT Press.
Farber, R.; Lapedes, A.; and Sirotkin, K. 1992. Determination of eukaryotic
protein coding regions using neural networks and information theory. Journal
of Molecular Biology 226:471–479.
Hoang, D. 1993. Searching genetic databases in splash 2. IEEE Workshop on
FPGAs for Custom Computing Machines 185–191.
Jiang, T. Approximation algorithms for multiple sequence alignment. URL:
http://www.iis.sinica.edu.tw/ hil/summer/jiang2.ppt. University of California
Lecture Notes.
Jiang, T. Fundamental algorithmic problems and techniques in sequence alignment. URL: http://www.iis.sinica.edu.tw/∼hil/summer/jiang1.ppt. University
of California Lecture Notes.
Krogh, A. 1998. An introduction into hidden markov models for biological sequences. In Salzberg, S.; Searls, D.; and Kasif, S., eds., Computational Methods
in Molecular Biology. Elsevier Science. chapter 4.
Kulp, D.; Haussler, D.; Reese, M.; and Eeckman, F. 1996. A generalized hidden
markov model for the recognition of human genes in DNA. In Proceedings of
ISMB-96, 134–142.
Overton, C. G., and Haas, J. 1998. Case-based reasoning gene annotation. In
Salzberg, S.; Searls, D.; and Kasif, S., eds., Computational Methods in Molecular
Biology. Elsevier Science. chapter 5.
Pevzner, P. A. 2000. Computational Molecular Biology: An Algorithmic Approach. The MIT Press. chapter 4,6.
Ram, A., and Francis, A. 1996. Multi-plan retrieval and adaptation in an
experience-based agent. In Leake, D., ed., Case-Based Reasoning: Experiences,
Lessons, and Future Directions. Menlo Park, CA: AAAI Press.
Rogic, S.; Mackworth, A.; and Ouellette, B. 2001. Evaluation of gene-finding
programs on mammalian sequences. Genome Research 11(5):817–832.
Shavlik, J. 1991. Finding genes by case-based reasoning in the presence of noisy
case boundaries. In Proceedings of the 1991 DARPA Workshop on Case-Based
Reasoning, volume 14, 861–866.
Smyth, B., and Keane, M. 1995. Experiments on adaptation-guided retrieval in
case-based design. In Proceedings of First International Conference on CaseBased Reasoning.
Staden, R., and McLachlan, A. 1982. Codon preferences and its use in identifying protein coding regions in long DNA sequences. Nucleic Acids Research
10(1):141–156.
Towell, G. G.; Shavlik, J. W.; and Noordewier, M. O. 1990. Refinement of approximate domain theories by knowledge-based neural networks. In Proceedings
of the Eighth National Conference on Artificial Intelligence, 861–866.