J. Mol. Biol. (1996) 264, 46–55 Very Fast Identification of RNA Motifs in Genomic DNA. Application to tRNA Search in the Yeast Genome Nadia El-Mabrouk1 and Frédérique Lisacek2* 1 Institut Gaspard Monge Université de Marne-la-Vallée, 2 rue de la Butte Verte, 93166 Noisy-le-Grand Cedex, France 2 Laboratoire Informatique et Génome, Université de Versailles-Saint-Quentin 45 avenue des Etats-Unis 78035 Versailles Cedex France *Corresponding author A common strategy characterises the various methods independently defined to identify almost unambiguously different types of RNA molecules in DNA fragments. So far, the good quality of detection of RNA motif has been the prior motivation and effectively delayed the optimisation of programs. As an illustration of possible improvements, a modified version of tRNAscan is described. The previous algorithm was altered to run 500 times faster and to lower both rates of false positives and false negatives. The newly sequenced genome of Saccharomyces cerevisiae is scanned both ways in less than three minutes and results match annotations found in databanks with three exceptions, two of which being arguably not real tRNAs. 7 1996 Academic Press Limited Keywords: pattern matching; optimised search; RNA motif; tRNA; yeast genome Introduction The identification of functional regions in genomic DNA increasingly relies on the coupling of experimental work to computer processing of sequences. Newly sequenced fragments of a genome may be analysed with computer programs and the result of an automated search may guide a new set of experiments. Among others, the identification of more or less complex RNA ‘‘motifs’’ is achieved by a variety of programs. There are basically two approaches to the identification of RNA motifs. It is either part of a general purpose method designed for searching and/or folding RNA sequences (Gautheret et al., 1990; Chiu & Kolodziejczak, 1991; Eddy & Durbin, 1994; Searls & Dong, 1992; Searls, 1993) or a self-contained method tailor-made for searching tRNA genes (Fichant & Burks, 1991; Pavesi et al., 1994), Escherichia coli transcription terminators (d’Aubenton Carafa et al., 1990), self-splicing introns (Lisacek et al., 1994), etc. By construction, reported results are usually more accurate in the latter case and only these will be considered here. The outline of a common strategy for searching these motifs has been given by Dandekar & Hentze (1995). A similar approach characterises the various methods independently defined to identify almost unambiguously different types of RNA molecules in DNA fragments. Such dedicated searches are based on the principle that the more conserved a 0022–2836/96/460046–10 $25.00/0 region, the more easily recognisable. They depend on the use of ‘‘weight’’ or ‘‘consensus’’ matrices (Staden, 1988). Weight matrices make the definition of the RNA motifs more flexible than consensus sequences. Primary and secondary structure features are then used to gradually refine the identification process. Most algorithms following the above strategy were not optimised. But, the forthcoming availability of complete genomes emphasises the need for efficient and fast computer search tools. Efficiency depends on how sensitive is the search and speed on how optimised is the algorithm. So far, the good quality of detection of RNA motifs has been the prior motivation and effectively postponed the optimisation of programs. The speed of search algorithms first became a concern in the late 80s with the fast-growing sequence databases and databanks. The exponential increase of operations involved in sequence comparison and the absence of obvious means of speeding up rigorous methods relying on dynamic programming (e.g. Smith & Waterman, 1981) prompted the appearance of heuristic methods (Pearson & Lipman, 1988; Altschul et al., 1990). In contrast, RNA motif searching has been based on heuristic methods, circumventing combinatorial difficulties with a loose theoretical background. Fast and efficient techniques can be imported from formally solved string-matching problems. Searching for invariant or semi-invariant 7 1996 Academic Press Limited 47 Very Fast RNA Motif Search nucleotides can be considered as searching for patterns with don’t care symbols. A number of algorithms have been defined and improved whether string matching is approximate or exact (Stephen, 1994). The optimisation of RNA motif search introduced here, relies on the Shift-Add algorithm defined in the case of search with mismatches (Baeza-Yates & Gonnet, 1992). Testing and using this algorithm for the identification of tRNA genes led to considerably reduce computer time. Over the last 20 years, the tRNA molecule has been extensively studied. The corresponding gene is a short sequence that folds in the form of a cloverleaf. Tertiary interactions are known (Haselman et al., 1988). Many sequences are available and aligned (Sprinzl et al., 1992). Conserved regions appear in the alignment, which consists of only 76 positions. The first reliable algorithm, tRNAscan, was written by Fichant & Burks (1991). More recently, another sensitive method based on the use of weight matrices was defined (Pavesi et al., 1994). The modified version (FAStRNA) of tRNAscan yields good results. Scanning is performed in a few seconds for small genomes and in less than three minutes for both strands of the newly sequenced Saccharomyces cerevisiae genome. Setting an appropriate hierarchy of searching operations also contributes to minimising computing time, as tested by Wozniak & Makalowski (1990). The hierarchy introduced in tRNAscan is slightly modified. Other parameters and definitions are made more flexible, helping to reduce both rates of false positive (a selected sequence that does not correspond to a known tRNA) and false negative (a non-selected sequence corresponding to a known tRNA). Throughout the text, reference to positions in the tRNA sequence alignment are made as given by Sprinzl et al. (1992). Results Pattern matching versus consensus matrices From a computing point of view, there are several advantages to the pattern-matching approach over the probabilistic approach: (1) the signal search is uninterrupted by score calculations; (2) score calculations are not necessary and score thresholds need not be specified; (3) a signal is † Potential tRNA found in unannotated sequences are not considered. ‡ Truncated, partial, putative, repeated gene or pseudogene sequences are not considered, nor are sequences containing too many unidentified/ambiguous nucleotides. § Mitochondrial and chloroplastic sequences are not included in the study. considered as a whole as against a collection of invariant, semi-invariant and variable nucleotides; segments are selected upon an overall minimal number of errors. As far as the quality of results is concerned, both approaches yield accurate results. Speed Both tRNAscan and FAStRNA version were written in C and run on an HP9000 computer. Whereas tRNAscan based on a naive search method scans a relatively small genome (ca 2 × 200,000 nucleotides) in 16 minutes, the modified version runs in two seconds. Both strands of the 16 chromosomes of the recently completed genome of Saccharomyces cerevisiae are scanned in 2.5 minutes. For instance, the longest chromosome (IV) made of 1,522,191 nucleotides, is exhaustively analysed in 19.5 seconds, and the shortest (VI) made of 270,148 nucleotides in 3.5 seconds. Both strands of chromosome III (2 × 315,000 nucleotides) are scanned in four seconds, as against 35 minutes with tRNAscan and 11 minutes with Genlang (Searls & Dong, 1992). Pavesi et al. (1994) reported that both strands of a genomic fragment (2 × 318,444 nucleotides) were scanned in two hours with a GWBASIC program implemented on a PC 486DX/33 computer. Sensitivity of the search Databank searching The modified version of the tRNAscan algorithm yields good results. Both rates of false positives† and false negatives‡ are reduced. Moreover, some predicted sequences, not annotated, are over 70% similar to known tRNA and are likely to correspond to tRNA genes. Both tRNAscan and FAStRNA were tested on a set of 95,143 sequences of EMBL (Rel.38)§ FAStRNA is 97.9% accurate as against 96% for tRNAscan. In both cases, the rate of false positive is of the order of 10−4, though slightly lower for FAStRNA. Corresponding sequences overlap but may also differ. Results are shown in Table 1. Detailed analysis of results have been provided by Pavesi et al. (1994). The overall number of false negatives is estimated to be three times less than that given by Fichant & Burks (1991). Conversely, more false positives (0.0137% as against 0.0033%) were found and two sequences out of a set of 30 potential genes were successfully tested for transcriptional activity in vitro. Most of these potential genes are predicted by FAStRNA. FAStRNA was tested with available fragments of the Escherichia coli genome and Bacillus subtilis large genome fragments, with 100% success in both cases. 48 Very Fast RNA Motif Search Table 1. Results of database searching A. Global results. Comparison between tRNAscan and FAStRNA Numberb Taxonomic Number ofa category nucleotides tested of tRNA genes Primates Rodents Invertebrates Plants Prokaryotes Fungi Other mammals Other vertebrates Total % False negativesc tRNAscan FAStRNA % False positivesd tRNAscan FAStRNA 25,454,414 30,313 seq 19,490,297 19,584 seq 14,048,275 10,818 seq 9,017,972 7831 seq 20,680,405 13,069 seq 3,918,817 2905 seq 5,133,144 5088 seq 5,796,710 5535 seq 10,060 77 6532 90 28,708 374 9001 122 45,998 618 7484 92 2836 36 4956 61 2.60 2 genes 3.33 3 genes 4.28 16 genes 6.56 8 genes 3.88 24 genes 2.17 2 genes 8.33 3 genes 0.00 0 genes 1.30 1 genes 2.22 2 genes 2.14 8 genes 1.64 2 genes 2.59 16 genes 1.08 1 genes 2.77 1 genes 0.0 0 genes 0.0005 4 genes 0.0005 3 genes 0.0005 2 genes 0.0 0 genes 0.0009 5 genes 0.0019 2 genes 0.0007 1 gene 0.0 0 genes 0.0 0 genes 0.0 0 genes 0.0005 2 genes 0.0 0 gene 0.0005 3 genes 0.0009 1 genes 0.0014 2 genes 0.0006 1 genes 103,540,034 95,143 seq 115,575 1470 3.94 58 genes 2.11 31 genes 0.0006 17 genes 0.0003 9 genes a The number of nucleotides corresponds to a single DNA strand. The total number of processed nucleotides is twice this number. Number of nucleotides corresponding to tRNA genes and number of tRNA genes in the test set. c The percentage of false negatives is obtained by dividing the number of tRNA genes that are not predicted by the total number of known tRNA genes. d The percentage of false positives is calculated according to Fichant & Burks (1991): the value in column 3 is subtracted from that in column 2. The obtained value is divided by average tRNA length and multiplied by 2 to obtain the number of tRNA-sized objects in the test set. Finally the number of false positives is divided by this obtained value. b B. False negatives obtained by FAStRNA Taxonomica category Entry namea tRNA typeb Primates tRNASer Rodents Invertebrates Plants Prokaryotes Fungi Other mammals a b HSTGSS SeCys MMSEC tRNA MMTGCI ACTRANRNA DMTGSCA CCTRNAA CETGSCA tRNASeCys tRNALeu tRNASeCys DMTGYC NC02677 PM12SRRNA tRNATyr tRNAAla tRNALys TBTRQVK ATPATY3 PUDNAK AATGL ALRRNA ALRRNA ALTRNA11 ALTRNAGLY ALTRNAIA ASPTRNLEU CJTRNLAG CJTRNLAG EHRTRNA HVTGW MG02104 MPTGV PLTRNA PVSELC SCTRPSER SPCEN3XB BTSERSEC tRNAVal tRNASer tRNAGly tRNALeu tRNALys tRNAVal tRNAAla tRNAGly tRNAIle tRNALeu tRNAAla tRNALeu tRNAAla tRNATrp tRNASeCys tRNALeu tRNALeu tRNASeCys tRNASer tRNAAsp tRNASer Reason why FAStRNA does not predict the gene Too many mismatches in the D signal and too low score for the aminoacyl arm Too many mismatches in the D signal and too low score for the aminoacyl arm Too low sg Too low sg Too many mismatches in the D signal and too low sg Too low score for the aminoacyl arm Too many mismatches in the D signal, too low score for the aminoacyl arm Too large intron (114 nucleotides) Too low sg Deletion between the D arm and the aminoacyl arm and too many mismatches in the TCC signal Too low score for the aminoacyl arm Too low sg Deletion in the TCC loop and insertion in the TCC arm Too large intron (292 nucleotides) Too low score for the aminoacyl arm Insertion in the TCC loop Deletion in the TCC loop Deletion between the D arm and the aminoacyl arm Deletion in the TCC loop Too low sg Too many mismatches in the TCC signal . . . Too many mismatches in the TCC signal . . . Deletion in the TCC loop Too large intron (106 nucleotides) Too many mismatches in the TCC signal . . . Deletion in the TCC loop Too low score for the anticodon arm Too many mismatches in the D signal Insertion in the D loop Deletion in the TCC loop Too many mismatches in the D signal and too low score for the aminoacyl arm Location of the non-predicted gene in EMBL (release 38). Family of the non-predicted tRNA. continued overleaf 49 Very Fast RNA Motif Search Table 1. Continued C. Potential tRNA genes predicted by FAStRNA Entry namea Taxonomica category Invertebrates Plants Prokaryotes Fungi CEF42H10 CEGPA1 (p) CEUNC22 (p) EHRPTARQ (p) ATCPYLP (p) ATENGE (p) DCGSTA BCPSAM2AT HHRNAPOP HIINT HMRGHM MCRRNA5 MLB577COS PAKPILIN PPTGRH PPTGRH SASAM2B NCGLA1 (p) PPURNA1 (p) SPCEN114 SPCEN163 Positionsa tRNA typeb 533-608 2929-3012 1474-1545 278-362 234-326 4380-4462 4096-4021 17-93 1243-1168 200-125 187-113 225-308 7676-7752 972-1048 403-480 489-564 25-101 3470-3541 343-272 1774-1860 2624-2547 Lys Comparedc sequence tRNA (CUU) tRNALeu(CAG) tRNAPro(CGG) tRNAThr(UGU) tRNATyr(GUA) tRNALeu(UAA) tRNASer(GAA) tRNAPro(GGG) tRNAAsp(GUC) tRNALys(UUU) tRNAGly(GCC) tRNALeu(UAG) tRNAPro(CGG) tRNAThr(CGU) tRNAPro(UGG) tRNAHis(GUG) tRNAPro(CGG) tRNAArg(ACG) tRNAMet(CAU) tRNATyr(GUA) tRNAAla(AGC) DK7560 DMTGL* CETGPX* TATRTY1* DY6740 SCLEURNA* RF7020 DP1560 RD0500 DK2000 RG0380 DL1141 DP1400 DT1820 DP1740 DH1740 DP1360 DMRP* MCTRFM* RY6320 DA6320 Similarityd (%) 100 77 93 93 90 74 87 100 93 99 99 100 100 100 100 100 100 82 70 99 99 a Location of the predicted gene in EMBL (Rel.38). (p) indicates potential tRNA genes also recognized by Pavesi et al. (1994). Family of the potential tRNA. c tRNA gene present in EMBL used for the computation of the percentage of similarity with the predicted tRNA gene. References with * are entry names in EMBL. Others are entry names in the tRNA database file of EMBL (Rel.38). d Percentage of similarity between predicted and compared sequence. b Searching the yeast genome The 16 chromosomes are annotated in the yeast database found at the Martinsried Institute for Protein Sequence: 84 tRNA genes had been identified in early sequenced chromosomes I, II, III, V, VIII, IX and XI. Another 191 are annotated in chromosomes IV, VI, VII, X, XII, XIII, XIV, XV and XVI, all of them but three and no other were found with FAStRNA. Results corresponding to all chromosomes are summed in Table 2. No false positive was found by FAStRNA, although there is one missing sequence in annotations listed in the results. A copy of a tRNASer(GCT) found in chromosome XII at position 784,352 is repeated in chromosome IV at position 1,141,093 and not reported as such in the database. Sequences of false negatives were aligned with other detected sequences corresponding to the same anticodon, when available. The tRNAAsp annotated in chromosome XIV at position 519,096 is found to match exactly other tRNAAsp genes identified on different chromosomes, except for the first eight nucleotides. FAStRNA precisely rejects this sequence because of its inability to form an aminoacyl arm. Possible deletions or sequencing errors can explain the mismatches. Table 2. Results of yeast genome scanning Chromosome I II III IV V VI VII VIII IX X XI XII XIII XIV XV XVI Number of annotated genes Number of matches identified by FAStRNA Number of false positives identified by FAStRNA 4 13 10 26 20 9 36 11 10 24 16 23 21 15 20 17 4 13 10 26 20 9 36 11 10 24 16 22 21 13 20 17 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 50 Very Fast RNA Motif Search Figure 1. False negative found in the yeast genome. The tRNALeu annotated in chromosome XII at position 725,746 is supposed to correspond to an CAG anticodon that is found nowhere else in the genome. As a result, the interpretation of the alignment is not so straightforward. The sequence corresponding to the GAG anticodon is over 75% homologous to the TAG anticodon sequence, as shown in Figure 1. Such a similarity drops to 20% for the CAG anticodon sequence. The tRNAGly annotated in chromosome XIV at position 71,772 is cited as a potential tRNA detected by tRNAscan. To begin with, the alignment corresponding to GCC anticodon sequences (there are 16 exact copies of this particular gene in the whole genome) shows a lack of similarity. Moreover, none of the 21 tRNAGly sequences annotated in the genome contains an intron. The sequence is not selected by FAStRNA because of the poor quality of the D and TCC regions. Discussion Various approaches for RNA structure prediction have been defined and trained on the structure of the tRNA molecule (Gautheret et al., 1990; Chiu & Kolodziejczak, 1991; Eddy & Durbin, 1994; Searls & Dong, 1992; Searls, 1993). Progressively, the general framework of RNA prediction was dropped in favour of sensitive and specified methods as seen in the latest attempts (Fichant & Burks, 1991; Pavesi et al., 1994). Such an approach is characterised by the low rate of generated false positives and false negatives. The latter is higher in the method of Fichant & Burks (1991) and conversely, the former is higher in that of Pavesi et al. (1994). But these methods are, in fact, instances of a general strategy of exhaustive search for specific RNA motifs (Dandekar & Hentze, 1995). Nucleotide sequence filters are organised in a sequential algorithm and information is gradually refined. The general 51 Very Fast RNA Motif Search definition of a tRNA consensus sequence matches a set of necessary and sufficient conditions relying on primary and/or secondary structure features. Though the generality of the target of search is lost, the method appears to be transposable from one type of RNA to another. The notion of generalisation is shifted from the object to the method, which justifies seeking an optimal definition of the method. Results presented above account for the possibility of optimising RNA motif search with respect to exhaustiveness and speed. Exhaustiveness was partly achieved by modifying a few definitions, in particular, scoring functions for base-pairing. It was enhanced as well by modifying the algorithm, since some arbitrary thresholds and scores need not be kept in the optimised version. In the methods of Fichant & Burks (1991) and Pavesi et al. (1994), priority was not given to how quickly searching is performed. Nevertheless, within years, complete genomes of various organisms will be available and fast sequence scanning has already become a concern. The issue of sequence comparison was addressed with the definition of methods such as FASTA (Pearson & Lipman, 1988) and BLAST (Altschul et al., 1990). They were designed to reduce both the time and space required for assessing similarities. Other attempts, especially relying on advances in the field of string matching, are aimed at minimising the number of operations for comparing two sequences. For instance, this number can be reduced by considering sequence repeats only once through the use of position trees (Lefèvre & Ikeda, 1994). Exact string-matching problems gave rise to fast and efficient techniques, such as the Boyer-Moore algorithm. An example of implementation is given by Prunella et al. (1993). The efficiency of this algorithm is obtained by overlooking stretches of sequence where no match can possibly be found. Since sequence shifts and character skip are involved, the size of the alphabet of symbols and that of the pattern are determinant factors. The larger the alphabet and the longer the pattern, the faster the method. The use of this algorithm is made difficult by the approximate or ambiguous definition of patterns in nucleotide sequences. In tRNAs, the TCC and D patterns are defined on a small alphabet and they may contain errors. Another alternative had to be sought for searching tRNA sequences. There are a number of algorithms for approximate string-matching (Stephen, 1994), in particular those designed to handle don’t care symbols. Some were implemented specifically for searching biological patterns such as the basis of the language ANREP (Mehldau & Myers, 1993). Recent contri- butions in this area, such as the Shift-Add algorithm, rely on a pre-processing stage. Information about the current state of search is considered. An integer number represents a vector of states holding the results of comparisons between prefixes and the corresponding pattern. The search proceeds by reading symbols and updating the state vector. The Shift-Add algorithm was proven successful for identifying the TCC and D patterns in tRNAs. It can be implemented as part of the search for other complex RNA motifs, such as self-splicing introns. As detailed by Lisacek et al. (1994), the most conserved region surrounding and including the catalytic site of group I introns, defines two patterns, namely Ā(UGC)AN(AGC)GRC(UGC) and R(AGU)(C CNRNRC(UGG)C. Once similar patterns are identified, a succession of search procedures are run and the structure of a group I intron is gradually elucidated (results not shown). † Our training set is drawn from tRNA database file of Rel.38 of EMBL. Mitochondrial and chloroplastic tRNA genes do not share all the features common to tRNA sequences and are not included in the study. After eliminating redundant sequences, we obtain a training set containing 546 sequences. Signal search Conclusion Techniques for approximate pattern-matching with mistakes were used to optimise search procedures to identify RNA motifs, in particular tRNA sequences. A possible extension to this work would consist of implementing an algorithm accommodating insertions and deletions in the definition of patterns, as described by Wu & Manber (1992). Diverse possibilities have to be explored. Furthermore, as the definition of patterns and motifs varies depending on the type of RNA, the limitations of the method need to be assessed. Methods Basic assumptions FAStRNA depends on two essential characteristics of the primary and secondary structure of the tRNA gene as it is observed in the initial alignment of 546† sequences. (1) The presence of invariant (i.e. universal) and semi-invariant nucleotides located in two highly conserved regions forming respectively the TCC and D signals. (2) The cloverleaf structure consisting of four arms (paired bases) and thee loops (unpaired bases), one of which being of variable size, the D loop. The localisation of a signal is followed by an attempt to fold the stretch containing and spreading around this signal into a base-paired arm. Potential TCC and D arms are thus identified. Compatibility between TCC and D arms is established from the initial alignment of sequences. A maximum (or minimum) distance between two arms corresponds to the maximum (or minimum) number of nucleotides connecting successive strands involved in distinct arms in the alignment. The probabilistic approach Signals can be represented in different ways. In tRNAscan, a probabilistic representation was chosen by 52 Very Fast RNA Motif Search means of weight matrices. In the original alignment of sequences, the frequency of nucleotide occurrence between position 48 and position 62 is used to delimit the TCC signal. Then, the composition of any 15nucleotide segment can be compared with that of a TCC signal. By analogy, the D signal corresponds to unusual nucleotide frequencies between position 8 and position 15. For each compared segment, a similarity score, depending on the corresponding weight matrix is computed. Moreover, a fixed number of invariant nucleotides (frequencies closest to 1) is imposed a priori. For instance, the D signal contains three invariant bases: T in position 8, G in position 10 and A in position 14. It is minimally imposed for a potential D signal to contain at least two of these three invariant bases. Consequently, the identification of a signal relies on two necessary conditions: a segment is selected as a signal if it approximately matches the sequence of invariant nucleotides and if the similarity score is above a set threshold. A comparable similarity-based selection of signals is implemented in FAStRNA, with minor, changes. Invariant bases are not fixed a priori, but deduced from the weight matrix. Moreover, in the score calculation, the logarithm of frequency is considered to favour high to the detriment of low frequencies. When matches with invariant bases occur, a segment is selected upon the similarity score S, as follows: n s log fb,i S= i=1 n s log fmax,i i=1 where n is the length of the signal, fb,i is the frequency (in the weight matrix) of the nucleotide b found in position i in the segment, and fmax,i is the highest frequency in position i. As log 1 = 0, invariant nucleotides do not contribute to the score. Score equals 1 corresponds to a perfect match. The higher the score, the worse the match. The pattern-matching approach A signal can be represented as a class of patterns. Given the alphabet N = 4A,T,G,C5, each position in the segment corresponds to a subset of N. Searching for a signal is equivalent to a pattern-matching problem, with possible errors. The initial alignment is used to define the class of patterns characterising a signal. A nucleotide is assumed to occur at a given position if it is found at least 28 times in the alignment of 546 sequences (i.e. 5%). For instance, at position 48 only C and T satisfy this criterion. The TCC and D signals remain 15 and 8 nucleotides long, respectively. As an alternative to the similarity-based function described above, a pattern-matching approach can be used in FAStRNA. It identifies a TCC signal by matching 15-nucleotide segments to YT NNRGTTCRAC YCY (R and Y represents respectively purine and pyrimidine, and X represents the absence of symbol X) with up to k errors† and likewise a D signal by matching 8-nucleotide segments to TRGYNNAR. † The first five nucleotides do not seem to occur randomly but a more definite pattern emerges between G(6) and C(14). Table 3. Weights associated with base-pairing A G C T A G C −10 −10 −2 5 −10 −10 7 3 −2 7 −10 −10 T 5 3 −10 −2 When searching such a class of patterns with errors, the naive way is to check all symbols in each and every overlapping segment, resulting in a slow progression along the scanned sequence. Various algorithms have been developed for searching different sets of patterns, and especially patterns with don’t care symbols. However, most of them deal with exact string matching or are not adapted to a four-symbol alphabet. Consequently, to tackle the question of speed, a specific class of algorithms was considered (Baeza-Yates & Gonnet, 1992; Baeza-Yates & Perleberg, 1992; Wu & Manber, 1992). The main properties of these algorithms are: (1) Simplicity; they consist of a pattern preprocessing step and a search step. Each of them is simple and only few bitwise logical operations are used. (2) Speed; the time-delay to process one character is bounded by a constant depending only on the pattern length. Therefore, these algorithms are linear and remain efficient even if the size of the alphabet is small. (3) Flexibility; they are flexible enough to allow searching for a class of patterns, without damaging the programming practicality. They are all based on the same idea, which consists of finding, at a given position in the sequence, all approximate pattern prefixes ending at this position. Speed is increased by representing the state of the search as a bit number or an array, and by using the ability of programming languages to handle bit words. The Shift-Add algorithm (Baeza-Yates & Gonnet, 1992) is one of these algorithms concerning pattern matching with mismatches. It was used to search for TCC and D signals. The main idea of this algorithm is to represent the state of the search as a bit number (or a vector) and, at each step, perform a few simple arithmetic and logical operations. Let P = p1 · · · pm be a pattern of length m and s = s1 · · · sn be a sequence of length n over a finite alphabet N. The problem is to find in s all occurrences of P with at most k mismatches (0 E k E m). In other words, the distance between two patterns of the same length will be defined as the number of their mismatching characters (the Hamming distance). Given a current position j in the sequence, let Sj be the state vector at this position. Sj contains individual states of the search between each prefix of P and the corresponding substring of s. Namely, for 1 E i E m, Sj [i ] is the number of mismatches between p1 · · · pi and sj−i+1 · · · sj . P matches at j if and only if Sj [m] < k + 1. When sj+1 is read, the number of mismatches for each prefix of P needs to be completed. Values of Boolean expressions sj+1 = pi , for 1 E i E m, can be computed during a preprocessing step. For each character a in N, a vector Ta of size m is constructed such that: For i, 1 E i E m, Ta [i ]= 6 0 if a = pi 1 otherwise It is sufficient to construct the T arrays only for characters appearing in the pattern. Finally, Sj+1 [i ] = Sj [i − 1] + Tsj+1 [i ] 53 Very Fast RNA Motif Search Possible values of the vector state components are 1, · · · , m. Thus, to represent each component, b = Klog2 (m + 1)L bits are required. However, since we need only to compare the number of mismatches with k, it is enough to represent values from 1 to k. In this case, one more bit is needed for carrying over additions. The improved algorithm uses b = Klog2 (k + 1)L + 1 bits. At each position j in the sequence, the overflow bits are recorded in an overflow state Rj and the overflow bits of Sj are reset. The Shift-Add algorithm works in O(n) time, and, at each step, three basic operations are performed: one shift, one addition and one test to determine whether P matches at position j. In order to consider large patterns, one solution is to use more than one bit per number. Still, if the number of words per number is small, the Shift-Add algorithm is a good practical choice for searching with mismatches a class of patterns. For v = 32, only one word for the D signal and two words for the TCC signal are required. Other changes involving the definition of thresholds and scoring functions were made. In the probabilistic approach, matching a segment with invariant bases of the TCC or D signal is considered as a string-matching problem with don’t care symbols. Therefore, the Shift-Add algorithm is useful in this case. Figure 2. Flow chart of the algorithm. Structure prediction In order to obtain Sj+1 from Sj by simple arithmetic and logical operations, vectors are considered as numbers and represented in base 2b, where b is the bit number needed to represent each vector component. Thus: m m i=1 i=1 Sj = s Sj [i ]2(i−1)b and Ta = s Ta [i ]2(i−1)b If representations do not exceed the word size v, namely mb E v, the transition from Sj to Sj+1 is obtained by the assignment: Sj+1 = (Sj b) + Tsj+1 where denotes the bitwise shift-left operation. P matches at j if and only if Sj < (k + 1)2(m−1)b. In the tRNAscan, the definition of the tRNA secondary structure depends only on Watson-Crick and (G, T) pairs. To address the question of flexibility in defining arms, each of the possible ten pairs (regardless of the orientation) is given a weight as described by Lisacek et al. (1994). The values shown in Table 3 reflect the stability of a base-pair and its frequency in natural RNA helices. As a result, a selection threshold for a potential arm is not simply a minimal number of Watson-Crick or (G, T) pairs but more accurately a minimal weight of successive pairings. Practically, a score is computed by cumulating weights of successive pairings, and a potential arm is recognised if its score is above a fixed threshold. Weak base-pairings at both ends of the arm are not considered. Table 4. Parameter and threshold values used in FAStRNA Regiona Perfect matchb TCC signal TCC arm At most 1 mismatch base-pairing score >26 D signal Aminoacyl arm No mismatch Base-pairing score >36 D arm Score of the 4 base-pairs >16 Base 18, 19 and 21 Base 18 = G, base 19 = G and base 21 = A T Base 33 Anticodon arm Without intron With intron a Base-pairing score >19 Base-pairing score >26 Threshold matchc Three mismatches Base-pairing score >10 and at least 3 base-pairing Two mismatches base-pairing score >18 and at least 4 base-pairing Score of the 3 first base-pairs >11 (MAX) or score of the 3 first base-pairs >7 (MIN) and 4th base-pairing >0 Other bases Other base Base-pairing score >11 Base-pairing score >17 Regions of the tRNA-sequence chronologically analysed by the algorithm. Each time a condition is verified, the general score sg is incremented. c Minimal conditions for accepting a region. b 54 Very Fast RNA Motif Search Let S be the segment TTTGTCCGA and S' be the potential complementary segment CAGAACGAT: S: TTTGTCCGA = = = =o = TAGCAAGAC :S' According to Table 1 the corresponding cumulative sum is: 0 5 8 15 20 18 25, such that an arm is defined as S(2)–S(7) paired with S'(3)–S'(8) with a score of 25. Another small change was introduced in assessing the quality of base-pairing in the D arm. Minimally, the D arm is defined as the coupling of positions 10 and 25, 11 and 24, and 12 and 23. If the cumulative sum of such pairings is below a MIN value, the arm cannot be formed and if it is above a MAX value, the arm is formed. When it is between MIN and MAX and the weight of the pair defined by position 13 and 22 is positive, then the arm is formed. Improved algorithm and whole-sequence scanning The improved algorithm is completed in eight steps (Figure 2). The first two are dedicated to identifying a potential TCC region. It is assumed that seven nucleotides form a TCC loop and five base-pairs form a TCC arm. The following three steps are focused on identifying potential D regions. D regions are searched in areas where distance requirements with the previously determined TCC are satisfied. The length of connecting sequences between potential TCC and D is constrained by a possible intronic insertion, occurring between positions 37 and 38 (intron length varies between eight and 60 nucleotides) as well as the presence of the variable loop (minimally comprising three nucleotides). Moreover, it must accommodate the formation of an aminoacyl arm with at most seven base-pairs. It is assumed that 6† to 11 nucleotides form a D loop and at most four base-pairs form a D arm. A global similarity score, sg, is computed at each step on the basis of compliance to a minimal set of characteristics common to most tRNAs. The sixth step consists of discarding all partial solutions with a low sg score, that is, lacking too many characteristics common to most tRNAs. The seventh step corresponds to looking for the anticodon arm, taking into account the potential presence of an intron, as mentioned earlier. It is assumed that in an intronless sequence, seven nucleotides form an anticodon loop and five base-pairs form a anticodon arm. Finally, all solutions with a newly incremented but low sg score are discarded in the eighth step. All parameter and threshold values are given in Table 4. All throughout such a stepwise search, an unsuccessful outcome results in going back to the last loop preceding this step and moving forward to the next window to identify a new potential region. When the direct sequence strand has been scanned completely, the algorithm is applied to the complementary strand. † Five tRNA genes were missed by tRNAscan as this parameter was set to 7 by Fichant & Burks (1991). Acknowledgements We are grateful to M. Crochemore and A. Hénaut for helpful comments. This program is available on the following URL:http: //www-igm.univ-m/v.fr/nmabrouk/. References Altschul, S. F., Gish, W., Miller, W., Myers, E. W. & Lipman, D. J. (1990). A basic local alignment search tool. J. Mol. Biol. 215, 403–410. Baeza-Yates, R. & Gonnet, G. H. (1992). A new approach to text searching. Commun. ACM, 35(10), 74–82. Baeza-Yates, R. & Perleberg, C. H. (1992). Fast and practical approximate string matching. In Lecture Notes in Computer Science, vol. 644, pp. 185–191, Springer-Verlag. Chiu, D. K. Y. & Kolodziejczak, T. (1991). Inferring consensus structure from nucleic acid sequences. Comput. Appl. Biosci. 7(3), 347–352. Dandekar, T. & Hentze, M. W. (1995). Finding the hairpin in the haystack: searching for RNA motifs. Trends genet. 11(2), 45–50. d’Aubenton Carafa, Y., Brody, E. & Thermes, C. (1990). Prediction of rho-independent Escherichia coli transcription terminators. J. Mol. Biol. 216, 835–858. Eddy, S. R. & Durbin, R. (1994). RNA sequence analysis using covariance models. Nucl. Acids Res. 22(11), 2079–2088. Fichant, G. A. & Burks, C. (1991). Identifying potential tRNA genes in genomic DNA sequences. J. Mol. Biol. 220, 659–671. Gautheret, D., Major, F. & Cedergren, R. (1990). Pattern searching/alignment with RNA primary and secondary structures: an effective descriptor for tRNA. Comput. Appl. Biosci. 6(4), 325–331. Haselman, T., Chappelear, J. E. & Fox, G. E. (1988). Fidelity of secondary and tertiary interactions in tRNA. Nucl. Acids Res. 16, 5673–5684. Lefèvre, C. & Ikeda, J. E. (1984). A fast word search algorithm for the representation of sequence similarity in genomic DNA. Nucl. Acids Res. 22(3), 404–411. Lisacek, F., Diaz, Y. & Michel, F. (1994). Automatic identification of group I intron cores in genomic DNA sequences. J. Mol. Biol. 235, 1206–1217. Mehldau, G. & Myers, G. (1993). A system for pattern matching applications on biosequences. Comput. Appl. Biosci. 9(3), 299–314. Pavesi, A., Conterio, F., Bolchi, A., Dieci, G. & Ottonello, S. (1994). Identification of new eukaryotic tRNA genes in genomic DNA databases by a multistep weight matrix analysis of transcriptional control regions. Nucl. Acids Res. 22(7), 1247–1256. Pearson, W. R. & Lipman, D. J. (1988). Improved tools for biological sequence comparison. Proc. Natl Acad. Sci. USA, 85, 2444–2448. Prunella, N., Liuni, S., Attimonelli, M. & Pesole, G. (1993). FASTPAT: a fast and efficient algorithm for string searching in DNA sequences. Comput. Appl. Biosci. 9(5), 541–545. Searls, D. (1993). The computational linguistics of biological sequences. In Artificial Intelligence and Molecular Biology (Hunter, L., ed.), pp. 47–120, AAAI Press. Searls, D. & Dong, S. (1992). A syntactic pattern recognition system for DNA sequences. In The Second International Conference on Bioinformatics, Supercom- Very Fast RNA Motif Search puting and Complex Genome Analysis (Lim, H. A., Fickett, J. W., Cantor, C. R. & Robbins, R. J., eds), pp. 89–101. World Scientific, Teaneck, NJ. Smith, T. F. & Waterman, M. S. (1981). Identification of common molecular subsequences. J. Mol. Biol. 147, 195–197. Sprinzl, M., Steegborn, C., Huebel, F. & Steinberg, S. (1992). Compilation of tRNA sequences and sequences of tRNA genes. Nucl. Acids Res. 24, 68–72. 55 Staden, R. (1988). Methods to define and locate patterns of motifs in sequences. Comput. Appl. Biosci. 4, 53–60. Stephen, G. A. (1994). String Search Algorithms. Lecture Notes Series on Computing, vol. 3, World Scientific, Singapore. Wozniak, P. & Mokalowski, W. (1990). Searching for tRNA genes in DNA sequences—an IBM microcomputer program. Comput. Appl. Biosci. 6, 49–50. Wu, S. & Manber, U. (1992). Fast text searching allowing errors. Commun. ACM, 35(10), 83–91. Edited by J. Karn (Received 30 July 1996; accepted 4 September 1996)
© Copyright 2026 Paperzz