Very Fast Identification of RNA Motifs in Genomic DNA. Application

J. Mol. Biol. (1996) 264, 46–55
Very Fast Identification of RNA Motifs in
Genomic DNA. Application to tRNA Search
in the Yeast Genome
Nadia El-Mabrouk1 and Frédérique Lisacek2*
1
Institut Gaspard Monge
Université de
Marne-la-Vallée, 2 rue de la
Butte Verte, 93166
Noisy-le-Grand Cedex, France
2
Laboratoire Informatique et
Génome, Université de
Versailles-Saint-Quentin
45 avenue des Etats-Unis
78035 Versailles Cedex
France
*Corresponding author
A common strategy characterises the various methods independently
defined to identify almost unambiguously different types of RNA
molecules in DNA fragments. So far, the good quality of detection of RNA
motif has been the prior motivation and effectively delayed the
optimisation of programs. As an illustration of possible improvements, a
modified version of tRNAscan is described. The previous algorithm was
altered to run 500 times faster and to lower both rates of false positives
and false negatives. The newly sequenced genome of Saccharomyces
cerevisiae is scanned both ways in less than three minutes and results match
annotations found in databanks with three exceptions, two of which being
arguably not real tRNAs.
7 1996 Academic Press Limited
Keywords: pattern matching; optimised search; RNA motif; tRNA; yeast
genome
Introduction
The identification of functional regions in
genomic DNA increasingly relies on the coupling
of experimental work to computer processing
of sequences. Newly sequenced fragments of a
genome may be analysed with computer programs
and the result of an automated search may guide
a new set of experiments. Among others, the
identification of more or less complex RNA
‘‘motifs’’ is achieved by a variety of programs.
There are basically two approaches to the
identification of RNA motifs. It is either part of a
general purpose method designed for searching
and/or folding RNA sequences (Gautheret et al.,
1990; Chiu & Kolodziejczak, 1991; Eddy & Durbin,
1994; Searls & Dong, 1992; Searls, 1993) or a
self-contained method tailor-made for searching
tRNA genes (Fichant & Burks, 1991; Pavesi et al.,
1994), Escherichia coli transcription terminators
(d’Aubenton Carafa et al., 1990), self-splicing
introns (Lisacek et al., 1994), etc. By construction,
reported results are usually more accurate in the
latter case and only these will be considered here.
The outline of a common strategy for searching
these motifs has been given by Dandekar & Hentze
(1995). A similar approach characterises the various
methods independently defined to identify almost
unambiguously different types of RNA molecules
in DNA fragments. Such dedicated searches are
based on the principle that the more conserved a
0022–2836/96/460046–10 $25.00/0
region, the more easily recognisable. They depend
on the use of ‘‘weight’’ or ‘‘consensus’’ matrices
(Staden, 1988). Weight matrices make the definition
of the RNA motifs more flexible than consensus
sequences. Primary and secondary structure features are then used to gradually refine the
identification process.
Most algorithms following the above strategy
were not optimised. But, the forthcoming availability of complete genomes emphasises the need
for efficient and fast computer search tools.
Efficiency depends on how sensitive is the search
and speed on how optimised is the algorithm. So
far, the good quality of detection of RNA motifs has
been the prior motivation and effectively postponed the optimisation of programs.
The speed of search algorithms first became a
concern in the late 80s with the fast-growing
sequence databases and databanks. The exponential increase of operations involved in sequence
comparison and the absence of obvious means of
speeding up rigorous methods relying on dynamic
programming (e.g. Smith & Waterman, 1981)
prompted the appearance of heuristic methods
(Pearson & Lipman, 1988; Altschul et al., 1990). In
contrast, RNA motif searching has been based on
heuristic methods, circumventing combinatorial
difficulties with a loose theoretical background.
Fast and efficient techniques can be imported
from formally solved string-matching problems.
Searching for invariant or semi-invariant
7 1996 Academic Press Limited
47
Very Fast RNA Motif Search
nucleotides can be considered as searching for
patterns with don’t care symbols. A number of
algorithms have been defined and improved
whether string matching is approximate or exact
(Stephen, 1994). The optimisation of RNA motif
search introduced here, relies on the Shift-Add
algorithm defined in the case of search with
mismatches (Baeza-Yates & Gonnet, 1992). Testing
and using this algorithm for the identification of
tRNA genes led to considerably reduce computer
time.
Over the last 20 years, the tRNA molecule
has been extensively studied. The corresponding
gene is a short sequence that folds in the form
of a cloverleaf. Tertiary interactions are known
(Haselman et al., 1988). Many sequences are
available and aligned (Sprinzl et al., 1992). Conserved regions appear in the alignment, which
consists of only 76 positions. The first reliable
algorithm, tRNAscan, was written by Fichant &
Burks (1991). More recently, another sensitive
method based on the use of weight matrices was
defined (Pavesi et al., 1994).
The modified version (FAStRNA) of tRNAscan
yields good results. Scanning is performed in a few
seconds for small genomes and in less than three
minutes for both strands of the newly sequenced
Saccharomyces cerevisiae genome. Setting an appropriate hierarchy of searching operations also
contributes to minimising computing time, as
tested by Wozniak & Makalowski (1990). The
hierarchy introduced in tRNAscan is slightly
modified. Other parameters and definitions are
made more flexible, helping to reduce both rates of
false positive (a selected sequence that does not
correspond to a known tRNA) and false negative
(a non-selected sequence corresponding to a known
tRNA).
Throughout the text, reference to positions in the
tRNA sequence alignment are made as given by
Sprinzl et al. (1992).
Results
Pattern matching versus consensus matrices
From a computing point of view, there are
several advantages to the pattern-matching approach over the probabilistic approach: (1) the
signal search is uninterrupted by score calculations;
(2) score calculations are not necessary and score
thresholds need not be specified; (3) a signal is
† Potential tRNA found in unannotated sequences
are not considered.
‡ Truncated, partial, putative, repeated gene or
pseudogene sequences are not considered, nor are
sequences containing too many
unidentified/ambiguous nucleotides.
§ Mitochondrial and chloroplastic sequences are not
included in the study.
considered as a whole as against a collection of
invariant, semi-invariant and variable nucleotides;
segments are selected upon an overall minimal
number of errors. As far as the quality of results
is concerned, both approaches yield accurate
results.
Speed
Both tRNAscan and FAStRNA version were
written in C and run on an HP9000 computer.
Whereas tRNAscan based on a naive search method
scans a relatively small genome (ca 2 × 200,000
nucleotides) in 16 minutes, the modified version
runs in two seconds.
Both strands of the 16 chromosomes of the
recently completed genome of Saccharomyces cerevisiae are scanned in 2.5 minutes. For instance,
the longest chromosome (IV) made of 1,522,191
nucleotides, is exhaustively analysed in 19.5
seconds, and the shortest (VI) made of 270,148
nucleotides in 3.5 seconds.
Both strands of chromosome III (2 × 315,000
nucleotides) are scanned in four seconds, as against
35 minutes with tRNAscan and 11 minutes with
Genlang (Searls & Dong, 1992). Pavesi et al. (1994)
reported that both strands of a genomic fragment
(2 × 318,444 nucleotides) were scanned in two
hours with a GWBASIC program implemented on
a PC 486DX/33 computer.
Sensitivity of the search
Databank searching
The modified version of the tRNAscan algorithm
yields good results. Both rates of false positives†
and false negatives‡ are reduced. Moreover, some
predicted sequences, not annotated, are over 70%
similar to known tRNA and are likely to
correspond to tRNA genes.
Both tRNAscan and FAStRNA were tested on a
set of 95,143 sequences of EMBL (Rel.38)§ FAStRNA
is 97.9% accurate as against 96% for tRNAscan. In
both cases, the rate of false positive is of the order
of 10−4, though slightly lower for FAStRNA.
Corresponding sequences overlap but may also
differ. Results are shown in Table 1.
Detailed analysis of results have been provided
by Pavesi et al. (1994). The overall number of false
negatives is estimated to be three times less than
that given by Fichant & Burks (1991). Conversely,
more false positives (0.0137% as against 0.0033%)
were found and two sequences out of a set of
30 potential genes were successfully tested for
transcriptional activity in vitro. Most of these
potential genes are predicted by FAStRNA.
FAStRNA was tested with available fragments of
the Escherichia coli genome and Bacillus subtilis large
genome fragments, with 100% success in both
cases.
48
Very Fast RNA Motif Search
Table 1. Results of database searching
A. Global results. Comparison between tRNAscan and FAStRNA
Numberb
Taxonomic
Number ofa
category
nucleotides tested
of tRNA genes
Primates
Rodents
Invertebrates
Plants
Prokaryotes
Fungi
Other
mammals
Other
vertebrates
Total
% False negativesc
tRNAscan
FAStRNA
% False positivesd
tRNAscan
FAStRNA
25,454,414
30,313 seq
19,490,297
19,584 seq
14,048,275
10,818 seq
9,017,972
7831 seq
20,680,405
13,069 seq
3,918,817
2905 seq
5,133,144
5088 seq
5,796,710
5535 seq
10,060
77
6532
90
28,708
374
9001
122
45,998
618
7484
92
2836
36
4956
61
2.60
2 genes
3.33
3 genes
4.28
16 genes
6.56
8 genes
3.88
24 genes
2.17
2 genes
8.33
3 genes
0.00
0 genes
1.30
1 genes
2.22
2 genes
2.14
8 genes
1.64
2 genes
2.59
16 genes
1.08
1 genes
2.77
1 genes
0.0
0 genes
0.0005
4 genes
0.0005
3 genes
0.0005
2 genes
0.0
0 genes
0.0009
5 genes
0.0019
2 genes
0.0007
1 gene
0.0
0 genes
0.0
0 genes
0.0
0 genes
0.0005
2 genes
0.0
0 gene
0.0005
3 genes
0.0009
1 genes
0.0014
2 genes
0.0006
1 genes
103,540,034
95,143 seq
115,575
1470
3.94
58 genes
2.11
31 genes
0.0006
17 genes
0.0003
9 genes
a
The number of nucleotides corresponds to a single DNA strand. The total number of processed nucleotides is twice this number.
Number of nucleotides corresponding to tRNA genes and number of tRNA genes in the test set.
c
The percentage of false negatives is obtained by dividing the number of tRNA genes that are not predicted by the total number
of known tRNA genes.
d
The percentage of false positives is calculated according to Fichant & Burks (1991): the value in column 3 is subtracted from that
in column 2. The obtained value is divided by average tRNA length and multiplied by 2 to obtain the number of tRNA-sized objects
in the test set. Finally the number of false positives is divided by this obtained value.
b
B. False negatives obtained by FAStRNA
Taxonomica
category
Entry namea
tRNA typeb
Primates
tRNASer
Rodents
Invertebrates
Plants
Prokaryotes
Fungi
Other mammals
a
b
HSTGSS
SeCys
MMSEC
tRNA
MMTGCI
ACTRANRNA
DMTGSCA
CCTRNAA
CETGSCA
tRNASeCys
tRNALeu
tRNASeCys
DMTGYC
NC02677
PM12SRRNA
tRNATyr
tRNAAla
tRNALys
TBTRQVK
ATPATY3
PUDNAK
AATGL
ALRRNA
ALRRNA
ALTRNA11
ALTRNAGLY
ALTRNAIA
ASPTRNLEU
CJTRNLAG
CJTRNLAG
EHRTRNA
HVTGW
MG02104
MPTGV
PLTRNA
PVSELC
SCTRPSER
SPCEN3XB
BTSERSEC
tRNAVal
tRNASer
tRNAGly
tRNALeu
tRNALys
tRNAVal
tRNAAla
tRNAGly
tRNAIle
tRNALeu
tRNAAla
tRNALeu
tRNAAla
tRNATrp
tRNASeCys
tRNALeu
tRNALeu
tRNASeCys
tRNASer
tRNAAsp
tRNASer
Reason why FAStRNA does not
predict the gene
Too many mismatches in the D signal and too low score for the
aminoacyl arm
Too many mismatches in the D signal and too low score for the
aminoacyl arm
Too low sg
Too low sg
Too many mismatches in the D signal and too low sg
Too low score for the aminoacyl arm
Too many mismatches in the D signal, too low score for the
aminoacyl arm
Too large intron (114 nucleotides)
Too low sg
Deletion between the D arm and the aminoacyl arm and too many
mismatches in the TCC signal
Too low score for the aminoacyl arm
Too low sg
Deletion in the TCC loop and insertion in the TCC arm
Too large intron (292 nucleotides)
Too low score for the aminoacyl arm
Insertion in the TCC loop
Deletion in the TCC loop
Deletion between the D arm and the aminoacyl arm
Deletion in the TCC loop
Too low sg
Too many mismatches in the TCC signal . . .
Too many mismatches in the TCC signal . . .
Deletion in the TCC loop
Too large intron (106 nucleotides)
Too many mismatches in the TCC signal . . .
Deletion in the TCC loop
Too low score for the anticodon arm
Too many mismatches in the D signal
Insertion in the D loop
Deletion in the TCC loop
Too many mismatches in the D signal and too low score for the
aminoacyl arm
Location of the non-predicted gene in EMBL (release 38).
Family of the non-predicted tRNA.
continued overleaf
49
Very Fast RNA Motif Search
Table 1. Continued
C. Potential tRNA genes predicted by FAStRNA
Entry namea
Taxonomica category
Invertebrates
Plants
Prokaryotes
Fungi
CEF42H10
CEGPA1 (p)
CEUNC22 (p)
EHRPTARQ (p)
ATCPYLP (p)
ATENGE (p)
DCGSTA
BCPSAM2AT
HHRNAPOP
HIINT
HMRGHM
MCRRNA5
MLB577COS
PAKPILIN
PPTGRH
PPTGRH
SASAM2B
NCGLA1 (p)
PPURNA1 (p)
SPCEN114
SPCEN163
Positionsa
tRNA typeb
533-608
2929-3012
1474-1545
278-362
234-326
4380-4462
4096-4021
17-93
1243-1168
200-125
187-113
225-308
7676-7752
972-1048
403-480
489-564
25-101
3470-3541
343-272
1774-1860
2624-2547
Lys
Comparedc sequence
tRNA (CUU)
tRNALeu(CAG)
tRNAPro(CGG)
tRNAThr(UGU)
tRNATyr(GUA)
tRNALeu(UAA)
tRNASer(GAA)
tRNAPro(GGG)
tRNAAsp(GUC)
tRNALys(UUU)
tRNAGly(GCC)
tRNALeu(UAG)
tRNAPro(CGG)
tRNAThr(CGU)
tRNAPro(UGG)
tRNAHis(GUG)
tRNAPro(CGG)
tRNAArg(ACG)
tRNAMet(CAU)
tRNATyr(GUA)
tRNAAla(AGC)
DK7560
DMTGL*
CETGPX*
TATRTY1*
DY6740
SCLEURNA*
RF7020
DP1560
RD0500
DK2000
RG0380
DL1141
DP1400
DT1820
DP1740
DH1740
DP1360
DMRP*
MCTRFM*
RY6320
DA6320
Similarityd (%)
100
77
93
93
90
74
87
100
93
99
99
100
100
100
100
100
100
82
70
99
99
a
Location of the predicted gene in EMBL (Rel.38). (p) indicates potential tRNA genes also recognized by Pavesi et al. (1994).
Family of the potential tRNA.
c
tRNA gene present in EMBL used for the computation of the percentage of similarity with the predicted tRNA gene. References
with * are entry names in EMBL. Others are entry names in the tRNA database file of EMBL (Rel.38).
d
Percentage of similarity between predicted and compared sequence.
b
Searching the yeast genome
The 16 chromosomes are annotated in the yeast
database found at the Martinsried Institute for
Protein Sequence: 84 tRNA genes had been
identified in early sequenced chromosomes I, II, III,
V, VIII, IX and XI. Another 191 are annotated in
chromosomes IV, VI, VII, X, XII, XIII, XIV, XV and
XVI, all of them but three and no other were found
with FAStRNA. Results corresponding to all
chromosomes are summed in Table 2.
No false positive was found by FAStRNA,
although there is one missing sequence in
annotations listed in the results. A copy of a
tRNASer(GCT) found in chromosome XII at
position 784,352 is repeated in chromosome IV at
position 1,141,093 and not reported as such in the
database.
Sequences of false negatives were aligned with
other detected sequences corresponding to the
same anticodon, when available. The tRNAAsp
annotated in chromosome XIV at position 519,096
is found to match exactly other tRNAAsp genes
identified on different chromosomes, except for the
first eight nucleotides. FAStRNA precisely rejects
this sequence because of its inability to form an
aminoacyl arm. Possible deletions or sequencing
errors can explain the mismatches.
Table 2. Results of yeast genome scanning
Chromosome
I
II
III
IV
V
VI
VII
VIII
IX
X
XI
XII
XIII
XIV
XV
XVI
Number of
annotated genes
Number of matches
identified by
FAStRNA
Number of
false positives
identified by
FAStRNA
4
13
10
26
20
9
36
11
10
24
16
23
21
15
20
17
4
13
10
26
20
9
36
11
10
24
16
22
21
13
20
17
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
50
Very Fast RNA Motif Search
Figure 1. False negative found in the yeast genome.
The tRNALeu annotated in chromosome XII at
position 725,746 is supposed to correspond to an
CAG anticodon that is found nowhere else in the
genome. As a result, the interpretation of the
alignment is not so straightforward. The sequence
corresponding to the GAG anticodon is over 75%
homologous to the TAG anticodon sequence, as
shown in Figure 1. Such a similarity drops to 20%
for the CAG anticodon sequence.
The tRNAGly annotated in chromosome XIV
at position 71,772 is cited as a potential tRNA
detected by tRNAscan. To begin with, the alignment corresponding to GCC anticodon sequences
(there are 16 exact copies of this particular
gene in the whole genome) shows a lack of
similarity. Moreover, none of the 21 tRNAGly
sequences annotated in the genome contains an
intron. The sequence is not selected by FAStRNA
because of the poor quality of the D and TCC
regions.
Discussion
Various approaches for RNA structure prediction
have been defined and trained on the structure of
the tRNA molecule (Gautheret et al., 1990; Chiu &
Kolodziejczak, 1991; Eddy & Durbin, 1994; Searls &
Dong, 1992; Searls, 1993). Progressively, the general
framework of RNA prediction was dropped in
favour of sensitive and specified methods as seen in
the latest attempts (Fichant & Burks, 1991; Pavesi
et al., 1994). Such an approach is characterised by
the low rate of generated false positives and false
negatives. The latter is higher in the method of
Fichant & Burks (1991) and conversely, the former
is higher in that of Pavesi et al. (1994). But these
methods are, in fact, instances of a general strategy
of exhaustive search for specific RNA motifs
(Dandekar & Hentze, 1995). Nucleotide sequence
filters are organised in a sequential algorithm and
information is gradually refined. The general
51
Very Fast RNA Motif Search
definition of a tRNA consensus sequence matches
a set of necessary and sufficient conditions relying
on primary and/or secondary structure features.
Though the generality of the target of search is lost,
the method appears to be transposable from one
type of RNA to another. The notion of generalisation is shifted from the object to the method, which
justifies seeking an optimal definition of the
method. Results presented above account for the
possibility of optimising RNA motif search with
respect to exhaustiveness and speed. Exhaustiveness was partly achieved by modifying a few
definitions, in particular, scoring functions for
base-pairing. It was enhanced as well by modifying
the algorithm, since some arbitrary thresholds and
scores need not be kept in the optimised version.
In the methods of Fichant & Burks (1991) and
Pavesi et al. (1994), priority was not given to how
quickly searching is performed. Nevertheless,
within years, complete genomes of various organisms will be available and fast sequence scanning
has already become a concern. The issue of
sequence comparison was addressed with the
definition of methods such as FASTA (Pearson &
Lipman, 1988) and BLAST (Altschul et al., 1990).
They were designed to reduce both the time and
space required for assessing similarities. Other
attempts, especially relying on advances in the field
of string matching, are aimed at minimising the
number of operations for comparing two sequences. For instance, this number can be reduced
by considering sequence repeats only once through
the use of position trees (Lefèvre & Ikeda, 1994).
Exact string-matching problems gave rise to fast
and efficient techniques, such as the Boyer-Moore
algorithm. An example of implementation is given
by Prunella et al. (1993). The efficiency of this
algorithm is obtained by overlooking stretches of
sequence where no match can possibly be found.
Since sequence shifts and character skip are
involved, the size of the alphabet of symbols and
that of the pattern are determinant factors. The
larger the alphabet and the longer the pattern, the
faster the method. The use of this algorithm is
made difficult by the approximate or ambiguous
definition of patterns in nucleotide sequences. In
tRNAs, the TCC and D patterns are defined on a
small alphabet and they may contain errors.
Another alternative had to be sought for searching
tRNA sequences.
There are a number of algorithms for approximate string-matching (Stephen, 1994), in particular
those designed to handle don’t care symbols. Some
were implemented specifically for searching biological patterns such as the basis of the language
ANREP (Mehldau & Myers, 1993). Recent contri-
butions in this area, such as the Shift-Add
algorithm, rely on a pre-processing stage. Information about the current state of search is
considered. An integer number represents a vector
of states holding the results of comparisons
between prefixes and the corresponding pattern.
The search proceeds by reading symbols and
updating the state vector.
The Shift-Add algorithm was proven successful
for identifying the TCC and D patterns in tRNAs.
It can be implemented as part of the search for other
complex RNA motifs, such as self-splicing introns.
As detailed by Lisacek et al. (1994), the most
conserved region surrounding and including the
catalytic site of group I introns, defines two
patterns, namely Ā(UGC)AN(AGC)GRC(UGC)
and R(AGU)(C CNRNRC(UGG)C. Once similar
patterns are identified, a succession of search
procedures are run and the structure of a group I
intron is gradually elucidated (results not shown).
† Our training set is drawn from tRNA database file
of Rel.38 of EMBL. Mitochondrial and chloroplastic
tRNA genes do not share all the features common to
tRNA sequences and are not included in the study.
After eliminating redundant sequences, we obtain a
training set containing 546 sequences.
Signal search
Conclusion
Techniques for approximate pattern-matching
with mistakes were used to optimise search
procedures to identify RNA motifs, in particular
tRNA sequences. A possible extension to this work
would consist of implementing an algorithm
accommodating insertions and deletions in the
definition of patterns, as described by Wu &
Manber (1992). Diverse possibilities have to be
explored. Furthermore, as the definition of patterns
and motifs varies depending on the type of RNA,
the limitations of the method need to be assessed.
Methods
Basic assumptions
FAStRNA depends on two essential characteristics of
the primary and secondary structure of the tRNA gene as
it is observed in the initial alignment of 546† sequences.
(1) The presence of invariant (i.e. universal) and
semi-invariant nucleotides located in two highly conserved regions forming respectively the TCC and D
signals. (2) The cloverleaf structure consisting of four
arms (paired bases) and thee loops (unpaired bases), one
of which being of variable size, the D loop.
The localisation of a signal is followed by an attempt
to fold the stretch containing and spreading around this
signal into a base-paired arm. Potential TCC and D arms
are thus identified.
Compatibility between TCC and D arms is established
from the initial alignment of sequences. A maximum (or
minimum) distance between two arms corresponds to the
maximum (or minimum) number of nucleotides connecting successive strands involved in distinct arms in the
alignment.
The probabilistic approach
Signals can be represented in different ways. In
tRNAscan, a probabilistic representation was chosen by
52
Very Fast RNA Motif Search
means of weight matrices. In the original alignment of
sequences, the frequency of nucleotide occurrence
between position 48 and position 62 is used to delimit
the TCC signal. Then, the composition of any 15nucleotide segment can be compared with that of a
TCC signal. By analogy, the D signal corresponds to
unusual nucleotide frequencies between position 8 and
position 15.
For each compared segment, a similarity score,
depending on the corresponding weight matrix is
computed. Moreover, a fixed number of invariant
nucleotides (frequencies closest to 1) is imposed a priori.
For instance, the D signal contains three invariant bases:
T in position 8, G in position 10 and A in position 14. It
is minimally imposed for a potential D signal to contain
at least two of these three invariant bases.
Consequently, the identification of a signal relies on
two necessary conditions: a segment is selected as a
signal if it approximately matches the sequence of
invariant nucleotides and if the similarity score is above
a set threshold.
A comparable similarity-based selection of signals is
implemented in FAStRNA, with minor, changes. Invariant bases are not fixed a priori, but deduced from the
weight matrix. Moreover, in the score calculation,
the logarithm of frequency is considered to favour high
to the detriment of low frequencies. When matches with
invariant bases occur, a segment is selected upon the
similarity score S, as follows:
n
s log fb,i
S=
i=1
n
s log fmax,i
i=1
where n is the length of the signal, fb,i is the frequency
(in the weight matrix) of the nucleotide b found in
position i in the segment, and fmax,i is the highest
frequency in position i. As log 1 = 0, invariant nucleotides
do not contribute to the score. Score equals 1 corresponds
to a perfect match. The higher the score, the worse the
match.
The pattern-matching approach
A signal can be represented as a class of patterns.
Given the alphabet N = 4A,T,G,C5, each position in the
segment corresponds to a subset of N. Searching for a
signal is equivalent to a pattern-matching problem, with
possible errors. The initial alignment is used to define the
class of patterns characterising a signal. A nucleotide is
assumed to occur at a given position if it is found at least
28 times in the alignment of 546 sequences (i.e. 5%). For
instance, at position 48 only C and T satisfy this criterion.
The TCC and D signals remain 15 and 8 nucleotides long,
respectively.
As an alternative to the similarity-based function
described above, a pattern-matching approach can be
used in FAStRNA. It identifies a TCC signal by matching
15-nucleotide segments to YT
NNRGTTCRAC
YCY (R and
Y represents respectively purine and pyrimidine, and X
represents the absence of symbol X) with up to k errors†
and likewise a D signal by matching 8-nucleotide
segments to TRGYNNAR.
† The first five nucleotides do not seem to occur
randomly but a more definite pattern emerges
between G(6) and C(14).
Table 3. Weights associated with base-pairing
A
G
C
T
A
G
C
−10
−10
−2
5
−10
−10
7
3
−2
7
−10
−10
T
5
3
−10
−2
When searching such a class of patterns with errors, the
naive way is to check all symbols in each and every
overlapping segment, resulting in a slow progression
along the scanned sequence. Various algorithms have
been developed for searching different sets of patterns,
and especially patterns with don’t care symbols. However,
most of them deal with exact string matching or are not
adapted to a four-symbol alphabet.
Consequently, to tackle the question of speed, a specific
class of algorithms was considered (Baeza-Yates &
Gonnet, 1992; Baeza-Yates & Perleberg, 1992; Wu &
Manber, 1992). The main properties of these algorithms
are: (1) Simplicity; they consist of a pattern preprocessing
step and a search step. Each of them is simple and
only few bitwise logical operations are used. (2) Speed;
the time-delay to process one character is bounded by
a constant depending only on the pattern length.
Therefore, these algorithms are linear and remain
efficient even if the size of the alphabet is small. (3)
Flexibility; they are flexible enough to allow searching for
a class of patterns, without damaging the programming
practicality.
They are all based on the same idea, which consists of
finding, at a given position in the sequence, all
approximate pattern prefixes ending at this position.
Speed is increased by representing the state of the search
as a bit number or an array, and by using the ability of
programming languages to handle bit words.
The Shift-Add algorithm (Baeza-Yates & Gonnet, 1992)
is one of these algorithms concerning pattern matching
with mismatches. It was used to search for TCC and D
signals. The main idea of this algorithm is to represent the
state of the search as a bit number (or a vector) and, at
each step, perform a few simple arithmetic and logical
operations.
Let P = p1 · · · pm be a pattern of length m and s = s1 · · · sn
be a sequence of length n over a finite alphabet N. The
problem is to find in s all occurrences of P with at most
k mismatches (0 E k E m). In other words, the distance
between two patterns of the same length will be defined
as the number of their mismatching characters (the
Hamming distance).
Given a current position j in the sequence, let Sj be the
state vector at this position. Sj contains individual states
of the search between each prefix of P and the
corresponding substring of s. Namely, for 1 E i E m,
Sj [i ] is the number of mismatches between p1 · · · pi and
sj−i+1 · · · sj . P matches at j if and only if Sj [m] < k + 1.
When sj+1 is read, the number of mismatches for each
prefix of P needs to be completed. Values of Boolean
expressions sj+1 = pi , for 1 E i E m, can be computed
during a preprocessing step. For each character a in N, a
vector Ta of size m is constructed such that:
For i, 1 E i E m,
Ta [i ]=
6
0 if a = pi
1 otherwise
It is sufficient to construct the T arrays only for characters
appearing in the pattern.
Finally, Sj+1 [i ] = Sj [i − 1] + Tsj+1 [i ]
53
Very Fast RNA Motif Search
Possible values of the vector state components are
1, · · · , m. Thus, to represent each component,
b = Klog2 (m + 1)L bits are required. However, since we
need only to compare the number of mismatches with k,
it is enough to represent values from 1 to k. In this case,
one more bit is needed for carrying over additions. The
improved algorithm uses b = Klog2 (k + 1)L + 1 bits. At
each position j in the sequence, the overflow bits are
recorded in an overflow state Rj and the overflow bits of
Sj are reset.
The Shift-Add algorithm works in O(n) time, and, at
each step, three basic operations are performed: one shift,
one addition and one test to determine whether P
matches at position j.
In order to consider large patterns, one solution is to
use more than one bit per number. Still, if the number of
words per number is small, the Shift-Add algorithm is a
good practical choice for searching with mismatches a
class of patterns.
For v = 32, only one word for the D signal and two
words for the TCC signal are required.
Other changes involving the definition of thresholds
and scoring functions were made. In the probabilistic
approach, matching a segment with invariant bases of the
TCC or D signal is considered as a string-matching
problem with don’t care symbols. Therefore, the Shift-Add
algorithm is useful in this case.
Figure 2. Flow chart of the algorithm.
Structure prediction
In order to obtain Sj+1 from Sj by simple arithmetic and
logical operations, vectors are considered as numbers
and represented in base 2b, where b is the bit number
needed to represent each vector component. Thus:
m
m
i=1
i=1
Sj = s Sj [i ]2(i−1)b and Ta = s Ta [i ]2(i−1)b
If representations do not exceed the word size v,
namely mb E v, the transition from Sj to Sj+1 is obtained
by the assignment:
Sj+1 = (Sj b) + Tsj+1
where denotes the bitwise shift-left operation. P
matches at j if and only if Sj < (k + 1)2(m−1)b.
In the tRNAscan, the definition of the tRNA
secondary structure depends only on Watson-Crick
and (G, T) pairs. To address the question of flexibility
in defining arms, each of the possible ten pairs
(regardless of the orientation) is given a weight as
described by Lisacek et al. (1994). The values shown in
Table 3 reflect the stability of a base-pair and its
frequency in natural RNA helices. As a result, a
selection threshold for a potential arm is not simply a
minimal number of Watson-Crick or (G, T) pairs but
more accurately a minimal weight of successive
pairings. Practically, a score is computed by cumulating weights of successive pairings, and a potential arm
is recognised if its score is above a fixed threshold.
Weak base-pairings at both ends of the arm are not
considered.
Table 4. Parameter and threshold values used in FAStRNA
Regiona
Perfect matchb
TCC signal
TCC arm
At most 1 mismatch
base-pairing score >26
D signal
Aminoacyl arm
No mismatch
Base-pairing score >36
D arm
Score of the 4 base-pairs >16
Base 18, 19 and 21
Base 18 = G, base 19 = G
and base 21 = A
T
Base 33
Anticodon arm
Without intron
With intron
a
Base-pairing score >19
Base-pairing score >26
Threshold matchc
Three mismatches
Base-pairing score >10
and at least 3 base-pairing
Two mismatches
base-pairing score >18
and at least 4 base-pairing
Score of the 3 first base-pairs >11 (MAX)
or score of the 3 first base-pairs >7 (MIN)
and 4th base-pairing >0
Other bases
Other base
Base-pairing score >11
Base-pairing score >17
Regions of the tRNA-sequence chronologically analysed by the algorithm.
Each time a condition is verified, the general score sg is incremented.
c
Minimal conditions for accepting a region.
b
54
Very Fast RNA Motif Search
Let S be the segment TTTGTCCGA and S' be the
potential complementary segment CAGAACGAT:
S: TTTGTCCGA
= = = =o =
TAGCAAGAC :S'
According to Table 1 the corresponding cumulative sum is: 0 5 8 15 20 18 25, such that an arm is
defined as S(2)–S(7) paired with S'(3)–S'(8) with a score
of 25.
Another small change was introduced in assessing the
quality of base-pairing in the D arm. Minimally, the D
arm is defined as the coupling of positions 10 and 25, 11
and 24, and 12 and 23. If the cumulative sum of such
pairings is below a MIN value, the arm cannot be formed
and if it is above a MAX value, the arm is formed. When
it is between MIN and MAX and the weight of the pair
defined by position 13 and 22 is positive, then the arm is
formed.
Improved algorithm and
whole-sequence scanning
The improved algorithm is completed in eight steps
(Figure 2). The first two are dedicated to identifying a
potential TCC region. It is assumed that seven
nucleotides form a TCC loop and five base-pairs form a
TCC arm.
The following three steps are focused on identifying
potential D regions. D regions are searched in areas
where distance requirements with the previously
determined TCC are satisfied. The length of connecting sequences between potential TCC and D is
constrained by a possible intronic insertion, occurring
between positions 37 and 38 (intron length varies
between eight and 60 nucleotides) as well as the presence of the variable loop (minimally comprising
three nucleotides). Moreover, it must accommodate
the formation of an aminoacyl arm with at most
seven base-pairs. It is assumed that 6† to 11 nucleotides form a D loop and at most four base-pairs form
a D arm.
A global similarity score, sg, is computed at each step
on the basis of compliance to a minimal set of
characteristics common to most tRNAs. The sixth step
consists of discarding all partial solutions with a low sg
score, that is, lacking too many characteristics common to
most tRNAs.
The seventh step corresponds to looking for the
anticodon arm, taking into account the potential
presence of an intron, as mentioned earlier. It is
assumed that in an intronless sequence, seven
nucleotides form an anticodon loop and five base-pairs
form a anticodon arm.
Finally, all solutions with a newly incremented but low
sg score are discarded in the eighth step. All parameter
and threshold values are given in Table 4.
All throughout such a stepwise search, an unsuccessful
outcome results in going back to the last loop preceding
this step and moving forward to the next window to
identify a new potential region.
When the direct sequence strand has been scanned
completely, the algorithm is applied to the complementary strand.
† Five tRNA genes were missed by tRNAscan as this
parameter was set to 7 by Fichant & Burks (1991).
Acknowledgements
We are grateful to M. Crochemore and A. Hénaut for
helpful comments.
This program is available on the following URL:http:
//www-igm.univ-m/v.fr/nmabrouk/.
References
Altschul, S. F., Gish, W., Miller, W., Myers, E. W. &
Lipman, D. J. (1990). A basic local alignment search
tool. J. Mol. Biol. 215, 403–410.
Baeza-Yates, R. & Gonnet, G. H. (1992). A new approach
to text searching. Commun. ACM, 35(10), 74–82.
Baeza-Yates, R. & Perleberg, C. H. (1992). Fast and
practical approximate string matching. In Lecture
Notes in Computer Science, vol. 644, pp. 185–191,
Springer-Verlag.
Chiu, D. K. Y. & Kolodziejczak, T. (1991). Inferring
consensus structure from nucleic acid sequences.
Comput. Appl. Biosci. 7(3), 347–352.
Dandekar, T. & Hentze, M. W. (1995). Finding the hairpin
in the haystack: searching for RNA motifs. Trends
genet. 11(2), 45–50.
d’Aubenton Carafa, Y., Brody, E. & Thermes, C. (1990).
Prediction of rho-independent Escherichia coli transcription terminators. J. Mol. Biol. 216, 835–858.
Eddy, S. R. & Durbin, R. (1994). RNA sequence analysis
using covariance models. Nucl. Acids Res. 22(11),
2079–2088.
Fichant, G. A. & Burks, C. (1991). Identifying potential
tRNA genes in genomic DNA sequences. J. Mol. Biol.
220, 659–671.
Gautheret, D., Major, F. & Cedergren, R. (1990). Pattern
searching/alignment with RNA primary and secondary structures: an effective descriptor for tRNA.
Comput. Appl. Biosci. 6(4), 325–331.
Haselman, T., Chappelear, J. E. & Fox, G. E. (1988).
Fidelity of secondary and tertiary interactions in
tRNA. Nucl. Acids Res. 16, 5673–5684.
Lefèvre, C. & Ikeda, J. E. (1984). A fast word search
algorithm for the representation of sequence
similarity in genomic DNA. Nucl. Acids Res. 22(3),
404–411.
Lisacek, F., Diaz, Y. & Michel, F. (1994). Automatic
identification of group I intron cores in genomic
DNA sequences. J. Mol. Biol. 235, 1206–1217.
Mehldau, G. & Myers, G. (1993). A system for pattern
matching applications on biosequences. Comput.
Appl. Biosci. 9(3), 299–314.
Pavesi, A., Conterio, F., Bolchi, A., Dieci, G. & Ottonello,
S. (1994). Identification of new eukaryotic tRNA
genes in genomic DNA databases by a multistep
weight matrix analysis of transcriptional control
regions. Nucl. Acids Res. 22(7), 1247–1256.
Pearson, W. R. & Lipman, D. J. (1988). Improved tools for
biological sequence comparison. Proc. Natl Acad. Sci.
USA, 85, 2444–2448.
Prunella, N., Liuni, S., Attimonelli, M. & Pesole, G. (1993).
FASTPAT: a fast and efficient algorithm for string
searching in DNA sequences. Comput. Appl. Biosci.
9(5), 541–545.
Searls, D. (1993). The computational linguistics of
biological sequences. In Artificial Intelligence and
Molecular Biology (Hunter, L., ed.), pp. 47–120, AAAI
Press.
Searls, D. & Dong, S. (1992). A syntactic pattern
recognition system for DNA sequences. In The Second
International Conference on Bioinformatics, Supercom-
Very Fast RNA Motif Search
puting and Complex Genome Analysis (Lim, H. A.,
Fickett, J. W., Cantor, C. R. & Robbins, R. J., eds),
pp. 89–101. World Scientific, Teaneck, NJ.
Smith, T. F. & Waterman, M. S. (1981). Identification of
common molecular subsequences. J. Mol. Biol. 147,
195–197.
Sprinzl, M., Steegborn, C., Huebel, F. & Steinberg, S.
(1992). Compilation of tRNA sequences and sequences of tRNA genes. Nucl. Acids Res. 24,
68–72.
55
Staden, R. (1988). Methods to define and locate patterns of
motifs in sequences. Comput. Appl. Biosci. 4, 53–60.
Stephen, G. A. (1994). String Search Algorithms. Lecture
Notes Series on Computing, vol. 3, World Scientific,
Singapore.
Wozniak, P. & Mokalowski, W. (1990). Searching for tRNA
genes in DNA sequences—an IBM microcomputer
program. Comput. Appl. Biosci. 6, 49–50.
Wu, S. & Manber, U. (1992). Fast text searching allowing
errors. Commun. ACM, 35(10), 83–91.
Edited by J. Karn
(Received 30 July 1996; accepted 4 September 1996)