CLEANUP: a fast computer program for

Vol. 12 no. 1 1996
Pages 1-8
CLEANUP: a fast computer program for
removing redundancies from nucleotide
sequence databases
Giorgio Grillo, Marcella Attimonelli1, Sabino Liuni and Graziano
Pesole12
Abstract
A key concept in comparing sequence collections is the issue
of redundancy. The production of sequence collections free
from redundancy is undoubtedly very useful, both in
performing statistical analyses and accelerating extensive
database searching on nucleotide sequences. Indeed, publicly
available databases contain multiple entries of identical or
almost identical sequences. Performing statistical analysis
on such biased data makes the risk of assigning high
significance to non-significant patterns very high. In order to
carry out unbiased statistical analysis as well as more
efficient database searching it is thus necessary to analyse
sequence data that have been purged of redundancy. Given
that a unambiguous definition of redundancy is impracticable for biological sequence data, in the present program a
quantitative description of redundancy will be used, based on
the measure of sequence similarity. A sequence is considered
redundant if it shows a degree of similarity and overlapping
with a longer sequence in the database greater than a
threshold fixed by the user.
In this paper we present a new algorithm based on an
'approximate string matching' procedure, which is able to
determine the overall degree of similarity between each pair
of sequences contained in a nucleotide sequence database
and to generate automatically nucleotide sequence collections free from redundancies.
Introduction
The way data are stored in biosequence databases strongly
affects the scientific approach to be used and the quality of
science that can be done with the database. In this context,
a key concept is redundancy even if biological data are too
complex to fit a simple definition of redundancy.
The sequence databases now available contain a
considerable amount of redundancy as they may contain
multiple copies of the same sequence. Redundant entries
can be originated by multiple submission of the same
Cenlro di Studio sui Milocondri e Metabolismo Energetico, CNR and
'Dipartimento di Biochimica e Biologia Molecolare. Universita di Bari, Italy
2
To whom correspondence should be addressed. E-mail:[email protected]
) Oxford University Press
sequence or by multiple sequencing of the same gene in the
same organisms. Even if in the latter case redundancy may
have biological significance in studying polymorphic
patterns, it can still prevent a given sequence collection
being used for some statistical analyses, e.g. methods to
identify sequence patterns peculiar to the sequence
collection under investigation, or can greatly slow down
extensive database searching.
As a qualitative definition of sequence redundancy is
impracticable, we shall apply a quantitative description
based on the measure of sequence similarity, depending on
the particular application we wish to carry out. According
to the above considerations the identification of redundant sequences in a given sequence collection can be
simply achieved by performing pairwise alignments
between all the sequences in the database. However, this
approach is impracticable for large sequence collections as
it would require a considerable amount of computational
time and is not user-friendly.
Here we present a new algorithm, based on an
'approximate string matching' procedure, which is able
to determine the overall degree of similarity between each
pair of sequences contained in a nucleotide sequence
database without performing the time-consuming task of
pairwise global best-alignments. The algorithm generates
a new dataset from the redundant nucleotide sequence
collection purified from redundancy according to cutoff
parameters set by the user.
This method is very fast and sensitive and allows the
user to generate automatically and in a user-friendly way
sequence databases that are free of redundancies.
Systems and methods
The algorithm presented here has been implemented in
CLEANUP, a computer program written in C and
running on VMS and UNIX operating systems.
CLEANUP allows Pearson/FastA and GCG input formats for nucleotide sequences and automatically produces
a new dataset purged of redundant sequences in both of
these formats. It is available free of charge by anonymous
FTP (IP address: area.ba.cnr.it, directory: pub/software/
cleanup).
C.Crillo el al.
c)
b)
a)
1
1
1
e)
d)
II
1
--V
\
k
f)
1 1
g)
1
1
Fig. 1. Similarity patterns (shaded areas) between primary (upper) and secondary (lower) sequences taken into account by the CLEANUP algorithm.
Cases (f )-(g) refer respectively to similar domains found in unsimilar surrounding regions and to possible domain shuffling. Arrows indicate 'guide
patterns' used for hooking similar segments in the compared sequences.
The algorithm has been tested on two sequence datasets:
(i) 362 nucleotide sequences extracted from the EMBL
database (release 41) through ACNUC retrieval program
(Gouy el al., 1985) using 'SP = Homo sapiens and O =
mitochondrion' selection criteria which extract all those
entries from human with report 'Mitochondrion' in the
OG field of EMBL entry line; (ii) 2400 nucleotide
sequences (5 523 925 bases) defined by the 'in:dro*'
GCG logical name which selects all Drosophila entries of
the invertebrate division of the Genbank collection
(release 90).
Algorithm
The computational model adopted to recognize redundancies (totally or partially overlapping sequences) within
collections of nucleotide sequences is based on the
observation that the occurrence probability of oligos
with length > 8 nucleotides in two or more sequences is
rather rare in the case of evolutionarily unrelated
sequences. The occurrence of a common oligo will be
thus considered as a clue of possible redundancy that will
be further tested by performing a progressive search of
matching oligos through the full length of both sequences
under examination.
The searching process needs to perform pairwise
comparisons of each sequence with all the others. We
define the first sequence in the comparison as the 'primary
sequence" and all the others as 'secondary sequences'.
The algorithm is structured in the following steps:
1. Sorting of the sequences according to decreasing
length.
2. Numeric coding of the sequences and generation of the
index structure.
3. Generation of the primary sequences data structure
containing the pattern positions and compositions.
4. Similarity searching between primary and secondary
sequences.
5. Creation of a sequence collection free from redundancy according to user-selected parameters.
As the sequences are sorted according to decreasing
length, the primary sequence will certainly have a length
equal or greater than any secondary sequence. The
searching phase will be greatly improved by this sorting
both in the case of similar sequences partially or totally
overlapping and in the case of sequences sharing similar
segments.
In the first case only three possibilities can occur: (i)
total coincidence, (ii) partial overlapping on the right or
(iii) partial overlapping on the left (Figure la-c). The
finding of a match between the rightmost or leftmost
element of the secondary sequence with any element of the
primary one will provide a remarkable clue of high
sequence similarity and will thus prime the overall
searching procedure. In the second case (Figure ld-g),
the sorting will minimize the number of comparisons to be
carried out between the primary and the secondary
sequences.
Figure 2 schematically shows how the numeric coding
and the index structure are generated. Given a sequence S,
L nucleotides long, extract all possible L- w + 1 iv-mers.
From each »v-mer P = S\S2 . . . sw (sj = A, C, G, T) all
possible /t-tuples (k < w), ph are generated, formed by
CLEANUP: removing redundancies from nucleotide sequences
guide
pattern
N(p)
Pi
i ©
f A'
'©
©
©
(A)
©
©
| ACGTACG.
k-tuples
©
coded
k-tuples
©
w-mer
i A )
(A)
L-,0
L©
CD
©
primary
Sequences
secondary
Sequences
0 12
1
3
1
4 .5
1819
4-2 4 -
1 1 II
1
C i 1 IllOlllllI
position composition
II
1 i n H1
•
(.
)i
(
)l
II
I
1
!
position composition
. i
position composition
I
( • • )l
1
position composition
Data Structure
Fig. 2. Schematic description of how the data structure is generated for each primary sequence. To each data-structure element, corresponding to a
specific fc-tuple, the information is linked through pointers on the positions and compositions of all fc-tuples extracted from the primary sequence. It may
happen that the same Ar-tuple occurs more than once or that it is not found at all. The finding of common A-tuples between primary and secondary
sequences (shaded box) define the guide pattern used for the hooking of similar segments in the compared sequences.
concatenating nucleotides, even non-contiguous, extending over the string of interest.
where /|,. . ., ih, . . ., ik is one of the possible combinations
of the w nucleotides of P, with i,,<ih+[
and no
repetitions. The coding of/?, into a unique integer N(pt)
can be obtained with the following:
h =
'
where numO,) is 0, 1, 2, 3 for s,•= A, C, G, T respectively,
G.Grillo el at.
and pos(j,v,) gives the position of sih within the zth pattern
which is in the range 1, . . . , k.
an 'approximate string matching' method (Baeza-Yates
and Perleberg, 1992). The searching algorithm compares
each primary sequence to all secondary ones looking for
sequence similarities over one (Figure la-c) or more
Example
segments (Figure ld-g) of the compared sequences.
From the octamer ACGTACGT, eight eptamers can be
A crucial step for assessing sequence similarity between
generated: .CGTACGT, A.GTACGT, AC.TACGT,
compared sequences is the correct hooking of similar
ACG.ACGT, ACGT.CGT, ACGTA.GT, ACGTAC.T,
segments in the compared sequences. If no suitable
ACGTACG. with, for example, N(.CGTACGT) =
hooking point is found for a pair of sequences, the
searching phase is completely skipped, which saves a
1.46 + 2.45 + 3.4 4 + 0.4 3 + 1.42+ 2.41 + 3.4° = 6939.
considerable amount of computation time. These partiThus, each sequence can be translated into (L - w + 1)
cular patterns (w-mers) involved in hooking are defined as
(w\jk\ (w-k)\) numbers, where L is the sequence length.
'guide patterns'. For a single overlapping segment, the
The basic idea of our computational model is the
guide pattern coincides with the leftmost or rightmost
generation of a set of highly descriptive indices for each
pattern of the secondary sequence. In multiple oversequence position (Califano and Rigoutsos, 1995). Indeed,
lapping segments as many guide patterns are found as
to each fc-tuple (whose numeric code corresponds to a
many similar segments are common in the compared
unique record of the data structure) the information is
sequences (see Figure 1).
linked, through pointers, on the position of the relevant wmer within sequence S and its composition (i.e. the
The data structure previously described allows the
particular combination of nucleotides extracted from the
detection of a similarity match between two specific >t>basic vv-mer). The data structure generated as a vector of
mers, obtained by comparing their extracted Ar-tuples, of
size 4k with a space value of 0..4*—1 is dynamically
the primary and secondary sequence, and this will prime
updated for each primary sequence, as this is compared to
further similarity searching even if they have up to
the secondary ones. It allows us to detect very efficiently,
<f> = w - k differences (see the example in Figure 2).
both in terms of computational speed and of memory
From information about the &-tuple composition we can
allocation, any similarity match between sequence strings
also infer the kind of difference, if any, between the
perturbated by up to w-k differences (i.e. mutations,
corresponding M'-mers (i.e. mismatch and/or insertioninsertions, deletions).
deletion and their position). The searching phase proceeds, from the guide pattern, from left to right or from
In the example shown in Figure 2 the similarity match
right to left, depending on whether the left or the right
between two octamers differing at a single position can be
guide pattern is found to match within the primary
detected if their component heptamers are compared, e.g.
sequence, comparing subsequent M'-mers with a step of w.
the similarity match between ACGTACGT and
In the case of a pattern mismatch, a similarity function is
ACCTACGT is determined by the common heptamer
used to evaluate the degree of similarity within a window
ACTACGT. In addition, the information about the
of a given size starting from the observed mismatch. This
heptamer composition (in this example 11011111) allows
function calculates a score for the window local alignment
us to detect the type and position of the difference (i.e.
based on sequence similarity. Only if the similarity
mutation, indel) between the compared octamers.
function gets a score value above a cutoff fixed by the
A suitable choice of k is fundamental to ensure a good
user does the sequence comparison proceed. Conversely,
compromise between sensibility, selectivity and efficiency
if the similarity score is below the cutoff, the algorithm
of the computational model. In our case it is advisable to
looks for a new segment or jumps to a new sequence
have k long enough for the occurrence probability of the
comparison.
^-tuples to be sufficiently low to allow selective similarity
searches. In our experience w = 8 and k = 7 realize an
A continuously updated estimate of regional similarity
efficient compromise between selectivity and sensibility
is computed in the searching phase as the comparison goes
and computation time. The quantity w - k is defined here
on. This method avoids the more complex and timeas the precision factor <f>, and sets the maximum number of consuming task of global alignment, e.g. in our case where
differences (mismatches, indels) allowed between two
only identical or nearly identical sequences need to be
matching segments. The precision factor also determines
detected, only those sequence segments with a potentially
the number of indices (coded ^-tuples) generated for each
high sequence similarity are compared.
sequence position and is defined as the index density
The search algorithm has been defined as 'approximate
(Califano and Rigoutsos, 1995).
string matching' because it allows the recognition, based
The searching phase, aimed at evaluating the redunon the pattern indexing approach, of the occurrence of
dancy between pairs of sequences, is carried out by using
small variability regions (mismatches or gaps) between
CLEANUP: removing redundancies from nucleotide sequences
primary and secondary sequences. However, the searching
phase will produce a very reliable overall alignment if any
significant sequence similarity is detected.
For all pairs of sequences a degree of relatedeness in
terms of % overlapping region (where % refers to total
length of the secondary sequence) and % similarity (i.e.
number of identical nucleotides over the complete overlapping region length) will be determined. Then a 'clean'
sequence database will be automatically generated that
will contain only non-redundant entries where those
secondary sequences are considered as redundant when
they display an overall degree of similarity and overlap
above the threshold fixed by the user. In the case of
partial sequence overlapping (cases b - g of Figure 1) the
secondary sequence will be either removed or left in the
database depending on a specific parameter previously set
by the user.
Results and discussion
The CLEANUP algorithm has been tested initially on a
dataset of 362 sequences accounting for all the human
mitochondrial DNA sequences available in release 41 of
the EMBL database. This dataset provides a very simple
check of the cleaning accurateness of the algorithm as we
can expect that after cleaning only a single sequence
corresponding to the complete human genome is retained
and all the others are erased. Table I reports part of the
output produced by applying CLEANUP to the human
mtDNA collection. CLEANUP automatically generates a
cleaned sequence collection by erasing all those sequences
that are found to show a degree of similarity and
overlapping with a longer sequence above the user fixed
threshold. In our test application, 20 sequences have been
retained out of the 362 instead of the expected single one
corresponding to the complete human mtDNA genome.
Table II shows the list of sequence entries left after the
cleaning process carried out by CLEANUP. It is striking
to note that most of the non-redundant sequences are in
fact nuclear encoded. This means that the annotators/
submitters have mistakenly considered the sequences as
encoded by the mtDNA, whereas they were actually
nuclear encoded and expressed in the mitochondrion.
Indeed, it is not so surprising to find errors in the
nucleotide sequence database. Quite remarkably, only five
mitochondrial-encoded sequences have been found not to
show a sufficiently high degree of similarity with the
complete mtDNA human genome. Three of them (e.g.
S43671, S43672, S43673) are sequence fragments from
deletion boundaries of mtDNA from individuals with
defects of the mitochondrial function and do not score
above the cutoff because one of the sequence segments is
shorther than the minimum size allowed by CLEANUP
for spaced segments (cases d-f in Figure 1). Entry S56589
has ~350 nucleotides displaying a high sequence similarity
with the complete human mtDNA genome, with the
remaining 150 nucleotides not matching any nucleotide
sequence contained in the database. The latter entry,
S70096, represents by far the most amazing case. The
Genbank annotator who created this entry from the
original article (Bodenteich et at., 1991) forgot that
sequencing gels should be read from bottom to top and
not vice versa when reading the nucleotide sequence
written on the side of a sequencing gel. Indeed, the
inverted sequence is identical to a specific region of the
human mtDNA.
It can nevertheless be stated that the CLEANUP
algorithm has proved to be efficient in this test. Indeed,
it has produced the expected results; the odd items
retained have been in fact due to errors in the database
entry or to the similarity threshold fixed by the user.
Figure 3 shows a CLEANUP session on a dataset of
2400 sequences (5 523 925 nt) comprising all Drosophila
entries in the Invertebrate division of the Genbank
collection (release 90). Four hundred out of the 2400
sequences (420 707 nucleotides) were removed by
CLEANUP as both their degree of sequence similarity
and overlap with other sequences present in the dataset
resulted in a similarity value above the fixed threshold of
95%.
The CLEANUP pairwise alignment procedure has also
proven to be more efficient than usual alignment programs
such as GAP or BESTFIT (GCG 1993) when the
compared sequences are very long and/or have very
different lengths. For example, in the attempt to align
two complete human mitochondrial genomes with a
different gene order, only CLEANUP succeeded in
determining the correct alignment (data not shown).
CLEANUP is also very fast. The above described
applications both on the 362 sequences of the mitochondrial DNA dataset involving 65 341 pairwise sequence
comparisons and on the 2400 Drosophila sequences
involving 2 878 800 pairwise sequence comparisons took
respectively 1.86 and 158.8 s of CPU time on a DEC 2100
4/275 OSF/1 machine.
The production of a cleaned sequence database will be
undoubtedly very useful in performing statistical analyses
of nucleotide sequences as well as to speed up database
searches.
Very few attempts to provide non-redundant nucleotide
databases have been made, 'nr'—a database maintained
by the NCBI as a target for BLAST (Altschul et al., 1990)
search—is a composite of Genbank, Genbank updates
and EMBL updates in which entries with absolutely
identical sequences have been merged. Indeed, if multiple
gene copies from the same organism have been stored in
G.Grillo et ai.
Table 1. Output produced by applying C L E A N U P to a sample of 362 sequences extracted from release 41 of the E M B L database
C L E A N U P - M A I N SEARCH FILE
Sequence 1
Sequence 2
MIHSCG
HSCOXII
HSMTALT1
HSMTNA02
HSMTNA03
HSMTNA04
HSMTNA05
HSMTNA06
HSMTNA07
HSMTNA08
HSMTNA09
HUMMTALTO
MIHSOI
MIHSAIA
MIHSAIAA
MIHSAIAB
MIHSAIB
MIHSAIC
MIHSAID
MIHSAIE
MIHSAIF
MIHSAIG
MIHSAIH
MIHSAII
MIHSAIJ
MIHSAIK
MIHSAIL
MIHSAIM
MIHSAIN
MIHSAIO
MIHSAIP
MIHSAIQ
MIHSAIR
MIHSAIS
MIHSAIT
MIHSAIU
MIHSAIV
MIHSAIW
MIHSAIX
MIHSAIY
MIHSALT01
MIHSALT02
MIHSALT03
MIHSALT04
MIHSALT05
MIHSALT06
MIHSALT07
MIHSALT08
MIHSALT10
LI
16569
L2
708
360
360
360
360
360
360
360
360
360
360
608
360
360
360
360
360
360
360
360
360
360
360
360
360
360
360
360
360
360
360
360
360
360
360
360
360
360
360
360
360
360
360
360
360
360
360
360
PI
P2
7584
16024
16024
16024
16024
16024
16024
16024
16024
16024
16024
324
16024
16024
16024
16024
16024
16024
16024
16024
16024
16024
16024
16024
16024
16024
16024
16024
16024
16024
16024
16024
16024
16024
16024
16024
16024
16024
16024
16024
16024
16024
16024
16024
16024
16024
16024
16287
16024
1
OSl
OS2
%Ident.
%Overl.
709
360
360
360
360
360
360
360
360
360
360
610
360
360
360
360
360
360
360
360
360
360
360
360
360
360
360
360
360
360
360
360
360
360
360
708
360
360
360
360
360
360
360
360
360
360
608
360
360
360
360
360
360
360
360
360
360
360
360
360
360
360
360
360
360
360
360
360
360
360
360
360
360
318
360
360
360
360
360
360
360
262
97
360
99.86
100.00
98.33
98.06
98.89
98.33
98.06
97.78
98.61
98.89
99.72
99.51
98.89
99.44
99.17
99.17
99.44
99.17
98.06
98.61
98.06
98.33
98.33
98.33
98.61
98.89
98.61
98.33
98.06
98.89
98.89
98.89
98.61
98.61
99.17
99.17
98.89
98.89
98.43
98.61
98.89
99.17
98.61
98.61
99.17
98.89
98.47
98.97
98.61
100.00
100.00
360
360
360
318
1
360
360
1
360
360
1
1
1
1
264
1
360
360
360
262
97
360
100.00
100.00
100.00
100.00
100.00
100.00
100.00
100.00
100.00
100.00 (c)
100.00
100.00
100.00
100.00
100.00
100.00
100.00
100.00
100.00
100.00
100.00
100.00
100.00
100.00
100.00
100.00
100.00
100.00
100.00
100.00
100.00
100.00
100.00
100.00
100.00
100.00
88.33
100.00
100.00
100.00
100.00
100.00
100.00
100.00
72.78
26.94
100.00
LI, L2? length of primary (LI) and secondary (L2) sequence; PI, P2, starting position of the similarity region in the primary (PI) and secondary (P2)
sequence; OSl, OS2, length of the overlapping segment in the primary (OSl) and secondary (OS2) sequence, (c) similarity match found on the reverse/
complementary strand of the secondary sequence.
the database it becomes very difficult to look for weaker
sequence relationships.
At the NCBI repository a specialized database of nonredundant functionally equivalent sequences (NRFES) is
maintained (Konopka, 1993). Indeed, suitable statistical
analyses carried out on sequence collections of functionally equivalent sequences could be very useful for assessing
function-associated pattern dictionaries which also might
CLEANUP: removing redundancies from nucleotide sequences
Table II. List of sequence entries left in the cleaned dataset after applying CLEANUP to the 362 mtDNA sequence dataset extracted from release 41 of
the EMBL database
Entry
Description
Genome
MIHSCG(JO4I5)
HSDF1FO5(X63423)
HSPSRNAB(L06218)
MIHSALD(M63967)
MIHSAPT1 (D1656I)
MIHSB(L11932)
MIHSFPS(D30648)
MIHSHTPA(D16480)
MIHSHTPB(D16481)
MIHSTF2(X76338)
M1HSTG(X76317)
MIHSTR3 (X76339)
MIHSTR4(X76340)
MIHSUMPS1 (L062I9)
MIHSUMPSR(L06217)
S43671
S43672
S43673
S55589
S70096
mtDNA complete genome
Fl FO ATP-synthase 5-subunit
16S ribosomal RNA pseudogene
aldehyde dehydrogenase
ATP synthase 7 subunit
serine hydroxymethyltransferase
complex II (succinate-ubiquinone oxidorcductase)
3-hydroxyacyl-CoA dehydrogenase
3-ketoacyl-CoA thiolase
transferrin partial gene
transferrin partial gene
transferrin partial gene
transferrin partial gene
mitochondrial I6S ribosomal RNA pseudogene
mitochondrial 16S ribosomal RNA pseudogene
mtDNA segment with deletion (10656.. 10676 + 11920.. 11925)
mtDNA segment with deletion (8459..8467+ 13445.. 13477)
mtDNA segment with deletion (10180.. 10199+ 13761.. 13770)
mtDNA segment with deletion
mtDNA glycine tRNA mutant
mitochondrial
nuclear
nuclear
nuclear
nuclear
nuclear
nuclear
nuclear
nuclear
nuclear
nuclear
nuclear
nuclear
nuclear
nuclear
mitochondral
mitochondral
mitochondral
?
mitocondrial
The overlapping segment positions, with respect to the mtDNA complete genome numbering, are reported in brackets for mtDNA segments with
deletions.
provide an excellent tool for the classification of the
anonymous sequences now produced in large amounts in
megasequencing projects.
NRSub, a non-redundant database for the Bacillus
subtilis genome where duplicate entries identified by
% Cleanup
-sim=95
BLAST similarity searches are removed, has recently
been reported (Perriere et al., 1994).
The construction of specialized databases such as those
described above could take great advantage of a fast,
automatic procedure to remove redundancy.
-overlap=95
Cleanup generates a non redundant sequence data library from any set of
sequences.
Cleanup of what sequences ?
inrdro*
What should I call the output searching file (* in_dro.s *) ?
What should I call the file listing erased sequences (* in_dro.e *) ?
What should I call the file listing retained sequences (* in_dro.nr *) ?
Sequences
% Overlapping for cleaning
Sequence matches
= 2400
= 95.00
= 424
Erased sequences
% Erased sequences
= 400
=16.67
Nucleotides
% Similarity for cleaning
Searching C.P.U. time
Erased nucleotides
% Erased nucleotides
= 5523925
= 95.00
= 158.84
420707
= 7.62
Fig. 3. Sample run of a CLEANUP session on a dataset of 2400 Drosophila entries. User inputs are in bold typed. CLEANUP application generates a file
listing all non-redundant sequence entry names which can be used as input with other programs of the GCG package as well as with retrieval programs
such as SRS (Etzold and Argos, 1993) or ACNUC (Gouy el al., 1985). It optionally also generates a sequence data file in Pearson/FASTA format.
G.Grillo et at.
To give an idea of the many possible uses of the
CLEANUP algorithm we have carried out a statistical
analysis aimed at the determination of the possible
nucleotide context around the AUG initiator codon of
protein-coding genes. A consensus signal describing the
favourable context for the AUG initiator codon has been
previously described by Kozak (1989) based on an analysis
carried out on a dataset of 699 vertebrate genes. In order
to carry out such an analysis on a larger dataset of human
genes we extracted the sequence regions spanning from
position - 2 0 to +20 with respect to the initiator AUG
codon from human genes coding for proteins. This
sequence extraction, carried out by using the retrieval
program ACNUC on release 88 of the Genbank collection, obtained 8004 sequences. In order to avoid the bias
produced by redundancy we need to remove all duplicated
sequence entries. Indeed, oligonucleotides present in
redundant entries would result to be overrepresented
because of the presence of multiple copies of the same
sequence and not because they are genuine biological
signals.
CLEANUP is particularly suited to this task as it
automatically generates a cleaned collection, free from
repetitions. In this particular case, by using the
CLEANUP program to remove 100% identical sequences
we have obtained a cleaned dataset of 5184 sequences,
thus showing that the original dataset contained > 35% of
redundancy.
CLEANUP is thus invaluable for constructing specific
specialized collections, rather than for the use on
repository databases such as EMBL/Genbank/DDBJ.
Indeed, biological data are too complex to fit a simple
definition of redundancy, and repository databases do not
attempt to be non-redundant, but rather sacrifice this goal
for the sake of completeness. Of course, the sequences of
the same gene from two individuals belonging to the same
species cannot be considered biologically redundant (i.e.
non-informative) even if their primary sequences are
perfectly identical; in fact, these sequences give us useful
information on the genetic variability of the population at
a given locus. In the same way the presence, in some
genomes, of several copies of repetitive sequences such as
SINES or LINES elements as well as prokaryotic rRNA
operons, has a biological relevance that should be taken
into account. Indeed, a peculiar feature of many genomes
is that they are highly biased in favour of some repetitive
sequence elements.
Given that a 'qualitative' definition of redundancy is
impracticable, the present algorithm allows any database
user/provider to extract from nucleotide databases specific
non-redundant collections based on 'quantitative' parameters, i.e. below a fixed threshold of sequence similarity,
depending on the particular application. To sum up, the
application of CLEANUP will be thus an invaluable and
efficient tool for all database users/developers to remove
or classify redundant information.
Acknowledgements
We thank C.Saccone for helpful comments. This work was partially
financed by MURST, Italy, Progetto Finalizzato Ingegneria Genetica,
CNR, Italy and by EMBnet Research and Development Project
Committee.
References
Altschul,S.F., Gish.W., Miller,W., Myers,E.W. and Lipman,D.J. (1990)
Basic local alignment search tool. J. Mot. Biol., 215, 403-410.
Baeza-Yates.R.A. and Perleberg,C.H. (1992) Fast and practical approximate string matching. In Apostolico,A., Crochemore,M., Galil, Z. and
Manber,U. (eds), Third Annual Symposium on Combinatorial Pattern
Matching, Tucson, AZ. Springer-Verlag, Berlin, pp. 185-192.
Bodenteich,A., Mitchell.L.G. and Merril.C.R. (1991) A lifetime of retinal
light exposure does not appear to increase mitochondrial mutations.
Gene, 108, 305-309.
Califano,A. and Rigoutsos,I. (1996) FLASH: a fast look-up algorithm
for string homology. Comput. Applic. Biosci., in press.
Etzold,T. and Argos,P. (1993) SRS an indexing and retrieval tool for flat
file data libraries. Comput. Applic. Biosci., 9, 49-57.
GCG (1993) Program Manual for the GCG Package. Genetic Computer
Group GCG, Madison, WI.
Gouy,M., Gautier,C, Attimonelli,M., Lanave.C. and Di Paola.G. (1985)
ACNUC—a portable retrieval system for nucleic acid sequence
databases: logical and physical designs and usage. Comput. Applic.
Biosci., 1, 167-172.
Konopka,A.K. (1993) Plausible classification codes and local compositional complexity of nucleotide sequences. In Lim.H.A., Fickett,J.W.,
Cantor.C.R. andobbins, R.J. (eds), The Second International Conference on Bioinformatics, Supercomputing and Complex Genome
Analysis, Singapore. World Scientific, New York, pp. 69-87.
Kozak,M. (1989). The scanning model for translation: an update. J. Cell.
Biol., 108,229-241.
Perriere,G., Gouy,M. and Gojobori,T. (1994). NRSub: a non-redundant
data base for the Bacillus subtilis genome. Nucleic Acids Res., 22, 5525—
5529.
Received on June 5, 1995; revised and accepted on October 2, 1995