Vol. 12 no. 1 1996 Pages 1-8 CLEANUP: a fast computer program for removing redundancies from nucleotide sequence databases Giorgio Grillo, Marcella Attimonelli1, Sabino Liuni and Graziano Pesole12 Abstract A key concept in comparing sequence collections is the issue of redundancy. The production of sequence collections free from redundancy is undoubtedly very useful, both in performing statistical analyses and accelerating extensive database searching on nucleotide sequences. Indeed, publicly available databases contain multiple entries of identical or almost identical sequences. Performing statistical analysis on such biased data makes the risk of assigning high significance to non-significant patterns very high. In order to carry out unbiased statistical analysis as well as more efficient database searching it is thus necessary to analyse sequence data that have been purged of redundancy. Given that a unambiguous definition of redundancy is impracticable for biological sequence data, in the present program a quantitative description of redundancy will be used, based on the measure of sequence similarity. A sequence is considered redundant if it shows a degree of similarity and overlapping with a longer sequence in the database greater than a threshold fixed by the user. In this paper we present a new algorithm based on an 'approximate string matching' procedure, which is able to determine the overall degree of similarity between each pair of sequences contained in a nucleotide sequence database and to generate automatically nucleotide sequence collections free from redundancies. Introduction The way data are stored in biosequence databases strongly affects the scientific approach to be used and the quality of science that can be done with the database. In this context, a key concept is redundancy even if biological data are too complex to fit a simple definition of redundancy. The sequence databases now available contain a considerable amount of redundancy as they may contain multiple copies of the same sequence. Redundant entries can be originated by multiple submission of the same Cenlro di Studio sui Milocondri e Metabolismo Energetico, CNR and 'Dipartimento di Biochimica e Biologia Molecolare. Universita di Bari, Italy 2 To whom correspondence should be addressed. E-mail:[email protected] ) Oxford University Press sequence or by multiple sequencing of the same gene in the same organisms. Even if in the latter case redundancy may have biological significance in studying polymorphic patterns, it can still prevent a given sequence collection being used for some statistical analyses, e.g. methods to identify sequence patterns peculiar to the sequence collection under investigation, or can greatly slow down extensive database searching. As a qualitative definition of sequence redundancy is impracticable, we shall apply a quantitative description based on the measure of sequence similarity, depending on the particular application we wish to carry out. According to the above considerations the identification of redundant sequences in a given sequence collection can be simply achieved by performing pairwise alignments between all the sequences in the database. However, this approach is impracticable for large sequence collections as it would require a considerable amount of computational time and is not user-friendly. Here we present a new algorithm, based on an 'approximate string matching' procedure, which is able to determine the overall degree of similarity between each pair of sequences contained in a nucleotide sequence database without performing the time-consuming task of pairwise global best-alignments. The algorithm generates a new dataset from the redundant nucleotide sequence collection purified from redundancy according to cutoff parameters set by the user. This method is very fast and sensitive and allows the user to generate automatically and in a user-friendly way sequence databases that are free of redundancies. Systems and methods The algorithm presented here has been implemented in CLEANUP, a computer program written in C and running on VMS and UNIX operating systems. CLEANUP allows Pearson/FastA and GCG input formats for nucleotide sequences and automatically produces a new dataset purged of redundant sequences in both of these formats. It is available free of charge by anonymous FTP (IP address: area.ba.cnr.it, directory: pub/software/ cleanup). C.Crillo el al. c) b) a) 1 1 1 e) d) II 1 --V \ k f) 1 1 g) 1 1 Fig. 1. Similarity patterns (shaded areas) between primary (upper) and secondary (lower) sequences taken into account by the CLEANUP algorithm. Cases (f )-(g) refer respectively to similar domains found in unsimilar surrounding regions and to possible domain shuffling. Arrows indicate 'guide patterns' used for hooking similar segments in the compared sequences. The algorithm has been tested on two sequence datasets: (i) 362 nucleotide sequences extracted from the EMBL database (release 41) through ACNUC retrieval program (Gouy el al., 1985) using 'SP = Homo sapiens and O = mitochondrion' selection criteria which extract all those entries from human with report 'Mitochondrion' in the OG field of EMBL entry line; (ii) 2400 nucleotide sequences (5 523 925 bases) defined by the 'in:dro*' GCG logical name which selects all Drosophila entries of the invertebrate division of the Genbank collection (release 90). Algorithm The computational model adopted to recognize redundancies (totally or partially overlapping sequences) within collections of nucleotide sequences is based on the observation that the occurrence probability of oligos with length > 8 nucleotides in two or more sequences is rather rare in the case of evolutionarily unrelated sequences. The occurrence of a common oligo will be thus considered as a clue of possible redundancy that will be further tested by performing a progressive search of matching oligos through the full length of both sequences under examination. The searching process needs to perform pairwise comparisons of each sequence with all the others. We define the first sequence in the comparison as the 'primary sequence" and all the others as 'secondary sequences'. The algorithm is structured in the following steps: 1. Sorting of the sequences according to decreasing length. 2. Numeric coding of the sequences and generation of the index structure. 3. Generation of the primary sequences data structure containing the pattern positions and compositions. 4. Similarity searching between primary and secondary sequences. 5. Creation of a sequence collection free from redundancy according to user-selected parameters. As the sequences are sorted according to decreasing length, the primary sequence will certainly have a length equal or greater than any secondary sequence. The searching phase will be greatly improved by this sorting both in the case of similar sequences partially or totally overlapping and in the case of sequences sharing similar segments. In the first case only three possibilities can occur: (i) total coincidence, (ii) partial overlapping on the right or (iii) partial overlapping on the left (Figure la-c). The finding of a match between the rightmost or leftmost element of the secondary sequence with any element of the primary one will provide a remarkable clue of high sequence similarity and will thus prime the overall searching procedure. In the second case (Figure ld-g), the sorting will minimize the number of comparisons to be carried out between the primary and the secondary sequences. Figure 2 schematically shows how the numeric coding and the index structure are generated. Given a sequence S, L nucleotides long, extract all possible L- w + 1 iv-mers. From each »v-mer P = S\S2 . . . sw (sj = A, C, G, T) all possible /t-tuples (k < w), ph are generated, formed by CLEANUP: removing redundancies from nucleotide sequences guide pattern N(p) Pi i © f A' '© © © (A) © © | ACGTACG. k-tuples © coded k-tuples © w-mer i A ) (A) L-,0 L© CD © primary Sequences secondary Sequences 0 12 1 3 1 4 .5 1819 4-2 4 - 1 1 II 1 C i 1 IllOlllllI position composition II 1 i n H1 • (. )i ( )l II I 1 ! position composition . i position composition I ( • • )l 1 position composition Data Structure Fig. 2. Schematic description of how the data structure is generated for each primary sequence. To each data-structure element, corresponding to a specific fc-tuple, the information is linked through pointers on the positions and compositions of all fc-tuples extracted from the primary sequence. It may happen that the same Ar-tuple occurs more than once or that it is not found at all. The finding of common A-tuples between primary and secondary sequences (shaded box) define the guide pattern used for the hooking of similar segments in the compared sequences. concatenating nucleotides, even non-contiguous, extending over the string of interest. where /|,. . ., ih, . . ., ik is one of the possible combinations of the w nucleotides of P, with i,,<ih+[ and no repetitions. The coding of/?, into a unique integer N(pt) can be obtained with the following: h = ' where numO,) is 0, 1, 2, 3 for s,•= A, C, G, T respectively, G.Grillo el at. and pos(j,v,) gives the position of sih within the zth pattern which is in the range 1, . . . , k. an 'approximate string matching' method (Baeza-Yates and Perleberg, 1992). The searching algorithm compares each primary sequence to all secondary ones looking for sequence similarities over one (Figure la-c) or more Example segments (Figure ld-g) of the compared sequences. From the octamer ACGTACGT, eight eptamers can be A crucial step for assessing sequence similarity between generated: .CGTACGT, A.GTACGT, AC.TACGT, compared sequences is the correct hooking of similar ACG.ACGT, ACGT.CGT, ACGTA.GT, ACGTAC.T, segments in the compared sequences. If no suitable ACGTACG. with, for example, N(.CGTACGT) = hooking point is found for a pair of sequences, the searching phase is completely skipped, which saves a 1.46 + 2.45 + 3.4 4 + 0.4 3 + 1.42+ 2.41 + 3.4° = 6939. considerable amount of computation time. These partiThus, each sequence can be translated into (L - w + 1) cular patterns (w-mers) involved in hooking are defined as (w\jk\ (w-k)\) numbers, where L is the sequence length. 'guide patterns'. For a single overlapping segment, the The basic idea of our computational model is the guide pattern coincides with the leftmost or rightmost generation of a set of highly descriptive indices for each pattern of the secondary sequence. In multiple oversequence position (Califano and Rigoutsos, 1995). Indeed, lapping segments as many guide patterns are found as to each fc-tuple (whose numeric code corresponds to a many similar segments are common in the compared unique record of the data structure) the information is sequences (see Figure 1). linked, through pointers, on the position of the relevant wmer within sequence S and its composition (i.e. the The data structure previously described allows the particular combination of nucleotides extracted from the detection of a similarity match between two specific >t>basic vv-mer). The data structure generated as a vector of mers, obtained by comparing their extracted Ar-tuples, of size 4k with a space value of 0..4*—1 is dynamically the primary and secondary sequence, and this will prime updated for each primary sequence, as this is compared to further similarity searching even if they have up to the secondary ones. It allows us to detect very efficiently, <f> = w - k differences (see the example in Figure 2). both in terms of computational speed and of memory From information about the &-tuple composition we can allocation, any similarity match between sequence strings also infer the kind of difference, if any, between the perturbated by up to w-k differences (i.e. mutations, corresponding M'-mers (i.e. mismatch and/or insertioninsertions, deletions). deletion and their position). The searching phase proceeds, from the guide pattern, from left to right or from In the example shown in Figure 2 the similarity match right to left, depending on whether the left or the right between two octamers differing at a single position can be guide pattern is found to match within the primary detected if their component heptamers are compared, e.g. sequence, comparing subsequent M'-mers with a step of w. the similarity match between ACGTACGT and In the case of a pattern mismatch, a similarity function is ACCTACGT is determined by the common heptamer used to evaluate the degree of similarity within a window ACTACGT. In addition, the information about the of a given size starting from the observed mismatch. This heptamer composition (in this example 11011111) allows function calculates a score for the window local alignment us to detect the type and position of the difference (i.e. based on sequence similarity. Only if the similarity mutation, indel) between the compared octamers. function gets a score value above a cutoff fixed by the A suitable choice of k is fundamental to ensure a good user does the sequence comparison proceed. Conversely, compromise between sensibility, selectivity and efficiency if the similarity score is below the cutoff, the algorithm of the computational model. In our case it is advisable to looks for a new segment or jumps to a new sequence have k long enough for the occurrence probability of the comparison. ^-tuples to be sufficiently low to allow selective similarity searches. In our experience w = 8 and k = 7 realize an A continuously updated estimate of regional similarity efficient compromise between selectivity and sensibility is computed in the searching phase as the comparison goes and computation time. The quantity w - k is defined here on. This method avoids the more complex and timeas the precision factor <f>, and sets the maximum number of consuming task of global alignment, e.g. in our case where differences (mismatches, indels) allowed between two only identical or nearly identical sequences need to be matching segments. The precision factor also determines detected, only those sequence segments with a potentially the number of indices (coded ^-tuples) generated for each high sequence similarity are compared. sequence position and is defined as the index density The search algorithm has been defined as 'approximate (Califano and Rigoutsos, 1995). string matching' because it allows the recognition, based The searching phase, aimed at evaluating the redunon the pattern indexing approach, of the occurrence of dancy between pairs of sequences, is carried out by using small variability regions (mismatches or gaps) between CLEANUP: removing redundancies from nucleotide sequences primary and secondary sequences. However, the searching phase will produce a very reliable overall alignment if any significant sequence similarity is detected. For all pairs of sequences a degree of relatedeness in terms of % overlapping region (where % refers to total length of the secondary sequence) and % similarity (i.e. number of identical nucleotides over the complete overlapping region length) will be determined. Then a 'clean' sequence database will be automatically generated that will contain only non-redundant entries where those secondary sequences are considered as redundant when they display an overall degree of similarity and overlap above the threshold fixed by the user. In the case of partial sequence overlapping (cases b - g of Figure 1) the secondary sequence will be either removed or left in the database depending on a specific parameter previously set by the user. Results and discussion The CLEANUP algorithm has been tested initially on a dataset of 362 sequences accounting for all the human mitochondrial DNA sequences available in release 41 of the EMBL database. This dataset provides a very simple check of the cleaning accurateness of the algorithm as we can expect that after cleaning only a single sequence corresponding to the complete human genome is retained and all the others are erased. Table I reports part of the output produced by applying CLEANUP to the human mtDNA collection. CLEANUP automatically generates a cleaned sequence collection by erasing all those sequences that are found to show a degree of similarity and overlapping with a longer sequence above the user fixed threshold. In our test application, 20 sequences have been retained out of the 362 instead of the expected single one corresponding to the complete human mtDNA genome. Table II shows the list of sequence entries left after the cleaning process carried out by CLEANUP. It is striking to note that most of the non-redundant sequences are in fact nuclear encoded. This means that the annotators/ submitters have mistakenly considered the sequences as encoded by the mtDNA, whereas they were actually nuclear encoded and expressed in the mitochondrion. Indeed, it is not so surprising to find errors in the nucleotide sequence database. Quite remarkably, only five mitochondrial-encoded sequences have been found not to show a sufficiently high degree of similarity with the complete mtDNA human genome. Three of them (e.g. S43671, S43672, S43673) are sequence fragments from deletion boundaries of mtDNA from individuals with defects of the mitochondrial function and do not score above the cutoff because one of the sequence segments is shorther than the minimum size allowed by CLEANUP for spaced segments (cases d-f in Figure 1). Entry S56589 has ~350 nucleotides displaying a high sequence similarity with the complete human mtDNA genome, with the remaining 150 nucleotides not matching any nucleotide sequence contained in the database. The latter entry, S70096, represents by far the most amazing case. The Genbank annotator who created this entry from the original article (Bodenteich et at., 1991) forgot that sequencing gels should be read from bottom to top and not vice versa when reading the nucleotide sequence written on the side of a sequencing gel. Indeed, the inverted sequence is identical to a specific region of the human mtDNA. It can nevertheless be stated that the CLEANUP algorithm has proved to be efficient in this test. Indeed, it has produced the expected results; the odd items retained have been in fact due to errors in the database entry or to the similarity threshold fixed by the user. Figure 3 shows a CLEANUP session on a dataset of 2400 sequences (5 523 925 nt) comprising all Drosophila entries in the Invertebrate division of the Genbank collection (release 90). Four hundred out of the 2400 sequences (420 707 nucleotides) were removed by CLEANUP as both their degree of sequence similarity and overlap with other sequences present in the dataset resulted in a similarity value above the fixed threshold of 95%. The CLEANUP pairwise alignment procedure has also proven to be more efficient than usual alignment programs such as GAP or BESTFIT (GCG 1993) when the compared sequences are very long and/or have very different lengths. For example, in the attempt to align two complete human mitochondrial genomes with a different gene order, only CLEANUP succeeded in determining the correct alignment (data not shown). CLEANUP is also very fast. The above described applications both on the 362 sequences of the mitochondrial DNA dataset involving 65 341 pairwise sequence comparisons and on the 2400 Drosophila sequences involving 2 878 800 pairwise sequence comparisons took respectively 1.86 and 158.8 s of CPU time on a DEC 2100 4/275 OSF/1 machine. The production of a cleaned sequence database will be undoubtedly very useful in performing statistical analyses of nucleotide sequences as well as to speed up database searches. Very few attempts to provide non-redundant nucleotide databases have been made, 'nr'—a database maintained by the NCBI as a target for BLAST (Altschul et al., 1990) search—is a composite of Genbank, Genbank updates and EMBL updates in which entries with absolutely identical sequences have been merged. Indeed, if multiple gene copies from the same organism have been stored in G.Grillo et ai. Table 1. Output produced by applying C L E A N U P to a sample of 362 sequences extracted from release 41 of the E M B L database C L E A N U P - M A I N SEARCH FILE Sequence 1 Sequence 2 MIHSCG HSCOXII HSMTALT1 HSMTNA02 HSMTNA03 HSMTNA04 HSMTNA05 HSMTNA06 HSMTNA07 HSMTNA08 HSMTNA09 HUMMTALTO MIHSOI MIHSAIA MIHSAIAA MIHSAIAB MIHSAIB MIHSAIC MIHSAID MIHSAIE MIHSAIF MIHSAIG MIHSAIH MIHSAII MIHSAIJ MIHSAIK MIHSAIL MIHSAIM MIHSAIN MIHSAIO MIHSAIP MIHSAIQ MIHSAIR MIHSAIS MIHSAIT MIHSAIU MIHSAIV MIHSAIW MIHSAIX MIHSAIY MIHSALT01 MIHSALT02 MIHSALT03 MIHSALT04 MIHSALT05 MIHSALT06 MIHSALT07 MIHSALT08 MIHSALT10 LI 16569 L2 708 360 360 360 360 360 360 360 360 360 360 608 360 360 360 360 360 360 360 360 360 360 360 360 360 360 360 360 360 360 360 360 360 360 360 360 360 360 360 360 360 360 360 360 360 360 360 360 PI P2 7584 16024 16024 16024 16024 16024 16024 16024 16024 16024 16024 324 16024 16024 16024 16024 16024 16024 16024 16024 16024 16024 16024 16024 16024 16024 16024 16024 16024 16024 16024 16024 16024 16024 16024 16024 16024 16024 16024 16024 16024 16024 16024 16024 16024 16024 16024 16287 16024 1 OSl OS2 %Ident. %Overl. 709 360 360 360 360 360 360 360 360 360 360 610 360 360 360 360 360 360 360 360 360 360 360 360 360 360 360 360 360 360 360 360 360 360 360 708 360 360 360 360 360 360 360 360 360 360 608 360 360 360 360 360 360 360 360 360 360 360 360 360 360 360 360 360 360 360 360 360 360 360 360 360 360 318 360 360 360 360 360 360 360 262 97 360 99.86 100.00 98.33 98.06 98.89 98.33 98.06 97.78 98.61 98.89 99.72 99.51 98.89 99.44 99.17 99.17 99.44 99.17 98.06 98.61 98.06 98.33 98.33 98.33 98.61 98.89 98.61 98.33 98.06 98.89 98.89 98.89 98.61 98.61 99.17 99.17 98.89 98.89 98.43 98.61 98.89 99.17 98.61 98.61 99.17 98.89 98.47 98.97 98.61 100.00 100.00 360 360 360 318 1 360 360 1 360 360 1 1 1 1 264 1 360 360 360 262 97 360 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 (c) 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 88.33 100.00 100.00 100.00 100.00 100.00 100.00 100.00 72.78 26.94 100.00 LI, L2? length of primary (LI) and secondary (L2) sequence; PI, P2, starting position of the similarity region in the primary (PI) and secondary (P2) sequence; OSl, OS2, length of the overlapping segment in the primary (OSl) and secondary (OS2) sequence, (c) similarity match found on the reverse/ complementary strand of the secondary sequence. the database it becomes very difficult to look for weaker sequence relationships. At the NCBI repository a specialized database of nonredundant functionally equivalent sequences (NRFES) is maintained (Konopka, 1993). Indeed, suitable statistical analyses carried out on sequence collections of functionally equivalent sequences could be very useful for assessing function-associated pattern dictionaries which also might CLEANUP: removing redundancies from nucleotide sequences Table II. List of sequence entries left in the cleaned dataset after applying CLEANUP to the 362 mtDNA sequence dataset extracted from release 41 of the EMBL database Entry Description Genome MIHSCG(JO4I5) HSDF1FO5(X63423) HSPSRNAB(L06218) MIHSALD(M63967) MIHSAPT1 (D1656I) MIHSB(L11932) MIHSFPS(D30648) MIHSHTPA(D16480) MIHSHTPB(D16481) MIHSTF2(X76338) M1HSTG(X76317) MIHSTR3 (X76339) MIHSTR4(X76340) MIHSUMPS1 (L062I9) MIHSUMPSR(L06217) S43671 S43672 S43673 S55589 S70096 mtDNA complete genome Fl FO ATP-synthase 5-subunit 16S ribosomal RNA pseudogene aldehyde dehydrogenase ATP synthase 7 subunit serine hydroxymethyltransferase complex II (succinate-ubiquinone oxidorcductase) 3-hydroxyacyl-CoA dehydrogenase 3-ketoacyl-CoA thiolase transferrin partial gene transferrin partial gene transferrin partial gene transferrin partial gene mitochondrial I6S ribosomal RNA pseudogene mitochondrial 16S ribosomal RNA pseudogene mtDNA segment with deletion (10656.. 10676 + 11920.. 11925) mtDNA segment with deletion (8459..8467+ 13445.. 13477) mtDNA segment with deletion (10180.. 10199+ 13761.. 13770) mtDNA segment with deletion mtDNA glycine tRNA mutant mitochondrial nuclear nuclear nuclear nuclear nuclear nuclear nuclear nuclear nuclear nuclear nuclear nuclear nuclear nuclear mitochondral mitochondral mitochondral ? mitocondrial The overlapping segment positions, with respect to the mtDNA complete genome numbering, are reported in brackets for mtDNA segments with deletions. provide an excellent tool for the classification of the anonymous sequences now produced in large amounts in megasequencing projects. NRSub, a non-redundant database for the Bacillus subtilis genome where duplicate entries identified by % Cleanup -sim=95 BLAST similarity searches are removed, has recently been reported (Perriere et al., 1994). The construction of specialized databases such as those described above could take great advantage of a fast, automatic procedure to remove redundancy. -overlap=95 Cleanup generates a non redundant sequence data library from any set of sequences. Cleanup of what sequences ? inrdro* What should I call the output searching file (* in_dro.s *) ? What should I call the file listing erased sequences (* in_dro.e *) ? What should I call the file listing retained sequences (* in_dro.nr *) ? Sequences % Overlapping for cleaning Sequence matches = 2400 = 95.00 = 424 Erased sequences % Erased sequences = 400 =16.67 Nucleotides % Similarity for cleaning Searching C.P.U. time Erased nucleotides % Erased nucleotides = 5523925 = 95.00 = 158.84 420707 = 7.62 Fig. 3. Sample run of a CLEANUP session on a dataset of 2400 Drosophila entries. User inputs are in bold typed. CLEANUP application generates a file listing all non-redundant sequence entry names which can be used as input with other programs of the GCG package as well as with retrieval programs such as SRS (Etzold and Argos, 1993) or ACNUC (Gouy el al., 1985). It optionally also generates a sequence data file in Pearson/FASTA format. G.Grillo et at. To give an idea of the many possible uses of the CLEANUP algorithm we have carried out a statistical analysis aimed at the determination of the possible nucleotide context around the AUG initiator codon of protein-coding genes. A consensus signal describing the favourable context for the AUG initiator codon has been previously described by Kozak (1989) based on an analysis carried out on a dataset of 699 vertebrate genes. In order to carry out such an analysis on a larger dataset of human genes we extracted the sequence regions spanning from position - 2 0 to +20 with respect to the initiator AUG codon from human genes coding for proteins. This sequence extraction, carried out by using the retrieval program ACNUC on release 88 of the Genbank collection, obtained 8004 sequences. In order to avoid the bias produced by redundancy we need to remove all duplicated sequence entries. Indeed, oligonucleotides present in redundant entries would result to be overrepresented because of the presence of multiple copies of the same sequence and not because they are genuine biological signals. CLEANUP is particularly suited to this task as it automatically generates a cleaned collection, free from repetitions. In this particular case, by using the CLEANUP program to remove 100% identical sequences we have obtained a cleaned dataset of 5184 sequences, thus showing that the original dataset contained > 35% of redundancy. CLEANUP is thus invaluable for constructing specific specialized collections, rather than for the use on repository databases such as EMBL/Genbank/DDBJ. Indeed, biological data are too complex to fit a simple definition of redundancy, and repository databases do not attempt to be non-redundant, but rather sacrifice this goal for the sake of completeness. Of course, the sequences of the same gene from two individuals belonging to the same species cannot be considered biologically redundant (i.e. non-informative) even if their primary sequences are perfectly identical; in fact, these sequences give us useful information on the genetic variability of the population at a given locus. In the same way the presence, in some genomes, of several copies of repetitive sequences such as SINES or LINES elements as well as prokaryotic rRNA operons, has a biological relevance that should be taken into account. Indeed, a peculiar feature of many genomes is that they are highly biased in favour of some repetitive sequence elements. Given that a 'qualitative' definition of redundancy is impracticable, the present algorithm allows any database user/provider to extract from nucleotide databases specific non-redundant collections based on 'quantitative' parameters, i.e. below a fixed threshold of sequence similarity, depending on the particular application. To sum up, the application of CLEANUP will be thus an invaluable and efficient tool for all database users/developers to remove or classify redundant information. Acknowledgements We thank C.Saccone for helpful comments. This work was partially financed by MURST, Italy, Progetto Finalizzato Ingegneria Genetica, CNR, Italy and by EMBnet Research and Development Project Committee. References Altschul,S.F., Gish.W., Miller,W., Myers,E.W. and Lipman,D.J. (1990) Basic local alignment search tool. J. Mot. Biol., 215, 403-410. Baeza-Yates.R.A. and Perleberg,C.H. (1992) Fast and practical approximate string matching. In Apostolico,A., Crochemore,M., Galil, Z. and Manber,U. (eds), Third Annual Symposium on Combinatorial Pattern Matching, Tucson, AZ. Springer-Verlag, Berlin, pp. 185-192. Bodenteich,A., Mitchell.L.G. and Merril.C.R. (1991) A lifetime of retinal light exposure does not appear to increase mitochondrial mutations. Gene, 108, 305-309. Califano,A. and Rigoutsos,I. (1996) FLASH: a fast look-up algorithm for string homology. Comput. Applic. Biosci., in press. Etzold,T. and Argos,P. (1993) SRS an indexing and retrieval tool for flat file data libraries. Comput. Applic. Biosci., 9, 49-57. GCG (1993) Program Manual for the GCG Package. Genetic Computer Group GCG, Madison, WI. Gouy,M., Gautier,C, Attimonelli,M., Lanave.C. and Di Paola.G. (1985) ACNUC—a portable retrieval system for nucleic acid sequence databases: logical and physical designs and usage. Comput. Applic. Biosci., 1, 167-172. Konopka,A.K. (1993) Plausible classification codes and local compositional complexity of nucleotide sequences. In Lim.H.A., Fickett,J.W., Cantor.C.R. andobbins, R.J. (eds), The Second International Conference on Bioinformatics, Supercomputing and Complex Genome Analysis, Singapore. World Scientific, New York, pp. 69-87. Kozak,M. (1989). The scanning model for translation: an update. J. Cell. Biol., 108,229-241. Perriere,G., Gouy,M. and Gojobori,T. (1994). NRSub: a non-redundant data base for the Bacillus subtilis genome. Nucleic Acids Res., 22, 5525— 5529. Received on June 5, 1995; revised and accepted on October 2, 1995
© Copyright 2026 Paperzz