BIOINFORMATICS ORIGINAL PAPER Vol. 22 no. 2 2006, pages 134–141 doi:10.1093/bioinformatics/bti774 Genome analysis WindowMasker: window-based masker for sequenced genomes Aleksandr Morgulis, E. Michael Gertz, Alejandro A. Schäffer and Richa Agarwala National Center for Biotechnology Information, National Institutes of Health, Department of Health and Human Services, Building 38A, Room 1003N, 8600 Rockville Pike, Bethesda, MD 20894, USA Received on July 26, 2005; revised and accepted on November 8, 2005 Advance Access publication November 15, 2005 Associate Editor: Chris Stoeckert ABSTRACT Motivation: Matches to repetitive sequences are usually undesirable in the output of DNA database searches. Repetitive sequences need not be matched to a query, if they can be masked in the database. RepeatMasker/Maskeraid (RM), currently the most widely used software for DNA sequence masking, is slow and requires a library of repetitive template sequences, such as a manually curated RepBase library, that may not exist for newly sequenced genomes. Results: We have developed a software tool called WindowMasker (WM) that identifies and masks highly repetitive DNA sequences in a genome, using only the sequence of the genome itself. WM is orders of magnitude faster than RM because WM uses a few linear-time scans of the genome sequence, rather than local alignment methods that compare each library sequence with each piece of the genome. We validate WM by comparing BLAST outputs from large sets of queries applied to two versions of the same genome, one masked by WM, and the other masked by RM. Even for genomes such as the human genome, where a good RepBase library is available, searching the database as masked with WM yields more matches that are apparently non-repetitive and fewer matches to repetitive sequences. We show that these results hold for transcribed regions as well. WM also performs well on genomes for which much of the sequence was in draft form at the time of the analysis. Availability: WM is included in the NCBI C11 toolkit. The source code for the entire toolkit is available at ftp://ftp.ncbi.nih.gov/toolbox/ ncbi_tools++/CURRENT/. Once the toolkit source is unpacked, the instructions for building WindowMasker application in the UNIX environment can be found in file src/app/winmasker/README.build. Contact: [email protected] Supplementary information: Supplementary data are available at ftp://ftp.ncbi.nlm.nih.gov/pub/agarwala/windowmasker/window masker_suppl.pdf. 1 INTRODUCTION In searching a DNA database, subsequences repeated hundreds of times in the database may be found to match the query, but such repetitive matches are usually not of biological interest. Finding matches to sequences that consist entirely of repetitive elements [such as microsatellites (Weber and May, 1989), Alus (Schmid and Jelinek, 1982) and LINEs (Smit et al., 1995)] can be avoided by To whom correspondence should be addressed. masking such sequences in the database. Because running times of database searches typically grow at least linearly with the number of alignments found, masking the database to remove repetitive sequences in the first place is much more efficient than filtering intermediate outputs. Therefore, we consider the problem of masking a genomic database that is to be used in BLAST searches. Our central aim is to preserve BLAST matches to sequences that are unique or have low copy number, while preventing BLAST from finding and reporting matches to repeated sequences with many nearly identical copies. We introduce the software package WindowMasker (WM) that implements a method to mask a DNA genome for repeats using only the sequence of the genome itself. NCBI’s BLAST Web page service received over 28 000 queries to genome-specific DNA databases during the first week of June 2005, and an average of 3000 queries per day in the entire month. These queries included thousands of queries for organisms such as the dog (405 dogspecific queries recorded in the first week) that have been recently sequenced and have no adequate masking for repeats. Since WM uses only the sequence of the genome, it can be applied each time the genome sequence is ‘finished’ or even in draft stages. To our knowledge, RepeatMasker/MaskerAid (Bedell et al., 2000; Smit and Green, 1996, http://www.repeatmasker.org/ webrepeatmaskerhelp.html) (RM) is currently the most widely used software for masking DNA databases. RM uses local alignment methods to compare subsequences in the genome with a curated library of repeat sequences, such as RepBase (Jurka, 2000), to determine the regions to be masked. In addition to masking repetitive sequences, RM uses the annotation in a repeat library to annotate the masked sequences with a description of what types of repetitive elements are contained therein. WM does not use repeat libraries, which are labor-intensive to create and curate, and may not exist for newly sequenced genomes. The aim of WM is to replace the use of RM for masking DNA databases prior to homology searches. WM is not capable of annotating masked regions at this time, but this does not compromise its utility for creating a masked database to be used by BLAST searches. Annotation of repeats is of use to biologists, but it is not used by BLAST searches. Another software package is RECON (Bao and Eddy, 2002). Like RM, RECON uses local alignment methods to compare subsequences in the genome. Unlike RM, RECON compares the Published by Oxford University Press 2005. The online version of this article has been published under an open access model. Users are entitled to use, reproduce, disseminate, or display the open access version of this article for non-commercial purposes provided that: the original authorship is properly and fully attributed; the Journal and Oxford University Press are attributed as the original place of publication with the correct citation details given; if an article is subsequently reproduced or disseminated not in its entirety but only in part or as a derivative work this must be clearly indicated. For commercial re-use, please contact [email protected] WindowMasker: masker for sequenced genomes genome with itself. The design of RECON suggests that its intended use is the compilation of repeat families for studying genomic evolution. WM uses separate methods to identify two categories of likely repeats, low-complexity sequences and global repeats. A low-complexity sequence contains tens of letters and does not necessarily occur many times throughout the genome. The following is an example of a sequence with a low-complexity interval. (R2) (R3) GGTTGGTCAAATAAAAAGTGATGTATGAAAAAGAGG CAAAACAACAAGAAGAAAAGATTGAAAAAATGAGAG CTGAAGATGGTGAAAATTATGACATCAAAAAGCAGG This sequence occupies positions 85 653 through 85 760 in a large clone AL451144.5 from human chromosome X; this sequence is part of a pseudogene that lies in an intron of the dystrophin gene (Koenig et al., 1988). A portion of the above sequence is considered to have low complexity, mostly because of the multiple A monomers. Lowcomplexity sequences are identified by a module called DUST that has been part of the NCBI C (and now C++) toolkit for many years. We present DUST in supplementary material because it has not been described in print, and we have made some improvements. By global repeats we mean families of nearly identical sequences that occur multiple times in not necessarily adjacent places in the genome. One expects that global repeats originated as one sequence in some primordial genome, and that sequence was duplicated or reinserted and subsequently mutated numerous times on the way to the genome being masked. WM identifies likely global repeats using a module named WinMask that finds all Nmers that occur far more often than expected in windows of size N + 4. The distinction between low-complexity sequences and global repeats is operational, not mathematical; it is defined by what sequences are masked by the two modules of WM. The two modules of WM are applied independently, and the set of letters masked in the output is the union of the sets masked by each module. Counting of Nmers has been used in other types of genome analysis. For example, identifying unique Nmers is important in primer design for STS markers (Healy et al., 2003). Masking based on exactly repeated Nmers has been used in several genome assembly packages such as Reps (Wang et al., 2002) and Atlas (Havlak et al., 2004). These packages make masking decisions based on the repetition count of each individual Nmer, so N has to be large enough (e.g. 20 in Reps and 32 in Atlas) that the multiple occurrence of the same Nmer is much more likely due to a duplication than due to chance. In WinMask, we use much shorter Nmers (N 15), combined with simple statistics on multiple, distinct Nmers in a sliding window, to distinguish highly repetitive sequences from chance recurrences. The effectiveness of WM for database searching is demonstrated by showing that among the regions of sequences from genomes tested, regions that have a large number of matches to the unmasked genome, indicating that the region is likely repetitive, are usually masked by WM. Conversely, we also show that regions that have a small number of matches to the unmasked genome, indicating that the entire region is likely non-repetitive, are usually left unmasked by WM. Five types of regions that were tested are as follows: (R1) The 50 longest contiguous regions of genomic database sequences that were masked by WM but that did not intersect a region masked by RM. (R4) (R5) The 50 longest contiguous regions of genomic database sequences that were masked by RM but that did not intersect a region masked by WM. Regions from high quality BLAST matches found in an RM-masked genome and not found in the WM-masked genome. Note that the matches are missed by BLAST due to masking of WM but not missed by BLAST due to masking of RM. The matches were found using sample queries (described below) and the regions are taken from the sample queries. A match is defined to be a high quality match if it is at least 92 bases long and has at least 97% identity and a match M is said to be not found in a set S if for every match N in S, the query or subject interval of N does not overlap the query or subject interval of M. The 92 base length and 97% identity thresholds are used in several MegaBLAST applications at NCBI. Regions from high quality BLAST matches found in a WMmasked genome and not found in the RM-masked genome. Note that the matches are missed by BLAST due to masking of RM but not missed by BLAST due to masking of WM. Regions from high quality BLAST matches found in an unmasked genome and not found in the WM-masked genome. Note that the matches are missed by BLAST due to masking of WM. The regions in sets (R1) and (R2) are selected to compare RM and WM directly, sets (R3) and (R4) compare the effect of RM and WM on BLAST searches for the genomes for which a RepBase library is available, and set (R5) shows the effectiveness of WM for genomes for which no RepBase library is available. We show that WM performs the desired masking, is more effective than RM for the task of masking a genome for BLAST searches, and does not require a library of repeats. In Section 2, we describe WinMask and methods used in evaluating the performance of WM. Section 3 describes tests of WM on several genomes and reports on direct comparisons with RM. We also show that WM does not excessively mask transcribed regions and substantially reduces the number of matches to consider for making gene models. We close with a discussion of related work and open problems. Source code for WM is available as part of the NCBI software toolkit ftp://ftp.ncbi.nih.gov/toolbox/ncbi_tools++/ CURRENT/. Once the toolkit source is unpacked, the instructions for building WindowMasker application in the UNIX environment can be found in file src/app/winmasker/README.build. 2 MATERIALS AND METHODS The WM program is written in C++ and included in the NCBI software toolkit. Numerous auxiliary programs used to automate tests and summarize results were written in C, C++ and Perl. 2.1 List of genomes tested We developed and tuned the WinMask algorithm described below using build 34.1 of the human genome. The combination of DUST and WinMask was then tested on five other genomes without further tuning of parameters. In the text below, these genomes are referred to by their English, nonscientific names: human (Homo sapiens), mouse (Mus musculus), rat (Rattus norvegicus), honeybee (Apis mellifera) and fruitfly (Drosophila melanogaster), except for Drosophila pseudoobscura. DNA sequences for the D. pseudoobscura genome were retrieved in FASTA format from GenBank (Benson et al., 2004). All other genome sequences were retrieved via NCBI’s 135 A.Morgulis et al. MapViewer where current sequence releases (called ‘builds’) are available as hyperlinks. Data on older builds are available on the NCBI ftp site under ftp://ftp.ncbi.nlm.nih.gov/genomes/ or may be requested by sending an email to [email protected] 2.2 The WinMask procedure for finding global repeats WinMask is a two-pass algorithm. In the first pass, it calculates the size N of Nmer to consider, Nmer frequency counts for the genome, threshold scores for the algorithm, and a score function that is based on the frequencies and threshold scores. In the second pass, it calculates masked regions for the genome using the previously generated score function and thresholds. A genome is presented to WinMask as a collection of FASTA formatted files, each file containing one or more sequences. Each sequence, commonly called a contig, represents a contiguous region of the genome that includes no large gaps (typically 100 or more bases). Contigs are not concatenated; an Nmer must occur entirely between the endpoints of a single contig. The genome sequences should include only one strand from each complementary pair of DNA strands. We found that this is not always the case; therefore, the first pass of WinMask has an option to check the input data for likely duplicate contigs. The first pass of the WinMask algorithm proceeds as follows: (1) Determine the size of Nmer to use when masking this genome. Let L be the sum of the lengths of all contigs. Then N is the smallest integer such that L/(4N) < 5. (2) Scan the contigs to determine frequency counts of all Nmers. Add the frequency counts for reverse complement of contigs. We use the notation freq(S) to denote the number of times Nmer S occurs in the genome. (3) Calculate cutoff values Tthreshold, Textend, Tlow and Thigh. Let C be the total number of distinct Nmers that occur at least once in the genome. The cutoff Tthreshold is the largest value of T that satisfies sizeðfS j S is an Nmer and freqðSÞ TgÞ 0:995 · C: This is equivalent to the 99.5 percentile. Similarly, the set Textend, Tlow and Thigh are defined as the 99.0, 90.0 and 99.8 percentiles of the same empirical cumulative distribution function, respectively. The score of an Nmer S is defined as 8 if freqðSÞ T high ; < T high Nmer scoreðSÞ¼ dT low =2e if freqðSÞ < T low ; and : freqðSÞ otherwise: An Nmer with a frequency count greater than Thigh is assigned a score Thigh rather than freq(S). This rule limits the effect of some highly repetitive Nmers. Similarly, any Nmer with a frequency count below Tlow is assigned score dTlow/2e. This rule allows us to avoid storing frequency counts for a large number of infrequently occurring Nmers, while not compromising the results in practice. The second pass of WinMask computes masked regions for all or part of the same genome. For each sequence to be masked, WinMask applies the following algorithm: (1) Scan all windows W‘ ¼ a‘ am of length N + 4 in the sequence to be masked, and assign each window a score ! ‘+4 X 5‚ Nmer_scoreðaj ai + N1 win_scoreðWÞ¼ Values of N, Tthreshold, Textend, Thigh and Tlow for different genomes we considered are shown in the Table S1 in Supplementary Information. Portions of the cumulative Nmer frequency distribution are plotted for human, rat, honeybee and fruitfly genomes in the Figure S1 in Supplementary Information. Distributions for mouse and D. pseudoobscura are not shown as they are indistinguishable from those for human and fruitfly, respectively. The Nmers having the same frequency or rank differ substantially for different genomes, although this is not apparent from the plot of frequency distributions. 2.3 Usage of RepeatMasker and MaskerAid Masking done with RM used the May 15, 2002, version of RepeatMasker (Smit and Green, 1996) in conjunction with MaskerAid (Bedell et al., 2000). Parameter settings were those recommended on the MaskerAid Web site. The time-consuming part of RepeatMasker is the repeated use of the program cross_match for the local alignment of DNA sequences. The MaskerAid module is used to identify most global repeats with permissive (and hence relatively fast) parameters to cross_match via the MaskerAid parameters -w -nolow -norna. Then, cross_match is used directly with the more restrictive parameter -noint to identify low-complexity sequences. The runs of RM were done using a recent version of RepBase for the genome being masked, and it was the same version being used in the NCBI genome pipeline (Kitts, 2003, http://www.ncbi.nlm.nih.gov/books) at the time of the experiments. 2.4 Usage of tandem repeats finder Preliminary analysis suggested that many regions in sets (R1)–(R5) are not entirely covered by low-complexity regions or global repeats, but contain a substantial subinterval generated by tandem repetition of a smaller motif. To investigate this formally, we used the Tandem repeats finder (TRF) software, version 3.21, with the parameters recommended in the documentation (Benson, 1999), to identify regions containing a tandem repeat. If a tandem repeat is found, we suspect that the minimal repetitive motif may be found many times across the genome, and this motif is evaluated further instead of the region that had the motif. We restricted attention to only those tandem repeats that cover at least 50% of the region of interest. The tandem repeats found are further subdivided into small patterns or generative patterns based on the length of the repeat. The reason for using the term ‘generative’ rather than the antonym ‘large’ is to reinforce the intuition that these larger patterns effectively ‘generate’ at least 50% of the region via a process of duplication and mutation. A repeat is called small [generative] if its length is less than [at least] 19 bases for human, mouse and rat genomes or less than [at least] 17 bases for Drosophila, honeybee and D. pseudoobscura genomes. The lengths 19 and 17 are the window sizes used in the WinMask module when masking those genomes with WM. Any region that has neither a small nor a generative pattern is called a region with no patterns. Almost all small patterns observed are di- or tri-nucleotide patterns. 2.5 BLAST runs and queries for testing In Section 3, we describe numerous results based on outputs of MegaBLAST (Zhang et al., 2000) or blastn (Altschul et al., 1997) modules of the BLAST software package for sequence alignment. All runs of MegaBLAST and blastn were done with default parameter settings, with the following exceptions: i¼‘ (2) Mask any window whose score is at least as great as Tthreshold. We turned off the DUST filtering of the query sequence using parameter -FF in order to assess the low-complexity filtering provided by RM and WM. (3) Mask the interval between two consecutive windows masked in the previous step if, and only if, every base in the interval is in a window that has a score at least as great as Textend. For certain smaller genomes, we ran MegaBLAST with word size 16 as well as with default word size 28; we clearly mark results generated using the smaller word size in Section 3. where integer division is used. 136 WindowMasker: masker for sequenced genomes The tests of alignment to transcribed regions carried out by the NCBI genome pipeline group use parameter settings -r10 -q-25 -X500 -Fm for MegaBLAST. Both blastn and MegaBLAST search for short exact alignments that are known as seeds. One major difference is that blastn uses by default a minimum seed size of 11, while MegaBLAST uses by default a minimum seed size of 28; some MegaBLAST tests were done with a seed size of 16. Partly because it uses a smaller seed size, blastn is slower and may print many more matches in the output. The effect that masking a database has on a blastn or MegaBLAST search is to forbid any masked base from participating in a seed. The algorithm ultimately extends seeds into gapped alignments. Masked bases are permitted to participate in gapped extensions exactly as if they were not masked. We note, however, that both blastn and MegaBLAST provide the option of treating masked bases as if they were a special base that mismatched every other base. We did not use this option in performing our tests. 2.5.1 Sample queries To test the effect of masking on MegaBLAST searches, we selected a separate set of 300 sample query sequences for each genome. Each set of 300 query sequences is divided into subsets of 100 each in three length ranges: small (500 bases), medium (10 kb), and large (100 kb). Queries for each length range were randomly selected from within bacterial artificial chromosome (BAC) sequences from the genome being queried, with the exception of D. pseudoobscura for which we used query sequences from D. melanogaster. MegaBLAST runs using the chosen sample queries were performed using both a WindowMasked and an unmasked database for each genome. For those genomes for which a library of repeats is available, an additional set of runs was performed using a RepeatMasked database. Results of the MegaBLAST runs were filtered for high quality matches, as defined in Section 1, to ensure a high probability that what we identify in our automated processing as a match is a true positive. 2.5.2 Query regions selected for further study It is common for a nearly identical interval in a sample query to participate in many distinct alignments that yield regions in sets (R1)–(R5). Therefore, intervals within each set were coalesced if they overlapped by at least 90%, and the ends also did not extend more than 20 bases. 2.5.3 Investigation of regions We classified all regions in sets (R1)–(R5) as having small, generative, or no patterns as described in Subsection 2.4. Regions having a small pattern were not analyzed further; they are low-complexity regions. Any generative patterns found within the regions were used as queries instead of the region they generate and aligned using blastn against the unmasked genome from which they originated. Results were filtered to require a minimum of 95% identity and a minimum length of 19 or 17 bases, whichever corresponds to the window size for the genome. Regions that had no discernible pattern were used in their entirety as queries and aligned against the appropriate unmasked genome. Query sequences that originated from sets (R1) and (R2) were aligned to genomes using blastn, whereas query sequences that originated from sets (R3), (R4) or (R5) were aligned using MegaBLAST because of the prohibitive amount of output generated if blastn is used. These results were filtered to require a minimum percentage identity of 95% and a minimum length of 75% of the query length. For all genomes except rat, we used a minimum of 95% identity; for rat we used 97% identity, as there were a large number of matches even with this more stringent criterion. 2.5.4 Evaluation of results The primary means of evaluating the results of a BLAST run is to count the number of output alignments meeting the prescribed length and identity criteria. We classify such runs and their output counts into three ranges: 1–10 matches, 11–50 matches and >50 matches. We suggest that output with 1–10 matches is desirable and should not be eliminated by masking, while output with >50 near exact matches is undesirable and indicates insufficient masking in the matched database regions. Whether outputs falling in the intermediate range are desirable Table 1. Percentage of base pairs masked by RM, WM, and by both RM and WM Genome Genome (bp) RM (%) WM (%) Common (%) Human 34.1 Mouse 32.1 Mouse 33.1 Fruitfly 6.3 Honeybee 1.1 Pseudoobscura Rat 2.1 2 862 648 857 2 763 483 754 2 748 835 566 116 781 562 205 571 295 127 928 143 2 839 476 130 48.4 45.4 45.6 8.8 — — — 37.4 36.9 36.9 21.6 44.2 24.0 41.6 29.6 29.9 29.9 6.4 — — — or not depends on the user’s taste and the biological motivation for the query. In our tests, the number of cases falling in the intermediate range is usually a small minority. Therefore, the qualitative evaluation does not depend much on the specific choices of 11 and 50 as the bounds of the intermediate range. 2.5.5 Test of alignment to transcribed regions For a test of alignments to human regions that are likely to be transcribed, the NCBI genome pipeline group collected sets of mRNAs and ESTs to align. Then they used MegaBLAST and other programs exactly as described in (Kitts, 2003) so that results would be comparable with NCBI’s annotation of the human genome that utilized RM. The mRNAs and ESTs are aligned as full length transcripts predicted by the genome pipeline. These predicted full-length transcripts are called gene models below. The test of alignment to transcribed regions by gene models was applied to human chromosome 1, not the entire genome. 3 RESULTS WM was developed using the human genome and then tested on the genomes of mouse, fruitfly, rat, honeybee and D. pseudoobscura. RepBase (Jurka, 2000) has good libraries for human, mouse and fruitfly. It currently does not have libraries for the recently sequenced rat genome or the partially sequenced honeybee and D. pseudoobscura genomes. For all genomes, we document the amount of masking and the effect of masking on BLAST searches. For the three genomes with RepBase libraries, we do analysis for regions in sets (R1)–(R4), while for the three genomes without RepBase libraries, we do analysis for regions in set (R5). For honeybee and D. pseudoobscura, the difference in results between using an unmasked database and a WindowMasked database is reduced if MegaBLAST seed size is 16 instead of the default seed size of 28. We do additional analysis for matches found only when using an unmasked database and a word size of 16. Results of aligning mRNAs and ESTs to human chromosome 1 masked using either RM or WM are also presented. 3.1 Tests of the amount of masking We did a direct comparison of the portions of the genome masked by RM against the portions masked by WM. Table 1 gives the percentages of the genomes masked by each method and the percentages masked in common. Pairs of contiguous regions can be classified as completely overlapping (same end points found by both methods), partially overlapping (regions overlap, but not completely) and nonoverlapping. Table S2 in Supplementary Information summarizes the lengths of contiguous regions found by one method that do not overlap a region found by the other. A subset of these regions forms sets (R1) and (R2) for further study. 137 A.Morgulis et al. Table 2. MegaBLAST results where comparison with WM uses RM database for human, mouse and Drosophila and unmasked (UM) database for honeybee, D.pseudoobscura and rat Match count UM Human 34.1 Mouse 32.1 Mouse 33.1 Fruitfly 6.3 Honeybee 1.1 Honeybee 1.1 W16 Pseudoobscura Pseudoobscura W16 Rat 2.1 RM 199 650 1 311 680 1 273 865 7 379 15 978 16 167 328 323 10 815 737 WM 2 069 2 925 1 933 3 307 12 989 13 462 276 314 11 820 2083 3994 4895 1698 — — — — — Same alignment Partial overlap RM UM WM No overlap RM UM WM 1 895 2 509 1 539 1 678 12 784 12 832 276 312 11 410 25 97 81 11 — — — — — 44 57 39 41 170 511 0 0 297 163 (82) 1388 (245) 3275 (256) 9 (7) — — — — — — — — — 2 589 2 383 52 11 10 803 598 130 359 355 1588 35 119 0 2 113 — — — — 605 952 0 0 729 (457) (294) (50) (9) (4473) (65) (91) (58) (268) (35) (119) (2) (94) Matches reported are at least 92 bases long and have at least 97% identity. Rows labeled W16 use MegaBLAST with initial match size 16, instead of the default 28. Each entry in the second, third and fourth columns counts the number of matches of that type totalled over 300 queries. The fifth column counts matches in common to the two databases being compared. ‘Partial overlap’ columns count cases where corresponding matches overlap in both query and subject, but are distinct. The last three columns count the number of matches unique to each database; in those columns the number in parentheses are the number of coalesced regions that are added to the sets (R3), (R4) or (R5), as described in Subsection 2.5.2. Table 3. Regions (R1) and (R2) for each genome are classified by the number of blastn matches they yield against the same unmasked genome Match count Human 34.1 (R1); Generative (R1); No pattern (R2); No pattern Mouse 32.1 (R1); Generative (R2); Generative (R2); No pattern Mouse 33.1 (R1); Generative (R2); Generative (R2); No pattern Fruitfly 6.3 (R1); Generative (R1); No pattern (R2); No pattern 11–50 >50 Total 1–10 44 6 50 0 4 50 1 0 0 43 2 0 pattern pattern 50 3 47 0 0 47 2 3 0 48 0 0 pattern pattern 50 3 47 0 0 47 3 3 0 47 0 0 pattern 44 6 50 0 0 50 10 6 0 34 0 0 Match count pattern Matches counted had at least 75% query length coverage and at least 95% identity. We suggest that cases with 1–10 matches should appear in the output, cases with >50 matches should not appear, and cases in the middle range may sometimes be desirable to have in BLAST output. 3.2 Effect on BLAST searches The number of alignments found by MegaBLAST between sample queries and the various masked and unmasked genome databases are listed in Table 2. The first number that should stand out is the 10 803 598 matches unique to an unmasked rat database; only 11 820 matches are found using a WindowMasked database, a reduction by a factor of about 1000. The large number of matches to the unmasked rat genome quantifies the need for a masking method that can be used when repeat libraries are not available. In the ‘Partial overlap’ and ‘No overlap’ columns of Table 2, we note that for those rows where we did not run RM, there are a few cases where we find an alignment when querying the WindowMasked genome, but do not find the same alignment 138 Table 4. Taking the coalesced query regions (R3)–(R5), enumerated in the ‘No overlap’ column of Table 2, as a starting point, this table classifies the query regions according to the number of matches covering at least 75% of the query region that are obtained when querying an unmasked database Total 1–10 Human 34.1 (No small patterns) (R3); Generative pattern 7 0 (R3); No pattern 75 6 (R4); Generative pattern 2 0 (R4); No pattern 63 49 Mouse 32.1 (1 RM and 1 WM small pattern) (R3); Generative pattern 15 0 (R3); No pattern 229 5 (R4); No pattern 90 66 Mouse 33.1 (1 RM and 1 WM small pattern) (R3); Generative pattern 18 0 (R3); No pattern 237 11 (R4); No pattern 57 40 Fruitfly 6.3 (No small patterns) (R3); No pattern 7 5 (R4); No pattern 268 211 Honeybee 1.1 UM W16 (49 small patterns) (R5); Generative pattern 111 2 (R5); No Pattern 134 95 Pseudoobscura UM W16 (1 small pattern) (R5); No pattern 8 8 Rat 2.1 UM (251 small patterns) (R5); Generative pattern 33 0 (R5); No Pattern 4189 2334 11–50 >50 0 7 2 14 7 62 0 0 1 29 23 14 195 1 1 25 13 17 201 4 1 57 1 0 2 39 107 0 0 0 0 442 33 1413 Matches for all rows except the last are filtered for 95% identity; matches for the last row are filtered at 97% identity. when querying the unmasked genome. These discrepant cases are not investigated further because they are owing to either the heuristic nature of MegaBLAST or the 97% identity filtering. Query portions of remaining alignments in ‘No overlap’ columns of Table 2 are transformed into regions for sets (R3), (R4) and (R5) as described in Subsection 2.5.2. WindowMasker: masker for sequenced genomes 3.3 Analysis of regions (R1)–(R5) Regions are investigated as per the protocol in Subsection 2.5.3. Tables 3 and 4 tabulate regions in sets (R1)–(R5) according to the number of alignments to the unmasked genome counting only the alignments that cover at least 75% of the length of the region with 95% identity. We note that there is an insignificant amount of change if we tabulate results at 25% length coverage instead of 75% (data not shown). Next, we discuss Tables 3 and 4. 3.3.1 MegaBLAST performance on regions masked only by WM or RM We did not find any regions in sets (R1) and (R2) that have a small pattern. As shown by the second column in Table 3, the vast majority of the regions in set (R2) do not have a generative tandem repeat, while the vast majority regions in set (R1) do. This should be interpreted as a hint that the regions masked only by WM are likely to be generated by globally repetitive patterns that were not included in the RepBase library. Almost all regions in set (R2) have fewer than 10 matches, which suggests that they should not be masked. Almost all the regions in set (R1) have 51 matches, strongly suggesting that WM is justified in masking them. The most troubling exceptions to this are 4 sequences in the human build 34.1 that are masked by WM, have no generative pattern and get 10 matches. Two out of these four are from the portion of contig NT_004538.15 that is derived from AL357500.17; both sequences get over 100 000 blastn matches that are not big enough to be counted at 75% length coverage. The third sequence also belongs to the same contig, while the last sequence has two patterns that are covering the sequence such that neither one is a tandem repeat and hence neither is reported by TRF. 3.3.2 MegaBLAST performance on RepeatMasked and WindowMasked genomes For the human, mouse and fruitfly genomes, we collected regions in sets (R3) and (R4). Table 4 shows that there are only five regions in (R4) that generate more than 50 matches to an unmasked database. This suggests that only those five regions align to repetitive DNA, and that the DNA sequences that would align to the other regions in (R4) were correctly left unmasked by WM. Table 4 also shows that a significant number of regions in (R3) also have many matches to an unmasked database. We infer that WM is justified in masking the sequences that would align to these regions. There are 27 cases where one might question the results of WM because the query, originally identified because of a unique match to the RepeatMasked genome, yielded at most 10 matches when compared with the unmasked genome. For these 27 cases, we investigated the subject portion of the original unique alignments to the RepeatMasked database. In all cases but one, the subject portion is heavily masked by both RM and WM, but a small amount of additional masking by WM eliminates the seed for a MegaBLAST alignment. The lone exception is a 108 base match at 97.22% identity between AL451144.5 and NT_006713.13, where 80 out of 108 bases are found to have low complexity by DUST. 3.3.3 MegaBLAST performance on genomes without a RepBase library For the honeybee, rat and D. pseudoobscura genomes, we collected regions in set (R5). Table 4 shows that generative patterns from these regions are highly repetitive. Some regions that do not have a generative pattern, particularly those from rat, do not often align to an unmasked genome. The row with 4189 ‘No pattern’ for Table 5. Counts of intermediate and final outputs when running NCBI’s genome annotation pipeline using a DNA database of human chromosome 1 masked by either RM or WM Category mRNA queries RM WM MegaBLAST matches 203 751 Transcripts aligned 14 874 (16 920) Unique transcripts 36 (97) EST queries RM WM 127 664 1 677 974 14 948 562 989 (16 833) (658 437) 110 8 521 (147) (11 943) 1 344 116 565 745 (652 112) 11 277 (13 539) Numbers in brackets are total number of gene models found for transcripts in that category. rat in Table 4 counts matches filtered for 97% identity. For rat, we did not generate a complete histogram of the number of matches to each query portion when matches are filtered for 95% identity, but we determined that only 675 query regions match less than 50 times against the unmasked genome. We did no further analysis to see if many more alignments would be found using more sensitive parameters to MegaBLAST or a more sensitive algorithm such as blastn. Of the 10 803 598 matches to the unmasked rat genome in the ‘No overlap’ column of Table 2, 6 240 141 matches, which comprise 4473 query regions, cover at least 75% of the query. Of these 6 240 141 matches, the 675 non-repetitive query regions account for only 1335 (<0.02%) matches. This seems to be acceptable performance for the WM program, especially when no alternative method of masking is available. 3.4 A test of alignment to transcribed regions The above tests and analysis concern entire genomes and do not distinguish between transcribed and untranscribed intervals. Since many searches are intended to identify matches in transcribed regions, we assessed WM performance in leaving the non-repetitive portions of these regions unmasked for matching. NCBI’s genome pipeline group applied their software (Kitts, 2003) to a DNA database for human chromosome 1. As above, the database was masked using either RM or WM. To measure the final output quantity and quality, we evaluated how many mRNAs or ESTs can be aligned as gene models. The masked databases are used by the genome pipeline software for the intermediate step of MegaBLAST searching. Therefore, another useful measure of performance is the number of MegaBLAST matches, which indicates how many candidate pieces of transcript may have to be spliced together to make a gene model. Table 5 shows that using WM, rather than RM, to mask the database enables the identification of slightly more (<1%) mRNAs, along with a substantial reduction (20–30%) in the number of MegaBLAST matches that are considered potential alignment pieces. Table 6 shows that using WM, rather than RM, also results in the creation of more gene models with higher length coverage. The improvement in final outputs is not substantial but does show that WM is not excessively masking transcribed regions, while substantially reducing the number of matches to consider. 3.5 Computation time To compare running times, we ran RM and WM on the human genome, one chromosome at a time, using multiple uniprocessors 139 A.Morgulis et al. Table 6. Alignments in the second row of Table 5 are classified by the percentage of the transcript covered by the gene model % Coverage All 50 50–75 75–90 90–95 95–99 >99 Total Longest 50 50–75 75–90 90–95 95–99 >99 Total RM mRNA WM mRNA RM EST WM EST 403 458 553 638 4266 10 602 16 920 381 452 536 634 4273 10 557 16 833 27 190 47 448 65 429 52 453 130 601 335 316 658 437 23 866 45 723 64 021 51 973 130 666 335 863 652 112 191 221 301 457 3839 9865 14 874 189 222 299 453 3856 9929 14 948 18 621 39 211 55 517 42 652 110 625 296 363 562 989 16 254 37 964 55 439 43 118 112 227 300 743 565 745 ‘All’ refers to all gene models found for transcripts, while ‘Longest’ refers to the subset of gene models where only the longest gene model for each transcript is considered. on a cluster of processors. RM took 1045 h of CPU time, while WM took under 11 h of CPU time. Of these 11 h, 1.2 h are consumed by the first pass of Nmer counting; if one also checks for duplicate contigs in the first pass, then the time increases to 4.5 h. Thus, we conclude that WM without the check for duplicate contigs is nearly two orders of magnitude faster than RM for genomes that are roughly the size of the human genome (3 Gb). 4 DISCUSSION We describe new software called WindowMasker (WM) that identifies and masks highly repetitive subsequences in the DNA sequence of a genome. Unlike the widely used RepeatMasker (RM) software (Bedell et al., 2000; Smit and Green, 1996), WM is not dependent on a library of known repeat sequences, such as RepBase (Jurka, 2000). WM is orders of magnitude faster than RM or other alternatives, such as RECON (Bao and Eddy, 2002), that use local alignment to identify the location of repeated sequences. WM has been put into production usage at NCBI, and the first new genome to which it was applied is that of the cow (Bos taurus). We tuned WinMask using the human genome and then tested the combination of DUST and WinMask on the genomes of mouse, rat, fruitfly, D. pseudoobscura and honeybee. The tests combine WM with the intended application of sequence database searching by evaluating the quality of MegaBLAST (Zhang et al., 2000) outputs when searching a masked database. Our tests show that MegaBLAST produces better results when searching a database masked by WM than it does when searching a database masked by RM. This improvement is seen even for the human genome, for which there has been adequate time to develop a good library of repeats. Therefore, WM will be particularly useful for masking recently sequenced genomes for which no repeat library yet exists. One limitation of our tests with RM is that we set the parameters according to the developers’ recommendations, but did not consider the possibility that other parameter settings would give 140 better results. When our tests show that RM over or under masks some region, we do not attempt to assess whether this is due primarily to the RM parameters or to the entries in the RepBase library. The RM documentation suggests that the interesting regions are those that code for proteins. In this study, we chose to define the interesting regions as those that are transcribed because (1) this simplified our testing and (2) it acknowledges the growing interest in untranslated, functional RNAs and the 50 and 30 untranslated regions adjacent to translated RNAs. Our tests on human chromosome 1 show that slightly more gene models are created using a WindowMasked database than using a RepeatMasked database and that the use of a WindowMasked database substantially reduces the number of partial matches that need to be processed. The principal disadvantage of using WM instead of RM is that WM does not annotate the processed sequence, whereas regions masked by RM are annotated as belonging to at least one known family of repeat sequences. One basic problem suggested by the contrasting methods is whether the intervals masked by WM can be used to generate an initial library of repeats for a genome? Two approaches to this problem might be (1) to cluster the WinMasked regions by length and some properties of the unit scores they contain or (2) to use the RECON method on a cleverly chosen subset of pairs of intervals. In its current implementation, WM takes as input one or more FASTA-formatted files and produces as output a FASTA-formatted file or a more compact representation of the masked intervals. In practice, genome sequences are initially presented as traces from a sequencing device, including real number measurements (dye intensity or mass or radioactivity) for each possible letter at each position. It may be useful to determine if masking can be done directly on sequence traces without interposing additional software to ‘call’ a unique letter for each position. For the subproblem of masking low-complexity sequences, we used a modified version of DUST, because (1) DUST has been used successfully for many years to filter DNA queries to BLAST and (2) our focus is on the WinMask method for masking global repeats. It would be interesting to try other software methods, such as STAR (Delgrange and Rivals, 2004), Sputnik (Abajian, 1994, http://espressosoftware.com/pages/sputnik.jsp; Morgante et al., 2002), Poly (Bizzaro and Mark, 2003) that are specifically designed to find microsatellite sequences with some mutations. Longer approximate tandem repeats can be found with Tandem Repeats Finder (Benson, 1999) or ATRHUNTER (Wexler et al., 2005). An older alternative, similar to DUST, is the Simple method used in (Hancock, 2002). Several other groups have proposed combinatorial approaches to repeat identification that also do not use a library. Such software packages include RAP (Campagna et al., 2005), also based on word counts; REPuter (Kurtz et al., 2001), based on suffix trees; FORrepeats (Lefebvre et al., 2003), based on an automaton construction; an unnamed software package from Cold Spring Harbor Laboratory (Healy et al., 2003), based on the Burrows-Wheeler transform and suffix arrays; and repeat detection module of FragmentGluer (Pevzner et al., 2004), based on a A-Bruijn graph approach. In general, these programs are interested in detecting repeats that occur more than once without concern for the frequency or variability. To our knowledge, none of these packages has been used to produce a database of masked sequences for use in WindowMasker: masker for sequenced genomes homology searching. Of the packages mentioned, RAP is closest in method to our word-counting approach. The developers of RAP chose to use discontiguous words (Campagna et al., 2005), whereas WM uses counts of contiguous words. We tested some discontiguous words in an early implementation of WinMask and found no advantage (data not shown). However, this issue deserves further study, especially in consideration of the success of discontiguous words in homology searching (see e.g. Ma et al., 2002). Besides the choice of contiguous versus discontiguous words, three other design decisions for WinMask, described in Subsection 2.2, may be of concern: the choice of score thresholds in the first pass, the definition of the window_score function (at step 1 in the second pass), and the rule (at step 3 in the second pass) for masking in between consecutive already masked (at step 2 in the second pass) regions. For the score threshold, we chose plausible values based on Nmer counts for several genomes; we display those counts in Table S1 and Figure S1. A more principled method of choosing the thresholds would be of interest. For the window_score function we also considered the minimum unit score and median unit score instead of the average; using the average gave better results on the human genome (data not shown). For step 3 in the WinMask algorithm, we considered instead (1) masking between consecutive masked intervals if the distance in bases between them is sufficiently low regardless of score and (2) masking between already masked intervals if the average of the unit scores in the coalesced interval is above some threshold. The rule described above as step 3 performed better than these two alternatives on the human genome (data not shown). Another direction for further investigation is whether either the combinatorial or the library-based approach to masking can be extended from inputs of single genomes to species-heterogeneous databases such as GenBank’s non-redundant (nr) database? In conclusion, WM produces better results than RM without requiring a library of repeats. ACKNOWLEDGEMENTS We express our thanks to Stephen Altschul, David Lipman and Jim Ostell for advice and encouragement. We also thank Jo McEntyre for editorial assistance. We thank Victor Sapojnikov for assistance in testing both RM and WM and for guidance in enhancing the input/ output options so that WM could be used in NCBI’s genome pipeline. We also thank the NCBI genome pipeline group for carrying out the tests of alignment to transcribed regions. A.M. is supported under a contract to MSD Inc., Fairfax, VA, USA. Funding to pay the Open Access publication charges for this article was provided by the Intramural Research Program of the National Institules of Health, NLM. Conflict of Interest: none declared. REFERENCES Abajian,C. (1994) Sputnik. Altschul,S.F. et al. (1997) Gapped BLAST and PSI-BLAST—a new generation of protein database search programs. Nucleic Acids Res., 25, 3389–3402. Bao,Z. and Eddy,S.R. (2002) Automated de novo identification of repeat sequence families in sequenced genomes. Genome Res., 12, 1269–1276. Bedell,J.A. et al. (2000) Maskeraid: a performance enhancement to repeatmasker. Bioinformatics, 16, 1040–1041. Benson,D.A. et al. (2004) Genbank: update. Nucleic Acids Res., 32, D23–D26. Benson,G. (1999) Tandem Repeats Finder: a program to analyze DNA sequences. Nucleic Acids Res., 27, 573–580. Bizzaro,J.W. and Marx,K.A. (2003) Poly: a quantitative analysis tool for simple sequence repeat (SSR) tracts in DNA. BMC Bioinformatics, 4, 22. Campagna,D. et al. (2005) RAP: a new computer program for de novo identification of repeated sequences in whole genomes. Bioinformatics, 21, 582–588. Delgrange,O. and Rivals,E. (2004) Star: an algorithm to search for tandem and approximate repeats. Bioinformatics, 20, 2812–2820. Hancock,J.M. (2002) Genome size and the accumulation of simple sequence repeats: implications of new data from sequencing projects. Genetica, 115, 93–103. Havlak,P. et al. (2004) The Atlas genome assembly system. Genome Res., 14, 721–732. Healy,J. et al. (2003) Annotating large genomes with exact word matches. Genome Res., 13, 2306–2315. Jurka,J. (2000) Repbace update: a database and an electronic journal of repetitive elements. Trends Genet., 16, 418–420. Kitts,P. (2003) Genome assembly and annotation process. In J.McEntyre and J.Ostell (eds). The NCBI Handbook, National Library of Medicine, US, chapter 14. Koenig,M. et al. (1988) The complete sequence of dystrophin predicts a rod-shaped cytokeletal protein. Cell, 53, 219–228. Kurtz,S. et al. (2001) Reputer: the manifold applications of repeat analysis on a genomic scale. Nucleic Acids Res., 29, 4633–4642. Lefebvre,A. et al. (2003) Forrepeats: detects repeats on entire chromosomes and between genomes. Bioinformatics, 19, 319–326. Ma,B. et al. (2002) Patternhunter: faster and more sensitive homology search. Bioinformatics, 18, 440–445. Morgante,M. et al. (2002) Microsatellites are preferentially associated with nonrepetitive DNA in plant genomes. Nat. Genet., 30, 194–200. Pevzner,P.A. et al. (2004) De novo repeat classification and fragment assembly. Genome Res., 14, 1786–1796. Schmid,C.W. and Jelinek,W.R. (1982) The Alu family of dispersed repetitive sequences. Science, 216, 1065–1070. Smit,A.F.A. and Green,P. (1996) Repeatmasker. Smit,A.F.A. et al. (1995) Ancestral, mammalian-wide subfamilies of LINE-1 repetitive sequences. J. Mol. Biol., 246, 401–417. Wang,J. et al. (2002) RePs: a sequence assembler that masks exact repeats identified from shotgun data. Genome Res., 12, 824–831. Weber,J.L. and May,P.E. (1989) Abundant class of human DNA plymorphisms which can be typed using the polymerase chain reaction. Am. J. Hum. Genet., 44, 388–396. Wexler,Y. et al. (2005) Finding approximate tandem repeats in genomic sequences. J. Comp. Biol., 12, 928–942. Zhang,Z. et al. (2000) A greedy algorithm for aligning DNA sequences. J. Comp. Biol., 7, 203–214. 141
© Copyright 2026 Paperzz