WindowMasker: window-based masker for sequenced genomes

BIOINFORMATICS
ORIGINAL PAPER
Vol. 22 no. 2 2006, pages 134–141
doi:10.1093/bioinformatics/bti774
Genome analysis
WindowMasker: window-based masker for sequenced genomes
Aleksandr Morgulis, E. Michael Gertz, Alejandro A. Schäffer and Richa Agarwala
National Center for Biotechnology Information, National Institutes of Health, Department of Health
and Human Services, Building 38A, Room 1003N, 8600 Rockville Pike, Bethesda, MD 20894, USA
Received on July 26, 2005; revised and accepted on November 8, 2005
Advance Access publication November 15, 2005
Associate Editor: Chris Stoeckert
ABSTRACT
Motivation: Matches to repetitive sequences are usually undesirable
in the output of DNA database searches. Repetitive sequences need
not be matched to a query, if they can be masked in the database.
RepeatMasker/Maskeraid (RM), currently the most widely used software for DNA sequence masking, is slow and requires a library of
repetitive template sequences, such as a manually curated RepBase
library, that may not exist for newly sequenced genomes.
Results: We have developed a software tool called WindowMasker
(WM) that identifies and masks highly repetitive DNA sequences in a
genome, using only the sequence of the genome itself. WM is orders of
magnitude faster than RM because WM uses a few linear-time scans of
the genome sequence, rather than local alignment methods that compare each library sequence with each piece of the genome. We validate
WM by comparing BLAST outputs from large sets of queries applied to
two versions of the same genome, one masked by WM, and the other
masked by RM. Even for genomes such as the human genome, where a
good RepBase library is available, searching the database as masked
with WM yields more matches that are apparently non-repetitive and
fewer matches to repetitive sequences. We show that these results hold
for transcribed regions as well. WM also performs well on genomes for
which much of the sequence was in draft form at the time of the analysis.
Availability: WM is included in the NCBI C11 toolkit. The source
code for the entire toolkit is available at ftp://ftp.ncbi.nih.gov/toolbox/
ncbi_tools++/CURRENT/. Once the toolkit source is unpacked, the
instructions for building WindowMasker application in the UNIX environment can be found in file src/app/winmasker/README.build.
Contact: [email protected]
Supplementary information: Supplementary data are available at
ftp://ftp.ncbi.nlm.nih.gov/pub/agarwala/windowmasker/window
masker_suppl.pdf.
1
INTRODUCTION
In searching a DNA database, subsequences repeated hundreds of
times in the database may be found to match the query, but such
repetitive matches are usually not of biological interest. Finding
matches to sequences that consist entirely of repetitive elements
[such as microsatellites (Weber and May, 1989), Alus (Schmid and
Jelinek, 1982) and LINEs (Smit et al., 1995)] can be avoided by
To whom correspondence should be addressed.
masking such sequences in the database. Because running times of
database searches typically grow at least linearly with the number of
alignments found, masking the database to remove repetitive
sequences in the first place is much more efficient than filtering
intermediate outputs. Therefore, we consider the problem of masking a genomic database that is to be used in BLAST searches. Our
central aim is to preserve BLAST matches to sequences that are
unique or have low copy number, while preventing BLAST from
finding and reporting matches to repeated sequences with many
nearly identical copies.
We introduce the software package WindowMasker (WM) that
implements a method to mask a DNA genome for repeats using
only the sequence of the genome itself. NCBI’s BLAST Web page
service received over 28 000 queries to genome-specific DNA
databases during the first week of June 2005, and an average of
3000 queries per day in the entire month. These queries included
thousands of queries for organisms such as the dog (405 dogspecific queries recorded in the first week) that have been recently
sequenced and have no adequate masking for repeats. Since WM
uses only the sequence of the genome, it can be applied each time
the genome sequence is ‘finished’ or even in draft stages.
To our knowledge, RepeatMasker/MaskerAid (Bedell et al.,
2000; Smit and Green, 1996, http://www.repeatmasker.org/
webrepeatmaskerhelp.html) (RM) is currently the most widely
used software for masking DNA databases. RM uses local alignment methods to compare subsequences in the genome with a
curated library of repeat sequences, such as RepBase (Jurka,
2000), to determine the regions to be masked. In addition to masking
repetitive sequences, RM uses the annotation in a repeat library to
annotate the masked sequences with a description of what types of
repetitive elements are contained therein. WM does not use repeat
libraries, which are labor-intensive to create and curate, and may not
exist for newly sequenced genomes. The aim of WM is to replace
the use of RM for masking DNA databases prior to homology
searches. WM is not capable of annotating masked regions at
this time, but this does not compromise its utility for creating a
masked database to be used by BLAST searches. Annotation of
repeats is of use to biologists, but it is not used by BLAST searches.
Another software package is RECON (Bao and Eddy, 2002).
Like RM, RECON uses local alignment methods to compare
subsequences in the genome. Unlike RM, RECON compares the
Published by Oxford University Press 2005.
The online version of this article has been published under an open access model. Users are entitled to use, reproduce, disseminate, or display the open access
version of this article for non-commercial purposes provided that: the original authorship is properly and fully attributed; the Journal and Oxford University
Press are attributed as the original place of publication with the correct citation details given; if an article is subsequently reproduced or disseminated not in its
entirety but only in part or as a derivative work this must be clearly indicated. For commercial re-use, please contact [email protected]
WindowMasker: masker for sequenced genomes
genome with itself. The design of RECON suggests that its
intended use is the compilation of repeat families for studying
genomic evolution.
WM uses separate methods to identify two categories of
likely repeats, low-complexity sequences and global repeats.
A low-complexity sequence contains tens of letters and does not
necessarily occur many times throughout the genome. The following is an example of a sequence with a low-complexity interval.
(R2)
(R3)
GGTTGGTCAAATAAAAAGTGATGTATGAAAAAGAGG
CAAAACAACAAGAAGAAAAGATTGAAAAAATGAGAG
CTGAAGATGGTGAAAATTATGACATCAAAAAGCAGG
This sequence occupies positions 85 653 through 85 760 in a large
clone AL451144.5 from human chromosome X; this sequence is part
of a pseudogene that lies in an intron of the dystrophin gene (Koenig
et al., 1988). A portion of the above sequence is considered to have
low complexity, mostly because of the multiple A monomers. Lowcomplexity sequences are identified by a module called DUST that
has been part of the NCBI C (and now C++) toolkit for many years. We
present DUST in supplementary material because it has not been
described in print, and we have made some improvements.
By global repeats we mean families of nearly identical sequences
that occur multiple times in not necessarily adjacent places in the
genome. One expects that global repeats originated as one sequence
in some primordial genome, and that sequence was duplicated or
reinserted and subsequently mutated numerous times on the way to
the genome being masked. WM identifies likely global repeats using
a module named WinMask that finds all Nmers that occur far more
often than expected in windows of size N + 4.
The distinction between low-complexity sequences and global
repeats is operational, not mathematical; it is defined by what
sequences are masked by the two modules of WM. The two modules
of WM are applied independently, and the set of letters masked in
the output is the union of the sets masked by each module.
Counting of Nmers has been used in other types of genome
analysis. For example, identifying unique Nmers is important in
primer design for STS markers (Healy et al., 2003). Masking
based on exactly repeated Nmers has been used in several genome
assembly packages such as Reps (Wang et al., 2002) and Atlas
(Havlak et al., 2004). These packages make masking decisions
based on the repetition count of each individual Nmer, so N has
to be large enough (e.g. 20 in Reps and 32 in Atlas) that the
multiple occurrence of the same Nmer is much more likely due
to a duplication than due to chance. In WinMask, we use much
shorter Nmers (N 15), combined with simple statistics on multiple, distinct Nmers in a sliding window, to distinguish highly
repetitive sequences from chance recurrences.
The effectiveness of WM for database searching is demonstrated
by showing that among the regions of sequences from genomes
tested, regions that have a large number of matches to the unmasked
genome, indicating that the region is likely repetitive, are usually
masked by WM. Conversely, we also show that regions that have a
small number of matches to the unmasked genome, indicating that
the entire region is likely non-repetitive, are usually left unmasked
by WM. Five types of regions that were tested are as follows:
(R1)
The 50 longest contiguous regions of genomic database
sequences that were masked by WM but that did not intersect
a region masked by RM.
(R4)
(R5)
The 50 longest contiguous regions of genomic database
sequences that were masked by RM but that did not intersect
a region masked by WM.
Regions from high quality BLAST matches found in an
RM-masked genome and not found in the WM-masked
genome. Note that the matches are missed by BLAST due
to masking of WM but not missed by BLAST due to
masking of RM. The matches were found using sample
queries (described below) and the regions are taken from
the sample queries. A match is defined to be a high quality
match if it is at least 92 bases long and has at least 97% identity
and a match M is said to be not found in a set S if for every
match N in S, the query or subject interval of N does not
overlap the query or subject interval of M. The 92 base length
and 97% identity thresholds are used in several MegaBLAST
applications at NCBI.
Regions from high quality BLAST matches found in a WMmasked genome and not found in the RM-masked genome.
Note that the matches are missed by BLAST due to masking of
RM but not missed by BLAST due to masking of WM.
Regions from high quality BLAST matches found in an
unmasked genome and not found in the WM-masked genome.
Note that the matches are missed by BLAST due to masking
of WM.
The regions in sets (R1) and (R2) are selected to compare RM and
WM directly, sets (R3) and (R4) compare the effect of RM and WM
on BLAST searches for the genomes for which a RepBase library is
available, and set (R5) shows the effectiveness of WM for genomes
for which no RepBase library is available. We show that WM
performs the desired masking, is more effective than RM for the
task of masking a genome for BLAST searches, and does not require
a library of repeats.
In Section 2, we describe WinMask and methods used in evaluating the performance of WM. Section 3 describes tests of WM on
several genomes and reports on direct comparisons with RM. We
also show that WM does not excessively mask transcribed regions
and substantially reduces the number of matches to consider for
making gene models. We close with a discussion of related work
and open problems. Source code for WM is available as part of the
NCBI software toolkit ftp://ftp.ncbi.nih.gov/toolbox/ncbi_tools++/
CURRENT/. Once the toolkit source is unpacked, the instructions
for building WindowMasker application in the UNIX environment
can be found in file src/app/winmasker/README.build.
2
MATERIALS AND METHODS
The WM program is written in C++ and included in the NCBI software
toolkit. Numerous auxiliary programs used to automate tests and summarize
results were written in C, C++ and Perl.
2.1
List of genomes tested
We developed and tuned the WinMask algorithm described below using
build 34.1 of the human genome. The combination of DUST and WinMask
was then tested on five other genomes without further tuning of parameters.
In the text below, these genomes are referred to by their English, nonscientific names: human (Homo sapiens), mouse (Mus musculus), rat (Rattus
norvegicus), honeybee (Apis mellifera) and fruitfly (Drosophila melanogaster), except for Drosophila pseudoobscura. DNA sequences for the D.
pseudoobscura genome were retrieved in FASTA format from GenBank
(Benson et al., 2004). All other genome sequences were retrieved via NCBI’s
135
A.Morgulis et al.
MapViewer where current sequence releases (called ‘builds’) are available
as hyperlinks. Data on older builds are available on the NCBI ftp site under
ftp://ftp.ncbi.nlm.nih.gov/genomes/ or may be requested by sending an email
to [email protected]
2.2
The WinMask procedure for finding
global repeats
WinMask is a two-pass algorithm. In the first pass, it calculates the size N of
Nmer to consider, Nmer frequency counts for the genome, threshold scores
for the algorithm, and a score function that is based on the frequencies and
threshold scores. In the second pass, it calculates masked regions for the
genome using the previously generated score function and thresholds.
A genome is presented to WinMask as a collection of FASTA formatted
files, each file containing one or more sequences. Each sequence, commonly
called a contig, represents a contiguous region of the genome that includes
no large gaps (typically 100 or more bases). Contigs are not concatenated;
an Nmer must occur entirely between the endpoints of a single contig. The
genome sequences should include only one strand from each complementary
pair of DNA strands. We found that this is not always the case; therefore, the
first pass of WinMask has an option to check the input data for likely
duplicate contigs.
The first pass of the WinMask algorithm proceeds as follows:
(1) Determine the size of Nmer to use when masking this genome. Let L
be the sum of the lengths of all contigs. Then N is the smallest integer
such that L/(4N) < 5.
(2) Scan the contigs to determine frequency counts of all Nmers. Add the
frequency counts for reverse complement of contigs. We use the notation freq(S) to denote the number of times Nmer S occurs in the genome.
(3) Calculate cutoff values Tthreshold, Textend, Tlow and Thigh. Let C be the
total number of distinct Nmers that occur at least once in the genome.
The cutoff Tthreshold is the largest value of T that satisfies
sizeðfS j S is an Nmer and freqðSÞ TgÞ 0:995 · C:
This is equivalent to the 99.5 percentile. Similarly, the set Textend,
Tlow and Thigh are defined as the 99.0, 90.0 and 99.8 percentiles of
the same empirical cumulative distribution function, respectively.
The score of an Nmer S is defined as
8
if freqðSÞ T high ;
< T high
Nmer scoreðSÞ¼ dT low =2e if freqðSÞ < T low ; and
:
freqðSÞ otherwise:
An Nmer with a frequency count greater than Thigh is assigned a score Thigh
rather than freq(S). This rule limits the effect of some highly repetitive
Nmers. Similarly, any Nmer with a frequency count below Tlow is assigned
score dTlow/2e. This rule allows us to avoid storing frequency counts for a
large number of infrequently occurring Nmers, while not compromising the
results in practice.
The second pass of WinMask computes masked regions for all or part of
the same genome. For each sequence to be masked, WinMask applies the
following algorithm:
(1) Scan all windows W‘ ¼ a‘ am of length N + 4 in the sequence to
be masked, and assign each window a score
!
‘+4
X
5‚
Nmer_scoreðaj ai + N1
win_scoreðWÞ¼
Values of N, Tthreshold, Textend, Thigh and Tlow for different genomes we
considered are shown in the Table S1 in Supplementary Information. Portions of the cumulative Nmer frequency distribution are plotted for human,
rat, honeybee and fruitfly genomes in the Figure S1 in Supplementary
Information. Distributions for mouse and D. pseudoobscura are not
shown as they are indistinguishable from those for human and fruitfly,
respectively. The Nmers having the same frequency or rank differ substantially for different genomes, although this is not apparent from the plot of
frequency distributions.
2.3
Usage of RepeatMasker and MaskerAid
Masking done with RM used the May 15, 2002, version of RepeatMasker
(Smit and Green, 1996) in conjunction with MaskerAid (Bedell et al., 2000).
Parameter settings were those recommended on the MaskerAid Web site.
The time-consuming part of RepeatMasker is the repeated use of the program
cross_match for the local alignment of DNA sequences. The MaskerAid
module is used to identify most global repeats with permissive (and hence
relatively fast) parameters to cross_match via the MaskerAid parameters
-w -nolow -norna. Then, cross_match is used directly with the more
restrictive parameter -noint to identify low-complexity sequences. The runs
of RM were done using a recent version of RepBase for the genome being
masked, and it was the same version being used in the NCBI genome
pipeline (Kitts, 2003, http://www.ncbi.nlm.nih.gov/books) at the time of
the experiments.
2.4
Usage of tandem repeats finder
Preliminary analysis suggested that many regions in sets (R1)–(R5) are not
entirely covered by low-complexity regions or global repeats, but contain a
substantial subinterval generated by tandem repetition of a smaller motif.
To investigate this formally, we used the Tandem repeats finder (TRF)
software, version 3.21, with the parameters recommended in the documentation (Benson, 1999), to identify regions containing a tandem repeat. If a
tandem repeat is found, we suspect that the minimal repetitive motif may be
found many times across the genome, and this motif is evaluated further
instead of the region that had the motif.
We restricted attention to only those tandem repeats that cover at least
50% of the region of interest. The tandem repeats found are further subdivided into small patterns or generative patterns based on the length of the
repeat. The reason for using the term ‘generative’ rather than the antonym
‘large’ is to reinforce the intuition that these larger patterns effectively
‘generate’ at least 50% of the region via a process of duplication and mutation. A repeat is called small [generative] if its length is less than [at least]
19 bases for human, mouse and rat genomes or less than [at least] 17 bases
for Drosophila, honeybee and D. pseudoobscura genomes. The lengths 19
and 17 are the window sizes used in the WinMask module when masking
those genomes with WM. Any region that has neither a small nor a generative pattern is called a region with no patterns. Almost all small patterns
observed are di- or tri-nucleotide patterns.
2.5
BLAST runs and queries for testing
In Section 3, we describe numerous results based on outputs of MegaBLAST
(Zhang et al., 2000) or blastn (Altschul et al., 1997) modules of the BLAST
software package for sequence alignment. All runs of MegaBLAST and
blastn were done with default parameter settings, with the following
exceptions:
i¼‘
(2) Mask any window whose score is at least as great as Tthreshold.
We turned off the DUST filtering of the query sequence using
parameter -FF in order to assess the low-complexity filtering provided
by RM and WM.
(3) Mask the interval between two consecutive windows masked in the
previous step if, and only if, every base in the interval is in a window
that has a score at least as great as Textend.
For certain smaller genomes, we ran MegaBLAST with word size 16 as
well as with default word size 28; we clearly mark results generated using
the smaller word size in Section 3.
where integer division is used.
136
WindowMasker: masker for sequenced genomes
The tests of alignment to transcribed regions carried out by the NCBI
genome pipeline group use parameter settings -r10 -q-25 -X500 -Fm
for MegaBLAST.
Both blastn and MegaBLAST search for short exact alignments that are
known as seeds. One major difference is that blastn uses by default a minimum seed size of 11, while MegaBLAST uses by default a minimum seed
size of 28; some MegaBLAST tests were done with a seed size of 16. Partly
because it uses a smaller seed size, blastn is slower and may print many more
matches in the output. The effect that masking a database has on a blastn
or MegaBLAST search is to forbid any masked base from participating
in a seed. The algorithm ultimately extends seeds into gapped alignments.
Masked bases are permitted to participate in gapped extensions exactly as if
they were not masked. We note, however, that both blastn and MegaBLAST
provide the option of treating masked bases as if they were a special base
that mismatched every other base. We did not use this option in performing
our tests.
2.5.1 Sample queries To test the effect of masking on MegaBLAST
searches, we selected a separate set of 300 sample query sequences for each
genome. Each set of 300 query sequences is divided into subsets of 100 each
in three length ranges: small (500 bases), medium (10 kb), and large
(100 kb). Queries for each length range were randomly selected from
within bacterial artificial chromosome (BAC) sequences from the genome
being queried, with the exception of D. pseudoobscura for which we used
query sequences from D. melanogaster.
MegaBLAST runs using the chosen sample queries were performed using
both a WindowMasked and an unmasked database for each genome. For
those genomes for which a library of repeats is available, an additional set of
runs was performed using a RepeatMasked database. Results of the MegaBLAST runs were filtered for high quality matches, as defined in Section 1,
to ensure a high probability that what we identify in our automated processing as a match is a true positive.
2.5.2 Query regions selected for further study It is common for a
nearly identical interval in a sample query to participate in many distinct
alignments that yield regions in sets (R1)–(R5). Therefore, intervals within
each set were coalesced if they overlapped by at least 90%, and the ends also
did not extend more than 20 bases.
2.5.3 Investigation of regions We classified all regions in sets
(R1)–(R5) as having small, generative, or no patterns as described in
Subsection 2.4. Regions having a small pattern were not analyzed further;
they are low-complexity regions. Any generative patterns found within the
regions were used as queries instead of the region they generate and aligned
using blastn against the unmasked genome from which they originated.
Results were filtered to require a minimum of 95% identity and a minimum
length of 19 or 17 bases, whichever corresponds to the window size for the
genome. Regions that had no discernible pattern were used in their entirety
as queries and aligned against the appropriate unmasked genome. Query
sequences that originated from sets (R1) and (R2) were aligned to genomes
using blastn, whereas query sequences that originated from sets (R3), (R4)
or (R5) were aligned using MegaBLAST because of the prohibitive amount
of output generated if blastn is used. These results were filtered to require
a minimum percentage identity of 95% and a minimum length of 75% of
the query length. For all genomes except rat, we used a minimum of 95%
identity; for rat we used 97% identity, as there were a large number of
matches even with this more stringent criterion.
2.5.4 Evaluation of results The primary means of evaluating the
results of a BLAST run is to count the number of output alignments meeting
the prescribed length and identity criteria. We classify such runs and their
output counts into three ranges: 1–10 matches, 11–50 matches and >50
matches. We suggest that output with 1–10 matches is desirable and should
not be eliminated by masking, while output with >50 near exact matches is
undesirable and indicates insufficient masking in the matched database
regions. Whether outputs falling in the intermediate range are desirable
Table 1. Percentage of base pairs masked by RM, WM, and by both RM
and WM
Genome
Genome (bp)
RM (%)
WM (%)
Common (%)
Human 34.1
Mouse 32.1
Mouse 33.1
Fruitfly 6.3
Honeybee 1.1
Pseudoobscura
Rat 2.1
2 862 648 857
2 763 483 754
2 748 835 566
116 781 562
205 571 295
127 928 143
2 839 476 130
48.4
45.4
45.6
8.8
—
—
—
37.4
36.9
36.9
21.6
44.2
24.0
41.6
29.6
29.9
29.9
6.4
—
—
—
or not depends on the user’s taste and the biological motivation for the
query. In our tests, the number of cases falling in the intermediate range
is usually a small minority. Therefore, the qualitative evaluation does not
depend much on the specific choices of 11 and 50 as the bounds of the
intermediate range.
2.5.5 Test of alignment to transcribed regions For a test of alignments to human regions that are likely to be transcribed, the NCBI genome
pipeline group collected sets of mRNAs and ESTs to align. Then they used
MegaBLAST and other programs exactly as described in (Kitts, 2003) so that
results would be comparable with NCBI’s annotation of the human genome
that utilized RM. The mRNAs and ESTs are aligned as full length transcripts
predicted by the genome pipeline. These predicted full-length transcripts are
called gene models below. The test of alignment to transcribed regions by
gene models was applied to human chromosome 1, not the entire genome.
3
RESULTS
WM was developed using the human genome and then tested on the
genomes of mouse, fruitfly, rat, honeybee and D. pseudoobscura.
RepBase (Jurka, 2000) has good libraries for human, mouse and
fruitfly. It currently does not have libraries for the recently
sequenced rat genome or the partially sequenced honeybee and
D. pseudoobscura genomes. For all genomes, we document the
amount of masking and the effect of masking on BLAST searches.
For the three genomes with RepBase libraries, we do analysis for
regions in sets (R1)–(R4), while for the three genomes without
RepBase libraries, we do analysis for regions in set (R5). For honeybee and D. pseudoobscura, the difference in results between using
an unmasked database and a WindowMasked database is reduced if
MegaBLAST seed size is 16 instead of the default seed size of 28.
We do additional analysis for matches found only when using an
unmasked database and a word size of 16. Results of aligning
mRNAs and ESTs to human chromosome 1 masked using either
RM or WM are also presented.
3.1
Tests of the amount of masking
We did a direct comparison of the portions of the genome masked
by RM against the portions masked by WM.
Table 1 gives the percentages of the genomes masked by each
method and the percentages masked in common. Pairs of contiguous
regions can be classified as completely overlapping (same end
points found by both methods), partially overlapping (regions overlap, but not completely) and nonoverlapping. Table S2 in Supplementary Information summarizes the lengths of contiguous regions
found by one method that do not overlap a region found by the other.
A subset of these regions forms sets (R1) and (R2) for further study.
137
A.Morgulis et al.
Table 2. MegaBLAST results where comparison with WM uses RM database for human, mouse and Drosophila and unmasked (UM) database for honeybee,
D.pseudoobscura and rat
Match count
UM
Human 34.1
Mouse 32.1
Mouse 33.1
Fruitfly 6.3
Honeybee 1.1
Honeybee 1.1 W16
Pseudoobscura
Pseudoobscura W16
Rat 2.1
RM
199 650
1 311 680
1 273 865
7 379
15 978
16 167
328
323
10 815 737
WM
2 069
2 925
1 933
3 307
12 989
13 462
276
314
11 820
2083
3994
4895
1698
—
—
—
—
—
Same
alignment
Partial overlap
RM
UM
WM
No overlap
RM
UM
WM
1 895
2 509
1 539
1 678
12 784
12 832
276
312
11 410
25
97
81
11
—
—
—
—
—
44
57
39
41
170
511
0
0
297
163 (82)
1388 (245)
3275 (256)
9 (7)
—
—
—
—
—
—
—
—
—
2 589
2 383
52
11
10 803 598
130
359
355
1588
35
119
0
2
113
—
—
—
—
605
952
0
0
729
(457)
(294)
(50)
(9)
(4473)
(65)
(91)
(58)
(268)
(35)
(119)
(2)
(94)
Matches reported are at least 92 bases long and have at least 97% identity. Rows labeled W16 use MegaBLAST with initial match size 16, instead of the default 28. Each entry in the
second, third and fourth columns counts the number of matches of that type totalled over 300 queries. The fifth column counts matches in common to the two databases being compared.
‘Partial overlap’ columns count cases where corresponding matches overlap in both query and subject, but are distinct. The last three columns count the number of matches unique to each
database; in those columns the number in parentheses are the number of coalesced regions that are added to the sets (R3), (R4) or (R5), as described in Subsection 2.5.2.
Table 3. Regions (R1) and (R2) for each genome are classified by the
number of blastn matches they yield against the same unmasked genome
Match count
Human 34.1
(R1); Generative
(R1); No pattern
(R2); No pattern
Mouse 32.1
(R1); Generative
(R2); Generative
(R2); No pattern
Mouse 33.1
(R1); Generative
(R2); Generative
(R2); No pattern
Fruitfly 6.3
(R1); Generative
(R1); No pattern
(R2); No pattern
11–50
>50
Total
1–10
44
6
50
0
4
50
1
0
0
43
2
0
pattern
pattern
50
3
47
0
0
47
2
3
0
48
0
0
pattern
pattern
50
3
47
0
0
47
3
3
0
47
0
0
pattern
44
6
50
0
0
50
10
6
0
34
0
0
Match count
pattern
Matches counted had at least 75% query length coverage and at least 95% identity. We
suggest that cases with 1–10 matches should appear in the output, cases with >50 matches
should not appear, and cases in the middle range may sometimes be desirable to have in
BLAST output.
3.2
Effect on BLAST searches
The number of alignments found by MegaBLAST between sample
queries and the various masked and unmasked genome databases
are listed in Table 2. The first number that should stand out is the
10 803 598 matches unique to an unmasked rat database; only
11 820 matches are found using a WindowMasked database, a
reduction by a factor of about 1000. The large number of matches
to the unmasked rat genome quantifies the need for a masking
method that can be used when repeat libraries are not available.
In the ‘Partial overlap’ and ‘No overlap’ columns of Table 2,
we note that for those rows where we did not run RM, there are
a few cases where we find an alignment when querying the
WindowMasked genome, but do not find the same alignment
138
Table 4. Taking the coalesced query regions (R3)–(R5), enumerated in the
‘No overlap’ column of Table 2, as a starting point, this table classifies the
query regions according to the number of matches covering at least 75% of
the query region that are obtained when querying an unmasked database
Total
1–10
Human 34.1 (No small patterns)
(R3); Generative pattern
7
0
(R3); No pattern
75
6
(R4); Generative pattern
2
0
(R4); No pattern
63
49
Mouse 32.1 (1 RM and 1 WM small pattern)
(R3); Generative pattern
15
0
(R3); No pattern
229
5
(R4); No pattern
90
66
Mouse 33.1 (1 RM and 1 WM small pattern)
(R3); Generative pattern
18
0
(R3); No pattern
237
11
(R4); No pattern
57
40
Fruitfly 6.3 (No small patterns)
(R3); No pattern
7
5
(R4); No pattern
268
211
Honeybee 1.1 UM W16 (49 small patterns)
(R5); Generative pattern
111
2
(R5); No Pattern
134
95
Pseudoobscura UM W16 (1 small pattern)
(R5); No pattern
8
8
Rat 2.1 UM (251 small patterns)
(R5); Generative pattern
33
0
(R5); No Pattern
4189
2334
11–50
>50
0
7
2
14
7
62
0
0
1
29
23
14
195
1
1
25
13
17
201
4
1
57
1
0
2
39
107
0
0
0
0
442
33
1413
Matches for all rows except the last are filtered for 95% identity; matches for the last row
are filtered at 97% identity.
when querying the unmasked genome. These discrepant cases
are not investigated further because they are owing to either the
heuristic nature of MegaBLAST or the 97% identity filtering. Query
portions of remaining alignments in ‘No overlap’ columns of
Table 2 are transformed into regions for sets (R3), (R4) and (R5)
as described in Subsection 2.5.2.
WindowMasker: masker for sequenced genomes
3.3
Analysis of regions (R1)–(R5)
Regions are investigated as per the protocol in Subsection 2.5.3.
Tables 3 and 4 tabulate regions in sets (R1)–(R5) according to the
number of alignments to the unmasked genome counting only the
alignments that cover at least 75% of the length of the region with
95% identity. We note that there is an insignificant amount of
change if we tabulate results at 25% length coverage instead of
75% (data not shown). Next, we discuss Tables 3 and 4.
3.3.1 MegaBLAST performance on regions masked only by WM
or RM We did not find any regions in sets (R1) and (R2) that have
a small pattern. As shown by the second column in Table 3, the vast
majority of the regions in set (R2) do not have a generative tandem
repeat, while the vast majority regions in set (R1) do. This should be
interpreted as a hint that the regions masked only by WM are likely
to be generated by globally repetitive patterns that were not included
in the RepBase library.
Almost all regions in set (R2) have fewer than 10 matches, which
suggests that they should not be masked. Almost all the regions in
set (R1) have 51 matches, strongly suggesting that WM is justified
in masking them. The most troubling exceptions to this are 4
sequences in the human build 34.1 that are masked by WM,
have no generative pattern and get 10 matches. Two out of
these four are from the portion of contig NT_004538.15 that is
derived from AL357500.17; both sequences get over 100 000 blastn
matches that are not big enough to be counted at 75% length
coverage. The third sequence also belongs to the same contig,
while the last sequence has two patterns that are covering the
sequence such that neither one is a tandem repeat and hence neither
is reported by TRF.
3.3.2 MegaBLAST performance on RepeatMasked and WindowMasked genomes For the human, mouse and fruitfly genomes, we
collected regions in sets (R3) and (R4). Table 4 shows that there are
only five regions in (R4) that generate more than 50 matches to an
unmasked database. This suggests that only those five regions
align to repetitive DNA, and that the DNA sequences that would
align to the other regions in (R4) were correctly left unmasked
by WM.
Table 4 also shows that a significant number of regions in (R3)
also have many matches to an unmasked database. We infer that
WM is justified in masking the sequences that would align to these
regions. There are 27 cases where one might question the results of
WM because the query, originally identified because of a unique
match to the RepeatMasked genome, yielded at most 10 matches
when compared with the unmasked genome. For these 27 cases, we
investigated the subject portion of the original unique alignments to
the RepeatMasked database. In all cases but one, the subject portion
is heavily masked by both RM and WM, but a small amount of
additional masking by WM eliminates the seed for a MegaBLAST
alignment. The lone exception is a 108 base match at 97.22%
identity between AL451144.5 and NT_006713.13, where 80 out
of 108 bases are found to have low complexity by DUST.
3.3.3 MegaBLAST performance on genomes without a RepBase
library For the honeybee, rat and D. pseudoobscura genomes, we
collected regions in set (R5). Table 4 shows that generative patterns
from these regions are highly repetitive. Some regions that do not
have a generative pattern, particularly those from rat, do not often
align to an unmasked genome. The row with 4189 ‘No pattern’ for
Table 5. Counts of intermediate and final outputs when running NCBI’s
genome annotation pipeline using a DNA database of human chromosome 1
masked by either RM or WM
Category
mRNA queries
RM
WM
MegaBLAST matches 203 751
Transcripts aligned
14 874
(16 920)
Unique transcripts
36
(97)
EST queries
RM
WM
127 664 1 677 974
14 948
562 989
(16 833) (658 437)
110
8 521
(147) (11 943)
1 344 116
565 745
(652 112)
11 277
(13 539)
Numbers in brackets are total number of gene models found for transcripts in that
category.
rat in Table 4 counts matches filtered for 97% identity. For rat, we
did not generate a complete histogram of the number of matches to
each query portion when matches are filtered for 95% identity, but
we determined that only 675 query regions match less than 50 times
against the unmasked genome. We did no further analysis to see if
many more alignments would be found using more sensitive parameters to MegaBLAST or a more sensitive algorithm such as blastn.
Of the 10 803 598 matches to the unmasked rat genome in the
‘No overlap’ column of Table 2, 6 240 141 matches, which comprise
4473 query regions, cover at least 75% of the query. Of these
6 240 141 matches, the 675 non-repetitive query regions account
for only 1335 (<0.02%) matches. This seems to be acceptable
performance for the WM program, especially when no alternative
method of masking is available.
3.4
A test of alignment to transcribed regions
The above tests and analysis concern entire genomes and do not
distinguish between transcribed and untranscribed intervals. Since
many searches are intended to identify matches in transcribed
regions, we assessed WM performance in leaving the non-repetitive
portions of these regions unmasked for matching. NCBI’s genome
pipeline group applied their software (Kitts, 2003) to a DNA database for human chromosome 1. As above, the database was masked
using either RM or WM. To measure the final output quantity and
quality, we evaluated how many mRNAs or ESTs can be aligned as
gene models. The masked databases are used by the genome pipeline software for the intermediate step of MegaBLAST searching.
Therefore, another useful measure of performance is the number of
MegaBLAST matches, which indicates how many candidate pieces
of transcript may have to be spliced together to make a gene model.
Table 5 shows that using WM, rather than RM, to mask the
database enables the identification of slightly more (<1%)
mRNAs, along with a substantial reduction (20–30%) in the number
of MegaBLAST matches that are considered potential alignment
pieces. Table 6 shows that using WM, rather than RM, also results
in the creation of more gene models with higher length coverage.
The improvement in final outputs is not substantial but does show
that WM is not excessively masking transcribed regions, while
substantially reducing the number of matches to consider.
3.5
Computation time
To compare running times, we ran RM and WM on the human
genome, one chromosome at a time, using multiple uniprocessors
139
A.Morgulis et al.
Table 6. Alignments in the second row of Table 5 are classified by the
percentage of the transcript covered by the gene model
% Coverage
All
50
50–75
75–90
90–95
95–99
>99
Total
Longest
50
50–75
75–90
90–95
95–99
>99
Total
RM mRNA
WM mRNA
RM EST
WM EST
403
458
553
638
4266
10 602
16 920
381
452
536
634
4273
10 557
16 833
27 190
47 448
65 429
52 453
130 601
335 316
658 437
23 866
45 723
64 021
51 973
130 666
335 863
652 112
191
221
301
457
3839
9865
14 874
189
222
299
453
3856
9929
14 948
18 621
39 211
55 517
42 652
110 625
296 363
562 989
16 254
37 964
55 439
43 118
112 227
300 743
565 745
‘All’ refers to all gene models found for transcripts, while ‘Longest’ refers to the subset
of gene models where only the longest gene model for each transcript is considered.
on a cluster of processors. RM took 1045 h of CPU time, while WM
took under 11 h of CPU time. Of these 11 h, 1.2 h are consumed by
the first pass of Nmer counting; if one also checks for duplicate
contigs in the first pass, then the time increases to 4.5 h. Thus, we
conclude that WM without the check for duplicate contigs is nearly
two orders of magnitude faster than RM for genomes that are
roughly the size of the human genome (3 Gb).
4
DISCUSSION
We describe new software called WindowMasker (WM) that identifies and masks highly repetitive subsequences in the DNA
sequence of a genome. Unlike the widely used RepeatMasker
(RM) software (Bedell et al., 2000; Smit and Green, 1996), WM
is not dependent on a library of known repeat sequences, such as
RepBase (Jurka, 2000). WM is orders of magnitude faster than RM
or other alternatives, such as RECON (Bao and Eddy, 2002), that
use local alignment to identify the location of repeated sequences.
WM has been put into production usage at NCBI, and the first new
genome to which it was applied is that of the cow (Bos taurus).
We tuned WinMask using the human genome and then tested the
combination of DUST and WinMask on the genomes of mouse,
rat, fruitfly, D. pseudoobscura and honeybee. The tests combine
WM with the intended application of sequence database searching
by evaluating the quality of MegaBLAST (Zhang et al., 2000)
outputs when searching a masked database. Our tests show that
MegaBLAST produces better results when searching a database
masked by WM than it does when searching a database masked
by RM. This improvement is seen even for the human genome, for
which there has been adequate time to develop a good library of
repeats. Therefore, WM will be particularly useful for masking
recently sequenced genomes for which no repeat library yet exists.
One limitation of our tests with RM is that we set the parameters
according to the developers’ recommendations, but did not
consider the possibility that other parameter settings would give
140
better results. When our tests show that RM over or under masks
some region, we do not attempt to assess whether this is due
primarily to the RM parameters or to the entries in the RepBase
library.
The RM documentation suggests that the interesting regions
are those that code for proteins. In this study, we chose to define
the interesting regions as those that are transcribed because (1) this
simplified our testing and (2) it acknowledges the growing interest
in untranslated, functional RNAs and the 50 and 30 untranslated
regions adjacent to translated RNAs. Our tests on human chromosome 1 show that slightly more gene models are created using a
WindowMasked database than using a RepeatMasked database and
that the use of a WindowMasked database substantially reduces the
number of partial matches that need to be processed.
The principal disadvantage of using WM instead of RM is that
WM does not annotate the processed sequence, whereas regions
masked by RM are annotated as belonging to at least one known
family of repeat sequences. One basic problem suggested by the
contrasting methods is whether the intervals masked by WM can be
used to generate an initial library of repeats for a genome? Two
approaches to this problem might be (1) to cluster the WinMasked
regions by length and some properties of the unit scores they contain
or (2) to use the RECON method on a cleverly chosen subset of pairs
of intervals.
In its current implementation, WM takes as input one or more
FASTA-formatted files and produces as output a FASTA-formatted
file or a more compact representation of the masked intervals. In
practice, genome sequences are initially presented as traces from
a sequencing device, including real number measurements (dye
intensity or mass or radioactivity) for each possible letter at each
position. It may be useful to determine if masking can be done
directly on sequence traces without interposing additional software
to ‘call’ a unique letter for each position.
For the subproblem of masking low-complexity sequences, we
used a modified version of DUST, because (1) DUST has been used
successfully for many years to filter DNA queries to BLAST and
(2) our focus is on the WinMask method for masking global repeats.
It would be interesting to try other software methods, such as
STAR (Delgrange and Rivals, 2004), Sputnik (Abajian, 1994,
http://espressosoftware.com/pages/sputnik.jsp; Morgante et al.,
2002), Poly (Bizzaro and Mark, 2003) that are specifically designed
to find microsatellite sequences with some mutations. Longer
approximate tandem repeats can be found with Tandem Repeats
Finder (Benson, 1999) or ATRHUNTER (Wexler et al., 2005). An
older alternative, similar to DUST, is the Simple method used
in (Hancock, 2002).
Several other groups have proposed combinatorial approaches
to repeat identification that also do not use a library. Such software
packages include RAP (Campagna et al., 2005), also based on
word counts; REPuter (Kurtz et al., 2001), based on suffix trees;
FORrepeats (Lefebvre et al., 2003), based on an automaton construction; an unnamed software package from Cold Spring Harbor
Laboratory (Healy et al., 2003), based on the Burrows-Wheeler
transform and suffix arrays; and repeat detection module of
FragmentGluer (Pevzner et al., 2004), based on a A-Bruijn graph
approach. In general, these programs are interested in detecting
repeats that occur more than once without concern for the frequency
or variability. To our knowledge, none of these packages has
been used to produce a database of masked sequences for use in
WindowMasker: masker for sequenced genomes
homology searching. Of the packages mentioned, RAP is closest
in method to our word-counting approach. The developers of
RAP chose to use discontiguous words (Campagna et al., 2005),
whereas WM uses counts of contiguous words. We tested some
discontiguous words in an early implementation of WinMask and
found no advantage (data not shown). However, this issue
deserves further study, especially in consideration of the success
of discontiguous words in homology searching (see e.g. Ma et al.,
2002).
Besides the choice of contiguous versus discontiguous
words, three other design decisions for WinMask, described in
Subsection 2.2, may be of concern: the choice of score thresholds
in the first pass, the definition of the window_score function (at step
1 in the second pass), and the rule (at step 3 in the second pass) for
masking in between consecutive already masked (at step 2 in the
second pass) regions. For the score threshold, we chose plausible
values based on Nmer counts for several genomes; we display those
counts in Table S1 and Figure S1. A more principled method of
choosing the thresholds would be of interest. For the window_score
function we also considered the minimum unit score and median
unit score instead of the average; using the average gave better
results on the human genome (data not shown). For step 3 in the
WinMask algorithm, we considered instead (1) masking between
consecutive masked intervals if the distance in bases between them
is sufficiently low regardless of score and (2) masking between
already masked intervals if the average of the unit scores in the
coalesced interval is above some threshold. The rule described
above as step 3 performed better than these two alternatives on
the human genome (data not shown).
Another direction for further investigation is whether either the
combinatorial or the library-based approach to masking can be
extended from inputs of single genomes to species-heterogeneous
databases such as GenBank’s non-redundant (nr) database? In conclusion, WM produces better results than RM without requiring
a library of repeats.
ACKNOWLEDGEMENTS
We express our thanks to Stephen Altschul, David Lipman and Jim
Ostell for advice and encouragement. We also thank Jo McEntyre for
editorial assistance. We thank Victor Sapojnikov for assistance in
testing both RM and WM and for guidance in enhancing the input/
output options so that WM could be used in NCBI’s genome pipeline.
We also thank the NCBI genome pipeline group for carrying out the
tests of alignment to transcribed regions. A.M. is supported under a
contract to MSD Inc., Fairfax, VA, USA. Funding to pay the Open
Access publication charges for this article was provided by the
Intramural Research Program of the National Institules of Health,
NLM.
Conflict of Interest: none declared.
REFERENCES
Abajian,C. (1994) Sputnik.
Altschul,S.F. et al. (1997) Gapped BLAST and PSI-BLAST—a new generation of
protein database search programs. Nucleic Acids Res., 25, 3389–3402.
Bao,Z. and Eddy,S.R. (2002) Automated de novo identification of repeat sequence
families in sequenced genomes. Genome Res., 12, 1269–1276.
Bedell,J.A. et al. (2000) Maskeraid: a performance enhancement to repeatmasker.
Bioinformatics, 16, 1040–1041.
Benson,D.A. et al. (2004) Genbank: update. Nucleic Acids Res., 32, D23–D26.
Benson,G. (1999) Tandem Repeats Finder: a program to analyze DNA sequences.
Nucleic Acids Res., 27, 573–580.
Bizzaro,J.W. and Marx,K.A. (2003) Poly: a quantitative analysis tool for simple
sequence repeat (SSR) tracts in DNA. BMC Bioinformatics, 4, 22.
Campagna,D. et al. (2005) RAP: a new computer program for de novo identification of
repeated sequences in whole genomes. Bioinformatics, 21, 582–588.
Delgrange,O. and Rivals,E. (2004) Star: an algorithm to search for tandem and approximate repeats. Bioinformatics, 20, 2812–2820.
Hancock,J.M. (2002) Genome size and the accumulation of simple sequence repeats:
implications of new data from sequencing projects. Genetica, 115, 93–103.
Havlak,P. et al. (2004) The Atlas genome assembly system. Genome Res., 14, 721–732.
Healy,J. et al. (2003) Annotating large genomes with exact word matches. Genome
Res., 13, 2306–2315.
Jurka,J. (2000) Repbace update: a database and an electronic journal of repetitive
elements. Trends Genet., 16, 418–420.
Kitts,P. (2003) Genome assembly and annotation process. In J.McEntyre and J.Ostell
(eds). The NCBI Handbook, National Library of Medicine, US, chapter 14.
Koenig,M. et al. (1988) The complete sequence of dystrophin predicts a rod-shaped
cytokeletal protein. Cell, 53, 219–228.
Kurtz,S. et al. (2001) Reputer: the manifold applications of repeat analysis on a
genomic scale. Nucleic Acids Res., 29, 4633–4642.
Lefebvre,A. et al. (2003) Forrepeats: detects repeats on entire chromosomes and
between genomes. Bioinformatics, 19, 319–326.
Ma,B. et al. (2002) Patternhunter: faster and more sensitive homology search.
Bioinformatics, 18, 440–445.
Morgante,M. et al. (2002) Microsatellites are preferentially associated with nonrepetitive DNA in plant genomes. Nat. Genet., 30, 194–200.
Pevzner,P.A. et al. (2004) De novo repeat classification and fragment assembly.
Genome Res., 14, 1786–1796.
Schmid,C.W. and Jelinek,W.R. (1982) The Alu family of dispersed repetitive
sequences. Science, 216, 1065–1070.
Smit,A.F.A. and Green,P. (1996) Repeatmasker.
Smit,A.F.A. et al. (1995) Ancestral, mammalian-wide subfamilies of LINE-1 repetitive
sequences. J. Mol. Biol., 246, 401–417.
Wang,J. et al. (2002) RePs: a sequence assembler that masks exact repeats identified
from shotgun data. Genome Res., 12, 824–831.
Weber,J.L. and May,P.E. (1989) Abundant class of human DNA plymorphisms which
can be typed using the polymerase chain reaction. Am. J. Hum. Genet., 44, 388–396.
Wexler,Y. et al. (2005) Finding approximate tandem repeats in genomic sequences.
J. Comp. Biol., 12, 928–942.
Zhang,Z. et al. (2000) A greedy algorithm for aligning DNA sequences. J. Comp. Biol.,
7, 203–214.
141