Computer Note

Journal of Heredity 2013:104(1):154–157
doi:10.1093/jhered/ess082
Advance Access publication November 9, 2012
© The American Genetic Association. 2012. All rights reserved.
For permissions, please email: [email protected].
Computer Note
MSDB: A User-Friendly Program for
Reporting Distribution and Building
Databases of Microsatellites from
Genome Sequences
Lianming Du, Yuzhi Li, Xiuyue Zhang, and
Bisong Yue
From the Key Laboratory of Bio-resources and
Eco-environment, Ministry of Education, College of Life
Science, Sichuan University. Chengdu, Sichuan 610064, P.R.
China (Du and Zhang); and Sichuan Key Laboratory of
Conservation Biology on Endangered Wildlife, College of
Life Sciences, Sichuan University, Chengdu, Sichuan 610064,
P.R. China (Li and Yue).
Address correspondence to Bisong Yue at the address
above, or email: [email protected].
Microsatellite Search and Building Database (MSDB) is a new
Perl program providing a user-friendly interface for identification and building databases of microsatellites from complete
genome sequences. The general aims of MSDB are to use
the database to store the information of microsatellites and
to facilitate the management, classification, and statistics of
microsatellites. A user-friendly interface facilitates the treatment of large datasets. The program is powerful in finding
various types of pure, compound, and complex microsatellites from sequences as well as generating a detailed statistical report in worksheet format. MSDB also contains other
two subprograms: SWR, which is used to export microsatellites from the database to meet user’s requirements, and
SWP, which is used to automatically invoke R to draw a sliding window plot for displaying the distribution of density or
frequency of identified microsatellites. MSDB is freely available under the GNU General Public license for Windows and
Linux from the following website: http://msdb.biosv.com/.
Key words: database, microsatellite, statistics
Introduction
Microsatellites or simple sequence repeats (SSRs) are 1–6 base
pair (bp) nucleotide motifs randomly repeated DNA sequences
that are frequently distributed in eukaryotic genomes. Because
of their high polymorphism, codominance, and potential for
high throughput analysis, microsatellites have been preferred
as powerful tools for population biology and genetic studies
154
in many organisms (Zane et al. 2002; Li et al. 2010). However,
the most conventional procedure for the isolation of microsatellite markers is challenging, costly, labour consuming and
time consuming (Castagnone-Sereno et al. 2010). In recent
years, the availability of complete genome sequences for a
wide range of organisms has made it possible to search and
investigate genome-wide microsatellites.
Years ago, FASTA (Pearson 1994) and BLAST (Altschul
et al. 1990) packages were used to study microsatellites.
Later more specific tools have been developed to search
microsatellites in genomic sequences, such as TRF (Benson
1999), SSRFinder (Gao et al. 2003), SSRIT (Temnykh et al.
2001), MISA (Thiel et al. 2003), Mreps (Kolpakov et al.
2003), TROLL (Castelo et al. 2002), IMEx (Mudunuri
and Nagarajaram 2007), SciRoKo (Kofler et al. 2007),
MSATCOMMANDER (Faircloth 2008), and QDD (Meglécz
et al. 2010). Although these tools are able to perform well in
searching for microsatellites, many of these tools are not user
friendly because they do not provide a graphical user interface
and support only the Linux operating system. Some of them
are not able to obtain the flanking sequence of microsatellites
and some do not provide detailed summary statistics. Only
a few of them can process multiple sequence files at once.
Importantly, none of them support a convenient output
format which can be used to facilitate a secondary search
and generate a sliding window plot to show distribution of
microsatellites on a sequence or chromosome (Table 1). In this
study, we developed a new tool for detecting microsatellites in
genomic sequence to resolve these problems.
MSDB (Microsatellite Search and Building Database)
provides an efficient and user-friendly tool that 1) finds various types of pure, compound, and complex microsatellites;
2) builds a SQLite database for microsatellites from a genome
sequence; 3) generates a detailed statistical report of density
and frequency of microsatellites; 4) invokes R to draw a sliding window plot to show the distribution of microsatellites;
5) exports a file in Primer3 input format for designing primers; 6) permits the user to export microsatellites from database in a format to meet the user’s requirements.
Implementation
There are many approaches for finding repeat sequences.
MSDB uses regular expression pattern matching within
each DNA sequence to locate microsatellite arrays. MSDB
contains three programs: 1) MSDB main program is used
to find microsatellites and generate basic relevant statistics,
2) SWR (search within results) is used to retrieve data from
the database that is generated by main program according to
end-user’s requirements, and 3) SWP (sliding window plot)
is used to automatically generate a sliding window plot of
density or frequency of microsatellites.
Perfect ssr
search
No
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Search tool
TRF
SSRIT
Mreps
MISA
TROLL
SciRoKo
IMEx
MSATCOMMANDER
MSDB
No
No
No
Yes
Yes
No
No
Yes
Yes
Imperfect
ssr search
Yes
Yes
No
Yes
No
No
No
No
No
Compound
or interrupted
ssr search
Table 1 Comparison of several microsatellite search tools
Perl
Perl
C++
C#
C
Python
Perl
C
?
Programming
language
Windows,
Linux, Mac
Linux
Windows,
Linux, Mac
Linux
Linux
Windows
Linux
Windows,
Linux, Mac
Windows,
Linux
Operating
system
Gui
Console
Console
Gui
Console/web
Gui
Web
Console/web
Gui/web
User interface
Yes
No
Yes
Yes
No
No
No
No
No
Batch
Yes
Yes
No
Yes
Yes
No
No
No
No
Flanking
sequence
Yes
Yes
No
Yes
Yes
No
No
No
Yes
statistics
Yes
No
No
No
No
No
No
No
No
database
Yes
Yes
No
No
Yes
Yes
No
No
No
Primer3
Computer Note
155
Journal of Heredity 2013:104(1)
Microsatellite Search Modes
Microsatellites can be grouped into six categories: 1) pure
microsatellites or perfect microsatellites consist of identical repeats, 2) interrupted pure microsatellites consist of
two or more individual pure microsatellites with the same
motif, 3) compound microsatellites consist of two adjacent pure microsatellites with different motifs, 4) interrupted compound microsatellites consist of two repetitive
sequences of compound microsatellites interrupted by a
short, non-repetitive sequence, 5) complex microsatellites
consist of several different perfect repetitive sequences,
and 6) interrupted complex microsatellites consist of several different perfect repetitive sequences interrupted by
non-repetitive sequences (Chambers and MacAvoy 2000;
Bachmann et al. 2004).
MSDB has two search modes: perfect search mode is
used to search perfect microsatellites or pure microsatellites and imperfect search mode is used to search all types
of microsatellites mentioned above. The two search modes
have the same important parameter, the minimum number of repeats, which determines whether the repetitive
sequence detected from a genome sequence is an individual microsatellite. The imperfect search mode has another
important parameter, maximum distance (dMAX), which is
the maximum distance allowed between two adjacent microsatellites. The dMAX and the motifs of individual microsatellites determine the category to which a microsatellite
found in sequence belongs.
The two search modes will not produce redundant repetitive sequences. MSDB uses regular expression to match a
sequence from beginning to end so that the same microsatellite loci will be matched only once. For instance, the sequence
ACACACACACAC will be identified as a (AC)6 dinucleotide microsatellite rather than as a (ACAC)3 tetranucleotide
microsatellite.
Building Microsatellite Database
The overall microsatellite content in a genome correlates
with the genome size of the organism (Ellegren. 2004).
Generally, there are thousands of microsatellite loci in
eukaryotic genomes. If all those microsatellites are saved
in one regular text file, it is virtually impossible to manage
and view those microsatellites in their genomic sequence;
they appear more like a heap of chaotic data. Therefore, we
choose SQLite (http://www.sqlite.org) to build a microsatellite database for a genome to solve this problem. SQLite is
a widely used, cross-platform SQL database engine, which is
a self-contained, embeddable, serverless, transactional SQL
database engine (Zhu et al. 2008), and it is more convenient
to view, manage, retrieve, and export microsatellites from a
database than from other file formats.
In the beginning of each search, MSDB will automatically generate a SQLite database file to save microsatellite
information. The database contains two tables: SSR is used
to store information of each microsatellite and FILE is
used to store information of each sequence file. The motif,
type, repeats, length, location, sequence, flanking sequence,
156
and source of each microsatellite will be inserted into table
SSR of the database as a record. And for each sequence,
its sequence or file name, total sequence length, and total
microsatellite counts and length of microsatellites will be
inserted into table File of the database as a record. The
more detailed information about the structure and usage
of the database is available in the Supplementary Material
online.
Microsatellite Statistic
MSDB uses the SQLite database above to provide two
different microsatellite statistical modes: one is used to
produce a statistical file for all input sequence files as a
whole; another is used to produce a statistical file for each
individual input sequence file. The statistical information
will be displayed in Excel worksheet file. This information
contains SSRs counts, SSRs length (bp), average length (bp),
frequency (loci/Mb), and density (bp/Mb). More detailed
information about microsatellite statistics is available in the
Supplementary Material online.
Other Features
The users can use SWR to select microsatellites to export to
an Excel file or generate a Primer3 input formatted text file
to design primers for PCR according to the user’s requirements. Sliding window analysis is a commonly used method
for studying the properties of molecular sequences: data
are plotted as moving averages of a particular criterion for
a window of a certain length slid along a sequence. The
users also can use SWP to generate a sliding window plot
to show the distribution of microsatellites in a sequence or
chromosome.
Input and Output
MSDB can accept FASTA, GenBank, EMBL, and plain file
containing the DNA sequence data with any extension as
input files. The input files can be automatically recognized by
MSDB and parsed to pure sequence for subsequent processing. The output files of the MSDB main program contains
a SQLite database file that contains the information of each
microsatellite and Excel file that contains the statistical information of the microsatellites.
System Requirements
MSDB is written in Perl and can be run as a standalone application with a user-friendly interface on Windows and Linux
systems. The Perl modules were combined into an executable
file, so it is unnecessary to install Perl interpreter and libraries
during program installing. But the statistical file of MSDB
is an Excel file, so you should ensure that Microsoft Office
Excel 2003 or higher have been installed. On Linux, instead
of Microsoft Office, the free software Open Office (http://
www.openoffice.org/) can be installed.
Computer Note
Supplementary Material
Supplementary material can be found at http://www.jhered.
oxfordjournals.org/.
Funding
National Science Foundation (31172118); National Basic
Research Program of China (973 Project: 2011CB111503).
Acknowledgments
We thank professor Moermond for valuable help with the article.
Kofler R, Schlotterer C, Lelley T. 2007. SciRoKo: a new tool for whole genome
microsatellite search and investigation. Bioinformatics. 23:1683–1685.
Kolpakov R, Bana G, Kucherov G. 2003. Mreps: efficient and flexible detection of tandem repeats in DNA. Nucleic Acids Res. 31:3672–3678.
Li YZ, Xu X, Shen FJ, Zhang WP, Zhang ZH, Hou R, Yue BS. 2010.
Development of new tetranucleotide microsatellite loci and assessment of
genetic variation of giant panda in two largest giant panda captive breeding
populations. J Zool. 282:39–46.
Meglécz E, Costedoat C, Dubut V, Gilles A, Malausa T, Pech N, Martin J.
QDD: a user-friendly program to select microsatellite markers and design
primers from large sequencing projects. Bioinformatics. 26:403–404.
Pearson WR. 1994. Using the FASTA program to search protein and DNA
sequence databases. Methods Mol Biol. 24:307–331.
References
Castagnone-Sereno P, Danchin EG, Deleury E, Guillemaud T, Malausa T,
Abad P. 2010. Genome-wide survey and analysis of microsatellite in nematodes, with a focus on the plant-parasitic species Meloidogyne incognita.
BMC Genomics. 11:598.
Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. 1990. Basic local alignment search tool. J Mol Biol. 215:403–410.
Mudunuri SB, Nagarajaram HA. 2007. IMEx: Imperfect Microsatellite
Extractor. Bioinformatics. 23(10):1181–1187.
Bachmann L, Bareiss P, Tomiuk J. 2004. Allelic variation, fragment length
analyses and population genetic models: a case study on Drosophila microsatellites. J Zool Syst Evol Res. 42:215–222.
Benson G. 1999. Tandem repeats finder: a program to analyze DNA
sequences. Nucleic Acids Res. 27:573–580.
Castelo A, Martins W, Gao G. 2002. TROLL-tandem repeat occurrence locator. Bioinformatics. 18:634–636.
Chambers GK and MacAvoy ES. 2000. Microsatellites: consensus and controversy. Comp Biochem Physiol B. 126:455–476.
Ellegren H. 2004. Microsatellites: simple sequences with complex evolution.
Nat Rev Genet. 5:435–445.
Faircloth BC. 2008. MSATCOMMANDER: detection of microsatellite
repeat arrays and automated, locus-specific primer design. Mol Ecol Resour.
8:92–94.
Gao L, Tang J, Li H, Jia J. 2003. Analysis of microsatellites in major crops assessed
by computational and experimental approaches. Mol Breed. 12:245–261.
Temnykh S, DeClerck G, LuKashova A, Lipovich L, Cartinhour S, McCouch S.
Computational and experimental analysis of microsatellites in rice (Oryza sativa
L.): frequency, length variation, transposon associations, and genetic marker
potential. Genome Research. 11:1441–1452.
Thiel T, Michalek W, Varshney RK, Graner A. 2003. Exploiting est databases
for the development and characterization of gene-derived ssr-markers in
barley (Hordeum vulgare L.). Theor Appl Genet. 106:411–422.
Zane L, Bargelloni L, Patarnello T. 2002. Strategies for microsatellite isolation: a review. Mol Ecol. 11:1–16.
Zhu Y, Davis S, Stephens R, Meltzer PS, Chen Y. 2008. GEOmetadb:
powerful alternative search engine for the Gene Expression Omnibus.
Bioinformatics. 24:2798–2800.
Received May 9, 2012; Revised August 2, 2012;
Accepted August 15, 2012
Corresponding Editor: Howard Ross
157