Journal of Heredity 2013:104(1):154–157 doi:10.1093/jhered/ess082 Advance Access publication November 9, 2012 © The American Genetic Association. 2012. All rights reserved. For permissions, please email: [email protected]. Computer Note MSDB: A User-Friendly Program for Reporting Distribution and Building Databases of Microsatellites from Genome Sequences Lianming Du, Yuzhi Li, Xiuyue Zhang, and Bisong Yue From the Key Laboratory of Bio-resources and Eco-environment, Ministry of Education, College of Life Science, Sichuan University. Chengdu, Sichuan 610064, P.R. China (Du and Zhang); and Sichuan Key Laboratory of Conservation Biology on Endangered Wildlife, College of Life Sciences, Sichuan University, Chengdu, Sichuan 610064, P.R. China (Li and Yue). Address correspondence to Bisong Yue at the address above, or email: [email protected]. Microsatellite Search and Building Database (MSDB) is a new Perl program providing a user-friendly interface for identification and building databases of microsatellites from complete genome sequences. The general aims of MSDB are to use the database to store the information of microsatellites and to facilitate the management, classification, and statistics of microsatellites. A user-friendly interface facilitates the treatment of large datasets. The program is powerful in finding various types of pure, compound, and complex microsatellites from sequences as well as generating a detailed statistical report in worksheet format. MSDB also contains other two subprograms: SWR, which is used to export microsatellites from the database to meet user’s requirements, and SWP, which is used to automatically invoke R to draw a sliding window plot for displaying the distribution of density or frequency of identified microsatellites. MSDB is freely available under the GNU General Public license for Windows and Linux from the following website: http://msdb.biosv.com/. Key words: database, microsatellite, statistics Introduction Microsatellites or simple sequence repeats (SSRs) are 1–6 base pair (bp) nucleotide motifs randomly repeated DNA sequences that are frequently distributed in eukaryotic genomes. Because of their high polymorphism, codominance, and potential for high throughput analysis, microsatellites have been preferred as powerful tools for population biology and genetic studies 154 in many organisms (Zane et al. 2002; Li et al. 2010). However, the most conventional procedure for the isolation of microsatellite markers is challenging, costly, labour consuming and time consuming (Castagnone-Sereno et al. 2010). In recent years, the availability of complete genome sequences for a wide range of organisms has made it possible to search and investigate genome-wide microsatellites. Years ago, FASTA (Pearson 1994) and BLAST (Altschul et al. 1990) packages were used to study microsatellites. Later more specific tools have been developed to search microsatellites in genomic sequences, such as TRF (Benson 1999), SSRFinder (Gao et al. 2003), SSRIT (Temnykh et al. 2001), MISA (Thiel et al. 2003), Mreps (Kolpakov et al. 2003), TROLL (Castelo et al. 2002), IMEx (Mudunuri and Nagarajaram 2007), SciRoKo (Kofler et al. 2007), MSATCOMMANDER (Faircloth 2008), and QDD (Meglécz et al. 2010). Although these tools are able to perform well in searching for microsatellites, many of these tools are not user friendly because they do not provide a graphical user interface and support only the Linux operating system. Some of them are not able to obtain the flanking sequence of microsatellites and some do not provide detailed summary statistics. Only a few of them can process multiple sequence files at once. Importantly, none of them support a convenient output format which can be used to facilitate a secondary search and generate a sliding window plot to show distribution of microsatellites on a sequence or chromosome (Table 1). In this study, we developed a new tool for detecting microsatellites in genomic sequence to resolve these problems. MSDB (Microsatellite Search and Building Database) provides an efficient and user-friendly tool that 1) finds various types of pure, compound, and complex microsatellites; 2) builds a SQLite database for microsatellites from a genome sequence; 3) generates a detailed statistical report of density and frequency of microsatellites; 4) invokes R to draw a sliding window plot to show the distribution of microsatellites; 5) exports a file in Primer3 input format for designing primers; 6) permits the user to export microsatellites from database in a format to meet the user’s requirements. Implementation There are many approaches for finding repeat sequences. MSDB uses regular expression pattern matching within each DNA sequence to locate microsatellite arrays. MSDB contains three programs: 1) MSDB main program is used to find microsatellites and generate basic relevant statistics, 2) SWR (search within results) is used to retrieve data from the database that is generated by main program according to end-user’s requirements, and 3) SWP (sliding window plot) is used to automatically generate a sliding window plot of density or frequency of microsatellites. Perfect ssr search No Yes Yes Yes Yes Yes Yes Yes Yes Search tool TRF SSRIT Mreps MISA TROLL SciRoKo IMEx MSATCOMMANDER MSDB No No No Yes Yes No No Yes Yes Imperfect ssr search Yes Yes No Yes No No No No No Compound or interrupted ssr search Table 1 Comparison of several microsatellite search tools Perl Perl C++ C# C Python Perl C ? Programming language Windows, Linux, Mac Linux Windows, Linux, Mac Linux Linux Windows Linux Windows, Linux, Mac Windows, Linux Operating system Gui Console Console Gui Console/web Gui Web Console/web Gui/web User interface Yes No Yes Yes No No No No No Batch Yes Yes No Yes Yes No No No No Flanking sequence Yes Yes No Yes Yes No No No Yes statistics Yes No No No No No No No No database Yes Yes No No Yes Yes No No No Primer3 Computer Note 155 Journal of Heredity 2013:104(1) Microsatellite Search Modes Microsatellites can be grouped into six categories: 1) pure microsatellites or perfect microsatellites consist of identical repeats, 2) interrupted pure microsatellites consist of two or more individual pure microsatellites with the same motif, 3) compound microsatellites consist of two adjacent pure microsatellites with different motifs, 4) interrupted compound microsatellites consist of two repetitive sequences of compound microsatellites interrupted by a short, non-repetitive sequence, 5) complex microsatellites consist of several different perfect repetitive sequences, and 6) interrupted complex microsatellites consist of several different perfect repetitive sequences interrupted by non-repetitive sequences (Chambers and MacAvoy 2000; Bachmann et al. 2004). MSDB has two search modes: perfect search mode is used to search perfect microsatellites or pure microsatellites and imperfect search mode is used to search all types of microsatellites mentioned above. The two search modes have the same important parameter, the minimum number of repeats, which determines whether the repetitive sequence detected from a genome sequence is an individual microsatellite. The imperfect search mode has another important parameter, maximum distance (dMAX), which is the maximum distance allowed between two adjacent microsatellites. The dMAX and the motifs of individual microsatellites determine the category to which a microsatellite found in sequence belongs. The two search modes will not produce redundant repetitive sequences. MSDB uses regular expression to match a sequence from beginning to end so that the same microsatellite loci will be matched only once. For instance, the sequence ACACACACACAC will be identified as a (AC)6 dinucleotide microsatellite rather than as a (ACAC)3 tetranucleotide microsatellite. Building Microsatellite Database The overall microsatellite content in a genome correlates with the genome size of the organism (Ellegren. 2004). Generally, there are thousands of microsatellite loci in eukaryotic genomes. If all those microsatellites are saved in one regular text file, it is virtually impossible to manage and view those microsatellites in their genomic sequence; they appear more like a heap of chaotic data. Therefore, we choose SQLite (http://www.sqlite.org) to build a microsatellite database for a genome to solve this problem. SQLite is a widely used, cross-platform SQL database engine, which is a self-contained, embeddable, serverless, transactional SQL database engine (Zhu et al. 2008), and it is more convenient to view, manage, retrieve, and export microsatellites from a database than from other file formats. In the beginning of each search, MSDB will automatically generate a SQLite database file to save microsatellite information. The database contains two tables: SSR is used to store information of each microsatellite and FILE is used to store information of each sequence file. The motif, type, repeats, length, location, sequence, flanking sequence, 156 and source of each microsatellite will be inserted into table SSR of the database as a record. And for each sequence, its sequence or file name, total sequence length, and total microsatellite counts and length of microsatellites will be inserted into table File of the database as a record. The more detailed information about the structure and usage of the database is available in the Supplementary Material online. Microsatellite Statistic MSDB uses the SQLite database above to provide two different microsatellite statistical modes: one is used to produce a statistical file for all input sequence files as a whole; another is used to produce a statistical file for each individual input sequence file. The statistical information will be displayed in Excel worksheet file. This information contains SSRs counts, SSRs length (bp), average length (bp), frequency (loci/Mb), and density (bp/Mb). More detailed information about microsatellite statistics is available in the Supplementary Material online. Other Features The users can use SWR to select microsatellites to export to an Excel file or generate a Primer3 input formatted text file to design primers for PCR according to the user’s requirements. Sliding window analysis is a commonly used method for studying the properties of molecular sequences: data are plotted as moving averages of a particular criterion for a window of a certain length slid along a sequence. The users also can use SWP to generate a sliding window plot to show the distribution of microsatellites in a sequence or chromosome. Input and Output MSDB can accept FASTA, GenBank, EMBL, and plain file containing the DNA sequence data with any extension as input files. The input files can be automatically recognized by MSDB and parsed to pure sequence for subsequent processing. The output files of the MSDB main program contains a SQLite database file that contains the information of each microsatellite and Excel file that contains the statistical information of the microsatellites. System Requirements MSDB is written in Perl and can be run as a standalone application with a user-friendly interface on Windows and Linux systems. The Perl modules were combined into an executable file, so it is unnecessary to install Perl interpreter and libraries during program installing. But the statistical file of MSDB is an Excel file, so you should ensure that Microsoft Office Excel 2003 or higher have been installed. On Linux, instead of Microsoft Office, the free software Open Office (http:// www.openoffice.org/) can be installed. Computer Note Supplementary Material Supplementary material can be found at http://www.jhered. oxfordjournals.org/. Funding National Science Foundation (31172118); National Basic Research Program of China (973 Project: 2011CB111503). Acknowledgments We thank professor Moermond for valuable help with the article. Kofler R, Schlotterer C, Lelley T. 2007. SciRoKo: a new tool for whole genome microsatellite search and investigation. Bioinformatics. 23:1683–1685. Kolpakov R, Bana G, Kucherov G. 2003. Mreps: efficient and flexible detection of tandem repeats in DNA. Nucleic Acids Res. 31:3672–3678. Li YZ, Xu X, Shen FJ, Zhang WP, Zhang ZH, Hou R, Yue BS. 2010. Development of new tetranucleotide microsatellite loci and assessment of genetic variation of giant panda in two largest giant panda captive breeding populations. J Zool. 282:39–46. Meglécz E, Costedoat C, Dubut V, Gilles A, Malausa T, Pech N, Martin J. QDD: a user-friendly program to select microsatellite markers and design primers from large sequencing projects. Bioinformatics. 26:403–404. Pearson WR. 1994. Using the FASTA program to search protein and DNA sequence databases. Methods Mol Biol. 24:307–331. References Castagnone-Sereno P, Danchin EG, Deleury E, Guillemaud T, Malausa T, Abad P. 2010. Genome-wide survey and analysis of microsatellite in nematodes, with a focus on the plant-parasitic species Meloidogyne incognita. BMC Genomics. 11:598. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. 1990. Basic local alignment search tool. J Mol Biol. 215:403–410. Mudunuri SB, Nagarajaram HA. 2007. IMEx: Imperfect Microsatellite Extractor. Bioinformatics. 23(10):1181–1187. Bachmann L, Bareiss P, Tomiuk J. 2004. Allelic variation, fragment length analyses and population genetic models: a case study on Drosophila microsatellites. J Zool Syst Evol Res. 42:215–222. Benson G. 1999. Tandem repeats finder: a program to analyze DNA sequences. Nucleic Acids Res. 27:573–580. Castelo A, Martins W, Gao G. 2002. TROLL-tandem repeat occurrence locator. Bioinformatics. 18:634–636. Chambers GK and MacAvoy ES. 2000. Microsatellites: consensus and controversy. Comp Biochem Physiol B. 126:455–476. Ellegren H. 2004. Microsatellites: simple sequences with complex evolution. Nat Rev Genet. 5:435–445. Faircloth BC. 2008. MSATCOMMANDER: detection of microsatellite repeat arrays and automated, locus-specific primer design. Mol Ecol Resour. 8:92–94. Gao L, Tang J, Li H, Jia J. 2003. Analysis of microsatellites in major crops assessed by computational and experimental approaches. Mol Breed. 12:245–261. Temnykh S, DeClerck G, LuKashova A, Lipovich L, Cartinhour S, McCouch S. Computational and experimental analysis of microsatellites in rice (Oryza sativa L.): frequency, length variation, transposon associations, and genetic marker potential. Genome Research. 11:1441–1452. Thiel T, Michalek W, Varshney RK, Graner A. 2003. Exploiting est databases for the development and characterization of gene-derived ssr-markers in barley (Hordeum vulgare L.). Theor Appl Genet. 106:411–422. Zane L, Bargelloni L, Patarnello T. 2002. Strategies for microsatellite isolation: a review. Mol Ecol. 11:1–16. Zhu Y, Davis S, Stephens R, Meltzer PS, Chen Y. 2008. GEOmetadb: powerful alternative search engine for the Gene Expression Omnibus. Bioinformatics. 24:2798–2800. Received May 9, 2012; Revised August 2, 2012; Accepted August 15, 2012 Corresponding Editor: Howard Ross 157
© Copyright 2026 Paperzz