White Paper RNA-Seq Open Source Tools 101 Software used in mRNA-seq differential expression analysis on the Maverix Analytic Platform. Introduction Next generation sequencing (NGS) is enabling a massive increase in scale and molecular detail of new data available for biomedical research. However, studies leveraging NGS data are commonly limited by two factors: (1) acquiring the tools and infrastructure necessary to process, interpret, and manage the data, and (2) availability of well-trained computational biologists to carry out these specialized, interdisciplinary tasks. As a result, many researchers do not take advantage of the technology, or may face unnecessary delays in analyzing the NGS data they have generated. These hurdles limit the pace and scale of the technology’s adoption in the broader research community. Maverix Biomics was founded to address this bottleneck, and has created a unique analytic environment that integrates key elements: (1) ease of use for any scientist; (2) the most current, industry-standard open source tools; (3) carefully constructed “analytic kits” which are ready to use off the shelf or with expert customization; (4) flexibility to analyze and visualize data from any species; and (5) ondemand, cloud-based computing power and storage to rapidly scale to almost any sized project. We provide the know-how, computing resources, and interactive visualization environment, leaving researchers to focus on the science. One rapidly growing area in the NGS realm is RNA-seq [1], a high-throughput sequencing technology that provides a genome-wide assessment of the RNA content of an organism, tissue or cell. RNA-seq utilizes an unbiased approach that allows a researcher to define and analyze the transcriptome, identify transcription start site (TSS), and perform alternative splicing analysis. RNA-seq analysis on the Maverix Analytic Platform offers a simple solution to complex NGS analytics, utilizing open-source bioinformatics tools in a secure cloud-based environment. Open Source Open-source software is computer software whose development is based on the sharing and collaborative improvement of the software source code [2]. While there are a number of different open-source licenses in common use, most enable free source code distribution where the copyright holder of the software provides the right to download, alter and distribute the software, resulting in an open development process in a public, collaborative manner. Software developers in the bioinformatics community were early adopters of the open-source ideal and have continued to embrace the concept as the field has expanded [3]. Leading open-source tools benefit from regular input from the bioinformatics community, resulting in continued improvement and increased utility, robustness and stability. In addition, emerging datatypes from the burgeoning NGS field result in a constant demand for new tools, and existing algorithms often provide the building blocks for expansion and innovation to meet those needs. Maverix Biomics leverages these leading peer-reviewed open-source tools that have become standard, effective methods for NGS analysis. Having focused initially on the advantages of using opensource bioinformatics tools, it is only reasonable to mention some of the limitations. The level of support and frequency of updates are variable, which can result in the software being less stable than a commercial product. This is frequently the case in an academic environment where long term support and maintenance is not considered in ongoing grant budgets. In addition, licensing is often limited to academic or non-profit institutions, making it difficult for commercial companies to incorporate the gold standard tools into their platforms. Despite its challenges, open-source software does benefit from a community of users and its collaborative development, each of which serves to advance innovation. The free, or low cost, licensing makes it easy to download and begin using immediately, and the flexibility of open White Paper: RNA-Seq Open Source Tools 101 In this white paper, we focus on a specific type of RNA-seq analysis provided by Maverix Biomics as an introduction to the protocol and tools: mRNA-Seq for Differential Expression in Eukaryotes (Fig. 1). The analysis begins with data quality assessment and preprocessing before launching the read mapping step. Once the improved set of RNA-seq reads are aligned to the reference genome, transcript assembly and abundance determination are carried out. As the analysis progresses, QC reports are generated, mapping statistics are summarized in tables and charts, and genome browser tracks are generated for visualization in the integrated UCSC Genome Browser [4]. The expression analysis completes with differential expression analysis, with results provided in interactive tables and heat maps. Figure 1. Analysis overview for the mRNA-Seq for Differential Expression in Eukaryotes analysis kit on the Maverix Analytic Platform. source tools allow users to enhance and expand the utility of the software to meet their needs. Data Analysis One of the most rapidly expanding areas in the NGS arena is RNA-seq [1] for transcriptome profiling, which has the potential to reveal the full RNA complement including mRNA, rRNA, tRNA and other non-coding RNAs (ncRNAs). Analysis of RNA-seq data can extend far beyond the description of the transcript complement to include transcription start site (TSS) determination, alternative splicing and transcript isoform analysis, and RNA-protein interaction analysis via CLIP-seq and RIP-seq. RNA-seq analysis on the Maverix Analytic Platform provides on-demand analysis for any type of RNA, utilization of widely used and accepted open-source analytic applications, and support for any type of organism, including human, animal, plant or microbe. QC/ Preprocessing Read Alignment FastQC 5 STAR 9 PRINSEQ 6 TopHat 10 Reaper 7 Bowtie 11 ea-utils 8 SAMtools 14 The Tools The Maverix Biomics team brings its bioinformatics expertise to the table. We have created an integrated protocol that combines the following steps of analysis into a streamlined process that utilizes gold-standard bioinformatics tools (Table 1) to provide a complete analysis of raw mRNA-seq data, including differential expression. Quality Control The analysis protocol initiates with data quality assessment of the raw input sequence data. The open-source tool used in this step is FastQC, a quality control (QC) tool for highthroughput sequence data [5]. FastQC runs analyses of the uploaded raw sequence reads to reveal the quality of the data and inform the subsequent preprocessing steps in the pipeline. FastQC imports the FASTQ files containing the raw sequence data and first outputs a set of basic statistics, number of raw reads, and read length. A set of analyses are Transcript Assembly Expression Analysis Cufflinks 15 Cuffdiff 15 R 18 BEDTools 16 UCSC Genome Browser, Kent source utilities 4 Table 1. Open-source tools used in the mRNA-Seq Differential Expression analysis. List of the main steps of the analysis and visualization, the open-source software used, and the citation for each tool. 2 White Paper: RNA-Seq Open Source Tools 101 then carried out to determine sequence quality, read length distribution, GC content, as well as the presence of duplications and overrepresented sequences. The analysis gives an overview of the mRNA-seq data quality and identifies potential sources of contamination and problem areas that can be managed in the preprocessing step. Data Preprocessing Following QC, the analysis moves to preprocessing of the RNA-seq reads to improve the quality of data input for the downstream read mapping. The trimming, filtering and formatting of short read sequencing data is carried out via the open-source tools PRINSEQ [6], Reaper [7], and FastqMcf, part of the ea-utils package [8]. Data preprocessing detects and removes N’s at the ends of reads, trims sequencing adapters, and filters reads for quality and length. Once the preprocessing is complete, FastQC analysis is carried out on the trimmed and filtered set of reads to perform a follow-up data quality assessment. This provides a comparison between the raw input sequence data and the quality of the improved set of reads (Fig. 2). The summary report generated provides a quality assurance check to validate the data used in the subsequent read mapping step. Improving the set of reads before alignment is a critical step to prevent the introduction of alignment errors and to improve the overall mapping rate. The removal of adapter sequence and low quality sequence from the ends of reads, as well as the removal of reads whose overall quality or length fall below a designated threshold, can improve the number of mapped reads, increase the speed of the alignment step, and prevent errors of misalignment. A B Figure 2. Quality scores from FastQC analysis. Per base sequence quality for mRNA-seq reads, demonstrating quality improvement from raw reads (A) to trimmed and filtered reads (B). 3 White Paper: RNA-Seq Open Source Tools 101 Figure 3. Mapping summary charts. Example mapping summarization showing percent reads and total read counts for four samples, with a comparison of trimmed, mapped and unmapped reads. Read Alignment As the analysis moves into the read mapping step, researchers using the Maverix Analytic Platform will have a choice between the de novo splice alignment tools STAR [9] or TopHat [10] to map their reads to the human reference genome. When selecting the mapping tool to use, the decision is often based on the type and quantity of reads, the complexity of the reference genome, or the immediacy of acquiring results. STAR (Spliced Transcripts Alignment to a Reference) is an algorithm that was designed to deal with previous tools’ poor scalability with read length, low mapping rates, and the restrictions placed on the number of mis-matches, indels, and splice junctions per read [8], which resulted in difficulty detecting non-linear transcripts. STAR can align reads significantly faster than TopHat and has reported more uniquely mapped reads and more reads with both pairs mapped. In addition, STAR identifies non-canonical splicing and chimeric transcripts, and is capable of mapping fulllength RNA sequences. The second option is TopHat which uses the fast, memoryefficient short read aligner Bowtie [11] as an alignment ‘engine’ [12]. When Bowtie fails to align a sequence read, TopHat creates smaller segments then, when the segments align to the genome at a distance from one another, TopHat can infer splice junctions, with or without available splice site annotations. TopHat has an advantage in that it maps to the transcriptome and genome independently, choosing the best alignment, and has reported accuracy with highly repetitive genomes, in the presence of pseudogenes, and across fusion breaks. TopHat is the more highly-cited of the two tools [13] and is considered the standard for spliced alignment of RNA-seq sequencing reads. When the read alignment step of the analysis is complete, a mapping summary report is generated with the assistance the open-source software SAMtools [14], a package of utilities for manipulating alignments in the SAM (Sequence Alignment/Map) format. The mapping summarization includes number of aligned reads as read percent and total read counts for trimmed, mapped an unmapped reads in a tabular view, as well as the number of paired-end reads that align properly. A columnar chart of the percent aligned reads and total aligned reads is also provided to facilitate visualization and comparison across samples (Fig 3). The final analysis steps include transcript assembly and expression analysis, which are carried out by an opensource software package called Cufflinks [12, 15]. The following sections will describe the tools and their use in the mRNA-seq for Differential Expression analysis in more detail. In brief, Cufflinks and is used in the transcript assembly step and, in conjunction with the utility Cuffcompare, in abundance determination. Cuffmerge merges GTF files produced by Cufflinks that will be used to compare samples in the subsequent differential analysis. Finally, Cuffdiff is used in the expression analysis, identifying differential expression levels in genes and transcripts, as well as differential splicing and promoter use. 4 White Paper: RNA-Seq Open Source Tools 101 Transcript Assembly Following the read mapping step, transcript assembly is carried out by Cufflinks, which generates spliced transcripts based on the individual mRNA-seq reads aligned to the reference genome by STAR or TopHat. Cufflinks assembles the reads into transcripts and identifies known and novel splice variants. If more than one replicate is available, Cufflinks will perform transcript assembly independently for each replicate, rather than assembling transcripts from a pooled set of reads. This prevents incorrect assembly of transcripts that could occur in a more complex mixture of reads. If the analysis includes replicate sets, once they have been independently processed, Cuffmerge assists in merging the individual transcript assemblies together, creating a final transcriptome assembly that is used in the subsequent expression analysis steps. Expression Analysis In addition to transcript assembly, Cufflinks follows up by quantifying the expression level of the sets of transcripts per gene, taking steps to filter out background noise and artifacts. Cufflinks is assisted by the utility program Cuffcompare, which compares the transcript assemblies to known annotation on the reference genome. This can often help in cases where the mRNA-sequence reads are relatively sparse in a region and known annotation can assist in assigning new genes and identifying known ones. The elucidated read depth, combined with the comparisons to reported annotation, helps estimate expression levels of alternatively spliced transcripts, known genes, and novel loci. The final objective of the analysis is the differential expression analysis, which utilizes the final component of the Cufflinks package: Cuffdiff. Using the final transcriptome assemblies from the transcript assembly step of the analysis, Cuffdiff identifies genes and transcripts that are differentially expressed between samples. The algorithm evaluates expression in two or more samples and determines the statistical significance of the difference in expression between them [15]. Cuffdiff is capable of finding genes and transcripts with differential levels of expression, as well as identifying distinct patterns of splicing and promoter usage. There are several output files that Cuffdiff supplies at this stage of the analysis. Changes in the gene and transcript expression levels are output in tabular form, providing the fold change in expression, P values, and feature attributes, including identifiers and chromosomal locations. The differential expression data can then be evaluated to identify genes with significant changes in expression Figure 4. mRNA-seq genome browser tracks. Example read alignment and read coverage tracks for control and experimental samples in the UCSC Genome Browser. In the presence of HOXA1 knockdown, the CCNA2 gene shows a decrease in expression relative to the control condition. 5 White Paper: RNA-Seq Open Source Tools 101 relative to the control or to another sample condition. In the final stage of our analysis, the file outputs are processed to provide summary reports, charts, interactive tables and heat maps, as described in the final section of our tools overview. Visualization One of the main outputs from the mRNA-seq analysis is the set of files used in the creation of browser tracks, which are made available for viewing and analysis in a private instance of the UCSC Genome Browser, integrated in the platform and available securely from your account. The UCSC Genome Browser is the most widely-used genome browser, providing web-accessible access to current versions of annotated reference genome sequence in the context of public datasets. Files are generated, converted, deployed to the genome browser, and made available for download from within the platform all via an automated, streamlined process that requires no input or interaction from the platform user. The read alignment output from STAR initially creates a file in SAM (Sequence Alignment/Map) format, a generic, compact, sortable and indexable format for storing large nucleotide sequence alignments against reference sequences. SAMtools is used to convert from SAM to BAM format, the compressed binary version of SAM. TopHat creates output directly in the BAM format. The BAM files are then deployed directly to the integrated UCSC Genome Browser as custom tracks that allow visualization of the aligned mRNA-seq reads in a genomic environment (Fig. 4), and in the context of public tracks and shared data sets. In addition to the read alignment track, the read mapping step of the analysis also generates an mRNA-seq read coverage browser track to allow a graphic view of the depth of coverage and to assist in comparing expression levels between samples (Fig. 4). Read coverage in the genome browser is visualized through a bigWig file which is generated from the BAM file output from the read alignment step, using BEDTools [16] and Kent source utilities [4]. Figure 5. Example heat map and integrated UCSC Genome Browser on the Maverix Analytic Platform. The heat map on the right shows differential expression between four samples. Mouseover reveals the feature name, chromosomal location and differential expression values. Clicking a feature on the heap map will load the region in the genome browser on the right. 6 White Paper: RNA-Seq Open Source Tools 101 Cufflinks outputs a transcript file in GTF format (Gene Transfer Format), containing chromosome location, coordinates by transcript and individual exons, features overlapped in the reference genome, and FPKM (Fragments Per Kilobase of transcript per Million mapped reads) values. The abundance determination from FPKM values mirrors read density measurements using RPKM values [17], normalizing for RNA length and total number of reads and allowing comparison of transcript levels within and between samples. From the GTF file, a BED file is generated and loaded as a track to the genome browser, showing the location of the transcript fragments generated via Cufflinks and allowing visualization of the differential pattern of gene expression and splice isoforms between samples. without delay so they can bypass the technical roadblocks and focus on the science. Maverix has sought to make the clear advantages of open source NGS analysis tools readily accessible to the nonexpert so that high performance NGS bioinformatics are open to all who need them, whenever they need them, with the support they deserve. References 1. RNA-Seq: a revolutionary tool for transcriptomics. Wang Z, Gerstein M, Snyder M. Nat Rev Genet. 2009 Jan;10(1):57-63. doi: 10.1038/nrg2484 2. The Open Source Definition. The Open Source Initiative. http://opensource.org/docs/osd 3. Open source tools and toolkits for bioinformatics: significance, and where are we? Stajich JE, Lapp H. Brief Bioinform. 2006 Sep;7(3):287-96. 4. The UCSC genome browser and associated tools. Kuhn RM, Haussler D, Kent WJ. Brief Bioinform. 2013 Mar;14(2):144-61. doi: 10.1093/bib/bbs038 Heat map clustering is carried out using the 'hclust' function of the open-source software R [18]. The hierarchical clustering step uses the sample FPKM value for each gene with Pearson correlation as the distance metric, and clustering is based on the similarity of the expression between two genes. The interactive heat map can be navigated by hovering over the sample rows and visualizing the underlying gene features, viewing names, chromosomal locations, and differential expression values. Clicking on a gene’s location on the heat map will load the associated chromosomal region in the integrated UCSC Genome Browser on the right. 5. FastQC: A quality control tool for high throughput sequence data. Simon Andrews. http:// www.bioinformatics.babraham.ac.uk/projects/fastqc/ 6. Quality control and preprocessing of metagenomic datasets. Schmieder R, Edwards R. Bioinformatics. 2011 Mar 15;27(6):863-4. doi: 10.1093/bioinformatics/ btr026 7. Reaper: demultiplexing, trimming and filtering short read sequencing data. Stijn van Dongen. http://www.ebi.ac.uk/~stijn/reaper/reaper.html 8. ea-utils : "Command-line tools for processing biological sequencing data". Erik Aronesty, 2011; http://code.google.com/p/ea-utils Summary 9. As technological improvements in NGS methodologies steadily advance and the cost of sequencing simultaneously declines, the generation of NGS data is increasing at a rate that is leading to a data analysis bottleneck. Maverix Biomics was founded to address the situation by providing computational analysis designed by bioinformatics experts using gold-standard open-source tools, the infrastructure to manage your data, and the platform and resources to visualize your results. The solutions that Maverix Biomics provides allow researchers to leverage their NGS data STAR: ultrafast universal RNA-seq aligner. Dobin A, Davis CA, Schlesinger F, Drenkow J, Zaleski C, Jha S, Batut P, Chaisson M, Gingeras TR. Bioinformatics. 2013 Jan 1;29(1):15-21. doi: 10.1093/bioinformatics/ bts635 10. TopHat: discovering splice junctions with RNA-Seq. Trapnell C, Pachter L, Salzberg SL. Bioinformatics. 2009 May 1;25(9):1105-11. doi: 10.1093/bioinformatics/ btp120 The differential expression step provides one of the most powerful visualizations from the mRNA-seq differential expression analysis in the form of a heat map. Cuffdiff generates an expression output file for each control/sample pair. For the mRNA-seq differential expression analysis, a gene-level expression file is created, which collates read counts for all isoforms of a transcription unit. Once the data is joined, filtering is done to remove genes that fall below a threshhold of significance for differential expression, then gene annotations are added to the composite file that is used for heat map generation. 11. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Langmead B, Trapnell C, Pop M, Salzberg SL. Genome Biol. 2009;10(3):R25. doi: 10.1186/gb-2009-10-3-r25 7 White Paper: RNA-Seq Open Source Tools 101 12. Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks. Trapnell C, Roberts A, Goff L, Pertea G, Kim D, Kelley DR, Pimentel H, Salzberg SL, Rinn JL, Pachter L. Nat Protoc. 2012 Mar 1;7(3):562-78. doi: 10.1038/nprot. 2012.016 13. Google Scholar, http://scholar.google.com, citations as of 7/15/2013; STAR (doi: 10.1093/bioinformatics/ bts635): 10; TopHat (doi: 10.1093/bioinformatics/ btp120): 1163. 14. The Sequence Alignment/Map format and SAMtools. Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R; 1000 Genome Project Data Processing Subgroup. Bioinformatics. 2009 Aug 15;25(16):2078-9. doi: 10.1093/ bioinformatics/btp352 15. Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Trapnell C, Williams BA, Pertea G, Mortazavi A, Kwan G, van Baren MJ, Salzberg SL, Wold BJ, Pachter L. Nat Biotechnol. 2010 May;28(5):511-5. doi: 10.1038/nbt.1621 16. BEDTools: a flexible suite of utilities for comparing genomic features. Quinlan AR, Hall IM. Bioinformatics. 2010 Mar 15;26(6):841-2. doi: 10.1093/bioinformatics/ btq033 17. Mapping and quantifying mammalian transcriptomes by RNA-Seq. Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B. Nat Methods. 2008 Jul;5(7):621-8. doi: 10.1038/nmeth.1226 18. R: A language and environment for statistical computing. R Development Core Team (2008). R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0, http://www.R-project.org 8 White Paper: RNA-Seq Open Source Tools 101 Appendix - Open Source Tools For the most up to date list of Open Source Tools deployed on the Maverix Analytic Platform, please visit www.maverixbio.com/opensource BEDTools Flexible suite of utilities for comparing genomic features, such as finding feature overlaps and computing coverage. BEDTools: a flexible suite of utilities for comparing genomic features. Quinlan AR, Hall IM. Bioinformatics. 2010 Mar 15;26(6):841-2. DOI: 10.1093/bioinformatics/btq033; PubMed: 20110278 License: GPLv2 Website: http://code.google.com/p/bedtools/ BLAST-Like Alignment Tool (BLAT) Performs rapid mRNA/DNA and cross-species protein alignments. BLAT–the BLAST-like alignment tool. Kent WJ. Genome Res. 2002 Apr;12(4):656-64. DOI: 10.1101/gr.229202; PubMed: 19357099 License: Authorized Use of Commercial License Website: http://www.kentinformatics.com/ Bowtie Ultrafast, memory-efficient short read aligner. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Langmead B, Trapnell C, Pop M, Salzberg SL. Genome Biol. 2009;10(3):R25. DOI: 10.1186/gb-2009-10-3-r25; PubMed: 19261174 License: Artistic License Website: http://bowtie-bio.sourceforge.net Cufflinks Assembles transcripts, estimates their abundances, and tests for differential expression and regulation in RNA-Seq samples. Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Trapnell C, Williams BA, Pertea G, Mortazavi A, Kwan G, van Baren MJ, Salzberg SL, Wold BJ, Pachter L. Nat Biotechnol. 2010 May;28(5):511-5. DOI: 10.1038/nbt.1621; PubMed: 20436464 License: OSI-Approved Boost License Website: http://cufflinks.cbcb.umd.edu/ ea-utils Command-line tools for processing biological sequencing data. Barcode demultiplexing, adapter trimming, etc. ea-utils : “Command-line tools for processing biological sequencing data”. Erik Aronesty License: MIT Website: http://code.google.com/p/ea-utils eXpress Software package for efficient probabilistic assignment of ambiguously mapping sequenced fragments. eXpress uses a streaming algorithm with linear run time and constant memory use, and can determine abundances of sequenced molecules in real time. Streaming fragment assignment for real-time analysis of sequencing experiments. Roberts A, Pachter L. Nat Methods. 2013 Jan;10(1):71-3. DOI: 10.1038/nmeth.2251; PubMed: 23160280 License: Artistic License 2.0 Website: http://bio.math.berkeley.edu/eXpress/index.html FastQC Bowtie 2 Quality control tool for high throughput sequence data. Ultrafast and memory-efficient tool for aligning sequencing reads to long reference sequences. Supports gapped, local, and paired-end alignment modes. License: GPLv3 Website: http://www.bioinformatics.babraham.ac.uk/ projects/fastqc/ Fast gapped-read alignment with Bowtie 2. Langmead B, Salzberg SL. Nat Methods. 2012 Mar 4;9(4):357-9. DOI: 10.1038/nmeth.1923; PubMed: 22388286 License: Artistic License Website: http://bowtie-bio.sourceforge.net/bowtie2/ NCBI C++ Toolkit Bpipe Provides a platform for running and managing big bioinformatics jobs that consist of a series of processing stages – known as ‘pipelines’. Bpipe: a tool for running and managing bioinformatics pipelines. Sadedin SP, Pope B, Oshlack A. Bioinformatics. 2012 Jun 1;28(11):1525-6. DOI: 10.1093/bioinformatics/bts167; PubMed: 22500002 License: BSD Website: https://code.google.com/p/bpipe/ Provides source code and configuration scripts that make it easy to compile NCBI software to run on a variety of computing platforms. The toolkit includes BLAST, the Basic Local Alignment Tool code and interface. The NCBI C++ Toolkit Book [Internet]. Vakatov D, editor. Bethesda (MD): National Center for Biotechnology Information (US); 2004-. http://www.ncbi.nlm.nih.gov/books/NBK7160/ License: Public Domain Website: http://www.ncbi.nlm.nih.gov/IEB/ToolBox/ CPP_DOC/ 9 White Paper: RNA-Seq Open Source Tools 101 Perl STAR High-level programming language for tasks involving quick prototyping, system utilities, software tools, system management tasks, database access, graphical programming, networking, and web programming. Spliced Transcripts Alignment to a Reference (STAR), an ultrafast universal RNA-seq aligner, uses sequential maximum mappable seed search in uncompressed suffix arrays followed by seed clustering and stitching procedure. License: Artistic License,GPLv3 Website: http://www.perl.org/ STAR: ultrafast universal RNA-seq aligner. Dobin A, Davis CA, Schlesinger F, Drenkow J, Zaleski C, Jha S, Batut P, Chaisson M, Gingeras TR. Bioinformatics. 2013 Jan 1;29(1):15-21. DOI: 10.1093/bioinformatics/bts635; PubMed: 23104886 License: GPLv3 Website: https://code.google.com/p/rna-star/ PRINSEQ Bioinformatics tool to PRe-process and show INformation of SEQuence data. The tool is written in Perl and can be helpful if you want to filter, reformat, or trim your sequence data. It also generates basic statistics for your sequences. Quality control and preprocessing of metagenomic datasets. Schmieder R, Edwards R. Bioinformatics. 2011 Mar 15;27(6):863-4. DOI: 10.1093/bioinformatics/btr026; PubMed: 21278185 License: GPLv3 Website: http://prinseq.sourceforge.net/ R Provides a wide variety of statistical and graphical techniques, and is highly extensible. TMAP The torrent mapping alignment program is a fast and accurate alignment software for short and long nucleotide sequences produced by next-generation sequencing technologies. License: GPLv2 Website: https://github.com/iontorrent/TMAP TopHat Fast splice junction mapper for RNA-Seq reads. R: A language and environment for statistical computing. R Development Core Team (2008). R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0. License: GPLv2 Website: http://www.r-project.org/ TopHat: discovering splice junctions with RNA-Seq. Trapnell C, Pachter L, Salzberg SL. Bioinformatics. 2009 May 1;25(9):1105-11. DOI: 10.1093/bioinformatics/btp120; PubMed: 19289445 License: Artistic License Website: http://tophat.cbcb.umd.edu/ Reaper tRNAscan-SE Program for demultiplexing, trimming and filtering short read sequencing data. Includes Tally, a program for deduplicating sequence fragments for both single and paired end input. Search for tRNA genes in genomic sequence. License: GPLv3 Website: http://www.ebi.ac.uk/~stijn/reaper/reaper.html SAMtools Various utilities for manipulating alignments in the SAM format, including sorting, merging, indexing and generating alignments in a per-position format. The Sequence Alignment/Map format and SAMtools. Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R; 1000 Genome Project Data Processing Subgroup. Bioinformatics. 2009 Aug 15;25(16): 2078-9. DOI: 10.1093/bioinformatics/btp352; PubMed: 19505943 License: BSD,MIT Website: http://samtools.sourceforge.net/ SRA Toolkit tRNAscan-SE: a program for improved detection of transfer RNA genes in genomic sequence. Lowe TM, Eddy SR. Nucleic Acids Res. 1997 Mar 1;25(5):955-64. DOI: 10.1093/nar/25.5.0955; PubMed: 9023104 License: GPLv2 Website: http://lowelab.ucsc.edu/tRNAscan-SE/ UCSC Genome Browser and Tool Set An integrated tool set for visualizing, comparing, analyzing and sharing both publicly available and user-generated genomic data sets. The UCSC Genome Browser is a mature web tool for rapid and reliable display of any requested portion of the genome at any scale, together with aligned annotation tracks. The UCSC genome browser and associated tools. Kuhn RM, Haussler D, Kent WJ. Brief Bioinform. 2012 Oct 31. DOI: 10.1093/bib/bbs038; PubMed: 22908213 License: Authorized Use of Commercial License Website: http://genome.ucsc.edu/ Supports conversion of SRA data to several popular formats. License: free Website: http://www.ncbi.nlm.nih.gov/Traces/sra/sra.cgi? view=toolkit_doc 10 Complex NGS analytics made simple. 1670 S. Amphlett Blvd, Suite 214, San Mateo CA 94402 • 650-388-9277 www.maverixbio.com © 2013 Maverix Biomics, Inc. All Rights Reserved.
© Copyright 2025 Paperzz