RNAseq Open Source Tools 101 revision 3

White Paper
RNA-Seq Open Source Tools 101
Software used in mRNA-seq differential expression analysis on the Maverix Analytic Platform.
Introduction
Next generation sequencing (NGS) is enabling a massive
increase in scale and molecular detail of new data available
for biomedical research. However, studies leveraging NGS
data are commonly limited by two factors: (1) acquiring the
tools and infrastructure necessary to process, interpret, and
manage the data, and (2) availability of well-trained
computational biologists to carry out these specialized,
interdisciplinary tasks. As a result, many researchers do not
take advantage of the technology, or may face unnecessary
delays in analyzing the NGS data they have generated.
These hurdles limit the pace and scale of the technology’s
adoption in the broader research community.
Maverix Biomics was founded to address this bottleneck,
and has created a unique analytic environment that
integrates key elements: (1) ease of use for any scientist; (2)
the most current, industry-standard open source tools; (3)
carefully constructed “analytic kits” which are ready to use off
the shelf or with expert customization; (4) flexibility to
analyze and visualize data from any species; and (5) ondemand, cloud-based computing power and storage to
rapidly scale to almost any sized project. We provide the
know-how, computing resources, and interactive
visualization environment, leaving researchers to focus on
the science.
One rapidly growing area in the NGS realm is RNA-seq [1], a
high-throughput sequencing technology that provides a
genome-wide assessment of the RNA content of an
organism, tissue or cell. RNA-seq utilizes an unbiased
approach that allows a researcher to define and analyze the
transcriptome, identify transcription start site (TSS), and
perform alternative splicing analysis. RNA-seq analysis on
the Maverix Analytic Platform offers a simple solution to
complex NGS analytics, utilizing open-source bioinformatics
tools in a secure cloud-based environment.
Open Source
Open-source software is computer software whose
development is based on the sharing and collaborative
improvement of the software source code [2]. While there
are a number of different open-source licenses in common
use, most enable free source code distribution where the
copyright holder of the software provides the right to
download, alter and distribute the software, resulting in an
open development process in a public, collaborative manner.
Software developers in the bioinformatics community were
early adopters of the open-source ideal and have continued
to embrace the concept as the field has expanded [3].
Leading open-source tools benefit from regular input from
the bioinformatics community, resulting in continued
improvement and increased utility, robustness and stability.
In addition, emerging datatypes from the burgeoning NGS
field result in a constant demand for new tools, and existing
algorithms often provide the building blocks for expansion
and innovation to meet those needs. Maverix Biomics
leverages these leading peer-reviewed open-source tools
that have become standard, effective methods for NGS
analysis.
Having focused initially on the advantages of using opensource bioinformatics tools, it is only reasonable to mention
some of the limitations. The level of support and frequency of
updates are variable, which can result in the software being
less stable than a commercial product. This is frequently the
case in an academic environment where long term support
and maintenance is not considered in ongoing grant
budgets. In addition, licensing is often limited to academic or
non-profit institutions, making it difficult for commercial
companies to incorporate the gold standard tools into their
platforms. Despite its challenges, open-source software
does benefit from a community of users and its collaborative
development, each of which serves to advance innovation.
The free, or low cost, licensing makes it easy to download
and begin using immediately, and the flexibility of open
White Paper: RNA-Seq Open Source Tools 101
In this white paper, we focus on a specific type of RNA-seq
analysis provided by Maverix Biomics as an introduction to
the protocol and tools: mRNA-Seq for Differential Expression
in Eukaryotes (Fig. 1). The analysis begins with data quality
assessment and preprocessing before launching the read
mapping step. Once the improved set of RNA-seq reads are
aligned to the reference genome, transcript assembly and
abundance determination are carried out. As the analysis
progresses, QC reports are generated, mapping statistics
are summarized in tables and charts, and genome browser
tracks are generated for visualization in the integrated UCSC
Genome Browser [4]. The expression analysis completes
with differential expression analysis, with results provided in
interactive tables and heat maps.
Figure 1. Analysis overview for the mRNA-Seq for Differential
Expression in Eukaryotes analysis kit on the Maverix Analytic Platform.
source tools allow users to enhance and expand the utility of
the software to meet their needs.
Data Analysis
One of the most rapidly expanding areas in the NGS arena
is RNA-seq [1] for transcriptome profiling, which has the
potential to reveal the full RNA complement including mRNA,
rRNA, tRNA and other non-coding RNAs (ncRNAs). Analysis
of RNA-seq data can extend far beyond the description of
the transcript complement to include transcription start site
(TSS) determination, alternative splicing and transcript
isoform analysis, and RNA-protein interaction analysis via
CLIP-seq and RIP-seq. RNA-seq analysis on the Maverix
Analytic Platform provides on-demand analysis for any type
of RNA, utilization of widely used and accepted open-source
analytic applications, and support for any type of organism,
including human, animal, plant or microbe.
QC/ Preprocessing
Read Alignment
FastQC 5
STAR 9
PRINSEQ 6
TopHat 10
Reaper 7
Bowtie 11
ea-utils 8
SAMtools 14
The Tools
The Maverix Biomics team brings its bioinformatics expertise
to the table. We have created an integrated protocol that
combines the following steps of analysis into a streamlined
process that utilizes gold-standard bioinformatics tools
(Table 1) to provide a complete analysis of raw mRNA-seq
data, including differential expression.
Quality Control
The analysis protocol initiates with data quality assessment
of the raw input sequence data. The open-source tool used
in this step is FastQC, a quality control (QC) tool for highthroughput sequence data [5]. FastQC runs analyses of the
uploaded raw sequence reads to reveal the quality of the
data and inform the subsequent preprocessing steps in the
pipeline.
FastQC imports the FASTQ files containing the raw
sequence data and first outputs a set of basic statistics,
number of raw reads, and read length. A set of analyses are
Transcript Assembly
Expression Analysis
Cufflinks 15
Cuffdiff 15
R 18
BEDTools 16
UCSC Genome Browser, Kent source utilities 4
Table 1. Open-source tools used in the mRNA-Seq Differential Expression analysis. List of the main
steps of the analysis and visualization, the open-source software used, and the citation for each tool.
2
White Paper: RNA-Seq Open Source Tools 101
then carried out to determine sequence quality, read length
distribution, GC content, as well as the presence of
duplications and overrepresented sequences. The analysis
gives an overview of the mRNA-seq data quality and
identifies potential sources of contamination and problem
areas that can be managed in the preprocessing step.
Data Preprocessing
Following QC, the analysis moves to preprocessing of the
RNA-seq reads to improve the quality of data input for the
downstream read mapping. The trimming, filtering and
formatting of short read sequencing data is carried out via
the open-source tools PRINSEQ [6], Reaper [7], and
FastqMcf, part of the ea-utils package [8]. Data
preprocessing detects and removes N’s at the ends of reads,
trims sequencing adapters, and filters reads for quality and
length. Once the preprocessing is complete, FastQC
analysis is carried out on the trimmed and filtered set of
reads to perform a follow-up data quality assessment. This
provides a comparison between the raw input sequence data
and the quality of the improved set of reads (Fig. 2). The
summary report generated provides a quality assurance
check to validate the data used in the subsequent read
mapping step. Improving the set of reads before alignment is
a critical step to prevent the introduction of alignment errors
and to improve the overall mapping rate. The removal of
adapter sequence and low quality sequence from the ends
of reads, as well as the removal of reads whose overall
quality or length fall below a designated threshold, can
improve the number of mapped reads, increase the speed of
the alignment step, and prevent errors of misalignment.
A
B
Figure 2. Quality scores from FastQC analysis. Per base sequence quality for mRNA-seq reads,
demonstrating quality improvement from raw reads (A) to trimmed and filtered reads (B).
3
White Paper: RNA-Seq Open Source Tools 101
Figure 3. Mapping summary charts. Example mapping summarization showing percent reads and
total read counts for four samples, with a comparison of trimmed, mapped and unmapped reads.
Read Alignment
As the analysis moves into the read mapping step,
researchers using the Maverix Analytic Platform will have a
choice between the de novo splice alignment tools STAR [9]
or TopHat [10] to map their reads to the human reference
genome. When selecting the mapping tool to use, the
decision is often based on the type and quantity of reads,
the complexity of the reference genome, or the immediacy
of acquiring results.
STAR (Spliced Transcripts Alignment to a Reference) is an
algorithm that was designed to deal with previous tools’
poor scalability with read length, low mapping rates, and the
restrictions placed on the number of mis-matches, indels,
and splice junctions per read [8], which resulted in difficulty
detecting non-linear transcripts. STAR can align reads
significantly faster than TopHat and has reported more
uniquely mapped reads and more reads with both pairs
mapped. In addition, STAR identifies non-canonical splicing
and chimeric transcripts, and is capable of mapping fulllength RNA sequences.
The second option is TopHat which uses the fast, memoryefficient short read aligner Bowtie [11] as an alignment
‘engine’ [12]. When Bowtie fails to align a sequence read,
TopHat creates smaller segments then, when the segments
align to the genome at a distance from one another, TopHat
can infer splice junctions, with or without available splice
site annotations. TopHat has an advantage in that it maps to
the transcriptome and genome independently, choosing the
best alignment, and has reported accuracy with highly
repetitive genomes, in the presence of pseudogenes, and
across fusion breaks. TopHat is the more highly-cited of the
two tools [13] and is considered the standard for spliced
alignment of RNA-seq sequencing reads.
When the read alignment step of the analysis is complete, a
mapping summary report is generated with the assistance
the open-source software SAMtools [14], a package of
utilities for manipulating alignments in the SAM (Sequence
Alignment/Map) format. The mapping summarization
includes number of aligned reads as read percent and total
read counts for trimmed, mapped an unmapped reads in a
tabular view, as well as the number of paired-end reads that
align properly. A columnar chart of the percent aligned
reads and total aligned reads is also provided to facilitate
visualization and comparison across samples (Fig 3).
The final analysis steps include transcript assembly and
expression analysis, which are carried out by an opensource software package called Cufflinks [12, 15]. The
following sections will describe the tools and their use in the
mRNA-seq for Differential Expression analysis in more
detail. In brief, Cufflinks and is used in the transcript
assembly step and, in conjunction with the utility
Cuffcompare, in abundance determination. Cuffmerge
merges GTF files produced by Cufflinks that will be used to
compare samples in the subsequent differential analysis.
Finally, Cuffdiff is used in the expression analysis,
identifying differential expression levels in genes and
transcripts, as well as differential splicing and promoter use.
4
White Paper: RNA-Seq Open Source Tools 101
Transcript Assembly
Following the read mapping step, transcript assembly is
carried out by Cufflinks, which generates spliced transcripts
based on the individual mRNA-seq reads aligned to the
reference genome by STAR or TopHat. Cufflinks assembles
the reads into transcripts and identifies known and novel
splice variants. If more than one replicate is available,
Cufflinks will perform transcript assembly independently for
each replicate, rather than assembling transcripts from a
pooled set of reads. This prevents incorrect assembly of
transcripts that could occur in a more complex mixture of
reads. If the analysis includes replicate sets, once they
have been independently processed, Cuffmerge assists in
merging the individual transcript assemblies together,
creating a final transcriptome assembly that is used in the
subsequent expression analysis steps.
Expression Analysis
In addition to transcript assembly, Cufflinks follows up by
quantifying the expression level of the sets of transcripts per
gene, taking steps to filter out background noise and
artifacts. Cufflinks is assisted by the utility program
Cuffcompare, which compares the transcript assemblies to
known annotation on the reference genome. This can often
help in cases where the mRNA-sequence reads are
relatively sparse in a region and known annotation can
assist in assigning new genes and identifying known ones.
The elucidated read depth, combined with the comparisons
to reported annotation, helps estimate expression levels of
alternatively spliced transcripts, known genes, and novel
loci.
The final objective of the analysis is the differential
expression analysis, which utilizes the final component of
the Cufflinks package: Cuffdiff. Using the final transcriptome
assemblies from the transcript assembly step of the
analysis, Cuffdiff identifies genes and transcripts that are
differentially expressed between samples. The algorithm
evaluates expression in two or more samples and
determines the statistical significance of the difference in
expression between them [15]. Cuffdiff is capable of finding
genes and transcripts with differential levels of expression,
as well as identifying distinct patterns of splicing and
promoter usage.
There are several output files that Cuffdiff supplies at this
stage of the analysis. Changes in the gene and transcript
expression levels are output in tabular form, providing the
fold change in expression, P values, and feature attributes,
including identifiers and chromosomal locations. The
differential expression data can then be evaluated to
identify genes with significant changes in expression
Figure 4. mRNA-seq genome browser tracks. Example read alignment and read coverage tracks
for control and experimental samples in the UCSC Genome Browser. In the presence of HOXA1
knockdown, the CCNA2 gene shows a decrease in expression relative to the control condition.
5
White Paper: RNA-Seq Open Source Tools 101
relative to the control or to another sample condition. In the
final stage of our analysis, the file outputs are processed to
provide summary reports, charts, interactive tables and heat
maps, as described in the final section of our tools
overview.
Visualization
One of the main outputs from the mRNA-seq analysis is the
set of files used in the creation of browser tracks, which are
made available for viewing and analysis in a private
instance of the UCSC Genome Browser, integrated in the
platform and available securely from your account. The
UCSC Genome Browser is the most widely-used genome
browser, providing web-accessible access to current
versions of annotated reference genome sequence in the
context of public datasets. Files are generated, converted,
deployed to the genome browser, and made available for
download from within the platform all via an automated,
streamlined process that requires no input or interaction
from the platform user.
The read alignment output from STAR initially creates a file
in SAM (Sequence Alignment/Map) format, a generic,
compact, sortable and indexable format for storing large
nucleotide sequence alignments against reference
sequences. SAMtools is used to convert from SAM to BAM
format, the compressed binary version of SAM. TopHat
creates output directly in the BAM format. The BAM files are
then deployed directly to the integrated UCSC Genome
Browser as custom tracks that allow visualization of the
aligned mRNA-seq reads in a genomic environment (Fig.
4), and in the context of public tracks and shared data sets.
In addition to the read alignment track, the read mapping
step of the analysis also generates an mRNA-seq read
coverage browser track to allow a graphic view of the depth
of coverage and to assist in comparing expression levels
between samples (Fig. 4). Read coverage in the genome
browser is visualized through a bigWig file which is
generated from the BAM file output from the read alignment
step, using BEDTools [16] and Kent source utilities [4].
Figure 5. Example heat map and integrated UCSC Genome Browser on the Maverix Analytic Platform. The heat map on the
right shows differential expression between four samples. Mouseover reveals the feature name, chromosomal location and
differential expression values. Clicking a feature on the heap map will load the region in the genome browser on the right.
6
White Paper: RNA-Seq Open Source Tools 101
Cufflinks outputs a transcript file in GTF format (Gene
Transfer Format), containing chromosome location,
coordinates by transcript and individual exons, features
overlapped in the reference genome, and FPKM
(Fragments Per Kilobase of transcript per Million mapped
reads) values. The abundance determination from FPKM
values mirrors read density measurements using RPKM
values [17], normalizing for RNA length and total number of
reads and allowing comparison of transcript levels within
and between samples. From the GTF file, a BED file is
generated and loaded as a track to the genome browser,
showing the location of the transcript fragments generated
via Cufflinks and allowing visualization of the differential
pattern of gene expression and splice isoforms between
samples.
without delay so they can bypass the technical roadblocks
and focus on the science.
Maverix has sought to make the clear advantages of open
source NGS analysis tools readily accessible to the nonexpert so that high performance NGS bioinformatics are
open to all who need them, whenever they need them, with
the support they deserve.
References
1.
RNA-Seq: a revolutionary tool for transcriptomics.
Wang Z, Gerstein M, Snyder M. Nat Rev Genet. 2009
Jan;10(1):57-63. doi: 10.1038/nrg2484
2.
The Open Source Definition. The Open Source
Initiative. http://opensource.org/docs/osd
3.
Open source tools and toolkits for bioinformatics:
significance, and where are we? Stajich JE, Lapp H.
Brief Bioinform. 2006 Sep;7(3):287-96.
4.
The UCSC genome browser and associated tools.
Kuhn RM, Haussler D, Kent WJ. Brief Bioinform. 2013
Mar;14(2):144-61. doi: 10.1093/bib/bbs038
Heat map clustering is carried out using the 'hclust' function
of the open-source software R [18]. The hierarchical
clustering step uses the sample FPKM value for each gene
with Pearson correlation as the distance metric, and
clustering is based on the similarity of the expression
between two genes. The interactive heat map can be
navigated by hovering over the sample rows and visualizing
the underlying gene features, viewing names, chromosomal
locations, and differential expression values. Clicking on a
gene’s location on the heat map will load the associated
chromosomal region in the integrated UCSC Genome
Browser on the right.
5.
FastQC: A quality control tool for high throughput
sequence data. Simon Andrews. http://
www.bioinformatics.babraham.ac.uk/projects/fastqc/
6.
Quality control and preprocessing of metagenomic
datasets. Schmieder R, Edwards R. Bioinformatics.
2011 Mar 15;27(6):863-4. doi: 10.1093/bioinformatics/
btr026
7.
Reaper: demultiplexing, trimming and filtering short
read sequencing data. Stijn van Dongen.
http://www.ebi.ac.uk/~stijn/reaper/reaper.html
8.
ea-utils : "Command-line tools for processing biological
sequencing data". Erik Aronesty, 2011;
http://code.google.com/p/ea-utils
Summary
9.
As technological improvements in NGS methodologies
steadily advance and the cost of sequencing simultaneously
declines, the generation of NGS data is increasing at a rate
that is leading to a data analysis bottleneck. Maverix
Biomics was founded to address the situation by providing
computational analysis designed by bioinformatics experts
using gold-standard open-source tools, the infrastructure to
manage your data, and the platform and resources to
visualize your results. The solutions that Maverix Biomics
provides allow researchers to leverage their NGS data
STAR: ultrafast universal RNA-seq aligner. Dobin A,
Davis CA, Schlesinger F, Drenkow J, Zaleski C, Jha S,
Batut P, Chaisson M, Gingeras TR. Bioinformatics.
2013 Jan 1;29(1):15-21. doi: 10.1093/bioinformatics/
bts635
10. TopHat: discovering splice junctions with RNA-Seq.
Trapnell C, Pachter L, Salzberg SL. Bioinformatics.
2009 May 1;25(9):1105-11. doi: 10.1093/bioinformatics/
btp120
The differential expression step provides one of the most
powerful visualizations from the mRNA-seq differential
expression analysis in the form of a heat map. Cuffdiff
generates an expression output file for each control/sample
pair. For the mRNA-seq differential expression analysis, a
gene-level expression file is created, which collates read
counts for all isoforms of a transcription unit. Once the data
is joined, filtering is done to remove genes that fall below a
threshhold of significance for differential expression, then
gene annotations are added to the composite file that is
used for heat map generation.
11. Ultrafast and memory-efficient alignment of short DNA
sequences to the human genome. Langmead B,
Trapnell C, Pop M, Salzberg SL. Genome Biol.
2009;10(3):R25. doi: 10.1186/gb-2009-10-3-r25
7
White Paper: RNA-Seq Open Source Tools 101
12. Differential gene and transcript expression analysis of
RNA-seq experiments with TopHat and Cufflinks.
Trapnell C, Roberts A, Goff L, Pertea G, Kim D, Kelley
DR, Pimentel H, Salzberg SL, Rinn JL, Pachter L. Nat
Protoc. 2012 Mar 1;7(3):562-78. doi: 10.1038/nprot.
2012.016
13. Google Scholar, http://scholar.google.com, citations as
of 7/15/2013; STAR (doi: 10.1093/bioinformatics/
bts635): 10; TopHat (doi: 10.1093/bioinformatics/
btp120): 1163.
14. The Sequence Alignment/Map format and SAMtools. Li
H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer
N, Marth G, Abecasis G, Durbin R; 1000 Genome
Project Data Processing Subgroup. Bioinformatics.
2009 Aug 15;25(16):2078-9. doi: 10.1093/
bioinformatics/btp352
15. Transcript assembly and quantification by RNA-Seq
reveals unannotated transcripts and isoform switching
during cell differentiation. Trapnell C, Williams BA,
Pertea G, Mortazavi A, Kwan G, van Baren MJ,
Salzberg SL, Wold BJ, Pachter L. Nat Biotechnol. 2010
May;28(5):511-5. doi: 10.1038/nbt.1621
16. BEDTools: a flexible suite of utilities for comparing
genomic features. Quinlan AR, Hall IM. Bioinformatics.
2010 Mar 15;26(6):841-2. doi: 10.1093/bioinformatics/
btq033
17. Mapping and quantifying mammalian transcriptomes by
RNA-Seq. Mortazavi A, Williams BA, McCue K,
Schaeffer L, Wold B. Nat Methods. 2008 Jul;5(7):621-8.
doi: 10.1038/nmeth.1226
18. R: A language and environment for statistical
computing. R Development Core Team (2008). R
Foundation for Statistical Computing, Vienna, Austria.
ISBN 3-900051-07-0, http://www.R-project.org
8
White Paper: RNA-Seq Open Source Tools 101
Appendix - Open Source Tools
For the most up to date list of Open Source Tools deployed
on the Maverix Analytic Platform, please visit
www.maverixbio.com/opensource
BEDTools
Flexible suite of utilities for comparing genomic features,
such as finding feature overlaps and computing coverage.
BEDTools: a flexible suite of utilities for comparing genomic
features. Quinlan AR, Hall IM. Bioinformatics. 2010 Mar
15;26(6):841-2.
DOI: 10.1093/bioinformatics/btq033; PubMed: 20110278
License: GPLv2
Website: http://code.google.com/p/bedtools/
BLAST-Like Alignment Tool (BLAT)
Performs rapid mRNA/DNA and cross-species protein
alignments.
BLAT–the BLAST-like alignment tool. Kent WJ. Genome
Res. 2002 Apr;12(4):656-64.
DOI: 10.1101/gr.229202; PubMed: 19357099
License: Authorized Use of Commercial License
Website: http://www.kentinformatics.com/
Bowtie
Ultrafast, memory-efficient short read aligner.
Ultrafast and memory-efficient alignment of short DNA
sequences to the human genome. Langmead B, Trapnell C,
Pop M, Salzberg SL. Genome Biol. 2009;10(3):R25.
DOI: 10.1186/gb-2009-10-3-r25; PubMed: 19261174
License: Artistic License
Website: http://bowtie-bio.sourceforge.net
Cufflinks
Assembles transcripts, estimates their abundances, and
tests for differential expression and regulation in RNA-Seq
samples.
Transcript assembly and quantification by RNA-Seq reveals
unannotated transcripts and isoform switching during cell
differentiation. Trapnell C, Williams BA, Pertea G, Mortazavi
A, Kwan G, van Baren MJ, Salzberg SL, Wold BJ, Pachter
L. Nat Biotechnol. 2010 May;28(5):511-5.
DOI: 10.1038/nbt.1621; PubMed: 20436464
License: OSI-Approved Boost License
Website: http://cufflinks.cbcb.umd.edu/
ea-utils
Command-line tools for processing biological sequencing
data. Barcode demultiplexing, adapter trimming, etc.
ea-utils : “Command-line tools for processing biological
sequencing data”. Erik Aronesty
License: MIT
Website: http://code.google.com/p/ea-utils
eXpress
Software package for efficient probabilistic assignment of
ambiguously mapping sequenced fragments. eXpress uses
a streaming algorithm with linear run time and constant
memory use, and can determine abundances of sequenced
molecules in real time.
Streaming fragment assignment for real-time analysis of
sequencing experiments. Roberts A, Pachter L. Nat
Methods. 2013 Jan;10(1):71-3.
DOI: 10.1038/nmeth.2251; PubMed: 23160280
License: Artistic License 2.0
Website: http://bio.math.berkeley.edu/eXpress/index.html
FastQC
Bowtie 2
Quality control tool for high throughput sequence data.
Ultrafast and memory-efficient tool for aligning sequencing
reads to long reference sequences. Supports gapped, local,
and paired-end alignment modes.
License: GPLv3
Website: http://www.bioinformatics.babraham.ac.uk/
projects/fastqc/
Fast gapped-read alignment with Bowtie 2. Langmead B,
Salzberg SL. Nat Methods. 2012 Mar 4;9(4):357-9.
DOI: 10.1038/nmeth.1923; PubMed: 22388286
License: Artistic License
Website: http://bowtie-bio.sourceforge.net/bowtie2/
NCBI C++ Toolkit
Bpipe
Provides a platform for running and managing big
bioinformatics jobs that consist of a series of processing
stages – known as ‘pipelines’.
Bpipe: a tool for running and managing bioinformatics
pipelines. Sadedin SP, Pope B, Oshlack A. Bioinformatics.
2012 Jun 1;28(11):1525-6.
DOI: 10.1093/bioinformatics/bts167; PubMed: 22500002
License: BSD
Website: https://code.google.com/p/bpipe/
Provides source code and configuration scripts that make it
easy to compile NCBI software to run on a variety of
computing platforms. The toolkit includes BLAST, the Basic
Local Alignment Tool code and interface.
The NCBI C++ Toolkit Book [Internet]. Vakatov D, editor.
Bethesda (MD): National Center for Biotechnology
Information (US); 2004-.
http://www.ncbi.nlm.nih.gov/books/NBK7160/
License: Public Domain
Website: http://www.ncbi.nlm.nih.gov/IEB/ToolBox/
CPP_DOC/
9
White Paper: RNA-Seq Open Source Tools 101
Perl
STAR
High-level programming language for tasks involving quick
prototyping, system utilities, software tools, system
management tasks, database access, graphical
programming, networking, and web programming.
Spliced Transcripts Alignment to a Reference (STAR), an
ultrafast universal RNA-seq aligner, uses sequential
maximum mappable seed search in uncompressed suffix
arrays followed by seed clustering and stitching procedure.
License: Artistic License,GPLv3
Website: http://www.perl.org/
STAR: ultrafast universal RNA-seq aligner. Dobin A, Davis
CA, Schlesinger F, Drenkow J, Zaleski C, Jha S, Batut P,
Chaisson M, Gingeras TR. Bioinformatics. 2013 Jan
1;29(1):15-21.
DOI: 10.1093/bioinformatics/bts635; PubMed: 23104886
License: GPLv3
Website: https://code.google.com/p/rna-star/
PRINSEQ
Bioinformatics tool to PRe-process and show INformation of
SEQuence data. The tool is written in Perl and can be
helpful if you want to filter, reformat, or trim your sequence
data. It also generates basic statistics for your sequences.
Quality control and preprocessing of metagenomic
datasets. Schmieder R, Edwards R. Bioinformatics. 2011
Mar 15;27(6):863-4.
DOI: 10.1093/bioinformatics/btr026; PubMed: 21278185
License: GPLv3
Website: http://prinseq.sourceforge.net/
R
Provides a wide variety of statistical and graphical
techniques, and is highly extensible.
TMAP
The torrent mapping alignment program is a fast and
accurate alignment software for short and long nucleotide
sequences produced by next-generation sequencing
technologies.
License: GPLv2
Website: https://github.com/iontorrent/TMAP
TopHat
Fast splice junction mapper for RNA-Seq reads.
R: A language and environment for statistical computing.
R Development Core Team (2008). R Foundation for
Statistical Computing, Vienna, Austria.
ISBN 3-900051-07-0.
License: GPLv2
Website: http://www.r-project.org/
TopHat: discovering splice junctions with RNA-Seq. Trapnell
C, Pachter L, Salzberg SL. Bioinformatics. 2009 May
1;25(9):1105-11.
DOI: 10.1093/bioinformatics/btp120; PubMed: 19289445
License: Artistic License
Website: http://tophat.cbcb.umd.edu/
Reaper
tRNAscan-SE
Program for demultiplexing, trimming and filtering short read
sequencing data. Includes Tally, a program for deduplicating
sequence fragments for both single and paired end input.
Search for tRNA genes in genomic sequence.
License: GPLv3
Website: http://www.ebi.ac.uk/~stijn/reaper/reaper.html
SAMtools
Various utilities for manipulating alignments in the SAM
format, including sorting, merging, indexing and generating
alignments in a per-position format.
The Sequence Alignment/Map format and SAMtools. Li H,
Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N,
Marth G, Abecasis G, Durbin R; 1000 Genome Project Data
Processing Subgroup. Bioinformatics. 2009 Aug 15;25(16):
2078-9.
DOI: 10.1093/bioinformatics/btp352; PubMed: 19505943
License: BSD,MIT
Website: http://samtools.sourceforge.net/
SRA Toolkit
tRNAscan-SE: a program for improved detection of transfer
RNA genes in genomic sequence. Lowe TM, Eddy SR.
Nucleic Acids Res. 1997 Mar 1;25(5):955-64.
DOI: 10.1093/nar/25.5.0955; PubMed: 9023104
License: GPLv2
Website: http://lowelab.ucsc.edu/tRNAscan-SE/
UCSC Genome Browser and Tool Set
An integrated tool set for visualizing, comparing, analyzing
and sharing both publicly available and user-generated
genomic data sets. The UCSC Genome Browser is a
mature web tool for rapid and reliable display of any
requested portion of the genome at any scale, together with
aligned annotation tracks.
The UCSC genome browser and associated tools. Kuhn
RM, Haussler D, Kent WJ. Brief Bioinform. 2012 Oct 31.
DOI: 10.1093/bib/bbs038; PubMed: 22908213
License: Authorized Use of Commercial License
Website: http://genome.ucsc.edu/
Supports conversion of SRA data to several popular
formats.
License: free
Website: http://www.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?
view=toolkit_doc
10
Complex NGS analytics made simple.
1670 S. Amphlett Blvd, Suite 214, San Mateo CA 94402 • 650-388-9277
www.maverixbio.com
© 2013 Maverix Biomics, Inc. All Rights Reserved.