Practical: Representing and manipulating alignments

R / Bioconductor for High-Throughput Sequence Analysis
Nicolas Delhomme1
21 October - 26 October, 2013
1
[email protected]
Contents
1 Day2 of the workshop
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . .
1.2 Main Bioconductor packages of interest for the day . . .
1.3 A word on High-throughput sequence analysis . . . . . .
1.4 A word on Integrated Development Environment (IDE)
1.5 Today’s schedule . . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
2
2
2
2
2
2
2 Prelude
2.1 Purpose . . . . . . . . . . . . . . . . . . . . .
2.2 Creating GAlignment objects from BAM files
2.3 Processing the files in parallel . . . . . . . . .
2.4 Processing the files one chunk at a time . . .
2.5 Pros and cons of the current solution . . . . .
2.5.1 Pros . . . . . . . . . . . . . . . . . . .
2.5.2 Cons . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
4
4
4
4
5
6
6
6
3 Sequences and Short Reads
3.1 Alignments and Bioconductor packages . . . . . . .
3.1.1 The pasilla data set . . . . . . . . . . . . . .
3.1.2 Alignments and the ShortRead package . . .
3.1.3 Alignments and the Rsamtools package . . .
3.1.4 Alignments and other Bioconductor packages
3.1.5 Resources . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
7
7
7
8
9
13
17
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
4 Interlude
18
5 Estimating Expression over Genes and Exons
5.1 Counting reads over known genes and exons . .
5.1.1 The alignments . . . . . . . . . . . . . .
5.2 Discovering novel transcribed regions . . . . . .
5.3 Using easyRNASeq . . . . . . . . . . . . . . . .
5.4 Where to from here . . . . . . . . . . . . . . . .
1
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
20
20
20
23
27
29
Chapter 1
Day2 of the workshop
1.1
Introduction
This portion of the workshop introduces use of R [15] and Bioconductor [5] for analysis of high-throughput
sequence (HTS) data; specifically the manipulation of HTS reads alignment and how to estimate expression over exons, transcripts and genes using these. The workshop is structured as a series of short remarks
followed by group exercises. The exercises explore the diversity of tasks for which R / Bioconductor are
appropriate, but are far from comprehensive.
The goals of that workshop part are to: (1) develop familiarity with R / Bioconductor packages for
high-throughput analysis; (2) specifically for those necessary for manipulating HTS reads alignment files
and for devising expression over genic features; and (3) provide inspiration and a framework for further
independent exploration.
1.2
Main Bioconductor packages of interest for the day
Bioconductor is a collection of R packages for the analysis and comprehension of high-throughput genomic data. Among these, we will focus on three of them principally: ShortRead, Rsamtools and
GenomicRanges.
1.3
A word on High-throughput sequence analysis
Recent technological developments introduce high-throughput sequencing approaches. A variety of experimental protocols and analysis workflows address gene expression, regulation, and encoding of genetic
variants. Experimental protocols produce a large number (tens of millions per sample) of short (e.g. ,
35-250, single or paired-end) nucleotide sequences. These are aligned to a reference or other genome.
Analysis workflows use the alignments to infer levels of gene expression (RNA-seq), binding of regulatory
elements to genomic locations (ChIP-seq), or prevalence of structural variants (e.g. , SNPs, short indels,
large-scale genomic rearrangements). Sample sizes range from minimal replication (e.g,. 2 samples per
treatment group) to thousands of individuals.
1.4
A word on Integrated Development Environment (IDE)
There are numerous tools to support developing programs and softwares in R. For this course, we have
selected one of them: the RStudio environment, which provides a feature-full, user-friendly, cross-platform
environment for working with R.
1.5
Today’s schedule
2
Table 1.1: EMBO2013 AHTSD workshop day2 Schedule
Time
09:00
09:45
10:30
10:45
12:30
13:30
14:30
15:30
15:45
16:30
17:30
18:30
Description
Lecture: Representing and manipulating alignments
Practical: Representing and manipulating alignments
Coffee break
Practical c’ed: Representing and manipulating alignments
Lunch
Lecture: Estimating expression over genes and exons
Practical: Estimating expression over genes and exons
Coffee break
Lecture: Working without a ”reference” genome
Practical: Discovering novel transcribed regions
Question and Answer session - preferably at the Red Lion
Dinner
3
Chapter 2
Prelude
2.1
Purpose
Before getting familiar with the Bioconductor packages functionalities that were presented in the lecture,
we will first sublimate the knowledge you’ve gathered so far into adressing the computationaal challenges
faced when using HTS data: i.e. resources and time consumption.
In the lecture, the readGAlignmentsFromBam function from the Rsamtools package was introduced and
used to extract a GAlignment object. However, most of the times, an experiment will NOT consist of a
single sample (of only 2.5M reads!) and an obvious way to speed up the process is to parallelize. In the
following three sections, we will see how to perform this before ultimately discussing the pros and cons
of the implemented method.
2.2
Creating GAlignment objects from BAM files
Exercise 1
First of all, locate the BAM files and implement a function to read them sequentially. Have a look at
the lapply function man page for doing so.
Solution:
> library(Rsamtools)
> bamfiles <- dir(system.file("bigdata","bam",package="EMBO2013Day2"),
+
pattern="*.bam$",full.names=TRUE)
> gAlns <- lapply(bamfiles,readGAlignmentsFromBam)
Nothing complicated so far - or if, raise your voice. We proceed both files sequentially and get a list
of GAlignments objects stored in the gAlns object. Apart from the coding enhancement - with one line,
we can process all our samples - there is no other gains.
2.3
Processing the files in parallel
Modern laptop CPUs possess several cores that can perform tasks independently, commonly 2 to 4.
Computational servers usually have many CPUs (commonly 8) each having several cores. An obvious
enhancement to our previous solution is to take advantage of this CPU architecture and to process our
sample in parallel.
Exercise 2
Have a look at the parallel package and in particular at the mclapply function to re-implement the
previous function in a parallel manner.
Solution:
4
> library(parallel)
> gAlns <- mclapply(bamfiles,readGAlignmentsFromBam)
Exercise 3
Could you figure out how many cores were used in parallel when running the previous line? Can you
explain why that was so?
Solution: It is NOT because there were 2 files to proceed. The mclapply has a number of default
parameters - see ?mclapply for details - including the mc.cores one that defaults to 2. If you want to
proceed more samples in parallel, set that parameter value accordingly.
This new implementation has the obvious advantage to be X times faster (with X being the number
of CPU used, or almost so as parallelization comes with a slight processing cost), but it put a different
strain on the system. As several files are being processed in parallel, the memory requirement also
increase by a factor X (assuming files of almost equivalent size are to be processed). This might be fine
on a computational server but given the constant increase in sequencing reads being produced per run,
this will eventually be challenged.
Exercise 4
Can you think of the way this memory issue could be adressed? i.e. what could we modify in the way
we read/process the file to limit the memory required at a given moment?
Solution: No, buying more memory is usually not an option. And anyway, at the moment, the increase
rate of reads sequenced per run is faster than the memory doubling time. So, let us just move to the
next section to have a go at adressing the issue.
2.4
Processing the files one chunk at a time
To limit the memory required at any moment, one approach would be to proceed the file not as a whole,
but chunk-wise. As we can assume that reads are stored independently in BAM files (or almost so, think
of how Paired-End data is stored!), we simply can decide to parse, e.g. 1, 000, 000 reads at a time. This
will of course require to have a new way to represent a BAM file in R, i.e. not just as a character string
as we had it until now in our bamfiles object.
Exercise 5
The Rsamtools package again comes in handy. Lookup the ?BamFile package and try to scheme how we
could take advantage of the BamFile or BamFileList classes for our purpose.
Solution: The yieldSize parameter of either class looks like exactly what we want. Let us recode our
bamfiles character object into a BamFileList.
> bamFileList <- BamFileList(bamfiles,yieldSize=10^6)
Now that we have the BAM files described in a way that we can process them chunk-wise, let us do
so. The paradigm is as follow:
> open(bamFile)
> while(length(chunk <- readGAlignmentsFromBam(bamFile))){
+
message(length(chunk))
+ }
> close(bamFile)
5
Exercise 6
In the paradigm above, we process one BAM file chunk wise and report the sizes of the chunks. i.e.
these would be 1M reads - in our case - apart for the last one, which would be smaller or equal to 1M
(it is unlikely that a sequencing file contains an exact multiple of our chink size).
Now, try to implement the above paradigm in the function we implemented previously - see solution 2.3 page 4 - so as to process both our BAM files in parallel chunk-wise.
Solution:
> gAlns <- mclapply(bamFileList,function(bamFile){
+
open(bamFile)
+
gAln <- GAlignments()
+
while(length(chunk <- readGAlignmentsFromBam(bamFile))){
+
gAln <- c(gAln,chunk)
+
}
+
close(bamFile)
+
return(gAln)
+ })
2.5
Pros and cons of the current solution
Exercise 7
Before reading my comments below, take the time to jot down what you think are the advantages and
drawbacks of the method implemented above. My own comments below are certainly not extensive and
I would be curious to hear yours that are not matched with mine.
Solution:
2.5.1
Pros
a. We have written a streamlined piece of code, using up to date functionalities from other packages.
Hence, it is both easily maintanable and updatable.
b. With regards to time consumption, we have reduced it by a factor 2 and that can be reduced
further by using computer with more CPUs or a compute farm even - obviously if we have more
than 2 samples to process.
c. We have implemented the processing of the BAM files by chunk
2.5.2
Cons
a. There’s only one big cons really: we have NOT addressed the memory requirement issue satisfyingly.
We do proceed the BAM files by chunks, but then we simply aggregate these chunks without further
processing, so we eventually end up using the same amount of memory. This is the best we can
do so far given the introduced Bioconductor functionalities, so let us move to the next step in the
pipeline that will help us resolve that - see Chapter 4 page 18 if you are impatient - but first we
should recap the usage of the Bioconductor packages for obtaining and manipulating sequencing
read information in R, which is next chapter’s topic.
6
Chapter 3
Sequences and Short Reads
Most down-stream analysis of short read sequences is based on reads aligned to reference genomes. There
are many aligners available, including BWA [13, 12], Bowtie2 [9], GSNAP[21], STAR[4],etc. ; merits of
these are discussed in the literature. There are also alignment algorithms implemented in Bioconductor
(e.g., matchPDict in the Biostrings package and the gmapR, Rbowtie, Rsubread packages); matchPDict is
particularly useful for flexible alignment of moderately sized subsets of data.
3.1
Alignments and Bioconductor packages
The following sections introduce core tools for working with high-throughput sequence data; key packages
for representing reads and alignments are summarized in Table 3.1.
Moreover,Martin introduced yesterday resources for annotating sequences, that will come handy in
the next two chapters of this tutorial (Chapter 4, page 18 and Chapter 5, page 20)
Exercise 8
Read the man page of the GAlignments and GAlignmentPairs classes and pay attention to the very
important comments on multi-reads and paired-end processing.
Solution: Really just ?GAlignments. However, KEEP these details in mind as they essential and likely
source of erroneous conclusion. Remember the example of this morning lecture about RNA editing.
3.1.1
The pasilla data set
As a running example, we use the pasilla data set, derived from [2]. The authors investigate conservation
of RNA regulation between D. melanogaster and mammals. Part of their study used RNAi and RNAseq to identify exons regulated by Pasilla (ps), the D. melanogaster ortholog of mammalian NOVA1 and
Table 3.1: Selected Bioconductor packages for extracting and manipulating sequence reads alignments.
Package
ShortRead
GenomicRanges
Rsamtools
rtracklayer
Description
In addition to the functionalities described yesterday to manipulate raw
read files, e.g. the ShortReadQ class and functions for manipulating fastq
files; this package offers the possibility to load numerous HTS formats
classes. These are mostly sequencer manufacturer specific e.g. sff for
454 or pre-BAM aligner proprietary formats, e.g. MAQ or bowtie. These
functionalities rely heavily on Biostrings and somewhat on Rsamtools.
GAlignments and GAlignmentPairs store single- and paired-end aligned
reads.
Provides access to BAM alignment and other large sequence-related files.
Input and output of bed, wig and similar files
7
NOVA2. Briefly, their experiment compared gene expression as measured by RNAseq in S2-DRSC cells
cultured with, or without, a 444bp dsRNA fragment corresponding to the ps mRNA sequence. Their
assessment investigated differential exon use, but our worked example will focus on gene-level differences.
In the following sections, we look at a subset of the ps data, corresponding to reads obtained from
lanes of their RNA-seq experiment, and aligned to a D. melanogaster reference genome. These are the
same reads that were used yesterday for the demonstration of the raw read based functionalities of the
ShortRead package. As a side note, reads were retrieved from GEO and the Short Read Archive (SRA),
and were aligned to the D. melanogaster reference genome dm3 as described in the pasilla experiment
data package.
3.1.2
Alignments and the ShortRead package
Yesterday, Martin introduced the ShortRead to manipulate raw reads and to perform Quality Assessment
(QA) on raw data files e.g. fastq formatted files. These are not the only functionalities from the
ShortRead package, which offers as well the possibility to read in alignments files in many different
formats.
Exercise 9
Two files of the pasilla dataset have been aligned using bowtie [9], locate them in the bigdata folder of
the EMBO2013Day2 package.
Solution:
> bwtFiles <- dir(path=system.file("bigdata","bowtie",package="EMBO2013Day2"),
+
pattern="*.bwt$",full.names=TRUE)
As we will be accessing this bigdata folder frequently, we create a function called bigdata to do so
more conveniently.
> library(EMBO2013Day2)
> bigdata <- function()
+
system.file("bigdata",package="EMBO2013Day2")
Exercise 10
Have a pick at one of the file and try to decipher its format. Hint: it is a tab delimited format, so check
the read.delim function. As you may not want to read all the lines to get an idea, lookup an appropriate
argument for that.
Solution: You might want to check http://bowtie-bio.sourceforge.net/manual.shtml#default-bowtie-output
for checking whether your guesses were correct. Here is how to read 10 lines of the first file.
> read.delim(file=file.path(bigdata(),"bowtie","SRR074430.bwt"),
+
header=FALSE,nrows=10)
Exercise 11
now, as was presented in the lecture use the readAligned function to read in the bowtie alignment files.
Solution:
> alignedRead <- readAligned(dirPath=file.path(bigdata(),"bowtie"),
+
pattern="*.bwt$",type="Bowtie")
8
Exercise 12
What is peculiar about the returned object? Determine its class. Can you tell where the data from both
input files are?
Solution: We obtained a single object of the AlignedRead class. By looking at the documentation, i.e.
?readAligned in the Value section, we are told that all files are concatenated in a single object with NO
guarantee of order in which files are read. This is convenient when we want to merge several sequencing
runs of the same sample but we need to be cautions and process independent sample by individually
calling the readAligned function for every sample.
Exercise 13
Finally, taking another look at the lecture, select only the reads that align to chromosome 2L. Hint, use
the appropriate SRFilter filter.
Solution:
> alignedRead2L <- readAligned(dirPath=file.path(bigdata(),"bowtie"),
+
pattern="*.bwt$",type="Bowtie",
+
filter=chromosomeFilter("2L"))
This concludes the overview of the ShortRead package. As the BAM format has become a de-facto
standard, it is more unlikely that you end up using that package to process reads in R over the Rsamtools
package that you will be using next.
3.1.3
Alignments and the Rsamtools package
Alignment formats Most main-stream aligners produce output in SAM (text-based) or BAM format.
A SAM file is a text file, with one line per aligned read, and fields separated by tabs. Here is an example
of a single SAM line, split into fields.
> fl <- system.file("extdata", "ex1.sam", package="Rsamtools")
> strsplit(readLines(fl, 1), "\t")[[1]]
[1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
[17]
"B7_591:4:96:693:509"
"73"
"seq1"
"1"
"99"
"36M"
"*"
"0"
"0"
"CACTAGTGGCTCATTGTAAATGTGTGGTTTAACTCG"
"<<<<<<<<<<<<<<<;<<<<<<<<<5<<<<<;:<;7"
"MF:i:18"
"Aq:i:73"
"NM:i:0"
"UQ:i:0"
"H0:i:1"
"H1:i:0"
Fields in a SAM file are summarized in Table 3.2. We recognize from the FASTQ file introduced
yesterday, the identifier string, read sequences and qualities. The alignment is to a chromosome ‘seq1’
starting at position 1. The strand of alignment is encoded in the ‘flag’ field. The alignment record also
includes a measure of mapping quality, and a CIGAR string describing the nature of the alignment. In
this case, the CIGAR is 36M, indicating that the alignment consisted of 36 Matches or mismatches, with
no indels or gaps; indels are represented by I and D; gaps (e.g., from alignments spanning introns) by N.
9
Table 3.2: Fields in a SAM record. From http://samtools.sourceforge.net/samtools.shtml
Field
1
2
3
4
5
6
7
8
9
10
11
12+
Name
QNAME
FLAG
RNAME
POS
MAPQ
CIGAR
MRNM
MPOS
ISIZE
SEQ
QUAL
OPT
Value
Query (read) NAME
Bitwise FLAG, e.g., strand of alignment
Reference sequence NAME
1-based leftmost POSition of sequence
MAPping Quality (Phred-scaled)
Extended CIGAR string
Mate Reference sequence NaMe
1-based Mate POSition
Inferred insert SIZE
Query SEQuence on the reference strand
Query QUALity
OPTional fields, format TAG:VTYPE:VALUE
Note that mismatches are not represented in the CIGAR string but might be detailed in the additional
attributes; this depends on the aligner used.
BAM files encode the same information as SAM files, but in a format that is more efficiently parsed
by software; BAM files are the primary way in which aligned reads are imported in to R.
Aligned reads in R As introduced - c.f. section 3.1 - there are three different packages to read
alignments in R:
• ShortRead
• GenomicRanges
• Rsamtools
The last two will be described in more details in the next paragraphs.
GenomicRanges The readGAlignments function from the GenomicRanges package reads essential
information from a BAM file into R. The result is an instance of the GAlignments class. The GAlignments
class has been designed to allow useful manipulation of many reads (e.g., 20 million) under moderate
memory requirements (e.g., 4 GB).
Note that the the readGAlignments function and the GAlignments class have replaced the readGappedAlignments function and the GappedAlignments class, respectively, from the previous releases of Bioconductor;
this as of version 2.13 released beginning of October this year.
Exercise 14
Use the readGAlignments to read in the ”ex1.bam” that can be found in the ”extdata” folder of the
Rsamtools package.
Solution:
> alnFile <- system.file("extdata", "ex1.bam", package="Rsamtools")
> aln <- readGAlignments(alnFile)
> head(aln, 3)
The readGAlignments function takes an additional argument, param, allowing the user to specify regions
of the BAM file (e.g., known gene coordinates) from which to extract alignments.
A GAlignments instance is like a data frame, but with accessors as suggested by the column names.
It is easy to query, e.g., the distribution of reads aligning to each strand, the width of reads, or the cigar
strings
10
Exercise 15
Summarize the strand, width and CIGAR information from that file.
Solution:
> table(strand(aln))
+
1647 1624
*
0
> table(width(aln))
30
2
31
21
32
1
33
8
34
35
37 2804
36
285
38
1
40
112
> head(sort(table(cigar(aln)), decreasing=TRUE))
35M
2804
36M
283
40M
112
34M
37
33M 14M4I17M
6
4
Rsamtools The Rsamtools readGAlignmentsFromBam function - introduced earlier, see Chapter 2
page 4 - as the GenomicRanges readGAlignments function only parse some of the fields of a BAM file,
and that may not be appropriate for all uses. In these cases the scanBam function in Rsamtools provides
greater flexibility. The idea is to view BAM files as a kind of data base. Particular regions of interest
can be selected, and the information in the selection restricted to particular fields. These operations are
determined by the values of a ScanBamParam object, passed as the named param argument to scanBam.
Exercise 16
Consult the help page for ScanBamParam, and construct an object that restricts the information returned
by a scanBam query to the aligned read DNA sequence. Your solution will use the what parameter to the
ScanBamParam function.
Use the ScanBamParam object to query a BAM file, and calculate the GC content of all aligned reads.
Summarize the GC content as a histogram (Figure 3.1).
Solution:
>
>
>
>
param <- ScanBamParam(what="seq")
seqs <- scanBam(bamfiles[[1]], param=param)
readGC <- gcFunction(seqs[[1]][["seq"]])
hist(readGC)
Advanced Rsamtools usage
The Rsamtools package has more advanced functionalities:
1. function to count, index, filter, sort BAM files
2. function to access the header only
3. the possibility to access SAM attributes (tags)
4. manipulate the CIGAR string
5. create BAM libraries to represent a study set (BamViews)
6. . . .
Exercise 17
Find out the function that permit to scan the BAM header and retrieve the header of the first BAM file
in the bigdata() bam subfolder. What information does it contain?
11
Figure 3.1: GC content in aligned reads
Solution: It contains the reference sequence length and names as well as the name, version and command
line of the tool used for performing the alignments.
> scanBamHeader(bamfiles[1])
Exercise 18
The SAM/BAM format contains a tag: “NH” that defines the total number of valid alignments reported
for a read. How can you extract that information from the same first bam file and plot it as an histogram?
Solution:
> param <- ScanBamParam(tag="NH")
> nhs <- scanBam(bamfiles[[1]], param=param)[[1]]$tag$NH
So it seems a majority of our reads have multiple alignments! Some processing might be required to
deal with these; e.g. if reads were aligned to the transcriptome there exist tools that can deconvoluate
the transcript specific expression, for example MMSEQ [20], BitSeq [6], that last one existing as an R
package too: BitSeq. Otherwise if reads were aligned to the genome, one should consider filtering these
multiple alignments to avoid introducing artifactual noise in the subsequent analyses.
Exercise 19
The CIGAR string contains interesting information, in particular, whether or not a given match contain
indels. Using the first bam file, can you get a matrix of all seven CIGAR operations? And find out the
intron size distribution?
Solution:
> param <- ScanBamParam(what="cigar")
> cigars <- scanBam(bamfiles[[1]], param=param)[[1]]$cigar
12
>
>
>
>
>
cigar.matrix <- cigarOpTable(cigars)
intron.size <- cigar.matrix[,"N"]
intron.size[intron.size>0]
plot(density(intron.size[intron.size>0]))
histogram(log10(intron.size[intron.size>0]),xlab="intron size (log10 bp)")
Exercise 20
Look up the documentation for the BamViews and using the leeBamViews package, create a BamViews
instance. Afterwards, use some of the accessors of that object to retrieve e.g. the file paths or the sample
names
Solution:
>
>
>
>
>
>
>
+
>
>
library(leeBamViews)
bpaths = dir(system.file("bam", package="leeBamViews"), full=TRUE, patt="bam$")
gt<-do.call(rbind,strsplit(basename(bpaths),"_"))[,1]
geno<-substr(gt,1,nchar(gt)-1)
lane<-substr(gt,nchar(gt),nchar(gt))
pd = DataFrame(geno=geno, lane=lane, row.names=paste(geno,lane,sep="."))
bs1 = BamViews(bamPaths=bpaths, bamSamples=pd,
bamExperiment=list(annotation="org.Sc.sgd.db"))
bamPaths(bs1)
bamSamples(bs1)
Exercise 21
Finally, extract the coverage for the locus 861250:863000 on chromosome “Scchr13” for every sample in
the bs1 object
Solution:
> sel <- GRanges(seqnames = "Scchr13", IRanges(start = 861250, end = 863000),strand="+")
> covex = RleList(lapply(bamPaths(bs1), function(x) coverage(readGAlignments(x))[[1]]))
This offer an interesting way to process multiple sample at the same time when you’re interested in
a particular locus.
3.1.4
Alignments and other Bioconductor packages
In the following, an excerpt of additional functionalities offered by Bioconductor packages is presented.
It is far from being a complete overview, and as such only aims at giving a feel for what’s out there.
Retrieving data using SRAdb Most journals require the raw data to be deposited in a public
repository, such as GEO, SRA or ENA. The SRAdb package offers the possibility to list the content of
these archives, and to retrieve raw (fastq or sra) files.
Exercise 22
Using the pasilla package, retrieve the submission accession of that dataset (check out that package
vignette)
Solution:
> vignette(package="pasilla")
> vignette("create_objects")
> geo.acc <- "GEO: GSE18508"
13
Now that as we only have the GEO ID, we need to convert it to an SRA ID. You can either use the
GEO, SRA or ENA website for this or if you are slightly familiar with SQL, just use the SRAdb package.
Exercise 23
Look into the SRAdb package vignette to figure out how to do this.
Solution: Accessing the vignette and reading it tells us
> library(SRAdb)
> vignette("SRAdb")
a. we need to download the SRAdb sqlfile
b. we need to create a connection to the locally downloaded database
c. we need to query that database with our submission alias: “GEO: GSE18508” to retrieve the SRA
submission accession.
The first step requires the download of a 280+Mb compressed large file, so to avoid the downloading
time, connect to the file on the shared folder
> sqlfile <- "replace-with-your-path-to-the-SRAmetadb.sqlite-file"
> sra_con <- dbConnect(SQLite(),sqlfile)
> sra.acc <- dbGetQuery(sra_con,paste("select submission_accession ",
+
"from submission ",
+
'where submission_alias = "',
+
geo.acc,';"',sep=""))
To download the file, the command to use is getSRAdbFile
The retrieved sra.acc is: “SRA010243”.
Now that we have that accession, the vignette tells us how to get every experiment, sample, run, . . .
associated with this submission accession.
Exercise 24
There are at least two possibilities to do so, one using an SQL query and the other one using a function
of the packages. What would be that function?
Solution: For those that like SQL:
> run.acc <- dbGetQuery(sra_con,paste("select run_accession ",
+
"from run ",
+
'where submission_accession = "',
+
sra.acc,'";',sep=""))$run_accession
For those that like functions:
> sraConvert(sra.acc,sra_con=sra_con)
> run.acc <- sraConvert(sra.acc,"run",sra_con=sra_con)$run
Exercise 25
Now that we’ve got the list of runs, it would be interesting to get more information about the corresponding fastq file.
Solution:
> info <- getFASTQinfo(run.acc,srcType="ftp")
And the final step would be to download the fastq file(s) of interest.
14
Exercise 26
Retrieve the shortest fastq file from that particular submission.
Solution:
> getSRAfile(in_acc=info[which.min(info[,"run.read.count"]),"run"],
+
sra_con, destDir = getwd(),
+
fileType = 'fastq', srcType = 'ftp' )
Well, that’s almost it. As we are tidy people, we clean after ourselves.
> dbDisconnect( sra_con )
Demultiplexing using easyRNASeq Note: This section does not apply to all datasets but only to
multiplexed ones. Since the data we loaded so far into R was not multiplexed we will use a different
dataset here.
Nowadays, NGS machines produces so many reads (e.g. 40M for Illumina GAIIx, 100M for ABI
SOLiD4 and 160M for an Illumina HiSeq), that the coverage obtained per lane for the transcriptome of
organisms with small genomes, is very high. Sometimes it’s more valuable to sequence more samples with
lower coverage than sequencing only one to very high coverage, so techniques have been optimised for sequencing several samples in a single lane using 4-6bp barcodes to uniquely identify the sample within the
library[10]. This is called multiplexing and one can on average sequence 12 yeast samples at 30X coverage
in a single lane of an Illumina GenomeAnalyzer GAIIx (100bp read, single end). This approach is very
advantageous for researchers, especially in term of costs, but it adds an additional layer of pre-processing
that is not as trivial as one would think. Extracting the barcodes would be fairly straightforward, but
for the average 0.1-1 percent sequencing error rate that introduces a lot of multiplicity in the actual
barcodes present in the samples. A proper design of the barcodes, maximising the Hamming distance [8]
is an essential step for proper de-multiplexing.
The data we loaded into R in the previous section was not mutiplexed, so we now load a different FASTQ file where the 4 different samples sequenced were identified by the barcodes ”ATGGCT”,
”TTGCGA”, ”ACACTG” and ”ACTAGC”.
>
>
>
>
>
>
>
>
>
>
>
>
reads <- readFastq(file.path(bigdata(),"multiplex","multiplex.fq.gz"))
# filter out reads with more than 2 Ns
filter <- nFilter(threshold=2)
reads <- reads[filter(reads)]
# access the read sequences
seqs <- sread(reads)
# this is the length of your adapters
barcodeLength <- 6
# get the first 6 bases of each read
seqs <- narrow(seqs, end=barcodeLength)
seqs
length(table(as.character(seqs)))
So it seems we have 1953 barcodes instead of 6 . . .
Exercise 27
Which barcode is most represented in this library? Plot the relative frequency of the top 20 barcodes.
Try:
• using the function table to count how many times each barcode occurs in the library, you can’t apply
this function to seqs directly you must convert it first to a character vector with the as.character
function
• sort the counts object you just created with the function sort, use decreasing=TRUE as an argument
for sort so that the elements are sorted from high to low (use sort( ..., decreasing=TRUE ))
15
• look at the first element of your sorted counts object to find out with barcode is most represented
• find out what the relative frequency of each barcode is by dividing your counts object by the total
number of reads (the function sum might be useful)
• plot the relative frequency of the top 20 barcodes by adapting these function calls:
> # set up larger margins for the plot so we can read the barcode names
> par(mar=c(5, 5, 4, 2))
> barplot(..., horiz=T, las=1, col="orange" )
Solution:
>
>
>
>
>
barcount = sort(table(as.character(seqs)), decreasing=TRUE)
barcount[1:10]
# TTGCGA
barcount = barcount/sum(barcount)
par( mar=c(5, 5, 4, 2))
barplot(barcount[1:20], horiz=TRUE, las=1, col="orange" )
Exercise 28
The designed barcodes (”ATGGCT”, ”TTGCGA”, ”ACACTG” and ”ACTAGC”) seem to be equally distributed, what is the percentage of reads that cannot be assigned to a barcode?
Solution:
> signif((1-sum(barcount[1:4]))*100,digits=2)
# ~6.4%
We will now iterate over the 4 barcodes, split the reads between them and save a new fastq file for
each:
>
>
>
>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
barcodes = c("ATGGCT", "TTGCGA", "ACACTG", "ACTAGC")
# iterate through each of these top 10 adapters and write
# output to fastq files
for(barcode in barcodes) {
seqs <- sread(reads) # get sequence list
qual <- quality(reads) # get quality score list
qual <- quality(qual) # strip quality score type
mismatchVector <- 0 # allow no mismatches
# trim sequences looking for a 5' pattern
# gets IRanges object with trimmed coordinates
trimCoords <- trimLRPatterns(Lpattern=barcode,
subject=seqs, max.Lmismatch=mismatchVector, ranges=T)
# generate trimmed ShortReadQ object
seqs <- DNAStringSet(seqs, start=start(trimCoords),
end=end(trimCoords))
qual <- BStringSet(qual, start=start(trimCoords),
end=end(trimCoords))
# use IRanges coordinates to trim sequences and quality scores
qual <- SFastqQuality(qual) # reapply quality score type
trimmed <- ShortReadQ(sread=seqs, quality=qual, id=id(reads))
# rebuild reads object with trimmed sequences and quality scores
16
+
+
+
+
+
+
+
+
+
+
+ }
# keep only reads which trimmed the full barcode
trimmed <- trimmed[start(trimCoords) == barcodeLength + 1]
# write reads to Fastq file
outputFileName <- paste(barcode, ".fq", sep="")
writeFastq(trimmed, outputFileName)
# export filtered and trimmed reads to fastq file
print(paste("wrote", length(trimmed),
"reads to file", outputFileName))
You should have four new FASTQ files: ACACTG.fq, ACTAGC.fq ATGGCT.fq and TTGCGA.fq with
the reads (the barcodes have been trimmed) corresponding to each mutiplexed sampled. The next step
would be to align these reads against your reference genome.
Aligning reads using Rsubread Note that since last week and the latest release of Bioconductor
i.e. version 2.13, I have encountered weird errors using Rsubread on Mac OSX 10.6.8. If that occurs on
the course machines too, read the code and feel free to ask me any question.
>
>
>
>
>
>
>
>
+
>
>
+
+
+
+
+
+
+
+
+
+
+
+
library(Rsubread)
library(BSgenome.Dmelanogaster.UCSC.dm3)
chr4 <- DNAStringSet(unmasked(Dmelanogaster[["chr4"]]))
names(chr4) <- "chr4"
writeXStringSet(chr4,file="dm3-chr4.fa")
## create the indexes
dir.create("indexes")
buildindex(basename=file.path("indexes","dm3-chr4"),
reference="dm3-chr4.fa",memory=1000)
## align the reads
sapply(dir(pattern="*\\.fq$"),function(fil){
## align
align(index=file.path("indexes","dm3-chr4"),
readfile1=sub("\\.fq$","",fil),
nsubreads=2,TH1=1,
output_file=sub("\\.fq$","\\.sam",fil)
)
## create bam files
asBam(file=sub("\\.fq$","\\.sam",fil),
destination=sub("\\.fq$","",fil),
indexDestination=TRUE)
})
And that’s it you have filtered, demultiplexed and aligned your reads!
3.1.5
Resources
There are extensive vignettes for Biostrings and GenomicRanges packages. A useful on-line resource is
from Thomas Girke’s group.
17
Chapter 4
Interlude
Now that we have seen the GenomicRanges functionalities to find count or summarize overlaps between
reads and annotations, we can refine our prefered function. We had left it as:
> gAlns <- mclapply(bamFileList,function(bamFile){
+
open(bamFile)
+
gAln <- GAlignments()
+
while(length(chunk <- readGAlignmentsFromBam(bamFile))){
+
gAln <- c(gAln,chunk)
+
}
+
close(bamFile)
+
return(gAln)
+ })
Exercise 29
Using the synthetic transcript annotation prepared during the lecture: dmel_synthetic_transcript_r552.rda, implement the count by chunks.
Solution:
> load("~/Day2/dmel_synthetic_transcript_r5-52.rda")
> count.list <- mclapply(bamFileList,function(bamFile){
+
open(bamFile)
+
counts <- vector(mode="integer",length=length(annot))
+
while(length(chunk <- readGAlignmentsFromBam(bamFile))){
+
counts <- counts + assays(summarizeOverlaps(annot,chunk,mode="Union"))$counts
+
}
+
close(bamFile)
+
return(counts)
+ })
This gives us a list of counts per sample, to get a count matrix do:
> count.table <- do.call("cbind",count.list)
> head(count.table)
FBgn0000008.0
FBgn0000014.0
FBgn0000015.0
FBgn0000017.0
FBgn0000018.0
FBgn0000022.0
reads reads
1
3
0
0
0
0
107
110
5
11
0
0
18
Such a count.table object is the minimal input that downstream analysis softwares - e.g. DESeq2,
edgeR, etc. uses.
A similar function to this is probably all you’ll need to process your read and get a count table
from a standard Illumina based RNA-Seq experiment. However, you might want more flexibility for you
projects and certainly Bioconductor offer the possibility to do that; examples of which are given in the
next chapter.
19
Chapter 5
Estimating Expression over Genes
and Exons
This chapter1 describes part of an RNA-Seq analysis use-case. RNA-Seq [14] was introduced as a new
method to perform Gene Expression Analysis, using the advantages of the high throughput of NextGeneration Sequencing (NGS) machines.
5.1
Counting reads over known genes and exons
The goal of this use-case is to generate a count table for the selected genic features of interest, i.e. exons,
transcripts, gene models, etc.
To achieve this, we need to take advantage of all the steps performed previously in the workshop
Day1 and Day2.
1. the alignments information has to be retrieved
2. the corresponding annotation need to be fetched and possibly adapted e.g. as was done in the
preceeding lecture.
3. the read coverage per genic feature of interest determined
Exercise 30
Can you associate at least a Bioconductor package to every of these tasks?
Solution: There are numerous choices, as an example in the following we will go for the following set
of packages:
a. Rsamtools
b. genomeIntervals - this was already done during the lecture
c. GenomicRanges
5.1.1
The alignments
This was introduced in section 3.1.3, page 9. In this section we will import the data using the Rsamtoosl
readGAlignmentsFromBam. This will create a GAlignments object that contains only the reads that aligned
to the genome.
Exercise 31
Using what was introduced in section 3.1.3, read in the first bam file from the bigdata() bam folder.
Remember that the protocol used was not strand-specific.
1 The
author want to thank Ângela Gonçalves for parts of the present chapter
20
Figure 5.1: Overlap modes; Image from the HTSeq package developed by Simon Anders.
Solution: First we scan the bam directory:
>
>
bamfiles <- dir(file.path(bigdata(), "bam"), ".bam$", full=TRUE)
names(bamfiles) <- sub("_.*", "", basename(bamfiles))
Then we read the first file:
> aln <- readGAlignments(bamfiles[1])
> strand(aln) <- "*"
As we have seen, many of these reads actually align to multiple locations. In a first basic analysis i.e. to get a feel for the data - such reads could be ignored.
Exercise 32
Filter the multiple alignment reads. Think of the “NH” tag.
Solution:
> param <- ScanBamParam(tag="NH")
> nhs <- scanBam(bamfiles[[1]], param=param)[[1]]$tag$NH
> aln <- aln[nhs==1,]
Now that we have the alignments (aln object) and the synthetic transcript annotation (annot object)
- the one from the lecture; the same used in the Interlude 4, page 18, we can quantify gene expression by
counting reads over all exons of a gene and summing them together. One thing to keep in mind is that
special care must be taken in dealing with reads that overlap more than one feature (e.g. overlapping
genes, isoforms), and thus might be counted several times in different features. To deal with this we can
use any of the approaches summarised in Figure 5.1:
The GenomicRanges summarizeOverlaps offer different possibilities to summarize reads per features:
21
>
>
>
>
load("~/Day2/dmel_synthetic_transcript_r5-52.rda")
counts1 <- summarizeOverlaps(annot, aln, mode="Union")
counts2 <- summarizeOverlaps(annot, aln, mode="IntersectionStrict")
counts3 <- summarizeOverlaps(annot, aln, mode="IntersectionNotEmpty")
Exercise 33
Create a data.frame or a matrix of the results above and figure out if any differences can be observed.
E.g check for difference in the row standard deviation (using the apply and sd functions).
Solution:
>
+
+
+
>
>
>
>
>
>
>
synthTrxCountsTable <- data.frame(
assays(counts1)$counts,
assays(counts2)$counts,
assays(counts3)$counts)
colnames(synthTrxCountsTable) <- c("union","intStrict","intNotEmpty")
rownames(synthTrxCountsTable) <- rownames(counts1)
sds <- apply(synthTrxCountsTable,1,sd)
sum(sds!=0)
sum(sds!=0)/length(sds)
synthTrxCountsTable[which.max(sds),]
annot[which.max(sds),]
So it appears that we have 3, 872 cases where these counting generate different results (28% of the total!!),
and that the synthetic transcript “FBgn0003942.0” shows the largest difference.
For a detailled analysis, it would be important to adequatly choose one of the intersection modes
above, however for the remainder of this section, we will use the “union” set. As before for reads aligning
to multiple places in the genome, choosing to take the union when reads overlap several features is a
simplification we may not want to do. There are several methods that probabilistically estimate the
expression of overlapping features [11, 19, 20].
This concludes that section on counting reads per known features. In the next section, we will look
at how novel transcribed regions could be identified.
22
5.2
Discovering novel transcribed regions
One main advantage of RNA-seq experiments over microarrays is that they can be used to identify any
transcribed molecule, including unknown transcripts and isoforms, as well as other regulatory transcribed
elements. To identify such new elements, several methods are available to recreate and annotate transcripts, e.g. Cufflinks[19], Oases[17], Trinity[7], to mention some of them. We can use Bioconductor tools
as well, to identify loci and quantify counts without prior annotation knowledge. The example here is
very crude and is really just a proof of concept of what one could do in a few commands i.e. R rules.
Nonetheless to make the results more precise, the reads have been realigned using STAR [4], a
very fast and accurate aligner that use the recent approach of Maximum Exact Matches (MEMs), see
https://code.google.com/p/rna-star/ for more details. This MEM approach allow STAR to identify
exon-exon junctions without prior knowledge e.g. no need for an annotation gff. To start, we re-read
one of the sample alignments using the Rsamtools readGAlignmentsFromBam function.
> aln <- readGAlignmentsFromBam(
+
BamFile(file.path(bigdata(),"STAR","SRR074431.bam")))
Defining transcribed regions The process begins with calculating the coverage, using the method
from the GenomicRanges package:
> cover <- coverage(aln)
>
>
>
>
>
cover
# this object is compressed to save space. It is an RLE (Running Length Encoding)
# we can look at a section of chromosome 4 say between bp 1 and 1000
# which gives us the number of read overlapping each of those bases
as.vector(cover[["3R"]])[1:1000]
The coverage shows us how many reads overlap every single base in the genome. It is actually split
per chromosomes.
The next step is to define, “islands” of expression. These can be created using the slice function.
The peak height for the islands can be determined with the viewMaxs function and the island widths can
be found using the width function:
> islands <- slice(cover, 1)
> islandPeakHeight <- viewMaxs(islands)
> islandWidth <- width(islands)
While some more sophisticated approaches can be used to find exons de novo, we can use a simple
approach whereby we select islands whose maximum peak height is 2 or more and whose width is 114 bp
(150% of the read size) or more to be candidate exons. The elementLengths function shows how many
of these candidate exons appear on each chromosome:
> candidateExons <- islands[islandPeakHeight >= 2L & islandWidth >=114L]
> candidateExons[["3R"]]
Remember that we used an aligner which is capable of mapping reads across splice junctions in the
genome.
> sum(cigarOpTable(cigar(aln))[,"N"] > 0)
[1] 99677
There are 99, 677 reads that span exon-exon junctions (EEJs).
Let’s look up such a potential EEJ:
> aln[cigarOpTable(cigar(aln))[,"N"] > 0 & seqnames(aln) == "3R",]
23
GAlignments with 22070 alignments and 0 metadata columns:
seqnames strand
cigar
qwidth
start
end
width
<Rle> <Rle> <character> <integer> <integer> <integer> <integer>
[1]
3R
58M68N18M
76
452
595
144
[2]
3R
+
40M75N36M
76
20556
20706
151
[3]
3R
+
39M75N37M
76
20557
20707
151
[4]
3R
+
38M75N38M
76
20558
20708
151
[5]
3R
69M174N7M
76
23216
23465
250
...
...
...
...
...
...
...
...
[22066]
3R
- 43M7183N33M
76 27884825 27892083
7259
[22067]
3R
- 29M7183N47M
76 27884839 27892097
7259
[22068]
3R
- 29M7183N47M
76 27884839 27892097
7259
[22069]
3R
+ 17M7183N59M
76 27884851 27892109
7259
[22070]
3R
+ 15M7183N61M
76 27884853 27892111
7259
ngap
<integer>
[1]
1
[2]
1
[3]
1
[4]
1
[5]
1
...
...
[22066]
1
[22067]
1
[22068]
1
[22069]
1
[22070]
1
--seqlengths:
YHet ...
XHet
347038 ...
204112
> aln[cigarOpTable(cigar(aln))[,"N"] > 0 & seqnames(aln) == "3R",][2:10,]
GAlignments with 9 alignments and 0 metadata columns:
seqnames strand
cigar
qwidth
start
end
width
<Rle> <Rle> <character> <integer> <integer> <integer> <integer>
[1]
3R
+
40M75N36M
76
20556
20706
151
[2]
3R
+
39M75N37M
76
20557
20707
151
[3]
3R
+
38M75N38M
76
20558
20708
151
[4]
3R
69M174N7M
76
23216
23465
250
[5]
3R
+ 63M174N13M
76
23222
23471
250
[6]
3R
+ 45M174N31M
76
23240
23489
250
[7]
3R
+ 41M174N35M
76
23244
23493
250
[8]
3R
+ 40M174N36M
76
23245
23494
250
[9]
3R
+ 32M174N44M
76
23253
23502
250
ngap
<integer>
[1]
1
[2]
1
[3]
1
[4]
1
[5]
1
[6]
1
[7]
1
[8]
1
[9]
1
24
--seqlengths:
YHet ...
347038 ...
XHet
204112
There are respectively 3 and 7 reads spanning what looks like introns of 75 and 174 bp respectively.
Note that the GenomicRanges Galignments package is aware of splicing junctions. Have a look at the
coverage for the first intron:
> cover[["3R"]][20556:20706]
integer-Rle of length 151 with 5 runs
Lengths: 1 1 38 75 36
Values : 1 2 3 0 3
Now, if we select a few of these EEJs, we can have a look if we can identify a specific motif.
>
>
>
+
>
+
+
+
+
+
splice.reads <- aln[cigarOpTable(cigar(aln))[,"N"] > 0 & seqnames(aln) == "3R",]
read.start <- start(splice.reads)[c(2,5,16,30,37)]
donor.pos <- read.start - 1 +
as.integer(sapply(strsplit(cigar(splice.reads)[c(2,5,16,30,37)],"M"),"[",1))
acceptor.pos <- read.start - 1 +
sapply(
lapply(
lapply(strsplit(cigar(splice.reads)[c(2,5,16,30,37)],"M|N"),"[",1:2),
as.integer),
sum)
Now read the chromosome 3R sequence (actually just a subset of the first 30, 000bp)
> chr3R <- readDNAStringSet(file.path(bigdata(),"..",
+
"fasta",
+
"dmel-chromosome-3R-1-30000bp.fasta"))
Now locate the acceptor and donor sites, but think of the strand! Let’s just look at the one on the
plus strand.
> sel <- as.logical(strand(splice.reads)[c(2,5,16,30,37)] == "+")
> plus.donor <- Views(subject=chr3R[[1]],start=donor.pos[sel]-8,
+
end=donor.pos[sel]+11)
> plus.acceptor <- Views(subject=chr3R[[1]],start=acceptor.pos[sel]-10,
+
end=acceptor.pos[sel]+9)
Let’s see if there’s a consensus in the sequences of 20bp centered around the potential acceptor and
donor sites. Note that you might have to install the seqLogo package
>
>
+
+
+
>
library(seqLogo)
pwm <- makePWM(cbind(
alphabetByCycle(DNAStringSet(plus.donor))[c("A","C","G","T"),]/3,
alphabetByCycle(DNAStringSet(plus.acceptor))[c("A","C","G","T"),]/3)
)
seqLogo(pwm)
Clearly the logo - Figure 5.2 - is not exceptional, but from only 3 EEJs, we can already see that the
donor site at position 10 − 11 is GT and the acceptor site at position 30 − 31 is AG, i.e. the canonical
sites. Moreover, we can see a relative - or at least I want to see it because I know it must be there
- enrichment for Ts in the intron sequence, a known phenomenon. Hence, using a de-novo approach
complemented by additional criteria can prove very efficient.
25
Figure 5.2: GC content in aligned reads
This concludes the section on summarizing counts. As you could realize, juggling with the different
package for manipulating the alignment and annotation requires some coding. To facilitate this a number
of “workflow” package are available at Bioconductor. The next section gives a brief introduction of
easyRNASeq (obviously, a biased selection . . . )
26
5.3
Using easyRNASeq
Let us redo what was done in the previous section. Note that most of the RNAseq object slots are optional.
However, it is advised to set them, especially the readLength and the organismName; to help having a
proper documentation of your analysis. The organismName slot is actually mandatory if you want to
get genomic annotation using biomaRt. In that case, you need to provide the name as specified in the
corresponding BSgenome package, i.e. “Dmelanogaster” for the BSgenome.Dmelanogaster.UCSC.dm3
package.
> ## load the library
> library("easyRNASeq")
> count.table <- easyRNASeq(filesDirectory=dirname(bamfiles[1]),
+
filenames=basename(bamfiles),
+
organism="Dmelanogaster",
+
readLength=76L,
+
annotationMethod="gff",
+
annotationFile=file.path("~/Day2",
+
"dmel_synthetic_transcript_r5-52.gff3"),
+
format="bam",
+
gapped=TRUE,
+
count="transcripts")
> head(count.table)
> dim(count.table)
That is all. In one command, you got the count table for your 2 samples!
Warnings As you could see when running the previous example, warnings were emitted and quite
rightly so.
1. about the annotation: Although we have created synthetic transcripts (sometimes called gene
models), the annotation we are using here is still redundant, as genes located on opposing strand
overlap. Therefore counting reads using these annotation will result in counting some of the reads
several times. As this can be a very significant source of error, all the examples here will emit
this warning. The ideal solution is to further refine the annotation object so that it contains no
overlapping features or validate the affected genes either in-silico - e.g. by looking at the raw read
coverage in a genome browser or by an appropriate wet-lab method.
2. about potential naming issue in the input file: It is (sadly) very frequent that the sequencing
facilities use different naming conventions for the chromosomes they report in the alignment files.
It is therefore very frequent that the annotation provided to easyRNASeq uses different chromosome
names than the alignment file. These warnings are there to inform you about this issue.
Details The easyRNASeq function currently accepts the following annotationMethods:
• “biomaRt” use biomaRt to retrieve the annotation
• “env” use a RangedData or GRanges class object present in the environment
• “gff” reads in a gff version 3 file
• “gtf” reads in a gtf file
• “rda” load an RData object. The object needs to be named gAnnot and of class RangedData or
GRanges.
The reads can be read in from BAM files or any format supported by ShortRead.
The reads can be summarized by:
• exons
27
• features (any features such as introns, enhancers, etc.)
• transcripts
• geneModels (a geneModel is the set of non overlapping loci (i.e. synthetic exons) that represents
all the possible exons and UTRs of a gene. Such geneModels are essential when counting reads
as they ensure that no reads will be accounted for several times. E.g., a gene can have different
isoforms, using different exons, overlapping exons, in which case summarizing by exons might result
in counting a read several times, once per overlapping exon. N.B. Assessing differential expression
between transcripts, based on synthetic exons is something possible since the release 2.11 of R,
using the DEXSeq package available from Bioconductor. Note that this geneModels approach is
actually an older implementation of the one we have taken during the lecture to create the synthetic
transcripts. This last one should be the prefered one. In the coming version of easyRNASeq,
geneModels will be deprecated in favor of the synthetic transcripts generation approach.
The results can be exported in five different formats:
• count table (the default, a n (features) x m (samples) matrix).
• a DESeq [1] countDataSet class object. Useful to perform further analyses using the DESeq package.
• an edgeR [16] DGEList class object. Useful to perform further analyses using the edgeR package.
• an RNAseq class object. Useful for performing additional pre-processing without re-loading the
reads and annotations.
• an SummarizedExperiment class object. This should be the output of choice and will be made
default in the easyRNASeq version to be released with Bioconductor version 2.14 next spring.
The obtained results can optionally be corrected as Reads per Kilobase of feature per Million reads
in the library (RPKM, [14]) or normalized using the DESeq or edgeR packages.
For more details and a complete overview of the easyRNASeq package capabilities, have a look at
the easyRNASeq vignette.
> vignette("easyRNASeq")
Exercise 34
From the same input files and annotations, generate an object of class SummarizedExperiment .
Solution:
Note that recent change to the Bioconductor API have affected the following functionality. I’ve just
realized it yesterday, so I have not got the time to devise a fix, but I will do so asap. Sorry. Well, anyway
it should be the end of the day when you reach that point so you probably will not mind so much I hope.
Especially since it is not something essential.
> sumExp <- easyRNASeq(filesDirectory=dirname(bamfiles[1]),
+
filenames=basename(bamfiles),
+
organism="Dmelanogaster",
+
readLength=76L,
+
annotationMethod="gff",
+
annotationFile=file.path("~/Day2",
+
"dmel_synthetic_transcript_r5-52.gff3"),
+
format="bam",
+
gapped=TRUE,
+
count="transcripts",
+
outputFormat="SummarizedExperiment")
See the GenomicRange package SummarizedExperiment class for more details on last three accessors
used in the following.
28
>
>
>
>
>
>
## the counts
assays(sumExp)
## the sample info
colData(sumExp)
## the 'features' info
rowData(sumExp)
Caveats easyRNASeq is still under active development and as such the current version still lacks some
essential data processing (e.g. strand specific sequencing is not yet supported). The new version to be
released with Bioconductor 2.13, in early October this year, fill in these gaps:
1. The easyRNASeq function used above actually gets deprecated in favor of the simpleRNASeq function
which takes advantage of numerous core Bioconductor packages, e.g. the use of BamFile and
BamFileList objects from the Rsamtools package to locate and access BAM formatted files.
2. Secondly, it benefits from the RNA-Seq field standardization in the sense that the number of
necessary arguments to be provided by default has plummeted. It benefits as well from a refactoring
of how these arguments are provided; they are indeed abstracted and combined in a way similar to
the ScanBamParam parameter of the Rsamtools package scanBam.
3. Then, its performance - e.g. memory management - have been optimized through parallelization.
4. In addition, advanced checks are conducted on the data provided by the user to ensure the overall
process suitability. More comprehensive warnings or errors are thrown, should it be necessary.
5. The concerns raised by the analysis reported there https://stat.ethz.ch/pipermail/bioc-devel/
2012-September/003608.html by Robinson et al. have been adressed too. Both the original
easyRNASeq method and the GenomicRanges approach are provided, the later one being the
default.
6. And last but not least, it provides access to the latest tools for Differential Expression expression analysis such as DESeq2 and DEXSeq. Planned is an integration of the limma for enabling
the voom+limma paradigm. Ideally, easyRNASeq would select the most appropriate analysis to be
conducted based on the report by Soneson and Delorenzi [18].
5.4
Where to from here
After obtaining the count table, numerous downstream analyses are available. Most often, such count
tables are generated in a differential expression experimental setup. In that case, packages such as DESeq,
DEXSeq, edgeR, limma (see voom+limma in the limma vignette), etc. are some of the possibilities
available in Bioconductor. Have a look at [3] and [18] to decide which tool/approach is the best suited
for your experimental design. But, of course, counts can as well be used for other purposes such as
visualization, using e.g. the rtracklayer and GViz packages.
Actually, there’s no real limitation of what one can achieve with a count table and it does not need
be an RNA-Seq experiment; look at the DiffBind package for an example of using ChIP-Seq data for
differential binding analyses.
29
Acknowledgments
1. Thanks to the Workshop organizers, in particular Gabriella Rustici
2. Thanks to the other lecturers, it is always fun around you.
3. Finally, thanks to you the reader - whatever the support you’re reading this on - for having made
it that far.
30
Bibliography
[1] S. Anders and W. Huber. Differential expression analysis for sequence count data. Genome Biology,
11:R106, 2010.
[2] A. N. Brooks, L. Yang, M. O. Duff, K. D. Hansen, J. W. Park, S. Dudoit, S. E. Brenner, and B. R.
Graveley. Conservation of an RNA regulatory map between Drosophila and mammals. Genome
Research, pages 193–202, 2011.
[3] M.-A. Dillies, A. Rau, J. Aubert, C. Hennequet-Antier, M. Jeanmougin, N. Servant, C. Keime,
G. Marot, D. Castel, J. Estelle, G. Guernec, B. Jagla, L. Jouneau, D. Laloë, C. L. Gall, B. Schaëffer,
S. L. Crom, M. Guedj, F. Jaffrézic, and on behalf of The French StatOmique Consortium. A
comprehensive evaluation of normalization methods for illumina high-throughput rna sequencing
data analysis. Brief Bioinformatics, Sep 2012.
[4] A. Dobin, C. A. Davis, F. Schlesinger, J. Drenkow, C. Zaleski, S. Jha, P. Batut, M. Chaisson, and
T. R. Gingeras. Star: ultrafast universal rna-seq aligner. Bioinformatics, 29(1):15–21, Jan 2013.
[5] R. C. Gentleman et al. Bioconductor: open software development for computational biology and
bioinformatics. Genome Biology 2010 11:202, 5(10):R80, Jan 2004.
[6] Glaus, Peter, Honkela, Antti, Rattray, and Magnus. Identifying differentially expressed transcripts
from rna-seq data with biological variation. Bioinformatics, 28(13):1721–1728, 2012.
[7] M. G. Grabherr, B. J. Haas, M. Yassour, J. Z. Levin, D. A. Thompson, I. Amit, X. Adiconis, L. Fan,
R. Raychowdhury, Q. Zeng, Z. Chen, E. Mauceli, N. Hacohen, A. Gnirke, N. Rhind, F. D. Palma,
B. W. Birren, C. Nusbaum, K. Lindblad-Toh, N. Friedman, and A. Regev. Full-length transcriptome
assembly from rna-seq data without a reference genome. Nat Biotechnol, 29(7):644–652, May 2011.
[8] R. W. Hamming. Error detecting and error correcting codes. The Bell System Technical Journal,
XXIX(2):1–14, Nov 1950.
[9] B. Langmead, C. Trapnell, M. Pop, and S. L. Salzberg. Ultrafast and memory-efficient alignment
of short DNA sequences to the human genome. Genome Biol., 10:R25, 2009.
[10] P. Lefrançois, G. M. Euskirchen, R. K. Auerbach, J. Rozowsky, T. Gibson, C. M. Yellman, M. Gerstein, and M. Snyder. Efficient yeast chip-seq using multiplex short-read dna sequencing. BMC
genomics, 10(1):37, Jan 2009.
[11] B. Li, V. Ruotti, R. M. Stewart, J. A. Thomson, and C. N. Dewey. Rna-seq gene expression
estimation with read mapping uncertainty. Bioinformatics, 26(4):493–500, Feb 2010.
[12] H. Li and R. Durbin. Fast and accurate short read alignment with Burrows-Wheeler transform.
Bioinformatics, 25:1754–1760, Jul 2009.
[13] H. Li and R. Durbin. Fast and accurate long-read alignment with Burrows-Wheeler transform.
Bioinformatics, 26:589–595, Mar 2010.
[14] A. Mortazavi et al. Mapping and quantifying mammalian transcriptomes by rna-seq. Nature Methods, 5(7):621–8, Jul 2008.
31
[15] R Development Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria, 2009. ISBN 3-900051-07-0.
[16] M. D. Robinson, D. J. McCarthy, and G. K. Smyth. edgeR: a Bioconductor package for differential
expression analysis of digital gene expression data. Bioinformatics, 26:139–140, Jan 2010.
[17] M. H. Schulz, D. R. Zerbino, M. Vingron, and E. Birney. Oases: robust de novo rna-seq assembly
across the dynamic range of expression levels. Bioinformatics, 28(8):1086–92, Apr 2012.
[18] C. Soneson and M. Delorenzi. A comparison of methods for differential expression analysis of rna-seq
data. BMC Bioinformatics, 14:91, Jan 2013.
[19] C. Trapnell, B. A. Williams, G. Pertea, A. Mortazavi, G. Kwan, M. J. van Baren, S. L. Salzberg,
B. J. Wold, and L. Pachter. Transcript assembly and quantification by rna-seq reveals unannotated
transcripts and isoform switching during cell differentiation. Nat Biotechnol, 28(5):511–5, May 2010.
[20] E. Turro, S.-Y. Su, Â. Gonçalves, L. J. M. Coin, S. Richardson, and A. Lewin. Haplotype and
isoform specific expression estimation using multi-mapping rna-seq reads. Genome Biol, 12(2):R13,
Jan 2011.
[21] T. D. Wu and C. K. Watanabe. Gmap: a genomic mapping and alignment program for mrna and
est sequences. Bioinformatics, 21(9):1859–75, May 2005.
32