R / Bioconductor for High-Throughput Sequence Analysis Nicolas Delhomme1 21 October - 26 October, 2013 1 [email protected] Contents 1 Day2 of the workshop 1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Main Bioconductor packages of interest for the day . . . 1.3 A word on High-throughput sequence analysis . . . . . . 1.4 A word on Integrated Development Environment (IDE) 1.5 Today’s schedule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 2 2 2 2 2 2 Prelude 2.1 Purpose . . . . . . . . . . . . . . . . . . . . . 2.2 Creating GAlignment objects from BAM files 2.3 Processing the files in parallel . . . . . . . . . 2.4 Processing the files one chunk at a time . . . 2.5 Pros and cons of the current solution . . . . . 2.5.1 Pros . . . . . . . . . . . . . . . . . . . 2.5.2 Cons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 4 4 4 5 6 6 6 3 Sequences and Short Reads 3.1 Alignments and Bioconductor packages . . . . . . . 3.1.1 The pasilla data set . . . . . . . . . . . . . . 3.1.2 Alignments and the ShortRead package . . . 3.1.3 Alignments and the Rsamtools package . . . 3.1.4 Alignments and other Bioconductor packages 3.1.5 Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 7 7 8 9 13 17 . . . . . . . . . . . . . . . . . . . . . 4 Interlude 18 5 Estimating Expression over Genes and Exons 5.1 Counting reads over known genes and exons . . 5.1.1 The alignments . . . . . . . . . . . . . . 5.2 Discovering novel transcribed regions . . . . . . 5.3 Using easyRNASeq . . . . . . . . . . . . . . . . 5.4 Where to from here . . . . . . . . . . . . . . . . 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 20 20 23 27 29 Chapter 1 Day2 of the workshop 1.1 Introduction This portion of the workshop introduces use of R [15] and Bioconductor [5] for analysis of high-throughput sequence (HTS) data; specifically the manipulation of HTS reads alignment and how to estimate expression over exons, transcripts and genes using these. The workshop is structured as a series of short remarks followed by group exercises. The exercises explore the diversity of tasks for which R / Bioconductor are appropriate, but are far from comprehensive. The goals of that workshop part are to: (1) develop familiarity with R / Bioconductor packages for high-throughput analysis; (2) specifically for those necessary for manipulating HTS reads alignment files and for devising expression over genic features; and (3) provide inspiration and a framework for further independent exploration. 1.2 Main Bioconductor packages of interest for the day Bioconductor is a collection of R packages for the analysis and comprehension of high-throughput genomic data. Among these, we will focus on three of them principally: ShortRead, Rsamtools and GenomicRanges. 1.3 A word on High-throughput sequence analysis Recent technological developments introduce high-throughput sequencing approaches. A variety of experimental protocols and analysis workflows address gene expression, regulation, and encoding of genetic variants. Experimental protocols produce a large number (tens of millions per sample) of short (e.g. , 35-250, single or paired-end) nucleotide sequences. These are aligned to a reference or other genome. Analysis workflows use the alignments to infer levels of gene expression (RNA-seq), binding of regulatory elements to genomic locations (ChIP-seq), or prevalence of structural variants (e.g. , SNPs, short indels, large-scale genomic rearrangements). Sample sizes range from minimal replication (e.g,. 2 samples per treatment group) to thousands of individuals. 1.4 A word on Integrated Development Environment (IDE) There are numerous tools to support developing programs and softwares in R. For this course, we have selected one of them: the RStudio environment, which provides a feature-full, user-friendly, cross-platform environment for working with R. 1.5 Today’s schedule 2 Table 1.1: EMBO2013 AHTSD workshop day2 Schedule Time 09:00 09:45 10:30 10:45 12:30 13:30 14:30 15:30 15:45 16:30 17:30 18:30 Description Lecture: Representing and manipulating alignments Practical: Representing and manipulating alignments Coffee break Practical c’ed: Representing and manipulating alignments Lunch Lecture: Estimating expression over genes and exons Practical: Estimating expression over genes and exons Coffee break Lecture: Working without a ”reference” genome Practical: Discovering novel transcribed regions Question and Answer session - preferably at the Red Lion Dinner 3 Chapter 2 Prelude 2.1 Purpose Before getting familiar with the Bioconductor packages functionalities that were presented in the lecture, we will first sublimate the knowledge you’ve gathered so far into adressing the computationaal challenges faced when using HTS data: i.e. resources and time consumption. In the lecture, the readGAlignmentsFromBam function from the Rsamtools package was introduced and used to extract a GAlignment object. However, most of the times, an experiment will NOT consist of a single sample (of only 2.5M reads!) and an obvious way to speed up the process is to parallelize. In the following three sections, we will see how to perform this before ultimately discussing the pros and cons of the implemented method. 2.2 Creating GAlignment objects from BAM files Exercise 1 First of all, locate the BAM files and implement a function to read them sequentially. Have a look at the lapply function man page for doing so. Solution: > library(Rsamtools) > bamfiles <- dir(system.file("bigdata","bam",package="EMBO2013Day2"), + pattern="*.bam$",full.names=TRUE) > gAlns <- lapply(bamfiles,readGAlignmentsFromBam) Nothing complicated so far - or if, raise your voice. We proceed both files sequentially and get a list of GAlignments objects stored in the gAlns object. Apart from the coding enhancement - with one line, we can process all our samples - there is no other gains. 2.3 Processing the files in parallel Modern laptop CPUs possess several cores that can perform tasks independently, commonly 2 to 4. Computational servers usually have many CPUs (commonly 8) each having several cores. An obvious enhancement to our previous solution is to take advantage of this CPU architecture and to process our sample in parallel. Exercise 2 Have a look at the parallel package and in particular at the mclapply function to re-implement the previous function in a parallel manner. Solution: 4 > library(parallel) > gAlns <- mclapply(bamfiles,readGAlignmentsFromBam) Exercise 3 Could you figure out how many cores were used in parallel when running the previous line? Can you explain why that was so? Solution: It is NOT because there were 2 files to proceed. The mclapply has a number of default parameters - see ?mclapply for details - including the mc.cores one that defaults to 2. If you want to proceed more samples in parallel, set that parameter value accordingly. This new implementation has the obvious advantage to be X times faster (with X being the number of CPU used, or almost so as parallelization comes with a slight processing cost), but it put a different strain on the system. As several files are being processed in parallel, the memory requirement also increase by a factor X (assuming files of almost equivalent size are to be processed). This might be fine on a computational server but given the constant increase in sequencing reads being produced per run, this will eventually be challenged. Exercise 4 Can you think of the way this memory issue could be adressed? i.e. what could we modify in the way we read/process the file to limit the memory required at a given moment? Solution: No, buying more memory is usually not an option. And anyway, at the moment, the increase rate of reads sequenced per run is faster than the memory doubling time. So, let us just move to the next section to have a go at adressing the issue. 2.4 Processing the files one chunk at a time To limit the memory required at any moment, one approach would be to proceed the file not as a whole, but chunk-wise. As we can assume that reads are stored independently in BAM files (or almost so, think of how Paired-End data is stored!), we simply can decide to parse, e.g. 1, 000, 000 reads at a time. This will of course require to have a new way to represent a BAM file in R, i.e. not just as a character string as we had it until now in our bamfiles object. Exercise 5 The Rsamtools package again comes in handy. Lookup the ?BamFile package and try to scheme how we could take advantage of the BamFile or BamFileList classes for our purpose. Solution: The yieldSize parameter of either class looks like exactly what we want. Let us recode our bamfiles character object into a BamFileList. > bamFileList <- BamFileList(bamfiles,yieldSize=10^6) Now that we have the BAM files described in a way that we can process them chunk-wise, let us do so. The paradigm is as follow: > open(bamFile) > while(length(chunk <- readGAlignmentsFromBam(bamFile))){ + message(length(chunk)) + } > close(bamFile) 5 Exercise 6 In the paradigm above, we process one BAM file chunk wise and report the sizes of the chunks. i.e. these would be 1M reads - in our case - apart for the last one, which would be smaller or equal to 1M (it is unlikely that a sequencing file contains an exact multiple of our chink size). Now, try to implement the above paradigm in the function we implemented previously - see solution 2.3 page 4 - so as to process both our BAM files in parallel chunk-wise. Solution: > gAlns <- mclapply(bamFileList,function(bamFile){ + open(bamFile) + gAln <- GAlignments() + while(length(chunk <- readGAlignmentsFromBam(bamFile))){ + gAln <- c(gAln,chunk) + } + close(bamFile) + return(gAln) + }) 2.5 Pros and cons of the current solution Exercise 7 Before reading my comments below, take the time to jot down what you think are the advantages and drawbacks of the method implemented above. My own comments below are certainly not extensive and I would be curious to hear yours that are not matched with mine. Solution: 2.5.1 Pros a. We have written a streamlined piece of code, using up to date functionalities from other packages. Hence, it is both easily maintanable and updatable. b. With regards to time consumption, we have reduced it by a factor 2 and that can be reduced further by using computer with more CPUs or a compute farm even - obviously if we have more than 2 samples to process. c. We have implemented the processing of the BAM files by chunk 2.5.2 Cons a. There’s only one big cons really: we have NOT addressed the memory requirement issue satisfyingly. We do proceed the BAM files by chunks, but then we simply aggregate these chunks without further processing, so we eventually end up using the same amount of memory. This is the best we can do so far given the introduced Bioconductor functionalities, so let us move to the next step in the pipeline that will help us resolve that - see Chapter 4 page 18 if you are impatient - but first we should recap the usage of the Bioconductor packages for obtaining and manipulating sequencing read information in R, which is next chapter’s topic. 6 Chapter 3 Sequences and Short Reads Most down-stream analysis of short read sequences is based on reads aligned to reference genomes. There are many aligners available, including BWA [13, 12], Bowtie2 [9], GSNAP[21], STAR[4],etc. ; merits of these are discussed in the literature. There are also alignment algorithms implemented in Bioconductor (e.g., matchPDict in the Biostrings package and the gmapR, Rbowtie, Rsubread packages); matchPDict is particularly useful for flexible alignment of moderately sized subsets of data. 3.1 Alignments and Bioconductor packages The following sections introduce core tools for working with high-throughput sequence data; key packages for representing reads and alignments are summarized in Table 3.1. Moreover,Martin introduced yesterday resources for annotating sequences, that will come handy in the next two chapters of this tutorial (Chapter 4, page 18 and Chapter 5, page 20) Exercise 8 Read the man page of the GAlignments and GAlignmentPairs classes and pay attention to the very important comments on multi-reads and paired-end processing. Solution: Really just ?GAlignments. However, KEEP these details in mind as they essential and likely source of erroneous conclusion. Remember the example of this morning lecture about RNA editing. 3.1.1 The pasilla data set As a running example, we use the pasilla data set, derived from [2]. The authors investigate conservation of RNA regulation between D. melanogaster and mammals. Part of their study used RNAi and RNAseq to identify exons regulated by Pasilla (ps), the D. melanogaster ortholog of mammalian NOVA1 and Table 3.1: Selected Bioconductor packages for extracting and manipulating sequence reads alignments. Package ShortRead GenomicRanges Rsamtools rtracklayer Description In addition to the functionalities described yesterday to manipulate raw read files, e.g. the ShortReadQ class and functions for manipulating fastq files; this package offers the possibility to load numerous HTS formats classes. These are mostly sequencer manufacturer specific e.g. sff for 454 or pre-BAM aligner proprietary formats, e.g. MAQ or bowtie. These functionalities rely heavily on Biostrings and somewhat on Rsamtools. GAlignments and GAlignmentPairs store single- and paired-end aligned reads. Provides access to BAM alignment and other large sequence-related files. Input and output of bed, wig and similar files 7 NOVA2. Briefly, their experiment compared gene expression as measured by RNAseq in S2-DRSC cells cultured with, or without, a 444bp dsRNA fragment corresponding to the ps mRNA sequence. Their assessment investigated differential exon use, but our worked example will focus on gene-level differences. In the following sections, we look at a subset of the ps data, corresponding to reads obtained from lanes of their RNA-seq experiment, and aligned to a D. melanogaster reference genome. These are the same reads that were used yesterday for the demonstration of the raw read based functionalities of the ShortRead package. As a side note, reads were retrieved from GEO and the Short Read Archive (SRA), and were aligned to the D. melanogaster reference genome dm3 as described in the pasilla experiment data package. 3.1.2 Alignments and the ShortRead package Yesterday, Martin introduced the ShortRead to manipulate raw reads and to perform Quality Assessment (QA) on raw data files e.g. fastq formatted files. These are not the only functionalities from the ShortRead package, which offers as well the possibility to read in alignments files in many different formats. Exercise 9 Two files of the pasilla dataset have been aligned using bowtie [9], locate them in the bigdata folder of the EMBO2013Day2 package. Solution: > bwtFiles <- dir(path=system.file("bigdata","bowtie",package="EMBO2013Day2"), + pattern="*.bwt$",full.names=TRUE) As we will be accessing this bigdata folder frequently, we create a function called bigdata to do so more conveniently. > library(EMBO2013Day2) > bigdata <- function() + system.file("bigdata",package="EMBO2013Day2") Exercise 10 Have a pick at one of the file and try to decipher its format. Hint: it is a tab delimited format, so check the read.delim function. As you may not want to read all the lines to get an idea, lookup an appropriate argument for that. Solution: You might want to check http://bowtie-bio.sourceforge.net/manual.shtml#default-bowtie-output for checking whether your guesses were correct. Here is how to read 10 lines of the first file. > read.delim(file=file.path(bigdata(),"bowtie","SRR074430.bwt"), + header=FALSE,nrows=10) Exercise 11 now, as was presented in the lecture use the readAligned function to read in the bowtie alignment files. Solution: > alignedRead <- readAligned(dirPath=file.path(bigdata(),"bowtie"), + pattern="*.bwt$",type="Bowtie") 8 Exercise 12 What is peculiar about the returned object? Determine its class. Can you tell where the data from both input files are? Solution: We obtained a single object of the AlignedRead class. By looking at the documentation, i.e. ?readAligned in the Value section, we are told that all files are concatenated in a single object with NO guarantee of order in which files are read. This is convenient when we want to merge several sequencing runs of the same sample but we need to be cautions and process independent sample by individually calling the readAligned function for every sample. Exercise 13 Finally, taking another look at the lecture, select only the reads that align to chromosome 2L. Hint, use the appropriate SRFilter filter. Solution: > alignedRead2L <- readAligned(dirPath=file.path(bigdata(),"bowtie"), + pattern="*.bwt$",type="Bowtie", + filter=chromosomeFilter("2L")) This concludes the overview of the ShortRead package. As the BAM format has become a de-facto standard, it is more unlikely that you end up using that package to process reads in R over the Rsamtools package that you will be using next. 3.1.3 Alignments and the Rsamtools package Alignment formats Most main-stream aligners produce output in SAM (text-based) or BAM format. A SAM file is a text file, with one line per aligned read, and fields separated by tabs. Here is an example of a single SAM line, split into fields. > fl <- system.file("extdata", "ex1.sam", package="Rsamtools") > strsplit(readLines(fl, 1), "\t")[[1]] [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] "B7_591:4:96:693:509" "73" "seq1" "1" "99" "36M" "*" "0" "0" "CACTAGTGGCTCATTGTAAATGTGTGGTTTAACTCG" "<<<<<<<<<<<<<<<;<<<<<<<<<5<<<<<;:<;7" "MF:i:18" "Aq:i:73" "NM:i:0" "UQ:i:0" "H0:i:1" "H1:i:0" Fields in a SAM file are summarized in Table 3.2. We recognize from the FASTQ file introduced yesterday, the identifier string, read sequences and qualities. The alignment is to a chromosome ‘seq1’ starting at position 1. The strand of alignment is encoded in the ‘flag’ field. The alignment record also includes a measure of mapping quality, and a CIGAR string describing the nature of the alignment. In this case, the CIGAR is 36M, indicating that the alignment consisted of 36 Matches or mismatches, with no indels or gaps; indels are represented by I and D; gaps (e.g., from alignments spanning introns) by N. 9 Table 3.2: Fields in a SAM record. From http://samtools.sourceforge.net/samtools.shtml Field 1 2 3 4 5 6 7 8 9 10 11 12+ Name QNAME FLAG RNAME POS MAPQ CIGAR MRNM MPOS ISIZE SEQ QUAL OPT Value Query (read) NAME Bitwise FLAG, e.g., strand of alignment Reference sequence NAME 1-based leftmost POSition of sequence MAPping Quality (Phred-scaled) Extended CIGAR string Mate Reference sequence NaMe 1-based Mate POSition Inferred insert SIZE Query SEQuence on the reference strand Query QUALity OPTional fields, format TAG:VTYPE:VALUE Note that mismatches are not represented in the CIGAR string but might be detailed in the additional attributes; this depends on the aligner used. BAM files encode the same information as SAM files, but in a format that is more efficiently parsed by software; BAM files are the primary way in which aligned reads are imported in to R. Aligned reads in R As introduced - c.f. section 3.1 - there are three different packages to read alignments in R: • ShortRead • GenomicRanges • Rsamtools The last two will be described in more details in the next paragraphs. GenomicRanges The readGAlignments function from the GenomicRanges package reads essential information from a BAM file into R. The result is an instance of the GAlignments class. The GAlignments class has been designed to allow useful manipulation of many reads (e.g., 20 million) under moderate memory requirements (e.g., 4 GB). Note that the the readGAlignments function and the GAlignments class have replaced the readGappedAlignments function and the GappedAlignments class, respectively, from the previous releases of Bioconductor; this as of version 2.13 released beginning of October this year. Exercise 14 Use the readGAlignments to read in the ”ex1.bam” that can be found in the ”extdata” folder of the Rsamtools package. Solution: > alnFile <- system.file("extdata", "ex1.bam", package="Rsamtools") > aln <- readGAlignments(alnFile) > head(aln, 3) The readGAlignments function takes an additional argument, param, allowing the user to specify regions of the BAM file (e.g., known gene coordinates) from which to extract alignments. A GAlignments instance is like a data frame, but with accessors as suggested by the column names. It is easy to query, e.g., the distribution of reads aligning to each strand, the width of reads, or the cigar strings 10 Exercise 15 Summarize the strand, width and CIGAR information from that file. Solution: > table(strand(aln)) + 1647 1624 * 0 > table(width(aln)) 30 2 31 21 32 1 33 8 34 35 37 2804 36 285 38 1 40 112 > head(sort(table(cigar(aln)), decreasing=TRUE)) 35M 2804 36M 283 40M 112 34M 37 33M 14M4I17M 6 4 Rsamtools The Rsamtools readGAlignmentsFromBam function - introduced earlier, see Chapter 2 page 4 - as the GenomicRanges readGAlignments function only parse some of the fields of a BAM file, and that may not be appropriate for all uses. In these cases the scanBam function in Rsamtools provides greater flexibility. The idea is to view BAM files as a kind of data base. Particular regions of interest can be selected, and the information in the selection restricted to particular fields. These operations are determined by the values of a ScanBamParam object, passed as the named param argument to scanBam. Exercise 16 Consult the help page for ScanBamParam, and construct an object that restricts the information returned by a scanBam query to the aligned read DNA sequence. Your solution will use the what parameter to the ScanBamParam function. Use the ScanBamParam object to query a BAM file, and calculate the GC content of all aligned reads. Summarize the GC content as a histogram (Figure 3.1). Solution: > > > > param <- ScanBamParam(what="seq") seqs <- scanBam(bamfiles[[1]], param=param) readGC <- gcFunction(seqs[[1]][["seq"]]) hist(readGC) Advanced Rsamtools usage The Rsamtools package has more advanced functionalities: 1. function to count, index, filter, sort BAM files 2. function to access the header only 3. the possibility to access SAM attributes (tags) 4. manipulate the CIGAR string 5. create BAM libraries to represent a study set (BamViews) 6. . . . Exercise 17 Find out the function that permit to scan the BAM header and retrieve the header of the first BAM file in the bigdata() bam subfolder. What information does it contain? 11 Figure 3.1: GC content in aligned reads Solution: It contains the reference sequence length and names as well as the name, version and command line of the tool used for performing the alignments. > scanBamHeader(bamfiles[1]) Exercise 18 The SAM/BAM format contains a tag: “NH” that defines the total number of valid alignments reported for a read. How can you extract that information from the same first bam file and plot it as an histogram? Solution: > param <- ScanBamParam(tag="NH") > nhs <- scanBam(bamfiles[[1]], param=param)[[1]]$tag$NH So it seems a majority of our reads have multiple alignments! Some processing might be required to deal with these; e.g. if reads were aligned to the transcriptome there exist tools that can deconvoluate the transcript specific expression, for example MMSEQ [20], BitSeq [6], that last one existing as an R package too: BitSeq. Otherwise if reads were aligned to the genome, one should consider filtering these multiple alignments to avoid introducing artifactual noise in the subsequent analyses. Exercise 19 The CIGAR string contains interesting information, in particular, whether or not a given match contain indels. Using the first bam file, can you get a matrix of all seven CIGAR operations? And find out the intron size distribution? Solution: > param <- ScanBamParam(what="cigar") > cigars <- scanBam(bamfiles[[1]], param=param)[[1]]$cigar 12 > > > > > cigar.matrix <- cigarOpTable(cigars) intron.size <- cigar.matrix[,"N"] intron.size[intron.size>0] plot(density(intron.size[intron.size>0])) histogram(log10(intron.size[intron.size>0]),xlab="intron size (log10 bp)") Exercise 20 Look up the documentation for the BamViews and using the leeBamViews package, create a BamViews instance. Afterwards, use some of the accessors of that object to retrieve e.g. the file paths or the sample names Solution: > > > > > > > + > > library(leeBamViews) bpaths = dir(system.file("bam", package="leeBamViews"), full=TRUE, patt="bam$") gt<-do.call(rbind,strsplit(basename(bpaths),"_"))[,1] geno<-substr(gt,1,nchar(gt)-1) lane<-substr(gt,nchar(gt),nchar(gt)) pd = DataFrame(geno=geno, lane=lane, row.names=paste(geno,lane,sep=".")) bs1 = BamViews(bamPaths=bpaths, bamSamples=pd, bamExperiment=list(annotation="org.Sc.sgd.db")) bamPaths(bs1) bamSamples(bs1) Exercise 21 Finally, extract the coverage for the locus 861250:863000 on chromosome “Scchr13” for every sample in the bs1 object Solution: > sel <- GRanges(seqnames = "Scchr13", IRanges(start = 861250, end = 863000),strand="+") > covex = RleList(lapply(bamPaths(bs1), function(x) coverage(readGAlignments(x))[[1]])) This offer an interesting way to process multiple sample at the same time when you’re interested in a particular locus. 3.1.4 Alignments and other Bioconductor packages In the following, an excerpt of additional functionalities offered by Bioconductor packages is presented. It is far from being a complete overview, and as such only aims at giving a feel for what’s out there. Retrieving data using SRAdb Most journals require the raw data to be deposited in a public repository, such as GEO, SRA or ENA. The SRAdb package offers the possibility to list the content of these archives, and to retrieve raw (fastq or sra) files. Exercise 22 Using the pasilla package, retrieve the submission accession of that dataset (check out that package vignette) Solution: > vignette(package="pasilla") > vignette("create_objects") > geo.acc <- "GEO: GSE18508" 13 Now that as we only have the GEO ID, we need to convert it to an SRA ID. You can either use the GEO, SRA or ENA website for this or if you are slightly familiar with SQL, just use the SRAdb package. Exercise 23 Look into the SRAdb package vignette to figure out how to do this. Solution: Accessing the vignette and reading it tells us > library(SRAdb) > vignette("SRAdb") a. we need to download the SRAdb sqlfile b. we need to create a connection to the locally downloaded database c. we need to query that database with our submission alias: “GEO: GSE18508” to retrieve the SRA submission accession. The first step requires the download of a 280+Mb compressed large file, so to avoid the downloading time, connect to the file on the shared folder > sqlfile <- "replace-with-your-path-to-the-SRAmetadb.sqlite-file" > sra_con <- dbConnect(SQLite(),sqlfile) > sra.acc <- dbGetQuery(sra_con,paste("select submission_accession ", + "from submission ", + 'where submission_alias = "', + geo.acc,';"',sep="")) To download the file, the command to use is getSRAdbFile The retrieved sra.acc is: “SRA010243”. Now that we have that accession, the vignette tells us how to get every experiment, sample, run, . . . associated with this submission accession. Exercise 24 There are at least two possibilities to do so, one using an SQL query and the other one using a function of the packages. What would be that function? Solution: For those that like SQL: > run.acc <- dbGetQuery(sra_con,paste("select run_accession ", + "from run ", + 'where submission_accession = "', + sra.acc,'";',sep=""))$run_accession For those that like functions: > sraConvert(sra.acc,sra_con=sra_con) > run.acc <- sraConvert(sra.acc,"run",sra_con=sra_con)$run Exercise 25 Now that we’ve got the list of runs, it would be interesting to get more information about the corresponding fastq file. Solution: > info <- getFASTQinfo(run.acc,srcType="ftp") And the final step would be to download the fastq file(s) of interest. 14 Exercise 26 Retrieve the shortest fastq file from that particular submission. Solution: > getSRAfile(in_acc=info[which.min(info[,"run.read.count"]),"run"], + sra_con, destDir = getwd(), + fileType = 'fastq', srcType = 'ftp' ) Well, that’s almost it. As we are tidy people, we clean after ourselves. > dbDisconnect( sra_con ) Demultiplexing using easyRNASeq Note: This section does not apply to all datasets but only to multiplexed ones. Since the data we loaded so far into R was not multiplexed we will use a different dataset here. Nowadays, NGS machines produces so many reads (e.g. 40M for Illumina GAIIx, 100M for ABI SOLiD4 and 160M for an Illumina HiSeq), that the coverage obtained per lane for the transcriptome of organisms with small genomes, is very high. Sometimes it’s more valuable to sequence more samples with lower coverage than sequencing only one to very high coverage, so techniques have been optimised for sequencing several samples in a single lane using 4-6bp barcodes to uniquely identify the sample within the library[10]. This is called multiplexing and one can on average sequence 12 yeast samples at 30X coverage in a single lane of an Illumina GenomeAnalyzer GAIIx (100bp read, single end). This approach is very advantageous for researchers, especially in term of costs, but it adds an additional layer of pre-processing that is not as trivial as one would think. Extracting the barcodes would be fairly straightforward, but for the average 0.1-1 percent sequencing error rate that introduces a lot of multiplicity in the actual barcodes present in the samples. A proper design of the barcodes, maximising the Hamming distance [8] is an essential step for proper de-multiplexing. The data we loaded into R in the previous section was not mutiplexed, so we now load a different FASTQ file where the 4 different samples sequenced were identified by the barcodes ”ATGGCT”, ”TTGCGA”, ”ACACTG” and ”ACTAGC”. > > > > > > > > > > > > reads <- readFastq(file.path(bigdata(),"multiplex","multiplex.fq.gz")) # filter out reads with more than 2 Ns filter <- nFilter(threshold=2) reads <- reads[filter(reads)] # access the read sequences seqs <- sread(reads) # this is the length of your adapters barcodeLength <- 6 # get the first 6 bases of each read seqs <- narrow(seqs, end=barcodeLength) seqs length(table(as.character(seqs))) So it seems we have 1953 barcodes instead of 6 . . . Exercise 27 Which barcode is most represented in this library? Plot the relative frequency of the top 20 barcodes. Try: • using the function table to count how many times each barcode occurs in the library, you can’t apply this function to seqs directly you must convert it first to a character vector with the as.character function • sort the counts object you just created with the function sort, use decreasing=TRUE as an argument for sort so that the elements are sorted from high to low (use sort( ..., decreasing=TRUE )) 15 • look at the first element of your sorted counts object to find out with barcode is most represented • find out what the relative frequency of each barcode is by dividing your counts object by the total number of reads (the function sum might be useful) • plot the relative frequency of the top 20 barcodes by adapting these function calls: > # set up larger margins for the plot so we can read the barcode names > par(mar=c(5, 5, 4, 2)) > barplot(..., horiz=T, las=1, col="orange" ) Solution: > > > > > barcount = sort(table(as.character(seqs)), decreasing=TRUE) barcount[1:10] # TTGCGA barcount = barcount/sum(barcount) par( mar=c(5, 5, 4, 2)) barplot(barcount[1:20], horiz=TRUE, las=1, col="orange" ) Exercise 28 The designed barcodes (”ATGGCT”, ”TTGCGA”, ”ACACTG” and ”ACTAGC”) seem to be equally distributed, what is the percentage of reads that cannot be assigned to a barcode? Solution: > signif((1-sum(barcount[1:4]))*100,digits=2) # ~6.4% We will now iterate over the 4 barcodes, split the reads between them and save a new fastq file for each: > > > > + + + + + + + + + + + + + + + + + + + + + barcodes = c("ATGGCT", "TTGCGA", "ACACTG", "ACTAGC") # iterate through each of these top 10 adapters and write # output to fastq files for(barcode in barcodes) { seqs <- sread(reads) # get sequence list qual <- quality(reads) # get quality score list qual <- quality(qual) # strip quality score type mismatchVector <- 0 # allow no mismatches # trim sequences looking for a 5' pattern # gets IRanges object with trimmed coordinates trimCoords <- trimLRPatterns(Lpattern=barcode, subject=seqs, max.Lmismatch=mismatchVector, ranges=T) # generate trimmed ShortReadQ object seqs <- DNAStringSet(seqs, start=start(trimCoords), end=end(trimCoords)) qual <- BStringSet(qual, start=start(trimCoords), end=end(trimCoords)) # use IRanges coordinates to trim sequences and quality scores qual <- SFastqQuality(qual) # reapply quality score type trimmed <- ShortReadQ(sread=seqs, quality=qual, id=id(reads)) # rebuild reads object with trimmed sequences and quality scores 16 + + + + + + + + + + + } # keep only reads which trimmed the full barcode trimmed <- trimmed[start(trimCoords) == barcodeLength + 1] # write reads to Fastq file outputFileName <- paste(barcode, ".fq", sep="") writeFastq(trimmed, outputFileName) # export filtered and trimmed reads to fastq file print(paste("wrote", length(trimmed), "reads to file", outputFileName)) You should have four new FASTQ files: ACACTG.fq, ACTAGC.fq ATGGCT.fq and TTGCGA.fq with the reads (the barcodes have been trimmed) corresponding to each mutiplexed sampled. The next step would be to align these reads against your reference genome. Aligning reads using Rsubread Note that since last week and the latest release of Bioconductor i.e. version 2.13, I have encountered weird errors using Rsubread on Mac OSX 10.6.8. If that occurs on the course machines too, read the code and feel free to ask me any question. > > > > > > > > + > > + + + + + + + + + + + + library(Rsubread) library(BSgenome.Dmelanogaster.UCSC.dm3) chr4 <- DNAStringSet(unmasked(Dmelanogaster[["chr4"]])) names(chr4) <- "chr4" writeXStringSet(chr4,file="dm3-chr4.fa") ## create the indexes dir.create("indexes") buildindex(basename=file.path("indexes","dm3-chr4"), reference="dm3-chr4.fa",memory=1000) ## align the reads sapply(dir(pattern="*\\.fq$"),function(fil){ ## align align(index=file.path("indexes","dm3-chr4"), readfile1=sub("\\.fq$","",fil), nsubreads=2,TH1=1, output_file=sub("\\.fq$","\\.sam",fil) ) ## create bam files asBam(file=sub("\\.fq$","\\.sam",fil), destination=sub("\\.fq$","",fil), indexDestination=TRUE) }) And that’s it you have filtered, demultiplexed and aligned your reads! 3.1.5 Resources There are extensive vignettes for Biostrings and GenomicRanges packages. A useful on-line resource is from Thomas Girke’s group. 17 Chapter 4 Interlude Now that we have seen the GenomicRanges functionalities to find count or summarize overlaps between reads and annotations, we can refine our prefered function. We had left it as: > gAlns <- mclapply(bamFileList,function(bamFile){ + open(bamFile) + gAln <- GAlignments() + while(length(chunk <- readGAlignmentsFromBam(bamFile))){ + gAln <- c(gAln,chunk) + } + close(bamFile) + return(gAln) + }) Exercise 29 Using the synthetic transcript annotation prepared during the lecture: dmel_synthetic_transcript_r552.rda, implement the count by chunks. Solution: > load("~/Day2/dmel_synthetic_transcript_r5-52.rda") > count.list <- mclapply(bamFileList,function(bamFile){ + open(bamFile) + counts <- vector(mode="integer",length=length(annot)) + while(length(chunk <- readGAlignmentsFromBam(bamFile))){ + counts <- counts + assays(summarizeOverlaps(annot,chunk,mode="Union"))$counts + } + close(bamFile) + return(counts) + }) This gives us a list of counts per sample, to get a count matrix do: > count.table <- do.call("cbind",count.list) > head(count.table) FBgn0000008.0 FBgn0000014.0 FBgn0000015.0 FBgn0000017.0 FBgn0000018.0 FBgn0000022.0 reads reads 1 3 0 0 0 0 107 110 5 11 0 0 18 Such a count.table object is the minimal input that downstream analysis softwares - e.g. DESeq2, edgeR, etc. uses. A similar function to this is probably all you’ll need to process your read and get a count table from a standard Illumina based RNA-Seq experiment. However, you might want more flexibility for you projects and certainly Bioconductor offer the possibility to do that; examples of which are given in the next chapter. 19 Chapter 5 Estimating Expression over Genes and Exons This chapter1 describes part of an RNA-Seq analysis use-case. RNA-Seq [14] was introduced as a new method to perform Gene Expression Analysis, using the advantages of the high throughput of NextGeneration Sequencing (NGS) machines. 5.1 Counting reads over known genes and exons The goal of this use-case is to generate a count table for the selected genic features of interest, i.e. exons, transcripts, gene models, etc. To achieve this, we need to take advantage of all the steps performed previously in the workshop Day1 and Day2. 1. the alignments information has to be retrieved 2. the corresponding annotation need to be fetched and possibly adapted e.g. as was done in the preceeding lecture. 3. the read coverage per genic feature of interest determined Exercise 30 Can you associate at least a Bioconductor package to every of these tasks? Solution: There are numerous choices, as an example in the following we will go for the following set of packages: a. Rsamtools b. genomeIntervals - this was already done during the lecture c. GenomicRanges 5.1.1 The alignments This was introduced in section 3.1.3, page 9. In this section we will import the data using the Rsamtoosl readGAlignmentsFromBam. This will create a GAlignments object that contains only the reads that aligned to the genome. Exercise 31 Using what was introduced in section 3.1.3, read in the first bam file from the bigdata() bam folder. Remember that the protocol used was not strand-specific. 1 The author want to thank Ângela Gonçalves for parts of the present chapter 20 Figure 5.1: Overlap modes; Image from the HTSeq package developed by Simon Anders. Solution: First we scan the bam directory: > > bamfiles <- dir(file.path(bigdata(), "bam"), ".bam$", full=TRUE) names(bamfiles) <- sub("_.*", "", basename(bamfiles)) Then we read the first file: > aln <- readGAlignments(bamfiles[1]) > strand(aln) <- "*" As we have seen, many of these reads actually align to multiple locations. In a first basic analysis i.e. to get a feel for the data - such reads could be ignored. Exercise 32 Filter the multiple alignment reads. Think of the “NH” tag. Solution: > param <- ScanBamParam(tag="NH") > nhs <- scanBam(bamfiles[[1]], param=param)[[1]]$tag$NH > aln <- aln[nhs==1,] Now that we have the alignments (aln object) and the synthetic transcript annotation (annot object) - the one from the lecture; the same used in the Interlude 4, page 18, we can quantify gene expression by counting reads over all exons of a gene and summing them together. One thing to keep in mind is that special care must be taken in dealing with reads that overlap more than one feature (e.g. overlapping genes, isoforms), and thus might be counted several times in different features. To deal with this we can use any of the approaches summarised in Figure 5.1: The GenomicRanges summarizeOverlaps offer different possibilities to summarize reads per features: 21 > > > > load("~/Day2/dmel_synthetic_transcript_r5-52.rda") counts1 <- summarizeOverlaps(annot, aln, mode="Union") counts2 <- summarizeOverlaps(annot, aln, mode="IntersectionStrict") counts3 <- summarizeOverlaps(annot, aln, mode="IntersectionNotEmpty") Exercise 33 Create a data.frame or a matrix of the results above and figure out if any differences can be observed. E.g check for difference in the row standard deviation (using the apply and sd functions). Solution: > + + + > > > > > > > synthTrxCountsTable <- data.frame( assays(counts1)$counts, assays(counts2)$counts, assays(counts3)$counts) colnames(synthTrxCountsTable) <- c("union","intStrict","intNotEmpty") rownames(synthTrxCountsTable) <- rownames(counts1) sds <- apply(synthTrxCountsTable,1,sd) sum(sds!=0) sum(sds!=0)/length(sds) synthTrxCountsTable[which.max(sds),] annot[which.max(sds),] So it appears that we have 3, 872 cases where these counting generate different results (28% of the total!!), and that the synthetic transcript “FBgn0003942.0” shows the largest difference. For a detailled analysis, it would be important to adequatly choose one of the intersection modes above, however for the remainder of this section, we will use the “union” set. As before for reads aligning to multiple places in the genome, choosing to take the union when reads overlap several features is a simplification we may not want to do. There are several methods that probabilistically estimate the expression of overlapping features [11, 19, 20]. This concludes that section on counting reads per known features. In the next section, we will look at how novel transcribed regions could be identified. 22 5.2 Discovering novel transcribed regions One main advantage of RNA-seq experiments over microarrays is that they can be used to identify any transcribed molecule, including unknown transcripts and isoforms, as well as other regulatory transcribed elements. To identify such new elements, several methods are available to recreate and annotate transcripts, e.g. Cufflinks[19], Oases[17], Trinity[7], to mention some of them. We can use Bioconductor tools as well, to identify loci and quantify counts without prior annotation knowledge. The example here is very crude and is really just a proof of concept of what one could do in a few commands i.e. R rules. Nonetheless to make the results more precise, the reads have been realigned using STAR [4], a very fast and accurate aligner that use the recent approach of Maximum Exact Matches (MEMs), see https://code.google.com/p/rna-star/ for more details. This MEM approach allow STAR to identify exon-exon junctions without prior knowledge e.g. no need for an annotation gff. To start, we re-read one of the sample alignments using the Rsamtools readGAlignmentsFromBam function. > aln <- readGAlignmentsFromBam( + BamFile(file.path(bigdata(),"STAR","SRR074431.bam"))) Defining transcribed regions The process begins with calculating the coverage, using the method from the GenomicRanges package: > cover <- coverage(aln) > > > > > cover # this object is compressed to save space. It is an RLE (Running Length Encoding) # we can look at a section of chromosome 4 say between bp 1 and 1000 # which gives us the number of read overlapping each of those bases as.vector(cover[["3R"]])[1:1000] The coverage shows us how many reads overlap every single base in the genome. It is actually split per chromosomes. The next step is to define, “islands” of expression. These can be created using the slice function. The peak height for the islands can be determined with the viewMaxs function and the island widths can be found using the width function: > islands <- slice(cover, 1) > islandPeakHeight <- viewMaxs(islands) > islandWidth <- width(islands) While some more sophisticated approaches can be used to find exons de novo, we can use a simple approach whereby we select islands whose maximum peak height is 2 or more and whose width is 114 bp (150% of the read size) or more to be candidate exons. The elementLengths function shows how many of these candidate exons appear on each chromosome: > candidateExons <- islands[islandPeakHeight >= 2L & islandWidth >=114L] > candidateExons[["3R"]] Remember that we used an aligner which is capable of mapping reads across splice junctions in the genome. > sum(cigarOpTable(cigar(aln))[,"N"] > 0) [1] 99677 There are 99, 677 reads that span exon-exon junctions (EEJs). Let’s look up such a potential EEJ: > aln[cigarOpTable(cigar(aln))[,"N"] > 0 & seqnames(aln) == "3R",] 23 GAlignments with 22070 alignments and 0 metadata columns: seqnames strand cigar qwidth start end width <Rle> <Rle> <character> <integer> <integer> <integer> <integer> [1] 3R 58M68N18M 76 452 595 144 [2] 3R + 40M75N36M 76 20556 20706 151 [3] 3R + 39M75N37M 76 20557 20707 151 [4] 3R + 38M75N38M 76 20558 20708 151 [5] 3R 69M174N7M 76 23216 23465 250 ... ... ... ... ... ... ... ... [22066] 3R - 43M7183N33M 76 27884825 27892083 7259 [22067] 3R - 29M7183N47M 76 27884839 27892097 7259 [22068] 3R - 29M7183N47M 76 27884839 27892097 7259 [22069] 3R + 17M7183N59M 76 27884851 27892109 7259 [22070] 3R + 15M7183N61M 76 27884853 27892111 7259 ngap <integer> [1] 1 [2] 1 [3] 1 [4] 1 [5] 1 ... ... [22066] 1 [22067] 1 [22068] 1 [22069] 1 [22070] 1 --seqlengths: YHet ... XHet 347038 ... 204112 > aln[cigarOpTable(cigar(aln))[,"N"] > 0 & seqnames(aln) == "3R",][2:10,] GAlignments with 9 alignments and 0 metadata columns: seqnames strand cigar qwidth start end width <Rle> <Rle> <character> <integer> <integer> <integer> <integer> [1] 3R + 40M75N36M 76 20556 20706 151 [2] 3R + 39M75N37M 76 20557 20707 151 [3] 3R + 38M75N38M 76 20558 20708 151 [4] 3R 69M174N7M 76 23216 23465 250 [5] 3R + 63M174N13M 76 23222 23471 250 [6] 3R + 45M174N31M 76 23240 23489 250 [7] 3R + 41M174N35M 76 23244 23493 250 [8] 3R + 40M174N36M 76 23245 23494 250 [9] 3R + 32M174N44M 76 23253 23502 250 ngap <integer> [1] 1 [2] 1 [3] 1 [4] 1 [5] 1 [6] 1 [7] 1 [8] 1 [9] 1 24 --seqlengths: YHet ... 347038 ... XHet 204112 There are respectively 3 and 7 reads spanning what looks like introns of 75 and 174 bp respectively. Note that the GenomicRanges Galignments package is aware of splicing junctions. Have a look at the coverage for the first intron: > cover[["3R"]][20556:20706] integer-Rle of length 151 with 5 runs Lengths: 1 1 38 75 36 Values : 1 2 3 0 3 Now, if we select a few of these EEJs, we can have a look if we can identify a specific motif. > > > + > + + + + + splice.reads <- aln[cigarOpTable(cigar(aln))[,"N"] > 0 & seqnames(aln) == "3R",] read.start <- start(splice.reads)[c(2,5,16,30,37)] donor.pos <- read.start - 1 + as.integer(sapply(strsplit(cigar(splice.reads)[c(2,5,16,30,37)],"M"),"[",1)) acceptor.pos <- read.start - 1 + sapply( lapply( lapply(strsplit(cigar(splice.reads)[c(2,5,16,30,37)],"M|N"),"[",1:2), as.integer), sum) Now read the chromosome 3R sequence (actually just a subset of the first 30, 000bp) > chr3R <- readDNAStringSet(file.path(bigdata(),"..", + "fasta", + "dmel-chromosome-3R-1-30000bp.fasta")) Now locate the acceptor and donor sites, but think of the strand! Let’s just look at the one on the plus strand. > sel <- as.logical(strand(splice.reads)[c(2,5,16,30,37)] == "+") > plus.donor <- Views(subject=chr3R[[1]],start=donor.pos[sel]-8, + end=donor.pos[sel]+11) > plus.acceptor <- Views(subject=chr3R[[1]],start=acceptor.pos[sel]-10, + end=acceptor.pos[sel]+9) Let’s see if there’s a consensus in the sequences of 20bp centered around the potential acceptor and donor sites. Note that you might have to install the seqLogo package > > + + + > library(seqLogo) pwm <- makePWM(cbind( alphabetByCycle(DNAStringSet(plus.donor))[c("A","C","G","T"),]/3, alphabetByCycle(DNAStringSet(plus.acceptor))[c("A","C","G","T"),]/3) ) seqLogo(pwm) Clearly the logo - Figure 5.2 - is not exceptional, but from only 3 EEJs, we can already see that the donor site at position 10 − 11 is GT and the acceptor site at position 30 − 31 is AG, i.e. the canonical sites. Moreover, we can see a relative - or at least I want to see it because I know it must be there - enrichment for Ts in the intron sequence, a known phenomenon. Hence, using a de-novo approach complemented by additional criteria can prove very efficient. 25 Figure 5.2: GC content in aligned reads This concludes the section on summarizing counts. As you could realize, juggling with the different package for manipulating the alignment and annotation requires some coding. To facilitate this a number of “workflow” package are available at Bioconductor. The next section gives a brief introduction of easyRNASeq (obviously, a biased selection . . . ) 26 5.3 Using easyRNASeq Let us redo what was done in the previous section. Note that most of the RNAseq object slots are optional. However, it is advised to set them, especially the readLength and the organismName; to help having a proper documentation of your analysis. The organismName slot is actually mandatory if you want to get genomic annotation using biomaRt. In that case, you need to provide the name as specified in the corresponding BSgenome package, i.e. “Dmelanogaster” for the BSgenome.Dmelanogaster.UCSC.dm3 package. > ## load the library > library("easyRNASeq") > count.table <- easyRNASeq(filesDirectory=dirname(bamfiles[1]), + filenames=basename(bamfiles), + organism="Dmelanogaster", + readLength=76L, + annotationMethod="gff", + annotationFile=file.path("~/Day2", + "dmel_synthetic_transcript_r5-52.gff3"), + format="bam", + gapped=TRUE, + count="transcripts") > head(count.table) > dim(count.table) That is all. In one command, you got the count table for your 2 samples! Warnings As you could see when running the previous example, warnings were emitted and quite rightly so. 1. about the annotation: Although we have created synthetic transcripts (sometimes called gene models), the annotation we are using here is still redundant, as genes located on opposing strand overlap. Therefore counting reads using these annotation will result in counting some of the reads several times. As this can be a very significant source of error, all the examples here will emit this warning. The ideal solution is to further refine the annotation object so that it contains no overlapping features or validate the affected genes either in-silico - e.g. by looking at the raw read coverage in a genome browser or by an appropriate wet-lab method. 2. about potential naming issue in the input file: It is (sadly) very frequent that the sequencing facilities use different naming conventions for the chromosomes they report in the alignment files. It is therefore very frequent that the annotation provided to easyRNASeq uses different chromosome names than the alignment file. These warnings are there to inform you about this issue. Details The easyRNASeq function currently accepts the following annotationMethods: • “biomaRt” use biomaRt to retrieve the annotation • “env” use a RangedData or GRanges class object present in the environment • “gff” reads in a gff version 3 file • “gtf” reads in a gtf file • “rda” load an RData object. The object needs to be named gAnnot and of class RangedData or GRanges. The reads can be read in from BAM files or any format supported by ShortRead. The reads can be summarized by: • exons 27 • features (any features such as introns, enhancers, etc.) • transcripts • geneModels (a geneModel is the set of non overlapping loci (i.e. synthetic exons) that represents all the possible exons and UTRs of a gene. Such geneModels are essential when counting reads as they ensure that no reads will be accounted for several times. E.g., a gene can have different isoforms, using different exons, overlapping exons, in which case summarizing by exons might result in counting a read several times, once per overlapping exon. N.B. Assessing differential expression between transcripts, based on synthetic exons is something possible since the release 2.11 of R, using the DEXSeq package available from Bioconductor. Note that this geneModels approach is actually an older implementation of the one we have taken during the lecture to create the synthetic transcripts. This last one should be the prefered one. In the coming version of easyRNASeq, geneModels will be deprecated in favor of the synthetic transcripts generation approach. The results can be exported in five different formats: • count table (the default, a n (features) x m (samples) matrix). • a DESeq [1] countDataSet class object. Useful to perform further analyses using the DESeq package. • an edgeR [16] DGEList class object. Useful to perform further analyses using the edgeR package. • an RNAseq class object. Useful for performing additional pre-processing without re-loading the reads and annotations. • an SummarizedExperiment class object. This should be the output of choice and will be made default in the easyRNASeq version to be released with Bioconductor version 2.14 next spring. The obtained results can optionally be corrected as Reads per Kilobase of feature per Million reads in the library (RPKM, [14]) or normalized using the DESeq or edgeR packages. For more details and a complete overview of the easyRNASeq package capabilities, have a look at the easyRNASeq vignette. > vignette("easyRNASeq") Exercise 34 From the same input files and annotations, generate an object of class SummarizedExperiment . Solution: Note that recent change to the Bioconductor API have affected the following functionality. I’ve just realized it yesterday, so I have not got the time to devise a fix, but I will do so asap. Sorry. Well, anyway it should be the end of the day when you reach that point so you probably will not mind so much I hope. Especially since it is not something essential. > sumExp <- easyRNASeq(filesDirectory=dirname(bamfiles[1]), + filenames=basename(bamfiles), + organism="Dmelanogaster", + readLength=76L, + annotationMethod="gff", + annotationFile=file.path("~/Day2", + "dmel_synthetic_transcript_r5-52.gff3"), + format="bam", + gapped=TRUE, + count="transcripts", + outputFormat="SummarizedExperiment") See the GenomicRange package SummarizedExperiment class for more details on last three accessors used in the following. 28 > > > > > > ## the counts assays(sumExp) ## the sample info colData(sumExp) ## the 'features' info rowData(sumExp) Caveats easyRNASeq is still under active development and as such the current version still lacks some essential data processing (e.g. strand specific sequencing is not yet supported). The new version to be released with Bioconductor 2.13, in early October this year, fill in these gaps: 1. The easyRNASeq function used above actually gets deprecated in favor of the simpleRNASeq function which takes advantage of numerous core Bioconductor packages, e.g. the use of BamFile and BamFileList objects from the Rsamtools package to locate and access BAM formatted files. 2. Secondly, it benefits from the RNA-Seq field standardization in the sense that the number of necessary arguments to be provided by default has plummeted. It benefits as well from a refactoring of how these arguments are provided; they are indeed abstracted and combined in a way similar to the ScanBamParam parameter of the Rsamtools package scanBam. 3. Then, its performance - e.g. memory management - have been optimized through parallelization. 4. In addition, advanced checks are conducted on the data provided by the user to ensure the overall process suitability. More comprehensive warnings or errors are thrown, should it be necessary. 5. The concerns raised by the analysis reported there https://stat.ethz.ch/pipermail/bioc-devel/ 2012-September/003608.html by Robinson et al. have been adressed too. Both the original easyRNASeq method and the GenomicRanges approach are provided, the later one being the default. 6. And last but not least, it provides access to the latest tools for Differential Expression expression analysis such as DESeq2 and DEXSeq. Planned is an integration of the limma for enabling the voom+limma paradigm. Ideally, easyRNASeq would select the most appropriate analysis to be conducted based on the report by Soneson and Delorenzi [18]. 5.4 Where to from here After obtaining the count table, numerous downstream analyses are available. Most often, such count tables are generated in a differential expression experimental setup. In that case, packages such as DESeq, DEXSeq, edgeR, limma (see voom+limma in the limma vignette), etc. are some of the possibilities available in Bioconductor. Have a look at [3] and [18] to decide which tool/approach is the best suited for your experimental design. But, of course, counts can as well be used for other purposes such as visualization, using e.g. the rtracklayer and GViz packages. Actually, there’s no real limitation of what one can achieve with a count table and it does not need be an RNA-Seq experiment; look at the DiffBind package for an example of using ChIP-Seq data for differential binding analyses. 29 Acknowledgments 1. Thanks to the Workshop organizers, in particular Gabriella Rustici 2. Thanks to the other lecturers, it is always fun around you. 3. Finally, thanks to you the reader - whatever the support you’re reading this on - for having made it that far. 30 Bibliography [1] S. Anders and W. Huber. Differential expression analysis for sequence count data. Genome Biology, 11:R106, 2010. [2] A. N. Brooks, L. Yang, M. O. Duff, K. D. Hansen, J. W. Park, S. Dudoit, S. E. Brenner, and B. R. Graveley. Conservation of an RNA regulatory map between Drosophila and mammals. Genome Research, pages 193–202, 2011. [3] M.-A. Dillies, A. Rau, J. Aubert, C. Hennequet-Antier, M. Jeanmougin, N. Servant, C. Keime, G. Marot, D. Castel, J. Estelle, G. Guernec, B. Jagla, L. Jouneau, D. Laloë, C. L. Gall, B. Schaëffer, S. L. Crom, M. Guedj, F. Jaffrézic, and on behalf of The French StatOmique Consortium. A comprehensive evaluation of normalization methods for illumina high-throughput rna sequencing data analysis. Brief Bioinformatics, Sep 2012. [4] A. Dobin, C. A. Davis, F. Schlesinger, J. Drenkow, C. Zaleski, S. Jha, P. Batut, M. Chaisson, and T. R. Gingeras. Star: ultrafast universal rna-seq aligner. Bioinformatics, 29(1):15–21, Jan 2013. [5] R. C. Gentleman et al. Bioconductor: open software development for computational biology and bioinformatics. Genome Biology 2010 11:202, 5(10):R80, Jan 2004. [6] Glaus, Peter, Honkela, Antti, Rattray, and Magnus. Identifying differentially expressed transcripts from rna-seq data with biological variation. Bioinformatics, 28(13):1721–1728, 2012. [7] M. G. Grabherr, B. J. Haas, M. Yassour, J. Z. Levin, D. A. Thompson, I. Amit, X. Adiconis, L. Fan, R. Raychowdhury, Q. Zeng, Z. Chen, E. Mauceli, N. Hacohen, A. Gnirke, N. Rhind, F. D. Palma, B. W. Birren, C. Nusbaum, K. Lindblad-Toh, N. Friedman, and A. Regev. Full-length transcriptome assembly from rna-seq data without a reference genome. Nat Biotechnol, 29(7):644–652, May 2011. [8] R. W. Hamming. Error detecting and error correcting codes. The Bell System Technical Journal, XXIX(2):1–14, Nov 1950. [9] B. Langmead, C. Trapnell, M. Pop, and S. L. Salzberg. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol., 10:R25, 2009. [10] P. Lefrançois, G. M. Euskirchen, R. K. Auerbach, J. Rozowsky, T. Gibson, C. M. Yellman, M. Gerstein, and M. Snyder. Efficient yeast chip-seq using multiplex short-read dna sequencing. BMC genomics, 10(1):37, Jan 2009. [11] B. Li, V. Ruotti, R. M. Stewart, J. A. Thomson, and C. N. Dewey. Rna-seq gene expression estimation with read mapping uncertainty. Bioinformatics, 26(4):493–500, Feb 2010. [12] H. Li and R. Durbin. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics, 25:1754–1760, Jul 2009. [13] H. Li and R. Durbin. Fast and accurate long-read alignment with Burrows-Wheeler transform. Bioinformatics, 26:589–595, Mar 2010. [14] A. Mortazavi et al. Mapping and quantifying mammalian transcriptomes by rna-seq. Nature Methods, 5(7):621–8, Jul 2008. 31 [15] R Development Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria, 2009. ISBN 3-900051-07-0. [16] M. D. Robinson, D. J. McCarthy, and G. K. Smyth. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics, 26:139–140, Jan 2010. [17] M. H. Schulz, D. R. Zerbino, M. Vingron, and E. Birney. Oases: robust de novo rna-seq assembly across the dynamic range of expression levels. Bioinformatics, 28(8):1086–92, Apr 2012. [18] C. Soneson and M. Delorenzi. A comparison of methods for differential expression analysis of rna-seq data. BMC Bioinformatics, 14:91, Jan 2013. [19] C. Trapnell, B. A. Williams, G. Pertea, A. Mortazavi, G. Kwan, M. J. van Baren, S. L. Salzberg, B. J. Wold, and L. Pachter. Transcript assembly and quantification by rna-seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat Biotechnol, 28(5):511–5, May 2010. [20] E. Turro, S.-Y. Su, Â. Gonçalves, L. J. M. Coin, S. Richardson, and A. Lewin. Haplotype and isoform specific expression estimation using multi-mapping rna-seq reads. Genome Biol, 12(2):R13, Jan 2011. [21] T. D. Wu and C. K. Watanabe. Gmap: a genomic mapping and alignment program for mrna and est sequences. Bioinformatics, 21(9):1859–75, May 2005. 32
© Copyright 2026 Paperzz