From basic data types to complex objects - GenomicRanges

From basic data types to complex objects
GenomicRanges
Cremaschi Paolo
Centro d’Analisi Bioinformatiche per la Genomica (CABGEN), Pavia
Bioinformatic course 2014
Acknowledgments
The course was created by:
Luigi Marchionni
Department of Oncology
Johns Hopkins School of Medicine
Additional material was provided by:
Silvia Parolo
CABGEN
IGM-CNR Pavia
Additional thanks to:
• Kasper Hansen
• Michael Lawrence
Goals
• Introduce the use of R packages
• How to use complex R objects: the
GenomicRanges
• Basic operation with GenomicRanges
• Plot GenomicRanges
What are genomic ranges?
Nearly everything in genome browsers are
genomic ranges
Genomic Range
• The genomic range
– Represents genomic features, like genes and alignments
– Indexes into genomic vectors, like sequence and coverage
– Links summaries, like reads per kilobase per million of total
reads (RPKMs), to genomic locations
• The genome acts as a scaffold for data integration
• Ranges have a specialized structure and algebra,
requiring specialized data types and algorithms
Genomic Range files format
In the genome browsers (UCSC as an example) two
basic formats are used for genomic data: BED and
WIG (Wiggle).
Binary (not human readable), indexed versions of
these formats exist for large datasets: bigBed and
bigWig.
See: http://genome.ucsc.edu/FAQ/FAQformat.html
BED format
• A collection of genomic intervals, 3 required columns:
– chrom, chromStart, chromEnd!
• Nine additional optional columnsOther columns are for display
purposes:
– name, score, strand, thickStart, thickEnd, itemRgb, blockCount, blockSizes!
• Blocks columns are for interval sub-divisions (e.g., exons)
See: http://genome.ucsc.edu/FAQ/FAQformat.html#format1
Here an example of BEF file:
WIGGLE format
• Wiggle format (WIG) allows the display of continuous-valued data
in a track format
• A vector of numbers associated with genomic locations at a single
base (e.g., coverage)
• For speed and efficiency, wiggle data is compressed (unlike in the
“bedGraph” format)
• • Exist in two distinct formats:
– variableStep for data with irregular intervals
– fixedStep for data with regular intervals
See: http://genome.ucsc.edu/goldenPath/help/wiggle.html
Available Tools
• UCSC tools for BED, bigBed, WIG, and bigWig formats:
– bedToBigBed
– wigToBigWig
– bedGraphToBigWig
• To be used on the command line:
– BEDtools (popular and mature) !
– WiggleTools (also for bigWig and bigBed)
• In R from Bioconductor:
– IRanges
– GenomicRanges
– GenomicFeatures
Bioconductor Packages
• The IRanges, GenomicRanges, and GenomicFeatures packages
provide all the infrastructure needed to deal with genomic features
and ranges in R
• At first, these packages can appear daunting, with many classes and
methods… However, they also contain very extensive and useful
vignettes: read them!
Lawrence M et al, PLoS Comput Biol. 2013;9(8):e1003118.
IRanges package
“The IRanges package is designed to represent sequences, ranges
representing indices along those sequences, and data related to those
ranges”
• A sequence is an ordered and finite collection of elements, such as
a vector of integers – not necessarily only for nucleic acid
sequences
• Consecutive indices can be represented as a range to save memory
and computation, for example, instead of saving c(1,2,3,4,5,6,7) just
save 1 and 7 – the start and the end indexes –
IRanges package
A basic IRanges instance is the minimal representation of a range
(start, end, width) represented by:
-Tabular format
-Graph
GenomicRanges package
• Designed to represent genomic intervals (CpG islands, genes, exons,
TF binding sites, …)
• Based on IRanges package and provides support for other
Bioconductor packages (GenomicFeatures, Bsgenome, …)
• Contains three major classes:
– GRanges: single interval range features – genomic features with single start
and end locations
– GRangesList: multiple interval range features – features with multiple
start/end locations (e.g., a transcript with multiple exons)
– GappedAlignments: gapped alignments
GenomicRanges package
A basic GRanges instance extends IRanges representation with
information on chromosome and strand
GenomicFeatures package
• Tools and methods for making and manipulating transcript centric
annotations
• It enables to download – from UCSC Genome Browser or BioMart –
genomic locations of genes, transcripts, exons, and cds of a given
organism
• The package keeps track of the relationships between transcripts,
exons, cds and genes
• It also provides flexible methods for exporting the desired feature
information to convenient formats
Ranges algebra
• Summaries:
– coverage(), reduce(), disjoin()
• Arithmetic operations:
– shift(), resize(), restrict(), flank()
• Set operations:
– intersect(), union() , setdiff() , gaps()
• Comparisons:
– findOverlaps(), findMatches(), nearest(), order()
Joining with reduce()
E.g., finding "gene" regions
Disjoining with disjoin()
E.g., which bases belong to the same set of exons?
Finding flanking regions with flank()
E.g., identify the promoter region of a gene
Computing the “coverage”
Counting the number of times a position is represented in a
set of ranges with coverage()
Set operations
Overlaps
3
1
2
4
2
1
3
Views
•
•
•
Often we have a big object (e.g., genome) and we are only interested in subsets
of this object
These subsets may be stored as IRanges: the key idea is to store the big object
and just use the ranges
We could extract a “view” corresponding to a single chromosome
Summary
• The range integrates the different types of genomic data
• IRanges and GenomicRanges define the fundamental abstractions,
data types and utilities for representing, manipulating, comparing,
and summarizing ranges
• The data structures support storage of arbitrary metadata, and are
well integrated with reference annotation sources and visualization
packages (e.g., GenomicFeatures package)
• These tools can be used for the transcript expression analysis and
junction counting in the context of RNA-seq data
• Broader applications include: variant calling, ChIP-seq, proteomics,
and even general fields like time series analysis