From basic data types to complex objects GenomicRanges Cremaschi Paolo Centro d’Analisi Bioinformatiche per la Genomica (CABGEN), Pavia Bioinformatic course 2014 Acknowledgments The course was created by: Luigi Marchionni Department of Oncology Johns Hopkins School of Medicine Additional material was provided by: Silvia Parolo CABGEN IGM-CNR Pavia Additional thanks to: • Kasper Hansen • Michael Lawrence Goals • Introduce the use of R packages • How to use complex R objects: the GenomicRanges • Basic operation with GenomicRanges • Plot GenomicRanges What are genomic ranges? Nearly everything in genome browsers are genomic ranges Genomic Range • The genomic range – Represents genomic features, like genes and alignments – Indexes into genomic vectors, like sequence and coverage – Links summaries, like reads per kilobase per million of total reads (RPKMs), to genomic locations • The genome acts as a scaffold for data integration • Ranges have a specialized structure and algebra, requiring specialized data types and algorithms Genomic Range files format In the genome browsers (UCSC as an example) two basic formats are used for genomic data: BED and WIG (Wiggle). Binary (not human readable), indexed versions of these formats exist for large datasets: bigBed and bigWig. See: http://genome.ucsc.edu/FAQ/FAQformat.html BED format • A collection of genomic intervals, 3 required columns: – chrom, chromStart, chromEnd! • Nine additional optional columnsOther columns are for display purposes: – name, score, strand, thickStart, thickEnd, itemRgb, blockCount, blockSizes! • Blocks columns are for interval sub-divisions (e.g., exons) See: http://genome.ucsc.edu/FAQ/FAQformat.html#format1 Here an example of BEF file: WIGGLE format • Wiggle format (WIG) allows the display of continuous-valued data in a track format • A vector of numbers associated with genomic locations at a single base (e.g., coverage) • For speed and efficiency, wiggle data is compressed (unlike in the “bedGraph” format) • • Exist in two distinct formats: – variableStep for data with irregular intervals – fixedStep for data with regular intervals See: http://genome.ucsc.edu/goldenPath/help/wiggle.html Available Tools • UCSC tools for BED, bigBed, WIG, and bigWig formats: – bedToBigBed – wigToBigWig – bedGraphToBigWig • To be used on the command line: – BEDtools (popular and mature) ! – WiggleTools (also for bigWig and bigBed) • In R from Bioconductor: – IRanges – GenomicRanges – GenomicFeatures Bioconductor Packages • The IRanges, GenomicRanges, and GenomicFeatures packages provide all the infrastructure needed to deal with genomic features and ranges in R • At first, these packages can appear daunting, with many classes and methods… However, they also contain very extensive and useful vignettes: read them! Lawrence M et al, PLoS Comput Biol. 2013;9(8):e1003118. IRanges package “The IRanges package is designed to represent sequences, ranges representing indices along those sequences, and data related to those ranges” • A sequence is an ordered and finite collection of elements, such as a vector of integers – not necessarily only for nucleic acid sequences • Consecutive indices can be represented as a range to save memory and computation, for example, instead of saving c(1,2,3,4,5,6,7) just save 1 and 7 – the start and the end indexes – IRanges package A basic IRanges instance is the minimal representation of a range (start, end, width) represented by: -Tabular format -Graph GenomicRanges package • Designed to represent genomic intervals (CpG islands, genes, exons, TF binding sites, …) • Based on IRanges package and provides support for other Bioconductor packages (GenomicFeatures, Bsgenome, …) • Contains three major classes: – GRanges: single interval range features – genomic features with single start and end locations – GRangesList: multiple interval range features – features with multiple start/end locations (e.g., a transcript with multiple exons) – GappedAlignments: gapped alignments GenomicRanges package A basic GRanges instance extends IRanges representation with information on chromosome and strand GenomicFeatures package • Tools and methods for making and manipulating transcript centric annotations • It enables to download – from UCSC Genome Browser or BioMart – genomic locations of genes, transcripts, exons, and cds of a given organism • The package keeps track of the relationships between transcripts, exons, cds and genes • It also provides flexible methods for exporting the desired feature information to convenient formats Ranges algebra • Summaries: – coverage(), reduce(), disjoin() • Arithmetic operations: – shift(), resize(), restrict(), flank() • Set operations: – intersect(), union() , setdiff() , gaps() • Comparisons: – findOverlaps(), findMatches(), nearest(), order() Joining with reduce() E.g., finding "gene" regions Disjoining with disjoin() E.g., which bases belong to the same set of exons? Finding flanking regions with flank() E.g., identify the promoter region of a gene Computing the “coverage” Counting the number of times a position is represented in a set of ranges with coverage() Set operations Overlaps 3 1 2 4 2 1 3 Views • • • Often we have a big object (e.g., genome) and we are only interested in subsets of this object These subsets may be stored as IRanges: the key idea is to store the big object and just use the ranges We could extract a “view” corresponding to a single chromosome Summary • The range integrates the different types of genomic data • IRanges and GenomicRanges define the fundamental abstractions, data types and utilities for representing, manipulating, comparing, and summarizing ranges • The data structures support storage of arbitrary metadata, and are well integrated with reference annotation sources and visualization packages (e.g., GenomicFeatures package) • These tools can be used for the transcript expression analysis and junction counting in the context of RNA-seq data • Broader applications include: variant calling, ChIP-seq, proteomics, and even general fields like time series analysis
© Copyright 2026 Paperzz