panda Documentation

panda Documentation
Release 1.0
Daniel Vera
February 12, 2014
Contents
i
ii
panda Documentation, Release 1.0
functions:
Contents
1
panda Documentation, Release 1.0
2
Contents
CHAPTER 1
mat.make
mat.make creates two-dimensional matrices of scores aligned at user-specified genomic intervals. Each row in a matrix corresponds to a genomic interval, and each column correspond to distances from the aligned features. An example
of a matrix created by this function is a matrix of ChIP-seq signals relative to transcription start sites. The matrices
produced by this function can be used to generate heatmaps (mat.heatmap), aggregate plots (mat.plotaverages), and
other useful tasks. A matrix is a convenient way of storing genomic data when a given analysis of such data can be
focused on particular regions. When genomic intervals are used to generate scores, apart from calculating their density, mat.make can also create two-dimensional matrix listing the sizes of these intervals. This type of matrix is useful
with paired-end MNase-seq or DNase-seq data which can be used to create fragment size vs distance plots (V-plots,
Henikoff et al., 2009).
Input: The scores can be supplied in bedGraph, wiggle, or bigWig format. Alternatively, mat.make can create scores
from genomic intervals in the form of interval densities (such as calculating the density of reads supplied in bed or
bam format). Feature densities can be calculated from bed, bigBed, sam, bam, narrowPeak, or broadPeak files.
Output: Each matrix will be names with the following convention: scorefile_featurefile.mat[windowsize]. For example, if scores in H3K36me3signal.bw are aligned relative to TSSs.bed (feature), and the window size is 10bp,
the file for this matrix will be names H3K36me3signal_TSSs.mat10. For a given ‘feature’, each matrix for a
given dataset (scores) will have identical dimensions. Thus, different scores aligned at a given set of features
can be easily compared. Score matrices aligned at a given set of features will be saved in a directory bearing the
name of the feature, with a suffix of _mat[windowsize].
1.1 Usage and option summary
Usage:
mat.make (scorefiles, features, closest = NULL, cores = "max", meta = FALSE, metaflank = 1000, maskbe
3
panda Documentation, Release 1.0
1.2 Arguments
Main
options
scorefiles
featurefiles
regionsize
windowsize
Alignment
Options
strand
start
stop
prunefeaturesto
narrowpeak
Description
A list of file names of which to calculate scores from. Can be one of the following formats: bed,
bigBed, narrowPeaks, broadPeaks, bedGraph, wig, bigWig, bam.
A list of file names of which to create matrices from. Can be one of the following formats: bed,
bigBed, narrowPeaks
Distance (in bp) around feature to create matrix from. When meta = TRUE, size (in bp) of
meta-feature. Defaults to 1000.
Size (in bp) to bin data in nonoverlapping windows. Increasing windowsize proportionally decreases
columns in matrix. Must be a factor of regionsize. Defaults to 10 (bp).
Description
When TRUE, reverses scores in rows of minus-stranded features.
Column in featurefiles to align the center of matrix around. When strand = TRUE, minus-stranded
intervals are aligned by column defined by ‘stop’. Ignored when meta = TRUE or narrowPeak =
TRUE. Defaults to 2.
When strand = TRUE, column in featurefiles to align the center of matrix for minus-stranded
features. Ignored when meta = TRUE or narrowPeak = TRUE.
bed file name of intervals to intersect featurefiles to before creating the matrix. Useful to remove
features outside microarray or sequence-capture regions. Default NULL.
When TRUE, if a featurefile is a narrowPeak file, aligns the center of the matrix on the peak
summit defined by column 10 in narrowpeak file. Defaults to FALSE.
Score
Options
maskbed
Description
Misc.
Options
fragmats
closest
Description
bed file name who’s intervals are the only intervals which should have scores. Assigned NA’s to all
regions outside these intervals in the matrix. Used to prevent regions with no information from
artificially being assigned zeroes (e.g., regions without microarray probes or sequence capture
probes). Default NULL.
bgfiller
When a scorefile is a bedGraph, wig, or bigWig, value in matrix to assign windows which have no
overlapping scores in scorefile. Useful to prevent windows outside probed regions from artificially
being assigned zeroes. Defaults to 0. “NA” suggested for probe-specific data.
prunescoresWhen TRUE, removes scores in scorefiles that do not overlap with regions included in matrices. May
speed up matrix creation for very large scorefiles. Defaults to FALSE.
featureWhen TRUE, aligns the center of matrix on the center of the intervals in featurefiles. Ignored when
center
meta = TRUE or narrowPeak = TRUE. Defaults to FALSE.
rpm
When a scorefile is a bed, bigBed, or bam file, adjusts the calculated coverage in the matrix to RPM
defined by the number of lines in the scorefile (RPM = coverage * 1000000 / file lines). Defaults to
FALSE.
scoremat When TRUE, creates score matrices. Set to FALSE to only create fragment-size matrices (fragmats).
Defaults to TRUE.
cores
4
Indices of scorefiles to create fragment-size matrices for creating v-plots. Defaults to NULL.
bed file name who’s intervals are used to assign names to rows in the matrix based on the nearest
interval in featurefiles. Used for example to assign the closest gene’s name to each
transcription-factor binding site defined in featurefiles. Default NULL
Number of scorefiles to process simultaneously for each featurefile. Defaults to “max”, or all but one
core.
Chapter 1. mat.make
panda Documentation, Release 1.0
Metafeature
Options
meta
metaflank
Description
Create a meta-matrix, a matrix that aligned features by their 5’ and 3’ ends by scaling all features
to the same size, as defined by ‘regionsize’ (bp). Defaults to FALSE.
When meta = TRUE, determines the distance from the meta features (in bp) to define the matrix
boundary. Defaults to 1000.
1.3 Dependencies
bedtools 2.18: mat.make heavily relies on bedtools, a suite of genomic calculation software created by Aaron Quinlan.
kent source utilities: mat.make requires kent source utilities if wiggle or bigWig formats are used. Written by Jim
Kent of UCSC.
1.4 Examples
make a list of files which you would like to plot signal from
> scorefiles <- c( "h3k27me3-signal.bw" , "ctcf-chipseq-signal.bg" , "polII-chipseq-reads.bed" , "htt
make a list of files which specify where to align scores at
> featurefiles <- c( "protein-coding-genes.bed" , "start-codons.bed" , "ctcf-binding-sites.narrowPeak
make matrix of data in scorefiles aligned at featurefiles
> mat.make ( scorefiles , featurefiles )
$ cat variants.bed
chr1 100 200
nasty 1
chr2 500 1000 ugly 2
chr3 1000 5000 big
3
+
-
$ cat genes.bed
chr1 150 200
geneA 1
chr1 175 250
geneB 2
chr3 0
10000 geneC 3
+
+
-
$ cat conserve.bed
chr1 0
10000 cons1 1
chr2 700 10000 cons2 2
chr3 4000 10000 cons3 3
+
+
$ cat known_var.bed
chr1 0
120
known1
chr1 150 160
known2
chr2 0
10000 known3
+
$ bedtools annotate -i variants.bed -files genes.bed conserve.bed known_var.bed
chr1 100
200
nasty
1
0.500000
1.000000
0.300000
chr2 500
1000
ugly
2
+
0.000000
0.600000
1.000000
chr3 1000
5000
big
3
1.000000
0.250000
0.000000
1.3. Dependencies
5
panda Documentation, Release 1.0
6
Chapter 1. mat.make
CHAPTER 2
mat.heatmap
mat.heatmap draws heatmaps of matrices, sorted and/or grouped with user-defined criterea. also plots aggregate
profiles and optionally fragment-size plots.
2.1 Usage and option summary
Usage:
make a list of files which you would like to plot signal from
scorefiles <- c( "h3k27me3-signal.bw" , "ctcf-chipseq-signal.bg" , "polII-chipseq-reads.bed" , "http:
make a list of files which specify where to align scores at
featurefiles <- c( "protein-coding-genes.bed" , "start-codons.bed" , "ctcf-binding-sites.narrowPeaks"
make matrix of data in scorefiles aligned at featurefiles
mat.make ( scorefiles , featurefiles )
Main
options
mats
sorting
numgroups
genegroups
normalize
cores
Description
a vector of matrix file names from which to draw heatmaps.
a vector defining how to sequentially group and/or sort the data, and when applicable, which portion
of the matrix to use in determining how to sort the data. methods include
kmeans,mean,median,min,max,minloc,maxloc,sd,chrom, the left and right distances (in bp) from the
center of the matrix to which to limit the sorting/clustering method. In each string in sorting,
sorting/clustering methods, left distance, and right distance, must be separated by a comma and not
contain spaces. Clustering/grouping methods include kmeans and chromosome. Defaults to none (no
sorting/clustering).
a numeric vector which defines how many groups or clusters rows are sequentially divided into
corresponding to the sorting/clustering methods defined in ‘sorting’. For kmeans, defines how many
kmeans clusters are created. For all sorting methods, divides genes into equally-sized groups based on
the corresponding value in ‘numgroups’.
a list of character vectors of gene names, defining the genes belonging to each group. Only used when
sorting[1] is “genelist”.
logical. When TRUE, normalizes each score matrix row. For 1-tailed data, each row is divided by the
mean of the row, and the entire matrix is multiplied by the mean of the matrix. For 2-tailed data, for
each row, the mean is subtracted by the scores and divided by the standard deviation (z-score
normalization). Defaults to FALSE.
a natural number defining the number of scorefiles to process simultaneously for each featurefile.
Defaults to “max”, or all but one core.
7
panda Documentation, Release 1.0
View
options
defaultlims
Description
V-plot
options
fragmats
fragrange
vdefaultlims
Description
a character string or vector of character strings that define the range of values that correspond to the
color gradient edges in heatmaps defined in ‘plotcolors’. Defaults to c(“auto”,”auto”), which uses the
3 and 97 percentiles of each data set.
forcescore logical. When TRUE, before drawing the heatmap, NAs are converted to zeroes. This prevents
regions with no scores from showing as white, which may hinder or distract visualization of other
colors in heatmap. Defaults to TRUE.
plotcola character string defining colors used to create color gradient to draw heatmap (from low scores to
ors
high scores). Colors must be separated by spaces. Defaults to “white black” where white is the lowest
score and black is the highest score.
centera character string of names of features to which matrices are aligned to, which is used to label the
name
x-axis on aggregate plots.
vplotcolors
rpm
a vector of fragment-matrix file names from which to draw v-plots.
range of fragment sizes to define y-axis in vplots.
numeric vector of length 2 that define the range of values that correspond to the color gradient edges
in v-plots defined in ‘vplotcolors’. Defaults to c(“auto”,”auto”), which uses the 3 and 97 percentiles
of each data set.
string defining colors used to create color gradient to draw v-plot (from low scores to high scores).
Colors must be separated by spaces. Defaults to “black blue yellow red” where black is the lowest
scores and red is the highest score.
logical. when TRUE, normalizes v-plot scores to RPM.
2.2 Default behavior - plot scores aligned at the 5’ end of intervals in
featurefiles
By default, the fraction of each feature covered by each annotation file is reported after the complete feature in the file
to be annotated.
$ cat variants.bed
chr1 100 200
nasty 1
chr2 500 1000 ugly 2
chr3 1000 5000 big
3
+
-
$ cat genes.bed
chr1 150 200
geneA 1
chr1 175 250
geneB 2
chr3 0
10000 geneC 3
+
+
-
$ cat conserve.bed
chr1 0
10000 cons1 1
chr2 700 10000 cons2 2
chr3 4000 10000 cons3 3
+
+
$ cat known_var.bed
chr1 0
120
known1
chr1 150 160
known2
chr2 0
10000 known3
+
8
Chapter 2. mat.heatmap
panda Documentation, Release 1.0
$ bedtools annotate -i variants.bed -files genes.bed conserve.bed known_var.bed
chr1 100
200
nasty
1
0.500000
1.000000
0.300000
chr2 500
1000
ugly
2
+
0.000000
0.600000
1.000000
chr3 1000
5000
big
3
1.000000
0.250000
0.000000
2.3 -count Report the count of hits from the annotation files
$ bedtools annotate -counts -i variants.bed -files genes.bed conserve.bed known_var.bed
chr1 100
200
nasty
1
2
1
2
chr2 500
1000
ugly
2
+
0
1
1
chr3 1000
5000
big
3
1
1
0
2.4 -both Report both the count of hits and the fraction covered from
the annotation files
$ bedtools annotate -both -i variants.bed -files genes.bed conserve.bed known_var.bed
#chr start
end
name
score
+/cnt1
pct1
cnt2
pct2
cnt3
pct3
chr1 100
200
nasty
1
2
0.500000
1
1.000000
chr2 500
1000
ugly
2
+
0
0.000000
1
0.600000
chr3 1000
5000
big
3
1
1.000000
1
0.250000
2.5 -s Restrict the reporting to overlaps on the same strand.
$ bedtools annotate -s -i variants.bed -files
chr1 100
200
nasty
1
chr2 500
1000
ugly
2
+
chr3 1000
5000
big
3
-
genes.bed conserve.bed known_var.bed
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
1.000000
0.000000
0.000000
2.6 -S Restrict the reporting to overlaps on the opposite strand.
$ bedtools annotate -S -i variants.bed -files
chr1 100
200
nasty
1
chr2 500
1000
ugly
2
+
chr3 1000
5000
big
3
-
genes.bed conserve.bed known_var.bed
0.500000
1.000000
0.300000
0.000000
0.600000
1.000000
0.000000
0.250000
0.000000
workflows:
2.3. -count Report the count of hits from the annotation files
9
2
1
0
panda Documentation, Release 1.0
10
Chapter 2. mat.heatmap
CHAPTER 3
Indices and tables
• genindex
• modindex
• search
11