panda Documentation Release 1.0 Daniel Vera February 12, 2014 Contents i ii panda Documentation, Release 1.0 functions: Contents 1 panda Documentation, Release 1.0 2 Contents CHAPTER 1 mat.make mat.make creates two-dimensional matrices of scores aligned at user-specified genomic intervals. Each row in a matrix corresponds to a genomic interval, and each column correspond to distances from the aligned features. An example of a matrix created by this function is a matrix of ChIP-seq signals relative to transcription start sites. The matrices produced by this function can be used to generate heatmaps (mat.heatmap), aggregate plots (mat.plotaverages), and other useful tasks. A matrix is a convenient way of storing genomic data when a given analysis of such data can be focused on particular regions. When genomic intervals are used to generate scores, apart from calculating their density, mat.make can also create two-dimensional matrix listing the sizes of these intervals. This type of matrix is useful with paired-end MNase-seq or DNase-seq data which can be used to create fragment size vs distance plots (V-plots, Henikoff et al., 2009). Input: The scores can be supplied in bedGraph, wiggle, or bigWig format. Alternatively, mat.make can create scores from genomic intervals in the form of interval densities (such as calculating the density of reads supplied in bed or bam format). Feature densities can be calculated from bed, bigBed, sam, bam, narrowPeak, or broadPeak files. Output: Each matrix will be names with the following convention: scorefile_featurefile.mat[windowsize]. For example, if scores in H3K36me3signal.bw are aligned relative to TSSs.bed (feature), and the window size is 10bp, the file for this matrix will be names H3K36me3signal_TSSs.mat10. For a given ‘feature’, each matrix for a given dataset (scores) will have identical dimensions. Thus, different scores aligned at a given set of features can be easily compared. Score matrices aligned at a given set of features will be saved in a directory bearing the name of the feature, with a suffix of _mat[windowsize]. 1.1 Usage and option summary Usage: mat.make (scorefiles, features, closest = NULL, cores = "max", meta = FALSE, metaflank = 1000, maskbe 3 panda Documentation, Release 1.0 1.2 Arguments Main options scorefiles featurefiles regionsize windowsize Alignment Options strand start stop prunefeaturesto narrowpeak Description A list of file names of which to calculate scores from. Can be one of the following formats: bed, bigBed, narrowPeaks, broadPeaks, bedGraph, wig, bigWig, bam. A list of file names of which to create matrices from. Can be one of the following formats: bed, bigBed, narrowPeaks Distance (in bp) around feature to create matrix from. When meta = TRUE, size (in bp) of meta-feature. Defaults to 1000. Size (in bp) to bin data in nonoverlapping windows. Increasing windowsize proportionally decreases columns in matrix. Must be a factor of regionsize. Defaults to 10 (bp). Description When TRUE, reverses scores in rows of minus-stranded features. Column in featurefiles to align the center of matrix around. When strand = TRUE, minus-stranded intervals are aligned by column defined by ‘stop’. Ignored when meta = TRUE or narrowPeak = TRUE. Defaults to 2. When strand = TRUE, column in featurefiles to align the center of matrix for minus-stranded features. Ignored when meta = TRUE or narrowPeak = TRUE. bed file name of intervals to intersect featurefiles to before creating the matrix. Useful to remove features outside microarray or sequence-capture regions. Default NULL. When TRUE, if a featurefile is a narrowPeak file, aligns the center of the matrix on the peak summit defined by column 10 in narrowpeak file. Defaults to FALSE. Score Options maskbed Description Misc. Options fragmats closest Description bed file name who’s intervals are the only intervals which should have scores. Assigned NA’s to all regions outside these intervals in the matrix. Used to prevent regions with no information from artificially being assigned zeroes (e.g., regions without microarray probes or sequence capture probes). Default NULL. bgfiller When a scorefile is a bedGraph, wig, or bigWig, value in matrix to assign windows which have no overlapping scores in scorefile. Useful to prevent windows outside probed regions from artificially being assigned zeroes. Defaults to 0. “NA” suggested for probe-specific data. prunescoresWhen TRUE, removes scores in scorefiles that do not overlap with regions included in matrices. May speed up matrix creation for very large scorefiles. Defaults to FALSE. featureWhen TRUE, aligns the center of matrix on the center of the intervals in featurefiles. Ignored when center meta = TRUE or narrowPeak = TRUE. Defaults to FALSE. rpm When a scorefile is a bed, bigBed, or bam file, adjusts the calculated coverage in the matrix to RPM defined by the number of lines in the scorefile (RPM = coverage * 1000000 / file lines). Defaults to FALSE. scoremat When TRUE, creates score matrices. Set to FALSE to only create fragment-size matrices (fragmats). Defaults to TRUE. cores 4 Indices of scorefiles to create fragment-size matrices for creating v-plots. Defaults to NULL. bed file name who’s intervals are used to assign names to rows in the matrix based on the nearest interval in featurefiles. Used for example to assign the closest gene’s name to each transcription-factor binding site defined in featurefiles. Default NULL Number of scorefiles to process simultaneously for each featurefile. Defaults to “max”, or all but one core. Chapter 1. mat.make panda Documentation, Release 1.0 Metafeature Options meta metaflank Description Create a meta-matrix, a matrix that aligned features by their 5’ and 3’ ends by scaling all features to the same size, as defined by ‘regionsize’ (bp). Defaults to FALSE. When meta = TRUE, determines the distance from the meta features (in bp) to define the matrix boundary. Defaults to 1000. 1.3 Dependencies bedtools 2.18: mat.make heavily relies on bedtools, a suite of genomic calculation software created by Aaron Quinlan. kent source utilities: mat.make requires kent source utilities if wiggle or bigWig formats are used. Written by Jim Kent of UCSC. 1.4 Examples make a list of files which you would like to plot signal from > scorefiles <- c( "h3k27me3-signal.bw" , "ctcf-chipseq-signal.bg" , "polII-chipseq-reads.bed" , "htt make a list of files which specify where to align scores at > featurefiles <- c( "protein-coding-genes.bed" , "start-codons.bed" , "ctcf-binding-sites.narrowPeak make matrix of data in scorefiles aligned at featurefiles > mat.make ( scorefiles , featurefiles ) $ cat variants.bed chr1 100 200 nasty 1 chr2 500 1000 ugly 2 chr3 1000 5000 big 3 + - $ cat genes.bed chr1 150 200 geneA 1 chr1 175 250 geneB 2 chr3 0 10000 geneC 3 + + - $ cat conserve.bed chr1 0 10000 cons1 1 chr2 700 10000 cons2 2 chr3 4000 10000 cons3 3 + + $ cat known_var.bed chr1 0 120 known1 chr1 150 160 known2 chr2 0 10000 known3 + $ bedtools annotate -i variants.bed -files genes.bed conserve.bed known_var.bed chr1 100 200 nasty 1 0.500000 1.000000 0.300000 chr2 500 1000 ugly 2 + 0.000000 0.600000 1.000000 chr3 1000 5000 big 3 1.000000 0.250000 0.000000 1.3. Dependencies 5 panda Documentation, Release 1.0 6 Chapter 1. mat.make CHAPTER 2 mat.heatmap mat.heatmap draws heatmaps of matrices, sorted and/or grouped with user-defined criterea. also plots aggregate profiles and optionally fragment-size plots. 2.1 Usage and option summary Usage: make a list of files which you would like to plot signal from scorefiles <- c( "h3k27me3-signal.bw" , "ctcf-chipseq-signal.bg" , "polII-chipseq-reads.bed" , "http: make a list of files which specify where to align scores at featurefiles <- c( "protein-coding-genes.bed" , "start-codons.bed" , "ctcf-binding-sites.narrowPeaks" make matrix of data in scorefiles aligned at featurefiles mat.make ( scorefiles , featurefiles ) Main options mats sorting numgroups genegroups normalize cores Description a vector of matrix file names from which to draw heatmaps. a vector defining how to sequentially group and/or sort the data, and when applicable, which portion of the matrix to use in determining how to sort the data. methods include kmeans,mean,median,min,max,minloc,maxloc,sd,chrom, the left and right distances (in bp) from the center of the matrix to which to limit the sorting/clustering method. In each string in sorting, sorting/clustering methods, left distance, and right distance, must be separated by a comma and not contain spaces. Clustering/grouping methods include kmeans and chromosome. Defaults to none (no sorting/clustering). a numeric vector which defines how many groups or clusters rows are sequentially divided into corresponding to the sorting/clustering methods defined in ‘sorting’. For kmeans, defines how many kmeans clusters are created. For all sorting methods, divides genes into equally-sized groups based on the corresponding value in ‘numgroups’. a list of character vectors of gene names, defining the genes belonging to each group. Only used when sorting[1] is “genelist”. logical. When TRUE, normalizes each score matrix row. For 1-tailed data, each row is divided by the mean of the row, and the entire matrix is multiplied by the mean of the matrix. For 2-tailed data, for each row, the mean is subtracted by the scores and divided by the standard deviation (z-score normalization). Defaults to FALSE. a natural number defining the number of scorefiles to process simultaneously for each featurefile. Defaults to “max”, or all but one core. 7 panda Documentation, Release 1.0 View options defaultlims Description V-plot options fragmats fragrange vdefaultlims Description a character string or vector of character strings that define the range of values that correspond to the color gradient edges in heatmaps defined in ‘plotcolors’. Defaults to c(“auto”,”auto”), which uses the 3 and 97 percentiles of each data set. forcescore logical. When TRUE, before drawing the heatmap, NAs are converted to zeroes. This prevents regions with no scores from showing as white, which may hinder or distract visualization of other colors in heatmap. Defaults to TRUE. plotcola character string defining colors used to create color gradient to draw heatmap (from low scores to ors high scores). Colors must be separated by spaces. Defaults to “white black” where white is the lowest score and black is the highest score. centera character string of names of features to which matrices are aligned to, which is used to label the name x-axis on aggregate plots. vplotcolors rpm a vector of fragment-matrix file names from which to draw v-plots. range of fragment sizes to define y-axis in vplots. numeric vector of length 2 that define the range of values that correspond to the color gradient edges in v-plots defined in ‘vplotcolors’. Defaults to c(“auto”,”auto”), which uses the 3 and 97 percentiles of each data set. string defining colors used to create color gradient to draw v-plot (from low scores to high scores). Colors must be separated by spaces. Defaults to “black blue yellow red” where black is the lowest scores and red is the highest score. logical. when TRUE, normalizes v-plot scores to RPM. 2.2 Default behavior - plot scores aligned at the 5’ end of intervals in featurefiles By default, the fraction of each feature covered by each annotation file is reported after the complete feature in the file to be annotated. $ cat variants.bed chr1 100 200 nasty 1 chr2 500 1000 ugly 2 chr3 1000 5000 big 3 + - $ cat genes.bed chr1 150 200 geneA 1 chr1 175 250 geneB 2 chr3 0 10000 geneC 3 + + - $ cat conserve.bed chr1 0 10000 cons1 1 chr2 700 10000 cons2 2 chr3 4000 10000 cons3 3 + + $ cat known_var.bed chr1 0 120 known1 chr1 150 160 known2 chr2 0 10000 known3 + 8 Chapter 2. mat.heatmap panda Documentation, Release 1.0 $ bedtools annotate -i variants.bed -files genes.bed conserve.bed known_var.bed chr1 100 200 nasty 1 0.500000 1.000000 0.300000 chr2 500 1000 ugly 2 + 0.000000 0.600000 1.000000 chr3 1000 5000 big 3 1.000000 0.250000 0.000000 2.3 -count Report the count of hits from the annotation files $ bedtools annotate -counts -i variants.bed -files genes.bed conserve.bed known_var.bed chr1 100 200 nasty 1 2 1 2 chr2 500 1000 ugly 2 + 0 1 1 chr3 1000 5000 big 3 1 1 0 2.4 -both Report both the count of hits and the fraction covered from the annotation files $ bedtools annotate -both -i variants.bed -files genes.bed conserve.bed known_var.bed #chr start end name score +/cnt1 pct1 cnt2 pct2 cnt3 pct3 chr1 100 200 nasty 1 2 0.500000 1 1.000000 chr2 500 1000 ugly 2 + 0 0.000000 1 0.600000 chr3 1000 5000 big 3 1 1.000000 1 0.250000 2.5 -s Restrict the reporting to overlaps on the same strand. $ bedtools annotate -s -i variants.bed -files chr1 100 200 nasty 1 chr2 500 1000 ugly 2 + chr3 1000 5000 big 3 - genes.bed conserve.bed known_var.bed 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 0.000000 0.000000 2.6 -S Restrict the reporting to overlaps on the opposite strand. $ bedtools annotate -S -i variants.bed -files chr1 100 200 nasty 1 chr2 500 1000 ugly 2 + chr3 1000 5000 big 3 - genes.bed conserve.bed known_var.bed 0.500000 1.000000 0.300000 0.000000 0.600000 1.000000 0.000000 0.250000 0.000000 workflows: 2.3. -count Report the count of hits from the annotation files 9 2 1 0 panda Documentation, Release 1.0 10 Chapter 2. mat.heatmap CHAPTER 3 Indices and tables • genindex • modindex • search 11
© Copyright 2026 Paperzz