DeepToolsMacs GCC2016

ChIPseq analysis using
deepTools and MACS
Devon Ryan
Twitter: @dpryan79
Biostars: dpryan79
Who are we?
Deeptools.ie-freiburg.mpg.de
Topics to be covered
 Assessing ChIP quality/IP strength
 Understand the difference between BAM files and bigWig
files
 Understand why, when and how one needs to normalize
ChIP-seq data.
 Know the basics of peak calling (why and how is it done?)
 Be able to work with the output from MACS2 (e.g.,
filtering of peaks, visualization in a Genome Browser).
 Know how to generate coverage plots, e.g., heatmaps.
ChIPseq
[Szalkowski &Schmid, Brief Bioinf, 2011]
Outline
 ChIP efficiency
 Coverage Files
 Depth normalization
 Input normalization
 Peak calling
 Why? How?
 Types of peaks: Sharp, broad, mixed
 Downstream processing
 This is the actually interesting part
BAM fingerprints
Calculating the fingerprint
genome
2
3
4
2
3
3
4
4
5
7
8
5
7
8
sorting
2
2
3
3
3
4
4
4
scaling
0.25 0.25 0.37 0.37 0.37 0.5 0.5 0.5 0.5 0.87 1
bins
BAM fingerprints
IP Strength
Input
ChIP
High IP enrichment
Similar genome
coverage (ca. 90%)
Input
ChIP
Input deviates
from straight line
(uniformity)
Insufficient genome
coverage
IP Strength
Input
ChIP
Input
ChIP
Weak IP enrichment
over input
Practical I:
plotFingerprint
Outline
 ChIP efficiency
 Coverage Files
 Depth normalization
 Input normalization
 Peak calling
 Why? How?
 Types of peaks: Sharp, broad, mixed
 Downstream processing
 This is the actually interesting part
NGS file formats ~ analysis steps
@Read1
GATTTGGGGTTCAAAGCAGTAT
+
@Read2
CGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
FASTQ
sequenced DNA fragments
(reads):
DNA sequence only
read alignment
SAM
BAM
r1 163 chr1 7 30 8M2I4M1D3M = 37 39 TTAGATGATTG *
r2 0 chr1 9 30 3S6M1P1I4M * 0 0 AAAAGGATA *
reads:
DNA sequence +
genomic localization
counting reads, normalizing for sequencing depth etc.
chr1
chr1
chr1
chr1
10
20
30
40
20
30
40
50
1.5
1.7
2.0
1.8
bedGraph
bigWig
coverage files:
read numbers per genomic
region
D O W N S T R E A M A N A LY S E S
Computing coverage
aim: reduce the vast amount of information from
the BAM file to the simple information:
How many reads do I have (per bp/genomic bin/…)?
39V34V1:38:C0RLHACXX:4:1216:16137:31969
39V34V1:38:C0RLHACXX:4:1216:16137:31969163
163chr1
chr13000307
300030742
4251M
51M
=
=3000408
3000408152
152
CTGTAGTTACTGTTTGCTTACCTAGATTCTTCTTTTCCAGAATTCTCTTAG
CTGTAGTTACTGTTTGCTTACCTAGATTCTTCTTTTCCAGAATTCTCTTAG
CCCFFFFFHHHGHIIJIJJJJIIGHFGIGIJIIJJJHIHEHIGIIIIJJGF
CCCFFFFFHHHGHIIJIJJJJIIGHFGIGIJIIJJJHIHEHIGIIIIJJGFAS:i:0
AS:i:0XN:i:0
XN:i:0XM:i:0
XM:i:0
XO:i:0
XO:i:0XG:i:0
XG:i:0NM:i:0
NM:i:0MD:Z:51
MD:Z:51YS:i:0
YS:i:0YT:Z:CP
YT:Z:CP
chr2
chr2 100100
100100 100120
100120 55
chr2
chr2 100121
100121 100141
100141 3.2
3.2
chr2
chr2 100142
100142 100163
100163 13.8
13.8
size reduction leads to many advantages of bigWig files over BAMs:
• data storage
• data sharing
• intuitive visualization via genome browsers
• more efficient for downstream analyses
• …
Computing coverage
DNA → Sonicated to ~200bp frags. → 50-100 base reads
reads
fragments
genome
1
2
2
5
5
6
6
6
6
4
4
bins (e.g. 50 bp)
Depth normalization
Depth normalization
reads
fragments
genome
1
2
2
5
5
6
6
6
6
4
4
bins (e.g. 50 bp)
Practical II:
bamCoverage
Outline
 Coverage Files
 Depth normalization (bamCoverage)
 Input normalization
 Peak calling
 Why? How?
 Types of peaks: Sharp, broad, mixed
 Downstream processing
 This is the actually interesting part
Input samples – They’re important!!!
Input controls should be treated exactly the same* as ChIP
samples except for the antibody treatment!
*same cell type, same shearing, same PCRs, same experimentator, …
Input samples – They’re important!!!
(not only) gene-rich regions = bias-rich regions
(especially applicable to old sequencing data)
Why do we focus on MACS2?
Comparative coverage of BAM files
typical application:
 input-normalization for a ChIP sample
 aim:
 diminish the background signal from the ChIP signal based on the
input
 main caveat:
 the same genomic region is never covered exactly the same way,
which is neither the input’s nor the ChIP’s fault
Normalization by total read count:
very straight-forward, perhaps too simple?
Normalization by SES (Diaz et al., 2012):
more sophisticated, based on
bamFingerprint (greatest distance between
input and ChIP), not recommended for broad
marks due to weaker enrichment
Which normalization method?
 ratio
 log2 (ratio)
 difference
 sum
 reciprocal_ratio
 SES
Base your decision on the kind of questions you would like
to answer!
Practical III:
bamCompare
Outline
 ChIP efficiency
 Coverage Files
 Depth normalization
 Input normalization
 Peak calling
 Why? How?
 Types of peaks: Sharp, broad, mixed
 Downstream processing
 This is the actually interesting part
Peak calling
From DNA reads to protein binding sites
3 main tasks of peak calling programs:
1. identify original fragments
2. identify enriched regions
3. assign significance
different
solutions for
each step
CTACGGT…
ATCGCTG…
CATCGA…
GCATTG…
protein
CTACGGT
> 30 different
programs
Which peak caller to choose?
Again: base your decision on the kind of questions you would like to answer!
Table from Wilbanks & Facciotti, 2010
Why do we focus on MACS2?
 One of the most widely used peak callers, also used by big
consortia, e.g., (mod)ENCODE, Blueprint, NIH Roadmap
(reproducibility! comparability!)
 Under active development
 Can be used for sharp and mixed signals
Zhang Y, Liu T, Meyer CA, Eeckhoute J, Johnson DS, Bernstein BE, Nusbaum C,
Myers RM, Brown M, Li W, Liu XS. Model-based analysis of ChIP-Seq (MACS).
Genome Biol. 2008;9(9):R137
Peak Calling with MACS
1. Identify regions with enrich-
ment, i.e., large no. of mapped
reads (modeling by read
shifting)
2. Determine peaks based on
enrichments passing p-value*
threshold
3. Estimate false discovery rate
(FDR*) for each detected peak
*p-value: probability of an enrichment being stronger than expected (= PEAK!);
null hypothesis: reads are randomly distributed throughout the genome following a Poisson
distribution (input is used to parameterize the statistical model)
*q-value/FDR: how likely is it that the peak is not really a peak (false positive) given that
testing is done genome-wide?
Figure from Park, Nature Reviews Genetics, 2009
How does MACS2 work?
●
●
Fragment length modeling
–
Sliding window (2x bandwidth)
–
>M-fold enrichment over background (effective genome size!)
–
Filter “peaks” (>5x, <50x enrichment)
Peak calling
–
Reads extended
–
Samples scaled (need input)
–
Sliding window (2x frag. len.)
–
Enrichment according to Poisson variance
●
Use local noise for filtering
●
Use input variance for filtering
Properties influencing the peak calling
Know thy data!
 Library complexity (How many duplicates?
Overrepresented regions?)
 Enrichment strength (IP success)
 Width and nature of the enriched regions
(narrow vs. broad vs. mixed)
 No. of occupied sites
 Range of the ChIP signal intensities
ALWAYS inspect your data visually and
manually!
It’s not the peak caller that’s making
sense of your data, it’s you!
FASTQC
plotFingerprint
computeGCbias
Genome Browser
Important MACS parameters
 Specify effective/mappable genome size (-g)
 Fragment size: might be set manually, especially for paired-end
data (for which fragment size can be determined separately, e.g., by
Picard CollectInsertSizeMetrics)!
 Broad peak calling (--broad) should be turned on for basically all
histone marks except H3K4me3 and perhaps H3K27ac
Think about stringent filtering criteria
on the peak lists computed by MACS!
If you’re not satisfied, play with the
parameters!
You could make a
workflow for the
QC of peaks
Practical IV:
Peak Calling
Outline
 ChIP efficiency
 Coverage Files
 Depth normalization
 Input normalization
 Peak calling
 Why? How?
 Types of peaks: Sharp, broad, mixed
 Downstream processing
 This is the actually interesting part
Powerful visualization: heatmaps
use deepTools:
computeMatrix
plotHeatmap
plotProfile
Requirements:
 bigWig file
 bed file
 question in mind!
Powerful visualization: heatmaps
Possible questions:
 What kinds of signal distributions
do I see in my peaks?
 How does my signal look around
the TSS/TES/my favorite region?
 How does my signal look when I
assume the same size for all
genes?
…
computeMatrix
plotHeatmap
plotProfile
Powerful visualization: PCA/etc.
Practical V:
Visualization with
deepTools
Advanced topics: GC Bias
Advanced topics: GC Bias
use deepTools:
computeGCbias
correctGCbias
Advanced Topics: Blacklisted Regions
 Peaks in the same place
 Regardless of ChIP type
 Regardless of cell type
 Regardless of experiment
 These are false positives – known sites to ignore
 https://sites.google.com/site/anshulkundaje/projects/blacklists
 These regions can screw up scaling!
 DeepTools: can ignore blacklisted sites
 NGS utils: “BAM filter”
 Bedtools intersect (with peaks to remove overlaps)
Now you can...
 Assessing ChIP quality/IP strength
 Understand the difference between BAM files and bigWig
files
 Understand why, when and how one needs to normalize
ChIP-seq data.
 Know the basics of peak calling (why and how is it done?)
 Be able to work with the output from MACS2 (e.g.,
filtering of peaks, visualization in a Genome Browser).
 Know how to generate coverage plots, e.g., heatmaps.
Where to go for help
 DeepTools:
 https://groups.google.com/forum/#!forum/deeptools
 http://deeptools.readthedocs.org
 MACS google group
 https://groups.google.com/forum/#!forum/macs-announcement
 https://github.com/taoliu/MACS/wiki
 Biostars: www.biostars.org
 Galaxy support: http://biostar.usegalaxy.org
Slides will be posted to GCC2016 website!
Slides/datasets/pages will be @ http://deeptools.ie-freiburg.mpg.de