ChIPseq analysis using deepTools and MACS Devon Ryan Twitter: @dpryan79 Biostars: dpryan79 Who are we? Deeptools.ie-freiburg.mpg.de Topics to be covered Assessing ChIP quality/IP strength Understand the difference between BAM files and bigWig files Understand why, when and how one needs to normalize ChIP-seq data. Know the basics of peak calling (why and how is it done?) Be able to work with the output from MACS2 (e.g., filtering of peaks, visualization in a Genome Browser). Know how to generate coverage plots, e.g., heatmaps. ChIPseq [Szalkowski &Schmid, Brief Bioinf, 2011] Outline ChIP efficiency Coverage Files Depth normalization Input normalization Peak calling Why? How? Types of peaks: Sharp, broad, mixed Downstream processing This is the actually interesting part BAM fingerprints Calculating the fingerprint genome 2 3 4 2 3 3 4 4 5 7 8 5 7 8 sorting 2 2 3 3 3 4 4 4 scaling 0.25 0.25 0.37 0.37 0.37 0.5 0.5 0.5 0.5 0.87 1 bins BAM fingerprints IP Strength Input ChIP High IP enrichment Similar genome coverage (ca. 90%) Input ChIP Input deviates from straight line (uniformity) Insufficient genome coverage IP Strength Input ChIP Input ChIP Weak IP enrichment over input Practical I: plotFingerprint Outline ChIP efficiency Coverage Files Depth normalization Input normalization Peak calling Why? How? Types of peaks: Sharp, broad, mixed Downstream processing This is the actually interesting part NGS file formats ~ analysis steps @Read1 GATTTGGGGTTCAAAGCAGTAT + @Read2 CGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT FASTQ sequenced DNA fragments (reads): DNA sequence only read alignment SAM BAM r1 163 chr1 7 30 8M2I4M1D3M = 37 39 TTAGATGATTG * r2 0 chr1 9 30 3S6M1P1I4M * 0 0 AAAAGGATA * reads: DNA sequence + genomic localization counting reads, normalizing for sequencing depth etc. chr1 chr1 chr1 chr1 10 20 30 40 20 30 40 50 1.5 1.7 2.0 1.8 bedGraph bigWig coverage files: read numbers per genomic region D O W N S T R E A M A N A LY S E S Computing coverage aim: reduce the vast amount of information from the BAM file to the simple information: How many reads do I have (per bp/genomic bin/…)? 39V34V1:38:C0RLHACXX:4:1216:16137:31969 39V34V1:38:C0RLHACXX:4:1216:16137:31969163 163chr1 chr13000307 300030742 4251M 51M = =3000408 3000408152 152 CTGTAGTTACTGTTTGCTTACCTAGATTCTTCTTTTCCAGAATTCTCTTAG CTGTAGTTACTGTTTGCTTACCTAGATTCTTCTTTTCCAGAATTCTCTTAG CCCFFFFFHHHGHIIJIJJJJIIGHFGIGIJIIJJJHIHEHIGIIIIJJGF CCCFFFFFHHHGHIIJIJJJJIIGHFGIGIJIIJJJHIHEHIGIIIIJJGFAS:i:0 AS:i:0XN:i:0 XN:i:0XM:i:0 XM:i:0 XO:i:0 XO:i:0XG:i:0 XG:i:0NM:i:0 NM:i:0MD:Z:51 MD:Z:51YS:i:0 YS:i:0YT:Z:CP YT:Z:CP chr2 chr2 100100 100100 100120 100120 55 chr2 chr2 100121 100121 100141 100141 3.2 3.2 chr2 chr2 100142 100142 100163 100163 13.8 13.8 size reduction leads to many advantages of bigWig files over BAMs: • data storage • data sharing • intuitive visualization via genome browsers • more efficient for downstream analyses • … Computing coverage DNA → Sonicated to ~200bp frags. → 50-100 base reads reads fragments genome 1 2 2 5 5 6 6 6 6 4 4 bins (e.g. 50 bp) Depth normalization Depth normalization reads fragments genome 1 2 2 5 5 6 6 6 6 4 4 bins (e.g. 50 bp) Practical II: bamCoverage Outline Coverage Files Depth normalization (bamCoverage) Input normalization Peak calling Why? How? Types of peaks: Sharp, broad, mixed Downstream processing This is the actually interesting part Input samples – They’re important!!! Input controls should be treated exactly the same* as ChIP samples except for the antibody treatment! *same cell type, same shearing, same PCRs, same experimentator, … Input samples – They’re important!!! (not only) gene-rich regions = bias-rich regions (especially applicable to old sequencing data) Why do we focus on MACS2? Comparative coverage of BAM files typical application: input-normalization for a ChIP sample aim: diminish the background signal from the ChIP signal based on the input main caveat: the same genomic region is never covered exactly the same way, which is neither the input’s nor the ChIP’s fault Normalization by total read count: very straight-forward, perhaps too simple? Normalization by SES (Diaz et al., 2012): more sophisticated, based on bamFingerprint (greatest distance between input and ChIP), not recommended for broad marks due to weaker enrichment Which normalization method? ratio log2 (ratio) difference sum reciprocal_ratio SES Base your decision on the kind of questions you would like to answer! Practical III: bamCompare Outline ChIP efficiency Coverage Files Depth normalization Input normalization Peak calling Why? How? Types of peaks: Sharp, broad, mixed Downstream processing This is the actually interesting part Peak calling From DNA reads to protein binding sites 3 main tasks of peak calling programs: 1. identify original fragments 2. identify enriched regions 3. assign significance different solutions for each step CTACGGT… ATCGCTG… CATCGA… GCATTG… protein CTACGGT > 30 different programs Which peak caller to choose? Again: base your decision on the kind of questions you would like to answer! Table from Wilbanks & Facciotti, 2010 Why do we focus on MACS2? One of the most widely used peak callers, also used by big consortia, e.g., (mod)ENCODE, Blueprint, NIH Roadmap (reproducibility! comparability!) Under active development Can be used for sharp and mixed signals Zhang Y, Liu T, Meyer CA, Eeckhoute J, Johnson DS, Bernstein BE, Nusbaum C, Myers RM, Brown M, Li W, Liu XS. Model-based analysis of ChIP-Seq (MACS). Genome Biol. 2008;9(9):R137 Peak Calling with MACS 1. Identify regions with enrich- ment, i.e., large no. of mapped reads (modeling by read shifting) 2. Determine peaks based on enrichments passing p-value* threshold 3. Estimate false discovery rate (FDR*) for each detected peak *p-value: probability of an enrichment being stronger than expected (= PEAK!); null hypothesis: reads are randomly distributed throughout the genome following a Poisson distribution (input is used to parameterize the statistical model) *q-value/FDR: how likely is it that the peak is not really a peak (false positive) given that testing is done genome-wide? Figure from Park, Nature Reviews Genetics, 2009 How does MACS2 work? ● ● Fragment length modeling – Sliding window (2x bandwidth) – >M-fold enrichment over background (effective genome size!) – Filter “peaks” (>5x, <50x enrichment) Peak calling – Reads extended – Samples scaled (need input) – Sliding window (2x frag. len.) – Enrichment according to Poisson variance ● Use local noise for filtering ● Use input variance for filtering Properties influencing the peak calling Know thy data! Library complexity (How many duplicates? Overrepresented regions?) Enrichment strength (IP success) Width and nature of the enriched regions (narrow vs. broad vs. mixed) No. of occupied sites Range of the ChIP signal intensities ALWAYS inspect your data visually and manually! It’s not the peak caller that’s making sense of your data, it’s you! FASTQC plotFingerprint computeGCbias Genome Browser Important MACS parameters Specify effective/mappable genome size (-g) Fragment size: might be set manually, especially for paired-end data (for which fragment size can be determined separately, e.g., by Picard CollectInsertSizeMetrics)! Broad peak calling (--broad) should be turned on for basically all histone marks except H3K4me3 and perhaps H3K27ac Think about stringent filtering criteria on the peak lists computed by MACS! If you’re not satisfied, play with the parameters! You could make a workflow for the QC of peaks Practical IV: Peak Calling Outline ChIP efficiency Coverage Files Depth normalization Input normalization Peak calling Why? How? Types of peaks: Sharp, broad, mixed Downstream processing This is the actually interesting part Powerful visualization: heatmaps use deepTools: computeMatrix plotHeatmap plotProfile Requirements: bigWig file bed file question in mind! Powerful visualization: heatmaps Possible questions: What kinds of signal distributions do I see in my peaks? How does my signal look around the TSS/TES/my favorite region? How does my signal look when I assume the same size for all genes? … computeMatrix plotHeatmap plotProfile Powerful visualization: PCA/etc. Practical V: Visualization with deepTools Advanced topics: GC Bias Advanced topics: GC Bias use deepTools: computeGCbias correctGCbias Advanced Topics: Blacklisted Regions Peaks in the same place Regardless of ChIP type Regardless of cell type Regardless of experiment These are false positives – known sites to ignore https://sites.google.com/site/anshulkundaje/projects/blacklists These regions can screw up scaling! DeepTools: can ignore blacklisted sites NGS utils: “BAM filter” Bedtools intersect (with peaks to remove overlaps) Now you can... Assessing ChIP quality/IP strength Understand the difference between BAM files and bigWig files Understand why, when and how one needs to normalize ChIP-seq data. Know the basics of peak calling (why and how is it done?) Be able to work with the output from MACS2 (e.g., filtering of peaks, visualization in a Genome Browser). Know how to generate coverage plots, e.g., heatmaps. Where to go for help DeepTools: https://groups.google.com/forum/#!forum/deeptools http://deeptools.readthedocs.org MACS google group https://groups.google.com/forum/#!forum/macs-announcement https://github.com/taoliu/MACS/wiki Biostars: www.biostars.org Galaxy support: http://biostar.usegalaxy.org Slides will be posted to GCC2016 website! Slides/datasets/pages will be @ http://deeptools.ie-freiburg.mpg.de
© Copyright 2026 Paperzz