Accurate promoter and enhancer identification in 127

bioRxiv preprint first posted online Feb. 23, 2016; doi: http://dx.doi.org/10.1101/041020. The copyright holder for this preprint (which was not
peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission.
Accurate promoter and enhancer identification in 127 ENCODE and
Roadmap Epigenomics cell types and tissues by GenoSTAN
Benedikt Zacher1,∗ , Margaux Michel4 , Björn Schwalb4 , Patrick Cramer4 ,
Achim Tresch2,3,∗∗ ,Julien Gagneur1,5,∗∗∗
February 23, 2016
1. Feodor-Lynen-Str. 25, Munich, Germany, Gene Center and Department of Biochemistry, Center for Integrated
Protein Science CIPSM, Ludwig-Maximilians-Universität München.
2. Zülpicher Str. 47, Cologne, Germany, Department of Biology, University of Cologne.
3. Carl-von-Linne-Weg 10, 24105 Cologne, Germany, Max Planck Institute for Plant Breeding Research.
4. Am Fassberg 11, Göttingen, Germany, Department of Molecular Biology, Max Planck Institute for Biophysical
Chemistry
5. Present address: Technische Universität München, Department of Informatics, Boltzmannstr. 3, 85748
Garching, Germany
*Corresponding Author: Benedikt Zacher, Tel.: +49-89-2180-76740; Fax: +49-89-2180-76797; E-mail:
[email protected]
**Corresponding Author: Achim Tresch, Tel.: +49-221-470-8292; Fax: +49-221-5062-163; E-mail:
[email protected]
*** Corresponding Author: Julien Gagneur, Tel.: +49-89-2891-19411; Fax: +49-89-2891-19414; E-mail:
[email protected]
1
bioRxiv preprint first posted online Feb. 23, 2016; doi: http://dx.doi.org/10.1101/041020. The copyright holder for this preprint (which was not
peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission.
Abstract
Accurate maps of promoters and enhancers are required for understanding transcriptional regulation. Promoters
and enhancers are usually mapped by integration of chromatin assays charting histone modifications, DNA accessibility, and transcription factor binding. However, current algorithms are limited by unrealistic data distribution
assumptions. Here we propose GenoSTAN (Genomic STate ANnotation), a hidden Markov model overcoming these
limitations. We map promoters and enhancers for 127 cell types and tissues from the ENCODE and Roadmap
Epigenomics projects, today’s largest compendium of chromatin assays. Extensive benchmarks demonstrate that
GenoSTAN consistently identifies promoters and enhancers with significantly higher accuracy than previous methods. Moreover, GenoSTAN-derived promoters and enhancers showed significantly higher enrichment of complex
trait-associated genetic variants than current annotations. Altogether, GenoSTAN provides an easy-to-use tool to
define promoters and enhancers in any system, and our annotation of human transcriptional cis-regulatory elements
constitutes a rich resource for future research in biology and medicine.
2
bioRxiv preprint first posted online Feb. 23, 2016; doi: http://dx.doi.org/10.1101/041020. The copyright holder for this preprint (which was not
peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission.
Introduction
Transcription is tightly regulated by cis-regulatory DNA elements known as promoters and enhancers. These
elements control development, cell fate and may lead to disease if impaired. A promoter is functionally defined
as a region that regulates transcription of a gene, located upstream and in close proximity to the transcription
start sites (TSSs) [3]. In contrast, an enhancer was originally functionally defined as a DNA element that can
increase expression of a gene over a long distance in an orientation-independent fashion relative to the gene [8]. The
functional definition of enhancers and promoters leads to practical difficulties for their genome-wide identification
because the direct measurement of the regulatory activity of genomic regions is hard, with current approaches
leading to contradicting results [35, 30, 7].
Since the direct measurement of cis-regulatory activity is challenging, a biochemical characterization of the
chromatin at these elements based on histone modifications, DNA accessibility, and transcription factor binding
has been proposed [53, 6, 17, 26, 33]. This approach leverages extensive genome-wide datasets of chromatinimmunoprecipitation followed by sequencing (ChIP-Seq) of transcription factors (TFs), histone modifications, or
Cap analysis gene expression (CAGE) that have been generated by collaborative projects such as ENCODE [13, 61],
NIH Roadmap Epigenomics [12], BLUEPRINT [1] and FANTOM [5, 14].
In this context, the computational approaches employed to classify genomic regions as enhancers or promoters
play a decisive role [53, 33]. As the experimental data are heterogeneous, we generally refer to them as tracks. Several studies used supervised learning techniques to predict enhancers based on tracks such as histone modifications
or P300 binding (e.g. [32, 39, 51, 60]). However, a training set of validated enhancers is needed in this case, which is
hard to define since only few enhancers have been validated experimentally so far and these might be biased towards
specific enhancer subclasses. Alternatively, unsupervised learning algorithms were developed to identify promoters
and enhancers from combinations of histone marks and protein-DNA interactions alone [17, 20, 27, 26, 64, 29, 12, 13].
These unsupervised methods perform genome segmentation, i.e. they model the genome as a succession of segments
in different chromatin states defined by characteristic combinations of histone marks and protein-DNA interactions
found recurrently throughout the genome. All popular genome segmentations are based on hidden Markov models
[49]. However, these methods differ in the way the distribution of ChIP-seq signals for each chromatin state is
modeled. ChromHMM [17, 18, 20], one of the two methods applied by the ENCODE consortium, requires binarized
ChIP-seq signals that are then modeled with independent Bernoulli distributions. Consequently, the performance
of ChromHMM highly depends on the non-trivial choice of a proper binarization cutoff. Moreover, quantitative
information is lost with this approach, which is especially important for distinguishing promoters from enhancers
since these elements are both marked with H3K4me1 and H3K4me3, but at different ratios [2]. Segway [26, 27], the
other method applied by the ENCODE consortium, uses independent Gaussian distributions of log-transformed and
smoothed ChIP-seq signal. Although Segway preserves some quantitative information, the transformation of the
original count data leads to variance estimation difficulties for very low counts. Therefore, Segway further makes
the strong assumption that all tracks have the same variance. Recently, EpicSeg [43] used a negative multinomial
distribution to directly model the read counts without the need for data transformations. However, similar to the
variance model of Segway, the EpicSeg model leads to a common dispersion (the parameter adjusting the variance of
the negative multinomial) for all tracks. Moreover, EpicSeg does not provide a way to correct for sequencing depth,
which makes the application to data sets with multiple cell types with varying library sizes difficult. Also, EpicSeg
has been applied only to three cell types so far [43]. These methods not only differ in their modeling assumptions
but also lead to very different results. In the K562 cell line for instance, ChromHMM identified 22,323 enhancers
3
bioRxiv preprint first posted online Feb. 23, 2016; doi: http://dx.doi.org/10.1101/041020. The copyright holder for this preprint (which was not
peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission.
[13], Segway 38,922 enhancers [13], and EpicSeg 53,982 enhancers [43]. Altogether, improved methods and detailed
benchmarkings are required for a reliable annotation of transcriptional cis-regulatory elements.
Here we propose a new unsupervised genome segmentation algorithm, GenoSTAN (Genomic State Annotation
from sequencing experiments), which overcomes limitations of current state-of-the-art models. GenoSTAN learns
chromatin states directly from sequencing data without the need of data transformation, while still having trackspecific variance models. We applied GenoSTAN to a total of 127 cell types and tissues covering 16 datasets of
ENCODE and all 111 datasets of the Roadmap Epigenomics project as well as another ENCODE ChIP-seq dataset
for the K562 cell line. GenoSTAN consistently performed better when benchmarked against Segway, ChromHMM
and EpicSeg segmentations using independent evidence for activity of promoter and enhancer regions. Co-binding
analysis of TFs reveals that promoters and enhancers both shared the Polymerase II core transcription machinery
and general TFs, but they are bound by distinct TF regulatory modules and differ in many biophysical properties.
Moreover, GenoSTAN enhancer and promoter annotations had a higher enrichment for complex trait-associated
genetic variants than previous annotations, demonstrating the advantage of GenoSTAN and our chromatin state
map to understand genotype-phenotype relationships and genetic disease.
Results
Modeling of sequencing data with Poisson-lognormal and negative binomial distributions
We developed a new genomic segmentation algorithm, GenoSTAN, which implements hidden Markov models with
more flexible multivariate count distributions than previously proposed. Specifically, GenoSTAN supports two
multivariate discrete emission functions, the Poisson-lognormal distribution and the negative binomial distribution.
For the sake of reducing running time, the components of these multivariate distributions are assumed to be
independent. However, the variance is modeled separately for each state and each track, which provides a more
realistic variance model than current approaches. To be applicable to data sets with replicate experiments or
multiple cell types, GenoSTAN corrects for different library sizes of experiments (Methods). All parameters are
learnt directly from the data, leaving the number of chromatin states as the only parameter to be manually set.
We provide an efficient implementation of the Baum-Welch algorithm for inference of model parameters, which
can be run in a parallelized fashion using multiple cores. The method is implemented as part of our prevsiouy
published R/Bioconductor package STAN [62], which is freely available from http://bioconductor.org/. Altogether,
GenoSTAN uniquely combines flexible count distributions, short running times, and minimal number of manually
entered parameters (Figure 1A).
We performed an extensive benchmarking of GenoSTAN against alternative methods (Figure 1B). Benchmark I
compares GenoSTAN and the three alternative methods for the K562 cell line. K562 is a major model system to
study human transcription and the ENCODE cell line with the largest number of experiments [13]. Moreover, all
three other methods (ChromHMM, Segway, and EpicSeg) had been run by their own authors for K562, so that the
algorithm parameters for these annotations can be assumed to have been set at best expert knowledge. Benchmark
II compares methods for all 127 cell types and tissues provided by ENCODE and Roadmap Epigenomics. This is
the largest benchmark dataset of all three we considered but also the one with the least number of common tracks
(5 chromatin marks: H3K4me1, H3K4me3, H3K36me3, H3K27me3, and H3K9me3, Figure 1B). Benchmark III
4
bioRxiv preprint first posted online Feb. 23, 2016; doi: http://dx.doi.org/10.1101/041020. The copyright holder for this preprint (which was not
peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission.
compares results for a subset of 20 ENCODE and Roadmap Epigenomics cell types and tissues which had moreover
H3K27ac, H3K9ac ChIP-seq and DNAse I hypersensitivity data (DNAse-Seq). This distinction allowed to provide
more accurate annotations for the better characterized cell types and tissues. For benchmark II and benchmark III
only annotations for the method ChomHMM were available to compare against GenoSTAN annotation. Figure 1B
lists all model names and studies that were considered.
Benchmark I: Improved chromatin state annotation in human K562 cells
We first fitted two GenoSTAN models, one with Poisson-lognormal emissions (henceforth referred to as GenoSTANPoilog-K562 model) and one with negative binomial emissions (GenoSTAN-nb-K562 model) to a dataset of ChIP-seq
data of 9 histone modifications, of the histone acetyltransferase P300, and DNA accessibility (by DNase-Seq) data
for the K562 cell line at 200 bp binning resolution (Methods). As pointed out by others [17, 26], there is no
purely statistical criterion for choosing the number of states from the data of practical usage in such a setting.
In practice, the number of states is manually defined by trading off goodness of fit against interpretability of the
model [17, 26, 62]. For GenoSTAN-Poilog-K562, we used 18 chromatin states. For GenoSTAN-nb-K562, we used 23
states, since lower state numbers did not provide enough resolution to give a fine-grained map of chromatin states
on this data set (Methods). Figure 2A compares the two GenoSTAN segmentations to segmentations from other
studies using ChromHMM, Segway and EpicSeg on a region containing the TAL1 gene together with three known
enhancers [43, 27, 13, 25].
Chromatin states recover biologically meaningful features
In order to assign biologically meaningful labels to each state of the Poilog-K562 and of the nb-K562 GenoSTAN
models, we investigated their read coverage distributions and overlapped the occurrence of a state in the genome with
known genomic features. In line with previous studies, this led to the definition of promoter, enhancer, repressed,
actively transcribed and low coverage states [43, 20, 27]. The median read coverage in state segments and genomic
distributions were very similar for both the Poilog-K562 and the nb-K562 models (Figure 2B, Supplementary Figure
1). Promoter states were characterized by a low (< 1) H3K4me1/H3K4me3 ratio, in contrast to enhancer states
which showed a high ratio (> 1). Further, P300 levels were roughly two-fold higher in the enhancer state, which
is in accordance with previous observations [2, 45, 55]. Promoter (Prom) states were located close to annotated
GENCODE TSSs [24], with a median distance of 220 bp for GenoSTAN-Poilog-K562 and 400 bp for GenoSTAN-nbK562 model. Enhancer states (Enh) on the other hand were located further away from TSSs, with a median distance
of 3.6 kb (Poilog-K562) (respectively 5.8 kb for nb-K562, Figure 2B, Supplementary Figure 1). Promoter and
enhancer states also differed in their DNA sequence features. 45% of CpG islands were located within promoter states
(strong, weak and promoter flanking states) in both models, but only 3% in enhancer states (strong, weak, enhancer
flanking states, Figure 2B, Supplementary Figure 1). While promoter states mostly recovered stable TSSs, enhancer
states were located at unstable TSSs (GRO-cap TSSs that are not recovered by the GENCODE annotation), which
supports previous findings [16]. Furthermore, both models (GenoSTAN-nb-K562 and GenoSTAN-Poilog-K562)
contained 3 states, that we classified as “actively transcribed states”, which were characterized by high values of
H3K36me3 and overlap with UTRs, introns and exons. Two out of three “transcribed” states were also enriched in
promoter associated marks (H3K4me1-3, H3K27ac, H3K9ac) and H4K20me1 and thus represented 5’ transitions
in transcription. Moreover, both models fitted four repressed states showing high read coverage of H3K27me3.
Two of these states also exhibited high DNase-Seq and promoter/enhancer associated histone modification signals,
5
bioRxiv preprint first posted online Feb. 23, 2016; doi: http://dx.doi.org/10.1101/041020. The copyright holder for this preprint (which was not
peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission.
suggesting that these states might reflect repressed regulatory regions (ReprEnh, ReprD). These elements were
distal to annotated GENCODE TSSs (median distance: 5.2-11.8 kb). ReprEnh states were also enriched in P300
and recovered 0.2% of CpG islands, while ReprD states had lower P300 levels and recovered 8-9% of CpG islands
in the genome (Figure 2B, Supplementary Figure 1). The remaining states exhibited low coverage in chromatin
marks and therefore were labeled as “low” states. Altogether, GenoSTAN accurately recovered many features of
known chromatin states and provided a high resolution map of these in K562.
High variation of enhancer predictions between chromatin state annotations of different studies
To assess the consistency of promoter and enhancer predictions across studies, we compared the GenoSTAN
segmentations to other published segmentations in K562 by ChromHMM (’ChromHMM-ENCODE’ [13, 27] and
’ChromHMM-Nature’ [20]), Segway (’Segway-ENCODE’ [13, 27], ’Segway-nmeth’ [26] and ’Segway-Reg.Build’ [64])
and EpicSeg [43]. We computed pairwise Jaccard indices (the ratio of the number of common elements over all
elements predicted by two methods) of promoter and enhancer states to quantify the agreeement between the predictions of the different studies (Supplementary Figure 2). Promoter state annotations generally agreed well (median
Jaccard-Index: 0.78). However, enhancer prediction varied more (median Jaccard Index: 0.48), suggesting that
enhancers are more difficult to annotate. This variation of enhancer calls was also reflected in the different numbers
of annotated enhancer segments, which had been shown to vary greatly between different prediction methods [32].
The number of enhancer segments ranged from 10,932 segments in GenoSTAN-Poilog-K562 to 80,043 segments in
one Segway annotation [26] (Supplementary Table 1). Therefore, a thorough assessment of these predictions was
necessary to provide a robust and accurate prediction of these elements.
Comparison of GenoSTAN with published chromatin state annotations
In order to benchmark the different segmentations, we used independent data including evidence of transcriptional
activity (GRO-cap TSSs [16]), of transcription factor binding (ENCODE high occupancy target, or HOT regions
[61], and ENCODE TF binding sites [13]), and of cis-regulatory activity (enhancer activity assessed by reporter
assays [30]), which are all expected to be characteristics of promoters and enhancers. Transcription initiation activity is not only the hallmark of promoters, but also of enhancers [31, 5, 14, 16]. To benchmark the predictions
using evidence for transcription, we used published data from a protocol called GRO-cap [16], a nuclear run-on
protocol, which very sensitively maps transcription start sites genome-wide. To this end, we sorted for each method
chromatin states by their overlap with GRO-cap TSSs by decreasing precision. Starting with the most precise
state (i.e. highest overlap with TSSs) we calculated cumulative recall and false discovery rate (FDR) by subsequently adding states with decreasing precision (Figure 3A). GenoSTAN-Poilog-K562 had the highest recall and the
lowest FDR (Methods, Figure 3A). GenoSTAN-nb-K562 performed similar to other segmentations (Segway-Reg.
Build, ChromHMM-ENCODE). In particular, 94% of GenoSTAN-Poilog-K562 promoters (Prom.11) and 81% of
its enhancer regions (Enh.15) overlapped with GRO-cap TSSs. This compares to 85% (Prom.16) and 65% (Enh.6)
of GenoSTAN-nb-K562 and 89% (Tss) and 52% (Enh) of ChromHMM-ENCODE promoter and enhancer regions.
Interestingly, the two ChromHMM segmentations (ChromHMM-ENCODE [27, 13], ChromHMM-Nature [20]) had
very different accuracies for TSSs, which might be due to different data sets or cutoffs used for ChIP-seq binarization
in the two studies. In contrast, the overall accuracy of the Segway annotations was comparable across studies. This
comparison shows that GenoSTAN chromatin state annotation identifies putative promoters and enhancers which
show transcriptional activity more frequently than previous annotations.
6
bioRxiv preprint first posted online Feb. 23, 2016; doi: http://dx.doi.org/10.1101/041020. The copyright holder for this preprint (which was not
peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission.
GRO-cap is a very sensitive method that captures also a large amount of TSSs of unstable transcripts. However
it is limited to capped RNA species, misses RNAs below the detection threshold and cannot be used to validate repressed (i.e. transcriptionally inactive) regulatory elements. To address these shortcomings we used two additional
independent features, TF binding and HOT regions. The binding of TFs to a region of DNA is a pre-requisite for
potential regulatory function and transcriptional activity. High Occupancy of Target (HOT) regions are genomic regions which are bound by a large number of different transcription-related factors [61], which were shown to function
as enhancers [34] and are enriched in disease- and trait-associated genetic variants [40]. As for the benchmark with
TSSs, we sorted chromatin states by overlap with HOT regions by decreasing precision and calculated cumulative
recall and FDR (Figure 3B). The best performing segmentations for HOT regions were GenoSTAN-Poilog-K562 and
GenoSTAN-nb-K562, followed by ChromHMM-ENCODE. The ordering of states with HOT regions was indeed different from the GRO-cap TSSs benchmark. Additionally to GenoSTAN promoter and enhancer states, the repressed
enhancer state frequently overlapped with HOT regions with an overall precision of 81% (GenoSTAN-Poilog-K562)
and 77% (GenoSTAN-nb-K562). In comparison, the top three ChromHMM-ENCODE states had together a precision of 67%. All other segmentation methods showed a lower precision and recall for HOT regions. This was
also reflected in the frequency of individual TF binding sites at enhancer regions, which were generally higher in
GenoSTAN enhancer states than in other segmentations (Figure 3C). In particular, only a very small fraction of
EpicSeg and Segway-nmeth enhancers were found to be bound by TFs. EpicSeg and Segway-nmeth segmentations
were also those with the highest number of predicted enhancers, suggesting that many of these predictions are
spurious.
Next, we calculated the recall of FANTOM5 promoters [14] and enhancers [5] to assess how well the models distinguish promoters from enhancers, as it was evident from inspection of specific examples that this distinction was
difficult to be established by current methods (Figure 2A). The FANTOM5 consortium have performed extensive
mapping of capped transcripts 5’ ends using CAGE and defined enhancers and promoters based on transcriptional
activity pattern. FANTOM5 enhancers were defined as regions showing balanced bidirectional capped transcripts,
a hallmark of enhancer RNAs [5], whereas FANTOM5 promoters were defined as regions where transcription was
biased towards one direction. The FANTOM5 annotation of enhancers and promoters could not entirely replace a
chromatin state based approach because (i) the use of expression data in FANTOM5 limits the identified regulatory
regions to transcriptionally active elements and (ii) CAGE was shown to be not as sensitive to rapidly degraded
transcripts as GRO-cap and therefore might miss regulatory enhancers with unstable transcripts [16]. Nonetheless, FANTOM5 provides an annotation of enhancers and promoters based on independent data that is well suited
to assess how well the models distinguish promoters from enhancers. We filtered the FANTOM5 annotation to
promoters and enhancers for activity in K562 by overlapping them with DHS [13] and GRO-cap TSSs [16]. We
considered that a promoter state performed well, when the recall of FANTOM5 promoters was high and the recall of
FANTOM5 enhancers was low and vice versa for enhancer states. GenoSTAN-Poilog-K562 and ChromHMM-nature
enhancer states recall most FANTOM5 enhancers (60%, Figure 3D, Supplementary Table 2). For enhancer states,
the recall of FANTOM5 promoters was around 10% except for those of Segway-ENCODE, which recalls almost 35%
of FANTOM5 promoters and EpicSeg which 21% FANTOM5 promoters. In accordance with this, many promoter
regions were erroneously classified as enhancer regions in this segmentation (e.g. TAL1 promoter in Figure 2A). The
recall of FANTOM5 enhancers by promoter states was generally higher (17% - 37%). GenoSTAN-Poilog-K562 and
7
bioRxiv preprint first posted online Feb. 23, 2016; doi: http://dx.doi.org/10.1101/041020. The copyright holder for this preprint (which was not
peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission.
-nb-K562 recalled more than 90% of FANTOM5 promoters and around 20% of FANTOM5 enhancers which is comparable to other studies (Segway-nmeth, ChromHMM-nature, EpicSeg). ChromHMM-ENCODE promoter states
had a comparable recall of FANTOM5 promoters (92%), but higher recall of FANTOM5 enhancers (37%) (Figure
3D). This strong overlap of ChromHMM-ENCODE promoters with FANTOM5-labeled enhancers is in accordance
with our observation that some enhancer regions were errenuously classified as promoters in ChromHMM-ENCODE
(Figure 2A). These results show that GenoSTAN segmentations distinguish promoters from enhancers at similar or
better accuracy than other segmentations.
So far we only used indirect evidence (TSSs, HOT regions, TF binding, FANTOM5 enhancer) to draw conclusions
about the cis-regulatory activity of a candidate enhancer. As additional and direct evidence for the cis-regulatory
activity of enhancer regions inferred by GenoSTAN, we overlapped our enhancers to genomic sequences that were
previously tested for cis-regulatory activity in a reporter assay, where candidate elements had been cloned into
a plasmid upstream of the promoter of a reporter gene [30]. Enhancers from GenoSTAN segmentations showed
significantly higher activity than repressed or low coverage regions (GenoSTAN-Poilog-K562 & GenoSTAN-nbK562: p-value < 0.001 wilcoxon-test, Figure 3E). Interestingly, repressed regions (marked by H3K27me3) showed
lower activity than low coverage regions. Moreover, GenoSTAN-Poilog-K562 enhancers showed significantly higher
enhancer activity than those of three other studies (Figure 3F), including the original study (p-value < 0.01,
ChromHMM-nature enhancers) by Kheradpour et al. [30]. This analyis shows that GenoSTAN has higher succes
rate in predicting in vivo enhancer activity than previous methods.
Comparison of the GenoSTAN, ChromHMM, Segway and EpicSeg algorithms on a common dataset
The K562 genome segmentations of ChromHMM, Segway and EpicSeg used so far were derived from different
combinations of data tracks. To verify that the favorable performance of GenoSTAN is mainly due to an improved modeling and not due to different data, we also ran ChromHMM, Segway and EpicSeg on the same data
as GenoSTAN-Poilog-K562 and GenoSTAN-nb-K562. Both GenoSTAN-Poilog-K562 and GenoSTAN-nb-K562 had
a lower FDR at a similar or higher recall than all three other methods (Supplementary Figure 3). Moreover, we
found that changing the binarization for ChromHMM dramatically affected its outcome. Without further manual
processing of the data, ChromHMM fitted only one transcriptionally active state, which modeled both promoters
and enhancers, regardless of state number (Supplementary Figure 3). We suspected that the high read coverage
in the H3K4me1 and H3K4me3 signal tracks made promoters and enhancers indistinguishable after binarization
(H3K4me1 and H3K4me3 were called present at both, promoters and enhancers, and they were both called absent
elsewhere). When all data tracks were subsampled to the same (and lower) library size, this problem was solved
and ChromHMM fitted multiple transcriptionally active states thereby distinguishing promoters from enhancers
and, at the same time, increased in accuracy (Supplementary Figure 3). The same problem occurred for Segway,
but changing Segway’s parameters did not help distinguish different transcriptionally active chromatin states.
To make sure that these results did not depend on the arbitrary choice of the number of states, we ran each method
using 10 to 30 states (Methods) and calculated the precision of each state S for recalling HOT (respectively TSS)
regions as the fraction of all segments annotated with S that overlapped with a HOT (respectively TSS) region.
For each number of states and each segmentation algorithm, we determined the state with highest precision (Supplementary Fig. 4A, B). Independently of the number of states, GenoSTAN-Poilog and GenoSTAN-nb consistently
performed best. Even at low state numbers, precision remained constantly high, while it decreased considerably for
8
bioRxiv preprint first posted online Feb. 23, 2016; doi: http://dx.doi.org/10.1101/041020. The copyright holder for this preprint (which was not
peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission.
other methods. We also derived an area under curve (AUC) score for each model, to assess the spatial accuracy
in calling TSSs or HOT regions (Supplementary Fig. 4C, D and Methods). Again, AUC scores were consistently
highest for the GenoSTAN segmentations.
Altogether this extensive benchmark in the K562 cell line demonstrates that GenoSTAN-Poilog and to a slightly
lesser extent GenoSTAN-nb, outperforms current chromatin state annotation algorithms for identifying enhancers
and promoters.
Benchmarks II and III: Chromatin state annotation for ENCODE and Roadmap
Epigenomics cell types and tissues
We next applied GenoSTAN to 127 cell types and tissues from ENCODE and Roadmap Epigenomics, the largest
compendium of chromatin-related data, using genomic input and the five chromatin marks H3K4me1, H3K4me3,
H3K36me3, H3K27me3, and H3K9me3 that have been profiled across the whole compendium [12] (Supplementary
Figure 5, Benchmark II, Figure 1B for data tracks and model names). Moreover, we performed a dedicated analysis
to 20 of these cell types and tissues which had three further important data tracks: H3K27ac, H3K9ac and DNaseSeq (Supplementary Figure 6, Benchmark III, Figure 1B for data tracks and model names). These further three
tracks are important features of active promoters and enhancers, which can lead to more precisely mapped enhancer
boundaries [13]. We performed similar comparisons as decribed above to the three available segmentations from
the Roadmap Epigenomics project with 15, 18 and 25 states (ChromHMM-15, -18, and -25) [12, 19]. All methods
were less performant than in Benchmark I, possibly due to lower read coverage or to less rich data. Nonetheless,
the GenoSTAN annotations consistently outperformed the existing ones. Specifically, this held when assessing the
recovery of FANTOM5 CAGE tags (Figure 4A, assessed for all 127 cell types and tissues), of GRO-cap TSSs (Figure
4B assessed for the cell types with available GRO-Cap TSSs), and of HOT regions (Figure 4C, assessed for the
cell types with available HOT regions). Moreover, both GenoSTAN models distinguished better promoters from
enhancers than previous annotations (Figure 4D, Supplementary Table 2). The low accuracy of ChromHMM-15
and ChromHMM-18 promoters might be caused by frequent state switching between the promoter and promoter
flanking state (Supplementary Figure 7). Consequently, the number of promoter regions in K562 was up to 30%
higher in the ChromHMM-15 and -18 segmentations than in the GenoSTAN or ChromHMM-25 segmentations
(Supplementary Table 1). The number of predicted enhancers (in K562) also differed greatly. The ChromHMM15 and -18 state models predict 92,824 (7_Enh) and 22,678 (9_EnhA1) enhancers, while ChromHMM-25 predicts
12,706 and GenoSTAN-Poilog-20 and -127 predict 15,655 and 45,955 enhancers (Supplementary Table 1). Although
GenoSTAN predicted more enhancers than the ChromHMM-25 model, the fraction of putative enhancers bound
by individual TFs was greater (Figure 4E). For instance 46% (25%) of enhancers were bound by Pol II in the
GenoSTAN-Poilog-20 (-127) model, compared to 8%, 18% and 36% in the ChromHMM 15, 18 and 25 state models.
Also, the lineage-specific enhancer-binding transcription factor TAL1 binds at 37% (GenoSTAN-Poilog-20) and 27%
(GenoSTAN-Poilog-127) of predicted enhancers. Conversely, 13%, 16% and 27% of putative enhancers were bound
by TAL1 in the respective 15, 18 and 25 state ChromHMM models (Figure 4E). Collectively, these results show
that the improved performance of GenoSTAN is not restricted to the K562 dataset.
9
bioRxiv preprint first posted online Feb. 23, 2016; doi: http://dx.doi.org/10.1101/041020. The copyright holder for this preprint (which was not
peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission.
Cell-type specific enrichment of disease- and other complex trait-associated genetic
variants at promoters and enhancers
Previous studies showed that disease-associated genetic variants are enriched in potential regulatory regions [12,
20, 58, 57, 54, 42] demonstrating the need for accurate maps of these elements to understand genotype-phenotype
relationships and genetic disease. To study the potential impact of variants in regulatory regions on various traits
and diseases, we overlapped our enhancer and promoter annotations from 127 cell types and tissues (Benchmark
II, Figure 1B) with phenotype-associated genetic variants from the NHGRI genome-wide assocation studies catalog
(NHGRI GWAS Catalog [59]). First, we intersected trait-associated variants with enhancer and promoter states
(GenoSTAN-Poilog-127). Overall, 37% of all trait-associated SNPs were located in potential enhancers and 7%
in potential promoters. The number of traits significantly enriched (at FDR <0.05) with enhancers or promoters
in at least one cell type or tissue was larger for GenoSTAN-Poilog-127 (69 traits for GenoSTAN-Poilog-127 for
enhancers and 20 traits for promoters) than for the best performing ChromHMM-model (ChromHMM-15, 64 traits
for enhancers and 18 traits for promoters). The better performance of GenoSTAN-Poilog-127 was found at all FDR
cutoffs (Supplementary Figure 8). To control for the fact that methods can differ among each other regarding the
length of the promoters and enhancers they predict, we furthermore computed the recalls of GWAS variants for a
fixed genomic coverage. Restricting to a total genomic coverage of 2% (random subsetting, also allowing confidence
interval computation, Methods), enhancers of all GenoSTAN models overlaped a higher fraction of GWAS variants
at a similar to better per base pair density compared to the current ChromHMM annotations (Figure 5A). The
same trend was observed for promoters when restricting to 1% of genomic coverage (Figure 5B). The improved
overlap with trait-associated variants indicates that GenoSTAN annotation has a higher enrichment for functional
elements than the current annotation.
In accordance with previous studies [12, 20] we found that individual variants were strongly enriched in enhancer
or promoter states specifically active in the relevant cell types or tissues (Figure 5C, Supplementary Figure 8C).
Variants associated with height were significantly associated with osteoblasts (at FDR <0.001 here and after,
performed on Benchmark II for consistency across cell types and tissues). Variants associated with immune response
or autoimmune disorders were enriched in B- and T-cell enhancers (Figure 5C) and promoters (Supplementary
Figure 8C). These incude for instance HIV-1 control, autoimmune disease associated SNPs for systemic lupus
erythematosus, inflammatory bowel disease, Ulcerative colitis, Rheumatoid arthritis, Primary biliary cirrhosis and
Multiple sclerosis. Variants associated with electrocardiographic traits and QT interval were enriched in fetal heart
enhancers. SNPs associated with colorectal cancer were enriched in enhancers specific to the digestive system.
These results illustrate that the annotation of potential promoters and enhancers generated in this study can be of
great use for interpreting genetic variants associated, and underscore the importance of cell-type or tissue specific
annotations.
A novel annotation of enhancers and promoters in human cell types and
tissues
We then compiled the results from the best performing annotations for each cell type and tissue into a single
annotation file. The combined annotation file is available as Supplementary Data. All individual chromatin state
annotations are available at http://i12g-gagneurweb.in.tum.de/public/paper/GenoSTAN. For the combined anno-
10
bioRxiv preprint first posted online Feb. 23, 2016; doi: http://dx.doi.org/10.1101/041020. The copyright holder for this preprint (which was not
peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission.
tation file, we chose GenoSTAN with Poisson-lognormal in every instance, as it performed best in almost every
comparison we conducted. We used the results from benchmark I for K562, from benchmark III for the 20 cell
types and tissues, and from benchmark II for all the remaining Roadmap Epigenomics cell types and tissues. Overall, our annotation reports typically between 8,945 and 16,750 (10% and 90% quantiles of number of promoters
across all 127 cell types and tissues) active promoters per cell type or tissue. This number is consistent with the
typical number of expressed genes per tissue (in 11,953 to 16,869 range, [52]). However, the median width of these
elements depends on the data on which the annation was based. For the benchmark III dataset, promoters are much
narrower (800bp median) than for the K562 annotations (1.4 kb, Benchmark I data sst), suggesting that promoter
regions in the 20 cell types more accurately recover DNase hypersensitivity sites (DHS) of the core promoter (Figure
2, Supplementary Figure 6). The number of enhancers per cell type or tissue varied more greatly (between 8,208
and 33,596 for the 10% and 90% quantiles). The large variation of the number of enhancers might be partly due to
differences of sensitivity in complex biological samples. Consistent with this hypothesis, much fewer enhancers were
identified in tissues than in primary cells and cell lines (Supplementary Figure 9) likely because enhancers that are
active only in a small subsets of all cell types present of a tissue may be not detected. As more cell-type specific
data will be available, improved maps can be generated. The GenoSTAN software, which is publicly available, will
be instrumental to update these genomic annotations.
Promoters and enhancers have a distinct TF regulatory landscape
The biochemical distinction between enhancers and promoters is a topic of debate [53, 6]. We explored to which
extent enhancers and promoters are differentially bound by TFs using the K562 cell line dataset because i) we
obtained the most accurate annotation for this cell line (GenoSTAN-Poilog-K562, Benchmark I) and ii) ChIP-seq
data was available for as many as 101 TFs in this cell line [13]. Nine TF modules were defined by clustering based on
binding pattern similarity across enhancers and promoters (Methods, Figure 6). These 9 TF modules were further
characterized by the propensity of their TFs to bind promoters, enhancers or both (Figure 6). In accordance with
previous studies [4, 22], this recovered many complexes and promoter-associated and enhancer-associated proteins,
including the CTCF/cohesin complex (CTCF, Rad21, SMC3, Znf143), the AP-1 complex (Jun, JunB, FOSL1,
FOS), Pol3, promoter and enhancer associated modules, and factors associated with chromatin repression (EZH2,
HDAC6).
Moreover, the modules identified provided insights into the distinction of promoters and enhancers. On the one
hand, some TFs are common to both enhancers and promoters, which supports previous reports [5, 6]. In accordance with the recent finding of widespread transcription at enhancers [16], Pol II and multifunctional TFs Myc,
Max, and MAZ [56] are part of a TF module - which we called the Promoter-Enhancer-Module (PEM) - which had
approximately equal binding preferences for promoter and enhancer states, but also co-localized with other TFs
specifically binding enhancers or promoters (Figure 6).
On the other hand enhancers and promoters were also bound by distinct TFs, which is consistent with previously reported TF co-occurrence patterns at gene-proximal and gene-distal sites [22, 4]. Among the promoter and
enhancer-associated proteins we defined Promoter module 1 and 2 (PM1, PM2), Enhancer module 1 and 2 (EM1,
EM2), which had a strong preference for binding either a promoter or an enhancer, but exhibited different cobinding rates (Figure 6). Promoter module 1 contained TFs which were specifically enriched in promoter states and
associated with basic promoter functions, such as chromatin remodeling (CHD1, CHD2), transcription initiation or
11
bioRxiv preprint first posted online Feb. 23, 2016; doi: http://dx.doi.org/10.1101/041020. The copyright holder for this preprint (which was not
peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission.
elongation (TBP, TAF1, CCNT2, SP1) and other TFs involved in the regulation of specific gene classes (e.g. cell
cycle: E2F4) [56]. However, it also included TFs known as transcriptional repressors (e.g. Mxi1, a potential tumor
suppressor, which negatively regulates Myc). While TFs in PM1 showed a high co-binding rate, PM2 factors exhibited low co-binding. This might be partially explained by lower efficiency of the ChIP, since PM2 also contained
general TFs such as TFIIB, TFIIF or the Serine 2 phospho-isoform of Pol II, which are expected to co-localize with
other general TFs from PM1.
EM1 contained TFs with high co-binding rate, which included TAL1, an important lineage-specific regulator for erythroid development (K562 are erythroleukemia cells) and which had been shown to interact with CEBPB, GATA1
and GATA2 at gene-distal loci [22, 47]. It also contained the enhancer-specific transcription factor P300 [55] and
transcriptional activators (e.g. ATF1) and repressors (e.g. HDAC2, REST) [56]. Analogously to PM2, EM2 contained enhancer-specific transcriptional activators and repressors with a low co-binding rate.
Altogether this analysis highlights the common and distinctive TF binding properties of enhancers and promoters.
Discussion
We introduced GenoSTAN, a method for de novo and unbiased inference of chromatin states from genome-wide
profiling data. In contrast to previously described methods for chromatin state annotation, GenoSTAN directly
models read counts, thus avoiding data transformation and the manual tuning of thresholds (as in ChromHMM
and Segway), and variance is not shared between data tracks or states (as in EpicSeg and Segway) [43, 17, 26].
GenoSTAN is released as part of the open-source R/Bioconductor package STAN [62, 28, 21], which provides a
fast, multiprocessing implementation that can process data from 127 human cell types in less 3-6 days (GenoSTANPoilog-127: 6 days, -nb: 3 days).
Application of GenoSTAN significantly improved chromatin state maps of 127 cell types and tissues from the ENCODE and Roadmap Epigenomics projects [13, 12]. Binding of enhancer-associated co-activator CBP and histone
acetyltransferase P300 was used by several studies for the genome-wide prediction of enhancers [55, 45, 2]. From
these predictions a distinctive chromatin signature for promoters and enhancers was derived based on H3K4me1
and H3K4me3 [2]. In particular, the ratio H3K4me1/H3K4me3 was found to be low at promoters, in comparison
to enhancers. Active and poised enhancers could also be distinguished by presence or absence of H3K27me3 and
H3K9me3 [63]. All these features could be confirmed by GenoSTAN, making it a promising tool for the biochemical
characterization of enhancers and promoters. Moreover, extensive benchmarks based on independent data including
transcriptional activity, TF binding, cis-regulatory activity, and enrichment for complex trait-associated variants
showed the highest accuracy of GenoSTAN annotations over former genome segmentation methods.
The GenoSTAN annotation sheds light on the common and distinctive features of promoters and enhancers, which
currently are an intense subject of debate [53, 6]. Among other characteristics, a shared architecture of promoters
and enhancers was proposed based on the recent discovery of widespread bidirectional transcription at enhancers
[31, 6, 16]. This was supported by the observation that enhancers, which are depleted in CpG islands have similar
transcription factor (TF) motif enrichments as CpG poor promoters [5]. However, another study showed that TF
co-occurrence differed between gene-proximal and gene-distal sites [22, 4]. GenoSTAN chromatin states revealed a
very distinct TF regulatory landscape of these elements and therefore suggest that promoters and enhancers are
fundamentally different regulatory elements, both sharing the binding of the core transcriptional machinery. Our
annotation of enhancers and promoters will be a valuable resource to help characterizing the genomic context of
12
bioRxiv preprint first posted online Feb. 23, 2016; doi: http://dx.doi.org/10.1101/041020. The copyright holder for this preprint (which was not
peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission.
the binding of further TFs.
Indirectly, our analysis showed that chromatin state annotations are better predictors of enhancers than the
transcription-based definition provided by the FANTOM5 consortium [5]. While FANTOM5 enhancers are an
accurate predictor for transcriptionally active enhancers, the sensitivity remains poor (only 4,263 enhancers were
called by overlap with GRO-cap TSSs and DHS, which is less than the estimated number of transcribed genes,
for K562 cells compared to about 20,000-30,000 for ChromHMM and 10,000-20,000 for GenoSTAN). Although, the
sensitivity of the transcription-based approach can increase with transient transcriptome profiling [48, 46] or nascent
transcriptome profiling [11], the chromatin state data undoubtedly add valuable information for the indentification
of promoters and enhancers. Because it models count data, GenoSTAN analysis can in principle also integrate
RNA-seq profiles, for instance using it in a strand-specific fashion [62].
Systematic identification of cis-regulatory active elements by direct activity assays is notoriously difficult. STARRSeq for instance is a high-throughput reporter assay for the de novo identification of enhancers [7]. It was previously
used to identify thousands of cell-type specific enhancers in Drosophila, but has not been applied to human yet.
Moreover, STARR-Seq makes rigid assumptions about the location of the enhancer element with respect to the
promoter, and it does not account for the native chromatin structure. This might identify regions that are inactive
in situ [7]. Other experimental assays for the validation of predicted ENCODE enhancers lead to different results
[35, 30]. Complementary to these approaches, the systematic evaluation of cis-regulatory activity based on candidate regions in human cells have made progress with the advent of high-throughput CRISPR perturbation assays
[50]. Because it requires candidate cis-regulatory regions in a first place, such approach will benefit from improved
annotation maps as the one we are providing.
Thus, we foresee GenoSTAN to be instrumental in future efforts to generate robust, genome-wide maps of functional
genomic regions like promoters and enhancers
13
bioRxiv preprint first posted online Feb. 23, 2016; doi: http://dx.doi.org/10.1101/041020. The copyright holder for this preprint (which was not
peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission.
References
[1] The blueprint project. http://www.blueprint-epigenome.eu/.
[2] Distinct and predictive chromatin signatures of transcriptional promoters and enhancers in the human genome.
Nature genetics, 39(3):311–8, 2007.
[3] Metazoan promoters: emerging characteristics and insights into transcriptional regulation. Nat Rev Genet,
13(4):233–245, 2012.
[4] Interplay between chromatin state, regulator binding, and regulatory motifs in six human cell types. Genome
Research, 23:1142–1154, 2013.
[5] R. Andersson, C. Gebhard, and I. et al. Miguel-Escalada. An atlas of active enhancers across human cell types
and tissues. Nature, 507(7493):455–461, Mar 2014.
[6] Robin Andersson. Promoter or enhancer, what’s the difference? Deconstruction of established distinctions and
presentation of a unifying model. BioEssays, 37(3):314–323, 2015.
[7] Cosmas D Arnold, Daniel Gerlach, Christoph Stelzer, Lukasz M Boryn, Martina Rath, and Alexander Stark.
Genome-wide quantitative enhancer activity maps identified by STARR-seq. Science (New York, N.Y.),
339(6123):1074–1077, 2013.
[8] J Banerji, S Rusconi, and W Schaffner. Expression of a beta-globin gene is enhanced by remote SV40 DNA
sequences. Cell, 27(2 Pt 1):299–308, 1981.
[9] E. Birney, J. A. Stamatoyannopoulos, A. Dutta, R. Guigo, T. R. Gingeras, E. H. Margulies, Z. Weng, M. Snyder, E. T. Dermitzakis, R. E. Thurman, M. S. Kuehn, C. M. Taylor, S. Neph, C. M. Koch, S. Asthana,
A. Malhotra, I. Adzhubei, J. A. Greenbaum, R. M. Andrews, P. Flicek, P. J. Boyle, H. Cao, N. P. Carter,
G. K. Clelland, S. Davis, N. Day, P. Dhami, S. C. Dillon, M. O. Dorschner, H. Fiegler, P. G. Giresi, J. Goldy,
M. Hawrylycz, A. Haydock, R. Humbert, K. D. James, B. E. Johnson, E. M. Johnson, T. T. Frum, E. R.
Rosenzweig, N. Karnani, K. Lee, G. C. Lefebvre, P. A. Navas, F. Neri, S. C. Parker, P. J. Sabo, R. Sandstrom, A. Shafer, D. Vetrie, M. Weaver, S. Wilcox, M. Yu, F. S. Collins, J. Dekker, J. D. Lieb, T. D. Tullius,
G. E. Crawford, S. Sunyaev, W. S. Noble, I. Dunham, F. Denoeud, A. Reymond, P. Kapranov, J. Rozowsky,
D. Zheng, R. Castelo, A. Frankish, J. Harrow, S. Ghosh, A. Sandelin, I. L. Hofacker, R. Baertsch, D. Keefe,
S. Dike, J. Cheng, H. A. Hirsch, E. A. Sekinger, J. Lagarde, J. F. Abril, A. Shahab, C. Flamm, C. Fried,
J. Hackermuller, J. Hertel, M. Lindemeyer, K. Missal, A. Tanzer, S. Washietl, J. Korbel, O. Emanuelsson, J. S.
Pedersen, N. Holroyd, R. Taylor, D. Swarbreck, N. Matthews, M. C. Dickson, D. J. Thomas, M. T. Weirauch,
J. Gilbert, J. Drenkow, I. Bell, X. Zhao, K. G. Srinivasan, W. K. Sung, H. S. Ooi, K. P. Chiu, S. Foissac,
T. Alioto, M. Brent, L. Pachter, M. L. Tress, A. Valencia, S. W. Choo, C. Y. Choo, C. Ucla, C. Manzano,
C. Wyss, E. Cheung, T. G. Clark, J. B. Brown, M. Ganesh, S. Patel, H. Tammana, J. Chrast, C. N. Henrichsen, C. Kai, J. Kawai, U. Nagalakshmi, J. Wu, Z. Lian, J. Lian, P. Newburger, X. Zhang, P. Bickel, J. S.
Mattick, P. Carninci, Y. Hayashizaki, S. Weissman, T. Hubbard, R. M. Myers, J. Rogers, P. F. Stadler, T. M.
Lowe, C. L. Wei, Y. Ruan, K. Struhl, M. Gerstein, S. E. Antonarakis, Y. Fu, E. D. Green, U. Karaoz, A. Siepel, J. Taylor, L. A. Liefer, K. A. Wetterstrand, P. J. Good, E. A. Feingold, M. S. Guyer, G. M. Cooper,
G. Asimenos, C. N. Dewey, M. Hou, S. Nikolaev, J. I. Montoya-Burgos, A. Loytynoja, S. Whelan, F. Pardi,
14
bioRxiv preprint first posted online Feb. 23, 2016; doi: http://dx.doi.org/10.1101/041020. The copyright holder for this preprint (which was not
peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission.
T. Massingham, H. Huang, N. R. Zhang, I. Holmes, J. C. Mullikin, A. Ureta-Vidal, B. Paten, M. Seringhaus,
D. Church, K. Rosenbloom, W. J. Kent, E. A. Stone, S. Batzoglou, N. Goldman, R. C. Hardison, D. Haussler,
W. Miller, A. Sidow, N. D. Trinklein, Z. D. Zhang, L. Barrera, R. Stuart, D. C. King, A. Ameur, S. Enroth,
M. C. Bieda, J. Kim, A. A. Bhinge, N. Jiang, J. Liu, F. Yao, V. B. Vega, C. W. Lee, P. Ng, A. Shahab, A. Yang,
Z. Moqtaderi, Z. Zhu, X. Xu, S. Squazzo, M. J. Oberley, D. Inman, M. A. Singer, T. A. Richmond, K. J. Munn,
A. Rada-Iglesias, O. Wallerman, J. Komorowski, J. C. Fowler, P. Couttet, A. W. Bruce, O. M. Dovey, P. D.
Ellis, C. F. Langford, D. A. Nix, G. Euskirchen, S. Hartman, A. E. Urban, P. Kraus, S. Van Calcar, N. Heintzman, T. H. Kim, K. Wang, C. Qu, G. Hon, R. Luna, C. K. Glass, M. G. Rosenfeld, S. F. Aldred, S. J. Cooper,
A. Halees, J. M. Lin, H. P. Shulha, X. Zhang, M. Xu, J. N. Haidar, Y. Yu, Y. Ruan, V. R. Iyer, R. D. Green,
C. Wadelius, P. J. Farnham, B. Ren, R. A. Harte, A. S. Hinrichs, H. Trumbower, H. Clawson, J. HillmanJackson, A. S. Zweig, K. Smith, A. Thakkapallayil, G. Barber, R. M. Kuhn, D. Karolchik, L. Armengol, C. P.
Bird, P. I. de Bakker, A. D. Kern, N. Lopez-Bigas, J. D. Martin, B. E. Stranger, A. Woodroffe, E. Davydov,
A. Dimas, E. Eyras, I. B. Hallgrimsdottir, J. Huppert, M. C. Zody, G. R. Abecasis, X. Estivill, G. G. Bouffard,
X. Guan, N. F. Hansen, J. R. Idol, V. V. Maduro, B. Maskeri, J. C. McDowell, M. Park, P. J. Thomas, A. C.
Young, R. W. Blakesley, D. M. Muzny, E. Sodergren, D. A. Wheeler, K. C. Worley, H. Jiang, G. M. Weinstock,
R. A. Gibbs, T. Graves, R. Fulton, E. R. Mardis, R. K. Wilson, M. Clamp, J. Cuff, S. Gnerre, D. B. Jaffe,
J. L. Chang, K. Lindblad-Toh, E. S. Lander, M. Koriabine, M. Nefedov, K. Osoegawa, Y. Yoshinaga, B. Zhu,
and P. J. de Jong. Identification and analysis of functional elements in 1human genome by the ENCODE pilot
project. Nature, 447(7146):799–816, Jun 2007.
[10] M. G. Bulmer. On Fitting the Poisson Lognormal Distribution to Species-Abundance Data. Biometrics,
30(1):101–110, 2011.
[11] L. S. Churchman and J. S. Weissman. Nascent transcript sequencing visualizes transcription at nucleotide
resolution. Nature, 469(7330):368–373, Jan 2011.
[12] Roadmap Epigenomics Consortium.
518(7539):317–330, 2015.
Integrative analysis of 111 reference human epigenomes.
Nature,
[13] The ENCODE Consortium. An integrated encyclopedia of DNA elements in the human genome. Nature,
489:57–74, 2012.
[14] The Fantom Consortium. A promoter-level mammalian expression atlas. Nature, 507(7493):462–470, 2014.
[15] John D Cook. Notes on the Negative Binomial Distribution, 2009.
[16] Leighton J. Core, André L. Martins, Charles G. Danko, Colin T Waters, Adam Siepel, and John T. Lis. Analysis
of transcription start sites from nascent RNA supports a unified architecture of mammalian promoters and
enhancers. Submitted, 46(12):1311–1320, 2014.
[17] Jason Ernst and Manolis Kellis. Discovery and characterization of chromatin states for systematic annotation
of the human genome. Nature biotechnology, 28(8):817–825, 2010.
[18] Jason Ernst and Manolis Kellis. ChromHMM: automating chromatin-state discovery and characterization.
Nature Methods, 9(3):215–216, 2012.
15
bioRxiv preprint first posted online Feb. 23, 2016; doi: http://dx.doi.org/10.1101/041020. The copyright holder for this preprint (which was not
peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission.
[19] Jason Ernst and Manolis Kellis. Large-scale imputation of epigenomic datasets for systematic annotation of
diverse human tissues. Nature Biotechnology, 33(4):364–376, 2015.
[20] Jason Ernst, Pouya Kheradpour, Tarjei S Mikkelsen, Noam Shoresh, Lucas D Ward, Charles B Epstein,
Xiaolan Zhang, Li Wang, Robbyn Issner, Michael Coyne, Manching Ku, Timothy Durham, Manolis Kellis, and
Bradley E Bernstein. Mapping and analysis of chromatin state dynamics in nine human cell types. Nature,
473(7345):43–49, 2011.
[21] R. C. Gentleman, V. J. Carey, D. M. Bates, B. Bolstad, M. Dettling, S. Dudoit, B. Ellis, L. Gautier, Y. Ge,
J. Gentry, K. Hornik, T. Hothorn, W. Huber, S. Iacus, R. Irizarry, F. Leisch, C. Li, M. Maechler, A. J.
Rossini, G. Sawitzki, C. Smith, G. Smyth, L. Tierney, J. Y. Yang, and J. Zhang. Bioconductor: open software
development for computational biology and bioinformatics. Genome Biol., 5(10):R80, 2004.
[22] Mark B. Gerstein, Anshul Kundaje, Manoj Hariharan, Sherman M. Weissman, and Michael Snyder. Architecture of the human regulatory network derived from ENCODE data. Nature, 489(7414):91–100, 2012.
[23] Vidar GrÞtan and Steinar Engen. poilog: Poisson lognormal and bivariate Poisson lognormal distribution,
2008. R package version 0.4.
[24] J. Harrow, A. Frankish, J. M. Gonzalez, E. Tapanari, M. Diekhans, F. Kokocinski, B. L. Aken, D. Barrell, A. Zadissa, S. Searle, I. Barnes, A. Bignell, V. Boychenko, T. Hunt, M. Kay, G. Mukherjee, J. Rajan,
G. Despacio-Reyes, G. Saunders, C. Steward, R. Harte, M. Lin, C. Howald, A. Tanzer, T. Derrien, J. Chrast,
N. Walters, S. Balasubramanian, B. Pei, M. Tress, J. M. Rodriguez, I. Ezkurdia, J. van Baren, M. Brent,
D. Haussler, M. Kellis, A. Valencia, A. Reymond, M. Gerstein, R. Guigo, and T. J. Hubbard. GENCODE: the
reference human genome annotation for The ENCODE Project. Genome Res., 22(9):1760–1774, Sep 2012.
[25] Sven Heinz, Casey E Romanoski, Christopher Benner, and Christopher K Glass. The selection and function
of cell type-specific enhancers. Nature Reviews Molecular Cell Biology, 16(3):144–154, 2015.
[26] Michael M Hoffman, Orion J Buske, Jie Wang, Zhiping Weng, Jeff a Bilmes, and William Stafford Noble.
Unsupervised pattern discovery in human chromatin structure through genomic segmentation. Nature Methods,
9(5):473–476, 2012.
[27] Michael M. Hoffman, Jason Ernst, Steven P. Wilder, Anshul Kundaje, Robert S. Harris, Max Libbrecht, Belinda
Giardine, Paul M. Ellenbogen, Jeffrey a. Bilmes, Ewan Birney, Ross C. Hardison, Ian Dunham, Manolis Kellis,
and William Stafford Noble. Integrative annotation of chromatin elements from ENCODE data. Nucleic Acids
Research, 41(2):827–841, 2013.
[28] Ross Ihaka and Robert Gentleman. R: A Language for Data Analysis and Graphics. J. Comp. Graph. Stat.,
5(3):299–314, 1996.
[29] Peter V Kharchenko, Artyom a Alekseyenko, Yuri B Schwartz, Aki Minoda, Nicole C Riddle, Jason Ernst,
Peter J Sabo, Erica Larschan, Andrey a Gorchakov, Tingting Gu, Daniela Linder-Basso, Annette Plachetka,
Gregory Shanower, Michael Y Tolstorukov, Lovelace J Luquette, Ruibin Xi, Youngsook L Jung, Richard W
Park, Eric P Bishop, Theresa K Canfield, Richard Sandstrom, Robert E Thurman, David M MacAlpine, John a
Stamatoyannopoulos, Manolis Kellis, Sarah C R Elgin, Mitzi I Kuroda, Vincenzo Pirrotta, Gary H Karpen,
16
bioRxiv preprint first posted online Feb. 23, 2016; doi: http://dx.doi.org/10.1101/041020. The copyright holder for this preprint (which was not
peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission.
and Peter J Park. Comprehensive analysis of the chromatin landscape in Drosophila melanogaster. Nature,
471(7339):480–485, 2011.
[30] Pouya Kheradpour, Jason Ernst, Alexandre Melnikov, Peter Rogov, Li Wang, Xiaolan Zhang, Jessica Alston,
Tarjei S. Mikkelsen, and Manolis Kellis. Systematic dissection of regulatory motifs in 2000 predicted human
enhancers using a massively parallel reporter assay. Genome Research, 23:800–811, 2013.
[31] Tae-Kyung Kim, Martin Hemberg, Jesse M. Gray, Allen M. Costa, Daniel M. Bear, Jing Wu, David A.
Harmin, Mike Laptewicz, Kellie Barbara-Haley, Scott Kuersten, Eirene Markenscoff-Papadimitriou, Dietmar
Kuhl, Haruhiko Bito, Paul F. Worley, Gabriel Kreiman, and Michael E. Greenberg. Widespread transcription
at neuronal activity-regulated enhancers. Nature, 465(7295):182–187, 2010.
[32] D. Kleftogiannis, P. Kalnis, and V. B. Bajic. DEEP: a general computational framework for predicting enhancers. Nucleic Acids Research, 43(1):e6–e6, 2015.
[33] Dimitrios Kleftogiannis, Panos Kalnis, and Vladimir B Bajic. Progress and challenges in bioinformatics approaches for enhancer identification. Briefings in Bioinformatics, (August):1–13, 2015.
[34] E. Z. Kvon, G. Stampfel, J. O. Yanez-Cuna, B. J. Dickson, and A. Stark. HOT regions function as patterned
developmental enhancers and have a distinct cis-regulatory signature. Genes Dev., 26(9):908–913, May 2012.
[35] J. C. Kwasnieski, C. Fiore, H. G. Chaudhari, and B. a. Cohen. High-throughput functional testing of ENCODE
segmentation predictions. Genome Research, pages gr.173518.114–, 2014.
[36] B. Langmead and S. L. Salzberg. Fast gapped-read alignment with Bowtie 2. Nat. Methods, 9(4):357–359, Apr
2012.
[37] Michael Lawrence, Robert Gentleman, and Vincent Carey. rtracklayer: an r package for interfacing with
genome browsers. Bioinformatics, 25:1841–1842, 2009.
[38] Michael Lawrence, Wolfgang Huber, Hervé Pagès, Patrick Aboyoun, Marc Carlson, Robert Gentleman, Martin
Morgan, and Vincent Carey. Software for computing and annotating genomic ranges. PLoS Computational
Biology, 9, 2013.
[39] Dongwon Lee, Rachel Karchin, and Michael A Beer. Discriminative prediction of mammalian enhancers from
DNA sequence. Genome research, 21(12):2167–80, 2011.
[40] H. Li, H. Chen, F. Liu, C. Ren, S. Wang, X. Bo, and W. Shu. Functional annotation of HOT regions in the
human genome: implications for human disease and cancer. Sci Rep, 5:11633, 2015.
[41] H. Li, B. Handsaker, A. Wysoker, T. Fennell, J. Ruan, N. Homer, G. Marth, G. Abecasis, and R. Durbin. The
Sequence Alignment/Map format and SAMtools. Bioinformatics, 25(16):2078–2079, Aug 2009.
[42] Kerstin Lindblad-Toh, Manuel Garber, Or Zuk, Michael F. Lin, Brian J. Parker, Stefan Washietl, Pouya
Kheradpour, Jason Ernst, Gregory Jordan, Evan Mauceli, Lucas D. Ward, Craig B. Lowe, Alisha K. Holloway,
Michele Clamp, Sante Gnerre, Jessica Alfoldi, Kathryn Beal, Jean Chang, Hiram Clawson, James Cuff, Federica
Di Palma, Stephen Fitzgerald, Paul Flicek, Mitchell Guttman, Melissa J. Hubisz, David B. Jaffe, Irwin Jungreis,
W. James Kent, Dennis Kostka, Marcia Lara, Andre L. Martins, Tim Massingham, Ida Moltke, Brian J. Raney,
17
bioRxiv preprint first posted online Feb. 23, 2016; doi: http://dx.doi.org/10.1101/041020. The copyright holder for this preprint (which was not
peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission.
Matthew D. Rasmussen, Jim Robinson, Alexander Stark, Albert J. Vilella, Jiayu Wen, Xiaohui Xie, Michael C.
Zody, Jen Baldwin, Toby Bloom, Chee Whye Chin, Dave Heiman, Robert Nicol, Chad Nusbaum, Sarah
Young, Jane Wilkinson, Kim C. Worley, Christie L. Kovar, Donna M. Muzny, Richard A. Gibbs, Andrew Cree,
Huyen H. Dihn, Gerald Fowler, Shalili Jhangiani, Vandita Joshi, Sandra Lee, Lora R. Lewis, Lynne V. Nazareth,
Geoffrey Okwuonu, Jireh Santibanez, Wesley C. Warren, Elaine R. Mardis, George M. Weinstock, Richard K.
Wilson, Kim Delehaunty, David Dooling, Catrina Fronik, Lucinda Fulton, Bob Fulton, Tina Graves, Patrick
Minx, Erica Sodergren, Ewan Birney, Elliott H. Margulies, Javier Herrero, Eric D. Green, David Haussler,
Adam Siepel, Nick Goldman, Katherine S. Pollard, Jakob S. Pedersen, Eric S. Lander, and Manolis Kellis. A
high-resolution map of human evolutionary constraint using 29 mammals. Nature, 478(7370):476–482, 2011.
[43] Alessandro Mammana and Ho-Ryun Chung. Chromatin segmentation based on a probabilistic model for read
counts explains a large portion of the epigenome. Genome Biology, 16(1):151, 2015.
[44] Guillemette Marot, David Castel, Jordi Estelle, Gregory Guernec, Bernd Jagla, Nicolas Servant, Luc Jouneau,
Denis Laloe, Caroline Le Gall, and Brigitte Schae. A comprehensive evaluation of normalization methods for
Illumina high-throughput RNA sequencing data analysis. 14(6), 2012.
[45] Dalit May, Matthew J Blow, Tommy Kaplan, David J McCulley, Brian C Jensen, Jennifer A Akiyama, Amy
Holt, Ingrid Plajzer-Frick, Malak Shoukry, Crystal Wright, Veena Afzal, Paul C Simpson, Edward M Rubin,
Brian L Black, James Bristow, Len A Pennacchio, and Axel Visel. Large-scale discovery of enhancers from
human heart tissue. Nature genetics, 44(1):89–93, 2012.
[46] C. Miller, B. Schwalb, K. Maier, D. Schulz, S. Dumcke, B. Zacher, A. Mayer, J. Sydow, L. Marcinowski,
L. Dolken, D. E. Martin, A. Tresch, and P. Cramer. Dynamic transcriptome analysis measures rates of mRNA
synthesis and decay in yeast. Mol. Syst. Biol., 7:458, Jan 2011.
[47] Tõnis Org, Dan Duan, Roberto Ferrari, Amelie Montel-hagen, Ben Van Handel, A Marc, Rajkumar Sasidharan,
Liudmilla Rubbi, Yuko Fujiwara, Matteo Pellegrini, Stuart H Orkin, Siavash K Kurdistani, and Hanna K A
Mikkola. Scl binds to primed enhancers in mesoderm to regulate hematopoietic and cardiac fate divergence.
(January):1–20, 2015.
[48] M. Rabani, R. Raychowdhury, M. Jovanovic, M. Rooney, D. J. Stumpo, A. Pauli, N. Hacohen, A. F. Schier,
P. J. Blackshear, N. Friedman, I. Amit, and A. Regev. High-resolution sequencing and modeling identifies
distinct dynamic RNA regulatory strategies. Cell, 159(7):1698–1710, Dec 2014.
[49] L. R. Rabiner. A tutorial on hidden Markov models and selected applications in speech recognition. Proc.
IEEE, 77(2):257–286, February 1989.
[50] N. Rajagopal, S. Srinivasan, K. Kooshesh, Y. Guo, M. D. Edwards, B. Banerjee, T. Syed, B. J. Emons, D. K.
Gifford, and R. I. Sherwood. High-throughput mapping of regulatory DNA. Nat. Biotechnol., 34(2):167–174,
Feb 2016.
[51] Nisha Rajagopal, Wei Xie, Yan Li, Uli Wagner, Wei Wang, John Stamatoyannopoulos, Jason Ernst, Manolis
Kellis, and Bing Ren. RFECS: a random-forest based algorithm for enhancer identification from chromatin
state. PLoS computational biology, 9(3):e1002968, 2013.
18
bioRxiv preprint first posted online Feb. 23, 2016; doi: http://dx.doi.org/10.1101/041020. The copyright holder for this preprint (which was not
peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission.
[52] D. Ramskold, E. T. Wang, C. B. Burge, and R. Sandberg. An abundance of ubiquitously expressed genes
revealed by tissue transcriptome sequence data. PLoS Comput. Biol., 5(12):e1000598, Dec 2009.
[53] Daria Shlyueva, Gerald Stampfel, and Alexander Stark. Transcriptional enhancers: from properties to genomewide predictions. Nature reviews. Genetics, 15(4):272–86, 2014.
[54] Gosia Trynka, Cynthia Sandor, Buhm Han, Han Xu, Barbara E Stranger, X Shirley Liu, and Soumya Raychaudhuri. Chromatin marks identify critical cell types for fine mapping complex trait variants. Nature Genetics,
45(2):124–130, 2012.
[55] Axel Visel, Matthew J. Blow, Zirong Li, Tao Zhang, Jennifer A. Akiyama, Amy Holt, Ingrid Plajzer-Frick,
Malak Shoukry, Crystal Wright, Feng Chen, Veena Afzal, Bing Ren, Edward M. Rubin, and Len A. Pennacchio.
ChIP-seq accurately predicts tissue-specific activity of enhancers. Nature, 457(7231):854–858, 2009.
[56] J. Wang, J. Zhuang, S. Iyer, X. Lin, T. W. Whitfield, M. C. Greven, B. G. Pierce, X. Dong, A. Kundaje,
Y. Cheng, O. J. Rando, E. Birney, R. M. Myers, W. S. Noble, M. Snyder, and Z. Weng. Sequence features
and chromatin structure around the genomic regions bound by 119 human transcription factors. Genome Res.,
22(9):1798–1812, Sep 2012.
[57] Lucas D. Ward and Manolis Kellis. HaploReg v4: systematic mining of putative causal variants, cell types,
regulators and target genes for human complex traits and disease. Nucleic Acids Research, page gkv1340, 2015.
[58] Nils Weinhold, Anders Jacobsen, Nikolaus Schultz, Chris Sander, and William Lee. Genome-wide analysis of
noncoding regulatory mutations in cancer. Nature Genetics, 46(11):1160–5, 2014.
[59] D. Welter, J. MacArthur, J. Morales, T. Burdett, P. Hall, H. Junkins, A. Klemm, P. Flicek, T. Manolio,
L. Hindorff, and H. Parkinson. The NHGRI GWAS Catalog, a curated resource of SNP-trait associations.
Nucleic Acids Research, 42(D1):D1001–D1006, 2014.
[60] Kyoung-Jae Won, Xian Zhang, Tao Wang, Bo Ding, Debasish Raha, Michael Snyder, Bing Ren, and Wei
Wang. Comparative annotation of functional regions in the human genome using epigenomic data. Nucleic
acids research, 41(8):4423–32, 2013.
[61] Kevin Y Yip, Chao Cheng, Nitin Bhardwaj, James B Brown, Jing Leng, Anshul Kundaje, Joel Rozowsky,
Ewan Birney, Peter Bickel, Michael Snyder, and Mark Gerstein. Classification of human genomic regions based
on experimentally determined binding sites of more than 100 transcription-related factors. Genome Biology,
13(9):R48, 2012.
[62] B. Zacher, M. Lidschreiber, P. Cramer, J. Gagneur, and A. Tresch. Annotation of genomics data using
bidirectional hidden Markov models unveils variations in Pol II transcription cycle. Molecular Systems Biology,
10(12):768–768, 2014.
[63] Gabriel E Zentner, Paul J Tesar, and Peter C Scacheri. Epigenetic signatures distinguish multiple classes of
enhancers with distinct cellular functions. Genome research, 21(8):1273–83, 2011.
[64] Daniel R Zerbino, Steven P Wilder, Nathan Johnson, Thomas Juettemann, and Paul R Flicek. The Ensembl
Regulatory Build. Genome Biology, 16(1):1–8, 2015.
19
bioRxiv preprint first posted online Feb. 23, 2016; doi: http://dx.doi.org/10.1101/041020. The copyright holder for this preprint (which was not
peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission.
Methods
Availability of GenoSTAN and chromatin state annotations
GenoSTAN is freely available from http://bioconductor.org/ as part of our prevsiouy published R/Bioconductor
package STAN [62]. Chromatin state annotations for benchmark I, II and III can be downloaded from http://i12ggagneurweb.in.tum.de/public/paper/GenoSTAN. The combined promoter and enhancer annotation is available as
Supplementary Data.
Motivation of Poisson-lognormal and negative binomial emissions
The Poisson-lognormal and the negative binomial distribution can be thought of as extensions of the Poisson
distribution that allow for greater variance. We will now motivate both distributions from a Poisson distribution
with a prior on the mean of the Poisson.
Suppose that X ∼ P oisson(x|Λ) is a Poisson random variable and Λ ∼ Gamma (λ|α, β). From this we can derive
the negative binomial with success rate p and size r [15]:
ˆ∞
Pr (X = x|α, β) =
0
ˆ∞
=
0
=
p
P oisson (x|λ) Gamma λ|α = r, β =
1−p
dλ
1−p
λx −λ r−1 e−λ p
r
dλ
e λ
p
x!
Γ (r)
1−p
Γ (r + x) x
r
p (1 − p)
x!Γ (r)
where r > 0, p ∈ [0, 1]
In order to increase interpretability in the context of read counts, we re-parameterize this with mean µ =
Pr (X = x|µ, r) =
Γ (r + x)
x!Γ (r)
r
r+µ
x 1−
r
r+µ
r(1−p)
:
p
r
where µ > 0
The Poisson-lognormal distribution can be motivated likewise. Assume that X ∼ P oisson(x|Λ) is a Poisson
random variable and Λ ∼ N (log (λ) |µ, σ). Then the Poisson-lognormal is given by [10]:
ˆ∞
P oisson (x|λ) N (log (λ) |µ, σ) dλ
Pr (X = x|µ, σ) =
0
=
√
ˆ∞
(log(λ)−µ)2
2πσ 2
dλ
λx−1 e−λ e− 2σ2
x!
0
A closed form solution for this distribution does not exist. Thus numerical integration is needed to calculate
probabilities, which is done in GenoSTAN by using the R package poilog [23, 28].
20
bioRxiv preprint first posted online Feb. 23, 2016; doi: http://dx.doi.org/10.1101/041020. The copyright holder for this preprint (which was not
peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission.
Optimization of Poisson-lognormal and negative binomial emissions
Let O = (o0 , ..., oT ), ot = (ot,d )d∈D ∈ ND
0 be an obsevational sequence of |D|-dimensional count vectors ot . An
HMM assumes that each observation ot is emitted by a corresponding hidden (unobserved) variable st , t = 0, ..., T .
A hidden variable can assume values from a finite set of states K. Each state k ∈ K is associated to an emission
distribution ψk , which defines the probability of making a certain observation, ψk (ot ). GenoSTAN assumes that
Q
the components ot,d , d ∈ D, of a single observation ot are independent, and hence ψk (ot ) = d∈D ψk,d (ot,d ). The
value of st determines the probability of observing ot by Pr(ot | st ) = ψst (ot ). HMM learning is carried out using
the Baum-Welch algorithm [49]. The optimization problem for the paramters of a single emission distribution ψi,d
can be written as
arg max
ψi,d
T
X
Pr (st = i|O) log ψi,d (ot,d )
,
t=0
where Pr(st = i | O) is calculated efficiently by the Forward-Backward algorithm, and ψi,d is maximized within
the class of negative binomial or Poisson-lognormal distributions. An analytical solution for this problem does
not exist. Thus, we resort to numerical optimzation. As indicated by [43], above formula can be very costly to
compute, since the function needs to evaluate a sum over the complete observation sequence (i.e. the complete
binned genome) in each iteration. However, computations are greatly simplified by grouping together observations
ot,d with the same count number. Let Cd be the set of unique counts in dimension d. Then the following terms can
be precomputed for all c ∈ Cd before optimization:
X
f (c) =
Pr (st = i|O)
t; ot,d =c
The objective function becomes
arg max
X
ψi,d
c∈Cd
f (c) log ψi,d (c)
which avoids redundant calculations of ψi,d (ot ), t = 0, ..., T , and greatly reduces complexity since |Cd | T .
Correction for library size
The sequencing depth can be very different between experiments. GenoSTAN addresses this problem by using
pre-computed scaling factors to correct for varying sequencing depths for a data track between cell types. In this
work, the ’total count’ method is used [44]. Let L be the set of cell types and rd,l the number of reads of data track
d ∈ D in cell line l ∈ L. The scaling factor is then computed as
sd,l
rd,l
·
k∈L rd,k
=P
P
rd,k
|L|
k∈C
µ
The probability of an observation ot,l is calculated as Pr ot,l | sd,l
, r in the case of negative binomial and
µ
Pr ot,l | log sd,l
, σ in the case of Poisson-lognormal emissions.
21
bioRxiv preprint first posted online Feb. 23, 2016; doi: http://dx.doi.org/10.1101/041020. The copyright holder for this preprint (which was not
peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission.
Model initialization
Initialization of model parameters is crucial for HMMs since the EM algorithm is a gradient method which converges
to a local maximum. K-means is a widely used approach to derive an initial clustering to estimate model parameters
[49]. In order to make this approach applicable to sequencing data, we added a pseudocount and log-transformed the
data before k-means clustering. However, without further processing k-means rarely converged and the procedure
was slow on the complete data set. To address these issues, we further processed and filtered the data. First,
a threshold for signal enrichment for each data track is calculated using the default binarization approach of
ChromHMM [17]. The threshold is the smallest
discrete number nd > 0 such that Pr (X > nd ) < 10−4 where X
PT
ot,d
is a Poisson random variable with mean λd = t=0
. All ot,d < nd were set to 0, which improved convergence
T +1
of k-means. To improve the speed, all genomic bins ot,d where ∀d ∈ D : ot,d = 0 were removed and defined as
a ’background cluster’. K-means was then run on the rest of the data with |K| − 1 clusters. This clustering (the
’background’ and k-means clusters) was then used to derive an initial estimate of emission function parameters.
Initial state and transition probabilities were initialized uniform.
Data preprocessing
Benchmark I (K562 ENCODE) sequencing data was mapped to the hg20/hg38 (GRCh38) genome assembly (Human
Genome Reference Consortium) using Bowtie 2.1.0 [36]. Samtools [41] was used to quality filter SAM files, whereby
alignments with MAPQ smaller than 7 (-q 7) were skipped. To obtain midpoint positions of the ChIP-Seq fragments,
the (single end) reads were shifted in the appropriate direction by half the average fragment length as estimated
by strand coverage cross-correlation using the R/Bioconductor package chipseq [21]. Next, ChIP-Seq tracks were
summarized by the number of fragment midpoints in consecutive bins of 200 bp width. The data for the 127
ENCODE and Roadmap Epigenomics cell types (benchmark II and III) was downloaded as preprocessed tagAlign
files from the Roadmap Epigenomics supplementary website [12]. Fragment length was again estimated using the
R/Bioconductor package chipseq and reads were shifted by the fragment half size to the average fragment midpoint
[21]. The genome was partitioned into 200bp bins and reads were counted within each bin.
Model fitting of GenoSTAN
GenoSTAN was fitted on the complete data of benchmark data set I. The signal used for GenoSTAN model training
on Benchmark data set II and III was extracted from ENCODE pilot regions (1% of the human genome analyzed in
the ENCODE pilot phase [9]) for each cell type, which together covered 20% and 127% of the human genome. The
GenoSTAN-nb-20 model was learned in one day, the GenoSTAN-Poilog-20 model in two days using 10 cores. Model
learning on Benchmark set II using 10 cores took three (GenoSTAN-nb-127) and six days (GenoSTAN-Poilog-127).
Precomputed library size factors were used to correct for variation in read coverage.
Model fitting of ChromHMM, Segway and EpicSeg
The data was binarized as described in [17] and ChromHMM was fitted with default parameters. Before applying
Segway, the data was transformed using the hyperbolic sine function [26] and a runing mean over a 1kb sliding
window was computed to smooth the data. Segway was fitted on ENCODE pilot regions using a 200bp resolution.
EpicSeg was fitted on the untransformed count data with default parameters.
22
bioRxiv preprint first posted online Feb. 23, 2016; doi: http://dx.doi.org/10.1101/041020. The copyright holder for this preprint (which was not
peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission.
Processing of chromatin state annotations and external data
All state annotations and external data were lifted to the hg20/hg38 (GRCh38) genome assembly using the liftOver
function from the R/Bioconductor package rtracklayer [37]. Overlap of state annotations with external data was
calculated with GenomicRanges [38].
Computation of area under curve
AUC values were calculated on Benchmark set I for GenoSTAN, ChromHMM, Segway and EpicSeg. To this end,
a segmentation was transformed into a binary classifier and evaluated as follows. Each 200bp bin in the genome
overlapping with HOT (TSSs) regions was considered as ’true condition’, the rest as ’false’. For each state S the
precision for recalling HOT (TSS) regions was calculated as the fraction of all segments annotated with S that
overlapped with a HOT (TSS) region. States were then sorted by decreasing precision. The rank of each state was
used as score in the prediction of HOT (TSS) regions on each 200bp bin in the genome, which was then used to
calculate AUC values.
Analysis of transcription factor (co-)binding
Enrichment of TFs in chromatin states was calculated as described earlier [4]. The TF co-binding rate was calculated
T F ∩Si |
as the Jaccard-Index: |S
|ST F ∪Si | , where ST F is the set of TF binding sites and Si is the set genomic regions that are
annotated with state i.
Tissue-specific enrichment of disease- and complex trait-associated variants in regulatory regions
The GWAS catalog was obtained from the gwascat package from Bioconductor [21, 59]. Statistical testing was
carried out in a similar manner as described in [12]. The enrichment of SNPs from individual genome-wide association studies was calculated for traits with at least 20 variants. SNPs for each trait were overlapped with promoter
and enhancer regions and tested against the rest of the GWAS catalogue as background using Fisher’s exact test.
P-values were adjusted for multiple testing using the Benjamini-Hochberg correction. In order to calculate the
recall and frequency of SNPs, promoter and enhancer states were randomly sampled until a genomic coverage of
2% for enhancers and 1% of promoters was reached. This was done to control for the fact that methods can differ
among each other regarding the length of the promoters and enhancers they predict. This procedure was repeated
100 times enabling the calculation of 95% confidence intervals.
Acknowledgements
We thank Lars Steinmetz, Judith Zaugg, Aino Jarvelin, Wu Wei and Philip Brennecke for fruitful discussions and
hosting BZ during the early phase of the project. BZ was supported by a DAAD short term research grant.
23
bioRxiv preprint first posted online Feb. 23, 2016; doi: http://dx.doi.org/10.1101/041020. The copyright holder for this preprint (which was not
peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission.
Author contributions
BZ, JG and AT developed the statistical methods and computational workflow of the study. BZ developed and
implemented all software and scripts and carried out all computational analyses. BS helped with preprocessing of
the K562 data set. MM and PC helped with interpretation of the biological results. BZ, JG and AT wrote the
manuscript with input from all authors. All authors read and approved the final version of the manuscript.
Conflict of interest
The authors declare that they have no conflict of interest.
24
bioRxiv preprint first posted online Feb. 23, 2016; doi: http://dx.doi.org/10.1101/041020. The copyright holder for this preprint (which was not
peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission.
Figures
A
B
GenoSTAN
ChromHMM
Segway
EpicSeg
Count
distribution
yes
no
no
yes
Library size
correction
yes
no
no
no
Track-specific
variance
yes
no
no
no
Running time
(one cell line)
minutes-hours
minutes-hours
hours-days
minutes-hours
Manually
entered
parameters
Number of states
Number of states,
binarization cutoff
Number of states,
data transformation
and smoothing
Number of states
Benchmark
Cell/tissue types
GenoSTAN
model (ID)
Chromatin marks
Benchmarked
models (ID)
I
K562
(ENCODE)
Poilog-K562
nb-K562
H3K4me1, H3K4me2,
H3K4me3,
H3K36me3, H3K9ac,
H3K27ac, H3K27me3,
H4K20me1, P300,
DNase-Seq
ChromHMM-ENCODE
ChromHMM-Nature
Segway-ENCODE
Segway-nmeth
Segway-Reg.Build
EpicSeg
II
127 cell/tissue types
(ENCODE, Roadmap
Epigenomics)
Poilog-127
nb-127
H3K4me1, H3K4me3,
H3K36me3,
H3K27me3, H3K9me3
ChromHMM-15
ChromHMM-25
III
20 cell/tissue types
(ENCODE, Roadmap
Epigenomics)
Poilog-20
nb-20
H3K4me1, H3K4me3,
H3K36me3, H3K9ac,
H3K27ac, H3K27me3,
H3K9me3,
DNase-Seq
ChromHMM-15
ChromHMM-18
ChromHMM-25
Figure 1: Overview of chromatin state annotation methods and study design. (A) Comparison of features of
GenoSTAN against three previous chromatin state annotation algorithms. (B) Description of the three benchmark sets used in this study. GenoSTAN is benchmarked against published chromatin state annotations using
ChromHMM (’ChromHMM-ENCODE’ [13, 27], ’ChromHMM-Nature’ [20], ’ChromHMM-15’, ’-18’ and ’-25’ [12]),
Segway (’Segway-ENCODE’ [13, 27], ’Segway-nmeth’ [26] and ’Segway-Reg.Build’ [64]) and EpicSeg [43].
25
bioRxiv preprint first posted online Feb. 23, 2016; doi: http://dx.doi.org/10.1101/041020. The copyright holder for this preprint (which was not
peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission.
A
47.17 mb
Genomic position
47.19 mb
47.21 mb
47.18 mb
47.23 mb
47.2 mb
47.25 mb
47.22 mb
47.24 mb
H3K4me1
H3K4me3
DNase-Seq
TAL1
UCSC genes
GRO-cap TSS
+
-
GenoSTAN-PoiLog-K562
GenoSTAN-NB-K562
ChromHMM-ENCODE
Segway-ENCODE
EpicSeg
Promoter & downstream
TAL1 enhancer
Promoter
Enhancer
Downstream
TAL1 enhancer
TAL1 promoter &
upstream enhancer
B
Prom.11
58
31
5
9
14
35
297
452
203
194
11.4
1.4
0.2
0.01
28.97
32.82
55.98
52.1
1.6
0.72
PromWF.5
30
14
3
3
4
33
89
93
29
49
13.1
0.6
0.5
0.1
15.05
9.11
22.1
13.83
0.89
0.38
0.16
PromF.3
16
16
5
21
22
51
119
66
45
56
12.3
0.6
1
0.01
0.99
2.57
2.1
2.94
1.71
1.28
0.32
76
10.9
0.6
3.6
0.04
1.41
23.7
6.35
5.1
0.27
0.43
0.08
58
33
34.5
0.6
6.2
0.3
0.78
18.39
3.32
4.25
0.86
1.46
0.67
EnhF.10
9
11
3
8
4
29
15
7
16
14
58.3
0.4
5.6
0.82
1.85
3.1
1.87
2.38
1.22
2.43
0.87
ReprEnh.4
38
30
39
51
62
41
21
21
29
44
1.5
1.2
6.2
0.07
0.2
0.29
0.04
0.63
0.41
0.27
0.3
Gen5'.13
17
15
7
19
18
46
24
9
21
23
46.7
0.6
4.6
0.35
1.14
3.73
0.91
3.92
4.18
4.45
3.13
Elon'.14
10
9
5
29
16
14
6
5
11
14
59.9
1
4.9
0.14
3.48
0.82
0.37
4.16
26.91
14.83
21.41
Elon.6
5
7
3
16
8
5
3
3
5
6
70.4
1.8
6.2
0.78
2.06
0.43
0.52
2.63
43.23
35.92
39.9
ReprD.17
20
10
27
8
18
31
18
9
8
18
20.8
0.8
5.2
0.73
8.51
1.98
0.59
1.33
0.51
0.34
0.45
Repr.7
10
7
28
8
16
8
6
5
5
9
88.6
1
9.7
5.44
9.66
0.34
0.22
0.68
1.63
1.94
2.85
ReprW.9
5
6
12
4
8
3
3
3
3
5
200.7 1.4
13.1
17.06
6.39
0.19
0.34
0.8
1.82
5.38
3.58
Low.12
9
10
8
9
8
15
9
5
8
11
96.9
0.8
8.2
2.75
2.96
1.91
1.04
2.3
3.01
5.36
4.88
Low.8
4
6
3
4
4
9
3
3
3
6
132.5 0.8
8.2
3.91
6.1
0.31
2
1.78
2.43
7.77
4.04
Low.1
3
5
4
4
4
2
0
2
3
2
458.3 1.2
22.7
31.68
2.67
0.13
0.66
0.78
7.64
12.67
14.08
Low.16
2
3
1
1
0
1
0
1
0
1
440
0.6
24.7
23.1
3.44
0.05
1.23
0.35
1.56
4.03
3.07
Low.18
0
0
0
0
0
0
0
0
0
0
88.2
0.6
21.9
12.72
4.34
0.12
0.37
0.05
0.12
0.34
0.12
e1
0 3
20
3'
tro
in
an
ex
c
00
9a
10
x
ts
en
m
ed
i
m
#s
eg
27
H
3K
e3
4m
3K
H
3K
H
0m
H
3K
H
4K
2
m
e3
e3
H
3K
36
7m
p3
H
3K
2
N
D
median read count
U
TR
190
16
n
42
56
on
149
65
w
id
th
to me
(k
G dia
b)
EN n
C di
O st
D a
E nc
TS e
S
in
te
rg
en
ic
C
pG
is
la
in
nd
st
ab
le
TS
S
st
ab
le
TS
S
TS
S/
5'
U
TR
92
10
ac
12
12
4m
14
5
H
3K
7
27
e1
55
33
00
81
as
e
Enh.15
EnhWF.2
4m
e2
0.08
150
Figure 2: Chromatin states fitted on Benchmark I using GenoSTAN. (A) GenoSTAN segmentations are shown
with published segmentations using ChromHMM-ENCODE [13], Segway-ENCODE [13] and EpicSeg [43] at the
TAL1 gene and three known enhancers. GenoSTAN-Poilog-K562 correctly recalls all known promoter and enhancer
regions. GenoSTAN-nb-K562 misses the upstream enhancer. ChromHMM-ENCODE misclassifies most of the
downstream enhancer region as promoter. (B) Median read coverage of GenoSTAN-Poilog-K562 chromatin states
(left), their number of annotated segments in the genome, their median width and distance to the closest GENCODE
TSS (middle). The right panel shows recall of genomic regions by chromatin states.
26
bioRxiv preprint first posted online Feb. 23, 2016; doi: http://dx.doi.org/10.1101/041020. The copyright holder for this preprint (which was not
peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission.
A
B
●
●
●
3
●
●●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●●
●
●●
●
●
●●
●
●●
●
●
●
●●●
●●
●●●●
●●
●
●
●
●
●
● ●●
●
●●●
●●
●
●
●●
● ●●
●
●●
●●●●
●●
●●
● ● ● ●● ●
●
● ●
●●●
● ●●
●
●● ●
●
●●
● ●●
● ●●●
●
●
●
●
●
11_proximal
●
2_Weak_Promoter
●
0.8
●
PromP
●
0
●
Enh
4_Strong_Enhancer
●
●●
PromF
PromW.5
0.8
●
●
Enh.15
GenoSTAN-PoiLog-K562
GenoSTAN-NB-K562
ChromHMM-Nature
ChromHMM-ENCODE
Segway-ENCODE
Segway-nmeth
Segway-Reg.Build
EpicSeg
0.0
●
0.0
●●
●
●●
0.4
0.6
0.8
0.6
0.2
0.4
0.6
YY1
E2F6
ELF1
Egr1
ZBTB7A
CCNT2
HMGN3
IRF1
KDM5B
PHF8
TAF1
TBP
RBBP5
UBF
ETS1
Sin3A
E2F4
HDAC1
SAP30
CHD1
SP1
CHD2
Mxi1
THAP1
SIX5
Nrf1
NFYB
NFYA
SP2
USF2
GTF2F1
GTF2B
TAF7
Pol2(phosphoS2)
SRF
ELK1
BCLAF1
ZBTB33
SETDB1
Znf143
CTCF
Rad21
SMC3
CTCFL
ZNF263
Myc
Max
Pol2
MAZ
BHLHE40
JunD
RCOR1
CBX3
PML
GABPA
ATF3
SPI1
USF1
Jun
JunB
FOSL1
FOS
EP300
TEAD4
GATA2
TAL1
NR2F2
TRIM28
STAT5A
TBL1XR1
REST
ATF1
CEBPB
ARID3A
HDAC2
COREST
GATA1
STAT2
STAT1
MEF2A
MafF
MafK
Bach1
NFE2
HDAC8
BCL3
ZNF274
SIRT6
SMARCA4
SMARCB1
Pol3
BRF1
BDP1
POLR3A
GTF3C2
RFX5
HDAC6
RD
EZH2
BRF2
KAP1
NR2C2
●
●●
●●
●
● ●
●
●
●
●
●
●
●
11_proximal
●
Enh
●
●
14_Repetitive/CNV
●
0
●
●
3
●
●
1_Active_Promoter
●
1_proximal
ReprEnh.13
●
ReprEnh.4
Prom.11
●
PromF
●
●
Enh.6
●
Enh1
●
4_Strong_Enhancer
●
●
Tss
0.2
●
1
●
8
●
7_tss
●
Enh.15
Prom.22
●
Tss
●
●
2
●
Art
●
0.2
●
15
●
0.4
Tss
●●
Tss.22
●
●
● ●
0.0
0.6
0.4
81_Active_Promoter
● ●
7_tss
●
0.2
Cumulative recall (TSS)
●
1
●
2
Prom.11
●
Cumulative recall (HOT)
●
●
Enh1
●
●
●
Fraction of segments bound
●
●
●
●
●
1_proximal
● Prom.16
Tss
●
●
●
●
●●
●
●●
●
●●●
●
●
●
●
●
●
●
●
●●
●
●●
●●
● ●
● ●●
●●●●
●
●●
●●●● ●
● ●●●●
● ●●
● ●●● ●●●
● ● ●●
● ● ●●●
●
●●●
●● ●
●●
●
●
●
●
● ●
●
●
●●
●
●
●
●●
●
13
Enh.6
●
C
HOT regions
1.0
1.0
GRO−cap TSS
●
0.0
1.0
●
●
●● ●
0.2
● ●
0.4
0.6
0.8
1.0
Cumulative false discovery rate (HOT)
Cumulative false discovery rate (TSS)
D
0.7
E
F
***
***
***
**
0.6
2.0
*
*
*
**
*
*
*
2
Pr
om
En ot
ha er
nc sta
er te
st s
at
es
●
●
●
●
0.5
1
0
0.0
−0.5
−1
0.6
1.0
G
en
Recall of FANTOM5 promoter
0.8
TA
N
0.4
oS
0.2
−P
G (n oiL
en =2 o
oS 10 g
TA )
C
hr
om (n N−
H =28 NB
C
M 1
hr
M )
om
−
H (n Na
M
t
M =33 ure
−E 7
Se
N )
gw (n CO
ay =2 DE
−E 71
N )
Se (n CO
gw =2 D
ay 36 E
Se
)
gw ( −nm
ay n=2 et
−R 48 h
eg )
.
(n B
=1 uil
37 d
Ep )
(n icS
=3 e
06 g
)
−1.0
0.0
S
om eg Ep
H wa icS
C MM y-n eg
hr
om -EN me
Se H C th
gw MM OD
E
Se ay- -Na
gw Re tur
ay g.B e
G -E u
G en NC ild
en o O
oS ST D
TA AN E
N -N
-P B
oi
Lo
g
●
●
●
***
E
( n nh
=2 .1
10 5
)
(n En
=2 h.
2
R 03)
ep
r
(n ess
=1 e
93 d
)
(n L
=2 ow
62
)
0.1
●
●
●
●
***
C
hr
●
●
0.0
●
GenoSTAN-PoiLog
GenoSTAN-NB
ChromHMM−ENCODE
ChromHMM−Nature
Segway−ENCODE
Segway−nmeth
Segway−Reg.Build
EpicSeg
●
1.0
Enhancer activity
Enhancer activity (GenoSTAN-PoiLog)
0.5
0.4
0.3
●
0.2
Recall of FANTOM5 enhancer
1.5
Figure 3: Comparison of GenoSTAN to other published segmentations on benchmark set I. (A) Performance of
chromatin states in recovering GRO-cap transcription start sites. Cumulative FDR and recall are calculated by
subsequently adding states (in order of increasing FDR). (B) The same as in (A) for ENCODE HOT regions. (C)
The fraction of predicted enhancer segments bound by individual TFs is shown for different studies. GenoSTAN
enhancers are more frequently bound by TFs than those from other studies. (D) Recall of FANTOM5 promoters
and enhancers which are active in K562 (i.e. overlapping with a GRO-cap TSS and an ENCODE DNase hypersnesitivity site) by predicted promoters and enhancers is plotted to assess how well models distinguish promoters
from enhancers. (E) Predicted enhancers show significantly higher activity than repressed and low coverage regions
as measured by a reporter assay (’*’, ’**’ and ’***’ indicate p-values <0.05, 0,01 and 0.001). (F) Comparison of
experimetal measures of enhancer activity between different studies.
27
bioRxiv preprint first posted online Feb. 23, 2016; doi: http://dx.doi.org/10.1101/041020. The copyright holder for this preprint (which was not
peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission.
B
● ●●
● ●●● ● ● ●●
●
● ●
●●●●●
●
●
●●
●●●
●
●
●●
● ● ●● ●●● ●
●
Enh.6
●
●
0.8
0.8
●
●●
Prom.5
●10_TssBiv
●
Prom.14
● 1_TssA
Prom.6
●
●
1_TssA
Prom.5
●
Prom.21●
●
●
3_PromD1
Prom.19
GenoSTAN−PoiLog-20
GenoSTAN−NB-20
GenoSTAN−PoiLog-127
GenoSTAN−NB-127
ChromHMM−15
ChromHMM−18
ChromHMM−25
3_PromD1
0.2
●
● ●
0.0
1_TssA
●
GenoSTAN−PoiLog-127
GenoSTAN−NB-127
ChromHMM−15
ChromHMM−25
●
0.1
0.2
0.3
0.4
0.5
0.6
●
0.0
0.0
●
● ● ●●
0.0
0.7
●
●
●
●
●
●●
●
●
0.8
●
●
●
●
0.6
●
●
ReprEnh.2
●
●
●●
●
●
●
●
PromP.11 ●
Enh.12
●
●
Enh.6●
●
PromF.4
0.4
●
●
●
Enh.9
●
●
●●
● ●
●
●●
●
●
●
● 2_TssAFlnk
8_EnhG2●
● 3_TssFlnkU
9_TxReg ●
Prom.21●
2_PromU
Prom.5
●
●
0.2
Prom.19 ●
Enh.9
●
● 1_TssA
●
●
1_TssA
3_PromD1
●
●
● 10_TssBiv
●Prom.6
●
20 cell lines
●
●
●
●
Enh.12
●
0.0
0.2
●
●
0.4
0.0
0.0
●
●●
1.0
om
En ote
ha r s
nc ta
er tes
st
at
es
●
0.8
Pr
● ●
●
Recall of FANTOM5 enhancer
●
●
0.6
●
● ●
●
0.4
0.6
● ●●
●
●
●
0.4
●
●
D
●●
●
●●
●●
●
●●
●
●●
●
●●
●
●
●●●
●●
●●
●●
●●
●● ● ●● ●● ●
● ●●●
●●
●●●
●
●
●
●●●
●
●
●
●● ●
●
●
0.2
1.0
0.8
GenoSTAN−PoiLog-20
GenoSTAN−NB-20
GenoSTAN−PoiLog-127
GenoSTAN−NB-127
ChromHMM−15
ChromHMM−18
ChromHMM−25
●
Cumulative false discovery rate (TSS)
HOT regions
(K562, Gm12878, H1hesc, HeLa−S3, HepG2)
C
1_TssA
0.2
Cumulative false discovery rate (CAGE tags)
Cumulative recall (HOT)
Prom.1
Prom.15 ●
0.6
0.8
1.0
●
0.0
GenoSTAN−PoiLog−20
GenoSTAN−NB−20
GenoSTAN−PoiLog−127
GenoSTAN−NB−127
ChromHMM−15
ChromHMM−18
ChromHMM−25
0.2
●
0.4
Recall of FANTOM5 promoter
Cumulative false discovery rate (HOT)
0.6
●●
●
0.8
C
hr
om
C
H
oS hro MM
T A mH - 1
5
G N-P MM
en
oS oilo -18
TA g-1
27
N
G
en Ch -nb
oS rom -1
TA H 27
M
N
M
G
en -Po -2
oS ilo 5
TA g-2
0
N
-n
b20
0.4
●
2_PromU ●
●
●
●
●
●
0.6
0.6
3_TxnFlnk●●
1_TssA
Prom.19
YY1
E2F6
ELF1
Egr1
ZBTB7A
CCNT2
HMGN3
IRF1
KDM5B
PHF8
TAF1
TBP
RBBP5
UBF
ETS1
Sin3A
E2F4
HDAC1
SAP30
CHD1
SP1
CHD2
Mxi1
THAP1
SIX5
Nrf1
NFYB
NFYA
SP2
USF2
GTF2F1
GTF2B
TAF7
Pol2(phosphoS2)
SRF
ELK1
BCLAF1
ZBTB33
SETDB1
Znf143
CTCF
Rad21
SMC3
CTCFL
ZNF263
Myc
Max
Pol2
MAZ
BHLHE40
JunD
RCOR1
CBX3
PML
GABPA
ATF3
SPI1
USF1
Jun
JunB
FOSL1
FOS
EP300
TEAD4
GATA2
TAL1
NR2F2
TRIM28
STAT5A
TBL1XR1
REST
ATF1
CEBPB
ARID3A
HDAC2
COREST
GATA1
STAT2
STAT1
MEF2A
MafF
MafK
Bach1
NFE2
HDAC8
BCL3
ZNF274
SIRT6
SMARCA4
SMARCB1
Pol3
BRF1
BDP1
POLR3A
GTF3C2
RFX5
HDAC6
RD
EZH2
BRF2
KAP1
NR2C2
2_PromU Prom.19
0.4
●
Cumulative recall (TSS)
● ●
●
●
0.5
Elon.13
●
Elon.7
ReprEnh.2
0.3
3_TssFlnkU
●
●
0.1
2_TssAFlnk
●
●
0.2
Cumulative recall (CAGE tags)
●
2_TssFlnk
●
●
●●
●
●
●● ●3_TxnFlnk
●
●
Enh.9
Fraction of segments bound
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●
●
●
●●
●
●
●
●
●●
●
● ●
●●
●● ●
●●●
●
● ●●● ●
●● ●
●
●
●
●
ReprEnh.11
●
●
●
●
Enh.9
●
●
● ●●
● ●
● ●
●
● ●
●
●
●
Enh.12●
●●
●
●
●●
●
E
GRO−cap TSSs (K562, Gm12878)
G
en
1.0
FANTOM5 CAGE tags (127 cell lines)
Fitted on all 127 cell types with the
5 Roadmap Epigenomics core marks
(H3K4me1, H3K4me3, H3K36me3,
H3K27me3, H3K9me3)
1.0
A
Figure 4: Comparison of GenoSTAN to other published segmentations on benchmark II and III. (A) Performance of
chromatin states in recovering FANTOM5 CAGE tags in 127 cell types. CAGE tags were verlapped with chromatin
states wihout the use of cell type information. Cumulative FDR and recall are calculated by subsequently adding
states (in order of increasing FDR). (B) Performance of chromatin states in recovering GRO-cap transcription start
sites in two cell types where GRO-cap data was available. (C) The same as in (B) for ENCODE HOT regions for
five cell types where annotation of HOT regions was available. (D) Recall of FANTOM5 promoters and enhancers
by predicted promoters and enhancersis plotted to assess how well models distinguish promoters from enhancers.
(E) The fraction of predicted enhancer segments bound by individual TFs is shown for different studies. GenoSTAN
enhancers are more frequently bound by TFs than those from other studies.
28
bioRxiv preprint first posted online Feb. 23, 2016; doi: http://dx.doi.org/10.1101/041020. The copyright holder for this preprint (which was not
peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission.
A
Method
(genomic coverage)
SNPs recalled by enhancer
state (sensitivity)
SNP frequency per bp in
enhancer state (precision)
GenoSTAN−PoiLog−20
(2%)
GenoSTAN−NB−20
(2%)
*
*
*
*
*
*
ChromHMM−18
(2%)
0
*
*
*
ChromHMM−25
(2%)
Digestive
Other
ENCODE 2012
Mesenchyme
Neurosphere
Brain
Adipose
Heart
ES cell
iPSC
ES−derived
Blood & T cell
HSC & B cell
C
*
*
*
*
*
*
*
* *
*
* *
*
* *
* *
* *
*
* *
*
*
* * * *
*
* *
*
* *
* * *
* * * *
*
* *
*
* *
*
ChromHMM−15
(2%)
GenoSTAN−PoiLog−127
(2%)
GenoSTAN−NB−127
(2%)
0.04
B
Method
(genomic coverage)
0.02
0
5e−06
1e−05
* *
* *
* *
1.5e−05
* *
SNPs recalled by promoter
state (sensitivity)
SNP frequency per bp in
promoter state (precision)
*
GenoSTAN−PoiLog−20
(1%)
GenoSTAN−NB−20
(1%)
*
*
*
* * * * *
* * * *
*
*
*
*
*
*
*
*
*
ChromHMM−15
(1%)
*
*
*
*
*
GenoSTAN−PoiLog−127
(1%)
*
GenoSTAN−NB−127
(1%)
*
* *
* *
*
*
*
*
*
*
0
5e−06
1e−05
*
1.5e−05
*
* * *
* *
*
* * *
ES−WA7 cells (E002)
H9 cells (E008)
ES−UCSF4 cells (E024)
HUES6 cells (E015)
iPS−20b cells (E020)
iPS−15b cells (E018)
iPS DF 6.9 cells (E021)
H1 BMP4 derived trophoblast (E005)
HUES64 derived CD56+ mesoderm (E013)
H1 derived neuronal progenitor cultured cells (E007)
Primary mononuclear cells (from PB) (E062)
Primary T cells from primary blood (from PB) (E034)
Primary T helper cells (from PB) (E043)
Primary T helper naive cells (from PB) (E039)
Primary T helper naive cells (from PB) (E038)
Primary T CD8+ naive cells (from PB) (E047)
Primary T regulatory cells (from PB) (E044)
Primary T helper 17 cells PMA−I stimulated (E042)
Primary T cells effector/memory enriched (PB) (E045)
Primary T helper cells PMA−I stimulated (E041)
Primary T cells from cord blood (E033)
Primary T CD8+ memory cells (from PB) (E048)
Primary T helper memory cells (from PB) (E040)
Primary T helper memory cells (from PB) (E037)
Primary monocytes (from PB) (E029)
Primary HSCs G−CSF−mobilized male (E051)
Primary HSCs G−CSF−mobilized female (E050)
Primary haematopoietic stem cells (HSCs) (E035)
Primary HSCs short term culture (E036)
Primary natural killer cells (from PB) (E046)
Primary B cells from cord blood (E031)
Primary B cells (from PB) (E032)
Mesenchymal stem cell deriv. chondrocyte (E049)
Adipose−derived mesenchymal stem cells (E025)
Ganglion eminence derived neurospheres (E054)
Thymus (E112)
Fetal thymus (E093)
Brain hippocampus middle (E071)
Brain inferior temporal lobe (E072)
Brain cingulate gyrus (E069)
Fetal brain male (E081)
Brain germinal matrix (E070)
Adipose nuclei (E063)
Fetal heart (E083)
Fetal intestine small (E085)
Fetal intestine large (E084)
Rectal mucosa donor 31 (E102)
Sigmoid colon (E106)
Stomach mucosa (E110)
Rectal mucosa donor 29 (E101)
Small intestine (E109)
Liver (E066)
Pancreatic islets (E087)
Pancreas (E098)
Placenta (E091)
Lung (E096)
Placenta amnion (E099)
Fetal kidney (E086)
GM12878 lymphoblastoid (E116)
Monocytes−CD14+ RO01746 (E124)
Dnd41 T cell leukaemia (E115)
Osteoblast (E129)
HepG2 hepatocellular carcinoma (E118)
K562 leukaemia (E123)
Ag
e−
re
la
Li ted
pi m
d a
m c
et ula
ab r
o d
Al lism ege
zh p n
ei h era
m en ti
El
O er's oty on
ec
va d pe
tro
ria ise s
ca
G
n as
rd
ly
ca e
ca
io
n
gr
te
ap He cer
d
he
h ig
m Q ic t ht
og T ra
M
Ty lo int its
ea
pe bin erv
n
co E 2 le al
rp nd dia vel
us o be s
C cu me te
ho la tri s
l r o
LD est vol sis
L ero um
ch l, e
o to
H Ur les tal
Iro C DL ate ter
n olo ch le ol
s
M ta rec ole vel
ea tu ta st s
n s b l c ero
Sy
pl io a l
st
at m nc
em
e
e
ic Sy Pla let arke r
l s t v
Pr upu temele olu rs
im s ic t c me
e
ar ry s ou
y th cle nt
b e
s
Ig iliar ma rosi
G y t s
F− gly cir osu
ce co rho s
In
fla
U ll d syl sis
m
l
a
m C cer istri tio
at r at bu n
or oh ive ti
y n' c on
bo s o
Ty we dis litis
M p l d ea
R ul e 1 ise se
he tip d a
um le iab se
at sc ete
oi ler s
d o
ar sis
th
rit
is
0.01
*
*
*
* *
0.02
* *
*
* *
*
* *
ChromHMM−18
(1%)
*
*
*
*
*
*
*
*
*
*
*
*
*
ChromHMM−25
(1%)
* *
10
20
30
-log10(p-value)
Figure 5: Enrichments of genetic variants associated with diverse traits in enhancers and promoters are specific
to the relevant cell types or tissues. (A) Median SNP recall and frequency was calculated for enhancer states in
different segmentations by restricting it to a total genomic coverage of 2% (100 samples of random subsetting) to
control for different number of enhancer calls between the segmentations. Error bars show the 95% confidence
interval. (B) The same as in (A) but for promoters. (C) The heatmap shows the -log10(p-value) of significantly
enriched traits in enhancer states (GenoSTAN-Poilog-127, p-value < 0.001, marked by ’*’). Only cell types and
tissues where at least one trait was significantly enriched are shown. P-values were adjusted for multiple testing
using the Benjamin-Hochberg correction.
29
bioRxiv preprint first posted online Feb. 23, 2016; doi: http://dx.doi.org/10.1101/041020. The copyright holder for this preprint (which was not
peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission.
TF co-binding
TF enrichment
YY1, E2F2, ELF1,
Egr1, ZBTB7A,
CCNT2, HMGNR3,
IRF1, KDM5B, PRF8,
TAF1, TBP,
RBBP5, UBF, ETS1,
Sin3A, E2F4, HDAC1,
SAP30, CHD1, SP1,
CHD2, Mxi1
THAP1, SIX5, Nrf1,
NFYB, NFYA, SP2,
USF2, GTF2F1,GTF2FB,
TAF7, Pol2-S2P,
SRF, ELK1, BCLAF1,
ZBTB33, SETDB1
Znf143, CTCF,
Rad21, SMC3, CTCFL
Prom.11
PromW.5
Enh.15
PromF.3
YY1, E2F2, ELF1,
Egr1, ZBTB7A,
CCNT2, HMGNR3,
IRF1, KDM5B, PRF8,
TAF1, TBP,
RBBP5, UBF, ETS1,
Sin3A, E2F4, HDAC1,
SAP30, CHD1, SP1,
CHD2, Mxi1
THAP1, SIX5, Nrf1,
NFYB, NFYA, SP2,
USF2, GTF2F1,GTF2FB,
TAF7, Pol2-S2P,
SRF, ELK1, BCLAF1,
ZBTB33, SETDB1
Znf143, CTCF,
Rad21, SMC3, CTCFL
ZNF263, Myc, Max,
Pol2, MAZ, BHLHE40,
JunD, RCOR1, CBX3,
PML, GABPA, ATF3,
SPI1, USF1
Jun, JunB, FOSL1, FOS
EP300, TEAD4,
GATA2, TAL1, NR2F2,
TRIM28, STAT5A,
TBL1XR1, REST, ATF1,
CEBPB, ARID3A, HDAC2
COREST, GATA1,
STAT2, STAT1, MEF2A,
MafF, MafK, Bach1,
NFE2, HDAC8, BCL3,
ZNF274, SIRT6,
SMARCA4, SMARCB1
Pol3, BRF1, BDP1,
POLR3A, GTF3C2, RFX5
HDAC6, RD, EZH2,
BRF2, KAP1, NR2C2
EnhW.2
CTCF/cohesin complex
Shared PromoterEnhancer module
Jun, JunB, FOSL1, FOS
AP-1 complex
HDAC6, RD, EZH2,
BRF2, KAP1, NR2C2
EnhF.10
Promoter module 2
(weak co-binding)
ZNF263, Myc, Max,
Pol2, MAZ, BHLHE40,
JunD, RCOR1, CBX3,
PML, GABPA, ATF3,
SPI1, USF1
EP300, TEAD4,
GATA2, TAL1, NR2F2,
TRIM28, STAT5A,
TBL1XR1, REST, ATF1,
CEBPB, ARID3A, HDAC2
COREST, GATA1,
STAT2, STAT1, MEF2A,
MafF, MafK, Bach1,
NFE2, HDAC8, BCL3,
ZNF274, SIRT6,
SMARCA4, SMARCB1
Pol3, BRF1, BDP1,
POLR3A, GTF3C2, RFX5
ReprEnh.4
Promoter module 1
(strong co-binding)
Enhancer module 1
(strong co-binding)
Enhancer module 2
(weak co-binding)
Pol3 module
Chromatin repression
Enrichment (fraction)
of TF (co-)binding
0 0.2
0.6
1
Figure 6: Promoters and enhancers have a distinctive TF regulatory landscape. Co-binding (left) and enrichment
of transcription factor binding sites (right) in chromatin states for 101 transcription factor in K562 reveals TF
regulatory modules with distinct binding preferences for promoters, enhancers and repressed regions.
30
bioRxiv preprint first posted online Feb. 23, 2016; doi: http://dx.doi.org/10.1101/041020. The copyright holder for this preprint (which was not
peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission.
Supplementary Figures
Prom.22
57
33
7
11
16
40
351
529
274
230
12.8
0.8
0.4
0
18.13
26.5
32.09
34.8
1.17
0.53
0.07
Prom.16
107
27
3
2
4
29
140
261
87
109
12.1
0.6
0.1
0.02
19.5
13.72
39.89
27.12
0.71
0.18
0.03
PromWF.12
21
18
5
13
14
51
149
95
58
69
24.7
0.4
0.8
0.05
5.46
8.92
7.75
7.68
1.75
1.29
0.29
PromF.17
7
9
3
5
4
23
36
19
13
18
22.1
0.4
1.4
0.19
2.38
1.58
3.36
1.71
0.57
0.74
0.18
Enh.6
82
56
6
12
12
81
107
28
124
53
18.6
0.6
5.8
0.1
0.98
27.82
5.07
4.31
0.45
0.67
0.25
EnhTxn19
19
16
7
20
16
72
39
14
37
36
38.4
0.4
3.6
0.2
1.11
4.22
1.48
4.01
2.52
2.3
1.4
EnhWF.20
26
24
5
10
10
43
33
9
26
18
42.7
0.4
7.9
0.44
0.23
9.19
1.27
1.7
0.72
1.63
0.73
5
13
97.4
0.6
5.7
1.55
6.85
0.3
1.23
1.95
4.39
4.83
5.38
20
26
38
2.6
0.8
6.2
0.09
0.23
0.36
0.1
0.77
0.62
0.34
0.36
Gen5'.7
13
11
7
22
16
29
12
7
13
19
86.5
0.6
5.1
0.52
2.17
1.14
0.93
4.21
10.41
8.07
8.79
Elon.15
8
8
5
34
14
7
3
5
8
11
73.1
0.8
4.9
0.14
1.62
0.51
0.17
2.33
31.47
13.63
22.22
Elon.21
ReprD.4
4
7
3
13
6
3
0
3
3
5
108.2
1
7
1.19
0.58
0.18
0.39
1.26
28.97
25.85
30.94
24
10
24
7
16
45
24
11
8
20
17.9
0.6
4.3
0.45
9.4
1.99
1.28
1.6
0.46
0.2
0.41
ReprD.9
13
7
35
8
18
11
6
6
5
12
76.4
0.6
8.3
2.67
8.43
0.24
0.19
0.53
1.19
1.02
1.65
ReprW.3
7
5
24
7
12
4
3
4
5
7
155.3 0.8
11.8
6.22
3.21
0.01
0.09
0.22
0.71
1.38
1.08
Low.11
13
16
7
8
8
17
12
6
8
8
72.5
0.4
10.2
1.18
0.94
2.39
0.51
1.02
0.72
1.73
1.07
Low.5
6
10
4
11
8
10
3
4
5
8
118.1 0.6
8.6
1.25
0.6
0.27
0.7
1.91
5
12.41
7.18
Low.14
6
8
12
7
8
7
6
4
5
6
162.1 0.6
12.1
4.66
1.79
0.27
0.21
0.6
1.06
2.68
2.43
Low.8
4
5
9
4
8
2
0
2
3
4
305.9 0.8
14.9
16.59
1.68
0.03
0.26
0.41
1.03
3.74
2.25
Low.10
2
5
3
3
4
7
3
2
3
5
209.3 0.6
9.9
4.43
7.27
0.13
1.35
1.17
1.81
7.53
4.12
Low.2
3
5
3
3
4
1
0
2
0
2
346.6 1.2
26.5
27.3
0.51
0.05
0.32
0.4
3.14
5.87
7.1
Low.1
2
2
1
1
0
1
0
1
0
1
352
0.6
25.1
17.64
2.33
0.03
0.87
0.23
1.02
2.97
1.93
Low.23
0
0
0
0
0
0
0
0
0
0
67.6
0.7
20.4
13.11
4.58
0.14
0.49
0.05
0.12
0.4
0.15
0 3
20
3'
U
ro
in
t
ia
n
ed
m
#s
eg
m
en
ts
x
10
0
9a
c
H
3K
4m
3K
27
a
H
e2
4m
3K
H
3K
4m
H
H
3K
H
p3
0
as
D
N
median read count
TR
5
21
n
6
35
ex
on
20
0
w
id
th
to me
(k
G dia
b)
EN n
C di
O st
D a
E nc
TS e
S
in
te
rg
en
ic
C
pG
is
la
nd
in
st
ab
le
TS
S
st
ab
le
TS
S
TS
S/
5'
U
TR
8
56
c
7
49
e3
5
35
e1
6
29
0
3K
27
m
e3
H
3K
36
m
e3
H
4K
20
m
e1
7
34
e
EnhF.18
ReprEnh.13
150
Supplementary Figure 1: Median read coverage of GenoSTAN-nb-K562 chromatin states (left), their number of
annotated segments in the genome, their median width and distance to the closest GENCODE TSS (middle). The
right panel shows recall of genomic regions by chromatin states.
31
bioRxiv preprint first posted online Feb. 23, 2016; doi: http://dx.doi.org/10.1101/041020. The copyright holder for this preprint (which was not
peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission.
A
B
1.0
Jaccard Index
73
72
73
24
16
12
10
2
0
45
10
GenoSTAN-NB-K562
70
68
68
65
77
73
0
21
30
27
17
26
61
32
Segway-nmeth
59
70
100
87
90
78
81
79
37
4
18
10
2
1
0
6
Segway-ENCODE
74
68
87
100
88
83
81
82
42
13
31
0
4
4
37
14
Segway-Reg.Build
71
68
90
88
100
79
80
79
43
15
31
27
3
0
38
0
73
65
78
83
79
100
79
79
42
18
0
17
4
6
42
12
ChromHMM-Nature
72
77
81
81
80
79
100
79
53
25
36
35
0
19
60
33
ChromHMM-ENCODE
73
73
79
82
79
79
79
100
44
0
28
27
7
12
51
24
EpicSeg
24
0
37
42
43
42
53
44
100
43
54
50
40
41
43
33
Segway-nmeth
16
21
4
13
15
18
25
0
43
100
42
42
40
39
40
30
EpicSeg
12
30
18
31
31
0
36
28
54
42
100
76
61
57
49
48
ChromHMM-Nature
10
27
10
0
27
17
35
27
50
42
76
100
62
63
53
57
Segway-Reg.Build
2
17
2
4
3
4
0
7
40
40
61
62
100
68
43
45
ChromHMM-ENCODE
0
26
1
4
0
6
19
12
41
39
57
63
68
100
55
60
GenoSTAN-NB-K562
45
61
0
37
38
42
60
51
43
40
49
53
43
55
100
50
Segway-ENCODE
10
32
6
14
0
12
33
24
33
30
48
57
45
60
50
100
GenoSTAN-Poilog-K562
0.6
71
100
0.4
74
56
0.8
0.8
59
0.2
0.4
56
Overlap between state annotations (Jaccard Index)
0
100
2
E
56
62
D
-K
O
K5
og
C
B-
N
-P
oil
EN
-N
N
ay
gw
Se
oS
en
H
oS
om
en
hr
Strong Promoter
Strong Enhancer
G
TA
M
M
ay
gw
TA
d
E
D
uil
O
-E
-R
N
eg
C
.B
at
-N
Eic
M
M
H
om
hr
C
Se
C
G
g
e
ur
Se
h
g
et
Se
m
ic
Ep
ay
-n
gw
Se
e
E
ur
D
at
O
C
-N
N
M
C
hr
om
hr
H
om
M
H
M
M
-E
d
2
uil
56
.B
-K
eg
og
-R
oil
-P
N
TA
oS
en
G
C
h
E
et
D
O
m
C
EN
ay
-
ay
Se
gw
B-
ay
-n
-N
gw
AN
Se
ST
en
o
G
Se
gw
K5
62
0.0
GenoSTAN-Poilog-K562
Supplementary Figure 2: (A) Heatmap of pairwise overlap (Jaccard index) of promoter (red) and enhancer (orange)
state annotations from different studies on benchmark I. Rows and columns were ordered by separate clustering of
promoter and enhancer overlaps. (B) Distribution of pairwise Jaccard indices for strong promoters and enhancers
(off-diagonal elements of promoter and enhancer sub-matrices from (A)).
32
bioRxiv preprint first posted online Feb. 23, 2016; doi: http://dx.doi.org/10.1101/041020. The copyright holder for this preprint (which was not
peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission.
A
B
●
●
●
●●
●●
●
●
●
Enh.6
0.8
●
●
●●
●●●
●
●
●
●
●
●
●
PromWF.5
0.8
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●●
●
●●
●
●●
●
●
●●
●●
●●
●●
●
●
●
●●●
●●
●
●● ●
● ●● ●●
● ●●●
● ● ●
●
●
●
TSS (23 state models)
1.0
1.0
TSS (18 state models)
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●●●
●
●
●
●●
●
●●●
●●●
●
●●
●●●
●●
●
●
●●
●●
●
●●
●
●●
●● ●
●●
●●●● ●●
● ●● ● ●
●
●
●
● ●
●
●● ● ●
●
● ●● ●
●
●
●
●
●
●
●
Enh.15
●
●
●
●
●
●
●
●●●
●
●
0.0
●
●
●
●
0.0
●
●
0.2
0.4
0.6
0.8
0.6
●
Prom.22
●
GenoSTAN−PoiLog
GenoSTAN−NB
ChromHMM
ChromHMM−eq
Segway
EpicSeg
●
●
●
●
0.2
0.4
●
●
●
●
0.4
●
Prom.11
●
Prom.16
●
●
0.0
0.6
Cumulative recall (TSS)
●
●
●
0.2
Cumulative recall (TSS)
●
1.0
●
●●●●
0.0
●
0.2
Cumulative false discovery rate (TSS)
C
D
●
●
●
● ●
●●
●
●
●
●
●●●
●
●●●●
●●
●
●
●●●
●●●
● ● ●
●
●
●
● ● ●
●
● ●●
●● ●
●
●●
●
●
●
●
●
●
●
●
●
●
●
● ●
0.8
0.8
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
ReprEnh.4
●
Prom.11
●
●
●
●
●
0.6
0.6
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●●
●
●●
●
●●
●●●
●●
●●●
●●
●
●● ●
●
●● ● ● ●
● ●
●
●
● ● ●●● ●
●
●
● ● ●
●● ●
●●
●
●
●●
●
●
●
●
●
● ●
●
●
●
ReprEnh.13
●
●
● ●
0.4
Cumulative recall (HOT)
●
0.4
Cumulative recall (HOT)
●
●
●
●
● ●
●
●
1.0
HOT regions (23 state models)
●
●
●
●
●
0.8
1.0
1.0
●
●
0.6
Cumulative false discovery rate (TSS)
HOT regions (18 state models)
● ●
0.4
Enh.6
●
●
●
●
●
●
●
●
●
●
●
Enh.15
●
Prom.22
0.2
0.2
●
●
●
●
●
●
●
●
●
0.0
●
0.2
●
●
●
0.4
●
0.0
0.0
●
● ●
0.6
0.8
1.0
●
0.0
Cumulative false discovery rate (HOT)
●
0.2
●●
●
●
0.4
0.6
0.8
1.0
Cumulative false discovery rate (HOT)
Supplementary Figure 3: Comparison of GenoSTAN to other methods using 18 and 23 states on the same data set
in K562 (Benchmark I). (A-B) Performance of chromatin states in recovering GRO-cap transcription start sites for
the 18 and 23 state models. Cumulative FDR and recall are calculated by subsequently adding states (sorted by
decreasing precision). (C-D) The same as in (A-B) for ENCODE HOT regions.
33
bioRxiv preprint first posted online Feb. 23, 2016; doi: http://dx.doi.org/10.1101/041020. The copyright holder for this preprint (which was not
peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission.
B
1.0
1.0
A
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
0.8
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
0.7
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
GenoSTAN−PoiLog
GenoSTAN−NB
ChromHMM
10
15
ChromHMM−eq
Segway
EpicSeg
20
25
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
30
10
15
●
●
●
●
●
20
Number of states
25
30
Number of states
D
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
0.96
0.985
●
0.980
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
AUC (HOT)
●
●
0.970
●
●
●
●
●
●
●
●
0.965
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
0.90
0.975
●
●
●
●
●
●
●
●
●
0.94
●
●
●
0.98
●
●
0.92
0.990
C
AUC (TSS)
●
●
●
0.9
●
0.8
●
●
●
●
●
0.7
●
0.6
0.9
●
●
●
●
●
●
●
●
●
●
●
0.5
●
●
●
Fraction of "best" state overlapping with HOT region
●
●
0.6
Fraction of promoter state overlapping with TSSs
●
●
●
●
●
●
●
●
●
●
●
●
●
0.88
0.960
●
10
15
20
25
10
30
15
20
25
30
Number of states
Number of states
Supplementary Figure 4: Comparison of chromatin segmentation algorithms with respect to their ability to call
GRO-cap transcription start sites (left panels) and ENCODE HOT regions (right panels), as a function of the state
number used in the respective algorithm (x-axes). All models were learned on the data set of benchmark I. (A-B)
For each model, the state with highest precision in recalling HOT (respectively TSS) regions is shown. (C-D) For
each model, an area under curve (AUC) score (see Methods) is plotted to asses the spatial accuracy of a genome
segmentation.
34
bioRxiv preprint first posted online Feb. 23, 2016; doi: http://dx.doi.org/10.1101/041020. The copyright holder for this preprint (which was not
peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission.
3
12
1
1
1
1
17.8
0.6
0.7
0.19
19.08
12.71
28.34
19.55
1.12
0.59
Prom.5
5
42
2
1
1
3
15.2
1.2
0.4
0.06
26.36
36.62
50.66
48.03
2.11
1.32
0.2
Enh.12
15
6
4
1
2
3
46
1
3.9
0.72
1.84
36.29
8.03
11.98
4.76
6.33
2.68
EnhW.9
8
2
1
1
1
3
61.4
0.6
7.2
1.6
1.4
6.78
2.15
2.88
0.95
2.27
1.2
Elon.4
3
2
10
1
2
3
29.6
1
4.4
0.06
0.43
0.7
0.27
2.95
17.76
7.64
11.12
0.18
Gen5'.16
4
1
3
1
1
1
36.4
1
4.9
0.14
0.78
0.81
0.54
2.6
5.04
8.24
5.3
ElonW.15
1
1
3
1
1
1
95
1
6.5
0.64
2.15
0.35
0.45
2.35
25.7
24.56
30.51
Elon.13
1
1
6
0
1
1
16.2
2
3.5
0.05
0.87
0.16
0.08
0.71
18.83
6.76
10.52
ElonW.3
1
0
2
0
1
1
84.8
0.6
6.1
0.52
3.61
0.08
0.65
0.69
11.87
13.68
15.63
ReprEnh.2
4
3
1
9
2
3
7
0.8
5.7
0.38
2.7
0.71
0.58
0.7
0.19
0.1
0.23
Repr.14
1
1
1
5
1
1
68.7
1.6
10.1
6.89
7.6
0.32
0.23
0.59
1.09
1.61
2.03
Low.6
3
2
2
2
3
3
40.1
0.8
10.2
1.49
0.92
1.62
0.7
0.98
0.97
2.2
1.53
Low.18
3
1
1
1
1
1
105.6 0.8
8.7
3.51
1.77
1.32
1.24
1.9
1.68
4.43
2.96
Low.1
1
1
1
1
1
1
231.4 1.2
19.8
17.13
2.57
0.18
0.74
0.96
2.51
6.73
6
Low.10
1
1
1
2
2
3
99
1
14.2
5.18
1.02
0.34
0.33
0.7
1.26
3.2
2.6
Low.7
0
1
1
0
2
1
72.4
3.6
33.2
21.07
2.03
0.12
0.31
0.19
0.87
1.65
1.51
Low.17
2
0
1
0
1
1
74.3
0.6
7.2
2.05
2.84
0.68
1.87
1.42
1.31
3.32
2.54
Low.8
1
0
0
2
1
1
99.9
2
11.7
15.63
10.28
0.11
0.26
0.45
1.05
2.45
1.8
Low.11
0
0
0
0
1
0
280.7 0.8
23.9
16.95
6.76
0.05
0.76
0.3
0.64
2.01
1.23
Low.20
0
0
0
0
0
0
63.8
0.8
13.4
5.73
4.98
0.06
1.79
0.07
0.3
0.92
0.23
Prom.1
3
12
1
1
1
1
19.4
0.6
0.8
0.22
20.42
14.62
30.54
21.03
1.29
0.75
0.18
Prom.19
5
46
2
1
1
3
13.5
1.2
0.4
0.04
24.76
32.7
47.86
45.61
1.87
1.13
0.13
Enh.6
19
9
4
1
2
4
32.3
0.8
3.2
0.39
1.6
32.86
7.46
10.54
3.58
4.09
1.54
EnhW.8
10
3
2
1
1
3
75.3
0.6
6.8
1.36
1.15
11.23
2.61
4.28
1.86
4.42
2.1
EnhWF.9
4
1
1
1
1
1
117.5 0.6
7.2
2.3
1.86
2.68
1.93
3.19
2.52
6.18
3.72
3
39.5
1
4.6
0.1
0.63
1.07
0.58
4.27
17.33
10.02
11.61
1
1
105.3
1
6
0.55
2.4
0.37
0.57
2.75
27.88
24.82
31.28
Elon.7
1
1
8
1
1
1
12.6
2.6
3.5
0.05
0.85
0.18
0.07
0.84
19.92
6.76
10.82
ElonW.15
1
0
2
0
1
1
85.6
0.6
5.9
0.48
3.47
0.09
0.54
0.7
11.55
12.99
15.65
ReprEnh.11
5
4
2
9
3
4
7.2
0.8
6
0.36
2.47
0.9
0.63
0.72
0.2
0.1
0.2
Repr.2
1
1
1
1
4
3
35.5
1
24.7
1.96
0.74
0.07
0.13
0.17
0.85
1.18
1.33
1
1
1
6
1
1
30.9
2.2
8.7
4.34
7.76
0.19
0.23
0.54
0.71
0.83
1.26
1
1
1
4
1
1
119.2 1.4
12.7
10.51
3
0.31
0.25
0.59
1.12
2.55
2.32
Low.3
3
2
2
2
2
3
73.1
0.8
10.5
2.6
1.26
1.58
0.82
1.05
1.35
3.11
2.08
Low.17
2
1
1
1
1
1
91.5
0.6
7.9
2.7
2.55
0.53
1.45
1.56
1.84
4.66
3.26
Low.12
1
1
1
1
1
1
161.9
1
14.9
9.33
1.99
0.34
0.84
0.84
3.44
8.82
7.51
Low.18
0
1
0
0
1
1
77.3
4.2
31
29.88
2.76
0.08
0.25
0.32
0.55
1.48
1.17
Low.20
0
0
0
0
0
0
80.2
0.6
14.7
6.51
5.96
0.07
1.99
0.13
0.39
1.07
0.34
Low.5
0
0
0
0
1
0
212.7 0.6
17.5
11.53
4.8
0.09
1.02
0.5
0.99
3.19
2.14
Low.4
0
0
0
1
1
1
133.4 1.2
12.7
14.79
9.58
0.07
0.26
0.37
0.77
1.84
1.36
3'
n
tro
in
TR
ex
o
U
5'
TS
S/
ta
bl
e
TS
S
to
TS
ed
S
(x
(s
pu
m
#s
eg
m
en
ts
3K
H
27
3K
In
9m
e3
m
e3
H
36
m
4m
3K
H
H
3K
4m
e3
e1
Repr.14
ReprW.10
U
TR
2
1
n
1
4
)
9
1
10
ia
G
n
EN m
w 00)
i
e
d
C di
O an th (
D
E di kb)
TS sta
S nc
(k e
b)
in
te
rg
en
ic
C
pG
is
la
nd
2
1
t
3
e3
Gen5'.16
ElonW.13
3K
0
3
7
20
Median #reads per segment
B
Prom.19
H
0
3
7
20
Median #reads per segment
A
Supplementary Figure 5: GenoSTAN models for benchmark II. (A) Median read coverage of GenoSTAN-Poilog127 (Benchmark set II, fitted on the 127 ENCODE and Roadmap Epigenomics cell types and tissues) chromatin
states (left), their number of annotated segments in the genome, their median width and distance to the closest
GENCODE TSSs of segments (middle). The right panel shows recall of genomic regions by chromatin states. (B)
The same as (A) for GenoSTAN-nb-127.
35
bioRxiv preprint first posted online Feb. 23, 2016; doi: http://dx.doi.org/10.1101/041020. The copyright holder for this preprint (which was not
peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission.
0.2
0.4
1
3.8
6.1
6.6
9
4.7
5.6
2.9
5.6
6.7
8.4
12.7
10.9
11.5
8.9
6.5
16.6
33.8
10
35.5
13.7
12.8
15
0.66
1.64
1.97
1
2.96
2.84
0.3
16.46
32.62
11.97
13.86
0.33
0.39
1.46
0.26
0.54
1.7
2.69
2.42
0.68
1.1
0.98
0.47
0.4
0.32
0.21
0.95
1.53
1.4
4.45
6.86
0.51
9.09
24.76
3.33
14.09
0.3
0.45
3.08
0.41
1.08
4.34
6.78
6.02
0.97
2.93
2.79
1.33
1.45
0.89
0.07
0.12
0.38
0.51
2.4
3.89
0.27
10.88
33.05
5.84
16.24
0.35
0.56
3.22
0.3
0.76
3.61
4.95
4.81
1.06
2.51
2.23
0.96
0.84
0.19
0.05
0.02
0.21
0.11
0.51
1.35
0.46
0.08
0.46
0.01
0.44
0.51
2.65
13.48
0.77
2.95
2.62
1.3
12.85
7.05
3.63
24.56
4.88
13.13
5.91
17.73
21.41
7.19
1.14
0.59
1.2
1.59
0.46
2.15
0.27
3.35
2.87
5.28
4.48
0.53
0.82
1.8
1.58
2.08
0.93
2.54
2.42
3.71
9.14
4.74
10.95
28.33
8.72
24.02
14.85
5.65
1.8
0.69
0.34
0.08
0.09
0.5
0.05
0.24
0.47
0.31
1.52
0.84
0.1
0.05
0.28
0.01
0.05
0.02
0.01
33.7
38.8
7.31
5.52
2.7
1.89
0.97
0.32
0.34
0.02
0.48
0.45
0.08
0.24
0.33
0.57
0.87
1.27
0.5
0.14
0.94
0.09
1.12
0.19
1.18
22.62
38.37
7.4
5.98
5.22
3.97
0.89
3
2.35
0.24
0.7
0.49
0.26
0.71
0.46
0.73
1.65
2.4
0.6
0.11
1.13
0.18
0.35
0.2
0.01
0.69
1.44
1.79
0.69
2.64
2.74
0.31
16.15
35.09
8.27
13.92
0.23
0.46
1.4
0.62
1.94
1.01
3.37
3.14
0.8
1.4
0.29
0.67
0.63
0.31
0.2
0.81
1.38
0.93
3.35
5.83
0.55
8.58
24.56
2.16
14.22
0.14
0.48
3.09
1.25
4.8
2.32
7.33
7.92
1.21
3.68
0.72
2.13
1.54
0.83
0.06
0.09
0.31
0.33
1.77
2.83
0.58
10.41
33.99
3.77
15.83
0.22
0.82
2.75
0.87
4.2
1.85
4.75
7.14
1.25
3.28
0.41
1.39
0.94
0.18
en
m
eg
m
n
TR
U
3'
n
tro
in
TR
ex
o
)
U
bl
e
5'
ta
S/
to
G
#s
TS
TS
S
e3
4m
3K
H
H
3K
4m
e1
Low.25
0.4
0.8
0.4
0.6
0.4
0.4
0.4
0.6
0.8
1.9
0.6
0.6
1.6
1
0.4
0.4
0.6
0.6
2
0.8
0.4
1.8
0.4
0.8
0.6
18.38
43.64
7.22
6.5
4.98
4.06
1.23
3.2
2.42
0.48
0.73
0.78
0.18
0.84
0.08
0.42
0.7
2.1
0.36
0.08
0.53
0.62
0.27
0.17
0.02
(s
Low.24
14.2
16.7
25.6
15.7
53.7
108.9
28.3
54.9
132.3
6.5
106.1
12.7
27
195.6
38.1
129.9
95.1
89.4
115
124.5
161.6
140.4
151.8
185.6
76.4
28.08
45.14
6.67
5.52
2.35
1.97
2.17
0.38
0.45
0.02
0.53
0.58
0.06
0.32
0.08
0.23
0.6
1.13
0.44
0.12
0.57
0.22
0.8
0.29
1.3
S
Low.1
Low.17
1
4
3
4
3
3
1
3
3
1
1
3
1
1
4
3
3
1
1
1
1
1
0
1
0
9.17
33.34
8.75
26.2
12.09
4.65
1.38
0.74
0.33
0.11
0.08
0.92
0.03
0.38
0.02
0.42
0.31
0.75
0.09
0.03
0.08
0.07
0.04
0.01
0.01
TS
Low.7
Low.11
13
12
4
17
8
3
11
2
1
1
0
3
1
1
2
1
2
1
1
1
1
1
0
0
0
15.08
24.11
6.51
0.97
0.68
1.75
2.61
0.53
2.18
0.39
3.39
3.82
4.59
4.08
7.28
0.45
0.69
1.92
1.56
1.22
1.23
3.69
1.73
4.6
4.93
EN m
w
)
C edi idth
O an
(k
D
b
E di
TS sta )
S nc
(k e
in b)
te
rg
en
C
ic
pG
is
la
nd
Low.8
Low.10
Low.22
13
58
11
20
6
4
2
3
2
1
1
3
1
1
3
2
2
2
1
1
1
0
0
0
0
0.05
0.02
0.25
0.18
0.77
2.42
0.29
0.08
0.49
0.02
0.46
1.02
2.56
11.96
3.63
0.75
2.52
2.66
12.51
11.89
1.89
17.23
1.16
19.27
5.92
t
Low.2
Low.13
10
52
6
34
9
4
1
3
1
1
0
3
1
1
3
1
1
1
0
0
0
0
0
0
0
0.2
0.4
1.1
4.5
6.4
7.5
5.8
4.6
5.7
3.2
5.7
7.5
8.9
12.7
10
11.1
10.8
7.9
24.5
35.6
9.6
14.6
8.4
28.9
14.2
00
ReprD.6
ReprW.3
0
2
2
2
2
2
2
2
2
2
1
3
2
2
5
2
2
1
2
3
1
2
0
1
0
10
Elon.19
ElonW.18
ReprD.23
0
1
1
1
1
1
1
1
1
0
0
10
7
4
2
1
2
0
1
1
1
0
0
1
0
0.4
0.8
0.6
0.8
0.6
0.4
0.4
0.8
0.8
1.8
0.6
0.6
1.2
1
0.8
0.4
0.4
0.6
2.2
1.6
0.4
1.2
0.4
0.8
0.6
n
Gen5'.4
ElonW.16
0
2
3
5
5
2
1
10
5
9
2
1
1
1
3
2
1
1
1
1
0
0
0
0
0
ia
EnhWF.5
DNase.20
16
59
16
16
6
3
2
2
1
1
0
3
1
1
3
1
1
1
1
1
0
0
0
0
0
pu
Enh.9
EnhF.12
1
5
10
27
19
10
4
4
1
1
0
5
1
1
3
2
5
4
1
1
1
0
0
0
0
(x
0
3
7 20 55
Median #reads per segment
Prom.21
12.7
16.7
26.4
19.7
63.8
142.9
20.8
55
133.4
9.8
104.3
24.3
36.3
180.5
52.9
33.7
113.7
126.6
109.3
103.2
103.4
172.1
63.5
254.1
78.5
ed
Prom.14
PromWF.15
In
B
ts
Low.25
e3
3K
27
ac
H
3K
9a
c
D
N
as
e
Low.5
Low.23
H
Low.3
e3
Low.19
9m
Low.24
1
4
3
4
3
3
1
3
1
1
1
3
1
1
1
4
3
1
1
1
1
1
0
0
0
m
Low.1
Low.17
9
12
4
13
5
2
2
2
1
1
0
2
1
1
0
2
1
1
1
1
1
1
0
0
0
3K
Low.21
Low.20
12
53
10
15
5
3
1
3
2
2
1
2
1
1
0
3
2
1
1
1
1
1
0
0
0
H
Repr.2
ReprW.16
9
48
6
25
6
3
1
3
1
1
0
3
1
1
0
3
1
1
0
0
0
0
0
0
0
e3
ReprD.8
ReprW.11
0
2
2
2
2
2
1
2
2
2
1
2
2
2
1
5
2
1
2
3
1
1
0
1
0
m
Elon.18
Elon.14
ElonW.22
0
1
1
1
1
1
0
1
1
1
0
7
7
4
2
2
2
1
1
1
1
1
0
0
0
27
Gen5'.7
0
2
3
5
3
2
0
10
5
9
2
1
1
1
0
3
2
1
1
1
1
0
0
0
0
3K
EnhWF.10
DNaseF.12
15
55
14
13
5
2
1
2
1
1
0
2
1
1
0
3
1
1
1
1
1
0
0
0
0
H
Enh.9
EnhF.13
1
5
11
26
15
6
3
4
1
1
0
4
1
1
0
3
2
3
1
0
1
0
0
0
0
36
0
3
7 20 55
Median #reads per segment
Prom.6
3K
Prom.15
PromWF.4
H
A
Supplementary Figure 6: GenoSTAN models for benchmark III. (A) Median read coverage of GenoSTAN-Poilog20 (Benchmark set III, fitted on the 20 ENCODE and Roadmap Epigenomics cell types and tissues) chromatin
states (left), their number of annotated segments in the genome, their median width and distance to the closest
GENCODE TSSs of segments (middle). The right panel shows recall of genomic regions by chromatin states. (B)
The same as (A) for GenoSTAN-nb-20.
36
bioRxiv preprint first posted online Feb. 23, 2016; doi: http://dx.doi.org/10.1101/041020. The copyright holder for this preprint (which was not
peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission.
Genomic position
47.17 mb
47.19 mb
47.18 mb
47.21 mb
47.2 mb
47.23 mb
47.22 mb
47.24 mb
H3K4me1
H3K4me3
DNase-Seq
TAL1
UCSC genes
GenoSTAN
GRO-cap TSSs
+
-
Poilog-127
nb-127
Poilog-20
nb-20
ChromHMM-25
ChromHMM-18
ChromHMM-15
Promoter
Enhancer
Promoter & downstream
TAL1 enhancer
Downstream
TAL1 enhancer
TAL1 promoter &
upstream enhancer
Supplementary Figure 7: Chromatin states annotations for benchmark II and III GenoSTAN segmentations
are shown with the three Roadmap Epigenomics ChromHMM segmentations with 15 (ChromHMM-15), 18
(ChromHMM-18) and 25 (ChromHMM-25) states. Shown is the TAL1 locus with segmentations and data from the
K562 cell line.
37
bioRxiv preprint first posted online Feb. 23, 2016; doi: http://dx.doi.org/10.1101/041020. The copyright holder for this preprint (which was not
peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission.
A
C
60
*
40
*
*
*
* *
* *
* *
*
* *
* *
*
* *
*
*
*
*
20
#traits enriched in at least
one cell/tissue type
*
*
GenoSTAN−Poilog−127 (Enh.12)
GenoSTAN−nb−127 (Enh.6)
ChromHMM−15 (7_Enh)
ChromHMM−25 (13_EnhA1)
0
*
* *
0
0.01
0.02
0.03
0.04
*
*
0.05
*
p-value
B
25
20
*
*
*
* *
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
5
*
p-value
8 10
*
*
*
*
*
*
*
*
*
* *
*
*
*
*
*
*
* *
*
*
*
* * *
*
*
IMR90 fetal lung fibroblasts (E017)
HUES48 cells (E014)
HUES64 cells (E016)
H9 cells (E008)
ES−WA7 cells (E002)
ES−I3 cells (E001)
iPS−18 cells (E019)
H1 BMP4 derived mesendoderm (E004)
H9 derived neuronal progenitor cultured cells (E009)
HUES64 derived CD56+ mesoderm (E013)
HUES64 derived CD184+ endoderm (E011)
H9 derived neuron cultured cells (E010)
HUES64 derived CD56+ ectoderm (E012)
Primary T CD8+ memory cells (from PB) (E048)
Primary T helper naive cells (from PB) (E038)
Primary T cells effector/memory enriched (PB) (E045)
Primary T CD8+ naive cells (from PB) (E047)
Primary mononuclear cells (from PB) (E062)
Primary T helper memory cells (from PB) (E040)
Primary T helper memory cells (from PB) (E037)
Primary T helper cells PMA−I stimulated (E041)
Primary T regulatory cells (from PB) (E044)
Primary T helper 17 cells PMA−I stimulated (E042)
Primary T cells from primary blood (from PB) (E034)
Primary HSCs short term culture (E036)
Primary neutrophils (from PB) (E030)
Primary monocytes (from PB) (E029)
Primary B cells from cord blood (E031)
Primary B cells (from PB) (E032)
Primary natural killer cells (from PB) (E046)
Adipose−derived mesenchymal stem cells (E025)
Mesenchymal stem cell derived adipocyte (E023)
Mesenchymal stem cell deriv. chondrocyte (E049)
Foreskin fibroblast (E055)
Foreskin fibroblast (E056)
Breast myoepithelial (E027)
Foreskin keratinocyte (E057)
Foreskin keratinocyte (E058)
Breast vHMEC mammary epithelial (E028)
Cortex derived neurospheres (E053)
Fetal brain female (E082)
Skeletal muscle female (E108)
Rectal smooth muscle (E103)
Oesophagus (E079)
Fetal stomach (E092)
Fetal intestine large (E084)
Rectal mucosa donor 29 (E101)
Colonic mucosa (E075)
Duodenum mucosa (E077)
Liver (E066)
Spleen (E113)
Fetal adrenal gland (E080)
Pancreatic islets (E087)
GM12878 lymphoblastoid (E116)
Monocytes−CD14+ RO01746 (E124)
Osteoblast (E129)
NHDF−ad adult dermal fibroblast (E126)
NH−A astrocyte (E125)
NHLF lung fibroblast (E128)
HMEC mammary epithelial (E119)
HeLa−S3 cervical carcinoma (E117)
A549 EtOH 0.02pct lung carcinoma (E114)
K562 leukaemia (E123)
HepG2 hepatocellular carcinoma (E118)
Dnd41 T cell leukaemia (E115)
HSMM skeletal muscle myoblasts (E120)
NHEK−epidermal keratinocyte (E127)
ic
m
0.05
6
fla
0.04
em
0.03
4
In
0.02
Sy
st
0.01
2
lu IgG
pu g
s lyc
er o
m Mu yth syl
at lt em at
C
or ip a ion
hr
y le to
on
bo sc s
ic
w l us
ly
m Pl el ero
ph at di sis
oc ele sea
yt t c se
ic o
H leu un
Ty IV− ke ts
A p 1 m
O dip e 2 co ia
be o d nt
si ne iab rol
Id
ty c e
io
−r tin te
Bl
pa
el le s
oo
th
at v
d
ic
ed el
m
Al m
tr s
et
zh em
ab H aits
ei b
ol ei
m ra
Ag
ite gh
er n
e−
's ou
t
l
e
di s
re
ve
s n
la
te A eas eph Aut ls
d lz e r is
m he b op m
ac im io a
Li
ul e ma thy
pi
ar r's rk
d
m
et C deg dis ers
ab ho e ea
ol les ne se
is te ra
m r tio
o
LD phe l, t n
L no ota
ch ty l
ol pe
es s
te
ro
l
*
0
*
*
*
*
15
10
*
*
*
*
*
*
*
*
*
*
* *
*
*
*
*
*
*
*
*
*
*
*
*
*
*
0
#traits enriched in at least
one cell/tissue type
*
*
promoter state
GenoSTAN−Poilog−127 (Prom.5)
GenoSTAN−nb−127 (Prom.19)
ChromHMM−15 (1_TssA)
ChromHMM−25 (1_TssA)
* * *
* *
*
*
*
* *
*
*
*
*
*
* *
* *
*
0
-log10(p-value)
*
*
*
*
*
Adipose
Digestive
Other
ENCODE 2012
HSC & B cell
Mesenchyme
Neurosphere
Brain
Heart
IMR90
ES cell
iPSC
ES−derived
Blood & T cell
80
enhancer state
Supplementary Figure 8: Enrichments of genetic variants associated with diverse traits in enhancers and promoters
are specific to the relevant cell types. (A) The number of traits which are enriched in enhancer states in at least
one cell type or tissue is plotted for p-values < 0.05. (B) The same as in (A) but for promoters. (C) The heatmap
shows the -log10(p-value) of significantly enriched traits in promoter states (GenoSTAN-Poilog-127, p-value < 0.05,
marked by ’*’). P-values were adjusted for multiple testing using the Benjamin-Hochberg correction.
38
bioRxiv preprint first posted online Feb. 23, 2016; doi: http://dx.doi.org/10.1101/041020. The copyright holder for this preprint (which was not
peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission.
A
B
18000
Number of promoters
Number of enhancers
50000
40000
30000
20000
16000
14000
12000
10000
8000
10000
Heart
Adipose
Epithelial
IMR90
Thymus
Mesenchyme
Other
Muscle
ES cell
iPSC
ENCODE 2012
Digestive
Brain
Tissue group
HSC & B cell
ES−derived
Blood & T cell
Myostatin
Smooth Muscle
Neurosphere
Thymus
Epithelial
Neurosphere
Adipose
Blood & T cell
ES cell
ENCODE 2012
Other
HSC & B cell
Heart
Smooth Muscle
IMR90
Muscle
Myostatin
iPSC
Digestive
Brain
Mesenchyme
ES−derived
6000
Tissue group
C
D
50000
Number of promoters
Number of enhancers
18000
40000
30000
20000
16000
14000
12000
10000
8000
10000
6000
Sample type
Cell line
Primary cultures
Primary Tissue
Primary cells
ES cell derived
Primary cells
Primary cultures
Cell line
Primary Tissue
ES cell derived
0
Sample type
Supplementary Figure 9: Dependency of number of predicted promoters and enhancers on tissue group and sample
type. (A) Number of enhancer states per Roadmap Epigenomics cell/tissue group. (B) The same as in (A) for
promoters. (C) Number of enhancer states per Roadmap Epigenomics sample type. (D) The same as in (C) for
promoters.
39
bioRxiv preprint first posted online Feb. 23, 2016; doi: http://dx.doi.org/10.1101/041020. The copyright holder for this preprint (which was not
peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission.
Supplementary Tables
Method/segmentation
GenoSTAN-Poilog-K562
GenoSTAN-nb-K562
ChromHMM-Nature
ChromHMM-ENCODE
Segway-ENCODE
Segway-nmeth
Segway-Reg.Build
EpicSeg
Benchmark I - K562 (one cell type)
#promoters
#enhancers
11,358 (Prom.11)
10,932 (Enh.15)
12,829 (Prom.22)
18,551 (Enh.6)
16,118 (1_Active_Promoter) 30,492 (4_Strong_Enhancer)
16,452 (Tss)
22,323 (Enh)
19,894 (Tss)
33,518 (Enh1)
25,812 (8)
80,043 (0)
13,668 (7_tss)
38,992 (11_proximal)
16,192 (2)
53,982 (3)
Method/segmentation
GenoSTAN-Poilog-127
GenoSTAN-nb-127
GenoSTAN-Poilog-20
GenoSTAN-nb-20
ChromHMM-15
ChromHMM-18
ChromHMM-25
Benchmark II & III - 127 cell types and tissues
#promoters
#enhancers
15,229 (Prom.5)
45,955 (Enh.12)
13,547 (Prom.19)
32,280 (Enh.6)
12,710 (Prom.15)
19,730 (Enh.9)
14,168 (Prom.14)
15,655 (Enh.9)
21,002 (1_TssA)
92,824 (7_Enh)
20,049 (1_TssA)
22,678 (9_EnhA1)
12,525 (1_TssA)
12,706 (13_EnhA1)
Supplementary Table 1: Number of promoter and enhancer states for the chromatin state annotations analyzed in
this study. The original state name is given in brackets.
40
bioRxiv preprint first posted online Feb. 23, 2016; doi: http://dx.doi.org/10.1101/041020. The copyright holder for this preprint (which was not
peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission.
Method/segmentation
GenoSTAN-Poilog-K562
GenoSTAN-nb-K562
ChromHMM-Nature
ChromHMM-ENCODE
Segway-ENCODE
Segway-nmeth
Segway-Reg.Build
EpicSeg
Benchmark I - K562 (one cell type)
promoter states
enhancer states
Prom.11, PromW.5
Enh.15, Enh.2
Prom.16, Prom.22
Enh.6, Enh.19
Tss, TssF
Enh, EnhW
1_Active_Promoter, 2_Weak_Promoter 4_Strong_Enhancer, 5_Strong_Enhancer
Tss, PromF
Enh1, Enh2
8,6
0, 13
7_tss, 0_proximal
1_proximal, 11_proximal
2
3
Method/segmentation
GenoSTAN-Poilog-127
GenoSTAN-nb-127
GenoSTAN-Poilog-20
GenoSTAN-nb-20
ChromHMM-15
ChromHMM-18
ChromHMM-25
Benchmark II & III - 127 cell types and tissues
promoter states
enhancer states
Prom.19, Prom.5
Enh.12, EnhW.9
Prom.1, Prom.19
Enh.6, EnhW.8
Prom.15, Prom.6
Enh.9, EnhF.13
Prom.14, Prom.21
Enh.9, EnhF.12
1_TssA, 2_PromU
13_EnhA1, 14_EnhA2
1_TssA, 2_TssFlnk
9_EnhA1, 10_EnhA2
1_TssA, 2_TssAFlnk
7_Enh, 6_EnhG
Supplementary Table 2: Table showing promoter and enhancers states used to calculate recall of FANTOM5
promoters and enhancers. Two promoter and enhancer states were used for each segmentation, except for the
EpicSeg segmentation, which only fitted one enhancer state.
41