Comparative annotation of functional regions in the

Comparative annotation of functional regions in the human
genome using epigenomic data
Kyoung-Jae Won1,2, Xian Zhang1, Wang Tao1, Bo Ding1, Debasish Raha3, Michael
Snyder3, Bing Ren4, Wei Wang1**
1
Department of Chemistry & Biochemistry, University of California, San Diego, 9500
Gilman Drive, La Jolla CA 92093-0359, USA
2
Department of Genetics, The Institute for Diabetes, Obesity, and Metabolism, Perelman
School of Medicine at the University of Pennsylvania, 3400 Civic Center Blvd,
Philadelphia, PA 19104, USA
3
Department of Genetics, Stanford University, Stanford, CA 94305
4
Ludwig Institute for Cancer Research and Department of Cellular and Molecular
Medicine, UCSD School of Medicine, 9500 Gilman Drive, La Jolla, CA 92093, USA
*Corresponding author
1
Contents
1. Data
2. Method: The ChroModule model
2.1 Model configuration and training
2.2 ChroModule Annotations
2.3 Assessing ChroModule Annotations
2.4 Robustness of the ChroModule model
3. Analysis
3.1 Epigenomic variation scores (EVSs)
3.2 Enhancers are widely distributed in the genome and dictate cell-type
specificity
3.3 Cell-type specific enhancers regulate genes critical for the cell-type specific
functions
4. References
2
1. Data
Table S1 lists the dataset we used in the experiments. The sequencing data for
ChroModule were processed as described in (1). Briefly, sequencing reads were binned
into 100-bp bins and the binned reads were normalized by dividing the top 0.01% value
of the corresponding mark. Such data were then used in training and testing ChroModule.
We excluded unmappable reads in this study.
Table S1. Experimental data used in this study. Histone marks were H3K4me1/2/3,
H3K9ac, H3K27ac, H3K27me3, H3K36me3 and H4K20me1.
Cell line
Data type
Source
GM12878
histone
DNaseI
31 TFs (*)
RNA-seq
p300
histone
DNaseI
Oct4
RNA-seq
p300
histone
DNaseI
histone
FAIRE
histone
DNaseI
RNA-seq
ENCODE
ENCODE
ENCODE
ENCODE
ENCODE
ENCODE
ENCODE
(2)
ENCODE
ENCODE
ENCODE
ENCODE
ENCODE
ENCODE
ENCODE
ENCODE
ENCODE
H1
Hmec
Hsmm
Huvec
K562
histone
ENCODE
DNaseI
ENCODE
57 TFs (**)
ENCODE
RNA-seq
ENOCDE
p300
ENCODE
Nhek
histone
ENCODE
DNaseI
ENCODE
RNA-seq
ENCODE
Nhlf
histone
ENCODE
DNaseI
ENCODE
(*) Batf, Bcl11a, Bcl3Pcr1xBcl3, Cfos, CfosV2, Ebf, Egr1, Irf4, Jund, JundV2, Max, MaxV2, NfkbIggrab,
NfkbTnfa, Oct2, P300, Pax5c20, Pax5n19, Pbx3, Pu1, Rad21Iggrab, Sin3ak20, Sp1Pcr1x, Taf1, Tafii,
Tcf12, Tr4, Usf1, Yy1, Zbtb33, and Zzz3.
(**) Atf3, Bdp1, Brf1, Brf2, Brg1Musigg, Cfos, CfosV2, Cjun, CjunV2, Cmyc, CmycV2, Egr1, Gtf2b,
Hey1, Ifna30Cmyc, Ifna30Stat1, Ifna30Stat2, Ifna6hCjun, Ifna6hCmyc, Ifna6hStat1, Ifna6hStat2,
Ifng30Cjun, Ifng6hCjun, Ifng6hCmyc, Ini1Musigg, Jund, JundV2, Max, MaxV2, Nelfe, Nfe2, Nfe2V2,
Nfya, Nfyb, Pu1, Rad21, Rpc155, Sin3ak20, Sirt6, Six5, Stat1Ifng30, Stat1Ifng6h, Taf1, Tfiiic, Usf1,
Xrcc4, Znf263, bE2f4, E2f6, Gata1, bGata1V2, bGata2, bSetdb1, Tr4, Yy1, Znf263V2, and Znf274.
3
2. The ChroModule model
2.1 Model configuration and training
Table S2. The training dataset for ChroModule in Huvec.
Module
Number of Training set
HMM states
Forward
5
±1kb regions around the top 400 TSSs of the highest expressed
promoter
genes (RNA-seq data in Huvec) with a high H3K4me3(>8
normalized read counts)
Backward
5
Not trained. Obtained by flipping the forward promoter HMM
promoter
module.
Enhancer
5
±1kb regions around the 355 distal (>2.5kb form annotated
RefSeq TSS) DHSs with a high (>2 normalized read counts)
H3K4me1/2 and a low (<2 normalized read counts) H3K4me3
Transcribed
1
Top 1,000 exon regions in chromosome 1 with a high (>2
region
normalized read counts) H3K36me3
Repressed
1
Top 1,000 exon regions in chromosome 1 with a high (>2
region
normalized read counts) H3K27me3
Background
1
Entire chromosome 1
4
(a)
(b)
5
(c)
(d)
Figure S1. The promoter and enhancer regions in Huvec. (a) the profile of the top 500
promoters selected based on their gene expression; (b) The emission probability of the
HMMs for the histone marks trained for the promoter module; (c) the profile of the top
500 enhancers selected based on their DNaseI read counts; (d) The emission probability
of the HMMs for the histone marks trained for the enhancer module.
6
(a)
(b)
Figure S2. Epigenomic shapes identified by ChroModule. A unimodal pattern of
enhancer in GM12878. The unimodal pattern in H3K4me1 was mapped to the HMM
states of ‘111145555’. Unimodal enhancers are believed to result from the dynamic
positioning of nucleosomes. In resonance to the observations that unimodal enhancers
occur at binding loci of androgen receptors(3,4), we found moderate binding signals of
transcription factors at this locus as indicated by the track “Txn Factor ChIP”. In addition,
chromatin is open up in CD4+ T cells as shown in the bottom track, which suggests a
more active enhancer (bi-modal) in that cell type. (b) Expression levels of the nearest
genes associated with unimodal and bimodal enhancers. We selected enhancers that
visited the 2nd, 3rd and 4th states in HMM as the bimodal enhancers and enhancers without
visiting the 3rd state in HMM as the unimodal enhancers. Although the median of the
expression levels of the genes close to the bimodal enhancers is slightly higher than those
close to the unimodal enhancers, the difference is statistically insignificant (p-value is
0.4)
7
Figure S3. Diverse combinations of histone marks in the annotated enhancers in Hmec.
Clusters were generating by K-means and each cluster was aligned at DHS peaks found
in the predicted enhancers.
8
Figure S4. The predicted promoters and enhancers overlapping with open chromatin
regions. We used the DHSs (FAIRE in Hsmm) in the same cell type as well as all 8 cell
types in this study. A significant portion of predicted promoters/enhancers overlap with
open chromatin regions in other cell types suggests possible priming for activation.
2.2 ChroModule annotations
We used the Viterbi algorithm(5) to assign HMM states to each 100bp bin. The
HMM states correspond to the 5 epigenetic states. A block is a series of consecutive bins
with the same annotation. We counted the number of blocks annotated to each epigenetic
state and the base pair percentage covered by each epigenetic state in the genome (Table
S3).
9
Figure S5. The composition of the ChroModule annotations.
Table S3. The number and coverage of annotated promoter and enhancer blocks in each
cell type. n(DHS)=number of DHSs within the blocks. Multiple DHSs may be found in
one enhancer block.
Cell type
Gm12878
H1
Hmec
Hsmm
Huvec
K562
Nhek
60
35
39
33
31
36
41
Promoter
Number
of the
predicted
blocks
27,237
23,320
23,389
22,133
21,237
21,140
24,868
Nhlf
Total
39
89
23,244
38,214
Covered
range
(Mbp)
36,988
33,134
26,793
21,796
28,379
32,024
26,852
69
37
101
50
64
49
88
Enhancer
Number
of the
predicted
blocks
39,662
40,486
62,320
46,206
42,206
37,328
58,942
25,586
n/a
77
260
66,419
199,200
n(DHS)
(a)
Covered
range
(Mbp)
10
n(DHS)
(b)
44,423
32,966
61,488
46,647
53,700
44,574
66,086
76,864
n/a
(a)+(b) /
(total
number of
DHSs)
0.67
0.48
0.66
0.31
0.68
0.59
0.69
Transcribed
n(DHS)
Repressed
n(DHS)
9,615
14,997
7,739
40,605
10,250
13,296
11,814
9,084
9,654
4,053
16,650
2,780
2,509
4,165
0.52
9,335
1,966
Promoter or enhancer: 324M
2.3 Assessing ChroModule annotations
To evaluate the performance of ChroModule annotation, we counted the numbers
of RefSeq TSSs and p300 binding sites that overlap with the predicted promoters and
enhancers, respectively. To define the negative sets, we chose 2kb-binned genomic
regions that are located >2.5kb away from the positive sets. As a comparison, we also
show the performance of another HMM model called ChromHMM(6) assessed using the
criteria (Figure S3-S5, Table S4). We found that ChroModule consistently outperformed
ChromHMM, especially on enhancer prediction (Figure S4-S5).
Figure S6. Assessment of the enhancers predicted by ChroModule and ChromHMM
using p300 binding sites that are distal (>2.5kb) from Refseq TSSs. ChroModule
outperformed ChromHMM in all the cell types. ChromHMM results (enhancer, strong
enhancer) were downloaded from (6).
Table S4. Overlap between predicted enhancers and the TF binding sites (TFBSs). We
pooled together all the available TF ChIP-seq data in GM12878 and K562.
GM12878
K562
Overlap
Total
Ratio
Overlap
Total
Ratio
ChromHMM ("strong") 26,903
39,195
0.68
31,791
46,495
0.68
ChromHMM (all)
49,045
116,449
0.42
57,545
130,600 0.44
ChroModule
30,103
39,662
0.76
30,818
37,320
0.83
11
2.4 Robustness of the ChroModule model
To examine the robustness of ChroModule, we trained the model using the
epigenomic data in GM12878 instead of Huvec and tested it on the other two cell lines:
K562 and H1. The similar performance of independent ChroModule models illustrated its
robustness.
Figure S7. The comparison of ChroModule models independently trained in Huvec and
GM12878 (V2). ROC curves generated by using TF ChIP-seq binding sites in K562. The
comparable performance of independent ChroModule models illustrated its robustness.
3. Analysis
3.1 Epigenomic variation scores (EVSs)
Table S5. The average EVS in each cell type.
Promoter
Enhancer
Gm12878
H1
Hmec
Hsmm
Huvec
K562
Nhek
Nhlf
0.57
0.41
0.46
0.36
0.34
0.40
0.48
0.46
0.64
0.71
0.77
0.71
0.77
0.71
0.76
0.79
12
Transcribed
region
0.60
0.65
0.65
0.63
0.67
0.65
0.66
0.66
Repressed
region
0.32
0.70
0.65
0.32
0.73
0.59
0.68
0.72
3.2 Enhancers are widely distributed in the genome and dictate cell-type specificity
Figure S8. The epigenomic distance between cells calculated using ChroModule
annotated regions. The clustering was done by Pvclust R package (7). At each branch are
unbiased (AU) p-value (left) and Bootstrap probability (BP) (right) in percentage. Using
the default parameters of Pvclust, we performed 10 bootstrap simulations with different
data size, and the hierarchical clustering was repeated 10,000 times for each data size.
13
3.3 Cell-type specific enhancers regulate genes critical for cell-type specific functions
To investigate the functional role of enhancers, we conducted GO term analysis.
We assigned the predicted enhancers to the closest genes.
Table S6. Cell-type specific promoters and the function of the genes. DAVID (8) was
used to perform GO analysis. Inside the parenthesis are the number of genes associated
with each term and the Benjamini-Hochberg adjusted p-value.
Type
Common
promoters
Number of Refseq
genes within the celltype specific promoter
9,474
Cell type and GO terms (number of genes; p-value)
RNA processing (457; 3.2e-71)*
Cellular macromolecule catabolic process (558; 3.1e-63)
Translation (297; 7.3e-60)
Human embryonic stem cell
Transmission of nerve impulse (34;1.1e-9)*
Neuron differentiation (37; 1.8e-9)
H1 specific
430
GM12878
specific
244
Lymphoblastoid
Immune response (56; 5.7e-26)
Leukocyte activation (19; 1.1e-6)
Hmec specific
32
Human mammary epithelial cells
No significant terms
Hsmm specific
83
Huvec specific
41
K562 specific
166
Normal human skeletal muscle myoblasts
Muscle organ development (10; 4.7e-5)*
Muscle tissue development (6; 2.6e-2)
Human umbilical vein endothelial cells
Blood vessel development (6;3.9e-2)*
Vasculature development (6;2.2e-2)
Leukemia
Gas transport (5;1.9e-3)
Nhek specific
56
Normal Human epidermal keratinocytes
No significant terms
Nhlf specific
71
Normal human lung fibroblasts
No significant terms
* The most significant biological process.
14
4. References
1.
Won, K.J., Ren, B. and Wang, W. (2010) Genome-wide prediction of
transcription factor binding sites using an integrated model. Genome Biol, 11,
R7.
2.
Kunarso, G., Chia, N.Y., Jeyakani, J., Hwang, C., Lu, X., Chan, Y.S., Ng, H.H. and
Bourque, G. (2010) Transposable elements have rewired the core regulatory
network of human embryonic stem cells. Nat Genet, 42, 631-634.
3.
He, H.H., Meyer, C.A., Shin, H., Bailey, S.T., Wei, G., Wang, Q., Zhang, Y., Xu, K.,
Ni, M., Lupien, M. et al. (2010) Nucleosome dynamics define transcriptional
enhancers. Nat Genet, 42, 343-347.
4.
Wang, D., Garcia-Bassets, I., Benner, C., Li, W., Su, X., Zhou, Y., Qiu, J., Liu, W.,
Kaikkonen, M.U., Ohgi, K.A. et al. (2011) Reprogramming transcription by
distinct classes of enhancers functionally defined by eRNA. Nature.
5.
Juang, B.H. and Rabiner, L.R. (1991) Hidden Markov-Models for Speech
Recognition. Technometrics, 33, 251-272.
6.
Ernst, J., Kheradpour, P., Mikkelsen, T.S., Shoresh, N., Ward, L.D., Epstein, C.B.,
Zhang, X., Wang, L., Issner, R., Coyne, M. et al. (2011) Mapping and analysis of
chromatin state dynamics in nine human cell types. Nature, 473, 43-49.
7.
Suzuki, R. and Shimodaira, H. (2006) Pvclust: an R package for assessing the
uncertainty in hierarchical clustering. Bioinformatics, 22, 1540-1542.
8.
Huang da, W., Sherman, B.T. and Lempicki, R.A. (2009) Systematic and
integrative analysis of large gene lists using DAVID bioinformatics resources.
Nat Protoc, 4, 44-57.
15