Distinguishing between Genomic Regions Bound by Paralogous

Distinguishing between Genomic Regions
Bound by Paralogous Transcription Factors
Alina Munteanu1 and Raluca Gordân2
1
Faculty of Computer Science,
Alexandru I. Cuza University, Iasi, Romania
[email protected]
2
Institute for Genome Sciences and Policy,
Departments of Biostatistics & Bioinformatics,
Computer Science, and Molecular Genetics and Microbiology,
Duke University, Durham, NC 27708, USA
[email protected]
Abstract. Transcription factors (TFs) regulate gene expression by binding to specific DNA sites in cis regulatory regions of genes. Most eukaryotic TFs are members of protein families that share a common DNA
binding domain and often recognize highly similar DNA sequences. Currently, it is not well understood why closely related TFs are able to bind
different genomic regions in vivo, despite having the potential to interact
with the same DNA sites. Here, we use the Myc/Max/Mad family as a
model system to investigate whether interactions with additional proteins (co-factors) can explain why paralogous TFs with highly similar
DNA binding preferences interact with different genomic sites in vivo.
We use a classification approach to distinguish between targets of c-Myc
versus Mad2, using features that reflect the DNA binding specificities of
putative co-factors. When applied to c-Myc/Mad2 DNA binding data,
our algorithm can distinguish between genomic regions bound uniquely
by c-Myc versus Mad2 with 87% accuracy.
Keywords: Transcription factors, protein binding microarray, ChIPseq, co-factors, support vector machine, random forrest.
1
Introduction
Transcription factors (TFs) regulate gene expression by binding to specific, short
DNA sites in in the promoters or enhancers of the regulated genes. Determining
the DNA sequences recognized by TFs is essential for understanding how these
proteins achieve their DNA binding specificities and exert their specific regulatory roles in the cell. The DNA binding site motifs of hundred of eukaryotic TFs
have been determined thus far using high-throughput in vivo techniques such as
ChIP-chip [1] or ChIP-seq [2], as well as in vitro assays such as protein binding
microarrays (PBMs [3]). A close examination of the available TF-DNA binding
motifs from databases such as UniPROBE [4], Transfac [5], and Jaspar [6] reveals
that many eukaryotic TFs have highly similar DNA binding properties. This is
M. Deng et al. (Eds.): RECOMB 2013, LNBI 7821, pp. 145–157, 2013.
c Springer-Verlag Berlin Heidelberg 2013
146
A. Munteanu and R. Gordân
not surprising given that most TFs are members of protein families that share
a common DNA binding domain and thus have very similar sequence preferences [7]. However, it is surprising that, despite having the potential to bind the
same genomic sites, individual members of TF families (i.e., paralogous TFs)
often function in a non-redundant manner by binding different sets of target
genes and controlling different regulatory programs. For example, among TFs
in the E2F family, E2F1 has specific target genes [8] and it is the only factor
equipped with an ability to induce apoptosis [9], despite the fact that all E2F
family members have the same DNA binding specificity [10]. Similarly, ETS1
and ELK1, members of the ETS family of TFs, each have unique target genes
not bound by other ETS factors [11], despite the fact that their DNA binding motifs are virtually identical [12]. Currently, it is not well understood how
closely related TFs achieve their differential DNA binding specificity in vivo. In
some cases, intrinsic differences in DNA binding preferences contribute to the
observed functional differences between paralogous TFs [13]. However, in other
cases, the core DNA motifs are virtually identical [10], and still the proteins
interact differently with putative genomic binding sites in vivo, as revealed by
genome-wide ChIP-chip and ChIP-seq data [14, 15]. In such cases, it has been
hypothesized that interactions with specific protein partners (henceforth referred
to as co-factors) may contribute to the differential DNA binding in vivo [16].
Here, we use the Myc/Max/Mad family of TFs a model system to investigate whether interactions with putative co-factors can explain why paralogous
TFs with seamingly identical DNA binding preferences interact with different
genomic sites in vivo. Myc, Max, and Mad proteins are members of the basic
helix-loop-helix leucine zipper (bHLH/Zip) family and they play essential roles in
cell proliferation, differentiation, and death. Myc proteins are transcriptional activators that promote cell growth and proliferation, and are often overexpressed
in cancer cells [17]. Proteins of the Mad family act as transcriptional repressors,
they inhibit cell proliferation and are typically expressed at lower levels in human cancers [17]. In order to bind DNA, both Myc and Mad must heterodimerize
with Max, a bHLH/Zip TF with little transcriptional activity [17]. Mad factors
compete with Myc for dimerization with Max and for binding to genomic regions
containing the E-box motif (CAnnTG), with both Myc and Mad having a strong
preference for the E-box site CACGTG. Thus, it is not surprising that there is a
high degree of overlap between the sets of targets bound by Myc and Mad factors
in vivo, as illustrated by ChIP-seq data available from ENCODE [15]. However,
despite a significant overlap in their sets of ChIP-bound regions, Myc and Mad
also have unique targets, as illustrated in Fig. 1A for c-Myc and Mad2 (Mxi1),
representatives members of the Myc and Mad subfamilies, respectively. Here, we
focus on c-Myc and Mad2 because high-quality in vivo TF-DNA binding data is
available for both these factors as part of the ENCODE project [15].
We show that the intrinsic DNA binding preferences of c-Myc and Mad2
cannot explain why the two factors bind distinct sets of targets in vivo. Highquality DNA binding site motifs have been previously reported for c-Myc [5],
but not Mad2 (nor other Mad factors). Therefore, we use PBM assays [3] to
Distinguishing between the DNA Targets of Paralogous TFs
147
thoroughly characterize the sequence preferences of c-Myc and Mad2. Then, we
use Support Vector Machines (SVM) [18] and Random Forrests (RF) [19] to
identify sets of putative co-factors that can successfully distinguish between the
genomic regions bound uniquely by c-Myc or Mad2, with an accuracy of ∼87%.
Our classification-based approach is not restricted to c-Myc and Mad2. Instead, we implemented this approach in a general framework named COUGER
(co-factors associated with uniquely-bound genomic regions). Our framework
can be applied to any two sets of genomic regions bound by paralogous TFs to
identify the uniquely-bound targets and to determine the sets of co-TFs that best
distinguish between the two sets of unique targets. Compared to related tools for
analyzing ChIP-seq data, COUGER has several advantages, as detailed in the
Discussion section: it uses state-of-the-art classification algorithms (SVM and
RF) that are robust even when the feature set is large and some of the features
are highly correlated; it makes use of high-quality TF-DNA binding data (from
PBM experiments) to generate the features used in the classification; it takes
into account the fact that TF binding sites may occur in clusters (while other
tools only consider the highest affinity TF binding sites). Furthermore, given the
large amount of ChIP-seq data available from ENCODE, we have implemented
COUGER to accept as input ChIP-seq files in the narrowPeak format; such
files can be downloaded directly from the ENCODE website. We anticipate that
our framework will be extremely useful in analyzing ChIP-seq data to understand how interactions with specific co-factors contribute to differences in the in
vivo DNA binding specificities of paralogous TFs.
COUGER is available at: www.genome.duke.edu/labs/gordan/COUGER.
The PBM data for c-Myc and Mad2 is available at www.genome.duke.edu/labs/
gordan/DATA.
2
Intrinsic DNA Binding Preferences of c-Myc and Mad2
Cannot Explain Their Differential in vivo DNA Binding
We combined in vitro and in vivo TF-DNA binding data for c-Myc and Mad2 to
determine whether subtle differences in their intrinsic sequence preferences can
explain, at least in part, the unique genomic targets bound by only one of the two
factors in vivo. As evidence of in vivo binding we used ChIP-seq data from the
ENCODE project [15]. We focused on the Hela S3 and K562 cell lines because
ChIP-seq data is available for both c-Myc and Mad2, from the same laboratory.
For both c-Myc and Mad2 we downloaded the ChIP-seq data in narrowPeak
format from the UCSC Genome Browser [20]. For the HeLa S3 cell line, 7,440
binding regions (i.e., ChIP-seq peaks) were reported for c-Myc, and 32,138 for
Mad2. Because the number of bound genomic sequences varied greatly between
the two TFs, it would be difficult to perform a comparative analysis directly.
The fact that different types of controls were used in the c-Myc and Mad2 ChIP
experiments (standard versus no primary antibody) probably contributes to the
larger number of peaks reported for Mad2. However, a close examination of the
ChIP-seq data also revealed that the p-value cutoffs used for reporting the peaks
148
A. Munteanu and R. Gordân
(A)
c-Myc
(B)2.0
bits
2786
0.0
GAGTACT
c-Myc Logo
bits
6339
1.0
CACGTG
G
0.0
CA
C
G
A
T
A
5′
TA
G
T
Mad2 Logo
3′
C
G
T
C
A
AUC enrichment
0.0
CT
C
A
G
T
C
G
A
5′
2.0
Mad2
HeLa S3 ChIP-seq
Overlaps
CACGTGG
C
A
T
3′
c−Myc ChIP Mad2 ChIP
3419
1.0
(C)
0.2
0.4
0.6
0.8
c−Myc PWM
0.665
Mad2 PWM
0.663
c−Myc PWM
0.585
Mad2 PWM
0.582
Fig. 1. (A) Overlap between the sets of genomic regions bound by c-Myc and Mad2 in
a ChIP-seq experiment [15]. (B) c-Myc and Mad2 DNA binding motifs derived from in
vitro PBM data. The logos were generated using enoLOGOS [21]. (C) AUC enrichment
for the c-Myc and Mad2 DNA binding motifs in the ChIP-seq data for the two TFs in
HeLa S3 cells. The dotted line shows the expected AUC for a random motif.
were different: 10−8.8 for c-Myc and 10−2.4 for Mad2. To make the two data sets
more comparable, we applied a cutoff of 10 for the − log10 of the ChIP-seq pvalue. This resulted in more balanced sets of in vivo targets for c-Myc and Mad2,
with 6205 and 9758 bound regions, respectively. We used these sets of targets
for all the analyses described henceforth. As shown in Fig. 1A, although there is
a significant overlap between the two sets of targets, c-Myc and Mad2 also bind
unique genomic targets in HeLa S3 cells.
Currently, the molecular mechanisms that allow paralogous TFs, such as cMyc and Mad2, to interact with different sets of DNA sites in vivo are not well
understood. One hypothesis is that the two TFs exhibit slightly different DNA
binding preferences, and this may contribute to their differential in vivo binding.
To test this hypothesis, it is essential to have high-quality DNA binding site
motifs or other types of data that reflect the intrinsic DNA binding preferences
of these TFs. Although such data is available for c-Myc [5, 6], none of the Mad
factors have been thoroughly characterized, and the only DNA motif available for
Mad2 is a general E-box motif of low quality [5]. For this reason, we performed
PBM experiments [3] to thoroughly characterize the DNA binding preferences
of c-Myc and Mad2. We tested the two TFs either alone or in combination with
TF Max. As expected, the c-Myc:c-Myc and Mad2:Mad2 homodimers bound
DNA very weakly even when tested at high concentrations, while c-Myc:Max
and Mad2:Max bound DNA with high affinity. His-tagged versions of c-Myc,
Mad2, and Max were used in the PBM experiments, and they were a kind gift
from Richard Young and Peter Rahl (Whitehead Institute). To ensure that the
DNA binding signal detected on PBMs corresponds to heterodimers and not the
Max:Max homodimer, we used concentrations of c-Myc/Mad2 10 times higher
than the concentration of Max. We will henceforth refer to the c-Myc:Max PBM
data as c-Myc PBM data, and to the Mad2:Max PBM data as Mad PBM data.
From the universal PBM data for c-Myc and Mad2, we computed several
measures of the DNA binding specificity of the two factors: 1) we used the
Distinguishing between the DNA Targets of Paralogous TFs
149
Seed-and-Wobble algorithm [3] to derive DNA binding site motifs, or position
weight matrices (PWMs) [22]; 2) we computed the median fluorescence intensity
for each possible 8-mer, as described previously [3], with high median intensities
corresponding to 8-mers strongly preferred by the TF; and 3) we computed enrichment scores (E-scores) for each possible 8-mer, as described previously [3].
E-scores range from -0.5 to +0.5, with higher values corresponding to higher sequence preference. Compared to 8-mer median intensities, the E-scores are more
robust to changes in experimental conditions (e.g., binding buffers) and protein
concentrations. However, 8-mer median intensities can be used to approximate
the median intensities for longer k-mers, intensities that are not directly measured on the PBMs (see Supplementary Material online).
2.1
DNA Motifs Cannot Explain Differential in vivo DNA Binding
by c-Myc versus Mad2
The DNA motifs of c-Myc and Mad2 are very similar, but not identical (Fig. 1B).
For example, c-Myc appears to have a slightly higher preference for a C nucleotide immediately upstream of the CACGTG core. To assess whether such
differences are significant in vivo and potentially explain the differences in in
vivo DNA binding between the two proteins, we first compared the enrichment
of the c-Myc and Mad2 motifs in the ChIP-seq data, using a method based on
the area under the receiver operating characteristic curve (AUC) (see [23] and
Supplementary Materials online).
Fig. 1C shows AUC enrichments for the c-Myc and Mad2 motifs in the ChIPseq data. If these motifs could explain, even to a small extent, why c-Myc and
Mad2 bind different sets of targets in vivo, then we would expect the c-Myc motif
to be significantly more enriched than the Mad2 motif in the c-Myc ChIP-seq
data, and the Mad2 motif to be significantly more enriched than the c-Myc motif
in the Mad2 ChIP-seq data. However, the AUC enrichments of these motifs are
almost identical: 0.665 and 0.663 in c-Myc ChIP-seq data, and 0.585 and 0.582
in Mad2 ChIP-seq for the HeLa S3 cell line. In conclusion, we cannot use DNA
motifs to differentiate between the c-Myc and Mad2 ChIP-seq data sets.
2.2
In vitro Universal PBM Data Cannot Explain Differential in vivo
Binding by c-Myc versus Mad2
Another way of assessing the enrichment of a DNA motif (PWM) in a ChIP-seq
data set is by looking at how many of the ChIP-bound sequences contain a PWM
match above a certain cutoff (and possibly use this to compute a hypergeometric
p-value). The shortcoming of this method is that it depends greatly on the chosen
cutoff, and there is no systematic way of choosing the “best” cutoff for a given
PWM [23]. To overcome this problem, we considered a range of cutoffs and, for
each cutoff, we computed the fraction of ChIP-bound sequences that contain at
least one DNA site with a score above the cutoff. Furthermore, since cutoffs based
on PWM scores would not be readily comparable between the c-Myc and Mad2
PWMs, we chose cutoffs based on the number of possible k-mers with scores
A. Munteanu and R. Gordân
100
250
500
0.8
0.8
0.4
1000 5000
0.4
10
1000 5000
Number of top−scoring 10−mers
c−Myc fraction
in ChIP−seq
c−Myc ChIP−seq
10
100
250
500
1000
Mad2 fraction
in ChIP−seq
0.8
500
8−mer E−scores
c−Myc fraction
in DNase−seq
0.4
250
Fraction of peaks
100
Mad2 ChIP−seq
0.0
Fraction of peaks
Fraction of peaks
0.4
0.8
c−Myc ChIP−seq
10
(B)
0.0
10−mer PWM scores
0.0
Fraction of peaks
(A)
Mad2 fraction
in DNase−seq
Mad2 ChIP−seq
0.0
150
10
100
250
500
1000
Number of top−scoring 8−mers
Fig. 2. Fractions of ChIP-seq/DNase-seq peaks that contain DNA sites with (A) PWM
scores or (B) 8-mer E-scores above certain cutoffs. The cutoffs represent the number
of top-scoring k-mers, ranked by either PWM scores or PBM 8-mer E-scores. The full
lines correspond to ChIP-seq data. The dotted lines correspond to DNase-seq data.
above that cutoff, where k is the width of the PWM. The results presented here
are for PWMs of size k = 10. We obtained similar results for other values of k.
The results of this analysis are illustrated in Fig. 2A. For each ChIP-seq data
set and each TF, we counted the number of ChIP-seq peaks that contained at
least one 10-mer in the set of 10, 100, 250, 500, 1000, 5000, and 10000 topscoring 10-mers. We compared the fractions of peaks corresponding to c-Myc
versus Mad2 ChIP-seq data. Also, in order to compare those results with the
background distribution, we computed similar fractions for DNase-seq [24] peaks.
As shown in Fig. 2A, for both ChIP-seq data sets the values corresponding to
the c-Myc and Mad2 PWMs are very similar, and we observe the same pattern
of slightly higher fractions for Mad2. Thus, we cannot use the PWM scores in
this manner to differentiate between c-Myc and Mad2 ChIP-seq targets. We also
notice that, as expected, a larger fraction of ChIP-seq peaks contain high-scoring
PWM matches compared to DNase-seq peaks (compare the full and dotted lines
in Fig. 2, which correspond to ChIP- and DNase-seq data, respectively).
We note that PWMs are in fact summaries of the comprehensive data that
we obtain from PBM experiments. Thus, it is possible that differences between
the DNA binding specificities of c-Myc and Mad2 do exist, but are not captured
by PWMs. To test this hypothesis, we performed an analysis similar to the
one described above, but instead of using PWM scores we used 8-mer E-scores
derived directly from PBM data. Fig. 2B shows the fractions of peaks containing
top-scoring 8-mers. As in the case of PWM scores, the difference between ChIPseq data and DNase-seq data is significant for both TFs, but the curves for c-Myc
and Mad2 are almost identical. Thus, the 8-mer PBM data is still not sufficient
to differentiate between the in vivo targets of c-Myc compared to Mad2.
Distinguishing between the DNA Targets of Paralogous TFs
3
151
Binding of Putative Co-factors Can Explain Differences
in in vivo DNA Binding between c-Myc and Mad2
Given that the intrinsic DNA binding preferences of c-Myc and Mad2 cannot be
used to differentiate between the in vivo targets of the two TFs, we next focused
on the hypothesis that DNA binding of co-factors in the neighborhood of c-Myc
or Mad2 binding sites might contribute to the differences we observe between
their sets of in vivo targets. To test this hypothesis, we built classifiers that can
accurately distinguish between sequences bound uniquely by c-Myc versus Mad2
according to the ChIP-seq data, using features derived from either PBM data or
PWMs of putative co-factors. We implemented our approach in the COUGER
(co-factors associated with uniquely-bound genomic regions) framework. The
steps of the framework are summarized in Algorithm 1.
3.1
Classes and Features
Classes. We used the ChIP-seq data to define two classes of sequences: cMyc-specific sequences (i.e., c-Myc ChIP-seq peaks that do not overlap any of
the Mad2 peaks), and Mad2-specific sequences (i.e., Mad2 ChIP-seq peaks that
do not overlap any of the c-Myc peaks). In HeLa S3 cells we identified 2786
c-Myc-specific sequences and 6308 Mad2-specific sequences, which account for
approximately 45% and 65% of the total ChIP-seq peaks of c-Myc and Mad2,
respectively. These percentages are surprisingly high given the similarity between
the DNA binding preferences of the two TFs.
After identifying c-Myc- and Mad2-specific sequences, we filtered out some of
the Mad2-specific peaks and kept only the top 2786 , sorted according to the
Mad2 ChIP-seq p-value. Thus, we obtained two sets containing the same number
of DNA sequences, which eliminates a potential classification bias toward one
of the two classes. Finally, before computing the features for the selected DNA
sequences, we trimmed each sequence to ±100 bp on each side of the ChIP-seq
peak summit. This was necessary because many peaks are a few hundred to a few
thousand bases long. Given that we are interested in finding co-factors that bind
close to c-Myc and Mad2, we should look for DNA sites of these putative cofactors only in close proximity of the c-Myc and Mad2 ChIP-seq peak summits.
Features. We computed features using two types of information on the DNA
binding specificities of putative co-factors: PBM 8-mer E-scores and PWM scores.
We used 3 different types of features: 1) “PBM features” derived from 8-mer
E-scores for the mammalian TFs in UniPROBE [4] (420 PBM data sets), plus 9
PBM data sets from our laboratory; 2) “PWM features” derived from PWMs
computed from the PBM data sets; and 3) “T (Transfac) features” derived
from the PWMs in Transfac [5] (1226 PWMs). For a given PBM data set and
a DNA sequence, we generated: an “M” feature that represents the maximum
E-score over all the 8-mers in that sequence, and an “A” feature that represents
the average E-score over the top 3 highest-scoring 8-mers in that sequence (lines
4 and 5 of Algorithm 1). Similarly, we generated M and A features from PWMs.
152
A. Munteanu and R. Gordân
Algorithm 1. Classification
Input: D – data set with classification sequences for Myc and Mad; P BM (E-scores
from PBM data); P W M (PWMs from PBM data); T (PWMs from TRANSFAC).
Output: Lists of selected features SF , and accuracies A.
1: Dtrain ← random.sample(D, 2/3 · |D|) such that |{X ∈ Dtrain , class(X) = Myc}| =
|{X ∈ Dtrain , class(X) = Mad}|; Dtest ← D − Dtrain
2: for F ∈ {P BM, P W M, T } do
3:
for X ∈ D do
4:
FM (X) = {maxx∈X f (x)|∀f ∈ F }
5:
FA (X) = {avg(maxx∈X f (x), maxy∈X−{x} f (y), maxz∈X−{x,y} f (z))|∀f ∈ F }
6:
FMA (train) ← {FM (X), FA (X)|∀X ∈ Dtrain }
7:
SFt ← feature.selection(Dtrain , Ft (train))
8:
for C ∈ {SV Mlin , SV Mrbf , RFgi , RFpi } do
9:
bestp(C, SFt ) ← arg maxp∈params(C) accuracy(train(C, Dtrain , SFt , p))
10:
M odel(C, SFt ) ← train(C, Dtrain , SFt , bestp(C, SFt ))
11:
Atest (C, SFt ) ← accuracy(predict(C, Dtest , M odel(C, SFt )))
12: SF ← {SP BMMA , SP W MMA , STMA }
13: A ← {Atest (C, F )|∀C ∈ {SV Mlin , SV Mrbf , RFgi , RFpi }, ∀F ∈ SF }
14: return SF, A
3.2
Classification Algorithms
We used two state-of-the-art supervised classification algorithms: support vector
machine (SVM) and random forest (RF), both available as free software packages (LIBSVM [25], Random Jungle [26]). The SVM is widely used due to its
high accuracy on linear and nonlinear classification problems. In addition, the
SVM can successfully handle high-dimensional data, which makes it ideal for
our classification task. We trained SVMs using linear and radial basis function
kernels (SVMlin and SVMrbf , respectively). The RF classifier is essentially an
ensemble of classification trees. RF is comparable in performance with SVM, but
one of its distinguishing characteristics is that it explicitly computes a measure of
the importance of each variable for the classification task. Random Jungle (RJ)
implements 2 variable importance scores: Gini importance (the sum of impurity
decreases over all nodes in the forest in which the corresponding variable was
selected for splitting), and permutation importance (the average decrease in accuracy when the values of a variable are randomly permuted). We ran RJ with
both the Gini importance (RFgi ) and the permutation importance (RFpi ).
We split the c-Myc- and Mad2-specific sequences into two sets: 1) a training
set containing 2/3 of the sequences (i.e., 3714), randomly chosen from the original
set; and 2) a test set containing the remaining 1/3 of the sequences. For each
algorithm, we first searched for optimal parameter values using only the training
data, and then, using the best model obtained on the training set, we predicted
the class for each sequence in the test set. We measured the performance of each
algorithm using its accuracy on the test set.
To optimize the parameters we performed grid searches over the parameter space. For SVMlin we optimized C, the cost of misclassifying examples.
Distinguishing between the DNA Targets of Paralogous TFs
153
For SVMrbf we optimized C and the RBF kernel parameter γ. For RFgi and
RFpi we optimized ntree, the number of trees in the forest, and mtry, the number of input variables tried in each split (see Supplementary Material online).
Feature Selection. We performed feature selection on each of the three feature
types (PBM, PWM, and T (Transfac)) using both the maximum score of any
DNA site and the average over the top three highest scores. We used RF with a
backward elimination technique [26], an iterative process in which a RF is grown
at each step and a subset of variables is discarded. The eliminated features are
those with the smallest importance. In this instance we used only the unscaled
permutation importance, which is recommended for feature selection [27]. We
stopped the algorithm when the number of features fell below 100. We performed
two variants of selection: FS1, with 50% of features dropped at each iteration,
and FS2, with 33% of features dropped at each iteration.
3.3
Classification Accuracy on the Test Sets
We ran SVM and RF on the HeLa S3 ChIP-seq data using the features types
described above (PBM, PWM, and Transfac). The results for SVMlin and
RFpi are presented in Table 1 as classification accuracies, and vary between
85.52% and 88.05% depending on the algorithm and the set of features. Results
for the other two classifiers (SVMrbf and RFgi ), as well as results on the K562
ChIP-seq data, are available in the Supplementary Material online.
Table 1 shows that our SVM and RF classifiers can accurately distinguish
between c-Myc-specific and Mad2-specific genomic targets. This suggests that
a potential mechanism by which these TFs achieve differential DNA binding
in vivo is by interacting with co-factors that bind DNA in the neighborhood
of c-Myc or Mad2 DNA binding sites. We will perform follow-up analyses to
study the spacing between c-Myc/Mad2 sites and DNA sites of their putative
co-factors, to assess the likelihood of direct TF-TF interactions.
We note that SVMlin with Transfac features achieved the best classification
accuracy on the HeLa S3 ChIP-seq data: 88.05% when using all 2452 features.
However, the accuracy decreased after feature selection and became comparable
to the accuracy for PBM and PWM features.
Table 1. Classification accuracy on the test sets. Table shows the results of SVM and
RF on HeLa S3 ChIP-seq data using 3 feature types: PWM, Transfac, and PBM. Bold:
best classification accuracy obtained by SV M lin and RF pi for a particular feature type.
Features
PBM
Feature set
ALL FS1
Number of features 858
53
SV Mlin
RFpi
PWM
FS2
74
ALL
840
FS1
52
Transfac
FS2 ALL FS1
73 2452 76
FS2
94
86.87 87.03 87.08 86.60 85.74 86.33 88.05 86.60 86.65
86.65 86.92 86.71 85.52 86.01 85.95 86.60 86.44 86.71
154
A. Munteanu and R. Gordân
3.4
Selected Features. Putative Co-factors
SVMlin with PBM features obtained the best classification accuracy with a
limited number of features: 87.08% with a total of 74 features (FS2) and 87.03%
with a total of 53 features (FS1). Importantly, 52 of the 53 features in FS1 are
among the 74 features in FS2. We note that COUGER can be used to reduce
the number of features even further, although this might lead to a decrease in
classification accuracy. For this particular data set, for example, reducing the
number of selected features to 10 resulted in an accuracy of 85.47%.
We analyzed the top 53 selected features to identify putative co-factors that
might contribute to differential in vivo DNA binding by c-Myc versus Mad2.
The top 4 putative co-factors (according to RF variable importance score) are:
E2F, Sp100, Zfp161, and Sp4, all associated with Mad2-specific sequences. A
literature search revealed that at least 3 of these TFs are indeed good candidate
co-factors for Mad2: E2F binding site elements are important for autorepression
of the c-myc gene [28], Sp100 is a transcriptional repressor (similarly to Mad2)
and plays an important role as a tumor suppressor [29], and Zfp161 is a putative
c-myc repressor [30]. The fourth TF, Sp4, is not known to act as a repressor but it
has been shown to be aberrantly expressed in many cancers [5], which supports a
connection with Myc/Mad. For the highest confidence candidate co-factor, E2F,
we performed an enrichment analysis similar to the one in Fig. 2. We note that
E2F is a family of TFs with highly similar DNA binding specificities. All E2Fs
for which PBM data is currently available have been selected by our classifiers,
and they are similarly enriched in Mad2-specific targets. In Fig. 3 we show the
enrichment of two representative E2F family members: E2F3 and E2F4.
0.8
0.6
0.4
0.0
0.2
Fraction of peaks
0.8
0.6
0.4
0.2
0.0
Fraction of peaks
1.0
Mad2 specific targets
1.0
c−Myc specific targets
10
100
250
500
1000
5000
Number of top−scoring 8−mers
c−Myc
Mad2
E2F3
E2F4
10
100
250
500
1000
5000
Number of top−scoring 8−mers
ChIP−seq
DNase−seq
Fig. 3. Enrichment analysis for c-Myc, Mad, and E2F factors in the DNA regions bound
uniquely by either c-Myc or Mad2. Right plot shows that E2Fs are more enriched than
Mad2 in the Mad2-specific targets, although their enrichment in the DNase-seq peaks
is much lower. By contrast, in the c-Myc-specific targets (left plot), E2F sites are
depleted, as their enrichment is generally lower than in the DNase-seq peaks. These
results are in agreement with our classification analyses, which found that E2F sites
are strongly associated with Mad2-specific and not c-Myc-specific targets.
Distinguishing between the DNA Targets of Paralogous TFs
4
155
Discussion
Identifying the molecular mechanisms that allow paralogous TFs to bind different sets of in vivo targets is essential for understanding eukaryotic transcription.
Due to recent advances in high-throughput technologies for measuring TF-DNA
binding both in vivo and in vitro (such as ChIP-seq and PBM), it is now possible
to quantify the contributions of both intrinsic TF-DNA binding specificity and
interacting co-factors to differential in vivo DNA binding by related TFs. Here,
we focus on paralogous TFs c-Myc and Mad2, and show that differences in their
intrinsic sequence preferences cannot account for the large number of targets
bound uniquely by each TF. Instead, interactions with putative co-TFs are a
likely mechanism used by c-Myc and Mad2 to select their specific genomic sites.
To identify c-Myc and Mad2 co-factors, we designed COUGER, a novel
framework that uses in vitro DNA binding specificity data for putative co-factors
to distinguish between the genomic targets of paralogous TFs (here, c-Myc and
Mad2). We are not aware of other tools that aim to identify co-factors that interact specifically with paralogous TFs. However, similar classification approaches
have been previously used to distinguish between sets of genomic regions. Chen
and Zhou [31], for example, use Naı̈ve Bayes to identify co-factors that can distinguish between the regulatory regions of genes upregulated versus downregulated
in mouse ES cells. We note that our choice of classification algorithms is very
important. When using features derived from TF-DNA binding specificity data
(either PWMs or PBM data), it is likely to obtain features that are highly correlated. While Naı̈ve Bayes is not appropriate in this case, both SVM and RF
classifiers can be used. Furthermore, as the number of features increases (in our
case, as more and more PBM data is being generated), a Naı̈ve Bayes approach
may start to overfit the training data, while SVMs and RFs are more robust.
Finally, the advantage of using RF for feature selection (as opposed to Wilcoxon
rank-sum test) is that RF can easily handle interactions among features, which
would not be captured by a statistical test on individual features.
De novo motif discovery tools or methods that search for DNA motifs enriched in particular sets of sequences could also be used, in theory, to identify
co-factors [32, 33, 34] . However, these approaches would only search for one
co-factor at a time, and would be able to find only DNA motifs that appear in a
significant fraction of the DNA sequences of interest (here, the c-Myc- or Mad2specific sequences). Recent evidence from the ENCODE project shows that the
co-association of TFs is highly context specific, i.e., distinct combinations of TFs
bind at specific genomic locations [35]. Thus, classification approaches such as
COUGER, that search for sets of putative co-factors in TF-specific genomic
targets are more likely to reveal important molecular mechanisms through which
paralogous TFs achieve their regulatory specificity in the cell. Future work will
include additional computational analyses to select the best candidate co-factors
for c-Myc and Mad2, as well as using COUGER to identify co-factors for paralogous TFs from other protein families.
156
A. Munteanu and R. Gordân
References
[1] Ren, B., Robert, F., Wyrick, J.J., et al.: Genome-wide location and function of
DNA binding proteins. Science 290, 2306–2309 (2000)
[2] Johnson, D.S., Mortazavi, A., Myers, R.M., Wold, B.: Genome-wide mapping of
in vivo protein-DNA interactions. Science 316, 1497–1502 (2007)
[3] Berger, M.F., Philippakis, A.A., Qureshi, A.M., et al.: Compact, universal DNA
microarrays to comprehensively determine transcription-factor binding site specificities. Nat. Biotech. 24, 1429–1435 (2006)
[4] Robasky, K., Bulyk, M.L.: UniPROBE, update 2011: expanded content and search
tools in the online database of protein-binding microarray data on protein-DNA
interactions. Nucleic Acids Research 39, D124–D128 (2011)
[5] Matys, V., Kel-Margoulis, O.V., Fricke, E., et al.: TRANSFAC and its module
TRANSCompel: transcriptional gene regulation in eukaryotes. Nucleic Acids Research 34, D108–D110 (2006)
[6] Portales-Casamar, E., Thongjuea, S., Kwon, A.T., et al.: JASPAR 2010: the
greatly expanded open-access database of transcription factor binding profiles.
Nucleic Acids Research 38, D105–D110 (2010)
[7] Badis, G., Berger, M.F., Philippakis, A.A., et al.: Diversity and complexity in
DNA recognition by transcription factors. Science 324, 1720–1723 (2009)
[8] Wells, J., Graveel, C.R., Bartley, S.M., et al.: The identification of E2F1-specific
target genes. Proc. Natl. Acad. Sci. U S A 99, 3890–3895 (2002)
[9] Wu, Z., Zheng, S., Yu, Q.: The E2F family and the role of E2F1 in apoptosis. Int.
J. Biochem. Cell Biol. 41, 2389–2397 (2009)
[10] Tao, Y., Kassatly, R., Cress, W., Horowitz, J.: Subunit composition determines
E2F DNA-binding site specificity. Mol. Cell Biol. 17, 6994–7007 (1997)
[11] Hollenhorst, P.C., Shah, A.A., Hopkins, C., Graves, B.J.: Genome-wide analyses
reveal properties of redundant and specific promoter occupancy within the ETS
gene family. Genes Dev. 21, 1882–1894 (2007)
[12] Wei, G.H., Badis, G., Berger, M.F., et al.: Genome-wide analysis of ETS-family
DNA-binding in vitro and in vivo. EMBO J. 29, 2147–2160 (2010)
[13] Soleimani, V.D., Punch, V.G., Kawabe, Y.I., et al.: Transcriptional dominance
of Pax7 in adult myogenesis is due to high-affinity recognition of homeodomain
motifs. Dev. Cell 22, 1208–1220 (2012)
[14] Xu, X., Bieda, M., Jin, V.X., et al.: A comprehensive ChIP-chip analysis of E2F1,
E2F4, and E2F6 in normal and tumor cells reveals interchangeable roles of E2F
family members. Genome Research 17, 1550–1561 (2007)
[15] ENCODE Project Consortium, Bernstein, B., Birney, E., Dunham, I., Green, E.,
Gunter, C., Snyder, M.: An integrated encyclopedia of DNA elements in the human
genome. Nature 489, 57–74 (2012)
[16] Farnham, P.J.: Insights from genomic profiling of transcription factors. Nat. Rev.
Genet. 10, 605–616 (2009)
[17] Grandori, C., Cowley, S.M., James, L.P., Eisenman, R.N.: The Myc/Max/Mad
network and the transcriptional control of cell behavior. Annu. Rev. Cell Dev.
Biol. 16, 653–699 (2000)
[18] Cortes, C., Vapnik, V.: Support-vector networks. Machine Learning 20, 273–297
(1995)
[19] Breiman, L.: Random forests. Machine Learning 45, 5–32 (2001)
[20] Rosenbloom, K.R., Dreszer, T.R., Long, J.C., et al.: ENCODE whole-genome data
in the UCSC Genome Browser: update, Nucleic Acids Research 40, D912–D917
(2012)
Distinguishing between the DNA Targets of Paralogous TFs
157
[21] Workman, C.T., Yin, Y., Corcoran, D., et al.: enoLOGOS: a versatile web tool
for energy normalized sequence logos. Nucl. Acids Res. 33, W389 (2005)
[22] Stormo, G.D.: DNA binding sites: representation and discovery. Bioinformatics 16,
16–23 (2000)
[23] Gordân, R., Hartemink, A., Bulyk, M.: Distinguishing direct versus indirect transcription factor-DNA interactions. Genome Res. 19, 2090–2100 (2009)
[24] Song, L., Crawford, G.E.: DNase-seq: A high-resolution technique for mapping
active gene regulatory elements across the genome from mammalian cells. Cold
Spring Harbor Protocols 2010, pdb.prot5384 (2010)
[25] Chang, C.C., Lin, C.J.: LIBSVM: A library for support vector machines. ACM
Transactions on Intelligent Systems and Technology 2, 1–27 (2011)
[26] Schwarz, D.F., König, I.R., Ziegler, A.: On safari to random jungle: a fast implementation of random forests for high-dimensional data. Bioinformatics 26, 1752–
1758 (2010)
[27] Dı́az-Uriarte, R., Alvarez de Andrés, S.: Gene selection and classification of microarray data using random forest. BMC Bioinformatics 7, 3 (2006)
[28] Luo, Q., Li, J., Cenkci, B., Kretzner, L.: Autorepression of c-myc requires both initiator and E2F-binding site elements and cooperation with the p107 gene product.
Oncogene 23, 1088–1097 (2004)
[29] Negorev, D.G., Vladimirova, O.V., Kossenkov, A.V., et al.: Sp100 as a potent
tumor suppressor: accelerated senescence and rapid malignant transformation of
human fibroblasts through modulation of an embryonic stem cell program. Cancer
Research 70, 9991–10001 (2010)
[30] Sobek-Klocke, I., Disque-Kochem, C., Ronsiek, M., Klocke, R., et al.: The human
gene ZFP161 on 18p11.21-pter encodes a putative c-myc repressor and is homologous to murine Zfp161 (Chr 17) and Zfp161-rs1 (X Chr). Genomics 43, 156–164
(1997)
[31] Chen, G., Zhou, Q.: Searching ChIP-seq genomic islands for combinatorial regulatory codes in mouse ES cells. BMC Genomics 12, 515 (2011)
[32] Machanick, P., Bailey, T.L.: MEME-ChIP: motif analysis of large DNA datasets.
Bioinformatics 27, 1696–1697 (2011)
[33] Thomas-Chollier, M., Herrmann, C., Defrance, M., et al.: RSAT peak-motifs: motif
analysis in full-size ChIP-seq datasets. NAR 40, e31 (2012)
[34] Whitington, T., Frith, M.C., Johnson, J., Bailey, T.L.: Inferring transcription
factor complexes from ChIP-seq data. NAR 39, e98 (2011)
[35] Gerstein, M.B., Kundaje, A., Hariharan, M., et al.: Architecture of the human
regulatory network derived from ENCODE data. Nature 489, 91–100 (2012)