Distinguishing between Genomic Regions Bound by Paralogous Transcription Factors Alina Munteanu1 and Raluca Gordân2 1 Faculty of Computer Science, Alexandru I. Cuza University, Iasi, Romania [email protected] 2 Institute for Genome Sciences and Policy, Departments of Biostatistics & Bioinformatics, Computer Science, and Molecular Genetics and Microbiology, Duke University, Durham, NC 27708, USA [email protected] Abstract. Transcription factors (TFs) regulate gene expression by binding to specific DNA sites in cis regulatory regions of genes. Most eukaryotic TFs are members of protein families that share a common DNA binding domain and often recognize highly similar DNA sequences. Currently, it is not well understood why closely related TFs are able to bind different genomic regions in vivo, despite having the potential to interact with the same DNA sites. Here, we use the Myc/Max/Mad family as a model system to investigate whether interactions with additional proteins (co-factors) can explain why paralogous TFs with highly similar DNA binding preferences interact with different genomic sites in vivo. We use a classification approach to distinguish between targets of c-Myc versus Mad2, using features that reflect the DNA binding specificities of putative co-factors. When applied to c-Myc/Mad2 DNA binding data, our algorithm can distinguish between genomic regions bound uniquely by c-Myc versus Mad2 with 87% accuracy. Keywords: Transcription factors, protein binding microarray, ChIPseq, co-factors, support vector machine, random forrest. 1 Introduction Transcription factors (TFs) regulate gene expression by binding to specific, short DNA sites in in the promoters or enhancers of the regulated genes. Determining the DNA sequences recognized by TFs is essential for understanding how these proteins achieve their DNA binding specificities and exert their specific regulatory roles in the cell. The DNA binding site motifs of hundred of eukaryotic TFs have been determined thus far using high-throughput in vivo techniques such as ChIP-chip [1] or ChIP-seq [2], as well as in vitro assays such as protein binding microarrays (PBMs [3]). A close examination of the available TF-DNA binding motifs from databases such as UniPROBE [4], Transfac [5], and Jaspar [6] reveals that many eukaryotic TFs have highly similar DNA binding properties. This is M. Deng et al. (Eds.): RECOMB 2013, LNBI 7821, pp. 145–157, 2013. c Springer-Verlag Berlin Heidelberg 2013 146 A. Munteanu and R. Gordân not surprising given that most TFs are members of protein families that share a common DNA binding domain and thus have very similar sequence preferences [7]. However, it is surprising that, despite having the potential to bind the same genomic sites, individual members of TF families (i.e., paralogous TFs) often function in a non-redundant manner by binding different sets of target genes and controlling different regulatory programs. For example, among TFs in the E2F family, E2F1 has specific target genes [8] and it is the only factor equipped with an ability to induce apoptosis [9], despite the fact that all E2F family members have the same DNA binding specificity [10]. Similarly, ETS1 and ELK1, members of the ETS family of TFs, each have unique target genes not bound by other ETS factors [11], despite the fact that their DNA binding motifs are virtually identical [12]. Currently, it is not well understood how closely related TFs achieve their differential DNA binding specificity in vivo. In some cases, intrinsic differences in DNA binding preferences contribute to the observed functional differences between paralogous TFs [13]. However, in other cases, the core DNA motifs are virtually identical [10], and still the proteins interact differently with putative genomic binding sites in vivo, as revealed by genome-wide ChIP-chip and ChIP-seq data [14, 15]. In such cases, it has been hypothesized that interactions with specific protein partners (henceforth referred to as co-factors) may contribute to the differential DNA binding in vivo [16]. Here, we use the Myc/Max/Mad family of TFs a model system to investigate whether interactions with putative co-factors can explain why paralogous TFs with seamingly identical DNA binding preferences interact with different genomic sites in vivo. Myc, Max, and Mad proteins are members of the basic helix-loop-helix leucine zipper (bHLH/Zip) family and they play essential roles in cell proliferation, differentiation, and death. Myc proteins are transcriptional activators that promote cell growth and proliferation, and are often overexpressed in cancer cells [17]. Proteins of the Mad family act as transcriptional repressors, they inhibit cell proliferation and are typically expressed at lower levels in human cancers [17]. In order to bind DNA, both Myc and Mad must heterodimerize with Max, a bHLH/Zip TF with little transcriptional activity [17]. Mad factors compete with Myc for dimerization with Max and for binding to genomic regions containing the E-box motif (CAnnTG), with both Myc and Mad having a strong preference for the E-box site CACGTG. Thus, it is not surprising that there is a high degree of overlap between the sets of targets bound by Myc and Mad factors in vivo, as illustrated by ChIP-seq data available from ENCODE [15]. However, despite a significant overlap in their sets of ChIP-bound regions, Myc and Mad also have unique targets, as illustrated in Fig. 1A for c-Myc and Mad2 (Mxi1), representatives members of the Myc and Mad subfamilies, respectively. Here, we focus on c-Myc and Mad2 because high-quality in vivo TF-DNA binding data is available for both these factors as part of the ENCODE project [15]. We show that the intrinsic DNA binding preferences of c-Myc and Mad2 cannot explain why the two factors bind distinct sets of targets in vivo. Highquality DNA binding site motifs have been previously reported for c-Myc [5], but not Mad2 (nor other Mad factors). Therefore, we use PBM assays [3] to Distinguishing between the DNA Targets of Paralogous TFs 147 thoroughly characterize the sequence preferences of c-Myc and Mad2. Then, we use Support Vector Machines (SVM) [18] and Random Forrests (RF) [19] to identify sets of putative co-factors that can successfully distinguish between the genomic regions bound uniquely by c-Myc or Mad2, with an accuracy of ∼87%. Our classification-based approach is not restricted to c-Myc and Mad2. Instead, we implemented this approach in a general framework named COUGER (co-factors associated with uniquely-bound genomic regions). Our framework can be applied to any two sets of genomic regions bound by paralogous TFs to identify the uniquely-bound targets and to determine the sets of co-TFs that best distinguish between the two sets of unique targets. Compared to related tools for analyzing ChIP-seq data, COUGER has several advantages, as detailed in the Discussion section: it uses state-of-the-art classification algorithms (SVM and RF) that are robust even when the feature set is large and some of the features are highly correlated; it makes use of high-quality TF-DNA binding data (from PBM experiments) to generate the features used in the classification; it takes into account the fact that TF binding sites may occur in clusters (while other tools only consider the highest affinity TF binding sites). Furthermore, given the large amount of ChIP-seq data available from ENCODE, we have implemented COUGER to accept as input ChIP-seq files in the narrowPeak format; such files can be downloaded directly from the ENCODE website. We anticipate that our framework will be extremely useful in analyzing ChIP-seq data to understand how interactions with specific co-factors contribute to differences in the in vivo DNA binding specificities of paralogous TFs. COUGER is available at: www.genome.duke.edu/labs/gordan/COUGER. The PBM data for c-Myc and Mad2 is available at www.genome.duke.edu/labs/ gordan/DATA. 2 Intrinsic DNA Binding Preferences of c-Myc and Mad2 Cannot Explain Their Differential in vivo DNA Binding We combined in vitro and in vivo TF-DNA binding data for c-Myc and Mad2 to determine whether subtle differences in their intrinsic sequence preferences can explain, at least in part, the unique genomic targets bound by only one of the two factors in vivo. As evidence of in vivo binding we used ChIP-seq data from the ENCODE project [15]. We focused on the Hela S3 and K562 cell lines because ChIP-seq data is available for both c-Myc and Mad2, from the same laboratory. For both c-Myc and Mad2 we downloaded the ChIP-seq data in narrowPeak format from the UCSC Genome Browser [20]. For the HeLa S3 cell line, 7,440 binding regions (i.e., ChIP-seq peaks) were reported for c-Myc, and 32,138 for Mad2. Because the number of bound genomic sequences varied greatly between the two TFs, it would be difficult to perform a comparative analysis directly. The fact that different types of controls were used in the c-Myc and Mad2 ChIP experiments (standard versus no primary antibody) probably contributes to the larger number of peaks reported for Mad2. However, a close examination of the ChIP-seq data also revealed that the p-value cutoffs used for reporting the peaks 148 A. Munteanu and R. Gordân (A) c-Myc (B)2.0 bits 2786 0.0 GAGTACT c-Myc Logo bits 6339 1.0 CACGTG G 0.0 CA C G A T A 5′ TA G T Mad2 Logo 3′ C G T C A AUC enrichment 0.0 CT C A G T C G A 5′ 2.0 Mad2 HeLa S3 ChIP-seq Overlaps CACGTGG C A T 3′ c−Myc ChIP Mad2 ChIP 3419 1.0 (C) 0.2 0.4 0.6 0.8 c−Myc PWM 0.665 Mad2 PWM 0.663 c−Myc PWM 0.585 Mad2 PWM 0.582 Fig. 1. (A) Overlap between the sets of genomic regions bound by c-Myc and Mad2 in a ChIP-seq experiment [15]. (B) c-Myc and Mad2 DNA binding motifs derived from in vitro PBM data. The logos were generated using enoLOGOS [21]. (C) AUC enrichment for the c-Myc and Mad2 DNA binding motifs in the ChIP-seq data for the two TFs in HeLa S3 cells. The dotted line shows the expected AUC for a random motif. were different: 10−8.8 for c-Myc and 10−2.4 for Mad2. To make the two data sets more comparable, we applied a cutoff of 10 for the − log10 of the ChIP-seq pvalue. This resulted in more balanced sets of in vivo targets for c-Myc and Mad2, with 6205 and 9758 bound regions, respectively. We used these sets of targets for all the analyses described henceforth. As shown in Fig. 1A, although there is a significant overlap between the two sets of targets, c-Myc and Mad2 also bind unique genomic targets in HeLa S3 cells. Currently, the molecular mechanisms that allow paralogous TFs, such as cMyc and Mad2, to interact with different sets of DNA sites in vivo are not well understood. One hypothesis is that the two TFs exhibit slightly different DNA binding preferences, and this may contribute to their differential in vivo binding. To test this hypothesis, it is essential to have high-quality DNA binding site motifs or other types of data that reflect the intrinsic DNA binding preferences of these TFs. Although such data is available for c-Myc [5, 6], none of the Mad factors have been thoroughly characterized, and the only DNA motif available for Mad2 is a general E-box motif of low quality [5]. For this reason, we performed PBM experiments [3] to thoroughly characterize the DNA binding preferences of c-Myc and Mad2. We tested the two TFs either alone or in combination with TF Max. As expected, the c-Myc:c-Myc and Mad2:Mad2 homodimers bound DNA very weakly even when tested at high concentrations, while c-Myc:Max and Mad2:Max bound DNA with high affinity. His-tagged versions of c-Myc, Mad2, and Max were used in the PBM experiments, and they were a kind gift from Richard Young and Peter Rahl (Whitehead Institute). To ensure that the DNA binding signal detected on PBMs corresponds to heterodimers and not the Max:Max homodimer, we used concentrations of c-Myc/Mad2 10 times higher than the concentration of Max. We will henceforth refer to the c-Myc:Max PBM data as c-Myc PBM data, and to the Mad2:Max PBM data as Mad PBM data. From the universal PBM data for c-Myc and Mad2, we computed several measures of the DNA binding specificity of the two factors: 1) we used the Distinguishing between the DNA Targets of Paralogous TFs 149 Seed-and-Wobble algorithm [3] to derive DNA binding site motifs, or position weight matrices (PWMs) [22]; 2) we computed the median fluorescence intensity for each possible 8-mer, as described previously [3], with high median intensities corresponding to 8-mers strongly preferred by the TF; and 3) we computed enrichment scores (E-scores) for each possible 8-mer, as described previously [3]. E-scores range from -0.5 to +0.5, with higher values corresponding to higher sequence preference. Compared to 8-mer median intensities, the E-scores are more robust to changes in experimental conditions (e.g., binding buffers) and protein concentrations. However, 8-mer median intensities can be used to approximate the median intensities for longer k-mers, intensities that are not directly measured on the PBMs (see Supplementary Material online). 2.1 DNA Motifs Cannot Explain Differential in vivo DNA Binding by c-Myc versus Mad2 The DNA motifs of c-Myc and Mad2 are very similar, but not identical (Fig. 1B). For example, c-Myc appears to have a slightly higher preference for a C nucleotide immediately upstream of the CACGTG core. To assess whether such differences are significant in vivo and potentially explain the differences in in vivo DNA binding between the two proteins, we first compared the enrichment of the c-Myc and Mad2 motifs in the ChIP-seq data, using a method based on the area under the receiver operating characteristic curve (AUC) (see [23] and Supplementary Materials online). Fig. 1C shows AUC enrichments for the c-Myc and Mad2 motifs in the ChIPseq data. If these motifs could explain, even to a small extent, why c-Myc and Mad2 bind different sets of targets in vivo, then we would expect the c-Myc motif to be significantly more enriched than the Mad2 motif in the c-Myc ChIP-seq data, and the Mad2 motif to be significantly more enriched than the c-Myc motif in the Mad2 ChIP-seq data. However, the AUC enrichments of these motifs are almost identical: 0.665 and 0.663 in c-Myc ChIP-seq data, and 0.585 and 0.582 in Mad2 ChIP-seq for the HeLa S3 cell line. In conclusion, we cannot use DNA motifs to differentiate between the c-Myc and Mad2 ChIP-seq data sets. 2.2 In vitro Universal PBM Data Cannot Explain Differential in vivo Binding by c-Myc versus Mad2 Another way of assessing the enrichment of a DNA motif (PWM) in a ChIP-seq data set is by looking at how many of the ChIP-bound sequences contain a PWM match above a certain cutoff (and possibly use this to compute a hypergeometric p-value). The shortcoming of this method is that it depends greatly on the chosen cutoff, and there is no systematic way of choosing the “best” cutoff for a given PWM [23]. To overcome this problem, we considered a range of cutoffs and, for each cutoff, we computed the fraction of ChIP-bound sequences that contain at least one DNA site with a score above the cutoff. Furthermore, since cutoffs based on PWM scores would not be readily comparable between the c-Myc and Mad2 PWMs, we chose cutoffs based on the number of possible k-mers with scores A. Munteanu and R. Gordân 100 250 500 0.8 0.8 0.4 1000 5000 0.4 10 1000 5000 Number of top−scoring 10−mers c−Myc fraction in ChIP−seq c−Myc ChIP−seq 10 100 250 500 1000 Mad2 fraction in ChIP−seq 0.8 500 8−mer E−scores c−Myc fraction in DNase−seq 0.4 250 Fraction of peaks 100 Mad2 ChIP−seq 0.0 Fraction of peaks Fraction of peaks 0.4 0.8 c−Myc ChIP−seq 10 (B) 0.0 10−mer PWM scores 0.0 Fraction of peaks (A) Mad2 fraction in DNase−seq Mad2 ChIP−seq 0.0 150 10 100 250 500 1000 Number of top−scoring 8−mers Fig. 2. Fractions of ChIP-seq/DNase-seq peaks that contain DNA sites with (A) PWM scores or (B) 8-mer E-scores above certain cutoffs. The cutoffs represent the number of top-scoring k-mers, ranked by either PWM scores or PBM 8-mer E-scores. The full lines correspond to ChIP-seq data. The dotted lines correspond to DNase-seq data. above that cutoff, where k is the width of the PWM. The results presented here are for PWMs of size k = 10. We obtained similar results for other values of k. The results of this analysis are illustrated in Fig. 2A. For each ChIP-seq data set and each TF, we counted the number of ChIP-seq peaks that contained at least one 10-mer in the set of 10, 100, 250, 500, 1000, 5000, and 10000 topscoring 10-mers. We compared the fractions of peaks corresponding to c-Myc versus Mad2 ChIP-seq data. Also, in order to compare those results with the background distribution, we computed similar fractions for DNase-seq [24] peaks. As shown in Fig. 2A, for both ChIP-seq data sets the values corresponding to the c-Myc and Mad2 PWMs are very similar, and we observe the same pattern of slightly higher fractions for Mad2. Thus, we cannot use the PWM scores in this manner to differentiate between c-Myc and Mad2 ChIP-seq targets. We also notice that, as expected, a larger fraction of ChIP-seq peaks contain high-scoring PWM matches compared to DNase-seq peaks (compare the full and dotted lines in Fig. 2, which correspond to ChIP- and DNase-seq data, respectively). We note that PWMs are in fact summaries of the comprehensive data that we obtain from PBM experiments. Thus, it is possible that differences between the DNA binding specificities of c-Myc and Mad2 do exist, but are not captured by PWMs. To test this hypothesis, we performed an analysis similar to the one described above, but instead of using PWM scores we used 8-mer E-scores derived directly from PBM data. Fig. 2B shows the fractions of peaks containing top-scoring 8-mers. As in the case of PWM scores, the difference between ChIPseq data and DNase-seq data is significant for both TFs, but the curves for c-Myc and Mad2 are almost identical. Thus, the 8-mer PBM data is still not sufficient to differentiate between the in vivo targets of c-Myc compared to Mad2. Distinguishing between the DNA Targets of Paralogous TFs 3 151 Binding of Putative Co-factors Can Explain Differences in in vivo DNA Binding between c-Myc and Mad2 Given that the intrinsic DNA binding preferences of c-Myc and Mad2 cannot be used to differentiate between the in vivo targets of the two TFs, we next focused on the hypothesis that DNA binding of co-factors in the neighborhood of c-Myc or Mad2 binding sites might contribute to the differences we observe between their sets of in vivo targets. To test this hypothesis, we built classifiers that can accurately distinguish between sequences bound uniquely by c-Myc versus Mad2 according to the ChIP-seq data, using features derived from either PBM data or PWMs of putative co-factors. We implemented our approach in the COUGER (co-factors associated with uniquely-bound genomic regions) framework. The steps of the framework are summarized in Algorithm 1. 3.1 Classes and Features Classes. We used the ChIP-seq data to define two classes of sequences: cMyc-specific sequences (i.e., c-Myc ChIP-seq peaks that do not overlap any of the Mad2 peaks), and Mad2-specific sequences (i.e., Mad2 ChIP-seq peaks that do not overlap any of the c-Myc peaks). In HeLa S3 cells we identified 2786 c-Myc-specific sequences and 6308 Mad2-specific sequences, which account for approximately 45% and 65% of the total ChIP-seq peaks of c-Myc and Mad2, respectively. These percentages are surprisingly high given the similarity between the DNA binding preferences of the two TFs. After identifying c-Myc- and Mad2-specific sequences, we filtered out some of the Mad2-specific peaks and kept only the top 2786 , sorted according to the Mad2 ChIP-seq p-value. Thus, we obtained two sets containing the same number of DNA sequences, which eliminates a potential classification bias toward one of the two classes. Finally, before computing the features for the selected DNA sequences, we trimmed each sequence to ±100 bp on each side of the ChIP-seq peak summit. This was necessary because many peaks are a few hundred to a few thousand bases long. Given that we are interested in finding co-factors that bind close to c-Myc and Mad2, we should look for DNA sites of these putative cofactors only in close proximity of the c-Myc and Mad2 ChIP-seq peak summits. Features. We computed features using two types of information on the DNA binding specificities of putative co-factors: PBM 8-mer E-scores and PWM scores. We used 3 different types of features: 1) “PBM features” derived from 8-mer E-scores for the mammalian TFs in UniPROBE [4] (420 PBM data sets), plus 9 PBM data sets from our laboratory; 2) “PWM features” derived from PWMs computed from the PBM data sets; and 3) “T (Transfac) features” derived from the PWMs in Transfac [5] (1226 PWMs). For a given PBM data set and a DNA sequence, we generated: an “M” feature that represents the maximum E-score over all the 8-mers in that sequence, and an “A” feature that represents the average E-score over the top 3 highest-scoring 8-mers in that sequence (lines 4 and 5 of Algorithm 1). Similarly, we generated M and A features from PWMs. 152 A. Munteanu and R. Gordân Algorithm 1. Classification Input: D – data set with classification sequences for Myc and Mad; P BM (E-scores from PBM data); P W M (PWMs from PBM data); T (PWMs from TRANSFAC). Output: Lists of selected features SF , and accuracies A. 1: Dtrain ← random.sample(D, 2/3 · |D|) such that |{X ∈ Dtrain , class(X) = Myc}| = |{X ∈ Dtrain , class(X) = Mad}|; Dtest ← D − Dtrain 2: for F ∈ {P BM, P W M, T } do 3: for X ∈ D do 4: FM (X) = {maxx∈X f (x)|∀f ∈ F } 5: FA (X) = {avg(maxx∈X f (x), maxy∈X−{x} f (y), maxz∈X−{x,y} f (z))|∀f ∈ F } 6: FMA (train) ← {FM (X), FA (X)|∀X ∈ Dtrain } 7: SFt ← feature.selection(Dtrain , Ft (train)) 8: for C ∈ {SV Mlin , SV Mrbf , RFgi , RFpi } do 9: bestp(C, SFt ) ← arg maxp∈params(C) accuracy(train(C, Dtrain , SFt , p)) 10: M odel(C, SFt ) ← train(C, Dtrain , SFt , bestp(C, SFt )) 11: Atest (C, SFt ) ← accuracy(predict(C, Dtest , M odel(C, SFt ))) 12: SF ← {SP BMMA , SP W MMA , STMA } 13: A ← {Atest (C, F )|∀C ∈ {SV Mlin , SV Mrbf , RFgi , RFpi }, ∀F ∈ SF } 14: return SF, A 3.2 Classification Algorithms We used two state-of-the-art supervised classification algorithms: support vector machine (SVM) and random forest (RF), both available as free software packages (LIBSVM [25], Random Jungle [26]). The SVM is widely used due to its high accuracy on linear and nonlinear classification problems. In addition, the SVM can successfully handle high-dimensional data, which makes it ideal for our classification task. We trained SVMs using linear and radial basis function kernels (SVMlin and SVMrbf , respectively). The RF classifier is essentially an ensemble of classification trees. RF is comparable in performance with SVM, but one of its distinguishing characteristics is that it explicitly computes a measure of the importance of each variable for the classification task. Random Jungle (RJ) implements 2 variable importance scores: Gini importance (the sum of impurity decreases over all nodes in the forest in which the corresponding variable was selected for splitting), and permutation importance (the average decrease in accuracy when the values of a variable are randomly permuted). We ran RJ with both the Gini importance (RFgi ) and the permutation importance (RFpi ). We split the c-Myc- and Mad2-specific sequences into two sets: 1) a training set containing 2/3 of the sequences (i.e., 3714), randomly chosen from the original set; and 2) a test set containing the remaining 1/3 of the sequences. For each algorithm, we first searched for optimal parameter values using only the training data, and then, using the best model obtained on the training set, we predicted the class for each sequence in the test set. We measured the performance of each algorithm using its accuracy on the test set. To optimize the parameters we performed grid searches over the parameter space. For SVMlin we optimized C, the cost of misclassifying examples. Distinguishing between the DNA Targets of Paralogous TFs 153 For SVMrbf we optimized C and the RBF kernel parameter γ. For RFgi and RFpi we optimized ntree, the number of trees in the forest, and mtry, the number of input variables tried in each split (see Supplementary Material online). Feature Selection. We performed feature selection on each of the three feature types (PBM, PWM, and T (Transfac)) using both the maximum score of any DNA site and the average over the top three highest scores. We used RF with a backward elimination technique [26], an iterative process in which a RF is grown at each step and a subset of variables is discarded. The eliminated features are those with the smallest importance. In this instance we used only the unscaled permutation importance, which is recommended for feature selection [27]. We stopped the algorithm when the number of features fell below 100. We performed two variants of selection: FS1, with 50% of features dropped at each iteration, and FS2, with 33% of features dropped at each iteration. 3.3 Classification Accuracy on the Test Sets We ran SVM and RF on the HeLa S3 ChIP-seq data using the features types described above (PBM, PWM, and Transfac). The results for SVMlin and RFpi are presented in Table 1 as classification accuracies, and vary between 85.52% and 88.05% depending on the algorithm and the set of features. Results for the other two classifiers (SVMrbf and RFgi ), as well as results on the K562 ChIP-seq data, are available in the Supplementary Material online. Table 1 shows that our SVM and RF classifiers can accurately distinguish between c-Myc-specific and Mad2-specific genomic targets. This suggests that a potential mechanism by which these TFs achieve differential DNA binding in vivo is by interacting with co-factors that bind DNA in the neighborhood of c-Myc or Mad2 DNA binding sites. We will perform follow-up analyses to study the spacing between c-Myc/Mad2 sites and DNA sites of their putative co-factors, to assess the likelihood of direct TF-TF interactions. We note that SVMlin with Transfac features achieved the best classification accuracy on the HeLa S3 ChIP-seq data: 88.05% when using all 2452 features. However, the accuracy decreased after feature selection and became comparable to the accuracy for PBM and PWM features. Table 1. Classification accuracy on the test sets. Table shows the results of SVM and RF on HeLa S3 ChIP-seq data using 3 feature types: PWM, Transfac, and PBM. Bold: best classification accuracy obtained by SV M lin and RF pi for a particular feature type. Features PBM Feature set ALL FS1 Number of features 858 53 SV Mlin RFpi PWM FS2 74 ALL 840 FS1 52 Transfac FS2 ALL FS1 73 2452 76 FS2 94 86.87 87.03 87.08 86.60 85.74 86.33 88.05 86.60 86.65 86.65 86.92 86.71 85.52 86.01 85.95 86.60 86.44 86.71 154 A. Munteanu and R. Gordân 3.4 Selected Features. Putative Co-factors SVMlin with PBM features obtained the best classification accuracy with a limited number of features: 87.08% with a total of 74 features (FS2) and 87.03% with a total of 53 features (FS1). Importantly, 52 of the 53 features in FS1 are among the 74 features in FS2. We note that COUGER can be used to reduce the number of features even further, although this might lead to a decrease in classification accuracy. For this particular data set, for example, reducing the number of selected features to 10 resulted in an accuracy of 85.47%. We analyzed the top 53 selected features to identify putative co-factors that might contribute to differential in vivo DNA binding by c-Myc versus Mad2. The top 4 putative co-factors (according to RF variable importance score) are: E2F, Sp100, Zfp161, and Sp4, all associated with Mad2-specific sequences. A literature search revealed that at least 3 of these TFs are indeed good candidate co-factors for Mad2: E2F binding site elements are important for autorepression of the c-myc gene [28], Sp100 is a transcriptional repressor (similarly to Mad2) and plays an important role as a tumor suppressor [29], and Zfp161 is a putative c-myc repressor [30]. The fourth TF, Sp4, is not known to act as a repressor but it has been shown to be aberrantly expressed in many cancers [5], which supports a connection with Myc/Mad. For the highest confidence candidate co-factor, E2F, we performed an enrichment analysis similar to the one in Fig. 2. We note that E2F is a family of TFs with highly similar DNA binding specificities. All E2Fs for which PBM data is currently available have been selected by our classifiers, and they are similarly enriched in Mad2-specific targets. In Fig. 3 we show the enrichment of two representative E2F family members: E2F3 and E2F4. 0.8 0.6 0.4 0.0 0.2 Fraction of peaks 0.8 0.6 0.4 0.2 0.0 Fraction of peaks 1.0 Mad2 specific targets 1.0 c−Myc specific targets 10 100 250 500 1000 5000 Number of top−scoring 8−mers c−Myc Mad2 E2F3 E2F4 10 100 250 500 1000 5000 Number of top−scoring 8−mers ChIP−seq DNase−seq Fig. 3. Enrichment analysis for c-Myc, Mad, and E2F factors in the DNA regions bound uniquely by either c-Myc or Mad2. Right plot shows that E2Fs are more enriched than Mad2 in the Mad2-specific targets, although their enrichment in the DNase-seq peaks is much lower. By contrast, in the c-Myc-specific targets (left plot), E2F sites are depleted, as their enrichment is generally lower than in the DNase-seq peaks. These results are in agreement with our classification analyses, which found that E2F sites are strongly associated with Mad2-specific and not c-Myc-specific targets. Distinguishing between the DNA Targets of Paralogous TFs 4 155 Discussion Identifying the molecular mechanisms that allow paralogous TFs to bind different sets of in vivo targets is essential for understanding eukaryotic transcription. Due to recent advances in high-throughput technologies for measuring TF-DNA binding both in vivo and in vitro (such as ChIP-seq and PBM), it is now possible to quantify the contributions of both intrinsic TF-DNA binding specificity and interacting co-factors to differential in vivo DNA binding by related TFs. Here, we focus on paralogous TFs c-Myc and Mad2, and show that differences in their intrinsic sequence preferences cannot account for the large number of targets bound uniquely by each TF. Instead, interactions with putative co-TFs are a likely mechanism used by c-Myc and Mad2 to select their specific genomic sites. To identify c-Myc and Mad2 co-factors, we designed COUGER, a novel framework that uses in vitro DNA binding specificity data for putative co-factors to distinguish between the genomic targets of paralogous TFs (here, c-Myc and Mad2). We are not aware of other tools that aim to identify co-factors that interact specifically with paralogous TFs. However, similar classification approaches have been previously used to distinguish between sets of genomic regions. Chen and Zhou [31], for example, use Naı̈ve Bayes to identify co-factors that can distinguish between the regulatory regions of genes upregulated versus downregulated in mouse ES cells. We note that our choice of classification algorithms is very important. When using features derived from TF-DNA binding specificity data (either PWMs or PBM data), it is likely to obtain features that are highly correlated. While Naı̈ve Bayes is not appropriate in this case, both SVM and RF classifiers can be used. Furthermore, as the number of features increases (in our case, as more and more PBM data is being generated), a Naı̈ve Bayes approach may start to overfit the training data, while SVMs and RFs are more robust. Finally, the advantage of using RF for feature selection (as opposed to Wilcoxon rank-sum test) is that RF can easily handle interactions among features, which would not be captured by a statistical test on individual features. De novo motif discovery tools or methods that search for DNA motifs enriched in particular sets of sequences could also be used, in theory, to identify co-factors [32, 33, 34] . However, these approaches would only search for one co-factor at a time, and would be able to find only DNA motifs that appear in a significant fraction of the DNA sequences of interest (here, the c-Myc- or Mad2specific sequences). Recent evidence from the ENCODE project shows that the co-association of TFs is highly context specific, i.e., distinct combinations of TFs bind at specific genomic locations [35]. Thus, classification approaches such as COUGER, that search for sets of putative co-factors in TF-specific genomic targets are more likely to reveal important molecular mechanisms through which paralogous TFs achieve their regulatory specificity in the cell. Future work will include additional computational analyses to select the best candidate co-factors for c-Myc and Mad2, as well as using COUGER to identify co-factors for paralogous TFs from other protein families. 156 A. Munteanu and R. Gordân References [1] Ren, B., Robert, F., Wyrick, J.J., et al.: Genome-wide location and function of DNA binding proteins. Science 290, 2306–2309 (2000) [2] Johnson, D.S., Mortazavi, A., Myers, R.M., Wold, B.: Genome-wide mapping of in vivo protein-DNA interactions. Science 316, 1497–1502 (2007) [3] Berger, M.F., Philippakis, A.A., Qureshi, A.M., et al.: Compact, universal DNA microarrays to comprehensively determine transcription-factor binding site specificities. Nat. Biotech. 24, 1429–1435 (2006) [4] Robasky, K., Bulyk, M.L.: UniPROBE, update 2011: expanded content and search tools in the online database of protein-binding microarray data on protein-DNA interactions. Nucleic Acids Research 39, D124–D128 (2011) [5] Matys, V., Kel-Margoulis, O.V., Fricke, E., et al.: TRANSFAC and its module TRANSCompel: transcriptional gene regulation in eukaryotes. Nucleic Acids Research 34, D108–D110 (2006) [6] Portales-Casamar, E., Thongjuea, S., Kwon, A.T., et al.: JASPAR 2010: the greatly expanded open-access database of transcription factor binding profiles. Nucleic Acids Research 38, D105–D110 (2010) [7] Badis, G., Berger, M.F., Philippakis, A.A., et al.: Diversity and complexity in DNA recognition by transcription factors. Science 324, 1720–1723 (2009) [8] Wells, J., Graveel, C.R., Bartley, S.M., et al.: The identification of E2F1-specific target genes. Proc. Natl. Acad. Sci. U S A 99, 3890–3895 (2002) [9] Wu, Z., Zheng, S., Yu, Q.: The E2F family and the role of E2F1 in apoptosis. Int. J. Biochem. Cell Biol. 41, 2389–2397 (2009) [10] Tao, Y., Kassatly, R., Cress, W., Horowitz, J.: Subunit composition determines E2F DNA-binding site specificity. Mol. Cell Biol. 17, 6994–7007 (1997) [11] Hollenhorst, P.C., Shah, A.A., Hopkins, C., Graves, B.J.: Genome-wide analyses reveal properties of redundant and specific promoter occupancy within the ETS gene family. Genes Dev. 21, 1882–1894 (2007) [12] Wei, G.H., Badis, G., Berger, M.F., et al.: Genome-wide analysis of ETS-family DNA-binding in vitro and in vivo. EMBO J. 29, 2147–2160 (2010) [13] Soleimani, V.D., Punch, V.G., Kawabe, Y.I., et al.: Transcriptional dominance of Pax7 in adult myogenesis is due to high-affinity recognition of homeodomain motifs. Dev. Cell 22, 1208–1220 (2012) [14] Xu, X., Bieda, M., Jin, V.X., et al.: A comprehensive ChIP-chip analysis of E2F1, E2F4, and E2F6 in normal and tumor cells reveals interchangeable roles of E2F family members. Genome Research 17, 1550–1561 (2007) [15] ENCODE Project Consortium, Bernstein, B., Birney, E., Dunham, I., Green, E., Gunter, C., Snyder, M.: An integrated encyclopedia of DNA elements in the human genome. Nature 489, 57–74 (2012) [16] Farnham, P.J.: Insights from genomic profiling of transcription factors. Nat. Rev. Genet. 10, 605–616 (2009) [17] Grandori, C., Cowley, S.M., James, L.P., Eisenman, R.N.: The Myc/Max/Mad network and the transcriptional control of cell behavior. Annu. Rev. Cell Dev. Biol. 16, 653–699 (2000) [18] Cortes, C., Vapnik, V.: Support-vector networks. Machine Learning 20, 273–297 (1995) [19] Breiman, L.: Random forests. Machine Learning 45, 5–32 (2001) [20] Rosenbloom, K.R., Dreszer, T.R., Long, J.C., et al.: ENCODE whole-genome data in the UCSC Genome Browser: update, Nucleic Acids Research 40, D912–D917 (2012) Distinguishing between the DNA Targets of Paralogous TFs 157 [21] Workman, C.T., Yin, Y., Corcoran, D., et al.: enoLOGOS: a versatile web tool for energy normalized sequence logos. Nucl. Acids Res. 33, W389 (2005) [22] Stormo, G.D.: DNA binding sites: representation and discovery. Bioinformatics 16, 16–23 (2000) [23] Gordân, R., Hartemink, A., Bulyk, M.: Distinguishing direct versus indirect transcription factor-DNA interactions. Genome Res. 19, 2090–2100 (2009) [24] Song, L., Crawford, G.E.: DNase-seq: A high-resolution technique for mapping active gene regulatory elements across the genome from mammalian cells. Cold Spring Harbor Protocols 2010, pdb.prot5384 (2010) [25] Chang, C.C., Lin, C.J.: LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology 2, 1–27 (2011) [26] Schwarz, D.F., König, I.R., Ziegler, A.: On safari to random jungle: a fast implementation of random forests for high-dimensional data. Bioinformatics 26, 1752– 1758 (2010) [27] Dı́az-Uriarte, R., Alvarez de Andrés, S.: Gene selection and classification of microarray data using random forest. BMC Bioinformatics 7, 3 (2006) [28] Luo, Q., Li, J., Cenkci, B., Kretzner, L.: Autorepression of c-myc requires both initiator and E2F-binding site elements and cooperation with the p107 gene product. Oncogene 23, 1088–1097 (2004) [29] Negorev, D.G., Vladimirova, O.V., Kossenkov, A.V., et al.: Sp100 as a potent tumor suppressor: accelerated senescence and rapid malignant transformation of human fibroblasts through modulation of an embryonic stem cell program. Cancer Research 70, 9991–10001 (2010) [30] Sobek-Klocke, I., Disque-Kochem, C., Ronsiek, M., Klocke, R., et al.: The human gene ZFP161 on 18p11.21-pter encodes a putative c-myc repressor and is homologous to murine Zfp161 (Chr 17) and Zfp161-rs1 (X Chr). Genomics 43, 156–164 (1997) [31] Chen, G., Zhou, Q.: Searching ChIP-seq genomic islands for combinatorial regulatory codes in mouse ES cells. BMC Genomics 12, 515 (2011) [32] Machanick, P., Bailey, T.L.: MEME-ChIP: motif analysis of large DNA datasets. Bioinformatics 27, 1696–1697 (2011) [33] Thomas-Chollier, M., Herrmann, C., Defrance, M., et al.: RSAT peak-motifs: motif analysis in full-size ChIP-seq datasets. NAR 40, e31 (2012) [34] Whitington, T., Frith, M.C., Johnson, J., Bailey, T.L.: Inferring transcription factor complexes from ChIP-seq data. NAR 39, e98 (2011) [35] Gerstein, M.B., Kundaje, A., Hariharan, M., et al.: Architecture of the human regulatory network derived from ENCODE data. Nature 489, 91–100 (2012)
© Copyright 2026 Paperzz