BIOINFORMATICS ORIGINAL PAPER Vol. 22 no. 18 2006, pages 2210–2216 doi:10.1093/bioinformatics/btl329 Sequence analysis A mixture model-based discriminate analysis for identifying ordered transcription factor binding site pairs in gene promoters directly regulated by estrogen receptor-a Lang Li1,, Alfred S. L. Cheng2, Victor X. Jin2, Henry H. Paik3, Meiyun Fan3, Xiaoman Li1, Wei Zhang1, Jason Robarge1, Curtis Balch3, Ramana V. Davuluri2, Sun Kim3, Tim H.-M. Huang2 and Kenneth P. Nephew3 1 Division of Biostatistics, Department of Medicine, Indiana University School of Medicine, Indianapolis, IN 47405, USA, Division of Human Cancer Genetics, Department of Molecular Virology, Immunology, and Medical Genetics, Comprehensive Cancer Center, Ohio State University, Columbus, OH 43210, USA and 3Medical Sciences, Indiana University School of Medicine, Bloomington, IN 47405, USA 2 Received on April 13, 2006; revised on May 24, 2006; accepted on June 9, 2006 Advance Access publication June 29, 2006 Associate Editor: Martin Bishop ABSTRACT Motivation: To detect and select patterns of transcription factor binding sites (TFBSs) which distinguish genes directly regulated by estrogen receptor-a (ERa), we developed an innovative mixture model-based discriminate analysis for identifying ordered TFBS pairs. Results: Biologically, our proposed new algorithm clearly suggests that TFBSs are not randomly distributed within ERa target promoters (P-value < 0.001). The up-regulated targets significantly (P-value < 0.01) possess TFBS pairs, (DBP, MYC), (DBP, MYC/MAX heterodimer), (DBP, USF2) and (DBP, MYOGENIN); and down-regulated ERa target genes significantly (P-value < 0.01) possess TFBS pairs, such as (DBP, c-ETS1-68), (DBP, USF2) and (DBP, MYOGENIN). Statistically, our proposed mixture model-based discriminate analysis can simultaneously perform TFBS pattern recognition, TFBS pattern selection, and target class prediction; such integrative power cannot be achieved by current methods. Availability: The software is available on request from the authors. Contact: [email protected] Supplementary information: Supplementary data are available at Bioinformatics online. 1 INTRODUCTION 1.1 Biological background Identifying transcription factors (TFs) that regulate gene expression is an important step in understanding disease-related biological process. Transcriptional regulation by steroid receptors and ligandinducible transcription factors play crucial biological roles in normal physiological and pathological processes (McDonnell and Norris, 2002). Steroid receptor activity has recently been shown to depend on combinatorial interactions with multiple nuclear proteins (Geserick et al., 2005). One of the best studied steroid To whom correspondence should be addressed. 2210 receptors is that of the female sex hormone estrogen, designated estrogen receptor-a (ERa). Bioinformatics studies, including ours (Jin et al., 2004), have revealed DNA-binding sites for many transcription factors to be adjacent to estrogen response elements (EREs), DNA sequence motifs recognized by ERa. ERa may form complexes with nearby transcription factors, and such interactions may strengthen or attenuate transcriptional activity, thus defining target gene specificity (Geserick et al., 2005). To our knowledge, no study has been conducted to examine the patterns of transcription factor binding sites (TFBSs) in promoters that are directly activated or repressed by ERa. In the present study, we focused on 10 known TFs, which may play roles in regulating ERa-target genes. AP1, GATA3 and SMAD3 are known to be involved in estrogen signaling in reproductive tissues and breast carcinogenesis (Lacroix and Leclercq, 2004; Lincoln, 2005; Rushton et al., 2003), while DBP, MYOGENIN and USF2 are associated with multiple ERa regulated processes (Jin, 2004). The c-MYC proteins (MYC and its binding partner MAX) are important regulators of many cellular processes, including proliferation and apoptosis (Pelengaris et al., 2002). In addition, MYC and MAX are over-expressed in human breast carcinomas (Bland et al., 1995), rapidly up-regulated in response to estrogen, and are known mediators of the stimulatory effects of this hormone (Rodrik et al., 2005). The ETS family is one of the largest families of transcriptional regulators that activate or repress the expression of genes involved in various biological processes, including human cancer (Seth and Watson, 2005). c-ETS1-68, a splice variant of c-ETS1, is overexpressed in breast cancer (Myers et al., 2005; Lincoln and Bove, 2005; Seth and Watson, 2005) and has recently been shown to play a role in ERa-mediated tumor invasion and angiogenesis (Lincoln et al., 2003). Key insight into the gene networks underlying the role of ERa in breast cancer has come from microarray studies profiling upand down-regulation of ERa target gene expression in breast cancer cells (Lin et al., 2004). These and other studies further The Author 2006. Published by Oxford University Press. All rights reserved. For Permissions, please email: [email protected] Mixture model discriminate analysis on ERa target suggest that breast cancer growth is regulated by the coordinated action of ERa and multiple signaling pathways (Osborne et al., 2005). In this regard, the combination of ERa with transcription factors, such as those identified above, may provide an important mechanism to coordinate hormone-elicited signals for proper regulation of target genes. The overall focus of the present study was up- and down-regulated ERa target genes. Particularly, we examined TFBS patterns in target promoters that are either activated or repressed by ERa. We also investigated whether pairs of TFBSs were ordered and whether distinct TFBS patterns could distinguish up- and down-regulated ER targets. 1.2 Computational algorithm background Many computational algorithms (i.e. de novo TFBS discovery methods) have been developed for large-scale TFBS discovery. Importantly, these approaches can identify candidate TFBSs for experimental validation. Three major challenges face de novo discovery methods and need to be addressed: (1) the identification of TFBSs from a set of known co-regulated gene promoters, without knowing their locations or position specific weight matrices (PSWMs) (Bailey, 1994; Bussemaker et al., 2001; Gupta and Liu, 2003; Liu et al., 1995, 2001; Roth et al., 1998); (2) inference of TFBSs and their modules along a promoter sequence, based on a set of known TF PSWMs (Bailey and Nobel, 2003; Berman et al., 2002; Crowley et al., 1997; Firth et al., 2001, 2002; Frech et al., 1997; Kondrakhin et al., 1995; Prestridge, 1995; Sinha et al., 2003) and (3) discovery of patterns within the detected TFBSs that can not only characterize a regulatory mechanism, but can also distinguish between mechanisms. For example, between ERa up- and downregulated promoters, different TFBS patterns have yet to be identified and characterized. In this paper, we focused on challenge three: finding patterns within the detected TFBSs that not only characterize a regulatory mechanism but can also define different mechanisms. Many methods have previously been proposed for this objective, including logistic regression analysis (Krivan and Wasserman, 2001), logic regression (Keles et al., 2004) and classification trees (Jin et al., 2004). However, these methods merely select TFBSs that distinguish different regulation mechanisms, and cannot identify specific TFBS patterns within each regulation mechanism. To address both parts of challenge three, we propose a mixture model-based discriminate analysis. This approach uses a random background TFBS distribution component and a position specific weight matrix for any TFBS pair (adjacent or non-adjacent). Here, a mixture model was established for both ERa up- and down-regulated targets, forming a discriminate model for prediction of up/down classes. If the corresponding parameters in two mixture models had small differences and did not improve prediction, they were then ‘annealed’ to yield our optimal parsimonious model. 2 2.1 METHODS A mixture model for ordered TFBS pairs Before investigating the order of TFBSs, the actual presence of specific TFBSs was predicted from a supervised TFBS identification method (Jin et al., 2004), as described in detail in Section 3. The distribution of TFBSs on promoters is assumed to be either random or ordered. If all K TFBSs are randomly distributed, their relative frequencies are described by a parameter vector V0 ¼ (V0,1, . . . , V0, k, . . . V0, k)T, where V0, k represents the probability of the presence of TFBS k at any position on any promoter. Therefore, under this random model V0, the probability to observe a TFBS pair (k1, k2), is V 0; k1 · V 0; k2 . Conversely, if TFBSs are not random, all ordered TFBS pairs (adjacent or non-adjacent) are modeled by a TFBS position weight matrix (TFPWM). It is a K · 2 matrix, V1 ¼ (VT1,1, VT1,2), where V1,1 is the frequency vector for all K TFBSs at position 1, and V1,2 is the frequency vector at position 2. If a TFBS pair (k1, k2) follows TFBS pair model V1, it has a probability of V 1‚1k1 · V 1‚2k2 , where 1 k1, k2 K. This TFBS pair model strategy can identify non-adjacent TFBS pairs, which can be either true non-adjacent, or true adjacent but disrupted by falsely predicted TFBSs. The distribution of TFBS pairs is modeled by a mixture of random (V0) and ordered (V1) models. Given that sequence i has li TFBSs, it possesses mi ¼ li · (li 1)/2 TFBS pairs. Denote Xij ¼ (Xij1, Xij2) as the j-th TFBS pair from sequence i. If Xij is generated through the ordered (TFPWM) model V1, a random variable, designated as Zij ¼ 1; however, if Xij is generated through V0, the random variable Zij ¼ 0. With l denoted as the proportion of TFBS pairs that follow V1, the log-likelihood function is displayed by log Pr ðX‚ Zjl‚ V0 ‚ V1 Þ ¼ mi n n X X Zij log Pr Xij jV1 þ logl i¼1 j¼1 o þ 1 Z ij log Pr Xij jV0 þ log ð1 lÞ ð1Þ We can then implement an E-M algorithm by treating Zij as missing data. The E-step of the algorithm is specified as f ij0 f ij1 ðtÞ ‚ Z ij ¼ f ij0 þ f ij1 ðt Þ ðtÞ ¼ Pr Xij j V0 1 lðtÞ ‚ f ij1 ¼ Pr Xij j V1 lðtÞ ‚ ð2Þ 2 X log Pr Xij j V0 ¼ IT Xiju log V0 ‚ u¼1 log Pr Xij j V1 2 X IT Xiju log V1u ‚ ¼ ð3Þ u¼1 where I(Xiju) is a K · 1 indicator vector that specifies the TFBS Xiju, (t) Xij ¼ (Xij1, Xij2), and (l(t), V(t) 1 , V0 ) are current parameter estimates. In the M-step, the parameters in the matrices V1 and V0 are estimated by c0 cu ðtþ1Þ ‚V ‚ ¼ jc0 j 1u jcu j mj n X X T ðt Þ ck ¼ Zij I Xiju ‚ u ¼ 1‚2‚ ðtþ1Þ V0 ¼ ð4Þ i¼1 j¼1 c0 ¼ mj 2 X n X X ðt Þ 1 Zij mj 2 X n X 2 X T X T I Xiju ¼ I Xiju cu ‚ u¼1 i¼1 j¼1 u¼1 i¼1 j¼1 lðtþ1Þ ¼ u¼1 X ðtÞ . X Z ij mi : ij i Our ordered TFBS pair model [Equation (1)], therefore, is similar to a two-compartment mixture (TCM) model (Bailey, 1994). In our model, we assume that any ordered TFBS pair can be present multiple times within any promoter sequence. The unique feature of our model is that our TFBS pair uses any pair of TFBSs, even if they are not adjacent, while the Bailey TCM model merely describes a DNA sequence segment, with the nucleotides completely adjacent to one another. Computationally, the parameter estimations for V1 and V0 are different from other approaches proposed previously (Bailey, 1994). 2211 L.Li et al. 2.2 Test and Select Significant Ordered TFBS Pairs One underlying hypothesis was to test whether or not the predicted TFBSs contain ordered TFBS pairs. A likelihood ratio test is implemented by y ¼ 2 log Pr ðX j l‚ V0 ‚ V1 Þ ¼ Pr ðX j V1 ±‚ V0 ‚ lÞ ‚ Pr ðX j V0 Þ ð5Þ mi n Y Y l Pr Xij j V1 þ ð1 lÞ Pr Xij j V0 ‚ i¼1 j¼1 Pr ðX j V0 Þ ¼ mi n Y Y Pr Xij j V0 : i¼1 j¼1 The P-value for this type of likelihood ratio test was previously discussed by Bailey and Gribskov (1998), who proposed an approximated test. Although this approximation test worked well in a TCM model, it was unsuccessful in our ordered TFBS pair model, in which TFBS pairs were not constrained to be adjacent to one another. The difficulty of the asymptotic likelihood ratio test for mixture models is well documented by McLachlan and Peel (2000), who described that theoretical difficulties lie in the parameter identification problem, when data follow the model V0. Before these theoretical issues can be solved, the most reliable likelihood ratio test is through a bootstrap analysis. To establish an empirical distribution of the test statistic, multiple simulated datasets are generated under the background model V0, in which TFBSs are randomly distributed along the sequences. Using that simulation, 10 000 replication datasets are sufficient to generate the empirical distribution under the null model V0. To predict an ordered TFBS pair as significant, its score and P-value must necessarily be calculated. We proposed a score based on log-odds between its TFBS-pair model and background model [Equation (6)], with a score significance calculated from the tail percentile from an empirical distribution, based on numerous TFBS pairs generated using the background model V0. The significance of this score is shown below [Equation (7)]. A total of 1000 TFBS pairs were generated under the null model V0, and denote x as a TFBS pair. Pr ðx j V1 Þ sðxÞ ¼ log Pr ðx j V0 Þ X ~ ½sðxÞ ¼ P½sðxÞ ¼ Pr ½sðxÞ s j V0 ‚ P 2.3 ð6Þ 1sðxi Þs n i¼1‚ ... n : ð7Þ x Pr ðx j 0 Þ Discriminate analysis To perform discriminate analyses, we denote w1 as a (3 · K + 1) · 1 parameter vector that contains all parameters in (l, V1, V0) from the TFBS pair model for the ERa up-regulated promoter sequences. Denoting w2 as the set of parameters from ERa down-regulated promoter sequences, Equation (8) describes a discriminate function able to predict the ERa regulation class based on a set of TFBS pairs X from a new sequence. Pr ðX j w1 Þ ð8Þ ‚ dðX j w1 ‚ w2 Þ ¼ sign log Pr ðX j w2 Þ m h i Y Pr ðX j wÞ ¼ l Pr Xj j V1 þ ð1 lÞ Pr Xj j V0 ‚ j¼1 where m denotes the number of TFBS pairs, and Xj denotes the j-th pair. If d(X j w1, w2) ¼ 1, the discriminate function classifies the new sequence S to be an up-regulated target; if d(X j w1, w2) ¼ 1, the discriminate function classifies X as a down-regulated target. The prediction accuracy is thus evaluated by the agreement of the predicted value by d() and the true class of X. In Equation (8), the corresponding parameters between two models w1 and w2 may be very close to each other, while others may be very different. However, the discriminate function [Equation (8)] does not provide much 2212 information about which TFBSs or TFBS pairs are more important for predicting up/down classes. Therefore, we developed an ‘annealing’ procedure, which anneals two corresponding parameters to be the same, provided that they are numerically close and do not improve the prediction of the discriminate function. Consequently, the TFBSs and TFBS pairs that show different frequencies in either the background TFBS model, V0, or the TFBS pair model, V1, would be responsible for the best regulation class prediction. This prediction is controlled by a shrinkage parameter, D, as follows: " ( #) Pr X j w01 0 ‚ d X j w01 ‚w02 ¼ sign log Pr X j w2 w0hg ¼ w hg þ d0hg · sd hg ‚ d0hg ¼ sign d hg jd hg jD þ ‚ ð9Þ X ^ hg 1w ^ hg · w w ^ hg w hg w ^ 1g þ w ^ 2g h¼1‚ 2 ‚ ‚ sd2hg ¼ ‚w ¼ d hg ¼ sdhg ( hg 2 2 jd hg j D‚ jdhg j > D‚ jdhg jD þ ¼ h ¼ 1‚2; g ¼ 1‚2‚. . . ‚q: 0‚ jd hg j D‚ To select an optimal D which gives the best prediction accuracy, a 3-fold cross-validation is implemented. Parameters (w1, w2) are estimated from the training sample, annealed by D in the testing sample, and prediction accuracies then calculated for both training and testing elements. In practice, 20 equally spaced numbers between 0 and 2.0 are tested for D in the crossvalidation. The largest D providing the best prediction in the validation sample is then used to anneal parameters from the two mixture models. The major advantage of this method is its flexibility in selecting highly correlated predictors. While these correlated predictors are not all selected in regression model-based variable selection procedures, all of these adequately predict outcome (or class). This point is further addressed in our simulation studies (Section 3). 3 3.1 IMPLEMENTATION Data analysis 3.1.1 ERa target sequences ERa targets were previously identified by ‘ChIP-on-chip’ a microarray technique in which DNA sequences bound to specific proteins are immunoprecipitated and hybridized to a microarray (Cheng et al., 2006). Following estrogen treatment of MCF7 cells, up- or down-regulated ERa targets were identified using antibodies against acetylated or methylated histone H3 lysine 9 (H3K9) (Roh et al., 2005). Loci identified as ERa targets were classified as activated or repressed, based on their ratios of acetylation/methylation (Ac/Me) at H3-K9. Using those criteria, an Ac/Me >1.5, defined as up-regulation, was observed for 42 target genes, while 35 loci were classified as down-regulated, using Ac/ Me < 0.67. Every target probe sequence was extended by 2000 bp upstream and 1000 bp downstream. 3.1.2 TFBS module identifications In the supervised TFBS identification, we used MATCH and PSSMs in the TRANSFAC database (http://www.gene-regulation.com/cgi-bin/pub/databases/ transfac/search.cgi) to scan orthologous pairs of human and mouse promoters for the following TFs: AP1, c-ETS1-68, DBP, ERa, GATA3, MYC, MYC/MAX, MYOGENIN, SMAD3 and USF2, using the ‘minFN_good71.prf’ profile (profile of cut-off values with minimum number of false-negative predictions) of MATCH. If the resulting sequence similarity of the scanned TFBS was 60%, using ClustalW sequence alignment (Thompson et al., 1994), a predicted binding site was considered as conserved. Among the 42 up-regulated targets, 39 (Supplementary Table S1) possessed predicted ERa-binding sites, while the other three were Mixture model discriminate analysis on ERa target Fig. 1. TFBS distributions in up- (a) and down- (b) regulated promoter targets. c-ETS1-68 is denoted by blue dots; DBP is labeled with gold dots; ER is labeled with red dots; MYC and MYC/MAX are labeled with black dots; MYOGENIN is labeled with green dots; USF2 is labeled with brown dots and GAGA is labeled with gray dots. The other TFBSs for AP1 and SMAD3 are labeled with open circles. The x-axis represents the relative positions (not physical positions) among TFBSs, and they are not equally spaced. filtered out of the data analysis. Among the 35 down-regulated targets, 27 (Supplementary Table S1) possessed predicted ERabinding sites, and were then subjected to the data analysis described below. All the other 11 targets (3 up- and 8 down-regulated) were dropped from the analysis. These predicted TFBSs, and their relative positions, in the 39 up- and 27 down-regulated promoter sequences are plotted in Figure 1 and Supplementary information (file 2). 3.1.3 TFBS pattern identification for up-regulated ERa targets Figure 2a is the mixture model for TFBS pairs among up-regulated promoter targets. V0 denotes the background model, and V1 denotes the ordered pair model for a TFBS pair Xij ¼ (Xij1, Xij2). The color intensity represents the relative frequencies of the TFBSs; the darker the color, the higher the frequencies. The relative frequencies of c-ETS1-68 and ERa for a TFBS pair Xij are very comparable between V0 and V1. This suggests that the TFBS distribution follows the random model V0. In contrast, the relative frequencies of the other eight TFs in V1 are different to various degrees between position 1 and 2 of a TFBS pair, and these all differ from their frequencies in the random model V0. Following determination of the TFBS frequencies, a loglikelihood ratio test (5) was conducted, which significantly (P-value ¼ 0.0003) rejected the null hypothesis, demonstrating that the order of the 10 TFBSs was not purely random. Based on our score and P-value calculation schemes [Equations (6) and (7)], the following TFBS combinations had P-values < 0.01: (DBP, MYC), (DBP, MYC/MAX), (DBP, USF2) and (DBP, MYOGENIN). Based on Figure 2a, it is obvious that DBP has a higher frequency in position 1 than position 2 of a TFBS pair, and the TFBSs MYC, MYC/MAX, USF2 and MYOGENIN all have lower frequencies in position 1 than position 2. From Figure 1a, in 33 out of 39 up-regulated sequences, a common pattern found was Fig. 2. Heatmap (a) is the mixture model for TFBS pairs among up-regulated promoter targets. V0 denotes the background model and V1 denotes the ordered pair model for a TFBS pair (Xij1, Xij2). The higher the probability, the darker the color; the smaller the probability, the lighter the color. Heatmap (b) is the mixture model for TFBS pairs among down-regulated promoter targets. Heatmaps (c) and (d) display the mixture models for up- and downregulated targets respectively, after annealing by the discriminate analysis. It appears in heatmaps (c) and (d) that both target sets have comparable heatmaps, except that up-regulated targets have more MYC and MYC/MAX, while down-regulated targets possess more c-ETS1-68 sites. TFBS combinations starting with DBP and ending with a combination of USF2, MYC/MAX, MYOGENIN and (or) MYC. ERa and other TFBSs were found to occur before, between or after these specifically ordered TFBS pairs. 3.1.4 TFBS pattern identification for down-regulated ERa targets The V1 and V0 models for down-regulated ERa targets are shown in Figure 2b. The relative frequencies of TFBSs for ERa, MYC, MYC/MAX and SMAD3 in V0 and V1, are very comparable for a TFBS pair. This suggests that the distribution for these TFBSs follows the random model V0. Conversely, the other six TFBSs’ relative frequencies of a TFBS pair are different to various degrees, and these all differ from their frequencies in the random model V0. Similar to the result for up-regulated targets, the log-likelihood ratio test [Equation (5)] strongly (P-value ¼ 0.0007) rejected the null hypothesis, demonstrating that the arrangements of the 10 TFBSs were not purely random. Based on our score and P-value calculation schemes [Equations (6) and (7)], the following TFBS combinations had a P-value < 0.01: (DBP, c-ETS1-68), (DBP, USF2) and (DBP, MYOGENIN). In Figure 2b, it is obvious that DBP has a higher frequency in position 1 than position 2 of a TFBS pair; and all (USF2, CETS168, MYOGENIN) have lower frequencies in position 1 than position 2. From Figure 1b, in 26/27 downregulated sequences, common patterns found were combinations of TFBSs starting with DBP and ending with a combination of USF2, MYOGENIN and c-ETS1-68. Similar to up-regulated targets, the ERa-binding sites (and the other TFBSs) could occur before, between or after these specifically ordered TFBS pairs. 2213 L.Li et al. 3.1.5 Mixture model-based discriminate analysis Comparing TFBS patterns within the ERa up- and down-regulated targets shown in Figure 2a and 2b, the frequencies of (MYC, MYC/ MAX and c-ETS1-68) were highly different in both V0 and V1, while the other TFBS frequencies merely exhibited much smaller differences. We therefore expected those three TFBSs to possess much higher predictive power than the others. Using a 3-fold cross-validation, the mixture model-based discriminate analysis achieved 87% predictive power in the validation dataset. When the shrinkage parameter D was set to 1.0, three TFBSs (MYC, MYC/MAX and c-ETS1-68), when paired with DBP, demonstrated different distribution probabilities between the two target sets. Specifically, MYC and MYC/MAX were upregulation specific (neither was predicted in the down-regulated targets), while c-ETS1-68 was specific for down-regulation (not predicted in the up-regulated targets). In Figure 2c and 2d, based on the optimal D selected from cross-validation, our final discriminate analysis model was constructed from annealed two mixture models in the full dataset. It is noticeable that only those three TFBSs (MYC, MYC/MAX, and c-ETS1-68) showed differential frequencies between two target classes in both V0 and V1; all the other TFBSs had the same distributions. 3.1.6 Alternative discriminate analysis Using TFBS presence/ absence as a predictor, logistic regression only selected MYC and MYC/MAX to predict the up/down target classes with 87% prediction accuracy. This analysis was conducted in an R function glm(), with a stepwise forward variable selection used with a 3-fold cross-validation. Using the same binary TFBS predictors, we constructed a classification tree model. Here, the ‘Gini’ algorithm was selected as the splitting method for growing the tree, with 3-fold cross-validation then used to obtain the minimal tree achieving optimal prediction (Breiman et al., 1984). Classification tree analysis was performed in an R function tree() selecting a tree that contained MYC, MYC/MAX and CETS168, with a prediction accuracy of 87%. Logic regression (Keles et al., 2004; Ruczinski et al., 2003) was then implemented to select binary predictors and predict up/down-regulation classes. This analysis was conducted in the R package LogicReg. A 3-fold cross-validation, used to select the optimal predictor subset that achieved the best prediction, selected a logic model that contained MYC, MYC/MAX and c-ETS1-68, achieving a prediction accuracy of 87%. 3.2 Statistical simulation and validation studies 3.2.1 Promoter sequence simulation In our analysis, TFBS predictions were made by comparing binding sites between human and mouse sequences. Among the promoter sequences that we investigated in this paper, 70% were conserved, a conservation rate very close to the rate reported for comparative analysis of the mouse genome (Consortium, 2002). In our simulated studies, a 70% overlap of human and mouse sequences was simulated. Consensus sequences for 10 TFBSs were generated, using TRANSFAC position weight matrices, and inserted into both human- and mousesimulated sequences. Additional simulation detail can be found in our Supplementary information (S3.doc). In the following simulation studies, the TFBSs MYC, c-ETS1-68, DBP, MYOGENIN, USF2, MYC/MAX, GAGA3, AP1, SMAD3 and ERa are denoted by TF1–TF10, respectively. 2214 Fig. 3. Probabilities in selecting ordered motif pairs in simulation studies. The x-axis represents the number of false positive motif predictions with respect to the sequence length; the y-axis represents the probability of selecting pre-specified ordered motif pairs. Different lines represent various proportions of sequences that contain the ordered motif pairs: line ‘a’ denotes 100% of sequences containing the motif pairs, line ‘b’, 80%; line ‘c’, 50%; line ‘d’, 30% and line ‘e’, 0%. 3.2.2 Ordered TFBS pair detection performance in mixture model To investigate whether our mixture model could select ordered TFBS pairs, we performed a simulation study. The sensitivity of the model to the initial TFBS selection is a very interesting issue. Based on our analysis of ERa up-regulated promoter targets (Fig. 1), TF3 binding sites usually occur before TF5, TF4, or TF1. In our simulated sequences, these three ordered pairs were inserted into the sequence using a pre-specified order, while the other six TFBSs were randomly inserted. The matrix similarity score (mSS) proposed by MATCH (2003) was then employed to predict TFBSs. The sequence similarity threshold of the predicted TFBSs between simulated human and mouse promoters was defined as 60%. The cut-offs for the predicted TFBS were selected based on the number of false positives in a random sequence with n bps, and the false positive thresholds 0.25, 0.5, 1, 2 and 3 were evaluated. In addition, the proportion of simulated sequences that contain those three ordered TFBS pairs was varied among (0, 30, 50, 80 and 100%). All the other sequences, however, possessed only randomly distributed TFBSs. In our mixture model-based TFBS pair selection, the P-value threshold for the TFBS pair was set at 5%. The P-value calculation is illustrated in our Supplementary information (S3.doc). Figure 3 displays the average probability (power) of selecting three ordered TFBS pairs. First, the shorter the sequences, the higher the probability the model can detect sequentially ordered TFBS binding sites. Second, a set of sequences possessing a smaller proportion of ordered TFBS pairs has a lower probability for detecting ordered TFBS pairs, as compared with a sequence set containing a higher proportion (of ordered TFBS pairs). Third, in our simulated sequence length range (500–4000 bp), if 0.5 to 1 false positive values are allowed in predicting TFBS binding sites, our probability to detect three ordered TFBS pairs appears to be the highest. Although a more stringent false positive control, such as 0.25, does not greatly reduce the probability, a less stringent false positive control, such as 2, will lower the likelihood. Finally, in our ERa promoter target TFBS analysis, false positives in TFBS prediction Mixture model discriminate analysis on ERa target TF1 Selection Probability d a c b a d b c 0.0 0.5 a : mixture model b : logistic regression c : logic regression d : classification tree a d c a b d c b 1.0 1.5 False Positives Selection Probability 0.0 0.4 0.8 1.2 d c b a 0.6 Prediction 0.8 1.0 1.2 Prediction Accuracy 2.0 d c a b 0.0 0.5 1.0 1.5 False Positives a c d 0.5 1.0 1.5 False Positives 2.0 d c b 2.0 1.2 0.8 a c d a d c aa c d b b a d c 0.4 d c b Selection Probability b b b 0.0 0.0 1.2 0.8 0.4 b a c d 0.0 Selection Probability b b TF1&2 Selection Probability d c a aa d c aa d c b TF2 Selection Probability a d c a d c b 0.0 0.5 b 1.0 1.5 False Positives d c b 2.0 Fig. 4. Mixture model discriminate analysis simulation 1. In this simulation study, sequence set one’s ordered motif pairs possess TF1, and sequence set two’s ordered motif pairs have TF2. Both sets possess 80% sequences with ordered motif pairs. The x-axis represents the number of false positive motif prediction with respect to 3000 bp; in the upper left plot, the y-axis denotes the prediction accuracy for all four methods; in the other three plots, the y-axis represents the TF selection probabilities for either TF1 or TF2. Each line represents a different TFBS discovery method: line ‘a’, mixture model; line ‘b’, logistic regression; line ‘c’, logic regression and line ‘d’, classification tree. are confined within the range of 0.5–1.0. Consequently, our method should retain the highest power to detect ordered TFBS pairs. 3.2.2 Ordered TFBS pair mixture model-based discriminate analysis performance One simulation study was conducted to compare our mixture model-based discriminate analysis with logistic, logic and classification tree analyses. In this study, two sets of sequences were simulated. Among the first sequence set, a subset contained the ordered TFBS pairs (TF3, TF1) and (TF3, TF4). The number of sequences containing these pairs followed a binomial distribution with parameter 0.80. TF2 was excluded and the other six TFBSs did not display ordered appearances. The second sequence set was nearly the same as the first, except that its sequences possessed the order (TF3, TF2), rather than (TF3, TF1). Here, TF1 was excluded and the number of sequences possessing these pairs independently followed a binomial distribution with parameter 0.80. We further expected that either (TF3, TF1) or (TF3, TF2) could accurately distinguish the two sequence classes. Here, every sequence was 3000 bp in length, with 30 sequences simulated for each set, and a total of 1000 sets analyzed. The prediction accuracy was based on 3-fold cross-validation. In our simulation study, our analyses from the four TFBS discovery methods started from the ideal situation that every TFBS prediction was 100% accurate (i.e. 100% sensitivity and specificity, with no false positive prediction). We then proceeded to analyze simulated data with falsely predicted TFBSs. From Figure 4, first, with every TFBS accurately predicted, and no false positives, all four methods have comparable prediction accuracies. However, if false positive TFBS prediction exists, the mixture model-based discriminate analysis has the highest prediction accuracy, remaining stable as the number of false positives increase. Second, both TF1 and TF2 are selected for predicting two classes in the mixture model, with a probability >75%; this probability was not sensitive to the controlled number of false positives. Third, logistic regression usually selects either TF1 or TF2, but not both. Finally, logic regression and classification tree methods selected both TF1 and TF2 with probability >75%, when the false positive number was controlled at low level (0.5). However, the ability of these methods to select TF1 and TF2 was reduced to 24%, with the number of false positives set at two. Logistic regression, logic regression and classification tree all utilize TFBS presence or absence to predict class. A large false positive threshold can lead to a 100% TFBS presence; hence it can lose its power in predicting the class. However, the mixture model uses a TFBS’s random or ordered frequencies to predict its class, and it is not sensitive to the false positive threshold. The same reason is applicable to TF selection. When false positive threshold goes up, as the presence of true TFBS loses its power to predict the class, all these three methods fail. Only the mixture model-based approach is insensitive to false positive threshold, and consistently selects the true TFBS. CONCLUSIONS In this study, we demonstrate, for the first time, a method for predicting the presence and order of TFBS in the proximity of estrogen receptor elements, within ERa-responsive loci. Subsequently, two significant aspects emerge from this analysis. The first is the biological significance of our findings, in which our proposed new TFBS pattern identification algorithm clearly suggests that TFBSs are not randomly distributed within ERa target promoters (P-value < 0.001), but form distinctive patterns within target promoters both up- and down-regulated by ERa. The up-regulated targets contain ordered TFBS pairs such as (DBP, MYC), (DBP, MYC/MAX), (DBP, USF2) and (DBP, MYOGENIN). The downregulated targets contain TFBS pairs such as (DBP, c-ETS1-68), (DBP, USF2) and (DBP, MYOGENIN). Specifically, the TFBS combinations (DBP, USF2) and (DBP, MYOGENIN) appear in both up- and down-regulated ERa targets. These two transcription factors may act in concert with ERa to mediate target gene regulation, and their presence is not specific for either up- or downregulation. MYOGENIN also appears in the work by Jin et al. (2004), which specifies ERa direct targets. For promoters upregulated by ERa, the TFBS pairs (DBP, MYC) and (DBP, MYC/MAX) were found to be specific; further work by our group also demonstrated a previously unknown co-regulatory role of c-MYC within a subset of ERa-responsive genes (Cheng et al., 2006). Analogously, ETS factors are known to act both as positive or negative regulators of gene expression (Seth, 2005), and we found the (DBP, c-ETS1-68) TFBS pair to be down-regulationspecific. In support of our findings, c-ETS1-68 was previously shown to function as either a transcriptional repressor or activator, depending on the promoter context (Goldberg et al., 1994). In hormone-responsive breast cancer cells, the activity of ERa signaling pathway is inversely correlated with the activities of growth factor signaling pathways (Schiff, 2005). Our finding that the c-ETS1-68 binding site is a negative regulatory element in ERatarget genes suggests that ETS proteins might serve as points of molecular ‘crosstalk’ between estrogen and growth factor signaling pathways in hormone-dependent breast cancer cells. Overall, our mixture model-based discriminate analysis demonstrates prediction accuracy as high as 87% for the above-described patterns. 2215 L.Li et al. Second, our proposed mixture model-based discriminate analysis can perform both TFBS pattern recognition and target class prediction. There is currently no available method that can perform this feature, as logistic regression, logic regression and classification tree algorithms are limited to class prediction analysis. Based on extensive statistical simulation studies, the performance of our method is not sensitive to false positives in the initial TFBS prediction, while the other methods do not perform adequately when false positive TFBS predictions are high. Furthermore, our mixture model-based discriminate analysis can select, with relative high probability, all TFBSs that are differentially distributed between two classes, even when the false positive TFBS prediction is present at a high level. In contrast, logistic regression can detect only a small subset of TFBSs. While classification tree and logic regression methods can yield high probability to choose a full set of TFBSs that are differentially distributed between two classes, this probability quickly diminishes with increasing false positive TFBS prediction. In summary, our mixture model-based discriminate method holds great potential for TFBS pattern identification and classification in the study of estrogen-responsive promoter elements. Elucidation of regulatory proteins within such elements, and their association, will allow greater insight into biological processes regulated by estrogen, including sexual development, menopause and hormone-dependent cancers. This approach can be readily extended to any TF-directed targets, and seek this TF and its many coTFs’ TFBS patterns. ACKNOWLEDGEMENTS This research was supported by National Cancer Institute grants CA085289 (K.P.N.) and CA113001 (T.H-M.H.). Conflict of Interest: none declared. REFERENCES Bailey,L.T. and Gribskov,M. (1998) Combining evidence using P-value: application to sequence homology search. Bioinformatics, 14, 48–54. Bailey,L.T. and Nobel,W.S. (2003) Searching for statistically significant regulatory modules. Bioinformatics, 19, ii16–ii25. Bailey,T.L. and Elkans,C. (1994) Fitting a mixture model by expectation maximization to discover motifs in biopolymer. In Proceedings of the Second International Conference on Intelligent Systems for Molecular Biology, Menlo Park, CA, pp. 28–36. Berman,B.P. et al. (2002) Exploiting transcription factor binding site clustering to identify cis-regulatory modules involved in pattern formation in the Drosophila genome. Proc. Natl Acad. Sci. USA, 99, 757–762. Bland,K.I. et al. (1995) Oncogene protein co-expression. Value of Ha-ras, c-myc, c-fos, and p53 as prognostic discriminants for breast carcinoma. Ann. Surg., 221, 708–718. Bussemaker,H.J. et al. (2001) Regulatory element detection using correlation with expression. Nat. Genet., 27, 167–174. Breiman,L. et al. (1984) Classification Regression Trees. New York, NY, Chapman & Hall. Cheng,A.S.L. et al. (2006) Combinatorial analysis of transcription factor partners reveals recruitment of c-MYC to estrogen receptor a-responsive promoters. Mole. Cell, 21, 393–404. Consortium,M.G.S. (2002) Initial sequencing and comparative analysis of the mouse genome. Nature, 420, 521–562. Crowley,E.M. et al. (1997) A statistical model for locating regulatory regions in genomic DNA. J. Mol. Biol., 268, 8–14. Firth,M.C. et al. (2001) Detection of cis-element clusters in higher eukaryotic DNA. Bioinformatics, 17, 878–889. Firth,M.C. et al. (2002) Statistical significance of clusters of motifs represented by position specific scoring matrices in nucleotide sequences. Nucleic Acids Res., 30, 3214–3224. 2216 Frech,K. et al. (1997) A novel method to develop highly specific models for regulatory units detects a new LTR in GenBank which contains a functional promoter. J. Mol. Biol., 270, 674–687. Geserick,C. et al. (2005) The role of DNA response elements as allosteric modulators of steroid receptor function. Mol. Cell Endocrinol., 236, 1–7. Goldberg,Y. et al. (1994) Repression of AP-1-stimulated transcription by c-Ets-1. J. Biol. Chem., 269, 16566–16573. Gupta,M. and Liu,J.S. (2003) Discovery of conserved sequence patterns using a stochastic dictionary model. J. Am. Stat. Assoc., 98, 55–66. Jin,V.X. et al. (2004) Identifying estrogen receptor alpha target genes using integrated computational genomics and chromatin immunoprecipitation microarray. Nucleic Acids Res., 32, 6627–6635. Keles.,S. et al. (2004) Regulatory motif finding by logic regression. Bioinformatics, 20, 2799–2811. Kondrakhin,Y.V. et al. (1995) Eukrayotic promoter recognition by binding sites for transcription factors. Comp. Appl. Biosci., 11, 477–488. Krivan,W. and Wasserman,W.W. (2001) A predictive model for regulatory sequences directing liver-specific transcription. Genome Res., 11, 1559–1566. Lacroix,M. and Leclercq,G. (2004) About GATA3, HNF3A, and XBP1, three genes co-expressed with the oestrogen receptor-alpha gene (ESR1) in breast cancer. Mol. Cell Endocrinol., 219, 1–7. Lin,C.Y. et al. (2004) Discovery of estrogen receptor a target genes and response elements in breast tumor cells. Genome Biol., 5, R66. Lincoln,D.W., 2nd and Bove,K. (2005) The transcription factor Ets-1 in breast cancer. Front Biosci., 10, 506–511. Lincoln,D.W., 2nd et al. (2003) Estrogen-induced Ets-1 promotes capillary formation in an in vitro tumor angiogenesis model. Breast Cancer Res. Treat., 78, 167–178. Liu,J.S. et al. (1995) J. Am. Stat. Assoc., 90, 1156–1170. Liu,X. et al. (2001) BioProspector: discovering conserved DNA motifs in upstream regulatory regions of co-expressed genes. Pac. Symp. Biocomput., 127–138. McDonnell,D.P. and Norris,J.D. (2002) Connections and regulation of the human estrogen receptor. Science, 296, 1642–1644. McLachlan,G. and Peel,D. (2000) Finite Mixture Model. New York, NY, John Wiley & Sons. Myers,E. et al. (2005) Associations and interactions between Ets-1 and Ets-2 and coregulatory proteins, SRC-1, AIB1, and NCoR in breast cancer. Clin. Cancer Res., 11, 2111–2122. Osborne,C.K. et al. (2005) Crosstalk between estrogen receptor and growth factor receptor pathways as a cause for endocrine therapy resistance in breast cancer. Clin. Cancer Res., 11, 865s–870s. Pelengaris,S. et al. (2002) c-MYC: more than just a matter of life and death. Nat. Rev. Cancer, 2, 764–776. Prestridge,D. (1995) Predicting Pol II promoter sequences using transcription factor binding sites. J. Mol. Biol., 249, 923–932. Ripley,B.D. (1996) Pattern recognition and neural network. Cambridge, UK. Cambridge University Press. Rodrik,V. et al. (2005) Survival signals generated by estrogen and phospholipase D in MCF-7 breast cancer cells are dependent on Myc. Mol. Cell Biol., 25, 7917–7925. Roh,T.Y. et al. (2005) Active chromatin domains are defined by acetylation islands revealed by genome-wide mapping. Genes Dev., 19, 542–552. Roth,F.R. et al. (1998) Finding DNA regulatory motifs within unaligned noncoding sequences clustered by whole-genome mRNA quantitation. Nat. Biotechnol., 16, 939–945. Ruczinski,I. et al. (2003) Logic Regression. J. Comput. Graphical Stat., 12, 475–511. Rushton,J.J. et al. (2003) Distinct changes in gene expression induced by A-Myb, B-Myb and c-Myb proteins. Oncogene, 22, 308–313. Schiff,R. and Osborne,C.K. (2005) Endocrinology and hormone therapy in breast cancer: new insight into estrogen receptor-alpha function and its implication for endocrine therapy resistance in breast cancer. Breast Cancer Res., 7, 205–211. Seth,A. and Watson,D.K. (2005) ETS transcritpion factors and their emerging roles in human cancer. Eur. J. Cancer, 41, 2462–2478. Sinha,S. et al. (2003) A probabilistic method to detect regulatory modules. Bioinformatics, 19, 1–10. Thompson,J.D. et al. (1994) Clustal W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalities and weight matrix choice. Nucleic Acids Res., 22, 4673–4680.
© Copyright 2026 Paperzz