Master Thesis Discovery of co-regulated gene clusters by combining known transcription factor binding motifs and gene expression profiles Thesis Committee: Ir. T.A. Knijnenburg Prof.dr.ir. M.J.T. Reinders Dr.ir. D. de Ridder Dr.ir. E.P. van Someren Ir. M.H. van Vliet Dr. C. Witteveen Author Email Student number Thesis supervisors Date Information and Communication Theory Group T [I,C) Maarten Clements [email protected] 1006398 Prof.dr.ir. M.J.T. Reinders Dr.ir. E.P. van Someren Ir. T.A. Knijnenburg April 7, 2006 Incorporating motifs in gene clustering Discovery of co-regulated gene clusters by combining known transcription factor binding motifs and gene expression profiles Maarten Clements Information and Communication Theory Group, Faculty of Electrical Engineering, Mathematics and Computer Science, Delft University of Technology, 2600 GA Delft, The Netherlands. ABSTRACT Motivation: The standard approach for finding genes that are involved in the same biological process is to investigate the change in gene expression under certain conditions and cluster the genes according to this information. However, due to the limited number of measured conditions, genes clustered together are co-expressed, but not necessarily coregulated. In this paper, we provide a method that combines known transcription factor binding site information with gene expression into one clustering scheme. In this way we aim to find gene clusters that are co-regulated under certain growth limitation conditions, through common motifs. Results: We have shown that the integration of known motifs into the gene clustering scheme can improve the GO enrichment within the clusters. Moreover, we have created a framework that makes it easier to understand the regulation of gene clusters and we have made a step toward finding truly co-regulated clusters. We show the usefulness of this approach by applying it to analyse data of yeast under different cultivation conditions. Using our method, we have detected a very compact cluster that contains all genes from the allantoin catabolism pathway, regulated by the motifs DAL82, GAT1, GLN3 and GZF3. Furthermore, we found a cluster with very high enrichment of the GO category sulfur metabolism and we proposed a few new genetic regulation mechanisms. Contact: [email protected] Gene Clustering | Gene Regulation | Transcription Factors | Binding Motifs 1 Introduction The central dogma of molecular biology states that when a gene on the DNA is transcribed, RNA is formed which, in turn, is translated into proteins. The rate at which transcription takes place is called the activity of a gene, or gene expression. Gene expression is tightly controlled at multiple stages, but mainly at the initiation of transcription. Here, the chromatin structure is uncoiled around the gene to be expressed and the proteins that form the transcription machinery are recruited. These processes are regulated by a specific class of DNA-binding proteins called transcription factors (TF’s) (17). TFs are proteins that generally bind to the upstream re- © ICT - TU Delft 2006 gions of a gene and in that way induce or repress the activation of that gene. Transcription factors bind the double stranded DNA helix in sequence specific binding sites or regulatory motifs. Regulatory motifs are short nucleotide sequences (6-20 bp), that show some degree of sequence variation and follow few known rules. This makes direct identification of functional motifs a challenging task. The majority of motifs has been found by biological experimentation, such as systematic mutation of individual promoter regions; However, this process is laborious and unsuited for genome-scale analysis (17). In the last few years many new computational methods have been developed to automatically detect regulatory motifs. These tools can be divided into two main categories: scanning methods and de novo methods. In a scanning method, one uses a motif representation resulting from experimentally determined binding sites to scan the genome sequence to find more matches (16). In de novo methods, one attempts to find novel motifs that are enriched in a set of upstream sequences (10; 11; 14; 15; 28). It is desirable to understand which genes are similarly regulated during a certain biological process because this drastically decreases the search space when we are looking for genetic pathways. The usual approach to this problem is to cluster genes that demonstrate a similar expression profile under different conditions or over time. In order to identify the regulation program, de novo motif detection methods can be applied to the upstream coding regions of these gene clusters to detect frequently occurring sequence patterns, which may be related to certain transcription factors (23; 30). In these methods, the found regulation program of a gene cluster is considered as the final result. They do not check whether the found regulatory program sufficiently explains the observed expression of all members of the gene cluster. Segal et al. (27) used a more advanced method that tries to construct complex regulatory mechanisms from the expression profiles of supposed regulating genes. However, Segal assumes that the expression level of these TF producing genes is directly related to the expression of the genes that are regulated by them, although enough biological evidence against this simple model exists (18). Beer et al. (3) circumvents the need to know the TF abundance by using sequence data as input of his method to derive complex rules utilizing AND, OR, and NOT logic, with significant constraints on motif strength, 1 M. Clements Motifs Expression Motifs Expression Yeast Genome 6368 x 8 Gene Exp. data G1 1 Scen. 1 Gk G1 3 Scen. 2 2 Find Upstream SAM Selection 2497 Upstream Regions 2497 x 8 Gene Exp. data Gk G1 Scen. 3 107 PWMs Gk Figure 1: The goal of the proposed method is to find gene clusters that are co-expressed due to the same motifs. The reasoning behind the proposition that the integration of motif enrichment can accomplish this is threefold. Scenario 1: A cluster that is actually regulated by two different motifs is split up into separate clusters. Scenario 2: A cluster showing homogeneous expression is shrunk to a smaller cluster in which all genes contain the same motif. Scenario 3: Genes that show weak co-expression are integrated in the cluster because they share the same motif. 4 2 Compute Expression Distance dEij Iteration 5 orientation, and relative position. In this way, a large number of gene regulation hypotheses is generated, although these hypotheses need to be validated before biological conclusions can be drawn, since they encompass de novo motifs. In this work we propose to integrate the presence of known regulating elements in the upstream genetic region of genes together with their expression levels as a combined input to the clustering system. As all biological processes, gene expression and upstream motif enrichment demonstrate a high rate of random variation which can lead to the detection of spurious relationships. As these random variations are independent, an integration of both concepts in the clustering method may improve the discovery of transcriptional modules that are composed of genes that are co-regulated through a common motif or combination of motifs. The ways in which the clustering can profit from the additional motif information are summarized in Figure 1. Related methods have been proposed that also used the regulation program to adapt the grouping of genes. Segal et al. (26) uses an EM-algorithm that iteratively partitions the gene set and uses this gene partition to detect new motif candidates. In this way transcriptional modules are built that are both coherent in expression profiles and significantly enriched for common binding sites. Middendorf et al. (19) uses both gene regulators and putative binding sites to build a decision tree that tries to explain the gene expression profiles in terms of regulators and motifs. A similar method from Ruan et al. (24) applies a multivariate regression tree to discover a model for gene expression patterns. 6 Motif Scanning 8 Compute Motif Profile Initial Clustering C1 7 9 Compute Motif Distance Iteration dMij 1-Į Į Gene Ontology Database Select Best Motifs dCij 10 Combined Clustering Best Į 11 Cluster Validation C2 Figure 2: Flow diagram: The integration of enriched transcription factor binding sites into the clustering process of gene expression data. After preselection of the data, we compute gene distances on both expression and motif profiles. The motif distance is computed on a subset of the motifs, selected by the initial clustering. The second clustering step combines both information sources with weighing parameter α , which is optimized by finding the clustering with highest GO enrichment. Finally, C1 and C2 represent the initial and combined consensus clustering that are compared to show that our method generates more biologically relevant clusters. These methods generally aim to find new motifs together with a cluster of genes that appears to be regulated by these motifs. In this way these methods might produce gene clus- © ICT - TU Delft 2006 Incorporating motifs in gene clustering Genome Database (SGD) (4) and the S288C Saccharomyces cerevisiae strain from Ensembl V35 (13) Fig.2:3 . Using a compendium of 107 position weight matrices (PWMs), we scanned the upstream regions of the genes for instances of known transcription factor binding motifs (M7.3) Fig.2:4 . To obtain a single score for each gene-motif pair, we have adopted the score function from Segal et al. (26), which 2 List of contributions combines all scores from the upstream region into a single Here we give a short list of the main contributions of this pa- value (M7.4). We threshold these continuous values to obper. tain a true-false relationship for each gene-motif combination. The set of 107 thresholded motif scores will be called the bi• Using the binding data of Harbison et al. (9) as ground nary motif profile of a gene (M7.5) Fig.2:5 . truth, we have made an objective comparison between We make use of the Pearson correlation coefficient to different methods to compute a binding probability of compute the distance between genes based on their expresa given transcription factor motif and upstream region. sion profiles Fig.2:6 . The Pearson correlation is generally ac[Not in main paper] cepted to provide a useful distance measure for grouping co• We established a framework to compute combined gene regulated genes because it is insensitive to differences in offdistances based on both upstream motif enrichment and set and scaling of the profiles (1; 12). However, it is not trivial to define a distance measure bemicroarray data. tween genes based on their motif enrichment profile. The • Using a motif selection method on an expression data main difficulty is that the combinatorial effect of two factors clustering, we instantly select the motifs that are ac- may differ from the individual effect of the factors (3; 17). tive under the tested conditions. Different feature se- After comparison of several measures (see Supplement D.2), lection methods (such as decision trees and the method we selected the normalized Hamming distance on the binary described in this paper) were compared and evaluated motif profiles to compute this distance, because this measure based on curated motif knowledge. [Comparison not in has a large selective ability between possible combinations of profiles (M7.6) Fig.2:7 . main paper] To be able to tune the influence of the regulation informa• We dedicated the Gene Ontology database (2) to opti- tion, we combine both expression distance and motif enrichmize the weight of the motif information, and showed ment into a single distance measure between genes i and j as that it can be used as an independent cluster validation follows: method. (1) dCi j = (1 − α )dEi j + α dMi j , ters that are not biologically interpretable because both the gene cluster and the regulation program are free parameters. The fact that our method only inputs known motifs puts the focus on the grouping of genes and ensures that the resulting regulation is biologically relevant. • Because only validated motifs are used as an input to our system, we derive clusters that are biologically relevant in terms of regulation and expression. Out method directly shows the link between motifs and the genes they regulate. • We have proposed several new biological regulation mechanisms in Saccharomyces cerevisiae. 3 Combining gene expression and gene regulation Our proposed methodology is depicted in Figure 2. To evaluate our method, we have employed a dataset that is comprised of 6383 Saccharomyces cerevisiae genes with expression values measured over eight well-defined conditions Fig.2:1 (see Methods section, M7.1). After preselection of the genes that show most significant response under one or more conditions using the significance analysis of microarrays (SAM) algorithm (31) Fig.2:2 , we retain 2497 genes for our analysis (M7.2). From these genes we extracted the 1000 bp upstream region with use of gene location data from the Saccharomyces © ICT - TU Delft 2006 where α (0 ≤ α ≤ 1) is the weighing parameter that sets the balance between the expression distance dEi j and the motif distance dMi j . On this combined distance measure dCi j , we employ hierarchical clustering, using complete linkage, to divide the genes into 50 distinct groups. We expect that this number is slightly above the true number of clusters in the dataset, so that there is enough possibility to obtain compact clusters without over-segmenting the data. In order to improve the robustness of the clusters, we iteratively cluster 500 times on samplings of 80% of the data and combine the resulting clusterings using consensus clustering, which has proven to provide more reliable data groupings (6; 20) (M7.7). We expect that not all motifs in our database are functionally active in the investigated experimental setup. In order to select the motifs that play a role in the conditions under investigation, we initially cluster the data purely on the expression distance Fig.2:8 . From the resulting clusters we select the active motifs by computing the significance of motif enrichment using the hypergeometric distribution (M7.8) Fig.2:9 . In the combined clustering, we use only the selected motifs to compute the motif distance between genes Fig.2:10 . We now 3 M. Clements Pvalue PHO4 4.04 x10-11 GLN3 GZF3 CBF1 DAL82 GAT1 HAP2/3/4 6.83 x10-11 7.65 x10-10 2.72 x10 -9 2.76 x10-9 2.01 x10-7 3.52 x10-7 Rep of CAR 5.43 x10-7 CIN5 -3 Logo 8.84 x10-7 -3.5 2 1 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 1 2 3 4 5 6 7 1 2 3 4 5 6 7 1 2 3 4 5 6 7 10 11 12 -4 2 1 log10(GO p-values) Motif 2 1 2 1 2 1 2 -4.5 -5 -5.5 -6 -6.5 1 -7 2 1 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 1 2 3 4 5 6 7 8 -7.5 0 2 1 0.1 0.2 0.3 0.4 0.5 0.6 Motifweight (D ) 0.7 0.8 0.9 1 17 2 1 Figure 3: Motifs that are enriched in the initial clustering. We show only the motifs that attained a p-value smaller than 10−6 in any of the 50 clusters. For each motif we show its common name, its enrichment p-value and a logo that indicates the information content for each base in the motif. See Supplement E.5 for the complete list. Figure 4: This figure shows (1) the mean and standard deviation of the 500 GO enrichment scores of the sampled clusterings (blue) and (2) the scores of the consensus clustering (green), computed for x = 25. The log10 of the GO enrichment p-values is plotted against 100 values of α . We use the consensus clustering at α = 0.25 as final combined clustering. 4.2 Combined clustering and α estimation combine both expression and motif information using equation (1), varying α between 0 and 1. We optimize α by finding the clustering that obtains the highest enrichment of functional categories using the annotation database GO (2) (M7.9) Fig.2:11 . Finally, we compare the consensus clustering of the initial clustering (α = 0) with the consensus clustering at the selected ideal value of α . We closely identify the differences between the clusters. The importance of the combined clustering is shown by relating these differences to the scenarios in Figure 1. Furthermore, we show that the improved clusters have an increased biological relevance. The combined clustering uses both the expression data and the enrichment of the motifs selected by the initial clustering to compute the gene distance. To be less dependent on the initially chosen number of 50 clusters, we evaluate the GO enrichment of the final clustering for a range of best clusters and different settings of the combining weight α (M7.9). Figure 4 shows the clustering scores for 0 ≤ α < 1, here x is set to 25. We instantly notice that the consensus clustering scores much better than the sampled clusterings. This difference is mainly due to the fact that the consensus clustering is computed on the entire dataset of 2497 genes, while the individual clusterings use only 2000 genes (80% of the data). If we compare two clusters with the same motif enrichment ratio 4 Results (#moti f s/#genes), the larger cluster will obtain a better p4.1 Initial Clustering and Motif Selection value. Therefore, extreme p-values are more likely to occur in large clusters (thus, in the consensus clustering). From the initial consensus clustering, which is purely based Furthermore, we see from Figure 4 that the consensus clusupon expression data, we compute p-values of the motif entering does not show a smooth behaviour over variations of richment for each cluster. All motifs are ranked according to α . We therefore determine the optimal parameter values for the lowest p-value they attain in any of the clusters. We conx and α on the mean of the individual clusterings and use the sider a motif to be significantly enriched if is attains a p-value consensus clustering at the selected parameters as final result. < 0.005, using a bonferroni correction for 50 clusters and 100 In order to find the best value of α and x we have com−6 motifs this results in a threshold of 10 . Figure 3 shows the puted the gain in GO enrichment for each combination of x logos and p-values of the motifs that pass this threshold. The and α . To derive the significance of the difference between motifs that are displayed in this figure are selected as features the initial clustering and a combined clustering we have used for the combined clustering. Other feature selection methods a one-tailed two sample t-test (M7.9). Figure 5 shows that the have been evaluated and are discussed in Supplement E. optimal value of α does not vary strongly for different values of x. For any x > 5 the optimal value for α lies within the re- 4 © ICT - TU Delft 2006 Incorporating motifs in gene clustering with highly enriched motifs. The changes with regard to the initial clustering can be summarized as follows: 0 0 -5 Log10(P – value) 5 -10 10 -15 15 20 -20 25 -25 Clusters (x) 30 -30 35 0.5 0.45 0.4 40 0.35 0.3 0.25 Motifweigth (Į) 0.2 45 0.15 0.1 0.05 50 0 Figure 5: In order to find the optimal combination of α and x, where the gain in GO enrichment is maximal, we use a t-test to compute the distribution difference of the initial clustering (α = 0) and the combined clustering for each x ∈ N : 1 ≤ x ≤ 50 and each 0 ≤ α < 1 (100 steps). This graph shows the log10 (p-values) of the t-statistic on the vertical axis, indicating the difference of the combined clustering with respect to the initial clustering. The minimal values of the plot are shown in red. As the irregular structure of the surface is due to the limited number of sampling iterations, we can conclude that the optimal improvement is obtained in the region 0.24 ≤ α ≤ 0.28. We omitted the part where α > 0.5 since the one-tailed t-test causes all p-values to be zero in this area. gion 0.24 ≤ α ≤ 0.28, we have taken the consensus clustering at α = 0.25 as the final combined clustering. Figure 5 shows that an optimum in x is reached around x = 25. In this way we have used the GO enrichment as an independent cluster validation method. One might suggest to use this result to set the number of clusters to 25 and recompute the initial clustering. We have decided to preserve the extra clusters, in order allow the relevant clusters to be shrunk as depicted in Scenario 2 in Figure 1. The rest genes from the initial cluster might jump to one of the dummy clusters and are not obliged to be included in another informative group. 4.3 Cluster comparison To compare the initial and combined clustering, we now compute the consensus clustering for α = 0 and α = 0.25. Because we can only expect to improve clusters for which there is a related motif in our database, we discuss the value of our method by looking at the clusters with the highest motif enrichment. Figure 6 shows the clusters of the initial clustering in which at least one of the motifs attained a p-value smaller than 10−6 . For these five clusters (A-E), we show both motif enrichment and expression profiles. For the combined clustering with α = 0.25 we obtain the results shown in Figure 7. As expected we can see from these figures that the combined clustering contains more clusters © ICT - TU Delft 2006 Nitrogen Cluster The cluster that shows differential expression under nitrogen limitation (Cluster a, Figure 7) has become the cluster with the highest motif enrichment. The reason for this is that this cluster has been shrunk to about one fifth (62 → 12 genes) of the original cluster size (Cluster B, Figure 6). Only the well explainable genes, in terms of motif and expression, have been conserved in this cluster. Figure 8 shows the genes that were found in the initial clustering and in the combined clustering. This figure depicts the expression profiles of the genes, together with the enrichment of the motifs that have been related to this condition by Tai et al. (29) and were found in our combined clustering (DAL82, GAT1, GLN3, GZF3). It is clear that many genes in the initial clustering display the expected expression profile, but lack the presence of known regulating motifs. The newly found cluster only contains genes that demonstrate an expression profile that is clearly related to the regulation program. Also, we notice that the genes that contain the related motifs show a higher expression in the aerobic condition, while the initial cluster was overall stronger expressed in the anaerobic environment. This indicates that the presence of the transcription factors that bind to these motifs is controlled by the oxygen supply. Furthermore we observe that the combined cluster obtains a p-value of 5.9 ∗ 10−12 on the GO category Allantion Catabolism. In a nitrogen limited environment, the allantoin degradation pathway, which converts allantoin (C4 H6 N4 O3 ) to ammonia and carbon dioxide, allows S. cerevisiae to use allantoin as a sole nitrogen source (4). All genes that are part of this pathway according to the Saccharomyces Genome Database (4) are included in this cluster. The initial cluster showed higher enrichment (p-value: 9.6 ∗ 10−10 ) on the more general GO category Catabolism. Thus, in this example the addition of motif information led to a cluster that can be related to a more specific condition and in this way has a higher biological relevance. Since all genes that lack the regulating motif have been removed, this cluster change is a clear example of Scenario 2 in Figure 1. Sulfur Cluster Looking again at Figure 7, we notice that the cluster expressed under sulfur limitation (Cluster b) now clearly shows highly enriched motifs CBF1, MET31/ 32 and TYE7, whereas the initial cluster (Cluster C) only corresponded to CBF1. Indeed, in Tai et al. (29) both motifs CBF1 and MET31/ 32 were related to this condition, together with MET4. The TYE7 motif was not related to sulfur by Tai et al. but is very likely to play a role under this condition given the fact that its motif predominantly contains the strand TCACGTG (which is highly similar to the motif of CBF1). The combined sulfur cluster has become larger than the initial cluster (94 → 151 genes), and still shows the correct expres- 5 M. Clements 61 62 94 143 67 2 4 6 8 10 12 14 -1 0 1 Figure 6: Appearance of the most enriched motifs found in the initial clustering, ranked from left to right in order of significance (A). Only the clusters in which at least one motif showed significant enrichment (p-value < 10−6 ) are shown. The color of the squares indicates the −log10 (p-value) of the enrichment for each motif and cluster. For each of these clusters, we show the normalized expression profiles of all genes and the total number of genes in the cluster (B). All values that exceed a standard deviation of -1/1 are truncated to -1/1. sion profile with an improved motif enrichment, which indicates that the initial cluster missed some sulfur related genes. Also, the p-value of the GO category Sulfur Metabolism has improved from 7.5 ∗ 10−16 to 1.3 ∗ 10−19 , because six more genes with this annotation were included in the combined cluster. Figure 9 depicts which genes of this category were found by the initial and combined clustering. To illustrate the intended behaviour of our method we have indicated which genes we expected to be clustered differently in the combined clustering. This cluster is an example of Scenario 3 in Figure 1, because the cluster has been increased to include more genes with the same motifs. Second Aerobic Cluster Further inspection of Figure 7 shows a second aerobic cluster that is also controlled by the HAP motifs but shows a little higher expression under phosphorus limitation (Cluster e), while the initial aerobic cluster (Cluster D) shows higher expression under carbon limitation. We know that the HAP2/3/4 motif has been related to both carbon limitation and aerobic conditions (7; 29). Our finding suggests that the HAP motif also plays a role under phosphorus limitation. GAT1 MIG1 AFT2 B CBF1 PHO4 HAP2/3/4 MET31/32 TYE7 Genes Aerobic Anaerobic C N P S C N P S A B C D E 0 GZF3 DAL82 GLN3 A Genes B GAT1 HAP2/3/4 RepCAR1 CIN5 CBF1 DAL82 GLN3 GZF3 PHO4 A Aerobic Anaerobic C N P S C N P S 12 151 71 178 35 7 84 15 a b c d e f g h 0 2 4 6 8 10 12 14 -1 0 1 Figure 7: Appearance of the most enriched motifs (p-value < 10−6 ) found in the combined clustering. (Cluster g). This motif has been related to carbon limitation by Tai et al. (29) but our initial clustering did not clearly show this relation. Our secondary clustering indicates that the genes regulated by MIG1 are more strongly expressed in an anaerobic environment. Second Sulfur Cluster An additional sulfur cluster was discovered that lacks the well known sulfur related motifs, but does contain the AFT2 motif (Cluster h). The set of genes activated by AFT1 and AFT2 is designated as the iron regulon, and its activation was suggested to depend on a product of the mitochondrial iron-sulfur cluster biogenesis pathway (25). These genes are thus part of a different pathway than the genes in the sulfur metabolism cluster (Cluster b) and may indicate a novel mechanism working under sulfur limitation. 5 Discussion One of the principal differences between our method and comparable work is the fact that our method does not attempt to find new motifs. Related methods have used de novo motif finding methods in an iterative fashion with the clustering step in order to find genes that are co-regulated by a certain transcription factor. These methods can result in new, but unexplainable clusters in terms of regulation, because de novo motif detection methods are known to provide many false Anaerobic Phosphorus Cluster The last cluster from Fig- positives. The newly found motifs need to be biologically ure 6, Cluster E, is not present in Figure 7. This cluster, how- validated (i.e. coupled to a TF enzyme) before the matchever, has remained almost unaltered, except that the motif p- ing gene cluster will be trustworthy. In contrast, our method produces gene clusters that can be explained by known TFs values just exceed 10−6 in the combined clustering. and in this way directly produces biologically relevant gene Carbon Cluster Apart from increasing specificity in the ini- groups. Clearly, a limitation of our method is that it assumes tial clusters, the combined clustering also discovered a few that motifs related to the tested conditions are present in our additional clusters with significant motif enrichment. We database. have found a carbon cluster with clearly enriched motif MIG1 The fact that we use a fixed number of clusters and make 6 © ICT - TU Delft 2006 use of an exhaustive clustering method forces every gene to belong to a distinct group. Therefore, if we expect Scenario 2 from Figure 1 to occur, we require that the genes that were initially clustered together find a new cluster that better matches their motif profile. We have used the GO enrichment within the clusters as an independent cluster validation method, that showed us that only the best 25 clusters contribute to improvements in enrichment of functional categories. We deliberately held on to the number of 50 clusters, in order to allow genes to jump to one of the non relevant clusters. As is visible in Figure 3, our database contains motifs that only differ by a few bases (like GLN3, GZF3, DAL82 and GAT1). One might suggest that we should use a motif grouping method that combines these motifs. The reason we chose not to do this is twofold: firstly, we notice that a single base difference can still result in a different enrichment for these motifs (see also Figure 8); secondly, the motifs CBF1 and PHO4 show great resemblance although they are known to play a role under different conditions (sulfur and phosphorus limitation respectively). Other studies have shown that the T just before the core sequence CACGTG in CBF1 inhibits the binding of PHO4p but not CBF1p (5; 22). We therefore don’t want to group all motifs that differ by a single nucleotide. © ICT - TU Delft 2006 Aerobic D Anaerobic C N P S C N P S E Initial Clustering Combined Clustering 62 Genes C1 C2 12 Genes Figure 8: This figure shows the initial (C1) and combined (C2) cluster that demonstrate higher expression under nitrogen limitation (A). The normalized expression profiles are shown in red (high) and green (low). For each gene is indicated if either of the four known nitrogen related motifs (DAL82, GAT1, GLN3, GZF3) was present in its upstream region (B). The GO categories that showed the highest enrichment in C1 and C2 were Catabolism (GO1) and Allantoin Catabolism (GO2). (C) denotes which genes are annotated with these categories. The column both indicates genes that were found in both C1 and C2 (D). In the second cluster, only the genes with a clear regulation remain. In this way, we have gained more confidence in the co-regulation of this cluster. C Not found Aerobic Anaerobic C N P S C N P S B High Pot. A CBF1 MET31/32 MET4 TYE7 D CBF1 MET31/32 MET4 TYE7 Anaerobic C Both Aerobic C N P S C N P S B GO1 GO2 A DAL82 GAT1 GLN3 GZF3 Incorporating motifs in gene clustering Figure 9: This figure shows all genes annotated with Sulfur Metabolism in GO. The normalized expression profiles are shown in red (high) and green (low) and for each gene is indicated if either of the three known sulfur related motifs (CBF1, MET31/32, MET4) was present in its upstream region. The two groups of genes on the left (A) show which sulfur genes are found by the initial clustering and which genes where additionally included in the combined clustering (purple arrows). The block on the right (C) shows the annotated genes that were initially not found. The bar indicated by High Pot. shows which genes we considered to have a high potential to be found by the combined clustering, because they contain at least one of the known motifs and don’t deviate greatly from the desired expression profile (A minimal correlation between the expression profile and the perfect profile (00010001) of 0.5). The Not found bar shows that only one of these genes was eventually not included in the combined cluster. However, as a post-processing step it is possible to use a motif grouping method to combine similar motifs per cluster. When we regard the enriched motifs in the combined clustering (Figure 7), we notice that we have obtained more highly enriched motifs than in our initial clustering. We can now select these motifs to compute a new motif distance and in this way cluster in an iterative fashion, until the set of enriched motifs converges to a constant group. This was done for ten clustering iterations. The consensus clustering, however, does not become stable, which may be caused by the limited number of iterations. However, when we compute the mean motif enrichment of the 500 sampled clusterings, the set of enriched motifs clearly converges to a fixed set. Already, after three iterations this set is formed and constitutes of (in ranked order): CBF1, GZF3, GLN3, DAL82, HAP2/3/4, PHO4, TYE7, MET31/32, GAT1 which all obtain a p-value smaller than 10−6 . The clustering that results when these motifs are used as input is discussed in Supplement F. In our computation of the gene-motif score we have not taken the distance of the putative binding site to the transcription start site into account, although there are numerous indications that this relationship exists (34). Therefore, the use of a motif distance analysis such as done by Harbison et al. (9) is likely to improve the score function. 7 M. Clements 6 Conclusion 7.2 Gene preselection We have shown that the integration of known motifs into the gene clustering scheme can improve the GO enrichment within the resulting clusters. Moreover, we have created a framework that makes it easier to understand the regulation of gene clusters because we find clusters with higher motifcondition agreement. In this way, we have made a step towards finding truly co-regulated clusters. Our method is especially effective on well-defined small scale microarray experiments for which a selection of regulators is known. We dedicate the known motifs to discover gene clusters that are regulated through these motifs and in this way provide more reliable information with respect to gene clusters that are purely dependent on expression data. We applied our method to analyse yeast grown under different cultivation conditions. In this data, our method has detected a compact gene cluster that shows clear expression under nitrogen limitation and is regulated by TFs binding to the motifs DAL82, GAT1, GLN3 and GZF3. In spite of the compactness of this cluster, it still contains all genes known to take part in the allantoin catabolism pathway, which is known to provide ammonia if it is not supplied by the environment. We increased the cluster of genes that are expressed under sulfur limitation with respect to a clustering that is solely made on expression data. In this way we found more genes that were annotated to the sulfur metabolism category in GO and we increased the enrichment of the motifs CBF1, MET31/32 and TYE7. Furthermore, we found that the HAP2/3/4 motif is present in genes that show expression under phosphorus limitation in an aerobic environment. Initially, this motif was only related to carbon limitation and aerobic conditions. In addition, we detected a cluster that is regulated by the AFT2 motif in a sulfur limited condition and a cluster regulated by MIG1 under carbon limitation in anaerobic environment. Both of these regulatory mechanisms were not clearly visible in the initial clustering on expression data. The significance analysis of microarrays (SAM) method (31) is used to select the genes that demonstrate the most significant response under one or more nutrition limited growth conditions. Using SAM, the significance of change in at least one of the conditions is computed and all genes are ranked according to this score. Then the top 2500 genes was selected according to this rank for further analysis (obtaining a False Discovery Rate (FDR) of 0.01%). Three of the selected genes have an upstream region shorter than 1000 bp. These genes are disregarded, so 2497 genes are retained for further evaluation. 7 where N is the number of known motif sites, pb the background frequency of base b in the entire genome and fb, j is n the frequency matrix computed by Nb, j . A test sequence may be aligned along the weight matrix, and its score is the sum of the weights for the letters aligned at each position (see Figure 10). 7.1 Materials and Methods Expression dataset 7.3 Motif scanning Motifs are represented by a motif matrix that contains the base frequencies at the different positions. The alignment matrix nb, j , records the occurrence of base b at position j of all the aligned sites for this motif. For example, the distribution of the bases for the motif of PHO4 with length 12 looks like: A 1 C 2 = G 1 T 4 3 2 2 1 2 3 3 0 0 8 0 0 8 0 0 0 0 8 0 0 0 0 8 0 0 0 0 8 0 0 5 3 0 2 4 2 1 0 . 5 2 This motif matrix gives the numbers of base occurrences in 8 known binding sites for this motif (e.g. found in biological experiments). The position weight matrix (PWM) is the matrix that is most frequently used to score a test sequence with a given motif consensus. The PWM is computed from the alignment matrix by (11; 15): Wb, j = ln fb, j (nb, j + pb )/(N + 1) ≈ ln , pb pb (2) The proposed combined clustering method was developed and applied on the expression data of Saccharomyces cerevisiae from the Kluyver Laboratory for Biotechnology in Delft (prototrophic haploid reference strain CEN.PK113-7D (MATa)). p This dataset is comprised of 6383 genes and 24 arrays. The (3) Sci = ∑ W j [Si+ j−1 ], 24 arrays are made up of 3 replicated measurements of eight j=1 conditions. In these eight conditions the response of aerobic as well as anaerobic hemostat cultures of Saccharomyces where Si is the base at position i in the upstream region to be cerevisiae is compared to growth limitation by four differ- scanned, p is the size of the motif and W is the PWM. To scan the upstream regions of the genes for instances ent macronutrients (carbon, nitrogen, phosphorus and sulfur) of known transcription factor binding motifs, a compendium (29). of 107 position weight matrices (PWMs) was built, collected from three different online databases (18 from Transfac (32), 8 © ICT - TU Delft 2006 Incorporating motifs in gene clustering PHO4 0.17 0.24 0.24 -0.75 -0.18 0.62 0.62 -2.19 -2.19 1.56 -2.19 -2.19 1.09 -2.19 -2.19 -2.19 -2.19 1.56 -2.19 -2.19 -2.19 -2.19 1.56 -2.19 -2.19 -2.19 -2.19 1.09 -2.19 -2.19 1.11 0.17 -2.19 0.24 0.89 -0.18 -0.75 -2.19 1.11 -0.18 C A C G T T A G G … i+p-1 C G A T i-1 i i+1 … 15 3000 10 2000 5 1000 -0.18 0.24 0.24 -0.18 A C i+p i+p+1 Sci = 3.81 Figure 10: Scanning a sequence with a PWM. The score is simply the sum of the values sampled from the matrix # genes without motifs A -0.75 0.24 -0.36 0.43 Median # motifs per gene A C G T 13 from SCPD (34) and 76 from Harbison et al. (9)). All doubles were removed and when multiple TFs bind to the exact same motif their labels were combined in our compendium. 0 0.65 7.4 Computation of Gene-Motif agreement score Because regulatory motifs can occur on both strands of the DNA, a scan over a region of 1000 bp. will result in 2(1000 − p + 1) ≈ 2000 scores per gene for a PWM of length p. To obtain a single score for each Gene-Motif combination, several methods were compared (see Supplement C) and the method used by Segal et al. (26) has been adopted, which computes: P(Mg = true|S1 , . . . , Sn ) = ς log !!! n−p+1 1 , (4) ∑ exp{Sci } n − p + 1 i=1 where ς is the sigmoid function (ς (p) = 1+e1−p ) and n is the length of the upstream region. This function takes the mean of the exponent of all alignment scores Sci along the upstream region and in this way gives a higher weight to large scores and neglects very low scores. The sigmoid function scales the resulting score values between 0 and 1. 7.5 Threshold on the Gene-Motif agreement score Equation (4) returns a continuous value that can be seen as a probability that a certain motif is present in an upstream region of a gene. For both computational simplicity and comprehensibility it is desirable to threshold these gene-motif agreement scores and obtain a true/false relationship between gene upstream region and motif. Figure 11 shows the resulting median number of motifs per gene for thresholds ranging between 0.65 and 1. In addition, the total number of genes without any motif is depicted. To be able to distinguish between gene regulation programs a reasonable number of motifs per gene is needed and the number of genes without a motif needs to be reduced as much as possible. Therefore, the threshold on the score value is set, such that the median number of motifs for an upstream region equals five (threshold: 0,82). This number was also observed by Zhang et al. (33), who used a database of known and experimentally verified motifs to scan the upstream regions of © ICT - TU Delft 2006 0.7 0.75 0.8 0.85 0.9 Threshold on gene-motif agreement 0.95 0 1 Figure 11: Median number of motifs per gene (blue line, left y-axis) and the number of genes without motif (green line, right y-axis) as a function of the threshold on the scoring function (Eq. 4). The chosen threshold of 0.82 (red line) results in a median of 5 motifs per gene and a total of 31 genes without motifs. yeast genes. For vertebrates, Prakash and Tompa found a similar amount of six (21), based on over-representation in an orthologous human, chimp, mouse and rat dataset. Note that if a more stringent threshold would have been chosen, the number of genes without any motif annotation would have increased dramatically as is visible in Figure 11. The set of 107 thresholded motif scores will be called the binary motif profile of a gene. Figure 12 shows the binary motif profiles of twelve genes that show higher expression levels in a nitrogen limited environment, independent of the oxygen supply. The vertical lines in this figure indicate that all genes in this group have this binding site in their upstream region. If a group of genes shows similar expression profile and their upstream regions contain one or more similar motifs, we can say that the gene cluster is co-regulated. 7.6 Motif profile distance To obtain a motif distance between each gene pair, the normalized Hamming distance between the binary motif profiles is computed as follows: dH = ∑Ni=1 |P1 (i) − P2 (i)| , N (5) where N is the total length of the motif profiles and the numerator is the number of differences between profile P1 and P2 . The drawback of this method is the fact that it takes all the motifs in the motif profile into account, which causes a lot of noise, because not all motifs are active in our experimental setup. To compensate for this, a feature selection method 9 M. Clements Motif profiles Aerobic 107 Motifs For x = 1 ĺ 50 Expression profiles For Į = 0 ĺ 1 Anaerobic Fig. 5 C N P S C N P S Genes A R C R Į Motifs related to Nitrogen ǹ Figure 12: Experimental setup: The block on the right shows the normalized expression profiles over eight experimental conditions. Red indicates high expression with regard to the other conditions (green). The left block shows the binary motif profiles that indicate if either of the 107 motifs is present in the upstream region of a gene. X x Į B Fig. 4 Į is used so that only motifs that play a significant role under the tested conditions will contribute to the distance measure. This feature selection constitutes the selection of highly enriched motifs in the initial clustering, that is solely based on expression data. Other selection methods have been assessed, but did not give improvements (see Supplement E). Figure 13: This figure shows the steps we take to estimate the optimal value for α . Here, R is the number of cluster iterations (500), x is the number of clusters we use to compute a single score for a clustering (between 1 and X) and α is the motifweight that we vary between 0 and 1 in A steps. The total number of clusters X is set to 50 and A is set to 100. 7.7 7.8 Data clustering In both clustering steps we use hierarchical clustering to divide the data into 50 distinct groups. Complete linkage is used, which has shown to provide the most reliable clusters on genetic data (8) (see Supplement D.1). Because we chose to compute more clusters than we expected in the dataset, we assume that not all resulting clusters will be relevant. Therefore, only a select number of clusters will be regarded in order to assess the value of our method in Chapter 4.3. To improve the robustness of the putative clusters to variations in data sampling, we cluster 500 times on 80% of the data and employ consensus clustering (6; 20). This methodology first computes a consensus matrix which contains, for each pair of items, the proportion of clusterings in which the two items are clustered together: M (i, j) = ∑h M (h) (i, j) , ∑h I (h) (i, j) Enrichment computation For both the computation of the significance of the amount of motifs in a cluster and the amount of GO category annotations in a cluster, we compute p-values as enrichment scores. The hypergeometric distribution was employed to compute the probability of detecting the observed number of motifs/annotations or more in a gene cluster. An enrichment p-value is computed as follows: B G−B min(B,g) p = P(i ≥ b) = ∑ i=b i g−i G g , (7) where G is the total number of genes, B is the number of genes within this cluster, g is the total number of genes that have this motif/annotation and b is the number of genes from the cluster that have this motif/annotation. (6) 7.9 Cluster evaluation I (h) where indicates if item i and j where both selected by the data sampling, and M (h) is the co-occurrence matrix that stores the number of times item i and j are clustered together in clustering h: 1 if i and j belong to the same cluster, M (h) (i, j) = 0 otherwise. In order to evaluate the different clusterings, the Gene Ontology database (2) is used to find the enrichment of functional categories in the individual clusters. First, all GO categories with less than 5 annotations are removed, resulting in 576 categories. Then, p-values of the detected number of annotations for each cluster-category combination are computed using the hypergeometric distribution and the lowest p-value over all categories is assigned as a score for a cluster. From the consensus matrix we compute a new distance matrix The combined clustering step iterates 500 times over 80% D = 1 − M which is used to derive a new clustering, using of the data and varies α between 0 and 1 in 100 steps. Figagain hierarchical clustering with complete linkage. ure 13 shows that this results in R ∗ A ∗ X cluster scores. In step A, the score for a clustering is computed by taking the 10 © ICT - TU Delft 2006 Incorporating motifs in gene clustering average over the x best clusters, varying x between 1 and X. Step B computes the mean and standard deviation over the 500 iterations. Also, a consensus clustering for each α is determined. The score for the consensus clustering is computed and plotted together with the mean and standard deviation of the individual scores in Figure 4. Finally, in step C the gain of the combined clustering with respect to the initial clustering is computed, using a two sample t-test with respect to the clustering on expression data (α = 0) as follows: X init − X comb , T= q 2 2 Sinit +Scomb R (8) where X init and X comb are the sample means of the initial and combined clustering, R is the number of cluster iterations and 2 and S2 Sinit comb are the sample variances. Since we are only interested in clusterings that have a mean score lower than the initial clustering we compute a one-tailed t-test. The p-values of the t-statistic for each α and x are shown in Figure 5. References [1] Alon, U. et al. “Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays”. PNAS, Vol. 96, pp. 6745Ű6750, 1999. [2] Ashburner, M. et al. “Gene Ontology: tool for the unification of biology. The Gene Ontology Consortium”. Nature Genetics, Vol. 25, pp. 25-29, 2000. [3] Beer, M.A. and Tavazoie, S. “Predicting Gene Expression from Sequence”. Cell, Vol. 117, pp. 185-198, 2004. [4] Cherry, J.M. et al. “Genetic and physical maps of Saccharomyces cerevisiae”. Nature, 387(6632 Suppl), pp. 67-73, 1997. [5] Fisher, F. and Goding, C.R. “Single amino acid substitutions alter helix-loop-helix protein specificity for bases flanking the core CANNTG motif”. The EMBO Journal, Vol. 1, No. 1, pp. 4103-4109, 1992. [6] Fred, A.L.N. and Jain, A.K. “Data Clustering Using Evidence Accumulation”. 16th International Conference on Pattern Recognition (ICPR’02), Vol. 4, p. 40276, 2002. [7] Gancedo, J.M. “Yeast Carbon Catabolite Repression”. Microbiology and Molecular Biology Reviews Vol. 62, No. 2, pp. 334-361, 1998. [8] Gibbons, F.D. and Roth, F.P. “Judging the Quality of Gene Expression Based Clustering Methods Using Gene Annotation”. Genome Research, Vol. 12, pp. 1574Ű1581, 2002. [9] Harbison, C.T. et al. “Transcriptional regulatory code of a eukaryotic genome”. Nature, Vol. 431, pp. 99-104, 2004. [10] van Helden, J. et al. “Extracting Regulatory Sites from the Upstream Region of Yeast Genes by Computational Analysis of Oligonucleotide Frequencies”. J. Mol. Biol., 281, pp. 827-842, 1998. [11] Hertz, G.Z. and Stormo, G.D. “Identifying DNA and protein patterns with statistically significant alignments of multiple sequences”. Bioinformatics, Vol. 15, nos 7/8, pp. 563-577, 1999. [12] Heyer, L.J. et al. “Exploring Expression Data: Identification and Analysis of Coexpressed Genes”. Genome Research, Vol. 9, pp. 1106Ű1115, 1999. [13] Hubbard, T. et al. “Ensembl 2005” Nucleic Acids Research, Vol. 33, Database issue: D447-D453, 2005. [14] Hughes, J.D. et al. “Computational identification of cis-regulatory elements associated with functionally coherent groups of genes in Saccharomyces cerevisiae”. J. Mol. Biol., 296, pp. 1205-1214, 2000. [15] Jensen, S.T. et al. “Computational Discovery of Gene Regulatory Binding Motifs: A Bayesian Perspective”. Statistical Science, Vol. 19, No. 1, pp. 188-204, 2004. © ICT - TU Delft 2006 [16] Johansson, Ö. et al. “Identification of functional clusters of transcription factor binding motifs in genome sequences: the MSCAN algorithm”. Bioinformatics, Vol. 19, Suppl. 1, pp. i169-i176, 2003. [17] Kellis, M. “Computational comparative genomics: genes, regulation, evolution”. Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, 2003. [18] Latchman, D.S. “Transcription Factors as Potential Targets for Therapeutic Drugs”. Current Pharmaceutical Biotechnology, Vol. 1, No. 1, pp. 57-61, 2000. [19] Middendorf, M. et al. “Motif Discovery Through Predictive Modeling of Gene Regulation”. RECOMB 2005, LNBI 3500, pp. 538-552, 2005. [20] Monti, S. et al. “Consensus Clustering: A resampling-based method for class discovery and visualization of gene expression microarray data”. Machine Learning, Vol. 52, pp. 91-118, 2003. [21] Prakash, A. and Tompa, M. “Discovery of regulatory elements in vertebrates through comparative genomics”. Nature Biotechnology, Vol. 23, No. 10, pp. 1249-1256, 2005. [22] Robinson, K.A. and Lopes, J.M. “Saccharomyces cerevisiae basic helix-loop-helix proteins regulate diverse biological processes”. Nucleic Acids Research, Vol. 28, No. 7, pp. 1499-1505, 2000. [23] Roth, F. et al. “Finding DNA regulatory motifs within unaligned noncoding sequences clustered by whole-genome mRNA quantitation”. Nature Biotechnology, Vol. 16, pp. 939-945, 1998. [24] Ruan, J. and Zhang, W. “A bi-dimensional regression tree approach to the modeling of gene expression regulation”. Bioinformatics, Vol. 22, No. 3, pp. 332-340, 2006. [25] Rutherford, J.C. et al. “Activation of the Iron Regulon by the Yeast Aft1/Aft2 Transcription Factors Depends on Mitochondrial but Not Cytosolic Iron-Sulfur Protein Biogenesis”. The journal of biological chemistry Vol. 280, No. 11, pp. 10135-10140, 2005. [26] Segal, E. et al. “Genome-wide discovery of transcriptional modules from DNA sequence and gene expression”. Bioinformatics, Vol. 19, Suppl 1: i273-i282, 2003. [27] Segal, E. et al. “Module networks: identifying regulatory modules and their condition-specific regulators from gene expression data”. Nature Genetics, Vol. 34, No. 2, pp. 166-176, 2003. [28] Sinha, S. et al. “PhyME: A probabilistic algorithm for finding motifs in sets of orthologous sequences”. BMC Bioinformatics, Vol. 5, 170, 2004. [29] Tai, S.L. et al. “Two-dimensional Transcriptome Analysis in Chemostat Cultures”. The Journal of Biological Chemistry, Vol. 280, No. 1, pp. 437-447, 2005. [30] Tavazoie, S. et al. “Systematic determination of genetic network architecture”. Nature Genetics, Vol. 22, pp. 281-285, 1999. [31] Tusher, V.G. et al. “Significance analysis of microarrays applied to the ionizing radiation response”. PNAS, Vol. 98, No. 9, pp. 51165121,2004. [32] Wingender, E. et al. “TRANSFAC: an integrated system for gene expression regulation”. Nucleic Acids Research, Vol. 28, No. 1, pp. 316-319, 2000. [33] Zhang, Z. et al. “How much expression divergence after yeast gene duplication could be explained by regulatory motif evolution?”. TRENDS in Genetics, Vol. 20, No. 9, pp. 403-407, 2004. [34] Zhu, J. and Zhang, M.Q. “SCPD: a promoter database of the yeast Saccharomyces cerevisiae”. Bioinformatics, Vol. 15, nos 7/8, pp. 607-611, 1999. 11 supplement . Discovery of co-regulated gene clusters by combining known transcription factor binding motifs and gene expression profiles . Maarten Clements April 7, 2006 Supplementary Page 2 Contents A Genes and Their regulation A.1 DNA . . . . . . . . . . . . A.2 Genes . . . . . . . . . . . A.3 Transcription factors . . . A.4 Microarrays . . . . . . . . A.5 Yeast . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 3 3 3 4 4 B Motif representation 5 C Computing the motif profile C.1 Computing a Gene-Motif score . . . . . . . . . . . . . . . . . . . . . . . C.2 Comparison of scoring methods . . . . . . . . . . . . . . . . . . . . . . . C.3 Score thresholding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 7 8 8 D Clustering 10 D.1 Hierarchical clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 D.2 Distance measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 E Feature selection E.1 Cluster best . . . . . E.2 P-value threshold . . E.3 Decision tree . . . . E.4 Method comparison E.5 Motif ranking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 13 13 13 14 15 F Gene Ontology 18 G Iterative clustering 19 c ICT - TU Delft 2005 Supplementary A A.1 Page 3 Genes and Their regulation DNA The human body is constructed of trillions of cells which together form tissues like skin, muscle and bone. However, each of these cells contains the same building information, called DNA. This DNA is the codebook of how we are made and how our body works. It is just the part of this codebook that is read what actually determines the function of a certain cell. For example, a cell in muscle tissue uses the DNA that codes for the muscular function, while a bone cell only references that part of the DNA that describes the construction of bones. It is as if each cell reads only that part of a book of instructions that it needs. The DNA is made up of four basic nucleotides; Adenine (A), Cytosine (C), Guanine (G) and Thymine (T) (see Figure 1). These four nucleotides always bind each other in specific pairs: A-T and C-G, these pairs are called base pairs (bp). The concatenation of millions of base pairs forms chromosomes. Human DNA regularly contains 46 chromosomes with a total of almost 3.3 ∗ 109 base pairs. Figure 1: The genetic structure. Cells contain chromosomes that are made up of long strands of base pairs. A.2 Genes The parts of the DNA that actually code for the organic functions are called genes. The central dogma of molecular biology states that when a gene on the DNA is transcribed, RNA is formed which in turn is translated into proteins (see Figure 2). In biology, proteins are known as the basic building blocks of life. They perform a wide variety of biological functions, ranging from catalyzation of chemical reactions to actual formation of mechanical structures. Proteins are essentially polymers made up of a specific sequence of amino acids. The details of this sequence are stored in the code of a gene. Figure 2: In the process of transcription the gene is copied into a strand of RNA. This RNA may be translated into proteins; the building blocks of life. A.3 Transcription factors The rate at which transcription takes place is called the activity of a gene, or gene expression. Gene expression is tightly controlled at multiple stages, but mainly at the initiation of transcription. Here, the chromatin structure is uncoiled around the gene to be expressed and the proteins that form the transcription machinery are recruited. c ICT - TU Delft 2005 Supplementary Page 4 These processes are regulated by a specific class of DNA-binding proteins called transcription factors (TFs). [8] TFs are proteins that bind to the upstream regions of a gene and in that way induce or repress the activation of that gene (see Figure 3). Figure 3: Transcription factors bind close to the transcription start site of a gene. In this way, they enhance or repress the activation of transcription. A.4 Microarrays The development of the microarray technique has had paramount implications on the knowledge of the genetic mechanisms. This method allows us to measure the abundance of RNA on a certain moment in time (see Figure 4). From this information we can derive which genes are active under certain conditions or during biological processes. However, microarray measurements are known to be noisy because of biological variation and measurement inconsistencies. Replication of experiments can compensate for a large part of these effects, but still uncertainty remains. Genes Conditions High Expression Low Expression Figure 4: Microarrays can measure the expression of genes under certain conditions or over time. This figure shows a part of a microarray in which 17 genes are measured under 8 different conditions. A.5 Yeast In bioinformatics, bakers yeast (Saccharomyces cerevisiae) is often used as a model organism to develop new methods. Yeast is very easy to study because it is easily modified and cultured, yet it maintains the complex regulation mechanisms of other eukaryotes like plants and animals. With respect to human, the yeast genome is very small, it consists of about 1.3 ∗ 107 base pairs and about 6,000 genes. This is another attribute that makes bakers yeast a perfect organism for the development of computational methods. c ICT - TU Delft 2005 Supplementary B Page 5 Motif representation In the regulation process, transcription factors bind the DNA in the vicinity of the transcription start site in order to activate or repress the activation of the gene. Transcription factors bind the DNA in specific nucleotide sequences (generally 6-20 bp), these short sequences are called transcription factor binding sites or motifs. Motifs are usually represented by a motif matrix which contains the base frequencies at the different positions. There are three basic types of motif matrices, the simplest matrix is an alignment matrix nb,j , which records the occurrence of base b at position j of all the aligned sites for this motif. For example, the distribution of the bases for the motif of PHO4 with length 12 looks like: A 1 3 2 0 8 0 0 0 0 0 1 C 2 2 3 8 0 8 0 0 0 2 0 = G 1 2 3 0 0 0 8 0 5 4 5 . T 4 1 0 0 0 0 0 8 3 2 2 This motif matrix gives the numbers of base occurrences in 8 known binding sites for this motif (found in biological experiments). From the alignment matrix it is possible n to compute the frequency matrix by fb,j = Nb,j , where N is the number of known motif sites. In the graphical representation, depicted in Figure 5, information of a base at a certain position is computed in bits, derived with: Ib (j) = fb,j log2 fb,j , pb (1) where j is the position in the motif, b refers to each of the possible bases, fb,j is the observed frequency of the concerned base on position j and pb is the background frequency of base b in the entire genome [11]. 2 1 1 2 3 4 5 6 7 8 9 10 11 Figure 5: A graphical representation of the PHO4 motif When there are only a few sample sequences a straightforward calculation will tend to overestimate the entropy. To compensate, we substitute the frequency by the adapted frequency by adding a pseudo count: nb,j + pb . f˜b,j = N +1 (2) This small sample correction depends only on the a priori probability and the total amount of data in each column, which may differ from one column to another, since the matrix can be made up out of different short sequences. The errorbars in the figure indicate the score plus and minus the small sample correction [3]. The position weight matrix (PWM) is the matrix that is most frequently used to score a test sequence with a given motif consensus. The PWM is computed from the alignment matrix by [6][7]: Wb,j = ln c ICT - TU Delft 2005 fb,j (nb,j + pb )/(N + 1) ≈ ln . pb pb (3) Supplementary Page 6 A test sequence may be aligned along the weight matrix, and its score is the sum of the weights for the letters aligned at each position (see Figure 6). Sci = p X Wj [Si+j−1 ], (4) j=1 where Si is the base at position i in in the upstream region to be scanned, p is the size of the motif and W is the PWM. PHO4 A C G T A -0.75 0.24 -0.36 0.43 0.17 0.24 0.24 -0.75 -0.18 0.62 0.62 -2.19 -2.19 1.56 -2.19 -2.19 1.09 -2.19 -2.19 -2.19 -2.19 1.56 -2.19 -2.19 -2.19 -2.19 1.56 -2.19 -2.19 -2.19 -2.19 1.09 -2.19 -2.19 1.11 0.17 -2.19 0.24 0.89 -0.18 -0.75 -2.19 1.11 -0.18 C A C G T T A G G … i+p-1 C G A T i-1 i i+1 … -0.18 0.24 0.24 -0.18 A C i+p i+p+1 Sci = 3.81 Figure 6: Scanning a sequence with a PWM. The score is simply the sum of the values sampled from the matrix In order to scan the 1000bp. upstream region of a gene, we concatenate the positive and negative strand and compute the scores for the concatenated DNA strand. For each gene we end up with a binding vector of length 2000 that represents the likeliness that this TF can bind at the given location i. Figure 7 shows this binding vector for a scan with PHO4. 10 5 0 -5 Sc i -10 -15 -20 -25 -30 -35 0 200 400 600 800 1000 i 1200 1400 1600 1800 2000 Figure 7: Scanning an upstream region of 1000bp with a PWM gives us 2000 scores, because both the positive and negative strand need to be scanned. This graph shows the binding scores of PHO4 at the upstream region of a gene. c ICT - TU Delft 2005 Supplementary C C.1 Page 7 Computing the motif profile Computing a Gene-Motif score Given all scores of a particular upstream region and PWM, we need to determine the probability that this TF can bind to this gene. We will do this by determining if a gene has more high scores than can be expected randomly. If so, we call this gene enriched. To compute the enrichment of the 107 known motifs in the upstream regions of the 2497 Kluyver genes we have first determined the binding vectors by scanning all regions with all PWMs. A scan with a single PWM on an upstream region of 1000 bp results in 2 ∗ (1000 − n + 1) ≈ 2000 scores, because both the positive and the negative strand need to be scanned. In Figure 8 a normally distributed approximation of the distribution of the scores for a single PWM for all upstream regions in the yeast genome is depicted. The figure on the right shows the distribution of a single gene in which this motif has a higher enrichment compared to the background. We have considered three methods for computing a single gene-motif score. Background threshold The first method we used determines the pvalue of the detected enrichment by setting a threshold α where x% of the background lies above >α ) and computes: (x% = BB G<α Ex = pvalue = X i=0 B<α G<α −i B G B>α G>α +i , (5) where B is the background score distribution and G represents the distribution of the scores for the gene under investigation. The value of x determines the amount of deviation we allow from our PWM to consider a location to be a binding location. If we set x very small, only exact matches to the consensus motif will be counted as hits, whereas a higher value will allow some deviations. We have studied different values for x, i.e. 0.01, 0.02, 0.05, 0.1, 0.25 and 0.5. Combined threshold The second method computes a combined score which stresses the most stringent thresholds as follows: Emix = 50E001 + 25E002 + 10E005 + 5E01 + 2E025 + E05 93 (6) The rationale behind this combined score is that very large scores should be more important, but that many intermediate scores may have the same binding potential as a few high scores. 5 Mln Scores 2000 Scores Entire Genome Single Gene 100 - X % X% B< Į B> Į Į G< Į Scores ĺ G> Į Į Scores ĺ Figure 8: Approximation of the background distribution and the distribution of the scores from a single gene Exponent mean The third method to compute the probability that a gene g has motif M in his upstream region is the method used by Segal et al. [10], which c ICT - TU Delft 2005 Supplementary Page 8 computes: P (g.M = true|S1 , . . . , Sn ) = ς n−p+1 p X X 1 exp{ Wj [Si+j−1 ]}, n − p + 1 i=1 j=1 log !!! (7) where ς is the sigmoid function (ς(p) = 1+e1−p ) and n is the length of the upstream region. This function takes the mean of the exponent of all scores and in this way gives a higher weight to large scores and neglects very low scores. The sigmoid function scales the resulting score values between 0 and 1. C.2 Comparison of scoring methods To evaluate the quality of these different scores we employ the binding data from Harbison as a ground truth [5]. Harbison et al. elaborately investigated the confidence level at which transcription factors bind to certain motifs in different yeast species. The match between our methods at different cutoff thresholds between 0 and 1, and the data from Harbison [5] with no conservation criteria, and binding probability smaller than 0.001 was computed. From this ROC curves were computed, showing the sensitivity F alseP ositives T rueP ositives ( T rueP ositives+F alseN egatives ) versus 1-specificity ( F alseP ositives+T rueN egatives ). The ROC curves are depicted in Figure 9(a). ROC curves 1 15 3000 10 2000 5 1000 0.9 0.8 0.6 0.5 Segal2003 E05 E025 E01 E005 E002 E001 Emix 0.4 0.3 0.2 0.1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 1-Specificity 0.7 0.8 0.9 1 (a) ROC curves of the proposed Gene-Motif scoring methods, using Harbison as ground truth. 0 0.65 0.7 0.75 0.8 0.85 0.9 Threshold on Segal2003 score 0.95 # genes without motifs Median # motifs per gene Sensitivity 0.7 0 1 (b) Median number of motifs per gene and total genes without any motifs. Figure 9: Evaluation of the proposed scoring methods and setting a threshold to obtain a binary motif profile. From Figure 9(b), it is clear that Segal’s method outperforms the other measures. Consequently we have chosen to use this method to assign a motif score to each gene. Furthermore, we have observed that the proposed mix method tends towards the method of Segal if the number of combined thresholds increases. C.3 Score thresholding The scoring algorithm returns a continuous score that can be seen as a probability that a certain motif is functionally present in an upstream region of a gene. We will refer to the 107 motif scores as the continuous motif profile of a gene. For both computational simplicity and comprehensibility it might be desirable to threshold the enrichment values and obtain a true/false relationship between gene upstream region and motif. Figure 9(b) shows the resulting median number of motifs per gene for thresholds ranging between 0.65 and 1. Furthermore, the total number of genes without any motif c ICT - TU Delft 2005 Supplementary Page 9 is represented, because we want to restrict this number as much as possible. To obtain a reasonable number of motifs that can aid us in distinguishing between gene regulation programs we set the threshold on the score value so that the median number of motifs for an upstream region equals five (threshold: 0.82). This number was also observed by Zhang et al. [14] who used a database of known and experimentally verified motifs to scan the upstream regions of yeast genes. For vertebrates, Prakash and Tompa found a similar amount of 6 [9], based on over representation in an orthologous human, chimp, mouse and rat dataset. If we would have chosen a more stringent threshold the number of genes without any motif annotation would have increased dramatically as is visible in figure 9(b). The set of 107 thresholded motif scores will be called the binary motif profile of a gene. In Figure 10 we show the continuous motif profile with the threshold of 0.82. Motif Profile with threshold 1 0.9 0.8 0.7 Score 0.6 0.5 0.4 0.3 0.2 0.1 0 0 20 40 60 Motifs 80 100 120 Figure 10: A motif profile of a gene with the proposed threshold of 0.82. These motif profiles appear to be extremely noisy (with mean 0.32 and standard deviation 0.26). c ICT - TU Delft 2005 Supplementary D Page 10 Clustering D.1 Hierarchical clustering To separate the genes into distinct groups we conduct an agglomerative hierarchical clustering algorithm. This method begins with each element as a separate cluster and merges them in successively larger clusters. The resulting hierarchical structure is generally displayed in a dendrogram. Cutting this tree at a given height will give a clustering with the desired number of clusters. A pleasant property of hierarchical clustering is that it (opposed to other methods, like k-means) requires only a distance measure and no actual data values. There are three basic methods to decide which points are grouped together at each step in the clustering algorithm. If at a certain stage the distance between group A and group B needs to be calculated we can either use: • Single linkage: min{d(x, y) : x ∈ A, y ∈ B} • Complete linkage: max{d(x, y) : x ∈ A, y ∈ B} P P • Average linkage: NA1NB x∈A y∈B d(x, y) We will use the complete linkage method because this method tends to group the smallest clusters together and in this way provides the most evenly sized clusters. Figure 11 shows that complete linkage provides the most balanced dendrogram. D.2 Distance measures Every clustering method needs a measure to compute the distance between (groups of) data points. In straightforward expression data clustering, usually the correlation between two expression profiles is used as distance because this method is insensible to differences in offset and scaling of the profiles. To make use of both expression data and motif enrichment in a clustering scheme we may combine both distance measures using a weighted sum. To this end, we need a weighting parameter to tune the importance of both information sources. Therefore we propose to compute a separate distance measure for motif enrichment and expression profiles and combine them as follows: Jij = (1 − α)dEij + αdMij , (8) where α is the weighting parameter that sets the balance between the expression distance dEi,j and the motif distance dMi,j . To compute the expression distance we use the correlation between the expression profiles, this is the most common strategy and has proven to provide adequate results. In contrast to the expression profiles, it is far from trivial to determine a distance measure between motif profiles. A gene is not influenced by a single motif. In fact, the combination of two motifs in the upstream region of a gene can have an effect that is contrary to the effect of the individual motifs. For example, if motif M1 or M2 resides in the upstream region of a gene this gene can become up regulated, while a gene with both these motifs gets to be down regulated. We have considered 4 different methods that compute the motif distance based on binary profiles: • Hamming distance: dH = s N • Jaccard distance: dJ = 1 − r r+s • MaxiMin distance: dM = max{min{P1 , P2 }} c ICT - TU Delft 2005 Supplementary Page 11 Dendrogram: C=50, MotifWeight=0, Linkage=Average Dendrogram: C=50, MotifWeight=0, Linkage=Complete 1 0.6 0.55 0.9 0.5 0.8 0.45 0.4 0.7 0.35 0.6 0.3 0.25 0.5 0.2 26 46 27 39 38 11 2 32 5 19 24 43 50 29 30 31 41 44 1 14 7 18 13 25 45 23 33 3 9 16 42 49 4 21 10 35 6 37 8 22 12 20 28 40 17 36 15 34 47 48 39 18 19 30 1 42 25 9 48 33 7 14 15 35 22 17 3 8 6 13 4 21 28 2 29 36 5 45 12 43 32 10 16 34 24 49 23 50 11 20 38 31 40 27 46 37 44 26 47 41 (a) Average linkage; α = 0 (b) Complete linkage; α = 0 Dendrogram: C=50, MotifWeight=0.5, Linkage=Average Dendrogram: C=50, MotifWeight=0.5, Linkage=Complete 0.35 0.6 0.55 0.3 0.5 0.45 0.25 0.4 0.2 0.35 0.3 0.15 5 36 11 18 28 40 14 37 2 50 19 23 42 48 1 9 45 8 10 30 41 49 3 25 39 12 27 15 21 43 4 17 29 13 31 38 6 33 26 16 44 24 32 35 7 22 46 34 20 47 36 50 24 49 10 11 48 31 42 4 13 33 8 21 6 12 2 22 5 38 25 32 34 47 16 28 15 17 18 39 46 26 1 41 3 45 40 44 7 27 29 37 20 35 30 19 9 43 14 23 (c) Average linkage; α = 0.5 (d) Complete linkage; α = 0.5 Dendrogram: C=50, MotifWeight=1, Linkage=Complete Dendrogram: C=50, MotifWeight=1, Linkage=Average 0.3 0.19 0.28 0.18 0.26 0.17 0.24 0.16 0.22 0.15 0.2 0.14 0.18 0.13 0.16 1 16 26 29 11 27 10 4 12 30 32 15 23 20 50 9 31 39 24 49 34 17 45 28 36 8 6 47 37 19 13 21 46 43 44 48 2 40 42 25 22 3 14 5 33 41 35 7 38 18 (e) Average linkage; α = 1 23 21 1 6 13 25 43 50 10 46 2 24 14 22 39 7 32 35 4 47 12 45 27 37 16 11 26 31 17 3 42 49 36 41 8 18 38 5 28 9 44 33 34 20 29 30 15 19 40 48 (f) Complete linkage; α = 1 Figure 11: Dendrograms of clusterings with α ∈ [0, 0.5, 1] using both average and complete linkage. When α approaches 1, the symmetrical structure is maintained for complete linkage. c ICT - TU Delft 2005 Supplementary Page 12 • Shared motif distance: dS = 1 − r N In these distances, N is the total length PN of the motif profiles, s is the number of differences between the two profiles ( i=1 |P1 (i) − P2 (i)|) and r is the number of PN motifs present in both profiles ( i=1 min{P1 (i), P2 (i)}). A B C M1 M2 M3 M4 M5 M6 M7 Figure 12: Three artificial motif profiles where seven motifs can either be present or not present. In Table 1 the different distance measures are computed for combinations of these profiles. To illustrate the properties of each distance measure we have drawn three artificial motif profiles in Figure 12. Table 1 shows the distances for combinations of these profiles, all the methods demonstrate a different selective ability between profile combinations. The first three methods always give identical profiles a distance of zero, while the Shared criterion regards the number of present motifs and in this way distinguishes between A-A and C-C. The MaxiMin distance gives all profiles that differ by one or more motifs the maximal value of one. The Hamming distance is the only method that separates between A-B and A-C. We reason that the extra motif in gene C makes this gene more likely to be part of a different biological process than gene B. Therefore, we prefer to be able to distinguish between these gene pairs. Consequently we will take the Hamming distance to compute a distance measure between motif profiles. Distance Profiles A-A A-B A-C B-C C-C Hamming Jaccard MaxiMin Shared 0 2/7 3/7 1/7 0 0 1 1 1/2 0 0 1 1 1 0 6/7 1 1 6/7 5/7 Table 1: Distance measures for different combinations of motif profiles from Figure 12. The downside of the Hamming distance is that the score is dependent on the length of the motif profile. Moreover, in the motif profiles only a small part of the 107 motifs can be coupled to one of the conditions under investigation. The non-significant values would introduce a lot of noise in our computation if we would use the entire enrichment profile. We are therefore looking for a method to combine all motifs that play a role in the regulatory mechanism that is active under the specific microarray conditions in the computation of a distance measure. A smart feature selection method should be able to select the motifs that are truly informative and thereby reduce the number of redundant features. c ICT - TU Delft 2005 Supplementary E Page 13 Feature selection We use the initial clustering to select the motifs that are most important in the conditions under investigation. We propose three different methods to rank the motifs in order of importance. E.1 Cluster best The most straightforward way of selecting the most informative motifs for our clustering task is to first cluster the data on the expression profile only, then select the motif with highest enrichment from each cluster and use only these motifs together with the expression data in the next clustering iteration. We can either compute the motif enrichment by counting the number of genes that have the motif in their upstream region or by computing the p-value of the amount of genes that have this motif compared to the entire dataset. E.2 P-value threshold Instead of selecting the best motif from each cluster, we can also compute the p-value for all motif-cluster combinations and select the motifs that score below a certain threshold in one or more clusters. The advantage of this method is that we allow a combination of motifs to regulate one cluster. Also, the non informative clusters will not determine the selected motifs. As we do not thoroughly investigate the ideal number of clusters, this is a very important feature. E.3 Decision tree In classification problems the decision tree is well known and widely used. This approach tries to divide the search space into rectangular regions, so that all classes are separated. The input for a decision tree is a database of samples which are all given a class label, and a compendium of attributes that together contain all information we have about our samples. There are many different methods to grow decision trees, but all rely on the same principle: repeatedly split the data into smaller and smaller groups in such a way that each new generation of nodes has greater purity than its ancestors with respect to the target variables. [2] We propose to consider motif presences as features and learn a decision tree to predict cluster labels. In this case the decision tree creates rules in the form: Mik ∧ ¬Mil ¬Mik ∧ Mil Mik ∧ Mil ⇒ ⇒ ⇒ gi ∈ C1 gi ∈ C2 gi ∈ C3 This allows the occurrence of two motifs in a genes upstream region to result in a different gene expression profile than the occurrence of each motif separately. This method thus predicts a clustering and creates a general understanding of the regulation mechanism at the same time. Because the basic strategy in the choice of splits is to take the splitting attributes with the highest information gain first [4], the motifs at the top of the tree can be considered to be most informative to distinguish the clusters. By using the top-n motifs from the tree as a feature selection mechanism we can create an iterative method that uses the resulting tree in the next clustering step, so that we use the motifs that are most important in our new clustering. With this method we can choose to set a fixed depth level and select all important motifs for the next cluster iteration. c ICT - TU Delft 2005 Supplementary Page 14 Is M1 present? Yes No Is M2 present? Is M3 present? Yes Yes No Is M4 present? Is M5 present? Is M2 present? Yes No Yes No Yes No C1 C2 C3 C1 C4 C2 No C1 Figure 13: An example of a decision tree using motif presence to make splits in a gene database E.4 Method comparison Because we did not use any data knowledge to set the number of clusters to 50 we don’t expect to find only informative clusters. Therefore, we strive to make our methodology as independent on the number of clusters as possible. Feature selection method 1 (Cluster Best) picks the best motif from each cluster, and is therefore conflicting with our goals. To make a comparison between method 2 (P-value threshold) and method 3 (Decision tree) we compare the motif ranking both methods produce. To this end, we have used both methods in an iterative clustering scheme that uses the selected motifs from the previous iteration to compute the distance measure. Figure 14 shows which motifs were selected in each iteration. We have computed the results over 100 samplings of 80% of the data, therefore, the darkness of the lines indicates the proportion of the data samplings in which that motif was selected. Selected motifs over 10 iterations 1 2 2 3 3 4 4 Selection iteration Selection iteration Selected motifs over 10 iterations 1 5 6 7 8 5 6 7 8 9 9 10 10 10 20 30 40 50 60 Motifs 70 80 90 100 (a) Selection method 2: Selecting the motifs with p-value < 10−4 10 20 30 40 50 60 Motifs 70 80 90 100 (b) Selection method 3: Selecting the top motifs from the decision tree (depth ≤ 9) Figure 14: Evaluation of the proposed motif selection methods. Method 1 seems to be less dependent on the data sampling (there is a smaller number of light blue lines). Moreover, this method more often picks the motifs from which we know they play a role in this system. If we regard the motifs that were initially associated with the conditions under investigation by Tai et al. [12] we see that method 2 in general gives the known motifs a much higher rank than method 3 (see Table 2). We shall therefore use the p-value threshold method to select the most informative motifs. c ICT - TU Delft 2005 Supplementary Page 15 Table 2: The motifs mentioned by Tai et al. [12] and the rank they get according to our feature selection methods. Db indicates the index of the motif in our database. Rank Db Motif 8 48 49 41 47 9 38 58 59 MIG1 GLN3 GZF3 DAL82 GAT1 PHO4 CBF1 MET31/32 MET4 Method 2 Method 3 11 4 3 7 8 2 1 10 88 55 6 7 33 28 4 32 1 35 Figure 14 also shows us that the selected group of motifs almost does not change over the 10 iterations; we will therefore only use the motif selection once and use the second clustering as final result. E.5 Motif ranking In the initial clustering we set α = 0 and therefore cluster purely on expression data. We compute a consensus clustering using 500 samplings of 80% of the data. From this initial clustering we compute p-values of the motif occurrences for each cluster using the hypergeometric distribution. Then, we rank all motifs according to the lowest pvalue they attain in any of the clusters. In Table 3 the p-values and consensus sequence of our motif database are shown. Since we collected these motifs from three different databases (18 from Transfac [13], 13 from SCPD [15] and 76 from Harbison et al. [5]), we also show the source in Table 3. Table 3: The motif database and enrichment p-values in the initial clustering Db Motif Source P-value Consensus sequence 9 48 49 38 41 47 17 1 39 98 58 34 104 33 77 27 51 8 101 80 69 PHO4 GLN3 GZF3 CBF1 DAL82 GAT1 HAP2/3/4 repressor of CAR1 CIN5 TYE7 MET31/32 YAP3/5/6 ARR1 YAP7 AFT2 RPN4 SWI5 HAP1 MIG1 UME6 SIG1 RCS1 TransFac Harbison Harbison Harbison Harbison Harbison TransFac TransFac Harbison Harbison Harbison Harbison Harbison Harbison Harbison SCPD Harbison TransFac Harbison Harbison Harbison 4.04E-11 6.83E-11 7.65E-10 2.72E-09 2.76E-09 2.01E-07 3.52E-07 5.43E-07 8.84E-07 3.37E-06 4.38E-06 6.29E-06 7.74E-06 1.70E-05 3.75E-05 5.51E-05 6.15E-05 7.80E-05 1.06E-04 1.59E-04 3.41E-04 TCCCACGTGGGC GATAAGATA GATAAG TCACGTG GATAAGA AGATAAG ACCCGCCAATCACCGG CACTAGCCGCCGAGGGC TTACGTAA TCACGTGAT AAACTGTGG TTACTAA CTTAGTAAT GGGTGC TTTGCCACC ACAGCATGCTGG GGAAATATCGG GAAAAAAATCCGGGGTA TAGCCGCCGA AGGAAACAAAAA GGGTGCAGT Continued on next page c ICT - TU Delft 2005 Supplementary Page 16 Table 3 – continued from previous page Db Motif Source P-value Consensus sequence 63 37 25 92 94 22 46 16 99 74 19 79 84 57 21 60 44 91 59 62 61 35 89 24 107 67 78 102 81 70 93 85 42 3 32 55 106 103 23 97 53 43 65 26 73 87 75 90 2 50 5 83 20 18 86 95 64 10 68 36 15 82 11 NRG1 CAD1 REB1 SUM1 SWI4 SCB GAL80 RAP1 UGA3 RLM1 RLM1 SFP1 SNT2 MBP1 CSRE MSN2 FKH1 STP1 MET4 NDD1 MSN4 AZF1 STB4 PDR1/PDR3 ZAP1 PHO2 RTG3 XBP1 SIP4 RDS1 SUT1 SOK2 DIG1 MATa1 ACE2 LEU3 YOX1 YAP1 MCB THI2 INO2 FHL1 PDR1 ROX1 RIM101 SPT23 RLR1 STB5 ABF1 HAC1 GCR1 SKO1 SMP1 STRE SPT2 SWI6 OPI1 MCM1 PUT3 BAS1 AP-1 SKN7 HSF2 Harbison Harbison SCPD Harbison Harbison SCPD Harbison TransFac Harbison Harbison SCPD Harbison Harbison Harbison SCPD Harbison Harbison Harbison Harbison Harbison Harbison Harbison Harbison SCPD Harbison Harbison Harbison Harbison Harbison Harbison Harbison Harbison Harbison TransFac Harbison Harbison Harbison Harbison SCPD Harbison Harbison Harbison Harbison SCPD Harbison Harbison Harbison Harbison TransFac Harbison TransFac Harbison SCPD TransFac Harbison Harbison Harbison TransFac Harbison Harbison TransFac Harbison TransFac 5.00E-04 5.23E-04 5.69E-04 8.10E-04 1.13E-03 1.40E-03 1.56E-03 1.89E-03 2.13E-03 2.22E-03 2.28E-03 2.37E-03 2.57E-03 2.65E-03 3.14E-03 3.42E-03 3.73E-03 3.95E-03 3.96E-03 4.14E-03 4.15E-03 4.53E-03 4.90E-03 4.99E-03 5.01E-03 5.20E-03 5.33E-03 5.39E-03 6.74E-03 7.32E-03 7.69E-03 8.93E-03 9.36E-03 9.36E-03 1.03E-02 1.10E-02 1.18E-02 1.20E-02 1.26E-02 1.28E-02 1.31E-02 1.36E-02 1.36E-02 1.43E-02 1.49E-02 1.53E-02 1.68E-02 1.70E-02 1.71E-02 1.78E-02 1.78E-02 2.11E-02 2.17E-02 2.20E-02 2.23E-02 2.24E-02 2.29E-02 2.32E-02 2.35E-02 2.40E-02 2.47E-02 2.48E-02 2.61E-02 GGACCCT ATTAGTAAGC TTACCCG GCGTCAGAAAA GACGCGAAA CACGAAA CGGCCCCCCCCCCCCCG AACACCCATACACC CCGCCCCCGG CTAAAAATAG CGTTCTATAAATAGACCC ACCCGTACAT CGGCGCTACCA GACGCGT TCCGGATGAATGG AAGGGGCGG TTGTTTAC GCGGCCCCGCGGC AAAAACTGTGGCGCC TTTCCCAATTGGG CAGGGGC TTTTTCTTTTCCTGTTTC TCGGAACGA TCCGCGGA ACCCTCAAGGTTGT CGTGCGGTGCG GGTCAC CTTCGAG CGGCTGAATGGAA TCGGCCGA GCGGGGCCGG TGCAGGAA TGAAACA TGATGTAGCT TGCTGGT CCGGTACCGG ACAATACTGACG TTAGTCAGC ACGCGT GAAACCCTAAGA CACATGC ATGTACGGGTG CCGCCGAATAA CCCATTGTTCTC TGCCAAG GAAATCAA ATTTTCTTCTTT CGGTGTTATA CCATATCACTATACACGAAACT GGCCAGCGTGTC GGCTTCCAC ACGTCA GTGCTGCTATTTATAGCAGC TCAGGGGG CCTGTATCTAA TTTCGCGT TCGAACC TTTCCCAATCGGGTAA CCCGGCCCCCCCCCCCCG TGACTC GTGAGTCAG GCCCGGGCC AGAACAGAACAGAAC Continued on next page c ICT - TU Delft 2005 Supplementary Page 17 Table 3 – continued from previous page Db Motif Source P-value Consensus sequence 7 13 88 45 30 76 72 105 28 66 14 40 4 100 96 71 56 12 29 6 54 31 52 GAL4 HSF4 STB1 FKH2 UASPHR RPH1 RGT1 YDR026c STE12 PHD1 HSF5 DAL81 GCN4 UME1 TEC1 RFX1 MAC1 HSF3 TBP ADR1 INO4 XBP1 HSF1 TransFac TransFac Harbison Harbison SCPD Harbison Harbison Harbison SCPD Harbison TransFac Harbison TransFac Harbison Harbison Harbison Harbison TransFac SCPD TransFac Harbison SCPD Harbison 2.69E-02 2.71E-02 2.75E-02 2.82E-02 2.85E-02 3.30E-02 3.64E-02 3.64E-02 3.77E-02 4.04E-02 4.51E-02 4.91E-02 5.07E-02 5.12E-02 5.30E-02 5.49E-02 5.82E-02 5.92E-02 5.97E-02 8.29E-02 1.08E-01 1.08E-01 1.22E-01 CATCGGCGCACTGTCCTCCGAAC AGAACGTTCTAGAAC AAACGCGAAA AAAGGTAAACAA TTTTCTTCCTCG CCCCTTAAGG CGGACCA TTTACCCGGC ATGAAACC GCCGCAGG GTTCTAGAACAGAAC AAAAGCCGCGGGCGGGATT CCGCGCCGGTGACTCATCCCGCGCC AAGGAAAAGTA AGGAATG TTGCCATGGCAAC GAGCAAA AGAACAGAACGTTCT TATAAAA CGGGGG CATGTGAAAA GCCTCGAGGCGG TTCTAGAACCTTC c ICT - TU Delft 2005 Supplementary F Page 18 Gene Ontology The Gene Ontology (GO) Consortium [1] has gathered information about gene functions from multiple databases and combined this information into a single annotation tree. They have made great effort to describe gene products in terms of their associated biological processes, cellular components and molecular functions in a speciesindependent manner. This project has become a global center for gene function information. In order to evaluate the performance of our method we have used the GO database to compute the enrichment of GO annotations within clusters. If a cluster obtains a higher concentration of a certain GO annotation, we expect to have found genes that are more related in terms of biological function. Figure 15: The Gene Ontology database collects biological gene information in a hierarchical tree structure. c ICT - TU Delft 2005 Supplementary G Page 19 Iterative clustering We have iteratively clustered the data, using the best motifs from the consensus clustering of the previous iteration to compute the distance measure. In these clusterings α was fixed to 0.25. Table 4 shows the ranking of the motifs with p-values < 10−6 for ten clustering iterations. The set of motifs does not seem to converge to a select group. Table 4: Motif ranking for 10 iterations, using the consensus clustering to select the motifs that attain a p-value < 10−6 in any of the clusters. Iteration 0 shows the motif ranking of the motifs selected by the initial clustering. The numbers in this table indicate the index in our motif database, which can be found in Table 3 Motif Database Index Iteration p < 10−6 0 1 2 3 4 5 6 7 8 9 10 9 48 49 38 41 47 17 1 39 49 41 48 38 9 17 58 98 47 8 33 38 98 48 41 49 9 17 58 17 49 38 41 98 48 58 9 39 8 38 9 58 17 49 41 98 39 48 17 38 9 58 41 49 8 48 39 98 38 9 41 49 58 17 48 98 39 1 38 17 58 49 9 41 48 98 38 9 17 58 49 41 98 48 8 39 17 38 41 49 48 9 58 98 39 1 38 9 49 17 41 58 48 39 98 8 1 Using this method shows that the consensus clustering is not stable enough to reach convergence. When we however compute the mean motif enrichment of the 500 sampled clusterings, the iterative clustering does converge to a clear set of motifs. After three iterations the motif ranking remains constant (see Table 5). The motifs that obtain a p-value smaller than 10−6 are: CBF1, GZF3, GLN3, DAL82, HAP2/3/4, PHO4, TYE7, MET31/32, GAT1. Table 5: Motif ranking for 10 iterations, using the the mean p-values of the 500 sampled clusterings to select the significantly enriched motifs. This table also shows the motifs with p < 10−4 and p < 10−5 Motif Database Index Iteration p < 10−6 0 1 2 3 4 5 6 7 8 9 10 38 38 38 38 38 38 38 38 38 38 38 c ICT - TU Delft 2005 9 48 98 49 17 41 49 48 49 48 17 41 49 48 17 41 49 48 41 17 49 48 17 41 49 48 41 17 49 17 48 41 49 48 17 41 49 48 41 17 49 48 17 41 9 9 9 9 9 9 9 9 9 9 98 98 98 98 98 98 98 98 98 98 58 58 58 58 58 58 58 58 58 58 47 47 47 47 47 47 47 47 47 p < 10−5 p < 10−4 17 41 58 47 47 39 8 8 39 8 39 8 8 39 8 8 8 8 39 8 39 8 Supplementary Page 20 Aerobic Genes 8-MIG1 9-PHO4 47-GAT1 49-GZF3 38-CBF1 17-HAP2/3/4 58-MET31/32 98-TYE7 41-DAL82 48-GLN3 Finally we computed a consensus clustering based on 5000 cluster iterations, using the previously selected set of motifs. The resulting clusters are shown in Figure 16. Although the order of the motifs has changed, the set of motifs with a p-value < 10−6 remains almost the same. Only MIG1 has been included in this set. In Table 5 we can see that this motif was already the next in rank. Anaerobic C N P S C N P S A B C D E F G 45 129 90 6 55 71 22 0 2 4 6 8 10 12 14 -1 0 1 Figure 16: Clusters that show the highest motif enrichment in the consensus clustering computed over 5000 iterations. For this clustering we used the motifs: CBF1, GZF3, GLN3, DAL82, HAP2/3/4, PHO4, TYE7, MET31/32 and GAT1. The motifweight α was set to 0.25. Figure 16 shows all the previously identified regulation mechanisms (Cluster A, B, C, E and F). Cluster C shows both carbon and phosphorus limitation response in the aerobic environment. This cluster was previously identified as two separate groups, but since both mechanisms are regulated by the HAP motifs they are easily clustered together. The phosphorus cluster has been split into two different groups (Cluster E and G). The first cluster shows high expression under phosphorus limitation in both aerobic and anaerobic environments, while the second cluster shows little response in the anaerobic case. Furthermore, we identify a very small cluster responding to carbon limitation in an aerobic environment (Cluster D). These genes are regulated by TYE7 and CBF1, which are known to act under sulfur limitation. We have found no further evidence to ground this finding. c ICT - TU Delft 2005 Supplementary Page 21 References [1] Ashburner, M. et al., “Gene Ontology: tool for the unification of biology. The Gene Ontology Consortium”. Nature Genetics Vol. 25, pp. 25-29, 2000. [2] Berry, M.J.A. and Gordon, S.L. “Data Mining Techniques, for Marketing, Sales, and Customer Relationship Management”. Wiley Publishing, inc., Indianapolis, 2004. [3] Crooks, E.G., et al., “WebLogo: A Sequence Logo Generator”. Genome Research Vol. 14, pp. 1188-1190, 2004. [4] Dunham, M.H. “Data Mining, Introductory and Advanced Topics”. Pearson Education, Inc., 2003. [5] Harbison, C.T., et al., “Transcriptional regulatory code of a eukaryotic genome”. Nature Vol. 431, pp. 99-104, 2004. [6] Hertz, G.Z. and Stormo, G.D. “Identifying DNA and protein patterns with statistically significant alignments of multiple sequences”. Bioinformatics Vol. 15, nos 7/8, pp. 563577, 1999. [7] Jensen, S.T., et al., “Computational Discovery of Gene Regulatory Binding Motifs: A Bayesian Perspective”. Statistical Science Vol. 19, No. 1, pp. 188-204, 2004. [8] Kellis, M. “Computational comparative genomics: genes, regulation, evolution”. Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, 2003. [9] Prakash, A. and Tompa, M. “Discovery of regulatory elements in vertebrates through comparative genomics”. Nature Biotechnology Vol. 23, No. 10, pp. 1249-1256, 2005. [10] Segal, E., Yelensky, R. and Koller, D. “Genome-wide discovery of transcriptional modules from DNA sequence and gene expression”. Bioinformatics Vol. 19, Suppl 1: i273i282, 2003. [11] Stormo, G.D. “DNA binding sites: representation and discovery”. Bioinformatics Vol. 16, No. 1, 16-23, 2000. [12] Tai, S.L., et al., “Two-dimensional Transcriptome Analysis in Chemostat Cultures”. The Journal of Biological Chemistry Vol. 280, No. 1, pp. 437-447, 2005. [13] Wingender, E., et al., “TRANSFAC: an integrated system for gene expression regulation”. Nucleic Acids Research Vol. 28, No. 1, pp. 316-319, 2000. [14] Zhang, Z., Gu, J. and Gu, X. “How much expression divergence after yeast gene duplication could be explained by regulatory motif evolution?”. TRENDS in Genetics Vol. 20, No. 9, pp. 403-407, 2004. [15] Zhu, J. and Zhang, M.Q. “SCPD: a promoter database of the yeast Saccharomyces cerevisiae”. Bioinformatics Vol. 15, nos 7/8, pp. 607-611, 1999. c ICT - TU Delft 2005
© Copyright 2026 Paperzz