Discovery of co-regulated gene clusters by combining

Master Thesis
Discovery of co-regulated gene clusters
by combining known transcription factor
binding motifs and gene expression
profiles
Thesis Committee:
Ir. T.A. Knijnenburg
Prof.dr.ir. M.J.T. Reinders
Dr.ir. D. de Ridder
Dr.ir. E.P. van Someren
Ir. M.H. van Vliet
Dr. C. Witteveen
Author
Email
Student number
Thesis supervisors
Date
Information and
Communication
Theory Group
T
[I,C)
Maarten Clements
[email protected]
1006398
Prof.dr.ir. M.J.T. Reinders
Dr.ir. E.P. van Someren
Ir. T.A. Knijnenburg
April 7, 2006
Incorporating motifs in gene clustering
Discovery of co-regulated gene clusters by combining known transcription
factor binding motifs and gene expression profiles
Maarten Clements
Information and Communication Theory Group, Faculty of Electrical Engineering, Mathematics and Computer Science,
Delft University of Technology, 2600 GA Delft, The Netherlands.
ABSTRACT
Motivation: The standard approach for finding genes that
are involved in the same biological process is to investigate
the change in gene expression under certain conditions and
cluster the genes according to this information. However,
due to the limited number of measured conditions, genes
clustered together are co-expressed, but not necessarily coregulated. In this paper, we provide a method that combines
known transcription factor binding site information with gene
expression into one clustering scheme. In this way we aim to
find gene clusters that are co-regulated under certain growth
limitation conditions, through common motifs.
Results: We have shown that the integration of known motifs into the gene clustering scheme can improve the GO enrichment within the clusters. Moreover, we have created a
framework that makes it easier to understand the regulation
of gene clusters and we have made a step toward finding
truly co-regulated clusters. We show the usefulness of this
approach by applying it to analyse data of yeast under different cultivation conditions. Using our method, we have detected a very compact cluster that contains all genes from
the allantoin catabolism pathway, regulated by the motifs
DAL82, GAT1, GLN3 and GZF3. Furthermore, we found a
cluster with very high enrichment of the GO category sulfur
metabolism and we proposed a few new genetic regulation
mechanisms.
Contact: [email protected]
Gene Clustering | Gene Regulation | Transcription Factors | Binding Motifs
1
Introduction
The central dogma of molecular biology states that when a
gene on the DNA is transcribed, RNA is formed which, in
turn, is translated into proteins. The rate at which transcription takes place is called the activity of a gene, or gene expression. Gene expression is tightly controlled at multiple stages,
but mainly at the initiation of transcription. Here, the chromatin structure is uncoiled around the gene to be expressed
and the proteins that form the transcription machinery are
recruited. These processes are regulated by a specific class
of DNA-binding proteins called transcription factors (TF’s)
(17). TFs are proteins that generally bind to the upstream re-
© ICT - TU Delft 2006
gions of a gene and in that way induce or repress the activation
of that gene.
Transcription factors bind the double stranded DNA helix in sequence specific binding sites or regulatory motifs.
Regulatory motifs are short nucleotide sequences (6-20 bp),
that show some degree of sequence variation and follow few
known rules. This makes direct identification of functional
motifs a challenging task. The majority of motifs has been
found by biological experimentation, such as systematic mutation of individual promoter regions; However, this process
is laborious and unsuited for genome-scale analysis (17).
In the last few years many new computational methods
have been developed to automatically detect regulatory motifs. These tools can be divided into two main categories:
scanning methods and de novo methods. In a scanning
method, one uses a motif representation resulting from experimentally determined binding sites to scan the genome sequence to find more matches (16). In de novo methods, one
attempts to find novel motifs that are enriched in a set of upstream sequences (10; 11; 14; 15; 28).
It is desirable to understand which genes are similarly regulated during a certain biological process because this drastically decreases the search space when we are looking for
genetic pathways. The usual approach to this problem is to
cluster genes that demonstrate a similar expression profile under different conditions or over time. In order to identify the
regulation program, de novo motif detection methods can be
applied to the upstream coding regions of these gene clusters
to detect frequently occurring sequence patterns, which may
be related to certain transcription factors (23; 30). In these
methods, the found regulation program of a gene cluster is
considered as the final result. They do not check whether the
found regulatory program sufficiently explains the observed
expression of all members of the gene cluster.
Segal et al. (27) used a more advanced method that tries to
construct complex regulatory mechanisms from the expression profiles of supposed regulating genes. However, Segal
assumes that the expression level of these TF producing genes
is directly related to the expression of the genes that are regulated by them, although enough biological evidence against
this simple model exists (18). Beer et al. (3) circumvents the
need to know the TF abundance by using sequence data as input of his method to derive complex rules utilizing AND, OR,
and NOT logic, with significant constraints on motif strength,
1
M. Clements
Motifs
Expression
Motifs
Expression
Yeast
Genome
6368 x 8
Gene Exp. data
G1
1
Scen. 1
Gk
G1
3
Scen. 2
2
Find
Upstream
SAM
Selection
2497 Upstream
Regions
2497 x 8
Gene Exp. data
Gk
G1
Scen. 3
107 PWMs
Gk
Figure 1: The goal of the proposed method is to find gene clusters
that are co-expressed due to the same motifs. The reasoning behind
the proposition that the integration of motif enrichment can accomplish this is threefold. Scenario 1: A cluster that is actually regulated
by two different motifs is split up into separate clusters. Scenario 2:
A cluster showing homogeneous expression is shrunk to a smaller
cluster in which all genes contain the same motif. Scenario 3: Genes
that show weak co-expression are integrated in the cluster because
they share the same motif.
4
2
Compute
Expression
Distance
dEij
Iteration
5
orientation, and relative position. In this way, a large number of gene regulation hypotheses is generated, although these
hypotheses need to be validated before biological conclusions
can be drawn, since they encompass de novo motifs.
In this work we propose to integrate the presence of known
regulating elements in the upstream genetic region of genes
together with their expression levels as a combined input to
the clustering system. As all biological processes, gene expression and upstream motif enrichment demonstrate a high
rate of random variation which can lead to the detection of
spurious relationships. As these random variations are independent, an integration of both concepts in the clustering
method may improve the discovery of transcriptional modules that are composed of genes that are co-regulated through
a common motif or combination of motifs. The ways in which
the clustering can profit from the additional motif information
are summarized in Figure 1.
Related methods have been proposed that also used the regulation program to adapt the grouping of genes. Segal et al.
(26) uses an EM-algorithm that iteratively partitions the gene
set and uses this gene partition to detect new motif candidates. In this way transcriptional modules are built that are
both coherent in expression profiles and significantly enriched
for common binding sites.
Middendorf et al. (19) uses both gene regulators and putative binding sites to build a decision tree that tries to explain
the gene expression profiles in terms of regulators and motifs.
A similar method from Ruan et al. (24) applies a multivariate regression tree to discover a model for gene expression
patterns.
6
Motif
Scanning
8
Compute
Motif
Profile
Initial
Clustering
C1
7
9
Compute
Motif
Distance
Iteration
dMij
1-Į
Į
Gene Ontology
Database
Select
Best
Motifs
dCij
10
Combined
Clustering
Best Į
11
Cluster
Validation
C2
Figure 2: Flow diagram: The integration of enriched transcription
factor binding sites into the clustering process of gene expression
data. After preselection of the data, we compute gene distances on
both expression and motif profiles. The motif distance is computed
on a subset of the motifs, selected by the initial clustering. The second clustering step combines both information sources with weighing parameter α , which is optimized by finding the clustering with
highest GO enrichment. Finally, C1 and C2 represent the initial and
combined consensus clustering that are compared to show that our
method generates more biologically relevant clusters.
These methods generally aim to find new motifs together
with a cluster of genes that appears to be regulated by these
motifs. In this way these methods might produce gene clus-
© ICT - TU Delft 2006
Incorporating motifs in gene clustering
Genome Database (SGD) (4) and the S288C Saccharomyces
cerevisiae strain from Ensembl V35 (13) Fig.2:3 .
Using a compendium of 107 position weight matrices
(PWMs), we scanned the upstream regions of the genes for
instances of known transcription factor binding motifs (M7.3)
Fig.2:4 . To obtain a single score for each gene-motif pair, we
have adopted the score function from Segal et al. (26), which
2 List of contributions
combines all scores from the upstream region into a single
Here we give a short list of the main contributions of this pa- value (M7.4). We threshold these continuous values to obper.
tain a true-false relationship for each gene-motif combination.
The set of 107 thresholded motif scores will be called the bi• Using the binding data of Harbison et al. (9) as ground nary motif profile of a gene (M7.5) Fig.2:5 .
truth, we have made an objective comparison between
We make use of the Pearson correlation coefficient to
different methods to compute a binding probability of compute the distance between genes based on their expresa given transcription factor motif and upstream region. sion profiles Fig.2:6 . The Pearson correlation is generally ac[Not in main paper]
cepted to provide a useful distance measure for grouping co• We established a framework to compute combined gene regulated genes because it is insensitive to differences in offdistances based on both upstream motif enrichment and set and scaling of the profiles (1; 12).
However, it is not trivial to define a distance measure bemicroarray data.
tween genes based on their motif enrichment profile. The
• Using a motif selection method on an expression data main difficulty is that the combinatorial effect of two factors
clustering, we instantly select the motifs that are ac- may differ from the individual effect of the factors (3; 17).
tive under the tested conditions. Different feature se- After comparison of several measures (see Supplement D.2),
lection methods (such as decision trees and the method we selected the normalized Hamming distance on the binary
described in this paper) were compared and evaluated motif profiles to compute this distance, because this measure
based on curated motif knowledge. [Comparison not in has a large selective ability between possible combinations of
profiles (M7.6) Fig.2:7 .
main paper]
To be able to tune the influence of the regulation informa• We dedicated the Gene Ontology database (2) to opti- tion, we combine both expression distance and motif enrichmize the weight of the motif information, and showed ment into a single distance measure between genes i and j as
that it can be used as an independent cluster validation follows:
method.
(1)
dCi j = (1 − α )dEi j + α dMi j ,
ters that are not biologically interpretable because both the
gene cluster and the regulation program are free parameters.
The fact that our method only inputs known motifs puts the
focus on the grouping of genes and ensures that the resulting
regulation is biologically relevant.
• Because only validated motifs are used as an input to our
system, we derive clusters that are biologically relevant
in terms of regulation and expression. Out method directly shows the link between motifs and the genes they
regulate.
• We have proposed several new biological regulation
mechanisms in Saccharomyces cerevisiae.
3
Combining gene expression and gene regulation
Our proposed methodology is depicted in Figure 2. To evaluate our method, we have employed a dataset that is comprised of 6383 Saccharomyces cerevisiae genes with expression values measured over eight well-defined conditions Fig.2:1
(see Methods section, M7.1). After preselection of the genes
that show most significant response under one or more conditions using the significance analysis of microarrays (SAM)
algorithm (31) Fig.2:2 , we retain 2497 genes for our analysis
(M7.2). From these genes we extracted the 1000 bp upstream
region with use of gene location data from the Saccharomyces
© ICT - TU Delft 2006
where α (0 ≤ α ≤ 1) is the weighing parameter that sets the
balance between the expression distance dEi j and the motif
distance dMi j . On this combined distance measure dCi j , we
employ hierarchical clustering, using complete linkage, to divide the genes into 50 distinct groups. We expect that this
number is slightly above the true number of clusters in the
dataset, so that there is enough possibility to obtain compact
clusters without over-segmenting the data. In order to improve the robustness of the clusters, we iteratively cluster 500
times on samplings of 80% of the data and combine the resulting clusterings using consensus clustering, which has proven
to provide more reliable data groupings (6; 20) (M7.7).
We expect that not all motifs in our database are functionally active in the investigated experimental setup. In order to
select the motifs that play a role in the conditions under investigation, we initially cluster the data purely on the expression
distance Fig.2:8 . From the resulting clusters we select the active motifs by computing the significance of motif enrichment
using the hypergeometric distribution (M7.8) Fig.2:9 .
In the combined clustering, we use only the selected motifs
to compute the motif distance between genes Fig.2:10 . We now
3
M. Clements
Pvalue
PHO4
4.04 x10-11
GLN3
GZF3
CBF1
DAL82
GAT1
HAP2/3/4
6.83 x10-11
7.65 x10-10
2.72 x10
-9
2.76 x10-9
2.01 x10-7
3.52 x10-7
Rep of CAR 5.43 x10-7
CIN5
-3
Logo
8.84 x10-7
-3.5
2
1
1
2
3
4
5
6
7
8
9
1
2
3
4
5
6
7
8
9
1
2
3
4
5
6
1
2
3
4
5
6
7
1
2
3
4
5
6
7
1
2
3
4
5
6
7
10
11
12
-4
2
1
log10(GO p-values)
Motif
2
1
2
1
2
1
2
-4.5
-5
-5.5
-6
-6.5
1
-7
2
1
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
1
2
3
4
5
6
7
8
-7.5
0
2
1
0.1
0.2
0.3
0.4
0.5
0.6
Motifweight (D )
0.7
0.8
0.9
1
17
2
1
Figure 3: Motifs that are enriched in the initial clustering. We show
only the motifs that attained a p-value smaller than 10−6 in any of
the 50 clusters. For each motif we show its common name, its enrichment p-value and a logo that indicates the information content
for each base in the motif. See Supplement E.5 for the complete list.
Figure 4: This figure shows (1) the mean and standard deviation
of the 500 GO enrichment scores of the sampled clusterings (blue)
and (2) the scores of the consensus clustering (green), computed for
x = 25. The log10 of the GO enrichment p-values is plotted against
100 values of α . We use the consensus clustering at α = 0.25 as final
combined clustering.
4.2
Combined clustering and α estimation
combine both expression and motif information using equation (1), varying α between 0 and 1. We optimize α by finding the clustering that obtains the highest enrichment of functional categories using the annotation database GO (2) (M7.9)
Fig.2:11 .
Finally, we compare the consensus clustering of the initial
clustering (α = 0) with the consensus clustering at the selected ideal value of α . We closely identify the differences
between the clusters. The importance of the combined clustering is shown by relating these differences to the scenarios
in Figure 1. Furthermore, we show that the improved clusters
have an increased biological relevance.
The combined clustering uses both the expression data and
the enrichment of the motifs selected by the initial clustering
to compute the gene distance. To be less dependent on the
initially chosen number of 50 clusters, we evaluate the GO
enrichment of the final clustering for a range of best clusters
and different settings of the combining weight α (M7.9). Figure 4 shows the clustering scores for 0 ≤ α < 1, here x is set
to 25.
We instantly notice that the consensus clustering scores
much better than the sampled clusterings. This difference is
mainly due to the fact that the consensus clustering is computed on the entire dataset of 2497 genes, while the individual clusterings use only 2000 genes (80% of the data). If we
compare two clusters with the same motif enrichment ratio
4 Results
(#moti f s/#genes), the larger cluster will obtain a better p4.1 Initial Clustering and Motif Selection
value. Therefore, extreme p-values are more likely to occur
in large clusters (thus, in the consensus clustering).
From the initial consensus clustering, which is purely based
Furthermore, we see from Figure 4 that the consensus clusupon expression data, we compute p-values of the motif entering does not show a smooth behaviour over variations of
richment for each cluster. All motifs are ranked according to
α . We therefore determine the optimal parameter values for
the lowest p-value they attain in any of the clusters. We conx and α on the mean of the individual clusterings and use the
sider a motif to be significantly enriched if is attains a p-value
consensus clustering at the selected parameters as final result.
< 0.005, using a bonferroni correction for 50 clusters and 100
In order to find the best value of α and x we have com−6
motifs this results in a threshold of 10 . Figure 3 shows the
puted the gain in GO enrichment for each combination of x
logos and p-values of the motifs that pass this threshold. The
and α . To derive the significance of the difference between
motifs that are displayed in this figure are selected as features
the initial clustering and a combined clustering we have used
for the combined clustering. Other feature selection methods
a one-tailed two sample t-test (M7.9). Figure 5 shows that the
have been evaluated and are discussed in Supplement E.
optimal value of α does not vary strongly for different values
of x. For any x > 5 the optimal value for α lies within the re-
4
© ICT - TU Delft 2006
Incorporating motifs in gene clustering
with highly enriched motifs. The changes with regard to the
initial clustering can be summarized as follows:
0
0
-5
Log10(P – value)
5
-10
10
-15
15
20
-20
25
-25
Clusters (x)
30
-30
35
0.5
0.45
0.4
40
0.35
0.3
0.25
Motifweigth (Į)
0.2
45
0.15
0.1
0.05
50
0
Figure 5: In order to find the optimal combination of α and x, where
the gain in GO enrichment is maximal, we use a t-test to compute
the distribution difference of the initial clustering (α = 0) and the
combined clustering for each x ∈ N : 1 ≤ x ≤ 50 and each 0 ≤ α < 1
(100 steps). This graph shows the log10 (p-values) of the t-statistic on
the vertical axis, indicating the difference of the combined clustering
with respect to the initial clustering. The minimal values of the plot
are shown in red. As the irregular structure of the surface is due to
the limited number of sampling iterations, we can conclude that the
optimal improvement is obtained in the region 0.24 ≤ α ≤ 0.28. We
omitted the part where α > 0.5 since the one-tailed t-test causes all
p-values to be zero in this area.
gion 0.24 ≤ α ≤ 0.28, we have taken the consensus clustering
at α = 0.25 as the final combined clustering.
Figure 5 shows that an optimum in x is reached around
x = 25. In this way we have used the GO enrichment as an
independent cluster validation method. One might suggest to
use this result to set the number of clusters to 25 and recompute the initial clustering. We have decided to preserve the
extra clusters, in order allow the relevant clusters to be shrunk
as depicted in Scenario 2 in Figure 1. The rest genes from the
initial cluster might jump to one of the dummy clusters and
are not obliged to be included in another informative group.
4.3
Cluster comparison
To compare the initial and combined clustering, we now compute the consensus clustering for α = 0 and α = 0.25. Because we can only expect to improve clusters for which there
is a related motif in our database, we discuss the value of our
method by looking at the clusters with the highest motif enrichment. Figure 6 shows the clusters of the initial clustering
in which at least one of the motifs attained a p-value smaller
than 10−6 . For these five clusters (A-E), we show both motif
enrichment and expression profiles.
For the combined clustering with α = 0.25 we obtain the
results shown in Figure 7. As expected we can see from these
figures that the combined clustering contains more clusters
© ICT - TU Delft 2006
Nitrogen Cluster The cluster that shows differential expression under nitrogen limitation (Cluster a, Figure 7) has
become the cluster with the highest motif enrichment. The
reason for this is that this cluster has been shrunk to about
one fifth (62 → 12 genes) of the original cluster size (Cluster B, Figure 6). Only the well explainable genes, in terms
of motif and expression, have been conserved in this cluster.
Figure 8 shows the genes that were found in the initial clustering and in the combined clustering. This figure depicts the
expression profiles of the genes, together with the enrichment
of the motifs that have been related to this condition by Tai et
al. (29) and were found in our combined clustering (DAL82,
GAT1, GLN3, GZF3). It is clear that many genes in the initial
clustering display the expected expression profile, but lack the
presence of known regulating motifs. The newly found cluster only contains genes that demonstrate an expression profile
that is clearly related to the regulation program. Also, we
notice that the genes that contain the related motifs show a
higher expression in the aerobic condition, while the initial
cluster was overall stronger expressed in the anaerobic environment. This indicates that the presence of the transcription
factors that bind to these motifs is controlled by the oxygen
supply.
Furthermore we observe that the combined cluster obtains a p-value of 5.9 ∗ 10−12 on the GO category Allantion
Catabolism. In a nitrogen limited environment, the allantoin
degradation pathway, which converts allantoin (C4 H6 N4 O3 )
to ammonia and carbon dioxide, allows S. cerevisiae to use
allantoin as a sole nitrogen source (4). All genes that are
part of this pathway according to the Saccharomyces Genome
Database (4) are included in this cluster. The initial cluster
showed higher enrichment (p-value: 9.6 ∗ 10−10 ) on the more
general GO category Catabolism. Thus, in this example the
addition of motif information led to a cluster that can be related to a more specific condition and in this way has a higher
biological relevance. Since all genes that lack the regulating
motif have been removed, this cluster change is a clear example of Scenario 2 in Figure 1.
Sulfur Cluster Looking again at Figure 7, we notice that
the cluster expressed under sulfur limitation (Cluster b) now
clearly shows highly enriched motifs CBF1, MET31/ 32 and
TYE7, whereas the initial cluster (Cluster C) only corresponded to CBF1. Indeed, in Tai et al. (29) both motifs CBF1
and MET31/ 32 were related to this condition, together with
MET4. The TYE7 motif was not related to sulfur by Tai
et al. but is very likely to play a role under this condition
given the fact that its motif predominantly contains the strand
TCACGTG (which is highly similar to the motif of CBF1).
The combined sulfur cluster has become larger than the initial
cluster (94 → 151 genes), and still shows the correct expres-
5
M. Clements
61
62
94
143
67
2
4
6
8
10 12 14 -1
0
1
Figure 6: Appearance of the most enriched motifs found in the initial clustering, ranked from left to right in order of significance (A).
Only the clusters in which at least one motif showed significant enrichment (p-value < 10−6 ) are shown. The color of the squares indicates the −log10 (p-value) of the enrichment for each motif and
cluster. For each of these clusters, we show the normalized expression profiles of all genes and the total number of genes in the cluster
(B). All values that exceed a standard deviation of -1/1 are truncated
to -1/1.
sion profile with an improved motif enrichment, which indicates that the initial cluster missed some sulfur related genes.
Also, the p-value of the GO category Sulfur Metabolism has
improved from 7.5 ∗ 10−16 to 1.3 ∗ 10−19 , because six more
genes with this annotation were included in the combined
cluster. Figure 9 depicts which genes of this category were
found by the initial and combined clustering. To illustrate the
intended behaviour of our method we have indicated which
genes we expected to be clustered differently in the combined
clustering. This cluster is an example of Scenario 3 in Figure 1, because the cluster has been increased to include more
genes with the same motifs.
Second Aerobic Cluster Further inspection of Figure 7
shows a second aerobic cluster that is also controlled by the
HAP motifs but shows a little higher expression under phosphorus limitation (Cluster e), while the initial aerobic cluster
(Cluster D) shows higher expression under carbon limitation.
We know that the HAP2/3/4 motif has been related to both
carbon limitation and aerobic conditions (7; 29). Our finding
suggests that the HAP motif also plays a role under phosphorus limitation.
GAT1
MIG1
AFT2
B
CBF1
PHO4
HAP2/3/4
MET31/32
TYE7
Genes
Aerobic
Anaerobic
C N P S C N P S
A
B
C
D
E
0
GZF3
DAL82
GLN3
A
Genes
B
GAT1
HAP2/3/4
RepCAR1
CIN5
CBF1
DAL82
GLN3
GZF3
PHO4
A
Aerobic
Anaerobic
C N P S C N P S
12
151
71
178
35
7
84
15
a
b
c
d
e
f
g
h
0
2
4
6
8
10
12
14 -1
0
1
Figure 7: Appearance of the most enriched motifs (p-value < 10−6 )
found in the combined clustering.
(Cluster g). This motif has been related to carbon limitation
by Tai et al. (29) but our initial clustering did not clearly show
this relation. Our secondary clustering indicates that the genes
regulated by MIG1 are more strongly expressed in an anaerobic environment.
Second Sulfur Cluster An additional sulfur cluster was discovered that lacks the well known sulfur related motifs, but
does contain the AFT2 motif (Cluster h). The set of genes
activated by AFT1 and AFT2 is designated as the iron regulon, and its activation was suggested to depend on a product of
the mitochondrial iron-sulfur cluster biogenesis pathway (25).
These genes are thus part of a different pathway than the genes
in the sulfur metabolism cluster (Cluster b) and may indicate
a novel mechanism working under sulfur limitation.
5
Discussion
One of the principal differences between our method and
comparable work is the fact that our method does not attempt
to find new motifs. Related methods have used de novo motif finding methods in an iterative fashion with the clustering
step in order to find genes that are co-regulated by a certain
transcription factor. These methods can result in new, but unexplainable clusters in terms of regulation, because de novo
motif detection methods are known to provide many false
Anaerobic Phosphorus Cluster The last cluster from Fig- positives. The newly found motifs need to be biologically
ure 6, Cluster E, is not present in Figure 7. This cluster, how- validated (i.e. coupled to a TF enzyme) before the matchever, has remained almost unaltered, except that the motif p- ing gene cluster will be trustworthy. In contrast, our method
produces gene clusters that can be explained by known TFs
values just exceed 10−6 in the combined clustering.
and in this way directly produces biologically relevant gene
Carbon Cluster Apart from increasing specificity in the ini- groups. Clearly, a limitation of our method is that it assumes
tial clusters, the combined clustering also discovered a few that motifs related to the tested conditions are present in our
additional clusters with significant motif enrichment. We database.
have found a carbon cluster with clearly enriched motif MIG1
The fact that we use a fixed number of clusters and make
6
© ICT - TU Delft 2006
use of an exhaustive clustering method forces every gene to
belong to a distinct group. Therefore, if we expect Scenario 2
from Figure 1 to occur, we require that the genes that were initially clustered together find a new cluster that better matches
their motif profile. We have used the GO enrichment within
the clusters as an independent cluster validation method, that
showed us that only the best 25 clusters contribute to improvements in enrichment of functional categories. We deliberately
held on to the number of 50 clusters, in order to allow genes
to jump to one of the non relevant clusters.
As is visible in Figure 3, our database contains motifs that
only differ by a few bases (like GLN3, GZF3, DAL82 and
GAT1). One might suggest that we should use a motif grouping method that combines these motifs. The reason we chose
not to do this is twofold: firstly, we notice that a single base
difference can still result in a different enrichment for these
motifs (see also Figure 8); secondly, the motifs CBF1 and
PHO4 show great resemblance although they are known to
play a role under different conditions (sulfur and phosphorus
limitation respectively). Other studies have shown that the T
just before the core sequence CACGTG in CBF1 inhibits the
binding of PHO4p but not CBF1p (5; 22). We therefore don’t
want to group all motifs that differ by a single nucleotide.
© ICT - TU Delft 2006
Aerobic
D
Anaerobic
C N P S C N P S
E
Initial Clustering
Combined Clustering
62 Genes
C1
C2
12 Genes
Figure 8: This figure shows the initial (C1) and combined (C2) cluster that demonstrate higher expression under nitrogen limitation (A).
The normalized expression profiles are shown in red (high) and green
(low). For each gene is indicated if either of the four known nitrogen related motifs (DAL82, GAT1, GLN3, GZF3) was present in
its upstream region (B). The GO categories that showed the highest enrichment in C1 and C2 were Catabolism (GO1) and Allantoin Catabolism (GO2). (C) denotes which genes are annotated with
these categories. The column both indicates genes that were found in
both C1 and C2 (D). In the second cluster, only the genes with a clear
regulation remain. In this way, we have gained more confidence in
the co-regulation of this cluster.
C
Not found
Aerobic Anaerobic
C N P S C N P S
B
High Pot.
A
CBF1
MET31/32
MET4
TYE7
D
CBF1
MET31/32
MET4
TYE7
Anaerobic
C
Both
Aerobic
C N P S C N P S
B
GO1
GO2
A
DAL82
GAT1
GLN3
GZF3
Incorporating motifs in gene clustering
Figure 9: This figure shows all genes annotated with Sulfur
Metabolism in GO. The normalized expression profiles are shown
in red (high) and green (low) and for each gene is indicated if either
of the three known sulfur related motifs (CBF1, MET31/32, MET4)
was present in its upstream region. The two groups of genes on the
left (A) show which sulfur genes are found by the initial clustering
and which genes where additionally included in the combined clustering (purple arrows). The block on the right (C) shows the annotated genes that were initially not found. The bar indicated by High
Pot. shows which genes we considered to have a high potential to be
found by the combined clustering, because they contain at least one
of the known motifs and don’t deviate greatly from the desired expression profile (A minimal correlation between the expression profile and the perfect profile (00010001) of 0.5). The Not found bar
shows that only one of these genes was eventually not included in
the combined cluster.
However, as a post-processing step it is possible to use a motif grouping method to combine similar motifs per cluster.
When we regard the enriched motifs in the combined clustering (Figure 7), we notice that we have obtained more highly
enriched motifs than in our initial clustering. We can now select these motifs to compute a new motif distance and in this
way cluster in an iterative fashion, until the set of enriched
motifs converges to a constant group. This was done for ten
clustering iterations. The consensus clustering, however, does
not become stable, which may be caused by the limited number of iterations. However, when we compute the mean motif
enrichment of the 500 sampled clusterings, the set of enriched
motifs clearly converges to a fixed set. Already, after three
iterations this set is formed and constitutes of (in ranked order): CBF1, GZF3, GLN3, DAL82, HAP2/3/4, PHO4, TYE7,
MET31/32, GAT1 which all obtain a p-value smaller than
10−6 . The clustering that results when these motifs are used
as input is discussed in Supplement F.
In our computation of the gene-motif score we have not
taken the distance of the putative binding site to the transcription start site into account, although there are numerous indications that this relationship exists (34). Therefore, the use of
a motif distance analysis such as done by Harbison et al. (9)
is likely to improve the score function.
7
M. Clements
6
Conclusion
7.2
Gene preselection
We have shown that the integration of known motifs into
the gene clustering scheme can improve the GO enrichment
within the resulting clusters. Moreover, we have created a
framework that makes it easier to understand the regulation
of gene clusters because we find clusters with higher motifcondition agreement. In this way, we have made a step towards finding truly co-regulated clusters. Our method is especially effective on well-defined small scale microarray experiments for which a selection of regulators is known. We dedicate the known motifs to discover gene clusters that are regulated through these motifs and in this way provide more reliable information with respect to gene clusters that are purely
dependent on expression data.
We applied our method to analyse yeast grown under different cultivation conditions. In this data, our method has
detected a compact gene cluster that shows clear expression
under nitrogen limitation and is regulated by TFs binding to
the motifs DAL82, GAT1, GLN3 and GZF3. In spite of the
compactness of this cluster, it still contains all genes known to
take part in the allantoin catabolism pathway, which is known
to provide ammonia if it is not supplied by the environment.
We increased the cluster of genes that are expressed under sulfur limitation with respect to a clustering that is solely made
on expression data. In this way we found more genes that
were annotated to the sulfur metabolism category in GO and
we increased the enrichment of the motifs CBF1, MET31/32
and TYE7. Furthermore, we found that the HAP2/3/4 motif is
present in genes that show expression under phosphorus limitation in an aerobic environment. Initially, this motif was only
related to carbon limitation and aerobic conditions. In addition, we detected a cluster that is regulated by the AFT2 motif
in a sulfur limited condition and a cluster regulated by MIG1
under carbon limitation in anaerobic environment. Both of
these regulatory mechanisms were not clearly visible in the
initial clustering on expression data.
The significance analysis of microarrays (SAM) method (31)
is used to select the genes that demonstrate the most significant response under one or more nutrition limited growth conditions. Using SAM, the significance of change in at least one
of the conditions is computed and all genes are ranked according to this score. Then the top 2500 genes was selected
according to this rank for further analysis (obtaining a False
Discovery Rate (FDR) of 0.01%).
Three of the selected genes have an upstream region shorter
than 1000 bp. These genes are disregarded, so 2497 genes are
retained for further evaluation.
7
where N is the number of known motif sites, pb the background frequency of base b in the entire genome and fb, j is
n
the frequency matrix computed by Nb, j .
A test sequence may be aligned along the weight matrix,
and its score is the sum of the weights for the letters aligned
at each position (see Figure 10).
7.1
Materials and Methods
Expression dataset
7.3
Motif scanning
Motifs are represented by a motif matrix that contains the base
frequencies at the different positions. The alignment matrix
nb, j , records the occurrence of base b at position j of all the
aligned sites for this motif. For example, the distribution of
the bases for the motif of PHO4 with length 12 looks like:

A
1
C 2
=
G 1
T 4
3
2
2
1
2
3
3
0
0
8
0
0
8
0
0
0
0
8
0
0
0
0
8
0
0
0
0
8
0
0
5
3
0
2
4
2

1
0
.
5
2
This motif matrix gives the numbers of base occurrences in
8 known binding sites for this motif (e.g. found in biological
experiments).
The position weight matrix (PWM) is the matrix that is
most frequently used to score a test sequence with a given
motif consensus. The PWM is computed from the alignment
matrix by (11; 15):
Wb, j = ln
fb, j
(nb, j + pb )/(N + 1)
≈ ln
,
pb
pb
(2)
The proposed combined clustering method was developed and
applied on the expression data of Saccharomyces cerevisiae
from the Kluyver Laboratory for Biotechnology in Delft (prototrophic haploid reference strain CEN.PK113-7D (MATa)).
p
This dataset is comprised of 6383 genes and 24 arrays. The
(3)
Sci = ∑ W j [Si+ j−1 ],
24 arrays are made up of 3 replicated measurements of eight
j=1
conditions. In these eight conditions the response of aerobic as well as anaerobic hemostat cultures of Saccharomyces where Si is the base at position i in the upstream region to be
cerevisiae is compared to growth limitation by four differ- scanned, p is the size of the motif and W is the PWM.
To scan the upstream regions of the genes for instances
ent macronutrients (carbon, nitrogen, phosphorus and sulfur)
of known transcription factor binding motifs, a compendium
(29).
of 107 position weight matrices (PWMs) was built, collected
from three different online databases (18 from Transfac (32),
8
© ICT - TU Delft 2006
Incorporating motifs in gene clustering
PHO4
0.17
0.24
0.24
-0.75
-0.18
0.62
0.62
-2.19
-2.19
1.56
-2.19
-2.19
1.09
-2.19
-2.19
-2.19
-2.19
1.56
-2.19
-2.19
-2.19
-2.19
1.56
-2.19
-2.19
-2.19
-2.19
1.09
-2.19
-2.19
1.11
0.17
-2.19
0.24
0.89
-0.18
-0.75
-2.19
1.11
-0.18
C
A
C
G
T
T
A
G
G
…
i+p-1
C
G
A
T
i-1
i
i+1
…
15
3000
10
2000
5
1000
-0.18
0.24
0.24
-0.18
A
C
i+p i+p+1
Sci = 3.81
Figure 10: Scanning a sequence with a PWM. The score is simply
the sum of the values sampled from the matrix
# genes without motifs
A
-0.75
0.24
-0.36
0.43
Median # motifs per gene
A
C
G
T
13 from SCPD (34) and 76 from Harbison et al. (9)). All doubles were removed and when multiple TFs bind to the exact
same motif their labels were combined in our compendium.
0
0.65
7.4
Computation of Gene-Motif agreement score
Because regulatory motifs can occur on both strands of the
DNA, a scan over a region of 1000 bp. will result in 2(1000 −
p + 1) ≈ 2000 scores per gene for a PWM of length p. To obtain a single score for each Gene-Motif combination, several
methods were compared (see Supplement C) and the method
used by Segal et al. (26) has been adopted, which computes:
P(Mg = true|S1 , . . . , Sn ) =
ς log
!!!
n−p+1
1
, (4)
∑ exp{Sci }
n − p + 1 i=1
where ς is the sigmoid function (ς (p) = 1+e1−p ) and n is the
length of the upstream region. This function takes the mean
of the exponent of all alignment scores Sci along the upstream
region and in this way gives a higher weight to large scores
and neglects very low scores. The sigmoid function scales the
resulting score values between 0 and 1.
7.5
Threshold on the Gene-Motif agreement score
Equation (4) returns a continuous value that can be seen as
a probability that a certain motif is present in an upstream
region of a gene. For both computational simplicity and
comprehensibility it is desirable to threshold these gene-motif
agreement scores and obtain a true/false relationship between
gene upstream region and motif. Figure 11 shows the resulting median number of motifs per gene for thresholds ranging
between 0.65 and 1. In addition, the total number of genes
without any motif is depicted.
To be able to distinguish between gene regulation programs
a reasonable number of motifs per gene is needed and the
number of genes without a motif needs to be reduced as much
as possible. Therefore, the threshold on the score value is set,
such that the median number of motifs for an upstream region
equals five (threshold: 0,82). This number was also observed
by Zhang et al. (33), who used a database of known and experimentally verified motifs to scan the upstream regions of
© ICT - TU Delft 2006
0.7
0.75
0.8
0.85
0.9
Threshold on gene-motif agreement
0.95
0
1
Figure 11: Median number of motifs per gene (blue line, left y-axis)
and the number of genes without motif (green line, right y-axis) as a
function of the threshold on the scoring function (Eq. 4). The chosen
threshold of 0.82 (red line) results in a median of 5 motifs per gene
and a total of 31 genes without motifs.
yeast genes. For vertebrates, Prakash and Tompa found a similar amount of six (21), based on over-representation in an orthologous human, chimp, mouse and rat dataset. Note that if a
more stringent threshold would have been chosen, the number
of genes without any motif annotation would have increased
dramatically as is visible in Figure 11. The set of 107 thresholded motif scores will be called the binary motif profile of a
gene.
Figure 12 shows the binary motif profiles of twelve genes
that show higher expression levels in a nitrogen limited environment, independent of the oxygen supply. The vertical
lines in this figure indicate that all genes in this group have
this binding site in their upstream region. If a group of genes
shows similar expression profile and their upstream regions
contain one or more similar motifs, we can say that the gene
cluster is co-regulated.
7.6
Motif profile distance
To obtain a motif distance between each gene pair, the normalized Hamming distance between the binary motif profiles
is computed as follows:
dH =
∑Ni=1 |P1 (i) − P2 (i)|
,
N
(5)
where N is the total length of the motif profiles and the numerator is the number of differences between profile P1 and
P2 .
The drawback of this method is the fact that it takes all the
motifs in the motif profile into account, which causes a lot
of noise, because not all motifs are active in our experimental setup. To compensate for this, a feature selection method
9
M. Clements
Motif profiles
Aerobic
107 Motifs
For x = 1 ĺ 50
Expression profiles
For Į = 0 ĺ 1
Anaerobic
Fig. 5
C N P S C N P S
Genes
A
R
C
R
Į
Motifs related to Nitrogen
ǹ
Figure 12: Experimental setup: The block on the right shows the
normalized expression profiles over eight experimental conditions.
Red indicates high expression with regard to the other conditions
(green). The left block shows the binary motif profiles that indicate
if either of the 107 motifs is present in the upstream region of a gene.
X
x
Į
B
Fig. 4
Į
is used so that only motifs that play a significant role under
the tested conditions will contribute to the distance measure.
This feature selection constitutes the selection of highly enriched motifs in the initial clustering, that is solely based on
expression data. Other selection methods have been assessed,
but did not give improvements (see Supplement E).
Figure 13: This figure shows the steps we take to estimate the optimal value for α . Here, R is the number of cluster iterations (500),
x is the number of clusters we use to compute a single score for a
clustering (between 1 and X) and α is the motifweight that we vary
between 0 and 1 in A steps. The total number of clusters X is set to
50 and A is set to 100.
7.7
7.8
Data clustering
In both clustering steps we use hierarchical clustering to divide the data into 50 distinct groups. Complete linkage is
used, which has shown to provide the most reliable clusters
on genetic data (8) (see Supplement D.1). Because we chose
to compute more clusters than we expected in the dataset, we
assume that not all resulting clusters will be relevant. Therefore, only a select number of clusters will be regarded in order
to assess the value of our method in Chapter 4.3.
To improve the robustness of the putative clusters to variations in data sampling, we cluster 500 times on 80% of the
data and employ consensus clustering (6; 20). This methodology first computes a consensus matrix which contains, for
each pair of items, the proportion of clusterings in which the
two items are clustered together:
M (i, j) =
∑h M (h) (i, j)
,
∑h I (h) (i, j)
Enrichment computation
For both the computation of the significance of the amount
of motifs in a cluster and the amount of GO category annotations in a cluster, we compute p-values as enrichment
scores. The hypergeometric distribution was employed to
compute the probability of detecting the observed number of
motifs/annotations or more in a gene cluster. An enrichment
p-value is computed as follows:
B G−B
min(B,g)
p = P(i ≥ b) =
∑
i=b
i
g−i
G
g
,
(7)
where G is the total number of genes, B is the number of genes
within this cluster, g is the total number of genes that have this
motif/annotation and b is the number of genes from the cluster
that have this motif/annotation.
(6)
7.9
Cluster evaluation
I (h)
where
indicates if item i and j where both selected by
the data sampling, and M (h) is the co-occurrence matrix that
stores the number of times item i and j are clustered together
in clustering h:
1 if i and j belong to the same cluster,
M (h) (i, j) =
0 otherwise.
In order to evaluate the different clusterings, the Gene Ontology database (2) is used to find the enrichment of functional
categories in the individual clusters. First, all GO categories
with less than 5 annotations are removed, resulting in 576 categories. Then, p-values of the detected number of annotations
for each cluster-category combination are computed using the
hypergeometric distribution and the lowest p-value over all
categories is assigned as a score for a cluster.
From the consensus matrix we compute a new distance matrix
The combined clustering step iterates 500 times over 80%
D = 1 − M which is used to derive a new clustering, using
of the data and varies α between 0 and 1 in 100 steps. Figagain hierarchical clustering with complete linkage.
ure 13 shows that this results in R ∗ A ∗ X cluster scores. In
step A, the score for a clustering is computed by taking the
10
© ICT - TU Delft 2006
Incorporating motifs in gene clustering
average over the x best clusters, varying x between 1 and X.
Step B computes the mean and standard deviation over the
500 iterations. Also, a consensus clustering for each α is determined. The score for the consensus clustering is computed
and plotted together with the mean and standard deviation of
the individual scores in Figure 4. Finally, in step C the gain
of the combined clustering with respect to the initial clustering is computed, using a two sample t-test with respect to the
clustering on expression data (α = 0) as follows:
X init − X comb
,
T= q 2
2
Sinit +Scomb
R
(8)
where X init and X comb are the sample means of the initial and
combined clustering, R is the number of cluster iterations and
2 and S2
Sinit
comb are the sample variances. Since we are only
interested in clusterings that have a mean score lower than the
initial clustering we compute a one-tailed t-test. The p-values
of the t-statistic for each α and x are shown in Figure 5.
References
[1] Alon, U. et al. “Broad patterns of gene expression revealed by
clustering analysis of tumor and normal colon tissues probed by
oligonucleotide arrays”. PNAS, Vol. 96, pp. 6745Ű6750, 1999.
[2] Ashburner, M. et al. “Gene Ontology: tool for the unification of
biology. The Gene Ontology Consortium”. Nature Genetics, Vol. 25,
pp. 25-29, 2000.
[3] Beer, M.A. and Tavazoie, S. “Predicting Gene Expression from Sequence”. Cell, Vol. 117, pp. 185-198, 2004.
[4] Cherry, J.M. et al. “Genetic and physical maps of Saccharomyces
cerevisiae”. Nature, 387(6632 Suppl), pp. 67-73, 1997.
[5] Fisher, F. and Goding, C.R. “Single amino acid substitutions alter helix-loop-helix protein specificity for bases flanking the core
CANNTG motif”. The EMBO Journal, Vol. 1, No. 1, pp. 4103-4109,
1992.
[6] Fred, A.L.N. and Jain, A.K. “Data Clustering Using Evidence Accumulation”. 16th International Conference on Pattern Recognition
(ICPR’02), Vol. 4, p. 40276, 2002.
[7] Gancedo, J.M. “Yeast Carbon Catabolite Repression”. Microbiology and Molecular Biology Reviews Vol. 62, No. 2, pp. 334-361, 1998.
[8] Gibbons, F.D. and Roth, F.P. “Judging the Quality of Gene Expression Based Clustering Methods Using Gene Annotation”. Genome
Research, Vol. 12, pp. 1574Ű1581, 2002.
[9] Harbison, C.T. et al. “Transcriptional regulatory code of a eukaryotic genome”. Nature, Vol. 431, pp. 99-104, 2004.
[10] van Helden, J. et al. “Extracting Regulatory Sites from the Upstream Region of Yeast Genes by Computational Analysis of
Oligonucleotide Frequencies”. J. Mol. Biol., 281, pp. 827-842, 1998.
[11] Hertz, G.Z. and Stormo, G.D. “Identifying DNA and protein
patterns with statistically significant alignments of multiple sequences”. Bioinformatics, Vol. 15, nos 7/8, pp. 563-577, 1999.
[12] Heyer, L.J. et al. “Exploring Expression Data: Identification and
Analysis of Coexpressed Genes”. Genome Research, Vol. 9, pp.
1106Ű1115, 1999.
[13] Hubbard, T. et al. “Ensembl 2005” Nucleic Acids Research, Vol. 33,
Database issue: D447-D453, 2005.
[14] Hughes, J.D. et al. “Computational identification of cis-regulatory
elements associated with functionally coherent groups of genes in
Saccharomyces cerevisiae”. J. Mol. Biol., 296, pp. 1205-1214, 2000.
[15] Jensen, S.T. et al. “Computational Discovery of Gene Regulatory
Binding Motifs: A Bayesian Perspective”. Statistical Science, Vol.
19, No. 1, pp. 188-204, 2004.
© ICT - TU Delft 2006
[16] Johansson, Ö. et al. “Identification of functional clusters of transcription factor binding motifs in genome sequences: the MSCAN
algorithm”. Bioinformatics, Vol. 19, Suppl. 1, pp. i169-i176, 2003.
[17] Kellis, M. “Computational comparative genomics: genes, regulation, evolution”. Department of Electrical Engineering and Computer
Science, Massachusetts Institute of Technology, 2003.
[18] Latchman, D.S. “Transcription Factors as Potential Targets for
Therapeutic Drugs”. Current Pharmaceutical Biotechnology, Vol. 1,
No. 1, pp. 57-61, 2000.
[19] Middendorf, M. et al. “Motif Discovery Through Predictive Modeling of Gene Regulation”. RECOMB 2005, LNBI 3500, pp. 538-552,
2005.
[20] Monti, S. et al. “Consensus Clustering: A resampling-based
method for class discovery and visualization of gene expression
microarray data”. Machine Learning, Vol. 52, pp. 91-118, 2003.
[21] Prakash, A. and Tompa, M. “Discovery of regulatory elements in
vertebrates through comparative genomics”. Nature Biotechnology,
Vol. 23, No. 10, pp. 1249-1256, 2005.
[22] Robinson, K.A. and Lopes, J.M. “Saccharomyces cerevisiae basic
helix-loop-helix proteins regulate diverse biological processes”.
Nucleic Acids Research, Vol. 28, No. 7, pp. 1499-1505, 2000.
[23] Roth, F. et al. “Finding DNA regulatory motifs within unaligned
noncoding sequences clustered by whole-genome mRNA quantitation”. Nature Biotechnology, Vol. 16, pp. 939-945, 1998.
[24] Ruan, J. and Zhang, W. “A bi-dimensional regression tree approach
to the modeling of gene expression regulation”. Bioinformatics, Vol.
22, No. 3, pp. 332-340, 2006.
[25] Rutherford, J.C. et al. “Activation of the Iron Regulon by the Yeast
Aft1/Aft2 Transcription Factors Depends on Mitochondrial but
Not Cytosolic Iron-Sulfur Protein Biogenesis”. The journal of biological chemistry Vol. 280, No. 11, pp. 10135-10140, 2005.
[26] Segal, E. et al. “Genome-wide discovery of transcriptional modules
from DNA sequence and gene expression”. Bioinformatics, Vol. 19,
Suppl 1: i273-i282, 2003.
[27] Segal, E. et al. “Module networks: identifying regulatory modules and their condition-specific regulators from gene expression
data”. Nature Genetics, Vol. 34, No. 2, pp. 166-176, 2003.
[28] Sinha, S. et al. “PhyME: A probabilistic algorithm for finding motifs in sets of orthologous sequences”. BMC Bioinformatics, Vol. 5,
170, 2004.
[29] Tai, S.L. et al. “Two-dimensional Transcriptome Analysis in
Chemostat Cultures”. The Journal of Biological Chemistry, Vol. 280,
No. 1, pp. 437-447, 2005.
[30] Tavazoie, S. et al. “Systematic determination of genetic network
architecture”. Nature Genetics, Vol. 22, pp. 281-285, 1999.
[31] Tusher, V.G. et al. “Significance analysis of microarrays applied to
the ionizing radiation response”. PNAS, Vol. 98, No. 9, pp. 51165121,2004.
[32] Wingender, E. et al. “TRANSFAC: an integrated system for gene
expression regulation”. Nucleic Acids Research, Vol. 28, No. 1, pp.
316-319, 2000.
[33] Zhang, Z. et al. “How much expression divergence after yeast gene
duplication could be explained by regulatory motif evolution?”.
TRENDS in Genetics, Vol. 20, No. 9, pp. 403-407, 2004.
[34] Zhu, J. and Zhang, M.Q. “SCPD: a promoter database of the yeast
Saccharomyces cerevisiae”. Bioinformatics, Vol. 15, nos 7/8, pp.
607-611, 1999.
11
supplement
.
Discovery of co-regulated gene clusters by combining
known transcription factor binding motifs and gene
expression profiles
.
Maarten Clements
April 7, 2006
Supplementary
Page 2
Contents
A Genes and Their regulation
A.1 DNA . . . . . . . . . . . .
A.2 Genes . . . . . . . . . . .
A.3 Transcription factors . . .
A.4 Microarrays . . . . . . . .
A.5 Yeast . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
3
3
3
3
4
4
B Motif representation
5
C Computing the motif profile
C.1 Computing a Gene-Motif score . . . . . . . . . . . . . . . . . . . . . . .
C.2 Comparison of scoring methods . . . . . . . . . . . . . . . . . . . . . . .
C.3 Score thresholding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7
7
8
8
D Clustering
10
D.1 Hierarchical clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
D.2 Distance measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
E Feature selection
E.1 Cluster best . . . . .
E.2 P-value threshold . .
E.3 Decision tree . . . .
E.4 Method comparison
E.5 Motif ranking . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
13
13
13
13
14
15
F Gene Ontology
18
G Iterative clustering
19
c ICT - TU Delft 2005
Supplementary
A
A.1
Page 3
Genes and Their regulation
DNA
The human body is constructed of trillions of cells which together form tissues like skin,
muscle and bone. However, each of these cells contains the same building information,
called DNA. This DNA is the codebook of how we are made and how our body works.
It is just the part of this codebook that is read what actually determines the function
of a certain cell. For example, a cell in muscle tissue uses the DNA that codes for
the muscular function, while a bone cell only references that part of the DNA that
describes the construction of bones. It is as if each cell reads only that part of a book
of instructions that it needs.
The DNA is made up of four basic nucleotides; Adenine (A), Cytosine (C), Guanine
(G) and Thymine (T) (see Figure 1). These four nucleotides always bind each other in
specific pairs: A-T and C-G, these pairs are called base pairs (bp). The concatenation
of millions of base pairs forms chromosomes. Human DNA regularly contains 46
chromosomes with a total of almost 3.3 ∗ 109 base pairs.
Figure 1: The genetic structure. Cells contain chromosomes that are made up of long
strands of base pairs.
A.2
Genes
The parts of the DNA that actually code for the organic functions are called genes.
The central dogma of molecular biology states that when a gene on the DNA is transcribed, RNA is formed which in turn is translated into proteins (see Figure 2). In
biology, proteins are known as the basic building blocks of life. They perform a wide
variety of biological functions, ranging from catalyzation of chemical reactions to actual formation of mechanical structures. Proteins are essentially polymers made up of
a specific sequence of amino acids. The details of this sequence are stored in the code
of a gene.
Figure 2: In the process of transcription the gene is copied into a strand of RNA. This RNA
may be translated into proteins; the building blocks of life.
A.3
Transcription factors
The rate at which transcription takes place is called the activity of a gene, or gene
expression. Gene expression is tightly controlled at multiple stages, but mainly at the
initiation of transcription. Here, the chromatin structure is uncoiled around the gene
to be expressed and the proteins that form the transcription machinery are recruited.
c ICT - TU Delft 2005
Supplementary
Page 4
These processes are regulated by a specific class of DNA-binding proteins called transcription factors (TFs). [8] TFs are proteins that bind to the upstream regions of a
gene and in that way induce or repress the activation of that gene (see Figure 3).
Figure 3: Transcription factors bind close to the transcription start site of a gene. In this
way, they enhance or repress the activation of transcription.
A.4
Microarrays
The development of the microarray technique has had paramount implications on the
knowledge of the genetic mechanisms. This method allows us to measure the abundance of RNA on a certain moment in time (see Figure 4). From this information
we can derive which genes are active under certain conditions or during biological
processes. However, microarray measurements are known to be noisy because of biological variation and measurement inconsistencies. Replication of experiments can
compensate for a large part of these effects, but still uncertainty remains.
Genes
Conditions
High Expression
Low Expression
Figure 4: Microarrays can measure the expression of genes under certain conditions or over
time. This figure shows a part of a microarray in which 17 genes are measured under 8
different conditions.
A.5
Yeast
In bioinformatics, bakers yeast (Saccharomyces cerevisiae) is often used as a model
organism to develop new methods. Yeast is very easy to study because it is easily
modified and cultured, yet it maintains the complex regulation mechanisms of other
eukaryotes like plants and animals. With respect to human, the yeast genome is
very small, it consists of about 1.3 ∗ 107 base pairs and about 6,000 genes. This is
another attribute that makes bakers yeast a perfect organism for the development of
computational methods.
c ICT - TU Delft 2005
Supplementary
B
Page 5
Motif representation
In the regulation process, transcription factors bind the DNA in the vicinity of the
transcription start site in order to activate or repress the activation of the gene. Transcription factors bind the DNA in specific nucleotide sequences (generally 6-20 bp),
these short sequences are called transcription factor binding sites or motifs.
Motifs are usually represented by a motif matrix which contains the base frequencies at the different positions. There are three basic types of motif matrices, the
simplest matrix is an alignment matrix nb,j , which records the occurrence of base b at
position j of all the aligned sites for this motif. For example, the distribution of the
bases for the motif of PHO4 with length 12 looks like:


A
1 3 2 0 8 0 0 0 0 0 1
C 2 2 3 8 0 8 0 0 0 2 0
=

G 1 2 3 0 0 0 8 0 5 4 5 .
T 4 1 0 0 0 0 0 8 3 2 2
This motif matrix gives the numbers of base occurrences in 8 known binding sites for
this motif (found in biological experiments). From the alignment matrix it is possible
n
to compute the frequency matrix by fb,j = Nb,j , where N is the number of known
motif sites.
In the graphical representation, depicted in Figure 5, information of a base at a
certain position is computed in bits, derived with:
Ib (j) = fb,j log2
fb,j
,
pb
(1)
where j is the position in the motif, b refers to each of the possible bases, fb,j is
the observed frequency of the concerned base on position j and pb is the background
frequency of base b in the entire genome [11].
2
1
1
2
3
4
5
6
7
8
9
10
11
Figure 5: A graphical representation of the PHO4 motif
When there are only a few sample sequences a straightforward calculation will
tend to overestimate the entropy. To compensate, we substitute the frequency by the
adapted frequency by adding a pseudo count:
nb,j + pb
.
f˜b,j =
N +1
(2)
This small sample correction depends only on the a priori probability and the total
amount of data in each column, which may differ from one column to another, since
the matrix can be made up out of different short sequences. The errorbars in the figure
indicate the score plus and minus the small sample correction [3].
The position weight matrix (PWM) is the matrix that is most frequently used to
score a test sequence with a given motif consensus. The PWM is computed from the
alignment matrix by [6][7]:
Wb,j = ln
c ICT - TU Delft 2005
fb,j
(nb,j + pb )/(N + 1)
≈ ln
.
pb
pb
(3)
Supplementary
Page 6
A test sequence may be aligned along the weight matrix, and its score is the sum of
the weights for the letters aligned at each position (see Figure 6).
Sci =
p
X
Wj [Si+j−1 ],
(4)
j=1
where Si is the base at position i in in the upstream region to be scanned, p is the size
of the motif and W is the PWM.
PHO4
A
C
G
T
A
-0.75
0.24
-0.36
0.43
0.17
0.24
0.24
-0.75
-0.18
0.62
0.62
-2.19
-2.19
1.56
-2.19
-2.19
1.09
-2.19
-2.19
-2.19
-2.19
1.56
-2.19
-2.19
-2.19
-2.19
1.56
-2.19
-2.19
-2.19
-2.19
1.09
-2.19
-2.19
1.11
0.17
-2.19
0.24
0.89
-0.18
-0.75
-2.19
1.11
-0.18
C
A
C
G
T
T
A
G
G
…
i+p-1
C
G
A
T
i-1
i
i+1
…
-0.18
0.24
0.24
-0.18
A
C
i+p i+p+1
Sci = 3.81
Figure 6: Scanning a sequence with a PWM. The score is simply the sum of the values
sampled from the matrix
In order to scan the 1000bp. upstream region of a gene, we concatenate the positive
and negative strand and compute the scores for the concatenated DNA strand. For
each gene we end up with a binding vector of length 2000 that represents the likeliness
that this TF can bind at the given location i. Figure 7 shows this binding vector for
a scan with PHO4.
10
5
0
-5
Sc i
-10
-15
-20
-25
-30
-35
0
200
400
600
800
1000
i
1200
1400
1600
1800
2000
Figure 7: Scanning an upstream region of 1000bp with a PWM gives us 2000 scores, because
both the positive and negative strand need to be scanned. This graph shows the binding scores
of PHO4 at the upstream region of a gene.
c ICT - TU Delft 2005
Supplementary
C
C.1
Page 7
Computing the motif profile
Computing a Gene-Motif score
Given all scores of a particular upstream region and PWM, we need to determine the
probability that this TF can bind to this gene. We will do this by determining if a
gene has more high scores than can be expected randomly. If so, we call this gene
enriched. To compute the enrichment of the 107 known motifs in the upstream regions
of the 2497 Kluyver genes we have first determined the binding vectors by scanning all
regions with all PWMs. A scan with a single PWM on an upstream region of 1000 bp
results in 2 ∗ (1000 − n + 1) ≈ 2000 scores, because both the positive and the negative
strand need to be scanned.
In Figure 8 a normally distributed approximation of the distribution of the scores
for a single PWM for all upstream regions in the yeast genome is depicted. The
figure on the right shows the distribution of a single gene in which this motif has a
higher enrichment compared to the background. We have considered three methods
for computing a single gene-motif score.
Background threshold The first method we used determines the pvalue of the
detected enrichment by setting a threshold α where x% of the background lies above
>α
) and computes:
(x% = BB
G<α
Ex = pvalue =
X
i=0
B<α
G<α −i
B
G
B>α
G>α +i
,
(5)
where B is the background score distribution and G represents the distribution of
the scores for the gene under investigation. The value of x determines the amount of
deviation we allow from our PWM to consider a location to be a binding location. If
we set x very small, only exact matches to the consensus motif will be counted as hits,
whereas a higher value will allow some deviations. We have studied different values
for x, i.e. 0.01, 0.02, 0.05, 0.1, 0.25 and 0.5.
Combined threshold The second method computes a combined score which
stresses the most stringent thresholds as follows:
Emix =
50E001 + 25E002 + 10E005 + 5E01 + 2E025 + E05
93
(6)
The rationale behind this combined score is that very large scores should be more
important, but that many intermediate scores may have the same binding potential
as a few high scores.
5 Mln Scores
2000 Scores
Entire Genome
Single Gene
100 - X %
X%
B< Į
B> Į
Į
G< Į
Scores ĺ
G> Į
Į
Scores ĺ
Figure 8: Approximation of the background distribution and the distribution of the scores
from a single gene
Exponent mean The third method to compute the probability that a gene g
has motif M in his upstream region is the method used by Segal et al. [10], which
c ICT - TU Delft 2005
Supplementary
Page 8
computes:
P (g.M = true|S1 , . . . , Sn ) = ς
n−p+1
p
X
X
1
exp{
Wj [Si+j−1 ]},
n − p + 1 i=1
j=1
log
!!!
(7)
where ς is the sigmoid function (ς(p) = 1+e1−p ) and n is the length of the upstream
region. This function takes the mean of the exponent of all scores and in this way gives
a higher weight to large scores and neglects very low scores. The sigmoid function scales
the resulting score values between 0 and 1.
C.2
Comparison of scoring methods
To evaluate the quality of these different scores we employ the binding data from
Harbison as a ground truth [5]. Harbison et al. elaborately investigated the confidence
level at which transcription factors bind to certain motifs in different yeast species. The
match between our methods at different cutoff thresholds between 0 and 1, and the data
from Harbison [5] with no conservation criteria, and binding probability smaller than
0.001 was computed. From this ROC curves were computed, showing the sensitivity
F alseP ositives
T rueP ositives
( T rueP ositives+F
alseN egatives ) versus 1-specificity ( F alseP ositives+T rueN egatives ). The
ROC curves are depicted in Figure 9(a).
ROC curves
1
15
3000
10
2000
5
1000
0.9
0.8
0.6
0.5
Segal2003
E05
E025
E01
E005
E002
E001
Emix
0.4
0.3
0.2
0.1
0
0
0.1
0.2
0.3
0.4
0.5
0.6
1-Specificity
0.7
0.8
0.9
1
(a) ROC curves of the proposed Gene-Motif
scoring methods, using Harbison as ground
truth.
0
0.65
0.7
0.75
0.8
0.85
0.9
Threshold on Segal2003 score
0.95
# genes without motifs
Median # motifs per gene
Sensitivity
0.7
0
1
(b) Median number of motifs per gene and total genes without any motifs.
Figure 9: Evaluation of the proposed scoring methods and setting a threshold to obtain a
binary motif profile.
From Figure 9(b), it is clear that Segal’s method outperforms the other measures.
Consequently we have chosen to use this method to assign a motif score to each gene.
Furthermore, we have observed that the proposed mix method tends towards the
method of Segal if the number of combined thresholds increases.
C.3
Score thresholding
The scoring algorithm returns a continuous score that can be seen as a probability that
a certain motif is functionally present in an upstream region of a gene. We will refer to
the 107 motif scores as the continuous motif profile of a gene. For both computational
simplicity and comprehensibility it might be desirable to threshold the enrichment
values and obtain a true/false relationship between gene upstream region and motif.
Figure 9(b) shows the resulting median number of motifs per gene for thresholds
ranging between 0.65 and 1. Furthermore, the total number of genes without any motif
c ICT - TU Delft 2005
Supplementary
Page 9
is represented, because we want to restrict this number as much as possible. To obtain
a reasonable number of motifs that can aid us in distinguishing between gene regulation
programs we set the threshold on the score value so that the median number of motifs
for an upstream region equals five (threshold: 0.82). This number was also observed by
Zhang et al. [14] who used a database of known and experimentally verified motifs to
scan the upstream regions of yeast genes. For vertebrates, Prakash and Tompa found a
similar amount of 6 [9], based on over representation in an orthologous human, chimp,
mouse and rat dataset. If we would have chosen a more stringent threshold the number
of genes without any motif annotation would have increased dramatically as is visible
in figure 9(b). The set of 107 thresholded motif scores will be called the binary motif
profile of a gene. In Figure 10 we show the continuous motif profile with the threshold
of 0.82.
Motif Profile with threshold
1
0.9
0.8
0.7
Score
0.6
0.5
0.4
0.3
0.2
0.1
0
0
20
40
60
Motifs
80
100
120
Figure 10: A motif profile of a gene with the proposed threshold of 0.82. These motif profiles
appear to be extremely noisy (with mean 0.32 and standard deviation 0.26).
c ICT - TU Delft 2005
Supplementary
D
Page 10
Clustering
D.1
Hierarchical clustering
To separate the genes into distinct groups we conduct an agglomerative hierarchical
clustering algorithm. This method begins with each element as a separate cluster
and merges them in successively larger clusters. The resulting hierarchical structure
is generally displayed in a dendrogram. Cutting this tree at a given height will give
a clustering with the desired number of clusters. A pleasant property of hierarchical
clustering is that it (opposed to other methods, like k-means) requires only a distance
measure and no actual data values.
There are three basic methods to decide which points are grouped together at each
step in the clustering algorithm. If at a certain stage the distance between group A
and group B needs to be calculated we can either use:
• Single linkage: min{d(x, y) : x ∈ A, y ∈ B}
• Complete linkage: max{d(x, y) : x ∈ A, y ∈ B}
P
P
• Average linkage: NA1NB x∈A y∈B d(x, y)
We will use the complete linkage method because this method tends to group the
smallest clusters together and in this way provides the most evenly sized clusters.
Figure 11 shows that complete linkage provides the most balanced dendrogram.
D.2
Distance measures
Every clustering method needs a measure to compute the distance between (groups
of) data points. In straightforward expression data clustering, usually the correlation
between two expression profiles is used as distance because this method is insensible to
differences in offset and scaling of the profiles. To make use of both expression data and
motif enrichment in a clustering scheme we may combine both distance measures using
a weighted sum. To this end, we need a weighting parameter to tune the importance
of both information sources. Therefore we propose to compute a separate distance
measure for motif enrichment and expression profiles and combine them as follows:
Jij = (1 − α)dEij + αdMij ,
(8)
where α is the weighting parameter that sets the balance between the expression
distance dEi,j and the motif distance dMi,j . To compute the expression distance we
use the correlation between the expression profiles, this is the most common strategy
and has proven to provide adequate results.
In contrast to the expression profiles, it is far from trivial to determine a distance
measure between motif profiles. A gene is not influenced by a single motif. In fact, the
combination of two motifs in the upstream region of a gene can have an effect that is
contrary to the effect of the individual motifs. For example, if motif M1 or M2 resides
in the upstream region of a gene this gene can become up regulated, while a gene with
both these motifs gets to be down regulated. We have considered 4 different methods
that compute the motif distance based on binary profiles:
• Hamming distance: dH =
s
N
• Jaccard distance: dJ = 1 −
r
r+s
• MaxiMin distance: dM = max{min{P1 , P2 }}
c ICT - TU Delft 2005
Supplementary
Page 11
Dendrogram: C=50, MotifWeight=0, Linkage=Average
Dendrogram: C=50, MotifWeight=0, Linkage=Complete
1
0.6
0.55
0.9
0.5
0.8
0.45
0.4
0.7
0.35
0.6
0.3
0.25
0.5
0.2
26 46 27 39 38 11 2 32 5 19 24 43 50 29 30 31 41 44 1 14 7 18 13 25 45 23 33 3 9 16 42 49 4 21 10 35 6 37 8 22 12 20 28 40 17 36 15 34 47 48
39 18 19 30 1 42 25 9 48 33 7 14 15 35 22 17 3 8 6 13 4 21 28 2 29 36 5 45 12 43 32 10 16 34 24 49 23 50 11 20 38 31 40 27 46 37 44 26 47 41
(a) Average linkage; α = 0
(b) Complete linkage; α = 0
Dendrogram: C=50, MotifWeight=0.5, Linkage=Average
Dendrogram: C=50, MotifWeight=0.5, Linkage=Complete
0.35
0.6
0.55
0.3
0.5
0.45
0.25
0.4
0.2
0.35
0.3
0.15
5 36 11 18 28 40 14 37 2 50 19 23 42 48 1 9 45 8 10 30 41 49 3 25 39 12 27 15 21 43 4 17 29 13 31 38 6 33 26 16 44 24 32 35 7 22 46 34 20 47
36 50 24 49 10 11 48 31 42 4 13 33 8 21 6 12 2 22 5 38 25 32 34 47 16 28 15 17 18 39 46 26 1 41 3 45 40 44 7 27 29 37 20 35 30 19 9 43 14 23
(c) Average linkage; α = 0.5
(d) Complete linkage; α = 0.5
Dendrogram: C=50, MotifWeight=1, Linkage=Complete
Dendrogram: C=50, MotifWeight=1, Linkage=Average
0.3
0.19
0.28
0.18
0.26
0.17
0.24
0.16
0.22
0.15
0.2
0.14
0.18
0.13
0.16
1 16 26 29 11 27 10 4 12 30 32 15 23 20 50 9 31 39 24 49 34 17 45 28 36 8 6 47 37 19 13 21 46 43 44 48 2 40 42 25 22 3 14 5 33 41 35 7 38 18
(e) Average linkage; α = 1
23 21 1 6 13 25 43 50 10 46 2 24 14 22 39 7 32 35 4 47 12 45 27 37 16 11 26 31 17 3 42 49 36 41 8 18 38 5 28 9 44 33 34 20 29 30 15 19 40 48
(f) Complete linkage; α = 1
Figure 11: Dendrograms of clusterings with α ∈ [0, 0.5, 1] using both average and complete
linkage. When α approaches 1, the symmetrical structure is maintained for complete linkage.
c ICT - TU Delft 2005
Supplementary
Page 12
• Shared motif distance: dS = 1 −
r
N
In these distances, N is the total length
PN of the motif profiles, s is the number of
differences between the two profiles ( i=1 |P1 (i) − P2 (i)|) and r is the number of
PN
motifs present in both profiles ( i=1 min{P1 (i), P2 (i)}).
A
B
C
M1
M2
M3
M4
M5
M6
M7
Figure 12: Three artificial motif profiles where seven motifs can either be present or not
present. In Table 1 the different distance measures are computed for combinations of these
profiles.
To illustrate the properties of each distance measure we have drawn three artificial
motif profiles in Figure 12. Table 1 shows the distances for combinations of these
profiles, all the methods demonstrate a different selective ability between profile combinations. The first three methods always give identical profiles a distance of zero,
while the Shared criterion regards the number of present motifs and in this way distinguishes between A-A and C-C. The MaxiMin distance gives all profiles that differ
by one or more motifs the maximal value of one. The Hamming distance is the only
method that separates between A-B and A-C. We reason that the extra motif in gene
C makes this gene more likely to be part of a different biological process than gene B.
Therefore, we prefer to be able to distinguish between these gene pairs. Consequently
we will take the Hamming distance to compute a distance measure between motif
profiles.
Distance
Profiles
A-A
A-B
A-C
B-C
C-C
Hamming
Jaccard
MaxiMin
Shared
0
2/7
3/7
1/7
0
0
1
1
1/2
0
0
1
1
1
0
6/7
1
1
6/7
5/7
Table 1: Distance measures for different combinations of motif profiles from Figure 12.
The downside of the Hamming distance is that the score is dependent on the length
of the motif profile. Moreover, in the motif profiles only a small part of the 107 motifs
can be coupled to one of the conditions under investigation. The non-significant values
would introduce a lot of noise in our computation if we would use the entire enrichment
profile.
We are therefore looking for a method to combine all motifs that play a role in
the regulatory mechanism that is active under the specific microarray conditions in
the computation of a distance measure. A smart feature selection method should be
able to select the motifs that are truly informative and thereby reduce the number of
redundant features.
c ICT - TU Delft 2005
Supplementary
E
Page 13
Feature selection
We use the initial clustering to select the motifs that are most important in the conditions under investigation. We propose three different methods to rank the motifs in
order of importance.
E.1
Cluster best
The most straightforward way of selecting the most informative motifs for our clustering task is to first cluster the data on the expression profile only, then select the
motif with highest enrichment from each cluster and use only these motifs together
with the expression data in the next clustering iteration. We can either compute the
motif enrichment by counting the number of genes that have the motif in their upstream region or by computing the p-value of the amount of genes that have this motif
compared to the entire dataset.
E.2
P-value threshold
Instead of selecting the best motif from each cluster, we can also compute the p-value
for all motif-cluster combinations and select the motifs that score below a certain
threshold in one or more clusters. The advantage of this method is that we allow
a combination of motifs to regulate one cluster. Also, the non informative clusters
will not determine the selected motifs. As we do not thoroughly investigate the ideal
number of clusters, this is a very important feature.
E.3
Decision tree
In classification problems the decision tree is well known and widely used. This approach tries to divide the search space into rectangular regions, so that all classes are
separated. The input for a decision tree is a database of samples which are all given
a class label, and a compendium of attributes that together contain all information
we have about our samples. There are many different methods to grow decision trees,
but all rely on the same principle: repeatedly split the data into smaller and smaller
groups in such a way that each new generation of nodes has greater purity than its
ancestors with respect to the target variables. [2]
We propose to consider motif presences as features and learn a decision tree to
predict cluster labels. In this case the decision tree creates rules in the form:
Mik ∧ ¬Mil
¬Mik ∧ Mil
Mik ∧ Mil
⇒
⇒
⇒
gi ∈ C1
gi ∈ C2
gi ∈ C3
This allows the occurrence of two motifs in a genes upstream region to result in a
different gene expression profile than the occurrence of each motif separately. This
method thus predicts a clustering and creates a general understanding of the regulation
mechanism at the same time.
Because the basic strategy in the choice of splits is to take the splitting attributes
with the highest information gain first [4], the motifs at the top of the tree can be
considered to be most informative to distinguish the clusters. By using the top-n
motifs from the tree as a feature selection mechanism we can create an iterative method
that uses the resulting tree in the next clustering step, so that we use the motifs that
are most important in our new clustering.
With this method we can choose to set a fixed depth level and select all important
motifs for the next cluster iteration.
c ICT - TU Delft 2005
Supplementary
Page 14
Is M1 present?
Yes
No
Is M2 present?
Is M3 present?
Yes
Yes
No
Is M4 present?
Is M5 present?
Is M2 present?
Yes
No
Yes
No
Yes
No
C1
C2
C3
C1
C4
C2
No
C1
Figure 13: An example of a decision tree using motif presence to make splits in a gene
database
E.4
Method comparison
Because we did not use any data knowledge to set the number of clusters to 50 we don’t
expect to find only informative clusters. Therefore, we strive to make our methodology
as independent on the number of clusters as possible. Feature selection method 1
(Cluster Best) picks the best motif from each cluster, and is therefore conflicting with
our goals. To make a comparison between method 2 (P-value threshold) and method
3 (Decision tree) we compare the motif ranking both methods produce. To this end,
we have used both methods in an iterative clustering scheme that uses the selected
motifs from the previous iteration to compute the distance measure. Figure 14 shows
which motifs were selected in each iteration. We have computed the results over
100 samplings of 80% of the data, therefore, the darkness of the lines indicates the
proportion of the data samplings in which that motif was selected.
Selected motifs over 10 iterations
1
2
2
3
3
4
4
Selection iteration
Selection iteration
Selected motifs over 10 iterations
1
5
6
7
8
5
6
7
8
9
9
10
10
10
20
30
40
50
60
Motifs
70
80
90
100
(a) Selection method 2: Selecting the motifs
with p-value < 10−4
10
20
30
40
50
60
Motifs
70
80
90
100
(b) Selection method 3: Selecting the top motifs from the decision tree (depth ≤ 9)
Figure 14: Evaluation of the proposed motif selection methods.
Method 1 seems to be less dependent on the data sampling (there is a smaller
number of light blue lines). Moreover, this method more often picks the motifs from
which we know they play a role in this system. If we regard the motifs that were
initially associated with the conditions under investigation by Tai et al. [12] we see
that method 2 in general gives the known motifs a much higher rank than method 3
(see Table 2). We shall therefore use the p-value threshold method to select the most
informative motifs.
c ICT - TU Delft 2005
Supplementary
Page 15
Table 2: The motifs mentioned by Tai et al. [12] and the rank they get according to our
feature selection methods. Db indicates the index of the motif in our database.
Rank
Db
Motif
8
48
49
41
47
9
38
58
59
MIG1
GLN3
GZF3
DAL82
GAT1
PHO4
CBF1
MET31/32
MET4
Method 2
Method 3
11
4
3
7
8
2
1
10
88
55
6
7
33
28
4
32
1
35
Figure 14 also shows us that the selected group of motifs almost does not change
over the 10 iterations; we will therefore only use the motif selection once and use the
second clustering as final result.
E.5
Motif ranking
In the initial clustering we set α = 0 and therefore cluster purely on expression data.
We compute a consensus clustering using 500 samplings of 80% of the data. From this
initial clustering we compute p-values of the motif occurrences for each cluster using
the hypergeometric distribution. Then, we rank all motifs according to the lowest pvalue they attain in any of the clusters. In Table 3 the p-values and consensus sequence
of our motif database are shown. Since we collected these motifs from three different
databases (18 from Transfac [13], 13 from SCPD [15] and 76 from Harbison et al. [5]),
we also show the source in Table 3.
Table 3: The motif database and enrichment p-values in the initial clustering
Db
Motif
Source
P-value
Consensus sequence
9
48
49
38
41
47
17
1
39
98
58
34
104
33
77
27
51
8
101
80
69
PHO4
GLN3
GZF3
CBF1
DAL82
GAT1
HAP2/3/4
repressor of CAR1
CIN5
TYE7
MET31/32
YAP3/5/6 ARR1
YAP7
AFT2
RPN4
SWI5
HAP1
MIG1
UME6
SIG1
RCS1
TransFac
Harbison
Harbison
Harbison
Harbison
Harbison
TransFac
TransFac
Harbison
Harbison
Harbison
Harbison
Harbison
Harbison
Harbison
SCPD
Harbison
TransFac
Harbison
Harbison
Harbison
4.04E-11
6.83E-11
7.65E-10
2.72E-09
2.76E-09
2.01E-07
3.52E-07
5.43E-07
8.84E-07
3.37E-06
4.38E-06
6.29E-06
7.74E-06
1.70E-05
3.75E-05
5.51E-05
6.15E-05
7.80E-05
1.06E-04
1.59E-04
3.41E-04
TCCCACGTGGGC
GATAAGATA
GATAAG
TCACGTG
GATAAGA
AGATAAG
ACCCGCCAATCACCGG
CACTAGCCGCCGAGGGC
TTACGTAA
TCACGTGAT
AAACTGTGG
TTACTAA
CTTAGTAAT
GGGTGC
TTTGCCACC
ACAGCATGCTGG
GGAAATATCGG
GAAAAAAATCCGGGGTA
TAGCCGCCGA
AGGAAACAAAAA
GGGTGCAGT
Continued on next page
c ICT - TU Delft 2005
Supplementary
Page 16
Table 3 – continued from previous page
Db
Motif
Source
P-value
Consensus sequence
63
37
25
92
94
22
46
16
99
74
19
79
84
57
21
60
44
91
59
62
61
35
89
24
107
67
78
102
81
70
93
85
42
3
32
55
106
103
23
97
53
43
65
26
73
87
75
90
2
50
5
83
20
18
86
95
64
10
68
36
15
82
11
NRG1
CAD1
REB1
SUM1
SWI4
SCB
GAL80
RAP1
UGA3
RLM1
RLM1
SFP1
SNT2
MBP1
CSRE
MSN2
FKH1
STP1
MET4
NDD1
MSN4
AZF1
STB4
PDR1/PDR3
ZAP1
PHO2
RTG3
XBP1
SIP4
RDS1
SUT1
SOK2
DIG1
MATa1
ACE2
LEU3
YOX1
YAP1
MCB
THI2
INO2
FHL1
PDR1
ROX1
RIM101
SPT23
RLR1
STB5
ABF1
HAC1
GCR1
SKO1
SMP1
STRE
SPT2
SWI6
OPI1
MCM1
PUT3
BAS1
AP-1
SKN7
HSF2
Harbison
Harbison
SCPD
Harbison
Harbison
SCPD
Harbison
TransFac
Harbison
Harbison
SCPD
Harbison
Harbison
Harbison
SCPD
Harbison
Harbison
Harbison
Harbison
Harbison
Harbison
Harbison
Harbison
SCPD
Harbison
Harbison
Harbison
Harbison
Harbison
Harbison
Harbison
Harbison
Harbison
TransFac
Harbison
Harbison
Harbison
Harbison
SCPD
Harbison
Harbison
Harbison
Harbison
SCPD
Harbison
Harbison
Harbison
Harbison
TransFac
Harbison
TransFac
Harbison
SCPD
TransFac
Harbison
Harbison
Harbison
TransFac
Harbison
Harbison
TransFac
Harbison
TransFac
5.00E-04
5.23E-04
5.69E-04
8.10E-04
1.13E-03
1.40E-03
1.56E-03
1.89E-03
2.13E-03
2.22E-03
2.28E-03
2.37E-03
2.57E-03
2.65E-03
3.14E-03
3.42E-03
3.73E-03
3.95E-03
3.96E-03
4.14E-03
4.15E-03
4.53E-03
4.90E-03
4.99E-03
5.01E-03
5.20E-03
5.33E-03
5.39E-03
6.74E-03
7.32E-03
7.69E-03
8.93E-03
9.36E-03
9.36E-03
1.03E-02
1.10E-02
1.18E-02
1.20E-02
1.26E-02
1.28E-02
1.31E-02
1.36E-02
1.36E-02
1.43E-02
1.49E-02
1.53E-02
1.68E-02
1.70E-02
1.71E-02
1.78E-02
1.78E-02
2.11E-02
2.17E-02
2.20E-02
2.23E-02
2.24E-02
2.29E-02
2.32E-02
2.35E-02
2.40E-02
2.47E-02
2.48E-02
2.61E-02
GGACCCT
ATTAGTAAGC
TTACCCG
GCGTCAGAAAA
GACGCGAAA
CACGAAA
CGGCCCCCCCCCCCCCG
AACACCCATACACC
CCGCCCCCGG
CTAAAAATAG
CGTTCTATAAATAGACCC
ACCCGTACAT
CGGCGCTACCA
GACGCGT
TCCGGATGAATGG
AAGGGGCGG
TTGTTTAC
GCGGCCCCGCGGC
AAAAACTGTGGCGCC
TTTCCCAATTGGG
CAGGGGC
TTTTTCTTTTCCTGTTTC
TCGGAACGA
TCCGCGGA
ACCCTCAAGGTTGT
CGTGCGGTGCG
GGTCAC
CTTCGAG
CGGCTGAATGGAA
TCGGCCGA
GCGGGGCCGG
TGCAGGAA
TGAAACA
TGATGTAGCT
TGCTGGT
CCGGTACCGG
ACAATACTGACG
TTAGTCAGC
ACGCGT
GAAACCCTAAGA
CACATGC
ATGTACGGGTG
CCGCCGAATAA
CCCATTGTTCTC
TGCCAAG
GAAATCAA
ATTTTCTTCTTT
CGGTGTTATA
CCATATCACTATACACGAAACT
GGCCAGCGTGTC
GGCTTCCAC
ACGTCA
GTGCTGCTATTTATAGCAGC
TCAGGGGG
CCTGTATCTAA
TTTCGCGT
TCGAACC
TTTCCCAATCGGGTAA
CCCGGCCCCCCCCCCCCG
TGACTC
GTGAGTCAG
GCCCGGGCC
AGAACAGAACAGAAC
Continued on next page
c ICT - TU Delft 2005
Supplementary
Page 17
Table 3 – continued from previous page
Db
Motif
Source
P-value
Consensus sequence
7
13
88
45
30
76
72
105
28
66
14
40
4
100
96
71
56
12
29
6
54
31
52
GAL4
HSF4
STB1
FKH2
UASPHR
RPH1
RGT1
YDR026c
STE12
PHD1
HSF5
DAL81
GCN4
UME1
TEC1
RFX1
MAC1
HSF3
TBP
ADR1
INO4
XBP1
HSF1
TransFac
TransFac
Harbison
Harbison
SCPD
Harbison
Harbison
Harbison
SCPD
Harbison
TransFac
Harbison
TransFac
Harbison
Harbison
Harbison
Harbison
TransFac
SCPD
TransFac
Harbison
SCPD
Harbison
2.69E-02
2.71E-02
2.75E-02
2.82E-02
2.85E-02
3.30E-02
3.64E-02
3.64E-02
3.77E-02
4.04E-02
4.51E-02
4.91E-02
5.07E-02
5.12E-02
5.30E-02
5.49E-02
5.82E-02
5.92E-02
5.97E-02
8.29E-02
1.08E-01
1.08E-01
1.22E-01
CATCGGCGCACTGTCCTCCGAAC
AGAACGTTCTAGAAC
AAACGCGAAA
AAAGGTAAACAA
TTTTCTTCCTCG
CCCCTTAAGG
CGGACCA
TTTACCCGGC
ATGAAACC
GCCGCAGG
GTTCTAGAACAGAAC
AAAAGCCGCGGGCGGGATT
CCGCGCCGGTGACTCATCCCGCGCC
AAGGAAAAGTA
AGGAATG
TTGCCATGGCAAC
GAGCAAA
AGAACAGAACGTTCT
TATAAAA
CGGGGG
CATGTGAAAA
GCCTCGAGGCGG
TTCTAGAACCTTC
c ICT - TU Delft 2005
Supplementary
F
Page 18
Gene Ontology
The Gene Ontology (GO) Consortium [1] has gathered information about gene functions from multiple databases and combined this information into a single annotation
tree. They have made great effort to describe gene products in terms of their associated biological processes, cellular components and molecular functions in a speciesindependent manner. This project has become a global center for gene function information.
In order to evaluate the performance of our method we have used the GO database
to compute the enrichment of GO annotations within clusters. If a cluster obtains a
higher concentration of a certain GO annotation, we expect to have found genes that
are more related in terms of biological function.
Figure 15: The Gene Ontology database collects biological gene information in a hierarchical
tree structure.
c ICT - TU Delft 2005
Supplementary
G
Page 19
Iterative clustering
We have iteratively clustered the data, using the best motifs from the consensus clustering of the previous iteration to compute the distance measure. In these clusterings
α was fixed to 0.25. Table 4 shows the ranking of the motifs with p-values < 10−6
for ten clustering iterations. The set of motifs does not seem to converge to a select
group.
Table 4: Motif ranking for 10 iterations, using the consensus clustering to select the motifs
that attain a p-value < 10−6 in any of the clusters. Iteration 0 shows the motif ranking of
the motifs selected by the initial clustering. The numbers in this table indicate the index in
our motif database, which can be found in Table 3
Motif Database Index
Iteration
p < 10−6
0
1
2
3
4
5
6
7
8
9
10
9 48 49 38 41 47 17 1 39
49 41 48 38 9 17 58 98 47 8 33
38 98 48 41 49 9 17 58
17 49 38 41 98 48 58 9 39 8
38 9 58 17 49 41 98 39 48
17 38 9 58 41 49 8 48 39 98
38 9 41 49 58 17 48 98 39 1
38 17 58 49 9 41 48 98
38 9 17 58 49 41 98 48 8 39
17 38 41 49 48 9 58 98 39 1
38 9 49 17 41 58 48 39 98 8 1
Using this method shows that the consensus clustering is not stable enough to
reach convergence. When we however compute the mean motif enrichment of the 500
sampled clusterings, the iterative clustering does converge to a clear set of motifs.
After three iterations the motif ranking remains constant (see Table 5). The motifs
that obtain a p-value smaller than 10−6 are: CBF1, GZF3, GLN3, DAL82, HAP2/3/4,
PHO4, TYE7, MET31/32, GAT1.
Table 5: Motif ranking for 10 iterations, using the the mean p-values of the 500 sampled
clusterings to select the significantly enriched motifs. This table also shows the motifs with
p < 10−4 and p < 10−5
Motif Database Index
Iteration
p < 10−6
0
1
2
3
4
5
6
7
8
9
10
38
38
38
38
38
38
38
38
38
38
38
c ICT - TU Delft 2005
9 48 98 49
17 41 49 48
49 48 17 41
49 48 17 41
49 48 41 17
49 48 17 41
49 48 41 17
49 17 48 41
49 48 17 41
49 48 41 17
49 48 17 41
9
9
9
9
9
9
9
9
9
9
98
98
98
98
98
98
98
98
98
98
58
58
58
58
58
58
58
58
58
58
47
47
47
47
47
47
47
47
47
p < 10−5
p < 10−4
17 41 58
47
47 39 8
8 39
8 39
8
8 39
8
8
8
8 39
8 39
8
Supplementary
Page 20
Aerobic
Genes
8-MIG1
9-PHO4
47-GAT1
49-GZF3
38-CBF1
17-HAP2/3/4
58-MET31/32
98-TYE7
41-DAL82
48-GLN3
Finally we computed a consensus clustering based on 5000 cluster iterations, using
the previously selected set of motifs. The resulting clusters are shown in Figure 16.
Although the order of the motifs has changed, the set of motifs with a p-value < 10−6
remains almost the same. Only MIG1 has been included in this set. In Table 5 we can
see that this motif was already the next in rank.
Anaerobic
C N P S C N P S
A
B
C
D
E
F
G
45
129
90
6
55
71
22
0
2
4
6
8
10
12 14 -1
0
1
Figure 16: Clusters that show the highest motif enrichment in the consensus clustering
computed over 5000 iterations. For this clustering we used the motifs: CBF1, GZF3, GLN3,
DAL82, HAP2/3/4, PHO4, TYE7, MET31/32 and GAT1. The motifweight α was set to
0.25.
Figure 16 shows all the previously identified regulation mechanisms (Cluster A, B,
C, E and F). Cluster C shows both carbon and phosphorus limitation response in the
aerobic environment. This cluster was previously identified as two separate groups,
but since both mechanisms are regulated by the HAP motifs they are easily clustered
together.
The phosphorus cluster has been split into two different groups (Cluster E and G).
The first cluster shows high expression under phosphorus limitation in both aerobic and
anaerobic environments, while the second cluster shows little response in the anaerobic
case.
Furthermore, we identify a very small cluster responding to carbon limitation in
an aerobic environment (Cluster D). These genes are regulated by TYE7 and CBF1,
which are known to act under sulfur limitation. We have found no further evidence to
ground this finding.
c ICT - TU Delft 2005
Supplementary
Page 21
References
[1] Ashburner, M. et al., “Gene Ontology: tool for the unification of biology. The Gene
Ontology Consortium”. Nature Genetics Vol. 25, pp. 25-29, 2000.
[2] Berry, M.J.A. and Gordon, S.L. “Data Mining Techniques, for Marketing, Sales, and
Customer Relationship Management”. Wiley Publishing, inc., Indianapolis, 2004.
[3] Crooks, E.G., et al., “WebLogo: A Sequence Logo Generator”. Genome Research Vol.
14, pp. 1188-1190, 2004.
[4] Dunham, M.H. “Data Mining, Introductory and Advanced Topics”. Pearson Education,
Inc., 2003.
[5] Harbison, C.T., et al., “Transcriptional regulatory code of a eukaryotic genome”. Nature
Vol. 431, pp. 99-104, 2004.
[6] Hertz, G.Z. and Stormo, G.D. “Identifying DNA and protein patterns with statistically
significant alignments of multiple sequences”. Bioinformatics Vol. 15, nos 7/8, pp. 563577, 1999.
[7] Jensen, S.T., et al., “Computational Discovery of Gene Regulatory Binding Motifs:
A Bayesian Perspective”. Statistical Science Vol. 19, No. 1, pp. 188-204, 2004.
[8] Kellis, M. “Computational comparative genomics: genes, regulation, evolution”. Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, 2003.
[9] Prakash, A. and Tompa, M. “Discovery of regulatory elements in vertebrates through
comparative genomics”. Nature Biotechnology Vol. 23, No. 10, pp. 1249-1256, 2005.
[10] Segal, E., Yelensky, R. and Koller, D. “Genome-wide discovery of transcriptional modules from DNA sequence and gene expression”. Bioinformatics Vol. 19, Suppl 1: i273i282, 2003.
[11] Stormo, G.D. “DNA binding sites: representation and discovery”. Bioinformatics Vol.
16, No. 1, 16-23, 2000.
[12] Tai, S.L., et al., “Two-dimensional Transcriptome Analysis in Chemostat Cultures”.
The Journal of Biological Chemistry Vol. 280, No. 1, pp. 437-447, 2005.
[13] Wingender, E., et al., “TRANSFAC: an integrated system for gene expression regulation”. Nucleic Acids Research Vol. 28, No. 1, pp. 316-319, 2000.
[14] Zhang, Z., Gu, J. and Gu, X. “How much expression divergence after yeast gene duplication could be explained by regulatory motif evolution?”. TRENDS in Genetics Vol.
20, No. 9, pp. 403-407, 2004.
[15] Zhu, J. and Zhang, M.Q. “SCPD: a promoter database of the yeast Saccharomyces
cerevisiae”. Bioinformatics Vol. 15, nos 7/8, pp. 607-611, 1999.
c ICT - TU Delft 2005