36663 - Radboud Repository

PDF hosted at the Radboud Repository of the Radboud University
Nijmegen
The following full text is a preprint version which may differ from the publisher's version.
For additional information about this publication click this link.
http://hdl.handle.net/2066/36663
Please be advised that this information was generated on 2017-06-15 and may be subject to
change.
U sin g S ym m etric C ausal In d ep en d en ce M odels
to P red ict G ene E xpression from Sequence D a ta
Rasa Jurgelenaite1, Tom Heskes1, and Tjeerd D ijkstra2
1 I n s titu te for C o m p u tin g a n d In fo rm a tio n Sciences, R a d b o u d U n iv ersity N ijm egen,
P O B ox 9010, 6500 G L N ijm egen, T h e N e th e rla n d s
{rasa, to m h } @ cs.ru .n l
2 D e p a rtm e n t o f P arasito lo g y , L eid en U n iv ersity M edical C en ter,
P O B ox 9600, 2300 R C L eiden, T h e N e th e rla n d s
t.d ijk stra @ lu m c .n l
A b s t r a c t . W e p re se n t a n a p p ro a c h for in ferrin g tra n s c rip tio n a l reg u ­
la to ry m o d u les from genom e sequence an d gene ex p ressio n d a ta . O u r
m e th o d , w hich is b ased o n sy m m etric c a u sal in d ep en d e n ce m odels, is
b o th able to m o d el th e logic b e h in d tra n s c rip tio n a l re g u la tio n a n d to
in c o rp o ra te u n c e rta in ty a b o u t th e fu n c tio n a lity o f p u ta tiv e tra n s c rip ­
tio n fa c to r b in d in g sites. A p p ly in g o u r a p p ro a c h to th e d e ad lie st species
of h u m a n m a la ria p a ra s ite , P la s m o d iu m falcip aru m , we o b ta in several
strik in g re su lts t h a t d eserve fu rth e r (biological) in v estig atio n .
K e y w o r d s: T ra n sc rip tio n a l re g u la to ry n etw o rk s, sy m m etric ca u sal in ­
d ep en d en c e m o d els, P la s m o d iu m fa lc ip a ru m
1
In tro d u ctio n
One of the m ajor challenges facing biologists is to understand the transcriptional
regulation of genes, which is critical for the development, complexity and home­
ostasis of all living organisms. The introduction of DNA m icroarray technology
[26], which enables researchers to simultaneously measure the concentration of
RNA transcripts from a single sample of cells or tissues, has offered the possibil­
ity to infer large-scale transcriptional regulatory networks for various organisms.
The algorithms developed for this purpose can be grouped into two general
strategies: an influence strategy, which seeks to identify regulatory influences
between RNA transcripts, and a physical strategy, which seeks to identify the
proteins th a t regulate transcription and the DNA motifs to which the proteins
bind [11]. In this paper, we propose a m ethod following the latter strategy, which
has the advantage of being able to combine genome sequence d ata and RNA ex­
pression d ata to enhance the specificity and sensitivity of predicted interactions.
The physical strategy m ethods th a t make use of both RNA expression data
and genome sequence d ata rely on the assum ption th a t genes with similar ex­
pression profiles share common regulatory mechanisms. Based on the way in
which the two sources of d ata are related, we can distinguish three groups of
these methods. The first group includes the m ethods th a t first cluster genes on
the basis of their expression patterns and then search for putative motifs in the
upstream regions of the genes in each cluster. The early m ethods following this
approach searched for individual transcription factor binding site patterns in
upstream regions of the coexpressed genes (see e.g. [5,8,28]), while the more
recent algorithms search for DNA target sites for cooperatively binding tra n ­
scription factors [12,18]. The m ethods in the second group work in the opposite
direction, first identifying a set of candidate motifs and then trying to explain
RNA expression using these motifs [7,15, 22]. Finally, the algorithms in the last
group use both sources of d ata together. These m ethods use one or more itera­
tions of the following procedure: first, genes are clustered or grouped according
to their expression data, then the search for motifs in the upstream regions of
the coexpressed genes is performed, and, finally, the motifs identified are used
to build models th a t predict the expression pattern of the gene (see e.g. [2, 27]).
A key feature of transcriptional regulation of gene expression in eukaryotes
is th a t genes are often regulated by more than one transcription factor [30]. A
num ber of approaches have been proposed to address the combinatorial nature
of transcriptional regulation. One approach is based on the assumption th a t the
influence of different transcription factors on gene expression is additive. The
studies based on this approach use a simple linear regression to relate transcrip­
tion factor binding sites to gene expression values [7,17]. A probabilistic model
by Segal et al. [27] assumes th a t genes are partitioned into modules, which de­
term ine the gene expression profile. The strength of the association of a gene
with a module is the sum of its weighted motifs, where each weight specifies
the extent to which the m otif plays a regulatory role in the module. These ap­
proaches, however, cannot identify synergistic m otif combinations th a t control
gene expression patterns. Algorithms have been developed to model the synergy
between two transcription factors th a t bind to sites located anywhere in the
upstream region [22] or sites th a t are spatially close to each other [7,12]. Beer
and Tavazoie [2] present an approach which utilizes AND, OR and NOT logic to
capture combinatorial effects of transcription factors in the regulation of gene ex­
pression. This m ethod is not only able to infer combinatorial rules th a t involve
more than two transcription factors, but it also includes constraints on m otif
strength, orientation and relative position. A similar m ethod has been reported
by Hvidsten et al. [15]. To link transcription factor binding site combinations to
genes with particular expression profiles, the m ethod extracts IF-THEN rules
which correspond to AND logic.
Although the m ethods th a t model combinatorial effects of the motifs have
appealing properties, their drawback is their inability to cope with uncertainty
in the transcription factor binding sites th a t are identified. The robustness of the
m ethod in the face of uncertainty is im portant, as non-functional transcription
factor binding sites can be readily found throughout the genome, including pro­
moters [31]. We present an approach which is both able to model the logic behind
transcriptional regulation and to incorporate uncertainty about the functionality
of putative transcription factor binding sites. Our probabilistic method, which
is based on symmetric causal independence models, extends the earlier methods
th at infer combinatorial rules in two im portant directions. First, we use a broad
class of Boolean functions, symmetric Boolean functions, to capture combinato­
rial effects of transcription factors in the regulation of gene expression. Second,
the motifs contribute to the regulation of a gene through hidden variables; thus,
the m ethod is able to cope with non-functional transcription factor binding sites.
In this paper, we apply our m ethod to Plasm odium falciparum, which is
the deadliest species of the parasite th a t causes m alaria in humans. Human
m alaria infects between 300 and 500 million people and causes up to 2.7 million
deaths annually, m ostly among young children in Sub-Saharan Africa [6]. In
m any endemic countries, m alaria is also responsible for economic stagnation [23].
A good understanding of transcriptional regulation in this organism is im portant
for devising new ways to disrupt the parasite’s life cycle.
2
M e th o d o lo g y
In this section, we present our approach based on symmetric causal independence
models for inferring transcriptional regulatory modules from genome sequence
and gene expression data. The underlying assum ption in this approach is th at
genes in the different clusters share common regulatory mechanisms. W hen try ­
ing to separate the genes in one cluster from all others, we aim to find motifs
and their interactions th a t are specific to specific regulatory mechanisms. We
start our m ethod (Figure 1) with a ‘d ata pre-processing’ step, where we use a
motif-finding algorithm to identify putative transcription factor binding motifs
and we cluster genes according to their expression profiles. Then, for each cluster
of genes th a t exhibited significant changes, we learn a symmetric causal inde­
pendence model, which, given the binding sites of a gene, classifies the gene as
belonging to the cluster or not. Finally, we analyze the results of experiments
and identify potential transcription factors binding to the motifs th a t play a
regulatory role in gene expression. All these steps are described in detail further
in this section.
F ig . 1. O verview o f th e p ro p o sed ap p ro ach .
2.1
F in d in g tra n scrip tio n factor b in d in g m otifs
We extracted the DNA sequence 1000 bp upstream from the initiation codon
of each of 5404 Plasm odium falciparum genes using PlasmoDB release 5.2. In
instances where the upstream regulatory region overlapped with another open
reading frame, we extracted only the sequence between the open reading frames.
To find over-represented motifs, the extracted sequences were analyzed using
the AlignACE program [13]. We set the GC background param eter to 0.13 (the
fractional GC background for these regions), the num ber of columns to align to
10 and the num ber of expected sites to 5.
2.2
C lu sterin g o f th e R N A ex p ressio n d ata
We used a Plasm odium falciparum 3D7 strain RNA expression d ata set [4]. We
downloaded d ata th a t were normalized and median-centered and we only used
d ata for those oligonucleotides th a t have a corresponding open reading frame
assigned from PlasmoDB. We discarded the genes for which more than 20% of
m easurements were missing. A num ber of open reading frames had more than one
oligonucleotide measured; we averaged the measurements of these open reading
frames. After the d ata had been log2 transformed, we im puted missing values
using the weighted K-nearest neighbours method. We chose to use this data
im putation m ethod as it has been shown to provide a more robust and sensitive
missing value estim ation in m icroarray d ata than a singular value decomposition
based m ethod or the commonly used row average m ethod [29]. The weighted Knearest neighbours m ethod uses a weighted average of values from the K genes
closest to the gene of interest as an estim ate for the missing value. Based on the
results reported in [29], we chose the value of K to be 15 and Euclidean distance
as a metrics for gene similarity.
We used the K-means algorithm [19] with random initializations to cluster
the genes according to their RNA expression data. Since the K-means algorithm
is known to sometimes get stuck in a local optimum, we ran the algorithm 10
times for each num ber of clusters. To select the optimal num ber of clusters we
used the so-called C-index [14], which has been shown to outperform 13 other
indices for determ ining the number of clusters in binary d ata sets when the data
are clustered using the K-means algorithm [10].
2.3
L earning sy m m etric cau sal in d ep en d en ce m od els
The global structure of a symm etric causal independence model is shown in Fig­
ure 2; it expresses the idea th a t causes C i, . .. ,C n influence a given common effect
E through hidden variables H i , . . . , H n and a symmetric Boolean function f . All
variables in this model are binary; the hidden variable Hi is considered to be
a contribution of the cause variable Ci to the common effect E . The function
f represents in which way the hidden effects H i , and indirectly also the causes
Ci , interact to yield the final effect E . To learn more about symmetric causal
independence models and learning them , see [16].
F ig . 2 . S y m m etric ca u sal in d e p en d e n c e m odel
In this paper, we use symmetric causal independence models as a technique
to model combinatorial effects of transcription factor binding motifs in the reg­
ulation of gene expression. Transcription factor binding sites are causes in this
model, where the positive state of this variable is presence or absence of the
motif, depending on the m otif’s effect on expression of genes in the cluster. The
positive state of the effect variable represents gene belonging to the cluster, and
the negative state represents gene belonging to any other cluster.
We used a greedy approach to select the motifs whose absence or presence
contributes to the difference between the expression of genes belonging to a given
cluster and the expression of the other genes. First, we ranked all motifs based
on their m utual information scores, where the m utual inform ation measures the
m utual dependence of the variable M th a t represents a m otif and the class
variable C and is defined as:
' (m;c )
= £ E
meM ceo
v ’
w
.
Then, we built a model from the h highest ranked motifs. We started from a
model containing only a leaky cause, then iteratively added the next highest
ranked m otif and evaluated the model thus obtained. If the new model did not
have a higher score than the previous model, the m otif was removed from the
model. Since there are 2”+1 symmetric Boolean functions for a model with n
variables th a t represent motifs, evaluating all the resulting models is too expen­
sive computationally. Therefore, we restricted the interaction function space to
the Boolean threshold functions. This restriction means th a t for every added
m otif we only had to evaluate two models, a gene model with the interaction
function Tk and a gene model with the interaction function r k+1, where Tk is the
interaction function from the model with the highest score. We evaluated each
model using the classification accuracy on the validation set.
To solve the problem of unbalanced d ata (different class size, see Table 1),
we added as m any copies of every gene from the smaller class as was needed
for this class to am ount for at least half of the genes. To learn the param eters
of the gene model, we ran 25 iterations of the EM algorithm, com puted the
classification accuracy on the validation set after each iteration and chose those
param eters th a t provided the highest score.
2 .4
E v a lu a tio n o f t h e r e su lts
We used two error estim ation methods, cross-validation and bootstrap, to eval­
uate the models learned. The cross-validation scheme was used to examine the
predictive performance of the models, whereas the bootstrap approach was used
to evaluate the reliability of the model param eters. For both methods, we per­
formed 100 runs, and the d ata was split into training, validation and test sets.
The validation set was used to choose the num ber of iterations of the EM algo­
rithm and the threshold function; the results reported were obtained using an
independent test set.
We used the results of the bootstrap approach to test for potential syner­
gistic m otif pairs. From the results of the bootstrap approach, we estim ated
9 = (0\, 62, ...), where 0i is the probability th a t m otif M i will be chosen as a
feature in the model. We introduce a variable X j k th a t specifies four possible
combinations of occurrence of the motifs M j and M k. Our null hypothesis was
th at X j k follows multinomial distribution, with each trial resulting in one of 4
possible outcomes with probabilities p i = (1 —0j)(1 —0k),p 2 = 0 j(1 —0k),p 3 =
(1 — 0j)0k,P 4 = 0j0k, and the num ber of trials n being equal 100. To measure
the discrepancy between the observed and expected counts, we used Pearson’s
chi-square statistic:
X2
(Oi — E i )2
X = ^I
E,
where i is a possible outcome and the expected count E i = npi .
To compare our classifier to a classifier which assigns all genes to the biggest
class, we used a binomial test described in [24]. The test uses the num ber of
cases n for which the two classifiers produce a different output, and the number
of cases s where the output of the examined classifier was correct, while the
output of the reference classifier was wrong. Under the null hypothesis th a t the
two classifiers perform equally well, we compute:
p = 2 E i!(n — i)! o.5n
i =s
2.5
Id en tify in g p o te n tia l tra n scrip tio n factors b in d in g to th e m otifs
To identify potential transcription factors binding to the motifs, we used com­
parative genome analysis, which is based on the fact th a t sequence sim ilarity
might reflect functional similarity. Identification, which was done separately for
each motif, involves three steps. Firstly, we used STAMP [20], a web tool for ex­
ploring DNA-binding m otif similarities, to find a number of the closest matches
for a given m otif in 13 supported databases. Secondly, for each m atch found,
we checked whether the database where the m otif is stored reports a transcrip­
tion factor binding to it. Finally, if the transcription factor is known, we used
BLAST [1, 25] to find the most similar protein sequences from the Plasm odium
falciparum protein database.
3
3.1
E x p e rim en ta l R e su lts
T ran scrip tion factor b in d in g m otifs found and clu sters ob ta in ed
AlignACE found 100 transcription factor binding motifs in the given upstream
sequences. The motifs th a t were found to be the most im portant features for
classifying the genes will be discussed later in this section.
We chose the num ber of clusters to be 5, as the C-index curve had an ‘elbow’
at this value. Figure 3 presents the clusters obtained, which are comparable to
the four characteristic stages of intraerythrocytic parasite morphology discussed
by Bozdech et al. [4], as the vast m ajority of genes induced in every one of
the stages belong to one of four clusters. Cluster 5 is a cluster of genes whose
expression did not show a significant change. The correspondence among the
characteristic stages and the clusters and the cluster sizes are given in Table 1.
3.2
M o d els learn ed
We learned the models for the first four clusters, i.e. the clusters of genes whose
expression changed throughout the intraerythrocytic stage.
The classification accuracy of the gene models learned using the cross-validation
procedure explained in 2.4 is reported in Table 1. The p-values for the null hy­
pothesis th a t the gene models perform equally well as a classifier which assigns
all genes to a bigger class are less than 10~10.
Table 2 lists the motifs th a t were most often selected as features of the gene
model. Due to space limitations, we report only those motifs th a t were selected as
T a b le 1. A b rie f d e sc rip tio n of th e clu sters, th e n u m b e r of th e genes assigned, th e
c o rresp o n d in g c h a ra c te ristic sta g e of in tra e ry th ro c y tic p a ra s ite m orphology; a n d clas­
sificatio n a cc u racy o b ta in e d using th e c ro ss-v alid atio n p ro ced u re.
C lu ste r
N u m b e r of
genes
C o rresp o n d in g
stag e
A ccu racy
o b ta in e d (%)
B aselin e
acc u racy (%)
1
2
3
4
5
329
1033
985
144
1344
schizont
rin g /e a rly tro p h o z o ite
tr o p h o z o ite /e a rly schizont
early ring
-
60.48
61.52
59.16
63.21
-
50.79
52.52
50.90
51.30
-
features of the model in more than 50 bootstrap runs. Some of the motifs appear
in more than one cluster; however, their weighting is different (not shown) and
they can be either ‘present’ or ‘absent’ (the presence or absence is a positive
state of the corresponding variable in the model). Sequence logos of the motifs,
which were generated using the WebLogo program [9], are shown in Figure 4. A
study of the positive states of the variables representing the motifs selected as
features of the model in more than 20 bootstrap runs reveals a distinct pattern.
The variables in models for cluster 2 and cluster 4 represent the absence of the
motifs, while the variables in models for cluster 1 and cluster 3 m ainly represent
the presence of the motifs. Even though there are 6 motifs th a t break this pattern
in clusters 1 and 3, these motifs are found in a very small num ber of genes (from
1 to 5 % of genes); the other motifs selected are much more common in genes.
The sum m ary of these results is presented in Table 3.
T a b le 2 . M otifs t h a t w ere selected as fe a tu re s of th e m o d el in m o re th a n h a lf of th e
b o o ts tra p ru n s; th e n u m b e r o f ru n s th e m o tif w as selected is given in p aren th eses.
‘P re s e n t’ m o tifs are w ritte n in ro m a n , ‘a b s e n t’ m o tifs are w ritte n in ro m an .
C lu ste r
M otifs selected m o re th a n 50 tim es
1
M o tif 38 (100), M o tif 37 (95), M o tif 6 (65), M o tif 59 (65), M o tif 11 (63),
M o tif 21 (55)
M o t i f 6 (98), M o t i f 3 5 (93), M o t i f 3 7 (89), M o t i f 3 8 (89), M o t i f 11
(75), M o t i f 21 (68), M o t i f 5 9 (68), M o t i f 7 (64), M o t i f 8 0 (59)
M o tif 6 (100), M o t i f 8 8 (82), M o tif 38 (67), M o tif 59 (60), M o tif 21 (51)
M o t i f 6 (99), M o t i f 11 ( 93 ), M o t i f 4 ( 55 )
2
3
4
Interpretation of the probabilities of the hidden variables is somewhat tricky
as they highly depend on the num ber of input variables and the interaction
function in the model, which currently vary a lot from one bootstrap run to
another. Nevertheless, there is a pattern which suggests th a t probabilities of
A A .G G
aTaT
a
A
a
A
o
A
¡ E ^ C a C A ^ A j ^Ajx A
s
A
‘. I s â s â s â s L â s â s â ï â i A s ê i â
T a T T Ja i GL a QAi Ai a
at
Ca ia Ca
t
A
L *Lj T ¡taCAiA
>
3A A A A iIoC A C
slQiIas c2l'
jJLJLc
F ig . 4 . Sequence logos of th e m o tifs t h a t w ere selected as fe a tu re s of th e m o d el in m ore
th a n h a lf o f th e b o o ts tra p ru n s.
T a b le 3 . P o sitiv e s ta te s of v a riab les re p re se n tin g th e m o tifs t h a t w ere selected as
fe a tu re s of th e m o d el in a t le ast 20 b o o ts tra p run s.
C lu ste r M o tifs selected P o sitiv e sta te : ab sen ce P o sitiv e sta te : presence
1
2
3
4
14
14
10
15
2
12
13
4
15
6
1
0
hidden variables contain information about functionality of putative transcrip­
tion factor binding sites. The pattern emerges when we compare probabilities of
hidden variables with the average of probabilities of the other hidden variables
in the model. The probabilities in clusters of ‘absent’ motifs were almost the
same, while the probabilities in clusters of ‘present’ motifs differed much more
and the m ajority of motifs had the tendency to have corresponding probabilities
below or above the average.
To find statistically significant occurrences of m otif pairs, we tested all pos­
sible pairs of the motifs selected as causes in the model in at least 20 bootstrap
runs (see 2.4 for the description of the m ethod). We rejected the null hypothesis
at the significance level of 0.05 (corrected for multiple testing) for two m otif
pairs from cluster 4: for the pair of motifs 70 and 72 (with p-value of 0.0055),
and the pair of motifs 4 and 5 (with p-value of 0.0174). These motifs were se­
lected together to be features in the model more often than it would be expected.
Sequence logos for potential synergistic m otif pairs are shown in Figure 5.
3.3
P o te n tia l tra n scrip tio n factors b in d in g to th e m otifs
We present the most significant findings for the motifs reported in Figure 4.
Motifs 6,11 and 35 have the same closest m atch - the binding site of fruit fly’s
transcription factor Topoisomerase 2, reported in FlyReg database [3]. The most
Motif 4
Motif 5
F ig . 5 . Sequence logos o f p o te n tia l sy n erg istic m o tif pairs.
significant alignment in Plasm odium falciparum is PF14_0316, putative DNA
topoisomerase 2, whose protein sequence is nearly identical (E value of 0.0).
Another gene of Plasm odium falciparum th a t is a potential transcription fac­
tor binding to at least two of the motifs discussed is PF14_0175, which is anno­
tated as a hypothetical protein in PlasmoDB. One of the closest matches for mo­
tif 7 is MCM1+SFF_M01051 reported in TRANSFAC database [21]. The most
significant alignment for MCM1, which is yeast transcription factor involved in
cell-type-specific transcription and pheromone response and plays a central role
in the formation of both repressor and activator complexes, is PF14_0175 (E
value of 10~5). Another m otif to which this transcription factor could bind is
m otif 80; this possible connection was found through a different transcription
factor in a different organism. M otif F0XP1_M00987 reported in TRANSFAC is
a close m atch to m otif 80. Mouse transcription factor F 0X P 1 which binds to this
m otif is thought to repress expression of epithelial genes in the lung and reduce
expression from prom oters of mouse CC10 gene G002818. The most significant
alignment for variants T04812 and T04813 of F 0X P 1 in Plasm odium falciparum
is PF14_0175 (E value of 10~8).
A gene which is found as potential transcription factor for one third of the
motifs analyzed is PFL0465c, zinc finger transcription finger (krox1). For m otif
4, the connection was found through m otif Helios_M01004 reported in TRANS­
FAC and mouse transcription factor IKAROS family zinc finger 2, Helios, whose
functions include zinc ion binding, DNA binding and nucleic acid binding (E
value of 7 • 10~6). For m otif 21, the connection was found through m otif CF2ILM00012 reported in TRANSFAC and fruit fly transcription factor CF2-II, a
late activator in follicle cells during chorion formation (E value of 10~6).
4
D iscu ssio n and F u tu re W ork
We have presented an approach which is both able to model the logic behind
transcriptional regulation and to incorporate uncertainty about the functionality
of putative transcription factor binding sites. Another advantage of our technique
is th at it does not require other biological knowledge than genome sequence data
and RNA expression d ata to validate the results. Since we do not use expression
d ata while searching for putative regulatory motifs, the accuracy of the models
in predicting gene expression p attern is an unbiased measure of the soundness
of the models learned.
Experim ental results revealed the lack of consistency in the properties of the
models learned. This inconsistency could be caused by the lack of additional
constraints on the motifs, such as position relative to the translation start, ori­
entation and functional depth. Therefore, the next step in our research is to
implement normal and binomial approxim ations to Poisson binomial distribu­
tion, which will help to reduce com putational complexity of the EM algorithm.
Reduced com putational complexity will enable us to test more interaction func­
tions and to examine the additional constraints on the motifs.
We will also continue our discussions with biologists to find the explanation
to the experimental results, especially, the pattern of clusters of ‘present’ and
‘absent’ motifs, and potential transcription factors binding to the motifs.
A ck n ow led gm en ts. We would like to thank Michael A. Beer, Saeed Tavazoie,
Zbynek Bozdech and Mahony Shaun for helpful discussions.
R eferen ces
1. A ltsch u l, S .F ., M a d d en , T .L ., Schaffer, A .A ., Z hang, J., Z h ang , Z., M iller, W ., L ip ­
m a n , D .J.: G a p p e d B L A S T a n d P S I-B L A S T : a new g e n e ra tio n of p ro te in d a ta b a se
se arch p ro g ram s. N ucleic A cids R ese a rc h 2 5 (1997) 3389-3402
2. B eer, M .A ., Tavazoie, S.: P re d ic tin g gene ex p ressio n fro m sequence. C ell 1 1 7 (2004)
185-198
3. B erg m a n , C .M ., J .W C arlso n , S.E. C elniker: D ro so p h ila D N ase I fo o tp rin t d a ta b a se :
A sy ste m a tic genom e a n n o ta tio n of tra n s c rip tio n fa c to r b in d in g sites in th e fruitfly,
D. m elan o g a ste r. B io in fo rm atics 21 (2005) 1747-1749
4. B ozdech, Z., L linas, M ., P u llia m , B ., W ong, E .D ., Z hu, J., D eR isi, J.: T h e tra n sc rip to m e o f th e in tra e ry th ro c y tic d ev elo p m e n ta l cycle o f P la s m o d iu m falcip aru m .
P L oS B iology 1 (2003) 85-100
5. B ra z m a , A ., Jo n a sse n , I., V ilo, J., U k k o n en E.: P re d ic tin g gene re g u la to ry elem en ts
in silico o n a genom ic scale. G en o m e R ese a rc h 8 (1998) 1202-1215
6. B rem en , J.: T h e ears o f th e h ip p o p o ta m u s: m a n ife sta tio n s, d e te rm in a n ts, a n d e sti­
m a te s of th e m a la ria b u rd e n . A m e ric a n J o u rn a l o f T ro p ica l M edicine a n d H ygiene
6 4 (2001) 1-11
7. B u ssem ak er, H .J., Li, H. an d Siggia, E .D . R e g u la to ry elem en t d e te c tio n using co r­
re la tio n w ith expression. N a tu re G en etics 2 7 (2001) 167-171
8. C hu, S., D eR isi, J ., E isen, M ., M u lh o llan d , J., B o tste in , D ., B row n, P .O ., H erskow itz,
I.: T h e tra n s c rip tio n a l p ro g ra m of s p o ru la tio n in b u d d in g y east. Science 2 8 2 (1998)
699-705
9. C ro o k s G .E ., H o n G ., C h a n d o n ia J.M ., B re n n e r S.E.: W ebLogo: A sequence logo
g e n era to r. G enom e R ese arch 1 4 (2004) 1188-1190
10. D im itria d o u , E ., D o ln icar, S. W eingessel, A.: A n e x a m in a tio n of in d ex es for d e te r­
m in in g th e n u m b e r o f c lu ste rs in b in a ry d a ta sets. P sy c h o m e trik a 6 7 (2002) 137-160
11. G a rd n e r, T .S ., F a ith , J.: R ev erse-en g in eerin g tra n s c rip tio n c o n tro l netw orks.
P h y sic s of Life R eview s 2 (2005) 65 -8 8
12. G u h a T h a k u rta , D ., S to rm o , G.: Id en tify in g ta r g e t site s for co o p e ra tiv ely b in d in g
facto rs. B io in fo rm atics 1 7 (2001) 608-621
13. H ughes, J.D ., E ste p , P .W ., T avazoie S., C h u rch , G .M .: C o m p u ta tio n a l id en tifica­
tio n o f c is-re g u la to ry elem en ts asso c ia te d w ith g ro u p s o f fu n c tio n a lly re la te d genes
in S accharom yces cerevisiae. J o u rn a l of M o lecu lar B iology 296 (2000) 1205-1214
14. H u b e rt, L .J., L evin, J.R .: A g e n eral s ta tis tic a l fram ew o rk for accessing ca te g o rical
c lu ste rin g in free recall. P sy ch o lo g ical B u lle tin 83 (1976) 1072-1082
15. H v id ste n , T .R ., W ilczynski, B ., K ry sh tafo v y ch , A ., T iu ry n , J ., K om orow ski, J.,
F id elis K .: D iscovering re g u la to ry b in d in g -site m o d u les using ru le -b ase d learning.
G en o m e R esearch 15 (2005) 856-866
16. Ju rg e le n a ite , R ., H eskes, T .: E M a lg o rith m for sy m m e tric c a u sa l in d ep en d e n ce
m odels. P ro c e ed in g s o f th e S ev e n te e n th E u ro p e a n C onference o n M ach in e L earn in g
(2006) 234-245
17. K eles, S., v a n d e r L a a n , M ., E isen, M .B .: Id e n tific a tio n of re g u la to ry elem en ts using
a fe a tu re selectio n m e th o d . B io in fo rm atics 18 (2002) 1167-1175
18. L iu, X ., B ru tla g , D ., L iu, J.S .: B io P ro sp e cto r: discovering co nserved D N A m o tifs
in u p s tre a m re g u la to ry regions o f co ex p ressed genes. Pacific S y m p o siu m o n B io­
c o m p u tin g (2001) 127-138
19. M acQ u een , J.B .: Som e m e th o d s for classificatio n a n d a n aly sis o f m u ltiv a ria te o b ­
serv atio n s. P ro c e e d in g s o f th e F if th B erkeley S y m p o siu m o n M a th e m a tic a l S ta tis tic s
a n d P ro b a b ility 1 (1967) 281-297
20. M ahony, S., B enos, P.V .: ST A M P : a w eb to o l for e x p lo rin g D N A -b in d in g m o tif
sim ilarities. N ucleic A cids R esearch (2007) in press
21. M a ty s, V ., K el-M argoulis, O .V ., Fricke, E ., L iebich, I., L a n d , S., B arre -D irrie,
A ., R e u te r, I., C h ek m en ev , D ., K ru ll, M ., H o rn isch er, K ., Voss, N ., S teg m aier, P.,
L ew ick i-P o tap o v , B ., Saxel, H ., K el, A .E ., W in g en d e r, E .: T R A N S F A C a n d its m o d ­
ule T R A N S C o m p el: tra n s c rip tio n a l gene re g u la tio n in eu k ary o tes. N ucleic A cids
R ese a rch 3 4 (2006) D 1 0 8 -1 1 0
22. P ilp el, Y ., S u d a rsa n a m , P ., C h u rch , G .M .: Id en tify in g re g u la to ry n etw o rk s b y com ­
b in a to ria l an aly sis o f p ro m o te r elem ents. N a tu re G en etic s 29 (2001) 153-159
23. Sachs, J .D ., M alaney, P.: T h e econom ic a n d so cial b u rd e n o f m a laria . N a tu re 415
(2002) 680-685
24. S. Salzberg: O n co m p a rin g classifiers: p itfa lls to avoid a n d a reco m m en d ed a p ­
p ro ach , D a ta M ining a n d K now ledge D iscovery 1 (1997) 317-327
25. Schaffer, A .A ., A rav in d , L., M a d d en , T .L ., S hav irin , S., Spouge, J.L ., Wolf,
Y .I., K o o n in , E .V ., A ltsch u l, S.F.: Im p ro v in g th e a cc u racy of P S I-B L A S T p ro te in
d a ta b a s e search es w ith co m p o sitio n -b a se d s ta tis tic s a n d o th e r refin em en ts. N ucleic
A cids R ese arch 29 (2001) 2994-3005
26. S chena, M ., S halon, D ., D avis, R .W ., B row n, P .O .: Q u a n tita tiv e m o n ito rin g o f gene
ex p ressio n p a tte rn s w ith a c o m p lem en tary D N A m icroarray . Science 270 (1995)
4 67-470
27. Segal, E ., Yelensky, R ., K oller, D.: G enom e-w ide discovery of tra n s c rip tio n a l m o d ­
ules fro m D N A seq u en ce a n d gene expression. B io in fo rm atics 19 (2003) 1273-1282
28. Tavazoie, S., H ughes, J.D ., C am p b ell, M .J ., C ho, R .J ., C h u rch , G .M .: S y ste m a tic
d e te rm in a tio n of g en etic n e tw o rk a rc h ite c tu re . N a tu re G en etic s 22 (1999) 281-285
29. T ro y an sk ay a O ., C a n to r M ., S herlock G ., B ro w n P ., H a stie T ., T ib sh ira n i R ., B o tste in D ., A ltm a n R .B .: M issing value e stim a tio n m e th o d s for D N A m icro array s.
B io in fo rm atics 17 (2001) 520-525
30. W ag n er, A.: G enes re g u la te d co o p e ra tiv ely by one o r m o re tra n s c rip tio n fa cto rs a n d
th e ir id en tific a tio n in w hole eu k ary o tic genom es. B io in fo rm atics 15 (1999) 776-784
31. W ern er, T .: M odels for p re d ic tio n a n d re c o g n itio n o f eu k ary o tic p ro m o te rs. M a m ­
m a lia n G en o m e 10 (1999) 168-175