Recognition of regulatory sites by genomic comparison

Res. Microbiol. 150 (1999) 755−771
© 1999 Éditions scientifiques et médicales Elsevier SAS. All rights reserved
Recognition of regulatory sites by genomic comparison
Mikhail S. Gelfand*
State Scientific Center for Biotechnology ‘NIIGenetika’, 1-j Dorozhny pr. 1, Moscow 113545, Russia
Abstract — Availability of complete bacterial genomes opens the way to the comparative approach to the
recognition of transcription regulatory sites. Assumption of regulon conservation in conjunction with profile
analysis provides two lines of independent evidence making it possible to make highly specific predictions.
Recently this approach was used to analyze several regulons in eubacteria and archaebacteria. The present
review covers recent advances in the comparative analysis of transcriptional regulation in prokaryotes and
phylogenetic fingerprinting techniques in eukaryotes, and describes the emerging patterns of the evolution of
regulatory systems. © 1999 Éditions scientifiques et médicales Elsevier SAS
transcription / regulation / operator / repressor / activator / computer analysis / complete genome
1. Statistical methods for recognition of
protein-binding sites in DNA sequences
Recognition of functional sites in DNA sequences is an old problem in computation molecular biology. Recent advances in large scale
sequencing, and especially sequencing of complete bacterial genomes, make it particularly
important. The availability of several related
genomes opens up new possibilities for prediction of regulatory sites based on comparison of
candidate site distribution in these genomes. In
this review I describe the basic ideas behind the
comparative approach and provide some examples demonstrating its strong and weak
points. But let us start with a brief description of
the traditional techniques.
Generally, the search for regulatory patterns
is organized as follows. Coregulated genes from
one genome are collected from the literature,
databases, or in large-scale gene expression
experiments [122, 137]. The upstream regions of
* Correspondence and reprints
Tel.: +70 95 31 50 156; fax +70 95 31 50 501;
[email protected]
the collected genes are aligned in order to find
conserved sequence patterns. There exist various techniques for multiple local alignment that
can be used in such situations. The most popular approaches are the semi-empirical algorithms based on positional [43, 92, 161] or
position-independent oligonucleotide analysis [148, 163], and minimization techniques [62,
140], the most prominent of which is the Gibbs
sampler [16, 81, 82, 122]; for a review see [32,
41]. Alternatively, the initial determination of
the signal is done experimentally, e.g., by footprinting or mutational analysis of one or several
regulatory regions, and then new sites are
aligned to these signals.
After determination of the signal, it is represented as a recognition rule. The simplest variant of the recognition rule is a consensus word,
maybe with degenerate positions [25, 103, 113].
In this case the measure of ‘site-likeness’ of an
oligonucleotide, or the score, is defined as the
number of mismatches between the oligonucleotide and the consensus. A slightly more sophisticated procedure involves the use of positional
nucleotide weight matrices, or profiles [7, 15, 17,
50, 99, 113, 138]. Indeed, a consensus can be
756
Gelfand
represented by a profile whose elements are
units and zeroes. Profiles are more sensitive
than consensi. There exists a developed theory
for profiles linking them to site specificity [9,
126], providing the physical base for the choice
of positional nucleotide weights [7, 127, 141],
and allowing one to compute the statistical
significance of observed hits [19, 48] and to
estimate the pseudosite competition [8]. The
profile-based scores are positively correlated
with
experimentally
determined
site
strength [7, 99], although the definitions of both
values are not as trivial as they may seem: for
example, in many cases the positional nucleotide frequencies in natural sites are different
from those obtained in SELEX experiments [120,
128]. This means that results of benchmarking
of recognition algorithms using site strengths
estimated by in vitro binding should be interpreted with some caution [123].
Sophisticated methods such as neural networks [63] or hidden Markov models [34] are
rarely used for analysis of transcriptional operators, since these methods require large training
sets. Rather, these techniques are used for analysis of ‘mass’ sites such as promoters of transcription [28, 65, 104, 108, 164] and ribosomal
binding sites in prokaryotes and polymerase II
promoters and splicing sites in eukaryotes. For
reviews and benchmarking of recognition methods see [37, 38, 44, 123]; bacterial promoters and
terminators are specifically discussed in the
paper by M.-F. Sagot in this issue.
Finally, since the information content of most
transcription regulatory signals is rather low,
the specificity of the recognition rules is not
sufficiently high to provide reliable predictions.
Thus it is natural to try to predict regulatory
sites in context taking into account the fact that
promoters and operators do not occur independently, but rather form recognition complexes.
This is especially important for analysis of eukaryotic genes, whose regulation may involve
tens of different factors and their recognition
sites [165]. Several programs for recognition of
eukaryotic promoters [18, 114, 154] or tissuespecific regulation [42, 155, 160] use this ap-
proach. It was also applied to analysis of Escherichia coli regulatory sites [109, 121].
There exist several collections of prokaryotic
promoters from different genomes [12, 61,
87], operators from E. coli (RegulonDB
by [146]
at
http://www.cifn.unam.mx/
Computational_Biology/regulondb; DPInteract
by [120]
at
http://arep.med.harvard.edu/
dpinteract), and eukaryotic regulatory sites [47,
60], some of the latter specifically dedicated to
fungi [30, 142].
The derived recognition rules are used to scan
databases in order to find new representatives
of the regulon. The candidate sites are then
verified experimentally. In particular, simple
consensus search was successfully applied to
find PurR-regulated genes in E. coli [56], and to
characterize the competence regulon of Streptococcus pneumoniae and the σX and σW promoters
of Bacillus subtilis [66, 67]. In the cases of the
competence and σX signals, the consensus was
derived not from sequence comparisons alone,
but from mutation and deletion analyses of
known sites. Similarly, after experimental determination of the recognition rules for FruR [102]
and SoxS [106] regulators of E. coli, computer
scanning of the databases revealed new candidate binding sites whose significance was assessed taking into consideration the function of
the downstream genes. Large-scale analysis of
complete genomes was done for E. coli [120,
146] and Saccharomyces cerevisiae [36, 151].
2. Phylogenetic footprinting: comparative
analysis of eukaryotic regulatory sites
Until now we considered sets of coregulated
genes from one genome and assumed that upstream regions of these genes contain a common
signal. A transversal approach involves using
upstream regions of one gene (more exactly,
orthologous genes) or functionally related genes
from different genomes.This approach, termed
‘ phylogenetic fingerprinting ‘ [143], was used
to analyze regulation of many different eukaryotic systems: globins [52, 143], milk protein
genes [90], genes expressed in liver [147],
gut [73] and muscle [78, 94, 160], transcription
Recognition of regulatory sites by genomic comparison
factors [2, 78, 112, 116, 131, 144], sexdetermining genes of nematodes [79], and prespore genes of Dictyostelium [93].
There are two main variants of this analysis.
One possibility is to look for a conserved arrangement of known signals upstream of homologous genes [90, 91, 160]. This allows for
filtering out false-positives. Indeed, the number
of different profiles in the regulation databases
is so high, and their specificity so low, that
scanning of any DNA sequence fragment by a
set of profiles produces an enormous number of
hits. Conversely, scanning of sequence databases by one profile also gives many spurious
matches. However, false-positive hits are scattered at random, whereas true sites (at least
some of them, see below) are conserved. Thus,
occurrence of a candidate site upstream of homologous genes in different species is a strong
indicator that this site is truly functional.
The other approach is to search for regulatory
sites without any prior idea of the recognition
rule. Conserved segments identified by local
alignment serve as the basis for experimental
analysis [116, 131, 144, 136, 145]; see also [54]
and references therein). The scope of comparisons is quite wide. The most popular, of course,
are the human-mouse comparisons [1, 54, 116,
160]. Other examples include comparisons between different mammals [90, 91, 136] and vertebrates [94, 131], between species of fruit flies
Drosophila melanogaster and D. hydei [144] and
between nematodes Caenorhabditis elegans and
C. briggsae [73, 79, 145, 168].
The depth of comparisons can be quite diverse. In some cases even comparisons between
primates produce meaningful results [143],
although in general the degree of sequence
conservation in primates is too high. Humanmouse comparisons are often very informative [1, 54], although sometimes they are confounded by conservation of regions unrelated to
transcriptional regulation [31, 86] or even any
regulation at all [76, 77]. Besides, the conserved
segments are often fairly long and include not
only the binding core(s), but marginal regions
as well [32]. More distant comparisons, e.g.,
between mammalian and fish genes, are more
757
selective [2], but produce results in a smaller
number of cases [32]. It has been suggested that
the optimal evolutionary distance for phylogenetic fingerprinting is that between mammals
and birds [32]. However, it strongly depends on
the gene family and the regulatory system under analysis; to the examples above we can add
the rapidly evolving gene SRY whose regulatory sequences are conserved within groups of
mammals (primates, rodents, bovids), but are
not conserved between these groups [91].
A somewhat different approach involves
looking at differences in regulatory regions of
closely related genes. It was used to determine
novel expression patterns of Drosophila genes
Dpc and amd involved in production of neurotransmitters and cuticle formation [158], and
hunchback encoding a zing finger transcription
factor [53], as well as to study expression of the
primate β-globin gene cluster [52].
All analyses described above are based on the
assumption that regulatory sites of the same
kind are similar in related genomes. The biological basis for this assumption is provided by
complementation studies, where the protein
from one organism correctly regulates genes
from another organism. Examples of such crossspecies regulation can sometimes involve very
distant genomes. For instance, mammalian homologues of the apterous (Ap) transcription factor from Drosophila completely substitute for the
protein, and moreover the murine gene (Lhx2)
has the same tissue-specific pattern of expression as the Drosophila gene [118]. Human heatshock transcription factor HSF2 activates target
genes of its yeast homologue HSF in response to
thermal stress [89]. However, a detailed
study [32] found very little sequence conservation in regulatory regions of vertebrates and
insects. Thus it seems that these examples cannot serve as a basis for large-scale genomic
analysis, however interesting they are from the
biological point of view.
Other examples of such very distant regulatory homologies are the enhancer element of
plant histone promoters present also in small
plant viruses such as coconut foliar decay virus [97], and the BARBIE box occurring up-
758
Gelfand
stream of barbiturate-inducible genes of mammals, insects and the bacterium Bacillus
megaterium [57, 72].
The situation with the latter signal is somewhat controversial. In B. megaterium it is bound
by repressor BM3R1 [57, 83, 84]. It is possible
that the repressor acts by interfering with positive transcription factors BM1P1 and
BM1P2 [58, 84]. However, it has been also
claimed that deletion of the BARBIE box does
not influence effect of pentobarbital on the
P450BM-1 promoter [132].
In rodents the situation also is not clear [72].
Mutations in the BARBIE box in the promoter of
the rat alpha-1-acid glycoprotein gene abolish
induction by phenobarbital [40]. On the other
hand, the putative BARBIE box is not essential
for phenobarbital induction of the rat CYP2B1
gene encoding cytochrome P450 2B1 [106].
Mouse phenobarbital-inducible promoter of
CYP2B10 gene contains an insertion in the
middle of the BARBIE box, although the
1400-bp promoter region is 83% identical to the
rat CYP2B2 gene promoter [64].
Finally, in insects, candidate BARBIE boxes
were found upstream of a number of
xenobiotic-regulated genes of the house fly
Musca domestica [20], butterflies from the genus
Papilio [68], and the mosquito Culex quinquefasciatus [152]. The link between all these observation is established the fact that some unknown
B. megaterium protein recognizes both bacterial
and rodent sites [57].
3. Transcription regulatory sites
in bacterial genomes
Techniques similar to evolutionary fingerprinting were used in a number of bacterial
studies as well, e.g., for analysis of iron uptake
(Fur) and peroxide (PerR) regulons of B. subtilis [13], and OmpR and SoxS binding sites upstream of the micF gene of E. coli coding for an
anti-sense regulatory RNA [26]. However, all of
these papers report comparative results simply
to confirm or extend biological observations
made for one species, and not as a starting point
for analysis. The problem of recognition of
regulatory sites in completely sequenced bacterial genomes has been recently stated in a
number of reviews [11, 105]. Still, until now,
large scale analysis of transcription regulation
was restricted to scanning of genomes by signal
profiles [120, 146] and comparative analysis was
not applied in a systematic way.
One of the reasons for that could be the fact
that, although individual binding sites of
prokaryotic transcription factors are usually
longer than eukaryotic sites, there is no conservation of long regulatory regions and thus the
noise in pairwise comparisons is too strong to
discern weakly conserved regulatory signals.
However, this can be overcome by simultaneous analysis of regulons, and not single
genes, in several species [96].
Similarly to the eukaryotic case considered in
the previous section, the main assumption is
that regulatory patterns are conserved in related
species. However, since the number of regulatory interactions per gene in prokaryotes is
generally much smaller than that in eukaryotes,
it is possible to base analysis on conservation of
entire regulons, that is, sets of genes regulated
by a particular factor.
In the simplest variant of this approach, information from a well studied genome is transferred to a newly sequenced genome. It is
assumed that not only the composition of regulons, but the signal itself is conserved. The
known sites from the ‘old’ genome are used to
make a recognition rule to scan the ‘new’ genome. Candidate sites upstream of genes
orthologous to the known regulon members are
assumed to be true. This technique was used to
predict PurR (purine), ArgR (arginine), TrpR
and TyrR (aromatic amino acids), and LexA
(SOS-repair) regulons of Haemophilus influenzae [45, 95, 96]. An obvious prerequisite for this
analysis is conservation of the transcription
factor. Sometimes the results of complementation experiments are available that provide additional bases for the assumption of the signal
conservation (e.g., [98, 124, 135, 166]).
A minor modification of this technique allows
one to find new members of regulons. Indeed, if
Recognition of regulatory sites by genomic comparison
candidate signals appear for orthologous genes
with unknown function, these genes can be
added to the regulons. This allowed us to find a
family of transporters that are subject to purine
regulation in E. coli and H. influenzae [95, 96].
A different approach can be used in the
absence of prior information. As described
above, there exist numerous algorithms that
find signals in a sample of regulatory regions.
However, in many cases the algorithms produce
several candidates signals, only one of which is
biologically relevant [41, 122]. Thus the problem
is to distinguish the true signal among several
possibilities. Moreover, if the analyzed sample
does not in fact contain coregulated genes, most
algorithms still produce some output. There is
no theory allowing one to estimate the significance of the derived signal. However, the reliability of the signal can be assessed using
genomic comparisons. Indeed, assuming conservation of the signal in related genomes, one
can consider several sets of orthologous genes,
independently find candidate signals for each
set, and then select the most similar signals or
determine that there are no common signals.
This technique was used to describe purine
regulons in archaebacteria from the genus Pyrococcus (Gelfand et al., in preparation). Genes
encoding enzymes from the purine pathway
were found in the genomes of P. horikoshii [70],
P. abyssi (Utah Genome Center, Univeristy of
Utah, USA, http://www.genome.utah.edu) and
P. furiosus (Genoscope, National Centre for
Sequencing, France, http://www.genoscope.
cns.fr) using standard protein similarity analysis. The signals identified in each individual set
were very weak and practically indistinguishable from random noise (that is, pseudo-signals
found in random sets of upstream regions from
the same species). However, the fact that very
similar signals have been constructed for all
three sets is a strong indication that the signal is
correct. In fact, it turned out that upstream of all
relevant genes there are two candidate sites. The
recognition rule requiring two sites with a fixed
length spacer is highly specific: it selects only
genes from the purine pathway. The regulons
include also transporters (one gene in each
759
genome) that are orthologous to the purine
transporters found in an independent analysis
of E. coli and H. influenzae purine regulons, see
above. Thus the results of several unrelated
studies corroborate each other.
4. Evolution of regulons, regulators,
and signals
Application of the comparative approach allowed for description of some modes of regulon
evolution. Although these observations are very
fragmented and preliminary, they are important
for creating more powerful algorithms for regulation analysis, as they show limitations of the
existing methods and possible additional considerations that can extend the present capabilities of the comparative approach. Unless stated
otherwise, the examples presented below are
taken from papers ( [45, 95, 96] Gelfand et al.,
unpublished) and from my unpublished observations.
Firstly, the main assumption, that is, conservation of regulons, is only partially valid. In
many cases conservation of the regulon core is
accompanied by loss (or gain) of regulation by
some genes. For example, the main genes from
the purine pathways are regulated by PurR both
in E. coli and H. influenzae, but several other
genes are members of the purine regulon in the
former, but not the latter. Notably, a large number of such genes were initially identified in E.
coli by computational analysis and then checked
experimentally [56] and the strength of the repressor sites is rather low. Another noteworthy
case in the predicted loss of autoregulation by
PurR and IlvY in H. influenzae.
A related, but more straightforward situation
is the case when one genome has no orthologue
for a regulon member from the other genome.
Since the genome composition is indeed very
labile, this is a very frequent phenomenon occurring in almost all regulons we have studied
so far.
Secondly, analysis of regulatory interactions
is obscured by changes in operon structure.
Analysis of the operon structure, and, more
760
Gelfand
generally, consideration of gene proximity is a
powerful tool of functional analysis, since in
many cases genes that are close to each other in
a number of sufficiently distant genomes are
functionally related [21, 105].
There are two modes of operon structure
evolution that pose difficulties for the comparative analysis. One is insertion of genes between
the operator and the (former) first gene of the
operon. For instance, compare trpBA operons in
H. influenzae and Pasteurella multocida (figure 1a).
In H. influenzae this operon contains an additional gene ydfG coding for a predicted oxidoreductase. Thus the TrpR binding site (TRP
box) is immediately upstream of ydfG, but not
trpB.
The other mode involves breaking of an operon into two parts. The upstream part then
inherits the regulatory site from the original
operon, whereas the downstream part acquires
a new site. This happened with the tryptophan
operon (trpEDCBA in enterobacteria and Vibrionaceae) that broke into two operons in Pasteurellaceae, trpBA and trpEDC (figure 1a) and with
one of the operons from the purine regulon
(purHDglyA in H. influenzae, but two operons
purHD and glyA in E. coli, figure 1b).
A very nice example is present in the nitrogen
fixation regulon of methanogenic archaea
Methanobacterium
thermoautotrophicum
and
Methanococcus jannaschii (figure 1c). Both these
genomes contain two pairs of glutamine synthase regulator genes glnB and ammonium
transporter genes amtB. The relative arrangement of these genes is quite diverse: they form
tandemly repeated operons amtBglnB in M. thermoautotrophicum, whereas in M. jannaschii they
are arranged as the operon glnBamtB and a pair
of divergently transcribed genes. However, all
these operons have upstream nitrogen fixation
(NIF) boxes. In the latter case there is a single
regulatory site between the divergent genes
amtB and glnB that most likely regulates both
genes.
To account for possible changes in the operon
structure, one should select for pairwise comparisons not only the genes with candidate
regulatory sites in upstream regions, but also
Figure 1. Examples of changed operon structure with retained
regulation. a. trp operons in gamma-proteobacteria. TRP, binding
site of TrpR. b. purHD and glyA operons in gammaproteobacteria. PUR, binding site of PurR. c. glnBamtB loci in
methanogenic archaebacteria. NIF, nitrogen fixation operator.
Recognition of regulatory sites by genomic comparison
genes that can be coregulated with these genes,
that is, downstream genes are transcribed in the
same direction, with some limit on the spacer
length.
Thirdly, many regulons involve multiple
regulatory mechanisms. To deal with such
cases, the sets regulated by functionally linked
factors should be merged at the step of pairwise
comparisons into a single coregulated set.
For instance, the aromatic amino acid metabolism operons in enterobacteria and Pasteurellaceae are regulated on the level of transcription by two factors, TrpR and TyrR. Several
genes are under dual regulation. In addition to
changes in operon structure described above,
there are several genes that apparently have
different regulation in E. coli and H. influenzae
(figure 2). A relatively simple case is that of mtr,
that is under regulation by both TyrR and TrpR
in E. coli, but has no candidate TyrR binding
sites in H. influenzae (figure 2a). A more interesting case is that of DAPH-synthases (figure 2b).
DAPH synthase is the first enzyme of the pathway leading to aromatic amino acids. In E. coli
there are three DAPH-synthases encoded by
genes aroH, aroG, and aroF. They are feedback
inhibited by tryptophan, phenylalanine and tyrosine, respectively [110]. On the level of transcription, aroH is repressed by TrpR with tryptophan acting as corepressor, whereas aroF and
aroG are repressed by TyrR with corepressors
tyrosine and phenylalanine. The H. influenzae
has only one DAPH-synthase that is orthologous to aroG. However, it has no candidate
TyrR-binding sites, but a candidate TrpRbinding site insted.
The situation with global regulons seems to
be even more complicated. For example, the
heat shock response is regulated in various
bacteria by at least twelve systems [100], in
particular:
– heat-shock-specific σ-factor RpoH [32] in
gamma proteobacteria [51];
– factor CtsR in low-G + C Gram-positive
bacteria [29];
– unknown factor binding to ROSE element
in Bradyrhizobium japonicum [101];
761
– HspR binding to HAIR element in Streptomyces spp. [14];
– HrcA binding to CIRCE element [129].
The last system is of particular interest. The
CIRCE element (controlling inverted repeat of
chaperone expression) [167] has been found upstream of chaperone operons in numerous
Gram-positive and Gram-negative bacteria, cyanobacteria, Chlamydiae, and spirochetes [59,
130]. These operons often, but now always,
include the regulator itself (in Caulobacter crescentus the hrcA gene is under RpoH promoter [119]. However, the composition of the
chaperone operons in bacterial genomes is very
diverse (figure 2c) and not all of them have
upstream CIRCE elements. On the other hand,
in Mycoplasma genitalium, where HrcA is the
only transcription factor (E.V. Koonin, personal
communication), it regulates not only chaperones, but heat-shock proteases lon and clpB as
well. There is no obvious phylogenetic pattern
in the distribution of CIRCE: for instance, it is
absent in many gamma-proteobacteria, including E. coli, but not in all of them, since there is a
CIRCE element upstream of the groESL operon
of Chromatium vinosum [36].
The fourth complication is the fact that orthology of factors does not neccessarily imply
similarity of the signals. For example, SOS repair genes in E. coli (and other Gram-negative
bacteria) and B. subtilis (and other Grampositive bacteria) are regulated by orthologous
proteins LexA and DinR respectively. However,
the DNA binding domains of these proteins
have diverged and the recognized signals are
quite
different:
CTGTatatatatMCAG
for
LexA [156]
and
cGAACrnryGTTYg
for
DinR [162]. The composition of regulons also is
different, although there are some common
genes that can be revealed by the comparative
analysis. This example illustrates another feature that can be used for comparative analysis:
even if the regulators have diverged and the
signals are different, the overall symmetry of
the pattern is conserved. In the LexA-DinR case
the signal is a palindrome with strongly conserved half-sites and weakly conserved AT-rich
spacer. However, since most transcription regu-
762
Gelfand
latory signals are palindromes with degenerate
positions, this example is not particularly revealing. A more interesting case is that of phosphate regulons of enterobacteria and B. subtilis
regulated by homologous two-component systems PhoB-PhoR and PhoP-PhoR respectively.
The enterobacterial signal consists of several
heptanucleotide PHO boxes with consensus CTGTCAT with 4-bp spacers [157], whereas the B.
subtilis signal is several hexanucleotide boxes
TTWACA with 4–5 bp spacers [88]. In both
cases the total length of one unit consisting of a
box and a spacer is 10–11 bp, which means that
the boxes are located on the same side of the
DNA helix [115]. The position of the signals
relative to the start of transcription also is
conserved: they serve as substitute (-35) boxes.
Another example is the group of quorum sensing two-component systems regulating bacteriocin production in lactic bacteria and the
related agr system of Staphylococcus species [69,
74]. In all cases the signal is a pair of identical
Recognition of regulatory sites by genomic comparison
763
c
Figure 2. Changes in regulation. a. mtr gene in E. coli and H. influenzae. TYR, TRP, binding sites of TyrR and TrpR respectively. b. DAPH
synthases of E. coli and H. influenzae. 1st column: schematic representation of the evolutionary tree; 2nd column: inhibitor; 3rd column:
genome (E.c., E.coli, H.i., H. influenzae); 4th column: schematic representation of the relevant operons (TYR, TRP, binding sites of TyrR
and TrpR respectively). c. CIRCE elements and chaperone operons in complete bacterial genomes. CIRCE, binding site of hrcA. In
mycoplasmas, operon groESL has CIRCE element in M. pneumoniae, but not in M. genitalium.
764
Gelfand
nonanucleotide boxes with either a 12- or 13-bp
spacer (figure 3). Thus, boxes again are located
at the same side of the double helix, with
corresponding positions separated by two helical turns.
On the other hand, sometimes the similarity
between signals of paralogous factors may be a
problem. Indeed, if related regulators from the
same genome recognize similar signals, it is
difficult to sort out the regulons. The interplay
between the DNA sites and the recognizer domains in proteins, well studied in several phage
systems, can be quite subtle. Three base pairs
determine specificity of T3 and T7 promoters to
the respective RNA polymerases [75]. In the
Salmonella typhimurium phage P22, the C1 operator for the c2 gene promoter partially overlaps with the gene encoding the C1 activator
itself. A single-base-pair mutation within this
overlap changes the specificity of both the operator and the activator: the recognition is retained, but there is no cross-recognition of the
wild-type operator [117]. In E. coli there are
several groups of related factors and signals.
One of the best studied examples is the pair of
activators CRP and FNR that recognize very
similar palindromic signals with consensi
TGTGAN6TCACA and TTGATN4ATCAA respectively [6]. The physiological roles of the two
factors are quite different, with CRP regulating
catabolite repression and FNR relevant to
anaerobiosis. Although it is unlikely that there
is competition between these factors for wildtype sites [125], rough profiles may not be able
to distinguish between the two signals.
A somewhat more complicated case is that of
vegetative and stationary phase promoters of
E. coli recognized by σ70 and σS sigma-factors
respectively. The problem is that a number of
promoters are recognized by both sigma factors
and thus the similarity of the signals is not an
evolutionary relic, but has functional significance. However, most promoters are recognized
by only one sigma factor. It has been suggested
that the distinguishing feature of σS promoters
is curved DNA structure in the (-35) region,
instead of the usual TTGACA box, and slightly
different consensus of the (-10) box: CTATACT
instead of TATAAT [35]. A similar situation exists in B. subtilis where several pairs of sigmafactors share some promoters. One example
is provided by the recently characterized σX
and σW promoters with rather well conserved consensi TGTAACN17CGAC and
TGAAACN16CGTA respectively [67].
An interesting case is that of two-component
systems NarL-NarX and NarP-NarQ mediating
nitrate and nitrite regulation of anaerobic gene
expression in E. coli [23]. They bind to heptamer
half-sites TACYYMT organized as inverse repeats. It turns out that (phosphorylated) NarP
regulates only the genes where the spacer between the half-sites is exactly 2 bp (that is,
TACYYMT2AKRRGTA), whereas NarL can also
bind to half-sites in other arrangements, although the 2 bp still is the preferred spacer
length [24].
There are many other cases where position of
an operator relative to the transcription binding
site or other operators is important for the
action mechanism. There exist three positional
classes for CRP-binding sites [33]: i) CRP and
RNA polymerase binding on the same side of
DNA, ii) CRP binding site overlapping the
promoter, and iii) arbitrary position when CRP
interacts with another factor, e.g., CytR [150] or
FIS [49]. This is reflected in the periodicity of
distribution of CRP operators in regulatory regions aligned at transcription start sites [109].
Similar periodicity can also be observed in
distribution of binding sites for other factors [134]. Many activators act as repressors if
their binding site overlaps with the promoter;
this is the standard mechanism of negative
transcriptional feedback.
Thus simultaneous prediction of promoters
and operators may be more powerful than two
separate procedures, and furthermore, can provide additional information. However, it is not
clear how stable the positional effects are. For
example, transcription of the E. coli gene purB is
repressed by PurR via a roadblock mechanism:
the repressor binds to the operator located
within the coding region around codon 60 [55].
However, in H. influenzae the candidate opera-
Recognition of regulatory sites by genomic comparison
Figure 3. Signals upstream of bacteriocin operons of lactic bacteria and quorum sensing operons of Staphylococcus spp.
765
766
Gelfand
tor is located upstream of the gene, just as in
other genes.
Another aspect of positional analysis is the
use of colocalization of regulators and regulated
operons. This is especially important for systems with small regulons that are likely candidates for horizontal transfer, such as twocomponent systems regulating only one operon
and restriction-modification systems regulated
by the control gene. In the latter case, the
analysis of upstream regions based on the assumption of conserved orientation of an asymmetric operator relative to the methylase gene
and the control element restriction endonuclease operon [5] led to the discovery of candidate signals. Similarly, in the case of bacteriocin
systems (figure 3), the regulatory twocomponent systems and the regulated operons
are usually adjacent to each other on the chromosomes of lactic bacilli.
However, such an analysis also requires some
caution, as demonstrated by the following example. In E. coli the gene ilvC and its activator
ilvY are transcribed divergently, with operators
in the common regulatory zone acting as activating sites for ilvC and repressing sites for
ilvY [149]. This arrangement is conserved in H.
influenzae, but since the distance between the
only candidate operator and the ilvY gene is
several hundred base pairs, it is unlikely that
this operator is a repressor site for ilvY.
between convergently transcribed genes, where
one should expect at least one terminator [159].
Other well studied examples of RNA regulatory structures are attenuators of transcription [80] and elements involved in feedback
regulation of translation initiation in ribosomal
protein operons [71]. Both these systems are
well studied in E. coli, and it is not difficult to
extend these models to other bacteria where
they exist ( [11, 21, 85, 107, 133, 153]; Vitreschak
and Gelfand, unpublished). There exist systems
searching for RNA regulatory elements that
have been applied, in particular, to finding IREs
(iron-responsive elements) occurring at 3’untranslated mRNA regions of genes encoding
proteins involved in iron utilization [10, 22].
However, the present experience in comparative analysis of prokaryotic RNA regulation is
too limited to describe general properties of
such systems. In any case, it is clear that the
degree of conservation is quite diverse. For
instance, only about half of the ribosomal protein operons of E. coli retain the regulatory
elements in H. influenzae [153]. On the other
hand, the autoregulation mechanism of L1 is
conserved in archaebacteria, although in a
slightly different form [133]. The RFN element
implicated in regulation of riboflavin metabolism genes can be found in such diverse genomes as Bacillus subtilis and other high G +
C-content Gram-positives, Deinococcus radiodurans, Thermotoga spp. and several Gramnegatives including E. coli [45, 46].
5. Regulatory sites and RNA structure
The comparative approach can be used to
analyze RNA regulatory sites. The best known
prokaryotic signal which includes an RNA secondary structure element is the terminator of
transcription. However, as in the case of promoters, the comparative approach is nonapplicable to prediction of terminators, since they are
not characteristic of any particular regulon. Besides, in a recent study, it was shown that the
conventional notion of transcriptional terminator as a GC-rich hairpin followed by a run of T’s
may not be universal: in many prokaryotic
genomes there is no prevalence of hairpins
6. Limitations
The comparative approach is intended for
analysis of particular regulatory systems and
obviously cannot be applied to prediction of
generic functional sites such as promoters and
terminators of transcription. On the other hand,
analysis of promoters and terminators can be a
useful ingredient for the comparative prediction
of the operon structure.
As discussed above, an important prerequisite for regulon conservation is conservation of
the regulatory mechanism. This can be ascertained when the regulator is known, but can
Recognition of regulatory sites by genomic comparison
only be assumed when it is not known. Pure
phylogenetic considerations are not sufficient,
since the rate of the regulon evolution can be
very diverse. Comparisons of closely related
genomes can be fruitless because of residual
conservation of noncoding regions (although
we were able to find purine regulatory signals
in the three strains of Pyrococcus). More distant
comparisons are more informative, but the
regulon may not be conserved. Recall, for example, that the arginine repressor is conserved
in E. coli and B. subtilis and binds a similar
signal there [135], whereas the phosphate repressor is conserved, but binds different signals [88, 157]. On the other hand, the tryptophan regulon is regulated by the repressor
TrpR in enterobacteria [110] and Chlamydia trachomatis [139], the activator TrpI in fluorescent
pseudomonads [3], and by the RNA-binding
protein TRAP in Bacillus subtilis [4]. Some regulons, such as biodegradation regulons of
Pseudomonas spp. [27], or regulatory regions,
such as dnaA promoter in E. coli [111], are subject to fast evolution. Application of the comparative analysis in such cases is not straightforward, if even possible. Other types of
situations, where comparisons are difficult, are
small regulons (making it difficult to determine
the signal) and regulons subject to horizontal
transfer (complicating resolution of orthology).
However, despite all the problems mentioned, the comparative method is a powerful
technique. One may expect that its usefulness
will increase as more genomes are sequenced
and they cover more evenly the phylogenetic
space, making it possible to select the correct
evolutionary distance in each particular case.
Acknowledgments
I am grateful to Ross Overbeek for a conversation several years ago that started my studies
in the area of bacterial transcription regulation.
Many results reported here were obtained in
collaboration with Andrey Mironov who wrote
the software GENOME for comparative analysis of regulation. I am grateful to Jim Fickett,
Eugene Koonin, and Mike Roytberg for numer-
767
ous discussions, references, and encouragement. This work was partially supported by
grants from the Russian State Scientific Program
‘Human Genome’ and the Russian Fund of
Basic Research (99, 04, 48347).
References
[1] Ansari-Lari M.A., Oeltjen J.C., Schwartz S., Zhang Z., Muzny D.M.,
Lu J., Gorrell J.H., Chinault A.C., Belmont J.W., Miller W., Gibbs
R.A., Comparative sequence analysis of a gene-rich cluster at
human chromosome 12p13 and its syntenic region in mouse
chromosome 6, Genome Res. 8 (1998) 29–40.
[2] Aparicio S., Morrison A., Gould A., Gilthorpe J., Chaudhuri C.,
Rigby P., Krumlauf R., Brenner S., Detecting conserved regulatory
elements with the model genome of the Japanese puffer fish, Fugu
rubripes, Proc. Natl. Acad. Sci. USA 92 (1995) 1684–1688.
[3] Auerbach S., Gao J., Gussin G.N., Nucleotide sequences of the
trpI, trpB, and trpA genes of Pseudomonas syringae: positive control unique to fluorescent pseudomonads, Gene 123 (1993)
25–32.
[4] Babitzke P., Regulation of tryptophan biosynthesis: Trp-ing the
TRAP or how Bacillus subtilis reinvented the wheel, Mol. Microbiol.
26 (1997) 1–9.
[5] Bart A., Dankert J., vander Ende A., Operator sequences for the
regulatory proteins of restriction-modification systems, Mol. Microbiol. 31 (1999) 1277–1278.
[6] Bell A.I., Gaston K.L., Cole J., Busby S.J.W., Cloning of binding
sequences for the Escherichia coli transcription activators FNR and
CRP, location of bases involved in discrimination between FNR
and CRP, Nucleic Acids Res. 17 (1989) 3865–3874.
[7] Berg O.G., VonHippel P.H., Selection of DNA binding sites by
regulatory proteins. Statistical-mechanical theory and application
to operators and promoters, J. Mol. Biol. 193 (1987) 723–750.
[8] Berg O.G., Selection of DNA binding sites by regulatory proteins.
Functional specificity and pseudosite competition, J. Biomol.
Struct. Dyn. 6 (1988) 275–297.
[9] Berg O.G., Selection of DNA binding sites by regulatory proteins:
the LexA and the arginine repressor use different strategies for
functional specificity, Nucleic Acids Res. 16 (1988) 5089–5105.
[10] Billoud B., Kontic M., Viari A., Palingol: a declarative programming
language to describe nucleic acids’ secondary structures and to
scan sequence database, Nucleic Acids Res. 24 (1996) 1395–1403.
[11] Bork P., Dandekar T., Diaz-Lazcoz Y., Eisenhaber F., Huynen M.,
Yuan Y., Predicting function: From genes to genomes and back, J.
Mol. Biol. 283 (1998) 707–725.
[12] Bourn W.R., Babb B., Computer assisted identification and classification of streptomycete promoters, Nucleic Acids Res. 23
(1995) 3696–3703.
[13] Bsat N., Herbig A., Casillas-Martinez L., Setlow P., Helmann J.D.,
Bacillus subtilis contains multiple Fur homologues: identification of
the iron uptake (Fur) and peroxide regulon (PerR) repressors,
Mol. Microbiol. 29 (1998) 189–198.
[14] Bucca G., Ferina G., Puglia A.M., Smith C.P., The dnaK operon of
Streptomyces coelicolor encodes a novel heat-shock protein which
binds to the promoter region of the operon, Mol. Microbiol. 17
(1995) 663–674.
[15] Bucher P., Weight matrix descriptions of four eukaryotic RNA
polymerase II promoter elements derived from 502 unrelated
sequences, J. Mol. Biol. 212 (1990) 563–578.
[16] Cardon L.R., Stormo G.D., Expectation maximization algorithm
for identifying protein-binding sites with variable length from
unaligned DNA fragments, J. Mol. Biol. 223 (1992) 159–170.
768
Gelfand
[17] Chen Q.K., Hertz G.Z., Stormo G.D., MATRIX SEARCH 1. 0: A
computer program that scans DNA sequences for transcriptional
elements using a database of weight matrices, Comput. Appl.
Biosci. 11 (1995) 563–566.
[18] Chen Q.K., Hertz G.Z., Stormo G.D., PromFD 1. 0: A computer
program that predicts eukaryotic polII promoters using strings
and IMD matrices, Comput. Appl. Biosci. 13 (1997) 29–35.
[19] Claverie J.M., Audic S., The statistical significance of nucleotide
position-weight matrix matches, Comput. Appl. Biosci. 12 (1996)
431–439.
[20] Cohen M.B., Koener J.F., Feyereisen R., Structure and chromosomal localization of CYP6A1, a cytochrome P450-encoding gene
from the house fly, Gene 146 (1994) 267–272.
[21] Dandekar T., Snel B., Huynen M.A., Bork P., Conservation of gene
order: a fingerprint of physically interacting proteins, Trends
Biochem. Sci. 23 (1998) 324–328.
[22] Dandekar, T., Beyer, K., Bork, P., Kenealy, M.R., Pantopoulos, K.,
Hentze, M., Sonntag-Buck, V., Flouriot, G., Cannon, F., Keller, W.,
Schreiber S., Systematic genomic screening and analysis of mRNA
in untranslated regions and mRNA precursors: combining experimental and computational approaches. Bioinformatics 14 (1999)
271–278.
[23] Darwin A.J., Stewart V., The narmo dulon systems: nitrate and
nitrite regulation of anaerobic gene c expression,in: Line
C.C.,Lynch A.S., (Eds.), Regulation of Gene Expression in Escherichia coli, R.G. Landes Company, Austin TX, 1996, pp. 343–359.
[24] Darwin A.J., Tyson K.L., Busby S.J.W., Stewart V., Differential
regulation by the homologous NarL and NarP of Escherichia coli
K-12 depends on DNA binding site arrangement, Mol. Microbiol.
25, (1997), 583–595.
[25] Day W.H.E., McMorris F.R., Critical comparison of consensus
methods for molecular sequences, Nucleic Acids Res. 20 (1992)
1093–1099.
[26] Delihas N., Regulation of gene expression by trans-encoded antisense RNAs, Mol. Microbiol. 15 (1995) 411–414.
[27] De Lorenzo V., Perez-Martin J., Regulatory noise in prokaryotic
promoters: how bacteria learn to respond to novel environmental
signals, Mol. Microbiol. 19 (1996) 1177–1184.
[28] Demeler B., Zhou G., Neural network optimization for E. coli
promoter prediction, Nucleic Acids Res. 19 (1991) 1593–1599.
[29] Derre I., Rapoport G., Msadek T., CtsR, a novel regulator of stress
and heat shock response, controls clp and molecular chaperone
gene expression in Gram-positive bacteria, Mol. Microbiol. 31
(1999) 117–131.
[30] Dhawale S.S., Lane A.C., Compilation of sequence-specific DNAbinding proteins implicated in transcriptional control in fungi,
Nucleic Acids Res. 21 (1993) 5537–5546.
[31] Duret L., Dorkeld F., Gautier C., Strong conservation of noncoding sequences during vertebrates evolution - potential involvement in post-translational regulation of gene expression, Nucleic
Acids Res. 21 (1993) 2315–2322.
[32] Duret L., Bucher P., Searching for regulatory elements in human
noncoding sequences, Curr. Opin. Struct. Biol. 7 (1997) 399–406.
[33] Ebright R.H., Transcription activation at class II CAP-dependent
promoters, Mol. Microbiol. 8 (1993) 797–802.
[34] Eddy S.R., Hidden Markov models, Curr. Opin. Struct. Biol. 6
(1996) 361–365.
[35] Espinosa-Urgel M., Chamizo C., Tormo A., A consensus structure
for (S-dependent promoters, Mol. Microbiol. 21 (1996) 657–659.
[36] Ferreyra R., Soncini F., Viale A.M., Cloning, characterization, and
functional expression in Escherichia coli of chaperonin (groESL)
genes from the sulfur phototrophic bacterium Chromatium vinosum, J. Bacteriol. 175 (1993) 1514–1523.
[37] Fickett J.W., Hatzigeorgiou A.G., Eukaryotic promoter recognition, Genome Res. 7 (1997) 861–878.
[38] Fickett J.W., Coordinate positioning of MEF-2 and myogenin
binding sites, Gene 172 (1996) GC19–GC32.
[39] Fondrat C., Kalogeropoulos A., Approaching the function of new
genes by detection of their potential upstream activation sequences in Saccharomyces cerevisiae: application to chromosome
III, Comput. Appl. Biosci. 12 (1996) 363–374.
[40] Fournier T., Mejdoubi N., Lapoumeroulie C., Hamelin J., Durand
G., Porquet D., Transcriptional regulation of rat alpha 1-acid
glycoprotein gene by phenobarbital, J. Biol. Chem. 269 (1994)
27175–27178.
[41] Frech K., Quandt K., Werner T., Software for the analysis of DNA
sequence elements of transcription, Comput. Appl. Biosci. 13
(1997) 89–97.
[42] Frech K., Quandt K., Werner T., Muscle actin genes: A first step
towards computational classification of tissue specific promoters,
Silico Biol. (1998) http://www. bioinfo. de/isb/(1998)/01/0005.
[43] Galas D.J., Eggert M., Waterman M.S., Rigorous patternrecognition methods for DNA sequences: analysis of promoter
sequences from Escherichia coli, J. Mol. Biol. 186 (1985) 117–128.
[44] Gelfand M.S., Prediction of function in DNA sequence analysis. J.
Comput. Biol. 2 (1995) 87–115.
[45] Gelfand M.S., Mironov A.A., Computer analysis of regulatory
patterns in complete bacterial genomes. LexA and DinR binding
sites, (In Russian, Engl. transl.), Mol. Biol. 33 (1999) 439–442.
[46] Gelfand M.S., Miornov A.A., Jomantas J., Kozlov Y.U.I., Perumov
D.A., A conserved RNA structure element involved in regulation
of bacterial ribioflavin genes, Trends Genet. 15 in press. (1999)
[47] Ghosh D., OOTFD (Object-Oriented Transcription Factors Database): an object-oriented successor to TFD, Nucleic Acids Res.
26 (1998) 360–361.
[48] Goldstein L., Waterman M.S., Approximations to profile score
distributions, J. Comput. Biol. 1 (1994) 93–104.
[49] Gonzales-Gil G., Bringmann P., Kahmann R., FIS is a regulator of
metabolism in Escherichia coli, Mol. Microbiol. 22 (1996) 21–29.
[50] Goodrich J.A., Schwartz M.L., McClure W.R., Searching for and
predicting the activity of sites for DNA binding proteins: compilation and analysis of the binding sites for Escherichia coli integration host factor (IHF), Nucleic Acids Res. 18 (1990) 4993–5000.
[51] Gross C.A., Function and regulation of the heat shock proteins,
in:Neidhardt F.C. (Ed.), Escherichia coli and Salmonella: Cellular
and Molecular Biology, AMS Press, Washington DC, 1996, pp.
1382–1399.
[52] Gumucio D.L., Shelton D.A., Zhu W., Millinoff D., Gray T., Bock
J.H., Slightom J.L., Goodman M., Evolutionary strategies for the
elucidation of cis and trans factors that require the developmental
switching programs of the β-like globin genes, Mol. Phylogenet.
Evol. 5 (1996) 18–32.
[53] Hancock J.M., Shaw P.J., Bonneton F., Dover G.A., High sequence
turnover in the regulatory regions of the developmental gene
hunchback in insects, Mol. Biol. Evol. 16 (1999) 253–265.
[54] Hardison R.C., Oeltjen J., Miller W., Long human-mouse sequence
alignments reveal novel regulatory elements: A reason to sequence the mouse genome, Genome Res. 7 (1997) 759–766.
[55] He B., Zalkin H., Repression of Escherichia coli purB is by a
transcriptional roadblock mechanism, J. Bacteriol. 174 (1992)
7121–7127.
[56] He B., Choi K.Y., Zalkin H., Regulation of Escherichia coli glnB,
prsA, and speA by the purine repressor, J. Bacteriol. 175 (1993)
3598–3606.
[57] He J.S., Fulco A.J., A barbiturate-regulated protein binding to a
common sequence in the cytochrome P450 genes of rodents and
bacteria, J. Biol. Chem. 266 (1991) 7864–7869.
[58] He J.S., Liang Q., Fulco A.J., The molecular cloning and characterization of BM1P1 and BM1P2 proteins, putative positive transcription factors involved in barbiturate-mediated induction of the
genes encoding cytochrome P450BM-1 of Bacillus megaterium, J.
Biol. Chem. 270 (1995) 18615–18625.
[59] Hecker M., Schumann W., Volker U., Heat-shock and general
stress response in Bacillus subtilis, Mol. Microbiol. 19 (1996)
417–428.
Recognition of regulatory sites by genomic comparison
[60] Heinemeyer T., Wingender E., Ruter I., Hermjacob H., Hel A.E.,
Kel O.V., Ignatieva E.V., Ananko E.A., Podkolodnaya O.A., Kolpakov
F.A., Podkolodny N.L., Kolchanov N.A., Databases on transcriptional regulation: TRANSFAC, TRRD and COMPEL, Nucleic Acids Res. 26 (1998) 362–367.
[61] Helmann J.D., Compilation and analysis of Bacillus subtilis (A
promoter sequences: evidence for extended contact between
RNA polymerase and upstream promoter DNA, Nucleic Acids
Res. 23 (1995) 2351–2360.
[62] Hertz G.Z., Hartzell III G.W., Stormo G.D., Identification of
consensus patterns in unaligned DNA sequences known to be
functionally related, Comput. Appl. Biosci. 6 (1990) 81–92.
[63] Hirst J.D., Sternberg M.J., Prediction of structural and functional
features of protein and nucleic acid sequences by artificial neural
networks, Biochemistry 31 (1992) 7211–7218.
[64] Honkakoski P., Moore R., Gynther J., Negishi M., Characterization
of phenobarbital-inducible mouse CYP2B10 gene transcription in
primary hepatocytes, J. Biol. Chem. 271 (1996) 9746–9753.
[65] Horton P.B., Kanehisa M., An assessment of neural network and
statsitical approaches for prediction of E. coli promoter sites,
Nucleic Acids Res. 20 (1992) 4331–4338.
[66] Huang X., Helmann J.D., Identification of target promoters for the
Bacillus subtilis σX factor using a consensus-directed search, J. Mol.
Biol. 279 (1998) 165–173.
[67] Huang X., Gaballa A., Cao M., Helmann J.D., Identification of
target promoters for the Bacillus subtilis extracytoplasmic function
σfactor, σW, Mol. Microbiol. 31 (1999) 361–371.
[68] Hung C.F., Holzmacher R., Connolly E., Berenbaum M.R., Schuler
M.A., Conserved promoter elements in the CYP6B gene family
suggest common ancestry for cytochrome P450 monooxygenases
mediating furanocoumarin detoxification, Proc. Natl. Acad. Sci.
USA 93 (1996) 12200–12205.
[69] Jack R.W., Tagg J.R., Ray B., Bacteriocins of Gram-positive bacteria,
Microbiol. Rev. 59 (1995) 171–200.
[70] Kawarabayasi Y., Sawada M., Horikawa H., Haikawa Y., Hino Y.,
Yamamoto S., Sekine M., Baba S., Kosugi H., Hosoyama A., Nagai
Y., Sakai M., Ogura K., Otuka R., Nakazawa H., Takamiya M.,
Ohfuku Y., Funahashi T., Tanaka T., Kudoh Y., Yamazaki J., Kushida
N., Oguchi A., Aoki K., Nakamura Y., Robb T.F., Horikoshi K.,
Masuchi Y., Shizuya H., Kikuchi H., Complete sequence and gene
organization of the genome of a hyper-thermophilic archaebacterium, Pyrococcus horikoshii OT3, DNA Res. 5 (1998) 55–76.
[71] Keener J., Nomura M., Regulation of ribosome synthesis, in:Neidhardt F.C. (Ed.), Escherichia coli and Salmonella: Cellular and
Molecular Biology, AMS Press, Washington DC, 1996, pp.
1417–1431.
[72] Kemper B., Regulation of cytochrome P450 gene transcription by
phenobarbital, Prog. Nucleic Acid Res. Mol. Biol. 61 (1998) 23–64.
[73] Kennedy B.P., Aamodt E.J., Allen F.L., Chung M.A., Heschl M.F.P.,
McGhee J.D., The gut esterase gene (ges-1) from the nematodes
Caenorhabditis elegans and Caenorhabditis briggsae, J. Mol. Biol. 229
(1993) 890–908.
[74] Kleerebezem M., Quadri L.E.N., Kuippers O.P., DeVos W.M.,
Quorum sensing by peptide pheromones and two-component
signal-transduction systems in Gram-positive bacteria, Mol. Microbiol. 24 (1997) 895–904.
[75] Klement J.F., Moorefeld M.B., Jorgensen E., Brown J.E., Risman S.,
McAllister W.T., Discrimination between bacteriophage T3 and
T7 promoters by the T3 and T7 RNA polymerases depends
primarily upon three base-pair region located 10 to 12 base-pairs
upstream from the start site, J. Mol. Biol. 215 (1990) 21–29.
[76] Koop B.F., Hood L., Striking sequence similarity over almost 100
kilobases of human and mouse T-cell receptor DNA, Nature
Genet. 7 (1994) 48–53.
[77] Koop B.F., Human and rodent sequence comparisons: A mosaic
model of genomic evolution, Trends Genet. 11 (1995) 367–371.
[78] Krause M., Harrison S.W., Xu S.Q., Chen L., Fire A., Elements
regulating cell- and stage-specific expression of the C. elegans
MyoD family homolog hlh-1, Dev. Biol. 166 (1994) 133–148.
769
[79] Kuwabara P.E., Interspecies comparison reveals evolution of control regions in the nematode sex-determining gene tra-2, Genetics
144 (1996) 597–607.
[80] Landick R., Turnboughjr C.L., Yanofsky C., Transcription attenuation, in: Neidhardt F.C. (Ed.), Escherichia coli and Salmonella:
Cellular and Molecular Biology, AMS Press, Washington DC, 1996,
pp. 1263–1286.
[81] Lawrence C.E., Altschul S.F., Boguski M.S., Liu J.S., Neuwald A.F.,
Wootton J.C., Detecting subtle sequence signals: a Gibbs sampling
strategy for multiple alignment, Science 262 (1993) 208–214.
[82] Lawrence C.E., Reilly A.A., An expectation maximisation (EM)
algorithm for the identification and characterization of common
sites in unaligned biopolymer sequences, Proteins: Struct. Func.
Genet. 7 (1990) 41–51.
[83] Liang Q., He J.S., Fulco A.J., The role of Barbie box sequences as
cis-acting elements involved in the barbiturate-mediated induction of cytochromes P450BM-1 and P450BM-3 in Bacillus megaterium, J. Biol. Chem. 270 (1995) 4438–4450.
[84] Liang Q., Chen L., Fulco A.J., In vivo roles of BM3R1 repressor in he
barbiturate-mediated induction of the cytochrome P450 genes
(P450BM-3 and P450BM-1) of Bacillus megaterium, Biochim. Biophys. Acta 1380 (1998) 183–197.
[85] Liao D., Dennis P.P., The organization and expression of essential
transcription translation component genes in the extremely thermophilic eubacterium Thermotoga maritima, J. Biol. Chem. 267
(1992) 22787–22797.
[86] Lipman D., Making (anti) sense of non-coding sequence conservation, Nucleic Acids Res. 25 (1997) 3580–3583.
[87] Lisser S., Margalit H., Compilation of E. coli mRNA promoter
sequences, Nucleic Acids Res. 21 (1993) 1507–1516.
[88] Liu W., Hulett F.M., Comparison of PhoP binding to tuaA promoter
with PhoP binding to other Pho-regulon promoters establishes a
Bacillus subtilis Pho core binding site, Microbiology 144 (1998)
1443–1450.
[89] Liu X.D., Liu P.C., Santoro N., Thiele D.J., Conservation of a stress
response: human heat shock transcription factors substitute for
yeast HSF, EMBO J. 16 (1997) 6466–6477.
[90] Malewski T., Computer analysis of distribution of putative cis- and
trans-regulatory elements in milk protein gene promoters,
BioSystems45 (1998) 29–44.
[91] Margarit E., Guillen A., Rebordosa C., Vidal-Taboada J., Sanchez
M., Ballesta F., Oliva R., Identification of conserved potentially
regulatory sequences of the SRY gene from 10 different species of
mammals, Biochem. Biophys. Res. Commun. 245 (1998) 370–377.
[92] Mengeritsky G., Smith T.F., Recognition of characteristic patterns
in sets of functionally equivalent DNA sequences, Comput. Appl.
Biosci. 3 (1987) 223–227.
[93] Miller C., McDonald J., Francis D., Evolution of promoter sequences: elements of a canonical promoter for prespore genes of
Dictyostelium, J. Mol. Evol. 43 (1996) 185–193.
[94] Minty A., Kedes L., Upstream regions of the human cardiac actin
gene that modulate its transcription in muscle cells: presence of an
evolutionarily conserved repeated motif, Mol. Cell. Biol. 6 (1986)
2125–2136.
[95] Mironov A.A., Gelfand M.S., Computer analysis of regulatory
patterns in complete bacterial genomes. PurR binding sites, Mol.
Biol. 33 (1999) 109–114.
[96] Mironov A.A., Koonin E.V., Roytberg M.A., Gelfand M.S., Computer analysis of transcription regulatory patterns in completely
sequenced bacterial genomes, Nucleic Acids Res. 27 (1999)
2981–2989.
[97] Morozov S.Y.A., Merits A., Chernov B.K., Computer search of
transcription control sequences in small plant virus DNA reveals
a sequence highly homologous to the enhancer element of histone
promoters, DNA Seq. 4 (1994) 395–397.
[98] Muday G.K., Herrmann K.M., Regulation of the Salmonella typhimurium aroF gene in Escherichia coli, J. Bacteriol. 172 (1990)
2259–2266.
770
Gelfand
[99] Mulligan M.E., Hawley D.K., Entriken R., McClure W.R., Escherichia
coli promoter sequences predict in vitro RNA polymerase selectivity, Nucleic Acids Res. 12 (1984) 789–800.
[100] Narberhaus F., Negative regulation of bacterial heat shock genes,
Mol. Microbiol. 31 (1999) 1–8.
[101] Narberhaus F., Kaser R., Nocker A., Hennecke H., A novel DNA
element that controls bacterial heat shock gene expression, Mol.
Microbiol. 28 (1998) 315–323.
[102] Negre D., Bonod-Bidaud C., Geourjon G., Deleage G., Cozzone
A.J., Cortay J.C., Definition of a consensus DNA-binding site for
the Escherichia coli pleiotropic regulatory protein, FruR, Mol.
Microbiol. 21 (1996) 257–266.
[103] O’Neill M.C., Consensus methods for finding and ranking DNA
binding sites: application to E. coli promoters, J. Mol. Biol. 207
(1989) 301–310.
[104] O’Neill M.C., Training back-propagation neural networks to define and detect DNA binding sites, Nucleic Acids Res. 19 (1991)
313–318.
[105] Overbeek R., Fonstein M., D’Souza M., Pusch G.D., Maltsev N.,
The use of gene clusters to infer functional coupling. Proc. Natl.
Acad. Sci. USA 96 (1999) 2896–2901.
[106] Park Y., Li H., Kemper B., Phenobarbital induction mediated by a
distal CYP2B2 sequence in rat liver transiently transfected in situ,
J. Biol. Chem. 271 (1996) 23725–23728.
[107] Paton E.B., Zhyvoloup A.N., Binding site of the ribosomal protein
L10 in the untranslated leader of rplJ gene in Thermotoga maritima
suggests that this gene is under autogenous control, (in Russian,
Engl. transl.), Genetika32 (1996) 140–145.
[108] Pedersen A.G., Engelbrecht J., Investigations of Escherichia coli
promoter sequences with artificial neural networks: new signals
discovered upstream of the transcriptional startpoint, Intelligent
Systems Mol. Biol. 3 (1995) 292–299.
[109] Perez-Rueda E., Gralla J.D., Collado-Vides J., Genomic position
analyses and the transcription machinery, J. Mol. Biol. 275 (1998)
165–170.
[110] Pittard A.J., Biosynthesis of aromatic amino acids, in:Neidhardt
F.C. (Ed.), Escherichia coli and Salmonella: Cellular and Molecular
Biology, AMS Press, Washington DC, 1996, pp. 458–484.
[111] Polaczek P., Is the dnaA promoter region in Escherichia coli an
evolutionary junkyard of physiologically insignificant regulatory
elements? Mol. Microbiol. 27 (1998) 1089–1090.
[112] Popperl H., Bienz M., Studer M., Chan S.K., Aparicio S., Brenner S.,
Mann R.S., Krumlauf R., Segmental expression of Hoxb-1 is controlled by a highly conserved autoregulatory loop dependent
upon exd/pbx, Cell 81 (1995) 1031–1042.
[113] Prestridge D.S., SIGNAL SCAN 4.0 - additional databases and
sequence formats, Comput. Appl. Sci. 12 (1996) 157–160.
[114] Prestridge D.S., Burks C., The density of transcriptional elements
in promoter and non-promoter sequences, Hum. Mol. Genet. 2
(1993) 1449–1453.
[115] Qi Y., Kobayashi Y., Hulett F.M., The pst operon of Bacillus subtilis
has a phosphate-regulated promoter and is involved in phosphate
transport but not in regulation of the pho regulon, J. Bacteriol. 179
(1997) 2534–2539.
[116] Renucci A., Zappavigna V., Zakany J., Izpisua-Belmonte J.C., Burki
K., Duboule D., Comparison of mouse and human HOX-4 complexes defines conserved sequences involved in the regulation of
Hox-4. EMBO J. 11 (1992) 1459–1468.
[117] Retallack D.M., Johnson L.L., Ziegler S.F., Strauch M.A., Friedman
D.I., A single-base-pair mutation changes the specificities of both a
transcription regulation protein and its binding site, Proc. Natl.
Acad. Sci. USA 90 (1993) 9562–9565.
[118] Rincon-Limas D.E., Lu C.H., Canal I., Calleja M., RodriguezEsteban C., Izpisua-Belmonte J.C., Botas J., Conservation of the
expression and function of apterous orthologs in Drosophila and
mammals, Proc. Natl. Acad. Sci. USA 96 (1999) 2165–2170.
[119] Roberts R.C., Toochinda C., Avedissian M., Baldini R.L., Gomes
S.L., Shapiro L., Identification of a Caulobacter crescentus operon
encoding hrcA, involved in negatively regulating heat-inducible
transcription, and the chaperone gene grpE, J. Bacteriol. 178
(1996) 1829–1841.
[120] Robison K., McGuire A.M., Church G.M., A comprehensive library
of DNA-binding site matrices for 55 proteins applied to the
complete Escherichia coli K-12 genomes, J. Mol. Biol. 284 (1998)
241–254.
[121] Rosenblueth D.A., Thieffry D., Huerta A.M., Salgado H., ColladoVides J., Syntactic recognition of regulatory regions in Escherichia
coli, Comput. Appl. Biosci. 12 (1996) 415–422.
[122] Roth F.P., Hughes J.D., Estep P.W., Church G.M., Revealing regulons
by whole-genome expression monitoring and upstream sequence
alignment, Nature Biotechnol. 16 (1998) 239–245.
[123] Roulet E., Fisch I., Junier T., Bucher P., Mermod N., Evaluation of
computer tools for the prediction of transcription factor binding
sites on genomic DNA, In Silico Biol. . http://www. bioinfo.
de/isb/(1998)/01/0004 (1998)
[124] Savchenko A., Charlier D., Dion M., Weigel P., Hallet J.N., Holtham
C., Baumberg S., Glansdorff N., Sakayan V., The arginine operon of
Bacillus stearothermophilus: characterization of the control region
and its interaction with the heterologous B. subtilis arginine repressor, Mol. Gen. Genet. 252 (1996) 69–78.
[125] Sawers G., Kaiser M., Sirko A., Freundlich M., Transcriptional
activation by FNR and CRP: reciprocity of binding-site recognition, Mol. Microbiol. 23 (1997) 835–845.
[126] Schneider T.D., Stormo G.D., Gold L., Ehrenfeucht A., Information
content of binding sites on nucleotide sequences, J. Mol. Biol. 188
(1986) 415–431.
[127] Schneider T.D., Information content of individual genetic sequences, J. Theor. Biol. 189 (1997) 427–442.
[128] Schultzaberger R.K., Schneider T.D., Using sequence logos and
information analysis of Lrp DNA binding sites to investigate
discrepancies between natural selection and SELEX, Nucleic Acids Res. 27 (1999) 882–887.
[129] Schulz A., Schumann W., hrcA, the first gene of Bacillus subtilis
dnaK operon encodes a negative regulator of class I heat shock
genes, J. Bacteriol. 178 (1996) 1088–1093.
[130] Segal G., Ron E.Z., Regulation and organization of the groE and
dnaK operons in Eubacteria, FEMS Microbiol. Lett. 138 (1996)
1–10.
[131] Shain D.H., Zuber M.X., Norris J., Yoo J., Neuman T., Selective
conservation of an E-protein gene promoter during vertebrate
evolution, FEBS Lett. 440 (1998) 332–336.
[132] Shaw G.C., Sung C.C., Liu C.H., Lin C.H., Evidence against the
Bm1P1 protein as a positive transcription factor for barbituratemediated induction of cytochrome P450BM-1 in Bacillus megaterium, J. Biol. Chem. 273 (1998) 7996–8002.
[133] Shimmin L.C., Dennis P., Characterization of the L11, L1, L10 and
L12 equivalent ribosomal protein gene cluster of the halophilic
archaebacterium Halobacterium cutirubrum, EMBO J. 8 (1989)
1225–1235.
[134] Shumilov V.Y., Mutual positioning of promoters and operators in
DNA of Escherichia coli (in Russian), Mol. Biol. 32 (1998) 384.
[135] Smith M.C., Czaplewski L., North A.K., Baumberg S., Stockley
P.G., Sequences required for regulation of arginine biosynthesis
promoters are conserved between Bacillus subtilis and Escherichia
coli, Mol. Microbiol. 3 (1989) 23–28.
[136] Spek C.A., Bertina R.M., Reitsma P.H., Identification of evolutionarily invariant sequences in the protein C gene promoter, J. Mol.
Evol. 47 (1998) 663–669.
[137] Spellman P.T., Sherlock G., Zhang M.Q., Iyer V.R., Anders K., Eisen
M.B., Brown P.O., Botstein D., Futcher B., Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces
cerevisiae by microarray hybridization, Mol. Biol. Cell. 9 (1998)
3273–3297.
[138] Staden R., Computer methods to locate signals in nucleic acid
sequences, Nucleic Acids Res. 12 (1984) 515–519.
Recognition of regulatory sites by genomic comparison
[139] Stephens R.S., Kalman S., Lammel C., Fan J., Marathe R., Aravind
L., Mitchell W., Olinger L., Tatusov R.L., Zhao Q., Koonin E.V.,
Davis R.W., Genome sequence of an obligate intracellular pathogen of humans: Chlamydia trachomatis, Science 282 (1998)
754–759.
[140] Stormo G.D., Hartzell III G.W., Identifying protein-binding sites
from unaligned DNA fragments, Proc. Natl. Acad. Sci. USA 86
(1989) 1183–1187.
[141] Stormo G.D., Information content and free energy in DNAprotein interactions, J. Theor. Biol. 195 (1998) 135–137.
[142] Svetlov V.V., Cooper T.G., Compilation and characteristics of
dedicated transcription factors in Saccharomyces cerevisiae,
Yeast11 (1995) 1439–1484.
[143] Tagle D.A., Koop B.F., Goodman M., Slightom J.L., Hess D.L., Jones
R.T., Embryionic (and (globin genes of a prosimian primate (Galago
crassicaudatus) nucleotide and amino acid sequences, developmental regulation and phylogenetic footprints, J. Mol. Biol. 203
(1988) 439–455.
[144] Taylor H.S., A regulatory element of the empty spiracles homeobox gene is composed of three distinct conserved regions
that bind regulatory proteins, Mol. Reprod. Dev. 49 (1998)
246–253.
[145] Thacker C., Marra M.A., Jones A., Baillie D.L., Rose A.M., Functional genomics in Caenorhabditis elegans: An approach involving
comparisons from related nematodes, Genome Res. 9 (1999)
348–359.
[146] Thieffry D., Salgado H., Huerta A.M., Collado-Vides J., Prediction
of transcriptional regulatory sites in the complete genome sequence of Escherichia coli K-12, Bioinformatics 14 (1998) 391–400.
[147] Tronche F., Ringeisen F., Blumenfeld M., Yaniv M., Pontoglio M.,
Analysis of the distribution of binding sites for a tissue-specific
transcription factor in the vertebrate genome, J. Mol. Biol. 266
(1997) 231–245.
[148] Ulyanov A.V., Stormo G.D., Multi-alphabet consensus algorithm
for identification of low specificity protein-DNA interactions,
Nucleic Acids Res. 23 (1995) 1434–1440.
[149] Umbarger H.E., Biosynthesis of branched-chain amino acids, in:
Neidhardt F.C. (Ed.), Escherichia coli and Salmonella: Cellular and
Molecular Biology, AMS Press, Washington DC, 1996, pp. 442–457.
[150] Valentin-Hansen P., Sogaard-Andersen L., Pedersen H., A flexible
partnership: the CytR anti-activator and the cAMP-CRP activator
protein, comrades in transcription control, Mol. Microbiol. 20
(1996) 461–466.
[151] Van Helden J., Andre B., Collado-Vides J., Extracting regulatory
sites from the upstream region of yeast genes by computational
analysis of oligonucleotide frequencies, J. Mol. Biol. 281 (1998)
827–842.
[152] Vaughan A., Hawkes N., Hemingway J., Co-amplification explains
linkage disequilibrium of two mosquito diesterase genes in
insecticide-resistant Culex quinquefasciatus, Biochem. J. 325 (1997)
359–365.
[153] Vitreschak, A., Bansal, A.K., Titov I.I., Gelfand, M.S., Computer
analysis of regulatory patterns in complete bacterial genomes.
Translation initiation of the ribosomal protein operons, (In Russian, Engl. transl.), Biophysics 44 (1999)in press.
771
[154] Wagner A., A computational ‘genome walk’ technique to identify
regulatory interactions in gene networks, Pac. Symp. Biocomput.
(1998) 264–278.
[155] Wagner A., A computational genomics approach to the identification of gene networks. Nucleic Acids Res. 25 (1997)
3594–3604.
[156] Walker G.C., The SOS response of Escherichia coli, in:Neidhardt
F.C. (Ed.), Escherichia coli and Salmonella: Cellular and Molecular
Biology, AMS Press, Washington DC, 1996, pp. 1400–1416.
[157] Wanner B.L., Phosphorus assimilation and control of the phosphate regulon, in: Neidhardt F.C. (Ed.), Escherichia coli and Salmonella: Cellular and Molecular Biology,AMS Press, Washington DC,
1996, pp. 1357–1381.
[158] Wang D., Marsh J.L., Ayala F.J., Evolutionary changes in the expression pattern of a developmentally essential gene in three Drosophila species, Proc. Natl. Acad. Sci. USA 93 (1996) 7103–7107.
[159] Washio T., Sasayama J., Tomita M., Analysis of complete genomes
suggests that many prokaryotes do not rely on hairpin formation
in transcription termination, Nucleic Acids Res. 26 (1998)
5456–5463.
[160] Wasserman W.W., Fickett J.W., Identification of regulatory regions which confer muscle-specific gene expession, J. Mol. Biol.
278 (1998) 167–181.
[161] Waterman M.S., Arratia R., Galas D.J., Pattern recognition in
several sequences: consensus and alignment, Bull. Math. Biol. 45
(1984) 515–527.
[162] Winterling K.W., Chafin D., Hayes J.J., Sun J., Levine A.S., Yasbin
R.E., Woodgate R., The Bacillus subtilis DinR binding site: redefinition of the consensus sequence, J. Bacteriol. 180 (1998)
2201–2211.
[163] Wolfertstetter F., Frech K., Herrmann G., Werner T., Identification of functional elements in unaligned nucleic sequences by a
novel tuple search algorithm, Comput. Appl. Biosci. 12 (1996)
71–80.
[164] Yada T., Totoki Y., Ishii T., Nakai K., Functional prediction of B.
subtilis genes from their regulatory sequences, Intelligent Systems
Mol. Biol. 5 (1997) 354–357.
[165] Yuh C.H., Bolouri H., Davidson E.H., Genomic cis-regulatory
logic: experimental and computational analysis of a sea urchin
gene, Science 279 (1998) 1896–1902.
[166] Zhu Q., Zhao S., Somerville R.L., Expression, purification, and
functional analysis of the TyrR protein of Haemophilus inlfuenzae,
Protein Expr. Purif. 10 (1997) 237–246.
[167] Zuber U., Schumann W., CIRCE, a novel heat shock element
involved in regulation of heat shock operon dnaK of Bacillus subtilis,
J. Bacteriol. 176 (1994) 1359–1363.
[168] Zucker-Aprison E., Blumenthal T., Potential regulatory elements
of nematode vitellogenin genes revealed by interspecies sequence
comparison, J. Mol. Evol. 28 (1989) 487–496.