Res. Microbiol. 150 (1999) 755−771 © 1999 Éditions scientifiques et médicales Elsevier SAS. All rights reserved Recognition of regulatory sites by genomic comparison Mikhail S. Gelfand* State Scientific Center for Biotechnology ‘NIIGenetika’, 1-j Dorozhny pr. 1, Moscow 113545, Russia Abstract — Availability of complete bacterial genomes opens the way to the comparative approach to the recognition of transcription regulatory sites. Assumption of regulon conservation in conjunction with profile analysis provides two lines of independent evidence making it possible to make highly specific predictions. Recently this approach was used to analyze several regulons in eubacteria and archaebacteria. The present review covers recent advances in the comparative analysis of transcriptional regulation in prokaryotes and phylogenetic fingerprinting techniques in eukaryotes, and describes the emerging patterns of the evolution of regulatory systems. © 1999 Éditions scientifiques et médicales Elsevier SAS transcription / regulation / operator / repressor / activator / computer analysis / complete genome 1. Statistical methods for recognition of protein-binding sites in DNA sequences Recognition of functional sites in DNA sequences is an old problem in computation molecular biology. Recent advances in large scale sequencing, and especially sequencing of complete bacterial genomes, make it particularly important. The availability of several related genomes opens up new possibilities for prediction of regulatory sites based on comparison of candidate site distribution in these genomes. In this review I describe the basic ideas behind the comparative approach and provide some examples demonstrating its strong and weak points. But let us start with a brief description of the traditional techniques. Generally, the search for regulatory patterns is organized as follows. Coregulated genes from one genome are collected from the literature, databases, or in large-scale gene expression experiments [122, 137]. The upstream regions of * Correspondence and reprints Tel.: +70 95 31 50 156; fax +70 95 31 50 501; [email protected] the collected genes are aligned in order to find conserved sequence patterns. There exist various techniques for multiple local alignment that can be used in such situations. The most popular approaches are the semi-empirical algorithms based on positional [43, 92, 161] or position-independent oligonucleotide analysis [148, 163], and minimization techniques [62, 140], the most prominent of which is the Gibbs sampler [16, 81, 82, 122]; for a review see [32, 41]. Alternatively, the initial determination of the signal is done experimentally, e.g., by footprinting or mutational analysis of one or several regulatory regions, and then new sites are aligned to these signals. After determination of the signal, it is represented as a recognition rule. The simplest variant of the recognition rule is a consensus word, maybe with degenerate positions [25, 103, 113]. In this case the measure of ‘site-likeness’ of an oligonucleotide, or the score, is defined as the number of mismatches between the oligonucleotide and the consensus. A slightly more sophisticated procedure involves the use of positional nucleotide weight matrices, or profiles [7, 15, 17, 50, 99, 113, 138]. Indeed, a consensus can be 756 Gelfand represented by a profile whose elements are units and zeroes. Profiles are more sensitive than consensi. There exists a developed theory for profiles linking them to site specificity [9, 126], providing the physical base for the choice of positional nucleotide weights [7, 127, 141], and allowing one to compute the statistical significance of observed hits [19, 48] and to estimate the pseudosite competition [8]. The profile-based scores are positively correlated with experimentally determined site strength [7, 99], although the definitions of both values are not as trivial as they may seem: for example, in many cases the positional nucleotide frequencies in natural sites are different from those obtained in SELEX experiments [120, 128]. This means that results of benchmarking of recognition algorithms using site strengths estimated by in vitro binding should be interpreted with some caution [123]. Sophisticated methods such as neural networks [63] or hidden Markov models [34] are rarely used for analysis of transcriptional operators, since these methods require large training sets. Rather, these techniques are used for analysis of ‘mass’ sites such as promoters of transcription [28, 65, 104, 108, 164] and ribosomal binding sites in prokaryotes and polymerase II promoters and splicing sites in eukaryotes. For reviews and benchmarking of recognition methods see [37, 38, 44, 123]; bacterial promoters and terminators are specifically discussed in the paper by M.-F. Sagot in this issue. Finally, since the information content of most transcription regulatory signals is rather low, the specificity of the recognition rules is not sufficiently high to provide reliable predictions. Thus it is natural to try to predict regulatory sites in context taking into account the fact that promoters and operators do not occur independently, but rather form recognition complexes. This is especially important for analysis of eukaryotic genes, whose regulation may involve tens of different factors and their recognition sites [165]. Several programs for recognition of eukaryotic promoters [18, 114, 154] or tissuespecific regulation [42, 155, 160] use this ap- proach. It was also applied to analysis of Escherichia coli regulatory sites [109, 121]. There exist several collections of prokaryotic promoters from different genomes [12, 61, 87], operators from E. coli (RegulonDB by [146] at http://www.cifn.unam.mx/ Computational_Biology/regulondb; DPInteract by [120] at http://arep.med.harvard.edu/ dpinteract), and eukaryotic regulatory sites [47, 60], some of the latter specifically dedicated to fungi [30, 142]. The derived recognition rules are used to scan databases in order to find new representatives of the regulon. The candidate sites are then verified experimentally. In particular, simple consensus search was successfully applied to find PurR-regulated genes in E. coli [56], and to characterize the competence regulon of Streptococcus pneumoniae and the σX and σW promoters of Bacillus subtilis [66, 67]. In the cases of the competence and σX signals, the consensus was derived not from sequence comparisons alone, but from mutation and deletion analyses of known sites. Similarly, after experimental determination of the recognition rules for FruR [102] and SoxS [106] regulators of E. coli, computer scanning of the databases revealed new candidate binding sites whose significance was assessed taking into consideration the function of the downstream genes. Large-scale analysis of complete genomes was done for E. coli [120, 146] and Saccharomyces cerevisiae [36, 151]. 2. Phylogenetic footprinting: comparative analysis of eukaryotic regulatory sites Until now we considered sets of coregulated genes from one genome and assumed that upstream regions of these genes contain a common signal. A transversal approach involves using upstream regions of one gene (more exactly, orthologous genes) or functionally related genes from different genomes.This approach, termed ‘ phylogenetic fingerprinting ‘ [143], was used to analyze regulation of many different eukaryotic systems: globins [52, 143], milk protein genes [90], genes expressed in liver [147], gut [73] and muscle [78, 94, 160], transcription Recognition of regulatory sites by genomic comparison factors [2, 78, 112, 116, 131, 144], sexdetermining genes of nematodes [79], and prespore genes of Dictyostelium [93]. There are two main variants of this analysis. One possibility is to look for a conserved arrangement of known signals upstream of homologous genes [90, 91, 160]. This allows for filtering out false-positives. Indeed, the number of different profiles in the regulation databases is so high, and their specificity so low, that scanning of any DNA sequence fragment by a set of profiles produces an enormous number of hits. Conversely, scanning of sequence databases by one profile also gives many spurious matches. However, false-positive hits are scattered at random, whereas true sites (at least some of them, see below) are conserved. Thus, occurrence of a candidate site upstream of homologous genes in different species is a strong indicator that this site is truly functional. The other approach is to search for regulatory sites without any prior idea of the recognition rule. Conserved segments identified by local alignment serve as the basis for experimental analysis [116, 131, 144, 136, 145]; see also [54] and references therein). The scope of comparisons is quite wide. The most popular, of course, are the human-mouse comparisons [1, 54, 116, 160]. Other examples include comparisons between different mammals [90, 91, 136] and vertebrates [94, 131], between species of fruit flies Drosophila melanogaster and D. hydei [144] and between nematodes Caenorhabditis elegans and C. briggsae [73, 79, 145, 168]. The depth of comparisons can be quite diverse. In some cases even comparisons between primates produce meaningful results [143], although in general the degree of sequence conservation in primates is too high. Humanmouse comparisons are often very informative [1, 54], although sometimes they are confounded by conservation of regions unrelated to transcriptional regulation [31, 86] or even any regulation at all [76, 77]. Besides, the conserved segments are often fairly long and include not only the binding core(s), but marginal regions as well [32]. More distant comparisons, e.g., between mammalian and fish genes, are more 757 selective [2], but produce results in a smaller number of cases [32]. It has been suggested that the optimal evolutionary distance for phylogenetic fingerprinting is that between mammals and birds [32]. However, it strongly depends on the gene family and the regulatory system under analysis; to the examples above we can add the rapidly evolving gene SRY whose regulatory sequences are conserved within groups of mammals (primates, rodents, bovids), but are not conserved between these groups [91]. A somewhat different approach involves looking at differences in regulatory regions of closely related genes. It was used to determine novel expression patterns of Drosophila genes Dpc and amd involved in production of neurotransmitters and cuticle formation [158], and hunchback encoding a zing finger transcription factor [53], as well as to study expression of the primate β-globin gene cluster [52]. All analyses described above are based on the assumption that regulatory sites of the same kind are similar in related genomes. The biological basis for this assumption is provided by complementation studies, where the protein from one organism correctly regulates genes from another organism. Examples of such crossspecies regulation can sometimes involve very distant genomes. For instance, mammalian homologues of the apterous (Ap) transcription factor from Drosophila completely substitute for the protein, and moreover the murine gene (Lhx2) has the same tissue-specific pattern of expression as the Drosophila gene [118]. Human heatshock transcription factor HSF2 activates target genes of its yeast homologue HSF in response to thermal stress [89]. However, a detailed study [32] found very little sequence conservation in regulatory regions of vertebrates and insects. Thus it seems that these examples cannot serve as a basis for large-scale genomic analysis, however interesting they are from the biological point of view. Other examples of such very distant regulatory homologies are the enhancer element of plant histone promoters present also in small plant viruses such as coconut foliar decay virus [97], and the BARBIE box occurring up- 758 Gelfand stream of barbiturate-inducible genes of mammals, insects and the bacterium Bacillus megaterium [57, 72]. The situation with the latter signal is somewhat controversial. In B. megaterium it is bound by repressor BM3R1 [57, 83, 84]. It is possible that the repressor acts by interfering with positive transcription factors BM1P1 and BM1P2 [58, 84]. However, it has been also claimed that deletion of the BARBIE box does not influence effect of pentobarbital on the P450BM-1 promoter [132]. In rodents the situation also is not clear [72]. Mutations in the BARBIE box in the promoter of the rat alpha-1-acid glycoprotein gene abolish induction by phenobarbital [40]. On the other hand, the putative BARBIE box is not essential for phenobarbital induction of the rat CYP2B1 gene encoding cytochrome P450 2B1 [106]. Mouse phenobarbital-inducible promoter of CYP2B10 gene contains an insertion in the middle of the BARBIE box, although the 1400-bp promoter region is 83% identical to the rat CYP2B2 gene promoter [64]. Finally, in insects, candidate BARBIE boxes were found upstream of a number of xenobiotic-regulated genes of the house fly Musca domestica [20], butterflies from the genus Papilio [68], and the mosquito Culex quinquefasciatus [152]. The link between all these observation is established the fact that some unknown B. megaterium protein recognizes both bacterial and rodent sites [57]. 3. Transcription regulatory sites in bacterial genomes Techniques similar to evolutionary fingerprinting were used in a number of bacterial studies as well, e.g., for analysis of iron uptake (Fur) and peroxide (PerR) regulons of B. subtilis [13], and OmpR and SoxS binding sites upstream of the micF gene of E. coli coding for an anti-sense regulatory RNA [26]. However, all of these papers report comparative results simply to confirm or extend biological observations made for one species, and not as a starting point for analysis. The problem of recognition of regulatory sites in completely sequenced bacterial genomes has been recently stated in a number of reviews [11, 105]. Still, until now, large scale analysis of transcription regulation was restricted to scanning of genomes by signal profiles [120, 146] and comparative analysis was not applied in a systematic way. One of the reasons for that could be the fact that, although individual binding sites of prokaryotic transcription factors are usually longer than eukaryotic sites, there is no conservation of long regulatory regions and thus the noise in pairwise comparisons is too strong to discern weakly conserved regulatory signals. However, this can be overcome by simultaneous analysis of regulons, and not single genes, in several species [96]. Similarly to the eukaryotic case considered in the previous section, the main assumption is that regulatory patterns are conserved in related species. However, since the number of regulatory interactions per gene in prokaryotes is generally much smaller than that in eukaryotes, it is possible to base analysis on conservation of entire regulons, that is, sets of genes regulated by a particular factor. In the simplest variant of this approach, information from a well studied genome is transferred to a newly sequenced genome. It is assumed that not only the composition of regulons, but the signal itself is conserved. The known sites from the ‘old’ genome are used to make a recognition rule to scan the ‘new’ genome. Candidate sites upstream of genes orthologous to the known regulon members are assumed to be true. This technique was used to predict PurR (purine), ArgR (arginine), TrpR and TyrR (aromatic amino acids), and LexA (SOS-repair) regulons of Haemophilus influenzae [45, 95, 96]. An obvious prerequisite for this analysis is conservation of the transcription factor. Sometimes the results of complementation experiments are available that provide additional bases for the assumption of the signal conservation (e.g., [98, 124, 135, 166]). A minor modification of this technique allows one to find new members of regulons. Indeed, if Recognition of regulatory sites by genomic comparison candidate signals appear for orthologous genes with unknown function, these genes can be added to the regulons. This allowed us to find a family of transporters that are subject to purine regulation in E. coli and H. influenzae [95, 96]. A different approach can be used in the absence of prior information. As described above, there exist numerous algorithms that find signals in a sample of regulatory regions. However, in many cases the algorithms produce several candidates signals, only one of which is biologically relevant [41, 122]. Thus the problem is to distinguish the true signal among several possibilities. Moreover, if the analyzed sample does not in fact contain coregulated genes, most algorithms still produce some output. There is no theory allowing one to estimate the significance of the derived signal. However, the reliability of the signal can be assessed using genomic comparisons. Indeed, assuming conservation of the signal in related genomes, one can consider several sets of orthologous genes, independently find candidate signals for each set, and then select the most similar signals or determine that there are no common signals. This technique was used to describe purine regulons in archaebacteria from the genus Pyrococcus (Gelfand et al., in preparation). Genes encoding enzymes from the purine pathway were found in the genomes of P. horikoshii [70], P. abyssi (Utah Genome Center, Univeristy of Utah, USA, http://www.genome.utah.edu) and P. furiosus (Genoscope, National Centre for Sequencing, France, http://www.genoscope. cns.fr) using standard protein similarity analysis. The signals identified in each individual set were very weak and practically indistinguishable from random noise (that is, pseudo-signals found in random sets of upstream regions from the same species). However, the fact that very similar signals have been constructed for all three sets is a strong indication that the signal is correct. In fact, it turned out that upstream of all relevant genes there are two candidate sites. The recognition rule requiring two sites with a fixed length spacer is highly specific: it selects only genes from the purine pathway. The regulons include also transporters (one gene in each 759 genome) that are orthologous to the purine transporters found in an independent analysis of E. coli and H. influenzae purine regulons, see above. Thus the results of several unrelated studies corroborate each other. 4. Evolution of regulons, regulators, and signals Application of the comparative approach allowed for description of some modes of regulon evolution. Although these observations are very fragmented and preliminary, they are important for creating more powerful algorithms for regulation analysis, as they show limitations of the existing methods and possible additional considerations that can extend the present capabilities of the comparative approach. Unless stated otherwise, the examples presented below are taken from papers ( [45, 95, 96] Gelfand et al., unpublished) and from my unpublished observations. Firstly, the main assumption, that is, conservation of regulons, is only partially valid. In many cases conservation of the regulon core is accompanied by loss (or gain) of regulation by some genes. For example, the main genes from the purine pathways are regulated by PurR both in E. coli and H. influenzae, but several other genes are members of the purine regulon in the former, but not the latter. Notably, a large number of such genes were initially identified in E. coli by computational analysis and then checked experimentally [56] and the strength of the repressor sites is rather low. Another noteworthy case in the predicted loss of autoregulation by PurR and IlvY in H. influenzae. A related, but more straightforward situation is the case when one genome has no orthologue for a regulon member from the other genome. Since the genome composition is indeed very labile, this is a very frequent phenomenon occurring in almost all regulons we have studied so far. Secondly, analysis of regulatory interactions is obscured by changes in operon structure. Analysis of the operon structure, and, more 760 Gelfand generally, consideration of gene proximity is a powerful tool of functional analysis, since in many cases genes that are close to each other in a number of sufficiently distant genomes are functionally related [21, 105]. There are two modes of operon structure evolution that pose difficulties for the comparative analysis. One is insertion of genes between the operator and the (former) first gene of the operon. For instance, compare trpBA operons in H. influenzae and Pasteurella multocida (figure 1a). In H. influenzae this operon contains an additional gene ydfG coding for a predicted oxidoreductase. Thus the TrpR binding site (TRP box) is immediately upstream of ydfG, but not trpB. The other mode involves breaking of an operon into two parts. The upstream part then inherits the regulatory site from the original operon, whereas the downstream part acquires a new site. This happened with the tryptophan operon (trpEDCBA in enterobacteria and Vibrionaceae) that broke into two operons in Pasteurellaceae, trpBA and trpEDC (figure 1a) and with one of the operons from the purine regulon (purHDglyA in H. influenzae, but two operons purHD and glyA in E. coli, figure 1b). A very nice example is present in the nitrogen fixation regulon of methanogenic archaea Methanobacterium thermoautotrophicum and Methanococcus jannaschii (figure 1c). Both these genomes contain two pairs of glutamine synthase regulator genes glnB and ammonium transporter genes amtB. The relative arrangement of these genes is quite diverse: they form tandemly repeated operons amtBglnB in M. thermoautotrophicum, whereas in M. jannaschii they are arranged as the operon glnBamtB and a pair of divergently transcribed genes. However, all these operons have upstream nitrogen fixation (NIF) boxes. In the latter case there is a single regulatory site between the divergent genes amtB and glnB that most likely regulates both genes. To account for possible changes in the operon structure, one should select for pairwise comparisons not only the genes with candidate regulatory sites in upstream regions, but also Figure 1. Examples of changed operon structure with retained regulation. a. trp operons in gamma-proteobacteria. TRP, binding site of TrpR. b. purHD and glyA operons in gammaproteobacteria. PUR, binding site of PurR. c. glnBamtB loci in methanogenic archaebacteria. NIF, nitrogen fixation operator. Recognition of regulatory sites by genomic comparison genes that can be coregulated with these genes, that is, downstream genes are transcribed in the same direction, with some limit on the spacer length. Thirdly, many regulons involve multiple regulatory mechanisms. To deal with such cases, the sets regulated by functionally linked factors should be merged at the step of pairwise comparisons into a single coregulated set. For instance, the aromatic amino acid metabolism operons in enterobacteria and Pasteurellaceae are regulated on the level of transcription by two factors, TrpR and TyrR. Several genes are under dual regulation. In addition to changes in operon structure described above, there are several genes that apparently have different regulation in E. coli and H. influenzae (figure 2). A relatively simple case is that of mtr, that is under regulation by both TyrR and TrpR in E. coli, but has no candidate TyrR binding sites in H. influenzae (figure 2a). A more interesting case is that of DAPH-synthases (figure 2b). DAPH synthase is the first enzyme of the pathway leading to aromatic amino acids. In E. coli there are three DAPH-synthases encoded by genes aroH, aroG, and aroF. They are feedback inhibited by tryptophan, phenylalanine and tyrosine, respectively [110]. On the level of transcription, aroH is repressed by TrpR with tryptophan acting as corepressor, whereas aroF and aroG are repressed by TyrR with corepressors tyrosine and phenylalanine. The H. influenzae has only one DAPH-synthase that is orthologous to aroG. However, it has no candidate TyrR-binding sites, but a candidate TrpRbinding site insted. The situation with global regulons seems to be even more complicated. For example, the heat shock response is regulated in various bacteria by at least twelve systems [100], in particular: – heat-shock-specific σ-factor RpoH [32] in gamma proteobacteria [51]; – factor CtsR in low-G + C Gram-positive bacteria [29]; – unknown factor binding to ROSE element in Bradyrhizobium japonicum [101]; 761 – HspR binding to HAIR element in Streptomyces spp. [14]; – HrcA binding to CIRCE element [129]. The last system is of particular interest. The CIRCE element (controlling inverted repeat of chaperone expression) [167] has been found upstream of chaperone operons in numerous Gram-positive and Gram-negative bacteria, cyanobacteria, Chlamydiae, and spirochetes [59, 130]. These operons often, but now always, include the regulator itself (in Caulobacter crescentus the hrcA gene is under RpoH promoter [119]. However, the composition of the chaperone operons in bacterial genomes is very diverse (figure 2c) and not all of them have upstream CIRCE elements. On the other hand, in Mycoplasma genitalium, where HrcA is the only transcription factor (E.V. Koonin, personal communication), it regulates not only chaperones, but heat-shock proteases lon and clpB as well. There is no obvious phylogenetic pattern in the distribution of CIRCE: for instance, it is absent in many gamma-proteobacteria, including E. coli, but not in all of them, since there is a CIRCE element upstream of the groESL operon of Chromatium vinosum [36]. The fourth complication is the fact that orthology of factors does not neccessarily imply similarity of the signals. For example, SOS repair genes in E. coli (and other Gram-negative bacteria) and B. subtilis (and other Grampositive bacteria) are regulated by orthologous proteins LexA and DinR respectively. However, the DNA binding domains of these proteins have diverged and the recognized signals are quite different: CTGTatatatatMCAG for LexA [156] and cGAACrnryGTTYg for DinR [162]. The composition of regulons also is different, although there are some common genes that can be revealed by the comparative analysis. This example illustrates another feature that can be used for comparative analysis: even if the regulators have diverged and the signals are different, the overall symmetry of the pattern is conserved. In the LexA-DinR case the signal is a palindrome with strongly conserved half-sites and weakly conserved AT-rich spacer. However, since most transcription regu- 762 Gelfand latory signals are palindromes with degenerate positions, this example is not particularly revealing. A more interesting case is that of phosphate regulons of enterobacteria and B. subtilis regulated by homologous two-component systems PhoB-PhoR and PhoP-PhoR respectively. The enterobacterial signal consists of several heptanucleotide PHO boxes with consensus CTGTCAT with 4-bp spacers [157], whereas the B. subtilis signal is several hexanucleotide boxes TTWACA with 4–5 bp spacers [88]. In both cases the total length of one unit consisting of a box and a spacer is 10–11 bp, which means that the boxes are located on the same side of the DNA helix [115]. The position of the signals relative to the start of transcription also is conserved: they serve as substitute (-35) boxes. Another example is the group of quorum sensing two-component systems regulating bacteriocin production in lactic bacteria and the related agr system of Staphylococcus species [69, 74]. In all cases the signal is a pair of identical Recognition of regulatory sites by genomic comparison 763 c Figure 2. Changes in regulation. a. mtr gene in E. coli and H. influenzae. TYR, TRP, binding sites of TyrR and TrpR respectively. b. DAPH synthases of E. coli and H. influenzae. 1st column: schematic representation of the evolutionary tree; 2nd column: inhibitor; 3rd column: genome (E.c., E.coli, H.i., H. influenzae); 4th column: schematic representation of the relevant operons (TYR, TRP, binding sites of TyrR and TrpR respectively). c. CIRCE elements and chaperone operons in complete bacterial genomes. CIRCE, binding site of hrcA. In mycoplasmas, operon groESL has CIRCE element in M. pneumoniae, but not in M. genitalium. 764 Gelfand nonanucleotide boxes with either a 12- or 13-bp spacer (figure 3). Thus, boxes again are located at the same side of the double helix, with corresponding positions separated by two helical turns. On the other hand, sometimes the similarity between signals of paralogous factors may be a problem. Indeed, if related regulators from the same genome recognize similar signals, it is difficult to sort out the regulons. The interplay between the DNA sites and the recognizer domains in proteins, well studied in several phage systems, can be quite subtle. Three base pairs determine specificity of T3 and T7 promoters to the respective RNA polymerases [75]. In the Salmonella typhimurium phage P22, the C1 operator for the c2 gene promoter partially overlaps with the gene encoding the C1 activator itself. A single-base-pair mutation within this overlap changes the specificity of both the operator and the activator: the recognition is retained, but there is no cross-recognition of the wild-type operator [117]. In E. coli there are several groups of related factors and signals. One of the best studied examples is the pair of activators CRP and FNR that recognize very similar palindromic signals with consensi TGTGAN6TCACA and TTGATN4ATCAA respectively [6]. The physiological roles of the two factors are quite different, with CRP regulating catabolite repression and FNR relevant to anaerobiosis. Although it is unlikely that there is competition between these factors for wildtype sites [125], rough profiles may not be able to distinguish between the two signals. A somewhat more complicated case is that of vegetative and stationary phase promoters of E. coli recognized by σ70 and σS sigma-factors respectively. The problem is that a number of promoters are recognized by both sigma factors and thus the similarity of the signals is not an evolutionary relic, but has functional significance. However, most promoters are recognized by only one sigma factor. It has been suggested that the distinguishing feature of σS promoters is curved DNA structure in the (-35) region, instead of the usual TTGACA box, and slightly different consensus of the (-10) box: CTATACT instead of TATAAT [35]. A similar situation exists in B. subtilis where several pairs of sigmafactors share some promoters. One example is provided by the recently characterized σX and σW promoters with rather well conserved consensi TGTAACN17CGAC and TGAAACN16CGTA respectively [67]. An interesting case is that of two-component systems NarL-NarX and NarP-NarQ mediating nitrate and nitrite regulation of anaerobic gene expression in E. coli [23]. They bind to heptamer half-sites TACYYMT organized as inverse repeats. It turns out that (phosphorylated) NarP regulates only the genes where the spacer between the half-sites is exactly 2 bp (that is, TACYYMT2AKRRGTA), whereas NarL can also bind to half-sites in other arrangements, although the 2 bp still is the preferred spacer length [24]. There are many other cases where position of an operator relative to the transcription binding site or other operators is important for the action mechanism. There exist three positional classes for CRP-binding sites [33]: i) CRP and RNA polymerase binding on the same side of DNA, ii) CRP binding site overlapping the promoter, and iii) arbitrary position when CRP interacts with another factor, e.g., CytR [150] or FIS [49]. This is reflected in the periodicity of distribution of CRP operators in regulatory regions aligned at transcription start sites [109]. Similar periodicity can also be observed in distribution of binding sites for other factors [134]. Many activators act as repressors if their binding site overlaps with the promoter; this is the standard mechanism of negative transcriptional feedback. Thus simultaneous prediction of promoters and operators may be more powerful than two separate procedures, and furthermore, can provide additional information. However, it is not clear how stable the positional effects are. For example, transcription of the E. coli gene purB is repressed by PurR via a roadblock mechanism: the repressor binds to the operator located within the coding region around codon 60 [55]. However, in H. influenzae the candidate opera- Recognition of regulatory sites by genomic comparison Figure 3. Signals upstream of bacteriocin operons of lactic bacteria and quorum sensing operons of Staphylococcus spp. 765 766 Gelfand tor is located upstream of the gene, just as in other genes. Another aspect of positional analysis is the use of colocalization of regulators and regulated operons. This is especially important for systems with small regulons that are likely candidates for horizontal transfer, such as twocomponent systems regulating only one operon and restriction-modification systems regulated by the control gene. In the latter case, the analysis of upstream regions based on the assumption of conserved orientation of an asymmetric operator relative to the methylase gene and the control element restriction endonuclease operon [5] led to the discovery of candidate signals. Similarly, in the case of bacteriocin systems (figure 3), the regulatory twocomponent systems and the regulated operons are usually adjacent to each other on the chromosomes of lactic bacilli. However, such an analysis also requires some caution, as demonstrated by the following example. In E. coli the gene ilvC and its activator ilvY are transcribed divergently, with operators in the common regulatory zone acting as activating sites for ilvC and repressing sites for ilvY [149]. This arrangement is conserved in H. influenzae, but since the distance between the only candidate operator and the ilvY gene is several hundred base pairs, it is unlikely that this operator is a repressor site for ilvY. between convergently transcribed genes, where one should expect at least one terminator [159]. Other well studied examples of RNA regulatory structures are attenuators of transcription [80] and elements involved in feedback regulation of translation initiation in ribosomal protein operons [71]. Both these systems are well studied in E. coli, and it is not difficult to extend these models to other bacteria where they exist ( [11, 21, 85, 107, 133, 153]; Vitreschak and Gelfand, unpublished). There exist systems searching for RNA regulatory elements that have been applied, in particular, to finding IREs (iron-responsive elements) occurring at 3’untranslated mRNA regions of genes encoding proteins involved in iron utilization [10, 22]. However, the present experience in comparative analysis of prokaryotic RNA regulation is too limited to describe general properties of such systems. In any case, it is clear that the degree of conservation is quite diverse. For instance, only about half of the ribosomal protein operons of E. coli retain the regulatory elements in H. influenzae [153]. On the other hand, the autoregulation mechanism of L1 is conserved in archaebacteria, although in a slightly different form [133]. The RFN element implicated in regulation of riboflavin metabolism genes can be found in such diverse genomes as Bacillus subtilis and other high G + C-content Gram-positives, Deinococcus radiodurans, Thermotoga spp. and several Gramnegatives including E. coli [45, 46]. 5. Regulatory sites and RNA structure The comparative approach can be used to analyze RNA regulatory sites. The best known prokaryotic signal which includes an RNA secondary structure element is the terminator of transcription. However, as in the case of promoters, the comparative approach is nonapplicable to prediction of terminators, since they are not characteristic of any particular regulon. Besides, in a recent study, it was shown that the conventional notion of transcriptional terminator as a GC-rich hairpin followed by a run of T’s may not be universal: in many prokaryotic genomes there is no prevalence of hairpins 6. Limitations The comparative approach is intended for analysis of particular regulatory systems and obviously cannot be applied to prediction of generic functional sites such as promoters and terminators of transcription. On the other hand, analysis of promoters and terminators can be a useful ingredient for the comparative prediction of the operon structure. As discussed above, an important prerequisite for regulon conservation is conservation of the regulatory mechanism. This can be ascertained when the regulator is known, but can Recognition of regulatory sites by genomic comparison only be assumed when it is not known. Pure phylogenetic considerations are not sufficient, since the rate of the regulon evolution can be very diverse. Comparisons of closely related genomes can be fruitless because of residual conservation of noncoding regions (although we were able to find purine regulatory signals in the three strains of Pyrococcus). More distant comparisons are more informative, but the regulon may not be conserved. Recall, for example, that the arginine repressor is conserved in E. coli and B. subtilis and binds a similar signal there [135], whereas the phosphate repressor is conserved, but binds different signals [88, 157]. On the other hand, the tryptophan regulon is regulated by the repressor TrpR in enterobacteria [110] and Chlamydia trachomatis [139], the activator TrpI in fluorescent pseudomonads [3], and by the RNA-binding protein TRAP in Bacillus subtilis [4]. Some regulons, such as biodegradation regulons of Pseudomonas spp. [27], or regulatory regions, such as dnaA promoter in E. coli [111], are subject to fast evolution. Application of the comparative analysis in such cases is not straightforward, if even possible. Other types of situations, where comparisons are difficult, are small regulons (making it difficult to determine the signal) and regulons subject to horizontal transfer (complicating resolution of orthology). However, despite all the problems mentioned, the comparative method is a powerful technique. One may expect that its usefulness will increase as more genomes are sequenced and they cover more evenly the phylogenetic space, making it possible to select the correct evolutionary distance in each particular case. Acknowledgments I am grateful to Ross Overbeek for a conversation several years ago that started my studies in the area of bacterial transcription regulation. Many results reported here were obtained in collaboration with Andrey Mironov who wrote the software GENOME for comparative analysis of regulation. I am grateful to Jim Fickett, Eugene Koonin, and Mike Roytberg for numer- 767 ous discussions, references, and encouragement. This work was partially supported by grants from the Russian State Scientific Program ‘Human Genome’ and the Russian Fund of Basic Research (99, 04, 48347). References [1] Ansari-Lari M.A., Oeltjen J.C., Schwartz S., Zhang Z., Muzny D.M., Lu J., Gorrell J.H., Chinault A.C., Belmont J.W., Miller W., Gibbs R.A., Comparative sequence analysis of a gene-rich cluster at human chromosome 12p13 and its syntenic region in mouse chromosome 6, Genome Res. 8 (1998) 29–40. [2] Aparicio S., Morrison A., Gould A., Gilthorpe J., Chaudhuri C., Rigby P., Krumlauf R., Brenner S., Detecting conserved regulatory elements with the model genome of the Japanese puffer fish, Fugu rubripes, Proc. Natl. Acad. Sci. USA 92 (1995) 1684–1688. [3] Auerbach S., Gao J., Gussin G.N., Nucleotide sequences of the trpI, trpB, and trpA genes of Pseudomonas syringae: positive control unique to fluorescent pseudomonads, Gene 123 (1993) 25–32. [4] Babitzke P., Regulation of tryptophan biosynthesis: Trp-ing the TRAP or how Bacillus subtilis reinvented the wheel, Mol. Microbiol. 26 (1997) 1–9. [5] Bart A., Dankert J., vander Ende A., Operator sequences for the regulatory proteins of restriction-modification systems, Mol. Microbiol. 31 (1999) 1277–1278. [6] Bell A.I., Gaston K.L., Cole J., Busby S.J.W., Cloning of binding sequences for the Escherichia coli transcription activators FNR and CRP, location of bases involved in discrimination between FNR and CRP, Nucleic Acids Res. 17 (1989) 3865–3874. [7] Berg O.G., VonHippel P.H., Selection of DNA binding sites by regulatory proteins. Statistical-mechanical theory and application to operators and promoters, J. Mol. Biol. 193 (1987) 723–750. [8] Berg O.G., Selection of DNA binding sites by regulatory proteins. Functional specificity and pseudosite competition, J. Biomol. Struct. Dyn. 6 (1988) 275–297. [9] Berg O.G., Selection of DNA binding sites by regulatory proteins: the LexA and the arginine repressor use different strategies for functional specificity, Nucleic Acids Res. 16 (1988) 5089–5105. [10] Billoud B., Kontic M., Viari A., Palingol: a declarative programming language to describe nucleic acids’ secondary structures and to scan sequence database, Nucleic Acids Res. 24 (1996) 1395–1403. [11] Bork P., Dandekar T., Diaz-Lazcoz Y., Eisenhaber F., Huynen M., Yuan Y., Predicting function: From genes to genomes and back, J. Mol. Biol. 283 (1998) 707–725. [12] Bourn W.R., Babb B., Computer assisted identification and classification of streptomycete promoters, Nucleic Acids Res. 23 (1995) 3696–3703. [13] Bsat N., Herbig A., Casillas-Martinez L., Setlow P., Helmann J.D., Bacillus subtilis contains multiple Fur homologues: identification of the iron uptake (Fur) and peroxide regulon (PerR) repressors, Mol. Microbiol. 29 (1998) 189–198. [14] Bucca G., Ferina G., Puglia A.M., Smith C.P., The dnaK operon of Streptomyces coelicolor encodes a novel heat-shock protein which binds to the promoter region of the operon, Mol. Microbiol. 17 (1995) 663–674. [15] Bucher P., Weight matrix descriptions of four eukaryotic RNA polymerase II promoter elements derived from 502 unrelated sequences, J. Mol. Biol. 212 (1990) 563–578. [16] Cardon L.R., Stormo G.D., Expectation maximization algorithm for identifying protein-binding sites with variable length from unaligned DNA fragments, J. Mol. Biol. 223 (1992) 159–170. 768 Gelfand [17] Chen Q.K., Hertz G.Z., Stormo G.D., MATRIX SEARCH 1. 0: A computer program that scans DNA sequences for transcriptional elements using a database of weight matrices, Comput. Appl. Biosci. 11 (1995) 563–566. [18] Chen Q.K., Hertz G.Z., Stormo G.D., PromFD 1. 0: A computer program that predicts eukaryotic polII promoters using strings and IMD matrices, Comput. Appl. Biosci. 13 (1997) 29–35. [19] Claverie J.M., Audic S., The statistical significance of nucleotide position-weight matrix matches, Comput. Appl. Biosci. 12 (1996) 431–439. [20] Cohen M.B., Koener J.F., Feyereisen R., Structure and chromosomal localization of CYP6A1, a cytochrome P450-encoding gene from the house fly, Gene 146 (1994) 267–272. [21] Dandekar T., Snel B., Huynen M.A., Bork P., Conservation of gene order: a fingerprint of physically interacting proteins, Trends Biochem. Sci. 23 (1998) 324–328. [22] Dandekar, T., Beyer, K., Bork, P., Kenealy, M.R., Pantopoulos, K., Hentze, M., Sonntag-Buck, V., Flouriot, G., Cannon, F., Keller, W., Schreiber S., Systematic genomic screening and analysis of mRNA in untranslated regions and mRNA precursors: combining experimental and computational approaches. Bioinformatics 14 (1999) 271–278. [23] Darwin A.J., Stewart V., The narmo dulon systems: nitrate and nitrite regulation of anaerobic gene c expression,in: Line C.C.,Lynch A.S., (Eds.), Regulation of Gene Expression in Escherichia coli, R.G. Landes Company, Austin TX, 1996, pp. 343–359. [24] Darwin A.J., Tyson K.L., Busby S.J.W., Stewart V., Differential regulation by the homologous NarL and NarP of Escherichia coli K-12 depends on DNA binding site arrangement, Mol. Microbiol. 25, (1997), 583–595. [25] Day W.H.E., McMorris F.R., Critical comparison of consensus methods for molecular sequences, Nucleic Acids Res. 20 (1992) 1093–1099. [26] Delihas N., Regulation of gene expression by trans-encoded antisense RNAs, Mol. Microbiol. 15 (1995) 411–414. [27] De Lorenzo V., Perez-Martin J., Regulatory noise in prokaryotic promoters: how bacteria learn to respond to novel environmental signals, Mol. Microbiol. 19 (1996) 1177–1184. [28] Demeler B., Zhou G., Neural network optimization for E. coli promoter prediction, Nucleic Acids Res. 19 (1991) 1593–1599. [29] Derre I., Rapoport G., Msadek T., CtsR, a novel regulator of stress and heat shock response, controls clp and molecular chaperone gene expression in Gram-positive bacteria, Mol. Microbiol. 31 (1999) 117–131. [30] Dhawale S.S., Lane A.C., Compilation of sequence-specific DNAbinding proteins implicated in transcriptional control in fungi, Nucleic Acids Res. 21 (1993) 5537–5546. [31] Duret L., Dorkeld F., Gautier C., Strong conservation of noncoding sequences during vertebrates evolution - potential involvement in post-translational regulation of gene expression, Nucleic Acids Res. 21 (1993) 2315–2322. [32] Duret L., Bucher P., Searching for regulatory elements in human noncoding sequences, Curr. Opin. Struct. Biol. 7 (1997) 399–406. [33] Ebright R.H., Transcription activation at class II CAP-dependent promoters, Mol. Microbiol. 8 (1993) 797–802. [34] Eddy S.R., Hidden Markov models, Curr. Opin. Struct. Biol. 6 (1996) 361–365. [35] Espinosa-Urgel M., Chamizo C., Tormo A., A consensus structure for (S-dependent promoters, Mol. Microbiol. 21 (1996) 657–659. [36] Ferreyra R., Soncini F., Viale A.M., Cloning, characterization, and functional expression in Escherichia coli of chaperonin (groESL) genes from the sulfur phototrophic bacterium Chromatium vinosum, J. Bacteriol. 175 (1993) 1514–1523. [37] Fickett J.W., Hatzigeorgiou A.G., Eukaryotic promoter recognition, Genome Res. 7 (1997) 861–878. [38] Fickett J.W., Coordinate positioning of MEF-2 and myogenin binding sites, Gene 172 (1996) GC19–GC32. [39] Fondrat C., Kalogeropoulos A., Approaching the function of new genes by detection of their potential upstream activation sequences in Saccharomyces cerevisiae: application to chromosome III, Comput. Appl. Biosci. 12 (1996) 363–374. [40] Fournier T., Mejdoubi N., Lapoumeroulie C., Hamelin J., Durand G., Porquet D., Transcriptional regulation of rat alpha 1-acid glycoprotein gene by phenobarbital, J. Biol. Chem. 269 (1994) 27175–27178. [41] Frech K., Quandt K., Werner T., Software for the analysis of DNA sequence elements of transcription, Comput. Appl. Biosci. 13 (1997) 89–97. [42] Frech K., Quandt K., Werner T., Muscle actin genes: A first step towards computational classification of tissue specific promoters, Silico Biol. (1998) http://www. bioinfo. de/isb/(1998)/01/0005. [43] Galas D.J., Eggert M., Waterman M.S., Rigorous patternrecognition methods for DNA sequences: analysis of promoter sequences from Escherichia coli, J. Mol. Biol. 186 (1985) 117–128. [44] Gelfand M.S., Prediction of function in DNA sequence analysis. J. Comput. Biol. 2 (1995) 87–115. [45] Gelfand M.S., Mironov A.A., Computer analysis of regulatory patterns in complete bacterial genomes. LexA and DinR binding sites, (In Russian, Engl. transl.), Mol. Biol. 33 (1999) 439–442. [46] Gelfand M.S., Miornov A.A., Jomantas J., Kozlov Y.U.I., Perumov D.A., A conserved RNA structure element involved in regulation of bacterial ribioflavin genes, Trends Genet. 15 in press. (1999) [47] Ghosh D., OOTFD (Object-Oriented Transcription Factors Database): an object-oriented successor to TFD, Nucleic Acids Res. 26 (1998) 360–361. [48] Goldstein L., Waterman M.S., Approximations to profile score distributions, J. Comput. Biol. 1 (1994) 93–104. [49] Gonzales-Gil G., Bringmann P., Kahmann R., FIS is a regulator of metabolism in Escherichia coli, Mol. Microbiol. 22 (1996) 21–29. [50] Goodrich J.A., Schwartz M.L., McClure W.R., Searching for and predicting the activity of sites for DNA binding proteins: compilation and analysis of the binding sites for Escherichia coli integration host factor (IHF), Nucleic Acids Res. 18 (1990) 4993–5000. [51] Gross C.A., Function and regulation of the heat shock proteins, in:Neidhardt F.C. (Ed.), Escherichia coli and Salmonella: Cellular and Molecular Biology, AMS Press, Washington DC, 1996, pp. 1382–1399. [52] Gumucio D.L., Shelton D.A., Zhu W., Millinoff D., Gray T., Bock J.H., Slightom J.L., Goodman M., Evolutionary strategies for the elucidation of cis and trans factors that require the developmental switching programs of the β-like globin genes, Mol. Phylogenet. Evol. 5 (1996) 18–32. [53] Hancock J.M., Shaw P.J., Bonneton F., Dover G.A., High sequence turnover in the regulatory regions of the developmental gene hunchback in insects, Mol. Biol. Evol. 16 (1999) 253–265. [54] Hardison R.C., Oeltjen J., Miller W., Long human-mouse sequence alignments reveal novel regulatory elements: A reason to sequence the mouse genome, Genome Res. 7 (1997) 759–766. [55] He B., Zalkin H., Repression of Escherichia coli purB is by a transcriptional roadblock mechanism, J. Bacteriol. 174 (1992) 7121–7127. [56] He B., Choi K.Y., Zalkin H., Regulation of Escherichia coli glnB, prsA, and speA by the purine repressor, J. Bacteriol. 175 (1993) 3598–3606. [57] He J.S., Fulco A.J., A barbiturate-regulated protein binding to a common sequence in the cytochrome P450 genes of rodents and bacteria, J. Biol. Chem. 266 (1991) 7864–7869. [58] He J.S., Liang Q., Fulco A.J., The molecular cloning and characterization of BM1P1 and BM1P2 proteins, putative positive transcription factors involved in barbiturate-mediated induction of the genes encoding cytochrome P450BM-1 of Bacillus megaterium, J. Biol. Chem. 270 (1995) 18615–18625. [59] Hecker M., Schumann W., Volker U., Heat-shock and general stress response in Bacillus subtilis, Mol. Microbiol. 19 (1996) 417–428. Recognition of regulatory sites by genomic comparison [60] Heinemeyer T., Wingender E., Ruter I., Hermjacob H., Hel A.E., Kel O.V., Ignatieva E.V., Ananko E.A., Podkolodnaya O.A., Kolpakov F.A., Podkolodny N.L., Kolchanov N.A., Databases on transcriptional regulation: TRANSFAC, TRRD and COMPEL, Nucleic Acids Res. 26 (1998) 362–367. [61] Helmann J.D., Compilation and analysis of Bacillus subtilis (A promoter sequences: evidence for extended contact between RNA polymerase and upstream promoter DNA, Nucleic Acids Res. 23 (1995) 2351–2360. [62] Hertz G.Z., Hartzell III G.W., Stormo G.D., Identification of consensus patterns in unaligned DNA sequences known to be functionally related, Comput. Appl. Biosci. 6 (1990) 81–92. [63] Hirst J.D., Sternberg M.J., Prediction of structural and functional features of protein and nucleic acid sequences by artificial neural networks, Biochemistry 31 (1992) 7211–7218. [64] Honkakoski P., Moore R., Gynther J., Negishi M., Characterization of phenobarbital-inducible mouse CYP2B10 gene transcription in primary hepatocytes, J. Biol. Chem. 271 (1996) 9746–9753. [65] Horton P.B., Kanehisa M., An assessment of neural network and statsitical approaches for prediction of E. coli promoter sites, Nucleic Acids Res. 20 (1992) 4331–4338. [66] Huang X., Helmann J.D., Identification of target promoters for the Bacillus subtilis σX factor using a consensus-directed search, J. Mol. Biol. 279 (1998) 165–173. [67] Huang X., Gaballa A., Cao M., Helmann J.D., Identification of target promoters for the Bacillus subtilis extracytoplasmic function σfactor, σW, Mol. Microbiol. 31 (1999) 361–371. [68] Hung C.F., Holzmacher R., Connolly E., Berenbaum M.R., Schuler M.A., Conserved promoter elements in the CYP6B gene family suggest common ancestry for cytochrome P450 monooxygenases mediating furanocoumarin detoxification, Proc. Natl. Acad. Sci. USA 93 (1996) 12200–12205. [69] Jack R.W., Tagg J.R., Ray B., Bacteriocins of Gram-positive bacteria, Microbiol. Rev. 59 (1995) 171–200. [70] Kawarabayasi Y., Sawada M., Horikawa H., Haikawa Y., Hino Y., Yamamoto S., Sekine M., Baba S., Kosugi H., Hosoyama A., Nagai Y., Sakai M., Ogura K., Otuka R., Nakazawa H., Takamiya M., Ohfuku Y., Funahashi T., Tanaka T., Kudoh Y., Yamazaki J., Kushida N., Oguchi A., Aoki K., Nakamura Y., Robb T.F., Horikoshi K., Masuchi Y., Shizuya H., Kikuchi H., Complete sequence and gene organization of the genome of a hyper-thermophilic archaebacterium, Pyrococcus horikoshii OT3, DNA Res. 5 (1998) 55–76. [71] Keener J., Nomura M., Regulation of ribosome synthesis, in:Neidhardt F.C. (Ed.), Escherichia coli and Salmonella: Cellular and Molecular Biology, AMS Press, Washington DC, 1996, pp. 1417–1431. [72] Kemper B., Regulation of cytochrome P450 gene transcription by phenobarbital, Prog. Nucleic Acid Res. Mol. Biol. 61 (1998) 23–64. [73] Kennedy B.P., Aamodt E.J., Allen F.L., Chung M.A., Heschl M.F.P., McGhee J.D., The gut esterase gene (ges-1) from the nematodes Caenorhabditis elegans and Caenorhabditis briggsae, J. Mol. Biol. 229 (1993) 890–908. [74] Kleerebezem M., Quadri L.E.N., Kuippers O.P., DeVos W.M., Quorum sensing by peptide pheromones and two-component signal-transduction systems in Gram-positive bacteria, Mol. Microbiol. 24 (1997) 895–904. [75] Klement J.F., Moorefeld M.B., Jorgensen E., Brown J.E., Risman S., McAllister W.T., Discrimination between bacteriophage T3 and T7 promoters by the T3 and T7 RNA polymerases depends primarily upon three base-pair region located 10 to 12 base-pairs upstream from the start site, J. Mol. Biol. 215 (1990) 21–29. [76] Koop B.F., Hood L., Striking sequence similarity over almost 100 kilobases of human and mouse T-cell receptor DNA, Nature Genet. 7 (1994) 48–53. [77] Koop B.F., Human and rodent sequence comparisons: A mosaic model of genomic evolution, Trends Genet. 11 (1995) 367–371. [78] Krause M., Harrison S.W., Xu S.Q., Chen L., Fire A., Elements regulating cell- and stage-specific expression of the C. elegans MyoD family homolog hlh-1, Dev. Biol. 166 (1994) 133–148. 769 [79] Kuwabara P.E., Interspecies comparison reveals evolution of control regions in the nematode sex-determining gene tra-2, Genetics 144 (1996) 597–607. [80] Landick R., Turnboughjr C.L., Yanofsky C., Transcription attenuation, in: Neidhardt F.C. (Ed.), Escherichia coli and Salmonella: Cellular and Molecular Biology, AMS Press, Washington DC, 1996, pp. 1263–1286. [81] Lawrence C.E., Altschul S.F., Boguski M.S., Liu J.S., Neuwald A.F., Wootton J.C., Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment, Science 262 (1993) 208–214. [82] Lawrence C.E., Reilly A.A., An expectation maximisation (EM) algorithm for the identification and characterization of common sites in unaligned biopolymer sequences, Proteins: Struct. Func. Genet. 7 (1990) 41–51. [83] Liang Q., He J.S., Fulco A.J., The role of Barbie box sequences as cis-acting elements involved in the barbiturate-mediated induction of cytochromes P450BM-1 and P450BM-3 in Bacillus megaterium, J. Biol. Chem. 270 (1995) 4438–4450. [84] Liang Q., Chen L., Fulco A.J., In vivo roles of BM3R1 repressor in he barbiturate-mediated induction of the cytochrome P450 genes (P450BM-3 and P450BM-1) of Bacillus megaterium, Biochim. Biophys. Acta 1380 (1998) 183–197. [85] Liao D., Dennis P.P., The organization and expression of essential transcription translation component genes in the extremely thermophilic eubacterium Thermotoga maritima, J. Biol. Chem. 267 (1992) 22787–22797. [86] Lipman D., Making (anti) sense of non-coding sequence conservation, Nucleic Acids Res. 25 (1997) 3580–3583. [87] Lisser S., Margalit H., Compilation of E. coli mRNA promoter sequences, Nucleic Acids Res. 21 (1993) 1507–1516. [88] Liu W., Hulett F.M., Comparison of PhoP binding to tuaA promoter with PhoP binding to other Pho-regulon promoters establishes a Bacillus subtilis Pho core binding site, Microbiology 144 (1998) 1443–1450. [89] Liu X.D., Liu P.C., Santoro N., Thiele D.J., Conservation of a stress response: human heat shock transcription factors substitute for yeast HSF, EMBO J. 16 (1997) 6466–6477. [90] Malewski T., Computer analysis of distribution of putative cis- and trans-regulatory elements in milk protein gene promoters, BioSystems45 (1998) 29–44. [91] Margarit E., Guillen A., Rebordosa C., Vidal-Taboada J., Sanchez M., Ballesta F., Oliva R., Identification of conserved potentially regulatory sequences of the SRY gene from 10 different species of mammals, Biochem. Biophys. Res. Commun. 245 (1998) 370–377. [92] Mengeritsky G., Smith T.F., Recognition of characteristic patterns in sets of functionally equivalent DNA sequences, Comput. Appl. Biosci. 3 (1987) 223–227. [93] Miller C., McDonald J., Francis D., Evolution of promoter sequences: elements of a canonical promoter for prespore genes of Dictyostelium, J. Mol. Evol. 43 (1996) 185–193. [94] Minty A., Kedes L., Upstream regions of the human cardiac actin gene that modulate its transcription in muscle cells: presence of an evolutionarily conserved repeated motif, Mol. Cell. Biol. 6 (1986) 2125–2136. [95] Mironov A.A., Gelfand M.S., Computer analysis of regulatory patterns in complete bacterial genomes. PurR binding sites, Mol. Biol. 33 (1999) 109–114. [96] Mironov A.A., Koonin E.V., Roytberg M.A., Gelfand M.S., Computer analysis of transcription regulatory patterns in completely sequenced bacterial genomes, Nucleic Acids Res. 27 (1999) 2981–2989. [97] Morozov S.Y.A., Merits A., Chernov B.K., Computer search of transcription control sequences in small plant virus DNA reveals a sequence highly homologous to the enhancer element of histone promoters, DNA Seq. 4 (1994) 395–397. [98] Muday G.K., Herrmann K.M., Regulation of the Salmonella typhimurium aroF gene in Escherichia coli, J. Bacteriol. 172 (1990) 2259–2266. 770 Gelfand [99] Mulligan M.E., Hawley D.K., Entriken R., McClure W.R., Escherichia coli promoter sequences predict in vitro RNA polymerase selectivity, Nucleic Acids Res. 12 (1984) 789–800. [100] Narberhaus F., Negative regulation of bacterial heat shock genes, Mol. Microbiol. 31 (1999) 1–8. [101] Narberhaus F., Kaser R., Nocker A., Hennecke H., A novel DNA element that controls bacterial heat shock gene expression, Mol. Microbiol. 28 (1998) 315–323. [102] Negre D., Bonod-Bidaud C., Geourjon G., Deleage G., Cozzone A.J., Cortay J.C., Definition of a consensus DNA-binding site for the Escherichia coli pleiotropic regulatory protein, FruR, Mol. Microbiol. 21 (1996) 257–266. [103] O’Neill M.C., Consensus methods for finding and ranking DNA binding sites: application to E. coli promoters, J. Mol. Biol. 207 (1989) 301–310. [104] O’Neill M.C., Training back-propagation neural networks to define and detect DNA binding sites, Nucleic Acids Res. 19 (1991) 313–318. [105] Overbeek R., Fonstein M., D’Souza M., Pusch G.D., Maltsev N., The use of gene clusters to infer functional coupling. Proc. Natl. Acad. Sci. USA 96 (1999) 2896–2901. [106] Park Y., Li H., Kemper B., Phenobarbital induction mediated by a distal CYP2B2 sequence in rat liver transiently transfected in situ, J. Biol. Chem. 271 (1996) 23725–23728. [107] Paton E.B., Zhyvoloup A.N., Binding site of the ribosomal protein L10 in the untranslated leader of rplJ gene in Thermotoga maritima suggests that this gene is under autogenous control, (in Russian, Engl. transl.), Genetika32 (1996) 140–145. [108] Pedersen A.G., Engelbrecht J., Investigations of Escherichia coli promoter sequences with artificial neural networks: new signals discovered upstream of the transcriptional startpoint, Intelligent Systems Mol. Biol. 3 (1995) 292–299. [109] Perez-Rueda E., Gralla J.D., Collado-Vides J., Genomic position analyses and the transcription machinery, J. Mol. Biol. 275 (1998) 165–170. [110] Pittard A.J., Biosynthesis of aromatic amino acids, in:Neidhardt F.C. (Ed.), Escherichia coli and Salmonella: Cellular and Molecular Biology, AMS Press, Washington DC, 1996, pp. 458–484. [111] Polaczek P., Is the dnaA promoter region in Escherichia coli an evolutionary junkyard of physiologically insignificant regulatory elements? Mol. Microbiol. 27 (1998) 1089–1090. [112] Popperl H., Bienz M., Studer M., Chan S.K., Aparicio S., Brenner S., Mann R.S., Krumlauf R., Segmental expression of Hoxb-1 is controlled by a highly conserved autoregulatory loop dependent upon exd/pbx, Cell 81 (1995) 1031–1042. [113] Prestridge D.S., SIGNAL SCAN 4.0 - additional databases and sequence formats, Comput. Appl. Sci. 12 (1996) 157–160. [114] Prestridge D.S., Burks C., The density of transcriptional elements in promoter and non-promoter sequences, Hum. Mol. Genet. 2 (1993) 1449–1453. [115] Qi Y., Kobayashi Y., Hulett F.M., The pst operon of Bacillus subtilis has a phosphate-regulated promoter and is involved in phosphate transport but not in regulation of the pho regulon, J. Bacteriol. 179 (1997) 2534–2539. [116] Renucci A., Zappavigna V., Zakany J., Izpisua-Belmonte J.C., Burki K., Duboule D., Comparison of mouse and human HOX-4 complexes defines conserved sequences involved in the regulation of Hox-4. EMBO J. 11 (1992) 1459–1468. [117] Retallack D.M., Johnson L.L., Ziegler S.F., Strauch M.A., Friedman D.I., A single-base-pair mutation changes the specificities of both a transcription regulation protein and its binding site, Proc. Natl. Acad. Sci. USA 90 (1993) 9562–9565. [118] Rincon-Limas D.E., Lu C.H., Canal I., Calleja M., RodriguezEsteban C., Izpisua-Belmonte J.C., Botas J., Conservation of the expression and function of apterous orthologs in Drosophila and mammals, Proc. Natl. Acad. Sci. USA 96 (1999) 2165–2170. [119] Roberts R.C., Toochinda C., Avedissian M., Baldini R.L., Gomes S.L., Shapiro L., Identification of a Caulobacter crescentus operon encoding hrcA, involved in negatively regulating heat-inducible transcription, and the chaperone gene grpE, J. Bacteriol. 178 (1996) 1829–1841. [120] Robison K., McGuire A.M., Church G.M., A comprehensive library of DNA-binding site matrices for 55 proteins applied to the complete Escherichia coli K-12 genomes, J. Mol. Biol. 284 (1998) 241–254. [121] Rosenblueth D.A., Thieffry D., Huerta A.M., Salgado H., ColladoVides J., Syntactic recognition of regulatory regions in Escherichia coli, Comput. Appl. Biosci. 12 (1996) 415–422. [122] Roth F.P., Hughes J.D., Estep P.W., Church G.M., Revealing regulons by whole-genome expression monitoring and upstream sequence alignment, Nature Biotechnol. 16 (1998) 239–245. [123] Roulet E., Fisch I., Junier T., Bucher P., Mermod N., Evaluation of computer tools for the prediction of transcription factor binding sites on genomic DNA, In Silico Biol. . http://www. bioinfo. de/isb/(1998)/01/0004 (1998) [124] Savchenko A., Charlier D., Dion M., Weigel P., Hallet J.N., Holtham C., Baumberg S., Glansdorff N., Sakayan V., The arginine operon of Bacillus stearothermophilus: characterization of the control region and its interaction with the heterologous B. subtilis arginine repressor, Mol. Gen. Genet. 252 (1996) 69–78. [125] Sawers G., Kaiser M., Sirko A., Freundlich M., Transcriptional activation by FNR and CRP: reciprocity of binding-site recognition, Mol. Microbiol. 23 (1997) 835–845. [126] Schneider T.D., Stormo G.D., Gold L., Ehrenfeucht A., Information content of binding sites on nucleotide sequences, J. Mol. Biol. 188 (1986) 415–431. [127] Schneider T.D., Information content of individual genetic sequences, J. Theor. Biol. 189 (1997) 427–442. [128] Schultzaberger R.K., Schneider T.D., Using sequence logos and information analysis of Lrp DNA binding sites to investigate discrepancies between natural selection and SELEX, Nucleic Acids Res. 27 (1999) 882–887. [129] Schulz A., Schumann W., hrcA, the first gene of Bacillus subtilis dnaK operon encodes a negative regulator of class I heat shock genes, J. Bacteriol. 178 (1996) 1088–1093. [130] Segal G., Ron E.Z., Regulation and organization of the groE and dnaK operons in Eubacteria, FEMS Microbiol. Lett. 138 (1996) 1–10. [131] Shain D.H., Zuber M.X., Norris J., Yoo J., Neuman T., Selective conservation of an E-protein gene promoter during vertebrate evolution, FEBS Lett. 440 (1998) 332–336. [132] Shaw G.C., Sung C.C., Liu C.H., Lin C.H., Evidence against the Bm1P1 protein as a positive transcription factor for barbituratemediated induction of cytochrome P450BM-1 in Bacillus megaterium, J. Biol. Chem. 273 (1998) 7996–8002. [133] Shimmin L.C., Dennis P., Characterization of the L11, L1, L10 and L12 equivalent ribosomal protein gene cluster of the halophilic archaebacterium Halobacterium cutirubrum, EMBO J. 8 (1989) 1225–1235. [134] Shumilov V.Y., Mutual positioning of promoters and operators in DNA of Escherichia coli (in Russian), Mol. Biol. 32 (1998) 384. [135] Smith M.C., Czaplewski L., North A.K., Baumberg S., Stockley P.G., Sequences required for regulation of arginine biosynthesis promoters are conserved between Bacillus subtilis and Escherichia coli, Mol. Microbiol. 3 (1989) 23–28. [136] Spek C.A., Bertina R.M., Reitsma P.H., Identification of evolutionarily invariant sequences in the protein C gene promoter, J. Mol. Evol. 47 (1998) 663–669. [137] Spellman P.T., Sherlock G., Zhang M.Q., Iyer V.R., Anders K., Eisen M.B., Brown P.O., Botstein D., Futcher B., Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization, Mol. Biol. Cell. 9 (1998) 3273–3297. [138] Staden R., Computer methods to locate signals in nucleic acid sequences, Nucleic Acids Res. 12 (1984) 515–519. Recognition of regulatory sites by genomic comparison [139] Stephens R.S., Kalman S., Lammel C., Fan J., Marathe R., Aravind L., Mitchell W., Olinger L., Tatusov R.L., Zhao Q., Koonin E.V., Davis R.W., Genome sequence of an obligate intracellular pathogen of humans: Chlamydia trachomatis, Science 282 (1998) 754–759. [140] Stormo G.D., Hartzell III G.W., Identifying protein-binding sites from unaligned DNA fragments, Proc. Natl. Acad. Sci. USA 86 (1989) 1183–1187. [141] Stormo G.D., Information content and free energy in DNAprotein interactions, J. Theor. Biol. 195 (1998) 135–137. [142] Svetlov V.V., Cooper T.G., Compilation and characteristics of dedicated transcription factors in Saccharomyces cerevisiae, Yeast11 (1995) 1439–1484. [143] Tagle D.A., Koop B.F., Goodman M., Slightom J.L., Hess D.L., Jones R.T., Embryionic (and (globin genes of a prosimian primate (Galago crassicaudatus) nucleotide and amino acid sequences, developmental regulation and phylogenetic footprints, J. Mol. Biol. 203 (1988) 439–455. [144] Taylor H.S., A regulatory element of the empty spiracles homeobox gene is composed of three distinct conserved regions that bind regulatory proteins, Mol. Reprod. Dev. 49 (1998) 246–253. [145] Thacker C., Marra M.A., Jones A., Baillie D.L., Rose A.M., Functional genomics in Caenorhabditis elegans: An approach involving comparisons from related nematodes, Genome Res. 9 (1999) 348–359. [146] Thieffry D., Salgado H., Huerta A.M., Collado-Vides J., Prediction of transcriptional regulatory sites in the complete genome sequence of Escherichia coli K-12, Bioinformatics 14 (1998) 391–400. [147] Tronche F., Ringeisen F., Blumenfeld M., Yaniv M., Pontoglio M., Analysis of the distribution of binding sites for a tissue-specific transcription factor in the vertebrate genome, J. Mol. Biol. 266 (1997) 231–245. [148] Ulyanov A.V., Stormo G.D., Multi-alphabet consensus algorithm for identification of low specificity protein-DNA interactions, Nucleic Acids Res. 23 (1995) 1434–1440. [149] Umbarger H.E., Biosynthesis of branched-chain amino acids, in: Neidhardt F.C. (Ed.), Escherichia coli and Salmonella: Cellular and Molecular Biology, AMS Press, Washington DC, 1996, pp. 442–457. [150] Valentin-Hansen P., Sogaard-Andersen L., Pedersen H., A flexible partnership: the CytR anti-activator and the cAMP-CRP activator protein, comrades in transcription control, Mol. Microbiol. 20 (1996) 461–466. [151] Van Helden J., Andre B., Collado-Vides J., Extracting regulatory sites from the upstream region of yeast genes by computational analysis of oligonucleotide frequencies, J. Mol. Biol. 281 (1998) 827–842. [152] Vaughan A., Hawkes N., Hemingway J., Co-amplification explains linkage disequilibrium of two mosquito diesterase genes in insecticide-resistant Culex quinquefasciatus, Biochem. J. 325 (1997) 359–365. [153] Vitreschak, A., Bansal, A.K., Titov I.I., Gelfand, M.S., Computer analysis of regulatory patterns in complete bacterial genomes. Translation initiation of the ribosomal protein operons, (In Russian, Engl. transl.), Biophysics 44 (1999)in press. 771 [154] Wagner A., A computational ‘genome walk’ technique to identify regulatory interactions in gene networks, Pac. Symp. Biocomput. (1998) 264–278. [155] Wagner A., A computational genomics approach to the identification of gene networks. Nucleic Acids Res. 25 (1997) 3594–3604. [156] Walker G.C., The SOS response of Escherichia coli, in:Neidhardt F.C. (Ed.), Escherichia coli and Salmonella: Cellular and Molecular Biology, AMS Press, Washington DC, 1996, pp. 1400–1416. [157] Wanner B.L., Phosphorus assimilation and control of the phosphate regulon, in: Neidhardt F.C. (Ed.), Escherichia coli and Salmonella: Cellular and Molecular Biology,AMS Press, Washington DC, 1996, pp. 1357–1381. [158] Wang D., Marsh J.L., Ayala F.J., Evolutionary changes in the expression pattern of a developmentally essential gene in three Drosophila species, Proc. Natl. Acad. Sci. USA 93 (1996) 7103–7107. [159] Washio T., Sasayama J., Tomita M., Analysis of complete genomes suggests that many prokaryotes do not rely on hairpin formation in transcription termination, Nucleic Acids Res. 26 (1998) 5456–5463. [160] Wasserman W.W., Fickett J.W., Identification of regulatory regions which confer muscle-specific gene expession, J. Mol. Biol. 278 (1998) 167–181. [161] Waterman M.S., Arratia R., Galas D.J., Pattern recognition in several sequences: consensus and alignment, Bull. Math. Biol. 45 (1984) 515–527. [162] Winterling K.W., Chafin D., Hayes J.J., Sun J., Levine A.S., Yasbin R.E., Woodgate R., The Bacillus subtilis DinR binding site: redefinition of the consensus sequence, J. Bacteriol. 180 (1998) 2201–2211. [163] Wolfertstetter F., Frech K., Herrmann G., Werner T., Identification of functional elements in unaligned nucleic sequences by a novel tuple search algorithm, Comput. Appl. Biosci. 12 (1996) 71–80. [164] Yada T., Totoki Y., Ishii T., Nakai K., Functional prediction of B. subtilis genes from their regulatory sequences, Intelligent Systems Mol. Biol. 5 (1997) 354–357. [165] Yuh C.H., Bolouri H., Davidson E.H., Genomic cis-regulatory logic: experimental and computational analysis of a sea urchin gene, Science 279 (1998) 1896–1902. [166] Zhu Q., Zhao S., Somerville R.L., Expression, purification, and functional analysis of the TyrR protein of Haemophilus inlfuenzae, Protein Expr. Purif. 10 (1997) 237–246. [167] Zuber U., Schumann W., CIRCE, a novel heat shock element involved in regulation of heat shock operon dnaK of Bacillus subtilis, J. Bacteriol. 176 (1994) 1359–1363. [168] Zucker-Aprison E., Blumenthal T., Potential regulatory elements of nematode vitellogenin genes revealed by interspecies sequence comparison, J. Mol. Evol. 28 (1989) 487–496.
© Copyright 2026 Paperzz