Syntactic recognition of regulatory regions in Escherichia coli

CABIOS
Vol. 12 no. 5 1996
Pages 415-422
Syntactic recognition of regulatory regions in
Escherichia coli
David A.Rosenblueth\ Denis Thieffry2, Araceli M.Huerta2,
Heladia Salgado2 and Julio Collado-Vides2'3
Abstract
Introduction
The implementation of computational methods to predict
bacterial a10 promoters started more than 10 years ago when
the first collection of sequences was available (Hawley and
McClure, 1983). a70 promoters are formed by two domains,
'institute de Investigations en Matematicas Aplicadas y en Sistemas,
Universidad National Autonoma de Mexico, Ciudad Universitaria, Mexico
D.F., 04510 and 2Centro de Investigation sobre Fijacion de Nitrogeno,
Universidad Nacional Autonoma de Mexico, Cuernavaca A.P. 565-A,
Morelos 62100, Mexico
3
To whom reprint requests should be addressed
© Oxford University Press
415
Downloaded from http://bioinformatics.oxfordjournals.org/ by guest on July 20, 2015
Motivation: One of the most common methodologies to
identify cis-regulatory sites in regulatory regions in the DNA
is that of weight matrices, as testified by several articles in
this issue. An alternative to strengthen the computational
predictions in regulatory regions is to develop methods that
incorporate more biological properties present in such DNA
regions. The grammatical implementation presented in this
paper provides a concrete example in this direction.
Results: On the basis of the analysis of an exhaustive
collection of regulatory regions in Escherichia coli, a
grammatical model for the regulatory regions of a70
promoters has been developed. The terminal symbols of the
grammar represent individual sites for the binding of
activator and repressor proteins, and include the precise
position of sites in relation to transcription initiation.
Combining these symbols, the grammar generates a large
number of different sentences, each of which can be searched
for matching against a collection of regulatory regions by
means of weight matrices specific for each set of sites for
individual proteins. On the basis of this grammatical model, a
Prolog syntactic recognizer is presented here. Specific subgrammars for ArgR, LexA and TyrR were implemented. When
parsing a collection of 128 a70 promoter regions, the
syntactic recognizer produces a much lower number of
false-positive sites than the standard search using weight
matrices.
Availability: A WWW interface is under development and will
be freely accessible at the url: http://www.cifn.unam.mx/
Computational_Biologylindex.html.
Contact: E-mail: [email protected]
the -35 box and the -10 or Pribnow box, separated by a nonconserved sequence ranging from 15 to 21 bp long. Weight
matrices with specific penalties associated with the distance
have been refined through the years (Staden, 1984; Mulligan
and McClure, 1986; O'Neill, 1989; Hertz et al, 1990). A
similar methodology has also been used in the recognition of
protein-binding domains in the DNA (Schneider et al., 1986;
Goodrich et al., 1990). Several articles in this issue illustrate
the use of weight matrices in the analysis of eukaryotic
regulatory signals.
Based on an exhaustive collection of experimental
information on the anatomy of regulatory domains of both
a10 and a54 promoters in Escherichia coli (Collado-Vides
et al., 1991), a grammatical model for the regulatory regions
of a10 promoters has been developed (Collado-Vides,
1992,1993a,b). In this paper, we present a Prolog syntactic
recognizer based on this model. The terminal symbols of the
grammar represent individual sites for the binding of proteins,
such as the promoter (Pr), activator sites (I) and operator
negative sites (Op). More precisely, terminal symbols include
information on: (i) the category of the site (Pr, I or Op); (ii)
the position of the central nucleotide of the site in relation to
the initiation of transcription; and (iii) the protein that binds to
the site. At this level of description, the grammatical model
generates the collection of 128 a 70 promoters as well as
>5000 different combinations or 'regulatory sentences' which
are consistent with the biological principles encoded in the
grammar.
These principles are, in brief, the following, (i) Any
promoter must have a proximal site, close enough to the
promoter to enable direct contact between the bound
polymerase and the bound regulatory protein. Other sites
are defined as remote, (ii) Regulatory sites are grouped into
clusters or phrases containing one obligatory—proximal—
site plus additional optional sites, (iii) The positions and
number of duplicated sites that occur in a given promoter for a
given protein can be considered to be particular or
idiosyncratic properties of each protein. For instance, some
proteins bind preferentially to single sites, whereas other
proteins tend to bind to multiple closely located sites (see
Collado-Vides, 1996).
The positions contained in the dictionary for a given
protein were obtained from those contained in the set of 131
promoters of the collection. When detailed experimental
D.A.Rosenblueth et al.
studies defining functional positions of particular proteins
were available, these were also taken into consideration, such
as in the case of CRP (Gaston et al, 1990).
To match the set of regulatory 'sentences' generated by the
grammar with DNA sequences of regulatory domains, a
'sensor' is required that is capable of evaluating whether a
short sequence of DNA centered at particular positions
defined by the grammar is or is not a binding site for a given
protein. As described below, the syntactic parser presented
here uses weight matrices as 'sensors'. The idea of using the
grammar as a higher-order structure on top of particular
'sensors' was first suggested to one of us by David Searls,
who has applied this idea in a eukaryotic gene parser (Dong
and Searls, 1994).
Sold 128
promoters
sequences
Position and
Extract 6xt8ndfld
Atgned sites
I
fofeatfproten I
ThreshoMstor
each protein
Matrixtoreach
lWric of mafata»f
(i) A set of 128 sequences covering from around -200 to
+40, with respect to the annotated +1 of each
promoter. Each sequence corresponds to a line in
Figure 1 of Gralla and Collado-Vides (1996), where
the positions of known functional sites for the binding
of regulatory proteins are indicated by boxes. We call
these sequences 'oriented promoter regions' or
'oriented promoter sequences'. (The three promoters
excluded, fhuA, metH and pifC, do not have an
identified initiation of transcription.)
(ii) A set of 109 non-oriented promoter sequences,
irrespective of the number of closely located
promoters. For instance, the promoter region for
argCBH-E is present only once, even if it contains the
argE and the argCBH promoters.
(iii) A set of functional sites. Sequences for binding sites
that regulate one or more parallel promoters are
included only once, whereas sequences for sites that
regulate two divergent promoters are included twice,
one sequence being the inverted complement of the
other. One example of the first case is the gal system,
where one CRP site is located at -40.5 in relation
to the galpl promoter and at -36 in relation to the
galp2 promoter. The dictionary in the grammar
contains -40.5 and -36 as acceptable positions for
CRP. An example of the second case is the pair
of ArgR sites between the argE and argCBH
promoters.
416
Grammar
Syntactic parser
recoo/csr
Set of
sentences
• olmalote.|
TWrfe
twth
rwl»
^
Fig. 1. Flow chart of sequence manipulation and analysis. Square boxes
represent processes, and the other boxes describe input and output data. The
data in Table II contain the number of matches per threshold intervals—two
shaded boxes in this diagram. The third shaded box represents the list of
recognized sentences for each promoter.
The matching of weight matrices and DNA sequences
discussed below was always done with the set of 128 oriented
promoter sequences. The input for generating the weight
matrices of the different regulatory proteins was a collection
of files, each with the set of all 'extended' functional sites for
a given protein. These extended sites were extracted from the
set of functional sites mentioned above, using the central
positions obtained from the literature. Their size was that
reported in the literature plus 6 bp on each side (see Table I in
Collado-Vides, 1993b).
A multiple alignment method that selects the alignment
matrix optimizing information content, Wconsensus, was
used to generate a set of best matrices. Wconsensus
determines ungapped multiple alignments of unknown prior
width (Hertz et al, 1990; Hertz and Stormo, 1995). The
alignment matrix selected for each protein was the one with
the lowest expected frequency that includes all the sites. Once
the matrix and the aligned sequences for each protein had
been obtained, we re-calculated the new central positions for
each sequence in relation to the +1 of the respective
promoters, and used them to modify the dictionary accordingly. The schema in Figure 1 describes the complete process
of manipulation and sequence analysis.
Downloaded from http://bioinformatics.oxfordjournals.org/ by guest on July 20, 2015
?1
Recalculate center
position
Methods
DNA sequences for the regulatory domains of a promoters
were obtained from GenBank. The precise location of the
initiation of transcription was assigned after comparison of
(i) the GenBank file, (ii) the position indicated in the literature
and (iii) the position indicated in a recent review paper (Lisser
and Margalit, 1993).
Scripts were written in Perl (Wall and Schwartz, 1991) to
generate the following set of sequences.
P
I
\,
Syntactic recognition of regulatory regions in E.coli
Algorithm
A^BxB2...Bn
/i>0
such that A EN, and B, E (2UN), and SEN, is the start
symbol.
A context-free grammar, G — (2,N,P,S), determines a
formal language L(G) of strings over £ as follows. A string
a E L(G) if, and only if, there exists a parse tree such that S
is the root, a occurs at the frontier, and each node A in the
tree has as children Bx,B2,..,Bn if P has the production
Pr3
D-Op
Fig. 2. Parse tree of a regulatory region formed by a negative phrase of TyrR
sites as they occur in the aroF promoter. The obligatory site is the one with
the name of the protein, and optional sites are co-indexed (i).
417
Downloaded from http://bioinformatics.oxfordjournals.org/ by guest on July 20, 2015
We chose the programing language Prolog (Sterling and
Shapiro, 1994) to describe the a70 grammatical model. The
reason is that Prolog was designed as a language for
programing natural-language applications, and is therefore
specially suited for describing grammars. On the one hand,
the grammar is not context free (Collado-Vides, 1991) so that
we needed a formalism allowing us to incorporate contextsensitive information. We thus had to discard context-free
parsers.
On the other hand, a more conventional programing
language, such as Pascal or C, would have enabled us to
incorporate such information, but at the cost of blurring the
connection between the grammar and the computer program.
Using the Prolog language, we obtained the best of both
worlds. We were able to handle context-sensitive information
and yet maintain a clear connection between our grammar
and the Prolog program. As explained below, we did have to
circumvent one difficulty, however.
A context-free grammar (Hopcroft and Ullman, 1979) is a
4-tuple G = (2,N,P,S), where J is a finite alphabet of
terminals, N is a finite alphabet of non-terminals disjoint with
2, and P is a finite set of productions, each of which is of the
form:
Our grammar, however, is strictly context sensitive,
because there is additional information flowing in between
the leaves of the 'valid' trees, thus limiting the possibilities of
trees that can be built with the grammar. This additional
context-sensitive information is contained in a table which we
call the dictionary. Each entry of the dictionary has a special
leaf representing an obligatory site, together with contextsensitive information which may affect leaves representing
optional sites either to the left or to the right of the obligatory
site. For example, Figure 2 shows a parse tree where the leaf
containing [Op,TyrR,/,-50] represents an obligatory site with
information about the leaves [Op,/,-105] and [Op,i,-30],
indicated by the arrows.
The obstacle we had to overcome when implementing our
Prolog program arises from the fact that an obligatory site
may affect optional sites to its left. The reason is that Prolog
normally builds parse trees using a left-to-right discipline, so
that Prolog would have to start building the left fragment of
the tree before accessing the dictionary. (In principle, Prolog
is able to do so, but the search space would then be so big
as to render such an implementation intolerably inefficient.)
Consequently, we had to treat some productions of our
grammar in a special way, overriding the default left-to-right
regime.
The production Pr3 —• D-OpPr2t for instance, is not used to
build a parse tree from left to right, since Prolog would have
to add the leftmost leaf [Op,i,-105] without having accessed
the dictionary. Thus, we explicitly indicated to the Prolog
system that the rightmost subtree rooted in Pr2 must be built
before the leftmost subtree rooted in D-Op. Hence, the first
leaves to be built are the ones corresponding to dictionary
entries (such as [Op,TyrR,«,-50]).
We implemented the transfer of information between the
leaves as follows. Once the dictionary is accessed, the
context-sensitive information is passed through the nodes of
the tree, first bottom up, to the closest common ancestor of the
obligatory and the optional sites, and then top down. In the
case of the information represented by the arrow pointing to
the leftmost leaf in Figure 2, for example, such information is
transferred to the root Pr3 before it is transferred to the
leftmost leaf. We readily achieved this with 'definite-clause
grammars' (Sterling and Shapiro, 1994).
Note that our formal language (the set of strings
determined by our grammar) is finite, and therefore regular
(Hopcroft and Ullman, 1979). Thus, it is possible to describe
such a language not only with a context-free grammar, but
also with an even more restricted regular grammar. However,
such grammars would not reflect the biological reasons that
support the selection of each production as it has been done
(Collado-Vides, 1991, 1992, 1993a,b).
As often happens with Prolog programs, our Prolog
description of the grammar is invertible, so that we can use
it either as a parser (to determine whether or not a given string
D.A.Rosenblueth el al.
Table I. Properties of functional regulatory sites. The experimental size was
defined by inspection of the collection of sites obtained from the literature.
The aligned size is automatically assigned by the Wconsensus program.
Information content (in bits) is the adjusted information content as defined in
Hertz et al. (1990). The average distance correction is the sum of absolute
differences of central position defined from the literature minus the central
position of the aligned sequence for the same site, all divided by the number
of sequences for each protein
Protein
Experimental
size
Alignment
size
Information
content
Average
position
correction
ArgR
LexA
TyrR
16
20
22
24
21
19
1471
18.27
13.38
0.83
2.7
1.47
Table II. Building of the dictionaries. For each of the three proteins, a list of promoters with the positions of the corresponding regulatory sites is given in the first
column. All ArgR and LexA sites are repressor sites. In the case of TyrR, positions are followed by a (-) for repressor sites and by a (+) for activator sites. The
second column contains the dictionary entries, where optional sites for the same protein are co-indexed (i) with the obligatory site which is the only one with the
name of the protein. Each entry contains the category (I or Op) of the phrase, and slots separated by square brackets. Starting from the right end, the first slot is
that for obligatory sites, the second one for proximal duplicated sites, and in the case of Op entries, the third one is a slot for remote upstream duplicated sites. The
third column gives the number of regulatory arrays or 'sentences' generated by each sub-grammar and a few examples
Functional sites
ArgR:
argCBH:-3,5, 18.5
argE: -13.5,4.5
argF: -23.5, -2.5
argl: -23.5, -2.5
argRpl:-28.5, 8.5
carABp2: - 9 . 5 , 12,5
TyrR:
aroL: -48.0 ( - ) , 6.0 (-), 29.0 ( - )
mtr: -77.0 (+)
aroF: -105.0 (-), -53.0 (-), -30.0 ( - )
aroG: -39.0 ( - )
aroP: 41.0 (-), 64.0 ( - )
tyrB: 15.0 ( - ) , 41.0 ( - )
tyrP: -63.0 (+), -40.0 ( - )
tyrR: -50.0 ( - )
LexA:
cloDF13: 8.0
colElpl: 8.0, 23.0
lexA: -8.0, 14.0
recA: -19.0
ssb: -45.0
sulA: -5.0
uvrA: -30.0
uvrBp2: -19.0
uvrD: 12.0
418
Dictionaries
Regulatable 'sentences
Total number of sentences: 10
Examples:
([Pr,x],[Op,c,i,ArgR,-3])
([Pr,x],[Op,c,i,ArgR,-3],[Op,d,i,18.5])
Total number of sentences: 41
Examples:
([Pr,x],[Op,c,i,TyrR,-51])
([Pr,x],[Op)c,i,TyrR,-51 ],[Op,d,i,-30])
([I,c,i,TyrR,-77],[Pr,x])
([I,c,j,TyrR,-77],[Pr,x],[Op,c,i,TyrR,-51])
([I,c,j,TyrR,-77],[Pr,x],[Op,c,i,TyrR,-51],[Op,d,i,-30])
Total number of sentences: 9
Examples:
([Pr,x],[Op,c,i,LexA,+8])
([Pr,x],[Op,c,i,LexA,-19])
([Pr,x],[Op,c,i,LexA,+8],[Op,d,i,23])
Downloaded from http://bioinformatics.oxfordjournals.org/ by guest on July 20, 2015
is in the language of interest) or as a generator (to produce the
language). At first sight, it might seem that we wish to use it
as a parser, since the main goal is to recognize regulatory
signals in strings of DNA. However, such a use involves an
exhaustive search for a signal across the complete DNA
sequences. By contrast, as a generator, the Prolog program
behaves as a filter that constrains the search to a limited set of
positions and combinations of sites.
The output of our Prolog program (as a generator) was
coupled with a subroutine that invokes the weight matrix of
the associated protein, and evaluates the score of the strings of
DNA centered at the positions indicated in the dictionary for
each terminal symbol of a sentence. This was done using the
patser (for pattern recognition) program (Hertz et al., 1990;
Hertz and Stormo, 1995). A window parameter permits the
screening of a few more sequences around that central
position on both sides, and the sequence with the best score is
saved for the output in each word (or site) search. The search
for a sentence is stopped once a site is not found with a score
above threshold, with obligatory sites being searched first. If
an obligatory site is found, its score is saved and its search is
not repeated in subsequent sentences that include it together
with additional optional sites. Therefore, a sentence is
accepted if, and only if, for every single site, a match has
been found with a score higher than the threshold. The
threshold for each protein is set equal to the score of the
functional site that gives the lowest score evaluated by the
matrix. The final output can be ordered by listing all oriented
promoter sequences that match a given sentence, or by listing
all sentences that match a given oriented promoter sequence.
Syntactic recognition of regulatory regions in E.coli
Table III. Recognition with and without syntactic tool. Numbers of functional sites, of predicted sites with the syntactic recognizer, and of predicted sites with
patser, distributed in intervals depending on their scores (in bits)
ArgR (threshold: 10.82)
Score interval (bits)
10.82-15
15-20
20-25
25-30
Total
Ratio"
Functional sites
Sites with grammar + patser
Sites with patser
1
68
635
0
17
112
5
6
12
6
6
6
12
97
765
1
8.08
63.75
Score interval (bits)
9.35-10
10-15
15-20
20-25
Total
Ratio"
Functional sites
Sites with grammar + patser
Sites with patser
1
9
53
0
21
155
7
12
21
7
8
8
15
50
237
1
3.33
15.8
Score interval (bits)
17.01-20
20-25
25-30
30-35
Total
Ratio"
Functional sites
Sites with grammar + patser
Sites with patser
1
1
5
1
1
2
8
8
8
1
1
1
11
11
16
1
1
1.46
TyrR (threshold: 9.35)
"Predicted/functional sites.
Results
To illustrate the results that can be obtained by performing a
syntactic analysis on DNA regulatory sequences, we selected
three proteins: ArgR, LexA and TyrR. These proteins were
chosen because there is a good number of functional sites for
all of them, permitting a reasonable weight matrix. In
addition, they differ in their distribution organization. For
instance, TyrR involves transformational rules as depicted in
Figure 2. LexA, on the other hand, represents a well-studied
example (Schneider et ai, 1990), as well as a protein with a
high threshold compared to the other two.
Table I shows a comparison of size and positions between
the values obtained from the literature and those obtained as a
consequence of the multiple alignment of functional sites for
the three proteins. The size of TyR and LexA sequences
remains practically unchanged. The larger size selected by
the multiple alignment program for ArgR may be due to the
data. Certainly, the 12 ArgR functional sites occur in pairs of
closely located sites, thus the extended sites we extracted will
always have a piece of a neighbor site. The correction of
central distances is also quite small, indicating that the
footprint evidence and the computational definition of sites
match quite well.
A subset of rules and of dictionary entries able to generate
all the ordered regulatory promoters of the collection
containing any of the three proteins was used. Thus, three
'sub-grammars' were built, one for each protein. The
dictionaries were built using the center positions after
alignment available for all functional sites of a given protein,
and searching to minimize the number of entries in each case
(see Table II). These ArgR, TyrR and LexA sub-grammars
generate 10, 41 and 9 sentences, respectively.
To have an idea of the benefits provided by the grammar,
we computed the number of matches higher than or equal to
the thresholds for each protein using both the syntactic
parsers specific for each protein and the patser program (not
to be confused with parser!). Given an alignment matrix and a
threshold, patser searches for all possible matches within a
sequence. The syntactic search was done with a fixed window
of two, which means that the best match for the matrix was
searched within ± 2 bp around each position in the
Table IV. Analysis of TyrR predicted sites. 'False positives' for TyrR sites
occurring either alone or in pairs. Sites at tetA have an organization similar to
that found in tyrP, with an activator and a repressor site, and sites at fadL are
located similarly to two of the three sites in aroL. Duplicated sites were found
at argCBH with one site higher than 15, and at cirp2, but not within a given
sentence; at cirp2 they overlap, something not present within the small set of
functional TyrR sites
Promoters with multiple sites for TyrR found within a sentence
Promoter
Position 1
Position 2
Score 1
Score 2
fadL
tetA
-50
-64
28
-38
9.45
13.55
11.76
11.47
High-score predicted sites for TyrR
Promoter
Position
Score
papB
meUpl
cirp2
argCBH
argCBH
-40
-39
-53
-39
-40
21.30
17.22
16.26
15.54
15.30
Notes
Overlaps a Metl site
Overlaps a Mel) site
419
Downloaded from http://bioinformatics.oxfordjournals.org/ by guest on July 20, 2015
LexA (threshold: 17.01)
D.A.Rosenblueth ei al.
The parser performs 5-8 times better than the exhaustive
search method in the case of the two proteins with a low
threshold. Certainly, it is reasonable to expect that the
contribution of the syntactic recognition should be more
visible in cases of weight matrices with a low threshold. This
is why the contribution of the parser is less marked in the case
of LexA, which has a higher threshold.
Note, however, that the performance is not only related to
the threshold. ArgR has a slightly higher threshold than TyrR,
but has the highest number of false positives. Given the fourtimes larger number of sentences generated by the TyrR subgrammar compared to that of ArgR, this seems an unexpected
result (see Table II).
It is interesting that the sites with the lowest scores for
TyrR and ArgR are followed by sites with much larger scores
(around 15 and 20, respectively). One could decide to drop
these extremely low-valued sites and get much higher
thresholds. However, in both cases, there is mutational
evidence for their function (Cunin et al, 1983; Yang and
Pittard, 1987).
The TyrR site at the tyrB promoter that sets the threshold
for TyrR occurs with a duplicated site of score 16.3, and the
second weakest site has a score slightly higher than 15. Thus,
we limited our analysis to predictions that involve individual
sites with scores higher than 15, and to promoters with more
than one site generated by the same sentence in the grammar.
The latter are sites that occur at distances similar to those
found within native regulatory regions. These two sets are
Table V. Analysis of ArgR predicted sites. 'False positives' for ArgR sites occurring either alone or in pairs. Not all pairs overlap with the site for another
protein. When one of these ArgR putative sites overlaps a known site for another protein, that protein is indicated in the last column
Promoters with multiple sites for ArgR found within a sentence
Promoter
Position 1
Position 2
Score 1
Score 2
Overlapping site
argE
argE
sodA
aceBAK
uvrBp2
iuaC
iuaC
cloDF13
cysB
-7.5
-3.5
-29.5
-26.5
-11.5
-14.5
-24.5
-1.5
-14.5
14.5
18.5
-9.5
-9.5
5.5
3.5
-3 5
16.5
5.5
21.94
11.94
13.82
16.76
15.74
15.33
11.90
11.49
11.13
15.24
18.43
17.94
12.98
11.33
10.86
12.19
10.92
11.15
ArgR
ArgR
IcIR
LexA
LexA
CysB
Fur
Fur
Fur
High-score predicted sites for ArgR
Promoter
Position
Score
Functional overlapping site
uvrD
recA
sodA
uvrBp2
ompC
argl
metK
pflp6
iuaC
ilvY
glpTQ
-15
-21.5
-24.5
-21.5
-26.5
-13.5
-29.5
-24.5
-14.5
-30.5
-22.5
18.59
17.71
17.59
16.47
16.39
16.29
16.23
16.15
15.33
15.19
15.11
LexA
LexA
ArcA
LexA
None
ArgR
None
Close to FNR
Fur
IlvY
GlpR
420
Downloaded from http://bioinformatics.oxfordjournals.org/ by guest on July 20, 2015
dictionary. The matches obtained against the 128 oriented
promoter sequences are shown in Table III. As expected, the
number of false positives is lower when using the syntactic
parser than when using the weight matrix alone. In fact, this
could not be otherwise, since the syntactic recognizer uses the
same weight matrices with the same thresholds, but searches
within limited positions. Nonetheless, several interesting
observations emerge from this comparison.
The parser permits one to focus on a subset of sites located
at relevant positions and combinations of multiple sites.
Comparing the numbers obtained using grammar and patser
with the numbers obtained with patser alone, it can be seen
that patser eliminates predominantly false positives with low
scores. Interestingly, the number of matches with the highest
scores are usually the same using patser and the syntactic
recognizer (intervals 20-25 for ArgR and TyrR, and 25-35
for LexA). This implies that sites with the highest scores are
generally located at positions where at least one functional
site is known in the regulation of transcription initiation.
Therefore, the prediction of functionality for those few
uncharacterized sites becomes stronger. In other words, if this
observation is confirmed with a larger set of proteins, it would
suggest that evolution prevents the existence of strong nonfunctional sites within regulatory regions.
Syntactic recognition of regulatory regions in E.coti
Discussion
The grammatical recognizer represents a layer added on top
of a set of weight matrices for individual regulatory proteins.
This layer imposes additional restrictions on the otherwise
exhaustive search for any site with a score higher than a
given threshold in any position in a given sequence. Adding
this syntactic layer results in a considerable decrease in
the number of false positives. As mentioned before, at the
highest scores, almost no site is found that has not been
characterized as a functional site, which suggests that nonfunctional strong sites are avoided in the evolution of
regulatory regions.
The syntactic search strategy is restricted to the specific
positions indicated in the dictionary for each protein, and to
the specific combinations of sites within a given sentence. In
the case of forbidden zones for specific regulators, these
restrictions are strongly justified. For repressors, whose
positions in principle are apparently not so restricted [but
see Collado-Vides (1993a,b) for a systematic discussion of
these issues], this is a conservative approach with the
consequence that some true unknown sites may fail to be
recognized. We shall get a better idea of these questions when
we apply this syntactic method to analyze unannotated DNA
putative regulatory sequences coming out of genome
projects.
A standard methodology in computational biology is that
of separating testing and training sets to validate the proposed
model. Such a test has not been performed for the following
reasons. First, the number of sequences available for each
protein is usually smaller than 12. These are very small sets to
permit interesting training and testing sets. A hold-one-out
strategy could be performed with the sequences to better
support the weight matrices built. However, weight matrices
are used here as a standard methodology for individual
'sensors'. The main point of this paper is to present a novel
recognition method that organizes different 'sensors' based
on the organizing principles of regulatory regions in E.coli.
This organization is based on precise positions and on the
occurrence of combinations of functional sites for a collection
of proteins. Given the reduced size of the data set, we are not
convinced that a hold-one-out strategy could better validate
the grammatical approach.
We do not know in general how many new positions and
combinations of sites for a given protein are still to be
identified. Therefore, a given sub-grammar, which in fact has
been constructed with all currently available experimental
data, has to be considered as a testing hypothesis. The number
of adequately positioned strong sites found within the
complete E.coli chromosome that the grammar fails to
recognize will give an indication of how incomplete the
dictionary is. An incomplete grammar will tend to miss strong
sites present in new positions. In this sense, the preliminary
results presented here are quite encouraging.
Certainly, the ArgR, TyrR and LexA sub-grammars were
built with 12, 15 and 11 sites present in a total of 6, 8 and 8
promoters, respectively. Thus, the results presented in Table
III include >100 new promoters for each sub-grammar where
in principle new strong sites might have been missed. The
small number of strong 'false positives' found when using the
grammar indicates that the sub-grammars used here are quite
robust. This tendency has recently been confirmed by the
analysis of nine additional a70 protein sub-grammars (for
421
Downloaded from http://bioinformatics.oxfordjournals.org/ by guest on July 20, 2015
shown in Table IV. What we consider the strongest prediction
is a single site at the papB promoter, in a position similar to
the one at aroG and with a higher score.
In the case of ArgR, we found 18 sites occurring in pairs
within a given promoter recognized by a single sentence. Of
these, 11 sites have a score between 10 and 15, and 7 sites
have scores higher than 15, with one site higher than 20.
These are shown in Table V, ordered by decreasing score of
the highest site in the pair. It is interesting that the two pairs
with the highest scores occur within the argE promoter, the
one with the functional site with the lowest score. However,
none of these three pairs of ArgR boxes prevent the argE
promoter from being the one with the lowest scores of
promoters in the ArgR regulon. This matches with the lowest
repression ratio found for argE within the regulon (Cunnin
et al, 1983), but suggests that alternative pairs of sites at
different positions are available for ArgR.
Six out of nine pairs have a site with a score higher than 15.
However, out of these 18 sites, 14 are found at positions
overlapping with known sites for other repressors. Eleven
single sites were found with scores higher than 15, more than
double those found in this same interval with TyrR. Nine out
of these 11 putative sites are found overlapping with
functional sites for other proteins. In fact, we found that
there is a good number of false positives with lower scores
that occur in between pairs of functional ArgR sites. Taken
together, these results point to the conclusion that a good
number of false positives for ArgR are found within
functional sites for a different protein. This may be due to
the fact that the ArgR dictionary defines a syntactic search
(with a window of two) covering in an almost continuous
manner from -30 to +6 and from +10 to +20, the regions
where the concentration for functional sites for different
proteins is higher (see Figure 2A in Gralla and Collado-Vides,
1996). Given that this region also includes the promoter, it is
reasonable to deduce that the ArgR consensus compromises
with these other requirements, making its alignment matrix
particularly sensible to identify sites for other repressors (see
Table V). The TyrR dictionary, on the other hand, defines a
search for more upstream positions, with a gap between -30
and +6. This different distribution correlates with the lower
number of false positives found for TyrR.
D.A.Roscnblueth el al.
Acknowledgements
This work was supported by a grant from DGAPA-UNAM to J.C.-V. and
D.A.R., and by a grant from Conacyt to J.C.-V. We acknowledge Gerald
Hertz for the use of his programs, for fruitful discussions, and for sharing
regulatory sequences that helped us to correct inconsistencies in the location
of the initiation of transcription. Collaboration with G. Hertz was initiated at
the 1995 Aspen Workshop on Patterns in Biological Sequences. We
acknowledge Ernesto Leyva for writing the program to re-compute positions
of sites. Finally, we would also like to thank anonymous referees for their
interesting remarks.
References
Collado-VidesJ. (1991) The search of a grammatical model of gene
regulation is formally justified by showing the inadequacy of contextfree grammars. Comput. Applic. BioscL, 7, 321—326.
Collado-Vides,J. (1992) Grammatical model of the regulation of gene
expression. Proc. Natl Acad. Sci. USA, 89, 9405-9409.
Collado-Vides,J (1993a) A linguistic representation of the range of
transcription initiation of a70 promoters: I. An ordered array of complex
symbols with distinctive features. Biosystems, 29, 87-104.
Collado-VidesJ. (1993b) A linguistic representation of the range of
transcription initiation of o70 promoters: II. Distinctive features of
promoters and their regulatory binding sites. Biosystems, 29, 105-128.
422
Collado-Vides,J. (1996) Integrative representations of the regulation of gene
expression. In Collado-VidesJ., Magasanik.B. and Smith,T.F. (eds),
Integrative Approaches to Molecular Biology. MIT Press, Cambridge,
MA, pp. 179-203.
Collado-Vides,J., Magasanik,B. and Gralla.J.D. (1991) Control site location
and transcriptional regulation in Escherichia coli. Microbiol. Rev.. 55,
371-394.
Cunin,R., Eckhardt,T., Piette,J., Boyen.A., Pierard,A. and Glansdorff,N.
(1983) Molecular basis for modulated regulation of gene expression in
the arginine regulon of Escherichia coli K-12. Nucleic Acids Res., 11,
5007-5019.
Dong,S. and Searls,D.B. (1994) Gene structure prediction by linguistic
methods. Genomics, 23, 540-551.
Gaston,K., Bell,A., Kolb,A., Buc.H. and Busby.S. (1990) Stringent spacing
requirement for transcription activation by CRP. Cell, 62, 733-743.
Goodrich,J.A., Schwartz,M.L. and McClure,W.R. (1990) Searching for and
predicting the activity of sites for DNA binding proteins: compilation and
analysis of the binding sites for Escherichia coli integration host factor
(IHF). Nucleic Acids Res., 18, 4993-5000.
Gralla,J.D. and Collado-Vides,J. (1996) Organization and function of
transcription regulatory elements. In Neidhardt,F.C, Curtiss III,R.,
Ingraham,J., Lin.E.C.C, Low.K.B., Magasanik,B., Reznikoff,W.,
Schaechter.M., Umbarger,H.E. and Riley,M. (eds), Cellular and Molecular Biology: Escherichia coli and Salmonella, 2nd edn. American Society
for Microbiology, Washington, DC, pp. 1232-1245.
Hawley,D.K. and McClure,W.R. (1983) Compilation and analysis of
Escherichia coli promoter DNA sequences. Nucleic Acids Res., 11,
2237-2255.
Hertz.G.H. and Stormo,G.D. (1995) Identification of consensus patterns in
unaligned DNA and protein sequences: a large-deviation statistical byasis
for penalizing gaps. In Lim.H.A. and Cantor,C.R. (eds), Proceedings of the
3rd Internationa! Conference on Bioinformatics and Genome Research.
World Scientific Publ., Singapore, pp. 199-214.
Hertz,G.H., Hartzell III.G.W. and Stormo.G.D. (1990) Identification of
consensus patterns in unaligned DNA sequences known to be functionally
related. Comput. Applic. Biosci., 6, 81-92.
HopcroftJ.E. and Ullman,J.D. (1979) Introduction to Automata Theory,
Languages, and Computation. Addison-Wesley, Reading, MA.
Lisser,S. and Margalit,H. (1993) Compilation of E.coli mRNA promoter
sequences. Nucleic Acids Res., 21, 1507-1516.
Mulligan,,M.E. and McClure,W.R. (1986) Analysis of the occurrence of
promoter-sites in DNA. Nucleic Acids Res., 14, 109-126.
O'Neill,M.C. (1989) Consensus methods for finding and ranking DNA
binding sites. Application to Escherichia coli promoters. J. Mol. Biol, 207,
301-310.
Schneider,T.D., Stormo,G.D., Gold.L. and Ehrenfeucht,A. (1986) Information content of binding sites on nucleotide sequences. J. Mol. Biol., 188,
415-431.
Staden.R. (1984) Computer methods to locate signals in nucleic acid
sequences. Nucleic Acids Res., 12, 505-519.
Sterling,L. and Shapiro.E. (1994) The Art of Prolog, 2nd edn. MIT Press,
Cambridge, MA.
Thieffry,D., Rosenblueth.D.A., Huerta,A.M., Salgado.H. and ColladoVidesJ. (1996). Definite-clause grammars for the analysis of cisregulatory regions in E.coli. In Proceedings of the Pacific Symposium on
Biocomputing'97. Hawaii, January 1997, in press.
Wall,L. and Schwartz.R.L. (1991) Programming Perl. O'Reilly and
Associates, Inc., Sebastopol, CA.
Yang,J. and Pittard.J. (1987) Molecular analysis of the regulatory region of
the Escherichia coli K-12 tyrB gene. J. Bacteriol., 169, 4710-4715.
Downloaded from http://bioinformatics.oxfordjournals.org/ by guest on July 20, 2015
AraC, CRP, FNR, GlpR, MalT, MeU, PhoB, PurR and PutA
regulatory proteins; see Thieffry et al., 1997).
There are many questions that can still be addressed
concerning the analysis of our collection of 128 ordered
promoter sequences. For instance, a more exhaustive analysis
of sub-grammars with other regulators should help to get a
better idea on whether the strong tendency to find sites for
other proteins when searching with the ArgR sub-grammar
compared to TyrR (with similar thresholds) is due to the
positions of ArgR sites around the promoter, or to a peculiar
property of ArgR sites. It will also be interesting to evaluate
more extensively the frequency of cross-recognition among
matrices and sites for different proteins. Then, the convenience of building filters to prevent this cross-recognition
could be addressed.
Much more work is required to exploit fully the potentiality
of our syntactic recognizer. In the present implementation,
the whole language of the grammar is produced before the
matching of sensors and DNA strings is initiated. A more
efficient search strategy can be implemented by interleaving
an incremental generation process with the recognition
process. Furthermore, we will need to update information
concerning known binding sites for regulators. For some of
them we have very few sequences and, if available, weight
matrices developed by other investigators can be added to the
grammar. In fact, as mentioned before, any type of 'sensors'
for the recognition of protein DNA-binding sites can be
incorporated into the syntactic analysis. In fact, in order to
make predictions in unannotated DNA sequences, we will
need to include a specific sensor for the promoter itself.
Finally, it will be very interesting to apply this methodology
in eukaryotic regulatory regions. We are currently working in
these directions.