Gibbs Recursive Sampler: finding transcription

3580–3585
Nucleic Acids Research, 2003, Vol. 31, No. 13
DOI: 10.1093/nar/gkg608
Gibbs Recursive Sampler: finding transcription
factor binding sites
William Thompson1,*, Eric C. Rouchka2 and Charles E. Lawrence1,3
1
The Wadsworth Center, New York State Department of Health, Albany, NY 12201-0509, USA,
Department of Computer Engineering and Computer Science, University of Louisville, Louisville, KY 40292,
USA and 3Computer Science Department, Rensselaer Polytechnic Institute, Troy, NY 12180, USA
2
Received February 14, 2003; Revised and Accepted April 9, 2003
ABSTRACT
The Gibbs Motif Sampler is a software package for
locating common elements in collections of biopolymer sequences. In this paper we describe a new
variation of the Gibbs Motif Sampler, the Gibbs
Recursive Sampler, which has been developed
specifically for locating multiple transcription factor
binding sites for multiple transcription factors
simultaneously in unaligned DNA sequences that
may be heterogeneous in DNA composition. Here we
describe the basic operation of the web-based
version of this sampler. The sampler may be accessed at http://bayesweb.wadsworth.org/gibbs/gibbs.
html and at http://www.bioinfo.rpi.edu/applications/
bayesian/gibbs/gibbs.html. An online user guide is
available at http://bayesweb.wadsworth.org/gibbs/
bernoulli.html and at http://www.bioinfo.rpi.edu/
applications/bayesian/gibbs/manual/bernoulli.html.
Solaris, Solaris.x86 and Linux versions of the
sampler are available as stand-alone programs for
academic and not-for-profit users. Commercial
licenses are also available. The Gibbs Recursive
Sampler is distributed in accordance with the ISCB
level 0 guidelines and a requirement for citation of
use in scientific publications.
INTRODUCTION
Transcription regulation is arguably the most important
foundation of cellular function, since it exerts the most
fundamental control over the abundance of virtually all of a
cell’s functional macromolecules. A predominant feature of
transcription regulation is the binding of regulatory proteins,
transcription factors (TFs), to cognate DNA binding sites
known as transcription factor binding sites (TFBS) in the
genome. The computational identification of TFBS through
the analysis of DNA sequence data has emerged in the last
decade as a major new technology for the elucidation of
transcription regulatory networks. The Gibbs Motif Sampler
is a software package used to locate common elements in
collections of biopolymer sequences. It has been applied to
the analysis of protein sequences (1,2). Gibbs sampling has
also been used extensively in the identification of TFBS
(3,4) and an earlier version of this software has been
available at this web location for some time. In this paper we
describe a new variation, the Gibbs Recursive Sampler,
designed to search for multiple TFBS simultaneously. It
includes several features that are designed specifically for
locating TFBS in unaligned DNA sequences. These features
are based on characteristics of TF/DNA complexes or their
components.
THE GIBBS RECURSIVE SAMPLER
Gibbs sampling is a Markov Chain Monte Carlo procedure that
has seen wide application in the statistical community. It was
first applied in bioinformatics as a tool for multiple sequence
alignment in 1993 (1). Gibbs sampling techniques have
subsequently seen numerous enhancements and applications
(2,5). A key feature of sequence-based Gibbs sampling algorithms and related expectation maximization algorithms (6,7)
is the use of motif models in the form of product multinomial
models to capture sequence patterns common to the binding
sites of each TF.
The recursive sampler, described here, was specifically
developed for the identification of TFBS in unaligned DNA
sequences. It includes several features that are unique to this
software: a rigorous Bayesian method for inferring the number
and the locations of the TFBS for multiple TF motifs
simultaneously; a background model of the heterogeneity in
the composition of non-coding nucleotide sequence and the
ability to use prior information of binding motifs. In addition,
it includes features to allow the use of palindromic, direct
repeat and concentrated alphabet models, preferred binding
site locations, and a rigorous test of the statistical significance
of the results, the Wilcoxon signed-rank test.
In the following, we briefly describe how the algorithm
incorporates these features and we provide instructions on
the use of the algorithm and on the interpretation of its
results.
*To whom correspondence should be addressed. Tel: þ1 5184867882; Fax: þ1 518 473 2900; Email: [email protected]
Nucleic Acids Research, Vol. 31, No. 13 # Oxford University Press 2003; all rights reserved
Nucleic Acids Research, 2003, Vol. 31, No. 13
3581
process continues until a maximal number of iterations has
been executed or until a certain number of iterations, the
plateau period, has occurred without an increase in the
estimated MAP. As a default we report the MAP solution.
Alternatively, we provide a Bayesian inference of site
frequencies based on continued sampling after convergence.
These Bayesian inferences take the form of estimated
probabilities for each predicted site. Both types of solution
adjust inferences for the lengths of the input sequences and of
sites and motifs in a Bayesian manner analogous to Webb (8).
HETEROGENEOUS BACKGROUND
Figure 1. The compositional variation of a 500 bp region upstream of the translation start site of the YDR226W/ADK1 gene from Saccharomyces cerevisiae.
The probability of each base at each position is calculated by the Bayesian segmentation algorithm (10). This algorithm returns the probabilities of observing
each of the four bases at each position in the sequence.
RECURSIVE DISCOVERY OF SITES
AND SITE COUNTS
Multiple TFs often bind in a combinatorial fashion to regulate
transcription. These collections of factors may contain multiple
TFBS for any or all of the factors. Furthermore, some of the
DNA sequences in a data set may contain no sites at all, or no
sites for one or more of the factors involved in regulating the
rest. While the total number of sites in a given input sequence
often spans a specifiable and relatively short range, the exact
number of sites and the number of sites corresponding to each
motif in any input sequence are unknown and often vary among
the input sequences. To address these unknowns, the sampler
uses recursive sums over all possible alignments of 0 k Kmax
sites in a sequence, to obtain Bayesian inferences on the number
of sites for each motif and the total number of sites in each
sequence. The recursion examines the placements of sites for k
TFs, as represented by p different motifs, in each sequence. The
algorithm infers for each sequence the total number of sites, the
number of each of the p motifs and the alignments and orderings
of these sites in the sequence. In its back sampling step it
simultaneously samples sites in each sequence according to
these inferences. As in previous Gibbs sampling algorithms, the
widths of sites are inferred using a fragmentation algorithm (5).
The sampling process iterates over the sequences one at a time,
using currently sampled values in all other sequences, to guide
the sampling process toward a converged result.
After all of the sequences have been examined and a set of
multiple motif positions has been determined, the log of the
posterior alignment probability is calculated (5). The maximum value of this probability, the MAP (maximum a
posteriori probability), provides the optimal solution. The
MAP value is measured relative to an empty or ‘null’
alignment, by taking the difference between the log of the
probability of the alignment and the log of the probability of an
empty alignment. A value greater than zero indicates that the
alignment is more likely than unaligned background. The
As in previous Gibbs sampling models, the probability that a
particular position in the sequence is sampled as a site is
calculated as the ratio of the probability of the site under motif
models to the probability under a background model that
describes the sequence in the absence of TFBS. In previous
implementations, background models assumed homogeneity in
the composition of each sequence. However, variations in
sequence composition in non-coding DNA are often complex.
The most common approach to this complexity has been to
model the background sequence using Markov chains (9,10).
Non-coding sequence is also often heterogeneous in
composition, particularly in eukaryotes (Fig. 1). This variation
in local base composition can adversely affect sequence
alignment (11,12). In addition, since TFBS are often A-T or
G-C rich, masking algorithms are often not useful for reducing
the effect of background variation (13). To address this heterogeneity, the recursive sampler uses the Bayesian segmentation
algorithm (14) to produce a position-specific background
model. A two-step process is employed. First, the individual
input sequences are analyzed for heterogeneity in composition
using the Bayesian segmentation algorithm. This algorithm
returns the probabilities of observing each of the four bases at
each position in a sequence, based only on that specific
sequence’s compositional heterogeneity and on the uncertainty
in this heterogeneity. These probabilities are then used as
position-specific background models in the above ratio.
PALINDROMES AND DIRECT REPEATS
Several TFs that bind as symmetric homodimers or homomultimeric protein complexes have palindromic DNA binding
motifs (6). Other regulatory TFs such as Escherichia coli PhoB
and certain zinc finger proteins bind in directly repeating
multimers and therefore have directly repeating binding
patterns (15). Often, the spacing between the palindromic
half-sites is unknown. The algorithm makes inferences in these
cases, using a modified version of the fragmentation algorithm
(5), by restricting fragmentation to the center positions. In a
number of cases, AT/TA and GC/CG base pairs may be
indistinguishable to TF binding (16). For example, bases in the
minor groove of DNA are less accessible to the protein, which
limits the ability to distinguish the bases. In addition, AT/TA
pairs may allow the DNA greater flexibility to facilitate the
DNA bending often associated with TF binding (6). To address
all of these cases the recursive sampler includes motif models
that use reduced alphabets.
3582
Nucleic Acids Research, 2003, Vol. 31, No. 13
Figure 2. The histogram shows the distribution of the distances of 182 reported
TFBS from start codons in E.coli regulatory sequences. The solid line is a mixture of Gaussian functions fitted to the histogram data and represents the
default prokaryotic probability distribution for site locations.
PREFERRED SITE LOCATIONS
While at least in principle a TFBS may be located over the
entire range of the input DNA sequences, there are often
preferred locations for binding. For example, in prokaryotes,
while TFBS may occur anywhere in an intergenic sequence as
well as in the coding region, they are much more common in
and around the RNA polymerase binding site. In some cases,
an empirical distribution of the locations of the sites within the
sequences may be known. For example, Figure 2 shows the
distribution of 182 experimentally reported intergenic TFBS
relative to the start codons of the regulated genes in E.coli. The
use of such a distribution will allow the algorithm to focus on
the most probable locations for sites (4). By default, the
algorithm uses a uniform distribution of site positions.
Figure 3. The basic Gibbs Recursive Sampler data entry form, showing the
minimum amount of data required by the program.
Liu and coworkers (5). Conceptually, this procedure formalizes
comparisons with negative controls to assess statistical
significance. Under the null hypothesis, sites will be no more
likely to occur in study sequences than in negative controls. In
this case, application of the Gibbs sampler or other site-finding
algorithm to an extended set of sequences, the first half of
which are study sequences and the second half of which are
negative control sequences, will be equally likely to find a site
in either half. The Wilcoxon signed-rank test, included in the
recursive sampler, implements this concept in a manner that
uses not only the differences in the number of sites predicted in
study sequences versus the negative control sequences, but
also the ranks of the scores of each predicted site. Two
characteristics of this test contribute to its robustness: it is nonparametric and there are no restrictions on the characteristics of
the negative controls that can be used with the procedure.
PROGRAM INTERFACE
PRIOR INFORMATION ON CANDIDATE MOTIFS
Position weight matrices are now available for many TFs. When
a sufficient number of sites from prior studies are available the
corresponding multiple alignments can be converted to position
weight matrices that can be used to scan sequences for
additional sites (2,17). However, it is often the case that limited
data are available for the determination of a position weight
matrix. Fortunately, Bayesian statistics provides the means to
use such limited prior information through the specification of
prior motif models. These informed prior models provide clues
to the expected patterns in DNA binding motifs that influence
but do not control posterior inference of sites and motifs. The
recursive sampler permits incorporation of informed motif
priors and gives the user control over the strength of the clue.
By default, uninformed prior motif models are assumed.
An effort was made to keep the web-based program interface
straightforward. At a minimum, to use the recursive sampler
the user must supply a set of sequences in FASTA format, the
maximum number of sites per sequence and an initial width
estimate for each motif. The sequence data may be pasted into
the entry window or loaded from a file. Sets of prokaryotic or
eukaryotic default values for all parameters can be chosen.
Alternatively, the user can specify values for these parameters.
Figure 3 illustrates the initial interface screen. Required entries
are marked with asterisks. Each entry field has an associated
hyperlink that gives details about the required data format.
The sequence data is typically the non-coding sequences of
co-regulated genes from a single species (18) or the noncoding sequences of orthologous genes from several species
(4,19). McCue and colleagues (19) describe a number of
factors influencing the choice of species for use in the
identification of TFBS in prokaryotes.
WILCOXON TEST OF MOTIF SIGNIFICANCE
The assessment of statistical significance of any multiple
sequence alignment is difficult. A general, robust and nonparametric procedure for this assessment has been described by
ADVANCED OPTIONS
Clicking on the Show Advanced Options link brings up a new
screen with more options (Fig. 4). While defaults have been set
Nucleic Acids Research, 2003, Vol. 31, No. 13
3583
Figure 4. The Gibbs Recursive Sampler advanced entry page, showing
modifiable parameters.
for all advanced features, specification of alternative values on
this page allows the user greater control over how the sampler
searches for sites, and also allows adjustment of the program
runtime parameters. For example, in a particular prokaryotic
application, the user may decide to replace the default choice
of palindromic sites with a non-palindromic model. A number
of other parameters influence the operation of the program.
Several items have been mentioned above and further
information on formats and acceptable values can be obtained
by clicking on the associated hyperlinks.
A FREQUENCY-BASED SOLUTION
By default, the algorithm returns the single best solution that it
has found as measured by the MAP value. However, no
prediction method is perfect, and not all of the predictions of
this or other algorithms are equally compelling. For examination of the creditability of each prediction, a Bayesian sampling
solution is also available as an option. For this solution, after
the MAP solution has been determined, the algorithm
continues to sample, in order to explore variations in the
models. The frequencies with which sites are sampled are
recorded and reported to the user. Variations in models may
alter estimates of site frequencies from the values obtained
with the single MAP solution. Our experience shows that some
sites are strong and are sampled 100% of the time, while
others are weaker and are sampled with a lower frequency.
These sampling frequencies are an estimate of the probability
that the cognate TF binds at each predicted site. By default,
only sites that have a frequency of at least 50% are reported.
PROGRAM OUTPUT
The web-based version of Gibbs Recursive Sampler produces
the same output as does the stand-alone version. Normally, the
output is returned via email. The user also has the option of
Figure 5. Sample output from Gibbs Recursive Sampler.
receiving the output online. Figure 5 illustrates typical program
output. These results are for a set of 18 sequences extracted
from E.coli regulatory regions. The sequences are known to
contain binding sites for the protein CRP (6). The results were
generated using the recursive sampler with the following
parameters: a width of 16, a palindromic model, a maximum
number of sites per sequence of two and heterogeneous
background composition. Since we used a palindromic model,
the search for sites in the reverse complement direction was
turned off. The Wilcoxon signed-rank test was applied to test
the statistical significance of the resulting alignment.
INTERPRETATION OF RESULTS
The first portion of the results in Figure 5 is a list of the options
used for the current run. This is followed by a list of the FASTA
headings for the input sequences. Next are the results from the
MAP solution for each motif. The listing for each motif begins
with a table having a column for each of the bases and a row for
each of the positions in the motif. The numbers in the table
indicate the estimated probability of occurrence of each base in
each position within the motif, throughout the alignment. The
3584
Nucleic Acids Research, 2003, Vol. 31, No. 13
last column gives the information contribution to the model of
each position, with a maximum of two bits. Thus, this table
gives the estimated binding pattern of the motif and a measure
of the degree of conservation of each position in bits.
The next portion of the results is the alignment of the motif
sites, along with the probability that each site belongs to the
current motif model. Immediately below the listing of motif
sites is a row containing asterisks. The asterisks indicate the
conserved columns of the site. In the example in Figure 5, even
though the initial width of the site was entered as 16, the
program fragmented the sites to a total width of 22, 16
conserved columns and six non-conserved columns. Only the
conserved columns are used in the alignments. The fragmentation algorithm dynamically learns which columns are
conserved. The p-value of 0.000396 from the Wilcoxon
signed-rank test indicates that the solution is highly significant.
In addition, the predicted sites match well with the reported crp
binding sites for these sequences (6).
The MAP calculation includes terms for each component of
the model: the motif model, the fragmentation, the background
model and the alignment. The individual contributions of the two
motif-specific terms, the motif model term and the fragmentation
term are listed for each individual motif. The final full log MAP
minus the log of the null MAP (MAP of the sequence with no
Figure 5. Continued.
Nucleic Acids Research, 2003, Vol. 31, No. 13
sites) is returned at the bottom of the output, as are the
background and null MAP terms. A full description of the
output, along with other examples can be found at http://
bayesweb.wadsworth.org/gibbs/content.html#results.
ADDITIONAL FEATURES
The Gibbs Motif Sampler web page also allows for the analysis
of amino acid sequences. With the exception of nucleotidespecific features described above, all of the sampling options
are applicable to proteins. Low complexity regions of amino
acid sequences can be removed with an XNU process (20)
using a BLOSSUM62 matrix.
OTHER RESOURCES
A number of other web sites exist for the discovery of motifs in
nucleotide sequences. Several of these are based on Gibbs
sampling methods (9,21–23). Others use expectation maximization (7, http://www.cse.ucsc.edu/~kent/improbizer/index.
html), neural nets (24) or suffix trees or similar enumerative
algorithms (25–27) to search for motifs in unaligned sequence
data. Workman and Stormo (24) provide a comparison of an
earlier version of this software with neural network and
expectation maximization algorithms.
FOR FURTHER INFORMATION
The Bayesian Bioinformatics web page at http://www.
wadsworth.org/resnres/bioinfo/ provides a number of other
web based applications for sequence analysis, a database of
predicted E.coli TFBS, a tutorial on Bayesian bioinformatics
and references.
ACKNOWLEDGEMENTS
We thank our colleagues at the Wadsworth Center for their
assistance, especially C. Steven Carmack and Clarence Chan
for assistance with the web programming, and Lee Ann McCue
and Michael Palumbo for critical readings of the manuscript.
This work was supported by NIH grant R01HG01257 and
DOE grant DE-FG02-01ER63204.
REFERENCES
1. Lawrence,C.E., Altschul,S., Boguski,M., Liu,J., Neuwald,A. and
Wootton,J. (1993) Detecting subtle sequence signals: a Gibbs sampling
strategy for multiple alignment. Science, 262, 208–214.
2. Neuwald,A., Liu,J. and Lawrence,C.E. (1996) Gibbs Motif Sampling:
detection of bacterial outer membrane protein repeats. Protein Sci., 4,
1618–1632.
3. Robison,K., McGuire,A.M. and Church,G.M. (1998) A comprehensive
library of DNA-binding site matrices for 55 proteins applied to the
complete Escherichia coli K-12 genome. J. Mol. Biol., 248, 241–254.
3585
4. McCue,L.A., Thompson,W., Carmack,C.S., Ryan,M., Liu,J., Derbyshire,V.
and Lawrence,C.E. (2001) Phylogenetic footprinting of transcription factor
binding sites in proteobacterial genomes, Nucleic Acids Res., 39, 774–782.
5. Liu,J., Neuwald,A. and Lawrence,C.E. (1995) Bayesian models for
multiple local sequence alignment and Gibbs sampling strategies. J. Amer.
Stat. Assoc., 90, 1156–1170.
6. Lawrence,C.E. and Reilly,A. (1990) An expectation maximization (EM)
algorithm for the identification and characterization of common sites in
unaligned biopolymer sequences. Proteins, 7, 41–51.
7. Bailey,T.L. and Elkin,C. (1994) Fitting a mixture model by expectation
maximization to discover motifs in biopolymers. In Altman,R., Brutlag,D.,
Karp,P., Lathrop,R. and Searls,D. (eds), Proceedings of the Second
International Conference on Intelligent Systems for Molecular Biology,
AAAI Press, Menlo Park, CA, pp. 28–36.
8. Webb,B.J., Liu,J.S. and Lawrence,C.E. (2002) BALSA: Bayesian algorithm for local sequence alignment. Nucleic Acids Res., 30, 1268–1277.
9. Liu,X., Brutlag,DL. and Liu,J.S. (2001) BioProspector: discovering
conserved DNA motifs in upstream regulatory regions of co-expressed
genes. Proc. Pac. Symp. Biocomp., 6, 127–138.
10. Thijs,G., Lescot,M., Marchal,K., Rombauts,S., De Moor,B., Rouze,P. and
Moreau,Y. (2001). A higher order background model improves the detection
of regulatory elements by Gibbs Sampling. Bioinformatics, 17, 1113–1122.
11. Marchal,K., Thijs,G., De Keersmaecker,S., Monsieurs,P., De Moor,B. and
Vanderleyden,J. (2003) Genome-specific higher-order background models
to improve motif detection, Trends Microbiol., 11, 61–66.
12. Altschul,S.F., Gish,W., Miller,W., Myers,E.W. and Lipman,D.J. (1990)
Basic local alignment search tool. J. Mol. Biol., 215, 403–410.
13. Wootton,J.C. and Federhen,S. (1996) Analysis of compositional biased
regions in sequence databases. Methods Enzymol., 266, 554–571.
14. Liu,J. and Lawrence,C.E. (1999) Bayesian inference on biopolymer
models. Bioinformatics, 15, 38–52.
15. Warner,B.L. (1996) Phosphorous assimilation and control of the phosphate
regulon. In Neihdhardt, F.C. (ed.), Escherichia coli and Salmonella: Cellular
and Molecular Biology. ASM Press, Washington, DC, pp. 1357–1381.
16. Twyman,R.M. (1998) Advanced Molecular Biology, A Concise Reference.
Springer-Verlag, New York, NY, pp. 246–247.
17. Staden,R. (1989) Methods for calculating the probabilities of finding
patterns in sequences. Comput. Appl. Biosci., 5, 89–96.
18. Wasserman,W., Palumbo,M., Thompson,W., Fickett,J. and Lawrence,C.E.
(2000) Human-mouse genome comparisons to locate regulatory sites.
Nature Genet., 26, 225–228.
19. McCue,L.A., Thompson,W., Carmack,C.S. and Lawrence,C.E. (2002)
Factors influencing the identification of transcription factor binding sites by
cross-species comparison. Genome Res., 12, 1523–1532.
20. Claverie,J.M. and States,D. (1993) Information enhancement methods for
large scale sequence analysis, Comp. Chem., 17, 191–201.
21. Hughes,J.D., Estep,P.W., Tavazoie,S. and Church,G.M. (2000) Computational
identification of cis-regulatory elements associated with functionally coherent
groups of genes in Saccharomyces cerevisiae. J. Mol. Biol., 296, 1205–1214.
22. Thijs,G., Marchal,K., Lescot,M., Rombauts,S., De Moor,B., Rouze,P. and
Moreau,Y. (2002) A Gibbs sampling method to detect over-represented
motifs in upstream regions of coexpressed genes. J. Comp. Biol., 9,
447–464.
23. van Helden,J., André,B. and Collado-Vides,J. (2000) A web site for the
computational analysis of yeast regulatory sequences. Yeast, 16, 177–187.
24. Workman,C. and Stormo,G.D. (2000) ANN-Spec: a method for
discovering transcription factor binding sites with improved specificity.
Proc. Pac. Symp. Biocomp., 5, 464–475.
25. Sinha,S. and Tompa,M. (2000) A statistical method for finding
transcription factor binding sites. In Bourne,P., Gribskov,M., Altman,R.,
Jensen,N., Hope,D., Lengauer,T., Mitchell,J., Scheeff,E., Smith,C.,
Strande,S. and Weissig,H. (eds), Proceedings of the Eighth International
Conference on Intelligent Systems for Molecular Biology, AAAI Press,
Menlo Park, CA, pp. 344–354.
26. Eskin,E. and Pevzner,P.A. (2002) Finding composite regulatory patterns in
DNA sequences. Bioinformatics, 18, 354–363.
27. Marsan,L. and Sagot,M.F. (2001) Algorithms for extracting structured
motifs using a suffix-tree with application to promoter and regulatory site
consensus identification. J. Comput. Biol., 7, 345–360.