Genome-wide analysis of rice (Oryza sativa L. subsp. japonica

294
NOTE / NOTE
Genome-wide analysis of rice (Oryza sativa L.
subsp. japonica) TATA box and Y Patch promoter
elements
P. Civáň and M. Švec
Abstract: The TATA box is one of the best characterized transcription factor binding sites. However, it is not a ubiquitous
element of core promoters, and other sequence motifs such as Y Patches seem to play a major role in plants. Here, we
present a first genome-wide computational analysis of the TATA box and Y Patch distribution in rice (Oryza sativa L.
subsp. japonica) promoter sequences. Utilizing a probabilistic sequence model, we ascertain that only ~19% of rice genes
possess the TATA box, but ~50% contain one or more Y Patches in their core promoters. By computational processing of
identified elements, we generated extended TATA box and Y Patch nucleotide frequency matrices capable of predicting
these motifs in plants with a high degree of confidence.
Key words: TATA box, Y Patch, nucleotide frequency matrix, core promoter, Oryza sativa L.
Résumé : La boı̂te TATA est l’un des sites de liaison de facteurs de transcription les mieux caractérisés. Elle ne constitue
cependant pas un élément ubiquiste chez les promoteurs de base, et d’autres motifs comme les motifs Y semblent jouer un
rôle important chez les plantes. Ici, les auteurs présentent une première analyse bio-informatique à l’échelle du génome entier de la distribution des boı̂tes TATA et des motifs Y au sein des promoteurs chez le riz (Oryza sativa L. subsp. japonica). Au moyen d’un modèle probabiliste, les auteurs concluent que seuls ~19 % des gènes chez le riz présentent une
boı̂te TATA alors qu’environ 50 % présentent un ou plus d’un motif Y au sein de leur promoteur de base. Par analyse bioinformatique des éléments identifiés, les auteurs ont produit des matrices de fréquence nucléotidique des boı̂tes TATA et
des motifs Y qui permettent de prédire ces motifs chez les plantes avec un haut degré de confiance.
Mots-clés : boı̂te TATA, motif Y, matrice de la fréquence nucléotidique, promoteur de base, Oryza sativa L.
[Traduit par la Rédaction]
Initiation of transcription by RNA polymerase II requires
the assembly of the basal transcription apparatus at the core
promoter, a minimal stretch of contiguous DNA sequence
that is sufficient to direct this process accurately, typically
encompassing the transcription start site (TSS) and extending either upstream or downstream for an additional ~35 nt
(Butler and Kadonaga 2002). The most commonly recognized sequence motif of core promoters is the TATA box; a
T/A-rich sequence clustered around position –32 with respect to TSS in Arabidopsis (Molina and Grotewold 2005).
The TATA box is recognized by TATA-binding protein
(TBP), a component of the TFIID complex, which recruits
Received 7 July 2008. Accepted 24 December 2008. Published
on the NRC Research Press Web site at genome.nrc.ca on
18 February 2009.
Corresponding Editor: M. Francki.
P. Civáň1 and M. Švec. Department of Genetics, Faculty of
Natural Sciences, Comenius University, Mlynská dolina B-1,
842 15 Bratislava, Slovakia.
1Corresponding
author (e-mail: [email protected]).
Genome 52: 294–297 (2009)
RNA polymerase II and directs the assembly of the preinitiation complex (PIC).
Although the TATA box was considered to be universal
in the past, recent surveys demonstrated that the TATA box
occurs only in ~30% of all promoters in humans (Suzuki et
al. 2001), Drosophila melanogaster (Ohler et al. 2002), and
Arabidopsis thaliana (Molina and Grotewold 2005). It has
been shown that in human and yeasts, TATA-less genes are
frequently involved in basic ‘‘housekeeping’’ processes,
while TATA-containing genes are more often highly regulated, such as by biotic or stress stimuli (Yang et al. 2007).
In crop plants, stress-response associated genes are frequently of the highest agronomical importance, which
underlines the need of thorough characterization of promoter
elements. However, the proportion and characterization of
TATA-containing genes in plants as well as the incidence
and function of other regulatory motifs of plant core promoters are poorly examined.
Molina and Grotewold (2005) have identified T/C-rich
decamer motifs abundant in the [–50, +50] region of Arabidopsis promoters utilizing both MEME and AlignACE algorithms. Since these motifs resembled microsatellites
commonly found in Arabidopsis, they did not capture much
doi:10.1139/G09-001
Published by NRC Research Press
Civáň and Švec
attention. However, Yamamoto et al. (2007a), applying the
assumption that localized distribution is a signature of a
functional element of the promoter, have identified a group
of similar T/C-rich motifs in Arabidopsis and rice and designated this putative element of core promoters as the ‘‘Y
Patch’’ (or pyrimidine patch). The Y Patch was shown to be
a higher plant-specific element (Yamamoto et al. 2007b)
with strict direction sensitivity (Yamamoto et al. 2007a),
although its biochemical role is not known.
In this study, rice promoter sequences were downloaded
from publicly available Eukaryotic Promoter Database
(EPD) (www.epd.isb-sib.ch), which encompasses 13 046
rice sequences spanning the [–499, +100] region according
to the TSS. The 13 046 ‘‘preliminary’’ rice promoter entries
in EPD have been derived from a reference collection of
~30 500 mRNA sequences of Oryza sativa L. subsp. japonica ‘Nipponbare’ published by the Rice Full-Length cDNA
Consortium (Kikuchi et al. 2003). A detailed description of
the preliminary EPD rice promoter data set is provided in
Schmid et al. (2006).
We devided the promoter data set into 12 bins, each comprising 13 046 sequences of one 50 nt interval (Fig. 1), and
these subsets were directly retrieved from EPD. To search
for TATA boxes and Y Patches in the subset sequences, we
used higher order probabilistic model MotifScanner from
MotifSampler version 3.1 (Thijs et al. 2001). MotifScanner
is an algorithm for identification of predefined motifs in
DNA sequences based on a probabilistic sequence model.
The model utilizes the Gibbs sampling method and assumes
that motif instances are hidden in a noisy background sequence (Thijs et al. 2001). To create an appropriate variant,
we generated and tested several background models of order
1 accounting for dinucleotide distributions. These included
background models specifically derived from corresponding
bin sequences as well as a pseudo-random background
model designed from a large set of sequences that had been
generated artificially according to the real nucleotide composition of our data set. Finally, the background model derived from the [–499, –50] region of the complete promoter
set appeared to be the most appropriate, providing the highest proportion of TATA boxes found within the [–49, +1]
region to the total number of TATA elements detected.
The MotifScanner queries were specified with 8 and 10 nt
long nucleotide frequency matrices (NFMs) for TATA box
and Y Patch elements, respectively. The TATA box NFM
had been calculated from the weighted average of the EPD
Plant TATA box (www.epd.isb-sib.ch/promoter_elements/)
and PlantProm TATA box NFMs (Shahmuradov et al.
2003); the Y Patch NFM was based on the Motif 1 sequence
logo of Molina and Grotewold (2005) and our preliminary
analysis of the [–49, +50] promoter region. Only the ‘‘+’’
strands of DNA sequences were analysed; other parameters
required for MotifScanner searching were left at the default
value.
As expected for TATA motifs, we detected the highest
number of significant hits (2221) within the 10th bin ([–49,
+1] promoter region; Fig. 1b), which is consistent with the
current knowledge of TATA box position in higher eukaryotes. The abundance of TATA-like elements downstream
from the TSS (the 11th and 12th bin) is relatively low. How-
295
Fig. 1. Computational analysis of 13 046 rice promoter regions. Sequences were devided into 12 bins according to the distance from
the TSS. Bins: 1, [–499, –450]; 2, [–449, –400]; 3, [–399, –350]; 4,
[–349, –300]; 5, [–299, –250]; 6, [–249, –200]; 7, [–199, –150]; 8,
[–149, –100]; 9, [–99, –50]; 10, [–49, +1]; 11, [+1, +50]; 12, [+51,
+100]. (a) Frequency distribution of single nucleotides (A, C, G, T)
in the whole data set. (b) The curve indicates the number of TATA
boxes detected within particular bins of 13 046 sequences using
MotifScanner and combined EPD–PlantProm NFM. (c) Parallel representation of the Y Patch distribution.
ever, a continuous growth of significant hits can be seen in
the upstream direction, starting from the 8th bin, and the
frequency of hits appears to stabilize in the [–499, –400] region. Although these distant TATA-like elements are generPublished by NRC Research Press
296
ally less conserved, there is no unambiguous feature on the
basis of which such elements could be distinguished from
the known TATA box consensus sequence. However, our
searches for unspecified W stretches revealed large amounts
of A/T octamers rising up in the upstream direction from
TSS, which together with the substantial growth of AT content in the 3’ ? 5’ direction (from 38% in the 12th bin up to
60% in the first and second bin; Fig. 1a) provides at least a
partial explanation for the abundance of TATA box hits in
distant promoter regions.
Whether these TATA-like elements are nonfunctional or
somehow participate on the regulation of transcription initiation remains unclear. Theoretically, a large number of irrelevant TATA-like elements in the genome may recruit and
deplete the TBP pool as well as result in incorrect PIC formation. We assume that to avoid this, chromatin might keep
the distant parts of the promoter inaccessible to interactions
with the TBP complex, or additional sequence motif cooperating with the TATA box is required for accurate PIC formation. For the latter assumption, the Y Patch may be a
possible candidate.
The lists of sequence elements detected with the MotifScanner search within the [–49, +1] region (TATA box) and
[–49, +50] region (Y Patch) were converted from GFF format to FASTA format and further analysed using Geneious
Basic 3.5.6 software (www.geneious.com). Based on nucleotide frequencies revealed from the 2221 TATA boxes, we
created a retrained TATA box NFM. This NFM was extended to a length of 14 by adding three positions on both
sides provided with values of average nucleotide frequencies
of the 10th bin. When the MotifScanner inquiries were repeated to identify TATA boxes consistent with a new NFM,
2491 TATA elements were detected. These hits were analysed with Geneious Basic 3.5.6 software as described
above, leading to the creation of an extended nucleotide frequency distribution of rice TATA boxes (Table S1).2 Figure
2a shows graphical representation of TATA box consensus
sequence created with WebLogo (Crooks et al. 2004; http://
weblogo.berkeley.edu/).
As seen from our analysis, only 2491 out of 13 046 rice
genes (19%) encompass the TATA box in their core promoters. This proportion is even lower than the one recently
revealed for Arabidopsis genes (29%) using a similar approach (Molina and Grotewold 2005). The number of MotifScanner hits may be slightly increased by changing the
parameters of the model (up to 26.6% with p = 0.5); however, such an increase often leads to the detection of less
conserved motifs that hardly resemble TATA elements (results not shown).
Our attempt to expand the TATA box consensus sequence
turned out to be unreasoned, since TBP was shown to bind
to 8 bp of DNA in Saccharomyces (Kim et al. 1993). A
slight prevalence of C was observed at extending positions
(Table S1),2 but this may be well explained by the nucleotide composition of the [–49, +1] region (Fig. 1a). Consequently, the consensus sequence of the rice TATA box may
be specified as CTATAWAWA, located in a C-rich region.
Genome Vol. 52, 2009
Fig. 2. Graphical representation of (a) TATA box and (b) Y Patch
nucleotide frequency distributions generated with WebLogo. The
overall height of each stack indicates the sequence conservation at
that position (measured in bits), whereas the height of the symbols
within the stack reflects the relative frequency of the corresponding
nucleotide at that position (the numerals on the x-axis are irrespective of the TSS position).
The extended TATA box NFM produced in this study is
more or less consistent with previously released matrices of
plant TATA boxes (Table S3).2
The Y Patch seems to be a more abundant although less
conserved sequence motif of rice core promoters. We detected 4997 and 5230 Y Patch elements among the 10th and
11th bins, respectively (Fig. 1c). However, since some core
promoters encompassed more than one Y Patch, the values
do not represent the actual number of Y Patch positive
genes. When disregarding the redundant hits, approximately
half of all tested core promoters exhibit at least one Y Patch
motif. Based on the nucleotide frequency distribution of all
10 227 Y Patches detected in this study within the core promoters of rice (Table S2),2 we designate the Y Patch consensus sequence as CYTCYYCCYC (graphed on Fig. 2b).
To the best of our knowledge, we present here the first genome-wide based NFM and frequency estimates of Y Patch
promoter elements.
A strong skew in CG content near the TSS of Arabidopsis
genes is an intriguing feature observed by Tatarinova et al.
(2003). Similarly to Arabidopsis, our results also show that
C is considerably more frequent in the ‘‘+’’ strand of core
promoter regions than is G (Fig. 1a). It has been hypothe-
2 Supplementary
data for this article are available on the journal Web site (http://genome.nrc.ca) or may be purchased from the Depository
of Unpublished Data, Document Delivery, CISTI, National Research Council Canada, Building M-55, 1200 Montreal Road, Ottawa, ON
K1A 0R6, Canada. DUD 3887. For more information on obtaining material, refer to http://cisti-icist.nrc-cnrc.gc.ca/cms/unpub_e.html.
Published by NRC Research Press
Civáň and Švec
sized that the CG compositional strand bias might be caused
by distinctive mutation rates of C in the replication fork (Tatarinova et al. 2003). According to this hypothesis, C ? T
mutations occur more frequently on the ‘‘–’’ strand, making
the ‘‘+’’ strand G-poor. Following this reasoning, the ‘‘+’’
strand should be simultaneously enriched with A. However,
no such enrichment is evident in our data set (Fig. 1a),
where the frequency of A is continually decreasing in the
downstream direction, irrespective of the TSS position. We
therefore suggest that the CG skew phenomenon is better
explained by the frequent occurrence of clearly directional
Y Patches around TSSs.
Acknowledgements
We thank Ľubomı́r Tomáška and Matúš Valach for helpful comments on the manuscript. This work was supported
by the Science and Technology Assistance Agency under
contract No. APVT-27-028704 and the Slovak Research and
Development Agency under contract No. APVV-0770-07.
References
Butler, J.E.F., and Kadonaga, J.T. 2002. The RNA polymerase II
core promoter: a key component in the regulation of gene expression. Genes Dev. 16: 2583–2592. doi:10.1101/gad.1026202.
PMID:12381658.
Crooks, G.E., Hon, G., Chandonia, J.-M., and Brenner, S.E. 2004.
WebLogo: a sequence logo generator. Genome Res. 14: 1188–
1190. doi:10.1101/gr.849004. PMID:15173120.
Kikuchi, S., Satoh, K., Nagata, T., Kawagashira, N., Doi, K., Kishimoto, N., et al. 2003. Collection, mapping, and annotation of
over 28,000 cDNA clones from japonica rice. Science (Washington, D.C.), 301: 376–379. doi:10.1126/science.1081288.
PMID:12869764.
Kim, Y., Geiger, J.H., Hahn, S., and Sigler, P.B. 1993. Crystal
structure of a yeast TBP/TATA-box complex. Nature (Lond.),
365: 512–520. doi:10.1038/365512a0. PMID:8413604.
Molina, C., and Grotewold, E. 2005. Genome wide analysis of Arabidopsis core promoters. BMC Genomics, 6: 25. doi:10.1186/
1471-2164-6-25. PMID:15733318.
297
Ohler, U., Liao, G.C., Nieman, H., and Rubin, G.M. 2002. Computational analysis of core promoters in the Drosophila genome.
Genome Biol. 3: research0087. doi:10.1186/gb-2002-3-12research0087. PMID:12537576.
Schmid, C.D., Perier, R., Praz, V., and Bucher, P. 2006. EPD in its
twentieth year: towards complete promoter coverage of selected
model organisms. Nucleic Acids Res. 34: D82–D85. doi:10.
1093/nar/gkj146. PMID:16381980.
Shahmuradov, I.A., Gammerman, A.J., Hancock, J.M., Bramley,
P.M., and Solovyev, V.V. 2003. PlantProm: a database of plant
promoter sequences. Nucleic Acids Res. 31: 114–117. doi:10.
1093/nar/gkg041. PMID:12519961.
Suzuki, Y., Tsunoda, T., Sese, J., Taira, H., Mizushima-Sugano, J.,
Hata, H., et al. 2001. Identification and characterization of the potential promoter regions of 1031 kinds of human genes. Genome
Res. 11: 677–684. doi:10.1101/gr.GR-1640R. PMID:11337467.
Tatarinova, T., Brover, V., Troukhan, M., and Alexandrov, N.
2003. Skew in CG content near the transcription start site in
Arabidopsis thaliana. Bioinformatics, 19(Suppl. 1): i313–i314.
doi:10.1093/bioinformatics/btg1043. PMID:12855475.
Thijs, G., Lescot, M., Marchal, K., Rombauts, S., De Moor, B.,
Rouzé, P., and Moreau, Y. 2001. A higher-order background
model improves the detection of promoter regulatory elements
by Gibbs sampling. Bioinformatics, 17: 1113–1122. doi:10.
1093/bioinformatics/17.12.1113. PMID:11751219.
Yamamoto, Y.Y., Ichida, H., Matsui, M., Obokata, J., Sakurai, T.,
Satou, M., et al. 2007a. Identification of plant promoter constituents by analysis of local distribution of short sequences. BMC
Genomics, 8: 67. doi:10.1186/1471-2164-8-67. PMID:17346352.
Yamamoto, Y.Y., Ichida, H., Abe, T., Suzuki, Y., Sugano, S., and
Obokata, J. 2007b. Differentiation of core promoter architecture
between plants and mammals revealed by LDSS analysis. Nucleic Acids Res. 35: 6219–6226. doi:10.1093/nar/gkm685.
PMID:17855401.
Yang, C., Bolotin, E., Jiang, T., Sladek, F.M., and Martinez, M.
2007. Prevalence of the initiator over the TATA box in human
and yeast genes and identification of DNA motifs enriched in
human TATA-less core promoters. Gene, 389: 52–65. doi:10.
1016/j.gene.2006.09.029. PMID:17123746.
Published by NRC Research Press