294 NOTE / NOTE Genome-wide analysis of rice (Oryza sativa L. subsp. japonica) TATA box and Y Patch promoter elements P. Civáň and M. Švec Abstract: The TATA box is one of the best characterized transcription factor binding sites. However, it is not a ubiquitous element of core promoters, and other sequence motifs such as Y Patches seem to play a major role in plants. Here, we present a first genome-wide computational analysis of the TATA box and Y Patch distribution in rice (Oryza sativa L. subsp. japonica) promoter sequences. Utilizing a probabilistic sequence model, we ascertain that only ~19% of rice genes possess the TATA box, but ~50% contain one or more Y Patches in their core promoters. By computational processing of identified elements, we generated extended TATA box and Y Patch nucleotide frequency matrices capable of predicting these motifs in plants with a high degree of confidence. Key words: TATA box, Y Patch, nucleotide frequency matrix, core promoter, Oryza sativa L. Résumé : La boı̂te TATA est l’un des sites de liaison de facteurs de transcription les mieux caractérisés. Elle ne constitue cependant pas un élément ubiquiste chez les promoteurs de base, et d’autres motifs comme les motifs Y semblent jouer un rôle important chez les plantes. Ici, les auteurs présentent une première analyse bio-informatique à l’échelle du génome entier de la distribution des boı̂tes TATA et des motifs Y au sein des promoteurs chez le riz (Oryza sativa L. subsp. japonica). Au moyen d’un modèle probabiliste, les auteurs concluent que seuls ~19 % des gènes chez le riz présentent une boı̂te TATA alors qu’environ 50 % présentent un ou plus d’un motif Y au sein de leur promoteur de base. Par analyse bioinformatique des éléments identifiés, les auteurs ont produit des matrices de fréquence nucléotidique des boı̂tes TATA et des motifs Y qui permettent de prédire ces motifs chez les plantes avec un haut degré de confiance. Mots-clés : boı̂te TATA, motif Y, matrice de la fréquence nucléotidique, promoteur de base, Oryza sativa L. [Traduit par la Rédaction] Initiation of transcription by RNA polymerase II requires the assembly of the basal transcription apparatus at the core promoter, a minimal stretch of contiguous DNA sequence that is sufficient to direct this process accurately, typically encompassing the transcription start site (TSS) and extending either upstream or downstream for an additional ~35 nt (Butler and Kadonaga 2002). The most commonly recognized sequence motif of core promoters is the TATA box; a T/A-rich sequence clustered around position –32 with respect to TSS in Arabidopsis (Molina and Grotewold 2005). The TATA box is recognized by TATA-binding protein (TBP), a component of the TFIID complex, which recruits Received 7 July 2008. Accepted 24 December 2008. Published on the NRC Research Press Web site at genome.nrc.ca on 18 February 2009. Corresponding Editor: M. Francki. P. Civáň1 and M. Švec. Department of Genetics, Faculty of Natural Sciences, Comenius University, Mlynská dolina B-1, 842 15 Bratislava, Slovakia. 1Corresponding author (e-mail: [email protected]). Genome 52: 294–297 (2009) RNA polymerase II and directs the assembly of the preinitiation complex (PIC). Although the TATA box was considered to be universal in the past, recent surveys demonstrated that the TATA box occurs only in ~30% of all promoters in humans (Suzuki et al. 2001), Drosophila melanogaster (Ohler et al. 2002), and Arabidopsis thaliana (Molina and Grotewold 2005). It has been shown that in human and yeasts, TATA-less genes are frequently involved in basic ‘‘housekeeping’’ processes, while TATA-containing genes are more often highly regulated, such as by biotic or stress stimuli (Yang et al. 2007). In crop plants, stress-response associated genes are frequently of the highest agronomical importance, which underlines the need of thorough characterization of promoter elements. However, the proportion and characterization of TATA-containing genes in plants as well as the incidence and function of other regulatory motifs of plant core promoters are poorly examined. Molina and Grotewold (2005) have identified T/C-rich decamer motifs abundant in the [–50, +50] region of Arabidopsis promoters utilizing both MEME and AlignACE algorithms. Since these motifs resembled microsatellites commonly found in Arabidopsis, they did not capture much doi:10.1139/G09-001 Published by NRC Research Press Civáň and Švec attention. However, Yamamoto et al. (2007a), applying the assumption that localized distribution is a signature of a functional element of the promoter, have identified a group of similar T/C-rich motifs in Arabidopsis and rice and designated this putative element of core promoters as the ‘‘Y Patch’’ (or pyrimidine patch). The Y Patch was shown to be a higher plant-specific element (Yamamoto et al. 2007b) with strict direction sensitivity (Yamamoto et al. 2007a), although its biochemical role is not known. In this study, rice promoter sequences were downloaded from publicly available Eukaryotic Promoter Database (EPD) (www.epd.isb-sib.ch), which encompasses 13 046 rice sequences spanning the [–499, +100] region according to the TSS. The 13 046 ‘‘preliminary’’ rice promoter entries in EPD have been derived from a reference collection of ~30 500 mRNA sequences of Oryza sativa L. subsp. japonica ‘Nipponbare’ published by the Rice Full-Length cDNA Consortium (Kikuchi et al. 2003). A detailed description of the preliminary EPD rice promoter data set is provided in Schmid et al. (2006). We devided the promoter data set into 12 bins, each comprising 13 046 sequences of one 50 nt interval (Fig. 1), and these subsets were directly retrieved from EPD. To search for TATA boxes and Y Patches in the subset sequences, we used higher order probabilistic model MotifScanner from MotifSampler version 3.1 (Thijs et al. 2001). MotifScanner is an algorithm for identification of predefined motifs in DNA sequences based on a probabilistic sequence model. The model utilizes the Gibbs sampling method and assumes that motif instances are hidden in a noisy background sequence (Thijs et al. 2001). To create an appropriate variant, we generated and tested several background models of order 1 accounting for dinucleotide distributions. These included background models specifically derived from corresponding bin sequences as well as a pseudo-random background model designed from a large set of sequences that had been generated artificially according to the real nucleotide composition of our data set. Finally, the background model derived from the [–499, –50] region of the complete promoter set appeared to be the most appropriate, providing the highest proportion of TATA boxes found within the [–49, +1] region to the total number of TATA elements detected. The MotifScanner queries were specified with 8 and 10 nt long nucleotide frequency matrices (NFMs) for TATA box and Y Patch elements, respectively. The TATA box NFM had been calculated from the weighted average of the EPD Plant TATA box (www.epd.isb-sib.ch/promoter_elements/) and PlantProm TATA box NFMs (Shahmuradov et al. 2003); the Y Patch NFM was based on the Motif 1 sequence logo of Molina and Grotewold (2005) and our preliminary analysis of the [–49, +50] promoter region. Only the ‘‘+’’ strands of DNA sequences were analysed; other parameters required for MotifScanner searching were left at the default value. As expected for TATA motifs, we detected the highest number of significant hits (2221) within the 10th bin ([–49, +1] promoter region; Fig. 1b), which is consistent with the current knowledge of TATA box position in higher eukaryotes. The abundance of TATA-like elements downstream from the TSS (the 11th and 12th bin) is relatively low. How- 295 Fig. 1. Computational analysis of 13 046 rice promoter regions. Sequences were devided into 12 bins according to the distance from the TSS. Bins: 1, [–499, –450]; 2, [–449, –400]; 3, [–399, –350]; 4, [–349, –300]; 5, [–299, –250]; 6, [–249, –200]; 7, [–199, –150]; 8, [–149, –100]; 9, [–99, –50]; 10, [–49, +1]; 11, [+1, +50]; 12, [+51, +100]. (a) Frequency distribution of single nucleotides (A, C, G, T) in the whole data set. (b) The curve indicates the number of TATA boxes detected within particular bins of 13 046 sequences using MotifScanner and combined EPD–PlantProm NFM. (c) Parallel representation of the Y Patch distribution. ever, a continuous growth of significant hits can be seen in the upstream direction, starting from the 8th bin, and the frequency of hits appears to stabilize in the [–499, –400] region. Although these distant TATA-like elements are generPublished by NRC Research Press 296 ally less conserved, there is no unambiguous feature on the basis of which such elements could be distinguished from the known TATA box consensus sequence. However, our searches for unspecified W stretches revealed large amounts of A/T octamers rising up in the upstream direction from TSS, which together with the substantial growth of AT content in the 3’ ? 5’ direction (from 38% in the 12th bin up to 60% in the first and second bin; Fig. 1a) provides at least a partial explanation for the abundance of TATA box hits in distant promoter regions. Whether these TATA-like elements are nonfunctional or somehow participate on the regulation of transcription initiation remains unclear. Theoretically, a large number of irrelevant TATA-like elements in the genome may recruit and deplete the TBP pool as well as result in incorrect PIC formation. We assume that to avoid this, chromatin might keep the distant parts of the promoter inaccessible to interactions with the TBP complex, or additional sequence motif cooperating with the TATA box is required for accurate PIC formation. For the latter assumption, the Y Patch may be a possible candidate. The lists of sequence elements detected with the MotifScanner search within the [–49, +1] region (TATA box) and [–49, +50] region (Y Patch) were converted from GFF format to FASTA format and further analysed using Geneious Basic 3.5.6 software (www.geneious.com). Based on nucleotide frequencies revealed from the 2221 TATA boxes, we created a retrained TATA box NFM. This NFM was extended to a length of 14 by adding three positions on both sides provided with values of average nucleotide frequencies of the 10th bin. When the MotifScanner inquiries were repeated to identify TATA boxes consistent with a new NFM, 2491 TATA elements were detected. These hits were analysed with Geneious Basic 3.5.6 software as described above, leading to the creation of an extended nucleotide frequency distribution of rice TATA boxes (Table S1).2 Figure 2a shows graphical representation of TATA box consensus sequence created with WebLogo (Crooks et al. 2004; http:// weblogo.berkeley.edu/). As seen from our analysis, only 2491 out of 13 046 rice genes (19%) encompass the TATA box in their core promoters. This proportion is even lower than the one recently revealed for Arabidopsis genes (29%) using a similar approach (Molina and Grotewold 2005). The number of MotifScanner hits may be slightly increased by changing the parameters of the model (up to 26.6% with p = 0.5); however, such an increase often leads to the detection of less conserved motifs that hardly resemble TATA elements (results not shown). Our attempt to expand the TATA box consensus sequence turned out to be unreasoned, since TBP was shown to bind to 8 bp of DNA in Saccharomyces (Kim et al. 1993). A slight prevalence of C was observed at extending positions (Table S1),2 but this may be well explained by the nucleotide composition of the [–49, +1] region (Fig. 1a). Consequently, the consensus sequence of the rice TATA box may be specified as CTATAWAWA, located in a C-rich region. Genome Vol. 52, 2009 Fig. 2. Graphical representation of (a) TATA box and (b) Y Patch nucleotide frequency distributions generated with WebLogo. The overall height of each stack indicates the sequence conservation at that position (measured in bits), whereas the height of the symbols within the stack reflects the relative frequency of the corresponding nucleotide at that position (the numerals on the x-axis are irrespective of the TSS position). The extended TATA box NFM produced in this study is more or less consistent with previously released matrices of plant TATA boxes (Table S3).2 The Y Patch seems to be a more abundant although less conserved sequence motif of rice core promoters. We detected 4997 and 5230 Y Patch elements among the 10th and 11th bins, respectively (Fig. 1c). However, since some core promoters encompassed more than one Y Patch, the values do not represent the actual number of Y Patch positive genes. When disregarding the redundant hits, approximately half of all tested core promoters exhibit at least one Y Patch motif. Based on the nucleotide frequency distribution of all 10 227 Y Patches detected in this study within the core promoters of rice (Table S2),2 we designate the Y Patch consensus sequence as CYTCYYCCYC (graphed on Fig. 2b). To the best of our knowledge, we present here the first genome-wide based NFM and frequency estimates of Y Patch promoter elements. A strong skew in CG content near the TSS of Arabidopsis genes is an intriguing feature observed by Tatarinova et al. (2003). Similarly to Arabidopsis, our results also show that C is considerably more frequent in the ‘‘+’’ strand of core promoter regions than is G (Fig. 1a). It has been hypothe- 2 Supplementary data for this article are available on the journal Web site (http://genome.nrc.ca) or may be purchased from the Depository of Unpublished Data, Document Delivery, CISTI, National Research Council Canada, Building M-55, 1200 Montreal Road, Ottawa, ON K1A 0R6, Canada. DUD 3887. For more information on obtaining material, refer to http://cisti-icist.nrc-cnrc.gc.ca/cms/unpub_e.html. Published by NRC Research Press Civáň and Švec sized that the CG compositional strand bias might be caused by distinctive mutation rates of C in the replication fork (Tatarinova et al. 2003). According to this hypothesis, C ? T mutations occur more frequently on the ‘‘–’’ strand, making the ‘‘+’’ strand G-poor. Following this reasoning, the ‘‘+’’ strand should be simultaneously enriched with A. However, no such enrichment is evident in our data set (Fig. 1a), where the frequency of A is continually decreasing in the downstream direction, irrespective of the TSS position. We therefore suggest that the CG skew phenomenon is better explained by the frequent occurrence of clearly directional Y Patches around TSSs. Acknowledgements We thank Ľubomı́r Tomáška and Matúš Valach for helpful comments on the manuscript. This work was supported by the Science and Technology Assistance Agency under contract No. APVT-27-028704 and the Slovak Research and Development Agency under contract No. APVV-0770-07. References Butler, J.E.F., and Kadonaga, J.T. 2002. The RNA polymerase II core promoter: a key component in the regulation of gene expression. Genes Dev. 16: 2583–2592. doi:10.1101/gad.1026202. PMID:12381658. Crooks, G.E., Hon, G., Chandonia, J.-M., and Brenner, S.E. 2004. WebLogo: a sequence logo generator. Genome Res. 14: 1188– 1190. doi:10.1101/gr.849004. PMID:15173120. Kikuchi, S., Satoh, K., Nagata, T., Kawagashira, N., Doi, K., Kishimoto, N., et al. 2003. Collection, mapping, and annotation of over 28,000 cDNA clones from japonica rice. Science (Washington, D.C.), 301: 376–379. doi:10.1126/science.1081288. PMID:12869764. Kim, Y., Geiger, J.H., Hahn, S., and Sigler, P.B. 1993. Crystal structure of a yeast TBP/TATA-box complex. Nature (Lond.), 365: 512–520. doi:10.1038/365512a0. PMID:8413604. Molina, C., and Grotewold, E. 2005. Genome wide analysis of Arabidopsis core promoters. BMC Genomics, 6: 25. doi:10.1186/ 1471-2164-6-25. PMID:15733318. 297 Ohler, U., Liao, G.C., Nieman, H., and Rubin, G.M. 2002. Computational analysis of core promoters in the Drosophila genome. Genome Biol. 3: research0087. doi:10.1186/gb-2002-3-12research0087. PMID:12537576. Schmid, C.D., Perier, R., Praz, V., and Bucher, P. 2006. EPD in its twentieth year: towards complete promoter coverage of selected model organisms. Nucleic Acids Res. 34: D82–D85. doi:10. 1093/nar/gkj146. PMID:16381980. Shahmuradov, I.A., Gammerman, A.J., Hancock, J.M., Bramley, P.M., and Solovyev, V.V. 2003. PlantProm: a database of plant promoter sequences. Nucleic Acids Res. 31: 114–117. doi:10. 1093/nar/gkg041. PMID:12519961. Suzuki, Y., Tsunoda, T., Sese, J., Taira, H., Mizushima-Sugano, J., Hata, H., et al. 2001. Identification and characterization of the potential promoter regions of 1031 kinds of human genes. Genome Res. 11: 677–684. doi:10.1101/gr.GR-1640R. PMID:11337467. Tatarinova, T., Brover, V., Troukhan, M., and Alexandrov, N. 2003. Skew in CG content near the transcription start site in Arabidopsis thaliana. Bioinformatics, 19(Suppl. 1): i313–i314. doi:10.1093/bioinformatics/btg1043. PMID:12855475. Thijs, G., Lescot, M., Marchal, K., Rombauts, S., De Moor, B., Rouzé, P., and Moreau, Y. 2001. A higher-order background model improves the detection of promoter regulatory elements by Gibbs sampling. Bioinformatics, 17: 1113–1122. doi:10. 1093/bioinformatics/17.12.1113. PMID:11751219. Yamamoto, Y.Y., Ichida, H., Matsui, M., Obokata, J., Sakurai, T., Satou, M., et al. 2007a. Identification of plant promoter constituents by analysis of local distribution of short sequences. BMC Genomics, 8: 67. doi:10.1186/1471-2164-8-67. PMID:17346352. Yamamoto, Y.Y., Ichida, H., Abe, T., Suzuki, Y., Sugano, S., and Obokata, J. 2007b. Differentiation of core promoter architecture between plants and mammals revealed by LDSS analysis. Nucleic Acids Res. 35: 6219–6226. doi:10.1093/nar/gkm685. PMID:17855401. Yang, C., Bolotin, E., Jiang, T., Sladek, F.M., and Martinez, M. 2007. Prevalence of the initiator over the TATA box in human and yeast genes and identification of DNA motifs enriched in human TATA-less core promoters. Gene, 389: 52–65. doi:10. 1016/j.gene.2006.09.029. PMID:17123746. Published by NRC Research Press
© Copyright 2026 Paperzz