SHOWCASE ON RESEARCH The Underworld of Regulatory RNA in Complex Organisms John Mattick Institute for Molecular Bioscience, University of Queensland, QLD 4072 The general presumption in molecular biology is that most genes are synonymous with proteins, and that RNA primarily acts as a temporary intermediate (messenger RNA) between gene and protein, assisted by infrastructural RNAs (ribosomal RNAs, transfer RNAs, small nucleolar RNAs and spliceosomal RNAs) that are required for RNA processing and mRNA translation. This presumption stems from foundation studies in bacteria almost 50 years ago, and broadly holds true in the prokaryotes, whose genomes, of which well over 100 have now been completely sequenced, overwhelmingly consist of closely-spaced proteincoding sequences separated by 5'- and 3'- flanking sequences that control gene expression at the transcriptional and translational levels. Although a small number of very interesting regulatory RNAs have been recently identified in bacteria (1, 2), these RNAs occupy only a small proportion of their genomes. Thus, in prokaryotes at least, it is clear that proteins (and their products) not only comprise the main structural and functional components of the cell, but are also the main agents by which the system is regulated, in conjunction with environmental signals. Do Most Genes also Encode Proteins in Complex Organisms? It has been assumed that the same applies in the multicellular organisms, despite the fact that proteincoding sequences occupy a diminishing percentage of their genomes as these organisms increase in complexity, falling to just over 1% in humans. There is also no strong link between the numbers of proteincoding genes and developmental complexity − the nematode worm with only 1,000 cells has 50% more protein-coding genes (~19,000) than insects (~13,500), and about the same as vertebrates, including humans (20-25,000), which have less than rice (40-50,000). Part of this anomaly can be explained by expansion of the numbers of protein isoforms by alternate splicing, which does increase as a function of developmental complexity, but it should also be noted that this in turn requires another layer of regulation. The presumption that genes are generally synonymous with proteins has led logically to many subsidiary assumptions, notably that the vast bulk of sequences that do not code for proteins or their adjacent regulatory sequences (including introns and most intergenic sequences) are evolutionary junk. These assumptions have become articles of faith, but they are not necessarily correct. In fact, we may have seriously misunderstood the nature of genetic programming and information transaction in the higher organisms. A range of recent Vol 36 No 3 December 2005 evidence suggests that the higher organisms have increasingly coopted non-protein-coding RNAs as a means of transmitting regulatory information, and that this information may be essential to the evolution and ontogeny of complex organisms. The Amazing Complexity of the Mammalian Transcriptome It is now evident that the vast majority of the genomes of complex organisms are transcribed, largely into nonprotein coding RNA. Summing the genomic regions spanned by known genes, mRNAs and spliced ESTs in the UCSC databases, it is clear that almost 60% of the human genome is transcribed from at least one strand, and at least 24% from both strands (3). This is also very likely an underestimate. Analyses of large scale full length cDNA libraries and genome- or chromosomewide polling of the transcriptome using oligonucleotide tiling arrays in humans and mouse show that the numbers of detectable coding and noncoding RNAs (of which there are tens of thousands, most of which are not annotated in genomic databases) are roughly comparable and regulated by common transcription factors (4-8). Similar conclusions have been reached in Drosophila (9). Moreover, for technical reasons these analyses are likely to overlook many important small regulatory RNAs such as microRNAs (see below) which are often present in low amounts and are difficult to identify by reverse transcription. It is also clear that antisense transcription is common, not just in imprinted loci, but across the whole genome (10). Moreover, all well characterised loci, including β-globin in mammals and the bithorax-abdominal AB complex in Drosophila, express a majority of noncoding transcripts, many of which are developmentally regulated (11). In addition, it has recently been shown that over half of the transcripts detected in global tiling arrays are unique to the largely unstudied poly(A)- and the nuclear poly(A)+ fractions of the transcriptome, opening up yet another unexplored world of RNA biology in higher organisms (7). Cloning and sequencing of these transcripts has also revealed that, rather than neatly separated genes, the human genome expresses an interlaced network of nested and overlapping transcripts on both strands, where introns of one transcript harbour exons of another (8, 10, 12). Transcript overlap and exonic interlacing occurs not only on opposite strands, but also on the same strand, so that there is often no clear distinction between splice variants and overlapping and neighboring genes. Even loci that encode well-known proteins, such as sonic hedgehog, have been shown to have previously unknown exons and novel isoforms that are likely to AUSTRALIAN BIOCHEMIST Page 17 Underworld of Regulatory RNA in Complex Organisms SHOWCASE ON RESEARCH have important functions. It is also not uncommon for a single base pair to be part of an intricate network of multiple isoforms of overlapping sense and antisense transcripts, the majority of which are unannotated (3, 7, 12). The big and as yet largely unanswered question is whether these noncoding RNAs (ncRNAs) are meaningful or simply represent 'transcriptional noise'. The problem is how to find function for these tens of thousands of ncRNA transcripts (13). One approach is to examine the patterns of expression and the subcellular location of the ncRNAs under analysis, as is already routinely done for proteins whose functions are unknown. Another approach, pioneered recently by Willingham et al. (14), is to undertake highthroughput small interfering RNA (siRNA) screens using in vitro assays (cell based screens) which have identified ncRNA regulators of transcription factor activity, hedgehog signalling and cell viability. Given the limited scope of such assays, there is clearly a long road ahead in the functional genomics of ncRNA. Small Regulatory RNAs Control Development It is now clear that many aspects of multicellular differentiation and development are controlled by short regulatory RNAs called microRNAs (miRNAs). miRNAs were first discovered almost literally by putting two and two together − small 22 nucleotide RNAs called lin-4 and let-7 that regulate development had been identified in Caenorhabditis elegans, and then later correlated with the discovery of RNA interference (RNAi), whereby double stranded RNAs are processed by an enzyme called Dicer to produce siRNAs of 21-25 nucleotides in length that target cognate mRNA sequences (and quite possibly others) for destruction. Subsequent analyses showed that lin-4 and let-7 were produced by this same pathway from natural doublestranded precursors that are processed from longer primary transcripts by an enzyme complex called Drosha. Targeted searches for the existence of such natural miRNAs revealed that many exist in vivo, and appear to act either by translation inhibition, after binding imperfectly to the target (usually the 3'-UTR of an mRNA) or by target destruction, if matched perfectly, using pathways that are still being dissected but which involve a number of important proteins including those of the Argonaute family of developmental regulators (for recent reviews see references 15-17). Note that these RNA signals are digital rather than analog in nature − they convey no function in themselves, but rather are strings of bases that recognise a cognate target and in so doing recruit generic protein complexes to carry out appropriate actions. miRNAs control many aspects of animal and plant development, including developmental timing, cell proliferation, left-right patterning, floral development, neuronal cell fate, apoptosis, hematopoietic differentiation, adipocyte differentiation and insulin secretion, and have been shown to be perturbed in Page 18 cancer. In mammals, most miRNAs exhibit developmentally regulated expression patterns in a variety of cells and tissues, including brain, lung, liver, spleen, heart, skeletal muscle and embryonal stem cells. Knockout of the miRNA-producing enzyme Dicer1 in mice leads to lethality early in development, with Dicer1-null embryos depleted of stem cells. Dicerdefective embryonic stem cells also exhibit severe defects in differentiation in vitro as well as in centromeric silencing. Inactivation of Dicer also causes developmental arrest in zebrafish embryos (see 16). Small Regulatory RNAs are Frequently Derived from Introns miRNAs are processed by the RNAi machinery from pre-mRNA-like precursors (RNA polymerase IIsynthesised transcripts that are polyadenylated and capped), at least some of which are polycistronic. Many miRNAs are sourced from the introns of protein-coding genes and the remainder from the introns and the exons of mRNA-like ncRNA genes (16). In addition, most small nucleolar RNAs (snoRNAs) in animals and plants (which target other RNAs for editing) are also derived from introns, again from both protein-coding and noncoding transcripts (16), indicating that introns are capable of transmitting genetic information as RNA, a feature that may be much more widespread than expected, especially given the interesting patterns of conservation observed in these sequences. Thus there may be a highly parallel output from genes in the higher organisms, with introns evolving to produce regulatory RNAs in parallel with protein-coding sequences, and some, if not many, transcripts evolving only to produce regulatory RNAs (Fig. 1) (18, 19). Well over 1,500 miRNAs have already been identified in animals, and the number is rising rapidly (16, 17, 20). Discovering them by cloning is difficult and biased towards those that are most abundant. Some miRNAs, such as that encoded by the lys-6 locus, which controls the asymmetry of chemosensory neurons in C. elegans, are only expressed at low levels and were only discovered by sensitive genetic screens, which are difficult if not impossible to carry out in mammals (21). Consequently, most mammalian miRNAs have been identified bioinformatically on the basis of a double-stranded precursor, a match to a known mRNA, and evolutionary conservation, with significant subsets being subsequently validated experimentally. Many miRNAs are more highly conserved than many protein-coding sequences, which begs the question why, since the only structurefunction relationship that needs to be preserved is primary sequence recognition. The most likely explanation is that these miRNAs are involved in multi-lateral interactions, which would severely restrict their opportunity for co-variation (16). Indeed most known miRNAs are predicted to have multiple targets in the genome (22-25). AUSTRALIAN BIOCHEMIST Vol 36 No 3 December 2005 Underworld of Regulatory RNA in Complex Organisms SHOWCASE ON RESEARCH Fig. 1. The complexity of transcription of protein-coding (blue) and noncoding RNA (red) sequences. Transcripts may be derived from either or both strands, and be overlapping and interlaced (see 3,7,8,10,12). Many transcripts (including many noncoding transcripts) are alternatively spliced (not shown). Both exons and introns may transmit information. Many miRNAs and all snoRNAs in animals are sourced from introns (see 16). The range of types and functions of noncoding RNAs is unknown. This figure is reproduced from ref 3, with permission from Science. Lack of Conservation does not Necessarily Mean Lack of Function The known miRNAs may be just the tip of the iceberg, and there may be many others that have more limited numbers of targets, a suggestion supported by a recent analysis that did not require inter-species conservation but simply intragenomic sequence matching, which identified many more human miRNAs, a significant number of which are primate-specific (20). That is, co-variation of miRNA and target can be rapid if only one interaction is involved, especially if part mismatches can be tolerated (up to some threshold of signal recognition). It would also be expected that many of these regulatory RNAs may be under positive selection for adaptive radiation, and indeed given the relatively stable proteome among mammals, re-wiring the regulatory circuitry (both in cis and trans, including the influence of transposons) may be the major route to phenotypic variation, accompanied by alterations to the exonic repertoire of the proteome. Indeed if the transcribed noncoding RNA is largely functional, much if not most of the genome must be under some degree of evolutionary selection, although not necessarily highly conserved. In this context it should be noted that many functional long ncRNAs, such as XIST, H19 and Air, do not show strong conservation between species, and indeed on average show no more conservation than introns, which are widely regarded to be non-functional, although both contain patches of highly conserved sequences within them (26). Of course, this may not be the case at all. Although conservation of sequence is a strong signature of evolutionary constraint (and therefore of functional importance), the converse does not necessarily hold, and it may well be that many loci which are functional as sequence-specific regulatory signals are evolving quickly (like language) by drift under milder negative and positive selection. Vol 36 No 3 December 2005 Significant Fractions of Noncoding Sequences Show Evidence of Functional Constraint Analysis of the conservation patterns between humans and mouse suggests that at least 5-10% of these genomes (an order of magnitude greater than the protein-coding sequences) is under selective constraint (27). Indeed multiple alignments of genomes show that from yeasts to vertebrates, in order of increasing genome size and general biological complexity, increasing fractions of conserved bases are found to lie outside of the exons of known proteincoding genes. Studies on the CFTR and SIM2 loci in many different mammalian species have shown that noncoding regions of vertebrate genomes are conserved with interesting patterns that are not obvious from pairwise comparisons alone, i.e., that sequences that are not conserved between some species are conserved between others, indicating that more of these sequences convey information than is at first apparent, and that those that have diverged (in some lineages) have more likely done so because of adaptive radiation rather than neutral drift. A new class of noncoding element in animals, called ultraconserved elements, are found in intronic and intergenic regions, and are strongly associated with genes encoding developmental regulators, RNA binding sequences and some alternative splice sites. These sequences (like rRNA sequences) are far more conserved than protein-coding sequences, showing almost perfect conservation among mammals, as well as in chicken, and in many cases extending back to fishes (28). Our recent results also suggest that extended regions of the mammalian genome (many of which do not show strong evidence of primary sequence conservation) are refractory to transposon insertion, and that these regions are conserved among mammals and even marsupials, despite the fact that most recognisable transposon-derived sequences have entered since their divergence (C. Simons, M. Pheasant, I.V. Makunin and J.S. Mattick, submitted for publication). None of these observations are easily reconcilable with orthodox cis-acting protein-based models of gene regulation. AUSTRALIAN BIOCHEMIST Page 19 Underworld of Regulatory RNA in Complex Organisms Many Complex Molecular Genetic Phenomena are Directed by RNA SHOWCASE ON RESEARCH A wide range of complex genetic phenomena in higher eukaryotes, including co-suppression, gene silencing, RNAi, DNA methylation, imprinting, transvection, position effect variegation and transinduction (all of which are or may be related) are known to be linked directly or indirectly to RNA signalling. These pathways are intimately linked to the control of gene expression at the chromatin level, and it appears likely that the epigenetic trajectories that underpin differentiation and development are directed by RNA, albeit tuned by regulatory proteins that provide positional information to guide and correct stochastic errors in the process. The RNAi pathway and noncoding RNAs have been shown to be central to the formation of silenced chromatin and chromosomal dynamics in animals, plants, fungi and protozoa. It has also been shown that synthetic siRNAs targeted to a range of promoters can alter gene expression and the local patterns of DNA and histone methylation (see 16, 29), providing strong support for the notion that endogenous small RNAs may perform similar functions in vivo. It is well established that chromatin modification occurs at many different loci in different cell lineages and that this process is central to developmental ontogeny. There must either be an army of sequencespecific DNA binding proteins that specify these modifications, which is not the case − there are only a limited number of DNA and histone modifying enzymes (methylases, acetylases and deacetylases etc.) − or these enzymes must be directed to their sites of action by some other signal, most logically sequencespecific RNAs. Consistent with this idea, there is now evidence that chromodomains and SET domains found in many chromatin-modifying proteins are modules that bind RNA or structures that include RNA (see 16,30). Such signals would also potentially solve the conundrum of how to select from the huge number of transcription factor binding sites that exist in the genome. In this context it is interesting to note that triplexes, which may contain RNA, are very common in human chromosomes (31) and at least some transcription factors, including the zinc finger protein Sp1 and Y-box proteins, have also been shown to have high affinity for RNA or structures that include RNA (see 11). Indeed, it is quite possible that the large numbers of nucleic acid- and chromatin-binding proteins whose specificity is unknown are in fact recognising various forms of RNA:DNA and RNA:RNA complexes. While not yet demonstrated, trans-acting guide RNAs may also regulate alternate splicing (which is currently not at all well explained by combinatoric effects of protein splicing factors) It has been shown by many studies that splicing patterns may be easily altered in cultured cells and in whole animals by introducing small antisense RNAs (see 11, 32). Page 20 The Relationship of Regulation to Organised Complexity Finally, it appears that the critical role of regulation in organised complex systems has been underestimated, although it would seem intuitively to be obvious. That is, the proportion of the information in any system that must be devoted to regulation or control systems (management) increases with the complexity of that system, if the system is to function in an integrated way (33). Consistent with this, it has been shown that the numbers of regulatory protein genes in bacteria increases roughly as a quadratic function of gene number. Moreover extrapolation of the empirical curve suggests that the point where the numbers of new regulatory genes will exceed the number of new functional genes is just above the observed upper limit of known bacterial genome sizes, implying (although not proving) that the complexity of prokaryotic organisms has been limited probably throughout most of evolution by a primitive regulatory system based on proteins alone (33, 34). If this is correct it implies that regulatory combinatorics alone is not sufficient to overcome this limitation (as there is no a priori reason why prokaryotes could not have evolved more complex promoters) and that the more complex eukaryotes must have solved the problem another way, presumably by the adoption of RNA as a more efficient means of encoding and transmitting regulatory information. This system now appears to dominate our genomic programming, and there is good reason to think that multicellular ontogeny is primarily driven by an endogenous program that unfolds during embryogenesis and is driven by RNA regulatory networks. Indeed, it seems more and more likely that biology has undergone its own analog to digital transition, beginning over one billion years ago, and ironically that which was dismissed as junk may well be the critical adaptation that led to complex and ultimately cognitive organisms like us. References 1. Storz, G., Opdyke, J A., and Zhang, A. (2004) Curr. Opin. Microbiol. 7, 140-144 2. Wilderman, P.J., Sowa, N.A., FitzGerald, D.J., FitzGerald, P.C., Gottesman, S., Ochsner, U.A., and Vasil, M.L. (2004) Proc. Natl. Acad. Sci. USA 101, 9792-9797 3. Frith, M.C., Pheasant, M., and Mattick, J.S. (2005) Eur. J. Hum. Genet. 13, 894-897 4. Okazaki, Y., Furuno, M., Kasukawa, T., Adachi, J., Bono, H., Kondo, S., Nikaido, I., Osato, N., et al. (2002) Nature 420, 563-573 5. Bertone, P., Stolc, V., Royce, T.E., Rozowsky, J.S., Urban, A.E., Zhu, X., Rinn, J.L., Tongprasit, W., et al. (2004) Science 306, 2242-2246 6. Cawley, S., Bekiranov, S., Ng, H.H., Kapranov, P., Sekinger, E.A., Kampa, D., Piccolboni, A., Sementchenko, V., et al. (2004) Cell 116, 499-509 7. Cheng, J., Kapranov, P., Drenkow, J., Dike, S., Brubaker, S., Patel, S., Long, J., Stern, D., et al. (2005) Science 308, 1149-1154 AUSTRALIAN BIOCHEMIST Vol 36 No 3 December 2005 Underworld of Regulatory RNA in Complex Organisms SHOWCASE ON RESEARCH 8. Carninci, P., Kasukawa, T., Katayama, S., Gough, J., Frith, M.C., Maeda, N., Oyama, R., Ravasi, T., et al. (2005) Science 309, 1559-1563 9. Stolc, V., Gauhar, Z., Mason, C., Halasz, G., van Batenburg, M. F., Rifkin, S. A., Hua, S., Herreman, T., et al. (2004) Science 306, 655-660 10. Katayama, S., Tomaru, Y., Kasukawa, T., Waki, K., Nakanishi, M., Nakamura, M., Nishida, H., Yap, C.C., et al. (2005) Science 309, 1564-1566 11. Mattick, J.S. (2003) Bioessays 25, 930-939 12. Kapranov, P., Drenkow, J., Cheng, J., Long, J., Helt, G., Dike, S., and Gingeras, T.R. (2005) Genome Res. 15, 987-997 13. Mattick, J.S. (2005) Science 309, 1527-1528 14. Willingham, A.T., Orth, A.P., Batalov, S., Peters, E.C., Wen, B.G., Aza-Blanc, P., Hogenesch, J.B., and Schultz, P.G. (2005) Science 309, 1570-1573 15. Bartel, D.P. (2004) Cell 116, 281-297 16. Mattick, J.S., and Makunin, I.V. (2005) Hum. Mol. Genet. 14, R121-R132 17. Zamore, P.D., and Haley, B. (2005) Science 309, 1519-1524 18. Mattick, J.S., and Gagen, M.J. (2001) Mol. Biol. Evol. 18, 1611-1630 19. Mattick, J.S. (2001) EMBO Rep. 2, 986-991 20. Bentwich, I., Avniel, A., Karov, Y., Aharonov, R., Gilad, S., Barad, O., Barzilai, A., Einat, P., et al. (2005) Nat. Genet. 37, 766-770 21. Johnston, R.J., and Hobert, O. (2003) Nature 426, 845-849 22. John, B., Enright, A.J., Aravin, A., Tuschl, T., Vol 36 No 3 December 2005 Sander, C., and Marks, D.S. (2004) PLoS Biol. 2, e363 23. Kiriakidou, M., Nelson, P.T., Kouranov, A., Fitziev, P., Bouyioukos, C., Mourelatos, Z., and Hatzigeorgiou, A. (2004) Genes Dev. 18, 1165-1178 24. Lim, L.P., Lau, N.C., Garrett-Engele, P., Grimson, A., Schelter, J.M., Castle, J., Bartel, D.P., Linsley, P.S., and Johnson, J.M. (2005) Nature 433, 769-773 25. Xie, X., Lu, J., Kulbokas, E.J., Golub, T.R., Mootha, V., Lindblad-Toh, K., Lander, E.S., and Kellis, M. (2005) Nature 434, 338-345 26. Pang, K.C., Frith, M.C., and Mattick, J.S. (2006) Trends Genet., in press 27. Waterston, R.H., Lindblad-Toh, K., Birney, E., Rogers, J., Abril, J.F., Agarwal, P., Agarwala, R., Ainscough, R., et al. (2002) Nature 420, 520-562 28. Bejerano, G., Pheasant, M., Makunin, I., Stephen, S., Kent, W.J., Mattick, J.S., and Haussler, D. (2004) Science 304, 1321-1325 29. Ting, A.H., Schuebel, K.E., Herman, J.G., and Baylin, S.B. (2005) Nat. Genet. 37, 906-910 30. Krajewski, W.A., Nakamura, T., Mazo, A., and Canaani, E. (2005) Mol. Cell Biol. 25, 1891-1899 31. Ohno, M., Fukagawa, T., Lee, J.S., and Ikemura, T. (2002) Chromosoma 111, 201-213 32. Kole, R., Vacek, M., and Williams, T. (2004) Oligonucleotides 14, 65-74 33. Mattick, J.S., and Gagen, M.J. (2005) Science 307, 856-858 34. Mattick, J.S. (2004) Nat. Rev. Genet. 5, 316-323 AUSTRALIAN BIOCHEMIST Page 21
© Copyright 2026 Paperzz