The Underworld of Regulatory RNA in Complex Organisms

SHOWCASE ON RESEARCH
The Underworld of Regulatory RNA
in Complex Organisms
John Mattick
Institute for Molecular Bioscience, University of Queensland, QLD 4072
The general presumption in molecular biology is that
most genes are synonymous with proteins, and that
RNA primarily acts as a temporary intermediate
(messenger RNA) between gene and protein, assisted
by infrastructural RNAs (ribosomal RNAs, transfer
RNAs, small nucleolar RNAs and spliceosomal RNAs)
that are required for RNA processing and mRNA
translation. This presumption stems from foundation
studies in bacteria almost 50 years ago, and broadly
holds true in the prokaryotes, whose genomes, of which
well over 100 have now been completely sequenced,
overwhelmingly consist of closely-spaced proteincoding sequences separated by 5'- and 3'- flanking
sequences that control gene expression at the
transcriptional and translational levels. Although a
small number of very interesting regulatory RNAs have
been recently identified in bacteria (1, 2), these RNAs
occupy only a small proportion of their genomes.
Thus, in prokaryotes at least, it is clear that proteins
(and their products) not only comprise the main
structural and functional components of the cell, but are
also the main agents by which the system is regulated,
in conjunction with environmental signals.
Do Most Genes also Encode Proteins
in Complex Organisms?
It has been assumed that the same applies in the
multicellular organisms, despite the fact that proteincoding sequences occupy a diminishing percentage of
their genomes as these organisms increase in
complexity, falling to just over 1% in humans. There is
also no strong link between the numbers of proteincoding genes and developmental complexity − the
nematode worm with only 1,000 cells has 50% more
protein-coding genes (~19,000) than insects (~13,500),
and about the same as vertebrates, including humans
(20-25,000), which have less than rice (40-50,000). Part of
this anomaly can be explained by expansion of the
numbers of protein isoforms by alternate splicing,
which does increase as a function of developmental
complexity, but it should also be noted that this in turn
requires another layer of regulation.
The presumption that genes are generally synonymous
with proteins has led logically to many subsidiary
assumptions, notably that the vast bulk of sequences
that do not code for proteins or their adjacent
regulatory sequences (including introns and most
intergenic sequences) are evolutionary junk. These
assumptions have become articles of faith, but they are
not necessarily correct.
In fact, we may have seriously misunderstood the
nature of genetic programming and information
transaction in the higher organisms. A range of recent
Vol 36 No 3 December 2005
evidence suggests that the higher organisms have
increasingly coopted non-protein-coding RNAs as a
means of transmitting regulatory information, and
that this information may be essential to the evolution
and ontogeny of complex organisms.
The Amazing Complexity of the
Mammalian Transcriptome
It is now evident that the vast majority of the genomes
of complex organisms are transcribed, largely into nonprotein coding RNA. Summing the genomic regions
spanned by known genes, mRNAs and spliced ESTs in
the UCSC databases, it is clear that almost 60% of the
human genome is transcribed from at least one strand,
and at least 24% from both strands (3). This is also very
likely an underestimate. Analyses of large scale full
length cDNA libraries and genome- or chromosomewide polling of the transcriptome using oligonucleotide
tiling arrays in humans and mouse show that the
numbers of detectable coding and noncoding RNAs (of
which there are tens of thousands, most of which are
not annotated in genomic databases) are roughly
comparable and regulated by common transcription
factors (4-8). Similar conclusions have been reached in
Drosophila (9). Moreover, for technical reasons these
analyses are likely to overlook many important small
regulatory RNAs such as microRNAs (see below) which
are often present in low amounts and are difficult to
identify by reverse transcription. It is also clear that
antisense transcription is common, not just in imprinted
loci, but across the whole genome (10). Moreover, all
well characterised loci, including β-globin in mammals
and the bithorax-abdominal AB complex in Drosophila,
express a majority of noncoding transcripts, many of
which are developmentally regulated (11).
In addition, it has recently been shown that over half
of the transcripts detected in global tiling arrays are
unique to the largely unstudied poly(A)- and the
nuclear poly(A)+ fractions of the transcriptome, opening
up yet another unexplored world of RNA biology in
higher organisms (7). Cloning and sequencing of these
transcripts has also revealed that, rather than neatly
separated genes, the human genome expresses an
interlaced network of nested and overlapping
transcripts on both strands, where introns of one
transcript harbour exons of another (8, 10, 12).
Transcript overlap and exonic interlacing occurs not
only on opposite strands, but also on the same strand,
so that there is often no clear distinction between splice
variants and overlapping and neighboring genes. Even
loci that encode well-known proteins, such as sonic
hedgehog, have been shown to have previously
unknown exons and novel isoforms that are likely to
AUSTRALIAN BIOCHEMIST
Page 17
Underworld of Regulatory RNA
in Complex Organisms
SHOWCASE ON
RESEARCH
have important functions. It is also not uncommon for
a single base pair to be part of an intricate network of
multiple isoforms of overlapping sense and antisense
transcripts, the majority of which are unannotated (3,
7, 12).
The big and as yet largely unanswered question is
whether these noncoding RNAs (ncRNAs) are
meaningful or simply represent 'transcriptional noise'.
The problem is how to find function for these tens of
thousands of ncRNA transcripts (13). One approach is
to examine the patterns of expression and the
subcellular location of the ncRNAs under analysis, as
is already routinely done for proteins whose functions
are unknown. Another approach, pioneered recently
by Willingham et al. (14), is to undertake highthroughput small interfering RNA (siRNA) screens
using in vitro assays (cell based screens) which have
identified ncRNA regulators of transcription factor
activity, hedgehog signalling and cell viability. Given
the limited scope of such assays, there is clearly a long
road ahead in the functional genomics of ncRNA.
Small Regulatory RNAs Control Development
It is now clear that many aspects of multicellular
differentiation and development are controlled by short
regulatory RNAs called microRNAs (miRNAs).
miRNAs were first discovered almost literally by
putting two and two together − small 22 nucleotide
RNAs called lin-4 and let-7 that regulate development
had been identified in Caenorhabditis elegans, and then
later correlated with the discovery of RNA interference
(RNAi), whereby double stranded RNAs are processed
by an enzyme called Dicer to produce siRNAs of 21-25
nucleotides in length that target cognate mRNA
sequences (and quite possibly others) for destruction.
Subsequent analyses showed that lin-4 and let-7 were
produced by this same pathway from natural doublestranded precursors that are processed from longer
primary transcripts by an enzyme complex called
Drosha. Targeted searches for the existence of such
natural miRNAs revealed that many exist in vivo, and
appear to act either by translation inhibition, after
binding imperfectly to the target (usually the 3'-UTR of
an mRNA) or by target destruction, if matched perfectly,
using pathways that are still being dissected but which
involve a number of important proteins including those
of the Argonaute family of developmental regulators
(for recent reviews see references 15-17). Note that these
RNA signals are digital rather than analog in nature −
they convey no function in themselves, but rather are
strings of bases that recognise a cognate target and in so
doing recruit generic protein complexes to carry out
appropriate actions.
miRNAs control many aspects of animal and plant
development, including developmental timing, cell
proliferation, left-right patterning, floral development,
neuronal cell fate, apoptosis, hematopoietic
differentiation, adipocyte differentiation and insulin
secretion, and have been shown to be perturbed in
Page 18
cancer. In mammals, most miRNAs exhibit
developmentally regulated expression patterns in a
variety of cells and tissues, including brain, lung, liver,
spleen, heart, skeletal muscle and embryonal stem
cells. Knockout of the miRNA-producing enzyme
Dicer1 in mice leads to lethality early in development,
with Dicer1-null embryos depleted of stem cells. Dicerdefective embryonic stem cells also exhibit severe
defects in differentiation in vitro as well as in
centromeric silencing. Inactivation of Dicer also causes
developmental arrest in zebrafish embryos (see 16).
Small Regulatory RNAs are
Frequently Derived from Introns
miRNAs are processed by the RNAi machinery from
pre-mRNA-like precursors (RNA polymerase IIsynthesised transcripts that are polyadenylated and
capped), at least some of which are polycistronic.
Many miRNAs are sourced from the introns of
protein-coding genes and the remainder from the
introns and the exons of mRNA-like ncRNA genes
(16). In addition, most small nucleolar RNAs
(snoRNAs) in animals and plants (which target other
RNAs for editing) are also derived from introns, again
from both protein-coding and noncoding transcripts
(16), indicating that introns are capable of transmitting
genetic information as RNA, a feature that may be
much more widespread than expected, especially
given the interesting patterns of conservation observed
in these sequences. Thus there may be a highly parallel
output from genes in the higher organisms, with
introns evolving to produce regulatory RNAs in
parallel with protein-coding sequences, and some, if
not many, transcripts evolving only to produce
regulatory RNAs (Fig. 1) (18, 19).
Well over 1,500 miRNAs have already been identified
in animals, and the number is rising rapidly (16, 17,
20). Discovering them by cloning is difficult and biased
towards those that are most abundant. Some miRNAs,
such as that encoded by the lys-6 locus, which controls
the asymmetry of chemosensory neurons in C. elegans,
are only expressed at low levels and were only
discovered by sensitive genetic screens, which are
difficult if not impossible to carry out in mammals
(21). Consequently, most mammalian miRNAs have
been identified bioinformatically on the basis of a
double-stranded precursor, a match to a known
mRNA, and evolutionary conservation, with
significant subsets being subsequently validated
experimentally. Many miRNAs are more highly
conserved than many protein-coding sequences, which
begs the question why, since the only structurefunction relationship that needs to be preserved is
primary sequence recognition. The most likely
explanation is that these miRNAs are involved in
multi-lateral interactions, which would severely restrict
their opportunity for co-variation (16). Indeed most
known miRNAs are predicted to have multiple targets
in the genome (22-25).
AUSTRALIAN BIOCHEMIST
Vol 36 No 3 December 2005
Underworld of Regulatory RNA
in Complex Organisms
SHOWCASE ON
RESEARCH
Fig. 1.
The complexity of transcription of
protein-coding (blue) and noncoding
RNA (red) sequences. Transcripts may
be derived from either or both strands,
and be overlapping and interlaced
(see 3,7,8,10,12). Many transcripts
(including many noncoding
transcripts) are alternatively spliced
(not shown). Both exons and introns
may transmit information. Many
miRNAs and all snoRNAs in animals
are sourced from introns (see 16). The
range of types and functions of
noncoding RNAs is unknown.
This figure is reproduced from ref 3,
with permission from Science.
Lack of Conservation does not Necessarily
Mean Lack of Function
The known miRNAs may be
just the tip of the iceberg, and
there may be many others that
have more limited numbers of
targets, a suggestion supported
by a recent analysis that did
not require inter-species
conservation but simply intragenomic sequence matching,
which identified many more
human miRNAs, a significant number of which are
primate-specific (20). That is, co-variation of miRNA
and target can be rapid if only one interaction is
involved, especially if part mismatches can be
tolerated (up to some threshold of signal recognition).
It would also be expected that many of these
regulatory RNAs may be under positive selection for
adaptive radiation, and indeed given the relatively
stable proteome among mammals, re-wiring the
regulatory circuitry (both in cis and trans, including
the influence of transposons) may be the major route
to phenotypic variation, accompanied by alterations to
the exonic repertoire of the proteome. Indeed if the
transcribed noncoding RNA is largely functional,
much if not most of the genome must be under some
degree of evolutionary selection, although not
necessarily highly conserved.
In this context it should be noted that many functional
long ncRNAs, such as XIST, H19 and Air, do not show
strong conservation between species, and indeed on
average show no more conservation than introns,
which are widely regarded to be non-functional,
although both contain patches of highly conserved
sequences within them (26). Of course, this may not be
the case at all. Although conservation of sequence is a
strong signature of evolutionary constraint (and
therefore of functional importance), the converse does
not necessarily hold, and it may well be that many loci
which are functional as sequence-specific regulatory
signals are evolving quickly (like language) by drift
under milder negative and positive selection.
Vol 36 No 3 December 2005
Significant Fractions of Noncoding Sequences
Show Evidence of Functional Constraint
Analysis of the conservation patterns between
humans and mouse suggests that at least 5-10% of
these genomes (an order of magnitude greater than
the protein-coding sequences) is under selective
constraint (27). Indeed multiple alignments of
genomes show that from yeasts to vertebrates, in
order of increasing genome size and general biological
complexity, increasing fractions of conserved bases are
found to lie outside of the exons of known proteincoding genes. Studies on the CFTR and SIM2 loci in
many different mammalian species have shown that
noncoding regions of vertebrate genomes are
conserved with interesting patterns that are not
obvious from pairwise comparisons alone, i.e., that
sequences that are not conserved between some
species are conserved between others, indicating that
more of these sequences convey information than is at
first apparent, and that those that have diverged (in
some lineages) have more likely done so because of
adaptive radiation rather than neutral drift.
A new class of noncoding element in animals, called
ultraconserved elements, are found in intronic and
intergenic regions, and are strongly associated with
genes encoding developmental regulators, RNA
binding sequences and some alternative splice sites.
These sequences (like rRNA sequences) are far more
conserved than protein-coding sequences, showing
almost perfect conservation among mammals, as well
as in chicken, and in many cases extending back to
fishes (28). Our recent results also suggest that
extended regions of the mammalian genome (many of
which do not show strong evidence of primary
sequence conservation) are refractory to transposon
insertion, and that these regions are conserved among
mammals and even marsupials, despite the fact that
most recognisable transposon-derived sequences have
entered since their divergence (C. Simons, M.
Pheasant, I.V. Makunin and J.S. Mattick, submitted for
publication). None of these observations are easily
reconcilable with orthodox cis-acting protein-based
models of gene regulation.
AUSTRALIAN BIOCHEMIST
Page 19
Underworld of Regulatory RNA
in Complex Organisms
Many Complex Molecular Genetic
Phenomena are Directed by RNA
SHOWCASE ON
RESEARCH
A wide range of complex genetic phenomena in
higher eukaryotes, including co-suppression, gene
silencing, RNAi, DNA methylation, imprinting,
transvection, position effect variegation and
transinduction (all of which are or may be related) are
known to be linked directly or indirectly to RNA
signalling. These pathways are intimately linked to the
control of gene expression at the chromatin level, and
it appears likely that the epigenetic trajectories that
underpin differentiation and development are directed
by RNA, albeit tuned by regulatory proteins that
provide positional information to guide and correct
stochastic errors in the process. The RNAi pathway
and noncoding RNAs have been shown to be central
to the formation of silenced chromatin and
chromosomal dynamics in animals, plants, fungi and
protozoa. It has also been shown that synthetic siRNAs
targeted to a range of promoters can alter gene
expression and the local patterns of DNA and histone
methylation (see 16, 29), providing strong support for
the notion that endogenous small RNAs may perform
similar functions in vivo.
It is well established that chromatin modification
occurs at many different loci in different cell lineages
and that this process is central to developmental
ontogeny. There must either be an army of sequencespecific DNA binding proteins that specify these
modifications, which is not the case − there are only a
limited number of DNA and histone modifying
enzymes (methylases, acetylases and deacetylases etc.)
− or these enzymes must be directed to their sites of
action by some other signal, most logically sequencespecific RNAs. Consistent with this idea, there is now
evidence that chromodomains and SET domains found
in many chromatin-modifying proteins are modules
that bind RNA or structures that include RNA (see
16,30). Such signals would also potentially solve the
conundrum of how to select from the huge number of
transcription factor binding sites that exist in the
genome. In this context it is interesting to note that
triplexes, which may contain RNA, are very common
in human chromosomes (31) and at least some
transcription factors, including the zinc finger protein
Sp1 and Y-box proteins, have also been shown to have
high affinity for RNA or structures that include RNA
(see 11). Indeed, it is quite possible that the large
numbers of nucleic acid- and chromatin-binding
proteins whose specificity is unknown are in fact
recognising various forms of RNA:DNA and
RNA:RNA complexes.
While not yet demonstrated, trans-acting guide RNAs
may also regulate alternate splicing (which is currently
not at all well explained by combinatoric effects of
protein splicing factors) It has been shown by many
studies that splicing patterns may be easily altered in
cultured cells and in whole animals by introducing
small antisense RNAs (see 11, 32).
Page 20
The Relationship of Regulation to
Organised Complexity
Finally, it appears that the critical role of regulation in
organised complex systems has been underestimated,
although it would seem intuitively to be obvious. That
is, the proportion of the information in any system that
must be devoted to regulation or control systems
(management) increases with the complexity of that
system, if the system is to function in an integrated way
(33). Consistent with this, it has been shown that the
numbers of regulatory protein genes in bacteria increases
roughly as a quadratic function of gene number.
Moreover extrapolation of the empirical curve suggests
that the point where the numbers of new regulatory
genes will exceed the number of new functional genes is
just above the observed upper limit of known bacterial
genome sizes, implying (although not proving) that the
complexity of prokaryotic organisms has been limited
probably throughout most of evolution by a primitive
regulatory system based on proteins alone (33, 34).
If this is correct it implies that regulatory
combinatorics alone is not sufficient to overcome this
limitation (as there is no a priori reason why
prokaryotes could not have evolved more complex
promoters) and that the more complex eukaryotes
must have solved the problem another way,
presumably by the adoption of RNA as a more
efficient means of encoding and transmitting
regulatory information. This system now appears to
dominate our genomic programming, and there is
good reason to think that multicellular ontogeny is
primarily driven by an endogenous program that
unfolds during embryogenesis and is driven by RNA
regulatory networks. Indeed, it seems more and more
likely that biology has undergone its own analog to
digital transition, beginning over one billion years ago,
and ironically that which was dismissed as junk may
well be the critical adaptation that led to complex and
ultimately cognitive organisms like us.
References
1. Storz, G., Opdyke, J A., and Zhang, A. (2004) Curr.
Opin. Microbiol. 7, 140-144
2. Wilderman, P.J., Sowa, N.A., FitzGerald, D.J.,
FitzGerald, P.C., Gottesman, S., Ochsner, U.A., and
Vasil, M.L. (2004) Proc. Natl. Acad. Sci. USA 101,
9792-9797
3. Frith, M.C., Pheasant, M., and Mattick, J.S. (2005)
Eur. J. Hum. Genet. 13, 894-897
4. Okazaki, Y., Furuno, M., Kasukawa, T., Adachi, J.,
Bono, H., Kondo, S., Nikaido, I., Osato, N., et al.
(2002) Nature 420, 563-573
5. Bertone, P., Stolc, V., Royce, T.E., Rozowsky, J.S.,
Urban, A.E., Zhu, X., Rinn, J.L., Tongprasit, W., et
al. (2004) Science 306, 2242-2246
6. Cawley, S., Bekiranov, S., Ng, H.H., Kapranov, P.,
Sekinger, E.A., Kampa, D., Piccolboni, A.,
Sementchenko, V., et al. (2004) Cell 116, 499-509
7. Cheng, J., Kapranov, P., Drenkow, J., Dike, S.,
Brubaker, S., Patel, S., Long, J., Stern, D., et al.
(2005) Science 308, 1149-1154
AUSTRALIAN BIOCHEMIST
Vol 36 No 3 December 2005
Underworld of Regulatory RNA
in Complex Organisms
SHOWCASE ON
RESEARCH
8. Carninci, P., Kasukawa, T., Katayama, S., Gough,
J., Frith, M.C., Maeda, N., Oyama, R., Ravasi, T., et
al. (2005) Science 309, 1559-1563
9. Stolc, V., Gauhar, Z., Mason, C., Halasz, G., van
Batenburg, M. F., Rifkin, S. A., Hua, S., Herreman,
T., et al. (2004) Science 306, 655-660
10. Katayama, S., Tomaru, Y., Kasukawa, T., Waki, K.,
Nakanishi, M., Nakamura, M., Nishida, H., Yap,
C.C., et al. (2005) Science 309, 1564-1566
11. Mattick, J.S. (2003) Bioessays 25, 930-939
12. Kapranov, P., Drenkow, J., Cheng, J., Long, J., Helt,
G., Dike, S., and Gingeras, T.R. (2005) Genome Res.
15, 987-997
13. Mattick, J.S. (2005) Science 309, 1527-1528
14. Willingham, A.T., Orth, A.P., Batalov, S., Peters,
E.C., Wen, B.G., Aza-Blanc, P., Hogenesch, J.B.,
and Schultz, P.G. (2005) Science 309, 1570-1573
15. Bartel, D.P. (2004) Cell 116, 281-297
16. Mattick, J.S., and Makunin, I.V. (2005) Hum. Mol.
Genet. 14, R121-R132
17. Zamore, P.D., and Haley, B. (2005) Science 309,
1519-1524
18. Mattick, J.S., and Gagen, M.J. (2001) Mol. Biol. Evol.
18, 1611-1630
19. Mattick, J.S. (2001) EMBO Rep. 2, 986-991
20. Bentwich, I., Avniel, A., Karov, Y., Aharonov, R.,
Gilad, S., Barad, O., Barzilai, A., Einat, P., et al.
(2005) Nat. Genet. 37, 766-770
21. Johnston, R.J., and Hobert, O. (2003) Nature 426,
845-849
22. John, B., Enright, A.J., Aravin, A., Tuschl, T.,
Vol 36 No 3 December 2005
Sander, C., and Marks, D.S. (2004) PLoS Biol. 2,
e363
23. Kiriakidou, M., Nelson, P.T., Kouranov, A., Fitziev,
P., Bouyioukos, C., Mourelatos, Z., and
Hatzigeorgiou, A. (2004) Genes Dev. 18, 1165-1178
24. Lim, L.P., Lau, N.C., Garrett-Engele, P., Grimson,
A., Schelter, J.M., Castle, J., Bartel, D.P., Linsley,
P.S., and Johnson, J.M. (2005) Nature 433, 769-773
25. Xie, X., Lu, J., Kulbokas, E.J., Golub, T.R., Mootha,
V., Lindblad-Toh, K., Lander, E.S., and Kellis, M.
(2005) Nature 434, 338-345
26. Pang, K.C., Frith, M.C., and Mattick, J.S. (2006)
Trends Genet., in press
27. Waterston, R.H., Lindblad-Toh, K., Birney, E.,
Rogers, J., Abril, J.F., Agarwal, P., Agarwala, R.,
Ainscough, R., et al. (2002) Nature 420, 520-562
28. Bejerano, G., Pheasant, M., Makunin, I., Stephen,
S., Kent, W.J., Mattick, J.S., and Haussler, D. (2004)
Science 304, 1321-1325
29. Ting, A.H., Schuebel, K.E., Herman, J.G., and
Baylin, S.B. (2005) Nat. Genet. 37, 906-910
30. Krajewski, W.A., Nakamura, T., Mazo, A., and
Canaani, E. (2005) Mol. Cell Biol. 25, 1891-1899
31. Ohno, M., Fukagawa, T., Lee, J.S., and Ikemura, T.
(2002) Chromosoma 111, 201-213
32. Kole, R., Vacek, M., and Williams, T. (2004)
Oligonucleotides 14, 65-74
33. Mattick, J.S., and Gagen, M.J. (2005) Science 307,
856-858
34. Mattick, J.S. (2004) Nat. Rev. Genet. 5, 316-323
AUSTRALIAN BIOCHEMIST
Page 21