COMMENTARies | special feature Is sequencing enlightenment ending the dark age of the transcriptome? © 2009 Nature America, Inc. All rights reserved. Piero Carninci Sequencing-based technologies for RNA discovery are playing a key role in deciphering the transcriptome and hold the potential to provide us with a census of RNAs and their functions. A decade ago, before the appearance of the first draft of the human and other genomes, genes were identified via expressed sequencing tags. In the pregenome era, the main goal was to identify mRNAs produced by protein-coding genes. The computational tools used to predict these genes were far from accurate and required experimental data to support gene predictions. Contrary to expectations, just at the turn of the new millennium, initial genome studies suggested that the total number of genes was much lower than estimated earlier by analyses of expressed sequencing tag data1, giving an upper bound of approximately 25,000 protein-coding mammalian genes. The first doubts about this estimate came from the analysis of full-length cDNAs, which showed that in addition to protein-coding genes, there is an even larger number of RNA transcripts that do not encode proteins and are instead defined as noncoding RNAs (ncRNAs)2. Additionally, the analysis of 1% of the human genome by the ENCODE (Encyclopedia of DNA Elements) Consortium not only suggested that 93% of the genome is transcribed but also revealed an unprecedented number of splice variants. These data put the number of predicted proteome variants at least fourfold above the number of genes3. Indeed, an average cell contains at least 300,000 mRNA molecules and perhaps even more if we consider rare ncRNAs or RNAs restricted to specific cell compartments. Different cells express different sets of mRNAs and ncRNAs, and their expression level spans 4–5 orders of magnitude from the highly to the rarely expressed RNA. Classical Sanger (dideoxy) sequencing of full-length cDNA, even if implemented with strategies to subtract the known genes to more efficiently search for new RNAs4, still requires the cloning of individual cDNAs before analysis of the clones; the largest full-length sequencing efforts so far have been limited to about 100,000 cDNAs2. The main limitation of this approach is the need for very extensive handling and sequencing of individual cDNA clones. The advantage is that once the full-length cDNAs are sequenced, they not only serve to identify mRNAs, ncRNAs and their isoforms, but they can also be used in functional assays. Tiling DNA microarrays, which probe the whole nonrepetitive part of the genome at high resolution5, are faster alternatives for transcript detection in that they instantaneously allow the comprehensive assessment of mRNA and ncRNA expression. Together with tagging technologies6 that identify novel RNA start and termination sites, tiling arrays have been instrumental in revealing a seemingly overwhelming degree of transcriptome complexity. However, they suffer from a lack of sensitivity in detecting rare transcripts and Piero Carninci is at the Omics Science Center, RIKEN Yokohama Institute, Yokohama, Kanagawa, Japan. e-mail: [email protected] from background owing to cross-hybridization of highly related sequences, making the analysis of repeat regions impossible. Additionally, tiling arrays cannot be used to identify new splice junctions: they will identify exons but do not provide their connections. Even arrays dedicated to the detection of predicted or known splice variants cannot detect novel splice sites that have not yet been experimentally identified or predicted in silico. The arrival of second-generation sequencing has advanced unbiased transcriptome studies. The technology, known as RNA-sequencing (or RNA-seq) is procedurally quite simple: RNA is converted to a cDNA library without a lengthy cloning procedure, and a single cDNA preparation can yield over hundreds of millions of short reads that are then computationally aligned to the genome. Sequences mapping to a single exon are generally unambiguously assigned to the corresponding gene, often at a coverage of hundreds of sequences for each RNA. Chances are good that even rare RNAs can be identified in the library, although full assembly from the short reads is still unlikely for transcripts present in one copy per cell at the current coverage. With increased sequencing depth this may change, and it may become possible that RNA-seq will sensitively detect the whole transcriptome in a biological sample. Indeed, a growing number of studies published in this journal and elsewhere7–11 have used RNA-seq to comprehensively investigate transcription. One of the obvious technical advantages of RNA-seq is the avoidance of spurious, false positive detection of RNA, an inherent problem of microarrays owing to cross-hybridization. RNAs that are produced by highly similar members of paralogous genes present an alignment challenge, but recently developed strategies also allow for the mapping of reads to splice sites or to paralogous sequences9. With read length increasing, it will become even easier to identify the exact location of most sequence reads. In one of the first uses of RNA-seq, Mortazavi and colleagues developed an nature methods | VOL.6 NO.10 | OCTOBER 2009 | 711 special feature | COMMENTARies Genome RNA structure 1 © 2009 Nature America, Inc. All rights reserved. RNA-seq 2 3 4 5 ? Start site ? Figure 1 | Challenges in assembling RNA-seq data. A hypothetical locus often produces multiple RNAs (top), which may affect assembly of RNA-seq into an unambiguous gene model. A given annotated protein-coding gene (the known annotated region is represented by back boxes, and known exons are numbered 1–5) produces protein-coding mRNAs (gray boxes and arrrows) and other noncoding RNAs (large red arrows). RNA-seq produces multiple overlapping signals (bottom), represented by black and red tiled arrows. Although RNA-seq allows reconstruction of exon structure and neighboring exon-exon junction definition, the connectivity of exons as well as all the splice combinations of individual transcripts may not be fully reconstructed. By looking at RNA-seq data, the reconstruction of connectivity in the various transcripts between distant exons is challenging. Additionally, RNAs have multiple transcription start sites (middle, small red arrows), identified, for instance, by cap analysis gene expression (CAGE tags), which cannot be detected efficiently by RNA-seq. RNA-seq method in which oligo-dT–selected mouse mRNAs are fragmented before synthesis of random-primed, first-strand cDNA9. This RNA fragmentation proved essential for uniform coverage of all exons. Random priming performed on full-length mRNA would show a strong preference to a few particular positions on the mRNAs and thus bias coverage. Although the method was limited to the short 25-nucleotide reads available in the first version of the Illumina GA sequencer, the authors impressively provided a clear pattern of novel splice sites. These findings explore the world of protein-coding mRNAs, as noncoding RNAs often lack poly(A) tails, and their data suggest that a good fraction of protein-coding exons has already been identified because only a few percent of the tags mapped to intergenic regions. In parallel, Sean Grimmond and colleagues went for even larger transcriptome coverage in human embryonic stem cells and embryoid bodies using Applied Biosystems’ sequencing by ligation (SOLiD) technology10. They reasoned that RNAs are often transcribed bidirectionally and often overlap to form sense-antisense pairs. To capture these sense-antisense RNA pairs, they devised a template switch strategy that maintains the 5′-3′ directionality of the original RNAs in the shotgun library: the reverse transcriptase adds (albeit at low efficiency) some extra bases at the end of the first-strand cDNA, so the second-strand cDNA can be primed from the overhang. Another important procedural step was that they subtracted ribosomal RNAs from the total RNA preparation before cDNA library preparation; otherwise ribosomal RNAs would comprise a large part of the library, affecting sequencing yield. The authors showed that sense-antisense pairs are more concentrated at the 3′ end of genes. A remaining challenge is the detection of rare RNAs expressed only in a few subtypes of cells. This is a key issue in analyzing complex tissues, such as in the brain and those formed during embryonic development. A first step toward meeting this challenge has been made by Surani and colleagues, who prepared seven libraries from the limited amount of RNA in single mouse oocytescells in an early developmental stage11. Owing to the protocol, which is based on oligo-dT priming followed by A-tailing of cDNA and subsequent cDNA 712 | VOL.6 NO.10 | OCTOBER 2009 | nature methods fragmentation before sequencing, the direction of the RNA is lost. There are no protocols yet for the removal of ribosomal RNA when only a few nanograms of RNA are available. Also, oocytes, though technically single cells, are much larger than an average cell, which leaves room for the development of technologies to miniaturize single-cell sequencing–based assays. It is certainly reassuring that different sequencing and library preparation technologies have pointed unambiguously to similar levels of complexity and all have provided solutions for tackling this complexity. Indeed, RNA-seq is contributing to the comprehensive analysis of splice events and novel splicing combinations. Even with relatively ‘shallow’ coverage, such as less than 10 million 27-nucleotide short sequences7, researchers found more than 2,000 new exons for each cell line derived from kidney and B cells. Deeper sequencing of longer fragments allowed investigators to detect millions of splice sites and ~90,000–145,000 novel splice site candidates per tissue8–11. This clearly supports the notion that multiexon genes are usually alternatively spliced, but these studies have also demonstrated that alternative splicing is restricted to only a fraction of exons. RNA-seq also deals much better with RNA derived from repeat elements 10 , which have been previously excluded from arrays because of cross-hybridization. Notably, even 20-nucleotide sequence tags have been suitable for detection of widespread expression of retrotransposons12. Have we addressed all the challenges yet? A final unambiguous assembly of different splice isoforms is still outstanding because the connectivity between distant exons that do not share sequencing reads cannot be unequivocally predicted (Fig. 1). The ENCODE analysis working group is currently developing computational methods to assemble RNA-seq reads according to RNA models with accurate splice sites and to predict the most likely splice variants based on the expression of each exon. At the same time, sequencing technology is constantly improving, reads are longer—and thus the assembly of full mRNA sequences is becoming simpler—and greater support is being provided with the second- and third-generation sequencing technologies. The Roche 454 Life Sciences sequencer can produce over a million sequences around 500 nucleotides long; this will be helpful © 2009 Nature America, Inc. All rights reserved. COMMENTARies | special feature for assembling splicing sites. The ABI SOLiD and the Illumina GAIIX have not only increased the sequencing length to 50 and 75 bases, respectively, but have also developed methods for sequencing from both ends of the cDNA fragments to help in connecting more distant exons. Other challenges of RNA-seq are how to distinguish the various start and end sites of RNAs. It is becoming evident that there are often multiple overlapping RNAs encoded from the same genome region, and intronderived RNAs are recycled to produce functional ncRNAs such as microRNAs. Another source of complexity comes from the secondary processing of mRNAs, which produces shorter, likely functional, RNAs. Thus, protein-coding genes are associated with a plethora of short ncRNAs, including short RNAs associated with promoters13, transcripts arising around termination sites and even exons. A fraction of these RNAs are produced by a novel cleavage and recapping mechanisms, resulting in capped RNAs that start in the middle of coding exons or in untranslated regions. These naturally truncated RNAs are likely to be ncRNAs that overlap larger mRNAs13. Another complication arises from the broad nature of many promoters14, which produce various capped RNAs from multiple transcription start sites. Technologies that identify the cap structure in such mixtures are needed to distinguish the RNA fragments obtained by RNA-seq. At present RNA-seq does not perform well at unambiguously identifying transcription start sites, and RNA-seq protocols need improvement to simultaneously decipher the long, short and capped RNAs so the RNAs’ function can be assessed. Some of the third-generation sequencers such as those from Pacific Biosciences and Oxford Nanoporewhich will be able to read thousands of nucleotides15 of single cDNAsmay ultimately meet these challenges: their long sequences will quantitatively represent complete RNAs, and the use of tags and linkers that mark cap sites and other modifications will allow an allin-one determination of transcriptome structure, including start and termination sites and the mapping of regulatory elements such as promoters. The accurate sequence of coding sequences will also help directed cloning of open reading frames in experimental verification of alternative splice isoforms16,17. Although many challenges are ahead, the direction is becoming clearer, and I am beginning to wonder if the dark age of the transcriptome is giving way to rays of light. 1. Liang, F. et al. Nat. Genet. 25, 239–240 (2000). 2. Carninci, P. et al. Science 309, 1559–1563 (2005). 3. ENCODE Project Consortium et al. Nature 447, 799–816 (2007). 4. Carninci, P. et al. Genome Res. 13, 1273–1289 (2003). 5. Kapranov, P. et al. Science 316, 1484–1488 (2007). 6. Harbers, M. & Carninci, P. Nat. Methods 2, 495–502 (2005). 7. Sultan, M. et al. Science 321, 956–960 (2008). 8. Pan, Q., Shai, O., Lee, L.J., Frey, B.J. & Blencowe, B.J. Nat. Genet. 40, 1413–1415 (2008). 9. Mortazavi, A., Williams, B.A., McCue, K., Schaeffer, L. & Wold, B. Nat. Methods 5, 621– 628 (2008). 10. Cloonan, N. et al. Nat. Methods 5, 613–619 (2008). 11. Tang, F. et al. Nat. Methods 6, 377–382 (2009). 12. Faulkner, G.J. et al. Nat. Genet. 41, 563–571 (2009). 13. Fejes-Toth, K. et al. Nature 457, 1028–1032 (2009). 14. Carninci, P. et al. Nat. Genet. 38, 626–635 (2006). 15. Turner, D.J., Keane, T.M., Sudbery, I. & Adams, D.J. Mamm. Genome 20, 327–338 (2009). 16. Djebali, S. et al. Nat. Methods 5, 629–635 (2008). 17. Salehi-Ashtiani, K. et al. Nat. Methods 5, 597–600 (2008). Engineered fluorescent proteins: innovations and applications Michael W Davidson & Robert E Campbell Despite expansion of the fluorescent protein and optical highlighter palette into the orange to far-red range of the visible spectrum, achieving performance equivalent to that of EGFP has continued to elude protein engineers. Evolving proteins, evolving tools During the past decade and a half, intrinsically fluorescent proteins have been under intense evolutionary pressure for ‘fitness’, not in the wild, but rather for utility in livecell imaging experiments. This unnatural course of evolution has occurred on the benches of protein engineers around the world who have helped to drive progress in the ever-expanding repertoire of fluorescence imaging technologies. An underlying theme that has guided advancements in fluorescent protein engineering is that, all other factors being equal, redder is better. It is generally accepted that excitation with longer-wavelength light entails less phototoxicity for the cells or tissue being examined and decreased autofluorescence and scattering. These desirable factors mean that red-shifted fluorophores generally provide improved contrast (owing to decreased background fluorescence) and superior performance in whole-organism imaging (owing to higher tissue ‘transparency’). Early efforts to engineer red-shifted Aequeorea victoria GFP (avGFP) variants led to the development of enhanced GFP (EGFP) and yellow fluorescent proteins with emission maxima at approximately 507 nm and 529 nm, respectively (versus 508 nm for wild type)1. For a time, however, it appeared that fluorescent protein engineering had hit a ‘yellow’ wall in efforts to red-shift fluorescence emission. Fortuitously, this barrier had already been surmounted by natural evolution, as was revealed in October 1999 with a report that the Discosoma sp. mushroom anemone harbored a fluorescent protein homolog, commonly known as DsRed, emitting in the orange-red region (583 nm)2. Counterbalancing this favorable shift to the red were several undesirable properties, including oligomerization, ‘contamination’ by a green component and sluggish chromophore development, which dampened some of the initial enthusiasm. The discovery of DsRed (and other Anthozoa fluorescent proteins of various hues) had a twofold impact on the Michael W. Davidson is at the National High Magnetic Field Laboratory and Department of Biological Science, Florida State University, Tallahassee, Florida, USA. Robert E. Campbell is at the University of Alberta, Department of Chemistry, Edmonton, Alberta, Canada. e-mail: [email protected] nature methods | VOL.6 NO.10 | OCTOBER 2009 | 713
© Copyright 2026 Paperzz