© 2008 Nature Publishing Group http://www.nature.com/naturegenetics B R I E F C O M M U N I C AT I O N S Deep surveying of alternative splicing complexity in the human transcriptome by high-throughput sequencing Qun Pan1, Ofer Shai1,2, Leo J Lee1,2, Brendan J Frey1,2 & Benjamin J Blencowe1,3 We carried out the first analysis of alternative splicing complexity in human tissues using mRNA-Seq data. New splice junctions were detected in B20% of multiexon genes, many of which are tissue specific. By combining mRNA-Seq and EST-cDNA sequence data, we estimate that transcripts from ~95% of multiexon genes undergo alternative splicing and that there are B100,000 intermediate- to high-abundance alternative splicing events in major human tissues. From a comparison with quantitative alternative splicing microarray profiling data, we also show that mRNA-Seq data provide reliable measurements for exon inclusion levels. Alternative splicing is considered to be a key factor underlying increased cellular and functional complexity in higher eukaryotes1–3. From analyses of microarray profiling and EST-cDNA sequence data, it has been estimated that two-thirds of human genes contain one or more alternatively spliced exon4. However, because of the limited depth of coverage and sensitivity afforded by conventional sequencing and microarray profiling methods, the extent of human alternative splicing is not known5. High-throughput or ‘next generation’ sequencing technologies offer the potential to address this question6, and several very recent studies have applied analyses of short cDNA read (mRNA-Seq) data from these technologies to survey alternative splicing in mouse tissues and in human and mouse cell lines7–10. In this study, we used the Genome Analyzer system of Illumina to survey splicing complexity in diverse, normal human tissues using mRNASeq datasets consisting of 17–32 million 32-nucleotide-long reads. We also assessed the potential of these datasets to provide quantitative measurements for alternative splicing levels. To assess human tissue alternative splicing complexity, we used mRNA-Seq datasets from whole brain, cerebral cortex, heart, skeletal muscle, lung and liver to search libraries of splice junction sequences that represent ‘known’ splicing events and candidate ‘new’ splicing events. Junction sequences designated as ‘known’ below are those supported by the analysis of aligned EST and cDNA sequences, and candidate ‘new’ splice junction sequences are those corresponding to all hypothetical, additional 5¢ to 3¢ pairings of splice sites in the same set of genes (Fig. 1a; and Supplementary Methods online). Mining of a dataset of 15,702 multiexon UniGene clusters, each containing one or more locus-specific RefSeq cDNA, resulted in the compilation of 257,257 known splice junctions and 2,459,306 candidate new junctions. These junction libraries were searched using reads from the six tissues, and estimates for true-positive junctions were derived as ranges (see below) that exclude or include ‘repeat’ junction reads, that is, those reads that map to more than one splice junction in transcripts from the 15,702 surveyed genes. In order to assess which reads represent true splice junctions, we trained a logistic regression classifier to discriminate between known junction sequences and a set of control ‘reverse’ junction sequences, in which the 3¢ half of each detected known junction sequence is located upstream of the 5¢ half. These reverse junction sequences were used as controls to maintain inherent codon, dinucleotide and possible other compositional biases when discriminating between true and false junctions. The classifier was trained using five features that reflect important parameters when discriminating true- and false-positive alignments between mRNA-Seq reads and splice junction regions (Supplementary Methods). The classifier achieves 94% sensitivity at a specificity of 95%, as determined by tenfold cross-validation (Supplementary Fig. 1 online). The parameters learned by the classifier were applied to the known and new junctions on the basis of statistics obtained from sequence reads from each tissue, and the numbers of true versus false junctions were determined in the known and new junction datasets for all six tissues. When combining the mRNA-Seq data from the six tissues we detected between 128,395 (49.9%) and 130,854 (50.9%) of the 257,257 known junctions (Fig. 1b), whereas only 121 (0.04%) to 135 (0.052%) of the corresponding control junctions were detected, respectively. Each tissue dataset contributed between 18% and 31% of the detected known junctions. Thus, from profiling only six human tissues, we were able to detect approximately half of the splice junctions represented in EST-cDNA databases. This observation could reflect previous results from EST-cDNA analyses and microarray profiling studies indicating that most tissues, including those analyzed in the present study, express B6,000 to 10,000 mRNA genes, and that brain and liver show relatively high frequencies of alternative splicing compared to other tissues11–13. From the mRNA-Seq data, we also detected between 4,294 and 11,099 new splice junctions (Fig. 1b), which corresponds to a detection rate of one or more new splice junction in 2,948 (18.8%) to 3,788 (24.1%) of the surveyed genes. When combining EST, cDNA and mRNA-Seq data, we observed that more than 85% of the multiexon genes analyzed contain at least one alternative splicing event. 1Banting and Best Department of Medical Research, University of Toronto, Toronto M5S 3E1, Canada. 2Department of Electrical and Computer Engineering, University of Toronto, Toronto M5S 3G4, Canada. 3Department of Molecular Genetics, University of Toronto, Toronto M5S 3E1, Canada. Correspondence should be addressed to B.J.B. ([email protected]). Received 17 July; accepted 19 August; published online 2 November 2008; addendum published after print 28 April 2009; doi:10.1038/ng.259 NATURE GENETICS VOLUME 40 [ NUMBER 12 [ DECEMBER 2008 1413 B R I E F C O M M U N I C AT I O N S a New Known Known (repeats) Known (non repeats) New (repeats) New (non repeats) 160,000 140,000 120,000 100,000 80,000 60,000 40,000 20,000 0 AS junctions (%) 15,000 10,000 5,000 0 Known AS junctions New AS junctions d 2 3 4 5 6 Number of tissues Br Ce He Li Lu Sk 1 2 3 4 5 Number of tissues 6 Br Ce He Li Lu Sk Figure 1 Assessing human alternative splicing complexity using mRNA-Seq data. (a) Diagram showing a gene with ‘known’ splice junctions (blue lines) supported by cDNA-EST evidence. Dashed pink lines represent all hypothetically possible ‘new’ splice junctions, and the solid pink lines indicate a new junction detected using Illumina mRNA-Seq data. Alternative exons are indicated in red. (b) Numbers of known and new splice junctions detected using mRNA-Seq data from human tissues. Each point in the four plots indicates the mean number of junctions detected when comparing data from all possible combinations of the specified numbers of tissues. The light blue and dark blue plots show the numbers of detected known junctions when junction sequences that are repeated elsewhere in the surveyed genes are either included or excluded, respectively. The pink and purple plots show the numbers of new junctions detected when including or excluding repeated sequences. (c) Histograms of the tissue distribution of known and new alternatively spliced junctions. We detected 7,917 known and 2,368 new splice junctions representing evidence for skipping of one or more alternative cassette exons in mRNA-Seq read alignments. The tissue distribution of these junction reads was plotted as the percentage of junctions that appear in one to all six tissues. (d) Tissue distributions of new splice junctions detected in pairs of tissues. The size of each blue box indicates the number of junctions shared between a given pair of tissues, with the highest number of shared junctions corresponding to the largest box. Br, whole brain; Ce, cerebral cortex; He, heart; Li, liver; Lu, lung; Sk, skeletal muscle. 9– 1 13 2 –1 6 17 –2 21 0 –2 25 4 –2 29 8 –3 33 2 –3 37 6 –4 41 0 –5 0 >5 0 VOLUME 40 [ >2 2– 4 4– 8 8– 1 16 6 –3 32 2 –6 64 4 – 12 12 8– 8 25 256 6 51 –5 2 1 1, –1, 2 02 02 4– 4 2, 04 ≥2 8 ,0 48 5– 8 1– 1414 Number of AS events per exon 4 1– 4 5– 8 9– 1 13 2 –1 17 6 –2 21 0 –2 25 4 –2 8 29 –3 2 33 –3 37 6 –4 0 41 –5 0 >5 0 Number of AS events per exon However, at increased levels of read coverage (that is, 16 to >500 reads tissue-specific alternative splicing events, in addition to new alterper 100 nucleotides), alternative splicing events can be detected in 92– native splicing events in transcripts with tissue-restricted expression 97% of multiexon genes (Supplementary Methods). This represents a patterns. Supplementary Table 1 online lists genes with more than substantial increase over a previous estimate (74%)4 for the proportion five new splice junctions. Many of these genes encode ‘giant’ and other of multiexon genes that contain one or more alternative splicing event. muscle-specific proteins, thus revealing a previously unappreciated Given that our analysis of mRNA-Seq data detects approximately half degree of alternative splicing complexity in transcripts from muscleof known junctions, and that there is an almost linear increase in the specific genes. These findings are consistent with previous proposals detection rate of new junctions as data from each tissue is added that alternative splicing of transcripts encoding some of these proteins (Fig. 1b), we predict that with full coverage the numbers of new has an important role in controlling fundamental mechanical properties of muscle, such as tension and contractility14. junctions would be at least twice those detected in the present data. To assess the degree to which known and new junctions detected in As mRNA-Seq data affords the detection of alternative splicing the mRNA-Seq data may represent tissue-dependent splicing events, events in transcripts irrespective of their length and associated splicing we investigated the frequencies at which detected splice junctions formed by skipping a b of one or more exons are unique to indivi0.5 dual tissues, or are common to two or more 5 0.4 tissues (Fig. 1c). In each case, there were 4.5 0.3 significantly more junctions that were 4 0.2 3.5 detected in only one tissue, and these could 0.1 0 3 represent tissue-specific or tissue-restricted 2.5 splice variants, as well as possible lower2 abundance splice variants that are widely 150 1.5 expressed but only detected in a single tissue. 1 100 We also examined the tissue specificity of all 0.5 50 (n ¼ 439) new splice junctions that were 0 0 detected in two of the six tissues. In the plot shown in Figure 1d, the sizes of the blue squares indicate the proportions of the overNumber of reads per 100 bases Number of exons per gene laps between new splice junctions when comparing pairs of tissues. This shows that a Figure 2 Assessing alternative splicing frequency using mRNA-Seq data. (a) Box plots showing the greater proportion of new splice junctions number of alternative splicing events per exon (AS frequency, upper panel) and mRNA-Seq read are commonly detected in whole brain and coverage (lower panel) for genes with different numbers of exons. The alternative splicing frequency is cerebral cortex, or in skeletal muscle and calculated on the basis of mRNA-Seq data only. Each box shows the lower and upper quartile values, heart, than the proportion of new junctions and the white line indicates the median value. The error bars indicate the variation for the rest of the commonly detected in any of the other pairs data, and outliers are indicated as black pluses. (b) Alternative splicing frequency (number of alternative splicing events per exon) for genes with different mRNA-Seq read coverage. cDNA, EST and of tissues. This indicates that many of mRNA-Seq data were combined to calculate the number of alternative splicing events in each gene. the new splice junctions reflect the physiolo- The median alternative splicing frequency was determined for each gene group and a scale factor was gical origin of the tissues analyzed and applied to new junctions detected in mRNA-Seq data to account for missing new junctions expected therefore likely represent examples of new when surveying additional tissues. Number of reads per 100 bases © 2008 Nature Publishing Group http://www.nature.com/naturegenetics 90 80 70 60 50 40 30 20 10 0 25,000 20,000 1 c 30,000 Number of new junctions Number of known junctions b NUMBER 12 [ DECEMBER 2008 NATURE GENETICS © 2008 Nature Publishing Group http://www.nature.com/naturegenetics B R I E F C O M M U N I C AT I O N S complexity, the relationship between alternative splicing frequency relative to exon number in genes can be accurately assessed. Accordingly, we determined the median number of alternative splicing events per exon for genes with different numbers of total exons and with similar overall levels of Illumina read coverage (Fig. 2a). Despite the theoretical possibility of a quadratic (n2) increase in the number of alternative splicing possibilities as the number of exons per gene increases, our results indicate that the number of alternative splicing events per gene increases in a near linear fashion (Fig. 2a). Thus, notably, the frequency of alternative splicing detection per exon does not rise in genes with increasing numbers of exons, and this observation suggests that selection pressure may act to generally limit splicing complexity in large genes. This observation facilitates assessment of the total number of alternative splicing events in human tissues. When considering that the frequency of new alternative splicing events detected in the six different tissues can be extrapolated to other human tissues, it is possible to derive an estimate for the total number of alternative splicing events that can be detected by comparable methods. By combining the rates of detection of new and known alternative splicing events afforded by mRNA-Seq and EST-cDNA data, respectively, we observed that the median number of alternative splicing events per exon is between 0.5 and 0.75 for genes with intermediate to high levels of Illumina sequence coverage (32–256 reads per 100 bases; see Fig. 2b). Given that 175,944 exons were mined from 15,702 multiexon human genes, we predict that on the order of 88,000–132,000 alternative splicing events of comparable abundance as those detected in the present study are expressed in major human tissues. This estimate further suggests that, on average, there are at least seven alternative splicing events per multiexon human gene. An important question concerning high-throughput sequencing technologies is their capacity to generate reliable quantitative measurements for alternative splicing levels. To address this question, we compared estimates for percent exon inclusion from the mRNA-Seq data described above with percent inclusion estimates generated from profiling B5000 cassette-type alternative exons in the same six human tissues using our previously validated15, quantitative alternative splicing microarray system (unpublished data, Supplementary Fig. 2a and Supplementary Methods online). When applying a threshold of 20 or more reads per tissue that match any one of the three splice junction sequences representing inclusion and skipping of a cassette exon, there is a high correlation (r ¼ 0.80, n ¼ 1,548) between the alternative splicing microarray- and mRNA-Seq–derived NATURE GENETICS VOLUME 40 [ NUMBER 12 [ DECEMBER 2008 predictions for percent inclusion. The correlation increases (r ¼ 0.85, n ¼ 546) when a threshold of 50 or more junction matching reads is used. Predictions for percent inclusion levels from the two systems also agree well for tissue-regulated alternative exons (Supplementary Fig. 2b and Supplementary Information). Together, the results described above show that mRNA-Seq data can be used to reliably measure alternative splicing levels, in addition to revealing important new insights into alternative splicing complexity in the human transcriptome. Note: Supplementary information is available on the Nature Genetics website. ACKNOWLEDGMENTS We thank S. Luo, I. Khrebtukova and G. Schroth of Illumina Inc. for providing some of the mRNA-Seq datasets used in this analysis. We also thank M. Brudno, Y. Barash, J. Calarco and S. Ahmad for helpful suggestions and comments on the manuscript. B.J.B and B.J.F. acknowledge support from the Canadian Institutes of Health Research and from Genome Canada through the Ontario Genomics Institute. AUTHOR CONTRIBUTIONS Q.P. created the exon and splice junction libraries and performed analyses of the mRNA-Seq, cDNA-EST and microarray data. O.S., L.J.L. and B.J.F. designed and implemented the logistic regression classifier and contributed to the analyses of tissue-specific alternative splicing events. The study was coordinated by B.J.B. The manuscript was prepared by B.J.B. and Q.P., with the participation of O.S., L.J.L. and B.J.F. Published online at http://www.nature.com/naturegenetics/ Reprints and permissions information is available online at http://npg.nature.com/ reprintsandpermissions/ 1. Matlin, A.J., Clark, F. & Smith, C.W. Nat. Rev. Mol. Cell Biol. 6, 386–398 (2005). 2. Blencowe, B.J. Cell 126, 37–47 (2006). 3. Ben-Dov, C., Hartmann, B., Lundgren, J. & Valcarcel, J. J. Biol. Chem. 283, 1229–1233 (2008). 4. Johnson, J.M. et al. Science 302, 2141–2144 (2003). 5. Sorek, R., Dror, G. & Shamir, R. BMC Genomics 7, 273 (2006). 6. Calarco, J.A. et al. Adv. Exp. Med. Biol. 623, 64–84 (2007). 7. Bainbridge, M.N. et al. BMC Genomics 7, 246 (2006). 8. Mortazavi, A., Williams, B.A., McCue, K., Schaeffer, L. & Wold, B. Nat. Methods 5, 621–628 (2008). 9. Cloonan, N. et al. Nat. Methods 5, 613–619 (2008). 10. Sultan, M. et al. Science 321, 956–960 (2008). 11. Su, A.I. et al. Proc. Natl. Acad. Sci. USA 101, 6062–6067 (2004). 12. Zhang, W. et al. J. Biol. 3, 21 (2004). 13. Yeo, G., Holste, D., Kreiman, G. & Burge, C.B. Genome Biol. 5, R74 (2004). 14. Schiaffino, S. & Reggiani, C. Physiol. Rev. 76, 371–423 (1996). 15. Pan, Q. et al. Mol. Cell 16, 929–941 (2004). 1415 addendum Addendum: Deep surveying of alternative splicing complexity in the human transcriptome by high-throughput sequencing Qun Pan, Ofer Shai, Leo J Lee, Brendan J Frey & Benjamin J Blencowe Nat. Genet. 40, 1413–1415 (2008), published online 2 November 2008; addendum published after print 28 April 2009 © 2009 Nature America, Inc. All rights reserved. The GEO accession number for the mRNA-Seq datasets is GSE13652.
© Copyright 2025 Paperzz