Pan et al. 2008 - University of Toronto

© 2008 Nature Publishing Group http://www.nature.com/naturegenetics
B R I E F C O M M U N I C AT I O N S
Deep surveying of alternative
splicing complexity in the
human transcriptome by
high-throughput sequencing
Qun Pan1, Ofer Shai1,2, Leo J Lee1,2, Brendan J Frey1,2 &
Benjamin J Blencowe1,3
We carried out the first analysis of alternative splicing
complexity in human tissues using mRNA-Seq data. New splice
junctions were detected in B20% of multiexon genes, many
of which are tissue specific. By combining mRNA-Seq and
EST-cDNA sequence data, we estimate that transcripts from
~95% of multiexon genes undergo alternative splicing and
that there are B100,000 intermediate- to high-abundance
alternative splicing events in major human tissues. From a
comparison with quantitative alternative splicing microarray
profiling data, we also show that mRNA-Seq data provide
reliable measurements for exon inclusion levels.
Alternative splicing is considered to be a key factor underlying
increased cellular and functional complexity in higher eukaryotes1–3.
From analyses of microarray profiling and EST-cDNA sequence data,
it has been estimated that two-thirds of human genes contain one or
more alternatively spliced exon4. However, because of the limited
depth of coverage and sensitivity afforded by conventional sequencing
and microarray profiling methods, the extent of human alternative
splicing is not known5. High-throughput or ‘next generation’ sequencing technologies offer the potential to address this question6, and
several very recent studies have applied analyses of short cDNA read
(mRNA-Seq) data from these technologies to survey alternative
splicing in mouse tissues and in human and mouse cell lines7–10. In
this study, we used the Genome Analyzer system of Illumina to survey
splicing complexity in diverse, normal human tissues using mRNASeq datasets consisting of 17–32 million 32-nucleotide-long reads. We
also assessed the potential of these datasets to provide quantitative
measurements for alternative splicing levels.
To assess human tissue alternative splicing complexity, we used
mRNA-Seq datasets from whole brain, cerebral cortex, heart, skeletal
muscle, lung and liver to search libraries of splice junction sequences
that represent ‘known’ splicing events and candidate ‘new’ splicing
events. Junction sequences designated as ‘known’ below are those
supported by the analysis of aligned EST and cDNA sequences, and
candidate ‘new’ splice junction sequences are those corresponding to
all hypothetical, additional 5¢ to 3¢ pairings of splice sites in the same
set of genes (Fig. 1a; and Supplementary Methods online). Mining of
a dataset of 15,702 multiexon UniGene clusters, each containing one
or more locus-specific RefSeq cDNA, resulted in the compilation of
257,257 known splice junctions and 2,459,306 candidate new junctions. These junction libraries were searched using reads from the six
tissues, and estimates for true-positive junctions were derived as
ranges (see below) that exclude or include ‘repeat’ junction reads,
that is, those reads that map to more than one splice junction in
transcripts from the 15,702 surveyed genes.
In order to assess which reads represent true splice junctions, we
trained a logistic regression classifier to discriminate between known
junction sequences and a set of control ‘reverse’ junction sequences, in
which the 3¢ half of each detected known junction sequence is located
upstream of the 5¢ half. These reverse junction sequences were used as
controls to maintain inherent codon, dinucleotide and possible other
compositional biases when discriminating between true and false
junctions. The classifier was trained using five features that reflect
important parameters when discriminating true- and false-positive
alignments between mRNA-Seq reads and splice junction regions
(Supplementary Methods). The classifier achieves 94% sensitivity at
a specificity of 95%, as determined by tenfold cross-validation
(Supplementary Fig. 1 online). The parameters learned by the
classifier were applied to the known and new junctions on the basis
of statistics obtained from sequence reads from each tissue, and the
numbers of true versus false junctions were determined in the known
and new junction datasets for all six tissues.
When combining the mRNA-Seq data from the six tissues we
detected between 128,395 (49.9%) and 130,854 (50.9%) of the
257,257 known junctions (Fig. 1b), whereas only 121 (0.04%) to
135 (0.052%) of the corresponding control junctions were detected,
respectively. Each tissue dataset contributed between 18% and 31% of
the detected known junctions. Thus, from profiling only six human
tissues, we were able to detect approximately half of the splice
junctions represented in EST-cDNA databases. This observation
could reflect previous results from EST-cDNA analyses and microarray
profiling studies indicating that most tissues, including those analyzed
in the present study, express B6,000 to 10,000 mRNA genes, and that
brain and liver show relatively high frequencies of alternative splicing
compared to other tissues11–13.
From the mRNA-Seq data, we also detected between 4,294 and
11,099 new splice junctions (Fig. 1b), which corresponds to a
detection rate of one or more new splice junction in 2,948 (18.8%)
to 3,788 (24.1%) of the surveyed genes. When combining EST, cDNA
and mRNA-Seq data, we observed that more than 85% of the multiexon genes analyzed contain at least one alternative splicing event.
1Banting and Best Department of Medical Research, University of Toronto, Toronto M5S 3E1, Canada. 2Department of Electrical and Computer Engineering, University
of Toronto, Toronto M5S 3G4, Canada. 3Department of Molecular Genetics, University of Toronto, Toronto M5S 3E1, Canada. Correspondence should be addressed to
B.J.B. ([email protected]).
Received 17 July; accepted 19 August; published online 2 November 2008; addendum published after print 28 April 2009; doi:10.1038/ng.259
NATURE GENETICS VOLUME 40
[
NUMBER 12
[
DECEMBER 2008
1413
B R I E F C O M M U N I C AT I O N S
a
New
Known
Known (repeats)
Known (non repeats)
New (repeats)
New (non repeats)
160,000
140,000
120,000
100,000
80,000
60,000
40,000
20,000
0
AS junctions (%)
15,000
10,000
5,000
0
Known AS junctions
New AS junctions
d
2
3
4
5
6
Number of tissues
Br
Ce
He
Li
Lu
Sk
1
2
3
4
5
Number of tissues
6
Br Ce He Li
Lu Sk
Figure 1 Assessing human alternative splicing complexity using mRNA-Seq
data. (a) Diagram showing a gene with ‘known’ splice junctions (blue lines)
supported by cDNA-EST evidence. Dashed pink lines represent all
hypothetically possible ‘new’ splice junctions, and the solid pink lines
indicate a new junction detected using Illumina mRNA-Seq data. Alternative
exons are indicated in red. (b) Numbers of known and new splice junctions
detected using mRNA-Seq data from human tissues. Each point in the four
plots indicates the mean number of junctions detected when comparing data
from all possible combinations of the specified numbers of tissues. The light
blue and dark blue plots show the numbers of detected known junctions
when junction sequences that are repeated elsewhere in the surveyed genes
are either included or excluded, respectively. The pink and purple plots show
the numbers of new junctions detected when including or excluding repeated
sequences. (c) Histograms of the tissue distribution of known and new
alternatively spliced junctions. We detected 7,917 known and 2,368 new
splice junctions representing evidence for skipping of one or more alternative
cassette exons in mRNA-Seq read alignments. The tissue distribution of
these junction reads was plotted as the percentage of junctions that appear
in one to all six tissues. (d) Tissue distributions of new splice junctions
detected in pairs of tissues. The size of each blue box indicates the number
of junctions shared between a given pair of tissues, with the highest number
of shared junctions corresponding to the largest box. Br, whole brain; Ce,
cerebral cortex; He, heart; Li, liver; Lu, lung; Sk, skeletal muscle.
9–
1
13 2
–1
6
17
–2
21 0
–2
25 4
–2
29 8
–3
33 2
–3
37 6
–4
41 0
–5
0
>5
0
VOLUME 40
[
>2
2–
4
4–
8
8–
1
16 6
–3
32 2
–6
64 4
–
12 12
8– 8
25 256
6
51 –5
2 1
1, –1, 2
02 02
4– 4
2,
04
≥2 8
,0
48
5–
8
1–
1414
Number of AS events per exon
4
1–
4
5–
8
9–
1
13 2
–1
17 6
–2
21 0
–2
25 4
–2
8
29
–3
2
33
–3
37 6
–4
0
41
–5
0
>5
0
Number of AS events
per exon
However, at increased levels of read coverage (that is, 16 to >500 reads tissue-specific alternative splicing events, in addition to new alterper 100 nucleotides), alternative splicing events can be detected in 92– native splicing events in transcripts with tissue-restricted expression
97% of multiexon genes (Supplementary Methods). This represents a patterns. Supplementary Table 1 online lists genes with more than
substantial increase over a previous estimate (74%)4 for the proportion five new splice junctions. Many of these genes encode ‘giant’ and other
of multiexon genes that contain one or more alternative splicing event. muscle-specific proteins, thus revealing a previously unappreciated
Given that our analysis of mRNA-Seq data detects approximately half degree of alternative splicing complexity in transcripts from muscleof known junctions, and that there is an almost linear increase in the specific genes. These findings are consistent with previous proposals
detection rate of new junctions as data from each tissue is added that alternative splicing of transcripts encoding some of these proteins
(Fig. 1b), we predict that with full coverage the numbers of new has an important role in controlling fundamental mechanical properties of muscle, such as tension and contractility14.
junctions would be at least twice those detected in the present data.
To assess the degree to which known and new junctions detected in
As mRNA-Seq data affords the detection of alternative splicing
the mRNA-Seq data may represent tissue-dependent splicing events, events in transcripts irrespective of their length and associated splicing
we investigated the frequencies at which
detected splice junctions formed by skipping
a
b
of one or more exons are unique to indivi0.5
dual tissues, or are common to two or more
5
0.4
tissues (Fig. 1c). In each case, there were
4.5
0.3
significantly more junctions that were
4
0.2
3.5
detected in only one tissue, and these could
0.1
0
3
represent tissue-specific or tissue-restricted
2.5
splice variants, as well as possible lower2
abundance splice variants that are widely
150
1.5
expressed but only detected in a single tissue.
1
100
We also examined the tissue specificity of all
0.5
50
(n ¼ 439) new splice junctions that were
0
0
detected in two of the six tissues. In the plot
shown in Figure 1d, the sizes of the blue
squares indicate the proportions of the overNumber of reads per 100 bases
Number of exons per gene
laps between new splice junctions when comparing pairs of tissues. This shows that a Figure 2 Assessing alternative splicing frequency using mRNA-Seq data. (a) Box plots showing the
greater proportion of new splice junctions number of alternative splicing events per exon (AS frequency, upper panel) and mRNA-Seq read
are commonly detected in whole brain and coverage (lower panel) for genes with different numbers of exons. The alternative splicing frequency is
cerebral cortex, or in skeletal muscle and calculated on the basis of mRNA-Seq data only. Each box shows the lower and upper quartile values,
heart, than the proportion of new junctions and the white line indicates the median value. The error bars indicate the variation for the rest of the
commonly detected in any of the other pairs data, and outliers are indicated as black pluses. (b) Alternative splicing frequency (number of
alternative splicing events per exon) for genes with different mRNA-Seq read coverage. cDNA, EST and
of tissues. This indicates that many of
mRNA-Seq data were combined to calculate the number of alternative splicing events in each gene.
the new splice junctions reflect the physiolo- The median alternative splicing frequency was determined for each gene group and a scale factor was
gical origin of the tissues analyzed and applied to new junctions detected in mRNA-Seq data to account for missing new junctions expected
therefore likely represent examples of new when surveying additional tissues.
Number of reads
per 100 bases
© 2008 Nature Publishing Group http://www.nature.com/naturegenetics
90
80
70
60
50
40
30
20
10
0
25,000
20,000
1
c
30,000
Number of new junctions
Number of known junctions
b
NUMBER 12
[
DECEMBER 2008 NATURE GENETICS
© 2008 Nature Publishing Group http://www.nature.com/naturegenetics
B R I E F C O M M U N I C AT I O N S
complexity, the relationship between alternative splicing frequency
relative to exon number in genes can be accurately assessed. Accordingly, we determined the median number of alternative splicing events
per exon for genes with different numbers of total exons and with
similar overall levels of Illumina read coverage (Fig. 2a). Despite the
theoretical possibility of a quadratic (n2) increase in the number of
alternative splicing possibilities as the number of exons per gene
increases, our results indicate that the number of alternative splicing
events per gene increases in a near linear fashion (Fig. 2a). Thus,
notably, the frequency of alternative splicing detection per exon does
not rise in genes with increasing numbers of exons, and this observation suggests that selection pressure may act to generally limit splicing
complexity in large genes. This observation facilitates assessment of
the total number of alternative splicing events in human tissues.
When considering that the frequency of new alternative splicing
events detected in the six different tissues can be extrapolated to other
human tissues, it is possible to derive an estimate for the total number
of alternative splicing events that can be detected by comparable
methods. By combining the rates of detection of new and known
alternative splicing events afforded by mRNA-Seq and EST-cDNA
data, respectively, we observed that the median number of alternative
splicing events per exon is between 0.5 and 0.75 for genes with
intermediate to high levels of Illumina sequence coverage (32–256
reads per 100 bases; see Fig. 2b). Given that 175,944 exons were mined
from 15,702 multiexon human genes, we predict that on the order of
88,000–132,000 alternative splicing events of comparable abundance
as those detected in the present study are expressed in major human
tissues. This estimate further suggests that, on average, there are at
least seven alternative splicing events per multiexon human gene.
An important question concerning high-throughput sequencing
technologies is their capacity to generate reliable quantitative measurements for alternative splicing levels. To address this question, we
compared estimates for percent exon inclusion from the mRNA-Seq
data described above with percent inclusion estimates generated from
profiling B5000 cassette-type alternative exons in the same six human
tissues using our previously validated15, quantitative alternative
splicing microarray system (unpublished data, Supplementary
Fig. 2a and Supplementary Methods online). When applying a
threshold of 20 or more reads per tissue that match any one of the
three splice junction sequences representing inclusion and skipping of
a cassette exon, there is a high correlation (r ¼ 0.80, n ¼ 1,548)
between the alternative splicing microarray- and mRNA-Seq–derived
NATURE GENETICS VOLUME 40
[
NUMBER 12
[
DECEMBER 2008
predictions for percent inclusion. The correlation increases (r ¼ 0.85,
n ¼ 546) when a threshold of 50 or more junction matching reads is
used. Predictions for percent inclusion levels from the two
systems also agree well for tissue-regulated alternative exons (Supplementary Fig. 2b and Supplementary Information). Together, the
results described above show that mRNA-Seq data can be used to
reliably measure alternative splicing levels, in addition to revealing
important new insights into alternative splicing complexity in the
human transcriptome.
Note: Supplementary information is available on the Nature Genetics website.
ACKNOWLEDGMENTS
We thank S. Luo, I. Khrebtukova and G. Schroth of Illumina Inc. for providing
some of the mRNA-Seq datasets used in this analysis. We also thank M. Brudno,
Y. Barash, J. Calarco and S. Ahmad for helpful suggestions and comments on the
manuscript. B.J.B and B.J.F. acknowledge support from the Canadian Institutes
of Health Research and from Genome Canada through the Ontario
Genomics Institute.
AUTHOR CONTRIBUTIONS
Q.P. created the exon and splice junction libraries and performed analyses of the
mRNA-Seq, cDNA-EST and microarray data. O.S., L.J.L. and B.J.F. designed and
implemented the logistic regression classifier and contributed to the analyses of
tissue-specific alternative splicing events. The study was coordinated by B.J.B.
The manuscript was prepared by B.J.B. and Q.P., with the participation of O.S.,
L.J.L. and B.J.F.
Published online at http://www.nature.com/naturegenetics/
Reprints and permissions information is available online at http://npg.nature.com/
reprintsandpermissions/
1. Matlin, A.J., Clark, F. & Smith, C.W. Nat. Rev. Mol. Cell Biol. 6, 386–398
(2005).
2. Blencowe, B.J. Cell 126, 37–47 (2006).
3. Ben-Dov, C., Hartmann, B., Lundgren, J. & Valcarcel, J. J. Biol. Chem. 283,
1229–1233 (2008).
4. Johnson, J.M. et al. Science 302, 2141–2144 (2003).
5. Sorek, R., Dror, G. & Shamir, R. BMC Genomics 7, 273 (2006).
6. Calarco, J.A. et al. Adv. Exp. Med. Biol. 623, 64–84 (2007).
7. Bainbridge, M.N. et al. BMC Genomics 7, 246 (2006).
8. Mortazavi, A., Williams, B.A., McCue, K., Schaeffer, L. & Wold, B. Nat. Methods 5,
621–628 (2008).
9. Cloonan, N. et al. Nat. Methods 5, 613–619 (2008).
10. Sultan, M. et al. Science 321, 956–960 (2008).
11. Su, A.I. et al. Proc. Natl. Acad. Sci. USA 101, 6062–6067 (2004).
12. Zhang, W. et al. J. Biol. 3, 21 (2004).
13. Yeo, G., Holste, D., Kreiman, G. & Burge, C.B. Genome Biol. 5, R74 (2004).
14. Schiaffino, S. & Reggiani, C. Physiol. Rev. 76, 371–423 (1996).
15. Pan, Q. et al. Mol. Cell 16, 929–941 (2004).
1415
addendum
Addendum: Deep surveying of alternative splicing complexity in the human
transcriptome by high-throughput sequencing
Qun Pan, Ofer Shai, Leo J Lee, Brendan J Frey & Benjamin J Blencowe
Nat. Genet. 40, 1413–1415 (2008), published online 2 November 2008; addendum published after print 28 April 2009
© 2009 Nature America, Inc. All rights reserved.
The GEO accession number for the mRNA-Seq datasets is GSE13652.