supplement

www.sciencemag.org/cgi/content/full/1160342/DC1
Supporting Online Material for
A Global View of Gene Activity and Alternative Splicing by Deep
Sequencing of the Human Transcriptome
Marc Sultan, Marcel H. Schulz, Hugues Richard, Alon Magen, Andreas Klingenhoff,
Matthias Scherf, Martin Seifert, Tatjana Borodina, Aleksey Soldatov,
Dmitri Parkhomchuk, Dominic Schmidt, Sean O’Keeffe, Stefan Haas, Martin Vingron,
Hans Lehrach, Marie-Laure Yaspo*
*To whom correspondence should be addressed. E-mail: [email protected]
Published 3 July 2008 on Science Express
DOI: 10.1126/science.1160342
This PDF file includes:
Materials and Methods
Figs. S1 to S3
Table S1
References
Other Supporting Online Material for this manuscript includes the following: (available at
www.sciencemag.org/cgi/content/full/1160342/DC1)
Tables S2 to S9 as zipped archives: Gene lists, tags, and other data related to genes studied in
paper in Microsoft Excel and tab-delimited text (tsv) format. Legends appear in main SOM PDF
file.
Supplementary Methods
Cell culture
HEK 293T cells were cultured in parallel in 2*150 cm3 flasks with DMEM (penicillinstreptomycin) (Gibco) supplemented with 10% Fetal Calf Serum (FCS) (Biochrom) and
Glutamax (1x) (Gibco). At passage 20 (confluent state), cells were trypsinized, washed off in
PBS, pelleted and resuspended in 2 ml lysis solution RLT (Qiagen). B cells were cultured
parallely in 2*150 cm3 flasks with RPMI 1640 (penicillin-streptomycin) (Gibco)
supplemented with 10% FCS (Biochrom) and Glutamax (1x) (Gibco). At passage 20
(confluent state), cells were trypsinized, washed off in PBS, pelleted, and resuspended in 2
ml lysis solution RLT (Qiagen).
RNA preparation and double stranded cDNA generation
Total RNA was then extracted from ~ 20 x 106 cells per sample (2*HEK and 2*B cells
samples) using RNeasy Midi extraction kit (Qiagen) by following the manufacturer’s
instructions. DNA was removed using the “on column digestion” protocol of the RNeasy
Midi extraction kit (Qiagen). Total RNA quality was assessed by spectrophotometry
(Nanodrop) and gel electrophoresis (1% agarose). Then, mRNA was extracted from 120 µg of
each total RNA sample, using Dynabeads mRNA purification kit (Invitrogen) and following
the manufacturer’s instructions. The mRNA was eluted in 10,5 µl 10 mM Tris-HCl. First
strand cDNA was directly generated from the eluted mRNA using random hexamers
(Invitrogen) and superscript RT kit (Invitrogen), following the manufacturer’s instructions.
The final volume was 20 µl. The second strand cDNA synthesis was generated immediately
after the first strand synthesis. Briefly, 1x Second strand buffer (Invitrogen), 200 nM final
dNTPs (Invitrogen), 20 Units of E. coli DNA ligase (Invitrogen), 40 Units of E. coli
polymerase (Invitrogen), and 4 Units of E. coli RNase H (Invitrogen) were added to the first
strand cDNA and incubated for 2 hours at 16°C. Double stranded cDNA was then purified
using QIA quick PCR purification kit, following the manufacturer’s instructions. The
generated cDNA was quantified by UV spectrophotometry (Nanodrop). For the digital
expression libraries, about 250 ng of the double-stranded cDNA of each sample was
fragmented by sonication on the UTR200 (Hielscher Ultrasonics GmbH, Germany) under
following conditions: 1 hour, 50% pulse, 100% power, and continuous cooling by 0°C water
flow-through.
Chip hybridizations
Biotin-labelled cRNA was generated using a linear amplification kit (Ambion #IL1791)
starting with 500 ng of DNA-free, quality-checked total RNA of each sample as input (see
RNA preparation). Chip hybridisations, washing, Cy3-streptavidin (Amersham Biosciences)
staining, and scanning were performed on the Illumina BeadStation 500 platform employing
reagents and following protocols supplied by the manufacturer. cRNA samples were
hybridized as biological and technical duplicates on Illumina HumanRef8 V2.0 BeadChips.
RNA polymerase II chromatin Immunoprecipitation
Protocol was followed as described previously with some modifications(1). In short, 5-10 ×
107 293T/17 cells were cross-linked with 1% formaldehyde for 10 min and the reaction
stopped by adding 1/20 volume of 2.5 M glycine. The cross-linked material was washed with
PBS, lysed, and sonicated to an average size of 300 bp. The sheared chromatin was incubated
with 10 µg specific antibody to RNA polymerase II (ab5408, recognizing the
hypophosphorylated form of RNA pol II) coupled to protein G magnetic beads (Invitrogen)
for 16 h at 4°C. An aliquot of the input DNA was saved prior to immunoprecipitation as
reference sample. The DNA was finally recovered, after washing (wash buffer: 50 mM
Hepes-KOH, pH 7.6), 500 mM LiCl;1mM EDTA; 1% NP-40; 0.7% Na-Deoxycholate) and
elution, by reversing the cross-linking overnight at 65° C in the elution buffer (50 mM TrisHCl, pH 8.0; 10mM EDTA; 1% SDS).
Enrichment of polymerase II-bound regions was estimated on 1/50 of the DNA by real-time
PCR following the protocol described previously(2). For this, DNA regions encompassing the
transcription start site of three genes known to be active in HEK cells (SOD1, GABPA and
CCT8) and of one control gene that is not enriched (HMOX1-140) were amplified using
specific primers at 375 nM final concentration (Primers are given in Table 1). The enrichment
compared to the control input DNA was estimated to between 50-150 fold.
Forward primer
CGCGGAGGTCTGGCCTATAA
Reverse primer
CGTCGCCATAACTCGCTAGG
Forward primer
CCTGCAGGAAGCAGTTCACG
Reverse primer
TCTGGCCACAGAGGTTGCTC
Forward primer
TGTACGCATGCGCTCTTTGA
Reverse primer
ACGAACAGGACGCTGACTCG
Forward primer
GAAGGCGGATTTTGCTAGATTT
Reverse primer
CTCCTGCCTACCATTAAAGCTG
SOD1
CCT8
GABPA
HMOX1
Table 1: Primer sequences used for real time PCR .
Preparation of libraries for RNA sequencing
In brief the library preparation consisted of the following steps: end-repair, A-tailing, ligation
of adapter primers, size selection and pre-amplification. For all samples, the libraries were
prepared using the DNA sample kit (#FC-102-1002, Illumina), following the manufacturer’s
instructions, but with the following modifications: five times reduced amount of all enzymes
were used and adapters were ligated to the DNA fragments using 2 µl of ‘Adapter oligo mix’
in a total reaction volume of 25 µl.
For the digital expression library, 250 ng of the sheared double stranded cDNA (see above)
were used and 170-220bp fragments were selected at the gel size fractionation step.
For Polymerase II ChIP DNA, 80 ng of input DNA and 10 ng of ChIP DNA (see above) were
taken for library preparation. The amplification was performed prior to gel size fractionation.
18 cycles of PCR with Illumina PCR primers were performed using an elongation time of 45
seconds. PCR-products were size fractionated and 150-300bp fragments were excised.
Sequencing
Amplified material was loaded onto channels of the flowcell at 2-4 pM concentration.
Sequencing was carried out using the 1G Illumina Genome Analyser (Solexa) according to
the manufacturer’s instructions. Sequencing was carried out by running 27 cycles. Images
deconvolution and quality values calculation were performed using the Goat module
(Firecrest v.1.8.28 and Bustard v.1.8.28 programs) of the Illumina pipeline v.0.2.2.3. For
digital gene expression, two biological replicates were sequenced for each of the cell lines,
producing 3.5-4.4 million reads per sample. For the ChIP experiment, two independent lanes
were sequenced from the same ChIP DNA and three lanes from one input DNA.
Aligning reads to the genome using ELAND
Reads were aligned to the human genome (hg18, NCBI build 36.1) using Eland software
(Gerald module v.1.27, Illumina). The mapping criteria imposed by Eland are the followings:
matches should be collinear to the genome allowing up to two mismatches, but no indels. In
these conditions, 50% of the reads obtained here matched to unique locations of the human
genome, whereas 16-18% of the reads mapped to more than one genomic position.
Mapping reads to annotated transcripts and viral sequences
Reads that were found in unique positions in the human genome were further mapped to
exons of protein-coding genes retrieved from two independent databases: ENSEMBL v.46
and Eldorado (release 05/2007). The number of reads that were fully included in exons
(which we refer as “number of hits”) was counted. For genes encoding multiple transcripts,
the number of hits per gene was determined as the sum of all hits in all possible exons.
Reads from both cell lines were further mapped to 2,896 viral genomic sequences retrieved
from the viral section of the RefSeq database (release 25)(3).
Mapping reads to artificial splice junctions
All reads that could not be matched on the genomic sequence, to a known transcript or an
EST sequence (EST version) were mapped to a second dataset consisting of synthetic splice
junction sequences. We generated a dataset of artificial splice junction sequences by pairwise
connection of exon sequences from every annotated locus retrieved from UCSC (hg18,
http://hgdownload.cse.ucsc.edu/goldenPath/hg18/database/) and from Eldorado (release
05/2007) and obtained 2,334,049 and 2,828,506 synthetic splice junction respectively. The
exons of every locus were sorted/indexed via their positions within the locus. Then the
sequence of exon i was connected to the sequence of exon i+1, exon i+2, etc. The mapping
was based on the UCSC and Eldorado databases, where all possible junctions of length 50-52
bps were retrieved covering the half of each connected exon sequence. Mappings allowing 02 mismatches with an overlap of at least 1 bp with each of the two exon sequences were
considered.
78,457 and 62,110 splice junctions were identified on the UCSC junction set, for HEK and B
cells, respectively. Whereas 69,952 and 56,000 splice junctions had reads matching on the
Eldorado junction set. The two resulting datasets from Eldorado and UCSC where merged to
have one reference data set of splice junctions. For merging the genomic start and end
positions of the splice junctions where compared with a tolerance of +3/-3 base pairs.
Redundant junctions where removed leading to a total set of 83,239 and 66,330 junctions for
HEK and B cells, see Files S9a and S9b.
Random Model for junction hits
We considered a model to study the probability of random hits for a read of length R on splice
junctions of length J. Depending of the matching strategy we allow up to ME errors, where
ME is the number of maximum substitution errors allowed to occur between the read and
junction sequence. Assuming a uniform i.i.d. random model for DNA sequences the
probability P that a R-base pair (bp) read matches a splice junction of size J bps with not more
than ME substitution errors is
ME
P(Hit on a splice junction with up to ME errors)=
⎛ R⎞
∑⎜ k ⎟3
k =0
⎝ ⎠
k
*( J − R + 1)
4R
.
(1)
The expected number of reads that hit all considered splice junctions is calculated by
multiplying the number of considered junctions and number of considered reads with formula
(1).
Expected number of splice junctions
The expected number of reads hitting splice junctions of a gene by chance was given by:
p
X * r , where X is the number of unique reads that fall inside exons of the gene and where
1− pr
the probability that a read hits a junction Pr corresponds to:
Pr =
(n − 1) * 27
, where n is the number of exons in a given gene of length L.
L + (n − 1) * 27
We considered one isoform per gene including all annotated exons, but ignored the exons for
which less than half of the positions corresponded to unique hits.
Theoretical read coverage and normalized expression
The theoretical read coverage takes into account constraints due to the mapping procedure
that eliminates non unique reads. For each gene, all exons were extracted from the annotated
transcript database (ENSEMBL v.46) and concatenated, considering the longest exonic form.
For the resulting virtual cDNAs the number of all possible 27 bp oligomers (gene length - 26)
was determined, using sliding window of one base pair. We subtracted from the number of all
possible theoretical hits the non-unique oligomers (mapping in non-unique position in the
human genome) and oligomers that were shared with other genes (e.g. exon overlapping or
duplications). This provided us with the theoretical total number of unique 27-mers or virtual
length (L) representing a given gene. For most of the loci, between 80-100% of the exon
could be covered by unique 27-mers, allowing the scanning of most of the coding regions of
the human genome. Only 517 genes could not be scored since they did not contain any
informative reads, whereas ca. 1,000 genes were poorly scored (less than 40% theoretical
coverage). Then, for each sample, we counted the total number of RNA-seq reads that
matched the virtual length of each gene, which was called “number of hits” per gene (X). We
further normalized the number of hits per gene according to the sample size of the biological
replicates by scaling them up to the sample with the largest number of reads and by dividing it
by the virtual length of the gene. In summary, if:
N i = Sample size in replicate i.
X i j = Number of unique reads (number of hits) for gene j in replicate i.
r = Number of replicates.
L j = Theoretical number of unique 27-mers in gene j.
NEi j = Normalized expression of gene j in replicate i.
X j max
Then, NEi = ij × 1≤k ≤r
L
Ni
(Nk )
j
.
Assessing Transcript Diversity
We assessed the transcript diversity by estimating the real total number of genes expressed in
a cell line and the expected number of additional genes that could be identified by sequencing
additional replicates. Mapped reads are assumed to be sampled independently and with
replacement from the whole transcript population. Under this assumption, the observed
number of counts for a gene follows a Poisson distribution of parameter λ :
P( X i = xi ) = e
j
j
−λ
λx
j
i
xij !
, where X i j is the number of unique reads (number of hits) for gene j
in replicate i.
We call ωic the number of genes with count c in replicate i:
ω ic = ∑ δ X
j
j
i
,c
, where δ .,. is the Kronecker symbol ( δ a ,b = 1 if a=b and 0 otherwise). S i is the
number of different genes observed in experiment i ( S i = ∑ ωic ). For instance 14107 genes
c >0
are expressed in the HEK-1 lane.
ωi0 represents the theoretical number of expressed genes that are not observed in experiment
i. Estimation of gene diversity then subsumes to the estimation of ωi0 conditionally on S i . Let
us denote as f the marginal distribution over read counts, which can be written as:
f ( c, Q ) = P ( X . = c ) = ∫ e
j
−λ
λc
c!
dQ(λ ) , where Q denotes the distribution of Poisson rates
amongst the genes.
The distribution over read counts Q̂ is estimated with a Non Parametric Maximum
Likelihood (NPMLE) approach as presented by Wang and Lindsay(4, 5).
An estimator to ωi0 can be written as ωˆ i0 = S i
f i (0, Qˆ )
.
1 − f i (0, Qˆ )
Bootstrap estimates are obtained by resampling ωic with Q̂ , conditionally on S i . For every
bootstrap sample a NPMLE estimation was conducted, and confidence interval was deduced
from 200 samples(6).
The estimates for Q were obtained using a modified version of the java code developed by the
authors (courtesy of J.P. Wang).
Detection level of RNA-Seq
With two lanes, we obtained a sufficient sampling depth for capturing rare transcripts. Based
on the estimate of ca. 300,000 transcripts per cell, we detected as little as 0.3 RNA copy per
cell. Because we interrogated the whole transcript length, we derived this estimate by
rescaling normalized expressions as a function of the number of mappable reads in each exon.
Random noise model for read mapping
We estimated the background using the reads located outside of annotated genes to one read
per 25,000 bp, except for one of the B cell replicas (B cell-1) where the background is higher
(1 read per 9,174 bp), possibly due to a slight contamination with DNA or hnRNA. To do so,
we derived a model of read mapping at random, based on the observed reads within regions
where no transcription has been annotated. The null probability to observe a hit on one base
pair is then calculated as:
p0 =
#{hits in no transcription}
length of no transcription
For each gene of length L with n reads, we then compute the probability to observe those
reads from the random mapping alone. As X ~ Poisson( L.x) (section above), the probability
to observe at least n reads can be derived.
For each lane, a probability p0i was estimated, and only the gene with an FDR of less than 5%
where selected (9). This resulted in the elimination of 7 and 10 genes for HEK cell lanes 1 and
2, and 522 and 16 genes for the B cell lanes 1 and 2, respectively. Most of the discarded genes
(all but 31) had one read (only one gene with 5 reads was rejected in B cell lane 1).
According to this, we set an arbitrary threshold of 5 reads per coding sequence on the merged
lanes as expression hallmark. Genes with 1-4 exonic reads could either be weakly expressed
or background
A threshold of five exonic reads was used as expression hallmark.
RNA-Seq reproducibility
Data from the two biological replicas were highly correlated both at the qualitative (detected
exons) and quantitative (read density) levels (Pearson's correlation coefficient (PCC) 0.99 for
HEK and 0.98 for the B cell samples). Merging reads from the two samples increased both the
number of detected transcripts (Table S1) and the transcript coverage (see figure below).
Gene-read coverage. Histogram showing the fraction of the transcript (ENSEMBL v46) length
covered by reads for all genes expressed under relaxed parameters (at least one read). Individual lanes
are shown with grey bars and merged lanes with a green bar. (ENSEMBL-based).
Micro-array analysis and differential gene expression
Raw expression data (BeadSummary gene profile files), without normalization, were
extracted from the manufacturer’s Illumina BeadStudio application software 1.0. Further
processing of the data was done within the R / Bioconductor statistical analysis software
package. We calculated the Pearsons product moment correlation coefficient on the nonnormalized data sets to remove outliers and to check for integrity of the experiments
(r>=0.98), followed by the computation of intra-chip pair-wise scatter-plots for each probe.
The raw data sets were then normalized using quantile normalization(10), as it is robust
against non-linearity of data and performs also well on large data sets. The normalization was
computed by combining the treatment and the control group to which the differential gene
expression was analyzed. We calculated a detection score for each probe on the array that
evaluates if the signal intensity is significantly above the background intensity of the array.
The detection score (DS) is defined as: DS = R / N, where R is the rank of the probe bead
signal relatively to the negative control bead signals and N be the number of negative controls
on the chip field.
To compare the overlap of expressed genes between the platforms, we first mapped all microarray (MA) probes to the ENSEMBL gene catalogue (ENSEMBL (v.46) using Biomart
software (www.biomart.org). We considered for analysis only genes with addressing probes
on the MA (13,118 genes). Second, we set thresholds in each platform to define whether a
gene was expressed or not:
z
A gene was considered as detected in the RNA-Seq platform if it had at least a single
hit (relaxed analysis) or at least five hits (stringent analysis) on the merged replicates.
z
For microarray, the criteria for gene expression was first given by the detection score
(DS) that was greater or equal to 0.95 (1-pvalue) in all experiments (see Material and
Methods)
To score for differential expression, we considered the genes possessing, in MA experiment, a
detection score greater or equal to 0.95 for both sample type. The criteria for RNA-Seq
values were a number of counts greater or equal to 5 (on the merged replicates) in both
sample type.
The significance of differential expression for the microarray platform was assessed using a ttest assuming inequality of variances between the 2 conditions. We then applied a BenjaminiHochberg correction on the p-values, and a gene was considered as significantly differentially
expressed between the 2 conditions if its false discover rate (FDR) was lower or equal to 1%
(3,421 genes). For RNA-Seq, the differential expression was assessed with a test proposed by
Audic and Claverie (A&C) on the pooled lanes(11). The cutoff on the corrected FDR was
fixed at 1% (4,376 genes)(9).
Cluster detection
The reads were sorted by their position on the chromosome and the start positions were used
for the analysis. The RNA-Seq reads were analyzed for clusters of x reads in an initial
x
window of l base pairs. We denote as pos(k) the position of read k and d k describes the
distance between x consecutive reads starting with read k:
d kx = pos (k + x − 1) − pos ( k ) .
All reads k for which d k ≤ l holds, define a cluster Ck , where Ck = [ k, .., k+x-1].
Overlapping clusters where merged if they overlap by at least one read as follows. Given two
x
cluster Ca and Cb with pos(a) < pos(b), the new cluster Cab becomes:
Cab = Ca ∪ Cb .
In summary, Cab contains all reads that where contained in Ca and Cb .
Pol-II ChIP-Seq data analysis and correlation to gene expression
After mapping of the reads to the genome, the replicates were pooled and resulted in
3,198,755 unique reads for two lanes of the RNA Pol II experiment and in 8,017,459 unique
reads for three lanes of the control input DNA. For the analysis of the Pol-II ChIPSeq data the
above algorithm was applied with x = 10 and l = 100. From the 3,198,755 reads 9,908 clusters
could be identified. The algorithm was also applied to the input control. Due to the larger
number of reads in the data set (8,017,459) the threshold x was adapted and was set to 25.
This resulted in 416 clusters from the input control. 198 of the Pol-II clusters were
overlapping with one of the input control clusters and thus were discarded. The remaining
9,710 clusters were used for further analysis. These RNA pol-II blocks were mapped to
70,968 human promoters associated to 32,595 gene loci in Eldorado (Release 05/2007).
To perform the correlation of promoter activity data to the gene expression level, we
distributed the genes into five categories: the 12,567 expressed genes were distributed into
three groups of equal size (4,189 genes in each group) referring to high (0.0889<NE<47.8),
medium (0.0263<NE<0.0889) and low expression (0.0003<NE<0.0263); Uncertain (2,396
genes with 1-4 reads) and silent (7,333 genes with no read) genes represented the last
categories.
New transcriptional regions
For the detection of enriched clusters of RNA-Seq reads in intergenic and intronic regions the
threshold x was set to 5 and the window size l was set to 100. Intergenic and intronic regions
were defined using the NCBI build 36 of the human genome and the 113,458 transcripts
annotated in ENSEMBL (v.46; 42,622 transcripts) and Eldorado (Release 05/2007; 70,836
transcripts).
Intronic regions
We analyzed genomic regions solely annotated as introns. They did not overlap with any of
the exons annotated either in ENSEMBL v.46 or Eldorado (Release 05/2007). The exons
flanking the intronic region do not need to belong to the same annotated transcript. Of the
215,549 intronic regions identified 203,004 from HEK and from B cell were discarded
because they were less than 100bp in length or had less than 5 reads. The remaining intronic
regions (10,024 in HEK and 13,256 in Bcell) were analyzed for regions enriched in RNA-Seq
reads applying the cluster algorithm described above.
Clusters were tested for overlap with the annotated ESTs (EMBL Nucleotide Sequence
Database, Release 89) and if the respective ESTs overlap with any of the exons annotated in
ENSEMBL or Eldorado and thus connect the cluster to the annotated transcript (one base pair
of overlap).
Intergenic regions
For the prediction of new transcriptional units in intergenic regions clusters were calculated
individually for each of the HEK and Bcell cell lines. Intergenic regions were defined as parts
of the genome not annotated by transcripts either in ENSEMBL nor Eldorado. The minimum
required length for an intergenic region was 10,000 bp. The 5,000 bp immediately upstream
and downstream of the genes flanking the intergenic region were excluded from the analysis.
We identified 1,584 clusters in HEK and B cells collectively. Clusters connected directly or
indirectly by an EST sequence (by at least one base pair) were grouped into transcriptional
units (TUs).Thus 594 clusters with EST data were collapsed within 173 TUs. EST sequences
were taken from EMBL Nucleotide Sequence Database (Release 89) and mapped to the
human genome (NCBI build 36) resulting in 6,998,739 EST annotations. In addition we
determined the presence of Pol-IIa clusters from the ChIPSeq data, 1 kb upstream or
downstream of the new transcriptional unit and the number of CAGE tags (12). The
transcriptional units were tested in both directions due to the missing strand specificity of the
RNA-Seq reads.
Filtering
Orphan reads initially located within clusters but which corresponded to either true splice
junctions artefactually mapped to processed pseudogenes , translocated gene fragments, or
repeated elements were filtered out.
All intergenic TUs and intronic clusters were subjected to repeat masking (RepeatMasker
software 3.1.9) and TUs or clusters consisting of >50% of repetitive sequences were removed.
Additionally, all reads related to intergenic TUs and intronic clusters were screened against
the set of artificial exon junction sequences described above and using Eland software with
the same parameter settings and conditions as for the original mapping to the genome. Reads
artificially mapping to pseudogenes, or retrotransposed gene fragments were thus rejected.
Whenever more than 20% of the reads of a putative intergenic TU and intronic cluster were
rejected or the number of reads dropped below 5 the cluster was not considered.
By doing so we rejected 141 intergenic TUs because of repeated elements and 670 TUs
mapping to pseudogenes or retro-transposed gene fragments.
For the intronic clusters 341 and 374 were rejected due to repeated elements in HEK and B
cell, respectively. 795 and 622 clusters were rejected du to the mapping to the exon junction
set.
Blast analysis
The genomic sequences of the RNA-Seq read clusters related to the same transcriptional unit
(TU) were concatenated into artificial TU sequences. All TU were screened for similarities to
known or predicted protein sequences from SwissProt/Trembl database (07.02.08) using
BLASTX (e-value, cutoff = 0.001, minimum score = 40, percent identity >= 50%) and
requiring a minimum match of 30 amino acids.
In addition, we performed a BLASTN search against the non-redundant database (nr, NCBI,
options: -v 3 -b 3) annotating the top three matches, but rejecting hits to genomic DNA based
on the sequence description (keywords: genomic DNA; DNA sequence from clone; sapiens
chromosome; troglodytes chromosome; chromosome [A-Z0-9]*clone; clone RP[0-9]; BAC;
PAC). Only sequence matches >=30 bp were considered for further analysis.
Transcript extension
Transcripts annotated in ENSEMBL or Eldorado were analyzed if RNA-Seq reads located
immediately upstream or downstream of the transcript indicate that the extension of the
respective 5' or 3' UTR is incomplete. Therefore we determined the density of RNA-Seq reads
and the maximum distance between two neighboring reads in the first and last exon of each
transcript. For the density the number of reads was normalized by the length of the respective
exon (number of reads / length of exon).
The annotation of the exon was then extended to the next upstream/downstream read if (a) the
distance to the next read was below the maximum distance observed in the exon annotated
originally and (b) the density of reads calculated for the extended exon did not fall below the
density for the original exon. Exons with a density of RNA-Seq reads below 0.01 were not
analyzed.
Supplementary Table:
Table S1:
Tag mapping
HEK-1
HEK-2
B cell-1
B cell-2
2,278,066
HEK
pooled
4,640,112
2,086,216
1,809,427
B cells
pooled
3,895,643
Unique matches
2,362,046
Map to RNA
ENSEMBL
1,721,149
1,677,026
3,398,175
1,378,594
1,263,220
2,641,814
1,844,642
1,798,421
3,643,063
1,488,798
1,358,628
2,847,426
1,879,447
1,833,029
3,712,476
1,517,184
1,385,203
2,902,387
Partially mapping
RNAs
72,951
71,774
144,725
56,186
54,541
110,727
Orphan reads
(all)
341,844
316,113
657,957
465,722
305,758
771,480
Orphan reads
(intronic)
200,395
189,778
390,173
249,600
190,242
439,842
Orphan reads
(intergenic)
141,449
126,335
267,784
216,122
115,516
331,638
788,526
757,835
1,546,361
752,635
572,135
1,324,770
1,190,768
1,027,518
2,218,286
1,173,196
1,093,622
2,266,818
118,974
115,186
234,160
106,340
88,659
194,999
4,460,314
4,178,605
8,638,919
4,118,387
3,563,843
7,682,230
Map to RNAs
Eldorado
ENSEMBL +
Eldorado
combined
Multiple
matches
No match
to genome
Low quality
reads
Total reads
Read mapping overview. Summary of the mapping of the reads to the human genome (hg18, NCBI
build 36.1) and to gene loci (ENSEMBL V.46. and ElDorado release 05/07).
Supplementary Figures
Figure S1 | Dynamic range of RNA-Seq. Rank abundance curves (RACs) showing the total
number of mapped reads (x-axis) versus the total number of identified genes (y-axis). Data
points show the observed values on individual (crosses) and merged (dots) lanes. Curves were
extrapolated as follows: for values smaller or equal to the sample size, the number of expected
genes was obtained by sub-sampling. For values greater than the sample size, the number of
expected discoveries was computed from the statistical analysis by Poisson mixtures.
Figure S2 | Example of new TUs. Snapshot of TU 33 (yellow) and TU 34 (pink) spanning over 190
MB on chromosome 2. The units can be merged based on recent EST data. The sequencing reads
show that these TU are predominantly expressed in B cells. The vast majority of reads are in
agreement with exons deduced from ESTs, and coincide frequently with highly conserved regions
(PhastCons Track, UCSC). At the 5' bourvend of TU99 the putative transcriptional start is supported
by CAGE tags as well as by a PolIIa bound region.
Figure S3 | Read density in the EIF4G1 gene on chromosome 3. For HEK (top) and B cells
(bottom), the green bars show the 33 known exons, red bars represent the number of reads at a given
position, and blue lines identify the reads on splice junctions (width is proportional to the number of
reads on a given junction).
References:
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
T. I. Lee, S. E. Johnstone, R. A. Young, Nat Protoc 1, 729 (2006).
P. Kahlem et al., Genome Res 14, 1258 (Jul, 2004).
K. D. Pruitt, T. Tatusova, D. R. Maglott, Nucleic Acids Res 35, D61 (Jan, 2007).
J. P. Wang et al., BMC Bioinformatics 6, 300 (2005).
J. P. Z. Wang, B. G. Lindsay. (American Statistical Association, 2005), vol. 100, pp. 942-960.
B. Efron, R. Tibshirani, An Introduction to the Bootstrap (Chapman & Hall/CRC, 1993), pp.
N. D. Hastie, J. O. Bishop, Cell 9, 761 (Dec, 1976).
J. B. Kim et al., Science 316, 1481 (Jun 8, 2007).
Y. Benjamini, Y. Hochberg. (JSTOR, 1995), vol. 57, pp. 289-300.
B. M. Bolstad, R. A. Irizarry, M. Astrand, T. P. Speed. (Oxford Univ Press, 2003), vol. 19, pp. 185-193.
S. Audic, J. M. Claverie, Genome Res 7, 986 (Oct, 1997).
P. Carninci et al., Nat Genet 38, 626 (Jun, 2006).
Supplementary Information Guide
SOM.doc: contains supplementary methods, table S1, figures S1 to S3 and references.
Table S2: Contains the combined list of genes from the Ensembl and Eldorado annotation
with the RNA-SEQ results for HEK and B cells (Ensembl + Eldorado annotation). This table
provides for each gene in its respective annotation system, information about the gene
structure, the number of tags and their distribution in exons and in introns, the expression of
the gene, the Polymerase II correlation and the normalized expression (Zip file; 7.1 MB)
Table S3: Table listing the coordinates and number of tags of the 9,710 identified RNA
PolIIa bound regions and the associated gene loci with overlapping known promotors
(Eldorado, release 05/2007) (Excel file; 806 KB).
Table S4: This table lists the 13,118 genes that could uniquely be mapped between RNASEQ and microarray. It contains information about number of hits per gene, normalized
expression (NE), ratio and significance test for RNA-SEQ and the average intensities,
Detection scores, ratio and significance tests for Illumina (Excel file; 5.7 MB).
Table S5: Excel table listing all genes obtained by RNA-SEQ (16,392 based on ENSEMBL
v.46) with the ratios (B cell:HEK:cell) and q values (Excel file; 4,6 MB).
Table S6: This table provides lists of all transcript extensions for HEK and B cells in Sheet 1
and 2, respectively (Excel file; 2.2 MB).
Table S7: This Table lists the novel internal exons identified in B and HEK cell, respectively
and additional related information (gene info, distances to next exons, EST support, EST
linked gene symbol, and overlap of clusters between Hek and B cell (Excel file; 1.3 MB).
Table S8: This Table list the new transcriptional clusters sites (TS) (Sheet1); units (TUs)
(Sheet 1) and clusters (Sheet 2) with additionnal related information (CAGE, PolII, EST
supports; Repeat Masker, BlastX and BlastN results, Up and Downstream genes, etc…)
(Excel file; 345 KB).
Table S9: These tables contain all splice junctions with their position mapping on genes
(ENSEMBL v.46 and Eldorado release 05/07) and ESTs (EMBL NSD, release 89) that were
found in B cells (9a) and Hek cells (9b). Columns “Novel” and “Alternative” indicate
whether the identified junction is new and if the junction directly identifies an alternative
form of the gene, respectively (Zip file; 4 MB).