www.sciencemag.org/cgi/content/full/1160342/DC1 Supporting Online Material for A Global View of Gene Activity and Alternative Splicing by Deep Sequencing of the Human Transcriptome Marc Sultan, Marcel H. Schulz, Hugues Richard, Alon Magen, Andreas Klingenhoff, Matthias Scherf, Martin Seifert, Tatjana Borodina, Aleksey Soldatov, Dmitri Parkhomchuk, Dominic Schmidt, Sean O’Keeffe, Stefan Haas, Martin Vingron, Hans Lehrach, Marie-Laure Yaspo* *To whom correspondence should be addressed. E-mail: [email protected] Published 3 July 2008 on Science Express DOI: 10.1126/science.1160342 This PDF file includes: Materials and Methods Figs. S1 to S3 Table S1 References Other Supporting Online Material for this manuscript includes the following: (available at www.sciencemag.org/cgi/content/full/1160342/DC1) Tables S2 to S9 as zipped archives: Gene lists, tags, and other data related to genes studied in paper in Microsoft Excel and tab-delimited text (tsv) format. Legends appear in main SOM PDF file. Supplementary Methods Cell culture HEK 293T cells were cultured in parallel in 2*150 cm3 flasks with DMEM (penicillinstreptomycin) (Gibco) supplemented with 10% Fetal Calf Serum (FCS) (Biochrom) and Glutamax (1x) (Gibco). At passage 20 (confluent state), cells were trypsinized, washed off in PBS, pelleted and resuspended in 2 ml lysis solution RLT (Qiagen). B cells were cultured parallely in 2*150 cm3 flasks with RPMI 1640 (penicillin-streptomycin) (Gibco) supplemented with 10% FCS (Biochrom) and Glutamax (1x) (Gibco). At passage 20 (confluent state), cells were trypsinized, washed off in PBS, pelleted, and resuspended in 2 ml lysis solution RLT (Qiagen). RNA preparation and double stranded cDNA generation Total RNA was then extracted from ~ 20 x 106 cells per sample (2*HEK and 2*B cells samples) using RNeasy Midi extraction kit (Qiagen) by following the manufacturer’s instructions. DNA was removed using the “on column digestion” protocol of the RNeasy Midi extraction kit (Qiagen). Total RNA quality was assessed by spectrophotometry (Nanodrop) and gel electrophoresis (1% agarose). Then, mRNA was extracted from 120 µg of each total RNA sample, using Dynabeads mRNA purification kit (Invitrogen) and following the manufacturer’s instructions. The mRNA was eluted in 10,5 µl 10 mM Tris-HCl. First strand cDNA was directly generated from the eluted mRNA using random hexamers (Invitrogen) and superscript RT kit (Invitrogen), following the manufacturer’s instructions. The final volume was 20 µl. The second strand cDNA synthesis was generated immediately after the first strand synthesis. Briefly, 1x Second strand buffer (Invitrogen), 200 nM final dNTPs (Invitrogen), 20 Units of E. coli DNA ligase (Invitrogen), 40 Units of E. coli polymerase (Invitrogen), and 4 Units of E. coli RNase H (Invitrogen) were added to the first strand cDNA and incubated for 2 hours at 16°C. Double stranded cDNA was then purified using QIA quick PCR purification kit, following the manufacturer’s instructions. The generated cDNA was quantified by UV spectrophotometry (Nanodrop). For the digital expression libraries, about 250 ng of the double-stranded cDNA of each sample was fragmented by sonication on the UTR200 (Hielscher Ultrasonics GmbH, Germany) under following conditions: 1 hour, 50% pulse, 100% power, and continuous cooling by 0°C water flow-through. Chip hybridizations Biotin-labelled cRNA was generated using a linear amplification kit (Ambion #IL1791) starting with 500 ng of DNA-free, quality-checked total RNA of each sample as input (see RNA preparation). Chip hybridisations, washing, Cy3-streptavidin (Amersham Biosciences) staining, and scanning were performed on the Illumina BeadStation 500 platform employing reagents and following protocols supplied by the manufacturer. cRNA samples were hybridized as biological and technical duplicates on Illumina HumanRef8 V2.0 BeadChips. RNA polymerase II chromatin Immunoprecipitation Protocol was followed as described previously with some modifications(1). In short, 5-10 × 107 293T/17 cells were cross-linked with 1% formaldehyde for 10 min and the reaction stopped by adding 1/20 volume of 2.5 M glycine. The cross-linked material was washed with PBS, lysed, and sonicated to an average size of 300 bp. The sheared chromatin was incubated with 10 µg specific antibody to RNA polymerase II (ab5408, recognizing the hypophosphorylated form of RNA pol II) coupled to protein G magnetic beads (Invitrogen) for 16 h at 4°C. An aliquot of the input DNA was saved prior to immunoprecipitation as reference sample. The DNA was finally recovered, after washing (wash buffer: 50 mM Hepes-KOH, pH 7.6), 500 mM LiCl;1mM EDTA; 1% NP-40; 0.7% Na-Deoxycholate) and elution, by reversing the cross-linking overnight at 65° C in the elution buffer (50 mM TrisHCl, pH 8.0; 10mM EDTA; 1% SDS). Enrichment of polymerase II-bound regions was estimated on 1/50 of the DNA by real-time PCR following the protocol described previously(2). For this, DNA regions encompassing the transcription start site of three genes known to be active in HEK cells (SOD1, GABPA and CCT8) and of one control gene that is not enriched (HMOX1-140) were amplified using specific primers at 375 nM final concentration (Primers are given in Table 1). The enrichment compared to the control input DNA was estimated to between 50-150 fold. Forward primer CGCGGAGGTCTGGCCTATAA Reverse primer CGTCGCCATAACTCGCTAGG Forward primer CCTGCAGGAAGCAGTTCACG Reverse primer TCTGGCCACAGAGGTTGCTC Forward primer TGTACGCATGCGCTCTTTGA Reverse primer ACGAACAGGACGCTGACTCG Forward primer GAAGGCGGATTTTGCTAGATTT Reverse primer CTCCTGCCTACCATTAAAGCTG SOD1 CCT8 GABPA HMOX1 Table 1: Primer sequences used for real time PCR . Preparation of libraries for RNA sequencing In brief the library preparation consisted of the following steps: end-repair, A-tailing, ligation of adapter primers, size selection and pre-amplification. For all samples, the libraries were prepared using the DNA sample kit (#FC-102-1002, Illumina), following the manufacturer’s instructions, but with the following modifications: five times reduced amount of all enzymes were used and adapters were ligated to the DNA fragments using 2 µl of ‘Adapter oligo mix’ in a total reaction volume of 25 µl. For the digital expression library, 250 ng of the sheared double stranded cDNA (see above) were used and 170-220bp fragments were selected at the gel size fractionation step. For Polymerase II ChIP DNA, 80 ng of input DNA and 10 ng of ChIP DNA (see above) were taken for library preparation. The amplification was performed prior to gel size fractionation. 18 cycles of PCR with Illumina PCR primers were performed using an elongation time of 45 seconds. PCR-products were size fractionated and 150-300bp fragments were excised. Sequencing Amplified material was loaded onto channels of the flowcell at 2-4 pM concentration. Sequencing was carried out using the 1G Illumina Genome Analyser (Solexa) according to the manufacturer’s instructions. Sequencing was carried out by running 27 cycles. Images deconvolution and quality values calculation were performed using the Goat module (Firecrest v.1.8.28 and Bustard v.1.8.28 programs) of the Illumina pipeline v.0.2.2.3. For digital gene expression, two biological replicates were sequenced for each of the cell lines, producing 3.5-4.4 million reads per sample. For the ChIP experiment, two independent lanes were sequenced from the same ChIP DNA and three lanes from one input DNA. Aligning reads to the genome using ELAND Reads were aligned to the human genome (hg18, NCBI build 36.1) using Eland software (Gerald module v.1.27, Illumina). The mapping criteria imposed by Eland are the followings: matches should be collinear to the genome allowing up to two mismatches, but no indels. In these conditions, 50% of the reads obtained here matched to unique locations of the human genome, whereas 16-18% of the reads mapped to more than one genomic position. Mapping reads to annotated transcripts and viral sequences Reads that were found in unique positions in the human genome were further mapped to exons of protein-coding genes retrieved from two independent databases: ENSEMBL v.46 and Eldorado (release 05/2007). The number of reads that were fully included in exons (which we refer as “number of hits”) was counted. For genes encoding multiple transcripts, the number of hits per gene was determined as the sum of all hits in all possible exons. Reads from both cell lines were further mapped to 2,896 viral genomic sequences retrieved from the viral section of the RefSeq database (release 25)(3). Mapping reads to artificial splice junctions All reads that could not be matched on the genomic sequence, to a known transcript or an EST sequence (EST version) were mapped to a second dataset consisting of synthetic splice junction sequences. We generated a dataset of artificial splice junction sequences by pairwise connection of exon sequences from every annotated locus retrieved from UCSC (hg18, http://hgdownload.cse.ucsc.edu/goldenPath/hg18/database/) and from Eldorado (release 05/2007) and obtained 2,334,049 and 2,828,506 synthetic splice junction respectively. The exons of every locus were sorted/indexed via their positions within the locus. Then the sequence of exon i was connected to the sequence of exon i+1, exon i+2, etc. The mapping was based on the UCSC and Eldorado databases, where all possible junctions of length 50-52 bps were retrieved covering the half of each connected exon sequence. Mappings allowing 02 mismatches with an overlap of at least 1 bp with each of the two exon sequences were considered. 78,457 and 62,110 splice junctions were identified on the UCSC junction set, for HEK and B cells, respectively. Whereas 69,952 and 56,000 splice junctions had reads matching on the Eldorado junction set. The two resulting datasets from Eldorado and UCSC where merged to have one reference data set of splice junctions. For merging the genomic start and end positions of the splice junctions where compared with a tolerance of +3/-3 base pairs. Redundant junctions where removed leading to a total set of 83,239 and 66,330 junctions for HEK and B cells, see Files S9a and S9b. Random Model for junction hits We considered a model to study the probability of random hits for a read of length R on splice junctions of length J. Depending of the matching strategy we allow up to ME errors, where ME is the number of maximum substitution errors allowed to occur between the read and junction sequence. Assuming a uniform i.i.d. random model for DNA sequences the probability P that a R-base pair (bp) read matches a splice junction of size J bps with not more than ME substitution errors is ME P(Hit on a splice junction with up to ME errors)= ⎛ R⎞ ∑⎜ k ⎟3 k =0 ⎝ ⎠ k *( J − R + 1) 4R . (1) The expected number of reads that hit all considered splice junctions is calculated by multiplying the number of considered junctions and number of considered reads with formula (1). Expected number of splice junctions The expected number of reads hitting splice junctions of a gene by chance was given by: p X * r , where X is the number of unique reads that fall inside exons of the gene and where 1− pr the probability that a read hits a junction Pr corresponds to: Pr = (n − 1) * 27 , where n is the number of exons in a given gene of length L. L + (n − 1) * 27 We considered one isoform per gene including all annotated exons, but ignored the exons for which less than half of the positions corresponded to unique hits. Theoretical read coverage and normalized expression The theoretical read coverage takes into account constraints due to the mapping procedure that eliminates non unique reads. For each gene, all exons were extracted from the annotated transcript database (ENSEMBL v.46) and concatenated, considering the longest exonic form. For the resulting virtual cDNAs the number of all possible 27 bp oligomers (gene length - 26) was determined, using sliding window of one base pair. We subtracted from the number of all possible theoretical hits the non-unique oligomers (mapping in non-unique position in the human genome) and oligomers that were shared with other genes (e.g. exon overlapping or duplications). This provided us with the theoretical total number of unique 27-mers or virtual length (L) representing a given gene. For most of the loci, between 80-100% of the exon could be covered by unique 27-mers, allowing the scanning of most of the coding regions of the human genome. Only 517 genes could not be scored since they did not contain any informative reads, whereas ca. 1,000 genes were poorly scored (less than 40% theoretical coverage). Then, for each sample, we counted the total number of RNA-seq reads that matched the virtual length of each gene, which was called “number of hits” per gene (X). We further normalized the number of hits per gene according to the sample size of the biological replicates by scaling them up to the sample with the largest number of reads and by dividing it by the virtual length of the gene. In summary, if: N i = Sample size in replicate i. X i j = Number of unique reads (number of hits) for gene j in replicate i. r = Number of replicates. L j = Theoretical number of unique 27-mers in gene j. NEi j = Normalized expression of gene j in replicate i. X j max Then, NEi = ij × 1≤k ≤r L Ni (Nk ) j . Assessing Transcript Diversity We assessed the transcript diversity by estimating the real total number of genes expressed in a cell line and the expected number of additional genes that could be identified by sequencing additional replicates. Mapped reads are assumed to be sampled independently and with replacement from the whole transcript population. Under this assumption, the observed number of counts for a gene follows a Poisson distribution of parameter λ : P( X i = xi ) = e j j −λ λx j i xij ! , where X i j is the number of unique reads (number of hits) for gene j in replicate i. We call ωic the number of genes with count c in replicate i: ω ic = ∑ δ X j j i ,c , where δ .,. is the Kronecker symbol ( δ a ,b = 1 if a=b and 0 otherwise). S i is the number of different genes observed in experiment i ( S i = ∑ ωic ). For instance 14107 genes c >0 are expressed in the HEK-1 lane. ωi0 represents the theoretical number of expressed genes that are not observed in experiment i. Estimation of gene diversity then subsumes to the estimation of ωi0 conditionally on S i . Let us denote as f the marginal distribution over read counts, which can be written as: f ( c, Q ) = P ( X . = c ) = ∫ e j −λ λc c! dQ(λ ) , where Q denotes the distribution of Poisson rates amongst the genes. The distribution over read counts Q̂ is estimated with a Non Parametric Maximum Likelihood (NPMLE) approach as presented by Wang and Lindsay(4, 5). An estimator to ωi0 can be written as ωˆ i0 = S i f i (0, Qˆ ) . 1 − f i (0, Qˆ ) Bootstrap estimates are obtained by resampling ωic with Q̂ , conditionally on S i . For every bootstrap sample a NPMLE estimation was conducted, and confidence interval was deduced from 200 samples(6). The estimates for Q were obtained using a modified version of the java code developed by the authors (courtesy of J.P. Wang). Detection level of RNA-Seq With two lanes, we obtained a sufficient sampling depth for capturing rare transcripts. Based on the estimate of ca. 300,000 transcripts per cell, we detected as little as 0.3 RNA copy per cell. Because we interrogated the whole transcript length, we derived this estimate by rescaling normalized expressions as a function of the number of mappable reads in each exon. Random noise model for read mapping We estimated the background using the reads located outside of annotated genes to one read per 25,000 bp, except for one of the B cell replicas (B cell-1) where the background is higher (1 read per 9,174 bp), possibly due to a slight contamination with DNA or hnRNA. To do so, we derived a model of read mapping at random, based on the observed reads within regions where no transcription has been annotated. The null probability to observe a hit on one base pair is then calculated as: p0 = #{hits in no transcription} length of no transcription For each gene of length L with n reads, we then compute the probability to observe those reads from the random mapping alone. As X ~ Poisson( L.x) (section above), the probability to observe at least n reads can be derived. For each lane, a probability p0i was estimated, and only the gene with an FDR of less than 5% where selected (9). This resulted in the elimination of 7 and 10 genes for HEK cell lanes 1 and 2, and 522 and 16 genes for the B cell lanes 1 and 2, respectively. Most of the discarded genes (all but 31) had one read (only one gene with 5 reads was rejected in B cell lane 1). According to this, we set an arbitrary threshold of 5 reads per coding sequence on the merged lanes as expression hallmark. Genes with 1-4 exonic reads could either be weakly expressed or background A threshold of five exonic reads was used as expression hallmark. RNA-Seq reproducibility Data from the two biological replicas were highly correlated both at the qualitative (detected exons) and quantitative (read density) levels (Pearson's correlation coefficient (PCC) 0.99 for HEK and 0.98 for the B cell samples). Merging reads from the two samples increased both the number of detected transcripts (Table S1) and the transcript coverage (see figure below). Gene-read coverage. Histogram showing the fraction of the transcript (ENSEMBL v46) length covered by reads for all genes expressed under relaxed parameters (at least one read). Individual lanes are shown with grey bars and merged lanes with a green bar. (ENSEMBL-based). Micro-array analysis and differential gene expression Raw expression data (BeadSummary gene profile files), without normalization, were extracted from the manufacturer’s Illumina BeadStudio application software 1.0. Further processing of the data was done within the R / Bioconductor statistical analysis software package. We calculated the Pearsons product moment correlation coefficient on the nonnormalized data sets to remove outliers and to check for integrity of the experiments (r>=0.98), followed by the computation of intra-chip pair-wise scatter-plots for each probe. The raw data sets were then normalized using quantile normalization(10), as it is robust against non-linearity of data and performs also well on large data sets. The normalization was computed by combining the treatment and the control group to which the differential gene expression was analyzed. We calculated a detection score for each probe on the array that evaluates if the signal intensity is significantly above the background intensity of the array. The detection score (DS) is defined as: DS = R / N, where R is the rank of the probe bead signal relatively to the negative control bead signals and N be the number of negative controls on the chip field. To compare the overlap of expressed genes between the platforms, we first mapped all microarray (MA) probes to the ENSEMBL gene catalogue (ENSEMBL (v.46) using Biomart software (www.biomart.org). We considered for analysis only genes with addressing probes on the MA (13,118 genes). Second, we set thresholds in each platform to define whether a gene was expressed or not: z A gene was considered as detected in the RNA-Seq platform if it had at least a single hit (relaxed analysis) or at least five hits (stringent analysis) on the merged replicates. z For microarray, the criteria for gene expression was first given by the detection score (DS) that was greater or equal to 0.95 (1-pvalue) in all experiments (see Material and Methods) To score for differential expression, we considered the genes possessing, in MA experiment, a detection score greater or equal to 0.95 for both sample type. The criteria for RNA-Seq values were a number of counts greater or equal to 5 (on the merged replicates) in both sample type. The significance of differential expression for the microarray platform was assessed using a ttest assuming inequality of variances between the 2 conditions. We then applied a BenjaminiHochberg correction on the p-values, and a gene was considered as significantly differentially expressed between the 2 conditions if its false discover rate (FDR) was lower or equal to 1% (3,421 genes). For RNA-Seq, the differential expression was assessed with a test proposed by Audic and Claverie (A&C) on the pooled lanes(11). The cutoff on the corrected FDR was fixed at 1% (4,376 genes)(9). Cluster detection The reads were sorted by their position on the chromosome and the start positions were used for the analysis. The RNA-Seq reads were analyzed for clusters of x reads in an initial x window of l base pairs. We denote as pos(k) the position of read k and d k describes the distance between x consecutive reads starting with read k: d kx = pos (k + x − 1) − pos ( k ) . All reads k for which d k ≤ l holds, define a cluster Ck , where Ck = [ k, .., k+x-1]. Overlapping clusters where merged if they overlap by at least one read as follows. Given two x cluster Ca and Cb with pos(a) < pos(b), the new cluster Cab becomes: Cab = Ca ∪ Cb . In summary, Cab contains all reads that where contained in Ca and Cb . Pol-II ChIP-Seq data analysis and correlation to gene expression After mapping of the reads to the genome, the replicates were pooled and resulted in 3,198,755 unique reads for two lanes of the RNA Pol II experiment and in 8,017,459 unique reads for three lanes of the control input DNA. For the analysis of the Pol-II ChIPSeq data the above algorithm was applied with x = 10 and l = 100. From the 3,198,755 reads 9,908 clusters could be identified. The algorithm was also applied to the input control. Due to the larger number of reads in the data set (8,017,459) the threshold x was adapted and was set to 25. This resulted in 416 clusters from the input control. 198 of the Pol-II clusters were overlapping with one of the input control clusters and thus were discarded. The remaining 9,710 clusters were used for further analysis. These RNA pol-II blocks were mapped to 70,968 human promoters associated to 32,595 gene loci in Eldorado (Release 05/2007). To perform the correlation of promoter activity data to the gene expression level, we distributed the genes into five categories: the 12,567 expressed genes were distributed into three groups of equal size (4,189 genes in each group) referring to high (0.0889<NE<47.8), medium (0.0263<NE<0.0889) and low expression (0.0003<NE<0.0263); Uncertain (2,396 genes with 1-4 reads) and silent (7,333 genes with no read) genes represented the last categories. New transcriptional regions For the detection of enriched clusters of RNA-Seq reads in intergenic and intronic regions the threshold x was set to 5 and the window size l was set to 100. Intergenic and intronic regions were defined using the NCBI build 36 of the human genome and the 113,458 transcripts annotated in ENSEMBL (v.46; 42,622 transcripts) and Eldorado (Release 05/2007; 70,836 transcripts). Intronic regions We analyzed genomic regions solely annotated as introns. They did not overlap with any of the exons annotated either in ENSEMBL v.46 or Eldorado (Release 05/2007). The exons flanking the intronic region do not need to belong to the same annotated transcript. Of the 215,549 intronic regions identified 203,004 from HEK and from B cell were discarded because they were less than 100bp in length or had less than 5 reads. The remaining intronic regions (10,024 in HEK and 13,256 in Bcell) were analyzed for regions enriched in RNA-Seq reads applying the cluster algorithm described above. Clusters were tested for overlap with the annotated ESTs (EMBL Nucleotide Sequence Database, Release 89) and if the respective ESTs overlap with any of the exons annotated in ENSEMBL or Eldorado and thus connect the cluster to the annotated transcript (one base pair of overlap). Intergenic regions For the prediction of new transcriptional units in intergenic regions clusters were calculated individually for each of the HEK and Bcell cell lines. Intergenic regions were defined as parts of the genome not annotated by transcripts either in ENSEMBL nor Eldorado. The minimum required length for an intergenic region was 10,000 bp. The 5,000 bp immediately upstream and downstream of the genes flanking the intergenic region were excluded from the analysis. We identified 1,584 clusters in HEK and B cells collectively. Clusters connected directly or indirectly by an EST sequence (by at least one base pair) were grouped into transcriptional units (TUs).Thus 594 clusters with EST data were collapsed within 173 TUs. EST sequences were taken from EMBL Nucleotide Sequence Database (Release 89) and mapped to the human genome (NCBI build 36) resulting in 6,998,739 EST annotations. In addition we determined the presence of Pol-IIa clusters from the ChIPSeq data, 1 kb upstream or downstream of the new transcriptional unit and the number of CAGE tags (12). The transcriptional units were tested in both directions due to the missing strand specificity of the RNA-Seq reads. Filtering Orphan reads initially located within clusters but which corresponded to either true splice junctions artefactually mapped to processed pseudogenes , translocated gene fragments, or repeated elements were filtered out. All intergenic TUs and intronic clusters were subjected to repeat masking (RepeatMasker software 3.1.9) and TUs or clusters consisting of >50% of repetitive sequences were removed. Additionally, all reads related to intergenic TUs and intronic clusters were screened against the set of artificial exon junction sequences described above and using Eland software with the same parameter settings and conditions as for the original mapping to the genome. Reads artificially mapping to pseudogenes, or retrotransposed gene fragments were thus rejected. Whenever more than 20% of the reads of a putative intergenic TU and intronic cluster were rejected or the number of reads dropped below 5 the cluster was not considered. By doing so we rejected 141 intergenic TUs because of repeated elements and 670 TUs mapping to pseudogenes or retro-transposed gene fragments. For the intronic clusters 341 and 374 were rejected due to repeated elements in HEK and B cell, respectively. 795 and 622 clusters were rejected du to the mapping to the exon junction set. Blast analysis The genomic sequences of the RNA-Seq read clusters related to the same transcriptional unit (TU) were concatenated into artificial TU sequences. All TU were screened for similarities to known or predicted protein sequences from SwissProt/Trembl database (07.02.08) using BLASTX (e-value, cutoff = 0.001, minimum score = 40, percent identity >= 50%) and requiring a minimum match of 30 amino acids. In addition, we performed a BLASTN search against the non-redundant database (nr, NCBI, options: -v 3 -b 3) annotating the top three matches, but rejecting hits to genomic DNA based on the sequence description (keywords: genomic DNA; DNA sequence from clone; sapiens chromosome; troglodytes chromosome; chromosome [A-Z0-9]*clone; clone RP[0-9]; BAC; PAC). Only sequence matches >=30 bp were considered for further analysis. Transcript extension Transcripts annotated in ENSEMBL or Eldorado were analyzed if RNA-Seq reads located immediately upstream or downstream of the transcript indicate that the extension of the respective 5' or 3' UTR is incomplete. Therefore we determined the density of RNA-Seq reads and the maximum distance between two neighboring reads in the first and last exon of each transcript. For the density the number of reads was normalized by the length of the respective exon (number of reads / length of exon). The annotation of the exon was then extended to the next upstream/downstream read if (a) the distance to the next read was below the maximum distance observed in the exon annotated originally and (b) the density of reads calculated for the extended exon did not fall below the density for the original exon. Exons with a density of RNA-Seq reads below 0.01 were not analyzed. Supplementary Table: Table S1: Tag mapping HEK-1 HEK-2 B cell-1 B cell-2 2,278,066 HEK pooled 4,640,112 2,086,216 1,809,427 B cells pooled 3,895,643 Unique matches 2,362,046 Map to RNA ENSEMBL 1,721,149 1,677,026 3,398,175 1,378,594 1,263,220 2,641,814 1,844,642 1,798,421 3,643,063 1,488,798 1,358,628 2,847,426 1,879,447 1,833,029 3,712,476 1,517,184 1,385,203 2,902,387 Partially mapping RNAs 72,951 71,774 144,725 56,186 54,541 110,727 Orphan reads (all) 341,844 316,113 657,957 465,722 305,758 771,480 Orphan reads (intronic) 200,395 189,778 390,173 249,600 190,242 439,842 Orphan reads (intergenic) 141,449 126,335 267,784 216,122 115,516 331,638 788,526 757,835 1,546,361 752,635 572,135 1,324,770 1,190,768 1,027,518 2,218,286 1,173,196 1,093,622 2,266,818 118,974 115,186 234,160 106,340 88,659 194,999 4,460,314 4,178,605 8,638,919 4,118,387 3,563,843 7,682,230 Map to RNAs Eldorado ENSEMBL + Eldorado combined Multiple matches No match to genome Low quality reads Total reads Read mapping overview. Summary of the mapping of the reads to the human genome (hg18, NCBI build 36.1) and to gene loci (ENSEMBL V.46. and ElDorado release 05/07). Supplementary Figures Figure S1 | Dynamic range of RNA-Seq. Rank abundance curves (RACs) showing the total number of mapped reads (x-axis) versus the total number of identified genes (y-axis). Data points show the observed values on individual (crosses) and merged (dots) lanes. Curves were extrapolated as follows: for values smaller or equal to the sample size, the number of expected genes was obtained by sub-sampling. For values greater than the sample size, the number of expected discoveries was computed from the statistical analysis by Poisson mixtures. Figure S2 | Example of new TUs. Snapshot of TU 33 (yellow) and TU 34 (pink) spanning over 190 MB on chromosome 2. The units can be merged based on recent EST data. The sequencing reads show that these TU are predominantly expressed in B cells. The vast majority of reads are in agreement with exons deduced from ESTs, and coincide frequently with highly conserved regions (PhastCons Track, UCSC). At the 5' bourvend of TU99 the putative transcriptional start is supported by CAGE tags as well as by a PolIIa bound region. Figure S3 | Read density in the EIF4G1 gene on chromosome 3. For HEK (top) and B cells (bottom), the green bars show the 33 known exons, red bars represent the number of reads at a given position, and blue lines identify the reads on splice junctions (width is proportional to the number of reads on a given junction). References: 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. T. I. Lee, S. E. Johnstone, R. A. Young, Nat Protoc 1, 729 (2006). P. Kahlem et al., Genome Res 14, 1258 (Jul, 2004). K. D. Pruitt, T. Tatusova, D. R. Maglott, Nucleic Acids Res 35, D61 (Jan, 2007). J. P. Wang et al., BMC Bioinformatics 6, 300 (2005). J. P. Z. Wang, B. G. Lindsay. (American Statistical Association, 2005), vol. 100, pp. 942-960. B. Efron, R. Tibshirani, An Introduction to the Bootstrap (Chapman & Hall/CRC, 1993), pp. N. D. Hastie, J. O. Bishop, Cell 9, 761 (Dec, 1976). J. B. Kim et al., Science 316, 1481 (Jun 8, 2007). Y. Benjamini, Y. Hochberg. (JSTOR, 1995), vol. 57, pp. 289-300. B. M. Bolstad, R. A. Irizarry, M. Astrand, T. P. Speed. (Oxford Univ Press, 2003), vol. 19, pp. 185-193. S. Audic, J. M. Claverie, Genome Res 7, 986 (Oct, 1997). P. Carninci et al., Nat Genet 38, 626 (Jun, 2006). Supplementary Information Guide SOM.doc: contains supplementary methods, table S1, figures S1 to S3 and references. Table S2: Contains the combined list of genes from the Ensembl and Eldorado annotation with the RNA-SEQ results for HEK and B cells (Ensembl + Eldorado annotation). This table provides for each gene in its respective annotation system, information about the gene structure, the number of tags and their distribution in exons and in introns, the expression of the gene, the Polymerase II correlation and the normalized expression (Zip file; 7.1 MB) Table S3: Table listing the coordinates and number of tags of the 9,710 identified RNA PolIIa bound regions and the associated gene loci with overlapping known promotors (Eldorado, release 05/2007) (Excel file; 806 KB). Table S4: This table lists the 13,118 genes that could uniquely be mapped between RNASEQ and microarray. It contains information about number of hits per gene, normalized expression (NE), ratio and significance test for RNA-SEQ and the average intensities, Detection scores, ratio and significance tests for Illumina (Excel file; 5.7 MB). Table S5: Excel table listing all genes obtained by RNA-SEQ (16,392 based on ENSEMBL v.46) with the ratios (B cell:HEK:cell) and q values (Excel file; 4,6 MB). Table S6: This table provides lists of all transcript extensions for HEK and B cells in Sheet 1 and 2, respectively (Excel file; 2.2 MB). Table S7: This Table lists the novel internal exons identified in B and HEK cell, respectively and additional related information (gene info, distances to next exons, EST support, EST linked gene symbol, and overlap of clusters between Hek and B cell (Excel file; 1.3 MB). Table S8: This Table list the new transcriptional clusters sites (TS) (Sheet1); units (TUs) (Sheet 1) and clusters (Sheet 2) with additionnal related information (CAGE, PolII, EST supports; Repeat Masker, BlastX and BlastN results, Up and Downstream genes, etc…) (Excel file; 345 KB). Table S9: These tables contain all splice junctions with their position mapping on genes (ENSEMBL v.46 and Eldorado release 05/07) and ESTs (EMBL NSD, release 89) that were found in B cells (9a) and Hek cells (9b). Columns “Novel” and “Alternative” indicate whether the identified junction is new and if the junction directly identifies an alternative form of the gene, respectively (Zip file; 4 MB).
© Copyright 2026 Paperzz