BIOINFORMATICS ORIGINAL PAPER Vol. 26 no. 1 2010, pages 84–90 doi:10.1093/bioinformatics/btp626 Gene expression ARH: predicting splice variants from genome-wide data with modified entropy Axel Rasche∗ and Ralf Herwig Department of Vertebrate Genomics, Max-Planck-Institute for Molecular Genetics, Ihnestr. 63-73, D-14195 Berlin, Germany Received on June 15, 2009; revised on August 21, 2009; accepted on October 7, 2009 Advance Access publication November 4, 2009 Associate Editor: Trey Ideker ABSTRACT Motivation: Exon arrays allow the quantitative study of alternative splicing (AS) on a genome-wide scale. A variety of splicing prediction methods has been proposed for Affymetrix exon arrays mainly focusing on geometric correlation measures or analysis of variance. In this article, we introduce an information theoretic concept that is based on modification of the well-known entropy function. Results: We have developed an AS robust prediction method based on entropy (ARH). We can show that this measure copes with bias inherent in the analysis of AS such as the dependency of prediction performance on the number of exons or variable exon expression. In order to judge the performance of ARH, we have compared it with eight existing splicing prediction methods using experimental benchmark data and demonstrate that ARH is a well-performing new method for the prediction of splice variants. Availability and Implementation: ARH is implemented in R and provided in the Supplementary Material. Contact: [email protected] Supplementary information: Supplementary data are available at Bioinformatics online. 1 INTRODUCTION Alternative splicing (AS) has gained increasing interest since it has been identified as a key mechanism for generating the complex proteome of multicellular organisms by the generation of multiple protein isoforms from single gene templates (Ben-Dov et al., 2008). AS has been observed within a variety of biological conditions, for example, in tissue expression (Gupta et al., 2005; Wang et al., 2008), with respect to human diseases (Novoyatleva et al., 2006; Stoilov et al., 2002) and in protein modification (Stamm et al., 2005). Currently, AS is estimated to occur in up to 95% of human multi-exon genes (Pan et al., 2008; Wang et al., 2008). While standard microarray platforms have been useful to study gene expression on a large scale, this technique was, until recently, not suitable for the detection of splice variants due to the design of the probes. As a consequence, existing AS in the samples under study introduced an undetectable bias to those experiments. With the release of the Affymetrix exon arrays a first commercial platform for studying AS on a genome-wide scale has been made available using ∗ To 84 whom correspondence should be addressed. more than five million oligonucleotide probes distributed across the exome. The analysis of exon microarrays has posed new challenges to the computational analysis, including data normalization, quality control measures and methods for AS prediction. A number of methods has been proposed to study AS with Affymetrix exon arrays (Exon Array Whitepaper Collection, 2005; Purdom et al., 2008; Xing et al., 2006). Most of these methods are based on geometric measures, such as linear correlation, or analysis of variance (see Section 4.2 for more details). Entropy is a fundamental measure of information and uncertainty that has been introduced in the context of information theory by Shannon (1948). Mathematical properties of entropy have been described, for example, in the textbook of Cover and Thomas (2006). In our approach, we take the advantage of fundamental characteristics of entropy as a measure for the uniformity of a given probability distribution defined by the exon gene expression ratios within a given transcript (see Appendix 1 in the Supplementary Material). Information theoretic approaches based on entropy have been used in many fields of biosciences: for example, in sequence analysis (Schug et al., 2005); protein structure analysis (Crooks and Brenner, 2004); in the context of clustering gene expression data (Herwig et al., 1999; Steuer et al., 2002); feature selection (Herwig et al., 2000); and reverse engineering of gene regulatory networks (Margolin et al., 2006). In our article, we present a new method for predicting AS using a modified entropy measure. We show that this method overcomes important inherent bias such as the dependency of the splicing prediction on the number of the exons and the variability of exon expression. Furthermore, we present the first comprehensive evaluation of AS prediction methods for Affymetrix exon arrays using experimental benchmark datasets generated from different human tissues and transcript spike-in samples. Splicing prediction requires to be sensitive in the deviation of a proportion of exons, symmetric in up- and down-splicing, independent from the position of the spliced exons within the gene sequence and independent of the number of exons of the gene. To assess the performance of the different methods, we analysed them with respect to these requirements. For validation of the computational predictions, we translated the information on known splice variants available in public databases into sets of true positive splicing events. We show that ARH is a well-performing new splicing prediction method. An R implementation of ARH and the different methods is available in the Supplementary Material. © The Author 2009. Published by Oxford University Press. All rights reserved. For Permissions, please email: [email protected] [12:22 2/12/2009 Bioinformatics-btp626.tex] Page: 84 84–90 ARH: AS robust prediction by entropy 2 METHODS 2.1 Probe–exon–gene assignments The probe to exon and probe to gene assignments were dependent on the respective benchmark dataset. For the spike-in dataset, it was done using the chip description files from Affymetrix. For the human tissue dataset, probe–exon assignments were drawn from latest genome annotations of Dai et al. (2005) in version 11 for Ensembl exons. Exon to gene assignment was retrieved via BioMart from Ensembl 49 (Birney et al., 2006; Kasprzyk et al., 2004) and resulted in 232 376 exons that correspond to 26 538 genes. 2.2 Preprocessing A model-based analysis for tiling arrays from Johnson et al. (2006) was applied for intra-chip normalization, with the adjustment for exon arrays described in Kapur et al. (2007). Quantile normalization was then applied between arrays with Affy package (Gautier et al., 2004) in R/BioConductor in version 2.8.0/2.3.0, respectively (Gentleman et al., 2004; R Development Core Team, 2005). Exon and gene expression were defined as the median of the respective probe intensities using all experimental replicates. down-splicing. From these log-ratios the median is subtracted to correct for global gene expression changes: φg,e,t φg,e,t ζg,e = log2 − median log2 . (1) φg,e,c φg,e,c e=1,...,m (2) The exon splicing probability is computed as the absolute value of the splicing deviation ζg,e by pg,e = Note that for each gene 2|ζg,e | |ζg,e | e=1,...,m 2 . (2) e pg,e = 1. (3) To measure whether the exon splicing probabilities are equally distributed or whether a single or a few exons dominate the probability distribution, we compute the entropy for each gene: m Hg pg,1 ,...,pg,m = − pg,e ·log2 pg,e . (3) e=1 2.3 Computation of ROC, AUC For each method, the predictions were sorted by decreasing splicing indication. Then the position of the confirmed events (the true positive set) was identified. Receiver operating characteristic (ROC) was visualized with the ROCR-package in R (Sing et al., 2005). Likewise the area under the curve (AUC) of the ROC was computed with ROCR for quantification of the performance. 2.4 Selection of AS events from the AEdb AEdb contains confirmed splicing events extracted from the literature (Koscielny et al., 2009; Stamm et al., 2000). The AEdb sequence flat file was downloaded (http://www.ebi.ac.uk/asd/aedb/) and the splicing events were filtered by splicing mechanism (cassette exon events), species (human, mouse, rat) as well as the availability of a sequence for the events. Eight tissues were overlapping between AEdb and the tissue benchmark dataset. Events attributed to these eight tissues were selected and the corresponding sequences of the alternative exons were aligned to exon sequences from Ensembl 49 for exact matches. Events may attribute to more than one tissue. For tissue specificity such events were skipped. For pairwise comparison, only the events specific to either of the two tissues were used as a true positive set. A complete list of the true positive events is available in a Supplementary Material along with the R implementation. 3 ALGORITHM AND IMPLEMENTATION Splicing assessment is done in two steps. In the first step analysis is performed on the gene level, i.e. ARH identifies spliced genes [see Equation (5)]. In the second step analysis is performed on the exon level, i.e. splicing deviation [see Equation (1)] ranks the exons within a gene and identifies the skipped/included exons. Below, we give a formal description of the method; further material on the characteristics of entropy and an illustration of the different steps are given in Supplementary Material Appendix 1 and Supplementary Figure 7. For a gene g with m exons, two biological conditions c and t with corresponding exon expressions φg,e,t and φg,e,c , e = 1,...,m, we compute the following quantities: (1) The exon splicing deviation, ζg,e , measures the deviation of the fold change in each individual exon from the median transcript fold change. Here, we compute log-ratios of exon fold changes to account for symmetric measurement of up- or (4) Entropy defined in Equation (3) is dependent on the number of exons and cannot be directly used for the comparison of different genes. Thus, in order to make the measure independent of the number of exons for a given gene, we subtract entropy from its theoretical maximum: max(Hg )−Hg = log2 (m)−Hg pg,1 ,...,pg,m . (4) (5) Another necessary modification accounts for the strength of deviation within the gene. This is robustly estimated with the interquartile range of exon expression ratios, the 25%, φg,e,t φ Q.25,g,e=1,...,m φ , and 75%, Q.75,g,e=1,...,m φ g,e,t , g,e,c g,e,c quantiles. An index for the amplitude is the interquartile ratio Q.75,g /Q.25,g . This ratio is close to 1 for low splicing probability and increases with deviations of a number of exons in the gene. The interquartile ratio is multiplied with the entropy index and constitutes the ARH splicing prediction: ARHg = Q.75,g · max(Hg )−Hg . Q.25,g (5) Thus, ARH is suitable to compare the predictions across different genes. Large ARH values (>0.03) indicate splicing. The ARH prediction is implemented in R with three functions (R Development Core Team, 2005): ARH, ARH_p and ARH_sd. ARH returns gene level predictions, the outcome of Equation (5) and ARH_sd returns the splicing deviations from Equation (1). ARH_p returns the P-values of ARH derived from a fitted generalized extreme value distribution (Section 4.1) using the package evd and the below described parameters (Stephenson, 2002). The three functions take two input vectors (x and y in the implementation) for the exon (or probe set) expressions and one vector for the exon–gene grouping (f ). To avoid the division by 0, the second vector is set to a minimum of 0.0001. Genes with only one exon or non-finite exon expressions are set to NA. If a single exon deviates from the remaining exon expression ratios, it dominates the splicing probability distribution [Equation (3)] resulting in a low entropy and a high ARH value. If a larger number of exons is spliced, this measure is upweighted with an increased interquartile ratio >1. On the other hand, if all exons 85 [12:22 2/12/2009 Bioinformatics-btp626.tex] Page: 85 84–90 A.Rasche and R.Herwig The exon-level splicing indication has to be symmetric in terms of up- or down-splicing. The swapping of treatment and control samples changes an up- to a down-spliced exon and vice versa. The absolute value of the log2 splicing deviation accounts for this symmetry. The dependency of the splicing probability on the fold changes was simulated for a gene with 13 exons, where log2 ratios were drawn from a normal distribution with N (0,0.68) (Supplementary Fig. 2). Spliced exons are not necessarily adjacent. In the liver versus pancreas tissue comparison, the transcription factor HNF4A is an illustrative example depicted in Figure 3. Three exons were predicted to be spliced on positions 1, 4 and 5 with one confirmed event in position 4 in pancreas. The sum in the entropy formula is commutative and reflects the position independence of the exons. have similar expression changes this leads to a high entropy with small interquartile ratio and consequently to a small ARH value. RESULTS Characteristics of ARH 4.1.1 ARH background distribution For a given experiment, the ARH values show a rapid decline from many near-zero values to few high ARH values. The ARH distribution shows little variation even between tissues (Supplementary Fig. 1). To derive a biologically motivated background distribution, we compared samples of the same biological conditions. The human tissue dataset from Clark et al. (2007) entails data from 11 human tissues with 3 replicates each. In each tissue, this allows three pairwise comparisons summing to 33 pairwise comparisons that were used for defining a background sample of ARH values. The distribution of these 33 comparisons provides thresholds for significant ARH values. The 95% quantile of the distribution is QARH,.95 = 0.031. The 95% quantiles of the 33 individual comparisons also cluster around that value. For the 90, 99 and 99.9% quantiles of the background distribution, the thresholds are QARH,.9 = 0.023, QARH,.99 = 0.057 and QARH,.999 = 0.13, respectively. The background distribution is also adequate to calculate P-values. We found the generalized extreme value distribution to fit best to the ARH background distribution due to a long heavy tail of large ARH values. Distribution parameters were fit with Matlab resulting in location = 0.006338, scale = 0.005507 and shape = 0.3329 (Fig. 1.A). 4.1.3 Gene-level analysis A gene-level splicing prediction method needs to be sensitive in the deviation of a proportion of exons what is measured by ARH with the entropy and the interquartile ratio as weighting factor. We performed a simulation with varying number of spliced exons, where the linear ratio of the spliced exons is multiplied with a fold change of 3 (Fig. 1B). ARH values reflect the number of spliced exons with an unimodal, symmetric distribution. A strength of ARH is its low dependency on the total number of exons of a gene. In ARH, the genes are sorted in bins by exon number and gene predictions are compared with the gene-bin maximal prediction. Comparing the entropy to the maximal entropy makes ARH independent of the number of exons (Fig. 1C). See also Section 5.3. 4.1.4 Performance with low number of experimental replicates Since the costs of experimental replicates are often a limiting factor, methods favourably should require a low number of replicates for computing robust predictions. Purdom et al. (2008) were the first to address this aspect for their method FIRMA. We compared ARH and other methods using only a single chip per condition (Fig. 2C). The results highlight the good performance of ARH. Using the median over the probes and replicates, the method is relatively B 0.00 0.02 0.04 0.06 0.08 ARH values red dashed: generalised extreme value distribution 0.10 C Prediction distribution in Exon Bin 0.6 0.8 0.00 0 0.0 0.2 0.4 ARH distributions 20000 10000 Frequency 30000 1.0 ARH, varying No of spliced exons 0.08 Histogram of ARH values 40000 A 0.06 4.1.2 Exon-level analysis In a spliced gene, the splicing deviation ranks the exons in order to identify the most altered exons. With this ranking exons can be selected, for example, for validation. Assessing the absolute splicing deviation as above, a global number of spliced exons is determined with the following thresholds: QARH_sd,.9 = 0.43, QARH_sd,.95 = 0.53, QARH_sd,.99 = 0.75 and QARH_sd,.999 = 1.07. 0.04 4.1 0.02 4 0 1 2 3 4 5 6 7 8 9 10 11 12 No of spliced exons gene with 13 exons; exons spliced with fc 3 1 4 7 10 14 18 22 26 30 34 38 42 46 50 54 No exons in a gene Fig. 1. (A) Histogram of ARH background distribution derived from splicing predictions between the same biological conditions (ARH values equal to zero were skipped) and the fitted generalized extreme value distribution (dashed line). (B) Simulation of ARH values (y-axis) with respect to number of spliced exons (x-axis). A gene with 13 exons is used for simulation corresponding to the average exon number in Ensembl for known protein coding genes. Log2 ratios were drawn from a normal distribution with mean 0 and SD 0.68 corresponding to the liver versus pancreas comparison. Respective number of exons were upregulated with a factor of 3 indicating splicing. (C) Boxplots of ARH values (y-axis) with respect to genes with the same number of exons (x-axis). 86 [12:22 2/12/2009 Bioinformatics-btp626.tex] Page: 86 84–90 ARH: AS robust prediction by entropy Fig. 2. ROC curves for different aspects of methods performance. (A) Overall performance across the 28 pairwise tissue comparisons with respect to AEdb confirmed splicing events (performances vertically averaged). (B) Liver versus pancreas individual splicing predictions. (C) Performance of methods with only one of the three replicates. (D) Comparison of muscle versus non-muscle tissue data invoking additional experimental noise with AEdb confirmed splicing events. (E) Muscle versus non-muscle tissue data with RT-PCR validated true positive set. (F) HeLa cell line data with spiked transcripts as true positive set. robust in the number of experimental replicates, although a clear loss in performance is seen with all methods when using one replicate instead of three (compare Fig. 2B and C). 4.2 Method comparison We have compared ARH with eight existing splicing prediction methods: • Splicing Index (SI) (Srinivasan et al., 2005) • SPLICE (Hu et al., 2001) • PAC (Exon Array Whitepaper Collection, 2005) • ANOSVA (Cline et al., 2005) • MiDAS (Exon Array Whitepaper Collection, 2005) • MADS (Xing et al., 2008) • FIRMA (Purdom et al., 2008) • Correlation (Shah and Pallas, 2009) Due to the character of the predictions, the methods can be categorized into scores (Splicing Index, SPLICE, PAC, FIRMA and Correlation) or tests (ANOSVA, MiDAS and MADS). Also some methods provide exon-level prediction (Splicing Index, SPLICE, PAC, MiDAS, MADS and FIRMA) or gene-level prediction (ANOSVA and Correlation). All methods were applied to the data with the same preprocessing as described in Section 2.2 except FIRMA, which requires RMA. MiDAS values were calculated on the standard preprocessing with the Affymetrix Power Tools in version 1.8.0. 4.2.1 Test dataset 1: tissue data with literature confirmed events As a true positive set for judging the methods performances, we chose an independent dataset from the manually curated AEdb. Experimental data were available for 11 human tissues with 3 experimental replicates per tissue. Confirmed events in AEdb were available for eight of these tissues which allows for 28 pairwise tissue comparisons. Ordering the predictions by decreasing splicing indication constitutes a classifier that allows visualizing the performance of the predictions with the ROC (Fig. 2). Using the ROC, the performance was quantified with the AUC (Table 1). The benchmark test set was analysed with respect to different aspects. The pairwise tissue comparisons correspond to highly diverse biological conditions leading to a lot of variation in the benchmarks. In the analysis, we provide the average performance 87 [12:22 2/12/2009 Bioinformatics-btp626.tex] Page: 87 84–90 A.Rasche and R.Herwig Table 1. AUC for different test settings and methods Tissue pairwise average Liver versus pancreas Liver versus pancreas, 1-to-1 Tissue specificity average Muscle versus rest, AEdb Muscle versus rest, RT-PCR Spike-in ARH Splicing Index SPLICE PAC ANOSVA MiDAS FIRMA MADS Correlation 0.83 0.7 0.69 0.63 0.76 0.68 0.69 0.68 0.74 0.86 0.74 0.78 0.75 0.78 0.71 0.73 0.69 0.69 0.84 0.73 0.73 0.74 0.72 – 0.69 – 0.65 0.86 0.75 0.75 0.72 0.7 0.62 0.75 0.71 0.78 0.86 0.71 0.62 0.64 0.6 0.48 0.74 0.49 0.75 0.97 0.95 0.88 0.96 0.84 0.85 0.92 0.67 0.73 0.99 0.96 0.96 0.96 0.98 0.95 0.75 0.98 0.75 Gene length 0.83 0.79 0.79 0.84 0.92 0.93 – Gene length predictor (last row) refers to the number of exons per gene. The best performing method in each column is shown in bold. across the 28 pairwise comparisons (Fig. 2A). For an in-depth discussion of AS attributes, we chose the liver versus pancreas test case because it is representative for the average performance (Fig. 2B). For this test case AEdb returns 27 exon events in 18 genes. The methods not only differ by performance but also by the predicted splicing events. The commonality of the predictions is assessed by looking at the overlaps between methods. The top 250 predictions constitute ∼ 1% of all genes on the array. The commonality table in Supplementary Table 1 reflects a limited overlap between the methods. For the 18 confirmed genes, the ARH values and the corresponding quantiles in the ARH background distribution, the P-values of the generalized extreme value fit and Q-values for false discovery rate correction following Dabney et al. (2006) are listed in the Supplementary Table 2. Furthermore, we analysed tissue specificity comparing a selected tissue with the 10 remaining tissues (Supplementary Fig. 3). This leads to considerable variance in the intensities for the control group. For example, comparing, muscle to non-muscle tissues, this variance challenges the methods in their robustness for noise in the measurements and results in a strong spread of performances (Fig. 2D). For muscle the AEdb contains 19 confirmed exon events in 10 genes. In Das et al. (2007), the authors use the same human tissue dataset to establish a list of muscle-enriched exons whereof 17 events have been validated with RT-PCR. Since the study was performed on an older genome build, the probe set region of the 17 events was updated with the UCSC Genome Annotations Lift Tool to the current genome build (Assembly Mar 2006). The original regions intersect with 13 Ensembl exons in 11 genes constituting the list of validated events used for analysis. Since the RT-PCR assays are generated specifically on the samples under study, the ROC are more specific than AEdb confirmed events (Fig. 2E). It is a major advantage of ARH that it is robust to noise within the samples. The effect is also illustrated in the Supplementary Figure 4 with two case studies with different prediction quality. 4.2.2 Test dataset 2: spike-in transcripts In Abdueva et al. (2007), another benchmark dataset was presented with spikein hybridizations of 25 transcripts. In HeLa cells, where these transcripts are not expressed, the mRNA was added at concentrations of 0, 2, 32, 128, 512 pM in a Latin square design by five groups. The dataset has the advantage that expression strength is exactly known in every sample. The samples have a very homogenous background such that noise can be neglected. All true positives are known due to the closed collection of spiked genes. Following the original handling of the data, we used the Affymetrix probe–probe set-transcript cluster assignment. We followed an idea of Beffa et al. (2008) to generate splicing events. For a transcript in the 0 pM group, one exon was assigned to a transcript of the 2 pM group and replaced by this exon. Exons with 0, 2, 32, 128, 512 pM were assigned to transcripts with 2, 0, 512, 32, 128 pM expression, respectively. The true positives in this dataset are characterized by differentially expressed genes with generically spliced exons at extreme fold changes. The environment excluding the 25 transcripts has no expression change at low variability. Our results show a general increase in methods performance compared with the tissue data with ARH being the best performing method (Fig. 2F). 5 5.1 DISCUSSION General performance of methods and study design The prediction of AS remains a challenge. In general, performance of all methods could be improved, in particular, in the light of the tissue dataset. This is due to the fact that splicing prediction poses particular problems to the statistical analysis. A gene consists of several products on the one hand and of different exons on the other hand. For each transcript or each exon, a separate analysis is performed to test potential splicing using the same measurements in several tests. Approaches for the comparison of methods can be found in Purdom et al. (2008) with a simulation model and in Beffa et al. (2008) with the re-ordering of spike-in data. The advantage of the human tissue dataset is the challenge to identify splicing events in a non-artificial setting. The confirmed events used for our study are in any sense independent from the computations. However, this affects the performance of the methods with respect to two aspects. On the one hand, confirmed events were not determined for exactly those tissue samples that were used in hybridizations. Thus, some of the AEdb splicing events may be weak or inexistent in the tissues under study. For example, if an isoform is confirmed in one tissue but 88 [12:22 2/12/2009 Bioinformatics-btp626.tex] Page: 88 84–90 ARH: AS robust prediction by entropy of the ARH predictor. It can be seen that ARH is not the most sensitive method, i.e. the examples are found at a lower prediction ranking by other methods, but it is probably the most stable method, since it performs well in all three cases whereas other methods exert a much higher variation of performance. 0.15 legend pancreas liver 0.10 0.05 barplot: splicing probability 200 150 100 0.00 0 50 lines: exon expression 250 300 expression and splicing probability: ENSG00000101076 exons over gene Fig. 3. HNF4A is a gene with a confirmed splicing event between liver and pancreas (exon 4, green dot-dashed line). The lines (y-axis, left scale) show the exon expressions ordered by genomic position (x-axis). The bars (y-axis, right scale) correspond to the splicing probability values of the respective exons. The ARH value for HNF4A is 0.37, corresponding to a P-value of 8.01 × 10−5 . not tested specifically in the second, then this result can turn into a false positive in the light of the experimental data even when the isoform is present in both tissues. On the other hand, between tissues strong splicing differences can be expected. The number of confirmed splicing events is low concerning recent predictions of up to 95% of spliced human multi-exon genes (Pan et al., 2008). With the AEdb, we are only aware of few events of unknown strength. The methods may predict successfully many real, existing events before marking our confirmed events. These aspects may in part explain low sensitivity of the results. 5.2 Exon expression variability Exon expressions are variable across the gene. Figure 3 and Supplementary Figure 4 elucidate the complex nature of exon expression with the example of HNF4A. The assumption that all exons in a gene have the same expression does not hold in general. Thus, we cannot assume a uniform distribution. Similar observations led Shah and Pallas (2009) to the identification of the correlation as an indicator for splicing. In previous gene-level expression experiments, this variability was interpreted as noise of probe expression. The exon arrays point to a deeper transcription pattern in terms of splicing. Similar expression variability is found in RNA-Seq data (data not shown). ARH has been shown to cope with variable exon expression. Taking the ratio of the exon expressions between the biological conditions levels out the expression changes. The logarithm to the base 2 of the ratios saliently reflects splicing peculiarities in the exon expressions. With the entropy ARH weights the expression ratios to each other identifying genes with deviating ratios. We have further explored the HNF4A gene and two others, CSDE1 and DYSF, with respect to their detectability by the different methods (Supplementary Table 4C). The data reflects the relative robustness 5.3 Predictors versus number of exons in the gene In differential expression settings, the number of probes are mostly constant across the genes on the array. This is not true anymore with exon arrays. Predictions are calculated for genes with strongly differing number of exons. Ideally, a method is independent on the number of exons of a gene. We investigated the performance of the different methods with respect to this feature. The genes were partitioned in bins referring to exon numbers. Boxplots for the distribution of the predictions were calculated per bin and are shown in the Supplementary Figure 5. Here, genes with the same number of exons were assigned to the same bin. With increasing number of exons, the probability of a false positive prediction increases. Focussing on the exon level does not avoid the problem. Sorting the predictions by decreasing splicing indication, genes with high number of exons are still preferred. A majority of the methods shows a dependency on the number of exons. Especially statistical tests are susceptible to increasing splicing indication with increasing exon number. Statistical tests become sensitive with increasing exon number and detect decreasing splicing differences. In the AEdb test setting this misleadingly improves performance. To make the ARH gene-level prediction independent of the number of exons per gene, the entropy values were compared with their possible maximum. This maximum is only dependent on the number of exons and thus constant over the exon bin. Thus, ARH is corrected for the number of exons per gene. Interestingly, Figure 2 and Table 1 demonstrate that the number of exons per gene is per se already a well-performing splicing prediction exceeding several of the computational methods. This may be a consequence of gene length bias in the AEdb compared with genome-wide data from Ensembl annotation (Supplementary Figure 6). In the Ensembl database with increasing number of exons in the gene, the number of genes is decreasing with a mean of 13 exons per gene. The AEdb, in contrast, shows a fairly differing distribution of number of exons with a peak between 7 and 18 exons per gene and a mean of 25 reflecting selection bias of the manual curation. Therefore, the AEdb genes show an over-representation of genes with high exon number as visualized with the ratio of AEdb exon number bins divided by the Ensembl exon number bins in the Supplementary Figure 6. 6 CONCLUSION We developed ARH, an entropy-based measure for splicing prediction. ARH is based on a simple, robust model based on the exon expression ratios with respect to two experimental conditions. A deviation in exons leads to a dominating effect on the entropy and a high ARH value. ARH takes into account probe affinities and variable exon mRNA abundances by computing exon expression ratios between samples. We tested ARH with existing benchmark data and could show that it outperforms existing methods. Furthermore, this is the first study that comprehensively compared splicing prediction methods with the same experimental platform. 89 [12:22 2/12/2009 Bioinformatics-btp626.tex] Page: 89 84–90 A.Rasche and R.Herwig We analysed methods performances with respect to different aspects such as low number of experimental replicates and the dependency on the numbers of exons. ACKNOWLEDGEMENTS The authors jointly developed ARH. A.R. collected the different prediction methods and datasets, performed the evaluation of methods and wrote the manuscript. R.H. designed the study and contributed to the manuscript. Stefan Haas provided experience about AS and splicing prediction to the authors. Funding: European Union under its 6th Framework Programme with the grants SysProt (LSHG-CT-2006-037457); SysCo (LSHG-CT2006-037231); Max Planck Society. Conflict of Interest: none declared. REFERENCES Abdueva,D. et al. (2007) Experimental comparison and evaluation of the Affymetrix exon and u133plus2 genechip arrays. PLoS ONE, 2, e913. Beffa,C.D. et al. (2008) Dissecting an alternative splicing analysis workflow for genechip exon 1.0 st Affymetrix arrays. BMC Genomics, 9, 571. Ben-Dov,C. et al. (2008) Genome-wide analysis of alternative pre-mRNA splicing. J. Biol. Chem., 283, 1229–1233. Birney,E. et al. (2006) Ensembl 2006. Nucleic Acids Res., 34, D556–D561. Clark,T. et al. (2007) Discovery of tissue-specific exons using comprehensive human exon microarrays. Genome Biol., 8, R64. Cline,M.S. et al. (2005) ANOSVA: a statistical method for detecting splice variation from expression data. Bioinformatics, 21 (Suppl. 1), i107–i115. Cover,T.M. and Thomas,J.A. (2006) Elements of Information Theory. 2nd edn. Hoboken, Wiley-Interscience. Crooks,G.E. and Brenner,S.E. (2004) Protein secondary structure: entropy, correlations and prediction. Bioinformatics, 20, 1603–1611. Dabney,A. et al. (2006) q-value: Q-value estimation for false discovery rate control. R package version 1.1. Dai,M. et al. (2005) Evolving gene/transcript definitions significantly alter the interpretation of GeneChip data. Nucleic Acids Res., 33, e175. Das,D. et al. (2007) A correlation with exon expression approach to identify cis-regulatory elements for tissue-specific alternative splicing. Nucleic Acids Res., 35, 4845–4857. Exon Array Whitepaper Collection (2005) Alternative transcript analysis methods for exon arrays. Technical Report 1.1, Affymetrix, Inc. Gautier,L. et al. (2004) affy—analysis of affymetrix genechip data at the probe level. Bioinformatics, 20, 307–315. Gentleman,R.C. et al. (2004) Bioconductor: Open software development for computational biology and bioinformatics. Genome Biol., 5, R80. Gupta,S. et al. (2005) T-stag: resource and web-interface for tissue-specific transcripts and genes. Nucleic Acids Res., 33, W654–W658. Herwig,R. et al. (1999) Large-scale clustering of cdna-fingerprinting data. Genome Res., 9, 1093–1105. Herwig,R. et al. (2000) Information theoretical probe selection for hybridisation experiments. Bioinformatics, 16, 890–898. Hu,G.K. et al. (2001) Predicting splice variant from DNA chip expression data. Genome Res., 11, 1237–1245. Johnson,W.E. et al. (2006) Model-based analysis of tiling-arrays for ChIP-chip. Proc. Natl. Acad. Sci. USA, 103, 12457–12462. Kapur,K. et al. (2007) Exon arrays provide accurate assessments of gene expression. Genome Biol., 8, R82. Kasprzyk,A. et al. (2004) Ensmart: a generic system for fast and flexible access to biological data. Genome Res., 14, 160–169. Koscielny,G. et al. (2009) ASTD: the alternative splicing and transcript diversity database. Genomics, 93, 213–220. Margolin,A.A. et al. (2006) Aracne: an algorithm for the reconstruction of gene regulatory networks in a mammalian cellular context. BMC Bioinformatics, 7 (Suppl. 1), S7. Novoyatleva,T. et al. (2006) Pre-mRNA missplicing as a cause of human disease. In P.Jeanteur, (ed.) Alternative Splicing and Disease, vol. 44, Springer, Berlin, pp. 27– 46. Pan,Q. et al. (2008) Deep surveying of alternative splicing complexity in the human transcriptome by high-throughput sequencing. Nat. Genet., 40, 1413–1415. Purdom,E. et al. (2008) FIRMA: a method for detection of alternative splicing from exon array data. Bioinformatics, 24, 1707–1714. R Development Core Team (2005) R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. Schug,J. et al. (2005) Promoter features related to tissue specificity as measured by shannon entropy. Genome Biol., 6, R33. Shah,S.H. and Pallas,J.A. (2009) Identifying differential exon splicing using linear models and correlation coefficients. BMC Bioinformatics, 10, 26. Shannon,C.E. (1948) A mathematical theory of communication. Bell Syst. Technol. J., 27, 379–423. Sing,T. et al. (2005) ROCR: visualizing classifier performance in R. Bioinformatics, 21, 3940–3941. Srinivasan,K. et al. (2005) Detection and measurement of alternative splicing using splicing-sensitive microarrays. Methods, 37, 345–359. Stamm,S. et al. (2000) An alternative-exon database and its statistical analysis. DNA Cell Biol., 19, 739–756. Stamm,S. et al. (2005) Function of alternative splicing. Gene, 344, 1–20. Stephenson,A.G. (2002) evd: extreme value distributions. R News, 2, 31–32. Steuer,R. et al. (2002) The mutual information: detecting and evaluating dependencies between variables. Bioinformatics, 18 (Suppl. 2), S231–S240. Stoilov,P. et al. (2002) Defects in pre-mRNA processing as causes of and predisposition to diseases. DNA Cell Biol., 21, 803–818. Wang,E.T. et al. (2008) Alternative isoform regulation in human tissue transcriptomes. Nature, 456, 470–476. Xing,Y. et al. (2006) Probe selection and expression index computation of Affymetrix Exon Arrays. PLoS ONE, 1, e88. Xing,Y. et al. (2008) MADS: a new and improved method for analysis of differential alternative splicing by exon-tiling microarrays. RNA, 14, 1470–1479. 90 [12:22 2/12/2009 Bioinformatics-btp626.tex] Page: 90 84–90
© Copyright 2026 Paperzz