ARH: predicting splice variants from genome

BIOINFORMATICS
ORIGINAL PAPER
Vol. 26 no. 1 2010, pages 84–90
doi:10.1093/bioinformatics/btp626
Gene expression
ARH: predicting splice variants from genome-wide data with
modified entropy
Axel Rasche∗ and Ralf Herwig
Department of Vertebrate Genomics, Max-Planck-Institute for Molecular Genetics, Ihnestr. 63-73, D-14195 Berlin,
Germany
Received on June 15, 2009; revised on August 21, 2009; accepted on October 7, 2009
Advance Access publication November 4, 2009
Associate Editor: Trey Ideker
ABSTRACT
Motivation: Exon arrays allow the quantitative study of alternative
splicing (AS) on a genome-wide scale. A variety of splicing prediction
methods has been proposed for Affymetrix exon arrays mainly
focusing on geometric correlation measures or analysis of variance.
In this article, we introduce an information theoretic concept that is
based on modification of the well-known entropy function.
Results: We have developed an AS robust prediction method based
on entropy (ARH). We can show that this measure copes with bias
inherent in the analysis of AS such as the dependency of prediction
performance on the number of exons or variable exon expression.
In order to judge the performance of ARH, we have compared it
with eight existing splicing prediction methods using experimental
benchmark data and demonstrate that ARH is a well-performing new
method for the prediction of splice variants.
Availability and Implementation: ARH is implemented in R and
provided in the Supplementary Material.
Contact: [email protected]
Supplementary information: Supplementary data are available at
Bioinformatics online.
1
INTRODUCTION
Alternative splicing (AS) has gained increasing interest since it has
been identified as a key mechanism for generating the complex
proteome of multicellular organisms by the generation of multiple
protein isoforms from single gene templates (Ben-Dov et al., 2008).
AS has been observed within a variety of biological conditions,
for example, in tissue expression (Gupta et al., 2005; Wang et al.,
2008), with respect to human diseases (Novoyatleva et al., 2006;
Stoilov et al., 2002) and in protein modification (Stamm et al.,
2005). Currently, AS is estimated to occur in up to 95% of human
multi-exon genes (Pan et al., 2008; Wang et al., 2008).
While standard microarray platforms have been useful to study
gene expression on a large scale, this technique was, until recently,
not suitable for the detection of splice variants due to the design
of the probes. As a consequence, existing AS in the samples under
study introduced an undetectable bias to those experiments. With the
release of the Affymetrix exon arrays a first commercial platform for
studying AS on a genome-wide scale has been made available using
∗ To
84
whom correspondence should be addressed.
more than five million oligonucleotide probes distributed across the
exome. The analysis of exon microarrays has posed new challenges
to the computational analysis, including data normalization, quality
control measures and methods for AS prediction. A number of
methods has been proposed to study AS with Affymetrix exon arrays
(Exon Array Whitepaper Collection, 2005; Purdom et al., 2008;
Xing et al., 2006). Most of these methods are based on geometric
measures, such as linear correlation, or analysis of variance (see
Section 4.2 for more details).
Entropy is a fundamental measure of information and uncertainty
that has been introduced in the context of information theory by
Shannon (1948). Mathematical properties of entropy have been
described, for example, in the textbook of Cover and Thomas
(2006). In our approach, we take the advantage of fundamental
characteristics of entropy as a measure for the uniformity of a given
probability distribution defined by the exon gene expression ratios
within a given transcript (see Appendix 1 in the Supplementary
Material). Information theoretic approaches based on entropy have
been used in many fields of biosciences: for example, in sequence
analysis (Schug et al., 2005); protein structure analysis (Crooks and
Brenner, 2004); in the context of clustering gene expression data
(Herwig et al., 1999; Steuer et al., 2002); feature selection (Herwig
et al., 2000); and reverse engineering of gene regulatory networks
(Margolin et al., 2006).
In our article, we present a new method for predicting AS
using a modified entropy measure. We show that this method
overcomes important inherent bias such as the dependency of
the splicing prediction on the number of the exons and the
variability of exon expression. Furthermore, we present the first
comprehensive evaluation of AS prediction methods for Affymetrix
exon arrays using experimental benchmark datasets generated from
different human tissues and transcript spike-in samples. Splicing
prediction requires to be sensitive in the deviation of a proportion
of exons, symmetric in up- and down-splicing, independent from
the position of the spliced exons within the gene sequence and
independent of the number of exons of the gene. To assess the
performance of the different methods, we analysed them with
respect to these requirements. For validation of the computational
predictions, we translated the information on known splice variants
available in public databases into sets of true positive splicing events.
We show that ARH is a well-performing new splicing prediction
method. An R implementation of ARH and the different methods is
available in the Supplementary Material.
© The Author 2009. Published by Oxford University Press. All rights reserved. For Permissions, please email: [email protected]
[12:22 2/12/2009 Bioinformatics-btp626.tex]
Page: 84
84–90
ARH: AS robust prediction by entropy
2
METHODS
2.1
Probe–exon–gene assignments
The probe to exon and probe to gene assignments were dependent on the
respective benchmark dataset. For the spike-in dataset, it was done using
the chip description files from Affymetrix. For the human tissue dataset,
probe–exon assignments were drawn from latest genome annotations of
Dai et al. (2005) in version 11 for Ensembl exons. Exon to gene assignment
was retrieved via BioMart from Ensembl 49 (Birney et al., 2006; Kasprzyk
et al., 2004) and resulted in 232 376 exons that correspond to 26 538 genes.
2.2
Preprocessing
A model-based analysis for tiling arrays from Johnson et al. (2006) was
applied for intra-chip normalization, with the adjustment for exon arrays
described in Kapur et al. (2007). Quantile normalization was then applied
between arrays with Affy package (Gautier et al., 2004) in R/BioConductor
in version 2.8.0/2.3.0, respectively (Gentleman et al., 2004; R Development
Core Team, 2005). Exon and gene expression were defined as the median of
the respective probe intensities using all experimental replicates.
down-splicing. From these log-ratios the median is subtracted
to correct for global gene expression changes:
φg,e,t
φg,e,t
ζg,e = log2
− median log2
.
(1)
φg,e,c
φg,e,c
e=1,...,m
(2) The exon splicing probability is computed as the absolute
value of the splicing deviation ζg,e by
pg,e = Note that for each gene
2|ζg,e |
|ζg,e |
e=1,...,m 2
.
(2)
e pg,e = 1.
(3) To measure whether the exon splicing probabilities are
equally distributed or whether a single or a few exons
dominate the probability distribution, we compute the entropy
for each gene:
m
Hg pg,1 ,...,pg,m = −
pg,e ·log2 pg,e .
(3)
e=1
2.3
Computation of ROC, AUC
For each method, the predictions were sorted by decreasing splicing
indication. Then the position of the confirmed events (the true positive set)
was identified. Receiver operating characteristic (ROC) was visualized with
the ROCR-package in R (Sing et al., 2005). Likewise the area under the
curve (AUC) of the ROC was computed with ROCR for quantification of
the performance.
2.4
Selection of AS events from the AEdb
AEdb contains confirmed splicing events extracted from the literature
(Koscielny et al., 2009; Stamm et al., 2000). The AEdb sequence flat file
was downloaded (http://www.ebi.ac.uk/asd/aedb/) and the splicing events
were filtered by splicing mechanism (cassette exon events), species (human,
mouse, rat) as well as the availability of a sequence for the events. Eight
tissues were overlapping between AEdb and the tissue benchmark dataset.
Events attributed to these eight tissues were selected and the corresponding
sequences of the alternative exons were aligned to exon sequences from
Ensembl 49 for exact matches. Events may attribute to more than one tissue.
For tissue specificity such events were skipped. For pairwise comparison,
only the events specific to either of the two tissues were used as a true
positive set. A complete list of the true positive events is available in a
Supplementary Material along with the R implementation.
3
ALGORITHM AND IMPLEMENTATION
Splicing assessment is done in two steps. In the first step analysis
is performed on the gene level, i.e. ARH identifies spliced genes
[see Equation (5)]. In the second step analysis is performed on the
exon level, i.e. splicing deviation [see Equation (1)] ranks the exons
within a gene and identifies the skipped/included exons. Below, we
give a formal description of the method; further material on the
characteristics of entropy and an illustration of the different steps
are given in Supplementary Material Appendix 1 and Supplementary
Figure 7. For a gene g with m exons, two biological conditions c and
t with corresponding exon expressions φg,e,t and φg,e,c , e = 1,...,m,
we compute the following quantities:
(1) The exon splicing deviation, ζg,e , measures the deviation of
the fold change in each individual exon from the median
transcript fold change. Here, we compute log-ratios of exon
fold changes to account for symmetric measurement of up- or
(4) Entropy defined in Equation (3) is dependent on the number
of exons and cannot be directly used for the comparison
of different genes. Thus, in order to make the measure
independent of the number of exons for a given gene, we
subtract entropy from its theoretical maximum:
max(Hg )−Hg = log2 (m)−Hg pg,1 ,...,pg,m .
(4)
(5) Another necessary modification accounts for the strength of
deviation within the gene. This is robustly estimated with
the interquartile range of exon expression ratios, the
25%,
φg,e,t
φ
Q.25,g,e=1,...,m φ
, and 75%, Q.75,g,e=1,...,m φ g,e,t ,
g,e,c
g,e,c
quantiles. An index for the amplitude is the interquartile
ratio Q.75,g /Q.25,g . This ratio is close to 1 for low splicing
probability and increases with deviations of a number of exons
in the gene. The interquartile ratio is multiplied with the
entropy index and constitutes the ARH splicing prediction:
ARHg =
Q.75,g · max(Hg )−Hg .
Q.25,g
(5)
Thus, ARH is suitable to compare the predictions across
different genes. Large ARH values (>0.03) indicate splicing.
The ARH prediction is implemented in R with three functions
(R Development Core Team, 2005): ARH, ARH_p and ARH_sd.
ARH returns gene level predictions, the outcome of Equation (5)
and ARH_sd returns the splicing deviations from Equation (1).
ARH_p returns the P-values of ARH derived from a fitted generalized
extreme value distribution (Section 4.1) using the package evd
and the below described parameters (Stephenson, 2002). The three
functions take two input vectors (x and y in the implementation) for
the exon (or probe set) expressions and one vector for the exon–gene
grouping (f ). To avoid the division by 0, the second vector is set to
a minimum of 0.0001. Genes with only one exon or non-finite exon
expressions are set to NA.
If a single exon deviates from the remaining exon expression
ratios, it dominates the splicing probability distribution
[Equation (3)] resulting in a low entropy and a high ARH value. If a
larger number of exons is spliced, this measure is upweighted with
an increased interquartile ratio >1. On the other hand, if all exons
85
[12:22 2/12/2009 Bioinformatics-btp626.tex]
Page: 85
84–90
A.Rasche and R.Herwig
The exon-level splicing indication has to be symmetric in terms
of up- or down-splicing. The swapping of treatment and control
samples changes an up- to a down-spliced exon and vice versa.
The absolute value of the log2 splicing deviation accounts for
this symmetry. The dependency of the splicing probability on the
fold changes was simulated for a gene with 13 exons, where
log2 ratios were drawn from a normal distribution with N (0,0.68)
(Supplementary Fig. 2).
Spliced exons are not necessarily adjacent. In the liver versus
pancreas tissue comparison, the transcription factor HNF4A is an
illustrative example depicted in Figure 3. Three exons were predicted
to be spliced on positions 1, 4 and 5 with one confirmed event
in position 4 in pancreas. The sum in the entropy formula is
commutative and reflects the position independence of the exons.
have similar expression changes this leads to a high entropy with
small interquartile ratio and consequently to a small ARH value.
RESULTS
Characteristics of ARH
4.1.1 ARH background distribution For a given experiment, the
ARH values show a rapid decline from many near-zero values
to few high ARH values. The ARH distribution shows little
variation even between tissues (Supplementary Fig. 1). To derive
a biologically motivated background distribution, we compared
samples of the same biological conditions. The human tissue dataset
from Clark et al. (2007) entails data from 11 human tissues
with 3 replicates each. In each tissue, this allows three pairwise
comparisons summing to 33 pairwise comparisons that were used
for defining a background sample of ARH values. The distribution
of these 33 comparisons provides thresholds for significant ARH
values. The 95% quantile of the distribution is QARH,.95 =
0.031. The 95% quantiles of the 33 individual comparisons also
cluster around that value. For the 90, 99 and 99.9% quantiles
of the background distribution, the thresholds are QARH,.9 =
0.023, QARH,.99 = 0.057 and QARH,.999 = 0.13, respectively. The
background distribution is also adequate to calculate P-values. We
found the generalized extreme value distribution to fit best to the
ARH background distribution due to a long heavy tail of large ARH
values. Distribution parameters were fit with Matlab resulting in
location = 0.006338, scale = 0.005507 and shape = 0.3329 (Fig. 1.A).
4.1.3 Gene-level analysis A gene-level splicing prediction
method needs to be sensitive in the deviation of a proportion of exons
what is measured by ARH with the entropy and the interquartile
ratio as weighting factor. We performed a simulation with varying
number of spliced exons, where the linear ratio of the spliced exons
is multiplied with a fold change of 3 (Fig. 1B). ARH values reflect the
number of spliced exons with an unimodal, symmetric distribution.
A strength of ARH is its low dependency on the total number
of exons of a gene. In ARH, the genes are sorted in bins by
exon number and gene predictions are compared with the gene-bin
maximal prediction. Comparing the entropy to the maximal entropy
makes ARH independent of the number of exons (Fig. 1C). See also
Section 5.3.
4.1.4 Performance with low number of experimental replicates
Since the costs of experimental replicates are often a limiting factor,
methods favourably should require a low number of replicates for
computing robust predictions. Purdom et al. (2008) were the first
to address this aspect for their method FIRMA. We compared ARH
and other methods using only a single chip per condition (Fig. 2C).
The results highlight the good performance of ARH. Using the
median over the probes and replicates, the method is relatively
B
0.00
0.02
0.04
0.06
0.08
ARH values
red dashed: generalised extreme value distribution
0.10
C
Prediction distribution in Exon Bin
0.6
0.8
0.00
0
0.0
0.2
0.4
ARH distributions
20000
10000
Frequency
30000
1.0
ARH, varying No of spliced exons
0.08
Histogram of ARH values
40000
A
0.06
4.1.2 Exon-level analysis In a spliced gene, the splicing deviation
ranks the exons in order to identify the most altered exons.
With this ranking exons can be selected, for example, for
validation. Assessing the absolute splicing deviation as above,
a global number of spliced exons is determined with the
following thresholds: QARH_sd,.9 = 0.43, QARH_sd,.95 = 0.53,
QARH_sd,.99 = 0.75 and QARH_sd,.999 = 1.07.
0.04
4.1
0.02
4
0
1
2
3
4
5
6
7
8
9 10 11 12
No of spliced exons
gene with 13 exons; exons spliced with fc 3
1 4 7 10 14 18 22 26 30 34 38 42 46 50 54
No exons in a gene
Fig. 1. (A) Histogram of ARH background distribution derived from splicing predictions between the same biological conditions (ARH values equal to zero
were skipped) and the fitted generalized extreme value distribution (dashed line). (B) Simulation of ARH values (y-axis) with respect to number of spliced
exons (x-axis). A gene with 13 exons is used for simulation corresponding to the average exon number in Ensembl for known protein coding genes. Log2
ratios were drawn from a normal distribution with mean 0 and SD 0.68 corresponding to the liver versus pancreas comparison. Respective number of exons
were upregulated with a factor of 3 indicating splicing. (C) Boxplots of ARH values (y-axis) with respect to genes with the same number of exons (x-axis).
86
[12:22 2/12/2009 Bioinformatics-btp626.tex]
Page: 86
84–90
ARH: AS robust prediction by entropy
Fig. 2. ROC curves for different aspects of methods performance. (A) Overall performance across the 28 pairwise tissue comparisons with respect to AEdb
confirmed splicing events (performances vertically averaged). (B) Liver versus pancreas individual splicing predictions. (C) Performance of methods with
only one of the three replicates. (D) Comparison of muscle versus non-muscle tissue data invoking additional experimental noise with AEdb confirmed splicing
events. (E) Muscle versus non-muscle tissue data with RT-PCR validated true positive set. (F) HeLa cell line data with spiked transcripts as true positive set.
robust in the number of experimental replicates, although a clear loss
in performance is seen with all methods when using one replicate
instead of three (compare Fig. 2B and C).
4.2
Method comparison
We have compared ARH with eight existing splicing prediction
methods:
• Splicing Index (SI) (Srinivasan et al., 2005)
• SPLICE (Hu et al., 2001)
• PAC (Exon Array Whitepaper Collection, 2005)
• ANOSVA (Cline et al., 2005)
• MiDAS (Exon Array Whitepaper Collection, 2005)
• MADS (Xing et al., 2008)
• FIRMA (Purdom et al., 2008)
• Correlation (Shah and Pallas, 2009)
Due to the character of the predictions, the methods can be
categorized into scores (Splicing Index, SPLICE, PAC, FIRMA and
Correlation) or tests (ANOSVA, MiDAS and MADS). Also some
methods provide exon-level prediction (Splicing Index, SPLICE,
PAC, MiDAS, MADS and FIRMA) or gene-level prediction
(ANOSVA and Correlation). All methods were applied to the data
with the same preprocessing as described in Section 2.2 except
FIRMA, which requires RMA. MiDAS values were calculated on
the standard preprocessing with the Affymetrix Power Tools in
version 1.8.0.
4.2.1 Test dataset 1: tissue data with literature confirmed events
As a true positive set for judging the methods performances, we
chose an independent dataset from the manually curated AEdb.
Experimental data were available for 11 human tissues with 3
experimental replicates per tissue. Confirmed events in AEdb were
available for eight of these tissues which allows for 28 pairwise
tissue comparisons.
Ordering the predictions by decreasing splicing indication
constitutes a classifier that allows visualizing the performance of the
predictions with the ROC (Fig. 2). Using the ROC, the performance
was quantified with the AUC (Table 1).
The benchmark test set was analysed with respect to different
aspects. The pairwise tissue comparisons correspond to highly
diverse biological conditions leading to a lot of variation in the
benchmarks. In the analysis, we provide the average performance
87
[12:22 2/12/2009 Bioinformatics-btp626.tex]
Page: 87
84–90
A.Rasche and R.Herwig
Table 1. AUC for different test settings and methods
Tissue pairwise
average
Liver versus
pancreas
Liver versus
pancreas, 1-to-1
Tissue specificity
average
Muscle versus
rest, AEdb
Muscle versus
rest, RT-PCR
Spike-in
ARH
Splicing Index
SPLICE
PAC
ANOSVA
MiDAS
FIRMA
MADS
Correlation
0.83
0.7
0.69
0.63
0.76
0.68
0.69
0.68
0.74
0.86
0.74
0.78
0.75
0.78
0.71
0.73
0.69
0.69
0.84
0.73
0.73
0.74
0.72
–
0.69
–
0.65
0.86
0.75
0.75
0.72
0.7
0.62
0.75
0.71
0.78
0.86
0.71
0.62
0.64
0.6
0.48
0.74
0.49
0.75
0.97
0.95
0.88
0.96
0.84
0.85
0.92
0.67
0.73
0.99
0.96
0.96
0.96
0.98
0.95
0.75
0.98
0.75
Gene length
0.83
0.79
0.79
0.84
0.92
0.93
–
Gene length predictor (last row) refers to the number of exons per gene. The best performing method in each column is shown in bold.
across the 28 pairwise comparisons (Fig. 2A). For an in-depth
discussion of AS attributes, we chose the liver versus pancreas
test case because it is representative for the average performance
(Fig. 2B). For this test case AEdb returns 27 exon events in 18
genes. The methods not only differ by performance but also by
the predicted splicing events. The commonality of the predictions
is assessed by looking at the overlaps between methods. The top
250 predictions constitute ∼ 1% of all genes on the array. The
commonality table in Supplementary Table 1 reflects a limited
overlap between the methods. For the 18 confirmed genes, the ARH
values and the corresponding quantiles in the ARH background
distribution, the P-values of the generalized extreme value fit and
Q-values for false discovery rate correction following Dabney et al.
(2006) are listed in the Supplementary Table 2.
Furthermore, we analysed tissue specificity comparing a selected
tissue with the 10 remaining tissues (Supplementary Fig. 3). This
leads to considerable variance in the intensities for the control
group. For example, comparing, muscle to non-muscle tissues, this
variance challenges the methods in their robustness for noise in
the measurements and results in a strong spread of performances
(Fig. 2D). For muscle the AEdb contains 19 confirmed exon events
in 10 genes. In Das et al. (2007), the authors use the same human
tissue dataset to establish a list of muscle-enriched exons whereof
17 events have been validated with RT-PCR. Since the study was
performed on an older genome build, the probe set region of the
17 events was updated with the UCSC Genome Annotations Lift
Tool to the current genome build (Assembly Mar 2006). The original
regions intersect with 13 Ensembl exons in 11 genes constituting the
list of validated events used for analysis. Since the RT-PCR assays
are generated specifically on the samples under study, the ROC are
more specific than AEdb confirmed events (Fig. 2E). It is a major
advantage of ARH that it is robust to noise within the samples. The
effect is also illustrated in the Supplementary Figure 4 with two case
studies with different prediction quality.
4.2.2 Test dataset 2: spike-in transcripts In Abdueva et al.
(2007), another benchmark dataset was presented with spikein hybridizations of 25 transcripts. In HeLa cells, where these
transcripts are not expressed, the mRNA was added at concentrations
of 0, 2, 32, 128, 512 pM in a Latin square design by five groups.
The dataset has the advantage that expression strength is exactly
known in every sample. The samples have a very homogenous
background such that noise can be neglected. All true positives are
known due to the closed collection of spiked genes. Following the
original handling of the data, we used the Affymetrix probe–probe
set-transcript cluster assignment.
We followed an idea of Beffa et al. (2008) to generate splicing
events. For a transcript in the 0 pM group, one exon was assigned to
a transcript of the 2 pM group and replaced by this exon. Exons with
0, 2, 32, 128, 512 pM were assigned to transcripts with 2, 0, 512, 32,
128 pM expression, respectively. The true positives in this dataset
are characterized by differentially expressed genes with generically
spliced exons at extreme fold changes. The environment excluding
the 25 transcripts has no expression change at low variability. Our
results show a general increase in methods performance compared
with the tissue data with ARH being the best performing method
(Fig. 2F).
5
5.1
DISCUSSION
General performance of methods and study design
The prediction of AS remains a challenge. In general, performance
of all methods could be improved, in particular, in the light of the
tissue dataset. This is due to the fact that splicing prediction poses
particular problems to the statistical analysis. A gene consists of
several products on the one hand and of different exons on the
other hand. For each transcript or each exon, a separate analysis
is performed to test potential splicing using the same measurements
in several tests. Approaches for the comparison of methods can be
found in Purdom et al. (2008) with a simulation model and in Beffa
et al. (2008) with the re-ordering of spike-in data. The advantage of
the human tissue dataset is the challenge to identify splicing events
in a non-artificial setting.
The confirmed events used for our study are in any sense
independent from the computations. However, this affects the
performance of the methods with respect to two aspects. On the
one hand, confirmed events were not determined for exactly those
tissue samples that were used in hybridizations. Thus, some of the
AEdb splicing events may be weak or inexistent in the tissues under
study. For example, if an isoform is confirmed in one tissue but
88
[12:22 2/12/2009 Bioinformatics-btp626.tex]
Page: 88
84–90
ARH: AS robust prediction by entropy
of the ARH predictor. It can be seen that ARH is not the most
sensitive method, i.e. the examples are found at a lower prediction
ranking by other methods, but it is probably the most stable method,
since it performs well in all three cases whereas other methods exert
a much higher variation of performance.
0.15
legend
pancreas
liver
0.10
0.05
barplot: splicing probability
200
150
100
0.00
0
50
lines: exon expression
250
300
expression and splicing probability: ENSG00000101076
exons over gene
Fig. 3. HNF4A is a gene with a confirmed splicing event between liver and
pancreas (exon 4, green dot-dashed line). The lines (y-axis, left scale) show
the exon expressions ordered by genomic position (x-axis). The bars (y-axis,
right scale) correspond to the splicing probability values of the respective
exons. The ARH value for HNF4A is 0.37, corresponding to a P-value of
8.01 × 10−5 .
not tested specifically in the second, then this result can turn into
a false positive in the light of the experimental data even when
the isoform is present in both tissues. On the other hand, between
tissues strong splicing differences can be expected. The number of
confirmed splicing events is low concerning recent predictions of up
to 95% of spliced human multi-exon genes (Pan et al., 2008). With
the AEdb, we are only aware of few events of unknown strength. The
methods may predict successfully many real, existing events before
marking our confirmed events. These aspects may in part explain
low sensitivity of the results.
5.2
Exon expression variability
Exon expressions are variable across the gene. Figure 3 and
Supplementary Figure 4 elucidate the complex nature of exon
expression with the example of HNF4A. The assumption that
all exons in a gene have the same expression does not hold in
general. Thus, we cannot assume a uniform distribution. Similar
observations led Shah and Pallas (2009) to the identification of
the correlation as an indicator for splicing. In previous gene-level
expression experiments, this variability was interpreted as noise of
probe expression. The exon arrays point to a deeper transcription
pattern in terms of splicing. Similar expression variability is found
in RNA-Seq data (data not shown).
ARH has been shown to cope with variable exon expression.
Taking the ratio of the exon expressions between the biological
conditions levels out the expression changes. The logarithm to the
base 2 of the ratios saliently reflects splicing peculiarities in the
exon expressions. With the entropy ARH weights the expression
ratios to each other identifying genes with deviating ratios. We
have further explored the HNF4A gene and two others, CSDE1 and
DYSF, with respect to their detectability by the different methods
(Supplementary Table 4C). The data reflects the relative robustness
5.3
Predictors versus number of exons in the gene
In differential expression settings, the number of probes are mostly
constant across the genes on the array. This is not true anymore
with exon arrays. Predictions are calculated for genes with strongly
differing number of exons. Ideally, a method is independent on
the number of exons of a gene. We investigated the performance
of the different methods with respect to this feature. The genes
were partitioned in bins referring to exon numbers. Boxplots for the
distribution of the predictions were calculated per bin and are shown
in the Supplementary Figure 5. Here, genes with the same number
of exons were assigned to the same bin. With increasing number
of exons, the probability of a false positive prediction increases.
Focussing on the exon level does not avoid the problem. Sorting
the predictions by decreasing splicing indication, genes with high
number of exons are still preferred.
A majority of the methods shows a dependency on the number
of exons. Especially statistical tests are susceptible to increasing
splicing indication with increasing exon number. Statistical tests
become sensitive with increasing exon number and detect decreasing
splicing differences. In the AEdb test setting this misleadingly
improves performance. To make the ARH gene-level prediction
independent of the number of exons per gene, the entropy values
were compared with their possible maximum. This maximum is
only dependent on the number of exons and thus constant over the
exon bin. Thus, ARH is corrected for the number of exons per gene.
Interestingly, Figure 2 and Table 1 demonstrate that the number
of exons per gene is per se already a well-performing splicing
prediction exceeding several of the computational methods. This
may be a consequence of gene length bias in the AEdb compared
with genome-wide data from Ensembl annotation (Supplementary
Figure 6). In the Ensembl database with increasing number of exons
in the gene, the number of genes is decreasing with a mean of 13
exons per gene. The AEdb, in contrast, shows a fairly differing
distribution of number of exons with a peak between 7 and 18 exons
per gene and a mean of 25 reflecting selection bias of the manual
curation. Therefore, the AEdb genes show an over-representation of
genes with high exon number as visualized with the ratio of AEdb
exon number bins divided by the Ensembl exon number bins in the
Supplementary Figure 6.
6
CONCLUSION
We developed ARH, an entropy-based measure for splicing
prediction. ARH is based on a simple, robust model based on the
exon expression ratios with respect to two experimental conditions.
A deviation in exons leads to a dominating effect on the entropy
and a high ARH value. ARH takes into account probe affinities and
variable exon mRNA abundances by computing exon expression
ratios between samples. We tested ARH with existing benchmark
data and could show that it outperforms existing methods.
Furthermore, this is the first study that comprehensively compared
splicing prediction methods with the same experimental platform.
89
[12:22 2/12/2009 Bioinformatics-btp626.tex]
Page: 89
84–90
A.Rasche and R.Herwig
We analysed methods performances with respect to different aspects
such as low number of experimental replicates and the dependency
on the numbers of exons.
ACKNOWLEDGEMENTS
The authors jointly developed ARH. A.R. collected the different
prediction methods and datasets, performed the evaluation of
methods and wrote the manuscript. R.H. designed the study and
contributed to the manuscript. Stefan Haas provided experience
about AS and splicing prediction to the authors.
Funding: European Union under its 6th Framework Programme with
the grants SysProt (LSHG-CT-2006-037457); SysCo (LSHG-CT2006-037231); Max Planck Society.
Conflict of Interest: none declared.
REFERENCES
Abdueva,D. et al. (2007) Experimental comparison and evaluation of the Affymetrix
exon and u133plus2 genechip arrays. PLoS ONE, 2, e913.
Beffa,C.D. et al. (2008) Dissecting an alternative splicing analysis workflow for
genechip exon 1.0 st Affymetrix arrays. BMC Genomics, 9, 571.
Ben-Dov,C. et al. (2008) Genome-wide analysis of alternative pre-mRNA splicing.
J. Biol. Chem., 283, 1229–1233.
Birney,E. et al. (2006) Ensembl 2006. Nucleic Acids Res., 34, D556–D561.
Clark,T. et al. (2007) Discovery of tissue-specific exons using comprehensive human
exon microarrays. Genome Biol., 8, R64.
Cline,M.S. et al. (2005) ANOSVA: a statistical method for detecting splice variation
from expression data. Bioinformatics, 21 (Suppl. 1), i107–i115.
Cover,T.M. and Thomas,J.A. (2006) Elements of Information Theory. 2nd edn.
Hoboken, Wiley-Interscience.
Crooks,G.E. and Brenner,S.E. (2004) Protein secondary structure: entropy, correlations
and prediction. Bioinformatics, 20, 1603–1611.
Dabney,A. et al. (2006) q-value: Q-value estimation for false discovery rate control.
R package version 1.1.
Dai,M. et al. (2005) Evolving gene/transcript definitions significantly alter the
interpretation of GeneChip data. Nucleic Acids Res., 33, e175.
Das,D. et al. (2007) A correlation with exon expression approach to identify
cis-regulatory elements for tissue-specific alternative splicing. Nucleic Acids Res.,
35, 4845–4857.
Exon Array Whitepaper Collection (2005) Alternative transcript analysis methods for
exon arrays. Technical Report 1.1, Affymetrix, Inc.
Gautier,L. et al. (2004) affy—analysis of affymetrix genechip data at the probe level.
Bioinformatics, 20, 307–315.
Gentleman,R.C. et al. (2004) Bioconductor: Open software development for
computational biology and bioinformatics. Genome Biol., 5, R80.
Gupta,S. et al. (2005) T-stag: resource and web-interface for tissue-specific transcripts
and genes. Nucleic Acids Res., 33, W654–W658.
Herwig,R. et al. (1999) Large-scale clustering of cdna-fingerprinting data. Genome Res.,
9, 1093–1105.
Herwig,R. et al. (2000) Information theoretical probe selection for hybridisation
experiments. Bioinformatics, 16, 890–898.
Hu,G.K. et al. (2001) Predicting splice variant from DNA chip expression data. Genome
Res., 11, 1237–1245.
Johnson,W.E. et al. (2006) Model-based analysis of tiling-arrays for ChIP-chip. Proc.
Natl. Acad. Sci. USA, 103, 12457–12462.
Kapur,K. et al. (2007) Exon arrays provide accurate assessments of gene expression.
Genome Biol., 8, R82.
Kasprzyk,A. et al. (2004) Ensmart: a generic system for fast and flexible access to
biological data. Genome Res., 14, 160–169.
Koscielny,G. et al. (2009) ASTD: the alternative splicing and transcript diversity
database. Genomics, 93, 213–220.
Margolin,A.A. et al. (2006) Aracne: an algorithm for the reconstruction of gene
regulatory networks in a mammalian cellular context. BMC Bioinformatics, 7
(Suppl. 1), S7.
Novoyatleva,T. et al. (2006) Pre-mRNA missplicing as a cause of human disease. In
P.Jeanteur, (ed.) Alternative Splicing and Disease, vol. 44, Springer, Berlin, pp. 27–
46.
Pan,Q. et al. (2008) Deep surveying of alternative splicing complexity in the human
transcriptome by high-throughput sequencing. Nat. Genet., 40, 1413–1415.
Purdom,E. et al. (2008) FIRMA: a method for detection of alternative splicing from
exon array data. Bioinformatics, 24, 1707–1714.
R Development Core Team (2005) R: A Language and Environment for Statistical
Computing. R Foundation for Statistical Computing, Vienna, Austria.
Schug,J. et al. (2005) Promoter features related to tissue specificity as measured by
shannon entropy. Genome Biol., 6, R33.
Shah,S.H. and Pallas,J.A. (2009) Identifying differential exon splicing using linear
models and correlation coefficients. BMC Bioinformatics, 10, 26.
Shannon,C.E. (1948) A mathematical theory of communication. Bell Syst. Technol. J.,
27, 379–423.
Sing,T. et al. (2005) ROCR: visualizing classifier performance in R. Bioinformatics,
21, 3940–3941.
Srinivasan,K. et al. (2005) Detection and measurement of alternative splicing using
splicing-sensitive microarrays. Methods, 37, 345–359.
Stamm,S. et al. (2000) An alternative-exon database and its statistical analysis. DNA
Cell Biol., 19, 739–756.
Stamm,S. et al. (2005) Function of alternative splicing. Gene, 344, 1–20.
Stephenson,A.G. (2002) evd: extreme value distributions. R News, 2, 31–32.
Steuer,R. et al. (2002) The mutual information: detecting and evaluating dependencies
between variables. Bioinformatics, 18 (Suppl. 2), S231–S240.
Stoilov,P. et al. (2002) Defects in pre-mRNA processing as causes of and predisposition
to diseases. DNA Cell Biol., 21, 803–818.
Wang,E.T. et al. (2008) Alternative isoform regulation in human tissue transcriptomes.
Nature, 456, 470–476.
Xing,Y. et al. (2006) Probe selection and expression index computation of Affymetrix
Exon Arrays. PLoS ONE, 1, e88.
Xing,Y. et al. (2008) MADS: a new and improved method for analysis of differential
alternative splicing by exon-tiling microarrays. RNA, 14, 1470–1479.
90
[12:22 2/12/2009 Bioinformatics-btp626.tex]
Page: 90
84–90