Supplemental Material for Annotation Enrichment Analysis: An

Supplemental Material for Annotation Enrichment Analysis: An Alternative Method
for Evaluating the Functional Properties of Gene Sets
This file contains supplemental material meant to complement the findings in “Annotation Enrichment Analysis: An Alternate Method for Evaluating the Functional
Properties of Gene Sets.” Section 1 includes additional
information about the structural features of functional
annotation databases as well as the results of using standard statistical approaches for evaluating functional enrichment. Section 2 includes additional comparisons between Fisher’s Exact Test and our method, Annotation
Enrichment Analysis, which is described in the main text.
1. FEATURES OF ANNOTATION DATABASES
AND STANDARD FUNCTIONAL ENRICHMENT
APPROACHES
1.1. Annotation Properties of Functional
Annotation Databases other than the Gene Ontology
In “Annotation Enrichment Analysis: An Alternate
Method for Evaluating the Functional Properties of Gene
Sets” we demonstrate that there are biases in functional
enrichment analysis when using a standard set-overlap
statistic approach such as Fisher’s Exact Test. Specifically, using the Gene Ontology (GO), we show that
these biases appear to be associated with the annotation level of a gene in GO. Although we chose to use
GO to perform the analysis in the main text, we suggest
that these biases might manifest in functional enrichment
analysis performed using other annotation databases.
The general shape of the distributions shown in Figure 1 of the main text (for the number of annotations
made to/from terms/genes in GO) has been observed
both in GO and other annotation databases for several
other lower-level organisms including E.coli and yeast
(see Supplemental Material in [1]). Furthermore, other
annotation databases, such as PIR Keywords, are organized in a hierarchical structure similar to GO. Hence,
we expected that distributions of the annotation proper-
(a)Min Value of FET
(b)FDR q-value
FIG. S2: (a) The minimum p-value estimated by FET across
all GO branches for each of the 200 random gene-sets. (b)
Q-values associated with the FDR-corrected significance of
GO terms in 200 randomly generated gene sets. The terms
are ordered based on how many genes are annotated to the
term (kt ) and the gene sets are ordered based on total the
number of annotations (Mg ) made by the 200 genes in that
set. Note that although we tested all terms, only the 200 with
the highest number of annotations are shown.
ties using other human functional annotation databases
would be similar to those obtained for GO. We explicitly tested this hypothesis for two other functional annotation databases: PIR Keywords and Reactome. We
downloaded PIR Keyword information from the SwissProt resources ftp webpage (ftp://ftp.uniprot.org/
pub/databases/uniprot/). We also downloaded Reactome annotations from files available from MSigDB [2].
The shape of the distributions for the number of annotations made to categories and from genes in these two
databases, shown in Figure S1, greatly mimics those observed for the Gene Ontology (Figure 1 in the main text).
Note that MSigDB does not include information for categories with fewer than ten or more than one-thousand
annotated members, influencing the far left of the distribution; however, for categories with more than ten members the distribution still appears to be heavy-tailed.
1.2. Standard Multiple-Hypothesis Corrections
Cannot Overcome Extreme Annotation Bias
(a)Category Degree Distribution
(b)Gene Degree Distribution
FIG. S1: The cumulative degree distributions of (a) categories
and (b) genes in several functional databases other than the
Gene Ontology.
In the main text, for each of two hundred random gene
sets we calculated the FET p-value for all 10192 “Biological Process” terms. Since we did 10192 comparisons,
we expect that, for each of these gene sets, the minimum value for the p-value significance across all terms
should be approximately (1/10192) ≈ 10−4 . Figure S2(a)
plots the actual minimum p-value for each random gene
set. We observe that instead of the expected value, random gene sets with the fewest annotations have a minimum p-value around 10−3 , while random gene sets with
the highest annotation levels have a minimum p-value
2
close to 10−6 . These results demonstrate that when using Fisher’s Exact Test (FET) to evaluate the enrichment of a set of genes, annotation bias leads to both
anti-conservative estimates for functional enrichment in
gene sets that have many annotations, as well as overlyconservative estimates for enrichment in gene sets that
have few annotations.
Most commonly used functional enrichment tools (e.g.
[2–6]) do not report only the FET p-values, but also provide scientists with a number of other measures, including statistics designed to help mitigate issues associated
with multiple hypothesis testing. We wanted to see if
such measures might sufficiently modify the p-value so
that that anti-conservative estimates from FET might
not be an issue. Therefore, we determined the q-values
associated with the FDR-significance for all GO terms
in each of our 200 randomly generated gene sets. The
results are shown in Figure S2(b). We find that although
the majority of q-values clearly indicate insignificant associations, there is still an obvious bias toward significant
enrichment between high degree gene-set/term pairs. A
number of these high degree gene-set/term pairs still
have a q-value low enough to be considered significant
by some. Specifically, 99 comparisons have a q-value
less than 0.01, and, most importantly, all of these incorrectly called comparisons occur in random gene sets that
have an average annotation level greater than 54. Since
approximately half of the experimental gene signatures
we analyze in the main text have an average annotation
level greater than 54, we conclude that standard multiplehypothesis corrections, although important, cannot overcome the extreme annotation bias present in the evaluation of experimental gene signatures.
2.
ADDITIONAL COMPARISONS BETWEEN
FISHER’S EXACT TEST (FET) AND
ANNOTATION ENRICHMENT ANALYSIS (AEA)
2.1.
Additional Comparisons of FET and AEA
Significance Predictions
In the main text (Figures 2(b) and (d)) we selected
five of our random gene sets and directly compared the
distributions of the p-values predicted by FET and AEA
to the expected distribution (evenly distributed values
from zero to one); however, in these distributions we excluded p-value estimates for terms (and their corresponding branches) that did not have at least one annotation
from a member of the gene set in question. These excluded p-values are, by definition, all equal to exactly
one. In Figure S3(a)-(b) we plot, in rank order, the pvalues calculated for these same five selected random gene
sets according to either FET or AEA; however, as opposed to Figure 2(b) and (d) in the main text, these plots
include all terms and demonstrate the discrete nature of
both the FET and AEA methods. Namely, fewer terms
have an annotation overlap with gene sets composed of
(a)FET
(b)AEA
(c)FET vs. AEA in Signatures
FIG. S3: Plots of the p-values predicted by (a) FET and
(b) AEA as a function of rank for five of the random gene
sets. The “percent rank” for branches with p-values equal
to one demonstrate that both methods are discreet and gene
sets with fewer associated annotations have a non-zero overlap with fewer branches than gene sets with a higher number
of associated annotations. (c) The mean of the log-ratio between FET to AEA significance predictions in experimentallyderived gene signatures as a function of the average number of
annotations made by genes in the signature. A bias for FET
to predict more significant associations than AEA is evident
for the more highly annotated signatures.
poorly-annotated genes. Since terms with no annotation
overlap from a given gene set all have a p-value equal
to exactly one using either FET or AEA, the result for
these plots is that the “percent rank” for terms with a
p-value equal to one is lower for poorly-annotated gene
sets. These properties are in addition to an overall different shape in the p-value distribution for the remaining
terms, as shown in the main text.
Finally, in the main text we also predict the enrichment
of all “Biological Process” GO branches in experimental
gene signatures both by traditional set-overlap statistics
(FET) as well as with AEA (106 randomizations). We
compare these results based on the ranks of the branches
by each measure (Figure 4(b) in the main text). However, we also wanted to more directly compare the enrichment values predicted by FET and AEA for these
experimental signatures. Therefore, for each signature,
we also determined the log-ratio of the FET and AEA
p-values for each term. We took the mean of these logratios and plot, for each signature, this average value as a
function of the annotation level of the experimental gene
signature (Figure S3(c)). We observe that for high-degree
gene signatures, FET routinely predicts more significant
p-values than AEA; however, for the lowest degree signatures (which generally have an annotation level close
to chance), FET and AEA give much more similar esti-
3
(a)Significance By Branch Size
(b)Significance By Branch Size
(Filtered)
FIG. S4: The average significance of enrichment, estimated
by FET and AEA, for 200 randomly generated gene sets in
terms or branches, plotted as a function of the number of
progeny in the branch. (a) shows the value using all geneset/term comparisons, (b) plots the average value when only
considering comparisons where the term/branch of interest
has at least one annotation from a member from the gene set
of interest.
mations for the significance.
2.2.
Evaluating Enrichment in GO Branches
The approach of Fisher’s Exact Test (FET) investigates the overlap of a given gene set of interest, with a
collection of genes defined based on their annotations to
a single term. In contrast, when using Annotation Enrichment Analysis (AEA) we instead evaluate the overlap
in the annotations made by a given gene set of interest,
with the collection of annotations made to a GO Branch,
defined by a term and all of its progeny. In the most
extreme case, where a term has no progeny, its corresponding branch will be composed of a single term, and
its corresponding annotations will be composed of the
annotations from the same set of genes used for the FET
analysis; however the background distribution of the expected overlap in genes (for FET) or annotations (for
AEA) may be fundamentally different. In light of these
differing frameworks, we tested to see if the number of
terms that make up a GO branch effects the significance
of the predictions made by AEA (in which case we test
the overlap in annotations made to the head node and
all its progeny) compared to FET (in which case we test
the overlap with the genes annotated to the head node).
To do this, we binned GO terms based on the number of
progeny in their corresponding branch and determined
the average FET and AEA p-values of those terms (or
their corresponding branch) in any of our 200 randomly
generated gene sets (Figure S4(a)). We observe that FET
and AEA, on average across the gene sets, give approximately the same level of significance to GO terms; however, GO terms with the fewest progeny have much lower
average significance than terms with more progeny. This
is due to the discrete nature of the methods - terms with
fewer progeny generally have many fewer genes annotated
to them, therefore, in the majority of cases they have no
(a)FET (random “Branches”)
(b)AEA (random “Branches”)
(c)AEA-A (GO Branches)
(d)AEA-A (random
“Branches”)
FIG. S5: The significance of the enrichment of 200 randomly
generated gene sets in random “branches” according to (a)
FET and (b) AEA. Similarly, the significance of the enrichment of 200 randomly generated gene sets in (c) GO branches
and (d) random “branches” according to AEA-A. In all cases
the branches (GO branches or random “branches”) are ordered based on how many genes are annotated to the parent
term (kt ) and the gene sets are ordered based on total the
number of annotations (Mg ) made by the 200 genes in that
set.
overlap with the random gene sets and are given a p-value
of one by both AEA and FET. To better highlight potential differences between AEA and FET, we therefore
also plotted the average p-values for terms when only
considering comparisons in which least one member of
the considered gene set is annotated to the term of interest (Figure S4(b)). We find that for FET there is a
bias towards giving more significant p-values to terms
with fewer progeny, largely because these terms, by definition, often contain annotations from the most highlyannotating genes. On the other hand, AEA, which considers the distribution of annotations, has a much more
even level of prediction across the term branch sizes.
2.3.
Evaluating Enrichment in Random “Branches”
We also constructed sets of random GO terms that represent random “branches” that should be largely devoid
of biological coherency. Specifically, to mimic a branch
in the GO DAG that has a parent term with kt gene
annotations, we randomly ordered all the terms in GO
and selected the top Nt terms until the number of unique
genes annotated to those Nt random terms (kt0 ) is within
a small percentage of kt (|kt − kt0 |/kt < 0.01). In the case
where selecting both Nt and Nt+1 terms were within this
limit we chose Nt to minimize the absolute difference be-
4
tween kt and kt0 . If selecting the top Nt terms did not lead
to a situation within this limit, we reshuffled the terms
and selected the top Nt terms in this new list, repeating until a suitable random collection of terms could be
chosen. In this way we created 10192 random “branches”
with approximately the same number of unique genes annotated to each as to real GO branches, but in which the
genes annotated to the faux parent term are influenced
by a cumulation of annotations made to a random set of
progeny terms.
We then used these random “branches” to investigate
what effect, if any, overlapping annotations between GO
terms might have on functional enrichment analysis when
using a measure like FET. Interestingly, the trend we observed with real GO branches – that those with more gene
annotations to the parent term also tend to be enriched
in “high-degree” gene sets – appears to be even more
pronounced when using random “branches” (comparing
Figure S5(a) to Figure 2(a) in the main text). These
results suggest that part of the bias from FET might result from overlapping annotations to GO terms. We also
determined the significance of all random “branches” in
our randomly generated gene sets with AEA (104 randomizations), and created a heat map of these values
(Figure S5(b)). The significance predicted by AEA is
much lower for the randomly generated “branches” compared to the real GO branches (compare Figure S5(b)
with Figure 2(c) in the main text). This can be explained
as follows. We constructed our random gene sets using
information from GO annotations (in order to bias the
sets toward higher or lower annotation levels); therefore
genes within these sets are indirectly associated with the
GO branch structure; however, this association is largely
lost in our randomly generated “branches”.
Finally, we used AEA-A (Annotation Enrichment
Analysis Approximation) to determine the significance
of GO branches and random “branches” in our 200
randomly generated gene sets (Figure S5(c)-(d)). As
with AEA, AEA-A successfully eliminates the annotation bias associated with high degree gene-sets and predicts relatively fewer significant associations with random
“branches”.
We then evaluated whether the first term in the randomized list shares at least one quarter of its annotated genes
in common with a signature of interest. If so, the term
was kept and considered a member of a “truly enriched”
branch, if not, we proceeded to the next term. This continued until the simulated branch contained ten terms.
In this way we created ten “truly enriched” term sets
for each of the experimentally-derived gene signatures
(a)Fisher’s Exact Test
(b)Annotation Enrichment Analysis
FIG. S6: The significance predicted by (a) FET and (b) AEA
for twenty experimentally derived gene signatures and their
corresponding simulated “truly” enriched branches.
Finally, we also compared the performance of FET and
AEA on “truly enriched” term sets. More specifically, for
each simulated branch, we randomly ordered all terms.
in our analysis. We then evaluated how well AEA and
FET are able to associate a given experimentally-derived
gene signature with its set of “truly enriched” branches.
Figure S6 shows the significance predicted by FET and
AEA for twenty experimentally derived gene signatures
in the simulated “truly enriched branches”. “Truly enriched” branches are ordered along the x-axis such that
the first ten correspond to the first signature, the second ten correspond to the second signature, and so on.
Both FET and AEA are able to consistently recover these
“known” associations, as indicated by the blue (so more
significant) blocks along the diagonal. Interestingly, FET
contains a number of significant off-diagonal associations
which we hypothesize may be indicative of false positives
due to annotation bias. Results look similar for all signatures and enriched branches.
[1] K. Glass, E. Ott, W. Losert, and M. Girvan, Journal of the Royal Society, Interface / the Royal Society
(2012), ISSN 1742-5662, URL http://dx.doi.org/10.
1098/rsif.2011.0585.
[2] A. Subramanian, P. Tamayo, V. K. Mootha, S. Mukher-
jee, B. L. Ebert, M. A. Gillette, A. Paulovich, S. L.
Pomeroy, T. R. Golub, E. S. Lander, et al., Proceedings of the National Academy of Sciences of the United
States of America 102, 15545 (2005), ISSN 0027-8424,
URL http://dx.doi.org/10.1073/pnas.0506580102.
2.4.
Evaluating Enrichment in “Truly Enriched”
term sets
5
[3] D. W. Huang, B. T. Sherman, Q. Tan, J. Kir, D. Liu,
D. Bryant, Y. Guo, R. Stephens, M. W. Baseler, H. C.
Lane, et al., Nucl. Acids Res. 35, gkm415+ (2007),
ISSN 1362-4962, URL http://dx.doi.org/10.1093/nar/
gkm415.
[4] T. Beissbarth and T. P. Speed, Bioinformatics (Oxford,
England) 20, 1464 (2004), ISSN 1367-4803, URL http:
//dx.doi.org/10.1093/bioinformatics/bth088.
[5] D. Martin, C. Brun, E. Remy, P. Mouren, D. Thieffry,
and B. Jacq, Genome Biol 5 (2004), ISSN 1465-6914, URL
http://dx.doi.org/10.1186/gb-2004-5-12-r101.
[6] A. Alexa, J. Rahnenführer, and T. Lengauer, Bioinformatics 22, 1600 (2006), ISSN 1367-4803, URL http:
//dx.doi.org/10.1093/bioinformatics/btl140.