Supplemental Material for Annotation Enrichment Analysis: An Alternative Method for Evaluating the Functional Properties of Gene Sets This file contains supplemental material meant to complement the findings in “Annotation Enrichment Analysis: An Alternate Method for Evaluating the Functional Properties of Gene Sets.” Section 1 includes additional information about the structural features of functional annotation databases as well as the results of using standard statistical approaches for evaluating functional enrichment. Section 2 includes additional comparisons between Fisher’s Exact Test and our method, Annotation Enrichment Analysis, which is described in the main text. 1. FEATURES OF ANNOTATION DATABASES AND STANDARD FUNCTIONAL ENRICHMENT APPROACHES 1.1. Annotation Properties of Functional Annotation Databases other than the Gene Ontology In “Annotation Enrichment Analysis: An Alternate Method for Evaluating the Functional Properties of Gene Sets” we demonstrate that there are biases in functional enrichment analysis when using a standard set-overlap statistic approach such as Fisher’s Exact Test. Specifically, using the Gene Ontology (GO), we show that these biases appear to be associated with the annotation level of a gene in GO. Although we chose to use GO to perform the analysis in the main text, we suggest that these biases might manifest in functional enrichment analysis performed using other annotation databases. The general shape of the distributions shown in Figure 1 of the main text (for the number of annotations made to/from terms/genes in GO) has been observed both in GO and other annotation databases for several other lower-level organisms including E.coli and yeast (see Supplemental Material in [1]). Furthermore, other annotation databases, such as PIR Keywords, are organized in a hierarchical structure similar to GO. Hence, we expected that distributions of the annotation proper- (a)Min Value of FET (b)FDR q-value FIG. S2: (a) The minimum p-value estimated by FET across all GO branches for each of the 200 random gene-sets. (b) Q-values associated with the FDR-corrected significance of GO terms in 200 randomly generated gene sets. The terms are ordered based on how many genes are annotated to the term (kt ) and the gene sets are ordered based on total the number of annotations (Mg ) made by the 200 genes in that set. Note that although we tested all terms, only the 200 with the highest number of annotations are shown. ties using other human functional annotation databases would be similar to those obtained for GO. We explicitly tested this hypothesis for two other functional annotation databases: PIR Keywords and Reactome. We downloaded PIR Keyword information from the SwissProt resources ftp webpage (ftp://ftp.uniprot.org/ pub/databases/uniprot/). We also downloaded Reactome annotations from files available from MSigDB [2]. The shape of the distributions for the number of annotations made to categories and from genes in these two databases, shown in Figure S1, greatly mimics those observed for the Gene Ontology (Figure 1 in the main text). Note that MSigDB does not include information for categories with fewer than ten or more than one-thousand annotated members, influencing the far left of the distribution; however, for categories with more than ten members the distribution still appears to be heavy-tailed. 1.2. Standard Multiple-Hypothesis Corrections Cannot Overcome Extreme Annotation Bias (a)Category Degree Distribution (b)Gene Degree Distribution FIG. S1: The cumulative degree distributions of (a) categories and (b) genes in several functional databases other than the Gene Ontology. In the main text, for each of two hundred random gene sets we calculated the FET p-value for all 10192 “Biological Process” terms. Since we did 10192 comparisons, we expect that, for each of these gene sets, the minimum value for the p-value significance across all terms should be approximately (1/10192) ≈ 10−4 . Figure S2(a) plots the actual minimum p-value for each random gene set. We observe that instead of the expected value, random gene sets with the fewest annotations have a minimum p-value around 10−3 , while random gene sets with the highest annotation levels have a minimum p-value 2 close to 10−6 . These results demonstrate that when using Fisher’s Exact Test (FET) to evaluate the enrichment of a set of genes, annotation bias leads to both anti-conservative estimates for functional enrichment in gene sets that have many annotations, as well as overlyconservative estimates for enrichment in gene sets that have few annotations. Most commonly used functional enrichment tools (e.g. [2–6]) do not report only the FET p-values, but also provide scientists with a number of other measures, including statistics designed to help mitigate issues associated with multiple hypothesis testing. We wanted to see if such measures might sufficiently modify the p-value so that that anti-conservative estimates from FET might not be an issue. Therefore, we determined the q-values associated with the FDR-significance for all GO terms in each of our 200 randomly generated gene sets. The results are shown in Figure S2(b). We find that although the majority of q-values clearly indicate insignificant associations, there is still an obvious bias toward significant enrichment between high degree gene-set/term pairs. A number of these high degree gene-set/term pairs still have a q-value low enough to be considered significant by some. Specifically, 99 comparisons have a q-value less than 0.01, and, most importantly, all of these incorrectly called comparisons occur in random gene sets that have an average annotation level greater than 54. Since approximately half of the experimental gene signatures we analyze in the main text have an average annotation level greater than 54, we conclude that standard multiplehypothesis corrections, although important, cannot overcome the extreme annotation bias present in the evaluation of experimental gene signatures. 2. ADDITIONAL COMPARISONS BETWEEN FISHER’S EXACT TEST (FET) AND ANNOTATION ENRICHMENT ANALYSIS (AEA) 2.1. Additional Comparisons of FET and AEA Significance Predictions In the main text (Figures 2(b) and (d)) we selected five of our random gene sets and directly compared the distributions of the p-values predicted by FET and AEA to the expected distribution (evenly distributed values from zero to one); however, in these distributions we excluded p-value estimates for terms (and their corresponding branches) that did not have at least one annotation from a member of the gene set in question. These excluded p-values are, by definition, all equal to exactly one. In Figure S3(a)-(b) we plot, in rank order, the pvalues calculated for these same five selected random gene sets according to either FET or AEA; however, as opposed to Figure 2(b) and (d) in the main text, these plots include all terms and demonstrate the discrete nature of both the FET and AEA methods. Namely, fewer terms have an annotation overlap with gene sets composed of (a)FET (b)AEA (c)FET vs. AEA in Signatures FIG. S3: Plots of the p-values predicted by (a) FET and (b) AEA as a function of rank for five of the random gene sets. The “percent rank” for branches with p-values equal to one demonstrate that both methods are discreet and gene sets with fewer associated annotations have a non-zero overlap with fewer branches than gene sets with a higher number of associated annotations. (c) The mean of the log-ratio between FET to AEA significance predictions in experimentallyderived gene signatures as a function of the average number of annotations made by genes in the signature. A bias for FET to predict more significant associations than AEA is evident for the more highly annotated signatures. poorly-annotated genes. Since terms with no annotation overlap from a given gene set all have a p-value equal to exactly one using either FET or AEA, the result for these plots is that the “percent rank” for terms with a p-value equal to one is lower for poorly-annotated gene sets. These properties are in addition to an overall different shape in the p-value distribution for the remaining terms, as shown in the main text. Finally, in the main text we also predict the enrichment of all “Biological Process” GO branches in experimental gene signatures both by traditional set-overlap statistics (FET) as well as with AEA (106 randomizations). We compare these results based on the ranks of the branches by each measure (Figure 4(b) in the main text). However, we also wanted to more directly compare the enrichment values predicted by FET and AEA for these experimental signatures. Therefore, for each signature, we also determined the log-ratio of the FET and AEA p-values for each term. We took the mean of these logratios and plot, for each signature, this average value as a function of the annotation level of the experimental gene signature (Figure S3(c)). We observe that for high-degree gene signatures, FET routinely predicts more significant p-values than AEA; however, for the lowest degree signatures (which generally have an annotation level close to chance), FET and AEA give much more similar esti- 3 (a)Significance By Branch Size (b)Significance By Branch Size (Filtered) FIG. S4: The average significance of enrichment, estimated by FET and AEA, for 200 randomly generated gene sets in terms or branches, plotted as a function of the number of progeny in the branch. (a) shows the value using all geneset/term comparisons, (b) plots the average value when only considering comparisons where the term/branch of interest has at least one annotation from a member from the gene set of interest. mations for the significance. 2.2. Evaluating Enrichment in GO Branches The approach of Fisher’s Exact Test (FET) investigates the overlap of a given gene set of interest, with a collection of genes defined based on their annotations to a single term. In contrast, when using Annotation Enrichment Analysis (AEA) we instead evaluate the overlap in the annotations made by a given gene set of interest, with the collection of annotations made to a GO Branch, defined by a term and all of its progeny. In the most extreme case, where a term has no progeny, its corresponding branch will be composed of a single term, and its corresponding annotations will be composed of the annotations from the same set of genes used for the FET analysis; however the background distribution of the expected overlap in genes (for FET) or annotations (for AEA) may be fundamentally different. In light of these differing frameworks, we tested to see if the number of terms that make up a GO branch effects the significance of the predictions made by AEA (in which case we test the overlap in annotations made to the head node and all its progeny) compared to FET (in which case we test the overlap with the genes annotated to the head node). To do this, we binned GO terms based on the number of progeny in their corresponding branch and determined the average FET and AEA p-values of those terms (or their corresponding branch) in any of our 200 randomly generated gene sets (Figure S4(a)). We observe that FET and AEA, on average across the gene sets, give approximately the same level of significance to GO terms; however, GO terms with the fewest progeny have much lower average significance than terms with more progeny. This is due to the discrete nature of the methods - terms with fewer progeny generally have many fewer genes annotated to them, therefore, in the majority of cases they have no (a)FET (random “Branches”) (b)AEA (random “Branches”) (c)AEA-A (GO Branches) (d)AEA-A (random “Branches”) FIG. S5: The significance of the enrichment of 200 randomly generated gene sets in random “branches” according to (a) FET and (b) AEA. Similarly, the significance of the enrichment of 200 randomly generated gene sets in (c) GO branches and (d) random “branches” according to AEA-A. In all cases the branches (GO branches or random “branches”) are ordered based on how many genes are annotated to the parent term (kt ) and the gene sets are ordered based on total the number of annotations (Mg ) made by the 200 genes in that set. overlap with the random gene sets and are given a p-value of one by both AEA and FET. To better highlight potential differences between AEA and FET, we therefore also plotted the average p-values for terms when only considering comparisons in which least one member of the considered gene set is annotated to the term of interest (Figure S4(b)). We find that for FET there is a bias towards giving more significant p-values to terms with fewer progeny, largely because these terms, by definition, often contain annotations from the most highlyannotating genes. On the other hand, AEA, which considers the distribution of annotations, has a much more even level of prediction across the term branch sizes. 2.3. Evaluating Enrichment in Random “Branches” We also constructed sets of random GO terms that represent random “branches” that should be largely devoid of biological coherency. Specifically, to mimic a branch in the GO DAG that has a parent term with kt gene annotations, we randomly ordered all the terms in GO and selected the top Nt terms until the number of unique genes annotated to those Nt random terms (kt0 ) is within a small percentage of kt (|kt − kt0 |/kt < 0.01). In the case where selecting both Nt and Nt+1 terms were within this limit we chose Nt to minimize the absolute difference be- 4 tween kt and kt0 . If selecting the top Nt terms did not lead to a situation within this limit, we reshuffled the terms and selected the top Nt terms in this new list, repeating until a suitable random collection of terms could be chosen. In this way we created 10192 random “branches” with approximately the same number of unique genes annotated to each as to real GO branches, but in which the genes annotated to the faux parent term are influenced by a cumulation of annotations made to a random set of progeny terms. We then used these random “branches” to investigate what effect, if any, overlapping annotations between GO terms might have on functional enrichment analysis when using a measure like FET. Interestingly, the trend we observed with real GO branches – that those with more gene annotations to the parent term also tend to be enriched in “high-degree” gene sets – appears to be even more pronounced when using random “branches” (comparing Figure S5(a) to Figure 2(a) in the main text). These results suggest that part of the bias from FET might result from overlapping annotations to GO terms. We also determined the significance of all random “branches” in our randomly generated gene sets with AEA (104 randomizations), and created a heat map of these values (Figure S5(b)). The significance predicted by AEA is much lower for the randomly generated “branches” compared to the real GO branches (compare Figure S5(b) with Figure 2(c) in the main text). This can be explained as follows. We constructed our random gene sets using information from GO annotations (in order to bias the sets toward higher or lower annotation levels); therefore genes within these sets are indirectly associated with the GO branch structure; however, this association is largely lost in our randomly generated “branches”. Finally, we used AEA-A (Annotation Enrichment Analysis Approximation) to determine the significance of GO branches and random “branches” in our 200 randomly generated gene sets (Figure S5(c)-(d)). As with AEA, AEA-A successfully eliminates the annotation bias associated with high degree gene-sets and predicts relatively fewer significant associations with random “branches”. We then evaluated whether the first term in the randomized list shares at least one quarter of its annotated genes in common with a signature of interest. If so, the term was kept and considered a member of a “truly enriched” branch, if not, we proceeded to the next term. This continued until the simulated branch contained ten terms. In this way we created ten “truly enriched” term sets for each of the experimentally-derived gene signatures (a)Fisher’s Exact Test (b)Annotation Enrichment Analysis FIG. S6: The significance predicted by (a) FET and (b) AEA for twenty experimentally derived gene signatures and their corresponding simulated “truly” enriched branches. Finally, we also compared the performance of FET and AEA on “truly enriched” term sets. More specifically, for each simulated branch, we randomly ordered all terms. in our analysis. We then evaluated how well AEA and FET are able to associate a given experimentally-derived gene signature with its set of “truly enriched” branches. Figure S6 shows the significance predicted by FET and AEA for twenty experimentally derived gene signatures in the simulated “truly enriched branches”. “Truly enriched” branches are ordered along the x-axis such that the first ten correspond to the first signature, the second ten correspond to the second signature, and so on. Both FET and AEA are able to consistently recover these “known” associations, as indicated by the blue (so more significant) blocks along the diagonal. Interestingly, FET contains a number of significant off-diagonal associations which we hypothesize may be indicative of false positives due to annotation bias. Results look similar for all signatures and enriched branches. [1] K. Glass, E. Ott, W. Losert, and M. Girvan, Journal of the Royal Society, Interface / the Royal Society (2012), ISSN 1742-5662, URL http://dx.doi.org/10. 1098/rsif.2011.0585. [2] A. Subramanian, P. Tamayo, V. K. Mootha, S. Mukher- jee, B. L. Ebert, M. A. Gillette, A. Paulovich, S. L. Pomeroy, T. R. Golub, E. S. Lander, et al., Proceedings of the National Academy of Sciences of the United States of America 102, 15545 (2005), ISSN 0027-8424, URL http://dx.doi.org/10.1073/pnas.0506580102. 2.4. Evaluating Enrichment in “Truly Enriched” term sets 5 [3] D. W. Huang, B. T. Sherman, Q. Tan, J. Kir, D. Liu, D. Bryant, Y. Guo, R. Stephens, M. W. Baseler, H. C. Lane, et al., Nucl. Acids Res. 35, gkm415+ (2007), ISSN 1362-4962, URL http://dx.doi.org/10.1093/nar/ gkm415. [4] T. Beissbarth and T. P. Speed, Bioinformatics (Oxford, England) 20, 1464 (2004), ISSN 1367-4803, URL http: //dx.doi.org/10.1093/bioinformatics/bth088. [5] D. Martin, C. Brun, E. Remy, P. Mouren, D. Thieffry, and B. Jacq, Genome Biol 5 (2004), ISSN 1465-6914, URL http://dx.doi.org/10.1186/gb-2004-5-12-r101. [6] A. Alexa, J. Rahnenführer, and T. Lengauer, Bioinformatics 22, 1600 (2006), ISSN 1367-4803, URL http: //dx.doi.org/10.1093/bioinformatics/btl140.
© Copyright 2026 Paperzz