BIOINFORMATICS ORIGINAL PAPER Vol. 23 no. 20 2007, pages 2716–2724 doi:10.1093/bioinformatics/btm423 Gene expression Construction of a reference gene association network from multiple profiling data: application to data analysis Duygu Ucar1, Isaac Neuhaus2, Petra Ross-MacDonald2, Charles Tilford2, Srinivasan Parthasarathy1, Nathan Siemers2,* and Rui-Ru Ji2,* 1 Computer Science and Engineering, The Ohio State University, Columbus, OH 43210 and 2Applied Genomics, Bristol-Myers Squibb Research and Development, Pennington, NJ 08534, USA Received on January 31, 2007; revised on July 10, 2007; accepted on August 6, 2007 Advance Access publication September 10, 2007 Associate Editor: John Quackenbush ABSTRACT Motivation: Gene expression profiling is an important tool for gaining insight into biology. Novel strategies are required to analyze the growing archives of microarray data and extract useful information from them. One area of interest is in the construction of gene association networks from collections of profiling data. Various approaches have been proposed to construct gene networks using profiling data, and these networks have been used in functional inference as well as in data visualization. Here, we investigated a non-parametric approach to translate profiling data into a gene network. We explored the characteristics and utility of the resulting network and investigated the use of network information in analysis of variance models and hypothesis testing. Results: Our work is composed of two parts: gene network construction and partitioning and hypothesis testing using subnetworks as groups. In the first part, multiple independently collected microarray datasets from the Gene Expression Omnibus data repository were analyzed to identify probe pairs that are positively co-regulated across the samples. A co-expression network was constructed based on a reciprocal ranking criteria and a false discovery rate analysis. We named this network Reference Gene Association (RGA) network. Then, the network was partitioned into densely connected sub-networks of probes using a multilevel graph partitioning algorithm. In the second part, we proposed a new, MANOVA-based approach that can take individual probe expression values as input and perform hypothesis testing at the sub-network level. We applied this MANOVA methodology to two published studies and our analysis indicated that the methodology is both effective and sensitive for identifying transcriptional sub-networks or pathways that are perturbed across treatments. Contact: [email protected] or [email protected] 1 INTRODUCTION The technique of transcriptional profiling, as well as related proteomic and metabonomic methodologies, is producing biological state information in a breadth and depth that *To whom correspondence should be addressed. remains challenging to interpret. A diverse arsenal of analytical tools are typically applied to analyze such data (Quackenbush, 2001). A variety of global, unsupervised learning methods such as principle component analysis (PCA) and various clustering algorithms are often used to initially characterize a dataset (Eisen et al., 1998; Tamayo et al., 1999). The discovery of significantly regulated transcripts through hypothesis testing, including the use of t-tests or more complex ANOVA models, is often an important goal. Linear or non-linear models can also be built to predict biological response (Golub et al., 1999; Quackenbush, 2006). Classification schemes such as the Gene Ontology (Ashburner and Lewis, 2002) are often utilized to macroscopically classify and conceptualize the microscopic behavior of probes, and in some cases provide superior sensitivity and interpretability Recently, there has been a growing interest in producing gene networks from profiling data (Bergman et al., 2004; van Noort et al., 2004; Lee et al., 2004). In these representations, genes constitute the nodes of the network and the edges between nodes indicate gene associations (e.g. co-expression). As the application of graph methods and network inference algorithms to these data becomes more advanced, there may be some advantages to these approaches (Mao and Resat, 2004; Mehra et al., 2004; Rice et al., 2005; Pena et al., 2005). In this study, we explored the use of a non-parametric, rankbased methodology to transform profiling data into a gene network. Our methodology exploits principles of reciprocal best match pairs, often used in the area of gene sequence analysis, now applied to transcriptional profiling data. In addition, we applied a False Discovery Rate (FDR) analysis step to characterize and control the signal to noise ratio in the network. Graph partitioning and clustering algorithms were then utilized to extract functionally and topologically coherent sub-networks. The effectiveness of these algorithms was investigated based on domain knowledge from curated biological pathway databases. We also investigated the utility of a gene network (built via analysis of a compendium of human data available in the NCBI Gene Expression Omnibus) in enhancing the analysis and interpretation of individual profiling experiments. Specifically, we proposed to use the subnetworks extracted from the association network, presumably ß 2007 The Author(s) This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/ by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited. Construction of a reference gene association network GEO RGA network Correlation and Rank Analysis FDR Analysis Sorted Probe Pairs Pathways RGA Network Construction Clustering Algorithms Sub-networks … … GO Labeled Sub-networks Ion binding Apoptosis … Angiogenesis Annotation Analysis … RGA Network Partitioning MANOVA Tests Dataset D Do genes in Ion Binding subnetwork are differentially expressed among groups of D (i.e., different treatments) ? … … MANOVA based Hypothesis Testing Fig. 1. Schematic view of the proposed methodologies. In the first step, RGA network construction, transcriptional profiling projects downloaded from the Gene Expression Omnibus (GEO) were pooled and Pearson correlation coefficients were calculated for every probe pairs across all samples. Probe pairs were then sorted based on a reliability score calculated from reciprocal co-expression ranks. FDR analysis based on known pathway annotations was employed to find a suitable cutoff for inclusion of probe pairs in the resulting network. In the second step, RGA Network was partitioned to identify functional modules within the network. The sub-networks were annotated with their significantly enriched GO terms. In the last step, a new, MANOVA-based model was proposed for hypothesis testing at the sub-network level. representing biological functional units, as groups in set-based enrichment analyses. The goal of gene set analysis is to determine whether genes of a defined set are differentially affected by a treatment. The major advantage of gene set analysis over single gene analysis is that the formal provides a unifying theme for data interpretation, often shedding light on the underlying biological mechanisms. To this date several gene set analysis approaches have been proposed. Some of them, such as GO Term Finder (Boyle et al., 2004) and Gene Set Enrichment Analysis (Subramanian et al., 2005), require a prior generation of a rank-ordered gene list. Another test, parametric analysis of gene set enrichment (PAGE), does not require a sorted gene list but need genes’ fold changes or P-values from hypothesis testing as inputs (Kim and Volsky, 2005). Recently, Hotelling’s T-square test has been proposed as a multivariate approach for network/pathway significance analysis (Lu et al., 2005). This method takes multiple dependent variables (i.e. genes from a particular set) and tests if there is a significant difference between two groups (e.g. treated versus control). Several studies (Lu et al., 2005; Vinciotti et al., 2006; Xiong, 2006) have successfully utilized this approach to detect differentially regulated genes or gene groups. There are three potential advantages to this type of approaches. First, no ordered gene list needs to be generated. Second, it is more robust to type I error (false positives). Third, the set significance test may be more sensitive as it is computed directly from expression data instead of using some kind of enrichment score. Nevertheless, the application of Hotelling’s T-square test is limited as it can only deal with two treatment groups. Here, we proposed and investigated the utility of a new, MANOVA-based statistical model that allows hypothesis testing across sub-networks extracted from the reference network. This methodology overcomes the limitation of Hotelling’s T-square test and can be applied to a wider range of experimental designs. 2 METHODS The overview of the proposed methodology is illustrated in Figure 1. In the first step, a reference gene association (RGA) network was constructed from publicly available microarray profiling data. The network was partitioned in step two to extract co-expression sub-networks. These sub-networks were subsequently fed into a MANOVA-based procedure for hypothesis testing in individual profiling experiments. Throughout this article, gene and probe are used interchangeably. 2.1 Data collection and pre-processing We collected experiments within NCBI Gene Expression Omnibus (GEO) that were studied with the Affymetrix HG-U133A chip (Affymetrix Inc., Santa Clara, CA, USA). The starting data consisted of 525 human tissue and cell line samples from 26 independently conducted projects. We excluded projects that include primary tumor samples as we wanted to minimize the possible influence of gross cytogenetic abnormalities, amplifications and deletions on network construction. The resulting final dataset consisted of 393 human tissue and cell line samples from 21 different projects. The Affymetrix CEL files were analyzed with the MAS 5.0 algorithm 2717 D.Ucar et al. from Affymetrix and standardized using quantile normalization (Bolstad et al., 2003), making use of the publicly available Bioconductor tools (www.bioconductor.org). 2.2 RGA network construction In order to identify probes that are co-expressed across multiple projects, we first calculated Pearson correlation coefficients between every probe pair across all samples from the selected microarray projects. Formally, the Pearson correlation coefficient between probes pA and pB is defined as: Pn i i i ¼ 1 ðA AÞðB BÞ ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ffi rAB ¼ qP ð1Þ P n n 2 2 i i i ¼ 1 ðA AÞ i ¼ 1 ðB BÞ where Ai and Bi are the expression levels of pA and pB in the i th sample and A and B are the average expression levels of pA and pB, respectively. In this work, we based our analysis only on positive correlations as it guarantees to include the strongest correlations among gene pairs. In prior work, Goldenberg et al. (Goldenberg and Moore, 2004), demonstrated that positive correlations are much more significant than negative correlations, in sparsely connected networks. The implicit importance of positive correlation over the negative ones is also suggested in previous research in the context of microarray studies (Allocco et al., 2004). In our context, the goal is to derive a network that is relatively sparse, so in this sense by considering only the positive correlations we ensure that the strongest or most significant correlations among gene pairs are captured by our model. A rank ordered list of co-expressed neighbors was enumerated for each probe with rank 1 being the most correlated. Note that, although the Pearson correlation scores are symmetric, rank order of probe A with respect to probe B (say, ROAB), does not necessarily equal to rank of probe B with respect to probe A (say, ROBA). Thus, two reciprocal rank values are associated with each probe pair (i.e. ROAB and ROBA). Accordingly, for any probe pair, pA and pB, we define its reliability score as: RSAB ¼ 1 ROAB ROBA ð2Þ Thus, probe pairs with smaller reciprocal rank orders have higher reliability scores. We systematically computed reliability scores for every probe pair based on the reciprocal ranks. Pairs were sorted with respect to their scores starting from the most reliable one. Essentially the network construction step can be viewed as gradual additions of increasingly unreliable pairs. One can imagine that, with the increasing number of probe pairs in a network, the noise in the network is also increasing. To find a proper cutoff point, we adopted an FDR analysis step to quantify the signal to noise ratio in a given network. Two probes linked in the network can either be functionally relevant true positive (TP) or irrelevant false positive (FP). To quantify the relevance of probe pairs, we made use of curated pathway annotations from the KEGG, BioCarta and GenMAPP databases. Accordingly, a link between two probes that share pathway annotation in any of the databases is labeled as TP. Conversely, a link between two probes that are annotated with two non-overlapping sets of pathways is labeled as FP. Probe pairs where at least one probe is not annotated are defined as non-discriminatory (ND) as current knowledge is not sufficient to conclude whether they are functionally relevant. Such pairs are excluded from the FDR calculation. Effect of the FDR cutoff on the resulting network and criteria taken into consideration while setting this parameter are discussed in Section 3.1. In order to include pairs labeled as ND, a reliability score threshold corresponding to the chosen FDR cutoff was obtained and all probe pairs that have higher reliability scores than this threshold were included in the final reference network. 2718 To check the significance of the network, we generated random datasets using the random number generator tool from the Partek software package (www.partek.com). Random datasets of normal, gamma and uniform distributions, and of overall dimensions comparable to the experimental data were generated. These datasets were processed the same way as the real dataset and the sorted probe pair lists were analyzed and compared to that of RGA. 2.3 RGA network partitioning Many published graph partitioning algorithms can be used to partition the co-expression network into densely- connected modules. Clustering algorithms can also be employed utilizing the reliability scores of probe pairs as the similarity values. To find an optimal algorithm, two graph partitioning algorithms, Graclus (Dhillon et al., 2005) and Metis,1 as well as a clustering algorithm, Agglomerative Hierarchical Clustering were employed. We utilized a fast implementation of agglomerative clustering from CLUTO, clustering toolkit.2 We utilized the GO Term Finder tool (Boyle et al., 2004) to evaluate whether the extracted sub-networks were enriched in genes of similar functions and/or involved in similar biological processes. This analysis served two purposes. First, each sub-network was annotated with the best GO terms that are significantly enriched. Second, the optimal algorithm was selected based on the number of annotated sub-networks and the significance of the annotations (i.e. P-values). 2.4 MANOVA-based sub-network significance testing Sub-networks extracted from the RGA network usually consist of co-regulated probes involved in a common biological function or process. We adopted a MANOVA-based approach to assess transcriptional response at the sub-network level. MANOVA is a statistical method that can be used to measure treatment effects across multiple dependent variables simultaneously. In our case, we treated each probe of a sub-network as a response variable. By combining all probes from the same sub-network, we can access the response of that sub-network as a whole to some experiment factor. For experiments where there is only one independent variable (e.g. treatment), our MANOVA model is formulated as follows: p1 þ p2 þ þ pi þ ¼ t ð3Þ where pi is the i th probe in the sub-network and t is the treatment factor. Note that each probe is treated as a dependent variable. Using this model, we assessed the following hypotheses for each of the subnetworks. Hnull ¼ Sub-network is not affected by different treatments. Ha ¼ Sub-network is significantly affected by different treatments. Note that the model may be expanded to accommodate more complex experimental designs and interactions between factors. To validate the effectiveness of the approach, we applied it to two published studies on HIV protease inhibitors and cigarette smoking. Significantly affected sub-networks were analyzed and compared to known biological effects of the drugs. 1 http://glaros.dtc.umn.edu/gkhome/views/metis http://www-users.cs.umn.edu/karypis/cluto/index.html 2 Construction of a reference gene association network 1.0 0.6 0.4 False Positive Rate 0.2 0.0 Real Gamma Normal Uniform 5000 10000 15000 Number of Edges 20000 0.6 0.8 1.0 Fig. 2. Probe pairs calculated from the real expression data have significantly lower FDRs than those of random datasets. As described in Section 2, a sorted probe pair list was generated from each dataset and every probe pair was then given a label of TP, FP or ND based on the pathway annotations. False positive rates were calculated for the total probe pair numbers indicated on the graph. The x-axis indicates the total number of probe pairs considered starting from the top of the sorted probe pair list and the y-axis indicates the corresponding FDR for that set of probe pairs. 0.4 Microarray datasets generated from the Affymetrix HG-U133A platform were analyzed and a reference gene association network was constructed as described in Section 2. Primary tumor samples were excluded from the original data to minimize possible influence from cytogenetic abnormalities. For example, we often observed probe–probe correlations between probes mapped to same chromosomal regions commonly amplified or deleted in cancer if primary tumor samples were included in the sample set (data not shown). We employed a rank-based approach that avoids any assumption of data distribution since correlation coefficients calculated from profiling data are not of the well known normal or t-distribution (Allocco et al., 2004). We used mutual rank orders to quantify the strength of co-expression between probe pairs. Every probe pair was assigned a reliability score calculated from the two reciprocal correlation ranks between the pair. These scores were used to sort the probes pairs starting from the most reliable ones. Networks were constructed by incrementally adding probe pair to the networks starting from the top of the sorted list. To estimate the signal-to-noise ratio in a network, FDR analysis was adopted. As described in Section 2, each probe pair was given a label of TP, FP, or ND (non-discriminatory) based on the pathway annotations from the KEGG, BioCarta and GenMAPP databases. As depicted in Figure 2, FDR of the network increases with increasing number of probe pairs, indicating that reliability scores correlate well with the biological relevance of probe pairs. We chose a relatively moderate 20% FDR cutoff for a couple of reasons. First, it yielded a reasonable coverage of probes in the network. Second, we must take into account the fact that the current pathway annotation is not complete due to our limited understanding of biology, and the probe pairs falling into the false discovery class are in fact a mixture of false positives and true positives missing in the databases. We expect that the FDR threshold could be made stricter as pathway annotation becomes more complete. Using the 20% FDR cutoff, a network composed of 10 314 probes and 9675 edges was generated. To further assess the significance of probe–probe association in the RGA network, random datasets of normal, uniform and gamma distributions, were simulated as described in Section 2. For each of the datasets, Pearson correlations were computed for probe pairs and FDR analysis was performed. As shown in Figure 2, networks constructed from random datasets have much higher FDR than RGA at any given numbers of probe pairs. Another metric that quantifies functional relevance of probe pairs as their GO term overlap was also utilized to compare the networks (Lee et al., 2004). In this analysis, the number of shared GO annotation including parent terms was calculated for each probe pair. For better comparison, we used the same number of probe pairs from each of the random networks as in the real network (i.e. 9675 pairs). As depicted in Figure 3, 0.8 RGA network construction and significance verification Cumulative Probability 3.1 Network Size vs. FPR RESULTS AND DISCUSSION Real Gamma Normal Uniform 0.2 3 0 50 100 150 200 Number of overlapping GO Terms Fig. 3. Probe pairs calculated from the real expression data show greater functional relevance. The functional relevance of the top 9675 probe pairs, which correspond to 20% FDR in the real network, from each of sorted probe pair lists, were evaluated by examining the overlap of probe pairs’ Gene Ontology (GO) annotations. The x-axis indicates the number of overlapping GO terms and the y-axis is the cumulative probability. The x-axis is truncated at 220. the real network showed greater functional relevance than any of the random networks. In summary, these results demonstrated that our network construction procedure is effective in selecting biologically related probe pairs. 2719 D.Ucar et al. Boxplots of log (Cluster Size) Hierarchical Graclus 6 4 2 Log (cluster size) 8 Metis Fig. 4. Histogram of the best P-values of GO match for sub-networks. Sub-networks were annotated using the GO Term Finder tool and were assigned the P-value of the best GO term match. The distributions of the P-values were analyzed to select a partitioning algorithm that best captures the functional modules in the RGA network. 3.2 Partitioning of RGA network To further exploit the co-regulatory networks, it is desirable to partition the network into smaller pieces, with the overall intent of identifying functional biological modules within the larger network. Many published clustering algorithms can identify densely connected sub-networks in a network (Bader and Hoque, 2003; Brun et al., 2004; King et al., 2004). Given that each probe pair can be assigned a weight based on the reliability score, we selected three algorithms that can make use of the weights. The algorithms include two graph partitioning algorithms (Graclus and Metis) and one clustering algorithm (hierarchical clustering with average linkage). We considered two criteria when picking the number (k) of partitions. First, we would like to obtain sub-networks that are comparable in size to actual biological pathways. Second, the sub-network size should be feasible for use in the next section, MANOVA-based data analysis. After a few trials, we picked a partition number of 500, which would result in an average number of 20 nodes per partition if nodes are evenly distributed among partitions. After partitioning, we identified connected sub-networks to be evaluated further. If the partitioning algorithms could identify functional modules in the network, we would expect that probes of similar functions (thus likely to be co-expressed) would be partitioned into the same sub-network. The GO Term Finder tool was utilized to determine whether sub-networks are enriched in probes with similar GO term annotations and to compare the suitability of the partitioning and clustering algorithms. A plot of the best GO P-values for each of the sub-networks is shown in Figure 4. As can be seen from this figure, the Graclus algorithm outperformed the other two algorithms, in terms of both quality and quantity of significantly annotated subnetworks. The hierarchical clustering method yielded the worst performance, likely due to the unbalanced sub-network sizes. The box plots of the connected sub-network sizes when k was set to 500 are shown in Figure 5. It can be observed that the graph partitioning algorithms yielded sub-networks that are more balanced in size. By contrast, the hierarchical clustering 2720 Graclus Metis Algorithms Hierarchical Fig. 5. Box-plots of sub-network sizes generated by selected network partitioning algorithms with k ¼ 500. The algorithms are shown on the x-axis, and the log(sub-network size) values are indicated on the y-axis. algorithm tends to produce a few big sub-networks and many singletons. Accordingly, the final partitioning was obtained using the Graclus algorithm and a k value of 500. Among these 500 partitions, we identified 2112 connected sub-networks. The average size of sub-networks is 5 probes per sub-network and the largest two sub-networks have 38 probes each. Inspection of these sub-networks revealed many well-known areas of biological regulation. These areas include steroid biosynthesis, neuronal development, antigen processing and presentation, cell polarity maintenance, cytoskeletal components, the cell cycle, RNA splicing machinery and many others. However, it is also worthwhile to recognize the limitations of the underlying experimental data and their effects on network structure when interpreting the network and its partitions. One critical feature of the RNA profiling data generated from the U-133A array is that although (however many thousand) probes are independently measured within a single sample, significantly fewer independent signals can realistically be resolved from these data. Probe design has led to the intentional (and sometimes unintentional) construction of many redundant probes for RNA transcripts. We estimate that 2358 transcripts represented on the U-133A array have more than one probe set measuring them (data not shown). Although not technically an artifact, networks and network partitions containing these redundant probes have fewer independent genetic components than would be implied by the network connectivity. Lack of specificity of probe hybridization characteristics (cross-hybridization) is another limitation of Affymetrix technology. However, the impact of cross-hybridization on network structure seems to be small. The U-133A array design assists in the identification and interpretation of crosshybridization artifacts by attaching ‘_s’ and ‘_x’ suffixes in Construction of a reference gene association network Fig. 6. Example sub-networks. (a) The neuronal development and signaling cluster: 213841_at, 213484_at, 213197_at (ASTN1), 209726_at (CA11), 204995_at (CDK5R1), 205143_at (CSPG3), 39966_at (CSPG5), 205373_at (CTNNA2), 219619_at (DIRAS2), 215116_s_at (DNM1), 207010_at (GABRB1), 209469_at (GPM6A), 209470_s_at (GPM6A), 205358_at (GRIA2), 218623_at (HMP19), 219671_at (HPCAL4), 214680_at (NTRK2), 207093_s_at (OMG), 213849_s_at (PPP2R2B), 204974_at (RAB3A), 206196_s_at (RPIP8), 203485_at (RTN1), 203724_s_at (RUFY3), 219196_at (SCG3), 205152_at (SLC6A1), 202508_s_at (SNAP25), 205104_at (SNPH), 205551_at (SV2B), 212664_at (TUBB4). (b) The ribosomal protein cluster: 211942_x_at (RPL13A), 200022_at (RPL18), 200869_at (RPL18A), 200013_at (RPL24), 200003_s_at (RPL28), 200823_x_at (RPL29), 213969_x_at (RPL29), 200963_x_at (RPL31), 200002_at (RPL35), 219762_s_at (RPL36), 201406_at (RPL36A), 200936_at (RPL8), 214167_s_at (RPLP0), 200018_at (RPS13), 208645_s_at (RPS14), 211487_x_at (RPS17), 212433_x_at (RPS2), 200024_at (RPS5), 200082_s_at (RPS7), 217747_s_at (RPS9), 200651_at (GNB2L1), 217831_s_at (NSFL1C), 201600_at (PHB2), 201052_s_at (PSMF1). the nomenclature of probes to indicate higher risk of crosshybridization across isoforms and gene families for these probes. We found that the RGA network is only modestly enriched in probes capable of cross-hybridization—53% versus 47% on the U-133A array. This result suggests that, for the majority of the probes, the cross-hybridization impact at the probe set level is limited. The impact of the above probe design considerations may affect the network construction and our assessment of network robustness in various ways. In the cases where probes within a given network partition are non-redundant, we can somewhat safely ascribe the network to the biology we are trying to discover. The highly co-regulated elements of neuronal development and signaling fall into this category. Probes in this sub-network represent genes include ASTN1, CA11, CDK5R1, CSPG3, CSPG5, CTNNA2, DIRAS2, DNM1, GABRB1, GPM6A, GRIA2, HMP19, HPCAL4, NTRK2, OMG, PPP2R2B, RAB3A, RPIP8, RTN1, RUFY3, SCG3, SLC6A1, SNAP25, SNPH, SV2B and TUBB4 (Fig. 6a). These genes share only minor sequence homology. Two unannotated probes are also present in the sub-network: 213484_at and 213841_at. Interestingly, both probes represent transcripts that are highly expressed in brain tissues (Unigene, http:// www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=unigene). In other cases, both RNA sequence and functional biological conservation are contained within probes in a sub-network. Such a sub-network is fundamentally confounded, and even when we can give the sub-network a high Gene Ontology overlap significance, we lack the resolution to identify whether or not the score is due to our capture of biology or the fact that we are measuring many fewer independent signals within the sub-network than the number of probes would imply. The high degree of both sequence and functional similarity in the ribosome proteins (RPL13A, RPL18, RPL18A, RPL24, RPL28, RPL29, RPL31, RPL35, RPL36, RPL36A, RPL8, RPLP0, RPS13, RPS14, RPS17, RPS2, RPS5, RPS7, RPS9) is one example of this behavior, and is well represented within a sub-graph of the network (Fig. 6b). Finally, although details of biological specimen preparation not intended in study design (in vitro cell confluency, culture temperature variation, etc.) are still valid biological effects, other technical artifacts of sample processing, hybridization and data acquisition can interfere with the robustness of the network and network partitions. Some of these technical effects can be readily identified in the networks. For instance, a collection of control probes including BioB, BioC, BioD and Cre control oligonucleotides are spiked into the cRNA mixtures appear in the resultant network, connected only to each other. Similarly, 18S and 28S ribosomal RNA contaminants carried through the probe preparation process also appear as an isolated sub-network of the total network. 3.3 MANOVA-based sub-network significance testing Gene networks have been used in the area of gene function inference and gene–gene interaction/regulation studies (Jordan et al., 2004; Stuart et al., 2003). Allocco et al. have demonstrated that two genes are more likely to share a common transcriptional regulator if they are strongly co-expressed (Allocco et al., 2004). Since our sub-networks are composed of co-expressed genes, these sub-networks may be treated as a special transcriptional ontology where each sub-network is specifically under the control of one or a combination of 2721 D.Ucar et al. Table 1. Top 15 scoring sub-networks for the HIV data Cluster ID MANOVA P-value Significant GO annotation 1186 495 1197 812 164 1520 547 1302 1272 1862 568 1109 1707 1832 291 1.67E09 4.08E09 5.26E08 5.72E08 7.92E08 8.10E08 8.95E08 2.82E07 3.51E07 5.67E07 5.96E07 1.14E06 2.61E06 2.98E06 3.64E06 Regulation of glutamate-cysteine ligase activity Aminosugar metabolic process, N-acetylglucosamine 6-O-sulfotransferase activity Cellular respiration, TCA cycle, mitochondrion No significant annotation Aromatic amino acid transport, antiporter activity 6-phosphofructokinase complex Glycolysis, energy derivation by oxidation of organic compounds Glucose transport, integral to membrane Amino acid biosynthetic/metabolic process Amine metabolic process Steroid metabolic process, cholesterol homeostasis, xenobitic metabolic process Kinase activity Protein targeting to ER, endoplasmic reticulum Cell growth, phosphatase activity Response to chemical stimulus, response to drug Sub-networks are annotated with the top GO term hits in ‘Molecular Function’, ‘Biological Process’ and ‘Cellular Component’ ontologies. transcription factors. For any individual microarray study, if the significantly affected sub-networks can be identified analytically, it will greatly improve our understanding of the underlying biological mechanism. Naturally, sub-networks can be treated as gene sets that can be fed into gene set analysis tools. To this date several gene set significance testing approaches have been developed including GO Term Finder, GSEA, PAGE and Hotelling’s T-square test (Boyle et al., 2004; Kim and Volsky, 2005; Lu et al., 2005; Subramanian et al., 2005). Here, we explored the use of a new, MANOVA (multivariate analysis of variance) based approach for network/pathway analysis. MANOVA is an extension of ANOVA (analysis of variance) to cover cases where there is more than one dependent variables. It is often used when a set of correlated dependent variables are present whereas a single, overall statistical test is desired. Hotelling’s T-square test is a special case of MANOVA to deal with data of two groups. MANOVA enjoys the same advantages as Hotelling’s T-square test (see Introduction section), but it can be applied to a wider range of experimental designs since MANOVA can deal with data of any group number. Naturally, we can treat each probe as a dependent variable, and then derive a single statistic based on MANOVA for the sub-network as a unit. This approach is well-suited for our purposes because the sub-networks are consisted of coexpressed probes and MANOVA are especially designed to deal with correlated dependent variables. We tested our MANOVA model using two published datasets. The first one is the study of HIV protease inhibitors including antazanavir, nelfinavir and ritonavir (Parker et al., 2005). It is well known that patients taking protease inhibitor drugs to treat HIV-AIDS often develop a lipodystrophy-like syndrome such as hyperlipidermia, peripheral lipoatrophy and central fat accumulation (Calza et al., 2004). Parker et al. (Parker et al., 2005), used Affymetrix chip technology to survey gene expression changes after drug treatment. Their analysis indicated that protease inhibitors could induce 2722 gene expression changes indicative of dysregulation of lipid metabolism, endoplasmic reticulum (ER) stress and metabolic disturbance. This result is consistent with the clinical observations and provides evidence for a molecular mechanism for the pathophysiology of protease inhibitor-induced lipodystrophy. The top 15 scoring sub-networks with respect to treatment effect are listed in Table 1. The model P-values give a relative measurement of the significance of the treatment. One can immediately note that this list includes all the major targets of the HIV protease inhibitors including lipid metabolism, amino acid metabolism, gluconeogenesis, cellular respiration and endoplasmic reticulum. The second dataset is from the study on the effects of cigarette smoking on human airway epithelial cell transcriptome (Spira et al., 2004). It is well established that cigarette smoking is the leading cause of lung cancer and other serious pulmonary diseases such as chronic obstructive pulmonary disease (COPD). Spira and colleagues analyzed three groups of samples from current smokers, previous smokers and non-smokers. Their results indicated that cigarette smoking could lead to expression changes in genes involved in xenobiotic metabolism, antioxidation, inflammation, cell adhesion function and oncogenesis, which are consistent with previous studies using different model system or experimental design (Gebel et al., 2004; Hackett et al., 2003). The top 15 sub-networks identified by the MANOVA approach are listed in Table 2. Xenobiotic metabolism, cytoskeleton elements, amino acid and lipid metabolisms, and protein kinase and phosphatase activities are found to be significantly differentiated among study groups. In addition, several immune system-related responses are also present in the list, which are presumably due to the inflammation response caused by smoking. Again, the MANOVA approach was able to identify all the major effects caused by cigarette smoking. To compare this approach to a typical analysis flow, i.e. single gene analysis followed by a gene set analysis, we analyzed the Construction of a reference gene association network Table 2. Top 15 scoring sub-networks for the cigarette smoking data Cluster ID MANOVA P-value Significant GO annotation 1770 93 261 495 2062 1123 1369 1272 198 77 697 1967 568 434 182 1.69E12 4.17E10 8.39E10 4.08E09 1.42E08 4.84E08 7.04E08 3.37E07 3.91E07 3.93E07 4.50E07 5.54E07 5.96E07 1.13E06 2.31E06 Response to extracellular stimulus, cell ion homeostasis Epithelial cell differentiation, phospholipid binding Microtubule associated complex, GKAP/Homer scaffold activity Amino sugar metabolic process, N-acetylglucosamine 6-O-sulfotransferase activity Protein phosphatase binding, protein kinase CK2 activity Acute-phase response Fibril organization and biogenesis, extracellular matrix structural constituent Amino acid biosynthetic/metabolic process Nitrogen compound catabolic process, organic acid metabolic process Xenobitic metabolic process Leukocyte activation, immune response Phosphaditic acid metabolic process Steroid metabolic process, cholesterol homeostasis, xenobiotic metabolic process Myosin complex, actin cytoskeleton Ciliary or flagellar motility, microtubule-based process Sub-networks are annotated with the top GO term hits in ‘Molecular Function’, ‘Biological Process’ and ‘Cellular Component’ ontologies. two datasets using Ingenuity IPA tool (www.ingenuity.com), which uses the Fisher exact test to calculate gene set enrichment scores. For the HIV dataset, a list of 579 affected probes (10% FDR) identified through single gene ANOVA were analyzed by IPA. While lipid and amino acid metabolism were picked up as the main affected pathways, ER stress response and gluoconeogenesis were not identified (data not shown). For the cigarette smoking dataset, a list of 1190 affected probes (10% FDR) was analyzed. IPA identified xenobiotic metabolism, cytoskeleton structures and electron transport as the main affected targets but missed inflammation response completely. The gained sensitivity of our approach may be due to two factors. First, no ranked gene list with an artificial cutoff was used. Therefore, genes having modest changes may be considered together to reveal significant group effect. Second, instead of an enrichment score that does not reflect the actual expression data, our approach attempts to find the best separation between treatments. Given that parametric method is generally considered more sensitive than its non-parametric counterpart, it was expected for our method to generate results that better fit the biological hypothesis. However, cautions should also be taken when applying MANOVA and related Hotelling’s T-square models. There are three assumptions when applying MANOVA. First, the dependent variables should be normally distributed. In general, the more samples in a project, the more likely the assumption is met. Second, the relationships among dependent variables are linear. Since our calculation of co-expression is based on linear correlation, it is likely that genes of the same sub-network follow a linear relationship. However, deviations from linear relationship will compromise the power of the test. Third, variances and co-variances of dependent variables should be homogeneous across the ranges of independent variables. Several tests may be applied to check the validity of this assumption (Tabachnick and Fidell, 1996). In addition to the aforementioned assumptions, MANOVA is also sensitive to outliers, which may lead to type I or type II errors. There are tests available to check for univariate and multivariate outliers. Finally, MANOVA analysis can only be employed when the total number of samples is larger than the size of the gene set being tested. This is because one degree of freedom is lost for each dependent variable added (Bray and Maxwell, 1985). In another word, large gene sets cannot be tested by MANOVA or Hotelling’s T-square for projects involving small numbers of samples. In fact, this is one of the criteria we used to pick the partition number k so that the sizes of the resulting sub-networks are feasible for MANOVA based hypothesis testing for most projects, which typically have 10 samples or more. 4 CONCLUSION In this article, we showed how reciprocal rankings and FDR analysis can be applied to integrate signal from independently conducted microarray experiments into a meta-network, which we named RGA network. We used three different algorithms to identify densely connected sub-networks from the network. Examination of the sub-networks indicated that they may represent various units of biological processes. The homogeneity (gene function) of these sub-networks fits well with the assumption of MANOVA. We developed a MANOVA-based approach which allows interpreting expression profiling data at the gene set level. We conducted a thorough evaluation of our approach by applying it to two published studies. Although in this article, we limited our analysis to co-expression gene subnetworks, we strongly believe that our MANOVA model is expandable to other groups of related genes (e.g. a gene list containing genes mapped to the same chromosomal location). We expect this approach would serve as a novel method for analyzing transcriptional and proteomic profiling data at the gene set level. ACKNOWLEDGEMENTS The authors of this paper would like to thank Curtis Huttenhower and Hilary Coller for valuable discussion and review of the manuscript. Authors D.U. and S.P. were 2723 D.Ucar et al. supported by the DOE Early Career Principal Investigator Award No. DE-FG02-04ER25611 and NSF CAREER Grant IIS-0347662. Conflict of Interest: none declared. REFERENCES Allocco,D.J. et al. (2004) Quantifying the relationship between co-expression, co-regulation and gene function. BMC Bioinformatics, 5, 18. Ashburner,M. and Lewis,S.E. (2002) On ontologies for biologists: the Gene Ontology – uncoupling the web. Novartis Found. Symp., 247, 66–80. Bader,G.D. and Hogue,C.W. (2003) An automated method for finding molecular complexes in large protein interaction networks. BMC Bioinformatics, 4, 2. Bergman,S. et al. (2004) Similarities and difference in genome-wide expression data of six organisms. PLoS Biol., 2, 85–93. Bolstad,B.M. et al. (2003) A comparison of normalization methods for high density oligonucleotide array data based on bias and variance. Bioinformatics, 19, 185–193. Boyle,E.I. et al. (2004) GO::TermFinder–open source software for accessing Gene Ontology information and finding significantly enriched Gene Ontology terms associated with a list of genes. Bioinformatics, 20, 3710–3715. Bray,J.H. and Maxwell,S.E. (1985) Multivariate analysis of variance. Quantitative Applications in the Social Sciences Series. No. 54. Sage Publications, Thousand Oaks, CA. Brun,C. et al. (2004) Clustering proteins from interaction networks for the prediction of cellular functions. BMC Bioinformatics, 5, 95. Calza,L. et al. (2004) Dyslipidaemia associated with antiretroviral therapy in HIV-infected patients. J. Antimicrob. Chemother., 53, 10–14. Dhillon,I. et al. (2005) A fast kernel-based multilevel algorithm for graph clustering. In Proceedings of the 11th ACM SIGKDD, Chicago, IL, August 21 - 24, pp. 629–634. Eisen,M.B. et al. (1998) Cluster analysis and display of genome-wide expression patterns. Proc. Natl Acad. Sci. USA, 95, 14863–14868. Gebel,S. et al. (2004) Gene expression profiling in respiratory tissues from rats exposed to mainstream cigarette smoke. Carcinogenesis, 25, 169–178. Goldenberg,A. and Moore,A. (2004) Tractable learning of large Bayes net structures from sparse data. In Proceedings of the 21st ICML. Golub,T.R. et al. (1999) Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science, 286, 531–537. Hackett,N.R. et al. (2003) Variability of antioxidant-related gene expression in the airway epithelium of cigarette smokers. Am. J. Respir. Cell Mol. Biol., 29, 331–43. 2724 Jordan,I.K. et al. (2004) Conservation and coevolution in the scale-free human gene coexpression network. Mol. Biol. Evol., 21, 2058–2070. Kim,S.Y. and Volsky,D.J. (2005) PAGE: parametric analysis of gene set enrichment. BMC Bioinformatics, 6, 144. King,A.D. et al. (2004) Protein complex prediction via cost-based clustering. Bioinformatics, 20, 3013–3020. Lee,H.K. et al. (2004) Coexpression analysis of human genes across many microarray data sets. Genome Res., 14, 1085–1094. Lu,Y. et al. (2005) Hotelling’s T2 multivariate profiling for detecting differential expression in microarrays. Bioinformatics, 21, 3105–3113. Mao,L. and Resat,H. (2004) Probabilistic representation of gene regulatory networks. Bioinformatics, 20, 2258–2269. Mehra,S. et al. (2004) A Boolean algorithm for reconstructing the structure of regulatory networks. Metabolic Eng., 6, 326–339. Quackenbush,J. (2001) Computational analysis of microarray data. Nat. Rev. Genet., 2, 418–427. Quackenbush,J. (2006) Microarray analysis and tumor classification. N. Engl. J. Med., 354, 2463–2472. Parker,R.A. et al. (2005) Endoplasmic reticulum stress links dyslipidemia to inhibition of proteasome activity and glucose transport by HIV protease inhibitors. Mol. Pharmacol., 67, 1909–1919. Pena,J.M. et al. (2005) Growing Bayesian network models of gene networks from seed genes. Bioinformatics, 21, 224–229. Rice,J.J. et al. (2005) Reconstructing biological networks using conditional correlation analysis. Bioinformatics, 21, 765–773. Spira,A. et al. (2004) Effects of cigarette smoke on the human airway epithelial cell transcriptome. Proc. Natl Acad. Sci. USA, 101, 10143–10148. Subramanian,A. et al. (2005) Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc. Natl Acad. Sci. USA, 102, 15545–15550. Stuart,J. et al. (2003) A gene coexpression network for global discovery of conserved genetic modules. Sci. Express, 302, 249–255. Tabachnick,B.G. and Fidell,L.S. (1996) Using Multivariate Statistics, Harper Collins College Publishers, New York. Tamayo,P. et al. (1999) Interpreting patterns of gene expression with selforganizing maps: methods and application to hematopoietic differentiation. Proc. Natl Acad. Sci. USA, 96, 2907–2912. van Noort,V. et al. (2004) The yeast co-expression network has a small-world, scale-free architecture and can be explained by a simple model. EMBO Re., 5, 280–284. Vinciotti,V. et al. (2006) Exploiting the full power of temporal gene expression profiling through a new statistical test: application to the analysis of muscular dystrophy data. BMC Bioinformatics, 7, 183–194. Xiong,H. (2006) Non-linear tests for identifying differentially expressed genes or genetic networks. Bioinformatics, 22, 919–923.
© Copyright 2026 Paperzz