Construction of a reference gene association network from multiple

BIOINFORMATICS ORIGINAL PAPER
Vol. 23 no. 20 2007, pages 2716–2724
doi:10.1093/bioinformatics/btm423
Gene expression
Construction of a reference gene association network from
multiple profiling data: application to data analysis
Duygu Ucar1, Isaac Neuhaus2, Petra Ross-MacDonald2, Charles Tilford2,
Srinivasan Parthasarathy1, Nathan Siemers2,* and Rui-Ru Ji2,*
1
Computer Science and Engineering, The Ohio State University, Columbus, OH 43210 and 2Applied Genomics,
Bristol-Myers Squibb Research and Development, Pennington, NJ 08534, USA
Received on January 31, 2007; revised on July 10, 2007; accepted on August 6, 2007
Advance Access publication September 10, 2007
Associate Editor: John Quackenbush
ABSTRACT
Motivation: Gene expression profiling is an important tool for
gaining insight into biology. Novel strategies are required to analyze
the growing archives of microarray data and extract useful
information from them. One area of interest is in the construction
of gene association networks from collections of profiling data.
Various approaches have been proposed to construct gene networks using profiling data, and these networks have been used in
functional inference as well as in data visualization. Here, we
investigated a non-parametric approach to translate profiling data
into a gene network. We explored the characteristics and utility of
the resulting network and investigated the use of network information in analysis of variance models and hypothesis testing.
Results: Our work is composed of two parts: gene network
construction and partitioning and hypothesis testing using subnetworks as groups. In the first part, multiple independently
collected microarray datasets from the Gene Expression Omnibus
data repository were analyzed to identify probe pairs that are
positively co-regulated across the samples. A co-expression network was constructed based on a reciprocal ranking criteria and
a false discovery rate analysis. We named this network Reference
Gene Association (RGA) network. Then, the network was partitioned
into densely connected sub-networks of probes using a multilevel
graph partitioning algorithm. In the second part, we proposed a new,
MANOVA-based approach that can take individual probe expression
values as input and perform hypothesis testing at the sub-network
level. We applied this MANOVA methodology to two published
studies and our analysis indicated that the methodology is both
effective and sensitive for identifying transcriptional sub-networks or
pathways that are perturbed across treatments.
Contact: [email protected] or [email protected]
1
INTRODUCTION
The technique of transcriptional profiling, as well as related
proteomic and metabonomic methodologies, is producing
biological state information in a breadth and depth that
*To whom correspondence should be addressed.
remains challenging to interpret. A diverse arsenal of analytical
tools are typically applied to analyze such data (Quackenbush,
2001). A variety of global, unsupervised learning methods such
as principle component analysis (PCA) and various clustering
algorithms are often used to initially characterize a dataset
(Eisen et al., 1998; Tamayo et al., 1999). The discovery of
significantly regulated transcripts through hypothesis testing,
including the use of t-tests or more complex ANOVA models, is
often an important goal. Linear or non-linear models can also
be built to predict biological response (Golub et al., 1999;
Quackenbush, 2006). Classification schemes such as the Gene
Ontology (Ashburner and Lewis, 2002) are often utilized to
macroscopically classify and conceptualize the microscopic
behavior of probes, and in some cases provide superior
sensitivity and interpretability
Recently, there has been a growing interest in producing gene
networks from profiling data (Bergman et al., 2004; van Noort
et al., 2004; Lee et al., 2004). In these representations, genes
constitute the nodes of the network and the edges between
nodes indicate gene associations (e.g. co-expression). As the
application of graph methods and network inference algorithms
to these data becomes more advanced, there may be some
advantages to these approaches (Mao and Resat, 2004; Mehra
et al., 2004; Rice et al., 2005; Pena et al., 2005).
In this study, we explored the use of a non-parametric, rankbased methodology to transform profiling data into a gene
network. Our methodology exploits principles of reciprocal
best match pairs, often used in the area of gene sequence
analysis, now applied to transcriptional profiling data.
In addition, we applied a False Discovery Rate (FDR) analysis
step to characterize and control the signal to noise ratio in the
network. Graph partitioning and clustering algorithms were
then utilized to extract functionally and topologically coherent
sub-networks. The effectiveness of these algorithms was
investigated based on domain knowledge from curated
biological pathway databases. We also investigated the utility
of a gene network (built via analysis of a compendium of
human data available in the NCBI Gene Expression Omnibus)
in enhancing the analysis and interpretation of individual
profiling experiments. Specifically, we proposed to use the subnetworks extracted from the association network, presumably
ß 2007 The Author(s)
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/
by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
Construction of a reference gene association network
GEO
RGA network
Correlation and
Rank Analysis
FDR Analysis
Sorted Probe Pairs
Pathways
RGA Network Construction
Clustering
Algorithms
Sub-networks
…
…
GO
Labeled
Sub-networks
Ion binding
Apoptosis
…
Angiogenesis
Annotation
Analysis
…
RGA Network Partitioning
MANOVA Tests
Dataset D
Do genes in Ion Binding subnetwork are differentially
expressed among groups of D
(i.e., different treatments) ?
…
…
MANOVA based Hypothesis Testing
Fig. 1. Schematic view of the proposed methodologies. In the first step, RGA network construction, transcriptional profiling projects downloaded
from the Gene Expression Omnibus (GEO) were pooled and Pearson correlation coefficients were calculated for every probe pairs across all samples.
Probe pairs were then sorted based on a reliability score calculated from reciprocal co-expression ranks. FDR analysis based on known pathway
annotations was employed to find a suitable cutoff for inclusion of probe pairs in the resulting network. In the second step, RGA Network was
partitioned to identify functional modules within the network. The sub-networks were annotated with their significantly enriched GO terms. In the
last step, a new, MANOVA-based model was proposed for hypothesis testing at the sub-network level.
representing biological functional units, as groups in set-based
enrichment analyses.
The goal of gene set analysis is to determine whether genes
of a defined set are differentially affected by a treatment.
The major advantage of gene set analysis over single gene
analysis is that the formal provides a unifying theme for data
interpretation, often shedding light on the underlying biological
mechanisms. To this date several gene set analysis approaches
have been proposed. Some of them, such as GO Term Finder
(Boyle et al., 2004) and Gene Set Enrichment Analysis
(Subramanian et al., 2005), require a prior generation of
a rank-ordered gene list. Another test, parametric analysis
of gene set enrichment (PAGE), does not require a sorted
gene list but need genes’ fold changes or P-values from
hypothesis testing as inputs (Kim and Volsky, 2005).
Recently, Hotelling’s T-square test has been proposed as
a multivariate approach for network/pathway significance
analysis (Lu et al., 2005). This method takes multiple dependent
variables (i.e. genes from a particular set) and tests if there is
a significant difference between two groups (e.g. treated versus
control). Several studies (Lu et al., 2005; Vinciotti et al., 2006;
Xiong, 2006) have successfully utilized this approach to
detect differentially regulated genes or gene groups. There are
three potential advantages to this type of approaches. First,
no ordered gene list needs to be generated. Second, it is
more robust to type I error (false positives). Third, the
set significance test may be more sensitive as it is computed
directly from expression data instead of using some kind of
enrichment score. Nevertheless, the application of Hotelling’s
T-square test is limited as it can only deal with two treatment
groups.
Here, we proposed and investigated the utility of a new,
MANOVA-based statistical model that allows hypothesis
testing across sub-networks extracted from the reference
network. This methodology overcomes the limitation of
Hotelling’s T-square test and can be applied to a wider range
of experimental designs.
2
METHODS
The overview of the proposed methodology is illustrated in Figure 1.
In the first step, a reference gene association (RGA) network was constructed from publicly available microarray profiling data. The network
was partitioned in step two to extract co-expression sub-networks.
These sub-networks were subsequently fed into a MANOVA-based
procedure for hypothesis testing in individual profiling experiments.
Throughout this article, gene and probe are used interchangeably.
2.1
Data collection and pre-processing
We collected experiments within NCBI Gene Expression Omnibus
(GEO) that were studied with the Affymetrix HG-U133A chip
(Affymetrix Inc., Santa Clara, CA, USA). The starting data consisted
of 525 human tissue and cell line samples from 26 independently
conducted projects. We excluded projects that include primary
tumor samples as we wanted to minimize the possible influence of
gross cytogenetic abnormalities, amplifications and deletions on
network construction. The resulting final dataset consisted of 393
human tissue and cell line samples from 21 different projects.
The Affymetrix CEL files were analyzed with the MAS 5.0 algorithm
2717
D.Ucar et al.
from Affymetrix and standardized using quantile normalization
(Bolstad et al., 2003), making use of the publicly available
Bioconductor tools (www.bioconductor.org).
2.2
RGA network construction
In order to identify probes that are co-expressed across multiple
projects, we first calculated Pearson correlation coefficients between
every probe pair across all samples from the selected microarray
projects. Formally, the Pearson correlation coefficient between probes
pA and pB is defined as:
Pn
i
i
i ¼ 1 ðA AÞðB BÞ
ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
ffi
rAB ¼ qP
ð1Þ
P
n
n
2
2
i
i
i ¼ 1 ðA AÞ
i ¼ 1 ðB BÞ
where Ai and Bi are the expression levels of pA and pB in the i th sample
and A and B are the average expression levels of pA and pB, respectively.
In this work, we based our analysis only on positive correlations as it
guarantees to include the strongest correlations among gene pairs.
In prior work, Goldenberg et al. (Goldenberg and Moore, 2004),
demonstrated that positive correlations are much more significant than
negative correlations, in sparsely connected networks. The implicit
importance of positive correlation over the negative ones is also
suggested in previous research in the context of microarray studies
(Allocco et al., 2004). In our context, the goal is to derive a network that
is relatively sparse, so in this sense by considering only the positive
correlations we ensure that the strongest or most significant correlations among gene pairs are captured by our model.
A rank ordered list of co-expressed neighbors was enumerated for
each probe with rank 1 being the most correlated. Note that, although
the Pearson correlation scores are symmetric, rank order of probe A with
respect to probe B (say, ROAB), does not necessarily equal to rank of probe
B with respect to probe A (say, ROBA). Thus, two reciprocal rank values
are associated with each probe pair (i.e. ROAB and ROBA). Accordingly,
for any probe pair, pA and pB, we define its reliability score as:
RSAB ¼
1
ROAB ROBA
ð2Þ
Thus, probe pairs with smaller reciprocal rank orders have higher
reliability scores. We systematically computed reliability scores for
every probe pair based on the reciprocal ranks. Pairs were sorted with
respect to their scores starting from the most reliable one.
Essentially the network construction step can be viewed as gradual
additions of increasingly unreliable pairs. One can imagine that, with
the increasing number of probe pairs in a network, the noise in the
network is also increasing. To find a proper cutoff point, we adopted an
FDR analysis step to quantify the signal to noise ratio in a given
network. Two probes linked in the network can either be functionally
relevant true positive (TP) or irrelevant false positive (FP). To quantify
the relevance of probe pairs, we made use of curated pathway
annotations from the KEGG, BioCarta and GenMAPP databases.
Accordingly, a link between two probes that share pathway annotation
in any of the databases is labeled as TP. Conversely, a link between two
probes that are annotated with two non-overlapping sets of pathways is
labeled as FP. Probe pairs where at least one probe is not annotated are
defined as non-discriminatory (ND) as current knowledge is not
sufficient to conclude whether they are functionally relevant. Such pairs
are excluded from the FDR calculation.
Effect of the FDR cutoff on the resulting network and criteria taken
into consideration while setting this parameter are discussed in Section
3.1. In order to include pairs labeled as ND, a reliability score threshold
corresponding to the chosen FDR cutoff was obtained and all probe
pairs that have higher reliability scores than this threshold were
included in the final reference network.
2718
To check the significance of the network, we generated random
datasets using the random number generator tool from the Partek
software package (www.partek.com). Random datasets of normal,
gamma and uniform distributions, and of overall dimensions comparable to the experimental data were generated. These datasets were
processed the same way as the real dataset and the sorted probe pair
lists were analyzed and compared to that of RGA.
2.3
RGA network partitioning
Many published graph partitioning algorithms can be used to partition
the co-expression network into densely- connected modules. Clustering
algorithms can also be employed utilizing the reliability scores of probe
pairs as the similarity values. To find an optimal algorithm, two graph
partitioning algorithms, Graclus (Dhillon et al., 2005) and Metis,1 as
well as a clustering algorithm, Agglomerative Hierarchical Clustering
were employed. We utilized a fast implementation of agglomerative
clustering from CLUTO, clustering toolkit.2
We utilized the GO Term Finder tool (Boyle et al., 2004) to
evaluate whether the extracted sub-networks were enriched in genes of
similar functions and/or involved in similar biological processes.
This analysis served two purposes. First, each sub-network was
annotated with the best GO terms that are significantly enriched.
Second, the optimal algorithm was selected based on the number
of annotated sub-networks and the significance of the annotations
(i.e. P-values).
2.4
MANOVA-based sub-network significance testing
Sub-networks extracted from the RGA network usually consist of
co-regulated probes involved in a common biological function or
process. We adopted a MANOVA-based approach to assess transcriptional response at the sub-network level.
MANOVA is a statistical method that can be used to measure
treatment effects across multiple dependent variables simultaneously.
In our case, we treated each probe of a sub-network as a response
variable. By combining all probes from the same sub-network, we can
access the response of that sub-network as a whole to some experiment
factor.
For experiments where there is only one independent variable
(e.g. treatment), our MANOVA model is formulated as follows:
p1 þ p2 þ þ pi þ ¼ t
ð3Þ
where pi is the i th probe in the sub-network and t is the treatment
factor. Note that each probe is treated as a dependent variable. Using
this model, we assessed the following hypotheses for each of the subnetworks.
Hnull ¼ Sub-network is not affected by different treatments.
Ha ¼ Sub-network is significantly affected by different treatments.
Note that the model may be expanded to accommodate more
complex experimental designs and interactions between factors.
To validate the effectiveness of the approach, we applied it to two
published studies on HIV protease inhibitors and cigarette smoking.
Significantly affected sub-networks were analyzed and compared to
known biological effects of the drugs.
1
http://glaros.dtc.umn.edu/gkhome/views/metis
http://www-users.cs.umn.edu/karypis/cluto/index.html
2
Construction of a reference gene association network
1.0
0.6
0.4
False Positive Rate
0.2
0.0
Real
Gamma
Normal
Uniform
5000
10000
15000
Number of Edges
20000
0.6
0.8
1.0
Fig. 2. Probe pairs calculated from the real expression data have
significantly lower FDRs than those of random datasets. As described
in Section 2, a sorted probe pair list was generated from each dataset
and every probe pair was then given a label of TP, FP or ND based on
the pathway annotations. False positive rates were calculated for the
total probe pair numbers indicated on the graph. The x-axis indicates
the total number of probe pairs considered starting from the top of the
sorted probe pair list and the y-axis indicates the corresponding FDR
for that set of probe pairs.
0.4
Microarray datasets generated from the Affymetrix
HG-U133A platform were analyzed and a reference gene association network was constructed as described in Section 2.
Primary tumor samples were excluded from the original data
to minimize possible influence from cytogenetic abnormalities.
For example, we often observed probe–probe correlations between probes mapped to same chromosomal
regions commonly amplified or deleted in cancer if primary
tumor samples were included in the sample set (data not
shown).
We employed a rank-based approach that avoids any
assumption of data distribution since correlation coefficients
calculated from profiling data are not of the well known normal
or t-distribution (Allocco et al., 2004). We used mutual rank
orders to quantify the strength of co-expression between probe
pairs. Every probe pair was assigned a reliability score
calculated from the two reciprocal correlation ranks between
the pair. These scores were used to sort the probes pairs starting
from the most reliable ones. Networks were constructed by
incrementally adding probe pair to the networks starting from
the top of the sorted list.
To estimate the signal-to-noise ratio in a network, FDR
analysis was adopted. As described in Section 2, each probe
pair was given a label of TP, FP, or ND (non-discriminatory)
based on the pathway annotations from the KEGG, BioCarta
and GenMAPP databases. As depicted in Figure 2, FDR of the
network increases with increasing number of probe pairs,
indicating that reliability scores correlate well with the
biological relevance of probe pairs.
We chose a relatively moderate 20% FDR cutoff for a couple
of reasons. First, it yielded a reasonable coverage of probes in
the network. Second, we must take into account the fact that
the current pathway annotation is not complete due to our
limited understanding of biology, and the probe pairs falling
into the false discovery class are in fact a mixture of false
positives and true positives missing in the databases. We expect
that the FDR threshold could be made stricter as pathway
annotation becomes more complete. Using the 20% FDR
cutoff, a network composed of 10 314 probes and 9675 edges
was generated.
To further assess the significance of probe–probe association
in the RGA network, random datasets of normal, uniform and
gamma distributions, were simulated as described in Section 2.
For each of the datasets, Pearson correlations were computed
for probe pairs and FDR analysis was performed. As shown in
Figure 2, networks constructed from random datasets have
much higher FDR than RGA at any given numbers of
probe pairs.
Another metric that quantifies functional relevance of probe
pairs as their GO term overlap was also utilized to compare the
networks (Lee et al., 2004). In this analysis, the number of
shared GO annotation including parent terms was calculated
for each probe pair. For better comparison, we used the same
number of probe pairs from each of the random networks as
in the real network (i.e. 9675 pairs). As depicted in Figure 3,
0.8
RGA network construction and significance
verification
Cumulative Probability
3.1
Network Size vs. FPR
RESULTS AND DISCUSSION
Real
Gamma
Normal
Uniform
0.2
3
0
50
100
150
200
Number of overlapping GO Terms
Fig. 3. Probe pairs calculated from the real expression data show
greater functional relevance. The functional relevance of the top 9675
probe pairs, which correspond to 20% FDR in the real network, from
each of sorted probe pair lists, were evaluated by examining the overlap
of probe pairs’ Gene Ontology (GO) annotations. The x-axis indicates
the number of overlapping GO terms and the y-axis is the cumulative
probability. The x-axis is truncated at 220.
the real network showed greater functional relevance than any
of the random networks. In summary, these results demonstrated that our network construction procedure is effective in
selecting biologically related probe pairs.
2719
D.Ucar et al.
Boxplots of log (Cluster Size)
Hierarchical
Graclus
6
4
2
Log (cluster size)
8
Metis
Fig. 4. Histogram of the best P-values of GO match for sub-networks.
Sub-networks were annotated using the GO Term Finder tool and were
assigned the P-value of the best GO term match. The distributions of
the P-values were analyzed to select a partitioning algorithm that best
captures the functional modules in the RGA network.
3.2
Partitioning of RGA network
To further exploit the co-regulatory networks, it is desirable to
partition the network into smaller pieces, with the overall intent
of identifying functional biological modules within the larger
network. Many published clustering algorithms can identify
densely connected sub-networks in a network (Bader and
Hoque, 2003; Brun et al., 2004; King et al., 2004). Given that
each probe pair can be assigned a weight based on the reliability
score, we selected three algorithms that can make use of the
weights. The algorithms include two graph partitioning
algorithms (Graclus and Metis) and one clustering algorithm
(hierarchical clustering with average linkage).
We considered two criteria when picking the number (k) of
partitions. First, we would like to obtain sub-networks that are
comparable in size to actual biological pathways. Second, the
sub-network size should be feasible for use in the next section,
MANOVA-based data analysis. After a few trials, we picked
a partition number of 500, which would result in an average
number of 20 nodes per partition if nodes are evenly distributed
among partitions. After partitioning, we identified connected
sub-networks to be evaluated further.
If the partitioning algorithms could identify functional
modules in the network, we would expect that probes of similar
functions (thus likely to be co-expressed) would be partitioned
into the same sub-network. The GO Term Finder tool was
utilized to determine whether sub-networks are enriched in
probes with similar GO term annotations and to compare the
suitability of the partitioning and clustering algorithms.
A plot of the best GO P-values for each of the sub-networks is
shown in Figure 4. As can be seen from this figure, the Graclus
algorithm outperformed the other two algorithms, in terms of
both quality and quantity of significantly annotated subnetworks. The hierarchical clustering method yielded the worst
performance, likely due to the unbalanced sub-network sizes.
The box plots of the connected sub-network sizes when k was
set to 500 are shown in Figure 5. It can be observed that the
graph partitioning algorithms yielded sub-networks that are
more balanced in size. By contrast, the hierarchical clustering
2720
Graclus
Metis
Algorithms
Hierarchical
Fig. 5. Box-plots of sub-network sizes generated by selected network
partitioning algorithms with k ¼ 500. The algorithms are shown on the
x-axis, and the log(sub-network size) values are indicated on the y-axis.
algorithm tends to produce a few big sub-networks and many
singletons.
Accordingly, the final partitioning was obtained using the
Graclus algorithm and a k value of 500. Among these 500
partitions, we identified 2112 connected sub-networks. The
average size of sub-networks is 5 probes per sub-network and
the largest two sub-networks have 38 probes each.
Inspection of these sub-networks revealed many well-known
areas of biological regulation. These areas include steroid
biosynthesis, neuronal development, antigen processing and
presentation, cell polarity maintenance, cytoskeletal components, the cell cycle, RNA splicing machinery and many others.
However, it is also worthwhile to recognize the limitations of
the underlying experimental data and their effects on network
structure when interpreting the network and its partitions. One
critical feature of the RNA profiling data generated from the
U-133A array is that although (however many thousand)
probes are independently measured within a single sample,
significantly fewer independent signals can realistically be
resolved from these data. Probe design has led to the
intentional (and sometimes unintentional) construction of
many redundant probes for RNA transcripts. We estimate
that 2358 transcripts represented on the U-133A array have
more than one probe set measuring them (data not shown).
Although not technically an artifact, networks and network
partitions containing these redundant probes have fewer
independent genetic components than would be implied by
the network connectivity.
Lack of specificity of probe hybridization characteristics
(cross-hybridization) is another limitation of Affymetrix
technology. However, the impact of cross-hybridization on
network structure seems to be small. The U-133A array design
assists in the identification and interpretation of crosshybridization artifacts by attaching ‘_s’ and ‘_x’ suffixes in
Construction of a reference gene association network
Fig. 6. Example sub-networks. (a) The neuronal development and signaling cluster: 213841_at, 213484_at, 213197_at (ASTN1), 209726_at (CA11),
204995_at (CDK5R1), 205143_at (CSPG3), 39966_at (CSPG5), 205373_at (CTNNA2), 219619_at (DIRAS2), 215116_s_at (DNM1), 207010_at
(GABRB1), 209469_at (GPM6A), 209470_s_at (GPM6A), 205358_at (GRIA2), 218623_at (HMP19), 219671_at (HPCAL4),
214680_at (NTRK2), 207093_s_at (OMG), 213849_s_at (PPP2R2B), 204974_at (RAB3A), 206196_s_at (RPIP8), 203485_at (RTN1),
203724_s_at (RUFY3), 219196_at (SCG3), 205152_at (SLC6A1), 202508_s_at (SNAP25), 205104_at (SNPH), 205551_at (SV2B), 212664_at
(TUBB4). (b) The ribosomal protein cluster: 211942_x_at (RPL13A), 200022_at (RPL18), 200869_at (RPL18A), 200013_at (RPL24), 200003_s_at
(RPL28), 200823_x_at (RPL29), 213969_x_at (RPL29), 200963_x_at (RPL31), 200002_at (RPL35), 219762_s_at (RPL36), 201406_at (RPL36A),
200936_at (RPL8), 214167_s_at (RPLP0), 200018_at (RPS13), 208645_s_at (RPS14), 211487_x_at (RPS17), 212433_x_at (RPS2), 200024_at (RPS5),
200082_s_at (RPS7), 217747_s_at (RPS9), 200651_at (GNB2L1), 217831_s_at (NSFL1C), 201600_at (PHB2), 201052_s_at (PSMF1).
the nomenclature of probes to indicate higher risk of crosshybridization across isoforms and gene families for these
probes. We found that the RGA network is only modestly
enriched in probes capable of cross-hybridization—53% versus
47% on the U-133A array. This result suggests that, for the
majority of the probes, the cross-hybridization impact at the
probe set level is limited.
The impact of the above probe design considerations may
affect the network construction and our assessment of network
robustness in various ways. In the cases where probes within
a given network partition are non-redundant, we can somewhat
safely ascribe the network to the biology we are trying
to discover. The highly co-regulated elements of neuronal
development and signaling fall into this category. Probes in
this sub-network represent genes include ASTN1, CA11,
CDK5R1, CSPG3, CSPG5, CTNNA2, DIRAS2, DNM1,
GABRB1, GPM6A, GRIA2, HMP19, HPCAL4, NTRK2,
OMG, PPP2R2B, RAB3A, RPIP8, RTN1, RUFY3, SCG3,
SLC6A1, SNAP25, SNPH, SV2B and TUBB4 (Fig. 6a). These
genes share only minor sequence homology. Two unannotated
probes are also present in the sub-network: 213484_at and
213841_at. Interestingly, both probes represent transcripts that
are highly expressed in brain tissues (Unigene, http://
www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=unigene).
In other cases, both RNA sequence and functional biological
conservation are contained within probes in a sub-network.
Such a sub-network is fundamentally confounded, and even
when we can give the sub-network a high Gene Ontology
overlap significance, we lack the resolution to identify whether
or not the score is due to our capture of biology or the fact
that we are measuring many fewer independent signals within
the sub-network than the number of probes would imply.
The high degree of both sequence and functional similarity
in the ribosome proteins (RPL13A, RPL18, RPL18A, RPL24,
RPL28, RPL29, RPL31, RPL35, RPL36, RPL36A, RPL8,
RPLP0, RPS13, RPS14, RPS17, RPS2, RPS5, RPS7, RPS9) is
one example of this behavior, and is well represented within
a sub-graph of the network (Fig. 6b).
Finally, although details of biological specimen preparation
not intended in study design (in vitro cell confluency, culture
temperature variation, etc.) are still valid biological effects,
other technical artifacts of sample processing, hybridization
and data acquisition can interfere with the robustness of
the network and network partitions. Some of these technical
effects can be readily identified in the networks. For instance,
a collection of control probes including BioB, BioC, BioD
and Cre control oligonucleotides are spiked into the
cRNA mixtures appear in the resultant network, connected
only to each other. Similarly, 18S and 28S ribosomal
RNA contaminants carried through the probe preparation
process also appear as an isolated sub-network of the total
network.
3.3
MANOVA-based sub-network significance testing
Gene networks have been used in the area of gene function
inference and gene–gene interaction/regulation studies (Jordan
et al., 2004; Stuart et al., 2003). Allocco et al. have demonstrated that two genes are more likely to share a common
transcriptional regulator if they are strongly co-expressed
(Allocco et al., 2004). Since our sub-networks are composed
of co-expressed genes, these sub-networks may be treated as
a special transcriptional ontology where each sub-network is
specifically under the control of one or a combination of
2721
D.Ucar et al.
Table 1. Top 15 scoring sub-networks for the HIV data
Cluster ID
MANOVA P-value
Significant GO annotation
1186
495
1197
812
164
1520
547
1302
1272
1862
568
1109
1707
1832
291
1.67E09
4.08E09
5.26E08
5.72E08
7.92E08
8.10E08
8.95E08
2.82E07
3.51E07
5.67E07
5.96E07
1.14E06
2.61E06
2.98E06
3.64E06
Regulation of glutamate-cysteine ligase activity
Aminosugar metabolic process, N-acetylglucosamine 6-O-sulfotransferase activity
Cellular respiration, TCA cycle, mitochondrion
No significant annotation
Aromatic amino acid transport, antiporter activity
6-phosphofructokinase complex
Glycolysis, energy derivation by oxidation of organic compounds
Glucose transport, integral to membrane
Amino acid biosynthetic/metabolic process
Amine metabolic process
Steroid metabolic process, cholesterol homeostasis, xenobitic metabolic process
Kinase activity
Protein targeting to ER, endoplasmic reticulum
Cell growth, phosphatase activity
Response to chemical stimulus, response to drug
Sub-networks are annotated with the top GO term hits in ‘Molecular Function’, ‘Biological Process’ and ‘Cellular Component’ ontologies.
transcription factors. For any individual microarray study, if
the significantly affected sub-networks can be identified
analytically, it will greatly improve our understanding of the
underlying biological mechanism.
Naturally, sub-networks can be treated as gene sets that can
be fed into gene set analysis tools. To this date several gene set
significance testing approaches have been developed including
GO Term Finder, GSEA, PAGE and Hotelling’s T-square test
(Boyle et al., 2004; Kim and Volsky, 2005; Lu et al., 2005;
Subramanian et al., 2005). Here, we explored the use of a new,
MANOVA (multivariate analysis of variance) based approach
for network/pathway analysis.
MANOVA is an extension of ANOVA (analysis of variance)
to cover cases where there is more than one dependent
variables. It is often used when a set of correlated dependent
variables are present whereas a single, overall statistical test is
desired. Hotelling’s T-square test is a special case of MANOVA
to deal with data of two groups. MANOVA enjoys the same
advantages as Hotelling’s T-square test (see Introduction
section), but it can be applied to a wider range of experimental
designs since MANOVA can deal with data of any group
number. Naturally, we can treat each probe as a dependent
variable, and then derive a single statistic based on MANOVA
for the sub-network as a unit. This approach is well-suited for
our purposes because the sub-networks are consisted of coexpressed probes and MANOVA are especially designed to deal
with correlated dependent variables.
We tested our MANOVA model using two published
datasets. The first one is the study of HIV protease inhibitors
including antazanavir, nelfinavir and ritonavir (Parker et al.,
2005). It is well known that patients taking protease inhibitor
drugs to treat HIV-AIDS often develop a lipodystrophy-like
syndrome such as hyperlipidermia, peripheral lipoatrophy
and central fat accumulation (Calza et al., 2004). Parker
et al. (Parker et al., 2005), used Affymetrix chip technology
to survey gene expression changes after drug treatment.
Their analysis indicated that protease inhibitors could induce
2722
gene expression changes indicative of dysregulation of lipid
metabolism, endoplasmic reticulum (ER) stress and metabolic
disturbance. This result is consistent with the clinical observations and provides evidence for a molecular mechanism for the
pathophysiology of protease inhibitor-induced lipodystrophy.
The top 15 scoring sub-networks with respect to treatment
effect are listed in Table 1. The model P-values give a relative
measurement of the significance of the treatment. One can
immediately note that this list includes all the major targets of
the HIV protease inhibitors including lipid metabolism, amino
acid metabolism, gluconeogenesis, cellular respiration and
endoplasmic reticulum.
The second dataset is from the study on the effects of
cigarette smoking on human airway epithelial cell transcriptome (Spira et al., 2004). It is well established that cigarette
smoking is the leading cause of lung cancer and other serious
pulmonary diseases such as chronic obstructive pulmonary
disease (COPD). Spira and colleagues analyzed three groups
of samples from current smokers, previous smokers and
non-smokers. Their results indicated that cigarette smoking
could lead to expression changes in genes involved in
xenobiotic metabolism, antioxidation, inflammation, cell
adhesion function and oncogenesis, which are consistent
with previous studies using different model system or
experimental design (Gebel et al., 2004; Hackett et al.,
2003). The top 15 sub-networks identified by the MANOVA
approach are listed in Table 2. Xenobiotic metabolism,
cytoskeleton elements, amino acid and lipid metabolisms,
and protein kinase and phosphatase activities are found to be
significantly differentiated among study groups. In addition,
several immune system-related responses are also present in
the list, which are presumably due to the inflammation
response caused by smoking. Again, the MANOVA approach
was able to identify all the major effects caused by cigarette
smoking.
To compare this approach to a typical analysis flow, i.e. single
gene analysis followed by a gene set analysis, we analyzed the
Construction of a reference gene association network
Table 2. Top 15 scoring sub-networks for the cigarette smoking data
Cluster ID
MANOVA P-value
Significant GO annotation
1770
93
261
495
2062
1123
1369
1272
198
77
697
1967
568
434
182
1.69E12
4.17E10
8.39E10
4.08E09
1.42E08
4.84E08
7.04E08
3.37E07
3.91E07
3.93E07
4.50E07
5.54E07
5.96E07
1.13E06
2.31E06
Response to extracellular stimulus, cell ion homeostasis
Epithelial cell differentiation, phospholipid binding
Microtubule associated complex, GKAP/Homer scaffold activity
Amino sugar metabolic process, N-acetylglucosamine 6-O-sulfotransferase activity
Protein phosphatase binding, protein kinase CK2 activity
Acute-phase response
Fibril organization and biogenesis, extracellular matrix structural constituent
Amino acid biosynthetic/metabolic process
Nitrogen compound catabolic process, organic acid metabolic process
Xenobitic metabolic process
Leukocyte activation, immune response
Phosphaditic acid metabolic process
Steroid metabolic process, cholesterol homeostasis, xenobiotic metabolic process
Myosin complex, actin cytoskeleton
Ciliary or flagellar motility, microtubule-based process
Sub-networks are annotated with the top GO term hits in ‘Molecular Function’, ‘Biological Process’ and ‘Cellular Component’ ontologies.
two datasets using Ingenuity IPA tool (www.ingenuity.com),
which uses the Fisher exact test to calculate gene set enrichment
scores. For the HIV dataset, a list of 579 affected probes (10%
FDR) identified through single gene ANOVA were analyzed by
IPA. While lipid and amino acid metabolism were picked up as
the main affected pathways, ER stress response and gluoconeogenesis were not identified (data not shown). For the cigarette
smoking dataset, a list of 1190 affected probes (10% FDR) was
analyzed. IPA identified xenobiotic metabolism, cytoskeleton
structures and electron transport as the main affected targets but
missed inflammation response completely.
The gained sensitivity of our approach may be due to two
factors. First, no ranked gene list with an artificial cutoff
was used. Therefore, genes having modest changes may be
considered together to reveal significant group effect. Second,
instead of an enrichment score that does not reflect the actual
expression data, our approach attempts to find the best
separation between treatments. Given that parametric method
is generally considered more sensitive than its non-parametric
counterpart, it was expected for our method to generate results
that better fit the biological hypothesis.
However, cautions should also be taken when applying
MANOVA and related Hotelling’s T-square models. There are
three assumptions when applying MANOVA. First, the
dependent variables should be normally distributed. In general,
the more samples in a project, the more likely the assumption
is met. Second, the relationships among dependent variables are
linear. Since our calculation of co-expression is based on linear
correlation, it is likely that genes of the same sub-network
follow a linear relationship. However, deviations from linear
relationship will compromise the power of the test. Third,
variances and co-variances of dependent variables should be
homogeneous across the ranges of independent variables.
Several tests may be applied to check the validity of this
assumption (Tabachnick and Fidell, 1996).
In addition to the aforementioned assumptions, MANOVA
is also sensitive to outliers, which may lead to type I or type II
errors. There are tests available to check for univariate and
multivariate outliers. Finally, MANOVA analysis can only be
employed when the total number of samples is larger than the
size of the gene set being tested. This is because one degree of
freedom is lost for each dependent variable added (Bray and
Maxwell, 1985). In another word, large gene sets cannot be
tested by MANOVA or Hotelling’s T-square for projects
involving small numbers of samples. In fact, this is one of the
criteria we used to pick the partition number k so that the sizes
of the resulting sub-networks are feasible for MANOVA based
hypothesis testing for most projects, which typically have 10
samples or more.
4
CONCLUSION
In this article, we showed how reciprocal rankings and FDR
analysis can be applied to integrate signal from independently
conducted microarray experiments into a meta-network, which
we named RGA network. We used three different algorithms to
identify densely connected sub-networks from the network.
Examination of the sub-networks indicated that they may
represent various units of biological processes. The homogeneity (gene function) of these sub-networks fits well with the
assumption of MANOVA. We developed a MANOVA-based
approach which allows interpreting expression profiling data at
the gene set level. We conducted a thorough evaluation of our
approach by applying it to two published studies. Although in
this article, we limited our analysis to co-expression gene subnetworks, we strongly believe that our MANOVA model is
expandable to other groups of related genes (e.g. a gene list
containing genes mapped to the same chromosomal location).
We expect this approach would serve as a novel method for
analyzing transcriptional and proteomic profiling data at the
gene set level.
ACKNOWLEDGEMENTS
The authors of this paper would like to thank Curtis
Huttenhower and Hilary Coller for valuable discussion and
review of the manuscript. Authors D.U. and S.P. were
2723
D.Ucar et al.
supported by the DOE Early Career Principal Investigator
Award No. DE-FG02-04ER25611 and NSF CAREER Grant
IIS-0347662.
Conflict of Interest: none declared.
REFERENCES
Allocco,D.J. et al. (2004) Quantifying the relationship between co-expression,
co-regulation and gene function. BMC Bioinformatics, 5, 18.
Ashburner,M. and Lewis,S.E. (2002) On ontologies for biologists: the Gene
Ontology – uncoupling the web. Novartis Found. Symp., 247, 66–80.
Bader,G.D. and Hogue,C.W. (2003) An automated method for finding
molecular complexes in large protein interaction networks. BMC
Bioinformatics, 4, 2.
Bergman,S. et al. (2004) Similarities and difference in genome-wide expression
data of six organisms. PLoS Biol., 2, 85–93.
Bolstad,B.M. et al. (2003) A comparison of normalization methods for high
density oligonucleotide array data based on bias and variance. Bioinformatics,
19, 185–193.
Boyle,E.I. et al. (2004) GO::TermFinder–open source software for accessing Gene
Ontology information and finding significantly enriched Gene Ontology
terms associated with a list of genes. Bioinformatics, 20, 3710–3715.
Bray,J.H. and Maxwell,S.E. (1985) Multivariate analysis of variance.
Quantitative Applications in the Social Sciences Series. No. 54. Sage
Publications, Thousand Oaks, CA.
Brun,C. et al. (2004) Clustering proteins from interaction networks for the
prediction of cellular functions. BMC Bioinformatics, 5, 95.
Calza,L. et al. (2004) Dyslipidaemia associated with antiretroviral therapy in
HIV-infected patients. J. Antimicrob. Chemother., 53, 10–14.
Dhillon,I. et al. (2005) A fast kernel-based multilevel algorithm for graph
clustering. In Proceedings of the 11th ACM SIGKDD, Chicago, IL, August
21 - 24, pp. 629–634.
Eisen,M.B. et al. (1998) Cluster analysis and display of genome-wide expression
patterns. Proc. Natl Acad. Sci. USA, 95, 14863–14868.
Gebel,S. et al. (2004) Gene expression profiling in respiratory tissues from rats
exposed to mainstream cigarette smoke. Carcinogenesis, 25, 169–178.
Goldenberg,A. and Moore,A. (2004) Tractable learning of large Bayes net
structures from sparse data. In Proceedings of the 21st ICML.
Golub,T.R. et al. (1999) Molecular classification of cancer: class discovery and
class prediction by gene expression monitoring. Science, 286, 531–537.
Hackett,N.R. et al. (2003) Variability of antioxidant-related gene expression in
the airway epithelium of cigarette smokers. Am. J. Respir. Cell Mol. Biol., 29,
331–43.
2724
Jordan,I.K. et al. (2004) Conservation and coevolution in the scale-free human
gene coexpression network. Mol. Biol. Evol., 21, 2058–2070.
Kim,S.Y. and Volsky,D.J. (2005) PAGE: parametric analysis of gene set
enrichment. BMC Bioinformatics, 6, 144.
King,A.D. et al. (2004) Protein complex prediction via cost-based clustering.
Bioinformatics, 20, 3013–3020.
Lee,H.K. et al. (2004) Coexpression analysis of human genes across many
microarray data sets. Genome Res., 14, 1085–1094.
Lu,Y. et al. (2005) Hotelling’s T2 multivariate profiling for detecting differential
expression in microarrays. Bioinformatics, 21, 3105–3113.
Mao,L. and Resat,H. (2004) Probabilistic representation of gene regulatory
networks. Bioinformatics, 20, 2258–2269.
Mehra,S. et al. (2004) A Boolean algorithm for reconstructing the structure of
regulatory networks. Metabolic Eng., 6, 326–339.
Quackenbush,J. (2001) Computational analysis of microarray data. Nat. Rev.
Genet., 2, 418–427.
Quackenbush,J. (2006) Microarray analysis and tumor classification. N. Engl. J.
Med., 354, 2463–2472.
Parker,R.A. et al. (2005) Endoplasmic reticulum stress links dyslipidemia to
inhibition of proteasome activity and glucose transport by HIV protease
inhibitors. Mol. Pharmacol., 67, 1909–1919.
Pena,J.M. et al. (2005) Growing Bayesian network models of gene networks from
seed genes. Bioinformatics, 21, 224–229.
Rice,J.J. et al. (2005) Reconstructing biological networks using conditional
correlation analysis. Bioinformatics, 21, 765–773.
Spira,A. et al. (2004) Effects of cigarette smoke on the human airway epithelial
cell transcriptome. Proc. Natl Acad. Sci. USA, 101, 10143–10148.
Subramanian,A. et al. (2005) Gene set enrichment analysis: a knowledge-based
approach for interpreting genome-wide expression profiles. Proc. Natl Acad.
Sci. USA, 102, 15545–15550.
Stuart,J. et al. (2003) A gene coexpression network for global discovery of
conserved genetic modules. Sci. Express, 302, 249–255.
Tabachnick,B.G. and Fidell,L.S. (1996) Using Multivariate Statistics, Harper
Collins College Publishers, New York.
Tamayo,P. et al. (1999) Interpreting patterns of gene expression with selforganizing maps: methods and application to hematopoietic differentiation.
Proc. Natl Acad. Sci. USA, 96, 2907–2912.
van Noort,V. et al. (2004) The yeast co-expression network has a small-world,
scale-free architecture and can be explained by a simple model. EMBO Re., 5,
280–284.
Vinciotti,V. et al. (2006) Exploiting the full power of temporal gene expression
profiling through a new statistical test: application to the analysis of muscular
dystrophy data. BMC Bioinformatics, 7, 183–194.
Xiong,H. (2006) Non-linear tests for identifying differentially expressed genes or
genetic networks. Bioinformatics, 22, 919–923.