Leveraging gene co-expression networks to

B RIEFINGS IN FUNC TIONAL GENOMICS . VOL 13. NO 1. 66 ^78
doi:10.1093/bfgp/elt030
Leveraging gene co-expression
networks to pinpoint the regulation of
complex traits and disease, with a focus
on cardiovascular traits
Maxime Rotival and Enrico Petretto
Advance Access publication date 19 August 2013
Abstract
Over the past decade, the number of genome-scale transcriptional datasets in publicly available databases has
climbed to nearly one million, providing an unprecedented opportunity for extensive analyses of gene co-expression
networks. In systems-genetic studies of complex diseases researchers increasingly focus on groups of highly interconnected genes within complex transcriptional networks (referred to as clusters, modules or subnetworks) to
uncover specific molecular processes that can inform functional disease mechanisms and pathological pathways.
Here, we outline the basic paradigms underlying gene co-expression network analysis and critically review the
most commonly used computational methods. Finally, we discuss specific applications of network-based approaches
to the study of cardiovascular traits, which highlight the power of integrated analyses of networks, genetic and
gene-regulation data to elucidate the complex mechanisms underlying cardiovascular disease.
Keywords: co-expression networks; transcriptional regulation; functional pathways; master genetic regulators
INTRODUCTION
In the eukaryote cell, the use of genomic DNA is
tightly regulated to maintain optimal availability of
proteins needed by the cell to survive and perform its
function. Although dynamic regulation of gene expression is largely dependent on translation rate and
posttranslational modifications [1], the transcriptional
regulation plays a major role in cell homeostasis as
shown by the strong consistency of transcriptomic
profiles across biological replicates [2] and by the
relatively good correspondence between transcriptional levels and proteins abundance [3]. As a
consequence, perturbations of the transcriptional
regulation of complex tissues have been shown to
have a sizeable impact on complex traits [4].
The gene balance principle states that the stoichiometry of interacting proteins affects the efficiency of
their interactions and function [5]. This principle has
lead to the hypothesis that regulation of gene expression is evolutionarily selected toward a modular
structure that favor co-expression of genes encoding
interacting protein partners or genes that share similar functions [6]. These co-expression patterns reflect
the underlying activity of transcriptional networks,
which define the complex gene regulatory mechanisms operating in the cell. The structure of transcriptional networks underlying biological processes is
largely accepted to follow a scale free topology
[7–9], with a few regulatory genes or hubs controlling
large part of the network and several genes having
few connections. However, typical transcriptional
networks can also exhibit specific substructures,
including feedback and feed forward loops, single
input modules or dense regulatory modules [10].
Corresponding author. Enrico Petretto, MRC-Clinical Sciences Centre, Hammersmith Hospital Campus, Imperial College Centre for
Translational and Experimental Medicine (ICTEM Building), Du Cane Road, London, W12 0NN UK. Tel.: þ 44-020-8383-1468;
Fax: þ44-208-383-8577; E-mail: [email protected]
Maxime Rotival is an associate researcher in the Integrative Genomics and Medicine group at the MRC Clinical Sciences Centre of
Imperial College, and specializes in integrative study of biological networks underlying the etiology of common diseases.
Enrico Petretto leads the Integrative Genomics and Medicine group at the MRC Clinical Sciences Centre of Imperial College, and
employs systems-genetics approaches to identify functional networks and elucidate the regulation of complex traits and disease.
ß The Author 2013. Published by Oxford University Press. All rights reserved. For permissions, please email: [email protected]
Gene co-expression networks for analysis of complex trait regulation
Over the past decade, the number of genome-scale
transcriptional datasets in publicly available databases
has climbed exponentially, providing the scientific
community with an unprecedented opportunity for
analysis of co-expression networks across multiple organisms, tissues and experimental conditions. In network-medicine [11], co-expression network analysis
aims to identify regulatory modules responsible for
transcriptional networks that are underlying pathological mechanisms and disease susceptibility. This approach can be also leveraged to identify major-effect
genes involved in the etiology of common diseases, for
instance by identification of transcription factors (TFs)
underlying gene co-expression or master geneticregulators
of the networks (i.e. genes containing sequence variation that regulate the expression of the transcriptional
network). Complementary to genome wide association studies (GWAS) of complex disease that aim to
pinpoint individual genetic susceptibility variants, gene
co-expression analysis focuses on sets of co-regulated
genes informative of disease processes. Indeed, many
common genetic variants of small phenotypic effect
cannot be detected by typical GWAS (‘missing heritability’) [12] due to the large number of subjects
required and limited statistical power. This ‘missing
heritability’ in GWAS of complex disease might also
reflect the genetic architecture underlying complex
traits and disease, where a large number of common
gene variations with individual small effect determine
the phenotype. On the other hand, it is likely that
beyond the role exerted by individual gene variants,
the combined action of multiple genes of small
effect—that are highly coordinated (i.e. gene network)
[13]—can lead to (or modulate) disease susceptibility.
Therefore, gene co-expression network approaches
can be used to leverage the largely under-exploited
information contained into the ‘gray zone’ of
GWAS data (i.e. small effect gene variants) [14] and
provide new insights into the biological pathways
involved in disease susceptibility [15].
The practical utility of current approaches for network analysis to infer global regulatory networks in
the system is necessarily hampered by the available
experimental data, which are usually limited to a few
tissues, developmental time-points or experimental
conditions. However, several computational
approaches have been proposed to infer gene coexpression networks, which provide ‘partial’, yet informative, insights in the regulation of gene expression under specific experimental conditions. In this
review, we use the term gene co-expression analysis
67
to refer to any method that aims at identifying regulatory modules (i.e. set of co-expressed genes) within
the larger transcriptional network. First, we will present the different paradigms underlying co-expression network analysis in complex disease. Next, we
will introduce and critically review the most widely
used computational tools for the identification of coexpression modules, discussing their properties
before showing examples of applications to human
disease (with a focus on cardiovascular traits). Finally,
we will discuss the main challenges that co-expression analysis will have to face in the future.
CO-EXPRESSION NETWORK
ANALYSIS:THE MAIN PARADIGMS
The successful application of co-expression networks
analysis to study the etiology of complex disease
relies on the assumption that specific disease processes are determined by perturbations (e.g. genetic,
epigenetic, and environmental) of the underlying
gene regulatory networks. These network perturbations can result, for instance, in different functional
activities of a given pathway in a relevant tissue (i.e.
specific functional context), which in turn can lead
to disease susceptibility or changes in the disease trait.
Two main paradigms exist to link gene co-expression to disease:
The differential expression paradigm assumes that the
activity of a pathological pathway is affected
through up- or down-regulation of the module’s
genes (Figure 1). The gene co-expression network
topology and structure is the same in both pathological (disease) and normal state, but the average
gene expression level of the module genes is either
decreased or increased in the disease state.
The differential co-expression paradigm assumes that the
disease state is linked to perturbations of the structure and topology of regulatory network itself
(Figure 1). Co-expression network approaches
based on this paradigm aim to identify biological
pathways that have co-expression patterns (e.g.
networks topology, structure, interconnectivity)
that are different between disease and normal
states. Condition specific co-expression can be revealed either in the disease or in the normal states,
and reflect the differential activity of the underlying dysregulated transcriptional network in a
particular tissue or system.
68
Rotival and Petretto
Differential expression
paradigm
Differential co-expression
paradigm
Normal
network
Disease
network
Gene expression profiles
Normal
Disease
Normal
Disease
Normal
Disease
Gene co-expression
Normal
Disease
Transcripon factor
Regulated target gene
Transcriponal regulaon
Impaired regulaon
Down regulaon
Figure 1: Paradigms underlying gene co-expression network analysis. The differential expression paradigm (left)
searches for networks of genes that are conserved in disease and healthy subjects but whose expression is up or
down-regulated in cases compared with controls. In contrast, the differential co-expression paradigm (right)
searches for networks which have similar expression levels in cases and controls but whose topological structure
is perturbed in disease. For each of the two paradigms, the upper part shows the network in healthy (‘normal network’) and diseased subjects (‘disease network’), whereas the bottom part shows the differences in expression
levels for the genes in the network (‘gene expression profiles’) and different co-expression patterns between cases
and controls (‘gene co-expression’, darker color indicates stronger co-expression).
In both cases, transcriptional modules and coexpression networks can either exert a ‘causal role’
in disease etiology, or can be reactive to the disease
process:
Transcriptional modules that are causally associated with disease can be identified by testing
the co-expressed gene sets for enrichment in disease susceptibility gene variants using GWASbased pathway enrichment analyses [16, 17].
Such techniques compare the distribution of
GWAS P values among variants located in the
vicinity of the network genes compared with
random sets of genes of the same size to detect
Gene co-expression networks for analysis of complex trait regulation
enrichment of disease-associated variants. To gain
more insights into the genetic drivers of the transcriptional module, detailed regulatory mechanisms can also be investigated by TF or miRNA
binding site enrichment analysis in the module’s
genes promoters. Furthermore, analysis of sequence variation at enriched TF-binding sites
(TFBS) in disease and control samples can identify
differential binding of TFs to the module’s gene
promoters. For example, in the differentially coexpressed networks masterTF-regulators can be discovered by testing the association of sequence
variants at the TF-bindings site between disease
and control samples.
Transcriptional modules that are reactive to disease
reflect transcriptional changes that are secondary to
the disease processes, for instance inflammatory
disease mechanisms activate Toll-like receptorsignaling pathway and the downstream transcriptional response in several targets genes. Reactive
modules can reflect specific disease sub-phenotypes (e.g. tissue fibrosis or mitochondrial dysfunction), and their identification can help to pinpoint
abnormal cellular and molecular phenotypes that
arise as a consequence of the disease. Moreover,
characterization of the specific biological function
of reactive transcriptional modules associated with
disease can facilitate a better classification of disease
at the cellular level (e.g. gene expression signatures
in breast cancer [18]).
Although co-expression networks in theory can
be derived from any type of transcriptomic data, the
nature of the derived co-expression networks is
closely linked to the source of transcriptional variation that is captured by the data. Therefore, while
studies aiming for an unbiased description of the
regulatory networks require quantification of gene
expression across a large number of conditions or
cell types, typical co-expression analyses focus on
transcriptional variation occurring in single tissues
(or cell types) under various perturbations (environmental stimuli, genetic variability, disease states,
etc.).
COMPUTATIONAL TOOLS FOR
CO-EXPRESSION NETWORK
ANALYSIS
A large variety of methods have been proposed to
explore the structure of transcriptional regulation and
69
infer co-expression networks within and across experimental conditions and tissues. Although we
do not aim to provide an exhaustive listing of all
these methods, most gene co-expression analysis
approaches can be classified in two major categories.
More precisely, we distinguish latent factor-based
methods, that first identify hidden variables influencing gene expression to derive gene sets undergoing
regulatory control by these latent factors [19–21]
from reverse engineering approaches that, on the
other hand, infer the co-expression network and
cluster highly interconnected nodes (genes) to form
coherent modules [22, 23] (Figure 2A).
Latent factors methods
Latent factor analysis assumes that gene expression
profiles result from the influence of a limited
number of unobserved covariates (i.e. latent factors)
and try to estimate these covariates from the data, as
well as their respective contribution to each gene
expression profile (i.e. each gene expression profile
is modeled as a linear combination of the latent factors). Latent factor analysis is widely used in genomics and comprises methods such as principal
component analysis (PCA) [19, 24], independent
component analysis [20, 25] or non-negative
matrix factorization (NMF) [21, 26] and their derivatives [27–29]. To estimate the latent factors without
using prior knowledge, it is necessary to make some
assumptions on the latent factors structure and their
relative influence on the gene expression levels.
Existing latent factor-based methods differ mostly
by the set the assumptions that they make on the
latent factors and their contributions to gene expression. These differences can strongly affect the outcome of each method and must be carefully
considered. For instance, PCA assumes the orthogonality of the latent factors and searches for factors
explaining the maximal amount of variance in the
gene expression dataset, which reflect the main
source of variation of gene expression at the expense
of interpretability of more subtle causes of variation
[21]. On the other hand, NMF requires only positivity of both the latent factors and their contribution
to gene expression, providing a decomposition of the
gene expression data as a mixture of basic gene expression profiles, ultimately revealing different gene
co-expression modules. For example, NMF analysis
of gene expression in complex tissues (with varying
cell type composition) can provide a decomposition
reflecting the mixture of different cell types, whereas
70
Rotival and Petretto
(a)
Co-expression
network
Latent factor
analysis
factor-genes
associations
pruning
Regulatory
network
Modules
extraction
clustering
(b)
(c)
TFBS enrichment
GWAS enrichment
pp g
Network QTL mapping
Master genec regulators
gula
Pathway enrichment
Network-Phenotype
yp
associaons
as
Disease related
Phenotype
Figure 2: Computational tools used in co-expression network analysis. Gene co-expression analysis is performed
either by studying the network correlation structure (A, top-left) or by identifying latent factors underlying the network (A, top-right). The co-expression network is then pruned from its indirect edges to find the underlying regulatory network (A, bottom left). The main regulatory modules can then be extracted (A, bottom right) either by
clustering of the regulatory network or from the contributions of the latent factors to the gene expression.
Modules can then be annotated using external information by enrichment forTFBSs in the network-genes promoters
(B, left), enrichment for genes found by GWAS (B, middle), or in known pathways (B, right). ‘Master regulatory
genes’ controlling the network expression can be identified using mapping strategies to find Quantitative Trait Loci
(QTLs) underlying variability in network expression (C, left). Finally the gene co-expression network can be associated with variation in intermediate disease phenotypes (C, right), revealing a link between underlying pathways
and disease.
Gene co-expression networks for analysis of complex trait regulation
PCA analysis will identify the main source of variation of tissue composition. Once latent factors
underlying gene expression variation have been
identified, discrete gene sets (modules) can be identified by regressing gene expressions on the latent
factors and selecting genes that are the most strongly
affected by each latent factor (i.e. genes that have the
highest absolute effect size in the regression).
Reverse engineering approaches
Reverse engineering approaches are used to model
the structure of gene co-expression networks
through graphs, where the genes are represented as
nodes and edges represent co-regulation relationships
between genes. These co-regulation relationships are
usually quantified by adjacency values that reflect the
strength of the relationship (weighted network). A
null adjacency value then reflects the absence of relationship. It is important to emphasize that the selection of the optimal adjacency measure is far from
being neutral; it often depends on the underlying
data structure and therefore must be chosen with
care:
Pearson correlation and its derivatives (absolute/
squared correlation) or partial-correlations constitute the simplest choice of adjacency measure to
assess co-expression between gene pairs. These
however suffer from both lack of robustness to
outlier measurements and the intrinsic inability
to capture non-linear relationships between genes.
To limit the sensitivity of the adjacency measure
to outliers, more robust measures of correlation
can be used such as Spearman, Kendall or biweight
correlation [30]. These adjacency measures are
more appropriate when small samples and noisy
data are used to reconstruct the co-expression
networks.
One can also focus on capturing non-linear
dependencies between genes using mutual information (MI)-based adjacency measures, such as the
k-nearest-neighbors MI computed by MInet [31].
This kind of adjacency measures usually provides a
more realistic snapshot of the complex interactions
and connections present in transcriptional networks, but being non-linear are of more difficult
interpretation.
Once the most suitable adjacency measures are
calculated for all gene pairs present in the expression
dataset, the underlying regulatory network can be
71
derived by pruning non-significant edges or spurious
gene–gene connections (this also reduces the amount
of noise in the gene network data). Several
approaches can be used to identify significant
gene–gene interactions (edges) and extract relevant
gene co-expression modules from the adjacency
matrix:
Weighted
correlation
network
analysis
(WGCNA) [22, 32] or DiffCoEx [23, 33] algorithms employ clustering-based methods to identify gene co-expression clusters using a soft
thresholding step to assign a connection weight to
each gene pair and extract densely connected clusters. These methods apply either a power or a
sigmoid transform to the initial adjacency measures
to reduce the influence of low-adjacency edges.
Although these transformations reduce (but do not
remove) the noise created by non-significant edges
and minimize the influence of indirect associations, they require the choice of a suitable soft
thresholding parameter.
Another commonly used approach, Algorithm
for the Reconstruction of Accurate Cellular
Networks (ARACNe) [34], proceeds by considering each possible triplet of genes and systematically
removes the weakest of the three edges to identify
gene clusters. The Context Likelihood of
Relatedness (CLR) method further extends on
ARACNe by re-weighting the edges between
two genes according to the strength of their connection relative to the strength of the connections
with the surrounding genes [35]. Despite its good
efficacy in the identification of ‘true’ regulatory
connections in relatively simple networks,
ARACNe and CLR may struggle in more complicated situations where the same node is under
regulatory control by multiple input nodes.
Gaussian Graphical Modeling (GGM) approaches
(also known as ‘covariance selection’) assume a
multivariate Gaussian distribution of the covariance between any pair of gene expression profiles,
and estimate the partial-correlations matrix between each gene pairs using regularization
approaches [36, 37]. However, without specific
effort on the choice of the regularization parameters, these partial correlation-based approaches
(like GeneNet [37]) tend to be biased toward
identification of edges connecting nodes with relatively low degree of inter-connectivity [38].
Despite the good ability to uncover complex
72
Rotival and Petretto
regulatory structures [39], this degree-bias makes
GGM less powerful for the identification of networks with densely inter-connected hubs.
To improve on the limitations of GGM-based
approaches, Mahdi et al. [40] have suggested a
heuristic procedure that increases the power to
detect edges connecting hubs of the network.
The procedure works by estimating conditional
dependence between any gene pair in a bidirectional manner (i.e. gene A is regulated by gene B,
or gene B is regulated by gene A) thus limiting the
reduction of power due to multiple adjustments
for hubs, and allowing a better detection of edges
in dense areas of the network [40].
Complementary to the earlier-mentioned
approaches, other methods such as GENIE3 [41],
or ANOVA-based prediction [42] can combine
prior knowledge of TFs with non-linear dependency measures to carry out direct inference of the
relationships between the TF and its target genes,
and ultimately derive TF-driven co-regulatory networks. For instance in GENIE3, the expression
pattern of one of the genes is predicted from the
expression patterns of all the TFs, using tree-based
Methods (Random Forest and Extra Trees) [41].
When a reliable list of TFs is available, this class
of methods has been shown to achieve good performance in compelling network-inference benchmark studies (DREAM5), outperformed only by
the ‘wisdom of crowds’ approach, which combine
results from multiple algorithms to form a robust
prediction of the network [43].
EXTRACTING REGULATORY
COMPONENTS FROM THE
CO-EXPRESSION NETWORK
Once gene–gene co-regulation is inferred from the
gene expression data, the main regulatory components of the network can be identified using clustering approaches. The choice of the appropriate
clustering method should be guided by several
considerations:
First, the clustering method influences the characteristics of the clusters that are obtained. For
instance, methods such as clique percolation (i.e.
finding a group of nodes that are more densely
connected to each other than to other nodes in
the network) [44] or complete hierarchical clustering [45] search for densely connected clusters,
typically resulting in the identification of small
clusters. In contrast, other methods such as affinity
propagation [46] or dynamic tree cut [47] are
more powerful to identify larger clusters of
genes. The latter class of approaches, however,
suffers from the need to specify several ad hoc parameters that need to be finely tuned empirically,
and which depend on the size of the considered
datasets or, in the WGCNA framework, on the
soft thresholding power used to assign a connection weight to each gene pair in the previous step.
Moreover, the influence of these parameters on
the clustering results is not always clear and no
good (or universal) criteria are provided to evaluate the significance and stability of the identified
clusters.
Second, the choice of the optimal clustering
method should also consider the type of pruning
performed to identify relevant gene–gene relationships. Indeed, pruning the network from indirect connections will lead to reconstructing
sparse networks (i.e. indicating that gene-gene
interactions are rare), which will affect the results
of most clustering algorithms. Although some algorithms such as affinity propagation or SCAN
[48] are suitable to deal with sparse networks,
most approaches require the computation of a
topological overlap measure [49] from the adjacency matrix to group nodes that share a similar
set of neighbors prior to the clustering to allow the
identification of gene clusters.
Third, the choice of the clustering will also
depend of the type of ‘feature’ that is searched
for (e.g. clusters including hub genes, genes connecting different clusters). For instance, affinity
propagation looks for ‘exemplar nodes’ of the network and tries to assign each node to its closest
‘exemplar’ while minimizing the number of ‘exemplar nodes’. Usually, such method generates
highly centralized gene clusters and facilitates the
identification of hub genes. On the other hand,
approaches such as SCAN allow the identification
of both high-density clusters and genes acting as
connectors between clusters [48].
ANALYZING CO-EXPRESSION
NETWORKS ACROSS MULTIPLE
CONDITIONS
Alternative approaches and computational methods
have been developed to infer and characterize
Gene co-expression networks for analysis of complex trait regulation
co-expression networks across multiple experimental
conditions (e.g. cell-types, disease states, time
course). A straightforward approach consists in building a network separately within each condition and
then to employ a network comparison tool to detect
specific sub-networks and modules that significantly
differ between the various conditions [50]. Another
approach consists in building adjacencies between
genes using differential correlation measures [23,
33] and then build a network based on these adjacencies. Regardless of the method chosen, the significance of differential expression and the changes in
sub-network structures under different biological
conditions can be tested formally using the framework suggested in [51]. Finally, co-factorization
methods, such as Higher-Order Singular Value
Decomposition [29] or variants of the non-NMF
approach [28, 52], aim to decompose multiple
gene expression datasets using common latent factors
analysis, which uncover sets of genes undergoing
common regulation (i.e. shared across datasets) or
differential co-regulation across multiple conditions.
Aside from finding clusters that are specific to each
dataset, co-expression analysis across multiple experimental conditions can inform on the ‘robustness’ of
the inferred network, as it is able to provide useful
information about the network-conservation across
tissues, time-points or species.
USING CO-EXPRESSION
NETWORKS TO GAIN NEW
BIOLOGICAL INSIGHTS INTO
COMPLEX TRAITS AND DISEASE
The ability to leverage gene co-expression networks
data to elucidate pathways and processes underlying
disease is highly dependent on understanding the
specific regulatory control, functional context and
pathological activity of the network.
One approach to characterize the regulatory control of the network is to search for interaction partners of the network genes by analysis of the
underlying genes’ sequence. For instance, one can
look for enrichment of TFBS using tools such as
PASTAA [53], HOMER [54], FIRE [55], or exploit
data from miRNA targets databases such as miRBase
or targetScan [56, 57] to search of specific miRNAs
or TFs regulating the network.
The functional context where the network operates can be explored by enrichment-analysis of
tissue-specific genes [58] or cell-type specificity of
73
the network genes using publicly available databases
of expression profiles [59] or gene expression atlas
(e.g. Gene Expression Omnibus repository: http://
www.ncbi.nlm.nih.gov/geo/). These analyses can
elucidate the relative contribution of specific cell
types in determining the observed co-expression patterns, which represents a crucial step to understand
(and test experimentally) the molecular mechanisms
underling the regulatory network.
To link co-expression networks with disease it is
often necessary to integrate multiple layers of information, including details on the regulatory control
(e.g. TFBS analysis), functional context (e.g. celltype specificity) and genetic susceptibility data
(for instance using GWAS, as shown earlier)
(Figure 2B). In addition, genetic (sequence variation)
data can be integrated with transcriptional data (coexpression network) to identify genetic control
points of the regulatory network. In other words,
the network expression can be treated as a complex
multivariate trait that can be linked to underlying
sequence variation, for instance using standard
quantitative trait locus (QTL) mapping methods
(Figure 2C). To this aim, several multivariate mapping strategies such as Manova [60] or Bayesian variable selection algorithms [61, 62] have been used to
identify genetic control points of co-expression networks, also referred to as ‘master genetic regulators’
of the networks.
Because it gathers information across a large
number of genes, the study co-expression networks
can be informative even with very small sample size
(n 20) [63]. However, the robustness of the analysis
will depend on the kind of information that is being
extracted from the network. For instance, enrichment analyses (e.g. TFBS and GWAS) will be
robust to mild misclassification of genes and therefore will be able to provide insights even from networks derived from very small number of samples. In
contrast, accurate reverse engineering of the network
is particularly sensitive to the sample size and will
likely require hundreds of samples to ensure a
robust inference of the network [64]. Finding the
hubs of the networks, however, can be performed
even with networks inferred from relatively small
sample sizes because the accuracy of the estimated
genes connectivity will increase will the number of
genes in the network. In all cases, bootstrapping
methods may be used to assess the robustness of predicted edges in the network and other topological
properties.
74
Rotival and Petretto
One example of successful integration of coexpression networks with biological information that
led to identification of new genes underlying cardiovascular disease is the discovery of endonuclease G
(EndoG) [65]. Although EndoG was initially implicated
in apoptosis (but not hypertrophy), McDermott-Roe
et al. [65] demonstrated a role for EndoG in regulating
left ventricular mass and mitochondrial function using
co-expression network analysis in the human heart
(Figure 3). The co-expression network analysis identified a mitochondrial function-enriched gene network around EndoG (and where EndoG is one the
most-highly interconnected gene hubs, Figure 3),
which was crucial to define the EndoG role in fundamental mitochondrial processes that are unrelated to
apoptosis. This new functional role inferred by the
network-analysis was confirmed by experiments in
the EndoG knockout mouse showing decreased mitochondrial activity leading to reduced cardiac mass.
Another application of gene co-expression networks and integration with analysis of regulatory
control by TFs has led to the discovery of Ebi2 as a
‘master genetic regulator’ of an anti-viral expression
network and type 1 diabetes (T1D) risk [66]. In their
study, Heinig etal. hypothesized that loci that perturb
gene co-expression networks can be important for
disease risk. Under this hypothesis, they carried out
genome-wide co-expression analysis in seven rat tissues and identified a co-expression network that
contained several targets of the IRF7 TF, a master
regulator of the type 1 interferon response [67], and
that was under genetic control by the Ebi2 gene in
the rat (Figure 4). Using comparative genomics
approaches, Heinig et al. showed that the human
orthologue of rat Ebi2 was located at a T1Dsusceptibility locus and the network itself was
highly enriched for T1D susceptibility variants by
GWAS, supporting a role of the IRF7-driven
(IDIN) network and Ebi2 in the etiology of T1D.
Cell-type enrichment analysis showed that the IDIN
network was highly and specifically expressed in
immune cells, which was supported by independent
replication of the network in human monocytes.
This further suggested that Ebi2 might affect T1D
susceptibility (and gene co-expression in complex
tissues like the heart, where the network was initially
discovered) by controlling migration of immune cells
from the blood to the tissues, as shown in other
studies [68]. The discovery of Ebi2 as a ‘master genetic regulator’ of inflammatory response and macrophage infiltration leading to increased risk of T1D in
humans also provides an example of successful
application of network-based approaches to the
identification of novel disease-susceptibility genes
(Figure 4).
FUTURE DIRECTIONS FOR THE
STUDY OF GENE CO-EXPRESSION
NETWORKS
An important caveat to the inference of regulatory
mechanisms from co-expression networks arise from
the fact that in many cases, the regulation of TF activity occurs via post-translational modifications of the
TF. Since such modifications are not captured at the
transcriptional level, the regulatory connections between the TFs and their targets may be missed when
looking at the transcriptome only. To circumvent this
limitation, the use of existing biological knowledge
and/or external data to inform the search of coexpression modules will undoubtedly be one of the
major area of development in co-expression analysis
in the next years. The fact that the best performing
methods are those taking prior knowledge on TFs
target regulation into account [43] will indeed encourage to go further in this direction. In this respect,
the Bayesian network framework have been shown to
be an efficient way to include prior knowledge, for
instance using protein–protein interaction network
data or gene ontologies [69]. However, Bayesian network approaches tend to suffer from a fairly high
running time that often makes their application to
large-scale genomics impractical for analysis of complex networks. Multiple types of genomic data
(miRNA, proteomic data) can be also integrated for
more efficient network analysis, as shown by Zhang
et al. [70], which used a NMF-based framework and
combined predicted gene–gene and gene–miRNA
interactions with gene and miRNA expression data
to identify gene networks under regulatory control by
miRNAs.
Another current challenge for co-expression network approaches is in the necessity to adapt to the
growing number of high-throughput mRNA sequencing (RNA-seq) data for measuring the transcriptome.
Indeed, first attempts have already suggested that due
to the better dynamic range provided by RNA-seq
data (as compared with traditional microarray data),
it was possible to identify networks with high interconnectivity (i.e. more gene–gene connections in each
module), heterogeneity (fewer hubs in each module)
and centrality (stronger hubs) [71]. Moreover,
Gene co-expression networks for analysis of complex trait regulation
75
Gene co-expression
network in the human heart
Relative degree of
interconnectivity within
the co-expression network
1
ENDOG
14
mitochondrial gene function
Unique co-expression of ENDOG
with oxidative metabolism genes
Mitochondrial processes
(unrelated to apoptosis)
Cardiac
hypertrophy
Figure 3: Gene co-expression network analysis informs new gene function. EndoG gene was first found to regulate cardiac hypertrophy in the rat heart. Next, the co-expression network reconstructed in the human heart
identified the EndoG gene as one the major hubs in the network (i.e. gene with high degree of interconnectivity
within the network). Furthermore, the co-expression network was highly enriched for genes annotated for mitochondrial function (highlighted in darker color), suggesting a similar biological role for EndoG, which was previously
unappreciated. This new inference on EndoG function in fundamental (non-apoptotic) processes uncovered the link
between mitochondrial dysfunction and heart disease.
RNA-seq allows the study of different splicing events
and strand-specific expression and will therefore require ad hoc methods and statistical modeling to
take this additional layer of information into account
for network inference. The development of
approaches aiming to include splicing regulation in
network analysis is still at an early stage. Current attempts include the identification of modules of cospliced genes, by correlating quantitative measures of
exon splicing (measured as the frequency of inclusion
of the exon in the final transcript) and extracting modules of exons whose splicing patterns are correlated by
NMF approaches [72] and the use of canonical correlation analysis to define exon level measures of gene–
gene dependence [73].
It is likely that in the next few years the increasing
availability of genomic data from multiple sources
(e.g. high-throughput sequencing of mRNA and
76
Rotival and Petretto
Rat
Humans
Genome-wide co-expression network analysis
Genome-wide TF binding site analysis
GWAS data for type 1 diabetes
IRF7-driven inflammatory gene network
T1D associated gene
IRF7
genetic
control
rat Ebi2
rat chromosome 15q25
human EBI2
synteny
human chromosome 13q32
Figure 4: Master genetic regulators of co-expression networks across rats and humans. The identification of an
IRF7-regulated co-expression network in the rat was driven by integrated genome-wide co-expression analyses
and TFBS predictions in seven rat tissues. The network was found to be under master genetic control by Ebi2
gene in the rat, and thus suggesting a role of this gene in regulation of the inflammatory response. Comparative genomics approaches confirmed that the syntenic human locus that encoded for rat Ebi2 gene controlled a similar inflammatory gene co-expression network in human monocytes. Finally, enrichment analysis by GWAS data showed
the involvement of the interferon-regulated network in type 1 diabetes, and further suggested a role of the IRF7driven (IDIN) network and its ‘master genetic regulator’ (EBI2) in the etiology of type 1 diabetes.
DNA, quantitative protein abundance and protein–
protein interaction data) combined with refined gene
ontology and TFs annotations will offer new opportunities to infer reliable gene co-expression networks, and help make sense out of them. Our
ability to integrate these heterogeneous and highdimensional data via network modeling might then
represent a decisive step toward a better understanding of the regulation of cardiovascular traits and
disease.
Key Points
Analysis of gene co-expression networks can provide useful insights into the complex regulatory networks and mechanisms
underlying disease.
Gene co-expression networks can be causal or reactive to the
disease process, often reflecting context-specific functions and
regulation.
There is a plethora of tools for network analysis, yet each
approach has unique limitations and advantages, which make
their application highly dependent on the specific study design
and data structure.
Integration of co-expression networks with genetic and generegulation data is a powerful mean to identify master genetic
regulators for pathological pathways and molecular processes
underlying disease.
Acknowledgements
The authors acknowledge the funding from the European
Community’s Seventh Framework Programme (FP7/20072013) under grant agreement no. HEALTH-F4-2010-241504
(EURATRANS) and from the Medical Research Council.
References
1.
Schwanhäusser B, Busse D, Li N, et al. Global quantification
of mammalian gene expression control. Nature 2011;473:
337–42.
Gene co-expression networks for analysis of complex trait regulation
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
Lukk M, Kapushesky M, Nikkilä J, et al. A global map of
human gene expression. Nat Biotechnol 2010;28:322–4.
Lu P, Vogel C, Wang R, et al. Absolute protein expression
profiling estimates the relative contributions of transcriptional and translational regulation. Nat Biotechnol 2007;25:
117–24.
Nicolae DL, Gamazon E, Zhang W, et al. Trait-associated
SNPs are more likely to be eQTLs: annotation to enhance discovery from GWAS. PLoS Genet 2010;6:
e1000888.
Birchler JA, Veitia RA. The gene balance hypothesis:
implications for gene regulation, quantitative traits and
evolution. New Phytol 2010;186:54–62.
Carlson MRJ, Zhang B, Fang Z, et al. Gene connectivity,
function, and sequence conservation: predictions from
modular yeast co-expression networks. BMC Genomics
2006;7:40.
Babu MM, Luscombe NM, Aravind L, et al. Structure and
evolution of transcriptional regulatory networks. Curr Opin
Struct Biol 2004;14:283–91.
Barabási A-L, Gulbahce N, Loscalzo J. Network medicine: a
network-based approach to human disease. Nat Rev Genet
2011;12:56–68.
Liu Y-Y, Slotine J-J, Barabási A-L. Controllability of complex networks. Nature 2011;473:167–73.
Shen-Orr SS, Milo R, Mangan S, et al. Network motifs in
the transcriptional regulation network of Escherichia coli. Nat
Genet 2002;31:64–8.
Schadt EE, Friend SH, Shaywitz DA. A network view of
disease and compound screening. Nat Rev Drug Discov 2009;
8:286–95.
Hemani G, Knott S, Haley C. An evolutionary perspective
on epistasis and the missing heritability. PLoS Genet 2013;9:
e1003295.
McKinney BA, Pajewski NM. Six degrees of epistasis: statistical network models for GWAS. Front Genet 2011;2:109.
Naukkarinen J, Surakka I, Pietiläinen KH, et al. Use of
genome-wide expression data to mine the ‘Gray Zone’ of
GWA studies leads to novel candidate obesity genes. PLoS
Genet 2010;6:e1000976.
Langfelder P, Mischel PS, Horvath S. When is hub gene
selection better than standard meta-analysis? PLoS One
2013;8:e61505.
Weng L, Macciardi F, Subramanian A, et al. SNP-based
pathway enrichment analysis for genome-wide association
studies. BMC Bioinformatics 2011;12:99.
Zhang K, Chang S, Cui S, et al. ICSNPathway: identify
candidate causal SNPs and pathways from genome-wide
association study by one analytical framework. Nucleic
Acids Res 2011;39:W437–43.
Van’t Veer LJ, Dai H, van de Vijver MJ, et al. Gene expression profiling predicts clinical outcome of breast cancer.
Nature 2002;415:530–6.
Wall ME, Rechtsteiner A, Rocha LM. Singular value decomposition and principal component analysis. In: Berrar
DP, Dubitzky W, Granzow M (eds). A Practical Approach to
Microarray Data Analysis. Norwell, MA: Kluwer,
2003:91–109.
Liebermeister W. Linear modes of gene expression determined by independent component analysis. Bioinformatics
2002;18:51–60.
77
21. Lee DD, Seung HS. Learning the parts of objects
by non-negative matrix factorization. Nature 1999;401:
788–91.
22. Langfelder P, Horvath S. WGCNA: an R package for
weighted correlation network analysis. BMC Bioinformatics
2008;9:559.
23. Tesson B, Breitling R, Jansen R. DiffCoEx: a simple and
sensitive method to find differentially coexpressed gene
modules. BMC Bioinformatics 2010;11:497.
24. Alter O, Brown PO, Botstein D. Singular value decomposition for genome-wide expression data processing and
modeling. Proc Natl Acad Sci U S A 2000;97:10101–6.
25. Rotival M, Zeller T, Wild PS, et al. Integrating genomewide genetic variations and monocyte expression data
reveals trans-regulated gene modules in humans. PLoS
Genet 2011;7:e1002367.
26. Brunet J-P, Tamayo P, Golub TR, et al. Metagenes and
molecular pattern discovery using matrix factorization.
Proc Natl Acad Sci U S A 2004;101:4164–9.
27. Zou H, Hastie T, Tibshirani R. Sparse principal component
analysis. J Comput Graph Stat 2006;15:265–86.
28. Lee CM, Mudaliar MAV, Haggart DR, et al. Simultaneous
non-negative matrix factorization for multiple large scale
gene expression datasets in toxicology. PLoS One 2012;7:
e48238.
29. Ponnapalli SP, Saunders MA, Van Loan CF, et al. A higherorder generalized singular value decomposition for
comparison of global mRNA expression from multiple
organisms. PLoS One 2011;6:e28072.
30. Hardin J, Mitani A, Hicks L, et al. A robust measure of
correlation between two genes on a microarray. BMC
Bioinformatics 2007;8:220.
31. Meyer PE, Lafitte F, Bontempi G. minet: A R/
Bioconductor package for inferring large transcriptional networks using mutual information. BMC Bioinformatics 2008;
9:461.
32. Fuller TF, Ghazalpour A, Aten JE, et al. Weighted gene
coexpression network analysis strategies applied to mouse
weight. Mamm Genome 2007;18:463–72.
33. Amar D, Safer H, Shamir R. Dissection of regulatory networks that are altered in disease via differential co-expression. PLoS Comput Biol 2013;9:e1002955.
34. Margolin AA, Nemenman I, Basso K, et al. ARACNE: an
algorithm for the reconstruction of gene regulatory networks in a mammalian cellular context. BMC
Bioinformatics 2006;7(Suppl 1):S7.
35. Faith JJ, Hayete B, Thaden JT, et al. Large-scale mapping
and validation of Escherichia coli transcriptional regulation
from a compendium of expression profiles. PLoS Biol
2007;5:e8.
36. Friedman J, Hastie T, Tibshirani R. Sparse inverse covariance estimation with the graphical lasso. Biostatistics 2008;9:
432–41.
37. Schäfer J, Opgen-Rhein R, Strimmer K. Reverse engineering genetic networks using the GeneNet package. J Am
Stat Assoc 2001;96:1151–60.
38. Allen JD, Xie Y, Chen M, et al. Comparing statistical methods for constructing large scale gene networks. PLoS One
2012;7:e29348.
39. Werhli AV, Grzegorczyk M, Husmeier D. Comparative
evaluation of reverse engineering gene regulatory networks
78
40.
41.
42.
43.
44.
45.
46.
47.
48.
49.
50.
51.
52.
53.
54.
55.
56.
Rotival and Petretto
with relevance networks, graphical gaussian models and
bayesian networks. Bioinformatics 2006;22:2523–31.
Mahdi R, Madduri AS, Wang G, et al. Empirical Bayes
conditional independence graphs for regulatory network
recovery. Bioinformatics 2012;28:2029–36.
Huynh-Thu VA, Irrthum A, Wehenkel L, et al. Inferring
regulatory networks from expression data using tree-based
methods. PLoS One 2010;5:e12776.
Küffner R, Petri T, Tavakkolkhah P, et al. Inferring gene
regulatory networks by ANOVA. Bioinformatics 2012;28:
1376–82.
Marbach D, Costello JC, Küffner R, et al. Wisdom of
crowds for robust gene network inference. Nat Methods
2012;9:796–804.
Zhang S, Ning X, Zhang X-S. Identification of functional
modules in a PPI network by clique percolation clustering.
Comput Biol Chem 2006;30:445–51.
Hansen P, Delattre M. Complete-link cluster analysis by
graph coloring. J Am Stat Assoc 1978;73:397–403.
Frey BJ, Dueck D. Clustering by passing messages between
data points. Science 2007;315:972–6.
Langfelder P, Zhang B, Horvath S. Defining clusters from a
hierarchical cluster tree: the Dynamic Tree Cut package for
R. Bioinformatics 2008;24:719–20.
Xu X, Yuruk N, Feng Z, etal. SCAN: a structural clustering
algorithm for networks. In: Proceedings of the 13th ACM
SIGKDD International Conference Knowledge Discovery and
Data Mining. New York: ACM, 2007:824–33.
Yip AM, Horvath S. Gene network interconnectedness and
the generalized topological overlap measure. BMC
Bioinformatics 2007;8:22.
Zhang B, Li H, Riggins RB, et al. Differential dependency
network analysis to identify condition-specific topological
changes in biological networks. Bioinformatics 2009;25:
526–32.
Gill R, Datta S, Datta S. A statistical framework for differential network analysis from microarray data. BMC
Bioinformatics 2010;11:95.
Li W, Liu C-C, Zhang T, et al. Integrative analysis of many
weighted co-expression networks using tensor computation. PLoS Comput Biol 2011;7:e1001106.
Roider HG, Manke T, O’Keeffe S, et al. PASTAA:
identifying transcription factors associated with sets of
co-regulated genes. Bioinformatics 2009;25:435–42.
Heinz S, Benner C, Spann N, et al. Simple combinations of
lineage-determining transcription factors prime cis-regulatory elements required for macrophage and B cell identities.
Mol Cell 2010;38:576–89.
Elemento O, Slonim N, Tavazoie S. A universal framework
for regulatory element discovery across all genomes and data
types. Mol Cell 2007;28:337–50.
Griffiths-Jones S, Grocock RJ, van Dongen S, et al.
miRBase: microRNA sequences, targets and gene nomenclature. Nucleic Acids Res 2006;34:D140–44.
57. Lewis BP, Burge CB, Bartel DP. Conserved seed pairing,
often flanked by adenosines, indicates that thousands
of human genes are microRNA targets. Cell 2005;120:
15–20.
58. Kouadjo KE, Nishida Y, Cadrin-Girard JF, et al.
Housekeeping and tissue-specific genes in mouse tissues.
BMC Genomics 2007;8:127.
59. Shoemaker JE, Lopes TJS, Ghosh S, et al. CTen: a
web-based platform for identifying enriched cell types
from heterogeneous microarray data. BMC Genomics 2012;
13:460.
60. Shen Y, Lin Z, Zhu J. Shrinkage-based regularization tests
for high-dimensional data with application to gene set
analysis. Comput Stat Data Anal 2011;55:2221–33.
61. Bottolo L, Petretto E, Blankenberg S, et al. Bayesian detection of expression quantitative trait loci hot spots. Genetics
2011;189:1449–59.
62. Scott-Boyer MP, Imholte GC, Tayeb A, etal. An integrated
hierarchical Bayesian model for multivariate eQTL mapping. Stat Appl Genet Mol Biol 2012;11.
63. Zhao W, Langfelder P, Fuller T, et al. Weighted gene coexpression network analysis: state of the art. J Biopharm Stat
2010;20:281–300.
64. Margolin AA, Wang K, Lim WK, etal. Reverse engineering
cellular networks. Nat Protoc 2006;1:662–71.
65. McDermott-Roe C, Ye J, Ahmed R, et al. Endonuclease G
is a novel determinant of cardiac hypertrophy and mitochondrial function. Nature 2011;478:114–8.
66. Heinig M, Petretto E, Wallace C, et al. A trans-acting locus
regulates an anti-viral expression network and type 1 diabetes risk. Nature 2010;467:460–4.
67. Honda K, Yanai H, Negishi H, et al. IRF-7 is the master
regulator of type-I interferon-dependent immune
responses. Nature 2005;434:772–7.
68. Hannedouche S, Zhang J, Yi T, et al. Oxysterols
direct immune cell migration via EBI2. Nature 2011;475:
524–7.
69. Qin T, Tsoi LC, Sims KJ, etal. Signaling network prediction
by the Ontology Fingerprint enhanced Bayesian network.
BMC Syst Biol 2012;6(Suppl 3):S3.
70. Zhang S, Li Q, Liu J, et al. A novel computational framework for simultaneous integration of multiple types of
genomic data to identify microRNA-gene regulatory modules. Bioinformatics 2011;27:i401–9.
71. Iancu OD, Kawane S, Bottomly D, et al. Utilizing RNASeq data for de novo coexpression network inference.
Bioinformatics 2012;28:1592–7.
72. Dai C, Li W, Liu J, et al. Integrating many co-splicing networks to reconstruct splicing regulatory modules. BMC Syst
Biol 2012;6(Suppl 1):S17.
73. Hong S, Chen X, Jin L, et al. Canonical correlation analysis
for RNA-seq co-expression networks. Nucleic Acids Res
2013;41:e95.