B RIEFINGS IN FUNC TIONAL GENOMICS . VOL 13. NO 1. 66 ^78 doi:10.1093/bfgp/elt030 Leveraging gene co-expression networks to pinpoint the regulation of complex traits and disease, with a focus on cardiovascular traits Maxime Rotival and Enrico Petretto Advance Access publication date 19 August 2013 Abstract Over the past decade, the number of genome-scale transcriptional datasets in publicly available databases has climbed to nearly one million, providing an unprecedented opportunity for extensive analyses of gene co-expression networks. In systems-genetic studies of complex diseases researchers increasingly focus on groups of highly interconnected genes within complex transcriptional networks (referred to as clusters, modules or subnetworks) to uncover specific molecular processes that can inform functional disease mechanisms and pathological pathways. Here, we outline the basic paradigms underlying gene co-expression network analysis and critically review the most commonly used computational methods. Finally, we discuss specific applications of network-based approaches to the study of cardiovascular traits, which highlight the power of integrated analyses of networks, genetic and gene-regulation data to elucidate the complex mechanisms underlying cardiovascular disease. Keywords: co-expression networks; transcriptional regulation; functional pathways; master genetic regulators INTRODUCTION In the eukaryote cell, the use of genomic DNA is tightly regulated to maintain optimal availability of proteins needed by the cell to survive and perform its function. Although dynamic regulation of gene expression is largely dependent on translation rate and posttranslational modifications [1], the transcriptional regulation plays a major role in cell homeostasis as shown by the strong consistency of transcriptomic profiles across biological replicates [2] and by the relatively good correspondence between transcriptional levels and proteins abundance [3]. As a consequence, perturbations of the transcriptional regulation of complex tissues have been shown to have a sizeable impact on complex traits [4]. The gene balance principle states that the stoichiometry of interacting proteins affects the efficiency of their interactions and function [5]. This principle has lead to the hypothesis that regulation of gene expression is evolutionarily selected toward a modular structure that favor co-expression of genes encoding interacting protein partners or genes that share similar functions [6]. These co-expression patterns reflect the underlying activity of transcriptional networks, which define the complex gene regulatory mechanisms operating in the cell. The structure of transcriptional networks underlying biological processes is largely accepted to follow a scale free topology [7–9], with a few regulatory genes or hubs controlling large part of the network and several genes having few connections. However, typical transcriptional networks can also exhibit specific substructures, including feedback and feed forward loops, single input modules or dense regulatory modules [10]. Corresponding author. Enrico Petretto, MRC-Clinical Sciences Centre, Hammersmith Hospital Campus, Imperial College Centre for Translational and Experimental Medicine (ICTEM Building), Du Cane Road, London, W12 0NN UK. Tel.: þ 44-020-8383-1468; Fax: þ44-208-383-8577; E-mail: [email protected] Maxime Rotival is an associate researcher in the Integrative Genomics and Medicine group at the MRC Clinical Sciences Centre of Imperial College, and specializes in integrative study of biological networks underlying the etiology of common diseases. Enrico Petretto leads the Integrative Genomics and Medicine group at the MRC Clinical Sciences Centre of Imperial College, and employs systems-genetics approaches to identify functional networks and elucidate the regulation of complex traits and disease. ß The Author 2013. Published by Oxford University Press. All rights reserved. For permissions, please email: [email protected] Gene co-expression networks for analysis of complex trait regulation Over the past decade, the number of genome-scale transcriptional datasets in publicly available databases has climbed exponentially, providing the scientific community with an unprecedented opportunity for analysis of co-expression networks across multiple organisms, tissues and experimental conditions. In network-medicine [11], co-expression network analysis aims to identify regulatory modules responsible for transcriptional networks that are underlying pathological mechanisms and disease susceptibility. This approach can be also leveraged to identify major-effect genes involved in the etiology of common diseases, for instance by identification of transcription factors (TFs) underlying gene co-expression or master geneticregulators of the networks (i.e. genes containing sequence variation that regulate the expression of the transcriptional network). Complementary to genome wide association studies (GWAS) of complex disease that aim to pinpoint individual genetic susceptibility variants, gene co-expression analysis focuses on sets of co-regulated genes informative of disease processes. Indeed, many common genetic variants of small phenotypic effect cannot be detected by typical GWAS (‘missing heritability’) [12] due to the large number of subjects required and limited statistical power. This ‘missing heritability’ in GWAS of complex disease might also reflect the genetic architecture underlying complex traits and disease, where a large number of common gene variations with individual small effect determine the phenotype. On the other hand, it is likely that beyond the role exerted by individual gene variants, the combined action of multiple genes of small effect—that are highly coordinated (i.e. gene network) [13]—can lead to (or modulate) disease susceptibility. Therefore, gene co-expression network approaches can be used to leverage the largely under-exploited information contained into the ‘gray zone’ of GWAS data (i.e. small effect gene variants) [14] and provide new insights into the biological pathways involved in disease susceptibility [15]. The practical utility of current approaches for network analysis to infer global regulatory networks in the system is necessarily hampered by the available experimental data, which are usually limited to a few tissues, developmental time-points or experimental conditions. However, several computational approaches have been proposed to infer gene coexpression networks, which provide ‘partial’, yet informative, insights in the regulation of gene expression under specific experimental conditions. In this review, we use the term gene co-expression analysis 67 to refer to any method that aims at identifying regulatory modules (i.e. set of co-expressed genes) within the larger transcriptional network. First, we will present the different paradigms underlying co-expression network analysis in complex disease. Next, we will introduce and critically review the most widely used computational tools for the identification of coexpression modules, discussing their properties before showing examples of applications to human disease (with a focus on cardiovascular traits). Finally, we will discuss the main challenges that co-expression analysis will have to face in the future. CO-EXPRESSION NETWORK ANALYSIS:THE MAIN PARADIGMS The successful application of co-expression networks analysis to study the etiology of complex disease relies on the assumption that specific disease processes are determined by perturbations (e.g. genetic, epigenetic, and environmental) of the underlying gene regulatory networks. These network perturbations can result, for instance, in different functional activities of a given pathway in a relevant tissue (i.e. specific functional context), which in turn can lead to disease susceptibility or changes in the disease trait. Two main paradigms exist to link gene co-expression to disease: The differential expression paradigm assumes that the activity of a pathological pathway is affected through up- or down-regulation of the module’s genes (Figure 1). The gene co-expression network topology and structure is the same in both pathological (disease) and normal state, but the average gene expression level of the module genes is either decreased or increased in the disease state. The differential co-expression paradigm assumes that the disease state is linked to perturbations of the structure and topology of regulatory network itself (Figure 1). Co-expression network approaches based on this paradigm aim to identify biological pathways that have co-expression patterns (e.g. networks topology, structure, interconnectivity) that are different between disease and normal states. Condition specific co-expression can be revealed either in the disease or in the normal states, and reflect the differential activity of the underlying dysregulated transcriptional network in a particular tissue or system. 68 Rotival and Petretto Differential expression paradigm Differential co-expression paradigm Normal network Disease network Gene expression profiles Normal Disease Normal Disease Normal Disease Gene co-expression Normal Disease Transcripon factor Regulated target gene Transcriponal regulaon Impaired regulaon Down regulaon Figure 1: Paradigms underlying gene co-expression network analysis. The differential expression paradigm (left) searches for networks of genes that are conserved in disease and healthy subjects but whose expression is up or down-regulated in cases compared with controls. In contrast, the differential co-expression paradigm (right) searches for networks which have similar expression levels in cases and controls but whose topological structure is perturbed in disease. For each of the two paradigms, the upper part shows the network in healthy (‘normal network’) and diseased subjects (‘disease network’), whereas the bottom part shows the differences in expression levels for the genes in the network (‘gene expression profiles’) and different co-expression patterns between cases and controls (‘gene co-expression’, darker color indicates stronger co-expression). In both cases, transcriptional modules and coexpression networks can either exert a ‘causal role’ in disease etiology, or can be reactive to the disease process: Transcriptional modules that are causally associated with disease can be identified by testing the co-expressed gene sets for enrichment in disease susceptibility gene variants using GWASbased pathway enrichment analyses [16, 17]. Such techniques compare the distribution of GWAS P values among variants located in the vicinity of the network genes compared with random sets of genes of the same size to detect Gene co-expression networks for analysis of complex trait regulation enrichment of disease-associated variants. To gain more insights into the genetic drivers of the transcriptional module, detailed regulatory mechanisms can also be investigated by TF or miRNA binding site enrichment analysis in the module’s genes promoters. Furthermore, analysis of sequence variation at enriched TF-binding sites (TFBS) in disease and control samples can identify differential binding of TFs to the module’s gene promoters. For example, in the differentially coexpressed networks masterTF-regulators can be discovered by testing the association of sequence variants at the TF-bindings site between disease and control samples. Transcriptional modules that are reactive to disease reflect transcriptional changes that are secondary to the disease processes, for instance inflammatory disease mechanisms activate Toll-like receptorsignaling pathway and the downstream transcriptional response in several targets genes. Reactive modules can reflect specific disease sub-phenotypes (e.g. tissue fibrosis or mitochondrial dysfunction), and their identification can help to pinpoint abnormal cellular and molecular phenotypes that arise as a consequence of the disease. Moreover, characterization of the specific biological function of reactive transcriptional modules associated with disease can facilitate a better classification of disease at the cellular level (e.g. gene expression signatures in breast cancer [18]). Although co-expression networks in theory can be derived from any type of transcriptomic data, the nature of the derived co-expression networks is closely linked to the source of transcriptional variation that is captured by the data. Therefore, while studies aiming for an unbiased description of the regulatory networks require quantification of gene expression across a large number of conditions or cell types, typical co-expression analyses focus on transcriptional variation occurring in single tissues (or cell types) under various perturbations (environmental stimuli, genetic variability, disease states, etc.). COMPUTATIONAL TOOLS FOR CO-EXPRESSION NETWORK ANALYSIS A large variety of methods have been proposed to explore the structure of transcriptional regulation and 69 infer co-expression networks within and across experimental conditions and tissues. Although we do not aim to provide an exhaustive listing of all these methods, most gene co-expression analysis approaches can be classified in two major categories. More precisely, we distinguish latent factor-based methods, that first identify hidden variables influencing gene expression to derive gene sets undergoing regulatory control by these latent factors [19–21] from reverse engineering approaches that, on the other hand, infer the co-expression network and cluster highly interconnected nodes (genes) to form coherent modules [22, 23] (Figure 2A). Latent factors methods Latent factor analysis assumes that gene expression profiles result from the influence of a limited number of unobserved covariates (i.e. latent factors) and try to estimate these covariates from the data, as well as their respective contribution to each gene expression profile (i.e. each gene expression profile is modeled as a linear combination of the latent factors). Latent factor analysis is widely used in genomics and comprises methods such as principal component analysis (PCA) [19, 24], independent component analysis [20, 25] or non-negative matrix factorization (NMF) [21, 26] and their derivatives [27–29]. To estimate the latent factors without using prior knowledge, it is necessary to make some assumptions on the latent factors structure and their relative influence on the gene expression levels. Existing latent factor-based methods differ mostly by the set the assumptions that they make on the latent factors and their contributions to gene expression. These differences can strongly affect the outcome of each method and must be carefully considered. For instance, PCA assumes the orthogonality of the latent factors and searches for factors explaining the maximal amount of variance in the gene expression dataset, which reflect the main source of variation of gene expression at the expense of interpretability of more subtle causes of variation [21]. On the other hand, NMF requires only positivity of both the latent factors and their contribution to gene expression, providing a decomposition of the gene expression data as a mixture of basic gene expression profiles, ultimately revealing different gene co-expression modules. For example, NMF analysis of gene expression in complex tissues (with varying cell type composition) can provide a decomposition reflecting the mixture of different cell types, whereas 70 Rotival and Petretto (a) Co-expression network Latent factor analysis factor-genes associations pruning Regulatory network Modules extraction clustering (b) (c) TFBS enrichment GWAS enrichment pp g Network QTL mapping Master genec regulators gula Pathway enrichment Network-Phenotype yp associaons as Disease related Phenotype Figure 2: Computational tools used in co-expression network analysis. Gene co-expression analysis is performed either by studying the network correlation structure (A, top-left) or by identifying latent factors underlying the network (A, top-right). The co-expression network is then pruned from its indirect edges to find the underlying regulatory network (A, bottom left). The main regulatory modules can then be extracted (A, bottom right) either by clustering of the regulatory network or from the contributions of the latent factors to the gene expression. Modules can then be annotated using external information by enrichment forTFBSs in the network-genes promoters (B, left), enrichment for genes found by GWAS (B, middle), or in known pathways (B, right). ‘Master regulatory genes’ controlling the network expression can be identified using mapping strategies to find Quantitative Trait Loci (QTLs) underlying variability in network expression (C, left). Finally the gene co-expression network can be associated with variation in intermediate disease phenotypes (C, right), revealing a link between underlying pathways and disease. Gene co-expression networks for analysis of complex trait regulation PCA analysis will identify the main source of variation of tissue composition. Once latent factors underlying gene expression variation have been identified, discrete gene sets (modules) can be identified by regressing gene expressions on the latent factors and selecting genes that are the most strongly affected by each latent factor (i.e. genes that have the highest absolute effect size in the regression). Reverse engineering approaches Reverse engineering approaches are used to model the structure of gene co-expression networks through graphs, where the genes are represented as nodes and edges represent co-regulation relationships between genes. These co-regulation relationships are usually quantified by adjacency values that reflect the strength of the relationship (weighted network). A null adjacency value then reflects the absence of relationship. It is important to emphasize that the selection of the optimal adjacency measure is far from being neutral; it often depends on the underlying data structure and therefore must be chosen with care: Pearson correlation and its derivatives (absolute/ squared correlation) or partial-correlations constitute the simplest choice of adjacency measure to assess co-expression between gene pairs. These however suffer from both lack of robustness to outlier measurements and the intrinsic inability to capture non-linear relationships between genes. To limit the sensitivity of the adjacency measure to outliers, more robust measures of correlation can be used such as Spearman, Kendall or biweight correlation [30]. These adjacency measures are more appropriate when small samples and noisy data are used to reconstruct the co-expression networks. One can also focus on capturing non-linear dependencies between genes using mutual information (MI)-based adjacency measures, such as the k-nearest-neighbors MI computed by MInet [31]. This kind of adjacency measures usually provides a more realistic snapshot of the complex interactions and connections present in transcriptional networks, but being non-linear are of more difficult interpretation. Once the most suitable adjacency measures are calculated for all gene pairs present in the expression dataset, the underlying regulatory network can be 71 derived by pruning non-significant edges or spurious gene–gene connections (this also reduces the amount of noise in the gene network data). Several approaches can be used to identify significant gene–gene interactions (edges) and extract relevant gene co-expression modules from the adjacency matrix: Weighted correlation network analysis (WGCNA) [22, 32] or DiffCoEx [23, 33] algorithms employ clustering-based methods to identify gene co-expression clusters using a soft thresholding step to assign a connection weight to each gene pair and extract densely connected clusters. These methods apply either a power or a sigmoid transform to the initial adjacency measures to reduce the influence of low-adjacency edges. Although these transformations reduce (but do not remove) the noise created by non-significant edges and minimize the influence of indirect associations, they require the choice of a suitable soft thresholding parameter. Another commonly used approach, Algorithm for the Reconstruction of Accurate Cellular Networks (ARACNe) [34], proceeds by considering each possible triplet of genes and systematically removes the weakest of the three edges to identify gene clusters. The Context Likelihood of Relatedness (CLR) method further extends on ARACNe by re-weighting the edges between two genes according to the strength of their connection relative to the strength of the connections with the surrounding genes [35]. Despite its good efficacy in the identification of ‘true’ regulatory connections in relatively simple networks, ARACNe and CLR may struggle in more complicated situations where the same node is under regulatory control by multiple input nodes. Gaussian Graphical Modeling (GGM) approaches (also known as ‘covariance selection’) assume a multivariate Gaussian distribution of the covariance between any pair of gene expression profiles, and estimate the partial-correlations matrix between each gene pairs using regularization approaches [36, 37]. However, without specific effort on the choice of the regularization parameters, these partial correlation-based approaches (like GeneNet [37]) tend to be biased toward identification of edges connecting nodes with relatively low degree of inter-connectivity [38]. Despite the good ability to uncover complex 72 Rotival and Petretto regulatory structures [39], this degree-bias makes GGM less powerful for the identification of networks with densely inter-connected hubs. To improve on the limitations of GGM-based approaches, Mahdi et al. [40] have suggested a heuristic procedure that increases the power to detect edges connecting hubs of the network. The procedure works by estimating conditional dependence between any gene pair in a bidirectional manner (i.e. gene A is regulated by gene B, or gene B is regulated by gene A) thus limiting the reduction of power due to multiple adjustments for hubs, and allowing a better detection of edges in dense areas of the network [40]. Complementary to the earlier-mentioned approaches, other methods such as GENIE3 [41], or ANOVA-based prediction [42] can combine prior knowledge of TFs with non-linear dependency measures to carry out direct inference of the relationships between the TF and its target genes, and ultimately derive TF-driven co-regulatory networks. For instance in GENIE3, the expression pattern of one of the genes is predicted from the expression patterns of all the TFs, using tree-based Methods (Random Forest and Extra Trees) [41]. When a reliable list of TFs is available, this class of methods has been shown to achieve good performance in compelling network-inference benchmark studies (DREAM5), outperformed only by the ‘wisdom of crowds’ approach, which combine results from multiple algorithms to form a robust prediction of the network [43]. EXTRACTING REGULATORY COMPONENTS FROM THE CO-EXPRESSION NETWORK Once gene–gene co-regulation is inferred from the gene expression data, the main regulatory components of the network can be identified using clustering approaches. The choice of the appropriate clustering method should be guided by several considerations: First, the clustering method influences the characteristics of the clusters that are obtained. For instance, methods such as clique percolation (i.e. finding a group of nodes that are more densely connected to each other than to other nodes in the network) [44] or complete hierarchical clustering [45] search for densely connected clusters, typically resulting in the identification of small clusters. In contrast, other methods such as affinity propagation [46] or dynamic tree cut [47] are more powerful to identify larger clusters of genes. The latter class of approaches, however, suffers from the need to specify several ad hoc parameters that need to be finely tuned empirically, and which depend on the size of the considered datasets or, in the WGCNA framework, on the soft thresholding power used to assign a connection weight to each gene pair in the previous step. Moreover, the influence of these parameters on the clustering results is not always clear and no good (or universal) criteria are provided to evaluate the significance and stability of the identified clusters. Second, the choice of the optimal clustering method should also consider the type of pruning performed to identify relevant gene–gene relationships. Indeed, pruning the network from indirect connections will lead to reconstructing sparse networks (i.e. indicating that gene-gene interactions are rare), which will affect the results of most clustering algorithms. Although some algorithms such as affinity propagation or SCAN [48] are suitable to deal with sparse networks, most approaches require the computation of a topological overlap measure [49] from the adjacency matrix to group nodes that share a similar set of neighbors prior to the clustering to allow the identification of gene clusters. Third, the choice of the clustering will also depend of the type of ‘feature’ that is searched for (e.g. clusters including hub genes, genes connecting different clusters). For instance, affinity propagation looks for ‘exemplar nodes’ of the network and tries to assign each node to its closest ‘exemplar’ while minimizing the number of ‘exemplar nodes’. Usually, such method generates highly centralized gene clusters and facilitates the identification of hub genes. On the other hand, approaches such as SCAN allow the identification of both high-density clusters and genes acting as connectors between clusters [48]. ANALYZING CO-EXPRESSION NETWORKS ACROSS MULTIPLE CONDITIONS Alternative approaches and computational methods have been developed to infer and characterize Gene co-expression networks for analysis of complex trait regulation co-expression networks across multiple experimental conditions (e.g. cell-types, disease states, time course). A straightforward approach consists in building a network separately within each condition and then to employ a network comparison tool to detect specific sub-networks and modules that significantly differ between the various conditions [50]. Another approach consists in building adjacencies between genes using differential correlation measures [23, 33] and then build a network based on these adjacencies. Regardless of the method chosen, the significance of differential expression and the changes in sub-network structures under different biological conditions can be tested formally using the framework suggested in [51]. Finally, co-factorization methods, such as Higher-Order Singular Value Decomposition [29] or variants of the non-NMF approach [28, 52], aim to decompose multiple gene expression datasets using common latent factors analysis, which uncover sets of genes undergoing common regulation (i.e. shared across datasets) or differential co-regulation across multiple conditions. Aside from finding clusters that are specific to each dataset, co-expression analysis across multiple experimental conditions can inform on the ‘robustness’ of the inferred network, as it is able to provide useful information about the network-conservation across tissues, time-points or species. USING CO-EXPRESSION NETWORKS TO GAIN NEW BIOLOGICAL INSIGHTS INTO COMPLEX TRAITS AND DISEASE The ability to leverage gene co-expression networks data to elucidate pathways and processes underlying disease is highly dependent on understanding the specific regulatory control, functional context and pathological activity of the network. One approach to characterize the regulatory control of the network is to search for interaction partners of the network genes by analysis of the underlying genes’ sequence. For instance, one can look for enrichment of TFBS using tools such as PASTAA [53], HOMER [54], FIRE [55], or exploit data from miRNA targets databases such as miRBase or targetScan [56, 57] to search of specific miRNAs or TFs regulating the network. The functional context where the network operates can be explored by enrichment-analysis of tissue-specific genes [58] or cell-type specificity of 73 the network genes using publicly available databases of expression profiles [59] or gene expression atlas (e.g. Gene Expression Omnibus repository: http:// www.ncbi.nlm.nih.gov/geo/). These analyses can elucidate the relative contribution of specific cell types in determining the observed co-expression patterns, which represents a crucial step to understand (and test experimentally) the molecular mechanisms underling the regulatory network. To link co-expression networks with disease it is often necessary to integrate multiple layers of information, including details on the regulatory control (e.g. TFBS analysis), functional context (e.g. celltype specificity) and genetic susceptibility data (for instance using GWAS, as shown earlier) (Figure 2B). In addition, genetic (sequence variation) data can be integrated with transcriptional data (coexpression network) to identify genetic control points of the regulatory network. In other words, the network expression can be treated as a complex multivariate trait that can be linked to underlying sequence variation, for instance using standard quantitative trait locus (QTL) mapping methods (Figure 2C). To this aim, several multivariate mapping strategies such as Manova [60] or Bayesian variable selection algorithms [61, 62] have been used to identify genetic control points of co-expression networks, also referred to as ‘master genetic regulators’ of the networks. Because it gathers information across a large number of genes, the study co-expression networks can be informative even with very small sample size (n 20) [63]. However, the robustness of the analysis will depend on the kind of information that is being extracted from the network. For instance, enrichment analyses (e.g. TFBS and GWAS) will be robust to mild misclassification of genes and therefore will be able to provide insights even from networks derived from very small number of samples. In contrast, accurate reverse engineering of the network is particularly sensitive to the sample size and will likely require hundreds of samples to ensure a robust inference of the network [64]. Finding the hubs of the networks, however, can be performed even with networks inferred from relatively small sample sizes because the accuracy of the estimated genes connectivity will increase will the number of genes in the network. In all cases, bootstrapping methods may be used to assess the robustness of predicted edges in the network and other topological properties. 74 Rotival and Petretto One example of successful integration of coexpression networks with biological information that led to identification of new genes underlying cardiovascular disease is the discovery of endonuclease G (EndoG) [65]. Although EndoG was initially implicated in apoptosis (but not hypertrophy), McDermott-Roe et al. [65] demonstrated a role for EndoG in regulating left ventricular mass and mitochondrial function using co-expression network analysis in the human heart (Figure 3). The co-expression network analysis identified a mitochondrial function-enriched gene network around EndoG (and where EndoG is one the most-highly interconnected gene hubs, Figure 3), which was crucial to define the EndoG role in fundamental mitochondrial processes that are unrelated to apoptosis. This new functional role inferred by the network-analysis was confirmed by experiments in the EndoG knockout mouse showing decreased mitochondrial activity leading to reduced cardiac mass. Another application of gene co-expression networks and integration with analysis of regulatory control by TFs has led to the discovery of Ebi2 as a ‘master genetic regulator’ of an anti-viral expression network and type 1 diabetes (T1D) risk [66]. In their study, Heinig etal. hypothesized that loci that perturb gene co-expression networks can be important for disease risk. Under this hypothesis, they carried out genome-wide co-expression analysis in seven rat tissues and identified a co-expression network that contained several targets of the IRF7 TF, a master regulator of the type 1 interferon response [67], and that was under genetic control by the Ebi2 gene in the rat (Figure 4). Using comparative genomics approaches, Heinig et al. showed that the human orthologue of rat Ebi2 was located at a T1Dsusceptibility locus and the network itself was highly enriched for T1D susceptibility variants by GWAS, supporting a role of the IRF7-driven (IDIN) network and Ebi2 in the etiology of T1D. Cell-type enrichment analysis showed that the IDIN network was highly and specifically expressed in immune cells, which was supported by independent replication of the network in human monocytes. This further suggested that Ebi2 might affect T1D susceptibility (and gene co-expression in complex tissues like the heart, where the network was initially discovered) by controlling migration of immune cells from the blood to the tissues, as shown in other studies [68]. The discovery of Ebi2 as a ‘master genetic regulator’ of inflammatory response and macrophage infiltration leading to increased risk of T1D in humans also provides an example of successful application of network-based approaches to the identification of novel disease-susceptibility genes (Figure 4). FUTURE DIRECTIONS FOR THE STUDY OF GENE CO-EXPRESSION NETWORKS An important caveat to the inference of regulatory mechanisms from co-expression networks arise from the fact that in many cases, the regulation of TF activity occurs via post-translational modifications of the TF. Since such modifications are not captured at the transcriptional level, the regulatory connections between the TFs and their targets may be missed when looking at the transcriptome only. To circumvent this limitation, the use of existing biological knowledge and/or external data to inform the search of coexpression modules will undoubtedly be one of the major area of development in co-expression analysis in the next years. The fact that the best performing methods are those taking prior knowledge on TFs target regulation into account [43] will indeed encourage to go further in this direction. In this respect, the Bayesian network framework have been shown to be an efficient way to include prior knowledge, for instance using protein–protein interaction network data or gene ontologies [69]. However, Bayesian network approaches tend to suffer from a fairly high running time that often makes their application to large-scale genomics impractical for analysis of complex networks. Multiple types of genomic data (miRNA, proteomic data) can be also integrated for more efficient network analysis, as shown by Zhang et al. [70], which used a NMF-based framework and combined predicted gene–gene and gene–miRNA interactions with gene and miRNA expression data to identify gene networks under regulatory control by miRNAs. Another current challenge for co-expression network approaches is in the necessity to adapt to the growing number of high-throughput mRNA sequencing (RNA-seq) data for measuring the transcriptome. Indeed, first attempts have already suggested that due to the better dynamic range provided by RNA-seq data (as compared with traditional microarray data), it was possible to identify networks with high interconnectivity (i.e. more gene–gene connections in each module), heterogeneity (fewer hubs in each module) and centrality (stronger hubs) [71]. Moreover, Gene co-expression networks for analysis of complex trait regulation 75 Gene co-expression network in the human heart Relative degree of interconnectivity within the co-expression network 1 ENDOG 14 mitochondrial gene function Unique co-expression of ENDOG with oxidative metabolism genes Mitochondrial processes (unrelated to apoptosis) Cardiac hypertrophy Figure 3: Gene co-expression network analysis informs new gene function. EndoG gene was first found to regulate cardiac hypertrophy in the rat heart. Next, the co-expression network reconstructed in the human heart identified the EndoG gene as one the major hubs in the network (i.e. gene with high degree of interconnectivity within the network). Furthermore, the co-expression network was highly enriched for genes annotated for mitochondrial function (highlighted in darker color), suggesting a similar biological role for EndoG, which was previously unappreciated. This new inference on EndoG function in fundamental (non-apoptotic) processes uncovered the link between mitochondrial dysfunction and heart disease. RNA-seq allows the study of different splicing events and strand-specific expression and will therefore require ad hoc methods and statistical modeling to take this additional layer of information into account for network inference. The development of approaches aiming to include splicing regulation in network analysis is still at an early stage. Current attempts include the identification of modules of cospliced genes, by correlating quantitative measures of exon splicing (measured as the frequency of inclusion of the exon in the final transcript) and extracting modules of exons whose splicing patterns are correlated by NMF approaches [72] and the use of canonical correlation analysis to define exon level measures of gene– gene dependence [73]. It is likely that in the next few years the increasing availability of genomic data from multiple sources (e.g. high-throughput sequencing of mRNA and 76 Rotival and Petretto Rat Humans Genome-wide co-expression network analysis Genome-wide TF binding site analysis GWAS data for type 1 diabetes IRF7-driven inflammatory gene network T1D associated gene IRF7 genetic control rat Ebi2 rat chromosome 15q25 human EBI2 synteny human chromosome 13q32 Figure 4: Master genetic regulators of co-expression networks across rats and humans. The identification of an IRF7-regulated co-expression network in the rat was driven by integrated genome-wide co-expression analyses and TFBS predictions in seven rat tissues. The network was found to be under master genetic control by Ebi2 gene in the rat, and thus suggesting a role of this gene in regulation of the inflammatory response. Comparative genomics approaches confirmed that the syntenic human locus that encoded for rat Ebi2 gene controlled a similar inflammatory gene co-expression network in human monocytes. Finally, enrichment analysis by GWAS data showed the involvement of the interferon-regulated network in type 1 diabetes, and further suggested a role of the IRF7driven (IDIN) network and its ‘master genetic regulator’ (EBI2) in the etiology of type 1 diabetes. DNA, quantitative protein abundance and protein– protein interaction data) combined with refined gene ontology and TFs annotations will offer new opportunities to infer reliable gene co-expression networks, and help make sense out of them. Our ability to integrate these heterogeneous and highdimensional data via network modeling might then represent a decisive step toward a better understanding of the regulation of cardiovascular traits and disease. Key Points Analysis of gene co-expression networks can provide useful insights into the complex regulatory networks and mechanisms underlying disease. Gene co-expression networks can be causal or reactive to the disease process, often reflecting context-specific functions and regulation. There is a plethora of tools for network analysis, yet each approach has unique limitations and advantages, which make their application highly dependent on the specific study design and data structure. Integration of co-expression networks with genetic and generegulation data is a powerful mean to identify master genetic regulators for pathological pathways and molecular processes underlying disease. Acknowledgements The authors acknowledge the funding from the European Community’s Seventh Framework Programme (FP7/20072013) under grant agreement no. HEALTH-F4-2010-241504 (EURATRANS) and from the Medical Research Council. References 1. Schwanhäusser B, Busse D, Li N, et al. Global quantification of mammalian gene expression control. Nature 2011;473: 337–42. Gene co-expression networks for analysis of complex trait regulation 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. Lukk M, Kapushesky M, Nikkilä J, et al. A global map of human gene expression. Nat Biotechnol 2010;28:322–4. Lu P, Vogel C, Wang R, et al. Absolute protein expression profiling estimates the relative contributions of transcriptional and translational regulation. Nat Biotechnol 2007;25: 117–24. Nicolae DL, Gamazon E, Zhang W, et al. Trait-associated SNPs are more likely to be eQTLs: annotation to enhance discovery from GWAS. PLoS Genet 2010;6: e1000888. Birchler JA, Veitia RA. The gene balance hypothesis: implications for gene regulation, quantitative traits and evolution. New Phytol 2010;186:54–62. Carlson MRJ, Zhang B, Fang Z, et al. Gene connectivity, function, and sequence conservation: predictions from modular yeast co-expression networks. BMC Genomics 2006;7:40. Babu MM, Luscombe NM, Aravind L, et al. Structure and evolution of transcriptional regulatory networks. Curr Opin Struct Biol 2004;14:283–91. Barabási A-L, Gulbahce N, Loscalzo J. Network medicine: a network-based approach to human disease. Nat Rev Genet 2011;12:56–68. Liu Y-Y, Slotine J-J, Barabási A-L. Controllability of complex networks. Nature 2011;473:167–73. Shen-Orr SS, Milo R, Mangan S, et al. Network motifs in the transcriptional regulation network of Escherichia coli. Nat Genet 2002;31:64–8. Schadt EE, Friend SH, Shaywitz DA. A network view of disease and compound screening. Nat Rev Drug Discov 2009; 8:286–95. Hemani G, Knott S, Haley C. An evolutionary perspective on epistasis and the missing heritability. PLoS Genet 2013;9: e1003295. McKinney BA, Pajewski NM. Six degrees of epistasis: statistical network models for GWAS. Front Genet 2011;2:109. Naukkarinen J, Surakka I, Pietiläinen KH, et al. Use of genome-wide expression data to mine the ‘Gray Zone’ of GWA studies leads to novel candidate obesity genes. PLoS Genet 2010;6:e1000976. Langfelder P, Mischel PS, Horvath S. When is hub gene selection better than standard meta-analysis? PLoS One 2013;8:e61505. Weng L, Macciardi F, Subramanian A, et al. SNP-based pathway enrichment analysis for genome-wide association studies. BMC Bioinformatics 2011;12:99. Zhang K, Chang S, Cui S, et al. ICSNPathway: identify candidate causal SNPs and pathways from genome-wide association study by one analytical framework. Nucleic Acids Res 2011;39:W437–43. Van’t Veer LJ, Dai H, van de Vijver MJ, et al. Gene expression profiling predicts clinical outcome of breast cancer. Nature 2002;415:530–6. Wall ME, Rechtsteiner A, Rocha LM. Singular value decomposition and principal component analysis. In: Berrar DP, Dubitzky W, Granzow M (eds). A Practical Approach to Microarray Data Analysis. Norwell, MA: Kluwer, 2003:91–109. Liebermeister W. Linear modes of gene expression determined by independent component analysis. Bioinformatics 2002;18:51–60. 77 21. Lee DD, Seung HS. Learning the parts of objects by non-negative matrix factorization. Nature 1999;401: 788–91. 22. Langfelder P, Horvath S. WGCNA: an R package for weighted correlation network analysis. BMC Bioinformatics 2008;9:559. 23. Tesson B, Breitling R, Jansen R. DiffCoEx: a simple and sensitive method to find differentially coexpressed gene modules. BMC Bioinformatics 2010;11:497. 24. Alter O, Brown PO, Botstein D. Singular value decomposition for genome-wide expression data processing and modeling. Proc Natl Acad Sci U S A 2000;97:10101–6. 25. Rotival M, Zeller T, Wild PS, et al. Integrating genomewide genetic variations and monocyte expression data reveals trans-regulated gene modules in humans. PLoS Genet 2011;7:e1002367. 26. Brunet J-P, Tamayo P, Golub TR, et al. Metagenes and molecular pattern discovery using matrix factorization. Proc Natl Acad Sci U S A 2004;101:4164–9. 27. Zou H, Hastie T, Tibshirani R. Sparse principal component analysis. J Comput Graph Stat 2006;15:265–86. 28. Lee CM, Mudaliar MAV, Haggart DR, et al. Simultaneous non-negative matrix factorization for multiple large scale gene expression datasets in toxicology. PLoS One 2012;7: e48238. 29. Ponnapalli SP, Saunders MA, Van Loan CF, et al. A higherorder generalized singular value decomposition for comparison of global mRNA expression from multiple organisms. PLoS One 2011;6:e28072. 30. Hardin J, Mitani A, Hicks L, et al. A robust measure of correlation between two genes on a microarray. BMC Bioinformatics 2007;8:220. 31. Meyer PE, Lafitte F, Bontempi G. minet: A R/ Bioconductor package for inferring large transcriptional networks using mutual information. BMC Bioinformatics 2008; 9:461. 32. Fuller TF, Ghazalpour A, Aten JE, et al. Weighted gene coexpression network analysis strategies applied to mouse weight. Mamm Genome 2007;18:463–72. 33. Amar D, Safer H, Shamir R. Dissection of regulatory networks that are altered in disease via differential co-expression. PLoS Comput Biol 2013;9:e1002955. 34. Margolin AA, Nemenman I, Basso K, et al. ARACNE: an algorithm for the reconstruction of gene regulatory networks in a mammalian cellular context. BMC Bioinformatics 2006;7(Suppl 1):S7. 35. Faith JJ, Hayete B, Thaden JT, et al. Large-scale mapping and validation of Escherichia coli transcriptional regulation from a compendium of expression profiles. PLoS Biol 2007;5:e8. 36. Friedman J, Hastie T, Tibshirani R. Sparse inverse covariance estimation with the graphical lasso. Biostatistics 2008;9: 432–41. 37. Schäfer J, Opgen-Rhein R, Strimmer K. Reverse engineering genetic networks using the GeneNet package. J Am Stat Assoc 2001;96:1151–60. 38. Allen JD, Xie Y, Chen M, et al. Comparing statistical methods for constructing large scale gene networks. PLoS One 2012;7:e29348. 39. Werhli AV, Grzegorczyk M, Husmeier D. Comparative evaluation of reverse engineering gene regulatory networks 78 40. 41. 42. 43. 44. 45. 46. 47. 48. 49. 50. 51. 52. 53. 54. 55. 56. Rotival and Petretto with relevance networks, graphical gaussian models and bayesian networks. Bioinformatics 2006;22:2523–31. Mahdi R, Madduri AS, Wang G, et al. Empirical Bayes conditional independence graphs for regulatory network recovery. Bioinformatics 2012;28:2029–36. Huynh-Thu VA, Irrthum A, Wehenkel L, et al. Inferring regulatory networks from expression data using tree-based methods. PLoS One 2010;5:e12776. Küffner R, Petri T, Tavakkolkhah P, et al. Inferring gene regulatory networks by ANOVA. Bioinformatics 2012;28: 1376–82. Marbach D, Costello JC, Küffner R, et al. Wisdom of crowds for robust gene network inference. Nat Methods 2012;9:796–804. Zhang S, Ning X, Zhang X-S. Identification of functional modules in a PPI network by clique percolation clustering. Comput Biol Chem 2006;30:445–51. Hansen P, Delattre M. Complete-link cluster analysis by graph coloring. J Am Stat Assoc 1978;73:397–403. Frey BJ, Dueck D. Clustering by passing messages between data points. Science 2007;315:972–6. Langfelder P, Zhang B, Horvath S. Defining clusters from a hierarchical cluster tree: the Dynamic Tree Cut package for R. Bioinformatics 2008;24:719–20. Xu X, Yuruk N, Feng Z, etal. SCAN: a structural clustering algorithm for networks. In: Proceedings of the 13th ACM SIGKDD International Conference Knowledge Discovery and Data Mining. New York: ACM, 2007:824–33. Yip AM, Horvath S. Gene network interconnectedness and the generalized topological overlap measure. BMC Bioinformatics 2007;8:22. Zhang B, Li H, Riggins RB, et al. Differential dependency network analysis to identify condition-specific topological changes in biological networks. Bioinformatics 2009;25: 526–32. Gill R, Datta S, Datta S. A statistical framework for differential network analysis from microarray data. BMC Bioinformatics 2010;11:95. Li W, Liu C-C, Zhang T, et al. Integrative analysis of many weighted co-expression networks using tensor computation. PLoS Comput Biol 2011;7:e1001106. Roider HG, Manke T, O’Keeffe S, et al. PASTAA: identifying transcription factors associated with sets of co-regulated genes. Bioinformatics 2009;25:435–42. Heinz S, Benner C, Spann N, et al. Simple combinations of lineage-determining transcription factors prime cis-regulatory elements required for macrophage and B cell identities. Mol Cell 2010;38:576–89. Elemento O, Slonim N, Tavazoie S. A universal framework for regulatory element discovery across all genomes and data types. Mol Cell 2007;28:337–50. Griffiths-Jones S, Grocock RJ, van Dongen S, et al. miRBase: microRNA sequences, targets and gene nomenclature. Nucleic Acids Res 2006;34:D140–44. 57. Lewis BP, Burge CB, Bartel DP. Conserved seed pairing, often flanked by adenosines, indicates that thousands of human genes are microRNA targets. Cell 2005;120: 15–20. 58. Kouadjo KE, Nishida Y, Cadrin-Girard JF, et al. Housekeeping and tissue-specific genes in mouse tissues. BMC Genomics 2007;8:127. 59. Shoemaker JE, Lopes TJS, Ghosh S, et al. CTen: a web-based platform for identifying enriched cell types from heterogeneous microarray data. BMC Genomics 2012; 13:460. 60. Shen Y, Lin Z, Zhu J. Shrinkage-based regularization tests for high-dimensional data with application to gene set analysis. Comput Stat Data Anal 2011;55:2221–33. 61. Bottolo L, Petretto E, Blankenberg S, et al. Bayesian detection of expression quantitative trait loci hot spots. Genetics 2011;189:1449–59. 62. Scott-Boyer MP, Imholte GC, Tayeb A, etal. An integrated hierarchical Bayesian model for multivariate eQTL mapping. Stat Appl Genet Mol Biol 2012;11. 63. Zhao W, Langfelder P, Fuller T, et al. Weighted gene coexpression network analysis: state of the art. J Biopharm Stat 2010;20:281–300. 64. Margolin AA, Wang K, Lim WK, etal. Reverse engineering cellular networks. Nat Protoc 2006;1:662–71. 65. McDermott-Roe C, Ye J, Ahmed R, et al. Endonuclease G is a novel determinant of cardiac hypertrophy and mitochondrial function. Nature 2011;478:114–8. 66. Heinig M, Petretto E, Wallace C, et al. A trans-acting locus regulates an anti-viral expression network and type 1 diabetes risk. Nature 2010;467:460–4. 67. Honda K, Yanai H, Negishi H, et al. IRF-7 is the master regulator of type-I interferon-dependent immune responses. Nature 2005;434:772–7. 68. Hannedouche S, Zhang J, Yi T, et al. Oxysterols direct immune cell migration via EBI2. Nature 2011;475: 524–7. 69. Qin T, Tsoi LC, Sims KJ, etal. Signaling network prediction by the Ontology Fingerprint enhanced Bayesian network. BMC Syst Biol 2012;6(Suppl 3):S3. 70. Zhang S, Li Q, Liu J, et al. A novel computational framework for simultaneous integration of multiple types of genomic data to identify microRNA-gene regulatory modules. Bioinformatics 2011;27:i401–9. 71. Iancu OD, Kawane S, Bottomly D, et al. Utilizing RNASeq data for de novo coexpression network inference. Bioinformatics 2012;28:1592–7. 72. Dai C, Li W, Liu J, et al. Integrating many co-splicing networks to reconstruct splicing regulatory modules. BMC Syst Biol 2012;6(Suppl 1):S17. 73. Hong S, Chen X, Jin L, et al. Canonical correlation analysis for RNA-seq co-expression networks. Nucleic Acids Res 2013;41:e95.
© Copyright 2026 Paperzz