Causal modelling of relationships between genotype and phenotype using multi-omics data Heather J. Cordell Institute of Genetic Medicine Newcastle University, UK [email protected] Genome-wide association studies (GWAS) Highly sucessful approach over past 10 years Enabled by advances in microarray-based genotyping technologies Allowing us to measure between 500,000 and 4 million genetic variants in each individual Scan through genome looking for variants that correlate with phenotype (e.g. disease status) Heather Cordell (IGM) Causal modelling of multi-omics data 2 / 27 Disappointment with GWAS Critics point out that GWAS findings: Account for little of the ‘known’ heritability Confer small increases in disease risk (ORs ≈ 1.2–1.3) In most cases do not (yet) identify the underlying functional variant(s) We need to move towards understanding why a particular allele (or set of alleles) increases disease risk Heather Cordell (IGM) Causal modelling of multi-omics data 3 / 27 Causal modelling It has become popular to take ‘interesting’ SNPs from GWAS and cross reference them with publicly-available data that provide supporting evidence for loci identified For example, does a GWAS associated SNP associate with gene expression or DNA methylation in a relevant tissue? Does it lie in a genomic region with particular features, such as histone modifications, open chromatin, TF binding? Making use of resources such as ENCODE, GTEx, MuTHER, Roadmap Epigenomics Project Heather Cordell (IGM) Causal modelling of multi-omics data 4 / 27 Causal modelling It has become popular to take ‘interesting’ SNPs from GWAS and cross reference them with publicly-available data that provide supporting evidence for loci identified For example, does a GWAS associated SNP associate with gene expression or DNA methylation in a relevant tissue? Does it lie in a genomic region with particular features, such as histone modifications, open chromatin, TF binding? Making use of resources such as ENCODE, GTEx, MuTHER, Roadmap Epigenomics Project Ideally, would be preferable to have such data (e.g. gene expression, DNA methylation, proteomics, metabolites) measured in the same set of individuals for whom we have GWAS data Or, at least, in a subset of these individuals Heather Cordell (IGM) Causal modelling of multi-omics data 4 / 27 Causal modelling It has become popular to take ‘interesting’ SNPs from GWAS and cross reference them with publicly-available data that provide supporting evidence for loci identified For example, does a GWAS associated SNP associate with gene expression or DNA methylation in a relevant tissue? Does it lie in a genomic region with particular features, such as histone modifications, open chromatin, TF binding? Making use of resources such as ENCODE, GTEx, MuTHER, Roadmap Epigenomics Project Ideally, would be preferable to have such data (e.g. gene expression, DNA methylation, proteomics, metabolites) measured in the same set of individuals for whom we have GWAS data Or, at least, in a subset of these individuals Allows us to employ causal inference techniques to investigate causal pathways leading towards disease progression Heather Cordell (IGM) Causal modelling of multi-omics data 4 / 27 Previous approaches: pairwise filtering Filtering approach Filter (to reduce the number of variables considered) based on pairwise correlations Then perform causal inference test on all possible trios of variables Heather Cordell (IGM) Causal modelling of multi-omics data 5 / 27 Previous approaches: pairwise filtering Filtering approach Filter (to reduce the number of variables considered) based on pairwise correlations Then perform causal inference test on all possible trios of variables E.g. Liu et al. (2012) performed EWAS to select differentially methylated probes (DMPs) associated with rheumatoid arthritis (RA) Followed by GWAS to identify significant SNP-DMP pairs Then used causal inference test (CIT) on resulting 4016 SNP-DMP-RA trios Heather Cordell (IGM) Causal modelling of multi-omics data 5 / 27 Previous approaches: pairwise filtering Filtering approach Filter (to reduce the number of variables considered) based on pairwise correlations Then perform causal inference test on all possible trios of variables E.g. Liu et al. (2012) performed EWAS to select differentially methylated probes (DMPs) associated with rheumatoid arthritis (RA) Followed by GWAS to identify significant SNP-DMP pairs Then used causal inference test (CIT) on resulting 4016 SNP-DMP-RA trios Shin et al. (2014) performed GWAS to identify SNPs associated with lipid traits Then used correlation tests to identify significant metabolite-lipid associations Followed by Mendelian randomization (MR) and structural equation modelling (SEM) of 38 resulting SNP-MET-LIP trios Heather Cordell (IGM) Causal modelling of multi-omics data 5 / 27 Previous approaches: multi-variable Bayesian networks (Zhu, Schadt et al. 2004; 2012) RIMBANet software Applied to gene-expression and metabolite data With a causality test on the basis of previous eQTL experiments used as a prior for directing edges Carried out on trios of one genetic variable and two expression measures Heather Cordell (IGM) Causal modelling of multi-omics data 6 / 27 Comparison of causal inference methods For elucidating the causal relationships between 3 variables We assume (for now) that we have measurements of the the following: a genetic variant e.g. SNP (S) a phenotype of interest (P) an intermediate trait e.g. gene expression (G ) Heather Cordell (IGM) Causal modelling of multi-omics data 7 / 27 Comparison of causal inference methods For elucidating the causal relationships between 3 variables We assume (for now) that we have measurements of the the following: a genetic variant e.g. SNP (S) a phenotype of interest (P) an intermediate trait e.g. gene expression (G ) We also consider an unmeasured common environmental effect E . We assume filtering has been performed to reduce the number of trios of variables (S, G , P) to consider We can encode hypothesised causal scenarios via path diagrams e.g. Heather Cordell (IGM) Causal modelling of multi-omics data 7 / 27 Potential causal scenarios (a) (b) G (c) G S G S P G G G G P Heather Cordell (IGM) S P (l) G E E P (k) G E G S P (j) S (h) S P (i) P (g) S P S P (f) S G S P (e) (d) S G E P Causal modelling of multi-omics data S E P 8 / 27 Simulation Study Assume an underlying causal scenario Simulate many datasets containing 1000 observations of S, G , P Plus the unmeasured common environmental effect E Assume we do not know the simulation model, and assess how various methods perform for different simulated scenarios. Heather Cordell (IGM) Causal modelling of multi-omics data 9 / 27 Causal inference methods (1) Mendelian Randomization (MR) (Davey Smith and Ebrahim 2003) Used to estimate a causal effect of an intermediate variable G on variable P in the presence of confounders (E ) S is associated with G S is assumed independent of P (except through G ) S is used as an ‘instrument’ to anchor the direction of causality The Causal Inference Test (CIT) (Millstein et al. 2009) Statistical test to infer whether G acts as a mediator (and is the only causal link) between a genetic locus (S) and a quantitative trait (P) A chain of four mathematical conditions that must be satisfied to conclude causality Implemented in R package cit Heather Cordell (IGM) Causal modelling of multi-omics data 10 / 27 Causal inference methods (2) Structural Equation Modelling (SEM) Long history based on confirmatory factor analysis and path analysis Uses hypothesised path diagram to construct system of linear equations Parameters of model estimated using maximum likelihood Compare the sample and predicted covariance matrices Implemented in R packages such as sem or lavaan Heather Cordell (IGM) Causal modelling of multi-omics data 11 / 27 Causal inference methods (2) Structural Equation Modelling (SEM) Long history based on confirmatory factor analysis and path analysis Uses hypothesised path diagram to construct system of linear equations Parameters of model estimated using maximum likelihood Compare the sample and predicted covariance matrices Implemented in R packages such as sem or lavaan Bayesian networks Graphical representations of probabilistic relationships between variables R packages deal and bnlearn Score-based algorithms for learning the structure of the network given the observed data Heather Cordell (IGM) Causal modelling of multi-omics data 11 / 27 Causal inference methods (2) Bayesian Unified Framework (BUF) (Stephens 2013) Developed as a framework for analysing multivariate phenotypes e.g. in GWAS Uses Bayesian multivariate regression Partitions outcome variables (P, G ) into sets γ = (U, D, I ) with respect to a predictor variable (S) U = unassociated D = directly associated I = indirectly associated Software package mvBIMBAM Heather Cordell (IGM) Causal modelling of multi-omics data 12 / 27 Results: MR and CIT MR and CIT are specifically designed for testing causal relationship G → P (MR) or S → G → P (CIT) (a) MR CIT (b) G S S P Causal modelling of multi-omics data S P (l) G E E P (k) G E S P (j) G G S P (i) (h) G S P P (g) G S S P (f) G G S P (e) (d) G S P Heather Cordell (IGM) (c) G S G E P S E P 13 / 27 Results: SEM and BUF, simple scenarios SEM: i j k l a b c d e f g h i j k l a b c d e f g h i j k 0.9 0.6 0.0 0.0 0.3 0.6 0.9 g 0.3 0.6 0.0 0.3 0.6 0.0 0.3 0.6 0.3 0.0 a b c d e f g h f 0.9 c 0.9 b 0.9 a l a b c d e f g h i j k l a b c d e f g h i j k l BUF: f g h i j k l m (a) a b c d e f g h i j (b) G S Heather Cordell (IGM) l m f g h i j k l m 0.9 0.0 0.3 0.6 0.9 0.6 0.0 a b c d e a b c d e (c) G S P k g 0.3 0.6 0.0 0.3 0.6 0.0 0.3 0.6 0.3 0.0 a b c d e f 0.9 c 0.9 b 0.9 a g h i j (f) G S P f Causal modelling of multi-omics data l m a b c d e f g h i j k l m (g) G S P k G S P P 14 / 27 Results: SEM and BUF, more complex scenarios 0.6 0.4 0.2 0.0 0.0 0.2 0.4 0.6 0.8 1.0 i 1.0 b 0.8 SEM: a c d e f g h i j k l a c d e f g g h h i j k l m n 1.0 0.8 0.6 0.0 0.2 0.4 0.6 0.4 0.2 0.0 a b c d e f g h i (b) j k l m n a b c d e f i j k l (i) G S G S P Heather Cordell (IGM) b i 1.0 b 0.8 BUF: b E P Causal modelling of multi-omics data 15 / 27 Results: more complex scenarios (all methods) (a) (b) G S S P Causal modelling of multi-omics data S P (l) G E E P (k) G E S P (j) G G S P (i) (h) G S P P (g) G S S P (f) G G S P (e) (d) G S P Heather Cordell (IGM) (c) G S G E P S E P 16 / 27 Effect of changing model Changing relative size of common environmental effect: Heather Cordell (IGM) Causal modelling of multi-omics data 17 / 27 Effect of changing model Changing relative size of S → G effect: Heather Cordell (IGM) Causal modelling of multi-omics data 18 / 27 Lessons learned from simulation study MR and the CIT are both successful at identifying causal relationships between G and P when their assumptions are not violated. But not really designed for distinguishing between different causal structures. Other methods such as BUF/SEM/BNs can test for a wider set of causal relationships with various degrees of automation. For simple scenarios, these methods successfully identify the correct causal model the majority of the time. The presence of an unknown/unmeasured common environment effect can lead to incorrect inferences. Scalability to more than 3 variables? Heather Cordell (IGM) Causal modelling of multi-omics data 19 / 27 Bi-directional Mendelian Randomization study For determining causal (?) relationship between metabolites and BMI Metabolites selected if both associated with BMI and had an associated genetic ‘instrument’ available Allele score from known BMI-associated SNPs used as a genetic instrument for BMI Heather Cordell (IGM) Causal modelling of multi-omics data 20 / 27 Inference from Bayesian Network analysis Fitted Bayesian Networks modelling all 5 variables On sub-samples of the data consisting of unrelated twins Original MR had used random effects to model twin relatedness Results supported a causal effect of metabolites on BMI with probabilities 0.86 (EPA) and 0.87 (DGLA) Heather Cordell (IGM) Causal modelling of multi-omics data 21 / 27 Removing genetic anchors Heather Cordell (IGM) Causal modelling of multi-omics data 22 / 27 Simulation study Table 1. Simulation scenarios and parameter values considered Simulation scenario Equations describing simulated data 1. No confounding 2. Non‐genetic confounding 3. Genetic confounding (Pleiotropy) ~ 4. Both genetic and non‐genetic confounding 0 (null) 0.2 ‐0.2 0.4 ‐0.4 ‐ (0.1, 0.1, ‐) (0.1, ‐0.1, ‐) 0.1 0.1 0.1 0 (null) 0.2 ‐0.2 0.4 ‐0.4 0.2 ‐ 0.1 0.1 0.1 0 (null) 0.2 ‐0.2 0.4 ‐0.4 0.2 (0.1, 0.1, 0.1) (0.1, 0.1, ‐0.1) (0.1, ‐0.1, 0.1) (0.1, ‐0.1, ‐0.1) ∙ 0,1 0,1 , , ) ‐ ∙ ∙ ∙ ∙ ∙ ∙ ∙ ∙ ~ 0,1 , ~ 0,1 , ~ 0,1 , ~ 0,1 ~ 0,1 0,1 0.1 0,1 ~ ‐ 0.1 ∙ ∙ ∙ ∙ ∙ ~ 0,1 , ~ 0,1 , ~ ~ 0,1 ~ 0,1 ∙ ∙ ∙ ∙ ~ 0,1 , ~ 0,1 , ~ ~ 0,1 ‐ ∙ ∙ ∙ ~ 0,1 , ~ 0,1 ~ 0,1 ~ 0,1 Parameter values considered ( 0.1 0 (null) ‐ 0.2 ‐0.2 0.4 ‐0.4 0.1 Heather Cordell (IGM) Causal modelling of multi-omics data 23 / 27 Behaviour of MR and BNs Heather Cordell (IGM) Causal modelling of multi-omics data 24 / 27 Performance of MR and BNs Heather Cordell (IGM) Causal modelling of multi-omics data 25 / 27 Conclusions Formal approaches to causal inference such as structural equation modelling and Bayesian Networks can offer some advantages compared to Mendelian Randomization and the CIT Or at least offer a complementary approach Seem to work well in simple scenarios Potentially extendable to more complex scenarios and larger numbers of variables Further work needed to evaluate their performance Filtering or statistical approaches to induce sparsity probably required in application to ‘omics’ scale data sets Penalization/regularization approaches or Bayesian analysis with sparsity inducing priors? Bayesian Networks combined with sampling approaches (e.g. MCMC) arguably the most promising avenue for exploration Particularly for dealing with data sets where only a proportion of individuals are measured on all variables Heather Cordell (IGM) Causal modelling of multi-omics data 26 / 27 Acknowledgements Holly Ainsworth So-Youn Shin (University of Bristol) The Wellcome Trust (Grants 087436/Z/08/Z and 102858/Z/13/Z) The Oak Foundation Heather Cordell (IGM) Causal modelling of multi-omics data 27 / 27
© Copyright 2026 Paperzz