Causal modelling of relationships between genotype and phenotype

Causal modelling of relationships between genotype and
phenotype using multi-omics data
Heather J. Cordell
Institute of Genetic Medicine
Newcastle University, UK
[email protected]
Genome-wide association studies (GWAS)
Highly sucessful approach over past 10 years
Enabled by advances in microarray-based genotyping technologies
Allowing us to measure between 500,000 and 4 million genetic variants
in each individual
Scan through genome looking for variants that correlate with
phenotype (e.g. disease status)
Heather Cordell (IGM)
Causal modelling of multi-omics data
2 / 27
Disappointment with GWAS
Critics point out that GWAS findings:
Account for little of the ‘known’ heritability
Confer small increases in disease risk (ORs ≈ 1.2–1.3)
In most cases do not (yet) identify the underlying functional variant(s)
We need to move towards understanding why a particular allele
(or set of alleles) increases disease risk
Heather Cordell (IGM)
Causal modelling of multi-omics data
3 / 27
Causal modelling
It has become popular to take ‘interesting’ SNPs from GWAS and
cross reference them with publicly-available data that provide
supporting evidence for loci identified
For example, does a GWAS associated SNP associate with gene
expression or DNA methylation in a relevant tissue?
Does it lie in a genomic region with particular features, such as histone
modifications, open chromatin, TF binding?
Making use of resources such as ENCODE, GTEx, MuTHER,
Roadmap Epigenomics Project
Heather Cordell (IGM)
Causal modelling of multi-omics data
4 / 27
Causal modelling
It has become popular to take ‘interesting’ SNPs from GWAS and
cross reference them with publicly-available data that provide
supporting evidence for loci identified
For example, does a GWAS associated SNP associate with gene
expression or DNA methylation in a relevant tissue?
Does it lie in a genomic region with particular features, such as histone
modifications, open chromatin, TF binding?
Making use of resources such as ENCODE, GTEx, MuTHER,
Roadmap Epigenomics Project
Ideally, would be preferable to have such data (e.g. gene expression,
DNA methylation, proteomics, metabolites) measured in the same set
of individuals for whom we have GWAS data
Or, at least, in a subset of these individuals
Heather Cordell (IGM)
Causal modelling of multi-omics data
4 / 27
Causal modelling
It has become popular to take ‘interesting’ SNPs from GWAS and
cross reference them with publicly-available data that provide
supporting evidence for loci identified
For example, does a GWAS associated SNP associate with gene
expression or DNA methylation in a relevant tissue?
Does it lie in a genomic region with particular features, such as histone
modifications, open chromatin, TF binding?
Making use of resources such as ENCODE, GTEx, MuTHER,
Roadmap Epigenomics Project
Ideally, would be preferable to have such data (e.g. gene expression,
DNA methylation, proteomics, metabolites) measured in the same set
of individuals for whom we have GWAS data
Or, at least, in a subset of these individuals
Allows us to employ causal inference techniques to investigate causal
pathways leading towards disease progression
Heather Cordell (IGM)
Causal modelling of multi-omics data
4 / 27
Previous approaches: pairwise filtering
Filtering approach
Filter (to reduce the number of variables considered) based on pairwise
correlations
Then perform causal inference test on all possible trios of variables
Heather Cordell (IGM)
Causal modelling of multi-omics data
5 / 27
Previous approaches: pairwise filtering
Filtering approach
Filter (to reduce the number of variables considered) based on pairwise
correlations
Then perform causal inference test on all possible trios of variables
E.g. Liu et al. (2012) performed EWAS to select differentially
methylated probes (DMPs) associated with rheumatoid arthritis (RA)
Followed by GWAS to identify significant SNP-DMP pairs
Then used causal inference test (CIT) on resulting 4016
SNP-DMP-RA trios
Heather Cordell (IGM)
Causal modelling of multi-omics data
5 / 27
Previous approaches: pairwise filtering
Filtering approach
Filter (to reduce the number of variables considered) based on pairwise
correlations
Then perform causal inference test on all possible trios of variables
E.g. Liu et al. (2012) performed EWAS to select differentially
methylated probes (DMPs) associated with rheumatoid arthritis (RA)
Followed by GWAS to identify significant SNP-DMP pairs
Then used causal inference test (CIT) on resulting 4016
SNP-DMP-RA trios
Shin et al. (2014) performed GWAS to identify SNPs associated with
lipid traits
Then used correlation tests to identify significant metabolite-lipid
associations
Followed by Mendelian randomization (MR) and structural equation
modelling (SEM) of 38 resulting SNP-MET-LIP trios
Heather Cordell (IGM)
Causal modelling of multi-omics data
5 / 27
Previous approaches: multi-variable
Bayesian networks (Zhu, Schadt et al. 2004; 2012)
RIMBANet software
Applied to gene-expression and metabolite data
With a causality test on the basis of previous eQTL experiments used
as a prior for directing edges
Carried out on trios of one genetic variable and two expression measures
Heather Cordell (IGM)
Causal modelling of multi-omics data
6 / 27
Comparison of causal inference methods
For elucidating the causal relationships between 3 variables
We assume (for now) that we have measurements of the the
following:
a genetic variant e.g. SNP (S)
a phenotype of interest (P)
an intermediate trait e.g. gene expression (G )
Heather Cordell (IGM)
Causal modelling of multi-omics data
7 / 27
Comparison of causal inference methods
For elucidating the causal relationships between 3 variables
We assume (for now) that we have measurements of the the
following:
a genetic variant e.g. SNP (S)
a phenotype of interest (P)
an intermediate trait e.g. gene expression (G )
We also consider an unmeasured common environmental effect E .
We assume filtering has been performed to reduce the number of trios
of variables (S, G , P) to consider
We can encode hypothesised causal scenarios via path diagrams e.g.
Heather Cordell (IGM)
Causal modelling of multi-omics data
7 / 27
Potential causal scenarios
(a)
(b)
G
(c)
G
S
G
S
P
G
G
G
G
P
Heather Cordell (IGM)
S
P
(l)
G
E
E
P
(k)
G
E
G
S
P
(j)
S
(h)
S
P
(i)
P
(g)
S
P
S
P
(f)
S
G
S
P
(e)
(d)
S
G
E
P
Causal modelling of multi-omics data
S
E
P
8 / 27
Simulation Study
Assume an underlying causal scenario
Simulate many datasets containing 1000 observations of S, G , P
Plus the unmeasured common environmental effect E
Assume we do not know the simulation model, and assess how various
methods perform for different simulated scenarios.
Heather Cordell (IGM)
Causal modelling of multi-omics data
9 / 27
Causal inference methods (1)
Mendelian Randomization (MR) (Davey Smith and Ebrahim 2003)
Used to estimate a causal effect of an intermediate variable G on
variable P in the presence of confounders (E )
S is associated with G
S is assumed independent of P (except through G )
S is used as an ‘instrument’ to anchor the direction of causality
The Causal Inference Test (CIT) (Millstein et al. 2009)
Statistical test to infer whether G acts as a mediator (and is the only
causal link) between a genetic locus (S) and a quantitative trait (P)
A chain of four mathematical conditions that must be satisfied to
conclude causality
Implemented in R package cit
Heather Cordell (IGM)
Causal modelling of multi-omics data
10 / 27
Causal inference methods (2)
Structural Equation Modelling (SEM)
Long history based on confirmatory factor analysis and path analysis
Uses hypothesised path diagram to construct system of linear
equations
Parameters of model estimated using maximum likelihood
Compare the sample and predicted covariance matrices
Implemented in R packages such as sem or lavaan
Heather Cordell (IGM)
Causal modelling of multi-omics data
11 / 27
Causal inference methods (2)
Structural Equation Modelling (SEM)
Long history based on confirmatory factor analysis and path analysis
Uses hypothesised path diagram to construct system of linear
equations
Parameters of model estimated using maximum likelihood
Compare the sample and predicted covariance matrices
Implemented in R packages such as sem or lavaan
Bayesian networks
Graphical representations of probabilistic relationships between
variables
R packages deal and bnlearn
Score-based algorithms for learning the structure of the network given
the observed data
Heather Cordell (IGM)
Causal modelling of multi-omics data
11 / 27
Causal inference methods (2)
Bayesian Unified Framework (BUF) (Stephens 2013)
Developed as a framework for analysing multivariate phenotypes
e.g. in GWAS
Uses Bayesian multivariate regression
Partitions outcome variables (P, G ) into sets γ = (U, D, I ) with
respect to a predictor variable (S)
U = unassociated
D = directly associated
I = indirectly associated
Software package mvBIMBAM
Heather Cordell (IGM)
Causal modelling of multi-omics data
12 / 27
Results: MR and CIT
MR and CIT are specifically designed for testing causal relationship
G → P (MR) or S → G → P (CIT)
(a)
MR
CIT
(b)
G
S
S
P
Causal modelling of multi-omics data
S
P
(l)
G
E
E
P
(k)
G
E
S
P
(j)
G
G
S
P
(i)
(h)
G
S
P
P
(g)
G
S
S
P
(f)
G
G
S
P
(e)
(d)
G
S
P
Heather Cordell (IGM)
(c)
G
S
G
E
P
S
E
P
13 / 27
Results: SEM and BUF, simple scenarios
SEM:
i
j
k
l
a b c d e f g h
i
j
k
l
a b c d e f g h
i
j
k
0.9
0.6
0.0
0.0
0.3
0.6
0.9
g
0.3
0.6
0.0
0.3
0.6
0.0
0.3
0.6
0.3
0.0
a b c d e f g h
f
0.9
c
0.9
b
0.9
a
l
a b c d e f g h
i
j
k
l
a b c d e f g h
i
j
k
l
BUF:
f
g h
i
j
k
l m
(a)
a b c d e
f
g h
i
j
(b)
G
S
Heather Cordell (IGM)
l m
f
g h
i
j
k
l m
0.9
0.0
0.3
0.6
0.9
0.6
0.0
a b c d e
a b c d e
(c)
G
S
P
k
g
0.3
0.6
0.0
0.3
0.6
0.0
0.3
0.6
0.3
0.0
a b c d e
f
0.9
c
0.9
b
0.9
a
g h
i
j
(f)
G
S
P
f
Causal modelling of multi-omics data
l m
a b c d e
f
g h
i
j
k
l m
(g)
G
S
P
k
G
S
P
P
14 / 27
Results: SEM and BUF, more complex scenarios
0.6
0.4
0.2
0.0
0.0
0.2
0.4
0.6
0.8
1.0
i
1.0
b
0.8
SEM:
a
c
d
e
f
g
h
i
j
k
l
a
c
d
e
f
g
g
h
h
i
j
k
l
m
n
1.0
0.8
0.6
0.0
0.2
0.4
0.6
0.4
0.2
0.0
a
b
c
d
e
f
g
h
i
(b)
j
k
l
m
n
a
b
c
d
e
f
i
j
k
l
(i)
G
S
G
S
P
Heather Cordell (IGM)
b
i
1.0
b
0.8
BUF:
b
E
P
Causal modelling of multi-omics data
15 / 27
Results: more complex scenarios (all methods)
(a)
(b)
G
S
S
P
Causal modelling of multi-omics data
S
P
(l)
G
E
E
P
(k)
G
E
S
P
(j)
G
G
S
P
(i)
(h)
G
S
P
P
(g)
G
S
S
P
(f)
G
G
S
P
(e)
(d)
G
S
P
Heather Cordell (IGM)
(c)
G
S
G
E
P
S
E
P
16 / 27
Effect of changing model
Changing relative size of common environmental effect:
Heather Cordell (IGM)
Causal modelling of multi-omics data
17 / 27
Effect of changing model
Changing relative size of S → G effect:
Heather Cordell (IGM)
Causal modelling of multi-omics data
18 / 27
Lessons learned from simulation study
MR and the CIT are both successful at identifying causal relationships
between G and P when their assumptions are not violated.
But not really designed for distinguishing between different causal
structures.
Other methods such as BUF/SEM/BNs can test for a wider set of
causal relationships with various degrees of automation.
For simple scenarios, these methods successfully identify the correct
causal model the majority of the time.
The presence of an unknown/unmeasured common environment effect
can lead to incorrect inferences.
Scalability to more than 3 variables?
Heather Cordell (IGM)
Causal modelling of multi-omics data
19 / 27
Bi-directional Mendelian Randomization study
For determining causal (?) relationship between metabolites and BMI
Metabolites selected if both associated with BMI and had an
associated genetic ‘instrument’ available
Allele score from known BMI-associated SNPs used as a genetic
instrument for BMI
Heather Cordell (IGM)
Causal modelling of multi-omics data
20 / 27
Inference from Bayesian Network analysis
Fitted Bayesian Networks
modelling all 5 variables
On sub-samples of the data
consisting of unrelated twins
Original MR had used
random effects to model
twin relatedness
Results supported a causal
effect of metabolites on BMI
with probabilities 0.86
(EPA) and 0.87 (DGLA)
Heather Cordell (IGM)
Causal modelling of multi-omics data
21 / 27
Removing genetic anchors
Heather Cordell (IGM)
Causal modelling of multi-omics data
22 / 27
Simulation study
Table 1. Simulation scenarios and parameter values considered Simulation scenario Equations describing simulated data 1. No confounding 2. Non‐genetic confounding 3. Genetic confounding (Pleiotropy) ~
4. Both genetic and non‐genetic confounding 0 (null) 0.2 ‐0.2 0.4 ‐0.4 ‐ (0.1, 0.1, ‐) (0.1, ‐0.1, ‐) 0.1 0.1 0.1 0 (null) 0.2 ‐0.2 0.4 ‐0.4 0.2 ‐ 0.1 0.1 0.1 0 (null) 0.2 ‐0.2 0.4 ‐0.4 0.2 (0.1, 0.1, 0.1) (0.1, 0.1, ‐0.1) (0.1, ‐0.1, 0.1) (0.1, ‐0.1, ‐0.1) ∙
0,1
0,1
, , ) ‐ ∙
∙
∙
∙
∙
∙
∙
∙
~ 0,1 , ~ 0,1 , ~ 0,1 , ~ 0,1
~ 0,1
0,1
0.1 0,1 ~
‐ 0.1 ∙
∙
∙
∙
∙
~ 0,1 , ~ 0,1 , ~
~ 0,1
~ 0,1
∙
∙
∙
∙
~ 0,1 , ~ 0,1 , ~
~ 0,1
‐ ∙
∙
∙
~ 0,1 , ~ 0,1 ~ 0,1
~ 0,1
Parameter values considered (
0.1 0 (null) ‐ 0.2 ‐0.2 0.4 ‐0.4 0.1 Heather Cordell (IGM)
Causal modelling of multi-omics data
23 / 27
Behaviour of MR and BNs
Heather Cordell (IGM)
Causal modelling of multi-omics data
24 / 27
Performance of MR and BNs
Heather Cordell (IGM)
Causal modelling of multi-omics data
25 / 27
Conclusions
Formal approaches to causal inference such as structural equation
modelling and Bayesian Networks can offer some advantages
compared to Mendelian Randomization and the CIT
Or at least offer a complementary approach
Seem to work well in simple scenarios
Potentially extendable to more complex scenarios and larger numbers
of variables
Further work needed to evaluate their performance
Filtering or statistical approaches to induce sparsity probably required
in application to ‘omics’ scale data sets
Penalization/regularization approaches or Bayesian analysis with
sparsity inducing priors?
Bayesian Networks combined with sampling approaches (e.g. MCMC)
arguably the most promising avenue for exploration
Particularly for dealing with data sets where only a proportion of
individuals are measured on all variables
Heather Cordell (IGM)
Causal modelling of multi-omics data
26 / 27
Acknowledgements
Holly Ainsworth
So-Youn Shin (University of Bristol)
The Wellcome Trust (Grants 087436/Z/08/Z and 102858/Z/13/Z)
The Oak Foundation
Heather Cordell (IGM)
Causal modelling of multi-omics data
27 / 27