Understanding the genetic basis of complex polygenic traits through

Boston University
OpenBU
http://open.bu.edu
Theses & Dissertations
Boston University Theses & Dissertations
2014
Understanding the genetic basis of
complex polygenic traits through
Bayesian model selection of
multiple genetic models and
network modeling of family-based
genetic data
Bae, Harold Taehyun
http://hdl.handle.net/2144/15337
Boston University
BOSTON UNIVERSITY
GRADUATE SCHOOL OF ARTS AND SCIENCES
Dissertation
UNDERSTANDING THE GENETIC BASIS OF COMPLEX
POLYGENIC TRAITS THROUGH BAYESIAN MODEL SELECTION
OF MULTIPLE GENETIC MODELS AND NETWORK MODELLING
OF FAMILY-BASED GENETIC DATA
by
HAROLD T. BAE
B.A., Dartmouth College, 2005
M.S., Dartmouth College, 2008
Submitted in partial fulfillment of the
requirements for the degree of
Doctor of Philosophy
2014
Approved by
First Reader
Paola Sebastiani, Ph.D.
Professor of Biostatistics
Second Reader
Josee Dupuis, Ph.D.
Professor of Biostatistics
Third Reader
Stefano Monti, Ph.D.
Associate Professor of Medicine and Biostatistics
Acknowledgments
I would like to take the time to thank my dissertation committee for all their help through
this process. I would like to express my sincerest gratitude to my primary advisor, Dr.
Paola Sebastiani, for all her support and guidance throughout my time at Boston University. She has always helped me push myself in the right direction, learn how to approach
difficult problems from novel perspectives, and realize the strengths and weakness of myself
as a biostatistician. Working under the supervision of Dr. Sebastiani, I learned a lot both
professionally and personally. I would also like to thank the rest of my committee: Dr.
Josee Dupuis, Dr. Stefano Monti, Dr. Yorghos Tripodis and Dr. Thomas Perls. Thank
you for all your advice and comments on my dissertation.
I would also like to thank Dr. Martin Steinberg, with whom I started my research career,
for his guidance throughout my research. His passion towards research has always been a
motivating force for me. I would also like to thank Dr. Thomas Perls, who supported my
research in aging and longevity. Working with him, my passion to pursue future research
in aging solifidied. I am also grateful for Dr. Adrienne Cupples who brought me into
Boston University with the training grant support. The training grant meant a lot to me
and helped me finish my coursework swiftly.
A special thank you to my colleagues Nadia Solovieff, Jacqui Milton, and Steve Hartley:
thank you so much for helping me whenever I asked you random questions. Additionally,
I would like to thank all of my friends at Boston University: Carlee, Paula, Danielle, Sean,
Meredith, Han, Heather, Liz, Seung-hoan, Jae, and Avery.
I would like to thank my friends and family who helped me from afar. I would also like
to thank my parents and my sister. I would never have made it this far without their
unconditional love and support. Lastly, I would like to thank my wife Jaewon and my son
Minoh, to whom I dedicate my dissertation.
iii
UNDERSTANDING THE GENETIC BASIS OF COMPLEX
POLYGENIC TRAITS THROUGH BAYESIAN MODEL SELECTION
OF MULTIPLE GENETIC MODELS AND NETWORK MODELLING
OF FAMILY-BASED GENETIC DATA
(Order No.
)
HAROLD T. BAE
Boston University Graduate School of Arts and Sciences, 2014
Major Professor: Paola Sebastiani, Professor of Biostatistics
ABSTRACT
The global aim of this dissertation is to develop advanced statistical modeling to understand the genetic basis of complex polygenic traits. In order to achieve this goal, this
dissertation focuses on the development of (i) a novel methodology to detect genetic variants with different inheritance patterns formulated as a Bayesian model selection problem,
(ii) integration of genetic data and non-genetic data to dissect the genotype-phenotype associations using Bayesian networks with family-based data, and (iii) an efficient technique
to model the family-based data in the Bayesian framework.
In the first part of my dissertation, I present a coherent Bayesian framework for selection of the most likely model from the five genetic models (genotypic, additive, dominant,
codominant, and recessive) used in genetic association studies. The approach uses a polynomial parameterization of genetic data to simultaneously fit the five models and save
computations. I provide a closedform expression of the marginal likelihood for normally
distributed data, and evaluate the performance of the proposed method and existing methods through simulated and real genomewide data sets.
The second part of this dissertation presents an integrative analytic approach that
utilizes Bayesian networks to represent the complex probabilistic dependency structure
among many variables from family-based data. I propose a parameterization that extends
iv
mixed effects regression models to Bayesian networks by using random effects as additional
nodes of the networks to model the between-subjects correlations. I also present results of
simulation studies to compare different model selection metrics for mixed models that can
be used for learning BNs from correlated data and application of this methodology to real
data from a large family-based study.
In the third part of this dissertation, I describe an efficient way to account for family
structure in Bayesian inference Using Gibbs Sampling (BUGS). In linear mixed models, a
random effects vector has a variance-covariance matrix whose dimension is as large as the
sample size. However, a direct handling of this multivariate normal distribution is not computationally feasible in BUGS. Therefore, I propose a decomposition of this multivariate
normal distribution into univariate normal distributions using singular value decomposition
and implementation in BUGS are presented.
v
Contents
1 Introduction
1
2 Bayesian Polynomial Regression Models to Fit Multiple Genetic Models
for Quantitative Traits
4
2.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4
2.2
Relationship Between the Polynomial Model and Other Genetic Models . .
6
2.2.1
Genotypic Association Model . . . . . . . . . . . . . . . . . . . . . .
7
2.2.2
Additive Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8
2.2.3
Dominant Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8
2.2.4
Recessive Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8
2.2.5
Co-dominant Model . . . . . . . . . . . . . . . . . . . . . . . . . . .
9
2.3
Model Selection via Marginal Likelihood and Parameter Estimation . . . .
9
2.4
Simulation Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
14
2.4.1
Simulation Study (1) . . . . . . . . . . . . . . . . . . . . . . . . . . .
15
2.4.2
Simulation Study (2) . . . . . . . . . . . . . . . . . . . . . . . . . . .
16
2.4.3
Simulation Study (3) . . . . . . . . . . . . . . . . . . . . . . . . . . .
19
2.5
Application to Real Data . . . . . . . . . . . . . . . . . . . . . . . . . . . .
23
2.6
Conclusion
27
2.7
Supplementary Materials
2.8
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
28
2.7.1
Derivation of Marginal Likelihood for the General Model
. . . . . .
28
2.7.2
Derivation of Marginal Likelihood for the Dominant Model . . . . .
29
2.7.3
Supplementary Figures . . . . . . . . . . . . . . . . . . . . . . . . . .
35
2.7.4
Supplementary Tables . . . . . . . . . . . . . . . . . . . . . . . . . .
41
Acknowledgement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
42
vi
3 Network Modeling of Family-Based Data
3.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2
Learning Bayesian Networks from Independent and Identically Distributed
43
43
Observations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
45
3.3
Mixed-Effects Regression Models . . . . . . . . . . . . . . . . . . . . . . . .
49
3.4
Mixed-Effects Bayesian Networks . . . . . . . . . . . . . . . . . . . . . . . .
53
3.5
Simulation Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
58
3.5.1
Continuous Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
58
3.5.2
Time-to-event Data . . . . . . . . . . . . . . . . . . . . . . . . . . .
61
3.6
Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
67
3.7
Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
69
3.8
Supplementary Materials
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
72
3.9
Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
76
4 Efficient Technique to Model Extended Family Structure in Bayesian
inference Using Gibbs Sampling (BUGS) software
77
4.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
77
4.2
Proposed Parameterization . . . . . . . . . . . . . . . . . . . . . . . . . . .
78
4.3
A Real Data Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
79
4.4
Conclusion
82
4.5
Supplementary Materials
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
84
4.6
Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
86
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5 Conclusions
87
Bibliography
90
Curriculum Vitae
97
vii
List of Tables
2.1
Expected value of the quantitative trait for 3 genotypes in each model . . .
2.2
Specification of ω, prior hyperparameters, updated hyperparameters, and
marginal likelihood for each model. . . . . . . . . . . . . . . . . . . . . . . .
2.3
13
Median false positive rates in the NECS and CSSCD data in 10000 permutations (Simulation Study 1). . . . . . . . . . . . . . . . . . . . . . . . . . .
2.4
9
20
Mean number of additive, dominant, and recessive SNPs correctly identified
(out of 34, 33, and 33, respectively) in the two approaches when heritability
is 0.4. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
24
2.5
Significant results using two approaches in the a) NECS and b) CSSCD data. 26
2.6
Mean number of additive, dominant, and recessive SNPs correctly identified
in the two approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.1
False Positive Rates and Family-wise Error Rates of Different Model Selection Metrics For Continuous Data When h2 = 0.50 . . . . . . . . . . . . . .
3.2
61
False Positive Rates and Family-wise Error Rates of Different Model Selection Metrics For Time-to-event Data When h2 = 0.50
3.4
60
Power Comparisons of Four Variants of BIC vs. Corresponding LRTM For
Continuous Data h2 = 0.50 . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.3
41
. . . . . . . . . . . .
65
Power Comparisons of BICM vs. Corresponding LRTM For Time-to-event
Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
66
3.5
23 Genes in the IIS Pathway . . . . . . . . . . . . . . . . . . . . . . . . . .
68
3.6
Markov Blanket of Each Node in the Top 3 BNs . . . . . . . . . . . . . . .
69
3.7
False Positive Rates and Family-wise Error Rates of Different Model Selection Metrics For Continuous Data When h2 = 0.25 . . . . . . . . . . . . . .
3.8
72
False Positive Rates and Family-wise Error Rates of Different Model Selection Metrics For Continuous Data When h2 = 0.75 . . . . . . . . . . . . . .
viii
73
3.9
Power Comparisons of Four Variants of BIC vs. Corresponding LRTM For
Continuous Data When h2 = 0.25 . . . . . . . . . . . . . . . . . . . . . . . .
74
3.10 Power Comparisons of Four Variants of BIC vs. Corresponding LRTM For
Continuous Data When h2 = 0.75 . . . . . . . . . . . . . . . . . . . . . . . .
75
3.11 False Positive Rates and Family-wise Error Rates of Different Model Selection Metrics For Time-to-event Data When h2 = 0.25
. . . . . . . . . . . .
75
3.12 False Positive Rates and Family-wise Error Rates of Different Model Selection Metrics For Time-to-event Data When h2 = 0.75
4.1
. . . . . . . . . . . .
76
Comparison of Point Estimates (PE), standard errors (SE), and 95% Credible Intervals (95% CI) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
ix
82
List of Figures
2.1
Box plot of cubic root of empirical false positive rates for different p-value
thresholds, increasing sample sizes, and h2 =0.4. . . . . . . . . . . . . . . . .
2.2
Box plot of log-transformed BF for different p-values thresholds, increasing
sample sizes, and h2 =0.4. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.3
40
An Example Pedigree and Corresponding Additive Genetic Relationship Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2
39
Box plot of power of the two approaches at different significance thresholds
when h2 =0.6. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.1
38
Box plot of log-transformed BF for different p-values thresholds, increasing
sample sizes, and h2 =0.6. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.9
37
Box plot of cubic root of empirical false positive rates for different p-value
thresholds, increasing sample sizes, and h2 =0.6. . . . . . . . . . . . . . . . .
2.8
36
Box plot of power of the two approaches at different significance thresholds
when h2 =0.2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.7
35
Box plot of log-transformed BF for different p-values thresholds, increasing
sample sizes, and h2 =0.2. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.6
21
Box plot of cubic root of empirical false positive rates for different p-value
thresholds, increasing sample sizes, and h2 =0.2. . . . . . . . . . . . . . . . .
2.5
21
Box plot of power of the two approaches at different significance thresholds
when h2 =0.4. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.4
20
44
Example of BN with 3 observable variables X1 , X2 , X3 and parameter vectors
θ1 , θ2 , θ3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
47
3.3
Example of Proposed Parameterization For Correlated Observations . . . .
54
3.4
Top 3 BNs that dissect the associations of SNPs in genes of the IIS pathway
through effects on blood biomarkers. . . . . . . . . . . . . . . . . . . . . . .
x
70
4.1
Plots of Posterior Distributions of Heritability, Residual Variance, and Genetic Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
xi
83
List of Abbreviations
AIC
Akaike information criterion
BF
Bayes Factor
BIC
Bayesian information criterion
BN
Bayesian Network
BUGS
Bayesian inference Using Gibbs Sampling
CSSCD
Cooperative Study of Sickle Cell Anemia
DAG
Directed Acyclic Graph
DIC
Deviance Information Criterion
GWAS
Genome-wide Association Study
HWE
Hardy-Weinberg Equilibrium
IIS
Insulin and insulin-like growth factor 1 signaling
LLFS
Long Life Family Study
LRT
Likelihood Ratio Test
MAF
Minor Allele Frequency
MB
Markov Blanket
MCMC
Markov Chain Monte Carlo
NECS
New England Centenarian Study
NEO-FFI
Neo Five Factor Inventory
SNP
Single Nucleotide Polymorphism
xii
1
Chapter 1
Introduction
The overall theme of my dissertation is to develop advanced statistical modeling to understand the genetic basis of complex polygenic traits. In order to achieve this goal, my
dissertation focuses on: 1) detection of genetic variants with different inheritance patterns
formulated as a Bayesian model selection problem; 2) integration of genetic data and subphenotypes of the trait of interest to dissect the genotype-phenotype associations using
network modeling; and 3) efficient technique to model the correlated (family) data in the
Bayesian framework.
In the first part of my dissertation, I present a Bayesian polynomial model of genetic
data that can simultaneously parameterize the five different genetic models (2-df general,
additive, dominant, co-dominant, and recessive) commonly used in genetic association
studies. I show that the five genetic models are special cases of the proposed polynomial
parameterization. There is a convenient transformation between the polynomial model
(2 degrees of freedom model) and other four specific genetic models (1 degree of freedom
model) by utilizing the fact that the parameters in the four specific models are constrained
by a linear contrast of the parameters in the polynomial model. Assuming that all genetic
models are equally likely a priori, I propose to use the marginal likelihood to select the
most likely genetic model given the observed data. Computationally, the advantage of this
parameterization is that fitting a single polynomial model is sufficient to estimate the effect
of each genetic variant under each genetic model, instead of fitting five models, and the
marginal likelihood of different genetic models can be derived from a single polynomial
model. The theoretical advantage of this approach is to provide a coherent Bayesian
framework for the selection of the most likely genetic model by scoring each model by the
marginal likelihood. This work is forthcoming in Bayesian Analysis ([5]).
The second part of my dissertation presents an integrative analytic approach that uti-
2
lizes the formalism of Bayesian networks (directed acyclic graphical model) to represent
the complex probabilistic dependency structure among many variables in family-based correlated data. By building a joint network model which includes genetic and non-genetic
factors as well as sub-phenotypes (or trait-related intermediate phenotypes) of the target
trait, this method provides an integrative approach that allows for better understanding
of the biological mechanism that translates genotype to phenotype, which simple genetic
association analyses are not able to capture. However, many well-known approaches for
structural learning of directed graphical models assume that data are a random sample of
independent and identically distributed (IID) observations. Therefore, data from familybased study designs in which individuals within the same pedigree are correlated provide a
layer of complexity that affects decomposition of model likelihood, model selection metric,
and model search procedures. I propose a parameterization in which mixed effects regression models are used to model each node given the parent nodes to account for the family
structure. However, an additional challenge is that there is a lack of consensus on which
model selection metrics should be used when mixed models are used. Thus, I evaluate
various model selection metrics using simulaton studies for continuous data and time-toevent data. As a by-product of simulation studies for the time-to-event trait, I also derive
a heritability-like estimate for the time-to-event trait on the log-hazard scale so that the
genetic component of the trait can be controlled in the simulation design. Then, a specific
example of a network which incorporates genetic data, blood biomarkers, and a survival
trait from a large-scale family-based data is presented.
In the third part of my dissertation, I describe an efficient way to account for family
structure in Bayesian inference Using Gibbs Sampling (BUGS) [46]. In linear mixed models,
the family structure is typically accounted for by introducing a random effects vector that
follows a multivariate normal distribution with zero mean vector and variance-covariance
matrix that is proportional to additive relationship matrix (or twice the kinship matrix).
However, a direct handling of a multivariate normal distribution is not computationally
feasible in BUGS. Therefore, decomposition of this multivariate normal into univariate
3
normal distributions is crucial. I show how a high-dimensional covariance matrix can be
decomposed using singular value decomposition and implementation using the BUGS code.
4
Chapter 2
Bayesian Polynomial Regression Models to Fit Multiple Genetic Models
for Quantitative Traits
2.1
Introduction
Genome-wide association studies have been a popular approach to discover genetic variants
that are associated with increased risk for rare and common diseases ([66]). The most
common variants in the human genome are single nucleotide polymorphisms (SNPs): DNA
bases that can vary across individuals. Typically SNPs have two alleles, say A and B, and
based on the combination of SNPs alleles in each chromosome pair (the genotype), an
individual can be homozygous for the A allele if both chromosomes carry the allele A,
homozygous for the allele B if both chromosomes carry the B alleles, and heterozygous
when the two chromosomes carry the A and B alleles. Genotyping DNA was a slow and
expensive process until mid-2000, when high throughput technologies produced microarrays
that can generate the genetic profiles of an individual in hundreds of thousands to millions
of SNPs, and the technology was the trigger for an explosion of genome-wide association
studies (GWAS) to discover the genetic base of common diseases.
Typically in a GWAS the association between each SNP and a quantitative trait is
tested using linear regression under a specific genetic model that can assume a genotypic
(2 degrees of freedom), dominant, recessive, co-dominant, or additive mode of inheritance
of each tested SNP. In a genotypic model the 3 genotypes AA, AB and BB are treated as a
factor with 3 levels. The other 4 genetic models compress the 3 genotypes into a numerical
variable by either counting the number of minor alleles (additive model), or by recoding
the genotypes as AA=0 versus AB, BB=1 (dominant model for the B allele), AA, AB=0
versus BB=1 (recessive model for the B allele), AA, BB=0 versus AB=1 (co-dominant
model). However, the inheritance pattern is rarely known, and using a suboptimal model
5
can lead to a loss of power ([40]).
Selecting the correct genetic model for each SNP is often accomplished by fitting the
five models and choosing the model that describes the data best. This approach has several
drawbacks. It increases computational burden with genome-wide data as 5 GWASs need
to be conducted. Furthermore, testing five models for each SNP increases the burden of
multiple testing in addition to the existing issue of multiple comparisons with millions of
SNPs. More importantly, the optimal method for choosing the best model is not clear ([40]).
The common practice is to simply use the additive genetic model. It has been shown that
additive models perform reasonably well to detect variants that have additive or dominant
inheritance pattern, but they are underpowered when the true mode of inheritance is
recessive ([10]). Others ([21, 27, 41, 70]) have proposed to study the maximum of the three
test statistics derived under additive, dominant, and recessive models.
We propose a polynomial parameterization of the genetic data that includes the five
genetic models as special case, and we describe a coherent Bayesian framework to select the
most likely genetic model given the data. This polynomial parameterization is equivalent
to the genetic model described in [68] that adds a dominance effect to the additive model
to describe non-additive genetic effects. The advantage of either parameterization is that,
in a Bayesian framework, fitting a single model becomes sufficient to test the genotypephenotype associations without specifying a particular genetic model and this problem has
been described in details in [68]. Here, we focus on the specific task of selection of the best
genetic model when the specific mode of inheritance is of interest in addition to whether a
SNP is associated with the trait.
Section 2.2 describes this parameterization and shows that there is a mathematical
relationship between the parameters of the polynomial model and each of the five possible
genetic models. Section 2.3 describes the model selection approach that is based on the
computation of the marginal likelihood of the five models so that the model with maximum
posterior probability can be identified. Section 2.3 also provides closed form solutions for
the marginal likelihood and for the estimates of the parameters of the model with the
6
highest marginal likelihood or Bayes Factor (BF), assuming exchangeable observations
that follow normal distributions. The proposed method is evaluated through simulation
studies in Section 2.4, and is applied to two GWAS data sets in Section 2.5. Conclusions
and suggestions for further work are provided in Section 2.6.
2.2
Relationship Between the Polynomial Model and Other Genetic Models
Here we show that the five genetic models are specific cases of a general polynomial model,
with parameters that satisfy some linear constraints. Let y denote the response variable
in the genetic association study, and consider the polynomial regression model
E(y|β) = β0 + β1 xadd + β2 x2add
where β denotes the vector of regression parameters and xadd is the variable that codes for
the genotype data as follows:
xadd



0 if genotype is AA



=
1 if genotype is AB




 2 if genotype is BB
Note that the proposed model is equivalent to the additive model with dominance effects
described in [68].
E(y|θ) = θ0 + θ1 xadd + θ2 xhet
where xhet = 1 for heterozygous genotype and 0 otherwise, and θ1 = β + 2β2 ; θ2 = −β2 .
Mathematically, we found the polynomial parameterization more appealing as it allows
interpretation of the regression coefficients in terms of the SNP dosage.
7
2.2.1
Genotypic Association Model
The genotypic association model is typically parameterized using two indicator variables
to describe the effect of the genotypes AB and BB relative to AA:
E(y|γ) = γ0 + γ1 xAB + γ2 xBB
xAB = 1 if genotype is AB (and 0 otherwise)
and xBB = 1 if genotype is BB (and 0 otherwise)
This parameterization specifies the expected value of y, for each of the 3 genotypes AA,
AB, BB as summarized in Table 2.1. Equating the expected values of y from the 2 different
parameterizations produces a system of linear equations:



γ0


 γ +γ
1
 0

γ0 + γ2

β0
 
 
= β +β +β
0
1
2
 
 
β0 + 2β1 + 4β2





that can be solved as



 γ0   1 0 0

 
 γ = 0 1 1
 1  

 
0 2 4
γ2



  β0   1 0 0

 
 β  =  0 1 1
 1  

 
0 2 4
β2



β


Therefore, the parameters in the polynomial model and the genotypic association model
have a one-to-one relationship. For the other genetic models, some constraints on parameters of the polynomial model are necessary.
8
2.2.2
Additive Model
The parameterization of the additive genetic model is E(y|αA ) = αA0 +αA xadd and equating
the expected values of y in Table 2.1 leads to the system of linear equations:
β0 = αA0
β1 + β2 = αA
2β1 + 4β2 = 2αA
that can be solved if β2 = 0, so that β0 = αA0 , and β1 = αA . Therefore the relationship
between the parameters in the polynomial model and additive model requires a linear
constraint on the vector β.
2.2.3
Dominant Model
Now, consider the dominant model for the B allele: E(y|αD ) = αD0 + αD xdom , where
xDom =1 if genotype is AB or BB (and 0 otherwise). Proceeding as in the previous cases
leads to the equations:
αD0 = β0
αD = β1 + β2 = 2β1 + 4β2
The system has the solution αD0 = β0 and αD = 32 β1 if the parameters of the polynomial
model satisfy the constraint β1 + 3β2 = 0.
2.2.4
Recessive Model
Similarly, consider the recessive model for the B allele: E(y|αR ) = αR0 + αR xRec , where
xRec =1 if genotype is BB (and 0 otherwise). In this case, the relations between parameters
are:
αR0 = β0 = β0 + β1 + β2
9
αR = 2β1 + 4β2
with linear constraint β1 + β2 = 0, αR0 = β0 and αR = 2β1 .
2.2.5
Co-dominant Model
Lastly, consider the co-dominant genetic model: E(y|αC ) = αC0 + αC xCod , where xCod =1
if genotype is AB (and 0 otherwise). In this case:
αC0 = β0 = β0 + 2β1 + 4β2
αC = β1 + β2
The linear constraint is β1 + 2β2 = 0, so that αC0 = β0 and αC = 12 β1 .
Polynomial
Model
E(y|AA)
E(y|AB)
E(y|BB)
β0
β0 + β1 + β2
β0 + 2β1 +
4β2
2-df
General
Model
γ0
γ0 + γ1
γ0 + γ2
Additive
Model
Dominant
Model
Recessive
Model
αA0
αA0 + αA
αA0 +2αA
αD0
αD0 +αD
αD0 +αD
αR0
αR0
αR0 + αR
Codominant
Model
αC0
αC0 + αC
αC0
Table 2.1: Expected value of the quantitative trait for 3 genotypes in each model
In summary, there is a one-to-one transformation between the parameters of the polynomial and general model, while the transformation between the polynomial and other
four models (additive, dominant, recessive, and co-dominant) is constrained by a linear
contrast of the parameters in the polynomial model.
2.3
Model Selection via Marginal Likelihood and Parameter Estimation
The polynomial parameterization provides a framework to simultaneously fit different genetic models. Given a sample of genotype data, the question is how to select the most
appropriate genetic model. We propose a Bayesian model selection approach in which genetic models are compared based on their marginal likelihood and the model with largest
10
marginal likelihood is selected, assuming that a priori the 5 genetic models are equally
likely.
In the polynomial model (y|β = Xβ + in matrix form), the data are assumed to be
exchangeable and follow a normal distribution with:
1
y|X, β, τ ∼ N (Xβ, I)
τ
where I is the identity matrix. A standard normal-gamma prior for the vector of parameters
β and precision τ is assumed such that p(β, τ ) = p(β|τ )p(τ ), where
τ ∼ Gamma(a1 , a2 )
β|τ ∼ N (β0 , (τ R0 )−1 )
with β0 , R0 , a1 , and a2 as prior hyperparameters. Specification of these prior hyperparameters can be subjective and represents the prior probability of alternative genetic
models. With genome-wide data, most of the tested SNPs are likely to be null SNPS
and it is both reasonable and convenient to assume non-informative priors. Therefore the
following values for the prior hyper-parameters: β0 = 0, R0 = I, a1 = 1, and a2 = 1 will
be assumed. If there is strong prior belief about certain genetic models, more informative
prior distributions can be chosen and this problem is described at length in [68]. The
marginal likelihood given this polynomial model Mp can be computed analytically in the
equation below:
1
Z
p(y|Mp ) =
p(y|X, β, τ )p(β, τ )dβdτ =
with the following updated hyper-parameters:
Rn = R0 + X T X
1 aa2n1n Γ(a1n ) |R0 | 2
n
a
1
(2π) 2 a21 Γ(a1 ) |Rn | 2
11
βn = Rn−1 (R0 β0 + X T y)
a1n = a1 +
a2n = [
n
2
1
−βnT Rn βn + y T y + β0T R0 β0
+ ]−1
2
a2
Details are for example in reference ([54]). In the general genetic model, the vector of
parameters γ is a linear transformation of β, γ = ωβ, where the matrix ω is:


 1 0 0 



ω=
0
1
1




0 2 4
Since γ is a linear transformation of β, once a prior distribution for β is elicited, the prior
distribution of γ is derived as:
γ|τ = ωβ|τ ∼ N (ωβ0 , ω(τ R0 )−1 ω T )
while the prior for τ does not change with the re-parameterization. If the prior distributions
of the parameters vectors are so defined, then it can be shown that p(y|MP ) = p(y|MG )
(see section 2.7.1 for details). In other words, the marginal likelihood is invariant under
linear transformations of the regression coefficients.
Derivation of marginal likelihoods for additive, dominant, recessive, and co-dominant
models is different, as these models are defined by a linear transformation of 
the parameters

 α0 
of the polynomial model and an additional constraint. Formally, let α = 
 denote
α1


 α0 
the vector of parameters in any of these models. Then we can define α = 
 =
α1
12

θ
 0 


 θ0 

 |θ2 = 0 where θ = 

 θ1  = ωβ and matrix ω depends on the specific genetic


θ1
θ2
model (see Table 2.2). If the vector β follows a multivariate normal distribution, θ also



follows a multivariate normal distribution, and so does the marginal distribution of θ2 and
α that is a conditional distribution. Starting from the proper prior distributions for the
vector of parameters β and precision τ priors, then proper prior distributions for α and τ
are found to be:
τ ∼ Gamma(a1 , a2 )


 α0   θ0 
−1 −1
α=
=
 |θ2 = 0 ∼ N (µ0 , τ Σ0 )
α1
θ1


µ0 and Σ−1
0 can be obtained by using properties of conditional multivariate normal distribution ([18]) and are summarized in Table 2.2. Using these derived priors, the marginal
likelihood for the additive, dominant, recessive, and co-dominant models (MA , MD , MR ,
and MC , respectively) can be computed in closed form. The derivation of the marginal
likelihood for the dominant model is detailed in section 2.7.2. Derivation of the marginal
likelihood for the additive, recessive, and co-dominant model is similar. Note that the
derivation relies on the use of proper prior distributions for the parameters of the polynomial model.
Assuming that the 5 genetic models are a priori equally likely, the Bayes rule to model
selection is equivalent to choosing the genetic model with the highest marginal likelihood or
BF relative to the null model (i.e. ratio of marginal likelihood of one of the five models and
the null model) ([36]). Once the most likely model is selected, the parameter estimates of
any of the five genetic models are the means of the posterior distributions. The regression
parameters in the polynomial model are estimated by βn = Rn−1 (R0 β0 + X T y) and using
the one-to-one relationship, the parameters in the general model can be estimated by
γn = ωβn . The relation between parameters of the polynomial models and the dominant,
0
1
1
0
2
1
0
1
1
0
1
0

1
 0
0

1
 0
0

1
 0
0

1
 0
0
0
1
2
1
 0
0


0
1 
1

0
1 
2

0
4 
1
0
1 
3


0
1 
4
2
+ α1
β0 = [β00 ]
R0 = (r11 )
a1
a2
+ a1
2
−1
βn = Rn (R0 β0 + 1T y)
Rn = R0 + 1T y
α1n = α1 + n/2
T R β +y T y
−βn
n n
α2n =
2
!−1
T
β0 R0 β0
1
+ α
2
2
µT
0 Σ0 µ0
2
ω(τ Rn )−1 ω T = Sn =
 11

sn
s12
s13
n
n
22
23 
τ −1  s21
s
s
n
n
n
32
33
s31
n sn sn
θn0
µn =
−
θ
13 n1
sn
−1
(s33
θn2
23
n )
sn
11
sn
s12
−1
n
Σn =
−
s21
s22
n
n
13 sn
33 −1 31 32
(sn )
(sn sn )
s23
n
a1n = a1 + n/2
T
−µT
n Σn µn +y y
a2n =
2
!−1
ω(τ R0 )−1 ω T = S0 =
 11

s0
s12
s13
0
0
22
23 
τ −1  s21
s
s
0
0
0
32
33
s31
0 s0 s0
θ00
µ0 =
−
θ
13 01
s0
−1
(s33
θ02
23
0 )
s0
11
s0
s12
−1
0
Σ0 =
−
s21
s22
0
0
13 s0
33 −1 31 32
(s0 )
(s0 s0 )
s23
0
a1 ; a2

θn0
ωβn = θn =  θn1 
θn2
2
+ α1
Let
TR β
β0
0 0
2
γn = ωβn
0
Rn
= ω(Rn )−1 ω T
α1n = α1 + n/2
T R β +y T y
−βn
n n
α2n =
2
!−1
TR β
β0
0 0
2
Posterior Hyper
parameters
−1
β n = Rn
(R0 β0 + X T y)
Rn = R0 + X T y
α1n = α1 + n/2
T R β +y T y
−βn
n n
α2n =
2
!−1

θ00
ωβ0 = θ0 =  θ01 
θ02
Let
γ0 = ωβ0
0
R0
= ω(R0 )−1 ω T
a1
a2
β0= [β00 β01 β02 ]T 
r11
r12
r13
r22
r23 
R0 =  r21
r31
r32
r33
a1
a2
Prior Hyper
parameters
1
α
α2 1 Γ(α1 )
|R0 |1/2
1
α
α2 1 Γ(α1 )
1/2
|R0 |
1
(2π)n/2 |Rn |1/2
α1n
α2n
Γ(α1n )
p(y|Mnull ) =
where i= dominant, recessive
codominant or additive
1
α
α2 1 Γ(α1 )
1/2
|Σ0 |
1
(2π)n/2 |Σn |1/2
α1n
α2n
Γ(α1n )
p(y|Mi ) =
1
α
α2 1 Γ(α1 )
−1 ω T |1/2
|ω(R
)
1
0
(2π)n/2 |(ω −1 )T Rn ω −1 |1/2
α1n
α2n
Γ(α1n ) = p(y|Mp )
p(y|MG ) =
1
(2π)n/2 |Rn |1/2
α1n
α2n Γ(α1n )
p(y|Mp ) =
Marginal
Likelihood
Table 2.2: Specification of ω, prior hyperparameters, updated hyperparameters, and marginal likelihood for each model.
Null
Model
Additive
Model
Codominant
Model
Recessive
Model
Dominant
Model
Genotypic
Model
Polynomial
Model
ω
13
14
recessive, co-dominant, and additive models, can be used to derive their posterior estimates.
Specifically, from the set of relations:
β|τ ∼ N (βn , (τ Rn )−1 )
θ = ωβ|τ ∼ N (θn = ωβn , ω(τ Rn )−1 ω T )


 θ0 
−1 −1
α=
 |θ2 = 0 ∼ N (µn , τ Σn )
θ1
and using the properties of conditional multivariate normal distribution, the point estimates
µn for any model are found to be:


 θn0 
−1
µn = 
 + [S12 ][S22 ] [0 − θn2 ]
θn1


 S11 S12 
where ω(τ Rn )−1 ω T = τ −1 
, and dim(S11 ) = 2 × 2, dim(S12 ) = 2 × 1,
S21 S22
dim(S21 ) = 1 × 2, and dim(S22 ) = 1 × 1. Table 2.2 summarizes the specification of
and formulas for computing prior and updated hyper-parameters, and marginal likelihood
for different genetic models discussed in this section and the null model.
2.4
Simulation Studies
Three simulation studies were conducted to assess false and true positive rates of the
Bayesian procedure with polynomial models and compared to the frequentist approach in
which the association with minimum p-value is selected. Simulation study (1) was designed
to evaluate the false positive rates of the polynomial model approach for various selection
criteria. Real genotype data from two GWASs of different sample sizes were used and
the quantitative trait in each set was randomly permuted to create data sets with no
true positive associations. Simulation study (2) was designed to compare sensitivity and
15
specificity of our proposed method and the standard approach by simulating genetic data
that mimic the GWAS setting with causal SNPs (i.e. SNPs truly associated with the trait)
having different modes of inheritance, each SNP explaining the same proportion of the
trait variability. Simulation (3) modified the design of simulation (2) and let SNPs explain
varying proportions of the trait variability.
2.4.1
Simulation Study (1)
Data: Two real datasets were used. The first data set consisted of genotype data of 201
unrelated offspring of centenarians from the New England Centenarian Study (NECS)
(http://www.bumc.bu.edu/centenarian/) ([63]). The genotype data were described in
[65]. The quantitative trait in this analysis was a neuroticism score measured in the
NEO-Five Factor Inventory (NEO-FFI), which is a 60-item (12 items per domain) measure of five personality dimensions (neuroticism, extraversion, openness, agreeableness, and
conscientiousness) ([16]). Previous studies have shown that the estimated heritability of
neuroticism is approximately 25% ([4, 59]). The second data set consisted of 843 unrelated
African-American subjects with sickle cell anemia enrolled in Cooperative Study of Sickle
Cell Disease (CSSCD) (https://biolincc.nhlbi.nih.gov/studies/csscd/) ([24]). In
this cohort, the trait is the percent of fetal hemoglobin in the total hemoglobin. The percent fetal hemoglobin is a major modulator of hematologic and clinical complications of
sickle cell anemia ([2]). Studies have shown that there is a strong genetic basis of fetal
hemoglobin, with heritability estimates in the range of 61 to 89% ([23, 59]), and a wellestablished gene that affects this trait is BCL11A ([3]). Both studies were approved by the
institutional review board of each participating institution, and standard quality control
procedures were performed on both genotype data ([3, 65]).
Methods: Initially 254,612 and 486,331 autosomal SNPs were available for analysis in
the two cohorts (NECS and CSSCD), respectively. It is well known that SNPs in close
proximity tend to be correlated with each other ([69]), and this non-random correlation
can bias the assessment of false positive rates. In order to avoid this problem, SNPs whose
16
pairwise correlation was r2 > 0.2 were removed using the PLINK software ([61]). After
this pruning, 50,894 and 140,864 independent SNPs were left for analysis in the NECS
and CSSCD sample, respectively. In both sets, 50,000 SNPs were randomly chosen from
each set and 10,000 simulations were performed by permuting the trait values at random.
Two approaches were evaluated in this simulation study: 1) the proposed method, in
which the best genetic model was selected based on the maximum BF for each SNP; and
2) the frequentist approach, in which five genetic models were fitted and the best model
was selected based on the minimum p-value for each SNP. For the genotypic model (2
degrees of freedom) in the frequentist approach, the minimum of the two p-values was
chosen. Various threshold criteria for the two approaches were explored and the number of
significant associations detected for varying thresholds was recorded. False positive rates
were computed as the rates of significant associations.
2.4.2
Simulation Study (2)
Data: In order to assess the true positive rates of our proposed method and the standard
approach, genetic data were simulated with known causal SNPs, each explaining the same
proportion of the trait variance. A modification of the simulation procedure described in
[88] was used to simulate the data but additional source of variability was introduced as well
as SNPs with dominant and recessive mode of inheritance, in addition to additive effects.
Several scenarios were considered by using different sample sizes (N=1,000; 10,000; 20,000;
50,000; and 100,000) with varying heritability ([53]) of the quantitative traits (h2 =0.2,
0.4, and 0.6). Heritability is defined as the proportion of the total variance of the trait
that is explained by the genetic effect and the higher the heritability the larger the genetic
contribution to the trait. A total 500,000 SNPs were simulated in each run and included
100 causal SNPs: 34 with additive effects, 33 with dominant effects, and 33 with recessive
effects. We assumed that each causal SNP explained 1/100 of the total genetic variance
so that, for example, when the total heritability was 0.2 and 20% of the total phenotypic
variance was due to the genetic variance, each causal SNP explained 1/100 of the total
17
genetic variance and hence the SNP-specific heritability was 0.002.
Methods: The following steps describe the simulation scheme.
Step 1. Generate minor allele frequency for each SNP
The minor allele frequency (MAF: frequency of B allele) for each SNP was randomly
drawn from a Beta distribution Beta(2, 8), which represents the distribution of commercially available chips ([88]). We also used a standard quality control procedure by excluding
any SNPs with MAF less than 0.01.
Step 2. Generate the genotype
Genotypes for each SNP (AA, AB, BB) were generated assuming Hardy-Weinberg
equilibrium. Essentially, if p is the prevalence of the A allele in the population, HardyWeinberg equilibrium (HWE) law states that the prevalence of the three genotypes will
be p2 , 2p(1 − p), (1 − p)2 ([85, 31]). These expected genotype frequencies were used to
simulate the genotype data, given p.
Step 3. Select the causal SNPs
100 causal SNPs from the total SNPs were randomly chosen and assigned the mode of
inheritance randomly to the selected SNPs.
Step 4. Determine the effect size for each causal SNP
The effect size aj for each j th causal SNP (j=1, 2,, 100) was computed from the formula:
h2j =
2
2
σAdd,j
+ σDom,j
σT2 otal
=
2pj (1 − pj )[aj + dj (1 − 2pj )]2 + [2pj (1 − pj )dj ]2
σT2 otal
2
2
where σAdd,j
is the additive genetic variance of j th causal SNP, σDom,j
is the dominance
genetic variance of j th causal SNP, σT2 otal is the total phenotypic variance, pj is the MAF
for the j th causal SNP, aj is the additive genetic effect at j th causal SNP, and dj is the
dominance genetic effect at the j th causal SNP. The parameter h2j is the locus-specific heritability, which was assumed to be
h2
100 .
This is the amount of heritability that is contributed
to by j th causal SNP and hence all causal SNPs contribute to the total heritability by an
equal amount. In the above formula, note that the genetic variance is decomposed into
18
the additive and dominance variance component. The additive genetic variance implies
that each additional copy of an allele contributes a fixed amount of effect aj to the trait.
Under this assumption, the trait value of the heterozygote (AB) would be the midpoint
between the two homozygotes (AA and BB). On the other hand, when there exists dominance genetic variance, the trait value of the heterozygote will deviate from the midpoint
between the two homozygote, and the degree of deviation is expressed by the quantity dj .
Therefore, it follows that dj = 0 for any SNP with additive effect, and dj = aj for any SNP
with dominant effect, and dj = −aj for any SNP with recessive effect. We also assumed
σT2 otal = 1. Note that only the three genetic models (additive, dominant, and recessive)
were considered.
Step 5. Generate the phenotypic value based on the causal SNPs
Let yi denote the phenotypic value for ith individual. For each causal SNP, the SNP
contribution to the trait was randomly generated as
Xij ∼ N (aj Gij ,
σT2 otal
)
100
where aj is the effect of j th causal SNP (computed in the previous step) and Gij is genotype
coding for ith individual at j th causal SNP that was generated in Step 2. For an additive
causal SNP, Gij = number of minor allele (0, 1, 2). For a dominant causal SNP, Gij = 1
if the genotype is AB or BB (0 otherwise). For a recessive causal SNP, Gij = 1 if the
genotype is BB (0 otherwise). Then, the phenotypic value is:
yi =
X
Xij
j
and E(yi ) =
P
j
aj Gij and V ar(yi |Gij ) = σT2 otal = 1.
Step 6. Perform association tests using our method and the standard approach for each
of the 100 SNPs separately and select a model for each SNP.
Step 7. Repeat 100 times
19
In each simulated data set, the empirical false positive rates in the two approaches was
evaluated to determine thresholds for p-values and BF with the same false positive rates.
Specifically, in each simulated set the number of false positive associations (significant
associations of null SNPs) in the frequentist results with p-values p < 1 × 10−7 , 5 × 10−7 ,
1×10−6 , and 5×10−6 were counted and the BF thresholds that produced the same number
of false positive associations in the Bayesian approach was detected. Using these p-values
and BF thresholds that produced the same empirical false positive rates, the power of the
two approaches was evaluated. Two types of power were considered: (1) the number of
causal SNPs detected as associated regardless of whether the correct genetic model was
identified and (2) the number of causal SNPs detected with the true genetic model.
2.4.3
Simulation Study (3)
Data and Methods: The limitation of Simulation Study (2) is the assumption that each
SNP accounts for the same proportion of the trait variability. Therefore, the scheme of the
Simulation Study (2) was modified to let causal SNPs explaining varying proportion of the
trait variability. In this modified design, the genetic variances of dominant and recessive
SNPs were increased, while decreasing the genetic variance of additive SNPs. This tradeoff was necessary to maintain the same total heritability used in Simulation Study (2) for
proper comparison later. The following two cases were considered: 1) when h2j was halved
for additive SNPs and 2) when h2j was quartered for additive SNPs. In case 1), this resulted
in increasing h2j by 25% for both dominant and recessive SNPs. In case 2), this resulted
in increasing h2j by 37.5% for both dominant and recessive SNPs. Under these changes,
the effect sizes were generated based on Step 4 in the previous section and the rest of the
simulation design remained the same.
Results: Table 2.3 shows the median false positive rates at varying significance thresholds
in the two sets included in Simulation Study (1). Setting the thresholds to maximum BF
> 1500 in our approach and minimum p-value < 10−5 in the standard approach yields the
same median false positive rate of 4 × 10−5 in both data sets. This translates into 2 false
20
a) Bayesian Polynomial Model Approach
BF>100
BF> 500
BF>1000
NECS
9.0×10−4 1.4×10−4 8.0×10−5
data
CSSCD 6.8×10−4 1.2×10−4 6.0×10−5
data
BF>1500
4.0×10−5
BF>3000
2.0×10−5
BF>5000
0.0×10−0
4.0×10−5
2.0×10−5
0.0×10−0
b) Frequentist Approach
p < 10−3
p < 10−4
p < 10−5
p < 10−6
p < 10−7
p < 10−8
−3
−4
−5
−0
−0
NECS
3.4×10
4.0×10
4.0×10
0.0×10
0.0×10
0.0×10−0
data
CSSCD 3.3×10−3 3.8×10−4 4.0×10−5 0.0×10−0 0.0×10−0 0.0×10−0
data
Table 2.3: Median false positive rates in the NECS and CSSCD data in 10000 permutations
(Simulation Study 1). BF=Bayes Factor; p=p-value.
positive associations among 50,000 SNPs.
Figure 2.1: Box plot of cubic root of empirical false positive rates (y-axis) for different
p-value thresholds (x axis), and increasing sample sizes. The results are based on the
simulation scenario when the heritability was 0.4 and the sample sizes were 10,000, 20,000,
and 50,000 (panel a, b, and c, respectively).
Figures 2.1 to 2.3 summarize the results of Simulation Study (2) for the scenario in
which the total heritability is 0.4 and the sample sizes are 1) 10,000; 2) 20,000; and 3)
50,000. The full set of results when the total heritability is 0.2 and 0.6 can be found
in Supplementary Materials. With a sample of 1000, neither approach detects any causal
21
Figure 2.2: Box plot of log-transformed BF (y-axis) for different p-values thresholds (xaxis) and increasing sample sizes. The results are based on the simulation scenario when the
heritability estimate was 0.4 and the sample sizes were 10,000, 20,000, and 50,000 (panel a,
b, and c respectively).
Figure 2.3: Box plot of power of the two approaches at different significance thresholds
(Red: Bayesian approach; Blue: Frequentist approach). The results are based on the
simulation scenario when the heritability estimate was 0.4 and the sample sizes were 10,000,
20,000, and 50,000 (panel a, b, and c, respectively).
SNPs, while almost all causal SNPs are detected when the sample size is 100,000, regardless
of the heritability. Figure 2.1 shows the distribution of the empirical false positive rate
for different p-value thresholds and figure 2.2 shows the distribution of the BF that would
produce the same empirical false positive rates of the frequentist procedure. Figure 2.3
shows the box plot of the true positive rate (proportion of detected causal SNPs) of the
22
two approaches at varying significance thresholds. Finally, Table 2.4 shows the mean
number of additive, dominant, and recessive SNPs that are correctly identified (out of 34,
33, and 33, respectively) in the two approaches at successive thresholds.
The mean and standard deviation of the quantitative traits were 2.55 and 1.10 when
the total heritability was 0.4. Several points are noteworthy. The first point is that, as
expected, lower heritability of the trait results in a smaller number of detections. Even
when the total heritability was relatively high (h2 =0.6), both approaches detected about
half of the causal SNPs with N=10,000. At the most stringent significance threshold of
p < 1 × 10−7 , the Bayesian approach correctly identified 53.75 causal SNPs on average and
the frequentist approach correctly identified 46.52 causal SNPs on average when heritability
is 0.6 and N is 20,000 (see Supplementary Materials for detail). This result is consistent
with findings from other authors that large sample sizes are needed to detect many casual
variants that explain a small proportion of variability. For example, in [56] authors have
shown that they need approximately N=25,000 to detect 25 loci out of 201 causal variants
with 80% power for a highly heritable trait.
Secondly, we observed that the false positive rate of decision rules based on BF decreases
as the sample size increases, given a fixed BF threshold. This property has also been noted
in [49, 83] and implies that we can relax the thresholds for BF as we increase the sample
size and better leverage the increased sample size than frequentist procedures. Figure 2.2
illustrates this property graphically. For example, at a fixed p-value threshold 1×10−7 , the
median BF threshold needed to obtain the same false positive rate decreases from 11122 to
9866 to 7489 as the sample size increases from 10,000 to 20,000 to 50,000. A similar pattern
is observed at all levels of false positive rates. In contrast, no such pattern is observed in
the frequentist approach, and the false positive rates are invariant to sample sizes given a
fixed p-value threshold in the standard approach.
The third important point is that the Bayesian method we propose has a slightly greater
power for more stringent (lower) thresholds (see Figure 2.3) than the frequentist approach.
This results holds for all sample sizes and all levels of heritability considered in the simula-
23
tions (see Supplement Figures 2.6 and 2.9). Also, at this stringent threshold, the Bayesian
approach recovered more correct genetic models when the sample sizes were 20,000 and
50,000 (Table 2.4). Although our method recovers less often than the frequentist approach
SNPs with an additive genetic effect, it identifies more often SNPs with a dominant and
recessive effect. When the sample size was 10,000, the Bayesian approach identified the
correct genetic models slightly less often. In addition, both approaches identified nearly
0 models that had either dominant or recessive inheritance pattern when N=10,000 in
simulation study (2). We speculated that this may be due to the lack of power to detect
rare variants. For example, if we assume MAF=0.01, under HWE the expected count of
the homozygote group for the minor allele is only 1. As the sample size increased, there
was a substantial increase in identification of dominant and recessive variants. Results
from simulation study (3) also support this conjecture. Increased effect sizes for SNPs
with dominant and recessive effects resulted in more detection of these variants at the cost
of loss of power for additive SNPs. However, loss of power for additive SNPs was much
greater than increased power for dominant and recessive SNPs when the sample size was
10,000. This result suggests that we need much higher sample sizes to detect dominant
and recessive variants, compared to the additive SNPs.
2.5
Application to Real Data
Using the thresholds that yielded the same false positive rate in the two methods (maximum
BF>1500 and minimum p < 1 × 10−5 ) in simulation study (1), we compared the results
obtained with the two methods in the cohorts described in the earlier section, using the
SNP sets generated after pruning the dependent SNPs. In the NECS data, out of 50,894
tested SNPs, nine SNPs were found associated with neuroticism using the polynomial
model approach, whereas only five SNPs were significant using the standard approach
(Table 2.5). Four SNPs were common in both analyses. For these four SNPs, the genetic
models selected agreed in the two approaches. This result suggests that the Bayesian model
24
Bayesian
Frequentist
Significance A
D
R
Total
A
D
R
Total
Threshold
1 × 10−7
15.6 0.7
0.9
17.2
17.3 0.3
0.3
17.9
16.0 0.8
1.0
17.7
19.8 0.6
0.6
21.0
5 × 10−7
10000
1 × 10−6
16.3 0.9
1.2
18.4
20.8 0.7
0.8
22.3
17.5 1.3
1.9
20.7
22.4 1.1
1.6
25.2
5 × 10−6
Simulation
1 × 10−7
25.0 4.9
6.4
36.3
28.2 2.4
3.5
34.2
Study (2)
25.0 5.6
7.5
38.0
28.2 4.1
5.4
37.7
5 × 10−7
- Uniform 20000 1 × 10−6
25.0 6.2
8.1
39.3
28.2 4.9
6.6
39.7
Contribu25.0 8.8 11.0 44.8
28.3 7.5 10.1 45.9
5 × 10−6
tion
1 × 10−7
30.0 27.1 30.5 87.6
31.2 23.2 28.6 82.9
5 × 10−7
30.0 27.6 31.1 88.8
31.2 24.4 30.3 85.8
50000
30.0 27.8 31.4 89.3
31.2 24.6 31.0 86.7
1 × 10−6
5 × 10−6
30.0 28.5 32.0 90.5
31.2 25.4 31.9 88.5
3.5
1.4
1.9
6.8
3.2
0.7
0.8
4.7
1 × 10−7
3.8
1.5
2.1
7.4
4.9
1.0
1.4
7.3
5 × 10−7
10000
1 × 10−6
4.1
1.8
2.4
8.3
5.7
1.3
1.8
8.9
5 × 10−6
5.5
2.9
3.7
12.0
8.5
2.5
3.2
14.2
1 × 10−7
16.5 10.4 13.1 40.0
18.9 6.3
8.1
33.2
Simulation
5 × 10−7
17.0 11.5 14.5 43.1
21.2 9.0 11.5 41.7
Study (3) - 20000
17.4 12.3 15.4 45.1
21.9 10.2 13.3 45.4
1 × 10−6
Case 1)
5 × 10−6
18.5 14.7 18.7 51.9
23.6 12.9 17.7 54.2
26.3 29.4 32.7 88.4
29.0 26.6 32.2 87.8
1 × 10−7
5 × 10−7
26.3 29.4 32.8 88.5
29.0 26.8 32.6 88.4
50000
1 × 10−6
26.3 29.5 32.8 88.7
29.0 26.9 32.7 88.5
5 × 10−6
26.3 29.5 32.9 88.8
29.0 26.9 32.9 88.8
0.3
2.0
2.8
5.1
0.2
0.9
1.1
2.2
1 × 10−7
5 × 10−7
0.3
2.3
3.1
5.7
0.5
1.7
2.1
4.2
10000
1 × 10−6
0.4
2.6
3.4
6.3
0.6
2.1
2.7
5.5
5 × 10−6
0.7
3.9
5.0
9.6
1.2
3.5
4.5
9.2
1 × 10−7
3.9 13.5 16.8 34.2
3.4
8.8 11.2 23.5
Simulation
5 × 10−7
4.4 14.5 18.0 36.9
5.3 11.7 15.2 32.2
Study (3) - 20000
1 × 10−6
4.8 15.2 19.0 39.0
6.4 12.7 16.9 35.9
Case 2)
5 × 10−6
6.3 17.8 22.3 46.4
9.3 15.6 21.4 46.3
1 × 10−7
20.9 29.7 32.9 83.5
24.2 27.4 32.7 84.3
5 × 10−7
21.3 29.7 32.9 84.0
25.3 27.5 32.9 85.6
50000
1 × 10−6
21.5 29.7 33.0 84.2
25.6 27.5 33.0 86.0
5 × 10−6
21.8 29.7 33.0 84.5
26.1 27.5 33.0 86.5
Table 2.4: Mean number of additive, dominant, and recessive SNPs correctly identified (out of 34, 33, and 33, respectively) in the two approaches when heritability is 0.4.
A=additive; D=dominant; R=recessive.
N
25
selection procedures work well in the case of small sample sizes and can potentially discover
more variants.
In the CSSCD data, out of 140,864 tested SNPs, ten SNPs were associated with fetal
hemoglobin in both approaches, and eight SNPs were common in both (Table 4). For these
eight SNPs in common, five of them agreed in the genetic model selection between the two
approaches, but three SNPs (rs2239580, rs12469604, and rs2034614) had discrepant results.
For rs2239580, the Bayesian procedures selected the dominant model, while the standard
approach identified the genotypic model (2 degrees of freedom model). For rs12469604,
the Bayesian procedure selected the dominant model, while the additive model had the
minimum p-value. For rs2034614, the co-dominant model had the maximum BF and the
genotypic model had the minimum p-value.
Using our Bayesian polynomial model approach in the NECS data, 4 SNPs had dominant models, 3 SNPs had additive models, 1 SNP had a co-dominant model and 1 SNP
had a recessive model. In the CSSCD data, 3 SNPs had dominant models, 3 SNPs had
co-dominant models, 3 SNPs had recessive models, and 1 SNP had an additive model.
These results suggest that different variants may influence the trait through different genetic models. Some of these associations would not have been captured if an additive model
alone was used, and this highlights the need to examine all possible genetic models in a
computationally efficient manner to ensure that we do not miss any interesting findings.
Chr/Gene
1/C1orf203
4/unknown
4/unknown
1/ESRRG
8/unknown
17/unknown
19/C19orf12
13/unknown
10/unknown
2/IKZF2
Chr/Gene
2/BCL11A
3/unknown
20/unknown
14/COCH
6/RPS6KA2
14/unknown
2/unknown
12/PRICKLE1
4/TBC1D14
7/unknown
9/FUBP3
12/unknown
BFG
1.7e4
1.2e4
1.9e3
2.5e3
5.9e2
2.5e2
6.1e2
1.8e2
8.4e1
3.8e1
9.9e1
3.5e1
BFG
1.0e5
7.6e2
4.6e2
2.2e2
1.4e3
2.2e2
1.8e3
6.5e2
1.4e2
7.9e1
BFA
1.9e5
3.4e1
1.2e1
1.1e3
2.1e3
8.5e−1
2.2e3
1.5e2
5.0e1
4.8e−2
4.2e2
1.3e1
BFA
8.0e4
1.4e−1
3.9e2
4.4e1
5.0e3
1.0e1
2.4e3
2.2e3
3.0e−1
5.8e1
Bayesian
BFD
BFR
7.4e3
3.8e2
4.4e−1 1.6e5
2.5e−1 3.1e4
3.0e4
1.5e−1
5.9e3
3.1e−1
1.5e−1 4.4e3
2.9e3
6.3e−1
1.7e3
1.6e−1
1.9e2
1.9e−1
3.2e1
9.2e−1
5.2e1
6.0e2
1.8e2
8.4e−2
Bayesian
BFD
BFR
1.2e6
1.5e1
7.4e1
7.5e1
3.4e3
3.8e−1
4.9e−1 2.0e3
1.0e1
2.0e3
4.3e3
2.9e−1
9.4e2
3.7e1
6.7e1
1.5e2
1.9e3
1.2e−1
7.0e−1 2.1e1
BFC
2.1e1
2.2e−1
2.1e−1
2.8e4
3.7e3
3.3e−1
1.8e3
2.8e3
2.0e3
1.8e3
2.1e−1
8.9e2
BFC
1.5e3
1.2e4
1.5e3
4.0e−1
3.7e1
4.4e−1
9.9e−1
1.5e1
1.9e1
5.9e1
BFmax
1.9e5
1.6e5
3.1e4
3.0e4
5.9e3
4.4e3
2.9e3
2.8e3
2.0e3
1.8e3
6.0e2
8.9e2
BFmax
1.2e6
1.2e4
3.4e3
2.0e3
5.0e3
4.3e3
2.4e3
2.2e3
1.9e3
5.9e1
PHet
1.2e−4
9.1e−1
8.9e−1
1.7e−7
4.9e−6
6.7e−1
1.2e−5
7.7e−6
1.9e−5
4.0e−4
9.9e−2
7.6e−6
PHet
2.0e−7
8.4e−4
5.8e−6
7.3e−1
8.1e−1
4.0e−4
2.2e−2
4.2e−1
2.6e−4
3.5e−2
PHom
2.5e−7
7.6e−8
5.4e−7
2.1e−1
2.5e−1
4.7e−6
2.3e−1
5.6e−1
7.0e−1
8.7e−1
3.3e−6
2.8e−1
PHom
2.4e−4
7.7e−1
2.3e−1
3.1e−6
1.5e−1
2.7e−4
1.5e−3
6.2e−2
1.1e−3
1.4e−3
Frequentist
PA
PD
PR
1.3e−8 8.0e−7 2.6e−5
7.0e−3 1.5e−1 7.5e−8
2.1e−2 2.8e−1 4.9e−7
4.9e−6 2.3e−7 5.7e−1
6.4e−6 3.3e−6 3.9e−1
1.5e−2 3.7e−1 2.6e−6
6.6e−6 7.2e−6 2.8e−1
9.4e−5 1.2e−5 8.7e−1
3.5e−3 1.3e−4 3.7e−1
9.8e−1 1.6e−2 1.4e−2
7.9e−6 2.8e−3 1.1e−5
6.5e−3 4.6e−5 6.2e−1
Frequentist
PA
PD
PR
2.5e−7 1.9e−8 3.3e−2
1.5e−1 5.3e−2 6.9e−4
5.5e−5 5.5e−6 4.4e−1
1.9e−3 1.0e−1 2.8e−6
3.5e−4 2.4e−1 1.1e−4
2.2e−2 2.1e−4 2.8e−1
8.8e−4 2.2e−3 6.0e−3
8.2e−3 7.0e−2 1.2e−2
9.6e−1 7.9e−4 3.8e−1
3.7e−6 3.1e−3 2.3e−5
PC
1.5e−2
9.2e−1
6.7e−1
3.2e−7
6.5e−6
2.5e−1
1.4e−5
9.0e−6
1.3e−5
1.6e−5
4.8e−1
1.2e−5
PC
2.1e−5
1.9e−6
8.7e−6
6.2e−1
3.2e−4
5.2e−1
1.0e−1
6.1e−2
6.0e−2
5.0e−4
Pmin
1.3e−8
7.5e−8
4.9e−7
1.7e−7
3.3e−6
2.6e−6
6.6e−6
7.7e−6
1.3e−5
1.6e−5
3.3e−6
7.6e−6
Pmin
1.9e−8
1.9e−6
5.5e−6
2.8e−6
1.1e−4
2.1e−4
8.8e−4
8.2e−3
2.6e−4
3.7e−6
Table 2.5: Significant results using two approaches in the a) NECS and b) CSSCD data. BF=Bayes Factor ;P=p-value;
G=genotypic; A=additive; D=dominant; R=recessive; C=co-dominant; Het=heterozygote genotype factor in the genotypic model;
Hom=homozygote genotype factor in the genotypic model. SNPs that were significant in both approaches are highlighted in gray.
SNP
rs6709302
rs7631659
rs13043968
rs2239580
rs6932510
rs1890911
rs12469604
rs2034614
rs2301819
rs9642124
rs11794652
rs7975463
b) CSSCD Data
SNP
rs850610
rs7666974
rs2333166
rs2801185
rs1869676
rs8064944
rs3746314
rs9555139
rs12770017
rs1530239
a) NECS Data
26
27
2.6
Conclusion
We propose a Bayesian approach to simultaneously detect the SNPs associated with a
continuous trait and the mode of inheritance. Our Bayesian approach uses a polynomial
parameterization of the SNP dosage that can simultaneously represent different genetic
models and a coherent framework for model selection based on comparing different models
by their posterior probability ([73, 48, 68, 29, 81, 51, 72, 14, 45, 47, 86]). Crucial to our
approach is the use of proper prior distributions on the parameters of the polynomial model,
from which the prior distributions of specific genetic models can be derived. In contrast to
this coherent Bayesian approach, it is important to emphasize that the frequentist approach
does not have a clear way to compare the genotypic model (2 degrees of freedom) to the
other four specific genetic models (1 degree of freedom). The evaluation of the method in
simulated data shows that the Bayesian method we propose has a slightly higher power
when we limit to false positive rate at very small values and this is a particularly attractive
property in genome-wide association studies in which the large number of SNPs analyzed
requires to set the false positive rate to extremely small numbers. An additional attractive
feature of this method is the gain in computation efficienty; The proposed method codes five
genetic models simultaneously using a single polynomial parameterization instead fitting
five different genetic models for each SNP. This is in contrast to the standard approach in
which recoding of the SNP genotype and conducting 5 analyses is necessary to evaluate all
five models concurrently.
An important theoretical implication of this particular parameterization is that it shows
that different genetic models are functionally related. We have shown that there is a
mathematical relationship between the parameters of the polynomial model and each of
the five genetic models. This relation also suggests that when all five genetic models
are evaluated the effective number of tests per SNP is less than 5. In practice, GWASs
suffer from severe correction for multiple testing, and evaluation of several genetic models
for each SNP aggravates this issue. However, our work suggests that the correction for
28
multiple testing should be less severe as the effective number of tests is less than the number
of models fitted, when evaluating all five genetic models, although it is not immediately
obvious how Bonferroni type corrections should benefit from this result.
The proposed work can be particularly useful for genome-wide data consisting of millions of SNPs. This work, at the current state, is limited to the case where the trait is
quantitative, as we can obtain closed form solutions for the marginal likelihood and BF.
More work is needed to evaluate a similar approach when the trait of interest is binary
or time-to-event. In particular, when performing logistic regression in the GWAS context, alternative measures of associations such as approximate Bayes Factor or Bayesian
false-discovery probability ([80, 81, 82]) can be considered.
2.7
2.7.1
Supplementary Materials
Derivation of Marginal Likelihood for the General Model
Consider a general model (MG ):
y = X 0 γ + 0
p(y|X 0 , γ, τ ) =
τ n/2 −τ /2(y−X 0 γ)T (y−X 0 γ)
e
(2π)n/2
In this model, γ is a linear combination of β, such that γ = ωβ, where


 1 0 0 


 0 1 1 




0 2 4
We have the following proper priors for γ and τ such that p(γ, τ ) = p(γ|τ )p(τ ) :
τ ∼ Gamma(α1 , α2 )
γ|τ = ωβ|τ ∼ N (ωβ0 , ω(τ R0 )−1 ω T ) = N (γ00 , τ −1 (R00 )−1 )
29
, where
(R00 )−1 = ω(R0 )−1 ω T
γ00 = ωβ0
p(γ|τ ) =
τ (p+1)/2 |R00 |1/2 −τ /2(γ−ωβ0 )T R00 (γ−ωβ0 )
e
2π (p+1)/2
Then, the updated hyper-parameters are:
βn0 = (ω −1 )T Rn ω −1
γn0 = ωβn
0
α1n
= α1 + n/2 = α1n
1
1
=
+ 1/2(−βnT Rn βn + β0T R0 β0 + y T y) = 1/α2n
0
α2n
α2
The marginal likelihood given MG is:
0
0 α1n Γ(α0 )
1
|R00 |1/2 α2n
1n
p(y|MG ) =
(2π)n/2 |Rn0 |1/2 α2α1 Γ(α1 )
=
|R00 |1/2 α2n α1n Γ(α1n )
1
(2π)n/2 |Rn0 |1/2 α2α1 Γ(α1 )
=
|(ω −1 )T R0 ω −1 |1/2 α2n α1n Γ(α1n )
1
(2π)n/2 |(ω −1 )T Rn ω −1 |1/2 α2α1 Γ(α1 )
=
1
|R0 |1/2 α2n α1n Γ(α1n )
(2π)n/2 |Rn |1/2 α2α1 Γ(α1 )
= p(y|Mp ).
2.7.2
Derivation of Marginal Likelihood for the Dominant Model
Now, consider the dominant model for the B allele:
E(y|αD ) = αD0 + αD xdom ,
30


 αD0 
where xDom =1 if genotype is AB or BB (and 0 otherwise). In this model, α = 
 is
αD1


 θ0 



a linear combination of β with a linear constraint. If we let θ = 
 θ1  = ωβ,


θ2


 1 0 0 



ω=
 0 1 1 .


0 1 3




 αD0 
 θ0 
Then, α = 
=
 |θ2 = 0 (conditional multivariate normal). This correαD1
θ1
sponds to
αD0 = β0
αD1 = β1 + β2
with a linear constraint: β1 + 3β2 = 0.
We have the following proper priors for α and τ such that p(α, τ ) = p(α|τ )p(τ ).
For τ , we have:
τ ∼ Gamma(α1 , α2 )
p(τ ) =
1
τ
α2α1 Γ(α1 )
α1 −1 −τ /α2
e
,
where
α1 =
α2 =
v0
2
2
v0 σ02
For the prior hyper-parameters of α, we first determine the distribution of θ and then
31
determine the conditional distribution.
β|τ ∼ N (β0 , (τ R0 )−1 ),
where


 β00

β0 = 
 β01

β02





and

R0−1

 r11 r12 r13

=
 r21 r22 r23

r31 r32 r33





Then,


θ
 0 


−1 T

θ=
 θ1  = ωβ ∼ N (ωβ0 , ω(τ R0 ) ω ),


θ2
where


β00


ωβ0 = 
 β01 + β02

β01 + 3β02





and


r11
r12 + r13
r12 + 3r13


ω(τ R0 )−1 ω T = τ −1 
r22 + r23 + r32 + r33
r22 + 3r23 + r32 + 3r33
 r21 + r31

r21 + 3r31 r22 + 3r23 + r32 + 3r33 r22 + 3r23 + 3r32 + 9r33
Then,




 αD0   θ0 
−1 −1
α=
=
 |θ2 = 0 ∼ N (µ0 , τ Σ0 ),
αD1
θ1


.


32
where

µ0 = 

 

β00
β01 + β02
 
+
r12 + 3r13
r22 + 3r23 + r32 + 3r33

−1
 [r22 + 3r23 + 3r32 + 9r33 ] [−(β01 + 3β02 )]


Σ−1
0 =

r11
r12 + r13


r21 + r31 r22 + r23 + r32 + r33

r12 + 3r13


−1
−
 [r22 + 3r23 + 3r32 + 9r33 ]
r22 + 3r23 + r32 + 3r33

r12 + 3r13 r22 + 3r23 + r32 + 3r33
The updated parametes in the dominant model are obtained in a similar fashion. Let
and


 βn0

βn = 
 βn1

βn2






Rn−1
0
r11
0
r12
0
r13


0
0
0
=
 r21 r22 r23

0
0
0
r31
r32
r33






Then,


 θ0 


−1 T

θ=
 θ1  = ωβ ∼ N (ωβn , ω(τ Rn ) ω ),


θ2
33
where


βn0


ωβn = 
 βn1 + βn2

βn1 + 3βn2





and

0
r11
0
r12
0
r13
0
r12
0
3r13
+
+


0
0
0 + r0 + r0 + r0
0 + 3r 0 + r 0 + 3r 0
ω(τ Rn )−1 ω T = τ −1 
r22
r22
 r21 + r31
23
32
33
23
32
33

0
0 + 3r 0 + r 0 + 3r 0
0
0
0
0
r21 + 3r31
r22
23
32
33 r22 + 3r23 + 3r32 + 9r33



.


Then,




 αD0   θ0 
−1 −1
α=
=
 |θ2 = 0 ∼ N (µn , τ Σn ),
αD1
θ1
where


µn = 
 
βn0
βn1 + βn2
 
+
0
r12
0
r22

+
0
3r23
0
r32
+
0
r11

Σ−1
n =
0
r21
+
0
3r33
+
0
r12
0
r31


−
+

0
3r13
0
r22
+
0
r23
 0
0
0
0 −1
 [r22 +3r23 +3r32 +9r33 ] [−(βn1 +3βn2 )]
+
0
r13
+
0
r32

+
0
r33



0 + 3r 0
r12
13
0 + 3r 0 + r 0 + 3r 0
r22
23
32
33
 0
0
0
0 −1
 [r22 + 3r23 + 3r32 + 9r33 ]
0
r12
+
0
3r13
0
r22
+
0
3r23
+
0
r32
+
0
3r33
α1n = α1 + n/2
α2n = [
−µTn Σn µn + y T y + µT0 Σ0 µ0
+ 1/α2 ]−1
2
34
Finally,
p(y|MD ) ==
|Σ0 |1/2 α2n α1n Γ(α1n )
1
(2π)n/2 |Σn |1/2 α2α1 Γ(α1 )
35
2.7.3
Supplementary Figures
Figure 2.4: Box plot of cubic root of empirical false positive rates (y-axis) for different
p-value thresholds (x axis), and increasing sample sizes. The results are based on the
simulation scenario when the heritability was 0.2 and the sample sizes were 10,000, 20,000,
and 50,000 (panel a, b, and c, respectively).
36
Figure 2.5: Box plot of log-transformed BF (y-axis) for different p-values thresholds (xaxis) and increasing sample sizes. The results are based on the simulation scenario when the
heritability estimate was 0.2 and the sample sizes were 10,000, 20,000, and 50,000 (panel a,
b, and c respectively).
37
Figure 2.6: Box plot of power of the two approaches at different significance thresholds
(Red: Bayesian approach; Blue: Frequentist approach). The results are based on the
simulation scenario when the heritability estimate was 0.2 and the sample sizes were 10,000,
20,000, and 50,000 (panel a, b, and c, respectively).
38
Figure 2.7: Box plot of cubic root of empirical false positive rates (y-axis) for different
p-value thresholds (x axis), and increasing sample sizes. The results are based on the
simulation scenario when the heritability was 0.6 and the sample sizes were 10,000, 20,000,
and 50,000 (panel a, b, and c, respectively).
39
Figure 2.8: Box plot of log-transformed BF (y-axis) for different p-values thresholds (xaxis) and increasing sample sizes. The results are based on the simulation scenario when the
heritability estimate was 0.6 and the sample sizes were 10,000, 20,000, and 50,000 (panel a,
b, and c respectively).
40
Figure 2.9: Box plot of power of the two approaches at different significance thresholds
(Red: Bayesian approach; Blue: Frequentist approach). The results are based on the
simulation scenario when the heritability estimate was 0.6 and the sample sizes were 10,000,
20,000, and 50,000 (panel a, b, and c, respectively).
41
2.7.4
Supplementary Tables
Bayesian
Frequentist
Significance A
D
R
Total
A
D
R
Total
Threshold
4.0
0.1
0.1
4.3
3.6
0
0
3.7
1 × 10−7
4.3
0.1
0.1
4.6
5.4
0.1
0.1
5.6
5 × 10−7
10000
1 × 10−6
4.7
0.2
0.2
5.0
6.4
0.1
0.1
6.6
5 × 10−6
6.1
0.3
0.5
6.8
9.3
0.3
0.3
9.8
1 × 10−7
17.4 0.7
1.0
19.1
19.8
0.3
0.3
20.4
17.9 0.8
1.2
19.9
21.8
0.6
0.7
23.1
5 × 10−7
20000
h2 = 0.2
1 × 10−6
18.2 1.0
1.4
20.6
22.7
0.8
0.9
24.3
5 × 10−6
19.2 1.5
2.4
23.2
24.2
1.4
2.0
27.5
1 × 10−7
26.7 11.3 13.9 51.9
29.0
7.3
9.4
45.7
5 × 10−7
26.7 13.0 15.7 55.3
29.0 10.2
13.0
52.2
50000
26.7 13.9 16.5 57.0
29.0 11.7
14.6
55.3
1 × 10−6
5 × 10−6
26.7 16.5 19.6 62.3
29.0 14.5
18.7
62.2
21.3 1.6
2.3
25.2
24.4
0.8
1.0
26.1
1 × 10−7
5 × 10−7
21.4 1.8
2.5
25.8
25.2
1.2
1.8
28.3
10000
1 × 10−6
21.6 2.2
2.8
26.5
25.5
1.6
2.2
29.2
5 × 10−6
21.8 3.3
4.4
29.5
25.8
2.9
3.8
32.5
1 × 10−7
27.2 11.8 14.7 53.8
29.4
7.5
9.6
46.5
−7
5
×
10
27.2
13.0
16.0
56.3
29.4
10.1
13.5
52.9
20000
h2 = 0.6
1 × 10−6
27.2 13.7 17.2 58.1
29.4 11.3
15.1
55.8
5 × 10−6
27.2 16.1 20.3 63.7
29.4 14.2
19.6 63.16
31.4 29.5 32.9 93.8
32.1 26.94 32.59 91.63
1 × 10−7
5 × 10−7
31.4 29.6 32.9 93.9
32.1 27.1
32.8
91.9
50000
1 × 10−6
31.4 29.6 32.9 93.9
32.1 27.1
32.9
92.1
5 × 10−6
31.4 29.6 33.0 94.0
32.1 27.1
33.0
92.2
Table 2.6: Mean number of additive, dominant, and recessive SNPs correctly identified
(out of 34, 33, and 33, respectively) in the two approaches when heritability is 0.2 and 0.6.
A=additive; D=dominant; R=recessive.
N
42
2.8
Acknowledgement
This work was funded by the National Institute on Aging (NIA U19-AG023122 to T.P.), the
National Heart Lung Blood Institute (R21HL114237 to P.S.), and the National Institute
of Health (R01HL87681 to M.H.S.).
43
Chapter 3
Network Modeling of Family-Based Data
3.1
Introduction
A Bayesian network (BN) is a vector of random variables with a joint probability distribution that factorizes according to Markov properties represented by the associated directed
acyclic graph (DAG) [57]. BNs can represent complex probabilistic dependencies among
many variables in a modular way and have become increasingly popular in many fields,
including genetics [39, 64, 77] and genomics [22, 62]. There are many well established
approaches to structural learning of BNs that are particularly efficient when the variables
are all categorical or all Gaussian [32], and more computationally demanding when the
variables are a mixture of different types [37]. When learning the dependency structure
among many variables in a BN, a common assumption underlying virtually all of the proposed approaches is that data are from a random sample of independent and identically
distributed (IID) observations. However, many observational studies collect samples that
have correlations among observations (repeated measures and/or family-based samples),
in which observations within a cluster are correlated. In this case, the IID assumption
is violated in structural learning of BN. For example, family-based studies select families
based on some criterion, such as enrichment of family members for a particular trait of
interest, and consenting relatives within families are enrolled. In this study design, as
blood-related subjects in the same family share genetic background, observations cannot
be treated as independent. The type of correlation between relatives is different according
to degree of relatedness, which is often represented by the additive genetic relationship
matrix (see Figure 3.1). It is well known that ignoring the within-cluster correlation can
inflate the false positive rate of standard statistical approaches [11], and this problem can
be particularly serious in statistical genetics [7].
44
Figure 3.1: An Example Pedigree and Corresponding Additive Genetic Relationship Matrix. The pedigree on the top panel displays the relations among family members. The
additive genetic relationship matrix is the kinship matrix multiplied by 2; the kinship matrix contains kinship coefficients between any pair of family members and these coefficients
represent the probability that two individuals share the same gene allele by identity by
descent. The covariance between two family members i and j with kinship coefficient kij is
2kij σg2 where σg2 represents the genetic variance.
45
Among different approaches to model correlated data in the traditional regression
framework, the use of mixed effect models is emerging as one of the most popular methods [50]. In this chapter, we propose a parameterization that applies the use of mixed
effects regression models to BNs such that it can be used for both structure and parameter
learning from correlated data, and can work with a mixture of variable types including
categorical, continuous, and time-to-event data. In the next section we give a brief review
of learning BN from independent and identically distributed observations. In section 3.3
we describe the parameteriztion and use of mixed-effects regression models with correlated
data. In Section 3.4 we extend mixed-effect regression models to BNs by introducing a matrix of decomposable random effects. Section 3.5 presents the results of simulation studies
to compare different model selection metrics that can be used for learning mixed-effects
BNs and Section 3.6 provides a real data example. Conclusions and suggestions for further
work are in Section 3.7.
3.2
Learning Bayesian Networks from Independent and Identically Distributed Observations
The coherent Bayesian approach to model selection selects the model with maximum posterior probability, assuming a 0-1 loss function [6]. When all models are a priori equally
likely, this rule is equivalent to choosing the model with maximum marginal likelihood:
Z
p(D|M ) =
p(D|θ, M )p(θ|M )dθ
where D denotes the sample data set, M denotes the BN structure, θ denotes the model
parameters, and p(D|θ, M ) and p(θ|M ) denote the likelihood function and the prior distribution of the parameters. Efficient algorithms for learning the structure of a BN from data
leverage the decomposability of the likelihood function to break down the model search
into a modular search of the dependency of each node on its parent nodes [33]. The decomposability of the likelihood is based on the factorization of the probability distribution
46
of the variables X1 , ..., Xv according to the local Markov property of a given DAG
p(D|θ, M ) =
=
n
Y
p(x1k , x2k , ..., xvk |θ, M )
k=1
n Y
v
Y
p(xik |pa(xi )k , θi , Mi )
(3.1)
k=1 i=1
In Equation (3.1), n is the sample size. Each index i represents the node and each index
k represents the sample unit. pa(xi ) denotes the observable parents of the variable Xi
in the BN with structure M , while xik and pa(xi )k denote the observed value of Xi and
its parent nodes in the k-th sample unit. Each sub-model Mi specifies the set of parents
of the variable Xi (see Figure 3.2), so that M = (M1 , M2 , ..., Mv ). We also denote by
xk = (x1k , x2k , ..., xvk ) the vector of values of the variables measured in the k-th sample
unit. θ is the vector of parameters θ = (θ1 , θ2 , ..., θv ), where each θi can itself be a vector of
parameters indexing the conditional distribution of the variable Xi given its parents pa(xi ).
Note that the parameters θi are additional (unobserved) parent nodes of the variable Xi
and are included in the application of the local Markov property. In Equation (3.1), the
likelihood function is decomposed using factorization at two levels. First, the factorization
of the likelihood into the product over the sample units (index k) uses the property of
independence of the observations given the parameters. Second, the factorization into the
product over the variables (index i) uses the local Markov property. By inverting the two
products in Equation 3.1
p(D|θ, M ) =
=
n Y
v
Y
p(xik |pa(xi )k , θi , Mi )
k=1 i=1
( n
v
Y
Y
i=1
)
p(xik |pa(xi )k , θi , Mi )
k=1
the likelihood can be re-written as the product of local likelihood functions for each variable,
and this factorization is one of the critical features for local computations in BN [15,
33]. In addition to the local Markov property for the variables in the graphical model,
47
efficient Bayesian computations rely on a factorization of the prior distribution for the
vector of parameters θ. Dawid and Lauritzen [17] described general Hyper-Markov laws
that assume certain marginal and conditional independences of the parameters to produce
this factorization:
p(D|θ, M )p(θ|M ) = p(θ|M )
Y
p(xik |pa(xi )k , θi , Mi )
i,k
=
Y
p(xik |pa(xi )k , θi , Mi )p(θi |Mi )
i,k
Thus, the decomposability of the overall likelihood functions into product of local likelihood
functions and Hyper-Markov laws for the prior distributions for θ allows for computing the
marginal likelihood as the product:
p(D|M ) =
YZ Y
i
p(xik |pa(xi )k , θi , Mi )p(θi |Mi )dθi
k
Figure 3.2: Example of BN with 3 observable variables X1 , X2 , X3 and parameter vectors
θ1 , θ2 , θ3 .
Figure 3.2 shows a simple example of a BN with three variables. For each variable,
sub-models M1 , M2 , and M3 specify the set of parents of the variable X1 , X2 , and X3 ,
respectively, so that M1 = {X1 , pa(X1 )}, M2 = {X2 , pa(X2 )}, and M3 = {X3 , pa(X3 )}. If
48
there are no missing data, the observations are independent, and the prior distribution of
the parameters follow Hyper-Markov law, then the marginal likelihood p(D|M ) factorizes
into a product of three local marginal likelihood functions and we can then write the
marginal likelihood as
p(D|M ) = p(D|M1 )p(D|M2 )p(D|M3 ),
where p(D|Mi ) =
R
p(D|θi , Mi )p(θi |Mi )dθi . In this quantity, the local likelihood function
can be written as
p(D|θi , Mi ) =
n
X
p(Xik |pa(Xi )k , θi , Mi ).
k=1
Two well characterized situations are (1) structural learning of directed graphical models with categorical variables and hyper-Dirichlet prior distributions for the parameters that
index the conditional distribution of each node given the parents, and (2) directed Gaussian
networks, with Hyper-Wishart distributions that model their variance-covariance matrix.
Bayesian estimates of the parameters can be computed in closed form in both situations.
Computations are more demanding when the variables are mixed [37].
In the more realistic situations in which the variables in the model are a mix of categorical and continuous variables, exact Bayesian calculations are not feasible. Markov Chain
Monte Carlo (MCMC) methods are possible, but estimation of the marginal likelihood via
stochastic computations is a notoriously difficult problem and proposed solutions based
on harmonic means of MCMC samples may be numerically unstable [36]. An alternative
solution is to use the information criteria AIC or BIC:
AIC = −2log(p(D|θ̂)) + 2p
BIC = −2log(p(D|θ̂)) + log(n)p
BIC provides an approximation of marginal likelihood that works well in large samples
and is considered the appropriate criterion for model selection while AIC works best when
49
the goal is prediction [36]. When the sample size is greater than 7.38, BIC places a bigger
penalty term and leads to a more parsimonious model. A better penalty term for survival
analysis with censored observations has been proposed in [79] and uses the number of
uncensored events rather than the overall sample size.
3.3
Mixed-Effects Regression Models
Mixed effect models provide a general approach to analyze correlated data by including
random effects in the regression model to introduce correlation between observations within
the same clusters. These random effects are regarded as nuisance parameters in the sense
that, in many situations, the interest is in the statistical inference on fixed effects parameters adjusting for the random effects. Much more work and evaluations have been
conducted when the data are continuous [50]. Suppose Y denotes the vector of observations of n subjects from m clusters, and suppose that Y follows a multivariate normal
distribution. A linear mixed effect model that relates the effects of covariates X1 , ..., Xp to
Y uses random effects u to model the correlation between observations as:
Y = Xβ + ZΓu + ∆ u ∼ N (0, Is ) ∼ N (0, In ) u ⊥ (3.2)
This is the parameterization initially suggested by [13] where X is an n × p matrix of
measured covariates for the fixed effects β, Z is a n × s matrix of known coefficients, Γ
and ∆ are s × s and n × n matrices of parameters that contain variance parameters for the
random effects and error correlations. The model specifies that
E(Y |X, β) = Xβ
V (Y |X, β) = ZΓΓT Z T + ∆∆T = ZΨZ T + Σ
50
and the correlation of the observations is captured by the matrices Ψ and Σ. If both Ψ
and Σ are block diagonal matrices:
Ψ = diag(Ψ1 , Ψ2 , ..., Ψm ) Σ = diag(Σ1 , Σ2 , ..., Σm )
the parameterization is the ‘independent cluster model”, in which subjects from m different
clusters are independent, but they are correlated within the same cluster.
For analysis of time-to-events data with proportional hazard models, the random effects
are usually modeled in the parameterization of the log-transformed hazard function or using
a frailty term with gamma or multivariate normal distribution [76, 28]. For categorical data
modeled within the framework of generalized linear models, the random effects are modeled
on the scale of the linear predictors (Generalized linear mixed effect models, [8]). These
models make the additional assumption that observations are independent, conditional on
the random effects.
Inclusion of random effects in a Bayesian model specification does not add any level of
‘conceptual complexity’ to the Bayesian analysis: random effects are additional ‘nuisance
parameters” in the model, and Bayesian inference can be carried out by computing the
marginal posterior distribution of the parameters of interest [20]. The challenge is the
computation of the posterior distribution of the parameters when random effects are in the
model, and typically one has to rely on MCMC methods [25]. Several efficient approaches
have been proposed in the context of correlated family data [42, 44, 89]. In the frequentist
appproach there is no ‘natural’ way to deal with random effects. Statistical inference in
a frequentist approach uses either the marginal or conditional approach. In the marginal
approach, the integrated likelihood function is derived from the likelihood function of fixed
and random effects by integrating out the random effects:
Z
p(D|θ, M ) =
p(y, u|x, β, Γ, ∆)du
Z
=
p(u|Γ)p(y|u, x, β, Γ, ∆)du
51
and θ represents the vector of fixed effects and variance parameters, and D represents
the data set y. The integrated likelihood function can be computed in closed form for
linear models, and using numerical approximations for non-linear/non-normal models [60].
The integrated likelihood is used to find maximum-likelihood estimates of the fixed effects
and variance components. Restricted maximum likelihood can also be used to estimate
the variances/covariance parameters of the random effects and reduces the bias of the
maximum-likelihood estimates [50].
Common parameterizations of the random effects include exchangeable correlation, in
which the correlation between observations wihin the cluster is assumed constant, and
auto-regressive correlation. When clusters are families, the correlation between family
members depends on the degree of relatedness, and kinship coefficients can be used to
model the different correlation between relatives. Figure 3.1 shows the parameterization
for correlated family data using the kinship coefficients. In this situation, Ψ is a block
diagonal matrix (assuming that observations in different families are not related), and the
blocks are proportional to additive genetic relationship matrix (twice the matrix of kinship
coefficients), each representing the probability of alleles (eg alternative variants of a gene)
transmitted identically by descent between pairs of family relatives [38]. Using the additive
genetic relationship matrix implies that family members with closer relatedness are likely
to have more correlated phenotype values. The correlation of family members is modeled
proportionally to the kinship coefficients as suggested by [1], with a coefficient σg2 that
represents the ‘genetic variance’.
The review in [50] examines several model selection criteria that have been proposed for
selection of fixed effects in linear mixed effect models with normally distributed errors and
random effects. Commonly used criteria are modifications of the AIC and BIC but the
lack of consensus arises from the complication in using these criteria with mixed models
that (1) these information criteria can use either the integrated likelihood or the conditional
likelihood, (2) the effective sample size is not clear when observations are correlated and
(3) the computation of the effective number of parameters can also be unclear.
52
A popular solution known as the marginal AIC uses the integrated likelihood to compute
the deviance, and the number of fixed effects and variance parameters are the number of
parameters [78]. The review by [50] provides a detailed description of additional variants of
the AIC for linear mixed models that use different approximations of the effective number
of parameters. A similarly popular version of the BIC for mixed models also uses the
integrated likelihood to compute the deviance, and a penalty term based on the actual
sample size and the number of fixed effects and variance parameters. As noted before,
when observations are correlated, the overall sample size cannot correctly represent the
true quantity of the sample size. In addition, BIC that uses the full sample size when
observations are correlated provides a too stringent correction. Therefore, a modified
version of BIC that accounts for this fact has been proposed:
BIC = −2log(p(D|θ̂) + log(ne )p
where ne is an estimate of the effective sample size, which is a reduction of the full sample
size to account for the correlation between observations [50]. Three proposed corrections
of the sample size are particularly appealing.
Jones’ correction
ne = nj = 1T C −1 1, where 1 is the unit vector, and C is the correlation
matrix that can be estimated from the covariance matrix V = V (Y |X, β) = ZΨZ T + Σ
[35]. The rationale of this correction is that the sample size n is the coefficient of the
intercept term in the Least Squares equations, and this coefficient is 1T V −1 1 in the mixed
effects linear regression model. The correlation matrix C is then used to have an effective
sample size that is independent of linear transformation of the data. Note that, ne changes
based on the phenotype that is analyzed using Jones’ correction.
Yang’ correction
ne = ny = (
2
T
f (nf ) /(1nf Kf 1nf ))/2,
P
applies to family based data. It
assumes that the data are from families of size nf and each Kf denotes the kinship matrix
for the f th family [87]. Yang’s correction is based solely on the kinship matrix. Therefore,
53
given a particular family structure, the quantity ne will remain the same regardless of the
phenotype that is analyzed, which contrasts Jones’ correction.
Liberal correction
ne = nc where nc is the number of clusters. This is the most liberal
correction, in which a cluster represents a single sample unit.
Some of these criteria for model selection have been extensively evaluated through
theoretical arguments and simulation studies but many simulations used simple scenarios.
The determination of the best model selection creteria for linear mixed effect models is still
open and this is a very active research area in statistics [50].
Very little evaluation has been done for model selection of mixed models for time-toevent data, or generalized linear mixed effect models. The deviance information criterion
(DIC) is an alternative Bayesian criterion for model selection introduced by Spiegelhalter
et al [71]. In DIC the deviance is penalized by the effective number of parameters:
DIC = −2log(p(D|θ̂)) + pD
pD = E(−2log(p(D|θ)) + 2log(p(D|θ̂)))
and E(−2log(p(D|θ)) is the posterior expectation of the deviance. This criterion is supposed to work well also in small samples, but estimates of the criterion using MCMC are
often unstable [26].
3.4
Mixed-Effects Bayesian Networks
Our approach to learn and quantify BNs from correlated data applies the mixed effects
modeling approach to BNs and maintains the decomposability of the likelihood, integrated
likelihood and marginal likelihood functions that allows for node-to-parent regression and
efficient computation.
As random effects are nuisance parameters, they can be treated as additional parameters
in the BN. Therefore, we introduce a matrix of random effects u = (u1 , ..., uv ) that includes
54
Figure 3.3: Example of Proposed Parameterization For Correlated Observations. Our
proposed parameterization when both the dependency structure and conditional probability distributions need to be estimated from correlated data. The random effects u (blue
nodes) have probability distributions that depends on parameters τ (green nodes). This
parameterization differs from the common parameterization for independent observations
in that the random effects u1 , u2 , and u3 are added as parents of the observable variables
X1 , X2 , and X3 .
55
one vector of random effects for each of the observable variables X1 , ..., Xv in the BN. These
random effects are included in the parent sets of variables X1 , ..., Xv , specifically the parent
set of each variable Xi is now (pa(X)i , θi , ui ) with the addition of the vector of random
effects ui . We also assume that ui is independent from all the other fixed and random
effects parameters, conditionally on the family structure and correlation parameters (see
Figure 3.3 for an example). These assumptions are crucial as they allow to maintain the
factorization of the global likelihood function as the product of the local likelihood and the
modularity of model search but account for the correlation in the data.
To see this, let γ = (θ, u, τ ) denote the augmented set of parameters for the joint
probability distribution of the observable variables X1 , X2 , ..., Xv that factorizes according
to a DAG M . The vector θ in γ is the set of fixed effects parameters of the BN that
describe the parent-children conditional distributions, while u denotes the random effects
and τ is the additional variance parameters associated with u. For a given BN structure
M , we can write the product of the global likelihood function and prior distribution as:
p(D|γ, M )p(γ|M ) = p(D|θ, u, τ, M )p(θ, u, τ |M )
where D is again the data observed in a sample of n observations for the variables X1 , X2 , ..., Xv .
γ = (γ1 , ..., γv ) denotes the parameter vectors associated with the variables X1 , X2 , ..., Xv
in the model. We assume that γi and γj are marginally independent for i 6= j and that,
for each index i, θi and ui are independent, conditionally on the additional correlation
structure and parameter τi .
Figure 3.3 shows an example for a BN with 3 observable variables that follow normal
distributions. In this example, we have
X1 |γ ∼ N (β10 + δ1 u1 , σ1 )
X2 |X1 = x1 , γ ∼ N (β20 + β21 x1 + δ2 u2 , σ2 )
X3 |X1 = x1 , X2 = x2 , γ ∼ N (β30 + β31 x1 + β32 x2 + δ3 u3 , σ3 )
56
Then, the joint distribution of the observable variables X1 , X2 , and X3 is


X
µ1 = β10

 1 



 X  |γ ∼ M V N 
µ2 = β20 + β21 µ1
 2 




X3
µ3 = β30 + β31 µ1 + β32 µ2



 
 
 , Σ
 
 
Under these assumptions of independence of the parameters, the product of the global
likelihood and prior distribution of the parameters factorizes as:
p(D|γ, M )p(γ|M ) = p(γ|M )
v
Y
p(xi |pa(xi ), γi , Mi )
i=1
As the prior distribution expands, the above expression further simplies into
p(D|γ, M )p(γ|M ) =
v
Y
p(xi |pa(xi ), θi , τi , ui , Mi )
i=1
× p(θi |Mi )p(ui |τi , Mi )p(τi |Mi ).
In the formula,



xi = 





ui = 



xi1
..
.
xin







 θi1
 .
.
θi = 
 .

θipi






ui1
..
.
uin


 ∼ N (0, Ψi ) Ψi = Ψi (τi )


The variable Xi can assume different types of distributions. For continuous data, it can
follow a normal distribution. For categorical data, it can follow a Poisson distribution or
57
multinomial distribution. In the case of time-to-event data, survival distributions such as
Weibull distribution may be used. In the non-normal data case, random effects will be
included in the same scale of the linear predictors, or the log-hazard function for time-toevent data. The correlation between observations of the variable Xi is modelled through
the variance-covariance matrix Ψi and variance parameter τi of the random effect vector
ui . It is important to note that this parameterization will maintain the decomposability of
the integrated likelihood that can be used for local model search using information based
criteria or other Bayesian criteria.
For example, conditionally on fixed effects parameter vector θ = (θ1 , ..., θv ) and variance
parameter vector τ = (τ1 , ..., τv ) that is associated with the random effects vector u =
(u1 , ..., uv ), the integrated likelihood can be computed as:
p(D|θ, τ, M ) =
=
v Z
Y
i=1
v
Y
p(ui |τi , Mi ) × p(xi |pa(xi ), θi , τi , ui , Mi )dui
p(D|θi , τi , Mi )
i=1
where p(D|θi , τi , Mi ) can be computed exactly for normally distributed variables (see [50])
or using numerical approximations in other cases as shown in [60]. The product-form of
the integrated likelihood is crucial since it implies that the search of the best dependency
structure can be performed in a modular way, by choosing the set of parents of each variable
that optimizes either the marginal BIC or AIC rule.
Once the best dependency structure is selected, the conditional distributions of each
local model can be estimated using MCMC methods. The MCMC samples can also be used
to estimate functions of the parameters that summarize associations between variables in
the model not directly linked, such as interactions, and possibly mediation by estimating the
decomposition of total effects into direct and indirect effects as suggested in [58]. However,
this chapter of the dissertation focuses on the structural learning of the BN.
58
3.5
Simulation Studies
We conducted simulation studies to provide answers to the following questions:
• What is the effect of ignoring the correlation between the observations in structure
learning of a BN from correlated data?
• What is the trade-off between false and true positive rates of using modifications of
BIC and AIC for learning a BN from correlated data?
For simplicity, we focused on deterministic algorithms with a forward search procedure,
such as the K2 algorithm [15], so that the problem can be reduced to a search for covariates
in a regression model using a forward search. We explored the cases when the data are
continuous and time-to-event.
3.5.1
Continuous Data
We borrowed the family structure from the Long Life Family Study (LLFS), which is a
study of healthy aging that enrolled approximately 5,000 individuals between 2006 and 2009
from 583 families demonstrating clustering for longevity and healthy aging in the United
States and Denmark [52, 67]. A typical family structure in the LLFS has a proband,
its siblings, its offspring, and spouses. For this simulation study, the total sample size
was 4656 and the number of families was 582. Subjects whose genotype data were not
available were excluded from the analysis. With a kinship matrix from each family Kf ,
the variance-covariance matrix of the observations is the 4656 × 4656 matrix:
V = σe2 I4656 + 2σg2 diag(K1 , K2 , ..., K582 )
We chose the error variance σe2 = 1 and varied the genetic variance to be σg2 = 1/3, 1,
and 3 to simulate genetic traits with heritability σg2 /(σg2 + σe2 ) = 0.25, 0.50, and 0.75. For
instance, when σe2 = 1 and σg2 = 1, 50% of the trait variability is due to geneticsm and
59
the rest to other factors. To generate correlated data, in each simulation a vector Z of
independent and normally distributed observations was generated and transformed into
Y = U D1/2 Z where U and D are the matrix of eigenvectors and eigenvalues from the
spectral decomposition of the variance-covariance matrix V . This transformation guarantees that V (Y ) = U D1/2 V (Z)D1/2 U T = V so that the simulated data have the desired
correlation. In each run, we also included 10 null common SNPs (minor allele frequency >
5%), which were randomly selected from the real GWAS data from the LLFS.
Each simulated data was analyzed using a forward search with BIC and AIC (a variable
was added to the model when it decreased the BIC and AIC by the largest negative
amount) and the standard likelihood ratio test LRT = −2log(p(D|θ̂) (a variable was
added when it changed the LRT by the largest amount that exceed 3.84, for a nominal false
positive rate of 5%) ignoring the correlaton in the data. The data were also analyzed using
BIC, AIC, and the likelihood ratio test based on the intergrated likelihood, to account
for the correlation in the data. Four variants of BIC were used based on different effective
sample sizes: ne = 4656 (full sample size); ne = 2796 (Jones’ correction); ne = 1768
(Yang’s correction); and ne = 582 (most conservative sample size). The simulation was
repeated 1,000 times.
Table 3.1 shows the number of false positive covariates that were selected with the
forward search using the 9 criteria, the overall number of tests conducted during the forward
search, and both the false positive rate (probability of Type 1 error in one test) and the
family wise error rate (probability of one or more errors in the overall search) when the
heritability is 0.5. The full set of results for different heritability estimates can be found
in the Supplementary Materials. The results show an inflation of both error rates when
the correlation in the data is ignored, with a 55% increase of the family wise error rate
for the LRT , and a 267% increase for the BIC. The false positive rate of the LRT based
on the integrated likelihood that accounts for the correlation in the data is slightly below
the nominal level (0.0432), while the traditional LRT exhibits inflated Type 1 error rate
of 0.0748. The various corrections of the BIC result in small false positive and family
60
wise error rates. Using the full sample size as the effective sample size in the BIC is an
over-correction that results in a very conservative scoring metrics. Decreasing the effective
sample size makes the BIC score more liberal with a modest increase of both false positive
and family wise error rates. Although these small error rates seem desirable, the question
is their effect on the true positive rates of the different scoring metrics.
Score
BICM
BICJ
BICY
BICC
AICM
LRTM
BICF
LRTF
AICF
ne
4656
2796
1768
582
4656
4656
4656
4656
4656
Errors At Each
1
2
3
46
0
0
59
1
0
74
2
0
128
7
1
1616 779 270
513
100
13
128
7
0
965
311
79
2316 1381 685
Level
4
≥5
0
0
0
0
0
0
0
0
71
17
2
0
0
0
8
0
277 110
Tot
Test
10405
10530
10647
11135
23310
14535
11136
18218
28479
Error Rates
FPR FWER
0.0044
0.045
0.0057
0.058
0.0071
0.071
0.0122
0.120
0.1181
0.836
0.0432
0.415
0.0121
0.120
0.0748
0.642
0.1674
0.931
Table 3.1: BICM : BIC based on integrated likelihood and full sample size; BICJ , BICY ,
BICC : BIC with Jones’, Young and conservative effective sample size; AICM : AIC based
on integrated likelihood and full sample size; LRTM : likelihood ratio test based on integrated likelihood to account for correlated data; BICF , LRTF , and AICF : traditional
BIC, likelihood ratio test, and AIC. F P R is the false positive rate defined as number of
errors over total number of tests ignoring correlated data; F W ER is family wise error rate,
i.e., probability of one or more errors.
Based on the results of the false positive rates and family-wise error rates of different
model selection metrics, we compared the power of different variants of the BIC (BICM ,
BICJ , BICY , and BICC ) based on integrated likelihood to the power obtained from the
LRTM using the significance threshold determined from the false positive rates in Table 3.1.
For example, BICM has an observed false positive rate of 0.0044, so we compared the power
of BICM to the power of LRTM with significance threshold of 0.0044. To do so, we ran 3
additional simulations in which the variable Y was generated from a multivariate normal
distribution with variance-covariance structure as described above. In these scenarios, we
modelled the expected value of the variable Y as a linear function of 3 true covariates
that were also generated from a multivariate normal distribution with different amount of
61
correlations. Three sets of regression parameters were chosen to represent the situations
of weak, moderate and strong covariate effects such that the first scenario included 3
weak effect covariates, the second scenario included 3 moderate effect covariates, and the
third scenario included 3 strong effect covariates. Power was defined as the probability of
detecting all three true covariates in each run. The results are summarized in Table 3.2.
The results show that the corresponding LRTM has higher power in all cases. For instance,
in the presence of covariates with moderate effects, BICM detects the 3 covariates 29.5%
of the time, whereas the corresponding LRTM detects the covariates 31.4% of the time,
which is an increase by 1.9%.
BICM
LRTBICM
BICJ
LRTBICJ
BICY
LRTBICY
BICC
LRTBICC
Strong Effect
0.572
0.593
0.608
0.623
0.635
0.648
0.708
0.710
Power
Moderate Effect
0.295
0.314
0.322
0.340
0.346
0.362
0.426
0.429
Weak Effect
0.139
0.151
0.162
0.172
0.178
0.186
0.245
0.247
Table 3.2: Power Comparisons of Four Variants of BIC vs. Corresponding LRTM . Results
are based on 1,000 simulated datasets with 3 situations of strong, moderate, and weak
covariate effects. BICM : BIC based on integrated likelihood and full sample size; BICJ ,
BICY , BICC : BIC with Jones’, Young and conservative effective sample size; LRTBICM ,
LRTBICJ , LRTBICY , and LRTBICC : likelihood ratio test based on integrated likelihood
using the significance threshold obtained from empirical false positive rates.
3.5.2
3.5.2.1
Time-to-event Data
Heritability of Time-to-event Data
For the time-to-event data, there is no clear definition of heritability, as the genetic variance
σg2 is modelled on the log-hazard scale. Therefore, as a by-product of this simulation study,
I derived a heritability-like estimate for the time-to-event trait on the log-hazard scale so
that the genetic component of the trait can be controlled in the simulation design. The
62
rest of this sub-section shows the derivation of this mathematical quantity.
Preliminaries
Let survival time T ∼ W eibull(λ, p) such that we have density and hazard
functions (denoted f (t) and h(t)):
f (t) =
ptp−1 −(t/λ)p
e
λp
and
h(t) =
p
tp−1
λp
The logarithmic transformation of T follows a log-Weibull or Gumbel distribution with
location parameter a = log λ and scale parameter b = 1/p. Then, the expected value and
variance of log T are as follows:
E(T ) = a + bγ
V (T ) =
π2 2
b
6
, where γ = Euler’s constant ≈ 0.5772.
The Hazard Function
In a proportional hazards model, the hazard function for the
ith individual is defined as:
hi (t|Xi ) = h0 (t)eXi β
, where h0 (t) is the baseline hazard, Xi is the vector of measured covariates, and β is the
corresponding vector of regression coefficients. If we add a vector of random effect R such
that R ∼ N (0, σg2 A), where A represents the additive genetic relationship matrix, then the
model for the ith individual becomes:
hi (t, ri |Xi ) = h0 (t)eXi β+ri
63
, where ri is the per-subject random effect.
f2 ) for
Proposed Idea The idea behind obtaining a heritability-like estimate (denoted h
survival trait is to estimate the proportion of the total variance of the hazard function that
is due to the genetics. In order to obtain this quantity, we first compute the covariance
between the two individuals’ hazards on the log scale.
cov(log(hi (t)), log(hj (t))) = cov(Xi β + ri + log(h0 (t)), Xj β + rj + log(h0 (t)))
= cov(ri + log(h0 (t)), rj + log(h0 (t)))
= cov(ri , rj ) + cov(log(h0 (t)), log(h0 (t)))
+ cov(ri , log(h0 (t))) + cov(rj , log(h0 (t)))
Under the assumption that the random effects and the baseline hazard are independent,
we see that cov(ri , log(h0 (t))) = 0 and cov(rj , log(h0 (t))) = 0. Then,
cov(log(hi (t)), log(hj (t))) = cov(ri , rj ) + cov(log(h0 (t)), log(h0 (t)))
= cov(ri , rj ) + var(log(h0 (t)))
p
= 2σg2 Φij + var(log p + (p − 1)log(t))
λ
= 2σg2 Φij + var((p − 1)log(t))
= 2σg2 Φij + (p − 1)2 var(log(t))
p − 1 2 π2
=
+
2σg2 Φij
p
6
| {z }
|
{z
}
additional variance due to genetics
variance of baseline hazard
, where Φij is the kinship coefficient between the two individuals i and j. Hence, the
covariance between the two hazards on the log scale is the sum of the variance of baseline
hazard and additional variance due to genetics. As the genetic relationship between the
two individuals gets closer, there is larger variability added due to the genetics. Then, we
64
f2 on the log scale as the following:
can define h
f2 =
h
=
genetic variance
total variance of the hazard
σg2
2
p−1
π2
2
p
6 + σg
For the simulation study on time-to-event data, we again borrowed the family structure
from the LLFS. To generate correlated time-to-event data, we modified the simulation
scheme from [12] by inducing correlation with log-normal frailty (random effects). The
baseline survival time was simulated from W eibull(2, 2). We simulated the correlated trait
such that:
s
T (X, R) =
−4logU
,
exp(Xβ + R)
where
U ∼ U nif (0, 1)
R ∼ M V N (0, 2σg2 diag(K1 , K2 , ..., K582 ))
C ∼ U nif (0, 2)
so that the event time is defined as t = min(T, C) and the censoring indicator is δ = I(T ≤
C). The correlation among observations are induced by the inclusion of random effects
term R on the log-hazard scale.
In each run, we again included 10 null common SNPs (minor allele frequency > 5%),
which were randomly selected from the real GWAS data from the LLFS. Each simulated
data was analyzed using a forward search with BIC and AIC (a variable was added to
the model when it decreased the BIC and AIC by the largest negative amount) and the
standard likelihood ratio test LRT = −2log(p(D|θ̂) (a variable was added when it changed
the LRT by the largest amount that exceed 3.84, for a nominal false positive rate of 5%)
ignoring the correlaton in the data. The data were also analyzed using BIC, AIC, and
65
the likelihood ratio test based on the intergrated likelihood, to account for the correlation
in the data. For BIC, the effective sample size was the number of events as suggested by
[34]. The simulation was repeated 1,000 times.
Table 3.3 shows the number of false positive covariates that were selected with the
forward search using the 6 criteria, the overall number of tests conducted during the forward
search, and both the false positive rate (probability of Type 1 error in one test) and the
family wise error rate (probability of one or more errors in the overall search) when the
heritability is 50% on the log-hazard scale. The full set of results for different heritability
estimates can be found in the Supplementary Materials. The results show an inflation of
both error rates when the correlation in the data is ignored, with a 25% increase of the
family wise error rate for the LRT , and a 56% increase for the BIC. The false positive
rate of the LRT based on the integrated likelihood that accounts for the correlation in
the data is slightly below the nominal level (0.0470), while the traditional LRT exhibits
inflated Type 1 error rate of 0.0629. Consistent with the results from the continuous data,
AIC is the most liberal metric.
Score
BICM
AICM
LRTM
BICF
AICF
LRTF
Errors At
1
2
71
1
1654 831
561
120
121
11
2057 1180
767
226
Each
3
0
327
14
0
530
46
Level
4
0
81
3
0
188
5
≥5
0
21
0
0
58
0
Tot
Test
10638
23553
14850
11069
26572
16604
Error Rates
FPR FWER
0.0068
0.070
0.1237
0.822
0.0470
0.436
0.0119
0.109
0.1510
0.884
0.0629
0.543
Table 3.3: BICM : BIC based on integrated likelihood and number of events as the sample
size; AICM : AIC based on integrated likelihood and full sample size; LRTM : likelihood
ratio test based on integrated likelihood to account for correlated data; BICF , LRTF , and
AICF : traditional BIC, likelihood ratio test, and AIC. F P R is the false positive rate
defined as number of errors over total number of tests ignoring correlated data; F W ER is
family wise error rate, i.e., probability of one or more errors.
Based on the results of the false positive rates and family-wise error rates of different
model selection metrics, we compared the power of BICM based on integrated likelihood
66
to the power obtained from LRTM using the significance threshold determined from the
false positive rates in Table 3.3. Again, we ran 3 additional simulations in which the
correlated survival trait was generated as described above. In these scenarios, we modelled
the log hazard as a linear function of 3 true covariates that were also generated from
a multivariate normal distribution with different amounts of correlations. Three sets of
regression parameters were chosen to represent the situations of weak, moderate and strong
covariate effects. Power was defined as the probability of detecting all three true covariates
in each run. The results are summarized in Table 3.4. The results are summarized in
Table 3.4. The results show that the corresponding LRTM has higher power in all cases
regardless of the heritability estimates, which is consistent with results from the continuous
data. For instance, in the presence of covariates with moderate effects when h2 = 0.50,
BICM detects the 3 covariates 25.5% of the time, whereas the corresponding LRTM detects
the covariates 28.5% of the time, which is an increase by 3.0%.
h2 = 0.25
h2 = 0.50
h2 = 0.75
BICM
LRTBICM
BICM
LRTBICM
BICM
LRTBICM
Strong Effect
0.961
0.964
0.830
0.841
0.513
0.540
Power
Moderate Effect
0.726
0.741
0.516
0.522
0.255
0.285
Weak Effect
0.490
0.502
0.315
0.323
0.144
0.161
Table 3.4: Power Comparisons of BICM vs. Corresponding LRTM . Results are based
on 1,000 simulated datasets with 3 situations of strong, moderate, and weak covariate
effects. BICM : BIC based on integrated likelihood and number of events as the sample
size; LRTBICM : likelihood ratio test based on integrated likelihood using the significance
threshold obtained from empirical false positive rates.
In summary, these results emphasize the need to account for correlation in the data
(continuous and time-to-event) to avoid an unnecessary inflation of the false positive error
rates. Moreoever, LRTM has slightly higher power than BIC based on integrated likelihood
both in the cases of continuous and time-to-event data. This suggests that, in practice,
it may be more desirable to use LRTM with reduced significance threshold to ensure that
67
type 1 error rates are maintained at a certain level with higher power than BIC.
3.6
Application
We applied the proposed approach to build a BN to link genetic data, blood biomarkers,
socio-demographic factors, and life span in the LLFS. The genetic variants were independent single nucleotide polymorphisms (SNP) in the 23 genes in the insulin and insulin-like
growth factor 1 signaling (IIS) pathway that were found associated with age at death using
single SNP analysis with a significance threshold of 0.005. Table 3.5 summarizes the gene,
chromosome, and the number of tested SNPs. There was a total of 13 common SNPs (minor allele frequency ¿ 5%) that were individually associated with age at death, adjusting
for sex. In a joint model that included these 13 SNPs as covariates, 6 of them were still
associated with age at death at the p-value threshold of 0.005. Given the large number of
tested SNPs, the p-value threshold of 0.005 may appear too liberal. However, this step was
primarily to obtain a candidate list of genetic variants to be considered in building the BN
later. For this purpose, a more inclusive criterion was employed in the single SNP analysis,
but a much more stringent rule was applied when building the BN as described below.
The question we were trying to answer with the BN was whether some of these direct
associations between SNPs and lifespan could be explained through associations with blood
biomarkers such as serum levels of DHEA (a steroid hormone linked to muscle loss in
aging), insulin growth factor 1 (IGF-1), transferrin receptors (Tr), and hemoglobin (Hgb).
Additional variables in the network were age at enrollment (Age.E) and follow-up survival
time (FUS) censored at last contact for living subjects, an indicator variable (Birth Year
Cohort: BYC) that accounted for possible secular trend, and sex. To build the BN, we used
the search procedure of the K2 algorithm, and we considered all possible orderings of the
other variables with the exception of the SNPs that for biological reasons were considered
as root nodes in the BN and the follow-up survival time that was the child of all other
nodes. For each possible ordering of the variables, a BN was built by fitting appropriate
68
Gene
AKT1
AKT2
AKT3
FOXO1
FOXO3
FOXO6
GHR
IGF1
IGF1R
IKBKB
INS
INSR
IRS1
IRS2
PDPK1
PIK3CA
PIK3CB
PIK3CD
PIK3CG
PIK3R1
PIK3R2
PIK3R3
PIK3R5
Chromosome
14
19
1
13
6
1
5
12
15
8
11
19
2
13
16
3
3
1
7
5
19
1
17
Number of Tested SNPs
78
142
793
275
110
674
506
994
854
76
75
859
2105
1637
9
318
391
172
883
4179
32
207
295
Table 3.5: Summary of 23 Genes in the IIS Pathway
mixed effects regression models of follow-up survival time (using mixed Cox regression),
age at enrollment, the four biomarkers (using linear mixed model), and by identifying
statistically significant predictors through a forward search. Based on the simulation study,
the likelihood ratio test from mixed effects model was used for model selection criteria by
applying stringent Bonferroni correction at each node.
The three BNs that ranked top in the the global likelihood over all possible orderings
are depicted in Figure 3.4; BN M 1, BN M 2, and BN M 3 are shown in panels a), b), and c),
respectively. The top three BNs have very similar overall structures, with directions of few
edges switched. Table 3.6 shows the Markov Blanket (MB) for each node in the top 3 BNs.
The MB for a node consists of its parents, its children, and its children’s other parents.
69
Node
FUS
Age.E
DHEA
TR
IGF-1
Hgb
MB in M 1
TR, Age.E, Hgb,
Sex, rs1009375
BYC,
Sex,
rs6974881,
FUS,
TR, Hgb, rs1009375
Hgb, IGF1, BYC,
TR
Hgb, BYC, DHEA,
IGF1, FUS, Age.E,
Sex, rs1009375
Hgb,
Tr,
BYC,
DHEA
BYC, IGF1, FUS,
Tr, DHEA, Age.E,
Sex, rs1009375
MB in M 2
TR, Age.E, Hgb,
Sex, rs1009375
BYC,
Sex,
rs6974881,
FUS,
TR, Hgb, rs1009375
Hgb, IGF1, BYC,
TR
Hgb, BYC, DHEA,
IGF1, FUS, Age.E,
Sex, rs1009375
Hgb,
Tr,
BYC,
DHEA
BYC, IGF1, FUS,
Tr, DHEA, Age.E,
Sex, rs1009375
MB in M 3
TR, Age.E, Hgb,
Sex, rs1009375
BYC,
Sex,
rs6974881,
FUS,
TR, Hgb, rs1009375
Hgb, IGF1, BYC,
TR
Hgb, BYC, DHEA,
IGF1, FUS, Age.E,
Sex, rs1009375
Hgb,
Tr,
BYC,
DHEA
BYC, IGF1, FUS,
Tr, DHEA, Age.E,
Sex, rs1009375
Table 3.6: Markov Blanket of Each Node in the Top 3 BNs. FUS: Follow-up Survival;
Age.E: Age at enrollment; DHEA: Dehydroepiandrosterone; TR: Transferrin Receptors;
IGF-1: Insulin-like growth factor 1; INS: Insulin; Hgb: Hemoglobin.
Conditioned on the MB, a node is conditionally independent of all other nodes in the BN.
It is interesting to note that the MBs of the top three BNs are equivalent, suggesting
that the statistical inference on parameters is robust. The results from the final network
suggests that genetic variants in the IIS pathway do not affect age at death through other
biomarkers; there is one SNP (rs1009375) directly associated with follow-up survival and
another SNP (rs6974881) directly associated with age at enrollment. Literature review
of these two SNPs show that rs1009375 on chromosome 1 is in a locus that is linked to
glycemic control and severe diabetic retinopathy rs6974881 on chromosome 7 is situated
in a locus that is linked to age at natural menopause, pulse pressure and mean arterial
pressure, carotid intima media thickness and plaque.
3.7
Conclusions
We presented an approach to learn BNs from correlated data that generalizes mixed effect
regression modeling. The simulation study showed the importance of accounting for corre-
70
Figure 3.4: Top 3 BNs that dissect the associations of SNPs in genes of the IIS pathway
through effects on blood biomarkers.
71
lated data to avoid inflation of the false positive error rates, but also showed the need for
better scoring metrics for model selection in the context of correlated data. Learning BNs
from correlated data is an emerging research field with many research opportunities and
many areas of applications.
72
3.8
Supplementary Materials
Score
BICM
BICJ
BICY
BICC
AICM
LRTM
BICF
LRTF
AICF
ne
4656
3346
1768
582
4656
4656
4656
4656
4656
Errors At
1
2
44
0
49
0
64
0
119
2
1598 790
496
96
83
2
727
212
1949 1077
Each
3
0
0
0
0
294
12
0
40
455
Level
4
0
0
0
0
77
1
0
3
152
≥5
0
0
0
0
16
0
0
0
47
Tot
Test
10396
10441
10576
11042
23295
14336
10745
16291
25803
Error Rates
FPR FWER
0.0042
0.044
0.0047
0.049
0.0061
0.064
0.0110
0.114
0.1191
0.821
0.0422
0.397
0.0079
0.081
0.0603
0.517
0.1426
0.879
Table 3.7: False Positive Rates and Family-wise Error Rates of Different Model Selection
Metrics For Continuous Data When h2 = 0.25. BICM : BIC based on integrated likelihood
and full sample size; BICJ , BICY , BICC : BIC with Jones’, Young and conservative
effective sample size; AICM : AIC based on integrated likelihood and full sample size;
LRTM : likelihood ratio test based on integrated likelihood to account for correlated data;
BICF , LRTF , and AICF : traditional BIC, likelihood ratio test, and AIC. F P R is the
false positive rate defined as number of errors over total number of tests ignoring correlated
data; F W ER is family wise error rate, i.e., probability of one or more errors.
73
Score
BICM
BICJ
BICY
BICC
AICM
LRTM
BICF
LRTF
AICF
ne
4656
2483
1768
582
4656
4656
4656
4656
4656
Errors At Each
1
2
3
55
2
0
64
3
0
75
3
0
131
10
0
1597 758 279
528
111
16
193
16
0
1146 447 140
2540 1582 842
Level
4
≥5
0
0
0
0
0
0
0
0
86
18
0
0
0
0
23
1
359 159
Tot
Test
10475
10555
10654
11169
23135
14591
11739
19781
29816
Error Rates
FPR FWER
0.0054
0.051
0.0064
0.059
0.0073
0.070
0.0126
0.121
0.1183
0.822
0.045
0.415
0.0178
0.179
0.0888
0.702
0.1839
0.936
Table 3.8: False Positive Rates and Family-wise Error Rates of Different Model Selection
Metrics For Continuous Data When h2 = 0.75. BICM : BIC based on integrated likelihood
and full sample size; BICJ , BICY , BICC : BIC with Jones’, Young and conservative
effective sample size; AICM : AIC based on integrated likelihood and full sample size;
LRTM : likelihood ratio test based on integrated likelihood to account for correlated data;
BICF , LRTF , and AICF : traditional BIC, likelihood ratio test, and AIC. F P R is the
false positive rate defined as number of errors over total number of tests ignoring correlated
data; F W ER is family wise error rate, i.e., probability of one or more errors.
74
BICM
LRTBICM
BICJ
LRTBICJ
BICY
LRTBICY
BICC
LRTBICC
Strong Effect
0.797
0.810
0.815
0.824
0.838
0.835
0.877
0.875
Power
Moderate Effect
0.444
0.461
0.464
0.473
0.510
0.506
0.593
0.584
Weak Effect
0.273
0.287
0.290
0.297
0.332
0.329
0.417
0.405
Table 3.9: Power Comparisons of Four Variants of BIC vs. Corresponding LRTM For
Continuous Data When h2 = 0.25. Results are based on 1,000 simulated datasets with 3
situations of strong, moderate, and weak covariate effects. BICM : BIC based on integrated likelihood and full sample size; BICJ , BICY , BICC : BIC with Jones’, Young and
conservative effective sample size; LRTBICM , LRTBICJ , LRTBICY , and LRTBICC : likelihood ratio test based on integrated likelihood using the significance threshold obtained
from empirical false positive rates.
75
BICM
LRTBICM
BICJ
LRTBICJ
BICY
LRTBICY
BICC
LRTBICC
Strong Effect
0.233
0.270
0.268
0.286
0.284
0.304
0.356
0.374
Power
Moderate Effect
0.102
0.124
0.122
0.138
0.138
0.150
0.189
0.192
Weak Effect
0.049
0.067
0.067
0.072
0.071
0.075
0.102
0.106
Table 3.10: Power Comparisons of Four Variants of BIC vs. Corresponding LRTM For
Continuous Data When h2 = 0.75. Results are based on 1,000 simulated datasets with 3
situations of strong, moderate, and weak covariate effects. BICM : BIC based on integrated likelihood and full sample size; BICJ , BICY , BICC : BIC with Jones’, Young and
conservative effective sample size; LRTBICM , LRTBICJ , LRTBICY , and LRTBICC : likelihood ratio test based on integrated likelihood using the significance threshold obtained
from empirical false positive rates.
Score
BICM
AICM
LRTM
BICF
AICF
LRTF
Errors At Each Level
1
2
3
4
≥5
77
1
0
0
0
1661 821 326 91 24
522 114 21
2
0
107
2
0
0
0
1837 956 406 132 39
625 161 25
2
0
Tot
Test
10683
23769
14646
10970
24851
15537
Error Rates
FPR FWER
0.0073
0.075
0.1230
0.841
0.0450
0.413
0.0099
0.106
0.1356
0.864
0.0523
0.476
Table 3.11: False Positive Rates and Family-wise Error Rates of Different Model Selection
Metrics For Time-to-event Data When h2 = 0.25. BICM : BIC based on integrated likelihood and number of events as the sample size; AICM : AIC based on integrated likelihood
and full sample size; LRTM : likelihood ratio test based on integrated likelihood to account
for correlated data; BICF , LRTF , and AICF : traditional BIC, likelihood ratio test, and
AIC. F P R is the false positive rate defined as number of errors over total number of tests
ignoring correlated data; F W ER is family wise error rate, i.e., probability of one or more
errors.
76
Score
BICM
AICM
LRTM
BICF
AICF
LRTF
Errors At
1
2
79
4
1712 864
553
124
190
20
2193 1272
906
282
Each
3
0
344
16
1
597
65
Level
4
0
106
3
0
218
14
≥5
0
24
0
0
59
2
Tot
Test
10707
23858
14910
11698
27519
17727
Error Rates
FPR FWER
0.0078
0.075
0.1278
0.833
0.0466
0.434
0.0180
0.171
0.1577
0.909
0.0716
0.614
Table 3.12: False Positive Rates and Family-wise Error Rates of Different Model Selection
Metrics For Time-to-event Data When h2 = 0.75. BICM : BIC based on integrated likelihood and number of events as the sample size; AICM : AIC based on integrated likelihood
and full sample size; LRTM : likelihood ratio test based on integrated likelihood to account
for correlated data; BICF , LRTF , and AICF : traditional BIC, likelihood ratio test, and
AIC. F P R is the false positive rate defined as number of errors over total number of tests
ignoring correlated data; F W ER is family wise error rate, i.e., probability of one or more
errors.
3.9
Acknowledgements
This work was supported by the NIH National Institute of Aging (U01-AG023755 and U19AG023122), the National Heart Lung Blood Institute (R21HL114237), and the National
Institute of General Medicine (T32 GM74905)
77
Chapter 4
Efficient Technique to Model Extended Family Structure in Bayesian
inference Using Gibbs Sampling (BUGS) software
4.1
Introduction
Many observational studies are designed using some form of clustered sampling that introduces correlation between observations within the same cluster. A popular study design
that produces correlated observations is the family-based study [55], in which families (or
clusters) may be selected because family members are enriched of some particular traits of
interest. In this design, multiple relatives within the same family are enrolled in the study,
and subjects from the same family cannot be assumed independent because they share
genetic background and may have more similar phenotypes than members from different
families. In this context, standard statistical methods that assume independent and identically distributed observations are not appropriate because ignoring the correlation between
observations may impact the false positive rates of statistical methods [11].
When the trait of interest is continuous, a linear mixed-effects model can be used to
account for the family structure by using random effects with a variance-covariance matrix
that describes the within and between family covariances. However, the implementation of
such models in the BUGS software [46] becomes challenging due to the high dimensionality
of the random effects vector, which is as large as the sample size in the study. The fact
that the high-dimensional covariance matrix can only be updated as a composite whole in
BUGS [9] increases the computational burden of the Markov Chain Monte Carlo (MCMC)
estimation and often results in a failure to compile the model. To tackle this implementation
issue, Waldmann et al [84] have proposed an approach based on a decomposition of the
multivariate normal distribution of the random effects into univariate normal distributions
using conditional distributions [30]. Our experience with this approach is that it fails to
78
produce accurate results with large multigenerational families.
This paper describes an alternative parameterization that uses the singular value decomposition of the large covariance matrix of the random effects to fit a mixed model with
independent random effects and avoids the use of large variance matrices. This approach is
not novel and for example was used in factored spectrally transformed linear mixed models
(FaST-LMM) [43] for fast computations with family data. The novelty of our contribution
is to use this approach for an efficient implementation in the BUGS software. The BUGS
code is also provided and a specific example is presented with some discussions.
4.2
Proposed Parameterization
A common parameterization of linear mixed models for correlated family data is:
y|β, σg2 , σe2 = Xβ + ρ + ,
(4.1)
where y is the phenotype vector of size n × 1, X is the n × (p + 1) design matrix that
contains values of measured covariates, β is the fixed effect vector of size p × 1, ρ is the
random effect vector of size n × 1, which accounts for the additive polygenic effect, and is a vector of random errors of size n × 1. The vectors ρ and are marginally independent
and follow distributions:
ρ ∼ N (0, σg2 A)
∼ N (0, σe2 In ),
where A represents the known additive genetic relationship matrix (see Figure 3.1), σg2 is
the genetic variance, and σe2 is the error variance [19]. Under this parameterization the
variance-covariance matrix of the observation y is the matrix:
V ar(y|β, σg2 , σe2 ) = σg2 A + σe2 In .
(4.2)
79
One can estimate the narrow-sense heritability of a trait as a ratio of genetic variance σg2
to the total phenotypic variance σg2 + σe2 , such that h2 =
σg2
2
σg +σe2
.
In a recent review article, Muller et al. [50] described the following parameterization
for general linear mixed-effects models:
y|β, σg2 , σe2 = Xβ + Gu + (4.3)
where u ∼ N (0, σg2 Is ), e ∼ N (0, σe2 In ), and G is a matrix of known coefficients. The
advantage of the parameterization in Equation (4.3) is that the random effects ui , i =
1, ..., n are marginally independent rather than the high-dimensional multivariate normal
distribution ρ in the initial parameterization.
The two parameterizations are equivalent once the matrix G is derived from the singular
value decompostion of the matrix A. Specifically, by setting A = U S 1/2 S 1/2 U T and letting
G = U S 1/2 , the variance-covariance matrix of y from the model in Equation (4.3) is
V ar(y|β, u, e) = σg2 GGT + σe2 In = σg2 U S 1/2 S 1/2 U T + σe2 In = σg2 A + σe2 In
(4.4)
Note that the matrix U S 1/2 needs to be computed only once. We provide an example
R script that computes U S 1/2 given a family-based data set in the supplement material.
4.3
A Real Data Example
As an example, we consider the task of estimating the heritability of transferrin receptor
levels from a large family-based study. The data are from the Long Life Family Study
(LLFS) that between 2006 and 2009 enrolled approximately 5,000 individuals from 583
families demonstrating clustering for longevity and healthy aging in the United States and
Denmark [52, 67]. A typical family structure in the LLFS has a proband, the siblings,
their spouses, offspring of probands and siblings, and their spouses. The family size varies
80
between 3 individuals to 77 individuals. In this example, transferrin receptor levels were
adjusted for age at enrollment of the study and insulin levels. The kinship matrix A was
calculated with the R package kinship2 [75] and the R code to generate the matrix G is
provided in supplement material.
The entire BUGS code is shown with some comments below. There are a few points
that are noteworthy. First, the matrix G, which is computed within R, is part of the
BUGS data input. The calculation of the matrix G is required only once and performed
outside the model compilation in the BUGS software. Second, the variable offset is used
to indicate the individuals who belong to each specific family. For example, in the code
below, we use the index i to represents families and the index j to represent individuals.
The first few values of the variable offset in this particular example is c(1, 8, 16, ...). When
i = 1 (the first family), j will span from 1 to 7, indicating that individuals 1 through 7
belong to family 1. When i = 2 (the second family), j will span from 8 to 15, indicating
that individuals 8 through 15 belong to family 2. Then, based on the values of j and offset,
the inner product between an appropriate row of the matrix G and vector u is computed.
A total of 11,000 iterations with the first 1,000 as a burn-in was sufficient to reach
the convergence of the estimates. On average, each iteration took 0.0892 seconds on an
Intel(R) Core i5 processor (2.53 GHz) with 4 GB of RAM. The data were also analyzed
using the method proposed in [30] and by fitting classical linear mixed models with the
function lmekin() in coxme [74] package in R.
The BUGS code
model svd {
## loop over families
for( i in 1:n.fam) {
## loop over individuals within each family
for(j in offset[i]:(offset[i+1] - 1) ){
y[j]
~
dnorm( mu[j], tau.e)
81
mu[j] <-
b0 + b.age*age[j] + b.insulin*insulin[j] +
## X *%* Beta
inprod(G[j,offset[i]:(offset[i+1]-1)], u[offset[i]:(offset[i+1]-1))
## G *%* u
}
}
## Model random effects as univariate normal
for( t in 1:N)
{ u[t] ~ dnorm( 0, tau.g) }
## priors for fixed effects
b0 ~ dnorm(0,0.001)
b.age ~ dnorm(0,0.001)
b.insulin ~ dnorm(0,0.001)
## varance components
tau.e ~ dgamma(1,
1)
tau.g ~ dgamma(1 , 1)
sigma.g2 <- 1/tau.g
sigma.e2 <- 1/tau.e
## narrow-sense heritability
herit <- (1/tau.g)/( 1/tau.g+1/tau.e)
}
Table 4.1 compares the point estimates and standard errors (SE) of regression parameters, the variance parameters, and the heritability estimates from the linear mixed models
computed using the lmekin() function in R, the proposed method (SVD Model), and the
method in [30] (conditional Model). Figure 4.1 displays the posterior distribution of heritability, residual variance, and genetic variance. The point estimates and SE from the R
ouput and SVD model are nearly identical. The heritability estimates in the two methods
are 0.3677 and 0.3677, respectively, with a difference of only 0.0031. However, compared to
these two methods, the conditional model produces discrepant results; the residual variance
82
is over-estimated and the genetic variance is under-estimated, which leads to inconsistent
estimate of the heritability and the 95% credible intervals of the two Bayesian methods
do not overlap. Inconsistent results betweent the SVD model and conditional model are
surprising since, in theory, both approaches rely on decomposition methods that should
lead to exactly the same covariance matrix. To further investigate this discrepancy, simulations of several scenarios of extended pedigree data structures were performed. However,
we were not able to pinpoint the reason for the apparent discrepancy. The advantage of
the Bayesian approach here, compared to the classical approach is to provide measures of
the uncertainity of the heritability estimate as the posterior distribution (Figure 4.1) and
the 95% credible interval.
R (lmekin)
PE
SE
PE
Intercept
2.1494
0.0652
2.143
Age
0.0101
0.0008
0.0102
Insulin
0.0021
0.0004
0.0022
Heritability
0.3677
N/A
0.3707
0.4877
N/A
0.4866
0.2837
N/A
0.2862
Residual
Variance
Genetic
Variance
SVD Model
SE
95% CI
2.014 0.0697
2.283
0.0084 0.0009
0.0119
0.0014 0.0004
0.0030
0.3015 0.0345
0.4402
0.436 0.0263
0.5402
0.2303 0.0289
0.3466
Conditional Model
PE
SE
95% CI
2.018 2.153
0.0730
2.3
0.0082 0.0101 0.0009
0.0119
0.0014 0.0022 0.0004
0.0030
0.0957 0.1325 0.0257
0.195
0.6114 0.6624 0.02439
0.7074
0.0733 0.1013 0.0198
0.1494
Table 4.1: Comparison of Point Estimates (PE), standard errors (SE), and 95% Credible
Intervals (95% CI). R (lmekin): results obtained from using lmekin function in R; SVD
Model: results obtained from using the proposed method based on signular value decomposition of the additive genetic relationship matrix; Conditional Model: results obtained from
using the method in [30]. The total sample size was 4229 with 558 unique families.
4.4
Conclusion
The proposed BUGS code provides an easy and efficient way to account for extended family
structure in linear mixed model. Results show that this implementation produces consistent
results with the classical linear mixed models in R. The advantage of this approach is that
it allows for linear mixed modeling from Bayesian perspective that correctly adjusts for
83
Figure 4.1: Plots of Posterior Distributions of Heritability, Residual Variance, and Genetic
Variance
84
correlated data as well as obtaining an accurate estimate of heritability. A similar type of
approach could be used when the trait of interest is binary or time-to-event.
4.5
Supplementary Materials
Sample R script for computing the singular value decomposition of the additive genetic
relationship matrix.
## Read the Phenotype Data ##
pheno.file <- read.csv("pheno.data.csv")
## Read the Pedigree Data
ped.file <- read.csv("ped.data.csv")
## Variable Description:
## subject = unique individual ID
## gpedid = family ID
## dadsubj = ID of the father
## momsubj = ID of the mother
## sex = sex of individual
## Needs to be ordered by family
ped.file <- ped.file[ order(ped.file$pedid),]
## Use package kinship2 to construct the pedigree object and
## the kinship matrix.
library(kinship2)
## This creates pedigree objects
mped.full <- pedigree(id=ped.file$subject, dadid=ped.file$dadsubj,
momid=ped.file$momsubj, sex=ped.file$sex,
famid=ped.file$gpedid,missid=0)
## This creates the kinship matrix
kmat.full <- kinship(mped.full)
85
## Singular Value Decomposition by family
family.list <- intersect(unique(ped.file$gpedid), unique(pheno.file$gpedid))
my.U <- vector("list")
my.S <- vector("list")
my.kmat <- vector("list")
my.G <- vector("list")
for(i in 1:length(family.list)){
p <- ped.file[ which(ped.file$gpedid == family.list[i]),]
mped <- pedigree(id=p$subject, dadid=p$dadsubj,
momid=p$momsubj,sex=p$sex,famid=p$gpedid,missid=0)
kmat <- 2*kinship(mped)
## Now get kinship submatrix for subjects
ind.subj <- match(pheno.file$subject[which(pheno.file$gpedid ==
family.list[i])], row.names(kmat))
test <- svd( kmat[ind.subj[ which(is.na(ind.subj)==F)],
ind.subj[ which(is.na(ind.subj)==F)]])
my.U[[i]] <- test$u
my.S[[i]] <- test$d
my.G[[i]] <- test$u %*% diag( sqrt(test$d))
my.kmat[[i]] <- kmat[ind.subj[ which(is.na(ind.subj)==F)],
ind.subj[ which(is.na(ind.subj)==F)]]
}
## This is the matrix to be loaded to bugs
G.mat <- my.G[[1]]
for(i in 2:length(my.G)){
ith.block.1 <- rep(0, nrow(G.mat)*ncol(my.G[[i]]))
dim(ith.block.1) <- c(nrow(G.mat), ncol(my.G[[i]]))
ith.block.2 <- rep(0, ncol(G.mat)*nrow(my.G[[i]]))
dim(ith.block.2) <- c(nrow(my.G[[i]]), ncol(G.mat) )
G.mat <- rbind(
cbind(
G.mat, ith.block.1),
cbind(ith.block.2, my.G[[i]]))
}
86
4.6
Acknowledgments
This work was funded by the National Institute on Aging (NIA U19-AG023122 to T.P.), the
National Heart Lung Blood Institute (R21HL114237 to P.S.), and the National Institure
of General Medical Sciences T32GM074905.
87
Chapter 5
Conclusions
Statistical genetics is a rapidly changing field that requires specialized statistical methods
for proper analysis of massive data on the genome-wide level to understand the genetic
contributions to complex traits. Triggered by high throughput technologies, GWASs have
identified numerous genetic variants associated with diverse phenotypes in the last decade.
However, those variants with unusually low p-values often did not reveal important biological mechanisms and explained only a fraction of genetic variance of the phenotypes studied.
My current work described in the last three chapters attempts to contribute to aiding the
comprehension of genetic architecture of complex traits by 1) discovering potentially more
informative variants in the GWAS and 2) integrating the genetic and non-genetic data by
the use of network models in the presence of correlated data. In this concluding chapter, I would like to outline both short-term and long-term future directions for my work
presented in this dissertation as they relate to uncovering the genetic basis of such traits.
An immediate extension to the work in Chapter 2 is to consider the same type of model
selection in linear mixed models. Under certain distributional assumptions on the random
effects, a closed-form solution to the marginal likelihood can be derived by integrating
out the random effects. Then, a very similar approach can be employed for the model
selection among different genetic models. In addition, this approach should be explored
in the context of logistic regression. As there is no closed-form solution to the marginal
likelihood in this case, it is not immediately clear which scoring metric should be used for
model selection when the trait of interest is binary. Extensive evaluations on approximate
BFs need to be conducted.
Network modelling provides an approach that allows for integration of the accumulated genetic information and other non-genetic factors to uncover the complex paths from
genotype to phenotype. However, Chapter 3 mainly focuses on procedures to learn the
88
dependency structure among these variables and evaluations on model selection criteria
for correlated data. Reasoning the network with the correlated data is another key area
that needs to be explored in depth. Ability to reason the network may provide an answer
to the following question: what is the probability distribution of some variable (lifespan)
given a set of other variables (genetic and non-genetic exposures)? Parameter learning
and predictive inference, in the presence of family structure, need to be evaluated using
Markov chain Monte Carlo estimation. I believe that my work described in Chapter 4 can
contribute to this effort.
As a by-product of simulations studies in Chapter 3, I show a mathematical derivation
of the heritability of time-to-event data. This work is meaningful as it provides a way
to control the genetic parameter in simulation studies involving time-to-event data. In
addition, it also provides a clear way to explain the contribution of the genetic component
to the phenotype, which has a similar interpretation as in continuous data. I plan to
extend this idea of defining heritability of time-to-event trait to binary traits, for which
the liability threshold model is often used.
The focus of my work on Chapter 4 is limited to the continuous data. Implementation
of this method in the context of non-continuous data needs to be explored carefully. I
am also interested in dimensional reduction of random effects vector in mixed models.
In the commonly used parameterization, there are as many random effects parameters as
the sample size. However, this excess of nuisance parameters may over-parameterize the
model, which may result in loss of power in analysis. Trade-offs between gain in power and
accuracy of parameter estimates should be examined thoroughly.
In a broader context, applications of the methodologies I developed in my dissertation
may help reveal the underlying biological mechanisms of complex polygenic traits such as
autism and phenotypes associated with autism. I would pursue this aim through several
steps. First, it would be very important to estimate the genetic contribution of the trait
of interest, regardless of the type of the underlying probability distribution. Second, the
method developed in Chapter 2 can be used to obtain a solid list of candidate genetic
89
variants associated with the trait through multiple genetic models. Once this list is established, the BN could be built to represent the complex dependency structure among the
trait of interest, its sub-phenotypes, non-genetic factors, and genetic variants. As a last
step, predictive inference based on this BN should be carried out, which may provide a
better understanding of the complex biological mechamisms.
90
Bibliography
[1] G. R. Abecasis, L. R. Cardon, and W. O. Cookson. A general test of association
for quantitative traits in nuclear families. American Journal of Human Genetics,
66(1):279–92, 2000.
[2] I. Akinsheye, A. Alsultan, N. Solovieff, D. Ngo, C. T. Baldwin, P. Sebastiani, D. H.
Chui, and M. H. Steinberg. Fetal hemoglobin in sickle cell anemia. Blood, 118(1):19–27,
2011.
[3] H. T. Bae, C. T. Baldwin, P. Sebastiani, M. J. Telen, A. Ashley-Koch, M. Garrett, W. C. Hooper, C. J. Bean, M. R. Debaun, D. E. Arking, P. Bhatnagar, J. F.
Casella, J. R. Keefer, E. Barron-Casella, V. Gordeuk, G. J. Kato, C. Minniti, J. Taylor, A. Campbell, L. Luchtman-Jones, C. Hoppe, M. T. Gladwin, Y. Zhang, and M. H.
Steinberg. Meta-analysis of 2040 sickle cell anemia patients: Bcl11a and hbs1l-myb
are the major modifiers of hbf in african americans. Blood, 120(9):1961–2, 2012.
[4] H. T. Bae, P. Sebastiani, J. X. Sun, S. L. Andersen, E. W. Daw, A. Terracciano,
L. Ferrucci, and T. T. Perls. Genome-wide association study of personality traits in
the long life family study. Frontiers in Genetics, 4:65, 2013.
[5] H.T. Bae, T.T. Perls, M.H. Steinberg, and P. Sebastiani. Bayesian polynomial regression models to fit multiple genetic models for quantitative traits. Bayesian Analysis,
2014. Forthcoming.
[6] J. M. Bernardo and A. F. M. Smith. Bayesian Theory. John Wiley and Sons, New
York, NY, 1994.
[7] I. B. Borecki and M. A. Province. Genetic and genomic discovery using family studies.
Circulation, 118(10):1057–63, 2008.
[8] N. E. Breslow and D. G. Clayton. Approximate inference in generalized linear mixed
models. Journal of the American Statistical Association, 88(421):9–25, 1993.
[9] P. R. Burton, K. J. Tiller, L. C. Gurrin, W. O. Cookson, A. W. Musk, and L. J. Palmer.
Genetic variance components analysis for binary phenotypes using generalized linear
mixed models (glmms) and gibbs sampling. Genetic Epidemiology, 17(2):118–40, 1999.
Burton, P R Tiller, K J Gurrin, L C Cookson, W O Musk, A W Palmer, L J Genet
Epidemiol. 1999;17(2):118-40.
[10] W. S. Bush and J. H. Moore. Chapter 11: Genome-wide association studies. PLOS
Computational Biology, 8(12):e1002822, 2012.
[11] M. J. Cannon, L. Warner, J. A. Taddei, and D. G. Kleinbaum. What can go wrong
when you assume that correlated data are independent: an illustration from the evaluation of a childhood health intervention in brazil. Statistics in Medicine, 20(9-10):1461–
7, 2001.
91
[12] H. Chen, T. Lumley, J. Brody, N. L. Heard-Costa, C. S. Fox, L. A. Cupples, and
J. Dupuis. Sequence kernel association test for survival traits. Genetic Epidemiology,
38(3):191–7, 2014.
[13] Z. Chen and D. B. Dunson. Random effects selection in linear mixed models. Biometrics, 59(4):762–9, 2003.
[14] T. G. Clark, S. G. Campino, E. Anastasi, S. Auburn, Y. Y. Teo, K. Small, K. A.
Rockett, D. P. Kwiatkowski, and C. C. Holmes. A bayesian approach using covariance
of single nucleotide polymorphism data to detect differences in linkage disequilibrium
patterns between groups of individuals. Bioinformatics, 26(16):1999–2003, 2010.
[15] G. F. Cooper and G. F. Herskovitz. A bayesian method for the induction of probabilistic networks from data. Machine Learning, 9:309–347, 1992.
[16] P. T. Costa and R. R. McCrae. Professional Manual: Revised NEO Personality Inventory (NEO-PI-R) and NEO Five-Factor-Inventory (NEO-FFI). Psychological Assessment Resources, Odessa, FL, 1992.
[17] A. P. Dawid and S. L. Lauritzen. Hyper markov laws in the statistical-analysis of
decomposable graphical models. Annals of Statistics, 21(3):1272–1317, 1993.
[18] M. L. Eaton. Multivariate statistics: a vector space approach. Institute of Mathematical Statistics, 1983.
[19] J. Eu-ahsunthornwattana, E. N. Miller, M. Fakiola, Wellcome Trust Case Control
Consortium., S. M. B. Jeronimo, J. M. Blackwell, and H. J. Cordell. Comparison of
methods to account for relatedness in genome-wide association studies with familybased data. PLOS Genetics, 10(7):e1004445, 2014.
[20] Y. Fong, H. Rue, and J. Wakefield. Bayesian inference for generalized linear mixed
models. Biostatistics, 11(3):397–412, 2009.
[21] B. Freidlin, G. Zheng, Z. Li, and J. L. Gastwirth. Trend tests for case-control studies
of genetic markers: power, sample size and robustness. Human Heredity, 53(3):146–52,
2002.
[22] N. Friedman, M. Linial, I. Nachman, and D. Pe’er. Using bayesian networks to analyze
expression data. Journal of Computational Biology, 7(3-4):601–20, 2000.
[23] C. Garner, T. Tatu, J. E. Reittie, T. Littlewood, J. Darley, S. Cervino, M. Farrall,
P. Kelly, T. D. Spector, and S. L. Thein. Genetic influences on f cells and other
hematologic variables: a twin heritability study. Blood, 95(1):342–6, 2000.
[24] M. Gaston and W. F. Rosse. The cooperative study of sickle cell disease: review of
study design and objectives. American Journal of Pediatric Hematology/Oncology,
4(2):197–201, 1982.
[25] Andrew Gelman, John Carlin, Hal Stern, and Donald Rubin. Bayesian Data Analysis,
Second Edition. Chapman and Hall/CRC, 2003.
92
[26] Andrew Gelman and Jennifer Hill. Data analysis using regression and multilevel/hierarchical models. Analytical methods for social research. Cambridge University Press,
Cambridge; New York, 2007.
[27] J. R. Gonzalez, J. L. Carrasco, F. Dudbridge, L. Armengol, X. Estivill, and
V. Moreno. Maximizing association statistics over genetic models. Genetic Epidemiology, 32(3):246–54, 2008.
[28] U. S. Govindarajulu, H. Lin, K. L. Lunetta, and Sr. D’Agostino, R. B. Frailty models:
Applications to biomedical and genetic studies. Statistics in Medicine, 30(22):2754–64,
2011.
[29] Y. Guan and M. Stephens. Practical issues in imputation-based association mapping.
PLOS Genetics, 4(12):e1000279, 2008.
[30] J. Hallander, P. Waldmann, C. Wang, and M. J. Sillanpaa. Bayesian inference of
genetic parameters based on conditional decompositions of multivariate normal distributions. Genetics, 185(2):645–54, 2010.
[31] G. H. Hardy. Mendelian proportions in a mixed population. Science, 28(706):49–50,
1908.
[32] D Heckerman, D Geiger, and D. M Chickering. Learning bayesian networks: The
combinations of knowledge and statistical data. Machine Learning, 20:197–243, 1995.
[33] David Heckerman. A tutorial on learning with Bayesian networks, pages 301–354.
MIT Press, 1999.
[34] F. Y. Hsieh and P. W. Lavori. Sample-size calculations for the cox proportional hazards
regression model with nonbinary covariates. Controlled Clinical Trials, 21(6):552–60,
2000.
[35] R. H. Jones. Bayesian information criterion for longitudinal and clustered data. Statistics in Medicine, 30(25):3050–6, 2011.
[36] R.E. Kass and A.E. Raftery. Bayes factor. Journal of the American Statistical Association, 90(430):773–795, 1995.
[37] Daphne Koller and Nir Friedman. Probabilistic graphical models : principles and
techniques. Adaptive computation and machine learning. MIT Press, Cambridge,
MA, 2009.
[38] K. Lange. Mathematical and Statistical Methods for Genetic Analysis. Springer, 2002.
[39] S. L. Lauritzen and N. A. Sheehan. Graphical models for genetic analysis. Statistical
Science, 18(4):489–514, 2004.
[40] G. Lettre, C. Lange, and J. N. Hirschhorn. Genetic model testing and statistical power
in population-based association studies of quantitative traits. Genetic Epidemiology,
31(4):358–62, 2007.
93
[41] Q. Li, G. Zheng, Z. Li, and K. Yu. Efficient approximation of p-value of the maximum
of correlated tests, with applications to genome-wide association studies. Annals of
Human Genetics, 72(Pt 3):397–406, 2008.
[42] C. Lippert, J. Listgarten, Y. Liu, C. M. Kadie, R. I. Davidson, and D. Heckerman. Fast
linear mixed models for genome-wide association studies. Nature Methods, 8(10):833–
5, 2011.
[43] C. Lippert, J. Listgarten, Y. Liu, C. M. Kadie, R. I. Davidson, and D. Heckerman. Fast
linear mixed models for genome-wide association studies. Nature Methods, 8(10):833–
5, 2011.
[44] J. Listgarten, C. Lippert, and D. Heckerman. Fast-lmm-select for addressing confounding from spatial structure and rare variants. Nature Genetics, 45(5):470–1, 2013.
[45] J. Lorenzo Bermejo, A. Garcia Perez, A. Brandt, K. Hemminki, and A. G. Matthews.
Comparison of six statistics of genetic association regarding their ability to discriminate between causal variants and genetically linked markers. Human Heredity,
72(2):142–52, 2011.
[46] D. J. Lunn, A. Thomas, N. Best, and D. Spiegelhalter. Winbugs - a bayesian modelling framework: Concepts, structure, and extensibility. Statistics and Computing,
10(4):325–337, 2000. 352WK Times Cited:1592 Cited References Count:29.
[47] J. B. Maller, G. McVean, J. Byrnes, D. Vukcevic, K. Palin, Z. Su, J. M. Howson,
A. Auton, S. Myers, A. Morris, M. Pirinen, M. A. Brown, P. R. Burton, M. J. Caulfield,
A. Compston, M. Farrall, A. S. Hall, A. T. Hattersley, A. V. Hill, C. G. Mathew,
M. Pembrey, J. Satsangi, M. R. Stratton, J. Worthington, N. Craddock, M. Hurles,
W. Ouwehand, M. Parkes, N. Rahman, A. Duncanson, J. A. Todd, D. P. Kwiatkowski,
N. J. Samani, S. C. Gough, M. I. McCarthy, P. Deloukas, and P. Donnelly. Bayesian
refinement of association signals for 14 loci in 3 common diseases. Nature Genetics,
44(12):1294–301, 2012.
[48] J. Marchini, B. Howie, S. Myers, G. McVean, and P. Donnelly. A new multipoint
method for genome-wide association studies by imputation of genotypes. Nature Genetics, 39(7):906–13, 2007.
[49] W. J. Matthews. What might judgment and decision making research be like if we
took a bayesian approach to hypothesis testing? Judgment and Decision Making,
6(8):843–856, 2011.
[50] S. Muller, J. L. Scealy, and A. H. Welsh. Model selection in linear mixed models.
Statistical Science, 28(2):135–167, 2013.
[51] P. J. Newcombe, C. Verzilli, J. P. Casas, A. D. Hingorani, L. Smeeth, and J. C.
Whittaker. Multilocus bayesian meta-analysis of gene-disease associations. American
Journal of Human Genetics, 84(5):567–80, 2009.
94
[52] A. B. Newman, N. W. Glynn, C. A. Taylor, P. Sebastiani, T. T. Perls, R. Mayeux,
K. Christensen, J. M. Zmuda, S. Barral, J. H. Lee, E. M. Simonsick, J. D. Walston,
A. I. Yashin, and E. Hadley. Health and function of participants in the long life family
study: A comparison with other cohorts. Aging (Albany NY), 3(1):63–76, 2011.
[53] C. Ober, M. Abney, and M. S. McPeek. The genetic dissection of complex traits in a
founder population. American Journal of Human Genetics, 69(5):1068–79, 2001.
[54] A. O’Hagan and M. G. Kendall. Kendall’s Advanced Theory of Statistics: Bayesian
inference. vol. 2B, Volume 2, Part 2, volume 2. Edward Arnold, 1994.
[55] J. Ott, Y. Kamatani, and M. Lathrop. Family-based designs for genome-wide association studies. Nature Reviews Genetics, 12(7):465–74, 2011.
[56] J. H. Park, S. Wacholder, M. H. Gail, U. Peters, K. B. Jacobs, S. J. Chanock, and
N. Chatterjee. Estimation of effect size distribution from genome-wide association
studies and implications for future discoveries. Nature Genetics, 42(7):570–5, 2010.
[57] J. Pearl. Probabilistic Reasoning in Intelligent Systems: Networks of plausible inference. Morgan Kaufmann, San Francisco, CA, 1988.
[58] J. Pearl. The causal mediation formula–a guide to the assessment of pathways and
mechanisms. Prevention Science, 13(4):426–36, 2012.
[59] G. Pilia, W. M. Chen, A. Scuteri, M. Orru, G. Albai, M. Dei, S. Lai, G. Usala, M. Lai,
P. Loi, C. Mameli, L. Vacca, M. Deiana, N. Olla, M. Masala, A. Cao, S. S. Najjar,
A. Terracciano, T. Nedorezov, A. Sharov, A. B. Zonderman, G. R. Abecasis, P. Costa,
E. Lakatta, and D. Schlessinger. Heritability of cardiovascular and personality traits
in 6,148 sardinians. PLOS Genetics, 2(8):e132, 2006.
[60] Jos C. Pinheiro and Douglas M. Bates. Approximations to the log-likelihood function in the nonlinear mixed-effects model. Journal of Computational and Graphical
Statistics, 4(1):12–35, 1995.
[61] S. Purcell, B. Neale, K. Todd-Brown, L. Thomas, M. A. Ferreira, D. Bender, J. Maller,
P. Sklar, P. I. de Bakker, M. J. Daly, and P. C. Sham. Plink: a tool set for wholegenome association and population-based linkage analyses. American Journal of Human Genetics, 81(3):559–75, 2007.
[62] E. E. Schadt, J. Lamb, X. Yang, J. Zhu, S. Edwards, D. Guhathakurta, S. K. Sieberts,
S. Monks, M. Reitman, C. Zhang, P. Y. Lum, A. Leonardson, R. Thieringer, J. M.
Metzger, L. Yang, J. Castle, H. Zhu, S. F. Kash, T. A. Drake, A. Sachs, and A. J.
Lusis. An integrative genomics approach to infer causal associations between gene
expression and disease. Nature Genetics, 37(7):710–7, 2005.
[63] P. Sebastiani and T. T. Perls. The genetics of extreme longevity: lessons from the
new england centenarian study. Frontiers in Genetics, 3:277, 2012.
95
[64] P. Sebastiani, M.F. Ramoni, V.G. Nolan, C.T. Baldwin, and M.H. Steinberg. Genetic dissection and prognostic modeling of overt stroke in sickle cell anemia. Nature
Genetics, 37(4):435–440, 2005.
[65] P. Sebastiani, N. Solovieff, A. T. Dewan, K. M. Walsh, A. Puca, S. W. Hartley,
E. Melista, S. Andersen, D. A. Dworkis, J. B. Wilk, R. H. Myers, M. H. Steinberg,
M. Montano, C. T. Baldwin, J. Hoh, and T. T. Perls. Genetic signatures of exceptional
longevity in humans. PLOS ONE, 7(1):e29848, 2012.
[66] P. Sebastiani, N. Timofeev, D. A. Dworkis, T. T. Perls, and M. H. Steinberg. Genomewide association studies and the genetic dissection of complex traits. American Journal
of Hematology, 84(8):504–15, 2009.
[67] Paola Sebastiani, Evan C. Hadley, Michael Province, Kaare Christensen, Winifred
Rossi, Thomas T. Perls, and Arlene S. Ash. A family longevity selection score: ranking sibships by their longevity, size, and availability for study. American Journal of
Epidemiology, 170(12):1555–1562, Dec 2009.
[68] B. Servin and M. Stephens. Imputation-based analysis of association studies: candidate regions and quantitative traits. PLOS Genetics, 3(7):e114, 2007.
[69] M. Slatkin. Linkage disequilibrium–understanding the evolutionary past and mapping
the medical future. Nature Reviews Genetics, 9(6):477–85, 2008.
[70] H. C. So and P. C. Sham. Robust association tests under different genetic models,
allowing for binary or quantitative traits and covariates. Behavior Genetics, 41(5):768–
75, 2011.
[71] D.J. Spiegelhalter, N.G. Best, B.P. Carlin, and A. Van der Linde. Bayesian measures
of model complexity and fit. Journal of the Royal Statistical Society Series B, 64(Part
4):583–639, 2002.
[72] M. Stephens and D. J. Balding. Bayesian statistical methods for genetic association
studies. Nature Reviews Genetics, 10(10):681–90, 2009.
[73] The Wellcome Trust Case Control Consortium. Genome-wide association study
of 14,000 cases of seven common diseases and 3,000 shared controls. Nature,
447(7145):661–78, 2007.
[74] T. Therneau. coxme: Mixed Effects Cox Models, 2009.
http://cran.r-project.org/web/packages/coxme/coxme.pdf.
Available online at:
[75] T. Therneau, Atkinson. E., J. Sinnwell, M. Matsumoto, D. Schaid, and S. Mcdonnell. kinship2: Pedigree functions, 2012. Available online at: http://cran.rproject.org/web/packages/kinship2/kinship2.pdf.
[76] T. M. Therneau, P. M. Grambsch, and V. S. Pankratz. Penalized survival models and
frailty. Journal of Computational and Graphical Statistics, 12:156–175, 2003.
96
[77] Duncan Thomas. Gene-environment-wide association studies: emerging approaches.
Nature Reviews Genetics, 11(4):259–272, 2010. 10.1038/nrg2764.
[78] Florin Vaida and Suzette Blanchard. Conditional akaike information for mixed-effects
models. Biometrika, 92(2):351–370, 2005.
[79] C. T. Volinsky and A. E. Raftery. Bayesian information criterion for censored survival
models. Biometrics, 56(1):256–62, 2000.
[80] J. Wakefield. A bayesian measure of the probability of false discovery in genetic
epidemiology studies. American Journal of Human Genetics, 81(2):208–27, 2007.
[81] J. Wakefield. Reporting and interpretation in genome-wide association studies. International Journal of Epidemiology, 37(3):641–53, 2008.
[82] J. Wakefield. Bayes factors for genome-wide association studies: comparison with
p-values. Genetic Epidemiology, 33(1):79–86, 2009.
[83] J. Wakefield. Commentary: Genome-wide significance thresholds via bayes factors.
International Journal of Epidemiology, 41(1):286–91, 2012.
[84] P. Waldmann. Easy and flexible bayesian inference of quantitative genetic parameters.
Evolution, 63(6):1640–3, 2009.
[85] Wilhelm Weinberg. ber den nachweis der vererbung beim menschen. 1908.
[86] J. Xu, A. Yuan, and G. Zheng. Bayes factor based on the trend test incorporating
hardy-weinberg disequilibrium: more power to detect genetic association. Annals of
Human Genetics, 76(4):301–11, 2012.
[87] Y. Yang, E. F. Remmers, C. B. Ogunwole, D. L. Kastner, P. K. Gregersen, and W. Li.
Effective sample size: Quick estimation of the effect of related samples in genetic
case-control association analyses. Computational Biology and Chemistry, 35(1):40–9,
2011.
[88] W. K. Yip and C. Lange. Quantitative trait prediction based on genetic marker-array
data, a simulation study. Bioinformatics, 27(6):745–8, 2011.
[89] X. Zhou and M. Stephens. Genome-wide efficient mixed-model analysis for association
studies. Nature Genetics, 44(7):821–4, 2012.
97
Curriculum Vitae
Contact
Harold Taehyun Bae
Year of Birth: 1982
20 Second Street H223, Cambridge, MA 02141
Education
Boston University, PhD candidate in Biostatistics, 2009 – present. Thesis advisor: Paola Sebastiani. Supported by Interdisciplinary Training
Grant in Biostatistics (T-32)
The Dartmouth Institute, MS in Healthcare Services Research, 2008.
Dartmouth College, BA in Mathematics, 2005. Cum Laude
Publications 1. Bae HT, Steinberg MH, Perls TT, Sebastiani P. Bayesian Polynomial Regression Models to Fit Multiple Genetic Models for Quantitative Traits.
[Accepted for publication in Bayesian Analysis]
2. Sun F, Sebastiani P, Schupf N, Bae H, Andersen SL, McIntosh A, Abel
H, Elo IT, Perls TT. Extended maternal age at birth of last child and
women’s longevity in the Long Life Family Study. Menopause. 2014 Jun
23.
3. Lee JH, Cheng R, Honig LS, Feitosa M, Kammerer C, Kang MS, Schupf N,
Lin J, Sanders JL, Bae HT, Druley T, Perls TT, Christensen K, Province
M, Mayeux R. Genome wide association and linkage analyses identified
three loci – 4q25, 17q23.2 and 10q11.21 – associated with variation in
leukocyte telomere length: The Long Life Family Study. Frontiers in
Genetics. 2014 January 17.
4. Sebastiani P, Bae H, McIntosh A, Monti S. Bayesian Graphical Models for
Gene X Environment Interaction. Proceedings of Joint Statistical Meeting.
2013.
5. Sebastiani P, Bae H, Sun FX, Andersen SL, Daw EW, Malovini A, Kojima
T, Hirose N, Schupf N, Puca A, Perls TT. Meta-analysis of genetic variants
associated with human exceptional longevity. Aging (Albany NY). 2013
Aug 31.
6. Bachman E, Travison TG, Basaria S, Davda MN, Guo W, Li M, Connor
Westfall J, Bae H, Gordeuk V, Bhasin S. Testosterone Induces Erythrocytosis via Increased Erythropoietin and Suppressed Hepcidin: Evidence
for a New Erythropoietin/Hemoglobin Set Point. The Journals of Gerontology: Series A. 2013 Oct 24.
7. Bae HT, Sebastiani P, Sun JX, Andersen SL, Daw EW, Terracciano A,
Ferrucci L, Perls TT. Genome-wide association study of personality traits
in the long life family study. Frontiers in Genetics. 2013 May 8;4:65.
98
8. Alsultan A, Ngo D, Bae H, Sebastiani P, Baldwin CT, Melista E, Suliman AM, Albuali WH, Nasserullah Z, Luo HY, Chui DH, Steinberg MH,
Al-Ali AK. Genetic studies of fetal hemoglobin in the Arab-Indian haplotype sickle cell-(0) thalassemia. American Journal of Hematology. 2013
Jun;88(6):531-2.
9. Ngo D, Bae H, Steinberg MH, Sebastiani P, Solovieff N, Baldwin CT,
Melista E, Safaya S, Farrer LA, Al-Suliman AM, Albuali WH, Al Bagshi
MH, Naserullah Z, Akinsheye I, Gallagher P, Luo HY, Chui DH, Farrell
JJ, Al-Ali AK, Alsultan A. Fetal hemoglobin in sickle cell anemia: Genetic studies of the Arab-Indian haplotype. Blood Cells, Molecules and
Diseases. 2013 Jun;51(1):22-6.
10. Andersen SL, Sun JX, Sebastiani P, Huntly J, Gass JD, Feldman L, Bae
H, Christiansen L, Perls TT. Personality Factors in the Long Life Family
Study. The Journals of Gerontology: Series B. 2012 Dec 28.
11. Bae HT, Baldwin CT, Sebastiani P, Telen MJ, Ashley-Koch A, Garrett
M, Hooper WC, Bean CJ, Debaun MR, Arking DE, Bhatnagar P, Casella
JF, Keefer JR, Barron-Casella E, Gordeuk V, Kato GJ, Minniti C, Taylor
J, Campbell A, Luchtman-Jones L, Hoppe C, Gladwin MT, Zhang Y,
Steinberg MH. Meta-analysis of 2040 sickle cell anemia patients: BCL11A
and HBS1L-MYB are the major modifiers of HbF in African Americans.
Blood. 2012 Aug 30;120(9):1961-2.
12. Cordoba G, Schwartz L, Woloshin S, Bae H, Gtzsche PC. Definition, reporting, and interpretation of composite outcomes in clinical trials: systematic review. British Medical Journal. 2010 Aug 18;341:c3920.
13. Chatterjee A, Chen L, Goldenberg EA, Bae HT, Finlayson SR. Opportunity cost in the evaluation of surgical innovations: a case study of laparoscopic versus open colectomy. Surgical Endoscopy. 2010 May;24(5):1075-9.