THE PROBABILITY OF SNPS ASSOCIATED WITH A DISEASE
By
LU ZHANG
Submitted in partial fulfillment of the requirements
for the degree of Doctor of Philosophy
Thesis Advisor: Dr. Robert C. Elston
Department of Epidemiology and Biostatistics
CASE WESTERN RESERVE UNIVERSITY
January, 2015
CASE WESTERN RESERVE UNIVERSITY
SCHOOL OF GRADUATE STUDIES
We here by approve the thesis/dissertation of
Lu Zhang
candidate for the degree of PhD
Committee Chair
Dr. Robert C. Elston
Committee Member
Dr. Sanford Markowitz
Committee Member
Dr. Xiaofeng Zhu
Committee Member
Dr. Nora Nock
Date of Defense
August 26, 2014
*We also certify that written approval has been obtained for any proprietary material
contained therein
ii
ACKNOWLEDGMENTS
Special thanks go to my advisor, Dr. Robert C. Elston, whose invaluable support,
encouragement and consideration made this thesis possible. This thesis is much better
than it would have been due to his patience with my language and making this thesis
more readable.
I would like to thank my thesis committee members: Dr. Sanford Markowitz, Dr.
Xiaofeng Zhu and Dr. Nora Nock for their suggestions and input to the timely completion
of this dissertation.
Thanks also go to the other faculty and staff in the Department of Epidemiology
and Biostatistics, Case Western Reserve University. Among them are Dr. Omar de la Cruz
C. and Dr. Tao Feng for their assistance.
I would also like to thank my parents, Bo Zhang and Ping Lu, for all their
emotional support through the years while I was working toward this degree.
iii
TABLE OF CONTENTS
ACKNOWLEDGMENT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .iii
TABLE OF CONTENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv
LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii
CHAPTER I. LITERATURE REVIEW AND AIMS OF THE DISSERTATION . . . . . . . . . . . . . .1
1.1 Background . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Generalized Linear Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3
1.4 Bonferroni Correction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6
1.5 Permutation Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.6 Principal Component Analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .10
1.7 Linkage Disequilibrium . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .12
iv
1.8 False Discovery Rate . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . 14
1.9 False Discovery Rate Controlling Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.10 Point Estimate of the FDR by Permutations . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.11 Weighted FDR Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .18
1.12 Prioritized Subset Analysis . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 19
1.13 Aim of Dissertation . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . 19
CHAPTER II. THE METHOD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .23
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.2 Posterior Probabilities of Association . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.3 Normal Transformation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .25
2.3.1 Inverse CDF Transform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .26
2.3.2 Commingling analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . .26
2.4 Permutation-Based Point Estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.5 Histogram-Based Estimator of the Number of the True Null Hypothesis . .. .28
2.6 Linear Regression-Based Estimator of the Number of the True Null
Hypothesis . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
v
CHAPTER III. ANALYSIS OF THE WELLCOME TRUST CASE CONTROL CONSORTIUM
CROHNโS DISEASE DATA . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.2 Normal Transformation and Commingling Analysis . . . . . . . . . . . . . . . . . . . . 32
3.3 Permutation-Based Point Estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.4 Histogram-Based Estimator of the Number of Null Hypotheses . . . . .. . . . . . 39
3.5 Fitting the Alternative Distribution . . . . . . .. . . . .. . . . . . . . . . . . . . . . . . . . . . . 43
CHAPTER IV. DATA CLEANING AND BASIC ASSOCIATION ANALYSIS OF THE COLON
CANCER DATA. . . . . . . . . . . . . .. . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .58
4.1 Introduction to the data . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.2 Gender Check . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.3 Sample Relatedness Check . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.4 The Quality of SNPs. . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.5 Principal Component Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.6 Permutation Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
vi
4.6.1 Minor Allele Frequencies .. . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . 70
4.7 The Result of Association Analysis. . . . . . . . . . . .. . . .. . . . . . . . . . . . . . . . . . . . 73
CHAPTER V. POSTERIOR PROBABILITIES FOR THE COLON CANCER DATA . . . . .. . . . .76
CHAPTER VI. AREA FOR FUTURE RESEARCH . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .92
6.1 Estimate of the Alternative Distribution. . . . . . .. . . .. . . . . . . . . . . . . . . . . . . . 92
6.2 Fitting the Alternative Distribution. . . . . . .. . . .. . . . . . . . . . . . . . . . . . . . . . . . .92
6.3 Validation Dataset. . . . . . . . . . . . . . . . . . . . . . . . .. . . .. . . . . . . . . . . . . . . . . . . . 93
6.4 Consider all the SNPs together within a gene.. .. . . .. . . . . . . . . . . . . . . . . . . . 93
Appendix . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .94
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .102
vii
LIST OF TABLES
Table 3.2.1
Inputs and outputs of each step . . . . . . .. . . . . . . . . . . . . . . . . . . 33
Table 3.5.1
Estimates for mixture of different number of distributions. .. . . 51
Table 3.5.2
the posterior probabilities for the 71 LD intervals . . . . . . . . . . . . 54
Table 4.2.1
Gender check . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
Table 4.7.1
Summary of Top 9 SNPs in Association 1 . . . . . . . .. . . . . . . . . . . . 74
Table 5.1
Estimate for mixtures of normal distributions . .. .. . . . . . . . . . . . .80
Table 5.2
Estimate for mixtures of normal distributions if the top 2068 SNPs
are excluded. . . . .. . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
Table 5.3
Genes that include the SNPs with posterior probability >0.09495
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
Table 5.4
LD between pairs of SNPs in the discovery dataset. . . . . . . . . . . .89
Table 5.5
Results of association analysis in the two dataset . .. . . . . . . . . . .91
viii
LIST OF FIGURES
Figure 3.1.1
The Q-Q plot of CD based on WTCCC data . . . . . .. . . . . . . . . . . . . 32
Figure 3.2.1
Q-Q plots for fitted normal transformed p values and original
normal transformed p values for each step . . . . . . . . . . . . . . . . . .35
Figure 3.3.1
Average proportion of alternative distribution . . . . . . . . . . . . . . 37
Figure 3.3.2
The relation between the proportion of p-values corresponding to
the alternative distribution and the significance level cut-off . . . 38
Figure 3.4.1
The estimated number of null hypotheses as a function of the
number of bins for the WTCCC dataset . . . . . . . . . . . . . . . . . . . . . 40
Figure 3.4.2
The estimated proportion of alternative distribution p-values in a
bin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .42
Figure 3.5.1
Empirical cumulative distributions of negative normal
transformed p values for all SNPs . . . . . .. . . . . . . . . . . . . . . . . . . 45
Figure 3.5.2
Figure 3.5.3
Empirical cumulative distributions of negative normal
transformed ๐ values for the top 43,564 SNPs. . . . . . . . . . . . . . 46
Empirical cumulative distributions for negative normal
transformed ๐ values for the top 43,564 SNPs. . . .. . . . . . . . . . . 48
ix
Figure 3.5.4
Empirical cumulative distribution and estimated density function
of negative normal transformed alternative hypothesis p-values
for the top 43,564 SNPs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
Figure 3.5.5
Q-Q plot between negative fitted normal transformed and
negative original normal transformed p-values. . .. . . . . . . . . . . . 53
Figure 4.3.1
Z0 vs. Z1 for all pairs of 568 individuals . . . . .. . . . . . . . . . . . . . . . 60
Figure 4.3.2
Histogram of kinship coefficients for pairs of 568 individuals . . ..61
Figure 4.3.3
Z0 vs. Z1 for pairs of 540 individuals . . . . . . . . . . . . . . . . . . . . . . . 62
Figure 4.3.4
Histogram of kinship coefficients for pairs of 540 individuals . .. .63
Figure 4.3.5
Z0 vs. Z1 for pairs of 530 individuals . . . .. . . . . . . . . . . . . . . . . . . . 64
Figure 4.3.6
Histogram of kinship coefficients for pairs of 530 individuals .. . .65
Figure 4.5.1
PC1 vs. PC2 . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . .68
Figure 4.6.1
Q-Q plot for all the SNPs . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . 70
Figure 4.6.2
Empirical distribution obtained from SNPs with MAF between 0.0
and 0.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .72
Figure 4.6.3
Empirical distribution obtained from 10,000 random SNPs with
MAF between 0.0 and 0.1 . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . 73
x
Figure 4.7.1
-log(p-values) in blocks 60-62 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
Figure 4.7.2
-log(p-values) in block 63 . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . 75
Figure 5.1
Empirical cumulative distributions of negative normal
transformed p-values for the top 577,700 SNPs in the colon
cancer dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . .77
Figure 5.2
Smoothed ecdf and density of the alternative distribution for
negative normal transformed p-values in the colon cancer
dataset . . . . . . . . . . . . . . . . . . . . . . . . . . .. . .. . . . . . . . . . .. . . . . . . 79
Figure 5.3
Q-Q plot between fitted negative normal transformed p-values
and negative normal transformed original p-values. . . . . . . .. . .. 82
Figure 5.4
Posterior Probabilities for top 577,700 SNPs . . . . . . . . . . . . . . . . .84
Figure 5.5
The Q-Q plot of the negative normal transformed alternative
distribution of p-values for the SNPs with the highest posterior
probabilities. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . 86
Figure 5.6
The Q-Q plot of the negative normal transformed alternative
distribution of p-values and the negative normal transformed
permutation p-values for the SNPs with the highest posterior . . 87
xi
The Probability of SNPs Associated with a Disease
Abstract
By
LU ZHANG
Multiple inferences are widely used in genome-wide association studies, but
there are problems existing in family wise error control (FWER). Benjamini &
Hochberg (1995) introduced a new point of view to control errors, the false discover
rate (FDR), which is the expected ratio of the number of false rejections to the total
number of rejections. FDR is conservative when the unknown number of true null
hypotheses is less than the total number of hypotheses. To solve this bias, many
methods exist to estimate the number of true null hypotheses.
In this dissertation, we develop a simple nonparametric empirical model to help
select SNPs that are associated with a disease. This model is connected with the
theory of the FDR. The distribution of null hypothesis p-values is estimated by the
permutation p-values. The distribution of alternative hypothesis p-values is
estimated by the distribution of the original p-values after subtracting the
proportion of the null hypothesis p-values. Thus the model can produce a useful
posterior probability of effect for each individual SNP with a minimum of prior
xii
assumptions. For the Wellcome Trust Case Control Consortium (WTCCC) Crohnโs
disease dataset, the highest posterior probabilities are consistent with the most
significant p-values. For the colon cancer data, the SNPs with the highest posterior
probabilities are quite different from the SNPs with the most significant p-values.
xiii
CHAPTER I
LITERATURE REVIEW AND AIMS OF THE DISSERTATION
1.1 Background
Over the last two decades, genome-wide association studies (GWAS), which typically
test the association between a disease and hundreds of thousands, or even millions, of
genetic markers across the human genome, have rapidly become a standard method for
discovering complex disease genes. The advent of high-throughput technologies for
single nucleotide polymorphism (SNP) genotyping has allowed for a rapid way to do
GWA analysis. When pursuing multiple inferences, researchers have to find a way to
select the statistically significant results for emphasis and to support their conclusions,
but the multitude of tests leads to both false positive (Type I error) and false negative
(Type II error) results.
Type I and Type II errors may result from the effects of small sample size, an
incorrect genetic model for the disease, and linkage disequilibrium between SNPs. This
last means the tests are correlated. An unguarded use of single-inference procedures
results in greatly increased Type I error. To control this selection effect, classical
methods try to restrict the Type I error rate in families of comparisons under
simultaneous consideration. It assumes that all the tests are independent. Also, most of
the methodology of familywise error control (FWER) assumes the test statistics are
distributed as multivariate normal (or t). In fact, the statistics often follow combined
1
mixtures of distributions. The classical multiple-comparison procedure leads to three
main problems, as follows:
1. It controls the familywise error rate (FWER) in a strong sense: the results would be
too conservative. The overall power will be less than the per comparison procedure
at the same significance level.
2. Most of the FWER controlling methods assume the test statistics are multivariate
normally distributed under the null hypothesis. But in practice, many of the test
statistics are not multivariate normal. In fact, the distributions of test statistics are
often combined mixtures of distributions of unknown types.
3. Control of the FWER may not be needed. The FWER controls the probability that a
SNP that is actually not associated with disease is tested and found to be significant,
but researchers should be interested in another probability, which is the proportion
of false discoveries in the set of significant tests. If the proportion of false
discoveries in the set of significant tests is large, the declared discoveries are
probably false.
Some researchers try to provide an accurate method to control the FWER to allow
for how statistical tests of markers are correlated in genome-wide studies. These studies
fall into three categories: permutation testing (Han et al., 2009; Browning, 2008),
principal component analysis (PCA) (Cheverud, 2001; Galwey, 2009; Nyholt, 2004), and
analysis of the underlying linkage disequilibrium (LD) structure (Spencer et al., 2009;
Perneger, 1998; Gao et al., 2008). Also, Benjamini and Hochberg (1995) suggested a
new point of view to control errors, the false discovery rate (FDR).
2
Although the FDR is less conservative than the FWER, the Q-Q plots of p-values from
GWAS still show that there are many SNPs that depart from the expected null
distribution but cannot be declared significant. Therefore, this dissertation will attempt
to use an empirical way to find the non-null distribution in order to estimate the
probability that SNPs are truly associated with disease.
1.2 Notation
In this dissertation, random variables will be denoted by uppercase letters and
observed data will be denoted by lowercase letters. Matrices and vectors will be in
boldface.
1.3 Generalized Linear Models (GLM)
Logistic regression models have been widely used in studies to predict a binary
dependent response from several explanatory predictors (McCullagh et al., 1989;
Agresti, 1992). The statistical methods of GLM have been well developed for
independent samples where the variability associated with each observation is
determined by a function of the explanatory predictors. However, it is hard to be
confident that the samples are independent in applications. In GWAS, even when
individuals are independent, SNPs that are located proximal to each other on the same
chromosome tend to be inherited together, so they are highly correlated.
3
Standard methods that ignore the correlation structures may greatly inflate type I
error and tend to underestimate the true standard error of an estimated effect
(Haseman et al., 1979; Rao et al., 1984). To solve this problem, generalized linear mixed
models (GLMM) have been recently considered (Schall, 1991).These models have
considered the fact that, in many studies, responses are both discrete and correlated.
The main difficulty of the GLMM is the intensive computational requirements, because
the likelihood usually involves a high dimensional integrand.
Generalized Linear Models are made up of three components:
i.
The random component: a dependent response ๐ = (๐1 , ๐2 , โฆ , ๐๐ ) comes from
the exponential family of distributions, including Normal, Bernoulli, Binomial,
Poisson, Gamma, etc, where n is the number of observations. The exponential
family is a set of probability distributions whose probability density function can
be expressed as:
๐๐ (๐ฆ|๐) = โ(๐ฆ)exp{๐(๐)๐(๐ฆ) โ ๐ด(๐)} ,
(1.1)
where ๐ is the parameter of the family, ๐(๐ฆ) is the vector of sufficient statistics and
๐ด(๐) is the cumulant generating function. If ๐(๐) = ๐, the exponential family is said to
be in canonical form.
ii.
The systematic component: explanatory variables x1, x2,โฆ, xp, where ๐๐ป๐ = (x1i,
x2i, โฆ , x1n ) and p, the number of explanatory variables, produce a linear
predictor:
โ๐๐=1 ๐๐ ๐ฝ๐ ,
4
iii.
The link function: the link between the mean response and the systematic
component:
๐
๐(µ) = โ๐=1 ๐๐ ๐ฝ๐ ,
where ๐(๐)๐ = (g(µ1), g(µ2), โฆ , g(µn)). Thus g(.) is called the link function.
There are three common link functions:
(1) Identity link (usually used in normal and gamma regression models):
๐(µ) = µ.
(2) Log link (can be used when µ is positive, as when the data are Poisson counts):
๐(µ) = ๐๐๐(µ).
(3) Logit link (can be used when µ is bounded between 0 and 1 as when data are binary):
๐(µ) = ๐๐๐(
µ
).
๐โ๐
The parameters in ฮฒ can be estimated by maximizing the likelihood function.
Incorporating additional random effects into the systematic component gives the
generalized linear mixed model. The general form of the model is:
๐ = ๐ฟ๐ฟ + ๐๐ + ๐,
where X is a matrix of p predictor variables; ฮฒ is a column vector of the fixed-effect
regression coefficients; Z is the design matrix for q random effects; ๐ธ a vector of
random effects; and ๐ is a column vector of residuals, that part of ๐ that is not explained
by the model.
5
In this model, the response yi depends on explanatory variables xi and the random
effects zi. The model assumes that, given a q-dimensional vector ๐ธ of random effects,
(๐ธ)
the Yi are conditionally independent with means E(Yi| ๐ธ)=๐๐ . The model also makes
the assumption that ๐ธ has a multivariate normal distribution:๐ธ~๐(0, ๐ฎ), where ๐ฎ is
the variance-covariance matrix of the random effects.
Because the exact likelihood function involves a high-dimensional integration, some
approximations to the likelihood function have been proposed. The most popular
method among them is the penalized quasi-likelihood (PQL) method (Breslow et al.,
1993). It approximates the marginal quasi-likelihood using the well-known Laplaceโs
method, which leads eventually to estimating equations based on a PQL for the mean
parameters and a pseudo-likelihood for the variances. Implementation involves iterative
use of normal theory procedures for restricted maximum likelihood (REML) estimation.
The PQL tends to underestimate somewhat the variance components and fixed effects
when applied to clustered binary data.
1.4 Bonferroni Correction
In reality, there are hundreds of thousands, or millions, of SNPs in sequencing data.
The generalized mixed linear models cannot calculate such a huge size dataset which us
at the limit of computation. Typically, people ignore the correlations between SNPs in
GWAS, using a Bonferroni correction (Holm, 1979), which is a measure of the family-
6
wise error rate (FWER), for an assumed millions of independent tests, to decide on the
adjusted type I error.
When n hypotheses H1, โฆ, Hn with associated test statistics T1, โฆ, Tn are to be tested,
with the corresponding p-values P1, โฆ, Pn, the overall hypothesis is H0=โ{๐ป๐ : ๐ =
1, โฆ , ๐}. Thus if P(1)โค ฮฑ/n, where P(1) is the smallest of the p-values and ฮฑ is the
significance level, H0 should be rejected.
The proof of the appropriateness of the Bonferroni correction follows from Booleโs
inequality. The overall FWER is:
๐ผ
๐ผ
๐ผ
๐น๐น๐น๐น = Pr{๏ฟฝ ๏ฟฝ๐๐ โค ๏ฟฝ} โค ๏ฟฝ{Pr( ๐๐ โค )} โค ๐ = ๐ผ
๐
๐
๐
๐ผ
Here we have used the fact โ๐๐=1 = ๐ผ, where all the tests are defined as having equal
weights.
๐
Although the derivation of the Bonferroni correction does not require the tests to be
independent, when the correlations among the test statistics are small, the Bonferroni
correction is more accurate than when they are large. The disadvantage of this
procedure is very obvious. It may be very conservative, in particular if the test statistics
are highly correlated; moreover, it is often inappropriate to use only the smallest pvalue. Avoiding type I errors may inflate the probability of type II errors. This is
especially true for analyses with low power, such as for rare diseases, low frequency
alleles, or genetic factors with small effect sizes.
7
Several methods are commonly used to control the type I error rate; in particular, in
GWAS data, p-value adjustments for multiple comparisons (Miller, 1981); and principal
component analysis (PCA) to estimate the equivalent number of independent statistical
comparisons (Gao et al., 2008).
If a GWAS is considered to be complete and agnostic, no prior knowledge or
hypotheses can be provided; but the universal application of a one-size-fits-all
significance level to GWAS is not appropriate (Risch et al., 1996; McCarthy, 2008). Thus
there have been several studies attempting to provide a more accurate method ofallowing for how statistical tests of SNPs are correlated in genome-wide studies. Three
methods are commonly used: permutation testing, principal component analysis and
analysis of the underlying LD structure in the genome.
1.5 Permutation Testing
Permutation testing is often described as the gold standard for determining
statistical significance when performing multiple correlated tests for genetic
association (Galwey, 2009). It can be performed for both case-control and parentoffspring trio studies. The case/control statuses or the transmitted/untransmitted
statuses of the parental chromosomes are randomly permuted. This allows hypotheses
to be tested when the properties of the distribution of the test statistic under the null
hypothesis are not known or complex, or the data do not meet the assumptions of the
standard tests. If the datasets are relatively small, it would be good to use permutation
8
testing (Edgington, 1995). There are two kinds of permutation tests. One is the
complete test, which compares the observed value of the test statistic with the
distribution of all other possible values. The other is the sample test, which compares
the observed value of the test statistic with the distribution of a random sample of
values of the test statistics (Manly, 1997).
The theory of permutation testing evolved from the work of Fisher (1935) and
Pitman (1937).The basic procedures of a permutation test are:
(1) Assuming a dataset is composed two sets of variables, X1 and X2, a statistic S is to be
tested. Under the null hypothesis H0, the rows of X1 are considered as
โexchangeableโ with one another if the rows of X2 are considered as fixed, or
conversely. Thus, the pairing of X1 and X2 is only due to chance. In other words, any
value of X1 can be paired with any value of X2.
(2) The values of X1 are randomly shuffled while the values of X2 are fixed. Recompute
the statistic S for the randomly paired X1 and X2, obtaining a value s*.
(3) Repeat step (2) a large number of times. The permutations produce a set of values
s* which are obtained under H0.
(4) The values of s* are reference values for the statistic S, computed from the
unpermuted vectors. Together, the unpermuted and permuted values of S form an
estimate of the sampling distribution of S under H0.
(5) As in any other statistical test, the decision is made by comparing the observed value
of the statistic to the reference distribution.
9
In permutation tests, the reference distribution is obtained by randomly
permuting the data only, without reference to any statistical population. Thus, the data
do not have to be a random sample from some larger statistical population. The only
information provided by the tests is whether the pattern observed in the data is likely,
or not, to have arisen by chance. Permutation test correction is very robust and has the
advantage of drawing the threshold directly from the experimental data (Cheverud,
2001). However, permutation tests are often computationally intensive for large data
sets (Churchil et al., 1994; Conneely et al., 2007).Running 1 million permutations for
500,000 SNPs over 5,000 samples takes up to 4 CPU years (Han et al., 2009). Thus it is
hard to use them for the whole genome. Permutation tests are typically used for small
regions at the size of candidate gene studies.
1.6 Principal Component Analysis
Principal Component Analysis (PCA) uses an orthogonal transformation to
convert a set of variables that are probably correlated into a set of orthogonal variables
that are linear combinations of the original variables (Pearson K. , 1901). The new
uncorrelated variables are called principal components. The first principal component
has the largest possible variance, and each succeeding component in turn has the next
highest variance under the orthogonality constraint. The number of principal
components or the dimension must be smaller than or equal to the number of original
variables, so PCA can be used to reduce the dimension of variables when the number of
10
variables is high. Also, it captures large variability in the data and ignores small
variability.
There are several methods to calculate the principal components in different
situations. The Karhunen-Loève theorem (Karhunen, 1947) is a representation of an
infinite linear combination of orthogonal functions. Proper Orthogonal Decomposition
(POD) solves the problem when the variables are almost linearly dependent. Singular
value decomposition (SVD) (Golub & Loan, 1983) is widely used when the number of
dimensions is finite. It is used to decompose the data matrix. Eigenvalue decomposition
(EVD) decomposes a matrix into a canonical form. It is used to form a data covariance or
correlation matrix.
Given any ๐ × ๐ matrix X that is the data matrix, where n is the number of
observations and p is the number of explanatory variables, we need to find matrices U,
V, and W such that
๐ฟ = ๐ผ๐ผ๐ฝ๐ ,
where U is an ๐ × ๐ unitary matrix; W is a ๐ × ๐ rectangular diagonal matrix with
nonnegative real numbers on the diagonal; and V is a ๐ × ๐ unitary matrix. The diagonal
entries ๐ค๐ of W are the singular values of X. If X is singular, some of the ๐ค๐ will be 0. In
general, the rank of X is equal to the number of non-zero ๐ค๐ . The n columns of U and
the p columns of V are the left-singular vectors and right-singular vectors of X,
respectively. Also, the left-singular vectors of X are the eigenvectors of ๐ฟ๐ฟ๐ ; the right11
singular vectors of X are the eigenvectors of ๐ฟ๐ ๐ฟ; the non-zero ๐ค๐ are the square roots
of the non-zero eigenvalues of both ๐ฟ๐ฟ๐ and ๐ฟ๐ ๐ฟ.
After decomposing X, the columns of V are the principal components and the value
of ๐ค๐ gives the importance of each component. This means that the biggest ๐ค๐ indicates
the first principal component.
In Gaoโs paper (2008), the whole genome was broken into regions with
approximately 5000 SNPs each. The choice of region size is determined by the
computational constraints. They used the LD blocks identified by the Haploview
software (Barrett et al., 2005), based on the definition of haplotype blocks from Gabriel
(2002) in order to make adjacent regions having low LD with each other. They used PCA
to measure the number of informative/independent SNPs. Then this number was used
in a Bonferroni adjustment to estimate the significance threshold for each region. They
found their method is comparable to permutation-based corrections in both real and
simulated data.
1.7 Linkage Disequilibrium
At the individual level, genes on non-homologous chromosomes assort
independently, but if they are on the same chromosome, they may tend to be inherited
together. Genetic linkage is the tendency for alleles that are located proximal to each
other on the same chromosome to be inherited together during meiosis. Genetic linkage
was first discovered by William Bateson (Bateson, 1900). In a population, linkage
disequilibrium (LD) is used to describe the non-random association of alleles at two or
12
more linked loci, even when they are not on the same chromosome (Robbins, 1918), but
if there is no linkage the term gametic phase disequilibrium (GPD) is more appropriate.
Obviously, linkage is a cause of LD. There are several other factors causing LD and GPD,
including natural selection, the rate of mutation, genetic drift, non-random mating, and
population structure.
Suppose the haplotype frequencies for two loci A and B with two alleles each are
as given below:
Haplotype
A1B1
A1B2
A2B1
A2B2
Frequency
x11
x12
x21
x22
The allele frequencies can then be derived from the haplotype frequencies:
Allele
A1
A2
B1
B2
Frequency
p1=x11+x12
p2=x21+x22
q1=x11+x21
q2=x12+x22
The expected haplotype frequency for A1B1 should be p1q1 if the two alleles are
independent. The deviation of the observed haplotype frequency from the expected
frequency is the measure of LD or GPD and commonly denoted by a capital D (Lewontin
et al, 1960):
D= x11- p1q1
13
There are some other measures for LD and GPD, such as:
(1) Because the range of D depends on the allele frequencies, it is hard to compare the
values of D between markers. Lewontin suggested Dโ as the measure of LD:
๐ทโฒ =
๐ท
๐ท๐๐๐
min(๐1 ๐1 , ๐2 ๐2 )
, where ๐ท๐๐๐ = ๏ฟฝ
min(๐1 ๐2 , ๐2 ๐1 )
๐คโ๐๐ ๐ท < 0
, which theoretically is
๐คโ๐๐ ๐ท > 0
the maximum value of D (Lewontin, 1964). The range of Dโ is between -1 and 1. If allele
frequencies are similar, high |Dโ| indicates the markers are good surrogates for each
other.
(2) The correlation between two loci:
|๐ท|
โ๐1 ๐2 ๐1 ๐2
. The range of r is between 0 and 1. 1
means the two markers are identical, 0 means they are in perfect equilibrium.
1.8 False Discovery Rate
There is another measure commonly used to quantify the error rate when making
simultaneous inference for the many SNPs, rather than the FWER. It is the false
discovery rate (FDR). The FDR is defined as the expected ratio of the number of false
rejections to the number of total rejections (Benjamini & Hochberg, 1995). Consider a
multiple testing situation in which m independent tests are being performed. Suppose
m0 of the null hypotheses are true, m1=m-m0 null hypotheses are false, S is the number
of tests that are rejected and F is the number of tests that are falsely rejected. The
following table shows the types of possible test outcomes:
14
Null Hypotheses
Null Hypotheses
Total
Rejected
Not Rejected
Null True
F
m0-F
m0
Null False
T
m1-T
m1
Total
S
m-S
M
Thus the FDR is defined as:
๐น
๐น๐น๐น = .
๐
When the number of true null hypotheses m0 is smaller than the total number of
hypotheses m, the FDR is smaller than the FWER. Therefore, given the same nominal
control level, controlling the FDR is less stringent than controlling the FWER. Controlling
the FDR can provide a more practical balance between the number of true positives and
the number of false positives.
In some GWAS, researchers can use informative prior knowledge to improve the
estimate of the FDR. Thus, weighting tests according to the prior information can
substantially improve the power of a study. There are two approaches to weight tests.
One is the weighted FDR control, which is called weighted exceedance control (WEC)
(Genovese et al., 2006). Another is the prioritized subset analysis (PSA) (Li et al, 2008).
1.9 False Discovery Rate Controlling Procedure
In Benjamini and Hochbergโs paper (1995), the authors gave a very simple way to
control the FDR. Consider testing ๐ป1 , ๐ป2 , โฆ, ๐ป๐ with the corresponding p-values ๐1 ,
15
๐2 , โฆ, ๐๐ . Let ๐(1) โค ๐(2) โค โฏ โค ๐(๐) be the ordered statistic for p-values and denote
by ๐ป(๐) the corresponding null hypothesis. Let ๐ be the largest ๐ satisfying ๐(๐) โค
then reject all ๐ป(๐) (๐= 1, 2, โฆ , ๐).
Thus the expect FDR satisfies the inequality:
๐ธ๏ฟฝ๐๏ฟฝ๐๐0+1 = ๐1 , โฆ , ๐๐ = ๐๐1 ๏ฟฝ โค
๐0
๐
๐ โโค ๐ โ,
where 0 โค ๐0 โค ๐ and ๐1 = ๐ โ ๐0 ; ๐0 denotes the number of true null
hypotheses and ๐1 denotes the number of false null hypotheses. The FDR is then
controlled at ๐ โ.
This method is conservative when
๐0 , the unknown of number of true null hypotheses, < ๐. To solve this bias, we
should estimate the unknown number ๐0 . Many methods exist to estimate ๐0 ,
including those given by Benjamini and Hochberg (2000), Storey(2002) , Storey and
Tibshirani (2003).
1.10
Point Estimate of the FDR by Permutations
The FDR is defined as the expected proportion of tests that, rejected as null
hypotheses, are true null hypotheses:
๐น
๐โ๐
๐น๐น๐น = ๐ธ ๏ฟฝ ๏ฟฝ = ๐ธ ๏ฟฝ
๐
๐
๏ฟฝ,
16
๐
๐
๐ โ,
where S is the total number of tests that reject null hypotheses, F is the number of false
discoveries in S, and T is the number of true discoveries in S. It is obvious that T, F and S
depend on a fixed significance threshold. The null distribution of a statistic can usually
be approximated using a permutation procedure. In Millstein et al.โs paper (2013), they
used a permutation procedure to obtain a point estimate of the FDR and the
corresponding confidence interval. The statistic can be generated for each replicate
permuted dataset, so a set of statistics can be generated from a large number of
replicate permuted datasets. Let ๐๐โ denote the result from the ๐th permuted dataset.
For example, ๐๐โ denotes the number of tests that reject the null hypothesis for the ๐th
permuted dataset. In the permuted dataset, there are no false null hypotheses, i.e.
๐1โ =0 and m= ๐0โ ; consequently,
๐ธ[๐น๐โ ]
๐
=
๐ธ[๐น๐โ ]
๐0โ
.
The principal assumption of the permutations is the exchangeability of observations
under the null hypothesis. This implies that the expected proportion of false discoveries
among true null hypotheses should be the same in the observed and permutated
datasets:
๐ธ[๐น] ๐ธ[๐น๐โ ]
=
๐0
๐0โ
In most GWAS, ๐0 /๐ is very close to 1. Thus we can use as an approximation that
๐1 โ ๐ธ[๐] โ 0
when it is compared with ๐ โ ๐ธ[๐]. In this situation,
๐ธ[๐น]
๐0
=
๐ธ[๐]โ๐ธ[๐]
๐โ๐ธ[๐]โ(๐1 โ๐ธ[๐])
17
โ
๐ธ[๐]โ๐ธ[๐]
๐โ๐ธ[๐]
.
So it is quite easy to derive the expression for ๐ธ[๐]:
๐ธ[๐] =
๐๐[๐น]/๐0 โ๐ธ[๐]
๐ธ[๐]/๐0 โ1
.
In permutation procedures, as discussed before, ๐1โ =0 and m= ๐0โ . We can get ๐ โ =0 and
๐น๐โ = ๐๐โ . Thus we can express ๐ธ[๐] as
๐ธ[๐] =
๐ธ[๐]โ๐ธ[๐๐โ ]
1โ๐ธ[๐๐โ ]/๐
.
๐น
In Storey et al.โs paper (2003), if m is large, ๐ธ[ ] โ ๐ธ[๐น]/๐ธ[๐]. Based on all these
๐
equations, we can derive the expression of the expected FDR:
๏ฟฝ๏ฟฝ๏ฟฝโ 1โ๐/๐
.
๏ฟฝ๏ฟฝ๏ฟฝโ /๐
๐ 1โ๐
๏ฟฝ =๐
๐น๐น๐น
The authors also compared the results of 10 permutations to 100 permutations. The
point estimates and variances only changed a little. This implies that we need only a few
permutations to obtain accurate estimation.
1.11
Weighted FDR Control (WEC)
There are two weighting schemes for the WEC: binary weighting and general
weighting. The general weighting requires a researcher to assign a weight specifically for
each SNP. In binary weighting, SNPs thought to be more likely true positives (U=1) are all
assigned the same weight (w1), and SNPs thought to be more likely true negatives (U=0)
are all assigned another weight (w0).
Let Wi be the weight assigned to the ith SNP, i=1, โฆ, m. In the binary weighting
scheme, Wi is either w1 or w0. The p-values are weighted according to Pi*=Pi/Wi, where
18
Pi* is the weighted p-value of the ith SNP. The Benjamin and Hochbergโs (1995) FDR
control is then applied to the set of weighted p-values { P1*, P2*, โฆ, Pm*}. To maintain
the FDR at the desired level, the set of weights {W1, W2, โฆ, Wm} must meet the
requirement:
๐
๏ฟฝ =๏ฟฝ
๐
๐=1
1.12
๐๐
=1
๐
Prioritized Subset Analysis (PSA)
Suppose that prior information comes from our biological knowledge, or from
findings of data other than that in the current study. All SNPs are allocated into two
subsets: a prioritized subset comprises SNPs likely to be the true positives, and a nonprioritized subset comprises the remaining SNPs.
1.13
Aim of Dissertation
In conclusion, people have tried many methods to estimate the number of
independent tests to obtain an accurate Bonferroni correction, either for the whole
genome or for specific regions. But when the sample size is small, a Bonferroni
correction for the whole genome is still overly conservative. In this dissertation we shall
study methods that do not try to obtain an accurate Bonferroni correction for deciding
whether any SNP is associated with a disease. We want to find a method to fit the
mixture of null and non-null distributions, in order to figure out the probability that a
specific SNP is truly associated with the disease.
19
In Chapter II, we introduce the methods that we want to use to fit the mixture of
distributions. We transform all the p-values to z-scores by using an inverse normal CDF
(cumulative distribution function) transformation. If the null distribution of p-values is a
uniform distribution U(0,1), the transformed null distribution should be the standard
normal distribution N(0,1). Assume the non-null distribution is also a normal distribution.
A commingling analysis is able to fit a mixture of two normal distributions to these data:
๐ผ๐ผ(0, 1) + (1 โ ๐ผ)๐(๐, ๐ 2 ),
where ๐ผ is the proportion of tests belonging to the null distribution. However, there are
two problems for this commingling analysis:
i.
The proportion of the non-null distribution is very small, especially in GWAS, and
ii.
the commingling analysis cannot give a good estimation of ๐ผ.
The assumption that both the null and non-null distributions are normal is
doubtful, especially when the sample size is small.
In order to solve these two problems, we use permutations to obtain the null
distribution and we do not restrict the alternative distribution for p-values to being a
single normal distribution. The basic idea of an iterative algorithm to estimate ๐ผ was
mentioned by Mosig et al. (2001).
In Chapter III, we try our method on a Crohnโs disease dataset that comes from
the Wellcome Trust Case Control Consortium (WTCCC) The WTCCC Crohnโs disease
20
(CD) cohort comes from 1748 patients with CD and 2938 controls. We chose this
dataset because of the paper by Elding et al (2013). In this paper, using the WTCCC and
another dataset provided by the National Institute of Diabetes and Digestive and Kidney
Diseases IBD Genetics Consortium (NIDDK), the authors identified a total of 66 out of
the 71 known loci found from a larger sample (more details are provided in Chapter III) .
This indicates that it is not necessary to have such a large sample to explore association.
This chapter includes the information on the dataset, the results from association
analysis, the results of Crohnโs disease from other studies, and the results from our
method. We compare the results from our method to those obtained in other studies.
In Chapter IV, we introduce details about the data cleaning of a colon cancer
dataset. In this dataset, 530 individuals come from a colon cancer study with 1,491,783
cleaned SNPs. Samples were genotyped on the Illumina Omni 2.5-8 platform. 139 of the
subjects are controls. The samples also include 234 individuals with stage I or stage II
colon cancers, 68 individuals with stage III and 89 with stage IV colon cancers. The
sample size is quite small compared to the standard GWAS with 1000 cases and 2000
controls. The most significant p-value is 1.2x10-6, which is far from reaching the
Bonferroni cut-off ~10-8. This chapter includes the basic information on the dataset, the
criteria for data cleaning and the results of initial association analysis. The p-values from
permutation tests for the top SNPs are smaller than the nominal p-values. This indicates
some evidence of association.
21
In Chapter V, we try our method on the colon cancer dataset. The posterior
probabilities do not show strong evidence of association. We list the top genes based on
their posterior probabilities. The results indicate that SNPs with the highest posterior
probabilities are quite different from the SNPs with the most significant p-values.
Finally, in Chapter VI, we discuss the results and indicate further research areas
that need to be investigated.
22
CHAPTER II
THE METHOD
2.1 Introduction
Complex traits are generally influenced by many loci with small effects (Glazier,
2002). This is called โpolygenicโ architecture. Although GWAS have successfully
identified thousands of trait-associated SNPs (Hindorff, 2009), these SNPs only explain
small portions of the trait heritability (Manolio, 2009). More SNPs need to be identified
in larger samples in order to achieve the Bonferroni correction. Recent results indicate
that GWAS have the potential to explain much of the heritability of common complex
phenotypes (Yang, 2011). The issue is that for most complex traits a large number of
SNPs have too small effects to pass the Bonferroni correction, given current sample
sizes. Unfortunately, there are few methods trying to identify more SNPs other than
increasing the sample size.
In this chapter, we describe a method to estimate the empirical posterior
probabilities that there is association of a disease with the individual SNP of interest,
with a minimum of prior assumptions. It also connects well with the theory of the FDR
(Benjamini & Hochberg, 1995).
2.2 Posterior Probabilities of Association
23
The basic idea of posterior probabilities was mentioned by Efron (2001).
Let
๐0 = the number of tests that indicate the null hypothesis as true,
๐1 = ๐ โ ๐0 = the number of tests that indicate the alternative hypothesis as true,
where ๐ is the total number of tests,
๐0
๐0 =
๐
= the probability that the null hypothesis is true,
๐1 =1 โ ๐0 = the probability that the alternative hypothesis is true,
and
๐0 (๐)= the density function of ๐-values for SNPs that are not associated with the disease
(i.e. the density function for the ๐-values when the null hypothesis is true),
๐1 (๐)= the density function of ๐-values for SNPs that are associated with the disease (i.e.
the density function for ๐-values when the alternative hypothesis is true).
Thus the mixture density function of the two hypotheses is
๐(๐) = ๐0 ๐0 (๐) + ๐1 ๐1 (๐)
(2.1)
An application of Bayesโ rule gives the posteriori probabilities ๐0 (๐) and ๐1 (๐) that a
SNP with a specific ๐ value is not associated or associated with the disease:
๐0 (๐) = ๐0 ๐0 (๐)/๐(๐),
24
๐1 (๐) = 1 โ
๐0 ๐0 (๐)
๐(๐)
= ๐1 ๐1 (๐)/๐(๐).
(2.2)
In a typical GWAS, we can estimate ๐(๐) directly from the association analysis. Also,
the null density ๐0 (๐) can be estimated by permutation. In other words, if we can
estimate ๐0 or ๐0 , we can estimate the posterior probabilities directly.
In Efronโs paper (2001), only simple bounds for ๐0 and ๐1 were given:
๐1 โฅ 1 โ min๐ { ๐(๐)/๐0 (๐)},
๐0 โค min๐ { ๐(๐)/๐0 (๐)}.
We will introduce some methods to estimate ๐0 .
2.3 Normal Transformation
In GWAS, the null hypothesis is that there is no association between the
phenotype and genotype classes, and all the tests are independent. Under the null
hypothesis, the distribution of p-values is uniform (Murdoch, 2008). But in GWAS the pvalues we are interested in are extremely small (less than 10-7), they exceed the
accuracy of some programs in S.A.G.E or R. We have to transform the p-values to
another distribution. We can think of three ways to transform them: transforming to a
normal distribution, logarithm transformation and the ratio between observed p-values
and permutation p-values.
25
It is quite easy to transform a uniform to normal distribution. The inverse CDF
method is commonly used.
2.3.1 Inverse CDF Transform
The cumulative distribution function for the normal distribution is:
ฮฆ(๐ก) =
1
โ2๐
๐ก
2
๏ฟฝ ๐ โ1/2๐ฅ ๐๐
โโ
This function is a monotone increasing from R (R is the set of real number) onto (0, 1)
and its value at t gives the probability that a normal random variable has value less than
or equal to t. Let g: (0, 1)โ R be its inverse, i.e. g(x) is the unique solution t to ฮฆ(๐ก) = ๐ฅ.
That is,
1
๐ก
2
โซ ๐ โ1/2๐ข ๐๐ = ๐ฅ.
2๐ โโ
โ
2.3.2 Commingling analysis
Commingling analysis is a statistical method for distinguishing between one
distribution and a mixture of two or more distributions. It was commonly used to
provide preliminary evidence for a single genetic locus with a major effect on a
quantitative trait of interest. Maximum likelihood methods suggested by Day (1969) can
be used to evaluate whether one or two normal distributions provide the better
26
explanation for the data. Maximum likelihood estimates have been obtained by using an
EM algorithm (McLachlan, 1988).The commingling analysis is performed here by using a
S.A.G.E program. Here, we do not want to test whether there is only one normal
distribution or not. We want to use the program to estimate the parameters of two or
more normal distributions, such as the means, standard deviations and proportions.
2.4 Permutation-Based Point Estimator
This method was suggested by Millstein et al. (2013). The goal is to estimate the
FDR for a fixed significance threshold. The possible outcomes of an association study are:
Null Hypotheses
Null Hypotheses
Rejected
Not Rejected
Null True
F
m0-F
m0
Null False
T
m1-T
m1
Total
S
m-S
M
๐น
๐โ๐
๐น๐น๐น = ๐ธ ๏ฟฝ ๏ฟฝ = ๐ธ ๏ฟฝ
๐
๐
Total
๏ฟฝ,
where S is the total number of tests that reject the null hypothesis, F is the number of
false discoveries in S, T is the true discoveries in S. The details of this method were
introduced in Chapter I.
27
2.5 Histogram-Based Estimator of the Number of True Null Hypotheses
Mosig et al. (2001) proposed an iterative histogram-based algorithm that
estimates ๐0 when each ๐ value has a continuous uniform (0,1) null distribution. Given
a histogram of ๐ values ๐(1) , โฆ, ๐(๐) , start by choosing a bin size and assume that the
null hypothesis is true for all ๐ tests. The expected number of ๐ values in each bin is
known if we assume that the ๐ value has a continuous uniform (0,1) null distribution.
Then, find the leftmost bin where the number of ๐ values fails to exceed expectation.
The effect of the alternative distribution is relatively small in this bin. We can compute
the observed number of ๐ values minus the expected number in each bin. The sum of
these differences is a rough estimate of ๐1 . Therefore, ๐ โ ๐1 is an estimate of ๐0 .
Recalculate the expected number of null ๐ values in each bin using the new estimate of
๐0 , and repeat until the estimate of ๐0 converges.
This method can be easily improved by using permutations as the null distribution.
Just as the assumption of a uniform (0,1) null distribution is used to estimate the
expected number of ๐ values in each bin, we also can easily estimate the empirical
number of ๐ values in each bin by using permutations.
2.6 Linear Regression-Based Estimator of the Number of True Null Hypotheses
From (2.1), we can easily derive that
๐0 =
๐(๐)โ๐1 (๐)
๐0 (๐)โ๐1 (๐)
(2.3)
28
In most GWAS, the Q-Q plot is a straight line in the range of most ๐ values. This
indicates that in most of the range of ๐ values, the effect of the alternative density ๐1 (๐)
is very weak. Thus,
๐0 โ
The ratio
๐(๐)
.
๐(๐)
can be estimated directly from the empirical distribution by linear
๐0 (๐)
๐0 (๐)
(2.4)
regression.
29
CHAPTER III
ANALYSIS OF THE WELLCOME TRUST CASE CONTROL CONSORTIUM
CROHNโS DISEASE DATA
3.1 Introduction
Crohnโs disease (CD) belongs to a group of conditions known as Inflammatory
Bowel Diseases (IBD). Like ulcerative colitis (UC), it is a major sub-type of IBD. CD is a
chronic inflammatory condition of the gastrointestinal tract. Although CD is not the
same thing as UC, the symptoms of these two illnesses are quite similar, but the areas
affected in the gastrointestinal (GI) tract are different. CD principally involves the ileum,
but there is variation in disease phenotype. This variability results from the interaction
of environmental factors. CD is a polygenic disorder with some high-penetrance genes
and high individual sibling recurrence risk ratio (๐๐ ) ranging from 15 to 30 (Schreiber,
2005). Thus, CD has been a widely studied common multifactorial disease.
We chose the CD data from the Wellcome Trust Case Control Consortium
(WTCCC). WTCCC was formed with a view to exploring the utility, design and analyses
of GWAS for 7 complex human diseases; CD is one of them. The genotypes were
sampled on the Affymetrix 500K Array with 1698 CD and 2948 ethnically matched
controls. The CD cases included patients with any sub-type of CD. We chose this dataset
because of the paper by Elding et al. (2013). Using the WTCCC data, these authors
30
identified a total of 66 out of the 71 loci found from a much larger dataset, and
validated the results in another dataset provided by the National Institute of Diabetes
and Digestive and Kidney Diseases IBD Genetics Consortium (NIDDK). The 71 loci were
found by a meta-analysis from 6 GWASs that exist to date with CD using more than
6,000 cases and 15,000 controls. Elding et al. used a gene-mapping approach based on a
high-resolution LD map with distances in LD units (LDU). The construction of the LDU
maps was based on the Malécot model. To identify additional genes beside those 71 loci,
they used two criteria: (1) evidence of statistical significance for each data set and (2)
the estimates of location for both samples are within 150 kb of one another. Their
analyses identified a total of 66 loci in one or both data sets using an uncorrected
significance threshold p value < 0.05. 28 loci passed the genome-wide significance
threshold.
The second reason that we chose these data is the shape of their Q-Q plot
between โ๐๐๐10 (permutation ๐ values) and โ๐๐๐10 (original ๐ values) (Figure 3.1.1),
suggesting many SNPs are associated. CD is a polygenic disorder, 71 loci explain only
23.2% of the reported heritability (Franke, et al., 2010).
31
Figure 3.1.1 The Q-Q plot between โ๐๐๐10 (permutation ๐ values) and
โ๐๐๐10 (original ๐ values)of CD based on WTCCC data
3.2 Normal Transformation and Commingling Analysis
The method has been described in Section 2.3. We used S.A.G.E (S.A.G.E., 2012)
to perform a commingling analysis (a SEGREG function) of normal transformed p values.
As the commingling analysis depends on the initial values provided for a maximum
likelihood search, we use three steps to give more accurate initial values and estimates.
i.
We assume there are two independent means and two independent variances.
There is no assumption of Hardy-Weinberger Equilibrium, no Cox-Box
transformation
32
ii.
The initial means of the two distributions are set to 0 and the initial variances are
set to 1.
iii.
We use the estimated means and variances from step ii as initial values, then
redo the commingling analysis.
iv.
We use the estimated means and variances from step iii as initial values, then
redo the commingling analysis.
The inputs and outputs of each step are shown in Table 3.2.1
Table 3.2.1 Inputs and outputs of each step
Step
Inputs
Outputs
Mean1
Variance1
Mean2
Variance2
Mean1
Variance1
Mean2
Variance2
i
0
1
0
1
-0.063
1.03
-3.92
1.86
0.00088
ii
-0.063
1.03
-3.92
1.86
-0.059
1.03
-1.81
3.07
0.0041
iii
-0.059
1.03
-1.81
3.07
-0.059
1.03
-1.92
3.07
0.0037
๐1
-2
ln(Likelihood)
AIC
1107270
1107280
1107230
1107240
1107230
1107240
As the likelihood surface is flat, the standard errors for each estimate cannot be
computed. Both โ2 ln(Likelihood) and AIC indicate step ii and step iii fit the data a
little bit better than step i, but according to the AIC do not improve much. The likelihood
and AIC for Step ii and step iii are quite similar. So we can say after three steps, the
estimates are quite stable. The large differences between steps i and ii indicate that the
initial values are very important for the commingling analysis as implemented in SEGREG.
Thus, the mean of the alternative distribution is around -1.9, corresponding to a ๐ value
of 0.029. The proportion of the alternative distribution is around 0.0037. If we use these
mixtures of normal distributions as the expected values to give the Q-Q plot between
33
the fitted normal transformed p values and the original normal transformed p values,
the results of each step are shown in Figure 3.2.1.
34
Figure 3.2.1 Q-Q plots for fitted normal transformed p values and original normal
transformed p values for each step
Negative Fitted Normal Transformed p-values
Negative Fitted Normal Transformed p-values
The straight line is the line with slope=1 and intercept=0. In a Q-Q plot, if the
data points fit the straight line well, that means the raw data fit the expected data well.
35
From Figure 3.2.1, the commingling analysis obviously improves the fit of the data in
steps i-iii, but it is hard to say which step is the best. This conclusion is consisted with
the likelihood for each step. Also, one normal distribution cannot fit the alternative
distribution very well. Thus, we try other methods to fit the alternative distribution.
3.3 Permutation-Based Point Estimator
From Section 3.2, we can observe that the null distribution is approximately a
uniform distribution. However, in the colon cancer data, which we consider in Chapter
IV and V, the sample size is relatively small. We are not quite sure whether the null
distribution is uniform or not in this small sample situation. We will want to use
permutations to estimate the empirical null distribution instead of the uniform
distribution.
Theoretically, we should obtain all the possible combinations of the phenotypes
to obtain the empirical null distribution. But in GWAS, we have a large number of tests,
so we may not need to do so many permutations. This has been verified in some papers
(see Chapter I), suggesting that less than 100 permutations are sufficient. We still want
to make sure how many permutations are needed when using this method. Figure 3.3.1
shows the average estimated proportion of alternative distributions for the WTTCC data
for different cut-off values used to determine that a P-value indicates the alternative is
true.
36
Figure 3.3.1 Average proportion of alternative distribution
This figure indicates that 30 permutations can provide a stable estimate of the
proportion. We note that the estimated proportion depends on the significance cut-off
๐ผ, so we show the relationship between the cut-off ๐ผ and the estimated proportion
based on 50 permutations in Figure 3.3.2.
37
Figure 3.3.2 The relation between the proportion of p-values corresponding to the
alternative distribution and the significance level cut-off ๐ผ
Figure 3.3.2 obviously indicates that the estimated proportion of p-values
corresponding to the alternative distribution is not stable. It very much depends on the
significance level cut-off ๐ผ. We donโt think the estimates based on this method are
helpful.
38
3.4
Histogram-Based Estimator of the Number of Null Hypotheses
The original R code for this method can be downloaded from the Web site at
http://www.public.iastate.edu/โผdnett/ m0estimation.shtml. The code is used to
compute the estimate of ๐0 from a vector of observed ๐ values. The null distribution of
๐ values is assumed to be the uniform distribution. This method actually compares the
differences between the number of original p values and the number of expected p
values in each bin to estimate ๐0 . Mosig et al. (2001) used the uniform(0,1) distribution
as the expected null distribution of p values. The (0,1) interval is cut into equal-range
bins. We also can use the permutation p values as the expected null distribution of p
values. We modify the code easily to use the empirical distribution from permutations
instead of the uniform distribution. Although Nettleton et al. (2006) suggested that 20
bins are sufficient, we still want to see whether it is sufficient in our dataset. Figure 3.4.1
shows how the estimate ๐0 changes with the number of bins.
39
Figure 3.4.1 The estimated number of null hypotheses as a function of the number of
bins for the WTCCC dataset
This figure suggests more bins are needed in our situation. Beyond 50 permutations,
the estimate of ๐0 is reasonably stable, i.e within the range (38,4500, 38,5900). In
Nettletonโs paper, they only had about 20,000 ๐ values but, in our sample, we have
40
389,539 ๐ values. Also, most results suggested ๐0 is around 384, 800, so ๐ฬ1 โ 0.0122
(about 4,752 SNPs are associated with the disease).
41
Figure 3.4.2 The estimated proportion of alternative distribution p-values in a bin
Figure 3.4.2 assumes the estimated proportion of the alternative distribution ๐ฬ1 โ
0.0122. The (0,14) interval is cut into 100 equal-range bins. Each dots in the plot is the
estimate proportion of alternative distribution p-values in the corresponding bin. It is
obvious that there is a peak around โlog(๐) โ 1.7 (๐ โ 0.02). The leftmost part of the
curve shows some negative values. This is probably because of two reasons:
i.
ii.
The estimate of ๐1 is conservative.
Random effects.
42
Because only the leftmost side shows an obvious negative effect, the most probable
reason is that the estimate of ๐1 is conservative. This method considers that we are
only interested in the small ๐ values, so it does not consider the larger ๐ values.
3.5 Fitting the Alternative Distribution
Recalling Equations (2.1) and (2.2), the mixture density function of the two
hypotheses is
๐(๐) = ๐0 ๐0 (๐) + ๐1 ๐1 (๐),
and a SNP with a specific ๐ value is not associated or associated with the disease with
probabilities:
๐0 (๐) = ๐0 ๐0 (๐)/๐(๐),
๐1 (๐) = 1 โ
๐0 ๐0 (๐)
๐(๐)
= ๐1 ๐1 (๐)/๐(๐),
Now, in this dataset, we know ๐ฬ1 โ 0.0122. If we want to estimate ๐1 (๐), we still
need to find an analytical expression for ๐1 (๐). Obviously the proportion of ๐1 (๐) is
small, so it is hard to estimate the alternative density function๐1 (๐). Also, as we only
know the empirical density function of ๐0 (๐) and๐(๐), we have to consider how to
choose the number of bins and bin size to obtain a good estimate of ๐1 (๐).
To avoid the problem of choosing the number of bins, we consider estimating the
empirical cumulative density function (ecdf) for the alternative hypothesis.
43
As
๐(๐) = ๐0 ๐0 (๐) + ๐1 ๐1 (๐),
the cumulative density function is
๐(๐) = ๐0 ๐0 (๐) + ๐1 ๐1 (๐),
i.e.
๐1 (๐) =
1
๐1
(๐(๐) โ ๐0 ๐0 (๐)),
(3.1)
where ๐(๐), ๐0 (๐) and ๐1 (๐) are the cumulative density function of ๐(๐), ๐0 (๐) and
๐1 (๐), respectively.
Figure 3.5.1 shows the ecdf of both the negative normal transformed original ๐
values and the negative normal transformed permutation ๐ values. The two
distributions are quite similar, so we also show the part of the ecdf for only the top ~40,
000 SNPs in Figure 3.5.2.
44
Figure 3.5.1 Empirical cumulative distributions of negative normal transformed p values
for all SNPs
Negative Normal Transformed P-value
45
Figure 3.5.2 Empirical cumulative distributions of negative normal transformed ๐ values
for the top 43,564 SNPs
Negative Normal Transformed P-values
46
Figure 3.5.3 shows the estimated ecdfs for negative normal Transformed original ๐
values, the negative normal transformed permutation ๐ values and the negative normal
transformed alternative distribution of ๐ values specifically for the top 43,564 SNPs.
47
Figure 3.5.3 Empirical cumulative distributions for negative normal transformed ๐ values
for the top 43,564 SNPs
Negative Normal Transformed P-value
From Figure 3.5.3, we can see that the proportion of ๐ values belonging to the
alternative distribution decreases very quickly as the ๐ value increases. This indicates
48
that the alternative distribution has a very low density when the ๐ values are large. The
reason we chose the top 43,564 SNPs is that beyond 43,564 the density of the
alternative distribution decreases to 0. As we are only interested in the small ๐ values,
and when ๐ values are larger than that random effects increase, we consider the
alternative distribution exists only for those top 43,564 SNPs.
49
Figure 3.5.4 Empirical cumulative distribution and estimated density function of
negative normal transformed alternative hypothesis p-values for the top 43,564 SNPs
Negative Normal Transformed P-value
Negative Normal Transformed P-value
We use two steps to smooth these plots. First, we made the quantile of each ๐ value
on the ecdf greater than or equal to all the points with larger ๐ values, and the quantile
50
must be greater than 0. Second, we use a Gaussian kernel smoother with
bandwidth=0.25 to smooth the density function (Scott, 1992). After smoothing the plot,
the cumulative and density distribution of alternative hypothesis p-values are shown in
Figure 3.5.4. We can see the density function is a mixture of several approximately
normal distributions.
We use commingling analysis to fit a mixture of two or three normal distributions
using SEGREG in S.A.G.E. We want to compare how many distributions are sufficient.
Table 3.5.1 shows the estimates, corresponding โ2ln(likelihood) and AIC for each
model.
Table 3.5.1 Estimates for mixture of different number of distributions
Number of
distributions
Estimate
Mean1
Variance1
Mean2
Variance2
Mean3
Variance3
-2ln(Likelihood)
AIC
2
1.92
0.24
4.28
0.24
--
--
686594
686608
3
1.87
0.17
3.48
0.17
5.96
0.17
625191
625207
Similar to the previous commingling analysis, the likelihood surface is flat and
standard errors cannot be computed. Both likelihood and AIC indicate the mixture of
three distributions is better than the mixture of two distributions, even though we
cannot perform a formal test of significance โ the likelihood assumes the SNPs are
independent, which is not the case.
After trying several different initial values, the program indicates the alternative
distribution is:
51
๐๏ฟฝ1 (๐) = 0.920 โ N(1.874, 0.414) + 0.0698 โ N(3.480, 0.414) + 0.00996
โ N(5.957, 0.414)
Here we assume the variances of the three normal distributions are the same, because
the standard error for a variance is large and when the program estimates three
independent variances, the estimates of the variances and likelihoods are quite similar.
In order to fit the distribution for the very most significant ๐-values better, the mean of
the third distribution (mean=5.957) was fixed.
The Q-Q plot of the negative normal transformed original ๐-values against the fitted
negative normal transformed p-values is shown in Figure 3.5.5.
52
Figure 3.5.5 Q-Q plot between negative fitted normal transformed and negative original
normal transformed p-values
Negative Fitted Normal Transformed P-values
Figure 3.5.5 indicates that this method fits the negative normal transformed original
๐-values better than the results from the commingling analysis that assumes that the
53
alternative distribution of transformed ๐-values only contains one Normal distribution
(see section 3.2, Figure 3.2.1).
Table 3.5.2 the posterior probabilities for the 71 LD intervals
Chr
Start
Mb
End
Mb
Most Significant p-values
a
Posterior Probability
p-value From
Elding et al.
1
67.36
67.77
4.33E-13
1.000
2.10E-11
1
113.95
114.62
8.77E-04
0.188
1.40E-03
1
7.74
7.97
1.50E-03
0.186
--
1
160.69
162.47
5.99E-04
0.263
2.60E-03
1
172.66
172.95
--
--
3.60E-13
1
200.85
201.06
9.52E-03
0.185
3.00E-14
1
154.97
156.13
2.48E-03
0.184
--
1
197.32
197.95
1.58E-05
0.756
1.20E-04
1
206.8
207.03
--
--
--
2
25.45
25.6
9.56E-03
0.184
3.50E-02
2
27.39
27.86
2.35E-03
0.186
5.40E-04
2
43.45
43.95
7.94E-04
0.205
2.00E-05
2
60.92
61.89
1.07E-03
0.186
1.90E-05
2
234.15
234.57
1.17E-02
0.186
5.60E-25
2
102.8
103.3
9.44E-02
0.025
1.80E-08
2
198.14
198.96
2.76E-02
0.147
1.60E-02
2
231.05
231.23
3.25E-02
0.131
2.80E-05
3
48.18
51.75
7.29E-07
0.756
1.70E-31
3
18.6
18.88
1.51E-03
0.177
5.30E-03
5
39.84
40.96
4.78E-14
1.000
6.00E-51
5
129.38
132.02
5.44E-07
0.756
2.70E-22
5
150.03
150.4
1.49E-08
0.999
3.60E-09
5
158.5
158.95
3.05E-05
0.740
5.60E-06
5
72.45
72.58
4.17E-02
0.104
5.80E-03
5
96.08
96.42
4.83E-03
0.186
1.30E-08
5
141.41
141.64
3.38E-03
0.182
1.30E-02
5
173.22
173.54
2.24E-02
0.163
--
6
20.49
21.14
4.98E-06
0.735
4.30E-03
6
31.38
32.87
7.61E-07
0.756
1.20E-27
6
3.41
3.47
5.87E-02
0.066
1.50E-07
6
106.39
106.56
1.25E-02
0.186
--
6
167.34
167.55
2.35E-03
0.186
3.30E-09
6
90.8
91.08
1.50E-02
0.183
--
6
159.34
159.54
1.47E-02
0.184
1.20E-03
54
7
50.25
50.33
--
--
9.40E-04
8
126.47
126.58
2.19E-02
0.165
4.50E-05
8
129.49
129.6
4.96E-02
0.084
1.30E-04
9
4.94
5.3
8.15E-03
0.180
1.40E-04
9
117.43
117.7
1.12E-02
0.186
3.90E-04
9
b
139.13
139.42
--
--
2.30E-05
10
35.18
35.9
2.03E-02
0.170
3.40E-29
10
6.03
6.17
2.90E-02
0.142
3.60E-03
10
64.3
64.76
2.16E-03
0.120
1.20E-18
10
101.27
101.34
3.59E-02
0.120
1.20E-32
10
59.83
60.14
6.09E-02
0.062
9.30E-09
10
81
81.1
--
--
--
11
76.02
76.36
1.73E-02
0.178
4.40E-10
11
61.52
61.68
--
--
3.40E-02
11
63.82
64.29
2.18E-02
0.165
5.80E-03
12
40.13
41.02
3.43E-03
0.186
3.00E-02
13
44.23
44.64
1.81E-02
0.176
6.60E-12
13
42.82
43.1
4.83E-02
0.087
1.30E-07
14
69.16
69.32
--
--
4.10E-02
14
88.21
88.64
1.91E-02
0.173
6.20E-05
15
67.41
67.48
--
--
1.50E-02
16
28.29
29.03
1.04E-02
0.185
--
16
50.46
50.85
6.02E-05
0.696
2.60E-15
17
37.37
38.26
2.28E-02
0.162
1.10E-03
17
40.32
41
1.11E-02
0.186
2.00E-18
17
32.49
32.68
5.85E-02
0.067
--
18
12.74
12.93
2.50E-02
0.155
2.90E-12
19
1.09
1.18
--
--
2.00E-02
19
10.4
10.64
8.44E-02
0.033
1.30E-02
19
33.73
33.78
3.69E-02
0.117
--
19
49.09
49.28
1.31E-02
0.186
--
20
62.18
62.48
3.98E-02
0.109
2.30E-04
21
16.7
16.85
1.02E-02
0.186
4.40E-07
21
45.59
45.7
5.70E-02
0.069
5.50E-06
21
21.81
22.06
8.50E-04
0.193
--
22
29.9
30.67
3.45E-02
0.124
8.00E-07
--
--
b
a.
22
39.67
39.81
-This column shows the p-value of the most significant SNP in this region
b.
These two loci are not covered by the WTCCC dataset
55
In Table 3.5.2, 71 LD intervals were reported from other Crohn disease studies
(Franke, et al., 2010). The โStartโ and โEndโ columns indicate the MBP positions of each
locus. Two LD intervals are not covered by the WTCCC dataset (See Table 3.5.2). Elding
et al. could find one of them, because they divided each chromosome arm into nonoverlapping analytical windows based on LD units (Maniatis, et al., 2002). They could
cover more of the genome than single SNPs. This means that we can find at most 69 loci
only based on the WTCCC dataset. Because we think only the top 43,564 SNPs have
positive posterior probabilities, in this table we can see that 61 out of 69 LD intervals
have positive posterior probabilities. 51 out of 61 LD intervals are consistent with Elding
et al.โs signals. 10 loci have more than 0.5 posterior probability of being associated with
the disease; we think this is very significant. Notably, the first published result by the
WTCCC (WTCCC, 2007) reported just 9 loci. Elding et al. also identified 134 additional
signals (loci). 15 signals are intergenic and their paper did not provide details about their
locations. Another one signal is in the X chromosome. As we only consider the
autosomal chromosomes here, this signal is ignored. Two genes (HSP90AB2 and STL)
cannot be found in NCBI 37. Thus they provided 116 additional signals (loci). Because
Elding et al. gave the closest gene (up to 300kb), we also use this to identify genes. 114
out of 116 loci have positive posterior probabilities (see Appendix 6).
In conclusion, we can verify 175 out of 185 gene loci (94.6%) as having at least one
SNP with a positive posterior probability. As discussed before (see section 3.5), we
estimated there are 4,752 SNPs associated with the disease and 3253 (68.5%) of these
SNPs are included in the 175 loci. Most of the posterior probabilities are between 0.1
56
and 0.2, which would explain why it has been hard to find the genes. Overall, our
analysis suggests that any SNP with a positive posterior probability should be considered
as possibly implicating a gene locus. However, note that the QQ plot 3.5.5 does not fit a
straight line very well. As the lower tail is comprises the p values that are very close to 1,
we are not interested in this part. But tor the top tail, we should make sure, in any
future analysis, our model a good fit for the data.
57
CHAPTER IV
DATA CLEANING AND BASIC ASSOCIATION ANALYSIS OF THE COLON
CANCER DATASET
4.1 Introduction to the data
Colon cancer is cancer of the large intestine. Most colorectal cancer occurs due to
lifestyle and increasing age. It is the second leading cause of cancer death in the United
States. . Our dataset includes 568 unrelated individuals with five different stages of
colon cancer. Stage 0 represents controls and Stages I-IV represent different severity
levels of the disease. The higher number indicates the higher severity level. All
individuals were of self-reported Ashkenazi Jewish ancestry. Genotyping of 2,379,855
SNPs was performed using the Illumina Omni 2.8-8 platform.
We checked the individualsโ information in the following steps (all the steps were
implemented in PLINK v1.07 (Purcell, 2007):
4.2
Gender Check
Gender check was performed in PLINK V1.07 using the โ--check-sexโ option. The
summary table of self-reported gender and PLINK V1.07 check gender is given in Table
4.2.1. PLINK V1.07 suggests that nine females are misclassified as males, three males are
58
misclassified as females and six individuals do not have obvious sex, based on their
heterozygosity on the X chromosome. .
Table 4.2.1 Gender Check
Report Gender
4.3
Gender Checked by PLINK V1.07
Male
Female
Unknown
Male
282
9
4
Female
3
268
2
Unknown
0
0
0
Sample Relatedness Check
Sample relatedness check was performed in PLINK V1.07 using the โ--genomeโ
option. Figure 4.3.1 shows the pairs of individuals by their degree of relatedness: Z1
(one allele shared IBD) vs. Z0 (0 alleles IBD). The numbers in the figure are the IDs for
pairs of individuals. Figure 4.3.2 is a histogram of the pairwise kinship coefficients.
Although everyone in this sample is supposed to be unrelated, a few of them still have
very high kinship coefficients.
59
Figure 4.3.1 Z0 vs. Z1 for all pairs of 568 individuals
60
Figure 4.3.2 Histogram of kinship coefficients for pairs of 568 individuals
After removing 28 or 38 individuals who are suspected of being related to other
individuals, because of possible twins, duplicates or sample mix up, the plots of their
relatedness and kinship coefficients are as in Figure 4.3.3-4.3.6. We can see all the
related individuals are excluded when we remove 36 individuals. The IDs of the 28
individuals and 38 individuals are given in Appendix 1 and Appendix 2, respectively. The
numbers of SNPs with MAF<0.01 on the various chromosomes are given in Appendix 3.
61
Figure 4.3.3 Z0 vs. Z1 for pairs of 540 individuals
62
Figure 4.3.4 Histogram of kinship coefficients for pairs of 540 individuals
63
Figure 4.3.5 Z0 vs. Z1 for all pairs of 530 individuals
64
Figure 4.3.6 Histogram of kinship coefficients for pairs of 530 individuals
4.4
The Quality of SNPs
After evaluating the quality of the individuals, we evaluate the quality of the
markers in the following three steps:
65
1. MAF: we removed SNPs in which the minor allele frequency is <1% overall in the
three groups combined (Stage I/II, Stage III and Stage IV). There are 2,379,855 SNPs
before filtering. 2,964 SNPs have missing genotype data and 813,689 SNPs have MAF
< 1%, so 1,563,220 SNPs are left.
2. SNP call rate: to remove a SNP for which genotyping was consistently problematic,
SNPs with missing call rates >2% overall in the three groups combined are excluded.
Then 20,279 additional SNPs are excluded in this step.
3. GenTrain Score with Illumina: to remove SNPs with GenTrain Score < 0.6. 51,581
additional SNPs are excluded in this step.
Finally, we have 1,491,783 SNPs having good quality. Although the deviation
from HWE may be due to genotyping errors, inbreeding, and population stratification
(Setakis et al., 2006), here we do not remove any SNPs because of deviation from HWE.
The deviation from HWE may be due to a real difference between two groups. We do
not want to lose this information in the quality control procedures.
4.5
Principal Component Analysis
To perform principal component analysis, as it will take long time to calculate the
principal components with all the SNPs involved, only SNPs that are not highly
correlated to each other are chosen. Pruning is based on the variance inflation factor
(VIF). The VIF is 1/(1-R2), where R2 is the multiple correlation coefficient for a SNP being
simultaneously regressed on all other SNPs in a window. The window size in SNPs is set
66
to 50, the number of SNPs to shift the window at each step is 5, and the threshold of VIF
is 2. We have 530 principal components for 530 individuals. Figure 4.5.1 shows a plot of
PC1 vs. PC2 for the 530 individuals. Obviously, there is only one cluster, but a few
individuals are not in this cluster.
67
Figure 4.5.1 PC1 vs. PC2
As we are only interested in the SNPs with odds ratio>1 (one-sided test), but
PLINK V1.07 only provides the two-sided test, we recalculated the p-values for both
nominal p-values and theoretical p-values: Let p1 be the p-value from the PLINK V1.07
68
output, and p2 the one-sided p-value. If the odds ratio>1, then p2=p1/2. If the odds
ratio<1, then p2=1-p1/2. In the following analyses, only one-sided p-values are discussed.
First, we put gender and the top10 PCs as covariates, then rerun the association
(Stage IV vs. Stages I/II) analysis. The top 50 SNPs in both association tests are in given
Appendix 6. It seems the results are similar to those of the crude model. Finally, we
decide to choose gender, age, and 2 PCs as our covariates.
4.6 Permutation Tests
After doing association analysis using logistic regression to obtain p-values, we
also perform permutation tests to see if the top SNPs are significant. We randomize the
case-control labels together with the individualsโ covariate values once, and do an
association analysis again of all SNPs in PLINK v1.07. A Q-Q plot here is used to compare
the distributions of permutation p-values and nominal p-values. Figure 4.6.1 shows the
Q-Q plot for all SNPs.
69
Figure 4.6.1
Negative Normal Transformed Permutation P-value
To demonstrate that a single permutation can obtain the null distribution, we want
to demonstrate the exchangeability of the 1.5 million tests, or at least how they should
70
be binned so that we have exchangeability within each bin. There are three possible
reasons for non-exchangeability: MAF, HWE and LD structure.
4.6.1 Minor Allele Frequencies
The bins of MAF are (0.0, 0.1), (0.1, 0.2), (0.2, 0.3), (0.3, 0.4) and (0.4, 0.5). Here
the individuals used to calculate the MAF are from the two subgroups involved in the
corresponding association analyses that are of interest, i.e. the Stages I/II and Stage IV
individuals, instead of the whole sample. The Q-Q plot compares the distributions of โ
log10(nominal p-values) and โlog10(permutation p-values) obtained within each bin, but
all the plots contain the 20 top SNPs of interest. Figure 4.6.2 shows the Q-Q plot for
SNPs in the bin 0.0<MAF<0.1. Red dots are the top 20 SNPs in this bin and blue dots are
the top 20 SNPs with MAF not in this bin. .
71
Figure 4.6.2
We still want to know how many SNPs are sufficient to obtain the null
distribution. 10,000 SNPs are randomly chosen with 0.0<MAF<0.1, and the Q-Q plot is
given in Figure 4.6.3. Obviously, 10,000 SNPs are not sufficient, so we still use all the
SNPs to obtain the null distribution of p-values.
72
Figure 4.6.3
4.7 The Results of Association Analysis
As we are interested in finding genes that cause progression to the disease, we
perform a logistic regression association test between Stage IV and Stages I/II combined.
73
The basic information on the nine most significant SNPs is shown in Table 4.7.1. None of
them meet the Bonferroni cutoff (3.35 × 10โ8 ). Also, the p-values from a permutation
test are slightly smaller than the nominal p-values. This might indicate that the nominal
p-values are a little conservative. More information about top 50 SNPs, including pvalues of the HWE test, is given in Appendix 4.
Table 4.7.1 Summary of Top 9 SNPs in Association
CHR
SNP
Gene Symbol
Minor Allele
OR
Nominal
107permutation
P-value
P-value
Association
6
rs2024846
TBX18 | LOC100289423
A
2.606
2.20E-06
1.20E-06
4
kgp12071521
NR3C2 | LOC100287246
G
2.78
6.45E-06
4.20E-06
4
kgp7795337
NR3C2 | LOC100287246
G
2.78
6.45E-06
4.20E-06
10
rs11198164
EMX2 | RAB11FIP2
A
0.4175
8.21E-06
3.80E-06
10
rs1343418
EMX2 | RAB11FIP2
A
0.4262
1.18E-05
6.20E-06
10
kgp11193494
EMX2 | RAB11FIP2
A
0.4265
1.22E-05
5.90E-06
8
kgp12211973
CSMD3 | TRPS1
A
5.362
1.69E-05
1.18E-05
8
kgp451648
CSMD3 | TRPS1
G
8.329
2.02E-05
6.70E-06
There are three SNPs among the 20 top SNPs located in the gene CMSD3 on
chromosome 8. As CMSD3 is a cancer gene, we are very interested in this gene. The LD
structure of CSMD3 is investigated in Haploview. The three SNPs in CSMD3 are in block
60-63. The โlog10(p-values) for these four blocks (block 60-63) are shown in Figures 4.7.1
and 4.7.2.
74
Figure 4.7.1 log(p-values) in blocks 60-62
Figure 4.7.2 -log(p-values) in block 63
75
CHAPTER V
POSTERIOR PROBABILITIES FOR THE COLON CANCER DATA
In Chapter IV, we saw that neither logistic regression association analysis nor
permutation tests can provide significant evidence of association. In this Chapter, we try
to estimate the posterior probability of being associated to see whether we can provide
better evidence of association.
We can see in Figure 5.1 the raw estimate of the negative normal transformed
alternative distribution for p-values, stages I/II versus stage IV, in the colon cancer
dataset.
76
Figure 5.1 Empirical cumulative distributions of negative normal transformed pvalues for the top 577,700 SNPs in the colon cancer dataset
Negative Normal Transformed P-value
Comparing this with Figure 3.5.3, the alternative distribution for p-values in the
colon cancer data has much more randomness. This is reasonable, because we have a
77
much smaller sample size. The red dots are the ecdf of negative normal transformed
permutation p-values. The black dots are the ecdf of negative estimated normal
transformed alternative distribution of p-values.
Figure 5.2 shows the two curves after smoothing the empirical cumulative and
density functions.
78
Figure 5.2 Smoothed ecdf and density of the alternative distribution for negative
normal transformed p-values in the colon cancer dataset
Negative Normal Transformed P-value
Negative Normal Transformed P-value
79
The estimated proportion of alternative distribution for p-values is ๐ฬ1 โ 0.00394
(about 5,878 SNPs are associated with the disease). We want to model a mixture of
normal distributions and determine how many are sufficient to model this alternative
distribution. We show the estimate, โ2ln(likelihood) and AIC for three models in Table
5.1.
Table 5.1 Estimate for mixtures of normal distributions
Estimate
Number of
distributions
Mean1
Variance1
Mean2
Variance2
Mean3
Variance3
-2ln(Likelihood)
AIC
1
1.80
0.181
--
--
--
--
4675
4680
2
1.70
0.073
2.61
0.073
--
--
3019
3031
3
1.68
0.053
2.60
0.053
3.30
0.053
2416
2430
Similar to the previous commingling analysis, except when only one distribution is
fitted the likelihood surface is flat and standard errors cannot be computed. Both
likelihood and AIC indicate the mixture of three distributions is better than a single or a
mixture of two distributions. Also, note that if the top distribution (mean=3.30) is not
included in the model, the estimated means for the two models are similar.
Furthermore, if the p-values of the top 2068 SNPs are dropped from the data (with
negative normal transformed p-values โฅ3), the corresponding means and variances of
mixtures of normal distributions are as shown in Table 5.2.
80
Table 5.2 Estimate for mixtures of normal distributions if the top 2068 SNPs are
excluded
Estimate
Number of
distributions
Mean1
Variance1
Mean2
Variance2
-2ln(Likelihood)
AIC
1
1.79
0.173
--
--
4795
4801
2
1.69
0.063
2.60
0.063
3419
3431
From Table 5.2, we see that if the top 2068 SNPs are dropped, the estimated
mixture of two distributions is still quite similar, and a much better fit than a single
distribution, as in Table 5.1.
The estimated alternative distribution of normal transformed p-values from the 3distribution fit shown in Table 5.1 is:
๐๏ฟฝ1 (๐) = 0.824 โ N(1.677, 0.229) + 0.170 โ N(2.602, 0.229) + 0.00604
โ N(3.302, 0.229)
The Q-Q plot between the negative fitted normal transformed ๐-values and the
negative original normal transformed ๐ values is shown in Figure 5.3. Again, the mean of
the third normal distribution (mean=3.302) is fixed.
81
Figure 5.3 Q-Q plot between fitted negative normal transformed p-values and negative
normal transformed original p-values
Negative Fitted Normal Transformed P-values
82
We can see the data fit the model very well except for the lower tail. We are not
interested in the lower tail, so we do not try to fit those data any better. Also, this
indicates in this dataset the standard normal distribution cannot fit the null distribution
of p-values very well. This agrees with the earlier finding in Chapter IV that the
permutation p-values are all smaller than those expected asymptotically. At the very
upper tail, we can see the model overestimates the data for the top 98 SNPs a little bit.
This indicates that the posterior probabilities for those SNPs are overestimated. Also,
the top 3 SNPs are underestimated, so their posterior probabilities are underestimated.
In this model, these SNPs with the most significant p-values have extremely small
posterior probabilities (<10โ10 ). The SNP with the largest posterior probability is
indicated by the red line in the plot. The following figure shows the posterior
probabilities for the top 577,700 SNPs.
83
Figure 5.4 Posterior Probabilities for top 577,700 SNPs
Negative Normal Transformed P-value
Figure 5.4 shows that the SNPs with the most significant p-values do not have the
highest posterior probabilities. The maximum posterior probability is 0.0950. The
evidence of association is not strong.
84
Because we believe that the SNPs with higher posterior probabilities are more likely
associated with the disease, we are interested in whether the model fits well for the
SNPs with the highest posterior probabilities. We show the show the Q-Q plot at this
area in Figure 5.5.
85
Figure 5.5 The Q-Q plot of the negative normal transformed alternative distribution
of p-values for the SNPs with the highest posterior probabilities
Negative Fitted Normal Transformed P-values
Figure 5.5 indicate that the model still fits the data well in this area.
86
Figure 5.6 The Q-Q plot of the negative normal transformed alternative distribution
of p-values and the negative normal transformed permutation p-values for the SNPs
with the highest posterior probabilities
87
Figure 5.6 is the Q-Q plot between the negative normal transformed original pvalues and permutation p-values in the highest posterior probability area. This figure
expands part of Figure 4.6.1, where no deviation from the straight line can be seen.
Here, however, it can be clearly seen that the plot deviates from the straight line. Table
5.3 lists the genes that include the SNPs with posterior probability > 0.09495.
Table 5.3 Genes that include the SNPs with posterior probability > 0.09495
CHR
2
2
9
1
9
2
7
3
1
3
1
4
2
2
10
12
10
14
2
6
4
10
3
14
Posterior
Probability
0.0950
0.0950
0.0950
0.0950
0.0950
0.0950
0.0950
0.0950
0.0950
0.0950
0.0950
0.0950
0.0950
0.0950
0.0950
0.0950
0.0950
0.0950
0.0950
0.0950
0.0950
0.0950
0.0950
0.0950
Gene
DYSF
USP34
SLC31A1
COLGALT2
GLIS3
TMEM163
CCL26
ITIH4
LCE1E
PPARG
NGF
KCNIP4
VWC2L-IT1
MYO3B
PCDH15
LIMA1
H2AFY2
TTC7B
SGOL2
LY86
GALNT7
PRKG1
ROBO1
MDGA2
88
We should validate our method in another dataset, ideally one in which we might
expect similar results, but we do not have the complete data for another such dataset.
However, we have independent data for the 975 most significant SNPs, based on their
nominal p-values, derived from the same population but selected with less restrictive
criteria. The samples included individuals with either colon or rectal cancer: 89 with
stage IV colorectal cancer and 373 stage I and II colorectal cancer. The number of
samples in this validation dataset is quite similar to the dataset we analyzed.
We searched the top 975 SNPs in this dataset, ordered by their nominal p-values.
Only gene ROBO1 is validated. rs114143915 is the SNP in gene ROBO1 that has the
highest posterior probability in the MECC dataset. In validation dataset there are
three SNPs in gene ROBO1 rs7616854, rs77121699 and rs35456279 and they are all
among the top 975 SNPs . The relation between these three SNPs and rs114143915
in the MECC dataset are shown in Table 5.4 in Stage I/II and Stage IV separately.
Table 5.4 LD between pairs of SNPs in the discovery dataset
Stage IV
SNP
rs7616854
rs114143915
Haplotype Observed Frequency Expected Frequency Under LE
GT
0.000
0.001
0.023
0.022
AT
0.040
0.038
GA
0.937
0.939
AA
2
R =0.001, D'=1.000
rs114143915
SNP
Haplotype Observed Frequency
GA
0.000
rs77121699
AA
0.022
89
Expected Frequency Under LE
0.001
0.022
GC
AC
0.039
0.938
2
R =0.001, D'=1.000
0.038
0.939
rs114143915
SNP
Haplotype Observed Frequency Expected Frequency Under LE
GA
0.000
0.001
rs35456279
AA
0.028
0.027
GG
0.039
0.038
AG
0.933
0.934
2
R =0.001, D'=1.000
SNP
rs7616854
Stage I/II
rs114143915
Haplotype Observed Frequency Expected Frequency Under LE
GT
0.000
0.000
0.011
0.011
AT
0.006
0.006
GA
0.983
0.983
AA
2
R =0.000, D'=1.000
rs114143915
SNP
Haplotype Observed Frequency Expected Frequency Under LE
GA
0.000
0.001
rs77121699
AA
0.011
0.011
GC
0.006
0.006
AC
0.983
0.983
2
R =0.000, D'=1.000
rs114143915
SNP
Haplotype Observed Frequency Expected Frequency Under LE
GA
0.000
0.000
rs35456279
AA
0.015
0.015
GG
0.006
0.006
AG
0.979
0.979
2
R =0.000, D'=1.000
90
In Table 5.4, Dโ=1 indicates that the LD between the pairs of SNPs is strong. R2 is
low because R2 takes account of allele frequencies and the minor allele frequencies for
those SNPs are quite small (around 1%). We summarize the association analysis results
in Table 5.5
Table 5.5 Results of association analysis in the two datasets
Discovery
SNP
Odds Ratio
rs114143915
rs7616854
rs77121699
rs35456279
6.941
2.147
2.025
1.865
One-sided
P-value
0.00302
0.134
0.153
0.151
91
Validation
Odds Ratio
-1.460
1.363
1.411
One-sided
P-value
-0.239
0.249
0.227
CHAPTER VI
Areas for Future Research
In this dissertation, we have proposed an empirical method to estimate the
posterior probability that a SNP is associated with a disease. We found that the
Bonferroni-cutoff is too conservative. Also, we provide more meaningful results to
interpret the statistical analysis.
6.1 Estimate of the Alternative Distribution
In most GWAS, the proportion of p-values corresponding to the alternative
distribution is likely extremely small, so it is very hard to obtain an accurate and stable
estimate of that proportion in the presence of a lot of random noise. Better estimates of
the alternative distribution need to be obtained.
6.2 Fitting the Alternative Distribution
In Figure 3.6.5, it is obvious that the mixture of three normal distributions still
cannot fit the data perfectly. We can consider using a polynomial to fit the cumulative
distributions. Monotonic polynomial splines may be a good choice.
92
6.3 Validation datasets
Although we know the number of SNPs that have positive probabilities associated
with the disease,this probability still cannot identify the loci that are associated with the
disease. If we had a complete validation dataset, we could verify our results by
comparing the results obtained from the validation dataset which is also analyzed by
our method. Using FDR on increasingly large numbers of SNPs in the validation dataset
could determine a cut-off for the posterior probability that would controls the FDR to a
specified level.
6.4 Consider all the SNPs together within a gene
The SNPs located within a gene are highly correlated, so we cannot simply multiply
their posterior probabilities to get a score for the gene. We could try to use the linkage
between them as correlation matrix to obtain an overall posterior probability for a
particular gene.
93
Appendix 1
28 Removed Individuals Based on Their Relatedness
ID
10877
10669
12092
20933
13761
10197
22367
13500
11575
12244
10789
22048
11947
22801
21663
12913
21462
22359
11443
22684
21368
20327
10218
10450
10774
11648
12319
13201
94
Appendix 2
38 Removed Individuals Based on Their Relatedness
ID
10877
10669
12092
20933
13761
10197
22367
13500
11575
12244
10789
22048
11947
22801
21663
12913
21462
22359
11443
22684
21368
20327
10218
10450
10774
11648
12319
13201
11597 (possible twin, dup, sample mix up)
22395 (possible twin, dup, sample mix up)
12711
21418
23297
22717
11866
11583
95
Appendix 3
Number of SNPs with MAF<1% by Chromosome
28 Samples removed
38 Samples removed
Chromosome MAF<0.01 Chromosome MAF<0.01
1
65871
1
66035
2
68549
2
68723
3
57213
3
57370
4
52819
4
52946
5
51293
5
51440
6
51317
6
51415
7
44674
7
44788
8
42795
8
42908
9
33591
9
33671
10
40392
10
40538
11
41230
11
41310
12
38765
12
38854
13
27726
13
27809
14
25613
14
25694
15
24804
15
24869
16
25233
16
25295
17
22251
17
22313
18
22138
18
22205
19
15396
19
15440
20
18357
20
18408
21
10009
21
10032
22
9430
22
9457
96
Appendix 4
Stage IV vs Stage I/II Top 50 SNPs
CHR
SNP
GeneSymbol
BP
A1
OR
SE
L95
U95
STAT
Pvalue
6
rs2024846
TBX18 | LOC100289423
85575780
A
2.606
0.2024
1.753
3.875
4.734
2.20E-06
4
kgp12071521
NR3C2 | LOC100287246
149743890
G
2.78
0.2267
1.783
4.335
4.511
6.45E-06
4
kgp7795337
NR3C2 | LOC100287246
149748994
G
2.78
0.2267
1.783
4.335
4.511
6.45E-06
10
rs11198164
EMX2 | RAB11FIP2
119632192
A
0.4175
0.1959
0.2844
0.6129
-4.46
8.21E-06
10
rs1343418
EMX2 | RAB11FIP2
119635950
A
0.4262
0.1946
0.2911
0.6242
4.382
1.18E-05
10
kgp11193494
EMX2 | RAB11FIP2
119632422
A
0.4265
0.1948
0.2912
0.6248
4.375
1.22E-05
8
kgp12211973
CSMD3 | TRPS1
114783818
A
5.362
0.3903
2.495
11.52
4.303
1.69E-05
8
kgp451648
CSMD3 | TRPS1
114825951
G
8.329
0.4972
3.143
22.07
4.263
2.02E-05
8
kgp8557890
CSMD3 | TRPS1
114823103
G
8.329
0.4972
3.143
22.07
4.263
2.02E-05
2
rs6705378
HDAC4
240193889
G
2.243
0.1918
1.54
3.267
4.211
2.54E-05
5
kgp7123738
CSNK1G3 | ZNF608
123132486
A
3.706
0.3132
2.006
6.847
4.182
2.89E-05
22
rs13058496
LOC440786 | LOC100128009
17398812
G
2.157
0.1844
1.503
3.096
4.168
3.08E-05
10
rs1936298
EMX2 | RAB11FIP2
119641986
G
0.4477
0.1929
0.3067
0.6535
4.165
3.11E-05
4
kgp431927
NR3C2 | LOC100287246
149768284
A
2.612
0.2314
1.66
4.111
4.149
3.34E-05
12
rs11068687
KSR2
118256497
A
2.317
0.2034
1.555
3.451
4.131
3.61E-05
4
rs7669737
FSTL5
162361217
G
2.122
0.1825
1.484
3.035
4.123
3.74E-05
6
rs1711970
LOC100288267 | FOXF2
1386737
A
0.4245
0.2082
0.2823
0.6384
4.115
3.86E-05
6
kgp7355317
TBX18 | LOC100289423
85489616
G
2.1
0.181
1.473
2.994
4.099
4.15E-05
6
rs215922
TBX18 | LOC100289423
85521035
A
2.1
0.181
1.473
2.994
4.099
4.15E-05
11
kgp1795795
SLC37A2
124941849
A
3.485
0.305
1.917
6.335
4.093
4.25E-05
18
kgp4082225
RBBP8 | CABLES1
20667776
G
0.4257
0.2094
0.2824
0.6418
4.078
4.55E-05
5
kgp11300235
CSNK1G3 | ZNF608
123074679
C
3.527
0.3093
1.923
6.466
4.074
4.61E-05
5
kgp5983199
ZFYVE16
79740970
A
0.3244
0.2763
0.1887
0.5575
4.074
4.62E-05
5
rs13188088
LOC644037 | ZFYVE16
79701839
G
0.3244
0.2763
0.1887
0.5575
4.074
4.62E-05
19
kgp4364718
LOC148145 | UQCRFS1
29578972
A
3.297
0.2936
1.854
5.862
4.063
4.85E-05
7
kgp2311215
MAD1L1
2126098
A
5.029
0.3999
2.296
11.01
4.039
5.37E-05
12
rs12367527
LRIG3 | SLC16A7
59398014
A
2.572
0.2341
1.626
4.069
4.036
5.44E-05
10
kgp12367322
ST8SIA6 | PTPLA
17523142
G
2.344
0.2112
1.549
3.545
4.033
5.51E-05
2
rs2203879
LOC100129278 | KLHL29
22186722
A
6.192
0.4535
2.546
15.06
4.021
5.80E-05
3
rs9832895
SCN5A
38661533
A
0.4673
0.1892
0.3225
0.6771
4.021
5.80E-05
17
kgp11676109
RPH3AL
193595
A
0.4066
0.225
0.2616
0.632
3.999
6.36E-05
10
rs1290099
ST8SIA6 | PTPLA
17521991
G
2.239
0.2036
1.502
3.338
3.959
7.52E-05
5
kgp906905
LOC644037 | ZFYVE16
79678639
A
0.3638
0.2555
0.2205
0.6004
3.957
7.59E-05
4
kgp1847853
NR3C2 | LOC100287246
149792668
A
2.484
0.23
1.583
3.899
3.956
7.62E-05
17
rs17728437
WSCD1 | AIPL1
6146504
A
3.94
0.3472
1.995
7.782
3.949
7.84E-05
4
rs308420
FGF2
123767943
A
3.94
0.3472
1.995
7.782
3.949
7.84E-05
4
kgp886364
SPATA18 | USP46
53108847
A
2.098
0.189
1.449
3.039
3.922
8.77E-05
97
1
kgp2494374
FAM5C | RGS18
191422804
A
4.232
0.3684
2.056
8.712
3.916
9.01E-05
1
rs6679533
FOXJ3 | RIMKLA
42835396
A
2.152
0.1957
1.466
3.158
3.915
9.05E-05
10
kgp10672875
CACNB2
18440444
A
2.115
0.1914
1.453
3.077
3.912
9.14E-05
5
rs10063441
COL23A1
177854235
G
3.403
0.3133
1.842
6.289
3.91
9.25E-05
6
kgp22803378
TBX18 | LOC100289423
85559189
G
2.695
0.2538
1.638
4.432
3.905
9.42E-05
6
kgp6346424
TBX18
85473758
G
2.022
0.1805
1.419
2.88
3.901
9.59E-05
12
kgp10441124
ETNK1 | SOX5
22952020
G
9.836
0.5869
3.113
31.07
3.895
9.81E-05
16
kgp9704179
LOC100288121 | LOC401859
74141762
G
2.761
0.261
1.655
4.604
3.892
9.96E-05
11
rs11236006
C2CD3
73831032
A
2.307
0.2152
1.513
3.517
3.884
0.0001028
4
kgp757770
LOC100288304 | LOC100288373
182150118
G
2.044
0.1841
1.425
2.932
3.882
0.0001035
22
kgp2328053
LOC440786 | LOC100128009
17403936
C
1.983
0.1765
1.403
2.803
3.879
0.0001047
3
kgp8782223
CLSTN2
139842896
A
3.137
0.2948
1.76
5.59
3.878
0.0001055
3
kgp985504
CLSTN2
139841261
G
3.137
0.2948
1.76
5.59
3.878
0.0001055
98
Appendix 5
Posterior Probabilities for additional reported genes
CHR
1
3
12
10
15
11
6
16
3
12
9
3
12
12
20
5
16
16
4
3
22
1
8
2
1
2
9
21
3
2
2
6
10
2
5
10
6
Gene
ACTN2
AGTR1
ANO4
ARID5B
ATP8B4
BDNF
BEND3
BRD7
BTLA
C12orf74
C9orf85
CACNA2D3
CAND1
CCDC91
CD40
CDH12
CDH8
CNTNAP4
COL25A1
COL8A1
CSF2RB
DAB1
DCAF13
DCTN1
DNAH14
DNAJC27-AS1
DOCK8
DSCAM
EPHA6
ERBB4
ETAA1
EYS
FAM13C
FAM49A
FAM71B
FAS
FNDC1
Most Significant p-value
-3.92E-05
8.27E-03
1.06E-03
6.86E-03
3.88E-03
5.43E-04
6.02E-05
5.26E-03
9.39E-04
1.35E-03
3.51E-05
4.52E-03
1.74E-03
7.92E-04
1.56E-05
5.52E-03
1.12E-03
1.31E-03
1.22E-02
4.02E-02
3.00E-03
5.05E-03
1.12E-04
3.43E-02
9.55E-04
5.34E-04
3.43E-03
3.03E-03
2.48E-04
5.88E-04
3.86E-03
9.52E-05
1.49E-03
3.05E-05
6.32E-03
6.52E-03
99
Posterior Probability
-0.728
0.186
0.186
0.185
0.186
0.286
0.696
0.186
0.183
0.186
0.733
0.186
0.186
0.206
0.756
0.171
0.186
0.186
0.186
0.108
0.186
0.186
0.624
0.125
0.186
0.289
0.185
0.186
0.475
0.267
0.185
0.647
0.186
0.740
0.186
0.186
6
1
18
9
6
13
10
17
1
20
16
18
6
2
7
7
16
22
3
1
10
7
11
12
1
4
12
8
20
1
2
2
7
7
20
6
5
5
11
13
11
8
13
GABBR1
GADD45A
GATA6
GNA14
GPLD1
GPR12
GPR158
GRB2
H3F3A
HAO1
HSD11B2
IER3IP1
IFNGR1
IL1R2
INHBA-AS1
IQUB
IRF8
ISX
ITGA9
KCNK1
KCNMA1
KEL
KIRREL3
KITLG
LGALS8
LNX1
LOC100128554
LOC100507632
LOC339568
LOC400752
LOC400940
LRP2
MACC1
MACC1
MACROD2
MARCKS
MCC
MEGF10
MICALCL
MIR4703
MOB2
MTUS1
NBEA
2.47E-05
1.50E-03
6.12E-04
3.74E-03
5.93E-04
6.07E-04
4.78E-04
4.74E-04
1.99E-04
2.31E-03
1.35E-02
4.27E-03
4.52E-06
7.90E-04
1.21E-03
2.19E-06
1.98E-02
3.52E-03
3.73E-03
9.43E-04
1.56E-03
3.48E-03
2.69E-03
1.79E-03
2.04E-04
1.22E-02
1.22E-03
7.84E-04
1.77E-03
6.17E-04
2.55E-04
4.03E-03
-5.01E-04
1.13E-03
8.33E-04
3.45E-03
6.97E-03
2.93E-03
1.86E-03
3.19E-03
1.24E-03
1.00E-04
100
0.747
0.186
0.258
0.186
0.265
0.260
0.316
0.318
0.523
0.186
0.185
0.185
0.755
0.206
0.183
0.687
0.171
0.182
0.184
0.186
0.182
0.183
0.186
0.186
0.517
0.186
0.186
0.208
0.183
0.256
0.468
0.186
-0.304
0.186
0.197
0.184
0.185
0.186
0.184
0.186
0.186
0.640
1
7
7
10
14
2
11
11
15
4
15
1
10
14
7
4
16
1
18
3
1
7
8
4
6
6
3
2
8
6
4
12
7
16
10
2
NFIA
NPSR1
NPY
NRAP
NRXN3
NYAP2
OR52N4
OR9G4
OTUD7A
PDGFRA
PGBD4
PGM1
PLAU
PNN
PODXL
PRDM5
RBFOX1
REN
RIT2
ROBO2
RORC
SEMA3C
SLA
SNX25
SNX9
SOX4
STAC
STK39
STMN2
T
TEC
TMEM117
TSPAN12
WWOX
XPNPEP1
ZNF804A
2.47E-03
3.11E-04
2.30E-03
3.49E-04
1.15E-03
1.56E-02
3.24E-03
2.39E-04
4.99E-03
8.93E-05
4.37E-05
2.23E-05
5.09E-03
6.46E-04
4.41E-05
2.03E-04
2.09E-04
3.28E-03
1.67E-03
5.82E-03
3.82E-04
7.56E-03
5.79E-04
8.29E-04
6.17E-03
9.71E-05
7.45E-03
1.18E-03
1.90E-04
2.35E-03
9.18E-03
2.19E-03
9.41E-03
8.89E-05
5.40E-04
4.26E-05
101
0.186
0.422
0.186
0.393
0.186
0.182
0.186
0.483
0.186
0.655
0.721
0.750
0.186
0.246
0.720
0.518
0.512
0.186
0.185
0.186
0.371
0.186
0.270
0.197
0.186
0.644
0.183
0.181
0.531
0.186
0.186
0.186
0.184
0.655
0.287
0.723
References
Agresti, A. (1992). A Survey of Exact Inference for Contingency Tables. Statistical Science,
Vol. 7, No. 1. , pp. 131-153.
Barrett JC, F. B. (2005). Haploview: analysis and visualization of LD and haplotype maps.
Bioinformatics, 21(2):263-5.
Barrett, J., Fry, B., Maller, J., & Daly, M. (2005). Haploview: analysis and visualization of
LD and haplotype maps. Bioinformatics, 21(2):263-5.
Bateson, W. (1900). Problems of heredity as a subject for horticultural investigation.
Journal of Royal Horticultural Society, vol 25, 54-61.
Benjamini Y, H. Y. (1995). Controlling the false discovery rate: a practical and powerful
approach to multiple testing. J Roy Statist Soc Ser B (Methodological), 57:289300.
Benjamini, Y., & Hochberg, Y. (1995). Controlling the false discovery rate: a practical and
powerful approach to multiple testing. J Roy Statist Soc Ser B (Methodological),
57:289-300.
Benjamini, Y., & Hochberg, Y. (2000). On the Adaptive Control of the False Discovery
Rate in Multiple Testing with Independent Statistics. Journal of Educational and
Behavioral Statistics, Vol. 25, No. 1. (Spring, 2000), pp. 60-83.
102
Benjamini, Y., & Hochberg, Y. (2000). On the Adaptive Control of the False Discovery
Rate in Multiple Testing with Independent Statistics. Journal of Educational and
Behavioral Statistics, Vol. 25, No. 1. (Spring, 2000), pp. 60-83.
Bradley Efron, R. T. (2001). Empirical Bayes Analysis of a Microarray Experiment. the
American Statistical Association, Vol. 96, No. 456 (Dec., 2001), pp. 1151-1160.
Browning, B. (2008). PRESTO: rapid calculation of order statistic distributions and
multiple-testing adjusted P-values via permutation for one and two-stage genetic
association studies. BMC Bioinformatics, 9:309.
Cheverud, J. (2001). A simple correction for multiple comparisons in interval mapping
genome scans. Heredity, 87(1):52โ58.
Christopher R. Genovese, K. R. (2006). False discovery control with p-value weighting.
Biometrika, 93 (3): 509-524.
Churchil, l. G., & Doerge, R. (1994). Empirical Threshold Values for Quantitative Triat
Mapping. Genetics, 138(3):963โ971.
Churchill GA, D. R. (1994). Empirical Threshold Values for Quantitative Triat Mapping.
Genetics, 138(3):963โ971.
Conneely KN, B. M. (2007). So Many Correlated Tests, So Little Time! Rapid Adjustment
of P Values for Multiple Correlated Tests. American journal of human genetics,
81(6):1158โ1168.
103
Conneely, K., & Boehnke, M. (2007). So Many Correlated Tests, So Little Time! Rapid
Adjustment of P Values for Multiple Correlated Tests. American journal of human
genetics, 81(6):1158โ1168.
Consortium, W. T. (2007). Genomewide association study of 14,000 cases of seven
common diseases and 3,000 shared controls. Nature, 447, 661โ678.
Day, N. (1969). Estimating the components of a mixture of normal distributions.
Biometrika, 56, 463-474.
Edgington, E. S. (1995). Randomization Tests, 3rd ed. New York: Marcel Dekker.
Efron, B. (2001). Empirical Bayes Analysis of a Microarray Experiment. the American
Statistical Association, Vol. 96, No. 456 (Dec., 2001), pp. 1151-1160.
Elding, H. (2013). Refinement in Localization and Identification of Gene Regions
Associated with Crohn Disease. Am J Hum Genet, 92(1): 107โ113.
Fisher, R. A. (1935). Design of Experiments. Edinburgh: Oliver and Boyd.
Franke, A., McGovern, D., Barrett, J., Wang, K., Radford Smith, G., Ahmad, T., et al.
(2010). Genome-wide meta-analysis increases to 71 the number of confirmed
Crohnโs disease susceptibility loci. Nat. Genet, 42, 1118โ1125.
Gabriel, S. (2002). The structure of haplotype blocks in the human genome. Science,
296(5576):2225-9.
104
Galwey, N. (2009). A new measure of the effective number of tests, a practical tool for
comparing families of non-independent significance tests. Genetic Epidemiology,
33(7):559โ568.
Gao X, S. J. (2008). A multiple testing correction method for genetic association studies
using correlated single nucleotide polymorphisms. . Genetic Epidemiology,
32(4):361โ369.
Gao, X., Starmer, J., & Martin, E. (2008). A multiple testing correction method for
genetic association studies using correlated single nucleotide polymorphisms.
Genetic Epidemiology, 32(4):361โ369.
Genovese, C., Roede, K., & Wasserman, L. (2006). False discovery control with p-value
weighting. Biometrika, 93 (3): 509-524.
Glazier, A. (2002). Finding Genes That Underlie Complex Traits. Science, Vol. 298 no.
5602 pp. 2345-2349.
Golub, G. H., & Loan, C. F. (1983). Matrix Computations. Johns Hopkins Studies in
Mathematical Sciences.
Groene, J. D. (2006). Robo1/Robo4: differential expression of angiogenic markers in
colorectal cancer. Oncol Rep., 15(6):1437-43.
Han B, K. H. (2009). Rapid and accurate multiple testing correction and power
estimation for millions of correlated markers. PLoS Genetics, 5(4):e1000456.
105
Han, B. K. (2009). Rapid and accurate multiple testing correction and power estimation
for millions of correlated markers. PLoS Genet, 5(4), e1000456.
Han, B., Kang, H., & Eskin, E. (2009). Rapid and accurate multiple testing correction and
power estimation for millions of correlated markers. PLoS Genetics,
5(4):e1000456.
Haseman, J. D. and Kupper, L. L. (1979). Analysis of dichotomous response data from
certain toxicologic experiments. Biometrics, 35, 281-293.
Haseman, J., & Kupper, L. (1979). Analysis of dichotomous response data from certain
toxicologic experiments. Biometrics, 35, 281-293.
Hindorff, L. (2009). Potential etiologic and functional implications of genome-wide
association loci for human diseases and traits. Proc Natl Acad Sci U S A,
106(23):9362-7.
Holm, S. (1979). A simple sequentially rejective multiple test procedure. Scandinavian
Journal of Statistics, 6 (2): 65โ70.
JM, C. (2001). A simple correction for multiple comparisons in interval mapping genome
scans. Heredity, 87(1):52โ58.
Joshua Millstein, D. V. (2013). Computationally efficient permutation-based confidence
interval estimation for tail-area FDR. Front Genet, 4: 179.
Jr, R. G. (1981). Simultaneous Statistical Inference. Springer Series in Statistics.
106
Karhunen, K. (1947). Über lineare Methoden in der Wahrscheinlichkeitsrechnung. Ann.
Acad. Sci. Fennicae. Ser. A. I. Math.-Phys., 37: 1โ79.
Lewontin, R. (1964). The interaction of selection and linkage. I.General considerations;
heterotic models. Genetics, 49, 49โ67.
LEWONTIN, R. C. (1960). The evolutionary dynamics of complex polymorphism.
Evolution, 14: 458-472.
Lewontin, R. C., & Kojim, K. (1960). The evolutionary dynamics of complex
polymorphism. Evolution, 14: 458-472.
Li C, L. M. (2008). Prioritized subset analysis: improving power in genome-wide
association studies. Hum Hered, 65: 129โ141.
Li, C., Li, M., Lange, E., & Watanabe, R. (2008). Prioritized subset analysis: improving
power in genome-wide association studies. Hum Hered, 65: 129โ141.
Maniatis, N., Collins, A., Xu, C., McCarthy, L., Hewett, D., Tapper, W., et al. (2002). The
first linkage disequilibrium (LD) maps: delineation of hot and cold blocks by
diplotype analysis. Proc. Natl. Acad.Sci. USA, 99, 2228โ2233.
Manly, B. F. (1997). Randomization, Bootstrap and Monte Carlo Methods in Biology, 2nd
ed. London: Chapman and Hall.
Manolio, T. (2009). Finding the missing heritability of complex diseases. Nature, 461,
747-753.
107
McCarthy, M. I. (2008). Genome-wide association studies for complex traits: consensus,
uncertainty and challenges. Nature Reviews Genetics, 9, 356-369.
McCarthy, M. I. (2008). Genome-wide association studies for complex traits: consensus,
uncertainty and challenges. Nature Reviews Genetics, 9, 356-369.
McCullagh, P., & Nelder, J. A. (1989). Generalized Linear Models. Boca Raton: Chapman
and Hall/CRC.
McLachlan, G. J. (1988). Fitting mixture models to grouped and truncated data via the
EM algorithm. Biometrics, 44, 571-578.
Miller, R. G. (1981). Simultaneous Statistical Inference. Springer Series in Statistics.
Millstein, J., & Volfson, D. (2013). Computationally efficient permutation-based
confidence interval estimation for tail-area FDR. Front Genet, 4: 179.
Mosig, M. O. (2001). A whole genome scan for quantitative trait loci affecting milk
protein percentage in Israeli-Holstein Cattle, by means of selective milk DNA
pooling in a daughter design, using an adjusted false discovery rate criterion.
Genetics Society of America, 157: 1683โ1698.
Murdoch, D. (2008). P-Values are Random Variables. The American Statistician, 62, 242245.
108
N. E. Breslow and D. G. Clayton. (1993). Approximate Inference in Generalized Linear
Mixed Models. American Statistical Association, Vol. 88, No. 421 (Mar., 1993), pp.
9-25.
Nettleton, D., Hwang, J., Caldo, R., & Wise, R. (2006). Estimating the Number of True
Null Hypotheses From a Histogram of p-values. Journal of Agricultural, Biological,
and Environmental Statistics, Vol. 11, No. 3, 337-356.
NW, G. (2009). A new measure of the effective number of tests, a practical tool for
comparing families of non-independent significance tests. Genetic Epidemiology,
33(7):559โ568.
Nyholt, D. (2004). A simple correction for multiple testing for single-nucleotide
polymorphisms in linkage disequilibrium with each other. American Journal of
Human Genetics., 74(4):765โ769.
P. McCullagh, J. A. (1989). Generalized Linear Models. Boca Raton: Chapman and
Hall/CRC.
Pearson, K. (1901). On Lines and Planes of Closest Fit to Systems of Points in Space.
Philosophical Magazine, 2 (11): 559โ572.
Pearson, K. (1901). On Lines and Planes of Closest Fit to Systems of Points in Space.
Philosophical Magazine, 2 (11): 559โ572.
Perneger, T. (1998). What's wrong with Bonferroni adjustments. British Medical Journal,
316(7139):1236โ1238.
109
Pitman, E. J. (1937). Signi๏ฌcance tests which may be applied to samples from any
populations. III. The analysis of. Biometrika, 29, 322โ335.
Purcell, S. (2007). PLINK: a toolset for whole-genome association and population-based
linkage analysis. American Journal of Human Genetics, 81.
R, S. J. (2003). Statistical significance for genome-wide studies. . Proceedings of the
National Academy of Sciences, 100: 9440-9445.
Rao, J., & Scott, A. (1984). On chi-squared tests for multi-way tables with cell
proportions estimated from survey data. Annals of Statistics, 12, 46-60.
Rao, J.N.K., and Scott, A.J. (1984). On chi-squared tests for multi-way tables with cell
proportions estimated from survey data. Annals of Statistics, 12, 46-60.
Risch and Merikangas. (1996). The Future of Genetic Studies of Complex Human
Diseases. Science, 273:1516-17.
Robbins, R. B. (1918). Some applications of mathematics to breeding problems. Genetics,
3: 375โ389.
S.A.G.E. (2012). Statistical Analysis for Genetic Epidemiology. Retrieved from Release 6.3:
http://darwin.cwru.edu
SB, G. (2002). The structure of haplotype blocks in the human genome. Science,
296(5576):2225-9.
110
Schall, R. (1991). Estimation in generalized linear models with random effects.
Biometrika, 78 (4): 719-727.
Schreiber, S. R. (2005). Genetics of Crohn disease, an archetypal inflammatory barrier
disease. Nat Rev Genet, 6(5):376-88.
Scott, D. W. (1992). Multivariate Density Estimation. Theory, Practice and Visualization.
New York: Wiley.
Setakis, E., Stirnadel, H., & Balding, D. (2006). Logistic regression protects against
population structure in genetic association studies. Genome Res, 16, 290โ296.
Spencer, C., Su, Z., Donnelly, P., & Marchini, J. (2009). Designing genome-wide
association studies: sample size, power, imputation, and the choice of
genotyping chip. PLoS Genetics, 5(5):e1000477.
Storey, J. D. (2002). A direct approach to false discovery rates. J. R. Statist. Soc. B, 64,
Part 3, pp. 479โ498.
TV, P. (1998). What's wrong with Bonferroni adjustments. British Medical Journal,
316(7139):1236โ1238.
Yang, J. (2011). GCTA: A Tool for Genome-wide Complex Trait Analysis. Am J Hum Genet,
88(1): 76โ82.
111
© Copyright 2026 Paperzz