Nonlinear Tests for Genomewide Association Studies

Copyright Ó 2006 by the Genetics Society of America
DOI: 10.1534/genetics.106.060491
Nonlinear Tests for Genomewide Association Studies
Jinying Zhao,* Li Jin†,‡ and Momiao Xiong*,†,1
*Human Genetics Center, University of Texas Health Science Center, Houston, Texas 77030, †Laboratory of Theoretical Systems
Biology, School of Life Science, Fudan University, Shanghai 200433, China and ‡CAS-MPG Partner Institute of
Computational Biology, SIBS, CAS, Shanghai 200031, China
Manuscript received May 8, 2006
Accepted for publication June 19, 2006
ABSTRACT
As millions of single-nucleotide polymorphisms (SNPs) have been identified and high-throughput
genotyping technologies have been rapidly developed, large-scale genomewide association studies are
soon within reach. However, since a genomewide association study involves a large number of SNPs it is
therefore nearly impossible to ensure a genomewide significance level of 0.05 using the available statistics,
although the multiple-test problems can be alleviated, but not sufficiently, by the use of tagging SNPs. One
strategy to circumvent the multiple-test problem associated with genome-wide association tests is to
develop novel test statistics with high power. In this report, we introduce several nonlinear tests, which are
based on nonlinear transformation of allele or haplotype frequencies. We investigate the power of the
nonlinear test statistics and demonstrate that under certain conditions, some nonlinear test statistics have
much higher power than the standard x2 -test statistic. Type I error rates of the nonlinear tests are
validated using simulation studies. We also show that a class of similarity measure-based test statistics is
based on the quadratic function of allele or haplotype frequencies, and thus they belong to nonlinear
tests. To evaluate their performance, the nonlinear test statistics are also applied to three real data sets.
Our study shows that nonlinear test statistics have great potential in association studies of complex
diseases.
W
ITH the imminent completion of the HapMap
Project providing a comprehensive catalog of common genetic variations in human populations (Altshuler
and Clark 2005) and rapid development of technologies enabling efficient and economical genotyping of
a large number of variants (Borsting et al. 2005), genomewide association studies will become practically
feasible in the near future. However, a limit, which may
keep genomewide association studies from realization,
pertains to problems of a statistical nature. Considering
the adjustment for millions of statistical tests, a stringent P-value of 106–107 has been suggested to ensure
a genomewide significance level of 0.05 (Freimer and
Sabatti 2004; Neale and Sham 2004; Wang et al.
2005). Although this problem can be alleviated by selecting and typing tag SNPs (Halldorsson et al. 2004;
Ahmadi et al. 2005), the effect of such a strategy on the
significance level is still limited. Therefore, developing
novel test statistics with high power requires immediate
consideration.
The primary assumption for association studies is that
a mutation (a disease allele) increases disease susceptibility. Under this assumption, one expects that the
disease allele will occur more frequently in the affected
1
Corresponding author: Human Genetics Center, School of Public
Health, University of Texas Health Science Center, 1200 Herman Pressler,
Houston, TX 77030. E-mail: [email protected]
Genetics 174: 1529–1538 (November 2006)
individuals (cases) than in the unaffected ones (controls) (Pritchard and Donnelly 2001). The standard
x2 -test for association studies is to identify the disease
locus by comparing the differences in allele or haplotype frequencies between the affected and unaffected
individuals. More precisely, the x2 -statistic is a quadratic
form of difference of allele or haplotype frequencies between the affected and unaffected individuals (Chapman
and Wijsman 1998; Akey et al. 2001). A natural way to
amplify differences in frequency is to conduct linear
transformation of allele or haplotype frequencies in the
currently used statistics for association studies. However,
any statistics arising from linear transformation will not
change the values of pretransformation statistics. We
propose to use nonlinear transformations of allele or
haplotype frequencies in cases (P A) and in controls (P),
i.e., f ðP A Þ and f ðP Þ, with the expectation that statistics
based on the difference j f ðP A Þ f ðP Þj will be more
powerful than those based on the difference jP A P j.
For example, the case–control differential may be enhanced with some nonlinear transformations of allele
or haplotype frequencies. Association tests with such
nonlinear transformation are referred to as nonlinear
association tests hereafter.
The main purpose of this report is to develop a
general statistical framework of nonlinear association
tests and to present several nonlinear test statistics for
association studies. To accomplish this, we first study the
1530
J. Zhao, L. Jin and M. Xiong
properties of nonlinear transformations of allele or
haplotype frequencies in cases and controls. We then
study how to construct test statistics on the basis of the
nonlinear transformations of allele or haplotype frequencies and to derive asymptotic distributions of the
nonlinear test statistics under null and alternative hypotheses. Alternative to comparing differences in allele
or haplotype frequencies, a recently developed class
of association tests compares similarities of a genome
region between affected and unaffected individuals
(Tzeng et al. 2003; Zhang et al. 2003). Under the general statistical framework for nonlinear association tests,
we show that many similarity measure-based test statistics are nonlinear association tests with quadratic transformation of allele or haplotype frequencies. Thus, we
can unify the allele or haplotype frequency-based association tests and similarity measure-based association
tests. Since different nonlinear tests may have different
power, selection of nonlinear statistics is critical to a
successful application of nonlinear tests to association
studies. We compare the power of several nonlinear test
statistics and uncover the relationship between the
power of the nonlinear test statistics and the strength
of nonlinearity used in the test statistics (Bates and
Watts 1980). To demonstrate that amplification of the
differences in allele or haplotype frequencies by nonlinear test statistics will not cause false positive problems, we study the type I error rates of the nonlinear test
statistics by simulations. Finally, to evaluate the performance of the nonlinear test statistics for association
studies, the presented nonlinear test statistics are applied to three real data examples.
METHODS
Nonlinear transformations of allele or haplotype
frequencies: The principle behind the standard x2 -test
in case–control studies is to compare the difference
in allele or haplotype frequencies between cases and
controls. We expect that amplifying such a difference
may improve the power to detect disease susceptibility
genes. One strategy to amplify the difference is to nonlinearly transform the frequencies. The difference in
the values of nonlinear function of allele or haplotype
frequencies between cases and controls should be larger
than the difference in original allele or haplotype frequencies. Therefore, our goal is to search for nonlinear
transformations that meet this requirement. To achieve
this goal, we first investigate the factors that would affect
the difference in values of nonlinear function of allele
or haplotype frequencies between the two populations.
For convenience of presentation, we study only haplotypes. The results can be adapted easily for the alleles.
Consider two alleles D and d at the disease locus. Let
D denote the disease allele and f11 ; f12 , and f22 be the
penetrance of genotypes DD, Dd, and dd, respectively.
Let P ðAÞ ¼ PD2 f11 1 2PD Pd f12 1 Pd2 f22 be the prevalence
of disease. Define
a1 ¼
PD f11 1 Pd f12
P ðAÞ
and
a2 ¼
PD f12 1 Pd f22
;
P ðAÞ
where PD and Pd are the frequencies of alleles D and d,
respectively. Suppose that K marker loci span m haplotypes Hi ði ¼ 1; . . . ; mÞ. Let dHi D and dHi d be the
overall measures of linkage disequilibrium (LD) between haplotype Hi and disease allele D and allele d,
respectively, and define
dHi D ¼ PHi D PHi PD
and
dHi d ¼ PHi d PHi Pd ;
where PHi D and PHi are the frequencies of the haplotypes Hi D and Hi , respectively (Xiong et al. 2003). It
is known that PHAi ¼ PHi 1 ða1 a2 ÞdHi D ¼ PHi 1 edHi D ,
where PHAi and PHi are the frequencies of the haplotype
Hi in the cases and controls, respectively, and e ¼
a1 a2 (Zhao et al. 2005). Let f ðPHi Þ be a nonlinear
function of the haplotype frequency PHi . We now calculate the difference between the nonlinear transformation of the haplotype frequency in the affected
individuals PHAi and the nonlinear transformation of
the haplotype frequency in the general population PHi .
By Taylor’s expansion, we can obtain
e2
d0 ¼ f ðPHAi Þ f ðPHi Þ ef 9ðPHi ÞdHi D 1 f $ðPHi Þd2Hi D ;
2
where f 9ðPHi Þ and f $ðPHi Þ are the first and second derivatives of the function f ðPHi Þ with respect to PHi . This
equation still holds if the haplotype frequencies are
replaced by allele frequencies.
From the above equation, the difference between the
nonlinear functions of the frequencies in cases and
controls depends on the first and second derivatives of
the function f ðPHi Þ with respect to PHi as well as the
overall measure of the LD between the haplotype Hi and
the disease allele D. If j f 9ðPHi Þ 1 ðef $ðPHi ÞdHi D =2Þj . 1,
then we have jd0 j . j PHAi PHi j, which implies that the
absolute value of the difference in nonlinear functions
of the haplotype frequencies between cases and controls
is larger than that of the original frequency difference
under this condition.
Test statistics: Assume that nA affected individuals
and nG unaffected individuals are sampled. Let PˆAHi and
PˆHi be the estimators of frequencies of haplotype Hi
in cases and controls, respectively. The allele or haplotype frequencies are asymptotically distributed as multivariate normal distributions N ðP A ; ð1=2nA ÞSA Þ and
N ðP ;ð1=2nG ÞSÞ, respectively, where P A ¼ ½PHA1 ; ... ; PHAm T ,
P ¼ ½PH1 ; ... ;PHm T , SA ¼ diagðP1A ; ... ;PmA Þ P A ðP A ÞT , and
S ¼ diagðP1 ; ... ;Pm ÞPP T .
Let f ðxÞ be a continuously differentiable nonlinear
function with a nonzero differential at x. Let Xj ¼ f ðPˆAHj Þ
for j ¼ 1; . . . ; m, X ¼ ½X1 ; . . . ; Xm T , Yj ¼ f ðPˆHj Þ, and
Nonlinear Test for Association Studies
TABLE 1
Some of the nonlinear transformations for allele or
haplotype frequencies
Function
Derivative
Entropy
x log x
Exponential
ex
Quadratic
x2 1 x 1 1
Reciprocal
1
x
1 log x
ex
2x 1 1
1
x2
Y ¼ ½Y1 ; . . . ; Ym T . Then, the random vectors X and Y
are asymptotically distributed as multivariate normal
distributions N ð f ðP A Þ; ð1=2NA ÞBSA B T Þ and N ð f ðP Þ;
ð1=2nG ÞCSC T Þ, respectively (Serfling 1980), where
bii ¼ @f ðPHAi Þ=@PHAi , bij ¼ 0, cii ¼ @f ðPHi Þ=@PHi , cij ¼ 0,
B ¼ ðbij Þm3m , and C ¼ ðcij Þm3m .
Define the matrix
L¼
1
1
BSA B T 1
CSC T :
2nA
2nG
allele or haplotype frequencies. Therefore, similarity
measure-based statistics are nonlinear test statistics.
Analytic formulas for power calculation of the nonlinear tests: To evaluate the performance of the nonlinear test for association studies, we need to calculate its
power. The alternative hypothesis is that there is at least
one allele or haplotype associated with the disease;
i.e., Ha: P A 6¼ P . Under the alternative hypothesis, the
test statistic TN is asymptotically distributed as a noncentral x2ðr Þ with noncentrality parameter lN, where
lN ¼ ½ f ðP A Þ f ðP ÞT L1 ½ f ðP A Þ f ðP Þ, r ¼ rankðLÞ,
f ðP A Þ ¼ ½ f ðPHA1 Þ; . . . ; f ðPHAm ÞT ; f ðP Þ ¼ ½ f ðPH1 Þ; . . . ;
f ðPHm ÞT , L ¼ ð1=2nA ÞBSA B T 1 ð1=2nG ÞCSC T , SA ¼
diagðPHA1 ; . . . ; PHAm Þ P A ðP A ÞT ,
S ¼ diagðPH1 ; . . . ;
PHm Þ PP T , P A ¼ ½PHA1 ; . . . ; PHAm T , P ¼ ½PH1 ; . . . ;
PHm T , bii ¼ @f ðPHAi Þ=@PHAi ; bij ¼ 0; i 6¼ j, cii ¼ @f ðPHi Þ=
@PHi ; cij ¼ 0; i 6¼ j, B ¼ ðbij Þm3m , and C ¼ ðcij Þm3m .
The noncentrality parameter lN can be approximated
by
1 T 1
1
2 T
A
ðI 1 SÞS ðI 1 SÞ 1
S
lN e dHD I 1 S
2
2nA
2nG
1
3 I 1 S dHD
2
(appendix c), where
Let L̂ be an estimator of the matrix L. We propose the
test statistic TN to test the association of the alleles or
haplotypes with disease,
T
e¼
where L̂ is the generalized inverse of matrix L̂. The
null hypothesis is that there is no association of alleles
or haplotypes with the disease; i.e., H0: P A ¼ P . Let
r ¼ rankðL̂Þ. Under the null hypothesis, TN is asymptotically distributed as a central x2 with r degrees of
freedom (Greenwood and Nikulin 1996; Serfling
1980). The test statistic TN defines a class of nonlinear
tests. Various nonlinear functions with some regularity
can be used to construct the test statistic. Table 1 lists
some of the nonlinear functions used in this study and
their corresponding derivatives.
Similarity measure-based statistics are special cases
of the nonlinear tests: We often observe that affected
individuals share common haplotypes in the region surrounding disease mutations more often than unaffected
individuals (Fan and Lange 1998; Jorde 2000). There
are two ways to quantify the excessive sharing of common haplotypes among affected individuals. One way is
to measure differences in allele or haplotype frequencies between affected and unaffected individuals (Akey
et al. 2001). Another way is to measure differences in
similarity of the genome region between affected and
unaffected individuals (Bourgain et al. 2001; Tzeng
et al. 2003). In appendix b, we show that the similarity
measure of the genome region is a quadratic function of
PD ð f11 f12 Þ 1 Pd ð f12 f22 Þ
;
P ðAÞ
dHD ¼ ½dH1 D ; . . . ; dHm D T ;
TN ¼ ðX Y Þ L̂ ðX Y Þ;
1531
and
A
S ¼ C 1 H
ðP P Þ
ef $ðPH1 ÞdH1 D
ef $ðPHm ÞdHm D
¼ diag
; ... ;
:
f 9ðPH1 Þ
f 9ðPHm Þ
The matrix S measures the strength of the nonlinearity
of the nonlinear transformation f ðP Þ (appendix c). Note
that under the same alternative hypothesis, the traditional x2 -test statistic, which is defined as
1 A
1
1
T ¼ ðPˆA PˆÞT L̂0 ðPˆA PˆÞ L̂0 ¼
Ŝ 1
Ŝ;
2nA
2nG
is a noncentral x2ðr Þ -distribution with the noncentrality
parameter
1 A
1
l e 2 dTHD
S 1
S dHD :
2nA
2nG
Comparing the noncentrality parameters lN and l, we
can see that the noncentrality parameter lN involves
one more term S than the noncentrality parameter l.
The matrix S characterizes the nonlinearity of the nonlinear function. The power of the nonlinear test statistics depends on the strength of the nonlinearity of the
nonlinear function through the matrix S. The matrix S
1532
J. Zhao, L. Jin and M. Xiong
TABLE 2
Estimated type I error rates for the nonlinear test statistics (10,000 simulations)
Sample size
Entropy
Exponential
Quadratic
Reciprocal
100
200
300
400
500
0.0460
0.0510
0.0560
0.0570
0.0540
Two-SNP haplotypes (a ¼ 0.05)
0.0514
0.0508
0.0486
0.0476
0.0500
0.0548
0.0546
0.0532
0.0508
0.0496
0.0544
0.0544
0.0490
0.0538
0.0524
100
200
300
400
500
0.0450
0.0490
0.0502
0.0508
0.0500
Six-SNP haplotypes (a ¼ 0.05)
0.0530
0.0508
0.0478
0.0488
0.0498
0.0544
0.0476
0.0518
0.0476
0.0462
0.0522
0.0490
0.0508
0.0508
0.0512
is referred to as the strength matrix of the nonlinearity
of the nonlinear function.
If the product terms of the haplotype frequencies in
the variance–covariance matrices SA and S are ignored,
the matrices SA and S can be approximated by SA ¼
diagðPHA1 ; . . . ; PHAm Þ and S ¼ diagðPH1 ; . . . ; PHm Þ. Then
the noncentrality parameters lN and l will be further
reduced to
lN e 2
m
X
i¼1
l e2
m
X
d2Hi D ð1 1 ðepi =2ÞdHi D Þ2
;
ð1=2nA Þð1 1 ðepi dHi D =2ÞÞ2 PHAi 1 ð1=2nG ÞPHi
d2Hi D
;
1 ð1=2nG ÞPHi
ð1=2nA ÞPHAi
i¼1
where pi ¼ f $ðPHi Þ=f 9ðPHi Þ. The parameter pi is proportional to the curvature of a nonlinear function
(Bates and Watts 1980) and influences the noncentrality parameter lN.
From the above formulas, we can see that both
noncentrality parameters l and lN depend on the frequencies of the allele or haplotypes, penetrance, the
measure of the LD between the marker alleles or haplotypes, and the disease allele as well as sample size. In
addition, the noncentrality parameter of nonlinear test
lN also depends on the curvature, which measures the
degree of nonlinearity of nonlinear function.
RESULTS
Distribution of the nonlinear test statistics: In the
previous sections, we have shown that when the sample
size is large enough to apply large sample theory, the
nonlinear test statistics under the null hypothesis of
no association are asymptotically distributed as a central
x2 -distribution. To examine the validity of this statement, we performed a series of simulation studies. The
computer program SNaP (Nothnagel 2002) was used
to generate haplotypes of the sample individuals. Two
data sets with a single haplotype block each were simulated. The first data set has two marker loci that generated four haplotypes with frequencies 0.2952, 0.2562,
0.1957, and 0.2529. The second data set has six marker
loci that generated eight haplotypes with frequencies
0.1820, 0.1461, 0.1406, 0.1291, 0.1211, 0.1107, 0.0817,
and 0.0887. For each data set, 20,000 individuals who
were equally divided into cases and controls were generated in the general population.
To examine whether the asymptotic results of the
nonlinear test statistics still hold for small sample size
under the null hypothesis of no association, 100–500
individuals were randomly sampled from each of the
cases and controls. Ten thousand simulations were repeated for each of the nonlinear test statistics. In each
simulation, the nonlinear test statistics were calculated.
Table 2 shows that the estimated type I error rates (at the
significance level 0.05) of the nonlinear test statistics
were not appreciably different from the nominal level
a ¼ 0:05.
Power of nonlinear test statistics and standard x2-test
statistic: Power of a test statistic for association studies
depends on the allele or haplotype frequencies at the
marker loci and the frequency of the disease allele,
measure of LD between the alleles or haplotypes at the
marker loci and the disease allele, sample size, the disease model, and the measure of nonlinearity of the
nonlinear function. To evaluate the performance of
nonlinear tests, we compare the power of several nonlinear test statistics with that of the standard x2 -test
statistic by both analytical method and simulation. The
results are very similar. In this report, we present only
the power calculation by analytical method.
We first investigate the expected noncentrality parameters of nonlinear test statistics at the disease locus.
We assume that frequencies of two alleles at the disease
locus in controls are both equal to 0.5. Figure 1 plots the
Nonlinear Test for Association Studies
Figure 1.—Expected noncentrality parameters of the nonlinear test statistics and the standard x2 -test statistic as a function of the frequency of the disease allele in cases, assuming
that the frequencies of two alleles at the disease locus in the
controls are both equal to 0.5.
expected noncentrality parameters of the nonlinear test
statistics and the standard x2 -test statistic as a function
of frequency of disease allele in cases. From Figure 1 we
can see three remarkable features. First, the expected
noncentrality parameters of all test statistics increase as
the difference in frequency of disease allele between cases
and controls increases. Second, except for the reciprocalbased statistic that uses reciprocal function as nonlinear
transformation of allele/haplotype frequencies, expected
noncentrality parameters for all the other nonlinear test
statistics are larger than that of the standard x2 -test
statistic. Third, except for the reciprocal-based statistic,
expected noncentrality parameters for all the other
nonlinear test statistics are almost indistinguishable.
We then investigate the power of nonlinear test
statistics at the disease locus. Figure 2 plots the power
of the nonlinear test statistics and the standard x2 -test
statistic as a function of disease allele frequency under
three different disease models: (i) disease model with
penetrance f11 ¼ 1, f12 ¼ 0:2, and f22 ¼ 0:1; (ii) disease
model with penetrance f11 ¼ 1, f12 ¼ 1, and f22 ¼ 0:1;
and (iii) genotype relative risk model for r ¼ 4, in which
the genotype relative risk for genotypes Dd and DD is
r and r 2 times greater than that for the genotype dd
(Risch and Merikangas 1996). Several features emerge
from Figure 2. First, power for most of the nonlinear
test statistics is higher than that of the standard x2 -test
statistic, but power of the reciprocal-based test statistic
is lower than that of the standard x2 -test statistic. The
power curves of the exponential and quadratic functions are similar. Second, power of the nonlinear test
statistics is influenced by disease models. Shapes of the
nonlinear test statistics in disease model ii are different
from those of the test statistics in disease models i and
1533
iii. Third, power of the test statistics depends on disease
allele frequency. Shapes of the power curves in disease
models i and iii are roughly bell; however, shapes of
the power curves in disease model ii are skewed to
the left.
Real data examples: Nonlinear test statistics are
also applied to three real examples. The first example
is a test of association of COMT haplotypes with
schizophrenia. P-values of the nonlinear tests for testing
associations of two-SNP haplotypes (generated from two
SNP markers) and three-SNP haplotypes (generated
from three SNP markers) with schizophrenia are presented in Table 3. Table 3 also includes P-values of the
standard x2 -tests by Shifman et al. (2002). Improvement
of the nonlinear tests over the standard x2 -test varies
among nonlinear tests and among haplotypes. The
quadratic-based test has the largest improvement over
the standard x2 -test when it is applied to three-SNP
haplotypes. The P-value of the quadratic-based test is
4.0 3 1014 , which is much smaller than the 4:5 3 104
obtained by the standard x2 -test.
The second example is a test of association of functional haplotypes in the promoter of the matrix metalloproteinase-2 (MMP-2) gene with esophageal cancer in
the Chinese Han population (Yu et al. 2004). Two SNPs
in the MMP-2 gene were typed in 527 esophageal cancer
patients and 777 controls. P-values of the nonlinear tests
are given in Table 4. We can see that P-values for most of
the nonlinear tests are 10–100 times smaller than that
of the standard x2 -test, whereas the P-value of the
reciprocal-based test is almost the same as that of the
standard x2 -test.
To examine whether nonlinear test statistics show
significant association or not when the standard x2 -test
shows no significance, the proposed nonlinear test statistics were also applied to test association of a functional SNP in ZDHHC8 with schizophrenia in a Japanese
case–control population (Saito et al. 2005). The results
are summarized in Table 5. The data demonstrate that
when the x2 -test shows no association of the functional
SNP in the ZDHHC8 gene with schizophrenia, nonlinear test statistics also show no evidence of association.
P-values of the nonlinear test statistics are the same as
that of the standard x2 -test.
DISCUSSION
In the near future, genomewide association studies
performing millions of statistical tests will be conducted.
To ensure a genomewide significance level of 0.05, a
stringent P-value is required for the statistical test. There
is crucial need for increased efforts in developing new
statistical methods that can achieve small P-values. As an
attempt toward this direction, in this report, we present
nonlinear tests for association studies.
The traditional x2 -test statistic is a quadratic function of the difference (P A P ) in allele or haplotype
1534
J. Zhao, L. Jin and M. Xiong
frequencies between the affected and unaffected individuals. Although the x2 -test statistic itself is a nonlinear
function of allele or haplotype frequencies, its basic unit
(P A P ) is a linear transformation of allele or haplo-
TABLE 3
Association tests for COMT haplotypes with schizophrenia
Two-SNP haplotypesa
P-values for
H1
H2
H3
Three-SNP
haplotypea:
H4
Entropy
Exponential
Quadratic
Reciprocal
x2 b
1.9e-009
1.2e-013
7.5e-013
1.4e-010
1.4e-004
2.7e-006
9.5e-009
1.5e-008
8.3e-008
5.7e-003
2.9e-006
1.8e-005
9.5e-006
2.4e-006
1.1e-003
1.5e-012
8.0e-014
4.0e-014
2.9e-013
4.5e-004
All data (including males and females) are used in the
analysis.
a
H1, rs737865–rs165599; H2, rs737865–rs165688; H3,
rs165599–rs165688; H4, rs165688–rs737865–rs165599.
b
P-values reported by Shifman et al. (2002).
type frequencies. If the difference in nonlinear transformation of allele or haplotype frequencies is larger
than the difference in allele or haplotype frequencies,
i.e., kf ðP A Þ f ðP Þk . kP A P k, where k:k denotes a
norm of the vector, then the statistics based on f ðP A Þ f ðP Þ may have higher power than the statistics based on
(P A P ). On the basis of this simple idea, we have
developed a general statistical framework for nonlinear
tests that provides basic procedures about how to construct test statistics using nonlinear transformations of
allele or haplotype frequencies. We have showed that, in
general, similarity measure-based statistics can be formulated as the differences in quadratic forms of allele
or haplotype frequencies. Therefore, using the proposed statistical framework for nonlinear tests, we can
derive many similarity measure-based statistics. As a byproduct, nonlinear test theory can unify two classes of
association tests: tests of the difference in allele or haplotype frequencies and tests based on a similarity measure of the genome region being tested.
The distributions of nonlinear test statistics are based
on the asymptotic statistical theory of nonlinear transformations. We investigate the distributions of several
nonlinear test statistics under the null hypothesis by
Figure 2.—(A) Power of the nonlinear test statistics and
the standard x2 -test statistic at the disease locus with a significance level of 0.001 for a disease model with penetrance
f11 ¼ 1; f12 ¼ 0:2, and f22 ¼ 0:1 as a function of the disease allele frequency, assuming equal sample size (n ¼100) in both
cases and controls. (B) Power of the nonlinear test statistics
and the standard x2 -test statistic at the disease locus with a significance level of 0.001 for a disease model with penetrance
f11 ¼ 1; f12 ¼ 1, and f22 ¼ 0:1 as a function of the disease allele
frequency, assuming equal sample size (n ¼ 100) in both cases
and controls. (C) Power of the nonlinear test statistics and the
standard x2 -test statistic at the disease locus with a significance
level of 0.001 for a genotype relative risk disease model (r ¼ 4)
as a function of the disease allele frequency, assuming equal
sample size (n ¼ 100) in both cases and controls.
Nonlinear Test for Association Studies
TABLE 4
P-values of nonlinear tests for the MMP-2 gene with
esophageal cancer
Nonlinear transformation
P-value
Entropy
Exponential
Quadratic
Reciprocal
x2
3.2e-008
2.3e-007
1.9e-007
5.1e-006
7.0e-006
simulation studies. Even with moderate sample size
(n ¼ 100Þ, distributions of the proposed nonlinear statistics are still close to central x2 -distribution (data not
shown). To validate the test statistics, we calculate the
type I error rates of the presented nonlinear statistics by
simulations. This showed that the type I error rates of
nonlinear statistics were close to the nominal significance levels, which implies that the nonlinear tests for
association study are valid in a single homogeneous
population.
To evaluate the performance of the nonlinear test
statistics, we compare the power of the nonlinear test
statistics with that of the standard x2 -test statistic. To
reveal the relationships between the power of the nonlinear test statistics and the measure of nonlinearity
of nonlinear transformations, we developed analytical
tools for calculations of the power of the test statistics.
Power of the nonlinear statistics depends on several
parameters such as disease model, allele or haplotype
frequencies, measure of LD between the allele or
haplotype and disease allele, and the measure of nonlinearity of the nonlinear transformations of the allele
or haplotype frequencies. We showed that, in many
cases, most of the studied nonlinear test statistics have
higher power than the standard x2 -test statistic, with
the exception of the reciprocal transformation whose
power, in general, is lower than that of the standard x2 test statistic. However, since the power of a statistic is a
complex issue, there is not one statistic that is uniformly
most powerful. Forms of nonlinear transformation are
crucial for developing nonlinear test statistics. Our
TABLE 5
Association tests of a functional SNP in the ZDHHC8 gene
with schizophrenia
Sample
Nonlinear transformation
Entropy
Exponential
Quadratic
Reciprocal
x2
Male
Female
All
0.2586
0.2586
0.2586
0.2604
0.2603
0.7134
0.7134
0.7134
0.7135
0.7135
0.6177
0.6177
0.6177
0.6178
0.6178
1535
preliminary results showed that the larger the measure
of nonlinearity of the nonlinear transformation is, the
higher the power of its corresponding nonlinear test
statistic. Power of nonlinear test statistics is a complicated function of the measure of nonlinearity of the nonlinear transformation and other genetic and population
parameters, particularly allele/haplotype frequencies.
Our experience shows that when the frequencies of
alleles/haplotypes are ,0.05, nonlinear test statistics
may not be a good choice for association analysis. We
suggest using nonlinear test statistics when the frequencies of alleles/haplotypes are .0.05; i.e., we use nonlinear test statistics for association analysis of common
diseases with common alleles. A clear and consistent
pattern of power of the nonlinear test statistics depends
on the measure of nonlinearity of the nonlinear transformation and is difficult to obtain. More investigations
are needed.
To further evaluate the performance of the nonlinear
test statistics, the proposed nonlinear test statistics were
also applied to three real data examples. The results
showed that when the standard x2 -test detected association of the COMT gene with schizophrenia, all nonlinear test statistics demonstrated strong association
of the COMT gene with schizophrenia and when the
standard x2 -test detected no association of the gene
ZDHHC8 with schizophrenia, all nonlinear test statistics
with almost the same P-values as that of the standard
x2 -test also showed no association.
The results in this report are very limited. Theoretical
and empirical studies should be conducted to compare
and investigate the relative strengths and weaknesses of
nonlinear tests and other existing association tests. The
properties of the nonlinear test statistics should be
further investigated both by theoretical studies and by
empirical simulations. In this report, we studied only
very limited nonlinear functions. It is worth developing
general theory for searching optimal nonlinear functions with the highest power. Nonlinear tests are a new
concept for developing test statistics, which will open
new ways for developing powerful statistics in genetic
studies of complex diseases. Theory for nonlinear tests
is at its infancy. Many theoretical works and empirical
evaluations are needed in the future.
We thank Sagiv Shifman and Ariel Darvasi for providing the detailed
data information for schizophrenic haplotype analyses. We thank two
anonymous reviewers for helpful comments on the manuscript, which
led to its improvement. We also thank Ranjan Deka for his constructive
comments. M. M. Xiong is supported by the National Institutes of
Health (NIH)–National Institute of Arthritis and Musculoskeletal and
Skin Diseases grants IP50AR44888 and HL74735 and by NIH grant
ES09912. J. Y. Zhao is supported by NIH grant ES09912.
LITERATURE CITED
Ahmadi, K. R., M. E. Weale, Z. Y. Xue, N. Soranzo, D. P. Yarnall
et al., 2005 A single-nucleotide polymorphism tagging set for
human drug metabolism and transport. Nat. Genet. 37: 84–89.
1536
J. Zhao, L. Jin and M. Xiong
Akey, J., L. Jin and M. Xiong, 2001 Haplotypes vs single marker
linkage disequilibrium tests: What do we gain? Eur. J. Hum.
Genet. 9: 291–300.
Altshuler, D., and A. G. Clark, 2005 Genetics. Harvesting medical
information from the human family tree. Science 307: 1052–1053.
Bates, D. M., and D. G. Watts, 1980 Relative curvature measures of
nonlinearity. J. R. Stat. Soc. Ser. B 42: 1–25.
Borsting, C., J. J. Sanchez and N. Morling, 2005 SNP typing on
the NanoChip electronic microarray. Methods Mol. Biol. 297:
155–168.
Bourgain, C., E. Genin, P. Margaritte-Jeannin and F. ClergetDarpoux, 2001 Maximum identity length contrast: a powerful
method for susceptibility gene detection in isolated populations.
Genet. Epidemiol. 21(Suppl. 1): S560–S564.
Chapman, N. H., and E. M. Wijsman, 1998 Genome screens using
linkage disequilibrium tests: optimal marker characteristics and
feasibility. Am. J. Hum. Genet. 63: 1872–1885.
Fan, R., and K. Lange, 1998 Models for haplotype evolution in
a nonstationary population. Theor. Popul. Biol. 53: 184–198.
Freimer, N., and C. Sabatti, 2004 The use of pedigree, sib-pair and
association studies of common diseases for genetic mapping and
epidemiology. Nat. Genet. 36: 1045–1051.
Greenwood, P. E., and M. S. Nikulin, 1996 A Guide to Chi-Square
Testing. John Wiley & Sons, New York.
Halldorsson, B. V., V. Bafna, R. Lippert, R. Schwartz, F. M. De La
Vega et al., 2004 Optimal haplotype block-free selection of tagging SNPs for genome-wide association studies. Genome Res. 14:
1633–1640.
Jorde, L. B., 2000 Linkage disequilibrium and the search for complex disease genes. Genome Res. 10: 1435–1444.
Lehmann, E. L., 1983 Theory of Point Estimation. John Wiley & Sons,
New York.
Neale, B. M., and P. C. Sham, 2004 The future of association studies: gene-based analysis and replication. Am. J. Hum. Genet. 75:
353–362.
Nothnagel, M., 2002 Simulation of LD block-structured SNP haplotype data and its use for the analysis of case-control data by
supervised learning methods. Am. J. Hum. Genet. 71(Suppl.):
A2363.
Pritchard, J. K., and P. Donnelly, 2001 Case-control studies of association in structured or admixed populations. Theor. Popul.
Biol. 60: 227–237.
Risch, N., and K. Merikangas, 1996 The future of genetic studies
of complex human diseases. Science 273: 1516–1517.
Saito, S., M. Ikeda, N. Iwata, T. Suzuki, T. Kitajima et al., 2005 No
association was found between a functional SNP in ZDHHC8 and
schizophrenia in a Japanese case-control population. Neurosci.
Lett. 374: 21–24.
Serfling, R. J., 1980 Approximating Theorems of Mathematical Statistics.
John Wiley & Sons, New York.
Shifman, S., M. Bronstein, M. Sternfeld, A. Pisante-Shalom,
E. Lev-Lehman et al., 2002 A highly significant association between a COMT haplotype and schizophrenia. Am. J. Hum.
Genet. 71: 1296–1302.
Tzeng, J. Y., B. Devlin, L. Wasserman and K. Roeder, 2003 On
the identification of disease mutations by the analysis of haplotype similarity and goodness of fit. Am. J. Hum. Genet. 72:
891–902.
Wang, W. Y., B. J. Barratt, D. G. Clayton and J. A. Todd,
2005 Genome-wide association studies: theoretical and practical concerns. Nat. Rev. Genet. 6: 109–118.
Xiong, M., J. Zhao and E. Boerwinkle, 2003 Haplotype block linkage disequilibrium mapping. Front. Biosci. 8: a85–a93.
Yu, C., Y. Zhou, X. Miao, P. Xiong, W. Tan et al., 2004 Functional
haplotypes in the promoter of matrix metalloproteinase-2 predict risk of the occurrence and metastasis of esophageal cancer.
Cancer Res. 64: 7622–7628.
Zhang, S., Q. Sha, H. S. Chen, J. Dong and R. Jiang, 2003 Transmission/disequilibrium test based on haplotype sharing for
tightly linked markers. Am. J. Hum. Genet. 73: 566–579.
Zhao, J., E. Boerwinkle and M. Xiong, 2005 An entropy-based statistic for genomewide association studies. Am. J. Hum. Genet. 77:
27–40.
Communicating editor: N. Takahata
APPENDIX A
In the following, we show that any statistics arising
from linear transformation will not change the values of
pretransformation statistics. To illustrate this point, let
P A and P be the allele (haplotype) frequencies in cases
and controls, respectively, DP ¼ P A P be a vector of
differences in allele or haplotype frequencies between
cases and controls, S be variance–covariance matrix of
the vector of differences DP , and A be a linear transformation matrix, where linear transformation of the
allele or haplotype frequencies is expressed as AP A and
AP, respectively. The popularly used x2 -test statistic can
be derived from the statistic
T ¼ ðP A P ÞT S ðP A P Þ;
where S is a generalized inverse of the matrix S.
The difference in linear transformation of allele or
haplotype frequencies between cases and controls can
be written as
X ¼ AP A AP ¼ ADP ;
where A is assumed a nonsingular matrix. Then, the
variance–covariance matrix is given by
L ¼ CovðX ; X Þ ¼ ASAT :
The new statistic resulting from transformation is
T ¼ ðADP ÞT L ADP ¼ ðDP ÞT S DP ¼ T :
This shows that linear transformation of allele or haplotype frequencies will not change test statistics.
APPENDIX B
Below we show that a similarity measure of the genome region is a quadratic function of allele or haplotype frequencies. Therefore, similarity measure-based
statistics are nonlinear test statistics. For simplicity
of presentation, we consider only haplotype similarity.
However, the conclusions, in general, hold for other
types of similarity of the genome region. Suppose that
the numbers of Hi haplotypes in the affected and unaffected individuals are niA and ni , respectively. nA ; nG ;
PHAi , and PHi are defined as before. Then, we have PHi ¼
ni =2nG ; PHAi ¼ niA =2nA . Let GHi and GAHi be the similarity
measure of the haplotype Hi in the unaffected and
affected individuals. Let SðHi ; Hj Þ be a measure of the
similarity between the haplotype Hi and the haplotype
Hj . Then, the similarity measure of the haplotype Hi
in the unaffected individuals is given by
GHi ¼
m
nj
ni X
SðHi ; Hj Þ:
2nG j¼1 2nG
Nonlinear Test for Association Studies
Let P ¼ ½PH1 ; . . . ; PHm T ; Si ¼ ½SðHi ; H1 Þ; . . . ; SðHi ; Hm Þ.
Then the above equation can be further reduced to
GHi ¼ PHi Si P :
The similarity measure of all the haplotypes in the unaffected individuals, which is referred to as the overall
similarity measure and denoted by G, is defined as the
summation of the similarity
measure
P
Pm of the individual
haplotype, i.e., G ¼ m
i¼1 GHi ¼
i¼1 PHi Si P . Let S ¼
ðSðHi ; Hj ÞÞm3m be a similarity matrix. We have S ¼
½S1T ; . . . ; SmT T . Then, G can be written as
2 3
S1
6 . 7
G ¼ ½PH1 ; . . . ; PHm 4 .. 5P ¼ P T SP :
Sm
1537
matrix C for the affected individuals are similarly
defined. Let L ¼ ð1=2nG ÞBSB T 1 ð1=2na ÞCSA C T . We
define the haplotype similarity measure-based test statistic as
A
APPENDIX C
GA ¼ ðP A ÞT S A P A ;
A
A
where P , S , and G are similarly defined as those for
the unaffected individuals. Clearly, similarities measures GHi and G are quadratic functions of the haplotype
frequencies and hence are nonlinear transformations
of the haplotype frequencies. Both the overall similarity
measure G and the similarity measure GHi of the haplotype Hi can be used to construct association tests.
We first consider the overall similarity measure-based
test statistic. Let S ¼ diagðPH1 ; . . . ; PHm Þ PP T and SA ¼
diagðPHA1 ; . . . ; PHAm Þ P A ðP A ÞT . The Jacobian matrix B
of the overall similarity measure G with respect to P is
given by
B¼
@G
¼ 2SP :
@P
Similarly, we have C ¼ @GA =@P A ¼ 2SP A . Let Ĝ, Ĝ , Pˆ,
and PˆA be the corresponding estimators of G, GA , P , and
A
P A , respectively. Then the variance of Ĝ and Ĝ can be
approximated by
1 T
2
B SB ¼ P T S T SSP ;
2nG
nG
1 T A
2
C S C ¼ ðP A ÞT ðS A ÞT SA S A P A
varðĜ Þ ¼
2nA
nA
A
(Lehmann 1983). We define the overall haplotype similarity measure-based statistic as
A
Tos ¼
Let f ðP Þ be a vector-valued nonlinear function of
random vector P. Assume that the nonlinear function
f ðP Þ satisfies regularity conditions that ensure that
Theorem 3.3A in Serfling (1980) holds. Then, f ðPˆÞ
is asymptotically distributed as a multivariate normal
distribution N ð f ðP Þ; ð1=2nG ÞCSC T Þ, where
cii ¼
ðĜ ĜÞ2
A
varðĜÞ 1 varðĜ Þ
:
This is similar to the similarity measure-based test statistic D in Tzeng et al. (2003), where the variances of Ĝ
and ĜA are accurately calculated.
Now consider the haplotype similarity measure-based
test statistic. LetP
GH ¼ ½GH1 ; . . . ; GHl T , bii ¼ @GHi =@PHi ¼
SðHi ; Hi ÞPHi 1 m
j¼1 SðHi ; Hj ÞPHj and bij ¼ @GHi =@PHj ¼
PHi SðHi ; Hj Þ, B ¼ ðbij Þl 3m ðl # mÞ. GAH and the Jacobian
@f ðPHi Þ
;
@PHi
cij ¼
@f ðPHi Þ
;
@PHj
C ¼ ðcij Þm3m ;
S ¼ diagðPH1 ; . . . ; PHm Þ PP T :
Similarly, f ðPˆA Þ is asymptotically distributed as N ð f ðP A Þ;
ð1=2nA ÞBSA B T Þ, where
bii ¼
A
varðĜÞ ¼
A
where ĜH ; Ĝ, and L̂ are the estimators of GAH ; G, and L,
respectively, and L̂ is the generalized inverse of the
matrix L̂. Let r ¼ rankðLÞ; then, under the null hypothesis of no association between the haplotypes and the
disease, the test statistic Ts is asymptotically distributed
as a central x2ðr Þ . It is clear that both test statistics Tos and
Ts are nonlinear test statistics. Therefore, the similarity
measure-based statistics are special cases of the nonlinear test statistics.
Similarly, for the affected individuals, we have
A
A
Ts ¼ ðĜH ĜH ÞT L̂ ðĜH ĜH Þ;
@f ðPHAi Þ
;
@PHAi
bij ¼
@f ðPHAi Þ
;
@PHAj
B ¼ ðbij Þm3m ;
SA ¼ diagðPHA1 ; . . . ; PHAm Þ P A ðP A ÞT :
Therefore, under the null hypothesis H0: P A ¼ P , which
implies f ðP A Þ ¼ f ðP Þ, f ðPˆA Þ f ðPˆÞ is asymptotically
distributed as N ð0; LÞ, where
L¼
1
1
BSA B T 1
CSC T :
2nA
2nG
Let Z ¼ f ðPˆA Þ f ðPˆÞ and r ¼ rankðLÞ. Then, under
the null hypothesis, TN ¼ Z TLZ is asymptotically
distributed as a central x2ðr Þ -distribution (Greenwood
and Nikulin 1996). The alternative hypothesis is Ha:
P A 6¼ P . Under the alternative hypothesis, TN is asymptotically distributed as a noncentral x2ðr Þ -distribution
with the following noncentrality parameter:
lN ¼ ½ f ðP A Þ f ðP ÞT L ½ f ðP A Þ f ðP Þ:
ðC1Þ
By Taylor expansion, we have
f ðP A Þ f ðP Þ CðP A P Þ 1 12ðP A P ÞT H ðP A P Þ;
ðC2Þ
1538
J. Zhao, L. Jin and M. Xiong
where Hl ¼ diagð0; . . . ; f $ðPHl Þ; 0 . . . 0Þ, l ¼ 0; . . . ; m,
2
3
ðP A P ÞT H1 ðP A P Þ
6
7
..
7:
ðP A P ÞT H ðP A P Þ ¼ 6
.
4
5
A
T
A
ðP P Þ Hm ðP P Þ
dg
@g
¼
DP ;
dt
@P T
where
2
f 9ðPH1 Þ
@g
6
..
¼4
.
@P T
0
Equation C2 can be rewritten as
f ðP A Þ f ðP Þ C½ðP A P Þ 1 12C ðP A P ÞT H ðP A P Þ
¼ C½ðP A P Þ 1 12C H ðP A P ÞðP A P Þ:
ðC3Þ
ðC4Þ
Substituting f ðP Þ f ðP Þ in Equation C4 into Equation
C1 yields
lN ¼ ðP A P ÞT ðI 1 12SÞT ðC T L CÞðI 1 12SÞðP A P Þ
¼ ðP A P ÞT ðI 1 12SÞT ½C LðC T Þ ðI 1 12SÞðP A P Þ:
ðC5Þ
Recall that
and
B C 1 H ðP A P Þ;
ðC6Þ
where
Thus,
f 9ðPHm Þ
where
2
3
H1
6 . 7
H ¼ 4 .. 5 and
Hm
Hi ¼ diagð0; . . . ; f $ðHi Þ; . . . ; 0Þ:
The change rate of the tangent vector of the curve
characterizes the strength of the nonlinearity of the
nonlinear function (Bates and Watts 1980). The
vector S has the following form:
f $ðPH1 Þ A
f $ðPHm Þ A
S ¼ diag
ðP PH1 Þ; . . . ;
ðP PHm Þ :
f 9ðPH1 Þ H1
f 9ðPHm Þ Hm
1
1
BSA B T 1
CSC T ðC T Þ
2nA
2nG
1 A T
1
C BS ðC BÞ 1
S
2nA
2nG
1
1
¼
ðI 1 SÞSA ðI 1 SÞ 1
S:
2nA
2nG
7
5:
If the product terms of the haplotype frequencies
are ignored, we obtainC LðC T Þ ¼ diagðL1 ; . . . ; Lm Þ,
where
dHD ¼ ½dH1 D ; . . . ; dHm D T :
C LðC T Þ ¼ C 3
@ðdr =dtÞ
¼ C H DP ¼ S;
@Z T
A
P A P ¼ edHD
0
..
.
Taking Zi ði ¼ 1; . . . ; mÞ as a new coordinate system, we
obtain the change rates of the tangent vector of the
curve over new coordinates,
Let S ¼ C H ðP A P Þ; then
f ðP A Þ f ðP Þ CðI 1 12SÞðP A P Þ:
0 ..
. 0 ¼
ðC7Þ
Substituting Equations C6 and C7 into Equation C5, we
obtain
1
lN e 2 dTHD ðI 1 12SÞT
ðI 1 SÞSA ðI 1 SÞ
2nA
1
S ðI 1 12SÞdHD : ðC8Þ
1
2nG
Next we study geometric interpretation of the matrix S.
Let gðP Þ ¼ ½Z1 ; . . . ; Zm T , where Zi ¼ f ðPHi Þ. Define the
following parameter equations:
P A ¼ P 1 tDP :
Li ¼
1
1
½1 1 pi ðPHAi PHi Þ2 PHAi 1
PH ;
2nA
2nG i
pi ¼
f $ðPHi Þ
;
f 9ðPHi Þ
I 1 12S ¼ diagð1 1 ðp1 =2ÞðPHA1 PH1 Þ; . . . ; 1 1 ðpm =2Þ ðPHAm PHm ÞÞ. Then, Equation C8 can be simplified to
lN e 2
¼ e2
m
X
d2Hi D ½1 1 ðpi =2ÞðPHAi PHi Þ
Li
i¼1
m
X
i¼1
d2Hi D ½1 1 ðepi =2ÞdHi D 2
:
ð1=2nA Þð1 1 ðepi dHi D =2ÞÞ2 PHAi 1 ð1=2nG ÞPHi
For the standard x2 -test statistic, we have pi ¼ 0. Thus,
its noncentrality parameter is given by
1 A
1
2 T
l e dHD
S 1
S dHD
2nA
2nG
and
A
As t varies, gðP Þ ¼ gðP 1 tDP Þ defines a curve C in the
space. The tangent vector of the curve C at the point P
is given by
l e2
m
X
i¼1
d2Hi D
:
ð1=2nA ÞPHAi 1 ð1=2nG ÞPHi