bioinformatics - Oxford Academic

BIOINFORMATICS
ORIGINAL PAPER
Vol. 24 no. 7 2008, pages 958–964
doi:10.1093/bioinformatics/btn064
Gene expression
Identifying trait clusters by linkage profiles: application
in genetical genomics
Joshua N. Sampson1,* and Steven G. Self1,2
1
Department of Biostatistics, University of Washington and 2Statistical Center for HIV/AIDS Research and Prevention,
Seattle, WA, USA
Received on November 3, 2007; revised on January 11, 2008; accepted on February 17, 2008
Advance Access publication February 29, 2008
Associate Editor: John Quackenbush
ABSTRACT
Motivation: Genes often regulate multiple traits. Identifying clusters
of traits influenced by a common group of genes helps elucidate
regulatory networks and can improve linkage mapping.
Methods: We show that the Pearson correlation coefficient, ^ L ,
between two LOD score profiles can, with high specificity and
sensitivity, identify pairs of genes that have their transcription
regulated by shared quantitative trait loci (QTL). Furthermore, using
theoretical and/or empirical methods, we can approximate the
distribution of ^ L under the null hypothesis of no common QTL.
Therefore, it is possible to calculate P-values and false discovery
rates for testing whether two traits share common QTL. We then
examine the properties of ^ L through simulation and use ^ L to cluster
genes in a genetical genomics experiment examining Saccharomyces cerevisiae.
Results: Simulations show that ^ L can have more power than the
clustering methods currently used in genetical genomics. Combining
experimental results with Gene Ontology (GO) annotations show that
genes within a purported cluster often share similar function.
Software: R-code included in online Supplementary Material.
Contact: [email protected]
Supplementary information: Supplementary materials are available
at Bioinformatics online.
1
INTRODUCTION
Genetic linkage analysis, or linkage mapping, is a powerful tool
for locating genes influencing quantitative, or continuously
varying, traits. For linkage mapping, the trait of interest is
measured on a group of related individuals and then the
genotypes at a set of genetic markers (i.e. single nucleotide
polymorphisms) are recorded for that same group. Markers
that are strongly correlated with the trait are reported as
quantitative trait loci (QTL).
When a single gene regulates two or more traits, an
occurrence known as pleiotropy, the power to detect that
gene and the precision of its estimated location are often
improved by mapping all regulated traits simultaneously (Jiang
and Zeng, 1995). Given a set of genetically correlated traits,
several methods are available for joint linkage analysis.
Maximum likelihood approaches can be applied to multivariate
*To whom correspondence should be addressed.
958
distributions (Chen, 2005; Korol et al., 1996). Haley–Knott
regression is easily adapted to multiple traits by using multivariate regression and ANOVA (Knott and Haley, 2000).
Principal components analysis can transform the traits into
a set of orthogonal, canonical variables and each component
can then be analyzed by standard, single-trait methods (Mangin
et al., 1998; Weller et al., 1996). These methods have been used
extensively in recent years, uncovering genes influencing milk
production in cows, grain yield in wheat, and multi symptom
illnesses in a variety of organisms (Kraft et al. 2003; Mangin,
et al. 1998; Martin et al. 2003).
Before benefiting from such methods, we must first identify
a set of genetically coregulated traits. We quantify genetic
coregulation by averaging the percentage of influential genes
common to both traits, C(, ). Usually traits have been clustered
because of biological relationships or prior experiments.
However, our knowledge may be limited to the data collected
for the linkage studies. Therefore, using only the recorded trait
values and marker genotypes, we want a method to determine if
all, or a subgroup, of those traits are genetically coregulated.
If the traits could share only a single gene, such a method would
analyse each marker separately and mimic the previously listed
joint mapping methods. However, the heritable traits still being
studied are influenced by multiple genes, and two traits sharing
one gene would likely share a set of genes.
In this article, we formalize a novel method for identifying
groups, or clusters, of traits likely to share common QTL
(Schadt et al., 2005). The need for such a method has
dramatically increased with the emergence of genetical genomics. Genetics and genomics can be combined by measuring the
expression levels of thousands of genes simultaneously in a
group of related individuals and then treating each expression
level as a quantitative trait for linkage mapping (Brem et al.,
2002; Li and Burmeister, 2005; Segal et al., 2003). QTL
controlling expression levels, eQTL, have been successfully
identified in mice, yeast, wheat, and humans (Li and
Burmeister, 2005; MacLaren and Sikela, 2005; Yvert et al.,
2003). Here, we will offer a way to cluster genes regulated by
common eQTL, thereby improving our estimates of QTL
locations and identifying collections of genes participating in
the same biological pathway.
Our clustering method is based on the correlation, ^ L ,
between the LOD score profiles of two traits. Given a large
ß The Author 2008. Published by Oxford University Press. All rights reserved. For Permissions, please email: [email protected]
Trait clusters
group of traits, we form clusters so the value of ^ L between any
two members exceeds some threshold. We will introduce this
method in the context of a single genetical genomics experiment. Until now, these analyses have generally formed clusters
so all pairs within a cluster have highly correlated expression
levels. We can therefore compare our method to this established
standard. In other cases, such as clustering genes where the
expression levels were measured on different populations, there
is no alternative to clustering by ^ L .
The remainder of the article is divided as follows. First, we
introduce notation and briefly review LOD Scores. Second, we
introduce ^ L as an estimate for C(j1, j2). Third, we discuss ^ L as
a similarity measure, and show how combining this measure
and an appropriate algorithm can identify clusters of genetically coregulated traits (Hastie et al., 2001). Fourth, we apply
our clustering method to simulations and the experimental
results from Yvert et al.’s study of Saccharomyces cerevisiae
(Yvert et al., 2003).
2
2.1
METHODS
Notation
Assume a cross between two strains, ST1 and ST1, of yeast producing
n individuals. For subject i, i 2 f1, . . ., n}, let Yji be the expression level
!
for gene j and let Y j be called the expression, or phenotypic, profile of
gene j. Let subject i be genotyped at N markers, and let Gti be the
genotype at position t, where Gti ¼ 1 (Gti ¼ 1) if the allele is from
ST1 (ST1). fGti1, Yji1} and fGti2, Yji2} are independent if i1 6¼ i2. Recall,
yeast are haploid and have only two possible genotypes with
PðGti ¼ 1Þ ¼ PðGti ¼ 1Þ ¼ 12 8t, i.
To simplify the discussion, assume that all genes are located
at markers. We say that gene t, marker t, or QTL t influences
the expression of gene j if f(Yji | Gti ¼ 1) 6¼ f(Yji | Gti ¼ 1), ignoring
the possibility of epistasis. Furthermore, let Rj ftj1, . . ., tjNj} be the
positions of the Nj QTL influencing the expression of gene j. The LOD
score, Xnjt , is the log10 of the likelihood ratio statistic testing whether
t 2 Rj (see Supplementary Material for definition) and can be
approximated by Haley–Knott regression: Xnjt n=4:61 lnðSST =SSE Þ.
Here, SST and SSE are the total and residual sum of squares fromn
!
solving equation 1 (Haley et al., 1994). The vector of LOD scores, X j
will be called the LOD score profile of gene j.
Yji ¼ j 0 þ jt PðGti ¼ 1Þ þ ji
SST ¼
SSE ¼
Xn
Xi¼1
n
i¼1
ð1Þ
ðYji Yj Þ2
ðYji ð^j0 þ ^jt Gti ÞÞ2
P
!
Here Yj ¼ 1n ni¼1 Yji and ^ is the least-squares estimate of .
Let 1t2Rj ¼ 1 if t 2 Rj, 0 otherwise. For trait j, j 2 fj1, j2},
PN
t¼1 1t2Rj1 1t2Rj2 =Nj is the proportion of QTL that are common to
both traits. We measure the genetic coregulation by the geometric
mean, C(j1, j2), of the two proportions.
"PN
#1=2
PN
t¼1 1t2Rj1 1t2Rj2
t¼1 1t2Rj1 1t2Rj2
Cðj1 ; j2 Þ ¼
Nj1
Nj2
ð2Þ
PN
t¼1 1t2Rj1 1t2Rj2
pffiffiffiffiffiffiffiffiffiffiffiffiffi
¼
Nj1 Nj2
Our goal is to find clusters, T1, T2, . . ., Tg, of traits such that
Pg P
j1 , j2 2Ti Cðj1 , j2 Þ is large.
i¼1
2.2
Estimating C(j1, j2)
Define the LOD score correlation coefficient, ^ L , for traits j1 and j2 by
PN
t¼1 ðXj1 t Xj1 ÞðXj2 t Xj2 Þ
ð3Þ
L qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
PN
2 PN
2
ðX
j1 t Xj1 Þ
t¼1
t¼1 ðXj2 t Xj2 Þ
P
where Xj ¼ N1 N
t¼1 Xjt . Let trait j1 be influenced by N1 genes and trait j2
be influenced by N2 genes. Define k1 and k2 by N1 ¼ k1N and N2 ¼ k2N,
where 0 5 k1, k2 5 1. Furthermore, let the two traits share N00 ¼ k0 N
genes where 0 5 k0 5 k1, k2. For large n, we show (Appendix A) that
under three mild assumptions,
k0
^ L ðj1 , j2 Þ Cðj1 , j2 Þ ¼ pffiffiffiffiffiffiffiffiffiffi
ð4Þ
k1 k2
A second approach for estimating C(j1, j2) would be to first estimate
1t2Rj1 and 1t2Rj2 for each potential QTL t. Let 1^ t2Rj ¼ 1 if Xjt is a local
maximum and exceeds some threshold >0.22, 0 otherwise. Then,
calculating LOD scores by composite interval mapping (CIM) promises
PN ^
^
t¼1 1t2Rj1 1t2Rj2
qffiffiffiffiffiffiffiffiffiffiffiffiffi
limn!1
¼ Cðj1 , j2 Þ
ð5Þ
N^ j1 N^ j2
P ^
where N^ j ¼ N
t¼1 1t2Rj for j 2 fj1, j2}. We avoid this approximation
because it fails horribly for real sample sizes. Let QTL t0 affect traits j1
and j2. Even when n is large, the estimated locations for QTL t0 will rarely
coincide perfectly resulting in 1^ t2Rj1 1^ t2Rj2 ¼ 0 for all t close to t0 . This
approximation does not account for the distance between the two
estimated positions. The obvious correction would be using a continuous
measure estimating P(t 2 Rj) or evidence of linkage instead of 1^ t2Rj . As
such evidence is often quantified by Xjt, we have returned to our original
approach.
^ 1 , j2 Þ ^ L ðj1 ; j2 Þ
Returning to our original focus, we calculate Cðj
for all possible pairs of traits and input those values in a proximity
matrix, D. Let Dj1, j2 ¼ Ĉ(j1, j2), where Dj1, j2 is the entry in the j1 th
row and j2 th column of D. Given a proximity matrix, there are
numerous methods available for finding groups, T1, T2, . . ., Tg of traits
P P
^ 1 ; j2 Þ ¼ Pg P
such that gi¼1 j1 ; j2 2Ti Cðj
j1 ; j2 2Ti Dj1 ; j2 is large. If our
i¼1
estimates of Ĉ(j1, j2) are accurate, then the identified clusters will result
P P
in a large value of gi¼1 j1 ; j2 2Ti Cðj1 ; j2 Þ.
2.3
Similarity measures
Finding clusters, T1, T2, . . ., Tg, of traits resulting in a large value of
Pg P
j1 ;j2 2Ti Cðj1 ; j2 Þ does not require estimating C(j1, j2). We can
i¼1
circumvent this step by identifying a statistic, or measure, D, that is
highly correlated with C. Given such a measure, we can construct a
proximity matrix, D, where Dj1, j2 ¼ D(j1, j2). Applying an appropriate
clustering method to D would result in groups of traits with a large
Pg P
value of
j1 ;j2 2Ti Cðj1 ; j2 Þ. Three candidate measures are the
i¼1
correlation between expression levels, the correlation between vectors
^j1 and ^j2 , and the correlation between LOD score profiles. We must
avoid methods based on variance components because the genetic
component is not identifiable for Yvert et al.’s yeast experiment or any
experiment where the population is the progeny of a single cross.
The most prevalent method for finding genes linked to common
eQTL is clustering by expression profile (Brem and Kruglyak, 2005;
Eisen et al., 1998; Yvert et al., 2003). This method is equivalent to
defining Dj1 j2 ¼ ^ Pn ðj1 ; j2 Þ, where ^ Pn ðj1 ; j2 Þ estimates the Pearson
!
!
correlation coefficient, ^ P , between Y j1 and Y j2 .
Pn
i¼1 ðYj1 i Yj1 ÞðYj2 i Yj2 Þ
qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
ð6Þ
^ Pn ¼ qP
n
2 Pn
2
i¼1 ðYj1 i Yj1 Þ
i¼1 ðYj2 i Yj2 Þ
In Yvert et al.’s experiment, genes were clustered into groups, T1,T2, . . .,
Tg such that if j1, j2 2 Tk, then ^ Pn ðj1 ; j2 Þ > 0:725. Here, genes could be
members of multiple groups. In many cases, P, and therefore ^ Pn
959
J.N.Sampson and S.G.Self
should be strongly correlated with C(, ). Consider a linear model
describing the expression levels of two genes, each controlled by a
single QTL.
Yj1 i ¼ j1 0 þ j1 t Gti þ j1 i
Yj2 i ¼ j2 0 þ j2 t0 Gt0 i þ j2 i
ð7Þ
where "j1i and "j2i are independent and normally distributed with
mean ¼ 0 and variances ¼ ej2 1 , ej2 2 , respectively. When both genes are
influenced by the same eQTL, t ¼ t0 ,
j1 t j2 t
P ¼ qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
ðej2 1 þ 2j1 t Þðej2 2 þ 2j2 t Þ
ð8Þ
When there is only one QTL, the following, desirable, statement
is true: |P| 4 0 if and only if the two traits are genetically coregulated.
A discrete measure, C(j1, j2) 2 f0, 1} will never perfectly correlate with
a continuous measure, D(j1, j2). However, if j2 is constant for all genes,
P, is an increasing function of the genetic effect sizes, j1 and j2.
Therefore, as the influence of the shared QTL increases, the E½^ P will
also increase, another highly desirable characteristic. Unfortunately,
problems can arise when multiple genes influence each trait. Then, even
the simple statement from above fails, as P ¼ 0 no longer implies the
absence of genetic correlation. We give an example later where P(j1, j2)
¼ 0 and C(j1, j2) ¼ 1.
A second possible measure is ^ ðj1 ; j2 Þ, the Pearson correlation
coefficient between ^j1 and ^j2 , where ^j is the least squares estimate of
j in Equation (10). If we believe that two genes share a common
function only when they share the same QTL and when those QTL have
identical influences, ^ would be the preferred statistic. However, this
second condition is superfluous if our ultimate aim is to group genes for
QTL mapping, so we still prefer a measure correlated with C(j1, j2).
Therefore, ^ has two drawbacks. The signs of ^j1 and ^j2 affect ^ and
^ not the evidence for linkage, affect ^ . The logical
the size of ,
replacement for ^jt , which addresses those flaws, is the F-statistic
^2jt =varð^jt Þ. However, F ¼ (n 2)(102Xjt 1) so we are lead back to a
function based on the LOD scores for our correlate to C(j1, j2).
The third possible measure, and our focus for the remainder of the
article is ^ L . Unlike ^ P , ^ L incorporates both expression and genetic
data. Furthermore, ^ L can compare traits measured on distinct
populations. Both traits could be expression levels or one trait could
be a more general characteristic, such as size or life expectancy. Finally,
there is a robust correlation between ^ L and C even in small samples,
which will be illustrated by later examples.
2.4
Asymptotic properties of ^LNn
Let traits j1 and j2 have the distribution described by equation 10. To
simplify calculations, we make the following three assumptions: (1) The
entire genome is on a single chromosome. (2) The markers are evenly
spaced at intervals of length d cM. (3) Haldane’s mapping rule describes
the recombination rate between loci. Under the absolute null hypothesis,
!
!
!
H00 : No Genes Influence Either Trait H00 : j1 ¼ j2 ¼ 0 , we
proved that ^ LNn converges to a normal distribution as N, n ! 1
(Supplementary Material).
0
1
PN
pffiffiffiffi
pffiffiffiffiB
^ j1 ÞðXj2 t ^ j2 Þ
C
t¼1 ðXj1 t N^ LNn N@qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
A
PN
2 PN
2
^ j1 Þ
^ j2 Þ
t¼1 ðXj1 t t¼1 ðXj2 t j1 j2
!d N 0;
ð9Þ
varðXj1 t ÞvarðXj2 t Þ
P
0:08dt
wherepffiffiffiffi
j1 j2 ¼ 4:614 ð4 þ 2 1
Þ is the asymptotic variance
t¼1 4e
P
2
of 1= N N
ðX
^
ÞðX
^
j1 t
j1
j2 t
j2 Þ and var(Xj1t) ¼ var(Xj2t) ¼ 2/4.61 .
t¼1
In the Supplementary Material, we show that as long as there are at
least 200 subjects, the null distribution of ^ LNn should be nearly normal
960
and can be well approximated by the asymptotic null for a chromosome
of sufficient length. For some genomes or when the number of available
markers is small, the null distribution of ^ L can be heavily skewed
and can be better approximated by modifying the asymptotic
distribution (see Supplementary Material). The appropriately calculated null distribution can provide P-values, and other measures of
statistical significance, for experimental results. Violating any of the
three assumptions had little effect on the null distribution (see
Supplementary Material).
2.5
Clustering method
We propose a three-step method for identifying clusters of traits that
share common QTL. (1) Calculate the truncated LOD score profile,
!trunc
X j
for each of the n traits, where Xtrunc
¼ minðXjt ; 6Þ. Without
jt
truncation, traits with an extreme LOD score at position t will be
correlated to any trait j where Xjt is even modestly larger than E[Xjt].
Truncation also ensures that ^ L is only large when traits share multiple
QTL. Simulations suggested the threshold of six greatly reduced the
type I error rate without noticably lowering power. (2) Form an nxn
similarity matrix where the j1, j2 th entry is ^ L ðj1 , j2 Þ, calculated using the
truncated LOD scores at all markers. (3) Use a heierarchical clustering
method, (such as hclust, method¼‘complete’ in R), to order the traits.
Then, select groups of traits where ^ L ðj1 ; j2 Þ4c for all included trait
pairs, j1 and j2, where c is a predefined threshold. These groups can be
subsequently ranked by their size and/or mean value of ^ L ð; Þ.
2.6
Simulations
We could simulate small groups of coregulated traits, and then examine
whether the above method can correctly cluster those subgroups when
applied to the union of all traits. However, these simulations would
introduce multiple variables simultanously. Instead, we focus on groups
of two coregulated traits, j1 and j2, and calculate the probability of
correctly clustering those two traits together, or equivalently, calculating the Pð^ L ðj1 ; j2 Þ4c Þ. We refer to Pð^ L ðj1 ; j2 Þ4c Þ as the ‘power’ of
our clustering method, because in this two trait example, our clustering
method is equivalent to a test that rejects the null hypothesis H0 : fR1} \
fR2} ¼ Ø when ^ L 4c . We define c so the probability of clustering two
unrelated traits together, the ‘type I error rate’, is . In each of the
scenarios described below, we generate 10 000 values of ^ L ðj1 ; j2 Þ under
the null set-up and define the 95th percentile as c0.05. We then generate
values under the alternative, and define the proportion exceeding c0.05
as the power. Full, as opposed to truncated, LOD scores could be used
to calculate ^ L ðj1 ; j2 Þ because we avoided genes with extreme effects.
Using identical methodology, we also calculate the power for the test
rejecting H0 when ^ P ðj1 ; j2 Þ is large.
For each individual in a group of offspring, we simulated two
phenotypes and a set of SNPs, spaced every 10 cM, along a single
chromosome. The first marker was randomly assigned to be 1 or 1,
indicating parental origin, and the remaining markers were generated
according to Haldane’s mapping function. In the first set of
simulations, the phenotypes, Yj1 and Yj2 were generated according to
equation 10 with Nj1 ¼ Nj2 ¼ 1. Data was simulated for 100 subjects
when the trait heritability, H, was 0.05, 0.10 and 0.20 and for 1000
subjects when H was 0.010, 0.015 and 0.020. In this simple model,
H ¼ 2 =ð2 þ 2 Þ. Simulations were repeated for genome lengths of
1000, 5000, and 10 000 cM. Under the null hypothesis, the genes
influencing traits j1 and j2 were located at the 0.3Nth and 0.7Nth
marker, respectively, and under the alternative, both genes were located
at the 0.5Nth marker.
For the second set of simulations, we fixed the number of subjects
(100), heritability of each QTL (i.e 2t =ð2t þ 2 Þ ¼ 0:025Þ, and genome
Trait clusters
Table 1. The E½^ L is calculated under the null (R0 \ R1 ¼ Ø) and alternative (R0 \ R1 ¼
6 Ø) hypotheses under multiple combinations of subject
number, heritability, and chromosome length. The power of the test, reject H0: R0 \ R1 ¼ Ø when ^ L > c0:05 , is derived through simulation. For
comparison, E½^ P and the power from a ^ P -based test are also listed
Subjects
1000 cM
H
5000 cM
10000 cM
Phenotype
E½^ L (Null)
E½^ L (Alt)
Power
E½^ L (Null)
E½^ L (Alt)
Power
E½^ L (Null)
E½^ L (Alt)
Power
E½^ P (Null)
E½^ P (Alt)
Power
0.05
0.10
0.20
0.02
0.04
0.07
0.16
0.41
0.73
0.27
0.72
0.99
0.00
0.00
0.01
0.05
0.18
0.48
0.18
0.54
0.98
0.00
0.00
0.00
0.03
0.11
0.35
0.14
0.47
0.96
0.00
0.00
0.00
0.05
0.10
0.20
0.08
0.17
0.53
0.010
0.020
0.025
0.04
0.06
0.07
0.39
0.56
0.68
0.70
0.92
0.99
0.00
0.01
0.01
0.17
0.29
0.42
0.55
0.81
0.96
0.00
0.00
0.00
0.10
0.19
0.29
0.47
0.77
0.93
0.00
0.00
0.00
0.010
0.020
0.025
0.06
0.07
0.10
100
1000
length (5000 cM). The two phenotypes Yj1 and Yj2 were generated
according to the following equation.
Yj1 i ¼ 0 þ
N
X
j1 t Gti þ j1 i
t¼1
Yj2 i ¼ 0 þ
N
X
ð10Þ
j2 t Gti þ j2 i
t¼1
where there exists a single such that | jt | ¼ when | jt | 4 0 for j 2
fj1, j2} and t 2 f1, . . ., N}. Also, let var("j1i) ¼ var("j2i) ¼ ". Each
phenotype
had
two,
four
or
six
influential
QTL
P
ðNj N
1
2 f2; 4; 6gÞ. The total heritability, HT, was defined
t¼1
P jt26¼0 P 2
by HT ¼ t jt =ð t jt þ 2 Þ. The set of Nj QTL regulating phenotype
j can be abbreviated by Rj f1, . . ., N}. The influential genes for trait
j1 were evenly spaced over the first half of the chromosome, R1 ¼
fN/(2N1), . . ., N1N/(2N1)}. Under the null hypothesis, the influential
genes for trait j2 were evenly spaced along the second half of the
chromosome, R2 ¼ N/2 þ fN/(2N1), . . ., N1N/(2N1)}, whereas under the
alternative, they were shifted by a constant so that specified percentage
of QTL overlapped, R2 ¼ constant þ fN/(2N1), . . ., N1N/(2N1)}. With
multiple QTL, the direction of effect can influence our results. In
!
!
addition to letting all elements of j1 and j2 be positive, we also
!
examine the scenario where the signs of j2 alternate between positive
and negative (i.e. j2t21 ¼ j2t22 ¼ j2t23 ¼ j2t24). To define the concept of
P
alternating more generally, let sjt ¼ sign(jt) and S ¼ N100 t s^1t s^2t . Then
the alternating case is defined by S ¼ 0.
2.7
Experiment
Yvert and colleagues (2003) measured the expression of 6818 genes
in laboratory (BY) and wild (RM) strains of S.cerevisiae and in 112
segregants from a cross between them. Including multiple replicates for
each parent, expression was measured on 130 samples. In addition to
this genomics data, their lab genotyped each member of the two
generations at 3114 genetic markers. Because genotypes at adjacent
markers were often nearly identical, only a subset of 1063 marker
locations were chosen for calculating the LOD score correlation
coefficient. A complete description of the experiment has been
previously published (Yvert et al., 2003).
2.8
Composite interval mapping
In the analysis of both simulations and experimental results, we
calculated the LOD score profile using CIM. In addition to the locus
of interest, the nearest markers, on each side, at least 15 cM away were
also included in the Haley–Knott regression. For the experimental
results, the ‘at least 15 cM’ requirement was replaced by ‘differing
genotypes for at least 15 subjects’. All significant loci outside the
surrounding interval were also included in the regression. The group of
significant loci, , was initially defined as t0 , where Xjt0 ¼ maxt(Xjt).
Given a set , the LOD score, X
jt , for each position was then
recalculated with all significant loci included in the regression. Until
0
maxt ðX
jt Þ51:25 or contained eight loci, t was added to , where
00
00
Xjt0 ¼ maxt ðXjt Þ. If t 2 ; Xjt00 ¼ mint ðXjt Þ, and X
was
jt00 51:25, t
dropped from the set . This abbreviated version of CIM was chosen to
accommodate the large number of traits.
3
3.1
RESULTS
Simulations
Fundamental characteristics of ^ L are revealed by simulating
two traits, j1 and j2, each controlled by a single, and possibly
common, gene (Table 1). As previously discussed E½^ L usually
exceeds 0 if and only if C(j1, j2)40. When each trait is
controlled by a distinct gene, the maximum of E[Xj1t] (E[Xj2t])
occurs at a position t where E[Xj2t] (E[Xj1t]) is small, resulting in
negative correlation. As predicted, E½^ L increases dramatically
with heritability, for example, ranging from 0.16 (H ¼ 0.05) to
0.73 (H ¼ 0.20) when there are 100 subjects and a 1000 cM
chromosome. In general, E½^ L will also increase with the
number of subjects and decrease with genome size. The
variance of ^ L decreases with chromosome length, because
for any given , ^ L !p 0. The variance shrinks as the sample
grows because ^ L !p 1.
In all scenarios, the power for rejecting H0 was far greater
when using tests based on ^ L , compared to tests based on ^ P
(Table 1). The relative advantage of the former appears to
increase with sample size. With 100 subjects, the power is larger
by about a factor of 2, whereas with 1000 subjects, the power is
larger by about a factor of 10. The relative power is increasing,
in part, because E½^ L quickly approaches 1 as n increases,
whereas E½^ P remains constant.
Figure 1 shows that E½^ L increases with C(j1, j2) and HT. With
only a 100 subjects, the population size of Yvert’s experiment,
HT still affects E½^ L and E½^ L is a poor approximation
961
0.05
0.10
0.15
2 QTL
4 QTL
6 QTL
0.00
E[Corr. Coef]
0.20
J.N.Sampson and S.G.Self
0.0
0.2
0.4
0.6
0.8
1.0
C
Fig. 1. The E½^ L (y-axis) are compared for multiple values of C(j1, j2)
(x-axis) and multiple number of QTL, N1 ¼ N2 2 f2,4,6} (equivalently
HT 2 f0.05, 0.09, 0.13}). ^ L was simulated using single point mapping, a
5000 cM chromosome, 100 subjects, and S ¼ 1.
j1t2 ¼ j2t2, Xj1t1 tends to be higher than E[Xj1t1] when Xj2t1
is lower than E[Xj2t1]. Here, the extra error is negatively correlated. In the Supplementary Materials, we provide the mathemaþ
tical details and show that E½ðXj1 t1 E½Xj1 t1 ÞðXþ
j2 t1 E½Xj2 t1 Þ
E½ðXj1 t1 E½Xj1 t1 ÞðXj2 t1 E½Xj2 t1 Þ will be proportional to
ð22 þ 2 Þ2 8n21 22 . The superscript indicates whether S ¼ 0 ()
!
!
or S ¼ 1 (þ) when comparing j1 and j2 . As with the single
QTL examples, we again found that power for rejecting H0 was
greater when using tests based on ^ L , compared to tests based on
!
^ P . Now, we see that when S ¼ 0 or the signs of 2 alternate, the
advantage of the former is even greater. As promised in section
2.3, in the examples with S ¼ 0, the power to detect traits sharing
common QTL using ^ P is only equal to the alpha level, 0.05, even
when C(j1, j2) ¼ 1.
3.2
Table 2. The power for each of the following three tests: Reject H0: R0
\ R1 ¼ Ø when ^ L > c0:05 (Xj calculated by either single point, s.p.,
mapping or CIM), and reject when ^ P > c000:05 , are derived for multiple
combinations of QTL number, % overlap, and S. Simulations assume a
5000 cM genome and 100 subjects
Number of QTL
2
%
Overlap
50
100
4
25
50
75
100
S
0
1
0
1
0
1
0
1
0
1
0
1
Test Power
^ L (s.p.)
^ L (CIM)
^ P
0.164
0.162
0.277
0.333
0.119
0.126
0.195
0.239
0.303
0.390
0.406
0.564
0.554
0.536
0.632
0.624
0.546
0.548
0.566
0.612
0.586
0.666
0.610
0.772
0.079
0.073
0.052
0.155
0.071
0.071
0.047
0.120
0.068
0.255
0.049
0.419
of C(j1, j2). However, we see that even for sample sizes where
^ L Cðj1 , j2 Þ is not true, tests based on ^ L still have power to
identify correlated traits. Table 2 shows that these tests can be far
more powerful than those based on ^ P . Table 2 also shows the
power tends to be slightly smaller when the signs of the elements
!
of j2 alternate, even though the loci are, for practical purposes,
independent of each other. In our simulations, whenever C(j1, j2)
40, E½^ L j S ¼ 0 5E½^ L j S 6¼ 0 implying that contrary to the
!
!
marginal distributions, the joint distribution of X j1 and X j2 !
!
depends on the signs of j1 and j2 . To understand this
phenomena, we focus on the 2 QTL example. Let t1 and t2 be
the locations of the two influential genes. Although the
P
PN
E½ N
i¼1 Gt1 i Gt2 i ¼ 0, the Prð i¼1 Gt1 i Gt2 i 40Þ40. When this
event occurs and j1t2 ¼ j2t2, both Xj1t1 and Xj2t1 tend to increase.
PN
When
i¼1 Gt1 i Gt2 i 50, both Xj1t1 and Xj2t1 tend to decrease.
The values of Xj1t1 and Xj2t1 change together, or in unison.
The same is true for the LOD scores at t2. In contrast, when
962
Experiment
We calculated the LOD score profiles for the 6818 genes
measured by Yvert and colleagues (2003). Following, the steps
outlined in section 2.5, we formed clusters of genes where the
values for ^ L exceeded 0.4 for all member pairs. The threshold
was decided after approximating the absolute null distribution
for ^ L by permuting the subject labels for each trait, recalculating the LOD score profiles with CIM, and then using those
new profiles to calculate values of ^ L . Applying our clustering
method to these permuted traits, we found 643, 236, 50 and
3 clusters of two, three, four and five genes, respectively, where
all member pairs had ^ L exceeding 0.4. No clusters contained
more than five genes. Since we focused our interest on larger
clusters with at least 10 genes, we did not repeat this
computationally expensive permutation step to estimate
P-values for the smaller clusters. This permutation method
leads to a distribution of ^ L under the absolute null, H00 : There
are no QTL. In practice, we often found little difference between
this distribution and one assuming H0 (data not shown).
From the actual, experimental values, over 34 854 pairs had
a ^ L 40:4. We then ranked all clusters by their average value
of ^ L and focused on the top 20 clusters with at least 10 genes
(see Supplementary Material for the genes within each cluster).
Genes within these clusters tended to have the same molecular
functions and biological processes, as determined by Gene
Ontology (GO) annotations (see Supplementary Material).
The GO project creates a common vocabulary to describe genes
and their products (www.geneontology.org). For example, the
biological process of all annotated genes in the highest ranking
cluster included ‘DNA metabolic process’ and ‘Organelle
organization and biogenesis’. The molecular function of all of
these genes included ‘Helicase activity.’ Therefore, we now have
potential functions for the six genes in this group that
previously had no annotation. Figure 2 illustrates that the 16
genes in that cluster share multiple QTL. At least 67% of all
annotated genes in eight additional clusters shared a common
molecular function or biological process. Additionally, 77% of
the genes in cluster 16 participate in a metabolic process, but
the annotations discrimate between RNA, DNA, and amino
acid metabolic processes. The mean ^ L was 0.6 for the four
genes labelled as part of the amino acid metabolic process,
suggesting that those share nearly identical QTL with each
Trait clusters
Fig. 2. Each row represents the LOD score profile for a gene in cluster
1. LOD scores are color-coded: 0–1 (white), 1–2 (yellow), 2–3 (orange),
3–4 (red), 4–5 (purple) and 5–6 (black). Numbers/blue lines on the xaxis indicate chromosomes.
other, but only a portion of their QTL with genes involved in
other metabolic processes.
The overall correlation between ^ L and j^ P j was 0.37. Among
the 215 661 pairs (1%) which were ranked highest by ^ L , 64 232
were also within the top 1% of all pairs as ranked by j^ P j.
Moreover, among the 20 clusters ( 10 genes) which ranked
highest by ^ L , most had large mean values of j^ P j (see
Supplementary Material). Therefore, in this case, where both
clustering methods are applicable, the clusters with the largest
values of ^ L would also be identifiable when clustering by
phenotype, suggesting that few coregulated pairs can be
described by the alternating scenario (S ¼ 0).
4
coregulated genes. Currently it assesses coregulation by ^ P and
is limited to searching traits measured on the same cross. Our
introduction of ^ L will now allow for previously impossible
comparisons.
Clearly, ^ L has limitations because it is a function of LOD
scores. The quality of ^ L is limited by the quality of the LOD
score profiles, which are often very noisy. False positives will
occur when two traits are influenced by linked, but not
identical, genes. Moreover, there are other statistics which
may better compare two LOD score profiles. In the future,
we should explore statistics that account for the number of
overlapping QTL and preprocess the LOD scores by smoothing
their profiles before comparison. With enough smoothing,
correlation will be based only on the largest peaks. Also, we
might search for a method to better estimate P-values and
FDR. Currently, our null distributions, from theory or
permutation, both assume the absence of any genetic influence,
and therefore are only close approximations to the desired null
distributions.
At this stage, we propose the obvious two step procedure for
improved QTL mapping. First, search for genetically coregulated traits. Then, perform joint linkage mapping on those
traits. The P-values from standard joint linkage methods are no
longer valid, as we have purposely grouped traits that appear to
have similar LOD score profiles. In future research, we hope to
combine clustering and linkage so we can assign a single,
meaningful, significance level for each purported gene-trait
pair. As we improve our methods and genetical genomics
continues to gain popularity, ^ L , will become increasingly
important in identifying genetic coregulation.
DISCUSSION
The LOD score correlation coefficient, ^ L , quantifies the
evidence that two traits share common QTL. Ideally, ^ L can
be interpreted as C(, ), the average percentage of genes which
are common to both traits. However, even in the absence of this
ideal, we demonstrated that ^ L will still be strongly correlated
with C(, ) and that the statistic can be used to identify clusters
of coregulated traits. In this article, both traits are expression
profiles and are measured on a single population. Fortunately,
^ L is equally valid when assessing the genetic coregulation for
any two quantitative traits that are measured on any two
populations. Therefore, ^ L offers the most widely applicable
and statistically rigorous means for identifying genetically
coregulated traits.
As more laboratories are focusing on genetical genomics,
we need new methods to synthesize results. Investigators often
focus on a specific subset of genes. The subset may be
determined by their expression platform or by their experimental goals. Each manufacturer includes a different set of
probes (i.e. different genes) on their microarrays (Verdugo and
Medrano, 2006) and labs often limit their measurements to a
specific type of tissue or a specific set of genes thought to be
associated with a disease (MacLaren and Sikela, 2005;
Yamashita et al., 2005). As an example of how ^ L will
immediately impact the field of genetical genomics, we offer
WebQTL (http://www.genenetwork.org), a database that
includes LOD score profiles for an expansive list of traits in
mouse, rat and barley. This website was designed to find
ACKNOWLEDGEMENTS
This work was funded by National Institute of Dental and
Craniofacial Research (T32DE07132). We wish to thank John
Storey and Elizabeth Thompson for their valuable comments,
and also thank the anonymous reviewers for their extremely
insightful suggestions.
Conflict of Interest: none declared.
REFERENCES
Brem,R. and Kruglyak,L. (2005) The landscape of genetic complexity across,
5700 gene expression traits in yeast. PNAS, 102, 1572–1577.
Brem,R.B. et al. (2002) Genetic dissection of transcriptional regulation in
budding yeast. Science, 296, 752–755.
Chen,Z. (2005) The full em algorithm for the mles of qtl effects and positions and
their estimated variances in multiple-interval mapping. Biometrics, 61,
474–480.
Eisen,M.B. et al. (1998) Cluster analysis and display of genome-wide expression
patterns. Proc. Natl Acad. Sci., 95, 14863–14868.
Haley,C.S. et al. (1994) Mapping quantitative trait loci in crosses between
outbred lines using least squares. Genetics, 136, 1195–1207.
Hastie,T. et al. (2001) The Elements of Statistical Learning. 1st edn. Springer,
Canada.
Jiang,C. and Zeng,Z.B. (1995) Multiple trait analysis of genetic mapping for
quantitative trait loci. Genetics, 140, 1111–1127.
Knott,S.A. and Haley,C.S. (2000) Multitrait least squares for quantitative trait
loci detection. Genetics, 156, 899–911.
963
J.N.Sampson and S.G.Self
Korol,A.B. et al. (1996) Linkage between quantitative trait loci and marker loci:
resolution power of three statistical approaches in single marker analysis.
Biometrics, 52, 426–441.
Kraft,P. et al. (2003) Multivariate variance components analysis of longitudinal
blood pressure measurements from the Framingham Heart Study. BMC
Genetics, 4 (suppl. 1), S55–S61.
Lander,E.S. and Botstein,D. (1989) Mapping Mendelian factors underlying
quantitative traits using RFLP linkage maps. Genetics, 121, 185–199.
Lehman,E.L. and Casella,G. (1998) Theory of Point Estimation. 2nd edn.
Springer, New York, Chapter 6. pp. 429–475.
Li,J. and Burmeister,M. (2005) Genetical genomics: combining genetics with gene
expression analysis. Hum. Mol. Genet., 14 (suppl. 2), R163–169.
MacLaren,E.J. and Sikela,J.M. (2005) Cerebellar gene expression profiling and
eQTL analysis in inbred mouse strains selected for ethanol sensitivity. Alcohol
Clin. Exp. Res., 29, 1568–1579.
Mangin,B. et al. (1998) Pleiotropic QTL analysis. Biometrics, 54, 88–99.
Martin,L.J. et al. (2003) Phenotypic, genetic, and genome-wide structure in the
metabolic syndrome. BMC Genetics, 4 (suppl. 1), S95–S99.
Schadt,E. et al. (2005) An integrative genomics approach to infer causal
associations between gene expression and disease. Nat. Genet., 37, 710–717.
Segal,E. et al. (2003) Genome-wide discovery of transcriptional modules from
DNA sequence and gene expression. Bioinformatics, 19 (suppl. 1), i273–282.
Verdugo,R.A. and Medrano,J.F. (2006) Comparison of gene coverage of mouse
oligonucleotide microarray platforms. BMC Genomics, 7, 58.
Weller,J. et al. (1996) Application of a canonical transformation to detection of
quantitative trait loci with the aid of genetic markers in multitrait
experiments. Theor. App. Genet., 92, 998–1002.
Yamashita,S. et al. (2005) Expression quantitative trait loci analysis of 13 genes in
the rat prostate. Genetics, 171, 1231–1238.
Yvert,G. et al. (2003) Trans-acting regulatory variation in Saccharomyces
cerevisiae and the role of transcription factors. Nat. Genet., 35, 57–63.
APPENDIX
A
When the following three assumptions hold and the number of
subjects and markers are large, ^ L ðj1 ; j2 Þ Cðj1 ; j2 Þ.
ASSUMPTION 1. k1, k2 551
ASSUMPTION 2. The genes are independent of each other. For
any two positions, t1 and t2, P(Gt1 ¼ 1 | Gt2) ¼ 0.5.
ASSUMPTION 3. Genetic effects are described by the linear
model in Equation (10). with its accompanying restrictions
!
of and ".
We show that violating these assumptions will have minimal
effect in the Supplementary Material. Without loss of generality, let genes 1, . . ., N1 be the influential QTL for trait j1. As
Lander and Botstein (1989) point out, the expected score per
progeny, or ELOD, can be described by
P
2
2
t6¼t0 j t þ 0
ð11Þ
ELOD ðj1 ; t Þ ¼ 0:5log10 P 2 1
2
t j 1 t þ 964
when t0 N01 . Moreover, we know that E[Xj1t0 ] 0.22 when
t0 4N01 . By Assumption 3, ELOD(j1, t0 ) is the same for all t0 5N1.
Let ELOD ELOD(j1, t0 ) when t0 N01 . Therefore,
Xj1 0:22½k1 nELOD þ ð1 k1 Þ
ð12Þ
Furthermore, when t0 4N01
ðXj1 t0 Xj1 Þ 0:22½1 ðk1 nELOD þ ð1 k1 ÞÞ
ðXj1 t0 Xj1 Þ2 0:222 ½k21 2k21 nELOD þ k21 n2 E2LOD 0:222 ½k21 n2 E2LOD ð13Þ
ð14Þ
When t0 N01 ,
ðXj1 t0 Xj1 Þ 0:22½nELOD ðk1 nELOD þ ð1 k1 ÞÞ
ðXj1 t0 Xj1 Þ2 0:222 ð1 k1 Þ2 ½n2 E2LOD 2nELOD þ 1
0:222 ð1 k1 Þ2 ½n2 E2LOD ð15Þ
ð16Þ
Here F G if F/G !n 1. Using these results and their counterparts for j2,
ðXj1 t Xj1 Þ2 0:222 n2 E2LOD ½ð1 k1 Þ2 þ k1 ð1 k1 Þ2 ¼ 0:222 n2 E2LOD ðk1 k21 Þ
ð17Þ
Next, we approximate ðXj1 t Xj1 ÞðXj2 t Xj2 Þ using equations
13 and 15.
ðXj1 t0 Xj1 ÞðXj2 t0 Xj2 Þ 0:222 ½ð1 k1 k2 þ k0 Þðk1 k1 nELOD Þðk2 k2 nELOD Þ
þ ðk1 k0 Þð1 k1 ÞðnELOD þ 1Þðk2 k2 nELOD Þ
þ ðk2 k0 Þð1 k2 ÞðnELOD þ 1Þðk1 k1 nELOD Þ
ð18Þ
þ k0 ð1 k1 ÞðnELOD þ 1Þð1 k2 ÞðnELOD þ 1Þ
¼ 0:222 n2 E2LOD ðk0 k1 k2 Þ
Approximations 17, its counterpart for j2, and 18 show that
k0 k1 k2
^ L qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
ðk2 k22 Þðk1 k21 Þ
k0
pffiffiffiffiffiffiffiffiffi ¼ Cðj1 ; j2 Þ
k1 k2
ð19Þ