BIOINFORMATICS ORIGINAL PAPER Vol. 24 no. 7 2008, pages 958–964 doi:10.1093/bioinformatics/btn064 Gene expression Identifying trait clusters by linkage profiles: application in genetical genomics Joshua N. Sampson1,* and Steven G. Self1,2 1 Department of Biostatistics, University of Washington and 2Statistical Center for HIV/AIDS Research and Prevention, Seattle, WA, USA Received on November 3, 2007; revised on January 11, 2008; accepted on February 17, 2008 Advance Access publication February 29, 2008 Associate Editor: John Quackenbush ABSTRACT Motivation: Genes often regulate multiple traits. Identifying clusters of traits influenced by a common group of genes helps elucidate regulatory networks and can improve linkage mapping. Methods: We show that the Pearson correlation coefficient, ^ L , between two LOD score profiles can, with high specificity and sensitivity, identify pairs of genes that have their transcription regulated by shared quantitative trait loci (QTL). Furthermore, using theoretical and/or empirical methods, we can approximate the distribution of ^ L under the null hypothesis of no common QTL. Therefore, it is possible to calculate P-values and false discovery rates for testing whether two traits share common QTL. We then examine the properties of ^ L through simulation and use ^ L to cluster genes in a genetical genomics experiment examining Saccharomyces cerevisiae. Results: Simulations show that ^ L can have more power than the clustering methods currently used in genetical genomics. Combining experimental results with Gene Ontology (GO) annotations show that genes within a purported cluster often share similar function. Software: R-code included in online Supplementary Material. Contact: [email protected] Supplementary information: Supplementary materials are available at Bioinformatics online. 1 INTRODUCTION Genetic linkage analysis, or linkage mapping, is a powerful tool for locating genes influencing quantitative, or continuously varying, traits. For linkage mapping, the trait of interest is measured on a group of related individuals and then the genotypes at a set of genetic markers (i.e. single nucleotide polymorphisms) are recorded for that same group. Markers that are strongly correlated with the trait are reported as quantitative trait loci (QTL). When a single gene regulates two or more traits, an occurrence known as pleiotropy, the power to detect that gene and the precision of its estimated location are often improved by mapping all regulated traits simultaneously (Jiang and Zeng, 1995). Given a set of genetically correlated traits, several methods are available for joint linkage analysis. Maximum likelihood approaches can be applied to multivariate *To whom correspondence should be addressed. 958 distributions (Chen, 2005; Korol et al., 1996). Haley–Knott regression is easily adapted to multiple traits by using multivariate regression and ANOVA (Knott and Haley, 2000). Principal components analysis can transform the traits into a set of orthogonal, canonical variables and each component can then be analyzed by standard, single-trait methods (Mangin et al., 1998; Weller et al., 1996). These methods have been used extensively in recent years, uncovering genes influencing milk production in cows, grain yield in wheat, and multi symptom illnesses in a variety of organisms (Kraft et al. 2003; Mangin, et al. 1998; Martin et al. 2003). Before benefiting from such methods, we must first identify a set of genetically coregulated traits. We quantify genetic coregulation by averaging the percentage of influential genes common to both traits, C(, ). Usually traits have been clustered because of biological relationships or prior experiments. However, our knowledge may be limited to the data collected for the linkage studies. Therefore, using only the recorded trait values and marker genotypes, we want a method to determine if all, or a subgroup, of those traits are genetically coregulated. If the traits could share only a single gene, such a method would analyse each marker separately and mimic the previously listed joint mapping methods. However, the heritable traits still being studied are influenced by multiple genes, and two traits sharing one gene would likely share a set of genes. In this article, we formalize a novel method for identifying groups, or clusters, of traits likely to share common QTL (Schadt et al., 2005). The need for such a method has dramatically increased with the emergence of genetical genomics. Genetics and genomics can be combined by measuring the expression levels of thousands of genes simultaneously in a group of related individuals and then treating each expression level as a quantitative trait for linkage mapping (Brem et al., 2002; Li and Burmeister, 2005; Segal et al., 2003). QTL controlling expression levels, eQTL, have been successfully identified in mice, yeast, wheat, and humans (Li and Burmeister, 2005; MacLaren and Sikela, 2005; Yvert et al., 2003). Here, we will offer a way to cluster genes regulated by common eQTL, thereby improving our estimates of QTL locations and identifying collections of genes participating in the same biological pathway. Our clustering method is based on the correlation, ^ L , between the LOD score profiles of two traits. Given a large ß The Author 2008. Published by Oxford University Press. All rights reserved. For Permissions, please email: [email protected] Trait clusters group of traits, we form clusters so the value of ^ L between any two members exceeds some threshold. We will introduce this method in the context of a single genetical genomics experiment. Until now, these analyses have generally formed clusters so all pairs within a cluster have highly correlated expression levels. We can therefore compare our method to this established standard. In other cases, such as clustering genes where the expression levels were measured on different populations, there is no alternative to clustering by ^ L . The remainder of the article is divided as follows. First, we introduce notation and briefly review LOD Scores. Second, we introduce ^ L as an estimate for C(j1, j2). Third, we discuss ^ L as a similarity measure, and show how combining this measure and an appropriate algorithm can identify clusters of genetically coregulated traits (Hastie et al., 2001). Fourth, we apply our clustering method to simulations and the experimental results from Yvert et al.’s study of Saccharomyces cerevisiae (Yvert et al., 2003). 2 2.1 METHODS Notation Assume a cross between two strains, ST1 and ST1, of yeast producing n individuals. For subject i, i 2 f1, . . ., n}, let Yji be the expression level ! for gene j and let Y j be called the expression, or phenotypic, profile of gene j. Let subject i be genotyped at N markers, and let Gti be the genotype at position t, where Gti ¼ 1 (Gti ¼ 1) if the allele is from ST1 (ST1). fGti1, Yji1} and fGti2, Yji2} are independent if i1 6¼ i2. Recall, yeast are haploid and have only two possible genotypes with PðGti ¼ 1Þ ¼ PðGti ¼ 1Þ ¼ 12 8t, i. To simplify the discussion, assume that all genes are located at markers. We say that gene t, marker t, or QTL t influences the expression of gene j if f(Yji | Gti ¼ 1) 6¼ f(Yji | Gti ¼ 1), ignoring the possibility of epistasis. Furthermore, let Rj ftj1, . . ., tjNj} be the positions of the Nj QTL influencing the expression of gene j. The LOD score, Xnjt , is the log10 of the likelihood ratio statistic testing whether t 2 Rj (see Supplementary Material for definition) and can be approximated by Haley–Knott regression: Xnjt n=4:61 lnðSST =SSE Þ. Here, SST and SSE are the total and residual sum of squares fromn ! solving equation 1 (Haley et al., 1994). The vector of LOD scores, X j will be called the LOD score profile of gene j. Yji ¼ j 0 þ jt PðGti ¼ 1Þ þ ji SST ¼ SSE ¼ Xn Xi¼1 n i¼1 ð1Þ ðYji Yj Þ2 ðYji ð^j0 þ ^jt Gti ÞÞ2 P ! Here Yj ¼ 1n ni¼1 Yji and ^ is the least-squares estimate of . Let 1t2Rj ¼ 1 if t 2 Rj, 0 otherwise. For trait j, j 2 fj1, j2}, PN t¼1 1t2Rj1 1t2Rj2 =Nj is the proportion of QTL that are common to both traits. We measure the genetic coregulation by the geometric mean, C(j1, j2), of the two proportions. "PN #1=2 PN t¼1 1t2Rj1 1t2Rj2 t¼1 1t2Rj1 1t2Rj2 Cðj1 ; j2 Þ ¼ Nj1 Nj2 ð2Þ PN t¼1 1t2Rj1 1t2Rj2 pffiffiffiffiffiffiffiffiffiffiffiffiffi ¼ Nj1 Nj2 Our goal is to find clusters, T1, T2, . . ., Tg, of traits such that Pg P j1 , j2 2Ti Cðj1 , j2 Þ is large. i¼1 2.2 Estimating C(j1, j2) Define the LOD score correlation coefficient, ^ L , for traits j1 and j2 by PN t¼1 ðXj1 t Xj1 ÞðXj2 t Xj2 Þ ð3Þ L qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi PN 2 PN 2 ðX j1 t Xj1 Þ t¼1 t¼1 ðXj2 t Xj2 Þ P where Xj ¼ N1 N t¼1 Xjt . Let trait j1 be influenced by N1 genes and trait j2 be influenced by N2 genes. Define k1 and k2 by N1 ¼ k1N and N2 ¼ k2N, where 0 5 k1, k2 5 1. Furthermore, let the two traits share N00 ¼ k0 N genes where 0 5 k0 5 k1, k2. For large n, we show (Appendix A) that under three mild assumptions, k0 ^ L ðj1 , j2 Þ Cðj1 , j2 Þ ¼ pffiffiffiffiffiffiffiffiffiffi ð4Þ k1 k2 A second approach for estimating C(j1, j2) would be to first estimate 1t2Rj1 and 1t2Rj2 for each potential QTL t. Let 1^ t2Rj ¼ 1 if Xjt is a local maximum and exceeds some threshold >0.22, 0 otherwise. Then, calculating LOD scores by composite interval mapping (CIM) promises PN ^ ^ t¼1 1t2Rj1 1t2Rj2 qffiffiffiffiffiffiffiffiffiffiffiffiffi limn!1 ¼ Cðj1 , j2 Þ ð5Þ N^ j1 N^ j2 P ^ where N^ j ¼ N t¼1 1t2Rj for j 2 fj1, j2}. We avoid this approximation because it fails horribly for real sample sizes. Let QTL t0 affect traits j1 and j2. Even when n is large, the estimated locations for QTL t0 will rarely coincide perfectly resulting in 1^ t2Rj1 1^ t2Rj2 ¼ 0 for all t close to t0 . This approximation does not account for the distance between the two estimated positions. The obvious correction would be using a continuous measure estimating P(t 2 Rj) or evidence of linkage instead of 1^ t2Rj . As such evidence is often quantified by Xjt, we have returned to our original approach. ^ 1 , j2 Þ ^ L ðj1 ; j2 Þ Returning to our original focus, we calculate Cðj for all possible pairs of traits and input those values in a proximity matrix, D. Let Dj1, j2 ¼ Ĉ(j1, j2), where Dj1, j2 is the entry in the j1 th row and j2 th column of D. Given a proximity matrix, there are numerous methods available for finding groups, T1, T2, . . ., Tg of traits P P ^ 1 ; j2 Þ ¼ Pg P such that gi¼1 j1 ; j2 2Ti Cðj j1 ; j2 2Ti Dj1 ; j2 is large. If our i¼1 estimates of Ĉ(j1, j2) are accurate, then the identified clusters will result P P in a large value of gi¼1 j1 ; j2 2Ti Cðj1 ; j2 Þ. 2.3 Similarity measures Finding clusters, T1, T2, . . ., Tg, of traits resulting in a large value of Pg P j1 ;j2 2Ti Cðj1 ; j2 Þ does not require estimating C(j1, j2). We can i¼1 circumvent this step by identifying a statistic, or measure, D, that is highly correlated with C. Given such a measure, we can construct a proximity matrix, D, where Dj1, j2 ¼ D(j1, j2). Applying an appropriate clustering method to D would result in groups of traits with a large Pg P value of j1 ;j2 2Ti Cðj1 ; j2 Þ. Three candidate measures are the i¼1 correlation between expression levels, the correlation between vectors ^j1 and ^j2 , and the correlation between LOD score profiles. We must avoid methods based on variance components because the genetic component is not identifiable for Yvert et al.’s yeast experiment or any experiment where the population is the progeny of a single cross. The most prevalent method for finding genes linked to common eQTL is clustering by expression profile (Brem and Kruglyak, 2005; Eisen et al., 1998; Yvert et al., 2003). This method is equivalent to defining Dj1 j2 ¼ ^ Pn ðj1 ; j2 Þ, where ^ Pn ðj1 ; j2 Þ estimates the Pearson ! ! correlation coefficient, ^ P , between Y j1 and Y j2 . Pn i¼1 ðYj1 i Yj1 ÞðYj2 i Yj2 Þ qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ð6Þ ^ Pn ¼ qP n 2 Pn 2 i¼1 ðYj1 i Yj1 Þ i¼1 ðYj2 i Yj2 Þ In Yvert et al.’s experiment, genes were clustered into groups, T1,T2, . . ., Tg such that if j1, j2 2 Tk, then ^ Pn ðj1 ; j2 Þ > 0:725. Here, genes could be members of multiple groups. In many cases, P, and therefore ^ Pn 959 J.N.Sampson and S.G.Self should be strongly correlated with C(, ). Consider a linear model describing the expression levels of two genes, each controlled by a single QTL. Yj1 i ¼ j1 0 þ j1 t Gti þ j1 i Yj2 i ¼ j2 0 þ j2 t0 Gt0 i þ j2 i ð7Þ where "j1i and "j2i are independent and normally distributed with mean ¼ 0 and variances ¼ ej2 1 , ej2 2 , respectively. When both genes are influenced by the same eQTL, t ¼ t0 , j1 t j2 t P ¼ qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ðej2 1 þ 2j1 t Þðej2 2 þ 2j2 t Þ ð8Þ When there is only one QTL, the following, desirable, statement is true: |P| 4 0 if and only if the two traits are genetically coregulated. A discrete measure, C(j1, j2) 2 f0, 1} will never perfectly correlate with a continuous measure, D(j1, j2). However, if j2 is constant for all genes, P, is an increasing function of the genetic effect sizes, j1 and j2. Therefore, as the influence of the shared QTL increases, the E½^ P will also increase, another highly desirable characteristic. Unfortunately, problems can arise when multiple genes influence each trait. Then, even the simple statement from above fails, as P ¼ 0 no longer implies the absence of genetic correlation. We give an example later where P(j1, j2) ¼ 0 and C(j1, j2) ¼ 1. A second possible measure is ^ ðj1 ; j2 Þ, the Pearson correlation coefficient between ^j1 and ^j2 , where ^j is the least squares estimate of j in Equation (10). If we believe that two genes share a common function only when they share the same QTL and when those QTL have identical influences, ^ would be the preferred statistic. However, this second condition is superfluous if our ultimate aim is to group genes for QTL mapping, so we still prefer a measure correlated with C(j1, j2). Therefore, ^ has two drawbacks. The signs of ^j1 and ^j2 affect ^ and ^ not the evidence for linkage, affect ^ . The logical the size of , replacement for ^jt , which addresses those flaws, is the F-statistic ^2jt =varð^jt Þ. However, F ¼ (n 2)(102Xjt 1) so we are lead back to a function based on the LOD scores for our correlate to C(j1, j2). The third possible measure, and our focus for the remainder of the article is ^ L . Unlike ^ P , ^ L incorporates both expression and genetic data. Furthermore, ^ L can compare traits measured on distinct populations. Both traits could be expression levels or one trait could be a more general characteristic, such as size or life expectancy. Finally, there is a robust correlation between ^ L and C even in small samples, which will be illustrated by later examples. 2.4 Asymptotic properties of ^LNn Let traits j1 and j2 have the distribution described by equation 10. To simplify calculations, we make the following three assumptions: (1) The entire genome is on a single chromosome. (2) The markers are evenly spaced at intervals of length d cM. (3) Haldane’s mapping rule describes the recombination rate between loci. Under the absolute null hypothesis, ! ! ! H00 : No Genes Influence Either Trait H00 : j1 ¼ j2 ¼ 0 , we proved that ^ LNn converges to a normal distribution as N, n ! 1 (Supplementary Material). 0 1 PN pffiffiffiffi pffiffiffiffiB ^ j1 ÞðXj2 t ^ j2 Þ C t¼1 ðXj1 t N^ LNn N@qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi A PN 2 PN 2 ^ j1 Þ ^ j2 Þ t¼1 ðXj1 t t¼1 ðXj2 t j1 j2 !d N 0; ð9Þ varðXj1 t ÞvarðXj2 t Þ P 0:08dt wherepffiffiffiffi j1 j2 ¼ 4:614 ð4 þ 2 1 Þ is the asymptotic variance t¼1 4e P 2 of 1= N N ðX ^ ÞðX ^ j1 t j1 j2 t j2 Þ and var(Xj1t) ¼ var(Xj2t) ¼ 2/4.61 . t¼1 In the Supplementary Material, we show that as long as there are at least 200 subjects, the null distribution of ^ LNn should be nearly normal 960 and can be well approximated by the asymptotic null for a chromosome of sufficient length. For some genomes or when the number of available markers is small, the null distribution of ^ L can be heavily skewed and can be better approximated by modifying the asymptotic distribution (see Supplementary Material). The appropriately calculated null distribution can provide P-values, and other measures of statistical significance, for experimental results. Violating any of the three assumptions had little effect on the null distribution (see Supplementary Material). 2.5 Clustering method We propose a three-step method for identifying clusters of traits that share common QTL. (1) Calculate the truncated LOD score profile, !trunc X j for each of the n traits, where Xtrunc ¼ minðXjt ; 6Þ. Without jt truncation, traits with an extreme LOD score at position t will be correlated to any trait j where Xjt is even modestly larger than E[Xjt]. Truncation also ensures that ^ L is only large when traits share multiple QTL. Simulations suggested the threshold of six greatly reduced the type I error rate without noticably lowering power. (2) Form an nxn similarity matrix where the j1, j2 th entry is ^ L ðj1 , j2 Þ, calculated using the truncated LOD scores at all markers. (3) Use a heierarchical clustering method, (such as hclust, method¼‘complete’ in R), to order the traits. Then, select groups of traits where ^ L ðj1 ; j2 Þ4c for all included trait pairs, j1 and j2, where c is a predefined threshold. These groups can be subsequently ranked by their size and/or mean value of ^ L ð; Þ. 2.6 Simulations We could simulate small groups of coregulated traits, and then examine whether the above method can correctly cluster those subgroups when applied to the union of all traits. However, these simulations would introduce multiple variables simultanously. Instead, we focus on groups of two coregulated traits, j1 and j2, and calculate the probability of correctly clustering those two traits together, or equivalently, calculating the Pð^ L ðj1 ; j2 Þ4c Þ. We refer to Pð^ L ðj1 ; j2 Þ4c Þ as the ‘power’ of our clustering method, because in this two trait example, our clustering method is equivalent to a test that rejects the null hypothesis H0 : fR1} \ fR2} ¼ Ø when ^ L 4c . We define c so the probability of clustering two unrelated traits together, the ‘type I error rate’, is . In each of the scenarios described below, we generate 10 000 values of ^ L ðj1 ; j2 Þ under the null set-up and define the 95th percentile as c0.05. We then generate values under the alternative, and define the proportion exceeding c0.05 as the power. Full, as opposed to truncated, LOD scores could be used to calculate ^ L ðj1 ; j2 Þ because we avoided genes with extreme effects. Using identical methodology, we also calculate the power for the test rejecting H0 when ^ P ðj1 ; j2 Þ is large. For each individual in a group of offspring, we simulated two phenotypes and a set of SNPs, spaced every 10 cM, along a single chromosome. The first marker was randomly assigned to be 1 or 1, indicating parental origin, and the remaining markers were generated according to Haldane’s mapping function. In the first set of simulations, the phenotypes, Yj1 and Yj2 were generated according to equation 10 with Nj1 ¼ Nj2 ¼ 1. Data was simulated for 100 subjects when the trait heritability, H, was 0.05, 0.10 and 0.20 and for 1000 subjects when H was 0.010, 0.015 and 0.020. In this simple model, H ¼ 2 =ð2 þ 2 Þ. Simulations were repeated for genome lengths of 1000, 5000, and 10 000 cM. Under the null hypothesis, the genes influencing traits j1 and j2 were located at the 0.3Nth and 0.7Nth marker, respectively, and under the alternative, both genes were located at the 0.5Nth marker. For the second set of simulations, we fixed the number of subjects (100), heritability of each QTL (i.e 2t =ð2t þ 2 Þ ¼ 0:025Þ, and genome Trait clusters Table 1. The E½^ L is calculated under the null (R0 \ R1 ¼ Ø) and alternative (R0 \ R1 ¼ 6 Ø) hypotheses under multiple combinations of subject number, heritability, and chromosome length. The power of the test, reject H0: R0 \ R1 ¼ Ø when ^ L > c0:05 , is derived through simulation. For comparison, E½^ P and the power from a ^ P -based test are also listed Subjects 1000 cM H 5000 cM 10000 cM Phenotype E½^ L (Null) E½^ L (Alt) Power E½^ L (Null) E½^ L (Alt) Power E½^ L (Null) E½^ L (Alt) Power E½^ P (Null) E½^ P (Alt) Power 0.05 0.10 0.20 0.02 0.04 0.07 0.16 0.41 0.73 0.27 0.72 0.99 0.00 0.00 0.01 0.05 0.18 0.48 0.18 0.54 0.98 0.00 0.00 0.00 0.03 0.11 0.35 0.14 0.47 0.96 0.00 0.00 0.00 0.05 0.10 0.20 0.08 0.17 0.53 0.010 0.020 0.025 0.04 0.06 0.07 0.39 0.56 0.68 0.70 0.92 0.99 0.00 0.01 0.01 0.17 0.29 0.42 0.55 0.81 0.96 0.00 0.00 0.00 0.10 0.19 0.29 0.47 0.77 0.93 0.00 0.00 0.00 0.010 0.020 0.025 0.06 0.07 0.10 100 1000 length (5000 cM). The two phenotypes Yj1 and Yj2 were generated according to the following equation. Yj1 i ¼ 0 þ N X j1 t Gti þ j1 i t¼1 Yj2 i ¼ 0 þ N X ð10Þ j2 t Gti þ j2 i t¼1 where there exists a single such that | jt | ¼ when | jt | 4 0 for j 2 fj1, j2} and t 2 f1, . . ., N}. Also, let var("j1i) ¼ var("j2i) ¼ ". Each phenotype had two, four or six influential QTL P ðNj N 1 2 f2; 4; 6gÞ. The total heritability, HT, was defined t¼1 P jt26¼0 P 2 by HT ¼ t jt =ð t jt þ 2 Þ. The set of Nj QTL regulating phenotype j can be abbreviated by Rj f1, . . ., N}. The influential genes for trait j1 were evenly spaced over the first half of the chromosome, R1 ¼ fN/(2N1), . . ., N1N/(2N1)}. Under the null hypothesis, the influential genes for trait j2 were evenly spaced along the second half of the chromosome, R2 ¼ N/2 þ fN/(2N1), . . ., N1N/(2N1)}, whereas under the alternative, they were shifted by a constant so that specified percentage of QTL overlapped, R2 ¼ constant þ fN/(2N1), . . ., N1N/(2N1)}. With multiple QTL, the direction of effect can influence our results. In ! ! addition to letting all elements of j1 and j2 be positive, we also ! examine the scenario where the signs of j2 alternate between positive and negative (i.e. j2t21 ¼ j2t22 ¼ j2t23 ¼ j2t24). To define the concept of P alternating more generally, let sjt ¼ sign(jt) and S ¼ N100 t s^1t s^2t . Then the alternating case is defined by S ¼ 0. 2.7 Experiment Yvert and colleagues (2003) measured the expression of 6818 genes in laboratory (BY) and wild (RM) strains of S.cerevisiae and in 112 segregants from a cross between them. Including multiple replicates for each parent, expression was measured on 130 samples. In addition to this genomics data, their lab genotyped each member of the two generations at 3114 genetic markers. Because genotypes at adjacent markers were often nearly identical, only a subset of 1063 marker locations were chosen for calculating the LOD score correlation coefficient. A complete description of the experiment has been previously published (Yvert et al., 2003). 2.8 Composite interval mapping In the analysis of both simulations and experimental results, we calculated the LOD score profile using CIM. In addition to the locus of interest, the nearest markers, on each side, at least 15 cM away were also included in the Haley–Knott regression. For the experimental results, the ‘at least 15 cM’ requirement was replaced by ‘differing genotypes for at least 15 subjects’. All significant loci outside the surrounding interval were also included in the regression. The group of significant loci, , was initially defined as t0 , where Xjt0 ¼ maxt(Xjt). Given a set , the LOD score, X jt , for each position was then recalculated with all significant loci included in the regression. Until 0 maxt ðX jt Þ51:25 or contained eight loci, t was added to , where 00 00 Xjt0 ¼ maxt ðXjt Þ. If t 2 ; Xjt00 ¼ mint ðXjt Þ, and X was jt00 51:25, t dropped from the set . This abbreviated version of CIM was chosen to accommodate the large number of traits. 3 3.1 RESULTS Simulations Fundamental characteristics of ^ L are revealed by simulating two traits, j1 and j2, each controlled by a single, and possibly common, gene (Table 1). As previously discussed E½^ L usually exceeds 0 if and only if C(j1, j2)40. When each trait is controlled by a distinct gene, the maximum of E[Xj1t] (E[Xj2t]) occurs at a position t where E[Xj2t] (E[Xj1t]) is small, resulting in negative correlation. As predicted, E½^ L increases dramatically with heritability, for example, ranging from 0.16 (H ¼ 0.05) to 0.73 (H ¼ 0.20) when there are 100 subjects and a 1000 cM chromosome. In general, E½^ L will also increase with the number of subjects and decrease with genome size. The variance of ^ L decreases with chromosome length, because for any given , ^ L !p 0. The variance shrinks as the sample grows because ^ L !p 1. In all scenarios, the power for rejecting H0 was far greater when using tests based on ^ L , compared to tests based on ^ P (Table 1). The relative advantage of the former appears to increase with sample size. With 100 subjects, the power is larger by about a factor of 2, whereas with 1000 subjects, the power is larger by about a factor of 10. The relative power is increasing, in part, because E½^ L quickly approaches 1 as n increases, whereas E½^ P remains constant. Figure 1 shows that E½^ L increases with C(j1, j2) and HT. With only a 100 subjects, the population size of Yvert’s experiment, HT still affects E½^ L and E½^ L is a poor approximation 961 0.05 0.10 0.15 2 QTL 4 QTL 6 QTL 0.00 E[Corr. Coef] 0.20 J.N.Sampson and S.G.Self 0.0 0.2 0.4 0.6 0.8 1.0 C Fig. 1. The E½^ L (y-axis) are compared for multiple values of C(j1, j2) (x-axis) and multiple number of QTL, N1 ¼ N2 2 f2,4,6} (equivalently HT 2 f0.05, 0.09, 0.13}). ^ L was simulated using single point mapping, a 5000 cM chromosome, 100 subjects, and S ¼ 1. j1t2 ¼ j2t2, Xj1t1 tends to be higher than E[Xj1t1] when Xj2t1 is lower than E[Xj2t1]. Here, the extra error is negatively correlated. In the Supplementary Materials, we provide the mathemaþ tical details and show that E½ðXj1 t1 E½Xj1 t1 ÞðXþ j2 t1 E½Xj2 t1 Þ E½ðXj1 t1 E½Xj1 t1 ÞðXj2 t1 E½Xj2 t1 Þ will be proportional to ð22 þ 2 Þ2 8n21 22 . The superscript indicates whether S ¼ 0 () ! ! or S ¼ 1 (þ) when comparing j1 and j2 . As with the single QTL examples, we again found that power for rejecting H0 was greater when using tests based on ^ L , compared to tests based on ! ^ P . Now, we see that when S ¼ 0 or the signs of 2 alternate, the advantage of the former is even greater. As promised in section 2.3, in the examples with S ¼ 0, the power to detect traits sharing common QTL using ^ P is only equal to the alpha level, 0.05, even when C(j1, j2) ¼ 1. 3.2 Table 2. The power for each of the following three tests: Reject H0: R0 \ R1 ¼ Ø when ^ L > c0:05 (Xj calculated by either single point, s.p., mapping or CIM), and reject when ^ P > c000:05 , are derived for multiple combinations of QTL number, % overlap, and S. Simulations assume a 5000 cM genome and 100 subjects Number of QTL 2 % Overlap 50 100 4 25 50 75 100 S 0 1 0 1 0 1 0 1 0 1 0 1 Test Power ^ L (s.p.) ^ L (CIM) ^ P 0.164 0.162 0.277 0.333 0.119 0.126 0.195 0.239 0.303 0.390 0.406 0.564 0.554 0.536 0.632 0.624 0.546 0.548 0.566 0.612 0.586 0.666 0.610 0.772 0.079 0.073 0.052 0.155 0.071 0.071 0.047 0.120 0.068 0.255 0.049 0.419 of C(j1, j2). However, we see that even for sample sizes where ^ L Cðj1 , j2 Þ is not true, tests based on ^ L still have power to identify correlated traits. Table 2 shows that these tests can be far more powerful than those based on ^ P . Table 2 also shows the power tends to be slightly smaller when the signs of the elements ! of j2 alternate, even though the loci are, for practical purposes, independent of each other. In our simulations, whenever C(j1, j2) 40, E½^ L j S ¼ 0 5E½^ L j S 6¼ 0 implying that contrary to the ! ! marginal distributions, the joint distribution of X j1 and X j2 ! ! depends on the signs of j1 and j2 . To understand this phenomena, we focus on the 2 QTL example. Let t1 and t2 be the locations of the two influential genes. Although the P PN E½ N i¼1 Gt1 i Gt2 i ¼ 0, the Prð i¼1 Gt1 i Gt2 i 40Þ40. When this event occurs and j1t2 ¼ j2t2, both Xj1t1 and Xj2t1 tend to increase. PN When i¼1 Gt1 i Gt2 i 50, both Xj1t1 and Xj2t1 tend to decrease. The values of Xj1t1 and Xj2t1 change together, or in unison. The same is true for the LOD scores at t2. In contrast, when 962 Experiment We calculated the LOD score profiles for the 6818 genes measured by Yvert and colleagues (2003). Following, the steps outlined in section 2.5, we formed clusters of genes where the values for ^ L exceeded 0.4 for all member pairs. The threshold was decided after approximating the absolute null distribution for ^ L by permuting the subject labels for each trait, recalculating the LOD score profiles with CIM, and then using those new profiles to calculate values of ^ L . Applying our clustering method to these permuted traits, we found 643, 236, 50 and 3 clusters of two, three, four and five genes, respectively, where all member pairs had ^ L exceeding 0.4. No clusters contained more than five genes. Since we focused our interest on larger clusters with at least 10 genes, we did not repeat this computationally expensive permutation step to estimate P-values for the smaller clusters. This permutation method leads to a distribution of ^ L under the absolute null, H00 : There are no QTL. In practice, we often found little difference between this distribution and one assuming H0 (data not shown). From the actual, experimental values, over 34 854 pairs had a ^ L 40:4. We then ranked all clusters by their average value of ^ L and focused on the top 20 clusters with at least 10 genes (see Supplementary Material for the genes within each cluster). Genes within these clusters tended to have the same molecular functions and biological processes, as determined by Gene Ontology (GO) annotations (see Supplementary Material). The GO project creates a common vocabulary to describe genes and their products (www.geneontology.org). For example, the biological process of all annotated genes in the highest ranking cluster included ‘DNA metabolic process’ and ‘Organelle organization and biogenesis’. The molecular function of all of these genes included ‘Helicase activity.’ Therefore, we now have potential functions for the six genes in this group that previously had no annotation. Figure 2 illustrates that the 16 genes in that cluster share multiple QTL. At least 67% of all annotated genes in eight additional clusters shared a common molecular function or biological process. Additionally, 77% of the genes in cluster 16 participate in a metabolic process, but the annotations discrimate between RNA, DNA, and amino acid metabolic processes. The mean ^ L was 0.6 for the four genes labelled as part of the amino acid metabolic process, suggesting that those share nearly identical QTL with each Trait clusters Fig. 2. Each row represents the LOD score profile for a gene in cluster 1. LOD scores are color-coded: 0–1 (white), 1–2 (yellow), 2–3 (orange), 3–4 (red), 4–5 (purple) and 5–6 (black). Numbers/blue lines on the xaxis indicate chromosomes. other, but only a portion of their QTL with genes involved in other metabolic processes. The overall correlation between ^ L and j^ P j was 0.37. Among the 215 661 pairs (1%) which were ranked highest by ^ L , 64 232 were also within the top 1% of all pairs as ranked by j^ P j. Moreover, among the 20 clusters ( 10 genes) which ranked highest by ^ L , most had large mean values of j^ P j (see Supplementary Material). Therefore, in this case, where both clustering methods are applicable, the clusters with the largest values of ^ L would also be identifiable when clustering by phenotype, suggesting that few coregulated pairs can be described by the alternating scenario (S ¼ 0). 4 coregulated genes. Currently it assesses coregulation by ^ P and is limited to searching traits measured on the same cross. Our introduction of ^ L will now allow for previously impossible comparisons. Clearly, ^ L has limitations because it is a function of LOD scores. The quality of ^ L is limited by the quality of the LOD score profiles, which are often very noisy. False positives will occur when two traits are influenced by linked, but not identical, genes. Moreover, there are other statistics which may better compare two LOD score profiles. In the future, we should explore statistics that account for the number of overlapping QTL and preprocess the LOD scores by smoothing their profiles before comparison. With enough smoothing, correlation will be based only on the largest peaks. Also, we might search for a method to better estimate P-values and FDR. Currently, our null distributions, from theory or permutation, both assume the absence of any genetic influence, and therefore are only close approximations to the desired null distributions. At this stage, we propose the obvious two step procedure for improved QTL mapping. First, search for genetically coregulated traits. Then, perform joint linkage mapping on those traits. The P-values from standard joint linkage methods are no longer valid, as we have purposely grouped traits that appear to have similar LOD score profiles. In future research, we hope to combine clustering and linkage so we can assign a single, meaningful, significance level for each purported gene-trait pair. As we improve our methods and genetical genomics continues to gain popularity, ^ L , will become increasingly important in identifying genetic coregulation. DISCUSSION The LOD score correlation coefficient, ^ L , quantifies the evidence that two traits share common QTL. Ideally, ^ L can be interpreted as C(, ), the average percentage of genes which are common to both traits. However, even in the absence of this ideal, we demonstrated that ^ L will still be strongly correlated with C(, ) and that the statistic can be used to identify clusters of coregulated traits. In this article, both traits are expression profiles and are measured on a single population. Fortunately, ^ L is equally valid when assessing the genetic coregulation for any two quantitative traits that are measured on any two populations. Therefore, ^ L offers the most widely applicable and statistically rigorous means for identifying genetically coregulated traits. As more laboratories are focusing on genetical genomics, we need new methods to synthesize results. Investigators often focus on a specific subset of genes. The subset may be determined by their expression platform or by their experimental goals. Each manufacturer includes a different set of probes (i.e. different genes) on their microarrays (Verdugo and Medrano, 2006) and labs often limit their measurements to a specific type of tissue or a specific set of genes thought to be associated with a disease (MacLaren and Sikela, 2005; Yamashita et al., 2005). As an example of how ^ L will immediately impact the field of genetical genomics, we offer WebQTL (http://www.genenetwork.org), a database that includes LOD score profiles for an expansive list of traits in mouse, rat and barley. This website was designed to find ACKNOWLEDGEMENTS This work was funded by National Institute of Dental and Craniofacial Research (T32DE07132). We wish to thank John Storey and Elizabeth Thompson for their valuable comments, and also thank the anonymous reviewers for their extremely insightful suggestions. Conflict of Interest: none declared. REFERENCES Brem,R. and Kruglyak,L. (2005) The landscape of genetic complexity across, 5700 gene expression traits in yeast. PNAS, 102, 1572–1577. Brem,R.B. et al. (2002) Genetic dissection of transcriptional regulation in budding yeast. Science, 296, 752–755. Chen,Z. (2005) The full em algorithm for the mles of qtl effects and positions and their estimated variances in multiple-interval mapping. Biometrics, 61, 474–480. Eisen,M.B. et al. (1998) Cluster analysis and display of genome-wide expression patterns. Proc. Natl Acad. Sci., 95, 14863–14868. Haley,C.S. et al. (1994) Mapping quantitative trait loci in crosses between outbred lines using least squares. Genetics, 136, 1195–1207. Hastie,T. et al. (2001) The Elements of Statistical Learning. 1st edn. Springer, Canada. Jiang,C. and Zeng,Z.B. (1995) Multiple trait analysis of genetic mapping for quantitative trait loci. Genetics, 140, 1111–1127. Knott,S.A. and Haley,C.S. (2000) Multitrait least squares for quantitative trait loci detection. Genetics, 156, 899–911. 963 J.N.Sampson and S.G.Self Korol,A.B. et al. (1996) Linkage between quantitative trait loci and marker loci: resolution power of three statistical approaches in single marker analysis. Biometrics, 52, 426–441. Kraft,P. et al. (2003) Multivariate variance components analysis of longitudinal blood pressure measurements from the Framingham Heart Study. BMC Genetics, 4 (suppl. 1), S55–S61. Lander,E.S. and Botstein,D. (1989) Mapping Mendelian factors underlying quantitative traits using RFLP linkage maps. Genetics, 121, 185–199. Lehman,E.L. and Casella,G. (1998) Theory of Point Estimation. 2nd edn. Springer, New York, Chapter 6. pp. 429–475. Li,J. and Burmeister,M. (2005) Genetical genomics: combining genetics with gene expression analysis. Hum. Mol. Genet., 14 (suppl. 2), R163–169. MacLaren,E.J. and Sikela,J.M. (2005) Cerebellar gene expression profiling and eQTL analysis in inbred mouse strains selected for ethanol sensitivity. Alcohol Clin. Exp. Res., 29, 1568–1579. Mangin,B. et al. (1998) Pleiotropic QTL analysis. Biometrics, 54, 88–99. Martin,L.J. et al. (2003) Phenotypic, genetic, and genome-wide structure in the metabolic syndrome. BMC Genetics, 4 (suppl. 1), S95–S99. Schadt,E. et al. (2005) An integrative genomics approach to infer causal associations between gene expression and disease. Nat. Genet., 37, 710–717. Segal,E. et al. (2003) Genome-wide discovery of transcriptional modules from DNA sequence and gene expression. Bioinformatics, 19 (suppl. 1), i273–282. Verdugo,R.A. and Medrano,J.F. (2006) Comparison of gene coverage of mouse oligonucleotide microarray platforms. BMC Genomics, 7, 58. Weller,J. et al. (1996) Application of a canonical transformation to detection of quantitative trait loci with the aid of genetic markers in multitrait experiments. Theor. App. Genet., 92, 998–1002. Yamashita,S. et al. (2005) Expression quantitative trait loci analysis of 13 genes in the rat prostate. Genetics, 171, 1231–1238. Yvert,G. et al. (2003) Trans-acting regulatory variation in Saccharomyces cerevisiae and the role of transcription factors. Nat. Genet., 35, 57–63. APPENDIX A When the following three assumptions hold and the number of subjects and markers are large, ^ L ðj1 ; j2 Þ Cðj1 ; j2 Þ. ASSUMPTION 1. k1, k2 551 ASSUMPTION 2. The genes are independent of each other. For any two positions, t1 and t2, P(Gt1 ¼ 1 | Gt2) ¼ 0.5. ASSUMPTION 3. Genetic effects are described by the linear model in Equation (10). with its accompanying restrictions ! of and ". We show that violating these assumptions will have minimal effect in the Supplementary Material. Without loss of generality, let genes 1, . . ., N1 be the influential QTL for trait j1. As Lander and Botstein (1989) point out, the expected score per progeny, or ELOD, can be described by P 2 2 t6¼t0 j t þ 0 ð11Þ ELOD ðj1 ; t Þ ¼ 0:5log10 P 2 1 2 t j 1 t þ 964 when t0 N01 . Moreover, we know that E[Xj1t0 ] 0.22 when t0 4N01 . By Assumption 3, ELOD(j1, t0 ) is the same for all t0 5N1. Let ELOD ELOD(j1, t0 ) when t0 N01 . Therefore, Xj1 0:22½k1 nELOD þ ð1 k1 Þ ð12Þ Furthermore, when t0 4N01 ðXj1 t0 Xj1 Þ 0:22½1 ðk1 nELOD þ ð1 k1 ÞÞ ðXj1 t0 Xj1 Þ2 0:222 ½k21 2k21 nELOD þ k21 n2 E2LOD 0:222 ½k21 n2 E2LOD ð13Þ ð14Þ When t0 N01 , ðXj1 t0 Xj1 Þ 0:22½nELOD ðk1 nELOD þ ð1 k1 ÞÞ ðXj1 t0 Xj1 Þ2 0:222 ð1 k1 Þ2 ½n2 E2LOD 2nELOD þ 1 0:222 ð1 k1 Þ2 ½n2 E2LOD ð15Þ ð16Þ Here F G if F/G !n 1. Using these results and their counterparts for j2, ðXj1 t Xj1 Þ2 0:222 n2 E2LOD ½ð1 k1 Þ2 þ k1 ð1 k1 Þ2 ¼ 0:222 n2 E2LOD ðk1 k21 Þ ð17Þ Next, we approximate ðXj1 t Xj1 ÞðXj2 t Xj2 Þ using equations 13 and 15. ðXj1 t0 Xj1 ÞðXj2 t0 Xj2 Þ 0:222 ½ð1 k1 k2 þ k0 Þðk1 k1 nELOD Þðk2 k2 nELOD Þ þ ðk1 k0 Þð1 k1 ÞðnELOD þ 1Þðk2 k2 nELOD Þ þ ðk2 k0 Þð1 k2 ÞðnELOD þ 1Þðk1 k1 nELOD Þ ð18Þ þ k0 ð1 k1 ÞðnELOD þ 1Þð1 k2 ÞðnELOD þ 1Þ ¼ 0:222 n2 E2LOD ðk0 k1 k2 Þ Approximations 17, its counterpart for j2, and 18 show that k0 k1 k2 ^ L qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ðk2 k22 Þðk1 k21 Þ k0 pffiffiffiffiffiffiffiffiffi ¼ Cðj1 ; j2 Þ k1 k2 ð19Þ
© Copyright 2026 Paperzz