Current Research Areas in Complex Trait Genetics Richard Mott December 2014 Overview I Heritability and Liability I Polygenic Scores I Estimating SNP effects jointly I Genomic Prediction Genetic Relationship Matrices (Reminder) I At SNP q, genotype of individual i is I giq = 0,½,1, assumes effects are additive at q I Genetic correlation across all loci P q (giq −ḡq )(gjq −ḡq ) P kij = √P 2 2 I I q (giq −ḡq ) K =(kij ) = Z0 Z q (gjq −ḡq ) where √P(giq −ḡq ) 2 q (giq −ḡq ) I Ziq = I standardised genotype Heritability and Liability I We know how to compute the heritability of a quantitative trait as h2 = σg2 2 σg +σe2 I How do we compute the heritability of a dichotomous trait? I We could simply treat the disease status 0/1 as if it were a quantitative trait, using the mixed model machinery to estimate heritability. But this results in problems: I Scale. For quantitative traits the scale of measurement is the same as the scale on which heritability is expressed. For dichotomous traits, phenotypes are measured on the 0/1 scale, but heritability is most interpretable on liability scale. I Ascertainment. In case-control studies the proportion P of cases is usually (much) larger than the prevalence K in the population yet estimates of genetic variation are most interpretable if they are not biased by this ascertainment. Liability covðy; lÞ ¼ Eðy,lÞ % EðyÞEðlÞ ¼ K1i þ ð1 % KÞ0i2 ¼ K where z is the height of the standard normal probabil function at the truncation threshold t. The above d describe the relationship between the phenotypes o scales, but what we are interested in is the relationshi genetic values on those scales. Following Dempster and we determine the genetic value on the observed 0–1 for an individual (u), defined in Equation 9, as u ¼ c þ bg ¼ c þ zg; (Eq where c is a constant. The linear regression coefficient that links the tw derived from the regression of the phenotype on the scale (y) on the additive genetic effect on the scale of l and equals the covariance of y and g divided by the va (Equation 3), b ¼ covðy; gÞ=s2g ¼ ½Eðy,gÞ % EðyÞEðgÞ'=h2l ¼ Kih2l =h (Eq Figure 1. The Liability Threshold Model for a Disease Prevalence of K An underlying continuous random variable determines disease status. If liability exceeds the threshold t, then individuals are affected. populations and that statistical methods developed for quantitative traits can be applied to the trait liability.2,6 The model can be written as l ¼ m1N þ g þ e (Equation 10) where l is a vector of liability phenotypes that are distributed as Finally, the heritability on the observed scale is the ge ance on the observed scale, s2u ¼ varðzgÞ ¼ z2 s2g from Eq as a proportion of the total variance of 0–1 observatio is the Bernoulli distribution variance K(1 % K) an written as h i2 h2o ¼ s2u =½Kð1 % KÞ' ¼ s2g covðy; gÞ=s2g =½Kð1 % K ¼ s2g b2 =½Kð1 % KÞ' ¼ h2l z2 =½Kð1 % KÞ': This can be rearranged to transform the heritabili observed scale to that on the liability scale as h2l ¼ h2o Kð1 % KÞ=z2 : (Eq This linear transformation was derived by Alan Robert tions for these observations are that either the effect sizes at individual SNPs are so small that they do not reach genome-wide significance in GWAS or that causal variants are not in sufficient LD with SNPs on the commercial arrays to be detected by association.7,12 For example, insufficient LD could arise if causal variants have lower minor allele frequency (MAF) than genotyped SNPs. To test these ARTICLE artificial case-control differences could be partitioned as ‘‘heritability’’ in methods that utilize genome-wide similarity within and differences between cases and controls. In the present study, we overcome all three problems and by using theory, simulations, and analysis of real Estimating Missing Heritability for Disease from Genome-wide Association Studies 1 Queensland Institute of Medical Research, 300 Herston Rd, Herston, Queensland 4006, Australia; 2Biosciences Research Division, Department of Primary Industries, Melbourne, Victoria 3086, Australia; 3Department of Agriculture and Food Systems, University of Melbourne, Melbourne, Victoria 3010, Australia *Correspondence: [email protected] 1 Michael 2,3All Sang Hong Lee,1 Naomi R.byWray, E.Human Goddard, and M. Visscher1,* DOI 10.1016/j.ajhg.2011.02.002. !2011 The American Society of Genetics. rightsPeter reserved. Genome-wide association studies are designed to discover SNPs that are associated with a complex trait. Employing strict significance thresholds when testing SNPs avoids false positives the2011 expense of increasing false negatives. Recently, we developed 294 The American Journal individual of Human Genetics 88, 294–305, Marchat11, a method for quantitative traits that estimates the variation accounted for when fitting all SNPs simultaneously. Here we develop this method further for case-control studies. We use a linear mixed model for analysis of binary traits and transform the estimates to a liability scale by adjusting both for scale and for ascertainment of the case samples. We show by theory and simulation that the method is unbiased. We apply the method to data from the Wellcome Trust Case Control Consortium and show that a substantial proportion of variation in liability for Crohn disease, bipolar disorder, and type I diabetes is tagged by common SNPs. I We distinguish between heritability on the Introduction I hypotheses, we recently developed a method to estimate of variance explained by all SNPs in GWAS for a quantitative trait.7 We showed that a substan2 tial proportion of genetic variation for human height was l common SNPs. For complex diseases it associated with would be very useful to apply the same estimation procedure to case-control GWAS data. However, there are three issues that need to be overcome to be able to estimate genetic variance for disease without bias and with computationally fast algorithms: proportion 0 − 1 observed scale, ho2 andtheon the Heritability is a general and key population parameter that can I help understand the genetic architecture of complex traits. It is usually defined as the proportion of total phenotypic variation that is due to additive genetic factors.1 I Methods of obtaining unbiased estimates of heritability from pedigree data are well established for continuous 2 2 I 2 for example phenotypes, (restricted) maximum likelihood o l for linear mixed models (LMM).2–5 For binary traits, such asIdisease, familial resemblance is usually parameterized on an unobserved continuous liability scale so that the heritability is independent of disease prevalence.6 With I genome-wide genotype data, we can derive estimates of genetic variance tagged by the SNPs from samples of indiI who are unrelated in the conventional sense.7 Herividuals tability estimated from pedigree data is not the same as the proportion of phenotypic variation explained by all SNPs because the former includes the contribution of all causal Normally distributed liability scale h We relate these two heritabilities by h = h K (1 − K )/z K is disease prevalence in the population (1) Scale. For quantitative traits the scale of measureis the same as the scale on which heritability Individual is a case if unobserved ment score t(K ) isliability expressed. For diseaseexceeds traits, the phenotypes (case-control status) are measured on the 0–1 scale, z(K ) is height of N(0, 1) distribution at threshold t(Kon) a scale of but heritability is most interpretable liability. (2) Ascertainment. In case-control studies the proportion of cases is usually (much) larger than the prev- Ascertainment Bias ! # ! "% " ! " bcc ¼ cov ycc ; gcc varðgcc Þ ¼ E ycc ,gcc # E ycc $% # $% # ¼ h2l iP # h2l ilP s2gcc ¼ Ph2l ið1 # lÞ s2gcc ¼ s2 Pð1#PÞ g quantifies the change of th The term Kð1#KÞ s2gcc ficient due to ascertainment in a regression of p observed risk scale onto genetic factors on the In the absence of ascertainment (P ¼ K), this ter According to Equation 15, the genetic value scale (ucc) for an individual in a case-control stu ucc ¼ c þ bcc gcc ¼ c þ z Pð1 # PÞ s2g Kð1 # KÞ s2gc #2 & Pð1 s2gcc ¼ z Kð1 and " s2ucc ¼ b2cc s2gcc ¼ z Figure 2. The Distribution of Liability When Cases Are Oversampled as in a Case-Control Study I ! " var ycc ¼ Pð1 # PÞ; which is the phenotypic variance on the observed scale in the case-control sample; and We note that ucc is a least-squares estimate of on the observed scale. When residuals are norm the least-square estimate is the same as the (res likelihood estimate. However, normality of liab a case-control study. The previous section describ relationships between parameters on different s ence of ascertainment. In practice, we do not ob directly but estimate them. We now consider between the parameters and their estimates whe lihood is used to estimate the variance compon The estimated genetic variance on the obs REML analysis (Equation 9) is based on 0–1 obs covariance structure among samples. Withou the mean of estimated genetic values on the o be derived from Equations 11, 12, 15, and 1 Let P ∼ 0.5 be the fraction of cases in the case-control sample Eðl Þ ¼ Pi þ ð1 # PÞi ¼ il; where we define l ¼ ðP # KÞ=ð1 # KÞ: Usually P > K Using so Equations the sample is enriched for cases 1, 13, and 14 then gives, cc I Pð1 # PÞ s2g Kð1 # KÞ s2gcc 2 ! " Ascertainment Bias I 2 be the heritability on the observed scale in a Let hoc case-control study, estimated in the usual way e.g. with GCTA I Adjustment for Ascertainment Bias is hl2 ∼ I If P = K this reduces to previous relationship. I Care must be taken to clean the data of SNPs with different missing rates between cases and controls, and which may inflate the apparent heritability 2 P(1−P) hoc K (1−K ) K (1−K ) we conclude that there is no need to make the missing threshold more stringent than 20. On the liability scale, the heritability estimate (i.e., the variance in liability explained by the SNPs) is 0.22 (SE ¼ 0.04), which is much higher than that explained by genome-wide significant SNPs.27 Similar results are obtained if the SNPs with MAF >0.05 are used (Table 3). This indicates that common SNPs (MAF > 0.05) are in substantial LD with casual variants for Crohn disease. Example: Crohn’s Disease Table 3. for <4 missing genotypes (Table 5). For type I diabetes, some SNPs on chromosome 6 had extremely significant associations, for example, WTCCC13 reported a p value of 5.47e-134 for rs9272346 in the region of the major histocompatibility complex (MHC). We performed an analysis without chromosome 6 or with chromosome 6 only when we used SNPs with an MAF > 0.01 (Table 6). We observed that the estimates substantially decreased Estimated Genetic Variance on the Observed and Liability Scale Explained by All SNPs for Crohn Disease in WTCCC Data Thresholda No. SNPb Estimatec (SE) LR Adjustedd (SE) Transformede (SE) MAF > 0.01 200 322,142 0.56 (0.07) 63.16 0.64 (0.08) 0.24 (0.03) 20 294,850 0.53 (0.07) 57.48 0.61 (0.08) 0.22 (0.03) 7 248,791 0.52 (0.07) 57.30 0.61 (0.08) 0.22 (0.03) 4 195,977 0.50 (0.07) 54.94 0.60 (0.08) 0.22 (0.03) 200 293,269 0.56 (0.07) 69.00 0.63 (0.08) 0.23 (0.03) 20 266,843 0.53 (0.07) 63.27 0.60 (0.08) 0.22 (0.03) 7 225,043 0.52 (0.07) 63.94 0.60 (0.08) 0.22 (0.03) 4 177,615 0.50 (0.07) 62.14 0.60 (0.08) 0.22 (0.03) MAF > 0.05 a Excluding SNPs with more than the listed number of missing genotypes. b After filtering on the basis of SNP missing rate. c Estimate of genetic variance proportional to the total phenotypic variance on the observed scale. d Estimate adjusted for reduced number of SNPs. e Transformed genetic variance proportional to the total phenotypic variance on the liability scale under the assumption that the population prevalence is 0.1%, the heritability on the liability scale explained by the SNPs. 300 The American Journal of Human Genetics 88, 294–305, March 11, 2011 Estimating Liability to improve GWAS power I I See LEAP (Weissbrod et al http://arxiv.org/pdf/1409.2448.pdf) Basic idea - estimate the unobserved liability in each individual and use this in a GWAS: I I I I Estimate heritability on the Liability Scale, taking into account ascertainment Estimate the effect of each SNP. Using a Probit model, a liability estimate is computed for every individual. SNPs are tested for association with estimated liability via a standard LMM model. Polygenic Scores I What can one do when a GWAS finds few or no genome-wide significant associations? I There may still be some genuine signal among those SNPs with p-values in the range (say) < 10−1 , but also many false positives I One solution - compute the genome wide heritability I Polygenic Scores are another way forward Polygenic Scores I Let X (t) be the set of SNPs with p-values < t, thinned to remove duplicate SNPs in strong LD with each other I Let gix be the genotype dosage of SNP x in individual i I Let βx be the estimated coefficient of SNP x in the GWAS P Si (t) = x∈X (t) gix βx - the Polygenic Score I I Expect Si (t) to be large when the individual i is affected, low when i is not I Polygenic Score can predict the phenotype in individuals not used to train it (genomic prediction) I Polygenic Score can be used to assess genetic correspondence between different diseases measured in different cohorts. Vol 460 | 6 August 2009 | doi:10.1038/nature08185 LETTERS Common polygenic variation contributes to risk of schizophrenia and bipolar disorder The International Schizophrenia Consortium* Schizophrenia is a severe mental disorder with a lifetime risk of about 1%, characterized by hallucinations, delusions and cognitive deficits, I with heritability estimated at up to 80%1,2. We performed a genome-wide association study of 3,322 European individuals with schizophrenia and 3,587 controls. Here we show, using two analytic approaches, the extent to which common genetic variation underlies I the risk of schizophrenia. First, we implicate the major histocompatibility complex. Second, we provide molecular genetic evidence for a I substantial polygenic component to the risk of schizophrenia involving thousands of common alleles of very small effect. We show that this component also contributes to the risk of bipolar disorder, but I to several non-psychiatric diseases. not We genotyped the International Schizophrenia Consortium (ISC) case-control sample for up to ,1 million single nucleotide polymorphisms (SNPs), augmented by imputed common HapMap SNPs. In the genome-wide association study (GWAS; genomic conI lGC 5 1.09; Supplementary Table 1 and Supplementary Figs trol 1–3), the most associated genotyped SNP (P 5 3.4 3 1027) was located in the first intron of myosin XVIIIB (MYO18B) on chromosome 22. The second strongest association comprised more than 450 SNPs on chromosome 6p spanning the major histocompatibility complex (MHC; Fig. 1). There is some evidence for between-site 3322 cases, 3587 controls Table 2, Supplementary Fig. 2 and section 5 and 6 in Supplementary Information). The best imputed SNP, which reached genome-wide significance (rs3130297, P 5 4.79 3 1028, T allele odds ratio 5 0.747, minor allele frequency (MAF) 5 0.114, 32.3 megabases (Mb)), was also in the MHC, 7−7 kilobases (kb) from NOTCH4, a gene with previously reported associations with schizophrenia4. We imputed classical human leukocyte antigen (HLA) −7 alleles; six were significant at P , 1023, found on the ancestral European haplotype5 (Table 1, Supplementary Table 3 and section 3 in Supplementary Information). However, it was not possible to ascribe the association to a specific HLA allele, haplotype or region (Supplementary Table 3 and Supplementary Fig. 4). We exchanged GWAS summary results with the Molecular Genetics of Schizophrenia (MGS) and SGENE consortia for genotyped SNPs with P , 1023. There were 8,008 cases and 19,077 controls of European descent in the combined sample (see refs 6, 7 and section 7 in Supplementary Information). Our top genotyped MHC SNP (rs3130375) had P 5 0.086 and P 5 0.14 in MGS and SGENE, respectively. Considering the combined results for genotyped and imputed SNPs across the MHC region more broadly, rs13194053 had a genome-wide significant combined P 5 9.5 3 1029 (ISC, MGS and Most significant SNP P = 3.4x10 , only weakly associated Cluster of associated SNPs in MHC P < 6x10 74,062 SNPs in linkage equilibrium used as basis from which to select sets X (t) for polygenic score analysis. Used males (2,176 cases, 1,642 controls) to train polygenic score, females (1,146 cases, 1,945 controls) to test 24 22 24 was in linkage disequilibrium . Across the region, 11 other and 32.7 Mb (Supplementary Variance explained (R2) PT < 0.1 PT < 0.2 PT < 0.3 PT < 0.4 PT < 0.5 5 × 10–11 1× 0.02 10–12 7 × 10–9 0.01 0.008 0.71 0.05 0.30 0.65 0.23 0.06 Schizophrenia D C C -B CD HT RA T1D T2D TC CAD W ST EP EA S- SG M G O ’D AA on ov an 0 M te whether common variants ly testing the classic theory of othesized to apply to schizosis did not identify a large ere could still be potentially ts that collectively account for n risk. We summarized variinto quantitative scores, and ependent samples10. Although ple, genotypic relative risk even nominally significant P etected at increasingly liberal e, PT , 0.1 or PT , 0.5. Using f ‘score alleles’ in a discovery s for individuals in independscore, instead of risk, as we rue risk alleles from unasso- P = 2 × 10–28 0.03 Bipolar disorder Non-psychiatric (WTCCC) Figure 2 | Replication of the ISC-derived polygenic component in independent schizophrenia and bipolar disorder samples. Variance explained in the target samples on the basis of scores derived in the entire ISC for five significance thresholds (PT , 0.1, 0.2, 0.3, 0.4 and 0.5, plotted left to right in each study). The y axis indicates Nagelkerke’s pseudo R2; the number above each set of bars is the P value for the PT , 0.5 target sample −19 analysis. CAD, coronary artery disease; CD, Crohn’s disease; HT, hypertension; RA, rheumatoid arthritis; T1D, type I diabetes; T2D, type II diabetes. Numbers for cases/controls: MGS-EA 2,687/2,656; MGS-AA 1,287/ 973; O’Donovan 479/2,938; STEP-BD 955/1,498; WTCCC 1,829/2,935; µ+γS CD 1,748/2,935; µ+γS CAD 1,926/2,935; HT 1,952/2,935; RA 1,860/2,935; T1D 1,963/2,935; and T2D 1,924/2,935. The score on the basis of all SNPs with male discovery t < 0.5 (37, 655 SNPs) was highly correlated with schizophrenia in target females P < 9x10 logistic regression on polygenic score P(case) = e /(1 + e ) Explains 3% of the variance using Nagelkerke’s pseudo R 2 on a reduced set I of SNPs to After filtering on MAF, geno(independent of association set of 74,062 autosomal SNPs upplementary Tables 6 and 7). ets of score alleles at different I individual in the target same alleles they possessed, each e discovery sample. To assess zophrenia risk,I we tested for a mpared to controls (sections 2 Nagelkerke’s R 2 I Liability is one way to define heritability for a dichotomous trait I How do we define heritability on the observed scale? I How do we define variance explained in a logistic model I Concept of variance explained does not exist in a logistic model I But the Deviance (twice the log likelihood ratio) is a natural generalisation L(θ0 ) 2/n ˆ L(theta) 2 I ρmax = 1 − L(θ0 )2/n I Nagelkerke’s R 2 = ρ2 /ρ2 max I ρ2 = 1 − I Same as Pearson squared correlation R 2 when data are Normally distributed Estimating SNP effects Jointly I I Conventional GWAS estimate each SNP independently There are some reasons for considering all SNPs jointly I I I Variability explained jointly is handled better Genomic Prediction Computational reasons I But... Need to impose shrinkage constraints on SNP estimates I We describe four methods: BLUP, ridge regression, LASSO, Prior SNP Distributions BLUP: Best Linear Unbiased Predictions I BLUPs are the predictions of the random (SNP) effects in a mixed model I They are linear functions of the observations, are unbiased and have minimum variance I All random effects are estimated jointly by BLUP I There are usually many more SNPs than individuals so BLUPs are another way of getting SNP estimates BLUP for random SNP effects I NOTE: we are using non-standard notation where random SNP effects are β, fixed covariates are α (in most descriptions, the fixed effects are β and the random effects are u). I Mixed Model y = Xα + Zβ + e I α are fixed effects (e.g. age, sex) with design matrix X I β are the random SNP effects with scaled genotype matrix Z I β ∼ MVN(0, G), e ∼ MVN(0, R), cov (u, e) = 0 I β̂ = GZ0 V−1 (y − Xα̂) BLUP I V = ZGZ0 + R I ŷ = Xα̂ + Zβ̂ = Cy for a matrix C BLUP I BLUP is good for genomic prediction, but... I BLUP SNP estimates tend to underestimate large effect SNPs and to over-estimate small effects I BLUP SNP estimates resemble a sample from a Gaussian distribution I But can be useful as a first approximation to identify candidate SNPs e.g. Verbyla et al (2007) Theor Appl Genet 116:95111 I Can make more complex versions of BLUP where different genome regions are treated differently e.g. Speed and Balding (2014) Genome Research 24:15501557 Ridge Regression and The Lasso I Least Squares regression: minimise Ω = (y − Xβ)0 (y − Xβ) I Ridge regression: minimise Ω subject to β 0 β < k P Lasso regression: minimise Ω subject to j |βj | < k I I Both Ridge and Lasso are shrinkage methods but have different properties - often many of the lasso estimates of βj are exactly 0, while ridge estimates are non-zero but small. I Lasso is potentially useful for fitting all SNPs in a model simultaneously, forcing most to be zero I Many extensions to Lasso for QTL mapping have been published, e.g. Yi and Xu 2008, Genetics 179:1045-1055 The Lasso Prior Distributions on SNP Effects Verbyla et al. BMC Proceedings 2010, 4(Suppl 1):S5 http://www.biomedcentral.com/1753-6561/4/S1/S5 PROCEEDINGS Open Access Sensitivity of genomic selection to using different prior distributions Klara L Verbyla1,2,3,4*, Philip J Bowman2, Ben J Hayes2, Michael E Goddard2,3,4 From 13th European workshop on QTL mapping and marker assisted selection Wageningen, The Netherlands. 20-21 April 2009 Abstract Genomic selection describes a selection strategy based on genomic estimated breeding values (GEBV) predicted from dense genetic markers such as single nucleotide polymorphism (SNP) data. Different Bayesian models have been suggested to derive the prediction equation, with the main difference centred around the specification of the prior distributions. Methods: The simulated dataset of the 13th QTL-MAS workshop was analysed using four Bayesian approaches to predict GEBV for animals without phenotypic information. Different prior distributions were assumed to assess their affect on the accuracy of the predicted GEBV. Conclusion: All methods produced GEBV that were highly correlated with the true breeding values. The models appear relatively insensitive to the choice of prior distributions for QTL-MAS data set and this is consistent with uniformity of performance of different methods found in real data. Background Genomic selection describes a technique for evaluating The aim of this study was to assess the effect that different prior distributions and subsequently the models Prior Distributions on SNP Effects I Mixed model: y = Xα + Zβ + g + e (NOTE different notation) I var (g) = Kσg2 , var (e) = Iσe2 I SNP effects β, fixed effects α, oligogenic effects g Page 2 of 4 Prior Distributions on SNP Effects marker he QTL of ranociated distribuderived residual e ~ N(0, r distrilygenic priors of nt were assessed Table 1 Prior Distribution Specifications Method Bayes BLUP Prior Distribution i | 2 2 Bayes A Bayes A/B (Hybrid) i| 2 i i Bayes C 2 i 0 i | i, 2 i r, s N 0, 2 | i2 2 i 2 i N 0, 2 2 2 i r, s N 0, 2 i with probability 1- π 2 r , s with probability π 2 i 2 (1 i )N 0, 2 i / 100 2 i N(0, i ) r, s gi ~bernoulli(π) 1 - p(gi = 0) = p(gi = 1) = π bi is the effect for the ith SNP and gi is the indicator variable for the ith SNP. A faster alternative to both the Bayes A/B hybrid and ponents[7]. GEBV: Genomic Estimated Breeding Values Table 2 Correlations Between Estimated GEBV for unphenotyped animals at t=600 Bayes C Bayes A/B Bayes BLUP Bayes A 0.999 0.991 0.860 Bayes C 1 0.993 0.863 1 0.893 Bayes A/B Table 3 Comparison of True and Estimated GEBV Method Correlation MSE Rank Regression Bayes.BLUP 0.885 5.479 0.691 0.979 BayesA 0.857 7.092 0.696 1.162 BayesA/B 0.889 5.435 0.73 1.081 BayesC 0.861 6.561 0.71 1.024 Correlation coefficient between the true and predicted GEBV, Mean Square Error (MSE), Rank (Accuracy of the predicting the best 100 animals) and the Regression Coefficient of the true breeding value on the estimated GEBV. the per recomm effect d to deter and sett the like possible Acknowle KV was fun Training, as Commissio European C may be m This article Supplemen mapping a The full co http://www Author de 1 Animal Br 8200 AB Le Conclusions I Research in Complex Trait Analysis leads in several profitable directions I I I I Identification of individual SNP effects Estimates of Heritability in the entire populace Genomic Prediction Progress in Human Complex Trait research is accelerated by considering work in animals and plants
© Copyright 2026 Paperzz