15/09/2014 Detecting “polygenes” using signals of polygenic selection. Tools for increasing the power of GWAS. Piffer, Davide Gilfoyle, Bertram Contact: [email protected] Abstract Comparison of allele frequency patterns across many loci has recently been used to identify selective pressure on polygenic traits such as height and IQ. In this paper, we used one such approach based on principal components analysis, and analyzed GWAS hits from recent studies whose effect on intelligence has been replicated. The component scores had very strong correlations with estimates of country/population IQ (r around 0.8-0.9). To further validate this approach, we tested the prediction that the alleles positively correlated to the principal component (PC) are overrepresented among highly intelligent individuals. James Watson and Craig Venter both meet this requirement and have their genomes freely available online. We found an overrepresentation (compared to the 1000 Genomes CEU sample) of alleles correlated with our PC in Watson and Venter’s genomes, suggesting that our PCs could represent a genuine signal of selection pressure on intelligence across many genes. We found that among the alleles correlated with our PCs, there is a higher ratio of derived:ancestral alleles compared to SNPs uncorrelated to the PC. This is in accord with the increase of human intelligence during hominin evolution. Introduction Over the last few years, researchers have started moving away from the study of genetic evolution using a single-gene, Mendelian approach towards models that examine many genes together (polygenic). The more genes are involved in a given phenotype, the more the signal of natural selection will be “diluted” across different genomic regions (because each gene accounts for a tiny effect) making it difficult to detect it using approaches focused on a single gene (Pritchard et al., 2010; Piffer, 2014). A first attempt at empirically identifying polygenic selection was made by Turchin et al (2012) on two populations (Northern and Southern Europeans) and evidence for higher frequency of height increasing alleles (obtained from GWAS studies) among Northern Europeans was provided. A drawback of that paper was the reliance on populations from a single continent and that crude pairwise comparisons (e.g. French vs. Italian) were used without correlating frequency differences to average population height. Moreover, the strength of selection was not determined. Another highly polygenic phenotype is intelligence or IQ. Rietveld et al. (2013)’s meta-analysis found ten SNPs that increased educational attainment, comprising three with nominal genome wide significance and seven with suggestive significance. A recent study has replicated the positive effect of these top three SNPs rs9320913, rs11584700 and rs4851266 on mathematics and reading performance in an independent sample of school children (Ward et al., 2014). Two different approaches to identify selection based on the correlation of allele frequencies across different populations has been recently developed by Piffer (2013) and Berg & Coop (2014). Piffer (2013) applied principal components analysis to the frequencies of the ten alleles reported in Rietveld et al. (2013) with positive effect on educational attainment and intelligence from two samples comprising 14 and 50 populations (1000 Genomes and ALFRED databases, respectively) and found that they loaded highly and in the expected direction (positively) on a single factor accounting for most of the variance. The factor scores were correlated to indexes of country educational attainment (i.e. PISA) and IQ, producing high correlations (r around 0.9). This factor was interpreted as indicating the strength of polygenic selection. This was the first time that genetic frequencies were used from a cross-racial sample and an estimate of selection strength was provided, thus correlating it with measured average phenotypic scores. The principal component found by Piffer (2013) indicated that alleles located on different chromosomes (unliked) are not randomly distributed across populations. In fact, they follow a similar spatial distribution, so that countries with a higher frequency of an intelligence-increasing allele will on average have higher frequencies of other intelligence-increasing alleles. The principal component extracted from only a few alleles thus predicts the frequency of other alleles whose association with intelligence hasn’t yet been determined. Further support for this contention was recently provided by Piffer (2014), with the finding that three racial groups had significantly different frequencies of 12 intelligence increasing alleles. Recently, Piffer used 4 SNPs whose effect on intelligence has been replicated (Piffer, 2014b). These are three top SNPs reported in Rietveld et al. (2013) and one reported in Benyamin et al. (2013) and Davies et al. (2011). Piffer (2014b) found that their effect on country IQ was not mediated by ancestral migrations and that it persisted also within continents. A strong negative correlation between genotypic IQ and stature (height) was also found and interpreted as opposite selective pressure on these two phenotypes. Intelligence is a highly polygenic trait, probably influenced by many thousands of genetic variants. If the principal component truly represents a genuine signal of polygenic selection, when more variants with positive effects on intelligence will be discovered, they will be found to be positively correlated to it. A consequence of these assumptions is that if there are currently hundreds or thousands of alleles with a positive effect on intelligence (that studies have yet to identify), these will be positively correlated to the principal component found by Piffer. Thus, if we scan the entire human genome and select the alleles with a strong positive correlation to the principal component, we will isolate a sample of alleles whose total (polygenic) effect on intelligence will be positive. This selected sample of alleles should have on average higher frequencies among people with high intelligence, compared to the normal populations (the samples in the 1000 Genomes database). The genomes of two highly intelligent individuals (Nobel Laureate James Watson and Craig Venter) are both freely available on the internet. James Watson’s intelligence level is likely high, given that he obtained a PhD at 21 and won a Nobel prize at 25. Venter’s achievement are outstanding and as a further proof of his intelligence he scored 143 on the IQ test administered by the Navy (Venter, 2008). We used them as a sample of highly intelligent individuals to test the prediction that alleles positively correlated to the principal component have significantly higher frequency in their genomes than in the genomes of the normal (reference) population to which they belong (CEU). This test has a double purpose: 1) validate the signal of polygenic selection obtained by the two sets of 4 and 3 SNPs with principal component analysis; 2) identify a set of SNPs whose probability of being associated with intelligence is higher than a random set of SNP. This will be used to increase the power of GWAS, by reducing multiple comparisons. Methods and results James Watson and Craig Venter’s genomes were downloaded from Ensembl (ftp://ftp.ensembl.org/pub/current_variation/vcf/homo_sapiens/), hg38. Frequencies of the four intelligence increasing alleles (3 from Rietvald et al. + rs236330) were obtained from 1000 Genomes, phase 3 for 26 populations. These SNPs are the only ones whose positive effect has been replicated (Piffer, 2014b). An additional analysis was run with 3 SNPs that reached significance in a very recent study (Rietveld et al, 2014). Finally, an analysis with all 7 SNPs together was carried out. Analysis 1. Top 4 SNPs A principal component analysis was run and a single factor was extracted that explained 75.42 % of the variance. The other components, explaining less than 15% of the variance (eigenvalues lower than 0.6) were thus excluded from the analysis. Component loadings are reported in table 1. These are all high (>0.8) and in the expected direction (positive loadings). Component scores were highly correlated to country IQ (r=0.886; N=23; p=0.000). Table 1. Component Matrixa Componen t 1 rs9320913_A rs11584700_ G .804 .877 rs4851266_T rs236330_C .938 .850 Extraction Method: Principal Component Analysis. a. 1 components extracted. Component scores were interpreted as a signal of polygenic selection and are reported in table 2. Table 2. Population PC Afr.Car.Barbados -1.37606 US Blacks -1.2195 Bengali Banglade 0.05106 Chinese Dai 1.11266 UtahWhites 0.7512 HanChineseBejing 1.26035 HanChineseSouth 1.09477 Colombian -0.025 Esan Nigeria -1.48221 Finns 0.84245 British 0.79862 Gujarati Ind. Tx 0.57772 Gambian -1.56704 Spanish 0.65138 Indian Telegu UK 0.11431 Japanese 1.05752 Vietnam 1.36983 Luhya Kenya -1.62949 Mende Sierra Leo -1.31723 Mexican LA -0.08204 Peruvian -0.26814 Punjabi Pakistan 0.17343 Puerto Rican 0.03404 SriLankanUK 0.02587 TuscanItaly 0.60048 Yoruba -1.54897 The entire 1000 Genomes, phase 3, allele frequency database was searched for alleles with a strong positive association with the principal component (r>0.8). This returned a set of N= 238,198 SNPs, consisting of 183508 and 54690 “beneficial” and “detrimental” alleles, respectively. A Fisher’s exact test was used to compare the control (1000 Genomes CEU) to Watson and Venter’s genomes. A 2x2 contingency table is filled out, with allele count data for the two groups (Venter’s or Watson’s genome and CEU) and two outcomes (“beneficial” and other allele). This is a test against the null expectation that “beneficial” alleles (those with a positive correlation to the principal component) are randomly distributed across individuals, independent of their cognitive phenotype. Genomic regions in linkage with the 4 SNPs (distance <500Kb) were excluded as these could have inflated the significance of the results. The test for Venter’s genome produced significant results (p<2.2e-16) for the alternative hypothesis that true odds ratio is greater than 1. Odds ratio= 1.152 (lower bound 95% confidence interval: 1.142). Allele counts are reported in table 3a. Table 3a. Venter’s 2x2 contingency table. Strongly associated SNPs (r>0.8). Outcome 1 (Beneficial alleles) Outcome 2 (Other allele) Group 1 (Venter) 189332 48956 Group 2 (CEU mean) 183508 54690 The test for Watson’s genome produced significant results (p<2.2e-16) for the alternative hypothesis that true odds ratio is greater than 1. Odds ratio= 1.188 (lower bound 95% confidence interval: 1.178). Allele counts are reported in table 3b. Table 3b. Watson’s 2x2 contingency table. Strongly associated SNPs(r>0.8). Outcome 1 (Beneficial alleles) Outcome 2 (Other allele) Group 1 (Watson) 190498 47790 Group 2 (CEU mean) 183508 54690 The same analysis was run on SNPs with a correlation around 0 (p<0.005) with the principal component in order to determine whether the signal (odds ratio) would be weaker. For Venter’s genome, the p value was significant (p=0.00043) and odds ratio slightly higher than unity (1.019). 95% confidence interval: 1.008-1.029. Allele counts are reported in table 4a. Table 4a. Venter’s 2x2 contingency table. Non-associated SNPs (p<0.005). Outcome 1 (Neutral allele) Outcome 2 (Other allele) Group 1 (Venter) 82330 64262 Group 2 (CEU mean) 81623 64908 For Watson’s genome, results were similar and also significant (p=0.009)and odds ratio slightly higher than unity (1.013). 95% confidence interval: 1.003- 1.024. Table 4b. Watson’s 2x2 contingency table. Non-associated SNPs (p<0.005). Outcome 1 (Neutral allele) Outcome 2 (Other allele) Group 1 (Watson) 82153 64437 Group 2 (CEU mean) 81623 64908 Ancestral vs derived alleles Table 5 reports the number of alleles that are derived and ancestral for two groups of alleles: one with alleles very strongly (r>0.8) correlated to the PC and the other with alleles not correlated to it (p<0.005). There is an overrepresentation of derived alleles in the r>0.8 group compared to the other group, with odds ratio of 1.075, p=4.549e-12 (95% C.I. 1.053 – 1.097). Table 5. Derived 56987 34495 SNPs r>0.8 SNPs p<0.005 Ancestral 40512 26362 ANALYSIS II. 3 SNPs from Rietveld et al., 2014 Recently a new study has identified three SNPs with a positive relationship to intelligence (rs1487441; rs7923609; rs2721173). We ran a principal components analysis on them. A single component was extracted that explained 59.69% of the variance. Two of the thre trait-increasing alleles (rs1487441_A and rs2721173_C) loaded positively on it (r= 0.725 and 0.739, respectively) but one (rs7923609_G) had a negative loading (r=0.848). Component matrix is reported in table 6. Table 6. Component Matrix a Component 1 rs1487441_A .725 rs7923609_G -.848 rs2721173_C .739 Extraction Method: Principal Component Analysis. a. 1 components extracted. Component scores were positively correlated to country IQ (r=0.716, N=23; p= 0.000) and to the top 4 SNPs PC (r= 0.854, N=26; p=0.000). Since the two PCs extracted from two independent sets of SNPs were correlated to each other, we ran a PCA with the 7 SNPs altogether. This produced 2 components with eigenvalues higher than one, explaining 64.66 and 16.79% of the variance, respectively. The second component was not clearly interpretable. 6 out of 7 alleles loaded positively on the first component (r= 0.518 to 0.877) and one (rs7923609_G) loaded in the opposite direction (r=-0.822). The first PC was also positively correlated to country IQ (r= 0.855) and to the other two PCs from the two separate sets of SNPs (r= 0.984 and 0.931, N=26; p=0.000). Table 7. Component Matrix a Component 1 2 rs1487441_A .810 .522 rs7923609_G -.822 .381 rs2721173_C .518 -.571 rs9320913_A .824 .497 rs11584700_G .854 -.167 rs4851266_T .877 .228 rs236330_C .864 -.324 Extraction Method: Principal Component Analysis. a. 2 components extracted. The entire 1000 Genomes, phase 3, allele frequency database was searched for alleles with a strong positive association with the 3 alleles-PC principal component (r>0.9). Fisher’s exact test with the 3 SNPs PC. A Fisher’s exact test was used to compare the control (1000 Genomes CEU) to Watson and Venter’s genomes. A 2x2 contingency table is filled out, with allele count data for the two groups (Venter’s or Watson’s genome and CEU) and two outcomes (“beneficial” and other allele). Genomic regions in linkage with the 3 SNPs (distance <500Kb) were excluded as these could have inflated the significance of the results. The test for Venter’s genome produced significant results (p<2.2e-16) for the alternative hypothesis that true odds ratio is greater than 1. Odds ratio= 1.219 (lower bound 95% confidence interval: 1.205). Allele counts are reported in table 8a. Table 8a. Venter’s 2x2 contingency table. Strongly associated SNPs (r>0.8). Outcome 1 (Beneficial allele) Outcome 2 (Other allele) Group 1 (Venter) Group 2 (CEU mean) 101552 25746 97202 30041 The test for Watson’s genome produced significant results (p<2.2e-16) for the alternative hypothesis that true odds ratio is greater than 1. Odds ratio= 1.225 (lower bound 95% confidence interval: 1.211). Allele counts are reported in table 8b Table 8b. Watson’s 2x2 contingency table. Strongly associated SNPs(r>0.8). Outcome 1 (Beneficial allele) Outcome 2 (Other allele) Group 1 (Watson) Group 2 (CEU mean) 101655 25643 97202 30041 The same analysis was run on SNPs with a correlation around 0 (p<0.005) with the principal component in order to determine whether the signal (odds ratio) would be weaker. For Venter’s genome, the p value was not significant (p=0.375) and odds ratio near 1 (0.995). 95% confidence interval: 0.985-1.005. Allele counts are reported in table 9a. Table 9a. Venter’s 2x2 contingency table. Non-associated SNPs (p<0.005). Outcome 1 (Neutral allele) Outcome 2 (Other allele) Group 1 (Watson) Group 2 (CEU mean) 80870 81007 61422 61232 For Watson’s genome, the p value was not significant (p=0.665) and odds ratio near 1 (1.002). 95% confidence interval: 0.991-1.013. Allele counts are reported in table 9b. Table 9b. Watson’s 2x2 contingency table. Non-associated SNPs (p<0.005). Outcome 1 (Neutral allele) Outcome 2 (Other allele) Group 1 (Watson) 81121 61175 Group 2 (CEU mean) 81007 61232 Ancestral vs derived alleles Table 10 reports the number of alleles that are derived and ancestral for two groups of alleles: one with alleles very strongly (r>0.8) correlated to the PC and the other with alleles not correlated to it (p<0.005). There is an overrepresentation of derived alleles in the r>0.8 group compared to the other group, with odds ratio of 1.295, p < 2.2e-16 (95% C.I. 1.263 - 1.327). Table 10. SNPs r>0.8 SNPs p<0.005 Derived Ancestral 33418 33886 19012 24966 Discussion The frequencies of 4 SNPs were searched on 1000 Genomes, phase 3 (26 populations) and produced a principal component explaining the majority of the variance, with all four alleles loading highly (r>0.8) and in the expected direction (positive loadings), suggesting that it represents (polygenic) selection pressure on intelligence. This picture was supported by the component scores’ high correlation (r=0.886) with the average IQ scores of countries. Another set of 3 SNPs with a significant association to intelligence reported in a recent study (Rietveld et al., 2014) produced a PC that was strongly associated with the former PC (r= 0.854) and to country IQ (r= 0.716). A PCA run on the entire set of SNPs (N=7), and the resulting PC was strongly correlated to country IQ (r= 0.855) and to the other two PCs (r> 0.8). Since the two sets of SNPs are not in linkage and they were reported in two independent studies, their producing very similar results lends credibility to the hypothesis that the results are not due to a statistical fluke but they represent a genuine signal of natural selection acting on many loci. The alleles with a positive correlation (r>0.8) to the top 4 SNPs PC had higher frequencies in Venter and Watson’s genomes than in the control 1000 Genomes population (CEU). Odds ratios were 1.188 and 1.152 for Watson and Venter. The alleles with no correlation to the 4SNPs_PC (p<0.005) were still overrepresented in Venter and Watson’s genomes but less so. Odds ratios were lower (1.013 and 1.019 for Watson and Venter’s respectively). This result is difficult to interpret and could be due to chance. The PC extracted from the set of 3 SNPs (taken from Rietveld et al., 2014) produced even better results with higher odds ratios for the alleles correlated to it (r>0.8) (Odds ratios of 1.225 and 1.219 for Watson and Venter, respectively) and lower odds ratios (not significantly different from 1) for the alleles not associated with the PC (1.002 and 0.995 for Watson and Venter, respectively). The analysis of ancestral alleles reveals that the alleles positively associated with the IQ PC have an overrepresentation of derived alleles and a lower proportion of ancestral alleles than uncorrelated SNPs. Again the results were slightly better for the PC extracted from the 3 SNPs (odds ratio= 1.295) than the PC extracted from the top 4 SNPs (odds ratio= 1.075). This confirms our hypothesis that intelligence-increasing alleles tend to be derived because human intelligence has greatly increased since the last common ancestor between human and non-human primates. A very important note of caution when reading the results should be taken, because the significance levels are likely to be highly inflated. In fact, we treated each allele as independent from all the others, but the reality of linkage disequilibrium makes this an unrealistic assumption. In the next analysis, we will control for possible confounding effects due to linkage by dividing the genome in regions of 500kb and regarding each 500kb region as a single case (the average frequency within that region will be used). The employment of the PC allows the researcher to select a set of candidate genes (see supplementary files) thus greatly reducing the issue of multiple comparisons and the reduction in significance following statistical corrections (i.e. Bonferroni). This increases the GWAS’ power and reduces the sample size required to perform it. References: Benyamin, B., Pourcain, B.St., Davis, O.S., Davies, G., Hansell, N.K., Brion, M.-J.A. et al (2013). Childhood intelligence is heritable, highly polygenic and associated with FNBP1L. Molecular Psychiatry, 19: 253-258. Berg, J.J. & Coop, G. (2014). A Population Genetic Signal of Polygenic Adaptation. PLOS Genetics, Doi: 10.137/journal.pgen.1004412 Davies, G., Tenesa, A., Payton, A., Yang, J., Harris, S.E., Liewald, D., Xiayi, K., Le Hellard, S. et al. (2011). Genome-wide association studies establish that human intelligence is highly heritable and polygenic. Molecular Psychiatry, 16: 996-1005. Piffer, D. (2013). Factor Analysis of Population Allele Frequencies as a Simple, Novel Method of Detecting Signals of Recent Polygenic Selection: The Example of Educational Attainment and IQ. Mankind Quarterly, 54: 168200. Piffer, D. (2014a). Simple statistical tools to detect signals of recent polygenic selection. IBC 2014;6:1, 1-6 • DOI: 10.4051 / ibc.2014.6.1.0001 Piffer, D. (2014b). Estimating strength of polygenic selection with principal component analysis of spatial genetic variation. http://dx.doi.org/10.1101/008011 Pritchard, J. K., Pickrell, J. K., and Coop, G. (2010). The genetics of human adaptation: hard sweeps, soft sweeps, and polygenic adaptation. The genetics of human adaptation: hard sweeps, soft sweeps, and polygenic adaptation. Current biology : CB 20, R208-215. Rietveld, C. A., Medland, S. E., Derringer, J., Yang, J., Esko, T., Martin, N. W., Westra, H. J., Shakhbazov, K., Abdellaoui, A., Agrawal, A., et al. (2013). GWAS of 126,559 individuals identifies genetic variants associated with educational attainment. Science 340, 1467-1471. Rietveld, C., Esko, T., Davies, G., et al. (2014). Common genetic variants associated with cognitive performance identified using the proxy-phenotype method. doi: 10.1073/pnas.1404623111 Turchin, M. C., Chiang, C. W., Palmer, C. D., Sankararaman, S., Reich, D., Genetic Investigation of, A. T. C., and Hirschhorn, J. N. (2012). Evidence of widespread selection on standing variation in Europe at height-associated SNPs. Nat Genet 44, 1015-1019. Venter, C. (2008). A Life Decoded. Penguin Books. Ward, M.E., McMahon, G., St Pourcain, B., Evans, D.M., Rietveld, C.A., et al. (2014) Genetic Variation Associated with Differential Educational Attainment in Adults Has Anticipated Associations with School Performance in Children. PLoS ONE 9(7): e100248. doi:10.1371/journal.pone.0100248
© Copyright 2026 Paperzz