15/09/2014 Detecting “polygenes” using signals of polygenic

15/09/2014
Detecting “polygenes” using signals of polygenic
selection. Tools for increasing the power of
GWAS.
Piffer, Davide
Gilfoyle, Bertram
Contact: [email protected]
Abstract
Comparison of allele frequency patterns across many loci has recently been used to identify
selective pressure on polygenic traits such as height and IQ. In this paper, we used one such
approach based on principal components analysis, and analyzed GWAS hits from recent studies
whose effect on intelligence has been replicated. The component scores had very strong
correlations with estimates of country/population IQ (r around 0.8-0.9). To further validate this
approach, we tested the prediction that the alleles positively correlated to the principal component
(PC) are overrepresented among highly intelligent individuals. James Watson and Craig Venter
both meet this requirement and have their genomes freely available online. We found an
overrepresentation (compared to the 1000 Genomes CEU sample) of alleles correlated with our
PC in Watson and Venter’s genomes, suggesting that our PCs could represent a genuine signal of
selection pressure on intelligence across many genes. We found that among the alleles correlated
with our PCs, there is a higher ratio of derived:ancestral alleles compared to SNPs uncorrelated to
the PC. This is in accord with the increase of human intelligence during hominin evolution.
Introduction
Over the last few years, researchers have started moving away from the study of genetic evolution
using a single-gene, Mendelian approach towards models that examine many genes together
(polygenic). The more genes are involved in a given phenotype, the more the signal of natural
selection will be “diluted” across different genomic regions (because each gene accounts for a tiny
effect) making it difficult to detect it using approaches focused on a single gene (Pritchard et al.,
2010; Piffer, 2014). A first attempt at empirically identifying polygenic selection was made by
Turchin et al (2012) on two populations (Northern and Southern Europeans) and evidence for
higher frequency of height increasing alleles (obtained from GWAS studies) among Northern
Europeans was provided. A drawback of that paper was the reliance on populations from a single
continent and that crude pairwise comparisons (e.g. French vs. Italian) were used without
correlating frequency differences to average population height. Moreover, the strength of selection
was not determined.
Another highly polygenic phenotype is intelligence or IQ. Rietveld et al. (2013)’s meta-analysis
found ten SNPs that increased educational attainment, comprising three with nominal genome
wide significance and seven with suggestive significance. A recent study has replicated the
positive effect of these top three SNPs rs9320913, rs11584700 and rs4851266 on mathematics
and reading performance in an independent sample of school children (Ward et al., 2014).
Two different approaches to identify selection based on the correlation of allele frequencies across
different populations has been recently developed by Piffer (2013) and Berg & Coop (2014).
Piffer (2013) applied principal components analysis to the frequencies of the ten alleles reported in
Rietveld et al. (2013) with positive effect on educational attainment and intelligence from two
samples comprising 14 and 50 populations (1000 Genomes and ALFRED databases, respectively)
and found that they loaded highly and in the expected direction (positively) on a single factor
accounting for most of the variance. The factor scores were correlated to indexes of country
educational attainment (i.e. PISA) and IQ, producing high correlations (r around 0.9). This factor
was interpreted as indicating the strength of polygenic selection. This was the first time that genetic
frequencies were used from a cross-racial sample and an estimate of selection strength was
provided, thus correlating it with measured average phenotypic scores.
The principal component found by Piffer (2013) indicated that alleles located on different
chromosomes (unliked) are not randomly distributed across populations. In fact, they follow a
similar spatial distribution, so that countries with a higher frequency of an intelligence-increasing
allele will on average have higher frequencies of other intelligence-increasing alleles. The principal
component extracted from only a few alleles thus predicts the frequency of other alleles whose
association with intelligence hasn’t yet been determined.
Further support for this contention was recently provided by Piffer (2014), with the finding that three
racial groups had significantly different frequencies of 12 intelligence increasing alleles.
Recently, Piffer used 4 SNPs whose effect on intelligence has been replicated (Piffer, 2014b).
These are three top SNPs reported in Rietveld et al. (2013) and one reported in Benyamin et al.
(2013) and Davies et al. (2011).
Piffer (2014b) found that their effect on country IQ was not mediated by ancestral migrations and
that it persisted also within continents. A strong negative correlation between genotypic IQ and
stature (height) was also found and interpreted as opposite selective pressure on these two
phenotypes.
Intelligence is a highly polygenic trait, probably influenced by many thousands of genetic variants.
If the principal component truly represents a genuine signal of polygenic selection, when more
variants with positive effects on intelligence will be discovered, they will be found to be positively
correlated to it.
A consequence of these assumptions is that if there are currently hundreds or thousands of alleles
with a positive effect on intelligence (that studies have yet to identify), these will be positively
correlated to the principal component found by Piffer.
Thus, if we scan the entire human genome and select the alleles with a strong positive correlation
to the principal component, we will isolate a sample of alleles whose total (polygenic) effect on
intelligence will be positive. This selected sample of alleles should have on average higher
frequencies among people with high intelligence, compared to the normal populations (the samples
in the 1000 Genomes database).
The genomes of two highly intelligent individuals (Nobel Laureate James Watson and Craig
Venter) are both freely available on the internet. James Watson’s intelligence level is likely high,
given that he obtained a PhD at 21 and won a Nobel prize at 25. Venter’s achievement are
outstanding and as a further proof of his intelligence he scored 143 on the IQ test administered by
the Navy (Venter, 2008).
We used them as a sample of highly intelligent individuals to test the prediction that alleles
positively correlated to the principal component have significantly higher frequency in their
genomes than in the genomes of the normal (reference) population to which they belong (CEU).
This test has a double purpose: 1) validate the signal of polygenic selection obtained by the two
sets of 4 and 3 SNPs with principal component analysis; 2) identify a set of SNPs whose
probability of being associated with intelligence is higher than a random set of SNP. This will be
used to increase the power of GWAS, by reducing multiple comparisons.
Methods and results
James Watson and Craig Venter’s genomes were downloaded from Ensembl
(ftp://ftp.ensembl.org/pub/current_variation/vcf/homo_sapiens/), hg38.
Frequencies of the four intelligence increasing alleles (3 from Rietvald et al. + rs236330) were
obtained from 1000 Genomes, phase 3 for 26 populations. These SNPs are the only ones whose
positive effect has been replicated (Piffer, 2014b).
An additional analysis was run with 3 SNPs that reached significance in a very recent study
(Rietveld et al, 2014). Finally, an analysis with all 7 SNPs together was carried out.
Analysis 1. Top 4 SNPs
A principal component analysis was run and a single factor was extracted that explained 75.42 %
of the variance. The other components, explaining less than 15% of the variance (eigenvalues
lower than 0.6) were thus excluded from the analysis. Component loadings are reported in table 1.
These are all high (>0.8) and in the expected direction (positive loadings). Component scores were
highly correlated to country IQ (r=0.886; N=23; p=0.000).
Table 1.
Component Matrixa
Componen
t
1
rs9320913_A
rs11584700_
G
.804
.877
rs4851266_T
rs236330_C
.938
.850
Extraction Method:
Principal Component
Analysis.
a. 1 components extracted.
Component scores were interpreted as a signal of polygenic selection and are reported in table 2.
Table 2.
Population
PC
Afr.Car.Barbados
-1.37606
US Blacks
-1.2195
Bengali Banglade
0.05106
Chinese Dai
1.11266
UtahWhites
0.7512
HanChineseBejing 1.26035
HanChineseSouth
1.09477
Colombian
-0.025
Esan Nigeria
-1.48221
Finns
0.84245
British
0.79862
Gujarati Ind. Tx
0.57772
Gambian
-1.56704
Spanish
0.65138
Indian Telegu UK
0.11431
Japanese
1.05752
Vietnam
1.36983
Luhya Kenya
-1.62949
Mende Sierra Leo
-1.31723
Mexican LA
-0.08204
Peruvian
-0.26814
Punjabi Pakistan
0.17343
Puerto Rican
0.03404
SriLankanUK
0.02587
TuscanItaly
0.60048
Yoruba
-1.54897
The entire 1000 Genomes, phase 3, allele frequency database was searched for alleles with a
strong positive association with the principal component (r>0.8).
This returned a set of N= 238,198 SNPs, consisting of 183508 and 54690 “beneficial” and
“detrimental” alleles, respectively.
A Fisher’s exact test was used to compare the control (1000 Genomes CEU) to Watson and
Venter’s genomes. A 2x2 contingency table is filled out, with allele count data for the two groups
(Venter’s or Watson’s genome and CEU) and two outcomes (“beneficial” and other allele). This is a
test against the null expectation that “beneficial” alleles (those with a positive correlation to the
principal component) are randomly distributed across individuals, independent of their cognitive
phenotype. Genomic regions in linkage with the 4 SNPs (distance <500Kb) were excluded as
these could have inflated the significance of the results.
The test for Venter’s genome produced significant results (p<2.2e-16) for the alternative
hypothesis that true odds ratio is greater than 1. Odds ratio= 1.152 (lower bound 95% confidence
interval: 1.142). Allele counts are reported in table 3a.
Table 3a. Venter’s 2x2 contingency table. Strongly associated SNPs (r>0.8).
Outcome 1 (Beneficial alleles) Outcome 2 (Other allele)
Group 1 (Venter)
189332
48956
Group 2 (CEU mean)
183508
54690
The test for Watson’s genome produced significant results (p<2.2e-16) for the alternative
hypothesis that true odds ratio is greater than 1. Odds ratio= 1.188 (lower bound 95% confidence
interval: 1.178). Allele counts are reported in table 3b.
Table 3b. Watson’s 2x2 contingency table. Strongly associated SNPs(r>0.8).
Outcome 1 (Beneficial alleles) Outcome 2 (Other allele)
Group 1 (Watson)
190498
47790
Group 2 (CEU mean)
183508
54690
The same analysis was run on SNPs with a correlation around 0 (p<0.005) with the principal
component in order to determine whether the signal (odds ratio) would be weaker.
For Venter’s genome, the p value was significant (p=0.00043) and odds ratio slightly higher than
unity (1.019). 95% confidence interval: 1.008-1.029. Allele counts are reported in table 4a.
Table 4a. Venter’s 2x2 contingency table. Non-associated SNPs (p<0.005).
Outcome 1 (Neutral allele) Outcome 2 (Other allele)
Group 1 (Venter)
82330
64262
Group 2 (CEU mean)
81623
64908
For Watson’s genome, results were similar and also significant (p=0.009)and odds ratio slightly
higher than unity (1.013). 95% confidence interval: 1.003- 1.024.
Table 4b. Watson’s 2x2 contingency table. Non-associated SNPs (p<0.005).
Outcome 1 (Neutral allele) Outcome 2 (Other allele)
Group 1 (Watson)
82153
64437
Group 2 (CEU mean)
81623
64908
Ancestral vs derived alleles
Table 5 reports the number of alleles that are derived and ancestral for two groups of alleles: one
with alleles very strongly (r>0.8) correlated to the PC and the other with alleles not correlated to it
(p<0.005). There is an overrepresentation of derived alleles in the r>0.8 group compared to the
other group, with odds ratio of 1.075, p=4.549e-12 (95% C.I. 1.053 – 1.097).
Table 5.
Derived
56987
34495
SNPs r>0.8
SNPs p<0.005
Ancestral
40512
26362
ANALYSIS II. 3 SNPs from Rietveld et al., 2014
Recently a new study has identified three SNPs with a positive relationship to intelligence
(rs1487441; rs7923609; rs2721173). We ran a principal components analysis on them. A single
component was extracted that explained 59.69% of the variance. Two of the thre trait-increasing
alleles (rs1487441_A and rs2721173_C) loaded positively on it (r= 0.725 and 0.739, respectively)
but one (rs7923609_G) had a negative loading (r=0.848). Component matrix is reported in table 6.
Table 6.
Component Matrix
a
Component
1
rs1487441_A
.725
rs7923609_G
-.848
rs2721173_C
.739
Extraction Method: Principal
Component Analysis.
a. 1 components extracted.
Component scores were positively correlated to country IQ (r=0.716, N=23; p= 0.000) and to the
top 4 SNPs PC (r= 0.854, N=26; p=0.000).
Since the two PCs extracted from two independent sets of SNPs were correlated to each other, we
ran a PCA with the 7 SNPs altogether. This produced 2 components with eigenvalues higher than
one, explaining 64.66 and 16.79% of the variance, respectively. The second component was not
clearly interpretable. 6 out of 7 alleles loaded positively on the first component (r= 0.518 to 0.877)
and one (rs7923609_G) loaded in the opposite direction (r=-0.822). The first PC was also
positively correlated to country IQ (r= 0.855) and to the other two PCs from the two separate sets
of SNPs (r= 0.984 and 0.931, N=26; p=0.000).
Table 7.
Component Matrix
a
Component
1
2
rs1487441_A
.810
.522
rs7923609_G
-.822
.381
rs2721173_C
.518
-.571
rs9320913_A
.824
.497
rs11584700_G
.854
-.167
rs4851266_T
.877
.228
rs236330_C
.864
-.324
Extraction Method: Principal Component
Analysis.
a. 2 components extracted.
The entire 1000 Genomes, phase 3, allele frequency database was searched for alleles with a
strong positive association with the 3 alleles-PC principal component (r>0.9).
Fisher’s exact test with the 3 SNPs PC.
A Fisher’s exact test was used to compare the control (1000 Genomes CEU) to Watson and
Venter’s genomes. A 2x2 contingency table is filled out, with allele count data for the two groups
(Venter’s or Watson’s genome and CEU) and two outcomes (“beneficial” and other allele).
Genomic regions in linkage with the 3 SNPs (distance <500Kb) were excluded as these could have
inflated the significance of the results.
The test for Venter’s genome produced significant results (p<2.2e-16) for the alternative
hypothesis that true odds ratio is greater than 1. Odds ratio= 1.219 (lower bound 95% confidence
interval: 1.205). Allele counts are reported in table 8a.
Table 8a. Venter’s 2x2 contingency table. Strongly associated SNPs (r>0.8).
Outcome 1 (Beneficial allele) Outcome 2 (Other allele)
Group 1 (Venter)
Group 2 (CEU mean)
101552
25746
97202
30041
The test for Watson’s genome produced significant results (p<2.2e-16) for the alternative
hypothesis that true odds ratio is greater than 1. Odds ratio= 1.225 (lower bound 95% confidence
interval: 1.211). Allele counts are reported in table 8b
Table 8b. Watson’s 2x2 contingency table. Strongly associated SNPs(r>0.8).
Outcome 1 (Beneficial allele) Outcome 2 (Other allele)
Group 1 (Watson)
Group 2 (CEU mean)
101655
25643
97202
30041
The same analysis was run on SNPs with a correlation around 0 (p<0.005) with the principal
component in order to determine whether the signal (odds ratio) would be weaker.
For Venter’s genome, the p value was not significant (p=0.375) and odds ratio near 1 (0.995). 95%
confidence interval: 0.985-1.005. Allele counts are reported in table 9a.
Table 9a. Venter’s 2x2 contingency table. Non-associated SNPs (p<0.005).
Outcome 1 (Neutral allele) Outcome 2 (Other allele)
Group 1 (Watson)
Group 2 (CEU mean)
80870
81007
61422
61232
For Watson’s genome, the p value was not significant (p=0.665) and odds ratio near 1 (1.002).
95% confidence interval: 0.991-1.013. Allele counts are reported in table 9b.
Table 9b. Watson’s 2x2 contingency table. Non-associated SNPs (p<0.005).
Outcome 1 (Neutral allele) Outcome 2 (Other allele)
Group 1 (Watson)
81121
61175
Group 2 (CEU mean)
81007
61232
Ancestral vs derived alleles
Table 10 reports the number of alleles that are derived and ancestral for two groups of alleles: one
with alleles very strongly (r>0.8) correlated to the PC and the other with alleles not correlated to it
(p<0.005). There is an overrepresentation of derived alleles in the r>0.8 group compared to the
other group, with odds ratio of 1.295, p < 2.2e-16 (95% C.I. 1.263 - 1.327).
Table 10.
SNPs r>0.8
SNPs p<0.005
Derived
Ancestral
33418
33886
19012
24966
Discussion
The frequencies of 4 SNPs were searched on 1000 Genomes, phase 3 (26 populations) and
produced a principal component explaining the majority of the variance, with all four alleles loading
highly (r>0.8) and in the expected direction (positive loadings), suggesting that it represents
(polygenic) selection pressure on intelligence. This picture was supported by the component
scores’ high correlation (r=0.886) with the average IQ scores of countries. Another set of 3 SNPs
with a significant association to intelligence reported in a recent study (Rietveld et al., 2014)
produced a PC that was strongly associated with the former PC (r= 0.854) and to country IQ (r=
0.716). A PCA run on the entire set of SNPs (N=7), and the resulting PC was strongly correlated to
country IQ (r= 0.855) and to the other two PCs (r> 0.8).
Since the two sets of SNPs are not in linkage and they were reported in two independent studies,
their producing very similar results lends credibility to the hypothesis that the results are not due to
a statistical fluke but they represent a genuine signal of natural selection acting on many loci.
The alleles with a positive correlation (r>0.8) to the top 4 SNPs PC had higher frequencies in
Venter and Watson’s genomes than in the control 1000 Genomes population (CEU). Odds ratios
were 1.188 and 1.152 for Watson and Venter. The alleles with no correlation to the 4SNPs_PC
(p<0.005) were still overrepresented in Venter and Watson’s genomes but less so. Odds ratios
were lower (1.013 and 1.019 for Watson and Venter’s respectively). This result is difficult to
interpret and could be due to chance.
The PC extracted from the set of 3 SNPs (taken from Rietveld et al., 2014) produced even better
results with higher odds ratios for the alleles correlated to it (r>0.8) (Odds ratios of 1.225 and 1.219
for Watson and Venter, respectively) and lower odds ratios (not significantly different from 1) for
the alleles not associated with the PC (1.002 and 0.995 for Watson and Venter, respectively).
The analysis of ancestral alleles reveals that the alleles positively associated with the IQ PC have
an overrepresentation of derived alleles and a lower proportion of ancestral alleles than
uncorrelated SNPs. Again the results were slightly better for the PC extracted from the 3 SNPs
(odds ratio= 1.295) than the PC extracted from the top 4 SNPs (odds ratio= 1.075).
This confirms our hypothesis that intelligence-increasing alleles tend to be derived because human
intelligence has greatly increased since the last common ancestor between human and non-human
primates.
A very important note of caution when reading the results should be taken, because the
significance levels are likely to be highly inflated. In fact, we treated each allele as independent
from all the others, but the reality of linkage disequilibrium makes this an unrealistic assumption.
In the next analysis, we will control for possible confounding effects due to linkage by dividing the
genome in regions of 500kb and regarding each 500kb region as a single case (the average
frequency within that region will be used).
The employment of the PC allows the researcher to select a set of candidate genes (see
supplementary files) thus greatly reducing the issue of multiple comparisons and the reduction in
significance following statistical corrections (i.e. Bonferroni). This increases the GWAS’ power and
reduces the sample size required to perform it.
References:
Benyamin, B., Pourcain, B.St., Davis, O.S., Davies, G., Hansell, N.K., Brion, M.-J.A. et al
(2013). Childhood intelligence is heritable, highly polygenic and associated with FNBP1L.
Molecular Psychiatry, 19: 253-258.
Berg, J.J. & Coop, G. (2014). A Population Genetic Signal of Polygenic Adaptation. PLOS
Genetics, Doi: 10.137/journal.pgen.1004412
Davies, G., Tenesa, A., Payton, A., Yang, J., Harris, S.E., Liewald, D., Xiayi, K., Le Hellard, S.
et al. (2011). Genome-wide association studies establish that human intelligence is highly
heritable and polygenic. Molecular Psychiatry, 16: 996-1005.
Piffer, D. (2013). Factor Analysis of Population Allele Frequencies as a Simple, Novel Method of
Detecting Signals of Recent Polygenic Selection: The Example of Educational Attainment and IQ.
Mankind Quarterly, 54: 168200.
Piffer, D. (2014a). Simple statistical tools to detect signals of recent polygenic selection.
IBC 2014;6:1, 1-6 • DOI: 10.4051 / ibc.2014.6.1.0001
Piffer, D. (2014b). Estimating strength of polygenic selection with principal component analysis of
spatial genetic variation. http://dx.doi.org/10.1101/008011
Pritchard, J. K., Pickrell, J. K., and Coop, G. (2010). The genetics of human adaptation: hard
sweeps, soft sweeps, and polygenic adaptation. The genetics of human adaptation: hard sweeps,
soft sweeps, and polygenic adaptation. Current biology : CB 20, R208-215.
Rietveld, C. A., Medland, S. E., Derringer, J., Yang, J., Esko, T., Martin, N. W., Westra, H. J.,
Shakhbazov, K., Abdellaoui, A., Agrawal, A., et al. (2013). GWAS of 126,559 individuals identifies
genetic variants associated with educational attainment. Science 340, 1467-1471.
Rietveld, C., Esko, T., Davies, G., et al. (2014). Common genetic variants associated with
cognitive performance identified using the proxy-phenotype method. doi:
10.1073/pnas.1404623111
Turchin, M. C., Chiang, C. W., Palmer, C. D., Sankararaman, S., Reich, D., Genetic Investigation
of, A. T. C., and Hirschhorn, J. N. (2012). Evidence of widespread selection on standing variation
in Europe at height-associated SNPs. Nat Genet 44, 1015-1019.
Venter, C. (2008). A Life Decoded. Penguin Books.
Ward, M.E., McMahon, G., St Pourcain, B., Evans, D.M., Rietveld, C.A., et al. (2014) Genetic
Variation Associated with Differential Educational Attainment in Adults Has Anticipated
Associations with School Performance in Children. PLoS ONE 9(7): e100248.
doi:10.1371/journal.pone.0100248