Supplemental Data Seunggeun Lee, Sehee Kim, and Christian

Supplemental Data
Seunggeun Lee, Sehee Kim, and Christian Fuchsberger
Web Appendix
Web Appendix A.
Suppose n0 = nE0 + nI0. Since πœŽΜ‚πœƒ2 converges to zero at the rate of O(n0-1), which is faster
than O(n0-1/2) the convergence rate of πœƒΜ‚, the variation in πœŽΜ‚πœƒ2 can be ignored. Therefore,
we consider πœŽΜ‚πœƒ2 as a fixed sequence with O(n0-1). From the fact that π›½Μ‚πΆπ‘œπ‘š = 𝛽̂𝐼𝑛𝑑 + πœƒΜ‚, 𝛽̂
can be expressed as a function of (𝛽̂𝐼𝑛𝑑 , πœƒΜ‚, πœŽΜ‚πœƒ2 ). Let
Μ‚ 2π‘₯
𝜎
𝑓(π‘₯1 , π‘₯2 | πœŽΜ‚πœƒ2 , 𝛾) = π‘₯1 + 𝛾π‘₯ 2πœƒ+πœŽΜ‚2 2 .
2
πœƒ
For given πœŽΜ‚πœƒ2 and Ξ³, 𝛽̂ = 𝑓(𝛽̂𝐼𝑛𝑑 , πœƒΜ‚ | πœŽΜ‚πœƒ2 , 𝛾). Since 𝛽̂𝐼𝑛𝑑 is the MLE of Ξ²0 and π›½Μ‚πΆπ‘œπ‘š is the
MLE of Ξ²0,Com, we can show that the bivariate random variable (𝛽̂𝐼𝑛𝑑 , πœƒΜ‚ ) asymptotically
follows a multivariate normal distribution with mean (Ξ²0, ΞΈ) and variance D, where D is
estimated by
2
πœŽΜ‚πΌπ‘›π‘‘
Μ‚
𝐷=(
πœŒΜ‚πœŽΜ‚πΌπ‘›π‘‘ πœŽΜ‚πœƒ
πœŒΜ‚πœŽΜ‚πΌπ‘›π‘‘ πœŽΜ‚πœƒ
),
πœŽΜ‚πœƒ2
2
where πœŽΜ‚πΌπ‘›π‘‘
is an estimate of the variance of 𝛽̂𝐼𝑛𝑑 , and πœŒΜ‚ is an estimate of the correlation
coefficient between 𝛽̂𝐼𝑛𝑑 and πœƒΜ‚. When πœƒ=0, asymptotic mean of 𝛽̂ is 𝛽0. When πœƒβ‰ 0,
1
1
asymptotic mean of 𝛽̂ is 𝑓(𝛽0 , πœƒ| πœŽΜ‚πœƒ2 , 𝛾) β‰ˆ 𝛽0 + π›Ύπœƒ 𝑂(𝑛0). Therefore we can conclude that
𝛽̂ is an asymptotically unbiased estimator of 𝛽0.
Μ‚ . Suppose that Ο†1 =
To obtain the asymptotic variance of 𝛽̂ , we need to obtain 𝐷
log{ rI1/ (2nI1 –r11)}, Ο†2 = log{ rI0/ (2nI0 –r10)}and Ο†3 = log{( rI0+ rE0)/ (2nI0 +2nE0 – rI0 –
rE0)}. Then 𝛽̂𝐼𝑛𝑑 = Ο†1 – Ο†2, and πœƒΜ‚= Ο†2 - Ο†3. Asymptotic variance estimators for Ο†1, Ο†2 and Ο†3
Μ‚ (Ο†1) = 1/rI1 +1/ (2nI1 – rI1), π‘‰π‘Žπ‘Ÿ
Μ‚ (Ο†2) = 1/rI0 +1/(2 nI0 – rI0) and π‘‰π‘Žπ‘Ÿ
Μ‚ (Ο†3) = 1/ (rI0+
are π‘‰π‘Žπ‘Ÿ
2
rE0) +1/ (2nI0 +2nE0 – rI0– rE0). Since Ο†1 and Ο†2 are independent random variables, πœŽΜ‚πΌπ‘›π‘‘
=
Μ‚ (Ο†1) + π‘‰π‘Žπ‘Ÿ
Μ‚ (Ο†2). We estimate πœŽΜ‚πœƒ2 using the delta method based on the fact that πœƒΜ‚ is a
π‘‰π‘Žπ‘Ÿ
function of two independent random variables, rI0 and rE0. Using a similar approach, we
can obtain πœŒΜ‚. Now we estimate the variance of 𝛽̂ . The partial derivative of f is
πœ•π‘“
πœ•π‘“
πœŽΜ‚πœƒ4 βˆ’π›ΎπœŽΜ‚πœƒ2 π‘₯2
= 1;
=
2
πœ•π‘₯1
πœ•π‘₯2
(𝛾π‘₯22 + πœŽΜ‚πœƒ2 )
By the delta method, a variance estimator for 𝛽̂ is approximately
Μ‚ (𝛽̂) β‰ˆ βˆ‡π‘“(𝛽̂𝐼𝑛𝑑 , 𝛽̂𝐼 | πœŽΜ‚πœƒ2 , 𝛾)𝑇 𝛴 βˆ‡π‘“(𝛽̂𝐼𝑛𝑑 , 𝛽̂𝐼 | πœŽΜ‚πœƒ2 , 𝛾) (A.2)
π‘‰π‘Žπ‘Ÿ
where βˆ‡π‘“ is a vector of partial derivatives of f.
Web Appendix B.
Suppose that we observed n samples, and hence 2n copies of chromosomes. We consider
the following logistic regression that correspond to the allelic contingency table (Table
1),
πΏπ‘œπ‘”π‘–π‘‘{𝑃(𝑦𝑖 = 1)} = 𝑋𝑖 𝛽,
(B.1)
where 𝑦𝑖 is a phenotype value i (i=1,…, 2n), Xi = (1, gi) with gi being the number minor
allele (gi=0,1), and 𝛽 = (𝛽0 , 𝛽1 )β€² is a vector of regression coefficients for intercept and
allelic log odds ratio. Under (B.1) the MLE estimate of 𝛽1 is the same as the allelic log
odds ratio estimate from the 2×2 contingency table. The score statistic for H0: 𝛽1=0 can
be written as 𝑆 = βˆ‘2𝑛
Μ…), where 𝑦̅ = βˆ‘ 𝑦𝑖 /(2𝑛).
𝑖=1 𝑔𝑖 (𝑦𝑖 βˆ’ 𝑦
Next we consider estimating 𝛽̂1. Suppose that π‘Œ = (𝑦1 , … , 𝑦2𝑛 )β€² is a 2n×1 vector
of phenotypes, πœ‡Μ‚ = (πœ‡Μ‚ 1 , … , πœ‡Μ‚ 2𝑛 )β€² is a 2n×1 vector of estimated mean of Y, 𝛽̂ is an MLE
estimate for 𝛽. By Iterative Re-Weighted Least Squares (IRWLS) procedure, 𝛽̂ is a
solution to satisfy
𝑋 β€² (π‘Œ βˆ’ πœ‡Μ‚ ) = 0
(B.2)
Under the null hypothesis, 𝛽̂ should be close to 𝛽 0 = [π‘™π‘œπ‘”π‘–π‘‘(𝑦̅), 0]β€². Applying the Taylor
expansion on πœ‡Μ‚π‘– around 𝛽 0 ,
0
0
exp(𝑋 𝛽 )
exp(𝑋 𝛽 )
πœ‡Μ‚ 𝑖 β‰ˆ 1+exp(𝑋𝑖 𝛽0 ) + {1+exp(𝑋𝑖 𝛽0 ) }2 (𝛽̂ βˆ’ 𝛽 0 )
𝑖
𝑖
(B.3)
By plugging in (B.3) to (B.2),
𝛽̂1 β‰ˆ
βˆ‘2𝑛
Μ…)
𝑖=1 𝑔𝑖 (𝑦𝑖 βˆ’ 𝑦
2𝑛
𝑦̅(1 βˆ’ 𝑦̅) βˆ‘π‘–=1(𝑔𝑖 βˆ’ π‘ž)2
where q = βˆ‘ 𝑔𝑖 /(2𝑛) is an MAF estimate. Since βˆ‘(𝑔𝑖 βˆ’ π‘ž)2 β‰ˆ 2𝑛 π‘ž (1 βˆ’ π‘ž), we can
conclude that the score statistic S can be closely approximated by 2 𝑛𝑒𝑓𝑓 π‘ž (1 βˆ’ π‘ž)𝛽̂1,
where 𝑛𝑒𝑓𝑓 = ̅𝑦(1 βˆ’ 𝑦̅)𝑛 is the effective number of samples.
Web Appendix C.
We propose a simple heuristic approach for estimating the effective sample size (neff), in
which neff is a function of Ο„. When Ο„ =0, 𝛽̂ is estimated using combined control samples.
When Ο„ =1, 𝛽̂ is estimated with internal control samples only. From these facts, we
propose to use neff = (1-Ο„) neff-com + Ο„ neff-Int, where neff-com = nI1 (nI0 + nE0 )/(nI1 + nI0 + nE0)
and neff-Int = nI1 nI0/(nI1 + nI0) are the effective sample sizes of using combined control
samples and internal control samples only, respectively. We note that neff-com and neff-Int
reflect case-control ratios. In rare variant tests, it is common to up-weight rarer variants
or functional variants 17; 19; 25. In simulation studies and real data analysis, we up-weight
rarer variant using 𝑀𝑗 = π‘π‘’π‘‘π‘Ž(π‘žπ‘— , 1, 25), where π‘žπ‘— is the MAF of variant j, and beta
(1,25) is a beta density function with a parameter (1,25). This is the default weight in the
SKAT package19.
To calculate p-values, we first need to estimate an additional parameter Ξ£, the
covariance matrix of 𝛽̂𝑗 . With an estimate of the correlation matrix using internal study
samples, a sandwich-type estimate of Ξ£ can be obtained as follows. Suppose that C is a p
×p correlation matrix of p variants. The estimator of the (j, k) element of Ξ£ is Ξ£jk =
Μ‚ (𝑆𝑗 )𝑆𝐷
Μ‚ (π‘†π‘˜ )πΆπ‘—π‘˜ , where Cjk is the (j, k) element of C, 𝑆𝐷
Μ‚ (𝑆𝑗 ) = 𝑛𝑒𝑓𝑓 π‘žπ‘™ (1 βˆ’
𝑀𝑗 π‘€π‘˜ 𝑆𝐷
π‘žπ‘™ )πœŽΜ‚π‘€,𝑗 is the standard error estimate of Sj, and wj and wk are weights for variant j and k,
respectively. We further define a p×p compound symmetry matrix Rρ =(1- ρ) I + ρ11’
and its Cholesky decomposition matrix Lρ, i.e, Lρ Lρ’= Rρ. The QiECAT(ρ) asymptotically
follows a mixture of chi-square distributions, βˆ‘π‘π‘—=1 πœ†π‘— πœ’π‘—2 , where (Ξ»1, …, Ξ»p) are the
eigenvalues of Lρ Ξ£ Lρ’, and πœ’π‘—2 is the i.i.d chi-square random variable with d.f.=121.
Finally, p-values of QiECAT(ρ) can be obtained by the Davies method that inverts
characteristic function26.
Web Appendix D.
When only summary-level information for external control samples is available, samples
with different genetic background cannot be excluded. The presence of these samples can
cause a substantial number of variants to have different MAFs between internal and
external control samples. To mimic this scenario, we used the simulated European data
set, but included 5% and 10% of external control samples from the simulated AfricanAmerican-like haplotypes (Web Table 2). Since 36% of the variants (excluding
singletons) had substantially different MAFs (relative difference > 1.5) between African
American-like and European-like haplotypes (Web Figure 1), these scenarios represented
very strong population stratification. When 5% of the external control samples were
simulated from African-American-like haplotypes, iECAT-O had well controlled or
slightly inflated type I error rates depending on sample sizes and case-control ratios. This
is because iECAT-O exclusively used internal control samples when a variant had
substantially different MAFs between internal and external controls. With the increased
percentage of samples from African-American-like haplotypes, type I error rates were
only slightly increased. These results indicate that iECAT-O has robust type I error
control even in the presence of population stratification.
Web Figures and Tables
Web Figure 1. MAF spectrums of European-like and African American-like
haplotypes simulated from COSI
A, B) MAF histograms of European-like and African American-like haplotypes. C)
Scatter plot of MAFs between European-like and African American-like haplotypes. D)
Histogram of the relative difference (RD) of MAFs between European-like and African
American-like haplotypes. RD is defined as RD = 2| MAFEUR – MAFAA|/ (MAFEUR +
MAFAA). Since singletons always have RD=2, they are excluded.
Case:Control=1:1
4k
10k
20k
0.8
0.6
0.4
0.0
0.0
0.0
0.2
0.4
0.6
0.8
1.0
Causal = 30%
0.2
0.4
0.6
0.8
iECATβˆ’OKnown
iECATβˆ’O
Internal only (SKATβˆ’O)
0.2
Power
Causal = 10%
1.0
1.0
Causal = 5%
4k
10k
20k
4k
10k
20k
4k
10k
20k
4k
10k
Internel Study Sample Size
20k
1.0
0.0
0.2
0.4
0.6
0.8
1.0
0.8
0.6
0.0
0.2
0.4
0.6
0.4
0.0
0.2
Power
0.8
1.0
Case:Control=2:1
4k
10k
Internel Study Sample Size
20k
Internel Study Sample Size
Web Figure 2. Power comparisons when 20% of causal variants were risk
decreasing variants
Each line represents empirical power at Ξ± = 2.5×10-6. From left to right, the plots consider
that 5%, 10%, and 30% of variants were causal variants, respectively. From top to bottom
the plots consider that internal study case:control ratio was 1:1 and 2:1, respectively. The
external control sample sizes were the same as the internal study sample sizes. For causal
variants, we assumed that Ξ² = c|log10(MAF)|. Different values of c were used for three
different percentages of causal variants (See Method).
Case:Control=1:1
4k
10k
0.6
0.8
Causal = 30%
0.0
0.2
0.4
0.6
0.4
0.0
0.0
0.2
0.4
0.6
0% AA
5% AA
10% AA
0.2
Power
0.8
Causal = 10%
0.8
Causal = 5%
4k
10k
4k
10k
4k
10k
4k
10k
Internel Study Sample Size
0.8
0.0
0.2
0.4
0.6
0.8
0.6
0.4
0.0
0.2
0.4
0.0
0.2
Power
0.6
0.8
Case:Control=2:1
4k
10k
Internel Study Sample Size
Internel Study Sample Size
Web Figure 3. Empirical power of iECAT-O in the presence of population
stratification between internal and external control samples
Empirical power was estimated for controlling empirical Ξ± = 2.5×10-6. From left to right,
the plots consider that 5%, 10%, and 30% of variants were causal variants, respectively.
From top to bottom the plots consider that internal study case:control ratio was 1:1 and
2:1, respectively. In each plot, each bar represents empirical power with 0%, 5% and 10%
of the external control samples being generated from African-American-like haplotypes.
Remaining external control samples and all internal study samples were generated from
European-like haplotypes. The external control sample sizes were the same as the internal
study sample size. All causal variants were risk-increasing variants. For causal variants,
we assumed that Ξ² = c|log10(MAF)|. Different c values were used for three different
percentages of causal variants (See Method).
Case:Control=1:1
4k
10k
20k
0.8
0.6
0.4
0.0
0.0
0.0
0.2
0.4
0.6
0.8
1.0
Causal = 30%
0.2
0.4
0.6
0.8
iECAT(r=0)Known
iECAT(r=0)
Internal only (SKAT)
0.2
Power
Causal = 10%
1.0
1.0
Causal = 5%
4k
10k
20k
4k
10k
20k
4k
10k
20k
4k
10k
Internel Study Sample Size
20k
1.0
0.0
0.2
0.4
0.6
0.8
1.0
0.8
0.6
0.0
0.2
0.4
0.6
0.4
0.0
0.2
Power
0.8
1.0
Case:Control=2:1
4k
10k
Internel Study Sample Size
20k
Internel Study Sample Size
Web Figure 4. Power comparisons for the SKAT-type tests (ρ=0) when all causal
variants were risk increasing variants
Each line represents empirical power at Ξ± = 2.5×10-6. From left to right, the plots consider
that 5%, 10%, and 30% of variants were causal variants, respectively. From top to bottom
the plots consider that internal study case:control ratio was 1:1 and 2:1, respectively. The
external control sample sizes were the same as the internal study sample sizes. For causal
variants, we assumed that Ξ² = c|log10(MAF)|. Different values of c were used for three
different percentages of causal variants (See Method).
Case:Control=1:1
4k
10k
20k
0.8
0.6
0.4
0.0
0.0
0.0
0.2
0.4
0.6
0.8
1.0
Causal = 30%
0.2
0.4
0.6
0.8
iECAT(r=1)Known
iECAT(r=1)
Internal only (Burden)
0.2
Power
Causal = 10%
1.0
1.0
Causal = 5%
4k
10k
20k
4k
10k
20k
4k
10k
20k
4k
10k
Internel Study Sample Size
20k
1.0
0.0
0.2
0.4
0.6
0.8
1.0
0.8
0.6
0.0
0.2
0.4
0.6
0.4
0.0
0.2
Power
0.8
1.0
Case:Control=2:1
4k
10k
Internel Study Sample Size
20k
Internel Study Sample Size
Web Figure 5. Power comparisons for the Burden-type tests (ρ=1) when all causal
variants were risk increasing variants
Each line represents empirical power at Ξ± = 2.5×10-6. From left to right, the plots consider
that 5%, 10%, and 30% of variants were causal variants, respectively. From top to bottom
the plots consider that internal study case:control ratio was 1:1 and 2:1, respectively. The
external control sample sizes were the same as the internal study sample sizes. For causal
variants, we assumed that Ξ² = c|log10(MAF)|. Different values of c were used for three
different percentages of causal variants (See Method).
1.0
0.8
0.6
0.4
0.0
0.2
-6
Power at nominal a=2.5x10
0.0
0.2
0.4
0.6
0.8
1.0
Power at empirical a=2.5x10-6
Web Figure 6. Comparison of iECAT-O power obtained at nominal and empirical Ξ±
levels
X-axis represents empirical power of iECAT-O obtained for controlling empirical
Ξ±=2.5×10-6, and Y-axis represents an empirical power of iECAT-O calculated at nominal
Ξ±=2.5×10-6. Total 24 dots represents 24 different simulation setups.
15000
0.6
0.8
10000
0.0
0.0
iECATKnown
iECAT
Wald (Internal Only)
5000
0.4
0.2
0.2
0.4
Power
0.6
0.8
1.0
Case:Control=1:1, MAF=0.5%
1.0
Case:Control=1:1, MAF=1%
20000
5000
15000
20000
0.8
0.6
0.4
0.2
0.0
0.0
0.2
0.4
Power
0.6
0.8
1.0
Case:Control=2:1, MAF=0.5%
1.0
Case:Control=2:1, MAF=1%
10000
5000
10000
15000
Internal Sample Size
20000
5000
10000
15000
20000
Internal Sample Size
Web Figure 7. Power comparisons for single variant tests
Each line represents empirical power at Ξ± =10-6 when OR=2. From left to right, the plots
consider that causal variant MAF=1% and 0.5%, respectively. From top to bottom the
plots consider that internal study case:control ratio was 1:1 and 2:1, respectively. The
external control sample sizes were the same as the internal study sample sizes.
Web Figure 8. LASER analysis for AMD and GoT2D data
A) First two PCs of AMD (red) and ESP (black) samples. Grey dots represent reference
HGDP samples. B) First two PCs of GoT2D and ESP samples. Among GoT2D samples,
red presents non-finish cohort samples and blue represents Finish cohort samples. C-D)
First two PCs of GoT2D non-Finish cohort and ESP genotype-matched control samples.
Web Figure 9. Comparisons of iECAT-O p-values with and without genotype-based
matching
Each dot represents an iECAT-O p-value in the GoT2D data analysis. X-axis represents
an –log10 iECAT-O p-values calculated with all 4300 European ESP external control
samples, and Y-axis represents –log10 iECAT-O p-values calculated with 2568
genotype-matched ESP external control samples.
Web Figure 10. Single variant analysis results for the GoT2D exome data with ESP
as external control samples
QQ-plots of -log10 p-values of single variant tests. A total of 102K rare and low
frequency variants (MAF < 0.05) observed in both GoT2D and ESP datasets were used
for the association tests. Top panel considers that all 4300 European ESP samples were
used as external controls and the bottom panel considers that 2568 genotype matched
ESP samples were used as external controls. The dashed line represents a 95%
confidence band.
Web Figure 11. Estimated shrinkage weight (Ο„) in the GoT2D exome data analysis
with ESP as external control samples
Each boxplot represents estimated shrinkage weights, Ο„, in four different MAC bins: 3 ≀
MAC < 10, 10 ≀ MAC <2 0, 20 ≀ MAC < 50, and 50 ≀ MAC. Top panel considers that
all 4300 European ESP samples were used as external controls and the bottom panel
considers that 2568 genotype matched ESP samples were used as external controls.
Web Table 1. Empirical type I error rates for single variant test. Each cell has an
empirical type I error rate estimated from 5×107 simulated datasets. The external control
sample sizes were the same as the internal sample sizes. All internal and external controls
samples were simulated from European-like haplotypes.
Internal
Internal
Sample Size
Case:Control
4000
1:1
10000
Level Ξ±
iECAT
Wald*
iECATNoAdj
10-4
1.60E-05
2.40E-06
3.10E-03
1:1
10-6
1.00E-07
2.00E-08
2.10E-03
2:1
10-4
9.80E-06
2.80E-06
3.70E-03
2:1
10-6
8.00E-08
2.00E-08
2.60E-03
1:1
10-4
2.00E-05
3.40E-06
4.00E-03
1:1
10-6
1.40E-07
4.00E-08
2.90E-03
2:1
10-4
1.50E-05
4.40E-06
4.90E-03
2:1
10-6
1.80E-07
2.00E-08
3.60E-03
* Wald test used internal control samples only.
Web Table 2. Empirical type I error rates for iECAT-O when 5% and 10% of external
controls include samples generated from African-American-like haplotypes. Each cell has
an empirical type I error rate estimated from 107 simulated datasets. The external control
sample sizes were the same as the internal sample sizes.
Internal
Internal
Sample Size
Case:Control
4000
1:1
4000
10000
10000
Level Ξ±
5% AA
10% AA
10-4
8.00E-05
9.90E-05
1:1
2.5×10-6
1.10E-06
2.60E-06
2:1
10-4
9.20E-05
1.60E-04
2:1
2.5×10-6
2.70E-06
5.00E-06
1:1
10-4
1.20E-04
1.50E-04
1:1
2.5×10-6
3.20E-06
5.40E-06
2:1
10-4
1.60E-04
2.80E-04
2:1
2.5×10-6
4.80E-06
1.60E-05
Web Table 3. Top five rare SNVs (MAF < 0.01) by iECAT p-values from the AMD data analysis. The 4300 European ESP samples
were used as external controls.
Gene
C3
CFH
DOM3Z
SLC44A4
CFHR4
CHR
19
1
6
6
1
Location
6718146
196716375
31937762
31846798
196871717
RS
rs147859257
rs121913059
rs114879139
rs115526171
rs145744152
Allele
G/T
T/C
C/G
A/G
C/T
* Wald test used internal control samples only.
Allele Count
AMD-Case AMD-Control
48/4586
5/1577
22/4612
0/1582
32/4602
26/1556
35/4599
24/1558
37/4597
23/1559
ESP
37/8563
2/8598
130/8470
129/8471
131/8449
iECAT
1.23E-05
1.24E-05
3.58E-05
2.20E-04
5.02E-04
P-values
Wald*
1.12E-02
5.57E-02
9.66E-04
8.18E-03
2.33E-02
Web Table 4. Top five genes by iECAT-O p-values from the GoT2D data analysis. The
European ESP samples were used as external controls. Rare variants (MAF < 0.01) were
exclusively used for this analysis.
Gene
Without matching
SECISBP2
SHC3
KLHL10
WDFY1
GCNT3
With matching
WDFY1
SECISBP2
SHC3
VAV3
C9orf173
Chr
# of variant*
iECAT-O
P-values
SKAT-O
P-values
20
20
5
8
4
6
6
3
9
6
5.60E-05
1.10E-04
1.60E-04
1.70E-04
1.80E-04
4.70E-03
2.00E-04
5.00E-02
9.40E-04
9.60E-04
7
20
20
4
20
9
6
6
8
2
5.80E-05
1.00E-04
1.30E-04
2.30E-04
3.10E-04
5.10E-04
6.10E-03
2.40E-04
1.30E-02
5.90E-03
* Number of rare variants used for the test. Variants with internal study MAC ≀ 3 were
excluded from the analysis.