Detecting Rare Haplotype-Environment Interaction without the Assumption of Gene-Environment
Independence
Yuan Zhang1, Swati Biswas1, Shili Lin2
1
Department of Mathematical Sciences, University of Texas at Dallas
2
Department of Statistics, The Ohio State University
Introduction
LBL-GXEindep [1]
Rare variants and gene (G) - environment (E) interactions (GXE) have been suggested as potential causes for
"missing heritability" in complex diseases.
▶ We consider detecting GXE where G is a rare haplotype variant (rHTV).
▶ Recently, Biswas et al. [1] proposed a method based on Logistic Bayesian Lasso (LBL) [2] for detecting GXE.
▶ A key assumption of LBL-GXEindep is independence of G and E for controls; we refer to this method as
LBL-GXEindep.
▶ We propose a novel approach to deal with the scenarios when this assumption may not hold by modeling
haplotype frequencies as functions of environmental covariates (LBL-GXEdep).
▶ We also explore combining LBL-GXEindep and LBL-GXEdep to take into account uncertainty regarding G-E
independence assumption (LBL-GXEcomb). It uses reversible-jump Markov Chain Monte Carlo method [3].
▶ We compare LBL-GXEindep, LBL-GXEdep, and LBL-GXEcomb using simulations.
▶
LBL-GXEdep
▶
LBL-GXEindep does not model f to be dependent on E and so there are no b parameters.
▶
Thus, equation (2) is not used to model f .
▶
Prior for f is Dirichlet (1, . . . , 1).
▶
Rest of the model and inference procedures are same as in LBL-GXEdep.
⋆ Ei = vector of environmental covariates of individual i.
⋆ S(Gi) = set of haplotype pairs consistent with the ith individual’s genotype Gi.
Haplotype Names
h01100
h10100(R1)
h11011
h11100
h11111
h01100 X E
h10100 X E
h11011 X E(R2XE)
h11100 X E
h11111 X E
E
LBL-GXEcomb
Let M be the model index: M = I corresponds to LBL-GXEindep (Model(I)) and M = D corresponds to
LBL-GXEdep (Model(D)).
▶ Denote parameters in Model(I) to be θ (I) = (β, λ, b, d) and parameters in Model(D) to be θ (D) = (β, λ, f , d) .
▶ Suppose the probability of staying at Model(I) given the current state is Model(I), denoted by P (M = I|M = I) ,
is p1 . Similarly, let P (M = D|M = D) be p2 . Hence P (M = D|M = I) = 1 − p1 and P (M = I|M = D) = 1 − p2 .
▶
Haplotype Names
h01100
h10100(R1)
h11011
h11100
h11111
h01100 X E
h10100 X E
h11011 X E(R2XE)
h11100 X E
h11111 X E
E
- Acceptance probability
of the move from I to D, denoted
by A(I, D)
}, is
{
π(M =D,θ (D)|Y ,G,E)P (M =I|M =D) d(θ (D)) .
A(I, D) = min 1,
π(M =I,θ (I)|Y ,G,E)π(u)P (M =D|M =I) d(θ (I),u) ⋆ Complete data likelihood based on retrospective model [1, 2, 4]:
n1
∏
∑
Lc(Ψ) =
P (Zir |Ei, Yi = 1, Ψ)P (Ei|Yi = 1, Ψ)
⋆ D → I:
P (Zir |Ei, Yi = 0, Ψ)P (Ei|Yi = 0, Ψ).
- Discard bk1 and use bk0 to get f : fk =
i=n1 +1 Zir ∈S(Gi )
exp(bk0 )
,
m−1
∑
exp(bj0 )
1+
k = 1, ..., m − 1.
j=1
- Ψ = (β, b, d) consists of regression coefficients (β) and parameters associated with haplotype frequencies
(b, d).
- Let aZ|E = P (Z|E, Y = 0), bZ|E = P (Z|E, Y = 1), θZ,E = P (Y = 1|Z, E)/P (Y = 0|Z, E) = odds of disease.
- Acceptance probability of the move from D to I is A−1(I, D).
▶
θZ,E aZ|E
P (Y =1|Z,E)P (Z,E)
∑
∑
⋆ bZ|E = P (Z|E, Y = 1) =
= θ a .
P (Y =1|H,E)P (H,E)
H,E H|E
H
Suppose there are m haplotypes and f = (f1, . . . , fm) denote their frequencies in the control population. Then,
(1)
where δkk′ = 1 if zk = zk′ and 0 otherwise, and d = within population inbreeding coefficient.
Let mth haplotype be the baseline and consider C environment covariates, E = {E1, E2, . . . , EC }. Then, similar
to [5], we let
( )
= bk0 + bk1E1 + bk1E2 + ... + bkC EC = gk (E), k = 1, 2, ..., m − 1.
(2)
Solve for fk and plug into (1) , then we get P (Z = zk /zk′ |Y = 0, E, b, d) = aZ|E .
⋆ Model for θZ,E = exp (XZ,E β) .
∑
θH,E aH|E P (E)
⋆ P (E|Y = 1) ∝ P (Y = 1|E)P (E) = H ∑
.
1+ θH,E aH|E
H
⋆ P (E|Y = 0) ∝ P (Y = 0|E)P (E) =
∑P (E)
.
1+ θH,E aH|E
H
Prior
⋆ Prior for β: π(βj |λ) = λ2 exp (−λ|βj |) , −∞ < βj < ∞; j = 0, ..., (mC − 1).
⋆ Prior for λ: π(λ) =
ba a−1
exp(−λb),
Γ(a) λ
dep
0.00
0.02
0.02
0.01
0.01
0.01
0.01
0.01
0.01
0.02
0.00
2-stage
0.04
0.02
0.05
0.01
0.14
0.13
0.01
0.10
0.01
0.18
0.00
comb
0.00
0.02
0.01
0.01
0.01
0.01
0.01
0.02
0.01
0.02
0.00
indep
0.00
0.03
0.03
0.00
0.01
0.00
0.02
0.02
0.01
0.01
0.00
dep
0.01
0.03
0.02
0.00
0.01
0.01
0.01
0.01
0.01
0.02
0.00
2-stage
0.00
0.03
0.02
0.00
0.01
0.00
0.02
0.02
0.01
0.02
0.00
comb
0.00
0.03
0.03
0.00
0.00
0.00
0.02
0.01
0.01
0.01
0.00
indep
0.23
0.78
0.79
1.00
1.00
0.99
0.08
1.00
1.00
1.00
1.00
dep
0.00
0.68
0.01
0.01
0.01
0.01
0.01
0.47
0.01
0.01
0.01
2-stage
0.00
0.68
0.01
0.02
0.01
0.01
0.01
0.47
0.01
0.01
0.01
Setting 2
comb
0.00
0.67
0.01
0.02
0.01
0.01
0.01
0.47
0.01
0.01
0.01
indep
0.23
0.61
0.06
0.00
0.82
0.76
0.05
1.00
0.00
1.00
0.02
dep
0.00
0.57
0.05
0.00
0.00
0.00
0.02
0.84
0.01
0.02
0.01
2-stage
0.02
0.58
0.06
0.00
0.08
0.07
0.02
0.84
0.01
0.11
0.01
Setting 3
comb
0.00
0.57
0.05
0.01
0.01
0.00
0.03
0.84
0.01
0.02
0.01
indep
0.00
0.54
0.02
0.01
0.01
0.01
0.04
0.69
0.01
0.01
0.01
dep
0.01
0.52
0.03
0.01
0.01
0.00
0.04
0.50
0.02
0.01
0.00
2-stage
0.00
0.53
0.02
0.01
0.01
0.00
0.04
0.65
0.01
0.01
0.00
comb
0.00
0.55
0.02
0.01
0.00
0.00
0.03
0.64
0.00
0.01
0.00
LBL-GXEindep leads to inflated type I error rates in presence of G-E dependence (in Settings 1 and 2).
▶ LBL-GXEdep has reasonably good power for detecting interactions with rHTV while keeping the type I error
rates well-controlled under all settings.
▶ However, when G-E independence holds (Setting 3), LBL-GXEindep has higher power than LBL-GXEdep.
▶ LBL-GXEcomb provides a reasonable balance between the two by having powers similar to the power of
LBL-GXEindep in case of G-E independence but retaining the advantage of controlled type I errors in case of
G-E dependence.
▶ The two-stage approach gives inflated type I error rates under G-E dependence (as seen in Setting 2 results).
▶
There are five haplotypes in total.
▶ There are two rare haplotypes of frequencies 0.005 (R1) and 0.01 (R2).
▶ There is a binary environmental covariate, E with 50% prevalence.
▶ R2 interacts with E (R2XE) in increasing disease risk.
▶ Simulation settings are listed in Table 1.
⋆ In Setting 1, all haplotype frequencies depend on E.
⋆ In Setting 2, only 3 haplotype frequencies depend on E.
⋆ In Setting 3, haplotype frequencies do not depend on E.
▶ We consider two scenarios under each of the three settings:
⋆ Scenario 1: All ORs are set to be 1 (null). Results are shown in Table 2.
⋆ Scenario 2: OR of R1 (OR.R1) = 3, OR of R2XE (OR.R2XE) = 3.
ORs for other main and interaction effects are set to be 1 (no effect).
Results are shown in Table 3.
▶ n = 2000 (equal numbers of cases and controls).
▶ We report results for LBL-GXEindep, LBL-GXEdep, LBL-GXEcomb, and a two-stage approach.
▶ The two-stage approach first tests G-E independence at 5% level and depending on the results, applies
LBL-GXEindep or LBL-GXEdep in the second stage.
▶
fk
fm
indep
0.21
0.02
0.15
0.01
0.84
0.76
0.02
0.47
0.01
1.00
0.01
In our simulation study, π(M = I) = π(M = D) = 0.5, p1 = 0.3, p2 = 0.7.
Simulation Study
⋆ Model for aZ|E
log
comb
0.01
0.03
0.01
0.01
0.02
0.00
0.01
0.01
0.01
0.01
0.01
Setting 3
Simulation Results — Comments
H
P (Z = zk /zk′ |Y = 0, d, f ) = δkk′ dfk + (2 − δkk′ ) (1 − d)fk fk′ ,
2-stage
0.01
0.03
0.01
0.02
0.02
0.01
0.02
0.02
0.02
0.02
0.01
Setting 1
bk1 is generated through setting bk1 = uk , where uk ∼Double exponential (0, 2/0.52).
×
dep
0.01
0.03
0.00
0.01
0.02
0.01
0.02
0.01
0.02
0.02
0.01
Setting 2
Table 3: Power (gray shaded rows)/Type I error rates (non-shaded rows). %(BF>2) for Setting 1, 2 and 3
under Scenario 2 for LBL-GXEindep, LBL-GXEdep, LBL-GXEcomb, and the two-stage approach.
- Create bk0 and bk1, k = 1, . . . , m − 1:
( )
bk0 is generated using the current f : bk0 = log ffmk .
Model
i=1 Zir ∈S(Gi )
n
∏
∑
indep
0.20
0.06
0.88
1.00
1.00
1.00
0.10
1.00
1.00
1.00
1.00
⋆ I → D:
⋆ H = set of all possible haplotype pairs.
Conclusions and Discussion
LBL-GXEdep appears to model G-E dependence satisfactorily.
▶ LBL-GXEcomb, which allows moves between LBL-GXEindep and LBL-GXEdep and accounts for model
uncertainty, is a promising approach to optimize power without sacrificing type I error in both G-E
independence and dependence scenarios.
▶ We conducted more simulations under scenarios with different ORs and different number of haplotypes, and
similar results were obtained.
▶ In a future work, we plan to apply LBL-GXEdep and LBL-GXEcomb approach to real data.
▶
0 < λ < ∞.
⋆ Prior for b: π(bkc) = ν2 exp(−ν|bkc|), −∞ < bkc < ∞, k = 1, ..., m − 1; c = 0, ..., C. We set ν = 0.5.
⋆ Uniform prior for d subject to the constraint maxk {−fk / (1 − fk )} < d < 1.
▶
Setting 1
At iteration t, randomly choose whether or not a move to the other model is proposed:
▶ If it is proposed to stay at the current model, we do ordinary MCMC to update parameters.
▶ If it is proposed to go to the other model, we apply reversible-jump MCMC as follows (Below we fix C = 1 for
simplicity in notation):
⋆ Zi = missing haplotype pair of individual i.
▶
Table 2: Type I error rates. %(BF>2) for Setting 1, 2 and 3 under Scenario 1 (null)
for LBL-GXEindep, LBL-GXEdep, LBL-GXEcomb, and the two-stage approach.
▶
⋆ Data: n1 cases, n2 controls, n1 + n2 = n.
⋆ Yi =1/0 = case/control status of individual i (i = 1, 2, . . . , n).
▶
Simulation Results
Inference
⋆ Markov chain Monte Carlo (MCMC) methods are used to estimate posterior distributions.
⋆ Hypotheses to test association: H0: |β| < ϵ vs.Ha: |β| > ϵ. (ϵ set to 0.1).
⋆ Calculate Bayes Factor (BF) = ratio of posterior odds to prior odds of Ha to H0.
⋆ If BF exceeds a certain threshold, the corresponding β (main or interaction effect) is declared significant
(associated). BF>2 is used in analysis.
Table 1: Three simulation settings with G-E dependence (Setting1, 2)
or G-E independence (Setting 3). ORs are as noted above.
Setting 1
Haplotype Names
h01100
h10100
h11011
h11100
h11111
h10011
freq(E=0)
0.35
0.0075
0.0025
0.03
0.01
0.6
freq(E=1)
0.25
0.0025
0.0175
0.28
0.21
0.24
Setting 2
freq(E=0)
0.245
0.005
0.005
0.155
0.17
0.42
freq(E=1)
0.355
0.005
0.015
0.155
0.05
0.42
Setting 3
freq
0.3
0.005
0.01
0.155
0.11
0.42
References
[1] Biswas, S., Xia, S., Lin, S. (2013). Detecting Rare Haplotype-Environment Interaction with Logistic Bayesian LASSO. Genetic Epidemiology 38 31-41.
[2] Biswas, S., Lin, S. (2012). Logistic Bayesian LASSO for Identifying Association with Rare Haplotypes and Application to Age-Related Macular
Degeneration. Biometrics 68 587-597.
[3] Richardson, S., Green, P.J. (1995). Reversible Jump Markov Chain Monte Carlo Computation and Bayesian Model Determination. Biometrika 82 711-732.
[4] Kwee LC, Epstein MP, Manatunga AK, Duncan R, Allen AS, Satten GA. (2007). Simple Methods for Assessing Haplotype-Environment Interactions in
Case-Only and Case-Control Studies. Genetic Epidemiology 31 75-90.
[5] Mukherjee, B., Chatterjee, N. (2008). Exploiting Gene-Environment Independence for Analysis of CaseControl Studies: An Empirical Bayes-Type
Shrinkage Estimator to Trade-Off between Bias and Efficiency. Biometrics 64 685-694.
Acknowledgment: This work was partially supported by grant 1R03CA171011-01 from the National Cancer Institute, NIH.
Department of Mathematical Sciences
The University of Texas at Dallas
© Copyright 2026 Paperzz