Linkage Disequilibrium

Linkage Disequilibrium
Linkage Disequilibrium
Linkage Equilibrium
Consider two linked loci
Locus 1 has alleles A1 , A2 , . . . , Am occurring at frequencies
p1 , p2 , . . . , pm
locus 2 has alleles B1 , B2 , . . . , Bn occurring at frequencies
q1 , q2 , . . . , qn in the population.
How many possible haplotypes are there for the two loci?
Linkage Disequilibrium
Linkage Equilibrium
The possible haplotypes can be denote as
A1 B1 , A1 B2 , . . . , Am Bn with frequencies h11 , h12 , . . . , hmn
The two linked loci are said to be in linkage equilibrium (LE),
if the occurrence of allele Ai and the occurrence of allele Bj in
a haplotype are independent events. That is, hij = pi qj for
1 6 i 6 m and 1 6 j 6 n.
Remember that Hardy Weinberg Equilibrium (HWE) requires
independent assortment of alleles at a single locus. Under
HWE, we can obtain genotype frequencies at a locus based on
the allele frequencies
Linkage equilibrium requires independent assortment of the
alleles at two linked loci. We can obtain haplotype frequencies
for two loci based on the allele frequencies at the two loci
Linkage Disequilibrium
Linkage Disequilibrium
Two loci are said to be in linkage (or gametic) disequilibrium
(LD) if their respective alleles do not associate independently
Consider two bi-allelic loci.
There are four possible haplotypes: A1 B1 , A1 B2 , A2 B1 , and
A2 B2 .
Suppose that the frequencies of these four haplotypes in the
population are 0.4, 0.1, 0.2, and 0.3, respectively.
Are the loci in linkage equilibrium?
Which alleles on the two loci occur together on haplotypes
than what would be expected under linkage equilibrium?
Linkage Disequilibrium
Measures of Linkage Disequilibrium
The Linkage Disequilibrium Coefficient D is one measure of
LD.
For ease of notation, we define D for two biallelic loci with
alleles A and a at locus 1; B and b at locus 2:
DAB = P(AB) − P(A)P(B)
What about DaB ?
Linkage Disequilibrium
Linkage Disequilibrium Coefficient
Can similarly show that DAb = −DAB and Dab = DAB
LD is a property of two loci, not their alleles.
Thus, the magnitude of the coefficient is important, not the
sign.
The magnitude of D does not depend on the choice of alleles.
The range of values the linkage disequilibrium coefficient can
take on varies with allele frequencies.
Linkage Disequilibrium
Linkage Disequilibrium Coefficient
By using the fact that pAB = P(AB) must be less than both
pA = P(A) and pB = P(B), and that allele frequencies cannot
be negative, the following relations can be obtained:
0 6 pAB = pA pB + DAB 6 pA , pB
0 6 paB = pa pB − DAB 6 pa , pB
0 6 pAb = pA pb − DAB 6 pA , pb
0 6 pab = pa pb + DAB 6 pa , pb
These inequalities lead to bounds for DAB :
−pA pB , −pa pb 6 DAB 6 pa pB , pA pb
Linkage Disequilibrium
Linkage Disequilibrium Coefficient
bounds for DAB :
−pA pB , −pa pb 6 DAB 6 pa pB , pA pb
What is the theoretical range of the linkage disequilibrium
coefficient DAB and its absolute value |DAB | under the follow
scenario: P(A) = 12 , P(B) = 13
Linkage Disequilibrium
Normalized Linkage Disequilibrium Coefficient
The possible values of D depend on allele frequencies. This
makes D difficult to interpret. For reporting purposes, the
normalized linkage disequilibrium coefficient D 0 is often used.
(
DAB
max(−pA pB ,−pa pb ) if DAB < 0
0
DAB
=
(1)
DAB
if DAB > 0
min(pa pB ,pA pb )
Linkage Disequilibrium
Estimating D
Suppose we have the N haplotypes for two loci on a
chromosomes that have been sampled from a population of
interest. The data might be arranged in a table such as:
A
a
B
nAB
naB
nB
b
nAb
nab
nb
Total
nA
na
N
We would like to estimate DAB from the data. The maximum
likelihood estimate of DAB is
D̂AB = p̂AB − p̂A p̂B
where p̂AB =
nAB
N ,
p̂A =
nA
N,
and p̂B =
nB
N
So the population frequencies are estimated by the sample
frequencies
Linkage Disequilibrium
Estimating D
The MLE turns out to be slightly biased. If N gametes have
been sampled, then
N −1
E D̂AB =
DAB
N
The variance of this estimate depends on both the true allele
frequencies and the true level of linkage disequilibrium:
Var D̂AB =
1
2
N pA (1 − pA )pB (1 − pB ) + (1 − 2pA )(1 − 2pB )DAB − DAB
Linkage Disequilibrium
Testing for LD with D
Since DAB = 0 corresponds to the status of no linkage
disequilibrium, it is often of interest to test the null hypothesis
H0 : DAB = 0 vs. Ha : DAB 6= 0 .
One way to do this is to use a chi-square statistic. It is
constructed by squaring the asymptotically normal statistic z:
2
D̂AB − E0 D̂AB

Z2 = 
Var0 D̂AB

where E0 and Var0 are expectation and variance calculated
under the assumption of no LD, i.e., DAB = 0
Under the null, the test statistic will follow a Chi-Squared
(χ2 ) distribution with one degree of freedom.
Linkage Disequilibrium
Measuring LD with r 2
Define a random variable XA to be 1 if the allele at the first
locus is A and 0 if the allele is a.
Define a random variable XB to be 1 if the allele at the
second locus is B and 0 if the allele is b.
Then the correlation between these random variables is:
DAB
COV (XA , XB )
rAB = p
=p
Var (XA )Var (XB )
pA (1 − pA )pB (1 − pB )
It is usually more common to consider the rAB value squared:
2
rAB
=
2
DAB
pA (1 − pA )pB (1 − pB )
Linkage Disequilibrium
Measuring LD with r 2
R 2 has the same value however the alleles are labeled
Tests for LD: A natural test statistic to consider is the
contingency table test. Compute a test statistic using the
Observed haplotype frequencies and the Expected frequency if
there were no LD:
X2 =
X
possible haplotypes
(Observed cell − Expected cell)2
Expected cell
Under H0 , the X 2 test statistic has an approximate χ2
distribution with 1 degree of freedom
It turns out that X 2 = Nr̂ 2
Linkage Disequilibrium
D 0 and r 2
The case when D 0 = 1 is referred to as Complete LD
In this case, there are at most 3 of the 4 possible haplotypes
present in the populations. The intuition behind complete LD
is that the two loci are not being separated by a recombination
in this population since at least one of the haplotypes does not
occur in the population.
The case when r 2 = 1 is referred to as Perfect LD
The case of perfect LD occurs when there are exactly 2 of the
4 possible haplotypes present in the population, and as a
result, the two loci also have the same allele frequencies.
Loci that are in perfect LD are necessarily in complete LD
Linkage Disequilibrium
D 0 and r 2
If the two loci both have very rare alleles and the rare alleles
do not occur together on a haplotype, for example, it is
possible for D 0 to be 1 (since 1 of the haplotypes does not
occur in the populations) and for r 2 to be small (when the
alleles at the two loci for the 3 remaining haplotypes are not
correlated).
For this and other reasons, it is often useful to report both r 2
and D 0
Linkage Disequilibrium