Hardy Weinberg Model- 6 Genotypes
Silvelyn Zwanzig
Hardy -Weinberg with six genotypes. In a large population of plants (Mimulus guttatus) there are possible alleles S, I, F at one locus resulting in six
genotypes labeled SS, II, FF, SI, SF. Let θ1 , θ2 , θ3 denote the probabilities of
S,I,F with θ1 + θ2 + θ3 = 1. The Hardy Weinberg model specifies that the six
genotypes have probabilities:
Genotype
1
2 3
4
5
6
Genotype
SS II F F SI
SF
IF
2θ1 θ2 2θ1 θ3 2θ2 θ3
Probability θ12 θ22 θ32
(1)
Consider an i.i.d sample (X1 , .., Xn )according to the distribution (1).
1. Belongs the distribution (1) to a strictly(!) k-parametric exponential
family?
Theorem 3.2 implies, that it is enough to consider one observation of X
from (1).
It holds
2I1 (x) 2I2 (x) 2I3 (x)
θ2
θ3
P (X = x) = θ1
(2θ1 θ2 )I4 (x) (2θ1 θ3 )I5 (x) (2θ2 θ3 )I6 (x)
2I1 (x)+I4 (x)+I5 (x) 2I2 (x)+I4 (x)+I6 (x) 2I3 (x)+I5 (x)+I6 (x) I4 (x)+I5 (x)+I6 (x)
θ2
θ3
2
= θ1
with
Ik (x) =
1 f or x = k
for k = 1, ..., 6
0 f or x =
6 k
1
(2)
Introduce
2
1
IS (x) = 2I1 (x) + I4 (x) + I5 (x) =
0
2
1
II (x) = 2I2 (x) + I4 (x) + I6 (x) =
0
2
1
IF (x) = 2I3 (x) + I5 (x) + I6 (x) =
0
f or x = 1
f or x = 4, 5 ,
else
f or x = 2
f or x = 4, 6
else
f or x = 3
f or x = 5, 6
else
and reformulate the probability function
I (x) II (x) IF (x) I4 (x)+I5 (x)+I6 (x).
θ2 θ3 2
P (X = x) = θ1S
Note
6
X
(3)
Ik (x) = 1 for all x ,
k=1
thus
IS (x) + II (x) + IF (x)
= 2I1 (x) + I4 (x) + I5 (x) + 2I2 (x) + I4 (x) + I6 (x) + 2I3 (x) + I5 (x) + I6 (x)
6
X
= 2
Ik (x) = 2.
k=1
Furthermore
θ12 + θ22 + θ32 + 2(θ1 θ2 + θ2 θ3 + θ3 θ2 ) = 1.
It holds
θ12 + θ22 + θ32 + 2(θ1 θ2 + θ1 θ2 + θ1 θ2 ) = (θ1 + θ2 + θ3 )2 .
We obtain
θ1 + θ2 + θ3 = 1
2
Rewrite the probability function
I (x) II (x)
θ2 (1
P (X = x) = θ1S
− θ1 − θ2 )2−IS (x)−II (x) 2I4 (x)+I5 (x)+I6 (x)
as 2-parametric exponential family
P (X = x) = h(x)A(θ) exp (ς1 (θ)T1 (x) + ς2 (θ)T2 (x)) , θ = (θ1 , θ2 ) (4)
with
h(x) = 2I4 (x)+I5 (x)+I6 (x) , A(θ) = exp (2 ln(1 − θ1 − θ2 ))
and
(5)
θ1
T1 (x) = IS (x) , ς1 (θ) = ln
1 − θ1 − θ2
θ2
T2 (x) = II (x) , ς2 (θ) = ln
.
1 − θ1 − θ2
(6)
(7)
Properties, using (2)
T1 (x) = 2I1 (x) + I4 (x) + I5 (x)
T2 (x) = 2I2 (x) + I5 (x) + I6 (x)
We know with pk = P (X = k) that
1 f or k1 = k2 = k
Ik1 (X)Ik2 (X) =
,
0 else
pk f or k1 = k2 = k
EIk (X) = pk , EIk1 (X)Ik2 (X) =
0 else
(8)
Hence
ET1 = 2θ12 + 2θ1 θ2 + 2θ1 θ3 = 2θ1 (θ1 + θ2 + θ3 ) = 2θ1
ET2 = 2θ22 + 2θ1 θ2 + 2θ2 θ3 = 2θ2 (θ2 + θ1 + θ3 ) = 2θ2
3
(9)
and
ET12 = E (2I1 (x) + I4 (x) + I5 (x))2
= 4EI1 (x) + EI4 (x) + EI5 (x) = 2EI1 (x) + ET1
= 2θ12 + 2θ1 = 2θ1 (1 + θ1 )
(10)
Cov(T1 , T2 ) = E (T1 − 2θ1 ) (T2 − 2θ2 )
= E (T1 T2 ) − 4θ1 θ2
E (T1 T2 ) = E (2I1 (x) + I4 (x) + I5 (x)) (2I2 (x) + I4 (x) + I6 (x))
(11)
= EI4 (x) = 2θ1 θ2
Thus
Cov(T1 , T2 ) = −2θ1 θ2
Summarizing, the distribution of (T1 , T2 , T3 ) is multinomial with.
multinomial(2, θ1 , θ2 , θ3 )
where T3 (x) = 2I3 (x) + I5 (x) + I6 (x)
Solution:
The i.i.d sample (X1 , .., Xn ) according to the distribution (1) belongs
to a exponential family.
2. Derive k and the minimal sufficient statistics.
Compare: (4) - (7) Apply Theorem 3.2 and Theorem 4.10.
Solution: k = 2 and
T (X) = (T1 (X) , T2 (X)) ,
n
X
T1 (X) =
T1 (Xi ) = 2n1 + n4 + n5
i=1
T2 (X) =
n
X
T2 (Xi ) = 2n2 + n4 + n6
i=1
4
where nk = # {xi = k, i = 1, ..., n}
3. Are the minimal sufficient statistics complete?
Solution: Yes. Because:
Apply Theorem 5.5 and the natural parametrization with (6) and (7).
The natural parameter space is defined by
Z
∗
Z = (ς1 , ς2 ) : exp (ς1 T1 (x) + ς2 T2 (x)) h(x)dv < ∞
Here
Z
exp (ς1 T1 (x) + ς2 T2 (x)) h(x)dv =
X
exp (ς1 T1 (x) + ς2 T2 (x)) h(x) < ∞
x=1,2,3,4,5,6
thus Z ∗ = R2 .
4. Derive the M LE for all unknown parameters: θb1,M LE , θb2,M LE , θb3,M LE .
The log likelihood function is
l(θ) = T1 (x) ln(θ1 ) + T2 (x) ln(θ2 ) + T3 (x)) ln(1 − θ1 − θ2 ) + const
with T3 (x) = 2n3 + n5 + n6 .
∂
1
1
l(θ) =
T1 (x) −
T3 (x)) =! 0,
∂θ1
θ1
1 − θ1 − θ2
∂
1
1
T3 (x)) =! 0
l(θ) =
T2 (x) −
∂θ2
θ2
1 − θ1 − θ2
(12)
Using θ1 + θ2 + θ3 = 1
θ3 T1 (x) − θ1 T2 (x)) = 0
θ3 T2 (x) − θ2 T3 (x)) = 0
(13)
and adding both equations gives θ3 (T1 + T2 ) − (1 − θ3 ) T3 = 0. Because
T1 (x) + T2 (x) + T3 (x) = 2n we obtain
T3
2n3 + n5 + n6
θb3,M LE =
=
.
2n
2n
Analogously θb1,M LE =
T1
2n
=
2n2 +n4 +n5
,
2n
5
θb2,M LE =
T2
2n
=
2n2 +n4 +n6
2n
5. Give the M LE for the probability of SI.
We have
pSI = 2θ1 θ2
T1 T2
b
b
Thus (Def. 5.2) pc
SI M LE = 2θ1,M LE θ2,M LE = 2n2
b
b
b
6. Are the M LE = θ1,M LE , θ2,M LE , θ3,M LE optimal? In which sense?
(Give the definition!)
The MLE are functions of sufficient and complete statistics. We have
to check the unbiasedness. Using (9)
E θb1,M LE = E
T1
ET1
n2θ1
=
=
= θ1 .
2n
2n
2n
By symmetry θb2,M LE , θb3,M LE are also unbiased. Thus by Theorem of
Lehman Scheffe the maximum likelihood estimators are best unbiased
estimators. (Def. 5.6.) These estimators have minimal variance in the
class of all unbiased estimators.
7. Derive an alternative moment estimator for probability of SI , basing
on the number of plants belonging to the genotype SI only.
Pn
Number of plants belonging to the genotype SI :
i=1 I4 (xi ) = n4
Note the probability of SI equals p4 . It holds by (8)
En4 =
n
X
EI4 (xi ) = np4
i=1
Thus
pc
SI =
n4
n
is an reasonable unbiased estimator.
8. Compare the estimators in pc
SI M LE and pc
SI . Which is better? Why?
First we have to study the unbiasedness of pc
SI M LE .
E pc
SI M LE = E
6
T1 (x) T2 (x)
2n2
Because of (8) and (11) it holds
E (T1 (x) T2 (x)) = E
n
X
T1 (xi )
i=1
n
X
!
T2 (xj )
j=1
!
= E
X
T1 (xi )T2 (xi )
!
+E
i=j
= n2θ1 θ2 + (n2 − n)2θ1 2θ2
X
T1 (xi )T2 (xj )
i6=j
= nE (T1 T2 ) + (n2 − n)ET1 ET2
= 2θ1 θ2 (2n2 − n).
Thus
1
T1 (x) T2 (x)
= 2θ1 θ2 (1 − ).
2
2n
2n
The maximum likelihood estimator is not unbiased! Propose an bias
corrected estimator:
T1 (x) T2 (x)
pf
SI =
2n2 − n
This estimator pf
SI is BUE, because of the Theorem of Lehman Scheffe.
E pc
SI M LE = E
9. Calculate the Fisher information matrix for (1).
We have an i.i.d. sample. Consider only one observation. The score
function is given by (compare (12))
∂
1
l(θ)
T − θ13 T3
∂θ1
θ1 1
V (θ) =
=
1
∂
T − θ13 T3
l(θ)
θ2 2
∂θ2
Check: EV (θ) = 0. We have E
1
T
θ1 1
−
1
T
θ3 3
= 2 − 2 = 0.
Fisher information matrix: IX (θ) = Cov (V (θ))
Using (11) we calculate
7
2
1
1
E
T1 − T3
θ1
θ3
1
2
1
= 2 E (T1 )2 −
ET1 T3 + 2 ET32
θ1
θ1 θ3
θ3
2
2
2
= 2 θ1 (1 + θ1 ) −
2θ1 θ3 + 2 θ3 (1 + θ3 )
θ1
θ1 θ3
θ3
1
1
+
= 2
θ1 θ3
and
1
1
1
1
E
T1 − T3
T2 − T3
θ1
θ3
θ2
θ3
1
1
1
1
=
ET1 T2 −
ET1 T3 −
ET2 T3 + 2 ET32
θ1 θ2
θ1 θ3
θ2 θ3
θ3
1
1
1
1
=
2θ1 θ2 −
2θ1 θ3 −
2θ2 θ3 + 2 2θ3 (1 + θ3 )
θ1 θ2
θ1 θ3
θ2 θ3
θ3
2
=
θ3
Because of Corollary 4.1 IX (θ) = nIX (θ)
1
+ θ13
θ1
IX (θ) = 2n
1
θ3
1
θ2
1
θ3
+
1
θ3
10. Derive the Cramer Rao Bound for unbiased estimators of pSI = γ.
Apply formulary (5.2.29) for all unbiased estimators
V ar (b
γ ) ≥ Dθ g T IX (θ)−1 Dθ g.
Here g(θ) = 2θ1 θ2 , thus
Dθ g =
and
8
2θ2
2θ1
1
θ1
+
1
θ3
1
θ3
1
θ2
1
θ3
+
−1
=
1
θ3
Thus
−1
IX (θ)
1
=
2n
(θ3 + θ2 ) θ1
−θ1 θ2
−θ1 θ2
(θ3 + θ1 ) θ2
(1 − θ1 ) θ1
−θ1 θ2
−θ1 θ2
(1 − θ2 ) θ2
=
(1 − θ1 ) θ1
−θ1 θ2
−θ1 θ2
(1 − θ2 ) θ2
.
(14)
Remind : θ3 + θ2 + θ1 = 1
Then the CRB is
T θ2
(1 − θ1 ) θ1
−θ1 θ2
θ2
2
= 2θ1 θ2 θ2 −4θn1 θ2 +θ1
n
θ1
−θ1 θ2
(1 − θ2 ) θ2
θ1
11. Calculate the efficiencies. Consider first the moment estimator pc
SI =
n4
.
n
!
n
X
1
1
1
I4 (xi ) = V ar (I4 (x)) = 2θ1 θ2 (1−2θ1 θ2 )
V ar (pc
SI ) = 2 V ar
n
n
n
i=1
efficiency:
2θ1 θ2 θ2 −4θn1 θ2 +θ1
CRB
e(pc
= 1
SI , pSI ) =
V ar (pc
2θ1 θ2 (1 − 2θ1 θ2 )
SI )
n
θ2 − 4θ1 θ2 + θ1
=
1 − 2θ1 θ2
Consider now the bias corrected maximum likelihood estimator:
pf
SI =
It holds
V ar (pf
SI ) =
(2n2
T1 (x) T2 (x)
.
2n2 − n
1
V ar (T1 (x) T2 (x))
− n)2
Observe
(T1 (x) , T2 (x) , T2 (x)) ∼ multinomial(2n, θ1 , θ2 , θ3 )
9
Presentation as sums of independent Bernoulli variables
Tj (x) =
2n
X
Zi,j , Zi,j ∼ Ber(θj ) i.i.d., j = 1, 2, 3, Zi,1 + Zi,2 + Zi,3 = 1
i=1
Need
V ar(T1 (x) T2 (x)) = E (T1 (x) T2 (x))2 − (ET1 (x) T2 (x))2
T12 (x) =
ET1 (x) T2 (x) = (2n2 − n)2θ1 θ2
!2
2n
2n
2n
2n
2n
X
X
X
X
X
=
Zi1 ,1
Zi2 ,1 =
Zi,1 +
Zi1 ,1 Zi2 ,1
Zi,1
i1 =1
i=1
i2 =1
i1 6=i2
i=1
Thus
2n
X
T12 (x) T22 (x) =
Zi1 ,1 +
=
Zi1 ,1
i1 =1
+
2n
X
Zi1 ,1
2n
X
2n
X
Zi1 ,1 Zi2 ,1
i1 =1
Zi1 ,1
2n
X
Zi2 ,2 =
i3 =1
i3 6=i4
i3 =1
Zi3 ,2 Zi4 ,2
2n
X
Zi2 ,2
Zi1 ,1 Zi2 ,1
2n
X
Zi3 ,2 Zi4 ,2
i3 6=i4
2n
X
Zi1 ,1 Zi1 ,2 +
2n
X
Zi1 ,1 Zi2 ,2 =
i1 6=i3
i1 =1
2n
X
i1 6=i3
because: Zi1 ,1 Zi1 ,2 = 0
2n
X
i1 =1
Zi1 ,1
Zi3 ,2 Zi4 ,2
i3 =1
i1 6=i2
2n
X
!
Zi2 ,2
i1 6=i2
2n
X
Zi2 ,2 +
2n
X
i3 6=i4
i1 =1
+
Zi1 ,1 Zi2 ,1
2n
X
i3 =1
2n
X
+
!
i1 6=i2
i1 =1
2n
X
2n
X
2n
X
Zi3 ,2 Zi4 ,2 =
i3 6=i4
2n
X
i1 6=i3 6=i4
10
Zi1 ,1 Zi3 ,2 Zi4 ,2
Zi1 ,1 Zi2 ,2
2n
X
2n
X
Zi1 ,1 Zi2 ,1
i1 6=i2
i3 6=i4
T12
(x) T22
2n
X
Zi3 ,2 Zi4 ,2 =
(x) =
Zi1 ,1 Zi2 ,1 Zi3 ,2 Zi4 ,2
i1 6=i2 6=i3 6=i4
2n
X
Zi1 ,1 Zi2 ,2
i1 6=i3
2n
X
+
Zi1 ,1 Zi3 ,2 Zi4 ,2
i1 6=i3 6=i4
2n
X
+
Zi2 ,1 Zi3 ,1 Zi4 ,2
i2 6=i3 6=i4
2n
X
+
Zi1 ,1 Zi2 ,1 Zi3 ,2 Zi4 ,2
i1 6=i2 6=i3 6=i4
ET12 (x) T22 (x) = 2n (2n − 1) θ1 θ2
(15)
2
+2n (2n − 1) (2n − 2)θ1 θ2
(16)
+2n (2n − 1) (2n − 2)θ2 θ12
+2n (2n − 1) (2n − 2)(2n − 3)θ12 θ22
V ar T12 (x) T22 (x) = ET12 (x) T22 (x) − (n (2n − 1) 2θ1 θ2 )2
2n (2n − 1) (2n − 2)(2n − 3) − (2n (2n − 1))2
1
V ar (T1 (x) T2 (x))
− n)2
2n (2n − 1) θ1 θ2 + 2n (2n − 1) (2n − 2)θ1 θ22
1
+2n (2n − 1) (2n − 2)θ2 θ12 + n (2n − 1) (2n − 2)(2n − 3)2θ12 θ22
=
2
(2n − n)2
− (n (2n − 1) 2θ1 θ2 )2
V ar (pf
SI ) =
=
(2n2
2θ1 θ2
(1 + (2n − 2)(θ2 + θ1 ) + (6 − 8n)θ1 θ2 )
(2n2 − n)
Efficiency
e(pf
SI , pSI ) =
2θ1 θ2 θ2 −4θn1 θ2 +θ1
2n2 − n
1
2θ1 θ2 (1 + (2n − 2)(θ2 + θ1 ) + (6 − 8n)θ1 θ2 )
11
=
(2n−1)
θ2 −4θ1 θ2 +θ1
1
(1+(2n−2)(θ2 +θ1 )+(−8n+6)θ1 θ2 )
??????
Conclusion, also the corrected maximumlikelihood estimator is not efficient.
12. Calculate the efficiences of θb1,M LE , θb2,M LE .
We have θb1,M LE =
mial distributed.
T1
,
2n
θb2,M LE =
T2
.
2n
Furthermore T1 , T2 are multino-
(T1 (x) , T2 (x) , T2 (x)) ∼ multinomial(2n, θ1 , θ2 , θ3 )
Thus
Cov (T1 , T2 ) = 2n
and
θ1 (1 − θ1 ) −θ1 θ2
−θ1 θ2
θ2 (1 − θ2 )
1
b
b
Cov θ1,M LE , θ2,M LE = 2 Cov (T1 , T2 )
4n
1
θ1 (1 − θ1 ) −θ1 θ2
b
b
Cov θ1,M LE , θ2,M LE =
θ2 (1 − θ2 )
2n −θ1 θ2
comparing with CRB=IX (θ)−1 see (14). We get that θb1,M LE , θb2,M LE are
efficient.
12
© Copyright 2025 Paperzz