Reynolds, John.Genetic Distance and Coancestry."

GENETIC DISTANCE AND COANCESTRY
John Reynolds
#1341
iv
TABLE OF CONTENTS
Page
LIST OF TABLES •
vi
LIST OF FIGURES
viii
1.
INTRODUCTION
1
2.
GENETIC DISTANCE AND THE COANCESTRY COEFFICIENT
3
2.1
2.2
2.3
3.
Background
The Model •
Notation
ESTIMATION OF THE COANCESTRY COEFFICIENT USING MULTILOCUS
GENOTYPIC DATA • • • • • • • • • • • • • • • • •
3.1
15
The Variance-Covariance Matrix
of the Variance Components • •
The Asymptotic Distribution of
The Estimation of y • • • • •
Large Sample Variances of Some
24
34
35
42
3.1.2
3.1.3
3.1.4
3.2
3.3
5.
of the Estimators
• • . • • • • .
s as a ~ ~
• • • • • •
Estimators of
Many Loci Each with Two Codominant Alleles
A Special Case--Random Mating
3.3.1
3.3.2
One Locus
Two Loci.
A SIMULATION STUDY
4.1
4.2
Discussion of Programming •
Results . • • • • . • • .
DOMINANCE, MULTIPLE ALLELES AND OTHER EXTENSIONS .
5.1
5.2
5.3
15
Two Loci Each with Two Codominant Alleles •
3.1.1
4.
3
8
9
Dominance. . . • . • • •
Multiple Alleles . . • .
Mutation and Migration
e .
44
48
62
64
71
71
75
82
82
85
89
6.
DISCUSSION .
92
7.
SUMMARY
98
8.
LIST OF REFERENCES .
100
v
TABLE OF CONTENTS (Continued)
Page
9.
APPENDICES • • • •
9.1
9.2
Derivation of the Variance-Covariance Matrix of the
Vector s
• • • • • • • • • •
Asymptotic Multivariate Normality of s as a + 00
103
104
116
vi
L1ST OF TABLES
Page
2.1
Descent measures for one locus •
2.2
Nonidentity descent measures for two loci
12
2.3
Frequencies of gene combinations at two loci •
13
3.1
A bivariate analysis of variance of the random vector
~ijk
....
7
.........·....
. th
3.2
Gametic counts for the
3.3
The elements of the submatrices of A
3.4
The elements of the submatrices of
3.5
The submatrices of A and
mating
3.6
.....
~
-
sampled population
for the
18
··..
..··
·
-·
special case of random
.·.......··..
Values of contrasts in two locus nonidentity measures for a
monoecious population size N = 1000, with linkage parameter
A, at various times (t)
. • • •
•• . • • •
20
26
30
51
61
~
3.7
3.8
3.9
4.1
4.2
4.3
Asymptotic standard deviations (sd) of 8 and 8W for subdivided monoecious populations of size N = 100 at various
generation times,(t) • • . • • • • • .
v
-
'"
A
Asymptotic standard deviations (sd) of 8, 8, 8 and 8W for 30
subdivided monoecious populations of size N = 100 with
linkage parameter A = 0.0 • • . • •
• • • •.
"
-
A
65
66
....
Asymptotic standard deviations (sd) of 8, 8, 8 and 8W for 30
subdivided monoecious populations of size N = 100 with
linkage parameter A = 1.0 • • . • •
67
Realized means, standard deviations, mean square errors and
variances for four estimators of 8 for the case of two
monoecious populations diverging from an initial reference
population with parameters al = a2 = .25 and A = 0 . . . .
76
Realized means, standard deviations, mean square errors and
variances for four estimators of 8 for the case of two
monoecious populations diverging from an initial reference
population with parameters al = a2 = .25 and A = 0.9 . ••
77
Realized means, standard deviations, mean square errors and
variances for four estimators of 8 for the case of two
monoecious populations diverging from an initial reference
population with parameters al = .24, a2 = .09 and A = 0
78
vii
LIST OF TABLES (Continued)
Page
4.4
9.1
Realized means, standard deviations, mean square errors and
variances for four estimators of 8 for the case of two
monoecious populations diverging from an initial reference
population with parameters al = .24, a2 = .09 and A = 0.9.
79
Translations of redundant two-locus notation to one-locus
notation
. . . . . . . . . . . . . . . . . . . . .
108
viii
LIST OF FIGURES
Page
2.1
The model • • • • • • •
8
'ttl
3.1
_
A
A
Asymptotic coefficients of variation (CV) of 8 t 8 t 8 and 8W
for samples of size 50 from 20 monoecious populations of
size N = 1000 with al = 0.0196 t a2 = 0.2496 and A = 0.9 ••
70
1.
INTRODUCTION
There is some interest among geneticists in measurements, based on
electrophoretic data, of the genetic diversity among populations and in
the relationship of such measurements to particular models for the
evolution of populations.
The study of such measures of genetic
diversity or distance has been described by Kirk (1977) as a "major
growth industry" and that author has added that "new measures of
distance come off the assembly line almost as frequently as new car
models."
This work is concerned with a preexisting measure of genetic
distance and addresses the problem of estimating this distance when
multilocus data are available.
The need for a genetic distance, from which time since divergence
can be recovered, for instances of short-term evolution is described in
Chapter 2.
The model upon which such a genetic distance is based is
explicated and the notation utilized in the rest of the thesis is also
described in that chapter.
In Chapter 3 an analysis of variance with a structure reflecting the
model presented in Chapter 2 is described.
Such an analysis of variance
was presented by Cockerham (1969) for the case of one locus and Cockerham
and Weir (1977) for the case of two loci.
Variances and covariances of
the estimated variance components in this analysis of variance are
presented.
The expressions for these variances and covariances are
completely general and apply to any regular system of mating.
They are
consequently rather tedious to derive and complete details of their
derivations are put in an appendix.
These expressions may be trans-
lated into functions of descent measures and initial gene frequencies
2
and this translation is simplified if linkage equilibria are assumed to
exist in the initial reference population at all pairs of scored loci.
Four estimators of the distance measure are introduced and some limited
properties of these estimators are presented, the most notable being
that for large numbers of replicate populations the estimators are
asymptotically normal.
The asymptotic variances of these estimators
are compared for the special case of monoecious random mating populations with random selfing.
The results of a small simulation study
designed to illuminate the small sample properties of these estimators
for the case of monoecy and two-scored loci are presented in Chapter 4.
The effects of some failures of assumptions on the methods presented
in Chapters 3 and 4 are briefly described in Chapter 5 and a discussion of
the relationship of this study to previous studies and recommendations
for future research are presented in Chapter 6.
completes the thesis.
A summary, Chapter 7,
3
2.
GENETIC DISTANCE AND THE COANCESTRY COEFFICIENT
A genetic distance is a scalar quantity which aims to measure how
different a pair of populations, or possibly several populations, are in
their genetic constitution.
The literature concerning genetic distances
is vast but has been thoroughly reviewed by, for example, Goodman (1972),
Smith (1977) and Nei (1978a).
In this chapter, a brief discussion of
previous studies is presented, the use of gene identity measures as
genetic distances for isolated populations which have evolved over a
short time is described, and the model and notational tools upon which
this thesis is founded are presented.
2.1
Background
Prior to 1970 most genetic distances (see, for example, Smith, 1977)
were constructed to serve solely as scalar-valued discriminants between
currently constituted populations.
As such, they were closely related
to geometric distances with populations being represented by points in
Euclidean space.
If two populations had similar "genetic constitutions,"
that is, similar gene frequencies, their genetic distance was by construction small.
Often small genetic distance was construed to mean
that the pair of populations were recently descended from a common
ancestral population (the underlying model of the evolution of the
populations being that of a tree rather than a web).
The recovery of
time since divergence from these distances was next to impossible as
the distances were based on analyses which were local in time.
The need for a genetic distance from which divergence time could be
recovered was initially met by Nei (1972, 1973), Latter (1973) and
Morton (1973) although such distances were inherent in the work of
4
Malecot (1948, 1969), Wright (1965) and Cockerham (1967,1969).
The
"standard genetic distance" proposed by Nei (1972) for the study of the
long-term evolution of two populations X and Y is given by
D = -log
e
I
where
(2.1.1)
J
XY
is the probability that a gene chosen at random from population X
is identical in state to a gene chosen at random from population Y, and
J
X
is the probability that two randomly chosen genes from population X
are identical in state, with a similar definition for J y •
If more than
one locus is studied, the JI S are arithmetically averaged over loci
before inclusion in (2.1.1).
The model of evolution which Nei used to
arrive at his distance measure appears to have the following features:
(i) Initially, at time zero, a random mating population in HardyWeinberg equilibrium, in linkage equilibrium at all pairs of loci, and
in drift-mutation equilibrium "splits" into two identical populations
which remain isolated.
(ii) At the unknown time t at which the genetic distance between
the two populations is required, the populations are still in the three
types of equilibrium mentioned in (i).
(iii) All new mutations are unique (an infinite-alleles model).
(iv) No selection.
(v) Equal population sizes or equal effective population sizes.
The effect of assumption (i) is to eliminate a source of sampling
variation due to the fission process and to ensure that I at time zero
is equal to one.
The infinite-alleles model and this splitting process
5
also ensure that "identity in state" is equivalent to "identity by
descent" so that Nei is able to use the recursion for identity given
by Kimura and Crow (1964) to obtain at time t,
I
~
'\J
e
-2vt
where v is the mutation rate.
Thus,
D ~ 2vt
and this standard genetic distance is approximately a linear function of
divergence time.
In Nei's treatment of the variance of estimators of I
(Nei and Roychoudhury, 1974; Nei, 1978b) the total variance of an
estimator of I can be partitioned into components for loci and genes
within loci, because loci are assumed to be independent and mating is
assumed to be random.
Cockerham, Weir and Reynolds (1981), however,
show that when these assumptions do not hold, the partition of variation
of identity measures generates labels for populations, loci, loci by
populations, individuals within populations, and loci by individuals
within populations.
Moreover, in the partition of the total variance of
actual inbreeding, the component of variance for loci is zero as each
locus has the same expected inbreeding, but how these components relate
to the components of variance for estimators of I is not clear.
Latter (1973) suggested that the mean coefficient of kinship within
populations or the average coancestry coefficient, where the averaging
is essentially over all pairs of individuals within populations and all
replicate populations, may be used as a measure of genetic distance
between two populations for the case of drift alone and drift plus
mutation to novel alleles.
He presented an estimator of this distance
which, apart from a small correction factor, was a weighted average over
6
all scored alleles and all scored loci of the estimator of the coancestry coefficient proposed by Cockerham (1969) for the case of two
alleles at one locus.
Cockerham (1979) proposed that the coancestry coefficient, denoted
by 8
or S in the sequel and defined as the average of the N(N-l)/2
1
coancestries SXy of pairs of individuals X and Y in a population of
size N could be used as a genetic distance between a pair or among
several populations.
The coancestry Sxy of a pair of individuals X
and Y within a population is defined in Table 2.1 which is adapted from
Cockerham (1971).
One feature of the coancestry coefficient is that
for many mating systems involving finite populations it is a monotonically increasing function of generation time t so that given
information about the mating system and making sOme assumptions about
the fission process, time since fission or divergence can be recovered
from knowledge of 8 •
1
For example, for a monoecious population of
size N mating at random including a random amount of selfing, 8
1
time t is equal to
1 - (1 -
2~) t
so that
t =
10ge(1-8 )
l
log
~
e
(1 - --l:.)
2N
-2N 10ge(1-8 )
l
while for a monoecious population of size N mating at random but
excluding selfing 8
1
at time t is equal to
at
7
Table 2.1
Measure
F
X
6
XY
a
Descent measures for one locus.
Identity relation
x - x
b
..
x - y
y..
x - x .. - y
YXYZ
x - y
0....
x - x" - y
°XYZ
x - x"
°XYZW
x - y - z
6.. ..
x
XY
XY
X+Y
!:".. ..
X.Y
-
-
z
-
-
y
..
y - z
-
w
y -
..
..
y
.. -
z
y, x" - y
x - x
.,
6 X+YZ
x - y, x
!:"X.YZ
x - x
!:"XY .ZW
x - y, z
.. ,
y - z
-
w
aDifferent upper case subscripts denote different random
individuals.
b
Lower case letters denote random genes from corresponding
upper case individuals. Primes denote different genes. Nothing
is implied about genes not shown to be identical by descent.
8
where r -
N
IN 2+1
and t may be recovered by tabular or graphical means.
Recognizing that the assumptions concerning equilibria in the
sampled populations, which are made in the derivation of Nei's measure
of genetic distance are inappropriate for populations which have only
recently split from a common ancestral population, the remainder of
this chapter, and the thesis, is devoted to the use of the coancestry
coefficient and other functions of gene identity measures as genetic
distances.
2.2
The Model
The model for the fission process and other sampling processes is
presented diagrammatically in Figure 2.1.
iNiTIAL
An initial reference population,
REFERENCE
POPULATION
GENERATION
o
t-i
S,:i\·lFLES
Figure 2.1
PCPUUT10NS
The model. Straight arrows represent sampling of
offspring from progeny arrays. Jagged arrows
represent the formation of the isolated populations.
9
noninbred, essentially infinite, in Hardy-Weinberg equilibrium at each
locus·, but not necessarily in linkage equilibrium at every pair of loci,
is imagined.
Replicate populations begin as independent random
samples of size N from the initial reference population at time zero.
Generations are discrete, there is no selection, and replicate populations are assumed to remain isolated and constant in size.
Each
replicate population at a certain generation time is a random sample of
size N from the conceptually infinite offspring array generated according
to the rules of the mating system by that replicate population in the
previous generation.
The samples at generation t are from the concep-
tually infinite offspring arrays generated by the replicate populations
in generation t-l so that the sample size, n, may be greater than the
size, N, of each replicate population.
This model, which has a long history in population genetics and
essentially dates back to the work of Wahlund (see, for example,
Cockerham, 1973) differs in several key respects to the model utilized
by Nei (1972, 1978a).
Firstly, the founder populations are not
necessarily identical and as independent random samples of finite size
N may not be in Hardy-Weinberg equilibrium or linkage equilibrium.
Secondly, this model is not restricted to binary fission, and thirdly,
mating systems other than monoecy can be subsumed under the model.
2.3
Notation
In the sequel, several notations are used to denote various
probabilities of states of identity between and among genes.
While all
these notations have appeared in the literature before, they are
gathered here for ease of reference.
The necessary one-locus descent
10
measures have been given in Table 2.1.
These one-locus measures are
expectations over all replicate populations, and over all pairs, triples
or quadruples of individuals (whichever the case may be) in each
replicate population.
Initially, in Chapter 3, the digenic descent measures
Fl , IF
and 18
are used to express equivalence relationships between genes at different
loci.
Definitions of these descent measures can be found in Cockerham
and Weir (1977).
Later in Chapter 3 and the rest of the thesis,
particularly in the derivation of variances and covariances of estimated
variance components, linkage disequilibria between pairs of loci in the
initial reference population are assumed to be zero sO that these
descent measures and their precursors in Cockerham and Weir (1973) are
not required in the translation of frequencies of gene combinations to
functions of descent measures and gene frequencies.
The notation for the frequencies of various gene combinations is
given in Weir and Hill (1980) and Weir, Cockerham and Reynolds (1981).
For combinations of genes at one locus, horizontal and vertical bars
separate genes from different individuals, so that, for example,
#A
is the probability that a random gene from a locus in a randomly chosen
individual is allele A and that a randomly chosen gene from the same
locus in another randomly chosen individual is also allele A.
probability is given by
This
11
where PA is the frequency of allele A in the initial reference population.
As another example, consider the probability that three genes
chosen at random from the same locus in three different randomly chosen
individuals are all allele A.
This probability, denoted by
is given by
For combinations of genes at two loci, the necessary notation is given in
Tables 2.2 and 2.3 which are both adapted from Weir and Hill (1980).
In
Table 2.3, horizontal and vertical bars separate genes in different
individuals while diagonal bars separate genes on different gametes
within the same individual.
For example, assuming that linkage dis-
equilibrium in the initial population is zero, the frequency of the
double homozygote AABB is given by
(2.3.1)
(2.3.2)
Equation (2.3.1) can be deduced from Table IV of Cockerham and Weir
(1973) and involves probabilities of identity by descent.
The
equivalent expression in equation (2.3.2) involves the probabilities of
nonidentity given in Tables 22 and 2.3 and reduces to the appropriate
expression in Table 2.3.
As another example consider the probability that in two randomly
chosen individuals. one is homozygous for allele A at the first locus
and a randomly chosen gene from the second locus of this individual is
12
Table 2.2
Measure
e1
f
Gene arrangements b
(abl a'b')
(ab)(a'b ')
82
f
a
Nonidentity descent measures for two loci.
(ab)(a' Ib ')
1
(abla') (b') or (ab Ib ')(a')
2
T"
(ab) (a') (b')
~1 *
(a IbHa'1 b ')
~2*
(ala') (bib')
~3 *
(alb)(a')(b')
·3
..
~4*
(al a') (0) (0') or (bj b') (a) (a')
~
(a) (a') (0) (b')
J
aThe probability that genes a, a~ at one locus are not
identical by descent, and genes b, b~ at a second locus are
also not identical by descent.
b
Bars separate genes on separate gametes and parentheses
separate genes in separate individuals.
13
Table 2.3
Frequencies of gene combinations at two loci.
Coefficient of
Frequency
PA P B(2 - P A - PB) a
pAS
••
pAl!
1
pAl!
L
·.
·.
+
pAB
1
p..\B
A'
'5
~IB
pAlS
A. •
+
p•.\B
A'
+ "oo\!
• B
•B
,/,./3 + pA/3
° 3·
A. •
r#!+~
Alo
-oiS
A 3
FA/a
2
-p
2
-p
2
2
-a
2
-!I
1
-p
1
-IT
1
-u
tr\a+~3
2
- (F
~
L
-
pA/3
1
ArB
PA (l-PA)PB(l-P B)
-AlB
- Ai 3
AlB
+
•.1.
AI B
L
-p
pA/3
1
- il
?A~ + ~B
2
~
Ala
1
AlB
AlB
- (?
IT)
-
pAl3
AlB
r1
+
..
-~
llZ"
IT)
14
allele B, while a randomly chosen gene from the second locus of the
second individual is allele B.
This probability is given by
The advantage of working with nonidentity rather than identity
when studying gene combinations involving two loci is simply that the
transition equations for the two-locus nonidentity measures are homogeneous
(Weir, Avery and Hill, 1980).
15
3.
ESTIMATION OF THE COANCESTRY COEFFICIENT USING
MULTILOCUS GENOTYPIC DATA
Estimation of the coancestry coefficient in the one-locus case,
when the unit of observation is the gene, has been treated extensively
by Cockerham (1969, 1973).
As a paradigm for further work, the estima-
tion of the coancestry coefficient when two loci
each with two co-
dominant alleles are scored, is discussed in some detail.
3.1
Two Loci Each with Two Codominant Alleles
For t h e k th gamete
~n
0
t h e j th
~n
0
dO~v~°d ua1
~n
0
t he
~
0
th popu1 at~on,
0
consider the random vector of indicator random variables
X
o ok
-~J
where
X
=
o 0kl
~J
[Xo Okl]
X:: k2
(3.1.1)
= 1 if gamete k in individual j in population i carries
allele A at the first locus,
= o otherwise;
and where,
x
ijk2
= l i f gamete k in individual j in population i carries
allele B at the second locus,
= o otherwise.
Such a random vector is defined in Cockerham and Weir (1977).
The
expectation of this random vector over the three stages of sampling (a
population from an infinite array of replicate populations, n individuals
from an infinite progeny array in each replication population and a
complete sample or census of the two haplotypes or gametes within each
16
sampled individual) is simply the vector of frequencies of allele A
and allele B in the initial population,
Expectations of various products of these random vectors are as follows:
-,
PAPB+IFtlAB
I
2 .
PB+F1PB(1-PB)J
=
= C. a
T
x
E(x.
-1.J"k -1., , J. 'k')
+
~~
T
rp~+elPA(l-PA)
l
PAPB +
[ii
6AB
,
i
f:. i '
.,
P P + 8 tl
A B 1
AB
2
-
i
P B + elPB(l-PB)~
17
In Table 3.1 a bivariate analysis of variance for the random vector
x"k is presented.
-~J
Two decompositions of the bivariate analogs of
expected mean squares are given.
The first decomposition is standard
(see, for example, Bock, 1975) and is in
component matrices.
ter~s
of variance-covariance
The second decomposition is a bivariate analog of
the covariance representation given by Cockerham (1969).
It should be
noted that the first decomposition differs a little from that given by
Cockerham (1969, 1973) and Cockerham and Weir (1977).
Those authors
write expected mean squares as linear combinations of components of
variation which are equivalent to the "polykays" of Tukey (1956) and
the "cap sigmas" of Zyskind (1962).
The decomposition, in Table 3.1,
which takes into account the complete sampling at the third stage
(gametes within individuals) may be slightly more customary.
Certainly
the condition for a parametrically negative component of variation for
individuals within populations is more stringent with the present
The condition is F < 8 with the previous authors'
l
1
1
decomposition but Z(l+F ) < 8 with the present decomposition.
l
1
decomposition.
Quadratic unbiased estimators of the variance-covariance component
matrices L
C
S
c
'
1
an
=-
L and La are, respectively:
b
[~
~
L L
j k
T
x, 'kx , 'k
-~J
1
~. +
2an • 2.
= --
-~J
N2 •
l.
1
L
2 ,
- -
~
L
j
(~ ~ijk)(~ ~~jk)J
;1 +
.. 2
N2 .
11
Table 3.1
--_
e
e
e
..
~ijk.
A bivariate analysis of variance of the random vector
_-----------_.
Hatrlcea of Sua,s of Squares and Cr"ss l'roducts (SSP)·
dt
SOlin:£'
"(SSP/df)
-------------------------------------------------------
f(1: 1:
~I Jk)(r. E E ~~'Jk )
an I J k
I J k
'
Pupu lalluus
a-I
:n ; (~ ~ ~IJk)(~ ~ ~;Jk) - 2~n (~ ~ ~ ~IJk)(f' ~ ~ ~~jk)
.. (n-I)
t I J (1:k !!ljkXtk ~~'jk) - :f- I(tj tk '.!Ijk)(EJ k !!;jk)
lndividualg
1: E
Gdmetes
an
within
l: E I:
IJk
IlIJI~tdual6
1° /lAO
10/lAD
°1 Po (I-PO)
-
t IJk
E (E ~iJk)(E ~:Jk)
k
r.c
t
I ,.
E E E ~ljk~IJk
i
[O~PA(I_PA)
~IJk~'~jk
]
,.
2 "c + '1>
j k
",.t[
(~I+FI_- 201~PA(I-pA)
(F "IF-210)IIAII
(F-I .. IF- - 2 -(1)II
1
AB
(I ..
F1-
2° )1'0(1-1'11)
1
• ~r +C
ab
+ 2(n-I)C
a
+ 2anlili
r
2Eb • ~ + Cab - 2 C a
E
T
2an
Tnt:))
=
E
r
2':,," 2nEa • ~r" Cab" 2(n-l)C
a
n
wHhln Populations
Ea
2l:b + 2nEa " 2a,,00
1:
Mean
]
••
.. "a
+
Ec • [
r
• l~r - Cab
1I1i.
.. +
"r
I~l!
r
(I-FI)PA(I-PA)
-I
-
(F -IF)t.AO
(F
-I -IF)An
-
]
(1-;1)1'8(1-1'8)
....
(Xl
19
=
1
4a(n-1)
N1 • +
4
N2 • +
1 . . 1.
N1 •
• 2.
NIl +
• ..
N·· + 2 N· 1
NIl +
• 11
4 N· 1 +
• 1.
..1
N·· + 2 N1 •
• 11
..1
N· 1 +
..2
N· 2
..1
and,
1
1
sa = 2n Jla:1[2nL:(L:
..
1.J
=
1
4(a-1)n 2
L:~
.. k)(L: L:~:.k)
k 1.J
. k 1.J
J
-
~21
L:
an .
2
L:(2 N1 • + N1 . + N2 .)
iiI.
i 2.
i 1.
2
_ 1:.(2
a
1
- -
n
N1.+ N1.+ N2.)
. 1. • 2 • . 1.
S
b '
L:~
L:~:.k)'~
L:
.. k)(L: L:
. k 1.J
. . k 1.J
1.J
1.J
20
where the genotype counts (the N's) are defined in Table 3.2.
The
estimators of the variances are in fact minimum-variance-quadraticunbiased as the third and fourth moments of the components of
~ijk
are
finite (Graybill and Hultquist, 1961).
Table 3.2
Gametic counts for the i
th
sampled population.
Gamete 2
AB
AB'
A/B
A/B'
AB
11
iNU
i
11
N
12
11
N
i 21
N11
i 22
~
AB'
12
N
i 11
12
N
i 12
12
N
i 21
12
N
i 22
~
A'B
21
N
i 11
21
N
i 12
21
N
i 21
21
N
i 22
~
A/B'
22
iNn
22
N
i 12
22
N
i 21
22
N
i 22
~
iNii
iNii
i
N· .
21
iNii
~
.N
U
·.
12
.N
Gamete 1
·.
21
N
·.
•l
22
.N
·.
·. = n
. N··
It should be noted that while the off-diagonal elements of these
observed variance-covariance component matrices require the detection
of coupling and repulsion double heterozygotes, the diagonal elements
do not share this requirement and merely require single locus genotypic
counts.
Several approaches which may lead to the recovery of the coancestry
coefficient from these observed variance-covariance matrices are possible.
One approach, as 8 in the one-locus case is an intraclass correlation,
1
is to search for bivariate analogs of the intraclass correlation and take
21
appropriate scalar functions of them.
Two possibilities for bivariate
analogs of the intraclass correlation are (where T denotes elementwise
division) :
and
R = 2: 2:-1
a T
2
=
Three scalar functions which either extricate 8 or at least come close
1
to extricating 8 are as follows:
1
(3.1.2)
(3.1.3)
and
A
max
(3.1.4)
(R ) =
Z
where tr denotes the trace of a matrix and A
max
eigenvalue.
population
denotes the largest
When there is no linkage disequilibrium in the initial
(~AB=O),
all three scalar functions are equal to 8 .
1
Both
equation (3.1.3) and equation (3.1.4) along with eight other scalar
functions have been suggested by Ahrens (1976) as possible generalizations of the intraclass correlation coefficient.
The procedure
22
suggested by Ahrens for estimating such quantities is simply to replace
~a
and
~T
where ST
in equations (3.1.3) and (3.1.4) by Sa and ST' respectively,
= 21
Sc + Sb + Sa·
(3.1.2) and subsequently
61
Thus, for example, the quantity in equation
is estimated by an unweighted average,
namely,
...s
(1,1)
1
8=-1
a
2 LS (l,l)
T
S (2,2)l
+_a;;;.....~~
(3.1.5)
ST(2,2)~
Another suggestion, due to Cockerham (1979), is to use the
weighted average,
S (1,1)+8 (2,2)
a
a
as an estimator of the coancestry coefficient.
The rationale behind
this suggestion is that when the observed variance components in
equation (3.1.6) are replaced by their expectations, the result is
tr(~a)/tr(~T)
= 61 •
As estimators of the coancestry coefficient, these
quantities in equations (3.1.5) and (3.1.6) do not readily appear to
have their genesis in some optimization problem.
Recognizing the nonlinear structure of
~a'
~b
and LC ' another
approach is to adapt the methodology of Joreskog (1970) and Browne (1974),
reviewed in Joreskog (1978), for estimating the parameters in covariance
matrices with nonlinear structure, to the problem of estimating the
coancestry coefficient.
In order to simplify the discussion some changes in notation are
made.
Let,
::J
23
Sb =
S =
c
[:1
[:1
:J
:J
"
=
[(1+F_al.
Zl
rc
=
G:-Fl'
]
l+F
Z
( - - a) 0.
Db
' and
Db
2
:~-F)J
l
where 0.1 = PA(l-PA), 0. 2 = PB(l-PB), Da =
Dc
= (F-1 -IF)~AB
and where the subscripts and bars have been deleted from
the inbreeding coefficient (PI) and the coancestry coefficient
(61 ).
Noting that the expectations of y, v and u do not depend on F,
a,
0.1 or
0. , only the diagonal elements of the variance-covariance component
2
matrices need to be considered.
Let,
Then,
T
= ~ (y) =
r Sal' Sa ' (l+F)
(l+F
L
z \-2- - e 0'1' -Z- -
I
e ) O'z, (1-F)O'l' (l-F ) O'zJ
where,
0.1.8)
The estimation problem can now be viewed as the fitting of the
vector
~(r),
which depends on four parameters, to the observed vector s
containing six observations and bears a close resemblance to the type of
problem considered by Browne (1974).
Before proceeding further, the variance-covariance matrix of the
observed vector
~
is presented in the next section.
Knowledge of this
matrix allows asymptotic variances of estimators such as
v
e
and
e
to be
derived and, of course, influences the choice of objective function to
be minimized in the fitting of
£(r)
to s.
24
3.1.1
The Variance-Covariance Matrix of the Estimators of the Variance
Components
The variance-covariance matrix of the vector s of the estimated
variance components can be written in the form:
Var (s) = .!.A + -..!.~ +
1
T
a
anan(n-1)
1 <p+
1
1fI+
1
n
a(a-1)
a(a-1)n
a(a-1)n 2
+
The component matrices
<P =
2
A
(PA -
<P,
1fI and
2 2
PA)
n are as follows:
(pAl B _
(pAlB - P P )
.•
AB
2
(3.1.9)
2
PAP B)
2 2
2°4
(~
B PB)
4°4
4°2
1fI
=2
(pA
+
A
PA
A
(FA -
x
-
A
2FA)
2
PA)
..
AlB
x (P • . -PAP B)
n
=
1
2"
A
.. ..
(P
x
A 2
(pA+PA-2PA)
(pAB+ pA/B_ 2pAIB) 2
The component matrix T is as follows:
T
= -12
2A2
-2A2
-2 A2
A
2 2
2°4
..
x (PAlB
-P P )
• • A B
(pAB+pA/B_ 2pAIB)
.. ..
(pAB+pA/B_ 2pA/B)
4°2
2°2
B
+ pB _ 214)
B
B
B
2
(PP; - Pa)
25
where the submatrix A is given by,
A
pAB + 2pA/B + pA/B
=
AB
AB
AlB
-4(~
+ ~ - ~)
AlB
AlB
.1CAfB
~B + 2pBIB
+ pBIB
B.
BB
pAB + 2#/B + ~
AB
AB
AlB
-4 (n.!l.!!. + pB ~ _ p~)
-4(~+
~-~)
AlB
AlB
AlB
Lal·
L
The component matrices A and
= may
BIB
BIB
be partitioned into 2 x 2 submatrices
in the manner,
A=
Q
- =
R
B
s
E
M
'G
E
T
and 3.4, respectively.
H
T
The elements of the submatrices of A and
G
D
= are
presented in Tables 3.3
Complete details of the derivation of these
variances and covariances of the elements of the vector s are relegated
to Appendix 9.1.
It is clear that the variance-covariance matrix of s depends not
only on the parameter vector 1
T
=
[a ,a ,e,F] but also on the threel 2
gene, four-gene and two-gene-pair probability functions for one locus
introduced by Cockerham (1971) and on the two locus descent measures
discussed in Weir, Avery and Hill (1980) and Weir and Hill (1980).
Note, however, that the assumption of linkage equilibrium in the initial
population has simplified the expression, in terms of descent measures,
of those elements of the matrix which involve two loci.
If this
assumption is false, the more general formulation of Cockerham and
Weir (1973) is required.
e
e
e
Table 3.3
Th~
e1ement6 of the suLmatrices of h.
Expression in terms of descent measures t
Expression in terms of frequencies of gene combinations
Elements
AlA
(A ,2
ill
(A)r
B) -2P ( P;Til-PAPi
ill
B) -2P ( PAl'
ill -PBPAA) +4PAP B( p AlB
PA ,Pi
•• -PAPB )
A
B
I
AIA
A
3)
K(I,I)
Pi\1A- Pi;) - 4 PA\P;;:-r- 2P AP;+PA
K(l,2) = K(2,1)
PAIB -
K(2,2)
P B III
L(l, I)
14
till. - (B\2
( ill
Pi) - 4 P ',P . - 2p
ll
B1
B
6
P (l-P ) + (3l1xy.zw - 66
XYZW A
XYZW
A
62)p~ (l-PA)2
2
[~-(1-8) ]P (l-P )P B(l-P )
A
A
B
2 2
2
6XYZWPB(1-Pa) + (3l1XY'ZW - 60XYZW - 6 )Pa(I-P B)
3\
P
B i+ PB)
I!-
L-P~+2pA
IA+pAIA _ 4 (p~+pA
p~)
- (
+l'A - 2P~)2 J
A'
A A
A ,.
A A
AlA
PA
A
A
A
[
4'1 (8+ 2Yj(y + 6j(f)
r
'
- YXYZ - 0K'LZ + 6XYZW~ PA(l-PA)
3
- Le+ 2yXY + 2'
I
°X'y -
4'
(Ax·y+ 2 Ax+y) +
'X. YZ + 2Lx+YZ
1
~.., ~
:!
- 4yXYZ - 6°j(yZ - 3l1xy.zw + 66XYZW +4" (F-2e) _ PA (I-PA)
L(I,2) = L(2,1)
1
4
[pA IB + pA IB + pAIB +pAIB _ 2
••
A·
'Il
AB
(pill+pill+pAI!+p~IB _ 2pill)
AI,
·IB
AB
AB
AlB
1
[ 4' ~
I
- liZ + ~ - 4'
(1+F-28)
2
~
JPA(l-PA)PB(I-P B)
_ (p + p A _ 2pl)(p + pB - 2P!)J
A
A
A
B
B
B
L(2,2)
lr!
BIB
BIB_
4 L.P B+ 2P a • +P a a 4
(!U!
BIL !U!)_(
B. !)2
P'\B+Pa B PBIB
Pa+P B 2P B
J
1
[ 4" (8+ 2Yj(y + 6ie'¥')
-
1
Yxyz - °1(yz + °XYZ\~ ~ P B(l-P a )
3
1
- [ 8+ 2Yj{y +2 6XY - 7; <tx·,y+ 2!lx"+y) +
tx·, YZ + 24;+\"2
I
2- 2
2
- 4YXYZ-66j{YZ-3l1x¥.zw+6oxyzw+4" <F-20 _ PB(l-P S )
N
0\
:~"la
3.3 (Continu2d)
Expression in terms of frequencies of gene combinations
Eleme:r.ts
M(l,l)
p~ _ 2pA IA + pAIA _(
A
A'
;.. A
,PA
_ pA)
Expression in terms of descent measures
T
2
(8 - 2yXY + ~y)PA (l-PA) + (-48 + 8yXY + tJx.y
A
2?
2
+ 2tJx+y - 60j(y - F
(l-PA)
)PA
M(l,2) • M(2,l)
pAIB_(pAIB+pAIB)+pAIB_(
_pA)(
_pB)
••
A·
• B
A B
PA
A PB
B
M(2,2)
p!!._?pB!B+pB\B_( _pB\
B - • B
BiB IPB
B)
[~-
(l-F)2)PA(l-PA)PB(l-P )
B
2
(8 - 2yXY + OX\i)P B(l-P B) + (-48 + 8yXY + tx.y + 2tx+y - 6cj(y
_ F2)p2(1_p )2
B
Q(1,1)
t [(YXYZ + °XYl - 2oXYZW )PA(l-PA)
lli
AIA 2P A iA ( 2 PA)(
A 2P A)
2"1 rLPAI'+P
A ,A- y + 2P A - A PA +PA- A
_ 2
x,
(p~+ pA IA - 2plli)J
PA A
A'
B
+ (~'YZ + 2~+yZ - 4yXYl - 6o yZ - 66XY'ZW
AI'
2
2
21
+ 126xyZW - F8+ 28 )PA (l-PA) _
Q(l,2)
~ ili AlBB 2"1 LPA"j-+PA
$
(
2 A)(PB +PBB- 2PilB)
2PAlB + ,2PA - PA
t
[(1 +F - 28) (1 - 8) +
LIZ - 2l1;J
t
[(1 +F - 28)(1 - 8) +
lI~ -
PA(l-PA)P B(l-P B)
I
_ 2 (A IB+ pA B - 2pill)J
PA\P..
. B
'IB
Q(2,1)
2"1
[P~IB+PA
i l l A\Bil- 2Pill + (2P
2 - Pii
B)( P +PAA 2PA
A)
B
A
A1B
_ 2
2l1;J PA(l-PA)P B(l-P B)
I
(p A B + pA I' B - 2pill)J
PB ••
A·
AI'
N
'!
e
-
e
e
e
e
Table 3.3 (Continued)
Elerrlents
Q (2 .2)
Expression
BIB BIBil2"1 rLPilf+P
B
- 2
Expression in terms of descent measures t
in terms of frequencies of gene combinations
B
BIB (2P
2 P BX
B)
2P BfB+
B - il PB +P B -2Pjj
!!.l.!!)]
B
(p!!+p IB_ 2P
PB\ B
B'
B I·
~ [(Yxyz + 6j(yZ - 26Xyzw)PB(I-PB)
+
(tlj(.YZ + 2tlj(+yZ - 4yXYz - 66j(yZ - 6llxY'zW
+
R(I,I)
pili_ pA\~+ (2p 2 _ p:i)(p _ pA) _ 2p (p~_ ~IA)
AI·
A A
A
A
A
A
A A A'
2 2
2~
l26xYZW-F6+26 )PB(l-P B) J
(Yxyz - 6j(YZ)PA(I-PA)
+ [ -4YXyz +66j(yz·
R(I,2)
p A 1B _ p~\B + (2 2 _ p~)(
_ p B) _ 2 (pAIB _ pAIB)
A B
PA
A PB
B
PA "
. B
[(I-F) (1-6) • llZ]P (I-PA)PB(l-P )
A
B
R(2.I)
pill_ pA I!!+
(2lp!!)'(p _ pA) _ 2p (pA IB _ pAIB)
B
B \ A
A
B"
A'
[(I-FHI-e) - ll:]P (I-PA)PB(I-P )
B
A
R(2,2)
B!B BIB r 2 B\(
B)
(B
Pilf- P Blil+\2 P B - Pil) PB-PB -2P B PB-PB'
AT'"
·IB
AlB
BIB)
22
(tlj('YZ+2tlj(+yZ) + Fe]PA(I-P A)
(YXYZ - 6j(YZ)PB(I-P B)
,
s
,
+ [-4yXYZ + 6~Z - (6j(.yZ + 2llj(+yZ) +F6]p (l-P B)-
N
00
T..ble 3.3 (Cor,cinued)
Elements
5{1,1)
Expression in t~rms of descent measures
Expression in terms of frequencies of gene combinations
1 I" A AI'A
(~ AlA) - ( PA-PA)(P +PA
2PA)J
2LPA"-PAA-2,PAI,-PAA"
A A A- A"
1 {[8 - 6'""- 2{'Y
2
XY
XYZ
- 6"
XYZ
t
»)p (lop )
A
A
+ [-48- (t>x.y+2t>x+y)+66j(f+8'YXyz+2{~.yz+2~·!+yz)
2
2
- 126j(YZ+F - 2F8)PA{1-PA)
5 (l, 2)
1 !pA IB +p AIB _ (pAIB +pA IB) _2 (pill_ p~IB)
2;..."
A·
·B
- (PB - P:)(PA+p~ 5 (2, 1)
AB
AI·
1 {[2lI* - At:
2
q-Z
AB
• B
A'
{1-F)(l+F-28»)p {lop )p (lop ) '.
A
AB
B,
2P~)J
1 !pAIB +pAIB _ (P"-IB +pA IB) _ 2 (pill. pAl.!!)
2 '- . ,
-
2)
J
A B
.\B
1\'[211* - A"! - {1-F)(l+F-28»)p {lop )p (l-p ) 1.
2
A B
4
A
-Z
A
B
BJ
( - PAY
A'(PB +P B - 2PilB)J
- \P
B
A
5{2,2)
t {[8- 6X¥- 2 ('YXYZ - 6j(yz»)P (l-P
Bt.!) - ( PB-PBB)(PB+PB--PB'
B? B)J
2"1 [B
Pil-PBBIBB- 2 (.!!l!
PBI' -P BIB
B
+
[-48 -
B)
(Aj(.y+ 2Aj(+y) + 66XY + 8yXYZ +2(Aj(.YZ + 2.:.x·+yZ)
_ 126"
XYZ
+F 2 .2F8)l(l-p )21
B
B)
t No initial linkage disequilibrium bas been assumed.
N
\0
e
e
e
e
e
Table 3.4
The elements of the submatrices of :.
Expression in terms of frequencies of gene combinations
Elements
B(l,l)
e
2
rp¥+pAliL2p~_2
(p!+pA IA_ 2p!l!L2( +pA- 2p!)1
_ A I'
A A
AlA
PA\ A
A·
A \. J'I'A PA
A
A..J
Expression in terms of descent measures
2 [(YXYZ + 6XYZ - 26xYlw)PA(l-PA)
+ (e-4YXYZ +IIj(.yz+ 2 Ax+yZ - 66XYZ -
B(1,2) = B(2,l)
2
r_AfB
pA B+ pA/B _ 2P~lL
(pAB + pA/B _ 2pili)
AlB
AlB PA ' B
• B
'\B
_
B(2,2)
t
6llxy.zw+126xyzw)P~(l-PA)2J
* *
2(r3 - 2"'5 + llj)PA (l-PA)PB(l-P B)
I
(pAB + pAl B _ 2pili) +
(pAB + pA/B _ 2pA B)J
PB A'
A'
AI·
PAP B ••
••
••
2 lplli+pBl.!!._ 2pill- 2 (p!!+pBIB _ 2pill) + 2( +pB - 2pJ!)]
,. BI·
BIB
BIB
PB, B
B'
BI'
PB·P B
B
B
r
2L (YXYZ + 6XYZ - 26XYlW)PB(l-PB)
...
..,
+ (e-4YXYZ +!lX"YZ+ 2 l1j(+yZ -66XYZ - 6llxy-zW+ 126XYZ,,:)pii(l-P3)'
C (1,1)
18 :.I PA + 7pAA _ 2
1'*;'1
(5p!+ 14pA IA +pA IA)+ 32 (p!l!+ pA ~ A
A'
A A
. A,.
A IA
A IA ~
~
{[l + 7F - 2 (50 + 14yX'y + 6 ' ) + 32 (YXYl + 6 'yZ - 6XYZ ;'-) Jp (l-r ~)
XV
X
A
+ [-2 -28F+8(50+14YXy +6XY)
- 2(6X-.y+2"X'+y'
25 XY )
+ 32 (-4yXYl + tx.yZ + 2'X+yZ - 60XYZ
2
2)
3llxy-zw+ 66XYlw )]PA (l-P A) _~
C(1,2) = C(2,l)
l f pAB + pA/B + 2
8 L ••
••
(pAB + pAB) + 2pAB _ 2 (pA IB +pA IB +pA IB + pA IB)
,A·B
AB
-.
A'
. B A B
* - 16(f - t:.* ) + 16(f - 2t:.* + t:.j)
*] P (l-PA)PB(l-P )
il [2(@1 - Ai)
2
4
A
3
5
B
_ 4 [pAB+ p!!!+pA/B+ pA/B+ 2 (pA J!+p! B)~I
A·
'B
A'
• B
AlB
AlB.
+ 8(pti.!!+pill+ pA l.!!.+p!J B)+ 16 (pA B+pA / B + 2pili)}
AI·
'IB
AIS
AlB AlB
Ali Ali
w
o
~
Tabl" 3.4 (Continu"d)
C(2,2)
Expression in ter~ of descent measures
Expression in terms of frequencies of gene combinations
E1e:nents
!.I·
8 LP B
t
i {[1 + 7F - 2 (58 + 14yXY + 0Xy) + 32 (YXYZ + 0XYZ - 0XYZW) JPB (l-PS)
+ 7pB_ 2 (5P~+14pBIB+pBIB)+32 (p1ill.+ p B l.!.p!ll)J
B·
B BBl·
BIB
BIB
B
\ B
+ [-2 - 28F + 8 (58 + 14yiY + 0Xy) - 2 ('1c. yo + 2c.x+¥ - 2 "KY)
+ 32 (-4yXYZ +
tx. YZ + 2llx+YZ - 60XYZ
2
- 3txY'ZW+ 6 oXYZW) ] PB(1-P B)
0(1,1)
2"1
AIA. + PA
AIA)J
2"1 ii.. PA - FAA - 2 ( PAA - 2PA
A
2,
J
(1 - F - 28+4yiY - 2 Oxy) PA(1-P A)
2
+ (-1 + 2F +48 - 8y.. - ",... - 2/1,... + 66 ....)p (l-p )2
XY JI.y
JI+Y
XY A
A
J}
0\1,2) = 0(2,1)
!. fr-AB + I: A/ B _ 2 (pAB +pAB)+2pAB _ 2 [pAIB _ (pAIB+ pA\B)+0I B
0(2,2)
2"1
2
I
.,
.,
A·' B
AB
"
A'
. B
A B.
(61 - ~)PA (l-P )P (l-p
A
B
s)
1
-2 (1 - F .20+4y.. - 26....)p (l-p )
XY
XY B
B
[ PB-PB-2Pi-2PB
B
(3 BIB.+PBBIB)]
B
?
+ (-1+2F+48-8yXY - 6y.'.y- 2 "'x+y+60x.y)p;(1-PB)
E(l,l)
1. fl'~+ 3pA IA
2 ,A
A'
- PA[P A +
_ 2 [3
t {(8+ 3Yj[y - 6yXYZ - 66j[yZ + 8
(pAAf"IA + pA~)
- 4pill]
AIA
A IA
3P~ - 2 (3(pi+p~I~) - 4~)J}
t>XYlW)P A(l-P
2
A)
+ 6[-8 - 2YiY +4\yZ - ("K.yZ + 2"K+yz)
2
21
+ 6~yz +4fxy.ZW· 86XYZWJPA (l-PA) ;
w
.....
e
e
e
e
e
e
Table 3.4 (Continued)
E (I, 2)
!
2
{pAB
A'+
2P~
B+ pA/B _ 2
AlB
A·
(pill+p~IB) _ 4
AI·
_PAL.fpAB+
p A/B+ 2p AB _ 2
..
••
.B
E (2,1)
AB
(pA B+ pA/B _ 2Pll!!)
AlB
ATB ATB
f pAIB+ p AIB)_4
\..
• B
(p~+pA/B _ 2Pll!!\]}
•B
• B
[f2 - 114 - 2(f) +
ATB AlB
rpAB+pA/B+2pAB_2 (pA\B+ pAIB)_4
PBL ••
••
A·
..
A·
! I p!!.+ 3pBIB
2 LB
B·
(p~+pA/B_2pill)1}
A·
A·
t;) +411;]PA(l-PA)PB(l-P B)
~I B)
fpAB+2pA ~+pA/B _ 2 (pill+pAI!!.) -4 (pA B+pA/B _ 2Pll!!)
2 l·B
AlB
• B
'\B
A B
AlB
!
_
E(2,2)
Expression in terms of descent measures t
Expression in terms or frequencies of gene combinations
Elements
[f2 -
liZ - 2(f3 + ll;) +411;]PA(l-PA)PB(l-P B)
AI'-
_ 2 [3 (p!U!!.+pBI!!.) - 4P!U!!.]
B I'
B B
BIB
.>- 4Pill)])
BI·
J
r
B
(
B
BIB
- Pa... P B+ 3P B- 2 3(PB+ P B
t {(6+ 3yiY - 6yXYZ - 66j(YZ +S6XYZ\~)PB(l-PB)
+ 6[-6 - 2Yj(y +4yXYZ - (llj(.yZ + 2 "X'+YZ)
2
+ 66j(YZ +4,\Y.ZW - SbXYZW]PB(l-PB)
G(l,l)
p~_pAIA_2
(pill-pAI~)[
_pA_ 2 (p~_pAIA)J
A
A'
A ,.
AA
PA PA
A
A
A'
21
f
[6 - YiiY - 2 (YxyZ - 0iYZ)]PA (l-PA)
2
.}
+ [-66+4Yj(y+8yxyZ + 2 (llj(.YZ+2llx+YZ) -12oiYZ ]PAIl-p)-
G(l,2)
Pt.L2P~ B+~_2 (pll!!_p~IB)_
G (2,1)
pAR _ 2pA !!.+ AlB _ 2(.pAIB _ pAl!!.) _ fpAB+pA/B _ 2p-AB _ 2(pAIB _pAIB)]
.B
A B p. B
'.
A B
PBL.· .
••
A'
••
A'
G(2,2)
B
BIB -2\P
(~B
p--p
-p BIB)
- -P [ P -p B -2 (B
p--p BIB)]
B
B'
B'
BB
B B
B
B
B·
A'
AlB
I
A'
,A['
A B
[pAB+pA/B_2pAB_2(pAIB_pA!B)]
PA .•
••
·B
••
. B
18
-2(f2 - llZ)PA (l-PA)PB(l-P B)
-Z(fZ - llZ)PA(l-PA)PB(l-P B)
[6 - Y~ - Z(YxyZ - O~Z»)PB(l-PB)
,
2
+ [-66 + 4yiY + SyXyz + Z(llj(.yZ + Zllj(+yZ) - l2~X-'yZJPB(l-PB)
W
!'oJ
Table 3.4 (Continued)
Elements
H(l,l)
Expression in terms of frequencies of gene combinations
A
A
AlA.+ 2PAA+
AlA 8
41 rLPA-PA-6PA"+4PA
(lli
AlA)]
PA!.-PAA"
Expression in terms of descent measures
~{ (l-F - 68 + 4yj(y + 2oXY + 8yxyZ -
t
80t'!Z)PA(l-P A)
+ [-6+4F+3 2 8- 16yi(y + 2 (llj(.y+ 2"x+y) -126XY
- 8 [4yxyZ + (tx'YZ+2tx+yZ) -66XYZ ]]
H(1.2)
!.
4
{p AB + pA/B + 2pAB _ 2(pAB +pAB) _ 2 CpA IB + pA IB _
•.
_ 4
H(2. I)
!.
4
••
A'
.B
(pg+pA/B_2P~
A'
A·
AB
B)+8
AlB
"
A·
(~IB +pA IB)J
• B
A B
(pill_p~IB)}
AI·
P~(l-PA)2}
-t [81 - t; -4(f2 - (")]PA(l-PA)PB(l-PB)
AB
IB
{pAB + pA/B + ZpAB _ Z(pAB + pAB) _ 2 CpA IB +pA IB _ (pA + pA IB)J
..
.•
'B
A'
AB
.., B
A'
A B
-t
[81 -
t; -4(f2 - "Z)]PA(l-PA)P B(l-P B)
_ 4(pAR + pAl?> _ ZpA ~) + 8 (pill_ pA I~)}
\·B
H(Z, Z)
!.
r
4 l.PB
.:3
AlB
·IB
A B
_pB_6P~+·pBIB+ZpBIB+8
(p!!.ill.-pBI~)J
B
B" B'
B BBl'
B B
~
{(l- F - 68+4YXy+2oXy+8yXYZ - 86j(YZ)P B(1-P B)
+ [-6+4F+328-16yoo
XY +Z("','
-X,yoo+Z1\.··
-X+y00) -lZ6'-'
XY
- 8 [4yxyZ + (llj{.yz + Ztx+YZ) - 60icYzJ
J
pi(l-PB) Z }
t No initial linkage disequilibrium has been assumed.
w
w
e
e
e
34
As an aside, it can be noted that an estimator of average
heterozygosity for the two loci is
This estimator has variance,
Var(H)
= var(r l ) +
= ;
Var(r ) + 2 Cov(r ,r )
2
l 2
[M(l,l) + M(2,2) +
+ ; [M(1,2) +
~
~
(D(l,l) + D(2,2»)]
D(1,2)]
This expression for the variance of H can be recovered from the general
formula given in Weir, Cockerham and Reynolds (1981).
3.1.2
The Asymptotic Distribution of
§
as a
-+
00
The vector s has expectation £(1) and a variance-covariance matrix
given by equation (3.1.9).
As the distribution of x.ok does not appear
-1.J
to have a closed form for the simplest of mating structures, namely
drift, an exact closed form for the distribution of s for any mating
system seems to be an unreachable goal.
The discovery of an asymptotic
distribution for s for a fixed number (a) of sampled populations but for
a large number of sampled individuals within each population (n
-+
00)
would be a useful contribution as most analyses of population structure
involve only a small number of populations but large samples within each
population.
Unfortunately, only gametes sampled from different popula-
tions are assumed to be independent.
population are related.
Gametes sampled from the same
This results in a finite number (a) of sequences
of (n-l)-dependent random variables, where n
-+
00, but central limit
35
theorems for such random variables do not appear to be available
(Hoeffding and Robbins, 1948, page 775).
The only remaining case is that of an asymptotic distribution for
as a
~
+
00
when n is fixed.
It is standard to show that as a
+
00
~
the vector
has an asymptotic distribution which is multivariate normal with mean
vector
~(Y)
and variance-covariance matrix,
v = lA + 1: +
a
an
1
T
an(n-l)
(3.1.10)
where A, _ and T are the matrices defined in section 3.1.1.
All the
ingredients for the proof of this result are contained in Cramer (1946,
Chapter 28) or Serfling (1980, section 2.2).
Nevertheless a proof
outline is included, for the sake of completeness, in Appendix 9.2.
3.1.3
The Estimation of y
The fitting of Q(Y) to
(Joreskog, 1978).
~
may be accomplished in several ways
Two methods are considered here.
Ordinary least
squares consists in minimizing with respect to y the square of the
Euclidean norm
Q = "
s
cr(y) 11
2
=
[~
cr(y)]
T
[~
cr(y)].
(3.1.11)
Weighted least squares consists in minimizing with respect to y the
norm,
~(r)]
T
W[~
~(r)],
(3.1.12)
where W is a symmetric matrix of weights that is in some sense close to
the inverse of the variance-covariance matrix of s.
inverse of the variance-covariance matrix of
~,
If W does equal the
then the minimization
of QW is more correctly described as generalized least squares.
36
Some extra notation and assumptions are introduced by way of the
verification of some regularity conditions:
(i) The elements of
~(r)
and all partial derivatives of the first
three orders with respect to the elements of
r
bounded in a neighborhood of the true value of
o~(y)
oyT
=
are continuous and
r
designated rOo
In fact,
loa i )
~OYj
ij
oa
=
o·l
=
e
0
at
0
0
e
0'2
0
0
-0'1
2" 0'1
-0'2
2"
l+F _
2
o2 a.
~
e
e
1
1
0
k!f.
2
1-F
0
0
-0'1
0
1-F
0
-0'2
0'2
1
is equal to a constant (-1, 0, 2"
or 1) ,
and
for all i and any j, k and
(ii) The matrix acr/ay
G
= (o'!Vo~ )
. oY/'
- oyTI
T
~
is of full column rank
.
sO
that the matrix,
37
92 + (l;F _ 9/+ (1_F)2
(26 - l;F) a1
0
2
92 + (17 - 9) + (1_1)2
0
( 2e- 1+P') a
2
1
(29 -
1;') a2
(5-F--9-1
3) C1'z
4
Z
41
3)
(5-F--6-1
4
2
4 a1
(29 - 1+P) a
2
2
2
2
2(a +a )
l
2
- 2'1
is positive definite and, of course, nonsingular.
Z 2
(al +aZ)
(~F-16-1)a
4
241
CS-
1
3\
4 , - -2 9 - 4-) a2
2
2
- 2'1 (al+a
Z)
5
'4
Z
2
(al +ClZ)
Also if W is positive
definite then
0<:7
0<:7
oy
oyT
~ W ----
is positive definite.
(iii) The parameter vector
r
is uniquely determined by
~(r).
This
can be verified by noting that the equations,
~(y) =
0'1
=
8 0'1
<:7 2
8 0'2
0'3
C;F -e) 0'1
0'4
(l+F
-2- -
(75
(I-F) 0'1
0'6
(I-F) 0'2
e)
0'2
with the overidentifying conditions 01/02
1
°3 +-0
2 5
al
= °1 +
a2
= °2 + °4 +12 0 6
= 03/04 = 05/06
have solutions
38
F
=1
-
Now,
~
ely
T
= -Z[s
-
and this equals the null vector if and only if
Z1 6 +
W1(~
-6) + r 1 (I-F)
(3.1.13)
6Z + (l+F _ 6) Z +(l_F)Z
Z
l+F
zZ6 + wZ(--Z-- -
6)
+ r Z (l-F)
(3.1.14)
6Z + (l+F _ 6)Z + (l_F)Z
Z
6 - (l+F - '6) =
(3.1.15)
Z
(l+F _ 8) _ 2(1-F)
Z
=
(3.1.16)
Substituting equations (3.1.13) and (3.1.14) into equations (3.1.15)
and (3.1.16) gives two simultaneous cubic equations in 6 and (1+F)/2 - 6:
~T K :: = 0
1
~T K :: = 0
Z
where
39
~T = [e3 ,
3
(l;F _ e) ,
e
r
T
=
(17 - e),
r~zl+z2'
2
2
1=
K
e2 C7 -e),
W
e
,
l+F - e
2
'
2
2
eC;F - e) , e , C;F _ e)2,
II
.J
l
2
222
r
r
w
w r +w 2r 2 J
z w
1 +W2 ' r 1 +r 2 , zl l+ 2 2' Zl l +z2 2' 1 1
-4
0
4
5
6
-10
0
4
-4
-5
10
-6
-9
5
4
5
26
-30
-5
9
-4
-5
30
-26
8
0
-8
-8 -22
26
0
-8
8
8 -26
22
8
-8
0
0 -52
52
-4
0
4
4
24
-24
0
4
-4
-4
24
-24
0
0
0
0
-8
8
and
K
2
=
4
0
-4
-5
-6
10
0
0
0
0
0
0
5
-5
0
0
-20
20
0
-4
4
5
-10
6
-4
0
4
8
16
-26
0
4
-4
0
0
-6
0
8
-8
0
20
-32
II
0
0
0
-4
-8
24
I
0
-4
4
0
0
16
Lo
0
0
0
0
-8
I
i
i
II
II
I
..l
e
40
These equations may be solved for 8 and (l+F) /Z -8 by iteration using as
starting values
and
!±E. e
Z
respectively.
These solutions may then be substituted into equations
(3.1.13) and (3.1.14) to obtain solutions for 01 and oZ.
The numerical
solutions so obtained give rise to the ordinary 'least squares estimator
....
of y denoted by. y.
2
~.2
oyayr
Now the matrix,
2 e;F - e)2
0
e +
+ (1_F)2
- (2"1 w1 -r 1 )
- (%1-"'1)
+ (1_11')2
- (%2-"'2)
3\
11
(2.F-e\2
2/
(-F-6--)0'
,2
2
1
1
- (2"w
1 -r 1 )
1
- (2"
1
- (2" "'2-r 2)
- (%2-"'2)
2
2
2(0'1+0'2)
(4e- 1-F)0'2
- (%1-"'1)
3\ a
- F-e--)
222
(46-1-F)0'2
e + (17 - 6)
(4e- 1 - F )0'1
e
2
2
0
'S
(S- F - 6 -3- )' a
,2
2
1
(46-1-F)0'1
O'
2
1
2
1
- 2"
2
- Z (0'1 +0'2)
2
2
(0'1 +0'2)
522
;; (0'1 + 0'2)
"'2- r 2)
is required to be positive definite at the stationary point y for a
~
minimum to be guaranteed, but because a closed form for
r
cannot be
obtained, this condition cannot be checked explicitly here.
noted, however, that since
~
It can be
converges in probability to cr(Y) as a
~ ~ that aZQ/ararT converges in probability to 2G and this matrix is
41
positive definite.
Of course in any application the definiteness of
a2Q/ararT evaluated at the stationary point
r
can be checked by calcu-
lation of the eigenvalues.
If the matrix of weights W is not a function of y then the equation,
-2[~
may be solved numerically using the following as starting values for
aI' a ,
2
e
and F, respectively:
and
The obtained weighted least squares estimator will be denoted by I
Following Browne (1974) several asymptotic properties (a
A
+ ~)
w.
of
A
the estimators rand Xw are
A
A
(i) y and Xw are consistent estimators of y.
(ii) y has an asymptotic distribution which is multivariate normal
with mean Y and variance-covariance matrix,
-1 Cl£(X)
G
ClY
V
-1
G
•
(iii) Xw where W is a fixed positive definite matrix has an
asymptotic distribution which is multivariate normal with mean
variance-covariance matrix,
y
and
42
ocr
oe -1 ocr
Uw= (# -T)
-
oY
ocr
ocr
#Vw
~if
oY -
ocr -1
-T)
oY
-1
(iv) !W' where W is a consistent estimator of V
,has an
asymptotic distribution which is multivariate normal with mean rand
variance-covariance matrix,
U
Moreover this estimator is the best weighted least squares estimator in
the sense that
uW - U
is positive semidefinite.
-1
(v) If W is a consistent estimator of V
then the asymptotic
distribution of
is the central chi-square with 2 degrees of freedom.
Proofs of these properties, which rely on the regularity conditions mentioned earlier, mimic those in Browne (1974) and need not be
presented here.
3.1.4
Large Sample Variances of Some Estimators of
Consider the estimator
e =
e
1
(zl+ z 2)/[zl+ z 2+w1+w 2+ I(r +r 2 )],
1
presented in equation (3.1.6), which is also used as a starting value for
both ordinary and weighted least squares estimation.
Now
e
is a scalar
function of the vector random variable s which is asymptotically normal
with mean
~(r)
and variance-covariance matrix V, so by an application
43
of the theorem concerning functions of asymptotically normal vectors
[see, for example, Serfling (1980, page 122)J,
-
e is
asymptotically
normal with mean 8 and variance
(0'1 + 0'2)
-2
[1-8,1-8, -8, -8, -
2"1 8,- -21 8] V - 1-8
(3.1.17)
1-8
-8
-8
1
--8
2
1
--8
z
Similarly the estimator
presented in equation (3.1.5) is asymptotically normal with mean
variance
ETVE
e
where,
-8, - -8 , --8J
, 1-8
-a - ,-8
-,a
a
2a
2a
2
2
l
l
The ordinary least squares estimator
Z
e=
T ...
~3
Y has asymptotic
variance,
T
-1 3~
~3 G
3~
~
dY 3y
--rc-1
e"
-.J
A
while the weighted least squares estimator 8W =
-1
consistent estimator of V
30
dO
rW'
A
, has asymptotic variance,
= ~3T (2,-1
- )
;,v
--r
31
T
~3
-1
~3
when W is a
and
44
Of course
8W'
when W is a consistent estimator of v-I, has smaller
asymptotic variance than any other weighted least squares estimator and
e.
A direct comparison of
T
~3U~3
and the asymptotic variance of
e
does
not appear to be possible as the derivation of a closed form for V-I for
any mating system would appear to be intractable.
3.2
Many Loci Each with Two Codominant Alleles
The ex ten si on of this approach to many loci each with two codominant alleles does not require any new theory.
loci.
For the k
th
Suppose there are m
haplotype of the jth individual in the i th popula-
tion, the basic random vector of indicator variables analogous to
equation (3.1.1) is now m-dimensional:
~ijk
(3.2.l)
=
where
x .. k
~J!l.,
=1
if haplotype k in individual j
in population i carries
allele A at the lh locus,
2,
= o otherwise
.
The multivariate analysis of variance for x ..
-~J
as before (see Table 3.1).
k
has the same structure
The variance-covariance component matrices
are now m x m instead of 2 x 2 and have the form:
r:C ""
D
clm
D
c2m
(l-F)a
.'-
m
where
a
~
= P 2,(l-P i
Da'L~' = 1 6 6.'A
DbU ' =
-1
(F
1
2"
l2'
-
-
+ 1F - 2 1 e) 6.A A
2, ).'
and
,
D
C
u
-1
= (F
- IF)6.A ,\
():
~'
46
and where p t is the frequency .of allele At in the initial population and
t!.A
!
t" is the linkage disequilibrium between allele A t and allele A t .. in
the initial population.
The analog of equation (3.1.7) is now
w ,
(3.2.2)
m
so that
= [ 6Ct. l ,6Ct.
(l+F)
(l+F 6~
6
2 ,···,6Ct.m, -2-- Ct. l ,···, -2-- ,Ct.m,
(3.2.3)
where
The estimation problem now consists in fitting the vector cr(y)
which depends on m+2 parameters to the observed vector s which is
comprised of 3m observations.
The variance-covariance matrix of s has
elements which are obvious analogs of those elements of the variancecovariance matrix given in section 3.1.1 and all other results concerning
the asymptotic distribution of
~,
the estimation of 1 and large sample
variances of estimators of 6 are readily adapted from sections 3.1.2,
3.1.3 and 3.1.4, respectively.
Note, however, that since the two-locus descent measures depend on
the amount of linkage between the two specified loci, the analog of a
term such as K(1,2) in the variance-covariance matrix of s is
where A .. is the linkage parameter for loci t and t".
tt
This means that
the variance-covariance matrix of s depends on a large number of
parameters.
For example, if all m(m-l)/2 pairs of loci have different
47
linkage parameters then the 3m x 3m variance-covariance matrix is
composed of nonlinear functions of 5m 2 - 4m + 12 parameters, that is,
12 one-locus descent measures
m a. Q,' s, and
lOm(m-l)/2 two-locus descent measures.
One way of reducing the number of parameters is to consider the m loci
to be a random sample from the genome in the sense that the given m
loci, each with a specified
0Q,'
have m(m-l)/2 linkage parameters which
are a random sample of size m(m-l)/2 from some distribution with mean ~
and a range from 0 to 1.
Taking expectations over all possible samples
of size m(m-l)/2 of the linkage parameters might allow the replacement
of the lOm(m-l)/2 two-locus descent measures by just 10 two-locus
measures so that, for example,
Such a device is presented in Weir, Avery and Hill (1980) and is
applicable to univariate analyses where loci enter as a classification
variable or label in the analysis of variance and one is arguing to
change the status of loci from that of a fixed effect to a random
effect (see also Cockerham, Weir, Reynolds, 1981).
In the present
situation, each locus is a measurement variable and is accorded a dimension in the measurement space.
Thus the present analysis is localized
to a fixed set of loci, and their pairwise linkage parameters, so that
in the sequel the dependence of the two-locus descent measures on the
amount of linkage is not suppressed in the notation.
48
3.3
A Special Case--Random Mating
Consider isolated finite monoecious populations, each of size N,
which mate completely at random, including self-fertilization.
In this
case the number of one-locus gene identity measures reduces to four
(Cockerham, 1971) in the manner:
e =F ,
Y
=
Yj{y
=
YXYZ '
~ = ~X.y = ~X+y = ~X·yZ
o = 0··..
Xy
= 0"
XYZ
= ~X+yZ
= ~XY·ZW '
= 0XYZW
The two-locus nonidentity measures reduce to three for each pair of
loci (Weir, Avery and Hill, 1980) in the manner:
s
= 8
r
= f
~*
1
= 8
2
= f
f
l "" 2
3
= 1:::.* = ~* =
2
1
~*
3
=
!J.*4 = ~*5
.
These three two-locus measures will hereafter be subscripted in the
manner S £rf u: and
~*u,~'
to indicate the pair of loci being referred to.
Suppose m loci each with two codominant alleles are scored so that
the expected value of the observed vector of variance components is
=
z
m
49
=
1
- (1-8)0'
2
-
(1-8)~
1
-2 (1-8)0'1
1
-2 (1-8)0'm
(1-8)0'm
where
....
r T = [a 1 ,·.·,am,6] =
T
(~,6].
The component matrices of the
variance-covariance matrix of s given in equation (3.1.9) have the form
o
m 2m
2mo2m
o
2m m
o
2m m
o=
rt
(1-e)2[diag(~* ~)J
I
i
T
1 -
= 21
o
2m m
A
mm
- mm
A
- mAm
A
m m
o
m 2m
o
2m m
o
mm
50
where
=
mAm
(e-2y+o)diag(~)
+
(l-6e+8y+3A-6o)diag(~*~)
+ [8· U .--2f U .-+A1.q,.. J u " * [~~
T
- diag(~*~)J
and where the operation * denotes the Hadamard product of
diag(~)
two
matrices,
denotes the diagonal matrix with diagonal elements equal to
those of the vector
~,
and [ J.q,.q,.. denotes an m x m matrix with 1,.q,"
element given by the term enclosed in the brackets.
The matrices r and
may be partitioned into m x m submatrices in
~
the manner
!I. =
K
1R
2
-
=
1R
2
R
1M 1M
4
2
R
1M
M
B
1G
2
G
lD
H
H
D
1G
2
G
2
4
The submatrices are presented in Table 3.5.
As before, s has an asymptotic distribution (a
multivariate normal with mean
v
=
l1\
a
+
1: +
1
T
an
an(n-1)
~(r)
~
00)
which is
and variance-covariance matrix
51
Table 3.5
The submatrices of A and _ for the special case
of random mating.
Sublllatrix
c1iag(~)
Coefficients of
T
~
di.g(~~)
*
-d1ag(~~)
K
6
2
(3A-66-e )
A
M
e-2-.,+6
2
( -4e+8-.,+3A-66-e )
AU' - (l-e) 2
R
'1'-6
.
2
(-4y-3lr+66+El )
-
B
2(y-6)
2 (9-'+y-3lr+66)
2(fu.' - ll~,)
D
2"1
(-1+6e-8y-3lr+66)
Su' - llU'
G
e-3y+26
2 (-3~-.,+311-66)
-2 (fu.' -
II
1
;; (1-7E1f-12y-66)
2"3
- 2"1
(1-3e+4y-26)
(-1+69-8y-3lr+66)
1 The coefficients in this column are the
whose Iladamard product with
T
~
u.
-
1
, - (l-e) 2
*
[A~'
- (1_9)2]
llU')
* )
(8u.' - 4fu.,+311U '
~.t' elements of • coefficient matrix
'iiag(~~)
is the component of a sublllatrix.
For exalllp 1e.
2
K • 6 (i1ag(~) + (3A-66-e ) c1iag(~~) +
so tbat the
[IlU' - (l-e) 2 J U' *
t. t' element of K for t" t' is simply
Now
amm
I
1
2
(1-8) I
m m
-Ct
Ct
T
[~ - diag(~?)J
52
o~ o~
--a
(2.4 8 -
or 0<'/
9
5'
(J;
8 - ;;)
T
9
T
- 0:
0:
4 -
2.)
0:
4-
0:
So that
(~CZ1 :yCZT)-l =
o
1
982 -108+5
0
(4
+ (98-5)2 r::xo:T\
I
50:T 0:
mm
- (98-5)
T
--)
50:
0:
0:
- (98-5) T
0:
T
50: 0:
Ordinary least squares estimators
1
2-
T
(z - -=w - r) a
a
g and 6 must
= (~
4
5 T
- -)a a
4 - -
=
4
- 106 + 5
96
2
satisfy the two equations:
0.3.1)
[1~
]
+ E + 6 (~ - 1
~-r)
.
0.3.2)
Substituting equation (3.3.2) into equation (3.3.1) gives a quadratic
equation in 6:
A
e is given by
So that
A
e
=
lOb - (5a-4c) ± 18ob 2 + (Sa-4c)2
2b - 2(Sa-4c)
TTl
where a = z z, b = ~ (~+::) and c =
1
(~::)
(3.3.3)
T 1
(~::).
To check which
solution in equation (3.3.3) provides the absolute minimum, the
definiteness of
53
oQ
T
al y
=
1
2
-2 (98 - 108+5) I
mm
1
(98-5)0' - 2 (z - - w - r)
-
TIT
(98-5)0' - 2(z - - w - r)
9
-
T
- ex
ex
should be checked and the residual sum of squares,
[~
-
-
2 -
-
2 -
2 -
-
A
~(1)]
-
T
[~
A
- 2(r)]
should be calculated for each solution.
For a symmetric matrix of weights W that does not depend on 1 the
rW
A
weighted least squares estimator
is of course a solution to the
equation,
-2 [~
2
T
for which the Hessian 3 Qw/3131
is positive definite.
into m x m submatrices in the manner
W
=
the Hessian can be written in the form
02 Q
W
T
o.{"y
~
=
C
mm
d
T
1- m
where
d
m-l
1f 1
--
Partitioning W
54
and
Utilizing this expression for the Hessian a weighted least squares
estimator can be computed by the Newton-Raphson method.
Possible
starting values for this iterative procedure include
and
or
~
The asymptotic variance of
'T'
e"
-m+1
rOq oq"
-1
1--;
oy oy
~ TJ
e
as a
+
~
is
-1
oq oq (oq oq \
-v --' e
oy_
~ T 'oy ~ T)
-m+1
oy
- QY
(1-6)9'
2
-5 6 9'
4
-5e~
55
=
T
(~ ~) -
2
at (1-8) 2 K - 28(1-8) R + 8~~LlJ
T {I r
ex
+
a~
2
1
[(l-8)2 B - 28(1-8)G + ~5 (17D+16H) J
1
1
( 3 \2
+ 2an(n-1) 1 - 5" 8) A) ~
(3.3.4)
For large sized samples within each population, that is n about the same
order of magnitude as a, equation (3.3.4) becomes,
56
and this expression is equal to
L!I~, - :1_e>2]}
m
= a,
when all the at's are equal (at
Vt).
For large population sizes,
N, the contrasts in one-locus measures may be replaced by polynomials
in
~
= e -t/2N ,
where t indexes the number of generations, in the manner
= ±-4>
5
- 4>3 + 4>4 _ ±-4>6
= _~2 +
5
4~3 _ 44>4 + 4>6 •
The arguments which permit the replacement of one-locus measures by
polynomials in 4> can be found in Chevalet, Gillois and Nassar (1977)
and Cockerham, Weir and Reynolds (1981).
For the weighted least squares estimator rW' with W a consistent
-1
A
estimator of V , the asymptotic variance of6 as a
w
(o~ -1 o~
-1
~
-e
-m+l ay
T)
-m+l
e
T
-
or
+
~
is of course
(3.3.6)
but again a closed form for this variance in terms of descent measures
and gene frequencies would appear to be intractable.
However this
asymptotic variance is certainly smaller than the asymptotic variance of
6.
An expression for the asymptotic variance of 6 using the formula
in Kendall and Stuart (1977, page 247) for the approximate variance of
a ratio of random variables, or the equivalent expression in equation
(3.1.18), is
57
Var(8)
=
T -2
q~)
T
{Var(!~)
TTl
- 2e Cov[! ~,! (~~+2!)]
2
T
1
-1
+ e Var[! (~~+ 2~)]} + o(a )
(lTa )-2 IT{l[(1_e)2 K - 2e(1-e)R + e~]
=
-
+
-
-!.
an
.
-
a
[(1_e)2 B
-
2e(1-e)G + e 2 (H+k )]
2
1
+ 2an ( n-l)A} 1- + o(a
-1
)
=
+ 2
+
~#~~ (rR,R,~-1I1R,~)aR,aR,J
2an(~-1)
m
[(e-2y+o) R,:laR, + (1-6e+8y+311-6o)
l
+ r r (enR,~-2rnR,~+1I~R,~)]}+ o(a- )
R,# R, ~ N
N
(3.3.7)
For large sized samples within each population, t hat is n about the same
order of magnitude as a, the variance of
a -l( r aR, )-2
R,
[
3 )r
(o-2ey+e
e is
approximately,
aR, + (311-6o+8ey-e 2-46 3)r aR,2
Q,
Q,
(3.3.8)
and this expression is equal to
3
2
3
l {<O-26 Y+8 ) + (311-6o+8ey-e -46 ) + r L
2
rna
m
Q,#Q, ~
[ 1I ~ n. ~
N~
2
-
m2
(
1-6) ]}
.
58
when at
= a-
for all t.
For equal initial gene frequency functions at at each locus and
large sized samples within each population the asymptotic variances of
A
e
and
e
are clearly the same, but for different at the relationship
between expressions (3.3.5) and (3.3.8) is unclear.
loci,
3
I: a.(,
I: a.(,
.(,
a~)2
(I:
.(,
.(,
>
(~ a.(,)
2
and
4
I: a.(,
2
I: a.(,
.(,
.(,
>
(~ c{/
2
(I: a.(,)
.(,
but for two loci
a
2
l
a
Z
Z
<
Z Z
2
l + a z)
(a
at a2
(a +(
l
2
)2
For more than two loci, it is possible for
to be either less than or greater than
* [ /).,U./
z
(1-8) ] a.(, a-t'
Certainly, for m
59
However, it is the first two terms (the terms with coefficients of
o-Z6y+e 3 and 36-6o+86y-e Z-4e 3 ) in expressions (3.3.5) and (3.3.8) that
are likely to dominate these asymptotic variances, in which case the
A
asymptotic variance of
e.
e
will be greater than the asymptotic variance of
That this is so, should not be too surprising as the ordinary least
squares estimator ignores the variance-covariance structure of s in the
fitting of
~(r) to~.
Whether the biases and variances of these
estimators differ substantially in small samples
(i.~.,
small numbers
of populations sampled) is another matter.
An approximate expression for the variance of
v
e
is given by
Z
+
3
1.. r(Z (y-o)+ZSy - 2f-f) (1..)
an
m
L
3 ) + 2 ~ '=" (fU'
2
+ (,2 (8-4y-3L\+6o) - 88y+68 +28
" "
!:
t
m-
cr
.(,
ll~')J
t~t'
+
I
1
<8-2;r+o) !:
Zan(n-1)..
m
t
+!: !:
.f.,~.f.,'
(1..)
+
cr.f.,
(eu,-2fU'+Ll~")l'
VI;
m
J
J
(1-6e+8y+311-6o)
+
0
(a
-1
0.3.9)
)
v
If n is of the same order of magnitude as a, the variance of 8 is
approximately,
(3.3.10)
~o
Comparing expression (3.3.10) with the corresponding expression (3.3.8)
-
for a it can be seen that
but that,
- 2 can only be
while the relationship between m-2 and atat~(
~ at)
determined when
at,at~
and
~
at are known.
However each of the
expressions (3.3.8) and (3.3.10) are likely to be dominated by their
first terms, as time increases, in which case the asymptotic variance of
-
v
a will be smaller than the asymptotic variance of a.
.,
variances of a, a,
"
e
Asymptotic
A
and aware compared for the special cases of a
single locus and two loci in the next two sections.
As an aside, an interesting special case alluded to earlier is
that of large population size, N, and long time, t.
In this case the
one-locus measures may be replaced by polynomials in
~ = e- t / 2N .
More-
over since the contrasts in the two-locus nonidentity measures which
appear in the variance-covariance matrix of
~
are likely to be close
to zero for all values of the linkage parameter A (examples are given in
Table 3.6), the variance-covariance matrix of s may be approximated by
a matrix which is a function of the parameter vector,
1.*
61
Table 3.6
Values of contrasts in two locus nonidentity measures for a
monoecious population size N = 1000,with linkage parameter
A, at various times (t) •
t
6*-(1-8) 2
r-6*
8-2r+6*
8-4r+36*
<10- 5
<10- 5
.00033
.00033
50
<10- 5
<10- 5
.00032
.00032
100
<10- 5
<10- 5
.00030
.00030
500
<10- 5
<10- 5
.00020
.00020
1000
<10- 5
<10- 5
.00012
.00012
10
<10- 5
.0029
.0029
50
<10- 5
<10- 5
.00003
.0044
.0043
100
<10- 5
.00004
.0042
.0041
500
.00003
.00003
.0028
.0028
1000
.00003
.00002
.0017
.0017
10
0
0.9
This means that the estimation problem reduces to that of estimating
and consequently 8 = l-<P by fitting
~
to
~(r*)
~
where s has an approxi-
mate asymptotic variance-covariance matrix given by V(r*), the variancecovariance matrix V with all one-locus descent measures replaced by
polynomials in
~
and all off-diagonal elements in the component sub-
matrices of V, given in Table 3.5, set equal to zero.
A consistent
estimator of V(r*) is easily obtained by replacing r* by, say,
in V(y*) and an iterative weighted least squares procedure based on
equation (3.1.12) but with reestimation of V(r*) at each step may be
used to estimate r*.
62
3.3.1
One Locus
For the case of one locus,
s
T
=
[z,w,r]
and
T
1
[aa, I(l-e)a, (l-e)a]
~ (r) •
where
rT
[a,a]
=
e
Estimators of
(i)
~
a
(ii)
=
a~
v
a
include
1
= z/ (z+w +r-), presented in Cockerham (1969),
given by equation (3.3.3) with a
1
= z2,
b
=
1
z(zw+r) and
2
c = (rr) ,
~
(iii)
aW'
a weighted least squares estimator where the matrix of
weights W is a consistent estimator of V-I.
Now the asymptotic variance of
Var(S)
a
as a
+
~
is simply
=
+
1
rCS - 2;CH3) + (1-68+8y+36-60) Jl
2an(n-l):...
Q'
+ o(a -1 )
(3.3.11)
~
The asymptotic variance of
e
as a
+ ~
is
63
A
Var (8)
3
r<o-ZM8 ) + (3A-6o+88y-8z-483 )l
l..
0'
_
Z
r 2 (y-o) +Z8 _ 38 + 8 2 (-7§-36y+180)'l
1 (t.
y
2
50
I
+-~
an l
0'
= 1a
+ [Z(8-4y-3A+60) -88y+68Z +82 (9-48+9(8y+3A-60)\l}
25
J_
+
+
f el r<8-2y±o)
+
-- 0'
(1 2an(n-1)
o(a
-1
(1-68+8y+3A-6o)
L
l
-J
)
(3.3.lZ)
A
The difference between Var(6) and Var(S) is likely to be miniscule as
Var (8) - Var (8)
=
38
[ 3 (n 25an(n-1)
+
,;,
t)
8 - 5 I8-~Y+O + (1-68+8y+3A-6o) ]
0(a- 1 )
3rp(1-sp) [3
25an(n-1)
(n _1)(1
_rn 1- 5 l
2
'!" /
..J
p5
[(1_ ) + cp5
50'
. J
l
where the approximate expression in '"~ = e -t/ZN is for large sized
populations and long time, t.
A
The asymptotic variance of 6
w as
a
~
= is
given by expression
(3.3.6) with m set equal to 1, namely,
..-
Var (Sw)
=
Z
(S1 - 2S 2+ S3)8 +Z(Sz -S3)8+S 3
(S1 S3 - S;) O'Z
where
0.3.13)
64
and where det(V) denotes the determinant of V and v .. a typical element
~J
of V.
Some numerical comparisons of asymptotic standard deviations of
-e and
6W'
using equations (3.3.11) and (3.3.13), are presented in Table
3.7 for subdivided monoecious populations of size 100.
It is apparent
from Table 3.7 that there is little difference in the asymptotic
standard deviations of the two estimators except in the cases where the
sample size within each population (n) is much less than the number of
populations sampled (a) and this case is rarely found in practice.
3.3.2
Two Loci
v
For the two-locus case, asymptotic variances of the estimators e,
-e, e and
6W may
be calculated by setting m = 2 in equations (3.3.9),
(3.3.7), (3.3.4) and (3.3.6), respectively.
In Tables 3.8 and 3.9
asymptotic standard deviations of the estimators have been calculated
using these equations for samples of 30 replicate monoecious populations
of size 100.
The computation of these asymptotic standard deviations
for various generations, t, utilized the closed form solutions for the
one-locus descent measures given in Cockerham, Weir and Reynolds (1981)
and the transition equations for the two-locus nonidentity descent
measures given in Weir, Avery and Hill (1980).
As expected, for equal initial gene frequencies at both loci, the
asymptotic standard deviation of
deviation of
e.
e.
is equal to the asymptotic standard
For equal initial gene frequencies and n
asymptotic standard deviation of
course
v
e
e
~
is very close to that of
In all cases the asymptotic standard deviation of
a, the
-e and
8w is
than or equal to the asymptotic standard deviations of the other
of
less
65
Table 3.7
-
A
AsymptGtie standard deviations (sd) of 6 and 6W
for subdivided monoecjous populations of size
N = 100 at various generation times (t) •
or- .1
or - .24
t
e
a
n
sd (Ei)
20
.09539
30
10
.03926
.03910
.03379
.03367
30
30
.02972
.02971
.02610
.02609
30
50
.02789
.02789
.02463
.02463
50
:0
.03041
.03029
.02617
.02608
50
30
.02302
.02301
.02022
.02021
50
50
.02160
.02160
.01908
.01908
30
10
.07111
.07102
.05454
.05448
30
30
.06222
.06221
.04853
.04852
30
50
.06051
.06051
.04738
.04738
50
10
.05508
.05501
.04225
.04220
50
30
.04819
.04819
.03759
.03759
50
50
.04687
.04687
.03670
.03670
30
10
.1051
.1051
.07252
.07249
30
30
.09804
.09804
.06827
.06827
30
50
.09668
.09668
.06746
.06746
50
10
.08144
.08140
.05618
.05615
50
30
.07594
.07594
.05288
.05288
50
50
.07489
.07489
.05226
.05226
30
10
.1231
.1231
.07823
.07821
30
30
.1188
.1188
.07569
.07569
30
50
.1180
.1180
.07522
.07522
50
10
.09534
.09531
.06059
.06058
50
30
.09203
.09203
.05863
.05863
50
50
.09140
.09140
.05826
.05826
50
. 100
ZOO
.22169
.39423
.63304
sd (6..,)
sd (6)
sd(6..,)
66
Table 3.8
V
A
standard deviations (sd) of e, e, e and
eW for 30 subdivided monoecious populations of size
N = 100 with linkage parameter A = 0.0.
~symptotic
t
e
n
0'1
0'2
sd(S)
sd(6)
sd (8)
sd (6w)
20
.09539
10
.09
.09
.02844
.02844
.02833
.02833
10
.09
.24
.02627
.02692
.02993
.02578
10
.24
.24
.02390
.02390
.02382
.02382
30
.09
.09
.02147
.02147
.02149
.02146
30
.09
.24
.02002
.02071
.02321
.01979
30
.24
.24
.01846
.01846
.01848
.01846
50
.09
.09
.02013
.02013
.02015
.02013
SO
.09
.24
.01882
.01953
.02190
.01863
SO
.24
.24
.01742
.01742
.01744
.01742
10
.09
.09
.05222
.05222
.05231
.05216
10
.09
.24
.04591
.04449
.04875
.04383
10
.24
.24
.03857
.03857
.03863
.03853
30
.09
.09
.04561
.04561
.04573
.04561
30
.09
.24
.04036
.03944
.04338
.03878
30
.24
.24
.03432
.03432
.03439
.03432
SO
.09
.09
.04434
.04434
.04442
.04434
50
.09
.24
.03930
.03847
.04232
.03781
SO
.24
.24
.03351
.03351
.03356
.03351
10
.09
.09
.07797
.07797
.07842
.07793
10
.09
.24
.06599
.06072
.06537
.06057
10
.24
.24
.05129
•OS 129
.05156
.05126
30
.09
.09
.07265
.07265
.07289
.07265
30
.09
.24
.06168
.05701
.06136
.05686
30
.24
.24
.04828
.04828
.04842
.04827
50
.09
.09
.07163
.07163
.07179
.07163
SO
.09
.24
.06086
.05631
.06057
.05615
SO
.24
.24
.04771
.04771
.04780
.04771
SO
100
.22169
.39423
67
Table 3.9
-
standard deviations (sd) of ~, 6, 6 and
6W for 30 subdivided monoecious populations of size
N = 100 with linkage parameter A= 1.0.
4s ymptotic
~
t
e
n
0'1
O'z
sd (8)
sd(~)
sd(6)
sd(Sw)
ZO
.09539
10
.09
.09
.OZ890
.OZ890
.OZ878
.OZ878
10
.09
.Z4
.OZ677
.OZ730
.0301Z
.02626
10
.Z4
.24
.02445
.02445
.02435
.02435
30
.09
.09
.02175
.OZ175
.OZ178
.OZ175
30
.09
.Z4
.0203Z
.OZ094
.OZ33Z
.02009
30
.Z4
.24
.01879
.01879
.01881
.01878
50
.09
.09
.02038
.OZ038
.02041
.OZ038
50
.09
.24
.01909
.01973
.02200
.01889
50
.24
.24
.01771
.01771
.01773
.01771
10
.09
.09
.05356
.05356
.05366
.05349
10
.09
.Z4
.0474Z
.04574
.04938
.04526
10
.Z4
.24
.04036
.04036
.04043
.04031
30
.09
.09
.04668
.04668
.04681
.04668
30
.09
.Z4
.04157
.04042
.04387
.03993
30
.24
.24
.03573
.03573
.03582
.03573
50
.09
.09
.04537
.04537
.04546
.04537
50
.09
.24
.04045
.03941
.04279
.03891
50
.24
.24
.03485
.03485
.03491
.03485
10
.09
.09
.08074
.08074
.081Z3
.08070
10
.09
.24
.06924
.06352
.06683
.06348
10
.24
.24
.05540
.05540
.05574
.05537
30
.09
.09
.07513
.07513
.07539
.07513
30
.09
.24
.06458
.05951
.06265
.05949
30
.24
.24
.05193
.05193
.05211
.05193
50
.09
.09
.07406
.07406
.07423
.07406
50
.09
.24
.06369
.05875
.06183
.05872
50
.24
.24
.05128
.05128
.05139
.05128
50
100
.22169
.39423
68
estimators.
In the preparation of Tables 3.8 and 3.9, calculations
were performed for values of t ranging from 10 to 200 in steps of 10
generations.
For the one case of unequal initial gene frequencies at
the two loci that was studied, the behavior of the asymptotic standard
..,
-
A
deviations of a, a and a exhibited the following pattern.
that is in the construction of Table 3.8, and for t
V
sd(a)
~
_
sd(a) ,
<
90 ,
..,
A
sd(a)
<
~
while for 100
_
sd(a)
30,
A
sd(a)
<
but for 40 < t
sd(a)
~
For A = 0.0,
t
<
~
A
sd(6)
<
sd(a) ,
200,
..,
<
sd(6) •
A similar pattern was observed for A = 1.0 in the construction of Table
3.9.
This behavior may be deduced from the relevant equations for the
asymptotic variances and can be attributed to the relationships among
the coefficients of the functions of the one-locus descent measures,
namely,
and
2
0'1
( 0'1
2
+ 0'2
+ Q'2Y
and the changes over time in the relative magnitude of the functions of
the one-locus descent measures, for example, for N
= 100 and for
changes in t from 20 to 50 to 100, the corresponding changes in
69
o-28y+8 3
2
3
311-6o+88y-8 -48
are from 0.054 to 0.186 to 0.695.
Another example of this change in the ranking of the asymptotic
standard deviations of the estimators over time is provided in Figure
3.1 where asymptotic coefficients of variation of each estimator are
graphed on a logarithmic time scale for samples of 50 individuals from
20 replicate monoecious populations of size 1000 with quite different
initial gene frequencies at two tightly linked loci.
In this situation
the weighted least squares estimator, with weight matrix W, a consistent
-1
estimator of V , is uniformly best among the four estimators.
However
for long times, say t > 500, there is little difference between the
asymptotic coefficients of variation and asymptotic standard deviations
A
_
A
of 8, 8 and 8 so that the preferred estimator on the grounds of
W
-
computational ease might be 8.
..,
The estimator 8 which is an unweighted
average across loci performs poorly when t is large.
..,
is small, 8 performs almost as well as
8W and
However, when t
certainly better than 8
A
and 8.
Returning to Tables 3.8 and 3.9 it is apparent that linkage
increases the asymptotic standard deviations of the estimators but does
not appear to change the ranking of the estimators.
In summary it might appear from these limited numerical examples
that when 8 is to be estimated from data gathered on two loci from a
large number of isolated monoecious populations, and it cannot be
assumed that initial gene frequencies are the same at both loci, that
(i) 8 is the "appropriate" estimator when it is suspected that the
populations have been isolated for a long time;
70
1.6
1.4
1.2
1.0
cv
.8
.0
CV(~)~
.2
5
100
50
iO
soc
lOCO
t
Figure 3.1
~s~ptoti~
coefficients of variation (CV) of 6,
6, 6 and 6W for samples of size 50 from 20
monoecious populations of size N = 1000 with
al = 0.0196, a2 = 0.2496 and A = 0.9.
v
(ii) 6 is the "appropriate" estimator when it is suspected that
the populations have been isolated for a short time;
(iii)
eW
is the "appropriate" estimator 'Nhen no auxiliary informa-
tion regarding time is available and when a consistent and positive
definite estimator of
v- 1
is able to be constructed.
71
4.
A SIMULATION STUDY
In the previous chapter some large sample properties (a
+
four estimators of the coancestry coefficient were presented.
00) of
In this
chapter an attempt to elucidate the small sample properties of the
estimators via a small simulation study is described.
The case of two
replicate monoecious populations (a = 2) practicing random mating,
including selfing, with two loci scored, was simulated.
4.1
Discussion of Programming
The program for the simulation study was written in the Fortran IV
language and was run on computers at the Triangle Universities Computation Center (TUCC).
Pairs of monoecious random mating populations of
100 individuals each were constructed by sampling without replacement
from an initial reference population of infinite size.
Four kinds of
initial reference populations, all in Hardy-Weinberg equilibrium at
each of two loci and all in linkage equilibrium, were utilized.
These
reference populations had the following parameters:
and
0.
1
0.
1
0.
1
0.
1
=
.25
0.
2
=
.25
0.
2
=
.24
0.
2
=
.24
0.
2
a
=
.25
A =
=
.25
A = 0.9
=
.09
A= 0
=
.09
A = 0.9
A replication of the experiment consisted of a pair of populations
"monitored" over 100 generations of random mating and 100 replicates
were generated for each of the four kinds of initial reference populations.
Sampling of 100 individuals from the reference population to
form one of a pair of founder populations and sampling of 200 gametes
72
from an infinite pool of gametes to form a next generation of a population was achieved by the usual method of obtaining random numbers from
a particular discrete distribution utilizing pseudo-random uniform
(0,1) deviates.
The IMSL (1979) subroutine GGUBS (a multiplicative
generator) was used for this purpose.
Y
_
A
A
The estimators 8, 8, 8 and
eW were
computed for each generation for
each replicate pair of populations with the consequence that sample size
n equaled population size N.
If a replicate pair of populations
became fixed for the same alleles at both loci, that replicate was
ignored in subsequent generations.
This happened to one of the
replicates in the case
A= 0
at generation 91 and to one of the replicates in the case
A
= 0.9
at generation 81.
If a pair of populations became fixed for the same
v
allele at a single locus, the estimation of e for that replicate was
overlooked in the incidental generation and subsequent generations.
other three estimators were calculated.
The
By generation 100, fixation of
the same allele at a single locus had occurred in four replicates for
the case a l
case a
.24, a
l
Z
= a 2 = .25, A = 0 and A = 0.9, in 41 replicates for the
= .24, a 2 = .09, A = 0, and in 44 replicates for the case a l =
.09, A = 0.9.
Computation of the estimators
.e
V
e, e
and
A
e
posed no problems as
closed forms exist for these estimators (see equations 3.1.5, 3.1.6 and
3.3.3).
The computation of
8w was
accomplished by way of damped
73
Newton-Raphson iteration.
The main features of the algorithm to compute
A
Sw
are sketched below.
1.
An initial consistent estimate of generation time twas
recovered from
eby
the application of the formula
If the application of this formula yielded a negative value for t, t
was set equal to zero.
This value of t was used to generate, via the
appropriate transition equations, consistent estimates of the higher
order one-locus descent measures (y,
(8, r and
~*).
~
and 0) and the two-locus measures
The calculation of the two-locus measures presupposed
that the linkage parameter A was known exactly.
2.
1
The estimated total variances for locus 1 (zl+w + r l ) and locus
l
2 (z 2+w + tr2) were used as initial estimates of a and a , respectively.
l
2
2
If either of these was greater than 0.25 it was set equal to 0.25.
If
one of these was zero the locus was dropped from the analysis.
3.
A consistent estimate of the variance-covariance matrix of the
observed variance components was formed from the results of (1) and (2).
Since a
= 2,
terms with coefficients l/a(a-l) were not ignored in the
construction of this matrix.
4.
The weight matrix was formed by inversion of the estimated
variance-covariance matrix.
Since the estimated variance-covariance
matrix is positive definite and symmetric, the IMSL (1979) subroutine
LINV3P was used for the inversion.
5.
1
+ Zr
and
2
1
Damped Newton-Raphson iteration using zl + WI + Zr , z2 + w
l
2
e as
carried out.
initial estimates of aI' a
2
and 8, respectively, was
The IMSL (1979) subroutine LINV3P was used to invert the
74
Hessian.
If this subroutine returned an error message indicating that
the Hessian was algorithmically not positive definite, iteration was
stopped and the current value of the iterate was reported as the
estimate.
This type of error occurred in less than 2% of the minimiza-
tions and seemed to occur when the estimated value of t was far from
the actual value of t, for example an estimated t of 921 and an actual
t of 97, or an estimated t equal to zero and an actual t of 52.
The damping of the Newton step consisted in reducing the length of
the Newton step by a factor of one-half until the current value of the
objective function Qw(k-l) was bettered.
If a reduction in the objec-
tive function had not occurred for reduction in the Newton step by a
factor of 2
-32
,then a series of damped steps in the direction of the
gradient vector were taken until the current value of the objective
function was reduced.
step of 2
-32
If such a reduction was not achieved for a
A
times the gradient vector, the current iterate y(k-l) was
reported as the estimate.
This type of occurrence happened in less
than 0.5% of the minimizations and generally after four or five
iterations.
6.
Convergence was deemed to occur when both
and
II rw(k-1)
- lw(k)
II
<
5
10- [ II rw(k-1)
II
3
+ 10- ]
were satisfied or when the number of iterations (k) exceeded 50,
although this last criterion did not have to be invoked in any of the
minimizations.
75
Realized mean square errors and variances of the estimators for
each of the 100 generations were computed by averaging over the number
of replicates (r) in the manner,
1.
r
r
1:: (6 - 6)2 '"'
i
r i-I
r 1::
!:.:l L
r
i-I
(6. - 8)2
1
r-1
l + (8 _6)2
...J
where 8 denotes an estimate of the true value 8 for the i
i
and
e is
th
replicate
the average of these estimates over replicates.
4.2
Results
Selected results for generations 5, 10, 20, 50 and 100 for the
four types of initial reference population are presented in Tables 4.1,
4.2, 4.3 and 4.4.
In each of these tables the four rows nested within
each time, t, refer to the four estimators,
..
a
~
a
A
8
A
and
8
W
respectively, while the column labeled Avar contains numerical quantities obtained by evaluating the asymptotic variance formulae, presented
in section 3.3, for a
=
2.
In all four tables there is a trend for realized coefficients of
variation for each estimator to decrease as t increases.
When there
is no linkage (Tables 4.1 and 4.3), there is a tendency for e to be the
..,
least biased and a to be the most biased of the four estimators in later
generations.
This can be perceived by comparing the mean square errors
76
Table 4.1
e
t
5
10
20
50
100
.0248
.0489
.0954
.222
.394
Realized means, standard deviations, mean square errors and
variances for four estimators of e for the case of two
monoecious populations diverging from an initial reference
population with parameters al = a2 = .25 and A = O.
Mean
Std. Dev.
MSE x 10
.0257
.0257
.0066
.0066
.0042
.0261
.0263
.0069
.0069
.0042
.0259
.0257
.0066
.0066
.0042
.0259
.0261
.0067
.0068
.0041
.0441
.0441
.0195
.0194
.0131
.0451
.0457
.0208
.0209
.0131
.0443
.0445
.0198
.0198
.0131
.0446
.0450
.0203
.0203
.0145
.0899
.0852
.0722
.0726
.0413
.0945
.0938
.0871
.0880
.0413
.0937
.0969
.0930
.0939
.0413
.0922
.0905
.0811
.0818
.0406
.189*
.148
.228
.219
.160
.200
.160
.257
.255
.160
•200
.168
.283
.282
.160
.193
.153
.240
.234
.135
.307*
.201
.474
.403
.325
.343*
.225
.530
.508
.325
.363
.261
.682
.679
.326
.331*
.224
.538
.503
.348
.
Var x 10
Avar x 10
77
Table 4.2
e
t
5
10
20
50
100
.0248
.0489
.0954
.222
.394
Realized means, standard deviations, mean square errors and
variances for four estimators of e for the case of two
monoecious populations diverging from an initial reference
population with parameters al = a2 = .25 and A • 0.9.
Mean
Std. Dev.
MSE x 10
Var x 10
Avar x 10
.0173*
.0196
.0044
.0039
.0042
.0174*
.0198
.0044
.0039
.0042
.0174*
.0197
.0044
.0039
.0042
.0174*
.0198
.0044
.0039
.0041
.0467
.0450
.0201
.0203
.0132
.0477
.0467
.0216
.0218
.0132
.0465
.0451
.0202
.0204
.0132
.0474
.0464
.0214
.0216
.0142
.0780*
.0791
.0649
.0625
.0419
.0807
.0831
.0706
.0691
.0419
.0786*
.0808
.0675
.0654
.0419
.0795
.0819
.0690
.0671
.0429
.164*
.127
.192
.160
.162
.174*
.137
.209
.188
.162
.175*
.145
.230
.210
.163
.167*
.131
.198
.170
.112
.281*
.188
.478
.354
.330
.310*
.210
.509
.443
.330
.326*
.239
.612
.570
.331
.298*
.210
.530
.441
.327
78
Table 4.3
a
t
5
10
20
50
100
e
.0248
.0489
.0954
.222
.394
Realized means, standard deviations, mean square errors and
variances for four estimators of a for the case of two
monoecious populations diverging from an initial reference
population with parameters al = .24, a2 = .09 and A = O.
Mean
Std. Dev.
MSE x 10
Var x 10
Avar x 10
.0188*
.0202
.0044
.0041
.0031
.0194
.0281
.0081
.0079
.0036
.0193
.0327
.0109
.0107
.0047
.0193*
.0218
.0050
.0048
.0031
.0374*
.0356
.0139
.0127
.0142
.0390*
.0428
.0191
.0184
.0162
.0385*
.0480
.0238
.0230
.0207
.0378*
.0374
.0151
.0140
.0141
.0798*
.0668
.0466
.0446
.0483
.0921
.0910
.0820
.0827
.0522
.0950
.102
.102
.103
.0657
.0848
.0770
.0598
.0593
.0473
.184*
.131
.184
.172
.223
.214
.184
.335
.337
.214
.221
.201
.400
.404
.259
.204
.175
.307
.307
.206
.280*
.167
.405
.280
.545
.332*
.255
.680
.648
.467
.336*
.264
.724
.697
.540
.326*
.257
.699
.660
.464
79
Table 4.4
e
t
5
10
20
50
100
.0248
.0489
.0954
.222
.394
Realized means, standard deviations, mean square errors and
variances for four estimators of e for the case of two
monoecious populations diverging from an initial reference
population with parameters al = .24, a2 = .09 and A • 0.9.
Mean
Std. Dev.
MSE x 10
Var x 10
Avar x 10
.0207
.0233
.0056
.0055
.0044
.0201
.0248
.0063
.0062
.0051
.0193*
.0272
.0076
.0074
.0066
.0206
.0231
.0055
.0053
.0044
.0415
.0444
.0201
.0197
.0144
.0410
.0482
.0236
.0232
.0163
.0396
.0517
.0273
.0267
.0208
.0416
.0447
.0203
.0200
.0143
.0732*
.0634
.0447
.0402
.0489
.0787*
.0780
.0631
.0609
.0526
.0777*
.0847
.0742
.0718
.0659
.0760*
.0696
.0517
.0484
.0479
.151*
.116
.183
.135
.225
.180*
.153
.251
.235
.216
.186*
.170
.298
.288
.260
.168*
.142
.227
.200
.209
.250*
.158
.453
.250
.549
.287*
.218
.587
.477
.471
.296*
.233
.635
.543
.542
.277*
.212
.581
.449
.468
80
and variances of the estimators for the later generations and by noting
the asterisks in the column labeled ''Mean.''
For a particular time t a
one-sample t-test can be carried out to test whether the realized mean
of an estimator differs from the true value of 8.
The asterisk indicates
rejection of the null hypothesis that the realized mean of an estimator
is equal to the true value at the 5% significance level.
Note, however,
that since the sample of replicates at time t is related to the sample
of replicates at time t + k the outcomes of the t-tests for each
generation will be related so that comparisons of performance of an
estimator for different times should be resisted.
For tight linkage
(Tables 4.2 and 4.4) all four estimators appear to be badly biased but
A
again a is the least biased in later generations.
The extent and direc-
tion of some of these biases can be predicted from the following fQrmulae
..,a and -a:
for the means of
8 +
E(S)
=
1
m
I:
m t=1
(..1. at
+0(a- 1 )
m
I: at
8 + (~-t~,.;;;1,---_ ,m
2
4)
4J~
2
1.( 1 (8 - y) + ..1.
a
an
Jl 1a (82 - y)
i 12 8 (1-8) -
(8 - y)
L
:J"1
-'
(4.2.1)
+ _1 'L- 1. 8(1-8) - (8-y) ~J"1
an
2
.J
I: at
t=1
(4.2.2)
Notice that if at
should be zero.
=
.25 for all t that the approximate bias in
y
a
and
-
a
This appears to agree with the situation in early
generations in Table 4.1.
However terms involving the linkage parameter
A do not enter the approximate formulae in equations (4.2.1) and (4.2.2)
81
so that the large biases in Tables 4.2 and 4.4 are not predicted. For
v
Table 4.3 realized biases in 8 for times 10 and 50 are -.0115 and -.038,
while the approximate biases calculated from equation (4.2.1) are -.0023
and -.035.
Similarly realized biases in 8 for times 10 and 50 are
-.0099 and -.008 while the approximate biases calculated from equation
(4.2.2) are -.0006 and -.010.
v
In all four tables the estimator 8 generally has the smallest
realized mean square error and variance but its bias might prevent an
unequivocal recommendation for its routine use in the case of two
sampled populations (a
= 2).
There is some agreement between the
realized variances and asymptotic variances calculated for a
= 2
particularly for the later generations in Table 4.4 (the case where
gene frequencies at the two loci in the initial reference population
are quite different and linkage is tight).
Routine calculation of
these asymptotic variance formulae even for the case of a
=
provide some insight into the variance of an estimate of 8.
2 may
82
5.
DOMINANCE, MULTIPLE ALLELES AND OTHER EXTENSIONS
In this chapter, the effects of failures of some assumptions such
as codominance, two alleles per locus, no migration and no mutation,
on the methodology described in Chapters 3 and 4 are briefly described.
This discussion is certainly not complete and should be regarded as
merely an indication of areas worthy of future research.
5.1
Dominance
In previous chapters, alleles at each locus have been assumed to be
codominant; that is, genotypes at each locus have been assumed to be
identifiable.
A and
A at
Consider, now, two loci each with two allelic classes,
the first locus with A completely recessive to
at the second locus with B completely recessive to
B.
A and
Band
B
Let the indicator
random variable z .. be such that
-1.J
z ..
-1.J
= [Zii l ] =
Zij2
[~ J,
if individual j in population i is AABB
[ ~ J,
i f individual j in population i is AABB
or AABB
[~ J,
i f individual j in population i is AXBB
or M:BB
[~ J,
otherwise.
The values of this random variable, for one locus, correspond to the
assigned phenotypic values of Robertson (1952) and in this brief discussion an exact treatment of the effect of inbreeding on the variation due
to recessive genes is prOVided.
The expectations of various products of
the elements of these random vectors over the two sampling processes (a
83
populations from an infinite number of replicate populations and n
individuals from an infinite progeny array in each population) are as
follows:
T
E(~ij~ij)
"\ pA
-
T
E(~ij~ij')
T
E(~ij~i/j/)
~~
AB
~B]
pB
B
. [ p~l~
pAIB
A B
pA1B
A B
pB/B
B B
=
(p~)2
pApB
A B
pApB
A B
(p:f
]
A bivariate analysis of variance with labels for among and within
populations can be constructed.
The among-populations variance-
covariance component matrix E can be written in the form
a
~a
=
p~l~ _ (p~)2
pAIB _ pApB
pA/B _ pA pB
p:\: -(p:/
A B
A B
A B
A B
and the within-populations variance-covariance component matrix L can
b
be written in the form
L:.-
o
=-
pA _ pA IA
A
AA
pAB _ pA lB
AB
AB
84
The translations of the elements of these matrices to functions of
descent measures and gene frequencies (assuming linkage equilibrium in
the initial reference population) are
P~ - P~ I~ =
(F -
0Xy) PA (l-PA)
+ [1 - 4yxy -
(~.y+ 2tx+y) + 60XyJ P~ (l-p~)
(5.1.2)
- [2F + 48 - 12y•. - 2 (tL.__ .. + 2fu__ ..) + 120 ....J p3 (l-p )
XY
"'"'X. Y
"'"'X+Y
XY
A
A
It is clear that 6 cannot be recovered from the expected values of
the among and within variance components for each locus.
If, however,
the recessive alleles, A at the first locus and B at the second locus,
are rare in the initial reference population, that is PA and P are
B
close to zero, then
2:
e
a
~ [ o~
A
P
0':'XY
PB ]
and
2:
b
~[
(F - 0Xf)PA
0
(F _OO ••••)P
XY
l
B--
85
The parametric function 0iY/F, which increases from 0 to 1 as time
increases from generation 1, is identifiable and can be estimated by the
methods discussed in Chapters 3 and 4.
As an aside, exact expressions for the among and within variances
for the cases of full sib mating and random mating monoecious populations with selfing, considered by Robertson (1952), can be obtained by
substituting the exact solutions for the descent measures presented in
Cockerham, Weir and Reynolds (1981) into equations (5.1.1) and (5.1.2).
5.2 Multiple Alleles
Consider two loci, each with a series of codominant allelic
classes.
Let the first locus have r+l alleles A , A , •.. ,Ar + and the
l
l
2
second locus have s+l alleles B , B , .•• ,B + •
l
2
s l
The vector of indicator
random variables analogous to equation (3.1.1) and based on a scoring
system proposed by Stanton (1960) is
~ijk
=
(5.2.1)
~~~~­
Xijk21
where x"k!
~
u
=1
if haplotype k in individual j in population i carries
allele A , and zero otherwise (the index u runs from 1
u
to r),
86
and where,
Xijk2v
=1
if haplotype k in individual j in population i carries
allele B , and zero otherwise (the index v runs from 1
v
to s).
A multivariate analysis of variance for x"k with the same structure
-1.J
as that in Table 3.1 can be derived.
The (r+s) x (r+s) variance-
covariance component matrix for among populations, E , can be written in
a
the form
=
E
a
where All' which is r x r, has typical diagonal element,
A
p-!!.. - 2
PAu
A
u
= 8p
~
(l-p
~
.
) ,
u
= 1, ... , r
and typical off-diagonal element,
A
~- P P
A ..
Au Au"
u
= -ePA
u
PA
u :f u"
u"
The matrix A
is the s x s B-locus analog of All and Al2 , which is
Z2
r x s, has typical element,
A
p
u
IBv
u
where
~A
B
u v
u = l, ... ,r; v = l, ... ,s
- PA PB
v
is the disequilibrium for alleles Au and Bv in the initial
reference population.
The individuals-within-populations variance-covariance component
matrix has the form
87
where B , which is r x r, has typical diagonal element,
II
u=l, ••• ,r
and typical off-diagonal elements,
1.Z
A
A
(pA..
u -Z~) = A ..
u
I+F
,\
( - Z - 6J PA PA ..
u u
u
u 1= u" •
The matrix B
is the s x s B-Iocus analog of B and BIZ is the r x s
ZZ
II
matrix with typical element,
l(p
A B
Z
A /B
u v+P u
A IB
v_ ZP u
V)
=
1. (FI + F-Z e)
Z
I
I
6.
AB
u V
The variance-covariance component matrix for haplotypes or gametes
within individuals is,
E
=
c
where the r x r matrix CII has typical diagonal element,
A
u=l, •.. ,r
PAu = (I-F) PA (l-PA )
u
u
u
and typical off-diagonal element,
A
_p u
A ..
u
= -(l-F)pA
u
P
A
u
~
u 1= u ..
88
The matrix C
is the s x s B-locus analog of Cll and C which is r x s
l2
22
has typical element
A B
P u v
The problem of estimating
a
can now be viewed as the fitting of a vector
3
s which is Z[r(r+l) + s(8+1)] x 1 to a vector
vech(A
vech(A
vech(B
s
:=
vech(B
vech(C
vech(C
11
22
~(y)
where s is given by
)
)
n)
22
)
n)
22
)
and where carets denote unbiased estimators, from the multivariate
analysis of variance, of the submatrices of the variance-covariance
component matrices and the vech operator, to be found in Henderson and
Searle (1979), for example, stacks only those parts of columns of
matrices on or below the diagonal.
The vector
~(y)
is the expectation
of s over the three sampling processes, and
Various estimators of y and hence of a are available.
Chapters 3 and 4 various norms between s and
respect to
~(r)
As in
may be minimized with
y. Ratio estimators of a similar to "e and e- may be
constructed in various ways with the analogs of
..,
e and e being
89
y
e=
(5.2.2)
and
....
e=
(5.2.3)
where f is some r(r+l) x 1 vector and
~
is some s(s+l) x 1 vector.
Whether asymptotic variances of these estimators differ much for
different choices of f and
~
and whether an optimal choice of f and
can be routinely chosen is another matter.
~
Elucidation of the
asymptotic variances of these estimators requires the derivation, in
the manner of Appendix 9.1 of the variances of the off-diagonal terms
A
A
A
A
A
A
in All' A , B , B , C and C , and the covariances of these off22
ll
ll
22
22
diagonal terms with all other terms.
If the results in Chapters 3 and
4 are any indication there may well be little penalty to be paid (in
terms of asymptotic variance) in using the computationally simple
estimators
..,e
and
e- with f
=1
and
~
=!.
However, the small sample
properties of these estimators may well be quite different.
5.3
Mutation and Migration
In the development of the genetic distances for short-term evolution in this thesis a fixed allele model has been utilized; that is, the
expectations of gene frequencies over the relevant sampling processes have
been assumed to be constant and undisturbed by mutation or any other
90
force.
As an introduction to the problem of accounting for both
finite population size and mutation, consider an infinite-alleles
mutation model.
Any allele which is present in at least two of the
sampled replicate populations must have been present in the initial
reference population.
An allele which is present in only one of the
sampled replicate populations is either a mutant which has arisen since
divergence or is an allele that was actually present in the initial
population but was lost from the other sampled replicate populations
or missed in the sampling of individuals from these other replicate
populations.
Confining attention to alleles that are present in more
than one sampled replicate population allows score vectors similar to
that in equation (5.2.1) to be set up.
A multivariate analysis of
variance produces similar variance-covariance component matrices to
those in section 5.2, the only differences being that the gene frequencies, F and
e
are functions of the mutation rate.
For example, if
allele u at locus A appears in samples from more than one replicate
line, then the among populations variance-covariance matrix will have a
diagonal term equal to
where v is the mutation rate at the A-locus.
selfing,
e
is given by
t
2
(l-A )(l-v)
2N(1-A)
(1_v)2
where A
mate of
e
(2N-l)
2N
For monoecy with random
The recovery of divergence time from an esti-
requires a separate estimate of v, and of course, knowledge of
N, as v is not identifiable.
The expected gene frequencies
91
p*
A
u
= PA (I-v) t
u
are identifiable.
The effect of the incorporation of various migration models into the
fixed allele model and the joint effect of mutation and migration on
measures of genetic distance or more generally on descent measures
raises several problems.
The general approach that one must follow is
to derive the transition equations for the various one-locus and twolocus descent measures under the models.
The next step is to find
suitable estimators of the one-locus descent measures such as
may be candidate genetic distances.
e
that
If divergence time is to be
recovered (note that it may not be possible to speak of divergence if
migration among the a replicate lines is substantial) from an estimate
of a parameter such as
e
care must be taken to ensure that the parameter
is a monotonic function of time.
Estimates of the parameters in the
concurrent mutation and migration models, which are required if t is to
be recovered, may require samples of individuals to be taken at more
than one time.
92
6.
The use of a measure of
DISCUSSION
ge~etic
relationship, 6, as a genetic
distance has been proposed and methods for estimating such a measure
have been developed.
As a starting point the hierarchical analysis of
variance structure discussed by Cockerham (1969, 1973) has been used to
elucidate the various sampling processes inherent in the model of
evolutionary divergence.
The genetic distance 6 arises from this
structure in a natural way as a parameter and the task is to estimate
this parameter.
In contrast, the treatments by previous authors of
such distances have blurred the distinction between statistics and
parameters (Latter, 1973; Nei and Chakravarti, 1977).
The distance measures suggested by Nei (1972, 1973) are all
functions of the quantity I presented in equation (2.1.1).
The suggested
estimator for such a quantity, for a pair of populations X, Y and for
single-locus data is
....
I
(6.1)
=
2
.J 1:. pX..
1:. p~.
,
'V
where
PXi
•
~
~
...
~
is the frequency of allele i in the sample from population X.
This estimator can be approximated by
....
1*
1:.
=
i
Px.PY.
~
~
(6.2)
1:. ",2
(i
1:. p2
PX. +
i
~
2
Yi )
93
If the statistic in equation (6.2) is computed for the case of drift and
no mutation, its approximate expectation is (Cockerham, 1979)
where Pi is the frequency of allele i in the initial reference population.
Clearly, divergence time cannot be recovered from any function of
this statistic without knowledge of
2
~p.,
i J.
and of course population size N.
While Nei (1972, 1973, 1978a) does not claim that his measures are
designed to account for drift alone, the routine calculation of
y
y
statistics such as I and I* as "kernels" for measures of genetic
distance for short-term evolution is clearly inappropriate.
In order to investigate how to combine information over several
loci in the estimation of the genetic distance 8, a multivariate analysis
of variance framework was set up with the loci as separate "response"
variables.
It was shown that only the diagonal elements of the
estimated variance-covariance component matrices provide information
about the parameter 8.
Variances and covariances of these estimated
variance components which comprise the statistics from which estimators
of the genetic distance are constructed were derived in section 3.1.1
and Appendix 9.1.
These variances and covariances may be expressed in
terms of frequencies of gene combinations.
The effect of different
mating systems on these variances and covariances is illuminated by
translation of these frequencies of gene combinations to functions of
initial gene frequencies and descent measures.
The resulting expres-
sions are rather complicated (see Tables 3.3 and 3.4) but simplify for
the case of monoecy with random sampling (Table 3.5).
An important
94
feature is that there is covariation between the statistics for separate
loci.
This covariation is generated by the sampling of gametes in the
population to form successive generations and it depends on the mating
system, linkage and population size.
In the case of large monoecious
populations with random selfing that have been in existence for a long
time such covariation may be negligible (Table 3.6).
The other
important feature is that these variances and covariances are at most of
order a
-1
, see equation (3.1.9).
Since gametes or haplotypes within
individuals, and individuals within replicate populations are related
by virtue of the mating system and the finite population size, the
only independent samples are the replicate lines themselves so that a
central limit theorem for the estimated variance components is obtained
only when the number of replicate lines (a) tends to infinity.
This
means that asymptotic variances of estimators of distances are for
large a (a situation rarely met in practice).
In Chapters 3 and 4, four types of estimators were introduced:
an
unweighted average over loci of ratio estimators, designated by "v", a
weighted average over loci of ratio estimators designated by a tilde,
an ordinary least squares estimator designated by a caret, and a
weighted least squares estimator designated by a caret and subscripted
with a
"w" to indicate its dependence on a matrix of weights, W.
The
choice of the weight matrix for the fourth estimator has been overlooked somewhat in the previous chapters.
In order to obtain a "best"
weighted least squares estimator, W must be a consistent estimator of
the inverse of the variance-covariance matrix of the estimated variance
components.
One approach is to recover a consistent estimate of time
from an initial consistent estimate of
e, say ~ or
9,
then use this
95
esttmate of ttme to construct, via the respective transition equations
for the system of mating, the necessary higher-order one-locus descent
measures and two-locus nonidentity descent measures.
Elements of a
consistent estimator of the variance-covariance matrix may be obtained
by combining these calculated and consistent estimates of the descent
measures with initial consistent estimates of Pi(l-Pi).
Such an
approach which suggests iterative updating of the estimate of the
variance-covariance matrix merits further investigation.
A similar
method would be used to estimate the "asymptotic standard errors" of
estimates of 8.
In the special case of large monoecious populations
isolated for a long time, the construction of a consistent estimator of
the variance-covariance matrix was seen to involve less computation.
Only one type of mating system has been studied in this thesis,
namely monoecy with random selfing, although the necessary machinery
to study asymptotic variances of the estimators of 8 for any regular
system of mating has been established.
Transition equations for the
one-locus identity measures and two-locus nonidentity descent measures
for other mating systems, namely monoecious populations excluding
selfing and dioecious populations with random or hierarchical pairing
are available (Weir, Avery and Hill, 1980).
There is some interest in
establishing the transition equations for a double-first-cousin mating
system as a first step towards approximating the mating pattern of some
so-called primitive tribes (Neel, 1978).
This would allow the recovery
of divergence time for these tribes from some common ancestral population.
Some authors (Smouse and Neel, 1977; Chakraborty, 1980) essentially
average the score vectors x. 'k defined in equations (3.1.1) and (3.2.1)
-1.J
over gametes in the manner,
96
Previous discussions of score vectors similar to y .. have been given by
-~J
Wright (1922, 1965), Stanton (1960) and Cockerham (1969).
case of random mating monoecious populations, when F
= e,
Except in the
it can be
shown that the coancestry coefficient e is no longer identifiable when
a multivariate analysis of variance for l .. is constructed but that the
~J
parametric function P
= 2e/(1+F)
which is related to Wright's
coefficient of relationship is identifiable.
The parameter P, however,
is not a particularly well behaved distance measure as it asymptotes
earlier than e.
This has been pointed out elsewhere (Cockerham, 1978).
Perhaps the greatest weakness in the present study is the lack of
an extensive investigation of the small sample properties of the various
estimators of e.
One system of mating, monoecy with selfing, has been
the object of a small simulation study.
It was found that for a
=2
(the smallest number of replicate populations that allows the estimation
of e) and if two tightly linked loci are scored, the suggested estimators
of e are apt to show substantial bias.
This is particularly distressing
A
in the case of the estimator e
W
which was designed to take into account
covariation between the scored loci by utilizing a stochastic weight
matrix.
Future research may need to be directed toward the construction
of a simpler, and possibly deterministic, weight matrix.
Other research
might proceed in the direction of accepting the existing estimators and
finding suitable corrections for bias.
Equations (3.1.3) and (3.1.4)
may provide bias corrections and since they arise as scalar functions of
bivariate analogs of the intraclass correlation further research in
this area may be fruitful.
97
e
As they stand, the four estimators of
may be negative.
considered in this thesis
This occurs most often when divergence time is small
and one or more of the observed among variance components,
negative.
are
One suggestion (Cockerham, 1979) to obtain non-negative
estimates of
z~,
z~,
e
is to replace the observed among variance components,
in the numerators of the formulae for
v
mean squares for each locus divided by 2n.
e
~
and
e
by the observed among
This will also give the
numerators of these modified ratio estimators a positive bias [at most
of order 1/(2n-l)] and this may help correct the overall negative bias
of the ratio estimators.
98
7.
S~y
The use of the coancestry coefficient as a measure of genetic
distance has been espoused.
This distance has its genesis in ratios of
among population variation to total variation, hence its propriety as a
measure of genetic diversity among populations.
Also, this distance
when viewed as a function of measures of identity by descent is a
monotonic function of divergence time so that estimates of divergence
time can be obtained in either closed form or by graphical or tabular
means from estimates of the distance.
The estimation of this distance using multilocus data was studied.
Variances and covariances of the statistics with which these estimators
are constructed were derived and expressed in terms of higher order
one-locus measures of identity by descent and a set of measures of nonidentity for pairs of loci, under the assumption of linkage equilibrium
at all pairs of scored loci in the initial reference population.
General expressions for any regular mating system for the asymptotic
variances (as the number of replicate populations tends to infinity)
of four types of estimator of this distance were obtained.
Asymptotic
normality of these estimators was established, again for the case of a
large number of replicate populations so that hypothesis tests concerning
the values of the genetic distance and the construction of confidence
intervals for the genetic distance and divergence time are routine.
The special case of monoecious random mating populations with
selfing was discussed in some detail and it was found that when
e
is to
be estimated from data gathered at two loci, with two alleles per locus,
from a large number of such isolated populations, that
99
(i) If initial gene frequencies are the same at both loci, the
y
computationally simple unweighted average of the ratio estimators
e
is
as good as any of the others, and
(ii) If initial gene frequencies are different at both loci, the
weighted least squares estimator with weight matrix a consistent
estimator of the inverse of the variance-covariance matrix of the
constituent statistics appears to have the uniformly smallest asymptotic
variance of the four estimators.
average of the ratio estimators
The computationally simple unweighted
v
e
is as good when the populations have
only recently diverged while the weighted average of the ratio estimators
e
is as good when the populations have been isolated for some
time.
The mean square error properties of the four types of estimator
for situations in which only a small number of replicate populations are
available must await an extensive simulation study.
100
8.
LIST OF REFERENCES
Ahrens, H. 1976. Multivariate variance-covariance components (MVCC)
and generalized intraclass correlation coefficient (GICC). Biom.
Zeit. 18:527-533.
Bock, R. Darrell. 1975. Multivariate Statistical Methods in Behavioral
Research. McGraw-Hill, Inc., New York.
Browne, M. W. 1974. Generalized least squares estimators in the
analysis of covariance structures. S. Afr. Statist. J. 8:1-24.
Chakraborty, R. 1980. Relationship between single- and multi-locus
measures of gene diversity in a subdivided population. Ann. Hum.
Genet. 43:423-428.
Chevalet, C., M. Gillois and R. F. Nassar. 1977. Identity coefficients
in finite populations. I. Evolution of identity coefficients in a
random mating diploid dioecious population. Genetics 86:697-713.
Cockerham, C. Clark.
56:89-104.
1967.
Group inbreeding and coancestry.
Cockerham, C. Clark.
72-84.
1969.
Variance of gene frequencies.
Genetics
Evolution 23:
Cockerham, C. Clark. 1971. Higher order probability functions of
identity of alleles by descent. Genetics 69:235-246.
Cockerham, C. Clark.
679-700.
1973.
Analyses of gene frequencies.
Genetics 74:
Cockerham, C. Clark. 1978. Statistical concepts in genetics. Unpublished mimeographed notes, Department of Statistics, North Carolina
State University, Raleigh.
Cockerham, C. Clark. 1979. Distance measures. Unpublished mimeographed
notes, Department of Statistics, North Carolina State University,
Raleigh.
Cockerham, C. Clark and B. S. Weir. 1973. Descent measures for two loci
with some applications. Theor. Pop. BioI. 4:300-330.
Cockerham, C. Clark and B. S. Weir. 1977. Digenic descent measures for
finite populations. Genet. Res., Camb. 30:121-147.
Cockerham, C. Clark, B. S. Weir and John Reynolds. 1981. Variance of
inbreeding. Unpublished manuscript, Department of Statistics,
North Carolina State University, Raleigh.
Cramer, H. 1946. Mathematical Methods of Statistics.
University Press, Princeton, New Jersey.
Princeton
101
Goodman, M. M.
174-186.
1972.
Distance analysis in biology.
Syst. Zool. 21:
Graybill, F. A. and R. A. Hultquist. 1961. Theorems concerning
Eisenhart's Model II. Ann. Math. Stat. 32:261-269.
Henderson, H. V. and S. R. Searle. 1979. Vec and vech operators for
matrices, with some uses in Jacobians and multivariate statistics.
Canad. J. Stat. 7:65-81.
Hoeffding, W. and H. Robbins. 1948. The central limit theorem for
dependent random variables. Duke Math. Jour. 15:773-780.
IMSL Library Reference Manual. 1979. Edition 7. International
Mathematical and Statistical Libraries, Inc., Houston, Texas.
Joreskog, K. G. 1970. A general method for analysis of covariance
structures. Biometrika 57:239-251.
Joreskog, K. G. 1978. Structural analysis of covariance and correlation matrices. Psychometrika 43:443-477.
Kendall, M. G. and A. Stuart. 1977. The Advanced Theory of Statistics.
Vol. 1. 4th ed. Charles Griffin and Company, London, England.
Kimura, M. and J. F. Crow. 1964. The number of alleles that can be
maintained in a finite population. Genetics 49:725-738.
Kirk, R. L. 1977. Genetic distance--A pragmatic approach, pp. 244-245.
In S. Armendares and R. Lisker (eds.), Human Genetics. Excerpta
Medica Congress Series, Amsterdam, The Netherlands.
Latter, B. D. H. 1973. Measures of genetic distance between individuals
and populations, pp. 27- 37. In N. E. Morton (ed.), Genetic
Structure of Populations. University Press of Hawaii, Honolulu,
Hawaii.
Malecot, G. 1948. Les Mathematiques de l'Heredite.
Paris, France.
Malecot, G. 1969. The Mathematics of Heredity.
Company, San Francisco, California.
Masson et Cie,
W. H. Freeman and
Morton, N. E. 1973. Kinship and population structure, pp. 66-70.
In N. E. Morton (ed.), Genetic Structure of Populations. University
Press of Hawaii, Honolulu, Hawaii.
Neel, J. V. 1978. The population structure of an Amerindian tribe, the
Yanomama. Ann. Rev. Genet. 12:365-413.
Nei, M. 1972.
283-292.
Genetic distance between populations.
Amer. Nat. 106:
102
Nei, M. 1973. The theory and estimation of genetic distance, pp. 45-51.
In N. E. Morton (ed.), Genetic Structure of Populations. University
Press of Hawaii, Honolulu, Hawaii.
Nei, M. 1978a. The theory of genetic distance and evolution of human
races. Jap. J. Human Genet. 23:341-369.
Nei, M. 1978b. Estimation of average heterozygosity and genetic
distance from a small number of individuals. Genetics 89:583-590.
Nei, M. and A. Chakravarti •. 1977. Drift variances of FST and GST
statistics obtained from a finite number of isolated populations.
Theor. Pop. BioI. 11:307-325.
Nei, M. and A. K. Roychoudhury. 1974. Sampling variances of
heterozygosity and genetic distances. Genetics 76:379-390.
Robertson, A. 1952. The effect of inbreeding on the variation due to
recessive genes. Genetics 37:189-207.
Serfling, R. J. 1980. Approximation Theorems of Mathematical
Statistics. John Wiley and Sons, New York.
Smith, C. A. B. 1977.
Lond. 40:463-479.
A note on genetic distance.
Smouse, P. E. and J. V. Neel. 1977.
disequilibrium in the Yanomama.
Ann. Hum. Genet.,
Multivariate analysis of gametic
Genetics 85:733-752.
Stanton, R. G. 1960. Genetic correlations with multiple alleles.
Biometrics 16:235-244.
Tukey, J. W. 1956. Variances of variance components:
designs. Ann. Math. Stat. 27:722-736.
I.
Balanced
Weir, B. S., P. J. Avery and W. G. Hill. 1980. Effect of mating
structure on variation in inbreeding. Theor. Pop. BioI. 18:396-429.
Weir, B. S., C. Clark Cockerham and John Reynolds. 1981. Effect of
mating structure on variation in heterozygosity. Unpublished
Manuscript, North Carolina State University, Raleigh.
Weir, B. S. and W. G. Hill. 1980. Effect of mating structure on
variation in linkage disequilibrium. Genetics 95:477-488.
Wright, S. 1922. Coefficients of inbreeding and relationship.
Nat. 56:330-338.
Amer.
Wright, S. 1965. The interpretation of population structure by
F-statistics with special regard to systems of mating. Evolution
19:395-420.
Zyskind, G. 1962. On structure, relation, E, and expectation of mean
squares. Sankhya, Series A 24:115-148.
103
9.
APPENDICES
104
Derivation of the Variance-Covariance Matrix of the Vector
9.1
5
The variance-covariance matrix of !. is,
= 8(~~ T)
Var (!.)
-
T
~(y)c: (y),
where 8(~~T ) is the symmetric matrix,
8(~~
T
)
=a
z
zlw I
zlwZ
zlr l
zlr Z
Zz
zZw I
zzwz
zZr l
zZr Z
Zlw~
zZw I
wI
wIwZ
wIr l
wIr Z
zlwZ
zzwz
wIw Z
W
wZr l
wZr Z
zlr I
zZr I
wIr l
wZr l
r
z r
I 2
zZr Z
wIr Z
wZr Z
rIr Z
zl
zlzZ
zlzZ
Z
Z
(9.1.1)
Z
z
Z
rIr Z
l
r
Z
Z
Now,
+ r: r: r: r: (r: x. 'kl)\(r: x.
if, i' j
j'
A)
= an (ZpA+ 2PA
A
k
1J
k
1
'j
'kl\)l
-'
A\ + a (a-I)nZ (2pA) Z
+ an(n-I) (4P"A)
Al
2 2
= 2an [ PA +PA +Z(n-I)P"AJ + 4a(a-l)n P
A '
105
=8
[~ ~ (~ Xijk1 )4 + ~ ~~j~ (~ Xijk1)Z (~ Xij/k1)Z
+
= an
~"iE, ~
(ZPA +
14P~)
+ a(a-l)n
= Zan [PA
r~
+
Z
x ijk1 )
(~xi
Z
I
j 'kl)
an(n-l)(4P~ + 8P:I~
Z (
+
]
4P11~)
A \Z
,ZPA +ZPA)
+7P~+Z(n-l)(p~+2P:I~+p~I~) +
Z(a-l)(P A
+~)ZJ
'
so that,
2
Of course 8(r ) is just the analog of this with A's replaced by B's.
Z
9.1.Z
8(r r )
l Z
=8 {
1
1
2 2 (E E E x. 'kl)(E E E X. jkZ ) - - 2
2 (E E E x.jkl)IE E (E x. 'kZ)2
a n i j k 1J
'i j k 1
2a n 'i j k 1
-i j 'k 1J
- -h[E E (E x. jkl)\ 2 l (E E E x. 'k2)
Za n
k
k
_1_ r
(
)2 1J rL~ ~ (~ x ijk2 )2 l }
+ 4a2n2 ~~ ~ ~ x ijk1
i
j
1
.J
i
j
1
J
.J
l
-I
106
Now,
a{ (~ ~ ~ Xijkl)[~ ~ (~ XijkJ 2J}= a [~ ~ (~ Xijkl)(~ Xijk2 )2
,
2
+ I: I: I: (I: x·jk1)(I: x ij 'k2)
i j"j' k 1
'k
2
+I: I: I:
i"i' j
.. 2an
I: (I: x·jk1)(I:x""k2)
j'
'k
[-I• B• + pA/B
+
.,
k
1
J
1
]
2pAB + 2(n_ 1 lpAIB+ pA IB\
•B
\ ••
• jB)
+ 2(a-l)npA(P B +P:) ] '
a {[I:
I: (I: x
)2 ] (I: I: I: x ijk2 )}
ijk1
ij'k
ijk
= 2an
[p~~ + p~/~ + 2P~~
+
2(n-l)(p~I~+p~I~)
+ 2 (a-l)nPB(PA +~)J '
a fl
(9.1.2)
r' I: I: (I: x. .k 1)2 l ; Ir I: I: (\ I: x. j k2 )2 1. ")r -_ a rI: I: (\ I: x .. k 1)\2(I: x .. k2 )2
'-i j
k
1J
- ~i j
Ok
1
-' )
'-i j
0
'k
1Jk
1J
2
+ I: I: E (I: x. j k 1)2 (I: x. j 'k2)\
i
j" j '
'k
1
(
1<
+ I: I: I: I: ,I: x. jkl
iIi' j j' k 1
1
)2(,I: x. 'j 'k2 )2-,J
'k
1
107
+
2(n-l)(p~I~+p~I~+p~I~+p~I:)
+ 2 (a-l)n(PA
(9.1.3)
+P~)(PB+P:)J '
so that,
&(r l r 2 )
=
(PA -
P~)(PB - P:) + ~ [p~ I~ - (p~l~ +p~I:) + p~l:
+ ...!..
an
{l2 (pAB•• +pA/B)
••
- (PA -
P~)(PB - p:)J
_ (pAB +pAB) + pAB _ rpA IB _ (pA IB +pA IB\+pA IBJ}
A··B
AB
L··
A·
• B)
A B
2
As a check, it can be seen that &(r r ) reduces to 8(r ) when B's are replaced
l
l 2
by A's and when redundancies in the notation are eliminated in the manner:
p~l~
corresponds to
p!
pAA
corresponds to
P
A
A/A
AA
AA
AA
p •• ' PA' , P.A and PAA
A
correspond to
Utilizing this fact, that expectations of cross products involving two loci
reduce to expectations of squares or cross products involving one locus, only
the expectations of the off-diagonal elements in the submatrices in equation
(9.1.1) are presented in the following sections.
One locus expectations are
recovered by substituting say B's for A's if a result for the first locus is
required and by using Table (9.1)
to eliminate redundancies in the notation.
108
Table 9.1
Translations of redundant two-locus
notation to one-locus notation
One-locus notation
Redundant notation
PA
pAlA pAA pAA pAA
•• ' A·' 'A' AA
pA
pA IA pM pAA pAA
•• ' A·' 'A' AA
AlA AlA A A A A A A
PA:"' P:---A' p AlA' PAIX"' PX"IA
p~
A
A
pAIA
A•
pA/A
AlA
pAIA
A A
PAlA
pW-
pA/A
pAI~
A A
A A
AlA
where
8 ( IE E (E x
1 "i
j 'k
ijkl
\)2; rE E (E x.
-; "i j
k
l.jk2
)2 Jll
f
was given in equation (9.1.3),
109
+ 4(n_l)(n_?)(pill+ p A
- ·rB
/!\
A BI
(9.1.4)
8 {
[~ (~ ~ Xijk1fJ [~ ~ (~ Xijk2)2J}
is the analog of equation (9.1.4)
with A and B interchanged,
&
{ rL~ (~
~ x ijk1
)2
AlB
(AB
AB)
_AB
.. loP . • +2 PA. +P. B + 2P-AB
JrL~ (~ ~ x ijk2 )2 J} -_2an JlPAB.
IB
A
A
A
A
+ 2(n-l/p IB +p IB +p
+ p IB)
\ ••
A·
·B
AB
-f4 (n_l)[pAB + pAB + pA/B + pAiB
A·
'B
A.
• B
( AB AB)J
+ 2 PA!ii+PA"/B
+ 4( _1)(pM.+ 2pA/B+ pA/B)
n
AB
A B
AlB
+
4(n-l)(n-2)(p$+p~+p~l~ +p~I~)
( A B
AlB'
+ 16(n-l)(n-2)\P:Afi3+ P
ATB)
ill.
+ 8(n-l)(n-2)(n-3) PAIB
A
AJ
+ 2(a-l)n,r P +PA +2(n-l)PA"
A
(9.1.5)
110
so that,
--::2~1_-8 {eI: I:
4a n(n-l)
i j
-
-
(I: Xijk1 )2] (I: I: kI: Xijk2 )
Ie
i j
~ [~ (~ ~ Xijkl)2](~ ~ ~ Xijk2 )
~ [~ ~ ~
1
+ Z;
Xijk1lJ
L;r (~ ~
. \2 1
xijklJ ~
[~ ~ (~
Xijk2
lJ
[I
;
~ \~
,-.z-l
x ijk2 ) j
I
J '
where
8 {
[~~ (~XijkJ2] (~~ ~ XijkZ )}
was given in equation
(9.1.2),
111
)2 !
)} =2anP
{AB
AB 2 (n-l)P
(AlB
AlB)
r ( EEx
d { lE
•• +P AlB
•• + 2P A.+
•• +PAo
ijk1 J ,EEEx ijk2
i
j k
i
j k
AB
AlB) +4(n-l)
+ 4(n-l) ( PA:"+PA7
A
ill
(n-2)p~
A1
I
+ 2(a-l)n [ P +P +2(n-l)PA"JP I ' (9.1.6)
A A
B
d {
[~ ~ (~ Xijkl)2J [~ ~ (~ Xijk2)2J }
was given in equation (9.1.3),
d {
[~ (~ ~ Xijkl)2J [~ ~ ~ Xijk2 )2J }
is the analog of equation (9.1.4)
with A and B interchanged, so that,
(pill_p~I.B)
A I·
A B
+l[l(pAIB+pAIB_pAIB_pAIB) _
a 2
••
A °
• B
A B
_ l ( +pA_2P~.v _pB\l
A
2 PA
A),PB B).J
+ .l.1l(pAB+ pA/B+ 2pAB _ 2pAB _ 2pAB\ _ 1 (pAIB+pAIB _ pAIB _ pAIB)
an [t;
••
••
A'
•B
AB)
2
••
A °
• B
A B
_ (pM+pA/B _
A'
9.1.5
z
8(zl 2)
A
°
2P~ B _ 2pill+2P~IB)l
,AlB
AI·
A BJ
d(z z2)
l
1
tJ{rE(EEx.
,2 [r:(r:r:x.
)2
16(a_l)2 n4
~ Li j k 1jklJ.Jl i J k 1jk2 J
l
- 1
rE (E I: x. 0kl)2 (E E E X.
)2
a "i j k 1J
- i j k 1 jk2
-
~ (~ ~ ~ Xijkl)2[~ (~ ~ Xijk2)2J
12
+ a
+
1 3 8
4(a-l)n
f l
(~ ~ ~ x ijk1 )2 (~ ~ ~ Xijk2 l }
l
l
rr: (r: E X.
)2 w - [r: (r: r: X.
)2 wI
jk1 - 2
~i j k 1
i j k 1 jk2 -
1 (
)2 w +-,r:r:r:x.
1 (
,2
}
+ ; r:EEx.
o ) w
2
i j k 1 jk1
a 'i j k lJk2
1
1
+ 2" tJ [w1w2]
n
(9.1.7)
112
where
&{ [~
(~ ~ Xijkl)2J (~ ~ ~ Xijk2 )2} = {eqUation
(9.1.5) +
8a(a-1)n2PB[p~~ + p~/~ + 2P~~ + 2 (n-1)
X
(p~I~+p~I~)
+ 4(n_1)(pAB+ p AlB\
A.J
\ A'
+
4(n-1)(n-2)p~J
32[
A+2(n-1)PAAJ jI
P +PA
B A
+ 8a(a-1)(a-2)n P
,
(9.1.8)
2
( 1: 1:
>: X.
ijk
~ jk2
)
l-
= requation
(9.1.8) +
l..
8a(a-1)n2 rP [pAB + pAlB + 2pAB + 2 (n-1)
1.
A
X
"
••
•B
(p~I~+p~I:)
+ 4(n_1)(pAB +pA/B)\
·B
+ 4(n-1) (n-2)
. B
p~J
+
[p~~+P~/~+2(n-1)p~I~J2
+
(a-2)np~ [PB+p:+2(n-1)p~J
+ 4 (a-2)nP P
A B
[~~+p~/~
+
2(n-1)p~I~J
2 2} _:1
+ 2 (a-2) (a-3)n 2 PAPB
113
rI:
2
l1 = 2anfpAB+pA/B+2(pAB+pAB\
I: (I: x
)
l.i j \
ijk2 _1f
1..··
+
A'
oB)
+ 2pAB
AB
4(n-l)(pM.+2P~ B+pA/B)
Ao
AlB
A
0
+ 2(n_l)(pAIB+ pAIB+P'"\IB+ pAIB)
••
A·
·B
AB
+ 4(n-l)
(n-2)(pW-+p~I:)
+ 2 (a-l){pA+ P~ + 2 (n-l)p~J (PB + P:)
+ 4(a-l)nPA
[p~~+P~/~+2P~:
+
+
The other terms in equation
Thus,
2(n-l)(p~I~+p~I:)J
4(a-l)(a-2)n2p~(PB +P:)}
•
(9.1.7) have appeared in previous sections.
114
~(zlrZ) • 4a(a-l)n
---~1---~3 ~ { [E (E EXijk1)ZJ (Ei EE
XijkZ )
k
i
j k
j
-t [~ (~ ~
- 1
a
+
Xjjk1)Z]
[~ ~ (~ XijkZ ) Z ]
(E'i ~j kEX.jk1)Z
(Ei EE
x.
)
~
j k ~ jkZ
z~ (~ ~ ~ Xijk1)Z [~ ~ (~ XijkZ)z]}
where,
~ [(~ ~ ~ Xijk1)Z (~ ~ ~ x ijkZ )] = {eqUation
(9.1.6) +
+ 8a(a-l)n Zp !pAB+pA/B+Z(n_l)pAIB
AL ••
••
• •
and where other terms have appeared in previous sections, so that,
+
ill
AlB ( 2P Z
A\(
B\
(AlB
AlB)'
a1 rLPAr,-PA
B+\ A .PA) P B .PBrZPA\P, •• p. B J
~ ..L
an
{pAB •
A'
2P~
B+ pA/B •
A IB
A·
z(pill. p:iIB\1 •
\ A I-
A BJ
I
p [pAB + p A/B • 2pA B
A... - ••
••
+ z(pA IB _ pAB\"; 1
I, • B
'Br~J
115
9.1. 7 8(z w )
l 2
8(
z l w2
) =
8 {
1
4(a-l)n 2
_
..
r!: (!: !: x
Li
j k
ijkl
)2..,J w2
1
(!: !: !: x.
4a(a-l)n 2 i j k ~jkl
B)
A- A
(,P-A
p 2) -1(p + PB- 2P2\BB
B
)2
w
2
116
902
The vector
Asymptotic Multivariate Normality of
in equation (30l07) is, as a
~
multivariate normal
with mean vector
v .. l.A+ls+
a
an
1
an(n-l}
Y
~(y)
~m,
~
as a
~symptotically
and variance-covariance matrix
0
Proof Outline:
The vector
~
=
a
1 \'
-a L
i=l
1
n
W
1
n
W
- -Col) 2
- -
%2
a
2
a-I (Ci2 - Co2 )
- -
wI
wn
'01
2
wi2
l
Pn
2
Pi2
%1
r
r
where
may be written in the form,
~
~ ~
....L (C
a-I
n
n
12
117
Now the term
is not an average of independent random variables, but,
where use has been made of the lemma given by Serfling (1980, page 72) and given
earlier by Cramer (1946, page 365).
So by Slutsky's Theorem [see, for example,
Cramer (1946, page 255)] if the sequence of random variables
converges in distribution to some random variable then the sequence of random
variables
a
~ L C~l)(Cil - '.1)2
i=l
converges in distribution to the same random variable.
A similar result holds
for the term
1
a
\-
;- L
i=l
So the asymptotic distribution of
~
is the same as the asymptotic distribution of
a
s*
1
a
L
(C il - PA)
i=l
(C i2 - PB)
w
il
wi2
Pil
Pi2
2
2
1
n Wil
- -
1
n Wi2
- -
118
Now the vectors
(Ci l -PA)
(Ci2 - PB)
2
2
1
- -n wi l
1
- -n w12
are i.i.d. with mean vector
~(y)
and variance-covariance matrix given by
So by the multivariate version of the Lindeberg-uevy version of the central
limit theorem [see, for example, Cramer (1946, page 316) or Serfling (1980,
page 28)J, ~* is asymptotically multivariate normal with mean vector 2(Y) and
variance-covariance matrix V•
•