Dodds, K.GResampling Methods in Genetics and the Effect of Family Structure in Genetic Data"

•
Resampling methods in genetics and the effect of
family structure in genetic data.
K. G. Dodds
Institute of Statistics Mimeograph Series No. 1684T
•
NORTH CAROLINA STATE UNIVERSITY
Raleigh, North Carolina
iv
Table of Contents
Page
1. RES.AMPL ING lVIETHODS ...•........••..............................•..•. 1
1.1 Introduction
1
1.2 The Jackknife
1
1.3 The Bootstrap
7
1.4 The Delta Method and the Infinitesimal Jackknife
17
1.5 Other Resamp1ing Methods ..........•........••...•.••..•••••.•. 20
1.6 Relations Between Methods ................•...............•.... 22
1.7 U-Statistics
1.8 TransfoI1llat ions
24
25
2. RESAMPLING METHODS IN GENETICS ...•..•...........•................. 27
2.1 The Jackknife
2.2 The Bootstrap
2.3 The Delta Method
27
31
33
3. RES.AMPLING IN LINEAR MODELS ....•..........•.......•...•.•••.....•. 37
3. lInt roduct ion
3.2 Mean Model
3.3 One Way Model
37
38
39
3.4 One Way Model - Unbalanced Data ......•.....•.....••....•...... 48
3.5 Two Way Model
53
3.6 Three Factor Model With Crossed and Nested Factors ...•.....•.. 61
4. RESAMPLING METHODS FOR ESTIMATING THE COANCESTRY COEFFICIENT
70
4.1 Estimating the Coancestry Coefficient ...............•.......•. 70
4.2 Resampling Methods for the Distribution of 8w ••••••••••••••••• 71
4.3 Simulations Study ....••.......•............•....•.•........... 74
4.4 Discussion of Simulation Results ...........•..•..•...•........ 78
4.5 Example: Monterey Pine Data ..........................•........ 82
5. THE USE OF RESAMPLING METHODS IN GENETICS: CONCLUSION....•......... 87
6. THE EFFECT OF FAMILY STRUCTURE IN GENETIC DATA .:
:~
90
6.1 Family Structure in Genetic Data ..•.....•....•....•........... 90
6. 2 Notat ion
91
6.3 Relationships Between Gene Combination Frequencies .........•.. 95
6. 4 Gene Frequency
III
6.5 Hardy-Weinberg Disequilibrium .................•........••.•.. 113
6.6 Linkage Disequilibrium
119
6.7 The Correlation of Gene Frequencies .........•.....•.......... 132
6.8 Estimation of Effective Population Size ...........•.......... 142
6.9 The Effect of Family Structure: Conclusion .••..•..••.......... 146
7. REFERENCES
150
1.
RESAMPLING METHODS
1.1 Introduction
Often in statistics we want to find an estimate that requires using
a complicated formula.
Once we have the estimate we want to know how
good it is: Is its expectation equal to the parameter of interest?
If
not how far away is it? (i.e. What is the bias?) What is the standard
error of the estimate? What is a confidence interval for the parameter?
Usually the answers to these questions require another order of
complexity and it is often too difficult to find explicit formulas.
These and some other problems can sometimes be solved using what are
known as resampling methods.
Resampling methods are methods which look at the properties of a
new sample drawn from the data that we already have.
The properties of
this new sample are used to make inferences about the properties of the
original sample.
The way in which this sample is drawn depends on which
resampling method is being used.
In general these are completely
nonparametric methods although sometimes modifications are possible to
give a parametric method.
1.2 The Jackknife
The jackknife originated from a paper by Quenouille (1949) in which
he suggests a method for eliminating bias that is proportional to lin.
Later (Quenouille, 1956) it was developed into a general method for
reducing bias.
It has also been extended to a method for estimating the
sampling variance of an estimate.
e-
2
Let Xl, X2, .•. , Xn be a sample of independently and
Method.
identically distributed (i.i.d.) random variables with corresponding
observed values Xl,X2, ••• ,Xn.
parameter 8.
Let 9(Xl,X2, .•• ,Xn) be an estimator of a
The jackknife resamples n-l distinct sample values.
If
the i-th sample value is omitted we obtain 8(i). Let
8 ( .)
the mean of all n 8(i)'S.
= ~n=1 8 (
i
i )
/n
Then the jackknife estimator of the bias of 9
is defined to be
BiasJ = (n-l) (8(.)- 9).
(1. 2.1)
Subtracting BiasJ from 9 leads to the jackknife estimator of 8, a
weighted average of 9 and 8( . ), namely
8J= n9 - (n-l) 8(.)
The justification for this estimator is as follows.
(1.2.2)
Often the bias can
be written in ascending orders of l/n Le. as al/n + 82/n2 + 83/n3 + •••
If this is the case then the bias of 8J is O(1/n2 ) so that jackknifing
removes the first order bias.
If ai=O, i>l then 8J will be unbiased.
One consequence of this is that if the estimator is unbiased, then so is
the jackknife estimator, although it is not necessarily the same
estimator.
If the estimator is consistent and does remain unchanged
under jackknifing, then i t must be unbiased (Gray and Schucany, 1972,
Theorem 5.1).
Jackknife Variance EstiJl1Bte.
Tukey (1958) extended the jackknife
technique to give an estimator for the variance of 9 using the formula:
(1.2.3)
If we define the i-th "pseudo-value" as 9+(n-l)(9-8<i»), then 8J is
3
the mean of the pseudo-values, and VarJ is the sample variance of the
mean of the pseudo-values.
Following from this it was proposed that
tJ = (SJ-S)/f(VarJ)
(1.2.4)
is approximately distributed as a t with n-l degrees of freedom.
It has
since been shown that this is asymptotically true (i.e. tJ converges in
distribution to a
N(O,l»
in many situations as will be discussed
below.
We can also use VarJ as an estimate of Var(SJ) and in fact in the
situations that tJ is asymptotically normal, Var(9) and Var(SJ) are
asymptotically equal.
It is found (Efron, 1982) that VarJ usually
overestimates Var(9) in expectation, but no such theory exists for
Var(SJ).
Hinkley (1978) has done a Monte Carlo simulation and found
that VarJ can severely underestimate Var(SJ) for contaminated data.
ExBlBples.
Example 1.2.1.
The Expectation.
Suppose S=E(X) and 9=x.
e-
Then we
find S(i)=(nx-xi)/(n-l), S(.)=x and so SJ=9 - jackknifing has no
effect on the estimate.
VarJ=!(xi-x)2/(n(n-l», the usual estimate for
the variance of X.
Example 1.2.2.
statistic.
Let X-U(O,S), and let X[iJ denote the i-th order
Suppose we estimate S by 9=X[nJ.
We find
SJ=X[nJ + (n-l)(x[nJ-X[n-1J)/n.
Now E(9) = nS/(n+l) = (1-1/n+l/n2-1/n3 +•.. )S, while
E(SJ)=(1-1/n2+1/n3 -
•••
)S, and we see that the jackknife has removed the
O(l/n) term from the bias.
We also get
VarJ=(n-l)2(x[nJ-X[n-1J)2/n2
Suppose we now use the unbiased estimate 9=(n+l)x[nJ/n. Then the
~
4
jackknife estimate is
8J=2x[ n] -X[ n-l]
which is also unbiased (as expected).
Now
VarJ=(x[n]-X[n-l])2,
E(VarJ)=(2n/(n+1»
Var(9)
=Var(8J)
so that in this case VarJ is very conservative for Var(9) but is
actually unbiased for Var(8J).
However Miller (1964) has shown that
this is a case where the limiting distribution of tJ is non-normal
(notice that the asymptotic variance of 9 1 the asymptotic variance of
8,1).
Also we expected 8,1 to have larger variance than 9 since 9 is the
UMVU estimator of 8.
Example 1.2.3.
Ratio Estimation.
proved useful is in ratio estimation.
One area where the jackknife has
Let (Xi ,Vi), i=l, ••• ,n be a
random sample from a bivariate distribution.
estimate 8=E(Y)/E(X).
Suppose we wish to
The "obvious", but biased, estimator is 9=Y/I.
It has been found (Miller, 1974) that the jackknife estimator based on 9
has smaller bias and smaller variance than 9 for a normal or gamma
distribution.
A Taylor series expansion can be used to show that E(9) ,...
8+O(1/n), while E(8J) = 8+O(1/n2 ).
Use of tJ as s t-s tstis tic.
We have seen that the jackknife does
not always give reasonable estimates (example 1.2.2), so when can we use
tJ as a t statistic?
Miller (1974) reviews several broad classes of
statistics where this can be done. These are when 9 is:
1. A function of a maximum likelihood estimate.
This case is not so
important as we are assuming a distribution to get the estimate
5
and can presumably obtain variances and confidence intervals
using the properties of the distribution.
2. A function of a mean - 9=f(X).
tJ is N(O,l) providing
The limiting distribution of
Var(x)<~,
and f has a bounded second
derivative near E(X).
3. A function of a U statistic.
This generalizes 2.
See section
1.7 for more details.
4. A function of a regression estimate - 9=f(t).
The limiting
distribution of tJ is N(O,l) providing the expectation of the 4th
moments of the i.i.d. errors is finite, and X'X/n
definite matrix as n
~ ~,
~
a positive
where X is the design matrix.
Jackknifing
is to be achieved by deleting an (X,Y) pair to obtain 8(i).
(1977) suggests a weighted jackknife in this case.
Hinkley
To estimate 8 he
uses the weighted pseudo-values t+n(1-wi)(t-8(i», where Wi is
equal to the "distance" of the design point from the center of the
design.
5.
An
estimate of a parameter of a stochastic process with stationary,
independent increments.
Outliers.
One situation in which the jackknife does not appear to
be useful is when the data is contaminated and contains outliers.
Miller (1974) noted that it did not correct for outliers.
Hinkley
(1978) found that it could have much worse bias than the original
estimator for these cases.
He found that using the jackknife
pseudo-values with a robust estimation method can be better than either
the original or jackknife estimator.
Grouped Jackknife.
In general the jackknife need not be
e-
6
implemented by omitting just one observation at a time, but. can be used
omitting a group of size h, for each of g (distinct) groups, where the
total sample size is n=gh.
Quenouille's original method for eliminating
bias was actually the jackknife with g=2.
The properties of this method
of jackknifing are similar to those using n groups of size one.
In some
cases it has been shown that using g=n is optimal, and most authors
agree that this is the best method as it avoids the non-uniqueness
involved in assigning observations to groups.
The advantage with using
a smaller g is that less computation is required, but the gain from this
is unimportant when using computers.
In what follows the "jackknife"
will refer to the jackknife with n groups of size h=l.
Generalized Jackknife.
Gray and Schucany (1972) have given another
generalization of the jackknife which they call the "generalized
jackknife".
This estimator can be used to eliminate any highest order
bias (not just O(l/n», if it is known.
estimators of 8.
Suppose A1 and A2 are two
Then for any real number R "1 1 the generalized
jackknife is
G(A1,92) = (A1-R92)/(1-R)
The ordinary jackknife uses 91 = A, A2
bias was O(l/n2 ) and we used R
highest order bias.
If H(A)
generalized jackknife with
= 8(.)
= (n-l)2/n2,
and R
= (n-l)/n.
If the
this would remove the
= 8+81 /n+82/n2 then applying
A1 = 8J, 92 = the mean of the
the
n jackknife
estimates (i.e. the n 8J's) obtained by omitting an observation and then
jackknifing (based on the n-l samples of size n-2 to find each of the
8J's), and R
= (n-l)2/n2 ,
gives an estimate with bias of order O(1/n3 ) .
(Note that it is not unbiased.)
This is known as the second order
7
jackknife.
Gray and Schucany (1972) give a slightly different
definition of the second and higher order generalized jackknife which
would lead to an unbiased estimate in the above case.
Camputation.
When computing the jackknife it is usually easier to
have a subroutine for calculating 9 for a sample of size n with weights
applied to each sample value.
When calculating 8( i) use equal weights
on each observation except the i-th which will have weight zero.
This
is easier than putting all but the i-th observation into a new vector to
be passed to a subroutine.
Review Articles.
The jackknife has been reviewed by Gray and
Schucany (1972), Miller (1974) and Efron (1982).
A bibliography of
articles relating to the jackknife has been given by Parr and Schucany
(1980).
Recently more attention has been given to the bootstrap, which
will be discussed next.
1.3 The Bootstrap
The bootstrap was introduced by Efron (1979) as an alternative to
the jackknife.
The bootstrap has wider applicability than the jackknife
as it allows estimating the sampling distribution of a random variable,rather than just the bias and variance.
Method.
Let 1=(X1, X2, ... , Xn ) be a sample of i. i. d. random
variables with unknown distribution function F, and let x be an observed
value of I.
Suppose we are interested in a random variable R=R(I,F).
The bootstrap estimates the distribution of R as follows:
1. Find the empirical distribution function Fn .
2. Let
1*=(X1*,X2*, .•. ,Xn*) be a sample of i.i.d. random variables with
e-
8
distribution function Fn.
The resampling for the bootstrap consists
of sampling n observations from Fn.
3. Estimate the sampling distribution of R(X,F) by the distribution
of R*=R(I*,Fn).
Efron calls this the bootstrap distribution.
The intuitive justification for using this method is that the
distribution of R* equals the distribution of R if Fn=F, and we hope
that if Fn is close to F then the distribution of R* will be close to
that of R.
There are three methods of calculating the bootstrap distribution:
(i) Direct theoretical calculation, (ii) Monte Carlo approximation and
(iii) calculation of the mean and variance using a Taylor series
expansion.
The first method is the most desirable but in practice is
not often used because the bootstrap is used when a closed form for R
doesn't exist, or if it does it is too complicated to calculate its
bootstrap distribution.
The third method is the same as the delta
method (see sections 1.4 and 1.6).
The Monte Carlo method is the usual
way of finding the bootstrap distribution.
The Monte Carlo method consists of generating a large number, B, of
observed values x* of 1*.
The empirical distribution of R*=R(x*,Fn) is
then used to approximate the bootstrap distribution.
B->~
If we could let
then this would give exactly the bootstrap distribution.
However
as we are using the bootstrap distribution as an approximation to the
sampling distribution of R it is not worthwhile taking B extremely
large.
In practice it has been found that B=lOO is usually adequate
except when R is heavily dependent on the tails of the distribution,
such as the case of constructing nonparametric confidence intervals.
9
In the following it will be assumed that the Monte Carlo method is
used to estimate the bootstrap distribution, unless stated otherwise.
Also E* will be used to refer to expectation with respect to the
bootstrap distribution (conditional on x), and EF to refer to
expectation when sampling from distribution function F.
Bias and Variance Esti1ll8tes.
The most commonly used choice for R
is R=t(X)-8(F), where t(X) is an estimator of a(F), a parameter of
interest (often t(X)=a(Fn».
The aspects of the distribution of R of
most interest are the expectation (the bias of t) and the variance (the
sampling variance of t).
If we wish to use the bootstrap to correct for
the bias in t, our new estimate of a is t-E*(R*). If we have the case
where t(X)=a(Fn) this gives 2t(X)-E*(t(X*»
as the new estimate of 8.
The jackknife can also be used to estimate bias and variance.
Constructing Confidence Intervals.
Another use of the bootstrap is
in constructing a confidence interval for a parameter 8.
R=t(X)=9, an estimator of 8.
the R*.
Here we take
Let F* be the distribution function for
For a 1-20: central confidence interval we use the 1000: and
100(1-0:) percentiles to give the confidence interval
[F*-l (o:),F*-l (1-0:)].
Efron (198la) calls this the "percentile method".
Since this
depends on the tails of Fn, a larger value of B (say 10 times bigger
than the value needed for bias or variance estimation) is recommended to
give a reasonable Monte Carlo approximation to the tails of the
bootstrap distribution.
Efron (1982, sect.lO.7) presents a modification of this called the
"bias-corrected" percentile method, which corrects the bootstrap
10
distribution if the median of the R*'s doesn't coincide with the
observed value of R.
The justification for this assumes that there is
some transformation which makes R pivotal.
It is not necessary that the
transformation be known, only that it exists.
The method usually
assumes a normal pivotal quantity but another distribution could be
used.
Using the normal distribution, the method is performed as follows.
Let i be the standard normal cumulative distribution function.
bias-corrected percentile method gives the approximate
l-2~
The
central
confidence interval
[F*-l (i(2zo-Z1», F*-l (w(2Z0+Z1 »
J,
where
Zo
= w- 1 (F* (9»,
Zl
=
w- 1 (l-~).
The percentile method is based on a distribution with mean E*(a*), but
as noted above the bootstrap bias corrected estimate of a is 29-E*(a*),
so a confidence interval using a distribution with this mean would be
more consistent.
The bias-corrected percentile method makes a
correction in this direction, although it uses the median as a basis for
the correction.
Efron (1985) shows that if the bias-corrected percentile method is
used when the parameter is a function of the mean of a multivariate
normal with covariance matrix a 2 I, then the confidence interval is
correct in the 0(1) and 0(n- 1 /
2)
terms.
The same result holds if there
is a transformation to such a distribution.
This compares with the
normal approximation which is correct only to the 0(1) term.
Buckland
11
(1984) found that the bias-corrected method tends to give non-central
intervals for the parameter
~f
an exponential distribution.
~
He also
gives an example of using both the corrected and uncorrected methods in
a mark-recapture experiment.
Schenker (1985) gives an example where
there is a transformation to approximate normality, but a pivotal
quantity does not exist.
The coverage probabilities are found to be
below the nominal level.
ExB1l/Ples.
Example 1.3.1.
The Expectation.
e(Fn)=x, and R*=X*-x.
Let e(F)=EF(X) and R=X-e.
Then
Theoretical calculations using results from
the multinomial distribution (see section 3.2 for more details) show
that the bootstrap estimate of bias is E*R*=O, and of the variance
(Var(X» is
e-
Var* (R*)=E* (X*-X)2
=1: (Xi -X)2 /n 2
= (n-l)
Note that E(Var*(R*»
Var(X)/n, so that it is biased downward,
whereas the definition of the jackknife estimate of variance adjusts for
the bias in this case.
Efron (1982) states that there seems to be no
advantage to adjusting the bootstrap in this manner.
This seems to be
due to the fact that the adjustment for the bootstrap would also
increase the variance (whereas for the jackknife the adjustment
decreased the variance), and this increases the mean square error more
than it is reduced by any bias reduction.
Example 1.3.2.
R
= X[nl-e,
= (k/n)n
As in example 1.2.2 let X-U(O,e), and let
so that R*=xfnl-X[nl.
- «k-l)/n)n.
Let
Then we find
Pk
= P*(Xfnl=X[kl)
12
E*(R*) =
Var* (R*)
l:kPkX[k]
= l:kPkXh]
-
(l:kPkX[ k]
)2
This is a situation for which the bootstrap cannot be relied on to give
correct answers as n(Xfn]-X[n])/X[n] doesn't converge to an asymptotic
distribution.
This is because Fn does not converge uniformly to F for
this situation (see condition (2) of Bickel and Freedman (1981) stated
in the section on asymptotics below).
Example 1.3.3.
Regression.
Suppose Yi=gi(6)+ei, i=l, ••• ,n with
observed values Yi, where the gi are known functions (usually of vectors
of covariates) and 6 is a p-vector of unknown parameters.
The estimate
of 6 is found by minimizing some distance D between the vector of
observations Yi and the vector of the gi(6)'s (usually the sum of
squared differences).
There are two ways one could bootstrap in this
situation.
The first is when the set of gi'S is not random.
calls this the "regression model".
Freedman (1981)
In this case the ei ' s are i. i. d.
with some distribution function F, and we resample the residuals.
The
bootstrap algori tbm is as follows:
(1) Calculate 8 from the data by minimizing D, and obtain
~i=Yi
gi(8), i=l, ••• ,n.
(2) Construct Fn, the empirical distribution function of the ei's
centered in some way (e.g. mean or median is zero).
(3) Resample a set of n ei*'s from Fn and construct a bootstrap data
set Yi*=gi(8)+ei*.
(4) Calculate the bootstrap estimate 6* of 8 by minimizing D for
the bootstrap data set.
13
(5) Find the bootstrap distribution of 8*, usually by Monte Carlo
repetition of steps (3) and (4).
If this is carried out for linear regression it results in the usual
covariance matrix estimate except that the estimate of
instead of having the usual divisor n-p.
a2
is
I~i2/n
This is similar to the
situation found in Example 1.3.1.
The second method is where the (Yi,gi)'S taken together are
considered as the random sample.
"correlation model".
Freedman (1981) calls this the
In this case there is usually some dependence
between ei and gi, leading to non i.i.d. errors.
We now bootstrap by
resampling from a distribution function Fn with mass lIn on each
(Yi,gi(8».
This does not lead to the usual regression estimates for
finite samples.
The two methods are asymptotically equivalent.
Freedman (1981)
e-
shows that the methods are asymptotically valid for the case of least
squares, and gives some error bounds.
Efron and Gong (1983) suggest
that the correlation model should be used in model selection situations,
or when the model is in doubt, as it will still give trustworthy
estimates of the variability
of~.
Freedman and Peters (1984) apply
the bootstrap to generalized least squares, and study a particular case
which includes lags and constraints on coefficients.
Using the
regression model approach they evaluate the bootstrap estimates by
bootstrapping each bootstrap sample and conclude that bootstrapping
gives better estimates than the conventional generalized least squares
estimates.
Asymptotic T.beory.
Asymptotic results have been derived showing
~
14
the convergence of the bootstrap distribution to the sampling
distribution, and also the asymptotic normality of bootstrap estimates.
These results depend upon certain regularity conditions being met.
Efron (1979, comment G) looks at the asymptotic distributions when
the sample space is finite (discrete distributions).
With certain
regularity conditions on R it is shown that the bootstrap and sampling
distributions have the same asymptotic normal distribution.
Singh (1981) has presented the theory for bootstrapping the mean.
Let 8=EF(X),
~2=VarF(X),
S~=!(Xl-X)2/n.
1. If
E(X2)<~
X* be the mean of the bootstrap sample and
Singh's main results are:
then
SUPxER IP{[n(X-8)<x} - P*{[n(X*-X)<x} I
2. If
EIX3I<~
[n SUPxER
~
0 a.s.
and F is not a lattice distribution function then
IP{[n(X-8)<~X}
- P*{[n(X*-X)<snX}I
~
0 a.s.
He gives results on the rate of convergence, showing that the bootstrap
approximation is better than the normal approximation.
The rate of
convergence in result 2 is found to be not valid for a lattice
distribution.
The rate of convergence to zero for [n{Fn- 1 (t)-F-l (t)} is
also given.
Bickel and Freedman (1981) found convergence results for the mean
independently of Singh.
U-statistics.
They also show asymptotic normality for
They give loose guidelines for when the bootstrap can be
expected to work for R(X,Fn):
1.
R(Y,G) converges in distribution, to 0, say, for all G in a
neighborhood of F into which Fn falls almost always.
2.
The convergence in (1) is uniform on the neighborhood, and
15
3.
The function
G~O
is continuous.
Beran (1982) has found that the bootstrap and 1st order Edgeworth
expansion estimates are both asymptotically minimax, in contrast to the
normal approximation which isn't.
If a functional of the distribution
is being estimated, a 2nd order Edgeworth expansion may be necessary.
Again these results require certain assumptions to be met.
Bahu and Singh (1983) have looked at the asymptotics for a linear
combination of means of k populations, where results similar to those
for a single mean are obtained provided the sample sizes from each of
the populations tend to infinity at the same rate.
Bickel and Freedman
(1984) also investigate this. situation with special emphasis on
stratified sampling, and let the total sample size tend to infinity in
any manner.
They find that there can be a scaling problem with the
bootstrap in some situations (e.g. if the sample sizes from each
population remain bounded and the number of populations tends to
infinity).
Parametric Bootstrap.
There are several modifications of the
bootstrap to accommodate certain distributional assumptions.
These
modifications result in either the parametric or the smoothed
bootstra~.
For the parametric bootstrap the data are assumed to come from some
particular parametric family, and instead of using Fn, a nonparametric
estimate of F, we use an estimate of F that is a member of the family,
to give Fp, say.
This is usually done by using the maximum likelihood
estimates of the parameters.
R*=R(X*,Fp).
We then find the distribution of
In general this leads to the usual parametric estimates,
but also provides a method for getting these estimates when the theory
e-
16
is too difficult •
We may wish to use a smoother distribution
.Smoothed Bootstrap.
function than Fn, but not be willing to assume a parametric family.
Instead we can use a smoother estimate of F than Fn resulting in the
smoothed bootstrap.
One such method is the kernel density estimate:
Let Y be a random variable with density k(y) with mean zero.
The
density estimate is
fn (x)
=
1
nbn
This density has mean i and variance bn 2 Var(Y) + I(x-i)2/n, and is the
density of XFn+bnY, where XFn has distribution Fn.
the variance of the distribution.
Note the increase in
To obtain reasonable estimates when
they depend on the variance would require some further adjustment (e.g.
rescaling) to the density.
The performance of such an estimator has not
been examined.
Efron (1982) does give examples for the variance estimation of the
sample correlation coefficient, which is independent of scale so that
the above problem does not occur.
For k he uses either a normal or
uniform density with covariance matrix 1/4 of the sample covariance
matrix, and bn=l in both cases.
He finds that this can give some
improvement over the standard bootstrap.
The smoothed bootstrap has also been used by Silverman (1983) to
test for the number of modes in a distribution.
Other modifications.
One area where the bootstrap has been found
to be quite useful is for estimating the error rate of a prediction
rule.
Efron (1983) investigates this and also suggests several
17
variations on the bootstrap which seem to work well for this case.
~
Bickel and Freedman (1981) suggest that the bootstrap can be done
with a bootstrap sample size m1n, the actual sample size, and present
their asymptotic results in this framework.
They claim that this will
allow using a small (size n) sample to judge the likely performance of a
larger (size m) sample.
The intuitive appeal for this doesn't seem as
strong as that for the usual bootstrap, and it seems to have received
little attention.
Rand01llizstioD Test.
The bootstrap has some similarities to the
randomization test, which also estimates the sampling distribution from
the data.
For example, suppose we wish to test the equality of two
treatments.
The randomization test finds the distribution of the test
statistic under the null hypothesis, to see if the observed value of the
statistic is in the tail of that distribution.
In contrast if we
bootstrap in this situation we can estimate the sampling distribution
for the statistic under the true distribution, and find if the null
value of the statistic is in one of the tails of this estimated
distribution.
1.4 The Delta Method and the Infinitesimal Jackknife
The Del ts Method.
The delta method is the name given to a method
of calculating an approximation to the moments of a statistic from the
first few terms of a Taylor series expansion.
method of statistical differentials.
It is also called the
The delta method is not a
resampling method as such but is closely related to them.
Let t(X) be a statistic of interest, and suppose E(Xt)
= ~t
for
~-
18
i=l, ••• ,n.
Let u be the vector with i-th element at(X)/aXi evaluated at
I!=(I!l, ••• ,I!n), and Y be the matrix with (i,j)-th element Vij =
a2 t(X)/(aXiaXj) evaluated at (I!l, ••. ,I!n).
The Taylor series expansion
for t(X), centered on I!, is
t(X) = t(I!) + u'(X-I!) + (X-I!)'Y(X-I!)/2 + R,
where R is a remainder term, which tends to 0 as n-.co under appropriate
conditions.
Taking the expectation, truncating after the third term,
and the variance, truncating after the second term, we find
EOH{t(X)} = t(I!) +
~i~jVijCOV(Xi,Xj)/2
and
VarOH{t(X)} =
where
~
(1.4.1)
u'~u,
is the covariance matrix of the Xi.
Further approximation, by
substituting sample moments for the population moments, is usually
necessary to evaluate these expressions.
nonparametric delta method.
Efron (1982) calls this the
The Xi'S may be the sample observations
themselves, or some function of them, as long as the form of their
expectation and variance is known.
It is required that u be non-zero to
use (1.4.1), and if this is not the case further terms from the Taylor
series could be used to obtain an approximation.
Some simplification can occur for the special case of a multinomial
distribution (see Bailey, 1961, Appendix 2).
Let ni be the number of
observations in the i-th class, i=l, ••• ,k, and n=nl+ ... +nk.
Then we can
write (1.4.1) as
VaroH{t(nl, ..• ,nk,n)} =
n~iPi(at/ani)o2
-
n{~iPi(at/ani)o}2,
(1.4.2)
where the subscript 0 indicates that the derivatives are to be evaluated
at the Pi=E(ni).
If t(nl, ••. ,nk,n) can be expressed as a function of
19
the frequencies ni/n, this becomes
~inpi
VarDM{t(nl, ••. ,nk,n)} =
(1.4.3)
The infinitesimal jackknife was
The Infini tesiJDsl Jsckknife.
introduced by Jaeckel (1972).
(ot/oni)o2 - n(ot/on)o2,
This method is similar to the jackknife,
but replaces the differences 9-8i in the pseudo-values by what are
essentially the directional derivatives.
Let 9(P) be an estimate of 8 found by weighting the observations
by P=(Pl, ••• ,Pn).
Let PO=(l, ••. ,l)/n (so that 9=9(PO»
i-th co-ordinate vector.
and Si be the
Define
UiE = { 9(PO+£(Si-PO»
- 9(PO) }/£,
UiO = limE->o UiE
(1.4.4)
Then the infinitesimal jackknife estimate of the variance of 9 is
VarI J =
~
(Ui o/n)2.
e-
Other variance estimates may be obtained by using
VarIJE
for a particular choice of £.
= ~(UiE/n)2
Instead of calculating the limit in
(1.4.4) we can approximate VarIJ by choosing a small value of
£=0.001) and using (1.4.5).
(1.4.5)
£
(say
Another choice of £ is -l/(n-l) which gives
a variance estimate which is similar to VarJ: it replaces 8(.) by 9 in .
(1.2.3) and
mult~plies
it by (n-l)/n.
The UiO are actually estimates of the "influence function"
evaluated at
Xi.
The estimates are obtained by evaluating the influence
function at Fn instead of F.
The influence function, IF(x), measures
the sensitivity of an estimate to a change in the mass at a point x.
Under reasonable conditions it can be shown that
9 = 8(F) +
~IF(xi)/n
+ Op(l/n)
20
and this can be used to calculate the asymptotic variance of 9.
Relationship.
Efron (198lb) has proven a result showing that the
non-parametric delta method and infinitesimal jackknife estimates of
variance are identical for a particular class of statistics.
Suppose
t(X) is of the form t(C:h, ••• ,OA), where each Qa is an average
Qa
= IQa (Xi) In.
Then the two methods are identical.
This is a wide class of statistics
and includes sample variances, correlations and standardized means.
One
important consequence of this result is that the delta method may easily
be approximated by the numerical calculation of the infinitesimal
jackknife.
Often the delta method can involve tedious calculations.
If t(X) is of the above form then the error for the variance
estimate in these methods is o(l/n).
Efron (198lb) found that this
estimate could have severe downward bias, so that the ordinary
jackknife, which is found to be moderately biased upward (Efron, 1982,
Chapter 4), would be preferred.
1.5 Other Resampling Methods
This is a resampling method introduced by
RandQ1l/ Subs81llPling.
Hartigan (1969).
The plan is to randomly draw Drs subsets from the data
without replacement, Drs<2n -l, the number of subsets.
i-th ordered estimate of
8
from each subset.
Let
The set of
8[i]
8[i]
be the
's may be
thought of as estimating the sampling distribution of 9.
The motivation for this method comes from the following result.
Suppose F is symmetric about 8, and 9 is an M-estimate of 8.
(M-estimates are statistics which are obtained as solutions of
21
equations, such as maximum likelihood or least squares estimates.
Serfling (1980), chapter 7.)
See
~
Then 8 has probability l/(nrs+l) of being
in the interval (8[1] ,8[1+1]), where we define
8[o]=~
and
8[nrs+1]=~.
This gives a convenient method of constructing exact confidence
intervals.
Efron (198lb) used this method to find the standard error of the
correlation coefficient (not an M-estimate) and found that it gave very
biased results, even though asymptotically it gives the same results as
the bootstrap.
Half Sampling.
This is a method originally proposed for use in a
stratified sampling situation (see e.g. Kish and Frankel, 1974).
Suppose we have a sample with nJ observations from the 1-th stratum,
l=l, ••. ,L.
Randomly sample (with replacement) nJ-1 of the observations
from each of the strata and compute a new estimate.
As in the bootstrap
this is done many times to obtain an estimate of the sampling
distribution.
The reason for using
nJ-1
instead of
nJ
resampled
observations is to eliminate the bias in the variance estimate for the
linear case (c.f. E(Var*(R*»
in example 1.3.1).
Originally the method was proposed for use when nJ=2, l=l, .•• ,L.
When this is the case the resampling consists of taking half of the
observations each time (hence the name of the method).
The efficiency
can be improved by taking both half samples created (i.e. include the
complementary half sample).
There is another method of reducing the
number of calculations which involves choosing the samples in a certain
way to achieve what is known as balanced repeated replications.
method will give the same estimate of standard error for a linear
This
~-
22
statistic.
The half sampling method can be used in situations other than
stratified sampling by artificially creating the strata, as done by
Efron (198lb).
This is almost the same as bootstrapping with sample
size n/2 instead of n, and can be shown in the linear case to give a
badly inflated estimate of variance (it is probably a better estimate of
Var(O) for a sample of size n/2 than for one of size n).
Efron (198lb)
also found this to be the case in simulations estimating the variance of
the sample correlation coefficient.
If one wishes to eliminate the bias in the variance estimate, using
just one stratum and bootstrap sample size n-l would seem to be a better
method than artificial stratification.
Efron (198lb) does not consider
this method although it is likely to give estimates with similar
properties to the bootstrap variance estimates corrected for bias (see
example 1.3.1).
1.6 Relations Between Methods
Quadratic FUnctioDals.
Define 0 to be a quadratic functional if it
is a function of Fn, and can be written in the form
(1.6.1)
If all the 8's are zero then 0 is a linear functional.
In the rest of
this section we use the following definitions:
OLIN is the unique linear functional agreeing with 0 on each of the
reduced (size n-l) jackknife samples.
OQUAD is a quadratic functional agreeing with 0 on each of the
23
reduced
jac~knife
samples.
9Land 9Q are the linear and quadratic Taylor series approximations
to the bootstrap estimate of S.
BiasJ, Bias* and BiasIJ denote the bias estimates using the
jackknife, bootstrap and infinitesimal jackknife ( = the delta
method) methods respectively.
Results.
1. VarJ(9) = n Var*(SLIN)/(n-l) (Efron, 1982, Theorem 6.1).
If
9=SLIN then the jackknife estimate of variance is n/(n-l) times
the bootstrap estimate of variance.
2. If 9=SQUAD and n is sufficiently large
EF(VarJ(9»
> Var(9)
(Efron, 1982, section 4.5).
3. BiasJ(9) = n Bias*(SQuAD)/(n-l)
(Efron, 1982, Theorem 6.3).
If
9=SQUAD then the jackknife estimate of bias is n/(n-1) times the
bootstrap estimate of bias.
4. If 9=SQUAD then EF(BiasJ(9»
is the true bias (so that SJ is
unbiased for S) (Efron, 1982, Theorem 2.1).
5. BiasIJ(9) = Bias*(SQ)
6. VarIJ(9) = Var*(SL)
(Efron, 1982, section 6.6).
(Efron, 1982, section 6.3).
Parr (1983) has found that if 9=g(Xn), where g is a smooth
function, then the first order (O(n- 2 » terms for the delta method,
bootstrap and jackknife are the same.
If F has 3rd central moment zero
then the second order (O(n- S / 2 » terms for the delta method and
jackknife are the same.
For these results the bootstrap bias and
variance estimates were modified by multiplying by a factor n/(n-l).
e·
24
1.7 U-Statistics
This is a wide class of statistics for which resampling methods
have been found to work well.
A U-statistic is a statistic which can be
written in the form
U(Xl, ... ,Xn) = Icombs h(Xil, .•. ,Xim)/(i),
(1.7.1)
where the summation is over all combinations of m of the n Xi'S, h is a
symmetric function of the X's with expectation 8 and is called the
kernel, and m is the smallest integer such that such an h exists and is
called the degree.
Examples include sample moments, the Wilcoxon one-
sample statistic and Gini's mean difference.
See Serfling (1980,
chapter 5) for more examples and some properties.
Note that a
U-statistic is unbiased for 8.
Results.
Asymptotic results for jackknifing a function of one or
more U-statistics are given by Arvesen (1969).
Suppose
E(h)=~, E(h2)<~
and f is a real function with bounded second derivative near
8=f(~)
~.
Let
be the parameter of interest which is estimated by 9=f(U), h and
U as defined above.
Then
1. [n(8J-8) converges in distribution to a normal with mean zero and
variance (12 =
2. nVarJ(9)
~
m2[f/(~»)2Var{E(h(Xl,
... ,Xm)IXl)}. ,
(12 in probability, so that tJ converges in
distribution to a N(O,l).
3. If the grouped jackknife is used and the number of groups, g,
remains fixed as
n~,
then tJ converges in distribution to tg-l.
Arvesen (1969) also extends results A and B to the case where 9 is the
function of several U-statistics, showing the asymptotic normality in
this case (which includes ratios, t-statistics and the sample
25
correlation coefficient) also.
Parr and Tolley (1982) discuss Arvensen's results for the special
case of estimating a function of a multinomial parameter vector.
They
also find a condition for asymptotic normality of the second order
jackknife.
Lee (1985) has found expressions for the variance and bias of
VarJ(9) and Var*(9) when 9 is a U-statistic.
For the jackknife the mean
square error is the same as the variance to 0(n- 3 ),
The variance of the
jackknife and bootstrap variance estimates are found to be the same to
0(n- 3 ), and so are equal asymptotically.
numerical study where 8=Kenda11's tau.
Lee (1985) also does a
He finds that the bootstrap
variance estimate generally has smaller MSE than either the jackknife or
a bias-corrected version of the bootstrap.
e-
Efron and Stein (1981) show that if 9 is a U-statistic, then
VarJ(9»Var(9).
Lenth (1983) shows that the jackknife can be used to determine if a
given statistic is a U-statistic.
He proves that 9 is aU-statistic
iff, for n sufficiently large, it is invariant under the jackknife.
If
9 is aU-statistic n>m is sufficiently large.
1.8 Transformations
Transformations are often used in statistics to stabilize the
variance and sometimes to achieve approximate normality resulting in
better tests or confidence intervals.
For the jackknife, confidence
intervals are constructed by assuming that tJ is either N(O,l) or tn-l.
This always gives symmetric confidence intervals and so can be expected
e
26
to give poor intervals if the distribution of 9 is known to be skewed.
It is generally thought that transformations can help give better
results for these cases.
Arvesen and Schmitz (1970) have made a Monte Carlo study of this
where
e is the F-ratio from a one way analysis of variance. They find
that using the log transformation before jackknifing greatly improves
the results in terms of both significance level and power.
It is also
found to be better than the usual F-test for non-normal distributions,
and similar to the F-test for normal distributions.
In constructing confidence intervals using the bootstrap, the biascorrected percentile method is equivalent to making a transformation to
a (normal) pivotal quantity, if such a transformation exists.
discussed in section 1.3.
This was
27
2.
RESAMPLING METHODS IN GENETICS
2.1 The Jackknife
Several authors have used or advocated the use of the jackknife with
genetic data.
This is particularly so with the estimation of ratios. as
the jackknife has been shown to work well. in particular to reduce bias
(see example 1.2.3). in this situation.
Heritability.
Ayres and Arnold (1983) have used the jackknife to
estimate heritability of slug-eating tendency in garter snakes using
full-sib family data.
Let Yij be a measure of the slug-eating tendency
of the j-th snake in the i-th full sib family. and write
Yij=~+ai+eij.
i=l •••• ,I, j=1, ••• ,J1, where Var(a1):qa 2 , and Var(e1j):q2.
heritability was estimated by
2aa2/(~a2+&2),
The
i.e. twice the
iutraclass correlation.
Arvesen's (1969) results for jackknifing in an unbalanced one-way
layout were used.
This amounts to using weighting systems to obtain the
between and within sums of squares, and jackknifing by omitting one
family at a time from the complete population sample.
Each family
receives a weight one for the between sums of squares giving
~(Yi .-~Yi
giving
./1)2, and a weight (J1-l)-1 for the within sums of squares
~i (J1-l)-1~j(Yij-Yi.)2.
(I-l){~a2+~(1/Ji)~2/I}, and I~2
These sums of squares have expectations
respectively, ,from which the variance
components estimates and hence the heritability estimate are obtained.
This weighted estimate was also compared with one advocated by Bulmer
(1980,p.84) and was found to give similar results.
The jackknife was
also used to place confidence intervals on the heritability estimates.
e-
28
Genetic Correlations.
Toelle and Robison (1985) have used the
jackknife to find standard errors for genetic correlations between one
of four testicular measurements and one of nine female reproductive
traits in cattle.
sires.
Measurements were made on the half-sib progeny of the
Estimates of the variance components for the sire effect for the
reproductive trait measurement in the daughters, and the testicular
measurement in the sons, as well as the estimate of the covariance
component between these traits were used to obtain the phenotypic
correlation, rp.
From this the genotypic correlation, rG, was estimated
using the formula:
rG
= {4rp /hThF}{ [l+(KT-l)hT2 /4] [l+(KF-l)hF2 /4] /KTKF }l/2
where h~ and ~ are the heritabilities for the testicular measurement
and female reproductive trait respectively, and KT and KF are the
harmonic means of the number of male and female progeny per sire
respectively.
Jackknifing was performed by omitting the data for one year (from a
total of seven years) at a time.
They claim that the correlation
estimates are approximately normal based on a histogram of the jackknife
deviations
8(i)-8(.).
However the histogram is constructed from the
seven deviations for each of 16 different correlation estimates.
There
is no justification given for assuming that these deviations come from
the same population and so the histogram has doubtful use.
Genotypic Correlation and Regression.
Crozier et al (1984) have
used the jackknife to obtain the variance of a genotypic regression
estimate.
The data consist of individuals living in different social
29
groups, called colonies.
Jackknifing was carried out by omitting one of
e
these colonies at a time.
The regression estimate was used as a measure of relatedness
between two groups, X and Y, within a co1ony,and uses a mu1tia11e1ic
weighting system giving specific weights depending on the genotypes of
the interactants.
Let
A
= ~i
~ VijNij;
J.
B
=t
UiZi - ~ijlfJ~ UiZj/(k-1),
C
= Ho+(k-2)lIE:/2(k-1) ,
D
=t
Zi 2
-
where k is the number of alleles,
tjlf!
ZiZj/(k-1),
Ho and lIE: are the sums of proportions
of each colony that are homozygotes and heterozygotes, respectively, and
ViJ are the interaction weights and NiJ are the proportion of
e·
interactants that are of the i -th and j-th genotypes from groups X and Y
respectively.
The regression coefficient used is
bvx
= (A-B/c)/(C-D/c)
where c is the number of colonies sampled.
Simulations were performed
to check the validity of the jackknife, and it was found to perform
well, although it did have large variation when the allele frequencies
were very unequal.
Pamilo (1984) looked at using resampling methods (the jackknife and
bootstrap) to get variance estimates for genotypic correlation and
regression.
Jackknifing was performed by omitting a colony at a time,
as in Crozier et a1 (1984).
The weighting scheme used for the
regression and correlation was also the same as in Crozier et al (1984),
e
30
leading to the correlation estimate
r={He-~hem/c-~(hem-hom/2) /c(Nm-l)} /
{He-Ho /2}
where hem ,hom ,He and Ho are the observed and expected heterozygosities
within each colony m, and in the whole population, respectively, Nm is
the number of individuals in the m-th colony, and c is the number of
colonies sampled.
Simulations were carried out to assess how good the variance
estimates were.
Let the subscript S denote estimates based on the
simulation trials, the subscript R denote estimates based on the
resampling method, e be the parameter of interest, and mean, Sd and CV
refer to the sample mean, standard deviation and coefficient of
variation respectively.
The jackknife was found to give better results
than the bootstrap (see section 2.2) and was used in the subsequent data
analysis.
This was based on the following observations: (1) the ratio
of means (SER) to Sds(e) was closer to one for the jackknife, and (2) the
CVS(SER) was smaller for the jackknife.
F-Statistics and Genetic Distance.
Reynolds et al (1983) have used
the jackknife to estimate the variance of several different estimators
of genetic distance, and proceed by omitting one locus at a time.
Using
simulations (with 100 loci) this was found to give satisfactory results,
even for the case of linked loci.
The recommended estimate of distance
was D=-]n(l-e), where e is estimated by the method given in the next
paragraph.
Weir and Cockerham (1984) also advocate using this method for
estimating the variance of estimates of the F-statistics F (the
inbreeding coefficient), e (the coancestry) and f (the correlation of
31
genes within individuals within populations).
These parameters are
estimated by
A
F=(a+b)/(a+b+c),
9=a/(a+b+c) ,
f=b/(b+c) ,
where a,b and c are the ·observed components of variance for the between
populations, between individuals within populations and between gametes
within individuals effects from an analysis of variance on the frequency
of the allele under consideration.
If a multiple allele system is being
studied, it is recommended that the F-statistics be estimated by using
the sum over alleles of the variance components to replace the
individual components in the above expressions.
e·
2.2 The Bootstrap
Genotypic Correlation and Regression.
Pamilo (1984) also
considered the bootstrap for finding the sampling variance of genetic
correlation and regression estimates discussed in section 2.1.
Simulations for the bootstrap were performed by sampling from the c
observed colonies with bootstrap sample sizes from c to 4c.
Bickel and
Freedman (1981) have suggested using a bootstrap sample size different
from the actual sample size to investigate the properties of a larger
sample.
This does not seem to be the intent here, and since the actual
results of the simulation are not presented it is not clear how it has
influenced the decision to use the jackknife in favor of the bootstrap
for the actual data analysis.
Selection COJDponents.
Clark and Bundgaard (1984) have used the
32
bootstrap to find the sampling distribution of estimated selection
components.
Let gi and gi' be the frequencies of the i-th genotype
(i=1, •. ,3) before and after selection due to one of the components, say
fecundity, while holding the effects of the other components at their
null values (no selection).
Then the new genotype frequencies are given
by
gi
,=
Ij Nj Fj Ki j 1 (I i Ij Nj Fj Ki j ) ,
where Nj (j=1, ••• ,9) is the frequency of the j-th mating type, Fj is
the fecundity of the j-th mating type, and Kij is the fraction of
progeny of the j-th mating type that are genotype i.
frequency is given by Llp=gl I
+~
I
12-g1 -ga 12.
The change in gene
Similar expressions are
given for the components due to zygotic and sexual selection.
Confidence intervals for the change in gene frequency due to a
particular selection component were found using the percentile method.
The bootstrap distribution was found by using the observed distribution
of the effects of a particular component.
(multivariate) observation was obtained.
For sexual selection only one
It was assumed that this was
multinomially distributed, and a parametric bootstrap was performed.
Phylogenies.
Felsenstein (1985) has given a novel use for the
bootstrap in placing confidence intervals on phylogenies (evolutionary
trees).
The data consists of a table of character values for each of
several species.
The characters could be DNA sequence sites.
It was
argued that each character could be considered to have evolved
independently from the others, so that bootstrapping is to be performed
by sampling characters from the original set of characters. It was
suggested that a smaller bootstrap sample size may give more accurate
33
confidence intervals if correlations between the characters are present,
since these correlations would mean that there are really fewer
(independent) characters.
This approach was not pursued.
The interesting aspect of this work is in trying to construct a
confidence interval for a tree structure.
The key is to look at the
occurrences of a particular branch, or monophyletic group, in the
bootstrap samples.
A group of species are said to be in the same
monophyletic group if i t occurs in at least, say 95%, of the bootstrap
samples, i.e. if the alternatives occur less than 5% of the time.
The
set of such groups would then constitute the "confidence interval".
This selects the branches for which we are 95% sure are correct, rather
than give a set of branches in which we are 95% sure the true branch
lies.
The latter would be possible only if we were able to order the
possible branches.
2.3 The Delta Method
Linkage.
Bailey (1961) uses the delta method to find the variance
of estimated linkage for various types of data.
The formulas for
linkage estimation involve the relative frequencies of offspring
phenotypes, and the data is multinomially distributed, so the simpler
formula (1.4.3) can be used.
For example, consider the offspring resulting from the mating of
double heterozygotes AD/ab, both parents being in the coupling phase.
Let DAB be the number of offspring of phenotype AD (the double dominant
phenotype), with similar definitions for DAb, DaB and Dab, and let
n=DAB+DAb+naB+nab.
Then DAB, DAb, DaB and Dab have expectations
e·
34
n(2+8)/4, n(1-8)/4, n(1-8)/4 and n8/4 respectively, where 8=(1-r)2,
and r is the recombination fraction (assuming it is the same for males
and females).
Consider the quantity T=log{DAsnab/(DAbDas)}.
(1.4.3) gives VarDM(T)=8(1+28)/{n8(1-8)(2+8)}.
An
Applying
estimate of 8 or r
can be obtained by equating the quantity DAbDab/(DAbDaS) to the
expression obtained by replacing observed values by their expectations,
and solving (possibly iteratively).
A variance estimate of the
resulting estimate can be obtained by inverting the univariate Taylor
series formula VaroM(S)=VarDM(T)/(dT/d8)2, and similarly for r.
Heterozygosity.
We can use the delta method (formula (1.4.3»
find the variance of heterozygosity for multiallelic data.
to
Let nij be
the observed number of Ai AJ individuals (Ai AJ and AJ Ai, ijl!j, being
considered distinct), !!nij=n, with k alleles present.
h=~
Then
! nij/n,
llij
-
where h is the observed heterozygosity,
-
Pi is the observed frequency
of allele i, and H is the expected heterozygosity (under Hardy-Weinberg
equilibrium) given the observed gene frequencies.
find the variance of the quantity f=l-h/H.
Suppose we wish to
To use (1.4.3) we need the
partial derivatives of f with respect to the nij and n.
-
We find
(af/anii)o = 2hpi/nH2
(af/anij)o = {H+h(Pi+pj)}/nH2,
-
(af/an)o = -h(2-H)/nH2
i~j
-
where Pi, h and H are the population quantities corresponding to Pi, h
35
....
and H respectively; and so
where the PiJ are the population genotypic frequencies.
This can be used to construct a test to see if h=H.
Under the null
hypothesis the variance is
which can be estimated by replacing H and the Pi by their observed
....
values to obtain Va.
The test statistic f2/Va
is then compared with
a chi-square distribution with one degree of freedom.
Genetic Distance.
The delta method has been used by Nei and
Roychoudhury (1974) to find the variance for genetic distance.
They
estimate distance by
e·
A
D=-ln{Jxv /(JxJv p/2},
where the J's are the means over all loci of the j's, and jX=IPi 2
of the i-th allele in populations X and Y respectively.
,
To apply the
delta method, D was treated as a function of the J's, to give
A
2 2 2
VarDM(D)=Var(Jx)/4Jx+Var(Jv)/4Jv +Var(Jxv)/Jxv+Cov(Jx,Jv)/2JxJv
-Cov(Jx,Jxv)/JxJxv-Cov(Jv,Jxv)/JvJxv
The variances and covariances were estimated by their sample variances
and covariances across the loci.
Reynolds et al (1983) investigated the
use of this formula by a Monte Carlo study, and found that it gave
similar results to the jackknife (see section 2.1) for the two allele
case, but gave poor estimates for the 200 allele case.
Heritability.
The variance of an estimate of heritability can be
36
estimated by using the delta method for a ratio:
VarDM(X/Y) = {Var(X)/E2(X)+Var(Y)/E2(Y)-2Cov(X,Y)/E(X)E(Y)}E2(X)/E2(Y)
The heritability estimate involves a ratio of variance components.
example if an analysis of half-sibs leads to
~s2
and
~2
For
being the
estimated components of variance for the between sire and error effects
respectively, heritability is estimated as
h2=4as2/(~s2~2).
Variances and covariances of the variance components are estimated by
assuming appropriate chi-square distributions for the mean squares from
which the variance components are obtained.
Funkhouser and Grossman (1982) compared the use of this formula
with a first approximation formula which estimates Var(X/Y) by
Var(X)/E2(Y).
They found that the delta method variance can correct
some large biases in the first order approximations.
37
3.
RESAMPLING IN LINEAR MODELS
3.1 Introduction
We wish to find out how the jackknife and bootstrap estimates
behave in more complicated sampling situations than a random sample from
one population.
To do this we look at various analysis-of-variance type
linear models (with up to three factors), and investigate the effect of
the resampling methods when the parameter of interest is 8, the overall
mean effect.
These are situations for which we can find estimates using
conventional methods, but we hope that looking at the resampling methods
for these cases will give some insight as to their properties in more
complicated situations.
In each case it is assumed that each of the factors is a random
effect (unless stated otherwise).
Because there are several types of
sampling being done to obtain the sample, there are several ways to
perform the resampling.
Each of these ways will be investigated.
The
bootstrap is not approximated by Monte Carlo simulations, but the exact
mean and variance are calculated theoretically.
Throughout this chapter the "dot notation" will be used: A variable
with a dot in place of one or more subscripts indicates summation over
those subscripts, and a variable with an overbar and a dot in place of
one or more subscripts indicates averaging over those subscripts.
Let the notation Yi
~
M(n,k;Pl, ... ,Pk) mean that the vector of the
k random variables Yi (i=l, ... ,k)
follows a multinomial distribution
with sample size n, and probabilities pi.
e-
38
3.2 Mean Model
.J~(odel.
To set up notation and further investigate the resamp1ing
estimates for the mean, as discussed in examples 1.2.1 and 1.3.1, the
model with just a mean and error term will be considered.
yi = 8 + ei,
The model is
i=l, •.. ,n,
(3.2.1)
where E(ei) = 0,
Var(ei) = (1'2,
E(yt yt' )
1
=
i=i'
i1i' •
The mean 8 is estimated by
9 = y. = IYi/n.
Jackknifing.
As in example 1.2.1 jackknifing gives
8(i) = (nY.-Yi)/(n-1),
= y.,
8(.)
and so
8.1 = y.,
Var.1 = I(Yi-y.)2/n (n-1),
E(Var.1) = IE(ei-e.)2/n (n-l) = (1'2/n
Bootstrapping.
Suppose we wish to find the distribution 'of R=9.
Let ri be indicator variables counting the number of times each Yi
occurs in the bootstrap sample.
Then ri - M(n,n;l/n, .•• ,l/n).
E(ri) = 1,
Var(ri) = (n-1)/n,
Cov(ri ,ri' ) = -lin, i1i'.
The bootstrap sample estimate of 8 is
8* = IriYi/n,
We have
39
and
As noted in example 1.3.1 Var*(8*) is a biased but consistent
estimator of Var(9).
The estimator could be made unbiased by using
An alternative is to use a different sample
nVar*(8*)/(n-l) instead.
size. m. for the bootstrap sample. In this case ri
~
M(m.njl/n ••••• l/n).
We find
E* (8*) =
y ••
e-
E{Var*(8*)} = (n-l)a 2 /mn,
so if m=n-l. Var*(8*) will be unbiased.
This is similar to what is done
in half sampling (see section 1.5).
3.3 One Way Model
Consider the model
Model.
Yij = 8 +
~i
i=l, ... ,a.
+ eij.
j=l ••..• n.
where
E(~i)
= E(eij) = O.
Var(~i)
= aa 2
,
Var(eij) = a 2
•
+ aa 2 + a 2
82
E(YijYi' jl ) =
1
82 + aa 2
82
•
•
,
i=i', j=j'
i=i'. Jfj'
ifi' •
(3.3.1)
40
The estimate of 8 is
S
= y •• = IYij/an,
with
In this and the following cases there are several ways in which
resampling can be done.
jackknifing sections.
These methods will be described in the
In this section 8J and E*(8*) are always the same
as the original estimate S.
Jackknifing. We can jackknife using the following methods:
(1) OBS: Resample the observations themselves.
this means omitting one observation at time.
samples that are unbalanced.
For jackknifing
This results in jackknife
When this is the case there are two ways
of estimating 8, and hence of jackknifing.
n.
Let
= Ini,
where n1 refers to the number of observations in the i-th treatment.
First consider the unweighted case where
S
a
=1=1
I
n1
I Y1j In.
j=l
Then
8(ij)
= (anY•• -Y1j)/(an-l),
VarJ = II(Yij-y•• )2/an (an-l),
= (a-l)~a2/a(an-l)
+
~2/an
Now suppose we use the weighted mean
a
ni
S =1=1
I
I Y1j/ani.
j=l
where each observation is weighted by the reciprocal of the number of
41
observations in the same treatment, so that each treatment receives an
equal weight.
For this case the jackknife estimates will be subscripted
with JW to show that it is the weighted version that is being used.
We
find
= Y•• +(Yi.-Yij)/a(n-1),
VarJW = (an-1)II(Yij-Yi .)2/a n(n-1)2,
E(VarJw) = (an-1)q2/a2n(n-l)
8(ij)
3
(2) AT: Resample the levels of the A treatment.
this means omitting one of the treatments at a time.
For jackknifing
We find
8( 1) = (ay.. -Yi. )/(a-1),
VarJ
= I(Yi ._y•• )2 /a(a-1),
E(VarJ)
= qa 2 /a
+ q2/an
(3) OWA: Resample observations within the A treatments.
jackknife there are two ways we could proceed.
For the
First consider omitting
a observations at a time, one from each treatment.
e·
We will assume that
the j-th pseudo-value is the result of omitting the j-th observation
from each of the treatments, to obtain
8(j)
= (anY.. -liYij)/a(n-1),
VarJ = Ij (Ii Yi j -ayi. )2 /a2 n(n-l),
E(VarJ)
= q2/an
This method is not very intuitive as there is no reason why the j-th
observation in one treatment should be omitted along with the j-th, and
not any other, observations in the other treatments.
This results in
the presence of the term IiYij, in the expressions for 8(j) and VarJ,
which is difficult to interpret.
One way to avoid this would be to take
all combinations of one observation from each treatment, and omit one
~
42
combination at a time to obtain the "jackknife" estimate.
This would
require considerable computation.
Another way to jackknife within treatments is to consider 8 as the
mean of each of the treatment means, and then jackknife the samples for
each treatment.
The jackknife estimate would then be the mean of the
jackknife estimates for each treatment.
This is similar to the idea of
a separate ratio estimate, used in sampling theory, where a different
ratio estimate 1sused to estimate each stratum total, and these are
then summed to
esti~ate
the population total.
These jackknife estimates
will be subscripted by JS to indicate that the "separate jackknife
estimator" has been used.
We ·find
= (nYi.-Yij)/(n-l),
VarJs = n:(Yij-Yi .)2/a2n (n-l),
E(VarJs) = ~2/an •
8i(j)
We see that VarJS has the same expectation as VarJ, but now all the
means involved in the calculations have a clear meaning.
(4) Stratified sampling:
Several authors have investigated the use
of the jackknife in stratified sampling.
This is similar to the one-way
layout, with strata in place of treatments.
In this case it is usually.
a function of stratum means that is being estimated, and so it makes
sense to keep the same weight on each stratum when resampling as when
making the original estimate (often this weight will not be the same for
each stratum but will depend on the stratum population size).
Jones (1974) jackknifes by omitting one observation at a time, to
obtain the 8(ij), but then uses the stratum means of these (as was done
in the OWA separate jackknife) to obtain the jackknife estimate and
43
variance estimate of 8.
Lemeshow and Epp (1977) jackknife in a similar
e
fashion, omitting a single observation to obtain 8(ij), and then
defining the pseudo-values using the stratum sample sizes instead of the
complete sample size.
If we use these methods (with equal weight for
each stratum and ignoring the finite population corrections) for the
situation in this chapter then we find that both methods give
= 0,
8J
VarJ = I:E(Yi j -Yi .)2 /a2n(n-1).
This is the same as the OWA separate jackknife estimator.
However for
non-linear functions of the observations these methods will not be the
same, as the mean of the 8(ij)'S will not necessarily coincide with the"
mean of the stratum means of the 8i(j)'S.
We can bootstrap using the following methods:
Bootstrapping.
(1) OBS: As in the case of the jackknife we have the option of
using either a weighted or unweighted resamp1ing method.
Let the
subscript W indicate that a weighted bootstrap has been used.
Let rij
be the number of times that the observation YiJ is included in the
bootstrap sample.
Then rij - M(an,an;l/an, .•• ,l/an),
= 1,
E(rij)
Var(riJ)
= (an-1)/an,
Cov(rij ,n'
= -l/an,
j' )
(i,j)t!(i' ,j'),
and we find
8*
= IIri
Var* (8*)
j
Yi j lan,
= II (Yi j -y.. )2 /a2n2 ,
E{Var*(8*)}
8w*
= (a-l)~a2/a2n
= Iri j Yi j /ri.
.
+
(an-1)~2/a2n2,
e·
44
However the bootstrap variance of 8w* is not finite.
This is due
to the fact that there is a positive probability that there will be a
treatment without any observations in the bootstrap sample, and the
weighted bootstrap method gives equal weight to each treatment.
(2) AT: Let ri be the number of times the i-th treatment is
included in the bootstrap sample. Then ri - M(a,ajl/a, ..• ,l/a).
We have
E(ri) = 1,
Var(ri)
= (a-l)/a,
Cov(ri,ri') = -l/a, i1i',
and find
8* = !riYi ./a,
Var*(8*) = !(Yi.-y•• )2/a2 ,
E{Var*(8*)} =
(a-l)~a2/a2
+
(a-l)~2/a2n.
(3) OWA: Let riJ be the number of times that the observation YiJ is
included in the i-th treatment of the bootstrap sample.
riJ - M(n,njl/n, ••• ,l/n) independently for each i.
Then
We have
E(riJ) = 1,
Var(riJ) = (n-l)/n,
Cov(riJ,riJ'-) = -l/an, j1j', i=l, .•. ,a,
and find
8* = !!ri J Yi J lan,
Var* (8*) = !! (Yi J-Yi .)2 /a2n2 ,
E{Var*(8*)} =
(n-l)~2/an2.
We could also define the "separate bootstrap" analogous to the
"separate jackknife", but it turns out that this is identical to the
ordinary OWA bootstrap.
45
(4) Regression Model:
If we consider the Qi'S to be fixed
we can
consider this as a regression problem, and bootstrap using the
"regression model" algorithm of Freedman (1981) (see example 1.3.3.).
To obtain a full rank model we restrict
so that
= Y.. ,
9
= Yi.-Y•• ,
Qi
A
-
-
The bootstrap sample consists of the values
yi j *
= Yi. +ei j * ,
where the eiJ* are resampled randomly with replacement from the set of
~ij'S.
Let riJ be the number of times that
bootstrap sample.
Then riJ
8*
~
~ij
is used in the
M(an,anjl/an, ... ,l/an), so that
= Y•• +IIrij~ij/an,
The regression model bootstrap mean and variance estimates are the same
as those of the OWA bootstrap.
COI/1PsrisoD of Methods.
Consider the bootstrap and jackknife
variance estimators of this section.
The only estimator that is
unbiased for Var(9) is the AT jackknife estimator.
may still be of some use.
However the others
The AT bootstrap estimator is asymptotically
unbiased, when a, the number of treatments, tends to infinity.
If a is
not very large then using the bootstrap could result in badly biased
estimates, although in this case we probably could not expect reliable
46
results from either method.
This can often be the case in genetics
where, for example, the treatments may refer to different populations,
which may be few.
Resampling observations gives an unbiased estimate of
= ~a2/an
Var(Yij)/an
+
~2/an,
if both a and n are large (using either the bootstrap or jackknife).
The OWA jackknife (ordinary and separate estimators) are unbiased for
Var(g) if the treatments are considered to be fixed, i.e. the quantities
~i
in the model (3.3.1) are fixed and so have zero variance, and are
equal to their expected values.
The OWA bootstrap, regression model
bootstrap and ODS weighted jackknife also estimate this quantity if n is
large.
If n remains bounded and the number of treatments tends to
inf~nity
results.
then these last estimators will not give the correct asymptotic
This was noted by Bickel and Freedman (1984) for OWA
bootstrapping (in a stratified sampling situation), and they suggested
that the bootstrap could be rescaled if this was a problem.
Example.
To illustrate the above methods we will consider some
data discussed by Swofford and Selander (1981).
The data set shows
whether or not an individual is heterozygous at each of 15 loci, with up
to five alleles per locus.
Individuals are from six populations with
four to eleven individuals sampled per population.
We will analyze the
indicator variable which has the value one for heterozygotes, and is
zero otherwise.
Consider the data for the first sample where there are 11
individuals and the number of heterozygotes at each of the 15 loci are
1,0,4,0,0,0,0,4,0,0,0,0,0,7 and 0.
We consider loci as the A factor,
47
with variance component ~a2.
The heterozygosity estimate (using any
4It
method) is the average heterozygosity
-h = 16/165 = 0.097.
The variance estimates for each of the methods are given in table 3.1.
along with their expected values.
Since these expected values are in
terms of the variance components of the linear model, one may be
interested in what is obtained if the variance components in these
expressions are replaced by their estimates
Table 3.1
Resampling estimates of Variances of average
heterozygosity for population 1 of Swofford and Selander.
OBS
AT
OWA
Method
Estimate
J
0.00053
JW
0.00038
B
0.00053
0.0056~a2
+
0.00602~2
J
0.00256
0.06667~a2
+
0.0060~2
B
0.00239
0.0622~a2
+ 0.0056&2
JS
0.00035
0.0060~2
B
0.00031
0.00551~2
Expectation
0.0056~a2
+
0.0060~2
0.00663~2
Key: J=jackknife, B=bootstrap, W=weighted, S="separate" estimator.
e·
48
= 0.05697,
ff2
ff a 2 ='0.03315.
It turns out that this gives the resampling variance estimates. As will
be seen in the next section this is not the case for unbalanced data.
3.4 One Way Model - Unbalanced Data
~odel.
Consider the model
Yi j
=S
i=l, ... , a,
+ (u + ei j ,
(3.4.1)
j=l, •.• ,ni,
with the terms having the same expectations and covariance matrix as
those in (3.3.1).
Let
n. =
~ni,
The estimate of S is
9
a
= y •• = ~i=l
ni
~
Yij/n.,
j=l
with
Unless otherwise stated SJ and E*(S*) are both the same as the original
estimate 9 in all cases.
Jackknifing.
We can jackknife using the following methods:
(1) OBS: Resample the observations.
Unlike section 3.3 we consider
only the unweighted jackknife here, since this is the estimator we are
using for 9.
We find
S(ij)
= (n.Y•. -Yij)/(n.-l),
VarJ = n:(Yij-y•• )2/n.(n.-l),
49
(2) AT: Resample the A treatments.
Omitting one treatment at a time
we find
8( i)
8(.)
8J
= (n.Y•. -ni Yi • )/(n.-Di),
= n.~(Y •• -Yi .)/a(n.-ni) +
= as
VarJ
~Yi
./a
- (a-l)8(.)
2
2
= (a-l)(~8(i)-a8(.
»/a,
E(VarJ)
2
2
2 2
= (a-l){[(a-l)?:(N2-Di
)/(n.-ni)
->:?:(N2-ni-ni'
1
iill'
+
[(a-l)~l/(n.-ni )-~~(n.-Di-ni'
i
i ;Ii i'
2
)/(n.-Di )(n.-ni' »)cra
)/(n.-Di )(n.-ni' »)cr 2 }/a2
Weighting each 8(i) equally is not very satisfactory, as it leads
to a jackknife estimate, 8J, which isn't equal to
a,
and it also gives a
This
complicated variance estimate (with a complicated expectation).
can be avoided by using a weighted jackknife, weighting each 8(i)
proportional to the number of observations used to obtain it.
weights are (n.-ni)/(a-l)n •.
8w ( .)
These
This leads to
= y ••
(3) OWA: Resample observations within the A treatment.
We consider
only the separate jackknife estimator here, since there are a different
number of observations for each of the treatments.
It is difficult to
imagine a sensible way in which to omit one observation from each of the
treatments at a time unless all possible combinations are taken.
Letting
s = ~ni Si In. ,
e·
50
the separate
jackk~ife
gives
= (niYi.-Yij)/(ni-l),
8i(j)
= ~~ni(Yij-Yi.)2/n~(ni-1),
E(VarJs) = a 2 /n ••
Vans
We can bootstrap using the following methods:
Bootstrapping.
(1) OBS: Let rij be the number of times that the observation Yij
is included in the bootstrap sample.
Then rij - M(n.,n.jl/n., ... ,l/n.),
and
8*
= ~~rijYij/n.,
Va~(8*)
=
~~(Yij-Y .. )2/n~,
= (n~-N2)aa2/n~
E{Var*(8*)}
(2) AT:
+
(n.-l)a2/n~
•
As for the jackknife, weighted and unweighted estimators
will be considered.
Let ri be the number of times the i-th treatment is
included in the bootstrap sample.
ri - M(a,aj1/a, ••• ,1/a).
Then for the unweighted bootstrap
We have
= !ri Yi . la,
E*(8*) = ~Yi./a,
8*
Var*(8*) =
{~Yi~-(~Yi.)2/a}/a2,
E{Var*(8*)}
= (a-l)aa 2/a2
+ (a-l)(~1/ni)a2/a3.
Now suppose we sample treatments with probability proportional to
the number of observations in the treatment.
ri - M(a,ajnl/n., ... ,na/n.).
E(ri)
Then
We have
= ani In. ,
= ani (n.- ni)/n~,
Cov(ri , ri
= -ani ni /n~,
Var(ri)
I
and find
)
I
ir!i',
51
8w*
= :iri Yi •la,
Var*w(8w*)
= :ini(Yi.-Y•• )2/an.,
E{Var*w(8w*)}
= (n~-N2)qa2/an~
+ (a-l)q2/an •.
(3) OWA: Let rij be the number of times that the observation Yij is
included in the i-th treatment of the bootstrap sample.
rij
~
M(ni ,nijl/ni , ••. ,l/ni) independently for each i.
Then
We find
8* = :i:irt j Yi j In. ,
= :i:i(Yij-Yi.)2/n~,
E{Var*(8*)} = (n.-a)q2/n~ •
Var*(8*)
COJ1/psrison of Methods.
Unlike the balanced case, none of these
methods give a variance estimate which is unbiased for Var(Y•• ).
Based
on results for the balanced case we would expect one of the AT
jackknives to be close, especially if there is not severe unbalance.
would also expect the AT bootstraps to be close for large a.
weighted jackknife has the correct coefficient for the
q2
We
The AT
e-
term, but
apart from that any comparisons would require numerical calculation of
the coefficients.
Resampling observations will give an estimate with expected value
close to n.Var(Yij) for large a and ni and if the unbalance is not
severe.
The OWA jackknife variance is unbiased for Var(Y•. ) if
treatments are fixed, and the OWA bootstrap variance is consistent for
this if the number of treatments remains fixed as the number of
observations tends to infinity.
Example.
To illustrate the methods for unbalanced data, population
five of Swofford and Selander (1981) will be used.
This data is given
in table 3.2, where Yi. is the number of heterozygotes at a particular
~
52
Table 3.2
Data for population 5 of Swofford and Selander.
locus (i)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Di
3
3
4
4
4
4
4
4
2
2
1
4
3
4
4
Yi.
0
0
0
0
0
1
0
0
0
0
0
0
0
2
0
Table 3.3
Resampling estimates of variances of average
heterozygosity for population 5 of Swofford and Selander.
Method
OBS
AT
OWA
Estimate
Expectation
EE
J
0.00115
0.01894a'a 2 + 0.0200&2
0.00115
B
0.00113
0.0185&a 2 + 0.019600- 2
0.00113
J
0.00185
0.072400-a 2 + 0.020120"2
0.00158
JW
0.00183
0.0717&a 2 + 0.020000- 2
0.00157
B
0.00122
0.06222O"a 2 + 0.0217&2
0.00159
BW
0.00143
0.061870-a 2 + 0.018670- 2
0.00143
JS
0.00093
0.020000- 2
0.00100
B
0.00070
0.014000- 2
'0.00070
Key: J=jackknife, B=bootstrap, W=weighted, S="separate" estimator.
EE=estimated expectation - see text.
53
locus.
The variance estimates are given in table 3.3 along with their
expectation.
Once again we look at what happens if we replace the
variance components in the expectations by their estimates.
so obtained are in the column headed EE.
The values
Henderson's method 3 was used
to obtain the variance component estimates.
This method equates
reductions in sums of squares from a sequential analysis-of-variance
(quadratic forms in the vector of observations) to their expectations.
The estimates obtained are
aa 2
a2
= 0.00797,
= 0.05000.
In some of the cases the EE column is different from the estimate
column.
The variance component estimation procedure equates two
quadratic forms to their expectation and solves the resulting system.
The resampling variance estimate is also a quadratic form, but not
necessarily the same as one of those used for estimating the variance
components.
When this is the case, equating the resampling variance to
its expectation gives an equation that is inconsistent with those used
for variance component estimation (Searle, 1971, p.455).
In some cases
the same quadratic forms are used so that the estimated expectation is
the same as the resampling variance estimate.
The expectation of the AT
weighted jackknife is quite close to
Var(Y.. )
= 0.072aa 2
3.5 Two Way Model
~odel.
Consider the model
+ 0.020a2.
e·
54
Yijk = 8 + O:i + 8j + Yij + eijk,
i=l, ••• ,a,
(3.5.1)
j=l, ... ,b,
k=l, .•• , n,
where E(O:i) = E(8j) = E(Yij) = E(eijk) = 0,
2,
Var(O:i) =
eTa
Var(8j) =
eTb 2 ,
Var(Yij) =
eTc 2 ,
Var( ei j k) =
eT 2 ,
E(YijkYi' j' k' ) =
82 +
eTa 2
+
eTb 2
+
eTc 2
82 +
eTa 2
+
eTb 2
+
eTC 2 ,
82 +
eTa 2,
82 +
82
+
eT 2 ,
i=i' , j=j' , k=k'
i=i' , j=j' ,
i=i' ,
eTb 2 ,
,
k~k'
j~j'
i~i'
, j=j'
i~i'
,
j~j'.
The estimate of 8 is
S = y •.. = !Yijk/abn,
with
Var(S) =
eTa
2 /a +
eTb 2
/b +
eTc 2
/ab +
eT 2
/abn.
In this section 8J and E*(8*) are always the same as the original
estimate S.
The expected values of the variances are given in table
3.4.
Jackknifing.
We can jackknife using the following methods:
(1) OBS: Resample the observations.
observation at time.
For jackknifing omit one
First consider the unweighted case where we find
8(ijk) = (abnY ... -Yijk)/(abn-l),
VarJ = III(Yijk-y••. )2/abn(abn-l).
Now consider the weighted case.
Let nij refer to the number of
55
observations in the (i,j)-th treatment combination.
Suppose we estimate
8 for unbalanced data by using the weighted mean
a
a =i=l
~
b
~
mj
~
j=l k=l
Yijk/anij.
where each observation is weighted by the reciprocal of the number of
observations in the same treatment combination, so that each treatment
combination receives an equal weight.
8(ijk)
We find
= y .. +(Yij.-Yijk)/ab(n-I),
(2) ABT: Resample the levels of the AB treatment combinations.
For
jackknifing this means omitting one of the treatment combinations at a
time.
We find
8( i j) = (aby... -Yi j . )/ (ab-l) ,
VarJ
= ~~(Yij.-Y ... )2/ab(ab-I).
(3) A+BT: Resample both A and B treatments.
For jackknifing this
means omitting one A treatment and one B treatment at a time.
8(ij)
We find
= (abY... -bYi .. -aY.j.+Yij.)/(a-I)(b-I),
= (ab-I)~~(Yij.-bYi .. -aY.j .+(a+b-l)y... )2/ab(a-I)2(b-I)2.
E(VarJ) = (ab-l)qa 2 /a(a-l) + (ab-l)qb 2 jb(b-l)
VarJ
+ (ab-I)(a+b-l)qc 2 /ab(a-I)(b-l) + (ab-I)(a+b-l)q2/abn(a-l)(b-l)
(4) AT: Resample the levels of the A treatment.
For jackknifing we
find
8( i)
VarJ
= (aY... -Yi .. ) / (a-I) ,
= ~(Yi .. -Y... )2/a(a-I).
(5) BT: Resample the levels of the B treatment.
Results for this
case can be obtained from those for case (4) by symmetry.
56
(6) OWAB: Resamp1e observations within each treatment combination.
~
We will only consider the "separate jackknife" estimator as it is more
".
intuitive than omitting one observation from each treatment combination
at a time.
The samples for each treatment combination are jackknifed
and the jackknife estimate is the mean of the jackknife estimates for
each treatment combination.
8ij(k)
VarJs
We find
= (nYij.-Yijk)/(n-1),
= ~n:(Yijk-Yij.)2/a2b2n(n-1).
(7) OWA: Resamp1e observations within the A treatment.
"separate jackknife" will be used.
Again the
The samples for each A treatment are
jackknifed by omitting one observation and the jackknife estimate is the
mean of the jackknife estimates for each treatment.
8i(jk)
VarJs
We find
= (bnYi •• -Yijk)/(bn-1),
= ~~~(Yijk-Yi .. )2/a2bn(bn-1).
e-
(8) BWA: Resamp1e B treatments within the A treatments.
This time
the "separate jackknife" is implemented by jackknifing the A treatments
and omitting one B treatment at a time.
The jackknife estimate is the
mean of the jackknife estimates for each treatment. We find
8i(j)
VarJs
= (bYi •. -Yij.)/(b-1),
= ~~(Yij.-Yi .• )2/a2b(b-1).
(9) OWB: Resamp1e observations within the B treatment.
Results for
this case can be obtained from those for case (7) by symmetry.
(10) AWB: Resamp1e A treatments within the B treatments.
Results
for this case can be obtained from those for case (8) by symmetry.
Bootstrapping.
We can bootstrap using the following methods:
(1) OB8: Only the unweighted bootstrap will be considered, as the
57
weighted method does not give.a finite variance estimate.
Let rijk be
the number of times that the observation Yijk is included in the
bootstrap sample.
Then rijk - M(abn,abnjl/abn, •.. ,l/abn), and
= IIIrijkYijk/abn,
8*
(2) ABT: Let rij be the number of times the (i,j)-th treatment
combination is included in the bootstrap sample. Then
rij - M(ab,abjl/ab, ..• ,l/ab).
8*
We find
= IIrijYij./ab,
(3) A+BT: For the bootstrap we resample both A and B treatments.
Let ri be the number of times the i-th A treatment is included in the
bootstrap sample, and Sj be the number of times the j-th B treatment is
included in the bootstrap sample.
Then ri - M(a,ajl/a, .•• ,l/a),
independently of Sj - M(b,bjl/b, .•. ,l/b).
We find
(4) AT: Let ri be the number of times the i-th A treatment is
included in the bootstrap sample. Then ri - M(a,ajl/a, ... ,l/a).
8*
We find
= IriYi .. /a,
Var* ( 8*)
)2/ a.
2
= I (-Yi •• -Y...
(5) BT: Results for this case can be obtained from those for case
(4) by symmetry.
(6) OWAB: Let rijk be the number of times that the observation Yijk
is included in the (i,j)-th treatment combination of the bootstrap
58
sample.
Then rijk - M(n,njl/n, ••. ,l/n) independently for each i and j.
We find
8*
= IIIrt j kYi j k/abn,
Var* (8*)
= III (Yi j k-Yi j • )2 /a2b2n2 •
(7) OWA: Let rijk be the number of times that the observation Yijk
is included in the i-th A treatment of the bootstrap sample.
rijk - M(bn,bnjl/bn, ..• ,l/bn) independently for each i.
8*
Then
We find
= IIIrijkYijk/abn,
Var* (8*)
= III (Yi j k-Yi .. )2 /a2b2n2 •
(8) BWA: Let rij be the number of times that the mean of the
(i,j)-th treatment combination is included in the i-th A treatment of
the bootstrap sample.
each i.
Then rij - M(b,bjl/b, •.. ,l/b) independently for
We find
= IIrijYij./ab,
Var*(8*) = II(Yij.-Yi •• )2/a2b2 •
8*
(9) OWB: Results for this case can be obtained from those for case
(7) by symmetry.
(10) AWB: Results for this case can be obtained from those for case
(8) by symmetry.
Comparison of 1I1ethods.
None of the variance estimators discussed
is unbiased, or even consistent, for Var(S).
If we resample A
treatments the variance estimate does not include the
resampling the B treatments does not include the
~a2
~b2
term.
term, and
Resampling
AB treatment combinations, and resampling both A treatments and B
treatments together also fail to give a consistent estimate of Var(S),
as is now discussed.
59
Table 3.4 Coefficients of the variance components in the expected
value of the resampling variance estimates of section 3.5.
Method
OBS
J
JW
B
ABT
J
B
AT
OWA
BWA
lTb 2
lTc 2
a-I
aOlbn I)
b-l
b(OOn I)
0
0
0
a-I
ab-l
- ab(abn I)
1
85ii
abn-l
a2 b2 n(n I)
"86ZJi
b-l
ab-l
a 2 b2 n
abn-l
a2 b2 n2
a-I
a(ab I)
b-l
b(ab I)
aD
1
1
abn
a-I
b-l
azoz
ab-l
ab-l
a 2b2n
aD
1
1
abn
a-I
a-I
a2 bn
a5Z
1
0
a-I
0
JS
0
0
0
B
0
0
0
JS
0
B
0
JS
0
B
0
J
lT 2
azDn
azo
B
OWAB
lTa 2
a
az-
b-l
ab(bn I)
b-l
azo
b-l
ab(bn I)
n-l
aJ5iiZ
1
abn ..
b-l
bn-l
OO2 n2
1
1
abn
"86ZJi
abZii
1
aD
aD
b-l
b-l
a5Z
1
85ii
aoz
bn-l
"86ZJi
For explanation of methods, and A+BT coefficients, see the text.
60
The AT jackknife is unbiased except that it does not include the
term in Ub 2 , and so would be unbiased if Ub 2 =0 (and by symmetry the BT
jackknife is unbiased if Ua 2 =0).
The AT jackknife is also unbiased for
Var(9) if we consider the B treatments to be fixed.
The assumptions for
the Yij for this case are either that
Yij
=0
for all i and j,
or that we consider the Yij to be random and independent for all i and
j.
We cannot use the usual assumption in such a mixed model, which is
that
!iYij
(Hocking, 1973).
=0
The AT bootstrap estimates Var(9) if a is large and
Ub 2 =0.
The ABT methods give an estimate of
Var(Yij.)/ab
for large a and b.
e-
= (Ua 2 +Ub 2 +Uc 2 +u 2 /n)/ab
The BWA methods also estimate this quantity,
providing A is fixed, and b is large if the bootstrap is used.
Resampling observations gives an estimate of
= (ua 2 +Ub 2 +Uc 2 +u 2 )/abn
Var(Yijk)/abn
if a and b are large.
The OWAB methods and the OBS weighted jackknife give an estimate of
Var(9) if both A and B are considered to be fixed and n is large.
The
OWA methods give an estimate of
Var(Y.j.)/b
= (Ub 2 +uc 2 /a+u 2 /an)/b
if b is large.
The A+BT methods do not give estimates of any variance of interest.
61
3.6 Three Factor Model With Crossed and Nested Factors
Model.
Consider the model
Yijk = 8 + a:i + /3j + Yik + Oij + eijk,
i=l, ••• ,a,
(3.6.1)
j=l, ••• ,b,
k=l, ••. , c,
where E(a:i) = E(/3j) = E(Yik) = E(Oij) = E(eijk) = 0,
Var(a:i) =
Q'a 2 ,
Var(/3j) =
Q'b 2 ,
Var(Yik) =
Q'c 2 ,
Var(oij) =
Q'd 2 ,
Var(eijk) =
Q'2,
E(Yi j kYi' j' k' ) =
82 +
Q'a 2
+
82 +
Q'a 2
+
82 +
Q'a 2 ,
,
Q'd 2 ,
+
Q'c 2,
Q'b 2 ,
82 +
82
Q'b 2
i=i~ j=j~
k]!!k'
i=i~ j]!!j~
k=k'
i=i~ j]!!j~
k]!!k'
i]!!i~
j=j'
i]!!i~
j]!!j'.
Factor C (corresponding to the Yik term) is nested in factor A.
The
estimate of 8 is
9 = y... = !Yijk/abn,
with
Var(9) =
Q'a 2 /a
+
Q'b 2
jb + Q'c 2 /ac + Q'd 2 /ab + Q'2/ abc.
In this section 8J and E*(8*) are always the same as the original
estimate 9.
Table 3.5 contains the expected values of the variances.
Jackknifing.
We can jackknife using the following methods:
(1) OBS: Resamp1e the observations (i.e. the levels of the BC(A)
62
treatment combinations).
8(ijk)
VarJ
~
For jackknifing we find
= (abcY••• -Yijk)/(abc-l),
= !!!(Yijk-y••• )2/abc(abc-l).
(2) ABT: Resample the levels of the AB treatment combinations.
For
jackknifing we find
8(ij)
VarJ
= (abY .•. -Yij.)/(ab-l),
= !!(Yij.-y••• )2/ab(ab-l).
(3) C(A)T: Resample the levels of the C(A) treatment.
For
jackknifing we find
8(ik)
VarJ
= (acY••• -Yi .k)/(ac-l),
= !!(Yi .k-y••• )2/ac(ac-l).
(4) AT: Resample the levels of the A treatment.
For jackknifing we
find
8(i)
VarJ
~-
= (aY... -Yi .. )/(a-l),
= !(Yi •• -y... )2/a (a-l).
(5) BT: Resample the levels of the B treatment.
For jackknifing we
find
8(j)
VarJ
= (by••. -y.j.)/(b-l),
= !(y.j.-y••• )2/b(b-l).
(6) OWAB: Resample observations within each AB treatment
combination.
As in the previous section only the "separate jackknife"
estimator will be considered.
8i j ( k)
We find
= (CYi j •-Yi j k ) / (c-l) ,
VarJs = !!!(Yijk-Yij. )2/a2b 2c(c-l).
(7) OWC(A): Resample observations within each C(A) treatment.
samples for each C(A) treatment are jackknifed by omitting one
The
63
observation- (Le. one of the BC(A) treatment combinations) and the
jackknife estimate is the mean of the jackknife estimates for each
treatment.
We find
8ik(j)
VarJS
= (bYi .k-Yijk)/(b-l),
= ~~~(Yijk-Yi.k)2/a2b(b-l)c2.
(8) OWA: Resample observations within each A treatment.
The
samples for each A treatment are jackknifed by omitting one observation.
We find
8i(jk)
VarJs
= (bcYi •• -Yijk)/(bc-l),
= ~~~(Yijk-Yi .. )2/a2bc(bc-l}.
(9) C(A)WA: Resample C(A) treatments within each A treatment.
The
samples for each A treatment are jackknifed by omitting one of the C(A)
treatments.
We find
8i (k)
VarJs
= (CYi .. -Yi .k )/(c-l),
= ~~(Yi.k-Yi .. )2/a2c(c-l).
(10) BWA: Resample B treatments within each A treatment.
The
samples for each A treatment are jackknifed by omitting one B treatment
at a time.
We find
8i(j) - (bYi •• -Yij.)/(b-l),
VarJs
= ~~(Yij.-Yi .• )2/a2b(b-l).
(11) OWB: Resample observations within each B treatment.
8j(ik)
Vans
We find
= (acy.j.-Yijk)/(ac-l),
= ~~~(Yijk-Y.j. )2/ab2c(ac-l).
(12) AWB: Resample A treatments within each B treatment.
8U)i
= (aYd.-Yij.)/(a-l),
VarJs
= ~~(Yij.-Y.j.)2/a(a-l)b2.
We find
64
We can bootstrap using the following methods:
Bootstrapping.
.(1) OBS: Let rijk be the number of times Yijk appears in the
bootstrap sample.
~
Then rijk
M(abc,abcjl/abc, ..• ,l/abc).
= IIIri j kYi j •/abc,
8*
Var* (8*)
= III (Yi j k-Y ... )2 /a2b2 c2.
(2) ABT: Let rij be the number of times the (i,j)-th
combination is included in the bootstrap sample.
rtj
~
We find
M(ab,abjl/ab, ... ,l/ab).
8*
AB treatment
Then
We find
= IIrt jYi j . lab,
Var*(8*)
= II(Yij.-y••• )2/a2b2 .
(3) C(A)T: Let rik be the number of times the k-th C treatment in
the i-th A treatment is included in the bootstrap sample.
rik
~
M(ac,acjl/ac, ... ,l/ac).
8*
Then
We find
e-
= IIrikYi.k/ac,
Var*(a*)
= II(Yi .k-y.•. )2/a2c2.
(4) AT: Let ri be the number of times the i-th A treatment is
included in the bootstrap sample. Then ri
~
M(a,ajl/a, .•• ,l/a).
We find
= Iri Yi •• la,
Var*(8*) = I(Yi .. -y... )2/a2.
8*
(5) BT: Let rj be the number of times the j-th B treatment is
included in the bootstrap sample. Then rj
8*
~
M(b,bjl/b, ..• ,l/b).
We find
= IrjY.j ./b,
Var*(8*)
= I(Yd .. -y... )2/b2.
(6) OWAB: Let rijk be the number of times that the observation Yijk
is included in the (i,j)-th AB treatment combination of the bootstrap
sample.
Then rijk
~
M(c,cjl/c, ... ,l/c) independently for each i and j.
65
We find
8*
= ~~~rijkYijk/abc,
Var*(8*)
= ~~~(Yijk-Yij.)2/a2b2c2.
(7) BWC(A): Let rijk be the number of times that Yijk is included
in the (i,k)-th C(A) treatment of the bootstrap sample.
rijk - M(b,bjl/b, •.• ,l/b) independently for each i and k.
Then
We find
= ~~~rijkYijk/abc,
Var*(8*) = ~~~(Yijk-Yi .k)2/a2b2c2.
8*
(8) OWA: Let rijk be the number of times that the observation Yijk
is included in the i-th A treatment of the bootstrap sample.
rijk - M(bc,bc;l/bc, •.. ,l/bc) independently for each i.
Then
We find
= ~~~rijkYijk/abc,
Var*(8*) = ~~~(Yijk-Yi •• )2/a2b2c2.
8*
(9) C(A)WA: Let rik be the number of times that the mean of the
(i,k)-th C(A) treatment is included in the i-th A treatment of the
bootstrap sample.
i.
Then rik - M(c,cjl/c, •.• ,l/c) independently for each
We find
= ~~rikYik./ac,
Var*(8*) = ~~(Yi .k-Yi •• )2/a2c2.
8*
(10) BWA: Let rij be the number of times that the mean of the
(i,j)-th AB treatment combination is included in the i-th A treatment of
the bootstrap sample.
each i.
Then rij - M(b,bjl/b, ... ,l/b) independently for
We find
= ~~rijYij./ab,
Var*(8*) = ~~(Yij.-Yi .• )2/a2b2.
8*
(11) OWB: Let rijk be the number of times that the observation Yijk
66
is included in the j-th B treatment of the bootstrap sample.
rijk.~
M{ac,ac;l/ac, .•• ,l/ac) independently for each j.
Then
We find
= ~~~rijkYijk/abc,
Var*{8*) = ~~~{Yijk-Y.j .)2/a2b2c2.
8*
(12) AWB: Let rij be the number of times that the mean of the
(i,j)-th AB treatment combination is included in the j-th B treatment of
the bootstrap sample.
each j.
Then riJ
~
M{a,ajl/a, •.. ,l/a) independently for
We find
= ~~rijYij ./ab,
Var* (8*) = ~~(Yij ._y.j.)2 /a2b2 •
8*
COJllparison of Methods.
As in the last section we find that none of
the estimators is unbiased for Var(9) when we have two random effects
that are crossed.
However some methods do give an unbiased estimate of
Var(9) if certain factors are fixed.
As in the last section if a factor
is fixed then it is assumed that its interaction with a random effect is
random, and the corresponding terms from (3.6.l) are i.i.d. over all
their subscripts.
The interaction between two fixed effects is fixed.
The AT methods provide an estimate of Var(9) if B is fixed
(assuming a is large for the bootstrap).
The BT methods provide an
estimate of Var(9) if both A and C are fixed (assuming b is large for
the bootstrap).
The BWA methods provide an estimate of
Var{Yi .• )/a
= {aa2+Ub2/b+Uc2/c+Ud2/b+u2/bc)/a
if A and C are fixed (assuming b is large for the bootstrap).
The
C{A)WA methods also provide an estimate of Var(Yi .. )/a if A and B are
fixed (assuming c is large for the bootstrap).
~
67
Table 3.5 Coefficients of the variance components in the expected
value of the resampling variance estimates of section 3.6.
Method
OBS
O"a 2
J
AT
OWAB
0"2
b-l
b(abc-I)
ac-l
ac(abc-I)
ab-l
ab(abc-I)
abc
abc-l
a2 b2 c2
1
b-l
ac-l
a 2 bc2
ab-l
a2 b2 c
a-I
a(ab I)
b-l
b(ab I)
a-I
ac(ab I)
ao
abc
a-I
b-l
8bZ
a.zDc
ab-l
ab-l
a2 b2 c
J
a-I
a(ac-I)
0
_l_
ac
B
a-I
a2 c
0
J
_1_
a-I
J
B
BT
O"d 2
8bZC
B
C(A)T
O"c 2
a.zDc
B
ABT
a-I
a(abc-I)
O"b 2
a-I
82D
a
iT""
J
0
B
0
1
a-I
1
azJ)2'"
ac-l
a-I
ab(ac-I)
abc
a-I
ac-l
a2 bCZ
1
azcz
a.zDc
0
_I_
ac
ao
abc
0
a-I
a2 c
82D
a-I
a-I
a2 bc
1
0
b-l
0
[)
~
JS
0
0
B
0
0
I
1
,
abc
b-l
b-l
ab 2 c
8bZ
0
c-l
0
aDcZ
1
ao
_1_
ac
1
."
1
abc
c-l
abc2
68
Table 3.5
Method
OWC(A)
O'a 2
(continued)
O'b 2
O'c 2
O'd 2
0'2
1
0
1
abc
1
abc
b-1
0
JS
0
B
0
JS
0
B
0
C(A)WA JS
0
0
_1_
ac
0
1
abc
B
0
0
c-12
ac
0
c-1
abC2
JS
0
ao
1
0
ao
1
1
abc
B
0
b-1
0
b-1
b-1
ab 2 c
JS
a-I
ab(ac-I)
0
a-I
ab(ac I)
abC
a-I
0
ac-1
a2 bc2
a-I
ac-1
a2 bc2
1
ao
0
1
a-I
0
a-I
a.zo
OWA
BWA
OWB
B
azDc
abc
iibZC
b-1
ab(bc-I)
b-1
iibZC
8l)'2"
b-1
c-l
ac(bc I)
JS
c-1
B
azo
b-1
For explanation of methods see text.
bc-1
aI)ZC2
8l)'2"
1
abc
azDc
1
abc
aozc
aIi'CZ
abC
iibZC
b-1
ab(bc I)
1
azDc
,
AWB
b-1
iibZC
.,'
1
ao
1
abc
a-I
a-I
a2 bc
69
The AWB methods provide an estimate of
Var(Y.j.)/b
= (qa2!a+qb2~c2!ac~d2!a~2!ac)/b
if B is fixed (assuming a is large for the bootstrap).
The OWAB methods
also provide an estimate of Var(Y.j.)/b if A and B are fixed (assuming
c is large for the bootstrap).
The OWC(A) methods provide an estimate of
Var(Yi.k)!ac
= (qa2~b2/b+qC2~d2/b+q2/b)!ac
if A and C are fixed (assuming b is large for the bootstrap).
The C(A)T
methods also provide an estimate of Var(Yi.k)!ac if B is fixed
(assuming a is large).
The ABT methods provide an estimate of
Var(Yij.)!ab
= (qa2~b2~c2!~d2~2!c)!ac
(assuming a and b are large).
The OWB methods provide an estimate of
Var(Yijk)!abc
= (qa2~b2+qc2~d2~2)!abc
if B is fixed (assuming a is large).
The OWA methods provide an
estimate of Var(Yijk)!abc if A is fixed (assuming b and c are large).
The OBS methods also provide an estimate of Var(Yijk)!abc (assuming a
and b are large).
70
4.
RESAMPLING METHODS FOR ESTIMATING THE COANCBSTRY COEFFICIENT
-e
4.1 Estimating the Coancestry Coefficient.
This chapter contains an investigation of the performances of
different resampling methods when used to find the distributional
properties of an estimate of the coancestry coefficient, 8.
This
parameter has been used as the basis for a distance measure by Reynolds
et al (1983), and they give several different methods for estimating 8.
Using simulations of monoecious populations they suggest that the best
estimator of distance is the one based on a weighted estimate of 8.
Weir and Cockerham (1984) modified this formula to include heterozygote
frequency information as well as gene frequency information.
This
estimator, 8w, is the one that will be considered here.
FOT1/1ula for 8w.
The formula for 8w is based on an analysis-of-
e-
variance of gene frequencies for each allele and locus in the sample.
Let a/U, b/u, and C/U be the observed components of variance for the
between populations, between individuals within populations and between
gametes within individuals effects respectively, from the analysis for
the l-th locus and u-th allele.
Then
Ea/u
= P/u(l-P/u)8,
Eb/u
= P/u(l-P/u)(F-8),
EC/u
= p/u(l-p/u)(l-F),
where F is the inbreeding, and P/U is the frequency of the u-th allele
at the l-th locus.
An
estimator of 8 is
9 = a,u/(a/u+b1u+C/U),
(4.1.1)
which is unbiased to the extent that the expectation of a ratio is the
tit
71
ratio of expectations.
The weighted estimator combines information for multiple alleles
and several loci by summing the numerator and denominator of (4.1.1)
over alleles and loci to give
8w
= ~I~ualu/~l~u(alu+blu+clu)
In this study we shall only consider calculating
populations.
(4.1.2)
8w
for pairs of
For a pair of populations, i and j, the specific formulas
for the variance components used in (4.1.2) are as follows:
blu
= nl{s~u-[Plu(1-Plu)-s~u/2-nlu/4]/(nl-l)}/nel'
= nl[Plu(1-Plu)-s~u/2-(2nl-l)nlu/4nl]/(nl-l),
CIU
= nlu/2,
alU
where
= (nil+nJI)/2,
nel = nilnJI/nl,
PIU = (nilPi lU+nJlPJIU)/ 2nl,
nl
s~u
nlU
= [ni I (Pi IU-PlU )2+Dj I (PJ IU-PlU)2 ]lnl'
= (Di l hilu+nJlhJlu)/ 2nl,
nil is the sample size (of individuals) for population i and locus 1,
PilU is the sample frequency of allele u at locus 1 in population i, and
hilu is the sample frequency of heterozygotes with allele u (number of
heterozygous individuals with allele u divided by the total number of
individuals) at locus 1 in population i.
4.2 Resampling Methods for the Distribution of
The Delta Method.
8w
We can apply the delta method by considering
8w
as a function of the genotype frequencies and sample size for each locus
72
and population.
Note that 8w is not a function of the genotype
frequencies alone so we cannot use the simpler formula (1.4.3).
Let
nilUU' be the number of individuals in the i-th population with ordered
alleles u and u' at the l-th locus.
Then
nil = ~nilUU
+ uJl!u'
~ ~niIUU' ,
U
PilU
= ~u'
(nilUU' +niIU' U)/ 2ni l,
hi lU =u~ Jl!u(ni lUU' +ni lU' U)/ni l .
'
We obtain the following partial derivatives:
aUl/anilUU = 1/2,
OUI/ani IUU' = 1/2,
One l/om lUU' =(nj I/UI)2 /2,
so that
+ Ul[PIU(1-PlU)-sju/2-nlU/4]/2(u.I-l )2
+
[PiIU(2-PiJU)+PIU(1-PIU)-1-s~u/2-nIU/4]/2(UI-l)
}/nel
73
+
nl[P1U(1-P1U)-S~u/2-filU/4]/2(nl-1)2
+
[-p2ilU+P'U(1-P'U)-S~u/2-filU/4]/2(n,-1)
}/ne,
ob,u/ODi lUU' = -{ Pi lU (I-Pi lU )+P1U (l-PIU )-s~u/2
where u1u', U1u", but possibly u'=u".
Using formula (1.4.2) with independent populations and independent
loci we get
+
~~njluU'
(oSw/onjlUU' )2
Now
with, for example,
+
o(~a,u")/oDilUU'
u' ,
j1tQ
or u'
74
Jackknifing and Bootstrapping.
These methods can be used to
estimate the bias and variance of 8w by resampling the loci, assuming
they are independent.
To jackknife, one locus is omitted from the
sample, for both populations, at a time.
sampled with replacement.
To bootstrap, the loci are
Each time a particular locus is sampled the
data for that locus for both populations is included in the bootstrap
sample.
Another way to Use these methods is to resample individuals.
The bootstrap can also be used to give a confidence interval (using
either the percentile or bias-corrected percentile methods).
Once the
confidence interval has been obtained it can then be inverted and used
as a test for 8 equal to some hypothesized value 80:
The test rejects
the hypothesis if the bootstrap confidence interval does not contain 80.
The infinitesimal jackknife can also considered as a method to find
the variance of 8w.
Since 8w is a function of gene and heterozygote
sample frequencies (i.e. sample averages) the infinitesimal jackknife
will give the same results as the delta method (see section 1.4).
For
this equivalence to hold, the resampling for the infinitesimal jackknife
must be on the same units as the variances and covariances that are used
in the delta method.
An alternative to the infinitesimal jackknife is
,
to use (1.4.5), which gives a numerical approximation VarIJE to VarIJ
and does not require computation of the derivatives.
4.3 Simulation Study
Method.
The above methods for estimating the bias and variance of
8w have been investigated through Monte Carlo simulation studies.
The number of individuals of each genotype, for each of m loci,
75
with v alleles at each locus,
were stored for generations 5, 25 and 50
of a randomly mating idealized population consisting of 50 individuals
in each generation.
equilibrium.
The founder population was in Hardy-Weinberg
The Monte Carlo study consisted of 100 trials, i.e. 100
pairs of populations were obtained.
Resampling individuals was performed only with the jackknife and
done only for the two allele, one locus case.
When there was more than
one locus, the multi-locus genotypes of each individual were not stored,
and so resampling methods could not be used on individuals.
Results.
Table 4.1 gives the results of Monte Carlo studies for
two different situations: ten loci with three alleles at each locus, and
one locus with two alleles.
Both cases used unlinked loci, and has
founder populations with equi-frequent alleles.
Results are given for
generations 5, 25 and 50 for the fir&t case, and for generations 5 and
50 for the second case.
For each trial (pair of populations) 8w was calculated and the mean
and sample variance of these values were calculated (the "Me" line in
table 4.1).
These were used as the "true values" (E(8w) and Var(8w)) as
a basis for comparing the different resampling methods.
The theoretical
value of 8, 1-(1-1/2N)t, where t is the generation number and N is the
number of individuals in each generation, is also given.
The "8w mean" column also contains the Monte Carlo mean of the
jackknife estimates for the jackknife, and the Monte Carlo mean of the
bootstrap means of the 8w's calculated for each bootstrap sample.
From
this the values in the "Bias Estimate" column, which are actually the
Monte Carlo means of the bias estimates, may be obtained.
The jackknife
76
Table 4.1.
m v
Gen Method
Resamp1ing methods estimates for 8w
8w mean
Bias
Estimate
10
3
5
25
50
Theory
0.04901
Me
0.04922
Variance Estimates
Mean
CV
fMSE
0.000328
:1
. 0.04938
-0.00016
0.000302
78.9
0.00024
B
0.04906
-0.00016
0.000267
78.4
0.00022
1:1
0.000270
78.3
0.00022
DM
0.000100
26.4
Theory
0.22218
Me
0.21446
:1
0.21651
-0.00205
0.003354
57.5
0.00193
B
0.21249
-0.00197
0.002920
55.0
0.00164
1:1
0.002931
56.3
0.00169
DM
0.000224
14.0
0.003277
Theory
0.39499
Me
0.37341
:1
0.37801
-0.00460
0.006600
50.4
B
0.36909
-0.00432
0.005637
49.0 -'0.00322
1:1
0.005660
48.5
DM
0.000235
19.8
0.007287
0.00339
0.00319
77
Table 4.1 (continued)
m v
Gen Method
8w mean
Bias
Estimate
1
2
5
Me
0.05022
J(Ind)
0.05122
Me
0.26597
J(Ind)
0.26873
-0.00100
IJ
= Infinitesimal
= Monte
-0.00276
Jackknife on individuals, Theory
0.001826
107.3
0.001754
107.7
£
0.002865
65.3
0.002640
68.4
= Jackknife,
= 0.001, DM = Delta
Carlo trials, J
Jackknife with
CV
fMSE
0.066327
DM
Key to Methods: Me
Mean
0.006450
DM
50
Variance Estimates
= theoretical
B
= Bootstrap,
Method, J(Ind)
value of 8.
=
78
estimate of bias is 8w-SJ.
The bootstrap estimate is the bootstrap mean
when 8w-8 is bootstrapped.
If 8w is bootstrapped (as was done here) the
bootstrap estimate of bias is the bootstrap mean minus 8w.
-~
The
bootstrap mean and variance were calculated here using 1000 bootstrap
replicat ions.
Also given are the Monte Carlo mean, coefficient of variation (CV)
and root mean square error (fMSE) of the variance estimates.
As noted
earlier, the value in the variance estimate mean column for the "Me" row
is the Monte Carlo variance of the 8w's.
The Monte Carlo means of the bootstrap 95% confidence intervals are
given in table 4.2.
percentile method.
The confidence intervals were obtained using the
Also given are the Monte Carlo means of the
confidence intervals
{ E*(8w*)-1.96[Var*(8w*)]1/2, E*(8w*)+1.96[Var*(8w*)]1/2 }
e-
These assume a normal distribution, and use the mean and variance of the
bootstrap distribution.
These two methods may be compared with the
central 95% of the 8w's obtained from the 100 Monte Carlo trials.
4.4 Discussion of Simulation Results
Delta flfethod estimates.
For the three allele case all the methods
gave similar results except for the delta method.
The variance
estimates for the delta method are considerably smaller than those for
the other resampling methods and the variance of the Monte Carlo trials.
The reason for this was that the delta method was used with the
genotypes of individuals, which are multinomially distributed.
gave the variance for sampling individuals from the population.
This
This
e
79
can be seen in the two allele, one locus case where the variance
estimates using the jackknife on individuals were found to be similar to
those of the delta method.
Effect of unit resBJlJpled.
Resampling different units (loci or
individuals here) can result in quite different variance estimates, as
the variances being estimated are different.
Resampling individuals
imitates the statistical sampling process (i.e. sampling the current
generation).
Resampling unlinked loci imitates the genetic sampling of
each generation from the founder population: replicate pairs of
populations are generated in a similar way to the different pairs of
loci within a population.
If the value of 8 for two particular populations is of interest then
resampling methods should be used at the level of the individual to give
the variance of 8w for those two populations.
If 8w is used as a
distance measure between the two populations then the genetical sampling
from the founder population should be taken into account and the
variance of this process can be estimated by resampling methods applied
to unlinked loci.
Now let us consider this problem as an extension of the linear
model.
We have two crossed factors, individuals and loci, and are
considering a (non-linear) function of the gene and heterozygote
frequencies of each individual sampled.
If we were estimating the mean,
then, according to the results in section 3.5, resampling individuals
gives a variance estimate which does not include the variance component
from loci, and resampling loci gives a variance estimate which does not
include the variance component from individuals.
It would appear that
80
this is still approximately true for the quantity we are estimating,
-e
even though we do not have the exact situation from section 3.5, and so
the extent that each variance component is accounted for can not be
quantified by the results of section 3.5.
As resampling loci gives a
variance estimate which is very close to the Monte Carlo variance, we
can conclude that the component of variance due to individuals is not
very important.
Comparison of Locus Res8J11pling Methods.
The methods that resample
loci (in the three allele case) give results that are very close to the
Monte Carlo mean and variance.
estimates are similar.
The bootstrap and jackknife bias
They both indicate that 8w is too small,
although 8w is sometimes larger and sometimes smaller than the
theoretical value of 8.
However these bias estimates are negligible
compared to the estimate of standard error, so their sign is of little
importance.
The bootstrap estimates of variance are more biased for than those
of the jackknife for each generation.
The infinitesimal jackknife gives
variance estimates similar to those of the bootstrap.
Comparing the
methods on the basis of the root mean square error of the variance
estimates shows that the bootstrap and infinitesimal jackknife perform
similarly, and are better than the jackknife.
Confidence Intervals for 8 •
The confidence intervals from the
percentile method are very close to those obtained by assuming a normal
distribution.
The confidence intervals obtained from the Monte Carlo
trials have some skewness (the midpoint of the confidence interval is
greater than the mean of the estimates), especially for generation 5,
e·
81
Table 4.2. Percentile method and normal theory confidence intervals
for 8 based on 8w: ten locus, three allele case.
Generation
5
25
50
Method
Confidence Interval
Midpoint
Me
(0.01945,0.09770)
0.05857
Percentile
(0.02139,0.08053)
0.05096
Normal
(0.01702,0.08110)
0.04906
Me
(0.09644,0.33202)
0.21523
Percentile
(0.11465,0.31535)
0.21500
Normal
(0.10657,0.31841)
0.21249
Me
(0.21465,0.51281)
0.36373
Percentile
(0.22392,0.50779)
0.36585
Normal
(0.22193,0.51625)
0.36909
Key: Method: Percentile
= Bootstrap
interval using the percentile
method,
Normal
= confidence
using the normal distribution and the
mean and variance of the bootstrap distribution.
Me
= Central
95% of the Monte Carlo trials estimates.
Midpoint: the center of the confidence interval.
82
and the percentile method shows some of this skewness.
Both the
-e
percentile and the normal intervals are too small, this being a result
of the resampling variance estimates being smaller than the "true" (over
the Monte Carlo trials) variance.
Linkage and Unequal Allele Frequencies.
Two further simulation
studies were done to assess the effect of linkage, and of a founder
population with unequal allele frequencies.
ten loci and three alleles.
Both simulations were for
The first had equi-frequent alleles in the
founder population, but assumed that the loci were ordered with linkage
A
= 0.9
(recombination
= 0.05)
between adjacent loci.
This gave similar
results, in terms of the sizes of the variance estimates, to the case of
no linkage.
The second additional simulation had unlinked loci with
founder population allele frequencies of 0.6, 0.3, and 0.1.
The results
were qualitatively the same as for the case of equi-frequent alleles,
but the variance of the Monte Carlo trial values of 8w increased by 15%
to 60%, and the resampling methods variance estimates increased by 15%
to 40%.
4.5 Example: Monterey Pine Data
The Data.
In this section results are given for resampling
estimates of F-statistics for a set of Monterey Pine data obtained from
S.H. Strauss (personal communication).
The data set consists of three
populations, with 10 stands within each population.
individuals sampled from each stand.
There are 5 to 11
There are 28 loci sampled with up
to 5 alleles at a locus.
EstiJ1l8tion procedure.
The methods of Weir and Cockerham (1984)
e·
83
were used to give estimates of the following F-statistics:
F:
the correlation of genes within individuals,
81: the correlation of genes between individuals within stands,
82: the correlation of genes between stands within populations,
fl: the correlation of genes within individuals within stands,
f2: the correlation of genes of different individuals within stands
within populations,
fa: the correlation of genes within individuals relative to genes
between stands within populations.
The relationship between these parameters is given by the equations:
fl
= (F-81)/(1-81)
f2
= (81-82)/(1-82)
= (F-82)/(1-82).
fa
(4.5.1)
The jackknife and the bootstrap were performed both by resampling
stands and by resampling loci.
The bootstrap distribution was estimated
using 100 bootstrap replications.
The F-statistics estimates and their
resampling estimates and standard error estimates are given in table
4.3.
Bias Estimates.
There is little difference between the original
estimates and either of the jackknife estimates.
This is consistent
with the results of the previous section where none of the resampling
bias estimates were of importance.
However bootstrapping over stands
indicates that the estimate of 82 is biased upwards, although none of
the other resampling methods indicate this.
This suggests that one
should be cautious when using this estimate, although using the
bootstrap estimate and standard error estimate from resampling stands
84
would still reject the hypothesis that 82 is zero.
All other bootstrap
·e
estimates are similar to the original estimates with the exception of
the estimates of f2 and f3 obtained by bootstrapping stands.
Equations
(4.5.1) show that this is related to the effect bootstrapping stands has
on the estimate of 82.
Variance Estimates.
Comparing this situation with the linear model
in section 3.6 it would be expected that resampling loci ignores the
variance component due to stands and individuals within stands, while
resampling stands ignores the variance component due to loci.
We find
that the two ways of resampling (either loci or stands) give variance
estimates which are similar for fl,f2 and f3 but are quite different for
F,81 and 82.
The first three parameters are correlations relative to the stand
or population, whereas the latter three parameters are correlations
relative to the founder population.
Suppose we assume that the three
populations have arisen from tl generations of randomly mating
(including selfing) populations of size N from a founder population, and
that the stands have arisen from ta generations of randomly mating
groups of size N from the populations.
larger than ta.
We will assume that h
is much
Then F and 81 are equal, and depend on tl and ta j 8a
depends on tlj fl is zerOj and fa and f3 are equal, and depend on ta.
We can see from the data that this model does not hold (for example F 1
81), but the above dependencies on tl and ta still hold.
Resampling loci for F, 81, and 8a gives an estimate of the variance
of the genetic sampling process (which depends on tl).
As the three
parameters fl' fa, and f3 are correlations relative to a current
e-
85
Table 4.3 F-statistics calculations for Monterey Pine data
F
81
f1
82
f2
f3
Est
0.1689
0.0862
0.0624
0.0905
0.0254
0.1136
3(S)
0.1690
0.0863
0.0625
0.0904
0.0254
0.1136
B(S)
0.1696
0.0868
0.0552
0.0906
0.0336
0.1212
3(L)
0.1684
0.0859
0.0619
0.0900
0.0257
0.1133
B(L)
0.1684
0.0842
0.0603
0.0905
0.0256
0.1138
SEJ (S)
0.0218
0.0131
0.0105
0.0209
0.0073
0.0213
SEa (S)
0.0210
0.0128
0.0107
0.0184
0.0067
0.0189
SEJ CL)
0.0340
0.0235
0.0260
0.0200
0.0070
0.0203
SEa (L)
0.0361
0.0248
0.0270
0.0206
0.0074
0.0217
Key:
Est
3
= point
estimate,
denotes a jackknife estimate,
B denotes a bootstrap estimate,
(S) denotes an estimate obtained by resampling stands,
(L) denotes an estimate obtained by resampling loci,
SEJ denotes a standard error estimate (square root of the variance
estimate), obtained by jackknifing, and
SEa denotes a standard error estimate, obtained by bootstrapping.
86
(sub)population (and so depend on t2 but not tl), we could expect the
"4It
variance components due to genetic sampling, for these cases, to be
small.
This is consistent with the bootstrap and jackknife results
presented here, where resampling stands and resampling loci give similar
results.
e-
87
5.
THE USE OF RESAMPLING METHODS IN GENETICS: CONCLUSION
Resampling methods have a wide range of uses in the analysis of
genetic data.
They can give good estimates of the sampling distribution
(in particular the bias and variance) of statistics calculated from
genetic data, where no other theory exists which gives such estimates.
They are also useful in providing confidence intervals for a parameter
(possibly multivariate).
However they must be used with some caution,
as they do not always give satisfactory results.
This is most often
true when there are several conceivable ways of resampling, with each of
the ways giving different results.
Resampling methods have already been used by many authors to
analyze genetic data, as has been discussed in chapter two.
They have
mainly been used to find variance estimates of statistics such as
heritability, genetic correlations and F-statistics estimates.
A new
use of resampling methods has been given in the construction of
"confidence intervals" for a non-ordered estimate.
The case discussed
in chapter two is the construction of such a "confidence interval" for
an estimated phylogeny.
In many situations in genetics there is more than one way of
resampling the data.
For example we may have a sample of individuals
scored at each of several loci, and we could resample the different
individuals, or resample the different loci.
resampling will give different results.
Different methods of
The correct method of
resampling, if it exists, must be determined.
Often the data is collected for certain crossed or nested factors.
88
From these cases some general guidelines can be given:
(1) If we have one random factor then resampling should be on the levels
of this factor. To give reasonable results we should have more than
just a few levels of the factor.
(2) If one random factor is nested in another, the treatments of the
highest level factor should be resampled.
(3) If there are crossed random factors there is no method of resampling
which will give consistent variance estimates.
e-
(4) If a factor is fixed resampling should be done within that factor,
so that the variation due to the different levels of that factor are
not included in the estimates.
(5) If the data is unbalanced, then the levels of the factor being
resampled should be weighted proportional to the number of
observations in that level.
Although these guidelines are based on the results for estimating
the mean of a linear model, they should serve as a guide in more complex
(for example non-linear) situations, since resampling methods have
generally been found to work well in such situations.
In some situations where we have crossed random factors, we may
still be able to use resampling methods to give a variance estimate.
To
do this we need to know that one of the factors has a variance component
89
which is much larger than the variance component for the other factor.
Resampling the factor with the larger variance component will give a
variance estimate which may be adequate if no other method of estimating
the variance exists, although it will generally be biased downward.
This situation has been illustrated by the simulations in chapter
four, where for each simulated data set a parameter was estimated, and
then resampling methods were used to estimate the variance of this
estimate.
In this case the variance component due to loci was much
larger than the variance component due to individuals.
Resampling
methods applied to loci gave variance estimates which were only
moderately biased downward, while resampling individuals gave variance
estimates that were much smaller than the true variance (variation among
the simulated values).
These simulations also illustrate how the
guidelines given above generalize to situations other than the linear
cases from which they were developed.
90
6.
THE EFFECT OF FAMILY STRUCTURE IN GENETIC DATA
-e
6.1 Family Structure in Genetic Data
Introduction.
Often, in the analysis of genetic data, it is
assumed that the data being analysed results from a random sample from
the entire population, whereas it may be a sample from one or more small
parts of the total population.
One consequence of this is that the
sample may contain individuals that are closely related, say full-sibs
or half-sibs, more often than would be expected if the sample was taken
randomly from the entire population.
When a statistic is calculated it is often interpreted on the basis
of its expectation and variance, calculated under the assumption that
the data is a random sample from the whole population.
In this chapter
the effect of family structure in the data is investigated.
This is
done by finding the expectations and variances for four statistics - the
sample gene frequency, a measure of one-locus (Hardy-Weinberg)
disequilibrium, and two measures of two-locus (linkage) disequilibrium.
The expectations and variances are calculated assuming that the data
contains either full-sib or half-sib families, or no family structure at
all.
Genetic
~odel.
It is assumed that the sample population has
descended from an infinite founder population in Hardy-Weinberg and
linkage equilibrium.
Each successive generation consists of a randomly
mating idealized population of size N up to the (t-2)-th generation.
It is assumed that the sample consists of data on
with ni individuals from the i-th family, i=l, ... ,k.
~ni
individuals,
Each family is
e-
91
assumed to be of full-sibs or half-sibs (but not a mixture).·· The
parents for each family are assumed to be unrelated (this implies that
the (t-l)-th generation is infinite).
The case ni=l, i=l, ... ,k will be
referred to as the "no family structure" case, and will be compared with
results obtained by assuming a randomly mating idealized population of
size N for t generations, which will be referred to as the "classical"
case.
6.2 Notation
Frequencies of Gene Combinations.
Throughout this chapter it is
assumed that each locus has only two alleles, A and a for the first
locus, and B and b for the second locus.
For the one locus cases we
will need the population frequencies of combinations of up to four A
genes.
The frequency of the A allele will be denoted by p.
The
frequencies for the other gene combinations are denoted by a P followed
by two to four A'S, two A's for combinations of two genes, and so on.
If two genes are in different individuals, but in the same family they
are separated by a bar; if two genes are in different individuals in
different families they are separated by a double baf; otherwise the two
genes are distinct genes in the same individual.
For example
P:
is the frequency of individuals homozygous for A
pt
is the frequency that a gene from each of two individuals in the
same family are both A
~
is the frequency that a gene from each of two individuals in
different families are both A
~
is the frequency that for two individuals in different families, one
92
-e
is homozygous for A and a gene from the other is A
~
is the frequency that a gene from each of four individuals, two of
which are in the same family, are all A.
For two loci, to avoid ambiguity, the frequency of the A allele is
denoted by PA, and the frequency of the B allele is denoted by PB.
One
and two bars still denote that two genes are in different individuals,
in the same family and in different families respectively.
A diagonal
bar between the letters A and B denotes that two genes are in the same
individual but on different gametes.
If the letters A and B are on the
same line, and are not separated by any bars then the genes are in the
same gamete.
p"1
For example
is the frequency that a gamete from an individual carries A and B
pAy is the frequency that for two individuals from different families, a
gene from the first locus of one individual is A, and a gene from
the second locus of the other individual is B
~
is the frequency that for two individuals from the same family, a
gene from the first locus of one individual is A, and for the other
individual a gene from the first locus is A and the gene from the
second locus, on the opposite gamete to the gene from the first
locus, is B
one-locus Descent
~e8sures.
Cockerham (1971) are used.
The measures of gene identity given by
For the measures for generation t, different
subscript letters are used to indicate that genes are in different
individuals.
A double dot over a subscript indicates that the measure
is for both genes of that individual.
Individuals from different
families are distinguished as follows: X,Y,Z, and W denote individuals
e-
93
from one family; U and V denote individuals from a second family; T
denotes an individual from a third family; and S denotes an individual
from a fourth family.
Fx
For example
is the usual inbreeding coefficient for individual X: The
probability that both genes in X are identical by descent,
is the probability that a gene in X and a gene in U (in a
8xu
different family) are identical by descent,
is the probability that a gene in X, a gene in Y (in the same
YXYU
family) and a gene in U (in a different family) are all identical
by descent,
SXUT
is the probability that both genes in X, a gene in U (in a
second family) and a gene in T (in a third family) are all
identical by descent,
~Y.ZU
is the joint probability that a gene in X and a gene in Y (in the
same family) are identical by descent; and a gene in Z (in the
same family as X and Y) and a gene in U (in a different family)
are identical by descent, and
~Y+UT
is the average of two probabilities.
The first is the joint
probability that a gene in X and a gene in U ,(in a second family)
are identical by descent; and a gene in Y (in the same family as
X) and a gene in T (in a third family) are identical by descent.
The second is the joint probability that a gene in X and a gene
in T are identical by descent; and a gene in Y and a gene in U
are identical by descent.
The measures of gene identity for generation t-l will be left
without subscripts, as all the measures for a particular number of
94
alleles are the same.
This is because this generation is the result of
(t-l) randomly mating generations from the founder population, so that
..
any set of genes has the same probability of identity by descent, no
matter how they are arranged in individuals.
For example y denotes the
probability that any three genes in the (t-l)-th generation (two of
which mayor may not be from the same individual) are identical by
descent.
TWo-locus Descent ~easures.
For the (t-l)-th generation let IT be
the probability of nonidentity of two genes (at the same locus); @ be
the probability of nonidentity, at the two loci of interest, of two
gametes; r be the probability of nonidentity of one gamete at the two
loci, with one gene from each of the two loci; and
a be the probability
of nonidentity of two pairs of genes from each of the two loci.
SUJDS of Powers of S8Jllple Sizes.
Na
= IDi ,
= Ini 2,
= IDi a,
N4
= Ini 4 .
n.
N2
Letting I denote summation over families, and I denote
J
i
family members, define the constants
81
=
II 1
82
=
II I 1
8a
=
II I
S5
=
i
= n.,
J
i j
~J'
I 1
iJ~J' ~J"
= N2 -n. ,
e·
Let
summati~n
over
95
ni ni ni'
i;l.i' j;l.j' j" 1
= n.N2-n~-N3+N2,
S6
=
~ ~ ~ ~ ~
S7
=
~ ~ ~ ~ ~
ni ni ni' ni'
~ 1
i;l.i' j;l!j' j";I!j'"
= N2-N4-2n.N2+2N3+n.-N2,
Sa
=
~ ~ ~ ~
nini ni ni'
~ ~ 1
i;l!i' j;l!j' ;l!j"j'"
= n. N3-N4-3n.N2+3N3+2n~-2N2,
S9
=
~ ~
S10
=
~ ~
ni ni' ni"
~ 1
i ;I! i' ;I! i " j j' j ~'
~ ~ ~
ni ni ni' ni"
~ 1
i ;I. i' ;I! i " j ;I! j' j" j '"
~ ~ ~ ~
2
2
= n~-3n.N2+2N3,
2
2
3
= n.N2+2N4-2n.N3-N2+3n.N2-2N3-n.,
4
2
2
= n.-6n.N2+8n.N3+3N2-6N4
(6.2.1)
6.3 Relationships Between Gene Combination Frequencies
Gene Combination Frequencies in Te.rJl1s of Descent Measures.
The
relationship between the gene combination frequencies for the t-th
generation and the one-locus descent measures in the t-th
generation are given below:
~
= pFx + p2 (l-Fx)
pt = paxv + p2 (l-axv)
Pi- = paxu + p2(l-axu)
~r = pyx v + 2p2 (axv-yxv) + p2 (Fx-yxv) + p3(1-Fx-2axv+2yxv)
~ = pyxu + 2p2 (axu-yxu) + p2 (Fx-yxu) + p3 (1-Fx-2axu+2yxu)
~=
~
~
~1:
pyxvz + 3p2 (axv-yxvz) + p3(1-3axv+2yxvz)
= pyxvu + 2p2(axu-yxvu) + p2(axv-yxvu) + p3(1-axv-2axu+2yxvu)
= pyXUT + 3p2(axu-YXUT) + p3(1-3axu+2yxuT)
= pSXY + 2p2 (Llx+y-Sxy) + p2 (Llx. y-SxY) + 4p2 (YXV-SxY)
+ 4p3(axv-2Yxv-Lli(+Y+2SxY) + 2p3(Fx-2yxv-Llx.Y+2Sxy)
+ p4(1-2Fx-4axv+8yxv+Llx.Y+2Llx+Y-6SxY)
96
~
~t
= pSxu
2p2(~+u-Sxu)
+
p2(~x.u-Sxu)
+
4p3(axu-2yxu-~+u+2Sxu)
+
p4(1-2Fx-4axu+~.u+8yxu+2~x+ij-6Sxu)
= pSxvz
+
2p2(~+vz-Sxvz)
+ 2p2(yxv-Sxvz) +
~
+
+
+
2p3(Fx-2yxu-~x.u+2Sxu)
p2(~.vz-Sxvz)
+
p3(axv-2yxvz-~.vz+2Sxvz)
+
p4(1-Fx-58xv+4yxvz+4yxy+~.yz+2~+yz-6Sxyz)
= pSxvu
+
2p2(~+yu-SXYu)
+
+
p3(Fx-2yxv-~x.vz+2Sxvz)
p2(~.YU-SXYu)
+
p3(Fx-yxy-yxu-~.yu+2Sxyu)
+
p4(1-Fx-28xv-3axu+2yxy+2yxu+4yxyu+~.yu+2~+yu-6Sxyu)
= pSxuv
2p2(~x+uv-Sxuv)
+
~
+
+
2p3(aXY-YXY-YXYU-~+yu+2Sxvu)
p2(~.uv-Sxuv)
+ 2p2(yXYU-Sxuv)
4p3(8xu-yxu-YXYU-~+uv+2Sxuv)
+
p3(axy-2yxyu-~.uv+2Sxuv)
+
p4(1-Fx-8xy-48xu+4yxu+4yxyu+~x.uv+2~+uv-6Sxuv)
= pSXUT
+
2p2(~X+UT-SXUT)
+
+
p3(Fx-2yxu-~x.uv+2Sxuv)
p2(~.UT-SXUT)
+ 2p2(YXUT-SXUT)
4p3(axu-YXU-YXUT-~+UT+2SxUT)
+
p3(axu-2yxUT-~.UT+2SxUT)
+
p4(1-Fx-58xu+4yxu+4yxuT+~.uT+2~+uT-6SxuT)
= pSXYZW
+
~
2p3(8xu-yxu-YXYU-~+yu+2Sxyu)
p3(axu-2yxyu-~.vu+2Sxyu)
+ 2p2(YXU-SXUT) +
~
+ 2p2(yxyu-SXYu)
+
+ 2p2(yxu-Sxuv) +
~
+ 2p2(yxvz-Sxvz)
4p3(axv-yxv-yxvz-~+vz+2Sxvz)
+ p2(yxu-SXYu) + p2(YXy-Sxvu) +
~
+ 4p2(yxu-Sxu)
+
3p2(~xy.zw-SXYzw)
+
p3(Fx-2yxu-~X.UT+2SxUT)
+ 4p2(yxyz-SXYzw)
6p3(exy-2yxyz-~y.zw+2Sxyzw)
= pSXYUV
+
2p2(~y+uv-SXYuv)
+
+
p4(1-68xy+8yxyz+3~xy.zw-6Sxyzw)
p2(~y.uv-SXYuv)
+
4p3(axu-2yxyu-~xy+uv+2Sxyuv)
+
p4(1-2axy-4axu+8yxyu+~y.uv+2~xy+uv-6Sxyuv)
= pSXYZU
+
3p2(~y.zu-SXYzu)
+
+ 4p2(yXYU-SXYuv)
2p3(8xy-2yxyu-~xy.uv+2Sxyuv)
+ 3p2(yxyu-SXYzu) + p2(YXYZ-SXYzu)
e·
97
+ 3p3(8xu-2yxvu-axV.2U+2SxV2U) + 3p3(8xv-YXV2-YXVU-axV.2U+2SxV2U)
+
~
p4(1-38xv-38xu+2yxv2+6yxvu+3~xv.zu-6Sxvzu)
= pSXVUT
+ 2p2(axV+UT-SXVUT) +
+ 2p2(yxvU-SXVUT) +
p2(~V.UT-SXVUT)
4p3(8xu-yxVU-YXUT-~XV+UT+2SxVUT)
+ p3(8xu-2YxUT-axV.UT+2SxVUT) +
+
~
+ 2p2(YXUT-SXVUT)
p3(8xv-2yxvu-~XV.UT+2SxVUT)
p4(1-8xv-58xu+4yxvu+4yxuT+axv.uT+2~xv+uT-6SxvuT)
= pSXUTS
+ 3p2(axU.TS-SXUTS) + 4p2(YXUT-SXUTS)
+ 6p3(8xu-2YxUT-axU.TS+2SxUTS) +
p4(1-68xu+8yxuT+3~xu.Ts-6SxuTs)
(6.3.1)
The expressions for frequencies that differ in that one of them refers
to individuals in the same family, and the other refers to individuals
in different families are quite similar, the only difference being in
the subscripts.
P:~
For example the descent measures in the expression for
are subscripted by X,Y and Z (members of the same family), whereas
the expression for
~
is the same as that for
P:~
except that here the
descent measures are subscripted by X,U and T (members of different
families.
Descent Measures Transition Equations.
The one-locus descent
measures for the t-th generation, in terms of the measures for the
(t-l)-th generation, are found using the expansions given by Cockerham
(1971).
These expansions give the descent measures for a particular
group of (possibly related) individuals in terms of the descent measures
of their parents.
The transition equations for full-sibs and half-sibs
of unrelated parents are given below (recall that X,Y,Z and Wbelong to
one family, U and V belong to a second family, S belongs to a third
family and T belongs to a fourth family).
98
Full-sibs
Fx
48xv
2yxv
yxu
IGyxyz
4yxvu
4AX .v
I1x .u
8AX.vz
4AX .uv
2AX •vu
32!1xv.zw
16i1xv.uv
8I1xv.zu
SAX+v
AX+u
16l1x+vz
411X+uv
4AX+vu
16!1xv+uv
4SXV
Sxu
8Sxyz
4Sxuv
2Sxyu
= 8xu
=
=
= YXUT
=
=
=
= AX .UT
=
= 4l1xv.uT
=
=
=
=
=
= .IlX+UT
=
= 4l1xv+uT
=
=
=
= SXUT
=
= 4SXYUT
=
=
8
1 +
38
8 +
=
y
y
1 +
98 +
6y
8 +
3y
8 +
2y +
= !1xU.TS =
11
4y +
11
+
311
y +
11
2 + 158 + 12y +
311
1 +
+
911
28 +
3y +
311
38 +
2y +
211
38 +
=
8
1 +
68
=
Gy+
211
Y +
311
8 +
y +
211
8 +
6y +
911
8 +
2y +
78 +
=
=
(6.3.2)
S
S
28 +
=
e-
11
1 +
= SXUTS
11
5y +
S
Y +
3S
y +
S
e
99
64Sxvzw
16Sxvljv
16Sxvzu
=
=
=
1 + 218 + 36)' +
6S
8 +
6)'+
9S
8 +
9y+
6S
Half-sibs
Fx
88xv
4yxv
YXU
32yxvz
Byxvu
2Llx.v
Llii .i.i
8Llx.vz
8Llii .uv
4Llii •vu
64Llxv.zw
64Llxv.uv
16Llxv.zu
4Llii+v
Llii+i.i
4Llx+vz
8Llx+uv
8Llii+vu
64Llxv+uv
2Sxv
SiW
= 8xu
=
=
= YXUT
=
=
=
= Llii.UT
=
= 8Llxv.UT
=
=
=
=
=
= LlX+UT
=
= 8Llxv+UT
=
=
=
= SXUT
=
8
1 +
78
8 +
=
3y
Y
1 +
98 + 22y
8 +
7y
Y+
= LlxU.TS =
Ll
8 +
=
Ll
8
3y+
4Ll
+ 7Ll
Y+
3Ll
1 + 158 + 16y + 32Ll
1 + 148
28 +
8 +
+ 49Ll
3y + llLl
Y+
=
2Ll
(6.3.3}
~
Ll
8 +
=
8 +
y +
2Ll
Y+
7Ll
Y+
6Ll
8 + 14y + 49Ll
y +
= SXUTS
=
S
S
....-
100
8 + 7y +
8S
Y+
7S
=
y +
3S
128Sxvzw =
1 + 158 + 48y +
64S
16Sxv2
=
8Sxuv
= 8SXVUT
4Sxvu
=
64Sxvuv
=
8 + 14y +
49S
32SxV2U
=
8 + 9y+
22S
Gene COJI/bination Frequencies in TerJ/lS of Generation t-] Descent
Measures.
Using the results (6.3.2) and (6.3.3) we can obtain
expressions for the frequencies of gene combinations in terms of the
descent measures for generation t-l.
Let
T*= p2 + p(1-p)8,
G*= p3 + p(l-p) [3p8+(1-2p)y] and
D*= p4 +
p(1-p)[6p28+4p(1-2p)y+3p(1-p)~+(1-6p+6p2)S].
e-
For full-sib families we find:
~
= Pi-
=
T*
4~
= p + 3T*
2~r
=
~ =~
=
16~
= p + 9T* + 6G*
4P1'
=
T* + 3G*
4~~
=
T* + 2G* +
P% = P2- =
T* +
G*
G*
~ =
D*
D*
8~t
=
2T* + 5G* +
D*
2p»
=
G* +
D*
=
G* + 3D*
4p;jt
64~
= 4~
= p + 21T* + 36G* + 60*
(6.3.4)
101
l~
=
T* + 6G* + 9D*
16~
=
T* + 9G* + 6D*
For half-sib families we find:
8F1-
=
=p
4~r
=
T* + 3G*
~=~
=
G*
32~
=p
=
=
~
=11-
8~
2~1:
T*
+
+
7T*
3T* + 22G*
T* + 7G*
~
+
=
4~
=
=
8I2-
= 8~
(6.3.5)
D*
P%=P2-=PS-=
16~~
~
T* + 7G* + 8D*
G* + 3D*
G* + 7D*
12~
=p
64~
=
T* + 14G* + 49D*
32~
=
T* +
+ 15T* + 48G* + 64D*
90* + 22D*
For the case of no family structure we find:
~
=F1-
~r
=~
=
= ~~ = ~~ =
~1:
=
T*
(6.3.6)
G*
D*
Two-locus Frequencies in Terms of Those for the Previous
Generation. The following equations give the relationship between the
frequencies of gene combinations at two loci for the t-th generation and
those for the (t-l)-th generation.
The frequencies on the left of the
equations are for the t-th generation, and those on the right side are
102
for the (t-l)-th generation.
Since we are assuming that parents are
distinct there is no family structure in the (t-l)-th generation, and
so different individuals are also in different families.
This is also
true for the t-th generation frequencies for the case of no family
structure.
The results for this case can be obtained by replacing all
single bars in the generation t frequencies by double bars and using
either the full-sib or half-sib results.
The linkage parameter is
denoted by A, and the recombination by c=(1-A)/2.
Full-sibs
Generation t
Generation t-l
2p'·
(l+A)P'· + (l-A)P'.
p'. =FAy
~
4fl6a
p'.
2~·=2J¥
(l+A)~
+
(l-A)~
2P': =2pAf
(l+A)p6f +
(l-A)~
8~
(l+A)P'· + (l-A)P'. +
+
+ p'. +
2p.;.
2~·
+ 2(1+A)ptI
2(1-A)~
(I+A)P'· + (I-A)P'. + 2P': + 2(I+A)p6f
+
2(I-A)~
2~r
~+~
2P'C
p&f + ~
4~
~+~+p:r+~
4~
~+~+P'C+~
I7 =Pi'l =19-
~
P1:=~=11
p6f
16~
..
103
16P6f
P" B + p" B + 21'6a + 2P": + 4P"-1 + P"C +
4Pf! .
~
+ p:r + 2~-r-
4~
I'6a
+ P"C + 2p6f
4P11
pAf +
4I1r
pta. + ~ + 2plf
4P::=4PH
(1+A)2ptf +
16PM
(1+A)2P"B + (1-A)2P"B + 2(1-A2 )P: B
+ 2plf
2(1-A2)~
+
(1-A)2~
+ 2(1-A2 )P": +
2(1+A2)~:
4(1-A2)~
2(1-A)2~
+
8~
~
(l+A)~
4~
+
+ (l+A)pAf +
2(1+A)2~
+
(l-A)~
+
(l-A)~
+ 2~1 + 2~
(l+A)I¥ + (l-A)~ + (l+A)~ + 2~ + 2~
+
8~
(l-A)~
(l+A)pAf +
+
(l-A)~
+ (l+A)PM +
2~
+
2~1
(l-A)~
2pY=2I11: =2P"..i=2pY
(l+A)ptI + (l-A)FtI
32PH
(l+A)P"B + (l-A)P"B + 2p: B + 2P": + 2(1+A)I¥
+ 2(l+A)pAf +
2(1-A)~
2(1-A)~
+
+ 2P::
+ 2(1+A)ptl+ 4~ + 4~1, + 4~ + .2(1-A)~.
(l+A)PH +
+
8PH
(l-A)~
(l-A)~
+
2(1+A)~
+ 2P:-{ +
2(l+A)~
+
2~1
2(l-A)~
(l+A)~
+
(l-A)~
+ 2(l-A)PN
8~
+ 2(l+A)PH +
2(1-A)~
(l+A)pAf +
+
2~
104
4~=
PH
111 =P: =IW =12 =1'1: =111
PH
16Pft
P'7 + pta- +
+ 2ptf +
~
PA.f +
~r
+ P"= + ~ +
PAt
+ PH + 2~ + 2~f +2~+~+~=
8~1
pta +~ +PH + 2~ + 2~ +~
8~
PA.f +p6If +PH + 2PN +
4pt1f=4PM
PH
4~
~ +PH +PN +~
4~
p6f + PH +Pff + ~I
4~=4PS
p6f + ~f + 2~
4P2=4I1f
~ +~ + 2PH
2P2=2~
PH + PN
64PH
p". + p". + 2P'7 +
2~f +~
+pti + 2P9'
2~·
+ 2P": + 4ptt- + 4JY-i
+
2~r
+ 2P"= +
4~
+ 4PAtJ + 2~: + 4PH
+
8~
+ 8~1 +
8~
+
4~
+ 2~=
IGPH
PM
I 61'S
P'7 + ~r + P"C +2~+2~+~=+2~f
+
+ 2ptf + 4pt1 +
2~
+~
~
+
4~
+ 4Iff
+ 4P9+ 2p6f + 2~f + ~PH + 4~ + 2~1
16PH
PA.f
IGIV
pta +~ + 2~ + 2~ + 4ptf + 4Pft + 2~
(6.3.7)
Half-sibs
Generation t
Generation t-1
2p"·
(l+X)P"· + (I-X)P"a
p". =I>O,r
P'7
8P-;-
p". + p". + 6P'7
e·
105
2~B
=2I1-!
(l+A)~
(l-A)~
+
2pA:=2pAf
(l+A)pA.f + (l-A)f6t
16PP
(l+A)pAB + (l-A)pAB +
+
16pA.f
+ 6(1+A)ptA
2~B
6(l-A)~
(l+A)pAB + (l-A)pAB + 2pA: + 6(1+A)pA-I
+ 6(1-A)f6t
~
+ 2~
4~r
ptI. +
4pAC
fl6t
8~
~+~r+~+~+4~
8~
~
~=J1I&=19
~
J>AC=~=~
~
32~
pAB + pAB + 2?;- +
+ p6f + 2pAf
+ pAC + pA.f +
~
+ 4p6f
+
+
4~
pAB + pAB + 2I'6a + 2pA: + 4pA.f + 2pAC +
4~
2~B
4~
+
2~r
+l~
32p6f
+ 16p6f
817
P<\g + ~r + 6~
811-
I'6a + pAC +
8P41
pA.f +
~
+ 6plf
811r
Pt'- +
~
+ 6p6f
4~:=4pA:d
(1+A)2~
32~
(1+A)2pAB + (1-A)2pAB +
16ptl
+
6~
2(1-A2)~
+
2(1+A2)~:
+
6(1-A)2~
(l+A)ptl +
+
2~1
+
+
(1-A)2~
2(1-A2)~B
6(1+A)2~
(l+A)~
2~
+
+
+
(l-A)~
+ 4(1+A)ptf +
+ 2(1-A2 )pA:
12(1-A2)~
+ (l-A)J'&l
4(l-A)~
106
16~1
(l+A)fA-f + (l-A)Jl6Il + (l+A)F¥. +
+
(l-A)~
(l+A)~
16~
+
+
4(1+A)~~
+
2~
+ 4(l-A)!'N
+ (l-A)~ + (l+A)PH + 2~ + 2~1
(l-A)~
4(1+A)~~
+
+ 4(1-A)!'N
(l-A)~
2Ft1=2Fi:=2fA~=2Pif
(l+A)PH +
64PH
(1+A)fA 1 + (l-A)fA B +
+
2~
2(1+A)~
2~1
2(1-A)~
+
+ 2fA: + 2(1+A)ptL
2(1-A)~
+
+
2~:
+ 2(1+A)PH + 4~ + 4~' + 4~
+
16(1+A)~
(l+A)F¥. +
16PH
+
+
+
+
6(1+A)~
+
(l-A)~
+ (l-A)Jl6Il +
2~1
+
6(l+A)~
2~
+
6(1+A)~
+
(l-A)~
+
6(l-A)~
8~
ffl- +p6f + 2fIN +2~+~1+~
2~=
fIN +~
11!1=~
=11I=P2=PI: =PS
32~
e-
6(1-A)~
(l+A)~
16ptl
2~
16(1-A)~
+
6(1-A)~
(l+A)~
l~
2(1-A)~
+
PH
~ +~+~ +
+2~+4PH
P6lf +
2~ + 2ffl- + 2~1
+ 4~ + 2~1 + 2~ +
16~1
~ +~ + 2~ + 4~ + 4~ + 4PH
IGPK
pAf +~ + 2~1 + 4PH + 4~ + 4PH-
8P8=8fA,j
PH
8~
~ + fIN +PN +~ + 4~
8111
p6f +
8~=8PS
~ + ~I + 6~-f
+~ + 6PH
PH
+PN + ~I + 4~-f
8PH
107
8P2=8P11-
~+~+6PH
4~=4pA~
PH+~+2PH
128PH
p".
+ pAl + 2~· + 2pA: + 4~ + 4pdf + 4~
+ 4~ + 4I1fL + 4p6f + 2~: + 8~f + 8~
+ 16PH+ 16PN + 4~1 + 4~ + 40ptf
PH
+ 2~ + 12111 + ~ + 12I1i + 36I1t
FAa + ~r + pAC + 6~ + 6pAf + ~C + 6~1
+
32~
~-{
+
36~
pdf + ~ + 2p6f + 2~1 + 4PH + 4~ + 2~1
+ 8PH
32pY-
I¥
+
+ ~ + 2~--fl + 2~ + 4PH + 4~ + 2~-{
8~
(6.3.8)
No family structure
Generation t
Generation t-l
2pA·
(l+A)pA· + (l-A)pA.
p".
=FAa
FAa
2~·=2PP
(l+A)ptI- +
(l-A)~
2P": =2pdf
(l+A)p6f +
(l-A)~
~r=~=~
~
P"C =P6If=~
~
4~:=4~-1-
(1+A)2ptf +
2~ =2~ =2~f=2PH
(l+A)PH + (l-A)ptf
~=~C =PN=~I=~ =~
PH
2(1-A2)~/2
+
(1-A)2~
(6.3.9)
108
Two-locus Frequencies in TerJlls of Descent Measures of the Previous
e
Using the above equations, and those of table 1 of Weir and
Generation.
Hill (1980), the two-locus frequencies for the t-th generation can be
written in terms of the two-locus descent measures for the (t-l)-th
generation.
Since gametes in an individual in the (t-l)-th generation
came from the pool of gametes in the previous generation at random,
there is no need to distinguish how gametes and genes are arranged in
individuals (Weir and Cockerham, 1969).
This means that the descent
measures in table 1 of Weir and Hill (1980) may be used without
subscripts, and P and rr of that table are equivalent.
Now let
0*=PA (l-PA )ps (l-ps )O,
~*=PA
e·
(l-PA )ps (l-ps )~.
For full-sibs we have
nAB _nA
r·
_nA
-'DA...--
-rB-rB-'~-
B - nA B - 'DA.I. - t:lA.I. - nAr rnA
A - r B -. X--' ~-&A -
P"~ =P:t =P'1C =f1'l =J>Af =
P1f=FY =F9-=~=
PAPS+ IT* /2
pAps+3rr*/8
nAB -DA.l.-
&AB -.
n-
nAI-oA.B -
rAIiI-' AlB -
+
(1+A)20*/4 +(1-A2 )r*/2
PAPB+3IT* /4 +
(1+A+A2 )0* /4 +(l-A2 )r* /4
pAps+3rr* /4 +
r* /2
pAPs+7rr* /8 +
(l+A)0*/8 +
+(1-A)2~*/4
+(1-A)2~*/8
r*/2 +
(1-A)~*/8
109
-,..-
-
(1+A)f*/2 + (1-A)Ll*/2
Pti--~B
_~B -~~
pApa+ IJ*
PH=
PApa+3IJ* 14 +
PH=
pApa+ IJ*
+
(l+A)0* 18 + (2+A)f*/4 +3(1-A)Ll*/8
~=pY=
PAPB+7IJ*/8 +
(2+A)f*/4 + (1-A)Ll*/4
~=
PApa+3IJ*/4 +
0*/8 +
r*/4 +
Ll* 14
P:==
PAPS+ IJ*
+
0*/4 +
f*/2 +
Ll*/4
pApa+ IJ*
+
-
PilI=~ =111=P2=~
+
(2+A)0* 116+
3f*/8 + (1-A)Ll*/16
=
Pit =
Ll*
PN=
PAPs+3n*/4 +
0* 116+
3f*/8 +
Ll*·/8
P:f=~=
PA ps +7n* 18 +
0*/8 +
f*/2 +
Ll*/8
Pii=pY=
PAPS+ IJ*
+
f*/4 +
3Ll*/4
Prf=~=
pAps+7n* 18 +
r*/4 +
Ll* 12
PI: =P2=PS=~=
PApa+7IJ*/8 +
P:II =pA:JJC =
pApa+ IJ*
ptf=
PAPs+3IJ* 14 +
PH=
pApa+ IJ*
~=
pApB+3n* 14 +
~=PM=
pAps+7n* 18 +
3Ll*/4
r*/2 +
Ll*/2
30* 132+
3r* 18 +
3Ll* 132
0* 116+
3r*/8 +
9Ll*/16
+
+
9Ll*/16
3f*/8 +
3Ll* 18
(6.3.10)
For half-sibs we have
nAB _nA
r
_nA
-DA.--
- r B -rB-~""T-
_nAB -nAr _nAil -"041 rnAB
A -roB - rA -r Is -~A11 -
~ =f'iA=pA-f=~=~=
pApa+
IJ* 12
PAPS+ 7IJ* 116
110
rAB-~AT-
PAPB+
~~=
AB
PAPB+
~=
PAPS+ 7rr* /8 +
nAB
-'DA..I.-
II*
+
(1+>..)20* /4 +(1->..2 )r* /2
+(1->..)2~*/4
7rr*/8+(2+3>..+2>..2)0*/8+3(1->..2)r*/8+3(1->..)2~*/16
PAps+15II* /16+
(2+>..)r*/4 +
(1+>..)0*/16+ (2+>..)r*/4
+5(1->..)~*/16
(1+>..)r*/2 +
F1f=
(1->..)~*/4
(1->..)~*/2
PAPS+ 7rr* /8 +
(2+>..)0*/32+(7+4>..)r*/16+9(1->..)~*/32
PAPS+
+
(1+>..)0*/16+(4+3>..)r*/8
PAPS+15rr* /16+
(4+3>..)r*/8
II*
+7(1->..)~*/16
+3(1->..)~*/8
PAPS+ 7rr* /8 +
r*/4 +
PAPS+
rr*
+
r*/2 +
PAPB+
II*
+
~*/2
~=~ =PiI =P2 =PI: =
111=
~*
pAPB+15rr*/16+
rr*
3r*/8 +
~*/2
PiI=pY=
PAPB+
+
r*/8 +
7~*/8
P9f=PS-=
pAPB+15II* /16+
r*/8 +
3~*/4
PS =P2=PS=PS=
pAPs+15rr* /16+
~=I1JC=
PAPB+
F1f=
PAPS+ 7rr* /8 +
PH=
PAPS+
~=
PAPB+ 7rr* /8 +
rr*
rr*
7~*/8
+
+
pAps+15rr* /16+
r*/4 +
3~*/4
0*/64+
r*/4 +
~*/2
0*/64+
7r*/32+
49~*/64
49~*/64
3r*/16+
11~*
/16
(6.3.11)
For no family structure we have
e·-
111
nAB -DA.B _nAB
-'DA.I.-
rAii - .. AlB - r Ai" - .. AliI-
~ =P:C
PAPS + IT*
+
PAPS + 11*
+
pAPS + IT*
+
(1+A)2~/4+(1-A2)r*/2
+(1-A)2a*/4
(1+A)r*/2 + (1-A)a*/2
=PN =P:I =P:-C =
PH=
(6.3.12)
6.4 Gene Frequency
,
Classical Case.
Consider a sample of n. individuals for which we
know the genotypes at the locus of interest.
1,
Xj 1
=
Let
if the 1-th allele (1=1,2) of the j-th individual
(j=l, ... ,n.) is A
0,
otherwise.
The maximum likelihood estimate of p is
and we find
E(p)
= p,
= {p+p:+2(n.-l)~}/2n.,
Var(p) =
(6.4.1)
pt-p2+{p+P:-2~}/2n.
Notice that this variance consists of two components: pt-p2 which is the
variance due to the genetic sampling from the founder population, and
{p+P:-2~}/2n.
which is due to the multinomial (or statistical) sampling
of individuals from the t-th generation.
The first component does not
depend on the size (n.) of the sample, whereas the second component
decreases with increasing sample size, disappearing when n. is infinite.
112
Family Structure Case.
Suppose now that the sample consists of
full-sib or half-sib families, and that we know to which family the
individuals belong.
1,
Let
if the 1-th allele (1=1,2) of the j-th individual
Xijl =
(j=l, •.. ,ni) in the i-th family (i=l, .•. ,k) is A
0,
otherwise.
Then
E(p) = p,
E(J)2) =
Var(p) =
(6.4.2)
{2n.p+2n.p:+4S2pt+4S5~}/4n~,
pt-p2+N2{pt-~}/n~+{p+p:-2pt}/2n •.
Putting ni=l for each i gives (6.4.1) (note that for this case there is
no distinction between, for example, pt and
~).
Using (6.3.4) and (6.3.5) we find
Var(p) = p(l-p){a +
(N2+n.)(1-a)/4n~},
Var(p) = p(l-p){a +
(N2+3n.)(1-a)/8n~}
for full-sib and half-sib families respectively.
Putting ni=l for each
i (in either of these equations) gives the variance for the classical
case:
Var(p) = p(l-p){a + (1-8)/2n.}
It can be seen from these formulas that the variance increases as
the amount of family structure in the sample increases.
The variance
for full-sib families is greater than that for half-sib families which
is greater than the variance for the classical case, for all 8.
Note that Var(p) is the total variance due to two sampling
~
113
processes - genetical (the sampling of the genes in one generation for
the next) and statistical.
This variance cannot be estimated from the
data from one population since there is only one replicate of the
genetical sampling.
6.5 Hardy-Weinberg Disequilibrium
With the notation of the previous section, the
Classical Case.
Hardy-Weinberg disequilibrium, D, and its maximum likelihood estimate,
-D,
are given by
-=
D
IjIII~I' XjlXjl'
/2n. -
P2,
and we find
-
E(D) =
~-I1-{p+~-2pt}/2n.,
= {p+~+2(n.-l)pt}/8n~
E(D2)
-
(2n.-l){~+(n.-l)~r}/2n~
+
(n.-l){pt+~r+(n.-2)~}/n~
-
(2n.-l)(n.-l){2~r+(n.-2)~~}/n~
+
(n.-l){pt+2~r+~~+4(n.-2)~4(n.-2)~~2(n.-2)(n.-3)~}/2n~
-
Var(D) =
+
(2n.-l)2{~+(n.-l)~~}/4n~
~~-2~~~~2+2~11-pt2
+
{~-5~r+3~2~~+9~~~p~_ppt+~2_3~ptr2PP} In.
-
{8~-711-38~r+36~7~~+52~~44~p2+2p~-4ppt
(6.5.1)
Using (6.3.6) gives
= p(l-p)a
E(D)
= -p(1-p)(1-a)/2n.
D
-
114
= -{p(1-p)-D}/2n.,
...
Var(D) = p(l-p) {p(l-p)+(l-6p+6p2 )a-2(1-2p)2 y+3p(1-p)Li+(1-6p+6p2 )S} /n.·
-p(l-p) {2p(1-p)+(1-8p+8p2)a+p(1-p)a 2 -2(1-2p)2y
+3p(1-p)Li+(1-6p+6p2)S}/4n~
+O(l/n~
)
(6.5.2)
Since we have only one replicate of the genetical sampling we cannot
obtain estimates, from the data, for this expectation and variance.
The
expectation and variance due to the statistical sampling may be obtained
by setting a=y=Li=S=O .
...
We see that D, having no 0(1) term, is quite biased for D.
This
is because D is a parameter measuring the Hardy-Weinberg disequilibrium
among all populations with the same history, described in section 6.1,
as the population we are investigating.
However since we are dealing
...
with data from only one population, D is estimating the Hardy-Weinberg
disequilibrium within just that population.
This disequilibrium is
zero, since there was random mating in the previous generation.
As an
estimator of this disequilibrium, it is still biased, with a negative
bias, but this bias disappears as n. tends to infinity.
,
(Here we use
the expectation due to statistical sampling only, since we are now
interested in the current population.)
Family Structure Case.
For family sampling we have (using the S's
of (6.2.1»
...
D = ~i~j~~I;tJ'
XijlXijJ'
/2n.-I>2,
...
E(D) =
...
~-pt.--N2 (pt-I1-)/n~-{p+~-2~}/2n.,
E(D2) =
{n.p+n.~+2S2~2S5~}/8ni
e-
115
-
(2n.-l){n.F:+S2F:r+S5~}/2n1
+
{S2(~F:r)+S3~S6~}/n.4
+
{S5(pt+~)+2S6~+S9~}/n1
+
(2n.-l)2{n.F:+S2F:~+S5~}/4n1
-
(2n.-l){2S2F:r+S3F:~S6~}/n1
-
(2n.-l){2S5~+2S6~S9~}/n1
+
2{2S6(~+~)+2S8~SlO~}/n1
+
{S5(pt+2~+~)+4S6(~+~)+4S7~4S9(~~)+8S10~
+2S11ps-}/2n1
..
Var(D)
= p%-2P2-+I1t-F:2+2F:Pt-P¥
+ N2 {F:~-p%-2P2-4P»+6P2-+6P1r6Pii-+2F:pt-2F:Pt-2Ftpt+21¥}
/n~
-5~+3I1f--F:~-p%+2~4~3~6~pF:-ppt+F:2_2F:pt
+ {F:
-F:pt+2Ilt!1-} In.
+ N2 2 {3~6~3~f12+2ptPt-f1-2} /n1
- N2 {5F:r-5~+9I1f--9pt'+F:~-6P:~P%-P2-2P»+
- {8F:
+
12~~
-7~ -20F:r-l~+36~-4F:~+16F:~3p%+12~24~32~
N4{~4~3~12~6~}/n1
+ N3 {3~~9pt"+3F:~6~3P2-SP»-12~~~
,
12~}
/n1
+12~24~32~12~}/4n1
+ {p+7F: -14Ft
-36F:r+48~6F:~+48F:~48~}/8n~
(6.5.3)
If the family sizes, ni, are equal (to n, say) for each of the k
..
families, then the formula for Var(D) can be obtained by using the
following results:
116
N22/ni
= l/k,
= link,
= 1/k2 ,
= 1/k2 ,
N2/n~
= 1/nk2 ,
l/n~
= 1/n2k2 ,
= 1/k3 ,
= 1/nk3 ,
= 1/n2k3 ,
= 1/n3k3 .
N2/n~
lin.
N3/n~
N4/ni
N3/n'!
N2/n'!
l/n~
There is little simplification.
Putting ni=l for each i gives (6.5.1).
Using (6.3.4) and (6.3.5) we find
D
= p(l-p)e
...
B(D)
...
Var(D)
= -(N2+n.){p(1-p)-D}/4n~,
= N2P(1-p){p(1-p)+(1-6p+6p2)e-2(1-2p)2y+3p(1-p)~+(1-6p+6p2)S}/4n~
+3p(1-p){p(1-p)+(1-6p+6p2)e-2(1-2p)2y+3p(1-p)~+(1-6p+6p2)S}/4n.
-N3p(1-p){p(1-p)+(1-6p+6p2)e-2(1-2p)2y+3p(1-p)~+(1-6p+6p2)S}/4n~
+N2 2p(1-p) {2p(1-p)+(3-16p+16p2)e-p(1-p)e 2-6(1-2p)2 y
+9p(1-p)~+3(1-6p+6p2)S}/16n'!
-N2p(1-p) {2p(1-p)+(1-Sp+Sp2)e+p(1-p)e 2-2(1-2p)2 y
+3p(1-p)~+(1-6p+6p2)S}/Sn~
-p(l-p) {2p(1-p)+(1-Sp+Sp2)e+p(1-p)e 2-2(1-2p)2 y
+3p(1-p)~+(1-6p+6p2)S}/16n~
+O(1/k3 )
(6.5.4)
for full-sib families, and
117
D = P(l-p)8
-
E(D).
-
= -(N2+3n. ){p(l-p)-D} 18n~,
Var(D) =
p(l-p){p(l-p)+(1-6p+6p2)8-2(1-2p)2y+3p(l-p)~+(1-6p+6p2) S}/n.
+N22p(l-p) {2p(l-p)+(3-16p+16p2)8-p(l-p)8 2-6(1-2p)2y
+9p(1-p)~+3(1-6p+6p2)S}/64n1
-N2p(l-p) {IOp(l-p)+(7-48p+48p2)8+3p(l-p)8 2-14(1-2p)2 y
+2Ip(l-p)~+7(1-6p+6p2)S}/32n~
-p(l-p) {14p(1-p)+(5-48p+48p2)8+9p(l-p)8 2-10(1-2p)2 y
+15p(l-p)~+5(1-6p+6p2)S}/64n~
+O(l/k3 )
(6.5.5)
for half-sib families, where equal family sizes are assumed for the O()
term.
Putting ni=l for each i in either equation gives (6.5.2).
-
The
estimator D is biased downward, and the remarks for the classical case
also apply here.
The amount of bias increases with the amount of family structure in
the data, with the most bias occurring for full-sib family samples.
From
the O(l/k) terms of (6.5.2), (6.5.4) and (6.5.5) it can be seen that,
for equal family sizes, the variance for full-sib
f~ilies
is larger
than that for half-sib families, with the variance for half-sib families
and the classical case being the same (to O(l/k».
Numerical
calculations using the O(1/k2 ) terms show that the variance increases as
the amount of family structure increases:
The variance for full-sib
families is greater than that for half-sib families which is greater
than that for the classical case.
For the same number of families, the
variance is greater if the family sizes are unequal than if they are
118
equal.
-
Robertson and Hill (1984) also give results for Var(D) when there
is family sampling.
Putting
O(1/k2 ) terms, and defining
8=y=~=S=O
~n2,
in (6.5.4) and (6.5.5), ignoring
the variance of the family sizes, by
gives the results that they obtained.
Their derivation, using the delta
method approximation for the variance, is different from the one used
here, which is the reason why their formulas are O(l/k).
be simplified by using approximations for y,
(Weir and Cockerham, 1984).
by 82 , and S by 83
(38-1)/2,
•
-
The above expressions for the variance of D can
Approximate Forms.
~,
and S, in terms of 8
If 8 is small we can approximate y by 82 , a
If 8 is large (large t) we can approximate y by
a by (88-3)/5, and S by (98-4)/5. This gives the following
-
approximations for Var(D):
If 8 is small, then for the case of no
family structure we have
-
Var(D) - p(1-p){p(1-p)+(1-6p+6p2)8(1+8 2 )-(2-l1p+llp2)8 2 }/n.
-p(1-p){2p(1-p)+(1-8p+8p2)8+(1-6p+6p2)82(8-2)}/4n~
If 8 is small, then for full-sib families we have
-
Var(D) - N2P(1-p) {p(1-p)+(1-6p+6p2)8 (1+8 2 )-(2-11p+llp2 )8 2 }/4n~..
+3p(1-p){p(1-p)+(1-6p+6p2)8(1+8 2 )-(2-11p+llp2)8 2 }/4n.
-N3P(1-p){p(1-p)+(1-6p+6p2)8(1+82)-(2-11p+llp2)82}/4n~
+Nz 2p (1-p) {2p(1-p)+(3-16p+16p2)8 (1-28)+3(1-6p+6pZ)8 3 }/16n1
-Nzp(1-p){2p(1-p)+(1-8p+8p2)8+(1-6p+6p2)8Z(8-2)}/8n~
-p(1-p){2p(1-p)+(1-8p+8pZ)8+(1-6p+6p2)82(8-2)}/16n~
If 8 is small, then for half-sib families we have
-
Var(D)
119
~
p(1-p){p(1-p)+(1-6p+6p2)S(1+S2)-(2-llp+llp2)S2}/n.
+N22p(1-p) {2p(1-p)+(3-l6p+16p2)S(1-S)+3(1-6p+6p2)S3}/64n1
-N2p(1-p){lOp(1-p)+(7-48p+48p2)S-2(7-40p+40p2)S2
+7(1-6p+6p2)S3}/32n~
-p(1-p){14p(1-p)+(S-48p+48p2)S-2(S-32p+32p2)S2
+S(1-6p+6p2)e3}/64n~
If S is large, then for the case of no family structure we have
-
Var(D)
~
p(l-p)(l-e) {l/Sn. -
[1+Sp(1-p)(I-S)]/20n~}
If S is large, then for full-sib families we have
-
Var(D)
~
p(l-p)(l-S) {
N2/20n~
+ 3/20n.
+ N22[3-Sp(l-p)(1-s)]/80n1
-
-
N3/20n~
-
N2[l+Sp(1-p)(1-e)]/40n~
[1+Sp(1-p)(1-SS)]/80n~}
If S is large, then for half-sib families we have
-
Var(D)
~
p(l-p)(l-S) {l/Sn.
-
+ N22[3-Sp(1-p)(1-e)]/320n1
N2[7+lSp(1-p)(1-s)]/160n~
-
[l+9p(l-p)(l-S)]/64n~}
6.6 Linkage Disequilibrium
Classical Case.
The following measures of linkage disequilibrium
will be used (Cockerham and Weir, 1977):
DAB
= pAB
- PApB,
the "usual coefficient", and
the composite coefficient.
The first can be used only when coupling and
repulsion double heterozygotes can be distinguished (so that estimates
of each of the frequencies pAB and pAs can be obtained).
To obtain the
gene frequency correlations, in the next section, we also need estimates
120
of the two quantities
Q = PA ( l-pA )ps ( l-ps ) ,
Let XJl be as defined in section 6.4 and referring to the A locus;
and let YJl be as XJl defined in section 6.4 but referring to the B
locus.
Then the maximum likelihood estimates of the coefficients of
linkage disequilibrium, and of
-P:
~
-
~
and
P:
are
= IJ II1;ll'XJlxJl,/2n.,
= IJII1;ll' YJ1YJ1' /2n.
...
Estimates Q and R, of Q and R, are obtained by replacing the
frequencies in their expressions by their estimates.
-
We find
E(llAs)
= p"·_~-{P"·+P".-2~}/2n.,
= p". +P". -2~-{P"·+p". -2~}In. ,
E(Q) =
~-p6f-PN+PH
E(DAS)
...
-
+ {p". +p". -4~-2ptJ--2p6.f-~r-P"C -2~-2~+7~+7pAf+4pti+4PN
+~1+~-12~ }/2n.
+ l4p6f-2PH-4~-4~ -4~1+24~-2~-~= +24~+6~1+6~ -44~ } /4n~
...
E(R)
-4PH-8PM-8~ -8~1+32~-4~-2~= +32PN+8~f+~--= -48~ } /8n~
= ~+~r+P"C-2~-2p6f+~C-2~1-2~+4PH
,
121
+2P:: -8~ -8P:i+ 16~-6P:C+ 16~+ 16P:1+ 16~ -48PH }/2n.
-44ptf }/n~
-48P9-}/2n~,
-
E(DAS 2 )
-
= { (2n.-l)2[pAl+p::+2(n.-l)~]
2(2n.-l)[p:I+pA:+2(n.-l)~]
- 4(2n.-l) (n.-l)
[ptl+pAf+~+P:i+2(n.-2)PN]
+ [pAl +P:: +2(n. -1)I1f]
+ 4(n.-l)
[~+~+~+P:1+2(n.-2)~]
+ 2(n.-l)
[~+PM+p:r+pA=+2~+P:C+~+2(n.-2)
( ~+p6f+2~
+~f+~+2~)+4(n.-2)
-
E(AA s 2)
={ (n.-l)2[pAl+~:+2(n.-l)ptf]
- 4(n.-l)2
-
(n.-3)!'H] }/8n. 3
2(n.-l)2[p:I+pA:+2(n.-l)~]
[~+~+~+~f+2(n.-2)PN]
+~f+F1C+2PN)+4(n.-2)
Var(DAS) =
+
~-2~+~-pAl
(n.-3)Pfl-] }/2n. 3
2 +2pAl J'A.a-~2
122
:-48PH }/8n~
-
Var(LMB)
= ptf+2~-4ptf+~-4~+4~-P"· 2-2p"· P"B +4p"B Jli7-P"B 2
+4P". Jli7-4Jli7 2
-48PH }/2n~
(6.6.1)
Using these results along with table 1 of Weir and Hill (1980)
gives the following results, which may also be obtained from table 2 of
Weir and Hill (1980) by dropping the subscripts of the descent measures
used there.
E(LMB)
E(DAB) =
0,
=
0,
-=
E(Q)
-
E(R)
=
Var(LMB)
Var(DAB)
Q{a +(2r-3a)/n.
Q{a +(e-3a)/n.
= Q{e-2r+a
= Q{e-2r+a
+(2e-12r+lla)/4n~ -(e-4r+3a)/4n~},
-(4r-5a)/n~ -(e-4r+3a)/n~},
-(3e-IOr+6a)/2n.
+(4e-16r+lla)/4n~ -(e-4r+3a)/4n~},
-(20 -6r+3a)/n. +(20 -8r
+5a)/n~ -(e-4r+3a)/n~}
(6.6.2)
The parameters DAB and LMB are zero for the present assumptions, and so
e-
123
their estimators are unbiased.
Family structure Case.
For family sampling we have (using theS's
defined in (6.2.1»
'"
DAB
'"
= ~i~j~IXijlYijl/2n.
= ~i ~j ~~Il'! l' Xi j lXi j
P:
-P:
-
PAPB,
I' /2n. ,
= ~i ~j ~~Il'! l' Yi j IYi j I' /2n.
E(DAB) = pAB-p6y-N2{~-~}/n~-{pAB+pAB-2~}/2n.,
'"
E(~B) = pAB +pAB -2~-2N2 {~-pA,} /n~-{pAB +pAB -2~} In. ,
'"
E(Q) = ~-I9-I1+PS
+ N2 {J'Aa-PI;-ptl-I1-2P"11-2P9/ +3ptl+3pef+4PM+PS+I1I-GPS} /ni
+ {pAB +pA B-2J'Aa-2~-2PP-2Pdf-P:r -P't -2J'i1l-2~+2~+2~+4P1i+4Pt1'
+I9+~+4Pi1+411'1+P2+PS
-8pAd-2PS-2I'S }/2n.
- N3 {P:""'+pAf-IY-F1--2~-2Pif +2I9-+2J1.-2PS-2Pi1+8PM+2PS
+2PS-8l1I} /n~
+ N22 {2PH+PS-4pY-PS-PS+3PS} /ni
- N2 {2~-2~+2ptJ-+2p6.f-2pp-2POf+p:r +pAC +2~+2~-~
-P1: -2J'i1l
-2~-~-6pAf+~+I1+2P4f+2P9f +3P1l+3pef-4PH-4~-4pA~+
1 2pY
-4Pii-4~-4~-~-P2-2P2-2pA:JJC + 12Pii+3P2+3~ +8pAd+4~+ 121'5
+12Pi1-24PM-6PS-6Pi1}/2n~
-4~-P:r-P1: -2J'i1l-2~+8~-t'-+8pA-I+2I7-+2~+4P11+4pAAf
-2PK-4pAd
-4~ -4pAAIi+8pAri+8~+8pAAl-2Pii-~
+ 8Pii+8~+8~+2P1:+21'2+41'2
+4JliC -8I11-4~-16~-16pAd }/4ni
+
N4{~-2pAd-~-2~-2Pi1+8pA~+2~+2Pif-~}/ni
124
+4~+~f+~~ -4f1i-4~-4~-~ -P2-?~-2J>A~ +8~+2P2+2~
-12~+8J>Ad+4I1f+12~+12J>Ad--16pg-4P5-4~
}/2n1
+ N2 {~-P6y+2ptJ-+2pAf-2J¥-2PAf+~r +J>AC +2~+2~-~ -~ -2~-2~
-6~-6p6f+2f'1!+2I1-+4P'1f+4Pif +2PH-2PH+4~+4~1+4~ -4~-4~
-4J>AAIf-24PH+8J>Ad+8~+8IV+2~+~C
-2~-~ -24~-~f-6~
+8Pi1
+8Pii+8~+2~+2~+4~+4~+44~-8~-4P11-1~-16Pi1}/4n1
-
E(R)
+8~ -48~ } /8n~
= P6y+~+P1:-219-2f1+P:JC-2P2-2PC+4PS
+ N2 {~-P6y+~r +J>AC -~ -P1: -2~-2plf-4P1i-4Pif +6I9+6~+~C-P:jC
-2PS -2P2-4~-4PK+6P2+6P1: +16IV+4PS+4Pi1-24PS} /n~
+ {J>AI +J>A. -2~-4P6y+2~1 +2J>A: -4J¥-4p.1f-2~r -2J>AC -4~ -4P1: -4PP
-4~+4~+411+8P1i+8P9f+417-+4I1+2~:-8P9Jr: -8J>AAIi+ 1~-2~C
-4~
+4PS +4P2+8~+8~+16Pi1+4P:;1+4PS -32111-BPS-8111 } /2n.
- 2N3 {~+p6f-F7--~-2P"11-2FY +219+2~+~1+~
-PS -P2-2PZ1-2~
+2P2+2PC -4PS-4J>A:d-+ 16~+4PS+4Pi1-16PS } /n~
+ 4N22 {2~+PS-4J>A:d--PS-P9I+3PS} /n'!
- 2N2 {J'6a-P6y+ptJ-+pAf-pt&-~+~r +J>AC +~+~-~
-3p6f+3P1l+3~+2~1+2~
-P1: -J'iIl-p6IJ-3~
-21't1: -2J>AAIf-4PH-4~-4Pif+ 12JW+~C-P:JC
-3~1-3~ -4~-4~-4PS+ 12~+3P2+3P1: +8I11+4~+ 12~+ 12J>Ad-
-24Pi1-6PS-SPe } /n~
-~ -pAjC -2~-2~+4~~+4J>A-I+2~+2~+4P'1f+4F1r
+2~: -2J>Ari-4~1
-4~ -4~-4J>A~ -4J>AAli+8J>Ad+8~+8Pii-2~= -2~-~ +4~1+4~~ +8I1f
e-
125
+ 4N4 {P9;-2PH-PS-2}'$-2f>Ad-+8f>A,j+2P5+2I11-6P91 } In'!
+ 2N3 {~+f>A-f-~-~-2P4i-2f>A:J' +2P1L+2~+4~-f-4PH-4~-4f>A,l+8f>AAi
+4~+~1+~ -4PK-4I1i-4~-11C
-P2-2J'2-2f>AAJC +8Pi'i+2P2+2P1:
-12~-f+8PiI+4~+12}'$+12~-16pA,j-4~-4~}/n'!
+ N2{f6s-~+2~+2pAf-2pt&-2pAf+~r+f>A~+2~+2~-~-~-2~-2~
-~-6pAf+2~+2~+4P4i+4f>AJ' +2ptf-2PH+4PM+4~1+4~ -4I111--4~
-4f>AAfi-24~+8PH+8~+8~+2~+~C -2Pii1-~ -24PN-6~1-6~ +81'5
+8I1i+8I11+2P1: +2P2+4J'2+4f>AAJC +44P9;-8pAd-4PS-16~-16f>Ad} In'!
-
+8~
-48PH } 12n~
E(DAB 2 ) = { (2n.-1)2 [no (f>AB+~: )+2S2ptf+2SsPH]
+ [n·(f>AB+~:)+2S2~+2SsPiil]
+ 4[S2 (~+PNf+~ +~I ) +2S3l'N+2S6 1'5 ]
+ 2[S2 (f6s+ptf+~r+f>A=+2~+~C+~ )+2S3 (~-f+f>A-f+2~-f+~I+~-C+2PN)·
,
+4S4 PH+4S7 PH ]
+ 8[S6 (P4i+I1r +f>A;I+f>AAJi+~+f>A:&+I1i+~)+2S8 (}'$+f>AxI )+2S10f>A:zj]
+ 2 [Ss (~+PH+~ +1'1: +2f>AXli+~ +PiiI )+2S6 (J1Pl+plf+2PH+PS +P2+2PK)
+4S7 (f>Ad+~ )+2S9 (J9.+~+2f>AAi+P2+PI:
-
E(AAa 2 )
+4S10(~+~+2f>A,j)+4S11Pil]
}/8n1
= { (n.-1)2[n.(f>AB+~:)+2S2~:+2Ss~-i]
+2~)
126
- 4(n.-1) [85
(PP+~+PiI:
+P"Ali )+286 (~+P"~ )+28911li]
(n.-1)2[n·(P"B+~:)+282~+285~]
+
- 4(n.-1) [82
(~+~+I't:+~~ )+283~+286~]
- 4(n.-1) [85 (PP+POCI+I1-:+P"AII )+286 (~+~ )+289~]
+454 PH +487 PH ]
+ 8[86 (P'1I+P9/ +~+pY+~+P"A1C+I1t+~ )+288 (~+I1I )+2810P";j]
+ 2 [85 (~+PK+~ +P1: +2Pi1+~ +PH )+286 (~+F1+2P"d+~ +P:I+2Pii)
+487 (PH+~ )+289 (PIl+Pl'I+2pY+P2+PI: +2I1r1)
-
+4810(~+PiI+2Pi1)+4811~]
}/2ni
Var(DAB) = PH-2P"AJi+PS-P"· 2+2p"· ~_pAy2
+
N2{~-PH-2Pi1-2~-2Ftl+6P"AJi+4P"~+~+PiI-6~+2p"·~-2p"·pAy
-2F6a~+2~2}/n~
+ {p". -2I¥-2p.q+11f+~+~:-2~-2PH-2P"d-2~-2P"Ali+4PH+4~+4pY
+4PM+4I1r1+P2+F1IC -8Pil-2P5-2F11+2p"· 2 +2p"· P" B-4p"· F6a-2p"· ~
-2P".~+4~pAy}/2n.
- 2N3 {~-P"d-~-~+2Pir1-PS--Pil+4pY+PS+I1I-4PS
} /n~
+
N22{2P"d+PS--4PM-PS-F1I+3PS-~2 +2F6a~-~,2}/ni
-
N2{2ptA+2p&f-2I¥-2p6f-~-~-2P11-2Pir+3~+3pqf+2ptf-2PH+2ptf
+2~1+2~
-2Pi1-2PtJ: -2P"AIi-12PH+ 12P"~-4PH-4~-4~-PS-P2-2PZj
-2~ + 12~+3P2+3~
+8PH-+4PS+ 12~+12PU-24Pi1-6PS-~
+2p"·~-2p"·~+2P"BF6a-2P".pAy-4~2+4~~}/2n~
+4P'11+4I1r +2~: -4ptf-2PH-4ptf-4~~-4~-4PH-411r: -4P"..... +1~+8PH
+8~+8P"~-2~-~ +811i+~+8~+2P1: +2P2-+4P2+4~
-8P";d-4PS
127
+ Na {~+pq-Pfl-~-2P4i-211r' +2I9-+2F1-+4~-4P"d-4I1i-4~+8pAAJi
+4~+~1+~ -4Pi1-4~-4~-~ -P2-2~-2~
+8Pi'1+2P2+2PS
-12~+8Pi1+4~+12~+12Pi1-1~~-4~-4Pi1}/2n1
+ N2 {pAy-~+2~+2poq-2P"x-!-2PAf+~r +P"= +2~+2~-P1' -~ -2Pi'!-2po1f
-4P"AJi-24PH+8PH+8I1f+8Pii+2~+~=-2Pij1-~ -24~-6~1-~+8Pi1
+8I1i+~+2PS: +2P2+4~+4P"~ +44P9--8Pi1-4PS-16PS-16I11" } /4n1
+~ -48PH } /8n~
-
Var(AAB)
= PH+2P"d+PijI-4pY.-4P9J1+4PS-P"· 2-2p"· p". +4p"· I*,r-p". 2
+4P". ~-4~2
+
N2{~+2~+~-ptf-2Pii-4PH-4~-4Pif-Pii-4pt1-4Pii-4~+12Pi1
+12Fti+16Pi1+4~+4~-24PS+4p"·pAy-4p"·~+4P".pAy-4P".~
_8pAy~+8I*,r2}/n~
+
{P"·+P".+2~·+2P":-4Pi&-4P"-f-4~-4~+4~+4pqt+2~:-2~-4PH
-4~-8Pii-8~-8P"~+8PH+8~+8Pi1+16Fti-2~-4Pii+8Ft1+8PiI+8~.
,
- 4Na {PH -PH-pY -pY +2P"AJi+I1'I-Pii -Pif -~ +2P9J1-2~-2I11 +8P",j +2PS
+2I11-BPS } /n~
+ 4N22 {2P"d+PS-4pY-PS-PS+3PS-pAy2+2pAyI*,r-P".,.2} /n1
- 2N2 {~+P-\t-P"x-!-PAf+~+~-~-~-~-I1--2P4i-2Pi1"
+3I9-+3~+PH
128
-P9J-6Fff-2P5-2PSf-2~-11C
-P2-2P2-2P"~+12I11+3P2+3I1C +8ptf
-4~2+4~~}In~
- {p". +p". -~+2~· +2P": -21¥-2FA: -2I¥-2~-2~-2~-17-~ -2J11!
+2P2'+4P2+4~ -8PH-4~-1sp»-16P""+P"· 2+2p"·
p". -4p"· ~+p". 2
-4P".~+4F6s2}/n~
+ 4N4 {PH-2P"d-~-2I1i-2I11+8pY+2PS+2111-6P91} In'!
+ 2N3 {~+p6f-~-11-2P"1i-:21Y
+2PP+2~+4PN-4PK-4~-4pY-+8IW
+4~+~1+~ -4P5-4I1t-4~-PI: -P2-2P:Ji-2~ +8~+2P2+2P1:
-12PH+8PH+4~+12PS+12PU-16PM-4PS-4.
} In'!
+ N2 {~-~+2I¥+2pA.f-2ptl-2P6.f+~r +P"C +2~+2~-~
-~-6p6f+2I7+2~
-P1: -2~-2~
+4P"1&+4P9f +2PM-2P"ri +4~+4~1+4~-4PH-4I11:
-4P"AII-24PH-+8PH+8~+8I1l+2~+~=
-2P9J-~-24PN-6~f-6~+8Pi'f
+8I1I+8~+2~+2P2-+4P2+4~+44PH-8P"d-4PS-16~-16pY} In'!
+8~ -48PH } 12n~
(6.6.3)
Putting ni=l for each i gives (6.6.1).
-
Using (6.3.10) through (6.3.12) we find
E(DAB) =
0,
E(AAB)
-
= 0,
E(Q)
Q{ a
-=
+N2[2r-3a]/2n~
+ [2Ar-(1+2A)a]/2n. -
+ N22[20-4r+3a]/16ni +
N3[r-a]/2n~
N2[2A0-4(1+3A)r+5(1+2A)a]/8n~
e-
129
+
[2A2e-4A2r+(1+2A2)~]/16n~
-
N3(1+2A)[e-4r+3~]/16n1
~
= Q{
E(R)
a +
N2[0+2r-7~]/4n~
-
N3[e-~]/2n~
+
~/4n~
~
Var(DAB)
= Q{
-
+
-
N4[e-4r+3~]/32n1
+
N2[0+4A2r-(1+2A2)~]/32n1
+
[A(2+A)e-2A2r-(4+2A-A2)~]/4n.
-
N22[2e-4r+3~]/4n1
-
N4[e-4r+3~]/8n1
N2[2(1+2A)e-2(3+6A+A2)r+(3+8A+A2)~]/8n~
[(1+2A+3A2)e-4(2+2A+A2)r+2(1+A)(2+A)~]/8n.
+
N3[Ae-2(1+2A)r+(1+3A)~]/8n~
+
N2[(2+3A)e-2(4+6A+A2)r+(5+9A+A2)~]/8n~
-
[2(1-2A2)0+4A2r-(1+2A2)~]/16n~
-
N3(1+2A)[e-4r+3~]/16n1
-
= Q{
},
N2[2(1+2A)r-(3+4A)~]/2n~
-
Var(~B)
A2e/16n~
N3(1+2A)[e-4r+3~]/4n1
(1+A)2[e-2r+~]/4 -
+
-
+
N22[2e-4r+3~]/16n1
-
N4[e-4r+3~]/32n1
N2[0+4A2r-(1+2A2)~]/32ni
-
A2e/16n~
},
(1+A)2[e-2r+~]/4
-
N2[3(1+2A)e-2(4+8A+A2)r+(3+10A+A2)~]/8n~
-
[(1+2A+4A2)e-2(4+4A+3A2)r+(1+6A+3A2)~]/8n.
-
N3[(1-A)0+4Ar+(1-3A)~]/4n~
+
N22[2e-4r+3~]/4n1
(6.6.4)
for full-sib families;
~
E(~B)
-
E(Q)
= 0,
= Q{
a +
N2[2r-3~]/4n~
+ N22[2e-4r+3a]/64n1 +
+
+
[2(1+2A)r-(5+4A)~]/4n.
-
N3[r-~]/4n~
N2[2(1+2A)e-12(1+2A)r+(13+20A)a]/32n~
[2(1+2A)2e-4(3+2A)2r+(43+40A+8A2)a]/64n~
-
N4[e-4r+3~]/64ni
130
- N3(1+2A) [0-4r+3a]/32ni + N2[0+4A2r-(1+2A2 )a]/64ni
-
[(1+2A+2A2)0-2(1+A)(3+A)r+(1+A)(5+A)a]/32n~
- = Q{
H(R)
a +
N2[r-2a]/2n~
+ N22[20-4r+3a]/16n1 +
},
+ [(1+A2 )0-2A2r-(7+2A-A2 )a]/4n.
-N3[r-a]/2n~
N2[20+8Ar-(5+8A)a]/8n~
[20-4(5+4A)r+(27+16A)a]/16n~
- N4[0-4r+3a]/16n1
- N3(1+2A) [0-4r+3a]/8ni + N2[0+4A2 r-(1+2A2 )a]/16ni
-
-
[(1+2A+2A2)0-2(1+A)(3+A)r+(1+A)(5+A)a]/8n~
Var(DAB)
-
= Q{
},
(1+A)2[0-2r+a]/4
N2[2(1+2A)0-2(3+6A+A2)r+(3+8A+A2)a]/16n~
- [2(2+4A+3A2 )0-2(11+14A+5A2 )r+(11+20A+5A2 )a]/16n.
+ N2 2 [20-4r+3a]/64ni
+
N3[A0-2(1+2A)r+(1+3A)a]/16n~
+
N2[6(1+A)0-4(1+A)(5+A)r+(13+18A+2A2)a]/32n~
+
[2(1+8A+8A2)0-4(11+16A+6A2)r+(31+48A+12A2)a]/64n~
e-
- N4[0-4r+3a]/64ni - N3(1+2A) [0-4r+3a]/32ni
+ N2[0+4A2r-(1+2A2 )a]/64ni
-
-
[(1+2A+2A2)0-2(1+A)(3+A)r+(1+A)(5+A)a]/32n~
Var(~B)
= Q{(
},
1+A)2[0-2r+a]/4
-
N2[2(2+3A)0-2(5+8A+A2)r+(5+10A+A2)a]/16n~
-
[2(2+5A+4A2)0-2(11+16A+7A2)r+(3+22A+7A2)a]/16~.
+
N3[A0-2(1+2A)r+(1+3A)a]/8n~
+
N2[(2+3A)0-2(4+6A+A2)r+(3+9A+A2)a]/8n~
+
[2(1+2A)20-4(6+8A+3A2)r+(13+24A+6A2)a]/16n~
+ N2 2 [20-4r+3a]/16n1
- N4[0-4r+3a]/16ni
- N3(1+2A) [0-4r+3a]/8n1 + N2[0+4A2r-(1+2A 2 )a]/16ni
-
[(1+2A+2A2)0-2(1+A)(3+A)r+(I+A)(5+A)a]/8n~
}
(6.6.5)
for half-sib families; and
131
-
E(DAB) = 0,
E(Q)
E(LiAB) = 0,
-
= Q{
~
+
[(1+A)r-(2+A)~]/n.
+
[(1+A)2e-2(1+A)(1+5A)r+(11+10A+A2)~]/8n~
-
(1+A)[(1+A)e-2(3+A)r+(5+A)~]/16n~
E(R) = Q{
-
-
~
+
[(1+A)20+2(1+A)(1-A)r-(11+2A-A2)~]/4n.
[2(I+A)r-(3+2A)~]/n~
Var(DAB) = Q{
},
-
(1+A)[(1+A)e-2(3+A)r+(5+A)~]/4n~
(1+A)2[e-2r+~]/4
-
[3(I+A)2e-2(I+A)(7+3A)r+(7+14A+3A2)~]/8n.
+
[(I+A)2e-2(I+A)(3+A)r+(4+6A+A2)~]/4n~
-
(I+A)[(I+A)e-2(3+A)r+(5+A)~]/16n~
-
Var(LiAB) = Q{
},
(I+A)2[e-2r+~]/4
-
},
[(1+A)2e-2(I+A)(2+A)r+(1+4A+A2)~]/2n.
+ [(l+A)2e-2(l+A) (3+A)r+(3+6A+A2 )~] /2n~
-
(1+A)[(1+A)e-2(3+A)r+(5+A)~]/4n~
}
(6.6.6)
for no family structure.
These results (6.6.6) for the case of no
family structure are the same as the results (6.6.2) for the classical
case only if there is complete linkage (A=l).
This is because the case
of no family structure assumes distinct parents whereas the classical
case assumes that the parents have been chosen at random from a
population of size N (so that self-sibs, full-sibs and half-sibs all
occur with positive probability).
For the case of no family structure
with complete linkage, the gametes in the t-th generation have exactly
the same distribution as those in the (t-l)-th generation.
Each gamete
in the t-th generation is a distinct gamete from the (t-l)-th
132
generation.
The effect of this is to leave the nonidentity measures
~
unchanged, going from the t-th to the (t-l)-th generation, and the
(t-l)-th generation has the same history as the classical case.
This
explains why (6.6.6) with A=l is the same as (6.6.2).
-
-
Numerical calculations show that E(Q) and E(R) both decrease as the
amount of family structure in the sample increases, while Var(DAs) and
-
Var(AAa) may either increase or decrease, depending on the situation.
-
In most of the cases considered Var(DAs) increased with the amount of
family structure in the sample.
6.7 The Correlation of Gene Frequencies
Gene Frequency Correlations.
We now consider two gene frequency
correlations
e-
the first being appropriate when coupling and repulsion double
heterozygotes can be distinguished, and the second being appropriate
when this is not the case.
The expectation of these ratios are
approximated by the ratios of their expectations, denoted
by d2 and S2 .
,
for r 2 and r*2 respectively.
Descent ftfeasure Transition Equations.
To obtain simple expressions
for d2 and S2, we use the relationship between the (t-l)-th generation
measures 0, r,
and~,
when t is large.
these measures in terms of the other.
This allows expressing two of
Let
133
e
=
r
s=O, .•. ,t-l
s
be the vector of descent measures for the s-th generation.
For a
monoecious randomly mating population (including selfing) of size N, the
transition equation is
ys =
Q YS-l,
where Q is given by equation (27) of Weir and Cockerham (1969):
(1-c)2- 1-2c
~
Q
=
l-c
1-2c
- 4N'Z"
~
2N-l
4N3
where c
= (1-A)/2
(N-l)2c(l-c)
N
(N-l)c2
N
(N-l)[N+l+~N-2)(1-2c)1
2~
(N-l~~N-3)c
(N-l) (2N-l) (2N-3)
4NJ
(N-l)~N-l)
is the recombination fraction.
From the transition equations it can be seen that
!!t - 1
= Qt - 1 yo
(6.7.1)
Let Al>A2>A3 be the eigenvalues of Q, with corresponding eigenvectors
?il ,?i2, and ?i3.
Then we can choose al,
~
and
~
so -that
Then
The elements of ys, for large s, are just a multiple of the elements of
To find the eigenvalues of Q, suppose that
134
Ai = biO + bil/N + bi2/N2 + O(l/N3), i=1,2,3,
e,
and find three solutions of
In-Ai II = O(l/N3),
where I is the 3x3 identity matrix.
- liN
+ 1/4N2 +O(l/N3),
+ (5-4c)/2N
+ l/cN2 +O( 1/N3),
Al = 1
A2
= l-c
This gives
A3 = (1-c)2 + (3-4c)/2N - (l-c) (1-2c)/cN2 +O(l/N3) ,
if C10j and
Al = 1 - 1/2N,
A2 = 1 - 3/2N +
1/2N2 ,
A3 = 1 - 3/N + 1l/4N2 +O( 1/N3 ) ,
if c=o.
In the following it will be assumed that c10.
o
Write
= 00 + 01/N + 02 /N2 + O( 1/N3 ) ,
e·
!il = £0 + -cl/N + -c2/N2 + O(l/N3),
where 00, 01, 02, £0, £1, and £2 do not depend on N.
approximation to !il we look for a solution to
(O-AII)!i1 = O(l/N3).
Equating terms in N this gives
(1-00 )£0
= 0,
This leads to
=
+ k3/N2
,
To find an
135
where
kl~O,
ka, and k3 are arbitrary constants, and
Ql = (1-2c+2c2 )/2c(2-c),
Q2 = (3-2c+2c2)/2c(2-c).
We will take kl = 1, and ka=k3=0 to give
~l
as the equilibrium value of
This gives the equilibrium values of G> and r in terms of Ii.
Ys.
Gene Frequency Correlations for FBJDily SBJDpling.
Using
~l
and
equations (6.6.4) through (6.6.6), and ignoring terms in 1/N2 and 1/k3
(if equal family sizes are assumed) gives the following approximations
for d2 and S2:
d2
~ 1/2n.+1/4n~
S2
~
lin.
+l/n~
+
Ql(I-I/2n.+4n~)/Nt
+ Ql/N
(6.7.2)
for the classical case (coarser approximations are given by equation (4)
of Weir and Hill (1980»;
+ (1+2c-2c2 )/4n.
-
N3/8n~
+ N22 (3-4c+4c2 ) / 16n'! + N2 c ( I-c) /2n~
+
(3-4c+4c2)/16n~
+ Ql [ (1-c)2
- (1-2c)2/4n.
+ N3 (1-2c)/8n~
S2
(l-c)(1+7c)/16n~
- N22(I+c)(1-3c)/16n'! +
N2(2-3c2)/8n~
l/N,
~ N2(3-4c+4c2)/8n~
+ (5+4c-4c2)/8n.
-
N3/2n~
+ N2 2 (5-4c+4c2 ) /8n'!
+
N2(1+4c-4c2)/4n~
+
(5-4c+4c2)/8n~
+ Ql [(l-C)2
-
N2(1-4c+8c2)/8n~
+ (1+4c-8c2 )/8n.
-
N3c/2n~
+ N22(1+4c+20c2)/32n'!
+
N2(5+8c-34c2+24c3-8c4)/16n~
-
(11+4c-48c2+48c3-16c4)/32n~l/N
(6.7.3)
for full-sib family sampling;
d2
~
N2 ( l-2c+2c2 ) /8n~
+ (3+2c-2c2 )/8n.
-
N3/16n~
136
N2(3+4c-4c2)/32n~
+ N22(3-4c+4c2)/64n1 +
+
- N2 (1-2c2 ) /8n~
+ N3 (1-2c)/16n~
- N22 (l+c) (1-3c)/64n1 +
S2 •
(7-30c+19c2)/64n~
N2(3+2c-3c2)/32n~
]/N,
+ (7+2c-2c2 )/8n.
N2(1-2c+2c2)/8n~
-
N3/8n~
(7-c+c2)/8n~
+ N22 (l-c+c2) /8n1
+
N2(1+2c-2c2)/8n~
+
+
-
N2(1+2c-4c2)/8n~
+ (1+2c-4c2 )/Bn.
0:1
[(1-c)2
+ 3N2 2 c2/16n1
- N3 (l-2c)/8n~
-
-
- (3-8c+6c2 )/8n.
+ Qa [ (1-c)2
+
e,
(11-4c+4c2)/64n~
+ N2C( l-c)3 /4n~
(2-9c2+12c3-4c4)/16n~]/N
(6.7.4)
for half-sib family sampling; and
d2 •
1/2n.+l/4n~
S2 • lin.
+l/n~
+
O:l(l-c)2(1-1/2n.+4n~)/N,
+
0:1
(6.7.5)
(1-c)2/N
for the case of no family structure (putting ni=l in either (6.7.4) or
(6.7.5)).
To obtain these approximations, the denominator was expressed
as a power series in liN using a Taylor's series expansion.
The
difference between (6.7.2) and (6.7.5) is due to the case of no family
structure assuming distinct parents.
Comparison of Calculation Methods.
calculating d2 and S2 -
We now have three methods of .
(1) Iterate equation (6.7.1) until an
equilibrium value of y is obtained, and use this with the appropriate
formulas from section 6.6 to give the exact values. (2)
Use
~1
(to a
chosen order in liN) as an approximation to the equilibrium value of y,
along with the formulas of section 6.6.
(3) Use formulas (6.7.2)
through (6.7.5) (to a chosen order in l/k).
This is the same as (2)
except that a Taylor series expansion has been used to approximate the
~
137
denominator, and only terms up to O(l/N) are included .
.To compare these three methods they were used with each of the
following parameter combinations:
N = 100, 1000, 10000
n. = 10, 100, 1000
k = 1, 5, 10, 20, 100, 1000, but with k<n.
c = 0.5, 0.25, 0.05
(corresponding to A = 0, .5, .9)
Even or uneven family distribution, where "even" refers to ni = n./k for
each i, and "uneven" refers to nl = (k+l)/2k and ni = n./2k,
i~l.
Method (1) was used by iterating (6.7.1) for both 2N and 5N generations.
Method (2) was used with
~l
to order liN.
Method (3) was used with two different orders of approximation - first
to O(l/k), then to 0(1/k2
).
Some of the results are presented in table 6.1 (for full-sib family
sampling), and in table 6.2 (for half-sib family sampling).
The results from method (1) were the same after iterating for 2N
generations and for 5N generations in all cases (to 5 decimal places).
This means that the equilibrium values have effectively been reached
after 2N generations, and we can use these as the eX,act
value~.
Method
(2) provides results which differed from the exact results by at most
3.2%.
The worst case was when N=100, n.=k=1000, and c=0.05.
If N}1000
and/or c)0.25 the method (2) results differ from the exact results by
less than 1%.
Method (3) fails for k=l family, with errors often greater than
100%.
This is not surprising as for k=l the method has an error of
0(1).
In what follows, only the cases where k)5 families are
138
Table 6.1. Gene frequency correlations for N=lOOO, n.=lOO and for
full-sib family sampling using different calculation methods.
d2
S2
method
c= 0.50
c= 0.05
c= 0.50
c= 0.05
1
1
0.11637
0.29708
0.01010
0.41387
1
3 to O(l/k)
0.12879
0.23201
0.25754
0.36146
1
3 to o(l/k2 )
0.13007
0.28317
0.26259
0.46582
10
1
0.01646
0.02997
0.03309
0.04712
10
3 to O(l/k)
0.01633
0.02943
0.03258
0.04576
10
3 to O(1/k2 )
0.01647
0.02998
0.03313
0.04713
10
1
0.03536
0.07676
0.04300
0.09198
10
3 to O(l/k)
0.04444
0.08007
0.08882
0.12468
10
3 to O(1/k2
0.03713
0.07784
0.05955
0.10547
k
Even
Uneven
)
100
1
0.02724
0.05958
0.03037
0.06702
100
3 to O(l/k)
0.03601
0.06488
0.07195
0.10101
100
3 to O(1/k2 )
0.02853
0.06056
0.04203
0.07730
No family structure
100
1
0.00511
0.00918
0.01018
0.01427
100
3 to O(l/k)
0.00508
0.00917
0.01008
0.01419
100
3 to O(1/k2
0.00511
0.00919
0.01018
0.01429
1
0.00536
0.00962
0.01043
0.01472
3 to O(l/k)
0.00533
0.00962
0.01000
0.01005
3 to O(1/k2 )
0.00536
0.00964
0.01010
0.01015
)
Classical
e-
139
Table 6.2.
Gene frequency correlations for N=1000, n.=100 and for
" half-sib family sampling using different calculation methods.
d2
&2
method
c= 0.50
c= 0.05
c= 0.50
c= 0.05
1
1
0.02597
0.09504
0.01014
0.10318
1
3 to O(l/k)
0.06694
0.12059
0.07192
0.12556
1
3 to o(l/k2 )
0.03696
0.10321
0.04264
0.12160
1
0.01053
0.01921
0.01564
0.02446
10
3 to O(l/k)
0.01071
0.01930
0.01570
0.02431
10
3 "to O( 1/k2 )
0.01053
0.01923
0.01566
0.02448
10
1
0.01732
0.03854
0.01702
0.03982
10
3 to O(l/k)
0.02476
0.04462
0.02976
0.04962
10
3 to O(1/k2
0.01802
0.03917
0.01942
0.04188
k
Even
.10
Uneven
)
100
1
0.01445
0.03176
0.01457
0.03283
100
3 to O(l/k)
0.02055
0.03702
0.02554
0.04203
100
3 to O(1/k2 )
0.01491
0.03219
0.01623
0.03433
,
."
No family structure
100
1
0.00511
0.00918
0.01018
0.01427
100
3 to O(l/k)
0.00508
0.00917
0.01008
0.01419
100
3 to O(1/k2 )
0.00511
0.00919
0.01018
0.01429
1
0.00536
0.00962
0.01043
0.01472
3 to O(l/k)
0.00533
0.00962
0.01000
0.01005
3 to o(l/k2 )
0.00536
0.00964
0.01010
0.01015
Classical
140
considered.
When c=0.5 method (3) works quite well for the even family
o.
distribution cases.
cases.
Errors are less than 1% for calculating d2 in all
To achieve this the 0(1/k2 ) approximation must be used if n. is
small, and for the half-sib family cases if the number of families, k,
is small.
Errors are less than 1% for calculating S2 provided n.)lOO.
To achieve this the 0(1/k2 ) approximation must be used if k is small.
For n.=lO errors are up to 2%.
Method (3) works less well for the uneven family distribution cases
when c=0.5.
Using the 0(1/k2 ) approximation errors are less than 10% in
all cases for calculating d2
,
but only for the case of half-sib families
with the total sample size, n., equal to 10, for calculating S2.
In
some cases the error in calculating S2 exceeds 100%.
The adequacy of the method (3) approximations for smaller values of
c are similar to the case where c=0.5.
e-
Errors are generally less than
10% for d2 and for the even family distribution cases for S2.
are generally less than 1% for both d2 and
S2
for the even family
distribution cases.
To achieve this the 0(1/k2
used if k is small.
An
exception to these
Errors
)
approximation must be
statement~
is that if both N
and c are small (N=lOO and c=0.05) then errors are usually greater than
1% for both d2 and S2.
The approximations of method (3) are found to be quite poor for
uneven family sizes.
This is because these approximations are either
O(l/k) or 0(1/k2 ) if there is even family distribution, but for the
uneven family distribution case there is a family of size close to n./2,
and so the omitted terms are actually 0(1).
~
141
Effect of Family structure on d 2 and S 2.
The values of d2 and S2
are larger for full-sib families than for half-sib families.
The values
are larger if there is family structure than for the case of no family
structure, and increase with increasing family size (decreasing k),
except that the value of S2 decreases at k=l when c=0.50.
The values of
d2 are larger for uneven family distribution than for even family
distribution.
The values of S2 are larger for uneven family
distribution than for even family distribution, except for the cases
n.=lOO or n.=lOOO with k=5 families.
In general the values of d2 and S2 are larger if the individuals in
the sample are more related.
This also holds when comparing the no
family structure case with the classical case where the values are
larger for the latter case.
The individuals for this case are more
related as they result from random mating among the parents and so the
sample includes half-sibs and full-sibs with positive probability,
whereas for the case of no family structure the parents are distinct.
The magnitude of the effect of family structure can be seen by
comparing the d2 and S2 values for two different values of k.
For
example, for N=lOOO and sample size n.=lOO, d2 and S2, increase by a
factor of five to six for full-sibs, and a factor of two to three for
half-sibs, when the number of families, k, changes from 100 to 5.
Similar results hold for other combinations of n. and N.
The values' of d2 and S2 increase with increasing linkage
(decreasing c), and decrease with increasing N.
142
6.8. Estimation of Effective Population Size
Population Size EstiJ11ation Using Gene Frequency Correlations.
Noting that the right hand sides of equations (6.7.2) through (6.7.5)
are all linear in lIN suggests a method for estimating N from data on
linkage disequilibrium.
This involves calculating r 2 or r*2 and
equating this to d2 or S2 respectively, and solving for N (given that
the family structure and linkage are known).
This has been done for the
classical case (assuming no family structure) by Laurie-Ahlberg and Weir
(1979) for laboratory populations, and by Hill (1981) for both a
laboratory population and a natural population.
A problem with this method is that sampling effect (function of the
ni) swamps the effect due to N.
Estimates of N are often negative,
especially for natural populations, which implies that the best estimate
of population size is infinite (Hill, 1981).
This section gives an
example of data from a natural population that has been collected in
such a way that individuals in the sample are more likely to be related
than those in the whole population.
We wish to see what effect such
family structure would have on estimating N.
ftfouse Data.
The data, obtained from H.R. Nash ,(personal
communication), consists of allozyme data from two mouse populations.
One sample (A) is from a rick in Somerset, and the other (B) is from
several small ricks in Scotland.
Only loci for which all individuals in
a sample were scored are used, giving 15 loci for the 100 individuals
scored for sample A, and 16 loci for the 120 individuals scored for
sample B.
The quartiles and mean of the r*2 values calculated from each
pair of loci are given for each data set in table 6.3.
143
Table 6.3. Sample statistics for r*2 values for mouse data.
Sample A
Sample B
Mean
0.0144
0.0287
1st quartile
0.0016
0.0017
2nd quartile
0.0066
0.0079
3rd quartile
0.0192
0.0332
Statistic
Estimation of No
Each of these r*2 values is used to get an
estimate of N using the method described above, and assuming several
types of family structure in the data.
be unlinked (c=0.5).
All pairs of loci are assumed to
The medians and means of the resulting estimates
of N are given in table 6.4.
The medians tend to give more consistent
results than the means, and give negative estimates of N (implying an
infinite population size).
The fact that some means change dramatically between first and
second order approximations illustrates how small changes in the
parameters or the value of r*2 can give large changes in the estimate of
N.
Let 80 2 be the value of 82 when N is infinite.
For example, from
(6.7.3),
80 2
~
N2(3-4c+4c2)/8n~
+(5+4c-4c2 )/8n.
-N3/2n~
+N22(5-4c+4c2 )/8n1
+N2(1+4c-4c2)/4n~ +(5-4c+4c2)/8n~
for full-sib family sampling.
true when r*2 is close to 80 2 .
Then the above statement is especially
Slightly larger values of r*2 give large
estimates of N and slightly smaller values of r*2 give large negative
estimates of N.
This shows the problems with using this method to
144
Table 6.4.
e,
Estimates of N for the mouse data
'
",
families
data
median
number
size
A
10
10
f
-2.7
-2.7
-6.3
130.7
10
10
h
-5.5
-5.3
-0.5
68.1
20
5
f
-4.6
-4.5
-11.0
-3.3
20
5
h
-6.9
-6.7
-1.9
0.2
50
2
f
-7.0
-6.9
-1.9
-1.2
50
2
h
-8.2
-8.0
-3.0
1.5
100
1
nf
-8.7
-8.7
-18.2
-U.8
c
-34.9
-34.7
-73.0
-47.2
O(l/k)
O( 1/k2
)
O(l/k)
o(l/k2 )
set
B
type
mean
10
12
f
-2.7
-2.7
18.2
-2.3
10
12
h
-5.8
-5.7
-6.6
-2.6
20
6
f
-4.6
-4.5
-2.5
3.0
20
6
h
-7.5
-7.4
1.0
-5.9
60
2
f
-8.0
-7.8
,
1.3
.- 4.0
60
2
h
-9.3
-9.3
-18.5
-9.6
120
1
nf
-10.0
-9.9
-12.1
-9.4
c
-40.2
-39.5
-48.6
-37.8
Key: family type: f
= full-sib,
c
= classical
h
= half-sib,
nf
= no
family structure,
e-
145
estimate N.
Although the estimates resulting from a small number of families
tend to give smaller negative estimates of N, this is not an indication
that using the family structure formulas gives better results.
This is
because small negative estimates result from values of r*2 much smaller
than 80 2 , whereas large negative estimates result from values of r*2
only slightly smaller than 80 2 .
This is consistent with the results of
the previous section where it was found that relatedness of individuals
in a sample would lead to higher values of 82 •
Use of ResamplingMethodS.
As noted at the beginning of section
6.7, d2 and 82 are not the expectations of r 2 and r*2 but are
approximations, being the ratio of expectations.
This is a case where
the resampling estimates reduce the bias (see example 1.2.3).
It is
possible that such estimates may improve the estimation of N.
The
medians (over pairs of loci) of the jackknife and bootstrap estimates of
r*2 for the mouse data are given in table 6.5.
For data set A, only 14
loci were used as one of the loci gave many jackknife and bootstrap
samples which were monomorphic.
Unfortunately the resampling methods
indicate that r*2 is biased downward for 82 , and so we can expect no
improvement in the estimation of N.
Table 6.5. Resampling estimate medians for r*2 values for mouse data.
Sample A
Sample B
Original estimate
0.0077
0.0079
Jackknife
0.0007
0.0011
Bootstrap
0.0016
0.0001
Method
146
Another use of the bootstrap is in obtaining confidence intervals.
~
,
If we replace the negative estimates of N by an estimate equal to
positive infinity, we can construct a one-sided upper confidence
interval for N.
This was done with 200 bootstrap replications using the
bias-corrected percentile method, since both resampling methods indicate
a considerable bias in r*2.
The results are presented in table 6.6, for
several types of family sampling, with the lower end points actually
being the medians (over pairs of loci) of the lower end points of the
In each case the O(l/k2 ) approximations were
95% confidence intervals.
used.
The intervals tend to confirm that it is difficult to get good
estimates of N using this method.
Table 6.6. Bootstrap confidence intervals for N for mouse data.
data
families
half-sibs
number
A
10
10
(co,co)
(9.0,ao)
20
5
(17.4,ao)
(6.7,ao)
50
2
(6.7,ao)
(5.8,ao)
100
1
(5.6,ao)
10
12
(6.0,ao)
(2.6,ao)
20
6
(3.2,ao)
(2.4,ao)
60
2
(2.4,ao)
(2.3,ao)
120
1
(2.3,ao)
B
size
full-sibs
set
,
classical
(22.4,ao)
(9.l,ao)
6.9 The Effect of Family Structure: Conclusion
The amount of family structure (for example full-sibs or half-sibs)
~
147
in a samp.le can have a large effect on any statistics that are
calculated from that sample.
The interpretation of such statistics
should be based on their moments (in particular the mean and variance)
which are calculated by taking the family structure into account.
This chapter has investigated the effect of family structure on
several statistics:
(1) p, the sample gene frequency
(2) D, the sample Hardy-Weinberg disequilibrium
-- (4) Q=PA(l-pA)PB(l-PB),
(3) DAB and AAB, two estimates of linkage disequilibrium
-
and R =
- - - -
--
[PA(1-2pA)+~][PB(1-2pB)+P:]
(5) r2 and r*2, two estimates of gene frequency correlation.
The effect of increasing the amount of family structure in the data
was as follows:
(1) Var(p) increases
(2) Both the variance and the bias in D increase
-
(3) The variance of DAB and AAB may increase or decrease
-
-
(4) E(Q) and E(R) decrease
(5) E(r2) and E(r*2) increase.
In some cases the family structure can have a dramatic effect on the
magnitude of these moments.
For example if we estimate the
Hardy-Weinberg disequilibrium from a sample of size n.=IOOO consisting
of 10 families, and let 8=0, then the variance is 0.0000450 if the
families are half-sibs, and 0.0008604 if the families are full-sibs.
This is close to a 20 fold increase.
In general, suppose we have the sample mean of a quantity,
which can be expressed as
Yij,
148
Yij =
where
~i
~
+
+ eij,
~i
is the effect of the i-th family (i=l, ... ,k), and eij is the
effect of the j-th individual (j=l, ... ,ni) within the i-th family, with
E(~i)
= E(eij) = 0
Var(Yij) =
~a2
+
COV(Yij,Yijl)
~2
=~a2.
Then the variance of the sample mean, Y.. , will increase with
increasing family structure.
We can see this in two ways.
The first is
if we fix n., the total number of individuals in the sample, and
consider varying k, the number of families (small k referring to more
family structure than large k).
Since the variance is
Var(Y.• ) =
~a2/k
+
~2/n.
we can see that this quantity will increase with decreasing k.
Suppose that we now fix both k and n., but let the relationship of
the family members vary, using subscripts f and h to denote that the
variance ccmponents refer to full-sib and half-sib families
respectively.
Since Var(Yij) remains fixed (does not depend on any
other individual and so does not
d~pend
on the relationship) we have
We would expect the covariance for full-sibs to be larger than that for
half-sibs, i. e.
We now find
Varf(Y.. ) =
{~af2
+
~f2
+
(n./k-I)~af2}/n.
>
{~ah2
+
~h2
+
(n./k-I)~ah2}/n.
= Varh (y.. )
•
149
Again we find that increasing the family structure (in this case
increasing the relationship within families) increases the variance.
We see then that one must take the family structure in the data
into account when analyzing genetic data.
If the actual family
structure of the data is not known, but is of importance, the data
should be analyzed several times, using a likely range of types of
family structure.
Any conclusions should be made only if they are
consistent with each of the analyses.
150
7.
REFERENCES
Arvesen, J.N. (1969) Jackknifing U-statistics. Ann. flfath. Statist. 40,
2076-2100.
Arvesen, J.N. and T.H. Schmitz (1970) Robust procedures for variance
component problems using the jackknife. Biometrics 26, 677-686.
Ayres, F.A. and S.J. Arnold (1983) Behavioural variation in natural
populations. IV. Mendelian models and heritability of a feeding
response in the garter snake, ThOJlllJophis Elegans. Heredity 51,
405-413.
Babu, G.J. and K. Singh (1983) Inference on means using the bootstrap.
Ann. Statist. 11, 999-1003.
Bailey, N. T.J. (1961) The flfathematical Theory of Genetic Linkage. Oxford
Univ. Press, Oxford.
Beran, R. (1982) Estimated sampling distributions: the bootstrap and
competitors. Ann. Statist. 10, 212-225.
Bickel, P.J. and D.A. Freedman (1981) Some asymptotic theory for the
bootstrap. Ann. Statist. 9,1196-1217.
Bickel, P.J. and D.A. Freedman (1984) Asymptotic Normality and the
Bootstrap in Stratified Sampling. Ann. Statist. 12, 470-482.
Buckland, S.T. (1984) Monte Carlo Confidence Intervals. Biametrics 40,
811-817.
Bulmer, M.G. (1980) The flfathematical Theory of Quantitative Genetics.
Clarendon Press, Oxford.
Clark, A.G. and J. Bundgaard (1984) Selection components in background
replacement lines of drosophila. Genetics 108, 181-200
,
Cockerham, C.C. (1971) Higher order probability functions of identity of
alleles by descent. Genetics 69,235-246.
Cockerham, C.C. and B.S. Weir (1977) Digenic descent measures for finite
populations. Genetical Research 30, 121-147.
Crozier, R.H., P. Pamilo and Y.C. Crozier (1984) Relatedness and
microgeographic genetic variation in ~ytidoponera mayri, an
Australian arid-zone ant. Behav. Ecol. Sociobiol. 15, 143-150.
Efron, B. (1979) Bootstrap methods: Another look at the jackknife. Ann.
Statist. 7, 1-26
Efron, B. (1981a) Nonparametric standard errors and confidence intervals
(with discussion). Canadian Jour. of Stat~stics 9, 139-172.
4It
151
Efron, B. (198lb) Nonparametric estimates of standard error: The
jackknife, the bootstrap and other methods. Biometrika 68, 589-599.
Efron, B. (1982) The Jackknife, the Bootstrap, and Other Resampling
Plans, CBMS-NSF Regional Conference Series in Applied Mathematics,
Monongraph 38, Philadelphia: SIAM.
Efron, B. (1983) Estimating the error rate of a prediction rule:
Improvement on cross-validation. Jour. Amer. Statist. Assoc. 78,
316-331.
Efron, B. (1985) Bootstrap confidence intervals for a class of
parametric problems. Biometrika 72, 45-58.
Efron, B. and G. Gong (1983) A leisurely look at the bootstrap,
jackknife, and cross-validation. American Statistician 37, 36-48.
Efron, B. and C. Stein (1981) The jackknife estimate of variance. Ann.
Statist. 9, 1187-95.
Felsenstein, J. (1985) Confidence limits on phylogenies: An approach
using the bootstrap. Evolution 39, 783-791
Freedman, D. A. (1981) Bootstapping regression models. Ann. Stat. 9,
1218-1228.
Freedman, D.A. and S.C. Peters (1984) Bootstrapping a regression
equation: some empirical results. Jour. Amer. Statist. Assoc. 79,
97-106
Funkhouser, E.M. and M. Grossman (1982) Bias in estimation of the
sampling variance of an estimate of heritability. Jour. of Heredity
73, 139-141.
Gray S.L. and W.R. Schucany (1972) The Generalized Jackknife Statistic.
New York: Marcel Dekker.
Hartigan, J.A. (1969) Using subsample values as typical values. Jour.
Amer. Statist. Assoc. 64, 1303-1317.
Hill, W.G. (1981) Estimation of effective population size from data on
linkage disequilibrium. Genetical Research 38, 209-216.
Hinkley, D.V. (1977) Jackknifing in unbalanced situations. Technometrics
19, 285-292.
Hinkley, D.V. (1978) Improving the jackknife with special reference to
correlation estimation. Biometrika 65, 13-22.
Hocking, R.R. (1973) A discussion of the two-way mixed model.
Statistician 27, 148-152.
American
152
Jaeckel, L. (1972) The infinitesimal jackknife. Memorandum MM
72-1215-11, Bell Laboratories, Murray Hill, NJ.
•
Jones, H.L. (1974) Jackknife estimation of functions of stratum means.
Biometrika 61, 343-348.
Kish, L. and M.R. Frankel (1974) Inference from Complex Samples (with
discussion). J. Roy. Statist. Soc. ser. B. 36, 1-37.
Laurie-Ahlberg, C.C. and B.S. Weir (1979) Allozymic variation and
linkage disequilibrium in some laboratory populations of Drosophila
ft./elanogaster. Genetics 92, 1295-1314.
Lee, A.J. (1985) On estimating the variance of aU-statistic. Comm.
Statist.- Theor. ft./eth. 14, 289-301.
Lemeshow, S. and R. Epp (1977) Properties of the balanced half-sample
and jackknife variance estimation techniques in the linear case.
Commun. Statist.-Theor. ft./eth. 6, 1259-1274.
Lenth, R.V. (1983) Some properties of U-statistics. Amer. Statist. 37,
311-313.
Miller, R.G. (1964) A trustworthy jackknife. Ann. ft./ath. Statist. 35,
1594-1605.
~
~
Miller, R.G. (1974) The jackknife - a review. Biometrika 61, 1-15.
Nei, M. and A.K. Roychoudhury (1974) Sampling variances of
heterozygosity and genetic distance. Genetics 76, 379-90.
Pamilo, P. (1984) Genotypic correlation and regression in social groups:
multiple alleles, multiple loci and subdivided populations.
Genetics 107, 307-320
Parr, W.C. (1983) A note on the Jackknife, Bootstrap and Delta Method
estimators of bias and variance. Biometrika 70,' 719-722 ..
Parr, W.C. and W.R. Schucany (1980) The jackknife: a bibliography. Int.
Statist. Rev. 48, 73-78.
Parr, W.C., and H.D. Tolley (1982) Jackknifing in categorical data
analysis. Ann. Statist. 24, 67-79.
Quenouille, M. (1949) Approximate tests of correlation in time series.
J. Roy. Statist. Soc. ser. B. 11, 68-84.
Quenouille, M. (1956) Notes on bias in estimation. Biometrika 43, 253260.
Reynolds, J., B.S. Weir and C. Clark Cockerham (1983) Estimation of the
•
.
153
coancestry coefficient: basis for a short term genetic distance.
Genetics 105, 767-779.
Robertson, A. and W.G. Hill (1984) Deviations from Hardy-Weinberg
proportions: Sampling variances and use in estimation of inbreeding
coefficients. Genetics 107, 703-718.
Schenker, N. (1985) Qualms about Bootstrap Confidence Intervals. Jour.
Am. Statist. Assoc. 80, 360-361.
Searle, S.R. (1971) Linear Models. Wiley, New York
Serfling, R.J. (1980) Approximation TheoreJDs of MatheJDatical Statistics.
Wiley, New York.
Silverman, B.W. (1983) Some properties of a test for multimodality based
on kernel density estimates. in Probability, Statistics and
Analysis (ed. J.F.C.Kingman and G.E.H.Reuter) Cambridge University
Press, 248-259.
Singh, K. (1981) On the asymptotic accuracy of Efron's Bootstrap. Ann.
Statist. 9, 1187-1195.
Swofford, D.L. and R.B. Selander (1981) BIOSYS-l: A computer program
for the analysis of allelic variation in genetics. Users Manual,
Department of Genetics and Development, University of Illinois,
Urbana.
Toelle, V.D. and O.W. Robison (1985) Estimates of genetic correlations
between testicular measurements and female reproductive traits in
cattle. Jour. Animal Science 60, 89-100.
Tukey, J. (1958) Bias and confidence in not quite large samples
(abstract), Ann. Math. Statist. 29, 614.
Weir, B.S. and C.C. Cockerham (1969) Group inbreeding with two linked
loci. Genetics 63, 711-742.
Weir, B.S. and C.C. Cockerham (1984) Estimating F-statistics for the
analysis of population structure. Evolution 38, 1358-1370.
Weir, B.S. and W.G. Hill (1980) Effect of mating structure on variation
in linkage disequilibrium. Genetics 95, 477-488.

Download Report

Dodds, K.GResampling Methods in Genetics and the Effect of Family Structure in Genetic Data"

Paperzz.com

Your Paperzz