Estimation of Relatedness by DNA Fingerprinting 1

Estimation of Relatedness by DNA Fingerprinting 1
Michael Lynch
Department
of Ecology, Ethology and Evolution, University of Illinois
The recent discovery of hypervariable VNTR (variable number of tandem repeat)
loci has led to much excitement among population biologists regarding the feasibility
of deriving individual estimates of relatedness in field populations by DNA fingerprinting. It is shown that unbiased estimates of relatedness cannot be obtained at
the individual level without knowledge of the allelic distributions in both the individuals of interest and the base population unless the proportion of shared marker
alleles between unrelated individuals is essentially zero. Since the latter is usually
on the order of 0.1-0.5 and since there are enormous practical difficulties in obtaining
the former, only an approximate estimator for the relatedness can be given. The
bias of this estimator is individual specific and inversely related to the number of
marker loci and frequencies of marker alleles. Substantial sampling variance in
estimates of relatedness arises from variation in identity by descent within and
between loci and, with finite numbers of alleles, from variation in identity in state
between genes that are not identical by descent. In the extreme case of 25 assayed
loci, each with an effectively infinite number of alleles, the standard error of a
relatedness estimate is no less than 1496, 20%, 3596, and 53% of the expectation for
full sibs and second-, third-, and fourth-order relationships, respectively. Attempts
to ascertain relatedness by means of DNA fingerprinting should proceed with
caution.
Introduction
Tests of a number of ideas in evolutionary biology, particularly in the areas of
social organization and kin selection, require that the genetic relationships of wild
individuals be resolved accurately (Hamilton 1964; Grafen 1985 ) . It is also useful to
know the relatedness of potential mates in controlled breeding programs, for the advancement of economically important characters or for the preservation of endangered
species, in order to avoid the deleterious consequences of inbreeding. Even among
pair-bonding species, behavioral observations cannot be trusted as definitive indicators
of blood relationships because of the possibility of extrapair copulations; in species
lacking pair-bonds, this problem is exacerbated. It is necessary to rely on genetic
markers. In the past, almost all attempts to ascertain relationships have utilized polymorphic enzyme loci (Metcalf and Whitt 1977; McCracken and Bradbury 198 1; Pamilo
and Crozier 1982; Crozier et al. 1984; Pamilo 1984; Wilkinson and McCracken 1986).
Since the analysis of each such locus requires a separate biochemical protocol and
since most loci are monomorphic (hence providing no information), this can be a
rather time-consuming procedure. Moreover, because individual relatedness estimates
(including paternity analyses) based on isozyme surveys are statistically unreliable
1. Key words: none.
Address for correspondence and reprints: Michael Lynch, Department of Ecology, Ethology, and Evolution, University of Illinois, Shelford Vivarium, 606 East Healey Street, Champaign, Illinois 6 1820.
Mol. Biol. Evol. 5(5):584-599. 1988.
0 1988 by The University of Chicago. All rights reserved.
0737-4038/88/0505-00 10$02.00
584
Estimation of Relatedness by DNA Fingerprinting
585
(Chakraborty et al. 1988), field studies have not advanced beyond the estimation of
average relatedness among group members.
There is now much hope that DNA fingerprinting techniques (Gill et al. 1985;
Jeffreys et al. 1985a, 1985b, 1987; Burke and Bruford 1987; Jeffreys and Morton 1987;
Nakamura et al. 1987; Wetton et al. 1987; Vassart et al. 1987) may revolutionize the
study of social interactions by allowing investigators to estimate relatedness with an
unprecedented level of accuracy. DNA fingerprinting has two big advantages over
isozyme analysis. First, the markers for many loci are simultaneously visible on a
single gel. Second, the average number of alleles per locus is very much greater than
in the case of enzymatic loci. Thus, the yield of information per unit effort is expected
to be quite high.
DNA fingerprinting is now being advocated as a powerful method for paternity
exclusion and forensic analysis in humans. The technique can be applied to other
mammals, birds, and, very likely, to most other eukaryotes (Vassart et al. 1987).
Thus, many laboratories are now gearing up for the application of DNA fingerprinting
to wild populations and to captive populations of endangered species. The general
attitude seems to be that the success of DNA fingerprinting in the analysis of parentoffspring relationships will extend to more distant relationships. Because DNA technology is still quite expensive and because the fingerprinting technique has some technical and statistical problems, it seems useful at this point to outline the limitations
of the technique.
Background
DNA fingerprinting relies on the existence of families of minisatellites that are
dispersed throughout the genome in a large number of hypervariable loci. Individual
minisatellites consist of tandem arrays of short ( lo-60-bp) repeat units. While DNA
sequence variation may exist among repeat units, the most important variation from
the standpoint of the fingerprinting technique concerns the variation in the number
of tandem repeats per array. Substantial amounts of such allelic variation appear to
be generated by high rates of unequal crossing-over, possibly due, in part, to the existence of core sequences within each repeat unit that act as recombination hot spots
(Jeffreys et al. 1985a).
Standard DNA techniques can be used to discriminate the members of a minisatellite family on the basis of their size. After a genomic digest with a restriction
endonuclease that does not cleave the repeat unit, the fragments are separated by size
by agarose gel electrophoresis, transferred to an appropriate membrane via a Southern
blot, and hybridized to a radio-labeled probe for the repeat unit. The length of each
fragment is a function of the number of tandem repeats contained within it. Depending
on the stringency of the hybridization conditions, many loci are probed simultaneously
under this procedure, and a dozen or more bands typically appear in each lane of the
gel (fig. 1). The array of bands for an individual is its DNA fingerprint. In humans,
the length variation at each VNTR (variable-number-of-tandem-repeat
) locus is so
great that the fingerprints of virtually all individuals (other than monozygotic twins)
are unique (Jeffreys et al. 1985b).
A number of problems arise in interpreting the genetic basis of an individual’s
DNA fingerprint. First, it is generally unknown which markers belong to which loci.
Allelism can be excluded whenever two unique markers of a single parent are found
to be inherited by the same offspring, but large breeding designs would have to be
586
Lynch
B
\_I....
\_j.........
a
. . . . . .
-b
:--,
/_~W...
-
~_...........
. . ..[_~
h- \-j....
i-1
f
.C..
..
g
..I..........
. . . . . . . . . . . . . . .
i
j-
k
-
I
B....
f_;.
.
i-1....
-.
. . . . . . . . . .. . ..
-...*
-..............
I..
.
i-i....
..I.
-
.
I..
.
.
.
.
P
.
f-j..
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
q
.
t--s
u--v
-
f-j....
I_\....
....-
...............
r
w
x
f-b...
-
m
n
. 0
-
.
. . . . . . . . . .. .
.
. . . . . . . . . . . . . .
Y
Z
FIG. 1.-DNA fingerprints for two individuals, A and B. There are 26 distinct bands, a-z. Shared bands
are denoted with dashed lines.
implemented to identify alleles with certainty. Since dozens of alleles may exist at
some loci, we can regard such work as unfeasible for most studies. This raises a second
problem. If markers cannot be associated with specific loci, then it cannot be ascertained
whether individuals are homozygous or heterozygous. Consequently, if an appreciable
probability of homozygosity exists, then the only feasible measure of similarity between
individuals, i.e., the fraction of shared marker bands, will not necessarily be equivalent
to the fraction of shared genes.
If the DNA fingerprinting technique is to be of any use in estimating relatedness,
the fraction of shared marker bands must be a reliable surrogate. In figure 1, A exhibits
60% of the markers of B, while B exhibits 66.7% of the markers of A. Thus, as a first
approximation, one might be inclined to suggest that A and B are first-degree relatives.
However, if there is a significant chance that unrelated individuals share the same
markers, as will always be the case unless the allelic variation is enormous, the pro-
Estimation of Relatedness by DNA Fingerprinting
587
portional similarity of two DNA fingerprints will exceed the relatedness. The bias is
directly related to the frequencies of markers in the population, but, as noted above,
these are generally unknown.
The following theory considers the connection between DNA fingerprint similarity
and relatedness. The expected similarity between various kinds of relatives will be
evaluated first, and an estimator for relatedness will be given. Then, the sampling
variance of similarity resulting from variation in identity by descent and from background variation will be discussed. The formulas presented are meant primarily to
serve as a heuristic guide to evaluating the practical utility of DNA fingerprinting for
ascertaining relationships.
Unfortunately, several technical limitations of the fingerprinting technique preclude the development and application of a rigorous statistical model. First, when
allelic diversity is high and a large fraction of the lane of a gel is occupied by radiolabeled markers, as is normally the case with DNA fingerprints, there is a very real
possibility of corn&ration of unrelated fragments. Comigration will inflate the variance
of similarity, but the practical difficulties noted above usually will prevent a quantitative
evaluation of the problem. Second, substantial numbers of markers with very low
molecular weights are usually run off the end of the gel, leaving one with an incomplete
description of the minisatellite family. If the variation among missing fragments differs
from that among observed fragments, as appears to be the case (Jeffreys et al. 1985a),
the connection between similarity and relatedness will be further obscured. Third,
some marker loci may be linked, in which case individual markers should not be
treated as independent estimators of relatedness. However, the ascertainment of linkage
relationships is clearly beyond the scope of most of the field studies to which the
fingerprinting technique is likely to be applied.
Returning to figure 1, we see that individual A has two fewer bands than does B.
We now see that there are at least three potential explanations for this: (1) A might
be homozygous for two more loci than B; ( 2) two pairs of unrelated markers in A
may have comigrated; or (3) two of A’s markers may have migrated off the gel. In
the following analyses, it is assumed that comigration, loss of markers, and linkage
are unimportant, so that variation in marker number per individual can only be a
result of variation in homozygosity. It is also assumed that the marker alleles are
neutral and unlinked with any selected loci and that the probability of mutation is
negligible. The resultant conclusions regarding the limits of the power of DNA fingerprinting for ascertaining relatedness are therefore conservative.
Expected Proportion of Shared Markers
Many different technical definitions of relatedness exist (Michod and Hamilton
1980; Grafen 1985). Here we will define the relatedness of individual B to individual A,
rB,?,, as the expected fraction of B’s autosomal genes that are identical by descent ( IBD)
with those in A (Crozier 1970). Identity by descent refers to situations in which two
genes are direct copies of an ancestral gene. In contrast, identity in state (IIS) implies
nothing about coancestry. Genes that are IIS need not be IBD, but genes that are IBD
must also be IIS. In the special case of no inbreeding, which will be our primary focus,
rBA is equivalent to all other definitions of relatedness, including Wright’s ( 1922)
correlation coefficient of relationship and Hamilton’s ( 197 1) regression coefficient of
relatedness.
Focusing on a single locus, table 1 illustrates the ways in which IBD can arise
between the genes in A and B. Associated with each of the nine configurations is a
588
Lynch
probability of occurrence (Jacquard’s [ 1974, pp. 10% 1071 condensed coefficients of
identity). The coefficients 4 1, 42, and C$3 are probabilities of situations in which
neither A nor B is inbred, $4 and C#J
5 of situations in which only A is inbred, @6and
4, of situations in which only B is inbred, and C#J
8 and & of situations in which both
A and B are inbred. For any pair of individuals, the nine coefficients, which sum to
one, are defined by the type of relationship. For example, if A and B are parent and
offspring and if neither is inbred, all coefficients are equal to zero, except that 42 = 1;
an offspring always inherits one, and only one, gene from its parent at each locus. In
terms of the coefficients,
rBA=+3+67+69+[(42+&)/2].
(1)
Ideally, estimates of relatedness based on DNA fingerprints should converge asymptotically on rBAas more markers are compared.
The operational measure of similarity between two DNA fingerprints is the proportion of shared bands. If we let kB be the number of bands exhibited by B, then the
similarity of B to A is
(2)
where Pmj = 1 if A shares the jth marker of B and Pmj = 0 otherwise. The subscript
m simply denotes the (generally unknown) marker locus. To evaluate the connection
between the statistic SBAand the parameter r&+, the behavior of the marker-specific
similarities needs to be defined. We start with E( Pmj), the conditional probability that
A will have the band given that B has it. This probability depends on the distribution
of identity by descent between the two individuals. If we let 8imj be the probability
that A exhibits the jth band of B conditional on identity relationship i, then
(3)
Expressions for three of the 8imj values will be derived to illustrate the basic
concepts. Under identity relationship 1, A and B are noninbred and not IBD. Thus,
the probability that A exhibits a particular band of B is simply the probability that A
is a homozygote or heterozygote for the allele: 8 lmj = PGj + 2pmj( 1 - pmj) = pmj( 2
- Pmj), where Pmj is the population frequency of genes at locus m that are of type j.
Under identity relationship 2, A and B have a single gene IBD and neither is inbred.
We know that individual B has at least one allele that yields the observed band. The
probability that the second allele in B is the same as the first (and that B is therefore
homozygous) is pmj, in which case A must exhibit the marker. The probability that B
is a heterozygote is (1 - pmj), in which case there is a 0.5 probability that the marker
gene in B is also present in A. If it is the alternative allele in B that is transmitted to
A, there is still a pmj probability that A’s other gene at the locus is identical in state
with the marker. TO sum UP, fl2mj = pmj + (1 - pmj)(O.5 + 0.5pmj) = pmj + O-5(1
- p$). Finally, for identity relationship 3, both alleles exhibit identity by descent
between individuals, SO 03mj = 1. These and the remaining coefficients are summarized
in table 1.
Estimation of Relatedness by DNA Fingerprinting
Table 1
The Nine Possible Identity-by-Descent
589
Relationships
Identity-by-Descent Relationship
Probability
Probability that A has Marker Allele
L: :
aI
b,
al
a2 al
II or II
a2
b2
bl
b2
a2
aI
or
al
a2
aI
a2
\
or //
b, b2 b, b2
e2mj=Pmj
b,
al = a2
b,
bz
al = a2
O-5(1 -
Pijl
a2
II or %
b2 b, b
II
+
&j=
1
@4m,=P*j
al = a2
II 4
or WI
b, b2 bl b2
%l,=O.5(1 +
Pm,)
;: =;
aI
II \
b,
al
b,
al
=
=
=
=
a2
al
or
b2
a2
b2
a2
a2
//II
07mj=
1
b, = b2
98m,=Pm,
II % II
&j=l
b, = b2
NOTE.-The relationships are between two alleles at the mth locus in individuals A (a, and az) and B (br and b& the
associated probabilities, c#J,,depend on the nature of the relationship. 0 ,,,,,is the probability that A shares a particular band
(j, with population allelic frequency pW) with B, given the identity relationship i at the locus. The presence of an equals
sign (=, in any direction) implies identity by descent.
The expected proportion of bands shared between unrelated individuals is obtained by weighting the conditional 8 lmj values by their expected frequencies Pmj and
summing over loci and alleles:
e1=
5 5 PLj(2 - Pmj)lL
9
(4)
m=lj=l
where L is the number of marker loci and n, is the total number of alleles at the mth
locus. When all alleles have equal frequencies, nm = 1/p, and equation (4) reduces
to 8, = p( 2 - p). Clearly, 8r approaches zero as the number of alleles per locus
increases. However, substantial similarity between nonrelatives is still expected even
at fairly low allelic frequencies. With p = 0.01, 0.1, and 0.2, for example, O1 = 0.02,
0.19, and 0.36. Diallelic loci with p = 0.5 yield 8, = 0.75. Although the allelic frequency
distributions for DNA fingerprinting loci are not actually uniform, equation (4) shows
that the expected similarity, per locus, of unrelated individuals can be no less than
q&( 2 - qm), where qm is the frequency of the most abundant allele. Consequently,
even if most alleles are very rare, a few common ones will greatly magnifi the similarity
of nonrelatives.
We are now in a position to examine the relationship between expected similarity
590
Lynch
and relatedness. The following illustrations focus on the special case of noninbred
individuals, but all of the tools necessary for more complex analyses (including different
definitions of relatedness) are in table 1. For our special case, the relatedness of B to
A is
rBA =
0.562 + $3 .
(5)
From equation (3), the expected similarity, conditional on B having band m j, is
(6)
Noting that 1 =~#~,+$~+&,wecanalsowritethisas
Averaging over all bands exhibited by B, we find that the expected similarity between
A and B conditional on B’s set of bands is
E(SBA)
= rBA +
(1 - rBA)eIB,
(8)
elmj/kB
(9)
where
8 1B
=
5
j=l
is the expected similarity of B to nonrelatives. Only if all alleles have equal frequencies
is 8 1B exactly equal to 8, , the average similarity between all pairs of nonrelatives in
the population.
Equation ( 8) states that the expected similarity is equal to the sum of (a) the
probability that genes in B are IBD with those in A and (b) the probability that they
are not IBD but still identical in state. The second term in equation (8) defines the
upward bias that results if one relies on similarity as an estimator of relatedness. Figure
2 illustrates the inflation E( SBA)/rBAfor the special case in which all marker alleles
have equal frequency. This ratio is equal to 1/r BAwhen loci are monomorphic and
declines to one with increqing allelic diversity. Thus, the more distant the relationship,
the more one is likely to overestimate the degree of relatedness when relying on similarity as an estimator-and
with distant relationships, a substantial amount of bias
remains even when the number of alleles is quite high. When the true relatedness is
0.25,O. 125, and 0.0625, fewer than 5, 13, and 29 detectable alleles per locus will cause
the expected similarity to be more than double the relatedness. The implications of
these results are clear. One should avoid using similarity as an estimator of relatedness
unless there is reason to believe that all marker alleles are very rare.
Rearranging equation (8) and substituting observed for expected similarity, we
find an unbiased estimator for the relatedness
SBA
&A
=
-
8 1B
1 - elB
’
Estimation of Relatedness by DNA Fingerprinting
59 1
1
8
6
4
2
0
0
20
60
40
Number
of Marker
80
100
Alleles
FIG. 2.-The ratio of the expected proportion of shared bands E( SB,) to the relatedness ( rgA) for the
special case in which allele frequencies are equal and inbreeding is absent. The dependence of this ratio on
the number of alleles per locus is shown for four levels of reIatedness.
The main problem with this approach is that the parameter eIB will almost never be
known. There are two ways to obtain estimates of 8 1~.One way is to substitute estimates
of the population frequencies for B’s alleles into equation ( 9) and solve for the statistic
&B. Alternatively, one could estimate the mean similarity of B to many independent
relatives of the same type (fixed rBA) and solve
’ &B =
%A
1
_
rBA
(11)
rBA -
Both of these approaches are impractical, if not impossible, for most field studies.
Thus, it is clear that an alternative protocol is required if DNA fingerprinting is to be
of any use in estimating relatedness.
The most easily implemented procedure is to utilize an estimate of the average
similarity between all unrelated individuals, 8 1, in place of 8 1B. It can be shown by
Taylor expansion that
1BA z
SBA l-4,
0,
+
(1 -
sBA)vdh)
(1 - Q3
’
(12)
_- -
-~ __. _- ~___
592
Lynch
where Var( 0,) is an estimate of the variance of similarity among pairs of unrelated
individuals. From large sampling theory, olB and 6, are expected to converge as the
number of assayed markers increases. However, the rate and direction of this convergence will depend on the distribution of marker alleles in the population and in
individual B. Thus, the cost of not knowing 8 lB is that equation (12) gives biased
estimates of relatedness, the magnitude of which is unique and unknown for each
individual. The burden is therefore upon the investigator to demonstrate that equation
( 12) is a reasonable surrogate for the less practical equation ( 10).
In principle, the estimates 8, and Var( 0,) can be obtained by making measures
of SBAfor N independent pairs of the same types of relatives. 6, is then obtained by
use of equation ( 11) with S, the mean similarity between the designated class of relatives,
being substituted for S BA. If we let Var( S) be the variance among the independent
measures of S, Var(BJ = Var(S)/( 1 - r)2, and the standard error of 6, is [Var(B,)/
N] ‘I2. Ideally this type of analysis would be performed on pairs of unrelated individuals
(r = 0), since this minimizes the standard error of 6,. However, since the main purpose of a fingerprinting study is to ascertain relatedness, this would involve a circularity
in many field studies. An alternative approach would be to focus on a class of individuals
for which the relatedness is known with confidence (e.g., mothers and offspring in
mammals), but even here one would have to be secure with an assumption of no
inbreeding.
Sampling Variance
An essential consideration in any attempt to ascertain relatedness is the magnitude
of the sampling variance of similarity. Such variation arises as a consequence of variation in both identity by descent per locus and IIS between genes that are not IBD. If
the sampling variance of SBAis high, then that of rBAwill be too.
Here it becomes useful to focus on the properties of a locus rather than on those
of a single allele. If the two alleles of B at the mth locus are denoted as 1 and 2, the
realized similarity of B with A at this locus may be written as
S BA,m
pm1
+
=
pm2
2
(13)
’
where SBA,mis restricted to values of 0, 0.5, and 1. To compute the sampling variance
of SBA,m,02( SBA,m), for a particular type of relationship, the mean squared value,
E( S&m), needs to be evaluated. This requires an averaging over all possible identity
relationships. If we let Hmi be the probability that B is homozygous at the mth locus
conditional on identity relationship i and let E ( Si,m 1H, i) and E( S iA,mIO, i) be the
expected squared similarities given homozygosity (H) or heterozygosity (0) in B and
identity relationship i, then
a2(SBA,m)
=
5
i=
$i[HmiE(S&m(
1
H,
i) +
(1
-
Hmi)E(SL,rnIO,
i)l
-
E2(fh4m) ,
*
Estimation of Relatedness by DNA Fingerprinting
593
Table 2
Expressions for a*(S BA,,,,)for Cases in Which There Is No Inbreeding
and the Distribution of Allele Frequencies Is Uniform
Relation
Sampling Variance of SBA,m
Nonrelative ............................
Parent-offspring ........................
Full sibs ..............................
Half-sibs, grandparent-grandchild, uncle(aunt)niece(nephew) .......................
Third order ............................
Fourth order ...........................
I41 --Pm-P)
0.25p(2 - 5p + 48 - p3)
0.125(1 - 4pz + 5p3 - 2p4)
0.0625( 1 + 8p - 248 + 24$ - 9p4)
0.03125( 1.5 + 22~ - 61p2 -+ 628 - 24.5~~)
0.015625(1.75 + 53~ - 139.58 + 141$ - 56.25~~)
NOTE.-TO obtain the sampling variance of similarity based on L loci, the tabulated expressions should be divided by L.
For the special case of no inbreeding, equation ( 14) needs only to be summed over i
= 1 to 3 with
E(S&rnIH9 l) =
E(ShrnlO,
1) =
{ 5
j=
Pmj
1
2
5 Pmjf.PiZj
+ 2Pmj(1 -
lo,
2)
=
{
?
j=
Wb)
W&m
WC)
1% 3) = 1 ,
Pmk(2PmjPmk)
+ 0.25
is:;
1
Pmj
5
is:.
WO
9
Wi~,mlW 2) = 1,
5
j=
x[P~j+P~k+2(Pmj+Pmk)(l
E(SL,m
Pmj)]
j= 1
Prnk[0-25(1
Pmj
1
?f
Pmk
i3)
-Pmj-Pmk)])/{l
-Pmk)
+Pmk]}/{
-
1 -
2
j=
1
5
j=
P&j],
1
P&j]
3
(15e)
and
E(SL,mIO,
3) = 1 -
um
Although three of these expressions depend on the gene frequency distribution in
rather complicated ways, substantial simplification is possible when p is assumed to
be constant within and between loci. In table 2, expressions for 02( SBA,m)are given
for several types of relatives under these special conditions. The sampling variance
for the similarity of parents and offspring at a marker locus approaches zero asymptotically as p + 0, being - p/ 2 for small p . For other types of relationships, however,
there is a real lower limit to the sampling variance per locus (0.125 for full sibs and
--
-
---
594
Lynch
0.062,0.047,
and 0.027 for second-, third-, and fourth-order relationships, respectively),
caused by variation in the number of IBD genes per locus.
To gain some appreciation for the magnitude of the error in estimates of relatedness that is caused by the sampling variance of similarity, the square roots of the
expressions in table 1 are divided by [ E( &,,,,) - p( 2 - p)] to obtain coefficients of
variation (standard error/expectation) of lBA(fig. 3). For parent-offspring relationships,
CV( fBA) declines with increasing allelic diversity, but even with 100 alleles/locus it
is still -0.15. For other types of relationships, the coefficient of variation is substantially
higher, increasing with the distance of the relationship. As the number of alleles increases, CV( iBA) approaches lower limits of 0.7 for full sibs and of 1.O, 1.7, and 2.6
for second-, third-, and fourth-order relationships, respectively.
Ordinarily, similarities will be computed by summing over several loci, typically
10-20. If we assume that the loci are unlinked and in linkage equilibrium, the sampling
variance of such an integrated measure would then be equal to the mean sampling
variance of the individual loci divided by the number of loci. Thus, with 25 loci, the
coefficients of variation of &A would be one-fifth of those in figure 3. In this extreme
case, the standard error of ?BAfor full sibs and for second-, third-, and fourth-order
relationships would be no less than 14%, 20%, 35%, and 53%, respectively, of the
expectation.
These analyses are sufficient to illustrate the uncertainties that can arise when
the similarity measure is used to estimate the degree of relationship between individuals.
The sampling variance of ?B* computed above is truly a lower limit because it does
not account for the (usually unobserved) variation of 8,, around 0, or for the sampling variance of & and Var( 0,) around the population parameters. One can only
conclude that beyond (and often including) second-degree relationships, DNA fingerprinting does not provide a powerful means of assessing individual relationships.
If, however, one is simply interested in the average relatedness among independent
pairs of individuals in a group, then there is hope, since the standard error of the
average relatedness is that for independent estimates divided by the square root of the
number of estimates. For the extreme case of 25 loci and p --+0, assays of two independent pairs of full sibs would reduce the coefficient of variation of average relatedness
to 0.1 (the additional problem of uncertainty about 8 lB here being ignored). For
groups composed of second-, third-, and fourth-order relatives, the same level of accuracy would require assays of 4, 12, and 28 independent pairs, respectively.
Discussion
The preceding analyses have identified three technical difficulties in using DNA
fingerprinting to obtain individual estimates of relatedness: (1) the upward bias of
fingerprint similarity compared with relatedness caused by finite numbers of alleles,
(2) the inability to completely correct for such bias because of its individual specificity,
and (3) the sampling variance caused by variation in identity by descent within and
between loci. If we take into account the additional problems of comigration of nonallelic markers, linkage and/ or linkage disequilibrium between marker loci, the frequent
inability to observe markers with very low molecular weights, possible linkage of marker
loci with other loci under selection, and high and variable mutation rates (Jeffreys et
al. 1988)) it is clear that considerable caution needs to be exercised in applications of
DNA fingerprinting to estimate individual relatedness.
With very large numbers of alleles per locus (the critical number increasing with
the distance of the relationship), the bias between DNA fingerprint similarity and
Estimation of Relatedness by DNA Fingerprinting
595
3.5
3.0
2.5
2.0
1.5
Second degree
I .o
Full-sibs
0.5
Parent-offspring
I
I
I
20
40
60
80
Number
of Marker
I
0
0
100
Alleles/Locus
FIG. 3.-Coefficient of variation of the relatedness estimate as a function of the number of marker
alleles per locus and the degree of relationship. For L loci, the plotted values should be divided by L”*.
relatedness can be ignored. However, while most VNTR loci do appear to be exceptionally variable, the number of alleles per locus is by no means high enough to permit
the use of similarity as a reasonable estimator of relatedness. Nakamura et al. (1987)
surveyed samples of 60-80 individuals with 372 VNTR probes, 77 of which revealed
length polymorphisms. Of the latter loci, 11 exhibited more than 10 marker alleles,
but the vast majority (68%) had five or fewer alleles, and 26% had only two to three
596
Lynch
Minimum
300
0
Similarity/Locus
0.75
I .oo
I
0
for Nonrelatives
I
I
I
20
I
40
Heterozygosity
I
0.56 0.36
I
60
I
I
80
0.10
I
100
(%I
FIG. 4.-Distribution
of heterozygosity over 392 loci probed for VNTR polymorphisms in humans
(data from Nakamura et al. 1987). The upper scale gives the expected fraction of shared bands per locus
between unrelated individuals as a function of the heterozygosity under the assumption of an even allele
frequency distribution.
alleles. The heterozygosity per polymorphic locus has a mode at 0.6-0.7 (fig. 4), which
implies a rather high average similarity between nonrelatives. Expected values of &,
are given in figure 4 under the assumption of an even allele frequency distribution
(heterozygosity = 1 - p, and &, = p [ 2 - p] ). An averaging of these locus-specific
similarities leads to an underestimate of O1, the expected similarity between nonrelatives, because the observed allele frequency distributions are not even. Thus, for the
set of loci observed by Nakamura et al. (1987), 8, is certainly >OS.
It is important to recognize that Nakamura et al. ( 1987 ) were only able to reveal
locus-specific polymorphisms by probing for VNTR loci under high-stringency con-
Estimation of Relatedness by DNA Fingerprinting
597
ditions. In DNA fingerprinting, one normally hybridizes under low-stringency conditions in order to examine a large number of markers simultaneously. This eliminates
the luxury of ignoring loci with low (or no) variability. Prescreening a population
with many different potential probes prior to analysis is a useful way to identify probes
that give low 8, under low-stringency conditions. However, it should be noted that
probes for the most hypervariable VNTR loci in humans still give & in the range of
0.1-0.3 (Jeffreys et al. 19853). When the same probes are used in house sparrows,
0, = 0.14 (SE = 0.01) (Wetton et al. 1987). Pair-wise comparisons in six species of
birds yielded estimates of 8, in the range of 0.1-0.5 ( Burke and Bruford 1987 ) .
Thus, unless probes can be located for loci that are much more variable than the
already “hypervariable” loci of Jeffreys et al. ( 1985a, 198%)) the upward bias of DNA
fingerprint similarity relative to relatedness is too substantial to ignore. Prior to embarking on a field study, it is then essential to obtain the baseline estimates 8, and
Var( 0,) in order to apply equation (12), and even then one must recognize that this
formula gives biased estimates of relatedness. Equation (12) will tend to overestimate
the relatedness when applied to individuals that happen to contain common alleles
and will tend to underestimate it for individuals carrying rare alleles.
Since the estimator for the sampling variance derived in the present paper did
not include some potentially important sources of variation, it should be interpreted
as nothing more than a lower limit to the sampling variance in hypothesis testing; but
it can serve as a guide to the minimum number of markers (and/or individuals) that
need to be evaluated to achieve a desired level of precision. An alternative empirically
based-but time-consuming -approach to hypothesis testing exists. This would involve
the development of frequency distributions of similarity from known types of relatives
drawn from the base population of interest. The probability that any individual estimate
of similarity is associated with a particular type of relationship can then be assayed
directly without knowledge of the allelic distributions at specific loci.
In closing, we must emphasize that the cautionary tone of the present paper
applies primarily to ascertainment of relatedness. There are numerous problems in
population biology for which DNA fingerprinting may provide a simple and powerful
analytical tool. Provided one knows the mother of an individual, paternity exclusion
is quite straightforward, since it simply involves the identification of markers in the
offspring that cannot be attributed to the mother or to the putative father. A similar
logic applies to the ascertainment of multiple paternity. The simplicity of both types
of analyses stems from the fact that they merely involve the rejection of a general
hypothesis rather than the acceptance of a specific relationship. In some applications,
such as the identification of optimal breeding pairs in genetic conservation programs,
it will often be less critical to establish absolute relatedness than rank-order relatedness.
Since expected similarity is a monotonic function of relatedness, DNA fingerprinting
can provide a useful guide to such decision making. One should, however, maintain
an awareness that, owing to the many sources of error described in the present paper,
rank order based on fingerprint similarities can sometimes be substantially different
than the true rank order of relatedness.
Acknowledgments
Helpful comments were provided by N. Burley, T. Crease, B. Leathers, D. Queller,
S. Robinson, and R. Yokoyama. Financial support came from National Science
Foundation grant BSR 86-00487 and National Institutes of Health Biomedical Research support grant RR07030.
598 Lynch
LITERATURE CITED
BURKE, T., and M. W. BRUFORD. 1987. DNA fingerprinting in birds. Nature 327:149-152.
CHAKRABORTY,R., T. R. MEAGHER,and P. E. SMOUSE. 1988. Parentage analysis with genetic
markers in natural populations. I. The expected proportion of offspring with unambiguous
paternity. Genetics 118:527-536.
CROZIER, R. H. 1970. Coefficients of relationship and the identity of genes by descent in the
Hymenoptera. Am. Nat. 104:2 16-2 17.
CROZIER,R. H., P. PAMILO,and Y. C. CROZIER. 1984. Relatedness and microgeographic genetic
variation in Rhytidoponera mayri, an Australian arid zone ant. Behav. Ecol. Sociobiol. 15:
143-150.
GILL, P., A. J. JEFFREYS,and D. J. WERRE-I-T. 1985. Forensic application of DNA ‘fingerprints.’
Nature 318:577-579.
GRAFEN, A. 1985. A geometric view of relatedness. Oxf. Surv. Evol. Biol. 2:28-89.
HAMILTON, W. D. 1964. The genetical evolution of social behaviour. I and II. J. Theor. Biol.
7:1-52.
. 197 1. Selection of selfish and spiteful behavior in some extreme models. P. 55-9 1 in
J. F. EISENBERGand W. S. DILLON, eds. Man and beast: comparative social behaviour.
Smithsonian Institution Press, Washington, D.C.
JACQUARD,A. 1974. The genetic structure of populations. Springer, Berlin.
JEFFREYS,A. J., and D. B. MORTON. 1987. DNA fingerprints of dogs and cats. Anim. Genet.
18:1-15.
JEFFREYS,A. J., N. J. ROYLE, V. WILSON, and Z. WONG. 1988. Spontaneous mutation rates
to new length alleles at tandem-repetitive hypervariable loci in human DNA. Nature 332:
278-28 1.
JEFFREYS,A. J., V. WILSON, R. KELLY, B. A. TAYLOR, and G. BULFIELD. 1987. Mouse DNA
‘fingerprints’: analysis of chromosome localization and germ line stability of hypervariable
loci in recombinant inbred strains. Nucleic Acids Res. 15:2823-2836.
JEFFREYS,A. J., V. WILSON, and S. L. THEIN. 1985a. Hypervariable ‘minisatellite’ regions in
human DNA. Nature 314:67-73.
. 1985b. Individual-specific ‘fingerprints’ of human DNA. Nature 316:76-79.
MCCRACKEN, G. F., and J. W. BRADBURY. 1981. Social organization and kinship in the polygynous bat, PhyZlostomus hastatus. Behav. Ecol. Sociobiol. 15:287-29 1.
METCALF, R. A., and G. S. WHITT. 1977. Intra-nest relatedness in the social wasp Polistes
metricus. Behav. Ecol. Sociobiol. 2:339-35 1.
MICHOD, R. E., and W. D. HAMILTON. 1980. Coefficients of relatedness in sociobiology. Nature
288:694-697.
NAKAMURA,Y., M. LEPPERT, P. O’CONNELL, R. WOLFF, T. HOLM, M. CULVER,C. MARTIN,
E. FUJIMOTO, M. HOFF, E. KUMLIN, and R. WHITE. 1987. Variable number of tandem
repeat (VNTR) markers for human gene mapping. Science 235: 16 16- 1622.
PAMILO,P. 1984. Genotypic correlation and regression in social groups: multiple alleles, multiple
loci and subdivided populations. Genetics 107:307-320.
PAMILO, P., and R. H. CROZIER. 1982. Measuring genetic relatedness in natural populations:
methodology. Theor. Popul. Biol. 21: 17 1- 193.
VASSART,G., M. GEORGES,R. MONSIEUR,H. BROCAS,A. S. LEQUARRE,and D. CHRISTOPHE.
1987. A sequence in M 13 phage detects hypervariable minisatellites in human and animal
DNA. Science 235:683-684.
WETTON, J. H., R. E. CARTER, D. T. PARKIN, and D. WALTERS. 1987. Demographic study of
a wild house sparrow population by DNA fingerprinting. Nature 327: 147- 149.
Estimation of Relatedness by DNA Fingerprinting
599
WILKINSON, G. S., and G. F. MCCRACKEN. 1986. On estimating genetic relatedness
genetic markers. Evolution 39: 1169- 1174.
WRIGHT, S. 1922. Coefficients of inbreeding and relationship. Am. Nat. 56:330-338.
MASATOSHI NEI, reviewing
Received
February
editor
4, 1988; revision
received
April 2 1, 1988
using