Complex mixtures: A critical examination of a paper by

G Model
FSIGEN-708; No. of Pages 6
Forensic Science International: Genetics xxx (2011) xxx–xxx
Contents lists available at ScienceDirect
Forensic Science International: Genetics
journal homepage: www.elsevier.com/locate/fsig
Complex mixtures: A critical examination of a paper by Homer et al.
Thore Egeland a,1,*, A. Elida Fonneløp b,1, Paul R. Berg c, Matthew Kent c, Sigbjørn Lien c
a
IKBM, Norwegian University of Life Sciences (UMB), Norway
Institute of Forensic Medicine, University of Oslo, Norway
c
Department of Animal- and Aquacultural Sciences and Centre for Integrative Genetics (CIGENE), Norwegian University of Life Sciences (UMB), Norway
b
A R T I C L E I N F O
A B S T R A C T
Article history:
Received 25 July 2010
Received in revised form 7 February 2011
Accepted 17 February 2011
DNA evidence in criminal cases may be challenging to interpret if several individuals have contributed to
a DNA-mixture. The genetic markers conventionally used for forensic applications may be insufficient to
resolve cases where there is a small fraction of DNA (say less than 10%) from some contributors or where
there are several (say more than 4) contributors. Recently methods have been proposed that claim to
substantially improve on existing approaches [1]. The basic idea is to use high-density single nucleotide
polymorphism (SNP) genotyping arrays including as many as 500,000 markers or more and explicitly
exploit raw allele intensity measures. It is claimed that trace fractions of less than 0.1% can be reliably
detected in mixtures with a large number of contributors. Specific forensic issues pertaining to the
amount and quality of DNA are not discussed in the paper and will not be addressed here. Rather our
paper critically examines the statistical methods and the validity of the conclusions drawn in Homer
et al. (2008) [1].
We provide a mathematical argument showing that the suggested statistical approach will give
misleading results for important cases. For instance, for a two person mixture an individual contributing
less than 33% is expected to be declared a non-contributor. The quoted threshold 33% applies when all
relative allele frequencies are 0.5. Simulations confirmed the mathematical findings and also provide
results for more complex cases. We specified several scenarios for the number of contributors, the
mixing proportions and allele frequencies and simulated as many as 500,000 SNPs.
A controlled, blinded experiment was performed using the Illumina GoldenGate1 360 SNP test panel.
Twenty-five mixtures were created from 2 to 5 contributors with proportions ranging from 0.01 to 0.99.
The findings were consistent with the mathematical result and the simulations.
We conclude that it is not possible to reliably infer the presence of minor contributors to mixtures
following the approach suggested in Homer et al. (2008) [1]. The basic problem is that the method fails to
account for mixing proportions.
ß 2011 Elsevier Ireland Ltd. All rights reserved.
Keywords:
Mixtures
SNP genotyping arrays
Mixing proportions
1. Introduction
A study in 2008 by Homer et al. [1] presents methods designed
to resolve whether individuals are in a genomic DNA mixture using
high-density single nucleotide polymorphism (SNP) genotyping
arrays. The implications of the paper for genome-wide association
studies have been profound. Several papers including [2–6] have
discussed the paper and suggested modifications and alternative
approaches. The paper has also led to the precautionary removal of
large amounts of summary data from public access. We are,
however, not aware of discussions related to important potential
forensic applications. In this paper we examine Homer et al. [1] in a
* Corresponding author.
E-mail address: [email protected] (T. Egeland).
1
These authors contributed equally to this work.
forensic context by means of mathematical analyses, simulation
and a controlled, blinded experiment.
2. Methods
Below we first review the basic statistical methods of Homer
et al. [1] in a way that is convenient for the subsequent discussion.
Appendix A presents detailed calculations for a specific dataset.
Initially it is sufficient to consider one SNP with alleles A and B.
Three different estimates of the relative frequency of the A allele,
pA, are fundamental:
1. The individual estimate Y: If one is forced to estimate pA based
only on the genotype of one individual this can be done by
letting Y be 0, 0.5 or 1 depending on whether the individual is
homozygous BB, heterozygous AB or homozygous AA.
2. The population estimate Pop: This is the conventional estimate
based on a reference database.
1872-4973/$ – see front matter ß 2011 Elsevier Ireland Ltd. All rights reserved.
doi:10.1016/j.fsigen.2011.02.003
Please cite this article in press as: T. Egeland, et al., Complex mixtures: A critical examination of a paper by Homer et al., Forensic Sci. Int.
Genet. (2011), doi:10.1016/j.fsigen.2011.02.003
G Model
FSIGEN-708; No. of Pages 6
T. Egeland et al. / Forensic Science International: Genetics xxx (2011) xxx–xxx
2
3. The mixture estimate M: Based on the fluorescence measurements H1 and H2 of alleles A and B, the following estimate is
suggested based on the mixture:
M¼
H1
:
H1 þ kH2
(1.1)
The constant k corrects for possible differential amplification of
A and B. This constant is assumed equal to 1 unless otherwise
specified.
To illustrate the notation, consider a simple case with one SNP,
one contributor and no measurement error. There are three
possibilities
1. The genotype is BB. Then H1 = 0 since there is no measurement
error and so according to Eq. (1.1), M = 0.
2. The genotype is AB. Then H1 = H2 and M = 1/2.
3. The genotype is AA. Then H2 = 0 and M = 1.
Observe that in this example the individual estimate Y coincides
with the mixture estimate M. This will normally not be true if there
are several contributors or if there is measurement error added to
H1 and H2.
Next two distances are calculated
D1 ¼ distance between the individual and the population ¼ jY Po pj;
D2 ¼ distance between the individual and the mixture ¼ jY Mj:
and the final distance is defined as
D ¼ D1 D2 ¼ jY Po pj jY Mj:
Generally, D could be negative, zero or positive. However, for
the simple illustration above, jY Mj = 0 and so D = D1 which is
never negative by definition.
To indicate the specific SNP (j) and individual (i) the expression
may be rewritten to comply with the notation in [1]
D ¼ DðY i j Þ ¼ jY i j Po p j j jY i j M j j:
(1.2)
The distances for all s SNPs are converted to the conventional
test-statistic
TðY i Þ ¼
meanðDðY i ÞÞ
pffiffi ;
sdðDðY i ÞÞ= s
Table 1
The design and some results of the controlled, blinded experiments.
Mixture
4f
4d
4b
4h
4c
G
H
F
E
D
B
C
A
R
S
T
U
V
W
X
Y
Q
P
O
N
M
L
K
J
I
0.100
0.100
0.100
0.100
0.200
0.330
0.250
0.500
0.450
0.500
0.700
0.800
0.800
0.300
0.200
0.010
0.400
0.150
0.100
0.200
0.300
0.050
0.050
0.050
0.050
0.300
0.225
0.450
0.900
0.200
0.330
0.250
0.500
0.250
0.100
0.100
0.200
0.050
0.100
0.250
0.990
0.600
0.200
0.200
0.300
0.700
0.238
0.317
0.475
0.950
0.300
0.225
0.450
0.000
0.200
0.330
0.250
0.000
0.300
0.150
0.010
0.000
0.150
0.250
0.300
0.000
0.000
0.250
0.300
0.500
0.000
0.238
0.317
0.475
0.000
0.300
0.225
0.000
0.000
0.200
0.000
0.250
0.000
0.000
0.250
0.090
0.000
0.000
0.350
0.150
0.000
0.000
0.300
0.400
0.000
0.000
0.238
0.317
0.000
0.000
0.000
0.225
0.000
0.000
0.200
0.000
0.000
0.000
0.000
0.000
0.100
0.000
0.000
0.000
0.100
0.000
0.000
0.100
0.000
0.000
0.000
0.238
0.000
0.000
0.000
The number of contributors in the experiments and the amount contributed (0.01–
0.99) are shown. The individuals correctly identified are indicated in grey. For
example, in mixture G individuals 4b and 4h are correctly identified whereas the
method fails to identify 4f and 4d. Individual 4c is correctly identified as a noncontributor.
single nucleotide polymorphisms (SNPs) distributed across the
genome. Twenty-five mixtures were created from 2 to 5
contributors with proportions ranging from 0.01 to 0.99, see Table
1. The allele frequencies for the reference population were
estimated from HapMap data http://hapmap.ncbi.nlm.nih.gov/.
The true mixture compositions were unknown during analyses.
3. Results and discussion
3.1. Theoretical arguments
(1.3)
where sd denotes standard deviation. Consider next the null
hypothesis
H0:‘‘Individual i is not in the mixture’’.
A significantly large positive value of T(Yi) leads to rejection of
the null hypothesis. The degenerated case where only one
individual, i, has contributed and there is no measurement error
is again useful for illustration purposes. The distance based on a
SNP will then never be negative as explained previously. The
numerator of (1.3) will be positive and also T(Yi) > 0. A sufficiently
large number of SNPs will therefore guarantee statistical significance. The null hypothesis will be rejected and the correct
conclusion, individual i is in the mixture, will be reached.
In the more general case when there are contributors in
addition to individual i, the test will work according to [1] since
‘‘Mj is shifted away from the reference population by Yi’s
contribution to the mixture’’.
2.1. The controlled, blinded experiment
The statistical method presented in [1] was tested in a number
of controlled experiments. The panel selected for the analysis was
the Illumina GoldenGate1 DNA Test Panel which includes 360
The statistical methods of Homer et al. [1] lack solid theoretical
foundation. This has been indicated previously in [2] in a nonforensic context: ‘‘This work appeared in a non-statistical journal
and the statistical problem was stated somewhat informally. In
particular, the precise role of the ‘reference population’ in the
inference is not very clear.’’ Some further comments are in order.
The relevant reference allele frequencies may not be available or
may differ depending on ethnic background as pointed out in [6].
This problem is much smaller for conventional forensic markers
since these are chosen to have reasonably uniform distributions
across populations.
We have some fundamental problems related to the formulation
of the null hypothesis. If a null hypothesis is rejected, the opposite
should be concluded. This is not the case here. Rather the conclusion
‘‘Individual i is in the mixture’’ can only be drawn for a sufficiently large
positive value of the test-statistic. A large negative value indicates
that individual i is not in the mixture but more ‘‘. . .ancestrally similar
to the reference population than to the mixture, and thus less likely to
be in the mixture’’ as pointed out in [1].
From a practical point of view, this is one example where it is
not sufficient to base the conclusion on the p-value. It is also
necessary to pay attention to the sign of the test statistic.
We next present theoretical arguments explaining why the
method does not work as generally as claimed. It is sufficient to
consider a relevant special case to provide a counter example.
Please cite this article in press as: T. Egeland, et al., Complex mixtures: A critical examination of a paper by Homer et al., Forensic Sci. Int.
Genet. (2011), doi:10.1016/j.fsigen.2011.02.003
G Model
FSIGEN-708; No. of Pages 6
T. Egeland et al. / Forensic Science International: Genetics xxx (2011) xxx–xxx
Assume individual 1 has contributed a fraction w to the mixture
and that remaining 1 w comes from individual 2. There is no
measurement error. In Appendix B it is shown that the expected
distance (1.2) is
3w 1
EðDÞ ¼
;
8
(1.4)
when p = 0.5. This implies that there are three possible outcomes of
the statistical test depending on the fraction individual 1
contributes:
1. If the fraction w is larger than 1/3, E(D) > 0 from Eq. (1.4). In this
case the test statistic will be positive. For a large number of SNPs
significance will be reached and the correct conclusion will
emerge. Observe that this includes the important special case
w = 0.5, corresponding to balanced mixtures discussed in [2–6].
2. If w = 1/3 then E(D) = 0. In this case the test is likely to be
inconclusive.
3. If w < 1/3 the wrong conclusion is reached as then E(D) < 0.
The findings above based on the theoretical derivation are
consistent with simulations and experimental results reported later
in this paper. The main problem of the approach is that it does not
account for or estimate the fraction contributed. Methods for
estimating mixing proportions have been presented in another study
[7].
3.2. Simulation
We specified several scenarios for the number of contributors,
the mixing proportions and allele frequencies. Appendix A
provides further details on the assumptions and simulation
experiments.
Example 1. A two person mixture, with no measurement error is
considered once more. We simulated 500,000 SNPs. The relative
frequency for the A allele was 0.5 for all SNPs. One example is
provided in Fig. 1 where w ranges from 0.01 to 0.99. For w Homer et al's test statistic.
Two contributors. Individual 1 contributes the fraction w
600
●
●
Table 2
Results of simulation experiment. Two person mixture.
T(Y)
p-Value
No. SNPs
w
3.90
7.15
27.08
86.37
200.52
0.53
0.10
0.09
1.18
1.70
3.16
7.14
26.18
80.85
180.67
9.69E05
8.84E13
1.73E161
0.00E+00
0.00E+00
5.98E01
9.20E01
9.29E01
2.40E01
8.92E02
1.60E03
9.63E13
5.07E151
0.00E+00
0.00E+00
100
1000
10,000
100,000
500,000
100
1000
10,000
100,000
500,000
100
1000
10,000
100,000
500,000
0.10
0.10
0.10
0.10
0.10
0.33
0.33
0.33
0.33
0.33
0.50
0.50
0.50
0.50
0.50
The number of SNPs and fractions contributed vary. For the balanced case (w = 0.5)
with p-values close to 0 when there are more than 1000 SNPs. Significance is never
reached when w = 0.33. The results are significantly wrong for w = 0.10.
0:32; TðY i Þ 3:8 (p-value = 0.00007) and there is strong evidence in favor of the wrong conclusion. There is overwhelming
evidence for the correct conclusion; the individual belongs to the
mixture, when w 0:34 since then T(Yi) 7.1 (p-value 0). In
Table 2, results are shown for varying number of SNPs. When
the mixture is balanced, the contributors are identified with as few
as 100 markers. Similarly, the wrong conclusion is obtained for w ¼
0:1 for 100 markers and the results get worse as the amount of data
increases. For w ¼ 0:33 the test is inconclusive, the p-values exceed
0.05 regardless of the number of markers. The simulations accurately confirm the previous mathematical finding summarized in
Eq. (1.4). Similar results are obtained for other specifications of the
database and also when measurement error was added as is shown
in Fig. 2. The value of the test statistic and the point where it
changes from negative to positive may depend on the minor allele
frequency (MAF). The main finding however remains: the wrong
conclusion is reached for the minor contributor when the fraction
is small. As the MAF decreases towards 0, the threshold for negative values also decreases and this is consistent with the result
derived in Appendix B.
Example 2. Table 3 displays the results of a three person mixture
with 500,000 SNPs and no measurement error. Observe that for a
balanced mixture the individuals are correctly identified as contributors. In other words, a person contributing a fraction of 0.33 is
correctly identified in this case. However, the wrong conclusion is
obtained for the minor contributor in the unbalanced scenario.
●
●
●
●
●
400
3
●
●
●
3.3. The controlled, blinded experiment
●
200
T(Y)
●
●
Table 1 shows the individuals identified correctly as contributors. Generally, major contributors tend to be identified and
all samples with proportions exceeding 0.33 were correctly
●
●
0
●
●●
●●
●
−200
●
●
0.0
●
●
●
●
Table 3
Results of simulation experiment. Three person mixture.
●
w=1/3
0.2
0.4
0.6
0.8
1.0
w
500,000 SNPs. Allele frequency=0.5
Fig. 1. Figure shows that the conclusion depends on the amount contributed from
the individual. The test statistic is close to 0 for w = 1/3 and the method therefore
fails to identify the individual as a contributor. This confirms the mathematical
derivation in Appendix B.
Example
Person
T(Y)
p-Value
w
1
1
1
2
2
2
1
2
3
1
2
3
197.9
97.0
223.0
74.8
72.3
73.7
0
0
0
0
0
0
0.1
0.4
0.5
0.33
0.33
0.33
The method again works when the mixture is balanced, corresponding to w = 0.33
for this three person mixture. Once, more the wrong conclusion is reached for the
minor contributor corresponding to line 1.
Please cite this article in press as: T. Egeland, et al., Complex mixtures: A critical examination of a paper by Homer et al., Forensic Sci. Int.
Genet. (2011), doi:10.1016/j.fsigen.2011.02.003
G Model
FSIGEN-708; No. of Pages 6
T. Egeland et al. / Forensic Science International: Genetics xxx (2011) xxx–xxx
4
Homer et al's test statistic.
Two contributors. Individual 1 contributes w. SD=0.1
MAF 0.2
MAF 0.4
●
●
●
−100 0
0
0.0
0.2
●
●
●
●
●
100
●
●
●
●
●
TY
●
●
0.4
0.6
0.8
1.0
●
●
●
0.0
0.2
0.4
w
●
200
400
●
TY
TY
●
200
●
●
●
●
0
●
−100
0
●
0.4
0.6
0.8
1.0
w
●
●
●
●
0.2
1.0
●
100
600
●
●
0.0
0.8
MAF 0.5
●
●
0.6
w
MAF 0.3
●
●
●
300
400
●
200
TY
600
●
●
●
0.0
●
0.2
0.4
0.6
0.8
1.0
w
Fig. 2. Simulation results for varying minor allele frequency (MAF). Measurement error is added with a standard deviation of 0.1. Again, for minor contributions (small w) the
test statistic is negative and the wrong conclusion emerges.
recognized by the test. The results are summarized in Table 4. For
30 (24.0%) cases the test was inconclusive in the sense that the pvalue was greater than 0.05. Contributors were incorrectly
assigned as non-contributors in 14 (11.2%) cases. The initial
analysis used k = 1 in Eq. (1.1). New analyses with a more
appropriate estimate, k = 3, were performed and led to slight
improvements and correspond to those reported. The selection of
SNPs for the Illumina GoldenGate1 360 SNP test panel includes
SNPs on the X and Y chromosome. These SNPs may not be suitable
and the analyses were repeated with these markers removed. The
results remained essentially similar.
3.4. Discussion
The controlled, blinded experiment is small. The number of
SNPs in the analysis (360) is not comparable to the 500,000
analyzed in [1] and so there is a problem of insufficient power
particularly for the inconclusive findings. Experiments could be
performed for data sets comparable to those used in [1]. However,
this was not considered worthwhile in view of the theoretical
findings.
Throughout we use a significance level of 0.05. It could be
argued that corrections should be made for multiple testing. For
the simulations based on 500,000 SNPs such corrections would
have no impact for most cases as conclusions would remain
unchanged virtually regardless of the method used to correct. For
the experimental data, the situation differs, but we have chosen the
traditional level since we regard the tests as descriptive.
The forensic genetics community should be encouraged to
apply markers and methods of other areas of genetics. Highdensity single nucleotide polymorphism genotyping arrays may be
of great importance for forensic casework. Issues related to the
viability of the sub-optimal DNA sources frequently facing forensic
laboratories should not delay the presentation of new data sources
and statistical methods. The paper we have discussed should be
welcomed as an important first attempt. However, the validity of
findings needs to be critically examined. While claims like ‘‘trace
amounts of less than 0.1% can be reliably detected in mixtures with
a large number of contributors’’ [1] may seem outrageous and
counter intuitive and the evidence presented meager, careful
arguments are required for a firm rebuttal. Such arguments have
been presented and the theoretical arguments should be
particularly convincing.
We conclude, contrary to [1], that it is not possibly to accurately
infer the presence of contributors to unbalanced mixtures
following the suggested approach.
Appendix A. Details on the Homer et al. statistic and simulation
Table 4
The results of the blinded experiment are summarized.
Contributed
No
Yes
Conclusion
Not in Mixture
Inconclusive
In mixture
39
14
0
30
0
42
For individuals who do not contribute to the mixture, corresponding to the ‘NO’ line,
the method never leads to the wrong conclusion. For contributors, however, the
method gives an inconclusive result for 30 cases and the wrong conclusion 14 times.
We explain the simulation and the statistical calculations below
based on a simple example. The data is provided in Table 5. There are
4 SNPs and three contributors (denoted 1, 2 and 3 in the ID column) to
the mixture. We would like to test the hypothesis
H0: ‘‘Individual 1 is not in the mixture’’.
Consider first SNP 1. The statistical test uses the genotype of
individual 1. This genotype is AB. Since Y is 0, 0.5 or 1 depending on
Please cite this article in press as: T. Egeland, et al., Complex mixtures: A critical examination of a paper by Homer et al., Forensic Sci. Int.
Genet. (2011), doi:10.1016/j.fsigen.2011.02.003
G Model
FSIGEN-708; No. of Pages 6
T. Egeland et al. / Forensic Science International: Genetics xxx (2011) xxx–xxx
5
Table 5
Data used to explain statistical calculations and simulation.
SNP
ID
Genotype
1
1
1
1
2
2
2
2
3
3
3
3
4
4
4
4
1
2
3
mix
1
2
3
mix
1
2
3
mix
1
2
3
mix
A
A
A
A
B
B
B
B
A
A
A
A
B
B
B
B
B
B
A
B
A
A
A
A
A
B
A
A
B
A
A
B
H1
H2
Pop
w
Y
M
D
NA
NA
NA
1.1
NA
NA
NA
1.0
NA
NA
NA
1.5
NA
NA
NA
1.3
NA
NA
NA
0.9
NA
NA
NA
1.0
NA
NA
NA
0.5
NA
NA
NA
0.7
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.1
0.4
0.5
1.0
0.1
0.4
0.5
1.0
0.1
0.4
0.5
1.0
0.1
0.4
0.5
1.0
0.5
0
1
NA
0.5
0.5
0.5
NA
0
0.5
1
NA
0
1
0.5
NA
NA
NA
NA
0.55
NA
NA
NA
0.50
NA
NA
NA
0.75
NA
NA
NA
0.65
0.05
0.05
0.05
NA
0.00
0.00
0.00
NA
0.25
0.25
0.25
NA
0.15
0.15
0.15
NA
There are four SNPs and three contributors to the mixture (denoted 1, 2 and 3) in the ID column. The individuals contribute in fractions 0.1, 0.4 and 0.5 as indicated in the w
column. These fractions are not known and not used to calculate the distances D, but they need to be specified to simulate data.
whether the individual is homozygous BB, heterozygous AB or
homozygous AA, Y = 0.5 in the first line of Table 5. The fluorescence
measurements are H1 = 1.1 and H2 = 0.9 and therefore M = 1.1/
(1.1 + 0.9) = 0.55 as shown in the line ‘mix’. For this example we
assume the allele frequency of A, denoted Pop, to be 0.5. The distance
is therefore
DðYÞ ¼ jY Po pj jY Mj ¼ j0:5 0:5j j0:5 0:55j ¼ 0:05:
The distances for the 3 other SNPs are calculated similarly and the
final test based only on these distances are 0.05, 0, 0.25, 0.15. The
mean value is 0.11250 and the standard deviation 0.11087 and so
the test statistic is
meanðDðY 1 ÞÞ
0:11250
pffiffiffi ¼ 2:03:
pffiffi ¼
TðY 1 Þ ¼
sdðDðY 1 ÞÞ= s 0:11087= 4
For a small number of SNPs, say less than 30 it would be
reasonable to use a t-distribution to convert the value for the test
statistic to a p-value. However, a normal distribution is used since the
central limit theorem justifies this distribution for the large number
of SNPs used in our examples and 2.03 corresponds to a p-value of
0.042. (The normal distribution may not be appropriate for real data.)
Recall that a negative value of the test-statistic indicates that the
person in question is more ‘‘. . .ancestrally similar to the reference
population than to the mixture, and thus less likely to be in the
mixture’’ [1]. There are far too few SNPs to conclude. However, the
same wrong conclusion is obtained if this example is scaled up
500,000 SNPs as shown in Table 2.
Simulation: The R-software (http://www.R-project.org) was used
to simulate and analyze data. We need to specify the number of
contributors and their proportions. For this example, there are three
individuals contributing in proportions 0.1, 0.4 and 0.5 as indicated by
column w of Table 5. Hardy–Weinberg equilibrium and linkage
equilibrium is assumed. Peak heights are simulated for each
contributor from a normal distribution with specified expectation
and standard deviation.
Appendix B. Mathematical derivation of the expected distance
In general it is not possible to derive simple formulae for the test
statistic (1.3) used to determine whether an individual has
contributed to a mixture. However, relevant calculations can be
performed in some simple cases. Below we consider a two-person
mixture with no measurement error. Since the test is based on the
distances we consider the expected value
EðDÞ ¼ EðD1 Þ EðD2 Þ ¼ EðjY 11 pj jY 11 M 1 jÞ;
where for brevity Pop is denoted p. Since p is the same for all
markers is it is sufficient to consider one marker below. Without
loss of generality we can and will assume p 1/2: the result for
p > 1/2 is obtained by replacing p by 1 p.
We first consider D1 and derive the expected value from the
standard formula as
EðD1 Þ ¼ EðjY 11 pjÞ ¼ j0 pjPðY 11 ¼ 0Þ
1
þ pPðY 11 ¼ 0:5Þ þ j1 pjPðY 11 ¼ 1Þ:
2
We assume Hardy–Weinberg equilibrium and then
PðY 11 ¼ 0Þ ¼ Pð‘‘individual is BB’’Þ ¼ ð1 pÞ2 :
The remaining probabilities are found similarly
P(Y11 = 0.5) = 2p(1 p) and P(Y11 = 1) = p2 and therefore
as
1
p 2 pð1 pÞ
EðD1 Þ ¼ pð1 pÞ2 þ 2
þ ð1 pÞ p2 ¼ 2ð1 pÞ2 p:
Consider next D2 = |Y11 M|. Individual 1 contributes a fraction w
and individual 2 the remaining fraction (1 w). The estimates of the
A-allele frequency based on the genotype of individual 1 and 2 are
respectively Y11 and Y21. By accounting for the fractions contributed,
it follows that the mixture estimate of the A-allele frequency is
M1 ¼ wY 11 þ ð1 wÞY 21 :
For instance, if the individuals contribute equally, corresponding
to w = 0.5, and are heterozygous, the mixture based estimate is the
intuitive M = 0.5 0.5 + (1 0.5) 0.5 = 0.5.
Observe that we may write D2 ¼ jY 11 Mj ¼ ð1 wÞjY 11 Y 21 j:
According to the general formula for expectation
EðD2 Þ ¼ ð1 wÞ
X
jy1 y2 jPðY 11 ¼ y1 ; Y 21 ¼ y2 Þ:
y1 ;y2
Please cite this article in press as: T. Egeland, et al., Complex mixtures: A critical examination of a paper by Homer et al., Forensic Sci. Int.
Genet. (2011), doi:10.1016/j.fsigen.2011.02.003
G Model
FSIGEN-708; No. of Pages 6
6
T. Egeland et al. / Forensic Science International: Genetics xxx (2011) xxx–xxx
Letting q(y1, y2) = P(Y11 = y1, Y21 = y2) for brevity we find
EðD2 Þ
1
1
1 1
1 1
1
1
q 0;
þ q ; 0 þ q ; 1 þ q 1;
þ qð1; 0Þ þ qð0; 1Þ
¼ ð1 wÞ
2
2 2
2 2
2
2
2
1
1
þ q ; 1 þ 2qð1; 0Þ
¼ ð1 wÞ q 0;
2
2
2
¼ ð1 wÞðð1 pÞ 2 pð1 pÞ þ 2 pð1 pÞ p2 þ 2 p2 ð1 pÞ2 Þ
¼ ð1 wÞ2 pð1 pÞð1 p þ p2 Þ:
Combining the expressions for E(D1) and E(D2) the final result
follows
EðDÞ ¼ 2ð1 pÞ2 p ð1 wÞ2 pð1 pÞð1 p þ p2 Þ
¼ 2 pð1 pÞð1 p ð1 wÞð1 p þ p2 ÞÞ:
References
(1.5)
For p = 0.5 the result EðDÞ ¼ ð3w 1Þ=8 quoted in the main text
appears. Based on (1.5) the expected value is negative if and only if
w<
p2
:
1 p þ p2
test statistics is 0 for a value close to 1/3 in agreement with the
theoretical value EðDÞ ¼ ð3w 1Þ=8 ¼ ð3ð1=3Þ 1Þ=8 ¼ 0:
(1.6)
From this follows that the expected distance is negative for p = 0.5
whenever w < 1/3. As p decreases towards 0, smaller values of w are
needed to ensure negative values since the right hand side of (1.6)
increases in p. Fig. 1, based on simulation of 500,000 SNPs confirms
the theoretical findings of this section. Based on this figure the, the
[1] N. Homer, et al., Resolving individuals contributing trace amounts of DNA to highly
complex mixtures using high-density SNP genotyping microarrays, PLoS Genet. 4
(8) (2008) 1–9.
[2] D. Clayton, On inferring presence of an individual in a mixture: a Bayesian
approach, Biostatistics 11 (4) (2010) 661–673.
[3] K.B. Jacobs, et al., A new statistic and its power to infer membership in a genomewide association study using genotype frequencies, Nat. Genet. 41 (11) (2009)
1253–1257.
[4] P.M. Visscher, W.G. Hill, The limits of individual identification from sample allele
frequencies: theory and statistical analysis, PLoS Genet. 5 (10) (2009) 1–6.
[5] R. Braun, et al., Needles in the haystack: identifying individuals present in pooled
genomic data, PLoS Genet. 5 (10) (2009) 1–8.
[6] J. Sampson, H. Zhao, Identifying individuals in a complex mixture of DNA with
unknown ancestry, Stat. Appl. Genet. Mol. Biol. 8 (1) (2009) 37.
[7] M.W. Perlin, B. Szabady, Linear mixture analysis: a mathematical approach to
resolving mixed DNA samples, J. Forensic Sci. 46 (6) (2001) 1372–1378.
Please cite this article in press as: T. Egeland, et al., Complex mixtures: A critical examination of a paper by Homer et al., Forensic Sci. Int.
Genet. (2011), doi:10.1016/j.fsigen.2011.02.003