G Model FSIGEN-708; No. of Pages 6 Forensic Science International: Genetics xxx (2011) xxx–xxx Contents lists available at ScienceDirect Forensic Science International: Genetics journal homepage: www.elsevier.com/locate/fsig Complex mixtures: A critical examination of a paper by Homer et al. Thore Egeland a,1,*, A. Elida Fonneløp b,1, Paul R. Berg c, Matthew Kent c, Sigbjørn Lien c a IKBM, Norwegian University of Life Sciences (UMB), Norway Institute of Forensic Medicine, University of Oslo, Norway c Department of Animal- and Aquacultural Sciences and Centre for Integrative Genetics (CIGENE), Norwegian University of Life Sciences (UMB), Norway b A R T I C L E I N F O A B S T R A C T Article history: Received 25 July 2010 Received in revised form 7 February 2011 Accepted 17 February 2011 DNA evidence in criminal cases may be challenging to interpret if several individuals have contributed to a DNA-mixture. The genetic markers conventionally used for forensic applications may be insufficient to resolve cases where there is a small fraction of DNA (say less than 10%) from some contributors or where there are several (say more than 4) contributors. Recently methods have been proposed that claim to substantially improve on existing approaches [1]. The basic idea is to use high-density single nucleotide polymorphism (SNP) genotyping arrays including as many as 500,000 markers or more and explicitly exploit raw allele intensity measures. It is claimed that trace fractions of less than 0.1% can be reliably detected in mixtures with a large number of contributors. Specific forensic issues pertaining to the amount and quality of DNA are not discussed in the paper and will not be addressed here. Rather our paper critically examines the statistical methods and the validity of the conclusions drawn in Homer et al. (2008) [1]. We provide a mathematical argument showing that the suggested statistical approach will give misleading results for important cases. For instance, for a two person mixture an individual contributing less than 33% is expected to be declared a non-contributor. The quoted threshold 33% applies when all relative allele frequencies are 0.5. Simulations confirmed the mathematical findings and also provide results for more complex cases. We specified several scenarios for the number of contributors, the mixing proportions and allele frequencies and simulated as many as 500,000 SNPs. A controlled, blinded experiment was performed using the Illumina GoldenGate1 360 SNP test panel. Twenty-five mixtures were created from 2 to 5 contributors with proportions ranging from 0.01 to 0.99. The findings were consistent with the mathematical result and the simulations. We conclude that it is not possible to reliably infer the presence of minor contributors to mixtures following the approach suggested in Homer et al. (2008) [1]. The basic problem is that the method fails to account for mixing proportions. ß 2011 Elsevier Ireland Ltd. All rights reserved. Keywords: Mixtures SNP genotyping arrays Mixing proportions 1. Introduction A study in 2008 by Homer et al. [1] presents methods designed to resolve whether individuals are in a genomic DNA mixture using high-density single nucleotide polymorphism (SNP) genotyping arrays. The implications of the paper for genome-wide association studies have been profound. Several papers including [2–6] have discussed the paper and suggested modifications and alternative approaches. The paper has also led to the precautionary removal of large amounts of summary data from public access. We are, however, not aware of discussions related to important potential forensic applications. In this paper we examine Homer et al. [1] in a * Corresponding author. E-mail address: [email protected] (T. Egeland). 1 These authors contributed equally to this work. forensic context by means of mathematical analyses, simulation and a controlled, blinded experiment. 2. Methods Below we first review the basic statistical methods of Homer et al. [1] in a way that is convenient for the subsequent discussion. Appendix A presents detailed calculations for a specific dataset. Initially it is sufficient to consider one SNP with alleles A and B. Three different estimates of the relative frequency of the A allele, pA, are fundamental: 1. The individual estimate Y: If one is forced to estimate pA based only on the genotype of one individual this can be done by letting Y be 0, 0.5 or 1 depending on whether the individual is homozygous BB, heterozygous AB or homozygous AA. 2. The population estimate Pop: This is the conventional estimate based on a reference database. 1872-4973/$ – see front matter ß 2011 Elsevier Ireland Ltd. All rights reserved. doi:10.1016/j.fsigen.2011.02.003 Please cite this article in press as: T. Egeland, et al., Complex mixtures: A critical examination of a paper by Homer et al., Forensic Sci. Int. Genet. (2011), doi:10.1016/j.fsigen.2011.02.003 G Model FSIGEN-708; No. of Pages 6 T. Egeland et al. / Forensic Science International: Genetics xxx (2011) xxx–xxx 2 3. The mixture estimate M: Based on the fluorescence measurements H1 and H2 of alleles A and B, the following estimate is suggested based on the mixture: M¼ H1 : H1 þ kH2 (1.1) The constant k corrects for possible differential amplification of A and B. This constant is assumed equal to 1 unless otherwise specified. To illustrate the notation, consider a simple case with one SNP, one contributor and no measurement error. There are three possibilities 1. The genotype is BB. Then H1 = 0 since there is no measurement error and so according to Eq. (1.1), M = 0. 2. The genotype is AB. Then H1 = H2 and M = 1/2. 3. The genotype is AA. Then H2 = 0 and M = 1. Observe that in this example the individual estimate Y coincides with the mixture estimate M. This will normally not be true if there are several contributors or if there is measurement error added to H1 and H2. Next two distances are calculated D1 ¼ distance between the individual and the population ¼ jY Po pj; D2 ¼ distance between the individual and the mixture ¼ jY Mj: and the final distance is defined as D ¼ D1 D2 ¼ jY Po pj jY Mj: Generally, D could be negative, zero or positive. However, for the simple illustration above, jY Mj = 0 and so D = D1 which is never negative by definition. To indicate the specific SNP (j) and individual (i) the expression may be rewritten to comply with the notation in [1] D ¼ DðY i j Þ ¼ jY i j Po p j j jY i j M j j: (1.2) The distances for all s SNPs are converted to the conventional test-statistic TðY i Þ ¼ meanðDðY i ÞÞ pffiffi ; sdðDðY i ÞÞ= s Table 1 The design and some results of the controlled, blinded experiments. Mixture 4f 4d 4b 4h 4c G H F E D B C A R S T U V W X Y Q P O N M L K J I 0.100 0.100 0.100 0.100 0.200 0.330 0.250 0.500 0.450 0.500 0.700 0.800 0.800 0.300 0.200 0.010 0.400 0.150 0.100 0.200 0.300 0.050 0.050 0.050 0.050 0.300 0.225 0.450 0.900 0.200 0.330 0.250 0.500 0.250 0.100 0.100 0.200 0.050 0.100 0.250 0.990 0.600 0.200 0.200 0.300 0.700 0.238 0.317 0.475 0.950 0.300 0.225 0.450 0.000 0.200 0.330 0.250 0.000 0.300 0.150 0.010 0.000 0.150 0.250 0.300 0.000 0.000 0.250 0.300 0.500 0.000 0.238 0.317 0.475 0.000 0.300 0.225 0.000 0.000 0.200 0.000 0.250 0.000 0.000 0.250 0.090 0.000 0.000 0.350 0.150 0.000 0.000 0.300 0.400 0.000 0.000 0.238 0.317 0.000 0.000 0.000 0.225 0.000 0.000 0.200 0.000 0.000 0.000 0.000 0.000 0.100 0.000 0.000 0.000 0.100 0.000 0.000 0.100 0.000 0.000 0.000 0.238 0.000 0.000 0.000 The number of contributors in the experiments and the amount contributed (0.01– 0.99) are shown. The individuals correctly identified are indicated in grey. For example, in mixture G individuals 4b and 4h are correctly identified whereas the method fails to identify 4f and 4d. Individual 4c is correctly identified as a noncontributor. single nucleotide polymorphisms (SNPs) distributed across the genome. Twenty-five mixtures were created from 2 to 5 contributors with proportions ranging from 0.01 to 0.99, see Table 1. The allele frequencies for the reference population were estimated from HapMap data http://hapmap.ncbi.nlm.nih.gov/. The true mixture compositions were unknown during analyses. 3. Results and discussion 3.1. Theoretical arguments (1.3) where sd denotes standard deviation. Consider next the null hypothesis H0:‘‘Individual i is not in the mixture’’. A significantly large positive value of T(Yi) leads to rejection of the null hypothesis. The degenerated case where only one individual, i, has contributed and there is no measurement error is again useful for illustration purposes. The distance based on a SNP will then never be negative as explained previously. The numerator of (1.3) will be positive and also T(Yi) > 0. A sufficiently large number of SNPs will therefore guarantee statistical significance. The null hypothesis will be rejected and the correct conclusion, individual i is in the mixture, will be reached. In the more general case when there are contributors in addition to individual i, the test will work according to [1] since ‘‘Mj is shifted away from the reference population by Yi’s contribution to the mixture’’. 2.1. The controlled, blinded experiment The statistical method presented in [1] was tested in a number of controlled experiments. The panel selected for the analysis was the Illumina GoldenGate1 DNA Test Panel which includes 360 The statistical methods of Homer et al. [1] lack solid theoretical foundation. This has been indicated previously in [2] in a nonforensic context: ‘‘This work appeared in a non-statistical journal and the statistical problem was stated somewhat informally. In particular, the precise role of the ‘reference population’ in the inference is not very clear.’’ Some further comments are in order. The relevant reference allele frequencies may not be available or may differ depending on ethnic background as pointed out in [6]. This problem is much smaller for conventional forensic markers since these are chosen to have reasonably uniform distributions across populations. We have some fundamental problems related to the formulation of the null hypothesis. If a null hypothesis is rejected, the opposite should be concluded. This is not the case here. Rather the conclusion ‘‘Individual i is in the mixture’’ can only be drawn for a sufficiently large positive value of the test-statistic. A large negative value indicates that individual i is not in the mixture but more ‘‘. . .ancestrally similar to the reference population than to the mixture, and thus less likely to be in the mixture’’ as pointed out in [1]. From a practical point of view, this is one example where it is not sufficient to base the conclusion on the p-value. It is also necessary to pay attention to the sign of the test statistic. We next present theoretical arguments explaining why the method does not work as generally as claimed. It is sufficient to consider a relevant special case to provide a counter example. Please cite this article in press as: T. Egeland, et al., Complex mixtures: A critical examination of a paper by Homer et al., Forensic Sci. Int. Genet. (2011), doi:10.1016/j.fsigen.2011.02.003 G Model FSIGEN-708; No. of Pages 6 T. Egeland et al. / Forensic Science International: Genetics xxx (2011) xxx–xxx Assume individual 1 has contributed a fraction w to the mixture and that remaining 1 w comes from individual 2. There is no measurement error. In Appendix B it is shown that the expected distance (1.2) is 3w 1 EðDÞ ¼ ; 8 (1.4) when p = 0.5. This implies that there are three possible outcomes of the statistical test depending on the fraction individual 1 contributes: 1. If the fraction w is larger than 1/3, E(D) > 0 from Eq. (1.4). In this case the test statistic will be positive. For a large number of SNPs significance will be reached and the correct conclusion will emerge. Observe that this includes the important special case w = 0.5, corresponding to balanced mixtures discussed in [2–6]. 2. If w = 1/3 then E(D) = 0. In this case the test is likely to be inconclusive. 3. If w < 1/3 the wrong conclusion is reached as then E(D) < 0. The findings above based on the theoretical derivation are consistent with simulations and experimental results reported later in this paper. The main problem of the approach is that it does not account for or estimate the fraction contributed. Methods for estimating mixing proportions have been presented in another study [7]. 3.2. Simulation We specified several scenarios for the number of contributors, the mixing proportions and allele frequencies. Appendix A provides further details on the assumptions and simulation experiments. Example 1. A two person mixture, with no measurement error is considered once more. We simulated 500,000 SNPs. The relative frequency for the A allele was 0.5 for all SNPs. One example is provided in Fig. 1 where w ranges from 0.01 to 0.99. For w Homer et al's test statistic. Two contributors. Individual 1 contributes the fraction w 600 ● ● Table 2 Results of simulation experiment. Two person mixture. T(Y) p-Value No. SNPs w 3.90 7.15 27.08 86.37 200.52 0.53 0.10 0.09 1.18 1.70 3.16 7.14 26.18 80.85 180.67 9.69E05 8.84E13 1.73E161 0.00E+00 0.00E+00 5.98E01 9.20E01 9.29E01 2.40E01 8.92E02 1.60E03 9.63E13 5.07E151 0.00E+00 0.00E+00 100 1000 10,000 100,000 500,000 100 1000 10,000 100,000 500,000 100 1000 10,000 100,000 500,000 0.10 0.10 0.10 0.10 0.10 0.33 0.33 0.33 0.33 0.33 0.50 0.50 0.50 0.50 0.50 The number of SNPs and fractions contributed vary. For the balanced case (w = 0.5) with p-values close to 0 when there are more than 1000 SNPs. Significance is never reached when w = 0.33. The results are significantly wrong for w = 0.10. 0:32; TðY i Þ 3:8 (p-value = 0.00007) and there is strong evidence in favor of the wrong conclusion. There is overwhelming evidence for the correct conclusion; the individual belongs to the mixture, when w 0:34 since then T(Yi) 7.1 (p-value 0). In Table 2, results are shown for varying number of SNPs. When the mixture is balanced, the contributors are identified with as few as 100 markers. Similarly, the wrong conclusion is obtained for w ¼ 0:1 for 100 markers and the results get worse as the amount of data increases. For w ¼ 0:33 the test is inconclusive, the p-values exceed 0.05 regardless of the number of markers. The simulations accurately confirm the previous mathematical finding summarized in Eq. (1.4). Similar results are obtained for other specifications of the database and also when measurement error was added as is shown in Fig. 2. The value of the test statistic and the point where it changes from negative to positive may depend on the minor allele frequency (MAF). The main finding however remains: the wrong conclusion is reached for the minor contributor when the fraction is small. As the MAF decreases towards 0, the threshold for negative values also decreases and this is consistent with the result derived in Appendix B. Example 2. Table 3 displays the results of a three person mixture with 500,000 SNPs and no measurement error. Observe that for a balanced mixture the individuals are correctly identified as contributors. In other words, a person contributing a fraction of 0.33 is correctly identified in this case. However, the wrong conclusion is obtained for the minor contributor in the unbalanced scenario. ● ● ● ● ● 400 3 ● ● ● 3.3. The controlled, blinded experiment ● 200 T(Y) ● ● Table 1 shows the individuals identified correctly as contributors. Generally, major contributors tend to be identified and all samples with proportions exceeding 0.33 were correctly ● ● 0 ● ●● ●● ● −200 ● ● 0.0 ● ● ● ● Table 3 Results of simulation experiment. Three person mixture. ● w=1/3 0.2 0.4 0.6 0.8 1.0 w 500,000 SNPs. Allele frequency=0.5 Fig. 1. Figure shows that the conclusion depends on the amount contributed from the individual. The test statistic is close to 0 for w = 1/3 and the method therefore fails to identify the individual as a contributor. This confirms the mathematical derivation in Appendix B. Example Person T(Y) p-Value w 1 1 1 2 2 2 1 2 3 1 2 3 197.9 97.0 223.0 74.8 72.3 73.7 0 0 0 0 0 0 0.1 0.4 0.5 0.33 0.33 0.33 The method again works when the mixture is balanced, corresponding to w = 0.33 for this three person mixture. Once, more the wrong conclusion is reached for the minor contributor corresponding to line 1. Please cite this article in press as: T. Egeland, et al., Complex mixtures: A critical examination of a paper by Homer et al., Forensic Sci. Int. Genet. (2011), doi:10.1016/j.fsigen.2011.02.003 G Model FSIGEN-708; No. of Pages 6 T. Egeland et al. / Forensic Science International: Genetics xxx (2011) xxx–xxx 4 Homer et al's test statistic. Two contributors. Individual 1 contributes w. SD=0.1 MAF 0.2 MAF 0.4 ● ● ● −100 0 0 0.0 0.2 ● ● ● ● ● 100 ● ● ● ● ● TY ● ● 0.4 0.6 0.8 1.0 ● ● ● 0.0 0.2 0.4 w ● 200 400 ● TY TY ● 200 ● ● ● ● 0 ● −100 0 ● 0.4 0.6 0.8 1.0 w ● ● ● ● 0.2 1.0 ● 100 600 ● ● 0.0 0.8 MAF 0.5 ● ● 0.6 w MAF 0.3 ● ● ● 300 400 ● 200 TY 600 ● ● ● 0.0 ● 0.2 0.4 0.6 0.8 1.0 w Fig. 2. Simulation results for varying minor allele frequency (MAF). Measurement error is added with a standard deviation of 0.1. Again, for minor contributions (small w) the test statistic is negative and the wrong conclusion emerges. recognized by the test. The results are summarized in Table 4. For 30 (24.0%) cases the test was inconclusive in the sense that the pvalue was greater than 0.05. Contributors were incorrectly assigned as non-contributors in 14 (11.2%) cases. The initial analysis used k = 1 in Eq. (1.1). New analyses with a more appropriate estimate, k = 3, were performed and led to slight improvements and correspond to those reported. The selection of SNPs for the Illumina GoldenGate1 360 SNP test panel includes SNPs on the X and Y chromosome. These SNPs may not be suitable and the analyses were repeated with these markers removed. The results remained essentially similar. 3.4. Discussion The controlled, blinded experiment is small. The number of SNPs in the analysis (360) is not comparable to the 500,000 analyzed in [1] and so there is a problem of insufficient power particularly for the inconclusive findings. Experiments could be performed for data sets comparable to those used in [1]. However, this was not considered worthwhile in view of the theoretical findings. Throughout we use a significance level of 0.05. It could be argued that corrections should be made for multiple testing. For the simulations based on 500,000 SNPs such corrections would have no impact for most cases as conclusions would remain unchanged virtually regardless of the method used to correct. For the experimental data, the situation differs, but we have chosen the traditional level since we regard the tests as descriptive. The forensic genetics community should be encouraged to apply markers and methods of other areas of genetics. Highdensity single nucleotide polymorphism genotyping arrays may be of great importance for forensic casework. Issues related to the viability of the sub-optimal DNA sources frequently facing forensic laboratories should not delay the presentation of new data sources and statistical methods. The paper we have discussed should be welcomed as an important first attempt. However, the validity of findings needs to be critically examined. While claims like ‘‘trace amounts of less than 0.1% can be reliably detected in mixtures with a large number of contributors’’ [1] may seem outrageous and counter intuitive and the evidence presented meager, careful arguments are required for a firm rebuttal. Such arguments have been presented and the theoretical arguments should be particularly convincing. We conclude, contrary to [1], that it is not possibly to accurately infer the presence of contributors to unbalanced mixtures following the suggested approach. Appendix A. Details on the Homer et al. statistic and simulation Table 4 The results of the blinded experiment are summarized. Contributed No Yes Conclusion Not in Mixture Inconclusive In mixture 39 14 0 30 0 42 For individuals who do not contribute to the mixture, corresponding to the ‘NO’ line, the method never leads to the wrong conclusion. For contributors, however, the method gives an inconclusive result for 30 cases and the wrong conclusion 14 times. We explain the simulation and the statistical calculations below based on a simple example. The data is provided in Table 5. There are 4 SNPs and three contributors (denoted 1, 2 and 3 in the ID column) to the mixture. We would like to test the hypothesis H0: ‘‘Individual 1 is not in the mixture’’. Consider first SNP 1. The statistical test uses the genotype of individual 1. This genotype is AB. Since Y is 0, 0.5 or 1 depending on Please cite this article in press as: T. Egeland, et al., Complex mixtures: A critical examination of a paper by Homer et al., Forensic Sci. Int. Genet. (2011), doi:10.1016/j.fsigen.2011.02.003 G Model FSIGEN-708; No. of Pages 6 T. Egeland et al. / Forensic Science International: Genetics xxx (2011) xxx–xxx 5 Table 5 Data used to explain statistical calculations and simulation. SNP ID Genotype 1 1 1 1 2 2 2 2 3 3 3 3 4 4 4 4 1 2 3 mix 1 2 3 mix 1 2 3 mix 1 2 3 mix A A A A B B B B A A A A B B B B B B A B A A A A A B A A B A A B H1 H2 Pop w Y M D NA NA NA 1.1 NA NA NA 1.0 NA NA NA 1.5 NA NA NA 1.3 NA NA NA 0.9 NA NA NA 1.0 NA NA NA 0.5 NA NA NA 0.7 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.1 0.4 0.5 1.0 0.1 0.4 0.5 1.0 0.1 0.4 0.5 1.0 0.1 0.4 0.5 1.0 0.5 0 1 NA 0.5 0.5 0.5 NA 0 0.5 1 NA 0 1 0.5 NA NA NA NA 0.55 NA NA NA 0.50 NA NA NA 0.75 NA NA NA 0.65 0.05 0.05 0.05 NA 0.00 0.00 0.00 NA 0.25 0.25 0.25 NA 0.15 0.15 0.15 NA There are four SNPs and three contributors to the mixture (denoted 1, 2 and 3) in the ID column. The individuals contribute in fractions 0.1, 0.4 and 0.5 as indicated in the w column. These fractions are not known and not used to calculate the distances D, but they need to be specified to simulate data. whether the individual is homozygous BB, heterozygous AB or homozygous AA, Y = 0.5 in the first line of Table 5. The fluorescence measurements are H1 = 1.1 and H2 = 0.9 and therefore M = 1.1/ (1.1 + 0.9) = 0.55 as shown in the line ‘mix’. For this example we assume the allele frequency of A, denoted Pop, to be 0.5. The distance is therefore DðYÞ ¼ jY Po pj jY Mj ¼ j0:5 0:5j j0:5 0:55j ¼ 0:05: The distances for the 3 other SNPs are calculated similarly and the final test based only on these distances are 0.05, 0, 0.25, 0.15. The mean value is 0.11250 and the standard deviation 0.11087 and so the test statistic is meanðDðY 1 ÞÞ 0:11250 pffiffiffi ¼ 2:03: pffiffi ¼ TðY 1 Þ ¼ sdðDðY 1 ÞÞ= s 0:11087= 4 For a small number of SNPs, say less than 30 it would be reasonable to use a t-distribution to convert the value for the test statistic to a p-value. However, a normal distribution is used since the central limit theorem justifies this distribution for the large number of SNPs used in our examples and 2.03 corresponds to a p-value of 0.042. (The normal distribution may not be appropriate for real data.) Recall that a negative value of the test-statistic indicates that the person in question is more ‘‘. . .ancestrally similar to the reference population than to the mixture, and thus less likely to be in the mixture’’ [1]. There are far too few SNPs to conclude. However, the same wrong conclusion is obtained if this example is scaled up 500,000 SNPs as shown in Table 2. Simulation: The R-software (http://www.R-project.org) was used to simulate and analyze data. We need to specify the number of contributors and their proportions. For this example, there are three individuals contributing in proportions 0.1, 0.4 and 0.5 as indicated by column w of Table 5. Hardy–Weinberg equilibrium and linkage equilibrium is assumed. Peak heights are simulated for each contributor from a normal distribution with specified expectation and standard deviation. Appendix B. Mathematical derivation of the expected distance In general it is not possible to derive simple formulae for the test statistic (1.3) used to determine whether an individual has contributed to a mixture. However, relevant calculations can be performed in some simple cases. Below we consider a two-person mixture with no measurement error. Since the test is based on the distances we consider the expected value EðDÞ ¼ EðD1 Þ EðD2 Þ ¼ EðjY 11 pj jY 11 M 1 jÞ; where for brevity Pop is denoted p. Since p is the same for all markers is it is sufficient to consider one marker below. Without loss of generality we can and will assume p 1/2: the result for p > 1/2 is obtained by replacing p by 1 p. We first consider D1 and derive the expected value from the standard formula as EðD1 Þ ¼ EðjY 11 pjÞ ¼ j0 pjPðY 11 ¼ 0Þ 1 þ pPðY 11 ¼ 0:5Þ þ j1 pjPðY 11 ¼ 1Þ: 2 We assume Hardy–Weinberg equilibrium and then PðY 11 ¼ 0Þ ¼ Pð‘‘individual is BB’’Þ ¼ ð1 pÞ2 : The remaining probabilities are found similarly P(Y11 = 0.5) = 2p(1 p) and P(Y11 = 1) = p2 and therefore as 1 p 2 pð1 pÞ EðD1 Þ ¼ pð1 pÞ2 þ 2 þ ð1 pÞ p2 ¼ 2ð1 pÞ2 p: Consider next D2 = |Y11 M|. Individual 1 contributes a fraction w and individual 2 the remaining fraction (1 w). The estimates of the A-allele frequency based on the genotype of individual 1 and 2 are respectively Y11 and Y21. By accounting for the fractions contributed, it follows that the mixture estimate of the A-allele frequency is M1 ¼ wY 11 þ ð1 wÞY 21 : For instance, if the individuals contribute equally, corresponding to w = 0.5, and are heterozygous, the mixture based estimate is the intuitive M = 0.5 0.5 + (1 0.5) 0.5 = 0.5. Observe that we may write D2 ¼ jY 11 Mj ¼ ð1 wÞjY 11 Y 21 j: According to the general formula for expectation EðD2 Þ ¼ ð1 wÞ X jy1 y2 jPðY 11 ¼ y1 ; Y 21 ¼ y2 Þ: y1 ;y2 Please cite this article in press as: T. Egeland, et al., Complex mixtures: A critical examination of a paper by Homer et al., Forensic Sci. Int. Genet. (2011), doi:10.1016/j.fsigen.2011.02.003 G Model FSIGEN-708; No. of Pages 6 6 T. Egeland et al. / Forensic Science International: Genetics xxx (2011) xxx–xxx Letting q(y1, y2) = P(Y11 = y1, Y21 = y2) for brevity we find EðD2 Þ 1 1 1 1 1 1 1 1 q 0; þ q ; 0 þ q ; 1 þ q 1; þ qð1; 0Þ þ qð0; 1Þ ¼ ð1 wÞ 2 2 2 2 2 2 2 2 1 1 þ q ; 1 þ 2qð1; 0Þ ¼ ð1 wÞ q 0; 2 2 2 ¼ ð1 wÞðð1 pÞ 2 pð1 pÞ þ 2 pð1 pÞ p2 þ 2 p2 ð1 pÞ2 Þ ¼ ð1 wÞ2 pð1 pÞð1 p þ p2 Þ: Combining the expressions for E(D1) and E(D2) the final result follows EðDÞ ¼ 2ð1 pÞ2 p ð1 wÞ2 pð1 pÞð1 p þ p2 Þ ¼ 2 pð1 pÞð1 p ð1 wÞð1 p þ p2 ÞÞ: References (1.5) For p = 0.5 the result EðDÞ ¼ ð3w 1Þ=8 quoted in the main text appears. Based on (1.5) the expected value is negative if and only if w< p2 : 1 p þ p2 test statistics is 0 for a value close to 1/3 in agreement with the theoretical value EðDÞ ¼ ð3w 1Þ=8 ¼ ð3ð1=3Þ 1Þ=8 ¼ 0: (1.6) From this follows that the expected distance is negative for p = 0.5 whenever w < 1/3. As p decreases towards 0, smaller values of w are needed to ensure negative values since the right hand side of (1.6) increases in p. Fig. 1, based on simulation of 500,000 SNPs confirms the theoretical findings of this section. Based on this figure the, the [1] N. Homer, et al., Resolving individuals contributing trace amounts of DNA to highly complex mixtures using high-density SNP genotyping microarrays, PLoS Genet. 4 (8) (2008) 1–9. [2] D. Clayton, On inferring presence of an individual in a mixture: a Bayesian approach, Biostatistics 11 (4) (2010) 661–673. [3] K.B. Jacobs, et al., A new statistic and its power to infer membership in a genomewide association study using genotype frequencies, Nat. Genet. 41 (11) (2009) 1253–1257. [4] P.M. Visscher, W.G. Hill, The limits of individual identification from sample allele frequencies: theory and statistical analysis, PLoS Genet. 5 (10) (2009) 1–6. [5] R. Braun, et al., Needles in the haystack: identifying individuals present in pooled genomic data, PLoS Genet. 5 (10) (2009) 1–8. [6] J. Sampson, H. Zhao, Identifying individuals in a complex mixture of DNA with unknown ancestry, Stat. Appl. Genet. Mol. Biol. 8 (1) (2009) 37. [7] M.W. Perlin, B. Szabady, Linear mixture analysis: a mathematical approach to resolving mixed DNA samples, J. Forensic Sci. 46 (6) (2001) 1372–1378. Please cite this article in press as: T. Egeland, et al., Complex mixtures: A critical examination of a paper by Homer et al., Forensic Sci. Int. Genet. (2011), doi:10.1016/j.fsigen.2011.02.003
© Copyright 2026 Paperzz