BIOINFORMATICS ORIGINAL PAPER Vol. 23 no. 14 2007, pages 1807–1814 doi:10.1093/bioinformatics/btm260 Genetics and population analysis Error detection in SNP data by considering the likelihood of recombinational history implied by three-site combinations Donna M. Toleno, Peter L. Morrell and Michael T. Clegg* Department of Ecology and Evolutionary Biology, University of California, Irvine, CA 92697, USA Received on December 13, 2006; revised on April 30, 2007; accepted on May 9, 2007 Advance Access publication May 17, 2007 Associate Editor: Martin Bishop ABSTRACT Motivation: Errors in nucleotide sequence and SNP genotyping data are problematic when inferring haplotypes. Previously published methods for error detection in haplotype data make use of pedigree information; however, for many samples, individuals are not related by pedigree. This article describes a method for detecting errors in haplotypes by considering the recombinational history implied by the patterns of variation, three SNPs at a time. Results: Coalescent simulations provide evidence that the method is robust to high levels of recombination as well as homologous gene conversion, indicating that patterns produced by both proximate and distant SNPs may be useful for detecting unlikely three-site haplotypes. Availability: The perl script implementing the described method is called EDUT (Error Detection Using Triplets) and is available on request from the authors. Contact: [email protected] Supplementary information: Supplementary data are available at Bioinformatics online. 1 INTRODUCTION Single nucleotide polymorphisms (SNPs) are the most common form of genetic variation. The most rapid method of collecting complete SNP data for a chromosomal segment is through direct nucleotide sequencing of PCR-amplified DNA fragments. SNP data can be used to calculate population-level diversity statistics even when the phase or arrangement of SNPs and insertion/deletion polymorphisms (indels) in the diploid individual remains unknown. The phase of mutations that make up parental haplotypes cannot be determined through direct sequencing of PCR products, but must be resolved through either computational or experimental means (Clark et al., 1998). Correctly inferred haplotypes increase power for inferences and predictions of disease penetrance, drug responses, diagnostic test variation and risk factors for complex diseases (Clark, 2004; Lee et al., 2005). Accurate inference of haplotypes is also important for population genetics hypothesis testing and parameter estimation. Haplotyping errors have been shown to affect estimation *To whom correspondence should be addressed. of any measure requiring knowledge of the arrangement of SNP states, such as the population recombination rate and the quantification of associations among polymorphism (Morrell et al., 2006; Ptak et al., 2004; Wall, 2004). Errors in resequencing data appear to be commonly encountered and seldom reported. For example, Morrell et al. (2005, 2006) report that standard sequencing practices failed to identify a heterozygous individual in 6 out of 11 cases. The impact of these errors on analyses varies, with the greatest impact on estimated rates of recombination and homologous gene conversion (Morrell et al., 2006). The possible means for error introduction are numerous and include problems related to both genotyping and phasing. Many research projects require the resequencing of a chromosomal region in a large number of diploid individuals. The potential for resequencing errors is high, in part because every nucleotide site in every individual must be screened for heterozygosity. Miscalled genotypes at heterozygous sites are a major source of error (Morrell et al., 2006). The rarest SNP variants are likely to be present in a heterozygous state, therefore, the investigator may not have a priori knowledge of the presence of the SNP during the genotyping process. Even with highly trained technicians and the best possible quality control measures implemented by sequence assembly software, genotypes of individual SNPs are often called incorrectly in directly sequenced diploid regions (Zhang et al., 2005). Haplotype inference often depends on computational phasing methods which require arranging SNPs into the most likely configurations. Such methods rely heavily on accurate interpretation of raw sequence trace data or SNP genotyping results (Halldórsson et al., 2004). Experimental phasing methods require either cloning of PCR products or the amplification of one haplotype at a time with allele specific PCR (AS-PCR). Inferring haplotypes with AS-PCR is an alternative to computational phasing or labor-intensive cloning methods (Clark et al., 1998). It is important to note that obtaining the correct genotype at every site does not guarantee accurate inference of haplotypes with any phasing method (Lajoie and El-Mabrouk, 2005). Screening for errors is desirable even when experimental means are used to recover haplotypes in individuals heterozygous for a locus. Both cloning and AS-PCR involve multiple PCR products and templates resulting in an error-prone assembly process. The cloning process is particularly ß The Author 2007. Published by Oxford University Press. All rights reserved. For Permissions, please email: [email protected] 1807 D.M.Toleno et al. error-prone. High rates of polymerase error make it difficult to recover accurate sequence from each clone, forcing the use of a majority rule approach to determine each haplotype. The clones obtained from any given PCR reaction may contain PCR chimeras (Cronn et al., 2002) and biases in the success rate for cloning the two haplotypes (Morrell et al., 2006). Also, AS-PCR can fail to prime specifically, resulting in genotyping errors. Cloning and AS-PCR are also vulnerable to human and software errors; errors may include sequence assembly problems, sample mislabeling, misinterpretation of sequence traces or sample contamination. While careful sequencing protocols are likely to minimize errors, a tool to identify a subset of unlikely base calls that require re-inspection will be extremely valuable when high data quality is essential. In the present article we present a method for error detection in which haplotypes are examined three SNPs at a time. We refer to the method as EDUT (error detection using triplets) and we present and test a Perl script implementing the method. EDUT is most sensitive to ‘switch errors’, SNPs occurring on an incorrect haplotype background. Detecting switch errors is difficult because the base calls that lead to incorrect inference of haplotypes frequently occur at known SNPs. In the absence of a method evaluating the likelihood of SNP arrangements, an error can appear to be a likely base call. Switch errors are often repeatable, and can result from machine scored data. There is currently no routine procedure for detecting this type of error. In a previous paper, the EDUT method was used to identify switch errors in haplotype data from a resequencing project (Morrell et al., 2006). Researchers have avoided the problems associated with direct sequencing of nuclear genes in diploid organisms by focusing on mitochondrial and chloroplast loci not subject to sequencing difficulties caused by heterozygosity, or on organisms where inbreeding or genetic manipulation result in homozygosity at nuclear genes (Wright and Gaut, 2005). As population studies are expanded to a wider range of diploid organisms (including humans), a method with the broad applicability demonstrated by EDUT will be extremely useful. Published methods for detecting errors in sequencing and phasing, particularly for data from humans, have made use of the relatedness of haplotypes sampled in parents, progeny and siblings (Abecasis et al., 2002; Becker et al., 2006; Douglas et al., 2000; Heath, 1998; Lange et al., 1988; O’Connell and Weeks, 1998; Sobel et al., 2002; Thomas, 2005). Because haplotypes are passed from parents to progeny with a very small probability of being altered by mutation or rearranged by recombination, nucleotide sequence information from related individuals with a known pedigree is useful for imputation of missing data and haplotype reconstruction (Li and Jiang, 2005), and for error detection (Abecasis et al., 2002; Becker et al., 2006). In samples from most organisms, family relationships are difficult or impossible to determine, thus creating the need for a non-pedigree-based method for error detection. Evaluation of haplotype likelihood has been used for both phase estimation and missing genotype imputation in samples of unrelated individuals (Scheet and Stephens, 2006; Stephens and Scheet, 2005). The detection of genotyping errors is a natural extension of haplotype reconstruction methods; 1808 both efforts rely on the SNP associations created by mutation and a finite amount of historical recombination. EDUT can be applied to resequencing or SNP genotyping data and any combination of computationally or experimentally phased haplotypes. The error detection procedure exploits an important property of haplotype data; some haplotypes are more likely than others. Assuming one mutation per site, S SNPs can yield 2S 1 unique haplotype configurations, only a subset of which are generally observed. EDUT flags a SNP within a specific haplotype whenever an unusual haplotype arrangement occurs for a SNP triplet. A flagged site will be referred to as a ‘flagged base’ to emphasize the role of the flagged site in a single haplotype. The phrase is not meant to limit the discussion to resequencing data. EDUT identifies a subset of base calls that create unlikely arrangements within the inferred haplotype; thus it finds base calls likely to have been inferred incorrectly. The subset consists of the base calls contributing most directly to evidence for a double recombination event. In principle, the method is analogous to the error detection approach of Lincoln and Lander (1992) routinely used in genetic mapping; in practice, it is analogous to the routine process of singleton SNP verification. 2 2.1 METHODS The algorithm EDUT processes population data by first identifying all of the SNPs in the data set. The algorithm initially includes all sites, including those in alignment gaps because such sites can include SNPs and may be informative about the evolutionary history of the genomic region. As illustrated in Figure 1, SNP data is converted to a binary format. By convention, the majority state is coded as 0 and the minority state is coded as 1. EDUT accepts data input in standard formats, including aligned fasta files. Under an infinite sites model, segregating sites will only be altered by mutation one time in the history of the population (Kimura, 1969). This assumption holds for the vast majority of SNPs within a sample from a randomly mating population of a single species. For example, 52% of all SNPs in wild barley resequencing data of Morrell et al. (2006) show evidence of multiple hits by mutation. When there are more than 2 nucleotide states at a site, EDUT codes the majority state as a 0 and the two or three minor states as 1. Homoplasy events (multiple origins of the same SNP state) are less likely than mutation to a unique 3rd state. When the data shows only minimal violations of the infinite sites assumption, homoplasies have a negligible effect on the effectiveness of this algorithm and are therefore not considered during processing. All (((S)*(S 1))/2) pairs of SNPs are evaluated with the four-gamete test. For any pair of nucleotide sites, only three configurations are possible on the basis of unique mutations. The presence of all four gametes indicates a historical recombination event between the pair of sites (Hudson and Kaplan, 1985). EDUT creates a table storing the results of the four-gamete test. In Figure 1, four gamete tests are true for the site pairs labeled AB and BC. The number of sets of three site combinations is denoted NT where NT ¼ (S*(S 1)*(S 2))/6. All NT triplets are evaluated using stored four-gamete test results so that ‘pattern a’ is identified when no recombination is inferred for the outer pair of SNPs with all four arrangements present at each of the SNP pairs involving the middle SNP (Fig. 1). For all pattern a triplets detected, the rare types contributing to the two diagnostic four-gamete tests are flagged and reported. Rare types are defined by default Error detection in SNP data Fig. 1. Three polymorphic sites are illustrated in a hypothetical population of 30 chromosomes. (a) Curved arrows below the sequence data represent evidence of recombination. The orthogonal lines above the sequence data indicate that there is no recombination evident between the outer sites. The sequence data is in pattern a as described in Padhukasahasram et al. (2004) and Morrell et al. (2006). (b) Coding of nucleotide states: polymorphic sites are coded into binary data. By convention, 0 is assigned to the common state. (c) Counts of each coded pair: The presence of all four two-site arrangements indicates a recombination event between the sites. There is evidence of a recombination event between the gray (B) and the black (A and C) labeled sites as shown in the table. The lowest frequency site responsible for defining pattern a is the most likely source of error in the data set as indicated by boxes in (a and c). as a single occurrence of a two-site arrangement. At times there may be a tie for the rare type. Two pairs of states may each occur only once causing two different individuals in the population to be critical to the inference of the four gamete test and therefore to the pattern a triplet. For each flagged base, a counter is maintained and incremented every time the site is involved in a rare pair of a pattern a triplet. The ultimate value of this counter for each base flagged is reported as the raw priority score (P) in the summary output. Any given SNP variant present in more than one of the samples is considered informative. Each one of S informative SNPs participates in a total of T triplets, where T ¼ ([S 1][S 2])/2. The quantity, a is defined as the proportion of triplets in pattern a relative to the total number of triplets. To arrive at an approximation of the number of pattern a triplets expected for each site, T is multiplied by a. An additional correction is made to account for SNP position; sites that are closer to the middle of the series of informative SNPs are more likely to be flagged twice in a single triplet. For the ith SNP this correction is designated M, where M ¼ (i 1) (S i). M ¼ 0 when the SNP is either the first or last in the region considered. The weighted priority score is defined as PW, where PW ¼ P/(a [T þ M]). 2.2 Applying the method The EDUT program was tested against resequencing data from wild barley and avocado where all sequence traces were available. A total of 22 data sets of inferred haplotypes from 25 individuals for wild barley and 41–53 individuals for avocado were used as input into EDUT. In the majority of the data sets no sites were flagged, but in one case the weighted scores drew attention to an assembly error. Inspection of EDUT-flagged base calls in the trace data and consideration of peak heights and quality scores in Consed (Gordon et al., 1998) aid in the assessment of base call accuracy. EDUT includes an option for processing data directly from the coalescent simulation package, ms (Hudson, 2002). Simulations permitted rigorous testing of EDUT (see Table 1). In each simulation replicate, an error was introduced by changing the 10th site in the first chromosome from one binary state to the other (1 changed to 0, 0 changed to 1). The HU simulations were designed to resemble human resequencing data. The GE (genotyping error) simulations are designed to represent data as it is frequently collected, i.e. from diploid individuals where the phase of each SNP is unknown. Paired haplotypes in the simulations served as the simulated genotypes. A single error in a single genotype was introduced at the 10th SNP in each replicate simulation; pairs of the simulated chromosomes were subsequently passed as input to Phase 2.1 (Stephens et al., 2001). The resulting inferred haplotypes were used as input for EDUT. For 100 replicate simulations, the process was repeated a total of three times so that the single error was introduced into each of the three possible genotypes (homozygous for the SNP state with major frequency, heterozygous or homozygous for minor SNP state). LS (large sample) simulations included an error at every 100th simulated chromosome in the sample. This represents a low frequency but systematic occurrence of haplotype error. PS simulations were designed to include population structure. Population structure is common in samples from multiple geographic regions and in comparative studies of populations. Four populations with 25 chromosomes per population were simulated with populations exchanging 1 migrant per generation. The SF (SNP frequency) simulations were designed to test the sensitivity of EDUT to various allele frequencies and states. For each simulation, allele frequencies at the SNP where the error was introduced were categorized as major or minor and either ancestral or derived. We evaluate performance of EDUT for all four combinations (Tables 1 and 2). Unlike the GE simulations above, the SNP frequencies were tabulated after the simulation analysis. For SF simulations, 10 000 replicates were used to increase the incidence of extreme allele frequencies. The GC simulations were designed to test for the effect of gene conversion on the sensitivity of error detection. Simulation parameters included the mutation parameter, ¼ 4Ne and the recombination parameter, ¼ 4Ner where Ne is effective population size, is per base rate of mutation and r is the per base pair rate of recombination. The gene conversion parameter, f, was defined as the ratio of the rate of gene conversion relative to the rate of crossover. The values of f used in simulations were 0, 2, 5 and 10; spanning the range of values estimated from empirical data (Padhukasahasram et al., 2006; Wall, 2004). All data analysis, including graphics, was performed with the R statistical environment and programming language (R Development Core Team, 2005). 3 3.1 RESULTS The EDUT method The inspiration for the EDUT methodology resulted from careful evaluation of resequencing data from wild barley (Morrell et al., 2005, 2006). All sequences reported were inferred from high quality sequence trace data with a minimum quality criterion of a Phred score 20 on both forward and reverse reads. As is the customary practice in population genetics resequencing projects, singleton mutations were verified with a new PCR amplification followed by sequencing with coverage of the SNP on the forward and reverse strand. While surveying the data for patterns indicating gene conversion, it was clear that typing errors at key SNP sites could be responsible for producing a triplet pattern resembling the effect of gene conversion. Figure 1 illustrates how EDUT detects 1809 D.M.Toleno et al. Table 1. Descriptions of simulations used to test EDUT under a variety of circumstances Simulation Number of chromosomes Number of SNPs Recombination (distance) in units of kb of human DNA Purpose Total number of simulated data sets HU 100 50 Testing the effectiveness of EDUT on simulated human data. 100 GE 100 Variable, mean ¼ 208 SNPs (expected for 100 chromosomes and 50 kb human DNA) Fixed, 50 SNPs 100 100 LS 200 400 600 800 Fixed, 50 SNPs 100 PS 4 groups of 25 Fixed, 50 SNPs 100 Testing the effects of various types of genotyping error: heterozygous SNPs, homozygous common, homozygous rare. See Methods section for details. Testing the effects of a very large sample. For one SNP an error was introduced every 100th haplotype for up to eight total errors at one SNP. Testing the effect of population subdivision. Simulated chromosomes are split evenly into four subpopulations with one migration per generation Testing the effects of allele frequency on error flagging and priority scores. Exactly the same as ‘GE’ simulations. Testing the effects of gene conversion (f ¼ 0, 2, 5, 10) and number of sampled haplotypes. (25, 50, 100). Error introduced into one haplotype at one SNP, and two SNPs. SF 100 Fixed, 50 SNPs 100 GC 25, 50, 100 Variable, mean ¼ 20.9 SNPs 10 errors using three SNPs and haplotype frequencies typical for a sample of 30 chromosomes. Pattern a, utilized in error detection as illustrated in Figure 1, occurs at a consistently low frequency under many biologically realistic circumstances (See Fig. 2 above and discussion of Fig. 2 below). 3.2 100 10 000 100 each sites. However, quality scores for these base calls were greater than Phred 20, the quality criterion used in the assembly of the data (Fig. S2). A partial sequence trace is shown in Figure S3. These results demonstrate that inspection of sequence quality can provide supporting evidence for a switch error in any inferred haplotype, but searching for a low quality score is not necessary or sufficient for detecting a sequencing error. Analyses performed on empirical data An example of the utility of EDUT on empirical data was presented in Morrell et al. (2006). For example, in the Cbf3 locus (Morrell et al., 2005, 2006), EDUT assisted in the detection and correction of four incorrect haplotypes. Homozygosity was incorrectly assumed for 13 SNPs in one sample and 5 SNPs in another. In each case, two switch errors lead to the inference that base calls from these two haplotypes were incorrectly combined. Table S1 (found online) includes the example EDUT output from the wild barley Cbf3 locus. Figure S1 demonstrates the resolution of haplotypes at Cbf3. Inspection of the trace files associated with possible errors flagged by EDUT lead to the identification of localized drops in Phred quality scores for incorrectly genotyped heterozygous 1810 100 for each sample size 3.3 Analyses performed on simulated data As discussed above, pattern a occurs at low and relatively consistent frequency under many biologically realistic circumstances (Wiuf and Hein, 2000). Figure 2 illustrates the proportion of pattern a for groups of 100 simulations across values ranging from 5 to 90. The total number of triplets considered in the proportion includes SNPs for which the minority state occurs at least twice in the sample. Singleton SNPs, occurring only once in the sample, can never participate in pattern a. Simulations represented in Figure 2 are based on / ¼ 1.5. At lower rates of recombination, holding / constant causes the simulated data to have few SNPs and thus the SD for Error detection in SNP data Table 2. Results of the simulations summarized in Table 1 Simulation Details Percent of errors flagged Mean number of sites in flagged list per simulation [SD] HU Equivalent to 50 kb human sequence 93 895 [238] GE Error at a heterozygous SNP Error at a homozygous common SNP Error at a homozygous rare SNP 44 41 10 85 [26.3] 87 [28] 81 [31] LS N ¼ 200 N ¼ 400 N ¼ 600 N ¼ 800 64 51 32 36 71 54 32 44 PS Four subpopulations SF All SF simulations Error at majority, ancestral base Error at majority, derived base Error at minority, ancestral base Error at minority, derived base GC N ¼ 25, f ¼ 0 N ¼ 25, f ¼ 2 N ¼ 25, f ¼ 5 N ¼ 25, f ¼ 10 N ¼ 50, f ¼ 0 N ¼ 50, f ¼ 2 N ¼ 50, f ¼ 5 N ¼ 50, f ¼ 10 N ¼ 100, f ¼ 0 N ¼ 100, f ¼ 2 N ¼ 100, f ¼ 5 N ¼ 100, f ¼ 10 74 68 55 67 58 64 65 69 58 72 73 65 Sensitivity of the priority score 3.3 1.3 1.6 0.14 [21] [18] [13] [13] 0.9 0.4 0.19 0.14 68 138 [28] 1.98 76 87 66 42 27 94 [26] - 1.3 0.57 1.4 0.17 0.21 9.6 12.0 14.8 14.5 11.1 12.8 14.8 15.7 8.6 15.3 18.1 10.2 [12.7] [12.7] [15.2] [12.3] [13.9] [12.4] [12.0] [12.4] [9.2] [12.6] [13.3] [9.8] 2.5 1.7 1.4 1.4 1.4 1.4 1.1 1.1 2.1 1.6 1.6 1.3 The sensitivity indicated in the last column is defined as follows: [(mean error score)—(mean score for all results)]/(SD of all scores). the proportion of triplets in pattern a is elevated. As the simulated recombination rate increases, the mean of a (proportion of all triplets in pattern a) remains relatively constant and small (510%). The results of EDUT with a variety of simulated data are summarized with the rightmost three columns of Table 2. The percentage of all replicate simulations flagging the known error is reported. For the simulated human data, the known error is flagged in 93% of simulations. In the SF simulations, the error was flagged in 76% of simulations. Detection is decreased when the errors are introduced into unphased genotypes as in the GE simulations, where the error was flagged in 40% of replicate simulations for the two genotype classes including at least one majority SNP and only 10% of the time for the homozygous minority SNP genotype. There is also a trend toward lower detection levels when gene conversion is included and when allowing more than one error at a SNP (LS simulations). However, there is no clear impact of population structure on error flagging (PS simulations). The second important summary of the EDUT results is the mean number of sites appearing in the flagged site lists. A shorter list is more desirable because it will be easier to verify Fig. 2. The results of coalescent simulations designed to test the impact of recombination rate on the proportion of pattern a (A) in a data set. A and 95% confidence intervals (defined by þ/ 1.96 standard deviations) are plotted against the per locus indicated on the x-axis. Simulations are based on / ¼ 1.5 and with a gene conversion rate of f ¼ 5. 1811 D.M.Toleno et al. the flagged genotypes in the raw data. The number of flagged bases increases along with the number of SNPs in the data used for input. There is a trade-off between the percentage of errors flagged and the total number of sites flagged. For the HU simulations, which represent 50 kb of human data, there were an average of 208 SNPs per simulation (Table 1). This resulted in 895 flagged bases or 4% of all the (208 100) possible sites. The trade-off is more favorable for the smaller data sets where the total sites flagged represents 51% of the possible sites. The third piece of information in Table 2 describes the ability of the priority score to distinguish between true errors and the background sites flagged. A sensitivity score is calculated in which the mean score for the known errors is compared to the priority scores for all sites. In all but one error category in Table 2 the score was positive, indicating the difference is in the expected direction, i.e. known errors have higher priority scores. The magnitude of the difference is highest for the two categories of simulation in which fixed and realistic ratios of / were considered (HU and GC). Considering the standard deviation (SD) of the overall score, the differences were 2 SDs for the GC simulation with f ¼ 0 and 3.1 SDs for the simulated human data. The GE, LS, PS and SF simulations are more conservative tests of the method because they use a high recombination rate with the number of SNPs fixed at 50 per simulation. 3.4 Processing time For computational efficiency, EDUT initially evaluates each SNP pair one time to create a table of inferred recombination events. The stored information for the SNP pairs is referenced as each triplet is considered. The processing time therefore increases exponentially with the number of SNPs and is much less dependent on the depth of the sample. Small data sets are processed quickly. For example, the Cbf3 locus discussed in the text, which includes 26 chromosomes and 25 informative SNPs, requires only 3 s of processing time on an AMD Opteron 275 (2.2 GHz) with 16 GB of RAM. A simulated human sequence representing 50 kb of human data (labeled HU in Table 1) was processed in 6 min on the same processor with processing time increasing 5-fold for a region including only twice as many SNPs. 4 DISCUSSION EDUT is useful under many biologically realistic conditions. Much of the utility of the method lies in the potential to detect switch errors in heterozygous individuals. Missed heterozygous SNPs represent a very significant source of error, particularly in highly heterozygous samples. Every site in every individual must be genotyped as either homozygous or heterozygous. We have used coalescent simulations to demonstrate that EDUT can be expected to flag a single base genotyping error with 450% probability, returning a higher priority score for known switch errors than for false positives. Because individual sequence reads tend to contribute switch errors at multiple adjacent SNPs, the probability of identifying individuals that are heterozygous at a locus is high. In combination with other 1812 available tools for genotyping and resolving phase of heterozygous individuals, EDUT can be used to dramatically improve the accuracy of haplotype data. 4.1 Evidence in support of the method’s utility Coalescent simulations (Hudson, 2002; Wiuf and Hein, 2000) demonstrate that a (the proportion of all triplets in pattern a) is small and fairly constant. The variance of a decreases as larger regions of sequence are considered. Small regions often do not include any sites in pattern a (Morrell et al., 2006). Figure S1 illustrates that for the Cbf3 locus, despite the lack of pattern a in the resolved haplotypes, the EDUT method assisted in resolving the phase of SNPs for two heterozygous genotypes because the original incorrect resequencing data included erroneous pattern a triplets. As described in the results, weighted priority scores tend to be higher for the known errors than for other flagged bases. Priority scores can inform user decisions about which SNP genotypes and/or haplotypes to double-check. Whether a user should check the complete list of flagged bases or a subset of the list depends on the likelihood of the presence of an error, the perceived cost of the effort of checking a site that is not an error, and the importance of discovering errors in the data set. 4.2 Errors EDUT is not likely to detect There are three general types of errors that EDUT is not likely to detect. These include errors occurring in data with no evidence of a recombination event, switch errors at a singleton SNP where the minority state occurs only once in the sample, and any error causing a haplotype to masquerade as one of the other observed haplotypes. When haplotypes result from computationally phased genotype data, an erroneous base call will rarely be detected if the correct genotype is homozygous rare. Table 2 indicates that detection occurred 10% of the time in a group of 100 simulated data sets and the mean priority score for an error is not greater than the priority scores for all flagged bases. Simulations also indicate that errors in computationally phased data are much less likely to be flagged than errors in directly obtained haplotypes, although computational phasing seems to have no effect on the sensitivity of the weighted priority scores (compare GE and SF simulation results in Table 2). An error at a minority allele is also less likely to be detected than an error at the majority allele. Also, very large samples present a challenge for error detection because even a 1% error rate results in eight errors for a sample size of 400 diploid individuals. The presence of numerous errors limits the utility of EDUT because the errors can erode the signal from the rare SNP arrangements that provide the basis for error detection. Thus the potential to detect an error is decreased with each additional error. 4.3 Correctly genotyping each site Accurate haplotype data is highly dependent on accurate genotype data. For resequencing data, there are a number of issues that complicate accurate genotyping. First, the discovery of all SNPs in a sample is difficult, because the rarest class of Error detection in SNP data SNPs may be found only in the heterozygous state. Very rare SNPs are difficult for SNP detection software to identify, because the only evidence of the SNP occurs as overlapping peaks in resequencing trace data. Segregating indel polymorphisms also limit the utility of SNP detection tools, such as Polyphred 5.03 (Stephens et al., 2006) and SNPdetector (Zhang et al., 2005) because heterozygous indels create PCR products from a single individual that differ in length, resulting in partially unreadable sequence traces. As an alternative to resequencing, haplotypes might be derived from a series of SNP genotypes. There are a variety of reasons why such genotyping assays may fail to yield accurate genotypes. Many SNP genotyping platforms rely on carrying out an initial multiplex PCR amplification of specific genomic regions, and use this as template for the subsequent genotyping assay. For individual assays, this amplification can fail, leading to missing data for an individual. More seriously, there might be allele specific amplification (due to a heterozygous site within the binding site for one of the PCR amplification oligos). This could result in a heterozygous SNP being incorrectly called homozygous. After PCR amplification, a potential problem can arise if the subsequent oligo binding step involves an undiscovered SNP. This can lead to a failure of the assay for one of the haplotypes such that a heterozygous query SNP will be called homozygous. The potential for undetected SNPs to disrupt an assay is likely to be dependent on sequence composition, and will be expected to vary with the genotyping platform and its implementation (Koboldt et al., 2006). Empirical data and simulation experiments indicate that EDUT is helpful for detecting occasional errors in SNP genotyping and less helpful for finding errors at SNPs with high rates of genotyping errors. Since missing data does not interfere with the method, it would be prudent to omit any data resulting from suspect assays. As with the methods presented by Lincoln and Lander (1992), EDUT lends itself to reexamination of unlikely results. Although the errors are detected through information about SNP arrangement within haplotypes, both resequencing and SNP genotype data verification require a review of the original SNP genotype call. 4.4 Determining the phase of the mutations Even with correctly genotyped data, computational methods may not always phase the genotype data with high confidence. Rare mutations are difficult to phase, because there is little prior information about the haplotypes with which they are associated. Also, current computational phasing methods generally assume that all recombination is due to crossover, and do not account for the potential contribution of homologous gene conversion to haplotype diversity (Lajoie and ElMabrouk, 2005). This is problematic, because the best available estimates suggest that gene conversion may contribute as much or more than crossover to total recombination (Morrell et al., 2006; Ptak et al., 2004). EDUT can be used in conjunction with computational and experimental phasing in the process of accurate haplotype inference and can be used with phasing methods as part of an iterative quality control procedure. 4.5 Importance of accurate haplotype inference Accurate haplotype data is important for many basic biological questions and practical applications. One fundamental biological question for which accurate haplotype data is essential is the estimation of the rate of homologous gene conversion from resequencing data (Ptak et al., 2004; Wall, 2004). Errors can result in large biases in the gene conversion rate estimate (Morrell et al., 2006; Ptak et al., 2004; Wall, 2004). Both whole genome scans and candidate gene association studies will benefit from accurate haplotype data. Haplotype associations with phenotypic variation of interest (association mapping) can facilitate the process of finding candidate genes and could lead to elucidating important structural and regulatory gene interactions. Incorrect haplotypes will not only decrease the effectiveness of whole genome scans but will also decrease the power of hypothesis testing in case-control studies. It has been demonstrated that genotyping errors can have a large impact on association studies. Knapp and Becker (2004) used a simulation approach to determine the impact of genotyping errors. Specifically, they measured the effect on Zhang’s haplotype-sharing transmission disequilibrium test (HS-TDT), a non-parametric test to determine if a haplotype transmission from parent to offspring is associated with a phenotype (Zhang et al., 2003). With only 1% of markers incorrectly genotyped, the error rate is increased from a nominal error rate of 0.05 to around 0.5 (Knapp and Becker, 2004). Clearly genotyping errors can have large effects on haplotype-based association tests. The method presented is a computationally tractable means of examining haplotype data. Through the use of three site haplotype arrangements, EDUT exhibits considerable utility for reducing the data to the portion most worthy of reexamination. Though the time and resources devoted to verifying haplotype base calls may be significant for large genotyping or resequencing projects, the improvement in data quality can be substantial. ACKNOWLEDGEMENTS The authors thank M. Brandt, C. Torres, L. DeRose, S. MacDonald and two anonymous reviewers for helpful comments on previous versions of the manuscript. This work was supported by National Science Foundation grant DEB-0129247. Conflict of Interest: none declared. REFERENCES Abecasis,G.R. et al. (2002) Merlin–rapid analysis of dense genetic maps using sparse gene flow trees. Nat. Genet., 30, 97–101. Becker,T. et al. (2006) Identification of probable genotyping errors by consideration of haplotypes. Eur. J. Hum. Genet., 14, 450–458. Clark,A.G. (2004) The role of haplotypes in candidate gene studies. Genet. Epidemiol., 27, 321–333. Clark,A.G. et al. (1998) Haplotype structure and population genetic inferences from nucleotide-sequence variation in human lipoprotein lipase. Am. J. Hum. Genet., 63, 595–612. 1813 D.M.Toleno et al. Cronn,R. et al. (2002) PCR-mediated recombination in amplification products derived from polyploid cotton. Theor. Appl. Genet., 104, 482–489. Douglas,J.A. et al. (2000) A multipoint method for detecting genotyping errors and mutations in sibling-pair linkage data. Am. J. Hum. Genet., 66, 1287–1297. Gordon,D. et al. (1998) Consed: a graphical tool for sequence finishing. Genome Res., 8, 195–202. Halldórsson,B.V. (2004) A survey of computational methods for determining haplotypes. In: Istrail,S. et al. (eds) Computational Methods for SNPs and Haplotype Inference. Springer-Verlag, Berlin, New York. Heath,S.C. (1998) Generating consistent genotypic configurations for multiallelic loci and large complex pedigrees. Hum. Hered., 48, 1–11. Hudson,R.R. (2002) Generating samples under a Wright-Fisher neutral model of genetic variation. Bioinformatics, 18, 337–338. Hudson,R.R. and Kaplan,N.L. (1985) Statistical properties of the number of recombination events in the history of a sample of DNA-sequences. Genetics, 111, 147–164. Kimura,M. (1969) The number of heterozygous nucleotide sites maintained in a finite population due to steady flux of mutations. Genetics, 61, 893–903. Knapp,M. and Becker,T. (2004) Impact of genotyping errors on type I error rate of the haplotype-sharing transmission/disequilibrium test (hs-tdt). Am. J. Hum. Genet., 74, 589–591. Koboldt,D.C. et al. (2006) Distribution of human SNPs and its effect on highthroughput genotyping. Hum. Mutat., 27, 249–254. Lajoie,M. and El-Mabrouk,N. (2005) Recovering haplotype structure through recombination and gene conversion. Bioinformatics, 21 (Suppl. 2), ii173–ii179. Lange,K. et al. (1988) Programs for pedigree analysis: MENDEL, FISHER, and dGENE. Genet. Epidemiol., 5, 471–472. Lee,J.E. et al. (2005) Gene SNPs and mutations in clinical genetic testing: haplotype-based testing and analysis. Mutat. Res. Fund. Mol. Mech. Mut., 573, 195–204. Li,J. and Jiang,T. (2005) Computing the minimum recombinant haplotype configuration from incomplete genotype data on a pedigree by integer linear programming. J. Comput. Biol., 12, 719–739. Lincoln,S.E. and Lander,E.S. (1992) Systematic detection of errors in geneticlinkage data. Genomics, 14, 604–610. Morrell,P.L. et al. (2005) Low levels of linkage disequilibrium in wild barley (Hordeum vulgare ssp spontaneum) despite high rates of self-fertilization. Proc. Natl Acad. Sci. USA, 102, 2442–2447. 1814 Morrell,P.L. et al. (2006) Estimating the contribution of mutation, recombination and gene conversion in the generation of haplotypic diversity. Genetics, 173, 1705–1723. O’Connell,J.R. and Weeks,D.E. (1998) PedCheck: a program for identification of genotype incompatibilities in linkage analysis. Am. J. Hum. Genet., 63, 259–266. Padhukasahasram,B. et al. (2004) Estimating the rate of gene conversion on human chromosome 21. Am. J. Hum. Genet., 75, 386–397. Padhukasahasram,B. et al. (2006) Estimating recombination rates from singlenucleotide polymorphisms using summary statistics. Genetics, 174, 1517–1528. Ptak,S.E. et al. (2004) Insights into recombination from patterns of linkage disequilibrium in humans. Genetics, 167, 387–397. Scheet,P. and Stephens,M. (2006) A fast and flexible statistical model for largescale population genotype data: applications to inferring missing genotypes and haplotypic phase. Am. J. Hum. Genet., 78, 629–644. Sobel,E. et al. (2002) Detection and integration of genotyping errors in statistical genetics. Am. J. Hum. Genet., 70, 496–508. Stephens,M. and Scheet,P. (2005) Accounting for decay of linkage disequilibrium in haplotype inference and missing-data imputation. Am. J. Hum. Genet., 76, 449–462. Stephens,M. et al. (2001) A new statistical method for haplotype reconstruction from population data. Am. J. Hum. Genet., 68, 978–989. Stephens,M. et al. (2006) Automating sequence-based detection and genotyping of SNPs from diploid samples. Nat. Genet., 38, 375–381. Thomas,A. (2005) GMCheck: Bayesian error checking for pedigree genotypes and phenotypes. Bioinformatics, 21, 3187–3188. Wall,J.D. (2004) Estimating recombination rates using three-site likelihoods. Genetics, 167, 1461–1473. Wall,J.D. et al. (2002) Testing models of selection and demography in Drosophila simulans. Genetics, 162, 203–216. Wiuf,C. and Hein,J. (2000) The coalescent with gene conversion. Genetics, 155, 451–462. Wright,S.I. and Gaut,B.S. (2005) Molecular population genetics and the search for adaptive evolution in plants. Mol. Biol. Evol., 22, 506–519. Zhang,J.H. et al. (2005) SNPdetector: a software tool for sensitive and accurate SNP detection. PLoS Comput. Biol., 1, 395–404. Zhang,S.L. et al. (2003) Transmission/disequilibrium test based on haplotype sharing for tightly linked markers. Am. J. Hum. Genet., 73, 566–579.
© Copyright 2025 Paperzz