Error detection in SNP data by considering the likelihood of

BIOINFORMATICS ORIGINAL PAPER
Vol. 23 no. 14 2007, pages 1807–1814
doi:10.1093/bioinformatics/btm260
Genetics and population analysis
Error detection in SNP data by considering the likelihood of
recombinational history implied by three-site combinations
Donna M. Toleno, Peter L. Morrell and Michael T. Clegg*
Department of Ecology and Evolutionary Biology, University of California, Irvine, CA 92697, USA
Received on December 13, 2006; revised on April 30, 2007; accepted on May 9, 2007
Advance Access publication May 17, 2007
Associate Editor: Martin Bishop
ABSTRACT
Motivation: Errors in nucleotide sequence and SNP genotyping data
are problematic when inferring haplotypes. Previously published
methods for error detection in haplotype data make use of pedigree
information; however, for many samples, individuals are not related
by pedigree. This article describes a method for detecting errors in
haplotypes by considering the recombinational history implied by the
patterns of variation, three SNPs at a time.
Results: Coalescent simulations provide evidence that the method
is robust to high levels of recombination as well as homologous gene
conversion, indicating that patterns produced by both proximate and
distant SNPs may be useful for detecting unlikely three-site
haplotypes.
Availability: The perl script implementing the described method is
called EDUT (Error Detection Using Triplets) and is available on
request from the authors.
Contact: [email protected]
Supplementary information: Supplementary data are available at
Bioinformatics online.
1
INTRODUCTION
Single nucleotide polymorphisms (SNPs) are the most
common form of genetic variation. The most rapid method of
collecting complete SNP data for a chromosomal segment
is through direct nucleotide sequencing of PCR-amplified
DNA fragments. SNP data can be used to calculate population-level diversity statistics even when the phase or arrangement of SNPs and insertion/deletion polymorphisms (indels) in
the diploid individual remains unknown. The phase of
mutations that make up parental haplotypes cannot be
determined through direct sequencing of PCR products, but
must be resolved through either computational or experimental
means (Clark et al., 1998). Correctly inferred haplotypes
increase power for inferences and predictions of disease
penetrance, drug responses, diagnostic test variation and risk
factors for complex diseases (Clark, 2004; Lee et al., 2005).
Accurate inference of haplotypes is also important for
population genetics hypothesis testing and parameter estimation. Haplotyping errors have been shown to affect estimation
*To whom correspondence should be addressed.
of any measure requiring knowledge of the arrangement of
SNP states, such as the population recombination rate and the
quantification of associations among polymorphism (Morrell
et al., 2006; Ptak et al., 2004; Wall, 2004).
Errors in resequencing data appear to be commonly
encountered and seldom reported. For example, Morrell et al.
(2005, 2006) report that standard sequencing practices failed
to identify a heterozygous individual in 6 out of 11 cases. The
impact of these errors on analyses varies, with the greatest
impact on estimated rates of recombination and homologous
gene conversion (Morrell et al., 2006).
The possible means for error introduction are numerous and
include problems related to both genotyping and phasing.
Many research projects require the resequencing of a chromosomal region in a large number of diploid individuals. The
potential for resequencing errors is high, in part because every
nucleotide site in every individual must be screened for
heterozygosity. Miscalled genotypes at heterozygous sites are
a major source of error (Morrell et al., 2006). The rarest SNP
variants are likely to be present in a heterozygous state,
therefore, the investigator may not have a priori knowledge of
the presence of the SNP during the genotyping process. Even
with highly trained technicians and the best possible quality
control measures implemented by sequence assembly software,
genotypes of individual SNPs are often called incorrectly in
directly sequenced diploid regions (Zhang et al., 2005).
Haplotype inference often depends on computational
phasing methods which require arranging SNPs into the most
likely configurations. Such methods rely heavily on accurate
interpretation of raw sequence trace data or SNP genotyping
results (Halldórsson et al., 2004). Experimental phasing
methods require either cloning of PCR products or the
amplification of one haplotype at a time with allele specific
PCR (AS-PCR). Inferring haplotypes with AS-PCR is an
alternative to computational phasing or labor-intensive cloning
methods (Clark et al., 1998). It is important to note that
obtaining the correct genotype at every site does not guarantee
accurate inference of haplotypes with any phasing method
(Lajoie and El-Mabrouk, 2005).
Screening for errors is desirable even when experimental
means are used to recover haplotypes in individuals heterozygous for a locus. Both cloning and AS-PCR involve multiple
PCR products and templates resulting in an error-prone
assembly process. The cloning process is particularly
ß The Author 2007. Published by Oxford University Press. All rights reserved. For Permissions, please email: [email protected]
1807
D.M.Toleno et al.
error-prone. High rates of polymerase error make it difficult to
recover accurate sequence from each clone, forcing the use of
a majority rule approach to determine each haplotype. The
clones obtained from any given PCR reaction may contain
PCR chimeras (Cronn et al., 2002) and biases in the success rate
for cloning the two haplotypes (Morrell et al., 2006). Also,
AS-PCR can fail to prime specifically, resulting in genotyping
errors. Cloning and AS-PCR are also vulnerable to human
and software errors; errors may include sequence assembly
problems, sample mislabeling, misinterpretation of sequence
traces or sample contamination. While careful sequencing
protocols are likely to minimize errors, a tool to identify
a subset of unlikely base calls that require re-inspection will be
extremely valuable when high data quality is essential.
In the present article we present a method for error detection
in which haplotypes are examined three SNPs at a time.
We refer to the method as EDUT (error detection using
triplets) and we present and test a Perl script implementing the
method. EDUT is most sensitive to ‘switch errors’, SNPs
occurring on an incorrect haplotype background. Detecting
switch errors is difficult because the base calls that lead to
incorrect inference of haplotypes frequently occur at known
SNPs. In the absence of a method evaluating the likelihood
of SNP arrangements, an error can appear to be a likely base
call. Switch errors are often repeatable, and can result from
machine scored data. There is currently no routine procedure
for detecting this type of error. In a previous paper, the EDUT
method was used to identify switch errors in haplotype data
from a resequencing project (Morrell et al., 2006).
Researchers have avoided the problems associated with
direct sequencing of nuclear genes in diploid organisms by
focusing on mitochondrial and chloroplast loci not subject to
sequencing difficulties caused by heterozygosity, or on organisms where inbreeding or genetic manipulation result in
homozygosity at nuclear genes (Wright and Gaut, 2005).
As population studies are expanded to a wider range of diploid
organisms (including humans), a method with the broad
applicability demonstrated by EDUT will be extremely useful.
Published methods for detecting errors in sequencing and
phasing, particularly for data from humans, have made use of
the relatedness of haplotypes sampled in parents, progeny and
siblings (Abecasis et al., 2002; Becker et al., 2006; Douglas
et al., 2000; Heath, 1998; Lange et al., 1988; O’Connell and
Weeks, 1998; Sobel et al., 2002; Thomas, 2005). Because
haplotypes are passed from parents to progeny with a very
small probability of being altered by mutation or rearranged
by recombination, nucleotide sequence information from
related individuals with a known pedigree is useful for
imputation of missing data and haplotype reconstruction
(Li and Jiang, 2005), and for error detection (Abecasis et al.,
2002; Becker et al., 2006). In samples from most organisms,
family relationships are difficult or impossible to determine,
thus creating the need for a non-pedigree-based method for
error detection.
Evaluation of haplotype likelihood has been used for both
phase estimation and missing genotype imputation in samples
of unrelated individuals (Scheet and Stephens, 2006; Stephens
and Scheet, 2005). The detection of genotyping errors is
a natural extension of haplotype reconstruction methods;
1808
both efforts rely on the SNP associations created by mutation
and a finite amount of historical recombination.
EDUT can be applied to resequencing or SNP genotyping
data and any combination of computationally or experimentally phased haplotypes. The error detection procedure exploits
an important property of haplotype data; some haplotypes are
more likely than others. Assuming one mutation per site,
S SNPs can yield 2S 1 unique haplotype configurations, only a
subset of which are generally observed. EDUT flags a SNP
within a specific haplotype whenever an unusual haplotype
arrangement occurs for a SNP triplet. A flagged site will be
referred to as a ‘flagged base’ to emphasize the role of the
flagged site in a single haplotype. The phrase is not meant
to limit the discussion to resequencing data. EDUT identifies
a subset of base calls that create unlikely arrangements within
the inferred haplotype; thus it finds base calls likely to have
been inferred incorrectly. The subset consists of the base calls
contributing most directly to evidence for a double recombination event. In principle, the method is analogous to the error
detection approach of Lincoln and Lander (1992) routinely
used in genetic mapping; in practice, it is analogous to the
routine process of singleton SNP verification.
2
2.1
METHODS
The algorithm
EDUT processes population data by first identifying all of the SNPs in
the data set. The algorithm initially includes all sites, including those in
alignment gaps because such sites can include SNPs and may be
informative about the evolutionary history of the genomic region.
As illustrated in Figure 1, SNP data is converted to a binary format.
By convention, the majority state is coded as 0 and the minority state is
coded as 1. EDUT accepts data input in standard formats, including
aligned fasta files.
Under an infinite sites model, segregating sites will only be altered
by mutation one time in the history of the population (Kimura, 1969).
This assumption holds for the vast majority of SNPs within a sample
from a randomly mating population of a single species. For example,
52% of all SNPs in wild barley resequencing data of Morrell et al.
(2006) show evidence of multiple hits by mutation. When there are more
than 2 nucleotide states at a site, EDUT codes the majority state as a 0
and the two or three minor states as 1. Homoplasy events (multiple
origins of the same SNP state) are less likely than mutation to a unique
3rd state. When the data shows only minimal violations of the infinite
sites assumption, homoplasies have a negligible effect on the effectiveness of this algorithm and are therefore not considered during
processing.
All (((S)*(S 1))/2) pairs of SNPs are evaluated with the four-gamete
test. For any pair of nucleotide sites, only three configurations are
possible on the basis of unique mutations. The presence of all four
gametes indicates a historical recombination event between the pair of
sites (Hudson and Kaplan, 1985). EDUT creates a table storing the
results of the four-gamete test. In Figure 1, four gamete tests are
true for the site pairs labeled AB and BC. The number of sets of three
site combinations is denoted NT where NT ¼ (S*(S 1)*(S 2))/6.
All NT triplets are evaluated using stored four-gamete test results so
that ‘pattern a’ is identified when no recombination is inferred for the
outer pair of SNPs with all four arrangements present at each of the
SNP pairs involving the middle SNP (Fig. 1). For all pattern a triplets
detected, the rare types contributing to the two diagnostic four-gamete
tests are flagged and reported. Rare types are defined by default
Error detection in SNP data
Fig. 1. Three polymorphic sites are illustrated in a hypothetical
population of 30 chromosomes. (a) Curved arrows below the sequence
data represent evidence of recombination. The orthogonal lines above
the sequence data indicate that there is no recombination evident
between the outer sites. The sequence data is in pattern a as described in
Padhukasahasram et al. (2004) and Morrell et al. (2006). (b) Coding of
nucleotide states: polymorphic sites are coded into binary data.
By convention, 0 is assigned to the common state. (c) Counts of
each coded pair: The presence of all four two-site arrangements
indicates a recombination event between the sites. There is evidence of a
recombination event between the gray (B) and the black (A and C)
labeled sites as shown in the table. The lowest frequency site responsible
for defining pattern a is the most likely source of error in the data set as
indicated by boxes in (a and c).
as a single occurrence of a two-site arrangement. At times there may be
a tie for the rare type. Two pairs of states may each occur only once
causing two different individuals in the population to be critical to the
inference of the four gamete test and therefore to the pattern a triplet.
For each flagged base, a counter is maintained and incremented every
time the site is involved in a rare pair of a pattern a triplet. The ultimate
value of this counter for each base flagged is reported as the raw
priority score (P) in the summary output.
Any given SNP variant present in more than one of the samples is
considered informative. Each one of S informative SNPs participates in
a total of T triplets, where T ¼ ([S 1][S 2])/2. The quantity, a is
defined as the proportion of triplets in pattern a relative to the
total number of triplets. To arrive at an approximation of the number
of pattern a triplets expected for each site, T is multiplied by a.
An additional correction is made to account for SNP position; sites that
are closer to the middle of the series of informative SNPs are more
likely to be flagged twice in a single triplet. For the ith SNP this
correction is designated M, where M ¼ (i 1) (S i). M ¼ 0 when the
SNP is either the first or last in the region considered. The weighted
priority score is defined as PW, where PW ¼ P/(a [T þ M]).
2.2
Applying the method
The EDUT program was tested against resequencing data from wild
barley and avocado where all sequence traces were available. A total of
22 data sets of inferred haplotypes from 25 individuals for wild barley
and 41–53 individuals for avocado were used as input into EDUT. In
the majority of the data sets no sites were flagged, but in one case the
weighted scores drew attention to an assembly error. Inspection of
EDUT-flagged base calls in the trace data and consideration of peak
heights and quality scores in Consed (Gordon et al., 1998) aid in the
assessment of base call accuracy.
EDUT includes an option for processing data directly from the
coalescent simulation package, ms (Hudson, 2002). Simulations
permitted rigorous testing of EDUT (see Table 1). In each simulation
replicate, an error was introduced by changing the 10th site in the first
chromosome from one binary state to the other (1 changed to 0, 0
changed to 1).
The HU simulations were designed to resemble human resequencing
data. The GE (genotyping error) simulations are designed to represent
data as it is frequently collected, i.e. from diploid individuals where
the phase of each SNP is unknown. Paired haplotypes in the
simulations served as the simulated genotypes. A single error in a
single genotype was introduced at the 10th SNP in each replicate
simulation; pairs of the simulated chromosomes were subsequently
passed as input to Phase 2.1 (Stephens et al., 2001). The resulting
inferred haplotypes were used as input for EDUT. For 100 replicate
simulations, the process was repeated a total of three times so that the
single error was introduced into each of the three possible genotypes
(homozygous for the SNP state with major frequency, heterozygous or
homozygous for minor SNP state). LS (large sample) simulations
included an error at every 100th simulated chromosome in the
sample. This represents a low frequency but systematic occurrence of
haplotype error.
PS simulations were designed to include population structure.
Population structure is common in samples from multiple geographic
regions and in comparative studies of populations. Four populations
with 25 chromosomes per population were simulated with populations
exchanging 1 migrant per generation. The SF (SNP frequency)
simulations were designed to test the sensitivity of EDUT to various
allele frequencies and states. For each simulation, allele frequencies at
the SNP where the error was introduced were categorized as major or
minor and either ancestral or derived. We evaluate performance of
EDUT for all four combinations (Tables 1 and 2). Unlike the GE
simulations above, the SNP frequencies were tabulated after the
simulation analysis. For SF simulations, 10 000 replicates were used to
increase the incidence of extreme allele frequencies.
The GC simulations were designed to test for the effect of gene
conversion on the sensitivity of error detection. Simulation parameters
included the mutation parameter, ¼ 4Ne and the recombination
parameter, ¼ 4Ner where Ne is effective population size, is per base
rate of mutation and r is the per base pair rate of recombination.
The gene conversion parameter, f, was defined as the ratio of the rate of
gene conversion relative to the rate of crossover. The values of f used in
simulations were 0, 2, 5 and 10; spanning the range of values estimated
from empirical data (Padhukasahasram et al., 2006; Wall, 2004).
All data analysis, including graphics, was performed with the R
statistical environment and programming language (R Development
Core Team, 2005).
3
3.1
RESULTS
The EDUT method
The inspiration for the EDUT methodology resulted from
careful evaluation of resequencing data from wild barley
(Morrell et al., 2005, 2006). All sequences reported were
inferred from high quality sequence trace data with a minimum
quality criterion of a Phred score 20 on both forward and
reverse reads. As is the customary practice in population
genetics resequencing projects, singleton mutations were
verified with a new PCR amplification followed by sequencing
with coverage of the SNP on the forward and reverse strand.
While surveying the data for patterns indicating gene conversion, it was clear that typing errors at key SNP sites could be
responsible for producing a triplet pattern resembling the effect
of gene conversion. Figure 1 illustrates how EDUT detects
1809
D.M.Toleno et al.
Table 1. Descriptions of simulations used to test EDUT under a variety of circumstances
Simulation
Number of
chromosomes
Number of
SNPs
Recombination
(distance)
in units of kb
of human DNA
Purpose
Total number
of simulated
data sets
HU
100
50
Testing the effectiveness of EDUT
on simulated human data.
100
GE
100
Variable, mean ¼ 208 SNPs
(expected for 100 chromosomes
and 50 kb human DNA)
Fixed, 50 SNPs
100
100
LS
200
400
600
800
Fixed, 50 SNPs
100
PS
4 groups of 25
Fixed, 50 SNPs
100
Testing the effects of various types
of genotyping error: heterozygous
SNPs,
homozygous common,
homozygous rare. See Methods
section for details.
Testing the effects of a very large
sample. For one SNP an error was
introduced every 100th haplotype
for up to eight total errors at
one SNP.
Testing the effect of population
subdivision.
Simulated chromosomes are split
evenly into four subpopulations
with one migration per generation
Testing the effects of allele
frequency on error flagging and
priority scores. Exactly the same as
‘GE’ simulations.
Testing the effects of gene
conversion (f ¼ 0, 2, 5, 10) and
number of sampled haplotypes.
(25, 50, 100).
Error introduced into one
haplotype at one SNP,
and two SNPs.
SF
100
Fixed, 50 SNPs
100
GC
25, 50, 100
Variable, mean ¼ 20.9 SNPs
10
errors using three SNPs and haplotype frequencies typical for a
sample of 30 chromosomes. Pattern a, utilized in error
detection as illustrated in Figure 1, occurs at a consistently
low frequency under many biologically realistic circumstances
(See Fig. 2 above and discussion of Fig. 2 below).
3.2
100
10 000
100 each
sites. However, quality scores for these base calls were greater
than Phred 20, the quality criterion used in the assembly of the
data (Fig. S2). A partial sequence trace is shown in Figure S3.
These results demonstrate that inspection of sequence quality
can provide supporting evidence for a switch error in any
inferred haplotype, but searching for a low quality score is not
necessary or sufficient for detecting a sequencing error.
Analyses performed on empirical data
An example of the utility of EDUT on empirical data was
presented in Morrell et al. (2006). For example, in the Cbf3
locus (Morrell et al., 2005, 2006), EDUT assisted in the
detection and correction of four incorrect haplotypes.
Homozygosity was incorrectly assumed for 13 SNPs in one
sample and 5 SNPs in another. In each case, two switch errors
lead to the inference that base calls from these two haplotypes
were incorrectly combined. Table S1 (found online) includes the
example EDUT output from the wild barley Cbf3 locus.
Figure S1 demonstrates the resolution of haplotypes at Cbf3.
Inspection of the trace files associated with possible errors
flagged by EDUT lead to the identification of localized drops in
Phred quality scores for incorrectly genotyped heterozygous
1810
100 for each
sample size
3.3
Analyses performed on simulated data
As discussed above, pattern a occurs at low and relatively
consistent frequency under many biologically realistic circumstances (Wiuf and Hein, 2000). Figure 2 illustrates the
proportion of pattern a for groups of 100 simulations across
values ranging from 5 to 90. The total number of triplets
considered in the proportion includes SNPs for which the
minority state occurs at least twice in the sample. Singleton
SNPs, occurring only once in the sample, can never participate
in pattern a. Simulations represented in Figure 2 are based on /
¼ 1.5. At lower rates of recombination, holding / constant
causes the simulated data to have few SNPs and thus the SD for
Error detection in SNP data
Table 2. Results of the simulations summarized in Table 1
Simulation
Details
Percent of errors
flagged
Mean number
of sites in
flagged list
per simulation [SD]
HU
Equivalent to 50 kb human sequence
93
895 [238]
GE
Error at a heterozygous SNP
Error at a homozygous common SNP
Error at a homozygous rare SNP
44
41
10
85 [26.3]
87 [28]
81 [31]
LS
N ¼ 200
N ¼ 400
N ¼ 600
N ¼ 800
64
51
32
36
71
54
32
44
PS
Four subpopulations
SF
All SF simulations
Error at majority, ancestral base
Error at majority, derived base
Error at minority, ancestral base
Error at minority, derived base
GC
N ¼ 25, f ¼ 0
N ¼ 25, f ¼ 2
N ¼ 25, f ¼ 5
N ¼ 25, f ¼ 10
N ¼ 50, f ¼ 0
N ¼ 50, f ¼ 2
N ¼ 50, f ¼ 5
N ¼ 50, f ¼ 10
N ¼ 100, f ¼ 0
N ¼ 100, f ¼ 2
N ¼ 100, f ¼ 5
N ¼ 100, f ¼ 10
74
68
55
67
58
64
65
69
58
72
73
65
Sensitivity of the
priority score
3.3
1.3
1.6
0.14
[21]
[18]
[13]
[13]
0.9
0.4
0.19
0.14
68
138 [28]
1.98
76
87
66
42
27
94 [26]
-
1.3
0.57
1.4
0.17
0.21
9.6
12.0
14.8
14.5
11.1
12.8
14.8
15.7
8.6
15.3
18.1
10.2
[12.7]
[12.7]
[15.2]
[12.3]
[13.9]
[12.4]
[12.0]
[12.4]
[9.2]
[12.6]
[13.3]
[9.8]
2.5
1.7
1.4
1.4
1.4
1.4
1.1
1.1
2.1
1.6
1.6
1.3
The sensitivity indicated in the last column is defined as follows: [(mean error score)—(mean score for all results)]/(SD of all scores).
the proportion of triplets in pattern a is elevated. As the
simulated recombination rate increases, the mean of a
(proportion of all triplets in pattern a) remains relatively
constant and small (510%).
The results of EDUT with a variety of simulated data are
summarized with the rightmost three columns of Table 2. The
percentage of all replicate simulations flagging the known error
is reported. For the simulated human data, the known error is
flagged in 93% of simulations. In the SF simulations, the error
was flagged in 76% of simulations. Detection is decreased when
the errors are introduced into unphased genotypes as in the GE
simulations, where the error was flagged in 40% of replicate
simulations for the two genotype classes including at least one
majority SNP and only 10% of the time for the homozygous
minority SNP genotype. There is also a trend toward lower
detection levels when gene conversion is included and when
allowing more than one error at a SNP (LS simulations).
However, there is no clear impact of population structure on
error flagging (PS simulations).
The second important summary of the EDUT results is the
mean number of sites appearing in the flagged site lists.
A shorter list is more desirable because it will be easier to verify
Fig. 2. The results of coalescent simulations designed to test the impact
of recombination rate on the proportion of pattern a (A) in a data set.
A and 95% confidence intervals (defined by þ/ 1.96 standard
deviations) are plotted against the per locus indicated on the x-axis.
Simulations are based on / ¼ 1.5 and with a gene conversion
rate of f ¼ 5.
1811
D.M.Toleno et al.
the flagged genotypes in the raw data. The number of flagged
bases increases along with the number of SNPs in the data used
for input. There is a trade-off between the percentage of
errors flagged and the total number of sites flagged. For the
HU simulations, which represent 50 kb of human data, there
were an average of 208 SNPs per simulation (Table 1). This
resulted in 895 flagged bases or 4% of all the (208 100)
possible sites. The trade-off is more favorable for the smaller
data sets where the total sites flagged represents 51% of
the possible sites.
The third piece of information in Table 2 describes the ability
of the priority score to distinguish between true errors and the
background sites flagged. A sensitivity score is calculated in
which the mean score for the known errors is compared to the
priority scores for all sites. In all but one error category
in Table 2 the score was positive, indicating the difference is in
the expected direction, i.e. known errors have higher priority
scores. The magnitude of the difference is highest for the two
categories of simulation in which fixed and realistic ratios of /
were considered (HU and GC). Considering the standard
deviation (SD) of the overall score, the differences were 2 SDs
for the GC simulation with f ¼ 0 and 3.1 SDs for the simulated
human data. The GE, LS, PS and SF simulations are more
conservative tests of the method because they use a high
recombination rate with the number of SNPs fixed at 50 per
simulation.
3.4
Processing time
For computational efficiency, EDUT initially evaluates each
SNP pair one time to create a table of inferred recombination
events. The stored information for the SNP pairs is referenced
as each triplet is considered. The processing time therefore
increases exponentially with the number of SNPs and is much
less dependent on the depth of the sample. Small data sets are
processed quickly. For example, the Cbf3 locus discussed in the
text, which includes 26 chromosomes and 25 informative SNPs,
requires only 3 s of processing time on an AMD Opteron 275
(2.2 GHz) with 16 GB of RAM. A simulated human sequence
representing 50 kb of human data (labeled HU in Table 1)
was processed in 6 min on the same processor with processing
time increasing 5-fold for a region including only twice as
many SNPs.
4
DISCUSSION
EDUT is useful under many biologically realistic conditions.
Much of the utility of the method lies in the potential to detect
switch errors in heterozygous individuals. Missed heterozygous
SNPs represent a very significant source of error, particularly in
highly heterozygous samples. Every site in every individual
must be genotyped as either homozygous or heterozygous.
We have used coalescent simulations to demonstrate that
EDUT can be expected to flag a single base genotyping error
with 450% probability, returning a higher priority score for
known switch errors than for false positives. Because individual
sequence reads tend to contribute switch errors at multiple
adjacent SNPs, the probability of identifying individuals that
are heterozygous at a locus is high. In combination with other
1812
available tools for genotyping and resolving phase of heterozygous individuals, EDUT can be used to dramatically improve
the accuracy of haplotype data.
4.1
Evidence in support of the method’s utility
Coalescent simulations (Hudson, 2002; Wiuf and Hein, 2000)
demonstrate that a (the proportion of all triplets in pattern a) is
small and fairly constant. The variance of a decreases as larger
regions of sequence are considered. Small regions often do not
include any sites in pattern a (Morrell et al., 2006). Figure S1
illustrates that for the Cbf3 locus, despite the lack of pattern a
in the resolved haplotypes, the EDUT method assisted in
resolving the phase of SNPs for two heterozygous genotypes
because the original incorrect resequencing data included
erroneous pattern a triplets.
As described in the results, weighted priority scores tend to
be higher for the known errors than for other flagged bases.
Priority scores can inform user decisions about which SNP
genotypes and/or haplotypes to double-check. Whether a user
should check the complete list of flagged bases or a subset of
the list depends on the likelihood of the presence of an error,
the perceived cost of the effort of checking a site that is not an
error, and the importance of discovering errors in the data set.
4.2
Errors EDUT is not likely to detect
There are three general types of errors that EDUT is not likely
to detect. These include errors occurring in data with no
evidence of a recombination event, switch errors at a singleton
SNP where the minority state occurs only once in the sample,
and any error causing a haplotype to masquerade as one of the
other observed haplotypes.
When haplotypes result from computationally phased
genotype data, an erroneous base call will rarely be detected
if the correct genotype is homozygous rare. Table 2 indicates
that detection occurred 10% of the time in a group of 100
simulated data sets and the mean priority score for an error is
not greater than the priority scores for all flagged bases.
Simulations also indicate that errors in computationally
phased data are much less likely to be flagged than errors in
directly obtained haplotypes, although computational phasing
seems to have no effect on the sensitivity of the weighted
priority scores (compare GE and SF simulation results in
Table 2). An error at a minority allele is also less likely to be
detected than an error at the majority allele.
Also, very large samples present a challenge for error
detection because even a 1% error rate results in eight errors
for a sample size of 400 diploid individuals. The presence of
numerous errors limits the utility of EDUT because the errors
can erode the signal from the rare SNP arrangements that
provide the basis for error detection. Thus the potential to
detect an error is decreased with each additional error.
4.3
Correctly genotyping each site
Accurate haplotype data is highly dependent on accurate
genotype data. For resequencing data, there are a number of
issues that complicate accurate genotyping. First, the discovery
of all SNPs in a sample is difficult, because the rarest class of
Error detection in SNP data
SNPs may be found only in the heterozygous state. Very rare
SNPs are difficult for SNP detection software to identify,
because the only evidence of the SNP occurs as overlapping
peaks in resequencing trace data. Segregating indel polymorphisms also limit the utility of SNP detection tools, such as
Polyphred 5.03 (Stephens et al., 2006) and SNPdetector
(Zhang et al., 2005) because heterozygous indels create PCR
products from a single individual that differ in length, resulting
in partially unreadable sequence traces.
As an alternative to resequencing, haplotypes might be
derived from a series of SNP genotypes. There are a variety
of reasons why such genotyping assays may fail to yield
accurate genotypes. Many SNP genotyping platforms rely on
carrying out an initial multiplex PCR amplification of specific
genomic regions, and use this as template for the subsequent
genotyping assay. For individual assays, this amplification
can fail, leading to missing data for an individual. More
seriously, there might be allele specific amplification (due to
a heterozygous site within the binding site for one of the
PCR amplification oligos). This could result in a heterozygous
SNP being incorrectly called homozygous. After PCR amplification, a potential problem can arise if the subsequent oligo
binding step involves an undiscovered SNP. This can lead to a
failure of the assay for one of the haplotypes such that a
heterozygous query SNP will be called homozygous. The
potential for undetected SNPs to disrupt an assay is likely to be
dependent on sequence composition, and will be expected to
vary with the genotyping platform and its implementation
(Koboldt et al., 2006).
Empirical data and simulation experiments indicate that
EDUT is helpful for detecting occasional errors in SNP
genotyping and less helpful for finding errors at SNPs with
high rates of genotyping errors. Since missing data does not
interfere with the method, it would be prudent to omit any data
resulting from suspect assays.
As with the methods presented by Lincoln and Lander
(1992), EDUT lends itself to reexamination of unlikely results.
Although the errors are detected through information
about SNP arrangement within haplotypes, both resequencing
and SNP genotype data verification require a review of the
original SNP genotype call.
4.4
Determining the phase of the mutations
Even with correctly genotyped data, computational methods
may not always phase the genotype data with high confidence.
Rare mutations are difficult to phase, because there is little
prior information about the haplotypes with which they are
associated. Also, current computational phasing methods
generally assume that all recombination is due to crossover,
and do not account for the potential contribution of homologous gene conversion to haplotype diversity (Lajoie and ElMabrouk, 2005). This is problematic, because the best available
estimates suggest that gene conversion may contribute as much
or more than crossover to total recombination (Morrell et al.,
2006; Ptak et al., 2004). EDUT can be used in conjunction with
computational and experimental phasing in the process of
accurate haplotype inference and can be used with phasing
methods as part of an iterative quality control procedure.
4.5
Importance of accurate haplotype inference
Accurate haplotype data is important for many basic biological
questions and practical applications. One fundamental biological question for which accurate haplotype data is essential is
the estimation of the rate of homologous gene conversion from
resequencing data (Ptak et al., 2004; Wall, 2004). Errors can
result in large biases in the gene conversion rate estimate
(Morrell et al., 2006; Ptak et al., 2004; Wall, 2004).
Both whole genome scans and candidate gene association
studies will benefit from accurate haplotype data. Haplotype
associations with phenotypic variation of interest (association
mapping) can facilitate the process of finding candidate genes
and could lead to elucidating important structural and
regulatory gene interactions. Incorrect haplotypes will not
only decrease the effectiveness of whole genome scans but will
also decrease the power of hypothesis testing in case-control
studies.
It has been demonstrated that genotyping errors can have
a large impact on association studies. Knapp and Becker (2004)
used a simulation approach to determine the impact of genotyping errors. Specifically, they measured the effect on Zhang’s
haplotype-sharing transmission disequilibrium test (HS-TDT),
a non-parametric test to determine if a haplotype transmission
from parent to offspring is associated with a phenotype (Zhang
et al., 2003). With only 1% of markers incorrectly genotyped,
the error rate is increased from a nominal error rate of 0.05
to around 0.5 (Knapp and Becker, 2004). Clearly genotyping
errors can have large effects on haplotype-based association
tests.
The method presented is a computationally tractable means
of examining haplotype data. Through the use of three site
haplotype arrangements, EDUT exhibits considerable utility
for reducing the data to the portion most worthy of
reexamination. Though the time and resources devoted to
verifying haplotype base calls may be significant for large
genotyping or resequencing projects, the improvement in data
quality can be substantial.
ACKNOWLEDGEMENTS
The authors thank M. Brandt, C. Torres, L. DeRose,
S. MacDonald and two anonymous reviewers for helpful
comments on previous versions of the manuscript. This work
was supported by National Science Foundation grant
DEB-0129247.
Conflict of Interest: none declared.
REFERENCES
Abecasis,G.R. et al. (2002) Merlin–rapid analysis of dense genetic maps using
sparse gene flow trees. Nat. Genet., 30, 97–101.
Becker,T. et al. (2006) Identification of probable genotyping errors by
consideration of haplotypes. Eur. J. Hum. Genet., 14, 450–458.
Clark,A.G. (2004) The role of haplotypes in candidate gene studies. Genet.
Epidemiol., 27, 321–333.
Clark,A.G. et al. (1998) Haplotype structure and population genetic inferences
from nucleotide-sequence variation in human lipoprotein lipase. Am. J. Hum.
Genet., 63, 595–612.
1813
D.M.Toleno et al.
Cronn,R. et al. (2002) PCR-mediated recombination in amplification products
derived from polyploid cotton. Theor. Appl. Genet., 104, 482–489.
Douglas,J.A. et al. (2000) A multipoint method for detecting genotyping errors
and mutations in sibling-pair linkage data. Am. J. Hum. Genet., 66,
1287–1297.
Gordon,D. et al. (1998) Consed: a graphical tool for sequence finishing. Genome
Res., 8, 195–202.
Halldórsson,B.V. (2004) A survey of computational methods for determining
haplotypes. In: Istrail,S. et al. (eds) Computational Methods for SNPs and
Haplotype Inference. Springer-Verlag, Berlin, New York.
Heath,S.C. (1998) Generating consistent genotypic configurations for multiallelic loci and large complex pedigrees. Hum. Hered., 48, 1–11.
Hudson,R.R. (2002) Generating samples under a Wright-Fisher neutral model of
genetic variation. Bioinformatics, 18, 337–338.
Hudson,R.R. and Kaplan,N.L. (1985) Statistical properties of the number of
recombination events in the history of a sample of DNA-sequences. Genetics,
111, 147–164.
Kimura,M. (1969) The number of heterozygous nucleotide sites maintained in a finite population due to steady flux of mutations. Genetics, 61,
893–903.
Knapp,M. and Becker,T. (2004) Impact of genotyping errors on type I error rate
of the haplotype-sharing transmission/disequilibrium test (hs-tdt).
Am. J. Hum. Genet., 74, 589–591.
Koboldt,D.C. et al. (2006) Distribution of human SNPs and its effect on highthroughput genotyping. Hum. Mutat., 27, 249–254.
Lajoie,M. and El-Mabrouk,N. (2005) Recovering haplotype structure through
recombination and gene conversion. Bioinformatics, 21 (Suppl. 2), ii173–ii179.
Lange,K. et al. (1988) Programs for pedigree analysis: MENDEL, FISHER, and
dGENE. Genet. Epidemiol., 5, 471–472.
Lee,J.E. et al. (2005) Gene SNPs and mutations in clinical genetic testing:
haplotype-based testing and analysis. Mutat. Res. Fund. Mol. Mech. Mut.,
573, 195–204.
Li,J. and Jiang,T. (2005) Computing the minimum recombinant haplotype
configuration from incomplete genotype data on a pedigree by integer linear
programming. J. Comput. Biol., 12, 719–739.
Lincoln,S.E. and Lander,E.S. (1992) Systematic detection of errors in geneticlinkage data. Genomics, 14, 604–610.
Morrell,P.L. et al. (2005) Low levels of linkage disequilibrium in wild barley
(Hordeum vulgare ssp spontaneum) despite high rates of self-fertilization.
Proc. Natl Acad. Sci. USA, 102, 2442–2447.
1814
Morrell,P.L. et al. (2006) Estimating the contribution of mutation, recombination
and gene conversion in the generation of haplotypic diversity. Genetics, 173,
1705–1723.
O’Connell,J.R. and Weeks,D.E. (1998) PedCheck: a program for identification of
genotype incompatibilities in linkage analysis. Am. J. Hum. Genet., 63,
259–266.
Padhukasahasram,B. et al. (2004) Estimating the rate of gene conversion on
human chromosome 21. Am. J. Hum. Genet., 75, 386–397.
Padhukasahasram,B. et al. (2006) Estimating recombination rates from singlenucleotide polymorphisms using summary statistics. Genetics, 174,
1517–1528.
Ptak,S.E. et al. (2004) Insights into recombination from patterns of linkage
disequilibrium in humans. Genetics, 167, 387–397.
Scheet,P. and Stephens,M. (2006) A fast and flexible statistical model for largescale population genotype data: applications to inferring missing genotypes
and haplotypic phase. Am. J. Hum. Genet., 78, 629–644.
Sobel,E. et al. (2002) Detection and integration of genotyping errors in statistical
genetics. Am. J. Hum. Genet., 70, 496–508.
Stephens,M. and Scheet,P. (2005) Accounting for decay of linkage disequilibrium
in haplotype inference and missing-data imputation. Am. J. Hum. Genet., 76,
449–462.
Stephens,M. et al. (2001) A new statistical method for haplotype reconstruction
from population data. Am. J. Hum. Genet., 68, 978–989.
Stephens,M. et al. (2006) Automating sequence-based detection and genotyping
of SNPs from diploid samples. Nat. Genet., 38, 375–381.
Thomas,A. (2005) GMCheck: Bayesian error checking for pedigree genotypes
and phenotypes. Bioinformatics, 21, 3187–3188.
Wall,J.D. (2004) Estimating recombination rates using three-site likelihoods.
Genetics, 167, 1461–1473.
Wall,J.D. et al. (2002) Testing models of selection and demography in Drosophila
simulans. Genetics, 162, 203–216.
Wiuf,C. and Hein,J. (2000) The coalescent with gene conversion. Genetics, 155,
451–462.
Wright,S.I. and Gaut,B.S. (2005) Molecular population genetics and the search
for adaptive evolution in plants. Mol. Biol. Evol., 22, 506–519.
Zhang,J.H. et al. (2005) SNPdetector: a software tool for sensitive and accurate
SNP detection. PLoS Comput. Biol., 1, 395–404.
Zhang,S.L. et al. (2003) Transmission/disequilibrium test based on haplotype
sharing for tightly linked markers. Am. J. Hum. Genet., 73, 566–579.