1 Identifiability of DNA sequences from ionic current amplitude

Supporting information for “Error analysis of
idealized nanopore sequencing”
Christopher R. O’Donnell 1 Hongyun Wang 2
William B. Dunbar 1*
1
Identifiability of DNA sequences from ionic current
amplitude
If an enzyme on the pore controls the ssDNA passing through the pore, the
ssDNA moves in 1 nt steps, with the dwell time of each ssDNA position being
exponentially distributed, and step-transitions that are instantaneous compared
to the measurement bandwidth [1,2]. An appropriate idealization for the signal is
a pulse-train, defined by a set of 𝑀 amplitudes and a sequence of measured
dwell times. From the single-channel recording and analysis literature [3-5], there
is a set of techniques that can be applied to estimate the pulse-train idealization
from the noisy recorded data. For sequencing, the pulse train would be
compared to a library of amplitudes identified through control experiments with
known sequences. In this section, we consider challenges associated with having
a limited number of distinct amplitude levels in the idealization.
If 𝑛 nucleotides affect the ionic current, then 𝑀 = 4𝑛 amplitude levels are sufficient to unambiguously identify the sequence. For the purpose of synthesizing
the idealization from noisy data, each amplitude level must have a signal-to-noise
ratio (S/N) of at least 2 for idealization by half-amplitude methods, or at least 1.5
by Markov-based methods [4]. For 𝑛 ≥ 3, as in the case of the MspA nanopore
[2], achieving 𝑀 = 4𝑛 amplitude levels with sufficient S/N may not be possible.
We consider specific examples in which having fewer that 4𝑛 amplitude levels
makes it impossible to unambiguously identify the sequence. We construct examples for 𝑛 = 2,3,4, assuming sequences are identified right-to-left as they pass
through the pore. Thus, 𝐴𝐺𝐶𝑇𝑇𝐴𝐺 with 𝑛 = 4 would be identified 𝑇𝑇𝐴𝐺 , then
𝐶𝑇𝑇𝐴, etc. We refer to 𝑀 “distinct” amplitude levels when each level has sufficient S/N for detection. The case for 𝑛 = 1 nucleotide affecting the current amplitude is considered first, and is the simplest.
______________________________
1
Department of Computer Engineering, Baskin School of Engineering, University of California,
Santa Cruz, CA 95064, USA
2
Department of Applied Math and Statistics, Baskin School of Engineering, University of
California, Santa Cruz, CA 95064, USA
*
Correspondence: Associate Professor William B. Dunbar, Department of Computer
Engineering, Baskin School of Engineering, University of California, 1156 High Street,
MS:SOE3, Santa Cruz, CA 95064, Fax: (831) 459-4829, email: [email protected]
1
Proposition 1. If 𝑛 = 1, then 𝑀 = 4𝑛 = 4 distinct amplitude levels are necessary
to unambiguously identify the sequence.
Proof. This follows trivially. Suppose 𝑛 = 1 and 𝑀 = 3. Then two of the four nucleotides generate the same current amplitude. There is no way to reconcile
which of these nucleotides is present in the sensing region, using solely current
amplitude. If 𝑀 = 2 or 1, then one or none of the nucleotides are identifiable, respectively.
☐
Next, for the case 𝑛 = 2, we construct a case in which having fewer than 4𝑛 =
16 distinct amplitude levels makes it impossible to resolve all sequences.
Proposition 2. Suppose 𝑛 = 2 and there are 𝑀 < 4𝑛 = 16 distinct amplitude levels. Let 𝑋, 𝑌 ∈ {𝐴, 𝑇, 𝐺, 𝐶} and 𝑋 ≠ 𝑌 . If the three pairs in the set {𝑋𝑋, 𝑋𝑌, 𝑌𝑋}
generate the same amplitude 𝐼1 , then there are an infinite number of sequences
that cannot be identified from the pulse-train. In particular, no subsequence
𝑍1 ⋯ 𝑍𝑚 can be identified within the sequence 𝑋𝑍1 ⋯ 𝑍𝑚 𝑋, provided 𝑍𝑖 ∈ {𝑋, 𝑌} for
𝑖 = 1, . . . , 𝑚 (𝑚 ≥ 1) and each 𝑌 is separated by one or more 𝑋s.
Proof. Without loss of generality, let 𝑋 = 𝐶 and 𝑌 = 𝐴, and assume the pairs in
the set {𝐶𝐶, 𝐶𝐴, 𝐴𝐶} generate the same amplitude 𝐼1 . Then the nucleotide 𝑍1 ∈
{𝐴, 𝐶} within the triple 𝐶𝑍1 𝐶 cannot be identified. We can show this by considering
an example sub-sequence 𝑆1 = 𝑇𝐶𝐴𝐶𝐺 to be identified. The amplitude that registers 𝐶𝐺 (assumed to be identifiable) can be used to choose 𝑦𝐶𝐺 upon detecting
𝐼1 , with 𝑦 = 𝐴 or 𝐶. After the second 𝐼1 is detected (assuming a tracking counter
is enabled) we have 𝑥𝑦𝐶𝐺 with 𝑥𝑦 ∈ {𝐶𝐶, 𝐶𝐴, 𝐴𝐶}. Next, 𝑇𝐶 is detected (assuming
it is identifiable), which constrains the value of 𝑥 = 𝐶. The value for 𝑦, however,
cannot be resolved. Additionally, any subsequence constructed from 𝐴 and 𝐶 and
nested within 𝐶 ⋯ 𝐶 cannot be identified, provided each 𝐴 is nested within 𝐶s. An
example is the subsequence 𝐶𝐶𝐶 within 𝐶𝐶𝐶𝐶𝐶, which is indistinguishable from
the underlined subsequences within 𝐶𝐶𝐴𝐶𝐶, 𝐶𝐴𝐶𝐴𝐶, 𝐶𝐴𝐶𝐶𝐶, and 𝐶𝐶𝐶𝐴𝐶. Also,
the longer the nested subsequence, the larger the set of subsequences that are
indistinguishable.
☐
If one adds 𝑌𝑌 to the set in Proposition 2, the result is a larger number of unidentifiable subsequences.
Proposition 3. Suppose 𝑛 = 2 and there are 𝑀 < 4𝑛 = 16 distinct amplitude levels. Let 𝑋, 𝑌 ∈ {𝐴, 𝑇, 𝐺, 𝐶} and 𝑋 ≠ 𝑌. If the four pairs in the set {𝑋𝑋, 𝑋𝑌, 𝑌𝑋, 𝑌𝑌}
generate the same amplitude 𝐼1 , then there are an infinite number of sequences
that cannot be identified from the pulse-train. In particular, no subsequence
𝑍1 ⋯ 𝑍𝑚 within the sequence 𝑍0 𝑍1 ⋯ 𝑍𝑚 𝑍𝑚+1 can be identified, with 𝑍𝑖 ∈ {𝑋, 𝑌} for
𝑖 = 1, . . . , 𝑚 + 1 (𝑚 ≥ 1).
2
Proof. The proof follows the same logic as in the proof of Proposition 2, with the
unidentifiable subsequence being nested within 𝑋 ⋯ 𝑋 , 𝑌 ⋯ 𝑋 , 𝑋 ⋯ 𝑌 or 𝑌 ⋯ 𝑌 .
☐
To see the increase in the number of sequences that cannot be resolved, let
𝑋 = 𝐴 and 𝑌 = 𝐶 and assume {𝐶𝐶, 𝐶𝐴, 𝐴𝐶, 𝐴𝐴} generate the same amplitude.
Then the nucleotide 𝑍1 ∈ {𝐴, 𝐶} cannot be identified within any of the triples:
𝐶𝑍1 𝐶 , 𝐶𝑍1 𝐴, 𝐴𝑍1 𝐶 , or 𝐴𝑍1 𝐴. There is also a greater number of subsequence
permutations that cannot be resolved for a given subsequence length 𝑚 > 1. An
example, again with 𝑋 = 𝐴 and 𝑌 = 𝐶 and assuming {𝐶𝐶, 𝐶𝐴, 𝐴𝐶, 𝐴𝐴} generate
the same amplitude, the subsequence $CCCC$ with $m=4$ is not identifiable
from within 𝐶𝐶𝐶𝐶𝐶𝐶, 𝐶𝐶𝐶𝐶𝐶𝐴, 𝐴𝐶𝐶𝐶𝐶𝐶 or 𝐴𝐶𝐶𝐶𝐶𝐴. Moreover, all 2𝑚 = 16 fourletter combinations of 𝐴 and 𝐶 are indistinguishable from 𝐶𝐶𝐶𝐶.
We consider next 𝑛 = 3, which approaches the sensitivity of the biological
pore MspA [2] and matches the claimed sensitivity of the nanopores developed
by Oxford Nanopore Technologies. It is unlikely that all 𝑀 = 43 = 64 distinct amplitudes are available for idealization.
Proposition 4. Suppose 𝑛 = 3 and there are 𝑀 < 4𝑛 = 64 distinct amplitude levels. Let 𝑋, 𝑌 ∈ {𝐴, 𝑇, 𝐺, 𝐶} and 𝑋 ≠ 𝑌 . If the four triples in the set
{𝑋𝑋𝑋, 𝑋𝑋𝑌, 𝑋𝑌𝑋, 𝑌𝑋𝑋} generate the same amplitude 𝐼1 , then there are an infinite
number of sequences that cannot be identified from the pulse-train. In particular,
no subsequence 𝑍1 ⋯ 𝑍𝑚 can be identified within the sequence 𝑋𝑋𝑍1 ⋯ 𝑍𝑚 𝑋𝑋,
provided 𝑍𝑖 ∈ {𝑋, 𝑌} for 𝑖 = 1, . . . , 𝑚 (𝑚 ≥ 1) and each 𝑌 is separated by two or
more 𝑋s.
Proof. Without loss of generality, let 𝑋 = 𝐶 and 𝑌 = 𝐴, and assume the elements
in the set {𝐶𝐶𝐶, 𝐶𝐶𝐴, 𝐶𝐴𝐶, 𝐴𝐶𝐶} generate the same amplitude 𝐼1 . Then 𝐴 and 𝐶
are indistinguishable within the sequences 𝐶𝐶𝐴𝐶𝐶 and 𝐶𝐶𝐶𝐶𝐶, respectively. We
can show this by considering an example sub-sequence 𝑆2 = 𝑇𝐶𝐶𝐴𝐶𝐶𝐺 to be
identified. The amplitude that registers 𝐶𝐶𝐺 (assumed to be identifiable) can be
used to choose 𝑦𝐶𝐶𝐺 upon detecting 𝐼1 , with 𝑦 = 𝐴 or 𝐶. After the second 𝐼1 is
detected (assuming a tracking counter is enabled) we have 𝑦𝑧𝐶𝐶𝐺 with 𝑦𝑧 ∈
{𝐴𝐶, 𝐶𝐴, 𝐶𝐶} . After the third 𝐼1 is detected, we have 𝑥𝑦𝑧𝐶𝐶𝐺 with 𝑥𝑦𝑧 ∈
{𝐶𝐶𝐶, 𝐶𝐶𝐴, 𝐶𝐴𝐶, 𝐴𝐶𝐶}. Next, 𝑇𝐶𝐶 is detected (assuming it is identifiable), which
constrains the value of 𝑥𝑦 = 𝐶𝐶. The value for 𝑧 cannot be resolved. Following
the generalization for this example, it is straightforward to show that 𝐶𝐶, 𝐴𝐶 and
𝐶𝐴 are indistinguishable within 𝐶𝐶𝐶𝐶𝐶𝐶, 𝐶𝐶𝐴𝐶𝐶𝐶 and 𝐶𝐶𝐶𝐴𝐶𝐶, respectively. The
longer the nested subsequence, the larger the set of subsequences that are indistinguishable.
☐
Proposition 5. Suppose 𝑛 = 4 and there are 𝑀 < 4𝑛 = 256 distinct amplitude
levels. Let 𝑋, 𝑌 ∈ {𝐴, 𝑇, 𝐺, 𝐶} and 𝑋 ≠ 𝑌. If the five elements in the set
3
{𝑋𝑋𝑋𝑋, 𝑋𝑋𝑋𝑌, 𝑋𝑋𝑌𝑋, 𝑋𝑌𝑋𝑋, 𝑌𝑋𝑋𝑋𝑋}
generate the same amplitude 𝐼1 , then there are an infinite number of sequences
that cannot be identified from the pulse-train. In particular, no subsequence
𝑍1 ⋯ 𝑍𝑚 can be identified within the sequence 𝑋𝑋𝑋𝑍1 ⋯ 𝑍𝑚 𝑋𝑋𝑋, provided 𝑍𝑖 ∈
{𝑋, 𝑌} for 𝑖 = 1, . . . , 𝑚 (𝑚 ≥ 1) and each 𝑌 is separated by three or more 𝑋s.
Proof. The proof follows the same logic as the proofs for Propositions (2-4).
Without loss of generality, let 𝑋 = 𝐶 and 𝑌 = 𝐴, and assume the elements in the
set
{𝐶𝐶𝐶𝐶, 𝐶𝐶𝐶𝐴, 𝐶𝐶𝐴𝐶, 𝐶𝐴𝐶𝐶, 𝐴𝐶𝐶𝐶}
generate the same amplitude 𝐼1 . Then 𝐴 and 𝐶 (where underlined) are indistinguishable within the sequences 𝐶𝐶𝐶𝐴𝐶𝐶𝐶 and 𝐶𝐶𝐶𝐶𝐶𝐶𝐶, respectively. We can
show this by considering an example sub-sequence 𝑆2 = 𝑇𝐶𝐶𝐶𝐴𝐶𝐶𝐶𝐺 to be identified. The amplitude that registers 𝐶𝐶𝐶𝐺 (assumed to be identifiable) can be
used to choose 𝑧𝐶𝐶𝐶𝐺 upon detecting 𝐼1 , with 𝑧 = 𝐴 or 𝐶. After the second 𝐼1 is
detected (assuming a tracking counter is enabled) we have 𝑦𝑧𝐶𝐶𝐶𝐺 with 𝑦𝑧 ∈
{𝐴𝐶, 𝐶𝐴, 𝐶𝐶}. After the third 𝐼1 is detected, we have 𝑥𝑦𝑧𝐶𝐶𝐶𝐺 with
𝑥𝑦𝑧 ∈
{𝐶𝐶𝐶, 𝐶𝐶𝐴, 𝐶𝐴𝐶, 𝐴𝐶𝐶}. After the fourth 𝐼1 is detected, we have 𝑤𝑥𝑦𝑧𝐶𝐶𝐶𝐺 with
𝑤𝑥𝑦𝑧 ∈ {𝐶𝐶𝐶𝐶, 𝐶𝐶𝐶𝐴, 𝐶𝐶𝐴𝐶, 𝐶𝐴𝐶𝐶, 𝐶𝐴𝐶𝐶, 𝐴𝐶𝐶𝐶}
Next, 𝑇𝐶𝐶𝐶 is detected (assuming it is identifiable), which constrains the value of
𝑤𝑥𝑦 = 𝐶𝐶𝐶. The value for 𝑧 cannot be resolved. Following the generalization for
this example, it is straightforward to show that 𝐶𝐶, 𝐴𝐶 and 𝐶𝐴 are indistinguishable within 𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶, 𝐶𝐶𝐶𝐴𝐶𝐶𝐶𝐶 and 𝐶𝐶𝐶𝐶𝐴𝐶𝐶𝐶, respectively. The longer the
nested subsequence, the larger the set of subsequences that are indistinguishable.
☐
The results in Propositions (2-5) show that there may be sequences that cannot be identified by amplitude level classification. Moreover, the examples do not
cover all possible cases where identifiability is lost; they show only the existence
of cases where identifiability is lost. All cases should be enumerated as part of
efforts to sequence based on ionic current. Additionally, the cases shown are not
unreasonable, in the sense that such sequences might be expected to have a
common amplitude, particularly for 𝑛 = 3,4 . Until control experiments reveal
which sequences cannot be robustly separated by distinct amplitudes, and for
what 𝑛 value(s), it is not clear if the distinct amplitude levels that register in the
ionic current will be sufficient to identify intact ssDNA sequences.
2
Optimal binning scheme
4
This section defines how we choose the bins in assigning an estimated sequence
length for each set of amplitude level durations that would be collected during an
experiment. Let 𝑇 be an exponentially distributed random variable with mean 𝜏.
The probability density function (pdf) of an exponential distribution is
1
−𝑡
𝜌𝑇 (𝑡) = exp ( )
𝜏
𝜏
The sum of 𝑘 independent samples of 𝑇 has a Gamma distribution, with random
variable 𝑆 and pdf denoted
𝑆 = 𝑇1 + 𝑇2 + ⋯ + 𝑇𝑘
𝜌𝑆 (𝑠, 𝑘) =
1
𝑠 𝑘−1
−𝑠
( )
exp ( )
Γ(𝑘)𝜏 𝜏
𝜏
In our problem, we assume 𝜏 is known but 𝑘 is unknown. We measure 𝑛 independent samples of 𝑆:
{𝑆𝑗 , 𝑗 = 1, 2, … , 𝑛}
We want to estimate 𝑘, denoted 𝑘est (𝑛), from the 𝑛 samples. To compute 𝑘est (𝑛)
we use
𝑛
𝑆𝑗
1
𝑋= ∑
𝑛
𝜏
𝑗=1
and the equation
1, if 𝑋 ≤ 𝑏1
2, if 𝑏1 < 𝑋 ≤ 𝑏2
𝑘est (𝑛) = {
3, if 𝑏2 < 𝑋 ≤ 𝑏3
⋮
(1)
Note that 𝑋 ≥ 0 since each 𝑆𝑗 measures time. Random variable 𝑛𝑋 has the
gamma distribution with shape parameter 𝑘𝑛 and scale parameter 1. There is
more than one way to assign the bin values (𝑏1 , 𝑏2 , 𝑏3 , . . . ). A naive but simple
approach is to set 𝑏𝑗 = 𝑗 + 0.5. An alternative is to design the bins to minimize
the error rate by some metric. In general, the bins can depend on 𝑛, which we
(𝑛)
denote 𝑏𝑗 . For a given 𝑘 and 𝑛, the probability of the estimated 𝑘 being correct
is Pr(𝑘est (𝑛) = 𝑘). We optimize the choice for the bins by maximizing the quantity
5
∞
∞
∑ Pr(𝑘est (𝑛) = 𝑗|𝑘=𝑗 =
𝑗=1
∞
(𝑛)
∑ Pr (𝑛𝑏𝑗−1
𝑗=1
< 𝑛𝑋 ≤
(𝑛)
𝑛𝑏𝑗 )|𝑘=𝑗
= ∑∫
(𝑛)
𝑛𝑏𝑗
(𝑛)
𝜌𝑆 (𝑠, 𝑗𝑛) 𝑑𝑠
𝑗=1 𝑛𝑏𝑗−1
(𝑛)
with 𝑏0 = 0. The quantity we maximize is the overall probability of the estimated
𝑘 being correct when the true 𝑘 is equally likely to be any positive integer. We
note that if we have some prior information about the distribution of 𝑘, we could
incorporate that into the formulation and find the corresponding optimal bins. The
solution to the optimization problem above is
1
(𝑛)
𝑏𝑗
1 Γ((𝑗 + 1)𝑛) 𝑛
= (
)
𝑛
Γ(𝑗𝑛)
In Matlab, this is calculated as
b(j,n) = exp((gammaln((j+1)*n) - gammaln(j*n))/n – log(n))
For large 𝑛, we have
(𝑛)
lim 𝑏𝑗
𝑛→∞
(𝑗 + 1)𝑗+1
=
𝑗𝑗 𝑒
(2)
For 𝑛 = ∞, we have
(∞)
𝑏1
(∞)
= 1.472, 𝑏2
(∞)
= 2.483, 𝑏3
(∞)
= 3.488, 𝑏4
(∞)
= 4.491, 𝑏5
= 5.492, etc.
Interestingly, these choices are not too different than 𝑏𝑗 = 𝑗 + 0.5. We use these
optimal bin choices in the limit of large 𝑛, dropping the superscript notation, setting (𝑏1 , 𝑏2 , 𝑏3 , . . . ) = (1.472, 2.483, 3.488, . . . ) in equation (1) (also equation (1) in
the main text).
3
Analytic error rate calculation
The error rate Err(𝑘, 𝑛) for a given 𝑘 and 𝑛 is given by
∞
1
Err(𝑘, 𝑛) = ∑|𝑗 − 𝑘| ∙ Pr(𝑘est (𝑛) = 𝑗)
𝑘
𝑗=1
𝑘−1
∞
𝑗=1
𝑗=𝑘+1
1
1
= ∑(𝑘 − 𝑗) Pr(𝑘est (𝑛) = 𝑗) + ∑ (𝑗 − 𝑘) Pr(𝑘est (𝑛) = 𝑗)
𝑘
𝑘
6
∞
1
+ ∑ (𝑗 − 𝑘) [Pr(𝑘𝑒𝑠𝑡 (𝑛) > 𝑗 − 1) − Pr(𝑘𝑒𝑠𝑡 (𝑛) > 𝑗)]
𝑘
∞
𝑗=𝑘+1
∞
1
1
= ∑ Pr(𝑘𝑒𝑠𝑡 (𝑛) < 𝑗) + ∑ Pr(𝑘𝑒𝑠𝑡 (𝑛) > 𝑗)
𝑘
𝑘
𝑗=𝑘
𝑘
=
𝑗=𝑘
∞
1
[∑ Pr(𝑛𝑋 < 𝑏𝑗 𝑛) + ∑ Pr(𝑛𝑋 > 𝑏𝑗 𝑛)]
𝑘
𝑗=1
𝑗=𝑘+1
In Matlab, Err(𝑘, 𝑛) is computed as
Err(k,n) = sum(gammainc(b(1:k)*n,k*n,’lower’))/k + …
sum(gammainc(b(k+1:201)*n,k*n,’upper))/k
with the sum to ∞ stopping at 201 (much longer than homopolymer region
lengths considered in our study) and with 𝑏(𝑗) = (𝑗 + 1)𝑗+1 ⁄(𝑗 𝑗 𝑒) from
equation (2). From this error rate equation, the per-nucleotide error rate (𝑔(𝑛))
can be computed for any sequence and for any number of reads 𝑛 using the
equation
𝑚
1
𝑔𝑛 = ∑ 𝑞𝑖 ∙ Err(𝑖, 𝑛)
𝑛𝑡
(3)
𝑖=1
where 𝑞𝑖 is the total number of nucleotides belonging to length 𝑖 repeats in the
sequence, and 𝑛𝑡 is the length of the sequence. The summation is over repeat
length, so 𝑚 is the longest repeat length present in the given sequence.
4
Breakdown of mean error rates
To gain some insight into the causes of errors in the multi-read consensus sequences, we can examine the percentage of the total mean error that is a result
of insertions, deletions, and substitutions. For the three simulated signal cases
(no noise, 1X noise, and 2X noise), Figure S1 shows the breakdown of the mean
error rate in terms of insertions, deletions, and substitutions. In all three cases,
mean error rates are shown as a function of the number of reads for the first 50
nucleotides of the Human Mitochondrial DNA sequence [6]. Data points are the
mean error per nucleotide from 900 independent multi-read consensus sequences, with each read being drawn from a set of 10,000 simulated signals. Error bars
reflect the standard error, which is computed as the standard deviation of each
data point divided by √900. For the 10,000 simulated signals, nucleotide dwell
times were randomly drawn from an exponential distribution with a mean of 1 ms.
In the no noise case, insertions account for 82.06% of the total mean error
rate on average while deletions account for the remaining 17.94%. Substitutions
7
do not contribute to the total mean error rate because they are a factor of additive
noise and filtering, which are not used in this case. The massive disparity between insertions and deletions can be attributed to the DNA sequence used in
these simulations. Of the sequence's 50 nucleotides, 34 (68%) are single nucleotides that are not part of a homopolymer region. In the no noise case, the only
types of sequencing errors that can occur for these bases are insertions. Additionally, the length of the sequence limits the total number of deletions that can
occur, whereas there is no upper limit on the possible number of insertions.
Therefore, if errors do occur during the base-calling process, we would expect
the vast majority of them to be insertions.
Figure S1. Breakdown of mean error rates into insertions, deletions, and substitutions. (a) Simulated nanopore signal with no additive noise. Insertions account
for the majority of the total mean error rate and substitutions do not contribute at
all. (b) Simulated nanopore signal with 1X noise. Like the no noise case, insertions account for the majority of the total mean error rate. Substitutions play a
small role when the number of reads is few, but quickly decrease to zero. (c)
Simulated nanopore signal with 2X noise. The additional noise results in a nearly
equal contribution from insertions and deletions to the total mean error rate. Substitutions also play a larger role.
In the 1X noise case, insertions account for 86.35% of the total mean error
rate, deletions account for 13.50%, and substitutions account for just 0.15% on
average. These results are similar to the no noise case, as expected, with the
main difference being the contribution of substitutions to the total mean error
rate. Since substitutions only occur when an ionic current level is long enough to
be identified but the additive noise and filtering cause its amplitude to be misclassified, they are rare to begin with, and their influence is quickly drowned out
as the number of reads is increased.
In the 2X noise case, the contribution to the total mean error rate by insertions
and deletions is almost equal with each accounting for 48.97% and 47.81% of
the total mean error rate on average, respectively. The contribution of substitutions is increased compared to the 1X noise case, but it is still much less than
either insertions or deletions accounting for only 3.22% of the total mean error
rate on average. The change in the ratio of insertion and deletions errors is a
function of increasing the additive noise without increasing the filtering of the signal. The additional noise causes the base-calling algorithm to detect many spuri8
ous ionic current levels while, at the same time, legitimate current levels go undetected. This affects the mean error rate in two ways. First, added current levels
that should not be detected increase the number of insertions and missed current
levels that should be detected increases the number of deletions. Second, the
addition or subtraction of an ionic current level skews the calculation of the duration times for the adjacent current levels increasing the already inherent chance
of assigning an incorrect number of nucleotides to those levels.
5
Minimum dwell time for ionic current level detection
The purpose of using an enzyme in conjunction with a nanopore for sequencing
is to reduce the speed of the molecule passing through the pore or, in other
words, to increase the dwell time of the molecule in the pore [2]. The longer each
nucleotide of a DNA sequence resides in the nanopore, the easier it is to sense
and identify. With this in mind, we ask the question: how long does the dwell time
of a nucleotide need to be in order for its ionic current level to be detected and
identified?
To begin to answer this question, we identified the two situations where the
dwell time of a nucleotide has the greatest impact on whether or not the ionic current level it produces is detectable with our step-detection algorithm. The first situation, illustrated in Figure S2a, is a short ionic current level in the form of a
pulse with an amplitude at one extreme of the scale situated between two longer
current levels with amplitudes at the other extreme of the scale. An example of a
sequence that would produce such an ionic current signal is 𝑇𝐴𝑇, where 𝑇 is
mapped to an amplitude of 0 pA and 𝐴 is mapped to an amplitude of 3 pA. If the
dwell time of the center nucleotide (𝐴) is too short, the filtered ionic current signal
will rise and fall so sharply that the gradient and amplitude will be outside of the
thresholds of the step-detection algorithm and the current level produced by the
nucleotide will go undetected.
Figure S2. Worst case scenarios that affect the minimum dwell time for detecting
ionic current levels. Simulation of a measured ionic current signal from nanopore
experiments (grey), additionally filtered signal for step detection (black), and the
noiseless ionic current levels (red). (a) A short ionic current level taking the form
9
of a pulse in the measured signal is difficult to detect if its gradient is too steep,
its peak too narrow, or its maximum amplitude occurs outside of the threshold.
(b) A short intermediate ionic current level between two longer levels is difficult to
detect if its gradient does not sufficiently flatted out.
The second situation, illustrated in Figure S2b, is a short intermediate ionic
current level that forms a staircase between the two current levels adjacent to it.
An example of a sequence that would produce such an ionic current signal is
𝑇𝐶𝐴, where 𝑇 is mapped to an amplitude of 0 pA, 𝐶 is mapped to an amplitude of
1 pA, and 𝐴 is mapped to an amplitude of 3 pA. If the dwell time of the center nucleotide (𝐶) is too brief, the filtered ionic current signal will be smoothed so much
that it passes through the current level produced by the nucleotide without flattening out, leaving the level undetected. For these two cases, we determined
from our simulations with 1X noise that the minimum dwell time for an ionic current level to be reliably detected is approximately 170 𝜇s. The 10-90% rise time
of the step response of the total filter (98 𝜇s) is consistent with the transition
times observed in Figure S2.
6
Mapping nucleobases to amplitudes
In this work, the mapping of nucleobases (𝐴, 𝐺, 𝐶, 𝑇) to ionic current amplitudes
(3,2,1,0) pA was done arbitrarily without any consideration for how this mapping
would affect the mean error rates. One could argue that choosing a particular
mapping for a given sequence may artificially reduce or increase the mean error
rate. Specifically, for a given sequence, one mapping will have a higher frequency of the largest (3 pA) amplitude transition than any other mapping, with the
3 pA transition easier to detect in noisy data than smaller (1,2 pA) transitions. In
such cases, error rates could be measurably affected by our arbitrary choice of
base-to-amplitude mapping. To test this, we examined the effects of different
base-amplitude mappings for simulated nanopore signals with 1X noise and a
mean dwell time of 1 ms. The labels for the three curves shown in Figure S3 reflect the base-amplitude mapping used with amplitudes decreasing from left to
right. As in previous simulations, mean error rates are shown as a function of the
number of reads for the first 50 nucleotides of the Human Mitochondrial DNA sequence [6]. Data points are the mean error per nucleotide from 900 independent
multi-read consensus sequences, with each read being drawn from a set of
10,000 simulated signals. Error bars reflect the standard error, which is computed as the standard deviation of each data point divided by √900. As can be seen
in Figure S3, rearranging the amplitude mappings has little affect on the mean
error rate for this sequence.
7
Influence of incorrect mean dwell time estimate on
error rate
10
For all of the previous simulations in this work, it is assumed that the mean of the
exponential distribution from which the dwell times for each nucleotide are drawn
is known. However, in practice, the mean dwell time needs to be estimated from
the data. Since the mean dwell time (𝜏) plays such a major role in the basecalling algorithm, it is important to understand the effect on the error rate of underestimating or overestimating 𝜏.
Figure S3. Effect of changing nucleobase amplitude mappings on mean error
rate. Simulated nanopore signals with 1X noise. Amplitudes (in pA) assigned to
bases decrease from left to right, i.e. for the curve 𝐶𝐴𝑇𝐺, base-amplitude mappings are 𝐶 → 3, 𝐴 → 2, 𝑇 → 1, and 𝐺 → 0. The curve 𝐴𝐺𝐶𝑇 reflects the baseamplitude mapping used in the paper. Rearranging the amplitude mappings has
virtually no affect on the mean error rate.
Figure S4 shows the mean error rates for underestimating 𝜏 by half its value
(𝜏̂ = 0.5𝜏), overestimating 𝜏 by twice its value (𝜏̂ = 2𝜏), and estimating 𝜏 exactly
(𝜏̂ = 𝜏). As in previous simulations, mean error rates are shown as a function of
the number of reads for the first 50 nucleotides of the Human Mitochondrial DNA
sequence [6]. Data points are the mean error per nucleotide from 900 independent multi-read consensus sequences, with each read being drawn from a set of
10,000 simulated signals. Error bars reflect the standard error, which is computed as the standard deviation of each data point divided by √900.
Since the noise and filtering are the same for all three cases, there is no difference in the detection of ionic current levels. The differences in the curves
come from the assignment of nucleotides to the identified ionic current levels using the optimal binning scheme. When 𝜏 is underestimated, the optimal binning
scheme assigns too many nucleotides to each ionic current level resulting in a
large number of insertion errors.
Increasing the number of reads has little affect on the mean error rate because each new read also has a large number of insertions at every ionic current
level. When 𝜏 is overestimated, the optimal binning scheme does not assign
enough nucleotides to each ionic current level resulting in a large number of deletions. However, unlike the number of possible insertions, which can be infinite,
11
the number of nucleotides limits the total number of possible deletions in a given
sequence; there cannot be more deletions than nucleotides. This results in a
much lower mean error rate for overestimating 𝜏 (15.74% for 30-read consensus
sequences) than for underestimating 𝜏 (112.87% for 30-read consensus sequences). Just as in the underestimated case, increasing the number of reads
does little to improve the error rate because each new read also has a large
number of deletions at every ionic current level.
Figure S4. Effect of estimating the mean dwell time on the mean error rate. Simulated nanopore signals with 1X noise, a true mean dwell time of 1 ms, and varying estimates (𝜏̂ ) of the mean dwell time (𝜏) used for base-calling. Underestimating the mean dwell time results in more nucleotides being assigned to each ionic
current level, which increases the number of insertions along with the mean error
rate. Overestimating the mean dwell time results in fewer nucleotides being assigned to each ionic current level, which increases the number of deletions. This
does not increase the mean error rate for a small number of reads because while
the number of deletions is increased, the number of insertions is also decreased.
Since insertions are the main drivers of the mean error rate, this actually improves the mean error rate for a small number of reads. In both cases, increasing the number of reads does little to improve the mean error rate.
8
Influence of mean dwell time on error rate
Enzyme controlled translocation rates of DNA through a nanopore have been
experimentally measured and modeled as being exponentially distributed [2]. To
mimic these experimental conditions, the simulated dwell times of nucleotides in
the nanopore sensor are randomly drawn from an exponential distribution with a
known mean. The larger the mean dwell time, the more slowly DNA passes
through the nanopore and the easier it should be to detect and identify the ionic
current levels induced by each nucleotide. To examine how the mean dwell time
12
influences the sequencing error rate, we simulated nanopore signals with 1X
noise for several different mean dwell times.
Figure S5 shows the mean error rates for four different mean dwell times
compared to the analytic error rate described in Section 3. In all four cases,
mean error rates are shown as a function of the number of reads for the first 50
nucleotides of the Human Mitochondrial DNA sequence [6]. Data points are the
mean error per nucleotide from 900 independent multi-read consensus sequences, with each read being drawn from a set of 10,000 simulated signals. Error bars
reflect the standard error, which is computed as the standard deviation of each
data point divided by √900.
As expected, the mean error rate decreases as the mean dwell time increases. For a mean dwell time of 0.1 ms, the mean error rate for the 50-read consensus sequences is a whopping 90.88%. But as the mean dwell time is increased,
the mean error rate converges towards the minimum analytic error rate, with a
rate for the 50-read consensus sequences of just 0.47% for a mean dwell time of
10 ms.
Figure S5. Effect of mean dwell time on the mean error rate. Simulated nanopore signals with 1X noise and varying mean dwell times. The mean error
rates decrease with an increase in the mean dwell time and eventually converge
to the analytic error rate.
9
Error rates of multi-nucleotide nanopore sensors
with no systematic errors
In previous sections, it has been assumed that the nanopore sensor has single
nucleotide sensitivity resulting in the ionic current signal being a function of only
one nucleotide in the channel. However, in practice, nanopore sensors do not
have single nucleotide sensitivity and ionic current signals are functions of multiple nucleotides in the channel [2]. In Section 1, we considered cases in which the
ionic current is a function of 𝑛 ≥ 1 nucleotides, and circumstances under which
the number of distinct amplitude levels 𝑀 (with 𝑀 ≤ 4𝑛 ) may not be sufficient to
13
unambiguously identify the DNA sequence. If, however, the number of distinct
amplitude levels is sufficient to unambiguously identify the sequence, then an 𝑛nucleotide nanopore sensor would be capable of identifying all homopolymer regions of length 𝑛 and smaller, in a single read. In that case, a larger nanopore
sensor would require fewer reads to achieve a desired error threshold for any
biologically relevant sequence. In this section, we assume that an 𝑛-nucleotide
nanopore sensor (with 𝑛 = 2,3,4) has no systematic errors (i.e., can resolve any
length-𝑛 sequence in the sensor), and consider the consequent improvement in
error-rate performance when compared to the error-rate performance with 𝑛 = 1.
Figure S6 shows a comparison of the analytic error rates for the four different nanopore sensors with 𝑛 = 1,2,3,4. The error rates were computed for the
complete Human Mitochondrial DNA sequence [6] using the method described in
Section 3. For a single read, the analytic error rate drops from 40.5% to 1.24% as
the size of the sensor is increased from one to four nucleotides, a dramatic improvement. However, the analytic error rates eventually converge for all four
cases as the number of reads increases, resulting in less dramatic improvement
when higher accuracy is required. For example, the four-nucleotide sensor still
requires 130 reads to achieve Q40 accuracy, as opposed to 156 reads for the
single nucleotide sensor.
Figure S6. Effect of nanopore sensor footprint on the analytic error rates in the
absence of systematic channel errors. The analytic error rates decrease with an
increase in the size of the sensor footprint, but approach the single nucleotide
sensor error rate as the number of reads increases.
10
References
14
[1] Cherf, G. M., Lieberman, K. R., Rashid, H., Lam, C. E., Karplus, K., Akeson,
M., Nat. Biotechnol. 2012, 30, 344–348.
[2] Manrao, E. A., Derrington, I. M., Laszlo, A. H., Langford, K. W., Hopper, M. K.,
Gillgren, N., Pavlenok, M., Niederweis, M., Gundlach, J. H., Nat. Biotechnol.
2012, 30, 349–353.
[3] Sakmann, B., Neher, E. (Eds.), Single-Channel Recording, Plenum Press,
New York 1995, 2nd edn.
[4] Venkataramanan, L., Sigworth, F. J., Biophys. J. 2002, 82, 1930–1942.
[5] Qin, F., Auerbach, A., Sachs, F., Biophys. J. 2000, 79, 1928–1944.
[6] Sanchez-Cespedes, M., Parrella, P., Nomoto, S., Cohen, D., Xiao, Y., Esteller, M., Jeronimo, C., Jordan, R. C. K., Nicol, T., Koch, W. M., Schoenberg, M., Mazzarelli, P., Fazio, V. M., Sidransky, D., Cancer Res. 2001, 61,
7015–7019.
15

Download Report

1 Identifiability of DNA sequences from ionic current amplitude

Paperzz.com

Your Paperzz