Supporting information for โError analysis of idealized nanopore sequencingโ Christopher R. OโDonnell 1 Hongyun Wang 2 William B. Dunbar 1* 1 Identifiability of DNA sequences from ionic current amplitude If an enzyme on the pore controls the ssDNA passing through the pore, the ssDNA moves in 1 nt steps, with the dwell time of each ssDNA position being exponentially distributed, and step-transitions that are instantaneous compared to the measurement bandwidth [1,2]. An appropriate idealization for the signal is a pulse-train, defined by a set of ๐ amplitudes and a sequence of measured dwell times. From the single-channel recording and analysis literature [3-5], there is a set of techniques that can be applied to estimate the pulse-train idealization from the noisy recorded data. For sequencing, the pulse train would be compared to a library of amplitudes identified through control experiments with known sequences. In this section, we consider challenges associated with having a limited number of distinct amplitude levels in the idealization. If ๐ nucleotides affect the ionic current, then ๐ = 4๐ amplitude levels are sufficient to unambiguously identify the sequence. For the purpose of synthesizing the idealization from noisy data, each amplitude level must have a signal-to-noise ratio (S/N) of at least 2 for idealization by half-amplitude methods, or at least 1.5 by Markov-based methods [4]. For ๐ โฅ 3, as in the case of the MspA nanopore [2], achieving ๐ = 4๐ amplitude levels with sufficient S/N may not be possible. We consider specific examples in which having fewer that 4๐ amplitude levels makes it impossible to unambiguously identify the sequence. We construct examples for ๐ = 2,3,4, assuming sequences are identified right-to-left as they pass through the pore. Thus, ๐ด๐บ๐ถ๐๐๐ด๐บ with ๐ = 4 would be identified ๐๐๐ด๐บ , then ๐ถ๐๐๐ด, etc. We refer to ๐ โdistinctโ amplitude levels when each level has sufficient S/N for detection. The case for ๐ = 1 nucleotide affecting the current amplitude is considered first, and is the simplest. ______________________________ 1 Department of Computer Engineering, Baskin School of Engineering, University of California, Santa Cruz, CA 95064, USA 2 Department of Applied Math and Statistics, Baskin School of Engineering, University of California, Santa Cruz, CA 95064, USA * Correspondence: Associate Professor William B. Dunbar, Department of Computer Engineering, Baskin School of Engineering, University of California, 1156 High Street, MS:SOE3, Santa Cruz, CA 95064, Fax: (831) 459-4829, email: [email protected] 1 Proposition 1. If ๐ = 1, then ๐ = 4๐ = 4 distinct amplitude levels are necessary to unambiguously identify the sequence. Proof. This follows trivially. Suppose ๐ = 1 and ๐ = 3. Then two of the four nucleotides generate the same current amplitude. There is no way to reconcile which of these nucleotides is present in the sensing region, using solely current amplitude. If ๐ = 2 or 1, then one or none of the nucleotides are identifiable, respectively. โ Next, for the case ๐ = 2, we construct a case in which having fewer than 4๐ = 16 distinct amplitude levels makes it impossible to resolve all sequences. Proposition 2. Suppose ๐ = 2 and there are ๐ < 4๐ = 16 distinct amplitude levels. Let ๐, ๐ โ {๐ด, ๐, ๐บ, ๐ถ} and ๐ โ ๐ . If the three pairs in the set {๐๐, ๐๐, ๐๐} generate the same amplitude ๐ผ1 , then there are an infinite number of sequences that cannot be identified from the pulse-train. In particular, no subsequence ๐1 โฏ ๐๐ can be identified within the sequence ๐๐1 โฏ ๐๐ ๐, provided ๐๐ โ {๐, ๐} for ๐ = 1, . . . , ๐ (๐ โฅ 1) and each ๐ is separated by one or more ๐s. Proof. Without loss of generality, let ๐ = ๐ถ and ๐ = ๐ด, and assume the pairs in the set {๐ถ๐ถ, ๐ถ๐ด, ๐ด๐ถ} generate the same amplitude ๐ผ1 . Then the nucleotide ๐1 โ {๐ด, ๐ถ} within the triple ๐ถ๐1 ๐ถ cannot be identified. We can show this by considering an example sub-sequence ๐1 = ๐๐ถ๐ด๐ถ๐บ to be identified. The amplitude that registers ๐ถ๐บ (assumed to be identifiable) can be used to choose ๐ฆ๐ถ๐บ upon detecting ๐ผ1 , with ๐ฆ = ๐ด or ๐ถ. After the second ๐ผ1 is detected (assuming a tracking counter is enabled) we have ๐ฅ๐ฆ๐ถ๐บ with ๐ฅ๐ฆ โ {๐ถ๐ถ, ๐ถ๐ด, ๐ด๐ถ}. Next, ๐๐ถ is detected (assuming it is identifiable), which constrains the value of ๐ฅ = ๐ถ. The value for ๐ฆ, however, cannot be resolved. Additionally, any subsequence constructed from ๐ด and ๐ถ and nested within ๐ถ โฏ ๐ถ cannot be identified, provided each ๐ด is nested within ๐ถs. An example is the subsequence ๐ถ๐ถ๐ถ within ๐ถ๐ถ๐ถ๐ถ๐ถ, which is indistinguishable from the underlined subsequences within ๐ถ๐ถ๐ด๐ถ๐ถ, ๐ถ๐ด๐ถ๐ด๐ถ, ๐ถ๐ด๐ถ๐ถ๐ถ, and ๐ถ๐ถ๐ถ๐ด๐ถ. Also, the longer the nested subsequence, the larger the set of subsequences that are indistinguishable. โ If one adds ๐๐ to the set in Proposition 2, the result is a larger number of unidentifiable subsequences. Proposition 3. Suppose ๐ = 2 and there are ๐ < 4๐ = 16 distinct amplitude levels. Let ๐, ๐ โ {๐ด, ๐, ๐บ, ๐ถ} and ๐ โ ๐. If the four pairs in the set {๐๐, ๐๐, ๐๐, ๐๐} generate the same amplitude ๐ผ1 , then there are an infinite number of sequences that cannot be identified from the pulse-train. In particular, no subsequence ๐1 โฏ ๐๐ within the sequence ๐0 ๐1 โฏ ๐๐ ๐๐+1 can be identified, with ๐๐ โ {๐, ๐} for ๐ = 1, . . . , ๐ + 1 (๐ โฅ 1). 2 Proof. The proof follows the same logic as in the proof of Proposition 2, with the unidentifiable subsequence being nested within ๐ โฏ ๐ , ๐ โฏ ๐ , ๐ โฏ ๐ or ๐ โฏ ๐ . โ To see the increase in the number of sequences that cannot be resolved, let ๐ = ๐ด and ๐ = ๐ถ and assume {๐ถ๐ถ, ๐ถ๐ด, ๐ด๐ถ, ๐ด๐ด} generate the same amplitude. Then the nucleotide ๐1 โ {๐ด, ๐ถ} cannot be identified within any of the triples: ๐ถ๐1 ๐ถ , ๐ถ๐1 ๐ด, ๐ด๐1 ๐ถ , or ๐ด๐1 ๐ด. There is also a greater number of subsequence permutations that cannot be resolved for a given subsequence length ๐ > 1. An example, again with ๐ = ๐ด and ๐ = ๐ถ and assuming {๐ถ๐ถ, ๐ถ๐ด, ๐ด๐ถ, ๐ด๐ด} generate the same amplitude, the subsequence $CCCC$ with $m=4$ is not identifiable from within ๐ถ๐ถ๐ถ๐ถ๐ถ๐ถ, ๐ถ๐ถ๐ถ๐ถ๐ถ๐ด, ๐ด๐ถ๐ถ๐ถ๐ถ๐ถ or ๐ด๐ถ๐ถ๐ถ๐ถ๐ด. Moreover, all 2๐ = 16 fourletter combinations of ๐ด and ๐ถ are indistinguishable from ๐ถ๐ถ๐ถ๐ถ. We consider next ๐ = 3, which approaches the sensitivity of the biological pore MspA [2] and matches the claimed sensitivity of the nanopores developed by Oxford Nanopore Technologies. It is unlikely that all ๐ = 43 = 64 distinct amplitudes are available for idealization. Proposition 4. Suppose ๐ = 3 and there are ๐ < 4๐ = 64 distinct amplitude levels. Let ๐, ๐ โ {๐ด, ๐, ๐บ, ๐ถ} and ๐ โ ๐ . If the four triples in the set {๐๐๐, ๐๐๐, ๐๐๐, ๐๐๐} generate the same amplitude ๐ผ1 , then there are an infinite number of sequences that cannot be identified from the pulse-train. In particular, no subsequence ๐1 โฏ ๐๐ can be identified within the sequence ๐๐๐1 โฏ ๐๐ ๐๐, provided ๐๐ โ {๐, ๐} for ๐ = 1, . . . , ๐ (๐ โฅ 1) and each ๐ is separated by two or more ๐s. Proof. Without loss of generality, let ๐ = ๐ถ and ๐ = ๐ด, and assume the elements in the set {๐ถ๐ถ๐ถ, ๐ถ๐ถ๐ด, ๐ถ๐ด๐ถ, ๐ด๐ถ๐ถ} generate the same amplitude ๐ผ1 . Then ๐ด and ๐ถ are indistinguishable within the sequences ๐ถ๐ถ๐ด๐ถ๐ถ and ๐ถ๐ถ๐ถ๐ถ๐ถ, respectively. We can show this by considering an example sub-sequence ๐2 = ๐๐ถ๐ถ๐ด๐ถ๐ถ๐บ to be identified. The amplitude that registers ๐ถ๐ถ๐บ (assumed to be identifiable) can be used to choose ๐ฆ๐ถ๐ถ๐บ upon detecting ๐ผ1 , with ๐ฆ = ๐ด or ๐ถ. After the second ๐ผ1 is detected (assuming a tracking counter is enabled) we have ๐ฆ๐ง๐ถ๐ถ๐บ with ๐ฆ๐ง โ {๐ด๐ถ, ๐ถ๐ด, ๐ถ๐ถ} . After the third ๐ผ1 is detected, we have ๐ฅ๐ฆ๐ง๐ถ๐ถ๐บ with ๐ฅ๐ฆ๐ง โ {๐ถ๐ถ๐ถ, ๐ถ๐ถ๐ด, ๐ถ๐ด๐ถ, ๐ด๐ถ๐ถ}. Next, ๐๐ถ๐ถ is detected (assuming it is identifiable), which constrains the value of ๐ฅ๐ฆ = ๐ถ๐ถ. The value for ๐ง cannot be resolved. Following the generalization for this example, it is straightforward to show that ๐ถ๐ถ, ๐ด๐ถ and ๐ถ๐ด are indistinguishable within ๐ถ๐ถ๐ถ๐ถ๐ถ๐ถ, ๐ถ๐ถ๐ด๐ถ๐ถ๐ถ and ๐ถ๐ถ๐ถ๐ด๐ถ๐ถ, respectively. The longer the nested subsequence, the larger the set of subsequences that are indistinguishable. โ Proposition 5. Suppose ๐ = 4 and there are ๐ < 4๐ = 256 distinct amplitude levels. Let ๐, ๐ โ {๐ด, ๐, ๐บ, ๐ถ} and ๐ โ ๐. If the five elements in the set 3 {๐๐๐๐, ๐๐๐๐, ๐๐๐๐, ๐๐๐๐, ๐๐๐๐๐} generate the same amplitude ๐ผ1 , then there are an infinite number of sequences that cannot be identified from the pulse-train. In particular, no subsequence ๐1 โฏ ๐๐ can be identified within the sequence ๐๐๐๐1 โฏ ๐๐ ๐๐๐, provided ๐๐ โ {๐, ๐} for ๐ = 1, . . . , ๐ (๐ โฅ 1) and each ๐ is separated by three or more ๐s. Proof. The proof follows the same logic as the proofs for Propositions (2-4). Without loss of generality, let ๐ = ๐ถ and ๐ = ๐ด, and assume the elements in the set {๐ถ๐ถ๐ถ๐ถ, ๐ถ๐ถ๐ถ๐ด, ๐ถ๐ถ๐ด๐ถ, ๐ถ๐ด๐ถ๐ถ, ๐ด๐ถ๐ถ๐ถ} generate the same amplitude ๐ผ1 . Then ๐ด and ๐ถ (where underlined) are indistinguishable within the sequences ๐ถ๐ถ๐ถ๐ด๐ถ๐ถ๐ถ and ๐ถ๐ถ๐ถ๐ถ๐ถ๐ถ๐ถ, respectively. We can show this by considering an example sub-sequence ๐2 = ๐๐ถ๐ถ๐ถ๐ด๐ถ๐ถ๐ถ๐บ to be identified. The amplitude that registers ๐ถ๐ถ๐ถ๐บ (assumed to be identifiable) can be used to choose ๐ง๐ถ๐ถ๐ถ๐บ upon detecting ๐ผ1 , with ๐ง = ๐ด or ๐ถ. After the second ๐ผ1 is detected (assuming a tracking counter is enabled) we have ๐ฆ๐ง๐ถ๐ถ๐ถ๐บ with ๐ฆ๐ง โ {๐ด๐ถ, ๐ถ๐ด, ๐ถ๐ถ}. After the third ๐ผ1 is detected, we have ๐ฅ๐ฆ๐ง๐ถ๐ถ๐ถ๐บ with ๐ฅ๐ฆ๐ง โ {๐ถ๐ถ๐ถ, ๐ถ๐ถ๐ด, ๐ถ๐ด๐ถ, ๐ด๐ถ๐ถ}. After the fourth ๐ผ1 is detected, we have ๐ค๐ฅ๐ฆ๐ง๐ถ๐ถ๐ถ๐บ with ๐ค๐ฅ๐ฆ๐ง โ {๐ถ๐ถ๐ถ๐ถ, ๐ถ๐ถ๐ถ๐ด, ๐ถ๐ถ๐ด๐ถ, ๐ถ๐ด๐ถ๐ถ, ๐ถ๐ด๐ถ๐ถ, ๐ด๐ถ๐ถ๐ถ} Next, ๐๐ถ๐ถ๐ถ is detected (assuming it is identifiable), which constrains the value of ๐ค๐ฅ๐ฆ = ๐ถ๐ถ๐ถ. The value for ๐ง cannot be resolved. Following the generalization for this example, it is straightforward to show that ๐ถ๐ถ, ๐ด๐ถ and ๐ถ๐ด are indistinguishable within ๐ถ๐ถ๐ถ๐ถ๐ถ๐ถ๐ถ๐ถ, ๐ถ๐ถ๐ถ๐ด๐ถ๐ถ๐ถ๐ถ and ๐ถ๐ถ๐ถ๐ถ๐ด๐ถ๐ถ๐ถ, respectively. The longer the nested subsequence, the larger the set of subsequences that are indistinguishable. โ The results in Propositions (2-5) show that there may be sequences that cannot be identified by amplitude level classification. Moreover, the examples do not cover all possible cases where identifiability is lost; they show only the existence of cases where identifiability is lost. All cases should be enumerated as part of efforts to sequence based on ionic current. Additionally, the cases shown are not unreasonable, in the sense that such sequences might be expected to have a common amplitude, particularly for ๐ = 3,4 . Until control experiments reveal which sequences cannot be robustly separated by distinct amplitudes, and for what ๐ value(s), it is not clear if the distinct amplitude levels that register in the ionic current will be sufficient to identify intact ssDNA sequences. 2 Optimal binning scheme 4 This section defines how we choose the bins in assigning an estimated sequence length for each set of amplitude level durations that would be collected during an experiment. Let ๐ be an exponentially distributed random variable with mean ๐. The probability density function (pdf) of an exponential distribution is 1 โ๐ก ๐๐ (๐ก) = exp ( ) ๐ ๐ The sum of ๐ independent samples of ๐ has a Gamma distribution, with random variable ๐ and pdf denoted ๐ = ๐1 + ๐2 + โฏ + ๐๐ ๐๐ (๐ , ๐) = 1 ๐ ๐โ1 โ๐ ( ) exp ( ) ฮ(๐)๐ ๐ ๐ In our problem, we assume ๐ is known but ๐ is unknown. We measure ๐ independent samples of ๐: {๐๐ , ๐ = 1, 2, โฆ , ๐} We want to estimate ๐, denoted ๐est (๐), from the ๐ samples. To compute ๐est (๐) we use ๐ ๐๐ 1 ๐= โ ๐ ๐ ๐=1 and the equation 1, if ๐ โค ๐1 2, if ๐1 < ๐ โค ๐2 ๐est (๐) = { 3, if ๐2 < ๐ โค ๐3 โฎ (1) Note that ๐ โฅ 0 since each ๐๐ measures time. Random variable ๐๐ has the gamma distribution with shape parameter ๐๐ and scale parameter 1. There is more than one way to assign the bin values (๐1 , ๐2 , ๐3 , . . . ). A naive but simple approach is to set ๐๐ = ๐ + 0.5. An alternative is to design the bins to minimize the error rate by some metric. In general, the bins can depend on ๐, which we (๐) denote ๐๐ . For a given ๐ and ๐, the probability of the estimated ๐ being correct is Pr(๐est (๐) = ๐). We optimize the choice for the bins by maximizing the quantity 5 โ โ โ Pr(๐est (๐) = ๐|๐=๐ = ๐=1 โ (๐) โ Pr (๐๐๐โ1 ๐=1 < ๐๐ โค (๐) ๐๐๐ )|๐=๐ = โโซ (๐) ๐๐๐ (๐) ๐๐ (๐ , ๐๐) ๐๐ ๐=1 ๐๐๐โ1 (๐) with ๐0 = 0. The quantity we maximize is the overall probability of the estimated ๐ being correct when the true ๐ is equally likely to be any positive integer. We note that if we have some prior information about the distribution of ๐, we could incorporate that into the formulation and find the corresponding optimal bins. The solution to the optimization problem above is 1 (๐) ๐๐ 1 ฮ((๐ + 1)๐) ๐ = ( ) ๐ ฮ(๐๐) In Matlab, this is calculated as b(j,n) = exp((gammaln((j+1)*n) - gammaln(j*n))/n โ log(n)) For large ๐, we have (๐) lim ๐๐ ๐โโ (๐ + 1)๐+1 = ๐๐ ๐ (2) For ๐ = โ, we have (โ) ๐1 (โ) = 1.472, ๐2 (โ) = 2.483, ๐3 (โ) = 3.488, ๐4 (โ) = 4.491, ๐5 = 5.492, etc. Interestingly, these choices are not too different than ๐๐ = ๐ + 0.5. We use these optimal bin choices in the limit of large ๐, dropping the superscript notation, setting (๐1 , ๐2 , ๐3 , . . . ) = (1.472, 2.483, 3.488, . . . ) in equation (1) (also equation (1) in the main text). 3 Analytic error rate calculation The error rate Err(๐, ๐) for a given ๐ and ๐ is given by โ 1 Err(๐, ๐) = โ|๐ โ ๐| โ Pr(๐est (๐) = ๐) ๐ ๐=1 ๐โ1 โ ๐=1 ๐=๐+1 1 1 = โ(๐ โ ๐) Pr(๐est (๐) = ๐) + โ (๐ โ ๐) Pr(๐est (๐) = ๐) ๐ ๐ 6 โ 1 + โ (๐ โ ๐) [Pr(๐๐๐ ๐ก (๐) > ๐ โ 1) โ Pr(๐๐๐ ๐ก (๐) > ๐)] ๐ โ ๐=๐+1 โ 1 1 = โ Pr(๐๐๐ ๐ก (๐) < ๐) + โ Pr(๐๐๐ ๐ก (๐) > ๐) ๐ ๐ ๐=๐ ๐ = ๐=๐ โ 1 [โ Pr(๐๐ < ๐๐ ๐) + โ Pr(๐๐ > ๐๐ ๐)] ๐ ๐=1 ๐=๐+1 In Matlab, Err(๐, ๐) is computed as Err(k,n) = sum(gammainc(b(1:k)*n,k*n,โlowerโ))/k + โฆ sum(gammainc(b(k+1:201)*n,k*n,โupper))/k with the sum to โ stopping at 201 (much longer than homopolymer region lengths considered in our study) and with ๐(๐) = (๐ + 1)๐+1 โ(๐ ๐ ๐) from equation (2). From this error rate equation, the per-nucleotide error rate (๐(๐)) can be computed for any sequence and for any number of reads ๐ using the equation ๐ 1 ๐๐ = โ ๐๐ โ Err(๐, ๐) ๐๐ก (3) ๐=1 where ๐๐ is the total number of nucleotides belonging to length ๐ repeats in the sequence, and ๐๐ก is the length of the sequence. The summation is over repeat length, so ๐ is the longest repeat length present in the given sequence. 4 Breakdown of mean error rates To gain some insight into the causes of errors in the multi-read consensus sequences, we can examine the percentage of the total mean error that is a result of insertions, deletions, and substitutions. For the three simulated signal cases (no noise, 1X noise, and 2X noise), Figure S1 shows the breakdown of the mean error rate in terms of insertions, deletions, and substitutions. In all three cases, mean error rates are shown as a function of the number of reads for the first 50 nucleotides of the Human Mitochondrial DNA sequence [6]. Data points are the mean error per nucleotide from 900 independent multi-read consensus sequences, with each read being drawn from a set of 10,000 simulated signals. Error bars reflect the standard error, which is computed as the standard deviation of each data point divided by โ900. For the 10,000 simulated signals, nucleotide dwell times were randomly drawn from an exponential distribution with a mean of 1 ms. In the no noise case, insertions account for 82.06% of the total mean error rate on average while deletions account for the remaining 17.94%. Substitutions 7 do not contribute to the total mean error rate because they are a factor of additive noise and filtering, which are not used in this case. The massive disparity between insertions and deletions can be attributed to the DNA sequence used in these simulations. Of the sequence's 50 nucleotides, 34 (68%) are single nucleotides that are not part of a homopolymer region. In the no noise case, the only types of sequencing errors that can occur for these bases are insertions. Additionally, the length of the sequence limits the total number of deletions that can occur, whereas there is no upper limit on the possible number of insertions. Therefore, if errors do occur during the base-calling process, we would expect the vast majority of them to be insertions. Figure S1. Breakdown of mean error rates into insertions, deletions, and substitutions. (a) Simulated nanopore signal with no additive noise. Insertions account for the majority of the total mean error rate and substitutions do not contribute at all. (b) Simulated nanopore signal with 1X noise. Like the no noise case, insertions account for the majority of the total mean error rate. Substitutions play a small role when the number of reads is few, but quickly decrease to zero. (c) Simulated nanopore signal with 2X noise. The additional noise results in a nearly equal contribution from insertions and deletions to the total mean error rate. Substitutions also play a larger role. In the 1X noise case, insertions account for 86.35% of the total mean error rate, deletions account for 13.50%, and substitutions account for just 0.15% on average. These results are similar to the no noise case, as expected, with the main difference being the contribution of substitutions to the total mean error rate. Since substitutions only occur when an ionic current level is long enough to be identified but the additive noise and filtering cause its amplitude to be misclassified, they are rare to begin with, and their influence is quickly drowned out as the number of reads is increased. In the 2X noise case, the contribution to the total mean error rate by insertions and deletions is almost equal with each accounting for 48.97% and 47.81% of the total mean error rate on average, respectively. The contribution of substitutions is increased compared to the 1X noise case, but it is still much less than either insertions or deletions accounting for only 3.22% of the total mean error rate on average. The change in the ratio of insertion and deletions errors is a function of increasing the additive noise without increasing the filtering of the signal. The additional noise causes the base-calling algorithm to detect many spuri8 ous ionic current levels while, at the same time, legitimate current levels go undetected. This affects the mean error rate in two ways. First, added current levels that should not be detected increase the number of insertions and missed current levels that should be detected increases the number of deletions. Second, the addition or subtraction of an ionic current level skews the calculation of the duration times for the adjacent current levels increasing the already inherent chance of assigning an incorrect number of nucleotides to those levels. 5 Minimum dwell time for ionic current level detection The purpose of using an enzyme in conjunction with a nanopore for sequencing is to reduce the speed of the molecule passing through the pore or, in other words, to increase the dwell time of the molecule in the pore [2]. The longer each nucleotide of a DNA sequence resides in the nanopore, the easier it is to sense and identify. With this in mind, we ask the question: how long does the dwell time of a nucleotide need to be in order for its ionic current level to be detected and identified? To begin to answer this question, we identified the two situations where the dwell time of a nucleotide has the greatest impact on whether or not the ionic current level it produces is detectable with our step-detection algorithm. The first situation, illustrated in Figure S2a, is a short ionic current level in the form of a pulse with an amplitude at one extreme of the scale situated between two longer current levels with amplitudes at the other extreme of the scale. An example of a sequence that would produce such an ionic current signal is ๐๐ด๐, where ๐ is mapped to an amplitude of 0 pA and ๐ด is mapped to an amplitude of 3 pA. If the dwell time of the center nucleotide (๐ด) is too short, the filtered ionic current signal will rise and fall so sharply that the gradient and amplitude will be outside of the thresholds of the step-detection algorithm and the current level produced by the nucleotide will go undetected. Figure S2. Worst case scenarios that affect the minimum dwell time for detecting ionic current levels. Simulation of a measured ionic current signal from nanopore experiments (grey), additionally filtered signal for step detection (black), and the noiseless ionic current levels (red). (a) A short ionic current level taking the form 9 of a pulse in the measured signal is difficult to detect if its gradient is too steep, its peak too narrow, or its maximum amplitude occurs outside of the threshold. (b) A short intermediate ionic current level between two longer levels is difficult to detect if its gradient does not sufficiently flatted out. The second situation, illustrated in Figure S2b, is a short intermediate ionic current level that forms a staircase between the two current levels adjacent to it. An example of a sequence that would produce such an ionic current signal is ๐๐ถ๐ด, where ๐ is mapped to an amplitude of 0 pA, ๐ถ is mapped to an amplitude of 1 pA, and ๐ด is mapped to an amplitude of 3 pA. If the dwell time of the center nucleotide (๐ถ) is too brief, the filtered ionic current signal will be smoothed so much that it passes through the current level produced by the nucleotide without flattening out, leaving the level undetected. For these two cases, we determined from our simulations with 1X noise that the minimum dwell time for an ionic current level to be reliably detected is approximately 170 ๐s. The 10-90% rise time of the step response of the total filter (98 ๐s) is consistent with the transition times observed in Figure S2. 6 Mapping nucleobases to amplitudes In this work, the mapping of nucleobases (๐ด, ๐บ, ๐ถ, ๐) to ionic current amplitudes (3,2,1,0) pA was done arbitrarily without any consideration for how this mapping would affect the mean error rates. One could argue that choosing a particular mapping for a given sequence may artificially reduce or increase the mean error rate. Specifically, for a given sequence, one mapping will have a higher frequency of the largest (3 pA) amplitude transition than any other mapping, with the 3 pA transition easier to detect in noisy data than smaller (1,2 pA) transitions. In such cases, error rates could be measurably affected by our arbitrary choice of base-to-amplitude mapping. To test this, we examined the effects of different base-amplitude mappings for simulated nanopore signals with 1X noise and a mean dwell time of 1 ms. The labels for the three curves shown in Figure S3 reflect the base-amplitude mapping used with amplitudes decreasing from left to right. As in previous simulations, mean error rates are shown as a function of the number of reads for the first 50 nucleotides of the Human Mitochondrial DNA sequence [6]. Data points are the mean error per nucleotide from 900 independent multi-read consensus sequences, with each read being drawn from a set of 10,000 simulated signals. Error bars reflect the standard error, which is computed as the standard deviation of each data point divided by โ900. As can be seen in Figure S3, rearranging the amplitude mappings has little affect on the mean error rate for this sequence. 7 Influence of incorrect mean dwell time estimate on error rate 10 For all of the previous simulations in this work, it is assumed that the mean of the exponential distribution from which the dwell times for each nucleotide are drawn is known. However, in practice, the mean dwell time needs to be estimated from the data. Since the mean dwell time (๐) plays such a major role in the basecalling algorithm, it is important to understand the effect on the error rate of underestimating or overestimating ๐. Figure S3. Effect of changing nucleobase amplitude mappings on mean error rate. Simulated nanopore signals with 1X noise. Amplitudes (in pA) assigned to bases decrease from left to right, i.e. for the curve ๐ถ๐ด๐๐บ, base-amplitude mappings are ๐ถ โ 3, ๐ด โ 2, ๐ โ 1, and ๐บ โ 0. The curve ๐ด๐บ๐ถ๐ reflects the baseamplitude mapping used in the paper. Rearranging the amplitude mappings has virtually no affect on the mean error rate. Figure S4 shows the mean error rates for underestimating ๐ by half its value (๐ฬ = 0.5๐), overestimating ๐ by twice its value (๐ฬ = 2๐), and estimating ๐ exactly (๐ฬ = ๐). As in previous simulations, mean error rates are shown as a function of the number of reads for the first 50 nucleotides of the Human Mitochondrial DNA sequence [6]. Data points are the mean error per nucleotide from 900 independent multi-read consensus sequences, with each read being drawn from a set of 10,000 simulated signals. Error bars reflect the standard error, which is computed as the standard deviation of each data point divided by โ900. Since the noise and filtering are the same for all three cases, there is no difference in the detection of ionic current levels. The differences in the curves come from the assignment of nucleotides to the identified ionic current levels using the optimal binning scheme. When ๐ is underestimated, the optimal binning scheme assigns too many nucleotides to each ionic current level resulting in a large number of insertion errors. Increasing the number of reads has little affect on the mean error rate because each new read also has a large number of insertions at every ionic current level. When ๐ is overestimated, the optimal binning scheme does not assign enough nucleotides to each ionic current level resulting in a large number of deletions. However, unlike the number of possible insertions, which can be infinite, 11 the number of nucleotides limits the total number of possible deletions in a given sequence; there cannot be more deletions than nucleotides. This results in a much lower mean error rate for overestimating ๐ (15.74% for 30-read consensus sequences) than for underestimating ๐ (112.87% for 30-read consensus sequences). Just as in the underestimated case, increasing the number of reads does little to improve the error rate because each new read also has a large number of deletions at every ionic current level. Figure S4. Effect of estimating the mean dwell time on the mean error rate. Simulated nanopore signals with 1X noise, a true mean dwell time of 1 ms, and varying estimates (๐ฬ ) of the mean dwell time (๐) used for base-calling. Underestimating the mean dwell time results in more nucleotides being assigned to each ionic current level, which increases the number of insertions along with the mean error rate. Overestimating the mean dwell time results in fewer nucleotides being assigned to each ionic current level, which increases the number of deletions. This does not increase the mean error rate for a small number of reads because while the number of deletions is increased, the number of insertions is also decreased. Since insertions are the main drivers of the mean error rate, this actually improves the mean error rate for a small number of reads. In both cases, increasing the number of reads does little to improve the mean error rate. 8 Influence of mean dwell time on error rate Enzyme controlled translocation rates of DNA through a nanopore have been experimentally measured and modeled as being exponentially distributed [2]. To mimic these experimental conditions, the simulated dwell times of nucleotides in the nanopore sensor are randomly drawn from an exponential distribution with a known mean. The larger the mean dwell time, the more slowly DNA passes through the nanopore and the easier it should be to detect and identify the ionic current levels induced by each nucleotide. To examine how the mean dwell time 12 influences the sequencing error rate, we simulated nanopore signals with 1X noise for several different mean dwell times. Figure S5 shows the mean error rates for four different mean dwell times compared to the analytic error rate described in Section 3. In all four cases, mean error rates are shown as a function of the number of reads for the first 50 nucleotides of the Human Mitochondrial DNA sequence [6]. Data points are the mean error per nucleotide from 900 independent multi-read consensus sequences, with each read being drawn from a set of 10,000 simulated signals. Error bars reflect the standard error, which is computed as the standard deviation of each data point divided by โ900. As expected, the mean error rate decreases as the mean dwell time increases. For a mean dwell time of 0.1 ms, the mean error rate for the 50-read consensus sequences is a whopping 90.88%. But as the mean dwell time is increased, the mean error rate converges towards the minimum analytic error rate, with a rate for the 50-read consensus sequences of just 0.47% for a mean dwell time of 10 ms. Figure S5. Effect of mean dwell time on the mean error rate. Simulated nanopore signals with 1X noise and varying mean dwell times. The mean error rates decrease with an increase in the mean dwell time and eventually converge to the analytic error rate. 9 Error rates of multi-nucleotide nanopore sensors with no systematic errors In previous sections, it has been assumed that the nanopore sensor has single nucleotide sensitivity resulting in the ionic current signal being a function of only one nucleotide in the channel. However, in practice, nanopore sensors do not have single nucleotide sensitivity and ionic current signals are functions of multiple nucleotides in the channel [2]. In Section 1, we considered cases in which the ionic current is a function of ๐ โฅ 1 nucleotides, and circumstances under which the number of distinct amplitude levels ๐ (with ๐ โค 4๐ ) may not be sufficient to 13 unambiguously identify the DNA sequence. If, however, the number of distinct amplitude levels is sufficient to unambiguously identify the sequence, then an ๐nucleotide nanopore sensor would be capable of identifying all homopolymer regions of length ๐ and smaller, in a single read. In that case, a larger nanopore sensor would require fewer reads to achieve a desired error threshold for any biologically relevant sequence. In this section, we assume that an ๐-nucleotide nanopore sensor (with ๐ = 2,3,4) has no systematic errors (i.e., can resolve any length-๐ sequence in the sensor), and consider the consequent improvement in error-rate performance when compared to the error-rate performance with ๐ = 1. Figure S6 shows a comparison of the analytic error rates for the four different nanopore sensors with ๐ = 1,2,3,4. The error rates were computed for the complete Human Mitochondrial DNA sequence [6] using the method described in Section 3. For a single read, the analytic error rate drops from 40.5% to 1.24% as the size of the sensor is increased from one to four nucleotides, a dramatic improvement. However, the analytic error rates eventually converge for all four cases as the number of reads increases, resulting in less dramatic improvement when higher accuracy is required. For example, the four-nucleotide sensor still requires 130 reads to achieve Q40 accuracy, as opposed to 156 reads for the single nucleotide sensor. Figure S6. Effect of nanopore sensor footprint on the analytic error rates in the absence of systematic channel errors. The analytic error rates decrease with an increase in the size of the sensor footprint, but approach the single nucleotide sensor error rate as the number of reads increases. 10 References 14 [1] Cherf, G. M., Lieberman, K. R., Rashid, H., Lam, C. E., Karplus, K., Akeson, M., Nat. Biotechnol. 2012, 30, 344โ348. [2] Manrao, E. A., Derrington, I. M., Laszlo, A. H., Langford, K. W., Hopper, M. K., Gillgren, N., Pavlenok, M., Niederweis, M., Gundlach, J. H., Nat. Biotechnol. 2012, 30, 349โ353. [3] Sakmann, B., Neher, E. (Eds.), Single-Channel Recording, Plenum Press, New York 1995, 2nd edn. [4] Venkataramanan, L., Sigworth, F. J., Biophys. J. 2002, 82, 1930โ1942. [5] Qin, F., Auerbach, A., Sachs, F., Biophys. J. 2000, 79, 1928โ1944. [6] Sanchez-Cespedes, M., Parrella, P., Nomoto, S., Cohen, D., Xiao, Y., Esteller, M., Jeronimo, C., Jordan, R. C. K., Nicol, T., Koch, W. M., Schoenberg, M., Mazzarelli, P., Fazio, V. M., Sidransky, D., Cancer Res. 2001, 61, 7015โ7019. 15
© Copyright 2026 Paperzz