High accuracy base space sequencing: results from error-correction ligation chemistry Asim Siddiqui, Marcin Sikora, Somalee Datta, Chengyong Yang, Heinz Breu, Dima Brinza, Cisylia Duncan, Ryan Hsu, Srikanth Jandhyala, Brijesh Krishnaswami, Matthew Muller, Vasihnavi Panchapakesa, Daryl Thomas, Vasisht Tadigotla, Sowmi Utiramerur, Arjun Vadapalli, Eric White, Tanya Sokolsky, Yuandan Lou, Amitabh Shukla, Clarence Lee, Alan Blanchard ,Kevin McKernan, Fiona Hyland and Ellen Beasley Results Figure 1. Example of color encoding for base sequence u = [TCGTGTGCTTCCGAAG]. Colors arranged by A) order of data acquisition, B) base position and probe set. Figure 4. Majority of bases from synthetic template are sequenced at > 99.9999% accuracy 50 P1 adapter T C G T G T G C T T C C G A A G u0 u1 u2 u3 u4 u5 u6 u7 u8 u9 u10 u11 u12 u13 u14 u15 xA = [ Base Sequence u = [ Probe Set 1 Probe Set 2 xB = [ ] T C G T G T G C T T C C G A A G ] ] 50bp Mapped Reads ... The encoding properties of the ECC chemistry are shown in Figure 1.The first five cycles are identical to the standard 2 base encoded probes used in the original colour space chemistry. ECC adds a 6th cycle which produces redundant information spanning a series of 4 bases and represents a 4 base code.. This redundant information provides the means for detecting and correcting errors. Although, we have chosen a combination of 2 base and 4 base encoded probes, other combinations are possible and SOLiD chemistry allows up to 5 base codes. Default Library RunMetadata File Root Non-indexed runs only Unclassified TagDetails Library 1 0001 High accuracy reads provide an advantage in experiments where the errors cannot be overcome by throughput through redundant sequencing. Such experiments include those where subpopulations are present and their elucidation is critical. For example, searching for circulating tumor cells or a search for a specific variant in a tissue biopsy are both examples from cancer research where subpopulations are important. Metagenome sequencing is another example – teasing apart closely clades is made simpler with higher accuracy reads especially if some populations are present at low frequency. We have developed ECC chemistry as a means of producing higher accuracy reads. The properties of these reads enable both error detection and correction of the sequenced template. This poster describes the chemistry and the bioinformatics analysis of ECC. Read length Perfect Reads F3 Base or Color ! Library 2 Indexed runs only INTRODUCTION Sample Synthetic template beads 0002 R3 ! Library 96 Indexing Needed for re-indexing 0660 Fragments ECC run only Figure 3. LifeScope Software 0002 New data requires a new format. We have b u i l t t h e X S Q (eXtensible SeQuence) file format upon the open HDF format [1]. The format supports base space data and colour space data. Hence, it supports both classic 2 base encoded data and the new ECC colour data. It can also store the base space sequence derived from the ECC codes. XSQ is a binary format and hence offers compact storage. Raw Accuracy 99.6% 99.7% 45 90% 40 35 Predicted vs observed QV Distribution of QV 30 1.0 45 99.9% 25 0.8 20 99.99% 40 For placement only (delete box) 0.6 10 99.999% 30 25 20 15 0.2 5 0 10 20 30 40 50 60 70 80 Reads ordered from best to worst (in % of total beads) 90 100 99.9999% < 99.9999% Synthetic template beads are preloaded with DNA of known sequence. As such they are not created through the standard sample preparation method, but they are run on the instrument as a standard run. The resulting FASTQ file was filtered and end trimmed using a base QVs prior to mapping. Instrument throughput is reported on filtered and trimmed reads. Figure 5. Including sample preparation, average accuracy over all bases is > 99.99% 50 Sample Description unmapped DH10b Read length 50bp 40 35 Mapped Reads 99.7% Perfect Reads 98.3% 99% 30 99.9% 25 20 99.99% 15 Raw Accuracy 99.9905% 10 0.0 <5 <10 <15 <20 <25 QV <30 <35 <40 40 and above 5 0 <5 <10 <15 <20 <25 <30 <35 <40 40 and above predicted QV With ECC and reference mediated base space translation, 90% of the bases have a reported accuracy > 99.99%. This predicted or reported accuracy is validated by mapping to the genome as shown in Figure 5. Figure 7 also attests to the accuracy of reported QVs. The reported QVs behave well when compared to measured QV rates over the entire range of QVs CONCLUSIONS 45 90% Sample For placement only (delete box) 35 0.4 10 This poster presented an analysis of base space data derived using ECC. Analysis of this data will be supported by the LifeScope Software. We have demonstrated the potential of achieving individual base accuracy of > 99.9999% with synthetic template beads sequenced on the platform. Using standard sample preparation methods, we achieve > 99.99% accuracy. This method provides a platform for further improvements to sample preparation improving accuracy beyond what is achievable today. Finally, accurate QVs are important for bioinformatics tools to work well. We show that the reported QVs behave as expected over the full range of values. 99.999% 5 0 SOLiDTM The ECC chemistry is available on the 5500 instrument. This poster also introduces LifeScopeTM Software, the new data analysis software package for the SOLiD 5500 instrument. 10 20 30 40 50 60 70 80 Reads ordered from best to worst (in % of total beads) 90 100 99.9999% < 99.9999% REFERENCES 1. HDF – www.hdfgroup.org Unlike the data shown in Figure 4, this sample is derived from DH10b strain of E. coli and processed through the standard sample preparation steps (sufficient DNA was used to remove the need for amplification cycles of PCR). Here the error rate is higher than in Figure 5. We see a higher error rate in the first few bases and this has been associated with the chemistry of attaching the adaptor sequence. The perfect matches in the last 3 bases are an artifact arising from the mapping parameters – in order for the alignment to extend the full length of the read, there must be no mismatches in the last three bases, otherwise the alignment is terminated before the end of the read. The last 3 bases are excluded from accuracy calculation. MATERIALS AND METHODS Figures 4 and 5 demonstrate the error introduced in the sample preparation step. While with present methods, this places a bound on the achievable accuracy, ECC provides a unique tool for understanding the source of sample preparation errors and hence modifying the protocol to lower the error beyond what is achievable today. A screenshot from the LifeScope Software. The SOLiD 5500 instrument and the LifeScope Workstation Figure 7. QV accuracy is an important factor in accuracy 99% 15 99.9983% Figure 6. 90% of bases have QV >= 40 observed true phred Primer n xA,2 = xA,7 = xA,12 = Primer n-1 xA,1 = xA,6 = xA,11 = bridge probe Primer n-2 xA,5 = xA,10 = xA,15 = bridge probe Primer n-3 xA,4 = xA,9 = xA,14 = bridge probe Primer n-4 xA,3 = xA,8 = xA,13 = bridge probe Primer n-4 xB,1 = xB,2 = xB,3 = unmapped Most bioinformatics tools make extensive use of reported QVs to make accurate SNP calls. Inaccurate QVs will result in higher false positives and false negatives and hence a lower validation rate for the reported results. We can further boost accuracy by using the reference as a prior. Figure 7 shows that the reported QV values have the expected behaviour. % of all bases Sample Description Figure 2. XSQ file format When we apply the ECC chemistry to real samples, we observe errors at the sub 1% level that are clearly induced in the sample preparation steps. ECC chemistry has enabled the optimization of sample preparation to reduce these errors to the sub 0.01% level. We will describe reference-free ECC sequencing of microbes and human germline and introduce reference-assisted ECC sequencing which reduces errors further The Importance of Accurate QVs Base position Sequencing read accuracy is critical for applications such as cancer or environmental sequencing where small subpopulations may play an important role or to improve variant calling in germline sequencing. Throughput cannot overcome lower accuracy sequencing when the variant sought is present at a frequency close to the base error rate of the sequencer. We report redundant sequencing of the DNA template using mathematical principles employed in the error correcting codes found in compact discs (CD’s) and telecommunications protocols. Standard ‘2-base encoding’ chemistry utilizes ligation probes that sequence in steps of 5 bases with 5 sequential primers. Probes are pooled such that only 2 of the 5 bases are interrogated. We then interrogate the DNA template with an additional primer in an orthogonal colour space to the 2-base colour determinations i.e. using ligation probes that are pooled and labeled to interrogate different bases. Since the DNA ligase provides specificity over 5 bases (enabling codes as long as 5 bases), we are able to interrogate multiple bases with this single additional primer round – we have chosen to use 4base encoded probes. The orthogonal coding is a more powerful than simple resequencing. Using synthetic templates, we demonstrate that the sequencing chemistry is accurate to greater than 99.99% accuracy with the majority bases achieving 99.9999% or higher. Higher accuracy data is more valuable mitigating the cost and time of running an additional primer Material and Methods Base position ABSTRACT ACKNOWLEDGEMENTS We thank members of the sample preparation team, Beverly sequencing team, software team and project management teams of SOLiD 5500 and LifeScope Software. TRADEMARKS/LICENSING © 2011 Life Technologies Corporation. All rights reserved. The trademarks mentioned herein are the property of Life Technologies Corporation or their respective owners. For Research Use Only. Not intended for animal or human therapeutic or diagnostic use. Life Technologies • 5791 Van Allen Way • Carlsbad, CA 92008 • www.lifetechnologies.com
© Copyright 2026 Paperzz