High accuracy base space sequencing

High accuracy base space sequencing: results from error-correction ligation
chemistry
Asim Siddiqui, Marcin Sikora, Somalee Datta, Chengyong Yang, Heinz Breu, Dima Brinza, Cisylia Duncan, Ryan Hsu, Srikanth Jandhyala, Brijesh Krishnaswami, Matthew Muller, Vasihnavi Panchapakesa, Daryl
Thomas, Vasisht Tadigotla, Sowmi Utiramerur, Arjun Vadapalli, Eric White, Tanya Sokolsky, Yuandan Lou, Amitabh Shukla, Clarence Lee, Alan Blanchard ,Kevin McKernan, Fiona Hyland and Ellen Beasley
Results
Figure 1. Example of color encoding for base sequence u = [TCGTGTGCTTCCGAAG].
Colors arranged by A) order of data acquisition, B) base position and probe set.
Figure 4. Majority of bases from synthetic template are sequenced at > 99.9999% accuracy
50
P1 adapter
T C G T G T G C T T C C G A A G
u0 u1 u2 u3 u4 u5 u6 u7 u8 u9 u10 u11 u12 u13 u14 u15
xA =
[
Base Sequence u =
[
Probe Set 1
Probe Set 2
xB =
[
]
T C G T G T G C T T C C G A A G
]
]
50bp
Mapped
Reads
...
The encoding properties of the ECC chemistry are shown in Figure 1.The first five cycles are identical to the
standard 2 base encoded probes used in the original colour space chemistry. ECC adds a 6th cycle which
produces redundant information spanning a series of 4 bases and represents a 4 base code.. This redundant
information provides the means for detecting and correcting errors. Although, we have chosen a combination of
2 base and 4 base encoded probes, other combinations are possible and SOLiD chemistry allows up to 5 base
codes.
Default
Library
RunMetadata
File Root
Non-indexed
runs only
Unclassified
TagDetails
Library 1
0001
High accuracy reads provide an advantage in experiments where the errors cannot be
overcome by throughput through redundant sequencing. Such experiments include those
where subpopulations are present and their elucidation is critical. For example, searching for
circulating tumor cells or a search for a specific variant in a tissue biopsy are both examples
from cancer research where subpopulations are important. Metagenome sequencing is
another example – teasing apart closely clades is made simpler with higher accuracy reads
especially if some populations are present at low frequency. We have developed ECC
chemistry as a means of producing higher accuracy reads. The properties of these reads
enable both error detection and correction of the sequenced template. This poster describes
the chemistry and the bioinformatics analysis of ECC.
Read length
Perfect Reads
F3
Base or Color
!
Library 2
Indexed
runs only
INTRODUCTION
Sample
Synthetic
template beads
0002
R3
!
Library 96
Indexing
Needed for
re-indexing
0660
Fragments
ECC run only
Figure 3. LifeScope Software
0002
New data requires a
new format. We have
b u i l t t h e X S Q
(eXtensible SeQuence)
file format upon the
open HDF format [1].
The format supports
base space data and
colour space
data.
Hence, it supports both
classic 2 base encoded
data and the new ECC
colour data. It can also
store the base space
sequence derived from
the ECC codes. XSQ is
a binary format and
hence offers compact
storage.
Raw Accuracy
99.6%
99.7%
45
90%
40
35
Predicted vs observed QV
Distribution of QV
30
1.0
45
99.9%
25
0.8
20
99.99%
40
For placement only
(delete box)
0.6
10
99.999%
30
25
20
15
0.2
5
0
10
20
30
40
50
60
70
80
Reads ordered from best to worst (in % of total beads)
90
100
99.9999%
< 99.9999%
Synthetic template beads are preloaded with DNA of known sequence. As such they are not created through the
standard sample preparation method, but they are run on the instrument as a standard run. The resulting FASTQ
file was filtered and end trimmed using a base QVs prior to mapping. Instrument throughput is reported on filtered
and trimmed reads.
Figure 5. Including sample preparation, average accuracy over all bases is > 99.99%
50
Sample Description
unmapped
DH10b
Read length
50bp
40
35
Mapped
Reads
99.7%
Perfect Reads
98.3%
99%
30
99.9%
25
20
99.99%
15
Raw Accuracy
99.9905%
10
0.0
<5
<10
<15
<20
<25
QV
<30
<35
<40 40 and
above
5
0
<5
<10
<15
<20
<25
<30
<35
<40 40 and
above
predicted QV
With ECC and reference mediated base space translation, 90% of the bases have a reported accuracy > 99.99%. This
predicted or reported accuracy is validated by mapping to the genome as shown in Figure 5. Figure 7 also attests to the
accuracy of reported QVs. The reported QVs behave well when compared to measured QV rates over the entire range of
QVs
CONCLUSIONS
45
90%
Sample
For placement only
(delete box)
35
0.4
10
This poster presented an analysis of base space data derived using ECC. Analysis of this data
will be supported by the LifeScope Software. We have demonstrated the potential of achieving
individual base accuracy of > 99.9999% with synthetic template beads sequenced on the
platform. Using standard sample preparation methods, we achieve > 99.99% accuracy. This
method provides a platform for further improvements to sample preparation improving accuracy
beyond what is achievable today. Finally, accurate QVs are important for bioinformatics tools to
work well. We show that the reported QVs behave as expected over the full range of values.
99.999%
5
0
SOLiDTM
The ECC chemistry is available on the
5500 instrument. This poster also introduces
LifeScopeTM Software, the new data analysis software package for the SOLiD 5500
instrument.
10
20
30
40
50
60
70
80
Reads ordered from best to worst (in % of total beads)
90
100
99.9999%
< 99.9999%
REFERENCES
1.  HDF – www.hdfgroup.org
Unlike the data shown in Figure 4, this sample is derived from DH10b strain of E. coli and processed through the
standard sample preparation steps (sufficient DNA was used to remove the need for amplification cycles of PCR).
Here the error rate is higher than in Figure 5. We see a higher error rate in the first few bases and this has been
associated with the chemistry of attaching the adaptor sequence. The perfect matches in the last 3 bases are an
artifact arising from the mapping parameters – in order for the alignment to extend the full length of the read, there
must be no mismatches in the last three bases, otherwise the alignment is terminated before the end of the read.
The last 3 bases are excluded from accuracy calculation.
MATERIALS AND METHODS
Figures 4 and 5 demonstrate the error introduced in the sample preparation step. While with present methods,
this places a bound on the achievable accuracy, ECC provides a unique tool for understanding the source of
sample preparation errors and hence modifying the protocol to lower the error beyond what is achievable today.
A screenshot from the LifeScope Software.
The SOLiD 5500 instrument and the LifeScope Workstation
Figure 7. QV accuracy is an important
factor in accuracy
99%
15
99.9983%
Figure 6. 90% of bases have QV >= 40
observed true phred
Primer n
xA,2 = xA,7 = xA,12 = Primer n-1
xA,1 = xA,6 = xA,11 = bridge probe
Primer n-2
xA,5 = xA,10 = xA,15 = bridge probe
Primer n-3
xA,4 = xA,9 = xA,14 = bridge probe
Primer n-4
xA,3 = xA,8 = xA,13 = bridge probe
Primer n-4
xB,1 = xB,2 = xB,3 = unmapped
Most bioinformatics tools make extensive use of reported QVs to make accurate SNP calls. Inaccurate QVs will
result in higher false positives and false negatives and hence a lower validation rate for the reported results. We
can further boost accuracy by using the reference as a prior. Figure 7 shows that the reported QV values have the
expected behaviour.
% of all bases
Sample Description
Figure 2. XSQ file format
When we apply the ECC chemistry to real samples, we observe errors at the sub 1% level
that are clearly induced in the sample preparation steps. ECC chemistry has enabled the
optimization of sample preparation to reduce these errors to the sub 0.01% level. We will
describe reference-free ECC sequencing of microbes and human germline and introduce
reference-assisted ECC sequencing which reduces errors further
The Importance of Accurate QVs
Base position
Sequencing read accuracy is critical for applications such as cancer or environmental
sequencing where small subpopulations may play an important role or to improve variant
calling in germline sequencing. Throughput cannot overcome lower accuracy sequencing
when the variant sought is present at a frequency close to the base error rate of the
sequencer. We report redundant sequencing of the DNA template using mathematical
principles employed in the error correcting codes found in compact discs (CD’s) and
telecommunications protocols. Standard ‘2-base encoding’ chemistry utilizes ligation probes
that sequence in steps of 5 bases with 5 sequential primers. Probes are pooled such that only
2 of the 5 bases are interrogated. We then interrogate the DNA template with an additional
primer in an orthogonal colour space to the 2-base colour determinations i.e. using ligation
probes that are pooled and labeled to interrogate different bases. Since the DNA ligase
provides specificity over 5 bases (enabling codes as long as 5 bases), we are able to
interrogate multiple bases with this single additional primer round – we have chosen to use 4base encoded probes. The orthogonal coding is a more powerful than simple resequencing.
Using synthetic templates, we demonstrate that the sequencing chemistry is accurate to
greater than 99.99% accuracy with the majority bases achieving 99.9999% or higher. Higher
accuracy data is more valuable mitigating the cost and time of running an additional primer
Material and Methods
Base position
ABSTRACT
ACKNOWLEDGEMENTS
We thank members of the sample preparation team, Beverly sequencing team, software team
and project management teams of SOLiD 5500 and LifeScope Software.
TRADEMARKS/LICENSING
© 2011 Life Technologies Corporation.
All rights reserved.
The trademarks mentioned herein are the property of Life Technologies Corporation or their
respective owners.
For Research Use Only. Not intended for animal or human therapeutic or diagnostic use.
Life Technologies • 5791 Van Allen Way • Carlsbad, CA 92008 • www.lifetechnologies.com