Finding Deletions with Exact Break Points from Noisy Low Coverage

Finding Deletions with Exact Break
Points from Noisy Low Coverage
Paired-end Short Sequence Reads
Jin Zhang and Yufeng Wu
Department of Computer Science and Engineering
University of Connecticut
Introduction
•Structural variants
Reference
Alternative
•Our problem:
deletion
insertion
Finding Exact break points of deletions
using low coverage noisy data efficiently
•Data ( from 1000 genomes project pilot 1 )
Mean insertion size + 3 sds
Reference
Alternative
low coverage (2-6x); Illumina (2009_08);
45 individuals in CEU population;(combine)
580 G in BAM format;
Both ends mapped
In region
without
Deletion
Abnormal
insertion
size
Abnormal insertion size
Only one end mapped
24G in fastq format
250 Million reads
Real
Split-reads
(also contain
error)
Deletion
anchor
Sequence error
& other errors
(more 99%)
•Achieve
Deletions with exact break point efficiently
deletion
Split-reads Mapping
Method
Spanning pair (abnormal insertion size)

anchor
Split-read
reference
alternative
deletion
We map Split-read on Burrows-Wheeler Transform (BWT)
efficiency
Inexact mapping
reference
Split-read
CAAT CCCTC
ACGTAACG
CACAATACCCTCTCACACCAACGTTACG
mismatch
indel
SVs near SNPs and indels can be found; Reads with errors can be used
Hits not unique
Ex. Hit at 2 positions
reference
Split not unique
reference
Report the hits and splits with the best quality
Search locally
ex. search near region of 1Mb
or
Method
 Calling candidate deletions
TACGTTTAACCATACGGCCAAAACGTAACGT
TTAACCATACGTAACGT
or
TTAACCAT
ACGTAACGT
TTAACCATACG
TAACGT
(1)Sorting split-reads leftmost break points
(2)cluster the split-reads supporting the same candidate
(3)call candidate
Cutoff value: at least how many split-reads support the candidate
Reference
(leftmost)
(rightmost)
Method
 Candidates validation(calling deletions)
Has spanning pair
Abnormal insertion size
maybe caused by deletion
Reference
Alternative
deletion
candidate
If there exists spanning pairs with abnormal insertion size, we validate the deletion
Notice that the candidate is from split-reads mapping
No spanning pair (not enough information, we can’t validate)
Reference
candidate
low coverage may cause no spanning pair, there is a deletion.
Split-read is not mapped right or is with error, there is No deletion.
 Benchmarks
Results
Benchmark 1
The 1000 Genomes Project Consortium, (2010) A map of human genome variation
from population-scale sequencing, Nature, 467, 1061-1073.
Benchmark 2
Mills,R.E., et al., (2011) Mapping copy number variation by population-scale genome
sequencing, Nature 470, 5965.
We run our test with 1000 genomes project releases as benchmarks;
The Deletions in these releases are found by multiple methods
 Data
Low coverage (can be as low as 2x)
Illumina (2009_08);
45 individuals in CEU population;(combine)
580 G in BAM format;
 Comparison
Pindel v1
Exact match
Max Event Size 1Mb
Pindel v2
With mismatch
Max Event Size 8092bp
Ye,K., el al., (2009) Pindel: a pattern growth approach to detect break points of large deletions
and medium sized insertions from paired-end short reads, Bioinformatics, 25, 2865-s2871
Results
 Vs. Pindel v1 (Max Event 1Mb)
Total number of called deletions
What is the percentage of true
deletions?
We use the 1000 genomes project
releases as benchmarks
Chromosomes
Chromosomes
True positive with precise
deletions in benchmark 1
Therethe
With
might
benchmark,
be deletions
our method
not in
withbenchmarks
the
cutoff 2, has
but
more
found
true
by our
Positives and less false positives
method
True positive with precise
deletions in benchmark 2
chromosomes
chromosomes
Results
 Vs. Pindel v2 (Max Event 8092k)
Total number of deletions found
Comparison with v2 has the same
trend with the comparison with v1
True positive with precise
deletions in benchmark 1
Chromosomes
True positive with precise
deletions in benchmark 2
Chromosomes
Results
•Running time
Data: 45 individuals on chromosome 1
Running on our Xeon server with 24 CPUs
Finding SV with Maximum Event Size include 1Mb
Inexact Match
Threads
Hours
Pindel v1
Not allow
1
About 10
Pindel v2
Allow Mismatch
20
30 still running
Our Method
Allow Mismatch and indel
1
About 3.5
•An example of inexact mapping
(P1_M_061510_1 9_22)
Experiment is run on workstations supported by NSF grant IIS-0916948
Research partly supported by NSF grant IIS-0953563