Finding Deletions with Exact Break Points from Noisy Low Coverage Paired-end Short Sequence Reads Jin Zhang and Yufeng Wu Department of Computer Science and Engineering University of Connecticut Introduction •Structural variants Reference Alternative •Our problem: deletion insertion Finding Exact break points of deletions using low coverage noisy data efficiently •Data ( from 1000 genomes project pilot 1 ) Mean insertion size + 3 sds Reference Alternative low coverage (2-6x); Illumina (2009_08); 45 individuals in CEU population;(combine) 580 G in BAM format; Both ends mapped In region without Deletion Abnormal insertion size Abnormal insertion size Only one end mapped 24G in fastq format 250 Million reads Real Split-reads (also contain error) Deletion anchor Sequence error & other errors (more 99%) •Achieve Deletions with exact break point efficiently deletion Split-reads Mapping Method Spanning pair (abnormal insertion size) anchor Split-read reference alternative deletion We map Split-read on Burrows-Wheeler Transform (BWT) efficiency Inexact mapping reference Split-read CAAT CCCTC ACGTAACG CACAATACCCTCTCACACCAACGTTACG mismatch indel SVs near SNPs and indels can be found; Reads with errors can be used Hits not unique Ex. Hit at 2 positions reference Split not unique reference Report the hits and splits with the best quality Search locally ex. search near region of 1Mb or Method Calling candidate deletions TACGTTTAACCATACGGCCAAAACGTAACGT TTAACCATACGTAACGT or TTAACCAT ACGTAACGT TTAACCATACG TAACGT (1)Sorting split-reads leftmost break points (2)cluster the split-reads supporting the same candidate (3)call candidate Cutoff value: at least how many split-reads support the candidate Reference (leftmost) (rightmost) Method Candidates validation(calling deletions) Has spanning pair Abnormal insertion size maybe caused by deletion Reference Alternative deletion candidate If there exists spanning pairs with abnormal insertion size, we validate the deletion Notice that the candidate is from split-reads mapping No spanning pair (not enough information, we can’t validate) Reference candidate low coverage may cause no spanning pair, there is a deletion. Split-read is not mapped right or is with error, there is No deletion. Benchmarks Results Benchmark 1 The 1000 Genomes Project Consortium, (2010) A map of human genome variation from population-scale sequencing, Nature, 467, 1061-1073. Benchmark 2 Mills,R.E., et al., (2011) Mapping copy number variation by population-scale genome sequencing, Nature 470, 5965. We run our test with 1000 genomes project releases as benchmarks; The Deletions in these releases are found by multiple methods Data Low coverage (can be as low as 2x) Illumina (2009_08); 45 individuals in CEU population;(combine) 580 G in BAM format; Comparison Pindel v1 Exact match Max Event Size 1Mb Pindel v2 With mismatch Max Event Size 8092bp Ye,K., el al., (2009) Pindel: a pattern growth approach to detect break points of large deletions and medium sized insertions from paired-end short reads, Bioinformatics, 25, 2865-s2871 Results Vs. Pindel v1 (Max Event 1Mb) Total number of called deletions What is the percentage of true deletions? We use the 1000 genomes project releases as benchmarks Chromosomes Chromosomes True positive with precise deletions in benchmark 1 Therethe With might benchmark, be deletions our method not in withbenchmarks the cutoff 2, has but more found true by our Positives and less false positives method True positive with precise deletions in benchmark 2 chromosomes chromosomes Results Vs. Pindel v2 (Max Event 8092k) Total number of deletions found Comparison with v2 has the same trend with the comparison with v1 True positive with precise deletions in benchmark 1 Chromosomes True positive with precise deletions in benchmark 2 Chromosomes Results •Running time Data: 45 individuals on chromosome 1 Running on our Xeon server with 24 CPUs Finding SV with Maximum Event Size include 1Mb Inexact Match Threads Hours Pindel v1 Not allow 1 About 10 Pindel v2 Allow Mismatch 20 30 still running Our Method Allow Mismatch and indel 1 About 3.5 •An example of inexact mapping (P1_M_061510_1 9_22) Experiment is run on workstations supported by NSF grant IIS-0916948 Research partly supported by NSF grant IIS-0953563
© Copyright 2026 Paperzz