Opera: Reconstructing optimal genomic scaffolds with highthroughput paired-end sequences Song Gao, Niranjan Nagarajan, Wing-Kin Sung National University of Singapore Genome Institute of Singapore Outline Overview • Methods - 1. Pre-Processing - 2. A Special Case - 3. Full Algorithm - 4. Graph Contraction - 5. Gap Estimation • Results • Ongoing Work 2 Biological Entity Data Entity Genomic Sequence Genome Transcripts Microbial Community Sequencing Machine Reads Analysis Transcript Assembly ACGTTTAACAGG… TTACGATTCGATGA… GCCATAATGCAAG… CTTAGAATCGGATAGA AGGCATAGACTAGAG Metagenome 3 Sequence Assembly Reads Contigs (I) Paired-end Reads Scaffolds (II) Related Research Works Contig Level Scaffold Level OLC Framework: Celera Assembler[Myers et al,2000], Edena[Hernandez et al,2008], Arachne[Batzoglou et al,2002], PE Assembler[Ariyaratne et al ,2011] De Bruijn Graph: EULER[Pevzner et al, 2001] , Velvet[Zerbino et al,2008] , ALLPATHS[Butler et al,2008], SOAPdenovo[Li et al,2010] Comparative Assembly: AMOScmp[Pop,2004], ABBA[Salzberg,2008] Embedded Module: EULER[Pevnezer et al, 2001], Arachne[Batzoglou et al ,2002], Celera Assembler[Myers et al,2000], Velvet[Zerbino, 2008] Standalone Module: Bambus[Pop, et al, 2004] , SOPRA[Dayarian et al, 2010] 4 Scaffolding Problem[Huson et al, 2002] Discordant Read Contig 1k Paired-end Read 3k 2.5k Scaffold Value Addition Gap Filling: GapCloser Module of SOAPdenovo Repeat Resolution Long-Range Genomic Structure * Huson, D. H., Reinert, K., Myers E.W.: The greedy path-merging algorithm for contig scaffolding. Journal of the ACM 49(5), 603–615 (2002) 5 Statistics of Assembled Genomes[Schatz et al, 2010] Organism Genome Size # of Contigs N50 # of Scaffold N50 Grapevine 500Mb 58,611 18.2kb 2,093 1.33Mb Panda 2.4Gb 200,604 36.7kb 81,469 1.22Mb Strawberry 220Mb 16,487 28.1kb 3,263 1.44Mb Turkey 1.1Gb 128,271 12.6kb 26,917 1.5Mb * N50: Given a set of sequences of varying lengths, the N50 length is defined as the length N for which 50% of all bases in the sequences are in a sequence of length L >= N. Data Sequencing Errors Read Length Coverage Analysis Long Insert vs. Long Read[Chaisson, 2009; Zerbino, 2009] * Schatz M. C., Arthur L. D., Steven L. S.: Assembly of large genomes using second-generation sequencing. Genome Research, 20-9, 1165-1173 (2010) * Zerbino, D.R.: Pebble and rock band: heuristic resolution of repeats and scaolding in the velvet short-read de novo assembler. PLoS ONE, 4(12) (2009) * Chaisson, M.J., Brinza, D., Pevzner, P.A.: De novo fragment assembly with short mate-paired reads: does the read length matter? Genome Research 19, 336-346 (2009) 6 NP-Complete [Huson et al, 2002] * Huson, D. H., Reinert, K., Myers E.W.: The greedy path-merging algorithm for contig scaffolding. Journal of the ACM 49(5), 603–615 (2002) 7 Heuristic Methods - Celera Assembler[Myers et al,2000] - Euler[Pevzner et al, 2001] - Jazz[Chapman et al, 2002] - Velvet[Zerbino et al,2008] - Arachne[Batzoglou et al ,2002] - Bambus[Pop, et al, 2004] “True Complexity” Phase transition based on parameters[Hayes, 1996] 3-SAT Problem Parametric Complexity[Rodney et al, 1999] Vertex Cover Problem * Hayes, B. Can't get no satisfaction. American. Scientist. 85, 108-112 (1996). * Rodney G. D., et al. Parameterized Complexity: A Framework for Systematically Confronting Computational Intractability. DIMACS. Vol 49. 1999 8 Outline • Overview Methods - 1. Pre-Processing - 2. A Special Case - 3. Full Algorithm - 4. Graph Contraction - 5. Gap Estimation • Results • Ongoing Work 9 1. Pre-Processing Paired-end Reads -> Clusters [Huson et al, 2002] 3 Chimeric Noise Filtered by simulation Chimera * Upper Bound of Paired-end Reads * Huson, D. H., Reinert, K., Myers E.W.: The greedy path-merging algorithm for contig scaffolding. Journal of the ACM 49(5), 603–615 (2002) 10 2. A Special Case No discordant clusters in final scaffold Naïve Solution A B C D +A+B +A+B+C +A-B +A+B-C +A+C … +A +A-C +A-C+B … +A-C-B Exponential Time … 11 Dynamic Programming Scaffold Tail is Sufficient Upper Bound width(w) Analogous to Bandwidth Problem[Saxe, 1980] Orientation of Nodes Direction of Edges Discordant Edges … * J. Saxe: Dynamic programming algorithms for recognizing small-bandwidth graphs in polynomial time SIAM J. on Algebraic and Discrete Methodd, 1(4), 363369 (1980) 12 Equivalence class of scaffolds S1 and S2 have the same tail -> They are in the same class Feature of equivalence class: - Use of the same set of contigs; - All or none of them can be extended to a solution Tail +A-B+C +D+E -A+C +D+E+F … 3. Full Algorithm Consider discordant clusters ACCAAAATTT ? CTAGAA CAAGAA ACCAAGAATTT Chimeric Reads Sequencing Errors Mapping Errors Equivalence Class Number of Discordant Edges (p) 14 4. Graph Contraction 20k 4. Graph Contraction 4. Graph Contraction 5. Gap Estimation Utility Genome finishing(Genome Size Estimation) Scaffold Correctness μ,σ Calculate Gap Sizes g1 g2 g3 Maximum Likelihood Quadratic Function Solved through quadratic programming [Goldfarb, et al, 1983] Polynomial Time * Goldfarb, D., Idnani, A.: A numerically stable dual method for solving strictly convex quadratic programs. Mathematical Programming, 27 (1983) 18 Outline • Overview • Methods - 1. Pre-Processing - 2. A Special Case - 3. Full Algorithm - 4. Graph Contraction - 5. Gap Estimation Results • Ongoing Work 19 Runtime Comparison ◆E. coli ★B. pseudomallei ◆S. cerevisiae ◆D. melanogaster Bambus 50s 16m 2m 3m SOPRA 49m - 2h 5h 4s 7m 11s 30s Opera ◆ Simulated data set using MetaSim ★ In house data • Coverage of 300bp insert library: >20X • Coverage of 10kbp insert library: 2X • Contigs assembled using Velvet 20 Scaffold Contiguity Max Length N50 4.5 9 4 8 3.5 7 3 6 Velvet 2.5 Velvet 5 Bambus Bambus SOPRA 2 SOPRA 4 Opera Opera 3 1.5 2 1 1 0.5 0 0 E. coli B. pseudomallei S. cerevisiae D. melanogaster E. coli B. pseudomallei S. cerevisiae D. melanogaster 21 Scaffold Correctness # of Breakpoints 120 100 80 Velvet Bambus SOPRA Opera 60 40 20 0 E. coli S. cerevisiae D. melanogaster 22 Scaffold Correctness E.coli S. cerevisiae D. melanogaster Opera 1 3 4 Bambus 19 55 423 # of Discordant Edges 18 16 14 12 10 Velvet Opera 8 6 4 2 0 E. coli S. cerevisiae D. melanogaster 23 Ongoing Work A Rodent Genome Genome Size N50 ~2Gbp 765.5Kbp Opera SSpace 281.7Kbp A Tree Genome Opera Genome Size N50 Max Length ~300Mbp 209.9Kbp 921.8Kbp 24 Ongoing Work Repeats Lower bounds and better scaffold Multiple Libraries Other applications Metagenomics Cancer Genomics Link: https://sourceforge.net/projects/operasf/ 25 Acknowledgement Niranjan Nagarajan Wing-Kin Sung Pramila N. Ariyaratne Fundings: NUS Graduate School for Integrative Sciences and Engineering (NGS) A*STAR of Singapore Ministry of Education, Singapore Questions? 26
© Copyright 2024 Paperzz