E. coli S. cerevisiae D. melanogaster

Opera: Reconstructing optimal
genomic scaffolds with highthroughput paired-end sequences
Song Gao, Niranjan Nagarajan,
Wing-Kin Sung
National University of Singapore
Genome Institute of Singapore
Outline
Overview
• Methods
- 1. Pre-Processing
- 2. A Special Case
- 3. Full Algorithm
- 4. Graph Contraction
- 5. Gap Estimation
• Results
• Ongoing Work
2
Biological Entity
Data Entity
Genomic
Sequence
Genome
Transcripts
Microbial
Community
Sequencing
Machine
Reads
Analysis
Transcript
Assembly
ACGTTTAACAGG…
TTACGATTCGATGA…
GCCATAATGCAAG…
CTTAGAATCGGATAGA
AGGCATAGACTAGAG
Metagenome
3
Sequence Assembly
Reads
Contigs
(I)
Paired-end Reads
Scaffolds
(II)
Related Research Works
Contig Level
Scaffold Level
OLC Framework:
Celera Assembler[Myers et al,2000], Edena[Hernandez et al,2008],
Arachne[Batzoglou et al,2002], PE Assembler[Ariyaratne et al ,2011]
De Bruijn Graph:
EULER[Pevzner et al, 2001] , Velvet[Zerbino et al,2008] ,
ALLPATHS[Butler et al,2008], SOAPdenovo[Li et al,2010]
Comparative Assembly:
AMOScmp[Pop,2004], ABBA[Salzberg,2008]
Embedded Module:
EULER[Pevnezer et al, 2001], Arachne[Batzoglou et al ,2002],
Celera Assembler[Myers et al,2000], Velvet[Zerbino, 2008]
Standalone Module:
Bambus[Pop, et al, 2004] , SOPRA[Dayarian et al, 2010]
4
 Scaffolding Problem[Huson et al, 2002]
Discordant Read
Contig
1k
Paired-end Read
3k
2.5k
Scaffold
 Value Addition
Gap Filling:
GapCloser Module of SOAPdenovo
Repeat Resolution
Long-Range Genomic Structure
* Huson, D. H., Reinert, K., Myers E.W.: The greedy path-merging algorithm for contig scaffolding. Journal of the ACM 49(5), 603–615 (2002)
5
Statistics of Assembled Genomes[Schatz et al, 2010]
Organism
Genome Size
# of Contigs
N50
# of Scaffold
N50
Grapevine
500Mb
58,611
18.2kb
2,093
1.33Mb
Panda
2.4Gb
200,604
36.7kb
81,469
1.22Mb
Strawberry
220Mb
16,487
28.1kb
3,263
1.44Mb
Turkey
1.1Gb
128,271
12.6kb
26,917
1.5Mb
* N50: Given a set of sequences of varying lengths, the N50 length is defined as the length N
for which 50% of all bases in the sequences are in a sequence of length L >= N.
Data
Sequencing Errors
Read Length
Coverage
Analysis
Long Insert vs. Long Read[Chaisson, 2009; Zerbino, 2009]
* Schatz M. C., Arthur L. D., Steven L. S.: Assembly of large genomes using second-generation sequencing. Genome Research, 20-9, 1165-1173 (2010)
* Zerbino, D.R.: Pebble and rock band: heuristic resolution of repeats and scaolding in the velvet short-read de novo assembler. PLoS ONE, 4(12) (2009)
* Chaisson, M.J., Brinza, D., Pevzner, P.A.: De novo fragment assembly with short mate-paired reads: does the read length matter? Genome Research 19,
336-346 (2009)
6
NP-Complete [Huson et al, 2002]
* Huson, D. H., Reinert, K., Myers E.W.: The greedy path-merging algorithm for contig scaffolding. Journal of the ACM 49(5), 603–615 (2002)
7
Heuristic Methods
- Celera Assembler[Myers et al,2000]
- Euler[Pevzner et al, 2001]
- Jazz[Chapman et al, 2002]
- Velvet[Zerbino et al,2008]
- Arachne[Batzoglou et al ,2002]
- Bambus[Pop, et al, 2004]
“True Complexity”
Phase transition based on parameters[Hayes, 1996]
3-SAT Problem
Parametric Complexity[Rodney et al, 1999]
Vertex Cover Problem
* Hayes, B. Can't get no satisfaction. American. Scientist. 85, 108-112 (1996).
* Rodney G. D., et al. Parameterized Complexity: A Framework for Systematically Confronting Computational Intractability. DIMACS. Vol 49. 1999
8
Outline
• Overview
 Methods
- 1. Pre-Processing
- 2. A Special Case
- 3. Full Algorithm
- 4. Graph Contraction
- 5. Gap Estimation
• Results
• Ongoing Work
9
1. Pre-Processing
Paired-end Reads -> Clusters [Huson et al, 2002]
3
Chimeric Noise
Filtered by simulation
Chimera
* Upper Bound of Paired-end Reads
* Huson, D. H., Reinert, K., Myers E.W.: The greedy path-merging algorithm for contig scaffolding. Journal of the ACM 49(5), 603–615 (2002)
10
2. A Special Case
No discordant clusters in final scaffold
Naïve Solution
A
B
C
D
+A+B
+A+B+C
+A-B
+A+B-C
+A+C
…
+A
+A-C
+A-C+B
…
+A-C-B
Exponential Time
…
11
Dynamic Programming
Scaffold Tail is Sufficient
Upper Bound
width(w)
Analogous to Bandwidth Problem[Saxe, 1980]
Orientation of Nodes
Direction of Edges
Discordant Edges …
* J. Saxe: Dynamic programming algorithms for recognizing small-bandwidth graphs in polynomial time SIAM J. on Algebraic and Discrete Methodd, 1(4), 363369 (1980)
12
Equivalence class of scaffolds
S1 and S2 have the same tail -> They are in the same class
Feature of equivalence class:
- Use of the same set of contigs;
- All or none of them can be extended to a solution
Tail
+A-B+C
+D+E
-A+C
+D+E+F
…
3. Full Algorithm
Consider discordant clusters
ACCAAAATTT
?
CTAGAA
CAAGAA
ACCAAGAATTT
Chimeric Reads
Sequencing Errors
Mapping Errors
Equivalence Class
Number of Discordant Edges (p)
14
4. Graph Contraction
20k
4. Graph Contraction
4. Graph Contraction
5. Gap Estimation
Utility
Genome finishing(Genome Size Estimation)
Scaffold Correctness
μ,σ
Calculate Gap Sizes
g1
g2
g3
Maximum Likelihood
Quadratic Function
Solved through quadratic programming [Goldfarb, et al, 1983]
Polynomial Time
* Goldfarb, D., Idnani, A.: A numerically stable dual method for solving strictly convex quadratic programs. Mathematical Programming, 27 (1983)
18
Outline
• Overview
• Methods
- 1. Pre-Processing
- 2. A Special Case
- 3. Full Algorithm
- 4. Graph Contraction
- 5. Gap Estimation
 Results
• Ongoing Work
19
Runtime Comparison
◆E.
coli
★B.
pseudomallei
◆S.
cerevisiae
◆D.
melanogaster
Bambus
50s
16m
2m
3m
SOPRA
49m
-
2h
5h
4s
7m
11s
30s
Opera
◆ Simulated data set using MetaSim ★ In house data
• Coverage of 300bp insert library: >20X
• Coverage of 10kbp insert library: 2X
• Contigs assembled using Velvet
20
Scaffold Contiguity
Max Length
N50
4.5
9
4
8
3.5
7
3
6
Velvet
2.5
Velvet
5
Bambus
Bambus
SOPRA
2
SOPRA
4
Opera
Opera
3
1.5
2
1
1
0.5
0
0
E. coli
B. pseudomallei S. cerevisiae D. melanogaster
E. coli
B. pseudomallei S. cerevisiae D. melanogaster
21
Scaffold Correctness
# of Breakpoints
120
100
80
Velvet
Bambus
SOPRA
Opera
60
40
20
0
E. coli
S. cerevisiae
D. melanogaster
22
Scaffold Correctness
E.coli
S. cerevisiae
D. melanogaster
Opera
1
3
4
Bambus
19
55
423
# of Discordant Edges
18
16
14
12
10
Velvet
Opera
8
6
4
2
0
E. coli
S. cerevisiae
D. melanogaster
23
Ongoing Work
A Rodent Genome
Genome Size
N50
~2Gbp
765.5Kbp
Opera
SSpace
281.7Kbp
A Tree Genome
Opera
Genome Size
N50
Max Length
~300Mbp
209.9Kbp
921.8Kbp
24
Ongoing Work
Repeats
Lower bounds and better scaffold
Multiple Libraries
Other applications
Metagenomics
Cancer Genomics
Link: https://sourceforge.net/projects/operasf/
25
Acknowledgement
Niranjan Nagarajan
Wing-Kin Sung
Pramila N. Ariyaratne
Fundings:
NUS Graduate School for
Integrative Sciences and Engineering (NGS)
A*STAR of Singapore
Ministry of Education, Singapore
Questions?
26