Abstract Algorithm Overview Incomplete Breakpoint Graph Analysis

Pseudo-chromosome assembly of large and complex genomes using
multiple references
1
2
2
3,4
5
5,6
2
1
1
Mikhail Kolmogorov , Brian Raney , Joel Armstrong , Duncan Odom , Paul Flicek , David Thybert , Benedict Paten and Son Pham
1
University of California San Diego, USA 2University of California Santa Cruz, USA, 3University of Cambridge, UK,
4
Wellcome Trust Sanger Institute, UK, 5European Bioinformatic Institute, UK, 6The Genome Analysis Center, UK
Abstract
Incomplete Breakpoint Graph Analysis
• Assembly of mammalian-scale genomes into complete
chromosomes is challenging
Ref. 1
a
d
c
• To address this, we developed Ragout, a referenceassisted assembly tool for large and complex genomes
Ref. 2
a
b
c
Target
a
• Taking as input an NGS assembly and multiple related references, Ragout infers
their evolutionary relationship and builds the final assembly of the target
genome using a genome rearrangement approach
• Using Ragout we assembled two mice genomes (M. Caroli and M. Pahari) with
comlicated chromosome-scale rearrangements into sets of high-quality pseudochromosomes
• Chromosome coloring comfirms most the rearrangements that Ragout has
detected
d
at
b
b
d
c
ct
dh
ch
bh
bt
Algorithm Overview
(a)
A A C GT G C TA G G C G AT G G
a
d
c
Ref. 2
A A C G G AT G G G G C T G C TA
a
b
c
C G ATAT G
Target
d
G C TA
b
b
(b) Nucleotide sequences
are decomposed into the
permutations of synteny
blocks
d
a
c
C G AT
(c)
(d)
at
dh
Ref. 2
Ref. 1
Target
ct
(e)
ah
(d) Incomplete breakpoint
graph reflects adjacencies
between synteny blocks
ch
bh
at
(c) Phylogenetic tree of
the input genomes is
reconstructed based on
the breakpoints data
ah
dt
bt
(e) Missing adjacencies
are reconstructed by
analysing rearrangements
(f)
dt
ct
C G AT
dh
C G ATAT G
G C TA
ch
bh
• Ragout recovers the missing
adjacencies so as to minimize the
weighted number of rearrangements
between the genomes
• These adjacencies are then used to
merge the target fragments into
scaffolds
(a) Raw references and
target genome sequences
(b)
Ref. 1
• If all genomes were complete, the
edges of each color will define a
perfect matching on the graph
• As the target genome is
fragmented, some adjacencies of red
color are missing
ah
dt
• Breakpoint graphs reflect
adjacencies between synteny blocks
in different genomes
bt
Results
• We assembled two genomes from Murinae family: Mus. Caroli and
Mus. Pahari using Mus. Musculus and Rattus Norvegicus as references.
The assemblies contains 20 and 23 pseudo-chromosomes, respectively
with at most 2% of unlocalized sequence.
• M. Caroli shows 5% sequence diversity from Mus. Musculus and has
the same karyotype (which was confirmed by Ragout). However, we
have detected a large inversion in chr17.
• M. Pahari has 10% sequence diversity and contain many interchromosomal rearrangements. Ragout has detected four of them,
which were also confirmed by chromosome coloring.
• Some of the rearrangements remain undetected, as the
corresponding breakpoints are missing from the NGS data. However,
they could be recovered using the aid of chromosomal maps.
(f) Target fragments are
joined into scaffolds with
respect to the inferred
adjacencies
Synteny Blocks
(a)
G1
(b)
a1
a2
a5
a3
a2
a1
G2
G3
a1
a2
a1
a3
a5
a4
a5
a4
a3
a3
a5
a4
(c)
(d)
a123
a4
a123
a5
a5
a4
a4
(e)
(f)
a1235
G1
a1235
a1
a2
a3
a5
Synteny blocks help to separate small sequence variations from largescale rearrangements. (a) An alignment between three genomes with
complicated sub-structure. (b) A-Bruijn graph representation of the
alignment. Small sequence variations correspond to bubbles, while
rearrangements form more complicated structures. (c) A bubble is
removed during the graph simplification, forming a larger synteny block
a123(d) Masking smaller block a4 allows to make the graph structure
even simplier. (e) After removing another bubble, the whole alignment is
represented as a large synteny block. (d) A hierarchical representation
of a synteny block.
Availability & Contacts
• Ragout is an easy to use package, written in Python/C+.
It is freely available at http://fenderglass.github.io/Ragout
• Email: [email protected]