Un-zipping diploid genomes – revealing all kinds of heterozygous

Un-zipping Diploid Genomes - Revealing All Kinds of
Heterozygous Variants from Comprehensive Haplotig Assembly
Jason Chin1, Paul Peluso1, David Rank1, Maria Nattestad2, Fritz J. Sedlazeck2, Michael Schatz2, Alicia Clum3, Kerrie Barry3, Alex Copeland3,
Ronan O’Malley4, Chongyuan Luo4, Joseph Ecker4
1PacBio, 2Cold
Spring Harbor Laboratory, 3Joint Genome Institute, 4 HHMI / The Salk Institute
ABSTRACT AND BASIC IDEAS
RESULTS
4 major haplotype phased blocks determined by hetSNPs
Example of the
FALCON-Unzip
Process
Un-phased region
“FALCON-Unzip Process”
Outside of the simplest cases (haploid, bacteria, or inbreds), genomic information is not carried in a single reference per individual, but rather has higher
ploidy (n ≥ 2) for almost all organisms. The existence of two or more highly related sequences within an individual makes it extremely difficult to build highquality, highly contiguous genome assemblies from short DNA fragments. Based on the earlier work on a polyploidy aware assembler, FALCON
(https://github.com/PacificBiosciences/FALCON), we developed new algorithms and software (FALCON-unzip) for de novo haplotype reconstructions from
SMRT Sequencing data.
~ 4.80 Mb
We apply the algorithms and the prototype software for (1) a highly repetitive diploid fungal genome (Clavicorona pyxidata) and (2) a F1 hybrid from two
inbred Arabidopsis strains: Cvi-0 and Col-0.
For the fungal genome, we achieve an N50 of 1.53 Mb (of the 1n assembly contigs) of the ~42 Mb 1n genome and an N50 of the haplotigs of 872 kb from a
95x read length N50 ~16 kb dataset. We find that ~45% of the genome is highly heterozygous and ~55% of the genome is highly homozygous. We
develop methods to assess the base-level accuracy and local haplotype phasing accuracy of the assembly with short-read data from an Illumina platform.
For the Arabidopsis F1 hybrid genome, we find that 80% of the genome can be separated into haplotigs. The long-range accuracy of phasing haplotigs is
evaluated by comparing them to the assemblies from the two inbred parental lines.
Add missing haplotype specific nodes & edges
Remove edges that connect different haplotypes
The more complete view of both haplotypes can help to generate better annotations, provide useful biological insights, including information about all kinds
of heterozygous variants, and be used for studying differential allele expression in organisms. Finally, we apply FALCON/FALCON-Unzip on three other
organisms (grape, zebra finch, human) to demonstrate this is a general approach to solve diploid assembly problem with PacBio long-read data.
Why Do We See Bubbles in Assembly Graphs?
Genome Sequences
SNPs
SVs
SNPs
SVs
FALCON Assembly & FALCON-Unzip Tool
Construct Arabidopsis thaliana Col-0
x CVI-0 Diploid F1 Line
Associate contig 1
(Alternative allele)
SNPs
Haplotype 1
The final graph comprises a primary contig (blue),
a major haplotig (red) and other smaller haplotigs.
Primary contig
FALCON
•
Haplotype 2
Associate contig 2
(Alternative allele)
Assembly Graph
SNPs
SVs
SNPs
SVs
SNPs
SVs
Col-0
FALCON-Unzip
Augmented with haplotype
information of each reads
SNPs
F1 Diploid Assembly vs.
Parental Line Assemblies
Cvi-0
•
SNPs
SVs
SNPs
In most OLC assembler designs, the overlapper does not catch differences
at SNP level but structural variations are naturally segregated.
Col-0 x Cvi-0
Image credits: Pajoro, et al, Trends in
Updated primary contig + “associate haplotigs”
Haploid-like contig in the inbred-line assemblies
Two inbred lines
sequenced in 2013
(P4-C2 chemistry),
assembled as haploid
genomes
Col-0 x Cvi-0 assembly
F1 line constructed
and sequenced in
2015 (P6-C4
chemistry), assembled
with FALCON and
Few or no variations
FALCON-Unzip
Many variations
Arabidopsis Thaliana F1 Diploid
Assembly Statistics
Assembler
Inbred Col-0
Inbred Cvi-0
CA/HGAP
CA/HGAP
Assembly Size (Mb)
# contigs
N50 size (Mb)
Max Contig size (Mb)
126
1325
6.210
10.25
119
194
4.79
11.25
140
0.146
F1 FALCON p-contigs
143
Inbred Cvi-0
7.92
119
Inbred Col-0
4.79
126
40
Few or no
variations
7.96
57
20
Many variations
6.92
F1 Unzip p-contigs
0
primary contig
N50 size (Mb)
105
F1 FALCON a-contigs
haplotigs
Col-0 x Cvi-0
F1
FALCON
FALCON-Unzip FALCON-Unzip
primary contigs primary contigs haplotigs
143
140
105
426
172
248
7.92
7.96
6.92
13.39
13.32
11.65
Assembly Size (Mb)
F1 Unzip haplotigs
Cvi-0 assembly
(18.5 Gbp reads, reads
N50 ~ 17.5 kb)
Plant Science 21.1 (2016): 6-8.
Strain
Col-0 assembly
60
80
100
120
6.21
140
160
0
2
4
6
8
10
Additional Diploid Genome Assembly
Examples
Clavicorona
Taeniopygia
pyxidata
Cabernet
guttata
(Coral fungus) Sauvignon+* (Zebra finch)$*
How to Resolve Structural Variations & Het-SNPs Phasing at Once?
Information Sources
3 kb – 100 kb
Structural
Variations
300 b – 10 kb
het-SNP
Pros & Cons
Assembly graph features
Overlap-layout
Nearby SVs may
process catches
SV haplotypes
✗Collapsed paths
when there is no
SV
be phased
automatically
✗Haplotype-fused
paths
Easy to group
SNPs/reads into
different
haplotypes
✗No phasing
information
associated with
SVs
Haploid Genome
~ 44 Mb
Size:
Sequencing
4.1 Gb / 95x
Coverage
Primary contig size 41.9 Mb
1.5 Mb
Primary contig N50
Haplotig size 25.5 Mb
872 kb
Haplotig N50
Haplotype-
+ Led
Remove edges connecting
different haplotypes
$
~ 500 Mb
73.7 Gb /
147x
591.0 Mb
2.2 Mb
372.2 Mb
767 kb
~1.2 Gb
Human*
~ 3 Gb
50 Gb / 42x 255 Gb / 85x
1.07 Gb
3.23 Mb
0.84 Gb
910 kb
2.76 Gb
22.9 Mb
2.0 Gb
330 kb
by Cantu lab, UC Davis and Cramer lab, UN Reno
Thanks to Erich Jarvis for permission to use preliminary data
* Preliminary results. Fast file system and efficient computational infrastructure are currently needed for
large genomes.
Tiling path of haplotype 0
SUMMARY
specific paths
✗More
fragmented
contigs
Tiling path of haplotype 1
• SMRT Sequencing provides a single data type for routine diploid assembly
• Large genomes are more computationally challenging, but mostly an engineering problem now
• Future Work:
• Haplotype phasing improvement, incorporate other 3rd party and better phasing code
• To develop a sequence aligner for “augmented alignment” for faster Quiver consensus process
• Want to attack the algorithm problem for polyploid assembly? Let us know. We can work together.
For Research Use Only. Not for use in diagnostics procedures. © Copyright 2016 by Pacific Biosciences of California, Inc. All rights reserved. Pacific Biosciences, the Pacific Biosciences logo, PacBio, SMRT, SMRTbell, Iso-Seq, and Sequel
are trademarks of Pacific Biosciences. BluePippin and SageELF are trademarks of Sage Science. NGS-go and NGSengine are trademarks of GenDx. All other trademarks are the sole property of their respective owners.