Un-zipping Diploid Genomes - Revealing All Kinds of Heterozygous Variants from Comprehensive Haplotig Assembly Jason Chin1, Paul Peluso1, David Rank1, Maria Nattestad2, Fritz J. Sedlazeck2, Michael Schatz2, Alicia Clum3, Kerrie Barry3, Alex Copeland3, Ronan O’Malley4, Chongyuan Luo4, Joseph Ecker4 1PacBio, 2Cold Spring Harbor Laboratory, 3Joint Genome Institute, 4 HHMI / The Salk Institute ABSTRACT AND BASIC IDEAS RESULTS 4 major haplotype phased blocks determined by hetSNPs Example of the FALCON-Unzip Process Un-phased region “FALCON-Unzip Process” Outside of the simplest cases (haploid, bacteria, or inbreds), genomic information is not carried in a single reference per individual, but rather has higher ploidy (n ≥ 2) for almost all organisms. The existence of two or more highly related sequences within an individual makes it extremely difficult to build highquality, highly contiguous genome assemblies from short DNA fragments. Based on the earlier work on a polyploidy aware assembler, FALCON (https://github.com/PacificBiosciences/FALCON), we developed new algorithms and software (FALCON-unzip) for de novo haplotype reconstructions from SMRT Sequencing data. ~ 4.80 Mb We apply the algorithms and the prototype software for (1) a highly repetitive diploid fungal genome (Clavicorona pyxidata) and (2) a F1 hybrid from two inbred Arabidopsis strains: Cvi-0 and Col-0. For the fungal genome, we achieve an N50 of 1.53 Mb (of the 1n assembly contigs) of the ~42 Mb 1n genome and an N50 of the haplotigs of 872 kb from a 95x read length N50 ~16 kb dataset. We find that ~45% of the genome is highly heterozygous and ~55% of the genome is highly homozygous. We develop methods to assess the base-level accuracy and local haplotype phasing accuracy of the assembly with short-read data from an Illumina platform. For the Arabidopsis F1 hybrid genome, we find that 80% of the genome can be separated into haplotigs. The long-range accuracy of phasing haplotigs is evaluated by comparing them to the assemblies from the two inbred parental lines. Add missing haplotype specific nodes & edges Remove edges that connect different haplotypes The more complete view of both haplotypes can help to generate better annotations, provide useful biological insights, including information about all kinds of heterozygous variants, and be used for studying differential allele expression in organisms. Finally, we apply FALCON/FALCON-Unzip on three other organisms (grape, zebra finch, human) to demonstrate this is a general approach to solve diploid assembly problem with PacBio long-read data. Why Do We See Bubbles in Assembly Graphs? Genome Sequences SNPs SVs SNPs SVs FALCON Assembly & FALCON-Unzip Tool Construct Arabidopsis thaliana Col-0 x CVI-0 Diploid F1 Line Associate contig 1 (Alternative allele) SNPs Haplotype 1 The final graph comprises a primary contig (blue), a major haplotig (red) and other smaller haplotigs. Primary contig FALCON • Haplotype 2 Associate contig 2 (Alternative allele) Assembly Graph SNPs SVs SNPs SVs SNPs SVs Col-0 FALCON-Unzip Augmented with haplotype information of each reads SNPs F1 Diploid Assembly vs. Parental Line Assemblies Cvi-0 • SNPs SVs SNPs In most OLC assembler designs, the overlapper does not catch differences at SNP level but structural variations are naturally segregated. Col-0 x Cvi-0 Image credits: Pajoro, et al, Trends in Updated primary contig + “associate haplotigs” Haploid-like contig in the inbred-line assemblies Two inbred lines sequenced in 2013 (P4-C2 chemistry), assembled as haploid genomes Col-0 x Cvi-0 assembly F1 line constructed and sequenced in 2015 (P6-C4 chemistry), assembled with FALCON and Few or no variations FALCON-Unzip Many variations Arabidopsis Thaliana F1 Diploid Assembly Statistics Assembler Inbred Col-0 Inbred Cvi-0 CA/HGAP CA/HGAP Assembly Size (Mb) # contigs N50 size (Mb) Max Contig size (Mb) 126 1325 6.210 10.25 119 194 4.79 11.25 140 0.146 F1 FALCON p-contigs 143 Inbred Cvi-0 7.92 119 Inbred Col-0 4.79 126 40 Few or no variations 7.96 57 20 Many variations 6.92 F1 Unzip p-contigs 0 primary contig N50 size (Mb) 105 F1 FALCON a-contigs haplotigs Col-0 x Cvi-0 F1 FALCON FALCON-Unzip FALCON-Unzip primary contigs primary contigs haplotigs 143 140 105 426 172 248 7.92 7.96 6.92 13.39 13.32 11.65 Assembly Size (Mb) F1 Unzip haplotigs Cvi-0 assembly (18.5 Gbp reads, reads N50 ~ 17.5 kb) Plant Science 21.1 (2016): 6-8. Strain Col-0 assembly 60 80 100 120 6.21 140 160 0 2 4 6 8 10 Additional Diploid Genome Assembly Examples Clavicorona Taeniopygia pyxidata Cabernet guttata (Coral fungus) Sauvignon+* (Zebra finch)$* How to Resolve Structural Variations & Het-SNPs Phasing at Once? Information Sources 3 kb – 100 kb Structural Variations 300 b – 10 kb het-SNP Pros & Cons Assembly graph features Overlap-layout Nearby SVs may process catches SV haplotypes ✗Collapsed paths when there is no SV be phased automatically ✗Haplotype-fused paths Easy to group SNPs/reads into different haplotypes ✗No phasing information associated with SVs Haploid Genome ~ 44 Mb Size: Sequencing 4.1 Gb / 95x Coverage Primary contig size 41.9 Mb 1.5 Mb Primary contig N50 Haplotig size 25.5 Mb 872 kb Haplotig N50 Haplotype- + Led Remove edges connecting different haplotypes $ ~ 500 Mb 73.7 Gb / 147x 591.0 Mb 2.2 Mb 372.2 Mb 767 kb ~1.2 Gb Human* ~ 3 Gb 50 Gb / 42x 255 Gb / 85x 1.07 Gb 3.23 Mb 0.84 Gb 910 kb 2.76 Gb 22.9 Mb 2.0 Gb 330 kb by Cantu lab, UC Davis and Cramer lab, UN Reno Thanks to Erich Jarvis for permission to use preliminary data * Preliminary results. Fast file system and efficient computational infrastructure are currently needed for large genomes. Tiling path of haplotype 0 SUMMARY specific paths ✗More fragmented contigs Tiling path of haplotype 1 • SMRT Sequencing provides a single data type for routine diploid assembly • Large genomes are more computationally challenging, but mostly an engineering problem now • Future Work: • Haplotype phasing improvement, incorporate other 3rd party and better phasing code • To develop a sequence aligner for “augmented alignment” for faster Quiver consensus process • Want to attack the algorithm problem for polyploid assembly? Let us know. We can work together. For Research Use Only. Not for use in diagnostics procedures. © Copyright 2016 by Pacific Biosciences of California, Inc. All rights reserved. Pacific Biosciences, the Pacific Biosciences logo, PacBio, SMRT, SMRTbell, Iso-Seq, and Sequel are trademarks of Pacific Biosciences. BluePippin and SageELF are trademarks of Sage Science. NGS-go and NGSengine are trademarks of GenDx. All other trademarks are the sole property of their respective owners.
© Copyright 2026 Paperzz