HapMaker: Synthetic Haplotype Generator

Nozomu Okuda
Spring Research Conference 2013
DNA
Left - http://upload.wikimedia.org/wikipedia/commons/4/4c/DNA_Structure%2BKey%2BLabelled.pn_NoBB.png
Right - http://www.ncbi.nlm.nih.gov/nuccore/456367267?report=fasta
Chromosome and Genome
http://upload.wikimedia.org/wikipedia/commons/5/53/NHGRI_human_male_karyotype.png
Haplotype
http://upload.wikimedia.org/wikipedia/commons/6/6f/MajorEventsInMeiosis_variant_int.svg
Genome Assembly Problem
http://upload.wikimedia.org/wikipedia/commons/c/c6/USMC-17109.jpg
Overlap Consensus
EUGENE W. MYERS. Journal of Computational Biology. Summer 1995, 2(2): 275-290. doi:10.1089/cmb.1995.2.275.
De Bruijn Graph
Pavel A. Pevzner, Haixu Tang, and Michael S. Waterman An Eulerian path approach to DNA fragment assembly
PNAS 2001 98 (17) 9748-9753; doi:10.1073/pnas.171285098
Genome Assembly
http://upload.wikimedia.org/wikipedia/commons/6/6a/CelegansGoldsteinLabUNC.jpg
HapMaker
From http://upload.wikimedia.org/wikipedia/commons/0/01/Single_Chromosome_Mutations.png
HapMaker
array(#)
changes = 0
while(changes/# < %):
while(x is in conserved region or array[x] already marked):
randomly choose x from 1 to #, inclusive
if x is to be a SNP:
choose a base different from the reference at x
changes += 1
if x is to be an insertion:
choose a number of bases y to insert
changes += y
if y is to be a deletion:
choose a number of bases y to delete
changes += y
read through array and print out as marked
tgu_ref_chr25.fa
Preliminary Results
SOAPdenovo 33
contigs
Longest Sequence
Total
N50
2557
221771
100
Allpaths-LG contigs
(no het)
Longest Sequence
Total
N50
?
215
7900
SOAPdenovo 33
scaffolds
171156
220217
100
Allpaths-LG
scaffolds (no het)
?
13
240000
SOAPdenovo 35
contigs
2082
3765
391
Allpaths-LG contigs
(het)
?
215
8000
SOAPdenovo 35
scaffolds
70710
2518
7657
Allpaths-LG
scaffolds (het)
?
16
234000
Preliminary Results
Allpaths-LG contigs
(no het)
Equal
Hard to tell
No hits
Total
Missed
41
125
0
223
57
SOAPdenovo 33
contigs
Equal
Hard to tell
No hits
Total
Missed
Allpaths-LG
scaffolds (no het)
57002
2071
162982
225377
3322
Allpaths-LG contigs
(het)
2
11
0
13
0
SOAPdenovo 33
scaffolds
52718
1544
162708
220217
3247
Allpaths-LG
scaffolds (het)
40
120
0
221
61
SOAPdenovo 35
contigs
4629
690
204
5667
144
2
13
0
16
1
SOAPdenovo 35
scaffolds
2022
368
47
2518
81
Possible Conclusions
 It is really hard to measure the accuracy of
heterozygous assemblies.
 SOAPdenovo is not very confident about forming
heterozygous contigs.
 Allpaths-LG seems to do very well.
Further Work
From http://upload.wikimedia.org/wikipedia/commons/2/26/Chromosomes_mutations-en.svg
Further Work
From http://upload.wikimedia.org/wikipedia/commons/2/26/Chromosomes_mutations-en.svg
Further Work
From http://upload.wikimedia.org/wikipedia/commons/2/26/Chromosomes_mutations-en.svg