Unravelling polyploidy

2/7/2013
Overview
• Low coverage analysis of 6 Saccharum genotypes
• Our BAC sequencing work to date
Unravelling polyploidy
• Future directions?
NGS Approaches to Sugarcane Genomics
Paul Berkman | OCE Postdoctoral Fellow – Sugarcane Genome Bioinformatics
Sunday 13 January 2013
CSIRO PLANT INDUSTRY
Unravelling Polyploidy | Paul Berkman | Page 2
Low-coverage WGS
Low-coverage - Chloroplast phylogeny
S. officinarum
• IJ76-516
• LA-Purple
• Six Saccharum varieties
S. sponteneum
• Mandalay
• Each dataset mapped to published chloroplast sequence
• Consensus sequence determined for each genotype
• Phylogeny determined using nearest neighbour joining tree
• ~0.3-0.8x individual coverage
• (~0.1-0.4x after filtering)
• ~1-3x combined monoploid coverage
• Sorghum as an out-group
S. bicolor
S. officinarum X S. sponteneum
• Q165
• R570
• SP80-3280
• Non-comprehensive datasets
Mandalay
SP80-3280
NCo-310
• Useful for limited analysis
Q165
• Chloroplast phylogeny
• Repeats  two approaches
• Monoploid genome size estimation
R570
IJ76-514
0.0050
Unravelling Polyploidy | Paul Berkman | Page 3
Unravelling Polyploidy | Paul Berkman | Page 4
Low-coverage - Repeats analysis
Low-coverage - Repeats analysis
Homology
Homology
• Filtered readsets against TIGR Gramineae repeats database
•
•
•
•
BLASTN (2.2.24+)
TIGR_Gramineae_Repeats.v3.3
E-value 1e-5
Categorisation http://plantrepeats.plantbiology.msu.edu/about.html
LA Purple
180000
160000
140000
120000
100000
80000
IJ76
LA-purple
IJ76
LA-Purple Mandalay
Q165
R570
SP80-3280
60000
40000
# repeats hit
% repeats hit
# repeat hits
# repeat reads
% repeat reads
2,535
2,908
2,520
2,515
2,551
2,622
48.18%
55.27%
47.90%
47.80%
48.49%
49.84%
Mandalay
Q165
R570
20000
SP80
0
5,687,659 4,998,194 3,854,094 4,209,774 3,692,524 5,091,104
331,307
392,951
211,199
244,160
311,808
540,213
3.81%
4.52%
2.43%
2.81%
3.58%
6.21%
Unravelling Polyploidy | Paul Berkman | Page 5
Unravelling Polyploidy | Paul Berkman | Page 6
1
2/7/2013
Low-coverage - Repeats analysis
Low-coverage - Repeats analysis
Homology repeat profiles
Kmer analysis
Ty1-copia
2%
Unclassified
repeat
Ty3-gypsy 3%
1%
Ty1-copia
2%
Telomereassociated
repeats
13%
Mandalay
Ty3-gypsy
2%
transposons
1%
Other
transposons
1%
45S rDNA
39%
Other
retrotranspos
ons
20%
MITE
2%
11%
R570
Unclassified
repeat
3%
Ty3-gypsy
2%
Ty1-copia
2%
Other
retrotranspos
ons
28%
Ty3-gypsy
2%
Unclassified
repeat
3%
5S rDNA
4%
Centromereassociated
repeats
18%
CACTA
1%
Other
transposons
1%
Other
retrotranspos
ons
29%
5S rDNA
2%
Ty3-gypsy
Ty1-copia
2%
4%
45S rDNA
10%
5S rDNA
1%
45S rDNA
21%
CACTA
1%
Other
retrotranspos
ons
31%
Other
transposons
1%
MITE
2%
MITE
1%
Q165
R570
SP80-3280
ALL
24.72%
37.90%
36.26% 28.59%
21.63%
11.22%
65.26
%
75.28%
62.10%
63.74% 71.41%
78.37%
88.78%
4.81
6.21
4.14
5.07
7.24
Non-unique kmers
CACTA
1%
LA-Purple Mandalay
34.74%
Average number nonunique kmer occurrences
4.69
4.64
Centromereassociated
repeats
18%
Telomereassociated
repeats
27%
Centromereassociated
repeats
15%
IJ76
Unique kmers
SP80-3280
Unclassified
repeat
2%
Telomereassociated
repeats
23%
CACTA
1%
Centromereassociated
repeats
22%
MITE
1%
Ty1-copia
3%
45S rDNA
23%
5S rDNA
5%
Other
retrotranspos
ons
31%
Centromereassociated
repeats
19%
5S rDNA
Centromere2%
associated CACTA
0%
repeats
Q165
Telomereassociated
repeats
17%
45S rDNA
25%
5S rDNA
2%
CACTA
1%
Other
retrotranspos
ons
32%
MITE
1%
Determined optimal repeat kmer size to be 20
Calculated proportion of unique vs. non-unique kmers
Unclassified
Telomere- Ty1- Ty3-gypsy repeat
4%
associated copia
2%
2%
repeats
Other 6%
45S rDNA
22%
Telomereassociated
repeats
20%
Other
transposons
1%
Other
transposons
1%
Unclassified
repeat
4%
LA-purple
IJ76-514
MITE
1%
Unravelling Polyploidy | Paul Berkman | Page 7
Unravelling Polyploidy | Paul Berkman | Page 8
Low-coverage - Monoploid genome size
Low-coverage - Monoploid genome size
IJ76-514
Q165
LA Purple
R570
Mandalay
SP80-3280
• Individual assembly of NGS datasets
• Each of the 6 genotypes
• CLC Genomics Workbench
• Output average coverage of contigs
• Statistical analysis in R
• Normalised contig coverage for each genotype
• Estimated monoploid coverage of each genotype
• TOTAL DATA VOLUME x MONOPLOID COVERAGE = monoploid genome size
Unravelling Polyploidy | Paul Berkman | Page 9
Unravelling Polyploidy | Paul Berkman | Page 10
Low-coverage - Monoploid genome size
BAC sequencing - Approach
• Genotype-specific monoploid genome size estimates
• 72 BACs
Genotype
IJ76-514
LA-purple
Mandalay
Q165
R570
SP80-3280
*** Revised
Filtered data ** Monoploid
Full
monoploid
(Mbp)
coverage (x) coverage (x)
coverage (x)
1173.41
1937.01
1221.95
1198.55
1518.03
2761.75
Unravelling Polyploidy | Paul Berkman | Page 11
3.33
5.52
3.75
4.40
3.83
6.78
0.25
0.41
0.34
0.33
0.29
0.51
2.17
3.81
2.65
2.74
1.96
3.20
*** Revised
monoploid
genome size
(Mbp)
795
851
773
710
1,051
1,327
• 37 targeted
• 35 randomly extracted
• Illumina HiSeq2000, 100bp paired-end reads
• 87.99 Gbp total sequence data generated
• Assembly process
• Remove bacterial sequence contamination
• Hard-trim last 20 bp or 40 bp from reads
• Assembled using local assembler
Unravelling Polyploidy | Paul Berkman | Page 12
2
2/7/2013
BAC sequencing - Results
BAC sequencing - Directions
• High quality raw assemblies
Total number of reads
439,938,797
Total volume sequence data (Mbp)
87,987.76
Bacterial sequence contamination
13.06%
Total number assembled contigs
434
Total assembly Size (bp)
7,883,217
Total assembly N50 (bp)
39,008
Overall longest contig (bp)
128,410
Number contigs > 10 kbp
203
Number contigs > 100 kbp
7
• Comparable or better results than published 454 BAC assemblies
• WGS mate-pair libraries for improved scaffolding
• High throughput sequencing
• 96 BACs/lane barcoded on HiSeq
• Pooled BAC approach appears promising (4-6 BACs/barcode)
• Sugarcane genome
• 384 BACs/lane?
• Full monoploid genome on a single HiSeq Run?
• Under $500k USD?
Unravelling Polyploidy | Paul Berkman | Page 13
Unravelling Polyploidy | Paul Berkman | Page 14
Questions to be resolved
Acknowledgements
• Alleles/genes & homoeologs/homologs
• How do we think about these?
• How will these be presented?
• Intra- vs. Inter-varietal vs. Inter-species polymorphism?
• Large-scale systems approach to sugarcane
• How to integrate multi-omics data?
• How will we approach the effects of polyploidy on the system?
• Are current tools/approaches sufficient?
Unravelling Polyploidy | Paul Berkman | Page 15
1c10 Collaborators
CSIRO Colleagues
Peter Bundock
Karen Aitken
Robert Henry
Rosanne Casu
Jiri Stiller
BAC Collaborators
Anne Rae
Paul Visendi
Jingchuan Li
Hélène Bergès
Jai Perroux
Angélique D'Hont
Olivier Garsmeur
Carine Charron
Unravelling Polyploidy | Paul Berkman | Page 16
Thank you
CSIRO Plant Industry
Paul Berkman
OCE Postdoctoral Fellow
t +61 7 3214 2361
e [email protected]
w www.csiro.au/pi
CSIRO PLANT INDUSTRY
3