Genome and transcriptome sequencing of watermelon (Citrullus

Genome and transcriptome sequencing of
watermelon (Citrullus lanatus)
Zhangjun Fei (费章君)
Boyce Thompson Institute for Plant Research
USDA Robert W. Holley Center for Agriculture and Health
Cornell University
Research in Fei lab
Purely computational
No wet lab
Research in Fei lab
•
•
•
Developing biological databases for efficient storage, management,
dissemination, and mining of large and diverse public datasets
•
Tomato Functional Genomics Database (http://solgenomics.net/ted)
•
Cucurbit Genomics Database (http://www.icugi.org)
Developing computational tools and algorithms for efficient processing,
analysis, and integration of large-scale ‘omics’ datasets.
•
Plant MetGenMAP (http://bioinfo.bti.cornell.edu/tool/MetGenMAP)
•
iAssembler (http://bioinfo.bti.cornell.edu/tool/iAssembler)
•
iTAK (http://bioinfo.bti.cornell.edu/tool/itak)
Application of bioinformatics and sequencing technologies for trait
discovery, crop improvement and knowledge advancement
•
Plant virus identification
•
Tomato epigenome
•
Genome and transcriptome analysis of several important crops
•
………………………………….
Talk overview
•
Generation of a high-quality draft genome of watermelon
•
Re-sequencing the genomes of 20 representative watermelon accessions
from the three C. lanatus sub-species.
•
Comparative transcriptome sequencing (RNA-seq) of cucurbit phleom
sap and vascular bundles, and watermelon flesh and rind during fruit
development
•
Initial analysis of sweet potato genome sequencing
Watermelon (Citrullus lanatus)

a major cucurbit and an important
vegetable crop

Narrow genetic diversity

Complicated quantity traits
size, SSC, shape, aroma, maturity,
shelf-life, uniformity, growth vigor

diseases
BFB, GSB, viruses, Fusarium wilt,
powdery mildew, downy mildew,
Phytophthora blight

Pest infection
aphids
root-knot nematode
International Watermelon Genome Initiative
•
Generation of a high quality draft
genome of watermelon cultivar 97103,
an East-Asia type, early maturity, high
fruit-quality
•
Re-sequencing of representative watermelon genotypes to generate a
comprehensive genome sequence variation map
•
Large-scale sequencing of watermelon transcriptomes to gain deeper
understanding of important biological processes
Sequencing the genome of 97103
Insert size
Read length
Total length (Mb)
Sequence depth
100 bp
50
6,845.93
16.11
200 bp
75, 100
16,326.05
38.41
400 bp
44, 75
12,803.65
30.13
2 kb
44
4,172.44
9.82
5 kb
44
1,884.77
4.43
10 kb
44
2,526.47
5.94
20 kb
44
1,616.72
3.8
46,176.03
108.64
Total
Estimated genome size: 425 Mb
De novo assembly of watermelon genome
Effects of sequence depth and large-insert
reads on watermelon genome assembly
1. 100-200 bp
2. 400 bp
3. 2 kb
4. 5 kb
5. 10 kb
6. 20 kb
Evaluation of genome assembly quality
Analysis of un-assembled reads
De novo assembly (353.3 Mb) covers 83.2% of the estimated watermelon genome (425 Mb)
Lane 1 (%)
Lane 2 (%)
Lane 3 (%)
Total reads
13,107,531 (100)
12,037,084 (100)
15,435,854 (100)
Reads aligned to genome by SOAP
10,540,768 (80.42)
10,113,735 (84.02) 13,329,285 (86.35)
Reads not aligned to genome by SOAP
2,566,763 (19.58)
1,923,349 (15.98)
2,106,569 (13.65)
Reads aligned to genome by blast*
2,373,937 (18.11)
1,705,221 (14.17)
1,901,355 (12.32)
Reads not aligned to genome by blast*
192,826 (1.47)
218,128 (1.81)
205,214 (1.33)
centromere
telomere
45S and 5S rDNAs
Scaffold anchoring and ordering
93.5% of assembled genome anchored
70% and ordered
65% oriented
Whole genome duplication
Cucurbit genome evolution
c
w1
w11
Diploid
ancestor
w10
w2
w9
hexaploid
ancestor
w3
w8
w4
w7
A4
A7
A10
A13
A16
A1
A4
A7
A1 A2 A3
A4 A5 A6
A7 A8 A9
A10
A13
A16
A19
n=21
81
fissions
fissions
0 1 2
3 4 5
6 7 8
A1 A1 A1 A1 A1 A1 A1 A1 A1
9 0 1
A1 A2 A2
the transition from the
91
91
fusions
fusions
21-chromosome eudicot
TE Invasion
intermediate
Modern
Watermelon
species
w5
w6
A1
n=7
WGD
Structural shuffling
a
ancestors
involved 81 fissions and
0 1
w1 w2 w3 w4 w5 w6 w7 w8 w9 w1 w1
A19
91 fusions in order to
b
Watermelon
Chr10
Chr5
Chr6
Chr3
Chr1
Chr7
Chr4
Chr11
Chr8
Chr2
Chr9
reach the modern 11chromosome structure of
Cucumber
Chr3
Chr1
Chr5
Chr4
Chr6
Chr2
Chr7
watermelon,
Melon
LG4
LG6
LG10
LG5
LG2
LG12
LG9
LG7
LG8
LG3
LG11
LG1
which
represented as a mosaic
of 102 ancestral block
Watermelon
Chr10
Chr5
Chr6
Chr3
Chr1
Chr7
Chr4
Chr11
Chr8
Chr2
Chr9
is
Repeat sequence annotation
Repbase TEs
Type
Length
(bp)
% in
genome
TE protiens
Length
(bp)
% in
genome
De novo
Length (bp)
Combined TEs
% in
genome
Length (bp)
% in
genome
DNA transposon
3334975
1.0377
2972663
0.925
12100932
3.77
12100932
3.77
long interspersed
element
765975
0.2383
1893537
0.5892
4013972
1.25
4013972
1.25
long terminal
repeat
16960692
5.2776
20405741
6.3495
107653915
33.5
107653915
33.5
short
interspersed
element
27194
0.0085
-
-
637970
0.1985
637970
0.1985
Other
13490
0.0042
-
-
13490
0.0042
13490
0.0042
Unknown
13699
0.0043
426
0.0001
50632999
15.76
50632999
15.76
21116025
6.5706
25272367
7.8638
175053278
54.48
175053278
54.48
Total
Gene annotation
Number
Percent(%)
Total
23,440
100.00
Annotated
19,836
84.62
Swissprot
14,873
63.45
TrEMBL
19,760
84.30
InterPro
16,266
69.39
KEGG
10,936
46.66
GO
11,822
50.44
Unannotated
3,604
15.38
Talk overview
•
Generation of a high-quality draft genome of watermelon
•
Re-sequencing the genomes of 20 representative watermelon accessions
from the three C. lanatus sub-species.
•
Comparative transcriptome sequencing (RNA-seq) of cucurbit phleom
sap and vascular bundles, and watermelon flesh and rind during fruit
development
•
Initial analysis of sweet potato genome sequencing
Watermelon genome resequencing
C. lanatus subsp.
vulgaris East-Asia
ecotype
JX-2
JLM
JXF
RZ-901
XHBFGM
Black Diamond
Calhoun Gray
Sugarlee
Sy-904304
RZ-900
PI482271
PI500301
PI189317
PI595203
PI249010
PI482276
PI482303
PI296341-FR
PI482326
C. lanatus subsp.
vulgaris America
ecotype
C. lanatus subsp.
mucosospermus
C.
lanatus
subsp. lanatus
PI248178
Watermelon genome resequencing
Genetic diversity of watermelon genomes
Structure of watermelon germplasm
Pattern of 5S and 45S rDNA distribution
C. lanatus subsp. mucosospermus is the
recent ancestor of C. lanatus subsp. vulgaris
Selective sweep
a total of 108 regions of 7.78 Mb in size
containing 741 candidate genes
GO term enrichment analysis:
regulation of carbohydrate utilization, sugar mediated signaling, carbohydrate metabolism,
response to sucrose stimulus, regulation of nitrogen compound metabolism, cellular
response to nitrogen starvation, and growth
Evolution of disease resistance genes
• Only 44 NBS-LRR genes identified in the reference 97103 genome
• LOX family has undergone an expansion in the watermelon genome
with 26 members, of which nineteen are arranged in two tandem
gene arrays
Evolution of disease resistance genes
De novo assembly of un-aligned reads from low-coverage resequencing
Group 1 (cultivated watermelon): East Asia (5) and America ecotype (5)
Group 2 (semi-wild/wild): Citrullus lanatus subsp. lanatus subsp.egusi (6)
Citrullus lanatus subsp. citroides (4)
Group 1: no disease genes.
conserved hypothetical protein
conserved hypothetical protein
cytochrome c biogenesis orf452
maturase
NADH-ubiquinone oxidoreductase fe-s protein
predicted protein
ATP binding protein
mycolic acid methyl transferase-like protein
Galactose oxidase precursor, putative
ribosomal protein S12
xyloglucan endotransglucosylase/hydrolase
4.00E-38
6.00E-26
1.00E-133
1.00E-47
3.00E-54
1.00E-37
0
2.00E-69
1.00E-74
3.00E-18
5.00E-142
Evolution of disease resistance genes
New genes identified from semi-wild/wild species
pathogenesis-related protein 1
TIR-LRR-NBS disease resistance protein
TIR-LRR-NBS disease resistance protein
TIR-LRR-NBS disease resistance protein
TIR-LRR-NBS disease resistance protein
TIR-LRR-NBS disease resistance protein
TIR-LRR-NBS disease resistance protein
13S-lipoxygenase
lipoxygenase
loxc homologue
(3S)-linalool/(E)-nerolidol synthase
1-aminocyclopropane-1-carboxylate oxidase
1-aminocyclopropane-1-carboxylate oxidase-1
2,4-dienoyl-CoA reductase, putative
23 kDa jasmonate-induced protein
agglutinin [Amaranthus hypochondriacus]
Alba DNA/RNA-binding protein
ATP synthase CF1 epsilon subunit
ATP synthase subunit alpha, mitochondrial
B3 domain-containing protein At3g25182
cytochrome P450
endo-1,3-beta-glucanase
F1-ATPase alpha subunit
gag protease polyprotein
galactose-binding type-2 ribosome-inactivating protein
heat shock protein
histone H2B.2
huntingtin interacting protein
hypothetical protein
hypothetical protein
hypothetical protein
jasmonate-induced protein
laccase family protein/diphenol oxidase family protein
mandelonitrile lyase 1
1.00E-41
2.00E-45
3.00E-189
2.00E-66
6.00E-163
9.00E-71
1.00E-55
1.00E-80
2.00E-104
3.00E-39
3.00E-12
1.00E-38
2.00E-31
5.00E-15
1.00E-12
2.00E-06
7.00E-10
2.00E-13
1.00E-44
1.00E-07
1.00E-13
7.00E-11
2.00E-30
2.00E-06
8.00E-120
1.00E-65
1.00E-12
1.00E-22
4.00E-19
5.00E-08
2.00E-07
3.00E-21
2.00E-19
8.00E-06
Minor allergen Alt a
monovalent cation:proton antiporter
nodulin family protein
NtPRp27 [Nicotiana tabacum]
nutrient reservoir
orcinol O-methyltransferase
p8MTCP1
pectin methylesterase-like protein
phosphate starvation-induced protein 2
polyphenol oxidase
polyphenol oxidase 4 precursor
predicted protein
predicted protein
predicted protein
Protein PRY2 precursor, putative
putative alcohol acyl-transferases
putative major latex protein
putative major latex protein
putative non-specific lipid-transfer protein type 2 subfamily
putative WRKY transcription factor 62
quinone reductase
ribosomal protein L22
ribosomal protein S12
RNA polymerase beta subunit
RNA recognition motif-containing protein
RNase NE
similar to MtN19-like protein
soluble epoxide hydrolase
UDP-glucosyltransferase
UDP-glucosyltransferase family 1 protein
Ulp1-like peptidase
Ulp1-like peptidase
xyloglucan endotransglucosylase/hydrolase
2.00E-99
7.00E-130
2.00E-17
2.00E-38
8.00E-29
6.00E-120
1.00E-14
2.00E-25
3.00E-17
5.00E-08
2.00E-46
8.00E-04
1.00E-13
3.00E-11
6.00E-16
2.00E-70
3.00E-30
1.00E-25
2.00E-14
3.00E-12
6.00E-16
5.00E-25
2.00E-60
1.00E-26
3.00E-05
3.00E-32
2.00E-46
6.00E-19
2.00E-122
3.00E-76
2.00E-14
2.00E-08
2.00E-99
Talk overview
•
Generation of a high-quality draft genome of watermelon
•
Re-sequencing the genomes of 20 representative watermelon accessions
from the three C. lanatus sub-species.
•
Comparative transcriptome sequencing (RNA-seq) of cucurbit phleom
sap and vascular bundles, and watermelon flesh and rind during fruit
development
•
Initial analysis of sweet potato genome sequencing
Comparative transcriptome analysis
Strand-specific RNA-seq
Phloem sap and vascular RNA-seq
Sample
No. cleaned reads
No. mapped reads
Cucumber Chinese long phloem sap, repeat 1
13807149
10815694
Cucumber Chinese long phloem sap, repeat 2
12733347
10010183
Cucumber Chinese long phloem sap, repeat 3
8686814
7742299
Cucumber Chinese long vascular tissue, repeat 1
7288455
5821286
Cucumber Chinese long vascular tissue, repeat 2
16548993
13637388
Cucumber Chinese long vascular tissue, repeat 3
8215212
7689994
Watermelon 97103 phloem sap, repeat 1
16930703
13232347
Watermelon 97103 phloem sap, repeat 2
12069969
9710926
Watermelon 97103 phloem sap, repeat 3
13465328
11177584
Watermelon 97103 vascular tissue, repeat 1
13607959
10957475
Watermelon 97103 vascular tissue, repeat 2
11380726
10420706
Watermelon 97103 vascular tissue, repeat 3
13224995
12060837
Phloem sap and vascular RNA-seq
•
13,775 and 14,242 mRNA species, respectively, were identified in
watermelon and cucumber vascular bundles, while only 1,519 and 1,012
transcripts, respectively, in the watermelon and cucumber phloem sap
•
gene sets were found to be almost identical in vascular bundles between
the two cucurbit species, whereas only 50-60% of the transcripts detected
in the phloem sap were held in common
•
Gene Ontology (GO) terms highly enriched in common phloem
transcripts were response to stress or stimulus
•
GO terms, macromolecular biosynthesis process and protein metabolic
process, were highly enriched in watermelon unique phloem transcripts
Watermelon fruit RNA-seq
Sample
No. raw reads No. rRNA reads No. cleaned reads No. mapped reads
flesh 10 DAP rep1 10,131,218
24,132
10,107,086
6,575,358
flesh 10 DAP rep2
10,752,201
40,929
10,711,272
8,705,967
flesh 18 DAP rep1
18,914,328
52,506
18,861,822
14,277,592
flesh 18 DAP rep2
12,077,551
33,363
12,044,188
8,704,999
flesh 26 DAP rep1
13,588,792
56,979
13,531,813
11,412,819
flesh 26 DAP rep2
12,345,625
36,261
12,309,364
6,546,412
flesh 34 DAP rep1
10,306,366
25,801
10,280,565
3,325,540
flesh 34 DAP rep2
8,148,638
49,097
8,099,541
5,130,657
rind 10 DAP rep1
8,647,278
65,174
8,582,104
5,554,196
rind 10 DAP rep2
8,488,793
40,998
8,447,795
7,656,486
rind 18 DAP rep1
10,154,906
33,819
10,121,087
5,798,131
rind 18 DAP rep2
12,837,948
52,792
12,785,156
7,751,759
rind 26 DAP rep1
10,861,345
51,046
10,810,299
7,751,488
rind 26 DAP rep2
11,755,860
61,439
11,694,421
9,686,947
rind 34 DAP rep1
9,778,317
40,707
9,737,610
5,643,690
rind 34 DAP rep2
11,572,988
40,868
11,532,120
7,752,428
10 DAP
18 DAP
26 DAP
34 DAP
Watermelon fruit RNA-seq
• 3,046 and 558 genes that were differentially
expressed in flesh and rind, respectively,
during fruit development
• A model for sugar metabolism in cells of
watermelon fruit flesh was proposed based
on
the
expression
profiles
of
sugar
metabolic and transporter genes
• Key
enzymes
regulating
watermelon
citrulline accumulation in watermelon fruit
were identified
Summary of watermelon sequencing
•
A high quality draft genome of watermelon cultivar 97103 was generated
and the genome contains around 23,440 genes.
•
Resequencing of 20 watermelon accessions representing three different C.
lanatus subspecies produced numerous haplotypes and revealed the extent
of genetic diversity and population structure of watermelon germplasm.
•
Genomic regions that were preferentially selected during domestication
were identified and on the other hand, many disease resistance genes were
found to be lost during domestication.
•
Integrative genomic and transcriptomic analyses yielded important insights
into aspects of phloem-based vascular signaling, and identified genes critical
to valuable fruit quality traits
The draft genome of watermelon (Citrullus lanatus) and resequencing of 20 diverse
accessions. Nature Genetics In press
Talk overview
•
Generation of a high-quality draft genome of watermelon
•
Re-sequencing the genomes of 20 representative watermelon accessions
from the three C. lanatus sub-species.
•
Comparative transcriptome sequencing (RNA-seq) of cucurbit phleom
sap and vascular bundles, and watermelon flesh and rind during fruit
development
•
Initial analysis of sweet potato genome sequencing
Sweetpotato genome sequencing
flow cytometry analysis
Arumuganathan, K. & Earle, E.D. Nuclear DNA content of some important plant species.
Plant Mol. Biol. Reporter 9, 208–218 (1991)
Sweetpotato genome sequencing
•
•
•
Huachano, a Peruvian landrace which is
amenable to genetic transformation
About 80G raw sequence data was
generated using the Illumina HiSeq 2000
system
Additional 50G raw data was generated
using the SOLiD system
Raw data
~246 M read pairs in 200 bp library.
~152 M read pairs in 500 bp library.
Total length 80,317 Mb
Cleaned data
# paired
# Single
Read size
Total
SP200
185 M
32 M
94 bp
37.68 Gb
SP500
127 M
17 M
90 bp
24.36 Gb
Total
312 M
49 M
92 bp
62.04 Gb
Sweetpotato genome sequencing
Kmer distribution
Sweet potato
watermelon
Genome size = (Total number of kmer)/(Position of peak depth)
= 4,639,223,061 / 11 = 421.75 M
Sweetpotato genome sequencing
De novo assembly
Scaffolds >= 200 bp (GC% = 38.35%)
Contig
Scaffold
Size (bp)
Index
Size (bp)
Index
N90
236
737,451
282
538,782
N80
302
552,738
421
392,362
N70
382
407,767
563
292,438
N60
480
292,372
701
212,612
N50
626
202,185
903
149,749
N25
1,267
59,915
1,695
46,336
Largest
19,628
1
21,622
1
492,615,538
998,299
498,123,765
751,346
Total
Sweetpotato genome sequencing
Heterozygous SNP
Same Base
homoSNP
heteSNP
Total SNP%
heteSNP%
SP200R1
366,949,901
707,821
1,546,356
0.61%
0.42%
SP200R2
362,273,897
713,986
1,548,594
0.62%
0.42%
SP500R1
434,569,440
728,686
1,920,965
0.61%
0.44%
SP500R2
430,648,776
739,694
1,896,432
0.61%
0.44%
SP Read Pairs
478,919,859
744,669
2,225,150
0.62%
0.46%
cucumber
339,806,337
41,408
81,933
0.04%
0.02%
Cucumber
(Maximum)
174,393,370
298,379
392,352
0.39%
0.22%
Cucumber
(Mean)
171,397,005
419,873
81,520
0.29%
0.05%
Sweetpotato genome sequencing
Heterozygous SNP
2:1
1:1
3:2
B1B1B2B2B2B2
Acknowledgement
ZhangJun Fei
Linyong Mao
Yi Zheng
……
Yong Xu
Shaogui Guo
Honghe Sun
……
Jianguo Zhang
Zhiwen Wang
Jiumeng Min
……
Jerome Salse
Florent Murat
William Lucas
Byung-Kook Ham
Zhaoliang Zhang
Erik Legg, Xingping Zhang, Eric.Ganko
Joel.Piquemal, Michel.Ragot
Jack de Wit, Remco Ursem, Zhongkui Sun
Rob Dirks, Aat Vogelaar
Acknowledgment