PPT - GCP21

Mining a cassava full-length cDNA
library for gene discovery and its
application for cassava improvement
Germán Plata1, Fausto Rodríguez-Zapata1, Tetsuya Sakurai2, Motoaki Seki2, Andrés
Salcedo1, Joe Tohme1, Yoshiyuki Sakaki2, Atsushi Toyoda2, Atsushi Ishiwata2, Kazuo
Shinozaki2 and Manabu Ishitani1.
1International
2RIKEN
Center for Tropical Agriculture (CIAT), A.A. 6713, Cali, Colombia
Research Institute, 1-7-22 Suehiro-cho, Tsurumi-ku, Yokohama, 230-0045, Japan
Picture: www.botanypictures.com
Presentation Overview
•
•
•
•
•
•
•
Cassava, unique features and challenges
Tools for cassava computational genomics
Gene discovery from available data
Cassava transcriptome properties (fl-cDNA)
Gene correspondence and classification strategies
Interesting genes - Gene regulation
Ongoing work
Cassava (Manihot esculenta)
• South American origin, domesticated 5000 years ago
• Over 165 million tons / year; 84% for Human consumption
• Cultivated in tropical countries of Africa, Asia and Latin America
• Used for paper, cardboard, plywood, glue and alcohol production
• Quick recovery after stress exposure
• Photosyntetically active during prolongued water stress
• High productivity in poor soils
Many drawbacks
•
•
•
•
•
Bacterial and viral diseases, arthropod pests
High water content, post harvest physiological deterioration
Low protein / micronutrient (zinc, iron, vitamins) content
Toxic hydrogen cyanide
Long selection cycles: 10 - 12 months to harvest
Scarce genomic resources
Picture: www.bspp.org.uk/ndr/
Cassava genome-scale data
• ESTs: cost-effective way of genome analysis
11 954 ESTs
8 577 Unigenes
What ESTs provide
•
•
•
•
•
•
•
•
Partial coding sequences for gene discovery
Sequences for molecular marker design
Genome sequence annotation
Training of gene prediction algorithms
Gene expression profiles across tissues
Microarray design/construction
Clone selection (full-length sequencing)
Data for comparative genomics
What ESTs don’t provide
(but the genome would)
• Information on promoters, introns or
upstream regulatory elements
• Variation in non-transcribed regions
• Catalog of ALL cassava genes
• Predicted non-protein-coding elements:
– miRNA, siRNA
• Non-expressed pseudogenes
• Syntenic relationships among genes
Cassava full-length cDNA
TAI 16
Heat, Drought, Hi Al, Low PH, PPD, Untreated
Roots - Leaves
Full-length cDNA
Cloning and EST sequencing
Structural & Functional
Analysis
Comparative Genomics
Library Statistics
•
•
•
•
•
•
•
•
19 967 clones - 39 934 reads
10 577 scaffolds (Contigs+singlets)
~84% full-length ratio
7796 distinct genes
26% with alternative transcripts
6 967 new cassava transcripts
1 521 - ‘No hits found’
58 698 predicted cassava transcripts
Sequence functional annotation
similarity search
• Statistically significant alignments imply sequence
homology
• Function is well conserved above 40% protein
sequence identity (Wilson et al, 2000)
• 70-80% similar functions for e-values below 10-30
(Joshi and Xu, 2007)
Verify functional relationships
• ProSite
• Protein Data Bank (PDB)
GOMP
Gene Ontology + KEGG pathways
Fausto Rodriguez - CIAT 2006
Similarity search
Inferred annotations
<<< Keyword Search
<<< Hierarchy Navigation
Fausto Rodriguez - CIAT - 2006
Links to external databases – KEGG, NCBI, TAIR, UNIPROT
Example - Starch Biosynthesis
Functional Annotation
8227 (78%) Annotated Sequences
Graphs for 101 out of 114 Arabidopsis pathways
61 % of the enzymes per pathway on average
100%
0%
Arabidopsis
Cassava
Glycolysis/Gluconeogenesis (100%)
Pentose phosphate pathway (93%)
Carbon fixation (92%)
Starch and sucrose metabolism (76%)
Stilbene, coumarine and lignin biosynthesis (73%)
Biosynthesis of steroids (70%)
Structural Annotation
•
•
•
•
5’UTRs of 1 949 sequences
3’UTRs of 2 241 sequences
Full CDS for 732 genes
Useful to train gene models
• UTRs with higher folding energies, lengths and
GC % than those of populus and Arabidopsis
• Tighter postranscriptional regulation?
• 1 391 SSRs and 2356 SNPs
• 570 markers for genes in 86 KEGG pathways
Stress Genes and Gene
correspondence
Problem:
• Assign orthologous groups bearing in mind
gene duplications
• Trivial for complete genomes of closely
related species
• Uncertainty for incomplete genomes –
transcriptomes
Gene correspondence (cont.)
• Strategies:
– phylogenetics – poorly automatable, computerpower consuming
• Pairwise sequence alignments:
– BBHs
– Best Unambiguous Subsets (kellis et al, 2004 )
Available genomes
• Complete sets of genes for:
Fabaceae
Glycine
Medicago
Stress Genes
• 181 transcripts for 32 out of 44 stress induced RAFL
microarray genes (Seki et al. 2001)
• Cassava “exclusive” sequences have overrepresented
GO terms in stress-related categories (BBH)
Possible gene duplications?
Motifs in BBH Network
Novel genes in the cassava lineage
• 230 candidate gene duplications
• Many enzymes involved in response to stimulus
• Important enzymes in ROS pathway
Duplicated enzymes in
cassava compared to
poplar, castor bean and
Arabidopsis
Gene regulation
Transcription Factors
• Similarity Search + Pfam HMMs
CASSAVA FL-cDNA TRANCRIPTION FACTOR FAMILIES
ABI3VP1
Alfin-like
AP2-EREBP
ARF
ARR-B
BBR/BPC
BES1
bHLH
bHSH
bZIP
C2C2-CO-like
C2C2-Dof
C2C2-GATA
C2C2-YABBY
C2H2
5
0
24
1
8
0
3
11
0
9
19
9
7
5
13
C3H
CAMTA
CCAAT
CPP
CSD
DBP
E2F-DP
EIL
FHA
G2-like
GeBP
GRAS
GRF
HB
HRT
26
2
13
0
6
11
1
1
2
0
3
11
0
10
0
HSF
LFY
LIM
MADS
MYB
MYB-related
NAC
NOZZLE
Orphans
PBF-2-like
PLATZ
Pseudo
RWP-RK
S1Fa-like
SAP
8
0
4
6
32
2
21
0
0
1
7
0
0
2
0
SBP
Sigma70-like
SRS
TAZ
TCP
Trihelix
TUB
ULT
VOZ
WRKY
zf-HD
ZIM
OTHER TRANCRIPTIONAL REGULATORS
ARID
AUX/IAA
DDT
0 HMG
0 Jumonji
0 LUG
6 MBF1
4 PHD
0 RB
TOTAL transcripts in families: 355
4 SET
10 SNF2
0
0
5
1
0
6
0
9
0
0
7
3
0
Genes in 44/68 Families
Definitions:
http://plntfdb.bio.uni-potsdam.de/v2.0/
3
1
Cassava Gene Families
The MCL clustering algorithm (Stijn
van Dongen, 2000.) was used to
predict gene families in the cassava
dataset
Gene Regulation - miRNAs
• mirBase: 339 distinct plant microRNAs
• Prediction pipeline: 34 microRNA precursors
Near-perfect complementarity
160 microRNAs
328 Putative Targets
Transcription Factors
Ongoing Work
• Microarrays (Dr. Sakurai)
• Validation of duplications - miRNA
predictions
• Many interesting genes cloned and
available.
• Further characterization required
Ongoing Work
• New full-length library : Biotic - Abiotic stresses
• Normalized library - Rare transcripts
Abiotic stresses
Target tissues
• Drought
Roots, stems
•Overflowing
Roots, stems
• Fertilization overdose (urea)
Stems
• Hormone overdose (Picloram)
Leaves, stems
• Pesticide overdose (Vertimec)
Leaves, stems
Biotic stresses
• Xanthomonas axonopodis
Leaves, stems
• Whitefly (Aleurotrachelus socialis)
Leaves, stems
• Green mites (Mononychellus tanajoa)
Stems
• Mealybugs (Phenococcus herreni)
Leaves, stems
• Hornworm (Erinnyis ello)
Leaves
New fl-cDNA
ECU 72
96-well plate snapshot:
• 89 Scaffolds: 56 Contigs, 33 singlets
• 28 New Full-length cDNAs
• 10 New Cassava Genes
• 81 Sequences with Blast Hits - Annotation
Novel Genes - Overlap with TAI 16 Library
Polymorphism discovery (within-the-gene-markers)
Acknowledgements
• Fausto Rodríguez
• Andrés Salcedo
• Danilo Moreta
• Joe Tohme
• Manabu Ishitani
http://www.biomedcentral.com/1471-2229/7/66
Picture: www.botanypictures.com