Mining a cassava full-length cDNA library for gene discovery and its application for cassava improvement Germán Plata1, Fausto Rodríguez-Zapata1, Tetsuya Sakurai2, Motoaki Seki2, Andrés Salcedo1, Joe Tohme1, Yoshiyuki Sakaki2, Atsushi Toyoda2, Atsushi Ishiwata2, Kazuo Shinozaki2 and Manabu Ishitani1. 1International 2RIKEN Center for Tropical Agriculture (CIAT), A.A. 6713, Cali, Colombia Research Institute, 1-7-22 Suehiro-cho, Tsurumi-ku, Yokohama, 230-0045, Japan Picture: www.botanypictures.com Presentation Overview • • • • • • • Cassava, unique features and challenges Tools for cassava computational genomics Gene discovery from available data Cassava transcriptome properties (fl-cDNA) Gene correspondence and classification strategies Interesting genes - Gene regulation Ongoing work Cassava (Manihot esculenta) • South American origin, domesticated 5000 years ago • Over 165 million tons / year; 84% for Human consumption • Cultivated in tropical countries of Africa, Asia and Latin America • Used for paper, cardboard, plywood, glue and alcohol production • Quick recovery after stress exposure • Photosyntetically active during prolongued water stress • High productivity in poor soils Many drawbacks • • • • • Bacterial and viral diseases, arthropod pests High water content, post harvest physiological deterioration Low protein / micronutrient (zinc, iron, vitamins) content Toxic hydrogen cyanide Long selection cycles: 10 - 12 months to harvest Scarce genomic resources Picture: www.bspp.org.uk/ndr/ Cassava genome-scale data • ESTs: cost-effective way of genome analysis 11 954 ESTs 8 577 Unigenes What ESTs provide • • • • • • • • Partial coding sequences for gene discovery Sequences for molecular marker design Genome sequence annotation Training of gene prediction algorithms Gene expression profiles across tissues Microarray design/construction Clone selection (full-length sequencing) Data for comparative genomics What ESTs don’t provide (but the genome would) • Information on promoters, introns or upstream regulatory elements • Variation in non-transcribed regions • Catalog of ALL cassava genes • Predicted non-protein-coding elements: – miRNA, siRNA • Non-expressed pseudogenes • Syntenic relationships among genes Cassava full-length cDNA TAI 16 Heat, Drought, Hi Al, Low PH, PPD, Untreated Roots - Leaves Full-length cDNA Cloning and EST sequencing Structural & Functional Analysis Comparative Genomics Library Statistics • • • • • • • • 19 967 clones - 39 934 reads 10 577 scaffolds (Contigs+singlets) ~84% full-length ratio 7796 distinct genes 26% with alternative transcripts 6 967 new cassava transcripts 1 521 - ‘No hits found’ 58 698 predicted cassava transcripts Sequence functional annotation similarity search • Statistically significant alignments imply sequence homology • Function is well conserved above 40% protein sequence identity (Wilson et al, 2000) • 70-80% similar functions for e-values below 10-30 (Joshi and Xu, 2007) Verify functional relationships • ProSite • Protein Data Bank (PDB) GOMP Gene Ontology + KEGG pathways Fausto Rodriguez - CIAT 2006 Similarity search Inferred annotations <<< Keyword Search <<< Hierarchy Navigation Fausto Rodriguez - CIAT - 2006 Links to external databases – KEGG, NCBI, TAIR, UNIPROT Example - Starch Biosynthesis Functional Annotation 8227 (78%) Annotated Sequences Graphs for 101 out of 114 Arabidopsis pathways 61 % of the enzymes per pathway on average 100% 0% Arabidopsis Cassava Glycolysis/Gluconeogenesis (100%) Pentose phosphate pathway (93%) Carbon fixation (92%) Starch and sucrose metabolism (76%) Stilbene, coumarine and lignin biosynthesis (73%) Biosynthesis of steroids (70%) Structural Annotation • • • • 5’UTRs of 1 949 sequences 3’UTRs of 2 241 sequences Full CDS for 732 genes Useful to train gene models • UTRs with higher folding energies, lengths and GC % than those of populus and Arabidopsis • Tighter postranscriptional regulation? • 1 391 SSRs and 2356 SNPs • 570 markers for genes in 86 KEGG pathways Stress Genes and Gene correspondence Problem: • Assign orthologous groups bearing in mind gene duplications • Trivial for complete genomes of closely related species • Uncertainty for incomplete genomes – transcriptomes Gene correspondence (cont.) • Strategies: – phylogenetics – poorly automatable, computerpower consuming • Pairwise sequence alignments: – BBHs – Best Unambiguous Subsets (kellis et al, 2004 ) Available genomes • Complete sets of genes for: Fabaceae Glycine Medicago Stress Genes • 181 transcripts for 32 out of 44 stress induced RAFL microarray genes (Seki et al. 2001) • Cassava “exclusive” sequences have overrepresented GO terms in stress-related categories (BBH) Possible gene duplications? Motifs in BBH Network Novel genes in the cassava lineage • 230 candidate gene duplications • Many enzymes involved in response to stimulus • Important enzymes in ROS pathway Duplicated enzymes in cassava compared to poplar, castor bean and Arabidopsis Gene regulation Transcription Factors • Similarity Search + Pfam HMMs CASSAVA FL-cDNA TRANCRIPTION FACTOR FAMILIES ABI3VP1 Alfin-like AP2-EREBP ARF ARR-B BBR/BPC BES1 bHLH bHSH bZIP C2C2-CO-like C2C2-Dof C2C2-GATA C2C2-YABBY C2H2 5 0 24 1 8 0 3 11 0 9 19 9 7 5 13 C3H CAMTA CCAAT CPP CSD DBP E2F-DP EIL FHA G2-like GeBP GRAS GRF HB HRT 26 2 13 0 6 11 1 1 2 0 3 11 0 10 0 HSF LFY LIM MADS MYB MYB-related NAC NOZZLE Orphans PBF-2-like PLATZ Pseudo RWP-RK S1Fa-like SAP 8 0 4 6 32 2 21 0 0 1 7 0 0 2 0 SBP Sigma70-like SRS TAZ TCP Trihelix TUB ULT VOZ WRKY zf-HD ZIM OTHER TRANCRIPTIONAL REGULATORS ARID AUX/IAA DDT 0 HMG 0 Jumonji 0 LUG 6 MBF1 4 PHD 0 RB TOTAL transcripts in families: 355 4 SET 10 SNF2 0 0 5 1 0 6 0 9 0 0 7 3 0 Genes in 44/68 Families Definitions: http://plntfdb.bio.uni-potsdam.de/v2.0/ 3 1 Cassava Gene Families The MCL clustering algorithm (Stijn van Dongen, 2000.) was used to predict gene families in the cassava dataset Gene Regulation - miRNAs • mirBase: 339 distinct plant microRNAs • Prediction pipeline: 34 microRNA precursors Near-perfect complementarity 160 microRNAs 328 Putative Targets Transcription Factors Ongoing Work • Microarrays (Dr. Sakurai) • Validation of duplications - miRNA predictions • Many interesting genes cloned and available. • Further characterization required Ongoing Work • New full-length library : Biotic - Abiotic stresses • Normalized library - Rare transcripts Abiotic stresses Target tissues • Drought Roots, stems •Overflowing Roots, stems • Fertilization overdose (urea) Stems • Hormone overdose (Picloram) Leaves, stems • Pesticide overdose (Vertimec) Leaves, stems Biotic stresses • Xanthomonas axonopodis Leaves, stems • Whitefly (Aleurotrachelus socialis) Leaves, stems • Green mites (Mononychellus tanajoa) Stems • Mealybugs (Phenococcus herreni) Leaves, stems • Hornworm (Erinnyis ello) Leaves New fl-cDNA ECU 72 96-well plate snapshot: • 89 Scaffolds: 56 Contigs, 33 singlets • 28 New Full-length cDNAs • 10 New Cassava Genes • 81 Sequences with Blast Hits - Annotation Novel Genes - Overlap with TAI 16 Library Polymorphism discovery (within-the-gene-markers) Acknowledgements • Fausto Rodríguez • Andrés Salcedo • Danilo Moreta • Joe Tohme • Manabu Ishitani http://www.biomedcentral.com/1471-2229/7/66 Picture: www.botanypictures.com
© Copyright 2026 Paperzz