Title Plant Specimen Contextual Data Consensus Authors *Petra ten Hoopen1, <[email protected]> Ramona L. Walls2, <[email protected]> Ethalinda KS Cannon3, <[email protected]> Guy Cochrane1, <[email protected]> James Cole4, <[email protected]> Anjanette Johnston5, <[email protected]> Ilene Karsch-Mizrachi5, <[email protected]> Pelin Yilmaz6, <[email protected]> * corresponding author Affiliations 1 European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SD, United Kingdom 2 CyVerse, University of Arizona, Tucson, Arizona, USA 3 Department of Computer Science, Iowa State University, Ames, IA, USA; USDAARS Corn Insects and Crop Genetics, Ames, IA, USA 4 Department of Plant, Soil and Microbial Sciences, Michigan State University, East Lansing MI 48824, USA 5 National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38A, 8600 Rockville Pike, Bethesda, MD 20894, USA 6 Microbial Genomics and Bioinformatics Research Group, Max Planck Institute for Marine Microbiology, Celsius str. 1 Bremen Germany 28359 Abstract The Compliance and Interoperability Working Group of the Genomic Standards Consortium facilitates expert-community building and development of recommendations for description of genomic data and associated information. We present here our ongoing conation to harmonise reporting of contextual data of plant specimens associated with genomic and functional genomic data. This commentary summarises the current state of the plant sample contextual data harmonisation thus enabling engagement of a broad plant science community. Keywords plant, specimen, contextual data, checklist Background Publication of well-structured data in an established data resource supports discoverability and a safe preservation of legacy data. If related contextual data is structured in a similar manner it can allow further scientific advances through meaningful comparisons of datasets. Data and contextual data standards fulfil the role of bringing a structure into data, providing recommendations on data formats and specification of attributes that categorise the data. However, standardisation efforts should be harmonised to prevent duplicating of work or establishing contradictory practices. With extensive global interest in molecular analysis of plant species for food, forestry, biomass, and many other applications, we have entered an era of extensive publication of plant molecular data sets in need of structure, with, for example, 0.37 Terabases of plant assembled sequence data in 1017 studies presented from INSDC databases to August 2016. Here we describe an ongoing endeavour to harmonise recommendations to support the plant scientific community in reporting plant specimen contextual data associated with genomic and functional genomic experiments to data archives. Plant Specimen Contextual Data Consensus Plant specimen contextual data provides information about the plant material that is being analysed in a molecular assay. This information layer is distinct from the investigation layer, which specifies the investigation purpose and its authors, and from the experiment layer, which harbours details of the molecular experiment design. Contextual information of a plant specimen is also independent of the plant molecular analysis. This suggests that a common set of plant specimen descriptors can be used for reporting the contextual information about a plant sample that is associated with a molecular dataset. A number of projects are developing recommendations for plant molecular or phenotyping data reporting, Figure 1. We are aiming to unify these developments in a common contextual data set, the Plant Specimen Contextual Data Consensus, that will contribute to consistent reporting of plant specimen information to data repositories, and improve integration of the specimenassociated molecular data among repositories. Several resources were involved in this exercise: (1) the European Nucleotide Archive at the EMBL-EBI, UK [1], archiving genomic and transcriptomic data together with associated contextual data, (2) CyVerse (formerly the iPlant Collaborative), USA [2], a computation infrastructure for life sciences, (3) the GenBank [3] and BioSample databases at the NCBI, USA [4], archiving nucleotide sequence data and associated sample contextual information, and (4) the USDA Agricultural Research Service, USA, a scientific research agency for agriculture. We have also drawn from the expertise of Array Express, EMBL-EBI, UK [5] in collecting plant transcriptomics data and received a valuable input from authors of the MIAPPE (Minimum Information about Plant Phenotypic Experiments) developed by the transPLANT [6], a trans-national project to construct an einfrastructure for plant genomics. Furthermore, we were able to reuse a number of concepts specified in the core and plant host-associated environmental package of the MIxS Standard [7] developed by the Genomic Standards Consortium (GSC, [8]). The first version of the Plant Specimen Contextual Data Consensus is available in the Supplementary Table 1. However, the Consensus is likely to evolve and we therefore encourage readers to always view the latest version of the Consensus at the GSC website [9]. Each contextual data attribute of the Consensus is described with a name, category, suggested requirement level (M – mandatory, C – recommended, X – optional), definition, format and mapping to an available ontology class. Descriptors are divided into four categories: (1) organism descriptors, which specify taxonomic information, (2) sample descriptors, which characterise the material that has been taken from the plant organism and used for an experiment, (3) treatment descriptors, which describe the natural and imposed environmental conditions on the plant before the sample was taken and (4) growth medium descriptors, which provide details of the plant rooting conditions. The Consensus recommends a number of established relevant ontologies and controlled vocabularies: Plant Ontology (PO, [10]), Phenotypic Quality Ontology (PATO, [11]), Crop Ontology (CO, [12]), Plant Trait Ontology (TO, [13]), Plant Environment Ontology (EO, 14]), Environment Ontology (ENVO, [15]), Experimental Factor Ontology (EFO, [16]), Chemical Entities of Biological Interest (CHEBI, [17],) the NCBI Taxonomy index [18] and INSDC country controlled vocabulary [19]. Ten descriptors of the Consensus were identified as essential contextual data for a plant sample of any molecular experiment and these are highlighted in bold and suggested as mandatory. Moreover, a subset of recommended descriptors can be considered for reporting depending on the particular experiment or implementation. Optional descriptors offer further granularity to reporting. Although practical implementation of the Consensus might very depending on the resource adopting it, the Consensus offers a potential for harmonisation of plant specimen contextual data across molecular assays. One implementation is available at the European Nucleotide Archive [20], and will serve for deposition of plant genomic and transcriptomic data to INSDC. This can add to the collection of well described plant samples, such as for instance the Brassica oleracea sample SAMN03858113 [21] or the Hordeum vulgare sample SAMN04549447 [22]. The Consensus presented here is largely in line with recommendations for description of a plant bioresource and its environment and treatment formulated for plant phenotyping data [6] and we envisage its applicability also for description of samples associated with metabolic data. Conclusion In the current era of substantive need for data integration beyond scientific domains and political borders, it is fundamental for both short-term and longterm initiatives to unite their forces when working towards similar goals. We have presented here an example of ongoing transatlantic community collaboration, Figure 1, on a common goal to provide plant scientists with recommendations on how to describe a plant specimen that has been analysed in a molecular experiment. We believe this can contribute to a consistent description of plant specimens and improve integration of the specimenassociated molecular data. List of abbreviations CHEBI – Chemical Entities of Biological Interest CO – Crop Ontology EFO – Experimental Factor Ontology EMBL-EBI – European Molecular Biology Laboratory, European Bioinformatics Institute ENA – European Nucleotide Archive ENVO – Environment Ontology EO – Plant Environment Ontology GSC – Genomic Standards Consortium INSDC – International Nucleotide Sequence Database Collaboration NCBI – National Centre for Biotechnology Information MIAPPE – Minimum Information about Plant Phenotypic Experiments MIxS – Minimum Information about any (x) Sequence PATO – Phenotypic Quality Ontology PO – Plant Ontology TO – Plant Trait Ontology USDA-ARS – Agricultural Research Service of the US Department of Agriculture Ethics approval and consent to participate Not applicable Consent for publication Not applicable Availability of data and material All data generated during this study are included in this published article and its supplementary information files. Competing interest The authors declare that they have no competing interests. Funding The work was supported by the Biotechnology and Biological Sciences Research Council award BB/M018458/1 for PTH and GC and National Science Foundation Award Numbers DBI-0735191 and DBI-1265383 to CyVerse for RLW. Authors’ contribution PTH coordinated the plant specimen contextual data harmonisation, drafted the ENA plant checklist and mapped the draft to the Array Express, transPLANT and MIxS checklists. RLW and EC were leading on mappings to the plant contextual data checklists developed at CyVerse and USDA-ARS, respectively. AJ was leading on mappings to the NCBI Plant Package. IKM, GC and JC provided overall guidance. PY advised on MIxS standard concepts and published the plant specimen contextual data consensus on the GSC website. PTH wrote the manuscript with an editorial contribution and a revision by all co-authors. Acknowledgements We much appreciate contribution of Lisa Harper and Kapeel Chougule. We would like to thank Laura Huerta and Robert Petryszak for their thoughtful suggestions. We are grateful to Pawel Krajewsky, Hanna Ćwiek and Paul Kersey for reviewing the Consensus draft and appreciate comments from Richard Gibson and Clara Amid. References [1] Silvester, N, Alako, B, Amid, C et al. (2015) Content discovery and retrieval services at the European Nucleotide Archive. Nucleic Acid Res, 41, D30-35. [2] Merchant, Nirav, et al., (2016) The iPlant Collaborative: Cyberinfrastructure for Enabling Data to Discovery for the Life Sciences. PLOS Biology, doi: 10.1371/journal.pbio.1002342. [3] Clark K, Karsch-Mizrachi I, Lipman DJ, Ostell J, Sayers EW. GenBank. Nucleic Acids Res. 2016 Jan 4;44(D1):D67-72. doi: 10.1093/nar/gkv1276. [4] Barrett T, Clark K, Gevorgyan R, Gorelenkov V, Gribov E, Karsch-Mizrachi I, Kimelman M, Pruitt KD, Resenchuk S, Tatusova T, Yaschenko E, Ostell J. (2012) BioProject and BioSample databases at NCBI: facilitating capture and organization of metadata. Nucleic Acids Res. 40, D57-63. [5] Kolesnikov N, Hastings E, Keays M et al. (2015) Array Express update – simplifying data submissions. Nucleic Acid Res, 43, D1113-1116. [6] Krajewski P, Chen D, Ćwiek H et al. (2015) Towards recommendation for metadata and data handling in plant phenotyping. Journal of Experimental Botany 66, 5417-5427. [7] Yilmaz P, Kottman R, Field D, et al. (2011) Minimum information about a marker gene sequence (MIMARKS) and minimum information about any (x) sequence (MIxS) specifications. Nature Biotechnology, 29, 415-420. [8] Field D., Amaral-Zettler L., Cochrane D. et al. (2011) The Genomic Standards Consortium. PLoS Biol. 9(6):e1001088. doi: 10.1371/journal.pbio.1001088 [9] Plant Specimen Consensus (2016) Plant Specimen Contextual Data associated with molecular data. http://gensc.org/the-plant-specimen-contextual-dataconsensus/ Accessed 1 September 2016. [10] Cooper L., R.L. Walls, J. Elser, M.A. Gandolfo, et al. (2012) The Plant Ontology as a tool for comparative plant anatomy and genomic analyses. Plant and Cell Physiology 54:e1. doi: 10.1093/pcp/pcs163. [11] Gkoutos GV, Green EC, Mallon AM et al. (2005) Using ontologies to describe mouse phenotypes. Genome Biology 6: R8 [12] Shrestha R, Matteis L, Skofic M et al. (2012) Bridging the phenotypic and genetic data useful for integrated breeding through a data annotation using the Crop Ontology developed by the crop communities of practice. Frontier in Physiology http://dx.doi.org/10.3389/fphys.2012.00326 [13] Yamazaki Y, Jaiswal P (2005) Biological ontologies in rice databases. An introduction to the activities in Gramene and Oryzabase. Plant Cell Physiology 46: 63-68. [14] Walls RL, Athreya B, Cooper L, et al. Ontologies as integrative tools for plant science. (2012) American journal of botany. 99(8):1263-1275. [15] Buttigieg, PL, Morrison, N, Smith, B et al. (2013) The environment ontology: contextualising biological and biomedical entities. Journal of Biomedical Semantics 4, 43. [16] Malone, J., Holloway, E., Adamusiak, T. et al. (2010) Modeling sample variables with an Experimental Factor Ontology. Bioinformatics, 26(8), 11121115. [17] Hastings, J., de Matos, P., Dekker, A. et al.(2013) The ChEBI reference database and ontology for biologically relevant chemistry: enhancements for 2013. Nucleic Acids Research 42:D456-463. [18] Sayers, E.W., Barrett, T., Benson D.A. et al. (2009) Database resources of the National Center for Biotechnology Information. Nucleic Acid Res, 37, D5-15. [19] INSDC (1986) INSDC controlled vocabulary for values of the /country qualifier. http://www.insdc.org/country.html Accessed 1 September 2016. [20] ENA Plant Sample Checklist (2016) Implementation of the Plant Specimen Contextual Data Consensus in the submission system Webin. https://www.ebi.ac.uk/ena/submit/sra/#home Accessed 1 September 2016. [21] SAMN03858113, Sample record accession number at the resource where this has been issued. http://www.ncbi.nlm.nih.gov/biosample/SAMN03858113 Accessed 1 September 2016. [22] SAMN04549447, Sample record accession number at the resource where this has been issued. http://www.ncbi.nlm.nih.gov/biosample/SAMN04549447 Accessed 1 September 2016. Figure 1: Overview of data resources or initiatives (in black) and guidelines (in grey) contributing to the development of the Plant Specimen Contextual Data Consensus. European Nucleotide Archive (ENA) and Array Express at the European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI, UK), GenBank and BioSample at the National Centre for Biotechnology Information (NCBI, USA), CyVerse (USA), Agricultural Research Service of the US Department of Agriculture (USDA-ARS, USA) and trans-national transPLANT and Genomic Standards Consortium (GSC). Supplementary Table 1: The Plant Specimen Contextual Data Consensus. Each concept of the Consensus is described with a name, category, suggested requirement level (M – mandatory, C – recommended, X – optional), definition, format and mapping to an available ontology class.
© Copyright 2026 Paperzz