Plant Specimen Contextual Data Consensus

Title
Plant Specimen Contextual Data Consensus
Authors
*Petra ten Hoopen1, <[email protected]>
Ramona L. Walls2, <[email protected]>
Ethalinda KS Cannon3, <[email protected]>
Guy Cochrane1, <[email protected]>
James Cole4, <[email protected]>
Anjanette Johnston5, <[email protected]>
Ilene Karsch-Mizrachi5, <[email protected]>
Pelin Yilmaz6, <[email protected]>
* corresponding author
Affiliations
1 European Molecular Biology Laboratory, European Bioinformatics Institute,
Wellcome Genome Campus, Hinxton, Cambridge,
CB10 1SD, United Kingdom
2 CyVerse, University of Arizona, Tucson, Arizona, USA
3 Department of Computer Science, Iowa State University, Ames, IA, USA; USDAARS Corn Insects and Crop Genetics, Ames, IA, USA
4 Department of Plant, Soil and Microbial Sciences, Michigan State University,
East Lansing MI 48824, USA
5 National Center for Biotechnology Information, National Library of Medicine,
National Institutes of Health, Building 38A, 8600 Rockville Pike, Bethesda, MD
20894, USA
6 Microbial Genomics and Bioinformatics Research Group, Max Planck Institute
for Marine Microbiology, Celsius str. 1 Bremen Germany 28359
Abstract
The Compliance and Interoperability Working Group of the Genomic Standards
Consortium facilitates expert-community building and development of
recommendations for description of genomic data and associated information.
We present here our ongoing conation to harmonise reporting of contextual data
of plant specimens associated with genomic and functional genomic data. This
commentary summarises the current state of the plant sample contextual data
harmonisation thus enabling engagement of a broad plant science community.
Keywords
plant, specimen, contextual data, checklist
Background
Publication of well-structured data in an established data resource supports
discoverability and a safe preservation of legacy data. If related contextual data
is structured in a similar manner it can allow further scientific advances through
meaningful comparisons of datasets.
Data and contextual data standards fulfil the role of bringing a structure into
data, providing recommendations on data formats and specification of attributes
that categorise the data. However, standardisation efforts should be harmonised
to prevent duplicating of work or establishing contradictory practices.
With extensive global interest in molecular analysis of plant species for food,
forestry, biomass, and many other applications, we have entered an era of
extensive publication of plant molecular data sets in need of structure, with, for
example, 0.37 Terabases of plant assembled sequence data in 1017 studies
presented from INSDC databases to August 2016.
Here we describe an ongoing endeavour to harmonise recommendations to
support the plant scientific community in reporting plant specimen contextual
data associated with genomic and functional genomic experiments to data
archives.
Plant Specimen Contextual Data Consensus
Plant specimen contextual data provides information about the plant material
that is being analysed in a molecular assay. This information layer is distinct
from the investigation layer, which specifies the investigation purpose and its
authors, and from the experiment layer, which harbours details of the molecular
experiment design. Contextual information of a plant specimen is also
independent of the plant molecular analysis. This suggests that a common set of
plant specimen descriptors can be used for reporting the contextual information
about a plant sample that is associated with a molecular dataset.
A number of projects are developing recommendations for plant molecular or
phenotyping data reporting, Figure 1. We are aiming to unify these
developments in a common contextual data set, the Plant Specimen Contextual
Data Consensus, that will contribute to consistent reporting of plant specimen
information to data repositories, and improve integration of the specimenassociated molecular data among repositories.
Several resources were involved in this exercise: (1) the European Nucleotide
Archive at the EMBL-EBI, UK [1], archiving genomic and transcriptomic data
together with associated contextual data, (2) CyVerse (formerly the iPlant
Collaborative), USA [2], a computation infrastructure for life sciences, (3) the
GenBank [3] and BioSample databases at the NCBI, USA [4], archiving nucleotide
sequence data and associated sample contextual information, and (4) the USDA
Agricultural Research Service, USA, a scientific research agency for agriculture.
We have also drawn from the expertise of Array Express, EMBL-EBI, UK [5] in
collecting plant transcriptomics data and received a valuable input from authors
of the MIAPPE (Minimum Information about Plant Phenotypic Experiments)
developed by the transPLANT [6], a trans-national project to construct an einfrastructure for plant genomics. Furthermore, we were able to reuse a number
of concepts specified in the core and plant host-associated environmental
package of the MIxS Standard [7] developed by the Genomic Standards
Consortium (GSC, [8]).
The first version of the Plant Specimen Contextual Data Consensus is available in
the Supplementary Table 1. However, the Consensus is likely to evolve and we
therefore encourage readers to always view the latest version of the Consensus
at the GSC website [9]. Each contextual data attribute of the Consensus is
described with a name, category, suggested requirement level (M – mandatory, C
– recommended, X – optional), definition, format and mapping to an available
ontology class.
Descriptors are divided into four categories:
(1) organism descriptors, which specify taxonomic information, (2) sample
descriptors, which characterise the material that has been taken from the plant
organism and used for an experiment, (3) treatment descriptors, which describe
the natural and imposed environmental conditions on the plant before the
sample was taken and (4) growth medium descriptors, which provide details of
the plant rooting conditions.
The Consensus recommends a number of established relevant ontologies and
controlled vocabularies:
Plant Ontology (PO, [10]), Phenotypic Quality Ontology (PATO, [11]), Crop
Ontology (CO, [12]), Plant Trait Ontology (TO, [13]), Plant Environment Ontology
(EO, 14]), Environment Ontology (ENVO, [15]), Experimental Factor Ontology
(EFO, [16]), Chemical Entities of Biological Interest (CHEBI, [17],) the NCBI
Taxonomy index [18] and INSDC country controlled vocabulary [19].
Ten descriptors of the Consensus were identified as essential contextual data for
a plant sample of any molecular experiment and these are highlighted in bold
and suggested as mandatory. Moreover, a subset of recommended descriptors
can be considered for reporting depending on the particular experiment or
implementation. Optional descriptors offer further granularity to reporting.
Although practical implementation of the Consensus might very depending on
the resource adopting it, the Consensus offers a potential for harmonisation of
plant specimen contextual data across molecular assays. One implementation is
available at the European Nucleotide Archive [20], and will serve for deposition
of plant genomic and transcriptomic data to INSDC. This can add to the collection
of well described plant samples, such as for instance the Brassica oleracea
sample SAMN03858113 [21] or the Hordeum vulgare sample SAMN04549447
[22].
The Consensus presented here is largely in line with recommendations for
description of a plant bioresource and its environment and treatment formulated
for plant phenotyping data [6] and we envisage its applicability also for
description of samples associated with metabolic data.
Conclusion
In the current era of substantive need for data integration beyond scientific
domains and political borders, it is fundamental for both short-term and longterm initiatives to unite their forces when working towards similar goals. We
have presented here an example of ongoing transatlantic community
collaboration, Figure 1, on a common goal to provide plant scientists with
recommendations on how to describe a plant specimen that has been analysed in
a molecular experiment. We believe this can contribute to a consistent
description of plant specimens and improve integration of the specimenassociated molecular data.
List of abbreviations
CHEBI – Chemical Entities of Biological Interest
CO – Crop Ontology
EFO – Experimental Factor Ontology
EMBL-EBI – European Molecular Biology Laboratory, European Bioinformatics
Institute
ENA – European Nucleotide Archive
ENVO – Environment Ontology
EO – Plant Environment Ontology
GSC – Genomic Standards Consortium
INSDC – International Nucleotide Sequence Database Collaboration
NCBI – National Centre for Biotechnology Information
MIAPPE – Minimum Information about Plant Phenotypic Experiments
MIxS – Minimum Information about any (x) Sequence
PATO – Phenotypic Quality Ontology
PO – Plant Ontology
TO – Plant Trait Ontology
USDA-ARS – Agricultural Research Service of the US Department of Agriculture
Ethics approval and consent to participate
Not applicable
Consent for publication
Not applicable
Availability of data and material
All data generated during this study are included in this published article and its
supplementary information files.
Competing interest
The authors declare that they have no competing interests.
Funding
The work was supported by the Biotechnology and Biological Sciences Research
Council award BB/M018458/1 for PTH and GC and National Science Foundation
Award Numbers DBI-0735191 and DBI-1265383 to CyVerse for RLW.
Authors’ contribution
PTH coordinated the plant specimen contextual data harmonisation, drafted the
ENA plant checklist and mapped the draft to the Array Express, transPLANT and
MIxS checklists. RLW and EC were leading on mappings to the plant contextual
data checklists developed at CyVerse and USDA-ARS, respectively. AJ was leading
on mappings to the NCBI Plant Package. IKM, GC and JC provided overall
guidance. PY advised on MIxS standard concepts and published the plant
specimen contextual data consensus on the GSC website. PTH wrote the
manuscript with an editorial contribution and a revision by all co-authors.
Acknowledgements
We much appreciate contribution of Lisa Harper and Kapeel Chougule. We would
like to thank Laura Huerta and Robert Petryszak for their thoughtful suggestions.
We are grateful to Pawel Krajewsky, Hanna Ćwiek and Paul Kersey for reviewing
the Consensus draft and appreciate comments from Richard Gibson and Clara
Amid.
References
[1] Silvester, N, Alako, B, Amid, C et al. (2015) Content discovery and retrieval
services at the European Nucleotide Archive. Nucleic Acid Res, 41, D30-35.
[2] Merchant, Nirav, et al., (2016) The iPlant Collaborative: Cyberinfrastructure
for Enabling Data to Discovery for the Life Sciences. PLOS Biology, doi:
10.1371/journal.pbio.1002342.
[3] Clark K, Karsch-Mizrachi I, Lipman DJ, Ostell J, Sayers EW. GenBank. Nucleic
Acids Res. 2016 Jan 4;44(D1):D67-72. doi: 10.1093/nar/gkv1276.
[4] Barrett T, Clark K, Gevorgyan R, Gorelenkov V, Gribov E, Karsch-Mizrachi I,
Kimelman M, Pruitt KD, Resenchuk S, Tatusova T, Yaschenko E, Ostell J. (2012)
BioProject and BioSample databases at NCBI: facilitating capture and
organization of metadata. Nucleic Acids Res. 40, D57-63.
[5] Kolesnikov N, Hastings E, Keays M et al. (2015) Array Express update –
simplifying data submissions. Nucleic Acid Res, 43, D1113-1116.
[6] Krajewski P, Chen D, Ćwiek H et al. (2015) Towards recommendation for
metadata and data handling in plant phenotyping. Journal of Experimental Botany
66, 5417-5427.
[7] Yilmaz P, Kottman R, Field D, et al. (2011) Minimum information about a
marker gene sequence (MIMARKS) and minimum information about any (x)
sequence (MIxS) specifications. Nature Biotechnology, 29, 415-420.
[8] Field D., Amaral-Zettler L., Cochrane D. et al. (2011) The Genomic Standards
Consortium. PLoS Biol. 9(6):e1001088. doi: 10.1371/journal.pbio.1001088
[9] Plant Specimen Consensus (2016) Plant Specimen Contextual Data associated
with molecular data. http://gensc.org/the-plant-specimen-contextual-dataconsensus/
Accessed 1 September 2016.
[10] Cooper L., R.L. Walls, J. Elser, M.A. Gandolfo, et al. (2012) The Plant Ontology
as a tool for comparative plant anatomy and genomic analyses. Plant and Cell
Physiology 54:e1. doi: 10.1093/pcp/pcs163.
[11] Gkoutos GV, Green EC, Mallon AM et al. (2005) Using ontologies to describe
mouse phenotypes. Genome Biology 6: R8
[12] Shrestha R, Matteis L, Skofic M et al. (2012) Bridging the phenotypic and
genetic data useful for integrated breeding through a data annotation using the
Crop Ontology developed by the crop communities of practice. Frontier in
Physiology http://dx.doi.org/10.3389/fphys.2012.00326
[13] Yamazaki Y, Jaiswal P (2005) Biological ontologies in rice databases. An
introduction to the activities in Gramene and Oryzabase. Plant Cell Physiology 46:
63-68.
[14] Walls RL, Athreya B, Cooper L, et al. Ontologies as integrative tools for plant
science. (2012) American journal of botany. 99(8):1263-1275.
[15] Buttigieg, PL, Morrison, N, Smith, B et al. (2013) The environment ontology:
contextualising biological and biomedical entities. Journal of Biomedical
Semantics 4, 43.
[16] Malone, J., Holloway, E., Adamusiak, T. et al. (2010) Modeling sample
variables with an Experimental Factor Ontology. Bioinformatics, 26(8), 11121115.
[17] Hastings, J., de Matos, P., Dekker, A. et al.(2013) The ChEBI reference
database and ontology for biologically relevant chemistry: enhancements for
2013. Nucleic Acids Research 42:D456-463.
[18] Sayers, E.W., Barrett, T., Benson D.A. et al. (2009) Database resources of the
National Center for Biotechnology Information. Nucleic Acid Res, 37, D5-15.
[19] INSDC (1986) INSDC controlled vocabulary for values of the /country
qualifier. http://www.insdc.org/country.html
Accessed 1 September 2016.
[20] ENA Plant Sample Checklist (2016) Implementation of the Plant Specimen
Contextual Data Consensus in the submission system Webin.
https://www.ebi.ac.uk/ena/submit/sra/#home
Accessed 1 September 2016.
[21] SAMN03858113, Sample record accession number at the resource where this
has been issued.
http://www.ncbi.nlm.nih.gov/biosample/SAMN03858113
Accessed 1 September 2016.
[22] SAMN04549447, Sample record accession number at the resource where this
has been issued.
http://www.ncbi.nlm.nih.gov/biosample/SAMN04549447
Accessed 1 September 2016.
Figure 1: Overview of data resources or initiatives (in black) and guidelines (in
grey) contributing to the development of the Plant Specimen Contextual Data
Consensus. European Nucleotide Archive (ENA) and Array Express at the
European Molecular Biology Laboratory, European Bioinformatics Institute
(EMBL-EBI, UK), GenBank and BioSample at the National Centre for
Biotechnology Information (NCBI, USA), CyVerse (USA), Agricultural Research
Service of the US Department of Agriculture (USDA-ARS, USA) and trans-national
transPLANT and Genomic Standards Consortium (GSC).
Supplementary Table 1: The Plant Specimen Contextual Data Consensus. Each
concept of the Consensus is described with a name, category, suggested
requirement level (M – mandatory, C – recommended, X – optional), definition,
format and mapping to an available ontology class.