Data Curation in Repositories Christoph Steinbeck Cheminformatics and Metabolism, EBI Christoph Steinbeck • • Chemist by training (worst case for OA and OS) • • Director of the Metabolomics Society • Editor-in-Chief of Journal of Cheminformatics Head of Cheminformatics and Metabolism at European Bioinformatics Institute. Coordinator of European FP-7 COSMOS project on Metabolomics Standards The European Molecular Biology Laboratory (EMBL) A basic research institute funded by public research monies from 20 member states. The European Bioinformatics Institute (EBI) EBI resources: • Long-term access to scientific data • Citable stable identifiers (experiment with DOI ongoing) • The public databases of EMBL-EBI are freely available by any individual and for any purpose MetaboLights vs ChEBI Author- vs Curator-based Dataset Creation • ChEBI: < 10 kilobyte of textual and numeric data per data set, highly curated by in-house domain experts. Users submit requests. • MetaboLights: Gigabytes of complex data (free text, semantic data, spectra, chromatograms and pictures) in a wild mixture of formats, put together by authors Chemical Entities of Biological Interest (ChEBI) ChEBI cura+on workflow -‐ summary What to be added? Not already in database? Create entry + name Add synonyms Check generated data Add structure Patent no.? Add IUPAC Name(s) Cita+ons Ontology -‐ role Ontology -‐ structure Defini+on Status update MetaboLights vs ChEBI Author- vs Curator-based Dataset Creation • ChEBI: < 10 kilobyte of textual and numeric data, highly curated by in-house domain experts. Users submit requests. • MetaboLights: Gigabytes of complex data (free text, semantic data, spectra, chromatograms and pictures) in a wild mixture of formats, put together by authors Data Growth in PRIDE April 2012 300 M 50 M 10 M ISAcreator Nature Genetics doi:10.1038/ng.1054 Susanna-Assunta Sansone,…... Haug, Neumann, de Matos, Griffin, Steinbeck, ….. and Hide ISAconfigurator Summary (1) • Dataset can be gigabytes of complex data with tough visualisa+on challenges or just a kilobyte of simple tabular data => Ques%on of feasibility. 14 Summary (2) • Technical review: –SOP’s are key –Open source, domain-‐specific tools for reviewing open data are key –Tools for automa%c valida%on are half the rent –Adhering to Minimum Informa%on Standards and moving fully to Seman%c data will help a lot –Open notebook science and linked micro publica+ons won’t help (yet!) -‐ not enough data stability 15 Summary (3) • Scien%fic review: –Holis+c vs aspect-‐oriented review –We’ll not be able to do deep data review of complex studies un+l Watson becomes self-‐aware 16 Thank you! [email protected] Thank You
© Copyright 2025 Paperzz