Christoph Steinbeck

Data Curation in
Repositories
Christoph Steinbeck
Cheminformatics and
Metabolism, EBI
Christoph Steinbeck
•
•
Chemist by training (worst case for OA and OS)
•
•
Director of the Metabolomics Society
•
Editor-in-Chief of Journal of Cheminformatics
Head of Cheminformatics and Metabolism at
European Bioinformatics Institute.
Coordinator of European FP-7 COSMOS project
on Metabolomics Standards
The
European
Molecular
Biology
Laboratory
(EMBL)
A basic research
institute funded by
public research
monies from 20
member states.
The
European
Bioinformatics
Institute
(EBI)
EBI resources:
• Long-term access to scientific data
• Citable stable identifiers (experiment with
DOI ongoing)
• The public databases of EMBL-EBI are
freely available by any individual and for any
purpose
MetaboLights vs ChEBI
Author- vs Curator-based Dataset Creation
• ChEBI: < 10 kilobyte of textual and
numeric data per data set, highly curated by
in-house domain experts. Users submit
requests.
• MetaboLights: Gigabytes of complex
data (free text, semantic data, spectra,
chromatograms and pictures) in a wild
mixture of formats, put together by authors
Chemical Entities of Biological
Interest (ChEBI)
ChEBI cura+on workflow -­‐ summary
What to be added?
Not already in database?
Create entry + name
Add synonyms
Check generated data
Add structure
Patent no.?
Add IUPAC Name(s)
Cita+ons
Ontology -­‐ role
Ontology -­‐ structure
Defini+on
Status update
MetaboLights vs ChEBI
Author- vs Curator-based Dataset Creation
• ChEBI: < 10 kilobyte of textual and
numeric data, highly curated by in-house
domain experts. Users submit requests.
• MetaboLights: Gigabytes of complex
data (free text, semantic data, spectra,
chromatograms and pictures) in a wild
mixture of formats, put together by authors
Data Growth in PRIDE
April 2012
300 M
50 M
10 M
ISAcreator
Nature Genetics
doi:10.1038/ng.1054
Susanna-Assunta Sansone,…...
Haug, Neumann, de Matos, Griffin,
Steinbeck, ….. and Hide
ISAconfigurator
Summary (1)
• Dataset can be gigabytes of complex data with tough visualisa+on challenges or just a kilobyte of simple tabular data => Ques%on of feasibility. 14
Summary (2)
• Technical review:
–SOP’s are key
–Open source, domain-­‐specific tools for reviewing open data are key
–Tools for automa%c valida%on are half the rent
–Adhering to Minimum Informa%on Standards and moving fully to Seman%c data will help a lot
–Open notebook science and linked micro publica+ons won’t help (yet!) -­‐ not enough data stability 15
Summary (3)
• Scien%fic review: –Holis+c vs aspect-­‐oriented review
–We’ll not be able to do deep data review of complex studies un+l Watson becomes self-­‐aware
16
Thank you!
[email protected]
Thank You