molgenis

FAIR Data Stewardship
for Discovery and Innovation
MOLGENIS FAIR Roadmap
David van Enckevort
Utrecht, 3 November 2016
MOLGENIS Overview
Exchange format
Data request
Data explorer
Genome browser
Import data and meta data
using EMX format (D4.1)
Find and request (biobank)
data sets and items
Filter and download for
further analysis (D4.2)
Data sharing and
integration DAS protocol
Model registry
Annotators
Meta-data registry of
models for biobanks and
molecular data (D4.4)
Data integration for
diagnostics and
personalized
medicine
Biobank Connect
R statistics
Using ontologies to derive
harmonization rule for data
pooling (D2.2)
Use R data api to
up/download data and
integrate graphics
Compute
Impute pipeline
DNA pipeline
RNA pipeline
Large scale computation on
computational clusters,
grids and clouds
GWAS harmonization and
imputation
NGS data alignment,
SNV/SV calling, QC, NIPT
NGS data quantitation,
structure,eQTL allele
specific expression
MOLGENIS platform: open
source collaborative
mvc
JPA /
~20 active devs
~25 projects github.com/molgenis
Internally FAIR
Data as increasingly FAIR Digital Objects
Totally UNFAIR
Findable
Usable for Humans
PID
PID\\\
FAIR metadata
PID
Metadata (intrinsic)
Metadata (intrinsic)
'provenance' (user defined)
'provenance' (user defined)
'provenance' (user defined)
Data (elements)
Data (elements)
Data (elements)
Metadata (intrinsic)
FAIR datarestricted access
FAIR dataOpen Access
PID
PID
PID
FAIR dataOpen Access/Functionally Linked
Metadata (intrinsic)
Metadata (intrinsic)
Metadata (intrinsic)
'provenance' (user defined)
'provenance' (user defined)
'provenance' (user defined)
Data (elements)
Data (elements)
Data (elements)
Using our own
choice of formats
and standards,
interoperable
between
MOLGENIS
instances, but not
necessarily aligned
with the FAIR
chosen formats and
standards
FAIR Hackathon
• Two MOLGENIS Developers for two days
• Support from LUMC / DTL FAIR Team
• Goal to build a proof of concept of MOLGENIS
that is FAIR
• Using the BBMRI Biobank Catalogue as the usecase
BBMRI-NL
Biobank Catalogue
6
RD-Connect
Sample Catalogue
FAIR Hackathon Results
FAIR Hackathon Results
FAIR Hackathon Results
FAIR Hackathon Results
Data as increasingly FAIR Digital Objects
Totally UNFAIR
Findable
Usable for Humans
PID
PID\\\
FAIR metadata
PID
Metadata (intrinsic)
Metadata (intrinsic)
Metadata (intrinsic)
'provenance' (user defined)
'provenance' (user defined)
'provenance' (user defined)
Data (elements)
Data (elements)
Data (elements)
FAIR datarestricted access
FAIR dataOpen Access
PID
PID
PID
FAIR dataOpen Access/Functionally Linked
Metadata (intrinsic)
Metadata (intrinsic)
Metadata (intrinsic)
'provenance' (user defined)
'provenance' (user defined)
'provenance' (user defined)
Data (elements)
Data (elements)
Data (elements)
Roadmap
• Bring the PoC into production
• Implement Linked Data Fragments service
• Offer a suite of tools to make data FAIR
MOLGENIS/connect
‘FAIRifier’ system for retrospective interoperability of data
BiobankConnect
Make data attributes interoperable
<ID>
<ID>
<ID>
<ID>
<ID>
<ID>
<ID>
<ID>
<ID>
<ID>
<ID>
<ID>
<ID>
<ID>
<ID>
<ID>
<ID>
<ID>
<ID>
<ID>
<ID>
<ID>
<ID>
<ID>
<ID>
<ID>
<ID>
<ID>
<ID>
<ID>
<ID>
<ID>
<ID>
<ID>
<ID>
<ID>
<ID>
<ID>
<ID>
<ID>
<ID>
<ID>
<ID>
<ID>
<ID>
SORTA
Make data
values
interoperab
le
SORTA - To code data values
Semantic/lexical matching to shortlist codes for each unique value
Upload data using Excel
SORTA shortlists candidate
codes Semantic
Lexical
matching
matching
Use n-gram matching treshold (e.g 80%)
Human expert decides (and so trains SORTA)
SORTA
automatically
recodes when
high matching
score (e.g. 80%)
SORTA - To code data values
When match is < threshold, user decides and works through open issues
SORTA - To code data values
When match is < threshold, user decides and works through open issues
(Biobank)Connect to code data attributes
Software autogenerates mappings using ontologies + lexical matching
Per attribute mapping assistant
User can fix the mapping on the fly (using the semantic/lexical tricks)
Learn more or contact
[email protected]
Reading
◻ MOLGENIS docs @ http://molgenis.github.com
◻ BiobankConnect paper - http://pubmed.org/25361575
◻ SORTA paper - http://pubmed.org/26385205
◻ MOLGENIS papers - http://pubmed.org?term=MOLGENIS
Movies
◻ Upload - https://www.youtube.com/watch?v=VSZNXdaGIl4
◻ SORTA - https://www.youtube.com/watch?v=Wq81S-jR3l8
◻ BiobankConnect - https://www.youtube.com/watch?v=Gc1VKRCmTWU
- eric
- nl