Making Biomedical Research More Like Airbnb?

Reproducibility:
A Funder and Data
Science Perspective
Philip E. Bourne, PhD, FACMI
University of Virginia
Thanks to Valerie Florence, NIH for some slides
http://www.slideshare.net/pebourne
[email protected]
NetSci Preworkshop 2017
June 19, 2017
6/19/17
1
Who Am I Representing And What
Is My Bias?
• I am presenting my views, not necessarily those of NIH
• Now leading an institutional data science initiative
• Total data parasite
• Unnatural interest in scholarly communication
• Co-founded and founding EIC PLOS Computational Biology –
OA advocate
• Prior co-Director Protein Data Bank
• Amateur student researcher in scholarly communication
6/19/17
2
Reproducibility is the
responsibility of all stakeholders….
6/19/17
3
6/19/17
4
Lets start with researchers …
6/19/17
5
Reproducibility - Examples From
My Own Work
… And recently …
It took several months to replicate this
work this work
Phew…
6/19/17
http://www.sdsc.edu/pb/kinases/6
Beyond value to myself (and even
then the emphasis is not enough)
there is too little incentive to make
my work reproducible by others …
6/19/17
7
Tools Fix This Problem Right?
• Extracted all PMC papers with associated Jupyter
notebooks available
• Approx. 100
• Took a random sample of 25
• Only 1 ran out of the box
• Several ran with minor modification
• Others lacked libraries, sufficient details to run etc.
It takes more than tools.. It takes incentives …
Daniel Mietchen 2017 Personal Communication
6/19/17
8
Funders and publishers are the major
levers ..
What are funders doing?
Consider the NIH …..
6/19/17
9
6/19/17
10
NIH Special Focus Area
https://www.nih.gov/research-training/rigor-reproducibility
6/19/17
11
Outcomes – General …
6/19/17
12
Enhancing Reproducibility through Rigor and Transparency
NOT-OD-15-103
• Clarifies NIH expectations in 4 areas
• Scientific premise
• Describe strengths and weaknesses of prior
research
• Rigorous experimental design
• How to achieve robust and unbiased outcomes
• Consideration of sex and other relevant
biological variables
• Authentication of key biological and/or
chemical resources e.g., cell lines
6/19/17
13
Outcomes – network based …
6/19/17
14
Experiment in Moving from Pipes
to Platforms
6/19/17
Sangeet Paul Choudary https://www.slideshare.net/sanguit
15
Commons & the FAIR Principles
• The Commons is a virtual platform physically located
predominantly on public clouds
• Digital assets (objects) within that system are data,
software, narrative, course materials etc.
• Assets are FAIR – Findable, Accessible, Interoperable
and Reusable
Bonazzi and Bourne 2017 PLoS Biol 15(4): e2001818
FAIR: https://www.nature.com/articles/sdata201618
6/19/17
16
https://www.workitdaily.com/job-search-solution/
Just announced …
https://commonfund.nih.gov/sites/default/file
s/RM-17-026_CommonsPilotPhase.pdf
Bonazzi and Bourne 2017
FAIR: https://www.nature.com/articles/sdata201618
6/19/17
17
Current Data Commons Pilots
Commons
Platform Pilots
Cloud Credit
Model
• Explore feasibility of the Commons Platform
• Facilitate collaboration and interoperability
• Provide access to cloud via credits to populate the
Commons
• Connecting credits to NIH Grants
Reference Data
Sets
• Making large and/or high value NIH
funded data sets and tools accessible in
the cloud
Resource
Search & Index
• Developing Data & Software Indexing methods
• Leveraging BD2K efforts bioCADDIE et al
• Collaborating with external groups
6/19/17
18
Commons - Platform Stack
Software: Services & Tools
scientific analysis tools/workflows
Services: APIs, Containers, Indexing,
Data
“Reference” Data Sets
Digital Object Compliance
App store/User Interface
User defined data
Compute Platform: Cloud or HPC
https://datascience.nih.gov/commons
6/19/17
19
Mapping BD2K Activities to the
Commons Platform
BioCADDIE/Other
Indexing
NIH + Community
defined data sets
Software: Services & Tools
App store/User Interface
scientific analysis tools/workflows
Services: APIs, Containers,
Indexing,
Data
“Reference” Data Sets
User defined data
Digital Object Compliance
BD2K Centers,
MODS, HMP &
Interoperability
Supplements
NCI &
NIAID
Cloud
Pilots
possible FOAs and
CCM
Compute Platform: Cloud or HPC
Cloud credits
model (CCM)
6/19/17
https://datascience.nih.gov/commons
20
Overarching Questions
• Is the Commons a step towards improved
reproducibility?
• Is the Commons approach at odds with other
approaches, if not how best to coordinate?
• Do the pilots enable a full evaluation for a larger
scale implementation?
• How best to evaluate the success of the pilots?
6/19/17
21
Other Questions
• Is a mix of cloud vendors appropriate?
• How to balance the overall metrics of success?
•
•
•
•
•
•
•
Reproducibility
Cost saving
Efficiency – centralized data vs distributed
New science
User satisfaction
Data integration and reuse – how to measure?
Data security
• What are the weaknesses?
6/19/17
22
• Thank You
6/19/17
23
Acknowledgements
• Vivien Bonazzi, Jennie Larkin, Michelle Dunn, Mark Guyer, Allen Dearry,
Sonynka Ngosso Tonya Scott, Lisa Dunneback, Vivek Navale (CIT/ADDS)
• NLM/NCBI: Patricia Brennan, Mike Huerta, George Komatsoulis
• NHGRI: Eric Green, Valentina di Francesco
• NIGMS: Jon Lorsch, Susan Gregurick, Peter Lyster
• CIT: Andrea Norris
• NIH Common Fund: Jim Anderson , Betsy Wilder, Leslie Derr
• NCI Cloud Pilots/ GDC: Warren Kibbe, Tony Kerlavage, Tanja Davidsen
• Commons Reference Data Set Working Group: Weiniu Gan (HL), Ajay Pillai (HG),
Elaine Ayres, (BITRIS), Sean Davis (NCI), Vinay Pai (NIBIB), Maria Giovanni (AI),
Leslie Derr (CF), Claire Schulkey (AI)
• RIWG Core Team: Ron Margolis (DK), Ian Fore, (NCI), Alison Yao (AI), Claire
Schulkey (AI), Eric Choi (AI)
• OSP: Dina Paltoo
6/19/17
24