Reproducibility: A Funder and Data Science Perspective Philip E. Bourne, PhD, FACMI University of Virginia Thanks to Valerie Florence, NIH for some slides http://www.slideshare.net/pebourne [email protected] NetSci Preworkshop 2017 June 19, 2017 6/19/17 1 Who Am I Representing And What Is My Bias? • I am presenting my views, not necessarily those of NIH • Now leading an institutional data science initiative • Total data parasite • Unnatural interest in scholarly communication • Co-founded and founding EIC PLOS Computational Biology – OA advocate • Prior co-Director Protein Data Bank • Amateur student researcher in scholarly communication 6/19/17 2 Reproducibility is the responsibility of all stakeholders…. 6/19/17 3 6/19/17 4 Lets start with researchers … 6/19/17 5 Reproducibility - Examples From My Own Work … And recently … It took several months to replicate this work this work Phew… 6/19/17 http://www.sdsc.edu/pb/kinases/6 Beyond value to myself (and even then the emphasis is not enough) there is too little incentive to make my work reproducible by others … 6/19/17 7 Tools Fix This Problem Right? • Extracted all PMC papers with associated Jupyter notebooks available • Approx. 100 • Took a random sample of 25 • Only 1 ran out of the box • Several ran with minor modification • Others lacked libraries, sufficient details to run etc. It takes more than tools.. It takes incentives … Daniel Mietchen 2017 Personal Communication 6/19/17 8 Funders and publishers are the major levers .. What are funders doing? Consider the NIH ….. 6/19/17 9 6/19/17 10 NIH Special Focus Area https://www.nih.gov/research-training/rigor-reproducibility 6/19/17 11 Outcomes – General … 6/19/17 12 Enhancing Reproducibility through Rigor and Transparency NOT-OD-15-103 • Clarifies NIH expectations in 4 areas • Scientific premise • Describe strengths and weaknesses of prior research • Rigorous experimental design • How to achieve robust and unbiased outcomes • Consideration of sex and other relevant biological variables • Authentication of key biological and/or chemical resources e.g., cell lines 6/19/17 13 Outcomes – network based … 6/19/17 14 Experiment in Moving from Pipes to Platforms 6/19/17 Sangeet Paul Choudary https://www.slideshare.net/sanguit 15 Commons & the FAIR Principles • The Commons is a virtual platform physically located predominantly on public clouds • Digital assets (objects) within that system are data, software, narrative, course materials etc. • Assets are FAIR – Findable, Accessible, Interoperable and Reusable Bonazzi and Bourne 2017 PLoS Biol 15(4): e2001818 FAIR: https://www.nature.com/articles/sdata201618 6/19/17 16 https://www.workitdaily.com/job-search-solution/ Just announced … https://commonfund.nih.gov/sites/default/file s/RM-17-026_CommonsPilotPhase.pdf Bonazzi and Bourne 2017 FAIR: https://www.nature.com/articles/sdata201618 6/19/17 17 Current Data Commons Pilots Commons Platform Pilots Cloud Credit Model • Explore feasibility of the Commons Platform • Facilitate collaboration and interoperability • Provide access to cloud via credits to populate the Commons • Connecting credits to NIH Grants Reference Data Sets • Making large and/or high value NIH funded data sets and tools accessible in the cloud Resource Search & Index • Developing Data & Software Indexing methods • Leveraging BD2K efforts bioCADDIE et al • Collaborating with external groups 6/19/17 18 Commons - Platform Stack Software: Services & Tools scientific analysis tools/workflows Services: APIs, Containers, Indexing, Data “Reference” Data Sets Digital Object Compliance App store/User Interface User defined data Compute Platform: Cloud or HPC https://datascience.nih.gov/commons 6/19/17 19 Mapping BD2K Activities to the Commons Platform BioCADDIE/Other Indexing NIH + Community defined data sets Software: Services & Tools App store/User Interface scientific analysis tools/workflows Services: APIs, Containers, Indexing, Data “Reference” Data Sets User defined data Digital Object Compliance BD2K Centers, MODS, HMP & Interoperability Supplements NCI & NIAID Cloud Pilots possible FOAs and CCM Compute Platform: Cloud or HPC Cloud credits model (CCM) 6/19/17 https://datascience.nih.gov/commons 20 Overarching Questions • Is the Commons a step towards improved reproducibility? • Is the Commons approach at odds with other approaches, if not how best to coordinate? • Do the pilots enable a full evaluation for a larger scale implementation? • How best to evaluate the success of the pilots? 6/19/17 21 Other Questions • Is a mix of cloud vendors appropriate? • How to balance the overall metrics of success? • • • • • • • Reproducibility Cost saving Efficiency – centralized data vs distributed New science User satisfaction Data integration and reuse – how to measure? Data security • What are the weaknesses? 6/19/17 22 • Thank You 6/19/17 23 Acknowledgements • Vivien Bonazzi, Jennie Larkin, Michelle Dunn, Mark Guyer, Allen Dearry, Sonynka Ngosso Tonya Scott, Lisa Dunneback, Vivek Navale (CIT/ADDS) • NLM/NCBI: Patricia Brennan, Mike Huerta, George Komatsoulis • NHGRI: Eric Green, Valentina di Francesco • NIGMS: Jon Lorsch, Susan Gregurick, Peter Lyster • CIT: Andrea Norris • NIH Common Fund: Jim Anderson , Betsy Wilder, Leslie Derr • NCI Cloud Pilots/ GDC: Warren Kibbe, Tony Kerlavage, Tanja Davidsen • Commons Reference Data Set Working Group: Weiniu Gan (HL), Ajay Pillai (HG), Elaine Ayres, (BITRIS), Sean Davis (NCI), Vinay Pai (NIBIB), Maria Giovanni (AI), Leslie Derr (CF), Claire Schulkey (AI) • RIWG Core Team: Ron Margolis (DK), Ian Fore, (NCI), Alison Yao (AI), Claire Schulkey (AI), Eric Choi (AI) • OSP: Dina Paltoo 6/19/17 24
© Copyright 2026 Paperzz