The Reproducible Research Advantage

The Reproducible Research
Advantage
Why + how to make your research
more reproducible
Presentation for Research Week
February 22, 2016
April Clyburne-Sherin
Objectives
• What is reproducibility?
• Why practice reproducibility?
• What materials are necessary for
reproducibility?
• How can you make your research
reproducible?
What is reproducibility?
Scientific method
• Replication of findings is
highest standard of
evaluating evidence
• Focuses on validating the
scientific claim
• Requires transparency of
methods, data, and code
• Minimum standard for any
scientific study
Observation
Analysis
Question
Replication
Testing
Hypothesis
Prediction
Why practice reproducibility?
Why practice reproducibility?
Fanelli D (2010) “Positive” Results Increase Down the Hierarchy of the Sciences. PLoS ONE 5(4): e10068.
Why practice reproducibility?
• Study report is not enough
to:
• Assess the sensitivity of
findings to assumptions
• Replicate the findings
• Distinguish confirmatory
from exploratory analyses
• Identify protocol
deviations
• Cannot evaluate the study
analyses and findings using a
study report alone.
Study report
Reported
results
Figures
Tables
Numerical
summaries
Why practice reproducibility?
Study report
Processing
Raw
data
Analysing
Analytic
data
Reporting
Reported
results
Figures
Raw
results
Tables
Collecting
Numerical
summaries
Why practice reproducibility?
Study report
Processing
Raw
data
Analysing
Analytic
data
Reporting
Reported
results
Figures
Raw
results
Tables
Collecting
To fully assess the analyses and findings
of a study, we need more information.
Numerical
summaries
What materials are necessary for
reproducibility?
Study report
Processing
Raw
data
Analysing
Analytic
data
Reporting
Reported
results
Figures
Raw
results
Tables
Collecting
1. Data + metadata
2. Code
3. Documentation of methods
Numerical
summaries
Why practice reproducibility?
The idealist
The pragmatist
• Shoulders of giants!
• Minimum scientific standard
• Allows others to build on your
findings
• Improved transparency
• Increased transfer of
knowledge
• Increased utility of your data +
methods
• Data sharing citation
advantage (Piwowar 2013)
• “It takes some effort to
organize your research to be
reproducible… the principal
beneficiary is generally the
author herself.”- Schwab & Claerbout
• Improves capacity for complex
and large datasets or analyses
• Increased productivity
How can you make your research
reproducible?
1. Plan for reproducibility before you start
•
•
•
•
Power
Data management plan
Informative naming + location
Study plan + pre-analysis plan
2. Keep track of things
• Version control
• Documentation
3. Contain bias
• Reporting
• Confirmatory vs. exploratory analyses
4. Archive + share your materials
1. Plan for reproducibility before you
start
Power
• Calculate your power
• Low power means:
– Low probability of finding
true effects
– Low probability that a
positive is a true positive
(positive predictive value)
– Exaggerated estimate of the
magnitude of effect when
true effect discovered
– Greater vibration of effects
• Low powered studies
produce more false
negatives than high
powered
• If there are 100 true
positive effects in a field,
20% power means only 20
of them will be discovered
1. Plan for reproducibility before you
start
Power
• Calculate your power
• Low power means:
– Low probability of finding
true effects
– Low probability that a
positive is a true positive
(positive predictive value)
– Exaggerated estimate of the
magnitude of effect when
true effect discovered
– Greater vibration of effects
1. Plan for reproducibility before you
start
Power
• Calculate your power
• Low power means:
– Low probability of finding
true effects
– Low probability that a
positive is a true positive
(positive predictive value)
– Exaggerated estimate of the
magnitude of effect when
true effect discovered
– Greater vibration of effects
The Winner’s Curse
1. Plan for reproducibility before you
start
Power
• Calculate your power
• Low power means:
– Low probability of finding
true effects
– Low probability that a
positive is a true positive
(positive predictive value)
– Exaggerated estimate of the
magnitude of effect when
true effect discovered
– Greater vibration of effects
How?
• More likely obtain different
estimates of the magnitude of
the effect depending on the
analytical options it
implements
• A manipulation affecting only
three observations could
change the odds ratio from
1.00 to 1.50 in a small study
but might only change it from
1.00 to 1.01 in a large study
1. Plan for reproducibility before you
start
Power
• Calculate your power
• Low power means:
– Low probability of finding
true effects
– Low probability that a
positive is a true positive
(positive predictive value)
– Exaggerated estimate of the
magnitude of effect when
true effect discovered
– Greater vibration of effects
How?
• Estimate the size of effect you
are studying
• Design your study with
sufficient power to detect that
effect
• If you need more power,
consider collaborating
• If your study is underpowered,
report this and acknowledge
this limitation in the
interpretation of your results
1. Plan for reproducibility before you
start
Data management plan
How?
• Prepare to share
• Data that is well-managed
from the start is easier to
prepare for sharing
• Smooths transitions between
researchers
• Protects you if questions are
raised about data validity
• Metadata provides context
• Document metadata while
collecting to save time
• Use open data formats rather
than proprietary: .csv, .txt ,
.png
• Data:
–
–
–
–
Collected
Stored
Documented
Managed
• Metadata:
– Collected
– Documented / Version control
1. Plan for reproducibility before you
start
Informative name + location
• Plan your file
naming + location
system a priori
• Names and locations
should be
distinctive,
consistent, and
informative:
– What it is
– Why it exists
– How it relates to
other files
1. Plan for reproducibility before you
start
Informative name + location
• The rules don’t matter. That you have rules matters.
• Make it machine readable:
– Default ordering
– Use of meaningful deliminators and tags
– Example: use “_” and “-” to store metadata in name (eg, YYYY-MMDD_assay_sample-set_well)
• Make it human readable:
– Choose self-explanatory names and locations
1. Plan for reproducibility before you
start
Study plan
• Pre-register your study plan
before you look at your data
• Public registration of all
studies counters publication
bias
• Counters selective reporting
and outcome reporting bias
• Distinguishes a priori design
decisions from post hoc
• Corroborates the rigor of
your findings
How?
• Hypothesis
• Study design
–
–
–
–
Type of design
Sampling
Power and sample size
Randomization?
• Variables measured
– Meaningful effect size
• Variables constructed
– Data processing
Open Science Framework
ClinicalTrials.gov
1. Plan for reproducibility before you
start
Pre-analysis plan
How?
• Pre-register your analysis plan
before you look at your data
• Defines your confirmatory
analyses
• Corroborates the rigor of your
findings
• Define data analysis set
• Statistical analyses
Processing
Analysing
– Primary
– Secondary
– Exploratory
•
•
•
•
Missing data
Outliers
Multiplicity
Subgroups + covariates
(Adams-Huet and Ahn, 2009)
Raw
data
Analytic
data
Raw
results
1. Plan for reproducibility before you
start
Pre-registration
• Pre-register
your study +
analysis plan
with
Registered
Reports
2. Keep track of things
Version control
• Track your changes
• Everything created
manually should use version
control
• Tracks changes to files,
code, metadata
• Allows you to revert to old
versions
• Make incremental changes:
commit early, commit often
• Git / GitHub / BitBucket
Version control for data
• Metadata should be version
controlled
2. Keep track of things
Documentation
• Document everything done
by hand
• Document your software
environment (eg,
dependencies, libraries,
sessionInfo () in R)
• Everything done by hand or
not automated from data
and code should be
precisely documented:
– README files
• Make raw data read only
– You won’t edit it by accident
– Forces you to document or
code data processing
• Document in code
comments
3. Contain bias
Reporting
• Report transparently +
completely
• Transparently means:
–
–
–
–
Readers can use the findings
Replication is possible
Users are not misled
Findings can be pooled in
meta-analyses
• Completely means:
– All results are reported, no
matter their direction or
statistical significance
How?
• Use reporting guidelines
• CONSORT
•
Consolidated Standards of
Reporting Trials
• SAMPL
•
Statistical analyses and
methods in the published
literature
3. Contain bias
Confirmatory vs. exploratory
• Distinguish confirmatory
from exploratory analyses
How?
• Provide access to your preregistered analysis plan
• Avoid HARKing:
Hypothesizing After the
Results are Known
• Report all deviations from
your study plan
• Report which decisions
were made after looking at
the data
4. Archive + share your materials
Share your materials
• Where doesn’t matter.
That you share matters.
• Get credit for your code,
your data, your methods
• Increase the impact of
your research
Open Science Framework
How can you make your research
reproducible?
1. Plan for reproducibility before you start
• Power – Calculate your power
• Data management plan – Prepare to share
• Informative naming + location – The rules don’t matter. That you have rules matters.
• Study plan + pre-analysis plan – Pre-register your plans
2. Keep track of things
• Version control – Track your changes
• Documentation – Document everything done by hand
3. Contain bias
• Reporting – Report transparently + completely
• Confirmatory vs. exploratory analyses – Distinguish confirmatory from exploratory
4. Archive + share your materials
• Where doesn’t matter. That you share matters.
How to learn more
• Organizing a project for
reproducibility
– Reproducible Science
Curriculum by Jenny Bryan
– https://github.com/reproducibl
e-science-curriculum/
• Data management
– Data Management from
Software Carpentry by Orion
Buske
– http://softwarecarpentry.org/v4/data/mgmt.h
tml
• Literate programming
– Literate Statistical
Programming by Roger Peng
– https://www.youtube.com/wat
ch?v=YcJb1HBc-1Q
• Version control
– Version Control by Software
Carpentry
– http://softwarecarpentry.org/v4/vc/
• Sharing materials
– Open Science Framework by
Center for Open Science
– https://osf.io/