Slides - Strengthening Reproducibility in Network Science

Barbara Jasny
Deputy Editor, Em
Science
[email protected]
A Brief History of Science
Science was founded in 1880
on $10,000 of seed money
from Thomas Edison
What is Different Now
•
•
•
•
Big data
Computer modeling
Team and interdisciplinary science
Increased ability to study questions of societal
significance
• Increased ability to make predictions/approach
causality
• Increased pressures for funding/tenure etc.
• Increased volume of scientific information
published
Where are we now?
>100,000 Individual Subscribers
~1 million online Readers
>13,000 submissions, ~1000 published
Spectrum of Reproducibility*
• Low End (minimum standard) Repeatability:
Another group can access the data, analyze it
using the same methodology, and obtain the
same result.
• High End (gold standard) Replication: The study is
repeated start to finish, including new data
collection and analysis, using fresh materials and
reagents, and obtain the same result.
*Ioannidis and Khoury, Science, Special Issue onData Replication &
Reproducibility, 334, December 2011.
What Can Go Wrong?
• The system under investigation maybe more complex
than previously thought, so that the experimenter is
not actually controlling all independent variables. THIS
IS HOW SCIENCE EVOLVES
• Authors may not have divulged all of the details of a
complicated experiment, making it irreproducible by
another lab.
• Through random chance, a certain number of studies
will produce false positives. Authors need to set
appropriately stringent significance tests for their
results
• Publication bias
Where are the problems coming
from?
• Insufficient student training eg- experimental
design, statistics
• Pressure to publish, renew grants, and get
promoted—(Difficulties in sharing)
• Industry researchers must promote the goals of
their company. (Difficulties in sharing)
• Editors and reviewers for journals and grants,
under space and time pressure and in a situation
where community standards are evolving, may
not be maintaining sufficiently high standards
• Studies were underfunded from the start
2014 Workshop at the
Center for Open Science
• TOP
(Transparency
and Openness
Promotion)
standards
published in
Science
• Now 753
journals
representing 63
organizations
have signed on
TOP Guidelines
B. A. Nosek et al. Science 2015;348:1422-1425
Published by AAAS
What if Reproducibility/Replicability
Aren’t Options?
Michael Tomasello, and Josep Call Science 2011;334:1227-1228
www,nasa.gov
Second Arnold Workshop: Data
Sharing in the Field Sciences 2015
• Establish metadata standards
– Fund data repositories and support data
professionals
• Education: importance of quality
control
• Culture changes to:
– Relinquish ownership of data
– Start treating data as citable objects
– Liberating field science samples and data M. McNutt et
al. Science 04 Mar 2016
Third Arnold Workshop: Code and
Computational Methods 2016
• Access to data is of little use without
having code used to process data to derive
results
• Need standards for accessibility,
interoperability, attribution
Ideally: Share data, software, workflows,
and details of the computational
environment in open repositories. Persistent
links should appear in the published article
and include permanent identifier for data,
code, and digital artifacts
Science Policy: Data Must Be
Available
—in SM or Archived
“Data and materials availability: All data
necessary to understand, assess, and extend the
conclusions of the manuscript must be available
to any reader of Science. After publication, all
reasonable requests for materials must be
fulfilled.”
There are still some exceptions----Any restrictions on the availability of data or
materials, including fees and original data
obtained from other sources (Materials Transfer
Agreements), must be disclosed to the editors
upon submission. “
It’s still complicated
14
What Needs to be Shared?
• Exposure to ideologically diverse news and
opinion on Facebook
Eytan Bakshy,
Solomon Messing
Lada Adamic
• Science 05 Jun 2015:
Vol. 348, Issue 6239, pp. 1130-1132
DOI: 10.1126/science.aaa1160
Bakshy et al.
• The following code and data are archived in the
Harvard Dataverse Network,
http://dx.doi.org/10.7910/DVN/LDJ7MS, “Replication
Data for: Exposure to Ideologically Diverse News and
Opinion on Facebook”; R analysis code and aggregate
data for deriving the main results (e.g., Table S5, S6);
Python code and dictionaries for training and testing
the hard-soft news classifier; Aggregate summary
statistics of the distribution of ideological homophily in
networks; Aggregate summary statistics of the
distribution of ideological alignment for hard content;
shared by the top 500 most shared websites.
What Needs to be Shared
• Unique in the shopping mall: On the
reidentifiability of credit card metadata
Yves-Alexandre de Montjoye1,*, Laura Radaelli2,
Vivek Kumar Singh1,3, Alex “Sandy” Pentland
• Science 30 Jan 2015:
Vol. 347, Issue 6221, pp. 536-539
DOI: 10.1126/science.1256297
de Montjoye et al.
•
•
•
•
•
•
•
•
•
For contractual and privacy reasons, we unfortunately cannot
make the raw data available. Upon request we can, however, make
individual-level data of gender, income level, resolution (h, v, a),
and unicity (true, false), along with the appropriate documentation,
available for replication. This allows the re-creation of Figs. 2 to 4,
as well as the GLM model and all of the unicity statistics.
A randomly subsampled data set for the four points case
can be found at http://web.media.mit.edu/~yva/
uniqueintheshoppingmall/
When Release is Against
Public Safety/Interest
• Heads-up limit hold’em poker is solved
Michael Bowling, Neil Burch, Michael
Johanson, Oskari Tammelin
• Science 09 Jan 2015:
Vol. 347, Issue 6218, pp. 145-149
DOI: 10.1126/science.1259433
Bowling et al.
• As heads-up no-limit Texas hold’em is
commonly played online for high stakes, the
scientific benefit of releasing source code
must be balanced with the potential for it to
be used for gambling purposes. As a
compromise, an implementation of the
DeepStack algorithm for the toy game of nolimit Leduc hold’em is available at
https://github.com/lifrordi/DeepStack-Leduc.
Reproducibility as it Affects
Industry/Academia Partnerships
Fourth Arnold Workshop - 2016
Current Problems
Widely accepted that academic standards are significantly
lower than industry
Venture firms and Biopharma replicate before investing- 2-6
researchers, 1-2 years, $500K- 2 million
Proprietary/IP concerns
Agreements can restrict rights of academics and/or delay
publication
Privacy issues—data can be linked to provide unexpected
info about you and your network
Recommendations
• Michael Rosenblatt (Chief Medical Officer, Flagship
Ventures published SciTM April) –incentives-based
approach, where industry provides incentives if
universities guarantee research.
• Universities could do random quality assurance checks
of faculty— test if auditing makes a difference in
practice and if journals see improvements.
• Crowdsourcing to check data deposition. Must be seen
as helpful – not punitive
• Form a working group to develop a toolkit for
establishing standards for partnerships.
• Organize high level meeting to examine ways to sustain
existing databases and build new ones
• Promote better education of students and faculty
regarding reproducibility and data sharing—
experimental design
Industry-Academia
Agreements
•
•
•
•
•
•
•
What data are needed?
What will be published?
What happens to data an academic
produces with industry data?
What approvals are needed for
publication/ speaking?
What will be proprietary/trade
secrets?
When will data be released?
Can released data be used to
extend research findings or only
reproduce them?
If Replication is a Public Good--• How should replication projects be rewarded?
• Do we have a journal of replication?
• Do people post replications to their blog?
• Who publishes negative results?
Data Policies of Elsevier
• https://www.elsevier.com/abo
ut/ourbusiness/policies/researchdata
• Is it compulsory to share my
research data?
• No. Our policy is clear in that
we encourage and support
authors to share their research
data rather than mandating
them to do so ----Where there
is community support for
(often discipline-specific)
mandates regarding data
deposit, submission and
sharing, some of our journals
may reflect this with their own
mandatory data sharing
policies.
Quarterly Journal of Political Science
• Authors of empirical papers may be asked to
supply a replication data set for editors or
referees. Upon acceptance of a manuscript,
authors will be required to submit a
replication dataset/archive prior to
publication. The dataset, documentation,
command files, etc. will be reviewed in-house
and made available at this site coincident with
publication. Online appendices, if any, will be
handled similarly.