Repeatable and Reproducible Evaluation

Repeatable and
Reproducible Evaluation
Fraida Fund
NYU Polytechnic School of Engineering
[email protected]
“In industry, we ignore the evaluation in
academic papers. It is often wrong and always
irrelevant.”
- Head of a major industrial lab, 2011
Source of quote: Vitek, Jan, and Tomas Kalibera. "R3: Repeatability, reproducibility and
rigor." ACM SIGPLAN Notices 47, no. 4a (2012): 30-36. http://janvitek.github.
io/pubs/r3.pdf)
Common problems in evaluation
●
●
●
●
●
●
●
●
●
Unclear goals
Meaningless measurements
No baseline (or wrong baseline)
Not representative
Implicit assumptions
Weak statistics
Ineffective or misleading graphics
Proprietary code and data
Results are not reproducible
Repetition
The ability to re-run the exact same experiment
with the same method on the same or similar
system and obtain the same or very similar
result.
Reproducibility
Independent confirmation of qualitative results
by a third party, using the description of
experiment design in the report/paper.
Six degrees of reproducibility
5: The results can be easily reproduced by an
independent researcher with at most 15 min of
user effort, requiring only standard, freely
available tools (C compiler, etc.).
Source: P. Vandewalle, J. Kovacevic, and M. Vetterli. "Reproducible research in signal
processing - what, why, and how." IEEE Signal Processing Magazine, 26(3):37–47, May
2009. http://infoscience.epfl.ch/record/136640/files/VandewalleKV09.pdf
Six degrees of reproducibility
4: The results can be easily reproduced by an
independent researcher with at most 15
minutes of user effort, requiring some
proprietary source packages (MATLAB, etc.).
Source: P. Vandewalle, J. Kovacevic, and M. Vetterli. "Reproducible research in signal
processing - what, why, and how." IEEE Signal Processing Magazine, 26(3):37–47, May
2009. http://infoscience.epfl.ch/record/136640/files/VandewalleKV09.pdf
Six degrees of reproducibility
3: The results can be reproduced by an
independent researcher, requiring considerable
effort.
Source: P. Vandewalle, J. Kovacevic, and M. Vetterli. "Reproducible research in signal
processing - what, why, and how." IEEE Signal Processing Magazine, 26(3):37–47, May
2009. http://infoscience.epfl.ch/record/136640/files/VandewalleKV09.pdf
Six degrees of reproducibility
2: The results could be reproduced by an
independent researcher, requiring extreme
effort.
Source: P. Vandewalle, J. Kovacevic, and M. Vetterli. "Reproducible research in signal
processing - what, why, and how." IEEE Signal Processing Magazine, 26(3):37–47, May
2009. http://infoscience.epfl.ch/record/136640/files/VandewalleKV09.pdf
Six degrees of reproducibility
1: The results cannot seem to be reproduced
by an independent researcher.
Source: P. Vandewalle, J. Kovacevic, and M. Vetterli. "Reproducible research in signal
processing - what, why, and how." IEEE Signal Processing Magazine, 26(3):37–47, May
2009. http://infoscience.epfl.ch/record/136640/files/VandewalleKV09.pdf
Six degrees of reproducibility
0: The results cannot be reproduced by an
independent researcher.
Source: P. Vandewalle, J. Kovacevic, and M. Vetterli. "Reproducible research in signal
processing - what, why, and how." IEEE Signal Processing Magazine, 26(3):37–47, May
2009. http://infoscience.epfl.ch/record/136640/files/VandewalleKV09.pdf
How reproducible is CS systems
research?
Versioning problems
We’ll give you code… soon
No plans to release the code
Only student knew how to use, left
Proprietary code
Depends on proprietary/obsolete
systems
Poor design
Build errors
How to create a
reproducible experiment
Experiment design
❏ Is there a clear mapping between your
experiment goal and experiment design?
❏ Does your experiment achieve your goal
with the minimum amount of work possible?
❏ Is it clear what the “result” of your evaluation
is?
❏ Are there as few manual steps in your
experiment as possible?
❏ Are the tools used in your experiment open
and widely available?
Data analysis and visualization
❏ Did you separate raw and processed data?
❏ Do you have a data analysis and
visualization script? (No manual calculations
or interactive image generation!)
❏ Did you share the raw and processed data
and script used to generate any images in
your report?
❏ Are you using version control?
❏ Do you follow good statistics and data
integrity practices?
Documentation
❏ Is it clear where to begin? (e.g., can
someone picking a project up see where to
start running it)
❏ Are there instructions for setting up the
experiment and executing it?
❏ Do you explain non-obvious steps in the
instructions?
❏ Have you noted the exact version of every
external application used in the process?
❏ Are you using version control?
Lab exercises
Final lab exercises
Routing (repeatable and reproducible):
● Dijkstra’s algorithm
● OSPF
Software defined networks
● Just to give you another tool to use in
potential projects
Projects
● Form groups of 3 or 4
● Project will run on GENI
○ Lab exercises give you some software tools to use:
iperf, netem, tinyhttpd, OSPF setup, SDN, others
○ May use these or other software
● Must use good experiment design practices
● Must use good practices for communicating
quantitative results
● Must use good practices for creating
reproducible experiments
Projects
The labs are meant to help you, so you can use
them as a jumping-off point for projects
Topics can include:
● Data center networks
● Congestion and flow control
● Routing and resiliency
● SDN
● Other topics related to HSN
Projects
Start thinking about your project
● Work in groups of 3-4
● Must have reasonable division of labor (every student
takes responsibility for a part of the project)
● Must apply lessons from the lab lectures
● Will give you specific instructions for proposal before
spring break.
● Project proposals due @ midterm.
Lab coverage on midterm
Lab topics are included on midterm:
● Using networking testbeds
● Experiment design
● Communicating results
● Reproducible experiments
Will give some example problems for you to
work on.
Getting help
● Office hours on lab website
● Asking for help on the Internet
○ For e.g. Git Bash, R usage, there’s lot of information
online
○ GENI Users Group: https://groups.google.
com/forum/#!forum/geni-users
○ If you ask a question, cite it in your report
References
1.
2.
3.
4.
5.
6.
7.
Raj Jain, The Art of Computer Systems Performance Analysis: Techniques for Experimental
Design, Measurement, Simulation, and Modeling," Wiley- Interscience, New York, NY, April
1991, ISBN:0471503361.
Moraila, G., Shankaran, A., Shi, Z., & Warren, A. M. “Measuring Reproducibility in Computer
Systems Research.” Tech Report (2014). http://reproducibility.cs.arizona.edu/tr.pdf
Vitek, Jan, and Tomas Kalibera. "R3: Repeatability, reproducibility and rigor." ACM
SIGPLAN Notices 47, no. 4a (2012): 30-36. http://janvitek.github.io/pubs/r3.pdf
P. Vandewalle, J. Kovacevic, and M. Vetterli. "Reproducible research in signal
processing - what, why, and how." IEEE Signal Processing Magazine, 26(3):37–47,
May 2009. http://infoscience.epfl.ch/record/136640/files/VandewalleKV09.pdf
Edwards, Sarah, Xuan Liu, and Niky Riga. "Creating Repeatable Computer Science and
Networking Experiments on Shared, Public Testbeds." ACM SIGOPS Operating Systems
Review 49, no. 1 (2015): 90-99. http://mescal.imag.fr/membres/arnaud.
legrand/research/readings/acm_sigops_si_rsea/p90-edwards.pdf and http://groups.geni.
net/geni/wiki/PaperOSRMethodology
Leek, Jeff. The elements of data analytic style. 2015
Handigol, Nikhil, Brandon Heller, Vimalkumar Jeyakumar, Bob Lantz, and Nick McKeown.
"Reproducible network experiments using container-based emulation." In Proceedings of the 8th
international conference on Emerging networking experiments and technologies, pp. 253-264.
ACM, 2012. http://tiny-tera.stanford.edu/~nickm/papers/p253.pdf and https:
//reproducingnetworkresearch.wordpress.com/