Path planning 2015 workshop: Experimental research in practice

Workshop:
Experimental research in practice
Roland Geraerts
16 November 2016
1
Bad repeatability
Why can it be hard to reproduce papers’ claims?
2
Bad repeatability (1)
 Problem
 The results cannot be reproduced easily
 Cause
 Details of the method are lacking
• Parts of the method are not described
• Degenerate cases are missing
• References to other papers (without mentioning details)
 Parameters don’t get assigned values (usually weights)
 Source code is not available
 The experimental setup is not clear
• Tested hardware (e.g. which PC/GPU, the number of cores used)
• Statistical setup (e.g. number of runs, seed)
• Details of the scenario(s) are missing
3
Bad repeatability (2)
 Problem
 The results cannot be reproduced easily
 Cause
 Low significance caused by a low number of runs
 Hard problems can be hard to implement
 Solution
 Let someone else implement the method/paper
 Provide the source code
4
Data collection errors
What kind of errors occur during the collection of (raw) data?
5
Data collection errors (1)
 Problem
 Errors occur during collection of raw data
• E.g., copy/paste values from GUIs into excel sheets or text files
 Cause
 The data collection process was not automated
• There is a GUI but not a command line (console) version
• Variables aren’t assigned the right values (how to verify?)
 The precision of the stored numbers is too low
 Statistics are computed wrongly (e.g. how to compute the SD)
 Only the execution of a part of the algorithm is recorded
 The visualization part is not strictly separated from the
execution part of the algorithm
• E.g. While the method performs its computations, the results are
being written to a log file and sent to the GPU for visualization
purposes
6
Data collection errors (2)
 Solution
 Automate the process using a console called from a batch file
• For small experiments, call the arguments in the batch file
• Otherwise, build a load/save mechanism
 Create an API that supports setting up experiments
7
Data collection errors (3):
Time measurement errors
 Problem
 Time is measured wrongly
 Cause
 Lack of timer’s accuracy
• C++: Don’t use time.h
• Don’t start/stop the timer inside the method, especially not if the
parts take less than 1 ms to compute
 Intervening network/CPU/GPU processes
8
Data collection errors (4):
Time measurement errors
 Solution
 Use accurate timers
• C++: Use QueryPerformanceCounter(…) instead; be careful of 0.3s
jumps, or C++ 11: std::chrono::high_resolution_clock
• Run fast methods many times and take the average; watch out for
non-deterministic behavior
 Take the average of some runs, also in case of deterministic
algorithms
 Only measure the running time of the algorithm
•
•
•
•
•
•
9
Switch off the network
Kill the virus killer
Stop the e-mail program
Disable update functionality
Use only 1 core
Don’t work on your thesis while running the experiments on the
same machine; and yes, this happens
Bad figures
When do figures convey information badly?
10
Bad figures
 Problem
 The figures convey information badly
 Cause
 The figures are hard to read (e.g. too small or bitmapped)
 Axes haven’t been labeled
 The y-axis doesn’t start at 0 which amplifies (random) differences
 Use the right number precision/format
• Don’t display 100,000.001
• Don’t display 0.0005 s, or 0.1 0.15 0.2 …
 The meaning is not conveyed clearly
 Some colors/patterns don’t do well on black & white printers
 Solution
 Use e.g. GNUplot (set all labels and export to vector: EPS or PDF)
 Use vector images as much as possible (e.g. use IPE)
 Explain all phenomena
11
Conclusions are too general
When are drawn conclusions too general?
12
Conclusions are too general (1)
 Problem
 The conclusions drawn are often too general
 Cause
 Only one instance is tested, e.g.
• environment / moving entity
 Only one problem setting is tested
 A favorable setup is used, e.g.
• a few axis-aligned rectangular obstacles
• polygonal convex obstacles
• 1 fixed query
 Deterministic experiments do suffer from the ‘variance problem’
13
Conclusions are too general (2)
 Solution
 Try to sample the problem space as good as possible
 Don’t try to bias any method
• Use a favorable setup (to show certain properties) and a ‘normal’ one
• Also choose worst-case scenarios
• Tune all methods equally
 Compare against the state-of-the-art instead of old methods only
 Dare to show the weakness(es) of your method
14
Statistical weaknesses
When are the statistics less reliable?
15
Statistical weaknesses
 Problem
 Statistics are done badly
 Cause
 Results have been collected on different sets of hardware
 Too few runs
 Not all running times are mentioned (e.g. initialization)
 Only averages are mentioned
 Solution
 Use the same machine (and don’t change the setup)
 Use e.g. GNUplot and set all (relevant) labels
 Use other measures, e.g.
•
•
•
•
16
SD
Boxplot
Student’s t-test: statistical hypothesis test
ANOVA: Analysis of variance
Statistically significant?
17
So your method is statistically significant
 While a method was granted being statistically
significant, this does not have to mean anything in
practice…
 …due to the programmer’s bias.
 Suppose different methods run in 10.2, 10.0, 10.3, and
9.6 seconds (with appropriate SDs etc). While the latter
one might be better, in reality it does not have to be…
 …since the third one might be the only one that wasn’t
optimized.
18
Ways to bias your results (1)
 Run the code with choices in of
 Hardware (CPU, GPU, memory, cache, #cores, #threads)
 Language (C++/C#, 32/64bit, different optimizations)
 Software libraries (own code/boost/STL)
 Implementation is done by different people
19
Ways to bias your results (2):
Some code optimizations
 Enable optimizations in your compiler
 Run in release mode!
 Visual studio
•
•
•
•
full optimization
inline function expansion
Enable intrinsic functions
Etc.
 Compile the code with a 64-bit compiler
 2-15% improvement of running times due to
• usage of a larger instruction set
• Not having to simulate 32-bit code
 However, watch code that deals with memory and loops
• use memsize-types in address arithmetic
20
Ways to bias your results (3):
Some code optimizations
 Unroll loops
 Improves usage of parallel execution (e.g. SSE2)
 Create small code
 E.g. by improving the implementation; properly align data
 Improves cache behavior
 Avoid mixed arithmetic
 Use STL
 Is heavily optimized
 Avoid disk usage and writing to a console etc.
 Follow the course: Optimization and vectorization
21
Ethics versus mistakes
 Let’s have a discussion here!
22