On Benchmarking Frequent Itemset Mining Algorithms Balázs Rácz, Ferenc Bodon, Lars Schmidt-Thieme Computer and Automation Budapest University of Research Institute of the Technology and Economics Hungarian Academy of Sciences Computer-Based New Media Group, Institute for Computer Science History Over 100 papers on Frequent Itemset Mining Many of them claim to be the ‘best’ FIMI03, 04 workshop: extensive benchmarks with many implementations and data sets Based on benchmarks run against some publicly available implementation on some datasets Serves as a guideline ever since How ‘fair’ was the benchmark and what did it measure? OSDM05, 2005-08-21 On Benchmarking Frequent Itemset Mining Algorithms 2 On FIMI contests Problem 1: We are interested in the quality of algorithms, but we can only measure implementations. No good theoretical data model yet for analytical comparison We’ll see later: would need good hardware model Problem 2: If we gave our algorithms and ideas to a very talented and experienced low-level programmer, that could completely re-draw the current FIMI rankings. A FIMI contest is all about the ‘constant factor’ OSDM05, 2005-08-21 On Benchmarking Frequent Itemset Mining Algorithms 3 On FIMI contests (2) Problem 3: Seemingly unimportant implementation details can hide all algorithmic features when benchmarking. These details are often unnoticed even by the author and almost never published. OSDM05, 2005-08-21 On Benchmarking Frequent Itemset Mining Algorithms 4 On FIMI contests (3) Problem 4: FIM implementations are complete ‘suites’ of a basic algorithm and several algorithmic/implementational optimizations. Comparing such complete ‘suites’ tells us what is fast, but does not tell us why. Recommendation: Modular programming Benchmarks on the individual features OSDM05, 2005-08-21 On Benchmarking Frequent Itemset Mining Algorithms 5 On FIMI contests (4) Problem 5: All ‘dense’ mining tasks’ run time is dominated by I/O. Problem 6: On ‘dense’ datasets FIMI benchmarks are measuring the ability of submitters to code a fast integer-to-string conversion function. Recommendation: Have as much identical code as possible library of FIM functions OSDM05, 2005-08-21 On Benchmarking Frequent Itemset Mining Algorithms 6 On FIMI contests (5) Problem 7: Run time differences are small Problem 8: Run time varies from run to run The very same executable on the very same input Bug or feature of modern hardware? What to measure? Recommendation: ‘winner takes all’ evaluation of a mining task is unfair OSDM05, 2005-08-21 On Benchmarking Frequent Itemset Mining Algorithms 7 On FIMI contests (6) Problem 9: Traditional run-time (+memory need) benchmarks do not tell us whether an implementation is better than an other in algorithmic aspects, or implementational (hardware-friendliness) aspects. Problem 10: Traditional benchmarks do not show whether on a slightly different hardware architecture (like AMD vs. Intel) the conclusions would still hold or not. Recommendation: extend benchmarks OSDM05, 2005-08-21 On Benchmarking Frequent Itemset Mining Algorithms 8 Library and pluggability Code reusal, pluggable components, data structures Object oriented design Do not sacrifice efficiency No virtual method calls allowed in the core Then how? C++ templates Allow pluggability with inlining Plugging requires source code change, but several versions can coexist Sometimes tricky to code with templates OSDM05, 2005-08-21 On Benchmarking Frequent Itemset Mining Algorithms 9 I/O efficiency Variations of output routine: normal-simple: renders each itemset and each item separately to text normal-cache: caches the string representation of item identifiers df-buffered: (depth-first) reuses the string representation of the last line, appends the last item df-cache: like df-buffered, but also caches the string representation of item identifiers OSDM05, 2005-08-21 On Benchmarking Frequent Itemset Mining Algorithms 10 decoder-test df-buffered df-cache normal-cache normal-simple Time (seconds, log-scale) 100 10 1 0.1 26 25 24 23 22 size of the itemset 21 20 19 Benchmarking: desiderata 1. 2. 3. The benchmark should be stable, and reproducible. Ideally it should have no variation, surely not on the same hardware. The benchmark numbers should reflect the actual performance. The benchmark should be a fairly accurate model of actual hardware. The benchmark should be hardware-independent, in the sense that it should be stable against the slight variation of the underlying hardware architecture, like changing the processor manufacturer or model. OSDM05, 2005-08-21 On Benchmarking Frequent Itemset Mining Algorithms 12 Benchmarking: reality Different implementations stress different aspects of the hardware Migrating to other hardware: Ranking cannot be migrated between HW Complex benchmark results are necessary May be better in one aspect, worse in another one Win due to algorithmic or HW-friendliness reason? Performance is not as simple as ‘run time in seconds’ OSDM05, 2005-08-21 On Benchmarking Frequent Itemset Mining Algorithms 13 Benchmark platform Virtual machine How to define? How to code the implementations? Cost function? Instrumentation (simulation of actual CPU) Slow (100-fold slower than plain run time) Accuracy? Cost function? OSDM05, 2005-08-21 On Benchmarking Frequent Itemset Mining Algorithms 14 Benchmark platform (2) Run-time measurement Performance counters Present in all modern processor (since i586) Count performance-related events real-time PerfCtr kernel patch under Linux, vendor-specific software under Windows Problem: measured numbers reflect the actual execution, thus are subject to variation OSDM05, 2005-08-21 On Benchmarking Frequent Itemset Mining Algorithms 15 Time (seconds, log-scale) BMS-POS.dat 100 apriori-noprune eclat-cov er eclat-diffset nonordfp-classic-td nonordfp-dense nonordfp-sparse 10 1 5000 4500 4000 3500 3000 2500 2000 1500 1000 500 min_supp 0 all uops on BMS-POS at 1000 60 GClockticks 50 40 30 20 Three sets of bars: wide, centered narrow, centered right • total size shows brown shows # of lbrown shows total instructions (u-ops) ticks clockticks of memory used, i.e. run-time, executed – wait) stable, r/w (mostly • purple shows cyan black shows shows time of stall (CPU wasted u-ops due read-ahead waiting formissth) to branch (prefetch) predictions 3 uops/tick 2 uops/tick 1 uop/tick stall bogus uops nbogus uops prefetch pending r/w pending 10 0 eclat-c n n e lat-d a n over onordfp-spoanordfp-decn iffsetpriori-noprounordfp-cl ne rse se all uops on BMS-POS at 1000 60 3 uops/tick 2 uops/tick 1 uop/tick stall bogus uops nbogus uops prefetch pending r/w pending GClockticks 50 40 30 20 10 0 eclat-c n n e lat-d a n over onordfp-spoanordfp-decn iffsetpriori-noprounordfp-cl ne rse se Conclusion We cannot measure algorithms, only implementations Modular implementations with pluggable features Shared code for the common functionality (like I/O) FIMI library with C++ templates Benchmark: run time varies, depends on hardware used Complex benchmarks needed Conclusions on algorithmic aspects or hardware friendliness? OSDM05, 2005-08-21 On Benchmarking Frequent Itemset Mining Algorithms 19 Thank you for your attention Big question: how does the choice of compiler influence the performance and the ranking? OSDM05, 2005-08-21 On Benchmarking Frequent Itemset Mining Algorithms 20
© Copyright 2026 Paperzz