On Benchmarking Frequent Itemset Mining Algorithms

On Benchmarking Frequent
Itemset Mining Algorithms
Balázs Rácz, Ferenc Bodon, Lars Schmidt-Thieme
Computer and Automation
Budapest University of
Research Institute of the
Technology and Economics
Hungarian Academy of Sciences
Computer-Based
New Media Group,
Institute for
Computer Science
History


Over 100 papers on Frequent Itemset Mining
Many of them claim to be the ‘best’


FIMI03, 04 workshop: extensive benchmarks
with many implementations and data sets


Based on benchmarks run against some publicly
available implementation on some datasets
Serves as a guideline ever since
How ‘fair’ was the benchmark and what did it
measure?
OSDM05, 2005-08-21
On Benchmarking Frequent Itemset Mining
Algorithms
2
On FIMI contests

Problem 1: We are interested in the quality of
algorithms, but we can only measure
implementations.



No good theoretical data model yet for analytical
comparison
We’ll see later: would need good hardware model
Problem 2: If we gave our algorithms and ideas to a
very talented and experienced low-level
programmer, that could completely re-draw the
current FIMI rankings.

A FIMI contest is all about the ‘constant factor’
OSDM05, 2005-08-21
On Benchmarking Frequent Itemset Mining
Algorithms
3
On FIMI contests (2)

Problem 3: Seemingly unimportant implementation
details can hide all algorithmic features when
benchmarking.

These details are often unnoticed even by the author and
almost never published.
OSDM05, 2005-08-21
On Benchmarking Frequent Itemset Mining
Algorithms
4
On FIMI contests (3)


Problem 4: FIM implementations are
complete ‘suites’ of a basic algorithm and
several algorithmic/implementational
optimizations. Comparing such complete
‘suites’ tells us what is fast, but does not tell
us why.
Recommendation:


Modular programming
Benchmarks on the individual features
OSDM05, 2005-08-21
On Benchmarking Frequent Itemset Mining
Algorithms
5
On FIMI contests (4)



Problem 5: All ‘dense’ mining tasks’ run time
is dominated by I/O.
Problem 6: On ‘dense’ datasets FIMI
benchmarks are measuring the ability of
submitters to code a fast integer-to-string
conversion function.
Recommendation:


Have as much identical code as possible
 library of FIM functions
OSDM05, 2005-08-21
On Benchmarking Frequent Itemset Mining
Algorithms
6
On FIMI contests (5)


Problem 7: Run time differences are small
Problem 8: Run time varies from run to run




The very same executable on the very same input
Bug or feature of modern hardware?
What to measure?
Recommendation: ‘winner takes all’
evaluation of a mining task is unfair
OSDM05, 2005-08-21
On Benchmarking Frequent Itemset Mining
Algorithms
7
On FIMI contests (6)



Problem 9: Traditional run-time (+memory
need) benchmarks do not tell us whether an
implementation is better than an other in
algorithmic aspects, or implementational
(hardware-friendliness) aspects.
Problem 10: Traditional benchmarks do not
show whether on a slightly different hardware
architecture (like AMD vs. Intel) the
conclusions would still hold or not.
Recommendation: extend benchmarks
OSDM05, 2005-08-21
On Benchmarking Frequent Itemset Mining
Algorithms
8
Library and pluggability



Code reusal, pluggable components, data structures
Object oriented design
Do not sacrifice efficiency



No virtual method calls allowed in the core
Then how?
C++ templates



Allow pluggability with inlining
Plugging requires source code change, but several
versions can coexist
Sometimes tricky to code with templates
OSDM05, 2005-08-21
On Benchmarking Frequent Itemset Mining
Algorithms
9
I/O efficiency

Variations of output routine:




normal-simple: renders each itemset and each
item separately to text
normal-cache: caches the string representation of
item identifiers
df-buffered: (depth-first) reuses the string
representation of the last line, appends the last
item
df-cache: like df-buffered, but also caches the
string representation of item identifiers
OSDM05, 2005-08-21
On Benchmarking Frequent Itemset Mining
Algorithms
10
decoder-test
df-buffered
df-cache
normal-cache
normal-simple
Time (seconds, log-scale)
100
10
1
0.1
26
25
24
23
22
size of the itemset
21
20
19
Benchmarking: desiderata
1.
2.
3.
The benchmark should be stable, and
reproducible. Ideally it should have no variation,
surely not on the same hardware.
The benchmark numbers should reflect the actual
performance. The benchmark should be a fairly
accurate model of actual hardware.
The benchmark should be hardware-independent,
in the sense that it should be stable against the
slight variation of the underlying hardware
architecture, like changing the processor
manufacturer or model.
OSDM05, 2005-08-21
On Benchmarking Frequent Itemset Mining
Algorithms
12
Benchmarking: reality


Different implementations stress different
aspects of the hardware
Migrating to other hardware:



Ranking cannot be migrated between HW
Complex benchmark results are necessary


May be better in one aspect, worse in another one
Win due to algorithmic or HW-friendliness reason?
Performance is not as simple as ‘run time in
seconds’
OSDM05, 2005-08-21
On Benchmarking Frequent Itemset Mining
Algorithms
13
Benchmark platform

Virtual machine




How to define?
How to code the implementations?
Cost function?
Instrumentation (simulation of actual CPU)



Slow (100-fold slower than plain run time)
Accuracy?
Cost function?
OSDM05, 2005-08-21
On Benchmarking Frequent Itemset Mining
Algorithms
14
Benchmark platform (2)

Run-time measurement





Performance counters
Present in all modern processor (since i586)
Count performance-related events real-time
PerfCtr kernel patch under Linux, vendor-specific
software under Windows
Problem: measured numbers reflect the actual
execution, thus are subject to variation
OSDM05, 2005-08-21
On Benchmarking Frequent Itemset Mining
Algorithms
15
Time (seconds, log-scale)
BMS-POS.dat
100
apriori-noprune
eclat-cov er
eclat-diffset
nonordfp-classic-td
nonordfp-dense
nonordfp-sparse
10
1
5000 4500 4000 3500 3000 2500 2000 1500 1000 500
min_supp
0
all uops on BMS-POS at 1000
60
GClockticks
50
40
30
20
Three sets of bars:
wide,
centered
narrow,
centered
right
• total
size
shows
brown
shows
# of
lbrown
shows
total
instructions
(u-ops)
ticks clockticks
of memory
used,
i.e. run-time,
executed
– wait)
stable,
r/w (mostly
• purple
shows
cyan
black shows
shows
time
of stall
(CPU
wasted
u-ops
due
read-ahead
waiting
formissth)
to
branch
(prefetch)
predictions
3 uops/tick
2 uops/tick
1 uop/tick
stall
bogus uops
nbogus uops
prefetch pending
r/w pending
10
0
eclat-c
n
n
e lat-d
a
n
over onordfp-spoanordfp-decn
iffsetpriori-noprounordfp-cl
ne
rse
se
all uops on BMS-POS at 1000
60
3 uops/tick
2 uops/tick
1 uop/tick
stall
bogus uops
nbogus uops
prefetch pending
r/w pending
GClockticks
50
40
30
20
10
0
eclat-c
n
n
e lat-d
a
n
over onordfp-spoanordfp-decn
iffsetpriori-noprounordfp-cl
ne
rse
se
Conclusion



We cannot measure algorithms, only
implementations
Modular implementations with pluggable features
Shared code for the common functionality (like I/O)


FIMI library with C++ templates
Benchmark: run time varies, depends on hardware
used


Complex benchmarks needed
Conclusions on algorithmic aspects or hardware
friendliness?
OSDM05, 2005-08-21
On Benchmarking Frequent Itemset Mining
Algorithms
19
Thank you for your attention

Big question: how does the choice of
compiler influence the performance and the
ranking?
OSDM05, 2005-08-21
On Benchmarking Frequent Itemset Mining
Algorithms
20