A Seven-Dimensional Analysis of Hashing Methods and its

A Seven-Dimensional Analysis of Hashing Methods and its
Implications on Query Processing
Stefan Richter
Information Systems Group
Saarland University
∗
Victor Alvarez
Jens Dittrich
Dept. Computer Science
TU Braunschweig
[email protected] [email protected]
saarland.de
ABSTRACT
Hashing is a solved problem. It allows us to get constant time access for lookups. Hashing is also simple. It is safe to use an arbitrary method as a black box and expect good performance, and optimizations to hashing can only improve it by a negligible delta. Why
are all of the previous statements plain wrong? That is what this paper is about. In this paper we thoroughly study hashing for integer
keys and carefully analyze the most common hashing methods in a
five-dimensional requirements space: () data-distribution, () load
factor, () dataset size, () read/write-ratio, and () un/successfulratio. Each point in that design space may potentially suggest a different hashing scheme, and additionally also a different hash function. We show that a right or wrong decision in picking the right
hashing scheme and hash function combination may lead to significant difference in performance. To substantiate this claim, we
carefully analyze two additional dimensions: () five representative hashing schemes (which includes an improved variant of Robin
Hood hashing), () four important classes of hash functions widely
used today. That is, we consider 20 different combinations in total.
Finally, we also provide a glimpse about the effect of table memory layout and the use of SIMD instructions. Our study clearly
indicates that picking the right combination may have considerable
impact on insert and lookup performance, as well as memory footprint. A major conclusion of our work is that hashing should be
considered a white box before blindly using it in applications, such
as query processing. Finally, we also provide a strong guideline
about when to use which hashing method.
1.
[email protected]
evidence that hash tables are much faster than even the most recent and best tree-structured indexes. For instance, in our recent
experimental analysis [1] we carefully compared the performance
of modern tree-structured indexes for main-memory databases like
ARTful [17] with a selection of different hash tables1 . A central
lesson learned from our work [1] was that a carefully and wellchosen hash table is still considerably faster (up to factor 4-5x)
for point queries than any of the aforementioned tree-structured indexes. However, our previous work also triggered some nagging
research questions: () When exactly should we choose which hash
table? () What are the most efficient hashing methods that should
be considered for query processing? () What other dimensions affect the choice of “the right” hash table? and finally () What is the
performance impact of those factors. While investigating answers
to these questions we stumbled over interesting results that greatly
enriched our knowledge, and that could greatly help practitioners,
and potentially also the optimizer, to take well-informed decisions
as of when to use what hash table.
1.1
Our Contributions
We carefully study single-threaded hashing for 64-bit integer
keys and values in a five-dimensional requirements space:
1. Data distribution. Three different data distributions: dense,
sparse, and a grid-like distribution (think of IP addresses).
2. Load factor. Six different load factors between 25- and 90%.
3. Dataset size. We consider a variety of sizes for the hash tables to observe performance when they are rather small (they
fit in cache), and when they are of medium and large sizes
(outside cache but still addressable by TLB using huge pages
or not respectively).
INTRODUCTION
In recent years there has been a considerable amount of research
on tree-structured main-memory indexes, e.g. [17, 13, 21]. However, it is hard to find recent database literature thoroughly examining the effects of different hash tables in query processing. This
is unfortunate for at least two reasons: First, hashing has plenty of
applications in modern database systems, including join processing, grouping, and accelerating point queries. In those applications, hash tables serve as a building block. Second, there is strong
4. Read/write-ratio. We consider whether the hash tables are
to be used under a static workload (OLAP-like) or a dynamic
workload (OLTP-like). For both we simulate an indexing
workload — which in turn captures the essence of other important operations such as joins or aggregates.
5. Un/successful lookup ratio. We study the performance of
the hash tables when the amount of lookups (probes) varies
from all successful to all unsuccessful.
∗Work done while at Saarland University.
This work is licensed under the Creative Commons AttributionNonCommercial-NoDerivatives 4.0 International License. To view a copy
of this license, visit http://creativecommons.org/licenses/by-nc-nd/4.0/. For
any use beyond those covered by this license, obtain permission by emailing
[email protected].
Proceedings of the VLDB Endowment, Vol. 9, No. 3
Copyright 2015 VLDB Endowment 2150-8097/15/11.
Information Systems Group
Saarland University
Each point in that design space may potentially suggest a different hash table. We show that a right/wrong decision in picking the
1
We use the term hash table throughout the paper to indicate that
both the hashing scheme (say linear probing) and the hash function
(say Murmur) are chosen.
right combination hhashing scheme, hash functioni may lead to an
order of magnitude difference in performance. To substantiate this
claim, we carefully analyze two additional dimensions:
6. Hashing scheme. We consider linear probing, quadratic probing, Robin Hood hashing as described in [5] but carefully engineered, Cuckoo hashing [19], and two different variants of
chained hashing.
7. Hash function. We integrate each hashing scheme with four
different hash functions: Multiply-shift [8], Multiply-addshift [7], Tabulation hashing [20], and Murmur hashing [2],
which is widely used in practice. This gives 24 different
combinations (hash tables).
Therefore, we study in total a set of seven different dimensions that are key parameters to the overall performance of a hash
table. We shed light on these seven dimensions focusing on one of
the most important use-cases in query processing: indexing. This
in turn resembles very closely other important operations such as
joins and aggregates — like SUM, MIN, etc. Additionally, we also
offer a glimpse about the effect of different table layout and the use
of SIMD instructions. Our main goal is to produce enough results
that can guide practitioners, and potentially the optimizer, towards
choosing the most appropriate hash table for their use case at hand.
To the best of our knowledge, no work in the literature has considered such a thorough set of experiments on hash tables.
Our study clearly indicates that picking the right configuration
may have considerable impact on standard query processing tasks
such as main-memory indexing as well as join processing, which
heavily rely on hashing. Hence, hashing should be considered as a
white box method in query processing and query optimization.
We decided to focus on studying hash tables in a single-threaded
context to isolate the impact of the aforementioned dimensions. We
believe that a thorough evaluation of concurrency in hash tables is a
research topic in its own and beyond the scope of this paper. However, our observations still play an important role for hash maps
in multi-threaded algorithms. For partitioning-based parallelism
— which has recently been considered in the context of (partitionbased hash) joins [3, 4, 16] — single-threaded performance is still
a key parameter: each partition can be considered an isolated unit
of work that is only accessed by exactly one thread at a time, and
therefore concurrency control inside the hash tables is not needed.
Furthermore, all hash tables we present in the paper can be extended for thread safety through well-known techniques such as
striped locking or compare-and-swap. Here, the dimensions we
discuss still impact the performance of the underlying hash table.
This paper is organized as follows: In Sections 2 and 3 we briefly
describe each of the five considered hashing schemes and the four
considered hash functions respectively. In Section 4 we describe
our methodology, setup, measurements, and the three data distributions used. We also discuss why we have narrowed down our result
set — we present in this paper what we consider the most relevant
results. In Sections 5, 6, and 7 we present all our experiments along
with their corresponding discussion.
2.
HASHING SCHEMES
In this paper, we study the performance of five different hashing schemes: () chained hashing, () linear probing, () quadratic
probing, () Robin Hood hashing on linear probing, and () Cuckoo
hashing — the last four belong to the so-called open-addressing
schemes, in which every slot of the hash table stores exactly one
element, or stores special values denoting whether the corresponding slot is free. For open-addressing schemes we assume that the
tables have l slots (l is called capacity of the table). Let 0 ≤ n ≤ l
be the number of occupied slots (we call n the size of the table)
and consider the ratio α = nl as the load factor of the table. For
chained hashing, the concept of load factor makes in general little sense since it can store more than one element in the same slot
using a linked list, and thus we could obtain α > 1. Hence, whenever we discuss chained hashing for a load factor α, we mean that
the presented chained hash tables are memory-wise comparable to
open-addressing hash tables at load factor α — in particular, the
hash tables contain the same number n of elements, but their directory size can differ. We elaborate on this in Section 4.5.
Finally, one fundamental question in open-addressing is whether
to organize the table as array-of-structs (AoS) or as a struct-ofarrays (SoA). In AoS, the table is stored in one (or more in case of
Cuckoo hashing) arrays of key-value pairs, similar to a row layout.
In contrast to that, SoA representation keeps keys and corresponding values separated in two corresponding, aligned arrays - similar
to column layout. We found in a micro-benchmark that AoS is superior to SoA in most relevant cases for our setup and hence apply
this organization in all open-addressing schemes in this paper. For
more details on this micro-benchmark see Section 7. We now proceed to briefly describe each considered hashing scheme in turn.
2.1
Chained Hashing
Standard chained hashing is a very simple approach for collision
handling, where each slot of table T (the directory) is a pointer to a
linked list of entries. On inserts, entries are appended to the list that
corresponds to their key k under hash function h, i.e., T [h(k)]. In
case of lookups, the linked list under T [h(k)] is searched for the entry with key k. Chained hashing is a simple and robust method that
is widely used in practice, e.g., in the current implementations of
std::unordered map in C++ STL or java.util.HashMap
in Java. However, compared to open-addressing methods, chained
hashing has typically sub-optimal performance for integer keys w.r.t.
runtime and memory footprint. Two main reasons for this are:
() the pointers used by the linked lists lead to a high memory
overhead and () using linked lists leads to additional cache misses
(even for slots with one element and no collisions). This situation
brings different opportunities for optimizing a traditional chained
hash table. For example, we can reduce cache misses by making the directory wide enough (say 24-byte entries for key-valuepointer triplets) so that we can always store one element directly in
the directory and avoid following the corresponding pointer. Collisions are then stored in the corresponding linked list. In this version
we potentially achieved the latency of open-addressing schemes
(if collisions are rare) at the cost of space. Throughout the paper
we denote the two versions of chained hashing we mentioned by
ChainedH8, and ChainedH24 respectively.
In the very first set of experiments we studied the performance of
ChainedH8, and ChainedH24 under a variety of factors, as to better
understand the trade-offs they offer. One key observation that we
would like to point out at this point is: We observed that entry allocation in the linked lists is a key factor for insert performance in
all our variants of chained hashing. For example, a naive approach
with dynamic allocation, i.e., using one malloc call per insertion,
and one free call per delete, lead to a significant overhead. For
most use cases, an alternative allocation strategy provides a considerable performance benefit. That is, for both chained hashing methods in our indexing experiments, Sections 5 and 6, we use a slab
allocator. The idea is to bulk-allocate many (or up to all) entries
in one large array and store all map entries consecutively in this arrays. This strategy is very efficient in all scenarios where the size of
the hash table is either known in advance or only growing. We ob-
served an improvement over traditional allocation in both: memory
footprint (due to less fragmentation and less malloc metadata) as
well as raw performance (by up to one order of magnitude!).
l = (1 + ε)n. That is, from a theoretical point of view, there is no
reason to use any other hashing table. We will see in our experiments, however, that the story is slightly different in practice.
2.2
2.3
Linear Probing
Linear probing (LP) is the simplest scheme for collision handling in open-addressing. The hash function is of the following
form: h(k, i) = (h0 (k) + i) mod l, where i represents the i-th
probed location and h0 (k) is an auxiliary hash function. It works
as follows: First, try to insert each key-value pair p = hk, vi with
key k at the optimal slot T [h(k, 0)] in an open-addressing hash table T . In case h(k, 0) is already occupied by another entry with
different key, we (circularly) probe the consecutive slots h(k, 1) to
h(k, l − 1). We store p in the first free slot T [h(k, i)], for some
0 < i < l, we encounter2 . We define the displacement d of p
as i, and the sum of displacements over all entries as the total displacement of T . Observe that the total displacement is a measure
of performance in linear probing since a high value implies long
probe sequences entries during lookups.
The rather simple strategy of LP has two advantages: () Low
code complexity which allows for fast execution and () Excellent cache efficiency due to the sequential linear scan. However,
on high load factors > 60%, LP noticeably suffers from primary
clustering, i.e., a tendency to create long sequences of filled slots
and hence high total displacement. We will address those areas of
occupied slots that are adjacent w.r.t. probe sequences as clusters.
Further, we can also observe that unsuccessful lookups worsen the
performance of LP since they require a complete scan of all slots
up to the first empty slot. Linear probing also requires dedicated
handling of deletes, i.e., we can not simply remove entries from the
hash table because this could disconnect a cluster and produce incorrect results under lookups. One option to handle deletes in LP
are the so called tombstones, i.e., a special value (different from
the empty slot) that marks deleted entries so that lookups continue
scanning after seeing one tombstone — yielding correct results.
Using tombstones makes deletes very fast. However, tombstones
can have a negative impact on performance, as they potentially connect otherwise unconnected clusters, thus building larger clusters.
Inserts can replace a tombstone that is found during a probe after confirming that the key to insert is not already contained. Another strategy to handle deletes is partial cluster rehash: we delete
the entry from the slot and rehash all following entries in the same
cluster. For our experiments we decided to implement an optimized
version of tombstones which will only place tombstones when required to keep a cluster connected (i.e. only if the next slot from the
deleted entry is occupied). Placing tombstones is very fast (faster in
general than rehashing after every deletion), and the only negative
point about tombstones are lookups after a considerable amount of
deletions — in such a case we could shrink the hash table and perform a rehash anyway.
One of our main motivations to study linear hashing in this paper
is not only that it belongs to the classical hashing schemes, which
dates to the 50’s [15], but also the recent developments regarding
its analysis. Knuth was the first [14] to give a formal analysis of
the operations of linear probing (insertion, deletions, lookups) and
he showed that all these operation can be performed in O(1) using truly random hash functions3 . However, very recently [20] it
was shown that linear probing with tabulation hashing (see Section 3.3) as a function matches asymptotically the bounds of Knuth
in expected running time O( ε12 ), where the hash table has capacity
2
Observe that as long as the table is not full, an empty slot is found.
Which map every key in a given universe of keys independently
and uniformly onto the hash table.
3
Quadratic Probing
Quadratic probing (QP) is another popular approach for collision
handling in open-addressing. The hash function in QP is of the
following form: h(k, i) = (h0 (k) + c1 · i + c2 · i2 ) mod l, where i
represents the i-th probed location, h0 is an auxiliary hash function,
and c1 ≥ 0, c2 > 0 are auxiliary constants.
In case that the capacity of the table l is a power of two and
c1 = c2 = 1/2, it can be proven that quadratic probing will consider every single slot of the table one time in the worst case [6].
That is, as long as there are empty slots in the hash table, this particular version of quadratic probing will always find them eventually.
Compared to linear probing, quadratic probing has a reduced tendency for primary clustering and comparably low code complexity.
However, QP still suffers from so-called secondary clustering: if
two different keys collide in the very first probe, they will also collide in all sub-sequent probes. For deletions, we can apply the same
strategies as in LP. Our definition of displacement for LP carries
over to QP as the number of probes 0 < i < l until an empty slot
is found.
2.4
Robin Hood Hashing on LP
Robin Hood hashing [5] is an interesting extension that can be
applied to many collision handling schemes, e.g., linear probing [23].
For the remainder of this paper, we will only talk about Robin Hood
hashing on top of LP and simply refer to this combination as Robin
Hood hashing (RH). Furthermore, we introduce a new tuned approach to Robin Hood hashing that improves on the worst-case scenario of LP (unsuccessful lookups on high load factors) at a small
cost on inserts, and very high rates of successful lookups (close to
100%, best-case scenario).
In general, RH is based on the observation that collisions can
be resolved in favor of any of the keys involved. With this additional degree of freedom, we can modify the insertion algorithm of
LP as follows: On a probe sequence to insert a new entry enew ,
whenever we encounter an existing entry eold with displacement
d(enew ) > d(eold )4 , we exchange eold by enew and continue the
search for an empty slot with eold . As a result, the variance in displacement between all entries (and hence their variance in lookup
times) is minimized. While this approach does not change the total
displacement compared to LP, we can exploit the established ordering in other ways. In this sense, the name Robin Hood was motivated by the observation that the algorithm takes from the“rich”
elements (with smaller displacement) and gives to the “poor” (with
higher displacement). Thus distributing the “wealth” (proximity to
optimal slot) more fairly across all elements without changing the
average “wealth” per element.
It is known that RH can reduce the variance in displacement significantly over LP. Previous work [23] suggests to exploit this property to improve on unsuccessful lookups in several ways. For example, we could already start searching for elements at the slot with
expected (average) displacement from their perfect slot and probe
bidirectional from there. In practice, this is not very efficient due to
high branch misprediction rates and/or unfriendly access pattern.
Another approach introduces an early abort criterium for unsuccessful lookups. If we keep track of the maximum displacement
dmax among all entries in the hash table, a probe sequence can already stop after dmax iterations. However, in practice we observed
4
If d(enew ) = d(eold ) we can compare the actual keys as tie
breaker to establish a full ordering.
that dmax is often still too high5 to obtain significant improvements
over LP. We can improve on this method by introducing a different
abort criterion, which compares the probe iteration i with the displacement of currently probed entry d(ei ) in every step and stops
as soon as d(ei ) < i. However, comparing against d(ei ) on each
iteration requires us to either store displacement information or recalculate the hash value. We found all those approaches to be prohibitively expensive w.r.t. runtime and inferior to the plain LP in
most scenarios. Instead, our approach applies early abortion by
hash computation only on every m-th probe, where a good choice
of m is slightly bigger than the average displacement in the table.
As computing the average displacement under updates can be expensive, a good sweet spot for most load factors is to check once at
the end of each cache-line. We found this to give a good tradeoff
between an overhead for successful probes and the ability to stop
unsuccessful probes early. Hence, this is the configuration we use
for RH in our experiments. Furthermore, our approach to RH applies partial rehash for deletions which turned out to be superior to
tombstones for this table. Notice that tombstones in RH would, for
correctness, require to store information that allow us to reconstruct
the displacement of the deleted entry.
2.5
Cuckoo Hashing
Cuckoo hashing [19] (CuckooH) is a another open-addressing
scheme that, in its original (and simplest) version, works as follows: There are two hash tables T0 , T1 , each one having its own
hash function h0 and h1 respectively. Every inserted element p is
stored at either T0 [h0 (p)] or T1 [h1 (p)] but never in both. When
inserting an element p, location T0 [h0 (p)] is first probed, if the location is empty, p is stored there, otherwise, p kicks out the element
q already found at that location, p is stored there, and q is tried to
be inserted at location T1 [h1 (q)]. If this location is free, q is stored
there, otherwise q kicks out the element therein, and we repeat: in
iteration i ≥ 0, location Tj [hj (·)] is probed, where j = i mod 2.
In the end we hope that every element finds its own “nest” in the
hash table. However, it may happen that this process enters a loop,
and thus a place for each element is never found. This is dealt with
by performing only a fixed amount of iterations, once this limit is
achieved, a rehash of the complete set is performed by choosing
two new hash functions. The advantages of CuckooH are () For
lookups, traditional CuckooH requires at most two tables accesses,
which is in general optimal among hashing schemes using linear
space. In particular, the load factor has only a small impact on
the lookup performance of the hash table. () CuckooH has been
reported [19] to be competitive with other good hashing schemes,
like linear and quadratic probing, and () CuckooH is easy to implement. However, it has been empirically observed [19, 12] that
the load factor of traditional CuckooH with 2 tables should stay
slightly below 50% in order to work. More precisely, below 50%
load factor creation succeeds with high probability, but it starts failing from 50% on [11, 18]. This problem can be alleviated by generalizing CuckooH to use more tables T0 , T1 , T2 . . . Tk , each having
its own hash function hk , k > 1. For example, for k = 4 the
load factor (empirically) increases to 96% [12]. All this at the expense of performance, since now lookups require at most four table
lookups. Furthermore, Cuckoo hashing is very sensitive to what
hash functions are used [19, 10, 20] and requires robust hash functions. In our experiments we only consider Cuckoo hashing on four
tables (called CuckooH4) since we want to study the performance
of hash tables under many different load factors, that go up to 90%,
5
For high load factor α, dmax can often be an order of magnitude
higher than the average displacement.
and CuckooH4 is the only version of traditional Cuckoo hashing
that offers this flexibility.
3.
HASH FUNCTIONS
In our study we want to investigate the impact of different hash
functions in combination with various hashing schemes (Section 2)
under different key distributions. Our set of hash functions covers a spectrum of different theoretical guarantees that also admit
very efficient implementations (low code complexity) and thus are
also used in practice. We also consider one hash function that is,
in our opinion, the most representative member of a class of engineered hash functions6 that do not necessarily have theoretical
guarantees, but that show good empirical performance, and thus
are widely used in practice. We believe that our chosen set of
hash functions is very representative and offers practitioners a good
set of hash functions for integers (64-bit in this paper) to choose
from. The set of hash functions we considered is: () Multiplyshift [8], () Multiply-add-shift [7], () Tabulation hashing [20],
and () Murmur hashing [2]. Formally, () is the weakest and ()
is the strongest w.r.t. randomization. The definition and properties
of these hash functions are as follows:
3.1
Multiply-shift
Multiply-shift (Mult) is very well known [8], and it is given here:
hz (x) = (x · z
mod 2w ) div 2w−d
where x is a w-bit integer in {0, . . . , 2w − 1}, z is an odd w-bit
integer in {1, . . . , 2w − 1}, the hash table is of size 2d , and the
div operator is defined as: a div b = ba/bc. What makes this
hash function highly interesting is: () It can be implemented extremely efficiently by observing that the multiplication x · z is natively done modulo 2w in current architectures for native types like
32- and 64-bit integers, and the operator div is equivalent to a right
bit shift by w − d positions. () It has been proven [8] that if
x, y ∈ {0, . . . , 2w − 1}, with x 6= y, and if z ∈ {1, . . . , 2w − 1}
1
.
chosen uniformly at random, then the collision probability is 2d−1
This also means that the family of hash functions Hw,d = {hz |
0 < z < 2w and z odd} is the ideal candidate for simple and rather
robust hash functions. Multiply-shift is a universal hash function.
3.2
Multiply-add-shift
Multiply-add-shift (MultAdd) is also a very well known hash
function [7]. It’s definition is very similar to the previous one:
ha,b (x) = ((x · a + b)
mod 22w ) div 22w−d
where again x is a w-bit integer, a, b are two 2w-bit integers, and
2d is the size of the hash table. For w = 32 this hash function can
be implemented natively under 64-bit architectures, but w = 64
requires 128-bit arithmetic which is still not widely supported natively. It can nevertheless still be implemented (keeping its formal
properties) using only 64-bit arithmetic
[22]. When a, b are ranw
domly chosen from {0, . . . , 22 }, it can be proven that collision
probability is 21d , and thus is stronger than Multiply-shift — although it also incurs into heavier computations. Multiply-add-shift
is a 2-independent hash function.
3.3
Tabulation hashing
Tabulation hashing (Tab) is the strongest hash function among
all the ones that we consider and also probably the least known.
It became more popular in recent years since it can be proven [20]
6
Like FNV, CRC, DJB, CityHash for example.
that tabulation and linear probing achieve O(1) for insertions, deletions, and lookups. This produces a hash table that is, in asymptotic
terms, unbeatable. Its definition is as follows (we assume 64-bit
keys for simplicity): Split the 64-bit keys into c characters, say
eight chars c1 , . . . , c8 . For every position 1 ≤ i ≤ 8 initialize a table Ti with 256 entries (for chars) with truly 64-bit random codes.
The hash function for key x = c1 · · · c8 is then:
h(x) =
8
M
Ti [ci ]
i=1
L
where
denotes the bitwise XOR. So a hash code is composed
by the XOR of the corresponding entries in tables Ti of the characters of x. If all tables are filled with truly random data, then it
is known that tabulation is 3-independent (but not stronger), which
means that for any three distinct keys x1 , x2 , x3 from our universe
of keys, and three (not necessarily distinct) hash codes y1 , y2 , y3 ∈
{0, . . . , l} then
1
l3
which means that under tabulation hashing, the hash code h(xi )
is uniformly distributed onto the hash table for every key in our
universe, and that for any three distinct keys x1 , x2 , x3 , the corresponding hash codes are three independent random variables.
Now, the interesting part of tabulation hashing is that it requires
only bitwise operations, which are very fast, and lookups in tables
T1 , . . . , T8 . These tables are as heavy as 256 · 8 · 8 B = 16 KB.
Which mean that they all fit comfortably in the L1 cache of processors, which is 32 or 64 KB in modern computing servers. That is,
lookups in those tables incur in potentially low latency operations,
and thus the evaluation of single hash codes is potentially very fast.
P r[h(x1 ) = y1 ∧ h(x2 ) = y2 ∧ h(x3 ) = y3 ] ≤
3.4
Murmur hashing
Murmur hashing (Murmur) is one of the most common hash
functions used in practice due to its good behavior. It is relatively
fast to compute and it seems to produce quite good hash codes. We
are not aware of any formal analysis on this, so we use Murmur
hashing essentially as is. As we limit ourselves in this paper to
64-bit keys, we use Murmur3’s 64-bit finalizer [2] as shown in the
code below.
uint64_t murmur3_64_finalizer(uint64_t key) {
key ˆ= key >> 33;
key *= 0xff51afd7ed558ccd;
key ˆ= key >> 33;
key *= 0xc4ceb9fe1a85ec53;
key ˆ= key >> 33;
return key;}
4.
METHODOLOGY
Throughout the paper we want to understand how well a hash
table can work as a plain index for a set of n hkey, valuei pairs
of 64-bit integers. The keys obey three different data distributions,
described later on in Section 4.3. This scenario, albeit generic, resembles very closely other interesting uses of hash tables such as
in join processing or in aggregate operations like AVERAGE, SUM,
MIN, MAX, and COUNT. In fact, we performed experiments simulating these operations, and the results were comparable those from
the WORM workload.
We study the relation between (raw) performance and load factors by performing insertions and lookups (successful and unsuccessful) on hash tables at different load factors. For this we consider a write-once-read-many (WORM) workload, and a mixed
read-write (RW) workload. These two kinds of workload simulate
elementary operational requirements of OLAP and OLTP scenarios, respectively, for index structures.
4.1
Setup
All experiments are single-threaded and all implementations are
our own. All hash tables have map semantics, i.e., they cover both
key and value. All experiments are in main memory. For the experiments in Sections 5 and 6 we use a single core (one NUMA
region) of a dual-socket machine having two hexacore Intel Xeon
Processors X5690 running at 3.47 GHz. The machine has a total of 192 GB of RAM running at 1066 MHz. The OS is a 64bit Linux (3.2.0) with page size of 2 MB (using transparent huge
pages). All algorithms are implemented in C++ and compiled with
gcc-4.7.2 with optimization -O3. Prefetching, hyper-threading
and turbo-boost are disabled via BIOS to isolate the real characteristics of the considered hash tables.
Since our server does not support AVX-2 instructions, we ran
the layout and SIMD evaluation, Section 7, on a MacbookPro with
Intel Core i7-4980HQ running at 2.80GHz (Haswell) with 16 GB
DDR3 RAM at 1600 MHz running Mac OS X 10.10.2 in singleuser mode. Here, the page size is 4 KB and pre-fetching is activated since we could not deactivate it as cleanly as for our linux
server. All binaries are compiled with clang-600.0.56 with
optimization -O3.
4.2
Measurement and Analysis
For all indexing experiments of Sections 5 and 6 we report the
average of three independent runs (three different random seeds for
the generation and shuffling of data). We performed an analysis of
variance on all results and we found that, in general, the results are
overall very stable and uniform. Whenever variance was noticeable, we reran the corresponding experiment with the same setting
to rule out machine problems. As variance was very insignificant,
we decided that there is no added benefit in showing it in the plots.
4.3
Data distributions
Every indexed key is 64 bits. We consider three different kinds
of data distributions: Dense, Sparse, and Grid. In the dense distribution we index every key in [1 : n] := {1, 2, . . . , n}. In the
sparse distribution, n 264 keys are generated uniformly at random from [1 : 264 − 1]. In the grid distribution every byte of every
key is in the range [1 : 14]. That is, the universe under the grid
distribution consists of 148 = 1, 475, 789, 056 different keys, and
we use only the first n keys (in the sorted order). Thus, the grid
distribution is also a different kind of dense distribution. Elements
are randomly shuffled before insertion, and the set of lookup keys
is also randomly shuffled.
4.4
Narrowing down our result set
Our overall set of experiments contained the combinations of
many different dimensions, and thus the amount of raw information obtained exceeds legibility easily and makes the presentation
of the paper very difficult. For example, there are in total 24 different hash tables (hashing scheme + hash function). Thus, if we
wanted to present all of them, every plot would contain 24 different
curves, which is too much information for a single plot. Thus, we
decided to present in this paper only the most representative set of
results. We will make the complete set of results available in a technical report version of this paper. Therefore, although we originally
considered four different hash functions Mult, MultAdd, Tab, and
Murmur, see Section 3, the following observations were uniform
across all experiments: () Mult is the fastest hash function when
integrated with all hashing schemes, i.e., producing the highest
throughputs and also of good quality (robustness), and thus it
definitely deserves to be presented. () MultAdd, when integrated
with hashing schemes, has a robustness that falls between Mult and
Murmur — more robust than Mult but less than Murmur. In terms
of speed it was slower (in throughput) than Murmur. Thus we decided not to present MultAdd here and present Murmur instead. ()
Tabulation was indeed the strongest, most robust hash function of
all when integrated with all hashing schemes. However, it is also
the slowest, i.e., producing the lowest throughput. By studying the
results provided by Mult and Murmur, we think that the trade-off
for by tabulation (robustness instead of speed) is less attractive in
practice. Hence we do not present results for tabulation here.
In the end, we observed the importance of reducing operations
during hash code computations as much as possible. Mult, for example, requires only one multiplication and one right bit shift —
it is by far the lightest to compute. MultAdd for 64-bit keys without 128-bit arithmetic [22] (natively unsupported on our server) requires two multiplications, six additions, plus a number of logical
ANDs and right bit shifts, which is more expensive than Murmur’s
64-bit finalizer which requires only two multiplications and a number of XORs and right bit shifts. As for tabulation, the eight table
lookups per key ended up dominating its execution time. Assuming all tables remain in L1 cache, the latency of each table lookup
is around 5-10 clock cycles. One addition requires one clock cycle
and one multiplication at most five clock cycles (on Intel architectures). Thus, it is very interesting to observe and understand that,
when hash code computation is part of hot loops during a workload
(as in our experiments), we should really be concerned about how
many clock cycles each computation costs — we could observe the
effect of even one more instruction per hash code computation. We
want to point out as well that the situation of MultAdd changes if
we use native 128-bit arithmetic, or if we use 32-bits keys with
native 64-bit arithmetic (one multiplication, one addition, and
one right bit shift). In that case we could use MultAdd instead
of Murmur for the benefit of proven theoretical properties.
4.5
On load factors for chained hashing
As we mentioned before, the load factor makes almost no sense
for chained hashing since it can exceed one. Thus, throughout the
paper we refrain ourselves from using the formal definition of load
factor together with chained hashing. We will instead study chained
hashing under memory budgets. That is, whenever we compare
chained hashing against open-addressing schemes at a given load
factor α = nl , what we do is that we modify the size of the directory
of the chained hash table so that its overall memory consumption
does not exceeds 110% of what open-addressing schemes require.
In such a comparison, all hash tables will contain the exact same
number n of elements. Thus, all hash tables compute the exact
same number of hashes. In this regard, whether or not a chained
hash table stays within memory constraints depends on the number
of chained entries. Both variants of chained hashing considered by
us can not place more than a fraction of 16/24 < 0.67 of the total
of elements that an open-addressing scheme could place under the
same memory constraint. If we take the extra 10% we grant to
chained hash tables into account, this fraction grows to roughly
0.73. However, in practice this threshold is smaller (< 0.7) due to
how collisions distribute over the table. This already strongly limits
the usability of chained hashing under memory constraints and also
brings up the following interesting situation. If chained hashing
has to work under memory constraints, we can also try an openaddressing scheme for the exact same task under the same amount
of memory. This potentially means lower load factors (< 0.5) for
the latter. Depending on the hash function used, collisions might
thus be rare, and the performance might become similar to a directaddressing scheme — which is ideal. This might render chained
hashing irrelevant.
5.
WRITE-ONCE-READ-MANY (WORM)
In WORM we are interested in build and probe times (read-only
structure) under six different load factors 25%, 35%, 45%, 50%,
70%, 90%. These load factors are w.r.t. open addressing schemes
on three different pre-allocated capacities7 : 216 (small — 1 MB),
227 (medium — 2 GB) and 230 (large — 16 GB). This gives a total
of up to 54 different configurations (three data distributions, six
load factors, and three capacities) for each of the 24 hash tables.
Due to the lack of space, and by our discussion offered on the load
factors of chained hashing, we present here only the subsets of the
large capacity presented in Figure 1.
Load factors
25%, 35%, 45%
Large capacity
50%
70%, 90%
Hash tables
ChainedH8, ChainedH24,
LP
ChainedH24, LP, QP, RH,
CuckooH4
LP, QP, RH, CuckooH4
Figure 1: Subset of results for WORM presented in this paper.
The main reason for presenting only the large capacity is that
“big” datasets are nowadays of primary concern and most observations can be transferred to smaller datasets. Also, we divided
the hash tables this way because, by our explanation before, at
low load factors collisions will be rare and performance of openaddressing schemes will be chiefly dominated by the simplicity of
the used hash table — i.e., low code complexity. Thus we decided
to compare the two variants of chained hashing against the simplest open-addressing scheme (linear probing)8 . At a load factor
of 50%, collision resolution of different open-addressing schemes
start becoming apparent and thus from that point on we include all
open-addressing schemes considered by us. For chained hashing
we consider only the best performing variant. For higher load factors (≥ 70%), however, both variants of chained hashing could not
place enough elements in the allocated memory. Thus we removed
them altogether and study only open-addressing schemes.
5.1
Low load factors: 25%, 35%, 45%
In our very first set of experiments we are interested in understanding () the fundamental difference between chained hashing
and open-addressing and () the trade-offs offer by the two different variants of chained hashing. The results can be seen in Figure 2.
Discussion. We start by discussing the memory footprints of all
structures, see Figure 3. For linear probing, the footprint is constant (16 GB), independent of the load factor, and easily determined
only be the size of the directory, i.e., 230 slots of 16 B each. In
ChainedH8, the footprint is calculated as size of directory, i.e. 230
or 229 slots, times the pointer size — 8 B. In addition to that come
24 B for each entry in the table. The footprint of ChainedH24 is
computed as directory size, 229 , times 24 B, plus 24 B for each
collision. From this data we can obtain the amount of collisions
for ChainedH24. For example, at load factor 35%, ChainedH24
requires 12 GB for the directory, and all that goes beyond that is
due to collisions. Thus, for the sparse distribution for example,
ChainedH24 deals with ≈ 28% rate of collisions. But under the
dense distribution, it deals only with ≈ 3% collision rate using
Mult as hash function.
For performance results, let us focus on multiplicative hashing
(Mult). Here, we can see a clear and stable ranking among the
7
WORM is a static workload. This means that the hash tables never
rehash during the workload.
8
For insignificant amounts of collisions, the performance of LP,
RH, and QP is essentially equivalent.
Insertions
70
Lookups (Hash tables at 25% load factor)
(a)
Lookups (Hash tables at 35% load factor)
Lookups (Hash tables at 45% load factor)
(b)
(c)
(d)
(f)
(g)
(h)
(j)
(k)
(l)
M insertions/second
50
M lookups/second
Dense distribution
60
40
30
20
10
0
60
(e)
M lookups/second
M insertions/second
Grid distribution
50
40
30
20
10
0
60
M lookups/second
M insertions/second
Sparse distribution
(i)
50
40
30
20
ChainedH8Mult
ChainedH8Murmur
ChainedH24Mult
ChainedH24Murmur
LPMult
LPMurmur
10
0
25
35
Load factor (%)
45
0
25
50
75
Unsuccessful queries (%)
100
0
25
50
75
Unsuccessful queries (%)
100
0
25
50
75
Unsuccessful queries (%)
100
Figure 2: Insertion and lookup throughputs, comparing two different variants of chained hashing with linear probing, under three different distributions at
load factors 25%, 35%, 45% from 230 for linear probing. Higher is better.
Memory footprint [MB]
2.0*104
Dense distribution
4
1.5*10
ChainedH8Mult
ChainedH8Murmur
ChainedH24Mult
ChainedH24Murmur
LPMult
LPMurmur
1.0*104
5.0*103
0.0*10
25
35
45
Load factor (%)
Figure 3: Memory usage under the dense distribution of the hash tables
presented in Figure 2. This distribution produces the largest differences in
memory usage among hash tables. For the sparse and grid distributions,
memory of ChainedH24Mult matches that of ChainedH24Murmur, and the
rest remain the same. Lower is better.
methods. For inserts, ChainedH24 performs better than ChainedH8.
This is expected as the inlining of ChainedH24 helps to avoid caches
misses for all occupied slots. Linear probing is, however, the top
performer. This is because low load factors allow for many in-place
insertions to the perfect slot.
In terms of lookup performance, we can also find a clear ranking
among the chained hashing variants. Here, ChainedH24 performs
best again. The superior performance of ChainedH24 is again easily explainable by the lower amount of pointer-chasing in the structure. We can also observe that between LP and ChainedH24, in
all cases, one of the two structures performs best — but each in
a different case and the order is typically determined by the ratio
of unsuccessful queries. For all successful lookups, LP outper-
forms, in all but one case, all variants of chained hashing. The
only exception is under the dense distribution at 25% load factor,
Figure 2(b). There, both methods are essentially equivalent because the amount of collisions is essentially zero. The difference
we observe is due to variance in code complexity and different directory sizes — smaller directories lead to better cache behavior.
Otherwise, in general, LP improves significantly over ChainedH24
if most queries are successful. In turn, ChainedH24 improves over
LP, also by a significant amount in general, if most lookups are unsuccessful. We typically find the crossover point at around 50%
unsuccessful lookups. Interestingly, in some cases we can even
observe ChainedH8 performing slightly better than LP for 100%
unsuccessful lookups. This is explainable because even when collisions are rare, primary clusters can build up in linear probing (think
of a continuous sequence of perfectly placed elements). For every unsuccessful query, LP has to scan until it finds an empty slot,
and as the amount of unsuccessful queries increases, LP becomes
considerably slower. If the unsuccessful query falls into a primary
cluster, chained hashing answers right away if it detects and empty
slot, or it will follow the linked list until the end. However, linked
lists are very short on average. The highest observed collision rate
is ≈ 34% (sparse distribution at 45% load factor). This means that,
at most, roughly one-third of the elements are outside the directory.
Under the probabilistic properties of Mult, it can be argued that the
linked list in chained hashing are in expectation of length at most 2,
and thus chained hashing follows on average at most two pointers.
We can conclude that, at low load factors (< 50%), LPMult
is the way to go if most queries are successful (≥ 50%), and
ChainedH24 must be considered otherwise.
Insertions
70
Lookups (Hash tables at 50% load factor)
(a)
Lookups (Hash tables at 70% load factor)
Lookups (Hash tables at 90% load factor)
(b)
(c)
(d)
(f)
(g)
(h)
(j)
(k)
(l)
M insertions/second
50
M lookups/second
Dense distribution
60
40
30
20
10
0
40
30
M lookups/second
M insertions/second
Grid distribution
(e)
20
ChainedH24Mult
ChainedH24Murmur
CuckooH4Mult
CuckooH4Murmur
LPMult
LPMurmur
QPMult
QPMurmur
RHMult
RHMurmur
10
0
40
30
M lookups/second
M insertions/second
Sparse distribution
(i)
20
10
0
50
70
Load factor (%)
90
0
25
50
75
Unsuccessful queries (%)
100
0
25
50
75
Unsuccessful queries (%)
100
0
25
50
75
Unsuccessful queries (%)
100
Figure 4: Insertion and lookup throughputs, open-addressing variants and chained hashing, under three different distributions at load factors 50%, 70%, 90%
from 230 . Higher is better. Memory consumption for all open-addressing schemes is 16 GB, and 16.4 GB for ChainedH24.
5.2
High load factors: 50%, 70%, 90%
In our second set of experiments we study the performance of
hash tables when space efficiency is required, and thus we are not
able to use hash tables at low load factors. That is, we stress the
hash tables to occupy up to 90% of the space assigned to them
(chained hashing is allowed up to 10% more). We decided to use
Cuckoo hashing on four tables, rather than on two or three tables,
because this version of Cuckoo hashing is known to achieve load
factors as high as 96.7% [9, 12] with high probability. In contrast,
Cuckoo hashing on two and three tables have stable load factors of
< 50 and up to ≈ 88% respectively [18]. This means that if in
practice we want to consider very high load factors (≥ 90%), then
Cuckoo hashing on four tables is the best candidate. An overview
of the absolute best performers w.r.t. the other two capacities (small
and medium) is given as a table in Figure 6.
Discussion. Let us first start with a general discussion about the impact of distributions and hash functions on both, insert and lookup
performance across all tables. Our first important observation is
that Multiply-shift (Mult) performs essentially always better than
Murmur hashing in this experiment. We can conclude from this
that, overall, the improved quality of Murmur over Mult does not
justify the higher computational effort. Mult seems already good
enough to drive our five considered hash tables: ChainedH24, LP,
QP, RP, and CuckooH4 up to the significantly high load factor of
90% — observe that no hash table is the absolute best using Murmur, see all plots of Figure 4. Another interesting observation is
that, while we can see a significant variance in throughput under
Mult across different data distributions — compare for example
the throughputs of dense and sparse distributions under Mult —
this variance is minimal under Murmur. This indicates that Mur-
mur provides a very good randomization of the input data, basically transforming all input distribution into a distribution that is
very close to uniform, and hence the distribution seems not to have
much effect under Murmur 9 . However, sensitivity of a hash function to certain data distributions is not necessarily bad. For example, under the dense distribution10 Mult is known [15] to produce an
approximate arithmetic progression as hash codes, which reduces
collisions. For a comparison, just observe that the dense distribution achieves higher throughputs than the sparse distribution that is
usually considered as an unbiased reference of speed. We have observed that the picture does not easily change, even in the presence
of a certain degree of gaps in the sequence of dense keys. Overall
this makes Mult a strong candidate for dense keys, which appear
very often in practice, e.g., for generated primary keys. In contrast
to that, Mult is slightly slower on the grid distribution compared
to the sparse distribution. We could observe that Mult produces
indeed more collisions than the expected amount on uniformly distributed keys. However, this larger amount of collisions does not
get highly reflected in the observed performance. Thus, we consider Mult as the best candidate to be used in practice when
quality results on high throughputs is desired, but at the cost of
a high variance across data distributions.
Let us now focus on the difference between the hash tables.
In general, it can immediately be seen that all open-addressing
schemes, except CuckooH4, are better than ChainedH24 in almost
all cases for up to 50% unsuccessful lookups, see Figure 4 (a, b,
9
We observed the same for Tab.
Actually under a generalized dense distribution of keys following
an arithmetic progression k, k + d, k + 2d, . . .
10
e, f, i, j). And only for the degenerated case of 100% unsuccessful
lookups, ChainedH24 is the overall winner — for the same reasons
as for low load factors. ChainedH24 is removed from the comparison for load factors > 50% because it exceeds the memory limit.
Between open-addressing schemes, things are more interesting.
On insertions (leftmost column of Figure 4), we can observe a
rather clear ranking among methods that holds across all distributions and load factors. CuckooH4 is showing a very stable insert
performance that is only slightly affected by increasing load factors. However, this performance is rather low. We can explain this
result by the expensive reorganization that happens during Cuckoo
cycles, and can often incur into several cache misses (whenever an
element is moved between the tables) for a single insert. Unsurprisingly, LP, QP, and RH show rather similar insert performance characteristics because their insertion algorithm is very similar. Starting
with high performance at 50% load factors, this performance drops
significantly as the load factor increases. However, even under a
high load factor, linearly and quadratically probing a hash table
seems to be very effective. Among the three methods, we observe
that RH is in general slightly slower than LP and QP. This is because RH performs small reorganizations on already inserted elements. However, these reorganizations often stay within one cache
line, and thus the decrease in performance stays typically within
less than 10%. With respect to QP and LP, the following are the
most relevant observations. QP and LP have very similar insertion throughput for low load factors (up to 50). For higher load
factors, when the difference in collision handling plays a role: ()
LPMult is considerable faster than QPMult under the dense distribution of keys (45M insertions/second versus 35M insertions/second — Figure 4(a)), and () QP (Mult/Murmur) is faster than LP
(Mult/Murmur) otherwise. This is explainable: for () it suffices
to observe that a dense distribution is the best case for LPMult –
since Mult produces an approximate arithmetic progression (very
few collisions). The best way to lay out an (approximate) arithmetic progression, in order to have better data locality, is to do so
linearly, just as LP does. We could also observe that when primary clusters start appearing, they appear well distributed across
the whole table, and they have similar sizes. Thus no cluster is arbitrarily long, which is good for LP. On the other hand, QP touches
a new cache line in every probe subsequent to the third, and touching a new cache line results usually in a cache miss. Data locality
is thus not optimal. For () the argument complements (). Data is
distributed more randomly, by the hash function, across the table.
This causes an increment in collisions w.r.t. the combination hdense
distribution + Multi. For high load factors this increment in collisions means considerable long primary clusters that LP has to deal
with. In this case, QP is a better strategy to handle collisions since
it scatters collisions more sparsely across the table, and chances to
find empty slots fast, over the whole sequence of insertions, are
better than in LP with considerable long primary clusters.
For lookups we can find a similar situation as for inserts. LP,
QP, and RH perform better than CuckooH4 in many situations, i.e.,
up to relatively high load factors. However, the performance of the
former three significantly decreases with () higher load factors and
() more unsuccessful lookups. We could observe that from a load
factor of 80% on, CuckooH4 clearly surpasses the other methods.
In general, LP, QP, and RH are better in dealing with higher collision rates than Cuckoo hashing, which is known to be negatively affected by “weak” hash functions [19] such as Mult. However, these
“weak” hash functions affect only during the construction of the
hash table, since once the hash table is constructed, then lookups in
Cuckoo hashing are performed in constant time (four cache misses
at most for CuckooH4). As such, Cuckoo hashing is also less af-
fected by unsuccessful lookups than LP, QP, and RH. However, it
seems that we can benefit from CuckooH4 only on very high load
factors ≥ 80%.
As expected, the more complex re-organization that RH performs on the keys during insertions, see Section 2.4, can be seen
to pay off under unsuccessful lookups — RH is much less affected
by them than LP and QP. In RH, unsuccessful lookups can stop as
soon as the displacement of the search key is exceeded by another
key we encounter during the probe. Hence, RH does not necessarily
require a complete scan of all adjacent keys in the same cluster, and
can stop probing after less iterations than LP or QP. Clearly, this advantage of RH over LP and QP increases with higher load factors
and higher rates of unsuccessful lookups — significantly improving on the worst-case of the methods. However, in the best of cases,
i.e., when all lookups are successful, RH is slightly slower than
the competitors. This is also expected as RH does not improve on
the average displacement or amount of loaded cache lines w.r.t. LP
(clusters contain only different permutations of the elements therein
contained under RH and LP). When all lookups are successful, the
(small) performance penalty of RH is due to its slightly more complex code. We can conclude that RH provides a very interesting trade-off: for a small penalty (often within 1-5%) in peak
performance on the best of cases (all lookups successful), RH
significantly improves on the worst-case over LP in general, up
to more than a factor 4. Under the dense distribution — Figure 4
(a – c) — RH and LP have similar performance up to 70% load
factor, but for 90% load factor, RH is significantly faster than LP
(up to 40%) from 25% unsuccessful lookups on.
Across the whole set of experiments, RH is always among
the top performers, and even the best method for most cases.
This observation holds for all data set sizes we tested. In this regard, Figure 6 gives an overview and summarizes the absolute best
methods we tested in this experiment under all capacities (small,
medium, and large). Methods are color-coded as in the curves
in the plots. Observe that patterns are nicely recognizable. For
lookups in general, RH seems to be an excellent all-rounder unless the hash table is expected to be very full, or the amount of
unsuccessful queries is rather large. In such cases, CuckooH4 and
ChainedH24 would be better options, respectively, if their slow insertion times are acceptable. With respect to insertions, it is natural
not to see RH appearing more often, and certainly CuckooH4 and
ChainedH24 not at all, due to their complicated insertion procedures. For insertions, QP seems to be the best option in general.
Even when LP or RH are sometimes better, the difference is rather
small, less than 10%.
6.
READ-WRITE WORKLOAD (RW)
In RW we are interested in analyzing how growing (rehashing)
over a long sequence of operations affects overall throughput and
memory consumption. The set of operations we consider is the
following: insertions, deletions (all successful), and lookups (successful and unsuccessful). In RW we let the hash tables grow
over a set of 1000 million operations that appear in random order.
Each hash table initially contains 16 millions keys11 . We set the
insertion-to-deletion ratio (updates) to 4:1 (20% deletions), and the
successful-to-unsuccessful-lookup ratio to 3:1 (25% unsuccessful
queries). For this kind of workload we present here only the results concerning the sparse distribution of keys. We consider three
different thresholds for rehashing: at 50%, 70%, and 90%. Rehashing at 50% allows us to always have enough empty slots, and
thus also less collisions. However, this also means a potential loss
11
In the beginning (no updates), the hash tables have a load factor
of roughly 47%.
Growing at 50% load factor
30
Growing at 70% load factor
(a)
Growing at 90% load factor
(b)
(c)
M operations/second
25
20
15
CuckooH4Mult
CuckooH4Murmur
LPMult
LPMurmur
QPMult
QPMurmur
RHMult
RHMurmur
Chained24HMult
Chained24HMurmur
Sparse distribution
10
5
0
0 5
25
50
75
100 0 5
25
50
75
100 0 5
25
50
75
100
3.5*104
(d)
(e)
(f)
Memory footprint [MB]
3.0*104
2.5*104
2.0*104
1.5*104
1.0*104
5.0*103
0.0*10
0
5
25
50
75
Update percentage (%)
100
0
5
25
50
75
Update percentage (%)
100
0
5
25
50
75
Update percentage (%)
100
Figure 5: 1000M operations of RW workload under different load factors and update-to-lookup ratios. For updates, the insertion-to-deletion ratio is 4:1. For
lookups, the successful-to-unsuccessful-lookup is ratio 3:1. The key distribution is sparse. Higher is better in performance, lower is better for memory.
Unsuccessful queries (%)
y
cit
pa
Ca
or
act
d f st.
Di
a
Lo
Insertions
0%
25%
50%
75%
100%
S
50% M
D
L
e
S
n 70% M
L
s
S
e
90% M
L
290
56
46
289
56
46
169
51
43
345
76
68
346
75
64
183
60
55
126
56
53
113
48
48
80
37
40
91
44
44
71
35
39
63
27
32
114
46
39
62
30
32
61
22
26
188
51
42
63
29
28
64
21
23
S
50% M
L
G
S
r
70% M
i
L
d
S
90% M
L
194
38
34
113
28
25
72
20
18
212
43
39
119
30
28
90
22
18
95
33
30
67
24
20
52
19
16
72
29
26
57
22
19
47
19
16
85
28
26
56
21
19
49
20
16
124
32
31
58
21
19
52
20
16
S
S 50% M
L
p
S
a
70% M
r
L
s
S
e 90% M
L
104
39
34
77
28
26
57
19
18
109
44
40
74
30
27
59
22
18
68
33
30
52
22
20
51
19
16
63
29
26
50
21
19
53
19
15
73
29
26
51
20
19
58
20
16
103
33
30
53
20
19
66
21
16
QPMult
RHMult
CuckooH4Mult
LPMult
ChainedH24
Figure 6: Absolute best performers for the WORM workload (Section 5.2)
across distributions, different load factors, and different capacities: Small
(S), Medium (M) and Large (L). Throughput of the corresponding hash
table is shown inside its cell in millions of operations per second.
in space since the workload might stop short after growing, and
thus up to 75% of the hash table could be empty. On the other
hand, rehashing at 90% deals with a large amount of collisions as
the table gets full, but then we potentially waste less space. In addition to that, high load factors will incur into slow lookup times
before a rehash. Observe again that by the natural load factors of
Cuckoo hashing on two and three tables, Cuckoo hashing on four
tables is the best candidate again for controlling at what load factor
the hash table must rehash. For chained hashing, similar to the situation in WORM, we present here only the case where rehashing
is performed at 50% load factor. This is the only case in which we
can keep memory consumption of chained hashing (ChainedH24)
comparable to what the open-addressing schemes require. The results of these experiments are shown in Figure 5.
Discussion. With respect to the performance in the WORM scenario on high load factors — Section 5.2 — the outcome of the
RW comparison offers few surprises. One of these surprises is
to see that ChainedH24 offers better performance than CuckooH4
(50% load factor only), and sometimes even by a large margin.
They both, however, lag clearly behind the other (open-addressing)
schemes. As RW workload is write-heavy, what we see in the
plots is mostly the cost of table reorganization (rehashing) — except for data points at 0% updates. In that case, what we see are
only lookups with 25% of unsuccessful queries, see Figure 4(j) for
a comparison. For CuckooH4 the gap narrows as the load factor
increases, see Figure 5(c), but is not enough to become really competitive with the best performers — which are at least twice as fast
as the updates become more frequent. As a conclusion, although
memory requirements of ChainedH24 and CuckooH4 are competitive with that of the other schemes in a dynamic setting,
both — chained and Cuckoo hashing — should be avoided for
write-heavy workloads.
We can also see that Mult governs again over Murmur on all
hash tables — Figure 5 (a – c). Which is to be expected since
the hash tables rehash many times and thus hash function computations are fundamental. Also, we always find LP, QP, and RH
as the fastest methods, and often with very similar performance.
Growing at 50% load factor — Figure 5(a) — the difference in
throughput of all three methods is mostly within the boundary of
variance. In case of high update percentage (> 50%), we can observe a small performance penalty for RH in comparison to LP and
QP, which is due to the slightly slower insert performance that we
already observed in the WORM benchmark, see Figure 4(i). This
is expected because at 50% load factor, there are few collisions,
and more sophisticated strategies for handling collisions can not
benefit as much. At 70% and 90% load factors — Figures 5(b)
and 5(c) — all three methods are getting slower, and we can also
observe a clearer difference between them because different strategies have an impact now. Interestingly, with increasing load factor
and update ratios, QP is showing the best performance, with LP
being second and RH in third place. This is consistent with our
observation in the WORM experiment that QP is best for inserts on
high load factors and RH is typically the slowest. As a conclusion,
in a write-heavy workload, quadratic probing looks as the best
option in general.
7.
ON TABLE LAYOUT: AOS OR SOA
One fundamental question in open-addressing is whether to organize the table as an array-of-structs (AoS) or as a struct-of-arrays
(SoA). In AoS, the table is internally represented as array of keyvalue pairs whereas SoA keeps keys and values separated in two
corresponding, aligned arrays. Both variants offer different performance characteristics and tradeoffs. These tradeoffs are somewhat
similar to the difference between row and column layout for storing
database tables. In general, we can expect to touch less cache-lines
for AoS when the total displacement of the table is rather low, ideally just one cache line. In contrast to that, SoA already needs to
touch at least two cache lines for each successful probe (one for the
key and one for the value) in the best case. However, for high displacement (and hence longer probe sequences) SoA layout offers
the benefit that we can just search through keys only, thus scanning up to only half the amount of data compared to AoS, where
keys and values are interleaved. Another advantage of SoA over
AoS is that a separation of keys from values makes vectorization
with SIMD easy, essentially allowing us to load and compare four
densely packed keys at a time on 256-bit SIMD registers as offered
on current AVX-2 platforms. In contrast to that, comparing four
keys in AoS with SIMD requires to first extract only the keys from
the key-value pairs into the SIMD register, e.g., by using gatherscatter vector addressing which we found to be not very efficient on
current processors. Independent of this, AoS also needs to touch up
to two times more cache lines for long probe sequences compared
to SoA when many keys are scanned.
In the following, we present a micro-benchmark to illustrate the
effect of different layout and SIMD for inserts and lookups in linear probing. Since our computing server does not support AVX2 instructions, we ran this micro-benchmark on a new MacBook
Pro as described in Section 4. We implemented key comparisons
with SIMD instructions for lookup and inserts on top of our existing linear probing hash tables by manually introducing intrinsics
to our code. For example, in AoS, we load four keys at a time
to a SIMD register from an cache-line-aligned index, using the
_mm256_load_si256 command. Then we perform a vectorized comparison on the four keys using _mm256_cmpeq_epi64
and, in case of one successful comparison, obtain the first matching
index with _mm256_movemask_pd.
We compare LPMult in AoS layout against LPMult in SoA layout with and without SIMD on a sparse data set. Similar to the
indexing experiment of Section 5.2, we measure the throughput for
insertions and lookups for load factors 50, 70, 90%. Due to the limited memory available on the laptop, we use the medium table capacity of 227 slots — 2 GB. This still allows us to study the performance outside of caches, where we expect layout effects to matter
most, because touching different cache lines typically triggers expensive cache misses. Figure 7 shows the results of the experiment.
Discussion. Let us start by discussing the impact of layout without
using SIMD instructions, methods LPAoSMult and LPSoAMult in
Figure 7. For inserts (Figure 7(a)), AoS performs up to 50% better
than SoA, on the lowest load factor (50%). This gap is slowly
closing with higher load factors, leaving AoS only 10% faster than
SoA on load factor 90%. This result can be explained as follows.
When collisions are rare (as on load factor 50), SoA touches two
times more cache lines than AoS — it has to place key and value in
different locations. In contrast to that, SoA can fit up to two times
more keys in one cache line than AoS, which improves throughput
for longer probes sequences when searching empty slots under high
load factors. However, when beginning inserting into an empty
hash table, we can often place the entry into its hash bucket without
any further probing. Only over time we will require more and more
probes. Thus, in the beginning, there is a high number of insertions
where the advantage of AoS has higher impact. This is also the
reason why the gap in insertion throughput between AoS and SoA
significantly narrows as the load factor increases.
For lookups (Figures 7(b — d)) we noticed overall that AoS is
faster than SoA on short probe sequences, i.e., especially for low
load factors and low rates of unsuccessful queries. On the lowest
load factor (50%, Figure 7(b)), we can see that in the best case
(all queries successful) AoS typically encounters half the number of cache misses compared to SoA, because keys and values
are adjacent. This is reflected in a 63% higher throughput. With
increasing unsuccessful lookup rate, the performance of SoA approaches AoS and the crossover point lies around 75% unsuccessful lookups. For 100% unsuccessful lookups, AoS improves over
SoA by 15%. For load factor 70% (Figure 7(c)), AoS is again superior to SoA for low rates of unsuccessful queries, but the crossover
point at which SoA starts being beneficial shifted to 25% unsuccessful queries instead of 75% of the 50% load factor. Interestingly, we can observe that for load factor 90% (Figure 7(d)), the
advantage of SoA over AoS layout is unexpectedly low — with the
highest difference observed being around 30% instead of close to
a factor 2 as we could expect. Our analysis obtained a combination of three different factors that explain this result. First, even
in the extreme case of 100% unsuccessful lookups, the difference
in touched caches lines is not a factor 2. The combination hsparse
distribution, Multi simulates the ideal case that every key is uniformly distributed over the hash table. Thus, we know [15] that
the average number of
unsuccessful search in linear
probes
in an
1
, where α is the load factor
probing is roughly 12 1 + (1−α)
2
of the table. Thus, for 90% load factor the average probe length is
roughly 50.5 (we could verify this experimentally as well). Now,
in AoS we can pack four key-value pairs into a cache line, and
twice as much (eight) for SoA. This means that the average numand 50.5
ber of loaded cache lines in AoS and SoA is roughly 50.5
4
8
50.5
respectively. However, in practice this behaves like d 4 e = 13
and d 50.5
e = 7 respectively — since whole cache lines are loaded.
8
Which means that AoS loads only roughly 1.85× more caches lines
as SoA — which we were also able to verify experimentally. In
addition to that, a second factor are non-uniform costs of visiting
cache lines. We observed that the first probe in a sequence is typically more expensive than the subsequent linear probes because the
first probe is likely to trigger a TLB miss and a page walk, which
amortizes over visiting a larger amount of adjacent slots. The third
factor is that, independent from the number of visited cache lines,
the number of hash computations, loop iterations, and key comparisons are identical for SoA and AoS. Those parts of the probing
algorithm involve data dependencies that build up a critical path
in the pipeline, which is not easily hidden in the load latency by
modern processors and compilers. In conclusion, the ideal advantages of SoA over AoS are less strong in practice due to the
way hardware works.
We now proceed to discuss the impact of SIMD instructions in
both layouts. In general, SIMD allows us to compare up to four
8-byte keys (or half a cache line) in parallel, with one instruction.
However, this parallelism typically comes at a small price because
loading keys into SIMD registers and generating a memory address
from the result of SIMD comparison (e.g., by performing count-
Insertions
Lookups (Hash tables at 50% load factor) Lookups (Hash tables at 70% load factor) Lookups (Hash tables at 90% load factor)
(b)
(c)
(d)
LPAoSMult
(a)
LPAoSMultSIMD
LPSoAMult
LPSoAMultSIMD
35
30
M lookups/second
M insertions/second
Medium capacity - 227 slots
Sparse distribution
40
25
20
15
10
5
0
50
70
Load factor (%)
90
0
25
50
75
Unsuccessful queries (%)
100
0
25
50
75
Unsuccessful queries (%)
100
0
25
50
75
Unsuccessful queries (%)
100
Figure 7: Effect of layout and SIMD in performance of LPMult at load factors 50, 70, 90% w.r.t. 227 under a sparse distribution of keys. Higher is better.
trailing-zeros on a bit mask) potentially introduce a small overhead
in terms of instructions. In case of writes that depend on address
calculation based on the result of SIMD operations, we could even
observe expensive pipeline stalls. Hence, in certain cases, SIMD
can actually make execution slower, e.g., see Figure 7(a). For lower
load factors, using SIMD for insertions can decrease performance
significantly for both AoS and SoA layout, by up to 64% in the
extreme case. However, there is a crossover point between SIMD
and non-SIMD insertions around 75% load factor. We found that
in such cases, SIMD is up to 12% faster than non-SIMD.
For lookups, we can observe that SIMD improves performance
in almost all cases. We notice that in general, the improvement of
SIMD is higher for SoA than for AoS. As mentioned before, SoA
layout simplifies loading keys to a SIMD register, whereas AoS requires us to gather the interleaved keys in a register. We observed
that on the Haswell architecture, gathering is still a rather expensive operation and this difference gives SoA an edge over AoS for
SIMD. As a result, we find SoA-SIMD superior to plain SoA in
all cases for lookups, with improvement of up to up to 81% (Figure 7(b)). We observed that AoS-SIMD can be up to 17% harmful
for low load factors, but beneficial for high load factors.
In general, we could observe in this experiment that AoS is significantly superior to SoA for insertions — even up to very high
load factors. Our overall conclusion is that AoS outperforms
SoA by a larger margin than the other way around. Inside
caches (not shown), both methods are comparable in terms of
lookup performance, with AoS performing slightly better. When
using SIMD, SoA has an edge over AoS — at least on current
hardware — because keys are already densely packed.
8.
CONCLUSIONS AND FUTURE WORK
Due to the lack of space, we stated our conclusions in an inline
fashion throughout the paper. All the knowledge we gathered leads
us to propose a decision graph, Figure 8, that we hope can help
practitioners to decide more easily what hash table to use in practice
under different circumstances. Obviously, no experiment can be
complete enough to fully capture the true nature of each hash table
in every situation. Our suggestions are, nevertheless, educated as a
result of our large set of experiments, and we are confident that they
represent very well the behavior of the hash tables. We also hope
that our study makes practitioners more aware about trade-offs and
consequences of not carefully choosing a hash table.
9.
REFERENCES
[1] V. Alvarez, S. Richter, X. Chen, and J. Dittrich. A comparison of adaptive radix
trees and hash tables. In 31st IEEE ICDE, April 2015.
[2] A. Appleby. Murmurhash3 64-bit finalizer. Version 19/02/15.
https://code.google.com/p/smhasher/wiki/MurmurHash3.
[3] C. Balkesen, J. Teubner, G. Alonso, and M. Ozsu. Main-memory hash joins on
modern processor architectures. IEEE TKDE, 2014.
[4] R. Barber, G. Lohman, I. Pandis, V. Raman, R. Sidle, G. Attaluri, N. Chainani,
S. Lightstone, and D. Sharpe. Memory-efficient hash joins. VLDB, 8(4), 2014.
[5] P. Celis. Robin Hood Hashing. PhD thesis, University of Waterloo, 1986.
Start
No
Read or write?
Load factor
< 50%?
Yes
Writes > Reads
Reads ≥ Writes
Dense
distribution?
No
Yes
No
Yes
Dynamic?
≥ 50%
Successful
lookups?
Yes
ChainedH24
QPMult
No
> 70%
Dense
distribution?
Yes
CH4Mult
Load factor?
≤ 70%
No
Unsuccessful
lookups?
No
Yes
LPMult
RHMult
≥ 90%
< 80%
< 90%
Yes
Unsuccessful
lookups?
No
No
Load factor?
≥ 80%
Load factor?
Figure 8: Suggested decision graph for practitioners.
[6] T. H. Cormen, C. E. Leiserson, and R. L. Rivest. Introduction to Algorithms.
MIT Press, Cambridge, MA, USA, 1990.
[7] M. Dietzfelbinger. Universal hashing and k-wise independent random variables
via integer arithmetic without primes. In STACS, pages 569–580, 1996.
[8] M. Dietzfelbinger, T. Hagerup, J. Katajainen, and M. Penttonen. A reliable
randomized algorithm for the closest-pair problem. Journal of Algorithms,
25(1):19 – 51, 1997.
[9] M. Dietzfelbinger and R. Pagh. Succinct Data Structures for Retrieval and
Approximate Membership, volume 5125, pages 385–396. LNCS, 2008.
[10] M. Dietzfelbinger and U. Schellbach. On risks of using cuckoo hashing with
simple universal hash classes. In SODA, pages 795–804, 2009.
[11] M. Drmota and R. Kutzelnigg. A precise analysis of cuckoo hashing. ACM
Trans. Algorithms, 8(2):11:1–11:36, Apr. 2012.
[12] D. Fotakis, R. Pagh, P. Sanders, and P. Spirakis. Space efficient hash tables with
worst case constant access time. Theory of Comp. Sys., 38(2):229–248, 2005.
[13] C. Kim, J. Chhugani, N. Satish, E. Sedlar, A. D. Nguyen, T. Kaldewey, V. W.
Lee, S. A. Brandt, and P. Dubey. Fast: Fast architecture sensitive tree search on
modern cpus and gpus. In ACM SIGMOD, pages 339–350, 2010.
[14] D. Knuth. Notes on ”open” addressing. Unpublished Memorandum, 1963.
[15] D. E. Knuth. The Art of Computer Programming, Volume 3: (2nd Ed.) Sorting
and Searching. Addison Wesley, 1998.
[16] H. Lang, V. Leis, M.-C. Albutiu, T. Neumann, and A. Kemper. Massively
Parallel NUMA-Aware Hash Joins, pages 3–14. LNCS, 2015.
[17] V. Leis, A. Kemper, and T. Neumann. The adaptive radix tree: Artful indexing
for main-memory databases. In 29th IEEE ICDE, pages 38–49, April 2013.
[18] M. Mitzenmacher. Some Open Questions Related to Cuckoo Hashing, volume
5757, pages 1–10. LNCS, 2009.
[19] R. Pagh and F. F. Rodler. Cuckoo hashing. Journal of Algorithms,
51(2):122–144, 2004.
[20] M. Pǎtraşcu and M. Thorup. The power of simple tabulation hashing. J. ACM,
59(3):14:1–14:50, June 2012.
[21] B. Schlegel, R. Gemulla, and W. Lehner. k-ary search on modern processors. In
DaMoN Workshop, pages 52–60. ACM, 2009.
[22] M. Thorup. String hashing for linear probing. In 20th ACM-SIAM SODA, pages
655–664, 2009.
[23] A. Viola. Analysis of hashing algorithms and a new mathematical transform.
University of Waterloo, 1995.