Mike Dodds, Andreas Haas, Christoph M. Kirsch Abstract Key Ideas

Fast Concurrent Data-Structures
Through Explicit Timestamping
Mike Dodds, Andreas Haas, Christoph M. Kirsch
[email protected], [email protected], [email protected]
Abstract
Partially Ordered Elements
Concurrent data-structures, such as stacks,
queues and deques, often implicitly enforce a
total order over elements with their underlying
memory layout. However, linearizability only requires that elements are ordered if the inserting
methods ran sequentially. We propose a new
data-structure design which uses explicit timestamping to avoid unwanted ordering. Elements
can be left unordered by associating them with
unordered timestamps if their insert operations
ran concurrently. In our approach, more concurrency translates into less ordering, and thus lesscontended removal and ultimately higher performance and scalability.
Elements with timestamps are stored in an unordered buffer. Elements which were inserted
concurrently may have the same timestamp.
• Order elements in the data-structure only
partially by using explicit timestamping.
• Store elements in thread-local buffers to
avoid synchronization of insert operations.
• Use the RDTSCP CPU instruction for
highly-scalable timestamp generation.
• Use intervals as timestamps to reduce the
order on timestamps while still providing
linearizability.
without interval timestamping
enq(b)
time
oldest element:
removed first
in a FIFO queue
Buffer
a|1
with interval timestamping
b|4
d|6
c|4
e|7
youngest element:
removed first
in a stack
same timestamp:
removed in any
order
artificially
delayed
dequeues can remove
element a and b in parallel
enq(a)
enq(b)
time
still faster than
enqueues in most other
implementations
TS-Buffer Design
tryRemoveR: A thread searches through the
right ends of all linked lists for the rightmost element to remove. Nodes are removed logically first by setting a removed
flag. Unlinking occurs in later insert or
remove operations. If setting the removed
flag fails, then tryRemoveR returns an
invalid item.
insL/tryRemoveL: Analogous to insR and
tryRemoveR, but the left side of the
linked lists is accessed.
Interval Length
Performance of the TS deque with interval
timestamping with an increasing delay in a highcontention producer-consumer benchmark. All
measurements are done with 80 threads on a 40core (2 hyperthreads per core) server machine.
The left y-axis shows the average number of operations (both insert and remove operations) per
millisecond, the right y-axis shows the average
number of CAS retries per remove operation.
7000
TS_Deque {
TS_Buffer buffer;
void insertR(Element element) {
item = buffer.insR(element);
timestamp = buffer.newTimestamp();
buffer.setTimestamp(item, timestamp);
}
Element removeR() {
do {
item = buffer.tryRemoveR();
} while (!item.isValid());
if (item.isEmpty())
return EMPTY;
else
return item.element;
}
}
left
7,R
6,L
...
left
......
right
8,R
right
2,L
9,R
4,R
3,L
6,L
timestamp
right
side where
the node got
inserted
Treiber Stack
EB Stack
MS Queue
FC Queue
3000
6
2000
4
1000
2
0
5000
10000
15000
20000
25000 30000
delay in ns
TS-hardware Queue
TS-interval Queue
Hardcoded TS-hardware Queue
Hardcoded TS-interval Queue
7000
6000
6000
operations per ms (more is better)
operations per ms (more is better)
TS-hardware Stack
TS-interval Stack
Hardcoded TS-hardware Stack
Hardcoded TS-interval Stack
7000
5000
4000
3000
2000
35000
40000
45000
0
50000
4000
3000
2000
1000
0
0
10
20
30
40
50
number of threads
60
70
TS-hardware Deque
TS-interval Deque
5000
1000
2
80
2
10
20
30
40
50
number of threads
60
70
80
(b) TS stack without interval timestamping
7000
7000
6000
6000
operations per ms (more is better)
operations per ms (more is better)
This work has been supported by the National
Research Network RiSE on Rigorous Systems
Engineering (Austrian Science Fund (FWF):
S11404-N23).
8
Performance and scalability of the TS deque in a high-contention producer-consumer benchmark with
an increasing number of threads. The baselines are a Treiber stack, an elimination-backoff stack (EB
stack), a Michael-Scott queue (MS queue), and a flat-combining queue (FC queue). Stack-specific
and queue-specific TS-Buffers (Hardcoded TS Stack and Hardcoded TS Queue) allow additional
performance gains.
Additional Informantion
Acknowledgements
10
4000
0
(a) TS stack with interval timestamping
http://scal.cs.uni-salzburg.at/tsstack
5000
12
Performance in a Producer/Consumer Benchmark
insertL and removeL are defined analogously.
• The remove operations of the TS deque are
lock-free, the insert operations are waitfree.
4,R
14
Performance TS-interval Stack
Retries TS-interval Stack
Performance Hardcoded TS-interval Stack
Retries Hardcoded TS-interval Stack
Performance TS-interval Queue
Retries TS-interval Queue
Performance Hardcoded TS-interval Queue
Retries Hardcoded TS-interval Queue
Performance TS-interval Deque
Retries TS-interval Deque
6000
operations per ms (more is better)
TS Deque Pseudo Code
left
• The TS deque is linearizable with respect
to the sequential specification of a deque.
Linearizability allows reordering
overlapping operations. Therefore
artificially delaying enqueue
operations can reduce the order
of elements in the data structure.
insR: Each thread inserts at the right side of
its own linked list.
TS Buffer
Correctness
dequeues compete for
element a before removing
element b
enq(a)
number of retries (less is better)
Key Ideas
Interval Timestamping
5000
4000
3000
2000
1000
5000
4000
3000
2000
1000
0
0
2
10
20
30
40
50
number of threads
60
70
(c) TS queue with interval timestamping
80
2
10
20
30
40
50
number of threads
60
70
80
(d) TS queue without interval timestamping