Fast Concurrent Data-Structures Through Explicit Timestamping Mike Dodds, Andreas Haas, Christoph M. Kirsch [email protected], [email protected], [email protected] Abstract Partially Ordered Elements Concurrent data-structures, such as stacks, queues and deques, often implicitly enforce a total order over elements with their underlying memory layout. However, linearizability only requires that elements are ordered if the inserting methods ran sequentially. We propose a new data-structure design which uses explicit timestamping to avoid unwanted ordering. Elements can be left unordered by associating them with unordered timestamps if their insert operations ran concurrently. In our approach, more concurrency translates into less ordering, and thus lesscontended removal and ultimately higher performance and scalability. Elements with timestamps are stored in an unordered buffer. Elements which were inserted concurrently may have the same timestamp. • Order elements in the data-structure only partially by using explicit timestamping. • Store elements in thread-local buffers to avoid synchronization of insert operations. • Use the RDTSCP CPU instruction for highly-scalable timestamp generation. • Use intervals as timestamps to reduce the order on timestamps while still providing linearizability. without interval timestamping enq(b) time oldest element: removed first in a FIFO queue Buffer a|1 with interval timestamping b|4 d|6 c|4 e|7 youngest element: removed first in a stack same timestamp: removed in any order artificially delayed dequeues can remove element a and b in parallel enq(a) enq(b) time still faster than enqueues in most other implementations TS-Buffer Design tryRemoveR: A thread searches through the right ends of all linked lists for the rightmost element to remove. Nodes are removed logically first by setting a removed flag. Unlinking occurs in later insert or remove operations. If setting the removed flag fails, then tryRemoveR returns an invalid item. insL/tryRemoveL: Analogous to insR and tryRemoveR, but the left side of the linked lists is accessed. Interval Length Performance of the TS deque with interval timestamping with an increasing delay in a highcontention producer-consumer benchmark. All measurements are done with 80 threads on a 40core (2 hyperthreads per core) server machine. The left y-axis shows the average number of operations (both insert and remove operations) per millisecond, the right y-axis shows the average number of CAS retries per remove operation. 7000 TS_Deque { TS_Buffer buffer; void insertR(Element element) { item = buffer.insR(element); timestamp = buffer.newTimestamp(); buffer.setTimestamp(item, timestamp); } Element removeR() { do { item = buffer.tryRemoveR(); } while (!item.isValid()); if (item.isEmpty()) return EMPTY; else return item.element; } } left 7,R 6,L ... left ...... right 8,R right 2,L 9,R 4,R 3,L 6,L timestamp right side where the node got inserted Treiber Stack EB Stack MS Queue FC Queue 3000 6 2000 4 1000 2 0 5000 10000 15000 20000 25000 30000 delay in ns TS-hardware Queue TS-interval Queue Hardcoded TS-hardware Queue Hardcoded TS-interval Queue 7000 6000 6000 operations per ms (more is better) operations per ms (more is better) TS-hardware Stack TS-interval Stack Hardcoded TS-hardware Stack Hardcoded TS-interval Stack 7000 5000 4000 3000 2000 35000 40000 45000 0 50000 4000 3000 2000 1000 0 0 10 20 30 40 50 number of threads 60 70 TS-hardware Deque TS-interval Deque 5000 1000 2 80 2 10 20 30 40 50 number of threads 60 70 80 (b) TS stack without interval timestamping 7000 7000 6000 6000 operations per ms (more is better) operations per ms (more is better) This work has been supported by the National Research Network RiSE on Rigorous Systems Engineering (Austrian Science Fund (FWF): S11404-N23). 8 Performance and scalability of the TS deque in a high-contention producer-consumer benchmark with an increasing number of threads. The baselines are a Treiber stack, an elimination-backoff stack (EB stack), a Michael-Scott queue (MS queue), and a flat-combining queue (FC queue). Stack-specific and queue-specific TS-Buffers (Hardcoded TS Stack and Hardcoded TS Queue) allow additional performance gains. Additional Informantion Acknowledgements 10 4000 0 (a) TS stack with interval timestamping http://scal.cs.uni-salzburg.at/tsstack 5000 12 Performance in a Producer/Consumer Benchmark insertL and removeL are defined analogously. • The remove operations of the TS deque are lock-free, the insert operations are waitfree. 4,R 14 Performance TS-interval Stack Retries TS-interval Stack Performance Hardcoded TS-interval Stack Retries Hardcoded TS-interval Stack Performance TS-interval Queue Retries TS-interval Queue Performance Hardcoded TS-interval Queue Retries Hardcoded TS-interval Queue Performance TS-interval Deque Retries TS-interval Deque 6000 operations per ms (more is better) TS Deque Pseudo Code left • The TS deque is linearizable with respect to the sequential specification of a deque. Linearizability allows reordering overlapping operations. Therefore artificially delaying enqueue operations can reduce the order of elements in the data structure. insR: Each thread inserts at the right side of its own linked list. TS Buffer Correctness dequeues compete for element a before removing element b enq(a) number of retries (less is better) Key Ideas Interval Timestamping 5000 4000 3000 2000 1000 5000 4000 3000 2000 1000 0 0 2 10 20 30 40 50 number of threads 60 70 (c) TS queue with interval timestamping 80 2 10 20 30 40 50 number of threads 60 70 80 (d) TS queue without interval timestamping
© Copyright 2026 Paperzz