Safely Accessing Time Stamps in Transactions Stephan Diestelhorst [email protected] Advanced Micro Devices, Inc. Martin Pohlack [email protected] Advanced Micro Devices, Inc. Initially: c = 0 t1 = RDTSCP c := 1 lc = c t2 = RDTSCP Abstract Time stamps are frequently used in multi-threaded applications, and provide a way for an application to determine order between events. We identify interactions between time stamps and transactional mechanisms that differ from the expected behaviour of using locks for mutual Figure 1: Dependent reads from the time stamp counter exclusion, and draft implementations that remove these produce properly ordered time stamps: If lc = 1, then t2 > t1. differences. To transparently elide locks in existing applications, Keywords: Transactional Memory, Lock Elision, Clocks, the semantics must not be weaker than that provided by mutual exclusion. Synchronization 1 2.3 Roadmap Applications must measure time and its progression, to determe durations of, order and coordinate events. Computers provide time sources of varying quality. We will focus our analysis mainly on the CPU’s time stamp counter (TSC), which has seen significant quality improvements during the last decade. We assume the TSC is a suitable real-time source,1 and that applications (through libraries such as libc) rely on such behaviour. For brevity, and without loss of generality, we ignore the potential offset between different TSCs because it can be bounded by a small constant with initial calibration. Our results hold also in those cases, with some additional margins added to comparisons accounting for the difference. On the AMD64 architecture, the TSC can be read with the RDTSC and RDTSCP instructions. We consider only RDTSCP because it is properly serialised with the instruction stream. We do not consider other clocks, due to space and because the TSC is the most frequently used fast, stable time source. Section 2 provides the background of concurrency control mechanisms and time stamps, and Section 3 reviews their semantics. We then show critical interactions between time stamps and transactions in Section 4, and draft two implementations that properly handle these in Section 5. Sections 6 and 7 provide an outlook and conclude the paper, respectively. 2 2.1 Background Concurrency Control Critical sections traditionally operate on a strict mutual exclusion property: Instructions of two critical sections must not interleave if both critical sections are protected by the same lock variable. Recent literature calls this mode of operation Single Lock Atomicity (SLA) [6]. Database transactions provide serialisability [8], the strongest form of isolation. Briefly, transactions may overlap if there exists an equivalent execution in which no transactions overlap. Strict serialisability (and the similar linearisability [4]) is a stronger form of serialisability that restricts the serial order such that observed real-time order between non-overlapping transactions is maintained. Transactional memory [3] brings the notion of transactions to general-purpose systems. The large variety of semantics [7] differs mainly in how they deal with interleavings of instructions that are not part of transactions and instructions that are part of a transaction. 2.2 3 Semantics of Time Stamps 3.1 Traditional Code Memory causality and synchronised time stamps allow us to reason about time stamp relations across multiple cores and application threads: Causal TSCs: In non-transactional code, order imposed by memory accesses will be reflected in time stamps that are ordered accordingly. In Figure 1, two RDTSCP instructions are ordered by a memory dependence. We assume for this and similar examples t2 > t1 always holds; generally, existing order such as memory dependence and program order also orders time stamps. Critical sections implemented through locks serialise execution through memory dependencies. Therefore, for Transactional Lock Elision Transactional execution offers benefits over strict mutual exclusion imposed by critical sections, because transactions can be optimistically executed in parallel, as long as the specified isolation level is maintained. The idea to convert lock-protected critical sections into transactions and extract additional performance has been proposed in previous work on lock elision [10]. Access to Time Stamp Counters 1 1 This has been true for most x86 microprocessors since 2007. TX1.begin t1 = RDTSCP time stamps t1 and t2 read in inside critical section CS1 with t1 < t2 , and t3 , t4 read in CS2 also ordered t3 < t4 , and CS1 and CS2 being protected by the same lock variable, we know from Causal TSCs that either t2 < t3 , or t4 < t1 . In other words: t = RDTSCP t2 = RDTSCP TX1.end Temporal mutual exclusion: For two critical sections CS1 , CS2 protected by the same lock, the intervals spanned by the set of obtained time stamps T (CS) do not overlap: [min(T (CS1 )), max(T (CS1 ))] ∩ [min(T (CS2 )), max(T (CS2 ))] = ∅. 3.2 Initially: c = 0 TX1.begin TX2.begin t2 = RDTSCP TSC-oblivious Transactions c := 1 t1 = RDTSCP TX1.end lc = c TX2.end Figure 3: Two transactions can exhibit contradicting ordering if lc = 1 and t2 < t1 . We will therefore use the term transactions also for elided critical sections. 4.1 Simple Overlap Case In Figure 2, the case t1 < t < t2 directly violates Temporal mutual exclusion – the result for time stamps in traditional critical sections obtained in Section 3.1. 4.2 Order Mismatch: Memory vs. Time Stamps In addition to detecting temporal overlap through time stamps, two transactions may observe mismatches in the respective memory or time stamp orders. Figure 3 shows an execution that orders transaction TX1 before TX2 through memory;however, because t2 < t1 , the time stamps indicate the opposite. Proper SLA semantics would not allow such contradicting orders under Causal TSCs. The Need for Stronger Semantics Usually, elided critical sections track memory accesses and ensure they correlate to a sequential execution by tracking conflicting accesses. However, as we will show in the next section, this is insufficient if applications make use of the TSC inside critical sections. Applications may infer concurrent execution of critical sections that were supposed to execute strictly sequentially. Such mismatch between expected behaviour and implementation breaks transparency of the elision and can lead to crashes, or other misbehaviour. In addition to supporting stronger semantics for the lock elision case, we also consider a stronger semantics incorporating TSC accesses for (hardware) transactions. Providing the same semantics in both modes makes sense: (1) using hardware transactions as a drop in replacement for locking, and (2) providing a well-known semantics for TSC usage in transactions that is easy to understand and reason about. 4 ... TX2.end Figure 2: Overlapping time stamp values violate SLA. Because transactions are a new programming construct, they need not maintain strict compatibility with legacy code and its assumptions. Therefore, access to the TSC is treated differently in the various implementations of transactional memory: software TMs (STMs) usually do not track RDTSC(P) instructions and so may allow an application to infer temporal placement and overlap of the transactions. AMD’s proposed Advanced Synchronization Facility (ASF) [1] does not allow transactions to execute RDTSC(P), making it impossible to detect temporal overlap for transactions at the cost of not being able to read the TSC in transactions. In Intel’s recent Transactional Synchronization Extensions (TSX) [5], access to the TSC is permitted in both transactions and elided critical sections. Because the elision does not work fully transparently,2 software can be exposed to the changed semantics of (elided) critical sections [9]. 3.3 TX2.begin 4.3 Semantic Issues of Time Stamps in Transactions This section illustrates problematic orderings of transactions and the use of time stamp accesses from within. We aim for transactions and elided critical sections to have a semantic that is equivalent to proper mutual exclusion. 2 Applications need to use annotated instructions to acquire and release the lock variable, and can query whether they run in a transaction / elided critical section with the XTEST instruction 2 Weak and Strong Temporal Isolation If code accesses the same data inside and outside critical sections, no new order between these accesses is created. In those cases, order can be created only through reasoning with the underlying memory semantics. If no such order can be established (usually called a data race), the involved accesses are subject to complex, sometimes even all-bets-are-off / catch-fire semantics. With transactional memory, however, some systems provide strong isolation which properly orders transactions with respect to memory accesses outside of transactions, usually by assuming that the non-transactions are one-instruction mini-transactions. Systems that do not provide such isolation, but only order transactions, provide weak isolation. We observe a similar interaction with TSC accesses, extending the weak / strong isolation property to weak / strong temporal isolation (WTI / STI). Figure 4 highlights the interaction. The example clearly does not violate either SLA, nor does it produce overlapping time stamp intervals. It also does not violate purely memory-based strong isolation semantics, because Initially: c = 0 and aborts all executed transactions that try to use the RDTSC(P) instruction. However, because time stamping is employed in many applications, it is desirable to find less rigid solutions that provide temporal isolation. TX2.begin t2 = RDTSCP t1 = RDTSCP c := 1 Temporal Isolation Rules: Clearly, there are two rules needed for proper temporal isolation: (1) The time stamp order needs to agree with the order established by the Figure 4: Failure of strong temporal isolation if lc = 1 memory accesses in transactions and non-transactional and t2 < t1 . Instructions on the left do not execute from a code; and, (2) time stamp intervals of transactions / elided transactional context. critical sections must not overlap. lc = c TX2.end 5.1 Initially: c = 0 Restricting transactions to access the TSC at most once removes the problem of overlapping time stamp intervals. A simple implementation could therefore track locally that transactions do not read from the TSC twice, and abort otherwise. The simplest solution to enforce consistent time stamp and memory order, Single-AbortAll, is to allow only a single transaction access to the TSC and abort all other overlapping transactions (e.g., by adding a dedicated memory location T SCA to each transaction’s read set and sending out conflicting write probes on an RDTSC(P)). For example, the execution of Figure 3 would abort TX1 immediately at the acquisition of t2 , regardless of TX1’s content. A more selective approach is Single-AbortTSC, which aborts only those live transactions that have or will eventually access the TSC. That can be achieved by recording TSC usage, handling the conflict only if a running transaction has already accessed the TSC, and letting future RDTSC instructions check whether their enclosing transactions have seen a remote TSC access. Compared to Single-AbortAll, the advantage is that transactions that do not access the TSC need not be aborted. The execution in Figure 3 would only need conflict handling when TX1 obtains t1 . If it would not have used RDTSCP, it could continue execution. TX2.begin t2 = RDTSCP t1 = RDTSCP TX1.begin TX1.end c := 1 lc = c TX2.end Figure 5: Transparent lock elision requires strong temporal isolation: with SLA, if t2 < t1 then lc = 0; however, with weak temporal isolation lc = 1 is possible. TX2 is ordered behind the non-transactional store from a memory perspective. With strong temporal isolation, the intuitive combination of strong memory isolation and Causal TSCs is enforced. 4.4 The Need for Strong Temporal Isolation Strong temporal isolation restricts execution more than a system that provides only weak temporal isolation. However, SLA does not inherently order critical sections and code outside of critical sections, so it may seem sufficient to provide only weak temporal isolation for transparent lock elision (and transactions). However, if we modify the example in Figure 4 slightly to obtain Figure 5, we find otherwise. Using locks (SLA semantics), we can deduce the following: If t2 < t1 , TX2 must also read the old value of c because it must have executed entirely before the empty transaction TX1 which in turn executed before the update to c. Strong temporal isolation will enforce the same implication to hold, but the depicted schedule is possible with weak temporal isolation. Therefore we conclude: 5.2 WTI insufficient: Transparent transactional lock elision with support for accesses to TSCs requires a stronger semantics than weak temporal isolation.3 5 Single TSC Access Supporting Time Stamps in Transactions In the previous section, we presented various problems that illustrated how existing systems with unrestricted access to time stamps can cause violations of the semantics outlined in Section 3. Of course, forbidding all access to time stamp counters from within transactions and elided critical sections will provide the exact semantics. ASF works like that 3 Strong temporal isolation, for example. We do not know if a semantics weaker than STI would also work. 3 Multiple TSC Accesses Allowing a transaction to read from the TSC multiple times allows transactions to observe the passage of time. Similar to Single-AbortTSC, transactions will not abort all other concurrently running transactions at the first read of the TSC, but rather ensure that no overlapping time stamp intervals can form. In this case, each transaction checks at the second or later read from the TSC that it has not received any notifications from remote TSC accesses; otherwise, a conflict exists. This case needs to be handled by a conflict resolution policy, for example by self-aborting on the second local RDTSC(P), or by aborting all other time stamp-using transactions. The advantage of this technique is that transactions can execute in parallel, and each can access the TSC multiple times, as long as the intervals formed by each transactions first and last accesses to the TSC are disjunct. In the Figure 2 example, the conflict would be detected at the read of t2 because of the concurrent TSC read to t . If the reads for t2 and t would have been ordered the opposite way (and then also t2 < t), no conflict would exist. Enforcing order: Again, in addition to enforcing disjunct time stamp intervals, we need to make sure that these intervals are ordered consistently with all other orders – for example, imposed through memory accesses. Take the simple example in Figure 3: so far, the algorithm does permit the contradicting order. Consistent ordering betwen time stamps and memory accesses can be achieved by ensuring that TX2 commits before TX1, for example by waiting at TX1’s commit point for a can commit message from TX2. These messages do not need to broadcast, because TX2 knows that TX1 has acquired a later time stamp. TX2 can therefore signal TX1 (and all later) transactions that they must wait for a commit signal, and when they can commit. provide as much performance as possible with STI. The straightforward way out (for us as hardware designers) is to leave the decision to software – for example, by offering multiple types of transactions and / or time stamp counters, opening up a new space for looking into the interactions between different types and exactly how much each needs to be weakened to provide compelling performance, while remaining useful for applications. 7 Conclusion In addition to communication through memory, time stamps provide another way for parallel code to coordinate. Therefore, time stamps need to be considered by systems that control parallel execution such as transactional memory and lock elision. Existing critical sections provide very strong semantics of strong temporal isolation, which we have not found to be provided by most STM and HTM solutions. In this paper, we have identified the issue of time stamp order and have shown with multiple examples how applications can observe time stamps inconsistent with the isolation level or with the order observed from memory accesses. We drafted various implementations that will provide strong temporal isolation without the need to fully serialise execution. Waiting alternatives: Instead of TX1 stalling the CPU at commit, it may be possible for it to continue execution of instructions behind the transaction transactionally, essentially increasing the length of the transaction by adding the following, non-transactional instructions to its transactional tail. That way, useful work can be done, reducing the performance penalty incurred by waiting. Of course, this may lead to additional conflicts (due to additional memory accesses adding to the working set), and additional transactional TSC accesses that need References special handling. Another option is to put the core in a low-power state [1] Advanced Micro Devices, Inc. AMD Advanced Synchronization Facility Proposal. Available at develand wait for the special can commit signal from TX2 oper.amd.com/CPU/ASF/Pages/default.aspx. (similar to the low-power mode that can be entered with [2] Advanced Micro Devices, Inc. AMD64 Architecture Prothe MONITOR / MWAIT instruction combination [2]). 5.3 Handling Non-transactional TSC Accesses STI requires that TSC accesses outside transactions need to participate in the conflict detection mechanisms. However, we may want to treat conflicts with nontransactional TSC accesses differently to ensure progress and minimal obstruction for non-transactional code. We therefore suggest biasing conflict resolution in favour of non-transactional TSC accesses, by always aborting the concurrent transactions with which the non-transactional RDTSC(P) conflicted (instead of retrying / delaying the non-transactional access). 6 Outlook [3] [4] [5] [6] Although we believe our examples and implementations cover all critical interactions, and thus provide an iden- [7] tical semantics to fully sequential mutual exclusion, we have not yet proven our solutions to be correct or our restrictions to be minimal. We have started to develop a formalism, but have not discussed it here due to space constraints and prematurity. [8] We are particularly interested in the applicability of our proposed implementations. Clearly, our system is more permissive than AMD ASF’s strict ban of TSC [9] reads inside transactions (and more transparent than [10] Intel’s TSX lock elision), but we still need to abort and serialise a significant number of transactions. Although we believe that this is inherent to the limit to which one can make parallel execution behave similarly to sequential execution, some applications may just not care. Automatic inference of such properties seems very hard; therefore, we have restricted our analysis to 4 grammers Manual Volume 3: General-Purpose and System Instructions, 3.18 edition, March 2012. Maurice P. Herlihy and J. Eliot B. Moss. Transactional Memory: Architectural Support for Lock-Free Data Structures. In Proceedings of the 20th International Symposium on Computer Architecture, San Diego, Calif., May 1993. Maurice P. Herlihy and Jeannette M. Wing. Linearizability: a Correctness Condition for Concurrent Objects. ACM Transactions on Programming Languages and Systems, 12(3):463–492, 1990. Intel Corp. Intel(R) Architecture Instruction Set Extensions Programming Reference, 319433-012a edition, February 2012. James R. Larus and Ravi Rajwar. Transactional Memory. Synthesis Lectures on Computer Architecture. Morgan & Claypool, 2007. Vijay Menon, Steven Balensiefer, Tatiana Shpeisman, Ali-Reza Adl-Tabatabai, Richard Hudson, Bratin Saha, and Adam Welc. Practical Weak-Atomicity Semantics for Java STM. In Proceedings of the 20th ACM Symposium on Parallelism in Algorithms and Architectures, Munich, Germany, June 2008. Christos H. Papadimitriou. The Serializability of Concurrent Database Updates. Journal of the JACM, 26(4):631–653, 1979. Ravi Rajwar. Personal communication, 2012. Ravi Rajwar and James R. Goodman. Speculative Lock Elision: Enabling Highly Concurrent Multithreaded Execution. In Proceedings of the 34th IEEE/ACM International Symposium on Microarchitecture, Austin, Tex., December 2001.
© Copyright 2025 Paperzz