Safely Accessing Time Stamps in Transactions

Safely Accessing Time Stamps in Transactions
Stephan Diestelhorst
[email protected]
Advanced Micro Devices, Inc.
Martin Pohlack
[email protected]
Advanced Micro Devices, Inc.
Initially: c = 0
t1 = RDTSCP
c := 1
lc = c
t2 = RDTSCP
Abstract
Time stamps are frequently used in multi-threaded applications, and provide a way for an application to determine order between events. We identify interactions between time stamps and transactional mechanisms that differ from the expected behaviour of using locks for mutual Figure 1: Dependent reads from the time stamp counter
exclusion, and draft implementations that remove these produce properly ordered time stamps: If lc = 1, then t2 > t1.
differences.
To transparently elide locks in existing applications,
Keywords: Transactional Memory, Lock Elision, Clocks, the semantics must not be weaker than that provided by
mutual exclusion.
Synchronization
1
2.3
Roadmap
Applications must measure time and its progression,
to determe durations of, order and coordinate events.
Computers provide time sources of varying quality.
We will focus our analysis mainly on the CPU’s time
stamp counter (TSC), which has seen significant quality
improvements during the last decade. We assume the
TSC is a suitable real-time source,1 and that applications
(through libraries such as libc) rely on such behaviour.
For brevity, and without loss of generality, we ignore the
potential offset between different TSCs because it can be
bounded by a small constant with initial calibration. Our
results hold also in those cases, with some additional margins added to comparisons accounting for the difference.
On the AMD64 architecture, the TSC can be read
with the RDTSC and RDTSCP instructions. We consider
only RDTSCP because it is properly serialised with the
instruction stream. We do not consider other clocks, due
to space and because the TSC is the most frequently used
fast, stable time source.
Section 2 provides the background of concurrency control
mechanisms and time stamps, and Section 3 reviews their
semantics. We then show critical interactions between
time stamps and transactions in Section 4, and draft two
implementations that properly handle these in Section
5. Sections 6 and 7 provide an outlook and conclude the
paper, respectively.
2
2.1
Background
Concurrency Control
Critical sections traditionally operate on a strict mutual
exclusion property: Instructions of two critical sections
must not interleave if both critical sections are protected
by the same lock variable. Recent literature calls this
mode of operation Single Lock Atomicity (SLA) [6].
Database transactions provide serialisability [8], the
strongest form of isolation. Briefly, transactions may
overlap if there exists an equivalent execution in which no
transactions overlap. Strict serialisability (and the similar
linearisability [4]) is a stronger form of serialisability that
restricts the serial order such that observed real-time order
between non-overlapping transactions is maintained.
Transactional memory [3] brings the notion of transactions to general-purpose systems. The large variety of
semantics [7] differs mainly in how they deal with interleavings of instructions that are not part of transactions
and instructions that are part of a transaction.
2.2
3
Semantics of Time Stamps
3.1
Traditional Code
Memory causality and synchronised time stamps allow
us to reason about time stamp relations across multiple
cores and application threads:
Causal TSCs: In non-transactional code, order imposed by memory accesses will be reflected in time
stamps that are ordered accordingly.
In Figure 1, two RDTSCP instructions are ordered by
a memory dependence. We assume for this and similar
examples t2 > t1 always holds; generally, existing order
such as memory dependence and program order also
orders time stamps.
Critical sections implemented through locks serialise
execution through memory dependencies. Therefore, for
Transactional Lock Elision
Transactional execution offers benefits over strict mutual
exclusion imposed by critical sections, because transactions can be optimistically executed in parallel, as long
as the specified isolation level is maintained. The idea to
convert lock-protected critical sections into transactions
and extract additional performance has been proposed in
previous work on lock elision [10].
Access to Time Stamp Counters
1
1
This has been true for most x86 microprocessors since 2007.
TX1.begin
t1 = RDTSCP
time stamps t1 and t2 read in inside critical section CS1
with t1 < t2 , and t3 , t4 read in CS2 also ordered t3 < t4 ,
and CS1 and CS2 being protected by the same lock
variable, we know from Causal TSCs that either t2 < t3 ,
or t4 < t1 . In other words:
t = RDTSCP
t2 = RDTSCP
TX1.end
Temporal mutual exclusion: For two critical sections
CS1 , CS2 protected by the same lock, the intervals
spanned by the set of obtained time stamps T (CS)
do not overlap: [min(T (CS1 )), max(T (CS1 ))] ∩
[min(T (CS2 )), max(T (CS2 ))] = ∅.
3.2
Initially: c = 0
TX1.begin
TX2.begin
t2 = RDTSCP
TSC-oblivious Transactions
c := 1
t1 = RDTSCP
TX1.end
lc = c
TX2.end
Figure 3: Two transactions can exhibit contradicting ordering
if lc = 1 and t2 < t1 .
We will therefore use the term transactions also for elided
critical sections.
4.1
Simple Overlap Case
In Figure 2, the case t1 < t < t2 directly violates
Temporal mutual exclusion – the result for time stamps in
traditional critical sections obtained in Section 3.1.
4.2
Order Mismatch: Memory vs. Time Stamps
In addition to detecting temporal overlap through time
stamps, two transactions may observe mismatches in
the respective memory or time stamp orders. Figure 3
shows an execution that orders transaction TX1 before
TX2 through memory;however, because t2 < t1 , the
time stamps indicate the opposite. Proper SLA semantics
would not allow such contradicting orders under Causal
TSCs.
The Need for Stronger Semantics
Usually, elided critical sections track memory accesses
and ensure they correlate to a sequential execution by
tracking conflicting accesses. However, as we will show
in the next section, this is insufficient if applications make
use of the TSC inside critical sections. Applications may
infer concurrent execution of critical sections that were
supposed to execute strictly sequentially. Such mismatch
between expected behaviour and implementation breaks
transparency of the elision and can lead to crashes, or
other misbehaviour.
In addition to supporting stronger semantics for the
lock elision case, we also consider a stronger semantics
incorporating TSC accesses for (hardware) transactions.
Providing the same semantics in both modes makes
sense: (1) using hardware transactions as a drop in
replacement for locking, and (2) providing a well-known
semantics for TSC usage in transactions that is easy to
understand and reason about.
4
...
TX2.end
Figure 2: Overlapping time stamp values violate SLA.
Because transactions are a new programming construct,
they need not maintain strict compatibility with legacy
code and its assumptions. Therefore, access to the TSC
is treated differently in the various implementations of
transactional memory: software TMs (STMs) usually do
not track RDTSC(P) instructions and so may allow an
application to infer temporal placement and overlap of
the transactions. AMD’s proposed Advanced Synchronization Facility (ASF) [1] does not allow transactions
to execute RDTSC(P), making it impossible to detect
temporal overlap for transactions at the cost of not being
able to read the TSC in transactions.
In Intel’s recent Transactional Synchronization Extensions (TSX) [5], access to the TSC is permitted in both
transactions and elided critical sections. Because the
elision does not work fully transparently,2 software can
be exposed to the changed semantics of (elided) critical
sections [9].
3.3
TX2.begin
4.3
Semantic Issues of Time Stamps in Transactions
This section illustrates problematic orderings of transactions and the use of time stamp accesses from within. We
aim for transactions and elided critical sections to have
a semantic that is equivalent to proper mutual exclusion.
2
Applications need to use annotated instructions to acquire and
release the lock variable, and can query whether they run in a
transaction / elided critical section with the XTEST instruction
2
Weak and Strong Temporal Isolation
If code accesses the same data inside and outside critical
sections, no new order between these accesses is created. In those cases, order can be created only through
reasoning with the underlying memory semantics. If no
such order can be established (usually called a data race),
the involved accesses are subject to complex, sometimes
even all-bets-are-off / catch-fire semantics.
With transactional memory, however, some systems provide strong isolation which properly orders
transactions with respect to memory accesses outside of transactions, usually by assuming that the
non-transactions are one-instruction mini-transactions.
Systems that do not provide such isolation, but only order
transactions, provide weak isolation.
We observe a similar interaction with TSC accesses,
extending the weak / strong isolation property to weak /
strong temporal isolation (WTI / STI).
Figure 4 highlights the interaction. The example
clearly does not violate either SLA, nor does it produce
overlapping time stamp intervals. It also does not violate
purely memory-based strong isolation semantics, because
Initially: c = 0
and aborts all executed transactions that try to use the
RDTSC(P) instruction. However, because time stamping
is employed in many applications, it is desirable to find
less rigid solutions that provide temporal isolation.
TX2.begin
t2 = RDTSCP
t1 = RDTSCP
c := 1
Temporal Isolation Rules: Clearly, there are two rules
needed for proper temporal isolation: (1) The time stamp
order needs to agree with the order established by the
Figure 4: Failure of strong temporal isolation if lc = 1 memory accesses in transactions and non-transactional
and t2 < t1 . Instructions on the left do not execute from a code; and, (2) time stamp intervals of transactions / elided
transactional context.
critical sections must not overlap.
lc = c
TX2.end
5.1
Initially: c = 0
Restricting transactions to access the TSC at most once
removes the problem of overlapping time stamp intervals.
A simple implementation could therefore track locally
that transactions do not read from the TSC twice, and
abort otherwise.
The simplest solution to enforce consistent time stamp
and memory order, Single-AbortAll, is to allow only a
single transaction access to the TSC and abort all other
overlapping transactions (e.g., by adding a dedicated
memory location T SCA to each transaction’s read set and
sending out conflicting write probes on an RDTSC(P)).
For example, the execution of Figure 3 would abort TX1
immediately at the acquisition of t2 , regardless of TX1’s
content.
A more selective approach is Single-AbortTSC, which
aborts only those live transactions that have or will
eventually access the TSC. That can be achieved by
recording TSC usage, handling the conflict only if a
running transaction has already accessed the TSC, and
letting future RDTSC instructions check whether their
enclosing transactions have seen a remote TSC access.
Compared to Single-AbortAll, the advantage is that
transactions that do not access the TSC need not be
aborted. The execution in Figure 3 would only need
conflict handling when TX1 obtains t1 . If it would not
have used RDTSCP, it could continue execution.
TX2.begin
t2 = RDTSCP
t1 = RDTSCP
TX1.begin
TX1.end
c := 1
lc = c
TX2.end
Figure 5: Transparent lock elision requires strong temporal
isolation: with SLA, if t2 < t1 then lc = 0; however, with
weak temporal isolation lc = 1 is possible.
TX2 is ordered behind the non-transactional store from
a memory perspective. With strong temporal isolation,
the intuitive combination of strong memory isolation and
Causal TSCs is enforced.
4.4
The Need for Strong Temporal Isolation
Strong temporal isolation restricts execution more than
a system that provides only weak temporal isolation.
However, SLA does not inherently order critical sections
and code outside of critical sections, so it may seem
sufficient to provide only weak temporal isolation for
transparent lock elision (and transactions). However, if
we modify the example in Figure 4 slightly to obtain Figure 5, we find otherwise. Using locks (SLA semantics),
we can deduce the following: If t2 < t1 , TX2 must also
read the old value of c because it must have executed
entirely before the empty transaction TX1 which in
turn executed before the update to c. Strong temporal
isolation will enforce the same implication to hold, but
the depicted schedule is possible with weak temporal
isolation. Therefore we conclude:
5.2
WTI insufficient: Transparent transactional lock
elision with support for accesses to TSCs requires a
stronger semantics than weak temporal isolation.3
5
Single TSC Access
Supporting Time Stamps in Transactions
In the previous section, we presented various problems
that illustrated how existing systems with unrestricted access to time stamps can cause violations of the semantics
outlined in Section 3.
Of course, forbidding all access to time stamp counters
from within transactions and elided critical sections
will provide the exact semantics. ASF works like that
3
Strong temporal isolation, for example. We do not know if a
semantics weaker than STI would also work.
3
Multiple TSC Accesses
Allowing a transaction to read from the TSC multiple
times allows transactions to observe the passage of time.
Similar to Single-AbortTSC, transactions will not abort
all other concurrently running transactions at the first
read of the TSC, but rather ensure that no overlapping
time stamp intervals can form.
In this case, each transaction checks at the second
or later read from the TSC that it has not received any
notifications from remote TSC accesses; otherwise, a
conflict exists. This case needs to be handled by a conflict
resolution policy, for example by self-aborting on the
second local RDTSC(P), or by aborting all other time
stamp-using transactions.
The advantage of this technique is that transactions can
execute in parallel, and each can access the TSC multiple
times, as long as the intervals formed by each transactions
first and last accesses to the TSC are disjunct. In the
Figure 2 example, the conflict would be detected at the
read of t2 because of the concurrent TSC read to t . If the
reads for t2 and t would have been ordered the opposite
way (and then also t2 < t), no conflict would exist.
Enforcing order: Again, in addition to enforcing
disjunct time stamp intervals, we need to make sure
that these intervals are ordered consistently with all
other orders – for example, imposed through memory
accesses. Take the simple example in Figure 3: so far, the
algorithm does permit the contradicting order. Consistent
ordering betwen time stamps and memory accesses can
be achieved by ensuring that TX2 commits before TX1,
for example by waiting at TX1’s commit point for a
can commit message from TX2. These messages do not
need to broadcast, because TX2 knows that TX1 has
acquired a later time stamp. TX2 can therefore signal
TX1 (and all later) transactions that they must wait for a
commit signal, and when they can commit.
provide as much performance as possible with STI. The
straightforward way out (for us as hardware designers)
is to leave the decision to software – for example, by
offering multiple types of transactions and / or time
stamp counters, opening up a new space for looking into
the interactions between different types and exactly how
much each needs to be weakened to provide compelling
performance, while remaining useful for applications.
7
Conclusion
In addition to communication through memory, time
stamps provide another way for parallel code to coordinate. Therefore, time stamps need to be considered
by systems that control parallel execution such as
transactional memory and lock elision. Existing critical
sections provide very strong semantics of strong temporal
isolation, which we have not found to be provided by
most STM and HTM solutions. In this paper, we have
identified the issue of time stamp order and have shown
with multiple examples how applications can observe
time stamps inconsistent with the isolation level or with
the order observed from memory accesses. We drafted
various implementations that will provide strong temporal
isolation without the need to fully serialise execution.
Waiting alternatives: Instead of TX1 stalling the CPU
at commit, it may be possible for it to continue execution
of instructions behind the transaction transactionally,
essentially increasing the length of the transaction by
adding the following, non-transactional instructions to
its transactional tail. That way, useful work can be done,
reducing the performance penalty incurred by waiting.
Of course, this may lead to additional conflicts (due
to additional memory accesses adding to the working
set), and additional transactional TSC accesses that need
References
special handling.
Another option is to put the core in a low-power state [1] Advanced Micro Devices, Inc. AMD Advanced Synchronization Facility Proposal.
Available at develand wait for the special can commit signal from TX2
oper.amd.com/CPU/ASF/Pages/default.aspx.
(similar to the low-power mode that can be entered with
[2] Advanced Micro Devices, Inc. AMD64 Architecture Prothe MONITOR / MWAIT instruction combination [2]).
5.3
Handling Non-transactional TSC Accesses
STI requires that TSC accesses outside transactions
need to participate in the conflict detection mechanisms.
However, we may want to treat conflicts with nontransactional TSC accesses differently to ensure progress
and minimal obstruction for non-transactional code. We
therefore suggest biasing conflict resolution in favour of
non-transactional TSC accesses, by always aborting the
concurrent transactions with which the non-transactional
RDTSC(P) conflicted (instead of retrying / delaying the
non-transactional access).
6
Outlook
[3]
[4]
[5]
[6]
Although we believe our examples and implementations
cover all critical interactions, and thus provide an iden- [7]
tical semantics to fully sequential mutual exclusion, we
have not yet proven our solutions to be correct or our
restrictions to be minimal. We have started to develop
a formalism, but have not discussed it here due to space
constraints and prematurity.
[8]
We are particularly interested in the applicability of
our proposed implementations. Clearly, our system is
more permissive than AMD ASF’s strict ban of TSC [9]
reads inside transactions (and more transparent than [10]
Intel’s TSX lock elision), but we still need to abort and
serialise a significant number of transactions. Although
we believe that this is inherent to the limit to which
one can make parallel execution behave similarly to
sequential execution, some applications may just not
care. Automatic inference of such properties seems
very hard; therefore, we have restricted our analysis to
4
grammers Manual Volume 3: General-Purpose and System Instructions, 3.18 edition, March 2012.
Maurice P. Herlihy and J. Eliot B. Moss. Transactional
Memory: Architectural Support for Lock-Free Data
Structures. In Proceedings of the 20th International
Symposium on Computer Architecture, San Diego, Calif.,
May 1993.
Maurice P. Herlihy and Jeannette M. Wing. Linearizability: a Correctness Condition for Concurrent Objects.
ACM Transactions on Programming Languages and
Systems, 12(3):463–492, 1990.
Intel Corp. Intel(R) Architecture Instruction Set Extensions Programming Reference, 319433-012a edition,
February 2012.
James R. Larus and Ravi Rajwar. Transactional Memory.
Synthesis Lectures on Computer Architecture. Morgan &
Claypool, 2007.
Vijay Menon, Steven Balensiefer, Tatiana Shpeisman,
Ali-Reza Adl-Tabatabai, Richard Hudson, Bratin Saha,
and Adam Welc. Practical Weak-Atomicity Semantics for
Java STM. In Proceedings of the 20th ACM Symposium
on Parallelism in Algorithms and Architectures, Munich,
Germany, June 2008.
Christos H. Papadimitriou.
The Serializability of
Concurrent Database Updates. Journal of the JACM,
26(4):631–653, 1979.
Ravi Rajwar. Personal communication, 2012.
Ravi Rajwar and James R. Goodman. Speculative Lock
Elision: Enabling Highly Concurrent Multithreaded
Execution. In Proceedings of the 34th IEEE/ACM
International Symposium on Microarchitecture, Austin,
Tex., December 2001.