Hardware transactional memory

Hardware Transactional Memory
(Herlihy, Moss, 1993)
1
Some slides are taken from a
presentation by Royi Maimon &
Merav Havuv, prepared for a seminar
given by Prof. Yehuda Afek.
Outline

Hardware Transactional Memory (HTM)




2
Transactions
Caches and coherence protocols
General Implementation
Simulation
What is a transaction?




3
A transaction is a sequence of memory loads and
stores executed by a single process that either
commits or aborts
If a transaction commits, all the loads and stores
appear to have executed atomically
If a transaction aborts, none of its stores take effect
Transaction operations aren't visible until they
commit (if they do)
Transactions properties:
A transaction satisfies the following property:

4
Atomicity: Each transaction either commits (its
changes seem to take effect atomically) or aborts (its
changes have no effect).
Transactional Memory


A new multiprocessor architecture
The goal: Implementing non-blocking synchronization
that is
efficient
– easy to use
compared with conventional techniques based on mutual exclusion
–

5
Implemented by straightforward extensions to
multiprocessor cache-coherence protocols and / or
by software mechanisms
Outline

Hardware Transactional Memory (HTM)




66
Transactions
Caches and coherence protocols
General Implementation
Simulation
A cache is an associative
(a.k.a. content-addressable) memory
Address A
Data @A
Conventional memory
Data D
Address A, s.t. *A=D
Associative memory
7
Cache Associativity
8
Fully associative cache
9
Cache tags and address structure
Main Memory
Cache
Tags are
typically highorder address
bits
10
Cache-Coherence Protocol

In multiprocessors, each processor typically
has its own local cache memory
–
–
–

A Cache-coherence protocol manages the
consistency of caches and main memory:
–
–
11
Minimize average latency due to memory access
Decrease bus traffic
Maximize cache hit ratio
Shared memory semantics maintained
Caches and main memory communicate to guarantee
coherency
The need to maintain coherency
12
Figure taken from the book: “Computer architecture – A quantitative approach” by Hennessy and Peterson
Coherency requirements
13
Text taken from the book: “Computer architecture – A quantitative approach” by Hennessy and Peterson
Snoopy Cache

14
All caches monitor (snoop) the activity on
a global bus/interconnect to determine if
they have a copy of the block of data that
is requested on the bus.
Coherence protocol types
15

Write through: the information is written to
both the cache block and to the block in
the lower-level memory

Write-back (widely used): the information
is written only to the cache block. The
modified cache block is written to main
memory only when it is replaced
3-state Coherence protocol
16

Invalid: cache line/block does not contain
legal information

Shared: cache line/block contains
information that may be shared by other
caches

Modified/exclusive: cache line/block was
modified while in cache and is exclusively
owned by current cache
Cache-coherency mechanism (write-back)
17
Cache-coherency mechanism – state
transition diagram
Figure taken from the book: “Computer architecture – A quantitative approach” by Hennessy and Peterson
MESI protocol (Goodman, 1983)
19
Cache line
status
M
(Modified)
E
(Exclusive
S
(Shared)
I
(Invalid)
Is line valid?
Yes
Yes
Yes
No
Main memory
updated?
No
Yes
Yes
__
Other cache
copies exist?
No
No
Maybe
__
Outline

Hardware Transactional Memory (HTM)




20
Transactions
Caches and coherence protocols
General Implementation
Simulation
HTM-supported API
The following primitive instructions for accessing memory are
provided:
21

Load-transactional (LT): reads value of a shared memory
location into a private register.

Load-transactional-exclusive (LTX): Like LT, but “hinting” that
the location is likely to be modified.

Store-transactional (ST) tentatively writes a value from a private
register to a shared memory location.

Commit (COMMIT)

Abort (ABORT)

Validate (VALIDATE) tests the current transaction status.
Some definitions

Read set: the set of locations read by LT by a
transaction

Write set: the set of locations accessed by LTX or ST
issued by a transaction

Data set (footprint): the union of the read and write
sets.

22
A set of values in memory is inconsistent if it
couldn’t have been produced by any serial execution
of transactions
Intended Use
Instead of acquiring a lock, executing the critical section,
and releasing the lock, a process would:
1.
2.
3.
4.
use LT or LTX to read from a set of locations
use VALIDATE to check that the values read are
consistent,
use ST to modify a set of locations
use COMMIT to make the changes permanent.
If either the VALIDATE or the COMMIT fails,
the process returns to Step (1).
23
Implementation
24

Hardware transactional memory is implemented by
modifying standard multiprocessor cache
coherence protocols

Herlihy and Moss suggested to extend “snoopy”
cache protocol for a shared bus to support
transactional memory

Supports short-lived transactions with a relatively
small data set.
The basic idea


25
Any protocol capable of detecting register
access conflicts can also detect transaction
conflict at no extra cost
Once a transaction conflict is detected, it can
be resolved in a variety of ways
Implementation

Each processor maintains two caches
–
–

26
Regular cache for non-transactional operations,
Transactional cache small, fully associative
cache for transactional operations. It holds all the
tentative writes, without propagating them to other
processors or to main memory (until commit)
An entry may reside in one cache or the
other but not in both
Cache line states

Each cache line (regular or transactional) has one of the
following states:
(Modified)
(Exclusive)

Each transactional cache lines has (in addition)
one of these states:
“New” values
27
“Old” values
Cleanup

28
When the transactional cache needs space
for a new entry, it searches for:
–
A TC_INVALID entry
–
If none - a TC_NORMAL entry
–
finally for an TC_COMMIT entry
(why can such entries be replaced?)
Processor actions


29
Each processor maintains two flags:
–
The transaction active (TACTIVE) flag: indicates whether a
transaction is in progress
–
The transaction status (TSTATUS) flag: indicates whether
that transaction is active (True) or aborted (False)
Non-transactional operations behave exactly as in
original cache-coherence protocol
Example – LT operation:
Not Found?
Cache miss
Not Found?
Ask to read this block
from the shared
memory
Successful
read
Look for
NORMAL
entry
Busy signal
Abort the transaction:
 TSTATUS=FALSE
 Drop tc_ABORT entries
30
 All tc_COMMIT entries
are set to tc_NORMAL
Create two
entries:
tc_ABORT and
tc_COMMIT
Look for
tc_ABORT
entry
Found?
Change it to
tc_ABORT and
allocate another
tc_COMMIT entry
with same value
Found?
Return
its value
Snoopy cache actions:
31

Both the regular cache and the transactional
cache snoop on the bus.

A cache ignores any bus cycles for lines not in that
cache.

The transactional cache’s behavior:
–
If TSTATUS=False, or if the operation isn’t transactional,
the cache acts just like the regular cache, but ignores
entries with state other than TC_NORMAL
–
Otherwise: On LT of another cpu, if the state is
TC_NORMAL or the line not written to, the cache returns
the value, and in all other cases it returns BUSY
Committing/aborting a transaction


Upon commit

Set all entries tagged by TC_COMMIT to TC_INVALID

Set all entries tagged by TC_ABORT to TC_NORMAL
Upon abort

Set all entries tagged by TC_ABORT to TC_INVALID

Set all entries tagged by TC_COMMIT to TC_NORMAL
Since transactional cache is small, it is assumed that
these operations can be done in parallel.
32
Outline


Lock-Free
Hardware Transactional Memory (HTM)




33
Transactions
Caches and coherence protocols
General Implementation
Simulation
Simulation
34

We’ll see an example code for the
producer/consumer algorithm using transactional
memory architecture.

The simulation runs on both cache coherence
protocols: snoopy and directory cache.

The simulation use 32 processors

The simulation finishes when 2^16 operations have
completed.
Part Of Producer/Consumer Code
35
unsigned queue_deq(queue *q) {
typedef struct {
unsigned head, tail, result;
Word deqs;
// Holds the head’s index
unsigned backoff = BACKOFF_MIN
Word enqs;
// Holds the tail’s index
unsigned wait;
Word items[QUEUE_SIZE];
while (1) {
} queue;
result = QUEUE_EMPTY;
tail = LTX(&q->enqs);
head = LTX(&q->deqs);
if (head != tail) {
/* queue not empty? */
result = LT(&q->items[head % QUEUE_SIZE]);
/* advance counter */
ST(&q->deqs, head + 1);
}
if (COMMIT()) break;
/* abort => backoff */
wait = random() % (01 << backoff);
while (wait--);
if (backoff < BACKOFF_MAX)
backoff++;
}
return result;
}
The results:
36
Snoopy cache
Directory-based coherency
Key Limitations:
37

Transactional size is limited by cache size

Transaction length effectively limited by
scheduling quantum

Process migration problematic