Synchronization Coherence

UNIVERSITY OF MASSACHUSETTS
Dept. of Electrical & Computer Engineering
Computer Architecture
ECE 668
Multiprocessor Systems III
Csaba Andras Moritz
Papers:
• Hammond et al, Transactional Memory Coherence and Consistency, at Stanford
[ISCA2004]
• Y. Guo et al, Synchronization Coherence (SyC), at UMass (JPDC2007]
Textbook
Slides adopted from authors groups papers and focused on class needs
SyC - full Copyright Csaba Andras Moritz 2016
Slides Outline
o Transactional Memory Coherence and Consistency (TCC) –
Synchronization and Coherence with Transactions
o Synchronization Coherence (SyC) - Cache Coherence with
Fine-Grained Synchronization Hardware Primitives
Issues with Parallel Systems
• Cache coherence protocols are complex
– Must track ownership of cache lines
– Difficult to implement and verify all corner cases
• Memory Consistency protocols are complex
– Must provide rules to correctly order individual loads/stores
– Difficult for both hardware and software
• Protocols rely on low latency, not bandwidth
– Critical short control messages on ownership transfers
– Latency of short messages unlikely to scale well in the future
– Bandwidth is likely to scale much better (as discussed L1)
• High speed interchip connections
• Multicore (CMP) = on-chip bandwidth
3
Desirable Multiprocessor System
• A shared memory system with
1. A simple, easy programming model (unlike message
passing)
2. A simple, low-complexity hardware implementation
(unlike shared memory)
3. Good performance (unlike many protocols that generate
bus transactions)
4
Lock Freedom
• Why using synch locks is bad?
• Common problems in conventional locking
mechanisms
– Priority inversion: When low-priority process is preempted
while holding a lock needed by a high-priority process
– Convoying: When a process holding a lock is descheduled (e.g. page fault), no forward progress for other
processes capable of running
– Deadlock (or Livelock): Processes/threads to lock the
same set of objects in different orders
•could be poorly designed by programmers
5
Using Transactions
• What is a transaction?
– A sequence of instructions that is guaranteed to execute and
complete only as an atomic unit
Begin Transaction
Inst #1
Inst #2
Inst #3
…
End Transaction
– Satisfy the following properties
• Serializability: Transactions appear to execute serially.
• Atomicity (or Failure-Atomicity): A transaction either
– commits changes when complete, visible to all; or
– aborts, discarding changes (will retry again)
6
TCC (Stanford)
• Transactional Coherence and Consistency (TCC)
• Programmer-defined groups of instructions within a program
Begin Transaction
Inst #1
Inst #2
Inst #3
…
End Transaction
Start Buffering Results
Commit Results Now
• Only commit machine state at the end of each transaction
– Each must update machine state atomically, all at once
– To other processors, all instructions within one transaction appear to
execute only when the transaction commits
– These commits impose an order on how processors may modify
machine state
7
Transaction Code Example
• MIT LTM instruction set
xstart:
XBEGIN on_abort
lw
r1, 0(r2)
addi r1, r1, 1
...
XEND
...
on_abort:
…
j
xstart
// back off
// retry
8
Transactional Memory
• Transactions appear to execute in commit order
– Flow (RAW) dependency cause transaction violation and restart
Transaction A
Time
Arbitrate
Commit
Transaction C
Transaction B
ld 0xdddd
...
st 0xbeef
0xbeef
ld 0xdddd
...
ld 0xbbbb
Arbitrate
Commit
0xbeef
ld 0xbeef
Violation!
ld 0xbeef
Re-execute
with new data
9
Transactional Memory
• Output and Anti-dependencies are automatically handled
– WAW are handled by writing buffers only in commit order (think about
sequential consistency)
Transaction A
Transaction B
Store X
Store X
Local
buffer
Local
buffer
Commit X


Commit X
Shared Memory
10
Transactional Memory
• Output and Anti-dependencies are automatically handled
– WAW are handled by writing buffers only in commit order
– WAR are handled by keeping all writes private until commit
Transaction A
Transaction A
Transaction B
LD X (=1)
Local
buffer
Local
buffer
Commit X


Commit X
X=1
Commit X
Local stores
supply data
ST X = 1
Store X
Store X
Transaction B
ST X = 3
LD X (=3)
LD X (=3)
Commit X
X=3
Shared Memory
11
TCC System
• Similar to prior thread-level speculation (TLS) techniques
–
–
–
–
CMU Stampede
Stanford Hydra
Wisconsin Multiscalar
UIUC speculative multithreading CMP
• Loosely coupled TLS system
• Completely eliminates conventional cache coherence and
consistency models
– No MESI-style cache coherence protocol
• But require new hardware support
12
The TCC Cycle
• Transactions run in a cycle
• Speculatively execute code and
buffer
• Wait for commit permission
– Phase provides synchronization, if
necessary
– Arbitrate with other processors
• Commit stores together (as a packet)
– Provides a well-defined write ordering
– Can invalidate or update other caches
– Large packet utilizes bandwidth
effectively
• And repeat
13
Advantages of TCC
• Trades bandwidth for simplicity and latency tolerance
– Easier to build
– Not dependent on timing/latency of loads and stores
• Transactions eliminate locks
– Transactions are inherently atomic
– Catches most common parallel programming errors
• Shared memory consistency is simplified
– Conventional model sequences individual loads and stores
– Now only have hardware sequence transaction commits
• Shared memory coherence is simplified
– Processors may have copies of cache lines in any state (no MESI !)
– Commit order implies an ownership sequence
14
How to Use TCC
• Divide code into potentially parallel tasks
– Usually loop iterations
– For initial division, tasks = transactions
• But can be subdivided up or grouped to match HW limits
(buffering)
– Similar to threading in conventional parallel programming, but:
• We do not have to verify parallelism in advance
• Locking is handled automatically
• Easier to get parallel programs running correctly
• Programmer then orders transactions as necessary
– Ordering techniques implemented using phase number
– Deadlock-free (At least one transaction is the oldest one)
– Livelock-free (watchdog HW can easily insert barriers anywhere)
15
How to Use TCC
• Three common ordering scenarios
– Unordered for purely parallel tasks
– Fully ordered to specify sequential task (algorithm level)
– Partially ordered to insert synchronization like barriers
16
Basic TCC Transaction Control Bits
• In each local cache
– Read bits (per cache line, or per word to eliminate false sharing)
• Set on speculative loads
• Snooped by a committing transaction (writes by other CPU)
– Modified bits (per cache line)
• Set on speculative stores
• Indicate what to rollback if a violation is detected
• Different from dirty bit
– Renamed bits (optional)
• At word or byte granularity
• To indicate local updates (WAR) that do not cause a violation
• Subsequent reads that read lines with these bits set, they do NOT
set read bits because local WAR is not considered a violation
17
During A Transaction Commit
• Need to collect all of the modified caches together into a
commit packet
• Potential solutions
– A separate write buffer, or
– An address buffer maintaining a list of the line tags to be committed
– Size?
• Broadcast all writes out as one single (large) packet to the
rest of the system
18
Re-execute A Transaction
• Rollback is needed when a transaction cannot commit
• Checkpoints needed prior to a transaction
• Checkpoint memory
– Use local cache
– Overflow issue
• Conflict or capacity misses require all the victim lines to be kept
somewhere (e.g. victim cache)
• Checkpoint register state
– Hardware approach: Flash-copying rename table / arch register file
– Software approach: extra instruction overheads
19
Sample TCC Hardware
• Write buffers and L1 Transaction Control Bits
– Write buffer in processor, before broadcast
• A broadcast bus or network to distribute commit packets
– All processors see the commits in a single order
– Snooping on broadcasts triggers violations, if necessary
• Commit arbitration/sequence logic
20
Ideal Speedups with TCC
• equake_l : long transactions
• equake_s : short transactions
21
Speculative Write Buffer Needs
• Only a few KB of write buffering needed
– Set by the natural transaction sizes in applications
– Small write buffer can capture 90% of modified state
22
Broadcast Bandwidth
• Broadcast is bursty
• Average bandwidth
– Needs ~16 bytes/cycle @ 32 processors with whole modified lines
– Needs ~8 bytes/cycle @ 32 processors with dirty data only
• High, but feasible on-chip
23
TCC vs MESI [PACT 2005]
• Application, Protocol + Processor count
24
Further Readings
• M. Herlihy and J. E. B. Moss, “Transactional Memory: Architectural
Support for Lock-Free Data Structures,” ISCA 1993.
• R. Rajwar and J. R. Goodman, “Speculative Lock Elision: Enabling Highly
Concurrent Multithreaded Execution,” MICRO 2001
• R. Rajwar and J. R. Goodman, “Transactional Lock-Free Execution of
Lock-Based Programs,” ASPLOS 2002
• J. F. Martinez and J. Torrellas, “Speculative Synchronization: Applying
Thread-Level Speculation to Explicitly Parallel Applications,” ASPLOS
2002
• L. Hammond, V. Wong, M. Chen, B. D. Calrstrom, J. D. Davis, B.
Hertzberg, M. K. Prabhu, H. Wijaya, C. Kozyrakis, and K. Olukoton
“Transactional Memory Coherence and Consistency,” ISCA 2004
• C. S. Ananian, K. Asanovic, B. C. Kuszmaul, C. E. Leiserson, S. Lie,
“Unbounded Transactional Memory,” HPCA 2005
• A. McDonald, J. Chung, H. Chaf, C. C. Minh, B. D. Calrstrom, L. Hammond,
C. Kozyrakis and K. Olukotun, “Characterization of TCC on a ChipMultiprocessors,” PACT 2005.
25
Synchronization Coherence
• Synchronization Coherence:
– A Transparent Hardware Mechanism for Cache Coherence and FineGrained Synchronization, Yao Guo, Vladimir Vlassov, Raksit Ashok,
Richard Weiss, and Csaba Andras Moritz, Journal of Parallel and
Distributed Computing, 2007
– Paper accessible from my group’s site
• www.ecs.umass.edu/ece/andras follow to papers
26
Goals of SyC
• Provide transparent fine-grained synchronization and
caching in a multiprocessor machine and single-chip
multiprocessor
• Merges fine-grained synchronization (FGS) mechanisms with
traditional cache coherence protocols we discussed.
– FGS reduces network utilization as well as synchronization related
processing overheads – same as dataflow execution in CPUs
– Minimal added hardware complexity as compared to
• cache coherence mechanisms or
• previously reported fine-grained synchronization techniques.
27
Why FGS?
• One popular approach is based on improving the efficiency
of traditional coarse-grained synchronization (and with
speculative execution beyond synchronization points).
– Discussed this before
– Scalability limitations in many-cores with 100s of cores potentially
– Speculation may be costly on power consumption!
• Another suggested alternative is fine-grained
synchronization, such as synchronization at a word level.
– Efficient
– But how to do you combine with caching?
•SyC does it transparently
28
MIT Alewife CC-NUMA Machine
• Implements FGS
• Directory-based CC-NUMA multiprocessor with a full/empty
tagged shared memory.
– Hardware support for fine-grained word-level synchronization includes
full/empty bits, special memory operations and a special full/empty trap.
– Software support includes trap handler routines for context switching and
waiting for synchronization.
• While the Alewife architecture supports fine-grained
synchronization and shows demonstrable benefits over a
coarse-grained approach, it still implements synchronization
in a software layer above the cache coherence protocol
layer.
– Keeping the two layers separate entails additional communication overhead.
29
Alewife FGS Synch Structures
• A J-structure is an abstract data type that can be used for consumer-producer
type of process interaction. A J-structure has the semantics of a write-once
variable: it can be written only once, and a read from an empty J-structure
suspends until the structure is filled. An instance of the J-structure is initially
empty. See more in paper.
• A lockable L-structure is an abstract data type that has three operations: a nonlocking peek (L- peek), a locking read (L-read), and an unlocking write (L-write).
An L-peek is similar to a J-read: it waits until the structure is full, and then returns
its value without changing the state. An L-write is similar to a J-write: it stores the
value to an empty structure, changes its state to full and releases all pending Jreads, if any.
30
SyC Overview
• Assume a multiprocessor with full/empty tagged shared
memory (FE-memory) where each location (e.g. word) can be
tagged as full or empty (i.e., it has a full/empty bit (FE-bit)
associated with it): if the bit is set the location is full,
otherwise the location is empty.
• In order to provide fine-grained synchronization such as MIT
J-structures and L-structures described above, this
multiprocessor supports special synchronized FE-memory
operations (reads and writes) that can depend on the FE-bit
and can alter the FE-bit in addition to reading or writing the
target location.
• Architecture should include corresponding FE-memory
instructions as well as full/empty conditional branch
instructions.
31
SyC contd.
• A vector of FE-bits and a vector of P-bits associated with words
in a memory block are stored in the coherence directory and
as an extra field in the cache tag when the block is cached.
– A P-bit indicates whether there are operations (reads or writes) pending on a
corresponding word: if set, it means there are pending synchronized reads (if
the word is empty) or pending writes (if the word is full) for the corresponding
word. This information is required so that a successfully executed altering
operation can immediately satisfy one or more pending operations.
• This way, a tag (directory) lookup includes tag match and
inspection of full/empty bits. Each home node (directory) also
contains a State Miss Buffer (SMB) that holds information
regarding which nodes have pending operations for a given
word and whether operations are altering or not.
• The synchronization miss is treated as a cache miss, and
information on the miss is recorded at the cache, e.g., in Miss
Information/Status Holding Register (MSHR) and a write
buffer.
32
Architectural Support
• Assumes a processor system with a 32-byte (4 words) block.
• In addition to the 32-byte block and the tag bits, each cache
block has 8 extra bits:
– 4 FE-bits and
– 4 P-bits (that is about 3% of storage overhead per cache block)
• The cache controller not only matches the tag but also
checks the full/empty bits depending on the instruction, and
makes a decision based on the state of the cache line as
well as the associated FE-bits and P- bits.
• The directory controller is also modified to account for the
Synch Miss Buffer (SMB) implementing the SyC protocol in
the directory.
33
Architectural Support Overview for SyC
Cache
34
Programming Model – FE-M Instruct.
• Requires compiler support, L/J structures or a suitable programming model
similar to dataflow; KTH initiated the Extended Dataflow Arch model
35
Protocol Transactions Support
Some extensions
to MESI for non ordinary
reads and writes.
See 3.3 if
interested
36
Example Results
37