UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering Computer Architecture ECE 668 Multiprocessor Systems III Csaba Andras Moritz Papers: • Hammond et al, Transactional Memory Coherence and Consistency, at Stanford [ISCA2004] • Y. Guo et al, Synchronization Coherence (SyC), at UMass (JPDC2007] Textbook Slides adopted from authors groups papers and focused on class needs SyC - full Copyright Csaba Andras Moritz 2016 Slides Outline o Transactional Memory Coherence and Consistency (TCC) – Synchronization and Coherence with Transactions o Synchronization Coherence (SyC) - Cache Coherence with Fine-Grained Synchronization Hardware Primitives Issues with Parallel Systems • Cache coherence protocols are complex – Must track ownership of cache lines – Difficult to implement and verify all corner cases • Memory Consistency protocols are complex – Must provide rules to correctly order individual loads/stores – Difficult for both hardware and software • Protocols rely on low latency, not bandwidth – Critical short control messages on ownership transfers – Latency of short messages unlikely to scale well in the future – Bandwidth is likely to scale much better (as discussed L1) • High speed interchip connections • Multicore (CMP) = on-chip bandwidth 3 Desirable Multiprocessor System • A shared memory system with 1. A simple, easy programming model (unlike message passing) 2. A simple, low-complexity hardware implementation (unlike shared memory) 3. Good performance (unlike many protocols that generate bus transactions) 4 Lock Freedom • Why using synch locks is bad? • Common problems in conventional locking mechanisms – Priority inversion: When low-priority process is preempted while holding a lock needed by a high-priority process – Convoying: When a process holding a lock is descheduled (e.g. page fault), no forward progress for other processes capable of running – Deadlock (or Livelock): Processes/threads to lock the same set of objects in different orders •could be poorly designed by programmers 5 Using Transactions • What is a transaction? – A sequence of instructions that is guaranteed to execute and complete only as an atomic unit Begin Transaction Inst #1 Inst #2 Inst #3 … End Transaction – Satisfy the following properties • Serializability: Transactions appear to execute serially. • Atomicity (or Failure-Atomicity): A transaction either – commits changes when complete, visible to all; or – aborts, discarding changes (will retry again) 6 TCC (Stanford) • Transactional Coherence and Consistency (TCC) • Programmer-defined groups of instructions within a program Begin Transaction Inst #1 Inst #2 Inst #3 … End Transaction Start Buffering Results Commit Results Now • Only commit machine state at the end of each transaction – Each must update machine state atomically, all at once – To other processors, all instructions within one transaction appear to execute only when the transaction commits – These commits impose an order on how processors may modify machine state 7 Transaction Code Example • MIT LTM instruction set xstart: XBEGIN on_abort lw r1, 0(r2) addi r1, r1, 1 ... XEND ... on_abort: … j xstart // back off // retry 8 Transactional Memory • Transactions appear to execute in commit order – Flow (RAW) dependency cause transaction violation and restart Transaction A Time Arbitrate Commit Transaction C Transaction B ld 0xdddd ... st 0xbeef 0xbeef ld 0xdddd ... ld 0xbbbb Arbitrate Commit 0xbeef ld 0xbeef Violation! ld 0xbeef Re-execute with new data 9 Transactional Memory • Output and Anti-dependencies are automatically handled – WAW are handled by writing buffers only in commit order (think about sequential consistency) Transaction A Transaction B Store X Store X Local buffer Local buffer Commit X Commit X Shared Memory 10 Transactional Memory • Output and Anti-dependencies are automatically handled – WAW are handled by writing buffers only in commit order – WAR are handled by keeping all writes private until commit Transaction A Transaction A Transaction B LD X (=1) Local buffer Local buffer Commit X Commit X X=1 Commit X Local stores supply data ST X = 1 Store X Store X Transaction B ST X = 3 LD X (=3) LD X (=3) Commit X X=3 Shared Memory 11 TCC System • Similar to prior thread-level speculation (TLS) techniques – – – – CMU Stampede Stanford Hydra Wisconsin Multiscalar UIUC speculative multithreading CMP • Loosely coupled TLS system • Completely eliminates conventional cache coherence and consistency models – No MESI-style cache coherence protocol • But require new hardware support 12 The TCC Cycle • Transactions run in a cycle • Speculatively execute code and buffer • Wait for commit permission – Phase provides synchronization, if necessary – Arbitrate with other processors • Commit stores together (as a packet) – Provides a well-defined write ordering – Can invalidate or update other caches – Large packet utilizes bandwidth effectively • And repeat 13 Advantages of TCC • Trades bandwidth for simplicity and latency tolerance – Easier to build – Not dependent on timing/latency of loads and stores • Transactions eliminate locks – Transactions are inherently atomic – Catches most common parallel programming errors • Shared memory consistency is simplified – Conventional model sequences individual loads and stores – Now only have hardware sequence transaction commits • Shared memory coherence is simplified – Processors may have copies of cache lines in any state (no MESI !) – Commit order implies an ownership sequence 14 How to Use TCC • Divide code into potentially parallel tasks – Usually loop iterations – For initial division, tasks = transactions • But can be subdivided up or grouped to match HW limits (buffering) – Similar to threading in conventional parallel programming, but: • We do not have to verify parallelism in advance • Locking is handled automatically • Easier to get parallel programs running correctly • Programmer then orders transactions as necessary – Ordering techniques implemented using phase number – Deadlock-free (At least one transaction is the oldest one) – Livelock-free (watchdog HW can easily insert barriers anywhere) 15 How to Use TCC • Three common ordering scenarios – Unordered for purely parallel tasks – Fully ordered to specify sequential task (algorithm level) – Partially ordered to insert synchronization like barriers 16 Basic TCC Transaction Control Bits • In each local cache – Read bits (per cache line, or per word to eliminate false sharing) • Set on speculative loads • Snooped by a committing transaction (writes by other CPU) – Modified bits (per cache line) • Set on speculative stores • Indicate what to rollback if a violation is detected • Different from dirty bit – Renamed bits (optional) • At word or byte granularity • To indicate local updates (WAR) that do not cause a violation • Subsequent reads that read lines with these bits set, they do NOT set read bits because local WAR is not considered a violation 17 During A Transaction Commit • Need to collect all of the modified caches together into a commit packet • Potential solutions – A separate write buffer, or – An address buffer maintaining a list of the line tags to be committed – Size? • Broadcast all writes out as one single (large) packet to the rest of the system 18 Re-execute A Transaction • Rollback is needed when a transaction cannot commit • Checkpoints needed prior to a transaction • Checkpoint memory – Use local cache – Overflow issue • Conflict or capacity misses require all the victim lines to be kept somewhere (e.g. victim cache) • Checkpoint register state – Hardware approach: Flash-copying rename table / arch register file – Software approach: extra instruction overheads 19 Sample TCC Hardware • Write buffers and L1 Transaction Control Bits – Write buffer in processor, before broadcast • A broadcast bus or network to distribute commit packets – All processors see the commits in a single order – Snooping on broadcasts triggers violations, if necessary • Commit arbitration/sequence logic 20 Ideal Speedups with TCC • equake_l : long transactions • equake_s : short transactions 21 Speculative Write Buffer Needs • Only a few KB of write buffering needed – Set by the natural transaction sizes in applications – Small write buffer can capture 90% of modified state 22 Broadcast Bandwidth • Broadcast is bursty • Average bandwidth – Needs ~16 bytes/cycle @ 32 processors with whole modified lines – Needs ~8 bytes/cycle @ 32 processors with dirty data only • High, but feasible on-chip 23 TCC vs MESI [PACT 2005] • Application, Protocol + Processor count 24 Further Readings • M. Herlihy and J. E. B. Moss, “Transactional Memory: Architectural Support for Lock-Free Data Structures,” ISCA 1993. • R. Rajwar and J. R. Goodman, “Speculative Lock Elision: Enabling Highly Concurrent Multithreaded Execution,” MICRO 2001 • R. Rajwar and J. R. Goodman, “Transactional Lock-Free Execution of Lock-Based Programs,” ASPLOS 2002 • J. F. Martinez and J. Torrellas, “Speculative Synchronization: Applying Thread-Level Speculation to Explicitly Parallel Applications,” ASPLOS 2002 • L. Hammond, V. Wong, M. Chen, B. D. Calrstrom, J. D. Davis, B. Hertzberg, M. K. Prabhu, H. Wijaya, C. Kozyrakis, and K. Olukoton “Transactional Memory Coherence and Consistency,” ISCA 2004 • C. S. Ananian, K. Asanovic, B. C. Kuszmaul, C. E. Leiserson, S. Lie, “Unbounded Transactional Memory,” HPCA 2005 • A. McDonald, J. Chung, H. Chaf, C. C. Minh, B. D. Calrstrom, L. Hammond, C. Kozyrakis and K. Olukotun, “Characterization of TCC on a ChipMultiprocessors,” PACT 2005. 25 Synchronization Coherence • Synchronization Coherence: – A Transparent Hardware Mechanism for Cache Coherence and FineGrained Synchronization, Yao Guo, Vladimir Vlassov, Raksit Ashok, Richard Weiss, and Csaba Andras Moritz, Journal of Parallel and Distributed Computing, 2007 – Paper accessible from my group’s site • www.ecs.umass.edu/ece/andras follow to papers 26 Goals of SyC • Provide transparent fine-grained synchronization and caching in a multiprocessor machine and single-chip multiprocessor • Merges fine-grained synchronization (FGS) mechanisms with traditional cache coherence protocols we discussed. – FGS reduces network utilization as well as synchronization related processing overheads – same as dataflow execution in CPUs – Minimal added hardware complexity as compared to • cache coherence mechanisms or • previously reported fine-grained synchronization techniques. 27 Why FGS? • One popular approach is based on improving the efficiency of traditional coarse-grained synchronization (and with speculative execution beyond synchronization points). – Discussed this before – Scalability limitations in many-cores with 100s of cores potentially – Speculation may be costly on power consumption! • Another suggested alternative is fine-grained synchronization, such as synchronization at a word level. – Efficient – But how to do you combine with caching? •SyC does it transparently 28 MIT Alewife CC-NUMA Machine • Implements FGS • Directory-based CC-NUMA multiprocessor with a full/empty tagged shared memory. – Hardware support for fine-grained word-level synchronization includes full/empty bits, special memory operations and a special full/empty trap. – Software support includes trap handler routines for context switching and waiting for synchronization. • While the Alewife architecture supports fine-grained synchronization and shows demonstrable benefits over a coarse-grained approach, it still implements synchronization in a software layer above the cache coherence protocol layer. – Keeping the two layers separate entails additional communication overhead. 29 Alewife FGS Synch Structures • A J-structure is an abstract data type that can be used for consumer-producer type of process interaction. A J-structure has the semantics of a write-once variable: it can be written only once, and a read from an empty J-structure suspends until the structure is filled. An instance of the J-structure is initially empty. See more in paper. • A lockable L-structure is an abstract data type that has three operations: a nonlocking peek (L- peek), a locking read (L-read), and an unlocking write (L-write). An L-peek is similar to a J-read: it waits until the structure is full, and then returns its value without changing the state. An L-write is similar to a J-write: it stores the value to an empty structure, changes its state to full and releases all pending Jreads, if any. 30 SyC Overview • Assume a multiprocessor with full/empty tagged shared memory (FE-memory) where each location (e.g. word) can be tagged as full or empty (i.e., it has a full/empty bit (FE-bit) associated with it): if the bit is set the location is full, otherwise the location is empty. • In order to provide fine-grained synchronization such as MIT J-structures and L-structures described above, this multiprocessor supports special synchronized FE-memory operations (reads and writes) that can depend on the FE-bit and can alter the FE-bit in addition to reading or writing the target location. • Architecture should include corresponding FE-memory instructions as well as full/empty conditional branch instructions. 31 SyC contd. • A vector of FE-bits and a vector of P-bits associated with words in a memory block are stored in the coherence directory and as an extra field in the cache tag when the block is cached. – A P-bit indicates whether there are operations (reads or writes) pending on a corresponding word: if set, it means there are pending synchronized reads (if the word is empty) or pending writes (if the word is full) for the corresponding word. This information is required so that a successfully executed altering operation can immediately satisfy one or more pending operations. • This way, a tag (directory) lookup includes tag match and inspection of full/empty bits. Each home node (directory) also contains a State Miss Buffer (SMB) that holds information regarding which nodes have pending operations for a given word and whether operations are altering or not. • The synchronization miss is treated as a cache miss, and information on the miss is recorded at the cache, e.g., in Miss Information/Status Holding Register (MSHR) and a write buffer. 32 Architectural Support • Assumes a processor system with a 32-byte (4 words) block. • In addition to the 32-byte block and the tag bits, each cache block has 8 extra bits: – 4 FE-bits and – 4 P-bits (that is about 3% of storage overhead per cache block) • The cache controller not only matches the tag but also checks the full/empty bits depending on the instruction, and makes a decision based on the state of the cache line as well as the associated FE-bits and P- bits. • The directory controller is also modified to account for the Synch Miss Buffer (SMB) implementing the SyC protocol in the directory. 33 Architectural Support Overview for SyC Cache 34 Programming Model – FE-M Instruct. • Requires compiler support, L/J structures or a suitable programming model similar to dataflow; KTH initiated the Extended Dataflow Arch model 35 Protocol Transactions Support Some extensions to MESI for non ordinary reads and writes. See 3.3 if interested 36 Example Results 37
© Copyright 2025 Paperzz