Foundations What is the meaning of shared memory when you have multiple access ports into global memory? Shared-Memory Systems and Cache Coherence What if you have caches? Memory Pa wa3 ra2 ra1 wb3 Pc Pb wb2 rb1 rc4 rc3 wc2 wc1 Sequential consistency: Final state (of memory) is as if all RDs and WRTs were executed in some fixed serial order (per processor order also maintained) Æ Lamport [This notion borrows from similar notions of sequential consistency in transaction processing systems.] 6.173 Fall 2010 Agarwal - 1 - - Page 1 2 - One other cache nasty to watch out for Foundations cache A hardware designers physical perspective of sequential consistency MEM foo1 foo2 foo3 foo4 cache foo1 foo2 foo3 foo4 foo home P Memory P Flush foo* from cache, wait till done w1c Does it always work? w ra ra a3 2 1 Pa r c4 r c3 c2 Pc w c1 w w b w 3 b2 rb1 Pb Key: Using fence to wait until flush is done is the key mechanism that guarantees sequential consistency We will revisit this in more detail shortly - 3 - - Page 2 4 - One other cache nasty to watch out for xxx cache cache yyy foo1 xxx foo2 Cache foo1 xxx foo3 foo4 line Summary of New Multicore Instructions MEM yyy foo1 xxx foo2 foo3 foo4 • Send message • Receive message foo home P • Synchronization P – Barrier Flush yyy from cache, wait till done Flush foo* from cache, wait till done – Test and set – F&A and relatives (e.g., F&Op, CmpXch) • Flush cache line Correct final value: foo1 yyy Wrong final value: foo1 xxx • Memory fence Problem called “False Sharing” Leads to bugs with sw coherence Leads to poor perf. with hw coherence Solutions? Pad shared data structures so multiple shared items do not fall into same cache line - 5 - - Page 3 6 - Recall, Shared Memory Algorithmic Model Outlline Memory architecture Cache coherence in small multicores Shared Memory Cache coherence in manycores wrt read P P - 7 - P - Page 4 . . . 8 - P Shared-Memory Structure in Cutting Edge Multicores Shared Memory Structures in Parallel Computers Monolithic Memory Distributed M Network C C C P P P M M M Network C C . . .P P C C P P C . . .P Distributed - local Multicore Chip C Network Like legos, can move Ps, Cs and Ms around P But, what about multicores chips? P C M P M C . . . C P - 9 M P P P - Distributed M M C P Ring C C P P Chip 10 M M Network Memory C - Page 5 r le Memory C C y or em M l ro nt co - C . . .P Shared-Memory Structure in Cutting Edge Multicores Caches and Cache Coherence Tile processor 64 cores Network M C M M P l ro nt o c Multicore Chip Memory Memory y or em M Memory P C P C P C P C P C P C P C P C P C P C P C P C Memory P C P C P C P C r le C C P P Distributed M M M M Network C P C C P P C . . .P Chip Mesh - 11 - - Page 6 12 - A World Without Caches With Caches Network M C Network M M M C P P - 13 C C rd P M M C C rd P P P - - Page 7 14 - How are Caches Different from Fast Local Memory (SRAM)? Key insight why use a cache when local mem exists Anatomy of a common case LD operation Network M HW: 1 cycle SW: 10 cycles M M C LD A C C P P P If A replicated in local store then fetch from local store versus Else send message to get A from DRAM Network M m HW: 100 cycles SW: 110 cycles M M m m P When done in HW, we call the store a cache! P P Can do all of this in hardware too. This is what typical caches do Discuss - 15 - - Page 8 16 - Solving the Coherence Problem Cache Coherence Problem – Small multicores Network > Software coherence > Snooping caches M C M M C C P P wrt – Manycores ? > Software coherence > full map directories P > limited pointers > chained pointers · singly linked · doubly linked > limitless schemes Coherence problem > Hierarchical methods We will study Coherence structures Coherence protocols Cache side state diagrams Directory side state diagrams - 17 - - Page 9 18 - Hardware Cache Coherence Software Coherence Saw this before Snooping Caches Shared Memory MEM cache foo1 foo2 foo3 foo4 cache foo1 foo2 foo3 foo4 Bus or Ring foo home flush fence P a cache P a GET_foo_LOCK . . . . MUNGE a 3 a snoop 2 cache a Match 4 cache x 5 tags InvalidateDual ported Processor y x Broadcast 1 write cache tags Processor z Flush foo* from cache Fence: wait till changes that result from flush are visible to everyone • Works for small multicores (mem off chip) • Broadcast address on shared write RELEASE_foo_LOCK • Everyone listens (snoops) on bus/ring to see if any of their own addresses match • Invalidate copy on match Can stick the locking, flushes and fences in library code to provide clean abstractions • How do you know when to broadcast, invalidate – State associated with each cache line – Key benefit: no global state in main mem Let’s look at this in more detail next… - 19 - - Page 10 20 - Hardware Cache Coherence Update versus Invalidate Protocols Invalidate versus Update Snooping Caches Tradeoffs between Shared Memory a • Update protocols Bus or Ring cache a • Ownership protocols a 3 a snoop 2 cache a Match 4 cache 5 tags Update Dual ported 1 write Processor Update better when poor write locality Broadcast cache tags Invalidate better otherwise Competitive snooping idea -- Processor –Do write updates • Broadcast address on shared write –If more than a “few” updates, then use ownership • Everyone listens (snoops) on bus/ring to see if any of their own addresses match “Few” Æ Switch mode when cost of all updates so far = cost of invalidation • If address matches – Invalidate local copy (called invalidate or ownership protocol) OR The cost of this approach is no worse than twice the optimal (try to prove this) – Update local copy with new data from bus (writer must broadcast value along with address) “Competitive algorithms are cool” Only a cache side state machine needed Discuss paper - 21 - - Page 11 22 - Snooping Caches Definitions State diagram for ownership protocols Shared Memory For each address a Cache side state machine Store state with cache tags “Invalid” a Bus or Ring invalid Ext. bus request cache 3 a snoop 2 cache a Match cache 4 5 tags Dual My local responseUpdate ported a “Modified” “Shared” a 1 write Processor read-clean shared-data • For each address ^ • Assume cache blocksize is one word for now; Let’s deal with the cache block complexity later “MSI” Variants such as MESI, MOESI 23 - - Page 12 cache tags My local request Processor write-dirty - My bus response Broadcast 24 - State diagram for cache block in ownership protocols a: address invalid Local Read Fetch block State diagram for update protocols My local request Ext. bus request My bus response invalid Local Write Broadcast a; Fetch block Remote Write Remote Write/local replace Update memory read-clean My local request Ext. bus request My bus response My local response a: address <a>: value Local Read Fetch block Local replace Update memory read-clean write-dirty Local Write Broadcast a, <a>; Fetch block Remote Read Update memory Remote Write write-dirty Update local copy Local Write Broadcast a, <a> Local Write Broadcast a Local Write Broadcast a,<a> Remote Write Update local copy In ownership protocol: writer owns exclusive copy - 25 - - Page 13 26 - Maintaining coherence in manycores • Software coherence – saw this before • Hardware coherence > full map directories > limited pointers > chained pointers · singly linked · doubly linked > limitless schemes > Hierarchical methods - 27 - Page 14
© Copyright 2026 Paperzz