Department of Computer Sciences Revisiting the Complexity of Hardware Cache Coherence and Some Implications Rakesh Komuravelli Sarita Adve, Ching-Tsun Chou University of Illinois @ Urbana-Champaign, Intel [email protected] Department of Computer Sciences Motivation • Today’s shared memory systems are more complex than ever – Implementing cache coherence protocols is a major challenge – Tens of transient states, hard to test races and add optimizations • Have we tamed the protocol complexity yet? – Verified a state-of-the-art implementation of MESI from GEMS – Found six bugs even after 4+ years of usage worldwide • Current verification techniques are insufficient – Scalable but hard to use or error prone (e.g. parametric verification) – Protocols designed for verification impact perf. (e.g., fractals) Are there any alternatives? 2 Department of Computer Sciences An alternative approach • DeNovo: a h/w-s/w co-designed protocol [PACT 2011] – Assumes disciplined programming eliminating data races – Simple protocol, yet providing performance and power advantages • Model checked DeNovo – Found three bugs: easy to fix; implementation errors – 15X fewer reachable states compared to MESI • Focus of the talk: – Understand what makes hardware protocols complex – Experiences with verifying MESI and DeNovo 3 Department of Computer Sciences Outline • Motivation • Understanding hardware protocol complexity • Verification model and findings • Conclusions Department of Computer Sciences Why are hardware protocols complex? Invalid Shared Modified Exclusive • Text book protocol for MESI 5 Department of Computer Sciences Why are hardware protocols complex? Writek Invalid Shared Readi Readi [sharers exist] Writek Writei Writek Readk Readk Readi [no sharers] Writei Exclusive Modified Readi Writei Readi/ Writei • Text book protocol for MESI • In reality, the actual implementation is a lot more complex 6 Department of Computer Sciences Why are hardware protocols complex? Writek Invalid Shared Readi Readi [sharers exist] Writek Writei Writek Readk Readk Readi [no sharers] Writei Exclusive Modified Readi Writei Readi/ Writei • Text book protocol for MESI • In reality, the actual implementation is a lot more complex 7 Department of Computer Sciences Example transition for MESI Store L1P1 Initial state at L1 L1P2 Invalid (I) GETX Transient_1 Shared (S) L1Pn … Shared (S) Acks Invalid (I) Invalid (I) Transient_2 On last Ack Modified (M) Invalidations … Data Initial state at L2 Shared (SS) Transient_3 L1Modified(MT) LL2/Directory Exclusive_Unblock 8 Department of Computer Sciences Example transition for MESI Store L1P1 Initial state at L1 L1P2 Invalid (I) GETX Transient_1 Shared (S) L1Pn … Shared (S) Acks Invalid (I) Invalid (I) Transient_2 On last Ack Modified (M) Invalidations … Data Initial state at L2 Shared (SS) Transient_3 L1Modified(MT) LL2/Directory Exclusive_Unblock • One transition requires three transient states (total 21) • Transient states ← Hardware races ← Software data races 9 Department of Computer Sciences DeNovo with zero transient states • Assumes data-race-free software – Completely eliminates transient states from the protocol • Exploits s/w information for simple coherence enforcement Readi • Invalidate stale copies in private caches – Caches selectively self-invalidate entries not written by self – No sharers list • Track up-to-date copy Invalid Readi Writei Valid Writei Writek – Directory keeps track of one up-to-date copy Registered Readi, Writei Readk 10 Department of Computer Sciences Example transition for DeNovo Store L1P1 Initial state at L1 L1P2 Invalid (I) Valid (V) L1Pn … Valid (V) SelfInvalidations Registered (R) Initial state at L2 Invalid (I) Invalid (I) Registration request Valid (V) Registered (R) LL2/Directory Zero transient states => a simplified protocol 11 Department of Computer Sciences Outline • Motivation • Understanding hardware protocol complexity • Verification model and findings • Conclusions Department of Computer Sciences Verification model • Murφ model checking tool • Verified DeNovo and MESI protocols – State-of-the art GEMS implementation • Abstract model – Single address, two data values – Two cores with private L1 and unified L2, unordered n/w – Data-race-free assumption for DeNovo 13 Department of Computer Sciences • Correctness Results – Six bugs in MESI protocol • Two deadlock scenarios • Unhandled races due to L1 writebacks • Several days to fix 14 Department of Computer Sciences A MESI bug L1P2 L1P1 Replacement Store Modified (M) Invalid (I) Transient_1 Transient_3 L1P1 PUTX DATA GETX Dangling message!! Invalid (I) Modified (M) Fwd_GETX Exclusive_Unblock L1Modified (MT) Modified at:P1 Transient_2 L1Modified (MT) Modified at:P2 ERROR!! … LL2/Directory 2a • Complex to identify the cause and fix • Required adding multiple new state transitions 15 Department of Computer Sciences • Correctness Results – Six bugs in MESI protocol • Two deadlock scenarios • Unhandled races due to L1 writebacks • Several days to fix and needed more transient states – Three bugs in DeNovo protocol • Mistakes in translation from high level specification • Simple to fix • Complexity – 15x fewer reachable states for DeNovo – 20x faster to verify for DeNovo DeNovo is simpler and needs reduced verification effort 16 Department of Computer Sciences Scalability results • Extended the base model – Two addresses instead of one • DeNovo model finished without new bugs • MESI model ran out of system memory (32GB) Need more scalable tools for non-experts 17 Department of Computer Sciences Conclusions • Have we tamed the coherence protocol complexity yet? No! – 6 bugs in a state-of-the-art MESI protocol in use for 4+ years – Main source: transient states • DeNovo: an alternative h/w-s/w co-designed approach – 3 easy-to-fix bugs in an immature protocol – Zero transient states • MESI vs. DeNovo – DeNovo has 15X fewer reachable states, 20X faster to verify • Easy-to-use verification tools are not scalable – Need better tools for non-experts 18
© Copyright 2026 Paperzz