Revisiting the Complexity of Hardware Cache Coherence and Some

Department of Computer Sciences
Revisiting the Complexity of
Hardware Cache Coherence and Some Implications
Rakesh Komuravelli
Sarita Adve, Ching-Tsun Chou
University of Illinois @ Urbana-Champaign, Intel
[email protected]
Department of Computer Sciences
Motivation
• Today’s shared memory systems are more complex than ever
– Implementing cache coherence protocols is a major challenge
– Tens of transient states, hard to test races and add optimizations
• Have we tamed the protocol complexity yet?
– Verified a state-of-the-art implementation of MESI from GEMS
– Found six bugs even after 4+ years of usage worldwide
• Current verification techniques are insufficient
– Scalable but hard to use or error prone (e.g. parametric verification)
– Protocols designed for verification impact perf. (e.g., fractals)
Are there any alternatives?
2
Department of Computer Sciences
An alternative approach
• DeNovo: a h/w-s/w co-designed protocol [PACT 2011]
– Assumes disciplined programming eliminating data races
– Simple protocol, yet providing performance and power advantages
• Model checked DeNovo
– Found three bugs: easy to fix; implementation errors
– 15X fewer reachable states compared to MESI
• Focus of the talk:
– Understand what makes hardware protocols complex
– Experiences with verifying MESI and DeNovo
3
Department of Computer Sciences
Outline
• Motivation
• Understanding hardware protocol complexity
• Verification model and findings
• Conclusions
Department of Computer Sciences
Why are hardware protocols complex?
Invalid
Shared
Modified
Exclusive
• Text book protocol for MESI
5
Department of Computer Sciences
Why are hardware protocols complex?
Writek
Invalid
Shared
Readi
Readi
[sharers exist]
Writek
Writei
Writek
Readk
Readk
Readi
[no sharers]
Writei
Exclusive
Modified
Readi
Writei
Readi/
Writei
• Text book protocol for MESI
• In reality, the actual implementation is a lot more complex
6
Department of Computer Sciences
Why are hardware protocols complex?
Writek
Invalid
Shared
Readi
Readi
[sharers exist]
Writek
Writei
Writek
Readk
Readk
Readi
[no sharers]
Writei
Exclusive
Modified
Readi
Writei
Readi/
Writei
• Text book protocol for MESI
• In reality, the actual implementation is a lot more complex
7
Department of Computer Sciences
Example transition for MESI
Store
L1P1
Initial state
at L1
L1P2
Invalid (I)
GETX
Transient_1
Shared (S)
L1Pn
…
Shared (S)
Acks
Invalid (I)
Invalid (I)
Transient_2
On last Ack
Modified (M)
Invalidations
…
Data
Initial state
at L2
Shared (SS)
Transient_3
L1Modified(MT)
LL2/Directory
Exclusive_Unblock
8
Department of Computer Sciences
Example transition for MESI
Store
L1P1
Initial state
at L1
L1P2
Invalid (I)
GETX
Transient_1
Shared (S)
L1Pn
…
Shared (S)
Acks
Invalid (I)
Invalid (I)
Transient_2
On last Ack
Modified (M)
Invalidations
…
Data
Initial state
at L2
Shared (SS)
Transient_3
L1Modified(MT)
LL2/Directory
Exclusive_Unblock
• One transition requires three transient states (total 21)
• Transient states ← Hardware races ← Software data races
9
Department of Computer Sciences
DeNovo with zero transient states
• Assumes data-race-free software
– Completely eliminates transient states from the protocol
• Exploits s/w information for simple coherence enforcement
Readi
• Invalidate stale copies in private caches
– Caches selectively self-invalidate
entries not written by self
– No sharers list
• Track up-to-date copy
Invalid
Readi
Writei
Valid
Writei
Writek
– Directory keeps track of one up-to-date copy
Registered
Readi, Writei
Readk
10
Department of Computer Sciences
Example transition for DeNovo
Store
L1P1
Initial state
at L1
L1P2
Invalid (I)
Valid (V)
L1Pn
…
Valid (V)
SelfInvalidations
Registered (R)
Initial state
at L2
Invalid (I)
Invalid (I)
Registration
request
Valid (V)
Registered (R)
LL2/Directory
Zero transient states => a simplified protocol
11
Department of Computer Sciences
Outline
• Motivation
• Understanding hardware protocol complexity
• Verification model and findings
• Conclusions
Department of Computer Sciences
Verification model
• Murφ model checking tool
• Verified DeNovo and MESI protocols
– State-of-the art GEMS implementation
• Abstract model
– Single address, two data values
– Two cores with private L1 and unified L2, unordered n/w
– Data-race-free assumption for DeNovo
13
Department of Computer Sciences
• Correctness
Results
– Six bugs in MESI protocol
• Two deadlock scenarios
• Unhandled races due to L1 writebacks
• Several days to fix
14
Department of Computer Sciences
A MESI bug
L1P2
L1P1
Replacement
Store
Modified (M)
Invalid (I)
Transient_1
Transient_3
L1P1
PUTX
DATA
GETX
Dangling
message!!
Invalid (I)
Modified (M)
Fwd_GETX
Exclusive_Unblock
L1Modified (MT)
Modified at:P1
Transient_2
L1Modified (MT)
Modified at:P2
ERROR!!
…
LL2/Directory
2a
• Complex to identify the cause and fix
• Required adding multiple new state transitions
15
Department of Computer Sciences
• Correctness
Results
– Six bugs in MESI protocol
• Two deadlock scenarios
• Unhandled races due to L1 writebacks
• Several days to fix and needed more transient states
– Three bugs in DeNovo protocol
• Mistakes in translation from high level specification
• Simple to fix
• Complexity
– 15x fewer reachable states for DeNovo
– 20x faster to verify for DeNovo
DeNovo is simpler and needs reduced verification effort
16
Department of Computer Sciences
Scalability results
• Extended the base model
– Two addresses instead of one
• DeNovo model finished without new bugs
• MESI model ran out of system memory (32GB)
Need more scalable tools for non-experts
17
Department of Computer Sciences
Conclusions
• Have we tamed the coherence protocol complexity yet? No!
– 6 bugs in a state-of-the-art MESI protocol in use for 4+ years
– Main source: transient states
• DeNovo: an alternative h/w-s/w co-designed approach
– 3 easy-to-fix bugs in an immature protocol
– Zero transient states
• MESI vs. DeNovo
– DeNovo has 15X fewer reachable states, 20X faster to verify
• Easy-to-use verification tools are not scalable
– Need better tools for non-experts
18