QB or not QB: An Efficient Execution Verification tool for Memory Orderings Ganesh Gopalakrishnan* School of Computing, University of Utah, Salt Lake City, UT Yue Yang* Microsoft Research, Redmond, WA Hemanthkumar Sivaraj* Intel Corporation, Bangalore, India * Work supported in part by SRC Contract 1031.001 and NSF Award 0219805 Efficient Multiprocessors must have Efficient Shared Memory Systems CPU performance Memory performance 2 Building Efficient Memory Allow reorderings between load / stores that fall on DIFFERENT addresses Example : Program st c,1 ; st d,2 ld d; ld c CPU CPU Memory Execution st c,1 ; ld d, 2; st d,2 ld c, 0 • Helps hide latencies • Simplifies design of directory protocols • System programmers will bite the bullet ;-) 3 Permitted reorderings are specified by the shared memory consistency model A VERY complex specification for a real architecture (e.g. Itanium, PowerPC, …) Also of growing concern in Software (e.g. Java Memory Model, Unified Parallel C model, …) 4 MODULAR SPECIFICATION OF MEMORY MODELS legal_itanium exec = (* a given execution *) ?order. requireLinearOrder exec order /\ requireWriteOperationOrder exec order /\ requireProgramOrder exec order /\ requireMemoryDataDependence exec order /\ requireDataFlowDependence exec order /\ requireCoherence exec order /\ requireReadValue exec order /\ requireAtomicWBRelease exec order /\ requireSequentialUC exec order /\ requireNoUCBypass exec order See IPDPS 2004 5 A MEMORY MODEL RULE IN HOL requireCoherence exec order = !i j. i IN exec /\ j IN exec ==> isWr i /\ isWr j /\ (i.var = j.var) /\ order i j /\ ((attr_of i.var = WB) \/ (attr_of i.var = UC)) /\ ((i.wrType=Local) /\ (j.wrType=Local) /\ (i.proc=j.proc) \/ (i.wrType=Remote) /\ (j.wrType=Remote) /\ (i.wrProc=j.wrProc)) ==> !p q. p IN exec /\ q IN exec ==> isWr p /\ isWr q /\ (p.wrID = i.wrID) /\ (q.wrID = j.wrID) /\ (p.wrType = Remote) /\ (q.wrType = Remote) /\(p.wrProc = q.wrProc) ==> order p q 6 How do we know that the actual silicon matches the shared memory model ? ? ! X . X in exec ? Y . Y in exec …. ? ! /\ … \/ …. • Pray • Run tests and manually check results • ? What else ? 7 FORMALLY VERIFY “interesting” EXECUTIONS P1’s exec P2’s exec st8 ld8 ld2 ld2 … [12ca20] = 7f869af546f2f14c r25 = [45180] <87b5e547172644a8> r26 = [2c2a2c] <44a8> r27 = [45aa2a] <c58e> st8 ld8 st2 st2 … [45180] = 87b5e547172644a8 r25 = [45180] <87b5e547172644a8> [2c2a2c] = 44a8 [45aa2a] = c58e … 8 TWO APPROACHES: - explicitly QB - implicitly QB Given Execution SPEC OF MEMORY MODEL IN hol “BOOLIFY” CONVERT TO EXECUTION CHECKER PROGRAM QBF PROGRAM SAT PROBLEM Given Execution 9 AN EXAMPLE requireMickeyMouse exec order = !i j. i IN exec /\ j IN exec ==>( i.op = read /\ i.data = 35 /\ j.op = write /\ j.data = 46 ==> order j i) GIVEN MP EXECUTION… PROCESSOR 1 ----------- PROCESSOR 2 ----------- read(ADDR, 35) write(ADDR, 46) 10 requireMickeyMouse exec order = !i j. i IN exec /\ j IN exec ==>( i.op = read /\ i.data = 35 /\ j.op = write /\ j.data = 46 ==> order j i) Explicitly QB ! i j : Bool . BOOLIFIED MATRIX Implicitly QB FOR I = 1 to 2 DO /\ (FOR j = 1 to 2 DO /\ ( BOOLIFIED MATRIX ) 11 The Intel Itanium® Processor memory model • Has these kinds of instructions : “weak load” or “ordinary load” -- ld “strong load” or “acquire-load” -- ld.acq “weak store” or “ordinary store” -- st “strong store” or “release store” -- st.rel “memory fence” (NOT barrier!) -- mf A few semaphore-types Allows sub-word writes, I/O spaces… We don’t model these details momentarily … 12 EVEN THIS EXAMPLE HAS A 1-page “proof” A manual proof… P st [x] = 1 mf ld r1 = [y] <0> Q st . rel [y] = 1 R ld . acq r2 = [y] <1> Atomicity of st.rel ld r3 = [x] <0> Load of initial value is before store of every other value 13 CONTRIBUTIONS: Wrote a formal description of Itanium® In Higher Order Logic - modular - extensible - works for many architectures As opposed to relying on concurrent data structures that “pretend to be Itanium®” (the “operational style” ) Showed how, using SAT, executions can be formally verified against the spec 14 Our Approach Itanium Ordering rules in HOL P st [x] = 1 MP execution to be verified Mechanical Program Derivation (to be automated) Checker Program R Q st.rel [y] = 1 ld.acq r2 = [y] <1> ld mf r3 = [x] <0> ld r1 = [y] <0> RECENT WORK • • • Find Offending Clauses Trace their annotations Determine “ordering cycle” Unsat Core Extraction using Zcore Satisfiability Problem with Clauses carrying annotations Sat Solver Unsat Sat Explanation in the form of one possible interleaving 15 Largest example tried to date (courtesy S. Zeisset, Intel) Proc 1 Proc 2 st8 [12ca20] = 7f869af546f2f14c ld r25 = [45180] <87b5e547172644a8> ld4 r24 = [733a74] <415e304> st4.rel [175984] = 96ab4e1f … 58 more instructions… … 67 more instructions… st2 [7c2a00] = 4bca ld8 r87 = [56460] <b5c113d7ce4783b1> • Initially the tool gave a trivial violation • Diagnosed to be forgotten memory initialization • Added method to incorporate memory initialization in our tool • Our tool found the exact same cycle as pointed out by author of test Cycle found thru our tool: st.rel (line 18, P1) ld (line 22, P2) mf ld (line 30, P2) st (line 11, P1) 16 Statistics Pertaining to Case Study • 140 total instructions • All runs were on a 1.733 GHz 1GB Redhat Linux V9 Athlon • 1 minutes to generate Sat instance • 9M clauses ( O(n^3) in terms of instructions ) • 117,823 variables ( not a problem ) • ~1 minute to run Sat (unsat here) – 0.2 sec to do “real work” • Zcore runs fast – gave 23 clauses in one iteration 17 The rest of the talk • An Intuitive presentation of the Itanium® memory model • Example of how a HOL rule was turned into a SAT generator • How the SAT part was done Throwing an efficient “transitivity blanket” over a problem to cover it with whatever transitivity it begs for !! • What more to expect • Related work 18 Itanium® memory model thru examples “Ordinary store” … st [x] = 2 … Can freely slide in a sequential program… Only rule is coherence The same applies to an “ordinary load” … ld reg1 = [x] … 19 Itanium® memory model thru examples “Release store” … st.rel [x] = 2 Things before it in sequential program order can’t happen after it Things after it in sequential program Order may happen before it !! 20 Itanium® memory model thru examples “Acquire load” … ld.acq r3 = [y] Things before it in sequential program order may happen after it Things after it in sequential program Order can’t happen before it !! 21 But with these rules alone, we can’t explain the following legal outcome in Itanium® st.rel [y] = 1 Data dep. ld.acq r3 = [y] <1> ld reg1 = [x] <0> ld.acq rule st.rel [x] = 2 ld.acq r4 = [x] <2> ld reg2 = [y] <0> Itanium specification DOES NOT try to explain outcomes in terms of “shuffles” of the original instructions! 22 Itanium® rules explain execution outcomes in terms of “progenies” of stores and loads This has turned out to be an unspoken convention in this area for other memory models also… A store generates (n+1) progenies st [y] = 1 Other instructions generate only one ld.acq r3 = [y] Local copy for P0 “remote” copy for P0 “remote” copy for P1 23 We wrote such a “breeding assembler” P1: St a,1; Ld r1,a <1>; St b,r1 <1>; P2: Ld.acq r2,b <1>; Ld r3,a <0>; {id=0; proc=0; pc=0; op= St; var=0; data=1; wrID=0; wrType=Local; wrProc=0; reg=-1; useReg=false}; Tuple 1 {id=1; proc=0; pc=0; op= St; var=0; data=1; wrID=0; wrType=Remote; wrProc=0; reg=-1; useReg=false}; {id=2; proc=0; pc=0; op= St; var=0; data=1; wrID=0; wrType=Remote; wrProc=1; reg=-1; useReg=false}; {id=3; proc=0; pc=1; op= Ld; var=0; data=1; wrID=-1; wrType=DontCare; wrProc=-1; reg=0; useReg=true}; {id=4; proc=0; pc=2; op= St; var=1; data=1; wrID=4; wrType=Local; wrProc=0; reg=0; useReg=true}; ... {id=5; proc=0; pc=2; op= St; var=1; data=1; wrID=4; wrType=Remote; wrProc=0; reg=0; useReg=true}; {id=6; proc=0; pc=2; op= St; var=1; data=1; wrID=4; wrType=Remote; wrProc=1; reg=0; useReg=true}; {id=7; proc=1; pc=0; op= LdAcq; var=1; data=1; wrID=-1; wrType=DontCare; wrProc=-1; reg=1; useReg=true}; {id=8; proc=1; pc=1; op= Ld; var=0; data=0; wrID=-1; wrType=DontCare; wrProc=-1; reg=2; useReg=true} Tuple 9 24 Itanium® rules specify how to line-up the tuples to explain the load-outcomes !! P0 P1 st [y] = 1 ld.acq r3 = [y] <1> ld reg1 = [x] <0> st [y] = 1 “l” st [x] = 2 “l” st [x] = 2 “rp0” st [x] = 2 “rp1” st [y] = 1 “rp0” st [y] = 1 “rp1” Now, arrange the split copies… Dependencies st [x] = 2 ld.acq r4 = [x] <2> ld reg2 = [y] <0> st [y] = 1 “l” ld.acq r3 = [y] <1> Explanation… st [x] = 2 “l” ld.acq r4 = [x] <2> st [y] = 1 “rp0” st [x] = 2 “rp1” ld reg1 = [x] <0> Antidependencies st [x] = 2 “rp0” ld reg2 = [y] <0> st [y] = 1 “rp1” 25 Gist of our method: Illustration on SC and of Itanium The tuples to be ordered SC(exec) = Exists order. ( requireStrictTotalOrder exec order /\ requireProgramOrder exec order /\ requireReadValue exec order Find an arrangement under SC constraints The tuples to be ordered legalItanium(exec) = Exists order. ( requireStrictTotalOrder exec order /\ requireWriteOperationOrder /\ requireItProgramOrder /\ requireMemoryDataDependence /\ requireDataFlowDependence /\ requireCoherence /\ requireAtomicWBRelease /\ requireSequentialUC /\ requireNoUCBypass exec order exec order exec order exec order exec order exec order exec order exec order /\ requireReadValue exec order Find arrangement as per above constraints 26 Gist of constraints : • Some arrangements are statically known : • Others are conditional : • Some must form an atomic set : Implies and Everybody else Strictly before or Strictly after. • Many are unordered : • Find a strict total order satisfying all the above ! 27 Gist of constraint ENCODING : • Use Boolean precedence matrix • Capture “i before j” by m_ij 1 Implies Unit clauses and N 1 i Statically known : 1 1j 1 N Boolean formula Atomic set : See how SAT-generator is derived Strict total order : Spew out irreflexivity and totality axioms Then throw a “transitivity blanket” on top of all tuples 28 -Also tried E_ij method - and some incremental SAT (see paper) 29 Approaches to “transitivity blanket” Naïve : For all tuples i, j, and k, generate m_ij /\ m_jk m_jk Too many clauses (1B for a 1000-tuple program) Better: Obtain transitive-closure of known orderings and then prune irrelevant parts of the blanket E.g., if ~m_ij is known, don’t generate m_ij /\ … … … as well as /\ m_ij … 30 Obtaining SAT-generator from HOL Initial Spec atomicWBRelease(exec,order) = forall (i in exec).(j in exec).(k in exec). (i.op = StRel) /\ (i.wrType = Remote) /\ (attr_of i.var = WB) /\ (i.wrID = k.wrID) /\ order(i,j) /\ order(j,k) ==> (j.wrID = i.wrID) Applying Contrapositive atomicWBRelease(exec,order) = forall (i in exec).(j in exec).(k in exec). (i.op = StRel) /\ (i.wrType = Remote) /\ (attr_of i.var = WB) /\ (i.wrID = k.wrID) /\ ~(j.wrID = i.wrID) ==> ~(order(i,j) /\ order(j,k)) After Reducing quantifier Scopes atomicWBRelease(exec,order) = forall (i in exec). (i.op = StRel) /\ (i.wrType = Remote) /\ (attr_of i.var = WB) ==> forall (k in exec). (i.wrID = k.wrID) ==> forall (j in exec). ~(j.wrID = i.wrID) ==> ~(order(i,j) /\ order(j,k)) 31 …Obtaining SAT-generator from HOL Transformed Spec atomicWBRelease(exec,order) = forall (i in exec). (i.op = StRel) /\ (i.wrType = Remote) /\ (attr_of i.var = WB) ==> forall (k in exec). (i.wrID = k.wrID) ==> forall (j in exec). ~(j.wrID = i.wrID) ==> ~(order(i,j) /\ order(j,k)) Functional Program that generates the constraints (will be automated) atomicWBRelease(exec) = forall(i,exec,wb(i)) wb(i) = if ~((attr_of i.var=WB) & (i.op=StRel) & (i.wrType=Remote) then true else forall(k,exec,wb1(i,k)) wb1(i,k) = if ~(i.wrID=k.wrID) else forall(j,exec,wb2(i,k,j)) then true wb2(i,k,j) = if (j.wrID=i.wrID) else ~(order(i,j) & order(j,k)) then true forall(i,S, e(i)) = for all i in S : e(i) (* foldr( map (fn i -> e(i)) (S) (&), true) *) 32 Clause annotations for the unsat core for example op1 = 1; op2 = -1; op3 = -1; op4 = -1; rule = Reflexive op1 = 4; op2 = 5; op3 = 6; op4 = -1; rule = TransitiveOrder op1 = 4; op2 = 5; op3 = -1; op4 = -1; rule = ProgramOrder op1 = 4; op2 = 6; op3 = 8; op4 = -1; rule = TransitiveOrder op1 = 4; op2 = 11; op3 = 12; op4 = -1; rule = TransitiveOrder op1 = 5; op2 = 6; op3 = -1; op4 = -1; rule = ProgramOrder op1 = 6; op2 = 8; op3 = -1; op4 = -1; rule = TotalOrder op1 = 10; op2 = 11; op3 = -1; op4 = -1; rule = TotalOrder op1 = 11; op2 = 4; op3 = 8; op4 = -1; rule = TransitiveOrder op1 = 11; op2 = 4; op3 = -1; op4 = -1; rule = TotalOrder op1 = 11; op2 = 12; op3 = -1; op4 = -1; rule = ProgramOrder op1 = -1; op2 = -1; op3 = -1; op4 = -1; rule = NoRule op1 = 6; op2 = -1; op3 = -1; op4 = -1; rule = ReadValue op1 = 6; op2 = -1; op3 = -1; op4 = -1; rule = ReadValue op1 = 6; op2 = -1; op3 = -1; op4 = -1; rule = ReadValue op1 = 6; op2 = -1; op3 = -1; op4 = -1; rule = ReadValue op1 = 6; op2 = 8; op3 = -1; op4 = -1; rule = ReadValue op1 = 6; op2 = -1; op3 = -1; op4 = -1; rule = ReadValue op1 = -1; op2 = -1; op3 = -1; op4 = -1; rule = NoRule op1 = 11; op2 = -1; op3 = -1; op4 = -1; rule = ReadValue op1 = 11; op2 = 10; op3 = -1; op4 = -1; rule = ReadValue op1 = 11; op2 = -1; op3 = -1; op4 = -1; rule = ReadValue op1 = 11; op2 = -1; op3 = -1; op4 = -1; rule = ReadValue op1 = 11; op2 = -1; op3 = -1; op4 = -1; rule = ReadValue op1 = 11; op2 = 10; op3 = -1; op4 = -1; rule = ReadValue op1 = -1; op2 = -1; op3 = -1; op4 = -1; rule = NoRule op1 = 12; op2 = -1; op3 = -1; op4 = -1; rule = ReadValue op1 = 12; op2 = -1; op3 = -1; op4 = -1; rule = ReadValue op1 = 12; op2 = -1; op3 = -1; op4 = -1; rule = ReadValue op1 = 12; op2 = -1; op3 = -1; op4 = -1; rule = ReadValue op1 = 12; op2 = 4; op3 = -1; op4 = -1; rule = ReadValue op1 = 12; op2 = -1; op3 = -1; op4 = -1; rule = ReadValue op1 = -1; op2 = -1; op3 = -1; op4 = -1; rule = NoRule op1 = 10; op2 = 12; op3 = -1; op4 = -1; rule = AtomicWBRelease op1 = 10; op2 = 11; op3 = -1; op4 = -1; rule = AtomicWBRelease op1 = 10; op2 = 11; op3 = 10; op4 = -1; rule = AtomicWBRelease op1 = 10; op2 = 11; op3 = 9; op4 = -1; rule = AtomicWBRelease op1 = 10; op2 = 11; op3 = 8; op4 = -1; rule = AtomicWBRelease op1 = 10; op2 = 11; op3 = 8; op4 = -1; rule = AtomicWBRelease op1 = 10; op2 = 11; op3 = 8; op4 = -1; rule = AtomicWBRelease op1 = 10; op2 = 11; op3 = 8; op4 = -1; rule = AtomicWBRelease 33 1 2 3 4 denotes an op st [x] = 1 mf 5 ld r1 = [y] <0> 6 7 8 Denotes op numbers. Store has both local and remote exec 9 10 st.rel [y] = 1 ld.acq r2 = [y] <1> 11 ld 12 r3 = [x] <0> 34 1 2 3 4 st [x] = 1 mf 5 ld r1 = [y] <0> 6 7 8 op1 = 4; op2 = 5; op3 = -1; op4 = -1; rule = ProgramOrder 9 10 st.rel [y] = 1 ld.acq r2 = [y] <1> 11 ld 12 r3 = [x] <0> 35 1 2 3 4 st [x] = 1 mf 5 ld r1 = [y] <0> 6 7 8 op1 = 5; op2 = 6; op3 = -1; op4 = -1; rule = ProgramOrder 9 10 st.rel [y] = 1 ld.acq r2 = [y] <1> 11 ld 12 r3 = [x] <0> 36 1 2 3 4 st [x] = 1 op1 = 6; op2 = -1; op3 = -1; op4 = -1; rule = ReadValue mf 5 ld r1 = [y] <0> 6 7 8 9 10 st.rel [y] = 1 ld.acq r2 = [y] <1> 11 ld 12 r3 = [x] <0> op1 = 6; op2 = -1; op3 = -1; op4 = -1; rule = ReadValue op1 = 6; op2 = -1; op3 = -1; op4 = -1; rule = ReadValue op1 = 6; op2 = -1; op3 = -1; op4 = -1; rule = ReadValue op1 = 6; op2 = 8; op3 = -1; op4 = -1; rule = ReadValue op1 = 6; op2 = -1; op3 = -1; op4 = -1; rule = ReadValue 37 1 2 3 4 st [x] = 1 mf 5 6 ld r1 = [y] <0> 7 8 9 10 st.rel [y] = 1 ld.acq r2 = [y] <1> ld r3 = [x] <0> 11 12 op1 = 10; op2 = 12; op3 = -1; op4 = -1; rule = AtomicWBRelease op1 = 10; op2 = 11; op3 = -1; op4 = -1; rule = AtomicWBRelease op1 = 10; op2 = 11; op3 = 10; op4 = -1; rule = AtomicWBRelease op1 = 10; op2 = 11; op3 = 9; op4 = -1; rule = AtomicWBRelease op1 = 10; op2 = 11; op3 = 8; op4 = -1; rule = AtomicWBRelease op1 = 10; op2 = 11; op3 = 8; op4 = -1; rule = AtomicWBRelease op1 = 10; op2 = 11; op3 = 8; op4 = -1; rule = AtomicWBRelease op1 = 10; op2 = 11; op3 = 8; op4 = -1; rule = AtomicWBRelease 38 1 2 3 4 st [x] = 1 mf 5 ld r1 = [y] <0> 6 7 8 9 10 st.rel [y] = 1 ld.acq r2 = [y] <1> 11 ld 12 r3 = [x] <0> op1 = 11; op2 = -1; op3 = -1; op4 = -1; rule = ReadValue op1 = 11; op2 = 10; op3 = -1; op4 = -1; rule = ReadValue op1 = 11; op2 = -1; op3 = -1; op4 = -1; rule = ReadValue op1 = 11; op2 = -1; op3 = -1; op4 = -1; rule = ReadValue op1 = 11; op2 = -1; op3 = -1; op4 = -1; rule = ReadValue op1 = 11; op2 = 10; op3 = -1; op4 = -1; rule = ReadValue 39 1 2 3 4 st [x] = 1 mf 5 6 ld r1 = [y] <0> 7 8 op1 = 11; op2 = 12; op3 = -1; op4 = -1; rule = ProgramOrder 9 10 st.rel [y] = 1 ld.acq r2 = [y] <1> 11 ld 12 r3 = [x] <0> 40 1 2 3 4 st [x] = 1 mf 5 ld r1 = [y] <0> 6 7 8 9 10 st.rel [y] = 1 ld.acq r2 = [y] <1> 11 ld 12 r3 = [x] <0> op1 = 12; op2 = -1; op3 = -1; op4 = -1; rule = ReadValue op1 = 12; op2 = -1; op3 = -1; op4 = -1; rule = ReadValue op1 = 12; op2 = -1; op3 = -1; op4 = -1; rule = ReadValue op1 = 12; op2 = -1; op3 = -1; op4 = -1; rule = ReadValue op1 = 12; op2 = 4; op3 = -1; op4 = -1; rule = ReadValue op1 = 12; op2 = -1; op3 = -1; op4 = -1; rule = ReadValue 41 CONCLUSIONS • An execution verification method for real memory models • Convert HOL spec of memory model to SAT-generator • Given an execution, run SAT-generator, and generate a SAT-instance • Unsat core gives violating cycle • Works for a few hundred total assembly language instructions 42 What to expect • There is only so much engineering one can put-in before making the checker code suspect • About 500 total instructions may be checkable • To scale beyond this size, we may need to sacrifice completeness (e.g. limited transitivity instantiation good for bug-hunting) • Incremental SAT methods can definitely pay-off • Worst-case (for exhaustive checking) is still bad 43 Related Work • Yuan Yu encoded Alpha axioms in FOL and solved using Simplify • TSOtool (ISCA’04, Hangal et.al.) - TSO much simpler than Itanium - They deliberately omit ordering rules to keep their checker polynomial (e.g. “ordering unrelated stores”) - Hence incomplete - Very long executions checked - Most industrial in-house checkers are similar 44 Extra Slides 45 A real example: Atomic WB Release Informal statement: Store-Releases to write-back memory become visible to all processors in the same order Implementation: All copies of a “split st.rel” are visible atomically st.rel [x] = 1 Atomic set 46 One standard way of specifying atomicity: All other events “e” are strictly before or strictly after the atomic set e e Another standard way of specifying atomicity: If some event “e” is between two events in the atomic set, then “e” also belongs to the atomic set e e 47 Constraint (Sat) Encoding Approach #1 n logn approach (“small domain” encoding) • Attach a word w_t of 2 bits to each tuple t • Tuple i before Tuple j --> Assert wi < wj • StrictTotalOrder --> Assert that the wt words are distinct • Smaller # of Boolean Vars • Much Harder SAT instances (abandoned for now) Illustration on 4 tuples x00 x01 x10 x11 x20 x21 x30 x31 requireStrictTotalOrder order exec requireOtherOrder order requireReadValue order exec exec For all i, j: xi1,xi0 != xj1, xj0 A system of constraints with primitive constraint xi1, xi0 < xj1, xj0 48 Constraint Encoding Approach #2 n n approach (“e_ij” encoding) • Assign a matrix position mij for each pair of tuples ti and tj • Tuple i before Tuple j --> Assert mij true • StrictTotalOrder --> Assert Irreflexitivity, Transitivity, Totality • Larger # of Boolean Vars • Easier SAT instances (being pursued now) Illustration on 4 tuples . j . . . i . mij . . . . . . Forall i : ~mii . . . . requireStrictTotalOrder order exec requireOtherOrder order requireReadValue order exec exec Forall i,j : mij \/ mji Forall i,j,k : mij /\ mjk => mik A system of constraints with primitive constraint mij 49 Table of Results (somewhat dated…) SAT-instance generation time for n logn method Tuples Total Order Other Order 32 0.2 1.6 64 1.2 17.1 128 5.7 179.0 SAT-instance generation time for n n method Tuples Total Order Other Order 32 0.5 0.1 64 4.3 0.9 128 34.2 9.0 SAT-checking times Tuples Monolith n logn nn TotalOrd OtherOrd Monolith TotalOrd OtherOrd 32 9.6 0.6 4.3 0.33 0.69 0.05 64 247.17 29.53 37.6 2.73 6.17 0.5 128 abort abort 164.8 145.6 351.1 1341 50 Example execution (Table 18, pg. 31 of App note) P st [x] = 1 mf Q st.rel [y] = 1 R ld.acq r2 = [y] <1> ld r3 = [x] <0> ld r1 = [y] <0> • The Sat instance generated for the above example is UNSAT. • Next few slides show automated approach to detect the root cause cycle. • We will ignore the reflexive and transitive rules in these slides (they are necessary to force unsat, but useless in building a cycle!!) 51 Good Case-study Illustrating Program Derivation from Formal Specs • Initial specs: HOL • Formal derivation of tail-recursive functional programs • “Code generation” consists of generating Boolean clauses • Source-level optimizations • The use of incremental SAT can perhaps be directed by “functional scripts” that are automatically generated • Use of Unsat cores to pinpoint errors – Choose Boolean encoding method – Re-target code generation correspondingly – Record known orderings (e.g., “i before j”) – these manifest as unit clauses – Infer others (e.g., “not j before i”) - generate unit-clauses for these too – Prevent generating transitivity axioms that depend on “j before i” 52 Concluding Remarks • Main source of complexity: the transitivity axiom • “Lazy” methods for handling transitivity must be investigated • Hybrid Sat encoding (partly nn and partly n log n) can also help as was the experience of Lahiri, Seshia, and Bryant • Analyzing larger programs: – Somehow view program in terms of “basic blocks” – Treat each basic block as super instruction – If super-instruction unordered, no need to descend into basic block • Exploit incremental Sat when same litmus tests are rerun • Try modeling another weak memory model 53 Extra Slides 54 Unsat Core generation • The CNF file generated by the sat-generating program is solved using zchaff. • If SAT, then we get a satisfying assignment. • First n*n variables in the assignment correspond to the n*n variables in our ordering. Can be used to output a valid ordering of the exec. • If UNSAT, then need a way to find a “root-cause” for the illegality of the execution. • We use unsatisfiable core generation to get to the root cause. • An unsatisfiable core of an unsatisfiable Sat instance is a subset of clauses of the formula such that its conjunction is still UNSAT. 55 Generating Unsatisfiable Core • Zchaff can be told to generate resolution trace while checking for Sat. • Zcore – tool that takes as input a CNF file and resolution trace produced by zchaff and produces unsatisfiable core. • Zcore available as part of zchaff. • Unsatisfiable core is another CNF file with the reduced set of clauses. • Can be fed back into zchaff/zcore to generate a potentially smaller unsatisfiable core. • Process repeated till fixed point reached. 56 Mapping back to root-cause • Clauses in the unsatisfiable core contain the ordering violation information in them • Tool to home in towards the root-cause for the violation • If the root cause is not something trivial, then the cause is usually a cycle of instructions. Each link in the cycle corresponds to an ordering requirement between the instuctions involved. • If cycle exists, then Transitivity can be applied to show that Irreflexivity is not satisfied. • Input to the tool to generate root cause: – The original set of annotated machine instructions for all processors – The default values stored in memory locations at the beginning of the execution – Clause annotations for the clauses that form the unsatisfiable core 57 Root-cause cycle analysis algorithm Each ReadValue rule generates a set of clauses. From the annotations, find the tuples that come from the same ReadValue rule (two different exec will be involved in a rule) – Extract the exec out of the annotations and get the corresponding instructions (using the proc and pc values) From the data being used in the ld instruction and the default date value for the corresponding memory address, it can be seen if the effect of a store is being reflected in a load. This way the dependency between a load and a store is established. The above is done for all the ReadValue rules in the annotations exec (and the corresponding instructions) on both sides of a mf that form a link in the cycle are inferred based on ProgramOrder rule annotations and the pc values involved. The other missing links in the violating cycle are also inferred based on the remaining ProgramOrder rule annotations. 58 A taxonomy of Formal methods to specify industrial Relaxed Memory Models • Operational – Operational models of industrial memory models are complex – Running them inside a standard model-checker is too slow! – Utility for verification is limited – Provides limited insight • Axiomatic – Much more precise – Orderings must ideally be expressed thru an ORTHOGONAL set of rules – No such prior axiomatic specs of industrial memory models 59 Post-Si verification of MP Orderings today (oversimplified) assembly program 1 assembly program n ... Run repeatedly to catch one interleaving that might reveal bug New MP System ... assembly execution 1 assembly execution n Check every execution against ordering rules for compliance * This is done ad-hoc * How to make this formal and efficient ? * How to capitalize on repeated re-runs ? 60 Explanation of Illegal Executions (p 31 of Itanium App Note – search 251429) P us: st [x] = 1 Q sr: st . rel [y] = 1 mf: mf R la: ld . acq r2 = [y] <1> ul2: ld r3 = [x] <0> ul1: ld r1 = [y] <0> • US >> MF ; hence RVr(US) F(MF) • MF >> UL1 ; hence F(MF) R(UL1) • …many reasons… hence R(UL1) RVp(SR) • If RVr(SR) R(UL1) and RVr(SR) UL1 RVp(SR) , WB release atomicity of SR is violated, thus R(UL1) RVr(SR) • …five lines of reasons Hence RVr(SR) R(LA) • Since LA >> UL2, R(LA) R(UL2) • Another para of reasons LV(Sr2) R(UL2) LV(SR1) RVp(SR1) RVq(SR1) F(MF1) R(UL1) RVq(SR2) RVp(SR2). But can’t allow due to atomicity of SR. 61 Checking Executions and Providing Explanations (present approach) P st [x] = 1 mf Q st . rel [y] = 1 R ld . acq r2 = [y] <1> ld r3 = [x] <0> ld r1 = [y] <0> • Published approaches are very labor-intensive paper-and-pencil proofs • Clearly this can’t scale (6 instruction MP program takes 1-page of detailed mathematical proof • What about the combinatorics of reasoning about 200 instructions? • Approaches actually used within the industry involves the use of “checkers” • Details of these checkers are unknown (How complete? How scalable?) 62 The rest of the talk • Itanium memory model in Higher Order Logic • Our HOL specs translation “sat-generating checker programs” • Execution to be checked translation by above program to Sat • Each assembly instruction clauses it generates + annotations • When Sat, what interleaving explains? • When Unsat, how to get “core” (root-cause) + annotations on core • Translating annotations on core to cycle on original program (well, not so high actually… ) 63 • Itanium memory model in Higher Order Logic (well, not so high actually… ) The initial focus of our presentation : - How to model an execution ? - Why use “split stores” in modeling ? 64 But, how do we check executions against such specs? SC(exec) = Exists order. ( requireStrictTotalOrder exec order legalItanium(exec) = Exists order. ( requireStrictTotalOrder exec order /\ requireProgramOrder exec order /\ requireReadValue exec order /\ requireWriteOperationOrder /\ requireItProgramOrder /\ requireMemoryDataDependence /\ requireDataFlowDependence /\ requireCoherence /\ requireAtomicWBRelease /\ requireSequentialUC /\ requireNoUCBypass exec order exec order exec order exec order exec order exec order exec order exec order /\ requireReadValue exec order Execution 1 st c,1 ; ld d, 2; st d,2 ld c, 0 Execution 2 st c,1 ; ld d, 2; st d,2 ld c, 1 e.g., which execution is legal under which memory model ? 65 • Itanium memory model in Higher Order Logic • Our HOL specs translation “sat-generating checker programs” (well, not so high actually… ) 66 • Itanium memory model in Higher Order Logic • Our HOL specs translation “sat-generating checker programs” • Execution to be checked translation by above program to Sat (well, not so high actually… ) 67 How the SAT encoding is achieved... Example Execution st c,1 ; st d,2 • Store c viewed at P1 for modeling bypassing • Store c viewed at P1 for modeling global visibility • Store c viewed at P2 for modeling global visibility • Store d viewed at P1 for modeling bypassing • Store d viewed at P1 for modeling global visibility • Store d viewed at P2 for modeling global visibility • Ld d viewed at P2 for modeling read value • Ld c viewed at P2 for modeling read value ld d, 2; ld c, 0 Break it down into “tuples” 8 tuples obtained SC(exec) = Exists order. ( requireStrictTotalOrder exec order /\ requireOtherOrderSC exec order /\ requireReadValue legalItanium(exec) = Exists order. ( requireStrictTotalOrder exec order /\ requireOtherOrderItanium exec order /\ requireReadValue exec order exec order 68 Explaining the results of Sat • Itanium memory model in Higher Order Logic • Our HOL specs translation “sat-generating checker programs” • Execution to be checked translation by above program to Sat • Each assembly instruction clauses it generates + annotations • When Sat, what interleaving explains? • When Unsat, how to get “core” (root-cause) + annotations on core • Translating annotations on core to cycle on original program (well, not so high actually… ) 69 Clause Annotations • Each clause generated by the sat-generating checker program also generates an associated tuple. • This tuple has information pertaining to the clause’s source. • Each tuple has the following information – The exec involved in generating the clause (upto a maximum of 4 exec could generate a clause) – The proc value of the processor whose instructions were used to generate this clause (taken from the tuples generated by the gentuple program) – The pc value of the instruction that was the source for this tuple – The name of the memory ordering rule the application of which generated this tuple (ReadValue, ProgramOrder, Reflexive, etc) • The clause annotation looks as follows < proc, pc, op1, op2, op3, op4, RuleName > 70
© Copyright 2025 Paperzz