Industrial Semantics Or How to Stop the Maths Getting in the Way of the Marketing Joe Stoy Founder and Principal Engineer Bluespec, Inc. (with help from many at Bluespec) APPSEM05 Workshop, 15 September 2005 Basic Message An industrial tool needs good semantics Robust Simple Conforming to users’ model But the theory must be “under the covers” Learning curve Perceived learning curve Outline A technology based on TermRewriting Systems A superstructure based on functional programming semantics Semantic issues Tool and market For designing chips (ASICs, FPGAs, ...) currently low-level with Verilog or VHDL chip complexity rising (millions of gates) For chip designers, verification engineers, system architects ASICs have huge NREs ($500K–$1M) mistakes (respins) cost another NRE tools run into millions of $$$ per team, form a significant fraction of a company’s budget (e.g., ~10%) tools tend to run on UNIX (Solaris, Linux) History of this technology Research@MIT on high-level synthesis & verification (Prof. Arvind et. al.) Technology Productization within industry Major pilot project: Arbiter for 160 Gb/s router (1.5M gates, 200 MHz, 1.3m) Technology Bluespec, Inc. High-level synth. tool VC funding ~1996 2000 2003 Product available 2004 Design Flow BluespecSystemVerilog Transaction Cycle Accurate Simulation DAL Design Assertion Level Event Based Simulation Bluespec Synthesis Verilog RTL RTL Synthesis Netlist Bluesim Cosim: Typical Use Models Bluespec integrated into SystemC/Verilog SystemC /Verilog Bluespec Where Bluespec is: • …integrated into Verilog System-on-Chip (SoC) design • …back-annotated into SystemC model • …part of mixed SystemC/Bluespec model SystemC/Verilog integrated into Bluespec Bluespec SystemC/ Verilog Where Bluespec is designed with: • …existing Verilog IP re-used • …a SystemC model awaiting the Verilog A technology based on term-rewriting systems Term-rewriting systems while some rule is applicable choose an applicable rule apply it to the term (or a subcomponent) FP’s standard operational semantics Our rewritings are less free-wheeling (don’t change structure of term) maybe “state transition system” a better name Clocked synchronous hardware The compiler translates BSV source code into Verilog RTL I Transition Logic S“Next” Collection S of State Elements O Scheduling and control logic Modules’ (Current state) Rules p1 d1 cond action pn dn “CAN_FIRE” “WILL_FIRE” p1 pn f1 Scheduler d1 dn Muxing fn Modules’ (Next state) Term-rewriting systems while some rule is applicable choose an applicable rule apply it to the term (or a subcomponent) FP’s standard operational semantics Our rewritings are less free-wheeling (don’t change structure of term) Rules are constructed from guarded atomic actions with interfaces Guarded Atomic Actions Actions are guarded fifo.enq(x); if fifo full, cannot happen hides lots of tedious bureaucracy Actions are atomic Atomicity atomic Atomicity ατομος Atomicity a-tomic Atomicity a-tomic not asymmetric atypical amoral Atomicity a-tomic not asymmetric atypical amoral cut microtome tomography tome (of a multi-volume book) Atomicity Rules are atomic “Not cut” Whenever they run, they run to completion never interrupted No other activities are interleaved with them This greatly simplifies design avoids many race conditions Guarded Atomic Actions Actions can be composed a1;a2 resulting action atomic guarded by guards of a1 and a2 Guarded Atomic Actions Conditionals: if (b) a1; Guarded by “b implies (a1’s guards)” Another conditional: when (b) a1; guarded by b (and a1’s guards) Yet another perhaps-if (b) a1; unguarded a1’s guards conjoined to b Nice algebra (separate question — which to have in BSV) … with interfaces A BSV design is structured by modules Modules communicate only through interfaces Modules and interfaces module interface rule state An example module mkTest (); int n = 15; // constant Reg#(int) state <- mkReg(0); NumFn f <- mkFact2(); rule go (state == 0); f.start (n); state <= 1; endrule rule finish (state == 1); $display (“Result is %d”, f.result()); state <= 2; endrule endmodule: mkTest interface NumFn; method Action start (int n); method int result (); endinterface module mkFact2 (NumFn); Reg#(int) x <- mkReg(?); Reg#(int) j <- mkReg(0); rule step (j > 0); x <= x * j; j <= j - 1; endrule method start (n) if (j == 0); x <= 1; j <= n; endmethod method result () if (j == 0); return x; endmethod endmodule: mkFact2 Method invocations fit into rules module mkTest () ; … Fact f <- mkFact2(); … rule finish (state==1); $display(“…%d”, f.result()) state <= 2; endrule endmodule Rule condition is: module mkFact2 (Fact); Reg#(int) x <- mkReg(?) Reg#(int) j <- mkReg(n); … method result () if (j == 0); return x; endmethod endmodule state==1 && j==0 Explicit condition and all implicit conditions of all method calls in the rule Thus, a part of the rule’s condition (j == 0), and a part of a rule’s computation (reading x) are in a different module, via a method invocation Modularizing rules module mkTest () ; … Fact f <- mkFact2(); rule go (state==0); f.start(15); state <= 1; endrule … endmodule module mkFact2 (Fact); Reg#(int) x <- mkReg(?) Reg#(int) j <- mkReg(0); … method start (int n) if (j == 0); x <= 1; j <= n; endmethod endmodule Rule condition: state==0 && j==0 Rule actions: state<=1, x<=1 and j<=15 Thus, a part of the rule’s action is in a different module Order of Evaluation Not Lazy (e.g. Haskell’s) Schedule as many rules as possible in each clock cycle (patented technology – James Hoe et al) Clocked synchronous hardware The compiler translates BSV source code into Verilog RTL I Transition Logic S“Next” Collection S of State Elements O Rule semantics mapped to hardware semantics Rules HW Ri Rj Rk rule steps Rj Rk Ri The effect of each cycle is as if a sequence of rules was executed one-at-a-time Consequence: The HW state can never result from an interleaving of actions from different rules Rule atomicity (therefore, correctness) is preserved clocks Scheduling and control logic Modules’ (Current state) Rules p1 d1 cond action pn dn “CAN_FIRE” “WILL_FIRE” p1 pn f1 Scheduler d1 dn Muxing fn Modules’ (Next state) … leads to pragmatic constraints on rule combination Initial set: A rule fires within a clock cycle A rule fires at most once in a clock cycle A rule’s effect is only visible in the next clock We only combine rules in a certain fixed order within a cycle All rules which read a register must precede any which write it We only consider rules enabled at the start of the clock cycle Each rule is independent of previous rules executing in the same cycle The logic path delay depends on individual rule paths, and not on combinations of rules … (Some since relaxed) Benefits of atomic-action semantics Consider this example Process 0 increments register x Process 1 transfers a unit from register x to register y Process 2 decrements register y 0 +1 -1 x 1 +1 -1 2 y This is an abstraction of some real applications: Bank account: 0 = deposit to checking, 1 = transfer from checking to savings, 2 = withdraw from savings Packet processor: 0 = packet arrives, 1 = packet is processed, 2 = packet departs … Concurrency in the example cond0 0 cond1 +1 -1 x cond2 1 +1 -1 2 Process priority: 2>1>0 y Process j (= 0,1,2) only updates under condition condj Only one process at a time can update a register. Note: Process 0 and 2 can run concurrently if process 1 is not running Both of process 1’s updates must happen “indivisibly” (else inconsistent state) Suppose we want to prioritize process 2 over process 1 over process 0 cond0 0 cond1 +1 -1 x Is Where’s either the error? correct? cond2 1 +1 -1 y 2 Process priority: 2>1>0 if ((!cond1 || cond2) && cond0) always @(posedge CLK) // process 0 if (!cond1 && cond0) x <= x + 1; always @(posedge CLK) // process 1 if (!cond2 && cond1) begin y <= y + 1; x <= x – 1; end always @(posedge CLK) // process 2 if (cond2) y <= y – 1; always @(posedge CLK) begin if (!cond2 && cond1) x <= x – 1; else if (cond0) x <= x + 1; if (cond2) y <= y – 1; else if (cond1) y <= y + 1; end Which of these solutions are correct, if any? What’s required to verify that they’re correct? Now, what if I Δ’d the priorities: 1 > 2 > 0? And, what if the processes are in different modules? With Bluespec, design is direct cond0 Process priority: 2 > 1 > 0 cond1 cond2 0 +1 -1 1 x +1 -1 2 y (* descending_urgency = “proc2, proc1, proc0” *) rule proc0 (cond0); x <= x + 1; endrule Functional correctness follows directly from rule semantics rule proc1 (cond1); y <= y + 1; x <= x – 1; endrule Related actions are grouped naturally with their conditions— easy to change rule proc2 (cond2); y <= y – 1; endrule Interactions between rules are managed by the compiler (scheduling, muxing, control) Same hardware as the RTL Reorder Buffer Verification-centric design Example from CPU design Register File Speculative, out-of-order Branch FIFO FIFO Instruction Memory Nirav Dave, MEMOCODE, 2004 ALU Unit FIFO FIFO Decode FIFO FIFO Fetch ReOrder Buffer (ROB) FIFO FIFO Many, many concurrent activities MEM Unit FIFO FIFO Data Memory ROB actions Register File Get operands for instr Writeback results Re-Order Buffer State Instruction Operand 1 Operand 2 Head Decode Unit Insert an instr into ROB Resolve branches Tail Empty Waiting Dispatched Killed Done E W Di K Do Result E Instr - V - V - - E Instr - V - V - - W Instr A V 0 V 0 - W Instr B V 0 V 0 - W Instr C V 0 V 0 - W Instr D V 0 V 0 - E Instr - V - V - - E Instr - V - V - - E Instr - V - V - - E Instr - V - V - - E Instr - V - V - - E Instr - V - V - - E Instr - V - V - - E Instr - V - V - - E Instr - V - V - - E Instr - V - V - - Get a ready ALU instr Put ALU instr results in ROB Get a ready MEM instr Put MEM instr results in ROB ALU Unit(s) MEM Unit(s) But, what about all the potential race conditions? Reading from the register file at the same time a separate instruction is writing back to the same location Which value to read? An instruction is being inserted into the ROB simultaneously with a dependent upstream instruction’s result coming back from an ALU Put a tag or the value in the operand slot? An instruction is being inserted into the ROB simultaneously with a branch mis-prediction must kill the mis-predicted instructions and restore a “consistent state” across many modules Rule Atomicity Lets you code each operation in isolation Eliminates the nightmare of race conditions (“inconsistent state”) under such complex concurrency conditions All behaviors are Insert Instr in ROB explicable as a • Put instruction in first sequence of atomic available slot actions on the • Increment tail Dispatch pointer Instr • Mark instruction • Get source operands state dispatched Write Back Results to ROB - RF <or> prev instr • Forward to•appropriate Write back results to unit instr result Commit Instr • Write back to all waiting • Write results to register tags file (or allow memory • Set to done write for store)Branch Resolution •… • Set to Empty •… • Increment head pointer •… Performance Semantics Another processor example Rule-based Specifications bypasses Dec bI bD R1 = 5 IF R1 = 2 + 3 RF Exe Mem bE Wb bW iMem dMem Each pipeline stage is described as a set of atomic rules: rule Execute Add: when (bD.first == (Ri = begin result = va + vb; // compute bE.enq (Ri = result); // enqueue bD.deq; // dequeue end va + vb)) ==> addition result into bE instruction from bd Any legal behavior can be understood in terms of applying one rule at a time Performance Concerns The designer wants to make sure that one instruction executes every cycle FIFOs must support both enq and deq in each cycle RF IF I45 I3 bI iMem Dec I2 bD Exe I1 Mem bE I0 bW dMem Wb What are the semantics of FIFOs? Example from a commercially available FIFO IP component data_in push_req_n pop_req_n clk rstn data_out full empty These constraints are taken from several paragraphs of documentation, spread over many pages, interspersed with other text A FIFO interface in BSV interface FIFOQueue #(type aType); method Action push (aType val); method aType first(); method ActionValue#(aType) pop(); method Action clear(); endinterface Methods as ports push not full enab rdy n not empty rdy n not empty rdy enab clear: • no argument • has side effect (action) always true rdy clear pop: • n-bit result • has side effect (action) pop enab FIFOQueue module first: • n-bit result • has no side effect n first push: • n-bit argument • has side effect (action) FIFO semantics: types interface FIFOQueue #(type aType); method Action push (aType val); method aType first(); method ActionValue#(aType) pop(); method Action clear(); endinterface FIFO semantics: laws Algebra of enq, deq, etc Not at present part of BSV Though SVA assertions are Needed for formal verification work So far, all in atomic-action world FIFO semantics: performance “Must support enq and deq in same cycle” Naïve implementation (ignoring wraparound) Bool not_empty = (iptr != optr); Bool not_full = (iptr+1 != optr); method Action enq(x) if (not_full); buf[iptr] <= x; iptr <= iptr + 1; endmethod method Actionvalue deq() if (not_empty); optr <= optr + 1; return (buf[optr]); endmethod iptr optr Semantics: concurrency Empty Naïve Normal Turbo Neither Full can’t deq enq and deq can’t enq conflict can’t deq enq and deq can’t enq conflict-free can’t deq enq before Bypass deq Super enq before deq (not yet) enq + deq deq before enq enq + deq can’t enq enq + deq deq before enq A problem with “super” push enab rdy not full pop n not empty enab n rdy So what are the semantics of FIFOs? Performance requirements are sometimes an essential part of (interface) specification Pipeline efficiency Sometimes not Allow extra optional attributes on interface definitions? Take care not accidentally to re-invent inheritance (à la OOP) badly Semantics Maybe best to separate functional and performance semantics So functional semantics can be handled entirely in TRS (atomic action) framework Care: must not rely on performance semantics to achieve functionally necessary concurrency. If actions must be simultaneous, compose them in same action; don’t put them in separate rules The processor example Composing Atomic Rules with Performance Guarantees MIT PhD: Rosenband 2005 Performance Concerns The designer wants to make sure that one CPU instruction executes every cycle Suppose the designer can specify which rules (if enabled) should execute within a clock cycle, and in which order. Wb x Mem x Exe x Dec x IF No change in rules RF IF I45 I3 bI iMem Dec I2 bD Exe I1 Mem bE I0 Wb bW dMem Correctness of hardware guaranteed by the Compiler Order is important to minimize FIFO size and achieve optimal bypassing Experiments in scheduling What happens if the user specifies: Wb x Wb x Mem x Mem x Exe x Exe x Dec x Dec x IF x IF No change in rules a superscalar processor! RF II11 II10 9IF 8 I97 I86 bI iMem Dec I75 I64 bD Exe I53 I42 Mem I31 I20 bE Wb bW dMem Executing 2 instructions per cycle requires more resources but is functionally equivalent to the original design 4-Stage Processor Results Design Benchmark (cycles) Area 10ns (µm2) Timing 10ns (ns) Area 2ns (µm2) Timing 2ns (ns) 1 element fifo: No Spec 18525 24762 5.85 26632 2.00 Spec 1 11115 25094 6.83 33360 2.00 Spec 2 11115 25264 6.78 34099 2.04 2 element fifo: No Spec. 18525 32240 7.38 39033 2.00 Spec 1 11115 32535 8.38 47084 2.63 Spec 2 7410 45296 9.99 62649 4.72 benchmark: a program containing additions / jumps / loadc’s Dan Rosenband & Arvind 2004 A superstructure based on disguised Haskell Implementing BSV: two-level compilation BSV source program Level 1 Elaboration • Desugaring • Type-checking • Massive partial evaluation and static elaboration Term Rewriting System (rules and actions) Level 2 Synthesis • Scheduling • Control logic generation • Object code generation Object code (Verilog RTL or C) Bluespec Classic: a Haskell-based HDL package Shift(shift) where import List sstep :: Bit m -> Bit n -> Nat -> Bit n sstep s x i when s[i:i] == 1 = x << (1 << i) sstep s x i = x shift :: Bit m -> Bit n -> Bit n shift s x = foldl (sstep s) x (map fromInteger (upto 0 ((valueOf m) - 1))) Lennart Augustsson (2000) Selling BS Classic Unfamiliar syntax a significant barrier even in marketing slides even ()s in function calls are different! Many fronts in adoption war new new new new new hardware design methodology unfamiliar syntax type system purely functional thinking FP concepts (map, fold, monads) Adapt an existing HDL Map matching concepts expressions, bit vectors, functions, modules Extend where straightforward higher-order functions, first-class objects, polymorphism Standardize where possible tagged unions, pattern matching (SV 3.1a) Desugar where required imperative assignments, loops Bluespec Classic: a Haskell-based HDL package Shift(shift) where import List sstep :: Bit m -> Bit n -> Nat -> Bit n sstep s x i when s[i:i] == 1 = x << (1 << i) sstep s x i = x shift :: Bit m -> Bit n -> Bit n shift s x = foldl (sstep s) x (map fromInteger (upto 0 ((valueOf m) - 1))) Bluespec SystemVerilog: FP with Verilog Syntax function Bit#(n) sstep(Bit#(m) s, Bit#(n) x, Nat i); if(s[i] == 1) return(x << (1 << i)); else return x; endfunction function Bit#(n) shift(Bit#(m) s, Bit#(n) x); return(foldl((sstep(s)), x, (map(fromInteger, upto(0, valueof(m) - 1))))); endfunction Bluespec SystemVerilog: Imperative circuit construction function Bit#(n) shift(Bit#(m) s, Bit#(n) x); Integer max = valueof(m); Bit#(n) xA [max+1]; xA[0] = x; for (Integer j = 0; j < max; j = j + 1) if (s[fromInteger(j)] == 1) xA[j+1] = xA[j] << (1 << fromInteger(j)); else xA[j+1] = xA[j]; return xA[max]; endfunction Teaching BSV Limited training time (2-4 days typical) Audience: hardware designers little or no FP background wires and registers, not abstractions conservative (remember cost of mistakes?) Format: lectures interspersed with labs Need to communicate basics or else evaluation project might be hard Want to show full range of features or else benefits not perceived and no sale Teaching conclusions Functional features are “advanced” Get in the way of communicating basics Strict typing seen as restrictive bit vs. Bool bit-width constraints structures vs. bit representations Standards less relevant when teaching damn the torpedoes and teach the sugar Key challenge: build intuition about generated hardware Time-released Haskell FIFO#(Nat) f(); mkSizedFIFO#(depth) the_f(f); FIFO#(Nat) f <- mkSizedFIFO(depth); Tuple2#(Nat,Bool) nb = foo(x); let nb = foo(x); Compiling FP into hardware Implementing BSV: two-level compilation BSV source program Level 1 Elaboration • Desugaring • Type-checking • Massive partial evaluation and static elaboration Term Rewriting System (rules and actions) Level 2 Synthesis • Scheduling • Control logic generation • Object code generation Object code (Verilog RTL or C) For historical reasons, the level one evaluator is lazy Is this still a good idea as the language becomes more imperative? Laziness is hard work Performance is a challenge graph reduction required to avoid duplicating work Non-strict primitives (if, and, or) require careful handling symmetric short-circuiting undetermined values must be propagated correctly Error messages can be confusing “Compile-time expression did not evaluate” Being lazy pays off Consider: let z = x + y Is this: a static constant? a fixed incrementer? a full adder? A lazy evaluator does not care! evaluates what it can defers (or suspends) what it can't User benefit: can move freely between static and dynamic code (?) Conclusions FP cultural heritage contains lots of goodies TRS — semantic foundation Typing — unexpected convenience … Hardware designers can enjoy the fruits without realising what they’re using … … till later, perhaps The End
© Copyright 2025 Paperzz