BS better with FP ...in three acts Mieszko Lis Ravi Nanavati

Industrial Semantics
Or
How to Stop the Maths Getting in the Way of the
Marketing
Joe Stoy
Founder and Principal Engineer
Bluespec, Inc.
(with help from many at Bluespec)
APPSEM05 Workshop, 15 September 2005
Basic Message
An industrial tool needs good semantics



Robust
Simple
Conforming to users’ model
But the theory must be “under the
covers”


Learning curve
Perceived learning curve
Outline
A technology based on TermRewriting Systems
A superstructure based on
functional programming semantics
Semantic issues
Tool and market
For designing chips (ASICs, FPGAs, ...)


currently low-level with Verilog or VHDL
chip complexity rising (millions of gates)
For chip designers, verification
engineers, system architects




ASICs have huge NREs ($500K–$1M)
mistakes (respins) cost another NRE
tools run into millions of $$$ per team,
form a significant fraction of a company’s
budget (e.g., ~10%)
tools tend to run on UNIX (Solaris, Linux)
History of this technology
Research@MIT on high-level synthesis & verification
(Prof. Arvind et. al.)
Technology
Productization within industry
Major pilot project:
Arbiter for 160 Gb/s router
(1.5M gates, 200 MHz, 1.3m)
Technology
Bluespec, Inc.
High-level synth. tool
VC funding
~1996
2000
2003
Product
available
2004
Design Flow
BluespecSystemVerilog
Transaction
Cycle
Accurate
Simulation
DAL
Design Assertion Level
Event
Based
Simulation
Bluespec
Synthesis
Verilog RTL
RTL
Synthesis
Netlist
Bluesim
Cosim: Typical Use Models
Bluespec integrated
into SystemC/Verilog
SystemC
/Verilog
Bluespec
Where Bluespec is:
• …integrated into Verilog
System-on-Chip (SoC) design
• …back-annotated into
SystemC model
• …part of mixed
SystemC/Bluespec model
SystemC/Verilog integrated
into Bluespec
Bluespec
SystemC/
Verilog
Where Bluespec is designed
with:
• …existing Verilog IP re-used
• …a SystemC model awaiting
the Verilog
A technology
based on
term-rewriting
systems
Term-rewriting systems
while some rule is applicable


choose an applicable rule
apply it to the term (or a subcomponent)
FP’s standard operational semantics
Our rewritings are less free-wheeling
(don’t change structure of term)

maybe “state transition system” a better
name
Clocked synchronous
hardware
The compiler translates BSV source
code into Verilog RTL
I
Transition
Logic
S“Next” Collection S
of
State
Elements
O
Scheduling and control logic
Modules’
(Current state)
Rules
p1
d1
cond
action
pn
dn
“CAN_FIRE”
“WILL_FIRE”
p1
pn
f1
Scheduler
d1
dn
Muxing
fn
Modules’
(Next state)
Term-rewriting systems
while some rule is applicable


choose an applicable rule
apply it to the term (or a subcomponent)
FP’s standard operational semantics
Our rewritings are less free-wheeling (don’t
change structure of term)
Rules are constructed from guarded atomic
actions with interfaces
Guarded Atomic Actions
Actions are guarded


fifo.enq(x);
if fifo full, cannot happen
hides lots of tedious bureaucracy
Actions are atomic
Atomicity
atomic
Atomicity
ατομος
Atomicity
a-tomic
Atomicity
a-tomic

not



asymmetric
atypical
amoral
Atomicity
a-tomic

not




asymmetric
atypical
amoral
cut



microtome
tomography
tome (of a multi-volume book)
Atomicity
Rules are atomic
“Not cut”

Whenever they run, they run to completion


never interrupted
No other activities are interleaved with
them
This greatly simplifies design

avoids many race conditions
Guarded Atomic Actions
Actions can be composed


a1;a2
resulting action atomic
guarded by guards of a1 and a2
Guarded Atomic Actions
Conditionals:

if (b) a1;
Guarded by “b implies (a1’s guards)”
Another conditional:

when (b) a1;
guarded by b (and a1’s guards)
Yet another


perhaps-if (b) a1;
unguarded
a1’s guards conjoined to b
Nice algebra

(separate question — which to have in BSV)
… with interfaces
A BSV design is structured by modules
Modules communicate only through
interfaces
Modules and interfaces
module
interface
rule
state
An example
module mkTest ();
int n = 15; // constant
Reg#(int) state <- mkReg(0);
NumFn f <- mkFact2();
rule go (state == 0);
f.start (n);
state <= 1;
endrule
rule finish (state == 1);
$display (“Result is %d”,
f.result());
state <= 2;
endrule
endmodule: mkTest
interface NumFn;
method Action start (int n);
method int result ();
endinterface
module mkFact2 (NumFn);
Reg#(int) x <- mkReg(?);
Reg#(int) j <- mkReg(0);
rule step (j > 0);
x <= x * j; j <= j - 1;
endrule
method start (n) if (j == 0);
x <= 1; j <= n;
endmethod
method result () if (j == 0);
return x;
endmethod
endmodule: mkFact2
Method invocations
fit into rules
module mkTest () ;
…
Fact
f <- mkFact2();
…
rule finish (state==1);
$display(“…%d”, f.result())
state <= 2;
endrule
endmodule
Rule condition is:

module mkFact2 (Fact);
Reg#(int) x <- mkReg(?)
Reg#(int) j <- mkReg(n);
…
method result () if (j == 0);
return x;
endmethod
endmodule
state==1 && j==0
Explicit condition and all implicit conditions
of all method calls in the rule
Thus,


a part of the rule’s condition (j == 0), and
a part of a rule’s computation (reading x)
are in a different module, via a method invocation
Modularizing rules
module mkTest () ;
…
Fact
f <- mkFact2();
rule go (state==0);
f.start(15);
state <= 1;
endrule
…
endmodule
module mkFact2 (Fact);
Reg#(int) x <- mkReg(?)
Reg#(int) j <- mkReg(0);
…
method start (int n) if (j == 0);
x <= 1; j <= n;
endmethod
endmodule
Rule condition: state==0 && j==0
Rule actions:
state<=1, x<=1 and j<=15
Thus, a part of the rule’s action is in a different module
Order of Evaluation
Not Lazy (e.g. Haskell’s)
Schedule as many rules as possible in
each clock cycle

(patented technology – James Hoe et al)
Clocked synchronous
hardware
The compiler translates BSV source
code into Verilog RTL
I
Transition
Logic
S“Next” Collection S
of
State
Elements
O
Rule semantics
mapped to hardware semantics
Rules
HW
Ri
Rj
Rk
rule steps
Rj
Rk
Ri
The effect of each cycle is as if a sequence of rules
was executed one-at-a-time
Consequence: The HW state can never result from
an interleaving of actions from different rules
Rule atomicity (therefore, correctness) is preserved
clocks
Scheduling and control logic
Modules’
(Current state)
Rules
p1
d1
cond
action
pn
dn
“CAN_FIRE”
“WILL_FIRE”
p1
pn
f1
Scheduler
d1
dn
Muxing
fn
Modules’
(Next state)
… leads to pragmatic
constraints on rule combination
Initial set:









A rule fires within a clock cycle
A rule fires at most once in a clock cycle
A rule’s effect is only visible in the next clock
We only combine rules in a certain fixed order within a cycle
All rules which read a register must precede any which write
it
We only consider rules enabled at the start of the clock cycle
Each rule is independent of previous rules executing in the
same cycle
The logic path delay depends on individual rule paths, and
not on combinations of rules
…
(Some since relaxed)
Benefits of
atomic-action
semantics
Consider this example
Process 0 increments register x
Process 1 transfers a unit from register x to register
y
Process 2 decrements register y
0
+1
-1
x
1
+1
-1
2
y
This is an abstraction of some real applications:



Bank account: 0 = deposit to checking, 1 = transfer from
checking to savings, 2 = withdraw from savings
Packet processor: 0 = packet arrives, 1 = packet is
processed, 2 = packet departs
…
Concurrency in the example
cond0
0
cond1
+1
-1
x
cond2
1
+1
-1
2
Process priority:
2>1>0
y
Process j (= 0,1,2) only updates under
condition condj
Only one process at a time can update a
register. Note:


Process 0 and 2 can run concurrently if process 1
is not running
Both of process 1’s updates must happen
“indivisibly” (else inconsistent state)
Suppose we want to prioritize process 2 over
process 1 over process 0
cond0
0
cond1
+1
-1
x
Is
Where’s
either
the error?
correct?
cond2
1
+1
-1
y
2
Process priority:
2>1>0
if ((!cond1 || cond2) && cond0)
always @(posedge CLK) // process 0
if (!cond1 && cond0)
x <= x + 1;
always @(posedge CLK) // process 1
if (!cond2 && cond1) begin
y <= y + 1;
x <= x – 1;
end
always @(posedge CLK) // process 2
if (cond2)
y <= y – 1;
always @(posedge CLK) begin
if (!cond2 && cond1)
x <= x – 1;
else if (cond0)
x <= x + 1;
if (cond2)
y <= y – 1;
else if (cond1)
y <= y + 1;
end
Which of these solutions are correct, if any?
What’s required to verify that they’re correct?
Now, what if I Δ’d the priorities: 1 > 2 > 0?
And, what if the processes are in different modules?

With Bluespec,
design is direct
cond0
Process priority: 2 > 1 > 0
cond1
cond2
0
+1
-1
1
x
+1
-1
2
y
(* descending_urgency = “proc2, proc1, proc0” *)
rule proc0 (cond0);
x <= x + 1;
endrule
Functional correctness follows
directly from rule semantics
rule proc1 (cond1);
y <= y + 1;
x <= x – 1;
endrule
Related actions are grouped
naturally with their conditions—
easy to change
rule proc2 (cond2);
y <= y – 1;
endrule
Interactions between rules are
managed by the compiler
(scheduling, muxing, control)
Same hardware as the RTL
Reorder Buffer
Verification-centric design
Example from CPU design
Register
File
Speculative, out-of-order
Branch
FIFO
FIFO
Instruction
Memory
Nirav Dave, MEMOCODE, 2004
ALU
Unit
FIFO FIFO
Decode
FIFO
FIFO
Fetch
ReOrder
Buffer
(ROB)
FIFO FIFO
Many, many concurrent
activities
MEM
Unit
FIFO
FIFO
Data
Memory
ROB actions
Register
File
Get operands
for instr
Writeback
results
Re-Order Buffer
State Instruction Operand 1 Operand 2
Head
Decode
Unit
Insert an
instr into
ROB
Resolve
branches
Tail
Empty
Waiting
Dispatched
Killed
Done
E
W
Di
K
Do
Result
E
Instr
-
V
-
V
-
-
E
Instr
-
V
-
V
-
-
W
Instr
A
V
0
V
0
-
W
Instr
B
V
0
V
0
-
W
Instr
C
V
0
V
0
-
W
Instr
D
V
0
V
0
-
E
Instr
-
V
-
V
-
-
E
Instr
-
V
-
V
-
-
E
Instr
-
V
-
V
-
-
E
Instr
-
V
-
V
-
-
E
Instr
-
V
-
V
-
-
E
Instr
-
V
-
V
-
-
E
Instr
-
V
-
V
-
-
E
Instr
-
V
-
V
-
-
E
Instr
-
V
-
V
-
-
E
Instr
-
V
-
V
-
-
Get a ready
ALU instr
Put ALU instr
results in ROB
Get a ready
MEM instr
Put MEM instr
results in ROB
ALU
Unit(s)
MEM
Unit(s)
But, what about all
the potential race conditions?
Reading from the register file at the same time a
separate instruction is writing back to the same
location

Which value to read?
An instruction is being inserted into the ROB
simultaneously with a dependent upstream
instruction’s result coming back from an ALU

Put a tag or the value in the operand slot?
An instruction is being inserted into the ROB
simultaneously with a branch mis-prediction

must kill the mis-predicted instructions and restore a
“consistent state” across many modules
Rule Atomicity
Lets you code each operation in isolation
Eliminates the nightmare of race conditions
(“inconsistent state”) under such complex
concurrency conditions
All behaviors are
Insert Instr in ROB
explicable as a
• Put instruction in first
sequence of atomic
available slot
actions on the
• Increment tail Dispatch
pointer Instr
• Mark instruction
• Get source operands
state
dispatched Write Back Results to ROB
- RF <or> prev instr
• Forward to•appropriate
Write back results to
unit
instr result
Commit Instr
• Write back to all waiting
• Write results to register
tags
file (or allow memory
• Set to done write for store)Branch Resolution
•…
• Set to Empty
•…
• Increment head pointer
•…
Performance Semantics
Another processor example
Rule-based Specifications
bypasses
Dec
bI
bD
R1 = 5
IF
R1 = 2 + 3
RF
Exe
Mem
bE
Wb
bW
iMem
dMem
Each pipeline stage is described as a set of atomic rules:
rule Execute Add: when (bD.first == (Ri =
begin
result = va + vb;
// compute
bE.enq (Ri = result); // enqueue
bD.deq;
// dequeue
end
va + vb)) ==>
addition
result into bE
instruction from bd
Any legal behavior can be understood in terms of applying one
rule at a time
Performance Concerns
The designer wants to make sure that one instruction executes
every cycle
FIFOs must support both enq and deq in each cycle
RF
IF I45
I3
bI
iMem
Dec
I2
bD
Exe
I1
Mem
bE
I0
bW
dMem
Wb
What are the
semantics
of FIFOs?
Example from a commercially
available FIFO IP component
data_in
push_req_n
pop_req_n
clk
rstn
data_out
full
empty
These constraints are taken
from several paragraphs of
documentation, spread over
many pages, interspersed
with other text
A FIFO interface in BSV
interface FIFOQueue #(type aType);
method Action
push (aType val);
method aType
first();
method ActionValue#(aType) pop();
method Action
clear();
endinterface
Methods as ports
push
not full
enab
rdy
n
not empty
rdy
n
not empty
rdy
enab
clear:
• no argument
• has side effect (action)
always true
rdy
clear
pop:
• n-bit result
• has side effect (action)
pop
enab
FIFOQueue
module
first:
• n-bit result
• has no side effect
n
first
push:
• n-bit argument
• has side effect (action)
FIFO semantics: types
interface FIFOQueue #(type aType);
method Action
push (aType val);
method aType
first();
method ActionValue#(aType) pop();
method Action
clear();
endinterface
FIFO semantics: laws
Algebra of enq, deq, etc
Not at present part of BSV

Though SVA assertions are
Needed for formal verification work
So far, all in atomic-action world
FIFO semantics: performance
“Must support enq and deq in same
cycle”
Naïve implementation
(ignoring wraparound)
Bool not_empty = (iptr != optr);
Bool not_full = (iptr+1 != optr);
method Action enq(x) if (not_full);
buf[iptr] <= x;
iptr <= iptr + 1;
endmethod
method Actionvalue deq() if (not_empty);
optr <= optr + 1;
return (buf[optr]);
endmethod
iptr
optr
Semantics: concurrency
Empty
Naïve
Normal
Turbo
Neither
Full
can’t deq
enq and deq
can’t
enq
conflict
can’t deq
enq and deq
can’t
enq
conflict-free
can’t deq
enq before
Bypass
deq
Super
enq before
deq
(not yet)
enq + deq
deq before
enq
enq + deq
can’t enq
enq + deq
deq before
enq
A problem with “super”
push
enab
rdy
not full
pop
n
not empty
enab
n
rdy
So what are the
semantics of FIFOs?
Performance requirements are
sometimes an essential part of
(interface) specification

Pipeline efficiency
Sometimes not
Allow extra optional attributes on
interface definitions?

Take care not accidentally to re-invent
inheritance (à la OOP) badly
Semantics
Maybe best to separate functional and
performance semantics


So functional semantics can be handled
entirely in TRS (atomic action) framework
Care: must not rely on performance
semantics to achieve functionally
necessary concurrency.

If actions must be simultaneous, compose them
in same action; don’t put them in separate
rules
The processor example
Composing Atomic Rules with
Performance Guarantees
MIT PhD: Rosenband 2005
Performance Concerns
The designer wants to make sure that one CPU instruction
executes every cycle
Suppose the designer can specify which rules (if enabled) should
execute within a clock cycle, and in which order.
Wb x Mem x Exe x Dec x IF
No change in rules
RF
IF I45
I3
bI
iMem
Dec
I2
bD
Exe
I1
Mem
bE
I0
Wb
bW
dMem
Correctness of hardware guaranteed by the Compiler
Order is important to minimize FIFO size and achieve optimal bypassing
Experiments in scheduling
What happens if the user specifies:
Wb x Wb x Mem x Mem x Exe x Exe x Dec x Dec x IF x IF
No change in rules
a superscalar processor!
RF
II11
II10
9IF
8
I97 I86
bI
iMem
Dec
I75 I64
bD
Exe
I53 I42
Mem I31 I20
bE
Wb
bW
dMem
Executing 2 instructions per cycle requires more resources but is
functionally equivalent to the original design
4-Stage Processor Results
Design
Benchmark
(cycles)
Area
10ns
(µm2)
Timing
10ns
(ns)
Area
2ns
(µm2)
Timing
2ns
(ns)
1 element fifo:
No Spec
18525
24762
5.85
26632
2.00
Spec 1
11115
25094
6.83
33360
2.00
Spec 2
11115
25264
6.78
34099
2.04
2 element fifo:
No Spec.
18525
32240
7.38
39033
2.00
Spec 1
11115
32535
8.38
47084
2.63
Spec 2
7410
45296
9.99
62649
4.72
benchmark: a program containing additions / jumps / loadc’s
Dan Rosenband & Arvind 2004
A superstructure
based on
disguised Haskell
Implementing BSV:
two-level compilation
BSV source program
Level 1 Elaboration
• Desugaring
• Type-checking
• Massive partial evaluation and static elaboration
Term Rewriting System
(rules and actions)
Level 2 Synthesis
• Scheduling
• Control logic generation
• Object code generation
Object code
(Verilog RTL or C)
Bluespec Classic:
a Haskell-based HDL
package Shift(shift) where
import List
sstep :: Bit m -> Bit n -> Nat -> Bit n
sstep s x i when s[i:i] == 1 = x << (1 << i)
sstep s x i = x
shift :: Bit m -> Bit n -> Bit n
shift s x = foldl (sstep s) x
(map fromInteger
(upto 0 ((valueOf m) - 1)))
Lennart Augustsson (2000)
Selling BS Classic
Unfamiliar syntax a significant barrier


even in marketing slides
even ()s in function calls are different!
Many fronts in adoption war





new
new
new
new
new
hardware design methodology
unfamiliar syntax
type system
purely functional thinking
FP concepts (map, fold, monads)
Adapt an existing HDL
Map matching concepts

expressions, bit vectors, functions, modules
Extend where straightforward

higher-order functions, first-class objects,
polymorphism
Standardize where possible

tagged unions, pattern matching (SV 3.1a)
Desugar where required

imperative assignments, loops
Bluespec Classic:
a Haskell-based HDL
package Shift(shift) where
import List
sstep :: Bit m -> Bit n -> Nat -> Bit n
sstep s x i when s[i:i] == 1 = x << (1 << i)
sstep s x i = x
shift :: Bit m -> Bit n -> Bit n
shift s x = foldl (sstep s) x
(map fromInteger
(upto 0 ((valueOf m) - 1)))
Bluespec SystemVerilog:
FP with Verilog Syntax
function Bit#(n) sstep(Bit#(m) s, Bit#(n) x, Nat i);
if(s[i] == 1)
return(x << (1 << i));
else
return x;
endfunction
function Bit#(n) shift(Bit#(m) s, Bit#(n) x);
return(foldl((sstep(s)),
x,
(map(fromInteger,
upto(0, valueof(m) - 1)))));
endfunction
Bluespec SystemVerilog:
Imperative circuit construction
function Bit#(n) shift(Bit#(m) s, Bit#(n) x);
Integer max = valueof(m);
Bit#(n) xA [max+1];
xA[0] = x;
for (Integer j = 0; j < max; j = j + 1)
if (s[fromInteger(j)] == 1)
xA[j+1] = xA[j] << (1 << fromInteger(j));
else
xA[j+1] = xA[j];
return xA[max];
endfunction
Teaching BSV
Limited training time (2-4 days typical)
Audience: hardware designers



little or no FP background
wires and registers, not abstractions
conservative (remember cost of mistakes?)
Format: lectures interspersed with labs
Need to communicate basics

or else evaluation project might be hard
Want to show full range of features

or else benefits not perceived and no sale
Teaching conclusions
Functional features are “advanced”

Get in the way of communicating basics
Strict typing seen as restrictive



bit vs. Bool
bit-width constraints
structures vs. bit representations
Standards less relevant when teaching

damn the torpedoes and teach the sugar
Key challenge: build intuition about
generated hardware
Time-released Haskell
FIFO#(Nat) f();
mkSizedFIFO#(depth) the_f(f);
FIFO#(Nat) f <- mkSizedFIFO(depth);
Tuple2#(Nat,Bool) nb = foo(x);
let nb = foo(x);
Compiling FP
into hardware
Implementing BSV:
two-level compilation
BSV source program
Level 1 Elaboration
• Desugaring
• Type-checking
• Massive partial evaluation and static elaboration
Term Rewriting System
(rules and actions)
Level 2 Synthesis
• Scheduling
• Control logic generation
• Object code generation
Object code
(Verilog RTL or C)
For historical reasons, the level one evaluator is lazy
Is this still a good idea as the language becomes
more imperative?
Laziness is hard work
Performance is a challenge

graph reduction required to avoid
duplicating work
Non-strict primitives (if, and, or)
require careful handling


symmetric short-circuiting
undetermined values must be propagated
correctly
Error messages can be confusing

“Compile-time expression did not evaluate”
Being lazy pays off
Consider: let z = x + y
Is this:



a static constant?
a fixed incrementer?
a full adder?
A lazy evaluator does not care!


evaluates what it can
defers (or suspends) what it can't
User benefit: can move freely between
static and dynamic code (?)
Conclusions
FP cultural heritage contains lots of
goodies



TRS — semantic foundation
Typing — unexpected convenience
…
Hardware designers can enjoy the fruits
without realising what they’re using …
… till later, perhaps
The End