High-Level Specification and Efficient

High-level Specification and
Efficient Implementation of
Pipelined Circuits
Maria-Cristina Marinescu
Martin Rinard
Laboratory for Computer Science
Massachusetts Institute of Technology
Overall Goal
Modular,
Asynchronous,
Sequential
Specification
Efficient,
Synchronous,
Parallel
Implementation
in
Synthesizable
Verilog
Specification Language
Concepts
• State (Registers, Memory)
• Queues (Conceptually Unbounded Length)
• Modules
• Read inputs from queues and state
• Write outputs to queues and state
Module Example
Register
File
r0
0
r1
43
r2
100
r3
84
<jz r0>,<inc r1>
Input
Queue
<inc r2 100>, <inc r3 84>
Register Operand
Fetch Module
Output
Queue
Module Example
Register
File
r0
0
r1
43
r2
100
r3
84
r1
<jz r0>
Input
Queue
<inc r1>
Register Operand
Fetch Module
<inc r2 100>, <inc r3 84>
Output
Queue
Module Example
Register
File
r0
0
r1
43
r2
100
r3
84
r1
<jz r0>
Input
Queue
43
<inc r1 43>
Register Operand
Fetch Module
<inc r2 100>, <inc r3 84>
Output
Queue
Module Example
Register
File
r0
0
r1
43
r2
100
r3
84
<jz r0>
Input
Queue
<inc r1 43>, <inc r2 100>, <inc r3 84>
Register Operand
Fetch Module
Output
Queue
Module Behavior
• Each module has a set of update rules
• Each Update Rule Consists of
• Precondition
• Action (set of updates)
• Rule is enabled (and can execute) if
precondition is true in current state
• When rule executes, atomically applies
updates in action to produce new state
Update Rules in Example
“If an increment instruction is at the head of the input
queue and there is no RAW hazard, then atomically
remove the instruction from the queue, fetch the value
from the register file, and append the instruction with
the register value into the output queue”
<INC r> = head(iq) and notin(oq, <INC r _>) 
iq = tail(iq), oq = append(oq, <INC r rf[r]>);
“If a jump on zero instruction is at the head of the input
queue and there is no RAW hazard, then atomically
remove the instruction from the queue, fetch the value
from the register file, and append the instruction with
the register value into the output queue”
<JZ r l> = head(iq) and notin(oq, <INC r _>) 
iq = tail(iq), oq = append(oq, <JZ rf[r] l>);
From Modules to Systems
• System is a set of Modules
• Access same Registers and Memories
• Also communicate via Queues
• Behavior of System
• Update rules from all Modules
• Queues Provide Modularity
• Decouple Modules
• Enable Independent Development
• Promote Reusable Modular Designs
Example System Specification
• Instruction Fetch Module
TRUE  iq = append(iq,im[pc]), pc = pc + 1;
• Register Operand Fetch Module
<INC r> = head(iq) and notin(rq, <INC r _>) 
iq = tail(iq), rq = append(rq, <INC r rf[r]>);
<JZ r l> = head(iq) and notin(rq, <INC r _>) 
iq = tail(iq), rq = append(rq, <JZ rf[r] l>);
• Compute and Writeback Module
<INC r v> = head(rq) 
rf = rf[r = v+1], rq = tail(rq);
<JZ v l> = head(rq) and (v == 0) 
pc = l, iq = nil, rq = nil;
<JZ v l> = head(rq) and (v !=0)  rq = tail(rq);
Abstract Model of Execution
• Conceptually, system execution is a
sequence of rule executions
• while TRUE
choose an enabled rule
execute rule
obtain new state
• Concepts in Abstract Execution Model
• Rules execute atomically
• Rules execute asynchronously
• Rules execute sequentially
• Unbounded Queues
Synthesis Algorithm
Key Challenge
• Specification Language
• Sequential, atomic, asynchronous
semantics
• Conceptually unbounded queues
• Implemented Circuit
• Coordinated parallel execution
• Finite length queues
Initial Synthesis Algorithm
• Symbolically Execute Rules in Order
• Each rule starts with result from previous rule
• Obtain Expressions for New Values of
Registers, Memories, and Queues
• Generate Combinational Circuit that Produces
New Values
• Each clock cycle circuit computes new values,
writes new values back
• Every rule gets a chance to execute, every
clock cycle!
SE0
SE1
SE2
SE3
Rule 1
Rule 2
Rule 3
Properties of Initial Algorithm
• Preserves Semantics of Specification
• Independent Rules Execute Concurrently
• But May Have Long Clock Cycle
• Output of each preceding rule fed in
as input to next rule
• Data traverses ALL rules (and pipeline
stages) in a single cycle!
• Solution: Relaxation
Relaxation
for each rule Ri with precondition Pi
for each variable instance vi in precondition Pi
replace vi with its earliest safe version
...
Rk-1: Pk-1 -> vk = ... vk safe for vi if either
...
• Pi[vk/vi] implies Pi
Ri : Pi(vi,...) -> ...
• (Pi,Pk-1) mutually exclusive
...
0
1
2
3
=>
0
1
3
2
Relaxation Result
•
•
•
•
Relaxation exposes additional parallelism
Queues separate pipeline stages
Items traverse one stage per clock cycle
Safety: If a rule executes in new system
• Then it also executes in old system
• And it generates same result
• Liveness: After relaxation, all rules test initial
state
• If rule enabled in old system but not in new
system, then
• Some rule executes in new system
Global Scheduling
• Issue:
• Conceptually unbounded queues
• Finite hardware buffers
• Solution: Modify append rules s.t. no queue
exceeds its specified length
• Challenge:
• Schedule maximum number of rules
• Rules can insert into full queues if within
length at the end of clock cycle
Global Scheduling
• Assumption: queues start within length at
beginning of cycle
• Goal: generate circuit that makes queues
remain within length at end of cycle
• Basic Approach:
• Before enabled rule executes
• Be sure will be room for result in output
queues at end of clock cycle
• Key Idea: a rule can insert into a queue as
long as enough following rules remove from it
GS: Basic Concepts
• Rule-Queue Graph
• Nodes of 2 types: rules and queues
• Edge from rule node to queue node if rule
inserts into queue
• Edge from queue node to rule node if rule
removes from queue
• In Example:
1
iq
2
3
4
rq
5
6
Acyclic Rule-Queue Graphs
• Process Rules in Topological Sort Order
• Augment execution precondition
• If rule inserts into a queue, require that either
• there is room in queue when rule executes or
• future rules will execute and remove items to
make room in queue
• Each queue has counter of number of
elements in queue at start of cycle
• Combinational logic tracks queue insertions
and deletions
• GS algorithm generates the control signals
for the combinational logic
Pipeline Implications
• Counter becomes presence bit for single
element queues
• Additional preconditions can be viewed as
pipeline stall logic
• Design can be written to generate pipeline
forwarding/bypassing instead of stall
Global Scheduling: Example
IQ0
P0
IQ1
IQ0
~ P1[IQ0/IQ1],
~ P2[IQ0/IQ2]
P1[IQ0/IQ1]
IQ2 tail(IQ1)
P4
nil
IQ0 IQ2
P2[IQ0/IQ2]
IQ5
tail(IQ1)
IQ3
P4
nil
IQ5
P4
nil
IQ5
• For length(iq) = 1, length(rq) = 1
• R0 executes and appends to iq if:
• P1’ || P2’ || P4’ OR
• iq0 = nil
• R4 doesn’t insert into queues
=> P4’ = P4
• Apply same rationale for R1 & R2:
R1 executes and appends to rq if:
• P4 || P3’ || P5’
• rq0 = nil
• R3 and R5 don’t insert into queues
=> P3’ = P3, P5’ = P5
• GS1(rq) = GS2(rq) = (rq0 = nil) || P4 || P3 || P5
• GS0(iq) = (iq0 = nil) || P4 || (P1 || P2)  [(rq0 = nil) || P3 || P5] =
= (iq0 = nil) || P4 || P1 || P2
Cyclic Rule-Queue Graphs
• Cyclic Graphs lead to Cyclic Dependences
• Rule 1 depends on rule 2 to remove an
item from a queue
• But rule 2 depends on rule 1 to remove
an item from another queue
Queue x
rule 1
rule 2
Queue y
• Algorithm from acyclic case would generate
recursive preconditions
Cyclic R-Q Graphs: Example
• Let P1’ = P1  GS1
• Assumption: R1 executes (P1’ = TRUE)
• Find group of rules that must fire together
• P1’ = P1  [(x=nil) || P2’] =
= P1  [(x=nil) || P2  [(y=nil) || P1’]]
Queue x
rule 1
rule 2
Queue y
• No need to explore P1’ further (P1’ = TRUE) =>
P1’ = P1  [(x=nil) || P2]
Solution to Cyclic Dependence
Problem
• Key Idea: no deadlock if we can coordinate
removals and insertions from/to all queues in
cycle s.t. removals make room for insertions
• Groups of rules must execute together
• Use depth-first search on rule-queue
graph to find cyclic groups
• Augment preconditions to allow all rules in
cycle to execute together
• Extensions include paths into and out of
cyclic group
Cyclic R-Q Graphs: Algorithm
SymbolicExecution(Ri, CrtPath)
for each queue q that Ri inserts into
for each rule Rj that inserts/removes in/from q
newRj = if Rj  CrtPath
then TRUE rule already examined
else SymbolicExecution(Rj)
newCrtPath = if Rj  CrtPath
then CrtPath
else CrtPath  Rj
replace Rj’ with newRj in GSi(q)
GSi = 
GSi(q)
q
Ri’ = Ri  GSi
Symbolic Execution
• Substitute out all intermediate versions
of variables
• Obtain expression for last version of
each variable
• Each expression defines new value of
corresponding variable
Optimizations
• Optimize expressions from symbolic execution
• CSE: avoid unnecessary replication of HW
• Mutual Exclusion Testing:
• Eliminate computation of values that
never occur in practice as result of
mutually exclusive preconditions
Verilog Generation
• Synthesize HW directly from expressions:
• Each queue as one or more registers
• Each memory variable as library block
• Each state variable as one or more
registers, depending on type
• Each expression as combinational logic
that feeds back into corresponding
registers
Experimental Results
• We have implemented synthesis system
• Used system to generate synthesizable
Verilog for several specifications
Architecture
RISC Pipelined Processor
SCU RTL 98 DSP
Benchmark
Bubblesort
Butterfly
Filter
Cycle (MHz)
88.89
90.91
Cycle (MHz)
107.06
104.42
105.01
Area
23195.25
22999.50
Area
5434
5411
3757
(map effort medium, area effort low)
Conclusion
• Starting Point: (Good for Designer)
Modular, Asynchronous, Sequential
Specification with Conceptually Infinite
Queues
• Ending Point: (Good for Implementation)
Efficient, Synchronous, Globally
Scheduled, Parallel Implementation with
Finite Queues in Synthesizable Verilog
• Variety of Techniques:
•Symbolic Execution
•Global Scheduling