Pipelining and Retiming

Pipelining and
Retiming
Prepared by Mark Jarvin
Agenda



Synchronous circuit retiming
Pipelining
Software pipelining
The Retiming Problem: Example
a
b
c




D=4
T=4
Latency = 4
Throughput = 4
1
1
2
y
1


How can this be
improved?
Pipelining?
The Retiming Problem: Example
a
b
c


1
2
1
1
Latency = 6
Throughput = 3


y
Delay is not balanced
This can still be improved
The Retiming Problem: Example
a
b
c


1
1
Latency = 4
Throughput = 2
2
y
1

Now, delay is balanced
Observations


Some basic transformations can be used for cycle time
reduction
The retiming transformation moves registers across
gates
1
1
Observations


Levelization doesn’t help
Only useful for acyclic circuits
Naïve Algorithm
while ( not timed ) {
pick a candidate gate;
apply retiming transformation;
do timing analysis;
}
Questions

Can we apply retiming in batch mode?
 i.e.,


simultaneously on all gates
Can we make sure the retimed circuit is optimal?
Can we achieve this in polynomial time?
Retiming Circuit Model
G  V , E, d , w
V : gates and primary inputs
E  V V
D :V 
: delay of gates
W :E 
: # registers between 2 gates
Retiming Circuit Model: Example 1
a
b
c
2
2
x
3
0
y
vx
0 vi
1
vy
0
3
Retiming Circuit Model: Example 1
a
b
c
2
2
x
3
1
y
vx
0 vi
0
vy
0
3
Retiming Circuit Model: Example 2
e
f
g
7
7
7
0
3
h
3
a
7
vg
3
b
vf
0
ve
0
0 vh
d
7
7
0
3
c
0
0
0
0
1
vd
1
va
3
1
vb
3
1
vc
3
3
Metrics

Path delay:
d  p   d  vi ,

, v j    dk
Path weight:
w  p   w  vi ,
, v j    wk
Metrics

Define weight and delay metrics for any given vertex
pair:
W  u, v   min w  p 
pu v
D  u, v  

max
 pu v  w p W u ,v 
d  p
Both quantities are undefined if there is no path p from
u to v
W (D) Matrix for Example 2
W (D)
a
b
c
d
e
f
g
h
a
0 (3)
1 (6)
2 (9)
3 (12)
2 (16)
1 (13)
0 (10)
0 (10)
b
1 (20)
0 (3)
1 (6)
2 (9)
1 (13)
0 (10)
0 (17)
0 (17)
c
1 (27)
2 (30)
0 (3)
1 (6)
0 (10)
0 (17)
0 (24)
0 (24)
d
1 (27)
2 (30)
3 (33)
0 (3)
0 (10)
0 (17)
0 (24)
0 (24)
e
1 (24)
2 (27)
3 (30)
4 (33)
0 (7)
0 (14)
0 (21)
0 (21)
f
1 (17)
2 (20)
3 (23)
4 (26)
3 (30)
0 (7)
0 (14)
0 (14)
g
1 (10)
2 (13)
3 (16)
4 (19)
3 (23)
2 (20)
0 (7)
0 (7)
h
1 (3)
2 (6)
3 (9)
4 (12)
3 (16)
2 (13)
1 (10)
0 (0)
The Retiming Transformation

How do we represent retiming?


How does it affect G?
Informally:



The transformation is fundamentally moving registers across
gates
Represent it as the number of registers to push from a gate’s
outputs to its inputs
Define this number for all gates
The Retiming Transformation

Definition: a retiming of a network G V , E, d , w is an
integer-valued vertex labelling r : V  that transforms G
into G V , E , d , w where for each edge Vi , V j  E :
wij  wij  rj  ri
The Retiming Transformation
2
0
vx

1
0 vi
vy
3

0
2
1
vx
0 vi
0
vy
0
3
Initially:
wix  0, wxy  1, wiy  0
Apply retiming:
ri  0, rx  1, ry  0

Finally:
wix  1, wxy  0, wiy  0

Note: retiming will
change the number of
registers in general, but
not the number of
registers in a given cycle
Legal and Feasible Retiming

A retiming is legal if the retimed network doesn’t contain
negative weights:
wij  wij  rj  ri  0
rj  ri  wij  ri    wij 


For a given cycle time  , the network is timing feasible if
it can correctly operate under 
This holds if for all D  i, j    , W  i, j   1
Feasible Retiming

Furthermore:W  i, j   wik1  wk1k2 
 wkm j
 wik1  rk1  ri
 wk1k2  rk2  rk1

 wkm j  rj  rkm
 W  i, j   rj  ri

Finally:
rj  ri  W  i, j   1  ri  1  W i, j  
Feasible Test Algorithm




Any retiming must satisfy the system of difference
constraints:
rj  ri  W  i, j 
 Vi ,V j   E
rj  ri  1  W  i, j 
D  i, j   
General approach: integer linear programming
Special form: single-source longest path problem
Note: we can skip the second inequality wherever
D  i, j   d  j    or D  i, j   d i   
Feasible Test Algorithm


Longest path problem can be solved with Bellman-Ford
Build a constraint graph with an edge from i to j if we
have a constraint of the form rj  ri  bk
7
vg
0
0 vh
-1
1
0
1
7
vf
0
ve
-2
0
-1
0
0
0
-1
  13
7
va
3
-1
vb
3
vd
-1
-1
vc
3
3
Feasible Test Algorithm




The solution is feasible if there are no positive cycles
If feasible, the longest distance of each vertex provides
the retiming function
For the previous example, with reference node vh:
ra  1
re  2
rb  2
r f  1
rc  3
rg  0
rd  2
rh  0
There are no positive cycles; hence,   13 is a feasible
clock period
Feasible Test Algorithm

Here, there is a positive cycle:

Hence, a clock period of 12 is not feasible
vb  ve  v f  vg  vb
7
1
7
1
7
vg
0
vf
0
ve
0
0 vh
0
0
-1
0
0
0
-1
  12
vd
-1
va
3
-1
vb
3
-1
vc
3
3
Optimally Retimed Example Circuit
7
7
7
vg
1
vf
1
ve
0
1
0 vh
1
1
0
0
0
va
0
vb
3
1
vc
3
3
f
g
vd
7
e
7
7
0
h
3
a
3
b
3
c
3
d
3
Optimal Retiming
 Binary search of minimum cycle
optimalRetiming ( G ) {
min = 0;
max = MAX;
while ( min ≠ max ) {
mid = ( max – min ) / 2;
if ( feasibleTest ( G, mid ) )
max = mid;
else
min = mid;
}
return min;
}
time
Optimal Retiming

Do we really need to search all clock periods? No…




Optimal cycle time must be one of D(i,j)
So, sort and search O(V2) clock periods
Computing each D(i,j) requires O(VE+V2 lgV) time
Overall, the complexity is O(VE lgV)
Optimal Retiming

Can we do better? Yes…

Look at the delay-to-register ratios and maximum node delay of
the cycles in the circuit, where delay-to-register ratio and
maximum node delay are defined as:
 d v
R C  
 w e
vC
D  max d  v  : v V 
eC

Then, the minimum feasible clock period lies in the
range:
 max R  C      G    max R  C    D  1
 cG
 min
 cG


This improves the overall running time to O(VE lgD)
Pipelining

Can be thought of as a special case of retiming
fetch
decode
execute
writeback
fetch
decode
execute
writeback
fetch
decode
execute
writeback
Software Pipelining

This can also be thought of in
terms of retiming
Loop
Boundary
Iteration