Fast Min-Register Retiming Through Binary Max-Flow

Min-Register
Retiming
Under Simultaneous Timing and Initial State Constraints
Aaron Hurst
Dec. 2007
Introduction

Retiming is the structural relocation of registers such
that output functionality is preserved

Transformation with many means and many ends





Minimizing worst-case delay
Minimizing number of registers
Either of the above under constraints
Optimally or heuristically
Other… ?
In an industrial setting?

1.
2.
Diminishing returns in combinational optimization
Coming of age of sequential equivalence checking
Motivation
Register minimization is uniquely valuable


Area and power reduction

Clock network: dynamic power, design effort
Testability:
scan chain depth
Verification:
state representation


Must satisfy several constraints


Timing, initializability, congestion, electrical...
Constrained minimum-register retiming is hard


Current solutions not scalable
Outline
Core problem
1.

unconstrained register minimization
2.
Constraints: Timing
3.
Constraints: Initializability
4.
Other constraints
Flow-Based Register
Minimization
A New Approach to
Unconstrained Register Minimization
Background

Register minimization an “original” problem in retiming
given G  V , E , wi
choose r (v) to min  r (v)  indegree(v)  outdegree(v)  s. t.
v
r (u )  r (v)  wi (u, v)

e  u , v
An instance of minimum-cost network flow
OVE log (V 2 E ) log(VC ) 


[Goldberg97]
But we can do better...
r(v): retiming lag
wi(u,v): initial reg count
Orientation

Consider one combinational frame of the circuit

A single directed acyclic graph of combinational logic




Nodes:
Edges:
Inputs:
Outputs:
logic gates
pair-wise net connections
register outputs, primary inputs
registers inputs, primary outputs
primary
outputs
primary
inputs
register
inputs
register
outputs
Cuts in a Frame

Consider circuit w/o primary IOs and their transitive fan-in/out


Retiming = a complete cut of the DAG
Number of registers =
v : (v  u ) crossing cut

Problem consists of finding minimum cut
Max-Flow Formulation

Min-cut/Max-flow Duality



Edges in graph are assigned a capacity
Min-cut width = Max-flow through graph
Min-cut derived from residual flow

Partition graph into {S,R} by source reachability



sink
S =  augmenting path from source  s
R =  augmenting path from source  s
Min-cut is not unique

Selects one with min movement of registers
source
Constraint Type #1

What are the effect of unconstrained edges?
v

u

Never saturated; always present in residual graph


Destination node v always reachable from source u
Minimum cut will never lie between u and v
A useful tool for constraining solution....
A Necessary Modification


Min-cut guarantees every path will be cut at least once
Retiming requires that every path is cut exactly once
R
R2’
R1’
R1
R2
R3’
R3
S

Observation: a path must
cross cut from R → S

Solution: Use unconstrained
flow to prevent reverse edges
Fanout Sharing

Nets were decomposed into flow arcs
1 11

False model of register count



Reality: one register per net / hyper-edge
“Fanout sharing”
1 1
Introduce a structure to simulate fanout-sharing



1
1 1
1
1
1


Single Iteration

1.
2.
What is the final flow graph?
Reverse Edges
Fanout-sharing
1
1
 

1

1
1

Unitary Flow Simplification


Binary marking scheme
Flow computed on original netlist

1
1



1
1
Multiple Frames

Globally minimum solution requires
moving registers beyond one frame


Corresponding min-cut may stretch across
multiple combinational frames
Solution: Repeat over single frame

Terminate when no further change

Then, consider backward direction

Final result is provably optimal
unrolled circuit
Overall Algorithm
Start
Forward retiming
Backward retiming
Block Fan-out
Cone of PIs
Block Fan-in
Cone of POs
Compute
Max-Flow
Compute
Max-Flow
y
Implement
Min-Cut
y
n
Improv.?
Implement
Min-Cut
Improv.?
n

Forward retiming is preferred due to initial
state computation
Done
Asymptotic Analysis

Single iteration runtime limited by maximum flow solver
OVE log(V 2 E ) 



[Goldberg95]
Or, using unitary flow simplification…
O  RE 

Total number of iterations is bounded by |R|

O R2 E

Experimental Results

Applied to {ISCAS,ITC,OpenCores,Altera} benchmarks...
Register Savings per Iteration

The number of iterations is
quite small
Register count is
monotonically decreasing

Runtime can be bounded
60.0%
Share of Total Reduction

50.0%
48.7%
42.6%
Forward
Backward
40.0%
30.0%
20.0%
3.5% 0.5%
0.4%
0.1%
1.1%
10.0%
0.0%
1
2
3
Iteration

Runtime is 5x faster than minimum-cost formulation

<0.01s for 70% of benchmarks
3.1%
4
5+
Summary
Key points:
1.
Optimal
2.
Minimum register movement
3.
Fast... both absolutely and relatively
4.
Scalable: early termination with improvement
Timing Constraints
Background

Timing constraints make problem much harder
D(u  v)  W (u  v)  r (u )  r (v)

p  u  v
D(u,v): path delay
W(u,v): path reg count
Complexity: pair-wise path delay constraints

Enumeration alone is O(n3)

Simplification: prune unnecessary constraints

Minaret: use skew-equivalence to find ASAP and ALAP
register positions [Sapatnekar99]
Conservative Constraints

Consider retiming a register

Two timing constraints made potentially critical in each direction
max
minarrival
arrival

If other end of timing constraints is fixed...


max
minarrival
arrival
Bound on absolute positions of register
All such constraints can be computed with two-pass STA

Linear time
Exact Constraints

Fixing other end of timing path is conservative


May also move in the same direction, relaxing constraint
If other end of timing constraints is not fixed...

Conditional constraints
“Can retime R2 past v2 only if R1 is retimed past v1”
R1 v1
R2
v2
unit delay
max delay  3

Computed from edge of bounded
transitive fan-in/out cone
- delay
- register count
Constraint Implementation

Conservative Constraints: Ccons  V


Indicate nodes to be removed from problem
Exact Constraints: Cexact  V V

Implemented as unconstrained edges

v1

v2
Cut can only move beyond v2 if it moves beyond v1

Refinement

All timing constraints met by initial circuit



Guarantees flow from source to sink remains finite
Iteratively tighten conservative constraints into exact ones
Simplification: Only constraints limiting area improvement
Ccons = all
Cexact = 
minexact
minexact+cons
compute cut
with Ccons
compute cut
w/o Ccons
y
convert ccons between
two cuts into cexact
Conservative Constraint
any?
n
Building Intuition



unit delay
max delay  3
4
4
5
6
No Constraint
Conservative Constraint
Exact Constraint

minexact
minexact+cons
Constraints impose
relations across
multiple clock cycles
Building Intuition
unit delay
max delay  2
No Constraint
Conservative Constraint
Exact Constraint

minexact
minexact+cons
Cycles in constraints
lock retiming moves
to be ‘in-step’
Experimental Results
Max path delay  initial period, min path delay ≥ 0

250
60
200
50
40
150
30
100
20
50
10
0
0
Num. Gates (1k's)
Runtime (s)
Flow-based vs. Minaret Runtime
Design
Minaret

Flow-based w/ STA
Design Size
Average number of exact constraints = 1.1% of design size
Summary
Key points:
1.
Inherits benefits of flow-based retiming
1.
2.
3.
Optimal
Fast
Monotonic improvement
2.
Problem reduced using both timing and area criticality
3.
Advanced timing model and constraints
4.
Scalable: intermediate solutions are timing-feasible
Initializability Constraints
Problem

Retimed circuit must preserve initialization behavior

Accomplished by:
1.
2.
Additional combinational logic
Identifying an equivalent initial state
forward retiming:
simulation
0
?
backward retiming:

SAT
Backward retiming jeopardizes initializability
0
Background

How to transform an uninitializable circuit into an
initializable one?

‘Prayer’
minimizing register movement maximizes initializability [Pan99]

‘Slash and Burn’
incrementally tighten bound on negative retiming lag [Stok95]

‘Brute Force’
mixed-integer linear program [Sapatnekar97]
Feasibility Constraints
SAT problem with variables V : N  Z
Feasibility Constraint :   V

Parital cut of unrolled circuit
Sufficient to imply infeasibility

?

At least one element must be removed
UNSAT with only variables in TFO()

?

?

0
Built incrementally as retiming progresses
0

0

Variables switched off in SAT with additional per-clause
variables z
 x  x  x  z  x  x  z ...
1
2
5
1
1
2
2
Feasibility Constraints
SAT w/o variables TFO( )
y
y
v
?
SAT
SAT
?
n
n
?
binary search
on v
0
=
0
Topologically order circuit
0

SAT w/o variables {u : topo(u)  topo(v)}  TFO( )
 SAT w/o variables {u : topo(u)  topo(v)}  TFO( )


Fast:
Faster:
Incremental SAT
UNSAT core can be used to localize conflict
Constraint Type #2
Penalty Structure
0

source
Adds exactly one unit flow path
Delayed insertion until  comes into frame

Biases against cuts below feasibility
constraint




0

0

Closest cut of width +1 returned
Cut is squeezed forward
New cut of width +1 is closest and
therefore most initializable
If it isn’t... additional penalty needed
sink
Algorithm
Cfeas = 
compute
min-cut
constrained
by Cfeas
n
Cfeas  
initializable?
=
y
binary search
on v
n

SAT y
n SAT y
v
Complexity: already NP-complete from test for initializability
Experimental Results

Equivalent init. state after min-reg retiming for most designs
Only one design was not initializable: s400
Can be easily lost if backward retiming invoked multiple times



A harder problem: randomized initial states
Original
Name
Gates
Regs
Infeasible
Feasible Min-Register
Regs
Regs
Avg. ||
Runtime
s400
0.3k
21
18
+1
8.0
0.08s
oc_aes_core
16.6k
402
395
+3
2.0
2.55s
oc_vga_lcd
17.1k
1108
1087
+1
1.0
1.09s
nut_003
6.6k
484
450
+3
1.0
1.41s
radar12
71.1k
3875
3771
+27
2.3
108.3s
oc_wb_dma
29.2k
1775
1757
+2
3.5
5.70s
oc_minirisc
3.9k
289
271
+2
1.0
0.49s
Summary
Key points:
1.
Optimal
2.
Compatible with timing constraints
3.
Worst-case bound non-polynomial, but fast in practice
Additional Applications

Physical constraints


Electrical constraints


Placement congestion: penalty structures
Capacitive load on clock network drivers: penalty structures
Others?
Contribution

New formulation of register minimization problem

Constraints of different forms can be added to problem
1.
2.
3.
Timing
Initializability
Other

Improves upon best practices within each sub-problem
Unified solution to synthesis-ready retiming

Scalable
