Min-Register
Retiming
Under Simultaneous Timing and Initial State Constraints
Aaron Hurst
Dec. 2007
Introduction
Retiming is the structural relocation of registers such
that output functionality is preserved
Transformation with many means and many ends
Minimizing worst-case delay
Minimizing number of registers
Either of the above under constraints
Optimally or heuristically
Other… ?
In an industrial setting?
1.
2.
Diminishing returns in combinational optimization
Coming of age of sequential equivalence checking
Motivation
Register minimization is uniquely valuable
Area and power reduction
Clock network: dynamic power, design effort
Testability:
scan chain depth
Verification:
state representation
Must satisfy several constraints
Timing, initializability, congestion, electrical...
Constrained minimum-register retiming is hard
Current solutions not scalable
Outline
Core problem
1.
unconstrained register minimization
2.
Constraints: Timing
3.
Constraints: Initializability
4.
Other constraints
Flow-Based Register
Minimization
A New Approach to
Unconstrained Register Minimization
Background
Register minimization an “original” problem in retiming
given G V , E , wi
choose r (v) to min r (v) indegree(v) outdegree(v) s. t.
v
r (u ) r (v) wi (u, v)
e u , v
An instance of minimum-cost network flow
OVE log (V 2 E ) log(VC )
[Goldberg97]
But we can do better...
r(v): retiming lag
wi(u,v): initial reg count
Orientation
Consider one combinational frame of the circuit
A single directed acyclic graph of combinational logic
Nodes:
Edges:
Inputs:
Outputs:
logic gates
pair-wise net connections
register outputs, primary inputs
registers inputs, primary outputs
primary
outputs
primary
inputs
register
inputs
register
outputs
Cuts in a Frame
Consider circuit w/o primary IOs and their transitive fan-in/out
Retiming = a complete cut of the DAG
Number of registers =
v : (v u ) crossing cut
Problem consists of finding minimum cut
Max-Flow Formulation
Min-cut/Max-flow Duality
Edges in graph are assigned a capacity
Min-cut width = Max-flow through graph
Min-cut derived from residual flow
Partition graph into {S,R} by source reachability
sink
S = augmenting path from source s
R = augmenting path from source s
Min-cut is not unique
Selects one with min movement of registers
source
Constraint Type #1
What are the effect of unconstrained edges?
v
u
Never saturated; always present in residual graph
Destination node v always reachable from source u
Minimum cut will never lie between u and v
A useful tool for constraining solution....
A Necessary Modification
Min-cut guarantees every path will be cut at least once
Retiming requires that every path is cut exactly once
R
R2’
R1’
R1
R2
R3’
R3
S
Observation: a path must
cross cut from R → S
Solution: Use unconstrained
flow to prevent reverse edges
Fanout Sharing
Nets were decomposed into flow arcs
1 11
False model of register count
Reality: one register per net / hyper-edge
“Fanout sharing”
1 1
Introduce a structure to simulate fanout-sharing
1
1 1
1
1
1
Single Iteration
1.
2.
What is the final flow graph?
Reverse Edges
Fanout-sharing
1
1
1
1
1
Unitary Flow Simplification
Binary marking scheme
Flow computed on original netlist
1
1
1
1
Multiple Frames
Globally minimum solution requires
moving registers beyond one frame
Corresponding min-cut may stretch across
multiple combinational frames
Solution: Repeat over single frame
Terminate when no further change
Then, consider backward direction
Final result is provably optimal
unrolled circuit
Overall Algorithm
Start
Forward retiming
Backward retiming
Block Fan-out
Cone of PIs
Block Fan-in
Cone of POs
Compute
Max-Flow
Compute
Max-Flow
y
Implement
Min-Cut
y
n
Improv.?
Implement
Min-Cut
Improv.?
n
Forward retiming is preferred due to initial
state computation
Done
Asymptotic Analysis
Single iteration runtime limited by maximum flow solver
OVE log(V 2 E )
[Goldberg95]
Or, using unitary flow simplification…
O RE
Total number of iterations is bounded by |R|
O R2 E
Experimental Results
Applied to {ISCAS,ITC,OpenCores,Altera} benchmarks...
Register Savings per Iteration
The number of iterations is
quite small
Register count is
monotonically decreasing
Runtime can be bounded
60.0%
Share of Total Reduction
50.0%
48.7%
42.6%
Forward
Backward
40.0%
30.0%
20.0%
3.5% 0.5%
0.4%
0.1%
1.1%
10.0%
0.0%
1
2
3
Iteration
Runtime is 5x faster than minimum-cost formulation
<0.01s for 70% of benchmarks
3.1%
4
5+
Summary
Key points:
1.
Optimal
2.
Minimum register movement
3.
Fast... both absolutely and relatively
4.
Scalable: early termination with improvement
Timing Constraints
Background
Timing constraints make problem much harder
D(u v) W (u v) r (u ) r (v)
p u v
D(u,v): path delay
W(u,v): path reg count
Complexity: pair-wise path delay constraints
Enumeration alone is O(n3)
Simplification: prune unnecessary constraints
Minaret: use skew-equivalence to find ASAP and ALAP
register positions [Sapatnekar99]
Conservative Constraints
Consider retiming a register
Two timing constraints made potentially critical in each direction
max
minarrival
arrival
If other end of timing constraints is fixed...
max
minarrival
arrival
Bound on absolute positions of register
All such constraints can be computed with two-pass STA
Linear time
Exact Constraints
Fixing other end of timing path is conservative
May also move in the same direction, relaxing constraint
If other end of timing constraints is not fixed...
Conditional constraints
“Can retime R2 past v2 only if R1 is retimed past v1”
R1 v1
R2
v2
unit delay
max delay 3
Computed from edge of bounded
transitive fan-in/out cone
- delay
- register count
Constraint Implementation
Conservative Constraints: Ccons V
Indicate nodes to be removed from problem
Exact Constraints: Cexact V V
Implemented as unconstrained edges
v1
v2
Cut can only move beyond v2 if it moves beyond v1
Refinement
All timing constraints met by initial circuit
Guarantees flow from source to sink remains finite
Iteratively tighten conservative constraints into exact ones
Simplification: Only constraints limiting area improvement
Ccons = all
Cexact =
minexact
minexact+cons
compute cut
with Ccons
compute cut
w/o Ccons
y
convert ccons between
two cuts into cexact
Conservative Constraint
any?
n
Building Intuition
unit delay
max delay 3
4
4
5
6
No Constraint
Conservative Constraint
Exact Constraint
minexact
minexact+cons
Constraints impose
relations across
multiple clock cycles
Building Intuition
unit delay
max delay 2
No Constraint
Conservative Constraint
Exact Constraint
minexact
minexact+cons
Cycles in constraints
lock retiming moves
to be ‘in-step’
Experimental Results
Max path delay initial period, min path delay ≥ 0
250
60
200
50
40
150
30
100
20
50
10
0
0
Num. Gates (1k's)
Runtime (s)
Flow-based vs. Minaret Runtime
Design
Minaret
Flow-based w/ STA
Design Size
Average number of exact constraints = 1.1% of design size
Summary
Key points:
1.
Inherits benefits of flow-based retiming
1.
2.
3.
Optimal
Fast
Monotonic improvement
2.
Problem reduced using both timing and area criticality
3.
Advanced timing model and constraints
4.
Scalable: intermediate solutions are timing-feasible
Initializability Constraints
Problem
Retimed circuit must preserve initialization behavior
Accomplished by:
1.
2.
Additional combinational logic
Identifying an equivalent initial state
forward retiming:
simulation
0
?
backward retiming:
SAT
Backward retiming jeopardizes initializability
0
Background
How to transform an uninitializable circuit into an
initializable one?
‘Prayer’
minimizing register movement maximizes initializability [Pan99]
‘Slash and Burn’
incrementally tighten bound on negative retiming lag [Stok95]
‘Brute Force’
mixed-integer linear program [Sapatnekar97]
Feasibility Constraints
SAT problem with variables V : N Z
Feasibility Constraint : V
Parital cut of unrolled circuit
Sufficient to imply infeasibility
?
At least one element must be removed
UNSAT with only variables in TFO()
?
?
0
Built incrementally as retiming progresses
0
0
Variables switched off in SAT with additional per-clause
variables z
x x x z x x z ...
1
2
5
1
1
2
2
Feasibility Constraints
SAT w/o variables TFO( )
y
y
v
?
SAT
SAT
?
n
n
?
binary search
on v
0
=
0
Topologically order circuit
0
SAT w/o variables {u : topo(u) topo(v)} TFO( )
SAT w/o variables {u : topo(u) topo(v)} TFO( )
Fast:
Faster:
Incremental SAT
UNSAT core can be used to localize conflict
Constraint Type #2
Penalty Structure
0
source
Adds exactly one unit flow path
Delayed insertion until comes into frame
Biases against cuts below feasibility
constraint
0
0
Closest cut of width +1 returned
Cut is squeezed forward
New cut of width +1 is closest and
therefore most initializable
If it isn’t... additional penalty needed
sink
Algorithm
Cfeas =
compute
min-cut
constrained
by Cfeas
n
Cfeas
initializable?
=
y
binary search
on v
n
SAT y
n SAT y
v
Complexity: already NP-complete from test for initializability
Experimental Results
Equivalent init. state after min-reg retiming for most designs
Only one design was not initializable: s400
Can be easily lost if backward retiming invoked multiple times
A harder problem: randomized initial states
Original
Name
Gates
Regs
Infeasible
Feasible Min-Register
Regs
Regs
Avg. ||
Runtime
s400
0.3k
21
18
+1
8.0
0.08s
oc_aes_core
16.6k
402
395
+3
2.0
2.55s
oc_vga_lcd
17.1k
1108
1087
+1
1.0
1.09s
nut_003
6.6k
484
450
+3
1.0
1.41s
radar12
71.1k
3875
3771
+27
2.3
108.3s
oc_wb_dma
29.2k
1775
1757
+2
3.5
5.70s
oc_minirisc
3.9k
289
271
+2
1.0
0.49s
Summary
Key points:
1.
Optimal
2.
Compatible with timing constraints
3.
Worst-case bound non-polynomial, but fast in practice
Additional Applications
Physical constraints
Electrical constraints
Placement congestion: penalty structures
Capacitive load on clock network drivers: penalty structures
Others?
Contribution
New formulation of register minimization problem
Constraints of different forms can be added to problem
1.
2.
3.
Timing
Initializability
Other
Improves upon best practices within each sub-problem
Unified solution to synthesis-ready retiming
Scalable
© Copyright 2025 Paperzz