Slides - Cs.ucr.edu

Routing Algorithms for FPGAs with
Sparse Intra-cluster Routing Crossbars
Yehdhih Ould Mohammed Moctar1
Guy G.F. Lemieux2
Philip Brisk1
1University
of California Riverside
2University of British Columbia
22nd International Conference on Field Programmable Logic and Applications
Oslo, Norway, August 29-31, 2012
Outline
• Background
– FPGA architecture
– FPGA routing
• FPGAs w/full intra-cluster routing crossbars
– Implications for routing
• FPGAs w/sparse intra-cluster routing crossbars
– Routing challenges
– New routing algorithms
• Experimental Results
• Conclusion
2
FPGA Architecture (1/3)
BLE output
K
K-LUT
Clock
DFF
Configuration bit
Basic Logic Element (BLE)
3
FPGA Architecture (2/3)
W×Fcin:1 multiplexer
Each CLB has N BLEs (K-LUTs)
Configurable Logic Block (CLB)
Isolation
Buffers
IntraCluster
Routing
...
K
BLE
...
K
C Block
(inputs)
W routing segments
Each BLE connects to W×Fcout
segments in the routing channel
...
BLE
...
...
N local feedbacks
I = Number of of CLB inputs
W routing segments
C Block (outputs)
Versatile Place and Route (VPR) CLB Architecture
4
FPGA Architecture (3/3)
I/O Pads
CLB
Switch Block
(S Block)
Connection Block
(C Block)
5
Routing Resource Graph (RRG)
source
wire3 wire4
2-LUT
out
out
in1
wire1
wire2
in2
wire3
wire4
wire1
wire2
in2
in1
sink
www.eecg.toronto.edu/~aling/ece1718/project/fang/route_rr_graph.png
6
FPGA Routing
• Disjoint-path problem (on the RRG); NP-complete
• Input:
– Graph G(V, E)
– Set of sources S = {s1, s2, …, sm}
– Set of sets of sinks T = {T1, T2, …, Tn}, Ti = {ti1, ti2, …, tik}
• Solution
– Finds paths from each source si to all sinks in Ti
– Paths emanating from different sinks must be disjoint
(cannot shared any vertices or edges)
• Objective(s)
– Minimize delay, wirelength, etc.
7
Disjoint and Non-disjoint Paths
s1
t21
s2
t11
8
Disjoint and Non-disjoint Paths
Two nets are routed on the same FPGA segment
(remember, RRG vertices represent wires and CLB I/Os)
s1
t21
s2
t11
This route is illegal!
9
Disjoint and Non-disjoint Paths
s1
t21
s2
t11
This route is legal!
10
LUT Input Equivalence (1/3)
2-LUT
2-LUT
s1
in1
s1
in1
s2
in2
s2
in2
LUT inputs are interchangeable
11
LUT Input Equivalence (2/3)
2-LUT
2-LUT
s1
in1
s1
in1
s2
in2
s2
in2
s1
t11 = in1
s1
t21 = in1
s2
t21 = in2
s2
T11 = in2
Overly restrictive disjoint-path problem formulation
12
LUT Input Equivalence (3/3)
2-LUT
2-LUT
s1
in1
s1
in1
s2
in2
s2
in2
s1
s2
t
Represent all inputs of a LUT as one RRG sink, t
13
Outline
• Background
– FPGA architecture
– FPGA routing
• FPGAs w/full intra-cluster routing crossbars
– Implications for routing
• FPGAs w/sparse intra-cluster routing crossbars
– Routing challenges
– New routing algorithms
• Experimental Results
• Conclusion
14
CLB Architecture (Recap)
W×Fcin:1 multiplexer
Each CLB has N BLEs (K-LUTs)
Configurable Logic Block (CLB)
Isolation
Buffers
IntraCluster
Routing
...
K
BLE
...
K
C Block
(inputs)
W routing segments
Each BLE connects to W×Fcout
segments in the routing channel
...
BLE
...
...
N local feedbacks
I = Number of of CLB inputs
W routing segments
C Block (outputs)
Versatile Place and Route (VPR) CLB Architecture
15
Full Crossbar Intra-cluster Routing (1/6)
CLB inputs
2-LUT BLE
in1
in2
2-LUT BLE
in1
in2
Local feedbacks
• Typical of FPGAs 10+ years ago
16
Full Crossbar Intra-cluster Routing (2/6)
CLB inputs
2-LUT BLE
in1
in2
2-LUT BLE
in1
in2
Local feedbacks
• Each CLB input connects to each BLE input
17
Full Crossbar Intra-cluster Routing (3/6)
2-LUT BLE
CLB inputs
in1
in2
2-LUT BLE
in1
in2
Local feedbacks
• CLB inputs are equivalent
18
Full Crossbar Intra-cluster Routing (4/6)
CLB inputs
2-LUT BLE
in1
in2
2-LUT BLE
in1
in2
Local feedbacks
• No need to model intra-cluster routing in the RRG
19
Full Crossbar Intra-cluster Routing (5/6)
W×Fcin:1 multiplexer
Isolation
Buffers
Each CLB has N BLEs (K-LUTs)
Configurable Logic Block (CLB)
IntraCluster
Routing
...
C Block
(inputs)
W routing segments
K
BLE
...
K
...
BLE
...
...
N local feedbacks
I = Number of of CLB inputs
RRG
Each BLE connects to W×Fcout
segments in the routing channel
W routing segments
C Block (outputs)
RRG
20
Full Crossbar Intra-cluster Routing (6/6)
I/O Pads
CLB
Switch Block
(S Block)
Connection Block
(C Block)
Don’t model the RRG inside each CLB!
21
Outline
• Background
– FPGA architecture
– FPGA routing
• FPGAs w/full intra-cluster routing crossbars
– Implications for routing
• FPGAs w/sparse Intra-cluster routing crossbars
– Routing challenges
– New routing algorithms
• Experimental Results
• Conclusion
22
Sparse Crossbar Intra-cluster Routing (1/6)
CLB inputs
2-LUT BLE
in1
in2
2-LUT BLE
in1
in2
Local feedbacks
• Most FPGAs today
23
Sparse Crossbar Intra-cluster Routing (2/6)
CLB inputs
2-LUT BLE
in1
in2
2-LUT BLE
in1
in2
Local feedbacks
• CLB inputs are not equivalent!
24
Sparse Crossbar Intra-cluster Routing (3/6)
CLB inputs
2-LUT BLE
in1
in2
2-LUT BLE
in1
in2
Local feedbacks
• RRG must include the intra-cluster routing
25
Sparse Crossbar Intra-cluster Routing (4/6)
W×Fcin:1 multiplexer
Isolation
Buffers
Each CLB has N BLEs (K-LUTs)
Configurable Logic Block (CLB)
IntraCluster
Routing
...
C Block
(inputs)
W routing segments
Each BLE connects to W×Fcout
segments in the routing channel
K
BLE
...
K
...
BLE
...
...
N local feedbacks
I = Number of of CLB inputs
W routing segments
C Block (outputs)
RRG
26
Sparse Crossbar Intra-cluster Routing (5/6)
I/O Pads
CLB
Switch Block
(S Block)
Connection Block
(C Block)
RRG encompasses the entire FPGA
27
Sparse Crossbar Intra-cluster Routing (6/6)
• Key Issues relating to RRG size
– May run out of memory
• If you are not routing in the cloud
– Long router runtimes
• Large search space leads to slow convergence
• Thrashing
• Objective
– Route without expanding the whole RRG
28
Routing Algorithms
• Extensions to PathFinder routing algorithm
– Routes one net at a time via wavefront expansion
• Baseline
– Extend the RRG with intra-cluster routing at each CLB
• Selective RRG Expansion (SERRGE)
– Route nets to CLB inputs
– Dynamically expand the RRG to finish the route
• Partial Pre-Routing (PPR)
– Pre-route all nets within CLBs
– Complete global routes to CLB inputs
29
Baseline Router
CLB inputs
2-LUT BLE
in1
in2
Global
RRG
2-LUT BLE
in1
in2
Local feedbacks
• Inputs of the same LUT are equivalent!
30
SERRGE in Action (1/8)
I/O Pads
CLB
s
Switch Block
(S Block)
t
Start with a global RRG
Connection Block
(C Block)
31
SERRGE in Action (2/8)
I/O Pads
CLB
s
Switch Block
(S Block)
t
Route from s’s CLB to t’s CLB
Connection Block
(C Block)
32
SERRGE in Action (3/8)
I/O Pads
CLB
s
Switch Block
(S Block)
t
Connection Block
(C Block)
Locally expand the RRG with part of t’s CLB
33
SERRGE in Action (4/8)
CLB inputs
2-LUT BLE
in1
in2
t
2-LUT BLE
in1
in2
Local feedbacks
Maintain one copy of the intra-cluster topology
34
SERRGE in Action (5/8)
CLB inputs
2-LUT BLE
in1
in2
t
2-LUT BLE
in1
in2
Local feedbacks
Expand the fanout of the CLB input being routed
35
SERRGE in Action (6/8)
CLB inputs
2-LUT BLE
in1
in2
t
2-LUT BLE
in1
in2
Local feedbacks
Complete the route if possible; if not, try again
36
SERRGE in Action (7/8)
CLB inputs
2-LUT BLE
in1
in2
t
2-LUT BLE
in1
in2
Local feedbacks
Track used CLB and BLE inputs; remember the route
37
SERRGE in Action (8/8)
CLB inputs
2-LUT BLE
in1
in2
t
2-LUT BLE
in1
in2
Local feedbacks
Deallocate unused RRG resources
38
SERRGE Summary
• Global routing uses standard PathFinder
• Selectively expand portions of the RRG to complete
routes from CLB inputs to BLE inputs
• Storage Requirement
– Global RRG (standard PathFinder requirement)
– One copy of the intra-cluster routing topology
– All intra-cluster routes computed thus far
• Fairly challenging to implement!
39
PPR in Action (1/4)
2-LUT BLE
CLB inputs
in1
in2
t1
2-LUT BLE
in1
in2
t2
Local feedbacks
Process CLBs one at a time
40
PPR in Action (2/4)
CLB inputs
2-LUT BLE
in1
in2
t1
2-LUT BLE
in1
in2
t2
Local feedbacks
Compute local routes for each BLE
41
PPR in Action (3/4)
CLB inputs
2-LUT BLE
in1
in2
t1
2-LUT BLE
in1
in2
t2
Local feedbacks
CLB inputs now form equivalence classes.
42
PPR in Action (4/4)
I/O Pads
CLB
s
Switch Block
(S Block)
t
Connection Block
(C Block)
Perform global routing w/equivalence class constraints on CLB inputs
43
PPR Summary
• Pre-route each CLB
• Propagate LUT equivalences to CLB inputs
– A baseline router on the full-blown RRG would still need
to satisfy BLE input-pin equivalence constraints
• Route as normal
• Much easier implementation than SERRGE
44
Outline
• Background
– FPGA architecture
– FPGA routing
• FPGAs w/full intra-cluster routing crossbars
– Implications for routing
• FPGAs w/sparse Intra-cluster routing crossbars
– Routing challenges
– New routing algorithms
• Experimental Results
• Conclusion
45
Experimental Setup
• VPR 5.0, iFAR FPGA architecture files
– Crossbar population density, p = {25%, 50%, 75%, 100%}
– LUT size = K = {4, 5, 6, 7}
– BLEs per cluster, N = {4, 6, 8, 10}
• 10 largest IWLS benchmarks
– Results are averaged
• VPR-PathFinder runs for 100 iterations, then stops
–
–
–
–
Baseline (full-blown RRG)
SERRGE
PPR
VPR 5.0 (when crossbar population density is 100%)
46
Routability as a Function of LUT Size
Percentage of Nets Routed Successfully
100%$
99%
97%
97%
95%$
95%
97%
96%
95%
94%
90%$
92%
92%
90%
85%$
Baseline$
SERRGE$
PPR$
80%
80%$
4$
N = 8 BLEs per CLB
p = 50% population density
5$
6$
7$
LUT Size (K)
47
Routability as a Function of CLB Size
Percentage of Nets Routed Successfully
100%$
98% 98%
98%
97%
96%
95%$
97%
97%
96%
95%
95%
90%
90%$
85%
85%$
Baseline$
SERRGE$
PPR$
80%$
4$
6$
8$
Number of BLEs per cluster (N)
K = 6-LUTs
p = 50% population density
10$
48
Runtime as a Function
of Population Density
Average Runtime
1600"
1400"
Baseline
Seconds
1200"
1000"
SERRGE
800"
600"
400"
PPR
200"
0"
40%"
50%"
75%"
100%"
Intra-cluster routing crossbar population density (p)
K = 6-LUTs
49
N=8 BLEs per cluster
Runtime as a Function of LUT Size
Average Runtime
1000"
900"
Seconds
800"
700"
600"
500"
400"
300"
200"
100"
VPR"5.0"
SERRGE"
PPR"
0"
4"
N=8 BLEs per cluster
p = 100% population density
5"
6"
7"
LUT size (K)
50
PPR Intra-cluster Routing Runtime
Intra-cluster Routing Runtime for PPR
16"
Seconds
40%"
50%"
75%"
100%"
12"
8"
4"
K=6-LUTs
N=8 BLEs per cluster
ae
s_
co
pc
re
$
i_
br
id
ge
32
w
$
b_
co
nm
ax
$
ct
$
b_
fu
n
ac
_c
tr
l$
us
m
em
_c
trl
$
es
$
st
em
ca
a$
sy
ar
e
de
s_
cd
e
sy
st
em
sp
i
$
s$
0"
On average, total runtime
was 100s of seconds!
51
Static and Dynamic RRG Size
RRG Size
30"
25"
Sta8c"
MB
20"
Dynamic"
15"
10"
5"
0"
Baseline" SERRGE"
p=40%"
K=6-LUTs
N=8 BLEs per cluster
PPR"
Baseline" SERRGE"
PPR"
p=100%"
VPR generates FPGAs that are
sized to the application, and are
smaller than commercial
52
Critical Path Delay
Average Critical Path Delay
!14!!
13.60
ns
13.30
13.30
13.40
13.20
13.00
!13!!
13.10
13.10
12.80
Baseline!
12.85
SERRGE!
13.00
12.90
PPR!
!12!!
40%!
50%!
75%!
100%!
Intra-cluster routing crossbar population density (p)
K=6-LUTs
N=8 BLEs per cluster
53
Outline
• Background
– FPGA architecture
– FPGA routing
• FPGAs w/full intra-cluster routing crossbars
– Implications for routing
• FPGAs w/sparse Intra-cluster routing crossbars
– Routing challenges
– New routing algorithms
• Experimental Results
• Conclusion
54
Conclusion
• Addressed FPGA routing algorithms where CLBs
have sparse intra-cluster routing crossbars
• Minimal modifications to PathFinder required
• SERRGE vs. PPR
– SERRGE achieves better routability
– PPR converges faster
– PPR easier to implement
55