IMPROVING TIMING-DRIVEN FPGA PACKING WITH

IMPROVING TIMING-DRIVEN FPGA PACKING WITH PHYSICAL INFORMATION∗
Doris T. Chen, Kristofer Vorwerk, Andrew Kennings
University of Waterloo
Waterloo, ON
{dtlchen,kpvorwer,akenning}@cheetah.vlsi.uwaterloo.ca
ABSTRACT
Recent work [2] has highlighted a deficiency with this
pack-then-place flow—namely, that up-front packing followed by CLB -level placement produces results with significantly worse quality than when BLEs are allowed to move
during placement. In practice, modern FPGA architectures
require expensive DRC checks which render full BLE-level
placement impractical. (Unlike VPR [3] architectures, modern FPGAs typically possess complex constraints on internal
feedbacks, the use of set and reset lines, carry chains, clock
domains, and so forth—DRC checking can, indeed, be a significant run-time bottleneck.) It is this premise which has
driven our work on packing and which forms the basis for
this paper.
This paper introduces two new packing algorithms—
which we call DPACK and HDPACK—that produce better initial packings, which in turn reduce the dependence
on computationally expensive BLE-level placement. (Our
methods are independent of and complementary to BLE
placement techniques [2].) DPACK and HDPACK employ
the concept of “physical clustering” [4] within a novel
hybrid framework for timing-driven FPGA packing. Our
approach uses a quick min-cut, partitioning-based global
placer to determine approximate BLE locations; with this
information, our packers are capable of making more informed decisions which, in turn, leads to reduced wire
lengths and critical path delays. We quantify our techniques
across accepted benchmarks and produce results with 16%
less wire length, 19% smaller minimum channel widths, and
8% less critical delay, on average, than leading methods.
The rest of this paper is organized as follows. Section 2
provides a background on packing algorithms. Section 3
presents the details of our packing framework. Numerical
results are presented in Section 4, and Section 5 offers concluding remarks.
The traditional approach to FPGA packing and CLB -level
placement has been shown to yield significantly worse quality than approaches which allow BLEs to move during placement. In practice, however, modern FPGA architectures require expensive DRC checks which can render full BLE-level
placement impractical. We address this problem by proposing a novel clustering framework that uses physical information to produce better initial packings which can, in turn,
reduce the amount of BLE-level placement that is required.
We quantify our packing technique across accepted benchmarks and show that it produces results with 16% less wire
length, 19% smaller minimum channel widths, and 8% less
critical path delay, on average, than leading methods.
1. INTRODUCTION
Modern FPGA architectures are typically implemented in
a hierarchical fashion, where coarse-grained logic blocks
(CLBs) are employed to group basic logic elements (BLEs),
which are, themselves, groups of flip-flops (FFs) and lookup tables (LUTs). Hierarchical architectures have historically been preferred due to the potential for reduced routing
area requirements and better delay [1]. Before placement
can occur in such architectures, the BLEs in a netlist must
be packed (or clustered) together to satisfy design rule constraints (DRCs).
Packing is traditionally performed after logic synthesis
and prior to placement. As a result, pre-placement metrics
are used to estimate the quality of packing different BLEs
together. Connectivity information (such as net fanout) is
one such metric. Timing information, obtained from a preliminary timing analysis, is also typically employed to predict the criticality of edges in the final implementation. By
employing such metrics, a packing heuristic is able to reduce the number of critical connections between resultant
clusters, thereby lowering wire length and critical path delay when the design is placed and routed.
2. BACKGROUND
FPGA packing algorithms typically fall into one of three
categories: (1) seed-based methods; (2) depth-optimal or
depth-relaxed methods; and (3) combined packing and
placement-based methods.
∗ This
work was supported in part by a grant from the Natural Sciences
and Engineering Research Council of Canada (NSERC) and a grant from
Actel Corporation.
1-4244-1060-6/07/$25.00 ©2007 IEEE.
117
icant improvement in estimated wire length cost (roughly
7% ∼ 36%) and timing cost (roughly 17% ∼ 25%) depending on the FPGA architecture. SCPlace uses T-VPack to
generate an initial set of CLB s which are feasible for the architecture; i.e., an initial packing must still be performed.
Seed-based methods are among the most-established
techniques for packing BLEs in FPGAs. T-VPack [1] is
a good example of this type of packing algorithm. In
T-VPack, a seed BLE is chosen to start a CLB. Additional
BLE s are added to the CLB until no more BLEs can be added
without exceeding CLB constraints. BLEs are chosen (to add
to a CLB) based on a gain value calculated from a cost function. This cost function is based on the number of shared
edges between the BLE and the current CLB, as well as the
criticalities of shared edges.
Like T-VPack, RPack [5] also packs BLEs one at a time
starting with a seed BLE. However, RPack incorporates
routability metrics into the packing cost function. Compared
to VPack (a non-timing-driven T-VPack), [5] shows that
RPack can significantly improve circuit routability. However, [5] focuses only on routability—no performance numbers were presented to indicate the impact of packing for
routability on the final quality of result in terms of timing. In [6], numerical results show that while RPack outperforms VPack, it only produces results that are comparable to
T-VPack.
iRAC [6] is another seed-based, routability-driven packer.
Special attention is paid to the selection of a seed BLE. Furthermore, the number of pins that are usable on any CLB is
limited to match the Rent parameter of the architecture. Numerical results presented in [6] indicate that the improved
selection of the seed BLE coupled with the use of the Rent
parameter can reduce the number of inter-CLB edges by
roughly 30% compared to RPack and T-VPack for an architecture consisting of 8 BLEs per CLB. The number of
generated CLB s typically increases, however, which may be
a problem in a highly-utilized device. No performance numbers were presented.
3. PACKING METHODS
Computationally expensive BLE-level placement will likely
remain a necessity in modern FPGA CAD; however, it is the
premise of this work to reduce (not eliminate) the amount
of BLE-level placement by producing better CLB packings
in the first place. Our work can be used to complement a
BLE -level placer (i.e., by generating the initial clustering for
a tool like SCPlace).
Our work is essentially a hybridization of a top-down and
bottom-up packing approach. Our method employs a fast,
min-cut partitioner to obtain approximate physical locations
for BLEs; subsequently, bottom-up packing is performed using this physical information. We begin by describing our
packing algorithm, called DPACK, and then discuss how we
augmented it with physical information. Next, we discuss
further extensions to the method and present the resulting
hybridized packer, called HDPACK.
3.1. Greedy Packing (DPack)
In the pursuit of better packings, we first developed a seedbased algorithm, similar to T-VPack, whose pseudocode is
shown in Figure 1. Like T-VPack, a seed BLE is selected as
the most critical, unpacked block. We use the path counting
algorithm in [11] as a tie-breaking mechanism during seed
selection, with the block that has the highest path count selected as the seed. We also use logic depth as a secondary
mechanism to break ties, as in [12]. After the seed BLE has
been chosen, a cost function is computed for all blocks i, j
that are connected to this BLE and is given by
While capable of achieving very tight packings, seedbased approaches are localized, greedy algorithms and may
become trapped in local minima. Depth-optimal and depthrelaxed methods, including TLC [7], MLC [8], and RCP [9],
attempt to duplicate timing-critical logic during packing to
obtain a set of clusters with optimal depth. These clusters
are then merged into CLBs using a variety of bin-packing
methods. While effective at reducing critical path delay, the
process of logic duplication can be hard to control, leading
to large increases in area. Timing estimates made during
packing may not be accurate when compared with the final
placement [10].
Costi j = λ × Ei j + (1 − λ) × Criti j
(1)
1
and Criti j = ∑e∈Eh | i, j∈e C(e).
where Ei j = ∑e∈Eh | i, j∈e |e|−1
Here, Eh represents all nets in the netlist, Ei j models connectivity, and C(e) is the estimated timing criticality of net
e. In (1), λ controls the preference between edge absorption
and timing criticality. The BLE with the highest computed
cost is added to the CLB . This continues until either the CLB
is full, or other constraints, such as the number of pins available on the CLB, are exceeded. Then, a new seed BLE is
chosen to start a new CLB, and the process is repeated until
the circuit has been packed.
DPACK also incorporates the hill-climbing and unrelated
logic packing algorithms from [12]. When the pin constraints of a CLB have been reached, but the CLB is not full,
Another alternative is to alter the placement algorithm
such that BLEs can move between CLB s. SCPlace [2] proposes a simulated annealing-based placement method that is
capable of moving both CLBs and BLEs; i.e., any individual
BLE is capable of being moved to another CLB during placement. Compared to VPR—which uses the traditional flow of
packing BLEs into CLBs followed by simulated annealingbased placement of CLBs—SCPlace demonstrates signif-
118
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
Procedure: DPACK
Inputs: A netlist to be packed, N
Returns: A packed netlist, N Perform timing analysis on the circuit;
Compute block criticalities via Kong path counting;
Sort block criticality from highest to lowest;
seedBLE ← most critical unclustered node;
while seedBLE > 0 do
clus ← new cluster;
clus.add(seedBLE);
while clus.getNumBLEs() < maxNumBLEsPerCLB do
for each BLE that shares an edge with clus do
Calculate the cost of adding the BLE;
if BLE can be added (passes DRC) then
costVector.add(BLE, cost);
end if
end do
BLEtoAdd ← getHighestCostBLE(costVector);
if BLEtoAdd is not valid then
BLEtoAdd ← get best unrelated BLE to add;
end if
if BLEtoAdd is valid then
clus.add(BLEtoAdd);
else
break ;
end if
end do
Add clus into N ;
seedBLE ← most critical unclustered node;
end do
return N ;
nores architectural constraints). We employ terminal propagation and alternate cuts in the horizontal and vertical directions; moreover, I / Os are partitioned along with the FFs and
LUT s in the flat netlist.
Once the number of nodes in a partition is less than a predetermined amount or the depth of the partitioning tree has
exceeded a threshold, the algorithm stops. The BLEs within
the partition are assigned the same x and y grid locations—
this is acceptable since it is not our intention for this fast
partitioning to generate legal placements, but rather to provide a rough idea of what BLEs may end up close together.
To account for physical information, the cost function (1)
was augmented with an additional term. We have found that
using physical information in the cost function, rather than
using the actual embedding produced by the partitioner to
generate initial clusterings, offers better performance and
simplifies DRC checking. The new cost function is given
by
Costi j = λ × Ei j + γ × Criti j − (1 − λ − γ) × Disti j
|x −x |
Fig. 1: Pseudocode for DPACK.
(2)
|y −y |
i
j
i
j
+ GridSize
. In this equation, λ and
where Disti j = GridSize
x
y
γ control the preference between edge absorption and timing. The Disti j term is a calculation of the Manhattan distance between the current CLB and the potential BLE (normalized by the grid size). We note that this cost penalizes
objects which are far apart. Several other formulations of
the cost function were considered—for instance, we tested
the original function that we introduced in [4]—but the formulation that we present here was found to yield the best
performance.
We also modified the way in which DPACK accounts for
unrelated logic packing. In the original algorithm, the BLE
that could fully utilize the remaining available inputs of a
CLB was added. In practice, there can be many blocks with
the same number of inputs. To break ties, we use the physical distance between the potential BLEs and the current CLB .
Consequently, the closest BLE is added to the CLB .
the packer enters a hill-climbing phase; BLEs are continuously added to the CLB even if the number of pins on the
resulting CLB exceed what is feasible. This is done in the
hopes that, by adding more BLEs to the CLB , the number
of pins can be reduced as more edges are absorbed. If, after reaching the maximum number of BLEs per CLB, the pin
constraints are still violated, the last feasible arrangement is
restored. If a CLB is not full, additional BLEs that have no
direct connection (i.e., unrelated logic) with the BLEs in the
CLB may be added provided that the DRC constraints are not
violated.
3.2. Incorporating Physical Information
The concept of “physical clustering” has been employed
successfully in ASIC placement [4]. In this approach, an
initial placement for cells in the unclustered netlist is determined via a quick global placement (which ignores overlap and legality constraints). The clustering method leverages the inter-cell distances from this approximate placement to make better clustering decisions when breaking ties
and packing unrelated logic.
Before physical information could be incorporated into
DPACK , we first developed a simplistic, top-down, mincut partitioning-based global placer. Our placer uses
hMetis [13] to recursively bi-partition and place the primitive netlist. Our technique does not employ placement feedback, a cutline oracle, or branch-and-bound partitioning, as
in [14]—it is merely intended to act as a fast and approximate means of determining a rough placement (which ig-
3.3. Hybridized Packing (HDPack)
We sought to further improve the quality of CLBs produced
by DPACK by, once again, borrowing concepts from ASIC
clustering. Specifically, we applied our concept of Hybrid
First Choice Clustering (HFCC) from [4] to FPGA packing.
We have previously applied HFCC to large-scale placement,
and we felt that it would be worthwhile to employ it in an
FPGA context.
In HFCC , objects are initially placed onto a “free” list
which contains the set of objects which have not been paired.
The affinity for pairing any two objects is calculated using (2). The algorithm repeatedly removes the object with
the highest affinity from the free list, and pairs it with the
object that (originally) yielded this high affinity, even if that
119
object had already been paired. Once an object has been
paired, it is said to form a “cluster”. An unpaired object is
always paired with either another unpaired object or a cluster. The position of each intermediate cluster is set to the
average location of its contained cells.
HFCC is very effective at making good pairwise packings
and in minimizing the number of external nets in the clustered netlist. However, we discovered a significant drawback inherent in the approach: HFCC initially creates a large
number of clusters which can be difficult to pack together
in later stages of the algorithm due to DRC constraints and a
lack of a hill-climbing phase. Consequently, HFCC packings
typically contain several percent more CLB s than DPACK or
T-VPack—for highly-utilized devices, this can be a significant drawback. This is similar to depth-optimal methods
(without duplication) in which the bin-packing applied after the initial clustering cannot effectively group clusters together to reduce the CLB count.
In contrast, DPACK was known to achieve good critical
delay reduction with “tight” packings. In the hopes of benefiting from the high net absorption offered by HFCC, while
still preserving the critical delay reduction from DPACK,
we devised a hybrid flow—HDPACK—that combines both
strategies. The pseudocode for this hybrid flow is shown in
Figure 2. In this combined approach, HFCC is used as a prepacking step before DPACK packing is called. First, HFCC is
used to make initial pairings; when the affinity values of the
pairings in the HFCC packer fall below a threshold, HFCC
packing is stopped and the list of “intermediate” clusters
is then fed to DPACK to complete. The threshold between
where HFCC terminates and DPACK begins was parameterized and swept to determine when it was best to stop HFCC
packing and to begin DPACK. In practice, we have found
that this hybrid flow produces good improvements in wire
length and delay over traditional methods.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
Procedure: HDPACK
Inputs: A netlist to be packed, N
Returns: A packed netlist, N Perform min-cut partitioning to determine initial cell locations;
// Do affinity clustering ...
for each edge e ∈ N do
for each cell i, j ∈ e do
// This cost is computed using Eq. 2
Costi j ← compute affinity cost for pairing i, j;
end do
end do
Sort all affinity Costi j from largest cost to lowest;
StoppingCost ← predetermined Costi j value at which to stop;
for each Costi j do
Attempt to pack cell i and j together;
if DRC was not successful then
continue ;
end if
if Costi j < StoppingCost then
break ;
end if
end do
// Finish off using greedy ...
Put each unclustered BLE into its own cluster;
Collect statistics (num pins, etc.) for each cluster;
Sort the list of clusters first (num BLEs contained, num pins, criticality);
seedClus ← cluster with highest # BLEs, highest # pins, highest criticality;
while seedClus is valid do
for each BLE B that shares an edge with clus (not in full cluster) do
// We use physical information when computing the cost of B
Calculate the cost of putting B in seedClus;
if B can be added to seedClus without violating DRC then
costVector.add(B, cost);
end if
end do
BLEtoAdd ← getHighestCostBLE(costVector);
if BLEtoAdd is not valid then
BLEtoAdd ← get best unrelated BLE to add;
end if
if BLEtoAdd is in another cluster already then
Remove BLEtoAdd from its original cluster;
end if
if BLEtoAdd is valid then
seedClus.add(BLEtoAdd);
else
break ;
end if
if seedClus.getNumBLEs() = numBLEsPerCLB then
Mark seedClus as full, and therefore cannot be modified anymore;
end if
seedClus ← most fully used, yet still incomplete cluster;
end do
return N ;
Fig. 2: Pseudocode for HDPACK.
4. NUMERICAL RESULTS
N = 2, 4, 8, 12 BLEs per CLB, in our tests. This allows us to
determine whether a given packing algorithm can perform
well for both large and small CLB sizes. For each architecture, the number of CLB inputs is calculated as I = 2N + 2.
For low-stress routing tests, we set the channel width to be
20% greater than the minimum channel width found from
the baseline (T-VPack) flow. The grid size is set to the smallest square grid that can accommodate a particular design.
We use length-1 segments in all routing architectures.
To measure the effectiveness of our algorithms, several experiments were conducted. We used the twenty largest designs from the MCNC benchmark set [3, 15]. To establish a
baseline for comparison, we packed designs using T-VPack,
followed by VPR for placement and routing. Similarly, we
employed both DPACK and HDPACK to perform packing,
and then placed and routed the resultant netlists using VPR.
For each circuit, the packing tools and VPR were executed 5
times (with different seeds) and the average of all five runs
was used for comparison purposes. We set VPR’s “timing
tradeoff” to 0.5 for all tests. All wire length and critical
path delays reported are obtained after routing.
A good packing algorithm should be able to perform well
under a variety of situations and constraints. Therefore,
we consider a range of architectural sizes, corresponding to
4.1. Results for DPack and HDPack
4.1.1. Low-Stress Routing
Our first set of experiments compares DPACK and HDPACK
to T-VPack in low-stress routing conditions [12]. In our first
120
Table 2: Packing with physical information, compared to
T-VPack.
test, physical information was not used. Since only edge absorption and timing information were employed, only one
trade-off parameter was used in the cost functions of our
packing tools. For DPACK, λ = 0.8 was found to yield
the best results in terms of wire length and critical path delay. For HDPACK, the best results were obtained using a
λ = 0.9. The number of external nets after packing and the
final routed wire lengths and critical delays are shown in
Table 1. Results are first normalized against the baseline
flow (T-VPack) and then averaged geometrically across all
twenty designs.
As shown in Table 1, both DPACK and HDPACK result in
significantly better net absorption and wire length reduction
than the baseline flow. A clear difference between DPACK
and HDPACK can be seen in terms of the wire length improvement. Since the HFCC method employed in HDPACK
pairs BLEs that share the highest affinities, it is able to make
the best decisions early on and is “unconcerned” with fully
packing CLBs. In contrast, DPACK is limited in that it must
complete one CLB before moving on to another. It is possible that in this process, some packed BLEs may have been
better off packed with other, still unpacked, BLEs.
In our second test, physical information was used during
packing. Since there are now two independent weighting
factors (c.f., Section 3), a two-dimensional sweep was performed to find the best configuration. For DPACK, the best
results were obtained using λ = 0.2 and γ = 0.4 (yielding
a physical information weight of 0.4). For HDPACK, the
best configuration was found with λ = 0.2, γ = 0.2 (and the
physical weight of 0.6). Results using physical information
are shown in Table 2. With physical information, DPACK
was able to achieve significant reductions in wire length and
critical path delay, with an average improvement of 16% and
8%, respectively. This represents an improvement of 7%
in wire length and 6% in critical path delay compared to
DPACK without physical information. Significant improvements for HDPACK are also evident, with improvements of
6% in wire length and 4% in critical delay compared to HDPACK without physical information.
The run-time ratios of the DPACK and HDPACK flows
compared to T-VPack were computed and compared. The
results (with and without physical information) are summarized in Table 3. Generally, the use of physical information
incurred negligible run-time penalty in the context of the en-
N
2
4
8
12
Geomean
N
2
4
8
12
Geomean
Crit
0.963
0.985
0.999
0.986
0.98
HDPACK
Ext Nets WL
0.948
0.873
0.858
0.874
0.832
0.847
0.844
0.825
0.87
0.85
Crit
0.900
0.920
0.922
0.937
0.92
HDPACK
Ext Nets WL
0.966
0.846
0.900
0.804
0.873
0.768
0.864
0.763
0.90
0.79
Crit
0.915
0.960
0.939
0.963
0.94
tire place-and-route run-time for most architectures. (For the
case of N = 12, the MCNC benchmarks that we considered
were clustered into such small netlists that placement and
routing time approached that of the packing time. Consequently, these results tend to show more variability which
we do not feel is indicative of performance on much larger,
real-world designs.)
4.1.2. High-Stress Routing
We conducted an experiment using high-stress routing to
find minimum channel widths. The search for minimum
channel width was performed 5 times for each design for
all architectures under consideration. The average channel width was computed for each case and then normalized
to the minimum channel width found by the baseline flow.
Physical information was enabled for these tests. The channel width improvement relative to T-VPack is shown in Table 4. DPACK and HDPACK were extremely successful in
reducing minimum channel widths, with a 19% and 24%
improvement, respectively, across all architectures.
We compare to RPack [5] as follows. For the N = 8 architecture, RPack cites a 16.5% improvement in minimum
channel width versus VPack (c.f., [5], Table 3). In [6],
however, it is shown that RPack does not provide any improvement versus T-VPack (c.f., [6], Table 2), where it is
also pointed out that T-VPack provides better results than
its non-timing-driven counterpart VPack. Given that, for
N = 8, DPACK and HDPACK yield improvements of 24%
and 24.5%, respectively, compared to T-VPack, we conclude that we outperform RPack even though we do not consider minimum channel width as an objective.
Comparison with iRAC [6] is more difficult. Since iRAC
produces more CLB s compared to other packing methods,
the results in [6] use a different grid size and VPR “io_rat”
value. We were unable to reproduce the T-VPack results
presented in [6]. Further, we note that the minimum chan-
Table 1: Packing without physical information, compared
to T-VPack.
DPACK
Ext Nets WL
0.966
0.902
0.928
0.892
0.911
0.900
0.937
0.931
0.94
0.91
DPACK
Ext Nets WL
0.962
0.862
0.937
0.834
0.908
0.823
0.942
0.834
0.94
0.84
Table 3: Run-time comparison vs. baseline.
Crit
0.937
1.007
0.984
1.013
0.98
N
2
4
8
12
121
DPACK
No Physical Physical
0.959
0.965
0.974
0.985
1.070
1.074
1.288
1.255
HDPACK
No Physical Physical
0.991
0.985
0.990
0.985
1.027
1.024
1.150
1.130
nel width experiments in [6] were obtained in combination
with a modified version of VPR—called iRAP—that includes
a congestion term in the placement algorithm’s objective
function. It is reasonable to expect that the modified placement algorithm also served to reduce channel widths. Nevertheless, our results of 24% and 24.5% reduction in channel
widths for DPACK and HDPACK, respectively, compare favorably to the 38% reduction obtained by iRAC+iRAP algorithm. We expect that by using a congestion-driven placer,
we could reduce this gap.
Fig. 3: Wire length reduction vs. partition depth.
4.2. How Much Physical Information is Enough?
Even though partitioning algorithms are fast, they still incur
some penalty in terms of run-time. If we can find a point
after which partitioning does not give much wire length and
critical delay reduction, there is no need to incur the additional run-time penalty. Therefore, we conducted an experiment to determine the partition tree depth that leads to the
best wire length and delay trade-offs.
Our partitioner was set to terminate when all end partitions were of a specified partition depth or when partitions
contained less than a set number of cells in the primitive
netlist. We varied the partition depth from 0 (no partitioning
at all) to 14, for each of the 20 designs in the benchmark
suite. This test was conducted across our four architecture
sizes of (N = 2, 4, 8, 12). Wire length improvement results
are shown in Figure 3 and critical delay reduction is shown
in Figure 4.
A dramatic initial reduction in both circuit metrics, as partition depth is increased, can be seen. Wire length improvement is greatest at a partition depth of 5, beyond which the
average wire length reduction increases only slightly before
flattening out. The trend for critical delay reduction is less
apparent (although this may point to limitations in our simplistic partitioner). For almost all architectures, a partition
depth of 5 yielded the best overall wire length and critical
delay reduction. (We note that the run-times presented in
Table 3 were obtained using a partition depth of 5.) Because
the MCNC benchmark suite consists of relatively similarlysized designs, partition depth was, by and large, a satisfactory stopping criterion; the number of cells per partition is
an alternative stopping metric that would likely be more suitable for suites with more varying design sizes. Regardless,
the key point in our finding is that, to achieve a good improvement in packing, the placement information does not
Fig. 4: Critical delay reduction vs. partition depth.
need to be precise, but must serve only as an approximate
“guide”.
4.3. Further Comparison and Integrating of Other
Methods
We compared DPACK and HDPACK to T-VPack, RPack,
and iRAC in terms of CLB statistics alone. We note that
RPack and iRAC were primarily geared toward addressing
routability; neither of these tools present timing results (as
we do), and this skews the results against our packers (i.e.,
by optimizing for wire length, as opposed to both timing and
wire length, as we do, RPack and iRAC can achieve more
favorable net absorption statistics and possibly reduce routing congestion, which could lead to lower minimum channel
widths). We report the number of CLBs, number of nets in
the CLB-level netlist, and the average number of pins used
per CLB for the N = 8 case in Table 5.1 All results are normalized with respect to T-VPack. iRAC was found to produce the lowest number of nets and average used-pins-percluster; however, this was achieved at the cost of significantly more CLBs. The next best packing results were found
by HDPACK. We stress that the data presented in Table 5 are
not figures of merit for FPGA placement; rather, they serve to
illustrate the differences in pre-placement packing statistics
between the various methods.2
Table 4: Improvement in minimum channel widths.
N
2
4
8
12
Geomean
DPACK
0.805
0.808
0.760
0.854
0.81
HDPACK
0.764
0.776
0.755
0.756
0.76
1 Comparisons with other architecture sizes are omitted due to space
limitations and because RPack and iRAC only present results when N = 8.
2 See Tables 1, 2, and 4 for a summary of our methods’ results in terms
of critical path delay, wire length, and channel widths.
122
Table 5 indicates that there is only a small increase in
the number of clusters made by DPACK or HDPACK. It
is important to note that, in our approach, the improvement
seen from the use of physical information is not a manifestation of depopulated CLBs, which has been shown to help
routability at the expense of area [16]. We note that DPACK
and HDPACK achieve remarkably good net absorption (not
to mention routed critical path delays and wire lengths, as
discussed previously) given that they produce very little depopulation compared to other techniques. Although we observed an increase of 2% in the average number of pins used
per CLB in DPACK, which may lead to more difficulty in
routing [17], we have not found this to be an issue.
Since performance of the placed design is the ultimate objective, it does not suffice to merely compare packing statistics. We attempted to incorporate concepts from RPack and
iRAC into our packers in an effort to assess their potential
benefits. The incorporation of RPack was straightforward
since it consisted of appending a new term to the cost function. The addition of an RPack term improved the critical
delay and wire length by ∼ 1% and ∼ 2%, but worsened
minimum channel width results by ∼ 2%. We did not feel
that these were statistically meaningful improvements.
Moreover, by adding the algorithms from iRAC into our
packers, we were unable to improve upon our best results.
iRAC depopulates by limiting the number of pins used per
CLB , as well as by trying to absorb low-fanout nets. We
conducted tests to establish the effect that a decrease in the
number of edges would have on the resulting packing statistics, and found that as the number of pins used per CLB is
decreased, the number of CLB s increases (with the external
net count decreasing). However, the wire length and critical
delay remained fairly consistent.
length, and 8% in critical path delay compared to T-VPack.
HDPACK showed an average improvement of 20% in wire
length and 6% in critical delay compared to T-VPack. Physical information was found to aid in the creation of better
CLB s; we believe that this is a result of better “tie-breaking”
during packing and a result of better packing of unrelated
logic. We also report significant improvements in minimum
channel widths required during high-stress routing. Neither algorithm produced a significant increase in the number
of CLB s, unlike other techniques described in the literature.
Our packing strategies are very fast and can be used to complement any hierarchical FPGA placement flow.
Our ongoing work focuses on using the physical information (generated from our partitioner) to produce initial
placements. We believe that the use of physical information
can also be employed in those packing schemes that perform
logic duplication to help predict and control the amount of
duplicated logic.
6. REFERENCES
[1] V. Betz and J. Rose, “Cluster-based logic blocks for FPGAs: Area-efficiency vs.
input sharing and size,” in CICC, 1997, pp. 551–554.
[2] G. Chen and J. Cong, “Simultaneous timing driven clustering and placement for
FPGAs,” in Proc. FPL, 2004, pp. 158–167.
[3] V. Betz and J. Rose, “VPR: A new packing, placement and routing tool for
FPGA research,” in Field-Programmable Logic and Applications, W. Luk, P. Y.
Cheung, and M. Glesner, Eds., 1997, pp. 213–222.
[4] K. Vorwerk and A. Kennings, “An improved multi-level framework for forcedirected placement,” in Proc. DATE, 2005, pp. 902–907.
[5] E. Bozorgzadeh, S. Ogrenci-Memik, and M. Sarrafzadeh, “Rpack: routabilitydriven packing for cluster-based fpgas.” in Proc. ASPDAC, 2001, pp. 629–634.
[6] A. Singh and M. Marek-Sadowska, “Efficient circuit clustering for area and
power reduction in fpgas,” in Proc. FPGA, 2002, pp. 59–66.
[7] J. Cong and M. Romesis, “Performance-driven multi-level clustering with application to hierarchical FPGA mapping,” in Proc. DAC, 2001, pp. 389–394.
[8] C. Sze, T.-C. Wang, and L.-C. Wang, “Multilevel circuit clustering for delay
minimization,” in IEEE Trans. CAD, 2004, pp. 1073–1085.
5. CONCLUSIONS
[9] M. Dehkordi and S. Brown, “Performance-driven recursive multi-level clustering,” in Proc. FPT, 2003, pp. 262–269.
We explored the use of physical information during packing.
We described a flow where top-down min-cut partitioningbased placement was performed prior to packing to generate rough physical locations for BLEs. The focus was not to
obtain architecturally correct placements, but to obtain reasonable physical information with little effort. This physical information is then incorporated into two packing algorithms, DPACK and HDPACK. Numerical results showed
that DPACK yielded an average reduction of 16% in wire
[10] V. Manohararajah, G. R. Chiu, D. P. Singh, and S. D. Brown, “Difficulty of
predicting interconnect delay in a timing driven fpga cad flow,” in SLIP, 2006,
pp. 3–8.
[11] T. Kong, “A novel net weighting algorithm for timing-driven placement,” in
Proc. ICCAD, 2002, pp. 172–176.
[12] V. Betz, J. Rose, and A. Marquardt, Eds., Architecture and CAD for DeepSubmicron FPGAs. Kluwer Academic Publishers, 1999.
[13] G. Karypis and V. Kumar, “hmetis: A hypergraph partitioning package,” Department of Computer Science and Engineering, University of Minnesota, Minneapolis, Tech. Rep., 1998.
[14] J. A. Roy et al., “Capo: robust and scalable open-source min-cut floorplacer,” in
Proc. ISPD, 2005, pp. 224–226.
Table 5: Comparison of pre-placement packing statistics
between known tools, N = 8, I = 18.
Packer
T-VPack
R-Pack
iRAC
DPACK
HDPACK
# CLB
1.000
1.009
1.078
1.007
1.014
Ext Nets
1.000
1.071
0.757
0.908
0.870
[15] S. Yang, “Logic synthesis and optimization benchmarks, version 3.0,” Microelectronics Center of North Carolina, Tech. Rep., 1991.
[16] M. Tom and G. Lemieux, “Logic block clustering of large designs for channelwidth constrained fpgas,” in Proc. DAC, 2005, pp. 726–731.
Pins Used
1.000
0.954
0.870
1.025
0.961
[17] R. Tessier and H. Giza, “Balancing logic utilization and area efficiency in FPGAs,” in FPL, 2000, pp. 535–544.
123

Download Report

IMPROVING TIMING-DRIVEN FPGA PACKING WITH

Paperzz.com

Your Paperzz