Slides

Floorplan Assisted Data Rate
Enhancement through Wire
Pipelining: A Real Assessment
ISPD 2005 San Francisco, CA
May 5th, 2005
Mario R. Casu - Politecnico di Torino
and
Luca Macchiarulo - University of Hawaii at Manoa
Outline


Communication concerns at the physical layer
Great Expectations of “Wire Pipelining”
– No block Delay
– Block delay limitation




Computation locality
Adaptive Communications
Floorplanning strategy for adaptive systems
Experimental results
Wire pipelining - concept



Wire delay:
substantial share of
overall delay
Global wires difficult
to deal with
Global wires scaling
does not follow
– Transistors
– Local wiring
Del
Wire pipelining - concept


Introducing a
latch/FF reduces the
timing constraints
Similar to classical
pipelining
Del’
Del’’
Critical Length

Maximal length for
which the wire can be
driven at a given
frequency
– Optimum number of
buffers
– Optimum buffer
dimensions
– Optimum wire sizing
Del=1/f
Wire Pipelining

Above Critical length
clocked elements are
needed (pipeline
stages)
Del>1/f
“Wire Pipelining” techniques


Problem: maintaining functionality with a
minimum loss in performance.
Solutions:
–
–
–
–
–
Globally Asynchronous Locally Synchronous – GALS
Retiming
Regular Distributed Register (J. Cong)
c-slowing (S. Sapatnekar)
Latency Insensitive Protocols (L. Carloni)
LIPs: Concept
Pearl
Shell
Relay Station
Shell – Relay Station Interaction
valid
stop
Feedback Topology
τ
0
τ
τ
τ
0
0
Feedback Topology
τ
τ
0
0
0
0τ
τ
Feedback Topology
0
τ
0
τ
τ
0τ1
1
Feedback Topology
τ
1
1
τ
1
0τ1τ
τ
Feedback Topology
1
τ
1
1
τ
0τ1ττ
τ
Feedback Topology
τ
2
τ
τ
τ
0τ1ττ2
2
Feedback Topology: Performance


Void data circulate in the
loops: initially as many as
relay stations (s)
“Period” of void-stop equal
to the number of shells (s)
and relay station (r) in the
loop

Worst loop fixes thr.

T=s/(s+r)

Ta=2/4, Tb=2/5
τ
2
τ
a
T=2/5
τ
b
τ
0τ1ττ2
2
Classical Floorplanning


Problem: find a placement
of (soft or hard) blocks that
optimally fits a floorplan
Optimality is Whitespace,
overall Wirelength, critical
path, or a combination
Floorplanning for Throughput
[ISPD2004]


The optimal floorplan in
our case is that which
guarantees the
maximum throughput
compatible with given
blocks’ dimensions
Maximum throughput is
equivalent to the worst
cost-to-time ratio loop
New Heuristic Throughput
Computation

Heuristic:
– Statically compute the shortest loop l(e) in which
every edge appears
– For every optimization iteration:
 Cost(e)=1/l(e)*floor(length/Clength)
 TotCost=Scost(e)
Throughput-frequency trade-off
f=1/L
T=1
DR0=1.1/L=1/L
Throughput-frequency trade-off
f=2/L
T=2/(2+2)=1/2
DR=1/2.2/L=1/L
No advantage!
Throughput-frequency trade-off
L/2
f=1/L
T=1
DR0=1/L.1=1/L
L
L
Throughput-frequency trade-off
L/2
L/2
f=2/L
L/2
T=3/(3+2)
DR=2/L.3/5=6/5L
L/2
L/2
Data Rate as the basic performance
metric – Speed-up

Wire pipelining allows increased frequency
But it decreases the throughput according to the
previous considerations
Real performance is given by DATA RATE=Thr*f
Advantage w.r.t. non-pipelined systems to be
assessed through DR measures
Speed-Up SU=DR/DR0

L/(lm+lmax)<SU<L/lm





Floorplanning can be extremely beneficial if it
can reduce the average branch length lm
Block delay effect

Blocks put a cap to the max frequency
– fmax<1/max(di)
i


We can measure delay in “length”, by using a proportionality factor
Block delay can enter in the picture if signals are latched at
the input or output side only
L
ld
Block delay models

We used two different models
– Delay proportional to block edge
Rationale: complexity of logic is related to block size
 Minimum constant of proportionality=1: delay is the
same needed for the fastest signal to traverse the entire
block
 Optimistic assumption

– Delay constant, related to technology and equal to
13FO4
Derived for assumption in the roadmap
 More realistic for high performance design
 More pessimistic (see below)


Probably the reality is somehow between the two
cases
Speed-up with block delay

Taking the block delay into account modifies the
previous considerations

max(Li+di)/(lm+dm+dmax)<SU<max(Li+di)/(lm+dm)

In general, much worse than previous case
Throughput driven floorplan
experiments



We used the floorplanner described in ISPD’04 to
evaluate the optimal frequency (maximum DR)
On GSRC and MCNC benchmarks with input-output
information
No block delay:
– SU varies between 0.8 to 36%
– Better on benchmarks with greater complexity

Block delay
– Proportional to blocks’ edges: -7% to 44%
– Equal to 13FO4: -11% to 12%
– MCNC suite shows the worse behavior

High speed systems with highly optimized blocks
lead to negligible or irrelevant SU, for an high
increase of clock frequency.
Space for better performance?


Not all point to point connections are actually used
at every clock cycle.
Ex. CPU to Cache communication.
Read cycle
Addr
Data-out
Data-in
Space for better performance?


Not all point to point connections are actually used
at every clock cycle.
Ex. CPU to Cache communication.
Write cycle
Addr
Data-out
Data-in
Space for better performance?


Unused communication channel effectively break
throughput-limiting loops
Pipelining without limitation can become possible
Stream Write cycle
Addr 1
Data-out 1
τ
Space for better performance?


Unused communication channel effectively break
throughput-limiting loops
Pipelining without limitation can become possible
Stream Write cycle
Addr 2
Addr 1
Data-out 2
Data-out 1
Space for better performance?


Unused communication channel effectively break
throughput-limiting loops
Pipelining without limitation can become possible
Stream Write cycle
Addr 3
Addr 2
Data-out 3
Data-out 2
Adaptive Latency Insensitive Protocol


Need a mechanism to allow discarding useless
“packets” by blocks: Adaptive communication
Details out of the scope of the paper but
– It is possible thorugh a simple modification of the
original protocol
– Requires the introduction of “oracles” predicting
unused inputs for each block
– We designed a functional implementation in
synthesizable VHDL
– We proved the correctness of the implementation
(absence of deadlocks and correct signal sequencing)
ALIP performance evaluation


The adaptiveness of the approach prevents a static
prediction of performance
However, a few conclusion can be reached:
– The performance is bounded above by static LIP
– Performance in long sequences of input independence
is equivalent to the simplified network with the
channel removed

If the system experiences unfrequent “context
switching” on its channels, such that at any given
time the performance is static Thi, the average
performance can be approximated as:
– Th=Sai.Thi
 ai: fraction of time with performance Thi
ALIP performance evaluation Example
Ck=1
Valid Data=1
Stream Write cycle
Addr 1
Data-out 1
τ
ALIP performance evaluation Example
Ck=2
Valid Data=2
Stream Write cycle
Addr 2
Addr 1
Data-out 2
Data-out 1
ALIP performance evaluation Example
Ck=3
Valid Data=3
Stream Write cycle
Addr 3
Addr 2
Data-out 3
Data-out 2
ALIP performance evaluation Example
Ck=4
Valid Data=4
Read cycle
Addr 4
Addr 3
Data-out 3
ALIP performance evaluation Example
Ck=5
Valid Data=5
Read cycle
-----
τ
Addr 4
τ
ALIP performance evaluation Example
Ck=6
Valid Data=5
Read cycle
τ
-----
τ
Data-in4
ALIP performance evaluation Example
Ck=7
Valid Data=5
Read cycle
τ
Data-in4
τ
-----
ALIP performance evaluation Example
Ck=8
Valid Data=6
Read cycle
Addr 5
-----
τ
τ
ALIP performance evaluation Example
Ck=8
Valid Data=6
Throughput=3/4
Th1=1
Th2=1/2
a1=1/2
a2=1/2
Read cycle
Addr 5
-----
τ
τ
Adaptive communication performance
evaluation - assumptions

Assumption 1: No time lost in “context switching”
– Unrealistic, but acceptable for burst communication,
and consistent with experiments

Assumption 2: Channels behave in a statistically
independent fashion
– Only single clock cycle independence is important for
our purposes

Under 1 and 2, we can compute channel activities
and use them to weight the connections
Floorplanning for Throughput –
adaptive case



The optimal floorplan in
our case is that which
guarantees the
maximum throughput
compatible with given
blocks’ dimensions
Maximum throughput is
equivalent to the worst
cost-to-time ratio loop,
weighted by the loop
activation ratio
It can be approximated
by taking into account
the channel activation
ratio
New Heuristic Throughput
Computation

Heuristic:
– Statically compute the shortest loop l(e) in which
every edge appears
– For every optimization iteration:
 Cost(e)=1/l(e)*floor(length/Clength)*a(e)
 TotCost=Scost(e)

The only change consists in the inclusion of the term
a(e)
Experiments

GSRC/MCNC benchmarks
– Burst mode
– Uniformly distributed phases and activation times
– Comparison between non-pipelined solution and
adaptively pipelined (13FO4 case)
– After optimization, a VHDL netlist is automatically
generated and simulated to measure the real
performance of the system (as opposed to the
approximation from the floorplanner)

Results:
– SU between 16 and 44%
– Monotonous behavior in the legal interval
– Limitations due mainly to FO4 delays
Experiments

MPEG decoder
– Strict data dependency
– Optimization as in other cases
– Simulation as before and with real channel utilization
profiles

Results:
– SU of 42% with block delay, 76% without
– Real SU of 31% (effect of non-random correlation)
Conclusions and future work



Pure “blind” pipelining fails to achive available
optimization, due to neglect of common information
Adaptive protocols can take advantage of the
information available to the blocks
We will concentrate on
– Automated extraction of information from the blocks
– Power optimization (power/timing trade-offs)
– Routing constraints effects
Thank you
Shell – Relay Station Interaction
a
valid
stop
Shell – Relay Station Interaction
b
a
valid
stop
Shell – Relay Station Interaction
c
b
valid
stop
Shell – Relay Station Interaction
d
c b
valid
stop
Feedforward equalization


Maximum performance
can be recovered by
equalizing various paths
Longest path
computation to obtain
the appropriate number
of added relay stations
Critical Length and Pipelining
Stages (ITRS projections)
Year
Node
Clock
Frequency
Critical
Stages
Length
10 mm
34 mm
2001
130 nm
1.684 GHz
17.11 mm
0
1
2002
115 nm
2.317 GHz
12.17 mm
0
2
2003
100 nm
3.088 GHz
8.95 mm
1
3
2004
90 nm
3.990 GHz
7.37 mm
1
4
2005
80 nm
5.173 GHz
5.28 mm
1
6
2006
70 nm
5.631 GHz
4.63 mm
2
7
2007
65 nm
6.739 GHz
4.16 mm
2
8
General Performance Evaluation



Generic netlists of blocks are feedforward
connections of loops
If feedforward connections are equalized, “worst”
loop dominates throughput
Problem formulation: max cost-to-time ratio
(polynomial time).