ECE 565 High-Level Synthesis--Introduction

ECE 565
High-Level Synthesis—An Introduction
Shantanu Dutt
ECE Dept., UIC
HLS Flow
• Code/Algorithm  Architecture (interconnected functional
units (FUs), memory units (MUs) via muxes, demuxes, tristate
buffers, buses, dedicated interconnects)
Classically, these 3
stages were
performed
sequentially but
currently performed
together (which
leads to better
optimization)
HLS Flow (contd)
HLS Flow (contd)
Taken into consideration
during register allocation
(post scheduling).
Taken into consideration
during scheduling.
(Binding)
Allocation: Simple counting of FUs after the
above 2 stages
Simple HLS Examples
+
Simple HLS Examples (contd)
2) Mapping to h/w w/ constraints: use only 1 (X) and 1 (+) w/ X delay of 2
cc’s and + delay of 1 cc
ldd
ldc
(a) Scheduling
lda
a
b
ldb
c
x
ldx
y
mux
mux
ldy
I1
I0
I0
I1
mux1
d
mux2
i) Non-overlapped pipelined scheduling
c1(1)
X
+
c2(1)
cc’s 1
c1(2)
c3(2)
c3(1) c2(2)
2
3
4
5
Note:
Unspecified
control signals
have either an
inactive value,
or if such a
concept doesn’t
exists for the cs,
then the don’tcare value
demux
6
[y  c+d]
(c2)
Controller FSM:
Reset
+
X
(b) Arch. Synthesis
cc 3i
O1
cc 3(i+1) (c) Controller FSM
Synthesis
mux1=0,
mux2=0
demux=0,
ldy=1
O0
z
ldz
Note: A register is loaded at the +ve/-ve edge
(in a +ve/-ve edge triggered system) of the cc
after the one in which its load signal is asseted.
lda=1, ldb=1,
ldc=1, ldd=1,
mux1=1, mux2=1
demux=1,
ldz=1
cc 3(i+2)
ldx=1
[z  x+y]
(c3)
demux
[x  a x b]
(c1)
lda = 1
reg. “a”
loaded
Simple HLS Examples (contd)
2) Mapping to h/w w/ constraints: use only 1 (X) and 1 (+) (cont’d)
ldd
ldc
(a) Scheduling
lda
a
ii) Overlapped pipelined scheduling
X
c1(1)
+
cc’s 1
c1(2)
(b) Arch. Synthesis
ldb
I1
mux1
d
I0
I0
y
mux
mux
ldy
I1
mux2
+
X
c2(1) c3(1) c2(2) c3(2)
demux
2
3
4
5
6
cc 3(i+1)
[z  x+y,]
(c3)
Controller FSM:
Reset
b
c
x
ldx
cc 3i
lda=1, ldb=1,
mux1=0, mux2=0
demux=0,
ldy=1, ldx=1
[y  c+d, x  a x b]
((c1, c2)
ldc=1, ldd=1,
mux1=1,
mux2=1,
demux=1,
ldz=1
demux
(c) Controller FSM
Synthesis
z
ldz
• For 4 iterations, the overlapped schedule takes 9
cc’s versus 12 cc’s by the non-overlapped sched.
• Overlap. sched: Time for n iterations = 2n+1
Throughput = n/(2n+1) ~ 0.5 outputs/cc
• Nonoverlap. sched: Time for n iterations = 3n
Throughput = n/3n ~ 0.33 outputs/cc
 ~ 34% throughput improvement using an
overlapped schedule
Simple HLS Examples (contd)
in1
T
• Some DFG control operation nodes:
Condition
(T/F)
F
Selectot
out
• Conditional code:
If (a > b) then
c  a-b;
Else
c  b-a;
• Possible DFGs corresponding to
the above conditional code:
in
in2
Condition
(T/F)
Distributor
T
F
out1
out2
Simple HLS Examples (contd)
• Iterative code: while (a > b)
a  a-b;
b
a
1
T sel F
-
a
mux
>
c2
T dist F
a
r1
ldr1
c1
Mux
b’
+
s xor ovfl
= 1  -ve
= 0  +ve
cin
b’+1 = 2’s compl.
of -b
1
demux
Demux
1
a
0
ldfina
(a) Scheduling (using
only 1 adder/sub)
final a
(b) Arch. Synthesis
Scheduling
& binding:
+
cc’s
c1
c2
c1
c2
b
0
To fsm
Initialized
to F
ldb
lda
Delay Nodes in DFGs
A delay node is generally implemented as a register (or a series of registers if clock
period < T0); a delay node thus becomes a state variable.
Delay Nodes in DFGs (contd)
register
Transformation in the DFG
Mapping to the architecture
Detailed HLS Example
Detailed HLS Example (contd)
Different paths (i/p  o/p)
in the DFG
Scheduling heuristic: Among available opers schedule those on
available FUs whose delay to o/p is the highest, breaking ties in
favor of those opers u whose “sibling” o/ps (o/ps to the same
children) that are avail. or will be available at u’s earliest finish,
otherwise the FU(s) will be idle unnecessary leading to a larger
latency (this will also reduce lifetimes of sibling o/ps).
(a) Scheduling w/ one X
(2 cc’s) & one + (1 cc);
Goal:Miinimize latency
(b) Reg. alloc. for o/p of
operations
(c) Arch. synthesis
The synthesized architecture
For WAR
constraint
[can’t store in
d1 as would
be natural, as
di’s data yet
to be
consumed by
c6 which has
not been
scheduled
yet]
Note: Above register allocation has been done w/
separate regs for multiplier and adder o/ps.
It is sub-optimal (4 non-primary i/p regs. needed)
Detailed HLS Example (contd)
Detailed HLS Example—Register Allocation
Detailed HLS Example—Register Allocation (contd)
Scheduling heuristic: Among available opers schedule those
on available FUs whose delay to o/p is the highest, breaking
ties in favor of those opers u whose “sibling” o/ps (o/ps to the
same children) that are avail. or will be available at u’s earliest
finish, otherwise the FU(s) will be idle unnecessary leading to
d0 latency (this will reduce lifetimes of sibling o/ps).
a larger
3 non-primary i/p
regs. needed
• In the conflict graph (one per FU), there is an edge between 2 var. nodes if
their lifetimes overlap (indicating that different registers need to be allocated
to them)
• Graph coloring—using min. # of colors to color node s.t. connected node
pairs have different colors—in general is NP-hard
• The above type of conflict graph is called an interval graph (derived from a
1-dimensional interval of the lifetimes)
• Min. graph coloring can be solved optimally in linear time for interval graphs
(using the left-edge algorithm that we will see later for channel routing)
Detailed HLS Example—Register Allocation (contd)
d0
3 non-primary i/p
regs. needed
Scheduling heuristic: Among available opers schedule
those on available FUs whose delay to o/p is the
highest, breaking ties arbitrarily: B’s lifetime increases,
but D’s (dep. of B) decreases similarly—heuristic should
be based on more global information