ECE 565 High-Level Synthesis—An Introduction Shantanu Dutt ECE Dept., UIC HLS Flow • Code/Algorithm Architecture (interconnected functional units (FUs), memory units (MUs) via muxes, demuxes, tristate buffers, buses, dedicated interconnects) Classically, these 3 stages were performed sequentially but currently performed together (which leads to better optimization) HLS Flow (contd) HLS Flow (contd) Taken into consideration during register allocation (post scheduling). Taken into consideration during scheduling. (Binding) Allocation: Simple counting of FUs after the above 2 stages Simple HLS Examples + Simple HLS Examples (contd) 2) Mapping to h/w w/ constraints: use only 1 (X) and 1 (+) w/ X delay of 2 cc’s and + delay of 1 cc ldd ldc (a) Scheduling lda a b ldb c x ldx y mux mux ldy I1 I0 I0 I1 mux1 d mux2 i) Non-overlapped pipelined scheduling c1(1) X + c2(1) cc’s 1 c1(2) c3(2) c3(1) c2(2) 2 3 4 5 Note: Unspecified control signals have either an inactive value, or if such a concept doesn’t exists for the cs, then the don’tcare value demux 6 [y c+d] (c2) Controller FSM: Reset + X (b) Arch. Synthesis cc 3i O1 cc 3(i+1) (c) Controller FSM Synthesis mux1=0, mux2=0 demux=0, ldy=1 O0 z ldz Note: A register is loaded at the +ve/-ve edge (in a +ve/-ve edge triggered system) of the cc after the one in which its load signal is asseted. lda=1, ldb=1, ldc=1, ldd=1, mux1=1, mux2=1 demux=1, ldz=1 cc 3(i+2) ldx=1 [z x+y] (c3) demux [x a x b] (c1) lda = 1 reg. “a” loaded Simple HLS Examples (contd) 2) Mapping to h/w w/ constraints: use only 1 (X) and 1 (+) (cont’d) ldd ldc (a) Scheduling lda a ii) Overlapped pipelined scheduling X c1(1) + cc’s 1 c1(2) (b) Arch. Synthesis ldb I1 mux1 d I0 I0 y mux mux ldy I1 mux2 + X c2(1) c3(1) c2(2) c3(2) demux 2 3 4 5 6 cc 3(i+1) [z x+y,] (c3) Controller FSM: Reset b c x ldx cc 3i lda=1, ldb=1, mux1=0, mux2=0 demux=0, ldy=1, ldx=1 [y c+d, x a x b] ((c1, c2) ldc=1, ldd=1, mux1=1, mux2=1, demux=1, ldz=1 demux (c) Controller FSM Synthesis z ldz • For 4 iterations, the overlapped schedule takes 9 cc’s versus 12 cc’s by the non-overlapped sched. • Overlap. sched: Time for n iterations = 2n+1 Throughput = n/(2n+1) ~ 0.5 outputs/cc • Nonoverlap. sched: Time for n iterations = 3n Throughput = n/3n ~ 0.33 outputs/cc ~ 34% throughput improvement using an overlapped schedule Simple HLS Examples (contd) in1 T • Some DFG control operation nodes: Condition (T/F) F Selectot out • Conditional code: If (a > b) then c a-b; Else c b-a; • Possible DFGs corresponding to the above conditional code: in in2 Condition (T/F) Distributor T F out1 out2 Simple HLS Examples (contd) • Iterative code: while (a > b) a a-b; b a 1 T sel F - a mux > c2 T dist F a r1 ldr1 c1 Mux b’ + s xor ovfl = 1 -ve = 0 +ve cin b’+1 = 2’s compl. of -b 1 demux Demux 1 a 0 ldfina (a) Scheduling (using only 1 adder/sub) final a (b) Arch. Synthesis Scheduling & binding: + cc’s c1 c2 c1 c2 b 0 To fsm Initialized to F ldb lda Delay Nodes in DFGs A delay node is generally implemented as a register (or a series of registers if clock period < T0); a delay node thus becomes a state variable. Delay Nodes in DFGs (contd) register Transformation in the DFG Mapping to the architecture Detailed HLS Example Detailed HLS Example (contd) Different paths (i/p o/p) in the DFG Scheduling heuristic: Among available opers schedule those on available FUs whose delay to o/p is the highest, breaking ties in favor of those opers u whose “sibling” o/ps (o/ps to the same children) that are avail. or will be available at u’s earliest finish, otherwise the FU(s) will be idle unnecessary leading to a larger latency (this will also reduce lifetimes of sibling o/ps). (a) Scheduling w/ one X (2 cc’s) & one + (1 cc); Goal:Miinimize latency (b) Reg. alloc. for o/p of operations (c) Arch. synthesis The synthesized architecture For WAR constraint [can’t store in d1 as would be natural, as di’s data yet to be consumed by c6 which has not been scheduled yet] Note: Above register allocation has been done w/ separate regs for multiplier and adder o/ps. It is sub-optimal (4 non-primary i/p regs. needed) Detailed HLS Example (contd) Detailed HLS Example—Register Allocation Detailed HLS Example—Register Allocation (contd) Scheduling heuristic: Among available opers schedule those on available FUs whose delay to o/p is the highest, breaking ties in favor of those opers u whose “sibling” o/ps (o/ps to the same children) that are avail. or will be available at u’s earliest finish, otherwise the FU(s) will be idle unnecessary leading to d0 latency (this will reduce lifetimes of sibling o/ps). a larger 3 non-primary i/p regs. needed • In the conflict graph (one per FU), there is an edge between 2 var. nodes if their lifetimes overlap (indicating that different registers need to be allocated to them) • Graph coloring—using min. # of colors to color node s.t. connected node pairs have different colors—in general is NP-hard • The above type of conflict graph is called an interval graph (derived from a 1-dimensional interval of the lifetimes) • Min. graph coloring can be solved optimally in linear time for interval graphs (using the left-edge algorithm that we will see later for channel routing) Detailed HLS Example—Register Allocation (contd) d0 3 non-primary i/p regs. needed Scheduling heuristic: Among available opers schedule those on available FUs whose delay to o/p is the highest, breaking ties arbitrarily: B’s lifetime increases, but D’s (dep. of B) decreases similarly—heuristic should be based on more global information
© Copyright 2026 Paperzz