Modulo Graph Embedding : Mapping Applications onto Coarse-Grained Reconfigurable Architectures Hyunchul Park, Kevin Fan, Manjunath Kudlur,Scott Mahlke Advanced Computer Architecture Lab University of Michigan 1 University of Michigan Electrical Engineering and Computer Science Coarse-Grained Reconfigurable Architecture (CGRA) Config FU LRF • Array of PEs connected in a mesh-like interconnect • Characterized by array size, node functionalities, interconnect, register file configurations • Execute compute intensive kernels in multimedia applications 2 University of Michigan Electrical Engineering and Computer Science CGRA : Attractive Alternative to ASICs • Suitable for running multimedia applications on embedded systems – High computation throughput – Low power consumption and scalability – High flexibility with fast configuration • Morphosys : 8x8 array with RISC processor – SIMD style execution of loops • Piperench : 1-D reconfigurable hardware – Virtualize hardware pipeline • ADRES : 8x8 array with tightly coupled VLIW – Modulo scheduling with simulated annealing 3 University of Michigan Electrical Engineering and Computer Science Scheduling in CGRA • Different from conventional VLIW – Sparse interconnect and distributed register files – No dedicated routing resources • Need a good compiler to exploit the abundance of computing resources Central RF FU0 FU1 FU2 FU3 FU0 LRF FU1 LRF FU2 LRF FU3 LRF CGRA Conventional VLIW 4 University of Michigan Electrical Engineering and Computer Science Objectives of This Work • Modulo scheduling technique for CGRAs – Exploit loop-level parallelism by overlapping execution of iterations • Targeting low-cost CGRAs – Achieve quality schedule under restriction of hardware • Fast compilation time 5 University of Michigan Electrical Engineering and Computer Science Modulo Scheduling Basics • Expose loop-level parallelism by overlapping execution of iterations • Initiation interval (II) – Each iteration is executed every II cycles A B C II A A B A B C B A C C B A C B C Overlapped Execution 6 University of Michigan Electrical Engineering and Computer Science Modulo Scheduling for CGRA • Mapping DFG onto 3-D scheduling space • Limited number of scheduling slots : (number of PEs) x II – Minimize routing cost (number of slots used for routing) • Sparse interconnect and distributed register files – Ensure routability of operands time II 4x4 CGRA Scheduling Space 7 DFG University of Michigan Electrical Engineering and Computer Science Our Approach • Systematic approach to generate good schedule in reasonable time • Minimize routing cost – Convert scheduling problem into graph embedding – Leverage graph embedding algorithm • Ensure routability of operands – Skewed scheduling space – Create a narrow, but tall scheduling space 8 University of Michigan Electrical Engineering and Computer Science 1 : Minimize Routing Cost • Routing cost : number of PEs used for routing • Determined by positions of producer and consumer – Minimize distance between producers and consumers • Height-based list scheduling – Schedule operations in the order of dependence height – Place consumers close to producers • Need to carefully place operations in the same height 9 University of Michigan Electrical Engineering and Computer Science Scheduling Example – Routing Cost 0 1 2 4 5 3 time PE 0 PE 1 PE 2 PE 3 0 0 1 2 3 1 4 5 4’ 2 3 6 PE 1 6 Routing Cost = 2 DFG PE 0 5’ PE 2 PE 3 time PE 0 PE 1 PE 2 PE 3 0 0 1 2 3 4 5 1 2 1x4 CGRA 6 3 Common consumer information is important ! 10 Routing Cost = 0 University of Michigan Electrical Engineering and Computer Science Affinity Graph Heuristic • Consider placement of operations with same height together – Use common consumer information • Affinity value between operations – Measured by the distance of common consumers in DFG • Construct affinity graph – Nodes : operations, edges : affinity values • Place operations with affinity edges close to each other 11 University of Michigan Electrical Engineering and Computer Science Affinity Graph Example height 3 0 1 2 3 4 5 0 1 2 3 4 5 height 2 height 1 Affinity Graph DFG Mapping onto CGRA PE PE PE PE 0 2 4 1 4 PE PE PE PE 1 2 3 3 5 2x4 CGRA Drawing affinity graph Bad Good onto mapping mapping scheduling space 12 University of Michigan Electrical Engineering and Computer Science Leveraging Graph Embedding • Graph embedding – Drawing a graph onto a target space • Grid layout algorithm by Li & Kurata – Embed complicated biochemical networks onto 2-D grid space – Simulated annealing • Our scheduling problem is a graph embedding problem – Draw affinity graph onto scheduling space minimizing edge length Process Flow of Grid Layout [Li 2005] 13 University of Michigan Electrical Engineering and Computer Science 2 : Ensure Routability of Operands • Resources are repeatedly used every II cycles – Routing can fail due to previously scheduled operations • Backtracking : hard to make forward progress for CGRA • Take preventative approach 0 1 3 2 II 4 PE 0 5 PE 1 PE 2 6 7 DFG time PE 0 PE 1 PE 2 0 0 1 2 1 3 4 2 5 6 3 1x3 CGRA Routing failed for Op 7 ! 14 0 1 7 2 4 3 4 5 5 6 University of Michigan Electrical Engineering and Computer Science Skewed Scheduling Space time PE 0 PE 1 PE 2 0 0 1 5 2 6 1 1 2 7 2 3 4 3 0 1 5 2 6 4 1 2 7 5 3 4 • Should prevent routing failures in advance • Skew scheduling space – Staggering down to the right • Create a narrow, but tall scheduling space – Operations can be routed to the right • Dynamically adjust scheduling space 15 University of Michigan Electrical Engineering and Computer Science System Flow 16 University of Michigan Electrical Engineering and Computer Science Experimental Setup • Twelve innermost loop kernels from various domains • Three designs with different RF configurations – Evaluate the impact of register file sharing Dedicated RF Shared RF Central RF 17 University of Michigan Electrical Engineering and Computer Science Evaluation of Affinity Heuristic 60 affinity greedy 50 Routing Cost 40 30 20 10 0 blowfish channel dct fft fir fsed sharp sobel viterbi • Results of acyclic scheduling • Average of 59% reduction in routing cost 18 University of Michigan Electrical Engineering and Computer Science Modulo Graph Embedding vs. Simulated Annealing 1 MGE 0.9 SA 0.8 utilization 0.7 0.6 0.5 0.4 0.3 0.2 0.1 • • e av er ag qu a nt t de id c rb i vi te be l so rp sh a iir fs ed fir fft t dc bl ow fis h ch an ne l 0 Utilization = (# slots used for computation) / (# total slots) Time : (~ 5 sec) vs. (5 min ~ 3 hours) 19 University of Michigan Electrical Engineering and Computer Science Impact of Register File Configurations 20 University of Michigan Electrical Engineering and Computer Science Conclusions • Modulo scheduler targeting low-cost CGRAs – Provide high computation throughput, scalability, power efficiency • Two heuristics to generate a good schedule – Affinity graph heuristic – Skewed scheduling space – Average utilizations of 56-68% for three designs • Systematic approach allows fast compilation time – All benchmarks finished within 5s 21 University of Michigan Electrical Engineering and Computer Science Questions ? 22 University of Michigan Electrical Engineering and Computer Science
© Copyright 2026 Paperzz