Dynamic Acceleration Management of SystemC Emulation Scott Sirowy, Chen Huang, and Frank Vahid† Department of Computer Science and Engineering University of California, Riverside {ssirowy,chuang, vahid}@cs.ucr.edu †Also with the Center for Embedded Computer Systems at UC Irvine This work was supported in part by the National Science Foundation and the Office of Naval Research Introduction: Prototyping Circuits and Systems address data go Edge Detector Memory Controller SystemC s1 s2 s3 s4 s6 s7 s8 s9 + + + + + + + + + + - + + + 255 MIN Pixel Value class EDGE_DETECTOR : public sc_module { //signal declarations … EDGE_DETECTOR() { SC_method(mainComp); sensitive << dataReady; Capture in HDL C++ based Creation, instantiation, and connection of components Precisely timed communication and execution among concurrently executing components Supports both “software” and “hardware” constructs and semantics SC_method(getPixel); sensitive << clock.pos(); 2/17 Introduction: Prototyping Circuits and Systems address data go Edge Detector Memory Controller s1 s2 s3 s4 + + + + s6 s7 s8 s9 + + + + + + + + - In-System Emulation + Prior to time-consuming mapping and synthesis 255 MIN class EDGE_DETECTOR : public sc_module { //signal declarations … EDGE_DETECTOR() { SC_method(mainComp); sensitive << dataReady; Quickly-obtained simulation interaction with real I/O But slower Capture in HDL Emulation SC_method(getPixel); sensitive << clock.pos(); 3/17 SystemC Emulation Engine Real I/O Peripherals Emulation Engine Input Memory Output Memory UART Buttons LEDs Main Processor Instruction Memory Read Signal Memory Write Signal Memory USB Interface Representative of many systems Emulation Engine Kernel Virtual Machine Discrete Event Kernel Peripheral Access and Hooks USB Download Interface I/O Peripherals Event Kernel and Virtual Machine Peripherals 4/17 Emulation Engine Acceleration Emulation Engine Input Memory Output Memory UART Buttons LEDs Main Processor SystemC bytecode Instruction Memory Read Signal Memory For some SystemC applications, emulation can be slow USB Interface An Edge Detection circuit required ~10 minutes to process a 320x240 image * Write Signal Memory 5/17 * on a 100 MHz/SRAM Microblaze SystemC Emulation Engine implementation Emulation Engine Acceleration Emulation Engine Input Memory Output Memory UART Buttons LEDs Main Processor SystemC bytecode Instruction Memory Read Signal Memory Write Signal Memory Accelerator 1 For some SystemC applications, emulation can be slow USB Interface Accelerator 3 An Edge Detection circuit required ~10 minutes to process a 320x240 image * If available, use platform FPGA to create bytecode accelerators Accelerator 2 FPGA Execute SystemC bytecode natively Accelerators speedup emulation by over 100X * on a 100 MHz Microblaze SystemC Emulation Engine implementation 6/17 Emulation Engine Acceleration Management Image Processing System Emulation Engine Input Memory Output Memory UART Buttons LEDs Main Processor Instruction Memory Read Signal Memory Event Queue Edge ? USB Interface Write Signal Memory Edge Edge … Available Accelerators Accelerator Loading Overhead Accelerator 3 Total Execution Time Emulate every process in software Accelerator 2 FPGA Blur Emulate in software, or accelerate using a bytecode accelerator? Accelerator 1 Blur Accelerate Every Process Communication and Loading Overhead Dynamically Manage Accelerators 7/17 Emulation Engine Acceleration Management Image Processing System Emulation Engine Input Memory Output Memory UART Buttons LEDs Main Processor Instruction Memory Read Signal Memory Write Signal Memory ? Edge Blur Accelerator 2 Accelerator 3 Blur Edge Edge … Online Decision USB Interface Accelerate Process Edge Accelerator 1 FPGA Event Queue Emboss Mean Blur Radial # of uses 8 4 3 5 2 In Accelerator Yes No No Yes No History Table 8/17 Dynamic Accelerator Management Event Queue: Loading Time (ms) p1p1 80 10 70 p2p2 60 20 60 p3p3 70 25 30 Available Acceleration Engines Statically Preloaded Emulation (ms) Emulation + Accelerator (ms) Time(ms) Initial state: Accelerator 1 and Accelerator 2 are preloaded with processes p1 and p2 from the SystemC circuit. Accelerator 1 Accelerator 2 9/17 Dynamic Accelerator Management Event Queue: Loading Time (ms) p1p1 80 10 70 p2p2 60 20 60 p3p3 70 25 30 Loading Time Available Acceleration Engines Accelerator 1 Better schedule Accelerator 2 Statically Preloaded Greedy schedule Emulation (ms) Emulation + Accelerator (ms) Time(ms) 3 -> 2 2 -> 1 1 -> 3 3 -> 2 Time(ms) 3 -> 1 Time(ms) 10/17 Aggregate Gain Solution: AG Table p1 Gain = Emulation only – (Emulation + Accelerator) Maintain a gain table for process in the SystemC circuit: ag(i) = ag(i) + gain(i) Fading process for temporal locality: ag(i)=ag(i)*f How to define fading factor f ? f min{ Loading time / Gain, 1} F=0.5 Emulation_only Emulation+Acc Gain p2 p3 200 100 50 10 20 25 190 80 25 Q = <p1, p1, p3, p2, p2, p1, p3> ag(1) 190 380 380 380 380 570 570 ag(2) 0 0 0 ag(3) 0 0 25 80 160 160 160 25 25 25 50 Q = <p1, p1, p3, p2, p2, p1, p3> 190*F+ 190 ag(1) 190 285 142 71 35 207 103 ag(2) 0 0 ag(3) 0 0 0 80 120 25 12 6 60 30 3 26 11/17 AG: Overheads And Replacement Policy Event Queue Emulation Engine Main Processor Input Memory Output Memory Edge Instruction Memory UART Read Signal Memory Buttons Write Signal Memory LEDs Accelerator 1 FPGA Edge Blur Emulation_only 100 200 Emulation+Acc 20 10 Gain 80 190 USB Interface Blur Blur ag(Edge) 80 ag(Blur) 0 Edge Edge … Loading time (LT): Accelerator loading time(i) Communication overhead (CO): (Dλ’- Dλ) * runtime(i) Overhead = LT + CO Policies: Load: ag(i) > Overhead 12/17 AG: Overheads And Replacement Policy Event Queue Emulation Engine Main Processor Input Memory Output Memory Blur Instruction Memory UART Read Signal Memory Buttons Write Signal Memory LEDs Accelerator Edge 1 FPGA Edge Blur Emulation_only 100 200 Emulation+Acc 20 10 Gain 80 190 USB Interface Blur Edge Edge ag(Edge) 80 80 ag(Blur) 0 190 … Loading time (LT): Accelerator loading time(i) Communication overhead (CO): (Dλ’- Dλ) * runtime(i) Overhead = LT + CO Policies: Load: ag(i) > Overhead Replace: ag(i) > Overhead + ag(j) (j is the Acc. to be replaced) Wait: ag(i) > Overhead + ag(j) + wait_time(j) 13/17 Comparison Solutions Emulation Engine Input Memory Output Memory UART Main Processor Instruction Memory Read Signal Memory Buttons LEDs Base Emulation: Emulating every process on main processor Infinite Accelerators: Accelerating every process w/o loading overhead USB Interface Write Signal Memory Accelerator 1 Accelerator 2 Accelerator 3 … FPGA Accelerator n 14/17 Comparison Solutions Emulation Engine Input Memory Output Memory UART Buttons LEDs Main Processor Instruction Memory Read Signal Memory Write Signal Memory USB Interface Base Emulation: Emulating every process on main processor Infinite Accelerators: Accelerating every process w/o loading overhead Static preloaded: Each accelerator is statically assigned a process to execute when on the event queue and never changes Greedy: Always assign the current process on the event queue to an accelerator, Accelerator 1 Accelerator 2 FPGA Accelerator 3 15/17 Experiments and Results 1.3X faster than greedily assigning accelerators (a) Virtex 4 Ml403: 1 Accelerator 617 397 651 389 (b) Virtex2Pro : 3 Accelerators 622 428 T o talE xecu tio n T im e (m s) E xecu tio n T im e (m s ) 180 160 140 120 100 80 60 40 20 0 R andom B iased Base emulator P eriodic Infinite Accelerators 651 617 200 Statically Preloaded 622 200 180 160 140 120 100 80 60 40 20 0 R andom B iased Greedy AG P eriodic Aggregate Gain Algorithm on average 3.8X faster than statically preloading accelerators 16/17 Conclusions SystemC Emulation can be improved by dynamically managing the SystemC bytecode accelerators Applied the online Aggregate Gain Algorithm to the SystemC emulation framework Improves emulation performance by 14X compared to emulating all of the SystemC on a base software emulation kernel 3.8X performance improvement over statically preloading the accelerator engines 17/17
© Copyright 2024 Paperzz