CHAINSAW Von-Neumann Accelerators To Leverage Fused Instruction Chains Amirali Sharifian, Snehasish Kumar, Apala Guha, Arrvindh Shriraman 1 AXC Challenge 1: Idleness Application DFG 1 Spatial Fabric 81 2 92 12 3 3 Fabric Size Dataflow graph size More dataflow dependencies More Idleness 2 AXC Challenge 2: Data movement Compute 81 92 30% 12 3 70% Communication 70% energy for moving data 3 Von-Neumann Features DFG Ins. Buffer 1 1 2 Reg 2 3 ALU 3 Temporal Mapping = Less Idleness Central Register File Fetch and Decode 4 Our Approach : Fused Instruction Chains Von-Neumann + Chains CHAIN DFG Reg. 1 1 2 2 3 3 Compiler exposed Bypass Temporal Mapping = Less Idleness Bypass = Internalize communication 5 Our Approach : Fused Instruction Chains Von-Neumann w/ Chains CHAIN DFG Do chains exist in a DFG? Reg. 1 1 How to form the chains? Compiler 2 2 exposed Bypass Reg. What are the challenges? 3 3 Modeling and Evaluation Temporal Mapping = Less Idleness Central Register File 6 CHAINs vs VLIW Chains VLIW Finding dependent instructions Finding independent instructions Vertical Fusion Horizontal Fusion 7 Do chains exist in a DFG ? 100 1-2 Op 3-4 Ops 5+ Ops 50 0 50–80% of DFG part of 3+ op chains 8 How to form chains? Reduce Communication Chained DFG C1 C2 1 4 2 5 3 6 Schedule C1 1 2 3 C2 4 5 6 Internalize communication May fail to discover ILP 9 How to form chains? Optimize for ILP Chained DFG C1 Schedule C1 C2 1 1 4 C3 C3 2 5 3 2 3 C2 4 5 6 6 Same ILP as the prog. Increased communication 10 How much communication is within chains? Inter-Chain Intra-Chain 100 80 60 40 20 0 ILP+ SIZE+ 9 Workloads (dwt, soplex) ILP+ SIZE+ 9 Workloads (parser, craft) ILP+ SIZE+ 8 Workloads (gcc, namd) 40-60% of communication localized 11 How to extract – longer – Chains Control Flow 12 How to extract – longer – Chains GUARD Control Flow Larger Superblocks/Paths ⇒ Larger chains 13 CHAINSAW is an Accelerator IN/ OUT Reg IN/ OUT Reg OOO Core Control free HOT PATH Ins. Buffer Ins. Buffer Reg. File Fetch/ Decode Fetch/ Decode Bypass Reg BUS Cache Mem. Reg. File Bypass Reg CHAINSAW WORKLOAD Only hot paths Limited inst. buffer 14 Multi-Lane CHAINSAW Execution Dataflow Graph 1 D2 C2 C0 Lane 1 Lane 2 Ins. Buffer Ins. Buffer 4 C1 D1 4 2 3 5 1 2 6 C2 C0 C1 5 3 6 D1 D2 Register file 15 Chainsaw – Fetch and Decode Dataflow Graph Instruction Fields C1 D1 4 Op 4 5 5 6 6 IN0/1 WR FWD L/R OUT0/1 1 0 X 0 1 0 0 1 1 X 1 X 0 1 0 Only 13bits is needed to decode! 16 Evaluation – Dynamic Energy • Chainsaw adds Fetch/Decode cost for dynamic energy • CGRA network overhead dominate Chainsaw F/D cost 100.0 Communication 80.0 Computation F/D OOO-4 F/D cost = 8% 60.0 40.0 20.0 0.0 gzip bzip2 mcf2k lbm parser 45% less than 4-way OOO 14% less than CGRA8x8 dwt53 17 Evaluation – Data movement energy 100 Inter-Chain 80 Intra-Chain CGRA 8X8 60 40 20 0 gzip bzip2 mcf2k lbm parser dwt53 Chainsaw reduces 40% of energy 18 Evaluation – Performance CHAINSAW8 CGRA 8X8 100 80 60 40 20 0 gzip bzip2 hmmer lbm parser Within 73% of CGRA8x8 20% better than OOO core dwt53 19 Chainsaw is a Von-Neumman accelerator Chains sequentially dependent operations. Chainsaw Accelerator: Exploit lack of ILP Reduce communication energy Reuse functional units Energy < CGRA Performance ≃ CGRA IN/ OUT Reg IN/ OUT Reg Ins. Buffer Ins. Buffer Reg. File Reg. File Fetch/ Decode Fetch/ Decode Bypass Reg Bypass Reg BUS 81 92 3 20 Q&A github.com/sfu-arch/chainsaw 21 AXC Challenge 2: Data movement Spatial Fabric 81 COMPUTE 92 12 3 SWITCH 50% Energy overhead for data movement 22 Evaluation – Data movement energy Intra-Chain 1 0.8 0.6 0.4 0.2 0 gzip bzip2 mcf2k lbm parser dwt53 Reduced energy in Chainsaw Chainsaw internalizes 50%+ of comm. 24 Evaluation – Dynamic Energy • Chainsaw adds Fetch/Decode cost for dynamic energy • CGRA network overhead dominate Chainsaw F/D cost 1 0.8 0.6 0.4 0.2 0 Communication gzip bzip2 mcf2k Ops lbm F/D parser OOO-4 dwt53 13% less than CGRA 45% less than 4-way OOO 25
© Copyright 2024 Paperzz