General-purpose SIMT processors Sylvain Collange INRIA Rennes / IRISA [email protected] From GPU to heterogeneous multi-core Yesterday (2000-2010) Homogeneous multi-core Discrete components Today (2011-...) Heterogeneous multi-core Central Processing Unit (CPU) Graphics Processing Unit (GPU) Physically unified CPU + GPU on the same chip Logically separated Different programming models, compilers, instruction sets Tomorrow Unified programming models? Single instruction set? Latencyoptimized cores Throughputoptimized cores Hardware accelerators Heterogeneous multi-core chip 2 Outline Performance or efficiency? Latency-oriented architectures Throughput-oriented architectures Heterogeneous architectures Dynamic SPMD vectorization Traditional dynamic vectorization More flexibility with state-free dynamic vectorization New CPU-GPU hybrids DITVA: CPU with dynamic vectorization SBI: GPU with parallel path execution 3 The 1980': pipelined processor Example: scalar-vector multiplication: X ← a∙X for i = 0 to n-1 X[i] ← a * X[i] Source code move i ← 0 loop: load t ← X[i] mul t ← a×t store X[i] ← t add i ← i+1 branch i<n? loop Machine code add i ← 18 Fetch store X[17] Decode mul Execute Memory L/S Unit Sequential CPU 4 The 1990': superscalar processor Goal: improve performance of sequential applications Latency: time to get the result Exploits Instruction-Level Parallelism (ILP) Uses many tricks Branch prediction, out-of-order execution, register renaming, data prefetching, memory disambiguation… Basis: speculation Take a bet on future events If right: time gain If wrong, roll back: energy loss 5 What makes speculation work: regularity Application behavior likely to follow regular patterns Regular case Control regularity Memory regularity for(i…) { if(f(i)) { } } j = g(i); x = a[j]; Irregular case Time i=0 i=1 i=2 i=3 taken taken taken taken j=17 j=18 j=19 j=20 i=0 i=1 i=2 i=3 not tk taken taken not tk j=21 j=4 j=17 j=2 Speculation exploits patterns to guess accurately Applications Caches Branch prediction Instruction prefetch, data prefetch, write combining… 6 The 2000': going multi-threaded Memory wall Compute Performance Gap Memory More and more difficult to hide memory latency Power wall Time Performance is now limited by power consumption Transistor density ILP wall Law of diminishing returns on Instruction-Level Parallelism Total power Transistor power Time Gradual transition from latencyoriented to throughput-oriented Cost Homogeneous multi-core Simultaneous multi-threading Serial performance 7 Homogeneous multi-core Replication of the complete execution engine Multi-threaded software Threads: T0 add i ← 18 IF IF store X[17] ID add i ← 50 IF IF store X[49] ID mul mul EX LSU EX Memory move i ← slice_begin loop: load t ← X[i] mul t ← a×t store X[i] ← t add i ← i+1 branch i<slice_end? loop Machine code LSU T1 Improves throughput thanks to explicit parallelism 8 Simultaneous multi-threading (SMT) Time-multiplexing of processing units Same software view move i ← slice_begin loop: load t ← X[i] mul t ← a×t store X[i] ← t add i ← i+1 branch i<slice_end? loop Machine code Threads: T0 T1 T2 mul Fetch mul Decode add i ←73 add i ← 50 Execute load X[89] store X[72] load X[17] store X[49] L/S Unit Memory T3 Hides latency thanks to explicit parallelism 9 Outline Performance or efficiency? Latency-oriented architectures Throughput-oriented architectures Heterogeneous architectures Dynamic SPMD vectorization Traditional dynamic vectorization More flexibility with state-free dynamic vectorization New CPU-GPU hybrids DITVA: CPU with dynamic vectorization SBI: GPU with parallel path execution 10 Throughput-oriented architectures Also known as GPUs, but do more than just graphics Target: highly parallel sections of programs Programming model: SPMD one function run by many threads One code: Many threads: For n threads: X[tid] ← a * X[tid] Goal: maximize computation / energy consumption ratio Many-core approach: many independent, multi-threaded cores Can we be more efficient? Exploit regularity 11 Parallel regularity Similarity in behavior between threads Irregular Regular Control regularity 1 i=17 Thread 2 3 i=17 i=17 4 i=17 2 i=4 3 i=17 4 i=2 load A[8] load A[0] load A[11] load A[3] a=17 a=-5 a=11 a=42 b=15 b=0 b=-2 Time switch(i) { case 2:... case 17:... case 21:... } Memory regularity load A[8] load A[9] load load A[10] A[11] r=A[i] A Data regularity 1 i=21 Memory a=32 a=32 a=32 a=32 b=52 b=52 b=52 b=52 r=a*b b=52 12 Dynamic SPMD vectorization aka SIMT Run SPMD threads in lockstep Mutualize fetch/decode, load-store units Fetch 1 instruction on behalf of several threads Read 1 memory location and broadcast to several registers (0-3) load IF T1 (0-3) store ID T2 T3 (0) mul (0) (1) mul (2) mul (1) (2) (3) mul EX (3) Memory T0 LSU SIMT: Single Instruction, Multiple Threads Wave of synchronized threads: warp Improves Area/Power-efficiency thanks to regularity 13 Example GPU: NVIDIA GeForce GTX 980 SIMT: warps of 32 threads 16 SMs / chip 4×32 cores / SM, 64 warps / SM Warp 1 Warp 5 … Core 127 Core 93 Warp 62 Core 92 … Warp 4 Warp 8 Core 91 Time Core 66 Warp 61 Core 65 … Warp 3 Warp 7 Core 64 Core 34 Warp 60 Core 33 Core 32 Core 2 Core 1 … Warp 2 Warp 6 … Warp 63 SM1 SM16 4612 Gflop/s Up to 32768 threads in flight 14 SIMT vs. multi-core + explicit SIMD SIMT All parallelism expressed using threads Warp size implementationdefined Dynamic vectorization Threads Warp Multi-core + explicit SIMD Combination of threads, vectors Vector length fixed at compiletime Static vectorization Threads Vector SIMT benefits Easier programming Retain binary compatibility 15 Outline Performance or efficiency? Latency-oriented architectures Throughput-oriented architectures Heterogeneous architectures Dynamic SPMD vectorization Traditional dynamic vectorization More flexibility with state-free dynamic vectorization New CPU-GPU hybrids DITVA: CPU with dynamic vectorization SBI: GPU with parallel path execution 16 Heterogeneity: causes and consequences Amdahl's law Time to run sequential sections S= 1 1−P P N Time to run parallel sections Latency-optimized multi-core (CPU) Low efficiency on parallel sections: spends too much resources Throughput-optimized multi-core (GPU) Low performance on sequential sections Heterogeneous multi-core (CPU+GPU) Use the right tool for the right job Resources saved in parallel sections can be devoted to accelerete sequential sections M. Hill, M. Marty. Amdahl's law in the multicore era. IEEE Computer, 2008. 17 Single-ISA heterogeneous architectures Proposed in the academia, now in embedded systems on chip Example: ARM big.LITTLE High-performance CPU cores Cortex A57 Low-power CPU cores Cortex A53 big cores LITTLE cores Thread migration Different cores, same instruction set Migrate application threads dynamically according to demand 18 Single-ISA heterogeneous CPU-GPU Enables dynamic task migration between CPU and GPU Use the best core for the current job Load-balance on available resources CPU cores GPU cores Thread migration 19 Option 1: static vectorization Extend CPU SIMD instruction sets to throughput-oriented cores Scalar + wide vector ISA e.g. x86-64 + AVX-512 Latency core Throughput cores core Issue: conflicting requirements Same suboptimal SIMD vector length for all cores Or binary compatibility loss, ISA fragmentation Intel. Intel® Advanced Vector Extensions 2015/2016 Support in GNU Compiler Collection. GNU Tools Cauldron 2014 20 Our proposal: Dynamic vectorization Extend SIMT execution model to general-purpose cores Scalar ISA on both sides Latency core Throughput cores core Flexibility advantage: SIMD width optimized for each core type Challenge: generalize dynamic vectorization to general-purpose instruction processing 21 Outline Performance or efficiency? Latency-oriented architectures Throughput-oriented architectures Heterogeneous architectures Dynamic SPMD vectorization Traditional dynamic vectorization More flexibility with state-free dynamic vectorization New CPU-GPU hybrids DITVA: CPU with dynamic vectorization SBI: GPU with parallel path execution 22 Capturing instruction regularity How to keep threads synchronized? Challenge: conditional branches Rules of the game x = 0; // Uniform condition if(tid > 17) { x = 1; One thread per SIMD lane } Same instruction on all lanes // Divergent conditions Lanes can be individually disabled if(tid < 2) { if(tid == 0) { x = 2; } else { Thread 0 Thread 1 Thread 2 Thread 3 x = 3; } 1 instruction Lane 0 Lane 1 Lane 2 Lane 3 } 23 Most common: mask stack Code Mask Stack 1 activity bit / thread x = 0; // Uniform condition 1111 if(tid > 17) { skip tid=2 tid=0 x = 1; 1111 } // Divergent conditions if(tid < 2) { push if(tid == 0) { push x = 2; } pop else { push } pop } pop x = 3; tid=1 tid=3 1111 1100 1111 1100 1111 1100 1111 1100 1111 1100 1000 0100 1111 A. Levinthal and T. Porter. Chap - a SIMD graphics processor. SIGGRAPH’84, 1984. 24 Instruction PC, Instruction Insn, Sequencer Activity Fetch Activity mask mask Mask stack Broadcast Traditional SIMT pipeline Instruction Activity bit Exec Instruction, Activity bit Exec Activity bit=0: discard instruction Instruction, Activity bit Exec Used in Nvidia GPUs 25 Outline Performance or efficiency? Latency-oriented architectures Throughput-oriented architectures Heterogeneous architectures Dynamic SPMD vectorization Traditional dynamic vectorization More flexibility with state-free dynamic vectorization New CPU-GPU hybrids DITVA: CPU with dynamic vectorization SBI: GPU with parallel path execution 26 Goto considered harmful? MIPS j jal jr syscall NVIDIA Tesla (2007) NVIDIA Fermi (2010) Intel GMA Intel GMA Gen4 SB (2006) (2011) bar bra brk brkpt cal cont kil pbk pret ret ssy trap .s bar bpt bra brk brx cal cont exit jcal jmx kil pbk pret ret ssy .s jmpi if iff else endif do while break cont halt msave mrest push pop jmpi if else endif case while break cont halt call return fork Control instructions in some CPU and GPU instruction sets AMD R500 (2005) jump loop endloop rep endrep breakloop breakrep continue AMD R600 (2007) push push_else pop loop_start loop_start_no_al loop_start_dx10 loop_end loop_continue loop_break jump else call call_fs return return_fs alu alu_push_before alu_pop_after alu_pop2_after alu_continue alu_break alu_else_after Why so many? Expose control flow structure to the instruction sequencer AMD Cayman (2011) push push_else pop push_wqm pop_wqm else_wqm jump_any reactivate reactivate_wqm loop_start loop_start_no_al loop_start_dx10 loop_end loop_continue loop_break jump else call call_fs return return_fs alu alu_push_before alu_pop_after alu_pop2_after alu_continue alu_break alu_else_after 27 SIMD is so last century Maspar MP-1 (1990) 1 instruction for 16 384 processing elements (PEs) PE : ~1 mm², 1.6 µm process SIMD programming model /1000 Fewer PEs NVIDIA Fermi (2010) 1 instruction for 16 PEs ×50 Bigger PEs More divergence PE : ~0,03 mm², 40 nm process Threaded programming model From centralized control to flexible distributed control 28 Moving away from the vector model Requirements for a single-ISA CPU+GPU Run general-purpose applications Switch freely back and forth between SIMT and MIMD modes Conventional techniques do no meet these requirements Solution: stateless dynamic vectorization Key idea Maintain 1 Program Counter (PC) per thread Each cycle, elect one Master PC to fetch from Activate all threads that have the same PC 29 1 PC / thread Program Counters (PCs) Code x = 0; tid= 0 1 2 3 0 0 if(tid > 17) { x = 1; } if(tid < 2) { Match → active if(tid == 0) { x = 2; Master PC } 1 0 PC0 else { x = 3; No match → inactive PC1 } } PC2 PC3 30 Our new SIMT pipeline Exec Update PC PC0 PC1 Insn, MPC=PC ?Insn 1 MPC Exec Update PC PC1 PCn MPC Instruction Insn, Fetch MPC Broadcast Insn, MPC=PC ?Insn 0 MPC Vote PC0 No match: discard instruction Insn, MPC=PC ?Insn n MPC Exec Update PC PCn 31 Benefits of stateless dynamic vectorization Before: stack, counters O(n), O(log n) memory n = nesting depth 1 R/W port to memory Exceptions: stack overflow, underflow Vector semantics Structured control flow only Specific instruction sets After: multiple PCs O(1) memory No shared state Allows thread suspension, restart, migration Multi-thread semantics Traditional languages, compilers Traditional instruction sets Can be mixed with MIMD 32 Scheduling policy: min(SP:PC) Which PC to choose as master PC ? Conditionals, loops Order of code addresses min(PC) Functions Source if(…) { } else { } Favor max nesting depth min(SP) Assembly … p? br else … br endif else: … endif: while(…) start: { … } p? br start … Order 1 2 3 1 2 3 4 With compiler support Unstructured control flow too No code duplication Full backward and forward compatibility … f(); … call f … void f() f: { … … ret } 1 3 2 33 Potential of Min(SP:PC) Comparison of fetch policies on SPMD benchmarks PARSEC and SPLASH benchmarks for CPU, using pthreads, OpenMP, TBB Microarchitecture-independent model: ideal SIMD machine Average number of active threads Min(SP:PC) achieves reconvergence at minimal cost T. Milanez et al. Thread scheduling and memory coalescing for dynamic vectorization of SPMD workloads. Parallel Computing 40.9:548-558. 2014 34 Simty: a synthesizable SIMT CPU Proof of concept for dynamic vectorization Written in synthesizable VHDL Runs the RISC-V instruction set (RV32I) Fully parametrizable warp size, warp count 10-stage pipeline FPGA Prototype On Altera Cyclone IV – based DE-2 board 8-warp × 4-thread Simty: 7151 LEs, 12 M9Ks (6%), 72 MHz 8-warp × 8-thread Simty: 12765 LEs, 24 M9Ks (11%), 63 MHz Overhead of per-PC control is small 35 Outline Performance or efficiency? Latency-oriented architectures Throughput-oriented architectures Heterogeneous architectures Dynamic SPMD vectorization Traditional dynamic vectorization More flexibility with state-free dynamic vectorization New CPU-GPU hybrids DITVA: CPU with dynamic vectorization SBI: GPU with parallel path execution 36 DITVA Dynamic Inter-Thread Vectorization Architecture Add dynamic vectorization capability to an in-order SMT CPU Runs existing parallel programs compiled for x86 Scheduling policy: alternate min(SP:PC) and round-robin 4 scalar units 2 SIMD units Baseline: 4-thread 4-issue in-order with explicit SIMD 4 SIMT units DITVA: 4-warp × 4-thread 4-issue 37 DITVA performance Speedup of 4-warp × 2-thread DITVA and 4-warp × 4-thread DITVA over baseline 4-thread processor +18% and +30% performance on SPMD workloads 38 Outline Performance or efficiency? Latency-oriented architectures Throughput-oriented architectures Heterogeneous architectures Dynamic SPMD vectorization Traditional dynamic vectorization More flexibility with state-free dynamic vectorization New CPU-GPU hybrids DITVA: CPU with dynamic vectorization SBI: GPU with parallel path execution 39 Simultaneous Branch Interweaving Co-issue instructions from divergent branches Fill inactive units using parallelism from divergent paths 1 2 3 4 5 6 7 Control-flow graph Same cycle, two instructions SIMT (baseline) SBI N. Brunie, S. Collange, G. Diamos. Simultaneous branch and warp interweaving for sustained GPU performance. ISCA 2012. 40 Conclusion: the missing link CPU today GPU today CPU ISA Multi-core multi-thread SIMT model DITVA SBI SIMT New design space New range of architecture options between multi-core and GPUs Enables heterogeneous platforms with unified instruction set 41
© Copyright 2026 Paperzz