Eric Schwarz June 22, 2015 The IBM z13 SIMD Accelerators for Integer, String, and Floating-Point © 2015 IBM Corporation z Systems - Processor Roadmap zEC12 8/2012 z196 9/2010 z10 2/2008 Core 0 Core 2 MC IOs GX IOs L3B L3_0 L3_0 L2 L2 Leadership System Capacity and Performance L3_0 Controller MCU CoP CoP GX L3_1 Controller L2 L2 MC IOs GX IOs L3B Core 1 L3_1 L3_1 Core 3 Top Tier Single Thread Performance,System Capacity Workload Consolidation and Integration Engine for CPU Intensive Workloads Decimal FP Infiniband 64-CP Image Large Pages z13 1/2015 Accelerator Integration Out of Order Execution Water Cooling PCIe I/O Fabric Leadership Single Thread, Enhanced Throughput Modularity & Scalability Dynamic SMT Improved out-of-order Supports two instruction threads Transactional Memory SIMD Dynamic Optimization PCIe attached accelerators 2 GB page support Step Function in System Capacity (XML) Business Analytics Optimized RAIM Enhanced Energy Management Shared Memory © 2015 IBM Corporation z13 Continues the CMOS Mainframe Heritage Begun in 1994 5.5 GHz 5.2 GHz 6000 6000 5.0 GHz 4.4 GHz 5000 5000 MHz/GHz 4000 4000 3000 3000 1202 * 1.7 GHz 2000 2000 902* 1.2 GHz +50% GHz +159 % 770 MHz 1000 1000 +33% GHz +18% 1514 * +26% GHz +6% 1695* +12% GHz -9% 0 0 2000 z900 z900 189 nm SOI 16 Cores** Full 64-bit z/Architecture 2003 z990 z990 130 nm SOI 32 Cores** Superscalar Modular SMP 2005 z9ec z9 EC 90 nm SOI 54 Cores** System level scaling 2008 z10ec z10 EC 2010 z196 z196 65 nm SOI 64 Cores** High-freq core 3-level cache 45 nm SOI 80 Cores** OOO core eDRAM cache RAIM memory zBX integration 2012 2015 zEC12 zNext zEC12 z13 32 nm SOI 22 nm SOI EC 101 Cores** 141 Cores** OOO and eDRAM cache improvements PCIe Flash Arch extensions for scaling SMT &SIMD Up to 10TB of Memory * MIPS Tables are NOT adequate for making comparisons of z Systems processors. Additional capacity planning required ** Number of PU cores for customer use © 2015 IBM Corporation z13 Core changes 1Q 2015 GA 8 cores per die 24 CP chips in system 192 cores x 2 threads => 384 threads Grow LSPR 8way by 11%, up to 1.4x perf with SMT2 ST -> 2 way SMT Double the instruction fetch, decode, dispatch bandwidth Frequency hit (5.2 GHz z196 -> 5.6 GHz zEC12 -> 5.0 GHz z13) Double the number of execution units – 2 FXU writers, 2 FXU cc, – 2 BFU, 2 DFU • Quad precision is faster, Divide is faster – 2 SIMD units – 2 DFX – accelerate SS decimal 4 © 2015 IBM Corporation z13 Instr / Execution Dataflow I$ SBBB 3 3 Instr decode/ crack / dispatch / map 3 3 Branch Q IssQ side0 IssQ side1 VBU1 VBU0 VFU0 GR 0 LSU FXU* pipe 0a 0 FXU 0b additional instruction flow for higher core throughput Vector0 / FPR0 register 128b string/int SIMD0 BFU0 DFU0 D$ 5 GR 1 LSU pipe 1 FXU 1a FXU 1b BFU1 DFU1 additional execution units for higher core throughput new arch registers / execution units to accelerate business analytics workloads VFU1 Vector1 / FPR1 register 128b string/int SIMD1 *FXa pipes execute reg writers and support b2b execution to itself FXb pipes execute non-reg writers and non-relative branches (needs 3w AGEN) © 2015 IBM Corporation Performance Technology is not getting any faster System z is already the fastest at 5 GHz for the past couple generations Parallelism is the future – Simultaneous Multi-Threading – SMT – Single Instruction Multiple Data – SIMD Huge performance gains possible through parallelism Got to be smarter and figure out best ways to utilize parallelism © 2015 IBM Corporation SIMD – Single Instruction Multiple Data Old road New super highway © 2015 IBM Corporation SIMD – Single Instruction Multiple Data + One 64 bit operation Versus many operations up to 128 bit such as Sixteen 8 bit operations OLD + + + + + + + + + + + + + + + + NEW © 2015 IBM Corporation Different size data types – all 128 bit wide 16 x 8 , 8 x 16, 4 x 32, 2 x 64 or 1 x 128 vs 1x64b byte halfword byte word doubleword byte quadword © 2015 IBM Corporation Datatypes -561 © 2015 IBM Corporation Data Types Supported Integers – unsigned and two’s complement Floating-point – System z supports more types than any other platform – – – – – – 32 (Single Precision - SP), 64 (Double Precision – DP), and 128 bit (Quad Precision – QP) Radices of 2 (Binary Floating-Point – BFP), 10 (Decimal – DFP), 16 (Hexadecimal – HFP) Character Strings – Characters 8, 16, and 32 bit – Null terminated( C ) and Length based (Java) © 2015 IBM Corporation Z13 SIMD accelerates: Integer Binary Floating-Point Double Precision – BFP DP – Also indirectly quad precision Strings © 2015 IBM Corporation SIMD Integer z13 has 2 new SIMD execution pipelines each 128 bits wide 16 x 8 bit / 8 x 16 bit / 4 x 32 bit / 2 x 64 bit / 1 x 128 bit Add, Subtract, Compare Logical operations Shifts Min / Max / Absolute Value Select CRC / Checksum © 2015 IBM Corporation SIMD Integer - types Byte, Halfword, Word, Doubleword, Quadword in 128 bit register 128b wide vector: 16xB, 8xHW, 4xW, 2xDW, 1xQW © 2015 IBM Corporation zEC12 (grey) integer vs. z13 (grey + purple) GPR 16 x 64bit FX0 A FX0 B FX1 A VR 32 x 128bit FX1 B SIMD 0 SIMD 1 Increased from 2 x 64 bit pipelines To 8 x 64 bit pipelines but even more Parallelism with smaller integer datatypes © 2015 IBM Corporation Integer Dataflows from zEC12 to z13 Integer Dataflow Roads are 4 times as wide supporting up to 18 times the number of elements OLD 64 bit one lane road OLD 64 bit one lane road New 64 bit one lane road New 64 bit one lane road New 128 bit SIMD highway New 128 bit SIMD highway © 2015 IBM Corporation Integer – Error Coding Acceleration Includes 64 bit Cyclic Redundancy Coding (CRC) assists to perform – AxB + CxD + E of 64 bit in carryless form. Also includes a Checksum accelerator to add 5 x 32 bit operands modulo 2**32 to accumulate a new 128 bit operand to a running sum every cycle. © 2015 IBM Corporation More Execution Pipelines Need More Data Quadrupled the FPRs to create the Vector Register file VRs contains floats, integers, and strings Vector register file 0 0 FPRs GPRs Think of this as a parking lot for rental cars Or a truck depot. Register 15 15 31 0 63 127 Bits © 2015 IBM Corporation Speeding Up Hash Functions / Integer Divide Z196 – Multiplicative integer divide algorithm – Inside BFU, 1x, blocking Latency of 64b Integer Divide z196 zEC12 zNext 90 80 60 50 40 30 20 10 Number of non-zero quotient bits Ex.: 234876 ÷ 2 = 78292 234876 ÷ 7893 = 29 © 2015 IBM Corporation 64 61 58 55 52 49 46 43 40 37 34 31 28 25 22 19 16 13 7 10 4 0 1 z13 – SRT based, 2nd generation (k=3) – Stand-alone engines in FXU, 2x – Non-blocking – Speedup over zEC12: 1.7x Throughput: 3.4x 70 Cycles zEC12 – SRT based, 1st generation (k=2) – Speedup over z196: 1.4 – 3.6x Z13 Floating Point throughput is twice as wide zEC12 has 1 BFU pipeline z13 has 2 BFU pipelines SIMD / Vector architecture makes it easy to clump data To fully utilize 2 BFUs with 1 thread. Note we double pump a vector to the same BFU pipeline © 2015 IBM Corporation Improved Binary Floating-Point (DP, SP) Latencies Add, sub, mul, FMA, convert Compare Divide 32b 64b SQRT 32b 64b z196 / zEC12 z13 8 8 8 33 40 39 53 3 17 27 21 36 © 2015 IBM Corporation Z13 allows multiple floating point units to execute at same time and does not stall on long operations like divide and square root. All Float Operations zEC12 Integer & String Divide /Sqrt BFP HFP FMA DFP and quad Fixpt BCD Add/ Sub Integer & String Divide /Sqrt BFP HFP FMA DFP and quad Fixpt BCD Add/ Sub z13 © 2015 IBM Corporation z13 does not stall on divide, each floating-point pipeline now consists of 5 sub-pipelines. zEC12 has 1 pipeline Run-on instructions Block pipeline z13 has non-blocking pipelines, only blocks on specific resource /sub pipeline, such as divider © 2015 IBM Corporation Floating-point Quad precision is much faster All other platforms use software, z is only platform to use hardware Moved quad precision to wide pipeline Now over 3x faster with 2x the bandwidth for a total of 6x performance over zEC12 Also non-blocking zEC12 z13 side A z13 side B Multiple passes © 2015 IBM Corporation BFP Quad Precision latency in cycles zEC12 z13 add, sub 35 11 multiply 55 - 97 23 divide ~165 49 sqrt ~170 66 1 engine 2 engines Each 3 X faster © 2015 IBM Corporation String Acceleration Character types supported are Byte, Half Word, and Word Loads and Stores for both – known length strings (Java strings), and – those with a null terminating character (C strings) String Search Acceleration – Find Equal – Find Not Equal – Range Compare © 2015 IBM Corporation In the past, loading or storing to a string could result in exceptions Normal Loads NEW INSTRUCTIONS: Load with Length Java strings Load to Block Boundary C strings “THE END\0” PRIVATE DON’T TOUCH THAT MEANS SS #, password,YOU!!! PAGE FAULT Are exact and will not set off an alarm. © 2015 IBM Corporation String Acceleration with Powerful New Instructions 16 characters of the string can be Examined every cycle Search text can be 16 characters Or 8 pairs of characters for ranges Vector Find Element Equal – “Barak” Returns a Pointer John Doe Barak Hussein Obama II Mary H Lamb Humpty Dumpty Kings Men Emergency 123-45-6789 987-65-4321 555-55-5555 666-66-6666 777-77-7777 911-91-1911 845-555-1212 212-555-1212 555-555-5555 666-666-6666 777-777-7777 ???-???-????? Vector String Range Compare Use it to Find IsNotAlphaNumericORHyphen (GE ‘0’ LE ‘9’) or (GE ‘a’ LE ‘z’) or (GE ‘A’ LE ‘Z’) or Returns a pointer (EQ ‘-” EQ ‘-’) ; © 2015 IBM Corporation MATRIX MULTIPLICATION © 2015 IBM Corporation Matrix multiply - DGEMM A product matrix element p(i,j) is equal to the sum of a row i x column j Form product i,j Row i Column j x = To create independent operations we do the opposite, We multiply a column by a row and form independent product elements 30 FPRs are allocated to accumulated product elements (in green) An iteration consists of accumulating one new product to a set of independent Product elements. © 2015 IBM Corporation 0 DGEMM 4x4 1 Load 8 FPRs 2 16 FMAs 3 x 4 5 6 7 With 2 BFU pipes Could separate by 8 cycles Need 24 arch FPRs (use Vector Scalar) 4x0=>8 4x1=>9 4x2=>10 4x3=>11 # Regs = NxM + N + M, could do 4x5 with 32 regs but not power of 2 5x0=>12 5x1=>13 5x2=>14 5x3=>15 6x0=>16 6x1=>17 6x2=>18 6x3=>19 7x0=>20 7x1=>21 7x2=>22 7x3=>23 Other algorithms could separate the adds to not delay multiplies 4 rows Form 16 entries At a time 4 columns x = 31 © 2015 IBM Corporation DGEMM 4x4 Vector-Scalar No stall cycles loads loads loads loads 8cyc WFMA loads loads loads loads loads loads 8cyc 8cyc 8cyc 8cyc 8cyc 8cyc 8cyc 8cyc WFMA WFMA WFMA WFMA WFMA WFMA WFMA WFMA BFU0 8cyc 8cyc BFU1 8cyc 32 TIME 8cyc WFMA WFMA 8cyc 8cyc 8cyc WFMA 8cyc 8cyc 8cyc 8cyc 8cyc 8cyc 8cyc 8cyc 8cyc 8cyc 8cyc 8cyc 8cyc WFMA WFMA WFMA WFMA WFMA WFMA 8cyc 8cyc 8cyc 8cyc 8cyc 8cyc 8cyc 16 independent FMAs © 2015 IBM Corporation DGEMM Conclusions So 2x4 algorithm about optimal for 16 regs and only 1 BFU needed. To use bandwidth of 2 BFUs with 8 cycle latency then 4x4 needed. This would approximately double the throughput. – Can form 16 entries in same time to form 8. Use of vectorized 4x8 causes no slow down to code now, and in the future may allow for up to 2x performance. 33 © 2015 IBM Corporation RF RF 64x127 6w4r 64x127 6w4r RF ecc + parity DFU R/W R/W DFX Ecc RF ecc + parity SX – string & mpy BFU PX – permute & simple Pipe ctl FPdiv FWD FPdiv PX BFU SX ecc + parity RF DFX R/W ecc + parity RF Ecc R/W DFU 64x127 6w4r 64x127 6w4r RF RF CPM 34 © 2015 IBM Corporation Backup / Old slides © 2015 IBM Corporation Range compare for isAlphaNumeric() in one instruction 0 1 Ab c 0 5 6 7 d z Z 0 ‘a’ GE 1 ‘z’ LE 2 ‘A’ GE 3 ‘Z’ LE 4 ‘0’ GE 5 ‘9’ LE 15 F T T F FF F TT F … T T T T TT T TT T … T T T F FF FT T T … T F F T TT T F FT … T T T T T T TT T T … F F F T T T T F F F… 15 AND OR AND OR AND Cntl 15 1 1 1 1 1 1 1 1 1 1 … 36 © 2015 IBM Corporation E DGEMM 2x4 This is the actual sequence used: Load 0,2,9,E,C,B C B 0 x 2 9 A = A + 2xB 8 = 8 + Bx0 7 = 7 + 2x9 5 = 5 + 2xE 6 5 4 3 8 A 1 7 3 = 3 + 2xC 1 = 1 + 0x9 6 = 6 + Ex0 4 = 4 + Cx0 6 Loads followed by 8 FMAs Form 8 entries At a time Each iteration move x = 37 © 2015 IBM Corporation 2x4 scroll pipes First iteration – 6 Loads come back in different cycles – So FMAs start up randomly Further iterations – Loads done ahead of time – Critical path is FMA to FMA for accumulation 38 © 2015 IBM Corporation FMAs Dependent due to Accumulation Loads are done early FMA dep on Prior FMA 39 SCROLL PIPELINES D-decode, m-mapping, I-issue, F-fpu execution, p-putaway, r-checkpointed © 2015 IBM Corporation DGEMM 2x4 loads 8cyc madbr FPU pipeline latency critical And how many multiplies can you Stick between dependent accumulation 8 FMAs 8cyc madbr If you could stick 16 FMAs Between dependent ones You’d get 2 FMAs per cycle madbr 8cyc 40 © 2015 IBM Corporation
© Copyright 2026 Paperzz