slides - arith 22

Eric Schwarz
June 22, 2015
The IBM z13 SIMD Accelerators for
Integer, String, and Floating-Point
© 2015 IBM Corporation
z Systems - Processor Roadmap
zEC12
8/2012
z196
9/2010
z10
2/2008
Core 0
Core 2
MC
IOs
GX
IOs
L3B
L3_0
L3_0
L2
L2
Leadership System Capacity
and Performance
L3_0 Controller
MCU
CoP
CoP
GX
L3_1 Controller
L2
L2
MC
IOs
GX
IOs
L3B
Core 1
L3_1
L3_1
Core 3
Top Tier Single Thread
Performance,System Capacity
Workload Consolidation
and Integration Engine for
CPU Intensive Workloads
Decimal FP
Infiniband
64-CP Image
Large Pages
z13
1/2015
Accelerator Integration
Out of Order Execution
Water Cooling
PCIe I/O Fabric
Leadership Single Thread,
Enhanced Throughput
Modularity & Scalability
Dynamic SMT
Improved out-of-order
Supports two instruction threads
Transactional Memory
SIMD
Dynamic Optimization
PCIe attached accelerators
2 GB page support
Step Function in System
Capacity
(XML)
Business Analytics Optimized
RAIM
Enhanced Energy
Management
Shared Memory
© 2015 IBM Corporation
z13 Continues the CMOS Mainframe Heritage Begun in 1994
5.5 GHz
5.2 GHz
6000
6000
5.0 GHz
4.4 GHz
5000
5000
MHz/GHz
4000
4000
3000
3000
1202
*
1.7 GHz
2000
2000
902*
1.2 GHz
+50%
GHz
+159
%
770 MHz
1000
1000
+33%
GHz
+18%
1514
*
+26%
GHz
+6%
1695*
+12%
GHz
-9%
0 0
2000
z900
z900
189 nm SOI
16 Cores**
Full 64-bit
z/Architecture
2003
z990
z990
130 nm SOI
32 Cores**
Superscalar
Modular SMP
2005
z9ec
z9 EC
90 nm SOI
54 Cores**
System level
scaling
2008
z10ec
z10 EC
2010
z196
z196
65 nm SOI
64 Cores**
High-freq core
3-level cache
45 nm SOI
80 Cores**
OOO core
eDRAM cache
RAIM memory
zBX integration
2012
2015
zEC12
zNext
zEC12
z13
32 nm SOI
22 nm SOI
EC
101 Cores**
141
Cores**
OOO and eDRAM
cache
improvements
PCIe Flash
Arch extensions
for scaling
SMT &SIMD
Up to 10TB of
Memory
* MIPS Tables are NOT adequate for making comparisons of z Systems processors. Additional capacity planning required
** Number of PU cores for customer use
© 2015 IBM Corporation
z13 Core changes
 1Q 2015 GA
 8 cores per die
 24 CP chips in system
 192 cores x 2 threads => 384 threads
 Grow LSPR 8way by 11%, up to 1.4x perf with SMT2
 ST -> 2 way SMT
 Double the instruction fetch, decode, dispatch bandwidth
 Frequency hit (5.2 GHz z196 -> 5.6 GHz zEC12 -> 5.0 GHz z13)
 Double the number of execution units
– 2 FXU writers, 2 FXU cc,
– 2 BFU, 2 DFU
• Quad precision is faster, Divide is faster
– 2 SIMD units
– 2 DFX – accelerate SS decimal
4
© 2015 IBM Corporation
z13 Instr / Execution Dataflow
I$
SBBB
3
3
Instr decode/ crack / dispatch / map
3
3
Branch Q
IssQ side0
IssQ side1
VBU1
VBU0
VFU0
GR 0
LSU FXU*
pipe 0a
0
FXU
0b
additional instruction flow for
higher core throughput
Vector0 / FPR0
register
128b string/int
SIMD0
BFU0
DFU0
D$
5
GR 1
LSU
pipe
1
FXU
1a
FXU
1b
BFU1
DFU1
additional execution units for
higher core throughput
new arch registers / execution units
to accelerate business analytics workloads
VFU1
Vector1 / FPR1
register
128b string/int
SIMD1
*FXa pipes execute reg
writers and support b2b
execution to itself
FXb pipes execute non-reg
writers and non-relative branches
(needs 3w AGEN)
© 2015 IBM Corporation
Performance
 Technology is not getting any faster
 System z is already the fastest at 5 GHz for the past couple generations
 Parallelism is the future
– Simultaneous Multi-Threading – SMT
– Single Instruction Multiple Data – SIMD
 Huge performance gains possible through parallelism
 Got to be smarter and figure out best ways to utilize parallelism
© 2015 IBM Corporation
SIMD – Single Instruction Multiple Data
Old road
New super
highway
© 2015 IBM Corporation
SIMD – Single Instruction Multiple Data
+
One 64 bit operation
Versus many operations up to 128 bit
such as Sixteen 8 bit operations
OLD
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
NEW
© 2015 IBM Corporation
Different size data types – all 128 bit wide
 16 x 8 , 8 x 16, 4 x 32, 2 x 64 or 1 x 128 vs 1x64b
byte halfword byte
word
doubleword
byte
quadword
© 2015 IBM Corporation
Datatypes
-561
© 2015 IBM Corporation
Data Types Supported
 Integers – unsigned and two’s complement
 Floating-point – System z supports more types than any other platform
–
–
–
–
–
–
32 (Single Precision - SP),
64 (Double Precision – DP), and
128 bit (Quad Precision – QP)
Radices of 2 (Binary Floating-Point – BFP),
10 (Decimal – DFP),
16 (Hexadecimal – HFP)
 Character Strings
– Characters 8, 16, and 32 bit
– Null terminated( C ) and Length based (Java)
© 2015 IBM Corporation
Z13 SIMD accelerates:
 Integer
 Binary Floating-Point Double Precision – BFP DP
– Also indirectly quad precision
 Strings
© 2015 IBM Corporation
SIMD Integer
 z13 has 2 new SIMD execution pipelines each 128 bits wide
 16 x 8 bit / 8 x 16 bit / 4 x 32 bit / 2 x 64 bit / 1 x 128 bit
 Add, Subtract, Compare
 Logical operations
 Shifts
 Min / Max / Absolute Value
 Select
 CRC / Checksum
© 2015 IBM Corporation
SIMD Integer - types
 Byte, Halfword, Word, Doubleword, Quadword in 128 bit register
128b wide vector:
16xB, 8xHW, 4xW,
2xDW, 1xQW
© 2015 IBM Corporation
zEC12 (grey) integer vs. z13 (grey + purple)
GPR
16 x 64bit
FX0
A
FX0
B
FX1
A
VR
32 x 128bit
FX1
B
SIMD 0
SIMD 1
Increased from 2 x 64 bit pipelines
To 8 x 64 bit pipelines but even more
Parallelism with smaller integer datatypes
© 2015 IBM Corporation
Integer Dataflows from zEC12 to z13
Integer Dataflow Roads are 4 times as wide supporting up to 18 times the number of elements
OLD 64 bit one lane road
OLD 64 bit one lane road
New 64 bit one lane road
New 64 bit one lane road
New 128 bit SIMD highway
New 128 bit SIMD highway
© 2015 IBM Corporation
Integer – Error Coding Acceleration
 Includes 64 bit Cyclic Redundancy Coding (CRC) assists to perform
– AxB + CxD + E of 64 bit in carryless form.
 Also includes a Checksum accelerator to add 5 x 32 bit operands modulo 2**32 to
accumulate a new 128 bit operand to a running sum every cycle.
© 2015 IBM Corporation
More Execution Pipelines Need More Data
 Quadrupled the FPRs to create the Vector Register file
 VRs contains floats, integers, and strings
Vector register file
0
0
FPRs
GPRs
Think of this as a parking lot for rental cars
Or a truck depot.
Register
15
15
31
0
63
127
Bits
© 2015 IBM Corporation
Speeding Up Hash Functions / Integer Divide
 Z196
– Multiplicative integer divide algorithm
– Inside BFU, 1x, blocking
Latency of 64b Integer Divide
z196
zEC12
zNext
90
80
60
50
40
30
20
10
Number of non-zero quotient bits
Ex.: 234876 ÷
2 = 78292
234876 ÷ 7893 =
29
© 2015 IBM Corporation
64
61
58
55
52
49
46
43
40
37
34
31
28
25
22
19
16
13
7
10
4
0
1
 z13
– SRT based, 2nd generation (k=3)
– Stand-alone engines in FXU, 2x
– Non-blocking
– Speedup over zEC12: 1.7x
Throughput: 3.4x
70
Cycles
 zEC12
– SRT based, 1st generation (k=2)
– Speedup over z196: 1.4 – 3.6x
Z13 Floating Point throughput is twice as wide
zEC12 has 1 BFU pipeline
z13 has 2 BFU pipelines
SIMD / Vector architecture makes it easy to clump data
To fully utilize 2 BFUs with 1 thread.
Note we double pump a vector to the same BFU pipeline
© 2015 IBM Corporation
Improved Binary Floating-Point (DP, SP) Latencies
Add, sub,
mul, FMA,
convert
Compare
Divide 32b
64b
SQRT 32b
64b
z196 / zEC12
z13
8
8
8
33
40
39
53
3
17
27
21
36
© 2015 IBM Corporation
Z13 allows multiple floating point units to execute at same time and does
not stall on long operations like divide and square root.
All Float
Operations
zEC12
Integer
&
String
Divide
/Sqrt
BFP
HFP
FMA
DFP
and
quad
Fixpt
BCD
Add/
Sub
Integer
&
String
Divide
/Sqrt
BFP
HFP
FMA
DFP
and
quad
Fixpt
BCD
Add/
Sub
z13
© 2015 IBM Corporation
z13 does not stall on divide, each floating-point pipeline now
consists of 5 sub-pipelines.
zEC12 has 1 pipeline
Run-on instructions
Block pipeline
z13 has non-blocking pipelines,
only blocks on specific resource /sub pipeline,
such as divider
© 2015 IBM Corporation
Floating-point Quad precision is much faster
 All other platforms use software, z is only platform to use hardware
 Moved quad precision to wide pipeline
 Now over 3x faster with 2x the bandwidth for a total of 6x performance over zEC12
 Also non-blocking
zEC12
z13 side A
z13 side B
Multiple
passes
© 2015 IBM Corporation
BFP Quad Precision latency in cycles
zEC12
z13
add, sub
35
11
multiply
55 - 97
23
divide
~165
49
sqrt
~170
66
1 engine
2 engines
Each 3 X faster
© 2015 IBM Corporation
String Acceleration
 Character types supported are Byte, Half Word, and Word
 Loads and Stores for both
– known length strings (Java strings), and
– those with a null terminating character (C strings)
 String Search Acceleration
– Find Equal
– Find Not Equal
– Range Compare
© 2015 IBM Corporation
In the past, loading or storing to a string could result in exceptions
Normal
Loads
NEW INSTRUCTIONS:
Load with Length
Java strings
Load to Block Boundary
C strings
“THE END\0”
PRIVATE
DON’T
TOUCH
THAT
MEANS
SS #, password,YOU!!!
PAGE
FAULT
Are exact and will not set off
an alarm.
© 2015 IBM Corporation
String Acceleration with Powerful New Instructions
16 characters of the string can be
Examined every cycle
Search text can be 16 characters
Or 8 pairs of characters for ranges
Vector Find Element Equal – “Barak”
Returns a Pointer
John Doe
Barak Hussein Obama II
Mary H Lamb
Humpty Dumpty
Kings Men
Emergency
123-45-6789
987-65-4321
555-55-5555
666-66-6666
777-77-7777
911-91-1911
845-555-1212
212-555-1212
555-555-5555
666-666-6666
777-777-7777
???-???-?????
Vector String Range Compare
Use it to Find IsNotAlphaNumericORHyphen
(GE ‘0’ LE ‘9’) or
(GE ‘a’ LE ‘z’) or
(GE ‘A’ LE ‘Z’) or Returns a pointer
(EQ ‘-” EQ ‘-’) ;
© 2015 IBM Corporation
MATRIX
MULTIPLICATION
© 2015 IBM Corporation
Matrix multiply - DGEMM
 A product matrix element p(i,j) is equal to the sum of a row i x
column j
Form product i,j
Row i
Column j
x
=
To create independent operations we do the opposite,
We multiply a column by a row and form independent product elements
30
FPRs are allocated to accumulated product elements (in green)
An iteration consists of accumulating one new product to a set of independent
Product elements.
© 2015 IBM Corporation
0
DGEMM 4x4
1
 Load 8 FPRs
2
 16 FMAs
3
x
4
5
6
7
 With 2 BFU pipes
 Could separate by 8 cycles
 Need 24 arch FPRs (use Vector Scalar)
4x0=>8
4x1=>9 4x2=>10 4x3=>11
 # Regs = NxM + N + M, could do 4x5 with 32 regs but
not power of 2
5x0=>12
5x1=>13
5x2=>14
5x3=>15
6x0=>16
6x1=>17
6x2=>18
6x3=>19
7x0=>20
7x1=>21
7x2=>22
7x3=>23
 Other algorithms could separate the adds to not delay
multiplies
4 rows
Form 16 entries
At a time
4 columns
x
=
31
© 2015 IBM Corporation
DGEMM 4x4 Vector-Scalar
No stall cycles
loads
loads loads loads
8cyc
WFMA
loads loads loads loads loads loads
8cyc
8cyc
8cyc
8cyc
8cyc
8cyc
8cyc
8cyc
WFMA
WFMA
WFMA
WFMA
WFMA
WFMA
WFMA
WFMA
BFU0

8cyc
8cyc
BFU1
8cyc
32
TIME
8cyc
WFMA
WFMA
8cyc
8cyc
8cyc
WFMA
8cyc
8cyc
8cyc
8cyc
8cyc
8cyc
8cyc
8cyc
8cyc
8cyc
8cyc
8cyc
8cyc
WFMA
WFMA
WFMA
WFMA
WFMA
WFMA
8cyc
8cyc
8cyc
8cyc
8cyc
8cyc
8cyc
16 independent FMAs
© 2015 IBM Corporation
DGEMM Conclusions
 So 2x4 algorithm about optimal for 16 regs and only 1 BFU needed.
 To use bandwidth of 2 BFUs with 8 cycle latency then 4x4 needed. This would
approximately double the throughput.
– Can form 16 entries in same time to form 8.
 Use of vectorized 4x8 causes no slow down to code now, and in the future may allow for up
to 2x performance.
33
© 2015 IBM Corporation
RF
RF
64x127
6w4r
64x127
6w4r
RF
ecc +
parity
DFU
R/W
R/W
DFX
Ecc
RF
ecc +
parity
SX – string & mpy
BFU
PX – permute & simple
Pipe ctl
FPdiv
FWD
FPdiv
PX
BFU
SX
ecc +
parity
RF
DFX
R/W
ecc +
parity
RF
Ecc
R/W
DFU
64x127
6w4r
64x127
6w4r
RF
RF
CPM
34
© 2015 IBM Corporation
Backup / Old slides
© 2015 IBM Corporation
Range compare
for isAlphaNumeric() in one instruction
0 1
Ab c 0 5 6 7 d z Z
0 ‘a’ GE
1 ‘z’ LE
2 ‘A’ GE
3 ‘Z’ LE
4 ‘0’ GE
5 ‘9’ LE
15
F T T F FF F TT F …
T T T T TT T TT T …
T T T F FF FT T T …
T F F T TT T F FT …
T T T T T T TT T T …
F F F T T T T F F F…
15
AND
OR
AND
OR
AND
Cntl 15
1 1 1 1 1 1 1 1 1 1 …
36
© 2015 IBM Corporation
E
DGEMM 2x4
This is the actual sequence used:
 Load 0,2,9,E,C,B
C
B
0
x
2
9
 A = A + 2xB
 8 = 8 + Bx0
 7 = 7 + 2x9
 5 = 5 + 2xE
6
5
4
3
8
A
1
7
 3 = 3 + 2xC
 1 = 1 + 0x9
 6 = 6 + Ex0
 4 = 4 + Cx0
 6 Loads followed by 8 FMAs
Form 8 entries
At a time
Each iteration move
x
=
37
© 2015 IBM Corporation
2x4 scroll pipes
 First iteration
– 6 Loads come back in different cycles
– So FMAs start up randomly
 Further iterations
– Loads done ahead of time
– Critical path is FMA to FMA for accumulation
38
© 2015 IBM Corporation
FMAs Dependent due to Accumulation
Loads are done
early
FMA dep on
Prior FMA
39
SCROLL PIPELINES
D-decode, m-mapping, I-issue, F-fpu execution, p-putaway, r-checkpointed
© 2015 IBM Corporation
DGEMM 2x4
loads
8cyc
madbr
FPU pipeline latency critical
And how many multiplies can you
Stick between dependent accumulation
8 FMAs
8cyc
madbr
If you could stick 16 FMAs
Between dependent ones
You’d get 2 FMAs per cycle
madbr
8cyc
40
© 2015 IBM Corporation