5.5 GFLOPS Vector Units for “Emotion Synthesis”

5.5 GFLOPS Vector Units for
“Emotion Synthesis”
System ULSI Engineering Laboratory, TOSHIBA Corp.
A.Kunimatsu, N.Ide, T.Sato, Y.Endo,
H.Murakami, T.Kamei, M.Hirano
Sony Computer Entertainment Inc.
M.Oka, A.Ohba, T.Yutaka, T.Okada,
M.Suzuoki
Overview
•
•
•
•
Strategy of VU architecture
VU features and instruction sets
Examples
Performance
Strategy of VU Architecture
• Targets
– Optimized for 3D calculations
– Easy to develop software pipelined programs
– Simple hardware
• Solutions
– VLIW processor
– 4 parallel fMACs
– All operations have same pipeline depth
(Except DIV/SQRT/RSQRT operations)
And one more essence …..
"Emotion Synthesis"
• "Emotion Synthesis"
– more realistic, more detailed and more natural motions
Need Huge Calculation Power
• We prepared two different of Vector Unit(VU)s
– VU0 : for flexible calculation with CPU control
– VU1 : for well-defined 3D calculations
to Rendering
Engine
(GS LSI)
“Emotion Engine” Block Diagram
COP1
FPU
COP2
128bit
Processor
Core
128
I$
D$
SPRAM
VU1
VU0
Inst.
MEM
4KB
Data
MEM
16KB
Data
MEM
4KB
VIF0
CPU
10ch
DMAC
System Bus
128-bit
Inst.
MEM
16KB
GIF
VIF1
VPU0
IPU
E
F
U
VPU1
Memory
Interface
External Memory
I/O
Interface
Peripherals
VU
• VLIW mode
– 64-bit VLIW instruction formats
– 5 function units are available simultaneously
4 FMAC + (FDIV, Branch, load/store, or iALU)
• Co-processor mode
– MIPS COP2 instruction : 32-bit opcode
– Two kinds of COP2 instructions :
• 4 parallel Floating operations
• call micro-subroutine of VLIW mode
VU0 and VU1
VU0
VU1
Main purpose
Flexible Calculation with
CPU control
Well-defined 3D
calculations
VLIW mode
Available
Available
Co-processor mode Available
VPU
With …
Inst. memory : 4KB
Data memory : 4KB
VIF(system bus I/F)
N/A
With …
Inst. Memory : 16KB
Data Memory : 16KB
VIF(system bus I/F)
GIF(Graphics I/F)
EFU(VU option)
VPU1 Block Diagram
64
32
128
Path1
128
Upper Instruction Lower Instruction
32
16
16
128
Inst. Mem
Data Mem
16KByte
16KByte
GIF
128
integer
registers
16bit x 16
fDIV
EFU
fMAC
LSU
iALU
RANDU/etc
fDIV
fMACx
fMACy
fMACz
Lower Execution Unit
fMACw
floating
registers
128bit x 32
Upper Execution Unit
16
Path2
128
128
64
VIF
System Bus
64
to Rendering Engine (GS LSI)
VU
Vector Unit :
Path3
128
128
Instruction opcode to support 2 modes
VLIW mode : 64-bit VLIW instruction opcode
Upper 32bit
Lower 32bit
63 62 61 60 59 58 57 56 55 54 53 52 51 50 49 48 47 46 45 44 43 42 41 40 39 38 37 36 35 34 33 32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 09 08 07 06 05 04 03 02 01 00
1
1
1
1
1
1
1
4 bit
I E M D T
5 bit
- - dest
- - - - - 0 0 ----
5 bit
5 bit
4 bit
Opcode Body
ft r eg
-----
fs r eg
-----
fd r eg
-----
2b
MADD bc
0010 - -
7 bit
4 bit
Lo w er OP.
1000000
5 bit
5 bit
11 bit
LQI
Opcode Body
01101
1111
dest
----
ft r eg
-----
is r eg
-----
00
A VLIW instruction includes
two COP2 instructions.
Co-processor mode : 32-bit MIPS COP2 instruction opcode
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 09 08 07 06 05 04 03 02 01 00
6 bit
1
COP2
010010
co
1
4 bit
dest
----
5 bit
5 bit
5 bit
4 bit
2b
fs r eg
fd r eg VMADD bc
Opcode
Body
--------0010 - -
ft r eg
-----
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 09 08 07 06 05 04 03 02 01 00
6 bit
1
COP2
010010
co
1
4 bit
dest
----
5 bit
5 bit
11 bit
is r eg
VLQI
Opcode
Body
----01101
1111
ft r eg
-----
00
64-bit VLIW instruction
Upper 32-bit
Lower 32-bit
63 62 61 60 59 58 57 56 55 54 53 52 51 50 49 48 47 46 45 44 43 42 41 40 39 38 37 36 35 34 33 32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 09 08 07 06 05 04 03 02 01 00
1
1
1
1
1
I E M D T
1
1
4 bit
- - dest
- - - - - 0 0 ----
5 bit
5 bit
5 bit
ft r eg
-----
fs r eg
-----
fd r eg
-----
FMAC x 4
4 bit
2b
MADD bc
0010 - -
Upper Instructions :
4 parallel floating ADD/SUB
4 parallel floating MUL
4 parallel floating MADD/MSUB
4 parallel floating MAX/MIN
Outer product calculation
Clipping detection
7 bit
Lo w er OP.
1000000
4 bit
5 bit
5 bit
11 bit
is r eg
dest
LQI
ft r eg
FDIV,
load/store,
iALU
-----------01101
1111 00
Lower Instructions :
Floating DIV/SQRT/RSQRT
Load/Store 128-bit data(4 FP data)
Jump/Branch
Elementary Function Unit
Random Unit
Registers
128-bit floating register x 32
32bit
127
32bit
96
VF00
VF01
VF02
VF03
1.00
VF01w
VF02w
VF03w
95
32bit
64
63
0.00
VF01z
VF02z
VF03z
32bit
32
31
0
0.00
VF01y
VF02y
VF03y
0.00
VF01x
VF02x
VF03x
VF31y
VF31x
:
:
VF31w
VF31
VF31z
16-bit integer register x 16
15
VI00
VI01
VI02
VI15
0
0
:
:
4 Parallel FMADD
w
z
y
w
fMACw
ACCw
+
z
fMACz
ACCz
+
Source register 1
x
Source register 2
y
x
fMACy
ACCy
+
fMACx
ACCx
+
4 Parallel FMADD with Broadcast
w
z
w
fMACw
ACCw
y
z
+
fMACz
ACCz
Source register 1
x
Source register 2
x
y
+
fMACy
ACCy
+
fMACx
ACCx
+
VLIW mode Pipeline Stages
• 3 cycle Latency : Basic Operations
F D X1 X2 X3 W ADD, SUB, MUL, MADD, MSUB
load/store, iALU operations, e.t.c.
• 7 cycle Latency : DIV / SQRT
F D 1 2 3 4 5 6 7 DIV
F D 1 2 3 4 5 6 7 SQRT
• 13 cycle Latency : RSQRT
F D 1 2 3 4 5 6 7 8 9 10 11 12 13 RSQRT
Geometry Transformation
FMACx :  m00
FMACy :  m10

FMACz :  m20

FMACw :  m30
m01
m11
m21
m31
m02
m12
m22
m32
m03   x 
m13   y 
 
m23   z 
 
m33   w
 m00 x + m01 y + m02 z + m03w
 m10 x + m11 y + m12 z + m13w 
=

 m20 x + m21 y + m22 z + m23w


 m30 x + m31 y + m32 z + m33w
Geometry Transformation Pipeline
Geometry Transformation Pipeline
FMACx FMACy FMACz FMACw
4 x MULA (m00 x
x broadcast +
4 x MADDA (m01 y
F D X1 X2 X3 W
y broadcast +
F
: Fetch
4 x MADDA (m02 z
D
: Decode F D X1 X2 X3 W
z broadcast
X1 - 3 : eXecute
+
W
: Write back
F D X1 X2 X3 W 4 x MADD (m03w
w broadcast
Upper F D X1 X2 X3 W
m10 x m20 x m 30 x )
+
+
+
m11 y m21 y m 31 y )
+
+
+
m12 z m22 z m 32 z )
+
+
+
m13 w m23 w m 33 w)
Perspective Transformation
Perspective Transformation Software Pipelining
LOOP 2
F D X1 X2 X3 W
F D X1 X2 X3 W
F D X1 X2 X3 W
F D X1 X2 X3 W
F D X1 X2 X3 W
F D X1 X2 X3 W
F D X1 X2 X3 W
F D X1 X2 X3 W
F D X1 X2 X3 W
F D X1 X2 X3 W
F D X1 X2 X3 W
F D X1 X2 X3 W
4 x MUL
F D X1 X2 X3 W
F D X1 X2 X3 W
F D X1 X2 X3 W
F D X1 X2 X3 W
F D X1 X2 X3 W
F D X1 X2 X3 W
F D X1 X2 X3 W
F D X1 X2 X3 W
LOOP 4
F D X1 X2 X3 W
F D X1 X2 X3 W
F D X1 X2 X3 W
Upper
LOOP 3
4 x MULA
4 x MADDA
4 x MADDA
F D X1 X2 X3 W 4 x MADD
F D X1 X2 X3 W
F D X1 X2 X3 W
Geometry
Transformation
LOOP 1
F D X1 X2 X3 W
F D X1 X2 X3 W
F D d1 d2 d3 d4 d5 d6 d7
Others
DIV
F D d1 d2 d3 d4 d5 d6 d7
Lower
F D d1 d2 d3 d4 d5 d6 d7
F D d1 d2 d3 d4 d5 d6 d7
7 Cycles : DIV repeat time
Another Lower inst.:
load/store,
loop control
Vector Normalization
25 cycles for normalization
13 cycles for repeat
(software pipelining)
x2 x2
MADD x 2 x+2 y+2 y 2
F D X1 X2 X3 W MADD x 2 + y 2 + z 2
Ç Ç X1 X2 X3 W
F D X1 X2 X3 W
MUL
F D 1 2 3 4 5 6 7 8 9 10111213
RSQRT
F D X1 X2 X3 W
1
3 x MUL

x +y +z
2
2
2
x
x2 + y2 + z 2
,
y
x2 + y2 + z 2
,
z
√
x 2 + y 2 + z 2 √↵
Co-processor Mode Pipeline stages
• Basic operations
I Q R A D W MIPS core pipeline
F D X1 X2 X3 W VU pipeline
10 stages
• call VLIW processor mode programs
I Q R A D W MIPS core pipeline
F D X1 X2 X3 W
VU VLIW pipelines
F D X1 X2 X3 W
F D X1 X2 X3 W
n steps
………….
F D X1 X2 X3 W
11 + n stages
Example : Jelly
•
Dynamic simulation
Spring model
Performance
160
150
140
Emotion Engine (300 MHz)
M vectors/sec
120
100
80
66
66
60
46
40
20
0
Geometry Trans.
Perspective
Trans.
Distance
1/Distance
EE Micrograph
Die size : 15.02mm x 15.04mm
Frequency : 300MHz
Transistors : 13.5M
Power : 18Watts
Design Rule : 0.25um
Gate Length : 0.18um
VU Micrograph
Acknowledgements
TOSHIBA Corp.
Y.Fujimoto, H.Goto, S.Hayakawa, F.Ishihara, T.Kawaai,
S.Kawakami, M.Saito, H.Tago, H.Takeda, T.Teruyama,
Y.Watanabe, I.Yamazaki
TOSHIBA MICROELECTRONICS Corp.
Y.Ogawa
TOSHIBA ENGINEERING Corp.
N.Iwatsuki, M.Otsuki, Y.Tsutsumi
TOSHIBA INFORMATION SYSTEMS(JAPAN) Corp.
M.Kuramochi, Y.Hikita
SONY COMPUTER ENTERTAINMENT Inc.
S.Aoki, K.Kutaragi, S.Shinohara, K.Umemura