5.5 GFLOPS Vector Units for “Emotion Synthesis” System ULSI Engineering Laboratory, TOSHIBA Corp. A.Kunimatsu, N.Ide, T.Sato, Y.Endo, H.Murakami, T.Kamei, M.Hirano Sony Computer Entertainment Inc. M.Oka, A.Ohba, T.Yutaka, T.Okada, M.Suzuoki Overview • • • • Strategy of VU architecture VU features and instruction sets Examples Performance Strategy of VU Architecture • Targets – Optimized for 3D calculations – Easy to develop software pipelined programs – Simple hardware • Solutions – VLIW processor – 4 parallel fMACs – All operations have same pipeline depth (Except DIV/SQRT/RSQRT operations) And one more essence ….. "Emotion Synthesis" • "Emotion Synthesis" – more realistic, more detailed and more natural motions Need Huge Calculation Power • We prepared two different of Vector Unit(VU)s – VU0 : for flexible calculation with CPU control – VU1 : for well-defined 3D calculations to Rendering Engine (GS LSI) “Emotion Engine” Block Diagram COP1 FPU COP2 128bit Processor Core 128 I$ D$ SPRAM VU1 VU0 Inst. MEM 4KB Data MEM 16KB Data MEM 4KB VIF0 CPU 10ch DMAC System Bus 128-bit Inst. MEM 16KB GIF VIF1 VPU0 IPU E F U VPU1 Memory Interface External Memory I/O Interface Peripherals VU • VLIW mode – 64-bit VLIW instruction formats – 5 function units are available simultaneously 4 FMAC + (FDIV, Branch, load/store, or iALU) • Co-processor mode – MIPS COP2 instruction : 32-bit opcode – Two kinds of COP2 instructions : • 4 parallel Floating operations • call micro-subroutine of VLIW mode VU0 and VU1 VU0 VU1 Main purpose Flexible Calculation with CPU control Well-defined 3D calculations VLIW mode Available Available Co-processor mode Available VPU With … Inst. memory : 4KB Data memory : 4KB VIF(system bus I/F) N/A With … Inst. Memory : 16KB Data Memory : 16KB VIF(system bus I/F) GIF(Graphics I/F) EFU(VU option) VPU1 Block Diagram 64 32 128 Path1 128 Upper Instruction Lower Instruction 32 16 16 128 Inst. Mem Data Mem 16KByte 16KByte GIF 128 integer registers 16bit x 16 fDIV EFU fMAC LSU iALU RANDU/etc fDIV fMACx fMACy fMACz Lower Execution Unit fMACw floating registers 128bit x 32 Upper Execution Unit 16 Path2 128 128 64 VIF System Bus 64 to Rendering Engine (GS LSI) VU Vector Unit : Path3 128 128 Instruction opcode to support 2 modes VLIW mode : 64-bit VLIW instruction opcode Upper 32bit Lower 32bit 63 62 61 60 59 58 57 56 55 54 53 52 51 50 49 48 47 46 45 44 43 42 41 40 39 38 37 36 35 34 33 32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 09 08 07 06 05 04 03 02 01 00 1 1 1 1 1 1 1 4 bit I E M D T 5 bit - - dest - - - - - 0 0 ---- 5 bit 5 bit 4 bit Opcode Body ft r eg ----- fs r eg ----- fd r eg ----- 2b MADD bc 0010 - - 7 bit 4 bit Lo w er OP. 1000000 5 bit 5 bit 11 bit LQI Opcode Body 01101 1111 dest ---- ft r eg ----- is r eg ----- 00 A VLIW instruction includes two COP2 instructions. Co-processor mode : 32-bit MIPS COP2 instruction opcode 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 09 08 07 06 05 04 03 02 01 00 6 bit 1 COP2 010010 co 1 4 bit dest ---- 5 bit 5 bit 5 bit 4 bit 2b fs r eg fd r eg VMADD bc Opcode Body --------0010 - - ft r eg ----- 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 09 08 07 06 05 04 03 02 01 00 6 bit 1 COP2 010010 co 1 4 bit dest ---- 5 bit 5 bit 11 bit is r eg VLQI Opcode Body ----01101 1111 ft r eg ----- 00 64-bit VLIW instruction Upper 32-bit Lower 32-bit 63 62 61 60 59 58 57 56 55 54 53 52 51 50 49 48 47 46 45 44 43 42 41 40 39 38 37 36 35 34 33 32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 09 08 07 06 05 04 03 02 01 00 1 1 1 1 1 I E M D T 1 1 4 bit - - dest - - - - - 0 0 ---- 5 bit 5 bit 5 bit ft r eg ----- fs r eg ----- fd r eg ----- FMAC x 4 4 bit 2b MADD bc 0010 - - Upper Instructions : 4 parallel floating ADD/SUB 4 parallel floating MUL 4 parallel floating MADD/MSUB 4 parallel floating MAX/MIN Outer product calculation Clipping detection 7 bit Lo w er OP. 1000000 4 bit 5 bit 5 bit 11 bit is r eg dest LQI ft r eg FDIV, load/store, iALU -----------01101 1111 00 Lower Instructions : Floating DIV/SQRT/RSQRT Load/Store 128-bit data(4 FP data) Jump/Branch Elementary Function Unit Random Unit Registers 128-bit floating register x 32 32bit 127 32bit 96 VF00 VF01 VF02 VF03 1.00 VF01w VF02w VF03w 95 32bit 64 63 0.00 VF01z VF02z VF03z 32bit 32 31 0 0.00 VF01y VF02y VF03y 0.00 VF01x VF02x VF03x VF31y VF31x : : VF31w VF31 VF31z 16-bit integer register x 16 15 VI00 VI01 VI02 VI15 0 0 : : 4 Parallel FMADD w z y w fMACw ACCw + z fMACz ACCz + Source register 1 x Source register 2 y x fMACy ACCy + fMACx ACCx + 4 Parallel FMADD with Broadcast w z w fMACw ACCw y z + fMACz ACCz Source register 1 x Source register 2 x y + fMACy ACCy + fMACx ACCx + VLIW mode Pipeline Stages • 3 cycle Latency : Basic Operations F D X1 X2 X3 W ADD, SUB, MUL, MADD, MSUB load/store, iALU operations, e.t.c. • 7 cycle Latency : DIV / SQRT F D 1 2 3 4 5 6 7 DIV F D 1 2 3 4 5 6 7 SQRT • 13 cycle Latency : RSQRT F D 1 2 3 4 5 6 7 8 9 10 11 12 13 RSQRT Geometry Transformation FMACx : m00 FMACy : m10 FMACz : m20 FMACw : m30 m01 m11 m21 m31 m02 m12 m22 m32 m03 x m13 y m23 z m33 w m00 x + m01 y + m02 z + m03w m10 x + m11 y + m12 z + m13w = m20 x + m21 y + m22 z + m23w m30 x + m31 y + m32 z + m33w Geometry Transformation Pipeline Geometry Transformation Pipeline FMACx FMACy FMACz FMACw 4 x MULA (m00 x x broadcast + 4 x MADDA (m01 y F D X1 X2 X3 W y broadcast + F : Fetch 4 x MADDA (m02 z D : Decode F D X1 X2 X3 W z broadcast X1 - 3 : eXecute + W : Write back F D X1 X2 X3 W 4 x MADD (m03w w broadcast Upper F D X1 X2 X3 W m10 x m20 x m 30 x ) + + + m11 y m21 y m 31 y ) + + + m12 z m22 z m 32 z ) + + + m13 w m23 w m 33 w) Perspective Transformation Perspective Transformation Software Pipelining LOOP 2 F D X1 X2 X3 W F D X1 X2 X3 W F D X1 X2 X3 W F D X1 X2 X3 W F D X1 X2 X3 W F D X1 X2 X3 W F D X1 X2 X3 W F D X1 X2 X3 W F D X1 X2 X3 W F D X1 X2 X3 W F D X1 X2 X3 W F D X1 X2 X3 W 4 x MUL F D X1 X2 X3 W F D X1 X2 X3 W F D X1 X2 X3 W F D X1 X2 X3 W F D X1 X2 X3 W F D X1 X2 X3 W F D X1 X2 X3 W F D X1 X2 X3 W LOOP 4 F D X1 X2 X3 W F D X1 X2 X3 W F D X1 X2 X3 W Upper LOOP 3 4 x MULA 4 x MADDA 4 x MADDA F D X1 X2 X3 W 4 x MADD F D X1 X2 X3 W F D X1 X2 X3 W Geometry Transformation LOOP 1 F D X1 X2 X3 W F D X1 X2 X3 W F D d1 d2 d3 d4 d5 d6 d7 Others DIV F D d1 d2 d3 d4 d5 d6 d7 Lower F D d1 d2 d3 d4 d5 d6 d7 F D d1 d2 d3 d4 d5 d6 d7 7 Cycles : DIV repeat time Another Lower inst.: load/store, loop control Vector Normalization 25 cycles for normalization 13 cycles for repeat (software pipelining) x2 x2 MADD x 2 x+2 y+2 y 2 F D X1 X2 X3 W MADD x 2 + y 2 + z 2 Ç Ç X1 X2 X3 W F D X1 X2 X3 W MUL F D 1 2 3 4 5 6 7 8 9 10111213 RSQRT F D X1 X2 X3 W 1 3 x MUL x +y +z 2 2 2 x x2 + y2 + z 2 , y x2 + y2 + z 2 , z √ x 2 + y 2 + z 2 √↵ Co-processor Mode Pipeline stages • Basic operations I Q R A D W MIPS core pipeline F D X1 X2 X3 W VU pipeline 10 stages • call VLIW processor mode programs I Q R A D W MIPS core pipeline F D X1 X2 X3 W VU VLIW pipelines F D X1 X2 X3 W F D X1 X2 X3 W n steps …………. F D X1 X2 X3 W 11 + n stages Example : Jelly • Dynamic simulation Spring model Performance 160 150 140 Emotion Engine (300 MHz) M vectors/sec 120 100 80 66 66 60 46 40 20 0 Geometry Trans. Perspective Trans. Distance 1/Distance EE Micrograph Die size : 15.02mm x 15.04mm Frequency : 300MHz Transistors : 13.5M Power : 18Watts Design Rule : 0.25um Gate Length : 0.18um VU Micrograph Acknowledgements TOSHIBA Corp. Y.Fujimoto, H.Goto, S.Hayakawa, F.Ishihara, T.Kawaai, S.Kawakami, M.Saito, H.Tago, H.Takeda, T.Teruyama, Y.Watanabe, I.Yamazaki TOSHIBA MICROELECTRONICS Corp. Y.Ogawa TOSHIBA ENGINEERING Corp. N.Iwatsuki, M.Otsuki, Y.Tsutsumi TOSHIBA INFORMATION SYSTEMS(JAPAN) Corp. M.Kuramochi, Y.Hikita SONY COMPUTER ENTERTAINMENT Inc. S.Aoki, K.Kutaragi, S.Shinohara, K.Umemura
© Copyright 2026 Paperzz