CS718 : Data Parallel Processors 27th April, 2006 Anshul Kumar, CSE IITD Data Parallel Architectures • SIMD Processors – Multiple processing elements driven by a single instruction stream • Associative Processors – SIMD like processors with associative memory • Vector Processors – Uni-processors with vector instructions • Systolic Arrays – Application specific VLSI structures Anshul Kumar, CSE IITD SIMD P IS DS M C P DS One of the earliest model of parallel computer Anshul Kumar, CSE IITD ILLIAC IV SIMD Model I/O CU P M bus PE1 P P PE2 M PEn M Interconnection network Planned for 64 x 4 PEs, built only 64 Anshul Kumar, CSE IITD P M Burroughs Scientific Processor (BSP) Model I/O CU P M bus P1 P2 Pn Interconnection network M1 Anshul Kumar, CSE IITD M2 Mk SIMD algorithms: sum of vector elements a0 step 1: a0+a1 step 2: a0+a1+ a2+a3 a1 a2 a3 a2+a3 a4 a5 a4+a5 a6 a7 a6+a7 a4+a5+ a6+a7 a0+a1+a2+a3+ step 3: a4+a5+a6+a7 OR Si = ai + ai+1 Si = Si + Si+2 Si = Si + Si+4 i = 0,2,4,6 i = 0,4 i=0 Anshul Kumar, CSE IITD Si = ai + ai+4 Si = Si + Si+2 Si = Si + Si+1 i = 0,1,2,3 i = 0,1 i=0 No. of processors vs time Adding vector elements: – n processors – log n steps – n/log n processors – log n steps Matrix multiplication: – – – – n processor – n2 steps n2 processors – n steps n3 processors – log n steps n3/log n processors – log n steps Important factors: data distribution, network Anshul Kumar, CSE IITD Rise and fall of SIMDs • Introduced in 60’s (e.g. Illiac, BSP) • Problems: – not cost effective – serial fraction and Amdahl’s law – I/O bottle neck • Overshadowed by Vector Processors • Resurrected in 80’s (MPP from Goodyear, Connection machine from Thinking Machines Inc., MP-1 from MasPar) • Did not survive because of high cost Anshul Kumar, CSE IITD Related ideas • Coarse grain SIMD with off the shelf processors (synchronized MIMD), e.g. CM5 of Thinking Machines • This gave rise to SPMD (single program multiple data) • MMX and SIMD instructions in Pentium Anshul Kumar, CSE IITD Vector Processors I-unit and control Memory I-cache D-cache Mem control V-reg GPRs address unit Buses VFU Anshul Kumar, CSE IITD VFU FU Four Generations of CRAY systems (vector processors) System CPUs Clock MHz CRAY-1 1 X-MP 4 Y-MP 8 C90 16 80 105 166 240 Anshul Kumar, CSE IITD Flops/ clock/ CPU 2 2 2 4 WordsMflops Gates/ moved/ chip clk/CPU 1 80 2 3 840 16 3 2667 2500 6 15360 10000 Cray History • http://www.cray.com/company/history.html Anshul Kumar, CSE IITD CRAY C90 • 8GB central memory shared by 16 CPUs • 128 CPU - mem paths • word = 64 bits + 16 ECC • Dual vector pipes • 128 element segments Anshul Kumar, CSE IITD Memory 8 sections 8x8 sub sections 8x8x2 bank groups 8x8x2x8 banks • CPU: 7.5 ns clock, 1620 MFLOPs • Mem: 32 MB x 32 banks, 64 bit word, 50ns access time • 3 FP pipes, 2 results each • Vector regs - FPU cross bar • 1.1 GB/s per I/O port Anshul Kumar, CSE IITD CPUs I/O 5x5 crossbar memories Convex C4/XA system utilities Other examples Fujitsu VP2000 Fujitsu VP5000 1 - 2 CPUs NEC SX - X • 4 CPUs • 4 x 2 pipes each Anshul Kumar, CSE IITD • • • • 7 - 222 CPUs 2 LS pipes 3 Func pipes 2 mask pipes Systolic Arrays (H.T. Kung 1978) Simplicity, Regularity, Concurrency, Communication Example : Band matrix multiplication A11 A12 0 0 0 0 B11B12 0 0 0 0 A A A 0 0 0 B B B 0 0 0 21 22 23 21 22 23 A31 A32 A33 A34 0 0 B31B32 B33 B34 0 0 C 0 A42 A43 A44 A45 0 0 B42 B43 B44 B45 0 0 0 A A A A 0 0 B B B B 53 54 55 56 53 54 55 56 0 0 0 A64 A65 A66 0 0 0 B64 B65 B66 Anshul Kumar, CSE IITD T=0 B31 A23 A22 A31 B21 A12 A21 A11 B11 B12 T=1 B31 A23 A32 A22 A31 A12 A21 B22 B21 A11 B11 B12 T=2 A33 A23 A32 A22 A31 B32 B31 A12 A21 B22 B21 A11 B11 B12 T=3 A34 B42 A42 A32 A22 A31 B32 B31 A23 A33 B21 A12 A21 A11 B11 B22 B12 B23 T=4 A34 A43 B42 A23 A33 A42 A32 A11 B11 A12 B21 A22 A31 B32 B31 A21 B11 B22 A11 B12 B33 B23 T=5 A34 A43 B42 A23 A33 A42 A21 B11 A22 B21 A32 A31 B11 B32 BC 3111 A11 B12 A12 B22 A21 B12 B33 B23 T=6 A44 A53 A34 A21 B11 A22 B21 A23 B31 A33 A43 A42 A31 B11 A32 B21 BC32 12 A21 B12 A22 B22 A31 B12 B43 B42 C11 B33 A12 B23 WARP: Programmable Systolic Processor [Kung, CMU 1987] Complete contrast to the original idea • not application specific • not a single VLSI • complex cell (pipelined FP adder, mult, FIFOs, RAM, cross bar) • linear • asynchronous Anshul Kumar, CSE IITD References • D. Sima, T. Fountain, P. Kacsuk, "Advanced Computer Architectures : A Design Space Approach", Addison Wesley, 1997. • K. Hwang, "Advanced Computer Architecture : Parallelism, Scalability, Programmability", McGraw Hill, 1993. Anshul Kumar, CSE IITD
© Copyright 2026 Paperzz