Vector Processors

CS718 : Data Parallel Processors
27th April, 2006
Anshul Kumar, CSE IITD
Data Parallel Architectures
• SIMD Processors
– Multiple processing elements driven by a single
instruction stream
• Associative Processors
– SIMD like processors with associative memory
• Vector Processors
– Uni-processors with vector instructions
• Systolic Arrays
– Application specific VLSI structures
Anshul Kumar, CSE IITD
SIMD
P
IS
DS
M
C
P
DS
One of the earliest model of parallel computer
Anshul Kumar, CSE IITD
ILLIAC IV SIMD Model
I/O
CU
P
M
bus
PE1
P
P
PE2
M
PEn
M
Interconnection network
Planned for 64 x 4 PEs, built only 64
Anshul Kumar, CSE IITD
P
M
Burroughs Scientific Processor (BSP) Model
I/O
CU
P
M
bus
P1
P2
Pn
Interconnection network
M1
Anshul Kumar, CSE IITD
M2
Mk
SIMD algorithms: sum of vector elements
a0
step 1:
a0+a1
step 2:
a0+a1+
a2+a3
a1
a2
a3
a2+a3
a4
a5
a4+a5
a6
a7
a6+a7
a4+a5+
a6+a7
a0+a1+a2+a3+
step 3: a4+a5+a6+a7
OR
Si = ai + ai+1
Si = Si + Si+2
Si = Si + Si+4
i = 0,2,4,6
i = 0,4
i=0
Anshul Kumar, CSE IITD
Si = ai + ai+4
Si = Si + Si+2
Si = Si + Si+1
i = 0,1,2,3
i = 0,1
i=0
No. of processors vs time
Adding vector elements:
– n processors – log n steps
– n/log n processors – log n steps
Matrix multiplication:
–
–
–
–
n processor – n2 steps
n2 processors – n steps
n3 processors – log n steps
n3/log n processors – log n steps
Important factors: data distribution, network
Anshul Kumar, CSE IITD
Rise and fall of SIMDs
• Introduced in 60’s (e.g. Illiac, BSP)
• Problems:
– not cost effective
– serial fraction and Amdahl’s law
– I/O bottle neck
• Overshadowed by Vector Processors
• Resurrected in 80’s (MPP from Goodyear,
Connection machine from Thinking
Machines Inc., MP-1 from MasPar)
• Did not survive because of high cost
Anshul Kumar, CSE IITD
Related ideas
• Coarse grain SIMD with off the shelf
processors (synchronized MIMD), e.g. CM5
of Thinking Machines
• This gave rise to SPMD (single program
multiple data)
• MMX and SIMD instructions in Pentium
Anshul Kumar, CSE IITD
Vector Processors
I-unit
and
control
Memory
I-cache
D-cache
Mem
control
V-reg
GPRs
address
unit
Buses
VFU
Anshul Kumar, CSE IITD
VFU
FU
Four Generations of CRAY systems
(vector processors)
System CPUs
Clock
MHz
CRAY-1 1
X-MP
4
Y-MP
8
C90
16
80
105
166
240
Anshul Kumar, CSE IITD
Flops/
clock/
CPU
2
2
2
4
WordsMflops Gates/
moved/
chip
clk/CPU
1
80
2
3
840
16
3
2667 2500
6
15360 10000
Cray History
• http://www.cray.com/company/history.html
Anshul Kumar, CSE IITD
CRAY C90
• 8GB central memory
shared by 16 CPUs
• 128 CPU - mem paths
• word =
64 bits + 16 ECC
• Dual vector pipes
• 128 element segments
Anshul Kumar, CSE IITD
Memory
8 sections
8x8 sub sections
8x8x2 bank groups
8x8x2x8 banks
• CPU: 7.5 ns clock, 1620
MFLOPs
• Mem: 32 MB x 32 banks,
64 bit word, 50ns access
time
• 3 FP pipes, 2 results each
• Vector regs - FPU cross
bar
• 1.1 GB/s per I/O port
Anshul Kumar, CSE IITD
CPUs
I/O
5x5
crossbar
memories
Convex C4/XA system
utilities
Other examples
Fujitsu VP2000
Fujitsu VP5000
1 - 2 CPUs
NEC SX - X
• 4 CPUs
• 4 x 2 pipes each
Anshul Kumar, CSE IITD
•
•
•
•
7 - 222 CPUs
2 LS pipes
3 Func pipes
2 mask pipes
Systolic Arrays (H.T. Kung 1978)
Simplicity, Regularity, Concurrency, Communication
Example :
Band matrix multiplication
 A11 A12 0 0 0 0   B11B12 0 0 0 0 
 A A A 0 0 0  B B B 0 0 0 
 21 22 23
  21 22 23

 A31 A32 A33 A34 0 0   B31B32 B33 B34 0 0 
C   


0 A42 A43 A44 A45 0  0 B42 B43 B44 B45 0 
 0 0 A A A A  0 0 B B B B 
53 54 55 56
53 54 55 56

 

0 0 0 A64 A65 A66  0 0 0 B64 B65 B66 
Anshul Kumar, CSE IITD
T=0
B31
A23
A22
A31
B21
A12
A21
A11
B11
B12
T=1
B31
A23
A32
A22
A31
A12
A21
B22
B21
A11
B11
B12
T=2
A33
A23
A32
A22
A31
B32
B31
A12
A21
B22
B21
A11
B11
B12
T=3
A34
B42
A42
A32
A22
A31
B32
B31
A23
A33
B21
A12
A21
A11 B11
B22
B12
B23
T=4
A34
A43
B42
A23
A33
A42
A32
A11 B11
A12 B21
A22
A31
B32
B31
A21 B11
B22
A11 B12
B33
B23
T=5
A34
A43
B42
A23
A33
A42
A21 B11
A22 B21
A32
A31 B11
B32
BC
3111
A11 B12
A12 B22
A21 B12
B33
B23
T=6
A44
A53
A34
A21 B11
A22 B21
A23 B31
A33
A43
A42
A31 B11
A32 B21
BC32
12
A21 B12
A22 B22
A31 B12
B43
B42
C11
B33
A12 B23
WARP: Programmable Systolic Processor
[Kung, CMU 1987]
Complete contrast to the original idea
• not application specific
• not a single VLSI
• complex cell (pipelined FP adder, mult,
FIFOs, RAM, cross bar)
• linear
• asynchronous
Anshul Kumar, CSE IITD
References
• D. Sima, T. Fountain, P. Kacsuk, "Advanced
Computer Architectures : A Design Space
Approach", Addison Wesley, 1997.
• K. Hwang, "Advanced Computer
Architecture : Parallelism, Scalability,
Programmability", McGraw Hill, 1993.
Anshul Kumar, CSE IITD