Pipelining - Branch Prediction

Multiprocessing:
SIMD and Performance
CS/COE 1541 (term 2174)
Jarrett Billingsley
Class Announcements
● How about this weather?
● There are only 3 weeks of lectures left!!!
● Clarification on NUMA:
o Not only just different memory techs, but also shared memory
systems where different parts of same memory have different
access times.
● I'm building a simple CPU for you in Logisim.
o I love Logisim sooooo that should be done by this weekend.
o Let's talk some more about vector processing so you have a
better idea of what the project will be about.
 Starting withhhhh these cool slides from a proposal for vector
processing for the RISC-V open CPU architecture!
3/29/2017
CS/COE 1541 term 2174
2
SIMD Architectures
3/29/2017
CS/COE 1541 term 2174
3
The general idea
● SIMD makes it as easy to do computations on vectors (1D arrays)
as on scalars (single numbers).
addi
top:
lw
lw
add
sw
addi
addi
addi
blt
3/29/2017
s3, s0, 256
t0,
t1,
t0,
t0,
s0,
s1,
s2,
s0,
0(s0)
0(s1)
t0, t1
0(s2)
s0, 4
s1, 4
s2, 4
s3, top
setvl
lv.w
lv.w
addv.w
sv.w
64
v0,
v1,
v0,
v0,
64 items long
0(s0) loads 64 words
0(s1) loads 64 words
v0, v1 adds 64 words
0(s2) stores 64 words
What if we had 80 things, but the
maximum vector length was 64?
CS/COE 1541 term 2174
4
Packed SIMD ("Multimedia SIMD")
● Invented in the 1950s (!!!!!) for supercomputers
o Then reintroduced in desktop PCs in the 1990s
● Operate on small vectors (4-64 items) at a time
● Fixed-size registers; fixed-function instructions.
o e.g. "add 4 doubles" or "do a population count (number of 1 bits)
on 16 bytes"
● Simple to implement, can give surprising performance boost for
things like sound/video encoding/decoding, 3D calculations, etc.
● To support larger vectors, need new registers and instructions!
3/29/2017
CS/COE 1541 term 2174
5
Packed SIMD processing hardware
● Packed SIMD hardware might look like this: (I'll draw it and upload after class)
3/29/2017
CS/COE 1541 term 2174
6
Vector processing
● Invented in the 1960s-70s
o ILLIAC IV, CDC STAR-100, Cray-1
● Similar idea to Packed SIMD, but much
more flexible
● Register to hold current vector length,
instead of hardwiring/coding it
● Reconfigurable vector register file to
support multiple data types
● Pipelined vector processing unit to
perform vector calculations extremely
quickly
3/29/2017
CS/COE 1541 term 2174
The Cray-1 was built out of
wires and discrete gate ICs.
It wasn't curved for fun. It made
the wires on the inside ring
shorter, improving cycle time.
7
Vector processing hardware
● A vector processor might look like this: (I'll draw it and upload after class)
3/29/2017
CS/COE 1541 term 2174
8
How fast can it be?
● We only have to fetch and decode a fraction of the number of
instructions that a typical scalar loop would use.
o We only have to check dependencies once.
o The vector unit just plugs and chugs – very simple cycles.
● We greatly reduce or even eliminate branching instructions.
● The vector processing unit can even run at a higher clock speed.
● Multiple lanes (vector pipelines) make it even faster.
● Vector loads/stores can make great use of DRAM burst transfers.
o Can also avoid caching data that will only be read/written once!
● Packed SIMD does give many of the same benefits, just at a much
smaller and less-flexible scale.
3/29/2017
CS/COE 1541 term 2174
9
Other cool features
● Striding lets you load/store non-contiguous data from memory at
regular offsets. (e.g. the first member of each struct in an array)
0
1
2
3
4
5
0
4
8
C
6
7
8
9
A
B
C
D
E
F
● Gather-scatter lets you put pointers in a vector, then load/store
from arbitrary memory addresses. (gather = load, scatter = store)
0
1
3/29/2017
2
3
4
5
2
E
7
4
6
7
8
9
CS/COE 1541 term 2174
A
B
C
D
E
F
10
The downsides…
● A very fast vector processor means it's only useful for large arrays,
or else it will constantly stall waiting for data.
● Vector registers are huge and saving them on interrupts/exceptions
is extremely time-consuming.
o RISC-V improves this by enabling and disabling the vector
processor, so its state only has to be saved if it's enabled.
3/29/2017
CS/COE 1541 term 2174
11
Parallel
Processing
Performance
3/29/2017
CS/COE 1541 term 2174
12
Speedup and Efficiency
● Suppose we have a problem of size n, and we run it on two
computer systems: a single processor, and a parallel processor
with p cores. Then:
𝐓𝐢𝐦𝐞𝐬𝐢𝐧𝐠𝐥𝐞 𝐧
𝐒𝐩𝐞𝐞𝐝𝐮𝐩 𝐧, 𝐩 =
𝐓𝐢𝐦𝐞𝐩𝐚𝐫𝐚𝐥𝐥𝐞𝐥 (𝐧, 𝐩)
● We'd like to have our problem take n/p time (linear!), but that
probably won't happen. So we measure how close we are with:
𝐒𝐩𝐞𝐞𝐝𝐮𝐩 𝐧, 𝐩
𝐄𝐟𝐟𝐢𝐜𝐢𝐞𝐧𝐜𝐲 𝐧, 𝐩 =
𝐩
3/29/2017
CS/COE 1541 term 2174
13
Strong and Weak scaling
● We can measure two kinds of scalability for a problem.
● Strong scaling means, for a fixed total problem size, how does
speedup change as we add more CPUs?
o Good for measuring speedup of CPU-bound programs, like
summing the contents of two arrays.
● Weak scaling means, for a fixed amount of work per processor,
how does speedup change as we add more CPUs?
o Good for measuring speedup of Memory-bound programs,
such as large databases.
3/29/2017
CS/COE 1541 term 2174
14
Amdahl's Law
● If f is the fraction of a task that can be executed in parallel,
𝐓𝐢𝐦𝐞𝐩𝐚𝐫𝐚𝐥𝐥𝐞𝐥 𝐧, 𝐩 = 𝟏 − 𝒇 𝐓𝐢𝐦𝐞𝐬𝐢𝐧𝐠𝐥𝐞
𝐓𝐢𝐦𝐞𝐬𝐢𝐧𝐠𝐥𝐞 𝐧
𝐧 +𝒇
𝒑
● Minsky's Conjecture says that the speedup we can achieve with p
processors is logarithmic in p: O(lg p)
o We can never have perfectly linear speedup!
3/29/2017
CS/COE 1541 term 2174
15