Pipelining Review and Its Limitations

Pipelining Review
and Its Limitations
Yuri Baida
[email protected]
[email protected]
October 16, 2010
MDSP Project | Intel Lab
Moscow Institute of Physics and Technology
Agenda
• Review
– Instruction set architecture
– Basic tools for computer architects
• Amdahl's law
• Pipelining
– Ideal pipeline
– Cost-performance trade-off
– Dependencies
– Hazards, interlocks & stalls
– Forwarding
– Limits of simple pipeline
MDSP Project | Intel Lab
Moscow Institute of Physics and Technology
2
Review
MDSP Project | Intel Lab
Moscow Institute of Physics and Technology
3
Instruction Set Architecture
• ISA is the hardware/software interface
– Defines set of programmer visible state
– Defines instruction format (bit encoding) and instruction semantics
– Examples: MIPS, x86, IBM 360, JVM
• Many possible implementations of one ISA
– 360 implementations: model 30 (c. 1964), z10 (c. 2008)
– x86 implementations: 8086, 286, 386, 486, Pentium, Pentium Pro,
Pentium 4, Core 2, Core i7, AMD Athlon, Transmeta Crusoe
– MIPS implementations: R2000, R4000, R10000, R18K, …
– JVM: HotSpot, PicoJava, ARM Jazelle, ...
MDSP Project | Intel Lab
Moscow Institute of Physics and Technology
4
Basic Tools for Computer Architects
•
•
•
•
•
•
•
Amdahl's law
Pipelining
Out-of-order execution
Critical path
Speculation
Locality
Important metrics
– Performance
– Power/energy
– Cost
– Complexity
MDSP Project | Intel Lab
Moscow Institute of Physics and Technology
5
Amdahl's Law
MDSP Project | Intel Lab
Moscow Institute of Physics and Technology
6
Amdahl’s Law: Definition
• Speedup = Timewithout enhancement / Timewith enhancement
• Suppose an enhancement speeds up a fraction f
of a task by a factor of S
1‒f
1‒f
f
f/S
• We should concentrate efforts on improving frequently occurring
events or frequently used mechanisms
MDSP Project | Intel Lab
Moscow Institute of Physics and Technology
7
Amdahl’s Law: Example
• New processor 10 times faster
• Input-output is a bottleneck
– 60% of time we wait
• Let’s calculate
• Its human nature to be attracted by 10× faster
– Keeping in perspective its just 1.6× faster
MDSP Project | Intel Lab
Moscow Institute of Physics and Technology
8
Pipelining
MDSP Project | Intel Lab
Moscow Institute of Physics and Technology
9
A Generic 4-stage Processor Pipeline
Fetch Unit gets
the next instruction from the cache.
Decode
Fetch
Floating
Multimedia
Point
Unit
Unit
Decode Unit determines
type of instruction.
Instruction and data sent to
Execution Unit.
Write
L2 Cache
Write Unit stores result.
Integer Unit
Instruction
Fetch
Decode Execute
Instruction
Fetch
Decode
Read
Write
Execute Memory
Write
MDSP Project | Intel Lab
Moscow Institute of Physics and Technology
10
Pipeline: Steady State
Cycle
1
2
3
Instr1
Fetch
Decode
Read
Fetch
Decode
Read
Fetch
Decode
Read
Fetch
Decode
Read
Fetch
Decode
Read
Fetch
Decode
Instr2
Instr3
Instr4
Instr5
4
5
Execute Memory
6
8
9
Write
Execute Memory
Instr6
7
Write
Execute Memory
Write
Execute Memory
Write
Execute Memory
Read
Execute
• Latency — elapsed time from start to completion of a particular task
• Throughput — how many tasks can be completed per unit of time
• Pipelining only improves throughput
– Each job still takes 4 cycles to complete
• Real life analogy: Henry Ford’s automobile assembly line
MDSP Project | Intel Lab
Moscow Institute of Physics and Technology
11
Pipelining Illustrated
L
TP ~ 1/n
n Gate Delay
L
n/2 Gate Delay
L
L
n/3 Gate
Delay
n/3 Gate
Delay
L
n/2 Gate Delay
L
TP ~ 2/n
n/3 Gate
Delay
TP ~ 3/n
MDSP Project | Intel Lab
Moscow Institute of Physics and Technology
12
Pipelining Performance Model
• Starting from an unpipelined
version with propagation delay T
and TP = 1/T
Unpipelined
k-stage
pipelined
T/k
S
T
where
• S — delay through latch and
overhead
…
T/k
S
MDSP Project | Intel Lab
Moscow Institute of Physics and Technology
13
Hardware Cost Model
• Starting from an unpipelined
version with hardware cost G
Unpipelined
k-stage
pipelined
G/k
where
• L — cost of adding each latch
L
G
…
G/k
L
MDSP Project | Intel Lab
Moscow Institute of Physics and Technology
14
Cost/Performance Trade-off
[Peter M. Kogge, 1981]
MDSP Project | Intel Lab
Moscow Institute of Physics and Technology
15
Pipelining Idealism
• Uniform suboperations
– The operation to be pipelined can be evenly partitioned
into uniformlatency suboperations
• Repetition of identical operations
– The same operations are to be performed repeatedly
on a large number of different inputs
• Repetition of independent operations
– All the repetitions of the same operation are mutually independent
– Good example: automobile assembly line
MDSP Project | Intel Lab
Moscow Institute of Physics and Technology
16
Pipelining Reality
• Uniform suboperations... NOT! ⇒ Balance pipeline stages
– Stage quantization to yield balanced stages
– Minimize internal fragmentation (some waiting stages)
• Identical operations... NOT! ⇒ Unify instruction types
– Coalescing instruction types into one “multi-function” pipe
– Minimize external fragmentation (some idling stages)
• Independent operations... NOT! ⇒ Resolve dependencies
– Inter-instruction dependency detection and resolution
– Minimize performance lose
MDSP Project | Intel Lab
Moscow Institute of Physics and Technology
17
Pipelining Reality
• Uniform suboperations... NOT! ⇒ Balance pipeline stages
– Stage quantization to yield balanced stages
– Minimize internal fragmentation (some waiting stages)
• Identical operations... NOT! ⇒ Unify instruction types
– Coalescing instruction types into one “multi-function” pipe
– Minimize external fragmentation (some idling stages)
• Independent operations... NOT! ⇒ Resolve dependencies
– Inter-instruction dependency detection and resolution
– Minimize performance lose
MDSP Project | Intel Lab
Moscow Institute of Physics and Technology
18
Pipelining Reality
• Uniform suboperations... NOT! ⇒ Balance pipeline stages
– Stage quantization to yield balanced stages
– Minimize internal fragmentation (some waiting stages)
• Identical operations... NOT! ⇒ Unify instruction types
– Coalescing instruction types into one “multi-function” pipe
– Minimize external fragmentation (some idling stages)
• Independent operations... NOT! ⇒ Resolve dependencies
– Inter-instruction dependency detection and resolution
– Minimize performance lose
MDSP Project | Intel Lab
Moscow Institute of Physics and Technology
19
Pipelining Reality
• Uniform suboperations... NOT! ⇒ Balance pipeline stages
– Stage quantization to yield balanced stages
– Minimize internal fragmentation (some waiting stages)
• Identical operations... NOT! ⇒ Unify instruction types
– Coalescing instruction types into one “multi-function” pipe
– Minimize external fragmentation (some idling stages)
• Independent operations... NOT! ⇒ Resolve dependencies
– Inter-instruction dependency detection and resolution
– Minimize performance lose
MDSP Project | Intel Lab
Moscow Institute of Physics and Technology
20
Pipelining Reality
• Uniform suboperations... NOT! ⇒ Balance pipeline stages
– Stage quantization to yield balanced stages
– Minimize internal fragmentation (some waiting stages)
• Identical operations... NOT! ⇒ Unify instruction types
– Coalescing instruction types into one “multi-function” pipe
– Minimize external fragmentation (some idling stages)
• Independent operations... NOT! ⇒ Resolve dependencies
– Inter-instruction dependency detection and resolution
– Minimize performance lose
MDSP Project | Intel Lab
Moscow Institute of Physics and Technology
21
Pipelining Reality
• Uniform suboperations... NOT! ⇒ Balance pipeline stages
– Stage quantization to yield balanced stages
– Minimize internal fragmentation (some waiting stages)
• Identical operations... NOT! ⇒ Unify instruction types
– Coalescing instruction types into one “multi-function” pipe
– Minimize external fragmentation (some idling stages)
• Independent operations... NOT! ⇒ Resolve dependencies
– Inter-instruction dependency detection and resolution
– Minimize performance lose
MDSP Project | Intel Lab
Moscow Institute of Physics and Technology
22
Hazards, Interlocks, and Stalls
• Pipeline hazards
– Potential violations of program dependences
– Must ensure program dependences are not violated
• Hazard resolution
– Static Method: performed at compiled time in software
– Dynamic Method: performed at run time using hardware
– Stall
– Flush
– Forward
• Pipeline interlock
– Hardware mechanisms for dynamic hazard resolution
– Must detect and enforce dependences at run time
MDSP Project | Intel Lab
Moscow Institute of Physics and Technology
23
Dependencies & Pipeline Hazards
• Data dependence (register or memory)
– True dependence (RAW)
– Instruction must wait for all required input operands
– Anti-dependence (WAR)
– Later write must not clobber a still-pending earlier read
– Output dependence (WAW)
– Earlier write must not clobber an already-finished later write
• Control dependence
– A “data dependency” on the instruction pointer
– Conditional branches cause uncertainty to instruction sequencing
• Resource conflicts
– Two instructions need the same device
MDSP Project | Intel Lab
Moscow Institute of Physics and Technology
24
Example: Quick Sort for MIPS
# for (;(j<high)&&(array[j]<array[low]);++j);
# $10 = j; $9 = high; $6 = array; $8 = low
MDSP Project | Intel Lab
Moscow Institute of Physics and Technology
25
Example: Quick Sort for MIPS
# for (;(j<high)&&(array[j]<array[low]);++j);
# $10 = j; $9 = high; $6 = array; $8 = low
bge
mul
addu
lw
mul
addu
lw
bge
$10,
$15,
$24,
$25,
$13,
$14,
$15,
$25,
$9,
$10,
$6,
0($24)
$8,
$6,
0($14)
$15,
L2
4
$15
addu
…
$10,
$10,
1
addu
$11,
$11,
-1
4
$13
L2
L1:
L2:
MDSP Project | Intel Lab
Moscow Institute of Physics and Technology
26
Example: Quick Sort for MIPS
# for (;(j<high)&&(array[j]<array[low]);++j);
# $10 = j; $9 = high; $6 = array; $8 = low
bge
mul
addu
lw
mul
addu
lw
bge
$10,
$15,
$24,
$25,
$13,
$14,
$15,
$25,
$9,
$10,
$6,
0($24)
$8,
$6,
0($14)
$15,
L2
4
$15
addu
…
$10,
$10,
1
addu
$11,
$11,
-1
4
$13
L2
L1:
L2:
MDSP Project | Intel Lab
Moscow Institute of Physics and Technology
27
Example: Quick Sort for MIPS
# for (;(j<high)&&(array[j]<array[low]);++j);
# $10 = j; $9 = high; $6 = array; $8 = low
bge
mul
addu
lw
mul
addu
lw
bge
$10,
$15,
$24,
$25,
$13,
$14,
$15,
$25,
$9,
$10,
$6,
0($24)
$8,
$6,
0($14)
$15,
L2
4
$15
addu
…
$10,
$10,
1
addu
$11,
$11,
-1
4
$13
L2
L1:
L2:
MDSP Project | Intel Lab
Moscow Institute of Physics and Technology
28
Example: Quick Sort for MIPS
# for (;(j<high)&&(array[j]<array[low]);++j);
# $10 = j; $9 = high; $6 = array; $8 = low
bge
mul
addu
lw
mul
addu
lw
bge
$10,
$15,
$24,
$25,
$13,
$14,
$15,
$25,
$9,
$10,
$6,
0($24)
$8,
$6,
0($14)
$15,
L2
4
$15
addu
…
$10,
$10,
1
addu
$11,
$11,
-1
4
$13
L2
L1:
L2:
MDSP Project | Intel Lab
Moscow Institute of Physics and Technology
29
Example: Quick Sort for MIPS
# for (;(j<high)&&(array[j]<array[low]);++j);
# $10 = j; $9 = high; $6 = array; $8 = low
bge
mul
addu
lw
mul
addu
lw
bge
$10,
$15,
$24,
$25,
$13,
$14,
$15,
$25,
$9,
$10,
$6,
0($24)
$8,
$6,
0($14)
$15,
L2
4
$15
addu
…
$10,
$10,
1
addu
$11,
$11,
-1
4
$13
L2
L1:
L2:
MDSP Project | Intel Lab
Moscow Institute of Physics and Technology
30
Pipeline: Data Hazards
Cycle
1
2
3
Instr1
Fetch
Decode
Read
Fetch
Decode
Read
Fetch
Decode
Read
Fetch
Decode
Read
Fetch
Decode
Read
Fetch
Decode
Instr2
Instr3
Instr4
4
5
Execute Memory
Instr5
Instr6
6
7
8
9
Write
Execute Memory
Write
Execute Memory
Write
Execute Memory
Write
Execute Memory
Read
Execute
• Instr2: _ → rk
• Instr3: rk → _
• How long should we stall for?
MDSP Project | Intel Lab
Moscow Institute of Physics and Technology
31
Pipeline: Stall on Data Hazard
Cycle
1
2
3
Instr1
Fetch
Decode
Read
Fetch
Decode
Read
Fetch
Decode
Fetch
Instr2
Instr3
Instr4
Instr5
4
5
Execute Memory
6
7
8
9
Stalled
Read
Execute
Stalled
Decode
Read
Stalled
Fetch
Decode
Write
Execute Memory
Write
Instr6
•
•
•
•
•
Instr2: _ → rk
Bubble
Bubble
Bubble
Instr3: rk → _
• Make the younger instruction wait until the
hazard has passed:
– Stop all up-stream stages
– Drain all down-stream stages
MDSP Project | Intel Lab
Moscow Institute of Physics and Technology
32
Pipeline: Forwarding
Cycle
1
2
3
Instr1
Fetch
Decode
Read
Fetch
Decode
Read
Fetch
Decode
Read
Fetch
Decode
Read
Fetch
Decode
Read
Fetch
Decode
Instr2
Instr3
Instr4
Instr5
Instr6
4
5
Execute Memory
6
7
8
9
Write
Execute Memory
Write
Execute Memory
Write
Execute Memory
Write
Execute Memory
Read
Execute
MDSP Project | Intel Lab
Moscow Institute of Physics and Technology
33
Limitations of Simple Pipelined Processors
(aka Scalar Processors)
• Upper bound on scalar pipeline throughput
– Limited by IPC = 1
• Inefficiencies of very deep pipelines
– Clocking overheads
– Longer hazards and stalls
• Performance lost due to in-order pipeline
– Unnecessary stalls
• Inefficient unification Into single pipeline
– Long latency for each instruction
– Hazards and associated stalls
MDSP Project | Intel Lab
Moscow Institute of Physics and Technology
34
Limitations of Deeper Pipelines
Unpipelined
k-stage
pipelined
T/k
S
(Code size)
(CPI)
(Cycle)
T
…
T/k
?
Eventually
limited by S
S
MDSP Project | Intel Lab
Moscow Institute of Physics and Technology
35
Put it All Together: Limits to Deeper Pipelines
Source: Ed Grochowski, 1997
MDSP Project | Intel Lab
Moscow Institute of Physics and Technology
36
Acknowledgements
• These slides contain material developed and copyright by:
– Grant McFarland (Intel)
– Christos Kozyrakis (Stanford University)
– Arvind (MIT)
– Joel Emer (Intel/MIT)
MDSP Project | Intel Lab
Moscow Institute of Physics and Technology
37