Pipelining

5008: Computer
Architecture
Appendix A – Pipelining
CA Lecture03 - pipelining ([email protected])
03-1
Why Pipeline? Because the resources are there!
Time (clock cycles)
Im
Reg
Im
Dm
Reg
Dm
Reg
Dm
Im
Reg
ALU
Inst 2
Reg
ALU
Inst 1
Im
ALU
Im
Reg
Inst 3
Inst 4
Reg
Reg
Dm
Reg
ALU
O
r
d
e
r
Inst 0
ALU
I
n
s
t
r.
Single cycle
data path
Dm
CA Lecture03 - pipelining ([email protected])
Reg
03-2
Pipeline Review
•
•
•
•
•
•
•
•
•
A pipeline is like an hooked assembly line.
Pipelining, in general, is not visible to the programmer (vs ILP)
Pipelining doesn’t help latency of single task, it helps
throughput of entire workload
Pipeline rate limited by slowest pipeline stage
Multiple tasks operating simultaneously using different
resources
Potential speedup = Number pipe stages, if perfectly balanced
stage.
Unbalanced lengths of pipe stages reduces speedup
Time to “fill” pipeline and time to “drain” it reduces speedup
Stall for Dependences
CA Lecture03 - pipelining ([email protected])
03-3
Outline
• MIPS – An ISA example for
pipelining
• 5 stage pipelining
• Structural and Data Hazards
• Forwarding
• Branch Schemes
• Exceptions and Interrupts
• Conclusion
CA Lecture03 - pipelining ([email protected])
03-4
A "Typical" RISC ISA
•
32-bit fixed format instruction (3 formats)
•
32 32-bit GPR (R0 contains zero, DP take pair)
•
3-address, reg-reg arithmetic instruction
•
Single address mode for load/store:
base + displacement
– no indirection
•
Simple branch conditions
•
Delayed branch
see: SPARC, MIPS, HP PA-Risc, DEC Alpha, IBM PowerPC,
CDC 6600, CDC 7600, Cray-1, Cray-2, Cray-3
CA Lecture03 - pipelining ([email protected])
03-5
Example: MIPS (- MIPS)
Register-Register
31
26 25
Op
21 20
Rs1
16 15
Rs2
11 10
6 5
Rd
0
Opx
Register-Immediate
31
26 25
Op
21 20
Rs1
16 15
Rd
immediate
0
Branch
31
26 25
Op
Rs1
21 20
16 15
Rs2/Opx
immediate
0
Jump / Call
31
26 25
Op
target
CA Lecture03 - pipelining ([email protected])
0
03-6
Datapath vs Control
Datapath
Controller
signals
Control Points
•
Datapath: Storage, FU, interconnect sufficient to perform
the desired functions
– Inputs are Control Points
– Outputs are signals
•
Controller: State machine to orchestrate operation on the
data path
CA Lecture03 - pipelining ([email protected])
– Based on desired function and signals
03-7
Approaching an ISA
•
Instruction Set Architecture
•
Meaning of each instruction is described by RTL on
architected registers and memory
Given technology constraints assemble adequate datapath
•
– Defines set of operations, instruction format, hardware
supported data types, named storage, addressing modes,
sequencing
–
–
–
–
•
•
•
•
Architected storage mapped to actual storage
Function units to do all the required operations
Possible additional storage (eg. MAR, MBR, …)
Interconnect to move information among regs and FUs
Map each instruction to sequence of RTLs
Collate sequences into symbolic controller state transition
diagram (STD)
Lower symbolic STD to control points
Implement controller
CA Lecture03 - pipelining ([email protected])
03-8
Outline
• MIPS – An ISA example for
pipelining -- Read Appendix B
• 5 stage pipelining
• Structural and Data Hazards
• Forwarding
• Branch Schemes
• Exceptions and Interrupts
• Conclusion
CA Lecture03 - pipelining ([email protected])
03-9
The Five Steps of the Load Instruction
Cycle 1 Cycle 2
Load Ifetch
Reg/Dec
Cycle 3 Cycle 4 Cycle 5
Exec
Mem
Wr
•
•
Every instruction can be implemented in at most 5 clock cycle
Ifetch: Instruction Fetch
•
•
•
•
Reg/Dec: Registers Fetch and Instruction Decode
Exec: Execution and calculate the memory address
Mem: Read the data from the Data Memory
Wr: Write the data back to the register file
– Fetch the instruction from the Instruction Memory
Branch requires ? cycles, Store requires ? cycles, others require ? cycles
CA Lecture03 - pipelining ([email protected])
03-10
The Four Steps of R-type Instruction
does not access data memory…
Cycle 1 Cycle 2
R-type Ifetch
Reg/Dec
Cycle 3 Cycle 4
Exec
Wr
•
Ifetch: Instruction Fetch
•
•
Reg/Dec: Registers Fetch and Instruction Decode
Exec:
•
Wr: Write the ALU output back to the register file
– Fetch the instruction from the Instruction Memory
– Update PC
– ALU operates on the two register operands
CA Lecture03 - pipelining ([email protected])
03-11
Important Observation
•
•
Each functional unit can only be used once per instruction
Each functional unit must be used at the same step for all
instructions:
– Load uses Register File’s Write Port during its 5th step
Load
1
Ifetch
2
Reg/Dec
3
Exec
4
Mem
5
Wr
This’s what caused
the problem
– R-type uses Register File’s Write Port during its 4th step
1
R-type Ifetch
2
Reg/Dec
3
Exec
4
Wr
Structural hazard !!
CA Lecture03 - pipelining ([email protected])
03-12
Pipelining the R-type and Load Instruction
Cycle 1 Cycle 2
Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9
Clock
R-type Ifetch
R-type
Reg/Dec
Exec
Ifetch
Reg/Dec
Exec
Ifetch
Reg/Dec
Load
Ops! We have a problem!
Wr
R-type Ifetch
Wr
Exec
Mem
Wr
Reg/Dec
Exec
Wr
R-type Ifetch
Reg/Dec
Exec
Wr
• We have pipeline conflict or structural hazard:
– Two instructions try to write to the register file at the
same time!
– Only one write port
CA Lecture03 - pipelining ([email protected])
03-13
Sol 1: Insert “Bubble” into the Pipeline
Cycle 1 Cycle 2
Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9
Clock
Ifetch
R-type
Load
Reg/Dec
Exec
Wr
Ifetch
Reg/Dec
Exec
Mem
R-type
Ifetch
Reg/Dec
Exec
R-type
Ifetch
Wr
Wr
Reg/Dec Pipeline
R-type Ifetch
Exec
Bubble Reg/Dec
Ifetch
•
Wr
Exec
Reg/Dec
Wr
Exec
Insert a “bubble” into the pipeline to prevent 2 writes at
the same cycle
– The control logic can be complex.
– Lose instruction fetch and issue opportunity.
•
No instruction is started in Cycle 6!
CA Lecture03 - pipelining ([email protected])
03-14
Sol 2: Delay R-type’s Write by One Cycle
•
•
Now R-type instructions also use Reg File’s write port at Step 5
Mem step for R-type inst. is a NOOP : nothing is being done.
1
R-type Ifetch
Cycle 1 Cycle 2
2
Reg/Dec
3
Exec
4
Mem
5
Wr
Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9
Clock
R-type Ifetch
R-type
Reg/Dec
Exec
Mem
Wr
Ifetch
Reg/Dec
Exec
Mem
Wr
Ifetch
Reg/Dec
Exec
Mem
Wr
Reg/Dec
Exec
Mem
Wr
Reg/Dec
Exec
Mem
Load
R-type Ifetch
R-type Ifetch
CA Lecture03 - pipelining ([email protected])
Wr
03-15
Similarly, the Four Steps of Store
Cycle 1 Cycle 2
Store
Ifetch
Reg/Dec
Cycle 3 Cycle 4
Exec
• Ifetch: Instruction Fetch
Mem
Wr
In order to keep our pipeline
uniform
– Fetch the instruction from the Instruction Memory
– Update PC
• Reg/Dec: Registers Fetch and Instruction Decode
• Exec: Calculate the memory address
• Mem: Write the data into the Data Memory
The Three Steps of Beq
Cycle 1 Cycle 2
Beq
Ifetch
Reg/Dec
Cycle 3 Cycle 4
Exec
Mem
Wr
• Ifetch: Instruction Fetch
– Fetch the instruction from the Instruction Memory
• Reg/Dec:
– Registers Fetch and Instruction Decode
• Exec:
– compares the two register operand,
– select correct branch target address
– latch into PC
Designing a Pipelined Processor
• Examine the datapath and control diagram
– Starting with single- or multi-cycle datapath?
– Single- or multi-cycle control?
• Partition datapath into steps
• Insert pipeline registers between
successive steps
• Associate resources with steps
• Ensure that flows do not conflict, or
figure out how to resolve
• Assert control in appropriate stage
CA Lecture03 - pipelining ([email protected])
03-18
5 Steps of MIPS Datapath
Instruction
Fetch
Instr. Decode
Reg. Fetch
Execute
Addr. Calc
Next SEQ PC
Adder
4
L
M
D
MUX
Data
Memory
ALU
Imm
MUX MUX
RD
Reg File
Inst
Memory
Address
IR <= mem[PC];
Zero?
RS1
RS2
Write
Back
MUX
Next PC
Memory
Access
Sign
Extend
WB Data
PC <= PC + 4
Reg[IRrd] <= Reg[IRrs] opIRop Reg[IRrt]
CA Lecture03 - pipelining ([email protected])
03-19
5 Steps of MIPS Datapath
Execute
Addr. Calc
Instr. Decode
Reg. Fetch
Next SEQ PC
Next SEQ PC
Adder
4
Zero?
RS1
B <= Reg[IRrt]
Imm
MUX
PC <= PC + 4
A <= Reg[IRrs];
MEM/WB
IR <= mem[PC];
Data
Memory
EX/MEM
ALU
MUX MUX
ID/EX
Reg File
IF/ID
Memory
Address
RS2
Write
Back
MUX
Next PC
Memory
Access
WB Data
Instruction
Fetch
Sign
Extend
RD
RD
RD
rslt <= A opIRop B
WB <= rslt
Reg[IRrd] <= WB CA Lecture03 - pipelining ([email protected])
Pipeline registers (latches)
03-20
Inst. Set Processor Controller
IR <= mem[PC];
Ifetch
PC <= PC + 4
JSR
A <= Reg[IRrs];
JR
RR
if bop(A,b)
ST
B <= Reg[IRrt]
jmp
br
opFetch-DCD
PC <= IRjaddr
r <= A opIRop B
RI
LD
r <= A opIRop IRim
r <= A + IRim
WB <= r
WB <= Mem[r]
PC <= PC+IRim
WB <= r
Reg[IRrd] <= WB
• Use data stationary control
Reg[IRrd] <= WB
CA Lecture03 - pipelining ([email protected])
– local decode
for each instruction phase / pipeline stage
Reg[IRrd] <= WB
03-21
Data Stationary Control
• Pass control signals along just like the data
– Main control generates control signals during ID
WB
Instruction
IF/ID
Control
M
WB
EX
M
WB
ID/EX
EX/MEM
MEM/WB
Use of “Data Stationary Control”
• The Main Control generates the control signals during Reg/Dec
– Control signals for Exec (ExtOp, ALUSrc, ...) are used 1 cycle later
– Control signals for Mem (MemWr Branch) are used 2 cycles later
– Control signals for Wr (MemtoReg MemWr) are used 3 cycles later
Reg/Dec
MemWr
Branch
MemtoReg
RegWr
RegDst
MemWr
Branch
MemtoReg
RegWr
MemWr
Branch
MemtoReg
RegWr
Wr
Mem/Wr Register
RegDst
ExtOp
ALUSrc
ALUOp
Mem
Ex/Mem Register
Main
Control
ID/Ex Register
IF/ID Register
ExtOp
ALUSrc
ALUOp
Exec
MemtoReg
RegWr
Visualizing Pipelining
Figure A.2, Page A-8
Time (clock cycles)
Reg
DMem
Ifetch
Reg
DMem
Reg
ALU
DMem
Reg
ALU
O
r
d
e
r
Ifetch
ALU
I
n
s
t
r.
ALU
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7
Ifetch
Ifetch
Reg
Reg
CA Lecture03 - pipelining ([email protected])
Reg
DMem
Reg
03-24
Outline
• MIPS – An ISA example for
pipelining
• 5 stage pipelining
• Structural and Data Hazards
• Forwarding
• Branch Schemes
• Exceptions and Interrupts
• Conclusion
CA Lecture03 - pipelining ([email protected])
03-25
Pipelining is not quite that easy!
• Limits to pipelining: Hazards prevent next
instruction from executing during its designated
clock cycle
– Structural hazards: HW cannot support this combination
of instructions (single person to fold and put clothes
away)
– Data hazards: Instruction depends on result of prior
instruction still in the pipeline (missing sock)
– Control hazards: Caused by delay between the fetching
of instructions and decisions about changes in control
flow (branches and jumps).
CA Lecture03 - pipelining ([email protected])
03-26
One Memory Port/Structural Hazards
Time (clock cycles)
DMem
Ifetch
Reg
DMem
Reg
ALU
DMem
Reg
ALU
DMem
Reg
ALU
O
r
d
e
r
Reg
ALU
I Load Ifetch
n
s
Instr 1
t
r.
ALU
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7
Ifetch
Instr 2
Instr 3
Instr 4
Ifetch
Reg
Ifetch
Reg
CA Lecture03 - pipelining ([email protected])
Reg
Reg
DMem
03-27
Detection is easy in this case! (right half highlight means read, left half write)
Reg
Structural Hazards Limit
Performance
• Why? The primary reason is to reduce cost of the
unit
• Example: if 1.3 memory accesses per instruction
and only one memory access per cycle then
– average CPI = 1.3
– otherwise resource is more than 100% utilized
• Solution 1: Use separate instruction and data
memories
• Solution 2: Allow memory to read and write more
than one word per cycle
• Solution 3: Stall
CA Lecture03 - pipelining ([email protected])
03-28
Speed Up Equations for Pipelining
Speedup =
Average instruction time unpipelined CPI unpipelined Clock cycle unpipelined
=
×
Average instruction time pipelined
CPI pipelined
Clock cycle pipelined
CPI pipelined = Ideal CPI + Average Stall cycles per Inst
Speedup =
Cycle Time unpipelined
Ideal CPI × Pipeline depth
×
Ideal CPI + Pipeline stall CPI Cycle Time pipelined
for balanced pipelining
For simple RISC pipeline, CPI = 1:
Cycle Time unpipelined
Pipeline depth
Speedup =
×
1 + Pipeline stall CPI Cycle Time pipelined
CA Lecture03 - pipelining ([email protected])
03-29
Example: One or Two Memory Ports?
•
Machine A: Dual ported memory (“Harvard Architecture”)
•
Machine B: Single ported memory, but its pipelined
implementation has a 1.05 times faster clock rate
•
Ideal CPI = 1 for both
•
Loads are 40% of instructions executed
SpeedUpA = Pipeline Depth/(1 + 0) x (clockunpipe/clockpipe)
= Pipeline Depth
SpeedUpB = Pipeline Depth/(1 + 0.4 x 1) x (clockunpipe/(clockunpipe / 1.05)
= (Pipeline Depth/1.4) x 1.05
= 0.75 x Pipeline Depth
SpeedUpA / SpeedUpB = Pipeline Depth/(0.75 x Pipeline Depth) = 1.33
•
Machine A is 1.33 times faster
CA Lecture03 - pipelining ([email protected])
03-30
One Memory Port Structural Hazards
Time (clock cycles)
DMem
Ifetch
Reg
DMem
Reg
ALU
Ifetch
Bubble
Stall
Instr 3
Reg
Reg
DMem
Bubble Bubble
Ifetch
Reg
How
do you
“bubble”
the pipe?
CA Lecture03
- pipelining
([email protected])
Reg
Bubble
ALU
O
r
d
e
r
Instr 2
Reg
ALU
I Load Ifetch
n
s
Instr 1
t
r.
ALU
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7
Bubble
Reg
DMem
03-31
Handling Stalls
• How to stall?
– Stall instruction in IF and ID: not change PC
and IF/ID
=> the stages re-execute the instructions
– What to move into EX: insert an NOP by
changing EX, MEM, WB control fields of
ID/EX pipeline register to 0
• as control signals propagate, all control signals to EX,
MEM, WB are de-asserted and no registers or
memories are written
CA Lecture03 - pipelining ([email protected])
03-32
Data Hazard Problem
• Due to the overlapped instructions.
Example: r1 cannot be read by other instructions
before it is written by the add.
add
r1 ,r2,r3
sub
r4, r1 ,r3
and
r6, r1 ,r7
or
r8, r1 ,r9
xor
r10, r1 ,r11
CA Lecture03 - pipelining ([email protected])
03-33
RAW Hazards on R1
Time (clock cycles) Dependencies backwards in time are hazards
or
r8,r1,r9
xor r10,r1,r11
Reg
DMem
Ifetch
Reg
DMem
Ifetch
Reg
DMem
Ifetch
Reg
ALU
and r6,r1,r7
Ifetch
Reg
DMem
ALU
sub r4,r1,r3
Reg
ALU
O
r
d
e
r
add r1,r2,r3 Ifetch
WB
ALU
I
n
s
t
r.
MEM
ALU
IF ID/RF EX
Reg
CA Lecture03 - pipelining ([email protected])
Reg
Reg
DMem
Reg
03-34
Types of Data Hazards
Three types: (inst. i1 followed by inst. i2)
• RAW (read after write): dependence
i2 tries to read operand before i1 writes it
• WAR (write after read): anti-dependence
i2 tries to write operand before i1 reads it
– Gets wrong operand, e.g., auto-increment addr.
– Can’t happen in MIPS 5-stage pipeline because:
• All instructions take 5 stages, and reads are always in stage 2, and
writes are always in stage 5
• WAW (write after write): output dependence
i2 tries to write operand before i1 writes it
– Leaves wrong result ( i1’s not i2’s); occur only in pipelines that
write in more than one stage
– Can’t happen in MIPS 5-stage pipeline because:
• All instructions take 5 stages, and writes are always in stage 5
– Out of order executions may suffer this data dependence
CA Lecture03 - pipelining ([email protected])
03-35
WAR Data Hazard
•
Write After Read (WAR)
InstrJ writes operand before InstrI reads it
I: sub r4,r1,r3
J: add r1,r2,r3
K: mul r6,r1,r7
•
•
•
•
Called an “anti-dependence” by compiler writers.
This results from reuse of the name “r1”.
Can’t happen in MIPS 5 stage pipeline because:
Can happen in between a shorter (Int) pipeline and a longer (FP)
pipeline
WAR hazards can happen if instructions execute out of order
or access data late
CA Lecture03 - pipelining ([email protected])
03-36
No WAR Hazards on r1
Time (clock cycles)
IF ID/RF EX
read
add r4,r1,r3
Ifetch
Reg
WB
write
ALU
DMem
Reg
write
read
sub r1,r2,r3
Ifetch
Reg
ALU
I
n
s
t
r.
MEM
DMem
Reg
O
r
d
e
r
CA Lecture03 - pipelining ([email protected])
03-37
WAW Data Hazard
Write After Write (WAW)
InstrJ writes operand before InstrI writes it.
I: sub r1,r4,r3
J: add r1,r2,r3
K: mul r6,r1,r7
•
•
•
Called an “output dependence” by compiler writers
This also results from the reuse of name “r1”.
Can’t happen in 5 stage pipeline because:
– All instructions take 5 stages, and
– Writes are always in stage 5
Will see WAR and WAW in more complicated pipes
CA Lecture03 - pipelining ([email protected])
03-38
No WAW Hazards on R1
Time (clock cycles)
IF ID/RF EX
WB
write
add r1,r4,r3
Ifetch
Reg
ALU
DMem
Reg
write
sub r1,r2,r3
Ifetch
Reg
ALU
I
n
s
t
r.
MEM
DMem
Reg
O
r
d
e
r
CA Lecture03 - pipelining ([email protected])
03-39
Data Forwarding to Avoid Data Hazard
•
With data forwarding (also called bypassing or shortcircuiting), data is transferred back to earlier pipeline
stages before it is written into the register file.
– Instr i: add r1,r2,r3
---------------------– Instr j: sub r4,r1,r5
•
•
(result ready after EX stage)
(result needed in EX stage)
This either eliminates or reduces the penalty of RAW
hazards.
To support data forwarding, additional hardware is required.
– Multiplexors to allow data to be transferred back
– Control logic for the multiplexors
CA Lecture03 - pipelining ([email protected])
03-40
Data Hazard Solution
“Forward” result from one stage to another
Time (clock cycles)
IF
ID/RF
Im
Reg
Dm
Im
Reg
Dm
Im
Reg
Dm
Im
Reg
ALU
xor r10,r1,r11
Dm
ALU
or r8,r1,r9
Reg
ALU
and r6,r1,r7
WB
ALU
O
r
d
e
r
sub r4,r1,r3
Im
MEM
ALU
I
n
s
t
r.
add r1,r2,r3
EX
Reg
Reg
“or”
OK if define
read/write
properly
CA
Lecture03
- pipelining
([email protected])
Reg
Reg
Dm
Reg
03-41
HW Change for Forwarding
NextPC
mux
MEM/WR
EX/MEM
ALU
mux
ID/EX
Registers
Data
Memory
mux
Immediate
CA Lecture03 - pipelining ([email protected])
03-42
Data Hazard Even with Forwarding
sub r4,r1,r6
and r6,r1,r7
or
Reg
DMem
Ifetch
Reg
DMem
Reg
Ifetch
r8,r1,r9
Ifetch
Reg
Reg
DMem
Reg
ALU
Ifetch
ALU
O
r
d
e
r
lw r1, 0(r2)
ALU
I
n
s
t
r.
ALU
Time (clock cycles)
Reg
DMem
Reg
Can’t solve with forwarding
CAdelay/stall
Lecture03 - pipelining
([email protected])
Must
instruction
dependent on loads
03-43
Pipeline Interlock Solution for
Load Stall
Reg
DMem
Ifetch
Reg
Bubble
Ifetch
Bubble
Reg
Bubble
Ifetch
and r6,r1,r7
or r8,r1,r9
How is this detected?
Reg
DMem
Reg
Reg
DMem
ALU
sub r4,r1,r6
Ifetch
ALU
O
r
d
e
r
lw r1, 0(r2)
ALU
I
n
s
t
r.
ALU
Time (clock cycles)
Reg
DMem
Pipeline interlock
CA Lecture03 - pipelining ([email protected])
03-44
Software Scheduling to Avoid Load Hazards
Try producing fast code for
a = b + c;
d = e – f;
assuming a, b, c, d ,e, and f in memory.
Slow code:
LW
LW
ADD
SW
LW
LW
SUB
SW
Rb,b
Rc,c
Ra,Rb,Rc
a,Ra
Re,e
Rf,f
Rd,Re,Rf
d,Rd
Fast code:
LW
LW
LW
ADD
LW
SW
SUB
SW
Rb,b
Rc,c
Re,e
Ra,Rb,Rc
Rf,f
a,Ra
Rd,Re,Rf
d,Rd
Compiler CA
optimizes
for
performance.
Hardware checks for safety.
Lecture03
- pipelining
([email protected])
03-45
Compiler Avoiding Load Stalls
scheduled
unscheduled
54%
gcc
31%
42%
spice
14%
65%
tex
25%
0%
20%
40%
60%
80%
% loads stalling pipeline
Compilers reduce the number of load stalls, but do not
completelyCAeliminate
them.
Lecture03 - pipelining ([email protected])
03-46
Outline
• MIPS – An ISA example for
pipelining
• 5 stage pipelining
• Structural and Data Hazards
• Forwarding
• Branch Schemes
• Exceptions and Interrupts
• Conclusion
CA Lecture03 - pipelining ([email protected])
03-47
22: add r8,r1,r9
36: xor r10,r1,r11
Reg
DMem
Ifetch
Reg
DMem
Ifetch
Reg
DMem
Ifetch
Reg
ALU
r6,r1,r7
Ifetch
DMem
ALU
18: or
Reg
ALU
14: and r2,r3,r5
Ifetch
ALU
10: beq r1,r3,36
ALU
Control Hazard on Branches
Three Stages Stall
Reg
Reg
Reg
Reg
DMem
What do you do with the 3 instructions in between?
CA Lecture03 - pipelining ([email protected])
03-48
Reg
Control/Branch Hazards
•
Control hazards, which occur due to instructions changing
the PC, can result in a large performance loss.
•
A branch is either
– Taken: PC <= PC + 4 + Imm
; branch target address
– Not Taken: PC <= PC + 4
•
The simplest solution is to stall the pipeline as soon as a
branch instruction is detected.
– Detect the branch in the ID stage
– Don’t know if the branch is taken until the EX stage
– If the branch is taken, we need to repeat the IF and ID stages
– New PC is not changed until the end of the MEM stage, after
determining if the branch is taken and the new PC value
CA Lecture03 - pipelining ([email protected])
03-49
Branch Stall Impact
• If CPI = 1, 30% branch,
Stall 3 cycles => new CPI = 1.9 !!
• Two part solution:
– Determine branch taken or not sooner, AND
– Compute taken branch address earlier
• MIPS branch tests if register = 0 or ≠ 0
• MIPS Solution:
– Move Zero test to ID/RF stage
– Adder to calculate new PC in ID/RF stage
– 1 clock cycle penalty for branch versus 3
CA Lecture03 - pipelining ([email protected])
03-50
Pipelined MIPS Datapath
page A-38
Instruction
Fetch
Memory
Access
Write
Back
Adder
Adder
MUX
Next
SEQ PC
Next PC
Zero?
RS1
MUX
MEM/WB
Data
Memory
EX/MEM
ALU
MUX
ID/EX
Imm
Reg File
IF/ID
Memory
Address
RS2
WB Data
4
Execute
Addr. Calc
Instr. Decode
Reg. Fetch
Sign
Extend
RD
RD
RD
• Interplay of instruction set design and cycle time.
CA Lecture03 - pipelining ([email protected])
03-51
Four Branch Hazard Alternatives
#1: Stall until branch direction is clear
#2: Predict Branch Not Taken
–
–
–
–
–
Execute successor instructions in sequence
“Squash” instructions in pipeline if branch actually taken
Advantage of late pipeline state update
47% MIPS branches not taken on average
PC+4 already calculated, so use it to get next instruction
#3: Predict Branch Taken
– 53% MIPS branches taken on average
– But haven’t calculated branch target address in MIPS
• MIPS still incurs 1 cycle branch penalty
• Other machines: branch target known before outcome
CA Lecture03 - pipelining ([email protected])
03-52
Four Branch Hazard Alternatives
#4: Delayed Branch -- make the stall cycle useful
– Define branch to take place AFTER a following instruction
branch instruction
sequential successor1
sequential successor2
........
sequential successorn
branch target if taken
e.g. Branch delay slot
of length n
These insts. are executed !!
– 1 slot delay allows proper decision and branch target address
in 5 stage pipeline
– MIPS uses this
CA Lecture03 - pipelining ([email protected])
03-53
Stall -- Control Hazard Solution
•
Stall: wait until decision is clear
– It’s possible to move up decision to 2nd stage by adding
hardware to check registers as being read
Time (clock cycles)
O
r
d
e
r
•
Load
Reg
Mem
Mem
Reg
Reg
Mem
Reg
Mem
Reg
ALU
Beq
Mem
ALU
Add
ALU
I
n
s
t
r.
Mem
Reg
Impact: 2 cycles (or 1 cycle penalty) per branch instruction
=> slow
CA Lecture03 - pipelining ([email protected])
03-54
Predict-- Control Hazard Solution
•
– Predict not taken, for example
Time (clock cycles)
O
r
d
e
r
•
Load
Reg
Mem
Mem
Reg
Reg
Mem
Reg
Mem
Reg
ALU
Beq
Mem
ALU
Add
ALU
I
n
s
t
r.
Predict: guess one direction then back up if wrong
Mem
Reg
Impact: 1 clock cycle per branch instruction if right, 2 if
wrong
CA Lecture03 - pipelining ([email protected])
03-55
Predict-Not-Taken Example
A Stall indeed
1CA
clock
cycle- pipelining
per branch
instruction if right, 2 if wrong
Lecture03
([email protected])
03-56
Delayed Branch-- Control Hazard Solution
•
Time (clock cycles)
•
Misc
Load
Mem
Mem
Reg
Reg
Mem
Reg
Mem
Reg
Mem
Reg
Mem
Reg
ALU
O
r
d
e
r
Reg
ALU
Beq
Mem
ALU
Add
ALU
I
n
s
t
r.
Redefine branch behavior (takes place after next instruction)
“delayed branch”
Mem
Impact: 1 clock cycles per branch instruction if can find
instruction to put in “slot”
CA Lecture03 - pipelining ([email protected])
Reg
03-57
Delayed Branch
• Delayed branch Æ make the stall cycle useful
– Add delay slots = branch penalty = length of
branch delay
• 1 slot for 5-stage DLX/MIPS
– Instructions in the delay slot are executed
whether or not the branch is taken
– See if the compiler can schedule something
useful in these slots
• When the slots cannot be scheduled, they are filled
with the no-op instruction (indeed, stall!!)
– Hope that filled slots actually help advance the
computation
CA Lecture03 - pipelining ([email protected])
03-58
Scheduling Branch Delay Slots
A. From before branch
add $1,$2,$3
if $2=0 then
delay slot
becomes
B. From branch target
sub $4,$5,$6
add $1,$2,$3
if $1=0 then
delay slot
becomes
if $2=0 then
add $1,$2,$3
•
•
•
add $1,$2,$3
if $1=0 then
sub $4,$5,$6
C. From fall through
add $1,$2,$3
if $1=0 then
delay slot
sub $4,$5,$6
becomes
add $1,$2,$3
if $1=0 then
sub $4,$5,$6
A is the best choice, fills delay slot & reduces instruction count (IC)
In B, the sub instruction may need to be copied, increasing IC
In B and C, must
be okay to execute sub when branch fails
CA Lecture03 - pipelining ([email protected])
03-59
Delay-Branch Scheduling Schemes and
Their Requirements
Scheduling
Strategy
Requirements
Improve Performance
When?
From before
Branch must not depend on
the rescheduled instructions
Always
From target
Must be OK to execute
rescheduled instructions if
branch is not taken. May need
to duplicate instructions
When branch is taken.
May enlarge program
if instructions are
duplicated
From fall
through
Must be OK to execute
instructions if branch is taken
When branch is not
taken.
CA Lecture03 - pipelining ([email protected])
03-60
Delayed Branch Summary
• Compiler effectiveness for single branch delay slot:
– Fills about 60% of branch delay slots
– About 80% of instructions executed in branch delay slots
useful in computation
– About 50% (60% x 80%) of slots usefully filled
• As processor go to deeper pipelines and multiple
issue, the branch delay grows and need more than
one delay slots
– Delayed branching (the static way) has lost popularity
compared to more expensive but more flexible dynamic
approaches
– Growth in available transistors has made dynamic
approaches relatively cheaper
CA Lecture03 - pipelining ([email protected])
03-61
Evaluating Branch Alternatives
Pipeline speedup =
Pipeline depth
1 +Branch frequency ×Branch penalty
• Assume 4% unconditional branch, 6% conditional branchuntaken, 10% conditional branch-taken
Branch
scheme
Branch
penalty
Stall pipeline
Predict taken
Predict not taken
Delayed branch
3
1
1
0.5
CPI
speedup v.
unpipelined
speedup v.
stall
1.60
1.20
1.14
1.10
3.1
4.2
4.4
4.5
1.0
1.33
1.40
1.45
CA Lecture03 - pipelining ([email protected])
03-62
Deeper Pipeline Example
• For a deeper pipeline, e.g. MIPS R4K, it takes at least 3
pipeline stages before the branch-target address is known
and an additional cycle before the branch condition is
evaluated, assuming no stalls on the registers in the
conditional comparisons
– Assuming an ideal CPI of 1,
Speedup =
Pipeline depth
1 + Pipeline stall cycles from branches
Pipeline stall cycles from branches = Branch frequency × Branch penalty
CA Lecture03 - pipelining ([email protected])
03-63
Branch Penalties
• Assume 4% unconditional branch, 6% conditional branchuntaken, 10% conditional branch-taken
• Branch penalties
Branch
scheme
Penalty
unconditional.
Flush pipeline
Predict taken
Predict not taken
2
2
2
Penalty
untaken
Penalty
taken
3
3
0
3
2
3
Try to find the CPI penalties for 3 branch schemes (Fig. A.16)
CA Lecture03 - pipelining ([email protected])
03-64
Problems with Pipelining
•
Exception: An unusual event happens to an instruction during its
execution
– Examples: divide by zero, undefined opcode
•
Interrupt: Hardware signal to switch the processor to a new
instruction stream
– Example: a sound card interrupts when it needs more audio output
samples (an audio “click” happens if it is left waiting)
•
Problem: It must appear that the exception or interrupt must
appear between 2 instructions (Ii and Ii+1)
– The effect of all instructions up to and including Ii is totalling
complete
–
•
No effect of any instruction after Ii can take place
The interrupt (exception) handler either aborts program or
restarts at instruction Ii+1
CA Lecture03 - pipelining ([email protected])
03-65
Exceptions in MIPS
Pipeline stage
Problem exceptions occurring
IF
Page fault on instruction fetch; misaligned
memory access; memory protection
violation
ID
Undefined or illegal opcode
EX
Arithmetic exception
MEM
Page fault on data fetch; misaligned
memory access; memory protection
violation
WB
Non
Note: Multiple exceptions may occur in the same clock cycle
in pipelining architecture
CA Lecture03 - pipelining ([email protected])
03-66
Precise Exceptions in Static Pipelines
Key observation: architected state only change
in memory and register write stages.
And In Conclusion: Control & Pipelining
•
•
•
Control VIA State Machines and Microprogramming
Just overlap tasks; easy if tasks are independent
Speed Up ≤ Pipeline Depth; if ideal CPI is 1, then:
Cycle Time unpipelined
Pipeline depth
Speedup =
×
1 + Pipeline stall CPI Cycle Time pipelined
•
•
Hazards limit performance on computers:
– Structural: need more HW resources
– Data (RAW,WAR,WAW): need forwarding, compiler
scheduling
– Control: delayed branch, prediction
Exceptions, Interrupts add complexity
CA Lecture03 - pipelining ([email protected])
03-68