Goal: Design a Pipelined Datapath

Goal: Describe Pipelining
A Pipeline (‫ )צנרת‬is like a conveyor belt ( ‫סרט‬
‫ )נא‬multiple instructions can be executed at
the same time, each one in a different stage.
Laundry (‫ )כביסה‬is like pipelining:
1. Put a load of dirty clothing in the washer
2. Move the wet wash in the dryer
3. Move the dry wash onto a table and fold
4. Put the clothes away in a closet
You can’t do it any faster. But what if you
have 3 washes (whites, colors, delicates)?
Computer
Architecture - Pipelining
1/14
Pipelined Laundry
 Each wash
cycle takes
2hours.
 4 washes take
8 hours.
 4 pipelined
washes take
3.5 hours.
 The stages
are
overlapped.
 pipelined
laundry is
potentially 4
times faster.
Time
6 PM
7
8
9
10
11
12
1
2 AM
6 PM
7
8
9
10
11
12
1
2 AM
Task
order
A
B
C
D
Time
Task
order
A
B
C
D
2/14
Pipelining Instructions
Executing instructions is performed in stages:
1. Fetch the instruction from memory
2. Decode the instruction and read the registers
3. Execute the operation or calculate address
4. Access a word in data Memory
5. Writeback the result into a register
While the first instruction is being decoded
the second instruction is already being
fetched. While the first instruction is executed
the second is decoded and the third ...
Computer
Architecture - Pipelining
3/14
Single-Cycle vs. Pipelined Performance
A lw takes 8ns, each cycle is 2ns long (time of
longest
stage,
memory
access).
Program
2
execution
Time
order
(in instructions)
lw $1, 100($0)
Instruction
Reg
fetch
lw $2, 200($0)
4
6
8
ALU
Data
access
10
12
14
ALU
Data
access
16
18
Reg
Instruction
Reg
fetch
8 ns
lw $3, 300($0)
Reg
Instruction
fetch
8 ns
...
8 ns
Program
execution
Time
order
(in instructions)
2
lw $1, 100($0)
Instruction
fetch
lw $2, 200($0)
2 ns
lw $3, 300($0)
Computer
4
Reg
Instruction
fetch
2 ns
6
ALU
Reg
Instruction
fetch
8
Data
access
ALU
Reg
10
14
12
Reg
Data
access
ALU
Reg
Data
access
Architecture
- Pipelining
2 ns
2 ns
2 ns
2 ns
Reg
2 ns
Speedup of Pipelining
Assuming ideal conditions:
DT instructions pipelined = DT instructions nonpipelined
number of pipe stages
Time between instructions: 8ns.
Number of pipe stages: 5 -> 8ns/5 = 1.6ns
But the minimum pipeline stage is 2ns.
So why is the speedup only 24ns/14ns = 1.7 and not
4.0.
For 1003 instructions:
single-cycle: 1000*8ns + 24ns = 8,024
pipelined:
Computer
1000*2ns + 14ns = 2,014 = 3.98
Architecture - Pipelining
5/14
The MIPS instruction set was designed for
pipelining:
Each instruction is the same size. Each cycle an
instruction is fetched. In the 80x86 instruction lengths vary
from 1 to 17 bytes. Some instructions can be fetched in 1
cycle other not, this complicates things.
MIPS has few instruction formats, in each instruction the
source register fields are in the same place. Registers
can be read at the same time the control is determining
the type of instruction. If this wasn’t so an extra pipeline
stage would be needed to read the registers.
Only Load and Store instructions access memory. If ALU
ops could access memory stages 3 and 4 would be
expanded.
Computer
Architecture - Pipelining
Pipeline Hazards
When the next instruction can’t
execute in the next clock cycle we say
that a Hazard (‫ )סכנה‬has occurred.
There are 3 type of hazards:
Structural Hazards: The same unit is needed by
two instructions.
Control Hazards: The next instruction isn’t known
yet.
Data Hazards: An instruction depends on the
result of a previous instruction.
Computer
Architecture - Pipelining
6/14
Structural Hazards
Laundry example:
A washer/dryer combination is used. Both
the wash and dry stages use the same unit.
The folding table has all your school work on it.
Instruction example:
A single memory is used. The Fetch and Memory stages
can’t be executed at the same time.
The Register file can’t be read and written to at the same
cycle.
Solution:
Two memories, for data and instructions.
Enable the register file to read and write simultaneously.
Computer
Architecture - Pipelining
7/14
Control Hazards
Laundry example:
Washing filthy uniforms. We might have to add more soap
and wash again. Only after the dry stage can we tell if the
uniforms are clean, or is more soap needed.
Instruction example:
A branch instruction is being decoded. The next
instruction fetched might be the wrong one. Even with a
dedicated ALU in the decode stage we still miss a stage.
Solution:
Stall: Wait until the branch direction is known and then
continue fetching. This is known as a pipeline stall or
bubble.
Computer
Architecture - Pipelining
8/14
Control Hazard Example
If the branch test fails the lw instruction will be
executed with a delay of one cycle. In many
processors a delay of 2 cycles is necessary.
Program
execution
Time
order
(in instructions)
add $4, $5, $6
beq $1, $2, 40
2
Instruction
fetch
2ns
4
Reg
Instruction
fetch
lw $3, 300($0)
4 ns
6
ALU
Reg
8
Data
access
ALU
Instruction
fetch
10
14
12
16
Reg
Data
access
Reg
Reg
ALU
Data
access
Reg
2ns
This solution is to slow, we need a faster solution.
What if we predict (‫ )מנבא‬the result of the branch.
Computer
Architecture - Pipelining
9/14
Branch Prediction
Laundry Solution:
While drying the first
load of uniforms wash
the second load.
If the first load isn’t
clean enough rewash
the first and second
loads.
Program
execution
Time
order
(in instructions)
add $4, $5, $6
2
6
Instruction
Reg
fetch
2 ns
lw $3, 300($0)
2
4
Instruction
Reg
fetch
beq $1, $2, 40
2 ns
ALU
Instruction
Reg
fetch
bubble
Instruction Solution:
or $7, $8, $9
Data
access
ALU
6
4 ns
10
14
Reg
Data
access
ALU
8
Data
access
12
Reg
Instruction
Reg
fetch
2 ns
Program
execution
Time
order
(in instructions)
8
Data
access
ALU
Instruction
Reg
fetch
beq $1, $2, 40
add $4, $5 ,$6
4
10
Reg
14
12
Reg
ALU
Data
access
Reg
bubble
bubble
bubble
Instruction
Reg
fetch
ALU
bubble
Data
access
Reg
Predict all branches are not taken. Fetch the next instruction. If the
branch is taken (misprediction), stall.
Computer
Architecture - Pipelining
10/14
3rd Solution: Delayed Decision
Laundry Solution:
When drying the first uniform load, wash a regular load.
Instruction Solution:
Switch around the order of instructions. After the branch
instruction execute an instruction that isn’t dependent on
Program the branch.
execution
order
Time
(in instructions)
beq $1, $2, 40
2
Instruction
fetch
add $4, $5, $6
(Delayed branch slot) 2 ns
lw $3, 300($0)
4
Reg
Instruction
fetch
2 ns
6
ALU
Reg
Instruction
fetch
8
Data
access
ALU
Reg
10
12
14
Reg
Data
access
ALU
Reg
Data
access
Reg
2 ns
Computer
Architecture - Pipelining
11/14
Data Hazards
Laundry example:
The first load is mainly socks. Every sock has its pair in
the second load. Can’t fold the first load until the second
load is dried.
Instruction example:
add
sub
$s0, $t0, $t1
$t2, $s0, $t3
The 2nd instruction is dependent on the 1st. Only during the
5th stage is the result written back into $s0.
Solution:
Stall: Wait until the 1st instruction ends. Results in 3
bubbles. To long to wait.
Computer
Architecture - Pipelining
12/14
Forwarding
The result is calculated in the 3rd stage, why wait:
Program
execution
order
Time
(in instructions)
2
add $s0, $t0, $t1
4
IF
sub $t2, $s0, $t3
6
ID
EX
IF
ID
8
MEM
EX
10
WB
MEM
WB
In the case of a R-format following a load, a bubble
is added (called a load-use data hazard):
2
Time
Program
execution
order
(in instructions)
lw $s0, 20($t1)
sub $t2, $s0, $t3
IF
4
6
ID
EX
bubble
bubble
IF
8
10
12
MEM
WB
bubble
bubble
bubble
ID
EX
MEM
14
WB
13/14
Code Reordering
Find the hazard:
lw
lw
sw
sw
# $t1 is the address of v[k]
$t0, 0($t1) # $t0=v[k]
$t2, 4($t1) # $t2 = v[k+1]
$t2, 0($t1) # v[k] = $t2
$t0, 4($t1) # v[k+1] = $t0
Solution: Reorder the instructions:
lw
lw
sw
sw
$t0,
$t2,
$t0,
$t2,
Computer
0($t1) # $t0=v[k]
4($t1) # $t2 = v[k+1]
4($t1) # v[k+1] = $t0
0($t1) # v[k] = $t2
Architecture - Pipelining
14/14