CS203 – Advanced
Computer Architecture
More Pipelining
Review: 5-stage MIPS Pipeline
F
D
X
Stages
Instruction fetch
Decode & read registers
Execute
Memory
Write-back
Assume
All ALU ops in 1 cycle
All memory accesses in 1
cycle
Branches resolved in D stage
M
W
SAXPY (Y=A*X+Y)
int i;
float X[N], Y[N];
for (i-0; i<N; i++){ y[i] += a*X[i];}
loop
l.s
l.s
mult.s
addi
add.s
addi
bneq
s.s
f1, 0(r2) // &X in r2
f2, 0(r3) // &Y in r3
f1, f1, f0 // a in f0
r2, r2, 4
f2, f2, f1
r3, r3, 4
r2, r1, Loop // N+4 in r1
f2, -4(r3) // in br delay slot
No stalls, 8 cycles per iteration
How to improve the basic pipeline?
CPUtime = InstCnt * CPI * ClkCycleTime
CPI = ideal CPI + stalls per instruction
Ideas?
Wide Pipelines (Superscalar)
N instructions each clock
cycle
ideal CPI = 1/N
F
D
X
M
W
Resources needed
wider path to I$
multi-ported register file
detect dependencies &
implement forwarding,.
F
D
X
M
W
F
D
X
M
W
Wide Pipelines (2)
Simplify: one integer and
one floating-point per
cycle
separate register files, no
forwarding between them
load/store are integer
Issues
branch hazards &
delay slots
forwarding
SAXPY
l.s
l.s
addi
addi
bneq
s.s
f1, 0(r2)
f2, 0(r3)
r2, r2, 4
r3, r3, 4
r2, r1, Loop
f2, -4(r3)
mult.s f1, f1, f0
add.s f2, f2, f1
-
6 cycles per iteration, 33% better
F
D
X
M
W
F
D
X
M
W
EE282 – Winter 2004
Lecture 5 - 14
F
D
X
M
Christos Kozyrakis
W
Deep Pipelines
F
D
X
M
W
Deeper pipeline,
smaller
CCT
Commercial
Practices:
Intel’s Pipelines
ideal CCT = 1/k, k stages
Commercial
Pipelines
MotivationsPractices:
for deepIntel’s
pipelines
EE282 – Winter 2004
P5
5
Lecture 5 - 14
Christos Kozyrakis
D1
D2 latencies
X
W of ALU, FPU, cache etc.
variable
CCT = max{all
latencies}
5-stages,
Pentium, <500MHz
F
F
D1
D2
X Intel’s
W Pipelines
Commercial
Practices:
Intel’s
pipelines
5-stages, Pentium, <500MHz
P6P5
F1
F2
F
D2
D1
D2
D2
D3
REN
X
W
ROB
SCH
DISP
X
RET1
RET2
P5: 5 stages, Pentium, < 500MHz
12-stages,
Pentium 2&3&M,
5-stages,
Pentium, <500MHz
6
F1
P6
Netburst
F2
F1
F2
IP1
IP2
D2
D2
D2
D2
TC1
TC2
D3
D3
REN
DR
P6: 12 stages, Pentium 2, 3 & M,
> 2GHz
12-stages, Pentium 2&3&M, <2GHz
REN
ROB
<2GHz
ROB
SCH
DISP
SCH
X
DISP
RET1
X
RET1
RET2
RET2
AL
REN Pentium
Q
S1
S2
S3
DP1
DP2
12-stages,
2&3&M,
<2GHz
R1
F2
S3
X
FL
BR
DR
W
20-stages, Pentium 4,
>3GHz 20 stages, Pentium 4, > 3GHz
Netburst:
Netburst
etburst
IP1
IP1
IP2
TC1
IP2
TC2
TC1
DR
TC2
AL
REN
DR
Q
AL
S1
S2
REN
S3
DP1
Q
S1
DP2
S2
R1
F2
S3
20-stages, Pentium 4, >3GHz
S3
DP1
X
FL
DP2
BR
DR
R1
W
F2
S3
X
FL
BR
DR
W
Limits to Pipelining
Cost/Performance tradeoffs (Peter Kogge, 1981)
Non-pipelined:
let T be latency and C be logic area cost
Pipelined:
d is latch delay, p is clock period, p =T/k + d;
pipelined frequency f = 1/p
pipelined area cost = C + k*h (h is latch area cost)
Performance/Cost Ratio: PCR
PCR is max at k0
Optimum # pipeline stages
f
1
PCR =
=
C + k.h é T ù. C + k.h
)
êë k úû (
T.C
k0 =
d.h
Limits to Pipelining (2)
Overhead introduced at each
pipeline stage
pipeline latches
uneven distribution of work per
stage
clock skew
clock may take longer to arrive
at different stages
Eventually overhead
dominates, diminishing
returns
k-stage pipeline where
overhead per stage is d (time)
instructions are spaced by S
CCT = T/k + d
CPI = ideal CPI + stalls = 1 +
Sk/T
CPU time = (1+Sk/T).(T/k + d)
T=60, d=2, S=10
k=5, CPUtime = 25.6
k=10, CPUtime = 21.3
k=15, CPUtime = 21.0
k=20, CPUtime = 21.65
MIPS R4000 pipeline
9
Base
Load stalls
Branch stalls
FP result stalls
FP structural
stalls
tomcatv
su2cor
spice2g6
ora
nasa7
doduc
li
espresso
eqntott
4.5
4
3.5
3
2.5
2
1.5
1
0.5
0
gcc
MIPS R400 Performance
Pipelined CPU speedup
For simple RISC pipeline, CPI = 1:
Multiple pipelines
In-order execution
in-order issue and completion of instructions
integer
X
F
D
R
fp add
X1
X2
X1
X2
X3
X
M1
M2
DIV
fp mult
ld, st
W
Multiple pipelines - Tomasulo
In-order issue
out of order completion
register renaming
through reservation
stations and CDB:
eliminates WAW &
WAR
dynamic loop
unrolling: loop level
parallelism
COMMON DATA BUS
integer
X
F
D
R
fp add
X1
X2
X1
X2
X3
X
M1
M2
DIV
fp mult
ld, st
W
Multiple pipelines - ROB
In-order issue
out of order completion
in order commit: supports speculation
through branch prediction
integer
X
F
D
R
X1
X2
X1
X2
X
ROB eliminates the
CDB bottleneck
separates completion
from commit stages
M1
fp add
X3
M2
DIV
fp mult
ld, st
W
ROB
© Copyright 2026 Paperzz