Dr. Amr Talaat Elect 707

Chapter 3
Instruction Level Parallelism
Dr. Eng. Amr T. Abdel-Hamid
Spring 2011
Text book slides: Computer Architec
ture: A Quantitative Approach 4th E
dition, John L. Hennessy & David A
. Patterso with modifications.
Computer Applications
Elect 707
Recall from Pipelining Review
CPIpipelined  Ideal CPI  Average Stall cycles per Inst
Pipelined CPI = Ideal pipeline CPI + Structural Stalls + Data Ha
zard Stalls + Control Stalls
Dr. Amr Talaat
 Ideal pipeline CPI: measure of the maximum performance
attainable by the implementation
 Structural hazards: HW cannot support this combination of
instructions
 Data hazards: Instruction depends on result of prior
instruction still in the pipeline
 Control hazards: Caused by delay between the fetching of
instructions and decisions about changes in control flow
(branches and jumps)
Elect 707
Reduction of Pipeline Hazards Techniques

Today
Dr. Amr Talaat
Elect 707
Instruction Level Parallelism
 Instruction-Level Parallelism (ILP): overlap the execution
of instructions to improve performance.
 2 main approaches to exploit ILP:
1) Rely on hardware to help discover and exploit the
parallelism dynamically (e.g., Pentium 4)
2) Rely on software technology to find parallelism,
statically at compile-time (e.g., Itanium 2)
Dr. Amr Talaat
Elect 707
Floating-Point Pipeline
Dr. Amr Talaat
Elect 707
Dynamic Scheduling
 Dynamic scheduling: hardware rearranges the instruction
execution to reduce stalls while maintaining data flow and
exception behavior
 It handles cases when dependences unknown at compile
time
 it allows the processor to tolerate unpredictable delays such
as cache misses, by executing other code while waiting for
the miss to resolve
 It allows code that compiled for one pipeline to run
efficiently on a different pipeline
 It simplifies the compiler
Dr. Amr Talaat
Elect 707
HW Schemes: Instruction Parallelism
 Key idea: Allow instructions behind stall to proceed
DIVD
ADDD
SUBD
F0,F2,F4
F10,F0,F8
F12,F8,F14
 Enables out-of-order execution and allows out-of-order
completion (e.g., SUBD)
 In a dynamically scheduled pipeline, all instructions still pas
s through issue stage in order (in-order issue)
Dr. Amr Talaat
 Will distinguish when an instruction begins execution and
when it completes execution; between 2 times, the instru
ction is in execution
 Note: Dynamic execution creates WAR and WAW hazards
Elect 707
Dynamic Scheduling Step 1
 Simple pipeline had 1 stage to check both structur
al and data hazards: Instruction Decode (ID), also
called Instruction Issue
 Split the ID pipe stage of simple 5-stage pipeline i
nto 2 stages:
 Issue—Decode instructions, check for structural haz
ards
 Read operands—Wait until no data hazards, then re
ad operands
Dr. Amr Talaat
Elect 707
A Dynamic Algorithm: Tomasulo’s
 For IBM 360/91 (before caches!)
  Long memory latency
 Goal: High Performance without special compilers
 Small number of floating point registers (4 in 360) prevented in
teresting compiler scheduling of operations
 This led Tomasulo to try to figure out how to get more effective re
gisters — renaming in hardware!
 Why Study 1966 Computer?
 The descendants of this are used in:
Dr. Amr Talaat
 Alpha 21264, Pentium 4, AMD Opteron, Power 5, …
Elect 707
Tomasulo Organization
FP Registers
From Mem
FP Op
Queue
Load Buffers
Load1
Load2
Load3
Load4
Load5
Load6
Store
Buffers
Add1
Add2
Add3
Mult1
Mult2
Dr. Amr Talaat
FP adders
Elect 707
Reservation
Stations
To Mem
FP multipliers
Reservation Station Components
Op: Operation to perform in the unit (e.g., + or –)
Vj, Vk: Value of Source operands
Qj, Qk: Reservation stations producing source registers (value
to be written)
 Note: Qj,Qk=0 => ready
 Store buffers only have Qi for RS producing result
Busy: Indicates reservation station or FU is busy
Dr. Amr Talaat
Register result status—Indicates which functional unit will write
each register, if one exists. Blank when no pending instruction
s that will write that register.
Elect 707
Three Stages of Tomasulo Algorithm
1. Issue—get instruction from FP Op Queue
If reservation station free (no structural hazard),
control issues instr & sends operands (renames registers).
2. Execute—operate on operands (EX)
When both operands ready then execute;
if not ready, watch Common Data Bus for result
3. Write result—finish execution (WB)
Write on Common Data Bus to all awaiting units;
mark reservation station available
 Normal data bus: data + destination (“go to” bus)
 Common data bus: data + source
Dr. Amr Talaat
 Example speed:
3 clocks for Fl .pt. +,-; 10 for * ; 40 clks for /
Elect 707
Tomasulo Example
Instruction stream
Instruction status:
Instruction
LD
F6
LD
F2
MULTD F0
SUBD
F8
DIVD
F10
ADDD
F6
j
34+
45+
F2
F6
F0
F8
k
R2
R3
F4
F2
F6
F2
Exec Write
Issue Comp Result
Load1
Load2
Load3
Dr. Amr Talaat
Register result status:
Clock
0
Clock cycle
counter
Elect 707
FU
No
No
No
3 Load/Buffers
Reservation Stations:
Time Name Busy
Add1
No
Add2
No
FU count
Add3
No
down
Mult1 No
Mult2 No
Busy Address
Op
S1
Vj
S2
Vk
RS
Qj
RS
Qk
3 FP Adder R.S.
2 FP Mult R.S.
F0
F2
F4
F6
F8
F10
F12
...
F30
Tomasulo Example Cycle 1
Instruction status:
Instruction
LD
F6
LD
F2
MULTD F0
SUBD
F8
DIVD
F10
ADDD
F6
j
34+
45+
F2
F6
F0
F8
k
R2
R3
F4
F2
F6
F2
Exec Write
Issue Comp Result
1
Reservation Stations:
Time Name Busy
Add1
No
Add2
No
Add3
No
Mult1 No
Mult2 No
Dr. Amr Talaat
Register result status:
Clock
1
Elect 707
FU
Busy Address
Load1
Load2
Load3
Op
S1
Vj
S2
Vk
RS
Qj
RS
Qk
F0
F2
F4
F6
F8
Load1
Yes
No
No
34+R2
F10
F12
...
F30
Tomasulo Example Cycle 2
Instruction status:
Instruction
LD
F6
LD
F2
MULTD F0
SUBD
F8
DIVD
F10
ADDD
F6
j
34+
45+
F2
F6
F0
F8
k
R2
R3
F4
F2
F6
F2
Exec Write
Issue Comp Result
1
2
Reservation Stations:
Time Name Busy
Add1
No
Add2
No
Add3
No
Mult1 No
Mult2 No
Dr. Amr Talaat
Register result status:
Clock
2
FU
Busy Address
Load1
Load2
Load3
Op
S1
Vj
S2
Vk
RS
Qj
RS
Qk
F0
F2
F4
F6
F8
Load2
Yes
Yes
No
34+R2
45+R3
F10
F12
Load1
Note: Can have multiple loads outstanding
Elect 707
...
F30
Tomasulo Example Cycle 3
Instruction status:
Instruction
LD
F6
LD
F2
MULTD F0
SUBD
F8
DIVD
F10
ADDD
F6
j
34+
45+
F2
F6
F0
F8
k
R2
R3
F4
F2
F6
F2
Exec Write
Issue Comp Result
1
2
3
Reservation Stations:
Time Name Busy Op
Add1
No
Add2
No
Add3
No
Mult1 Yes MULTD
Mult2 No
Dr. Amr Talaat
Register result status:
Clock
3
Elect 707
FU
F0
Busy Address
3
S1
Vj
Load1
Load2
Load3
S2
Vk
RS
Qj
RS
Qk
R(F4) Load2
F2
Mult1 Load2
F4
F6
Load1
F8
Yes
Yes
No
34+R2
45+R3
• Note: registers
names are removed
(“renamed”) in
Reservation Stations
; MULT issued
• Load1 completing;
what is waiting for
Load1?
F10
F12
...
F30
Tomasulo Example Cycle 4
Instruction status:
Instruction
LD
F6
LD
F2
MULTD F0
SUBD
F8
DIVD
F10
ADDD
F6
j
34+
45+
F2
F6
F0
F8
k
R2
R3
F4
F2
F6
F2
Exec Write
Issue Comp Result
1
2
3
4
Reservation Stations:
Busy Address
3
4
4
Load1
Load2
Load3
S1
Vj
S2
Vk
RS
Qj
RS
Qk
F2
F4
F6
F8
No
Yes
No
45+R3
F10
F12
Time Name Busy Op
Add1 Yes SUBD M(A1)
Load2
Add2
No
Add3
No
Mult1 Yes MULTD
R(F4) Load2
Mult2 No
Dr. Amr Talaat
Register result status:
Clock
4
FU
F0
Mult1 Load2
M(A1) Add1
• Load2 completing; what is waiting for Load2?
Elect 707
...
F30
Tomasulo Example Cycle 5
Instruction status:
Instruction
LD
F6
LD
F2
MULTD F0
SUBD
F8
DIVD
F10
ADDD
F6
j
34+
45+
F2
F6
F0
F8
k
R2
R3
F4
F2
F6
F2
Exec Write
Issue Comp Result
1
2
3
4
5
Reservation Stations:
Busy Address
3
4
4
5
Load1
Load2
Load3
S1
Vj
S2
Vk
RS
Qj
RS
Qk
F2
F4
F6
F8
Time Name Busy Op
2 Add1 Yes SUBD M(A1) M(A2)
Add2
No
Add3
No
10 Mult1 Yes MULTD M(A2) R(F4)
Mult2 Yes DIVD
M(A1) Mult1
Dr. Amr Talaat
Register result status:
Clock
5
FU
F0
Mult1 M(A2)
F10
M(A1) Add1 Mult2
• Timer starts down for Add1, Mult1
Elect 707
No
No
No
F12
...
F30
Tomasulo Example Cycle 6
Instruction status:
Instruction
LD
F6
LD
F2
MULTD F0
SUBD
F8
DIVD
F10
ADDD
F6
j
34+
45+
F2
F6
F0
F8
k
R2
R3
F4
F2
F6
F2
Exec Write
Issue Comp Result
1
2
3
4
5
6
Reservation Stations:
Busy Address
3
4
4
5
Load1
Load2
Load3
S1
Vj
S2
Vk
RS
Qj
RS
Qk
F2
F4
F6
F8
Time Name Busy Op
1 Add1 Yes SUBD M(A1) M(A2)
Add2 Yes ADDD
M(A2) Add1
Add3
No
9 Mult1 Yes MULTD M(A2) R(F4)
Mult2 Yes DIVD
M(A1) Mult1
Dr. Amr Talaat
Register result status:
Clock
6
FU
F0
Mult1 M(A2)
Add2
No
No
No
F10
F12
...
Add1 Mult2
• Issue ADDD here despite name dependency on F6?
Elect 707
F30
Tomasulo Example Cycle 7
Instruction status:
Instruction
LD
F6
LD
F2
MULTD F0
SUBD
F8
DIVD
F10
ADDD
F6
j
34+
45+
F2
F6
F0
F8
k
R2
R3
F4
F2
F6
F2
Exec Write
Issue Comp Result
1
2
3
4
5
6
Reservation Stations:
3
4
Busy Address
4
5
Load1
Load2
Load3
7
S1
Vj
S2
Vk
RS
Qj
RS
Qk
F2
F4
F6
F8
Time Name Busy Op
0 Add1 Yes SUBD M(A1) M(A2)
Add2 Yes ADDD
M(A2) Add1
Add3
No
8 Mult1 Yes MULTD M(A2) R(F4)
Mult2 Yes DIVD
M(A1) Mult1
Dr. Amr Talaat
Register result status:
Clock
7
FU
F0
No
No
No
Mult1 M(A2)
Add2
F10
F12
Add1 Mult2
• Add1 (SUBD) completing; what is waiting for it?
Elect 707
...
F30
Tomasulo Example Cycle 8
Instruction status:
Instruction
LD
F6
LD
F2
MULTD F0
SUBD
F8
DIVD
F10
ADDD
F6
j
34+
45+
F2
F6
F0
F8
k
R2
R3
F4
F2
F6
F2
Exec Write
Issue Comp Result
1
2
3
4
5
6
Reservation Stations:
Busy Address
3
4
4
5
Load1
Load2
Load3
7
8
S1
Vj
S2
Vk
RS
Qj
RS
Qk
F2
F4
F6
F8
Time Name Busy Op
Add1
No
2 Add2 Yes ADDD (M-M) M(A2)
Add3
No
7 Mult1 Yes MULTD M(A2) R(F4)
Mult2 Yes DIVD
M(A1) Mult1
Dr. Amr Talaat
Register result status:
Clock
8
Elect 707
FU
F0
Mult1 M(A2)
No
No
No
F10
Add2 (M-M) Mult2
F12
...
F30
Tomasulo Example Cycle 9
Instruction status:
Instruction
LD
F6
LD
F2
MULTD F0
SUBD
F8
DIVD
F10
ADDD
F6
j
34+
45+
F2
F6
F0
F8
k
R2
R3
F4
F2
F6
F2
Exec Write
Issue Comp Result
1
2
3
4
5
6
Reservation Stations:
Busy Address
3
4
4
5
Load1
Load2
Load3
7
8
S1
Vj
S2
Vk
RS
Qj
RS
Qk
F2
F4
F6
F8
Time Name Busy Op
Add1
No
1 Add2 Yes ADDD (M-M) M(A2)
Add3
No
6 Mult1 Yes MULTD M(A2) R(F4)
Mult2 Yes DIVD
M(A1) Mult1
Dr. Amr Talaat
Register result status:
Clock
9
Elect 707
FU
F0
Mult1 M(A2)
No
No
No
F10
Add2 (M-M) Mult2
F12
...
F30
Tomasulo Example Cycle 10
Instruction status:
Instruction
LD
F6
LD
F2
MULTD F0
SUBD
F8
DIVD
F10
ADDD
F6
j
34+
45+
F2
F6
F0
F8
k
R2
R3
F4
F2
F6
F2
Exec Write
Issue Comp Result
1
2
3
4
5
6
Reservation Stations:
3
4
4
5
7
8
Busy Address
Load1
Load2
Load3
10
S1
Vj
S2
Vk
RS
Qj
RS
Qk
F2
F4
F6
F8
Time Name Busy Op
Add1
No
0 Add2 Yes ADDD (M-M) M(A2)
Add3
No
5 Mult1 Yes MULTD M(A2) R(F4)
Mult2 Yes DIVD
M(A1) Mult1
Dr. Amr Talaat
Register result status:
Clock
10
FU
F0
No
No
No
Mult1 M(A2)
F10
F12
Add2 (M-M) Mult2
• Add2 (ADDD) completing; what is waiting for it?
Elect 707
...
F30
Tomasulo Example Cycle 11
Instruction status:
Instruction
LD
F6
LD
F2
MULTD F0
SUBD
F8
DIVD
F10
ADDD
F6
j
34+
45+
F2
F6
F0
F8
k
R2
R3
F4
F2
F6
F2
Exec Write
Issue Comp Result
1
2
3
4
5
6
Reservation Stations:
Busy Address
3
4
4
5
Load1
Load2
Load3
7
8
10
11
S1
Vj
S2
Vk
RS
Qj
RS
Qk
F2
F4
F6
F8
Time Name Busy Op
Add1
No
Add2
No
Add3
No
4 Mult1 Yes MULTD M(A2) R(F4)
Mult2 Yes DIVD
M(A1) Mult1
Dr. Amr Talaat
Register result status:
Clock
11
Elect 707
FU
F0
Mult1 M(A2)
No
No
No
• Write result of
ADDD here?
• All quick instructi
ons complete
in this cycle!
F10
(M-M+M)(M-M) Mult2
F12
...
F30
Tomasulo Example Cycle 12
Instruction status:
Instruction
LD
F6
LD
F2
MULTD F0
SUBD
F8
DIVD
F10
ADDD
F6
j
34+
45+
F2
F6
F0
F8
k
R2
R3
F4
F2
F6
F2
Exec Write
Issue Comp Result
1
2
3
4
5
6
Reservation Stations:
Busy Address
3
4
4
5
Load1
Load2
Load3
7
8
10
11
S1
Vj
S2
Vk
RS
Qj
RS
Qk
F2
F4
F6
F8
Time Name Busy Op
Add1
No
Add2
No
Add3
No
3 Mult1 Yes MULTD M(A2) R(F4)
Mult2 Yes DIVD
M(A1) Mult1
Dr. Amr Talaat
Register result status:
Clock
12
Elect 707
FU
F0
Mult1 M(A2)
No
No
No
F10
(M-M+M)(M-M) Mult2
F12
...
F30
Tomasulo Example Cycle 13
Instruction status:
Instruction
LD
F6
LD
F2
MULTD F0
SUBD
F8
DIVD
F10
ADDD
F6
j
34+
45+
F2
F6
F0
F8
k
R2
R3
F4
F2
F6
F2
Exec Write
Issue Comp Result
1
2
3
4
5
6
Reservation Stations:
Busy Address
3
4
4
5
Load1
Load2
Load3
7
8
10
11
S1
Vj
S2
Vk
RS
Qj
RS
Qk
F2
F4
F6
F8
Time Name Busy Op
Add1
No
Add2
No
Add3
No
2 Mult1 Yes MULTD M(A2) R(F4)
Mult2 Yes DIVD
M(A1) Mult1
Dr. Amr Talaat
Register result status:
Clock
13
Elect 707
FU
F0
Mult1 M(A2)
No
No
No
F10
(M-M+M)(M-M) Mult2
F12
...
F30
Tomasulo Example Cycle 14
Instruction status:
Instruction
LD
F6
LD
F2
MULTD F0
SUBD
F8
DIVD
F10
ADDD
F6
j
34+
45+
F2
F6
F0
F8
k
R2
R3
F4
F2
F6
F2
Exec Write
Issue Comp Result
1
2
3
4
5
6
Reservation Stations:
Busy Address
3
4
4
5
Load1
Load2
Load3
7
8
10
11
S1
Vj
S2
Vk
RS
Qj
RS
Qk
F2
F4
F6
F8
Time Name Busy Op
Add1
No
Add2
No
Add3
No
1 Mult1 Yes MULTD M(A2) R(F4)
Mult2 Yes DIVD
M(A1) Mult1
Dr. Amr Talaat
Register result status:
Clock
14
Elect 707
FU
F0
Mult1 M(A2)
No
No
No
F10
(M-M+M)(M-M) Mult2
F12
...
F30
Tomasulo Example Cycle 15
Instruction status:
Instruction
LD
F6
LD
F2
MULTD F0
SUBD
F8
DIVD
F10
ADDD
F6
j
34+
45+
F2
F6
F0
F8
k
R2
R3
F4
F2
F6
F2
Exec Write
Issue Comp Result
1
2
3
4
5
6
Reservation Stations:
Busy Address
3
4
15
7
4
5
Load1
Load2
Load3
10
11
S1
Vj
S2
Vk
RS
Qj
RS
Qk
F2
F4
F6
F8
8
Time Name Busy Op
Add1
No
Add2
No
Add3
No
0 Mult1 Yes MULTD M(A2) R(F4)
Mult2 Yes DIVD
M(A1) Mult1
Dr. Amr Talaat
Register result status:
Clock
15
FU
F0
Mult1 M(A2)
No
No
No
F10
F12
(M-M+M)(M-M) Mult2
• Mult1 (MULTD) completing; what is waiting for it?
Elect 707
...
F30
Tomasulo Example Cycle 16
Instruction status:
Instruction
LD
F6
LD
F2
MULTD F0
SUBD
F8
DIVD
F10
ADDD
F6
j
34+
45+
F2
F6
F0
F8
k
R2
R3
F4
F2
F6
F2
Exec Write
Issue Comp Result
1
2
3
4
5
6
Reservation Stations:
3
4
15
7
4
5
16
8
Load1
Load2
Load3
10
11
S1
Vj
S2
Vk
RS
Qj
RS
Qk
F2
F4
F6
F8
Time Name Busy Op
Add1
No
Add2
No
Add3
No
Mult1 No
40 Mult2 Yes DIVD M*F4 M(A1)
Dr. Amr Talaat
Register result status:
Clock
16
FU
F0
Busy Address
M*F4 M(A2)
No
No
No
F10
(M-M+M)(M-M) Mult2
• Just waiting for Mult2 (DIVD) to complete
Elect 707
F12
...
F30
(skip a couple of cycles)
Dr. Amr Talaat
Elect 707
Tomasulo Example Cycle 55
Instruction status:
Instruction
LD
F6
LD
F2
MULTD F0
SUBD
F8
DIVD
F10
ADDD
F6
j
34+
45+
F2
F6
F0
F8
k
R2
R3
F4
F2
F6
F2
Exec Write
Issue Comp Result
1
2
3
4
5
6
Reservation Stations:
3
4
15
7
4
5
16
8
Load1
Load2
Load3
10
11
S1
Vj
S2
Vk
RS
Qj
RS
Qk
F2
F4
F6
F8
Time Name Busy Op
Add1
No
Add2
No
Add3
No
Mult1 No
1 Mult2 Yes DIVD M*F4 M(A1)
Dr. Amr Talaat
Register result status:
Clock
55
Elect 707
FU
F0
Busy Address
M*F4 M(A2)
No
No
No
F10
(M-M+M)(M-M) Mult2
F12
...
F30
Tomasulo Example Cycle 56
Instruction status:
Instruction
LD
F6
LD
F2
MULTD F0
SUBD
F8
DIVD
F10
ADDD
F6
j
34+
45+
F2
F6
F0
F8
k
R2
R3
F4
F2
F6
F2
Exec Write
Issue Comp Result
1
2
3
4
5
6
Reservation Stations:
3
4
15
7
56
10
4
5
16
8
Load1
Load2
Load3
S1
Vj
S2
Vk
RS
Qj
RS
Qk
Dr. Amr Talaat
56
FU
F0
F2
F4
F6
F8
M*F4 M(A2)
No
No
No
11
Time Name Busy Op
Add1
No
Add2
No
Add3
No
Mult1 No
0 Mult2 Yes DIVD M*F4 M(A1)
Register result status:
Clock
Busy Address
F10
F12
(M-M+M)(M-M) Mult2
• Mult2 (DIVD) is completing; what is waiting for it?
Elect 707
...
F30
Tomasulo Example Cycle 57
Instruction status:
Instruction
LD
F6
LD
F2
MULTD F0
SUBD
F8
DIVD
F10
ADDD
F6
j
34+
45+
F2
F6
F0
F8
k
R2
R3
F4
F2
F6
F2
Exec Write
Issue Comp Result
1
2
3
4
5
6
Reservation Stations:
3
4
15
7
56
10
4
5
16
8
57
11
Load1
Load2
Load3
S1
Vj
S2
Vk
RS
Qj
RS
Qk
F2
F4
F6
F8
Time Name Busy Op
Add1
No
Add2
No
Add3
No
Mult1 No
Mult2 Yes DIVD M*F4 M(A1)
Dr. Amr Talaat
Register result status:
Clock
56
FU
F0
Busy Address
M*F4 M(A2)
No
No
No
F10
F12
...
F30
(M-M+M)(M-M) Result
• Once again: In-order issue, out-of-order execution and out-of-or
der completion.
Elect 707
Why can Tomasulo overlap iterations of loops?
 Register renaming
 Multiple iterations use different physical destinations for regi
sters (dynamic loop unrolling).
 Reservation stations
 Permit instruction issue to advance past integer control flow
operations
 Also buffer old values of registers - totally avoiding the WAR
stall
 Other perspective: Tomasulo building data flo
w dependency graph on the fly
Dr. Amr Talaat
Elect 707
Tomasulo Loop Example
Loop: LD
MULTD
SD
SUBI
BNEZ
F0
F4
F4
R1
R1
 Multiply takes 4 clocks
 Load have cache misses
Dr. Amr Talaat
Elect 707
0
F0
0
R1
Loop
R1
F2
R1
#8
Loop Example Cycle 0
Instruction status
Instruction
j
k
iteration
LD
F0
0 R1
1
MULTDF4
F0 F2
1
SD
F4
0 R1
1
LD
F0
0 R1
2
MULTDF4
F0 F2
2
SD
F4
0 R1
2
Reservation Stations
Time Name Busy Op
0 Add1 No
0 Add2 No
0 Add3 No
0 Mult1 No
0 Mult2 No
Register result status
Dr. Amr Talaat
Clock
0
Elect 707
R1
80
F0
Qi
Issue
S1
Vj
F2
Execution Write
complete Result
S2
Vk
F4
Busy Address
No
No
No
Qi
No
No
No
Load1
Load2
Load3
Store1
Store2
Store3
RS for j RS for k
Qj
Qk
Code:
LD
F0
MULTDF4
SD
F4
SUBI R1
BNEZ R1
F6
F8
0 R1
F0 F2
0 R1
R1 #8
Loop
F10 F12 ... F30
Loop Example Cycle 1
Dr. Amr Talaat
Instruction status
Instruction
j
k
iteration
LD
F0
0 R1
1
MULTDF4
F0 F2
1
SD
F4
0 R1
1
LD
F0
0 R1
2
MULTDF4
F0 F2
2
SD
F4
0 R1
2
Reservation Stations
Time Name Busy Op
0 Add1 No
0 Add2 No
0 Add3 No
0 Mult1 No
0 Mult2 No
Register result status
Clock
1
Elect 707
R1
80
F0
Qi
Load1
Issue
1
S1
Vj
F2
Execution Write
complete Result
S2
Vk
F4
Busy Address
Yes
80
No
No
Qi
No
No
No
Load1
Load2
Load3
Store1
Store2
Store3
RS for j RS for k
Qj
Qk
Code:
LD
F0
MULTDF4
SD
F4
SUBI R1
BNEZ R1
F6
F8
0 R1
F0 F2
0 R1
R1 #8
Loop
F10 F12 ... F30
Loop Example Cycle 2
Dr. Amr Talaat
Instruction status
Instruction
j
k
iteration
LD
F0
0 R1
1
MULTDF4
F0 F2
1
SD
F4
0 R1
1
LD
F0
0 R1
2
MULTDF4
F0 F2
2
SD
F4
0 R1
2
Reservation Stations
Time Name Busy Op
0 Add1 No
0 Add2 No
0 Add3 No
0 Mult1 Yes MULTD
0 Mult2 No
Register result status
Clock
2
R1
80
F0
Qi
Load1
Issue
1
2
S1
Vj
Execution Write
complete Result
S2
Vk
R(F2)
F2
F4
Busy Address
Yes
80
No
No
Qi
No
No
No
Load1
Load2
Load3
Store1
Store2
Store3
RS for j RS for k
Qj
Qk
Code:
LD
F0
MULTDF4
SD
F4
Load1
SUBI R1
BNEZ R1
F6
F8
0 R1
F0 F2
0 R1
R1 #8
Loop
F10 F12 ... F30
Mult1
Elect 707
3
Loop Example Cycle 3
Dr. Amr Talaat
Instruction status
Instruction
j
k
iteration
LD
F0
0 R1
1
MULTDF4
F0 F2
1
SD
F4
0 R1
1
LD
F0
0 R1
2
MULTDF4
F0 F2
2
SD
F4
0 R1
2
Reservation Stations
Time Name Busy Op
0 Add1 No
0 Add2 No
0 Add3 No
0 Mult1 Yes MULTD
0 Mult2 No
Register result status
Clock
3
R1
80
F0
Qi
Load1
Issue
1
2
3
S1
Vj
Execution Write
complete Result
S2
Vk
R(F2)
F2
F4
Busy Address
Yes
80
No
No
Qi
Yes
80 Mult1
No
No
Load1
Load2
Load3
Store1
Store2
Store3
RS for j RS for k
Qj
Qk
Code:
LD
F0
MULTDF4
SD
F4
Load1
SUBI R1
BNEZ R1
F6
F8
0 R1
F0 F2
0 R1
R1 #8
Loop
F10 F12 ... F30
Mult1
Elect 707
3
Loop Example Cycle 4
Dr. Amr Talaat
Instruction status
Instruction
j
k
iteration
LD
F0
0 R1
1
MULTDF4
F0 F2
1
SD
F4
0 R1
1
LD
F0
0 R1
2
MULTDF4
F0 F2
2
SD
F4
0 R1
2
Reservation Stations
Time Name Busy Op
0 Add1 No
0 Add2 No
0 Add3 No
0 Mult1 Yes MULTD
0 Mult2 No
Register result status
Clock
4
R1
72
F0
Qi
Load1
Issue
1
2
3
S1
Vj
Execution Write
complete Result
S2
Vk
R(F2)
F2
F4
Busy Address
Yes
80
No
No
Qi
Yes
80 Mult1
No
No
Load1
Load2
Load3
Store1
Store2
Store3
RS for j RS for k
Qj
Qk
Code:
LD
F0
MULTDF4
SD
F4
Load1
SUBI R1
BNEZ R1
F6
F8
0 R1
F0 F2
0 R1
R1 #8
Loop
F10 F12 ... F30
Mult1
Elect 707
4
Loop Example Cycle 5
Dr. Amr Talaat
Instruction status
Instruction
j
k
iteration
LD
F0
0 R1
1
MULTDF4
F0 F2
1
SD
F4
0 R1
1
LD
F0
0 R1
2
MULTDF4
F0 F2
2
SD
F4
0 R1
2
Reservation Stations
Time Name Busy Op
0 Add1 No
0 Add2 No
0 Add3 No
0 Mult1 Yes MULTD
0 Mult2 No
Register result status
Clock
5
R1
72
F0
Qi
Load1
Issue
1
2
3
S1
Vj
Execution Write
complete Result
S2
Vk
R(F2)
F2
F4
Busy Address
Yes
80
No
No
Qi
Yes
80 Mult1
No
No
Load1
Load2
Load3
Store1
Store2
Store3
RS for j RS for k
Qj
Qk
Code:
LD
F0
MULTDF4
SD
F4
Load1
SUBI R1
BNEZ R1
F6
F8
0 R1
F0 F2
0 R1
R1 #8
Loop
F10 F12 ... F30
Mult1
Elect 707
4
Loop Example Cycle 6
Dr. Amr Talaat
Instruction status
Instruction j
k
iteration
LD F0
0 R1
1
MULTD
F4
F0 F2
1
SD F4
0 R1
1
LD F0
0 R1
2
MULTD
F4
F0 F2
2
SD F4
0 R1
2
Reservation Stations
TimeNameBusyOp
0 Add1 No
0 Add2 No
0 Add3 No
0 Mult1 Yes MULTD
0 Mult2 No
Register result status
Execution
Write
Issue complete
Result
BusyAddress
1
Load1 Yes 80
2
Load2 Yes 72
3
Load3 No
Qi
6
Store1 Yes 80 Mult1
Store2 No
Store3 No
S1
S2
RS forRS
j for k
Vj
Vk
Qj
Qk
Code:
LD F0
0 R1
MULTD
F4 F0 F2
SD F4
0 R1
R(F2) Load1
SUBIR1 R1 #8
BNEZ
R1 Loop
Clock
F2
6
Elect 707
R1
72
F0
Qi
Load1
F4
Mult1
F6
F8
F10F12... F30
Loop Example Cycle 7
Dr. Amr Talaat
Instruction status
Instruction j
k
iteration
LD F0
0 R1
1
MULTD
F4
F0 F2
1
SD F4
0 R1
1
LD F0
0 R1
2
MULTD
F4
F0 F2
2
SD F4
0 R1
2
Reservation Stations
TimeNameBusyOp
0 Add1 No
0 Add2 No
0 Add3 No
0 Mult1 Yes MULTD
0 Mult2 Yes MULTD
Register result status
Execution
Write
Issue complete
Result
BusyAddress
1
Load1 Yes 80
2
Load2 Yes 72
3
Load3 No
Qi
6
Store1 Yes 80 Mult1
7
Store2 No
Store3 No
S1
S2
RS forRS
j for k
Vj
Vk
Qj
Qk
Code:
LD F0
0 R1
MULTD
F4 F0 F2
SD F4
0 R1
R(F2) Load1
SUBIR1 R1 #8
R(F2) Load2
BNEZ
R1 Loop
Clock
F2
7
R1
72
F0
Qi
Load2
F4
F6
F8
F10F12... F30
Mult2
Elect 707
4
Loop Example Cycle 8
Dr. Amr Talaat
Instruction status
Instruction
j
k
LD
F0
0 R1
MULTDF4
F0 F2
SD
F4
0 R1
LD
F0
0 R1
MULTDF4
F0 F2
SD
F4
0 R1
Reservation Stations
Time Name Busy
0 Add1 No
0 Add2 No
0 Add3 No
0 Mult1 Yes
0 Mult2 Yes
Register result status
Clock
8
R1
72
iteration
1
1
1
2
2
2
Op
MULTD
MULTD
F0
Qi
Issue
1
2
3
6
7
8
S1
Vj
Load2
F2
Execution Write
complete Result
Busy Address
Yes
80
Yes
72
No
Qi
Yes
80 Mult1
Yes
72 Mult2
No
R(F2)
R(F2)
Load1
Load2
Load3
Store1
Store2
Store3
RS for j RS for k
Qj
Qk
Code:
LD
F0
MULTDF4
SD
F4
Load1
SUBI R1
Load2
BNEZ R1
F4
F6
S2
Vk
F8
0 R1
F0 F2
0 R1
R1 #8
Loop
F10 F12 ... F30
Mult2
Elect 707
4
Loop Example Cycle 9
Dr. Amr Talaat
Instruction status
Instruction
j
k
LD
F0
0 R1
MULTDF4
F0 F2
SD
F4
0 R1
LD
F0
0 R1
MULTDF4
F0 F2
SD
F4
0 R1
Reservation Stations
Time Name Busy
0 Add1 No
0 Add2 No
0 Add3 No
0 Mult1 Yes
0 Mult2 Yes
Register result status
Clock
9
R1
64
iteration
1
1
1
2
2
2
Op
MULTD
MULTD
F0
Qi
Issue
1
2
3
6
7
8
S1
Vj
Load2
F2
Execution Write
complete Result
9
Load1
Load2
Load3
Store1
Store2
Store3
S2
RS for j RS for k
Vk
Qj
Qk
R(F2)
R(F2)
Load1
Load2
F4
F6
F8
Busy Address
Yes
80
Yes
72
No
Qi
Yes
80 Mult1
Yes
72 Mult2
No
Code:
LD
F0
MULTDF4
SD
F4
SUBI R1
BNEZ R1
0 R1
F0 F2
0 R1
R1 #8
Loop
F10 F12 ... F30
Mult2
Elect 707
4
Loop Example Cycle 10
Dr. Amr Talaat
Instruction status
Instruction
j
k
LD
F0
0 R1
MULTDF4
F0 F2
SD
F4
0 R1
LD
F0
0 R1
MULTDF4
F0 F2
SD
F4
0 R1
Reservation Stations
Time Name Busy
0 Add1 No
0 Add2 No
0 Add3 No
4 Mult1 Yes
0 Mult2 Yes
Register result status
Clock
10
R1
64
Qi
iteration
1
1
1
2
2
2
Op
Issue
1
2
3
6
7
8
S1
Vj
Execution Write
complete Result
9
10
Load1
Load2
Load3
10
Store1
Store2
Store3
S2
RS for j RS for k
Vk
Qj
Qk
MULTD
MULTD
M(80) R(F2)
R(F2)
Load2
F0
F2
F6
Load2
F4
F8
Busy Address
No
Yes
72
No
Qi
Yes
80 Mult1
Yes
72 Mult2
No
Code:
LD
F0
MULTDF4
SD
F4
SUBI R1
BNEZ R1
0 R1
F0 F2
0 R1
R1 #8
Loop
F10 F12 ... F30
Mult2
Elect 707
4
Loop Example Cycle 11
Dr. Amr Talaat
Instruction status
Instruction
j
k
LD
F0
0 R1
MULTDF4
F0 F2
SD
F4
0 R1
LD
F0
0 R1
MULTDF4
F0 F2
SD
F4
0 R1
Reservation Stations
Time Name Busy
0 Add1 No
0 Add2 No
0 Add3 No
3 Mult1 Yes
4 Mult2 Yes
Register result status
Clock
11
R1
64
Qi
iteration
1
1
1
2
2
2
Op
Issue
1
2
3
6
7
8
S1
Vj
Execution Write
complete Result
9
10
Load1
Load2
Load3
10
11
Store1
Store2
Store3
S2
RS for j RS for k
Vk
Qj
Qk
MULTD
MULTD
M(80) R(F2)
M(72) R(F2)
F0
F2
F4
F6
F8
Busy Address
No
No
Yes
64 Qi
Yes
80 Mult1
Yes
72 Mult2
No
Code:
LD
F0
MULTDF4
SD
F4
SUBI R1
BNEZ R1
0 R1
F0 F2
0 R1
R1 #8
Loop
F10 F12 ... F30
Mult2
Elect 707
4
Loop Example Cycle 12
Structural hazard – no MULT unit available
Dr. Amr Talaat
Instruction status
Instruction
j
k
LD
F0
0 R1
MULTDF4
F0 F2
SD
F4
0 R1
LD
F0
0 R1
MULTDF4
F0 F2
SD
F4
0 R1
Reservation Stations
Time Name Busy
0 Add1 No
0 Add2 No
0 Add3 No
2 Mult1 Yes
3 Mult2 Yes
Register result status
Clock
12
R1
64
Qi
iteration
1
1
1
2
2
2
Op
Issue
1
2
3
6
7
8
S1
Vj
Execution Write
complete Result
9
10
Load1
Load2
Load3
10
11
Store1
Store2
Store3
S2
RS for j RS for k
Vk
Qj
Qk
MULTD
MULTD
M(80) R(F2)
M(72) R(F2)
F0
F2
F4
F6
F8
Busy Address
No
No
Yes
64 Qi
Yes
80 Mult1
Yes
72 Mult2
No
Code:
LD
F0
MULTDF4
SD
F4
SUBI R1
BNEZ R1
0 R1
F0 F2
0 R1
R1 #8
Loop
F10 F12 ... F30
Mult2
Elect 707
4
Loop Example Cycle 13
Dr. Amr Talaat
Instruction status
Instruction
j
k
LD
F0
0 R1
MULTDF4
F0 F2
SD
F4
0 R1
LD
F0
0 R1
MULTDF4
F0 F2
SD
F4
0 R1
Reservation Stations
Time Name Busy
0 Add1 No
0 Add2 No
0 Add3 No
1 Mult1 Yes
2 Mult2 Yes
Register result status
Clock
13
R1
64
Qi
iteration
1
1
1
2
2
2
Op
Issue
1
2
3
6
7
8
S1
Vj
Execution Write
complete Result
9
10
Load1
Load2
Load3
10
11
Store1
Store2
Store3
S2
RS for j RS for k
Vk
Qj
Qk
MULTD
MULTD
M(80) R(F2)
M(72) R(F2)
F0
F2
F4
F6
F8
Busy Address
No
No
Yes
64 Qi
Yes
80 Mult1
Yes
72 Mult2
No
Code:
LD
F0
MULTDF4
SD
F4
SUBI R1
BNEZ R1
0 R1
F0 F2
0 R1
R1 #8
Loop
F10 F12 ... F30
Mult2
Elect 707
4
Loop Example Cycle 14
Dr. Amr Talaat
Instruction status
Instruction
j
k
LD
F0
0 R1
MULTDF4
F0 F2
SD
F4
0 R1
LD
F0
0 R1
MULTDF4
F0 F2
SD
F4
0 R1
Reservation Stations
Time Name Busy
0 Add1 No
0 Add2 No
0 Add3 No
0 Mult1 Yes
1 Mult2 Yes
Register result status
Clock
14
R1
64
Qi
iteration
1
1
1
2
2
2
Op
Issue
1
2
3
6
7
8
S1
Vj
Execution Write
complete Result
9
10
Load1
14
Load2
Load3
10
11
Store1
Store2
Store3
S2
RS for j RS for k
Vk
Qj
Qk
MULTD
MULTD
M(80) R(F2)
M(72) R(F2)
F0
F2
F4
F6
F8
Busy Address
No
No
Yes
64 Qi
Yes
80 Mult1
Yes
72 Mult2
No
Code:
LD
F0
MULTDF4
SD
F4
SUBI R1
BNEZ R1
0 R1
F0 F2
0 R1
R1 #8
Loop
F10 F12 ... F30
Mult2
Elect 707
5
Loop Example Cycle 15
Dr. Amr Talaat
Instruction status
Instruction
j
k
iteration
LD
F0
0 R1
1
MULTDF4
F0 F2
1
SD
F4
0 R1
1
LD
F0
0 R1
2
MULTDF4
F0 F2
2
SD
F4
0 R1
2
Reservation Stations
Time Name Busy Op
0 Add1 No
0 Add2 No
0 Add3 No
0 Mult1 No
0 Mult2 Yes MULTD
Register result status
Clock
15
R1
64
F0
Qi
Issue
1
2
3
6
7
8
S1
Vj
Execution Write
complete Result
9
10
Load1
14
15
Load2
Load3
10
11
Store1
15
Store2
Store3
S2
RS for j RS for k
Vk
Qj
Qk
M(72) R(F2)
F2
F4
F6
F8
Busy Address
No
No
Yes
64 Qi
Yes
80 M(80)*R(F2)
Yes
72 Mult2
No
Code:
LD
F0
MULTDF4
SD
F4
SUBI R1
BNEZ R1
0 R1
F0 F2
0 R1
R1 #8
Loop
F10 F12 ... F30
Mult2
Elect 707
5
Loop Example Cycle 16
Dr. Amr Talaat
Instruction status
Instruction
j
k
iteration
LD
F0
0 R1
1
MULTDF4
F0 F2
1
SD
F4
0 R1
1
LD
F0
0 R1
2
MULTDF4
F0 F2
2
SD
F4
0 R1
2
Reservation Stations
Time Name Busy Op
0 Add1 No
0 Add2 No
0 Add3 No
0 Mult1 Yes MULTD
0 Mult2 No
Register result status
Clock
16
R1
64
F0
Qi
Issue
1
2
3
6
7
8
S1
Vj
F2
Execution Write
complete Result
9
10
Load1
14
15
Load2
Load3
10
11
Store1
15
16
Store2
Store3
S2
RS for j RS for k
Vk
Qj
Qk
R(F2)
Load3
F4
F6
F8
Busy Address
No
No
Yes
64 Qi
Yes
80 M(80)*R(F2)
Yes
72 M(72)*R(72)
No
Code:
LD
F0
MULTDF4
SD
F4
SUBI R1
BNEZ R1
0 R1
F0 F2
0 R1
R1 #8
Loop
F10 F12 ... F30
Mult1
Elect 707
5
Loop Example Cycle 17
Dr. Amr Talaat
Instruction status
Instruction
j
k
iteration
LD
F0
0 R1
1
MULTDF4
F0 F2
1
SD
F4
0 R1
1
LD
F0
0 R1
2
MULTDF4
F0 F2
2
SD
F4
0 R1
2
Reservation Stations
Time Name Busy Op
0 Add1 No
0 Add2 No
0 Add3 No
0 Mult1 Yes MULTD
0 Mult2 No
Register result status
Clock
17
R1
64
F0
Qi
Issue
1
2
3
6
7
8
S1
Vj
F2
Execution Write
complete Result
9
10
Load1
14
15
Load2
Load3
10
11
Store1
15
16
Store2
Store3
S2
RS for j RS for k
Vk
Qj
Qk
R(F2)
Load3
F4
F6
F8
Busy Address
No
No
Yes
64 Qi
Yes
80 M(80)*R(F2)
Yes
72 M(72)*R(72)
Yes
64 Mult1
Code:
LD
F0
MULTDF4
SD
F4
SUBI R1
BNEZ R1
0 R1
F0 F2
0 R1
R1 #8
Loop
F10 F12 ... F30
Mult1
Elect 707
5
Loop Example Cycle 18
Dr. Amr Talaat
Instruction status
Instruction
j
k
iteration
LD
F0
0 R1
1
MULTDF4
F0 F2
1
SD
F4
0 R1
1
LD
F0
0 R1
2
MULTDF4
F0 F2
2
SD
F4
0 R1
2
Reservation Stations
Time Name Busy Op
0 Add1 No
0 Add2 No
0 Add3 No
0 Mult1 Yes MULTD
0 Mult2 No
Register result status
Clock
18
R1
56
F0
Qi
Issue
1
2
3
6
7
8
S1
Vj
Execution Write
complete Result
9
10
14
15
18
10
11
15
16
S2
Vk
R(F2)
F2
F4
Busy Address
No
No
Yes
64 Qi
Yes
80 M(80)*R(F2)
Yes
72 M(72)*R(72)
Yes
64 Mult1
Load1
Load2
Load3
Store1
Store2
Store3
RS for j RS for k
Qj
Qk
Code:
LD
F0
MULTDF4
SD
F4
Load3
SUBI R1
BNEZ R1
F6
F8
0 R1
F0 F2
0 R1
R1 #8
Loop
F10 F12 ... F30
Mult1
Elect 707
5
Loop Example Cycle 19
Dr. Amr Talaat
Instruction status
Instruction
j
k
iteration
LD
F0
0 R1
1
MULTDF4
F0 F2
1
SD
F4
0 R1
1
LD
F0
0 R1
2
MULTDF4
F0 F2
2
SD
F4
0 R1
2
Reservation Stations
Time Name Busy Op
0 Add1 No
0 Add2 No
0 Add3 No
0 Mult1 Yes MULTD
0 Mult2 No
Register result status
Clock
19
R1
56
F0
Qi
Issue
1
2
3
6
7
8
S1
Vj
Execution Write
complete Result
9
10
14
15
18
19
10
11
15
16
S2
Vk
R(F2)
F2
F4
Busy Address
No
No
Yes
64 Qi
No
Yes
72 M(72)*R(72)
Yes
64 Mult1
Load1
Load2
Load3
Store1
Store2
Store3
RS for j RS for k
Qj
Qk
Code:
LD
F0
MULTDF4
SD
F4
Load3
SUBI R1
BNEZ R1
F6
F8
0 R1
F0 F2
0 R1
R1 #8
Loop
…
F10 F12 ... F30
Mult1
Elect 707
5
Loop Example Cycle 20
Dr. Amr Talaat
Instruction status
Instruction
j
k
iteration
LD
F0
0 R1
1
MULTDF4
F0 F2
1
SD
F4
0 R1
1
LD
F0
0 R1
2
MULTDF4
F0 F2
2
SD
F4
0 R1
2
Reservation Stations
Time Name Busy Op
0 Add1 No
0 Add2 No
0 Add3 No
0 Mult1 Yes MULTD
0 Mult2 No
Register result status
Clock
20
R1
56
F0
Qi
Issue
1
2
3
6
7
8
S1
Vj
Execution Write
complete Result
9
10
14
15
18
19
10
11
15
16
20
S2
RS for j
Vk
Qj
R(F2)
F2
F4
Busy Address
No
No
Yes
64 Qi
No
Yes
72 M(72)*R(72)
Yes
64 Mult1
Load1
Load2
Load3
Store1
Store2
Store3
RS for k
Qk
Code:
LD
F0
MULTDF4
SD
F4
Load3
SUBI R1
BNEZ R1
F6
F8
0 R1
F0 F2
0 R1
R1 #8
Loop
F10 F12 ... F30
Mult1
Elect 707
5
Loop Example Cycle 21
Dr. Amr Talaat
Instruction status
Instruction
j
k
iteration
LD
F0
0 R1
1
MULTDF4
F0 F2
1
SD
F4
0 R1
1
LD
F0
0 R1
2
MULTDF4
F0 F2
2
SD
F4
0 R1
2
Reservation Stations
Time Name Busy Op
0 Add1 No
0 Add2 No
0 Add3 No
0 Mult1 Yes MULTD
0 Mult2 No
Register result status
Clock
21
R1
56
F0
Qi
Issue
1
2
3
6
7
8
S1
Vj
Execution Write
complete Result
9
10
14
15
18
19
10
11
15
16
20
21
S2
RS for j
Vk
Qj
R(F2)
F2
F4
Busy Address
No
No
Yes
64 Qi
No
No
Yes
64 Mult1
Load1
Load2
Load3
Store1
Store2
Store3
RS for k
Qk
Code:
LD
F0
MULTDF4
SD
F4
Load3
SUBI R1
BNEZ R1
F6
F8
0 R1
F0 F2
0 R1
R1 #8
Loop
F10 F12 ... F30
Mult1
Elect 707
5
Tomasulo’s scheme offers 2 major advantages
1. Distribution of the hazard detection logic
 distributed reservation stations and the CDB
 If multiple instructions waiting on single result, & each instru
ction has other operand, then instructions can be released si
multaneously by broadcast on CDB
 If a centralized register file were used, the units would have t
o read their results from the registers when register buses ar
e available
2. Elimination of stalls for WAW and WAR hazar
ds
Dr. Amr Talaat
Elect 707
Tomasulo Drawbacks
 Complexity
 Many associative stores (CDB) at high speed
 Performance limited by Common Data Bus
 Each CDB must go to multiple functional units
high capacitance, high wiring density
 Number of functional units that can complete per cycle limite
d to one!
 Multiple CDBs  more FU logic for parallel assoc stores
 Non-precise interrupts!
 We will address this later
Dr. Amr Talaat
Elect 707
 Sect 5th Edition 3.4 & 3.5
Dr. Amr Talaat
Elect 707