Lecture 8 Tomasulo`s Algorithm

Lecture 6
Score Board Contd. And Tomasulo’s
Algorithm
Instructor: Laxmi Bhuyan
Nov. 2, 2004
Lec. 7
1
Three Parts of the Scoreboard
1. Instruction status—which of 4 steps the instruction is in
(Issue, Operand Read, EX, Write)
2. Functional unit status—Indicates the state of the functional unit (FU). 9 fields for
each functional unit
Busy—Indicates whether the unit is busy or not
Op—Operation to perform in the unit (e.g., + or –)
Fi—Destination register
Fj, Fk—Source-register numbers
Qj, Qk—Functional units producing source registers Fj, Fk
Rj, Rk—Flags indicating when Fj, Fk are ready and not yet read. Set to
No after operand are read.
3. Register result status—Indicates which functional unit will write each register, if one
exists. Blank when no pending instructions will write that register
Nov. 2, 2004
Lec. 7
2
Detailed Scoreboard Pipeline Control
Instruction
status
Wait until
Bookkeeping
Issue
Not busy (FU)
and not result(D)
Busy(FU) yes; Op(FU) op;
Fi(FU) `D’; Fj(FU) `S1’;
Fk(FU) `S2’; Qj Result(‘S1’);
Qk Result(`S2’); Rj not Qj;
Rk not Qk; Result(‘D’) FU;
Read
operands
Rj and Rk
Rj No; Rk No
Execution
complete
Functional unit
done
Write
result
"f((Fj( f )!=Fi(FU)
or Rj( f )=No) &
(Fk( f )!=Fi(FU) or
Rk( f )=No))
WAW
WAR
Nov. 2, 2004
"f(if Qj(f)=FU then Rj(f) Yes);
"f(if Qk(f)=FU then Rj(f) Yes);
Result(Fi(FU)) 0; Busy(FU) No
A.55 on page A-76
Lec. 7
3
Scoreboard Example
• The following numbers are to illustrate behavior, not
representative
• LD – 1 cycle
– (compute address + data cache access)
• ADDDs and SUBs are 2 cycles
• Multiply is 10 cycles
• Divide is 40 cycles
Nov. 2, 2004
Lec. 7
4
Scoreboard Example
Instruction status
Instruction
j
k
LD
F6
34+ R2
LD
F2
45+ R3
MULTDF0
F2
F4
SUBD F8
F6
F2
DIVD F10 F0
F6
ADDD F6
F8
F2
Functional unit status
Time Name
Integer
Mult1
Mult2
Add
Divide
Register result status
Clock
Issue
Read Execution
Write
operands
completeResult
Busy
No
No
No
No
No
Op
dest
Fi
F0
F2
F4
S1
Fj
S2
Fk
FU for j FU for k Fj?
Qj
Qk
Rj
Fk?
Rk
F6
F8
F10
F30
F12
...
FU
Nov. 2, 2004
Lec. 7
5
Scoreboard Example Cycle 1
Instruction status
Instruction
j
k
LD
F6
34+ R2
LD
F2
45+ R3
MULTDF0
F2
F4
SUBD F8
F6
F2
DIVD F10 F0
F6
ADDD F6
F8
F2
Functional unit status
Time Name
Integer
Mult1
Mult2
Add
Divide
Register result status
Clock
1
Nov. 2, 2004
FU
Issue
1
Read
Execution
Write
operandscompleteResult
Busy
Yes
No
No
No
No
Op
Load
dest
Fi
F6
F0
F2
F4
S1
Fj
S2
Fk
R2
FU for j FU for k Fj?
Qj
Qk
Rj
F6 F8 F10
F12
...
Fk?
Rk
Yes
F30
Integer
Lec. 7
6
Scoreboard Example Cycle 2
Instruction status
Instruction
j
k Issue
LD
F6 34+ R2
1
LD
F2 45+ R3
MULTD F0 F2 F4
SUBD F8 F6 F2
DIVD F10 F0 F6
ADDD F6 F8 F2
Functional unit status
TimeName Busy
Integer Yes
Mult1 No
Mult2 No
Add
No
Divide No
Register result status
Clock
F0
2
FU
Nov. 2, 2004
Read Execution Write
operands
complete Result
2
Note: Can’t issue I2
because Integer unit
is busy. Can’t issue
next instruction due
to in-order issue
dest
Op
Fi
Load F6
S1 S2 FU for FU
j for F
k j?
Fj Fk Qj
Qk
Rj
R2
Fk?
Rk
No
F2
F6 F8 F10
Integer
F30
F4
Lec. 7
F12
...
7
Scoreboard Example Cycle 3
Instruction status
Instruction
j
k Issue
LD
F6 34+ R2
1
LD
F2 45+ R3
MULTD F0 F2 F4
SUBD F8 F6 F2
DIVD
F10 F0 F6
ADDD F6 F8 F2
Functional unit status
TimeName Busy
Integer Yes
Mult1 No
Mult2 No
Add
No
Divide No
Register result status
Clock
F0
3
FU
Nov. 2, 2004
Read Execution Write
operands
complete Result
2
3
dest
Op Fi
Load F6
S1 S2 FU for FU
j for F
k j?
Fj Fk Qj
Qk
Rj
R2
Fk?
Rk
No
F2
F6 F8 F10
Integer
F30
F4
Lec. 7
F12
...
8
Scoreboard Example Cycle 4
Instruction status
Instruction
j
k Issue
LD
F6 34+ R2
1
LD
F2 45+ R3
MULTD F0 F2 F4
SUBD F8 F6 F2
DIVD F10 F0 F6
ADDD F6 F8 F2
Functional unit status
TimeName Busy
Integer Yes
Mult1 No
Mult2 No
Add
No
Divide No
Register result status
Clock
F0
4
FU
Nov. 2, 2004
Read Execution Write
operands
complete Result
2
3
4
dest
Op
Fi
Load F6
S1 S2 FU for FU
j for F
k j?
Fj Fk Qj
Qk
Rj
R2
Fk?
Rk
No
F2
F6 F8 F10
F30
F4
Lec. 7
F12
...
9
Scoreboard Example Cycle 5
Instruction status
Instruction
j
k Issue
LD
F6 34+ R2
1
LD
F2 45+ R3
5
MULTD F0 F2 F4
SUBD F8 F6 F2
DIVD F10 F0 F6
ADDD F6 F8 F2
Functional unit status
TimeName Busy
Integer Yes
Mult1 No
Mult2 No
Add
No
Divide No
Register result status
Clock
F0
5
FU
Nov. 2, 2004
Read Execution Write
operands
complete Result
2
3
4
Now I2 is issued
dest
Op
Fi
Load F2
S1 S2 FU for FU
j for F
k j?
Fj Fk Qj
Qk
Rj
R3
Fk?
Rk
Yes
F2
F4
Integer
F6 F8 F10
F30
Lec. 7
F12
...
10
Scoreboard Example Cycle 6
Instruction status
Instruction
j
k Issue
LD
F6 34+ R2
1
LD
F2 45+ R3
5
MULTD F0 F2 F4
6
SUBD F8 F6 F2
DIVD
F10 F0 F6
ADDD F6 F8 F2
Functional unit status
TimeName Busy
Integer Yes
Mult1 Yes
Mult2 No
Add
No
Divide No
Register result status
Clock
F0
6
FU Mult
Nov. 2, 2004
Read Execution Write
operands
complete Result
2
3
4
6
dest
Op Fi
Load F2
Mult F0
S1 S2 FU for j FU for k Fj?
Fj Fk Qj
Qk
Rj
R3
F2 F4 Integer
No
Fk?
Rk
No
Yes
F2
F4
Integer
F6 F8 F10
F30
Lec. 7
F12
...
11
Scoreboard Example Cycle 7
Instruction status
Instruction
j
k Issue
LD
F6 34+ R2
1
LD
F2 45+ R3
5
MULTD F0 F2 F4
6
SUBD F8 F6 F2
7
DIVD
F10 F0 F6
ADDD F6 F8 F2
Functional unit status
TimeName Busy
Integer Yes
Mult1 Yes
Mult2 No
Add
Yes
Divide No
Register result status
Clock
F0
7
FU Mult
Nov. 2, 2004
Read Execution Write
operands
complete Result
2
3
4
6
7
I3 stalled at read
because I2 isn’t
complete
dest
Op Fi
Load F2
Mult F0
S1 S2 FU for j FU for k Fj?
Fj Fk Qj
Qk
Rj
R3
F2 F4 Integer
No
Subd F8
F6 F2
Integer Yes No
F2
F4
Integer
F6 F8 F10
Add
F12
Lec. 7
...
Fk?
Rk
No
Yes
F30
12
Scoreboard Example Cycle 8
Instruction status
Instruction
j
k Issue
LD
F6 34+ R2
1
LD
F2 45+ R3
5
MULTD F0 F2 F4
6
SUBD F8 F6 F2
7
DIVD F10 F0 F6
8
ADDD F6 F8 F2
Functional unit status
TimeName Busy
Integer No
Mult1
Yes
Mult2
No
Add
Yes
Divide Yes
Register result status
Clock
F0
8 Nov. 2, 2004
FU Mult1
Read EX
Write
Op
compl. Result
2
3
4
6
7
8
Op
dest
Fi
S1 S2 FU for FU
j for kFj? Fk?
Fj Fk Qj
Qk
Rj Rk
Mult
F0
F2 F4
Yes Yes
Sub
Div
F8
F10
F6 F2
F0 F6 Mult1
Yes Yes
No Yes
F2
F4
F6 F8 F10 F12
Add Divide
...
Lec. 7
F30
13
Scoreboard Example Cycle 9
Instruction status
Instruction
j
k
LD
F6 34+ R2
LD
F2 45+ R3
MULTD F0 F2 F4
SUBD F8 F6 F2
DIVD F10 F0 F6
ADDD F6 F8 F2
Functional unit status
Time Name
Integer
10 Mult1
Mult2
2 Add
Divide
Register result status
Clock
9
FU
Nov. 2, 2004
Read EX
Write
IssueOp
complete
Result
1
2
3
4
5
6
7
8
6
9
7
9
8
Busy Op
No
Yes Mult
No
Yes Sub
Yes Div
F0 F2
Mult1
Note: I3 and I4 read
operands because F2 is now
available. ADDD (I6) can’t
be issued because SUBD
(I4) uses the adder
dest
Fi
S1 S2 FU for j FU for k Fj?
Fj Fk Qj
Qk
Rj
Fk?
Rk
F0
F2 F4
No
No
F8
F10
F6 F2
F0 F6 Mult1
No
No
No
Yes
F4
F6 F8 F10
Add Divide
...
F30
Lec. 7
F12
14
Scoreboard Example Cycle 11
Instruction status
Read Execution
Write
Instruction j
k Issueoperands
complete
Result
LD F6 34+ R2 1
2
3
4
LD F2 45+ R3 5
6
7
8
Note: Add takes 2 cycles,
MULTD
F0 F2 F4
6
9
so nothing happens in
SUBD
F8 F6 F2
7
9
11
cycle 10. MUL continues.
DIVDF10 F0 F6
8
ADDD
F6 F8 F2
Functional unit status
dest S1 S2 FU for j FU for k Fj? Fk?
TimeName Busy Op
Fi
Fj Fk Qj
Qk
Rj
Rk
Integer No
8 Mult1
Yes Mult F0
F2 F4
No No
Mult2
No
0 Add
Yes Sub F8
F6 F2
No No
Divide Yes Div
F10 F0 F6 Mult1
No Yes
Register result status
Clock
F0 F2
F4
F6 F8 F10
F12
...
F30
11
FU Mult1
Add Divide
Nov. 2, 2004
Lec. 7
15
Scoreboard Example Cycle 12
Instruction status
Read Execution
Write
Instruction j
k Issueoperands
complete
Result
LD
F6 34+ R2 1
2
3
4
LD
F2 45+ R3 5
6
7
8
MULTD F0 F2 F4
6
9
SUBD F8 F6 F2
7
9
11 12
DIVD F10 F0 F6
8
ADDD F6 F8 F2
Functional unit status
dest S1 S2
TimeName Busy Op
Fi
Fj Fk
Integer No
7 Mult1 Yes Mult F0
F2 F4
Mult2 No
Add
No
Divide Yes Div F10 F0 F6
Register result status
Clock
F0 F2
F4
F6 F8
12
FU Mult1
Nov. 2, 2004
Lec. 7
FU for FU
j for F
k j?
Qj
Qk
Rj
Fk?
Rk
No
No
Mult1
No
Yes
F10 F12
Divide
...
F30
16
Scoreboard Example Cycle 13
Instruction status
Read Execution
Write
Instruction
j
k Issueoperands
complete
Result
LD
F6 34+ R2 1
2
3
4
LD
F2 45+ R3 5
6
7
8
MULTD F0 F2 F4
6
9
SUBD F8 F6 F2
7
9
11 12
DIVD F10 F0 F6
8
ADDD F6 F8 F2 13
Functional unit status
dest S1 S2
TimeName Busy Op
Fi
Fj Fk
Integer No
6 Mult1 Yes Mult F0
F2 F4
Mult2 No
Add
Yes Add F6
F8 F2
Divide Yes Div F10 F0 F6
Register result status
Clock
F0 F2
F4
F6 F8
13
FU Mult1
Add
Nov. 2, 2004
Lec. 7
Now ADDD is issued
because SUBD has
completed
FU for j FU for kFj?
Qj
Qk
Rj
No
Mult1
F10
F12
Divide
Fk?
Rk
No
Yes Yes
No Yes
...
F30
17
Scoreboard Example Cycle 14
Instruction status
Read Execution
Write
Instruction
j
k Issueoperands
complete
Result
LD
F6 34+ R2 1
2
3
4
LD
F2 45+ R3 5
6
7
8
MULTD F0 F2 F4
6
9
SUBD F8 F6 F2
7
9
11 12
DIVD F10 F0 F6
8
ADDD F6 F8 F2 13
14
Functional unit status
dest S1 S2
TimeName Busy Op
Fi
Fj Fk
Integer No
5 Mult1 Yes Mult F0
F2 F4
Mult2 No
2 Add
Yes Add F6
F8 F2
Divide Yes Div F10 F0 F6
Register result status
Clock
F0 F2
F4
F6 F8
14
FU Mult1
Add
Nov. 2, 2004
Lec. 7
FU for FU
j for F
k j?
Qj
Qk
Rj
Mult1
F10 F12
Divide
Fk?
Rk
No
No
No
No
No
Yes
...
F30
18
Scoreboard Example Cycle 15
Instruction status
Read Execution
Write
Instruction
j
k Issueoperands
complete
Result
LD
F6 34+ R2 1
2
3
4
LD
F2 45+ R3 5
6
7
8
MULTD F0 F2 F4
6
9
SUBD F8 F6 F2
7
9
11 12
DIVD F10 F0 F6
8
ADDD F6 F8 F2 13
14
Functional unit status
dest S1 S2
TimeName Busy Op
Fi
Fj Fk
Integer No
4 Mult1 Yes Mult F0
F2 F4
Mult2 No
1 Add
Yes Add F6
F8 F2
Divide Yes Div F10 F0 F6
Register result status
Clock
F0 F2
F4
F6 F8
15
FU Mult1
Add
Nov. 2, 2004
Lec. 7
Note: ADDD takes 2
cycles, so no change
FU for j FU for k Fj?
Qj
Qk
Rj
Mult1
F10
F12
Divide
Fk?
Rk
No
No
No
No
No
Yes
...
F30
19
Scoreboard Example Cycle 16
Instruction status
Read Execution
Write
Instruction
j
k Issue operands
complete
Result
LD
F6 34+ R2
1
2
3
4
LD
F2 45+ R3
5
6
7
8
MULTD F0 F2 F4
6
9
SUBD F8 F6 F2
7
9
11 12
DIVD F10 F0 F6
8
ADDD F6 F8 F2 13
14
16
Functional unit status
dest S1 S2
TimeName Busy Op
Fi
Fj Fk
Integer No
3 Mult1
Yes Mult F0
F2 F4
Mult2
No
0 Add
Yes Add F6
F8 F2
Divide Yes Div F10 F0 F6
Register result status
Clock
F0 F2
F4
F6 F8
16
FU Mult1
Add
Nov. 2, 2004
Lec. 7
ADDD completes, but
MULTD and DIVD go on
FU for j FU for k Fj?
Qj
Qk
Rj
Mult1
F10
Divide
F12
Fk?
Rk
No
No
No
No
No
Yes
...
F30
20
Scoreboard Example Cycle 17
Instruction status
Read Execution
Write
Instruction
j
k Issueoperands
complete
Result
LD
F6 34+ R2 1
2
3
4
LD
F2 45+ R3 5
6
7
8
MULTD F0 F2 F4
6
9
SUBD F8 F6 F2
7
9
11 12
DIVD F10 F0 F6
8
ADDD F6 F8 F2 13
14
16
Functional unit status
dest S1 S2
TimeName Busy Op
Fi
Fj Fk
Integer No
2 Mult1
Yes Mult F0
F2 F4
Mult2
No
Add
Yes Add F6
F8 F2
Divide Yes Div
F10 F0 F6
Register result status
Clock
F0 F2
F4
F6 F8
17
FU Mult1
Add
Nov. 2, 2004
Lec. 7
ADDD stalls, can’t write back
due to WAR with DIVD.
MULT and DIV continue
FU for FU
j for F
k j?
Qj
Qk
Rj
Mult1
F10 F12
Divide
Fk?
Rk
No
No
No
No
No
Yes
...
F30
21
Scoreboard Example Cycle 18
Instruction status
Read Execution
Write
Instruction
j
k Issueoperands
complete
Result
LD
F6 34+ R2 1
2
3
4
LD
F2 45+ R3 5
6
7
8
MULTD F0 F2 F4
6
9
SUBD F8 F6 F2
7
9
11 12
DIVD F10 F0 F6
8
ADDD F6 F8 F2 13
14
16
Functional unit status
dest S1 S2
TimeName Busy Op
Fi
Fj Fk
Integer No
1 Mult1
Yes Mult F0
F2 F4
Mult2
No
Add
Yes Add F6
F8 F2
Divide Yes Div
F10 F0 F6
Register result status
Clock
F0 F2
F4
F6 F8
18
FU Mult1
Add
Nov. 2, 2004
Lec. 7
MULT and DIV
continue
FU for FU
j for F
k j?
Qj
Qk
Rj
Mult1
F10 F12
Divide
Fk?
Rk
No
No
No
No
No
Yes
...
F30
22
Scoreboard Example Cycle 19
Instruction status
Read Execution
Write
Instruction
j
k Issueoperands
complete
Result
LD
F6 34+ R2 1
2
3
4
LD
F2 45+ R3 5
6
7
8
19
MULTD F0 F2 F4
6
9
SUBD F8 F6 F2
7
9
11 12
DIVD F10 F0 F6
8
ADDD F6 F8 F2 13
14
16
Functional unit status
dest S1 S2
TimeName Busy Op
Fi
Fj Fk
Integer No
0 Mult1
Yes Mult F0
F2 F4
Mult2
No
Add
Yes Add F6
F8 F2
Divide Yes Div
F10 F0 F6
Register result status
Clock
F0 F2
F4
F6 F8
19
FU Mult1
Add
Nov. 2, 2004
Lec. 7
MULT completes
after 10 cycles
FU for FU
j for F
k j?
Qj
Qk
Rj
Mult1
F10 F12
Divide
Fk?
Rk
No
No
No
No
No
Yes
...
F30
23
Scoreboard Example Cycle 20
Instruction
j
k Issueoperands
complete
Result
LD
F6 34+ R2
1
2
3
4
LD
F2 45+ R3
5
6
7
8
MULTD F0 F2 F4
6
9
19 20
SUBD F8 F6 F2
7
9
11 12
DIVD F10 F0 F6
8
ADDD F6 F8 F2 13
14
16
Functional unit status
dest S1 S2
TimeName Busy Op
Fi
Fj Fk
Integer No
Mult1
No
Mult2
No
Add
Yes Add F6
F8 F2
Divide Yes Div
F10 F0 F6
Register result status
Clock
F0 F2
F4
F6 F8
20
FU
Add
Nov. 2, 2004
Lec. 7
MULTD completes and
writes to F0
FU for FU
j for F
k j?
Qj
Qk
Rj
Fk?
Rk
No No
Yes Yes
F10 F12
Divide
...
F30
24
Scoreboard Example Cycle 21
Instruction
j
k Issueoperands
complete
Result
LD
F6 34+ R2
1
2
3
4
LD
F2 45+ R3
5
6
7
8
MULTD F0 F2 F4
6
9
19 20
SUBD F8 F6 F2
7
9
11 12
DIVD F10 F0 F6
8
21
ADDD F6 F8 F2 13
14
16
Functional unit status
dest S1 S2
TimeName Busy Op
Fi
Fj Fk
Integer No
Mult1
No
Mult2
No
Add
Yes Add F6
F8 F2
Divide Yes Div
F10 F0 F6
Register result status
Clock
F0 F2
F4
F6 F8
21
FU
Add
Nov. 2, 2004
Lec. 7
Now DIVD reads
because F0 is
available
FU for FU
j for F
k j?
Qj
Qk
Rj
F10 F12
Divide
Fk?
Rk
No
No
No
No
...
F30
25
Scoreboard Example Cycle 22
Instruction
j
k Issueoperands
complete
Result
LD
F6 34+ R2
1
2
3
4
LD
F2 45+ R3
5
6
7
8
MULTD F0 F2 F4
6
9
19 20
SUBD F8 F6 F2
7
9
11 12
DIVD F10 F0 F6
8
21
ADDD F6 F8 F2 13
14
16 22
Functional unit status
dest S1 S2
TimeName Busy Op
Fi
Fj Fk
Integer No
Mult1
No
Mult2
No
Add
No
Divide Yes Div
F10 F0 F6
Register result status
Clock
F0 F2
F4
F6 F8
21
FU
Nov. 2, 2004
Lec. 7
ADDD writes result
because WAR is
removed.
FU for FU
j for F
k j?
Qj
Qk
Rj
F10 F12
Divide
Fk?
Rk
No
No
...
F30
26
Scoreboard Example Cycle 61
Instruction
j
k Issueoperands
complete
Result
LD
F6 34+ R2
1
2
3
4
LD
F2 45+ R3
5
6
7
8
MULTD F0 F2 F4
6
9
19 20
SUBD F8 F6 F2
7
9
11 12
DIVD F10 F0 F6
8
21
61
ADDD F6 F8 F2 13
14
16 22
Functional unit status
dest S1 S2
TimeName Busy Op
Fi
Fj Fk
Integer No
Mult1
No
Mult2
No
Add
No
Divide Yes Div
F10 F0 F6
Register result status
Clock
F0 F2
F4
F6 F8
61
FU
Nov. 2, 2004
Lec. 7
DIVD completes
execution
FU for FU
j for F
k j?
Qj
Qk
Rj
F10 F12
Divide
Fk?
Rk
No
No
...
F30
27
Scoreboard Example Cycle 62
Instruction status
Instruction
j
k
LD
F6
34+ R2
LD
F2
45+ R3
MULTD
F0
F2
F4
SUBD F8
F6
F2
DIVD F10 F0
F6
ADDD F6
F8
F2
Functional unit status
Time Name
Integer
Mult1
Mult2
Add
0 Divide
Register result status
Read
Execution
Write
Issue operandscompleteResult
1
2
3
4
5
6
7
8
6
9
19
20
7
9
11
12
8
21
61
62
13
14
16
22
dest
S1 S2
Busy Op
Fi
Fj
Fk
No
No
No
No
No
Clock
F0
62
Nov. 2, 2004
F2
F4
Execution is finished
FU for j FU for k Fj?
Qj
Qk
Rj
F6 F8 F10
F12
...
Fk?
Rk
F30
FU
Lec. 7
28
Review: Scoreboard
• Limitations of 6600 scoreboard
–
–
–
–
–
No forwarding
Limited to instructions in basic block (small window)
Large number of functional units (structural hazards)
Stall on WAR hazards
Stall on WAW hazards
DIV.D
ADD.D
WAR S.D
SUB.D
Antidependence
MUL.D
F0, F2, F4
F6, F0, F8
F6, 0(R1)
WAW
F8, F10, F14
Output dependence
F6, F10, F8
Name dependence
Nov. 2, 2004
Lec. 7
29
Another Dynamic Algorithm: Tomasulo Algorithm
• For IBM 360/91 about 3 years after CDC 6600
• Goal: High Performance without special compilers
• Differences between Tomasulo Algorithm & Scoreboard
– Control & buffers distributed with Function Units vs. centralized in
scoreboard; called “reservation stations”
– Registers in instructions replaced by pointers to reservation station buffer
– HW renaming of registers to avoid WAW hazards
– Buffer operand values to avoid WAR hazards
– Common Data Bus broadcasts results to all FUs
– Load and Stores treated as FUs as well
• Why study? Lead to Alpha 21264, HP 8000, MIPS 10000,
Pentium II, Power PC 604 …
Nov. 2, 2004
Lec. 7
30
FP unit and load-store unit using Tomasulo’s alg.
Nov. 2, 2004
Lec. 7
31
Another Dynamic Algorithm: Tomasulo Algorithm
DIV.D
ADD.D
S.D
SUB.D
MUL.D
F0, F2, F4
S, F0, F8
S, 0(R1)
T, F10, F14
F6, F10, T
register renaming
• Implemented through reservation stations (rs) per functional unit
– Buffers an operand as soon as it is available – avoids WAR hazards.
– Pending instr. designate rs that will provide their inputs – avoids WAW hazards.
– The last write in a sequence of same-register-writing actually updates the
register
– Decentralize hazard detection and execution control
– Instruction results are passed directly to the FU from rs rather than from registers
 Through common data bus (CDB)
Nov. 2, 2004
Lec. 7
32
Three Stages of Tomasulo Algorithm
1. Issue—get instruction from FP Op Queue
Stall if structural hazard, ie. no space in the rs. If reservation station (rs) is free,
the issue logic issues instr to rs & read operands into rs if ready (Register
renaming => Solves WAR). Make status of destination register waiting for this
latest instn even if the previous instn writing to this register hasn’t completed =>
Solves WAW hazards.
2. Execution—operate on operands (EX)
When both operands are ready then execute;
if not ready, watch CDB for result – Solves RAW
3. Write result—finish execution (WB)
Write on Common Data Bus to all awaiting units;
mark reservation station available. Write result into dest. reg. if its status is r. =>
Solves WAW.
•
•
Normal data bus:
data + destination (“go to” bus)
CDB:
data + source
(“come from” bus)
– 64 bits of data + 4 bits of Functional Unit source address
– Write if matches expected Functional Unit (produces result)
– Does broadcast
Nov. 2, 2004
Lec. 7
33
Reservation Station Components
Op—Operation to perform in the unit (e.g., + or –)
Vj, Vk— Value of the source operand.
Qj, Qk— Name of the RS that would provide the source
operands. Value zero means the source operands already
available in Vj or Vk, or is not necessary.
Busy—Indicates reservation station or FU is busy
Register File Status Qi:
Qi —Indicates which functional unit will write each register, if
one exists. Blank (0) when no pending instructions that will write
that register meaning that the value is already available.
Nov. 2, 2004
Lec. 7
34
Tomasulo Example Cycle 0
Instruction status
Execution Write
Instruction
j
k Issue complete Result
Busy
LD
F6
34+
R2
Load1 No
LD
F2
45+
R3
Load2 No
MULTD F0
F2
F4
Load3 No
SUBD
F8
F6
F2
DIVD
F10
F0
F6
ADDD
F6
F8
F2
Reservation Stations
S1
S2 RS for j RS for k
Time Name Busy Op
Vj
Vk
Qj
Qk
0 Add1 No
0 Add2 No
Add3 No
0 Mult1 No
0 Mult2 No
Register result status
Clock
F0
F2
F4
F6
F8
F10
0
FU
Nov. 2, 2004
Lec. 7
Address
F12
...
35
F30
Tomasulo Example Cycle 1
Instruction status
Execution Write
Instruction
j
k Issue complete Result
Busy
LD
F6
34+
R2
1
Load1 Yes
LD
F2
45+
R3
Load2 No
MULTD F0
F2
F4
Load3 No
SUBD
F8
F6
F2
DIVD
F10
F0
F6
ADDD
F6
F8
F2
Reservation Stations
S1
S2 RS for j RS for k
Time Name Busy Op
Vj
Vk
Qj
Qk
0 Add1 No
0 Add2 No
Add3 No
0 Mult1 No
0 Mult2 No
Register result status
Clock
F0
F2
F4
F6
F8
F10
1
FU
Load1
Nov. 2, 2004
Lec. 7
Address
34+R2
F12
...
36
F30
Tomasulo Example Cycle 2
Instruction status
Execution Write
Instruction
j
k Issue complete Result
Busy Address
LD
F6
34+
R2
1
2Load1 Yes
34+R2
LD
F2
45+
R3
2
Load2 Yes
45+R3
MULTD F0
F2
F4
Load3 No
SUBD
F8
F6
F2
Assume Load takes 2 cycles
DIVD
F10
F0
F6
ADDD
F6
F8
F2
Reservation Stations
S1
S2 RS for j RS for k
Time Name Busy Op
Vj
Vk
Qj
Qk
0 Add1 No
0 Add2 No
Add3 No
0 Mult1 No
0 Mult2 No
Register result status
Clock
F0
F2
F4
F6
F8
F10 F12
...
2
FU
Load2
Load1
Nov. 2, 2004
Lec. 7
37
F30
Tomasulo Example Cycle 3
Instruction status
Execution Write
Instruction
j
k Issue complete Result
LD
F6
34+
R2
1
2--3
Load1
LD
F2
45+
R3
2
3Load2
MULTD F0
F2
F4
3
Load3
SUBD
F8
F6
F2
DIVD
F10
F0
F6
ADDD
F6
F8
F2
Reservation Stations
S1
S2 RS for j RS for k
Time Name Busy Op
Vj
Vk
Qj
Qk
0 Add1 No
0 Add2 No
read value
Add3 No
0 Mult1 Yes Mult
R(F4) Load2
0 Mult2 No
Register result status
Clock
F0
F2
F4
F6
F8
3
FU Mult1 Load2
Load1
Nov. 2, 2004
Lec. 7
Busy
Yes
Yes
No
Address
34+R2
45+R3
F10
F12
...
38
F30
Tomasulo Example Cycle 4
Instruction status
Execution Write
Instruction
j
k Issue complete Result
Busy
LD
F6
34+
R2
1
2--3
4
Load1 No
LD
F2
45+
R3
2
3--4
Load2 Yes
MULTD F0
F2
F4
3
Load3 No
SUBD
F8
F6
F2
4
DIVD
F10
F0
F6
ADDD
F6
F8
F2
Reservation Stations
S1
S2 RS for j RS for k
Time Name Busy Op
Vj
Vk
Qj
Qk
0 Add1 Yes Sub
M(A1)
Load2
0 Add2 No
Add3 No
0 Mult1 Yes Mult
R(F4) Load2
0 Mult2 No
Register result status
Clock
F0
F2
F4
F6
F8
F10
4
FU Mult1 Load2
M(A1) Add1
Nov. 2, 2004
Lec. 7
Address
45+R3
F12
...
39
F30
Tomasulo Example Cycle 5
Instruction status
Instruction
j
LD
F6
34+
LD
F2
45+
MULTD F0
F2
SUBD
F8
F6
DIVD
F10
F0
ADDD
F6
F8
Reservation Stations
Time Name
2 Add1
0 Add2
Add3
10 Mult1
0 Mult2
Register result status
Clock
5
Nov. 2, 2004
k
R2
R3
F4
F2
F6
F2
Execution Write
Issue complete Result
1
2--3
4
2
3--4
5
3
4
5
Busy Op
Yes Sub
No
No
Yes Mult
Yes Div
FU
F0
Mult1
Busy
Load1 No
Load2 No
Load3 No
S1
Vj
M(A1)
S2 RS for j RS for k
Vk
Qj
Qk
M(A2)
M(A2)
R(F4)
M(A1)
F2
M(A2)
F4
Lec. 7
Address
Mult1
F6
M(A1)
F8
F10 F12
Add1 Mult2
...
40
F30
Tomasulo Example Cycle 6
Instruction status
Instruction
j
LD
F6
34+
LD
F2
45+
MULTD F0
F2
SUBD
F8
F6
DIVD
F10
F0
ADDD
F6
F8
Reservation Stations
Time Name
1 Add1
0 Add2
Add3
9 Mult1
0 Mult2
Register result status
Clock
6
Nov. 2, 2004
Execution Write
k Issue complete Result
Busy
R2
1
2--3
4
Load1 No
R3
2
3--4
5
Load2 No
F4
3
6 -Load3 No
F2
4
6 -F6
5
F2
6
S1
S2 RS for j RS for k
Busy Op
Vj
Vk
Qj
Qk
Yes Sub
M(A1)
M(A2)
Yes Add
M(A2) Add1
No
Yes Mult
M(A2)
R(F4)
Yes Div
M(A1) Mult1
FU
F0
Mult1
F2
M(A2)
F4
Lec. 7
F6
Add2
Address
F8
F10 F12
Add1 Mult2
...
41
F30
Tomasulo Example Cycle 7
Instruction status
Instruction
j
LD
F6
34+
LD
F2
45+
MULTD F0
F2
SUBD
F8
F6
DIVD
F10
F0
ADDD
F6
F8
Reservation Stations
Time Name
0 Add1
0 Add2
Add3
8 Mult1
0 Mult2
Register result status
Clock
7
Nov. 2, 2004
Execution Write
k Issue complete Result
Busy
R2
1
2--3
4
Load1 No
R3
2
3--4
5
Load2 No
F4
3
6 -Load3 No
F2
4
6 -- 7
F6
5
F2
6
S1
S2 RS for j RS for k
Busy Op
Vj
Vk
Qj
Qk
Yes Sub
M(A1)
M(A2)
Yes Add
M(A2) Add1
No
Yes Mult
M(A2)
R(F4)
Yes Div
M(A1) Mult1
FU
F0
Mult1
F2
M(A2)
F4
Lec. 7
F6
Add2
Address
F8
F10 F12
Add1 Mult2
...
42
F30
Tomasulo Example Cycle 8
Instruction status
Instruction
j
LD
F6
34+
LD
F2
45+
MULTD F0
F2
SUBD
F8
F6
DIVD
F10
F0
ADDD
F6
F8
Reservation Stations
Time Name
0 Add1
2 Add2
Add3
7 Mult1
0 Mult2
Register result status
Clock
8
Nov. 2, 2004
Execution Write
k Issue complete Result
Busy
R2
1
2--3
4
Load1 No
R3
2
3--4
5
Load2 No
F4
3
6 -Load3 No
F2
4
6 -- 7
8
F6
5
F2
6
S1
S2 RS for j RS for k
Busy Op
Vj
Vk
Qj
Qk
No
Yes Add
M1-M2
M(A2)
No
Yes Mult
M(A2)
R(F4)
Yes Div
M(A1) Mult1
FU
F0
Mult1
F2
M(A2)
F4
Lec. 7
Address
F6
F8
F10 F12
Add2 M1-M2 Mult2
...
43
F30
Tomasulo Example Cycle 9
Instruction status
Instruction
j
LD
F6
34+
LD
F2
45+
MULTD F0
F2
SUBD
F8
F6
DIVD
F10
F0
ADDD
F6
F8
Reservation Stations
Time Name
0 Add1
1 Add2
Add3
6 Mult1
0 Mult2
Register result status
Clock
9
Nov. 2, 2004
Execution Write
k Issue complete Result
Busy
R2
1
2--3
4
Load1 No
R3
2
3--4
5
Load2 No
F4
3
6 -Load3 No
F2
4
6 -- 7
8
F6
5
F2
6
9 -S1
S2 RS for j RS for k
Busy Op
Vj
Vk
Qj
Qk
No
Yes Add
M1-M2
M(A2)
No
Yes Mult
M(A2)
R(F4)
Yes Div
M(A1) Mult1
FU
F0
Mult1
F2
M(A2)
F4
Lec. 7
Address
F6
F8
F10 F12
Add2 M1-M2 Mult2
...
44
F30
Tomasulo Example Cycle 10
Instruction status
Instruction
j
LD
F6
34+
LD
F2
45+
MULTD F0
F2
SUBD
F8
F6
DIVD
F10
F0
ADDD
F6
F8
Reservation Stations
Time Name
0 Add1
0 Add2
Add3
5 Mult1
0 Mult2
Register result status
Clock
10
Nov. 2, 2004
Execution Write
k Issue complete Result
Busy
R2
1
2--3
4
Load1 No
R3
2
3--4
5
Load2 No
F4
3
6 -Load3 No
F2
4
6 -- 7
8
F6
5
F2
6
9 -- 10
S1
S2 RS for j RS for k
Busy Op
Vj
Vk
Qj
Qk
No
Yes Add
M1-M2
M(A2)
No
Yes Mult
M(A2)
R(F4)
Yes Div
M(A1) Mult1
FU
F0
Mult1
F2
M(A2)
F4
Lec. 7
Address
F6
F8
F10 F12
Add2 M1-M2 Mult2
...
45
F30
Tomasulo Example Cycle 11
Instruction status
Execution Write
Instruction
j
k Issue complete Result
Busy Address
LD
F6
34+
R2
1
2--3
4
Load1 No
LD
F2
45+
R3
2
3--4
5
Load2 No
MULTD F0
F2
F4
3
6 -Load3 No
SUBD
F8
F6
F2
4
6 -- 7
8
DIVD
F10
F0
F6
5
ADDD
F6
F8
F2
6
9 -- 10
11
Reservation Stations
S1
S2 RS for j RS for k
Time Name Busy Op
Vj
Vk
Qj
Qk
0 Add1 No
Add2 No
Add3 No
4 Mult1 Yes Mult
M(A2)
R(F4)
0 Mult2 Yes Div
M(A1) Mult1
Register result status
Clock
F0
F2
F4
F6
F8
F10 F12
...
11
FU Mult1 M(A2)
M1-M2+M(A2)
M1-M2 Mult2
Nov. 2, 2004
Lec. 7
46
F30
Tomasulo Example Cycle 12
Instruction status
Execution Write
Instruction
j
k Issue complete Result
Busy Address
LD
F6
34+
R2
1
2--3
4
Load1 No
LD
F2
45+
R3
2
3--4
5
Load2 No
MULTD F0
F2
F4
3
6 -Load3 No
SUBD
F8
F6
F2
4
6 -- 7
8
DIVD
F10
F0
F6
5
ADDD
F6
F8
F2
6
9 -- 10
11
Reservation Stations
S1
S2 RS for j RS for k
Time Name Busy Op
Vj
Vk
Qj
Qk
0 Add1 No
Add2 No
Add3 No
4 Mult1 Yes Mult
M(A2)
R(F4)
0 Mult2 Yes Div
M(A1) Mult1
Register result status
Clock
F0
F2
F4
F6
F8
F10 F12
...
12
FU Mult1 M(A2)
M1-M2+M(A2)
M1-M2 Mult2
Nov. 2, 2004
Lec. 7
47
F30
Tomasulo Example Cycle 15
Instruction status
Execution Write
Instruction
j
k Issue complete Result
Busy Address
LD
F6
34+
R2
1
2--3
4
Load1 No
LD
F2
45+
R3
2
3--4
5
Load2 No
MULTD F0
F2
F4
3
6 -- 15
Load3 No
SUBD
F8
F6
F2
4
6 -- 7
8
DIVD
F10
F0
F6
5
ADDD
F6
F8
F2
6
9 -- 10
11
Reservation Stations
S1
S2 RS for j RS for k
Time Name Busy Op
Vj
Vk
Qj
Qk
0 Add1 No
Add2 No
Add3 No
0 Mult1 Yes Mult
M(A2)
R(F4)
0 Mult2 Yes Div
M(A1) Mult1
Register result status
Clock
F0
F2
F4
F6
F8
F10 F12
...
15
FU Mult1 M(A2)
M1-M2+M(A2)
M1-M2 Mult2
Nov. 2, 2004
Lec. 7
48
F30
Tomasulo Example Cycle 16
Instruction status
Execution Write
Instruction
j
k Issue complete Result
Busy Address
LD
F6
34+
R2
1
2--3
4
Load1 No
LD
F2
45+
R3
2
3--4
5
Load2 No
MULTD F0
F2
F4
3
6 -- 15
16
Load3 No
SUBD
F8
F6
F2
4
6 -- 7
8
DIVD
F10
F0
F6
5
ADDD
F6
F8
F2
6
9 -- 10
11
Reservation Stations
S1
S2 RS for j RS for k
Time Name Busy Op
Vj
Vk
Qj
Qk
0 Add1 No
Add2 No
Add3 No
Mult1 No
40 Mult2 Yes Div
M*F4
M(A1)
Register result status
Clock
F0
F2
F4
F6
F8
F10 F12
...
16
FU M*F4
M(A2)
M1-M2+M(A2)
M1-M2 Mult2
Nov. 2, 2004
Lec. 7
49
F30
Tomasulo Example Cycle 56
Instruction status
Execution Write
Instruction
j
k Issue complete Result
Busy Address
LD
F6
34+
R2
1
2--3
4
Load1 No
LD
F2
45+
R3
2
3--4
5
Load2 No
MULTD F0
F2
F4
3
6 -- 15
16
Load3 No
SUBD
F8
F6
F2
4
6 -- 7
8
DIVD
F10
F0
F6
5
17 -- 56
ADDD
F6
F8
F2
6
9 -- 10
11
Reservation Stations
S1
S2 RS for j RS for k
Time Name Busy Op
Vj
Vk
Qj
Qk
0 Add1 No
Add2 No
Add3 No
Mult1 No
0 Mult2 Yes Div
M*F4
M(A1)
Register result status
Clock
F0
F2
F4
F6
F8
F10 F12
...
56
FU M*F4
M(A2)
M1-M2+M(A2)
M1-M2 Mult2
Nov. 2, 2004
Lec. 7
50
F30
Tomasulo Example Cycle 57
Instruction status
Execution Write
Instruction
j
k Issue complete Result
Busy Address
LD
F6
34+
R2
1
2--3
4
Load1 No
LD
F2
45+
R3
2
3--4
5
Load2 No
MULTD F0
F2
F4
3
6 -- 15
16
Load3 No
SUBD
F8
F6
F2
4
6 -- 7
8
DIVD
F10
F0
F6
5
17 -- 56
57
ADDD
F6
F8
F2
6
9 -- 10
11
Reservation Stations
S1
S2 RS for j RS for k
Time Name Busy Op
Vj
Vk
Qj
Qk
0 Add1 No
Add2 No
Add3 No
Mult1 No
0 Mult2 No
Register result status
Clock
F0
F2
F4
F6
F8
F10 F12
...
57
FU M*F4
M(A2)
M1-M2+M(A2)
M1-M2 result
Nov. 2, 2004
Lec. 7
51
F30
Branch Prediction (3.4, 3.5)
Nov. 2, 2004
Lec. 7
52
Branch Prediction
• Easiest (static prediction)
–
–
–
–
Always taken, always not taken
Opcode based
Displacement based (forward not taken, backward taken)
Compiler directed (branch likely, branch not likely)
• Next easiest
– 1 bit predictor – remember last taken/not taken per branch
 Use a branch-prediction buffer or branch-history table
 Use part of the PC (low-order bits) to index buffer/table
– Multiple branches may share the same bit
 Invert the bit if the prediction is wrong
 Backward branches for loops will be mispredicted twice
Nov. 2, 2004
Lec. 7
53
Q: Assume a loop branch is taken nine times in a row, then not taken once. What
is the prediction accuracy using 1-bit predictor?
A: After first loop, the predictor will say not to take because the last time the
execution came out of loop, it set a “0” in the predictor. So, it’s a misprediction.
The bit will now be set to “1”. Works fine until the last loop when it is predicted
as taken. So, 2 mispredictions in in 10 loop executions => 80% accuracy.
How about a 2-bit predictor? Let the prediction be changed only after it misses
twice in a row.
Nov. 2, 2004
Lec. 7
54
2-bit Branch Prediction
• Has 4 states instead of 2, allowing for more information about
tendencies
• A prediction must miss twice before it is changed
• Good for backward branches of loops
Nov. 2, 2004
Lec. 7
55
Branch History Table
•
•
•
•
Has limited size
2 bits by N (e.g. 4K)
4K same as infinite, see Fig. 3.9
Uses low-order bits of branch PC to
choose entry
branch PC
BHT
01
Nov. 2, 2004
Lec. 7
56
Can we do better ?
• Correlating branch predictors also look at other branches for
clues
if (aa==2)
T
aa = 0
if (bb==2)
T
bb = 0
if(aa!=bb) { …
NT
Prediction if the last branch is NT
Prediction if the last branch is T
(1,1) predictor – uses history of 1 branch and uses a 1-bit predictor
Nov. 2, 2004
Lec. 7
57
Correlating Branch Predictor
• If we use 2 branches as histories, then there are 4 possibilities
(T-T, NT-T, NT-NT, NT-T).
• For each possibility, we need to use a predictor (1-bit, 2-bit).
• And this repeats for every branch.
(2,2) branch prediction
Nov. 2, 2004
Lec. 7
58
Performance of Correlating Branch Prediction
• With same number of
state bits, (2,2) performs
better than noncorrelating
2-bit predictor.
• Outperforms a 2-bit
predictor with infinite
number of entries
Nov. 2, 2004
Lec. 7
59
General (m,n) Branch Predictors
• The global history register is an m-bit shift register that records
the last m branches encountered by the processor
• Usually use both the PC address and the GHR (2-level)
m-bit ghr
01
PC
Combining
funciton
Nov. 2, 2004
Lec. 7
n-bit predictors
00
60
Is Branch Predictor Enough?
• When is using branch prediction beneficial?
– When the outcome is known later than the target
– For example, in our standard MIPS pipeline, we compute the target in ID
stage but testing the branch condition incur a structure hazard in register
file.
• If we predict the branch is taken and suppose it is correct, what
is the target address?
– Need a mechanism to provide target address as well
• Can we eliminate the one cycle delay for the 5-stage pipeline?
– Need to fetch from branch target immediately after branch
Nov. 2, 2004
Lec. 7
61
Branch Target Buffer (BTB)
Is the current instruction a branch ?
• BTB provides the answer before the current instruction is decoded
and therefore enables fetching to begin after IF-stage .
What is the branch target ?
• BTB provides the branch target if the prediction is a taken direct
branch (for not taken branches the target is simply PC+4 ) .
Nov. 2, 2004
Lec. 7
62