MS108 Computer System I Lecture 7 Tomasulo’s Algorithm Prof. Xiaoyao Liang 2015/4/10 1 The Tomasulo’s Algorithm • From IBM 360/91 • Goal: High Performance using a limited number of registers without a special compiler – 4 double-precision FP registers on 360 – Uses register renaming • Why Study a 1966 Computer? – The descendants of this include: Alpha 21264, HP 8000, MIPS 10000, Pentium III, PowerPC 604, … 2 Tomasulo Algorithm • Control & buffers are distributed with Function Units (FU) – FU buffers called “reservation stations (RS)” – Contain information about instructions, including operands – More reservation stations than registers, so can do optimizations compilers can’t • Registers in instructions replaced by values or pointers to reservation stations – form of register renaming – avoids WAR, WAW hazards • Results to FU from RS, not through registers (equivalent of forwarding). A Common Data Bus (CDB) broadcasts results to all FUs (their RSes) • Loads and Stores treated as FUs with RSes as well 3 Tomasulo Organization FP Registers From Mem FP Op Queue Load Buffers Load1 Load2 Load3 Load4 Load5 Load6 Store Buffers Add1 Add2 Add3 Mult1 Mult2 FP adders 4 Reservation Stations To Mem FP multipliers Common Data Bus (CDB) Tomasulo Organization Reservation Station Components • • • • Busy: Indicates reservation station or FU is busy Op: Operation to perform in the unit (e.g., + or –) Vj, Vk: Value of Source operands Qj, Qk: Reservation stations producing source registers (value to be written) – Note: Qj,Qk=0 => ready • A: effective address 5 Tomasulo Organization • Register result status— Qi – Indicates which functional unit will write each register, if one exists. Blank when no pending instructions that will write that register • Common data bus – Normal data bus: data + destination (“go to” bus) – CDB: data + source (“come from” bus) • 64 bits of data + 4 bits of Functional Unit source address • Write if matches expected Functional Unit (produces result) • Does the broadcast 6 Three Stages of Tomasulo Algorithm • 1. Issue—get instruction from FP Op Queue – If reservation station free (no structural hazard), control issues the instruction & sends operands (renames registers). • 2. Execute—operate on operands (EX) – When both operands ready then execute; if not ready, watch Common Data Bus for result • 3. Write result—finish execution (WB) – Write on Common Data Bus to all awaiting units; mark reservation station available 7 Tomasulo Loop Example Loop: LD MULTD SD SUBI BNEZ F0, F4, F4, R1, R1, 0(R1) F0, F2 0(R1) R1, #8 Loop • This time assume multiply takes 4 clock cycles in the execution stage • Assume 1st load takes 8 clock cycles (L1 cache miss) in the execution stage, 2nd load takes 1 extra cycle (hit) • Assume store takes 3 cycles in the execution stage • To be clear, will not show clocks for SUBI, BNEZ • Show about 2 iterations 8 Loop Example using simplified presentation for load/store components Instruction status: ITER Instruction 1 1 1 2 Iter- 2 ation 2 LD MULTD SD LD MULTD SD F0 F4 F4 F0 F4 F4 Count Reservation Stations: Time Name Busy Add1 No Add2 No Add3 No Mult1 No Mult2 No j k 0 F0 0 0 F0 0 R1 F2 R1 R1 F2 R1 Op Vj Exec Write Issue CompResult Load1 Load2 Load3 Store1 Store2 Store3 S1 Vk S2 Qj RS Qk 0 9 F0 R1 80 F2 F4 F6 F8 No No No No No No Added Store Buffers Code: LD MULTD SD SUBI BNEZ F0 F4 F4 R1 R1 F10 F12 Register result status Clock Qk Busy Addr 0 F0 0 R1 Loop R1 F2 R1 #8 ... F30 Instruction Loop Qi Value of Register used for address, iteration control Loop Example Cycle 1 Instruction status: ITER Instruction 1 LD F0 j k 0 R1 1 Vj S1 Vk Reservation Stations: Time Name Busy Add1 No Add2 No Add3 No Mult1 No Mult2 No Exec Write Issue CompResult Op S2 Qj RS Qk Busy Addr Qk Load1 Load2 Load3 Store1 Store2 Store3 Yes No No No No No 80 Code: LD MULTD SD SUBI BNEZ F0 F4 F4 R1 R1 0 F0 0 R1 Loop R1 F2 R1 #8 ... F30 Register result status Clock 1 10 R1 80 F0 Qi Load1 F2 F4 F6 F8 F10 F12 Loop Example Cycle 2 Instruction status: ITER Instruction Exec Write j k Issue CompResult Busy Addr 1 LD F0 0 R1 1 Load1 Yes 1 MULTD F4 F0 F2 2 Load2 No Load3 No Store1 No Store2 No Store3 No Reservation Stations: Time Op Vj S1 S2 RS Vk Qj Qk Qk 80 Name Busy Add1 No LD F0 0 R1 Add2 No MULTD F4 F0 F2 Add3 No SD F4 0 R1 Mult1 Yes Multd SUBI R1 R1 #8 Mult2 No BNEZ R1 Loop F12 ... R(F2) Load1 Code: Register result status Clock 2 11 F0 R1 80 Qi Load1 F2 F4 Mult1 F6 F8 F10 F30 Loop Example Cycle 3 Instruction status: ITER Instruction 1 1 1 LD MULTD SD F0 F4 F4 j k 0 F0 0 R1 F2 R1 Reservation Stations: Time Name Busy Op Add1 No Add2 No Add3 No Mult1 Yes Multd Mult2 No Vj Exec Write Issue CompResult 1 2 3 S1 Vk S2 Qj RS Qk R(F2) Load1 Busy Addr Qk Load1 Load2 Load3 Store1 Store2 Store3 Yes No No Yes No No 80 80 Mult1 Code: LD MULTD SD SUBI BNEZ F0 F4 F4 R1 R1 0 F0 0 R1 Loop R1 F2 R1 #8 ... F30 Register result status Clock 3 12 R1 80 F0 Qj Load1 F2 F4 Mult1 F6 F8 F10 F12 Loop Example Cycle 4 Instruction status: ITER Instruction 1 1 1 LD MULTD SD F0 F4 F4 j k 0 F0 0 R1 F2 R1 Reservation Stations: Time Name Busy Op Add1 No Add2 No Add3 No Mult1 Yes Multd Mult2 No Vj Exec Write Issue CompResult 1 2 3 S1 Vk S2 Qj RS Qk R(F2) Load1 Busy Addr Qk Load1 Load2 Load3 Store1 Store2 Store3 Yes No No Yes No No 80 80 Mult1 Code: LD MULTD SD SUBI BNEZ F0 F4 F4 R1 R1 0 F0 0 R1 Loop R1 F2 R1 #8 ... F30 Register result status Clock 4 R1 80 F0 Qi Load1 F2 F4 F6 F8 F10 F12 Mult1 •13 Dispatching SUBI Instruction (not in FP queue) Loop Example Cycle 5 Instruction status: ITER Instruction 1 1 1 LD MULTD SD F0 F4 F4 j k 0 F0 0 R1 F2 R1 Reservation Stations: Time Name Busy Op Add1 No Add2 No Add3 No Mult1 Yes Multd Mult2 No Vj Exec Write Issue CompResult 1 2 3 S1 Vk S2 Qj RS Qk R(F2) Load1 Busy Addr Qk Load1 Load2 Load3 Store1 Store2 Store3 Yes No No Yes No No 80 80 Mult1 Code: LD MULTD SD SUBI BNEZ F0 F4 F4 R1 R1 0 F0 0 R1 Loop R1 F2 R1 #8 ... F30 Register result status Clock 5 R1 72 F0 Qi Load1 F2 F4 F6 F8 F10 F12 Mult1 •14 And, BNEZ instruction (not in FP queue) Loop Example Cycle 6 Instruction status: ITER Instruction 1 1 1 2 LD MULTD SD LD F0 F4 F4 F0 j k 0 F0 0 0 R1 F2 R1 R1 1 2 3 6 Vj S1 Vk Reservation Stations: Time Exec Write Issue CompResult Name Busy Op Add1 No Add2 No Add3 No Mult1 Yes Multd Mult2 No S2 Qj RS Qk R(F2) Load1 Busy Addr Qk Load1 Load2 Load3 Store1 Store2 Store3 Yes Yes No Yes No No 80 72 80 Mult1 Code: LD MULTD SD SUBI BNEZ F0 F4 F4 R1 R1 0 F0 0 R1 Loop R1 F2 R1 #8 ... F30 Register result status Clock 6 R1 72 F0 Qi Load2 F2 F4 F6 F8 F10 F12 Mult1 • Notice that F0 never sees Load from location 80 15 Loop Example Cycle 7 Instruction status: ITER Instruction 1 1 1 2 2 LD MULTD SD LD MULTD F0 F4 F4 F0 F4 j k 0 F0 0 0 F0 R1 F2 R1 R1 F2 1 2 3 6 7 Vj S1 Vk Reservation Stations: Time Exec Write Issue CompResult Name Busy Op Add1 No Add2 No Add3 No Mult1 Yes Multd Mult2 Yes Multd S2 Qj RS Qk R(F2) Load1 R(F2) Load2 Busy Addr Qk Load1 Load2 Load3 Store1 Store2 Store3 Yes Yes No Yes No No 80 72 80 Mult1 Code: LD MULTD SD SUBI BNEZ F0 F4 F4 R1 R1 0 F0 0 R1 Loop R1 F2 R1 #8 ... F30 Register result status Clock 7 • • 16 R1 72 F0 Qi Load2 F2 F4 F6 F8 Mult2 Register file completely detached from computation First and Second iteration completely overlapped F10 F12 Loop Example Cycle 8 Instruction status: ITER Instruction 1 1 1 2 2 2 LD MULTD SD LD MULTD SD F0 F4 F4 F0 F4 F4 j k 0 F0 0 0 F0 0 R1 F2 R1 R1 F2 R1 1 2 3 6 7 8 Vj S1 Vk Reservation Stations: Time Exec Write Issue CompResult Name Busy Op Add1 No Add2 No Add3 No Mult1 Yes Multd Mult2 Yes Multd S2 Qj RS Qk R(F2) Load1 R(F2) Load2 Busy Addr Qk Load1 Load2 Load3 Store1 Store2 Store3 Yes Yes No Yes Yes No 80 72 80 72 Mult1 Mult2 Code: LD MULTD SD SUBI BNEZ F0 F4 F4 R1 R1 0 F0 0 R1 Loop R1 F2 R1 #8 ... F30 Register result status Clock 8 17 R1 72 F0 Qi Load2 F2 F4 Mult2 F6 F8 F10 F12 Loop Example Cycle 9 Instruction status: ITER Instruction 1 1 1 2 2 2 LD MULTD SD LD MULTD SD F0 F4 F4 F0 F4 F4 j k 0 F0 0 0 F0 0 R1 F2 R1 R1 F2 R1 1 2 3 6 7 8 9 Vj S1 Vk S2 Qj Reservation Stations: Time Exec Write Issue CompResult Name Busy Op Add1 No Add2 No Add3 No Mult1 Yes Multd Mult2 Yes Multd RS Qk R(F2) Load1 R(F2) Load2 Busy Addr Qk Load1 Load2 Load3 Store1 Store2 Store3 Yes Yes No Yes Yes No 80 72 80 72 Mult1 Mult2 Code: LD MULTD SD SUBI BNEZ F0 F4 F4 R1 R1 0 F0 0 R1 Loop R1 F2 R1 #8 ... F30 Register result status Clock 9 R1 72 F0 Qi Load2 F2 F4 Mult2 • Load1 completing: who is waiting? •18 Note: Dispatching SUBI F6 F8 F10 F12 Loop Example Cycle 10 Instruction status: ITER Instruction 1 1 1 2 2 2 LD MULTD SD LD MULTD SD F0 F4 F4 F0 F4 F4 j k 0 F0 0 0 F0 0 R1 F2 R1 R1 F2 R1 Reservation Stations: Time 4 Exec Write Issue CompResult 1 2 3 6 7 8 S1 Vk 9 10 10 S2 Qj Name Busy Op Vj Add1 No Add2 No Add3 No Mult1 Yes Multd M[80] R(F2) Mult2 Yes Multd R(F2) Load2 RS Qk Busy Addr Load1 Load2 Load3 Store1 Store2 Store3 Code: LD MULTD SD SUBI BNEZ No Yes No Yes Yes No F0 F4 F4 R1 R1 Qk 72 80 72 Mult1 Mult2 0 F0 0 R1 Loop R1 F2 R1 #8 ... F30 Register result status Clock R1 10 64 F0 Qi Load2 F2 F4 Mult2 • Load2 completing: who is waiting? •19 Note: Dispatching BNEZ F6 F8 F10 F12 Tomasulo Organization FP Registers From Mem FP Op Queue Load Buffers Load1 Load2 Load3 Load4 Load5 Load6 Store Buffers Add1 Add2 Add3 Mult1 Mult2 FP adders 20 Reservation Stations To Mem FP multipliers Common Data Bus (CDB) Loop Example Cycle 11 Instruction status: ITER Instruction Exec Write j k Issue CompResult 1 LD F0 0 R1 1 1 MULTD F4 F0 F2 1 SD F4 0 2 LD F0 2 MULTD 2 SD No 2 Load2 No R1 3 Load3 Yes 64 0 R1 6 Store1 Yes 80 Mult1 F4 F0 F2 7 Store2 Yes 72 Mult2 F4 0 R1 8 Store3 Time Op Vj 10 10 Qk Load1 Reservation Stations: 9 Busy Addr 11 S1 S2 RS Vk Qj Qk No Name Busy Code: Add1 No LD F0 0 R1 Add2 No MULTD F4 F0 F2 Add3 No SD F4 0 R1 3 Mult1 Yes Multd M[80] R(F2) SUBI R1 R1 #8 4 Mult2 Yes Multd M[72] R(F2) BNEZ R1 Loop F12 ... Register result status Clock 11 F0 R1 64 Qi Load3 F2 F4 F6 Mult2 • Next load in sequence 21 F8 F10 F30 Loop Example Cycle 12 Instruction status: ITER Instruction Exec Write j k Issue CompResult 1 LD F0 0 R1 1 1 MULTD F4 F0 F2 1 SD F4 0 2 LD F0 2 MULTD 2 SD No 2 Load2 No R1 3 Load3 Yes 64 0 R1 6 Store1 Yes 80 Mult1 F4 F0 F2 7 Store2 Yes 72 Mult2 F4 0 R1 8 Store3 Time Op Vj 10 10 Qk Load1 Reservation Stations: 9 Busy Addr 11 S1 S2 RS Vk Qj Qk No Name Busy Code: Add1 No LD F0 0 R1 Add2 No MULTD F4 F0 F2 Add3 No SD F4 0 R1 2 Mult1 Yes Multd M[80] R(F2) SUBI R1 R1 #8 3 Mult2 Yes Multd M[72] R(F2) BNEZ R1 Loop F12 ... Register result status Clock 12 F0 R1 64 Qi Load3 F2 F4 F6 F8 Mult2 • Why not issue third multiply? 22 F10 F30 Loop Example Cycle 13 Instruction status: ITER Instruction Exec Write j k Issue CompResult 1 LD F0 0 R1 1 1 MULTD F4 F0 F2 1 SD F4 0 2 LD F0 2 MULTD 2 SD No 2 Load2 No R1 3 Load3 Yes 64 0 R1 6 Store1 Yes 80 Mult1 F4 F0 F2 7 Store2 Yes 72 Mult2 F4 0 R1 8 Store3 Time Op Vj 10 10 Qk Load1 Reservation Stations: 9 Busy Addr 11 S1 S2 RS Vk Qj Qk No Name Busy Code: Add1 No LD F0 0 R1 Add2 No MULTD F4 F0 F2 Add3 No SD F4 0 R1 1 Mult1 Yes Multd M[80] R(F2) SUBI R1 R1 #8 2 Mult2 Yes Multd M[72] R(F2) BNEZ R1 Loop F12 ... Register result status Clock 13 F0 R1 64 Qi Load3 F2 F4 F6 F8 Mult2 • Why not issue third store? 23 F10 F30 Loop Example Cycle 14 Instruction status: ITER Instruction Exec Write j k Issue CompResult 1 LD F0 0 R1 1 9 1 MULTD F4 F0 F2 2 14 1 SD F4 0 R1 3 2 LD F0 0 R1 6 2 MULTD F4 F0 F2 2 SD F4 0 R1 Reservation Stations: Time Op Vj 10 Busy Addr Load1 No Load2 No Qk Load3 Yes 64 Store1 Yes 80 Mult1 7 Store2 Yes 72 Mult2 8 Store3 10 11 S1 S2 RS Vk Qj Qk No Name Busy Code: Add1 No LD F0 0 R1 Add2 No MULTD F4 F0 F2 Add3 No SD F4 0 R1 0 Mult1 Yes Multd M[80] R(F2) SUBI R1 R1 #8 1 Mult2 Yes Multd M[72] R(F2) BNEZ R1 Loop F12 ... Register result status Clock 14 F0 R1 64 Qi Load3 F2 F4 F6 F8 F10 Mult2 • Mult1 completing. Who is waiting? 24 F30 Loop Example Cycle 15 Instruction status: ITER Instruction Exec Write j k Issue CompResult Busy Addr 1 LD F0 0 R1 1 9 10 Load1 No 1 MULTD F4 F0 F2 2 14 15 Load2 No 1 SD F4 0 R1 3 2 LD F0 0 R1 6 10 2 MULTD F4 F0 F2 7 15 2 SD F4 0 R1 8 Reservation Stations: Time 0 Op Vj 11 Load3 Yes 64 Store1 Yes 80 [80]*R2 Store2 Yes 72 Mult2 Store3 S1 S2 RS Vk Qj Qk Qk No Name Busy Code: Add1 No LD F0 0 R1 Add2 No MULTD F4 F0 F2 Add3 No SD F4 0 R1 Mult1 No SUBI R1 R1 #8 Mult2 Yes Multd M[72] R(F2) BNEZ R1 Loop F12 ... Register result status Clock 15 F0 R1 64 Qi Load3 F2 F4 F6 F8 F10 Mult2 • Mult2 completing. Who is waiting? 25 F30 Loop Example Cycle 16 Instruction status: ITER Instruction Exec Write j k Issue CompResult Busy Addr 1 LD F0 0 R1 1 9 10 Load1 No 1 MULTD F4 F0 F2 2 14 15 Load2 No 1 SD F4 0 R1 3 2 LD F0 0 R1 6 10 2 MULTD F4 F0 F2 7 15 2 SD F4 0 R1 8 Reservation Stations: Time 4 Op Vj Qk Load3 Yes 64 11 Store1 Yes 80 [80]*R2 16 Store2 Yes 72 [72]*R2 Store3 S1 S2 RS Vk Qj Qk No Name Busy Add1 No LD F0 0 R1 Add2 No MULTD F4 F0 F2 Add3 No SD F4 0 R1 Mult1 Yes Multd SUBI R1 R1 #8 Mult2 No BNEZ R1 Loop F12 ... R(F2) Load3 Code: Register result status Clock 16 26 F0 R1 64 Qi Load3 F2 F4 Mult1 F6 F8 F10 F30 Loop Example Cycle 17 Instruction status: ITER Instruction Exec Write j k Issue CompResult Busy Addr 1 LD F0 0 R1 1 9 10 Load1 No 1 MULTD F4 F0 F2 2 14 15 Load2 No 1 SD F4 0 R1 3 2 LD F0 0 R1 6 10 2 MULTD F4 F0 F2 7 15 2 SD F4 0 R1 8 Reservation Stations: Time Op Vj Qk Load3 Yes 64 11 Store1 Yes 80 [80]*R2 16 Store2 Yes 72 [72]*R2 Store3 Yes 64 Mult1 S1 S2 RS Vk Qj Qk Name Busy Add1 No LD F0 0 R1 Add2 No MULTD F4 F0 F2 Add3 No SD F4 0 R1 Mult1 Yes Multd SUBI R1 R1 #8 Mult2 No BNEZ R1 Loop F12 ... R(F2) Load3 Code: Register result status Clock 17 27 F0 R1 64 Qi Load3 F2 F4 Mult1 F6 F8 F10 F30 Loop Example Cycle 18 Instruction status: ITER Instruction Exec Write j k Issue CompResult Busy Addr 1 LD F0 0 R1 1 9 10 Load1 No 1 MULTD F4 F0 F2 2 14 15 Load2 No 1 SD F4 0 R1 3 18 2 LD F0 0 R1 6 10 2 MULTD F4 F0 F2 7 15 2 SD F4 0 R1 8 Reservation Stations: Time Op Vj Qk Load3 Yes 64 11 Store1 Yes 80 [80]*R2 16 Store2 Yes 72 [72]*R2 Store3 Yes 64 Mult1 S1 S2 RS Vk Qj Qk Name Busy Add1 No LD F0 0 R1 Add2 No MULTD F4 F0 F2 Add3 No SD F4 0 R1 Mult1 Yes Multd SUBI R1 R1 #8 Mult2 No BNEZ R1 Loop F12 ... R(F2) Load3 Code: Register result status Clock 18 28 F0 R1 64 Qi Load3 F2 F4 Mult1 F6 F8 F10 F30 Loop Example Cycle 19 Instruction status: ITER Instruction Exec Write j k Issue CompResult Busy Addr Qk 1 LD F0 0 R1 1 9 10 Load1 No 1 MULTD F4 F0 F2 2 14 15 Load2 No 1 SD F4 0 R1 3 18 19 Load3 Yes 2 LD F0 0 R1 6 10 11 Store1 2 MULTD F4 F0 F2 7 15 16 Store2 Yes 72 [72]*R2 2 SD F4 0 R1 8 19 Store3 Yes 64 Mult1 S1 S2 RS Vk Qj Qk Reservation Stations: Time Op Vj 64 No Name Busy Add1 No LD F0 0 R1 Add2 No MULTD F4 F0 F2 Add3 No SD F4 0 R1 Mult1 Yes Multd SUBI R1 R1 #8 Mult2 No BNEZ R1 Loop F12 ... R(F2) Load3 Code: Register result status Clock 19 29 F0 R1 56 Qi Load3 F2 F4 Mult1 F6 F8 F10 F30 Loop Example Cycle 20 Instruction status: ITER Instruction Exec Write j k Issue CompResult Busy Addr 1 LD F0 0 R1 1 9 10 Load1 Yes 1 MULTD F4 F0 F2 2 14 15 Load2 1 SD F4 0 R1 3 18 19 Load3 Yes 2 LD F0 0 R1 6 10 11 Store1 No 2 MULTD F4 F0 F2 7 15 16 Store2 No 2 SD F4 0 R1 8 19 20 Store3 Yes S1 S2 RS Vk Qj Qk Reservation Stations: Time Op Vj Qk 56 No 64 64 Mult1 Name Busy Add1 No LD F0 0 R1 Add2 No MULTD F4 F0 F2 Add3 No SD F4 0 R1 Mult1 Yes Multd SUBI R1 R1 #8 Mult2 No BNEZ R1 Loop F12 ... R(F2) Load3 Code: Register result status Clock 20 F0 R1 56 Qi Load1 F2 F4 F6 F8 F10 F30 Mult1 • Once again: In-order issue, out-of-order execution and out-of-order completion. 30 Why can Tomasulo overlap iterations of loops? • Register renaming – Multiple iterations use different physical destinations for registers (dynamic loop unrolling). • Reservation stations – Buffer old values of registers - avoiding the WAR stall that we saw in the scoreboard. • Other perspective: Tomasulo builds data flow dependency graph on the fly. 31 Tomasulo’s scheme offers 2 major advantages (1)the distribution of the hazard detection logic – Distributed reservation stations and the CDB – If multiple instructions waiting on single result, the instructions can be released simultaneously by broadcast on CDB – If a centralized register file were used, the units would have to read their results from the registers when register buses are available. (2) the elimination of stalls for WAW and WAR hazards 32
© Copyright 2025 Paperzz