5008: Computer Architecture Appendix A – Pipelining CA Lecture03 - pipelining ([email protected]) 03-1 Why Pipeline? Because the resources are there! Time (clock cycles) Im Reg Im Dm Reg Dm Reg Dm Im Reg ALU Inst 2 Reg ALU Inst 1 Im ALU Im Reg Inst 3 Inst 4 Reg Reg Dm Reg ALU O r d e r Inst 0 ALU I n s t r. Single cycle data path Dm CA Lecture03 - pipelining ([email protected]) Reg 03-2 Pipeline Review • • • • • • • • • A pipeline is like an hooked assembly line. Pipelining, in general, is not visible to the programmer (vs ILP) Pipelining doesn’t help latency of single task, it helps throughput of entire workload Pipeline rate limited by slowest pipeline stage Multiple tasks operating simultaneously using different resources Potential speedup = Number pipe stages, if perfectly balanced stage. Unbalanced lengths of pipe stages reduces speedup Time to “fill” pipeline and time to “drain” it reduces speedup Stall for Dependences CA Lecture03 - pipelining ([email protected]) 03-3 Outline • MIPS – An ISA example for pipelining • 5 stage pipelining • Structural and Data Hazards • Forwarding • Branch Schemes • Exceptions and Interrupts • Conclusion CA Lecture03 - pipelining ([email protected]) 03-4 A "Typical" RISC ISA • 32-bit fixed format instruction (3 formats) • 32 32-bit GPR (R0 contains zero, DP take pair) • 3-address, reg-reg arithmetic instruction • Single address mode for load/store: base + displacement – no indirection • Simple branch conditions • Delayed branch see: SPARC, MIPS, HP PA-Risc, DEC Alpha, IBM PowerPC, CDC 6600, CDC 7600, Cray-1, Cray-2, Cray-3 CA Lecture03 - pipelining ([email protected]) 03-5 Example: MIPS (- MIPS) Register-Register 31 26 25 Op 21 20 Rs1 16 15 Rs2 11 10 6 5 Rd 0 Opx Register-Immediate 31 26 25 Op 21 20 Rs1 16 15 Rd immediate 0 Branch 31 26 25 Op Rs1 21 20 16 15 Rs2/Opx immediate 0 Jump / Call 31 26 25 Op target CA Lecture03 - pipelining ([email protected]) 0 03-6 Datapath vs Control Datapath Controller signals Control Points • Datapath: Storage, FU, interconnect sufficient to perform the desired functions – Inputs are Control Points – Outputs are signals • Controller: State machine to orchestrate operation on the data path CA Lecture03 - pipelining ([email protected]) – Based on desired function and signals 03-7 Approaching an ISA • Instruction Set Architecture • Meaning of each instruction is described by RTL on architected registers and memory Given technology constraints assemble adequate datapath • – Defines set of operations, instruction format, hardware supported data types, named storage, addressing modes, sequencing – – – – • • • • Architected storage mapped to actual storage Function units to do all the required operations Possible additional storage (eg. MAR, MBR, …) Interconnect to move information among regs and FUs Map each instruction to sequence of RTLs Collate sequences into symbolic controller state transition diagram (STD) Lower symbolic STD to control points Implement controller CA Lecture03 - pipelining ([email protected]) 03-8 Outline • MIPS – An ISA example for pipelining -- Read Appendix B • 5 stage pipelining • Structural and Data Hazards • Forwarding • Branch Schemes • Exceptions and Interrupts • Conclusion CA Lecture03 - pipelining ([email protected]) 03-9 The Five Steps of the Load Instruction Cycle 1 Cycle 2 Load Ifetch Reg/Dec Cycle 3 Cycle 4 Cycle 5 Exec Mem Wr • • Every instruction can be implemented in at most 5 clock cycle Ifetch: Instruction Fetch • • • • Reg/Dec: Registers Fetch and Instruction Decode Exec: Execution and calculate the memory address Mem: Read the data from the Data Memory Wr: Write the data back to the register file – Fetch the instruction from the Instruction Memory Branch requires ? cycles, Store requires ? cycles, others require ? cycles CA Lecture03 - pipelining ([email protected]) 03-10 The Four Steps of R-type Instruction does not access data memory… Cycle 1 Cycle 2 R-type Ifetch Reg/Dec Cycle 3 Cycle 4 Exec Wr • Ifetch: Instruction Fetch • • Reg/Dec: Registers Fetch and Instruction Decode Exec: • Wr: Write the ALU output back to the register file – Fetch the instruction from the Instruction Memory – Update PC – ALU operates on the two register operands CA Lecture03 - pipelining ([email protected]) 03-11 Important Observation • • Each functional unit can only be used once per instruction Each functional unit must be used at the same step for all instructions: – Load uses Register File’s Write Port during its 5th step Load 1 Ifetch 2 Reg/Dec 3 Exec 4 Mem 5 Wr This’s what caused the problem – R-type uses Register File’s Write Port during its 4th step 1 R-type Ifetch 2 Reg/Dec 3 Exec 4 Wr Structural hazard !! CA Lecture03 - pipelining ([email protected]) 03-12 Pipelining the R-type and Load Instruction Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Clock R-type Ifetch R-type Reg/Dec Exec Ifetch Reg/Dec Exec Ifetch Reg/Dec Load Ops! We have a problem! Wr R-type Ifetch Wr Exec Mem Wr Reg/Dec Exec Wr R-type Ifetch Reg/Dec Exec Wr • We have pipeline conflict or structural hazard: – Two instructions try to write to the register file at the same time! – Only one write port CA Lecture03 - pipelining ([email protected]) 03-13 Sol 1: Insert “Bubble” into the Pipeline Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Clock Ifetch R-type Load Reg/Dec Exec Wr Ifetch Reg/Dec Exec Mem R-type Ifetch Reg/Dec Exec R-type Ifetch Wr Wr Reg/Dec Pipeline R-type Ifetch Exec Bubble Reg/Dec Ifetch • Wr Exec Reg/Dec Wr Exec Insert a “bubble” into the pipeline to prevent 2 writes at the same cycle – The control logic can be complex. – Lose instruction fetch and issue opportunity. • No instruction is started in Cycle 6! CA Lecture03 - pipelining ([email protected]) 03-14 Sol 2: Delay R-type’s Write by One Cycle • • Now R-type instructions also use Reg File’s write port at Step 5 Mem step for R-type inst. is a NOOP : nothing is being done. 1 R-type Ifetch Cycle 1 Cycle 2 2 Reg/Dec 3 Exec 4 Mem 5 Wr Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Clock R-type Ifetch R-type Reg/Dec Exec Mem Wr Ifetch Reg/Dec Exec Mem Wr Ifetch Reg/Dec Exec Mem Wr Reg/Dec Exec Mem Wr Reg/Dec Exec Mem Load R-type Ifetch R-type Ifetch CA Lecture03 - pipelining ([email protected]) Wr 03-15 Similarly, the Four Steps of Store Cycle 1 Cycle 2 Store Ifetch Reg/Dec Cycle 3 Cycle 4 Exec • Ifetch: Instruction Fetch Mem Wr In order to keep our pipeline uniform – Fetch the instruction from the Instruction Memory – Update PC • Reg/Dec: Registers Fetch and Instruction Decode • Exec: Calculate the memory address • Mem: Write the data into the Data Memory The Three Steps of Beq Cycle 1 Cycle 2 Beq Ifetch Reg/Dec Cycle 3 Cycle 4 Exec Mem Wr • Ifetch: Instruction Fetch – Fetch the instruction from the Instruction Memory • Reg/Dec: – Registers Fetch and Instruction Decode • Exec: – compares the two register operand, – select correct branch target address – latch into PC Designing a Pipelined Processor • Examine the datapath and control diagram – Starting with single- or multi-cycle datapath? – Single- or multi-cycle control? • Partition datapath into steps • Insert pipeline registers between successive steps • Associate resources with steps • Ensure that flows do not conflict, or figure out how to resolve • Assert control in appropriate stage CA Lecture03 - pipelining ([email protected]) 03-18 5 Steps of MIPS Datapath Instruction Fetch Instr. Decode Reg. Fetch Execute Addr. Calc Next SEQ PC Adder 4 L M D MUX Data Memory ALU Imm MUX MUX RD Reg File Inst Memory Address IR <= mem[PC]; Zero? RS1 RS2 Write Back MUX Next PC Memory Access Sign Extend WB Data PC <= PC + 4 Reg[IRrd] <= Reg[IRrs] opIRop Reg[IRrt] CA Lecture03 - pipelining ([email protected]) 03-19 5 Steps of MIPS Datapath Execute Addr. Calc Instr. Decode Reg. Fetch Next SEQ PC Next SEQ PC Adder 4 Zero? RS1 B <= Reg[IRrt] Imm MUX PC <= PC + 4 A <= Reg[IRrs]; MEM/WB IR <= mem[PC]; Data Memory EX/MEM ALU MUX MUX ID/EX Reg File IF/ID Memory Address RS2 Write Back MUX Next PC Memory Access WB Data Instruction Fetch Sign Extend RD RD RD rslt <= A opIRop B WB <= rslt Reg[IRrd] <= WB CA Lecture03 - pipelining ([email protected]) Pipeline registers (latches) 03-20 Inst. Set Processor Controller IR <= mem[PC]; Ifetch PC <= PC + 4 JSR A <= Reg[IRrs]; JR RR if bop(A,b) ST B <= Reg[IRrt] jmp br opFetch-DCD PC <= IRjaddr r <= A opIRop B RI LD r <= A opIRop IRim r <= A + IRim WB <= r WB <= Mem[r] PC <= PC+IRim WB <= r Reg[IRrd] <= WB • Use data stationary control Reg[IRrd] <= WB CA Lecture03 - pipelining ([email protected]) – local decode for each instruction phase / pipeline stage Reg[IRrd] <= WB 03-21 Data Stationary Control • Pass control signals along just like the data – Main control generates control signals during ID WB Instruction IF/ID Control M WB EX M WB ID/EX EX/MEM MEM/WB Use of “Data Stationary Control” • The Main Control generates the control signals during Reg/Dec – Control signals for Exec (ExtOp, ALUSrc, ...) are used 1 cycle later – Control signals for Mem (MemWr Branch) are used 2 cycles later – Control signals for Wr (MemtoReg MemWr) are used 3 cycles later Reg/Dec MemWr Branch MemtoReg RegWr RegDst MemWr Branch MemtoReg RegWr MemWr Branch MemtoReg RegWr Wr Mem/Wr Register RegDst ExtOp ALUSrc ALUOp Mem Ex/Mem Register Main Control ID/Ex Register IF/ID Register ExtOp ALUSrc ALUOp Exec MemtoReg RegWr Visualizing Pipelining Figure A.2, Page A-8 Time (clock cycles) Reg DMem Ifetch Reg DMem Reg ALU DMem Reg ALU O r d e r Ifetch ALU I n s t r. ALU Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Ifetch Ifetch Reg Reg CA Lecture03 - pipelining ([email protected]) Reg DMem Reg 03-24 Outline • MIPS – An ISA example for pipelining • 5 stage pipelining • Structural and Data Hazards • Forwarding • Branch Schemes • Exceptions and Interrupts • Conclusion CA Lecture03 - pipelining ([email protected]) 03-25 Pipelining is not quite that easy! • Limits to pipelining: Hazards prevent next instruction from executing during its designated clock cycle – Structural hazards: HW cannot support this combination of instructions (single person to fold and put clothes away) – Data hazards: Instruction depends on result of prior instruction still in the pipeline (missing sock) – Control hazards: Caused by delay between the fetching of instructions and decisions about changes in control flow (branches and jumps). CA Lecture03 - pipelining ([email protected]) 03-26 One Memory Port/Structural Hazards Time (clock cycles) DMem Ifetch Reg DMem Reg ALU DMem Reg ALU DMem Reg ALU O r d e r Reg ALU I Load Ifetch n s Instr 1 t r. ALU Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Ifetch Instr 2 Instr 3 Instr 4 Ifetch Reg Ifetch Reg CA Lecture03 - pipelining ([email protected]) Reg Reg DMem 03-27 Detection is easy in this case! (right half highlight means read, left half write) Reg Structural Hazards Limit Performance • Why? The primary reason is to reduce cost of the unit • Example: if 1.3 memory accesses per instruction and only one memory access per cycle then – average CPI = 1.3 – otherwise resource is more than 100% utilized • Solution 1: Use separate instruction and data memories • Solution 2: Allow memory to read and write more than one word per cycle • Solution 3: Stall CA Lecture03 - pipelining ([email protected]) 03-28 Speed Up Equations for Pipelining Speedup = Average instruction time unpipelined CPI unpipelined Clock cycle unpipelined = × Average instruction time pipelined CPI pipelined Clock cycle pipelined CPI pipelined = Ideal CPI + Average Stall cycles per Inst Speedup = Cycle Time unpipelined Ideal CPI × Pipeline depth × Ideal CPI + Pipeline stall CPI Cycle Time pipelined for balanced pipelining For simple RISC pipeline, CPI = 1: Cycle Time unpipelined Pipeline depth Speedup = × 1 + Pipeline stall CPI Cycle Time pipelined CA Lecture03 - pipelining ([email protected]) 03-29 Example: One or Two Memory Ports? • Machine A: Dual ported memory (“Harvard Architecture”) • Machine B: Single ported memory, but its pipelined implementation has a 1.05 times faster clock rate • Ideal CPI = 1 for both • Loads are 40% of instructions executed SpeedUpA = Pipeline Depth/(1 + 0) x (clockunpipe/clockpipe) = Pipeline Depth SpeedUpB = Pipeline Depth/(1 + 0.4 x 1) x (clockunpipe/(clockunpipe / 1.05) = (Pipeline Depth/1.4) x 1.05 = 0.75 x Pipeline Depth SpeedUpA / SpeedUpB = Pipeline Depth/(0.75 x Pipeline Depth) = 1.33 • Machine A is 1.33 times faster CA Lecture03 - pipelining ([email protected]) 03-30 One Memory Port Structural Hazards Time (clock cycles) DMem Ifetch Reg DMem Reg ALU Ifetch Bubble Stall Instr 3 Reg Reg DMem Bubble Bubble Ifetch Reg How do you “bubble” the pipe? CA Lecture03 - pipelining ([email protected]) Reg Bubble ALU O r d e r Instr 2 Reg ALU I Load Ifetch n s Instr 1 t r. ALU Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Bubble Reg DMem 03-31 Handling Stalls • How to stall? – Stall instruction in IF and ID: not change PC and IF/ID => the stages re-execute the instructions – What to move into EX: insert an NOP by changing EX, MEM, WB control fields of ID/EX pipeline register to 0 • as control signals propagate, all control signals to EX, MEM, WB are de-asserted and no registers or memories are written CA Lecture03 - pipelining ([email protected]) 03-32 Data Hazard Problem • Due to the overlapped instructions. Example: r1 cannot be read by other instructions before it is written by the add. add r1 ,r2,r3 sub r4, r1 ,r3 and r6, r1 ,r7 or r8, r1 ,r9 xor r10, r1 ,r11 CA Lecture03 - pipelining ([email protected]) 03-33 RAW Hazards on R1 Time (clock cycles) Dependencies backwards in time are hazards or r8,r1,r9 xor r10,r1,r11 Reg DMem Ifetch Reg DMem Ifetch Reg DMem Ifetch Reg ALU and r6,r1,r7 Ifetch Reg DMem ALU sub r4,r1,r3 Reg ALU O r d e r add r1,r2,r3 Ifetch WB ALU I n s t r. MEM ALU IF ID/RF EX Reg CA Lecture03 - pipelining ([email protected]) Reg Reg DMem Reg 03-34 Types of Data Hazards Three types: (inst. i1 followed by inst. i2) • RAW (read after write): dependence i2 tries to read operand before i1 writes it • WAR (write after read): anti-dependence i2 tries to write operand before i1 reads it – Gets wrong operand, e.g., auto-increment addr. – Can’t happen in MIPS 5-stage pipeline because: • All instructions take 5 stages, and reads are always in stage 2, and writes are always in stage 5 • WAW (write after write): output dependence i2 tries to write operand before i1 writes it – Leaves wrong result ( i1’s not i2’s); occur only in pipelines that write in more than one stage – Can’t happen in MIPS 5-stage pipeline because: • All instructions take 5 stages, and writes are always in stage 5 – Out of order executions may suffer this data dependence CA Lecture03 - pipelining ([email protected]) 03-35 WAR Data Hazard • Write After Read (WAR) InstrJ writes operand before InstrI reads it I: sub r4,r1,r3 J: add r1,r2,r3 K: mul r6,r1,r7 • • • • Called an “anti-dependence” by compiler writers. This results from reuse of the name “r1”. Can’t happen in MIPS 5 stage pipeline because: Can happen in between a shorter (Int) pipeline and a longer (FP) pipeline WAR hazards can happen if instructions execute out of order or access data late CA Lecture03 - pipelining ([email protected]) 03-36 No WAR Hazards on r1 Time (clock cycles) IF ID/RF EX read add r4,r1,r3 Ifetch Reg WB write ALU DMem Reg write read sub r1,r2,r3 Ifetch Reg ALU I n s t r. MEM DMem Reg O r d e r CA Lecture03 - pipelining ([email protected]) 03-37 WAW Data Hazard Write After Write (WAW) InstrJ writes operand before InstrI writes it. I: sub r1,r4,r3 J: add r1,r2,r3 K: mul r6,r1,r7 • • • Called an “output dependence” by compiler writers This also results from the reuse of name “r1”. Can’t happen in 5 stage pipeline because: – All instructions take 5 stages, and – Writes are always in stage 5 Will see WAR and WAW in more complicated pipes CA Lecture03 - pipelining ([email protected]) 03-38 No WAW Hazards on R1 Time (clock cycles) IF ID/RF EX WB write add r1,r4,r3 Ifetch Reg ALU DMem Reg write sub r1,r2,r3 Ifetch Reg ALU I n s t r. MEM DMem Reg O r d e r CA Lecture03 - pipelining ([email protected]) 03-39 Data Forwarding to Avoid Data Hazard • With data forwarding (also called bypassing or shortcircuiting), data is transferred back to earlier pipeline stages before it is written into the register file. – Instr i: add r1,r2,r3 ---------------------– Instr j: sub r4,r1,r5 • • (result ready after EX stage) (result needed in EX stage) This either eliminates or reduces the penalty of RAW hazards. To support data forwarding, additional hardware is required. – Multiplexors to allow data to be transferred back – Control logic for the multiplexors CA Lecture03 - pipelining ([email protected]) 03-40 Data Hazard Solution “Forward” result from one stage to another Time (clock cycles) IF ID/RF Im Reg Dm Im Reg Dm Im Reg Dm Im Reg ALU xor r10,r1,r11 Dm ALU or r8,r1,r9 Reg ALU and r6,r1,r7 WB ALU O r d e r sub r4,r1,r3 Im MEM ALU I n s t r. add r1,r2,r3 EX Reg Reg “or” OK if define read/write properly CA Lecture03 - pipelining ([email protected]) Reg Reg Dm Reg 03-41 HW Change for Forwarding NextPC mux MEM/WR EX/MEM ALU mux ID/EX Registers Data Memory mux Immediate CA Lecture03 - pipelining ([email protected]) 03-42 Data Hazard Even with Forwarding sub r4,r1,r6 and r6,r1,r7 or Reg DMem Ifetch Reg DMem Reg Ifetch r8,r1,r9 Ifetch Reg Reg DMem Reg ALU Ifetch ALU O r d e r lw r1, 0(r2) ALU I n s t r. ALU Time (clock cycles) Reg DMem Reg Can’t solve with forwarding CAdelay/stall Lecture03 - pipelining ([email protected]) Must instruction dependent on loads 03-43 Pipeline Interlock Solution for Load Stall Reg DMem Ifetch Reg Bubble Ifetch Bubble Reg Bubble Ifetch and r6,r1,r7 or r8,r1,r9 How is this detected? Reg DMem Reg Reg DMem ALU sub r4,r1,r6 Ifetch ALU O r d e r lw r1, 0(r2) ALU I n s t r. ALU Time (clock cycles) Reg DMem Pipeline interlock CA Lecture03 - pipelining ([email protected]) 03-44 Software Scheduling to Avoid Load Hazards Try producing fast code for a = b + c; d = e – f; assuming a, b, c, d ,e, and f in memory. Slow code: LW LW ADD SW LW LW SUB SW Rb,b Rc,c Ra,Rb,Rc a,Ra Re,e Rf,f Rd,Re,Rf d,Rd Fast code: LW LW LW ADD LW SW SUB SW Rb,b Rc,c Re,e Ra,Rb,Rc Rf,f a,Ra Rd,Re,Rf d,Rd Compiler CA optimizes for performance. Hardware checks for safety. Lecture03 - pipelining ([email protected]) 03-45 Compiler Avoiding Load Stalls scheduled unscheduled 54% gcc 31% 42% spice 14% 65% tex 25% 0% 20% 40% 60% 80% % loads stalling pipeline Compilers reduce the number of load stalls, but do not completelyCAeliminate them. Lecture03 - pipelining ([email protected]) 03-46 Outline • MIPS – An ISA example for pipelining • 5 stage pipelining • Structural and Data Hazards • Forwarding • Branch Schemes • Exceptions and Interrupts • Conclusion CA Lecture03 - pipelining ([email protected]) 03-47 22: add r8,r1,r9 36: xor r10,r1,r11 Reg DMem Ifetch Reg DMem Ifetch Reg DMem Ifetch Reg ALU r6,r1,r7 Ifetch DMem ALU 18: or Reg ALU 14: and r2,r3,r5 Ifetch ALU 10: beq r1,r3,36 ALU Control Hazard on Branches Three Stages Stall Reg Reg Reg Reg DMem What do you do with the 3 instructions in between? CA Lecture03 - pipelining ([email protected]) 03-48 Reg Control/Branch Hazards • Control hazards, which occur due to instructions changing the PC, can result in a large performance loss. • A branch is either – Taken: PC <= PC + 4 + Imm ; branch target address – Not Taken: PC <= PC + 4 • The simplest solution is to stall the pipeline as soon as a branch instruction is detected. – Detect the branch in the ID stage – Don’t know if the branch is taken until the EX stage – If the branch is taken, we need to repeat the IF and ID stages – New PC is not changed until the end of the MEM stage, after determining if the branch is taken and the new PC value CA Lecture03 - pipelining ([email protected]) 03-49 Branch Stall Impact • If CPI = 1, 30% branch, Stall 3 cycles => new CPI = 1.9 !! • Two part solution: – Determine branch taken or not sooner, AND – Compute taken branch address earlier • MIPS branch tests if register = 0 or ≠ 0 • MIPS Solution: – Move Zero test to ID/RF stage – Adder to calculate new PC in ID/RF stage – 1 clock cycle penalty for branch versus 3 CA Lecture03 - pipelining ([email protected]) 03-50 Pipelined MIPS Datapath page A-38 Instruction Fetch Memory Access Write Back Adder Adder MUX Next SEQ PC Next PC Zero? RS1 MUX MEM/WB Data Memory EX/MEM ALU MUX ID/EX Imm Reg File IF/ID Memory Address RS2 WB Data 4 Execute Addr. Calc Instr. Decode Reg. Fetch Sign Extend RD RD RD • Interplay of instruction set design and cycle time. CA Lecture03 - pipelining ([email protected]) 03-51 Four Branch Hazard Alternatives #1: Stall until branch direction is clear #2: Predict Branch Not Taken – – – – – Execute successor instructions in sequence “Squash” instructions in pipeline if branch actually taken Advantage of late pipeline state update 47% MIPS branches not taken on average PC+4 already calculated, so use it to get next instruction #3: Predict Branch Taken – 53% MIPS branches taken on average – But haven’t calculated branch target address in MIPS • MIPS still incurs 1 cycle branch penalty • Other machines: branch target known before outcome CA Lecture03 - pipelining ([email protected]) 03-52 Four Branch Hazard Alternatives #4: Delayed Branch -- make the stall cycle useful – Define branch to take place AFTER a following instruction branch instruction sequential successor1 sequential successor2 ........ sequential successorn branch target if taken e.g. Branch delay slot of length n These insts. are executed !! – 1 slot delay allows proper decision and branch target address in 5 stage pipeline – MIPS uses this CA Lecture03 - pipelining ([email protected]) 03-53 Stall -- Control Hazard Solution • Stall: wait until decision is clear – It’s possible to move up decision to 2nd stage by adding hardware to check registers as being read Time (clock cycles) O r d e r • Load Reg Mem Mem Reg Reg Mem Reg Mem Reg ALU Beq Mem ALU Add ALU I n s t r. Mem Reg Impact: 2 cycles (or 1 cycle penalty) per branch instruction => slow CA Lecture03 - pipelining ([email protected]) 03-54 Predict-- Control Hazard Solution • – Predict not taken, for example Time (clock cycles) O r d e r • Load Reg Mem Mem Reg Reg Mem Reg Mem Reg ALU Beq Mem ALU Add ALU I n s t r. Predict: guess one direction then back up if wrong Mem Reg Impact: 1 clock cycle per branch instruction if right, 2 if wrong CA Lecture03 - pipelining ([email protected]) 03-55 Predict-Not-Taken Example A Stall indeed 1CA clock cycle- pipelining per branch instruction if right, 2 if wrong Lecture03 ([email protected]) 03-56 Delayed Branch-- Control Hazard Solution • Time (clock cycles) • Misc Load Mem Mem Reg Reg Mem Reg Mem Reg Mem Reg Mem Reg ALU O r d e r Reg ALU Beq Mem ALU Add ALU I n s t r. Redefine branch behavior (takes place after next instruction) “delayed branch” Mem Impact: 1 clock cycles per branch instruction if can find instruction to put in “slot” CA Lecture03 - pipelining ([email protected]) Reg 03-57 Delayed Branch • Delayed branch Æ make the stall cycle useful – Add delay slots = branch penalty = length of branch delay • 1 slot for 5-stage DLX/MIPS – Instructions in the delay slot are executed whether or not the branch is taken – See if the compiler can schedule something useful in these slots • When the slots cannot be scheduled, they are filled with the no-op instruction (indeed, stall!!) – Hope that filled slots actually help advance the computation CA Lecture03 - pipelining ([email protected]) 03-58 Scheduling Branch Delay Slots A. From before branch add $1,$2,$3 if $2=0 then delay slot becomes B. From branch target sub $4,$5,$6 add $1,$2,$3 if $1=0 then delay slot becomes if $2=0 then add $1,$2,$3 • • • add $1,$2,$3 if $1=0 then sub $4,$5,$6 C. From fall through add $1,$2,$3 if $1=0 then delay slot sub $4,$5,$6 becomes add $1,$2,$3 if $1=0 then sub $4,$5,$6 A is the best choice, fills delay slot & reduces instruction count (IC) In B, the sub instruction may need to be copied, increasing IC In B and C, must be okay to execute sub when branch fails CA Lecture03 - pipelining ([email protected]) 03-59 Delay-Branch Scheduling Schemes and Their Requirements Scheduling Strategy Requirements Improve Performance When? From before Branch must not depend on the rescheduled instructions Always From target Must be OK to execute rescheduled instructions if branch is not taken. May need to duplicate instructions When branch is taken. May enlarge program if instructions are duplicated From fall through Must be OK to execute instructions if branch is taken When branch is not taken. CA Lecture03 - pipelining ([email protected]) 03-60 Delayed Branch Summary • Compiler effectiveness for single branch delay slot: – Fills about 60% of branch delay slots – About 80% of instructions executed in branch delay slots useful in computation – About 50% (60% x 80%) of slots usefully filled • As processor go to deeper pipelines and multiple issue, the branch delay grows and need more than one delay slots – Delayed branching (the static way) has lost popularity compared to more expensive but more flexible dynamic approaches – Growth in available transistors has made dynamic approaches relatively cheaper CA Lecture03 - pipelining ([email protected]) 03-61 Evaluating Branch Alternatives Pipeline speedup = Pipeline depth 1 +Branch frequency ×Branch penalty • Assume 4% unconditional branch, 6% conditional branchuntaken, 10% conditional branch-taken Branch scheme Branch penalty Stall pipeline Predict taken Predict not taken Delayed branch 3 1 1 0.5 CPI speedup v. unpipelined speedup v. stall 1.60 1.20 1.14 1.10 3.1 4.2 4.4 4.5 1.0 1.33 1.40 1.45 CA Lecture03 - pipelining ([email protected]) 03-62 Deeper Pipeline Example • For a deeper pipeline, e.g. MIPS R4K, it takes at least 3 pipeline stages before the branch-target address is known and an additional cycle before the branch condition is evaluated, assuming no stalls on the registers in the conditional comparisons – Assuming an ideal CPI of 1, Speedup = Pipeline depth 1 + Pipeline stall cycles from branches Pipeline stall cycles from branches = Branch frequency × Branch penalty CA Lecture03 - pipelining ([email protected]) 03-63 Branch Penalties • Assume 4% unconditional branch, 6% conditional branchuntaken, 10% conditional branch-taken • Branch penalties Branch scheme Penalty unconditional. Flush pipeline Predict taken Predict not taken 2 2 2 Penalty untaken Penalty taken 3 3 0 3 2 3 Try to find the CPI penalties for 3 branch schemes (Fig. A.16) CA Lecture03 - pipelining ([email protected]) 03-64 Problems with Pipelining • Exception: An unusual event happens to an instruction during its execution – Examples: divide by zero, undefined opcode • Interrupt: Hardware signal to switch the processor to a new instruction stream – Example: a sound card interrupts when it needs more audio output samples (an audio “click” happens if it is left waiting) • Problem: It must appear that the exception or interrupt must appear between 2 instructions (Ii and Ii+1) – The effect of all instructions up to and including Ii is totalling complete – • No effect of any instruction after Ii can take place The interrupt (exception) handler either aborts program or restarts at instruction Ii+1 CA Lecture03 - pipelining ([email protected]) 03-65 Exceptions in MIPS Pipeline stage Problem exceptions occurring IF Page fault on instruction fetch; misaligned memory access; memory protection violation ID Undefined or illegal opcode EX Arithmetic exception MEM Page fault on data fetch; misaligned memory access; memory protection violation WB Non Note: Multiple exceptions may occur in the same clock cycle in pipelining architecture CA Lecture03 - pipelining ([email protected]) 03-66 Precise Exceptions in Static Pipelines Key observation: architected state only change in memory and register write stages. And In Conclusion: Control & Pipelining • • • Control VIA State Machines and Microprogramming Just overlap tasks; easy if tasks are independent Speed Up ≤ Pipeline Depth; if ideal CPI is 1, then: Cycle Time unpipelined Pipeline depth Speedup = × 1 + Pipeline stall CPI Cycle Time pipelined • • Hazards limit performance on computers: – Structural: need more HW resources – Data (RAW,WAR,WAW): need forwarding, compiler scheduling – Control: delayed branch, prediction Exceptions, Interrupts add complexity CA Lecture03 - pipelining ([email protected]) 03-68
© Copyright 2025 Paperzz